TidyTuesday: Web Scraping Data using Rvest

Поделиться
HTML-код
  • Опубликовано: 21 авг 2024

Комментарии • 13

  • @jannonflores1113
    @jannonflores1113 2 года назад

    Thanks so much for this Andrew!!! Cheers!!

  • @afiqyahya3398
    @afiqyahya3398 4 года назад

    Damn. I love how you choose your tidy tuesday contents. Cant complain enough.

  • @Pvillanueva13
    @Pvillanueva13 4 года назад +2

    Thanks for the intro to Rvest! The code as shown doesn't quite work correctly, though, since the get_text and get_link functions assign the same hardcoded link right at the beginning. I was able to get it to work just by deleting those lines - I got 6603 unique "staff members" this way compared to the 33 from this code. Thanks again for the video!

    • @AndrewCouch
      @AndrewCouch  4 года назад +1

      Good catch I’ll make sure to change it!
      -Andrew

  • @mohamedtekouk8215
    @mohamedtekouk8215 Год назад

    It is work with this example but with other examples output shows xmlnodset(0)

  • @felixzhao9070
    @felixzhao9070 3 года назад +1

    Hi Andrew, thank you so much for sharing the amazing content! I have a question with regard to identifying the total pages. In your tutorial, you went through a manual process, I wonder if there is any means to have R identify the total pages available? Because as the number of articles grows, you will have more pages than current available. Thanks again!

    • @AndrewCouch
      @AndrewCouch  3 года назад

      I think it depends on the webpage that you are scraping. For example, using page=all can sometimes retrieve all of the links into one url. Another way would be entering a large number and iterating through pages with the safely function. The pages that have no content would result in an error but the mapped function would still iterate through it.
      tibble(page_num = 1:100) %>%
      mutate(page = paste0("fivethirtyeight.com/tag/slack-chat/page/", page_num, "/")) %>%
      mutate(links = map(page, safely(get_links))) %>%
      mutate(links = pluck(links, 1))
      If you are planning on scraping data that will be added to the website with another link, I recommend saving the links that have been scraped and using an anti-join on the entire links set when re-running the script. I know this isn't the most efficient way of web scraping but I hope this helps!
      -Andrew

    • @felixzhao9070
      @felixzhao9070 3 года назад +1

      @@AndrewCouch Thank you so much for your quick reply Andrew! I will check it out...

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Don't you have to check whether they allow scraping first? There may be no need if there is an API.

    • @AndrewCouch
      @AndrewCouch  2 года назад

      Yes in general you should look for a robots.txt file in the website or an API. I advocate for scraping what you need for personal projects but for professional/work projects I do not scrape and instead purchase data from vendors.

  • @LK-zt9vf
    @LK-zt9vf 3 года назад

    how do I export this to CSV?
    write.csv(data_slack_pages, "data_test.csv")
    doesn't work

    • @AndrewCouch
      @AndrewCouch  3 года назад +2

      Is anything in data_slack_pages nested? You may need to unnest a column.
      Example:
      data_slack_pages %>%
      unnest(nested_column) %>%
      write.csv("data_test.csv")

    • @LK-zt9vf
      @LK-zt9vf 3 года назад +1

      @@AndrewCouch sorry for the slow reply. Worked a treat thank you, great tutorial. Might help to slow down for newbies just a bit!