Web scraping with rvest (R Case Study). Use RVEST to scrape and crawl websites then parse the HTML.

Поделиться
HTML-код
  • Опубликовано: 31 янв 2025

Комментарии •

  • @ahmed007Jaber
    @ahmed007Jaber 2 года назад +3

    Thank you for this John. One of the very best tutorials I have ever seen on webscraping. Keep up the good work

  • @david_daniels
    @david_daniels Год назад +1

    This right here is really well done and comprehensive. Thanks a lot John Little.
    😃

    • @JohnLittle1
      @JohnLittle1  Год назад

      Thanks for the comment. I'm glad to know your thoughts.

  • @vicky29508
    @vicky29508 2 года назад +1

    Hi John wonderfull session. Learn a lot form this video thanks

  • @faiazrummankhan5589
    @faiazrummankhan5589 2 года назад +1

    Thanks. Nicely Explained !

  • @yusufbas035
    @yusufbas035 2 года назад +1

    Thank you

  • @benjamintreitz1647
    @benjamintreitz1647 3 года назад

    Hello! Thank you for uploading this tutorial. I have a question: reproducing the results works until line 36t. ("nav_results_list

    • @JohnLittle1
      @JohnLittle1  3 года назад +1

      Hi Benjamin. It looks like the HTML at that web site has changed since I wrote the original code. I made some changes to lines 179, lines 303, and lines 411. You can see the differences summarized here: github.com/libjohn/workshop_webscraping/commit/6dd9a67b8d298930ddc9628518ae8c5c9559c2d8
      The basic issue is that the web site is now using the '..' designation to reference a relative path one directory above the results page. (I can't say why the site authors are doing this). But to get around it, replace '..' with the actual path:
      mutate(url = str_replace(url, "\\.\\.", "ecartico"))
      And then make some minor updates as a result of this change.
      Hope that helps.

  • @ajaolekan3934
    @ajaolekan3934 2 года назад +1

    Please what is the difference between web parser and web scraping?

    • @JohnLittle1
      @JohnLittle1  2 года назад +2

      Hi Ajao.
      One answer is that scraping refers to the collection and analysis of web data, while parse is more specifically separating HTML into useful component parts. By this definition, `read_html()` is more of a scraping function while `html_attr()` and `html_text()` are parsing operations.
      Here is an article that attempts to define some issues that surround web scraping: reallifemag.com/fair-game/
      I think of _scraping_ as a somewhat imprecise term that can include _parsing_ as one of the necessary steps to gather and prepare data for analysis.

    • @ajaolekan3934
      @ajaolekan3934 2 года назад +1

      @@JohnLittle1 thank you very much, was really helpful

  • @evertonfonseca8916
    @evertonfonseca8916 3 года назад +1

    thanks

  • @stemengoli6699
    @stemengoli6699 2 года назад

    how scraping when a first web wiki page is made? best

    • @JohnLittle1
      @JohnLittle1  2 года назад

      @stemengoli Can you give me a URL?

  • @moose23rizla
    @moose23rizla 3 года назад +1

    So if the gadget selector doesn't work for a website you are screwed up. You should show us the proper way, scraping data from the html code and not using a tool that works in some cases.

    • @JohnLittle1
      @JohnLittle1  3 года назад +8

      Hi @ProT. I feel like I explained more than just selector gadget, but I'm glad for the feedback. Nonetheless, you bring up an important and unmovable aspect of web scraping: no web scraping technique works all the time, for all pages, of all websites. Please drop in an example URL of a site where selector gadget doesn't work for you. I'm happy to try and provide suggestions and next steps.
      Anyway, it sounds like you've hit a frustration point -- which is common in web scraping. A quick suggestion, that does not yet take into account your specific case, is to read the html [ read_html() ], and then parse the html. i.e. fall back to using regex on the raw HTML. That is a more technical and potentially more robust approach.
      Best

    • @djangoworldwide7925
      @djangoworldwide7925 9 месяцев назад

      Jee what an asshole
      Great vid and great comment prof

  • @ciroweinstein8627
    @ciroweinstein8627 2 года назад

    Dear John, would you by any chance know what this is meant for ::> Disallow: *US_CENSUS_NAME*

    • @JohnLittle1
      @JohnLittle1  2 года назад

      I don't know. Without knowing the context, my guess is that 'Disallow: US_CENSUS_NAME' is listed in some target site's robots.txt file. If that is true, it should mean that the target site does not want any robots or crawlers searching for the path US_CENSUS_NAME. You could check this by manually entering the path into a web browser, as a complete URL appended to the target's domain name. Regardless, if you are crawling a site, you want to make sure your scraper-code does not crawl US_CENSUS_NAME as a target path.

    • @ciroweinstein8627
      @ciroweinstein8627 2 года назад +1

      @@JohnLittle1 It´s a inappropriate web site, will manually entering the path into a web browser just to see what happens but extremely curious on why and for what it is...
      Thank you for responding...
      Cheers