Hi Benjamin. It looks like the HTML at that web site has changed since I wrote the original code. I made some changes to lines 179, lines 303, and lines 411. You can see the differences summarized here: github.com/libjohn/workshop_webscraping/commit/6dd9a67b8d298930ddc9628518ae8c5c9559c2d8 The basic issue is that the web site is now using the '..' designation to reference a relative path one directory above the results page. (I can't say why the site authors are doing this). But to get around it, replace '..' with the actual path: mutate(url = str_replace(url, "\\.\\.", "ecartico")) And then make some minor updates as a result of this change. Hope that helps.
Hi Ajao. One answer is that scraping refers to the collection and analysis of web data, while parse is more specifically separating HTML into useful component parts. By this definition, `read_html()` is more of a scraping function while `html_attr()` and `html_text()` are parsing operations. Here is an article that attempts to define some issues that surround web scraping: reallifemag.com/fair-game/ I think of _scraping_ as a somewhat imprecise term that can include _parsing_ as one of the necessary steps to gather and prepare data for analysis.
So if the gadget selector doesn't work for a website you are screwed up. You should show us the proper way, scraping data from the html code and not using a tool that works in some cases.
Hi @ProT. I feel like I explained more than just selector gadget, but I'm glad for the feedback. Nonetheless, you bring up an important and unmovable aspect of web scraping: no web scraping technique works all the time, for all pages, of all websites. Please drop in an example URL of a site where selector gadget doesn't work for you. I'm happy to try and provide suggestions and next steps. Anyway, it sounds like you've hit a frustration point -- which is common in web scraping. A quick suggestion, that does not yet take into account your specific case, is to read the html [ read_html() ], and then parse the html. i.e. fall back to using regex on the raw HTML. That is a more technical and potentially more robust approach. Best
I don't know. Without knowing the context, my guess is that 'Disallow: US_CENSUS_NAME' is listed in some target site's robots.txt file. If that is true, it should mean that the target site does not want any robots or crawlers searching for the path US_CENSUS_NAME. You could check this by manually entering the path into a web browser, as a complete URL appended to the target's domain name. Regardless, if you are crawling a site, you want to make sure your scraper-code does not crawl US_CENSUS_NAME as a target path.
@@JohnLittle1 It´s a inappropriate web site, will manually entering the path into a web browser just to see what happens but extremely curious on why and for what it is... Thank you for responding... Cheers
Thank you for this John. One of the very best tutorials I have ever seen on webscraping. Keep up the good work
Glad it was helpful!
This right here is really well done and comprehensive. Thanks a lot John Little.
😃
Thanks for the comment. I'm glad to know your thoughts.
Hi John wonderfull session. Learn a lot form this video thanks
Glad you enjoyed it
Thanks. Nicely Explained !
Glad it was helpful!
Thank you
You're welcome
Hello! Thank you for uploading this tutorial. I have a question: reproducing the results works until line 36t. ("nav_results_list
Hi Benjamin. It looks like the HTML at that web site has changed since I wrote the original code. I made some changes to lines 179, lines 303, and lines 411. You can see the differences summarized here: github.com/libjohn/workshop_webscraping/commit/6dd9a67b8d298930ddc9628518ae8c5c9559c2d8
The basic issue is that the web site is now using the '..' designation to reference a relative path one directory above the results page. (I can't say why the site authors are doing this). But to get around it, replace '..' with the actual path:
mutate(url = str_replace(url, "\\.\\.", "ecartico"))
And then make some minor updates as a result of this change.
Hope that helps.
Please what is the difference between web parser and web scraping?
Hi Ajao.
One answer is that scraping refers to the collection and analysis of web data, while parse is more specifically separating HTML into useful component parts. By this definition, `read_html()` is more of a scraping function while `html_attr()` and `html_text()` are parsing operations.
Here is an article that attempts to define some issues that surround web scraping: reallifemag.com/fair-game/
I think of _scraping_ as a somewhat imprecise term that can include _parsing_ as one of the necessary steps to gather and prepare data for analysis.
@@JohnLittle1 thank you very much, was really helpful
thanks
how scraping when a first web wiki page is made? best
@stemengoli Can you give me a URL?
So if the gadget selector doesn't work for a website you are screwed up. You should show us the proper way, scraping data from the html code and not using a tool that works in some cases.
Hi @ProT. I feel like I explained more than just selector gadget, but I'm glad for the feedback. Nonetheless, you bring up an important and unmovable aspect of web scraping: no web scraping technique works all the time, for all pages, of all websites. Please drop in an example URL of a site where selector gadget doesn't work for you. I'm happy to try and provide suggestions and next steps.
Anyway, it sounds like you've hit a frustration point -- which is common in web scraping. A quick suggestion, that does not yet take into account your specific case, is to read the html [ read_html() ], and then parse the html. i.e. fall back to using regex on the raw HTML. That is a more technical and potentially more robust approach.
Best
Jee what an asshole
Great vid and great comment prof
Dear John, would you by any chance know what this is meant for ::> Disallow: *US_CENSUS_NAME*
I don't know. Without knowing the context, my guess is that 'Disallow: US_CENSUS_NAME' is listed in some target site's robots.txt file. If that is true, it should mean that the target site does not want any robots or crawlers searching for the path US_CENSUS_NAME. You could check this by manually entering the path into a web browser, as a complete URL appended to the target's domain name. Regardless, if you are crawling a site, you want to make sure your scraper-code does not crawl US_CENSUS_NAME as a target path.
@@JohnLittle1 It´s a inappropriate web site, will manually entering the path into a web browser just to see what happens but extremely curious on why and for what it is...
Thank you for responding...
Cheers