Webscraping in R

Kasper Welbers

Просмотров 17 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 31 янв 2025

Комментарии • 36

@alanscott9258 Год назад ⁺³
Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.
@kasperwelbers Год назад
Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!
@moviezone8130 8 месяцев назад
Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.
@mindandresearch 6 месяцев назад
You should make more and more videos. You explained this on point! Like on R and everything on it you will surely be the best no doubt!
@haraldurkarlsson1147 2 года назад
Congratulations! This is an excellent and lucid explanation of how to web scrape with Rs rvest. I had no idea that was this simple.
@raould2590 2 года назад
Excellent one! Thank you for this! Well structured & explained and very useful!
@Kinglium 3 года назад ⁺¹
thank you so much for all your hard work! I learned a lot from this video!!
@kasperwelbers 3 года назад
Thanks! Happy to hear it's helpful
@timmytesla9655 Год назад
This was really helpful. Thanks for this awesome tutorial.
@R0bbie4141 Год назад
Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻
@kasperwelbers Год назад
Hahahaha 🤣. Nice, thanks!!
@Quienescribiohoy 2 года назад
Thank you for this video, it was really helpful.
@pieracelis6862 10 месяцев назад
Really good tutorial, thanks a lot!! :)
@Ryan-vc9gc 2 года назад
Awesome video thank you
@sakifzaman Месяц назад
hi|
sorry for leaving a comment on an old post. dont know whether you will be able to reply. but i have a problem.
i did exactly the way you did it. but at the ggplot code i'm getting this error : Warning messages:
All aesthetics have length 1, but the data has 144 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
2: In geom_smooth(method = "lm") :
All aesthetics have length 1, but the data has 144 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing a single row.
@kasperwelbers 28 дней назад
Hi @sakifzaman. I suspect that you might have typed the aesthetics (in the aes part) with "quotes" instead of `backticks`. The difference is easy to miss (backticks are like single quotes pointing in the other direction).
In R, backticks are used for names that have spaces in them. So `GDP per capita` is the name of the column. If you use quotes (single or double) instead, ggplot interprets it as a value. So this is why it would say that you provide aesthetics of length 1 (just the value "GDP per capita", even though the data has 144 rows.
@harutyunhakobyan4534 2 года назад
Thank you very much very helpful
@haraldurkarlsson1147 Год назад
Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.
@kasperwelbers Год назад
Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)".
What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.
@haraldurkarlsson1147 Год назад
@@kasperwelbers
Kasper,
Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table:
html_doc |>
html_elements("table") |>
html_table(header = TRUE) |>
pluck(3)
pluck is from purrr package (pull will not work here).
Or using xpath:
html_doc |> html_elements(xpath = '//center[position() = 3]/table') |>
html_table(header = TRUE)
The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another.
By the way I am using the native pipe which will not always work but the regular magrittr will.
H
@kasperwelbers Год назад
@@haraldurkarlsson1147 pluck indeed offers a nice solution here!
There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.
@brittnyfreeman3650 2 года назад ⁺¹
Where is the html tutorial link that you mentioned? It’s not in the description of the video.
@kasperwelbers 2 года назад
Hi Brittny, you're right. I replaced the HTML file with a .md file (the rmarkdown file is already knitted), because somehow links to the html file on github didnt work. Did you need an html version in particular?
@hectormercedes6553 2 года назад ⁺¹
THANK YOU TEACHER, VERY IMPORTANT, IM NEWBIE
@thomasberthelot9187 2 года назад
Hi! I'm a newbie and the 7th and 8th lines, I get : "Error in read_html(url) : could not find function "read_html" ". Could you pleas telle me what's wrong ? It's the same when I run "%>%", I get : "Error in read_html(url) %>% html_element("table.wikitable") :
could not find function "%>%" ". Same with the library(tidyverse)
@thomasberthelot9187 2 года назад
when I run the 7th and 8th lines*
@thomasberthelot9187 2 года назад
I had already run "# install.packages("tidyverse")"
@thomasberthelot9187 2 года назад
Nvm it worked bc I downloaded it directly from the "Install" button in "Packages"!
I'm happy
@emmanuelgk4663 Месяц назад
me finding this useful 4 years latter
@Aguaires 9 месяцев назад
Dank u!
@haraldurkarlsson1147 2 года назад
Very useful! But my issue is not with the R code but rather reading through the html code and finding the right places. I had a heck of a time. For one thing the font was tiny and an enormously long code. Is it searchable?
@kasperwelbers 2 года назад ⁺²
Dear Haraldur, the hardest part of scraping is indeed not so much the code, but learning how to find and select HTML elements. The good news, though, is that this process is the same regardless of what scraping software you use. So if at some point you're using Python, it still applies, and there are also tools for automating an actual webrowser such as Selenium (and RSelenium) for which the main task is also finding and selecting HTML elements. That being said, there are great tools for searching through the HTML code. The main one being the Inspect option, as also discussed in the tutorial. If you use the one in Chrome, you can search for both strings and css-selectors. So that's a great way to find elements and figuring out how to select them with rvest. Also, note that if you right click an element on the webpage and select inspect, it automatically shows the HTML code for this element.
@haraldurkarlsson1147 2 года назад
@@kasperwelbers
Kasper, I have indeed used inspect in Chrome. My main problems is that the font is so small that I have a hard time reading it.
@kasperwelbers 2 года назад ⁺²
@@haraldurkarlsson1147 Ahhh like that! You should be able to change the font like any content on a webpage. In Chrome, at least for me, it works by holding ctrl and then scrolling up/down on my mouse. This change the font size for whatever window you're pointing at.
@erolarmstrong 2 года назад
html_element('table.wikitable')
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
i am getting this error while finding for the html node
@kasperwelbers 2 года назад ⁺¹
Hi Erol. I suspect you are now calling html_element(...) by itself, and not within a pipe.
The first argument of html_element should be an html page, which you create with read_html. So that would look like:
html_page = read_html("some url")
html_element(html_page, ".wikitable")
But the pipe operator allows us to write it like this:
read_html("some url") %>%
html_element(".wikitable")
In this case the output of read_html is plugged into html_element as the first argument.

Следующие

Автовоспроизведение

Text analysis in R. Part 1: Preprocessing