Webscraping in R

Поделиться
HTML-код
  • Опубликовано: 22 авг 2024
  • !! This video was recorded a while ago, and some of the examples no longer work. For the first example (on wikipedia), please check the updated code in this RMarkdown document:
    github.com/ccs...
    And yeah I know, the video is pretty long! It's actually 2 parts (in hindsight). Up till 40:00 it's mainly introducing how this works, and after 40:00 it's walking through 2 demo's. If you're the type of person that first wants to see something in action, you can skip straight to 40:00, and then see whether you want to spend time on learning understand what's happening there (for which you can either use the video or RMarkdown document).

Комментарии • 33

  • @mindandresearch
    @mindandresearch 24 дня назад

    You should make more and more videos. You explained this on point! Like on R and everything on it you will surely be the best no doubt!

  • @alanscott9258
    @alanscott9258 Год назад +3

    Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.

    • @kasperwelbers
      @kasperwelbers  Год назад

      Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!

  • @moviezone8130
    @moviezone8130 3 месяца назад

    Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Congratulations! This is an excellent and lucid explanation of how to web scrape with Rs rvest. I had no idea that was this simple.

  • @pieracelis6862
    @pieracelis6862 4 месяца назад

    Really good tutorial, thanks a lot!! :)

  • @raould2590
    @raould2590 Год назад

    Excellent one! Thank you for this! Well structured & explained and very useful!

  • @Kinglium
    @Kinglium 2 года назад +1

    thank you so much for all your hard work! I learned a lot from this video!!

  • @conservo3203
    @conservo3203 Год назад

    Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻

  • @timmytesla9655
    @timmytesla9655 Год назад

    This was really helpful. Thanks for this awesome tutorial.

  • @Quienescribiohoy
    @Quienescribiohoy 2 года назад

    Thank you for this video, it was really helpful.

  • @Aguaires
    @Aguaires 3 месяца назад

    Dank u!

  • @hectormercedes6553
    @hectormercedes6553 2 года назад +1

    THANK YOU TEACHER, VERY IMPORTANT, IM NEWBIE

  • @Ryan-vc9gc
    @Ryan-vc9gc Год назад

    Awesome video thank you

  • @harutyunhakobyan4534
    @harutyunhakobyan4534 Год назад

    Thank you very much very helpful

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Год назад

    Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.

    • @kasperwelbers
      @kasperwelbers  Год назад

      Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)".
      What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 Год назад

      @@kasperwelbers
      Kasper,
      Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table:
      html_doc |>
      html_elements("table") |>
      html_table(header = TRUE) |>
      pluck(3)
      pluck is from purrr package (pull will not work here).
      Or using xpath:
      html_doc |> html_elements(xpath = '//center[position() = 3]/table') |>
      html_table(header = TRUE)
      The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another.
      By the way I am using the native pipe which will not always work but the regular magrittr will.
      H

    • @kasperwelbers
      @kasperwelbers  Год назад

      @@haraldurkarlsson1147 pluck indeed offers a nice solution here!
      There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Very useful! But my issue is not with the R code but rather reading through the html code and finding the right places. I had a heck of a time. For one thing the font was tiny and an enormously long code. Is it searchable?

    • @kasperwelbers
      @kasperwelbers  2 года назад +2

      Dear Haraldur, the hardest part of scraping is indeed not so much the code, but learning how to find and select HTML elements. The good news, though, is that this process is the same regardless of what scraping software you use. So if at some point you're using Python, it still applies, and there are also tools for automating an actual webrowser such as Selenium (and RSelenium) for which the main task is also finding and selecting HTML elements. That being said, there are great tools for searching through the HTML code. The main one being the Inspect option, as also discussed in the tutorial. If you use the one in Chrome, you can search for both strings and css-selectors. So that's a great way to find elements and figuring out how to select them with rvest. Also, note that if you right click an element on the webpage and select inspect, it automatically shows the HTML code for this element.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 года назад

      @@kasperwelbers
      Kasper, I have indeed used inspect in Chrome. My main problems is that the font is so small that I have a hard time reading it.

    • @kasperwelbers
      @kasperwelbers  2 года назад +2

      @@haraldurkarlsson1147 Ahhh like that! You should be able to change the font like any content on a webpage. In Chrome, at least for me, it works by holding ctrl and then scrolling up/down on my mouse. This change the font size for whatever window you're pointing at.

  • @thomasberthelot9187
    @thomasberthelot9187 Год назад

    Hi! I'm a newbie and the 7th and 8th lines, I get : "Error in read_html(url) : could not find function "read_html" ". Could you pleas telle me what's wrong ? It's the same when I run "%>%", I get : "Error in read_html(url) %>% html_element("table.wikitable") :
    could not find function "%>%" ". Same with the library(tidyverse)

  • @brittnyfreeman3650
    @brittnyfreeman3650 Год назад

    Where is the html tutorial link that you mentioned? It’s not in the description of the video.

    • @kasperwelbers
      @kasperwelbers  Год назад

      Hi Brittny, you're right. I replaced the HTML file with a .md file (the rmarkdown file is already knitted), because somehow links to the html file on github didnt work. Did you need an html version in particular?

  • @erolarmstrong
    @erolarmstrong 2 года назад

    html_element('table.wikitable')
    Error in UseMethod("xml_find_first") :
    no applicable method for 'xml_find_first' applied to an object of class "character"
    i am getting this error while finding for the html node

    • @kasperwelbers
      @kasperwelbers  2 года назад +1

      Hi Erol. I suspect you are now calling html_element(...) by itself, and not within a pipe.
      The first argument of html_element should be an html page, which you create with read_html. So that would look like:
      html_page = read_html("some url")
      html_element(html_page, ".wikitable")
      But the pipe operator allows us to write it like this:
      read_html("some url") %>%
      html_element(".wikitable")
      In this case the output of read_html is plugged into html_element as the first argument.