The Rvest & RSelenium Tutorial - Web Scrape Dynamic Tables in R

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 51

  • @imfm
    @imfm Год назад +2

    I need to automate pulling data from several websites with atrocious autogenerated spaghetti code. I was trying with Rvest alone and httr and other solutions. I was getting nowhere fast. Then I found this video and boom, I'm in. I can't thank you enough Samer.

  •  Год назад

    Very well explained. I didn't' know about {RSelenium}, looks really powerful. Thanks!

  • @馮庭萱
    @馮庭萱 2 года назад

    many thanks. great explaination, super clear !

  • @AngelFelizF
    @AngelFelizF 2 года назад

    Great video, thanks for sharing

  • @delabungsu6817
    @delabungsu6817 2 года назад

    Thank you Samer.

  • @sarahsuzz
    @sarahsuzz 3 месяца назад

    I keep getting an error "element not found" when using xpath to locate my "nextpage" button - it is an aria-label and it's located in the div section of the DOM - not sure what I am doing wrong. I have checked my code for typos, very carefully. Can you help?

    • @sarahsuzz
      @sarahsuzz 3 месяца назад

      I found my issue - my aria-label was not an "a tag" it was a button

  • @RolandŐzse
    @RolandŐzse Год назад +1

    Hi,
    Thank you so much for this. I am not that big on coding and this solution is really easy to follow. Excuse me if I am being too dumb. I ran into a problem when you refer to the pagination command at 5:25 using the aria label. I am trying to scrape a transfermarkt table and that field is looking pretty different for me:
      
    As you can see, it's a href and not an aria label. There is a link to the next page on every page and I do not know how to iterate this. Works fine if I want to do the first two page but then It's obviously not working. Could you maybe help me out what I should copy paste to the findElement function? Or is this a whole different situation and I have to do something new? Thank you for your help in advance :)

  • @НикитаБабарыкин-р1ь
    @НикитаБабарыкин-р1ь 11 месяцев назад

    Can you help please? Error in checkError(res) :
    Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2254 ms: Connection refused
    What can be a problem?

  • @huongheidinguyen337
    @huongheidinguyen337 2 года назад +1

    Thank you for the tutorial. I'm practicing scraping Sephora product reviews and ran into a problem. On my last page, there is still a Next page button (it is just disabled), so there was no error and my Next-page loop didn't end. Do you have any suggestions on how to end the loop in this case?

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      if there is a way for you to determine how many pages there are, you can set that as your limit in the loop so that it does not go over that number.

  • @yehitzmedapirc
    @yehitzmedapirc Год назад

    Hi! What can I do if I my "Next button" is different every time?
    I do not have a "next" button, I have ti click on the 1, then 2 etc on the page.
    Thanks!

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      Try to see if the different next buttons have a similar attribute that you can use.

  • @devypratiwi8103
    @devypratiwi8103 Год назад

    hello thanks for sharing the video!
    so i've already watched and followed all the steps but i got an error saying
    Error in java_check() :
    PATH to JAVA not found. Please check JAVA is installed.
    but something that makes confuses is i've also already installed my JAVA till it complete but the error keeps saying that JAVA is not found. Do you know how to solve this issue? thankyou

  • @eleonoras.2878
    @eleonoras.2878 Год назад

    Thank you very much for providing such a great explanation! I've encountered an issue in that I'm only seeing a limited selection of chromedriver versions. Unfortunately, none of these versions seem to be compatible with my current Google Chrome version. Would you by any chance have any suggestions on how I might go about resolving this problem? Your insights would be greatly appreciated. :)

    • @SamerHijjazi
      @SamerHijjazi  Год назад +1

      Thank you for the great feedback! I would suggest running the wdman::selenium function, which will download the latest drivers. Then when you run rsDriver, refer to the chromedriver version that corresponds to yours.

    • @eleonoras.2878
      @eleonoras.2878 Год назад

      @@SamerHijjazi I appreciate your response and assistance. Thank you very much. :)

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      @@eleonoras.2878 my latest Selenium video might actually be able to solve your issue. ruclips.net/video/BnY4PZyL9cg/видео.htmlsi=RP74unOe8SvxWvPV

  • @arunrajesh5137
    @arunrajesh5137 2 года назад +1

    Watching this tutorial immediately after your Introduction to RSelenium. Really enjoyed learning it from you Samer. How do we navigate to a webpage with username and password from RSelenium ?

    • @SamerHijjazi
      @SamerHijjazi  2 года назад

      Thank you, Arun! You can do so by identifying the username and password input boxes and sending the username and password to those boxes using the sendKeysToElement function from RSelenium

    • @arunrajesh5137
      @arunrajesh5137 2 года назад

      @@SamerHijjazi thank you so much...

  • @cameronl1434
    @cameronl1434 2 года назад

    Sorry I am very much a beginner with all this so sorry if this is a stupid question. I have a data table which I want to extract the information from but when I inspect the code it doesn't have an ID. How can I go about selecting the date table without an ID? Thank you in advance

    • @zahrarahmati8612
      @zahrarahmati8612 2 года назад

      Hello Samer, I have exactly the same problem. Would you please help with this?

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      Not a stupid question at all! Try using a different attribute to identify your table by.

  • @retobunzli2088
    @retobunzli2088 2 года назад +1

    Hey Samer, love the tutorial but ran into an issue I couldn't resolve yet. I am using RSelenium to click on a tab that contains the data I want, which works fine if I run the lines of code one after the other, but not in a for loop. I have a list of links the loop should iterate through and some tries it didn't even click the tab for the first list item, other times it stopped after just a couple.. after just adding a bunch of clickElement() commands it worked for a bit longer (but not directly related to the number of commands added) and then stopped again. Any idea how to make it run more stable? My R memory usage is kinda high, could it be due to that? Am a total noob at R, but confusing that it works manually but not in the loop
    Edit: Also, the netstat free_port function always gives me an 'Error in strsplit(local, ":") : non-character argument'.. I wrote it exactly as you have, so no idea why it doesn't work.. if I define a port manually it (e.g. 14415 or '14415') it says 'Error: port should be an integer value'.. my knowledge of maths might be limited but last time I checked 14415 was an integer lol

    • @SamerHijjazi
      @SamerHijjazi  2 года назад +1

      Thank you for the great feedback. I'd have to look at your code to be able to see what's going wrong

    • @retobunzli2088
      @retobunzli2088 2 года назад

      ​@@SamerHijjazi ​ Thanks for the quick response. Thought it might be a common or known issue.. I have posted the code in a reddit thread titled "Impossible to run RSelenium's clickElement() in a loop??" 6 days ago
      Only if you have time and interest tho, don't wanna force you to look at my spaghetti code haha

    • @SamerHijjazi
      @SamerHijjazi  2 года назад

      @@retobunzli2088 can't find it. Looks like the post was removed. Can you reply to this comment with your loop?

    • @retobunzli2088
      @retobunzli2088 2 года назад

      @@SamerHijjazi yeah, just saw it did get removed. The loop looks like this:
      for (link in links) {
      remDr$navigate(link)
      object = remDr$findElement(...)
      results_object$clickElement() issue here (?)
      table i need = remDr$findElement(...)
      same table html = (...)$getPageSource()
      and so on, exactly like you did in the video. It worked line by line, which means the css selectors should be fine, just that the click command doesn't reliably execute.. since the code above probably doesn't help much, the site is (google) 'iaaf 100m times men', then for every athlete i want to go to their profile, click the results tab (this is where it fails randomly) where all the 100m times from the current season are listed, and then extract these values via html table (or similar). The links seem to be correct too, just something about the dyanamic nature of the specific site confuses the clickElement()

    • @SamerHijjazi
      @SamerHijjazi  2 года назад

      @@retobunzli2088 My guess is your loop is running too quickly, hence when it gets to the clickElement part, it's not able to locate the element due to the web page loading. I would suggest you include a small break in your loop to create a pause long enough for the site to load. You can do so by using the Sys.sleep function

  • @shoakromyusupov7297
    @shoakromyusupov7297 Год назад

    Really helpful video. Would like to ask if you can make similar video to scrape data from social media sites like Instagram, LinkedIn or from your own preference ?

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      Thank you! I don't think I will. LinkedIn is very difficult to scrape (plus they can close your account for it), and Instagram has its own API.

  • @jeysunez
    @jeysunez 2 года назад

    Would it be possible to hop on a zoom for help with a scraping project? I would really appreciate it

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      I'm currently not offering that. But I might be in the future :)

  • @respanol1970
    @respanol1970 Год назад

    Amazing!!!

  • @KaraniKeith
    @KaraniKeith 10 месяцев назад

    how do i setup the server in firefox browser ?

  • @MrNachtduiker
    @MrNachtduiker 2 года назад

    awesome, thanks

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Don't we have to check whether the site allows scraping first?

    • @SamerHijjazi
      @SamerHijjazi  Год назад

      Sure! This is only for demonstration purposes. But it's good practice to check first.

  • @tarasst6887
    @tarasst6887 Год назад

    Great!!!!

  • @ahmed007Jaber
    @ahmed007Jaber 2 года назад

    thank you for this;
    getting the below error
    Error in java_check() :
    PATH to JAVA not found. Please check JAVA is installed.
    whenver running
    rs_driver_object

    • @SamerHijjazi
      @SamerHijjazi  2 года назад

      You need to make sure the JDK is properly installed on your machine. If you're on a windows machine, this tutorial is useful: ruclips.net/video/IJ-PJbvJBGs/видео.html

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Neat. I was struggling with some dataset (tiny one) that has commas.

  • @celmywall
    @celmywall Год назад

    Thank you for your extraordinary tutorial. I'd like to have your opinion on this error: Error in rbindlist(list(all_data, df)) :
    Column 1 of item 2 is length 3 inconsistent with column 2 which is length 4. Only length-1 columns are recycled.
    > Thank you so much.
    Hey, I solved the error easily. Thanks anyways.

  • @MohammadMohammad-mj6pc
    @MohammadMohammad-mj6pc 2 года назад

    👌👌👌. can you create a video tutorial for chromote package.

    • @SamerHijjazi
      @SamerHijjazi  2 года назад +1

      This is a good idea! I'd like to explore the package

  • @glanegons
    @glanegons 2 года назад

    Too good mate, is it possible to share your code? Thanks

    • @SamerHijjazi
      @SamerHijjazi  2 года назад

      Thank you for your feedback. I've added the link to the code in the description. :)