I need to automate pulling data from several websites with atrocious autogenerated spaghetti code. I was trying with Rvest alone and httr and other solutions. I was getting nowhere fast. Then I found this video and boom, I'm in. I can't thank you enough Samer.
Год назад
Very well explained. I didn't' know about {RSelenium}, looks really powerful. Thanks!
I keep getting an error "element not found" when using xpath to locate my "nextpage" button - it is an aria-label and it's located in the div section of the DOM - not sure what I am doing wrong. I have checked my code for typos, very carefully. Can you help?
Hi, Thank you so much for this. I am not that big on coding and this solution is really easy to follow. Excuse me if I am being too dumb. I ran into a problem when you refer to the pagination command at 5:25 using the aria label. I am trying to scrape a transfermarkt table and that field is looking pretty different for me:
As you can see, it's a href and not an aria label. There is a link to the next page on every page and I do not know how to iterate this. Works fine if I want to do the first two page but then It's obviously not working. Could you maybe help me out what I should copy paste to the findElement function? Or is this a whole different situation and I have to do something new? Thank you for your help in advance :)
Can you help please? Error in checkError(res) : Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2254 ms: Connection refused What can be a problem?
Thank you for the tutorial. I'm practicing scraping Sephora product reviews and ran into a problem. On my last page, there is still a Next page button (it is just disabled), so there was no error and my Next-page loop didn't end. Do you have any suggestions on how to end the loop in this case?
Hi! What can I do if I my "Next button" is different every time? I do not have a "next" button, I have ti click on the 1, then 2 etc on the page. Thanks!
hello thanks for sharing the video! so i've already watched and followed all the steps but i got an error saying Error in java_check() : PATH to JAVA not found. Please check JAVA is installed. but something that makes confuses is i've also already installed my JAVA till it complete but the error keeps saying that JAVA is not found. Do you know how to solve this issue? thankyou
Thank you very much for providing such a great explanation! I've encountered an issue in that I'm only seeing a limited selection of chromedriver versions. Unfortunately, none of these versions seem to be compatible with my current Google Chrome version. Would you by any chance have any suggestions on how I might go about resolving this problem? Your insights would be greatly appreciated. :)
Thank you for the great feedback! I would suggest running the wdman::selenium function, which will download the latest drivers. Then when you run rsDriver, refer to the chromedriver version that corresponds to yours.
Watching this tutorial immediately after your Introduction to RSelenium. Really enjoyed learning it from you Samer. How do we navigate to a webpage with username and password from RSelenium ?
Thank you, Arun! You can do so by identifying the username and password input boxes and sending the username and password to those boxes using the sendKeysToElement function from RSelenium
Sorry I am very much a beginner with all this so sorry if this is a stupid question. I have a data table which I want to extract the information from but when I inspect the code it doesn't have an ID. How can I go about selecting the date table without an ID? Thank you in advance
Hey Samer, love the tutorial but ran into an issue I couldn't resolve yet. I am using RSelenium to click on a tab that contains the data I want, which works fine if I run the lines of code one after the other, but not in a for loop. I have a list of links the loop should iterate through and some tries it didn't even click the tab for the first list item, other times it stopped after just a couple.. after just adding a bunch of clickElement() commands it worked for a bit longer (but not directly related to the number of commands added) and then stopped again. Any idea how to make it run more stable? My R memory usage is kinda high, could it be due to that? Am a total noob at R, but confusing that it works manually but not in the loop Edit: Also, the netstat free_port function always gives me an 'Error in strsplit(local, ":") : non-character argument'.. I wrote it exactly as you have, so no idea why it doesn't work.. if I define a port manually it (e.g. 14415 or '14415') it says 'Error: port should be an integer value'.. my knowledge of maths might be limited but last time I checked 14415 was an integer lol
@@SamerHijjazi Thanks for the quick response. Thought it might be a common or known issue.. I have posted the code in a reddit thread titled "Impossible to run RSelenium's clickElement() in a loop??" 6 days ago Only if you have time and interest tho, don't wanna force you to look at my spaghetti code haha
@@SamerHijjazi yeah, just saw it did get removed. The loop looks like this: for (link in links) { remDr$navigate(link) object = remDr$findElement(...) results_object$clickElement() issue here (?) table i need = remDr$findElement(...) same table html = (...)$getPageSource() and so on, exactly like you did in the video. It worked line by line, which means the css selectors should be fine, just that the click command doesn't reliably execute.. since the code above probably doesn't help much, the site is (google) 'iaaf 100m times men', then for every athlete i want to go to their profile, click the results tab (this is where it fails randomly) where all the 100m times from the current season are listed, and then extract these values via html table (or similar). The links seem to be correct too, just something about the dyanamic nature of the specific site confuses the clickElement()
@@retobunzli2088 My guess is your loop is running too quickly, hence when it gets to the clickElement part, it's not able to locate the element due to the web page loading. I would suggest you include a small break in your loop to create a pause long enough for the site to load. You can do so by using the Sys.sleep function
Really helpful video. Would like to ask if you can make similar video to scrape data from social media sites like Instagram, LinkedIn or from your own preference ?
thank you for this; getting the below error Error in java_check() : PATH to JAVA not found. Please check JAVA is installed. whenver running rs_driver_object
You need to make sure the JDK is properly installed on your machine. If you're on a windows machine, this tutorial is useful: ruclips.net/video/IJ-PJbvJBGs/видео.html
Thank you for your extraordinary tutorial. I'd like to have your opinion on this error: Error in rbindlist(list(all_data, df)) : Column 1 of item 2 is length 3 inconsistent with column 2 which is length 4. Only length-1 columns are recycled. > Thank you so much. Hey, I solved the error easily. Thanks anyways.
I need to automate pulling data from several websites with atrocious autogenerated spaghetti code. I was trying with Rvest alone and httr and other solutions. I was getting nowhere fast. Then I found this video and boom, I'm in. I can't thank you enough Samer.
Very well explained. I didn't' know about {RSelenium}, looks really powerful. Thanks!
many thanks. great explaination, super clear !
Great video, thanks for sharing
Thank you Samer.
I keep getting an error "element not found" when using xpath to locate my "nextpage" button - it is an aria-label and it's located in the div section of the DOM - not sure what I am doing wrong. I have checked my code for typos, very carefully. Can you help?
I found my issue - my aria-label was not an "a tag" it was a button
Hi,
Thank you so much for this. I am not that big on coding and this solution is really easy to follow. Excuse me if I am being too dumb. I ran into a problem when you refer to the pagination command at 5:25 using the aria label. I am trying to scrape a transfermarkt table and that field is looking pretty different for me:
As you can see, it's a href and not an aria label. There is a link to the next page on every page and I do not know how to iterate this. Works fine if I want to do the first two page but then It's obviously not working. Could you maybe help me out what I should copy paste to the findElement function? Or is this a whole different situation and I have to do something new? Thank you for your help in advance :)
Can you help please? Error in checkError(res) :
Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2254 ms: Connection refused
What can be a problem?
Thank you for the tutorial. I'm practicing scraping Sephora product reviews and ran into a problem. On my last page, there is still a Next page button (it is just disabled), so there was no error and my Next-page loop didn't end. Do you have any suggestions on how to end the loop in this case?
if there is a way for you to determine how many pages there are, you can set that as your limit in the loop so that it does not go over that number.
Hi! What can I do if I my "Next button" is different every time?
I do not have a "next" button, I have ti click on the 1, then 2 etc on the page.
Thanks!
Try to see if the different next buttons have a similar attribute that you can use.
hello thanks for sharing the video!
so i've already watched and followed all the steps but i got an error saying
Error in java_check() :
PATH to JAVA not found. Please check JAVA is installed.
but something that makes confuses is i've also already installed my JAVA till it complete but the error keeps saying that JAVA is not found. Do you know how to solve this issue? thankyou
Thank you very much for providing such a great explanation! I've encountered an issue in that I'm only seeing a limited selection of chromedriver versions. Unfortunately, none of these versions seem to be compatible with my current Google Chrome version. Would you by any chance have any suggestions on how I might go about resolving this problem? Your insights would be greatly appreciated. :)
Thank you for the great feedback! I would suggest running the wdman::selenium function, which will download the latest drivers. Then when you run rsDriver, refer to the chromedriver version that corresponds to yours.
@@SamerHijjazi I appreciate your response and assistance. Thank you very much. :)
@@eleonoras.2878 my latest Selenium video might actually be able to solve your issue. ruclips.net/video/BnY4PZyL9cg/видео.htmlsi=RP74unOe8SvxWvPV
Watching this tutorial immediately after your Introduction to RSelenium. Really enjoyed learning it from you Samer. How do we navigate to a webpage with username and password from RSelenium ?
Thank you, Arun! You can do so by identifying the username and password input boxes and sending the username and password to those boxes using the sendKeysToElement function from RSelenium
@@SamerHijjazi thank you so much...
Sorry I am very much a beginner with all this so sorry if this is a stupid question. I have a data table which I want to extract the information from but when I inspect the code it doesn't have an ID. How can I go about selecting the date table without an ID? Thank you in advance
Hello Samer, I have exactly the same problem. Would you please help with this?
Not a stupid question at all! Try using a different attribute to identify your table by.
Hey Samer, love the tutorial but ran into an issue I couldn't resolve yet. I am using RSelenium to click on a tab that contains the data I want, which works fine if I run the lines of code one after the other, but not in a for loop. I have a list of links the loop should iterate through and some tries it didn't even click the tab for the first list item, other times it stopped after just a couple.. after just adding a bunch of clickElement() commands it worked for a bit longer (but not directly related to the number of commands added) and then stopped again. Any idea how to make it run more stable? My R memory usage is kinda high, could it be due to that? Am a total noob at R, but confusing that it works manually but not in the loop
Edit: Also, the netstat free_port function always gives me an 'Error in strsplit(local, ":") : non-character argument'.. I wrote it exactly as you have, so no idea why it doesn't work.. if I define a port manually it (e.g. 14415 or '14415') it says 'Error: port should be an integer value'.. my knowledge of maths might be limited but last time I checked 14415 was an integer lol
Thank you for the great feedback. I'd have to look at your code to be able to see what's going wrong
@@SamerHijjazi Thanks for the quick response. Thought it might be a common or known issue.. I have posted the code in a reddit thread titled "Impossible to run RSelenium's clickElement() in a loop??" 6 days ago
Only if you have time and interest tho, don't wanna force you to look at my spaghetti code haha
@@retobunzli2088 can't find it. Looks like the post was removed. Can you reply to this comment with your loop?
@@SamerHijjazi yeah, just saw it did get removed. The loop looks like this:
for (link in links) {
remDr$navigate(link)
object = remDr$findElement(...)
results_object$clickElement() issue here (?)
table i need = remDr$findElement(...)
same table html = (...)$getPageSource()
and so on, exactly like you did in the video. It worked line by line, which means the css selectors should be fine, just that the click command doesn't reliably execute.. since the code above probably doesn't help much, the site is (google) 'iaaf 100m times men', then for every athlete i want to go to their profile, click the results tab (this is where it fails randomly) where all the 100m times from the current season are listed, and then extract these values via html table (or similar). The links seem to be correct too, just something about the dyanamic nature of the specific site confuses the clickElement()
@@retobunzli2088 My guess is your loop is running too quickly, hence when it gets to the clickElement part, it's not able to locate the element due to the web page loading. I would suggest you include a small break in your loop to create a pause long enough for the site to load. You can do so by using the Sys.sleep function
Really helpful video. Would like to ask if you can make similar video to scrape data from social media sites like Instagram, LinkedIn or from your own preference ?
Thank you! I don't think I will. LinkedIn is very difficult to scrape (plus they can close your account for it), and Instagram has its own API.
Would it be possible to hop on a zoom for help with a scraping project? I would really appreciate it
I'm currently not offering that. But I might be in the future :)
Amazing!!!
how do i setup the server in firefox browser ?
awesome, thanks
Don't we have to check whether the site allows scraping first?
Sure! This is only for demonstration purposes. But it's good practice to check first.
Great!!!!
thank you for this;
getting the below error
Error in java_check() :
PATH to JAVA not found. Please check JAVA is installed.
whenver running
rs_driver_object
You need to make sure the JDK is properly installed on your machine. If you're on a windows machine, this tutorial is useful: ruclips.net/video/IJ-PJbvJBGs/видео.html
Neat. I was struggling with some dataset (tiny one) that has commas.
Thank you for your extraordinary tutorial. I'd like to have your opinion on this error: Error in rbindlist(list(all_data, df)) :
Column 1 of item 2 is length 3 inconsistent with column 2 which is length 4. Only length-1 columns are recycled.
> Thank you so much.
Hey, I solved the error easily. Thanks anyways.
👌👌👌. can you create a video tutorial for chromote package.
This is a good idea! I'd like to explore the package
Too good mate, is it possible to share your code? Thanks
Thank you for your feedback. I've added the link to the code in the description. :)