I love how much knowledge you can pack in such a short vid. I love the editing, the tone, the pace… your tutorials are def among the best on RUclips for any field!
Thanks so much for this. Your videos have been SO helpful for my analytics class....you have no idea. I don't know if you get any financial benefit from these videos at all, but I wanted to express my appreciation
Great tutorial.... I loved all the four parts... You are awesome... Thank you so much for this... I have been looking videos for web scraping and this is the best.... Thank you once again... God bless you...
Hey, great vid. However, after two years the iMDB website has made some changes and when you click the movie link it redirects you to a page which has all details about the movie but you can't the cast by using the selector gadget as the cast are in div elements. We have to click another link on this page "All cast and crew" and then on that page we can select the cast. So how do we it as we cant use the movie_links in sapply function as it returns nothing.
@@MMansouri I will definitely make a video at some point down the line on how to do this -- but I believe you need to use a combination of rvest and also the R selenium package that allows you to emulate a web browser and make it scroll down
thanks for this tutorials.I tried the codes and the variables remain empty. Could this problem be due to the new site restriction? Because selectorgaget is not working like your video now.
I tried doing this in Google news to extract the articles from each of the urls/headlines. It only work for a single sample url and not for all. Please help. 🙏🙏🙏
hi.. this is very informative.. unfotunately when i try to do wikipedia scraping i get the following error "Error in open.connection(x, "rb") : HTTP error 404" what am i doing wrong ?
Is it possible the link was wrong? I haven't run into the error but it looks like there are some potential solutions on Google that might be worth looking at
thank you so so much for your great tutorial!! And I have a question: what should I do when the error "Error in data.frame(name, year, rating, sys, cast, stringsAsFactors = FALSE) : arguments imply differing number of rows: 50, 45" appears for that on imdb there're 5 movies without rating?
It sucks that imdb updated it's page. I am having issues with the crew section in the data frame. This is because they changed how the elements are for the cast on pages, to get to the cast as displayed in your video (that allows Selector Gadget to work), you would need to click a separate link. I am very inexperienced so connecting that extra click to the format of program isn't something I'm capable of doing.
A minor adjustment can be made to account for the IMDB layout change. All you have to do is replace a segment of the film page's url with a segment from the full cast page's url. Doing so will give you the link to that film's full cast, where the CSS selector for the cast names remains the same as in the tutorial. To do this, add a single line of code to the pipeline for the movie_links : movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits?ref_=tt_cl_sm")) %>% paste0("www.imdb.com", .) Make sure you import the stringr library or you won't be able to use str_replace().
Hello Dataslice, thank for this awesome tutorial. When I am doing this for other website the href value is not visible. They have written instead of href. is there any way to deal with this issue??
The same thing happened to me, I got through it because I tried to inspect whether the code is "href" or something else, then tried to copy different type of < > in the script and it finally worked. If you know a better way to fix it please reply
When you are dragging the line @1.25, the console monitor instantly showing what is inside but my PC didn't show. How to setting up or I don't know just new for learning this. Thank you
Thank you so much! Explained a complicated question well in a short video. I computer spend morn than 1 minutes to run the cast = sapply(...) and it confused of why it spent so long. Would mind to explain the logic behind of why it takes long time to run this code?
Good question. sapply() acts as a for loop in this case; it takes in movie_links (which is a vector of movie link URLs) and then iterates through each URL and calls the get_cast function on it (which scrapes the actors on that link), and finally, returns a vector with the all the results from each URL. The reason it takes so long is because it's scraping n number of pages, where n is the number of URLs in movie_links.
@@dataslice Thank you! I understand the logic right now. However, some other questions just came out of my mind. Does the internet speed affect the operating speed? Are there any approaches can reduce the time consuming while we are scraping larger number of data (i.e. scraping 10000 movies)? Sorry for asking so much :) My major is statistic and actuarial study so that I learned R for statistical purpose only in the uni, and have ground level knowledge of Data Science.
@@timfei7221 Yes, internet speed will definitely affect how quickly the code runs since the web requests are being made from your computer. I can't think of a way of reducing the overall time, but it might be better to scrape multiple batches of URLs (e.g. 100 at a time) instead of just one giant list. It wouldn't make it quicker but you could append the results to a data frame and even save it to a .csv so you could at least see the results incrementally.
Thanks for this bro! I was able to scrape propertyfinder because of this. One problem thou, how can we make the program faster? It seems that whenever it scrapes across multiple pages, it always takes time rather than just scrape from outside details.
Why did you define (movie_link) in the get_cast function and not (movie_links) (LINE 14)? I used your code and still got object movie_cast not found. Help me out, please!
Movie_link was just a variable name for the input of the function. For example, I could define: AddTwo = function(movie_link) { return(movie_link+2)} X = 5 AddTwo(X) should give me 7
Nice Videos! Thanks a lot. The sapply function is relatively slow (1 seg per Link). What is the reason why sapply takes so long? Is there a way to make it faster?
It may be that the URL is invalid. You might want to printing out the URL and seeing if you can go to it in a web browser to make sure it actually exists
Thanks, this really helps in the project I'm attempting. However, I have one issue that I cannot find the solution for. I'm trying to grab information from within a page, but the function to retrive the information doesn't work. Instead an error comes stating " read_html ("link") isn't working as it states the link argument is missing with no default, although it's defined identically to how you defined movie_links. And when I just view the links variable by itself, it looks perfect, the list is how it should be, with address being correct for each example player (scraping a football website) in the list. I think everything else is working as inteneded, just can't finish the dataframe as this can't be grabbed for whatever reason. Any help would be much appreciated. Thanks.
Hello, does anyone how an idea how to scrape a webpage's clickable parts. for example, there is a table on the page and a click button for collapse and expand. When I scrape I can only scrape collapsed parts and I need the extended parts. Thanks in advance
For me it always seperates with a space when pasting to concatenate the links. If I add seq="" it concatenates with an extra space at the end. If I leave the seq out completely I do not get the extra space at the end of the link. No matter what I do the space between my main url that I paste and the href ( , .,) is not going away.. Any help? Edit: paste0() worked! Without using any seq="" specification. Thanks ChatGPT ;)
Hello, great tutorial. I am using it to web scrape job listings for a project. I continuously keep on getting "character(0)" when I try to get the job description for just one listing. Is there anyway around this? thanks!
I was also getting same for cast for movie, I removed one tag from selector, means selected correct css parameter for html_nodes, it worked... e.g for cast ".primary_photo+td" worked for me, in video it was ".primary_photo +td a" was used
Hi, awesome video. You made it very clear and simple. Can you make a video demonstrating how to scrap websites that uses Javascript to render content? The approach will be a little different I think. Congrats.
Great video! how would you go about getting more than one variable besides cast? I tried to do this and I kept getting an error saying that the read_html argument is missing with no default. Also how would you separate the cast row into separate columns?
Which variable were you trying to grab besides cast? And for separating the cast, the dplyr ‘separate’ function should allow you to split a column into multiple columns by a character - that might work
Absolutely great! Clear and to the point, the best tutorial I have found so far. I have a question: I have tried this method on a different website, but when I collect the data into a dataframe I get the following error: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0 This is probably because some pages returned no data. Could you help please? Many thanks for the great videos!
Fantastic! Really really useful By the way, I got error "arguments imply differing number of rows: 24, 23" when scraping my page, can you give any advice on how to fix that?
What if the html_node has no href or url? I'm following along using a Goodreads list. The list and book urls take the following forms, though this may be irrelevant: Book page: "www.goodreads.com/book/show/..." List page: "www.goodreads.com/list/show/..." From the console: > page %>% + html_nodes(".bookTitle span") {xml_nodeset (100)} [1] Don't Close Your Eyes [2] To Kill a Mockingbird
Thanks for sharing tutorial, really useful. I tried to use the same logic you showed to build page link "www.hr.gov.ge/JobProvider/UserOrgVaks/Details/62799" html % html_attr("href") %>% paste("www.hr.gov.ge", ., sep="") somehow it does not work, any suggestion on it. once again thank you in advance.
I love how much knowledge you can pack in such a short vid. I love the editing, the tone, the pace… your tutorials are def among the best on RUclips for any field!
man you should post more, love your videos, very easy to follow tutorials 🧡
Your videos are really great man, keep going!
Thank you! Glad you enjoyed it :-)
Sorry I didn't realize you created a series so am going through all of them! Once again thanks for sharing...this is super useful!
You juste helped get my researches to an unexpected level of data-collection. Thanks so much for this, your videos were an incredible support.
Thanks so much for this. Your videos have been SO helpful for my analytics class....you have no idea. I don't know if you get any financial benefit from these videos at all, but I wanted to express my appreciation
That's great to hear! I was a teaching assistant for a data science/analytics course last year and it's what inspired to me to make these videos!
Great tutorial.... I loved all the four parts... You are awesome... Thank you so much for this... I have been looking videos for web scraping and this is the best.... Thank you once again... God bless you...
Glad you enjoyed it!
Excellent video. Perfect example from which I could follow and apply to my own project. Great.
A big thanks for you. ive been looking this on the internet for hours adn you just explained it in 9 min. Thank you so much! Keep doing videos!
thank you thank you thank you! amazing explanations and demonstration, really clear, thanks again!
Glad it was helpful!
Outstanding video 👏
Hey, great vid. However, after two years the iMDB website has made some changes and when you click the movie link it redirects you to a page which has all details about the movie but you can't the cast by using the selector gadget as the cast are in div elements. We have to click another link on this page "All cast and crew" and then on that page we can select the cast. So how do we it as we cant use the movie_links in sapply function as it returns nothing.
Awesome
Thank you very very very much 👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏
great explanation !!
Nice presentation, waiting for web scrapping of multiple pages
Here's a link to part 3! ruclips.net/video/28pyEDV9mMw/видео.html
Any idea how this would work with continuous/infinite scrolling?
@@MMansouri I will definitely make a video at some point down the line on how to do this -- but I believe you need to use a combination of rvest and also the R selenium package that allows you to emulate a web browser and make it scroll down
dataslice Thanks for the reply! I thought so too. Looking forward to it... Thanks for the great videos.
Great tutorial, thanks a lot!
Great job !!! Thanks for your effort and time
No problem, thanks for watching!
thanks for this tutorials.I tried the codes and the variables remain empty. Could this problem be due to the new site restriction? Because selectorgaget is not working like your video now.
I tried doing this in Google news to extract the articles from each of the urls/headlines. It only work for a single sample url and not for all. Please help. 🙏🙏🙏
Incredible! you literally ELI5
hi.. this is very informative.. unfotunately when i try to do wikipedia scraping i get the following error "Error in open.connection(x, "rb") : HTTP error 404" what am i doing wrong ?
Is it possible the link was wrong? I haven't run into the error but it looks like there are some potential solutions on Google that might be worth looking at
actually I got the same problem and I didn't find a solution on Google
Error in open.connection(x, "rb") :
Could not resolve host: url
thank you so so much for your great tutorial!! And I have a question: what should I do when the error "Error in data.frame(name, year, rating, sys, cast, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 50, 45" appears for that on imdb there're 5 movies without rating?
same with me
It sucks that imdb updated it's page. I am having issues with the crew section in the data frame. This is because they changed how the elements are for the cast on pages, to get to the cast as displayed in your video (that allows Selector Gadget to work), you would need to click a separate link. I am very inexperienced so connecting that extra click to the format of program isn't something I'm capable of doing.
A minor adjustment can be made to account for the IMDB layout change. All you have to do is replace a segment of the film page's url with a segment from the full cast page's url.
Doing so will give you the link to that film's full cast, where the CSS selector for the cast names remains the same as in the tutorial. To do this, add a single line of code to the pipeline for the movie_links :
movie_links = page %>% html_nodes(".lister-item-header a") %>%
html_attr("href") %>%
str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits?ref_=tt_cl_sm")) %>%
paste0("www.imdb.com", .)
Make sure you import the stringr library or you won't be able to use str_replace().
How are you getting the table output in a different tab, and the format looks so clean?
Awesome!
Hello Dataslice, thank for this awesome tutorial. When I am doing this for other website the href value is not visible. They have written instead of href. is there any way to deal with this issue??
The same thing happened to me, I got through it because I tried to inspect whether the code is "href" or something else, then tried to copy different type of < > in the script and it finally worked.
If you know a better way to fix it please reply
When you are dragging the line @1.25, the console monitor instantly showing what is inside but my PC didn't show. How to setting up or I don't know just new for learning this. Thank you
I'm actually hitting Command + Enter (Control + Enter for PC) which is a shortcut to run the highlighted code -- apologies for the confusion!
Thank you so much! Explained a complicated question well in a short video. I computer spend morn than 1 minutes to run the cast = sapply(...) and it confused of why it spent so long. Would mind to explain the logic behind of why it takes long time to run this code?
Good question. sapply() acts as a for loop in this case; it takes in movie_links (which is a vector of movie link URLs) and then iterates through each URL and calls the get_cast function on it (which scrapes the actors on that link), and finally, returns a vector with the all the results from each URL. The reason it takes so long is because it's scraping n number of pages, where n is the number of URLs in movie_links.
@@dataslice Thank you! I understand the logic right now. However, some other questions just came out of my mind. Does the internet speed affect the operating speed? Are there any approaches can reduce the time consuming while we are scraping larger number of data (i.e. scraping 10000 movies)? Sorry for asking so much :) My major is statistic and actuarial study so that I learned R for statistical purpose only in the uni, and have ground level knowledge of Data Science.
@@timfei7221 Yes, internet speed will definitely affect how quickly the code runs since the web requests are being made from your computer. I can't think of a way of reducing the overall time, but it might be better to scrape multiple batches of URLs (e.g. 100 at a time) instead of just one giant list. It wouldn't make it quicker but you could append the results to a data frame and even save it to a .csv so you could at least see the results incrementally.
@@dataslice Thank you so much!
Thanks for this bro! I was able to scrape propertyfinder because of this. One problem thou, how can we make the program faster? It seems that whenever it scrapes across multiple pages, it always takes time rather than just scrape from outside details.
Why did you define (movie_link) in the get_cast function and not (movie_links) (LINE 14)? I used your code and still got object movie_cast not found. Help me out, please!
Movie_link was just a variable name for the input of the function. For example, I could define:
AddTwo = function(movie_link)
{ return(movie_link+2)}
X = 5
AddTwo(X) should give me 7
Is your data frame at the end (movies) fully populated?
for some elements like rating runtime no. of rows don't match, what can be done if missing values are there in CSS Selectors
and what if some extra values are coming?
Nice Videos! Thanks a lot.
The sapply function is relatively slow (1 seg per Link). What is the reason why sapply takes so long? Is there a way to make it faster?
There is a function called "map" by purr package. It is now common to use that function for iterative tasks
Iam getting error, when I run line 21 as 'HTTP error 400'
what does this mean?
It may be that the URL is invalid. You might want to printing out the URL and seeing if you can go to it in a web browser to make sure it actually exists
Thanks, this really helps in the project I'm attempting. However, I have one issue that I cannot find the solution for. I'm trying to grab information from within a page, but the function to retrive the information doesn't work. Instead an error comes stating " read_html ("link") isn't working as it states the link argument is missing with no default, although it's defined identically to how you defined movie_links. And when I just view the links variable by itself, it looks perfect, the list is how it should be, with address being correct for each example player (scraping a football website) in the list. I think everything else is working as inteneded, just can't finish the dataframe as this can't be grabbed for whatever reason. Any help would be much appreciated. Thanks.
Are you doing read_html(link) or read_html(“link”)? The first is passing in a variable named link and the second is the string “link”
Hello, does anyone how an idea how to scrape a webpage's clickable parts. for example, there is a table on the page and a click button for collapse and expand. When I scrape I can only scrape collapsed parts and I need the extended parts. Thanks in advance
Followed the same, but getting an error as " Error in open.connection(x, "rb") : HTTP error 400." Think it is because of web security...
For me it always seperates with a space when pasting to concatenate the links. If I add seq="" it concatenates with an extra space at the end. If I leave the seq out completely I do not get the extra space at the end of the link. No matter what I do the space between my main url that I paste and the href ( , .,) is not going away.. Any help?
Edit: paste0() worked! Without using any seq="" specification. Thanks ChatGPT ;)
Fantastic
Hello, great tutorial. I am using it to web scrape job listings for a project. I continuously keep on getting "character(0)" when I try to get the job description for just one listing. Is there anyway around this? thanks!
I was also getting same for cast for movie, I removed one tag from selector, means selected correct css parameter for html_nodes, it worked... e.g for cast ".primary_photo+td" worked for me, in video it was ".primary_photo +td a" was used
Hi, awesome video. You made it very clear and simple. Can you make a video demonstrating how to scrap websites that uses Javascript to render content? The approach will be a little different I think. Congrats.
Great video! how would you go about getting more than one variable besides cast? I tried to do this and I kept getting an error saying that the read_html argument is missing with no default. Also how would you separate the cast row into separate columns?
Which variable were you trying to grab besides cast? And for separating the cast, the dplyr ‘separate’ function should allow you to split a column into multiple columns by a character - that might work
@@dataslice I was trying to get the director for each movie along with the cast variable - are you able to do that?
You are a God
Absolutely great! Clear and to the point, the best tutorial I have found so far. I have a question: I have tried this method on a different website, but when I collect the data into a dataframe I get the following error:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
This is probably because some pages returned no data. Could you help please?
Many thanks for the great videos!
Were you able to figure it out? It could be because the html_nodes() argument is incorrect / doesn't exist on the page. What does your code look like?
Did you figure it out? I have the same problem on my code as well. But I noticed that it works sometimes but sometimes it doesn't. It's really weird.
@@dataslice How to avoid it? Is it a good way to use try except (put "Null" for empty info)inside the function?
love you
Fantastic! Really really useful By the way, I got error "arguments imply differing number of rows: 24, 23" when scraping my page, can you give any advice on how to fix that?
I just responded to your other comment!
What if the html_node has no href or url? I'm following along using a Goodreads list.
The list and book urls take the following forms, though this may be irrelevant:
Book page: "www.goodreads.com/book/show/..."
List page: "www.goodreads.com/list/show/..."
From the console:
> page %>%
+ html_nodes(".bookTitle span")
{xml_nodeset (100)}
[1] Don't Close Your Eyes
[2] To Kill a Mockingbird
Thanks for sharing tutorial, really useful. I tried to use the same logic you showed to build page link "www.hr.gov.ge/JobProvider/UserOrgVaks/Details/62799"
html % html_attr("href") %>% paste("www.hr.gov.ge", ., sep="") somehow it does not work, any suggestion on it. once again thank you in advance.