Web Scrape Nested Links/Multiple Pages - Web Scraping in R (Part 2)

Dataslice

Просмотров 43 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 дек 2024

Комментарии • 82

@Nafke 6 месяцев назад ⁺¹
I love how much knowledge you can pack in such a short vid. I love the editing, the tone, the pace… your tutorials are def among the best on RUclips for any field!
@prasenjeetrathore 3 года назад ⁺⁸
man you should post more, love your videos, very easy to follow tutorials 🧡
@1812CE 4 года назад ⁺⁴
Your videos are really great man, keep going!
@dataslice 4 года назад ⁺¹
Thank you! Glad you enjoyed it :-)
@willykitheka7618 3 года назад ⁺⁶
Sorry I didn't realize you created a series so am going through all of them! Once again thanks for sharing...this is super useful!
@postercam 2 года назад ⁺³
You juste helped get my researches to an unexpected level of data-collection. Thanks so much for this, your videos were an incredible support.
@grainofsalt2113 3 года назад ⁺¹
Thanks so much for this. Your videos have been SO helpful for my analytics class....you have no idea. I don't know if you get any financial benefit from these videos at all, but I wanted to express my appreciation
@dataslice 3 года назад ⁺¹
That's great to hear! I was a teaching assistant for a data science/analytics course last year and it's what inspired to me to make these videos!
@suganyavidyadharan2966 4 года назад ⁺²
Great tutorial.... I loved all the four parts... You are awesome... Thank you so much for this... I have been looking videos for web scraping and this is the best.... Thank you once again... God bless you...
@dataslice 4 года назад
Glad you enjoyed it!
@adamsaxton6550 2 года назад ⁺¹
Excellent video. Perfect example from which I could follow and apply to my own project. Great.
@vicentefontecilla2025 3 года назад ⁺¹
A big thanks for you. ive been looking this on the internet for hours adn you just explained it in 9 min. Thank you so much! Keep doing videos!
@atilga.n 3 года назад ⁺⁴
thank you thank you thank you! amazing explanations and demonstration, really clear, thanks again!
@dataslice 3 года назад ⁺¹
Glad it was helpful!
@retrosak1977 Год назад ⁺¹
Outstanding video 👏
@vedantamohapatra1192 2 года назад ⁺⁵
Hey, great vid. However, after two years the iMDB website has made some changes and when you click the movie link it redirects you to a page which has all details about the movie but you can't the cast by using the selector gadget as the cast are in div elements. We have to click another link on this page "All cast and crew" and then on that page we can select the cast. So how do we it as we cant use the movie_links in sapply function as it returns nothing.
@adammeziti9480 2 года назад ⁺¹
Awesome
@ss_051 11 месяцев назад ⁺¹
Thank you very very very much 👏👏👏👏👏👏👏👏👏👏👏👏👏👏👏
@mrmr2973 3 года назад ⁺¹
great explanation !!
@nancyachiengodhiambo9727 4 года назад ⁺¹
Nice presentation, waiting for web scrapping of multiple pages
@dataslice 4 года назад
Here's a link to part 3! ruclips.net/video/28pyEDV9mMw/видео.html
@MMansouri 4 года назад ⁺¹
Any idea how this would work with continuous/infinite scrolling?
@dataslice 4 года назад ⁺¹
@@MMansouri I will definitely make a video at some point down the line on how to do this -- but I believe you need to use a combination of rvest and also the R selenium package that allows you to emulate a web browser and make it scroll down
@MMansouri 4 года назад
dataslice Thanks for the reply! I thought so too. Looking forward to it... Thanks for the great videos.
@jtcr1 3 года назад ⁺¹
Great tutorial, thanks a lot!
@abdallahel-kafrawy4114 4 года назад ⁺¹
Great job !!! Thanks for your effort and time
@dataslice 4 года назад
No problem, thanks for watching!
@KianaAshoftehfard 11 месяцев назад
thanks for this tutorials.I tried the codes and the variables remain empty. Could this problem be due to the new site restriction? Because selectorgaget is not working like your video now.
@bonumsanguis1051 2 года назад
I tried doing this in Google news to extract the articles from each of the urls/headlines. It only work for a single sample url and not for all. Please help. 🙏🙏🙏
@thainapinheiro1019 3 года назад
Incredible! you literally ELI5
@radhikaiyer8012 4 года назад ⁺²
hi.. this is very informative.. unfotunately when i try to do wikipedia scraping i get the following error "Error in open.connection(x, "rb") : HTTP error 404" what am i doing wrong ?
@dataslice 4 года назад
Is it possible the link was wrong? I haven't run into the error but it looks like there are some potential solutions on Google that might be worth looking at
@mariemmoula2874 4 года назад
actually I got the same problem and I didn't find a solution on Google
Error in open.connection(x, "rb") :
Could not resolve host: url
@paulfong3011 3 года назад ⁺¹
thank you so so much for your great tutorial!! And I have a question: what should I do when the error "Error in data.frame(name, year, rating, sys, cast, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 50, 45" appears for that on imdb there're 5 movies without rating?
@kavitakamatdivekar5152 3 года назад
same with me
@fezzix8223 2 года назад ⁺¹
It sucks that imdb updated it's page. I am having issues with the crew section in the data frame. This is because they changed how the elements are for the cast on pages, to get to the cast as displayed in your video (that allows Selector Gadget to work), you would need to click a separate link. I am very inexperienced so connecting that extra click to the format of program isn't something I'm capable of doing.
@howardly7687 2 года назад ⁺⁶
A minor adjustment can be made to account for the IMDB layout change. All you have to do is replace a segment of the film page's url with a segment from the full cast page's url.
Doing so will give you the link to that film's full cast, where the CSS selector for the cast names remains the same as in the tutorial. To do this, add a single line of code to the pipeline for the movie_links :
movie_links = page %>% html_nodes(".lister-item-header a") %>%
html_attr("href") %>%
str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits?ref_=tt_cl_sm")) %>%
paste0("www.imdb.com", .)
Make sure you import the stringr library or you won't be able to use str_replace().
@JapjeetS 3 года назад
How are you getting the table output in a different tab, and the format looks so clean?
@marcinterlecki3024 2 года назад ⁺¹
Awesome!
@vineetkaushik5265 2 года назад ⁺¹
Hello Dataslice, thank for this awesome tutorial. When I am doing this for other website the href value is not visible. They have written instead of href. is there any way to deal with this issue??
@rekkiesbub423 Год назад
The same thing happened to me, I got through it because I tried to inspect whether the code is "href" or something else, then tried to copy different type of < > in the script and it finally worked.
If you know a better way to fix it please reply
@vanidanara 3 года назад
When you are dragging the line @1.25, the console monitor instantly showing what is inside but my PC didn't show. How to setting up or I don't know just new for learning this. Thank you
@dataslice 3 года назад ⁺¹
I'm actually hitting Command + Enter (Control + Enter for PC) which is a shortcut to run the highlighted code -- apologies for the confusion!
@timfei7221 3 года назад
Thank you so much! Explained a complicated question well in a short video. I computer spend morn than 1 minutes to run the cast = sapply(...) and it confused of why it spent so long. Would mind to explain the logic behind of why it takes long time to run this code?
@dataslice 3 года назад
Good question. sapply() acts as a for loop in this case; it takes in movie_links (which is a vector of movie link URLs) and then iterates through each URL and calls the get_cast function on it (which scrapes the actors on that link), and finally, returns a vector with the all the results from each URL. The reason it takes so long is because it's scraping n number of pages, where n is the number of URLs in movie_links.
@timfei7221 3 года назад ⁺¹
@@dataslice Thank you! I understand the logic right now. However, some other questions just came out of my mind. Does the internet speed affect the operating speed? Are there any approaches can reduce the time consuming while we are scraping larger number of data (i.e. scraping 10000 movies)? Sorry for asking so much :) My major is statistic and actuarial study so that I learned R for statistical purpose only in the uni, and have ground level knowledge of Data Science.
@dataslice 3 года назад ⁺¹
@@timfei7221 Yes, internet speed will definitely affect how quickly the code runs since the web requests are being made from your computer. I can't think of a way of reducing the overall time, but it might be better to scrape multiple batches of URLs (e.g. 100 at a time) instead of just one giant list. It wouldn't make it quicker but you could append the results to a data frame and even save it to a .csv so you could at least see the results incrementally.
@timfei7221 3 года назад
@@dataslice Thank you so much!
@jannonflores1113 2 года назад
Thanks for this bro! I was able to scrape propertyfinder because of this. One problem thou, how can we make the program faster? It seems that whenever it scrapes across multiple pages, it always takes time rather than just scrape from outside details.
@siddhantsingh7018 4 года назад
Why did you define (movie_link) in the get_cast function and not (movie_links) (LINE 14)? I used your code and still got object movie_cast not found. Help me out, please!
@dataslice 4 года назад
Movie_link was just a variable name for the input of the function. For example, I could define:
AddTwo = function(movie_link)
{ return(movie_link+2)}
X = 5
AddTwo(X) should give me 7
@dataslice 4 года назад
Is your data frame at the end (movies) fully populated?
@kavitakamatdivekar5152 3 года назад
for some elements like rating runtime no. of rows don't match, what can be done if missing values are there in CSS Selectors
@kavitakamatdivekar5152 3 года назад
and what if some extra values are coming?
@jesusrafaelyanvalenzuela2953 3 года назад ⁺¹
Nice Videos! Thanks a lot.
The sapply function is relatively slow (1 seg per Link). What is the reason why sapply takes so long? Is there a way to make it faster?
@TheNozimjon 3 года назад
There is a function called "map" by purr package. It is now common to use that function for iterative tasks
@venkatsainivarthi6326 3 года назад ⁺¹
Iam getting error, when I run line 21 as 'HTTP error 400'
what does this mean?
@dataslice 3 года назад
It may be that the URL is invalid. You might want to printing out the URL and seeing if you can go to it in a web browser to make sure it actually exists
@zeeshanhamid4413 3 года назад ⁺¹
Thanks, this really helps in the project I'm attempting. However, I have one issue that I cannot find the solution for. I'm trying to grab information from within a page, but the function to retrive the information doesn't work. Instead an error comes stating " read_html ("link") isn't working as it states the link argument is missing with no default, although it's defined identically to how you defined movie_links. And when I just view the links variable by itself, it looks perfect, the list is how it should be, with address being correct for each example player (scraping a football website) in the list. I think everything else is working as inteneded, just can't finish the dataframe as this can't be grabbed for whatever reason. Any help would be much appreciated. Thanks.
@dataslice 3 года назад
Are you doing read_html(link) or read_html(“link”)? The first is passing in a variable named link and the second is the string “link”
@ilhanilkeralbulut6620 2 года назад
Hello, does anyone how an idea how to scrape a webpage's clickable parts. for example, there is a table on the page and a click button for collapse and expand. When I scrape I can only scrape collapsed parts and I need the extended parts. Thanks in advance
@maheshgurumoorthi4391 3 года назад
Followed the same, but getting an error as " Error in open.connection(x, "rb") : HTTP error 400." Think it is because of web security...
@qorazx Год назад
For me it always seperates with a space when pasting to concatenate the links. If I add seq="" it concatenates with an extra space at the end. If I leave the seq out completely I do not get the extra space at the end of the link. No matter what I do the space between my main url that I paste and the href ( , .,) is not going away.. Any help?
Edit: paste0() worked! Without using any seq="" specification. Thanks ChatGPT ;)
@djangoworldwide7925 3 года назад ⁺¹
Fantastic
@mkhani023 4 года назад
Hello, great tutorial. I am using it to web scrape job listings for a project. I continuously keep on getting "character(0)" when I try to get the job description for just one listing. Is there anyway around this? thanks!
@kavitakamatdivekar5152 3 года назад
I was also getting same for cast for movie, I removed one tag from selector, means selected correct css parameter for html_nodes, it worked... e.g for cast ".primary_photo+td" worked for me, in video it was ".primary_photo +td a" was used
@thiagorocha2696 4 года назад
Hi, awesome video. You made it very clear and simple. Can you make a video demonstrating how to scrap websites that uses Javascript to render content? The approach will be a little different I think. Congrats.
@wooorrrrdddlol 4 года назад
Great video! how would you go about getting more than one variable besides cast? I tried to do this and I kept getting an error saying that the read_html argument is missing with no default. Also how would you separate the cast row into separate columns?
@dataslice 4 года назад
Which variable were you trying to grab besides cast? And for separating the cast, the dplyr ‘separate’ function should allow you to split a column into multiple columns by a character - that might work
@wooorrrrdddlol 4 года назад
@@dataslice I was trying to get the director for each movie along with the cast variable - are you able to do that?
@neillubbe79 4 года назад
You are a God
@leozborowski 4 года назад
Absolutely great! Clear and to the point, the best tutorial I have found so far. I have a question: I have tried this method on a different website, but when I collect the data into a dataframe I get the following error:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
This is probably because some pages returned no data. Could you help please?
Many thanks for the great videos!
@dataslice 4 года назад ⁺¹
Were you able to figure it out? It could be because the html_nodes() argument is incorrect / doesn't exist on the page. What does your code look like?
@lingzhao242 4 года назад
Did you figure it out? I have the same problem on my code as well. But I noticed that it works sometimes but sometimes it doesn't. It's really weird.
@lifestoriesfromearth6271 4 года назад
@@dataslice How to avoid it? Is it a good way to use try except (put "Null" for empty info)inside the function?
@alexandervera8482 2 года назад
love you
@lingzhao242 4 года назад
Fantastic! Really really useful By the way, I got error "arguments imply differing number of rows: 24, 23" when scraping my page, can you give any advice on how to fix that?
@dataslice 4 года назад
I just responded to your other comment!
@yesdcotchin 3 года назад
What if the html_node has no href or url? I'm following along using a Goodreads list.
The list and book urls take the following forms, though this may be irrelevant:
Book page: "www.goodreads.com/book/show/..."
List page: "www.goodreads.com/list/show/..."
From the console:
> page %>%
+ html_nodes(".bookTitle span")
{xml_nodeset (100)}
[1] Don't Close Your Eyes
[2] To Kill a Mockingbird
@temurgugushvili9368 3 года назад
Thanks for sharing tutorial, really useful. I tried to use the same logic you showed to build page link "www.hr.gov.ge/JobProvider/UserOrgVaks/Details/62799"
html % html_attr("href") %>% paste("www.hr.gov.ge", ., sep="") somehow it does not work, any suggestion on it. once again thank you in advance.

Следующие

Автовоспроизведение

Web Scrape Multiple Pages with Loops - Web Scraping in R (Part 3)