Web Scrape Text from ANY Website - Web Scraping in R (Part 1)

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 181

  • @AbhijeetSinghs
    @AbhijeetSinghs 4 года назад +23

    Learned more from this than from other 1hr long videos. Thanks for making this video.

  • @-Exen-
    @-Exen- 2 месяца назад

    This is the first coding project I've ever completed. Your tutorial was extremely intuitive, thank you!

  • @Unbox747
    @Unbox747 3 года назад +2

    You have such a calming voice and such clear explanations!

  • @TheMunishk
    @TheMunishk 3 года назад +3

    Some people can explain things in a neat and simple manner. This video does that

  • @deltax7159
    @deltax7159 2 года назад +2

    great video, been a SAS user for a while but really getting into R, your videos really help, thank you!

  • @kathrynwang4081
    @kathrynwang4081 3 года назад +5

    I thought it was really helpful that you explained what the rvest functions are doing! Thank you!

  • @alvaromartinez3310
    @alvaromartinez3310 2 года назад +3

    Excellent tutorial, I've been searching for this long time. Thank you so much, bro. Here you have a new sub

  • @willykitheka7618
    @willykitheka7618 3 года назад +4

    I can't believe you have taught me web scraping in 8 minutes! Thanks a heap! Ooh, I subscribed!

  • @rahulraghavan1894
    @rahulraghavan1894 3 года назад +7

    Amazing tutorial. Quality content!! Subscribed immediately after I saw this one tutorial. Hats off for the good work.

  • @barankaypakoglu7643
    @barankaypakoglu7643 2 года назад +4

    Very clean explanation. Super useful stuff! thank you for this

  • @josephfife8946
    @josephfife8946 2 года назад +5

    Such a great video! Thanks for putting this together! I love how clear and concise you were with each part!
    When I was following along I decided (for personal use of scraping imbd) the content rating (G, PG, PG-13, R ect...) was important, but was having some issues adding it to the table since not every movie (content) rating was available. This is what I ended up doing to get around that issue, in case anyone else finds this useful.
    #Part 1 Select content rating and a variable that does not change (This one was ended up having text of "Rate this")
    get_rating = page %>% html_nodes(".rate , .certificate") %>% html_text()
    #Part 2 Make a for loop that adds in 'Not provided" when a movie does not have a rating
    i = 1
    is_null = "Rate this"
    content_rating = "Rate this"
    count_rate = 1
    for(i in get_rating){
    if(get_rating[count_rate] == is_null) {
    content_rating

  • @silvestrecamposano6317
    @silvestrecamposano6317 5 месяцев назад

    Thank you for the very simplified explanation that we are able to understand.

  • @haoranliu8204
    @haoranliu8204 3 года назад +4

    This is the all time best tutorial!

  • @scpbm
    @scpbm 3 года назад +1

    You've just helped me save time, as I am gathering data from different websites. Thanks a lot!

    • @dataslice
      @dataslice  3 года назад

      Great to hear! Thanks for watching!

  • @loganlloyd3581
    @loganlloyd3581 2 года назад +3

    This is very well done and helps out a lot, thank you!

  • @bastih9816
    @bastih9816 2 года назад

    I don't comment often but this is so good quality content mate

  • @antxnioo
    @antxnioo 2 года назад +2

    i never coded in R. this made it look so easy. Thank you!

  • @fabienneraier1140
    @fabienneraier1140 7 месяцев назад

    A great tutorial, I got it to work right away! Thank you so much! :)

  • @SC-bi6my
    @SC-bi6my 3 года назад

    One of the best video in youtube.

  • @evanglaser6517
    @evanglaser6517 8 месяцев назад

    Super helpful and concise, thank you!

  • @giannispets
    @giannispets 3 года назад

    Thank you for the tutorial. Very nice and on to the point with blah blah

  • @Ricefield88
    @Ricefield88 Год назад

    Thank you! I’ve tried python and mostly failed but this tutorial worked!

  • @thecardigancardigand
    @thecardigancardigand 2 года назад +1

    Thank you! Very useful and clear explanation.

  • @bunnyhei
    @bunnyhei 2 года назад +1

    Thank you very much! Your great tutorial video straight to the point!

  • @eloscarc5782
    @eloscarc5782 4 месяца назад

    Wow, what a great explanation

  • @hcrnn7518
    @hcrnn7518 3 года назад +2

    Thanks, Man..It's so easy to learn from your videos..and I needed this for my work in the office..You have no idea how much time this has saved me..A subscribe and thumbs up from me!!!!!!!!!!

  • @dimasprasetyawardanadana7682
    @dimasprasetyawardanadana7682 6 месяцев назад

    Your video is so helpful. Thanks a lot!

  • @jean777-p2t
    @jean777-p2t 2 года назад

    Thank u very much! i learning to use R Studio, and its my first time in practice Web Scraping. I really so' happy :D

  • @retrolu1
    @retrolu1 3 года назад

    That looks so easy, thank you for that

  • @retrosak1977
    @retrosak1977 Год назад

    Such a great video 👏👏👏

  • @mrk9076
    @mrk9076 2 года назад +2

    Hi everyone!
    Just a question: why my SelectorGadget don't put the code when I highlighting the text is just show "#main a" which is not the code. Anyone can help me please?

  • @hayekri
    @hayekri 3 года назад

    I hope you get more subscribers b/c this is a very effective overview! Thanks!

  • @sub4morebysquawk427
    @sub4morebysquawk427 3 года назад

    You got me scraping the world wide web. Thanks!

    • @sub4morebysquawk427
      @sub4morebysquawk427 3 года назад +1

      @dataslice, i was trying to do this with facebook and google search, like i was searching for dentists in the area, and wanted a list and contact number out of them.. But i only show the div part, of the inspect..

  • @terraflops
    @terraflops 3 года назад

    i was hmm, okay, hope this is easier than bs4 in python, and just using the chrome extension with the name variable code .... AWESOME!! that was so easy! Thanks so much

  • @raj-nd6kz
    @raj-nd6kz 2 года назад

    lol at Lagaan being in the list, one of my favorite movies

  • @TheApexsha
    @TheApexsha Год назад +1

    Hey, I tried to do this exactly for youtube videos but the columns have 0 characters. Would you know why? Thank you.

    • @ignaciomorenobasanez3821
      @ignaciomorenobasanez3821 5 месяцев назад

      I encountered the same error, but when I tried another page, it worked well. I believe the package does not function directly with pages built using JavaScript.

  • @arshammikaeili7408
    @arshammikaeili7408 2 года назад

    This is the best
    Good quality
    Best way
    Not too long
    Fantastic 👌🏼👌🏼👌🏼

  • @ac6852
    @ac6852 3 года назад

    You are a freaking legend! Thank you for this awesome video!!!!!!!! xoxoxo

  • @eduardobustamante1797
    @eduardobustamante1797 3 года назад

    This is the best tutorial, thank you so much

  • @buraktiras93
    @buraktiras93 2 года назад

    Great content, thanks! Waiting for your new videos!

  • @Swelouise
    @Swelouise Год назад +1

    What if you can't select individual data elements on the page?

  • @christianberntsen3856
    @christianberntsen3856 2 года назад +2

    Very nice! However, on some pages the "read_html(link)" gets stuck in an infinite loop. Any idea why?

  • @johnbuhl7863
    @johnbuhl7863 4 года назад +1

    What do I do if the name field is empty?
    I followed along with your example and had no issues, but when I tried doing what I needed it for I couldnt get any values in "name"

  • @saminba9111
    @saminba9111 2 года назад

    Hi,
    i have a question about your video, suppose that I extract the CSV file from a webpage for the engine capacity of different make/models of the cars. now I have make/model and engine capacity . should I then manually search in the CSV file to find each make/model engine capacity related to my dataset? i mean after scrapping, should I manually find data in the CSV file?

  • @Jason-ot3fu
    @Jason-ot3fu 3 года назад +2

    Hi DataSlice, thanks for the great tutorial. I was wondering why when I type "View(movies)" I can see the synopsis values but when I export it to CSV, I can't see the synopsis values in the CSV file.

    • @dataslice
      @dataslice  3 года назад

      That’s an odd issue - are you sure the synopsis values aren’t there and just hidden? What command are you using to write to the csv?

    • @kavitakamatdivekar5152
      @kavitakamatdivekar5152 2 года назад

      @R for students | Dr. Fahad synopsis values are there, just increase the size of excel cell row, you can see it.

  • @jonplaud
    @jonplaud Год назад

    I got the webscrapping part down but the data.frame keeps showing up as an error.
    I keep getting
    Error in data.frame(name, year, rating, synopsis, stringsAsFactors = FALSE) :
    arguments imply differing number of rows: 51, 50

  • @yashs1999
    @yashs1999 3 года назад

    So helpful, thank you so much!

  • @hm.91
    @hm.91 2 года назад

    Great video! Thank you very much

  • @manu3939393
    @manu3939393 2 года назад

    Mhh, I'm getting "Error in open.connection(x, "rb") : HTTP error 403." if I do this in R for the page I want. Using your Google Sheets Tutorial works, however. But since I need nested links that's not really useful. Any ideas?

  • @ahmedfaraz9813
    @ahmedfaraz9813 2 года назад

    Thanks a lot
    Just one question. On my page some of the movies are missing IMBD ratings and hence when i ran the command "(Error in data.frame(name, year, rating, synopsis, stringsAsFactors = FALSE) :
    " arguments imply differing number of rows: 50, 41"
    what to do about it?

  • @gizl1
    @gizl1 8 месяцев назад

    Great video!!

  • @josephjohns4626
    @josephjohns4626 3 года назад

    @Dataslice, I got the following error message when attempting to do the exact same functions: "> year = page %>% html_nodes(".text-muted.unbold")%>% html_text()
    Error in UseMethod("xml_find_all") :
    no applicable method for 'xml_find_all' applied to an object of class "function""

  • @ymdec95
    @ymdec95 3 года назад +1

    Hi I tried loading the library (rvest) and library (dplyr) it shows an error saying there is no such package. What should I do?

    • @dataslice
      @dataslice  3 года назад +1

      What's the error you're getting? Did you install.packages("rvest") and install.packages("dplyr") beforehand?

    • @ymdec95
      @ymdec95 3 года назад

      @@dataslice yes.. I did install the packages and a folder was created storing those files as well

  • @Cx787
    @Cx787 3 года назад +3

    Thanks ! this is really helpful. One question about the data, I usually work with spanish web pages and the text have special characters such as á, é, í, ó, ú. These characters do not appear in the CSV file (they appear different as A', Ä, etc). Any idea how to solve this? I used to replace each one manually lol

  • @retro527
    @retro527 3 года назад +1

    you have such a nice voice 🥺❤️❤️❤️

  • @hineshpatel7076
    @hineshpatel7076 2 года назад +2

    hi great video, super useful. Are you able to do a video on scraping behind a login page ?

  • @previncoin8592
    @previncoin8592 3 года назад +1

    My IMDB page has 41 titles as confirmed at the top. All columns return 41 elements except (year) which returns 43, this causes a mismatch:
    "Error in data.frame(name, year, rating, synopsis, stringsAsFactors = FALSE) :
    arguments imply differing number of rows: 41, 44"
    This is because the first 3 entries in (year) are:
    [1] "IMDb user rating (average)"(2)"Number of votes"(2)"Release year or range"
    I cant see where this is coming from as there are no extra highlights in the gadget selection, is there a way to return only numbers for year)?

    • @kavitakamatdivekar5152
      @kavitakamatdivekar5152 3 года назад

      years= page %>% html_nodes(".lister-item-year") %>% html_text() will work

  • @ahmedfaraz9813
    @ahmedfaraz9813 2 года назад

    My question...when i wrote file to CSV, I did not get the synopsis in Excel file...why is that

  • @ali5t4ir
    @ali5t4ir 3 года назад +1

    Thank you so much for this, well explained!! I have tried this on a website & I get "Error in open.connection(x, "rb") : HTTP error 405." - usually in Python I think they use Hearder or User Agent to bypass this - is there any way to incorporate this in R please?

  • @jerryzhang2693
    @jerryzhang2693 3 года назад

    Why I have trouble with doing that, the data shows that 0 objective and 1 variable. "No data available in table"

  • @sophiej4605
    @sophiej4605 4 года назад

    How can I solve this error? When I run the "movies=data.frame(name,~~)", an error message shows up like "arguments imply differing number of rows: 100,91,1"

  • @buffaloperformanceandanaly1431
    @buffaloperformanceandanaly1431 Год назад +1

    Awesome video, thanks for sharing! Is there a way to read in images? Thanks!

  • @fk-xj5oj
    @fk-xj5oj 10 месяцев назад

    thank you. thats very helpful

  • @marvelousmike79
    @marvelousmike79 2 года назад

    How do I return values that are N/A? I am trying to scrape Indeed and some postings do not have the same variables e.g. salary.

  • @lifefaithworks
    @lifefaithworks Год назад

    Hello, great video! How do you scrape the next page.. etc to the end

  • @onsfarhat1042
    @onsfarhat1042 3 года назад

    Great video! Thanks :)

  • @walrexx_2370
    @walrexx_2370 3 года назад

    thank you for the great tutorial

  • @ucabcd7003
    @ucabcd7003 Год назад

    Thanks!! I follow your code here, but i does not work, I'm so neofit ... does this plataform allow scrapping? or maybe I made something wrong?

  • @HadesTimer
    @HadesTimer 2 года назад

    how do you deal with this if you don't have a data frame with the same number of rows? This one lined up but it would be easy to get data from a page like this that doesn't.

  • @previncoin8592
    @previncoin8592 3 года назад

    I got 50 titles & 38 ratings which returned an error, so had to remove rating column to run it. How can missing values be replaced with for instance, N/A?

  • @fleetwoodayisi9308
    @fleetwoodayisi9308 2 года назад

    is there a way to accont for items with a missing variable for example movies that have no cast so that the final output does not result in a dataframe error?

  • @ammarparmr
    @ammarparmr 3 года назад

    Informative video!!
    I just have a question,
    How to add a random delay time to avoid blocking

  • @previncoin8592
    @previncoin8592 3 года назад

    Very powerful stuff.

  • @nth.education
    @nth.education 2 года назад

    wow, this is so so cool

  • @joaquincarrascosa91
    @joaquincarrascosa91 3 года назад +1

    Great video, do you know how i could scrape the entire text from a website ? I was thinking of using it to make wordclouds as shown in your other video.

  • @nancyachiengodhiambo9727
    @nancyachiengodhiambo9727 4 года назад

    thanks so much, waiting for scraping multiple links

    • @dataslice
      @dataslice  4 года назад

      Hey Nancy -- the rest of the series is up! Part 2 is here: ruclips.net/video/E3pFBp5oPU8/видео.html and 3 and 4 are in the description as well. Thanks for watching!

  • @elizulkatri6758
    @elizulkatri6758 3 года назад

    I can save the data in csv format, but when I opened, the data still not organized and was not in table form. what should I do?

  • @logic0057
    @logic0057 Год назад

    Awesome!

  • @dimplenain0692
    @dimplenain0692 3 года назад

    What if there are any missing value for any variable like ratings? How to handle these missing values?

  • @boon8472
    @boon8472 3 года назад

    in the csv file the synopsis is blank cause there is commas in it. is there a way to fix it?

  • @SteashEdits
    @SteashEdits 3 года назад

    I ran the code for the title and worked perfectly fine. After I added the same code to get the year, neither year or title worked anymore giving me an error:
    “ no applicable method for 'xml_find_all' applied to an object of class "function" “

  • @nathasyapramudita6312
    @nathasyapramudita6312 Год назад

    is there any similar addons like SelectorGedget but in Firefox?

  • @GnarTank
    @GnarTank Год назад

    Some of the information that I've tried this on is coming out as double in length. I'm trying to practice this more using data from one of my friends league of legends games. Using leagueofgraphs to get the data. For some reason when I try to get the .gameMode information, data seems to double itself. And when I try to get the outcome of the game, Victory/Defeat, it returns the information as either all Victories with 5 blanks or all defeats with 5 blanks. Does any one have any advice how to fix this problem?

  • @DudeGuyWho
    @DudeGuyWho 2 года назад

    Excellent content! How can I download a multiple tab xlsx file into R from a URL. I know how to merge the tabs together once saved locally, but would like to read them in directly from URL into R.

  • @tainafelippe4842
    @tainafelippe4842 8 месяцев назад

    Hello! I love your videos, very easy to understand even for ppl who have English as a second language like me.
    Unfortunately when I tried to replicate this script, theres a problem in line 10, when I print line 10 to see its content it shows "character [0]" instead of the information that appears to you (the names of the movies). I tried using both your example and other websites but the problem remains, has anyone else had this issue?
    Thanks!

  • @papaorgen4224
    @papaorgen4224 3 года назад

    could you please do a video with scraping off a website with ? rvest doesn't seem to help

  • @DudeGuyWho
    @DudeGuyWho 2 года назад

    Awesome content! Can you help me understand how to download a multi-sheet xlsx workbook from URL into R? It's only two tabs and I do know how to merge the tabs into a single dataframe once downloaded.

  • @celmywall
    @celmywall Год назад

    First, great tutorial! Thank you. I had a problem creating the data frame because I have a different number of rows in some objects (45 or 50), so this is the reported error: Error in data.frame(name, year, rating, synopsis, stringsAsFactors = FALSE) :
    arguments imply differing number of rows: 50, 45. Any suggestion on this? Thank you

  • @pallimanisha3759
    @pallimanisha3759 2 года назад

    I have a problem here.... it is displaying "character (0)" in the console when I run the code. What should I do?

    • @amitkt
      @amitkt 2 года назад

      Hi Manisha, by any chance did you get a solution to this? thank you

    • @ignaciomorenobasanez3821
      @ignaciomorenobasanez3821 5 месяцев назад

      @@amitkt I encountered the same error, but when I tried another page, it worked well. I believe the package does not function directly with pages built using JavaScript.

  • @ogclinton4780
    @ogclinton4780 2 года назад

    Great video. Would this work if i want to get data off of a website say number of views and visitors of a website or organization site?

  • @maazafridi2090
    @maazafridi2090 3 года назад

    really awesome

  • @radhikaiyer8012
    @radhikaiyer8012 4 года назад

    Brilliant! Thanks

  • @jasonarchimandritis1183
    @jasonarchimandritis1183 3 года назад +1

    This is great thanks! Curious can this be used to scrape a youtube search result (I tried and couldn't get it to work, but ran your imdb code and it worked fine, not sure if it has something to do with the youtube search code or something) Thanks! :)

    • @dataslice
      @dataslice  3 года назад +2

      Yes, unfortunately this method will only work for sites where the content isn't generated dynamically after the page loads (e.g. RUclips). To scrape RUclips, you'd likely need to use the RSelenium library which allows for more advanced web scraping techniques

    • @jasonarchimandritis1183
      @jasonarchimandritis1183 3 года назад

      @@dataslice Gotcha thanks so much! I will check that out! Any chance you'll put up a Rselenium tutorial anytime soon? ;)

    • @dataslice
      @dataslice  3 года назад +2

      @@jasonarchimandritis1183 I've got a lot of video ideas in the backlog including RSelenium, so hopefully soon!

  • @gonzalomontiel6275
    @gonzalomontiel6275 4 года назад

    I wanted to scrape a page but then appear this message "Error in open.connection(x, "rb") : HTTP error 403.", do you know how to fix it?

    • @DappuDon
      @DappuDon 4 года назад

      add it as an exception so that loop keeps running

  • @michelepaleologo6310
    @michelepaleologo6310 2 года назад

    That’s awesome

  • @sultanhaider9596
    @sultanhaider9596 3 года назад

    why did you select "No of votes"?I'm not clear with this kindly help me!

  • @suryamadduri1353
    @suryamadduri1353 4 года назад

    Thanks for your videos. How to extract movie review from IMBD in R?. Please suggest

  • @yimeilong5518
    @yimeilong5518 4 года назад

    Hi, thank you so much for your videos. I have a problem when doing so. I use View() to check the output, all columns look great, but when I use write.csv() to export the output, open it, I found some parts are missing, do you know what's the problem? Thank you so much.

    • @dataslice
      @dataslice  4 года назад

      That’s odd. Are you sure they’re completely missing? There may be a new line character before the data and maybe your CSV viewer isn’t being displayed? Or maybe try cleaning the text in R (removing all special characters from your data)?

    • @yimeilong5518
      @yimeilong5518 4 года назад

      @@dataslice Thank you so much. My fault, they are not empty, there is space at the beginning, that made them look like they are empty. LOL

  • @aleksandrawiacek1892
    @aleksandrawiacek1892 3 года назад

    what if theselection consists in two phrases from Selector Gadget? e.g. .altrow td:nth-child(1) , .row td:nth-child(1)

  • @antonymaina2757
    @antonymaina2757 4 года назад

    when i web scrap the title names they come in row .title alternating with empty quotation marks.i.e 1."theboys" 2." " 3. "ozark"..kindly help me fix it