The Biggest Issues I've Faced Web Scraping (and how to fix them)

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 104

  • @PaoloAnzani_1
    @PaoloAnzani_1 6 месяцев назад +53

    In my opinion as i developed multiple web scraping application, half of the time is not spent coding but instead trying to reverse engineer the web application. Simple ones are just matter of looking at requests from dev tools and manually make api calls, while most complicated ones involve backtracing how content is loaded on the page to find the js code responsable to do that. Basically its 70% reverse engineering and 30% coding, if you do things the smart way.

    • @pranitmane
      @pranitmane 5 месяцев назад

      Yep!

    • @mateusb09
      @mateusb09 3 месяца назад +4

      What's the benefit of manually doing API calls instead of just letting selenium click the buttons which will do the exact same thing?

    • @kaj1543
      @kaj1543 3 месяца назад

      ​@@mateusb09selenium has overhead

    • @Anthony-qg5hj
      @Anthony-qg5hj 3 месяца назад +2

      ​@@mateusb09 because it's faster, less code, lower cost, easier to maintain

    • @mateusb09
      @mateusb09 3 месяца назад

      @@Anthony-qg5hj I had a selenium project in which I tried the approach you’re talking about. Not only needed to attach the login cookies (which expire) to the request anyway but also I needed to manually construct the request skeleton.
      So in the end I had a similar effort as I would have if I just force selenium to click buttons

  • @yafethtb
    @yafethtb 8 месяцев назад +17

    Yeah. Scraping a dynamic website really makes me want to scream like Linus Torvalds to NVIDIA. And I also hate CloudFlare 😂

    • @gamecast4432
      @gamecast4432 Месяц назад

      You can start a new browser or new context for every "goto()" with a different user-agent, that's how i do with CloudFare

  • @delsix1222
    @delsix1222 8 месяцев назад +30

    interesting timing to see this video, literally the day after I completed my first full-stack application which literally revolves around web-scraping :D

    • @flipygmd
      @flipygmd 8 месяцев назад +1

      You're the next Mark Zuckerberg

    • @Noumaan_Ahamed
      @Noumaan_Ahamed 8 месяцев назад

      How do you web scrape secure website?

    • @IshaqKhan010
      @IshaqKhan010 5 месяцев назад

      share website url

    • @delsix1222
      @delsix1222 5 месяцев назад

      @@IshaqKhan010 cant share url in yt comments, gets autofiltered

    • @pablom8854
      @pablom8854 3 месяца назад

      And I'm starting a web scraping project

  • @rikawrites7104
    @rikawrites7104 15 дней назад

    i started learning about web scraping YESTERDAY, and stumbled upon your video today. GODDAMN the way you explain stuff and speak really stuck with me! thank you for providing such value and motivating me to improve my communication skills as well :D

  • @Dalamain
    @Dalamain 8 месяцев назад +24

    I used to web scrape all the time, but stupid js frameworks obsfucated css class names has made it very difficutlt.

    • @gamecast4432
      @gamecast4432 Месяц назад

      I use the "[data-something="foo"], luckly most of the sites i need to scrape make use of this attr

  • @v1d300
    @v1d300 8 месяцев назад +7

    I am working on building a project that heavily requires scraping so I been doing a lot of research. And its really hard to find anything good that is not sponsored by brightdata. I get it, their marketing team has done a great job with tapping a perfect niche of creators who provide valuable information but this also creates a problem to ending up finding that almost each good resource is related to using brightdata and its not something I want to pay for when starting a hobby project.
    Anyway, this is a great video either way. I learned a lot of things I hadn't considered in my planning. Like the ETL(thats a new rabbit hole I need to dive into) or adaptive content extraction to account of layout changes. I was just assuming I will set up reporting to notify me when I start getting no content and then I will fix it.
    So thank you for that.
    Do you setup redis or something to make sure some requests are accessed from the cache of recently requested data than scraping again or accessing the db? is that necessary?
    And at what point should a webhook be setup and for what purpose exactly?
    Thank you

  • @xlafxx
    @xlafxx 8 месяцев назад +1

    I remember starting to watch your videos when I was entering computer science Ba, and as a 28 year old 1 semester left to graduate, you’re still uploading good content that’s unique. Never get tired of your vids , keep it up brother . I’m also concerned with the job market , can you make a vid about new grad Cs students ? For example seems almost every job wants front end or something and my school never taught any of it

    • @mrrobot-mn6re
      @mrrobot-mn6re 8 месяцев назад +1

      You want to get a job from what your school taught you? You are in for a ride brother. Tech is about your own research and self learning, every fucking day.I pity people that majored in CS because they heard about a programmer earning 6figs

    • @Hshjshshjsj72727
      @Hshjshshjsj72727 6 месяцев назад

      Unless u went to ivy league and wanna be a quant then u gotta do front end js react sql are key for majority. School is duhm unless ivybleague except for piece of paper

  • @JefCollier
    @JefCollier 3 месяца назад +1

    I saw this video recommended to me about two days after I had to scrape a ton of images and convert them to a PDF. The images are loaded dynamically and I will confess with shame that my script would scroll slowly down the entire page until it couldn't get any further. Then it would queue up all the appropriate image files and compile them into a local directory before turning them into a single PDF file.

  • @robinbreed2439
    @robinbreed2439 Месяц назад

    Great video and really nice energy, and I think you answered my question by using scrape browser to render javascipt headlessly. Thank you

  • @danielabraham3022
    @danielabraham3022 8 месяцев назад +2

    To be honest, i subscribed because the button lit up. Also, I love your content.

  • @V4rrow
    @V4rrow 8 месяцев назад +19

    dude is literally gilfoyle from silicon valley(love your vids)

    • @theparten
      @theparten 8 месяцев назад

      i wasn't looking for web scraping video but his face drew my attention, i was like wait this is Gilfoyle right😂❤...

    • @FFl1s
      @FFl1s 8 месяцев назад

      Fr

  • @redbill5197
    @redbill5197 8 месяцев назад +6

    Thank you for the amazing video! Much appreciated as a young web developer. By the way, none of the buttons lit up or did any animations... I am a subscriber, so I don't know if that's why.
    Peace!!!

    • @beaconxy
      @beaconxy 7 месяцев назад

      It actually didn't.

  • @EduardoEscarez
    @EduardoEscarez 8 месяцев назад +2

    AFAIK the button highlighting is a feature based on video subtitles, including those generated automatically, but still somewhat random. I didn't catch those because I was already subscribed and like the video a moment before you said it.

    • @v1d300
      @v1d300 8 месяцев назад

      I don't think its a video subtitles feature. It just happens randomly in my experience. The thumb up button shakes and subscribe highlights. Didn't happen for me on this video though :(

  • @doublesushi5990
    @doublesushi5990 8 месяцев назад +2

    such a chill vid

  • @Smallbusiness0007
    @Smallbusiness0007 8 месяцев назад +5

    The JD bottle in the background 😉

  • @xdcountry
    @xdcountry 8 месяцев назад +6

    This guy gets it-I’ve been there. I can’t wait to make this all an easy ass python plugin

  • @LM-ty8xg
    @LM-ty8xg Месяц назад

    Amazing content,
    Brother, please make a video explaining how to scrape dybamically loading powerBI tables on a website. There is simply no change in the html/css structure when you engage😅

  • @olhodetamarutaca
    @olhodetamarutaca 5 месяцев назад +1

    I really like the way you explain things and also the pronunciation issues

  • @tomasemilio
    @tomasemilio 8 месяцев назад +3

    Boom. Thanks

  • @ramelox
    @ramelox 8 месяцев назад +97

    When I see brightdata sponsorship, I instantly stop watching. Paying to brightdata is not a webscraping skill.

    • @zeddscarlxrd4331
      @zeddscarlxrd4331 8 месяцев назад +5

      Did u know how to bypass cloudflare or captcha without bright data?

    • @ZacMagee
      @ZacMagee 8 месяцев назад +7

      Some people 😂
      That's like saying.
      "Oh well, these stupid people who drive cars, why would they do that when we still have horses?"

    • @vasyavasin7364
      @vasyavasin7364 8 месяцев назад +12

      ​@@ZacMagee why should I pay it if I can do it free?😂

    • @vasyavasin7364
      @vasyavasin7364 8 месяцев назад

      ​@@zeddscarlxrd4331 How to bypass cloudflare you can find easy.

    • @Ohiostategenerationx
      @Ohiostategenerationx 7 месяцев назад +1

      ​@@vasyavasin7364do you still not need to scrap a bunch of proxies to use?

  • @nrgstudios612
    @nrgstudios612 3 месяца назад

    The subscribe button didn't light up because I was already subscribed 👍

  • @olasunkanmioyetunji9254
    @olasunkanmioyetunji9254 7 месяцев назад +1

    Can you recommend a course to learn web scraping. A course that taught the tool and techniques you mentioned and other concepts

    • @ravimahto3606
      @ravimahto3606 29 дней назад

      i am searching for it too, beginner in webscraping

  • @phethindabamkhwanazi3546
    @phethindabamkhwanazi3546 8 месяцев назад +1

    Hey, man do you have another channel where you teach live?????

  • @brianmorin5547
    @brianmorin5547 7 месяцев назад

    Is there a reason/advantage to using Bright Data's "scraping browser" product instead of integrating their proxy and IP rotation services into a script I'm running on my own server?

  • @manumartinezkcxu
    @manumartinezkcxu 5 месяцев назад

    what are the best ai scraping apps : suggestion/recommendations? Just looking for how our nonprofit organization is aligned with other organizations within a county of california in order to partner with them

  • @dmytro-skh
    @dmytro-skh 7 месяцев назад

    this video is what I need. But whoaa so fast changes of screens with code... I'm too old at 35 to be able to push the pause button so fast 😅 Do you have some links with those hacks?

  • @Cryogenics12
    @Cryogenics12 8 месяцев назад +2

    Hi Forrest. I was wondering how you still feel about AI and the future of software engineering. With chat GPT out for over a year now, have your views changed much? Maybe a good topic for another vid.

  • @javancheongyujing2531
    @javancheongyujing2531 8 месяцев назад +1

    Is web scraping under data science or software engineering structure?

    • @dedswift
      @dedswift 2 месяца назад

      Depends on the purpose of the data you’re scraping and how it’s used, but it can be both.

  • @sakibullah3577
    @sakibullah3577 2 месяца назад

    can anyone help me? I can't seem to bypass cloudflare loading page with heedless brightdata webscraper

  • @johnknox4293
    @johnknox4293 8 месяцев назад

    interesting....thanks man

  • @consolemodding1015
    @consolemodding1015 3 месяца назад

    The funny thing is when they block the ranges used by bright data xD

  • @carsonjamesiv2512
    @carsonjamesiv2512 8 месяцев назад

    GOOD VIDEO🎉👍

  • @realshiiiiiit8349
    @realshiiiiiit8349 8 месяцев назад

    Damn this guy is cool

  • @VishalJangid1
    @VishalJangid1 8 месяцев назад +1

    hopefully brightdata ain't a snitch 🫠

  • @juan7114
    @juan7114 4 месяца назад

    I hate 502 error, I don't know how to solve it

  • @storymode9085
    @storymode9085 8 месяцев назад

    wow... i got a long way to go

  • @oeerturk
    @oeerturk Месяц назад

    u said u prepared the video without the need of brightdata but for every issue except data storage u propose using brightdata for the most important&challenging parts....................? :/

  • @JoaquimDornelles95
    @JoaquimDornelles95 8 месяцев назад

    My fucking hero

  • @paulshorey7528
    @paulshorey7528 3 месяца назад

    I like your mustache

  • @OnlyUseMeEquip
    @OnlyUseMeEquip 5 месяцев назад +1

    if you are using selenium,puppeteer, or any other browser automation, you will never be a good web scraper, they are just too damn slow, if you are relying on them to get you passed the WAF javascript function and generate your cookies for you to then go scrape others will beat you to the punch with pure code

    • @consolemodding1015
      @consolemodding1015 3 месяца назад

      Define slow?

    • @OnlyUseMeEquip
      @OnlyUseMeEquip 3 месяца назад

      @@consolemodding1015 if you have to login repeatedly and solve captcha's, that delay is almost negated , pure code bots just generate new valid cookies, once you hit your 403 forbidden or 401 captcha new tokens are loaded and carry on, not to mention threads instead of instances, , reversing the WAF JS function is the key. a good pure code bot vs a good browser bot is likely to be around 100x more efficient

    • @mianashhad9802
      @mianashhad9802 3 месяца назад

      How can you scrape dynamic content without these tools? Anything else besides trying to find the API endpoint?
      I am a beginner who knows how to scrape simple pages. I want to learn how to scrape dynamic content. Would love to know your thoughts.

    • @heritage1834
      @heritage1834 2 месяца назад +1

      ​@@mianashhad9802A method that works is to clone the api calls that get the data from the backend server. You can find it in the network tab (fetch) in your browser's developer tools tab

    • @gdolphy
      @gdolphy Месяц назад

      ​@mianashhad9802 : if attribute data changes, target the tag. If tag changes, target the Ajax calls.

  • @botobeni
    @botobeni 7 месяцев назад

    12:30 nuh uh 🗿🗿

  • @justcode_99
    @justcode_99 8 месяцев назад

    Your mustache looks like a hedgehog 😂

  • @YouStillNeedToSleep
    @YouStillNeedToSleep 7 месяцев назад

    Examples. Are you a Leo? he he

  • @GEMSofGOD_com
    @GEMSofGOD_com 2 месяца назад

    Thank you Jesus

  • @francishubertovasquez2139
    @francishubertovasquez2139 8 месяцев назад

    Speaking of Females, if Hitler's fuhrer have Magog carrier of motorized machine monsters then the Northern Magog have ice snow predominant in their place near Arctic circle, and ice surface can better conduct gases and science elements and compounds interaction which can attract those science things from everywhere, who between them is stronger except for the Super Magog Dark Matter? Will they suffice at full force during the final battle end times?

  • @abe_is_live
    @abe_is_live 8 месяцев назад

    stop web scraping