Web scraping in Python takes 2 seconds...

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024
  • -- -- (Links on this page my give me a small commission from purchases made - thank you for the support!)
    Roadmap to Become a Data Scientist / Machine Learning Engineer in 2022: • Complete Roadmap to Be...
    Roadmap to Become a Data Analyst in 2022: • Roadmap to Become a Da...
    Roadmap to Become a Data Engineer in 2022: • Full Pathway to Become...
    Here's my favourite resources:
    Best Courses for Analytics:
    ---------------------------------------------------------------------------------------------------------
    + IBM Data Science (Python): bit.ly/3Rn00ZA
    + Google Analytics (R): bit.ly/3cPikLQ
    + SQL Basics: bit.ly/3Bd9nFu
    Best Courses for Programming:
    ---------------------------------------------------------------------------------------------------------
    + Data Science in R: bit.ly/3RhvfFp
    + Python for Everybody: bit.ly/3ARQ1Ei
    + Data Structures & Algorithms: bit.ly/3CYR6wR
    Best Courses for Machine Learning:
    ---------------------------------------------------------------------------------------------------------
    + Math Prerequisites: bit.ly/3ASUtTi
    + Machine Learning: bit.ly/3d1QATT
    + Deep Learning: bit.ly/3KPfint
    + ML Ops: bit.ly/3AWRrxE
    Best Courses for Statistics:
    ---------------------------------------------------------------------------------------------------------
    + Introduction to Statistics: bit.ly/3QkEgvM
    + Statistics with Python: bit.ly/3BfwejF
    + Statistics with R: bit.ly/3QkicBJ
    Best Courses for Big Data:
    ---------------------------------------------------------------------------------------------------------
    + Google Cloud Data Engineering: bit.ly/3RjHJw6
    + AWS Data Science: bit.ly/3TKnoBS
    + Big Data Specialization: bit.ly/3ANqSut
    More Courses:
    ---------------------------------------------------------------------------------------------------------
    + Tableau: bit.ly/3q966AN
    + Excel: bit.ly/3RBxind
    + Computer Vision: bit.ly/3esxVS5
    + Natural Language Processing: bit.ly/3edXAgW
    + IBM Dev Ops: bit.ly/3RlVKt2
    + IBM Full Stack Cloud: bit.ly/3x0pOm6
    + Object Oriented Programming (Java): bit.ly/3Bfjn0K
    + TensorFlow Advanced Techniques: bit.ly/3BePQV2
    + TensorFlow Data and Deployment: bit.ly/3BbC5Xb
    + Generative Adversarial Networks / GANs (PyTorch): bit.ly/3RHQiRj
    Become a Member of the Channel! bit.ly/3oOMrVH
    Follow me on LinkedIn! / greghogg

Комментарии • 267

  • @GregHogg
    @GregHogg  8 месяцев назад +22

    I offer 1 on 1 tutoring for Data Structures and Analytics! Email me at greg.hogg1@outlook.com - first call is free!

    • @MungeParty
      @MungeParty 6 месяцев назад +3

      Yeah but you record programming tutorials with your phone so how good could you possibly be with technology?

  • @jitendravishwakarma7949
    @jitendravishwakarma7949 Год назад +535

    It's easy when the data is in a structured format like in a table as you shown, difficulty arrives when you want to scrap unstructured information from any website and need to make it structured.

    • @GregHogg
      @GregHogg  Год назад +39

      Yes this is true

    • @mahdiabdulhafedh6488
      @mahdiabdulhafedh6488 Год назад +6

      Yah
      There are websites there are no structured data. 😅

    • @RodStremel
      @RodStremel Год назад +14

      ​@GregHogg so that's definitely not what happen 99% of the time, right? Cmon...

    • @gustavonovakoski4867
      @gustavonovakoski4867 Год назад +2

      I love when I have to scrap some product pricing from a random competitor

    • @infamous8541
      @infamous8541 Год назад +5

      That's where beautiful soup comes into play

  • @dmytro7441
    @dmytro7441 8 месяцев назад

    Scrab any news website in the same way 😅 Good luck 😂

  • @kdt85
    @kdt85 6 месяцев назад

    I didn't know 🐼 could do that!

  • @deborshikashyap6745
    @deborshikashyap6745 Год назад

    I wanted to see my google history in python terminal but that didn't happen I used colab

  • @test-rj2vl
    @test-rj2vl 2 месяца назад

    And now post video of yourself scaping facebook posts where ids and classes are random.

  • @polycrystallinecandy
    @polycrystallinecandy 7 месяцев назад +67

    Ain't no way bro called the URL "the line of the website" 💀

  • @IntricateMoon
    @IntricateMoon Год назад +370

    Love the voice cracks. So adorable

    • @Michael-ty2uo
      @Michael-ty2uo 7 месяцев назад +20

      💀 nahh bro

    • @Diaryofaninja
      @Diaryofaninja 7 месяцев назад +10

      Lil bro is glazing a grown ass man 🤮

    • @OtsoVesterinen
      @OtsoVesterinen 7 месяцев назад +2

      ​@@Diaryofaninjabetter than you glancing at kids lololol

    • @Diaryofaninja
      @Diaryofaninja 7 месяцев назад

      @@OtsoVesterinen Ur weird asl 😬

    • @OtsoVesterinen
      @OtsoVesterinen 7 месяцев назад

      @@Diaryofaninja im weird? dude you're saying someone liking an adult is weird, just sounds like you like kids

  • @monq02
    @monq02 Год назад +111

    scraping is actually easy, the hard part is parsing the data and getting what you need in the format you want

    • @GregHogg
      @GregHogg  Год назад +10

      That's probably still considered scraping, but yes I very much agree that's the hard part

  • @worldanvilbild3980
    @worldanvilbild3980 7 месяцев назад +33

    The part that makes it difficult is when the website doesnt want anyone to be able to scrape it. Thats when you have to use captcha breakers, proxies, undetectable drivers and 30 concurrent selenium instances.

    • @vasiliigulevich9202
      @vasiliigulevich9202 6 месяцев назад

      It is also often illegal

    • @DreadHalfling9
      @DreadHalfling9 6 месяцев назад

      ​@@vasiliigulevich9202 its not but ok

    • @gustavohab
      @gustavohab Месяц назад

      You can also grab the api if you can, and if needed, insert fake cookies so you can reproduce the scraping more times

  • @azhari7968
    @azhari7968 Год назад +32

    99% all the time? I've been scraping websites and they don't even contain any tables at all

    • @Yodella
      @Yodella Год назад +3

      My taught exactly as a data scientist
      I scrap data’s not tables

  • @matthewdaz6185
    @matthewdaz6185 7 месяцев назад +39

    Cool, now do it on a firewall protected site behind a login screen and data that is rendered in a custom styled react component with no consistent ids or classes

    • @Michael-ty2uo
      @Michael-ty2uo 7 месяцев назад +5

      Almost sums up my experience trying to webscrape Facebook

    • @TheBencheek
      @TheBencheek 6 месяцев назад

      Give the guy some credit. That's the 1% :D

    • @Michael-ty2uo
      @Michael-ty2uo 6 месяцев назад +1

      @@TheBencheekYeah but most people dont try to webscrape obscure websites they try to go with big names like facebook, ebay, amazon, LinkedIn and etc. And those "1%": WILL have some sort of anti webscraping measures built into their cite

    • @TheBencheek
      @TheBencheek 6 месяцев назад

      Trust me, I know :)

    • @sws212
      @sws212 6 месяцев назад +1

      @@Michael-ty2uo Most people try to scrape place with usable information which absolutely includes facebook, ebay, etc. Most site with valuable information to scrape generally make it harder to do.

  • @MungeParty
    @MungeParty 6 месяцев назад +1

    Yeah actually no most of the time the data you want is not already in a table. This video should be called how to do my homework.

  • @biggiebeats1490
    @biggiebeats1490 6 месяцев назад +1

    A jail broken Chay gpt prompt can do all this for free, coding is a waste of time

  • @tukankibar4917
    @tukankibar4917 5 месяцев назад +1

    It is easy if you are scraping Wikipedia.
    Not easy when you are scraping websites complex, and oftentimes hostile to being scraped.

  • @nicolasrulli
    @nicolasrulli 2 года назад +29

    Voice crack

    • @GregHogg
      @GregHogg  2 года назад +3

      Funny, I noticed this but listening back I didn't think anyone would hear it 😂

    • @linuxuser2928
      @linuxuser2928 Год назад

      @@GregHogg It makes it so much better.
      I'm subscribing because it's endearing to hear someone so young programming and making programming videos :)

    • @Pseudo___
      @Pseudo___ 10 месяцев назад

      @@GregHogg how did you think no one would notice this lol

  • @syedhaider0916
    @syedhaider0916 Год назад +15

    I think the problem is when you have to scrape data from a website with pagination where you want to extract data from all the pages of the website. The thing is literally every single time the the website will block you.

  • @harry-smith404
    @harry-smith404 7 месяцев назад +6

    Any problem is solved in 2 second when you have a library

  • @ammarmeer
    @ammarmeer 7 месяцев назад +1

    thats not what you think. stop making them fool😂

  • @sskrylov
    @sskrylov 7 месяцев назад

    Только обычно таблицы кастомные без тега, или рендарятся в js, или информация из них грузится через ajax например. Это слишком идеальные условия, в реальной жизни такого практически не встретить, поэтому функция практически бесполезна

  • @demonpandaz8246
    @demonpandaz8246 Месяц назад

    For more encouragement, I learned python web scraping in 6 hours while stoned asf. Just don't give up and you'll get there. :D

  • @jlaviews
    @jlaviews 5 месяцев назад

    Maybe 10% of the time. Most interesting info is stuck w JavaScript and that does not load before the read HTML from Pandas get executed

  • @nietzschebietzsche
    @nietzschebietzsche 3 месяца назад

    Where it gets difficult is when you want to do lots of UI interactions and every time the developers change 1 fucking x-path, you have to update your script

  • @SS-gu2tx
    @SS-gu2tx 8 месяцев назад +1

    This is far from sufficient.

  • @LoopyAnh
    @LoopyAnh Месяц назад

    Itd be nice if you did this with a recording of the screen instead of phone.
    Also wasnt aware of pandas but it makes it 100x easier than me trying to write it all from scratch.

  • @a13m34
    @a13m34 7 месяцев назад

    REAL MAN DO IT THE MANLY WAY, IN C LIKE A MAN WOULD💪💪 (I have forgotten what the sun looks like)

  • @lucagenoni4430
    @lucagenoni4430 7 месяцев назад

    Dumb video. 1% of the time there is an html table accessible without any sort of authentication or running JavaScript to render the website and the data.
    Html table in the source code of the website is only for old fashion website.

  • @MartinBarker
    @MartinBarker 6 месяцев назад

    now go read the source code of pandas that does all the work for you, people think programming is easy when they are literally using other peoples hundreds of not thousands of hours to do there work, now try doing it on an embedded system.
    now pandas is not so much of a bad one for this but also just using libraries with the ability to understand there code or them being massively open source you just asking for a supply chain attack.

  • @abhi1196
    @abhi1196 Год назад +3

    Would not work with websites having firewall and nowadays a simple website also contains one so waste of 2 sec

  • @asdanjer
    @asdanjer 4 месяца назад +1

    Yea Wikipedia tables are really easy. Even excel does that automatically. But when people say web scrapeing that is usually not what they mean. They usually mean unstructured data on sites that render content dynamically. Which is most of the Internet nowadays and also where most interesting data is.

    • @GregHogg
      @GregHogg  4 месяца назад

      Yes you're very right it wouldn't handle that as well :)

  • @yaroslavpanych2067
    @yaroslavpanych2067 7 месяцев назад

    Yeah, now scrap it from table built from pure s as it is usually done by dumb site builders last 15 years

  • @FLUX07
    @FLUX07 Год назад +1

    Please scrape amazon and other shopping sites and compare the prices of the searched product in 30 seconds. Good Luck, Hopefully You Do Not Get Blocked By Amazon And Other Sites

  • @waelodat7258
    @waelodat7258 Год назад +4

    It's pretty nice, but it doesn't work with more complex websites. Thank you 💗

  • @dacjames
    @dacjames 8 месяцев назад +2

    Web scraping is easy to do once and extremely difficult to scale.
    Websites change and have bugs, so a naive scraper will require maintenance at a rate that scales with the number of pages and sites it scrapes.

  • @RicardoSuarezdelValle
    @RicardoSuarezdelValle 6 месяцев назад

    Well, real web scraping is actually an arms race between websites which dont want to get scraped and scrapers who need scraping

  • @SamirKumarPradhan-z6c
    @SamirKumarPradhan-z6c 7 месяцев назад

    99% of the time is not use cases like this. Its unstructured data. Dont just make up stuff for views

  • @Skubidi-qy8hb
    @Skubidi-qy8hb Месяц назад

    Can we extract user data for business purposes? Offcourse it’s not legal

  • @ithakra
    @ithakra 4 месяца назад

    The other real trick is finding reliable information. Difficult to do in the west ..

  • @LinggarMaretvaCendani
    @LinggarMaretvaCendani 11 месяцев назад +2

    I got difficulty when i encounter websites that needs to specify it's HTTP headers to access it, like if we open the website with browser manually, we can see the content, doing inspect element, and basically get the content because HTTP headers are automatically assigned by the browser itself, but that would be difference if we access the website through python scraping library like urllib, scrapy, or beautifulsoup, when we can't just paste the URL and get the website content (html element, table, etc), we need to specify website's HTTP headers to get the element...

  • @Web.Scraping
    @Web.Scraping 27 дней назад

    With this method, there is a problem with the login, pagination, captcha, etc

  • @azursmile
    @azursmile 9 месяцев назад +2

    Think the difficulty is keeping up with changes in the website as they don't publish versioned specs like formal APIs.

  • @CleanRapMusic
    @CleanRapMusic 6 месяцев назад +1

    Scraping most large websites is much much harder. It involves JavaScript rendering for SPAs, spoofing browser metadata to emulate a real user, residential proxy servers, etc.

  • @graymars1097
    @graymars1097 6 месяцев назад

    That was 3 seconds! Youre a liar 😡
    😂😂😂😂
    Python is great 😊

  • @architech5940
    @architech5940 Год назад +1

    Not exactly bud. Not even close to 99% of the time because a large part of data science is collecting raw data and putting that data in a spread/table yourself. Most of the time the data you're looking for when web scraping is not already nicely organized in a table or chart.

  • @LifeLess1999
    @LifeLess1999 5 месяцев назад

    he's 30 and his voice still hasn't figured out how to not be so pitchy

  • @edwardharding5677
    @edwardharding5677 7 месяцев назад

    Can you ever say something without having a voice crack!?!?!

  • @fusebox9725
    @fusebox9725 10 месяцев назад +1

    I'm having trouble installing certain packages for python (using pycharm community edition)

  • @CodingwithSudhan
    @CodingwithSudhan Месяц назад

    what if they coded that table with divs lmao

  • @TuberTugger
    @TuberTugger 5 месяцев назад

    Wtf. Who said web scrapping is hard? It's just highly unethical and important data will be obfuscated. This is what happens when script kiddies try to make youtube shorts.

    • @GregHogg
      @GregHogg  5 месяцев назад

      Sorry for being a script kiddie 😂

  • @88starkiller
    @88starkiller 5 месяцев назад

    Guys that just started some form of engineering be like:

  • @propanben5214
    @propanben5214 7 месяцев назад

    Easy to Programm, when some else did it before with a library

  • @slyace1301
    @slyace1301 7 месяцев назад

    Idk man, this sounds like a nightmare for fats security

  • @massibob2004
    @massibob2004 4 месяца назад

    Do you know how to extract the visible text of a webpage ?

  • @gogutzy
    @gogutzy 11 месяцев назад

    i can do this in word - no need to complicate myself with python...

  • @sonicjoy2002
    @sonicjoy2002 5 месяцев назад

    I can copy and paste that table into google sheet without code at all.

  • @brandonmartino1363
    @brandonmartino1363 5 месяцев назад

    Pandas does my load the ja of web page. Doesn't work for all sites...
    Use selium web driver....

    • @GregHogg
      @GregHogg  5 месяцев назад

      Yeah good point!!

  • @diegonova741
    @diegonova741 6 месяцев назад

    So you can actually use web scraping as a tool to train an AI?

  • @lLenn2
    @lLenn2 7 месяцев назад

    Damn, bro, you still going through puberty?

  • @TheISP
    @TheISP 2 года назад +28

    You've explained it so simply I'm surprised I never realised this is the basic idea

    • @GregHogg
      @GregHogg  2 года назад

      Glad to hear it!

    • @AtomicPixels
      @AtomicPixels Год назад

      Right lol. And if you just so happen to want exactly and only exactly a table format. I don’t think this qualifies as a misconception, rather the exception 😂

  • @md.alnahian4613
    @md.alnahian4613 6 месяцев назад

    what would you do if some random captcha appares?

  • @marceli1588
    @marceli1588 5 месяцев назад

    All easy till it's dynamically loaded, half the content is hidden and the website had 10 different designers that never spoke to one another

    • @GregHogg
      @GregHogg  5 месяцев назад

      Yes exactly 😹

  • @cradokski
    @cradokski 5 месяцев назад

    Unfortunately it now gives the error ... File "C:\Program Files\Python312\Lib\urllib
    equest.py", line 1347, in do_open
    raise URLError(err)
    urllib.error.URLError:

    • @GregHogg
      @GregHogg  5 месяцев назад

      That's odd

    • @cradokski
      @cradokski 5 месяцев назад

      @@GregHogg Does it still work for you?

  • @elmehdi1291
    @elmehdi1291 Год назад +3

    The Voice crack is strong with him !

  • @ktxed
    @ktxed 8 месяцев назад +1

    this is possibly the most naive take on web scraping that I have ever seen

  • @Wallie.AiFounder
    @Wallie.AiFounder 3 месяца назад

    Broooo no wayyyy. Lifesaver

  • @boenjuan2042
    @boenjuan2042 Год назад

    Subscibed. Shall I learn from you how to get the data table in my case? @Greg Hogg

  • @gro967
    @gro967 6 месяцев назад

    Uses python for web scraping 😂😂😂

  • @chrisspellman5952
    @chrisspellman5952 5 месяцев назад

    Now do in 2 sec on some crappy intranet website that's poorly maintained but is also super critical

    • @GregHogg
      @GregHogg  5 месяцев назад

      That would be tricky 😂

  • @MrPierdole123
    @MrPierdole123 6 месяцев назад

    Oh boy if this works you saved me weeks of eork

  • @exxon47_
    @exxon47_ 5 месяцев назад

    Okay but doing the actual web scraping itself is the hard part not using a library that does it

    • @GregHogg
      @GregHogg  5 месяцев назад

      Absolutely... We would want to automate this as much as possible

  • @Smiley01987
    @Smiley01987 6 месяцев назад

    Maybe mention you're using a library?

  • @shneor.e
    @shneor.e 7 месяцев назад

    Cool story bro, now go drink some water.

  • @Herzfeld10
    @Herzfeld10 5 месяцев назад

    For who ? Web scraping is the first thing you do when learning python.

    • @GregHogg
      @GregHogg  5 месяцев назад

      Wasn't for me, personally! I did data science in pandas without scraping first

  • @Afifan909
    @Afifan909 5 месяцев назад

    What about when you add on a web crawler?

  • @rishabhchoudhary8641
    @rishabhchoudhary8641 6 месяцев назад

    I have beend using beautfiul soup for this 2second shit .. damnnnn

    • @GregHogg
      @GregHogg  5 месяцев назад

      Haha yeah!

  • @Techy504
    @Techy504 6 месяцев назад

    Why your voice crack so much 😭😭😭😭

  • @muzzletov
    @muzzletov 6 месяцев назад

    dude, whats with your voice?

  • @videogamecreativeinc.702
    @videogamecreativeinc.702 7 месяцев назад

    No one says it's difficult haha

  • @manit77
    @manit77 8 месяцев назад

    My UI is made of nested tables.

  • @karol1158
    @karol1158 6 месяцев назад

    HTTP Error 403: Forbidden

  • @DeveloperDrew
    @DeveloperDrew 7 месяцев назад

    Cloudflare has entered the chat…

  • @timgentemann6324
    @timgentemann6324 Год назад +1

    It’s a very very easy example. It can become very complicated

  • @nagalakshmip8725
    @nagalakshmip8725 10 месяцев назад

    how can I seperate the year and rating as seperate list when both have same tags and list

    • @GregHogg
      @GregHogg  10 месяцев назад

      . seperate()

  • @LGValdez
    @LGValdez 7 месяцев назад +5

    Not a misconception at all. Now do it without pandas (adding dependencies is bad for commercial code) on a custom website that constantly changes how data is displayed. Let’s say on one version uses table, the next uses div and for some reason next there are nested divs like react or angular. Now your code should work for all of those. Good luck!

  • @Mcs1v
    @Mcs1v 5 месяцев назад

    thanks Dunning-Kruger

  • @adammiller9029
    @adammiller9029 7 месяцев назад

    You're presuming that the website already has everything stored in a way that is friendly to you. In most cases if it did infact have it stored in a data friendly way, it has an API to expose that data readily. Showing people a website that is basically cherry picked to make something look easy then saying thats normal and expected... is going to leave novices confused and experiences engineers irritated.
    When you state a problem is normally solved easily one way, and then people go out into the wild and find out thats not really true... you're manufacturing a problem where people think they have the knowledge they need to solve a problem, so now think the website is broken or something when it doesn't work. It could also discourage novices from sticking to things when you tell them "This usually works" and it factually, does not.

  • @OghamTheBold
    @OghamTheBold 7 месяцев назад

    Manchester Birmingham ... Councils have councillor details not in tables but on separate pages with no class - ONS (Office of National Statistics) refuse to publish 'Magistrate Case Data'

  • @TheSahni76
    @TheSahni76 6 месяцев назад

    super smashing 😊

  • @wintercounter2
    @wintercounter2 6 месяцев назад

    Hardness comes from avoiding bot detection, not scraping itself. Experts know how to bypass Cloudflare for example.

  • @valeriusandof9782
    @valeriusandof9782 6 месяцев назад

    mindblown, I was expecting request beautifulsoup, didn't know you can just use pandas

  • @PetarVukmanovic
    @PetarVukmanovic 6 месяцев назад

    Sure. Cloudflare.

  • @havenselph
    @havenselph 8 месяцев назад

    Jacksfilms? HAHAHA

  • @iJuce
    @iJuce 5 месяцев назад

    ThErE iS A HÜgE MIscOnCEption

    • @GregHogg
      @GregHogg  5 месяцев назад

      wHaT iS iT it

  • @SAL404w
    @SAL404w 4 месяца назад

    Ok do it without pandas now

    • @GregHogg
      @GregHogg  4 месяца назад

      I'd rather not

  • @iSaac-kp5lk
    @iSaac-kp5lk 6 месяцев назад

    Do chairs now

  • @olegdayver7842
    @olegdayver7842 7 месяцев назад

    If you use python :) But what about google spiders? I think it is much harder than just using some library, because you need to be super efficient, when scanning billions of webpages for urls. Also you probably want to scan js scripts as well, they may contain some fetch procedures, especially on react driven apps.

  • @niaei
    @niaei 7 месяцев назад

    Imagine trying to grab data from a server side processed table. The data is jot there until you click on the next button.

  • @yeet3833
    @yeet3833 7 месяцев назад

    Goodluck scrapping data from dynamic website

  • @adelam7534
    @adelam7534 7 месяцев назад

    How old are you

  • @samanshaukat4472
    @samanshaukat4472 Год назад

    How we come to the page where you enter the website material to .... like that main window I can't find it even after python has been downloaded