Web scraping in Python takes 2 seconds...
HTML-код
- Опубликовано: 29 сен 2024
- -- -- (Links on this page my give me a small commission from purchases made - thank you for the support!)
Roadmap to Become a Data Scientist / Machine Learning Engineer in 2022: • Complete Roadmap to Be...
Roadmap to Become a Data Analyst in 2022: • Roadmap to Become a Da...
Roadmap to Become a Data Engineer in 2022: • Full Pathway to Become...
Here's my favourite resources:
Best Courses for Analytics:
---------------------------------------------------------------------------------------------------------
+ IBM Data Science (Python): bit.ly/3Rn00ZA
+ Google Analytics (R): bit.ly/3cPikLQ
+ SQL Basics: bit.ly/3Bd9nFu
Best Courses for Programming:
---------------------------------------------------------------------------------------------------------
+ Data Science in R: bit.ly/3RhvfFp
+ Python for Everybody: bit.ly/3ARQ1Ei
+ Data Structures & Algorithms: bit.ly/3CYR6wR
Best Courses for Machine Learning:
---------------------------------------------------------------------------------------------------------
+ Math Prerequisites: bit.ly/3ASUtTi
+ Machine Learning: bit.ly/3d1QATT
+ Deep Learning: bit.ly/3KPfint
+ ML Ops: bit.ly/3AWRrxE
Best Courses for Statistics:
---------------------------------------------------------------------------------------------------------
+ Introduction to Statistics: bit.ly/3QkEgvM
+ Statistics with Python: bit.ly/3BfwejF
+ Statistics with R: bit.ly/3QkicBJ
Best Courses for Big Data:
---------------------------------------------------------------------------------------------------------
+ Google Cloud Data Engineering: bit.ly/3RjHJw6
+ AWS Data Science: bit.ly/3TKnoBS
+ Big Data Specialization: bit.ly/3ANqSut
More Courses:
---------------------------------------------------------------------------------------------------------
+ Tableau: bit.ly/3q966AN
+ Excel: bit.ly/3RBxind
+ Computer Vision: bit.ly/3esxVS5
+ Natural Language Processing: bit.ly/3edXAgW
+ IBM Dev Ops: bit.ly/3RlVKt2
+ IBM Full Stack Cloud: bit.ly/3x0pOm6
+ Object Oriented Programming (Java): bit.ly/3Bfjn0K
+ TensorFlow Advanced Techniques: bit.ly/3BePQV2
+ TensorFlow Data and Deployment: bit.ly/3BbC5Xb
+ Generative Adversarial Networks / GANs (PyTorch): bit.ly/3RHQiRj
Become a Member of the Channel! bit.ly/3oOMrVH
Follow me on LinkedIn! / greghogg
I offer 1 on 1 tutoring for Data Structures and Analytics! Email me at greg.hogg1@outlook.com - first call is free!
Yeah but you record programming tutorials with your phone so how good could you possibly be with technology?
It's easy when the data is in a structured format like in a table as you shown, difficulty arrives when you want to scrap unstructured information from any website and need to make it structured.
Yes this is true
Yah
There are websites there are no structured data. 😅
@GregHogg so that's definitely not what happen 99% of the time, right? Cmon...
I love when I have to scrap some product pricing from a random competitor
That's where beautiful soup comes into play
Scrab any news website in the same way 😅 Good luck 😂
I didn't know 🐼 could do that!
I wanted to see my google history in python terminal but that didn't happen I used colab
And now post video of yourself scaping facebook posts where ids and classes are random.
Ain't no way bro called the URL "the line of the website" 💀
Love the voice cracks. So adorable
💀 nahh bro
Lil bro is glazing a grown ass man 🤮
@@Diaryofaninjabetter than you glancing at kids lololol
@@OtsoVesterinen Ur weird asl 😬
@@Diaryofaninja im weird? dude you're saying someone liking an adult is weird, just sounds like you like kids
scraping is actually easy, the hard part is parsing the data and getting what you need in the format you want
That's probably still considered scraping, but yes I very much agree that's the hard part
The part that makes it difficult is when the website doesnt want anyone to be able to scrape it. Thats when you have to use captcha breakers, proxies, undetectable drivers and 30 concurrent selenium instances.
It is also often illegal
@@vasiliigulevich9202 its not but ok
You can also grab the api if you can, and if needed, insert fake cookies so you can reproduce the scraping more times
99% all the time? I've been scraping websites and they don't even contain any tables at all
My taught exactly as a data scientist
I scrap data’s not tables
Cool, now do it on a firewall protected site behind a login screen and data that is rendered in a custom styled react component with no consistent ids or classes
Almost sums up my experience trying to webscrape Facebook
Give the guy some credit. That's the 1% :D
@@TheBencheekYeah but most people dont try to webscrape obscure websites they try to go with big names like facebook, ebay, amazon, LinkedIn and etc. And those "1%": WILL have some sort of anti webscraping measures built into their cite
Trust me, I know :)
@@Michael-ty2uo Most people try to scrape place with usable information which absolutely includes facebook, ebay, etc. Most site with valuable information to scrape generally make it harder to do.
Yeah actually no most of the time the data you want is not already in a table. This video should be called how to do my homework.
A jail broken Chay gpt prompt can do all this for free, coding is a waste of time
It is easy if you are scraping Wikipedia.
Not easy when you are scraping websites complex, and oftentimes hostile to being scraped.
Voice crack
Funny, I noticed this but listening back I didn't think anyone would hear it 😂
@@GregHogg It makes it so much better.
I'm subscribing because it's endearing to hear someone so young programming and making programming videos :)
@@GregHogg how did you think no one would notice this lol
I think the problem is when you have to scrape data from a website with pagination where you want to extract data from all the pages of the website. The thing is literally every single time the the website will block you.
Any problem is solved in 2 second when you have a library
thats not what you think. stop making them fool😂
Только обычно таблицы кастомные без тега, или рендарятся в js, или информация из них грузится через ajax например. Это слишком идеальные условия, в реальной жизни такого практически не встретить, поэтому функция практически бесполезна
For more encouragement, I learned python web scraping in 6 hours while stoned asf. Just don't give up and you'll get there. :D
Maybe 10% of the time. Most interesting info is stuck w JavaScript and that does not load before the read HTML from Pandas get executed
Where it gets difficult is when you want to do lots of UI interactions and every time the developers change 1 fucking x-path, you have to update your script
This is far from sufficient.
Itd be nice if you did this with a recording of the screen instead of phone.
Also wasnt aware of pandas but it makes it 100x easier than me trying to write it all from scratch.
REAL MAN DO IT THE MANLY WAY, IN C LIKE A MAN WOULD💪💪 (I have forgotten what the sun looks like)
Dumb video. 1% of the time there is an html table accessible without any sort of authentication or running JavaScript to render the website and the data.
Html table in the source code of the website is only for old fashion website.
now go read the source code of pandas that does all the work for you, people think programming is easy when they are literally using other peoples hundreds of not thousands of hours to do there work, now try doing it on an embedded system.
now pandas is not so much of a bad one for this but also just using libraries with the ability to understand there code or them being massively open source you just asking for a supply chain attack.
Would not work with websites having firewall and nowadays a simple website also contains one so waste of 2 sec
Yea Wikipedia tables are really easy. Even excel does that automatically. But when people say web scrapeing that is usually not what they mean. They usually mean unstructured data on sites that render content dynamically. Which is most of the Internet nowadays and also where most interesting data is.
Yes you're very right it wouldn't handle that as well :)
Yeah, now scrap it from table built from pure s as it is usually done by dumb site builders last 15 years
Please scrape amazon and other shopping sites and compare the prices of the searched product in 30 seconds. Good Luck, Hopefully You Do Not Get Blocked By Amazon And Other Sites
It's pretty nice, but it doesn't work with more complex websites. Thank you 💗
Web scraping is easy to do once and extremely difficult to scale.
Websites change and have bugs, so a naive scraper will require maintenance at a rate that scales with the number of pages and sites it scrapes.
Well, real web scraping is actually an arms race between websites which dont want to get scraped and scrapers who need scraping
99% of the time is not use cases like this. Its unstructured data. Dont just make up stuff for views
Can we extract user data for business purposes? Offcourse it’s not legal
The other real trick is finding reliable information. Difficult to do in the west ..
I got difficulty when i encounter websites that needs to specify it's HTTP headers to access it, like if we open the website with browser manually, we can see the content, doing inspect element, and basically get the content because HTTP headers are automatically assigned by the browser itself, but that would be difference if we access the website through python scraping library like urllib, scrapy, or beautifulsoup, when we can't just paste the URL and get the website content (html element, table, etc), we need to specify website's HTTP headers to get the element...
With this method, there is a problem with the login, pagination, captcha, etc
Think the difficulty is keeping up with changes in the website as they don't publish versioned specs like formal APIs.
Scraping most large websites is much much harder. It involves JavaScript rendering for SPAs, spoofing browser metadata to emulate a real user, residential proxy servers, etc.
That was 3 seconds! Youre a liar 😡
😂😂😂😂
Python is great 😊
Not exactly bud. Not even close to 99% of the time because a large part of data science is collecting raw data and putting that data in a spread/table yourself. Most of the time the data you're looking for when web scraping is not already nicely organized in a table or chart.
he's 30 and his voice still hasn't figured out how to not be so pitchy
Can you ever say something without having a voice crack!?!?!
I'm having trouble installing certain packages for python (using pycharm community edition)
what if they coded that table with divs lmao
Wtf. Who said web scrapping is hard? It's just highly unethical and important data will be obfuscated. This is what happens when script kiddies try to make youtube shorts.
Sorry for being a script kiddie 😂
Guys that just started some form of engineering be like:
Easy to Programm, when some else did it before with a library
Idk man, this sounds like a nightmare for fats security
Do you know how to extract the visible text of a webpage ?
i can do this in word - no need to complicate myself with python...
I can copy and paste that table into google sheet without code at all.
Pandas does my load the ja of web page. Doesn't work for all sites...
Use selium web driver....
Yeah good point!!
So you can actually use web scraping as a tool to train an AI?
Damn, bro, you still going through puberty?
You've explained it so simply I'm surprised I never realised this is the basic idea
Glad to hear it!
Right lol. And if you just so happen to want exactly and only exactly a table format. I don’t think this qualifies as a misconception, rather the exception 😂
what would you do if some random captcha appares?
All easy till it's dynamically loaded, half the content is hidden and the website had 10 different designers that never spoke to one another
Yes exactly 😹
Unfortunately it now gives the error ... File "C:\Program Files\Python312\Lib\urllib
equest.py", line 1347, in do_open
raise URLError(err)
urllib.error.URLError:
That's odd
@@GregHogg Does it still work for you?
The Voice crack is strong with him !
this is possibly the most naive take on web scraping that I have ever seen
Broooo no wayyyy. Lifesaver
Subscibed. Shall I learn from you how to get the data table in my case? @Greg Hogg
Uses python for web scraping 😂😂😂
Now do in 2 sec on some crappy intranet website that's poorly maintained but is also super critical
That would be tricky 😂
Oh boy if this works you saved me weeks of eork
Okay but doing the actual web scraping itself is the hard part not using a library that does it
Absolutely... We would want to automate this as much as possible
Maybe mention you're using a library?
Cool story bro, now go drink some water.
For who ? Web scraping is the first thing you do when learning python.
Wasn't for me, personally! I did data science in pandas without scraping first
What about when you add on a web crawler?
I have beend using beautfiul soup for this 2second shit .. damnnnn
Haha yeah!
Why your voice crack so much 😭😭😭😭
dude, whats with your voice?
No one says it's difficult haha
My UI is made of nested tables.
HTTP Error 403: Forbidden
Cloudflare has entered the chat…
It’s a very very easy example. It can become very complicated
how can I seperate the year and rating as seperate list when both have same tags and list
. seperate()
Not a misconception at all. Now do it without pandas (adding dependencies is bad for commercial code) on a custom website that constantly changes how data is displayed. Let’s say on one version uses table, the next uses div and for some reason next there are nested divs like react or angular. Now your code should work for all of those. Good luck!
thanks Dunning-Kruger
You're presuming that the website already has everything stored in a way that is friendly to you. In most cases if it did infact have it stored in a data friendly way, it has an API to expose that data readily. Showing people a website that is basically cherry picked to make something look easy then saying thats normal and expected... is going to leave novices confused and experiences engineers irritated.
When you state a problem is normally solved easily one way, and then people go out into the wild and find out thats not really true... you're manufacturing a problem where people think they have the knowledge they need to solve a problem, so now think the website is broken or something when it doesn't work. It could also discourage novices from sticking to things when you tell them "This usually works" and it factually, does not.
Manchester Birmingham ... Councils have councillor details not in tables but on separate pages with no class - ONS (Office of National Statistics) refuse to publish 'Magistrate Case Data'
super smashing 😊
Hardness comes from avoiding bot detection, not scraping itself. Experts know how to bypass Cloudflare for example.
mindblown, I was expecting request beautifulsoup, didn't know you can just use pandas
Sure. Cloudflare.
Jacksfilms? HAHAHA
ThErE iS A HÜgE MIscOnCEption
wHaT iS iT it
Ok do it without pandas now
I'd rather not
Do chairs now
If you use python :) But what about google spiders? I think it is much harder than just using some library, because you need to be super efficient, when scanning billions of webpages for urls. Also you probably want to scan js scripts as well, they may contain some fetch procedures, especially on react driven apps.
Imagine trying to grab data from a server side processed table. The data is jot there until you click on the next button.
Goodluck scrapping data from dynamic website
How old are you
How we come to the page where you enter the website material to .... like that main window I can't find it even after python has been downloaded