Python Web-scraping with Selenium vs Scrapy vs BeautifulSoup | Witcher project ep. #1

Thu Vu data analytics

Просмотров 47 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 12 дек 2024

Комментарии • 125

@LukeBarousse 2 года назад ⁺⁴⁷
Looking forward to this series, Thu! 🙌🏼 Also, love me some Selenium for web scraping!
@Thuvu5 2 года назад ⁺⁶
Hehe thanks for dropping by Luke!! 🙌🏽💜 Selenium fan here! 👋 I'm figuring stuff out and experimenting with ideas as I go but I'm really enjoying this project!
@LetsScrapeData Год назад
Thanks. I have created many web scrapers. I can create other scrapers for free if needed.@@Thuvu5
@dothework2989 2 года назад ⁺⁷
As someone who is looking to go into the data field, this was incredibly eye opening on the range of different things it can be applied to. Very grateful for this video, thanks :)
@Thuvu5 2 года назад ⁺¹
Aw thank you for this! 🤓🙌🏽
@ekaterinaerikhova9305 2 года назад ⁺¹
Thank you so much for this video! I've struggled with data scraping for my project, but with your tutorial, I managed to get the data I needed!
@navi_dust 2 года назад ⁺³
I'm so thrilled I found your channel - your content is amazing and inspiring!!! Thank you so much for sharing!
@al7240 2 года назад ⁺¹
This confiemd some doubts I had about what types of framework/libraries I'm using. Thanks!
@filipesaladini8386 2 года назад ⁺¹
I was waiting for a full-length project with all the minor errors and the possibility of coding along like this.
Great content as always, Thu! Thanks
@Thuvu5 2 года назад ⁺¹
Hey Filipe, yeahh.. so glad I kept the promise haha. Thank you for watching 🤩
@jordantherubio 2 года назад ⁺¹
Ah yeah i love the Witcher books! And showing how solve real problems in your process is great since i learn better with visuals this is a great series!
@derricktoppert8145 2 года назад ⁺³
I've been using Selenium for web automation, never thought to use it for web-scraping. Thanks a lot for that idea 👋
Also, looking forward to the rest of the serie.
@automatedbymarc6556 2 года назад ⁺¹
Cool example! Looking for the next episode
@Kilo1Nation 2 года назад ⁺³
Thanks for your consistently great videos. I can't wait to see the rest of the project!
@Thuvu5 2 года назад
Aw thank you for consistently checking out my videos, Cole! 😇👋
@Barbara-ka1mt 2 года назад ⁺⁵
I follow you since a while, and I really like your priceless tips. With this project you officially became my guru!!! 🙌 Can't wait for your next video, and can't wait for the 3rd season!!! 😊
@Thuvu5 2 года назад ⁺¹
Hey Barbara, thanks so much for following my channel! 🤗 me too, can’t wait for the 3rd season 🤩
@SophiaYangDS 2 года назад ⁺³
Love this series. Looking forward to the next one!
@johnwig285 2 года назад ⁺³
This channel is underrated 😭
@Thuvu5 2 года назад
Aw thank you John!! (I think so too 😂🙌)
@jasonlewis5125 2 года назад ⁺¹
It’s finally here!!!
@Thuvu5 2 года назад
Heck yeah, thank you for your patience 🙈
@tusharyadav3299 2 года назад ⁺¹
Excited for the complete project...!!! 🔥🔥🔥
@aliciatraver7216 2 года назад ⁺¹
Awesome video on Selenium! You do a very good job of explaining things step by step. Keep up the great content!
@lvhq-lamviechieuquatrongth7513 Год назад
Thanks!
@anshmarketing-xk3cb 3 месяца назад
What a tutorial. You have gained a new subscriber!!
@sisnandojunior340 2 года назад ⁺¹
Thanks for sharing you knowleage Thu! Besides training my English, I learn more about data science. :)
@mabenba 2 года назад
Thanks for this! I have learned to use beautifulsoup in the python for everybody specialization in Coursera, and this video is similar to the final project I made but with selenium (in my project I scrapped song's lyrics). What a great oportunity to learn a new library and make an amazingly fun project in the way.
Thanks for your amazing content!
@Thuvu5 2 года назад ⁺¹
Hey Matias, that’s so cool! Seems like a very interesting project you did! Thanks for watching 🙌🎉
@carbon-kevin 2 года назад ⁺²
That was pretty fun following along, really starting to like python. I hope to see more like this in the future.
@Thuvu5 2 года назад ⁺²
Ohh that was nice! 🙌🏽🙌🏽 It's always good to know that the code is actually reproducible 😂. Yess, more is definitely coming! I also try to keep a balance between project vids and non-project ones, though. Thank you for watching and following along!
@carbon-kevin 2 года назад ⁺¹
@@Thuvu5 It's definitely reproducible, I made it to the end of the video with the same results. Then cloned my own repo so I could experiment with the data set. Thanks for sharing this information!
@taianemonteiroacupuntura 2 года назад ⁺¹
Incredible!!!very fun project !!! Congrats !!!
@KunjaBihariKrishna Год назад
gpt4 with vision and browsing is really changing the webscraping game. It's not going to do well with high volume tasks, at least not cheaply, but it will make webscraping much easier in general
@ParagOak 2 года назад ⁺¹
0:30 I’m sold here. 😊
( I’m beginner in DS field )
@juliensoyer5786 2 года назад ⁺¹
Thanks for that ! thanks to you I managed to use it for the lore of a game I love !
@Thuvu5 2 года назад
Great to hear Julien!! 🙌
@dhoangk07 2 года назад ⁺¹
Thank you for your lession about scrap by selenium.
@Thuvu5 2 года назад ⁺¹
You’re very welcome 😊
@nicolasmedinacaraballo129 2 года назад ⁺¹
hi ! thank you this video help me a lot with a litle task in my job !! I was confused a litle bit with the xpath but I solved with right click in the element and copy the xpath directly
@Thuvu5 2 года назад
Yay 🙌🙌
@DevVuiTinh 2 года назад ⁺¹
Em chào chị, thấy chị đang làm RUclips channel liên quan đến đúng nghề em cũng đang theo nên em máu quá. Hiện em cũng đang làm và maintain project liên quan đến web crawling. Tất nhiên là sẽ phức tạp hơn nhiều so với những gì cơ bản thông thường. Do một số web liên quan đến đặc thù business nên sẽ có cơ chế chống crawl bằng cách dùng dịch vụ của bên thứ 3, ví dụ như Cloudfare chẳng hạn. Lúc này sẽ cần dùng đến những driver chuyên dụng như undetected-chromedriver. Mong chị ra thêm nhiều video hay nữa để mọi người học hỏi ạ 😁
@Thuvu5 2 года назад ⁺¹
OK em, cám ơn em đã chia sẻ nha 🤗
@shaonsikder556 2 года назад ⁺¹
Very Impressive Initiative!
@taianemonteiroacupuntura 2 года назад ⁺¹
Incredible!!!very fun project !!! Thanks !!!
@thaynaazevedocarvalho1461 2 года назад ⁺¹
yes!! can't wait to see how it goes!
@Daniel-fn6tj 2 года назад ⁺¹
Please continue this series
@TulkinYusupov 2 года назад ⁺¹
Thank you! This was what I need!
@Thuvu5 2 года назад ⁺¹
So glad it helped, Tulkin 🙌🏽
@anarikobi23 2 года назад ⁺²
Very useful and easy to understand
@Thuvu5 2 года назад
Aw so glad to hear, Anari! 🙌🏽
@ferozahmedsoomro359 10 месяцев назад ⁺¹
Thanks a lot. The video was very helpful.
@ffukue Год назад
Very good content, congratulations for the videos and for the didactic. It's a lot of fun to study and follow your content, it made me enjoy using Python again
@coscorrodrift 2 года назад ⁺¹
Wow this is brilliant. great explanation of the differences between those three libraries. i think you should link the 2nd part of the series in the end cards instead of those two other videos about stats etc
also f this title doesn't work well maybe you could try "Selenium vs Scrapy vs BS4, what do i use in a REAL project" or something like that. i feel like that represents the content of the video very well and could be clickable for many
@Thuvu5 2 года назад
Wow thank you so much for these suggestions! 🙌 You’re absolutely right! This is so helpful, I’ll adjust the title and end screen 😁
@Oreoshake02 2 года назад ⁺¹
Rest aside...She picked up The Witcher 🤩🤩..Queen, you dropped this 👑...
@Thuvu5 2 года назад
Yay, so nice to see another The Witcher fan here! 👋 I can't wait for the next season later this year! 🤩
@tjard1990 2 года назад ⁺¹
Hey this reminds me of something we did a few years ago, with Instagram... ;')
Btw, I think there's a better approach rather than using Xpath. It's using CSS selectors. They are a bit faster ,especially once you use a lot of queries in one page. Firefox never gives CSS selectors as an option, but Chrome does do it! Helped me quite a bit in writing other (headless) selenium applications !
Also maybe a note on session storing using Selenium / crawling in your next video?
Keep up the good work !
@Thuvu5 2 года назад ⁺¹
Omg Tjard!!!! Thanks so much for watching my video and commenting. Yesss, miss our Instagram bot 😂. And good to know about the CSS selector! hope you’re doing well 🤗
@tjard1990 2 года назад
@@Thuvu5
@Huyincon 2 года назад ⁺¹
Hay quá chị ơi ❤
@Thuvu5 2 года назад
Chị cám ơn em ❤️
@vineeta03 2 года назад ⁺²
Thank you youtube algorithm for bringing me here
@Thuvu5 2 года назад
Yay that’s awesome, Vineeta! Thank you for watching 🤗
@jamesc7514 Год назад
Awesome :)
part 2 please :D
@Thuvu5 Год назад
Link to Part 2 is in the video description 🙂
@jamesc7514 Год назад
@@Thuvu5 Thank you!!
@FIBONACCIVEGA Год назад ⁺¹
Such a good video . I loved it 🙌🙌🙌🙌🙌
@phyra4780 2 года назад ⁺¹
Thank you, you're awesome!
@Thuvu5 2 года назад
Aw thank you for watching! ☺️
@TheMISBlog 2 года назад ⁺¹
Very Useful video thu,Thanks
@Thuvu5 2 года назад ⁺¹
Thanks so much TheMis Blog 🙌🏽. Your comments mean a lot 😀
@LoneWolf-xz1ln 2 года назад
@@Thuvu5 🥰🥰🥰stop it now
@sundarhistoriya9268 2 года назад ⁺¹
❤👌👌👌 awesome
@mitsuinormal 2 года назад ⁺¹
Wow I admire you so much
@Thuvu5 2 года назад
Aw thank you! 🤗🙌
@micuzzu 2 года назад ⁺¹
BTW find_element_by_class_name is deprecated. Should be find_element(By.CLASS_NAME, "class name"). Like you used for the xpath
@eduardotejeda 2 года назад ⁺¹
Thank you!!!
@piggjf 2 года назад ⁺¹
My first comment was deleted, possibly because it had links that I used for troubleshooting. A couple of improvements that I discovered after finishing your video:
1. I converted the deprecated find_elements_* functions to just find_elements. This removed the warning message.
2. For the driver, I found it went faster if I called a headless chrome. I set this with Chrome options. This also seems to make the cookie popup not appear for some reason.
3. Because I'm running WSL2, I had better performance by installing the Chrome driver to a folder within my project folder and running that as a service.
I'll link my alterations in a follow up comment in case RUclips is auto-filtering comments with links.
@Thuvu5 2 года назад ⁺¹
Oh thank you so much for the improvement points! I’ll probably update some part of the code as you suggested, otherwise please feel free to send me a pull request! 🙌😀
@piggjf 2 года назад ⁺¹
And thank you for the video. I learned a lot going through this walkthrough. Looking forward to the next installment in the series!
@Thuvu5 2 года назад ⁺¹
@@piggjf I'm so glad! And great work with the pull request, thank you so much for improving the code! 🙌🏽
@saurabhthakare4218 2 года назад
Great content, but please can you show the web driver manager error occurring due to chrome driver.
@ahmedabushama4024 2 года назад ⁺¹
Nice job
@Thuvu5 2 года назад
Thanks Ahmed!
@ifeanyinwobodo8530 Год назад ⁺¹
Thanks for the video
I'm having issues in this project, it keeps giving me
AttributeError: module 'selenium' has no attribute 'Chrome'
Please what can I do? Your input will be highly appreciated
@alexanderchebotariov7230 2 года назад
Thank you for a great video, looking forward to see others. I'd like to ask you a question, when you searched an element by xpath you used (By.XPATH, ...), not a method find_elements_by xpath, is it to show both methods available or there is a reason for this (e.g., faster, easier to read)?
@kenchang3456 2 года назад ⁺¹
Like @Luke Barousse I'm looking forward this project. I have a similar project and this will really help kick mine off. Thank you very much :-)
@Thuvu5 2 года назад
Oh what a nice coincidence!! Thanks for watching Ken 🙌
@d-rey1758 Год назад
any advice with crawling through a website with a lot of "a href" elements, especially when they are child elements. using selenium it seems to struggle. is selenium even the right tool.
@learn_techie 2 года назад
How to write code to scrap information from websites on first 3 pages of google search engine. I mostly see solution for single URL but to extract information. I need something comprehensive. Can I give website as argument?
@avirathi3450 2 года назад
I have got 170 urls how can i extract text from each of them and do text analysis
@cerealport2726 2 года назад ⁺²
I tried:
from selenium import dancemoves
but sadly i got a traceback error "ImportError: cannot import name 'dancemoves' from 'selenium'"
i guess I need to update Selenium....?
@Thuvu5 2 года назад ⁺²
LOL yesss I think so! 😂 Or try restarting your computer, it always fixes weird glitches 🤣jk
@cerealport2726 2 года назад
@@Thuvu5 I'm never happy with my code until all the weird glitches work together to cancel each other out! 🥳
@yordanadaskalova 2 года назад
Hi Thu, I have one question - the Wiki Terms and Conditions page forbids any kind of scrape of its content. Isn't this a little, let's say not legal? Thanks
@piyushdwivedi3896 2 года назад ⁺¹
And the 100th Like goes to me..!!!!
@Thuvu5 2 года назад
Yaaaay, thanks for the nice number! 😀🙌🏽
@ZombieHelion2561 2 года назад ⁺¹
I would like to ask for your advice. I have a use case here where the websites consists of thousands of pdf files which I need to download it one by one. Does the web scraping possible to downliad all of the pdf files all at once?
@Thuvu5 2 года назад
Hey, I’m not sure about it as I’ve never downloaded pdf with this method before, but I think it should be possible to use selenium to click on the download button and download things. I guess it can be done almost all at once if you make a very small waiting time in between the downloads
@vilasmawal3099 Год назад
The website I am scraping from gives me captcha to check whether I am human or not. Is there any way to avoid this??
@zainiqbal7990 Год назад ⁺¹
Hi Thu, how can i support you besides making purchases on your affiliate links. Let us know
@Thuvu5 Год назад
Hey Zain, thank you! I really appreciate it! Currently I don't have any other ways you could support me :(, but thank you for offering that!
@hoangyenin5945 2 года назад ⁺¹
❤💞❤
@leniedor733 2 года назад
Why selenium for a site which that data is not hidden by JS ?
@leniedor733 2 года назад
I see it unnecesary and heavy processing (Being python already slower than java)
@NormanWasHere452 2 года назад
I can not for the life of me get past this error: "WebDriverException: Message: chrome not reachable" whenever I try and run driver.get(page_url)
@iepnguyenthai6084 2 года назад ⁺¹
ok
@myatcharm982 2 года назад ⁺¹
to all those who follow along and got the 'time out' error like me, try search "Stop infinite page load in selenium webdriver python" in stack overflow. The bot in me cannot extract characters more than 4 books. I can't sleep for a night to find source. >,
@Thuvu5 2 года назад ⁺¹
Oh I'm so sorry to hear this 😅. Thank you for sharing this tip!! 🙌🏽
@s6yx 7 месяцев назад
playwright >
@Frenzymove Год назад
You're so rush
@arun.kumar.s 2 года назад
Everyone is not a nerd, kinda hard to follow. Watched it in a slower pace. Make sure sequel is for non nerds.
@LoneWolf-xz1ln 2 года назад
You are my crush 🥰
@ruimand9632 2 года назад
she may be your crush but after this video she is my god 🤣
@Thuvu5 2 года назад ⁺¹
LOL that's funny ;P
@lycan2494 2 года назад
SIMP DETECTED!
@LoneWolf-xz1ln 2 года назад
@@Thuvu5 how it's funny?
@LoneWolf-xz1ln 2 года назад
@@Thuvu5 i have nothing to do with data science still idk why i watch your videos
@LoneWolf-xz1ln 2 года назад
I have a huge crush on you crush 🥰 😍☺️
@jerrycheng8537 Год назад
Why it returns an empty list in book_categories when I typed "book_categories = driver.find_elements(By.CLASS_NAME, 'category-page_member-link')"
@fokrulaminrasel5669 2 года назад ⁺¹
driver.get(pageURL) is showing error. can you help me.
@Digital-Light Год назад
same here

Следующие

Автовоспроизведение

How I'd Learn AI (If I Had to Start Over)