Scraping multiples websites with one Python script

Поделиться
HTML-код
  • Опубликовано: 15 фев 2023
  • Writing a simple web scraping script to do some basic price comparison
    github.com/jhnwr/youtube
    Scraper API www.scrapingbee.com/?fpr=jhnwr
    Patreon: / johnwatsonrooney
    Donations: www.paypal.com/donate/?hosted...
    Proxies: iproyal.club/JWR50
    Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
    Gear I use: www.amazon.co.uk/shop/johnwat...
    Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
  • НаукаНаука

Комментарии • 58

  • @bainsk8
    @bainsk8 2 месяца назад

    Great video John, thank you. Very informative.

  • @silkogelman
    @silkogelman Год назад +3

    Thank you John! 🙏
    Informative and it got me a couple of new ideas I want to try now! 💡😀

  • @dennistanui7085
    @dennistanui7085 Год назад +3

    Thanks a lot, always informative. How would you then run the two scrapers concurrently? and how would you pattern match when scraping a lot of products (i.e scrape all products on both sites, and then create a product_dataframe for example with price comparison)

  • @zhengdiao3494
    @zhengdiao3494 Год назад +2

    Learned a lot in your video, hope to come out with a neovim editor tutorial, thank you sir!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +1

      I am working on a neovim video and thanks for watching

    • @zhengdiao3494
      @zhengdiao3494 Год назад

      @@JohnWatsonRooney Thanks, have a nice life!

  • @Wassilvideos
    @Wassilvideos Год назад

    Hi John I have a question, can you guide me for how to scroll down a scrollable ul list in a section of the html with playwright

  • @gh-sb1dy
    @gh-sb1dy Год назад

    vids they are great
    When getting info from a site using python is the ip same or when using python? or do they have their own different ip address? and also same with scrapy; if i use scrapy does that ip address is same as this computers?
    because some sites have blocks set up to prevent types of things like this and i dont want to get banned forever by my ip
    any way to bypass this so you dont get banned?

  • @mmemahmoud7274
    @mmemahmoud7274 Год назад +1

    nice work as always , can you please make a video about how to scrape email addresses from a domain ?

  • @shehbanpatel
    @shehbanpatel 10 месяцев назад

    Hello, I tried this but keep getting the Attribute error 'NoneType' object has no attribute 'text'. I outputted the text this resp receives and it doesnt have the tag which shows up while inspecting the page

  • @sheikh4awais
    @sheikh4awais Год назад +10

    Could you also make one tutorial for your code editor setup and the terminal? It looks really cool.

  • @ericxls93
    @ericxls93 Год назад +1

    Very good video as usual! Thank you! When is chatgpt video coming 🤔?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +1

      thanks! hmm not a fan of chatgpt, not sure i'll cover it

  • @PankajThakur-jq1td
    @PankajThakur-jq1td Год назад +1

    Hey John, How can we scrape a page which requires zipcode to open the actual data to scrape and various navigations to go the data.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      Yes, it you will need to see how the website works. Sometimes it’s an Ajax request when you enter the zip code which you can copy, other times it might need browser automation

  • @LHCB6
    @LHCB6 Год назад +1

    Thanks for the zz shortcut

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      It’s a good one I didn’t even know about until recently

  •  Год назад +1

    so coooool

  • @jigneshprajapati6974
    @jigneshprajapati6974 Год назад

    how to automate the captcha in python

  • @samoylov1973
    @samoylov1973 Год назад +1

    Thank you for this video! Works wonderful with a particular item. But what if I want to get multiple items. Say, news stories from a website. html.css_first(selector).text().strip() - css_first gets only latest one. css_all - doesn't work, and just html.css(selector) won't work either. Please help.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +1

      Thanks. html.css(selector) will return a list of all matching elements for the given selector so we can loop through this and call .text() on each iteration to get the data

    • @samoylov1973
      @samoylov1973 Год назад

      @@JohnWatsonRooney Thank you! Waiting for more videos! Take care!

  • @yawarvoice
    @yawarvoice Год назад +2

    Hi,
    I"ve asked this question in other video of yours as well, but asking here again, in-case you have missed the other one:
    @John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all.
    Looking forward!
    Thanks.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +2

      Hi - it depends on the site but generally i suggest trying; a) adding more headers as well as the useragent b) trying playwright/selenium with the undetectable driver c) using proxies d) combination of all three. Beating some anti bot protection can be tricky it takes time to figure out what it is you need to do to comply

    • @yawarvoice
      @yawarvoice Год назад +1

      @@JohnWatsonRooney Normally its cloudflare the only hinderence. Where can I find detailed documentation for selectolax, I'm write now writing a scraper using cloudscraper (found it a comment, answered by you) and it has bypassed cloudflare. But I'm having trouble with selectolax right now, unable to find proper documentation. Is there any other fast alternative to selectolax? That has bigger community?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +2

      @@yawarvoice selectolax is just an HTML parser - the main on in the python community is Beautifulsoup you could give that go

    • @yawarvoice
      @yawarvoice Год назад

      @@JohnWatsonRooney Got it. One last thing: Which one you'll prefer: 1) SE+BS or 2) Playwright + BS or 3) Cloudscraper + BS?

  • @lasangagamers
    @lasangagamers Год назад

    i have written the code but it will not print any results

  • @pypypy4228
    @pypypy4228 Год назад +2

    Great video as always! But how do you happen not to be banned by Amazon? I tried scraping a couple of years ago - it always detected my script as robot and didn't give data.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +2

      Thanks! I’ve never had an issue with Amazon - I found that I usually just need a user agent and occasionally the language header and I’m good

    • @pypypy4228
      @pypypy4228 Год назад +1

      @@JohnWatsonRooney thank you! I gotta give it a try!

  • @void-qy4ov
    @void-qy4ov Год назад

    Hey man, 10x for your tuts.
    I'm doing a lot of scrapping. Lately I need to get logos of 20k e-commerce stores.
    Imho, it was an interesting task. Unfortunately only about 1/3 could be automated - I went with finding divs, classnames, and image sources having a 'logo' in it.
    May be you did something like that before and have interesting strategy ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      hey, thanks. interesting task as you say. I would probably save the html for each into a document database like mongo, and then test different patterns against each - save having to make loads of requests over and over. this way you could try different ways and see which works, updating the database with the logo as you go. Theoretical approach it would probably need revising as you go though

    • @void-qy4ov
      @void-qy4ov Год назад

      @@JohnWatsonRooney yep, i skipped db part, used just saved pages (played with filenames to get a correlation to store identifier). picking a strategy is the tricky part every site chooses it's own way to keep the logo, even on platforms like shopify or wp :)

    • @bensikes1640
      @bensikes1640 Год назад

      I’m trying to scrape addresses: zip code, city, state, etc. from thousands of websites. How would you recommend I do this. I’m trying regular expression stuff, but even then it pulls in other info.

  • @c__0ne
    @c__0ne Год назад

    Nice! Is this neovim? Can you write to me how to get this editor with syntax highlighting tabs etc? Thank you!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +2

      Yes it is! I am going to do a video on it but if you google "chrisatmachine basic IDE neovim" its basically that

    • @c__0ne
      @c__0ne Год назад +1

      @@JohnWatsonRooney thx!

  • @garyjo3229
    @garyjo3229 Год назад +1

    One question, what is your ide?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      Neovim - it’s a slightly modified version of chrisatmachine’s basic ide if you google it

  • @ChristopherBrown-bj4zl
    @ChristopherBrown-bj4zl Год назад +1

    7:05 Yeah but, show me the ugly as sin CSS selectors/HTML. Those are the ones that give me the hardest time. Great vids! Thanks!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      haha, yeah i understand. I'll include some more wonky stuff going forward

  • @theDataFixer
    @theDataFixer Год назад

    Do you have any web scraping tutorial? From zero to hero? Using the most updated tools and stuff?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +1

      I haven't done a full long video like that no, its spread out over multiple ones. I could add it to my list of things

    • @theDataFixer
      @theDataFixer Год назад +1

      @@JohnWatsonRooney Nice! A playlist maybe might work, or long video, or whatever....paid or not, I'm pretty sure it will be useful for everyone 🙏

  • @gh-sb1dy
    @gh-sb1dy Год назад +1

    Can you please post your codes in your videos to a link below or in github or etc. it would be so helpful

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад

      github.com/jhnwr/youtube - I am reorganizing my github but here it is

  • @bakasenpaidesu
    @bakasenpaidesu Год назад +2

    The comment section be like
    Video : "How I survived from dying"
    Comments: the shirt looks good.
    What I mean is everyone is asking for ide 😂

    • @JohnWatsonRooney
      @JohnWatsonRooney  Год назад +1

      Haha yeah, I didn’t think people would that interested in it

  • @herrpez
    @herrpez 10 месяцев назад

    Oops. Misspelled Thomann; better remake the video! 😉

    • @JohnWatsonRooney
      @JohnWatsonRooney  10 месяцев назад

      Haha yeah- I have actually done that before!