This AI Scraper Update Changes EVERYTHING!!

Поделиться
HTML-код
  • Опубликовано: 14 ноя 2024

Комментарии • 93

  • @randomchannelname9061
    @randomchannelname9061 Месяц назад +23

    Nice job 👍🏻
    Perhaps llama locally and / or from groq would be a nice improvement

  • @muhammadadil-v9i
    @muhammadadil-v9i День назад

    i cant explain in words, what you do thanks for your kind of efforts!!!

  • @hannespi2886
    @hannespi2886 Месяц назад +9

    Can't believe this, you did it. I've been coding non stop the last 5 days because of your last video on this, thaaank youu!!

    • @redamarzouk
      @redamarzouk  Месяц назад +2

      My pleasure 🙏

    • @JayS.-mm3qr
      @JayS.-mm3qr 19 дней назад +1

      How did a scraper help you code?

  • @tanvirahmed1959
    @tanvirahmed1959 Месяц назад +22

    Please integrate llama3 locally(without any api) as many of us run llama3 locally.

    • @reezlaw
      @reezlaw Месяц назад

      I don't know anything about this project but since it's open source I suppose you can just run Ollama with the OpenAI compatible API and simply replace the URL in the code, use whatever model you want

  • @SCHaworth
    @SCHaworth Месяц назад +6

    hmm. I already made a universal headless chrome scraper. Mine can even interact with with the page.
    But youre a better man than me for sharing.

    • @TLCMEDIA1
      @TLCMEDIA1 Месяц назад

      Mind sharing your code mate

  • @DevJonny
    @DevJonny Месяц назад +1

    nice to see you getting traction, I would love to see some content on how to mitigate and avoid being blocked, especially by cloudflare

  • @GundamExia88
    @GundamExia88 Месяц назад +2

    Great video! Having the ability to use locally hosted ollama on the network would be great. I have ollama running llama3 on another machine on the same network.

  • @michaelpongrac2364
    @michaelpongrac2364 Месяц назад +1

    Great Work!!!
    I appreciate that you have already made it run locally and created a resume scraper.
    Would you possibly combine the two by using the resume scraper with additional inputs as part of creating a json profile which could be used as search criteria input for scraping job searches such as indeed, stepstone, or other similar sites?
    It would be great to have the match percentage from the scraping to be used as a filter and/or sorting.
    The reason that I ask this is because it has multiple uses. If the json search criteria profile had some other definition, it could still be used as generic input values for the search process, thus allowing the match percentage functionality to have a universal application. The second use is to have a single profile that would deliver better search results than the original profiles such as Indeed and Stepstone.
    An additional option could be to use a starting location and radius to help limit the data to be processed. There are map apis that compute the travel distance between two points as well as the travel time based upon the travel mode (car, bus/train, bike, walk). This would add a lot of value to searches. It could also be added to the match percentage when used.
    I have one additional request. Could you set an option to change the language to German? If you need, I can help with the translation since I'm an American working in Germany. It would make things a lot easier for people in Germany. I already have a json structure. If you would like my help, let me know.

  • @ketchup1993
    @ketchup1993 Месяц назад +1

    Maybe a way to circumvent the token issue is to calculate tokens, then cut before token limit of the model, then continue after the cutoff and iterate until you got the full page

  • @Opeyemi.sanusi
    @Opeyemi.sanusi Месяц назад +7

    Love this is open source. Thank you!🙏🏾 I already knew how you were going to handle the pagination before you started talking 😂 a fox might be to add a starting url and a field for the second page.
    Another suggestion is proxy 😢
    I have more interesting adds to this

    • @redamarzouk
      @redamarzouk  Месяц назад +3

      Alright Proxy is noted!

    • @dhairyagoel5524
      @dhairyagoel5524 28 дней назад

      @@redamarzouk there is website called bright data which solve this issue like captcha and all other and it gives us credits so.. or any other.

  • @RJ.M.
    @RJ.M. Месяц назад +1

    You are a wonderful person, thank you for sharing 💪

  • @abdelazizabdelioua890
    @abdelazizabdelioua890 Месяц назад

    يعطيك الصحة I have a project in mind and this is what I was looking for to monitize it.
    Thanks ❤

  • @iltodes7319
    @iltodes7319 Месяц назад +1

    Good job bro. Please continue

  • @cadiszu9855
    @cadiszu9855 Месяц назад +5

    Auto subscribe to people that share useful free stuffs. Thanks for this!

  • @joelfrojmowicz
    @joelfrojmowicz Месяц назад +1

    Great Project, but it will be even greater if you create a Docker Container with it and allow to use local AI (llama) instead of using cloud.

  • @mikew2883
    @mikew2883 Месяц назад +1

    Awesome! 👏

  • @paulham.2447
    @paulham.2447 Месяц назад +2

    Que dire ? exceptionnel !? Merci Monsieur 👍

  • @aimenkigs
    @aimenkigs Месяц назад +3

    Love the project man! the update is the exact problem i faced before 🔥
    I’ve tried using GPT-4o-mini and Gemini Flash as well, and they both work smoothly. However, when using the local model, the pagination script throws an error on 'openai.ChatCompletion'. Could this be due to a version issue? Thanks

    • @redamarzouk
      @redamarzouk  Месяц назад +2

      My issue with using the local Llama 3.1 8B was really the number of tokens, in my case it was 8k tokens per completion.
      If you have a model with a longer context window and it still giving you errors, join the discord and share the screenshot so I can understand the problem better.

    • @ranggasaputra5001
      @ranggasaputra5001 Месяц назад

      @@redamarzouk Hello, can you send the discord link again because the link you previously provided has expired, Thanks 🙏

  • @michaelwallace4757
    @michaelwallace4757 Месяц назад

    Very nice! 🎉

  • @shawnsmith9198
    @shawnsmith9198 Месяц назад +1

    You are king!

  • @JonathanBarber-hi3vj
    @JonathanBarber-hi3vj Месяц назад

    Thank you so much for this video. I am a no-coder and have no problem following your instructions. I have the last versions of VS and Python installed and for some reason I am unable to download the requirement packages. Can you please advise? Thank you

  • @CTEBACp6uja
    @CTEBACp6uja Месяц назад +2

    Did you try to add login option, for website requiring it?
    I tried, but ofter get a response from a website that my browser doesnt support JavaScript, or that it is not enabled and that it is needed to proceed to logging. Tried to enable it in Selenium, but still getting the same response.
    Btw, thanks for sharing this, very interesting!

  • @yazanrisheh5127
    @yazanrisheh5127 Месяц назад +1

    Reda thank you for this video. I know in your previous version 2 of the scraper you allowed it to add delays to scrape a website but how would V3 work for infinite scrolling pagination instead of page 1, 2, 3 etc...

    • @redamarzouk
      @redamarzouk  Месяц назад

      I have a 3 scroll events, the first to half of the page height, the second to almost the end and then a last one to the end of the page, i have random time delays between them.
      Do you think it's enough to do the infinite scroll ?

  • @CyrilSz
    @CyrilSz Месяц назад

    Incroyable merci :)

  • @chrystylord2324
    @chrystylord2324 Месяц назад

    Hello!! great video. I want to ask you if it's possible de scrap a whole article for example with your tool. Unlike a lot of people here, i just want to read articles, light novels and some comics which are behind a paywall. Can your scraper help me with that or do i need to make some modifications to the code for it to work?

  • @mawkuri5496
    @mawkuri5496 Месяц назад +6

    can i use llama that is running locally on my pc?

  • @moeabdo3114
    @moeabdo3114 Месяц назад

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @ChijiokeObi
    @ChijiokeObi Месяц назад

    I believe to solve the maximum token issue is to first strip the html results for unnecessary html and script and style tags before sending it to LLM

    • @redamarzouk
      @redamarzouk  Месяц назад

      the html2text already get rid of all tags and scripts, but maybe the urls as well can be removed and sometimes it does decrease the amount of tokens in the markdowns.
      but the problem is if the user want to extract urls of images or something else for example, what should happen in that case?

  • @dewilton7712
    @dewilton7712 Месяц назад +1

    I keep getting 'Unexpected data format for URL 1' with all sites I try. I have Ollama with Llama3.1 8b installed locally if that matters.

  • @mohamedamrbadawi
    @mohamedamrbadawi Месяц назад

    is it possible to add a search box feature where you put the search url headers for e.g. amazon ebay temu to get title price a mini price comparison feature in short?

  • @vladlemos
    @vladlemos Месяц назад +1

    Muito interessante parabéns pela aula!

  • @JayS.-mm3qr
    @JayS.-mm3qr 19 дней назад

    Thank you for this very interesting scraper. But i just want a scraper that does not require paid api keys. Can someone PLEASE recommend a basic scraper for that please. Please.

  • @ambushtunes
    @ambushtunes Месяц назад

    How does one select multiple pages? It doesn't seem to work for me. Great job btw.

  • @hasanparvez8850
    @hasanparvez8850 Месяц назад +1

    Chunking the tokens for Alibaba can solve the issue.

  • @explosiveenterprises1479
    @explosiveenterprises1479 23 дня назад

    How would you utilize this to scrape from behind a login? I dont see any of the login info embedded in the URL structure so unsure the best way to do this.

  • @omarunzainkun11
    @omarunzainkun11 Месяц назад +1

    At your website one of the file is name "sraper" instead of "scraper" which eventually will cause no module not found. Newbies prolly won't realize this even though its very obvious. Just informing.

    • @redamarzouk
      @redamarzouk  Месяц назад

      Thanks for letting me know, I fixed it!

  • @maxxflyer
    @maxxflyer Месяц назад +1

    very good

  • @adriangpuiu
    @adriangpuiu Месяц назад +5

    you forgot to specify on how to activate the env after they created it. maybe some dont know how to do it and theyll install the requirments onto the main python env :P

    • @gamalfarag
      @gamalfarag Месяц назад +1

      thx, that solves the error in my setup, but I have another error
      ModuleNotFoundError: No module named 'scraper'

    • @redamarzouk
      @redamarzouk  Месяц назад +1

      Yeah I should probably add that to the documentation

    • @adriangpuiu
      @adriangpuiu Месяц назад

      @@gamalfarag pip install scraper

  • @Dmitrird
    @Dmitrird Месяц назад

    is it possible to build a table with different URLs and iterate over an automatic regime?

  • @Bryan-lu4du
    @Bryan-lu4du Месяц назад

    Could we use the app as an API? I want to have my app use your app essentially

  • @dhairyagoel5524
    @dhairyagoel5524 23 дня назад

    1. Getting unexpected URL error
    2. if chrome driver get old, we have to change it or not
    3. how to deploy it
    4. proxy

  • @alexscarbro796
    @alexscarbro796 Месяц назад

    Does anyone know of a tool that can scrap a name and address blocks from a largely fixed area on each page, of a multi-page PDF?

  • @SavanVyas91
    @SavanVyas91 Месяц назад

    You doing local scraping not puppeteer?

  • @tiagoreis5390
    @tiagoreis5390 Месяц назад

    Do you have how many tokens is on SheIn? Great work

    • @redamarzouk
      @redamarzouk  Месяц назад

      I didn't try with shein before, but they have a fairly simple website, the issue is that every page has 70+ products meaning it will make a lot of tokens.

  • @rajvaibhav821
    @rajvaibhav821 Месяц назад

    Do we really need selenium driver and actually opening a browser? Can it be done without that? Headless?

    • @redamarzouk
      @redamarzouk  Месяц назад

      I tried it with headless and headless new but it's a hit a miss with the infinite scroll cases. And most pagination details are at the bottom of the page.
      If you want to try it with headless go to assets.py, the headless option is already there just place it inside the settings list

  • @velocitai
    @velocitai Месяц назад

    La pluspart de mes scraping echouent a cause d une limitation de token avec gpt :/

  • @Benjaminborghini
    @Benjaminborghini Месяц назад

    I can't get this to work on Spotify-streams, I want to track all my streams across all my songs. I also made a HTML-link for it to scrape multiple links at one time, so nice that you fixed that now! But seems like Spotify is blocking it anyways. Any tips on how I could scrape this kind of data? Thanks!

    • @redamarzouk
      @redamarzouk  Месяц назад +3

      If spotify is one those websites that force a captcha upon opening the website, that would block the scraping.
      Someone proposed to add an attended mode for the user to solve a captcha and then allow the app to continue its scraping. I think I will be adding this feature next.

  • @pauljones7798
    @pauljones7798 Месяц назад +1

    This AI Scraper Update Changes EVERYTHING!!.
    Please, can it Scrape Freelance services marketplace?

  • @OPMultiplayerCoopGames
    @OPMultiplayerCoopGames Месяц назад

    How can I scrape emails from websites? I need to scan many of them, not just one per time, could you help me out? :)

  • @TheBestgoku
    @TheBestgoku Месяц назад +1

    this is great and all, how about you create a service, even if its paid. to help us not get banned for scraping. then we have something.

  • @shankar9063
    @shankar9063 Месяц назад +1

    Omg update

  • @ZeyadAlmothafar
    @ZeyadAlmothafar Месяц назад

    Can i use it to scrape a linkedin profile data? and is that legal to be used commercially ? (to integrate the data into a web application through apis)

  • @towhidurrahman8961
    @towhidurrahman8961 Месяц назад +2

    "Great job, sir!
    I have a question: Is it possible to share the webpage opened by Selenium with the user, allowing them to manually interact with it-such as solving captchas or authenticating-to bypass blockades? Once they clear the obstacles, Selenium can resume scraping."

    • @redamarzouk
      @redamarzouk  Месяц назад

      that's actually a great suggestion

    • @Steve-lu6ft
      @Steve-lu6ft Месяц назад

      @@redamarzouk Can you also do pagination in the same way? i.e., Click on the links so it can find the pagination elements.

    • @yazanrisheh5127
      @yazanrisheh5127 Месяц назад

      Yes please reda this would be an amazing feature to do. This way we can pretty much solve every captcha without paying for proxies or coding to solve a captcha etc... We can just let it alert us by sending an sms to our phone or something that says "Need to solve captcha come back to your pc" or maybe just play an audio file saying "Solve the captcha"

  • @minhvuongluu7644
    @minhvuongluu7644 Месяц назад

    can it scrape google map

  • @AnmolBatti-z5y
    @AnmolBatti-z5y 29 дней назад

    Do it will bypass bot protection like captcha?

    • @redamarzouk
      @redamarzouk  23 дня назад

      it doesn't explicitly bypass captcha if it arises, the trick is to use the useragent in order to stop the website from thinking we're bots in the first place.

  • @jeynergilcaga
    @jeynergilcaga 29 дней назад

    what about facebook?

  • @moonwhisperer4804
    @moonwhisperer4804 Месяц назад

    im looking for a way to go from list page, find all items, go into the detail page of each item and extract data from there. can this do that?

    • @redamarzouk
      @redamarzouk  Месяц назад

      Yes this is the most intuitive way but even specialized text to action apps out there can't do it in a universal way. It's really harder than it sounds like.
      That's why getting the pages and then scraping multiple urls of those pages at the same time is the most compatible way of doing pagination today.

  • @Anesu-nv1mh
    @Anesu-nv1mh 8 дней назад

    can it scrape photos and videos also and get it downloaded ?

    • @redamarzouk
      @redamarzouk  8 дней назад

      it can scrape links of pictures and videos but not the files themselves.
      of course the links has to be inside the websites markdowns.

  • @AIPulse118
    @AIPulse118 Месяц назад

    Can it scrape the openai docs? I have yet to be able to scrape their pages

    • @DevJonny
      @DevJonny Месяц назад

      do you mean the scraping part itself or the llm blocks the content? might want to try with scrapingbee

  • @SoshiForever1_SM
    @SoshiForever1_SM Месяц назад +1

    incroyable

  • @grahamrennie2057
    @grahamrennie2057 8 дней назад

    Looks like your website is down...

    • @redamarzouk
      @redamarzouk  8 дней назад

      I have just tried to access it and it's up, i checked on isituporjustme and it says it's working fine
      It's just you. automation-campus.com is up.
      Last updated: Nov 6, 2024, 10:14 PM (1 second ago)