This Open Source Scraper CHANGES the Game!!!

Поделиться
HTML-код
  • Опубликовано: 22 дек 2024

Комментарии •

  • @redamarzouk
    @redamarzouk  3 месяца назад +43

    Hey Everyone,
    LInk to code: www.automation-campus.com/downloads/scrapemaster
    My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.
    I opened a ticket and I'm waiting for their answer.
    in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

    • @ShaunPrince
      @ShaunPrince 3 месяца назад +1

      Let me know if I can help with this. I can setup a Gittea on AWS or something.

    • @Kevinsmithns
      @Kevinsmithns 3 месяца назад +2

      Yeah I was just looking and about to comment

    • @alex_osti
      @alex_osti 3 месяца назад +2

      I was about to give it a shot.. Waiting for the update. Great work btw

    • @rperellor
      @rperellor 3 месяца назад +1

      I had the opportunity to view it, but did not clone it

    • @redamarzouk
      @redamarzouk  3 месяца назад +8

      @@rperellor here is the code www.automation-campus.com/downloads/scrapemaster

  • @RoughSubset
    @RoughSubset 3 месяца назад +163

    So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

    • @redamarzouk
      @redamarzouk  3 месяца назад +17

      Yes if we launch the scraper with the same user agent for the same websites so many times they will pick up on it and block us.
      the modification will have a list of OS credentials with their versions and different browsers and their versions.

    • @markomarjanovic8348
      @markomarjanovic8348 3 месяца назад +18

      @@redamarzouk Would it be possible to have a video about proxy rotation implementation? There is not much of it on YT but i think its crucially important.

    • @redamarzouk
      @redamarzouk  3 месяца назад +16

      @@markomarjanovic8348 Added to the backlog

    • @amortalbeing
      @amortalbeing 3 месяца назад +1

      this is a good suggestion, would like this to be added as well.

    • @internetperson2
      @internetperson2 3 месяца назад

      Thirded

  • @jdnilsen
    @jdnilsen 3 месяца назад +2

    Thanks!

  • @thisisfabiop
    @thisisfabiop 3 месяца назад +23

    Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.
    Another great feature-although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI-would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

  • @SergeyNumerov
    @SergeyNumerov 3 месяца назад +31

    Pretty cool.
    Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.
    So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

    • @SpragginsDesigns
      @SpragginsDesigns 3 месяца назад +3

      Exactly. Anyone interested in helping me make something like this? Or is there something available already?

    • @pyros4333
      @pyros4333 2 месяца назад +1

      ​@@SpragginsDesignsyou could just hire someone to build it for you easily

  • @moiguess3256
    @moiguess3256 2 месяца назад +1

    You earned a new subscriber. Algerian brother here.

  • @justjosh1400
    @justjosh1400 3 месяца назад +5

    Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

    • @redamarzouk
      @redamarzouk  3 месяца назад +3

      Thank you.
      Yes Pagination will make this complete.
      But I’m thinking how can I make it universal, cause it has to work on every website, so would I just add another llm call to detect any url pagination pattern or do you have a better idea on how to do it ?

    • @justjosh1400
      @justjosh1400 3 месяца назад +1

      @@redamarzouk that might actually work using a lower model would be capable of determining if the page has pagination. Or have a checkbox for user to manually say it has pagination so the LLM will be looking for it. That way it's not always looking for it. And when it finds it return what kind of class it is. IDK

    • @wdonno
      @wdonno 3 месяца назад

      @@redamarzouksimilar scenarios may be an interim pathway: if the initial url prompts for a selection of (text input) that determines next page, can you add the ability to make that selection, ideally from a list of items of prior interest? The recursive ability to select specific buttons to push according to options on following pages would then solve a large number of use cases (ie an ability to map different actions according to a preselected known option types)? The base use case is to download files from a selection post which varies by initial (or ideally subsequent) text inputs, terminated by pressing a button to download a file or selected files). The approach can then be expanded to add more scenarios, until it is universal!

    • @justjosh1400
      @justjosh1400 3 месяца назад +1

      Thinking about and just thought maybe have an area to manually put in div container that the user can grab from the inspect tool.
      Or..
      Since we're using a LLM you could always prompt for it and return the value of the container. Such as look to see if this page has pagination at the bottom or top if so return a value perhaps and use that value to fill in

  • @Ant-ym3mw
    @Ant-ym3mw 3 месяца назад +1

    You got yourself a new sub!

  • @danielcave9606
    @danielcave9606 2 месяца назад +2

    Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.
    The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.
    I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

    • @redamarzouk
      @redamarzouk  2 месяца назад

      Yeah I thought I was creating a scraper at scale, but once started using it extensively I see it more as a productivity tool to help get the data quickly without the need for copy paste.
      Traditional scrapers will still have a place in the market simply because once you want to scrape hundreds of thousands or millions of pages, the cost of paying coders for custom scripts and maintenance will make sense compared to the value of the data scraped.

  • @shawnsmith9198
    @shawnsmith9198 3 месяца назад +4

    you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

    • @jimbob3823
      @jimbob3823 3 месяца назад

      New to macos can you please share your driver path? Not 100% which is the executable. Ty!

    • @thecashlessgamer480
      @thecashlessgamer480 3 месяца назад

      Yes please can you help me set it up on my mac as well?

    • @wavelyveney9021
      @wavelyveney9021 2 месяца назад

      I need assistance in setting up on a mac

  • @dimadem
    @dimadem 3 месяца назад +1

    so good idea and explanation, thank you

  • @ginocote
    @ginocote 3 месяца назад +4

    One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

    • @lovol2
      @lovol2 3 месяца назад

      This is just writing code. Just copy paste the html into chatgpt and say write the code to parse into JSON.. works really well.

  • @HyperUpscale
    @HyperUpscale 3 месяца назад +4

    Can you make it to use ollama on the back instead of OpenAI?

  • @ErickXavier
    @ErickXavier Месяц назад

    What about adding Pagination Support? Where the A.I. will go through pagrs and pages to scrape long paginated data?

  • @aveenof
    @aveenof 3 месяца назад +2

    Awesome work! Any idea why scraped output list gets truncated even if input+output tokens < max?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      in some cases I noticed that gpt4o mini can't extract all the data from the website.
      I tried with gpt4o and it was successful.
      So if you're sure your data is in the markdowns and gpt4o mini didn't pick it up, try with gpt4o.

  • @JordanCrawfordSF
    @JordanCrawfordSF Месяц назад

    0:36 - dude got possessed by ChatGPT and his eyes went bananas.

  • @SamirDamle
    @SamirDamle 3 месяца назад +9

    Thanks for the simple tutorial and code.
    Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

    • @redamarzouk
      @redamarzouk  3 месяца назад +5

      You’re welcome.
      I can add it but I won’t be able to test it.
      My small gpu can’t really handle it especially when I’m filming.

    • @HyperUpscale
      @HyperUpscale 3 месяца назад

      @@redamarzouk YES, PLEASE 🙏!!!

    • @GundamExia88
      @GundamExia88 3 месяца назад

      @@redamarzouk I hope this get added. I prefer to run Ollama locally. I'm only using a GTX 1070, it works fine.

    • @idrinkmusic
      @idrinkmusic 3 месяца назад +1

      @@redamarzouk this would be a game-changing update. You earned a sub for this video regardless.

    • @carvierdotdev
      @carvierdotdev 3 месяца назад

      ​@@GundamExia88 could you please tell me what models you run? I have the GTX 1080 Ti 11GB, thanks to a friend, and I want to play with that but I don't even know it's possible 😂😅

  • @CicadaMania
    @CicadaMania Месяц назад

    Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

  • @LeftBoot
    @LeftBoot 3 месяца назад +1

    How deep / how many 'pages in' will it go?

  • @MoneylessWorld
    @MoneylessWorld 3 месяца назад +5

    The dependency on OpenAI and the API key is a bummer.
    It would be better if we insert our own open-source AI engine and models.

    • @sixman9
      @sixman9 3 месяца назад

      If I'm not wrong, tools like Ollama use some of OpenAI's API surface to expose local LLMs. The docs read 'for chat/completions'.
      if this scraper is using OpenAI's function calling interface, you might be out of luck.

    • @91Chanito
      @91Chanito 3 месяца назад +1

      You can do that with your local llm.

  • @TheLionsaba
    @TheLionsaba 3 месяца назад +1

    Great video as always , only downside is that it is adressing people who work with code and experienced in data scraping , but for no code or very little code like me , i think the best way is to use computer vision models , Vllm , chatgpt already have it in their api , but also we have 2 new open source models that just got ou this week , Qwen 2 VL , and microsoft phi 3.5 vision.

    • @quercus3290
      @quercus3290 3 месяца назад

      LAION have a model in open source, it is a very powerful scraper, you will most likely need to fine tune any vision models.

  • @iltodes7319
    @iltodes7319 3 месяца назад +1

    Good job bro continue ❤

  • @rgsiiiya
    @rgsiiiya 3 месяца назад

    This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
    The shortcome is that it is very limited to just the single page at the URL location.
    To be truly valuable, it needs to also be a scraper (as you mention).
    Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
    While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

  • @sahil5124
    @sahil5124 3 месяца назад +1

    So its traditional scraping (selenium and beautiful soup) and AI is only used to organize the scraped data in a given format. The AI does not do the scraping. Is it correct or am I missing something?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      Yes the AI does the parsing. but creating unstructured markdowns can't really be called traditional scraping, no one will scrape the whole unstructured data from the html in a traditional setup.

  • @brbl415
    @brbl415 3 месяца назад +1

    does it bypass re-captcha?

  • @staticalmo
    @staticalmo 3 месяца назад +6

    No pagination?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      Check the new video, the scraper works with Llama3.1 and Qroq model Llama 70B for free: ruclips.net/video/xrt2GViRzQo/видео.html

  • @djasnive
    @djasnive 3 месяца назад +3

    Great Project.
    Is it possible to use OpenSource and Self Hosted model like Llama ?

    • @redamarzouk
      @redamarzouk  3 месяца назад +2

      Thank you.
      Yes it's possible, but I didn't even try this time because gpt4o and Gemini flash are so cheap and have a huge context window and I just went with them.
      But it's perfectly possible, you just need to modify the "format_data" function.

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 3 месяца назад

      @@redamarzouk Thank you so much, I had the same question, Thanks for answering.

  • @minissoft
    @minissoft 3 месяца назад +7

    Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
    Also add_argument("--disable-search-engine-choice-screen") is useful + ("--headless") maybe?

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      Oh I was looking for that argument "-disable-search-engine-choice-screen" that pop up is annoying ( even if it doesn't affect the scraping). I will be adding that, thank you!!

  • @daedaluxe
    @daedaluxe 3 месяца назад +1

    I don't think llms are ready for this scraping yet, better to get an llm to make a flask python app and make it manually scrape based on class names so you pull correct data with no hallucination, can also pull images and zip the images with zipfile

    • @redamarzouk
      @redamarzouk  3 месяца назад

      LLMs are not made the same, while I was scraping websites with 60K+ tokens I noticed that gpt4o mini gets me only a subset of the data while gpt4o latest manages to get me all the data.
      If someone is willing to pay 0.5 to 1$ per extraction, they can use gpt4o with a guaranteed correct and complete output.
      But 1$ an extraction is still very high if we want to scale it, in that sense it’s not ready.
      But for most cases mini works great with 0.005$ per extraction and it’s absolutely ready for anything.

  • @marcusmayer1055
    @marcusmayer1055 3 месяца назад +2

    How to Add local llm llama for this projekt?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      I did, watch this video ruclips.net/video/xrt2GViRzQo/видео.htmlsi=XWUzIu8uBehK4AV5

  • @ScottLahteine
    @ScottLahteine 3 месяца назад +3

    The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

    • @lawrencemanning
      @lawrencemanning 3 месяца назад

      The problem is you now will have an indeterminate algorithm taking you from input to output. In other words the mechanism will be fundamentally untestable and unrepeatable. It’s basically the same as feeding data to a bunch of chimpanzees and expecting them to perform the same processing on it. In other words this is fine if you have a human to check the output each time (the interactive use case) but any kind of automatic, unattended runs? Forget it.

  • @KPK_7
    @KPK_7 2 месяца назад

    Any way to scrape Twitter specific keyword

  • @eea8888
    @eea8888 3 месяца назад

    What if the data should be dynamic or there will be some click like search button, or their is select to choose from, and after that, scrap the data? What should we do in that case ?

  • @LeftBoot
    @LeftBoot 3 месяца назад

    Can it be multimodal? Viewing data in an image, also creating data tables into an image. Eg. Create a wallpaper of the most important LINUX keyboard shortcuts. etc

  • @mrsai4740
    @mrsai4740 3 месяца назад +1

    Hmm It seems like i ran into a limitation. I tried scrapping some golf course (lattitudes and longitudes) from google maps, but It only seems to ever give me 30 rows of data. At first i thought this might be an issue with max tokens, but i increased the max to the highest value possible: "16384" tokens, but this still only gave me around 30 rows with the same data

    • @redamarzouk
      @redamarzouk  3 месяца назад

      What model have you been using because gpt4omini can go up to 128000 tokens, and in my last video I've added gemini which can go up to more than 1M+.
      I've noticed this behavior as well, when a single page has sooooo much data, not just the table with the necessary data but other data, we run into a hard limit on how many rows we can scrape (Especially with apps like @irbnb and zill0w where there is a map that have so much data we won't be scraping), I guess you found the same limitation.

    • @mrsai4740
      @mrsai4740 3 месяца назад

      so i have been experimenting with this code and I got it to work with pagination by specifying a new field for a next button and a new field for number of pages. This seems to work well, but it also got me thinking: If we have too many tokens, we can probably try to chop the data up and then run the peices through the llm. The only thing i can see, is that if we start batching the data, we could end up missing critical peices of imformation (if we substring ot the worng spot, we may end up missing rows). I will try out gemini, i have never used it

    • @redamarzouk
      @redamarzouk  3 месяца назад

      @@mrsai4740 on some websites we can get either the next page or the new the url of the pages just by specifying it in the fields using this current version of the scrapper.
      But the problem is that most websites don't include all the url of the pages in the first page, usually it's under the form
      (1 2 3 4 ....45 46 47 48) For example.
      In this case we have to ask the LLM to conclude the url of the other pages using the pattern from the urls that it found.
      Other websites where we only have the next button can only be scraped one url at a time, so the universal approach will need some time and work to be figured out.

    • @mrsai4740
      @mrsai4740 3 месяца назад

      @@redamarzouk hmmm maybe we are tackling this in the wrong way, cause it seems like for this to be a universal solution, some legwork by the user needs to be done. In cases like that scrapeme site, yeah it is allot easier to provide an array of urls or a template that describes all the urls, but this doesn't tackle the problems of single page applications. Some sites have a paginator that modifies the current page with updated information. I guess it's back to the question: "how can we programmatically detect the way a site is paginating data?"

  • @Alphamaan
    @Alphamaan Месяц назад

    Can this app click on a car's page to scrap the details and go back to click on another car's page to scrap the details again?

  • @orangehatmusic225
    @orangehatmusic225 3 месяца назад +3

    So you can scrape 666.66 pages for $1 based on that usage.

  • @snehasissnehasis-co1sn
    @snehasissnehasis-co1sn 3 месяца назад +13

    I want to use groq api key bcoz it's free to use or local llm like ollama..... Please modify this code if possible......Great video.....

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 3 месяца назад +1

      same question, I was wondering can we do it using groq or cohere?

    • @ianmatejka3533
      @ianmatejka3533 3 месяца назад +1

      Wrap the groq api key by os.getenv() instead of passing in the string

    • @redamarzouk
      @redamarzouk  3 месяца назад +5

      @@snehasissnehasis-co1sn both has been added.
      Will present them in the next video.

  • @aleksandars9254
    @aleksandars9254 3 месяца назад

    Thanks dor the video! What mic are you using?

  • @stokedbeachbum
    @stokedbeachbum 3 месяца назад +1

    Can you also crawl a site such as Zillow and scrape multiple URLs?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      websites like zillow tend to have sooo much data inside of them 100K+ tokens, but the answer is still yes.

  • @amortalbeing
    @amortalbeing 3 месяца назад +1

    This was great thanks.

  • @mzahran001
    @mzahran001 3 месяца назад +3

    Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      You're welcome, can you elaborate more on how it should look like ?
      Because this will be awesome and I actually gave it some thought, but it's hard to get the exact link of multiple pages from which you want to extract data if you don't have the link for the first page.
      you think we can trust a search engine to give us the exact links we want to scrape data from?

  • @chandler_short
    @chandler_short 2 месяца назад

    How about something like scraping facebook marketplace or offerup?

  • @danielerikschaconbaquerizo2957
    @danielerikschaconbaquerizo2957 3 месяца назад

    whay about using library curl_cffi with requets to simulate a browser instead of selenium or playwright instead of selenium ? i think it would be faster.

  • @CryptoDuhd
    @CryptoDuhd 3 месяца назад

    I would love it even more if you created a docker container that was just downloadable and thereby installable directly on a Linux site. A user agent swap feature (like a list of user agents that could be chosen like round robin algorithm, or randomized) would be great too and handling a list of proxies that would also be swapped.

    • @redamarzouk
      @redamarzouk  3 месяца назад

      I haven't created a docket container, but I made a random user agent pick from a list. you can find the code to that in this video ruclips.net/video/xrt2GViRzQo/видео.htmlsi=smByssvvNhudzgRS
      What type of websites you will use this app to scrape from?

  • @GabrielM01
    @GabrielM01 Месяц назад

    Would be nice to have a option to use ollama so we can run it locally without using openais proprietary ai

  • @obey24com
    @obey24com 3 месяца назад +1

    What about websites with cloudfare security etc.?

    • @TheLionsaba
      @TheLionsaba 3 месяца назад

      Very important question.

  • @moeabdo3114
    @moeabdo3114 2 месяца назад

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @brianzvc
    @brianzvc Месяц назад

    does this scrape dynamic data?

  • @TLCMEDIA1
    @TLCMEDIA1 3 месяца назад +1

    This is amazing, I have been trying to reproduce the code but I keep getting errors. Any chance you can do a dummy video . Step by step as chat gpt does ? Please 🙏🏾

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      I did watch this video ruclips.net/video/xrt2GViRzQo/видео.htmlsi=XWUzIu8uBehK4AV5

    • @TLCMEDIA1
      @TLCMEDIA1 3 месяца назад

      @@redamarzouk appreciate you so much 🙌🏾💯

  • @Anton112eclipse
    @Anton112eclipse 2 месяца назад

    how does it work with pagination?

  • @jewlouds
    @jewlouds 3 месяца назад +1

    it actually works pretty good.

  • @SohanDomingo
    @SohanDomingo 3 месяца назад

    What video recording software you use?

  • @remusomega
    @remusomega 3 месяца назад

    a really cool feature would to add a text-splitter where it splits the text semantically into small chunks so we can readily use this to feed a RAG. Right now we typically splice things arbitrarily, but semantic splitting is the best.

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      can you give me an example of an output to split?

    • @TimothyJoh
      @TimothyJoh 3 месяца назад

      There are many such splitters available in llamaindex or langchain already. Another “automated” way might be to ask GPT 4o mini to split for you

  • @djagryn
    @djagryn 3 месяца назад +1

    Super intéressant 🎉

  • @edma6613
    @edma6613 3 месяца назад

    Could it download or summarize the files (pdf…) from a website?

  • @Web.Scraping
    @Web.Scraping 3 месяца назад

    What about captcha solving, such as cloudflare, recaptcha, hcaptcha..

  • @DummyAllan
    @DummyAllan 3 месяца назад +1

    I really appreciate the great work your are doing.
    Quick one, what happens to sites that require credentials? How do you handle that case?
    Thanks

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      That will need an intervention for your side, keep the website open and run the process again so it has access directly to the data.

  • @mikevinitsky8506
    @mikevinitsky8506 3 месяца назад

    can you make it for it to spider a website and if it finds a page that has all the required tags it puts the information in json, database, etc?

  • @nmlker
    @nmlker 3 месяца назад +1

    @redamarzouk Nice and easy scraper. I saw that you also have Scrapemaster 2.0 and installed that. The Env file mentions a Google API key. Which one should be added? Have a link where to get this particular Google API key?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      Thank you, to use the google API Key go to aistudio.google.com/app/apikey
      and from there create a new api key and add it to the .env.
      You can find all the details of the scarpeMaster 2.0 from here
      ruclips.net/video/xrt2GViRzQo/видео.htmlsi=KH5bfxyYJ9NV90FU

  • @maxxflyer
    @maxxflyer 3 месяца назад

    if I show the screenshot of the pokemons to gpt it will directly scrape all the data. so basically my first feeling is the AI is enough smart to suggest the fields in a dropdown menu. so I can choose them and tell what I really want. And decide a final label for each one of them.
    ...just an example to start!
    but as I said chatgpt can do the same just with a prompt. I don't actually need your app unless the page is full of data. in that case there may be limitations.
    so you should ask your self what a prompt can't do
    anyway my real problem is to have a scraper able to scrape data that are distributed around various pages. or for those cases where you must "load more" elements clicking a button.
    and I want to be able to specify the download format. gpt can reformat anything to anything.
    nice work but there are tons of improvements to be made. I will follow you to see where you get to.

  • @Daltoncast
    @Daltoncast 3 месяца назад

    Takes a screenshot then extracts with AI?

  • @JuankM1050
    @JuankM1050 3 месяца назад

    then i tried to make it work with the google gemini api, and sadly i could not. it always returns the empty table.

    • @redamarzouk
      @redamarzouk  3 месяца назад

      I've just added gemini to an updated script I'm working on, I also added Llama 3.1.
      stay tuned for the next video.

  • @cineymatic
    @cineymatic 3 месяца назад +2

    Great video! I have a few questions though 🤔:
    - Would it be easy to extend it to first log in to a site and then start scraping?
    - Would it be able to click buttons and scrape data from subsequent pages?
    - How is it identifying the elements on the page? Should it always be under a category or in the form of a table?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      for the first 2 questions the answer is no, unless we're creating it for specific websites, otherwise we have to create a universal text-2-action module with it (which is infinitely harder to do )
      For the last question, as far as the element doesn't need a ui/ux action to show, the scraper will pick up on it.

    • @cineymatic
      @cineymatic 3 месяца назад

      @@redamarzouk Thank you for the response.

  • @lyusvirazi6006
    @lyusvirazi6006 3 месяца назад

    Can you scrape PDF file from a website with this?

  • @SoSoInfinite
    @SoSoInfinite 2 месяца назад

    Can this scrape eBay api?

  • @BohemianAnarchy
    @BohemianAnarchy 3 месяца назад

    Curious Why not puppeteer?

  • @Cygx
    @Cygx 3 месяца назад

    why do I need to use a llm for scraping the data?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      Yeah for 1 of 2 websites it's doesn't make sense, but to scrape any website with 1 single app is pretty useful.
      Will you still prefer the traditional option even if you have to create a script every time ?

  • @ghostwhowalks2324
    @ghostwhowalks2324 3 месяца назад

    can you use playwright as well ?

  • @peladoclaus
    @peladoclaus 2 месяца назад

    Whats better about this than google advanced search?

    • @redamarzouk
      @redamarzouk  2 месяца назад

      I don't see how they're similar.
      I'm not searching for anything, i'm giving an exact url from which I want to extract structured data using an LLM.

  • @younube2
    @younube2 3 месяца назад

    Can you input multiple URLs and have the scraper collate + populate the same file?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      It can't do that today, but it will be a great addition.

  • @younube2
    @younube2 3 месяца назад

    Does this work on Amazon?

  • @neylz
    @neylz 3 месяца назад

    can this be used to scrape amazon data?

  • @bfamily787
    @bfamily787 3 месяца назад +3

    Great video, can you show how to implement local LLM like Ollama instead of openAI?

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      Thank you ,
      This has been demanded so many times I guess I have to make a new video about it.

  • @grahamahosking
    @grahamahosking 3 месяца назад

    Is it possible to add this to Home Assistant?

  • @blunoodle
    @blunoodle 2 месяца назад

    I used replit Ai agent to build + deploy a Kickass website scraper in like 10 mins!

  • @menachem-145
    @menachem-145 3 месяца назад

    how can i work with this on mac?

  • @viejitoloco4133
    @viejitoloco4133 3 месяца назад

    why do all that random stuff? what's the purpose?

  • @eightrice
    @eightrice 3 месяца назад

    there is no need to parse the actual scraped data through the LLM

    • @redamarzouk
      @redamarzouk  3 месяца назад

      I didn't scrape the structured data, but rather unstructured markdowns. So parsing is necessary in my case to get the table I want.

  • @joshd265
    @joshd265 3 месяца назад

    Please can you host this tool online so that us non dev folk can easily access it. Also, would be great to have the ability for the model to be able to summarise and pull keywords out of long product descriptions etc.

  • @daithi007
    @daithi007 3 месяца назад

    Do you have to manually accept cookies?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      No I didn't need to do so for the websites I scraped

  • @aleksd286
    @aleksd286 3 месяца назад

    Problem isn’t to scrape the data, it’s if you have a public facing website most likely you’ll get sued. Nowadays data is a copyrighted material

  • @imsjs78
    @imsjs78 3 месяца назад

    sorry but where can I see the actual code? should I register any website?
    or is there any link?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      The project GitHub link is in the description.

    • @mertgokce6385
      @mertgokce6385 3 месяца назад +1

      @@redamarzouk Is there sth wrong with your github ? Because it is not accessible.

  • @echobucket
    @echobucket 3 месяца назад

    I would not trust this to not hallucinate. I think of a famous example where it misinterpreted the column and concatenated some numbers together instead of treating them as separate columns, leading to incorrect values.

    • @redamarzouk
      @redamarzouk  3 месяца назад

      most data in tables results in line breaks between values in markdowns.
      can you share the use case where it has hallucinated for you, it will be very interesting use case?

  • @atultanna
    @atultanna 3 месяца назад

    This a great job Hope you could share a code for auto blogging Looking around but not able to find much Where to get in touch

  • @aijokker
    @aijokker 3 месяца назад

    Any way to use it with free model?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      Yes the only function that needs to be modified is format data.
      Make sure the open source model supports structured output.

  • @ditleporc
    @ditleporc 3 месяца назад

    Good job Reda, what'sup with your we automation-campus website ? is it down ? too much success ?

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      Thank you. but the website is up for me I've just checked on multiple devices and on isitdownorjustme, all working.

    • @ditleporc
      @ditleporc 3 месяца назад

      @@redamarzouk Zscaller classified your site as suspicious....

  • @w3whq
    @w3whq 3 месяца назад +1

    Great resource.

  • @mockcrackers7636
    @mockcrackers7636 Месяц назад

    Can it scrap linkedin ?

    • @redamarzouk
      @redamarzouk  21 день назад

      I've tried it and it did scrape it.

  • @ld-yt.
    @ld-yt. 3 месяца назад

    Why take down the repo ?

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      My GitHub got suspended, here is a back up link:
      www.automation-campus.com/downloads/scrapemaster

  • @cameronyking
    @cameronyking 3 месяца назад

    Can this be an API?

  • @SavanVyas91
    @SavanVyas91 3 месяца назад

    Pagination will be critical for this

  • @CarlvanEijk
    @CarlvanEijk 3 месяца назад

    404 on your git? what's going on?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      GitHub suspended my whole account (without warning). I've shared the code, follow the link in my description.

  • @MrTestingchannel1
    @MrTestingchannel1 3 месяца назад

    Repo deleted or hidden, why?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      GitHub suspended my account.
      I’ve shared the whole code, link in the description.

  • @MEYER251189
    @MEYER251189 3 месяца назад

    It's working on linux ?

    • @redamarzouk
      @redamarzouk  3 месяца назад

      I haven’t tried it, but it should work because nothing will be different.

    • @MEYER251189
      @MEYER251189 3 месяца назад

      @@redamarzouk i ask this because i saw a exe file inside the repo

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      @@MEYER251189 Oh yeah, I missed that.
      you should download the chromedriver for linux.

  • @jakob1379
    @jakob1379 3 месяца назад

    I think they are being too harsh.. There are other more effective ways to scrape that are also being promoted all over the place. So it's not a matter of risking overly spamming pages, though it does not natively respect robots.txt which might be an issue when promoting tools that does not need configuring.

  • @VaibhavShewale
    @VaibhavShewale 3 месяца назад +1

    lol, in college time i made a web scraper as my project and got full marks XD

  • @cheveznyc
    @cheveznyc 3 месяца назад +3

    Suggestions: ability to scrape bing, yahoo, Google, and able to check 2nd page of the results for outdated website, accessible none compliance. And no mobile friendly. 📵 and and is there a Google maps version? 😢😮

  • @BaldyMacbeard
    @BaldyMacbeard 3 месяца назад +5

    Ah yes. Finally... an even more expensive way to scrape sites than we used to have...

    • @redamarzouk
      @redamarzouk  3 месяца назад

      can you elaborate on what part you think is expensive?
      is it the scraping I made or just generally speaking ?

    • @the_real_cookiez
      @the_real_cookiez 3 месяца назад

      Beautifulsoup is free. And anything with Llm apis are not scalable cuz it's per usage. ​@@redamarzouk

    • @realmstupid-on8df
      @realmstupid-on8df 3 месяца назад

      $0.0015 is nothing. I bought $1 in Bitcoin at this amount.

  • @hendrikvanbrantegem7526
    @hendrikvanbrantegem7526 3 месяца назад

    can u do bulk url?

    • @redamarzouk
      @redamarzouk  3 месяца назад +1

      The streamlit application is mainly for interactive scraping. but the scraper.py file can be used to launch the scraping on a list of URLs.