Re: pagination, there would be a few ways to tackle this. The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop. Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a tag, links with a valid href attribute (not just a #) that contain a number, etc. Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected. But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.
I personally havent been able to get the ollama model working, i only got the gpt-4o-mini working, but that said, is cost a factor now that we can scrape with llama?
@@mrsai4740 well, i was thinking of cases where, for one reason or another, a local model couldn’t be used But aside from that from my perspective using a simple algorithm seems more viable as we can roughly know what to expect from the DOM in that regard. There aren’t 10,000 possible implementations for a pagination that are valid HTML and with parsing and regex it should be fairly easy to develop
@@bluetheredpanda I agree, there should be a way to detect pagination in a page or an item detail view of a page. However, i feel like capturing every single way one may paginate is going to be a challenge. Looking at the dom itself on some sites, Ive seen people use tags to put the pagination in, ive also soon people use to put a pagination in, and these two setups had no specific CSS that made it obvious this was a paginator. What can be done maybe, is have selenium truly crawl a site by clicking every element it can, and try to parse and combine the results. It sounds doable but the performance will probably be ass
@@mrsai4740 table wouldn’t be valid html, it’s definitely going to be an outlier (that’s actually why I mentioned the input field so users can enter their own targeting). I wouldn’t click every element at random (even though he did mention turning this into a crawler, so you never know), but beautiful soup already returns the entire DOM. We can scan it for links, and filter those based on the value of the href property, which would achieve the same thing while being a billion times faster. Re: CSS - modern CSS declarations would 100% allow you to target even if no specific class is used, ie. a[href$="/page/2/"] which means “a link which points to a page whose URL ends in /page/2/”. Combine that with regex, and we get a[href$="/page/\d+/"], which works for any number, not just page 2. There’s definitely something in there.
I'm working hard on that pagination, whenever I feel like I have a universal approach and test it on couple of websites I still find it needs approvement. But the next video will be out soon.
check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow
From a performance standpoint, I think it's better to use the LLM to analyze the source page layout and have it write a scrapy (or similar) scraper, and then to use that to scrape the data. Using the LLM to process all the data is fine for one or two pages, but if you need to do a big scrape of 1000s of pages, the performance is going to be very poor compared to writing a dedicated scraper with the LLM and using that.
It's working now. Check the link in the description. I was going to replicate the entire thing using cursor ai lol thank god he updated it. Thanks Reda
Only god knows, I didn't receive any explanation, and my demand to reinstate my account can take days, weeks or months (according to their forum). I'm not alone, this is happening to a lot of people. The issue is that similar AI Scrapers projects are up.
I'm literally doing this right now on my own scraper. So far, promising early results. I'm scraping part of the html structure. Sending to Claude to parse, and then guiding the scraper based on the initial structure it gets. It's also integrating brightdata scraping browser functionality.
Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg ?product=12345&page= ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.
or have the user just provide the url for say the second page with the page= parameter in and ask the model to determine it. not sure how reliable it would be though in every case if the pagination param wasnt obvious.
An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?
they ban you because unlike others I will not mention you are giving value without a financial lure of a subscription fee, this is bad for the business models of many others so they will always try shut you down. Keep being a rebel and giving REAL value, this is the only way someone with nothing can ever have a chance and trust me I have used this to get a new income and I was rock bottom so thank you!!! keep giving IT WILL GIVE BACK :D
You're welcome. About the docker, I got this request so many times now, I will have to create one. Stay tuned for that. To make it work on Linux, you'll need to change the path of the chromiumdriver because it's different for linux, and of course the commands to create the virtual env are different, but everything else should be the same.
Here is how I would implement automatic processing of Paginated and/or Nested data : The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement). The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing. The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.
actually that is a good idea because it can make the whole process faster. So the idea is to give the page structure to the LLM first and let it decide where the data will be located, after that I should create the markdowns only from inside that tag reducing the number of tokens and making the call faster. I thought about this approach before but was too lazy to recreate the app around it.
Is it able to scrape prices from variable products? Need this for promotional gifts (printed with your Logo)… so Like 100pcs, printed with 1 Color, 2 colors, 3 colors… Same with other qtys. ?
Is it possible to use vision model to scrap a website that block or flag a scrapper by setting up some virtual Environment where the llm can control and open website with it able to scrolling checking even press button to be able fully scrapping the web.. So using normal llm with agent added with vision. That just my mind i dont know if it too complex and dont make sense in term efficientcy specialy with the comput power it needed.
The market will take time to embrace these new methods of scraping with AI, even with very cheap or free options like groq and gemini, so scraping using regular ways still the dominant way to do scraping in general and I see them adding this type of AI Integration alongside what they already have, but it will take time.
can you add the option, to send the sorted file back to the llm to add more headers to the table or in general to modify the table, without having to do the scrap again?
I always save every markdown for every scraping inside a folder called output in the project. you only need to tweak the code a bit to run on the same markdowns again with the fields that you need. But yeah adding this feature will be nice, can you tell me when have you needed this?
Hii i want to scrap the devices like laptops,mobiles,tablets which supports type c charging and upto 45w chatging ,i want to scrap it from amazon and also have to automate this like for everyday it will scrap again and again it updtas the database and if that devices not present then create that devices in database ,i am a mern stack developer so how can i do that
Idea: What if we don't run headless but instead scrape as you are navigating. At least this way we can try to capture multiple pages and item/detail style websites.
True, I've been avoiding using headless for a while, but apparently the new --headless=new attribute runs as if you're opening the page (I'm not so convinced). Anyways if you don't like the headless option go to assets.py and you can simply remove it for the list of options, you don't need to touch the code itself.
Pagination is so important, but also going one level deeper to scrape data from each item would be a great add! In addition to paginations with numbers, a complementary approach (perhaps an additional toggle?) involved pages with Next button. The Forward button is an anchor element. It navigates to the next page and grabs the link until it reaches the last page, with the forward button’s href value #. Perhaps a similar approach can be done to scrape data from each item's page - listing all item links present in every page, and then scrape?
Yes, for websites that have a list/detail view, it would be nice if we can make selenium go into each and scrape data. I think the problem with this is the amount of tokens that will end up producing
so would you suggest having another way of launching the application only to crawl the URLs of the pages and put them somewhere to then launch the scraping on those urls ?
@@mrsai4740 Right, but some applications may make it worth it - let's say analyze competitors. With mini costs. are quite competitive and if you use local computation it's free!
Realy hard to handle all paginations automatically for any website and caring in the same time about token cost for a mass of scraping pages. But is a great challenge 😅
It will really depend on each machine, better GPU+Smaller Models = faster inference and vice versa. In my case I had a generation of 15 tokens per second which feels a bit on the slower side but it's not very bad.
i tried this with google maps (tried searching golf courses) and it seemed to work, but It only works on a small set of data since google only loads like 6 results unless you scroll. I got around this by just copying the element with all the Items i was interested in after i scrolled through the end of the list, then i shoved that into an html file and used a local file as the URL, The results seemed promising at first but it looks like the model gives up and doesnt capture everything
you're right. When giving really long html files to smaller models (and even worse local small models 8B or 7B), they seems to give up on getting everything. so going with gpt4-o or with the gemini 1.5 pro (they give 50 free request per day) will be your best option in this case of 100K+ tokens
I've have a random usergent chosen every time the app is launching a new process, it's in the assets file you can add more there easily without touching the code in the other files!
If a human can paginate, AI should be able to paginate. True Universal would be a mechanism with image recognition. Feed AI a screenshot of website and “which element does pagination?” This doesn’t yet cover infinite scroll. You could try to scroll the page and detect if more data is loaded. If so, there you have your pagination.
I created google api key but on running it says the API KEY is invalid ... even if it's freee, one must insert card data in order to use it ? I also noticed it extracts the whole raw data. If I give a tag, wouldnt it be better to search only that in orider to optimise the token / minute? On griq for example i have this problem: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-70b-versatile` in organization `org_0.......` on tokens per minute (TPM): Limit 20000, Used 0, Requested 46892.
I have no idea, I haven't received any emails from them(checked my spam folder and everywhere). I thought it was about the "useragent" that I used, but I searched for that exact line of code and I found it in 20k+ repositories, so truly I don't know.
hey, just wanted to say thanks, please in next version 1- i want my llama3.1 8b to work via ollama (installation) anything special, i need to do. 2- pagination (setting is critical) 3- fields we define (i.e name / price) how it (llm) maps name with class type title (slightly confusing)...
Noted for the first 2 points. For the third, it creates everything as a string, but the llm is smart enough to understand how to format other types like numbers for example. So no loss in format.
Very nice job but you are not facing the real problem. Does it work with JS heavy site? For the basic site with no JS you can make a web scraping tool in minutes with Claude. Plus pagination in many specific cases is really complicated. In my opinion ion you should use playground to let the user the ability to create a pattern specific for that web site (clicking around to memorize what to do to get the info) and than proceed with the auto scraping part
Thanks Reda ! That's really interesting. I tried to scrape data on websites like amazon by scraping only one single page of a product with Llama3.1. However, I faced the token limit issue although I have a powerful MacBook M3 with 38 GB. The same page works well with Gemini1.5 and gpt. Do you have an explanation please ?
you're most welcome. Yeah with Llama3.1 8b in lm studio you'll have to increase your context length to more than 100K tokens (you can do this in advanced configurations). the smaller models as good as they are for their size, they really can't keep track of all the completion when having to extract from pages with markdowns more than 60K tokens which amazon is one of them.
What? This code is 100% mine and first time I created this project is 4 months ago, go back to my channel and watch a whole video of 20 minutes about it. You could say that the idea has been around for sometime and I've seen big repos trying to do website to structured data like (firecrawl, jina AI, scrapegraph AI, etc..), and from there I tried to create a simple fully open source version because people in the comments asked me to. So yeah credits to those libraries, but saying I've taking this from someone else is delusion.
I think that without significant user input, it's unlikely to work. First, the script needs to capture the JS elements code from the site. Then, the user must provide what they are looking for, such as specific pages or other elements. finally, the script + LLM should automatically extract the relevant CSS selector or XPath
Re: pagination, there would be a few ways to tackle this.
The simplest to implement would be to have the user specify the CSS selector of the next button. Then, the script could retrieve a page, wait a few seconds, launch a click on that selector (i.e. via a JS function), and loop.
Now, if you wanted to make it automatic, I think the simplest way to tackle this would to write a simple algorithm with a decision tree, that is triggered when the DOM of the first page is returned. The algo would go over the DOM, and look for specific signs of a pagination: next button, numbers in a tag, links with a valid href attribute (not just a #) that contain a number, etc.
Or if you really want to cover all basis, you could have both: the algo would perform an auto-detection tentative, but there would be an input as a fallback in case it fails, or in case the user wants to modify what has been detected.
But I wouldn't necessarily use a model for this, as the cost and duration is going to skyrocket compared to running a simple JS or Python algo as traditional scraper do.
I personally havent been able to get the ollama model working, i only got the gpt-4o-mini working, but that said, is cost a factor now that we can scrape with llama?
@@mrsai4740 well, i was thinking of cases where, for one reason or another, a local model couldn’t be used
But aside from that from my perspective using a simple algorithm seems more viable as we can roughly know what to expect from the DOM in that regard. There aren’t 10,000 possible implementations for a pagination that are valid HTML and with parsing and regex it should be fairly easy to develop
@@bluetheredpanda I agree, there should be a way to detect pagination in a page or an item detail view of a page. However, i feel like capturing every single way one may paginate is going to be a challenge. Looking at the dom itself on some sites, Ive seen people use tags to put the pagination in, ive also soon people use to put a pagination in, and these two setups had no specific CSS that made it obvious this was a paginator. What can be done maybe, is have selenium truly crawl a site by clicking every element it can, and try to parse and combine the results. It sounds doable but the performance will probably be ass
@@mrsai4740 table wouldn’t be valid html, it’s definitely going to be an outlier (that’s actually why I mentioned the input field so users can enter their own targeting).
I wouldn’t click every element at random (even though he did mention turning this into a crawler, so you never know), but beautiful soup already returns the entire DOM. We can scan it for links, and filter those based on the value of the href property, which would achieve the same thing while being a billion times faster.
Re: CSS - modern CSS declarations would 100% allow you to target even if no specific class is used, ie. a[href$="/page/2/"] which means “a link which points to a page whose URL ends in /page/2/”. Combine that with regex, and we get a[href$="/page/\d+/"], which works for any number, not just page 2. There’s definitely something in there.
I cant wait for the pagination part!
Really looking forward to a follow-up video to this!
I'm working hard on that pagination, whenever I feel like I have a universal approach and test it on couple of websites I still find it needs approvement. But the next video will be out soon.
check if the website has a site map. the links in there are usually the content related ones. plus for SEO purposes, most business-related websites will use meaningful keywords in the URLs you can use to regex to filter/sort/prioritize. the issue you're trying to articulate is how to you allow the user to specify which content is scraped+paged. In your example, if you're on a shop site, you obviously want shop related links, you don't really care about the privacy policy or the returns policy. so in the same way you provide tags for data extract you could also provide a limited set of content type tags the user could select to guide which links are followed. use the ai to make a best guess about the nature of the site and then provide some helpful tags. Ai detects online store and blog. do you want to scrape both, the shop only, images only, blog only. if the user selects shop data only, then you can get pretty far in finding the links to follow
Thx. Definitely following this project!
Thank you Thank you Thank you. They can not stop you.
From a performance standpoint, I think it's better to use the LLM to analyze the source page layout and have it write a scrapy (or similar) scraper, and then to use that to scrape the data. Using the LLM to process all the data is fine for one or two pages, but if you need to do a big scrape of 1000s of pages, the performance is going to be very poor compared to writing a dedicated scraper with the LLM and using that.
He listened and he provided
why did they suspend the github account?
It's working now. Check the link in the description. I was going to replicate the entire thing using cursor ai lol thank god he updated it. Thanks Reda
Probably tripped some automatic security measure
Still unavailable
Also curious, never heard of someone having the github account suspended.
Only god knows, I didn't receive any explanation, and my demand to reinstate my account can take days, weeks or months (according to their forum).
I'm not alone, this is happening to a lot of people.
The issue is that similar AI Scrapers projects are up.
great work!
I think it should also be an advanced spider, wich checks the full sitestructure and then use the most needed.
Thank you!
Yeah some prompting to detect the structure of the page will get us the crawler for pagination we want.
I'm literally doing this right now on my own scraper. So far, promising early results.
I'm scraping part of the html structure. Sending to Claude to parse, and then guiding the scraper based on the initial structure it gets. It's also integrating brightdata scraping browser functionality.
Thanks Buddy. I feel sorry for your account suspension. I hope they will remove suspension soon.
Thanks great project. Regarding the pagination - you could have the user specify a placeholder in the url to identify the eg page= parameter for say the second page eg ?product=12345&page= ie to identify the pagination parameter ie page= in this case and then also specify a start at and end at page number separately and then have it open the different urls with the number inserted for that parameter.
or just have the user specify the page= parameter and the start and end pages
or have the user just provide the url for say the second page with the page= parameter in and ask the model to determine it. not sure how reliable it would be though in every case if the pagination param wasnt obvious.
do you think this will not burden the user with more steps, shouldn't we give a suggestion first and let them validate or modify it?
@@redamarzouk Yes probably better to give a suggestion from the url that they can change.
Can you guide exactly which google api key to get? As there are lot of options out there. Getting confused.
What about save data in a database and run it for a while? That would be amazing ...that would be the real useful tool for so many people ..
An idea for the pagination : scraping the source code of the URL, sending it to the LLM to recognized the structure and applying the adapted scraping code selected in a library of differents approaches ?
they ban you because unlike others I will not mention you are giving value without a financial lure of a subscription fee, this is bad for the business models of many others so they will always try shut you down. Keep being a rebel and giving REAL value, this is the only way someone with nothing can ever have a chance and trust me I have used this to get a new income and I was rock bottom so thank you!!! keep giving IT WILL GIVE BACK :D
Tutorial how to dockerize and launch on own linux server to access from any device and from everywhere would be great. Thanks for the app and code!
You're most welcome, and the docker part will be coming stay tuned!
Hi Marzouk, first thanks so much.
can you show us which files i need to change to use this on linux?
are you planing to docker this app in a future?
You're welcome.
About the docker, I got this request so many times now, I will have to create one. Stay tuned for that.
To make it work on Linux, you'll need to change the path of the chromiumdriver because it's different for linux, and of course the commands to create the virtual env are different, but everything else should be the same.
A lot modern websites don't use pagination but load as you scroll. You have to be able to handle that.
That also will be challenging yeah!
Do you have an example of a website in mind?
Here is how I would implement automatic processing of Paginated and/or Nested data :
The first time the user runs the script, they should get the option of web scraping with pagination and/or nested data. We can prompt the LLM accordingly depending on what they chose and have it store the nested page URLs per listing in the appropriate object as well as extract the selectors for next/previous page elements (or at least one pagination link from the DOM, but that is probably tricky to implement).
The user could have the option at the start to decide whether to process all pages and if not, decide how many pages he wants starting from the current page that was linked. The same goes for Nested Pages, the user can choose to attempt extraction of additional data found in "detail page" for each listing.
The important part is that for either, it must process the pages one step at a time while saving the progress. If anything goes wrong or some pages are missing, there should be clear UI letting the user know that, so he tries to web scrape those pages again. We could also offer the user an input to give us the pagination selector if all else fails.
Good idea
Have you considered making this more scalable by using an LLM to discover the exact settings to use for beautiful soup for each website?
actually that is a good idea because it can make the whole process faster.
So the idea is to give the page structure to the LLM first and let it decide where the data will be located, after that I should create the markdowns only from inside that tag reducing the number of tokens and making the call faster.
I thought about this approach before but was too lazy to recreate the app around it.
Is it able to scrape prices from variable products? Need this for promotional gifts (printed with your Logo)… so Like 100pcs, printed with 1 Color, 2 colors, 3 colors… Same with other qtys. ?
Is it possible to use vision model to scrap a website that block or flag a scrapper by setting up some virtual Environment where the llm can control and open website with it able to scrolling checking even press button to be able fully scrapping the web.. So using normal llm with agent added with vision.
That just my mind i dont know if it too complex and dont make sense in term efficientcy specialy with the comput power it needed.
perhaps u allow users input url pattern with [1-X] at the end so the your code can turn it into page urls and run your code per each.
I love this content, but I'd like to see your takes on the good old regular scraping, like scrapy with proxies
The market will take time to embrace these new methods of scraping with AI, even with very cheap or free options like groq and gemini, so scraping using regular ways still the dominant way to do scraping in general and I see them adding this type of AI Integration alongside what they already have, but it will take time.
Can this do dynamic JavaScript related scraping for something that’s not tied to an actual page/route?
what if I wanted to use this scraper to reference a csv file to do skip tracing?
this only fetch the first page. what can we do to scrape from all pages
+1
Hello, is there a docker container for unraid?
Can this tool scrape google maps 🤔
Can it collect emails? On dune and Brad street
How can I make it retrun more than 10 products?
can you add the option, to send the sorted file back to the llm to add more headers to the table or in general to modify the table, without having to do the scrap again?
I always save every markdown for every scraping inside a folder called output in the project.
you only need to tweak the code a bit to run on the same markdowns again with the fields that you need.
But yeah adding this feature will be nice, can you tell me when have you needed this?
Hello i cant get a display after it scraped.
the pagination thing you can do by checking the URL, I usually do this
How do you do it in case you have numbers of pages with no next button versus times where there is only a next button?
Hii i want to scrap the devices like laptops,mobiles,tablets which supports type c charging and upto 45w chatging ,i want to scrap it from amazon and also have to automate this like for everyday it will scrap again and again it updtas the database and if that devices not present then create that devices in database ,i am a mern stack developer so how can i do that
Idea: What if we don't run headless but instead scrape as you are navigating. At least this way we can try to capture multiple pages and item/detail style websites.
True, I've been avoiding using headless for a while, but apparently the new --headless=new attribute runs as if you're opening the page (I'm not so convinced).
Anyways if you don't like the headless option go to assets.py and you can simply remove it for the list of options, you don't need to touch the code itself.
Why was your github account suspended?
Pagination is so important, but also going one level deeper to scrape data from each item would be a great add!
In addition to paginations with numbers, a complementary approach (perhaps an additional toggle?) involved pages with Next button. The Forward button is an anchor element. It navigates to the next page and grabs the link until it reaches the last page, with the forward button’s href value #.
Perhaps a similar approach can be done to scrape data from each item's page - listing all item links present in every page, and then scrape?
Yes, for websites that have a list/detail view, it would be nice if we can make selenium go into each and scrape data. I think the problem with this is the amount of tokens that will end up producing
so would you suggest having another way of launching the application only to crawl the URLs of the pages and put them somewhere to then launch the scraping on those urls ?
@@mrsai4740 Right, but some applications may make it worth it - let's say analyze competitors. With mini costs. are quite competitive and if you use local computation it's free!
@@redamarzouk that would be amazing
Is anyone able to answer a support question in discord by chance ? its pretty quiet there
NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit:
how can I go about this error?
Me too! Did you find a fix?
what if it has a login page?
i'd love a docker container for this.
very beautiful and interesting. but for me being a beginner, I had to tell how to configure the LmStudio server in python✅✅😊😊
Realy hard to handle all paginations automatically for any website and caring in the same time about token cost for a mass of scraping pages. But is a great challenge 😅
How much fast is the local llma3. 1?
It will really depend on each machine, better GPU+Smaller Models = faster inference and vice versa.
In my case I had a generation of 15 tokens per second which feels a bit on the slower side but it's not very bad.
Does it work on Google Maps for scraping leads?
i tried this with google maps (tried searching golf courses) and it seemed to work, but It only works on a small set of data since google only loads like 6 results unless you scroll. I got around this by just copying the element with all the Items i was interested in after i scrolled through the end of the list, then i shoved that into an html file and used a local file as the URL, The results seemed promising at first but it looks like the model gives up and doesnt capture everything
you're right.
When giving really long html files to smaller models (and even worse local small models 8B or 7B), they seems to give up on getting everything.
so going with gpt4-o or with the gemini 1.5 pro (they give 50 free request per day) will be your best option in this case of 100K+ tokens
I was dreaming of this... Do you have included a way to anynimize header / VPN set up to avoid beeing up banned?
I've have a random usergent chosen every time the app is launching a new process, it's in the assets file you can add more there easily without touching the code in the other files!
dafuq did github suspend your account for? Did you mark it as "Educational" or
If a human can paginate, AI should be able to paginate. True Universal would be a mechanism with image recognition. Feed AI a screenshot of website and “which element does pagination?”
This doesn’t yet cover infinite scroll. You could try to scroll the page and detect if more data is loaded. If so, there you have your pagination.
I created google api key but on running it says the API KEY is invalid ... even if it's freee, one must insert card data in order to use it ? I also noticed it extracts the whole raw data. If I give a tag, wouldnt it be better to search only that in orider to optimise the token / minute? On griq for example i have this problem: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-70b-versatile` in organization `org_0.......` on tokens per minute (TPM): Limit 20000, Used 0, Requested 46892.
I think we just code and scrap but we need extension chrome . So not useful
Are you using firecrawl?
Not anymore, I've used it in the first video of this series.
@@redamarzouk are you using puppeteer, playwright, or selenium? I'm curious about bot detection...
Why did your github account get suspended? Which policy did you violate?
I have no idea, I haven't received any emails from them(checked my spam folder and everywhere).
I thought it was about the "useragent" that I used, but I searched for that exact line of code and I found it in 20k+ repositories, so truly I don't know.
Can you scrape pdf files from a website ?
No the scraper captures data that exists in the html of the website.
Will I get banned if i upload privately on my gh?
I can't guarantee anything, so it's really up to you, I got banned, hopefully no one else will.
hey, just wanted to say thanks, please in next version
1- i want my llama3.1 8b to work via ollama (installation) anything special, i need to do.
2- pagination (setting is critical)
3- fields we define (i.e name / price) how it (llm) maps name with class type title (slightly confusing)...
Noted for the first 2 points.
For the third, it creates everything as a string, but the llm is smart enough to understand how to format other types like numbers for example. So no loss in format.
Very nice job but you are not facing the real problem. Does it work with JS heavy site? For the basic site with no JS you can make a web scraping tool in minutes with Claude. Plus pagination in many specific cases is really complicated. In my opinion ion you should use playground to let the user the ability to create a pattern specific for that web site (clicking around to memorize what to do to get the info) and than proceed with the auto scraping part
We should create a scraper to scrape your site to get the code of the scraper.
Good idea 😄😄
Thanks Reda ! That's really interesting. I tried to scrape data on websites like amazon by scraping only one single page of a product with Llama3.1. However, I faced the token limit issue although I have a powerful MacBook M3 with 38 GB. The same page works well with Gemini1.5 and gpt. Do you have an explanation please ?
you're most welcome.
Yeah with Llama3.1 8b in lm studio you'll have to increase your context length to more than 100K tokens (you can do this in advanced configurations).
the smaller models as good as they are for their size, they really can't keep track of all the completion when having to extract from pages with markdowns more than 60K tokens which amazon is one of them.
Is it free?
when you use Llama 3.1 8b and Groq and Gemini (up to 1500 requests per day), yes it's free.
would rather use playwright instead of selenium
err
Not that i am a bad guy but this project has been implemented by someone else ...all you could have done is give credits
What?
This code is 100% mine and first time I created this project is 4 months ago, go back to my channel and watch a whole video of 20 minutes about it.
You could say that the idea has been around for sometime and I've seen big repos trying to do website to structured data like (firecrawl, jina AI, scrapegraph AI, etc..), and from there I tried to create a simple fully open source version because people in the comments asked me to.
So yeah credits to those libraries, but saying I've taking this from someone else is delusion.
I think that without significant user input, it's unlikely to work. First, the script needs to capture the JS elements code from the site. Then, the user must provide what they are looking for, such as specific pages or other elements. finally, the script + LLM should automatically extract the relevant CSS selector or XPath