I don't know anything about this project but since it's open source I suppose you can just run Ollama with the OpenAI compatible API and simply replace the URL in the code, use whatever model you want
Great video! Having the ability to use locally hosted ollama on the network would be great. I have ollama running llama3 on another machine on the same network.
Great Work!!! I appreciate that you have already made it run locally and created a resume scraper. Would you possibly combine the two by using the resume scraper with additional inputs as part of creating a json profile which could be used as search criteria input for scraping job searches such as indeed, stepstone, or other similar sites? It would be great to have the match percentage from the scraping to be used as a filter and/or sorting. The reason that I ask this is because it has multiple uses. If the json search criteria profile had some other definition, it could still be used as generic input values for the search process, thus allowing the match percentage functionality to have a universal application. The second use is to have a single profile that would deliver better search results than the original profiles such as Indeed and Stepstone. An additional option could be to use a starting location and radius to help limit the data to be processed. There are map apis that compute the travel distance between two points as well as the travel time based upon the travel mode (car, bus/train, bike, walk). This would add a lot of value to searches. It could also be added to the match percentage when used. I have one additional request. Could you set an option to change the language to German? If you need, I can help with the translation since I'm an American working in Germany. It would make things a lot easier for people in Germany. I already have a json structure. If you would like my help, let me know.
Maybe a way to circumvent the token issue is to calculate tokens, then cut before token limit of the model, then continue after the cutoff and iterate until you got the full page
Love this is open source. Thank you!🙏🏾 I already knew how you were going to handle the pagination before you started talking 😂 a fox might be to add a starting url and a field for the second page. Another suggestion is proxy 😢 I have more interesting adds to this
Love the project man! the update is the exact problem i faced before 🔥 I’ve tried using GPT-4o-mini and Gemini Flash as well, and they both work smoothly. However, when using the local model, the pagination script throws an error on 'openai.ChatCompletion'. Could this be due to a version issue? Thanks
My issue with using the local Llama 3.1 8B was really the number of tokens, in my case it was 8k tokens per completion. If you have a model with a longer context window and it still giving you errors, join the discord and share the screenshot so I can understand the problem better.
Thank you so much for this video. I am a no-coder and have no problem following your instructions. I have the last versions of VS and Python installed and for some reason I am unable to download the requirement packages. Can you please advise? Thank you
Did you try to add login option, for website requiring it? I tried, but ofter get a response from a website that my browser doesnt support JavaScript, or that it is not enabled and that it is needed to proceed to logging. Tried to enable it in Selenium, but still getting the same response. Btw, thanks for sharing this, very interesting!
Reda thank you for this video. I know in your previous version 2 of the scraper you allowed it to add delays to scrape a website but how would V3 work for infinite scrolling pagination instead of page 1, 2, 3 etc...
I have a 3 scroll events, the first to half of the page height, the second to almost the end and then a last one to the end of the page, i have random time delays between them. Do you think it's enough to do the infinite scroll ?
Hello!! great video. I want to ask you if it's possible de scrap a whole article for example with your tool. Unlike a lot of people here, i just want to read articles, light novels and some comics which are behind a paywall. Can your scraper help me with that or do i need to make some modifications to the code for it to work?
the html2text already get rid of all tags and scripts, but maybe the urls as well can be removed and sometimes it does decrease the amount of tokens in the markdowns. but the problem is if the user want to extract urls of images or something else for example, what should happen in that case?
is it possible to add a search box feature where you put the search url headers for e.g. amazon ebay temu to get title price a mini price comparison feature in short?
Thank you for this very interesting scraper. But i just want a scraper that does not require paid api keys. Can someone PLEASE recommend a basic scraper for that please. Please.
How would you utilize this to scrape from behind a login? I dont see any of the login info embedded in the URL structure so unsure the best way to do this.
At your website one of the file is name "sraper" instead of "scraper" which eventually will cause no module not found. Newbies prolly won't realize this even though its very obvious. Just informing.
you forgot to specify on how to activate the env after they created it. maybe some dont know how to do it and theyll install the requirments onto the main python env :P
I didn't try with shein before, but they have a fairly simple website, the issue is that every page has 70+ products meaning it will make a lot of tokens.
I tried it with headless and headless new but it's a hit a miss with the infinite scroll cases. And most pagination details are at the bottom of the page. If you want to try it with headless go to assets.py, the headless option is already there just place it inside the settings list
I can't get this to work on Spotify-streams, I want to track all my streams across all my songs. I also made a HTML-link for it to scrape multiple links at one time, so nice that you fixed that now! But seems like Spotify is blocking it anyways. Any tips on how I could scrape this kind of data? Thanks!
If spotify is one those websites that force a captcha upon opening the website, that would block the scraping. Someone proposed to add an attended mode for the user to solve a captcha and then allow the app to continue its scraping. I think I will be adding this feature next.
Can i use it to scrape a linkedin profile data? and is that legal to be used commercially ? (to integrate the data into a web application through apis)
"Great job, sir! I have a question: Is it possible to share the webpage opened by Selenium with the user, allowing them to manually interact with it-such as solving captchas or authenticating-to bypass blockades? Once they clear the obstacles, Selenium can resume scraping."
Yes please reda this would be an amazing feature to do. This way we can pretty much solve every captcha without paying for proxies or coding to solve a captcha etc... We can just let it alert us by sending an sms to our phone or something that says "Need to solve captcha come back to your pc" or maybe just play an audio file saying "Solve the captcha"
it doesn't explicitly bypass captcha if it arises, the trick is to use the useragent in order to stop the website from thinking we're bots in the first place.
Yes this is the most intuitive way but even specialized text to action apps out there can't do it in a universal way. It's really harder than it sounds like. That's why getting the pages and then scraping multiple urls of those pages at the same time is the most compatible way of doing pagination today.
I have just tried to access it and it's up, i checked on isituporjustme and it says it's working fine It's just you. automation-campus.com is up. Last updated: Nov 6, 2024, 10:14 PM (1 second ago)
Nice job 👍🏻
Perhaps llama locally and / or from groq would be a nice improvement
I agreed with you.
i cant explain in words, what you do thanks for your kind of efforts!!!
Can't believe this, you did it. I've been coding non stop the last 5 days because of your last video on this, thaaank youu!!
My pleasure 🙏
How did a scraper help you code?
Please integrate llama3 locally(without any api) as many of us run llama3 locally.
I don't know anything about this project but since it's open source I suppose you can just run Ollama with the OpenAI compatible API and simply replace the URL in the code, use whatever model you want
hmm. I already made a universal headless chrome scraper. Mine can even interact with with the page.
But youre a better man than me for sharing.
Mind sharing your code mate
nice to see you getting traction, I would love to see some content on how to mitigate and avoid being blocked, especially by cloudflare
Great video! Having the ability to use locally hosted ollama on the network would be great. I have ollama running llama3 on another machine on the same network.
Great Work!!!
I appreciate that you have already made it run locally and created a resume scraper.
Would you possibly combine the two by using the resume scraper with additional inputs as part of creating a json profile which could be used as search criteria input for scraping job searches such as indeed, stepstone, or other similar sites?
It would be great to have the match percentage from the scraping to be used as a filter and/or sorting.
The reason that I ask this is because it has multiple uses. If the json search criteria profile had some other definition, it could still be used as generic input values for the search process, thus allowing the match percentage functionality to have a universal application. The second use is to have a single profile that would deliver better search results than the original profiles such as Indeed and Stepstone.
An additional option could be to use a starting location and radius to help limit the data to be processed. There are map apis that compute the travel distance between two points as well as the travel time based upon the travel mode (car, bus/train, bike, walk). This would add a lot of value to searches. It could also be added to the match percentage when used.
I have one additional request. Could you set an option to change the language to German? If you need, I can help with the translation since I'm an American working in Germany. It would make things a lot easier for people in Germany. I already have a json structure. If you would like my help, let me know.
Maybe a way to circumvent the token issue is to calculate tokens, then cut before token limit of the model, then continue after the cutoff and iterate until you got the full page
Love this is open source. Thank you!🙏🏾 I already knew how you were going to handle the pagination before you started talking 😂 a fox might be to add a starting url and a field for the second page.
Another suggestion is proxy 😢
I have more interesting adds to this
Alright Proxy is noted!
@@redamarzouk there is website called bright data which solve this issue like captcha and all other and it gives us credits so.. or any other.
You are a wonderful person, thank you for sharing 💪
Thank you, means a lot!🙏
يعطيك الصحة I have a project in mind and this is what I was looking for to monitize it.
Thanks ❤
Good job bro. Please continue
My pleasure!
Auto subscribe to people that share useful free stuffs. Thanks for this!
Great Project, but it will be even greater if you create a Docker Container with it and allow to use local AI (llama) instead of using cloud.
Awesome! 👏
Que dire ? exceptionnel !? Merci Monsieur 👍
Love the project man! the update is the exact problem i faced before 🔥
I’ve tried using GPT-4o-mini and Gemini Flash as well, and they both work smoothly. However, when using the local model, the pagination script throws an error on 'openai.ChatCompletion'. Could this be due to a version issue? Thanks
My issue with using the local Llama 3.1 8B was really the number of tokens, in my case it was 8k tokens per completion.
If you have a model with a longer context window and it still giving you errors, join the discord and share the screenshot so I can understand the problem better.
@@redamarzouk Hello, can you send the discord link again because the link you previously provided has expired, Thanks 🙏
Very nice! 🎉
You are king!
Thank you so much for this video. I am a no-coder and have no problem following your instructions. I have the last versions of VS and Python installed and for some reason I am unable to download the requirement packages. Can you please advise? Thank you
Did you try to add login option, for website requiring it?
I tried, but ofter get a response from a website that my browser doesnt support JavaScript, or that it is not enabled and that it is needed to proceed to logging. Tried to enable it in Selenium, but still getting the same response.
Btw, thanks for sharing this, very interesting!
Reda thank you for this video. I know in your previous version 2 of the scraper you allowed it to add delays to scrape a website but how would V3 work for infinite scrolling pagination instead of page 1, 2, 3 etc...
I have a 3 scroll events, the first to half of the page height, the second to almost the end and then a last one to the end of the page, i have random time delays between them.
Do you think it's enough to do the infinite scroll ?
Incroyable merci :)
Hello!! great video. I want to ask you if it's possible de scrap a whole article for example with your tool. Unlike a lot of people here, i just want to read articles, light novels and some comics which are behind a paywall. Can your scraper help me with that or do i need to make some modifications to the code for it to work?
can i use llama that is running locally on my pc?
Can this scrape from youtube ? For seo ? Thx for your amazing work
I believe to solve the maximum token issue is to first strip the html results for unnecessary html and script and style tags before sending it to LLM
the html2text already get rid of all tags and scripts, but maybe the urls as well can be removed and sometimes it does decrease the amount of tokens in the markdowns.
but the problem is if the user want to extract urls of images or something else for example, what should happen in that case?
I keep getting 'Unexpected data format for URL 1' with all sites I try. I have Ollama with Llama3.1 8b installed locally if that matters.
is it possible to add a search box feature where you put the search url headers for e.g. amazon ebay temu to get title price a mini price comparison feature in short?
Muito interessante parabéns pela aula!
Thank you for this very interesting scraper. But i just want a scraper that does not require paid api keys. Can someone PLEASE recommend a basic scraper for that please. Please.
How does one select multiple pages? It doesn't seem to work for me. Great job btw.
Chunking the tokens for Alibaba can solve the issue.
How would you utilize this to scrape from behind a login? I dont see any of the login info embedded in the URL structure so unsure the best way to do this.
At your website one of the file is name "sraper" instead of "scraper" which eventually will cause no module not found. Newbies prolly won't realize this even though its very obvious. Just informing.
Thanks for letting me know, I fixed it!
very good
you forgot to specify on how to activate the env after they created it. maybe some dont know how to do it and theyll install the requirments onto the main python env :P
thx, that solves the error in my setup, but I have another error
ModuleNotFoundError: No module named 'scraper'
Yeah I should probably add that to the documentation
@@gamalfarag pip install scraper
is it possible to build a table with different URLs and iterate over an automatic regime?
Could we use the app as an API? I want to have my app use your app essentially
1. Getting unexpected URL error
2. if chrome driver get old, we have to change it or not
3. how to deploy it
4. proxy
Does anyone know of a tool that can scrap a name and address blocks from a largely fixed area on each page, of a multi-page PDF?
You doing local scraping not puppeteer?
Do you have how many tokens is on SheIn? Great work
I didn't try with shein before, but they have a fairly simple website, the issue is that every page has 70+ products meaning it will make a lot of tokens.
Do we really need selenium driver and actually opening a browser? Can it be done without that? Headless?
I tried it with headless and headless new but it's a hit a miss with the infinite scroll cases. And most pagination details are at the bottom of the page.
If you want to try it with headless go to assets.py, the headless option is already there just place it inside the settings list
La pluspart de mes scraping echouent a cause d une limitation de token avec gpt :/
I can't get this to work on Spotify-streams, I want to track all my streams across all my songs. I also made a HTML-link for it to scrape multiple links at one time, so nice that you fixed that now! But seems like Spotify is blocking it anyways. Any tips on how I could scrape this kind of data? Thanks!
If spotify is one those websites that force a captcha upon opening the website, that would block the scraping.
Someone proposed to add an attended mode for the user to solve a captcha and then allow the app to continue its scraping. I think I will be adding this feature next.
This AI Scraper Update Changes EVERYTHING!!.
Please, can it Scrape Freelance services marketplace?
How can I scrape emails from websites? I need to scan many of them, not just one per time, could you help me out? :)
this is great and all, how about you create a service, even if its paid. to help us not get banned for scraping. then we have something.
Omg update
Can i use it to scrape a linkedin profile data? and is that legal to be used commercially ? (to integrate the data into a web application through apis)
"Great job, sir!
I have a question: Is it possible to share the webpage opened by Selenium with the user, allowing them to manually interact with it-such as solving captchas or authenticating-to bypass blockades? Once they clear the obstacles, Selenium can resume scraping."
that's actually a great suggestion
@@redamarzouk Can you also do pagination in the same way? i.e., Click on the links so it can find the pagination elements.
Yes please reda this would be an amazing feature to do. This way we can pretty much solve every captcha without paying for proxies or coding to solve a captcha etc... We can just let it alert us by sending an sms to our phone or something that says "Need to solve captcha come back to your pc" or maybe just play an audio file saying "Solve the captcha"
can it scrape google map
Do it will bypass bot protection like captcha?
it doesn't explicitly bypass captcha if it arises, the trick is to use the useragent in order to stop the website from thinking we're bots in the first place.
what about facebook?
im looking for a way to go from list page, find all items, go into the detail page of each item and extract data from there. can this do that?
Yes this is the most intuitive way but even specialized text to action apps out there can't do it in a universal way. It's really harder than it sounds like.
That's why getting the pages and then scraping multiple urls of those pages at the same time is the most compatible way of doing pagination today.
can it scrape photos and videos also and get it downloaded ?
it can scrape links of pictures and videos but not the files themselves.
of course the links has to be inside the websites markdowns.
Can it scrape the openai docs? I have yet to be able to scrape their pages
do you mean the scraping part itself or the llm blocks the content? might want to try with scrapingbee
incroyable
Looks like your website is down...
I have just tried to access it and it's up, i checked on isituporjustme and it says it's working fine
It's just you. automation-campus.com is up.
Last updated: Nov 6, 2024, 10:14 PM (1 second ago)