@@BoominGame everything I've done in this video was for free, they give enough free credits at first, so you can just use all of it and then go to "jena ai" that is still free for now.
Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios). First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory. Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped". These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges. What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background. Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment. Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route. Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script. Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s) Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc. Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.) Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt". I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it. All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped): 1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the , , , tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes. 2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created. 3. Then I will fine tune a GPT3.5Turbo model with that json data. After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability). Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed). I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again. Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.
Nonsense. Scraping has for 10 years included both fetching data and then structuring it in some format, XML or JSON. Then we can do whatever we want with that structured that. Introducing "parsing" as some distinct construct is inane. More importantly, the way scraping can work today is leagues better than what the likes of APIFY used to do until 2 year ago, and yes this uses LLMs. Expand your reading.
Exactly I didn't really understand the point of firecrawl in this solution!? Does Firecrawl do anything better then free python library. Any suggestion on Python libraries btw?
firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k. Saying that you can just implement these tools easily is frankly disrespectful to the developers who have created these libraries and made them open source for the rest of us. in the example I covered, I didn't show the capabilities of filtering the markdown to only keep the main content in a page nor did I show how to scrape using a search query. I've done scraping professionally for 7+ years now, and the amount of problems you could encounter is immense, from websites blocking you to websites with table looking elements that are in fact just a chaos of divs to Iframes... About Vectorizing your markdown, I once did that on my machine in a "chat with pdf" project, and just with 1024 dimensions and 20 pages of pdf I have to wait long minutes to generate the vectorstore that has to be searched for every request also locally (not everyone has the hardware for it).
@@redamarzouk FireCrawl doesn't offer much value when there are free Python resources and paid tools that let you scrape websites without needing your own API key. You still have to input your OpenAI API key with FireCrawl, making it less appealing. Why pay for something when there are free or cheaper options that are easier to use? Thanks for sharing, but I'll stick with the alternatives.
Web scraping (getting data) and parsing (making sense of it) are two crucial steps for data extraction, often misunderstood as interchangeable. While AI promises a magic solution for parsing, it's expensive, unreliable, and environmentally unfriendly. It's better suited for rare cases where traditional methods struggle. Here, data pre-processing, training data creation, and fine-tuning a specific AI model is the key for success. Overall, scraping and parsing remain essential, with AI as a valuable tool for specific situations.
You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need
Correct I've actually used them while I was playing around with my code (alongside function calling), the issue I found is that I have to explain both pydantic schema and how I made it dynamic, because if I want a universal web scrapper that can use different fields everytime we're scrapping a different website. That ultimately would've made the video a 30mins+ video, so I opted for the easier less performant way.
You're on point with this, using function calling is always better for JSON Consistency. I actually used it when I was creating my original code. The issue is that I have a parameter "Fields" that can change depending on the type of website being scraped. So to account for that in my code I either need to make the schema inside the function calling generic (not so great) or I make it dynamic (really didn't want to go there, it will make the tutorial much more complicated). I also tried using pydantic expressions since Firecrawl has their own LLM Extractor that can use them, but it didn't perform as well. But yeah you're right function calling is always better. Lah yhfdk a sat.
Nice project, I worked on your code base for a while and used Groq mixtral instead, with multiple keys to pass limits, and Firecrawl is not automatic when it comes to pagination, you still need to add HTML code, which defeats the purpose, slow but ok for a free purpose. But I got around that I think. The next step is to use it in the front end. Zillow's API is only available for property developers, so scraping with manual inputs is the only way. However, working with the live API functionality would be the best way forward. Nice job!
Thank you, most websites of real estate or any other industry hold on to their data very close and make you pay if you want to use their API. You'll almost always have to scrape data manually, and yes when it comes to pagination you'll have to make another script to crawl all the pages you'll be scraping.
I can't find a OPENAI model that works for me. I've tried gpt-3, gpt-3.5, gpt-3.5-turbo-1186, I always get a 404 does not exist or you don't have access to it. GPT says use davinci or curie. Any suggestions?
Hello, if you're using open ai api, you need to add the parameter (max_tokens=xxxxxxxx) inside your client open ai call and define a number that don't exceed the max number of token of the model you're using (128 000 for gpt-4o for example)
I have a question, the website that you're using seem to be listings from a city like San Francisco but the results that you're getting only have around 10 entries scraped. Why aren't there more?
The reason is that website like the one I scrapped don't load the data unless we scroll down physically. Meaning we have to open the website and scroll using libraries like playwright that opens a browser instance using chromium and then scroll all the way down and then you'll have all the html.
I was thinking this was going to be similar to PandasAi in which case you can do natural language prompts and the LLM figures out how to convert that into code for you and that is further enhanced with UIs now so you literally get a prompt and off you go. Once the environment is ready there's hardly any coding required. This seems quite a bit more involved than that (?)
I gotta be honest, I didn't even try. I tried to self host an agentic software tool before and my pc was going crazy, it couldn't take the load from Llama3-8B running on LM Studio plus docker plus filming at the same time, I simply don't have the hardware for it. if you want to self host here is the link: github.com/mendableai/firecrawl/blob/main/SELF_HOST.md it is with docker.
Thank you for this wonderful tutorial, but as I am not a software programmer, I like to use web scraping tools for business purposes Is there a way to get this as a simple installer package? or a image for docker?
@@redamarzouk also a VPN would not defend from captcha. They are there for a good reason but would be interesting to find a way around it to build tools for customers
Bed is just short for "bedroom" that someone would put a bed in. Bath is short for bathroom you'd find a toilet in. A half bath is usually very small area with just a toilet and sink and no shower or bathtub. Have you had any luck with firecrawl if you need to login to the site first?
I'm curious - what do you do after structuring the data - do you store it in a vector DB? If so, do you store the Json as it is or something else? And can it actually be completely universal - by that i mean can it structure data by us not providing the fields on which it should strucutre the data. Can we make it in some way where upload a website and it understands the data and structures it according to it?
Hey bro.. this is awesome. being a no code platform user I am unable to grasp your coding though I understanding it. Can you share the scripts you are using please.
I made 2 other videos about this, and the last video I showed how to set this up with minimal coding knowledge (you only need vscode and python configured on your machine) you can follow the video from here ruclips.net/video/xrt2GViRzQo/видео.htmlsi=smByssvvNhudzgRS
You theoretically can use it when it comes to Data Extraction, but you will need a large context window version of Llama3 or Phi3. I've seen a model where they have extended the context length to 1M tokens for Llama3-7B. you need to keep in my that your hardware need to match the requirements.
Thank you. I have a case use, can I use the tool to make querys to a database, save the results as your tutorial shows and also print to PDF the result of every query?
If you already have a database you want to make queries against, you don't need any scraping (unless you need to scrape website to create that database). But yeah it sounds like you can do that without the need for any AI in the loop.
You could just ask GPT-4 one time to generate the extraction code or the tags to look for, per website, so that it doesn't need to always use AI for scraping, and you might get better results, and then in that code if it fails you fall back to regenerating it and cache it again.
Creating a dedicated script for a website is the best way to get the exact data you want, you're right in that sense, and you can always fix it with gpt-4 as well. But let say you're actively scraping 10 competitor websites where you only want to get their pricing updates and their new offerings, will it make sense to you to maintain 10 different scripts rather than have 1 script that can do the job and will need very minimal intervention? It depends on the use case, but there are times where customized scraping code isn't the best approach.
@@redamarzouk I didn't mean like that. I meant you would basically do the same thing as your technique, but you could just use the AI one for each domain, asking it what the CSS selectors are for the elements you're interested in. That way when you're looking for updates you don't even need to do any calls to the LLM unless it fails because the structure is different. You don't even have to maintain multiple scripts, just make a Dictionary with the domain name and the CSS paths and there you go. Of course a lot of different pages may have different structure but you could probably just feed in the HTML from a few different pages of the site and use a prompt telling GPT-4 the URLs and the markup and tell it to figure out the URL pattern that will match the specific stuff to look for. You could even still do this with GPT-3.5-Turbo. Basically the only idea I'm throwing out there is to ask the AI to tell you the tag names and have your code simply extract the info using BeautifulSoup or something else that can grab info out of tags based on CSS query selectors. That way, you can cache that info and then scrape faster after you get that info the initial time. Would only be a little more work but might be a lot better for some use cases. Just thought it was a cool idea
Hi, how can i do to make this Ai read my document pdf with contains a list of 1700 websites links ( yeah i know, it's a lot lol )? I want him to access these 1700 websites and help me to learn their contents by connecting all their infos in a sort of organised bullet-points liste ( a sort of concept Map ), and also create flashcards with the info in these 1700 web sites :) Please help me because i don't know ho to use perplexity to do that...
Yeah you can create the same process with no code tools like Make or Zapier or even with low code tools like UiPath and Power Automate, but I just feel more control over formatting my output and integrating my script with my other local processes when I use code. I still use no code tools for other things.
Hmmmm, I mean... it's pretty good. BUT.... and it's a pretty major BUT. For the sake of cost, I would much rather have a workflow that goes something like: URL -> GET MARKDOWN -> Use LLM to build Beautiful Soup Script for that URL -> Use that Script for future hits on that site. Why? Because it's very unlikely that you'll write a script to only hit a site once. Perhaps a follow up to your work would be something that does both.... URL -> DOES URL HAVE A BS SCRIPT? -> IF YES, run that script and return the data -> IF NO, pass markdown through LLM and create BS script -> Run BS Script
I've tried this. It's not as strait forward as the brut force method. If the LLM cost decreases enough it might be more costly in the log run, given the brut force method is a set and forget all tricks tool.
Scraping in general is a huge industry, a lot of companies need to scrape data about their own products to analyze customer reviews for example and detect trends on which products work better (I know!! why not use an API?? you will be surprised at how little real life business add APIs to their web apps) Companies also scrape competitors websites (No API possible in this case) to stay up to date with their pricing and align their products. Another use case is scraping for advertisers, because they have to analyze sentiment about a person or advertising agency before they can approach them with a brand deal offer. Also people scrape (usually linkedin ) for potential leads that are interested in a certain service (I receive a ton of emails a day because of that) I'm gonna stop here, but yeah web scraping is quite important.
it's open source, this is how you can run it locally and contribute to the project. github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md but honestly as IT folks we gotta stop going at each other for wanting to charge for an app we've created, granted I'm not recommending this to my clients yet and 50$/month is high, but if that what they want to charge it's really up to them.
Yes, it seems to be a good tool, but it's too expensive. In the free version, it offers scraping of 500 pages, but with just 8 requests, I used up 80% of the credit (over 490 pages). It probably arbitrarily chooses how deep the tree reading should go, but there’s no way to stop at the first or first x pages, so even in the paid plans, it seems to be yet another money-draining service. No thanks.
Watch my latest video, I talked about this issue of price and managed to get it work for way cheaper! ruclips.net/video/45hMI2QH1c8/видео.htmlsi=sNZLuizjI_4-1tik
Yea, looking more deeply, It's fake opensource. Having a bunch of code behind an api key and publishing the docs to use that api doesnt make stuff opensource
check the new video, it uses llama3.1 on your own machine, as long as your machine can handle high number of tokens you can do it fully locally: ruclips.net/video/xrt2GViRzQo/видео.html
I actually filmed an hour and I wanted to go through the financials of this method and if it makes sense, but I edited that part out so the video is less than 30mins. but I agree 50$ is high, and the markdowns should be of quality for the tokens to be less therefore cheap LLM cost. btw I"m not sponsored by any means by firecrawl, I was gonna talk about jina ai or scrapegraph-ai which do the same thing before deciding on firecrawl.
Looking forward to the day when all the effort wasted on webscraping warfare and costs for their vodoo is irrelevant with a sufficiently powerful opensource model run locally. It's a BS industry that should be made obsolete.
Totally agree with you, and we do have modified Llama3-8b models that can handle up to 1M tokens. With a state of the art GPU you can run it on your machine. The problem is the consistency of small models is not there yet, I see better results with Phi3 but it simply doesn't have the context window to handle the markdown I've shown in this video. hopefully we'll get there.
Hey everyone! 😊 I'm curious about your thoughts-was the explanation and flow of the video too fast, or was it clear and to the point?
It was perfect!
Its clear and easy to follow. Thanks for sharing! Just subscribed & tweeted as well :)
Is there a free tier/ community edition by installing the firecrawl repo locally and generating an local API?
@@BoominGame everything I've done in this video was for free, they give enough free credits at first, so you can just use all of it and then go to "jena ai" that is still free for now.
@@redamarzouk Pace is ok.But the text is too small.
Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios).
First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory.
Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped".
These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges.
What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background.
Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment.
Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route.
Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script.
Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s)
Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc.
Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.)
Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt".
I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it.
All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped):
1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the , , , tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes.
2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created.
3. Then I will fine tune a GPT3.5Turbo model with that json data.
After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability).
Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed).
I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again.
Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.
Thanks for taking the time to explain this! Very useful to clarify!
Thanks man. I am specializing in web scraping in my career. Do you have some blog or similar where you share content of web scraping as a career?
Nonsense. Scraping has for 10 years included both fetching data and then structuring it in some format, XML or JSON. Then we can do whatever we want with that structured that. Introducing "parsing" as some distinct construct is inane. More importantly, the way scraping can work today is leagues better than what the likes of APIFY used to do until 2 year ago, and yes this uses LLMs. Expand your reading.
@@ilianos his "explanation" is stupid.
@@rafael_tg watch more sensible videos and comments.
It's easy to do it with free python library. Reading HTML convert to markdown, even convert it for free to vector with transformer ect
Exactly I didn't really understand the point of firecrawl in this solution!? Does Firecrawl do anything better then free python library. Any suggestion on Python libraries btw?
Have you used it on complex websites with s or many ads, or logins or progressive JS based loads, or infinite scrolls? Clearly not.
firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k.
Saying that you can just implement these tools easily is frankly disrespectful to the developers who have created these libraries and made them open source for the rest of us.
in the example I covered, I didn't show the capabilities of filtering the markdown to only keep the main content in a page nor did I show how to scrape using a search query.
I've done scraping professionally for 7+ years now, and the amount of problems you could encounter is immense, from websites blocking you to websites with table looking elements that are in fact just a chaos of divs to Iframes...
About Vectorizing your markdown, I once did that on my machine in a "chat with pdf" project, and just with 1024 dimensions and 20 pages of pdf I have to wait long minutes to generate the vectorstore that has to be searched for every request also locally (not everyone has the hardware for it).
@@redamarzouk FireCrawl doesn't offer much value when there are free Python resources and paid tools that let you scrape websites without needing your own API key. You still have to input your OpenAI API key with FireCrawl, making it less appealing.
Why pay for something when there are free or cheaper options that are easier to use?
Thanks for sharing, but I'll stick with the alternatives.
😂 it's only easy if you haven't done anything of value, scraping in 2024 is hard everyone is blocking you
In the US, a “bedroom” is a room with a closet, a window, and a door that can be closed.
Web scraping (getting data) and parsing (making sense of it) are two crucial steps for data extraction, often misunderstood as interchangeable. While AI promises a magic solution for parsing, it's expensive, unreliable, and environmentally unfriendly. It's better suited for rare cases where traditional methods struggle. Here, data pre-processing, training data creation, and fine-tuning a specific AI model is the key for success. Overall, scraping and parsing remain essential, with AI as a valuable tool for specific situations.
You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need
Also, pydantic models can be scripted to have nested structure, in contrast to json schemas
Correct I've actually used them while I was playing around with my code (alongside function calling), the issue I found is that I have to explain both pydantic schema and how I made it dynamic, because if I want a universal web scrapper that can use different fields everytime we're scrapping a different website. That ultimately would've made the video a 30mins+ video, so I opted for the easier less performant way.
In my experience, function calling is way better at extracting consistent JSON than just prompting. Anyway, تبارك الله على ولد بلادي.
Good idea
You're on point with this, using function calling is always better for JSON Consistency. I actually used it when I was creating my original code.
The issue is that I have a parameter "Fields" that can change depending on the type of website being scraped. So to account for that in my code I either need to make the schema inside the function calling generic (not so great) or I make it dynamic (really didn't want to go there, it will make the tutorial much more complicated).
I also tried using pydantic expressions since Firecrawl has their own LLM Extractor that can use them, but it didn't perform as well.
But yeah you're right function calling is always better. Lah yhfdk a sat.
@@redamarzouk You have a knack for this bro. Keep up the good work. وفقك الله
where is the source code ?
can anyone please help me out
Hey, Github is out?
Nice project, I worked on your code base for a while and used Groq mixtral instead, with multiple keys to pass limits, and Firecrawl is not automatic when it comes to pagination, you still need to add HTML code, which defeats the purpose, slow but ok for a free purpose. But I got around that I think. The next step is to use it in the front end. Zillow's API is only available for property developers, so scraping with manual inputs is the only way. However, working with the live API functionality would be the best way forward. Nice job!
Thank you, most websites of real estate or any other industry hold on to their data very close and make you pay if you want to use their API.
You'll almost always have to scrape data manually, and yes when it comes to pagination you'll have to make another script to crawl all the pages you'll be scraping.
git repo is not show its give 404 error???
This project has evolved and now lives in this github repo
github.com/reda-marzouk608/scrape-master
Good work ! Nice presentation, nice code ! 😃 It will help me a lot. Merci Reda
Appreciate the nice words, you're welcome!
I can't find a OPENAI model that works for me. I've tried gpt-3, gpt-3.5, gpt-3.5-turbo-1186, I always get a 404 does not exist or you don't have access to it. GPT says use davinci or curie.
Any suggestions?
Great many thanks for sharing, quick questions how add line of code to go to page 2 and do the same thing then page 3 and so on Please.
You're welcome, you'll have to crawl the pages first and then loop through them using the script I've shown.
wa ta fiiine a bba reda , scrape lya data a wld aami, w7rrak lya l agents , 💪
What other options are beside Firecrawl? Thanks!
Just found it in the comments: "Firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k."
Exactly Jina AI, scrapegraph AI
are also options
Very helpful. How do you work around the output limit of 4096 tokens?
Hello,
if you're using open ai api, you need to add the parameter (max_tokens=xxxxxxxx) inside your client open ai call and define a number that don't exceed the max number of token of the model you're using (128 000 for gpt-4o for example)
I have a question, the website that you're using seem to be listings from a city like San Francisco but the results that you're getting only have around 10 entries scraped. Why aren't there more?
The reason is that website like the one I scrapped don't load the data unless we scroll down physically. Meaning we have to open the website and scroll using libraries like playwright that opens a browser instance using chromium and then scroll all the way down and then you'll have all the html.
I was thinking this was going to be similar to PandasAi in which case you can do natural language prompts and the LLM figures out how to convert that into code for you and that is further enhanced with UIs now so you literally get a prompt and off you go. Once the environment is ready there's hardly any coding required. This seems quite a bit more involved than that (?)
nice! any idea on how to self host firecrawl? like with Docker?
also, can it be coupled with n8n? how?
I gotta be honest, I didn't even try.
I tried to self host an agentic software tool before and my pc was going crazy, it couldn't take the load from Llama3-8B running on LM Studio plus docker plus filming at the same time, I simply don't have the hardware for it.
if you want to self host here is the link: github.com/mendableai/firecrawl/blob/main/SELF_HOST.md
it is with docker.
@@redamarzouk thanks. Is there any sense to use it with n8n? or maybe n8n can do the same without firecrawl? (noob here)
@@redamarzouk or maybe with things like Flowise?
Neat overview. Curious about API costs associated with these demos. Try zooming into your code for viewers.
watch on big monitors as most coders do
for only the demo you've seen, I spent 0.5$, for creating the code and launching it 60+ times, I spent 3$.
I will zoom in next time.
Thank you for this wonderful tutorial, but as I am not a software programmer, I like to use web scraping tools for business purposes Is there a way to get this as a simple installer package? or a image for docker?
I made a new video on how to set this up on your machine. Hope that can help you: ruclips.net/video/xrt2GViRzQo/видео.html
Amazing video and great explanations. Many thanks.
Appreciate it, Thank you for the kind word!
Wow! The AI was even clever enough to convert square meters into square feet, no need to write a conversion function!
I received this error mensage: "The page returned an error while being scraped."
what if the page has infinite scroll where new data appears as you scroll?
What about captcha
Websites don't like scrappers in general, so extensive scrapping will need a vpn (that can handle the volume of your scrapping).
@@redamarzouk also a VPN would not defend from captcha. They are there for a good reason but would be interesting to find a way around it to build tools for customers
Bed is just short for "bedroom" that someone would put a bed in. Bath is short for bathroom you'd find a toilet in. A half bath is usually very small area with just a toilet and sink and no shower or bathtub. Have you had any luck with firecrawl if you need to login to the site first?
I'm curious - what do you do after structuring the data - do you store it in a vector DB? If so, do you store the Json as it is or something else?
And can it actually be completely universal - by that i mean can it structure data by us not providing the fields on which it should strucutre the data.
Can we make it in some way where upload a website and it understands the data and structures it according to it?
Very helpful. Great job and thanks for sharing
Really Appreciate the kind words, Thank you.
can we use this for email and phone number extraction
Absolutely you just need to change the websites and the fields and you’re good to go
tbarkallah 3lik a bro mashallah
Lah yhafdk
Hey bro.. this is awesome. being a no code platform user I am unable to grasp your coding though I understanding it. Can you share the scripts you are using please.
I made 2 other videos about this, and the last video I showed how to set this up with minimal coding knowledge (you only need vscode and python configured on your machine) you can follow the video from here ruclips.net/video/xrt2GViRzQo/видео.htmlsi=smByssvvNhudzgRS
can Use LLMA 3/ Phi3 on local pc ?
You theoretically can use it when it comes to Data Extraction, but you will need a large context window version of Llama3 or Phi3.
I've seen a model where they have extended the context length to 1M tokens for Llama3-7B.
you need to keep in my that your hardware need to match the requirements.
Damn that was good man !
Glad you liked it, My pleasure 🙏
sir, can i use this script to make a web applapplication?
if yes then how, i am just learing nodejs....
It's not literally beds, rather short hand for bedrooms. 3 bedrooms.
Im just getting "An error occurred: name 'phone_fields' is not defined"
Thank you. I have a case use, can I use the tool to make querys to a database, save the results as your tutorial shows and also print to PDF the result of every query?
If you already have a database you want to make queries against, you don't need any scraping (unless you need to scrape website to create that database).
But yeah it sounds like you can do that without the need for any AI in the loop.
Good technology to keep in good book!
Bghit ghir nfhm chno dawr dial firecrawl fhadchi kamel ? Banli la mafih ta haja spéciale ga3 !!!
Helpful, Thank you
Glad it helped!
Sir how can I scrape raw data?
Tai Lopez?? Just driving my ferrari around the hollywood hills here
I have automate a scraper for zillow through MAKE automation,
Thanks for the helpful content.
You're most welcome!
You could just ask GPT-4 one time to generate the extraction code or the tags to look for, per website, so that it doesn't need to always use AI for scraping, and you might get better results, and then in that code if it fails you fall back to regenerating it and cache it again.
Creating a dedicated script for a website is the best way to get the exact data you want, you're right in that sense, and you can always fix it with gpt-4 as well.
But let say you're actively scraping 10 competitor websites where you only want to get their pricing updates and their new offerings, will it make sense to you to maintain 10 different scripts rather than have 1 script that can do the job and will need very minimal intervention?
It depends on the use case, but there are times where customized scraping code isn't the best approach.
@@redamarzouk I didn't mean like that. I meant you would basically do the same thing as your technique, but you could just use the AI one for each domain, asking it what the CSS selectors are for the elements you're interested in. That way when you're looking for updates you don't even need to do any calls to the LLM unless it fails because the structure is different. You don't even have to maintain multiple scripts, just make a Dictionary with the domain name and the CSS paths and there you go. Of course a lot of different pages may have different structure but you could probably just feed in the HTML from a few different pages of the site and use a prompt telling GPT-4 the URLs and the markup and tell it to figure out the URL pattern that will match the specific stuff to look for.
You could even still do this with GPT-3.5-Turbo. Basically the only idea I'm throwing out there is to ask the AI to tell you the tag names and have your code simply extract the info using BeautifulSoup or something else that can grab info out of tags based on CSS query selectors. That way, you can cache that info and then scrape faster after you get that info the initial time.
Would only be a little more work but might be a lot better for some use cases. Just thought it was a cool idea
Hi, how can i do to make this Ai read my document pdf with contains a list of 1700 websites links ( yeah i know, it's a lot lol )? I want him to access these 1700 websites and help me to learn their contents by connecting all their infos in a sort of organised bullet-points liste ( a sort of concept Map ), and also create flashcards with the info in these 1700 web sites :)
Please help me because i don't know ho to use perplexity to do that...
Does it parse JavaScript, infinity scroll, button click navigations?
Yes, you can ask LLMs to do all that like a human would.
Why do it this way, if you can do this without coding? Make . C for example
Yeah you can create the same process with no code tools like Make or Zapier or even with low code tools like UiPath and Power Automate, but I just feel more control over formatting my output and integrating my script with my other local processes when I use code.
I still use no code tools for other things.
Make and Zapier would get very pricey if this were automated at scale.
I get Error code: 429 when running the code. -'You exceeded your current quota,...
In case you haven't used your OpenAI API key in a while: they changed the way it works, you need to pay in advance to refill your quota
Hmmmm, I mean... it's pretty good. BUT.... and it's a pretty major BUT. For the sake of cost, I would much rather have a workflow that goes something like:
URL -> GET MARKDOWN -> Use LLM to build Beautiful Soup Script for that URL -> Use that Script for future hits on that site.
Why? Because it's very unlikely that you'll write a script to only hit a site once.
Perhaps a follow up to your work would be something that does both....
URL -> DOES URL HAVE A BS SCRIPT? -> IF YES, run that script and return the data -> IF NO, pass markdown through LLM and create BS script -> Run BS Script
Agree...TechSales man are keep poping up in RUclips ..
I've tried this. It's not as strait forward as the brut force method. If the LLM cost decreases enough it might be more costly in the log run, given the brut force method is a set and forget all tricks tool.
Nice video. Just that this cannot scrape where you need to click to reveal some info to scrape!
How is this an Agent though?
Awesome videoooo!
Appreciate it 🙏🙏
awesome bro
Glad you liked it
firecrawl is not open-sourced!!!
You too nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.
Except it is.
Refer to its repo, it shows how to run it locally github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
@@redamarzouk I'll take a look. Thanks.
But you are not using the open source, you are using their API... perhaps for the next time that you could do it run locally
@@redamarzouk the open source repo is still not ready for self hosting.
what is this "scraping" good for? I mean what can you use that for? sounds interesting tho
Scraping in general is a huge industry, a lot of companies need to scrape data about their own products to analyze customer reviews for example and detect trends on which products work better (I know!! why not use an API?? you will be surprised at how little real life business add APIs to their web apps)
Companies also scrape competitors websites (No API possible in this case) to stay up to date with their pricing and align their products.
Another use case is scraping for advertisers, because they have to analyze sentiment about a person or advertising agency before they can approach them with a brand deal offer.
Also people scrape (usually linkedin ) for potential leads that are interested in a certain service (I receive a ton of emails a day because of that)
I'm gonna stop here, but yeah web scraping is quite important.
Nefarious reasons. Steal content, creating seo pages on competitor keywords, make bots for social media. Generally nothing of great value.
What about Angie list 😢
Nice idea. Now wake me up when there are no credits involved (completely free).
it's open source, this is how you can run it locally and contribute to the project.
github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
but honestly as IT folks we gotta stop going at each other for wanting to charge for an app we've created, granted I'm not recommending this to my clients yet and 50$/month is high, but if that what they want to charge it's really up to them.
You did a good job explaining something new, but, for 2024, Puppeteer and Cheerio will work better than AI
"Beds" mean number of bedrooms.
That makes more sense, Thank you.
Yes, it seems to be a good tool, but it's too expensive.
In the free version, it offers scraping of 500 pages, but with just 8 requests, I used up 80% of the credit (over 490 pages). It probably arbitrarily chooses how deep the tree reading should go, but there’s no way to stop at the first or first x pages, so even in the paid plans, it seems to be yet another money-draining service.
No thanks.
Watch my latest video, I talked about this issue of price and managed to get it work for way cheaper! ruclips.net/video/45hMI2QH1c8/видео.htmlsi=sNZLuizjI_4-1tik
Yea, looking more deeply, It's fake opensource. Having a bunch of code behind an api key and publishing the docs to use that api doesnt make stuff opensource
check the new video, it uses llama3.1 on your own machine, as long as your machine can handle high number of tokens you can do it fully locally: ruclips.net/video/xrt2GViRzQo/видео.html
@@redamarzouk not a criticism, the content is great. Don't take others faults onto yourself
@@guerra_dos_bichos not at all I love the feedback, it makes for great videos.
another api key to pay ? what's the point of this really ?
You nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.
legend
50$ montly fee 🎉😂😅
I actually filmed an hour and I wanted to go through the financials of this method and if it makes sense, but I edited that part out so the video is less than 30mins.
but I agree 50$ is high, and the markdowns should be of quality for the tokens to be less therefore cheap LLM cost.
btw I"m not sponsored by any means by firecrawl, I was gonna talk about jina ai or scrapegraph-ai which do the same thing before deciding on firecrawl.
Looking forward to the day when all the effort wasted on webscraping warfare and costs for their vodoo is irrelevant with a sufficiently powerful opensource model run locally. It's a BS industry that should be made obsolete.
When agents become ubiquitous it will no longer make economic sense for websites to block robots with captchas and anti-scraping tech.
Totally agree with you, and we do have modified Llama3-8b models that can handle up to 1M tokens. With a state of the art GPU you can run it on your machine.
The problem is the consistency of small models is not there yet, I see better results with Phi3 but it simply doesn't have the context window to handle the markdown I've shown in this video.
hopefully we'll get there.
nice
u nice bro