Great video. I have a question regarding the exclusion of unwanted content during web page extraction. Specifically, how can headers, footers, navigational elements (including side navigation), and tables of contents be effectively removed? Considering that each website follows a different structure and pattern, it seems impractical to configure exclusion rules for every individual site. This issue becomes even more critical as it can lead to increased storage requirements and, in some cases, false retrieval results for Retrieval-Augmented Generation (RAG) systems due to the presence of unnecessary content. Could you share any insights or strategies to address this challenge effectively?
Hey man. I'm going to be honest but i'm new to data scraping and wanted to ask if crawl4ai can be used to scrape data from tiktok. They have implemented some harsh measures with request rate limits and login requirements. From what i saw crawl4ai has some login feature but just wanted to ask you if i'm going in the right direction. Otherwise looks great
Very useful Project, I must admit! Is it a recursive crawler, when I say recursive, I mean it, (not restricted to depth threshold). Also How differet is this from FireCrawl, in terms of functionality and other stuffs. I can't wait to get started on using this project, and give it a shot! Thanks!
You can send multiple links, so first crawl the main page, then get links and send them again. However soon I will release the ability to se the depth and get a cool result for that
Result is an object like this: class CrawlResult(BaseModel): url: str html: str success: bool cleaned_html: str = None markdown: str = None extracted_content: str = None metadata: dict = None error_message: str = None So you can access using this property (cleaned_html, markdown, extracted_content), or dump the model into a python dictionary using "result.model_dump()`
I Been trying and the one difficult i found it was the installation, but I think this a great approach to resolve scraping thanks for sharing
Love how you so excited of your project! Keep it up man! Great project
Thanks! Will do!
You deserve way more audience. Keep pushing man!
You wrote this project! U R The Man! :-) Thank you very much.
Great video.
I have a question regarding the exclusion of unwanted content during web page extraction. Specifically, how can headers, footers, navigational elements (including side navigation), and tables of contents be effectively removed? Considering that each website follows a different structure and pattern, it seems impractical to configure exclusion rules for every individual site.
This issue becomes even more critical as it can lead to increased storage requirements and, in some cases, false retrieval results for Retrieval-Augmented Generation (RAG) systems due to the presence of unnecessary content.
Could you share any insights or strategies to address this challenge effectively?
I can't get the local lamma to work :(
possible to put up a prebuilt docker image, including the 'models'? I had problem downloading the models during build docker. Thanks!
I will work on that. Trying to have a version without model dependency as well
Hey man. I'm going to be honest but i'm new to data scraping and wanted to ask if crawl4ai can be used to scrape data from tiktok. They have implemented some harsh measures with request rate limits and login requirements. From what i saw crawl4ai has some login feature but just wanted to ask you if i'm going in the right direction. Otherwise looks great
Colab link?
Very useful Project, I must admit! Is it a recursive crawler, when I say recursive, I mean it, (not restricted to depth threshold). Also How differet is this from FireCrawl, in terms of functionality and other stuffs. I can't wait to get started on using this project, and give it a shot! Thanks!
Looks exciting. Have you considered a nix script?
WHAT HAPPENED TO THE FLUTE UNCLE CODE
Hahahaha!! Ok, ok, message received
Really cool man! Can I crawl all accessible subpages from a main page? So I crawl 2 levels in total?
You can send multiple links, so first crawl the main page, then get links and send them again. However soon I will release the ability to se the depth and get a cool result for that
i got a result object. how to parse it
Result is an object like this:
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: str = None
markdown: str = None
extracted_content: str = None
metadata: dict = None
error_message: str = None
So you can access using this property (cleaned_html, markdown, extracted_content), or dump the model into a python dictionary using "result.model_dump()`
When I am using AsyncWebCrawler there is a runtime error there is no current event look in thread mainthread