Think you have what it takes to build an amazing AI agent? I'm currently hosting an AI Agent Hackathon competition for you to prove it and win some cash prizes! Register now for the oTTomator AI Agent Hackathon with a $6,000 prize pool! It's absolutely free to participate and it's your chance to showcase your AI mastery to the world: studio.ottomator.ai/hackathon/register
Thanks for the helpful tutorial. I got my this working on my local n8n docker instance. I sometimes want to scrape PDFs, so I first included a check to see if the file is binary then used Extract from PDF instead of your get /task/taskid ... I'd recommend to add that in your local tutorial so people can scrape PDFs as well!
Great Idea for a workflow, also to import other xml ressources. Please integrate crawl4ai in your local n8n docker project, perhaps directly with superbase instead of qdrant.
Thank you so much for this video. I've learned so much and can put this into action. Few questions : What are your superpowers ? How can you learn so many new things and create amazing content so fast ? The other question is how do you make a leaving ? I mean, how much time do you spend on the RUclips channel ? And last question, on the workflow, why to you connect the output of the AI agent node to the beginning of the loop ?
Nice one! Would be really nice to see an application that combines fine tuning and RAG. I am currently trying to work on fine tuning LLAMA to become an expert in suggesting code snippets. The idea is to crawl through a CHM file that contains the SDK documentation, and augment that with few blogs articles. Maybe something for your next example?
Thank you! Yes, this would definitely be an entire second video. But you can set up custom tools in n8n to make it agentic RAG. Basically just tools to interact with the data in other ways besides basic RAG.
Cole, thank you so much for all your work and great content. I can't tell you how helpful it is and how much I have learned, keep up the great work my friend! I do have a quick question. This is probably super simple, but I was thinking about ethically scraping, as you have mentioned more than once, and I wondered if there is a simple step that we could introduce into our workflow to ingest the robots.txt file for any given site, then parse and exclude all the specified off limits files and directories automatically?
You are so welcome, thank you for the kind words! :D Yes, this is a great idea and something you could set up super easily at the start of the workflow! You could just use the requests module in Python to pull the robots.txt and look for the common lines that specify if you can scrape or not.
00:05 - Scrape any website in minutes using n8n and Crawl4AI without coding. 02:06 - Implementing an AI agent using n8n for scraping without code. 05:58 - Setting up Crawl4AI with Docker for web scraping. 07:49 - Setting up Crawl4AI as a Docker API endpoint on Digital Ocean. 11:40 - Setting up Crawl4AI on Digital Ocean is quick and straightforward. 13:36 - Leveraging n8n and Crawl4AI for efficient agent development. 17:17 - Using n8n to split and manage URLs for scraping. 18:55 - Modify workflow for batch scraping using n8n and Crawl4AI. 22:24 - Integrating n8n with Crawl4AI for asynchronous web scraping. 24:16 - Automating web scraping with n8n and Crawl4AI 27:46 - Set up a vector store in Supabase to manage documents. 29:29 - Successfully scraped and processed 148 items across multiple pages. 32:54 - Easily scrape any website with n8n and Crawl4AI without coding.
thanks for another very well presented video tutorial, Cole. Much appreciated. Scraping the n8n docs in a workflow. Gets in at around the 69 items and the HTTP Request node after the Wait node just sits there continuously spinning for some reason. Very weird indeed. I think the issue is some internal resource problem with CrawlAI docker when it gets down to 1 slot available.
Cole, would you say Digital Ocean is cheaper than Vercel for web app deployement? Would you use the static app feature in digital ocean or droplets for web apps? And what is your advice for a newbie to learn programming? How would you start? What language would you learn? What you would do after with the knowledge? Thank you! I love your channel!
Great questions! Vercel is cheaper for hosting web apps, but you aren't able to host Docker containers like I do with DigitalOcean in this video. For your second question, it depends on the type of app. Some web apps can be served as static content while others are SPAs and other kinds of dynamic pages that would be better suited for a droplet for web apps. Droplet for web apps is certainly going to be more versatile! I would start by learning Python and getting good at using an AI IDE like Cursor or Windsurf not just to help you code but to help you understand what it is coding. Don't get lazy with it and let it write everything for you, actually make sure that you understand what is going on and that you have it explain things to you! Thanks for the kind words! :D
Very nice, but for my use case is more useful browser-use because some websites load the data after a while and crawl4ai do not return the data for those
Thank you! And actually Crawl4AI uses Playwright under the hood so it is browser based! You can scrape SPAs and other dynamic web apps that don't have a bunch of just static page URLs.
Great content. Thanks @ColeMedin! Appreciate the high quality content. One thing I think is an opportunity is to provide the source link at the answer AI agent provides. Is there a way to make that happen?
Thank you very much, you bet! It's not possible with basic n8n RAG, but if you set up a custom retrieval tool using Supabase you could certainly have it cite its sources!
Great question! Crawl4AI is able to extract links from any page, so you can start with the home page, have it find links there, and recursively scrape those links! See: docs.crawl4ai.com/core/link-media/
Thank you very much! Yes - n8n can be integrated into Open WebUI using Open WebUI's functions or pipelines. I do cover this in a video on my channel, though not with this specific use case: ruclips.net/video/E2GIZrsDvuM/видео.html
Great! Coolify i think would be better which has like one click setup for around 200 to 300 open source services including n8n and docker to self host. Should we expect you to do a non code vid using coolify?
Thanks! Yeah Coolify is great as well! I chose DigitalOcean just because I'm familiar with it and it actually seemed the simplest for this. Plus a lot of people use it already to host n8n.
What are you up to Cole!? I feel you're holding back your biggest project to date perhaps with all this scraping, how many of those 'digits' boxes did you pre-order from nvidia? /s
Haha what makes you think I would be holding back with these scraping videos? Honestly the reason I'm stringing a few together here is because people have been finding them really valuable. I do have some big things coming up for AI agents but I'm not holding back with this content ;)
I feel rather bad now, as how much I appreciate these actually, never stop!! Google knows too well why I’m littered with them, and as I set up one thing, a better one joins the party and it be like that, but it’s awesome, and I bet it’s as frustrating as awesome for you too (“oh boy, the DeepSeek guys with their ‘side projects’ dropped another model”), only with you putting a lot more effort into it, a big understatement too 💪 I should’ve ended the first with a /s, but I forget that even while adding extensive amounts of punctuation, I forgot the declaration of the sarcasm function == walk of shame 😬 But shit, I forgot theres actually a few who may sometimes say it while serious 😅 Not on a ColeMedin video though i thought-you’re too likeable for that!
Awesome Video! This will definitely help me with my next project! Do you have a video on how to host n8n on digital Ocean 🌊? I am not sure I want to run it on my computer sonce I want it to run 247
Thank you very much, I'm glad to hear it! I don't have a video but the n8n documentation for hosting on Digital Ocean is super helpful: docs.n8n.io/hosting/installation/server-setups/digital-ocean/
Great tutorial! Any way we can ensure it doesn't crawl the same pages when the workflow fails? I have to continuously delete the records from supabase.
Thank you! Great question too. I would recommend extending this workflow to clear out the vector DB for the specific page before reinserting anything for you. You can use the "page" metadata I show how to add in this video for that.
If I would need advanced specific knowledge extracted and interpreted- from, let’s say a programming application and a specific user forum, - would it be possible to “chain” crawl4ai and deepseek r1 into a speech to text chatbot that could specifically look into those sources first ?!😅 (and logically combine the content?)… that would be awwwesome.
@@ColeMedin great I was thinking if you can find us something that lets us deploy stuff for a span of time like 30 days without requiring any credit card. I mean for free
@@ColeMedin great. It would be great if your stuff can be run almost for free. I mean in your videos using mostly almost free. Gives you a little unique touch in this ai space.
@@ColeMedin it says i need to configure the connection. it was default local host but that did not work it says unable to connect. is it set p through my supabase? how do i connect that?
Are you running Supabase locally? I'm a bit confused! To connect your Supabase, you'll want to go to the "connect" tab in your Supabase dashboard and like for the connection details there to put into Supabase. Use the connection details that have the port 6543 instead of 5432!
Really it's used for turning an LLM into an expert for any website! Your ecommerce store, documentation for a programming language/library, a portion of Wikipedia, etc.
Hi @Cole Thanks for the tutorial. Can you or someone help? I keep having issues in Digital Ocean, I get Deployment failed during deploy phase I tried it twice. It said “container terminated by the platform due to exceeding resources or your app misbehaving.”
If you find a way to fix the bug with workflow execution when it shows that a node has been executed but has not actually been executed, let me know. Only self-hosted
This must be when the embeddings are created to insert your pages into the vector DB. I would add a wait node before inserting anything into the vector DB and adjust that until you don't get 429 errors.
Yeah that is true, I appreciate you calling it out! You can always use local LLMs for free though as well as some through APIs like Gemini 2.0 Flash. But yes, in general it'll cost you something.
how would I crawl a whole foums site.. multiple discussions with multiple pages? would I be training a model at that point? or still all in context window?
Think you have what it takes to build an amazing AI agent? I'm currently hosting an AI Agent Hackathon competition for you to prove it and win some cash prizes! Register now for the oTTomator AI Agent Hackathon with a $6,000 prize pool! It's absolutely free to participate and it's your chance to showcase your AI mastery to the world:
studio.ottomator.ai/hackathon/register
Nothing beats hands on example big up to you Cole
Yes please do include crawl for AI in the local AI starter kit - and thank you so much for your great work
Thank you and I am certainly planning on it! Quite a few comments calling it out already including yours!
You're a legend with these videos I'm learning so much from you.
I'm glad - thank you!
Thanks for the helpful tutorial. I got my this working on my local n8n docker instance. I sometimes want to scrape PDFs, so I first included a check to see if the file is binary then used Extract from PDF instead of your get /task/taskid ... I'd recommend to add that in your local tutorial so people can scrape PDFs as well!
Great Idea for a workflow, also to import other xml ressources. Please integrate crawl4ai in your local n8n docker project, perhaps directly with superbase instead of qdrant.
Thanks and I appreciate you calling out the local AI starter kit! I am certainly planning on adding both Crawl4AI and Supabase into it.
Hey Cole, thank you for another awesome video! So much great info packed in this one!! Keep it up! Jay
Thanks Jay, I appreciate it a lot!
ur content is so valuable my man! cheers from Brazil
Thank you very much!
I always benefit from your content.
Thank you for the amazing content. I will continue to follow it closely!
I'm glad - thank you!
Thank you so much for this video. I've learned so much and can put this into action. Few questions : What are your superpowers ? How can you learn so many new things and create amazing content so fast ? The other question is how do you make a leaving ? I mean, how much time do you spend on the RUclips channel ? And last question, on the workflow, why to you connect the output of the AI agent node to the beginning of the loop ?
Great stuff man! I've learned alot from you, appreciate the time you put into these!
Thank you! I'm glad to hear it!
I've been waiting all day lol!!
Super neat! Thanks! Would be cool to see a tut about retrieveing YT videos transcriptions to RAG...
Thank you, you bet! Great suggestion, I would love to do this ;)
@@ColeMedin maybe with integration of TEN
Nice one! Would be really nice to see an application that combines fine tuning and RAG. I am currently trying to work on fine tuning LLAMA to become an expert in suggesting code snippets. The idea is to crawl through a CHM file that contains the SDK documentation, and augment that with few blogs articles.
Maybe something for your next example?
Awesome video mate! How can you turn the n8n simple RAG agent in this video into a full Agentic agent? (Perhaps suggesting a part 2 to this video)
Thank you! Yes, this would definitely be an entire second video. But you can set up custom tools in n8n to make it agentic RAG. Basically just tools to interact with the data in other ways besides basic RAG.
Cole, thank you so much for all your work and great content. I can't tell you how helpful it is and how much I have learned, keep up the great work my friend! I do have a quick question. This is probably super simple, but I was thinking about ethically scraping, as you have mentioned more than once, and I wondered if there is a simple step that we could introduce into our workflow to ingest the robots.txt file for any given site, then parse and exclude all the specified off limits files and directories automatically?
You are so welcome, thank you for the kind words! :D
Yes, this is a great idea and something you could set up super easily at the start of the workflow! You could just use the requests module in Python to pull the robots.txt and look for the common lines that specify if you can scrape or not.
Thank you! Was looking for this.
You are so welcome! :D
very good job Cole! 🙂
Thank you Jorge! :D
00:05 - Scrape any website in minutes using n8n and Crawl4AI without coding.
02:06 - Implementing an AI agent using n8n for scraping without code.
05:58 - Setting up Crawl4AI with Docker for web scraping.
07:49 - Setting up Crawl4AI as a Docker API endpoint on Digital Ocean.
11:40 - Setting up Crawl4AI on Digital Ocean is quick and straightforward.
13:36 - Leveraging n8n and Crawl4AI for efficient agent development.
17:17 - Using n8n to split and manage URLs for scraping.
18:55 - Modify workflow for batch scraping using n8n and Crawl4AI.
22:24 - Integrating n8n with Crawl4AI for asynchronous web scraping.
24:16 - Automating web scraping with n8n and Crawl4AI
27:46 - Set up a vector store in Supabase to manage documents.
29:29 - Successfully scraped and processed 148 items across multiple pages.
32:54 - Easily scrape any website with n8n and Crawl4AI without coding.
congrats mate for the very good videos, and thank you very much for sharing
Thank you! You're welcome :)
thanks for another very well presented video tutorial, Cole. Much appreciated. Scraping the n8n docs in a workflow. Gets in at around the 69 items and the HTTP Request node after the Wait node just sits there continuously spinning for some reason. Very weird indeed. I think the issue is some internal resource problem with CrawlAI docker when it gets down to 1 slot available.
Cole, would you say Digital Ocean is cheaper than Vercel for web app deployement? Would you use the static app feature in digital ocean or droplets for web apps?
And what is your advice for a newbie to learn programming? How would you start? What language would you learn? What you would do after with the knowledge? Thank you! I love your channel!
Great questions! Vercel is cheaper for hosting web apps, but you aren't able to host Docker containers like I do with DigitalOcean in this video. For your second question, it depends on the type of app. Some web apps can be served as static content while others are SPAs and other kinds of dynamic pages that would be better suited for a droplet for web apps. Droplet for web apps is certainly going to be more versatile!
I would start by learning Python and getting good at using an AI IDE like Cursor or Windsurf not just to help you code but to help you understand what it is coding. Don't get lazy with it and let it write everything for you, actually make sure that you understand what is going on and that you have it explain things to you!
Thanks for the kind words! :D
Very nice, but for my use case is more useful browser-use because some websites load the data after a while and crawl4ai do not return the data for those
Thank you! And actually Crawl4AI uses Playwright under the hood so it is browser based! You can scrape SPAs and other dynamic web apps that don't have a bunch of just static page URLs.
@@ColeMedin Also browser-use uses playwright but maybe i didn't find a way to scrape websites that require waiting or interactions to get the data
You can execute JS with Crawl4AI! So I'd use that to do things like wait for certain elements to appear.
Great content. Thanks @ColeMedin! Appreciate the high quality content. One thing I think is an opportunity is to provide the source link at the answer AI agent provides. Is there a way to make that happen?
Thank you very much, you bet! It's not possible with basic n8n RAG, but if you set up a custom retrieval tool using Supabase you could certainly have it cite its sources!
The website that I am trying to scrape allows all user agents to scrape. But it doesn't have a sitemap what can I do?
Great question! Crawl4AI is able to extract links from any page, so you can start with the home page, have it find links there, and recursively scrape those links! See:
docs.crawl4ai.com/core/link-media/
@@ColeMedinur a G
Hi
Another amazing video thanks.
Any chance this can be integrated with open ui?
Thank you very much! Yes - n8n can be integrated into Open WebUI using Open WebUI's functions or pipelines. I do cover this in a video on my channel, though not with this specific use case:
ruclips.net/video/E2GIZrsDvuM/видео.html
Great! Coolify i think would be better which has like one click setup for around 200 to 300 open source services including n8n and docker to self host. Should we expect you to do a non code vid using coolify?
Thanks! Yeah Coolify is great as well! I chose DigitalOcean just because I'm familiar with it and it actually seemed the simplest for this. Plus a lot of people use it already to host n8n.
What are you up to Cole!? I feel you're holding back your biggest project to date perhaps with all this scraping, how many of those 'digits' boxes did you pre-order from nvidia? /s
Haha what makes you think I would be holding back with these scraping videos? Honestly the reason I'm stringing a few together here is because people have been finding them really valuable. I do have some big things coming up for AI agents but I'm not holding back with this content ;)
Oh, scrape what i said, sorry! I just realized it's me and the algorithm, well no comment.. great content as always!😁So, just one digits pc right!?
I feel rather bad now, as how much I appreciate these actually, never stop!! Google knows too well why I’m littered with them, and as I set up one thing, a better one joins the party and it be like that, but it’s awesome, and I bet it’s as frustrating as awesome for you too (“oh boy, the DeepSeek guys with their ‘side projects’ dropped another model”), only with you putting a lot more effort into it, a big understatement too 💪
I should’ve ended the first with a /s, but I forget that even while adding extensive amounts of punctuation, I forgot the declaration of the sarcasm function == walk of shame 😬
But shit, I forgot theres actually a few who may sometimes say it while serious 😅 Not on a ColeMedin video though i thought-you’re too likeable for that!
You're totally good man! haha
I appreciate the kind words :D
Please make a locally hosted instance tutorial using docker compose, like the local ai agent
Thanks for the suggestion Brad - I am planning on doing this in the near future!
Total clarity!
Awesome Video! This will definitely help me with my next project!
Do you have a video on how to host n8n on digital Ocean 🌊?
I am not sure I want to run it on my computer sonce I want it to run 247
Thank you very much, I'm glad to hear it!
I don't have a video but the n8n documentation for hosting on Digital Ocean is super helpful:
docs.n8n.io/hosting/installation/server-setups/digital-ocean/
Thank you ! You're Awesome !
when i search the /crawl mine says method not allowed?
Great tutorial! Any way we can ensure it doesn't crawl the same pages when the workflow fails? I have to continuously delete the records from supabase.
Thank you! Great question too. I would recommend extending this workflow to clear out the vector DB for the specific page before reinserting anything for you. You can use the "page" metadata I show how to add in this video for that.
Bro reminds me of the llama from the emperor's new groove ❤
Haha I haven't heard that one before!
If I would need advanced specific knowledge extracted and interpreted- from, let’s say a programming application and a specific user forum, - would it be possible to “chain” crawl4ai and deepseek r1 into a speech to text chatbot that could specifically look into those sources first ?!😅 (and logically combine the content?)… that would be awwwesome.
Yes that is certainly possible - I love the idea! I'll be doing more with R1 and RAG soon
Hi Cole, thank you for great content.
Can you please explain how to connect Postgres Chat Memory, i've struggled a lot, thanks in advance.
Can we deploy it on vercel? And use the endpoints as apis?
Yes you sure can!
@@ColeMedin great I was thinking if you can find us something that lets us deploy stuff for a span of time like 30 days without requiring any credit card. I mean for free
Render is another good option! They have a great free tier.
@@ColeMedin great. It would be great if your stuff can be run almost for free. I mean in your videos using mostly almost free. Gives you a little unique touch in this ai space.
could this be done on n8n documentation?
Yes definitely! Here is their sitemap:
n8n.io/sitemap_index.xml
I found this by going to n8n.io/robots.txt
Just found you here in Oz. For non-techies like me don't you have a fully packed agent i can just log i to or download? Happy to pay
how do you set up the postgres chat history db?
n8n does this automatically under the hood when you use the Postgres Chat History Node!
@@ColeMedin it says i need to configure the connection. it was default local host but that did not work it says unable to connect. is it set p through my supabase? how do i connect that?
Are you running Supabase locally? I'm a bit confused! To connect your Supabase, you'll want to go to the "connect" tab in your Supabase dashboard and like for the connection details there to put into Supabase. Use the connection details that have the port 6543 instead of 5432!
Waiting for the local version 😊
With the local AI starter kit? I'm definitely planning on adding it!
+1 on that 😊
For my understanding : what are the uses cases for something like that? Thanks!
Really it's used for turning an LLM into an expert for any website! Your ecommerce store, documentation for a programming language/library, a portion of Wikipedia, etc.
What are your thoughts on smolagents by huggingface. can we do the same with smolagents?
Kindly put a video on converting n8n to python code and to work with custom hosted models or with own DB
Most imp⚠️ compare with flowise
Can you please replicate the video using local resources, docker for both N8N and Crawl4Ai
Yes I am planning on doing this soon :)
We like it.
please show graphs tutorial 🥺in pydantic ai
Hi @Cole
Thanks for the tutorial.
Can you or someone help?
I keep having issues in Digital Ocean, I get Deployment failed during deploy phase I tried it twice. It said “container terminated by the platform due to exceeding resources or your app misbehaving.”
If you find a way to fix the bug with workflow execution when it shows that a node has been executed but has not actually been executed, let me know. Only self-hosted
Is anyone else running into to openai embedding 429 rate limit error? I checked api usage and it has zero requests on api being used in n8n creds
This must be when the embeddings are created to insert your pages into the vector DB. I would add a wait node before inserting anything into the vector DB and adjust that until you don't get 429 errors.
Thanks
You bet! Thank you so much for your support!
1 week later and the information in the previous video is already out of date.
What information are you referring to? The other video for doing something similar in Python is still relevant!
Idk honestly I hate that visual programming stuff. Way easier and faster to just write the code, for me
😉
It is free to use the software itself, but the LLM API calls are likely not free. Should probably note that for the less-informed.
Yeah that is true, I appreciate you calling it out! You can always use local LLMs for free though as well as some through APIs like Gemini 2.0 Flash. But yes, in general it'll cost you something.
Hey, just wondering, would something like gemini 2.0 be efficient at something like this? @ColeMedin
@@ColeMedin AFAIK, just Gemini 1.5 Pro available. You cannot use 2.0 API anymore at the moment, right?
I used it just a few days ago! Maybe something changed super recently?
Yes!
how would I crawl a whole foums site.. multiple discussions with multiple pages? would I be training a model at that point? or still all in context window?