n8n + Crawl4AI - Scrape ANY Website in Minutes with NO Code

Поделиться
HTML-код
  • Опубликовано: 29 янв 2025

Комментарии • 118

  • @ColeMedin
    @ColeMedin  2 дня назад +5

    Think you have what it takes to build an amazing AI agent? I'm currently hosting an AI Agent Hackathon competition for you to prove it and win some cash prizes! Register now for the oTTomator AI Agent Hackathon with a $6,000 prize pool! It's absolutely free to participate and it's your chance to showcase your AI mastery to the world:
    studio.ottomator.ai/hackathon/register

  • @Surafel_Demissie
    @Surafel_Demissie 2 дня назад +9

    Nothing beats hands on example big up to you Cole

  • @simfor
    @simfor 2 дня назад +4

    Yes please do include crawl for AI in the local AI starter kit - and thank you so much for your great work

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you and I am certainly planning on it! Quite a few comments calling it out already including yours!

  • @itwasntme947
    @itwasntme947 2 дня назад +4

    You're a legend with these videos I'm learning so much from you.

  • @liveitup278
    @liveitup278 2 часа назад

    Thanks for the helpful tutorial. I got my this working on my local n8n docker instance. I sometimes want to scrape PDFs, so I first included a check to see if the file is binary then used Extract from PDF instead of your get /task/taskid ... I'd recommend to add that in your local tutorial so people can scrape PDFs as well!

  • @saschadeus6491
    @saschadeus6491 2 дня назад +4

    Great Idea for a workflow, also to import other xml ressources. Please integrate crawl4ai in your local n8n docker project, perhaps directly with superbase instead of qdrant.

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thanks and I appreciate you calling out the local AI starter kit! I am certainly planning on adding both Crawl4AI and Supabase into it.

  • @SouthbayJay_com
    @SouthbayJay_com 2 дня назад +1

    Hey Cole, thank you for another awesome video! So much great info packed in this one!! Keep it up! Jay

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Thanks Jay, I appreciate it a lot!

  • @djpowerboy
    @djpowerboy 2 дня назад +1

    ur content is so valuable my man! cheers from Brazil

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Thank you very much!

  • @imurakoji
    @imurakoji 2 дня назад +1

    I always benefit from your content.
    Thank you for the amazing content. I will continue to follow it closely!

  • @bluegreen-ai
    @bluegreen-ai День назад

    Thank you so much for this video. I've learned so much and can put this into action. Few questions : What are your superpowers ? How can you learn so many new things and create amazing content so fast ? The other question is how do you make a leaving ? I mean, how much time do you spend on the RUclips channel ? And last question, on the workflow, why to you connect the output of the AI agent node to the beginning of the loop ?

  • @Paulson970
    @Paulson970 2 дня назад +1

    Great stuff man! I've learned alot from you, appreciate the time you put into these!

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you! I'm glad to hear it!

  • @SouthbayJay_com
    @SouthbayJay_com 2 дня назад +3

    I've been waiting all day lol!!

  • @MassimilianoMitch
    @MassimilianoMitch 2 дня назад +1

    Super neat! Thanks! Would be cool to see a tut about retrieveing YT videos transcriptions to RAG...

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you, you bet! Great suggestion, I would love to do this ;)

    • @MassimilianoMitch
      @MassimilianoMitch 2 дня назад +2

      @@ColeMedin maybe with integration of TEN

  • @cesarecaoduro8031
    @cesarecaoduro8031 День назад

    Nice one! Would be really nice to see an application that combines fine tuning and RAG. I am currently trying to work on fine tuning LLAMA to become an expert in suggesting code snippets. The idea is to crawl through a CHM file that contains the SDK documentation, and augment that with few blogs articles.
    Maybe something for your next example?

  • @rigaldamez3468
    @rigaldamez3468 День назад +1

    Awesome video mate! How can you turn the n8n simple RAG agent in this video into a full Agentic agent? (Perhaps suggesting a part 2 to this video)

    • @ColeMedin
      @ColeMedin  День назад +1

      Thank you! Yes, this would definitely be an entire second video. But you can set up custom tools in n8n to make it agentic RAG. Basically just tools to interact with the data in other ways besides basic RAG.

  • @OutdoorInformed
    @OutdoorInformed День назад +1

    Cole, thank you so much for all your work and great content. I can't tell you how helpful it is and how much I have learned, keep up the great work my friend! I do have a quick question. This is probably super simple, but I was thinking about ethically scraping, as you have mentioned more than once, and I wondered if there is a simple step that we could introduce into our workflow to ingest the robots.txt file for any given site, then parse and exclude all the specified off limits files and directories automatically?

    • @ColeMedin
      @ColeMedin  День назад

      You are so welcome, thank you for the kind words! :D
      Yes, this is a great idea and something you could set up super easily at the start of the workflow! You could just use the requests module in Python to pull the robots.txt and look for the common lines that specify if you can scrape or not.

  • @RajBiswal_Films
    @RajBiswal_Films 2 дня назад +2

    Thank you! Was looking for this.

    • @ColeMedin
      @ColeMedin  2 дня назад

      You are so welcome! :D

  • @JorgeCastro-vm1kt
    @JorgeCastro-vm1kt 2 дня назад +1

    very good job Cole! 🙂

  • @eklavyaaa
    @eklavyaaa 2 дня назад +1

    00:05 - Scrape any website in minutes using n8n and Crawl4AI without coding.
    02:06 - Implementing an AI agent using n8n for scraping without code.
    05:58 - Setting up Crawl4AI with Docker for web scraping.
    07:49 - Setting up Crawl4AI as a Docker API endpoint on Digital Ocean.
    11:40 - Setting up Crawl4AI on Digital Ocean is quick and straightforward.
    13:36 - Leveraging n8n and Crawl4AI for efficient agent development.
    17:17 - Using n8n to split and manage URLs for scraping.
    18:55 - Modify workflow for batch scraping using n8n and Crawl4AI.
    22:24 - Integrating n8n with Crawl4AI for asynchronous web scraping.
    24:16 - Automating web scraping with n8n and Crawl4AI
    27:46 - Set up a vector store in Supabase to manage documents.
    29:29 - Successfully scraped and processed 148 items across multiple pages.
    32:54 - Easily scrape any website with n8n and Crawl4AI without coding.

  • @user-nbfkxngjmyb
    @user-nbfkxngjmyb 2 дня назад +1

    congrats mate for the very good videos, and thank you very much for sharing

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you! You're welcome :)

  • @MartinCooney1
    @MartinCooney1 Час назад

    thanks for another very well presented video tutorial, Cole. Much appreciated. Scraping the n8n docs in a workflow. Gets in at around the 69 items and the HTTP Request node after the Wait node just sits there continuously spinning for some reason. Very weird indeed. I think the issue is some internal resource problem with CrawlAI docker when it gets down to 1 slot available.

  • @John-ek5bq
    @John-ek5bq 2 дня назад +1

    Cole, would you say Digital Ocean is cheaper than Vercel for web app deployement? Would you use the static app feature in digital ocean or droplets for web apps?
    And what is your advice for a newbie to learn programming? How would you start? What language would you learn? What you would do after with the knowledge? Thank you! I love your channel!

    • @ColeMedin
      @ColeMedin  2 дня назад +2

      Great questions! Vercel is cheaper for hosting web apps, but you aren't able to host Docker containers like I do with DigitalOcean in this video. For your second question, it depends on the type of app. Some web apps can be served as static content while others are SPAs and other kinds of dynamic pages that would be better suited for a droplet for web apps. Droplet for web apps is certainly going to be more versatile!
      I would start by learning Python and getting good at using an AI IDE like Cursor or Windsurf not just to help you code but to help you understand what it is coding. Don't get lazy with it and let it write everything for you, actually make sure that you understand what is going on and that you have it explain things to you!
      Thanks for the kind words! :D

  • @Techonsapevole
    @Techonsapevole 2 дня назад +2

    Very nice, but for my use case is more useful browser-use because some websites load the data after a while and crawl4ai do not return the data for those

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you! And actually Crawl4AI uses Playwright under the hood so it is browser based! You can scrape SPAs and other dynamic web apps that don't have a bunch of just static page URLs.

    • @Techonsapevole
      @Techonsapevole 2 дня назад +1

      @@ColeMedin Also browser-use uses playwright but maybe i didn't find a way to scrape websites that require waiting or interactions to get the data

    • @ColeMedin
      @ColeMedin  День назад

      You can execute JS with Crawl4AI! So I'd use that to do things like wait for certain elements to appear.

  • @thomashuang8061
    @thomashuang8061 2 дня назад +1

    Great content. Thanks @ColeMedin! Appreciate the high quality content. One thing I think is an opportunity is to provide the source link at the answer AI agent provides. Is there a way to make that happen?

    • @ColeMedin
      @ColeMedin  День назад

      Thank you very much, you bet! It's not possible with basic n8n RAG, but if you set up a custom retrieval tool using Supabase you could certainly have it cite its sources!

  • @sujikanth
    @sujikanth 2 дня назад +5

    The website that I am trying to scrape allows all user agents to scrape. But it doesn't have a sitemap what can I do?

    • @ColeMedin
      @ColeMedin  2 дня назад +2

      Great question! Crawl4AI is able to extract links from any page, so you can start with the home page, have it find links there, and recursively scrape those links! See:
      docs.crawl4ai.com/core/link-media/

    • @beansplace
      @beansplace День назад +1

      ​@@ColeMedinur a G

  • @alqaimyouth
    @alqaimyouth 2 дня назад +1

    Hi
    Another amazing video thanks.
    Any chance this can be integrated with open ui?

    • @ColeMedin
      @ColeMedin  День назад +1

      Thank you very much! Yes - n8n can be integrated into Open WebUI using Open WebUI's functions or pipelines. I do cover this in a video on my channel, though not with this specific use case:
      ruclips.net/video/E2GIZrsDvuM/видео.html

  • @PAKYOUTHISM
    @PAKYOUTHISM 2 дня назад +1

    Great! Coolify i think would be better which has like one click setup for around 200 to 300 open source services including n8n and docker to self host. Should we expect you to do a non code vid using coolify?

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Thanks! Yeah Coolify is great as well! I chose DigitalOcean just because I'm familiar with it and it actually seemed the simplest for this. Plus a lot of people use it already to host n8n.

  • @HenkHeidstra
    @HenkHeidstra 2 дня назад

    What are you up to Cole!? I feel you're holding back your biggest project to date perhaps with all this scraping, how many of those 'digits' boxes did you pre-order from nvidia? /s

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Haha what makes you think I would be holding back with these scraping videos? Honestly the reason I'm stringing a few together here is because people have been finding them really valuable. I do have some big things coming up for AI agents but I'm not holding back with this content ;)

    • @HenkHeidstra
      @HenkHeidstra 2 дня назад

      Oh, scrape what i said, sorry! I just realized it's me and the algorithm, well no comment.. great content as always!😁So, just one digits pc right!?

    • @HenkHeidstra
      @HenkHeidstra 2 дня назад

      I feel rather bad now, as how much I appreciate these actually, never stop!! Google knows too well why I’m littered with them, and as I set up one thing, a better one joins the party and it be like that, but it’s awesome, and I bet it’s as frustrating as awesome for you too (“oh boy, the DeepSeek guys with their ‘side projects’ dropped another model”), only with you putting a lot more effort into it, a big understatement too 💪
      I should’ve ended the first with a /s, but I forget that even while adding extensive amounts of punctuation, I forgot the declaration of the sarcasm function == walk of shame 😬
      But shit, I forgot theres actually a few who may sometimes say it while serious 😅 Not on a ColeMedin video though i thought-you’re too likeable for that!

    • @ColeMedin
      @ColeMedin  День назад +1

      You're totally good man! haha
      I appreciate the kind words :D

  • @BradParler
    @BradParler 2 дня назад +1

    Please make a locally hosted instance tutorial using docker compose, like the local ai agent

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thanks for the suggestion Brad - I am planning on doing this in the near future!

  • @rahulmisra2000
    @rahulmisra2000 2 дня назад +1

    Total clarity!

  • @wesayit9057
    @wesayit9057 2 дня назад +1

    Awesome Video! This will definitely help me with my next project!
    Do you have a video on how to host n8n on digital Ocean 🌊?
    I am not sure I want to run it on my computer sonce I want it to run 247

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Thank you very much, I'm glad to hear it!
      I don't have a video but the n8n documentation for hosting on Digital Ocean is super helpful:
      docs.n8n.io/hosting/installation/server-setups/digital-ocean/

  • @jeremiealcaraz
    @jeremiealcaraz 2 дня назад +1

    Thank you ! You're Awesome !

  • @HavocYT
    @HavocYT 2 часа назад

    when i search the /crawl mine says method not allowed?

  • @viralidy
    @viralidy 2 дня назад +1

    Great tutorial! Any way we can ensure it doesn't crawl the same pages when the workflow fails? I have to continuously delete the records from supabase.

    • @ColeMedin
      @ColeMedin  2 дня назад

      Thank you! Great question too. I would recommend extending this workflow to clear out the vector DB for the specific page before reinserting anything for you. You can use the "page" metadata I show how to add in this video for that.

  • @rogue.ganker
    @rogue.ganker 2 дня назад +1

    Bro reminds me of the llama from the emperor's new groove ❤

    • @ColeMedin
      @ColeMedin  День назад +1

      Haha I haven't heard that one before!

  • @guidosc3470
    @guidosc3470 2 дня назад +1

    If I would need advanced specific knowledge extracted and interpreted- from, let’s say a programming application and a specific user forum, - would it be possible to “chain” crawl4ai and deepseek r1 into a speech to text chatbot that could specifically look into those sources first ?!😅 (and logically combine the content?)… that would be awwwesome.

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      Yes that is certainly possible - I love the idea! I'll be doing more with R1 and RAG soon

  • @raminseferov2148
    @raminseferov2148 День назад

    Hi Cole, thank you for great content.
    Can you please explain how to connect Postgres Chat Memory, i've struggled a lot, thanks in advance.

  • @anfiiaidev
    @anfiiaidev 2 дня назад +1

    Can we deploy it on vercel? And use the endpoints as apis?

    • @ColeMedin
      @ColeMedin  2 дня назад

      Yes you sure can!

    • @anfiiaidev
      @anfiiaidev 2 дня назад +1

      @@ColeMedin great I was thinking if you can find us something that lets us deploy stuff for a span of time like 30 days without requiring any credit card. I mean for free

    • @ColeMedin
      @ColeMedin  День назад

      Render is another good option! They have a great free tier.

    • @anfiiaidev
      @anfiiaidev День назад

      @@ColeMedin great. It would be great if your stuff can be run almost for free. I mean in your videos using mostly almost free. Gives you a little unique touch in this ai space.

  • @xillionlegacy
    @xillionlegacy 2 дня назад +1

    could this be done on n8n documentation?

    • @ColeMedin
      @ColeMedin  2 дня назад

      Yes definitely! Here is their sitemap:
      n8n.io/sitemap_index.xml
      I found this by going to n8n.io/robots.txt

  • @globalsalesacademy
    @globalsalesacademy День назад

    Just found you here in Oz. For non-techies like me don't you have a fully packed agent i can just log i to or download? Happy to pay

  • @DougsGarden
    @DougsGarden 2 дня назад +1

    how do you set up the postgres chat history db?

    • @ColeMedin
      @ColeMedin  2 дня назад

      n8n does this automatically under the hood when you use the Postgres Chat History Node!

    • @DougsGarden
      @DougsGarden 2 дня назад

      @@ColeMedin it says i need to configure the connection. it was default local host but that did not work it says unable to connect. is it set p through my supabase? how do i connect that?

    • @ColeMedin
      @ColeMedin  День назад

      Are you running Supabase locally? I'm a bit confused! To connect your Supabase, you'll want to go to the "connect" tab in your Supabase dashboard and like for the connection details there to put into Supabase. Use the connection details that have the port 6543 instead of 5432!

  • @nunajah
    @nunajah 2 дня назад +1

    Waiting for the local version 😊

    • @ColeMedin
      @ColeMedin  2 дня назад +2

      With the local AI starter kit? I'm definitely planning on adding it!

    • @ThomasMock-c5n
      @ThomasMock-c5n 2 дня назад +1

      +1 on that 😊

  • @TonyGonery
    @TonyGonery День назад

    For my understanding : what are the uses cases for something like that? Thanks!

    • @ColeMedin
      @ColeMedin  День назад

      Really it's used for turning an LLM into an expert for any website! Your ecommerce store, documentation for a programming language/library, a portion of Wikipedia, etc.

  • @ankitgadhvi
    @ankitgadhvi 7 часов назад

    What are your thoughts on smolagents by huggingface. can we do the same with smolagents?

  • @nathamuni9435
    @nathamuni9435 2 дня назад +1

    Kindly put a video on converting n8n to python code and to work with custom hosted models or with own DB
    Most imp⚠️ compare with flowise

  • @noor96883
    @noor96883 День назад +1

    Can you please replicate the video using local resources, docker for both N8N and Crawl4Ai

    • @ColeMedin
      @ColeMedin  День назад

      Yes I am planning on doing this soon :)

  • @ten-framework
    @ten-framework 2 дня назад +1

    We like it.

  • @ace.1type8z8
    @ace.1type8z8 22 часа назад

    please show graphs tutorial 🥺in pydantic ai

  • @itsmar1034
    @itsmar1034 День назад

    Hi @Cole
    Thanks for the tutorial.
    Can you or someone help?
    I keep having issues in Digital Ocean, I get Deployment failed during deploy phase I tried it twice. It said “container terminated by the platform due to exceeding resources or your app misbehaving.”

  • @bosleo1130
    @bosleo1130 День назад

    If you find a way to fix the bug with workflow execution when it shows that a node has been executed but has not actually been executed, let me know. Only self-hosted

  • @bryant9820
    @bryant9820 2 дня назад

    Is anyone else running into to openai embedding 429 rate limit error? I checked api usage and it has zero requests on api being used in n8n creds

    • @ColeMedin
      @ColeMedin  2 дня назад

      This must be when the embeddings are created to insert your pages into the vector DB. I would add a wait node before inserting anything into the vector DB and adjust that until you don't get 429 errors.

  • @bluegreen-ai
    @bluegreen-ai День назад +1

    Thanks

    • @ColeMedin
      @ColeMedin  21 час назад

      You bet! Thank you so much for your support!

  • @thespencerowen
    @thespencerowen 2 дня назад

    1 week later and the information in the previous video is already out of date.

    • @ColeMedin
      @ColeMedin  2 дня назад +1

      What information are you referring to? The other video for doing something similar in Python is still relevant!

  • @angryktulhu
    @angryktulhu 15 часов назад

    Idk honestly I hate that visual programming stuff. Way easier and faster to just write the code, for me

  • @PyJu80
    @PyJu80 2 дня назад +1

    😉

  • @tnypxl
    @tnypxl 2 дня назад +3

    It is free to use the software itself, but the LLM API calls are likely not free. Should probably note that for the less-informed.

    • @ColeMedin
      @ColeMedin  2 дня назад +2

      Yeah that is true, I appreciate you calling it out! You can always use local LLMs for free though as well as some through APIs like Gemini 2.0 Flash. But yes, in general it'll cost you something.

    • @borismanevski3951
      @borismanevski3951 2 дня назад

      Hey, just wondering, would something like gemini 2.0 be efficient at something like this? ​@ColeMedin

    • @longho920
      @longho920 2 дня назад

      @@ColeMedin AFAIK, just Gemini 1.5 Pro available. You cannot use 2.0 API anymore at the moment, right?

    • @ColeMedin
      @ColeMedin  День назад

      I used it just a few days ago! Maybe something changed super recently?

    • @ColeMedin
      @ColeMedin  День назад

      Yes!

  • @That1AiGuy
    @That1AiGuy 16 часов назад

    how would I crawl a whole foums site.. multiple discussions with multiple pages? would I be training a model at that point? or still all in context window?