Scrape any website with OpenAI Functions & LangChain

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024

Комментарии • 106

  • @devlearnllm
    @devlearnllm  Год назад +14

    Additional details about scraping: only scrape for tags on some sites (like WSJ or CNN) yields the best results. Others might be different.

    • @stevenwessel9641
      @stevenwessel9641 6 месяцев назад

      I’m working on a couple ai projects with Malik Yusef, Kanye’s main collaborator and one of Virgil’s first mentors. We should connect, lmk 🙏🏼

  • @georgesanchez8051
    @georgesanchez8051 Год назад +37

    Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!

    • @devlearnllm
      @devlearnllm  Год назад +2

      Thank you. I'm glad that this approach resonates with people.

  • @emlincharly
    @emlincharly 10 месяцев назад +9

    This video feels like a coworker showing me something cool. Really good video man!

  • @alexanderroodt5052
    @alexanderroodt5052 Год назад +8

    With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!

    • @devlearnllm
      @devlearnllm  Год назад +1

      Thanks for the appreciation then. We love it.

  • @richmadrid9563
    @richmadrid9563 Год назад +4

    This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting.
    Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!

  • @devinschumacher
    @devinschumacher Год назад +6

    You are now officially a real youtuber.

  • @devlearnllm
    @devlearnllm  Год назад +9

    Just a slight correction: openai_api_key is a property in the llm object, in LangChain. It's not a global variable.

  • @walkingwchris
    @walkingwchris Год назад +5

    Well done for explaining the why so clearly . You had me in the first minute

  • @diegosandoval7462
    @diegosandoval7462 Год назад +15

    🎯 Key Takeaways for quick navigation:
    02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup.
    03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags.
    05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices.
    15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction.
    23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities.
    Made with HARPA AI

  • @StudioTatsu
    @StudioTatsu Год назад +3

    Note: 'kwargs' usually stand for keyword arguments. normally we call it "keyword args". Nice Vid. :)

    • @devlearnllm
      @devlearnllm  Год назад

      Thank you haha. So obvious in hindsight.

  • @techgeekguru
    @techgeekguru Год назад +5

    Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.

  • @mr.gk5
    @mr.gk5 2 месяца назад +1

    Great stuff, keep up on making great AI coding content, you got my sub!

  • @meinbherpieg4723
    @meinbherpieg4723 7 месяцев назад +1

    Good video. Thanks for taking the time to explain the nuances in depth. You've got my sub ha

  • @emmanueladepoju4089
    @emmanueladepoju4089 11 месяцев назад +1

    First video and I like this channel already!🙂

  • @miltondavilaharjula
    @miltondavilaharjula Год назад +1

    Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.

  • @sandratoolan9598
    @sandratoolan9598 Год назад +2

    Good luck dude, just keep doing what you doing.

  • @lukeotwell3296
    @lukeotwell3296 7 месяцев назад +2

    That's a high quality vid right there.

  • @tebblesfun
    @tebblesfun Год назад +1

    Thank you so much! I didn't know how to implement this and I bumped into your video. Such a saver!

  • @Wildhoneybush1
    @Wildhoneybush1 9 месяцев назад

    Great job, you are a real you-tuber and I can tell that you will become very popular. 😮🎉

  • @zakuro8532
    @zakuro8532 7 месяцев назад +1

    You are the King

  • @jazzzAiman
    @jazzzAiman Год назад +2

    Yup, straight to the point

  • @bigbena23
    @bigbena23 6 месяцев назад +1

    Great video.
    I guess modifying this to use local LLM should be easy, right?

  • @evansmakuba1631
    @evansmakuba1631 7 месяцев назад +1

    bro codes in light mode...respects

  • @MK-jn9uu
    @MK-jn9uu Год назад

    The beginning started mid sentence. Did I miss where you explained how ai will prevent us from rebuilding the scrape code when the website changes?

  • @HazemAzim
    @HazemAzim Год назад +1

    nice and simple . Thanks

  • @aamdmn2641
    @aamdmn2641 Год назад

    Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?

    • @devlearnllm
      @devlearnllm  Год назад

      Yup, my background was in JS / React

  • @FlutterDev1337
    @FlutterDev1337 Год назад +1

    This is awesome content btw!

  • @ratnpriyarai4793
    @ratnpriyarai4793 2 месяца назад +1

    It was quite useful for me.

  • @lamboqin2180
    @lamboqin2180 Год назад +2

    Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?

    • @devlearnllm
      @devlearnllm  Год назад +1

      Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀

  • @4ram16
    @4ram16 8 месяцев назад

    I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?

  • @tiagoc9754
    @tiagoc9754 10 месяцев назад

    As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?

  • @AlloMission
    @AlloMission 11 месяцев назад +1

    Thanks

  • @SergeyNumerov
    @SergeyNumerov 2 месяца назад

    I wonder how this would handle dynamic content: as in scraping websites where you have to click stuff to reveal valuable content.

  • @tiagoc9754
    @tiagoc9754 10 месяцев назад

    11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?

  • @abhishekchoudhury
    @abhishekchoudhury Год назад +1

    hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?

    • @devlearnllm
      @devlearnllm  Год назад +1

      You can try. Let us know how it goes.

  • @augmentos
    @augmentos Год назад +1

    Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?

  • @koleshjr
    @koleshjr Год назад +2

    Amazing Amazing

  • @plashless3406
    @plashless3406 9 месяцев назад

    Feeding all the HTML to the LLM might exhast the context lenght of LLM pretty quick.

  • @Guy-Scott
    @Guy-Scott Год назад

    Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?

  • @chadmichaellawson3985
    @chadmichaellawson3985 Год назад +1

    Fire!!

  • @onirdutta666
    @onirdutta666 Год назад +2

    I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance

    • @devlearnllm
      @devlearnllm  Год назад

      Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/

  • @sunilbendre123
    @sunilbendre123 10 месяцев назад

    Thanks for this. Just a quick question. How do i approach this problem if i have like 300 website links to scrape?

  • @SilenceOnPS4
    @SilenceOnPS4 Год назад +1

    Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time.
    I am a beginner in this realm, and would love to figure this out.

    • @devlearnllm
      @devlearnllm  Год назад +1

      For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/
      Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C
      (my username is toasted_shibe)

  • @julianomoraisbarbosa
    @julianomoraisbarbosa Год назад +2

    # til

  • @evolution3658
    @evolution3658 5 месяцев назад

    What is it for ? For what purpose?

  • @BarışAytimur-e8x
    @BarışAytimur-e8x 9 месяцев назад

    why use playwright? can't you use selenium instead?

  • @jsfnnyc
    @jsfnnyc Год назад

    Lolz at the neighbor's trash 😄

  • @Flameandfireclan
    @Flameandfireclan Год назад +2

    Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code.
    Would it be okay if I cloned your code and used it as a part of my software?
    (I am very impressed by what you have built that’s why I’m interested in using it myself)

    • @devlearnllm
      @devlearnllm  Год назад +1

      For sure. I'm flattered. And thanks for asking as well. Please credit me (my name and this video) if you don't mind.

    • @Flameandfireclan
      @Flameandfireclan Год назад +2

      @@devlearnllm Thanks! I’ll make sure to include your name (author) in the documentation and a link to the video! 🙏

  • @kamalseriki3201
    @kamalseriki3201 Год назад

    I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results.
    What can I do?

  • @devlearnllm
    @devlearnllm  Год назад +2

    Sign up for the upcoming AI Agents Master Course: forms.gle/YuMvqfXo6xXUXaR6A

  • @salmankhandu3819
    @salmankhandu3819 9 месяцев назад

    When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated

  • @thomaslyngesen7221
    @thomaslyngesen7221 Год назад +1

    I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?

    • @devlearnllm
      @devlearnllm  Год назад

      Sure, which site are you scraping?

    • @thomaslyngesen7221
      @thomaslyngesen7221 Год назад

      @@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?

    • @devlearnllm
      @devlearnllm  Год назад

      @@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.

    • @devlearnllm
      @devlearnllm  Год назад

      Make sure to pull my latest code, and only scrape for tag. Then the titles should be accurate. Thanks for pointing this out

  • @funnyperson4016
    @funnyperson4016 Год назад

    If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?

  • @Guy-Scott
    @Guy-Scott Год назад

    I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?

    • @devlearnllm
      @devlearnllm  Год назад

      The OpenAI Functions call is encapsulated in LangChain's chain.

  • @rajnishadhikari9280
    @rajnishadhikari9280 6 месяцев назад

    can you do same using opensource llm like llama 3 ?

  • @HiteshGautam-v6y
    @HiteshGautam-v6y 8 месяцев назад

    Can we scrape deep links of website as well. Like scrape about us page of website which was found from home page of website. If you can post it

  • @matheusduzziribeiro5637
    @matheusduzziribeiro5637 9 месяцев назад

    I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?

    • @andrew54292
      @andrew54292 8 месяцев назад

      Did you ever figure that out?

  • @hishamazmy8189
    @hishamazmy8189 7 месяцев назад

    amazing

  • @atrocitus777
    @atrocitus777 Год назад

    is this worth doing for data you want to scrape that's behind captchas?

    • @devlearnllm
      @devlearnllm  Год назад

      I haven't tried that yet, but probably requires some modifications on the Chromium and scraping side (not the extraction side)

    • @atrocitus777
      @atrocitus777 Год назад

      ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm

  • @viktorvegh7842
    @viktorvegh7842 Год назад

    Dont you have problems with website security? I tried to scrap some webs and I got IP ban

  • @CarlChristiansen-ps5ov
    @CarlChristiansen-ps5ov 7 месяцев назад

    i tried to upload a comment on a problem i run into, but for some reason it doesn't show in the comment? anyone knows why 😅

  • @HappyDataScience
    @HappyDataScience Год назад

    if you don't mind please change the theme

  • @Ryan-yj4sd
    @Ryan-yj4sd Год назад +2

    Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊

    • @devlearnllm
      @devlearnllm  Год назад

      For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster.
      The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?

    • @Ryan-yj4sd
      @Ryan-yj4sd Год назад +1

      Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.

    • @Ryan-yj4sd
      @Ryan-yj4sd Год назад +2

      @@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.

    • @devlearnllm
      @devlearnllm  Год назад

      @@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short.
      I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways.
      But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again.
      Thoughts?

    • @Ryan-yj4sd
      @Ryan-yj4sd Год назад

      @@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well:
      For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length.
      The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where:
      - \(n\) is the number of tokens in the sequence.
      - \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer).
      The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term.
      Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations.
      However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.

  • @SurajSingh-y3n3e
    @SurajSingh-y3n3e 5 месяцев назад

    bro i watched 4 minutes add before jumping actual video

    • @devlearnllm
      @devlearnllm  5 месяцев назад

      That's crazy. Let me see if I can change that somehow

  • @dxvfdfx
    @dxvfdfx Год назад

    How much do you need pay for open function if you called 1000 times?

    • @devlearnllm
      @devlearnllm  Год назад

      Call it 1000 times and share it with everyone.