Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!
With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!
This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting. Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!
🎯 Key Takeaways for quick navigation: 02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup. 03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags. 05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices. 15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction. 23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities. Made with HARPA AI
Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.
Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.
Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?
Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?
Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀
I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?
As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?
11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?
hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?
Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?
Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?
I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance
Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/
Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time. I am a beginner in this realm, and would love to figure this out.
For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/ Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C (my username is toasted_shibe)
Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code. Would it be okay if I cloned your code and used it as a part of my software? (I am very impressed by what you have built that’s why I’m interested in using it myself)
I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results. What can I do?
When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated
I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?
@@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?
@@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.
If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?
I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?
I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?
ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm
Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊
For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster. The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?
Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.
@@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.
@@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short. I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways. But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again. Thoughts?
@@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well: For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length. The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where: - \(n\) is the number of tokens in the sequence. - \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer). The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term. Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations. However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.
Additional details about scraping: only scrape for tags on some sites (like WSJ or CNN) yields the best results. Others might be different.
I’m working on a couple ai projects with Malik Yusef, Kanye’s main collaborator and one of Virgil’s first mentors. We should connect, lmk 🙏🏼
Refreshing to not see some bs clickbait video on LLM-uses. Just a clean, focused, and super differentiated walkthrough-video. Subscribed, and looking forward to more!
Thank you. I'm glad that this approach resonates with people.
This video feels like a coworker showing me something cool. Really good video man!
With AI assistance I can scrape hundreds of thousands of products/services a week and now have the facilities to talk to thousands of people at once. Learnt most of it from youtube from people such as yourself who are grossly underappreciated. Keep up the good work and thanks for sharing!
Thanks for the appreciation then. We love it.
This is exactly what I was looking for. A way to scrape websites like a human being, and done it via scripting.
Also, I like how you explain things clearly and how they work. I found this channel by accident, and decided to watch it. The next thing I knew, I'm a new subscriber!
You are now officially a real youtuber.
Just a slight correction: openai_api_key is a property in the llm object, in LangChain. It's not a global variable.
Well done for explaining the why so clearly . You had me in the first minute
🎯 Key Takeaways for quick navigation:
02:05 🌐 You can scrape websites using LangChain, OpenAI Functions, Playwright, and Beautiful Soup.
03:55 🧩 OpenAI Functions simplify web scraping by eliminating the need to manually declare HTML tags.
05:20 🛍️ You can use this approach to scrape e-commerce websites and extract specific information like item titles and prices.
15:41 🤖 LangChain simplifies interactions with OpenAI's GPT models for various applications, including information extraction.
23:32 ⚙️ Consider chunking large HTML content and building a FastAPI server to enhance this web scraping tool's capabilities.
Made with HARPA AI
Note: 'kwargs' usually stand for keyword arguments. normally we call it "keyword args". Nice Vid. :)
Thank you haha. So obvious in hindsight.
Very cool stuff! Like the style of narration focusing on conveying information in a straightforward and matter-of-fact manner, without overemphasizing or exaggerating.
Great stuff, keep up on making great AI coding content, you got my sub!
Good video. Thanks for taking the time to explain the nuances in depth. You've got my sub ha
First video and I like this channel already!🙂
Great video!! Thank you for sharing. I liked how you simplified the code and explanation. Your project really makes sense as webpages do change their structure and traditional approach may break due to those changes.
Good luck dude, just keep doing what you doing.
That's a high quality vid right there.
Thank you so much! I didn't know how to implement this and I bumped into your video. Such a saver!
Great job, you are a real you-tuber and I can tell that you will become very popular. 😮🎉
You are the King
Yup, straight to the point
TY
Great video.
I guess modifying this to use local LLM should be easy, right?
bro codes in light mode...respects
The beginning started mid sentence. Did I miss where you explained how ai will prevent us from rebuilding the scrape code when the website changes?
nice and simple . Thanks
Hi, great video! I've implemented a similar approach and I wanted to see yours which has given me new inspiration which I'm very grateful for so thank you! Why did you use Python, based on that you mentioned you are from the Javascript/Typescript world?
Yup, my background was in JS / React
This is awesome content btw!
It was quite useful for me.
Thank you for your video and resource! I am trying to build an web app to find news articles that have different stand points on the a choosen topic. Would this code be a good solution for me to scrap news, or would this be more suited to something else like scrapping more security tight websites(since it uses chromium)? I see the waiting time is quite long too. What Langchain solution/module would you recommend for my project?
Hey there, the wait time is mostly on the LLM part and not the scraping part. You can definitely use this to scrape news sites. LangChain has the OpenAI Function extraction chain, which has nice input parser for extracting. All you have to do is defining your schema for scraping, then off you go 🚀
I’m on a quest to use an LLM for web scraping without identifying HTML. You gave a lot of value background information. You referred to “python things” and talked as though you have experience with NodeJS. Why didn’t you use LangchainJS, Puppeteer and Cheerio? How difficult it be to rewrite your repo for NodeJS?
As a dev with JS background, how's been your experience with Python? Why you moved to Python instead of using LangchainJS? Comparing LangChainJS vs LangChain Python, do you miss many features from a fw to another? Have you ever faced an issue with JS that you could only solve in with Python?
Thanks
I wonder how this would handle dynamic content: as in scraping websites where you have to click stuff to reveal valuable content.
11:49 is it safe to remove other tags? It's recommended that web pages contain elements such as section, article, main, menu, header, footer, etc. not to mention h(n), label, span, aria. I know many pages out there don't follow the "correct" syntax, but I suppose especially when we talk about huge websites we'll commonly find those patterns. So removing other tags would not affect the result we expect from the AI integration?
hey @llmschool! This is very insightful, and it got me wondering if we can extract security ownership DEF 14A filling. The difficulty is that each filling has a different structure; can the LLM handle that?
You can try. Let us know how it goes.
Why when I try to use functions under the new assistant GPT builder, does it keep telling me it's invalid JSON, and I can't paste Python or JavaScript in there to be able to scrape the web?
Amazing Amazing
Feeding all the HTML to the LLM might exhast the context lenght of LLM pretty quick.
Where are you using the openai function calling functionality? Isn't it so that the openai function calling should call a specific function inside of your program? Or am I missing something?
Fire!!
I am getting an error "TypeError: Parameters to generic types must be types. Got {'properties': {'item_title': {'type': 'string'}, 'item_price': {'type': 'number'}, 'item_extra_info."..can u help..Thanks in advance
Hey, without looking at your code I'm not sure why that's the case. But I merged my code to LangChain (Python) a couple weeks ago for this usecase and you can follow the guide here: python.langchain.com/docs/use_cases/web_scraping/
Thanks for this. Just a quick question. How do i approach this problem if i have like 300 website links to scrape?
Would you know how to scrape PDF documents (download and sort into files) from a website that has a database that is constantly updating? If this is something you can do, I'd love to have a chat and would pay you for your time.
I am a beginner in this realm, and would love to figure this out.
For sure. You can reach out to me on LinkedIn: www.linkedin.com/in/haiphunghiem/
Or chat with me on LangChain Canada's Discord: discord.gg/rtKE2g266C
(my username is toasted_shibe)
# til
What is it for ? For what purpose?
why use playwright? can't you use selenium instead?
Lolz at the neighbor's trash 😄
The worst.
Hello sir, I’m building a commercial software. And I want to ask your permission before I use your code.
Would it be okay if I cloned your code and used it as a part of my software?
(I am very impressed by what you have built that’s why I’m interested in using it myself)
For sure. I'm flattered. And thanks for asking as well. Please credit me (my name and this video) if you don't mind.
@@devlearnllm Thanks! I’ll make sure to include your name (author) in the documentation and a link to the video! 🙏
I've been trying to use this a Django web app using celery but I've been getting coroutine errors. I managed to bypass that with async_to_sync function, but now the task keeps executing without giving any results.
What can I do?
Sign up for the upcoming AI Agents Master Course: forms.gle/YuMvqfXo6xXUXaR6A
When we get data from site and provide to llm for scraping how can we manage large data because data will to llm in chunks so when there large data some data might be truncated
I find the data returned is not valid, article title does not match their summary for instance. Can you comment a little more on the schemas, like is the naming of items important?
Sure, which site are you scraping?
@@devlearnllm I get pretty good results with your basic 'news' schema, but nothing with the 'e_commerce' schema, which is also more detailed it seems. Are you mirroring the item names used at the site you want to scrape?
@@thomaslyngesen7221 For ecommerce sites, it's quite challenging on the scraping side of things to deliver clean data to the LLM to extract. App Sumo is an easy site to scrape, but Amazon or Bestbuy seems more challenging. It'll take some experimentation to get them to work.
Make sure to pull my latest code, and only scrape for tag. Then the titles should be accurate. Thanks for pointing this out
If I have a list of URLs to scrape and a website behind a login and password with keywords and overall score and other variables I don’t need, will this be able to scrape all keywords from all URLs into a single csv file?
I also printed out the content in the extract function which is just plain text. How can openai with just plain text and a schema convert that plain text to a JSON file? I mean, where does it know another news_headline or news_short_summary start?
The OpenAI Functions call is encapsulated in LangChain's chain.
can you do same using opensource llm like llama 3 ?
Can we scrape deep links of website as well. Like scrape about us page of website which was found from home page of website. If you can post it
I'm trying to scrape wsj but I got this error: "RuntimeError: no validator found for , see `arbitrary_types_allowed` in Config". Do you know what this could be?
Did you ever figure that out?
amazing
is this worth doing for data you want to scrape that's behind captchas?
I haven't tried that yet, but probably requires some modifications on the Chromium and scraping side (not the extraction side)
ok i know there are captcha solution provides like 2captcha but then there are more advanced solutions offered by bright data and scraper api. There is not a lot of video tutorials about those services but i think this could be pretty powerful when integrated with something like those tools@@devlearnllm
Dont you have problems with website security? I tried to scrap some webs and I got IP ban
Don't go overboard then lol
i tried to upload a comment on a problem i run into, but for some reason it doesn't show in the comment? anyone knows why 😅
if you don't mind please change the theme
Nice video. This is totally unscalable, expensive and very slow. Websites don’t change much. You’re far better off asking the AI to write a good scraping bot rather than feeding in HTML into the bot. 😊
For now, everything you said is true (except websites don't change much. Scraping competitor's websites, or listings of JS-heavy websites change all the time). Over time, we'll see LLM calls being cheaper and faster.
The act of asking chatGPT to write a scraping bot is, how much different than an LLM call?
Feeding in the entire HTML call is slow and inefficient. I do some professional scraping and most of my clients scrapes run for years without almost no maintenance.
@@devlearnllmmy suggestion is to use LLM to make the updates to a real scraper on the fly, rather than blindly feeding in 4000 characters of text and asking LLM to extract. LLMs context length is O(n^2) and no cost reduction will solve this issue. So keeping context length as low as possible is always important.
@@Ryan-yj4sd I don't know what you mean by LLM context length being O n^2, but the output length is what determines the amount of time it takes to generate. Doesn't matter if the prompt is long or short.
I do like the idea of updating a scraper on the fly though. It might end up needing as much HTML as possible to generate new code or schema accurately anyways.
But you gave me a better idea: what if you still push HTML to LLM once, create a scraper or schema (like you said), and keep using it until the website changes. Here's where one can put in an evaluator of some sort (another small LLM call, perhaps?) to check the work of the scraper. If the work results are poor (you can determine what's good/not good for the LLM evaluator), then we run the first step again.
Thoughts?
@@devlearnllm the algorithm complexity is O(n^2). In other words, each token sits in a double loop. Of course the input length matters! I double checked as well:
For transformer-based models like GPT, the primary computational concern is the self-attention mechanism. The self-attention mechanism's complexity in transformers is primarily influenced by the sequence length.
The computational complexity of the self-attention mechanism in a transformer scales as \(O(n^2 \times d)\), where:
- \(n\) is the number of tokens in the sequence.
- \(d\) is the dimension of the model (i.e., the number of features or hidden units at each layer).
The quadratic relationship (\(n^2\)) arises from the pairwise comparisons between tokens when calculating attention scores. For each token, the model computes attention scores with every other token, leading to the quadratic term.
Given this, the time taken by the model will be proportionally related to the square of the input length (keeping other factors like model dimension and hardware constant). In other words, if you double the length of the input, you might expect roughly a fourfold increase in the time taken by the self-attention calculations.
However, in practice, other factors can influence the total processing time, including hardware efficiency, batch processing, and other parts of the model that don't scale quadratically. Still, the quadratic relationship provides a good rough estimate for the scaling behavior of transformers with respect to sequence length.
bro i watched 4 minutes add before jumping actual video
That's crazy. Let me see if I can change that somehow
How much do you need pay for open function if you called 1000 times?
Call it 1000 times and share it with everyone.