Interesting, I'm going to give this a go. I've experimented with pydantic for parsing llm output into json so this super relevant right now. Thanks Greg, great explainer as always.
Greg, great video as always! I achieved the same results by including the desired output in JSON format along with the initial prompt itself, without using the Kor library.
Thank you, Greg, for this informative video on using LLMs to extract data from text! I found it particularly valuable for its potential application in skill/information extraction from resumes/CVs submitted to large companies. I also noticed a minor error in the original code: """ output = chain.predict_and_parse(text="...")['data'] printOutput(output) """ updated code: """ output = chain.run(text="...")['data'] print(output) """
Great introduction. Perfect pacing I’m going to do some further research to see if I can figure out a way to use Kor with a local language model since I deal with confidential patient data in a healthcare setting.
Come on why did you steal my idea 😅. I was literally thinking how to scrape a youtube channel's data usung llms. I was looking for the info. You came right on time!
There's a video from James Briggs iirc that, iirc does Q&A against a knowledge base of youtube channels videos transcripts. Not sure if it was a dataset available or he extracted them from RUclips. Hope that helps
In the 3rd cell of the Kore Hello World example the call 'output = chain.predict_and_parse(text=(text))["data"]' must be replaced with 'output = chain.run(text=(text))["data"]' because 'predict_and_parse' has been depreciated.
That's really interesting. Would it be easy (maybe using LangChain) to define like required attributes or elements in the schéma, and if the LLM can't extract them, it would then start a Q&A with the user to ask the missing elememts and attributes until completing the required fields? That would be awesome to launch posterior actions for example.
Hey Greg, thanks for this video! Since, there is a limit to access open ai api key without paying, how can the above implementation be carried out with other open source LLMs ?
This is precisely what I need for my project, but like you said, the cost can spiral out of control. Have you tried with gpt 3.5? If so, how unreliable was it?
Newbie here, I don’t understand why you would need to use the library for this task? Couldn’t you just include in your llm prompt to specify the exact output and formatting you need? Cheers!😊
Fantastic tutorial! It would be great to see another tutorial using "transformers" instead of openai with chroma or any local database... and how will you save the extracted information.. does Kor tokenize that information, etc?
hi Greg, thank you for the great video! How would you go about extracting "tags" or predefined values an not String texts? Especially if the number of values ar in the thousands and are too many to just feed into the prompt (token optimization etc). Any ideas? Thank you!
hmm good question, check out this tutorial and code In cell 15 I have a schema for tags that may be helpful: github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Topic%20Modeling%20With%20Language%20Models.ipynb ruclips.net/video/pEkxRQFNAs4/видео.html
I'm having some problems running this with Ollama local models (I tried llama 3.1 and nuextract) and it's not working ... The output has lot of repetitive info
Incredible. Question of 1 million dollars 😊: How to "teach" chatgpt just 1 time what the schema is and be able to validate infinite texts without having to spend a token inputting the schema at the prompt and without having to train the model via fine-tune?
This looks like a really interesting approach. @DataIndependent any ideas of what the best approach for using tabular data (whether from a pandas dataframe, pyspark dataframe or SQL data table) in conjunction with LLMs? What about combining tabular data with text documents?
Can we extract important contents from research paper ? like some text from abstract and some from results or ablation table present. Can you make one video about it as how to customize that text extraction to google sheets.
@@DataIndependent Thank you for your reply:) I am working on a problem where I am extracting text from websites like Amazon, McDonald using web scraping and giving that raw text to my Open AI so that it can extract products or food items and their price, ratings, discount etc. Now the problem here is that I can't give all the text at a time to the open ai because of the limitation of the number of tokens. So is there any other way so that I can give text in chunks. Now the second thing is to improve the model performance, instead of giving raw text to the open ai i want to give embedding vectors of that text by the help of open ai embeddings. I am using retrievalQA and character text splitter in Lang chain to solve the above problem in my previous approach but how can I do that in this approach that you did in this video. Please give me a solution. Thank you for your time ☺️
I saw your videos on token limit and embeddings but I want to combine these two ideas and ask a query by the help of kor library so that I can get the output in a structure format.
This is a wrong approach imho. You have to use output as a text and not as an object. If you do that, you lose the ability to stream the output which is a main feature of these LLM. If you want to structure your text, you'll have to go with MD (mark down). Not to mention also that the translation in object is never deterministic due to the nature of LLM and you could get something unusable for your front end.
Finally a video that i can enjoy without that backgroud noise , thanks a lot and please continue without background music
Interesting, I'm going to give this a go. I've experimented with pydantic for parsing llm output into json so this super relevant right now. Thanks Greg, great explainer as always.
Glad it helped!
Greg, great video as always! I achieved the same results by including the desired output in JSON format along with the initial prompt itself, without using the Kor library.
But you to prompt in all your JSON file text, right?
Thank you, Greg, for this informative video on using LLMs to extract data from text! I found it particularly valuable for its potential application in skill/information extraction from resumes/CVs submitted to large companies. I also noticed a minor error in the
original code:
"""
output = chain.predict_and_parse(text="...")['data']
printOutput(output)
"""
updated code:
"""
output = chain.run(text="...")['data']
print(output)
"""
You're channel is gold! Thanks a lot for all those tutorials
Thanks, this was super useful! I would love to get some insight into the feedback you got from those 80 companies.
Most people either wanted the data for investment or sales use cases
@@DataIndependent I am developing a few small tools for a recruitment bureau, I am interested since what you mentioned seemed relevant
Great lectures. Thank you to share us for free. Thumb up
Thank you! I also explore more function calling to extract information
Thanks Greg, this is very relevant, will give Kor a try!
I really liked your video "The Data Learning Journey (Part 1)", and am hoping you will post Part 3 soon.
Great introduction. Perfect pacing I’m going to do some further research to see if I can figure out a way to use Kor with a local language model since I deal with confidential patient data in a healthcare setting.
I wonder the same thing, some letters for the Turkish language are problematic
I'm trying to do it, It's not working, the model (using KOR) is acting very stupid
Thanks Greg, this was really helpful!
This was an awesome content
very insightful - thank you
Awesome! I need to add another level to this which is openai function calling
Hey Greg, at 7:54 - what is the "many = True" attribute in Text class? Can you please explain with a bit more details?
Come on why did you steal my idea 😅. I was literally thinking how to scrape a youtube channel's data usung llms. I was looking for the info. You came right on time!
There's a video from James Briggs iirc that, iirc does Q&A against a knowledge base of youtube channels videos transcripts. Not sure if it was a dataset available or he extracted them from RUclips. Hope that helps
@@adumont Oh thanks. I will look it up
Why everyone making this?😂
In the 3rd cell of the Kore Hello World example the call 'output = chain.predict_and_parse(text=(text))["data"]' must be replaced with 'output = chain.run(text=(text))["data"]' because 'predict_and_parse' has been depreciated.
Yikes - thanks for the catch. I would also recommend looking at function calling from openai in case you want to see a different approach
Wow!! it's magic
awesome
That's really interesting. Would it be easy (maybe using LangChain) to define like required attributes or elements in the schéma, and if the LLM can't extract them, it would then start a Q&A with the user to ask the missing elememts and attributes until completing the required fields? That would be awesome to launch posterior actions for example.
Hey Greg, thanks for this video!
Since, there is a limit to access open ai api key without paying, how can the above implementation be carried out with other open source LLMs ?
Where is the "sign up" you mentioned? This seems very interesting for many applications.
Whoops! I'll put it in the description, this was it
www.openingattributes.com/
@@DataIndependent I am very impressed as were all my work friends.
This is precisely what I need for my project, but like you said, the cost can spiral out of control. Have you tried with gpt 3.5? If so, how unreliable was it?
Newbie here, I don’t understand why you would need to use the library for this task? Couldn’t you just include in your llm prompt to specify the exact output and formatting you need? Cheers!😊
Basically this abstracts a way all the extra needed work for formatting and text extracting and let you focus on your business logic
Fantastic tutorial! It would be great to see another tutorial using "transformers" instead of openai with chroma or any local database... and how will you save the extracted information.. does Kor tokenize that information, etc?
hi Greg, thank you for the great video! How would you go about extracting "tags" or predefined values an not String texts? Especially if the number of values ar in the thousands and are too many to just feed into the prompt (token optimization etc). Any ideas? Thank you!
hmm good question, check out this tutorial and code
In cell 15 I have a schema for tags that may be helpful: github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Topic%20Modeling%20With%20Language%20Models.ipynb
ruclips.net/video/pEkxRQFNAs4/видео.html
I'm having some problems running this with Ollama local models (I tried llama 3.1 and nuextract) and it's not working ... The output has lot of repetitive info
After close inspection, seems like the local llms don't understand the (bit complex) prompt generated by KOR
I think we'll have mor such prompt based tooling available sooner or later. Any other specific tools you are experimenting with?
Incredible. Question of 1 million dollars 😊: How to "teach" chatgpt just 1 time what the schema is and be able to validate infinite texts without having to spend a token inputting the schema at the prompt and without having to train the model via fine-tune?
damn, thats cool
This looks like a really interesting approach. @DataIndependent any ideas of what the best approach for using tabular data (whether from a pandas dataframe, pyspark dataframe or SQL data table) in conjunction with LLMs? What about combining tabular data with text documents?
use the pandas agent
Hey Greg, you sure this doesn't work well with GPT-3.5?
Can we extract important contents from research paper ? like some text from abstract and some from results or ablation table present. Can you make one video about it as how to customize that text extraction to google sheets.
Does anyone know how to do this with an LLM model loaded from transformers?
Hello, how to connect langchain not to chatgpt but to local chat-bots by their local-host names?
Is that a few shot NER ? 🤔
yeah the llm's are pretty good at it now
can i use it to extract events from the text using hugging face or any other open source llm model?
Yes, just swap out your model of choice when you make your LLM
How can I extract the data from an API output as JSON?
Can you please tell me if I want to give word embeddigns or vector db instead of text how can i do that?
What do you mean? could you explain more?
@@DataIndependent Thank you for your reply:)
I am working on a problem where I am extracting text from websites like Amazon, McDonald using web scraping and giving that raw text to my Open AI so that it can extract products or food items and their price, ratings, discount etc.
Now the problem here is that I can't give all the text at a time to the open ai because of the limitation of the number of tokens.
So is there any other way so that I can give text in chunks.
Now the second thing is to improve the model performance, instead of giving raw text to the open ai i want to give embedding vectors of that text by the help of open ai embeddings.
I am using retrievalQA and character text splitter in Lang chain to solve the above problem in my previous approach but how can I do that in this approach that you did in this video.
Please give me a solution. Thank you for your time ☺️
I saw your videos on token limit and embeddings but I want to combine these two ideas and ask a query by the help of kor library so that I can get the output in a structure format.
Is there a way to read an entire PDF with Langchain and Kor?
Oh ya, big time, use a PDF loader and you’re good to go.
In my “question a book” video I read a pdf this way
@@DataIndependent thanks watching that video right now.
@@DataIndependent after watching that video, do I need to use a vector database or can I just use the PDF loader and pipe that directly into Kor?
Can anyone help me with this error [initial_value must be str or None, not dict], while executing chain.predict and parse
Same
i tried `chain.run()` and it worked.
output = chain.run(text=(text))["data"]
printOutput(output)
Is there an existing tool that is cutting low-signal text?
What kind of low signal text?
@@DataIndependent this is term that you used for (probably) "filler words"; words that do not carry much of meaning
Hi may I know if it is working with LinkedIn?
Totally - you just need to access their data somehow
pip install kor? his document doesn't specify...
Yes! I don't run through the dependencies because it's different for everyone. Especially with sub packages.
This is a wrong approach imho. You have to use output as a text and not as an object. If you do that, you lose the ability to stream the output which is a main feature of these LLM. If you want to structure your text, you'll have to go with MD (mark down). Not to mention also that the translation in object is never deterministic due to the nature of LLM and you could get something unusable for your front end.
Wait at what point you are exactly talking u got me a bit confused here
you painted!
m'ke?
It's just too expensive to offer a viable product with OpenAI.
Ada-002 is $0.0004 per 1K tokens...