Master PDF Chat with LangChain - Your essential guide to queries on documents
HTML-код
- Опубликовано: 15 июл 2024
- Colab: colab.research.google.com/dri...
Reid Hoffman's Book: www.impromptubook.com/
Free PDF: www.impromptubook.com/wp-cont...
In this video I go through how to chat and query PDF files using LangChain and creator a FAISS vector store.
My Links:
Twitter - / sam_witteveen
Linkedin - / samwitteveen
Github:
github.com/samwit/langchain-t...
github.com/samwit/llm-tutorials Наука
It’s very generous of you, giving us the source code, and explaining everything clearly. This is the kind of channels that deserve a subscription and follow.
Amazing tutorial video! The pace is just perfect for learning. 👍Thanks!
Thank you, so much Sam.
This is brilliant content. Thanks Sam.
This is amazing, thanks for taking the time to do this
Very helpful, many thanks!
Thank you Sam, for your amazing explaination on how and why of Q&A on PDFs using LangChain. Looking forward to more such developer oriented educational videos.
Thanks a lot, your efforts are much appreciated
Brilliant stuff. Really fascinating explanation on how to customise your own AI.
Very nice explanation, great job!
great overview, thanks!
thank you for showing us this
thanks, i understood it . a really fantastic video
Outstanding video. Well done showing such a powerful technique.
Great video. Thanks
This is the video i was searching for
This is amazing and very well done. This channel has become my go to every morning. Thank you.
Thanks for the kind words, much appreciated.
Thanks so much!
Fantastic video! I appreciate the inclusion of a Colab project for us to experiment with. It would be amazing to see a similar tutorial on loading multiple PDFs from a Google Drive folder (e.g., "data") , recursively into a Colab project, enabling interaction for creating outlines, glossaries, taxonomies, and more from multiple pdf sources. I'm interested in an approach resembling ChatGPT, where you can input a long passage of text and generate new content from it, going beyond semantic searches and summaries.
So basically you want an AutoPlagiarize ?
@@RogerBarraud It would only be considered "plagiarized" if you publicized it as is without referencing the sources. I use ChatGPT to analyze scientific papers as a novice. For example, I would like to be able to have a whole range of summary types available to synthesize the information: Abstract. Summary, Executive Summary, Briefing, White Paper, Report.
Fantastic stuff. You've got such a knack for describing this stuff. I hope the AIs spare you when they take over.
This is the plan :D
Love your videos bro
thanks, much appreciated
*New Subscriber* Great video! I am interested in learning more on how to load in multiple PDF’s. Thanks
Great
🎉 thanks for the great explanation, can you explain the process with open assistant and a free vectores store, and fine tuning
Sam - Thank you for this great conceptual explainer on the basic building blocks of leveraging LLMs with Langchain for our own content corpus. One question on the specific use of the PromptTemplate around 12:00 minutes into the video - Prompt has 2 dynamic variables in there named {context} and {question}. However, in the chain.run command, the variables being used are "input_documents" and "question". Where does the variable {context} get defined for the template to use and elaborate in its response?
Great work, looking forward for the free Huggingface alternative istead of OpenAI.
Appreciate your work. You got my sub
br
Hey Sam, love your lectures! Any resources about free alternatives of OpenAI embeddings? Would be really useful! Thanks!
Great vidos Sam a lot of people jumping on the bandwagon with LLM's, langChain etc but your is clear and well constructed FAB I would love to see how you could use pinecone as a replacement for the vector store you used as i was unable to make it work with the one in the video.
thanks for the kind words. Pinecone is an external VectorStore. I am planning a video on ChromaDB, and will look at making one about Pinecone as well. Pinecone has had issue with them deleting people's data, but hopefully that is fixed.
Can you please do a video on llama index (previously gpt index). It offers such nice ways to query data over for example lists or tree of indices or knowledge graphs. Ultimately I'd use llama index to handle the indices and storing/retrieving data then langchain for the chat and chaining part. The cool thing about llama index is you can use different ways to retrieve relevant documents/data by for example using a support vector machine to get the top k highest matching docs instead of using cosine similarity and taking the top k of that. I think it would be nice to see because the examples provided by llama index can be a bit confusing but the whole project deserves way more exposure!
Yes already working on something like that. I wanted to get a few basics vids out the way so I can refer to them if people have questions etc. but totally agree agree Llama index is interesting.
Op 12 min is nice
Brilliant video thanks Sam. Do you know if the LangChain Text Splitter would take titles into consideration when splitting the text? Titles often provide important contextual information, and preserving their relationship with the subsequent text is crucial for maintaining context and meaning.
No this is just straight character splitting, everything that is a character is treated the same. Making a custom splitter is one of the things we are trying at work for a project that does this. That can be done with toold like Spacy etc.
Thanks Sam for this fantastic video. I am trying to read a complex pdf for example annual result pdf of a company containing all details with financial details in tabular format. Any suggestions how to preprocess and create embedding.
Awesome,I have a query, 15:47 can we make the LLM to be focused only on the document information not the external world information.
Great video
I have a question...
How can we find from which page the answer is generated and how can we get the generated answer page content alone in pqge document
Thank you for this very useful tutorial. I'm curious to know what you would do differently today when querying one or many PDFs. And, what's the best approach (using RAG?) in January, 2024 to simultaneously query several types such as Word, PDF and text?
I think it is RAG today
I've written a prompt for GPT-4 that I use with chatGPT to transform it into a legal assistant, and the results have been stellar. Is it possible to encode this prompt into the system you describe so that the bot operates with it in mind?
Great! Any suggestions to use HugginFace LLM open-source models?
Thanks a lot for your videos, been following for quite a while and am continually impressed with clarity and quality.
Regarding langchain, I've noticed that it uses GPT3 for mostly everything by default, even for chat chains. Is there any particular reason why it's not using (chat)GPT 3.5? Especially seeing as it's currently cheaper. Is it about temperature etc., that's hard to set for chat, or is there some other reason?
You can use it and I was going to show that as an alternative, it was just going to make the video too long I felt. The other big issue is that you will usually (perhaps 80% of the time) have to change the prompt to be something that works well on the turbo (chatGPT) API
Thank you for the video. Have you tried to remove stop words from the text chunks before creating the embeddings, if this degrades the search results?
removing stop words is a very old school way of doing NLP and mostly we haven't done things like that for the last 7 years. It can create issues and also the LLMs themselves are not trained on data like that.
Awesome explanation ! really appreciate your effort . but does this work on a more complicated Pdf, such as Pdf that contains some sort of table/graphs ? i faced some issues before when trying to read tables from a pdf using something like tabula.
no it won't work out the box, but there are ways that you can do it, which is one area we are working on at my work currently. What kind of tables and graphs are you looking to deal with? The unstructured library is something you can try but its still not great for graphs etc.
Thank you for the tutorial! How on earth did I only now stumble upon your channel? By the way, I have a couple of questions to ask you as I am creating a PDF RAG app also recently:
1. Is it possible to add memory into the RetrievalQA chain that you show in the video?
2. How might I create a custom prompt in RetrievalQA to designate a role, for instance, "You are a legal consultant to a multinational corporation, your task is to use the context given below to answer the question..."?
3. Can I achieve the same thing using LCEL (including custom prompt & memory)?
How can you integrate pinecone to this? Can you do also a followup video and integrate Pinecone in the same exmaple?
great video very thorough and well explained. I created my own version of the program with only minor changes. However, it seems to struggle with retrieval of certain information in pdf's of real estate offering memorandums. Do you have any recommendations on how to fix this?
So for something like that I would write a fact extraction chain first that went through the doc and got key info out and then I will add that meta data and have the search do both
Fantastic video, very well explained with excellent diagram. The Colab gift is very generous of you.
BTW, when loading two PDFs, and the two have each a Table of Content, when asking a query about ToC the answer returns only one ToC. Any idea how to overcome this?
Not sure this code is very old now so that could be an issue. I planning a big LangChain update vid over the next few weeks
Hey Sam, could we use that recursively to add short term and long term memory to the system? storing the chat content permanently into a vector store for short term and using GPT to compress after a given size to then store the compressed version as long term memory recursively. That would would allow for a bot with a real lifetime memory.
yes there are number of papers that do things like this. Check out my video on Generative Agents.
Hi,
Brilliant video. Extremely helpful. Had one question though: How can you chunk a pdf file(with images) or an Excel file?
Excel files can be done as dataframes, images in pdf etc there are a couple of ways, mostly using a library called Unstructured
Your videos are awesome. I was able to build a flutter app to work with a python backend running in replit, using fastapi to serve API endpoints. In my app, I can upload a PDF file and chat with it using an agent with memory. It works fine. However, I need to allow multiple users, each one to have its own agent with its own memory. I have no idea how to acomplish this. What path would you recommend?
How did you solve it?
@@jimmytorres4181 using fastapi with multiple Agents using a dictionary, and multiple indexes using FAISS or pinecone
what is the difference between the vector storage you are using vs a solution like Pinecone?
Hi, Sam. Excellent video! For "text_splitter=CharacterTextSplitter(separator="
",chunk_size=1000,chunk_overlap=200)
texts=text_splitter.split_text(raw_text)" if I did not add separator="
" in CharacterTexySplitter method, why the length of texts is equal to 1 ? Hope to get your answer:)
you need something to split the text. It seems to work the same way a split text in python works.
Thanks for the great video. If I have a document with already questions and answers. What is the best way to load the documents in the vector store? Only the answers or both? How to give langchain prompt template a positive and negative examples so that the llm can do a classification? Thanks in advance!
There are a variety of strategies for this, you could do both together or have 2 separate docstores with different indexing. Most importantly though use meta data, so if you index on questions you can refer to an answer easily. I think it is always good though to not just rely on the questions alone personally.
Cool, is possible to do it with Vicuna llama ?
yes but will probably need some fine tuning.
Nice video! Is it possible to see also the original text(s) where the chat is extracting the information from?
I have an Info Extraction video coming up this week.
How can we provide the page number of the pdf document (logical page number) in the 'source_documents' as well?
thanks again for another amazing video!🤩
I'm trying to follow the same method that you showed in this video, but sometimes my model answers out of the given PDF, do you have any idea on how can I solve it? I tried to play with the prompt or prompt template but didn't help too much...
is there any way to guarantee never answer out of the given PDF?
which API are you using ? The ChatGPT one I know that can hallucinate much more than the other ones
I love this approach and i think it is key to the functionality of LLM's, but the longer i watch the weaker and more limited the project appears introducing more and more potential bottlenecks.
How powerful is this tool in it's current state? How high is the price we pay for Pdf access?
The price for PDF access? it just runs on your machine. the only cost in the vid is for the LLM.
love the videos, thanks a lot
yes you can pickle it, though probably better to use Chroma DB
@@samwitteveenai why?
Great Video, thanks @samwitteveenai.
Are you using Jupyterlab? if not how do you reformat the return of the model? for example when you return the prompt of the model, you create a new split cell, what is that?
In the video I am using Colab which is Google own version of Jupyter Notebooks. Not sure what you mean by split cell. The LLM output is being parsed by LangChain before it is displayed so that could be what you mean?
@@samwitteveenai Thanks a lot, I did not know that Colab notebook shows your markdown text output as you type it. I'm used to vscode notebook, which does not do that.
This is great! I can't get the return_intermedia_steps ranking to work unfortunately, but everything else worked pretty well.
Thanks for the video! For a highly technical pdf (Maths, Physics, etc..) would this be useful at all? Is there a way to make images and formulas also be "vectorized"?
For academic papers you can convert to latex which a lot LLMs can deal with. Dealing with images is a lot more challenging. It really depends on what they are and how they are formatted.
@@samwitteveenai I will explore that, thanks a lot mate
Sam, thanks for the great videos you created. This video helps me to resolve an issue which I have been struggled with for a while. Before, I used the similar flow (load document, split, embedding, vector store, then query) but without using chain. Somehow, the response time for the query is long (more than 10 seconds). I started to use chain after watching your video. The response time dropped to 3 seconds. Thank you so much. BTW, when we do load_qa_chain(OpenAI(), chain_type="stuff") , can we specify OpenAI model version (e.g. gpt-3.5-turbo or text-davinci-003). It will be great if we can use gpt-3.5-turbo, because it's 10 times cheaper compared to text-davinci-003 😅 Thanks again
Thank you, Sam. I watched your video ruclips.net/video/biS8G8x8DdA/видео.html today, and it addressed the issues I encountered with your precise in-depth knowledge. I practiced with your guidances (turbo_llm and prompt), and it works perfectly. System prompt message to enforce the rule in gpt-3.5-turbo seems to be challenging to many people (including me). May be it worth to have a dedicated video on this topic, if other audiences have the similar issue. Thanks again, great mentor 🙏
Thanks John. I will do more with the turbo API going forward.
@samwitteveenai can i get the image as a output based on the questions if yes how can I do it ?
Suppose I have a pdf file consisting of medical information with only unstructured table in it. How to create LLM model that is pretrained on medical dataset to answer queries of the user based on given tabular data in the pdf
Hi, Sam! Why we need to introduce retriever in the chain?
You don't have to but that is the format they seem to moving to and also it makes it easier when we are swapping out various retrievers etc.
Nice video :)
Does the docsearch method also use LLM calls and therefore cost tokens or is this handled withouth LLM calls?
searching the doc in this version uses a LLM call for getting the embedding. The actual search is of the doc is all done locally. I have a version coming where the embeddings will be done by a local model for free.
@@samwitteveenai that would be nice
Thank you sam! This is informative . I'm working on similar projects with more than 200 PDF documents, each one with 300 pages in Avg . So How to approach this ? Any Idea? .
the basic concepts are the same, there are some tricks you can do, some of which I am making some vids for over the next week or so. I would suggest you start by looking at meta data an incorporate that into your searches.
This is a good explanation for how it works but is there a app or website where I could upload a PDF and ask it questions about a document? 🤔
Yeah I think there are a few, but you could make your own pretty easy too.
Hello, my case uses that I need to distill 2million words down to 10k, My problem has been Max word input not being enough anywhere I look, how can I achieve this desired output please? Thank you
Is there any way to build the same chatbot or question answering system which will utilize the information in a given website?
Yes if you look at the video I just released it has search and the video coming tomorrow has webpages.
What if i dont want to use OpenAI model and want to use someother custom model?
do i need gpt 4 or any paid plan for this ?
if I do not use other llms , how can I know if it is supported by LangChain?
And that's exactly what I've always said will be coming soon 😎 the future is here!
Can you make one not using openai please
Hi, do you have any idea how to create a chatbot that references our own document, but if it does not find any result, it will reference openAI’s database instead of giving “no context found”. Am a beginner so I appreciate the help!
Hi even I have the same doubt. Did you find any solution? I would appreciate your help. Thank you.
Ok so this let's the model answer based on context from the vector store. What if the user wants to relate something from the custom knowledge base and from other data that GPT already was trained on? Is there a way that the language model can still piece together an answer that's outside of the given context? For example: how can elon musk use this document to help stop rockets from exploding?
not sure exactly what you mean. You can use multiple vector stores for heterogeneous data etc, you just need to pass it all in the context.
Whenever OpenAI is involved, it should be mentioned in the title or thumbnail. Thanks...
very interesting. Does this scale to, say, 100'000 pages? Does it work well with something selfhosted like Vicuna?
to get it to scale you would use a better VectorStore. I will look at those in a future video. For Vicuna you may need to finetune it a bit, but very doable.
@@samwitteveenai very interesting, looking forward to your future videos! might have some interesting applications for investigations with a lot of documents :)
What about adding the concept of memory?
Just use an agent with memory rather than zero shot. I have a few vids that look at memory and agents which could be used with this
Does anyone of you know what document loader to use for uploaded PDFs? I'm using FastAPI to upload PDFs that I'd like to load them up to LangChain
You could have have FastAPI to upload them to s3/GCS etc and then just load them in and process them there. The aim would be to have a VectorStore that you could put the chunks into from each upload. Eg something like Pinecone or Weaviate etc.
how to edit the prompt
Hey How to check the prompt for conversational retrival chain
you can just go into the chain and print the prompt out. I show that in quite a few of the notebooks etc.
Hi Sam! Nice video, but I have a question: what If we have multiple pdfs and want to query over those pdfs?
I am making some vids to show different VectorStores and I will show multi pdfs as well. With this one you could also do it as it just splits into text strings.
@@samwitteveenai Sure, for pure QA, this approach might work, but when asked questions which require thorough reasoning and comparing information from multiple pdfs:
(e.g.: compare Amazon’s expenditures on ecology and Google’s expenditures on technology and tell me which company spends more?”)
this RetrievalQA chain fails. I had very bad experience working with agents, because they work for some queries and does not work for some.
Really want to see a video which shows more complicated scenarios, rather than a simple QA. Thanks for what you are doing tho 👍
What are describing sounds like multi hop questions. One way to handle this is to make new representations of your data that bridge the various parts of the info and then store those as well in your index.
@@samwitteveenai That’s impossible to build such indices for all possible questions.
How can you export embeddings to avoid repeated charges?
In this project you could just keep the ChromaDB as it has all the embeddings in there etc.
How Pincone DB is different compared to FAISS ?
mostly it costs money. It is also persisted which this example isn't
Why would you want some overlap of the chunks?
imagine without the overlap, half of something important might be in one chunk and half in the other chunk next to it which means neither chunk has enough signal/info to get a good embedding about that topic/info, therefore they wont get returned in the semantic search to answer a question related to that topic. By having overlap the chance of this happening goes down a lot.
hello Sam,
I have my daughter will do soon her final year training, and i advised her to do a searchable library of around 100 books (which is an encyclopedia of arabic language) using GPT chatbot.
can you please advise what are the best tools to reach this goal , and can it be done in a timeframe of 6 weeks.
Thank you
Arezki
The challenge there would be that it is in Arabic. GPT-4 can probably do that. The new PaLM 2 models from Google can also probably handle that. I will try to do a video on that soon. Apart from the language issues the rest of the process would be basically the same as chat to PDF/Text etc vids I have made
@@samwitteveenai thank you Sam for the information provided, and looking forward to see the video. I Will also advise them to do some research on the topic to get familiar with the process of research.
How to remove context confusion
Suppose in following text ' I have cat. His name is tom. He plays with my dog , his name is Moti. He is very charming. I bought a neckband of red color for him.
I feed it to the llm.
My first question will be
1) What is name of cat.
>> Tom
Second question will be
What is color of Neckband
>>> red
This should be answer because Tom don't have Neckband.
How to fix this...
Can you build the same system with langchain and huggingface models
If you mean without OpenAI I did a video like that last week.
@@samwitteveenai yea i watched it
But i mean that you take the pdf as an input from the user then find the similarity between the pdf and the question that user entered then the model response
you may use streamlit as web app to take the pdf input then make user ask question and of course without openai
Sam, pls make a video on JSON questionnaire using langchain... i have a big JSON document with 5000+ user records inside it.. how to query that data in langchain.. Ex: What is the email id of John, .. In which country does john live? how many users are from UK.. etc.
Just load it as is or as a csv. If you don't know how you want the data to look like then how will the model know?
For a simple test, deserialize the json arrays into a list of classes. Pick what element(s) in the class needs to be embedded, for example, user name or a description of some sort . Embed it, then save the embedding as part of the serialized class into the json. There's your database. Now when searching just embed the user's query, then perform similarly cosine between the query and all the embeddings you saved and pick the top 3 or whatever. It's really that easy. I am doing it this way, loading all the embeddings locally 12000+ (1536 dimensions each) and it take half a second to give me the top 3 results on mobile cpu. I had chatgpt4 help me create a optimized search cosine method for this.
Oh yeah I almost forgot, once I get the 3 documents I then have openai turbo 3.5 summarize the docs with a prompt instructions
there is a JSON agent you can use or like @Faris said you could convert it
@@theeFaris Hello .. I’m new to langchain . I have json .. langchain should go through the json and answer my query . But I am not getting desired output . I have username and email ids . Based on name , I want their email id . Not sure what I am missing..
gist.github.com/rajivraghu/c1cfa60b848765e28b78f16269c10f22
can we use it with azure open api
yes you should be able to swap out the model for the Azure one, though I haven't used that myself.
Why are there any issues facing while using the openai API key?
can you add an Agent without using OpenAI, instead using any other open sourced model? I cannot find any this type of example, that's gonna be very useful. A lot of people don't like using OpenAI, however there is really very less examples not using it.
You can but most the OpenSource models aren't good enough quality to generate good answers. For doing RAG they are I did a few vids about this last week.
@@samwitteveenai thanks for the reply! Is there any video related to Agent? I am building a tool similar to PDF QA as you demonstrated, the difference is I want to use an Agent to take care of chatting instead of pure QA patter. For example, the tool can response something like these: How can I assist you today? I cannot give an accurate answer based on the documents, would like to provide more information? ...
Excellent video, but using OpenAI model(s) repeatedly is a disservice to all. OpenAI/Pinecone charge for using their models/services (Pinecone recently, suddenly changed their free tier availability for new accounts) but with solid FOSS (free open source) models available, FOSS LLMs should be preferred in example code. Thank you.
I totally understand where you are coming from I am planning vids to show a lot of opensource alternatives, this just allowed me to keep the basics simple for this video. Another challenge is many of the open source LLMs aren't that great for prompts with context. I am currently training models specifically for this for work.
It can't even get the author right. :/ Similar to things like ChatPDF that just hallucinate things that have nothing to do with the document.
There are ways to make it better to get things like this right in commercial applications. Hallucinations is still a massive challenge.
This does not work for technical documentation. I tried it with a proprietary programming language manual months ago, then asked it to write code using that language and it was useless. All it really is is a sophisticated searching tool, good only for natural languages like fiction or commentary etc.
"I tried it with a proprietary programming language manual months ago, then asked it to write code using that language and it was useless." - this certainly won't work in this kind of way. For that you would fine one of the coding models on the data not just do retrieval for this.
Hey Sam,
I've said it before, but I just have to thank you again for your incredible videos! Your choice of words and facts, along with your soothing voice, make everything so easy to understand.
I had an idea that I think would be a fantastic addition to your channel. You know those "@domainofscience" "Map of..." videos, like "The Map of Quantum Computing" (ruclips.net/video/-UlxHPIEVqA/видео.html)? It would be amazing if you could make a "Map of AI" video in the same style. I believe you have a unique talent for breaking down complex topics and making them accessible to everyone.
Keep being awesome, Sam! Can't wait to see more of your great content.
Best, Fredrik
Fredrik thanks for the kind words. I know the channel you are talking about well and love those videos. This is a really cool idea, let me think about how to do it.
@@samwitteveenai FYI, I have made a reach out earlier to @domainofscience suggesting this but did't get a reply. Perhaps you guys could co-op if that would be suitable.I for one would be ready to pay for access to a video like that. I will stop to bother you now. Cheers!
Hi Sam,
For :
chain = load_qa_chain(OpenAI(),
chain_type="map_rerank",
return_intermediate_steps=True
)
query = "who are openai?"
docs = docsearch.similarity_search(query,k=10)
results = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
results
I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
6 query = "who are openai?"
7 docs = docsearch.similarity_search(query,k=10)
----> 8 results = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
9 results
10
7 frames
/usr/local/lib/python3.10/dist-packages/langchain/output_parsers/regex.py in parse(self, text)
26 else:
27 if self.default_output_key is None:
---> 28 raise ValueError(f"Could not parse output: {text}")
29 else:
30 return {
ValueError: Could not parse output: OpenAI is an artificial intelligence research laboratory consisting of the for-profit company OpenAI LP and its parent organization, the non-profit OpenAI Inc. Score: 100
I even saw a issue open on langchain : github.com/hwchase17/langchain/issues/3970.
I would really appreciate any assistance to address this concern.
Thanks,
Ankush Singal
Maybe try changing the query question.
I failed on this line of code "docsearch = FAISS.from_texts(texts, embeddings)" it returns ValueError: not enough values to unpack (expected 2, got 1). do you know what is the problem? i have duplicated the entire steps of yours
Ok, so i solved that one by specifying the model name on line 15 embeddings = OpenAIEmbeddings(model="davinci")
But again a run into error on dependencies on pexpect which cannot run on windows. i run my code on jupyter notebook
Then, to solve it i jump to google colab. And it runs. Until it hits error
On chain = load_qa_chain. Where even though it already has the correct answer it cannot parse the output
ModuleNotFoundError: No module named 'langchain.chains.question_answering'
not able to get rid of this error
how would you code it to have it generate questions about the pdf in this example?