Working with MULTIPLE PDF Files in LangChain: ChatGPT for your Data
HTML-код
- Опубликовано: 12 апр 2023
- Welcome to this tutorial video where we'll discuss the process of loading multiple PDF files in LangChain for information retrieval using OpenAI models like ChatGPT. Our step-by-step guide will explain how to convert PDF files into embeddings based on the chosen large language model. Let's get started!
Welcome to this tutorial where you'll learn how to extract valuable information from your PDFs using LangChain and OpenAI Text Embeddings. We'll guide you step-by-step through the process of setting up LangChain to communicate with your PDF files, allowing you to retrieve information efficiently and effectively. By the end of this tutorial, you'll have the skills necessary to use advanced language processing technology and improve your data analysis.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Support my work on Patreon: Patreon.com/PromptEngineering
🦾 Discord: / discord
▶️️ Subscribe: www.youtube.com/@engineerprom...
📧 Business Contact: engineerprompt@gmail.com
💼Consulting: calendly.com/engineerprompt/c...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
LINKS:
Google Colab: colab.research.google.com/dri...
LangChain: docs.langchain.com/docs/
VectorstoreIndexCreator vectorstore: tinyurl.com/3yz455m3
☕ Buy me a Coffee: ko-fi.com/promptengineering
Join the Patreon: patreon.com/PromptEngineering
#LangChain #InformationRetrieval #PDF #OpenAITextEmbeddings #DataAnalysis #LanguageProcessingTechnology #AI #MachineLearning #NaturalLanguageProcessing #NLP #Tutorial Наука
Want to connect?
💼Consulting: calendly.com/engineerprompt/consulting-call
🦾 Discord: discord.com/invite/t4eYQRUcXB
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Join Patreon: Patreon.com/PromptEngineering
▶ Subscribe: www.youtube.com/@engineerprompt?sub_confirmation=1
Suppose my question is about one document but it's taking answer from another document it's giving irrelevant answer , how we can handle it
Just wanna say thanks a lot for your tutorial!
superb explanation. Thanks
Thank you for sharing! Excellent video.
Thank you very much. Is there anyway we can specify which document to scan into to find the answers?
Very cool video. Thank you!
Can you choose which model to use? I don’t see a request completion with the model statement. Thank you for this video - I’m still learning by doing.
Great Video. I'm learning a a lot from you. Thank you.
Glad its helpful :)
Thanks for excellent video. How to get the page number of the content & sources...Any suggestions
Hi, very good work. Thanks! Sorry but the link of google colab is invalid
You are amazing! This is exactly what I was looking for. I might also need to connect with you in future for consultancy on something that I am trying to build.
My man! First, you're a monster. Obviously, I bought your a coffee. Anyway, there were 3 erros/bugs (excuse my languague this is the first time I code something in my life); which in case somebody was struggling I think they are useful. 1) in the section Connect Google Drive; second segment of the code; I had to input between pdf_folder_path = f'{root_dir}/data/' and os.listdir(pdf_folder_path) the line import os. In other words, the full line(s) of code is (first line) pdf_folder_path = f'{root_dir}/data/' [enter] import os (second line) [enter] os.listdir(pdf_folder_path) (third line). 2. In the section 'Load Multiple PDF Files' I included these two lines of code from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator; 3) In Vector Store section as a first line of code I have included: !pip install unstructured[local-inference]. And that's basically it! Cheers mate!
Thank you!
Thanks stranger! You just fixed my traceback error with suggestion number 3.
Hey, Thanks for the tutorial, I am thinking of creating a Voice Assistant using Open AI embeddings , it there any tutorial for this ?
Please consider doing a similar video on how to be able to chat more freely with Google Drive PDFS with memory. For example, having the script generate a glossary, an outline, or a lesson plan based on the database of pdfs.
Sure, that's in plans. There is a video on channel 'crash course on langchain', it has a section on memory so check that out for the time being :)
This is great idea Sam
Great stuff
Thank you so much, it worked! (But also required me to install that pdfminer and several other things)
Glad you found it useful. Google colab is sometimes really funny :)
This is excellent. Would love for you to dwell deeper into this experimentation. How much did it cost you on OpenAI’s end? For embeddings etc.
Thanks, I will be doing a lot more on this. For this video & and experimentation, the cost was around $1.
Hello! nice solution, wanted to try it, is the colab link working?
Does this method works with full books ~300 pages?
This is all good for demos but is langchain reliable for production-level apps. Are there any alternates? @Prompt Engineering
Can you also include how to interact with tables and pictures in a PDF document
can it answer questions that need information from multiple pdfs?
Hi Prompt Engineering!
Quick question: I like the way you created an index from multiple PDF files and queried from the index. Have you attempted to persist the vectorstore for later use (e.g., query or update with additional documents)?
Chromadb will be a good option for doing that.
@@engineerprompt can you make difference between chromadb and Faiss of Meta
?
How would you adjust the temperature ?
Hi, when installing chromadb I get error when installing hnslib, how did you fixed it?
"Failed building wheel for hnswlib"
Loooooove it ❤
Good stuff. Two questions/suggestions. First, is the data stored locally or in a database like Pinecone? Second, can the intake be modified so that I can use directeoryLoader?
Good stuff!
Thank you.
1) For this example, its stored locally but you can use any database you want.
2) Yes, you can do that. In that case, you will have to define the file type you want to read.
hope this helps.
is accuracy being calculated for the model?
can we compare the difference between 2 pdfs ?
Hello, great tutorial! Any idea how to change the max_tokens (output tokens) in this approach? So far I'm getting 256 tokens in the response, while I need much more.
which part of the script creates embedings and uses chroma pls?
Thank you for educating us. I wonder how would you integrate AutoGPT for multiple agents
AutoGPT will decide how many agents to use based on the problem its trying to solve. I don't think you need to specify number of agents.
I am very interested in how you made the model of the animated face that is at the beginning of the video, is there a tutorial on your channel on how to make such an avatar?
Yes, check this out: ruclips.net/video/V2efVSXSlqc/видео.html
I have a couple of other videos as well that use open source tools. You can look for those as well.
it works greatly to understand my files. But even with T4 GPU 16gB gpu memory, it takes 2-3 minutes to get answer for a file with 4 or 5 pages. Is it normal in this GPU condition?
One more question - do the documents need to be reloaded into a vector every single time? Or can we simply import the query and answer to another Python file?
That's a great question, should have addressed this in the video. You can simply write the embedding into a file and store that instead. Then reuse it whenever you want
I found that I had to add this in order for it to work:
!pip install unstructured[local-inference]
Otherwise I got this error:
ImportError: Following dependencies are missing: pdfminer. Please install them using `pip install unstructured[local-inference]`.
Why is this?
Thanks a lot! This indeed saved my time!
Hi, I have this error. How did you solve it? What is the local-inference?
Thanks for the clear example👍 I have 2 additional questions"
1. if u have a pdf with mathematical formula, as an example to calculate some measure (i.e. BMI). Can u also ask for the BMI if u supply your length and weight?
2. if I have a document with question and answers. How to feed it?
Thanks in advance.
1. Might be possible with GPT4 token.
2. Should be possible, similar to any other PDF. It will treat it normally like a sequence of tokens.
Please update the colab link, thanks very much!😀
Hi Prompt Engineering
I tried with similar example, but I am getting error
Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. (type=value_error)
Awesome video. Is it possible to run this as a regular Python file without Jupyter notebook, anything I should be aware of?
Yes, absolutely you can do that. Just create another virtual environment for this and install the packages and you are good to go.
Nice tutorial. I am actually facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index).
However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying". Is there a way to fix it? Could you find a solution and make a video of it?
Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!
I will try to look at the first problem. 2) yes, you can do that.
Hello, is there any update regarding this problem? By the way, nice vids!!
Just curious do you have a tutorial for multiple pdfs using llama 2 and other open source embedding ?
Yes just check out my localgpt videos
May I ask does it work with PDFs having over 4000 tokens (the limit of OpenAI API)? Thanks a lot for providing both guidelines and Colab notebook for immediate use!
That's what chunks are for. The text is split up into those chunks so that it makes them manageable for further processing.
Thank you so much that's quite helpful ! , although it would be great if you can help us give memory to it , for exemple if I correct a wrong output the bot should remember it. Have a nice a day and keep up the good work.
Glad you found it helpful. As far the memory is concerned, watch this video, there is a section on how to do it. I will be making more detailed videos on it later: ruclips.net/video/5-fc4Tlgmro/видео.html&ab_channel=PromptEngineering
is there a github respository for all your excellent training videos?
Can we integrate this with Django?
Thanks for the video, it's very useful. Is it possible to integrate a voice assistant that receives a question as input and answers via voice, using the information present in the pdfs? It would be very useful. It could be done by whisper or bark. What do you think about it?
I downloaded many PDF's just waiting for the day this becomes a reality.
Yes, if I find time, I will put together something for this.
I want this too!
Hi instead of UnstructuredPDFLoader can we use PyPDFLoader? I was using PyPDFLoader using glob and loader_cls . I have added 3 pdf files in a folder called pdf . so when i load it and print len of the documents it shows wrong ans like 5 or 6 whereas what i have loaded is only 3 pdf lines ? Can you plz lemme know if u have solution for this
Hey what if we want image responses how do we get it
Excellent video, exactly what i was looking for. My pdf files are a mess (anyone can relate?) Hundreds of pages, images, scanned documents sometimes.
Can you clarify what is Pinecone and how it could help in this particular workflow?
So this approach will only work on the text part of your documents, I am not aware of any approach that will understand images (yet). Pinecone is the vectorstore (think a database). You can basically store your embeddings there if you have a very large set of documents. Hope this helps.
do you have a video like this with local LLM? Like using LMStudio as a server?
look at the localgpt project.
suppose i added pdfs containing details for each employee. and then i ask how many employee have python experience? or how many employees are there in company?
can it respond correctly?
if no, then what should be done in order to get correct response for above queries?
Thanks!
My Question : I want the answers from both sources mentioning that this Answer1 is coming from Source1 and Answer2 is coming from Source2.
How can i achieve this?
can you try this for estimation pdf blueprint files for commercial window treatment business? and construction firms
You could. If you have files, I can try
Wonderful tutorial. Would it be possible to run this through VSCode appear within the browser? Since I attempted it and it only shows up in the console without opening the browser.
Would should be able to. VSCode has support for Jupyter Notebook. Check that out.
@@engineerprompt im facing a problem in vector store index ccreator
please share the solution
That's a another informative video. Appreciated. I know someone already asked in this thread that how to persist the index for later use. And you recommend chromadb is a good choice. However, in my company computer, I failed to install the chromadb. So how can I use FAISS instead to persist the index?
In that case, use pickle to dump the index to a pickle file.
@@engineerprompt thx for the response. Yes, I did that after watching your other vedios. All good now
I want to use alpaca or vacuna model instead of chatgpt because chatgpt has limitations on the requests we sent. I just wanted to use any open-source model instead of chatgpt is this possible?
Yes, you can looking into huggingface embeddings. When I get them working, will make a tutorial on it.
@@engineerprompt Yes looking forward to that tutorial where we can read multiple pdfs and query it without using OpenAPI.
Looking forward to this!
Waiting for this 😊
@@engineerprompt waiting for this
Great stuff! I was looking al over the web fro how to do this and this and this was the only useful video I could find. I jusr have one quick question for further work I need to do.
Just wondering if it is possible to make queries which only pertain to a specific document? For example, i only want to know something about the first paper (eg. authors, title, etc) but not the second. Let me know how you would go about doing this.
Thank you and glad you liked it. Yes, you can do that by adding metadata and use that as context to the LLM.
@@engineerprompt Thanks! Do you have any videos/resources on how to do this?
Do a demo of it working, I always wanted this.
Great content. Would it be easy to modify this process to handle different file formats such as .doc or .txt. Thanks again. I have subscribed.
Thank you, yes, you just need to add different loaders for each file type.
Is it possible to retrieve which section of the PDF it is referring too? (even it can detect the portion of chunk in pdf)
I am not sure, will look into it.
That would be incredible in order to make scientific reviews and references
@@Myplaylist892 I agree, I will look into it in more details.
@@engineerprompt hi..you got anything on this?
yes that does sound pretty useful
Immediately hit quota before able to query. Huggingface would be the next route for free AI version?
Yes, you can try huggingface or if you have the hardware, you can try to run something like localgpt locally
Can I store a vector in a database like Azure and then run just a similarity search or retriever without having to recreate it? Can someone help me?
Apart from OpenAI, who else provides embeddings?
Hugginface have their embeddedings as well as you can integrate models like Bert.
I am getting this error on VectorStoreIndex creation :
ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (C:\Program Files\Python310\lib\site-packages\pdfminer\utils.py)
Besides pdf, word files, Can localgpt handle excel, CSV files?
Yes, but you will have to experiment with the embedding model and llm
sorry, i think the Google Colab got problem, anyone has problem to open and run it ?
Can I use this in my company website for creating pdf searching .. please reply
Yes, its simple to integrate.
thank you for this valuable information's, how can i get the the number of the page as reference with the source pdf ?
I think there is a way, need to check it.
It’s listed on lang chain site
谢谢!
Thank you for your support!
Now text-davinci-003 model has been deprecated I'm no longer able to use openai library.
Got the error: openai.error.InvalidRequestError: the model 'text-davinci-003' has been deprecated. Is there a way I can replace it to gpt-3.5-turbo-instruct (recommended one by openai)?
I have a Question:
What Should be my System Requirements, if i want to build a Project Application using Langchain & OpenAi ?
You can run it on any machine that can run python if you are using openai models. You don’t need a gpu in that case
This error is popping up: "Failed to load the Detectron2 model. Ensure that the Detectron2 module is correctly installed."
after running this "index = VectorstoreIndexCreator().from_loaders(loaders)" although i have already installed the detectron2 model.
any solutions?
if its warning, just ignore.
I got this error :
InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4331 tokens (4075 in your prompt; 256 for the completion). Please reduce your prompt; or completion length.
Thanks a lot, but I have a error message when I run the VectorstoreIndexCreator() cell i get the following error: "ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (/usr/local/lib/python3.9/dist-packages/pdfminer/utils.py)" ¿could you help me?
@Sameul - @Yn Box has answered this question, please see the below comments. Basically you need to do:
!pip install unstructured[local-inference]
I was also facing the same issue as you and his resolution solved it!
@@Prakash-oq5ke This worked! Thank you.
do you know how to incorporate a new LLM like Dolly into LangChain?
tutorial coming soon.........
When I run the VectorstoreIndexCreator() cell i get the following error
ImportError: cannot import name 'open_filename' from 'pdfminer.utils' (/usr/local/lib/python3.9/dist-packages/pdfminer/utils.py)
I tried installing and importing the packages but that didn't work either , any solution to this?
Same 🥲
You can try installing this library first. !pip install unstructured[local-inference]
@@samdaniel1368 Thank you for the solution , running it in the first cell and restarting the runtime solved the issue for me
@@samdaniel1368 Perfect thanks
Thanks a lot for the video. I am facing a problem with the access to the Colab file, please, can you help?
What is the issue?
@@engineerprompt Something like "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential"
😢
@@samser1150 Make sure you are not running the colab in another tabs.
@@engineerprompt solved, thanks a lot
@@samser1150 Perfect, glad was helpful!
Hey, thnx for the video but The Colab file is private we can't access
It seems to be public, can you not see it at all?
Can you make a video where you make a webapp using langchain & streamlit where you can upload multiple PDF files and ask questions about the files?
Its on my to-do list :)
@@engineerprompt Hope I see it soon🙏🏼🤩
I am using Azure OpenAI , Code failing in the Index creation step i.e
index = VectorstoreIndexCreator(embedding=embeddings).from_loaders(loaders)
with the following message
raise error.InvalidRequestError(
openai.error.InvalidRequestError: Must provide an 'engine' or 'deployment_id' parameter to create a
Can you help on how to do this with Azure OpenAI setup
Sorry haven't used Azure so not sure what's going on here. Seems like you are having issues accessing openai api.
what to do about 'detectron2 not installed
Are you getting a warning or error?
How to do the same without openai? I mean, using gtp4all or some other llm. The point is, doing everthing "for free" without spend on API calls. And another on, how to do the same on a large codebase? Pyhton, java, clojure, etc.. thank you
Stay tuned, tutorial coming soon :-)
Can ypu help me? I have error message.
question = "핵가족 그리고 직계가족이 뭐지?"
response = model1({"question":question}, return_only_outputs=True)
print("Answer : ",response['answer'])
print("Sources : ",response['sources'])
This model's maximum context length is 4097 tokens, however you requested 4903 tokens (4647 in your prompt; 256 for the completion). Please reduce your prompt; or completion length
the cell " index = VectorstoreIndexCreator().from_loaders(loadees) " gives me error even though i pip installed pdfminer and !pip install unstructured[local-inference]... don't know what to do :(
What's your python version?
@@engineerprompt 3.9
Same error here
same erroe fot any solution?
Running python 3.10 and with the unstructured[local-inference] installed , I am running into the error at the index = ... line.
The error is:
AttributeError Traceback (most recent call last)
in ()
1 get_ipython().system('pip install unstructured[local-inference]')
----> 2 index = VectorstoreIndexCreator().from_loaders([loaders])
3 index
/usr/local/lib/python3.10/dist-packages/langchain/indexes/vectorstore.py in from_loaders(self, loaders)
67 docs = []
68 for loader in loaders:
---> 69 docs.extend(loader.load())
70 sub_docs = self.text_splitter.split_documents(docs)
71 vectorstore = self.vectorstore_cls.from_documents(
AttributeError: 'list' object has no attribute 'load'
How to use gpt-3.5 turbo instead of davinci
in the OpenAI function, set the model variable to gpt-3.5
@@engineerprompt here, too trying to do this, but with the VectorstoreIndexCreator it's a bit tricky if one doesn't know where to put it. Not choosing the gpt3.5 turbo becomes costly with the davinci over time
How to extract certain basic kyc data and stuff in excel from insurance polices and invoices of different different structures in pdf
Can this be done using chatgp Or any similar AI tool automatically without any training and annotations?
Yes!
@@engineerprompt how?
@@caankitrmehta2281 You need to design a prompt which will extract this info from the files.
@@engineerprompt what will b approx cost
Error in your google colab file "ModuleNotFoundError: No module named 'pdfminer'"
check it now!
Getting the following error message when I try to run the 'Load Required Packages' cell:
ModuleNotFoundError: No module named 'langchain'
Any advice?
Seems like you didn't install langchain. In the start there is a cell with the following command.
!pip install langchain
Make sure you run that.
@@engineerprompt Thank you. It worked after refreshing the page, think the error was on the Colab side. Great video!
Has anyone been able to fix the "detectron2 is not installed" issue?
Is it warning or error?
@@engineerprompt it is a warning : "detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy." and it is repeated for each file in the directory that has the PDFs.
which AI avatar generator are you using ? :)
I have a locally open-source workflow for that :-)
@@engineerprompt I feel a need to request a video on it 🙂
How much should pay for 10 pages based on your experience ?
Depending on how many times you will be prompting but its going to be in cents or a few dollars at max.
@@engineerprompt I appreciate your cooments.
Due to changes in VectorstoreIndexCreator API some errors appeared.
To solve it, I did:
embedding_ai = OpenAIEmbeddings() #Use any embedding you want to
index = VectorstoreIndexCreator(embedding = embedding_ai).from_loaders(loaders)
turbo_llm = ChatOpenAI(
temperature=0,
model_name='gpt-3.5-turbo-0125' # default gpt-3.5-turbo
)
#Need to define LLM now
index.query('Tell me something about Interpersonal communication', llm = turbo_llm)
please provide updated google collab link.
Can you please try now!
How to run this locally without Collab?
You will need to install Python on your machine. Download visual code studio and install it. Then download the notebook shown in the video and run it in visual code studio. Hope this helps. If you are not familiar with the process, I can make a tutorial at some point.
@@engineerprompt I have python and visual code studio installed as i run locally many LLM models and i do AI training. I just have never used collab/notebook things.
@@digidope Perfect, then its just a normal jupyter notebook once you download it. Just download it and you can run it as jupyter notebook.
Valeu!
Thank you, really appreciate your support!
@@engineerprompt I was looking for an alternative for manually creating each loader. Thx!
Getting an error when opening the link.
what's the error?
@@engineerprompt Colab signed out on a different tab. But I'm signed in.
Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication
@@engineerprompt Yeah I can't access it either. Something about the OAuth token
@@MatthewGunnin Make sure you close all instances of google colab you have running and then open this link. Hope this helps.
Can it work with 100s of PDFs?
You could, I haven't tried it so can't say how hard its going to be. Will look into it.
This would be a very interesting experiment as well.
At which point does it stop being context setting and starts being fine tuning?
Bro can u make video on how u link on ur website + ui make it more good
??
@@hamaltarther2515 You can use streamlit, will look into it.
cannot access colab
Can you please try now!