ChatGPT For Your DATA | Chat with Multiple Documents Using LangChain
HTML-код
- Опубликовано: 4 окт 2024
- ChatGPT For Your DATA | Chat with Multiple Documents Using LangChain
In this video, I will show you, how you can chat with any document. Let's say you have a folder and inside the folder, you have different file formats. Let's say you have PDF file. You have text file. You have read me file and others. I will show you how you can take all of your data, split the data into different chunks, do the embeddings using the OpenAI embeddings, store that into the Pinecone vectorstore. Finally, you can just chat with your own documents and get insights out of it, similar to ChatGPT for with your own data. Happy Learning.
👉🏼 Links:
GitHub code: github.com/sud...
☕ Buy me a Coffee: ko-fi.com/data...
💬 Build your custom chatbot: www.chatbase.c...
🔗 Other videos you might find helpful:
LangChain: • What is LangChain ? | ...
Open Assistant: • Open Assistant 😱 | Ope...
Chat with any pdf: • ChatPDF | MindBlowing ...
Sketch with pandas: • AI Code-Writing Assist...
Analyzing data with ChatGPT: • Analyzing data with CH...
🔴 RUclips: www.youtube.co...
👔 LinkedIn: / sudarshan-koirala
🐦 Twitter: / mesudarshan
💰🔗 Some links are affiliate links, meaning, when you use those, I might get some benefit.
#openai #llm #datasciencebasics #chatwithdata #documents #chatgpt #nlp
Many Thanks for your great work!
It's very well explained and applies to real uses of AI.
great Video. Thanks for your time and explanation.
You’re welcome. Glad that it was helpful.
as always great tutorials! I would love to see this same topic but without using openai..
you can try this one,
How TO SetUp and USE PrivateGPT | FULLY LOCAL | 100% PRIVATE
ruclips.net/video/VEQ8mxv2MHY/видео.html
I've identified the problem in your code. The issue lies in the creation of chat history. Your code expects a list of tuples, but in your Gradio app, you're creating a list of lists (nested lists), which is causing the code to malfunction.
Please try using the following code instead and replace it in your Gradio block. This updated code should resolve the issue and make it work correctly.
import gradio as gr
with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
def respond(user_message, chat_history):
print(user_message)
print(chat_history)
if chat_history:
chat_history = [tuple(sublist) for sublist in chat_history]
print(chat_history)
# Get response from QA chain
response = qa({"question": user_message, "chat_history": chat_history})
# Append user message and response to chat history
chat_history.append((user_message, response["answer"]))
print(chat_history)
return "", chat_history
msg.submit(respond, [msg, chatbot], [msg, chatbot], queue=False)
clear.click(lambda: None, None, chatbot, queue=False)
demo.launch(debug=True, share=True)
Great, thanks for identifying the problem !
Thank you. How could I print from which document (title) it is from and which page (s). It is useful when multiple files of multiple pages in the sorce directory. Thank you for your time.
Yes that would be nice, any tipps?
Awesome tutorial.
Thank you for sharing!
You are welcome 😎
Please do again this video with Streamlit
Yes ❤
So every time I need to chat with my own data I will have to embedding the query? That’s make it much more expensive isn’t?
The query/question is in natural language and needs to be converted to numbers/vectors. So, yes each of those needs to be embedded. The embedding is not that expensive. It is how it works :)
Can we do it with Groq ? How would you do the embeddings ?
How can I utilize this ChatBot for my SQL documents?
My question would be. How would you accommodate new random data that has to be introduced to this? Will be do the vectorization process all over again or is there a better way to handle it even for 1 document?
You can 'upsert' a document over existing ones. No need to vectorize again.
@@AlgorithmicEchoes So when is vectorization needed and when not? How does that work? In case of pinecone fine, we can simply push the docs and the rest is taken care of right? Any idea about other cases?
Also possible to share any demo codes for this or refer to it here so that others can also benefit from the same.. that'll be great.
Thank you for the good video. I am curious why you stored the vectors in chroma first and again in pincone again? Thank you
It was just for the demonstration purpose. You can choose anything either Chroma or Pinecone.
hello sir,
what will be evaluation metrics we should use for our usecase. kindly let me know
Its Chunks and not Choonks
just for Fun,Dont take it serious.....The video is informational and Perfect
Thanks, very good content!
You're welcome!
I tried your tutorial, but get stuck on the steps to Pinecone, error: AttributeError: init is no longer a top-level attribute of the pinecone package. Do you have an updated notebook?
Amazing tutorial! Is there a way to add in the sources as well with the responses?
In many cases, Yes. Refer to documentation, you might find the answer.
@@datasciencebasics Hi can you please tell which model you used for embedding same gpt 3.5 turbo?
May I know which website are you using to execute step by step. I learnt a lot form this tutorial
its google colab. Refer to this video for more info -> ruclips.net/video/Xi9-W26cDBs/видео.html
Were you able to figure out the error when entering the second query? I’m running into the same issue.
many thanks for great tutorial, but It seems slow, is there any way make it run faster? thanks advance
You are welcome. Some options might be trying with different models and can also use cache and see how it performs.
Can it be used with code? For example a .NET project with multiple classes
I am not sure if its possible with .NET Right now, langchain has python and javascript documentation.
Great Video 💯💥. So, can we add multiple CSV files instead of multiple types of files?
Thanks. It should be possible. You can give a try :)
Very informative video Sudarshan, in the Pinecone method. The free version does not support 1536 embedding size, suggestions?
Thank you Pranav. I am able to create embedding with 1536 dimensions. There must be something wring somewhere OR Pinecone must have disabled for newer users.
why we split data with chunk of 1000 or 1500 and then get 4 most relevant chunks? why not more than 1500 or 1000 character per chunk? or why not more than 4 releant chunks? is there limitation of characters to feed the chatGPT with data? how much is the limitation? after using the code I checked my API usage in OpenAi and saw that I have used instructGPT. what is instructGPT?
Please refer to these materials. Also, in the future you can do a simple google search to find your answers 😄
www.pinecone.io/learn/chunking-strategies/
Perplexity AI: what is instructGPT www.perplexity.ai/search/what-is-instructGPT-bbhHw0xHRLudFO.l..LIpQ?s=mn
@@datasciencebasics You are right
Hi, thank you for your contribution, how much data I can use? I mean a lot of documents ca be stored in the vector store?
As far as I know, you can store as much as you can. Give a try as I haven't tried storing too much of data.
Nice Video .Can we use open source models for the same ??
Here is with Llama2 ruclips.net/video/VPk-at5oqAY/видео.htmlsi=gkfVmnF0xP7pgJ8C
Is it possible to retrieve embedding part from chroma like pinecone. 2nd doubt is first I done embed with 2 files and I want to add 2 more files. So I need to 4 files need to embed or latest 2 files embed can combine with first 2 embed data. If it is possible, how to do it
Yes it is possible. You can just save the embeddings somewhere and again dump it in the same location. This way you don’t need to embed again. Give a try yourself and see how it works :)
Hey I runned and it worked great but why the responses so cold? is it the temperature?
Hey, might be. You can try playing around with temperature.
Getting some numpy error: "AttributeError: module 'numpy.linalg._umath_linalg' has no attribute '_ilp64' " in all your LangChain related colab notebooks
By just seeing this, I have no clue how to help you. Try using newer versions of packages or try to update the code as many libraries update over time.
@@datasciencebasics Could you please try rerunning your notebook up to the Directory Loader and check? That seems to be where the issue originates from
hi, thx the video! is it possible also to chunk a long html code to chat with that and help gpt to modify long code?
it is possible to load a github repo, make different chunks and ask questions related to code. I will demonstrate this in next video. Based in this, you might modify your code to achieve what you are looking for.
@@datasciencebasics thx the reply and in advance the future video!
When executing the loaders section enters in a long process that I stopped after 13 minutes. I checked and everything seemed normal
You might need to uninstall and install Pillow. Not sure if it helps but you can try.
import PIL
!pip uninstall Pillow
!pip install --upgrade Pillow
I copied your colab document. When I execute #take all the loader, then I get en error:
ImportError: cannot import name 'is_directory' from 'PIL._util' (/usr/local/lib/python3.10/dist-packages/PIL/_util.py)
How to solve it?
You might need to uninstall and install Pillow.
import PIL
!pip uninstall Pillow
!pip install --upgrade Pillow
great video really helpful
may i know different types of chatbots like gradio for free and code for some chatbots
and one more thing can we get our doucument link from our chat bot
thank you
please reply me
Thanks, you can use gradio, streamlit and others to create a UI for chatbots. Yes, you can have sources also being shown when returning the answers. Please refer to langchain documentation about QA with sources.
The last part gradio in the code is tried to install a malware in my system.
Strange, others are not facing this issue. Were you able to run it ?
at loaders step where you are creating list out of all loader in document [] when i run this piece of code it is taking more than 10 mins andstill not executed anyone can help??
hi, it’s hard to say what went wrong without seeing what and how u r loading. Hope, you already fix it.
Can I know using this method will consumed any openAI token for reading and answer queries of the documents?
Yes, it does. Please watch my PrivateGPT video to know how to chat with documents locally where OpenAI is not being used.
@@datasciencebasics Yes. I have followed your PrivateGPT video. It's work, much appreciate!
Ok if i have csv file how can i load it bro
You can use CSVLoader for that. I have other videos too where I have explained how to deal with csv files.
Can this be used to chat with JSON files too ?
yes we can. give a try.
Can you share colab link.
hei, its in the description of the video 😎
How much pinecone is secured
It depends what kind of usecase you will be using but I would say generall it is secured !
it doesn’t let me install the libraries, they take forever and that the end it crashes sand says there’s no dick space. Im on replit
Authentication error : in pinecone steps how to slove?
HI. I got the same error. Did you fix it?
@@mohammedelismaili3803 No
Hey, its related to pinecone authentication as it says. Either its from pinecone side or there is something wrong in ur side. Hopefully it will be fixed.
is there an easier way for non programmers to chat with 300+ own pdf's? does anyone sell ready to run solutions that I can just download and upload 300+ books on the same topic (legal theory for example) into it and chat with it?
you can use the latest model from OpenAI and use retrieval tool which they claim can handle 300 pages of pdf or you can even create GPTs from OpenAI
@@datasciencebasics not 300 pages, I mean 300+ books (pdf, epub and so on). Is it possible? not online, but offline on my Macintosh. Having my now GPT that is trained on my own 300 books on that cover the same particular topic. Train it and then chat with it, ask it questions on this topic which GPT will answer with info from all those 300 books I've loaded it with
there is nothing that’s impossible but for these kind of scenarios, good research about Open Source Models is needed. I can’t just say now use this and that model.
@@datasciencebasics create such an app please (for mac), I will buy it
Is it possible to embed Persian language?
haven’t tried myself so I am not sure about it !
@@datasciencebasics it supports
fails on import pinecone
hei, you might need to check the latest code from Langchain website. Also did u install pinecone ?