Thank you for providing the repo link, very much appreciated. The code is very well kept and very easy to read given the file organization. This is a great codebase to start building out advanced usages. Great work.
Leon thanks so much for your excellent content and video-editing. one of the very few video I didn't have to fast-forward! straight to the point and cut of slow parts. lovely, thanks.
Hi! Do you save your langchain objects using picke/joblib so you can load them later? Is this necessary? Let's assume a full-time running web server with RAG in the backend. I cannot just instantiate the objects every time.. I need to load them, right?
Yes you can save the vectorstore in pickle format. In this video I created the vectorstore with the 50 pdf files first, and then saved it. The application then just loads the vectorstore and queries it. You can also keep adding new documents and saving it.
Is this just for devs? Because I fail understanding it at very beginning. The content in the "create_index" file in the repository is completly different from the file you show in the video
I am sorry for the confusion. I structured the repository for better overview, functions that get used at multiple points are in the utils.py file. Configurations like chunksize or the path where the vectorindex is stored are in the config.yaml file. Let me know if there is something specific you don't understand. I am also planning an updated version of this video.
@@leonsaiagency Nothing to be sorry about, I like the idea behind your project and i bet it is very easy for someone who is more into this topic. I am an absolut noob when it comes to Github or dev stuff in general, thats the problem :D I try that again in the evening. But before I try I have one question. I work with a 2020 Macbook M1 (8GB RAM). Do you think my hardware is good enough or does this need bigger CPU?
Ziemlich hohes Niveau, aber nach ich paar weiteren Videos zu dem Thema geschaut habe, fange ich an, das Meiste zu verstehen. Ich finde das Problem bei Langchain ist, ist dass es so wahnsinnig umfangreich ist. Es gibt dutzende Komponenten, und wenn man anfängt, nicht nur OpenAI Text Modelle, sondern auch noch weitere Modelle zu verwenden, wird es richtig heftig. P.S. Sag mal, machst du auch 1-1 Coaching? Ich bin ein Berater, der gerne auch den Data AI Markt in sein "Portfolio" aufnehmen würde. Dazu muss ich allerdings noch einiges lernen.
Vielen Dank @eugenmalatov5470. Ich bin aktuell dabei, alles einzurichten, um Coachings anzubieten. Du kannst mich fürs Erste gerne unter leonsander.consulting@gmail.com kontaktieren, und dann können wir ein kostenloses Erstgespräch vereinbaren, um zu schauen, wie ich dir am besten helfen kann und weiteres zu besprechen.
You have to account for two token limits. First the context window of the embedding model, in the video it has a context window of 256 tokens. OpenAI embeddings have around 8000. So with the recursive character text splitter, we split the documents up in chunks of maximum 256 tokens, or approximately 1000 characters. Second you have to account for the context window of the llm. In this case gpt 3.5 turbo with a context window of 4000. So your prompt plus all the retrieved documents should not exceed the token limit. Lets say we set the k parameter to 5, then we would have 5 document chunks with maximum of 256 tokens. So 1280 tokes plus the tokens from your prompt, way under the 4k token context window.
Hello. Thank you for the video. Do you think this will work with PDF in other languages, not English? I'm Cambodian and would like to know if it can work with PDF in Khmer language. Thank you.
You're welcome. Openai embeddings have been trained on different languages, you can try it out with them. I did it with german documents and it worked perfectly fine. Try it with a relatively small chunk of text to keep the costs low. For a frame of reference, round about 70000 words would cost 1 cent. If it does not work, you could include a preprocessing step which translates your documents into english first. Let me know what the results were if you tried it 💪
will you cro5ce6kn5t will you create a video for creating chunks from multiple pdf files in same directory but in diff folders and also creating vectors and embedding using supabase
You can get all pdf files from different folders within the same directory with the line glob("./**/*.pdf", recursive=True) I might look into supabase and make a video on that, thanks for your suggestion.
@@leonsaiagencyor viewers can just set playback speed in RUclips to 0.75 for slower viewing if they need. Most times I'm setting video playback faster.
Text splitter creates chunks of 4000 characters...not tokens I guess. Correct me If I am wrong.
Absolutely, you are right. Thanks for pointing that out.
Thank you for providing the repo link, very much appreciated. The code is very well kept and very easy to read given the file organization. This is a great codebase to start building out advanced usages. Great work.
Thank you, much appreciated :)
Leon thanks so much for your excellent content and video-editing. one of the very few video I didn't have to fast-forward! straight to the point and cut of slow parts. lovely, thanks.
Thank you very much for the acknowledgement, cutting is such a big effort
Hi! Do you save your langchain objects using picke/joblib so you can load them later? Is this necessary? Let's assume a full-time running web server with RAG in the backend. I cannot just instantiate the objects every time.. I need to load them, right?
Yes you can save the vectorstore in pickle format. In this video I created the vectorstore with the 50 pdf files first, and then saved it.
The application then just loads the vectorstore and queries it. You can also keep adding new documents and saving it.
Is this just for devs? Because I fail understanding it at very beginning. The content in the "create_index" file in the repository is completly different from the file you show in the video
I am sorry for the confusion. I structured the repository for better overview, functions that get used at multiple points are in the utils.py file.
Configurations like chunksize or the path where the vectorindex is stored are in the config.yaml file.
Let me know if there is something specific you don't understand.
I am also planning an updated version of this video.
@@leonsaiagency Nothing to be sorry about, I like the idea behind your project and i bet it is very easy for someone who is more into this topic.
I am an absolut noob when it comes to Github or dev stuff in general, thats the problem :D
I try that again in the evening. But before I try I have one question. I work with a 2020 Macbook M1 (8GB RAM). Do you think my hardware is good enough or does this need bigger CPU?
Astonishing content Man a
Thank you very much 🙏
Thank you so much for your great tutorial. Any advantage of using Supabase instead of FAISS ?
Ziemlich hohes Niveau, aber nach ich paar weiteren Videos zu dem Thema geschaut habe, fange ich an, das Meiste zu verstehen.
Ich finde das Problem bei Langchain ist, ist dass es so wahnsinnig umfangreich ist. Es gibt dutzende Komponenten, und wenn man anfängt, nicht nur OpenAI Text Modelle, sondern auch noch weitere Modelle zu verwenden, wird es richtig heftig.
P.S. Sag mal, machst du auch 1-1 Coaching? Ich bin ein Berater, der gerne auch den Data AI Markt in sein "Portfolio" aufnehmen würde. Dazu muss ich allerdings noch einiges lernen.
Vielen Dank @eugenmalatov5470.
Ich bin aktuell dabei, alles einzurichten, um Coachings anzubieten.
Du kannst mich fürs Erste gerne unter leonsander.consulting@gmail.com kontaktieren, und dann können wir ein kostenloses Erstgespräch vereinbaren, um zu schauen, wie ich dir am besten helfen kann und weiteres zu besprechen.
Great video 🔥
Thank you, much appreciated 🙏
How exactly do you deal with token limits?
You have to account for two token limits. First the context window of the embedding model, in the video it has a context window of 256 tokens. OpenAI embeddings have around 8000.
So with the recursive character text splitter, we split the documents up in chunks of maximum 256 tokens, or approximately 1000 characters.
Second you have to account for the context window of the llm. In this case gpt 3.5 turbo with a context window of 4000. So your prompt plus all the retrieved documents should not exceed the token limit. Lets say we set the k parameter to 5, then we would have 5 document chunks with maximum of 256 tokens. So 1280 tokes plus the tokens from your prompt, way under the 4k token context window.
@@leonsaiagency So under the hood, an api call is made to gpt as an input of 5 chunks alongside the input?
@@MohammedRahman-x8o exactly
Thanks for the video 😊 is it possible to get the name of the source pdf with the answer ?
Sure, it should be in the metadata of the returned documents.
Hello. Thank you for the video. Do you think this will work with PDF in other languages, not English? I'm Cambodian and would like to know if it can work with PDF in Khmer language. Thank you.
You're welcome. Openai embeddings have been trained on different languages, you can try it out with them. I did it with german documents and it worked perfectly fine. Try it with a relatively small chunk of text to keep the costs low.
For a frame of reference, round about 70000 words would cost 1 cent.
If it does not work, you could include a preprocessing step which translates your documents into english first.
Let me know what the results were if you tried it
💪
good work
Hi Dr
Please make video on question answering from particular source (let's say 1000 pags pdf)
Thanks in advance
You can use this code to do exactly that, just place the particular pdf in the data folder.
will you cro5ce6kn5t
will you create a video for creating chunks from multiple pdf files in same directory but in diff folders and also creating vectors and embedding using supabase
You can get all pdf files from different folders within the same directory with the line glob("./**/*.pdf", recursive=True)
I might look into supabase and make a video on that, thanks for your suggestion.
@@leonsaiagency Thank you
good content! but in my opinion, you could slow🙂
Thank you for the feedback, maybe I can change something about that
@@leonsaiagencyor viewers can just set playback speed in RUclips to 0.75 for slower viewing if they need. Most times I'm setting video playback faster.