Index 50 PDF Books In 5 Minutes With Langchain Vectorindex

Поделиться
HTML-код
  • Опубликовано: 1 янв 2025

Комментарии • 36

  • @ParasJain-id7bz
    @ParasJain-id7bz Год назад +7

    Text splitter creates chunks of 4000 characters...not tokens I guess. Correct me If I am wrong.

    • @leonsaiagency
      @leonsaiagency  Год назад +3

      Absolutely, you are right. Thanks for pointing that out.

  • @matthewsimon-ol4kb
    @matthewsimon-ol4kb Год назад +4

    Thank you for providing the repo link, very much appreciated. The code is very well kept and very easy to read given the file organization. This is a great codebase to start building out advanced usages. Great work.

  • @techitint.9100
    @techitint.9100 Год назад +1

    Leon thanks so much for your excellent content and video-editing. one of the very few video I didn't have to fast-forward! straight to the point and cut of slow parts. lovely, thanks.

    • @leonsaiagency
      @leonsaiagency  Год назад

      Thank you very much for the acknowledgement, cutting is such a big effort

  • @arthurbrc_
    @arthurbrc_ Год назад

    Hi! Do you save your langchain objects using picke/joblib so you can load them later? Is this necessary? Let's assume a full-time running web server with RAG in the backend. I cannot just instantiate the objects every time.. I need to load them, right?

    • @leonsaiagency
      @leonsaiagency  Год назад

      Yes you can save the vectorstore in pickle format. In this video I created the vectorstore with the 50 pdf files first, and then saved it.
      The application then just loads the vectorstore and queries it. You can also keep adding new documents and saving it.

  • @songoku69_
    @songoku69_ Год назад +3

    Is this just for devs? Because I fail understanding it at very beginning. The content in the "create_index" file in the repository is completly different from the file you show in the video

    • @leonsaiagency
      @leonsaiagency  Год назад +1

      I am sorry for the confusion. I structured the repository for better overview, functions that get used at multiple points are in the utils.py file.
      Configurations like chunksize or the path where the vectorindex is stored are in the config.yaml file.
      Let me know if there is something specific you don't understand.
      I am also planning an updated version of this video.

    • @songoku69_
      @songoku69_ Год назад

      @@leonsaiagency Nothing to be sorry about, I like the idea behind your project and i bet it is very easy for someone who is more into this topic.
      I am an absolut noob when it comes to Github or dev stuff in general, thats the problem :D
      I try that again in the evening. But before I try I have one question. I work with a 2020 Macbook M1 (8GB RAM). Do you think my hardware is good enough or does this need bigger CPU?

  • @MasterBrain182
    @MasterBrain182 Год назад

    Astonishing content Man a

  • @MohammedAbdulatef
    @MohammedAbdulatef Год назад

    Thank you so much for your great tutorial. Any advantage of using Supabase instead of FAISS ?

  • @eugenmalatov5470
    @eugenmalatov5470 Год назад

    Ziemlich hohes Niveau, aber nach ich paar weiteren Videos zu dem Thema geschaut habe, fange ich an, das Meiste zu verstehen.
    Ich finde das Problem bei Langchain ist, ist dass es so wahnsinnig umfangreich ist. Es gibt dutzende Komponenten, und wenn man anfängt, nicht nur OpenAI Text Modelle, sondern auch noch weitere Modelle zu verwenden, wird es richtig heftig.
    P.S. Sag mal, machst du auch 1-1 Coaching? Ich bin ein Berater, der gerne auch den Data AI Markt in sein "Portfolio" aufnehmen würde. Dazu muss ich allerdings noch einiges lernen.

    • @leonsaiagency
      @leonsaiagency  Год назад

      Vielen Dank @eugenmalatov5470.
      Ich bin aktuell dabei, alles einzurichten, um Coachings anzubieten.
      Du kannst mich fürs Erste gerne unter leonsander.consulting@gmail.com kontaktieren, und dann können wir ein kostenloses Erstgespräch vereinbaren, um zu schauen, wie ich dir am besten helfen kann und weiteres zu besprechen.

  • @chineduezeofor2481
    @chineduezeofor2481 Год назад

    Great video 🔥

  • @MohammedRahman-x8o
    @MohammedRahman-x8o Год назад

    How exactly do you deal with token limits?

    • @leonsaiagency
      @leonsaiagency  Год назад

      You have to account for two token limits. First the context window of the embedding model, in the video it has a context window of 256 tokens. OpenAI embeddings have around 8000.
      So with the recursive character text splitter, we split the documents up in chunks of maximum 256 tokens, or approximately 1000 characters.
      Second you have to account for the context window of the llm. In this case gpt 3.5 turbo with a context window of 4000. So your prompt plus all the retrieved documents should not exceed the token limit. Lets say we set the k parameter to 5, then we would have 5 document chunks with maximum of 256 tokens. So 1280 tokes plus the tokens from your prompt, way under the 4k token context window.

    • @MohammedRahman-x8o
      @MohammedRahman-x8o Год назад

      @@leonsaiagency So under the hood, an api call is made to gpt as an input of 5 chunks alongside the input?

    • @leonsaiagency
      @leonsaiagency  Год назад

      @@MohammedRahman-x8o exactly

  • @ahmedmechergui8680
    @ahmedmechergui8680 Год назад

    Thanks for the video 😊 is it possible to get the name of the source pdf with the answer ?

    • @leonsaiagency
      @leonsaiagency  Год назад

      Sure, it should be in the metadata of the returned documents.

  • @Neosmey
    @Neosmey Год назад

    Hello. Thank you for the video. Do you think this will work with PDF in other languages, not English? I'm Cambodian and would like to know if it can work with PDF in Khmer language. Thank you.

    • @leonsaiagency
      @leonsaiagency  Год назад +2

      You're welcome. Openai embeddings have been trained on different languages, you can try it out with them. I did it with german documents and it worked perfectly fine. Try it with a relatively small chunk of text to keep the costs low.
      For a frame of reference, round about 70000 words would cost 1 cent.
      If it does not work, you could include a preprocessing step which translates your documents into english first.
      Let me know what the results were if you tried it
      💪

  • @SayaliYadav-h7x
    @SayaliYadav-h7x Год назад

    good work

  • @filmy2666
    @filmy2666 Год назад

    Hi Dr
    Please make video on question answering from particular source (let's say 1000 pags pdf)
    Thanks in advance

    • @leonsaiagency
      @leonsaiagency  Год назад

      You can use this code to do exactly that, just place the particular pdf in the data folder.

  • @SayaliYadav-h7x
    @SayaliYadav-h7x Год назад

    will you cro5ce6kn5t
    will you create a video for creating chunks from multiple pdf files in same directory but in diff folders and also creating vectors and embedding using supabase

    • @leonsaiagency
      @leonsaiagency  Год назад

      You can get all pdf files from different folders within the same directory with the line glob("./**/*.pdf", recursive=True)
      I might look into supabase and make a video on that, thanks for your suggestion.

    • @SayaliYadav-h7x
      @SayaliYadav-h7x Год назад +1

      @@leonsaiagency Thank you

  • @samsquamsh78
    @samsquamsh78 Год назад

    good content! but in my opinion, you could slow🙂

    • @leonsaiagency
      @leonsaiagency  Год назад

      Thank you for the feedback, maybe I can change something about that

    • @panvidbill1912
      @panvidbill1912 Год назад

      ​@@leonsaiagencyor viewers can just set playback speed in RUclips to 0.75 for slower viewing if they need. Most times I'm setting video playback faster.