LangChain: How to Properly Split your Chunks

Поделиться
HTML-код
  • Опубликовано: 18 авг 2023
  • In this video, we are taking a deep dive into Recursive Character Text Splitter class in Langchain. How you split your chunks/data determines the quality of the answers you get when you are trying to chat with your documents using LLMs. Learn how to properly use text splitter in Langchain.
    #llm #langchain #PDFchat
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬ CONNECT ▬▬▬▬▬▬▬▬▬▬▬
    ☕ Buy me a Coffee: ko-fi.com/promptengineering
    |🔴 Support my work on Patreon: Patreon.com/PromptEngineering
    🦾 Discord: / discord
    ▶️️ Subscribe: www.youtube.com/@engineerprom...
    📧 Business Contact: engineerprompt@gmail.com
    💼Consulting: calendly.com/engineerprompt/c...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    LINKS: python.langchain.com/docs/mod...
    ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
    All Interesting Videos:
    Everything LangChain: • LangChain
    Everything LLM: • Large Language Models
    Everything Midjourney: • MidJourney Tutorials
    AI Image Generation: • AI Image Generation Tu...
  • НаукаНаука

Комментарии • 77

  • @parisneto
    @parisneto 3 месяца назад

    Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!

  • @CacoNonino
    @CacoNonino 11 месяцев назад +18

    please make more videos like this one! Many people got into AI without coding background, we are missing more detailed videos on these topics!

    • @AJJU_OZA
      @AJJU_OZA 11 месяцев назад

      Answer me...
      Promote Engineering's videos are for Developer (appreciation) only ??

    • @CacoNonino
      @CacoNonino 11 месяцев назад

      @@AJJU_OZA well if it was i would not be here for so long hahahahahahaha
      What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal!
      In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp!
      answered!

    • @CacoNonino
      @CacoNonino 11 месяцев назад

      @@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it.
      I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!

    • @ml-techn
      @ml-techn 11 месяцев назад

      @@CacoNonino what do you mean by LLM python focused course?

    • @CacoNonino
      @CacoNonino 11 месяцев назад

      @@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering!
      Now i'am more and more building things on top of LLMs
      All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular.
      Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!

  • @asithakoralage628
    @asithakoralage628 11 месяцев назад +1

    Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.

  • @adnanrizve5551
    @adnanrizve5551 9 месяцев назад

    Great Work! Very simple but really elaborative. Please create more videos in this for this series

  • @RealEstate3D
    @RealEstate3D 11 месяцев назад +1

    First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context.
    Good work. Definitely go this road.

  • @wassimsaioudi116
    @wassimsaioudi116 5 месяцев назад

    Incredible ! Hope you'll provide more videos like this one !

  • @darshan7673
    @darshan7673 8 месяцев назад

    Great Video, Thanks for creating the video!

  • @user-ip6yq5tz1r
    @user-ip6yq5tz1r 7 месяцев назад

    Great Video, Thanks for creating the video!😀

  • @SmashPhysical
    @SmashPhysical 11 месяцев назад

    Great explanation, thanks, this will be super useful!

  • @e_hana_kakou
    @e_hana_kakou 11 месяцев назад

    Appreciate all your content. I'd love to know more about chunking customization. Thanks! 🤙

  • @yazanrisheh5127
    @yazanrisheh5127 11 месяцев назад +1

    Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much!
    Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.

  • @WinstonWalker-fc7ty
    @WinstonWalker-fc7ty 11 месяцев назад +6

    I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.

  • @AA_135
    @AA_135 10 месяцев назад

    Great explanation !

  • @duncanprins9944
    @duncanprins9944 10 месяцев назад

    Great! Much appreciated 😊

  • @deepaksingh9318
    @deepaksingh9318 4 месяца назад +1

    and I think nobody can explain concepts in easier way than you do..
    tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :)
    love it how you cover each and every aspects from learning point of view.. Thanks again. .

    • @engineerprompt
      @engineerprompt  4 месяца назад

      Glad it was helpful. Make sure to watch the next one :)

  • @SachinChavan13
    @SachinChavan13 10 месяцев назад +1

    Please keep making more such videos. I found this video very helpful..

  • @gangs0846
    @gangs0846 5 месяцев назад

    Thank you!

  • @izainonline
    @izainonline 9 месяцев назад

    Great Video to understand chunks and textsplitter

  • @Zivafgin
    @Zivafgin 10 месяцев назад

    Great content! Keep up please :)

  • @hvbris_
    @hvbris_ 11 месяцев назад

    Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!

  • @ShaneHolloman
    @ShaneHolloman 11 месяцев назад

    Excellent to have someone break these concepts down so clearly. Keep going, this is great!

  • @JourneyMindMap
    @JourneyMindMap 5 месяцев назад

    thanks dude

  • @RichardGetzPhotography
    @RichardGetzPhotography 11 месяцев назад

    Yes please do a video on Embedding settings. I am currently using these.
    Parameters
    ----------
    VECTOR_SIZE: int
    The size of the vector for the text embeddings (e.g., 300).
    WINDOW_SIZE: int
    The context window size for text embeddings, capturing larger contextual information (e.g., 20).
    MIN_COUNT: int
    The minimum frequency count for words to be considered in the text embeddings (e.g., 1).
    EPOCHS: int
    The number of training iterations for the Doc2Vec model (e.g., 500).

  • @mdfarhananis8950
    @mdfarhananis8950 11 месяцев назад +1

    Really useful
    Please continue making these

  • @hl236
    @hl236 4 месяца назад

    More videos on chunking and embedding please.

  • @TheCloudShepherd
    @TheCloudShepherd 7 месяцев назад

    Damn you explained that better in 3 mins that most other videos did in 30 mins

  • @unshadowlabs
    @unshadowlabs 10 месяцев назад

    I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .

  • @goncaavci1579
    @goncaavci1579 11 месяцев назад

    please make a video about embedding size. you are awesome thank you for videos

  • @guanjwcn
    @guanjwcn 11 месяцев назад

    please continue with these. they are useful.

  • @weber1209rafael
    @weber1209rafael 11 месяцев назад

    Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.

  • @arkodeepchatterjee
    @arkodeepchatterjee 11 месяцев назад

    really useful
    please continue making videos like this

  • @gerardorosiles8918
    @gerardorosiles8918 10 месяцев назад

    Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size?
    Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?

  • @nirsarkar
    @nirsarkar 10 месяцев назад

    Please do create one for custom splitting. I have a particular document where I would like to define a chunk demarcated by special sequence.

  • @Ken129100
    @Ken129100 11 месяцев назад +1

    Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.

  • @walidmaly3
    @walidmaly3 11 месяцев назад

    Please continue making videos like this. Any chance u can share the code as well?

  • @kenchang3456
    @kenchang3456 5 месяцев назад

    Excellent explanation, thank you. Just curious, why this video is the only video in your Demystifying LangChain playlist?

    • @engineerprompt
      @engineerprompt  5 месяцев назад +1

      Thank you. Just way too many things to cover but now getting back to RAG. Will be making alot more content on it.

  • @jstormclouds
    @jstormclouds 11 месяцев назад

    i feel i get the gist but interested in more on topic

  • @surajthakkar3420
    @surajthakkar3420 6 месяцев назад

    Hello mate,
    Any chance you can make a video on Context aware chunking which can improve the quality of chunks/output drastically!

  • @VerdonTrigance
    @VerdonTrigance 4 месяца назад

    How to define my own list of separators? Can I set mupltiple separators for paragraphs and multiple for sentences at the same time?

  • @PerFeldvoss
    @PerFeldvoss 11 месяцев назад

    What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)

  • @subhashinavolu1704
    @subhashinavolu1704 10 месяцев назад

    What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data

  • @MattGoldenberg
    @MattGoldenberg 11 месяцев назад

    Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.

  • @waelmashal7594
    @waelmashal7594 10 месяцев назад

    If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the

    splitter
    what u think ?

    • @engineerprompt
      @engineerprompt  9 месяцев назад

      This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space

  • @computerauditor
    @computerauditor 11 месяцев назад

    🔥🔥🔥

  • @r0f115L4m
    @r0f115L4m 11 месяцев назад +1

    Thank you for your video. What program are you using to create your diagrams?

  • @texasfossilguy
    @texasfossilguy 11 месяцев назад

    What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?

    • @shivanshugautam1381
      @shivanshugautam1381 10 месяцев назад

      Hi I am also having same problem. Do you have any idea how we can divide our document chunk efficiently.

  • @mikelugarte
    @mikelugarte 10 месяцев назад

    I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the "
    " separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same

    • @user-im6cm7fr8p
      @user-im6cm7fr8p 9 месяцев назад

      Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?

  • @rutvikghori2410
    @rutvikghori2410 3 месяца назад

    Thank you! How I can resolve issues of splitting, suppose I have multiple files and I want to generate a summary individually

    • @engineerprompt
      @engineerprompt  3 месяца назад +1

      In that case, look into summarization specific chains. Reduce map will be a good start.

    • @rutvikghori2410
      @rutvikghori2410 3 месяца назад

      @@engineerprompt Suppose these are code files and I want to generate summary for all separately.
      What should I do?

  • @AJJU_OZA
    @AJJU_OZA 11 месяцев назад

    Sir Promote Engineering's videos are for Developer (appreciation)...???

  • @amol5146
    @amol5146 5 месяцев назад

    Can you please explain how the chunk_overlap parameter works?

    • @engineerprompt
      @engineerprompt  5 месяцев назад

      Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.

    • @amol5146
      @amol5146 5 месяцев назад

      @@engineerprompt Thank you! Does chunk_overlap also follow the default list?

  • @vertigoz
    @vertigoz 2 месяца назад

    The link no longer works

  • @frazuppi4897
    @frazuppi4897 11 месяцев назад

    in real life you need to do way more stuff and all the tutorials are basically splitting some okay txt files but this is a good introduction

  • @MuhammadDanyalKhan
    @MuhammadDanyalKhan 5 месяцев назад

    had a question on this video i.e. how to split chunks:
    ruclips.net/video/n0uPzvGTFI0/видео.html .... How I can find best chunk size for financial statements?

  • @fra8156
    @fra8156 11 месяцев назад +1

    what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...

  • @CarlosIvanDonet
    @CarlosIvanDonet 11 месяцев назад

    Does this work on a cpp local model? Like modelname-ggmlv1.q4_1.bin

    • @engineerprompt
      @engineerprompt  11 месяцев назад +1

      Yes, it will work with any model