LangChain: How to Properly Split your Chunks

Поделиться
HTML-код
  • Опубликовано: 2 дек 2024

Комментарии • 79

  • @deepaksingh9318
    @deepaksingh9318 9 месяцев назад +2

    and I think nobody can explain concepts in easier way than you do..
    tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :)
    love it how you cover each and every aspects from learning point of view.. Thanks again. .

    • @engineerprompt
      @engineerprompt  8 месяцев назад

      Glad it was helpful. Make sure to watch the next one :)

  • @parisneto
    @parisneto 8 месяцев назад

    Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!

  • @CacoNonino
    @CacoNonino Год назад +20

    please make more videos like this one! Many people got into AI without coding background, we are missing more detailed videos on these topics!

    • @AJJU_OZA
      @AJJU_OZA Год назад

      Answer me...
      Promote Engineering's videos are for Developer (appreciation) only ??

    • @CacoNonino
      @CacoNonino Год назад

      @@AJJU_OZA well if it was i would not be here for so long hahahahahahaha
      What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal!
      In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp!
      answered!

    • @CacoNonino
      @CacoNonino Год назад

      @@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it.
      I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!

    • @ml-techn
      @ml-techn Год назад

      @@CacoNonino what do you mean by LLM python focused course?

    • @CacoNonino
      @CacoNonino Год назад

      @@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering!
      Now i'am more and more building things on top of LLMs
      All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular.
      Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!

  • @WinstonWalker-fc7ty
    @WinstonWalker-fc7ty Год назад +6

    I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.

  • @asithakoralage628
    @asithakoralage628 Год назад +1

    Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.

  • @SachinChavan13
    @SachinChavan13 Год назад +1

    Please keep making more such videos. I found this video very helpful..

  • @RealEstate3D
    @RealEstate3D Год назад +1

    First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context.
    Good work. Definitely go this road.

  • @wassimsaioudi116
    @wassimsaioudi116 10 месяцев назад

    Incredible ! Hope you'll provide more videos like this one !

  • @yazanrisheh5127
    @yazanrisheh5127 Год назад +1

    Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much!
    Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.

  • @TheCloudShepherd
    @TheCloudShepherd Год назад

    Damn you explained that better in 3 mins that most other videos did in 30 mins

  • @xiaok80s
    @xiaok80s Месяц назад

    Great explanation. Thanks.

  • @darshan7673
    @darshan7673 Год назад

    Great Video, Thanks for creating the video!

  • @ShaneHolloman
    @ShaneHolloman Год назад

    Excellent to have someone break these concepts down so clearly. Keep going, this is great!

  • @adnanrizve5551
    @adnanrizve5551 Год назад

    Great Work! Very simple but really elaborative. Please create more videos in this for this series

  • @hvbris_
    @hvbris_ Год назад

    Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!

  • @izainonline
    @izainonline Год назад

    Great Video to understand chunks and textsplitter

  • @SmashPhysical
    @SmashPhysical Год назад

    Great explanation, thanks, this will be super useful!

  • @e_hana_kakou
    @e_hana_kakou Год назад

    Appreciate all your content. I'd love to know more about chunking customization. Thanks! 🤙

  • @AA_135
    @AA_135 Год назад

    Great explanation !

  • @mdfarhananis8950
    @mdfarhananis8950 Год назад +1

    Really useful
    Please continue making these

  • @Ken129100
    @Ken129100 Год назад +1

    Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.

  • @RichardGetzPhotography
    @RichardGetzPhotography Год назад

    Yes please do a video on Embedding settings. I am currently using these.
    Parameters
    ----------
    VECTOR_SIZE: int
    The size of the vector for the text embeddings (e.g., 300).
    WINDOW_SIZE: int
    The context window size for text embeddings, capturing larger contextual information (e.g., 20).
    MIN_COUNT: int
    The minimum frequency count for words to be considered in the text embeddings (e.g., 1).
    EPOCHS: int
    The number of training iterations for the Doc2Vec model (e.g., 500).

  • @kenchang3456
    @kenchang3456 10 месяцев назад

    Excellent explanation, thank you. Just curious, why this video is the only video in your Demystifying LangChain playlist?

    • @engineerprompt
      @engineerprompt  10 месяцев назад +1

      Thank you. Just way too many things to cover but now getting back to RAG. Will be making alot more content on it.

  • @duncanprins9944
    @duncanprins9944 Год назад

    Great! Much appreciated 😊

  • @unshadowlabs
    @unshadowlabs Год назад

    I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .

  • @hl236
    @hl236 9 месяцев назад

    More videos on chunking and embedding please.

  • @weber1209rafael
    @weber1209rafael Год назад

    Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.

  • @nirsarkar
    @nirsarkar Год назад

    Please do create one for custom splitting. I have a particular document where I would like to define a chunk demarcated by special sequence.

  • @surajthakkar3420
    @surajthakkar3420 10 месяцев назад

    Hello mate,
    Any chance you can make a video on Context aware chunking which can improve the quality of chunks/output drastically!

  • @jjolla6391
    @jjolla6391 6 дней назад

    an embedder should have an option that chunks cannot cross paragraph boundaries -- even if 2 paras fit in a chunk. So the number of chunks will always be at least as many paras.

  • @gerardorosiles8918
    @gerardorosiles8918 Год назад

    Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size?
    Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?

  • @VerdonTrigance
    @VerdonTrigance 9 месяцев назад

    How to define my own list of separators? Can I set mupltiple separators for paragraphs and multiple for sentences at the same time?

  • @goncaavci1579
    @goncaavci1579 Год назад

    please make a video about embedding size. you are awesome thank you for videos

  • @subhashinavolu1704
    @subhashinavolu1704 Год назад

    What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data

  • @gangs0846
    @gangs0846 10 месяцев назад

    Thank you!

  • @MattGoldenberg
    @MattGoldenberg Год назад

    Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.

  • @guanjwcn
    @guanjwcn Год назад

    please continue with these. they are useful.

  • @samuraibhai007
    @samuraibhai007 Год назад +1

    Thank you for your video. What program are you using to create your diagrams?

  • @JourneyMindMap
    @JourneyMindMap 10 месяцев назад

    thanks dude

  • @texasfossilguy
    @texasfossilguy Год назад

    What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?

    • @shivanshugautam1381
      @shivanshugautam1381 Год назад

      Hi I am also having same problem. Do you have any idea how we can divide our document chunk efficiently.

  • @Zivafgin
    @Zivafgin Год назад

    Great content! Keep up please :)

  • @rutvikghori2410
    @rutvikghori2410 8 месяцев назад

    Thank you! How I can resolve issues of splitting, suppose I have multiple files and I want to generate a summary individually

    • @engineerprompt
      @engineerprompt  7 месяцев назад +1

      In that case, look into summarization specific chains. Reduce map will be a good start.

    • @rutvikghori2410
      @rutvikghori2410 7 месяцев назад

      @@engineerprompt Suppose these are code files and I want to generate summary for all separately.
      What should I do?

  • @AJJU_OZA
    @AJJU_OZA Год назад

    Sir Promote Engineering's videos are for Developer (appreciation)...???

  • @walidmaly3
    @walidmaly3 Год назад

    Please continue making videos like this. Any chance u can share the code as well?

  • @waelmashal7594
    @waelmashal7594 Год назад

    If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the

    splitter
    what u think ?

    • @engineerprompt
      @engineerprompt  Год назад

      This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space

  • @PerFeldvoss
    @PerFeldvoss Год назад

    What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)

  • @mikelugarte
    @mikelugarte Год назад

    I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the "
    " separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same

    • @TousifAhamedNadaf
      @TousifAhamedNadaf Год назад

      Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?

  • @arkodeepchatterjee
    @arkodeepchatterjee Год назад

    really useful
    please continue making videos like this

  • @jstormclouds
    @jstormclouds Год назад

    i feel i get the gist but interested in more on topic

  • @amol5146
    @amol5146 10 месяцев назад

    Can you please explain how the chunk_overlap parameter works?

    • @engineerprompt
      @engineerprompt  10 месяцев назад

      Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.

    • @amol5146
      @amol5146 10 месяцев назад

      @@engineerprompt Thank you! Does chunk_overlap also follow the default list?

  • @vertigoz
    @vertigoz 6 месяцев назад

    The link no longer works

  • @computerauditor
    @computerauditor Год назад

    🔥🔥🔥

  • @MuhammadDanyalKhan
    @MuhammadDanyalKhan 10 месяцев назад

    had a question on this video i.e. how to split chunks:
    ruclips.net/video/n0uPzvGTFI0/видео.html .... How I can find best chunk size for financial statements?

  • @fra4897
    @fra4897 Год назад

    in real life you need to do way more stuff and all the tutorials are basically splitting some okay txt files but this is a good introduction

  • @fra8156
    @fra8156 Год назад +1

    what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...

  • @CarlosIvanDonet
    @CarlosIvanDonet Год назад

    Does this work on a cpp local model? Like modelname-ggmlv1.q4_1.bin

  • @貴-b3w
    @貴-b3w 11 месяцев назад

    Great Video, Thanks for creating the video!😀