Semantic-Text-Splitter - Create meaningful chunks from documents

Поделиться
HTML-код
  • Опубликовано: 10 дек 2024

Комментарии • 43

  • @codingcrashcourses8533
    @codingcrashcourses8533  9 месяцев назад +1

    I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(

    • @nmstoker
      @nmstoker 9 месяцев назад

      It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад +5

      @@nmstoker I will release a video on how to make an LLM based splitter next video :). When nobody else wants to do it, lets do it ourselves :)

    • @nathank5140
      @nathank5140 9 месяцев назад +1

      Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas

    • @vibhavinayak8527
      @vibhavinayak8527 7 месяцев назад

      @@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it?
      Thank you for your content, love your videos!

    • @codingcrashcourses8533
      @codingcrashcourses8533  7 месяцев назад

      @@vibhavinayak8527 currently learning langgraph. But still struggle with that

  • @micbab-vg2mu
    @micbab-vg2mu 9 месяцев назад +4

    Thank you for the video - ) I agree random chunking every 500 or 1000 token gives random results.

  • @henkhbit5748
    @henkhbit5748 8 месяцев назад +1

    Yes, a much better chunking approach. thanks for showing👍

  • @kenj4136
    @kenj4136 9 месяцев назад +3

    Your tutorials are gold, thanks!

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад

      Thanks so much, honestly i am quite surprised that so many people watch and like that video

  • @Retburstjk
    @Retburstjk 3 месяца назад

    ty for the concise and helpful video about STS

  • @MikewasG
    @MikewasG 9 месяцев назад +1

    Thank you for your effort! The video is very helpful!

  • @Munk-tt6tz
    @Munk-tt6tz 7 месяцев назад

    Exactly what i was looking for, thanks!

  • @ashleymavericks
    @ashleymavericks 9 месяцев назад

    This is a brilliant idea!

  • @andreypetrunin5702
    @andreypetrunin5702 9 месяцев назад

    Спасибо! Очень полезно!

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 9 месяцев назад

    Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.

  • @AnthonyAlcerro-v6d
    @AnthonyAlcerro-v6d 9 месяцев назад +1

    We need a langchain in production course, hope you consider it!!!

  • @pillaideepakb
    @pillaideepakb 9 месяцев назад

    This is amazing

  • @thevadimb
    @thevadimb 7 месяцев назад +1

    Why didn't you like the Langchain implementation of the semantic splitter? What was the problem with it?

  • @znacibrateSANDU
    @znacibrateSANDU 9 месяцев назад

    Thank you

  • @moonly3781
    @moonly3781 9 месяцев назад

    Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content!
    I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад +1

      It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)

  • @raphauy
    @raphauy 9 месяцев назад

    Thanks for the video. Is there a way to do this with typescript?

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 9 месяцев назад

    Interesting! Saw the langchain implmentation. Do you prefer this one an could the tokenizer be any embedding model?

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад +2

      There is a difference between an embedding model and a tokenizer, hope you are aware of that. If yes, I didn´t understand the question

  • @jasonsting
    @jasonsting 9 месяцев назад

    Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад +3

      I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.

  • @bertobertoberto3
    @bertobertoberto3 8 месяцев назад

    Wow

  • @adanpalma4026
    @adanpalma4026 26 дней назад

    Its easy startong from txt but what happend when is a pdf and fisrst you have to uploaad and transform to text and next do semantic splitter

  • @mansidhingra4118
    @mansidhingra4118 6 месяцев назад

    Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад

      Currently not. Maybe they changed the import path. What Version do you use?

    • @mansidhingra4118
      @mansidhingra4118 6 месяцев назад

      @@codingcrashcourses8533 Thank you for your response. The current version I'm using for semantic_text_splitter is 0.13.3

  • @endo9000
    @endo9000 9 месяцев назад

    is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting?
    the idea would be to have a completely local alternative running on top of ollama.
    i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference.
    i am just curious if it would be possible.
    danke für die vids btw. learned a lot from that. ✌✌

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад +1

      Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.

    • @nathank5140
      @nathank5140 9 месяцев назад

      @@codingcrashcourses8533what do you mean? Can you provide an example?

    • @codingcrashcourses8533
      @codingcrashcourses8533  9 месяцев назад

      @@nathank5140 I will release a video on that topic on friday! :)

  • @adanpalma4026
    @adanpalma4026 13 дней назад

    Ummm you start from txt and usuarlly you can modify the txt in order to facilitate chunking but when is complex pdf it is no as easy