Semantic-Text-Splitter - Create meaningful chunks from documents

Поделиться
HTML-код
  • Опубликовано: 22 авг 2024

Комментарии • 40

  • @codingcrashcourses8533
    @codingcrashcourses8533  6 месяцев назад

    I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(

    • @nmstoker
      @nmstoker 5 месяцев назад

      It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!

    • @codingcrashcourses8533
      @codingcrashcourses8533  5 месяцев назад +5

      @@nmstoker I will release a video on how to make an LLM based splitter next video :). When nobody else wants to do it, lets do it ourselves :)

    • @nathank5140
      @nathank5140 5 месяцев назад +1

      Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas

    • @vibhavinayak8527
      @vibhavinayak8527 3 месяца назад

      @@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it?
      Thank you for your content, love your videos!

    • @codingcrashcourses8533
      @codingcrashcourses8533  3 месяца назад

      @@vibhavinayak8527 currently learning langgraph. But still struggle with that

  • @wtcbretburstjk3726
    @wtcbretburstjk3726 3 дня назад

    ty for the concise and helpful video about STS

  • @micbab-vg2mu
    @micbab-vg2mu 6 месяцев назад +4

    Thank you for the video - ) I agree random chunking every 500 or 1000 token gives random results.

  • @henkhbit5748
    @henkhbit5748 5 месяцев назад +1

    Yes, a much better chunking approach. thanks for showing👍

  • @kenj4136
    @kenj4136 6 месяцев назад +3

    Your tutorials are gold, thanks!

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад

      Thanks so much, honestly i am quite surprised that so many people watch and like that video

  • @Munk-tt6tz
    @Munk-tt6tz 3 месяца назад

    Exactly what i was looking for, thanks!

  • @MikewasG
    @MikewasG 6 месяцев назад +1

    Thank you for your effort! The video is very helpful!

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 6 месяцев назад

    Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.

  • @user-lg6dl7gr9e
    @user-lg6dl7gr9e 6 месяцев назад +1

    We need a langchain in production course, hope you consider it!!!

  • @ashleymavericks
    @ashleymavericks 6 месяцев назад

    This is a brilliant idea!

  • @andreypetrunin5702
    @andreypetrunin5702 6 месяцев назад

    Спасибо! Очень полезно!

  • @maxlgemeinderat9202
    @maxlgemeinderat9202 6 месяцев назад

    Interesting! Saw the langchain implmentation. Do you prefer this one an could the tokenizer be any embedding model?

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад +2

      There is a difference between an embedding model and a tokenizer, hope you are aware of that. If yes, I didn´t understand the question

  • @thevadimb
    @thevadimb 3 месяца назад

    Why didn't you like the Langchain implementation of the semantic splitter? What was the problem with it?

  • @znacibrateSANDU
    @znacibrateSANDU 6 месяцев назад

    Thank you

  • @pillaideepakb
    @pillaideepakb 6 месяцев назад

    This is amazing

  • @moonly3781
    @moonly3781 6 месяцев назад

    Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content!
    I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад +1

      It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)

  • @mansidhingra4118
    @mansidhingra4118 2 месяца назад

    Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?

    • @codingcrashcourses8533
      @codingcrashcourses8533  2 месяца назад

      Currently not. Maybe they changed the import path. What Version do you use?

    • @mansidhingra4118
      @mansidhingra4118 2 месяца назад

      @@codingcrashcourses8533 Thank you for your response. The current version I'm using for semantic_text_splitter is 0.13.3

  • @jasonsting
    @jasonsting 6 месяцев назад

    Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад +3

      I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.

  • @bertobertoberto3
    @bertobertoberto3 4 месяца назад

    Wow

  • @endo9000
    @endo9000 6 месяцев назад

    is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting?
    the idea would be to have a completely local alternative running on top of ollama.
    i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference.
    i am just curious if it would be possible.
    danke für die vids btw. learned a lot from that. ✌✌

    • @codingcrashcourses8533
      @codingcrashcourses8533  6 месяцев назад +1

      Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.

    • @nathank5140
      @nathank5140 5 месяцев назад

      @@codingcrashcourses8533what do you mean? Can you provide an example?

    • @codingcrashcourses8533
      @codingcrashcourses8533  5 месяцев назад

      @@nathank5140 I will release a video on that topic on friday! :)

  • @raphauy
    @raphauy 6 месяцев назад

    Thanks for the video. Is there a way to do this with typescript?