I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(
It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!
Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas
@@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it? Thank you for your content, love your videos!
Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.
Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content! I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.
It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)
Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.
I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.
Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?
is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting? the idea would be to have a completely local alternative running on top of ollama. i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference. i am just curious if it would be possible. danke für die vids btw. learned a lot from that. ✌✌
Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.
I made a mistake in this video: This Splitter does NOT accept a full model, but only accepts the Tokenizer. Sorry for that. So I am still looking for a good way to create LLM based chunks :(
It's a shame but I think the underlying idea of what you were after makes sense. It amuses me that so often people try LLMs with RAG outputs that even a typical human would struggle with!
@@nmstoker I will release a video on how to make an LLM based splitter next video :). When nobody else wants to do it, lets do it ourselves :)
Following for that. I have many meeting transcripts with discussions between two or more participants. The conversation often is non linear. Topics revisited multiple times. I’m trying to find a good way to embed the content. Thinking maybe to write one or more articles on each meeting. Then chunk those. Not sure, would appreciate any ideas
@@codingcrashcourses8533 Looks like some people have implemented 'Advanced Agentic Chunking' which actually uses an LLM to do so! Maybe you should make a video about it?
Thank you for your content, love your videos!
@@vibhavinayak8527 currently learning langgraph. But still struggle with that
Thank you for the video - ) I agree random chunking every 500 or 1000 token gives random results.
Yes, a much better chunking approach. thanks for showing👍
Your tutorials are gold, thanks!
Thanks so much, honestly i am quite surprised that so many people watch and like that video
ty for the concise and helpful video about STS
Thank you for your effort! The video is very helpful!
Exactly what i was looking for, thanks!
This is a brilliant idea!
Спасибо! Очень полезно!
Very intuitive approach towards RAG performance improvement. I wonder if the barchart at the end would be better off be substituted with 2-dimensional representation and be evaluated with KNN.
yes, I did a whole video on how to visualize embeddings
We need a langchain in production course, hope you consider it!!!
Yes, i indeed do that :)
This is amazing
Why didn't you like the Langchain implementation of the semantic splitter? What was the problem with it?
It does not use llm to archieve it
Thank you
Stumbled upon your amazing videos and want to thank you for the incredible tutorials. Truly amazing content!
I'm developing a study advisor chatbot that aims to answer students' questions based on detailed university course descriptions, each roughly the length of a PDF page. The challenge arises with varying descriptions across different universities, for similar course names and the length of each course. Each document starts with the university name and the course description. I've tried adding the university name and course description before every significant point, which helped when chunking by regex to ensure all relevant information is contained within each chunk. Despite this, when asking university-specific questions, the correct course description for the queried university sometimes doesn't appear in the result chunk, often because it's missing in the chunks. Considering a description is about a page of text, do you have a better approach to this problem or any tips ? Really sorry for the long question :) I would be very very grateful for the help.
It depends on your usecase. I think embeddings for 1 PDF are quite trashy, but if you need the whole document you can have a look at a parent child retriever. You embed very small documents, but pass the larger, related document to the LLM. Not sure what to do with the noise part, LLMs can handle SOME noise :)
Thanks for the video. Is there a way to do this with typescript?
yes, but someone has to create that npm package probably :)
Interesting! Saw the langchain implmentation. Do you prefer this one an could the tokenizer be any embedding model?
There is a difference between an embedding model and a tokenizer, hope you are aware of that. If yes, I didn´t understand the question
Since this solution creates "meaningful" chunks, implying that there can be meaningless or less meaningful chunks, would that then imply that these chunks affect the semantic quality of embeddings/vector database? I was previously getting garbage out of a chromadb/faiss test and this would explain it.
I would argue that there are two different kind of "trash" chunks. 1. there is are docs that just get cut off and lose their meaning. 2. chunks are too large and cover multiple topics -> embeddings just don´t mean anything.
Wow
Its easy startong from txt but what happend when is a pdf and fisrst you have to uploaad and transform to text and next do semantic splitter
I made a video about multimodal RAG :)
Hi, thanks for this brilliant video. Really thoughtful of you. Just one question, when I tried to import HuggingFaceTextSplitter. I received an importError. -- "ImportError: cannot import name 'HuggingFaceTextSplitter' from 'semantic_text_splitter'" Any idea how will it work?
Currently not. Maybe they changed the import path. What Version do you use?
@@codingcrashcourses8533 Thank you for your response. The current version I'm using for semantic_text_splitter is 0.13.3
is it theoretically possible to have a normal LLM like llama2 or mistral do the splitting?
the idea would be to have a completely local alternative running on top of ollama.
i see that in the case of semantic-textsplitter it uses a tokenizer for that and i understand the difference.
i am just curious if it would be possible.
danke für die vids btw. learned a lot from that. ✌✌
Sure that is possible. You can treat that as normal task for the llm. I would add that the new chunks should contain characters an outputparser can use to create multiple elements from them.
@@codingcrashcourses8533what do you mean? Can you provide an example?
@@nathank5140 I will release a video on that topic on friday! :)
Ummm you start from txt and usuarlly you can modify the txt in order to facilitate chunking but when is complex pdf it is no as easy