and I think nobody can explain concepts in easier way than you do.. tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :) love it how you cover each and every aspects from learning point of view.. Thanks again. .
Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!
@@AJJU_OZA well if it was i would not be here for so long hahahahahahaha What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal! In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp! answered!
@@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it. I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!
@@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering! Now i'am more and more building things on top of LLMs All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular. Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!
I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.
Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.
First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context. Good work. Definitely go this road.
Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much! Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.
Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!
Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.
Yes please do a video on Embedding settings. I am currently using these. Parameters ---------- VECTOR_SIZE: int The size of the vector for the text embeddings (e.g., 300). WINDOW_SIZE: int The context window size for text embeddings, capturing larger contextual information (e.g., 20). MIN_COUNT: int The minimum frequency count for words to be considered in the text embeddings (e.g., 1). EPOCHS: int The number of training iterations for the Doc2Vec model (e.g., 500).
I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .
Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.
an embedder should have an option that chunks cannot cross paragraph boundaries -- even if 2 paras fit in a chunk. So the number of chunks will always be at least as many paras.
Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size? Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?
What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data
Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.
What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?
If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the
This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space
What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)
I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the " " separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same
Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?
Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.
had a question on this video i.e. how to split chunks: ruclips.net/video/n0uPzvGTFI0/видео.html .... How I can find best chunk size for financial statements?
what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...
and I think nobody can explain concepts in easier way than you do..
tried 10 different videos for checking how Recursivesplitter would go if para is chunk size.. and you explained it.. :)
love it how you cover each and every aspects from learning point of view.. Thanks again. .
Glad it was helpful. Make sure to watch the next one :)
Just found your channel and while I initially wanted to have you as a professor in a classroom ( maybe back in college 30yrs ago), I really think you are helping to create a better world for many with your content, careful explanation and examples and this is the true reason and mission for a teacher, congrats!
please make more videos like this one! Many people got into AI without coding background, we are missing more detailed videos on these topics!
Answer me...
Promote Engineering's videos are for Developer (appreciation) only ??

@@AJJU_OZA well if it was i would not be here for so long hahahahahahaha
What I meant that for those who don't have coding knowledge and want to do more then replicating github repos, this Hands-on type of video is phenomenal!
In my case I am working in text based rpg game and the basic concept of this video was one I had yet to grasp!
answered!
@@AJJU_OZA i mean if the channel also had a LLM Python focused course I would be one paying for it.
I bet that there are ton of people changing carreers that also have need of more basic concepts in depth explanation videos like this one!
@@CacoNonino what do you mean by LLM python focused course?
@@ml-techn I mean, I cursed economy but changed carreers midway to dataengeenering!
Now i'am more and more building things on top of LLMs
All my coding knowlegde came from using chatGPT in the last year, and I think it is the same for a lot of people, hence why the tutorial videos are so popular.
Am i making sense? I mean, there a thousand videos outthere that mention splitting texting into chunks but not a lot explaining how that especifically is done how he made it here!
I’d love to see videos on both embedding size and modifying the text splitter! I’m particularly interested in strategies that would enable inclusion of citations, e.g. a medical article that includes numbered citations at the end of each sentence with the reference list at the end of the document.
Thank you, you explain very clearly and I have been watching your content. They really good and honest. Please keep these types of videos.. thanks a lot.
Please keep making more such videos. I found this video very helpful..
More to come 😎
First time I see content on the optimal chunk lengths. In addition it might be interesting on how to integrate metadata as for example on which page of a book, which url or which paragraph in a law text a text comes from or is within a text. These metadata also will take space in the retrieval context.
Good work. Definitely go this road.
Incredible ! Hope you'll provide more videos like this one !
Finally understood this. I remember asking on discord and I think you also replied but the fact an entire video was made on this made it muc much much clearer. Thank you so much!
Could you make a video about vectorstores and which one to use, how to know what to use, and the code behind it because I saw a couple like FAISS, chromaDB, deeplake etc... and for my chatbot, it's pretty much the last thing I have left to do but I still don't understand pretty much most of vectorstores for now.
Damn you explained that better in 3 mins that most other videos did in 30 mins
glad it was helpful.
Great explanation. Thanks.
Great Video, Thanks for creating the video!
Excellent to have someone break these concepts down so clearly. Keep going, this is great!
Great Work! Very simple but really elaborative. Please create more videos in this for this series
Good video - for the dataset I am working with I found that spliting by tokens produces better results but really depends on the data you're working with tbh!
Great Video to understand chunks and textsplitter
Thank you 🙏
Great explanation, thanks, this will be super useful!
Appreciate all your content. I'd love to know more about chunking customization. Thanks! 🤙
Great explanation !
Really useful
Please continue making these
Thanks for the video! What if you want to chunk a large PDF of 300 pages? How do you determine the chunk size? I mean, in your example you can observe the length of each paragraph by observation but might be hard to do it for large file. I would appreciate it if you share your opinion.
Yes please do a video on Embedding settings. I am currently using these.
Parameters
----------
VECTOR_SIZE: int
The size of the vector for the text embeddings (e.g., 300).
WINDOW_SIZE: int
The context window size for text embeddings, capturing larger contextual information (e.g., 20).
MIN_COUNT: int
The minimum frequency count for words to be considered in the text embeddings (e.g., 1).
EPOCHS: int
The number of training iterations for the Doc2Vec model (e.g., 500).
Excellent explanation, thank you. Just curious, why this video is the only video in your Demystifying LangChain playlist?
Thank you. Just way too many things to cover but now getting back to RAG. Will be making alot more content on it.
Great! Much appreciated 😊
I have seen a lot of videos on how to use these chunks with a vector database and have the LLM using RAG as a knowledge base. There seems to be very few videos on how to use the chunked data to fine-tune a LLM like llama 2 on this chunked data. I would love to see a video that covers the topic of using raw, or chunked data to fine tune a LLM without having to convert it into something like question and answer or instruct formatting .
More videos on chunking and embedding please.
Please create more content with in-depth information about how to this information in a smart way. Im currently building a domain specific knowledge base to create a "AI expert" in a certain topic and I am trying to find the right way to store alle the knowledge.
Please do create one for custom splitting. I have a particular document where I would like to define a chunk demarcated by special sequence.
Hello mate,
Any chance you can make a video on Context aware chunking which can improve the quality of chunks/output drastically!
an embedder should have an option that chunks cannot cross paragraph boundaries -- even if 2 paras fit in a chunk. So the number of chunks will always be at least as many paras.
Very nice video, I think anyone working on semantic search goes through the experience you described here. Have you seen a study that checks the performance of different embeddings with respect to the chunk size?
Also, what are the different available models for embeddings? I have been using the faiss models, I have heard you mention another one. What would be a good strategy to pick one vs. another?
How to define my own list of separators? Can I set mupltiple separators for paragraphs and multiple for sentences at the same time?
please make a video about embedding size. you are awesome thank you for videos
What if the pdf has tables too? I see the pdf loader in langchain is not reading the table. How to solve that? In case it is solved how does the recursive text splitter work with such tabular data
Thank you!
Hmmm, curious why you're splitting by character count and not by token count? Our recursive splitter always bottoms out in token count based on the model we're using, as the model can't see character level data, and the token count is the limiting factor we actually care about when inferencing.
please continue with these. they are useful.
Thank you for your video. What program are you using to create your diagrams?
thanks dude
What about a dynamic chunk size as a potential future feature? How does this work for a large series of documents like textbooks and other pdfs like science articles, or legal documents? What is a "best guess" for the parameters?
Hi I am also having same problem. Do you have any idea how we can divide our document chunk efficiently.
Great content! Keep up please :)
Thank you! How I can resolve issues of splitting, suppose I have multiple files and I want to generate a summary individually
In that case, look into summarization specific chains. Reduce map will be a good start.
@@engineerprompt Suppose these are code files and I want to generate summary for all separately.
What should I do?
Sir Promote Engineering's videos are for Developer (appreciation)...???
Please continue making videos like this. Any chance u can share the code as well?
If we check our docs and check the length of each paragraph and set the chunk size = the max length can help ? or maybe take the average length from all paraghraphs ? depend on the
splitter
what u think ?
This might be dates but yes, that can be one approach. Another is to use regular expressions of there is a pattern in the data. There are now more advanced retrieval methods that can compress data in the documents to make them more relevant to the query. A lot is happening in this space
What if you can preproces the texts and reorganise sentences by "key subjects relationship" .... That is as a supplement to the original text, you can perhaps make chunks of texts that summarise different key subjects. The AI would produce a (creative) list of these subjects, and then use that list when running through the text again... (and you can then "make langchain know" what sentences actually belong together!)
I have a CSV file with product descriptions and Ids. I need to query the descriptions with the user input in order to get the product Id. I am using CharacterTextSplitter split the full file into chunks with 1 line for each chunck. After that I want to do a similarity_search to get the proper lines of the CSV that contain the descriptions that are similar to the user input. Im using the "
" separator to split the text by lines but, for whatever reason it doesn´t work some times. I´d love to see an example of CharacterTextSplitter with this kind of situation or how to use RecursiveCharacterTextSplitter to do the same
Same issue I am also facing. I have managed to write a generic code for chunking, however I am able to get results only for small data sets not for large data sets. Did you manage to solve it ?
really useful
please continue making videos like this
i feel i get the gist but interested in more on topic
Can you please explain how the chunk_overlap parameter works?
Let's say you define the chunk size to be 1000 char with overlap of 200. In this case, the first chunk will be 1 - 1000 and the second chunk will start from 801-1800 because there is an overlap of 200. Hope this helps.
@@engineerprompt Thank you! Does chunk_overlap also follow the default list?
The link no longer works
🔥🔥🔥
had a question on this video i.e. how to split chunks:
ruclips.net/video/n0uPzvGTFI0/видео.html .... How I can find best chunk size for financial statements?
in real life you need to do way more stuff and all the tutorials are basically splitting some okay txt files but this is a good introduction
what about making a video using a very small LLM, that every pc can handle, using it on a very specific task, fine tuning it, and showing every steps from zero to hero and that we can work offline. In this way everyone can hands on this "lab" and learn by doing...
Does this work on a cpp local model? Like modelname-ggmlv1.q4_1.bin
Yes, it will work with any model
Great Video, Thanks for creating the video!😀