Holy smokes, I'm only 6 minutes in, and this is already by far the best video on this topic on all of youtube. Your content is extremely valuable man. Please keep it up !!
Hey DI, I am a industrial designer, I watch tons of youtube very single day for research and I built an bot to download captions from youtube last week and store them in json, it saved me a ton of works already and now I have time for a coffee and half hour of ps4 every day, (even I had to wrtte more then 10 hrs of code every day after and between work last year...), I was working to 4am at night in last couple of day to try to have it work summarizing long caption (many of these lines are not important...) and I kept fail. and it was like magic your video come up! even I havent watch it yet but i leave a comment first to say thank, I trust you. langchain doc is so messy and not clean that i am considering to redo it in Cantonese... Thank god you make some video and you save our life.
Just wanted to say your code and explanations are so coherent and easy to follow that an innumerate like me who barely knows python was able to grok the entire video played at 1.5x speed. Well done sir! Truly can't wait to try out the clustering technique.
I'm immensely grateful for your enlightening series on the 5 Levels Of LLM Summarizing. The concept of chunks nearest to centroids representing summaries is brilliant and has offered me a fresh perspective. I eagerly anticipate your insights on AGENTS!
I just wanted to thank you for your awesome video on text summarization. Your explanations were clear, concise, and informative, and your demonstrations were really helpful in understanding the concept. Your passion and expertise on the subject really shone through and I look forward to seeing more great content from you in the future!
in brief: This video demonstrates five levels of text summarization using language models, specifically focusing on OpenAI's GPT architecture. The video walks through the process of summarizing a few sentences, paragraphs, pages, an entire book, and an unknown amount of text. Level 1: Basic prompt - The presenter uses a simple prompt to summarize a couple of sentences from a Wikipedia passage on Philosophy. Level 2: Prompt templates - The presenter introduces prompt templates to dynamically summarize two essays by Paul Graham, showing how to create a one-sentence summary for each. Level 3: MapReduce method - The presenter explains how to summarize a long document by chunking it into smaller pieces, summarizing each chunk, and then summarizing the summaries. Level 4: Best representation vectors - The presenter demonstrates a method to summarize an entire book by selecting the top passages that best represent the book, clustering similar passages, and summarizing the most representative passage from each cluster. Level 5: Unknown amount of text - The video hints at a technique for handling an unknown amount of text but does not provide explicit details. The speaker demonstrates various techniques to summarize text using OpenAI models, progressing from novice to expert level: Level 1: Basic Summarization - Using GPT-3.5-turbo to summarize a single passage of text. Level 2: Summarizing Multiple Passages - Using GPT-3.5-turbo to summarize multiple passages and combine them into a single summary. Level 3: Summarizing Books - A custom method to summarize an entire book by splitting it into chunks, finding the most important chunks, summarizing those, and then combining the summaries. Level 4: Summarizing Books with Clustering - A more advanced method that uses clustering to find representative sections of a book before summarizing and combining them. Level 5: Summarizing an Unknown Amount of Text - Using agents to perform research on Wikipedia and summarize the information found. The speaker demonstrates how to use these techniques by summarizing a variety of text sources, including passages, books, and information found on Wikipedia. The video showcases the potential of OpenAI models for summarizing and condensing information while retaining the key points and insights.
Here's some additional info about this kmeans array as I struggled to understand it myself (made by gpt-4): The output you're seeing is from the labels_ attribute of the trained KMeans model in the sklearn library in Python. The KMeans algorithm clusters data by trying to separate samples into n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified beforehand, which is what you've done by setting n_clusters to 11. The labels_ attribute of the KMeans model returns an array where each element is the cluster label of the corresponding data point. These labels range from 0 to n_clusters - 1. So in your case, the labels range from 0 to 10, since you specified 11 clusters. To put it in the context of your specific problem, you've passed a list of vectors to the KMeans model. Each vector probably represents a portion of the book you're trying to summarize, perhaps a sentence or a paragraph, which has been transformed into a numerical vector using the langchain library. The array array([ 2, 2, 2, 8, 8, 2, 5, 1, 1, 7, 7, 4, 4, 9, 10, 5, 5, 5, 3, 3, 3, 0, 0, 10, 10, 6], dtype=int32) then represents the cluster assignments for each of these vectors. For example, the first three vectors were assigned to cluster 2, the next two to cluster 8, and so on. These cluster assignments are based on the distances between the vectors. Vectors that are closer to each other (and hence more similar) will be assigned to the same cluster. By identifying these clusters, the KMeans algorithm helps you find groups of similar sentences or paragraphs in the book. This can help in summarizing the book by identifying the key themes or topics covered.
Awesome video! Thank you for putting all of this information together! I have been wanting to learn more on how to do k value summarization with long chain. This video was exactly what I needed! Well done!
Your videos are extremely well explained and the use cases and examples are top notch. The vector clustering approach is pretty ingenious. Great stuff, keep it up!
I actually was going to do a pokemon and I asked chatgpt which pokemon had 5 levels and it said none of them do. So I asked for more metaphors I could do that would appeal to a developer audience and it suggested mario and then gave me the 5 names of those mario. Not bad.
The level of clarity in your content is just insane. I absolutely love it! If I may make a suggestion though - something to consider... Because this technology is growing, changing and evolving so quickly, it would be soooo good to have something like a concept map showing all the main concepts and use cases of let's say langchain with particular ways of achieving it and links to the videos where this thing is explained :D
I want to tell you that I am really appreciating the work you put in this tutorials! Realy, realy helpful. As an educationalist I am trying to get such a system to work for making learning plans, leassons, learning goals, examing question etc. You're work is realy helpfull and motivational to start working on a app that can make this stop just from documents. Really appreciate it!
Level 4 is super interesting. I’ve experimented with recursive summarization, but your method promises better results as well as being cheaper. I need to try it!
Great video! If I am not wrong, the element closest to the cluster centroid is the same as the medoid, which can be computed even if the centroid cannot, just taking the element for which the sum of the distances to the other elements is minimum. To choose the number of clusters (K) you can use the elbow criterion. I do not know if a Mixture of Gaussians would be a better clustering method, but it would be worth a try.
I've been working on parsing large sets of data (and struggling with balancing efficiency and comprehensiveness), and the idea to cluster and find the centers of clusters to get representative values is awesome! Definitely going to be toying around with this idea.
The user can enter different values for map_prompt and combine_prompt; the map step applies a prompt to each document, and the combine step applies one prompt to bring the map results together.
Glad to see that other people are also dealing with this very topic. In scientific terminology the name you are probably looking for is "semantic clustering" ;) Have a look at the WEClustering paper by Mehta/Bawa/Singh. Maybe it is also worth give some hierarchical clustering algorithm or DBSCAN algorithm a shot, because the number of clusters is unknown.
Great video! Some thoughts. I wonder what would happen if you applied your dimensionality reduction process prior to your k-means clustering algorithm. K-means is highly susceptible to the curse of dimensionality (i.e. the more dimensions you add, the more "space" gets added between points so eventually they're all so far that it's hard to justify one point being in the same cluster) and you're working with high-dimensional space. As a result, dimensionality reduction steps are a pretty common pairing with clustering methods like k-means. I'd suggest looking up PCA and NCA (if you haven't tried them yet) as methods to consider prior to k-means to potentially improve your clustering. Also, look up the "elbow method" on choosing an optimal value of 'k' when you do. That'll help you justify choosing a number of clusters to move forward with. Last recommendation: alwaysalwaysalways pull a few examples of text that supposedly represents a different cluster (in your case whatever's closest to the centroid). If the model really is interpreting the text the way you suggest (i.e. the intro contains a lot of information that the first few chapters contain), it's always worth poking around and seeing if you see that happening. Thanks for the guide!
So, you're saying, when he gets to 10:01, he could play with the number of clusters? Too many clusters, and it's almost like overfitting where everything has its own cluster, and too few, and you basically end up with one idea. The elbow would be an inflection point?
The choice of k will affect your results significantly. If you want to stay in 2 dimensions, consider using the book's table of contents to approximate what k to chose - a good plot should show less overlap in the clusters. Alternatively, most vector stores already provide semantic search in higher dimensions and use cosine instead of the euclidean distance in your example - so you could try getting near_term with weaviate and similiar stores or try hybrid search to get the core concept of each chapter, before summarization.
This is awesome! Are there any academic papers about 4 and 5 (o rmaybe 4 combined with 5)? Landmark attention is extending the context windows but token cost ( whether in terms of API calls or GPU memory) is going to be trade-off factor for the foreseeable future.
Someone sent me this one which looks interesting. "HYBRID LONG DOCUMENT SUMMARIZATION USING C2F-FAR AND CHATGPT: A PRACTICAL STUDY" I haven't read it, but the problem statement points in this direction I'm told arxiv.org/pdf/2306.01169.pdf
I just reduced summarization time on our Gen AI product to a maximum of 10% and also 10% the cost using BRV on large document! Thank you so much! You have a Patreon?
Hi Greg, I want to take a moment and thank you for creating a very crisp and clear explanation with examples. Great job here. One quick question, What are your thoughts on applying this technique to code bases of any languages? I am talking about summarizing a codebase (like the book in your example) in a paragraph. Thanks in advance.
Code is much tougher because the material references other material from across the base. You could do a highlight overview. Cursor.sh would be the coolest company to tackle this problem
@@DataIndependent Looking into Cursor, seems like a great product. What do you mean by highlight overview, can you please point me to right direction? Seems like youtube chat is too much for this kind of conversation. Thanks in advance. Great stuff!!
Cannot thank you enough Greg. so much. I encountered a consecutive issue at In[42]: 1st is I have to put vectors = np.array(vectors) before # Perform t-SNE and reduce to 2 dimensions, second is I have to perplexity in the TSEN() less than 2 which is my n_samples; or else I would get error message as 1st AttributeError: 'list' object has no attribute 'shape', and the 2nd perplexity must be less than n_samples, respectively. However, after all that, and run the code successfully, i no longer know if the plot was wirdoly correct.
@@DataIndependent or I’m using google colab to practicing it. Is it why and using my own data. Yes, I asked the bot from MS website. Not sure what it’s called. ;) I solved it by the solutions posted but I could not be sure if the plot I got is correct or not.
As a lawyer with no programming experience, I see the enormous potential to transform my profession. I am torn between investing the time to understand your videos or alternatively hiring someone--even DI--to build the tools that I would need. If anyone reading this is available for hire, please let me know!
I have a short list of developers and agencies I send people too if you want to chat about your product. If you wanted to up level your skills then learning this would be great but working with a professional will speed boost anything production grade You can reach me by twitter dm or contact@dataindependent.com
Great work man! I did it of a book, worked great. Although the kmeans plot shows some overlapping cluser, i might need to play around with it. I would love to be able to extract information about a particlar topic inside a big document and present a details summary - What does this document tell about topic X? would you say is more relevant to the tallking document code that you made just maybe we a more sophisiticated prompt?
Nice!! That’s great glad it worked out a bit. The Kmeans algorithm was the right mix of easy and effective for my use so case so I ran with it. There are a ton more algorithms to try which may be better. For that I would lean on a question and answer type of chain instead of summarizing. If the topic you want is static, then do a really good “query” which will pull the relevant documents. If the topic is dynamic, then look into ways to have the LLM help you bolster your starting query which does the retrieval
@@DataIndependent I would want to try Named Entity Recognition, Sentence Classification as well as Summarization for Systematic Literature Review of medical articles...
For example 4, what would you do if the summarises are too large to fit nicely in the combiner. I'm working on a similar problem, my solution was to chain the summary output to the beginning of the next block of context but that results in many calls to the LLM with full context
for level 4, is there a way to save the text in the vector store and then pass the whole vector store into a summarization chain? This will reduce the work load
Great video. Thanks, Greg. I am thinking how: (1) we could efficiently save the embeddings result into disk so the next time we could load back the data for next use, especially after running FAISS.from_documents(texts, embeddings) and the output is only saved in memory and cannot be retrieved next time I want to use the same vector data ... So that we could continuously grow our vector database (2) if we could be able to save the vector data into local disk, the next question is how we could get all vector data from the vector databases (where the output is same as the code: ` vectordb= embeddings.embed_documents([x.page_content for x in docs])` Appreciate if we have a video on how to manage and grow the vector database. Thanks.
Hey, you could use something like chroma db to store the embeddings into a disk and then it would load from the disk. It can be done using the persistence attribute
Awesome video, thanks! Question: Why do we split the text first using 'RecursiveCharacterTextSplitter' (step A) and then use map-reduce in the 'load_summarize_chain' (step B)? As far as I know map-reduce splits the text itself. Are the reasons that the initial step A puts an upper ceiling on the length of the chunks to ensure that 1) the LLM can handle all chunks within its token limit, and 2) we reduce cost because after pre-selecting with the embeddings the chunks are short enough to cost us less? Thanks for your help Greg
I don't think that map_reduce splits the docs for you. I believe you need to split before Either way if you do it before you have more control over how the chunks are made
@@DataIndependentthanks, can you make a video on why we sometimes exceed the token limit with these approaches even though our chunks are small enough, showing tips and tricks how to work around this? Also on how to save money by using different models for different summarisation tasks? Da Vinci gets quite pricey :D
for the number 4 method, is the number of clusters depend on the length of the docs. If i have 16 docs, will 11 clusters work or i have to choose another number?
Additionally to my previous comment, here's the solution, to understand what each cluster represents. we will need to examine the original data points assigned to each cluster. Just to get a deeper dive into what's actually going on there :) 1. Map Each Sentence/Paragraph to its Cluster: You can create a Python dictionary where the keys are the cluster labels and the values are lists of sentences or paragraphs belonging to that cluster. clustered_sentences = {i: [] for i in range(num_clusters)} for i, label in enumerate(kmeans.labels_): clustered_sentences[label].append(sentences[i]) # assuming `sentences` is your original list of sentences/paragraphs 2. Examine the Sentences/Paragraphs in Each Cluster: You can then print out or otherwise examine the sentences/paragraphs in each cluster to get a sense of what that cluster represents. for cluster, sentences in clustered_sentences.items(): print(f"Cluster {cluster}:") for sentence in sentences: print(f" - {sentence}")
Hi Greg. Thank you for the video. I have a question I'd like to ask. What if I want to work on another book instead of "intothinair"? I tried to load another pdf but I am getting errors as colab tells me "the book has 0 tokens in it.
@@DataIndependent i sort of made it work. although my number of documents section outputs 1, because of this i am not able to use clustering. ty for your reply.
Please consider making videos on the typescript module of langchain as some of the methods do not exist are are named differently it could be a good help
Hi going through the jupyternotebook, the section on plotting the graph is having an error now saying list object has no attribute shape. I am not too familiar with the vectors used, but something seems to be wrong with the fit_transform function when the data is passed into it. Any solution for that?
@@DataIndependent Thanks for the quick reply. But nope i did not change anything, just running everything through and encountering error. Took a screenshot here pasteboard.co/amwRkxPFxJZ8.png
Can you make a video how to touch up the long document using chatgpt or openai API? My sister owns a counsulting firm and is exploring to touch up various reports which are roughly 50 pages long.
Hi Greg! I really love you videos. Is it possible to create a tutorial on how to create a bot that finds information online? AKA surfs the internet. Thank you for all the good work!
Holy smokes, I'm only 6 minutes in, and this is already by far the best video on this topic on all of youtube. Your content is extremely valuable man. Please keep it up !!
Hey DI, I am a industrial designer, I watch tons of youtube very single day for research and I built an bot to download captions from youtube last week and store them in json, it saved me a ton of works already and now I have time for a coffee and half hour of ps4 every day, (even I had to wrtte more then 10 hrs of code every day after and between work last year...), I was working to 4am at night in last couple of day to try to have it work summarizing long caption (many of these lines are not important...) and I kept fail. and it was like magic your video come up! even I havent watch it yet but i leave a comment first to say thank, I trust you.
langchain doc is so messy and not clean that i am considering to redo it in Cantonese...
Thank god you make some video and you save our life.
Thanks for the kind words and good luck!
This content is incredible Greg. It's helping so many of us build the tools of the future (well at least the future of our own workflows!) Thank you!
Really a life-changing playlist
Will check after 7 years
Just wanted to say your code and explanations are so coherent and easy to follow that an innumerate like me who barely knows python was able to grok the entire video played at 1.5x speed. Well done sir! Truly can't wait to try out the clustering technique.
I'm immensely grateful for your enlightening series on the 5 Levels Of LLM Summarizing. The concept of chunks nearest to centroids representing summaries is brilliant and has offered me a fresh perspective. I eagerly anticipate your insights on AGENTS!
Man! I looooooove the Best Vector Representation method!! this soooo cool and completely solved my problem
I just wanted to thank you for your awesome video on text summarization. Your explanations were clear, concise, and informative, and your demonstrations were really helpful in understanding the concept. Your passion and expertise on the subject really shone through and I look forward to seeing more great content from you in the future!
Thank you very much, thats nice
in brief:
This video demonstrates five levels of text summarization using language models, specifically focusing on OpenAI's GPT architecture.
The video walks through the process of summarizing a few sentences, paragraphs, pages, an entire book, and an unknown amount of text.
Level 1: Basic prompt - The presenter uses a simple prompt to summarize a couple of sentences from a Wikipedia passage on Philosophy.
Level 2: Prompt templates - The presenter introduces prompt templates to dynamically summarize two essays by Paul Graham, showing how to create a one-sentence summary for each.
Level 3: MapReduce method - The presenter explains how to summarize a long document by chunking it into smaller pieces, summarizing each chunk, and then summarizing the summaries.
Level 4: Best representation vectors - The presenter demonstrates a method to summarize an entire book by selecting the top passages that best represent the book, clustering similar passages, and summarizing the most representative passage from each cluster.
Level 5: Unknown amount of text - The video hints at a technique for handling an unknown amount of text but does not provide explicit details.
The speaker demonstrates various techniques to summarize text using OpenAI models, progressing from novice to expert level:
Level 1: Basic Summarization - Using GPT-3.5-turbo to summarize a single passage of text.
Level 2: Summarizing Multiple Passages - Using GPT-3.5-turbo to summarize multiple passages and combine them into a single summary.
Level 3: Summarizing Books - A custom method to summarize an entire book by splitting it into chunks, finding the most important chunks, summarizing those, and then combining the summaries.
Level 4: Summarizing Books with Clustering - A more advanced method that uses clustering to find representative sections of a book before summarizing and combining them.
Level 5: Summarizing an Unknown Amount of Text - Using agents to perform research on Wikipedia and summarize the information found.
The speaker demonstrates how to use these techniques by summarizing a variety of text sources, including passages, books, and information found on Wikipedia.
The video showcases the potential of OpenAI models for summarizing and condensing information while retaining the key points and insights.
Good long summary - thank you!
Here's some additional info about this kmeans array as I struggled to understand it myself (made by gpt-4):
The output you're seeing is from the labels_ attribute of the trained KMeans model in the sklearn library in Python.
The KMeans algorithm clusters data by trying to separate samples into n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified beforehand, which is what you've done by setting n_clusters to 11.
The labels_ attribute of the KMeans model returns an array where each element is the cluster label of the corresponding data point. These labels range from 0 to n_clusters - 1. So in your case, the labels range from 0 to 10, since you specified 11 clusters.
To put it in the context of your specific problem, you've passed a list of vectors to the KMeans model. Each vector probably represents a portion of the book you're trying to summarize, perhaps a sentence or a paragraph, which has been transformed into a numerical vector using the langchain library.
The array array([ 2, 2, 2, 8, 8, 2, 5, 1, 1, 7, 7, 4, 4, 9, 10, 5, 5, 5, 3, 3, 3, 0, 0, 10, 10, 6], dtype=int32) then represents the cluster assignments for each of these vectors. For example, the first three vectors were assigned to cluster 2, the next two to cluster 8, and so on.
These cluster assignments are based on the distances between the vectors. Vectors that are closer to each other (and hence more similar) will be assigned to the same cluster. By identifying these clusters, the KMeans algorithm helps you find groups of similar sentences or paragraphs in the book. This can help in summarizing the book by identifying the key themes or topics covered.
Nice! Happy to chat more on this if you have questions
This is badass. Such a cool approach with the best representation vectors. Thanks for continuing to put out great work!
Awesome video! Thank you for putting all of this information together! I have been wanting to learn more on how to do k value summarization with long chain. This video was exactly what I needed! Well done!
Concise and informative, as always!
Your videos are extremely well explained and the use cases and examples are top notch. The vector clustering approach is pretty ingenious. Great stuff, keep it up!
Nice! Thank you Dimitar
Thumbnail 10/10
I actually was going to do a pokemon and I asked chatgpt which pokemon had 5 levels and it said none of them do. So I asked for more metaphors I could do that would appeal to a developer audience and it suggested mario and then gave me the 5 names of those mario. Not bad.
The level of clarity in your content is just insane. I absolutely love it! If I may make a suggestion though - something to consider... Because this technology is growing, changing and evolving so quickly, it would be soooo good to have something like a concept map showing all the main concepts and use cases of let's say langchain with particular ways of achieving it and links to the videos where this thing is explained :D
I really like your best representation vector approach!
Nice thank you - let me know how this works for you use case, I wanna see if it holds up in other domains or applications
I want to tell you that I am really appreciating the work you put in this tutorials! Realy, realy helpful. As an educationalist I am trying to get such a system to work for making learning plans, leassons, learning goals, examing question etc. You're work is realy helpfull and motivational to start working on a app that can make this stop just from documents. Really appreciate it!
Nice! Thank you!
Level 4 is super interesting. I’ve experimented with recursive summarization, but your method promises better results as well as being cheaper. I need to try it!
Nice thanks - I’m interested to hear how it works for you
I wouldn't have even considered the token limitation. Thank you for another great video.
Very practical and informative video.I was waiting for the vid since I saw your Tweet. Thank you Greg
Great video 👋👋Especially part 4 was illuminating.👍
Great video!
If I am not wrong, the element closest to the cluster centroid is the same as the medoid, which can be computed even if the centroid cannot, just taking the element for which the sum of the distances to the other elements is minimum.
To choose the number of clusters (K) you can use the elbow criterion. I do not know if a Mixture of Gaussians would be a better clustering method, but it would be worth a try.
Nice! Thank you for all those points. That's what I wanted to hear to see where to improve it.
I've been working on parsing large sets of data (and struggling with balancing efficiency and comprehensiveness), and the idea to cluster and find the centers of clusters to get representative values is awesome! Definitely going to be toying around with this idea.
Nice! Let me know how it works for you use case. Super curious to see if this lines up with more examples
The best representation vector approach is slick! 💯
The user can enter different values for map_prompt and combine_prompt; the map step applies a prompt to each document, and the combine step applies one prompt to bring the map results together.
Thanks! you helped me to get to the right path to my solution !
Very good content, short and to the point.
Nice! Thank you!
Would LOVE to see an end-to-end setup for these 🙏
Super cool. Your tutorials are extremely helpful!
Amazing video Greg!!!
Worthy every single minute. Tnks
This is a great video and summary of the various options!
Wow!! the clustering technique was incredible
Glad you liked it!
Nice video. Clear and concise. Well done
Amazing! Great job!😎🥳🦾
Would love an update with local llm for book summaries.
Thank you! Very clear and helpful. I am getting ready to try myself.
Nice! Please let me know how it goes - I'm curious to see how it does on longer bodies of text in the real world
Glad to see that other people are also dealing with this very topic. In scientific terminology the name you are probably looking for is "semantic clustering" ;) Have a look at the WEClustering paper by Mehta/Bawa/Singh. Maybe it is also worth give some hierarchical clustering algorithm or DBSCAN algorithm a shot, because the number of clusters is unknown.
Nice thanks for this. That’s a solid idea and approach.
I’ll try it out
Great video! Some thoughts.
I wonder what would happen if you applied your dimensionality reduction process prior to your k-means clustering algorithm. K-means is highly susceptible to the curse of dimensionality (i.e. the more dimensions you add, the more "space" gets added between points so eventually they're all so far that it's hard to justify one point being in the same cluster) and you're working with high-dimensional space. As a result, dimensionality reduction steps are a pretty common pairing with clustering methods like k-means. I'd suggest looking up PCA and NCA (if you haven't tried them yet) as methods to consider prior to k-means to potentially improve your clustering. Also, look up the "elbow method" on choosing an optimal value of 'k' when you do. That'll help you justify choosing a number of clusters to move forward with.
Last recommendation: alwaysalwaysalways pull a few examples of text that supposedly represents a different cluster (in your case whatever's closest to the centroid). If the model really is interpreting the text the way you suggest (i.e. the intro contains a lot of information that the first few chapters contain), it's always worth poking around and seeing if you see that happening. Thanks for the guide!
So, you're saying, when he gets to 10:01, he could play with the number of clusters? Too many clusters, and it's almost like overfitting where everything has its own cluster, and too few, and you basically end up with one idea. The elbow would be an inflection point?
Awesome video! You do a great job!
Wih gacor bang, gratefull 👏👏
Awesome content Greg. Appreciate it!
very cool approaches. thanks for sharing!
KMeans clustering to identify a representative semantic core from a cluster is brilliant... wow... I gotta apply Level 4 to so many datasets now...
The choice of k will affect your results significantly.
If you want to stay in 2 dimensions, consider using the book's table of contents to approximate what k to chose - a good plot should show less overlap in the clusters.
Alternatively, most vector stores already provide semantic search in higher dimensions and use cosine instead of the euclidean distance in your example - so you could try getting near_term with weaviate and similiar stores or try hybrid search to get the core concept of each chapter, before summarization.
This is awesome! Are there any academic papers about 4 and 5 (o rmaybe 4 combined with 5)? Landmark attention is extending the context windows but token cost ( whether in terms of API calls or GPU memory) is going to be trade-off factor for the foreseeable future.
Someone sent me this one which looks interesting. "HYBRID LONG DOCUMENT SUMMARIZATION USING C2F-FAR AND CHATGPT: A PRACTICAL STUDY"
I haven't read it, but the problem statement points in this direction I'm told
arxiv.org/pdf/2306.01169.pdf
@@DataIndependent Thank you!
Thank you, really good video.
Awesome share😊
Nice!! Thanks Zeel
I just reduced summarization time on our Gen AI product to a maximum of 10% and also 10% the cost using BRV on large document! Thank you so much! You have a Patreon?
Nice! Glad to hear it. Nope, no Patreon. But you sharing your value is great, thank you!
GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT! GREAT!
Hi Greg, I want to take a moment and thank you for creating a very crisp and clear explanation with examples. Great job here. One quick question, What are your thoughts on applying this technique to code bases of any languages? I am talking about summarizing a codebase (like the book in your example) in a paragraph. Thanks in advance.
Code is much tougher because the material references other material from across the base.
You could do a highlight overview.
Cursor.sh would be the coolest company to tackle this problem
@@DataIndependent Looking into Cursor, seems like a great product. What do you mean by highlight overview, can you please point me to right direction? Seems like youtube chat is too much for this kind of conversation. Thanks in advance. Great stuff!!
Great video!
I think you can reduce the batch size in the LLM to mitigate rate limits and timeouts.
Nice! Thank you Ehmad - yes that is a solid solution
awesome explanation
Awesome video, especially the Vector Clustering approach. I was wondering if you have any reference for this approach, any paper/blog etc?
This is genius
Cannot thank you enough Greg. so much. I encountered a consecutive issue at In[42]: 1st is I have to put vectors = np.array(vectors) before # Perform t-SNE and reduce to 2 dimensions, second is I have to perplexity in the TSEN() less than 2 which is my n_samples; or else I would get error message as 1st AttributeError: 'list' object has no attribute 'shape', and the 2nd perplexity must be less than n_samples, respectively. However, after all that, and run the code successfully, i no longer know if the plot was wirdoly correct.
Hm, nothing is jumping out to me right now about the problem.
Have you tried putting the error message in gpt 4 to debug?
@@DataIndependent or I’m using google colab to practicing it. Is it why and using my own data. Yes, I asked the bot from MS website. Not sure what it’s called. ;) I solved it by the solutions posted but I could not be sure if the plot I got is correct or not.
As a lawyer with no programming experience, I see the enormous potential to transform my profession. I am torn between investing the time to understand your videos or alternatively hiring someone--even DI--to build the tools that I would need. If anyone reading this is available for hire, please let me know!
I have a short list of developers and agencies I send people too if you want to chat about your product.
If you wanted to up level your skills then learning this would be great but working with a professional will speed boost anything production grade
You can reach me by twitter dm or contact@dataindependent.com
absolutely mind blowing and time saving
Nice! Glad to hear it
i have used map reduce technique but it's not fast .what should i do?
Legendary
Thanks Adam. This one is old school!
Great work man! I did it of a book, worked great. Although the kmeans plot shows some overlapping cluser, i might need to play around with it. I would love to be able to extract information about a particlar topic inside a big document and present a details summary - What does this document tell about topic X? would you say is more relevant to the tallking document code that you made just maybe we a more sophisiticated prompt?
Nice!! That’s great glad it worked out a bit.
The Kmeans algorithm was the right mix of easy and effective for my use so case so I ran with it. There are a ton more algorithms to try which may be better.
For that I would lean on a question and answer type of chain instead of summarizing.
If the topic you want is static, then do a really good “query” which will pull the relevant documents.
If the topic is dynamic, then look into ways to have the LLM help you bolster your starting query which does the retrieval
Awesome Video !...... Is there any Ebook or Course available explains everything about Langchain with NLP???
What would you want to see in it?
@@DataIndependent I would want to try Named Entity Recognition, Sentence Classification as well as Summarization for Systematic Literature Review of medical articles...
In the cell 32, the vectors need to be wrapped by np.array, otherwise, you will get 'list' object has no attribute 'shape' error.
wrap vector with np.array, but still got ValueError: perplexity must be less than n_samples
For example 4, what would you do if the summarises are too large to fit nicely in the combiner. I'm working on a similar problem, my solution was to chain the summary output to the beginning of the next block of context but that results in many calls to the LLM with full context
You could then do a map reduce on the large container of summaries. Split it up twice
for level 4, is there a way to save the text in the vector store and then pass the whole vector store into a summarization chain? This will reduce the work load
Great video. Thanks, Greg. I am thinking how:
(1) we could efficiently save the embeddings result into disk so the next time we could load back the data for next use, especially after running FAISS.from_documents(texts, embeddings) and the output is only saved in memory and cannot be retrieved next time I want to use the same vector data ... So that we could continuously grow our vector database
(2) if we could be able to save the vector data into local disk, the next question is how we could get all vector data from the vector databases (where the output is same as the code: ` vectordb= embeddings.embed_documents([x.page_content for x in docs])`
Appreciate if we have a video on how to manage and grow the vector database. Thanks.
Hey, you could use something like chroma db to store the embeddings into a disk and then it would load from the disk. It can be done using the persistence attribute
Awesome video, thanks! Question: Why do we split the text first using 'RecursiveCharacterTextSplitter' (step A) and then use map-reduce in the 'load_summarize_chain' (step B)? As far as I know map-reduce splits the text itself. Are the reasons that the initial step A puts an upper ceiling on the length of the chunks to ensure that 1) the LLM can handle all chunks within its token limit, and 2) we reduce cost because after pre-selecting with the embeddings the chunks are short enough to cost us less? Thanks for your help Greg
I don't think that map_reduce splits the docs for you. I believe you need to split before
Either way if you do it before you have more control over how the chunks are made
@@DataIndependentthanks, can you make a video on why we sometimes exceed the token limit with these approaches even though our chunks are small enough, showing tips and tricks how to work around this? Also on how to save money by using different models for different summarisation tasks? Da Vinci gets quite pricey :D
for the number 4 method, is the number of clusters depend on the length of the docs. If i have 16 docs, will 11 clusters work or i have to choose another number?
I would love to see what each method cost was, to get an idea
Additionally to my previous comment, here's the solution, to understand what each cluster represents. we will need to examine the original data points assigned to each cluster. Just to get a deeper dive into what's actually going on there :)
1. Map Each Sentence/Paragraph to its Cluster: You can create a Python dictionary where the keys are the cluster labels and the values are lists of sentences or paragraphs belonging to that cluster.
clustered_sentences = {i: [] for i in range(num_clusters)}
for i, label in enumerate(kmeans.labels_):
clustered_sentences[label].append(sentences[i]) # assuming `sentences` is your original list of sentences/paragraphs
2. Examine the Sentences/Paragraphs in Each Cluster: You can then print out or otherwise examine the sentences/paragraphs in each cluster to get a sense of what that cluster represents.
for cluster, sentences in clustered_sentences.items():
print(f"Cluster {cluster}:")
for sentence in sentences:
print(f" - {sentence}")
Curious to know if you compared performance of curie with gpt35turbo ?
Hi Greg. Thank you for the video. I have a question I'd like to ask. What if I want to work on another book instead of "intothinair"? I tried to load another pdf but I am getting errors as colab tells me "the book has 0 tokens in it.
It sounds like that is a data ingestion problem, not sure what the issue is specifically
@@DataIndependent i sort of made it work. although my number of documents section outputs 1, because of this i am not able to use clustering. ty for your reply.
Please consider making videos on the typescript module of langchain as some of the methods do not exist are are named differently it could be a good help
Is there a comunity like discord or sth where people discuss specific use cases and how to achieve them?
Why do these jupyter notebooks never work for me? Did I miss a requirements file?
@greg how we can we compare two documents
You can get the summaries, then ask the LLM to compare the two
Is there any way to control the length of summary returned by mapreduce ?
Totally, make a custom reduce prompt and tell it you want a shorter summary
Hi going through the jupyternotebook, the section on plotting the graph is having an error now saying list object has no attribute shape. I am not too familiar with the vectors used, but something seems to be wrong with the fit_transform function when the data is passed into it. Any solution for that?
Did you edit the code at all? Haven’t heard of a problem yet.
Make sure your packages and libraries are updated
@@DataIndependent Thanks for the quick reply. But nope i did not change anything, just running everything through and encountering error. Took a screenshot here pasteboard.co/amwRkxPFxJZ8.png
Can you make a video how to touch up the long document using chatgpt or openai API? My sister owns a counsulting firm and is exploring to touch up various reports which are roughly 50 pages long.
What do you mean 'touch up' long documents?
Dang. Nice
for high dimensionality data I would probably try UMAP instead of Kmeans
Nice thanks Jean - yeah there is a bunch of good clustering algorithms out there and I went with the quick and easy one after trying a few.
Hi Greg! I really love you videos. Is it possible to create a tutorial on how to create a bot that finds information online? AKA surfs the internet. Thank you for all the good work!
Nice! What types of information are you looking for?
I love "GPT 42K" :)
❤🔥❤🔥❤🔥❤🔥❤🔥❤🔥
Thanks Firsty ha
I love you
Thanks Will
“Napoleon Bonaparte and Serena Williams both achieved remarkable success in their respective fields…” 😂
Im so confused.. i was just tryign to summarize a research paper in 1500 words but it would be faster manually at this point lol.
Where are you confused? How can I help?