It works like a charm on my data. Topic label is now really meaningful and then highly more useful. You made my data Maarteen. Now i need to include those to my embeddings for semantic search and I think I am good to go. thanks a lot. Eager to check your book that is coming (cf. the link in your descriptiopn)
I am a big fan of BERTopic and what you are proposing with llama2 looks solving part of my challenges. I dedicate my next night on testing it ! Thanks for all the great work so far and sharing this with the community. You are a Man.
As promised, I am in the train. First thing first, there is no surprise. the tutorial combo with this video, the colab and the dedicated tutroial page is just perfect and educational as usual Maarteen. I like the integration with Llama2 as a new representation model and the possibility to leverage quantization. I was afraid of not being able to run your experiment on my desktop. You made my day allowing using 4bits ! Now the result is really really promising. This is exactly the type of challenge I was facing with previous topics like with KeyBERT. There are interesting but prone to interpretation and question loops with end users. This time, with llama2 I have the feeling we have the flexibility and versatility we need to guide the topics generation as we need. Really elegant implementation. Thank you Sir ! Next step for me is to test on my use case. ! Exciting
Been following BERTopic from the beginning and used it many times along with KeyBERT for work projects and personal projects. Always struggled with the interpretation of topics at the end of the process, but this looks like a great solution. Looking forward to getting your book now. Thanks so much for the tutorial!
Heel goed uitgelegd Maarten. Heel inspirerende video. En geweldig dat ik zelf in Google Colab hands on met jouw voorbeeld aan de gang kan. Heb al wat geprobeerd op mails (alleen nog gebruikmakend van BERTopic, zonder Llama) en resultaat is veelbelovend. Keep up the good work! Ik ben op mijn werk al een ambassadeur van BERTopic
Perfect, Thanks for this video as I tried so much to get access of your mediam article but was not able to read it because the content was for the premium users only having paid subscriptions, Thanks I was looking for something like this for my solution I will surely try this one
Hi Maarten! I've been following your work for some time and am so happy to see you start a RUclips channel. I am curious how you suggest I apply this (or something similar) to the task of identifying topic timestamps for RUclips videos?
You could use whisper to convert audio into text and feed it to Bertopic: towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf
Hi Marteen, great content as always. Would it be possible to make a video on topic distribution? If I've understood well, what BERTopic does is to assign a document to a cluster of documents, hence assign a single topic to a document. What if we want to assign multiple topics? For example, an abstract can talk about sentiment analysis in medical reviews using LLMs so we want to extract at least three main topics: sentiment analysis, medical reviews, and LLMs. How do we do? Your answer would be super appreciated!
Great work on this ~topic~. I’d be curious, have you tried using fuzzy clustering algorithms for separating topics? It’s likely that documents sometimes contain multiple topics
Maarten, great video on how to use your next iteration of Bertopic and the Llama2 model. Your examples are all focused on the english language. I have tried Bertopic with Dutch documents, but it fails to generate good quality topics. Could you make a video on using Dutch or any another language?
That's a great idea! To give you a quick few tips already... using a multi-lingual embedding model is quite important for properly representing another language especially if you use KeyBERTInspired. Another trick is to remove Dutch stopwords using the CountVectorizer. If you combine those tips together with the Best Practices, then that should already give you a head-start: maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html
This is incredible. Fantastic explanation. Thank you so much for the great content! a quick question, if we have only consists of object labels or information about objects detected in the images/video (e.g., "dog," "car," "tree," etc.), can we still use this object label information as input for BERTopic?
Considering your innovative approach was a great source of inspiration for me, I'm curious about using my own data. Is it sufficient to focus on the 'abstract' column, or would it be beneficial to include a 'title' column as well? I noticed you extracted 'titles' in your example but didn't use them in the training process(I may have overlooked it.). Additionally, the model returned over 100 topics, how can I effectively control the number of topics in the analysis? Thank you again for your contribution.
Hi Maarten! Thank you so much for the great content! One quick question - would you be able to have llama2 label the merged topics when doing hierarchical topic modeling?
Thanks for this great video. Do you think this can be done with game reviews to detect the most important components of the game?I planned to do that with LDA. However, I came across your video, and I thought that is great do to that with LLM.
The query I have regarding this Topic Modelling is Can we use this anywhere in use case of Retrieval Augmented Generation for better fetching of relevant documents and also for better generation of answers?
You could use the constructed topics to categorize the documents that you have. By supplying these documents with additional categories, you can create additional constraints/filters for a RAG-based pipeline. Therefore, instead of having to search through all documents, it will first search the category of the question after which it selects a relevant subset based on the category. There are many more ways you can use BERTopic in RAG but this can work well if you do not have additional metadata.
When using Agglomerative Clustering in this workflow, I have a HUGE topic 0 with 99% if the and so on ... like he regrouped most of the documents relative to stopwords ... that only happens with Agglomerative Clustering KMeans Mini Batch is ok ...
Good that you are experimenting with clustering models. As you have noticed, they matter greatly in the construction of the topics. One generally outperforms another. I generally hear good stories about using HDBSCAN, the default clustering algorithm. Even if you do not want the outliers, then there are options for reducing or even removing them: maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html
Hi Maarten! I have been using Bertopic since last year it's such a useful tool! When I tried this new LLM technique I ran into a problem where keybert and MMR are working fine, but LLM generated topics are just giving me repeated non-sense words, would you have any idea why? It looks like this
[INST] I have a topic that contains the following documents: - How does bekanområområområområområområområområområområområområområområområområområområområområ
you're doing a great job Maarten; thank you for this video!!
Super awesome.. really love the content. keep making such content. Thanks
It works like a charm on my data. Topic label is now really meaningful and then highly more useful. You made my data Maarteen. Now i need to include those to my embeddings for semantic search and I think I am good to go. thanks a lot. Eager to check your book that is coming (cf. the link in your descriptiopn)
Pretty incredible. Great coverage!
I am a big fan of BERTopic and what you are proposing with llama2 looks solving part of my challenges. I dedicate my next night on testing it ! Thanks for all the great work so far and sharing this with the community. You are a Man.
As promised, I am in the train. First thing first, there is no surprise. the tutorial combo with this video, the colab and the dedicated tutroial page is just perfect and educational as usual Maarteen. I like the integration with Llama2 as a new representation model and the possibility to leverage quantization. I was afraid of not being able to run your experiment on my desktop. You made my day allowing using 4bits ! Now the result is really really promising. This is exactly the type of challenge I was facing with previous topics like with KeyBERT. There are interesting but prone to interpretation and question loops with end users. This time, with llama2 I have the feeling we have the flexibility and versatility we need to guide the topics generation as we need. Really elegant implementation. Thank you Sir ! Next step for me is to test on my use case. ! Exciting
Been following BERTopic from the beginning and used it many times along with KeyBERT for work projects and personal projects. Always struggled with the interpretation of topics at the end of the process, but this looks like a great solution. Looking forward to getting your book now. Thanks so much for the tutorial!
I have seen all three videos. Loved all. Absolutely Gold
Please produce more content ! Love BERTopic 💯
THANK YOU MAARTEN THIS HAS TUTORIAL HAS MADE MY LIFE LOT EASY TO FINISH MY PROJECT SUCCESSFULLY!
Thank you for sharing this! Detailed, super informative and very helpful.
Thank you very much for uploading this video. It is very useful for our research work. Really appreciated your works and dedication :)
Heel goed uitgelegd Maarten. Heel inspirerende video. En geweldig dat ik zelf in Google Colab hands on met jouw voorbeeld aan de gang kan. Heb al wat geprobeerd op mails (alleen nog gebruikmakend van BERTopic, zonder Llama) en resultaat is veelbelovend. Keep up the good work! Ik ben op mijn werk al een ambassadeur van BERTopic
Thank you for this informative tutorial! It is really easy to understand and I am ready to implement it.
This is great! Thank you for providing this to the community.
Thank you, Marteen! Looking forward to your next videos.
Some on federated learning would be great too.
That's a good one! I work a lot with federated LLMs nowadays, so I'll keep it in mind 😀
This is exceptionally useful. thanks a lot !
Thank you, Maarten! Your video and explanation are perfect
Perfect, Thanks for this video as I tried so much to get access of your mediam article but was not able to read it because the content was for the premium users only having paid subscriptions, Thanks I was looking for something like this for my solution I will surely try this one
Great video. You save my work.
Great video. Thank you so much sir!
Fascinating! Can’t wait to try this
Really appreciated your works !! thank you !!
Hi Maarten! I've been following your work for some time and am so happy to see you start a RUclips channel. I am curious how you suggest I apply this (or something similar) to the task of identifying topic timestamps for RUclips videos?
You could use whisper to convert audio into text and feed it to Bertopic: towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf
Hi Marteen, great content as always. Would it be possible to make a video on topic distribution? If I've understood well, what BERTopic does is to assign a document to a cluster of documents, hence assign a single topic to a document. What if we want to assign multiple topics? For example, an abstract can talk about sentiment analysis in medical reviews using LLMs so we want to extract at least three main topics: sentiment analysis, medical reviews, and LLMs. How do we do? Your answer would be super appreciated!
This video is fantastic.
Great. Thanks for sharing
Great work on this ~topic~. I’d be curious, have you tried using fuzzy clustering algorithms for separating topics? It’s likely that documents sometimes contain multiple topics
Great video.
Maarten, great video on how to use your next iteration of Bertopic and the Llama2 model. Your examples are all focused on the english language. I have tried Bertopic with Dutch documents, but it fails to generate good quality topics. Could you make a video on using Dutch or any another language?
That's a great idea! To give you a quick few tips already... using a multi-lingual embedding model is quite important for properly representing another language especially if you use KeyBERTInspired. Another trick is to remove Dutch stopwords using the CountVectorizer.
If you combine those tips together with the Best Practices, then that should already give you a head-start: maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html
Could we integrate with AWS bedrock? The possibilities are endless! Thank you for your contribution to this field 😊
This is incredible. Fantastic explanation. Thank you so much for the great content! a quick question, if we have only consists of object labels or information about objects detected in the images/video (e.g., "dog," "car," "tree," etc.), can we still use this object label information as input for BERTopic?
With enough documents, I think this should be no problem. Definitely worth trying out!
Considering your innovative approach was a great source of inspiration for me, I'm curious about using my own data. Is it sufficient to focus on the 'abstract' column, or would it be beneficial to include a 'title' column as well? I noticed you extracted 'titles' in your example but didn't use them in the training process(I may have overlooked it.). Additionally, the model returned over 100 topics, how can I effectively control the number of topics in the analysis? Thank you again for your contribution.
Thanks for the update - really insightful! Is it possible to use a GPT-3.5 API instead of local LLama-2?
Can you please do a video on LLAMA3.1 for topic modeling and data summary [like agent - customer chat, reviews etc]
Hi! Unrelated to this video directly, but is there a way to render the visualisation of the clusters in html and not in Jupyter notebook?
Hi Maarten, does llama also do a good job labeling dutch keywords?
Hi Maarten! Thank you so much for the great content! One quick question - would you be able to have llama2 label the merged topics when doing hierarchical topic modeling?
Thanks for this great video. Do you think this can be done with game reviews to detect the most important components of the game?I planned to do that with LDA. However, I came across your video, and I thought that is great do to that with LLM.
Definitely
The query I have regarding this Topic Modelling is Can we use this anywhere in use case of Retrieval Augmented Generation for better fetching of relevant documents and also for better generation of answers?
You could use the constructed topics to categorize the documents that you have. By supplying these documents with additional categories, you can create additional constraints/filters for a RAG-based pipeline. Therefore, instead of having to search through all documents, it will first search the category of the question after which it selects a relevant subset based on the category. There are many more ways you can use BERTopic in RAG but this can work well if you do not have additional metadata.
@@MaartenGrootendorst Thankyou so much. I will look into this implementation methods and possibilities.
Can we use Llama 2 for german topics?
can we do that with gpt-3.5-turbo
When using Agglomerative Clustering in this workflow, I have a HUGE topic 0 with 99% if the and so on ... like he regrouped most of the documents relative to stopwords ... that only happens with Agglomerative Clustering KMeans Mini Batch is ok ...
Good that you are experimenting with clustering models. As you have noticed, they matter greatly in the construction of the topics. One generally outperforms another. I generally hear good stories about using HDBSCAN, the default clustering algorithm. Even if you do not want the outliers, then there are options for reducing or even removing them: maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html
thx ❤
what if i want the important topics from a single custom document, will it detect
Sure, use approximate_distribution: maartengr.github.io/BERTopic/getting_started/distribution/distribution.html
Hi Maarten! I have been using Bertopic since last year it's such a useful tool! When I tried this new LLM technique I ran into a problem where keybert and MMR are working fine, but LLM generated topics are just giving me repeated non-sense words, would you have any idea why? It looks like this
[INST]
I have a topic that contains the following documents:
- How does bekanområområområområområområområområområområområområområområområområområområområområ
My bad I was being an idiot, it was a problem with prompting template