1. It is mentioned that the existing topic modeling methods such as LDA/NMF methods have too many parameters to be tuned, and this seems to be the major motivation for BERTopic approach. How is this challenge solved in this approach? What is the difference in the number of parameters in topic modeling methods LDA/NMF vs. BERTopic method? 2.Why UMAP has been used for dimensionality reduction? Why is it the most effective clustering algorithm? What is the best way to balance the loss of information with low dimension reduction and poor clustering? How the parameters are tuned for this? Why HDB SCAN has been used?
For you second question from the paper "‘Moreover, (Allaoui et al., 2020) demonstrated that reducing high dimensional embeddings with UMAP can improve the performance of well-known clustering algorithms, such as k-Means and HDBSCAN, both in terms of clustering accuracy and time" For your first question again from the paper: "Conventional models, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and Non-Negative Matrix Factorization (NMF) (Févotte and Idier, 2011), describe a document as a bag-of-words and model each document as a mixture of latent topics. One limitation of these models is that through bag-of-words representations, they disregard semantic relationships among words. As these representations do not account for the context of words in a sentence, the bag-of-words input may fail to accurately represent documents" You have to do your own research regarding the number of parameters in topic modelling LDA/NMF vs BERTopic Alos have a look at the discussion part of the paper where the author has explained the strength and weakness
Great demo! Very clear explanation and trouble-shooting tips! welldone!
Thank You!
Hi, great video!
Is it possible to map each topic to their respective documents?
Very well explained. Can you suggest a source for text preprocessing before BERTopic.
Refer to this :github.com/MaartenGr/BERTopic/issues/40
Very detailed explanation thank you sir 🙏
Thank You!
1. It is mentioned that the existing topic modeling methods such as LDA/NMF methods have too many parameters to be tuned, and this seems to be the major motivation for BERTopic approach. How is this challenge solved in this approach? What is the difference in the number of parameters in topic modeling methods LDA/NMF vs. BERTopic method?
2.Why UMAP has been used for dimensionality reduction? Why is it the most effective clustering algorithm? What is the best way to balance the loss of information with low dimension reduction and poor clustering? How the parameters are tuned for this? Why HDB SCAN has been used?
Please have a look at the BERTopic paper
I have sir... please let help me understand this...
Can you please find answer for these questions for me? I will be highly obliged to you.... 🙏
For you second question from the paper "‘Moreover,
(Allaoui et al., 2020) demonstrated that reducing
high dimensional embeddings with UMAP can improve the performance of well-known clustering algorithms, such as k-Means and HDBSCAN, both
in terms of clustering accuracy and time"
For your first question again from the paper:
"Conventional models,
such as Latent Dirichlet Allocation (LDA) (Blei
et al., 2003) and Non-Negative Matrix Factorization (NMF) (Févotte and Idier, 2011), describe a
document as a bag-of-words and model each document as a mixture of latent topics.
One limitation of these models is that through
bag-of-words representations, they disregard semantic relationships among words. As these representations do not account for the context of words
in a sentence, the bag-of-words input may fail to
accurately represent documents"
You have to do your own research regarding the number of parameters in topic modelling LDA/NMF vs BERTopic
Alos have a look at the discussion part of the paper where the author has explained the strength and weakness
Thanku so much sir