BERTopic : Topic Modelling with Transformer Embeddings , arxiv dataset python demo

Поделиться
HTML-код
  • Опубликовано: 22 дек 2024

Комментарии • 13

  • @prabhacar
    @prabhacar 2 года назад

    Great demo! Very clear explanation and trouble-shooting tips! welldone!

  • @parthrangarajan3241
    @parthrangarajan3241 7 месяцев назад

    Hi, great video!
    Is it possible to map each topic to their respective documents?

  • @mmishrafaculty
    @mmishrafaculty Год назад

    Very well explained. Can you suggest a source for text preprocessing before BERTopic.

  • @adityay525125
    @adityay525125 2 года назад

    Very detailed explanation thank you sir 🙏

  • @seemarani7314
    @seemarani7314 2 года назад +1

    1. It is mentioned that the existing topic modeling methods such as LDA/NMF methods have too many parameters to be tuned, and this seems to be the major motivation for BERTopic approach. How is this challenge solved in this approach? What is the difference in the number of parameters in topic modeling methods LDA/NMF vs. BERTopic method?
    2.Why UMAP has been used for dimensionality reduction? Why is it the most effective clustering algorithm? What is the best way to balance the loss of information with low dimension reduction and poor clustering? How the parameters are tuned for this? Why HDB SCAN has been used?

    • @RitheshSreenivasan
      @RitheshSreenivasan  2 года назад

      Please have a look at the BERTopic paper

    • @seemarani7314
      @seemarani7314 2 года назад

      I have sir... please let help me understand this...

    • @seemarani7314
      @seemarani7314 2 года назад

      Can you please find answer for these questions for me? I will be highly obliged to you.... 🙏

    • @RitheshSreenivasan
      @RitheshSreenivasan  2 года назад +1

      For you second question from the paper "‘Moreover,
      (Allaoui et al., 2020) demonstrated that reducing
      high dimensional embeddings with UMAP can improve the performance of well-known clustering algorithms, such as k-Means and HDBSCAN, both
      in terms of clustering accuracy and time"
      For your first question again from the paper:
      "Conventional models,
      such as Latent Dirichlet Allocation (LDA) (Blei
      et al., 2003) and Non-Negative Matrix Factorization (NMF) (Févotte and Idier, 2011), describe a
      document as a bag-of-words and model each document as a mixture of latent topics.
      One limitation of these models is that through
      bag-of-words representations, they disregard semantic relationships among words. As these representations do not account for the context of words
      in a sentence, the bag-of-words input may fail to
      accurately represent documents"
      You have to do your own research regarding the number of parameters in topic modelling LDA/NMF vs BERTopic
      Alos have a look at the discussion part of the paper where the author has explained the strength and weakness

    • @seemarani7314
      @seemarani7314 2 года назад +1

      Thanku so much sir