Rasa Algorithm Whiteboard: Levenshtein Vectors

Поделиться
HTML-код
  • Опубликовано: 14 янв 2025

Комментарии • 2

  • @atinsood
    @atinsood 4 года назад

    as always, appreciate walking in depth through this interesting topic. quick follow up questions:
    1. Why did you end up picking euclidean distance vs cosine distance
    2. Given this observation, I wonder if you can leverage the whatlies find similar for spell correction, since Levensthein gets used for spell correction.
    3. I wonder if the relation will get stronger when the size of ngram is smaller. If rather than (1,3), you had picked up (3,5) then I wonder if the relation would have been weaker. Rationale being, longer the ngrams, higher then number of zeros in the sparse matrix and less and less information gets captured in the sparse matrix to be then transferred over to dense matrix.

    • @RasaHQ
      @RasaHQ  4 года назад +2

      For spell checkers you probably want to keep the original sparse representation. That's bound to result in faster retrieval times.
      I chose Euclidean distance because of the geometric interpretation. The cosine distance does not correlate because it measure the angle between two vectors. The sparse representation that you start with is pretty orthogonal so euclidean is better here.
      As far as ngrams go, sure, what you mention seems plausible but a lot of information is lost when we use SVD for the dimensionality reduction. There is a bottleneck there.