Choosing Indexes for Similarity Search (Faiss in Python)

Поделиться
HTML-код
  • Опубликовано: 25 июл 2024
  • Facebook AI Similarity Search (Faiss) is a game-changer in the world of search. It allows us to efficiently search a huge range of media, from GIFs to articles - with incredible accuracy in sub-second timescales for billion+ size datasets.
    The success in Faiss is due to many reasons. One of those, in particular, is its flexibility. Faiss recognizes that there is no 'one-size-fits-all' in similarity search.
    Instead, Faiss comes with a wide range of search indexes - which we can mix and match to our choosing.
    However, this great flexibility produces a question - how do we know which size fits our use case?
    Which index do we choose? Should we use multiple indexes, or is one enough?
    This video will explore the pros and cons of some of the most important indexes - Flat, LSH, HNSW, and IVF. We will learn how we decide which to use and the impact of parameters in each index to build some of the best indexes for semantic search.
    🌲 Pinecone Article:
    www.pinecone.io/learn/vector-...
    🎉 Sign-up For New Articles Every Week on Medium!
    / membership
    Download script for Sift1M dataset:
    gist.github.com/jamescalam/a0...
    Similarity Search Series:
    • Vector Similarity Sear...
    🤖 70% Discount on the NLP With Transformers in Python course:
    bit.ly/3DFvvY5
    👾 Discord
    / discord
    Mining Massive Datasets Book (Similarity Search):
    📚 amzn.to/3CC0zrc (3rd ed)
    📚 amzn.to/3AtHSnV (1st ed, cheaper)
    🕹️ Free AI-Powered Code Refactoring with Sourcery:
    sourcery.ai/?YouTu...
  • НаукаНаука

Комментарии • 17

  • @harshitjaitly6850
    @harshitjaitly6850 2 года назад +1

    Super Informative Content!
    Thank you so much for this.

  • @narayansharma8797
    @narayansharma8797 2 года назад +1

    Thanks a bunch for this, James! Would be really great to see a couple of them get explored in depth. Also, if you could benchmark FAISS against ScaNN, it will help a few of us noobs a hell lot.
    Great content! Lovely command over your content. Really need more of this.

    • @jamesbriggs
      @jamesbriggs  2 года назад +2

      Hey Narayan, there is a video released already covering the 'traditional' version of LSH, and two more videos that will be released at 1200 ET today on the random projection version of LSH (used in Faiss) - and there are plenty more of these on the way ;)
      I love the FAISS vs ScaNN idea too, will be working on it soon!

    • @narayansharma8797
      @narayansharma8797 2 года назад

      @@jamesbriggs Sold!

  • @Nick-vs1zp
    @Nick-vs1zp 2 года назад

    Great explanations, especially for IVF - it's probably the best explanation for how it works that I've seen.

  • @grayrigel7091
    @grayrigel7091 Год назад +1

    Hi James.
    Thanks for such a wonderful tutorial. Really useful. A quick question, For a new query vector, is it possible to return the IVF cell/partition that it belongs to, instead of returning the neighbors? I think I can measure the distances with centroids and return the closest centroid. However, I was thinking if there is built-in way.

  • @ChrisZuo
    @ChrisZuo 15 дней назад

    Thank you! The drawings are cute!

  • @katehan9623
    @katehan9623 2 года назад

    Thank you for your video. Most Valuable Channel. Do you use GPU for indexing in this projects?

  • @haneulkim4902
    @haneulkim4902 Год назад

    Thanks for amazing video! Do you know why simple K-means are not used for these MIPS problems?

  • @itheenigma
    @itheenigma 3 года назад

    Super useful! Thanks for this video James. For IVF, can we retrieve the clusters that each datapoint belongs to after training (also cluster centroids)?

    • @jamesbriggs
      @jamesbriggs  3 года назад +2

      Yes you can, there is info on it here gist.github.com/mdouze/904e0b538ef7767c9e83a45ac1b57d1b
      The code you need to write (after training and adding your data to 'index') is:
      invlists = index.invlists
      all_ids = []
      for l in range(ind.nlist):
      ls = invlists.list_size(l)
      if ls == 0:
      continue
      all_ids.append(
      faiss.rev_swig_ptr(invlists.get_ids(l), ls).copy()
      )

    • @itheenigma
      @itheenigma 3 года назад

      @@jamesbriggs legend. Will give it go. Ta!

  • @mohammadyahya78
    @mohammadyahya78 Год назад

    Does the IVF algorithm works with high dimensional data please like 100?

  • @nareshsandrugu6057
    @nareshsandrugu6057 2 года назад

    Can share the video assume I have binary data of train and test, so need to calculate the haming distance, I didn't found any videos using faiss ,if share the video that may more helpful

  • @viorelteodorescu
    @viorelteodorescu Год назад

    What does IP stand for?

  • @mohammadyahya78
    @mohammadyahya78 Год назад

    what is nbits please at 10:21?