Faiss - Introduction to Similarity Search

Поделиться
HTML-код
  • Опубликовано: 9 июл 2024
  • Full Similarity Search Playlist:
    • 3 Traditional Methods ...
    Facebook AI Similarity Search (FAISS) is one of the most popular implementations of efficient similarity search, but what is it - and how can we use it?
    What is it that makes FAISS special? How do we make the best use of this incredible tool?
    Fortunately, it's a brilliantly simple process to get started with. And in this video, we'll explore some of the options FAISS provides, how they work, and - most importantly - how FAISS can make our semantic search faster.
    🌲 Pinecone Article:
    www.pinecone.io/learn/faiss-t...
    📊 Data:
    github.com/jamescalam/data/tr...
    Notebook:
    gist.github.com/jamescalam/71...
    🤖 70% Discount on the NLP With Transformers in Python course:
    bit.ly/3DFvvY5
    🎉 Sign-up For New Articles Every Week on Medium!
    / membership
    👾 Discord
    / discord
    Mining Massive Datasets Book (Similarity Search):
    📚 amzn.to/3CC0zrc (3rd ed)
    📚 amzn.to/3AtHSnV (1st ed, cheaper)
    🕹️ Free AI-Powered Code Refactoring with Sourcery:
    sourcery.ai/?YouTu...
  • НаукаНаука

Комментарии • 83

  • @efeberkeerkeskin3375
    @efeberkeerkeskin3375 2 года назад +13

    This is the best tutorial I've ever seen. Very helpful for my final project in college. Thank you James!

  • @arnavmahapatra
    @arnavmahapatra Год назад +1

    You're a legend. I've learned a lot from your tutorials. Thank you!

  • @viorelteodorescu
    @viorelteodorescu Год назад

    Thank you, James! a very clear and comprehensive tutorial. I understood everything explained, and it was very helpful. I liked the way you covered the whole process, including the data. Great stuff!

  • @d4munche3z
    @d4munche3z Год назад +1

    Your videos are incredibly helpful, thank you!

  • @ProtosNo1
    @ProtosNo1 10 месяцев назад +3

    Thank for so good tutorial, I am a beginner in DS and already spent couple of hours reading the docs in FAISS official page, but your video is much more clearer and helps me to solve my Matching task🎉
    Thank one more time and take care😅

  • @codykingham4089
    @codykingham4089 3 года назад

    Great tutorial! Thanks!

  • @goelnikhils
    @goelnikhils Год назад +1

    Very good and detailed explanation. So helpful

  • @arminarlert132
    @arminarlert132 2 года назад +1

    This was so much helpful, thanks a lot.

  • @user-do1kl5tk8n
    @user-do1kl5tk8n 11 месяцев назад

    a very clear and easy understood tutorial! thanks a lot

  • @nareshsandrugu6057
    @nareshsandrugu6057 2 года назад

    It's simple and clean lecture🙂

  • @sumanthkaushik3898
    @sumanthkaushik3898 2 года назад

    Really well explained! Thanks!

    • @jamesbriggs
      @jamesbriggs  2 года назад

      Welcome! More Faiss content coming monday :)

  • @pierre-jeanruitort8892
    @pierre-jeanruitort8892 Год назад

    continue james !!!

  • @aditya_01
    @aditya_01 Год назад

    thank you very much for such a wonderful content.

  • @maxshibanov818
    @maxshibanov818 5 месяцев назад

    Thanks for the explanation, I started to understand something finally.
    I'd love to see not only the stats of query time to # vectors, but also accuracy increase to nprobe value increase, because it's probably not linear dependence

  • @Han-ve8uh
    @Han-ve8uh Год назад

    15:36 is there any formal/intuitive proof why the true closest database vector could be from another (non-closest) centroid? It feels to me the point selected in the red region is not the only point in the region, and there will be points on the right of the centroid too (something like "it can't be correct that everyone is above average", so not all points in cluster can be left of centroid).

  • @aruvii5452
    @aruvii5452 6 месяцев назад +1

    model.encode() itself takes significant time(~200 ms), shall i get any suggestion to reduce this time?

  • @tomwalczak4992
    @tomwalczak4992 3 года назад

    Another great video - really good explanation. This seems to be working surprisingly well - even though FAISS uses Euclidean distance (I assume on account of speed). Would be interesting to see how it works for searching both text and image vectors that are aligned in the same semantic space. For very large collections, this would be a great solution in production.

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      Yes by default we can use Euclidean or inner product - although there are ways to enable cosine sim (and other metrics) in *some* indexes. Would definitely be an interesting project, text+image similarity search, I'll make a note to look into it! In production, Faiss is really good

    • @vincviertytaccount2608
      @vincviertytaccount2608 Год назад +2

      When all vectors are L2 normalized, the inner dot product is equal to the cosine similarity. I also think I read somewhere that in that case, the euclidean distance will have the same ordering as the cosine "distance" since there is a strongly monotonic mapping from the cosine distance to the euclidean distance or smt like that (don't quote me on it).

  • @nikaize
    @nikaize 3 месяца назад

    very helpfull

  • @j5ha157
    @j5ha157 Год назад

    Really good video, thanks for taking the time to share your knowledge! I’ve got a task where I want to use similarity of documents as a recommendation engine. Would transforming a short passage of text using sentence transformer then taking the average vector for each passage work?

  • @grayrigel7091
    @grayrigel7091 Год назад

    This is a great introduction. Thanks for the great videos as always James. A quick question, is it possible get access to these cells with Faiss? In other words, centroid/cell labels for the various vectors. Something similar to a clustering output.

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      Thanks! Yes it's possible, this should help: gist.github.com/mdouze/b0a65aba70a64b00d6425b0e268a4c80

  • @areebakhtar6422
    @areebakhtar6422 4 месяца назад

    is there a way i can save the vectordb index created to a cloud or something so that others can access it

  • @ianhailey
    @ianhailey Год назад

    Can the parameterisation of an IVFPQ be automated / trained?

  • @johannamenges3095
    @johannamenges3095 Год назад

    Great video! But one general question about creating a semantic search engine. When I am creating a semantic search engine, is it then important to fine tune the neural network( for me its RoBERTa)?

  • @flexwang899
    @flexwang899 Год назад

    bro, what is your video about vector embedding?

  • @nickv5013
    @nickv5013 5 месяцев назад

    Would this approach be helpful for building a "Did you mean?" search query correction feature?

  • @arsenyivanov1137
    @arsenyivanov1137 2 года назад

    По какой-то непонятной мне причине это единственный человек во всем мире, который сделал обзор на либу Faiss..

    • @jamesbriggs
      @jamesbriggs  2 года назад

      if Google translate is correct, yes I agree there's very little content on Faiss out there - but I do have a full (free) course on it here: pinecone.io/learn/faiss

  • @williamyeoh4743
    @williamyeoh4743 Год назад

    Can it understand a question-and-answer type of text format?

  • @shakallb1357
    @shakallb1357 2 года назад

    Great

  • @harisjaved4441
    @harisjaved4441 2 года назад

    Thanks a lot for this James! Ii do have a question. Does the document vector have a max size? If I want to use Fiass on a dataframe that has vectors with average word count of 300 words per row of the dataframe, will Fiass truncate those vectors?
    thanks!

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      Sorry for the late response, Faiss can handle any vector size, however, the sentence transformer (or other model) that you use to encode the vector will truncate, SBERT typically truncates at 128 tokens (I think)

  • @D_Analyst007
    @D_Analyst007 3 месяца назад

    What happens if values of different sub-vectors from same vectors are the same? Isn't that an issue

  • @JOHNSMITH-ve3rq
    @JOHNSMITH-ve3rq Год назад +1

    The embedding process was skipped??

  • @arvinflores5316
    @arvinflores5316 Год назад +1

    Hey love the vid! Would like to ask, should stop words be removed first before going through embeddings for optimal results?

    • @jamesbriggs
      @jamesbriggs  Год назад +2

      hey Arvin, no when you're building dense embeddings we tend to use sbert or similar and those models actually work better when stopwords are included

    • @arvinflores5316
      @arvinflores5316 Год назад

      @@jamesbriggs Interesting. Would love to see an article about this one or is this more of based on experience?

    • @jamesbriggs
      @jamesbriggs  Год назад +2

      @@arvinflores5316 both experience and understanding of transformers, they're trained on huge amounts of data where the learn the context of words based on surrounding words. In the past this was somewhat true with things like word2vec but they were smaller models and couldn't maintain that much information within their internal model weights. Transformers are much larger and can focus on very nuanced details, including the impact of a "stopword" on the meaning of a particular sentence

    • @arvinflores5316
      @arvinflores5316 Год назад

      @@jamesbriggs oh that make sense! Thank you. No wonder it’s required for TFIDF because of its sparsity

  • @heejuneAhn
    @heejuneAhn 2 месяца назад

    Thanks. 1. So Can you summarize the order of complexity of indexFlatL2, indexIVFFlat and indexIVFPQ? O(nindexes). O(nindexes *nprobles/nlist), O(log(n)) or so? 2. Please explain more clearly the role of quantizer indexer for the main indexer in visual way. 3. You'd better check the Distance values together with search results index, to check how good the results is, especaily when you get the different result in IVFPQ.

  • @eduardmart1237
    @eduardmart1237 2 года назад

    Do you need to train an index even if you add yo it several new elements?
    How do you do it if you constantly add new data? (Something like face recognition library).

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      You train initially with as much data as possible, you do not need to retrain for every new example, but you should retrain when a significant portion of your data has changed

  • @TheNano822
    @TheNano822 10 месяцев назад

    Which IDE is this? Can someone tell me this

  • @NitishKumar-pg6qw
    @NitishKumar-pg6qw 3 года назад

    Great video James. Could you please make video on bi-encoder and cross encoder based passage retrieval.

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      Sure I'll add it to the list, would be a cool topic :)

  • @anoubhav
    @anoubhav 2 года назад

    Can you explain/ point to some reference to what is "m" and "bits" in IVFPQ index at 27:22? Thanks for the awesome videos!
    ----------------------------
    My understanding:
    1. The input vector embedding of dimension d = 768, after PQ, reduces to dimension "m". As we take sub-vectors of dimension d//m. And replace each sub-vector of length d//m into a single centroid ID.
    2. How to get centroid ID?
    For all sub-vectors at position i (across all sentence embeddings, ~15000), we perform a clustering. Thus, in total we perform clustering d//m times(i = 0-->d//m).
    In each clustering, the no.of clusters (IDs) = 2^bits? (is this correct?)
    We used bits = 8, so we replace a sub-vector of size d//m--> with a single cluster_ID whose value belongs to [0, 2^bits - 1], i.e. 2^bits distinct values. In video, 256 possible cluster IDs.
    e.g. input = [768-dim vector] --> after PQ -->
    output = [m-dim vector] whose values belong to [0, 2^bits - 1]
    = [8-dim vector] whose values belong to [0, 255]?

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      m is the number of subvectors used in PQ, and bits/nbits is the number of bits used by each subquantizer - I've covered these in later videos of the series:
      PQ deep-dive: ruclips.net/video/t9mRf2S5vDI/видео.html
      PQ and IVFPQ in Faiss: ruclips.net/video/BMYBwbkbVec/видео.html
      I hope that helps! :)

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      Missed your explanation - yes what you said is correct

  • @opalkabert
    @opalkabert 2 года назад +1

    James, is there function in FAISS that will produce score or probability when FAISS is used to find similar documents? I want to get score for each of the top documents FAISS proposes. Thanks

    • @opalkabert
      @opalkabert 2 года назад +1

      This did the trick:
      index = faiss.index_factory(d, "Flat", faiss.METRIC_INNER_PRODUCT)
      index.ntotal
      faiss.normalize_L2(doc_embeddings)
      index.add(doc_embeddings)
      faiss.normalize_L2(xq)
      distance, index = index.search(xq, 5)
      print('Distance by FAISS:{}'.format(distance))

    • @jamesbriggs
      @jamesbriggs  2 года назад

      Glad you figured it out :)

  • @user-qj3ig7qz3y
    @user-qj3ig7qz3y Месяц назад

    It is workings, how ever quite not precise for scientific documents with a lot of graphs, maths & symbol, could you please give any idea how would that get much precise, with highly appreciation and thank you for your tutorial..👍👍

  • @mmoya1135
    @mmoya1135 2 года назад

    can IVFPQ be trained incrementally? Loading a bunch of dense embeddings all at once requires a lot of memory

    • @jamesbriggs
      @jamesbriggs  2 года назад

      yes you can, try to make the smaller sample you train with representative of the expected dataset

  • @eduardmart1237
    @eduardmart1237 2 года назад

    If the index is more that one pc’s ram can you distribute it across several machines?

    • @jamesbriggs
      @jamesbriggs  2 года назад

      I believe it is possible but I'm not sure if there is any built-in functionality for this, take a look at pinecone.io/ if you're struggling with scaling, it's free upto 1M vecs

  • @eduardmart1237
    @eduardmart1237 2 года назад

    Are there easy ways to use faiss if the whole index doesn't fit in one pc RAM?
    Something like distributed system.

    • @jamesbriggs
      @jamesbriggs  2 года назад

      It is possible to distribute across hardware but I’m not sure how

  • @eduardmart1237
    @eduardmart1237 2 года назад

    Is it possible to improve speed of the search without sacrificing the accuracy? Not using approximate search.
    Or is the flatindex is the only one that returns 100% accurate results?

    • @jamesbriggs
      @jamesbriggs  2 года назад

      Flat index is the only one with guaranteed 100% recall, but you can get 99.9% with very fast indexes if done well

  • @charmz973
    @charmz973 2 года назад

    Thank you bruv, very resourceful though can't find the link to the notebook

    • @jamesbriggs
      @jamesbriggs  2 года назад +2

      here you go - gist.github.com/jamescalam/7117aa92235a7f52141ad0654795aa48

  • @esha_indra
    @esha_indra 2 года назад

    What to do if the embedding doesn't fit into memory?

    • @jamesbriggs
      @jamesbriggs  2 года назад +2

      try a compression method like PQ www.pinecone.io/learn/product-quantization/ :)

  • @Data_scientist_t3rmi
    @Data_scientist_t3rmi Год назад

    Hello, great videos !
    i wonder if we use FAISS, we will not be able to use cross encoder for simularity search !

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      Thanks! You can use a cross encoder as a second “reranking” step, but for faiss itself you need to use the biencoder/sentence transformer

    • @Data_scientist_t3rmi
      @Data_scientist_t3rmi Год назад

      @@jamesbriggs so Faiss also uses Sentence Transformer ? i have started the article, i have not finished yet, i'm on the step of clustering and product quantization, im wondering is also behind these faiss algorthms are they any sentence transformer algorithms and cosine simularity

  • @eduardmart1237
    @eduardmart1237 2 года назад +1

    Can you please make some examples using faiss with flask application.

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      I will but it may take some time for me to get to it sorry!

  • @aitarun
    @aitarun 2 года назад

    great video watch @1.75x atleast

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      I've always been told I'm a slow talker haha

  • @SuperMaDBrothers
    @SuperMaDBrothers Год назад

    "Centroids are simply the centers of those cells" means absolutely nothing. How do you even define the center of a cell? I know, but your audience doesn't. You need to explain things better. Lose the jargon and just tell us what it is - a chunking mechanism. Have more of an overview please.

    • @jamesbriggs
      @jamesbriggs  Год назад

      sure, I try to keep things as simple as possible, but sometimes maybe I miss explaining things as simply as I could have - thanks for the feedback I'll try

    • @SuperMaDBrothers
      @SuperMaDBrothers Год назад

      I might have misspoke this is a good explanation otherwise