3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Поделиться
HTML-код
  • Опубликовано: 25 июл 2024
  • Vector similarity search is one of the fastest-growing domains in AI and machine learning. At its core, it is the process of matching relevant pieces of information together.
    Similarity search is a complex topic and there are countless techniques for building effective search engines.
    In this video, we'll cover three vector-based approaches for comparing languages and identifying similar 'documents', covering both vector similarity search and semantic search:
    - TF-IDF
    - BM25
    - Sentence-BERT
    📰 Original article:
    www.pinecone.io/learn/semanti...
    🤖 70% Discount on the NLP With Transformers in Python course:
    bit.ly/3DFvvY5
    🎉 Sign-up For New Articles Every Week on Medium!
    / membership
    Mining Massive Datasets Book (Similarity Search):
    📚 amzn.to/3CC0zrc (3rd ed)
    📚 amzn.to/3AtHSnV (1st ed, cheaper)
    👾 Discord
    / discord
    🕹️ Free AI-Powered Code Refactoring with Sourcery:
    sourcery.ai/?YouTu...
    00:00 Intro
    01:37 TF-IDF
    11:44 BM25
    20:30 SBERT

Комментарии • 58

  • @stepkurniawan
    @stepkurniawan 7 месяцев назад +2

    7 mins to understand TF-IDF, youre my saviour

  • @mayursanmugam4050
    @mayursanmugam4050 10 месяцев назад +2

    Found this after stumbling around for a good overview of BM25 & SBERT. This is a fantastic initial introduction - enough detail and introduces the right concepts that people can double down on for further learning. Thank you James!

  • @bujin1977
    @bujin1977 Год назад

    Thanks, this was very helpful! I've recently started using SQL Server's full text search capabilities to drive course searches on our college website, but it was all a bit of a "black box" thing. No idea how it worked, I just trusted that it *did* work! Until I got a query from someone who wanted to know how to alter their search results to change the order that we display them on the website. I'm no stranger to complicated mathematical formulae, but I took one look at the BM25 formula on wikipedia and cried! Your explanation made it so much easier to understand what was going on.
    Now comes the hard part. Explaining how the staff member in question can alter their data to boost their results... 😬

  • @lukekim4760
    @lukekim4760 2 года назад +3

    I am into document similarity ranking and I love your videos! Thank you so much :)

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      Great to hear! I made a full (and free) course on semantic search if you're interested :) www.pinecone.io/learn/nlp

  • @mcnubn
    @mcnubn 5 месяцев назад

    Really helped clear up BM25 for me! Huge thank you so sharing this!

  • @AjayShivranBCSE
    @AjayShivranBCSE 2 года назад +1

    Great work man!

  • @asedaaddai-deseh8152
    @asedaaddai-deseh8152 2 года назад

    Great explanation!

  • @szymonskorupinski5237
    @szymonskorupinski5237 3 года назад

    Great work!

  • @yonahcitron226
    @yonahcitron226 Год назад

    great explanations! thanks!

  • @Data_scientist_t3rmi
    @Data_scientist_t3rmi Год назад

    Excellent video thank you!

  • @parth191079
    @parth191079 2 месяца назад

    This is super helpful! Thank you for this video.

  • @leonardvanduuren8708
    @leonardvanduuren8708 Год назад

    Masterful !! Thx for this and all your other stuff !!

  • @li-pingho1441
    @li-pingho1441 10 месяцев назад

    extremely simple explanation!!!!!!!!

  • @ruimelo1039
    @ruimelo1039 2 года назад +2

    I'm doing an uni project in this matter and your explanation was on point! Thank you

  • @qwerty8669
    @qwerty8669 2 года назад

    Thanks this was helpful

  • @LuisRomaUSA
    @LuisRomaUSA 2 года назад +2

    Not many views yet, but please dont stop making content. This is the best video i have found in a week of searching.

    • @jamesbriggs
      @jamesbriggs  2 года назад

      haha happy to hear, I've committed to making videos so I'll be here for a long time 😅
      check out the similarity search playlist if you're interested in these things, just finished it!

  • @MehdiMirzapour
    @MehdiMirzapour 4 месяца назад

    Great work! You are a great teacher! Although, I know these concepts but I enjoyed a lot watching it.

  • @abhishekrathi6253
    @abhishekrathi6253 2 года назад

    Nice explanation

  • @tomwalczak4992
    @tomwalczak4992 3 года назад +6

    Really good, simple explanations. Also really liked your Udemy course.

    • @jamesbriggs
      @jamesbriggs  3 года назад

      hey Thomas, yes I remember you left a review on the course? Great to see you here too and thanks!

    • @tomwalczak4992
      @tomwalczak4992 3 года назад

      @@jamesbriggs Yup ;) I'm gettting into NLP so your videos have been super useful. Just finished my first project that uses both sparse and dense embeddings: share.streamlit.io/tomwalczak/pubmed-abstract-analyzer
      And as you say in the video, dense embeddings and complex models don't always work better, at least not out-of-the-box. Looking forward to more vids :)

    • @jamesbriggs
      @jamesbriggs  3 года назад

      @@tomwalczak4992 That's a very cool project, first one too? I'm impressed!
      Awesome to see you're getting into it though, looking forward to seeing you around!

  • @UnpluggedPerformance
    @UnpluggedPerformance 2 года назад

    bro super good explanations

  • @UnpluggedPerformance
    @UnpluggedPerformance 2 года назад +1

    that Bert outcome is certainly cool!!!!! you made my day man!! awesome! how can we support you? (besides likes etc.)

    • @jamesbriggs
      @jamesbriggs  2 года назад

      comments like this! Really happy it helped :)

  • @li-pingho1441
    @li-pingho1441 10 месяцев назад

    thank you so muchhhhhhh

  • @pfinardii
    @pfinardii 2 года назад +1

    Hi James, fantastic video!!! A question: Using BERT to extract dense representations with hidden_state or last_hidden_state layers and we perform masked_embeddings = embeddings * mask (where mask is the attention_mask BERT output) to put 0 value in padding tokens maybe we need also to consider the special tokens [cls] and [sep]? I mean, the attention mask for these special tokens are 1. So when using some hidden layer from BERT we need perform a slice masked_embeddings = masked_embeddings[ : , 1:sep_token_pos,], where sep_token_pos is the [sep] position in sequence: [[cls], tokens of the sequence [sep], [pad],[pad]...]

    • @jamesbriggs
      @jamesbriggs  2 года назад +1

      hey Paulo, good question. I believe the other sentence transformer models that build these embeddings keep both, but I have never seen them explicitly state that they do (or why) in any papers, so I can't say for sure sorry!
      Nonetheless, my understand is that the CLS and SEP tokens are included within the embeddings as they still contain useful information about the input data. The CLS token itself can actually be used in building sentence embeddings (although it is ineffective compared to mean pooling afaik). The significance of that being that the CLS token contains enough information about the sequence to be (somewhat) effectively used as a single vector representation of the whole sequence, therefore it contains quite useful information about the sequence that would be lost if removed.
      As for the SEP token, I don't believe it is as important as the CLS, but I can't say I know how relevant it is.
      I'd be curious to see a comparison of embedding performance with/without the CLS/SEP tokens though. I'm sure it has been tested but I've never seen something like that mentioned

    • @pfinardii
      @pfinardii 2 года назад

      @@jamesbriggs Hi James I did a test with MNR loss. During tokenization process I set the tokenizer parameter add_special_tokens=False and I got 0.83 against 0.81 with the default value (True). Need to test only without [SEP] token to make the results more robust, thank for the reply :)

    • @jamesbriggs
      @jamesbriggs  2 года назад

      @@pfinardii oh so it's better? Wow I'll have to try it too - that's awesome :)

  • @wenzeloong
    @wenzeloong 2 года назад

    it's a great video!! I need your opinion sir James. In this video you are using Cosine Similarity to calculate the distance. What do you think if we combine these methods with ANN (approximate Nearest Neighbor) with angular distance. is that better than use cosine similarity?

    • @jamesbriggs
      @jamesbriggs  2 года назад +2

      Hey Iven, thanks! I think you should absolutely use ANN - definitely if you have lots of vectors. As for cosine similarity vs angular similarity, angular similarity can distinguish better between already very similar vectors, but I'm not sure if it is too important in most use-cases. Most applications from pretty smart people tend to use cosine similarity, so that is (for me) evidence that cosine similarity is 'good enough'
      If you're interested in ANN and more of this, I have a big playlist on it here ruclips.net/p/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc
      Hope it helps :)

    • @wenzeloong
      @wenzeloong 2 года назад

      @@jamesbriggs Thank you for your opinion and the playlist is quite amazing..! It helps me a lot.. thank you !

  • @Krobongo
    @Krobongo 2 года назад

    I'm a bit confused. Is SBERT just the embedding layer, which is fed to a ML Model, or is it also the model itself to do e.g. text classification?

  • @23232323rdurian
    @23232323rdurian 11 месяцев назад

    point taken and understood about eg..the similar meaning, but using different words.....however in practice just a straight-forward word-to-word with frequency stats works pretty good because: words have usage frequencies, so anybody MEANING to say is gonna say not .....like 100-to-1 odds.....and ? well that's extremely rare..... is gonna be 100s of times more frequent in this context than .....
    ==then further ACROSS languages (eg English, Japanese) the word frequencies dont necessarily translate....sometimes frequent English words are infrequent in Japanese and vice versa....

  • @peterthomas7523
    @peterthomas7523 2 года назад

    Excellent video as always :) kinda makes me wonder why I bother spending my grant money on training courses when your whole channel is simply better. I had a question about using S-BERT for similarities between documents, rather than sentences within a document. Could I just average the embedding for the sentences within each document and calculate cosine similarity between these? Or is there a better way? Thanks!

    • @jamesbriggs
      @jamesbriggs  2 года назад

      you can do this but it's not that effective, other option would be to try and compare all paragraphs and take an average score or create some sort of threshold like "if 5 paragraphs similarity > 0.8" etc. It's hard to do. I have a free 'course' on semantic search here, hopefully you can save some more of your grant money:
      www.pinecone.io/learn/nlp/

    • @peterthomas7523
      @peterthomas7523 2 года назад

      @@jamesbriggs Thanks a lot :) I've been working through your pinecone course and am really liking it so far!

  • @smcgpra3874
    @smcgpra3874 Год назад

    Can we classify tabular data where each row is one dataset

  • @venkateshkulkarni2227
    @venkateshkulkarni2227 2 года назад

    I think the bert-base-nli-tokens are deprecated now according to the Hugging Face website. Which Sentence Transformer model should we now use for SBERT?

    • @jamesbriggs
      @jamesbriggs  2 года назад +2

      I like mpnet models the most for generic sentence vets huggingface.co/flax-sentence-embeddings/all_datasets_v3_mpnet-base
      I cover a load of models, training methods, etc in this playlist:
      ruclips.net/p/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO
      Hope it helps :)

  • @23232323rdurian
    @23232323rdurian 11 месяцев назад

    28:50....both B and G SHARE this phrase and its several words, so THAT's why they share a high similarity score...

  • @user-bs9bu1ko5f
    @user-bs9bu1ko5f 2 года назад

    want some scripts or subtiltes for your video, thank you!

  • @AlexGuemez
    @AlexGuemez Год назад

    Is there a way to "reverse" TFIDF to see if Google uses it in his algorithm?

    • @jamesbriggs
      @jamesbriggs  Год назад

      I've not heard of a way but it could be possible - Google's algorithms uses a lot of different things though (BERT included), so I'm not sure if it would be possible to identify specific parts of it like TF-IDF

  • @wilfredomartel7781
    @wilfredomartel7781 2 года назад

    how to train sbert with a specific domain?

    • @jamesbriggs
      @jamesbriggs  2 года назад

      hey I have a few articles+videos on this, what does your training data look like? If you have sentence pairs + scores you can use MSE loss which I cover at the end of:
      www.pinecone.io/learn/gpl/
      If you don't have training data and just text data you can use unsupervised methods like GPL (above), GenQ, or TSDAE (all found here):
      www.pinecone.io/learn/nlp/
      If you have sentence pairs *without* labels you can use softmax or preferably MNR loss:
      www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/

  • @edgar23vargas53
    @edgar23vargas53 3 года назад

    Hey is there any way we can get in contact with you?

    • @jamesbriggs
      @jamesbriggs  3 года назад

      Yes on the 'About' page of my YT channel you'll be able to find my email

    • @edgar23vargas53
      @edgar23vargas53 3 года назад

      @@jamesbriggs DMed you on Instagram

    • @edgar23vargas53
      @edgar23vargas53 3 года назад

      @@jamesbriggs DMed you

    • @jamesbriggs
      @jamesbriggs  3 года назад

      @@edgar23vargas53 got it

    • @edgar23vargas53
      @edgar23vargas53 3 года назад

      @@jamesbriggs shot you an email

  • @ErginSoysal
    @ErginSoysal 2 года назад

    You don’t know what b and k in bm25, do you? 😏

  • @gorgolyt
    @gorgolyt 2 года назад +4

    You need to tighten up your math notation. Writing f(t, D) for the "total number of terms in the document" is really confusing and makes no sense. What is t in this function? You either need to sum over all t in D, which you haven't written, or you should just get rid of the t and use some function g(D) to denote the total number of terms in the document. When you get onto BM25 it's even worse, I'm not sure your explanation of your notation is even correct. It should be f(q, D) on the denominator, the same q that is in the numerator, not f(t, D), whatever that means.