Sentence Similarity With Transformers and PyTorch (Python)

Поделиться
HTML-код
  • Опубликовано: 10 янв 2025

Комментарии • 66

  • @ax5344
    @ax5344 2 года назад

    gosh, it is soooo useful. I saw those unsqueeze, last_hidden_states, clamp, etc commands before, but did not know how to link it with transformer embeddings. This hands-up detailed explanation helps soooo much! I did not even know where my question is; now I know the question and the answer. Thank you!

  • @ravivarma5703
    @ravivarma5703 3 года назад

    This Channel is gem.
    Thanks a lot keep doing it James

  • @thelastone1643
    @thelastone1643 3 года назад

    James, your lessons to ease the advanced NLP topics are amazing ...

    • @jamesbriggs
      @jamesbriggs  3 года назад

      That's awesome to hear, thankyou!

  • @AlgoTradingX
    @AlgoTradingX 3 года назад +2

    Thank you James, good work as usual. A small comment to help you with the RUclips algorithm as you deserve.

  • @wazzawazza
    @wazzawazza 3 года назад

    These are all gold. Thank you for making this!

  • @sanskaripatrick7191
    @sanskaripatrick7191 Год назад

    works like a charm. Thanks a lot man

  • @littlebylittle2237
    @littlebylittle2237 2 года назад

    Thank you so much for your awesome video!!! Seriously helped a lot.

  • @andresmauriciogomezr3
    @andresmauriciogomezr3 Год назад

    Hello friend, thanks a lot for the explanation. It was too useful to me!

  • @chiweilin6021
    @chiweilin6021 2 года назад

    Thanks for the tutorial !!
    Here is a question: How about using "pooler_output" tensor from model instead of mean pooling on last_hidden_state ??
    Is it because pooler_output more suitable for downstream task rather than sentence representation?
    I appreciate your answer 🙏

  • @ax5344
    @ax5344 2 года назад

    is the mean pooled embedding computed this way the same as "bert-as-service"?

  • @SuperZhang
    @SuperZhang 2 года назад

    hi thanks for great vedio! just one question, when i was running the part "outputs=model(**tokens)", i got error message saying "ValueError: You need to specify either `text` or `text_target`.", any idea why it happened?

  • @thelastone1643
    @thelastone1643 3 года назад

    I wish to do a lesson on fine-tuning the sentence-transformer model for sentence similarity on a custom dataset ...

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      That would be pretty interesting, I'll likely look into it sometime soon :)

    • @thelastone1643
      @thelastone1643 3 года назад

      @@jamesbriggs Thanks James, please use roBERTa as suggested above to deal with out of vocabulary ...

  • @ravivarma5703
    @ravivarma5703 3 года назад

    Could you give a deep dive about Fine tuning/retraining transformer VS Stitching Adapters for a downstream task ?

  • @ahmedal-saedi4672
    @ahmedal-saedi4672 2 года назад

    Hi James, thank you for the amazing lecture, could we apply this method for a list with combined words such as "machine learning", "Deep convolutional neural network", 'Random forests', 'Semisupervised learning', "Internet of things", 'Supervised learning', "deep learning", "Fuzzy logic", "real-time", "Big data", etc.? and find semantic similarities for these combined words

  • @cyberzilla7261
    @cyberzilla7261 4 месяца назад

    Hey I had a question how could I do this I multiple languages

  • @bijaynayak2346
    @bijaynayak2346 2 года назад

    Hi James, on your github code you created final result pandas dataset, could you share the logic before and new similarity

  • @freedmoresidume
    @freedmoresidume 2 года назад

    Thank you very much, this is really cool

  • @arinzeakutekwe5126
    @arinzeakutekwe5126 3 года назад

    Thanks for sharing. Can we use this approach to compare two addresses using sentence embeddings? For example given two columns of addresses, can we find how one address is similar (or not) to other addresses?

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      You could maybe, but I’m not sure the results would be great, I think for two addresses I’d use Levenshtein distance, I have a playlist cover these things in depth ruclips.net/p/PLIUOU7oqGTLhlWpTz4NnuT3FekouIVlqc
      The first one covers Levenshtein, then if that’s not enough we have the other vids - there’s an article incl with each video in the description - I’d def recommend those to help! Let me know how it goes or if you have questions :)

  • @diegoestebanalvarezmonroy4756
    @diegoestebanalvarezmonroy4756 3 года назад

    Hi James, I'm following this tutorial but using sentence-transformers/distiluse-base-multilingual-cased-v1 with 45000 sentences and the longest is 118, but I'm running out of memory when training, it saids that needs to allocate almost 17 Gb, this happens when running outputs = model(**tokens) , any idea of why is this happening? I'm using a sagemaker instance with 32Gb RAM... @James Briggs

    • @jamesbriggs
      @jamesbriggs  3 года назад

      hey Diego, this isn't the best approach when dealing with larger amounts of data, I'd recommend using a vector database like FAISS. Working on a tutorial for exactly this - should be out in ~2 weeks, so if you want to do everything before then I'd recommend you look into FAISS and see if you can use it to store your vectors and perform the similarity search, some good documentation here:
      github.com/facebookresearch/faiss/wiki
      Good luck!

  • @DanielWeikert
    @DanielWeikert 3 года назад

    Thanks. Can you do a deep dive into explaining Transformers?

    • @jamesbriggs
      @jamesbriggs  3 года назад

      I could give it a go for sure - the transformer models, or the huggingface transformers library? And are there any parts in particular that you think would be useful?

    • @williamwang2676
      @williamwang2676 2 года назад

      @@jamesbriggs the huggingface transformers library would be useful

    • @jamesbriggs
      @jamesbriggs  2 года назад

      @@williamwang2676 I have a lot of videos on these in the "NLP" and "NLP for Semantic Search" playlists on my channel, is there anything in particular you're interested in that I'm missing?

  • @jabowery
    @jabowery 3 года назад

    How can the cosine similarity vector be used to point out which words in the original sentences contribute the most to dissimilarity?

    • @jamesbriggs
      @jamesbriggs  3 года назад

      I'm not sure how we could do this - keyBert is supposedly good at identifying similar words within a sentence - so maybe this could be the right direction, I've never used it though so I'm not 100% certain

  • @thelastone1643
    @thelastone1643 3 года назад

    Which language models [BERT, Elmo, flair, GPT-2, etc.]are suitable for finding similar sentences or tweets?

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      BERT works well, there is also another model called Universal Sentence Encoder (USE) which is popular, but unfortunately only available via TensorFlow as far as I'm aware
      Check out the sentence-transformers library too, it's very good!

  • @kenzalahbabi-hl8zu
    @kenzalahbabi-hl8zu 2 года назад

    Hi James, such a great video ! I am new to the nlp world but I am working on text similarity on a very domain specific corpus which is unlabeled. In most cases for text similarity with bert I find a dataset with pairs of sentences but in my case it will be so hard to obtain. I just have one question when we use this kind of method we are unable to judge the performance (accuracy,recall,F1-score), How can evaluate my model in this case ? Thanks in advance !

  • @umeshtiwari9249
    @umeshtiwari9249 Год назад

    amazing video again

  • @meghnasingh9941
    @meghnasingh9941 3 года назад

    Haven't seen this level of in-depth explanation elsewhere. Thank you so much.
    Also I will be glad if you can provide your insight on appending some neural network layers post sentence embedding for some downstream application. Do you think it will enrich the embeddings more or make any difference?

    • @jamesbriggs
      @jamesbriggs  3 года назад

      It depends on what you're using them for, generally though I've found that you can't really enrich embeddings any further than what the big transformers models manage. What sort of downstream applications are you thinking of?

  • @nourassem555
    @nourassem555 3 года назад

    Thanks for the tutorial its amazing, I have a question though what is the difference between using tokenizers and the other approach you did in another video using BERT encoder is it that you just used the library ? and here you are actually implementing it from scratch, Thanks in advance

    • @jamesbriggs
      @jamesbriggs  3 года назад

      Yes that's right, both are exactly the same - and if we compare the output logits we'll even see (almost) the exact same values (with some rounding differences)

  • @diegoestebanalvarezmonroy4756
    @diegoestebanalvarezmonroy4756 3 года назад

    Hi James, awesome video! I've a question, i need to do this but in spanish, what model should i used to get good results, I'm looking to compare, for example, "Beer" and "Heineken Original Lager" and get a high similarity, is it possible?

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      Yes it should be able to do that, I’m looking and I can’t see any Spanish-specific models (looking at www.sbert.net/docs/pretrained_models.html), but there are a few multilingual models you could try, a little more info on those here:
      www.sbert.net/examples/training/multilingual/README.html
      I don’t know how to train sentence transformer models *yet*, but clearly there’s a gap in non-English sentence transformers, so I’m going to look into it - although it will be some time so don’t wait for me! Good luck :)

    • @diegoestebanalvarezmonroy4756
      @diegoestebanalvarezmonroy4756 3 года назад

      @@jamesbriggs Thanks for the resources I´ll check them out. Is it possible to build my own subject specific sentence transformer in spanish? And if it is, do you know where i can find info to do this? Cause i have a kinda big list of 45k sentences but the problem I'm trying to solve is very specific

  • @Yankzy
    @Yankzy Год назад

    Seen your openai embeddings, is this what they are doing or the techniques are different?

    • @jamesbriggs
      @jamesbriggs  Год назад +1

      the concept is the same, they're using transformer models with some sort of pooling to create the embedding models - if you want to dive into it I have a course on all of these techniques here ruclips.net/p/PLIUOU7oqGTLgz-BI8bNMVGwQxIMuQddJO

    • @Yankzy
      @Yankzy Год назад

      @@jamesbriggs Awesome! Will look at it

  • @edwsa5063
    @edwsa5063 2 года назад

    Hi James, thank you for this amazing tutorial ! Just a quick question : attention_mask is used 2 times : 1) model(input_ids, attention_mask) and then 2) mask_embeddings = embeddings * unsqueezed_expanded_mask . I wonder why the second step is necessary. Why does the model output unnecessary information, that we need to cancel with 2) ?

    • @jamesbriggs
      @jamesbriggs  2 года назад

      the models by default aren't used to produce sentence embeddings, so they output some information in the final embeddings, but when we average activations across those embeddings (to create sentence embeddings) we need to remove those activations from token embeddings that align to [PAD] tokens

  • @charmz973
    @charmz973 3 года назад

    Thank you Sir for your contribution,definitely purchasing your course on Udemy

    • @jamesbriggs
      @jamesbriggs  3 года назад

      that's awesome to hear, looking forward to seeing you there!

  • @VeronicaSantos
    @VeronicaSantos 3 года назад

    Can I compute similarity between a sentence/paragraph and a word/expression (related to a topoic) to identified if the sentence/paragraph is about a topic?

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      I would assume, although I can't say this for sure - that a random irrelevant word *should* have a lower similarity score than a relevant word. But you likely won't get as strong similarity scores as between two semantically similar sentences

  • @thelastone1643
    @thelastone1643 3 года назад

    is this model can deal with out of vocabulary tokens?

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      No this one can't, we want byte level encoding for that, which I believe roBERTa uses, but BERT doesn't

  • @rabnawazjansher7661
    @rabnawazjansher7661 3 года назад

    it will take too much time to match similiarty
    what will be the fast way

    • @jamesbriggs
      @jamesbriggs  3 года назад

      I'd recommend something like FAISS

  • @kabeerjaffri4015
    @kabeerjaffri4015 3 года назад

    Just discovered your channel. I can't thank the yt algorithim enough

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      I'm thanking the algorithm too - super happy you're enjoying it!

    • @kabeerjaffri4015
      @kabeerjaffri4015 3 года назад

      @@jamesbriggs oh my god u actually replied

    • @jamesbriggs
      @jamesbriggs  3 года назад +1

      haha I always do :)

  • @GaganKPolska
    @GaganKPolska 3 года назад

    You are awesome!

  • @AhmedBesbes
    @AhmedBesbes 3 года назад

    Great channel! we both share the same interests!
    Subscribed

    • @jamesbriggs
      @jamesbriggs  3 года назад

      That's awesome to hear, thanks!

  • @charmz973
    @charmz973 3 года назад

    I guess after this I qualify to be called BERTISIAN EXPERT woow