@@jamesbriggs I have just implemented your code. Works like a charm! One quick question. Do you suggest cleaning the sentences with tokenizers/lemmatizers and other NLP techniques before passing them to model.encode() or leave them as they are?
@@jacopoattolini2085 for transformers typically you'd want to keep full words, so I wouldn't use lemmatizations/stemming, or stopword removal - tokenization in some cases yes, like for URLs it can be a good idea. Also, depending on your data source (social media for sure), it can be useful to add unicode normalization, where you'd want to use NFKC in most cases
@@jamesbriggs thanks! Will try to experiment a bit. I am working with job descriptions data so maybe it is better to use the full sentence without transformations
For anyone getting stuck, switch out 'bert-base-nli-mean-tokens' for 'sentence-transformers/all-mpnet-base-v2'. Hugging face says the model in this video gives poor quality and redirects you to better/newer models. The model you'll switch out for has the best average performance as of now. About to try to send this to a database so I have a column of the 'sentences' and a column with the 'scores' so I can sort by the score. Any help sending the final array to a df would be much appreciated!
Hey Nathen, yes the original sbert models are pretty outdated - the other model you suggested is much better (and in general, MPNet models make great sentence transformers), thanks for sharing. For your df problem, you should be able to convert the arrays to lists, so assuming you have a sentence list in `sentences` and a score array in `scores`, you can do something like: ``` df = pd.DataFrame({ "sentences": sentences, "scores": scores.tolist() }) ``` You may also need to flatten the scores array, so you'd change the above to `scores[0].tolist()` Hope that helps :)
Hi James, I wonder if you could answer a very simple question. If I am to use "model.encode(sentences)" Is there a way to make it faster?? By default do you know if "ENCODE" applies a max_length=128 or if that value is 512, which is the traditional value applied into BERT. If that is the case, can you adjust it to this smaller value. In your other video, it is very clear how to get the mean after considering max_length=128 (at 6:30 minutes at ruclips.net/video/jVPd7lEvjtg/видео.html). However, is it possible to adjust this value for model.encode if by default a value higher than 128 was applied before averaging things? Thanks a lot in advance. Sincerely, F.
Does this tutorial start in the middle from another tutorial? The third step - model = SentenceTransformer(model_name) - does not work. Are there things that we are supposed to download first?
Thanks for your tutorial. I wonder that how about using pooler_output as embedding instead of mean pooling of hidden state of each word? Is pooler_output more suitable for down-stream task so we wouldn't use it as sentence representation?
Wonderful explanation. I had a couple of question 1. How is this model different from the deep learning seamese model? Or is it the same 2. Do you have any video explaining the internal or theoretical working of this model? Thanks once again
Thank you so much for the awesome explanation! Do you think this method could be applied also when working with whole paragraphs of texts, and not just single sentences? Or is this library not suited for comparing longer texts? Thank you!
It depends on the length of your text, it's not necessarily restricted to sentences but is (with the model we use here) 128 tokens, a token being a word/word-piece (sometimes a single word can be split into 2+ tokens). So with 128 tokens, you have a fair bit of flexibility on length :)
Hi James, good video. I've been trying to get semantic similarity on more abstract concepts, e.g.: between "number" and "integer", or "vector" and "list". attempted on a custom word2vec vocabulary and pre-trained Bert but doesn't produce great results, with other words like "string" appearing closer to "integer" using cosine similarity. Is there a specific approach you would use for fine-tuning for a problem like this?
How can we train and deploy this sentence similarity model in sagemaker ?My ultimate aim is to deploy this model as a REST API, so that I can utilize it from a different application. If you have already made any videos, please do share me the link.
Nothing specific to sagemaker unfortunately, but I do have an entire (free) course on sentence similarity models here: www.pinecone.io/learn/nlp/ There many chapters on different approaches to training, videos are embedding within each chapter - I hope that helps :)
Why are going with bert transformers when we can do the same thing using TFIDFVectorizer. Can you please make a video between pros and cons of these two approaches. If already posted please share the link as reply. Thank you.
here's a video comparing TFIDF, BM25 and Bert - ruclips.net/video/ziiF1eFM3_4/видео.html And another for traditional similarity metrics too (Jaccard, w-shingling, and Levenshtein) - ruclips.net/video/AY62z7HrghY/видео.html Currently working on a big series covering similarity search in-depth, so will be plenty more content on this topic over the next couple months :)
they're generally much more expressive thanks to the bidirectional, multi-head attention mechanisms inside the BERT encoder layers - so generally we would expect sentence BERT to outperform word2vec
Instead of two arrays, how can we do with two dataframes(df1 and df2), taking one cell from a column of df1 and matching it with all cell of a column of df2 and so on
Would probably be best to loop through your rows in df1, pull out the value in your df1 cell, then compare that against the full column of df2, you will want to extract both out as arrays though - I don't believe you sentence-transformers deals directly with df objects
But, at the same time if you're dealing with a lot of data - I'd recommend something more efficient than this implementation of cosine similarity, Faiss would be a good option for example, much faster. I'm working on a Faiss series at the moment, you can find the first video (which is all you need to get started) on my channel page :)
@@jamesbriggs It didn't work, the precise problem is listed at stackoverflow.com/questions/68624306/cosine-similarity-between-columns-of-two-different-dataframe/68626354?noredirect=1#comment121282908_68626354 any opinion?
I believe that if you fine-tune using generic language comprehension methods like MLM and NSP, it should enhance the similarity vectors that BERT produces too. I haven't done this though, so I can't say that for sure, but it's something I'm pretty interested in trying and I expect I will work on it soon - when I do, they'll be a video :)
Thank you for your response, I am kind off confused right know, in the other video where in u built from scratch.. If I do that with custom data... Will there be any changes in my results.. ??
@@harryrichard2154 in the MLM/NSP videos, I fine-tune BERT, so I take the existing BERT model, and train is some more on custom data. This essentially fine-tunes the weights inside the BERT network to be optimized for the custom dataset (eg better understand the custom style of language). So yes, you would get different results as for sentence similarity it is those internal weights that we are extracting :) Hope that helps
The results I'm getting are a hit or miss. I'm inserting a string I want to analyze into the first index of the list and running it as your vid showed. I might have one on English ships and I get the top result as something to do with the sea and a ship, but the string saying "you're fighting like cats and dogs" gives me an incoherent code (autogenerated image name I think) despite there being multiple sentences with fighting, cats and dogs in them. Thoughts? Seems to fail more often than not.
hey Eugene, it can be a bit hit and miss, but overall the performance should be quite good - for the incoherent code, there's a possibility that by pure chance this is encoded to a similar vector space as your cats and dogs sentence. I assume you're doing all of this with a larger dataset? If so I would recommend using something like faiss, which handles all of the distance computations (and is much faster) - however, in terms of accuracy, this *should* only help if there is something weird happening with your cosine similarity function. Tutorial here ruclips.net/video/sKyvsdEv6rk/видео.html Let me know if it helps!
I was using SequenceMatcher this afternoon for another project since it made more sense for that (companies in two datasets spelled slightly different that need matching). Applied it to my original project and seems to work much better and faster albeit it's not as sophisticated as SentenceTransformer. Works better if I strip out determiners and other useless words. "Fighting like cats and dogs" gives me back string "raining cats and dogs." "Youth Rebel" gives me "Youth Reading" which is close but not ideal but still usable. I'll try to find time for your faiss vid tomorrow night and let you know. Oh and I tried this on dataset almost 200,000 strings runs in about a couple mins
@@eugenesheely5288 yes I've seen sentencematcher used - afaik it's calculating the syntax similarity - rather than the 'semantic' similarity of sentence transformer. With your use-case of finding misspellings, I think your approach is ideal. Yes I definitely wouldn't recommend using this code on large datasets, or even slightly not-small datasets - 100% go with Faiss for that - let me know how it goes!
Subbed after this video. Will keep on checking your content regularly, James. Keep it up!
That's awesome, thanks Andrea!
Lots of help man. Left a like and subscribed great job!
Exactly what I was looking for in a clear and quick video. You gained a subscriber
Awesome to have you here!
@@jamesbriggs I have just implemented your code. Works like a charm! One quick question. Do you suggest cleaning the sentences with tokenizers/lemmatizers and other NLP techniques before passing them to model.encode() or leave them as they are?
@@jacopoattolini2085 for transformers typically you'd want to keep full words, so I wouldn't use lemmatizations/stemming, or stopword removal - tokenization in some cases yes, like for URLs it can be a good idea.
Also, depending on your data source (social media for sure), it can be useful to add unicode normalization, where you'd want to use NFKC in most cases
@@jamesbriggs thanks! Will try to experiment a bit. I am working with job descriptions data so maybe it is better to use the full sentence without transformations
For anyone getting stuck, switch out 'bert-base-nli-mean-tokens' for 'sentence-transformers/all-mpnet-base-v2'. Hugging face says the model in this video gives poor quality and redirects you to better/newer models. The model you'll switch out for has the best average performance as of now. About to try to send this to a database so I have a column of the 'sentences' and a column with the 'scores' so I can sort by the score. Any help sending the final array to a df would be much appreciated!
Hey Nathen, yes the original sbert models are pretty outdated - the other model you suggested is much better (and in general, MPNet models make great sentence transformers), thanks for sharing.
For your df problem, you should be able to convert the arrays to lists, so assuming you have a sentence list in `sentences` and a score array in `scores`, you can do something like:
```
df = pd.DataFrame({
"sentences": sentences,
"scores": scores.tolist()
})
```
You may also need to flatten the scores array, so you'd change the above to `scores[0].tolist()`
Hope that helps :)
Hi James, I wonder if you could answer a very simple question. If I am to use "model.encode(sentences)" Is there a way to make it faster?? By default do you know if "ENCODE" applies a max_length=128 or if that value is 512, which is the traditional value applied into BERT. If that is the case, can you adjust it to this smaller value.
In your other video, it is very clear how to get the mean after considering max_length=128 (at 6:30 minutes at ruclips.net/video/jVPd7lEvjtg/видео.html). However, is it possible to adjust this value for model.encode if by default a value higher than 128 was applied before averaging things?
Thanks a lot in advance.
Sincerely,
F.
Thank you! Super clear. A new subscriber of your channel!
Does this tutorial start in the middle from another tutorial? The third step - model = SentenceTransformer(model_name) - does not work. Are there things that we are supposed to download first?
How can we save this model as joblib , so we can use it for deployment??
You are the best! Thanks for the tutorial!
Thank you so much for a simple tutorial!
just wonder compared to OpenAI embedding API, which is better ? thank you for ur video
Thanks for your tutorial.
I wonder that how about using pooler_output as embedding instead of mean pooling of hidden state of each word?
Is pooler_output more suitable for down-stream task so we wouldn't use it as sentence representation?
Wonderful explanation. I had a couple of question
1. How is this model different from the deep learning seamese model? Or is it the same
2. Do you have any video explaining the internal or theoretical working of this model?
Thanks once again
Thank you so much for the awesome explanation!
Do you think this method could be applied also when working with whole paragraphs of texts, and not just single sentences? Or is this library not suited for comparing longer texts?
Thank you!
It depends on the length of your text, it's not necessarily restricted to sentences but is (with the model we use here) 128 tokens, a token being a word/word-piece (sometimes a single word can be split into 2+ tokens).
So with 128 tokens, you have a fair bit of flexibility on length :)
@@jamesbriggs Thank you so much for your answer!
Hi James, good video. I've been trying to get semantic similarity on more abstract concepts, e.g.: between "number" and "integer", or "vector" and "list". attempted on a custom word2vec vocabulary and pre-trained Bert but doesn't produce great results, with other words like "string" appearing closer to "integer" using cosine similarity. Is there a specific approach you would use for fine-tuning for a problem like this?
It might be better to try the BERT token ID embeddings rather than word2vec embeddings - might be more accurate :)
How can we train and deploy this sentence similarity model in sagemaker ?My ultimate aim is to deploy this model as a REST API, so that I can utilize it from a different application. If you have already made any videos, please do share me the link.
Nothing specific to sagemaker unfortunately, but I do have an entire (free) course on sentence similarity models here:
www.pinecone.io/learn/nlp/
There many chapters on different approaches to training, videos are embedding within each chapter - I hope that helps :)
subscribed, and thanks for this tutorial :)
Where can i download your ipynb file?
Why are going with bert transformers when we can do the same thing using TFIDFVectorizer. Can you please make a video between pros and cons of these two approaches. If already posted please share the link as reply. Thank you.
here's a video comparing TFIDF, BM25 and Bert - ruclips.net/video/ziiF1eFM3_4/видео.html
And another for traditional similarity metrics too (Jaccard, w-shingling, and Levenshtein) - ruclips.net/video/AY62z7HrghY/видео.html
Currently working on a big series covering similarity search in-depth, so will be plenty more content on this topic over the next couple months :)
Hey James, nice tutorial! Do you know the advantage of using sentence BERT over the average embeddings of all words in a sentence using word2vec?
they're generally much more expressive thanks to the bidirectional, multi-head attention mechanisms inside the BERT encoder layers - so generally we would expect sentence BERT to outperform word2vec
This was very informative. thanks alot
Welcome! Thanks for watching!
Instead of two arrays, how can we do with two dataframes(df1 and df2), taking one cell from a column of df1 and matching it with all cell of a column of df2 and so on
Would probably be best to loop through your rows in df1, pull out the value in your df1 cell, then compare that against the full column of df2, you will want to extract both out as arrays though - I don't believe you sentence-transformers deals directly with df objects
But, at the same time if you're dealing with a lot of data - I'd recommend something more efficient than this implementation of cosine similarity, Faiss would be a good option for example, much faster. I'm working on a Faiss series at the moment, you can find the first video (which is all you need to get started) on my channel page :)
@@jamesbriggs It didn't work, the precise problem is listed at stackoverflow.com/questions/68624306/cosine-similarity-between-columns-of-two-different-dataframe/68626354?noredirect=1#comment121282908_68626354 any opinion?
This is great!!
Hey James thanks for the tutorial. How do we fine tune bert model for sentence similarity?? Thank you once again for the tutorial.
I believe that if you fine-tune using generic language comprehension methods like MLM and NSP, it should enhance the similarity vectors that BERT produces too. I haven't done this though, so I can't say that for sure, but it's something I'm pretty interested in trying and I expect I will work on it soon - when I do, they'll be a video :)
Thank you for your response, I am kind off confused right know, in the other video where in u built from scratch.. If I do that with custom data... Will there be any changes in my results.. ??
@@harryrichard2154 in the MLM/NSP videos, I fine-tune BERT, so I take the existing BERT model, and train is some more on custom data.
This essentially fine-tunes the weights inside the BERT network to be optimized for the custom dataset (eg better understand the custom style of language).
So yes, you would get different results as for sentence similarity it is those internal weights that we are extracting :)
Hope that helps
The results I'm getting are a hit or miss. I'm inserting a string I want to analyze into the first index of the list and running it as your vid showed. I might have one on English ships and I get the top result as something to do with the sea and a ship, but the string saying "you're fighting like cats and dogs" gives me an incoherent code (autogenerated image name I think) despite there being multiple sentences with fighting, cats and dogs in them. Thoughts? Seems to fail more often than not.
hey Eugene, it can be a bit hit and miss, but overall the performance should be quite good - for the incoherent code, there's a possibility that by pure chance this is encoded to a similar vector space as your cats and dogs sentence.
I assume you're doing all of this with a larger dataset? If so I would recommend using something like faiss, which handles all of the distance computations (and is much faster) - however, in terms of accuracy, this *should* only help if there is something weird happening with your cosine similarity function. Tutorial here ruclips.net/video/sKyvsdEv6rk/видео.html
Let me know if it helps!
@@jamesbriggs data set is almost 7,000. Thanks for the tip I'll give it a try tonight.
I was using SequenceMatcher this afternoon for another project since it made more sense for that (companies in two datasets spelled slightly different that need matching). Applied it to my original project and seems to work much better and faster albeit it's not as sophisticated as SentenceTransformer. Works better if I strip out determiners and other useless words. "Fighting like cats and dogs" gives me back string "raining cats and dogs." "Youth Rebel" gives me "Youth Reading" which is close but not ideal but still usable. I'll try to find time for your faiss vid tomorrow night and let you know. Oh and I tried this on dataset almost 200,000 strings runs in about a couple mins
@@eugenesheely5288 yes I've seen sentencematcher used - afaik it's calculating the syntax similarity - rather than the 'semantic' similarity of sentence transformer. With your use-case of finding misspellings, I think your approach is ideal.
Yes I definitely wouldn't recommend using this code on large datasets, or even slightly not-small datasets - 100% go with Faiss for that - let me know how it goes!
@@jamesbriggs syntax vs semantics is a very nice way to explain the differences. I'll keep you updated.
It is helpful video, but can you send me this implementations