Great explanation! Regarding the "Multiple negative ranking loss"; you say that a_i and p_j should be far away from each other. Don't we risk to create a lot of clusters of pairs? I mean how do we get that a_i and p_j are closer than a_i and p_k if p_k and p_j are rather similar just like your example at 07:00 We would assume that (a_1, p_1) and (a_1, p_2) is closer than (a_1, p_3) . How do we make sure, that hose 3 pairs just not just end up in each of their own corner?
I am doing a sentiment analysis task. I want to train a model on multiple negatives ranking loss. I am dealing positive and negative pairs. (x, (ai,bi),(ak,bk)). x -> query ai, bi -> positive pair ak, bk - > negative pair(I have multiple negative pairs for every (query, positive pair) combination. How can I use multiple negatives ranking loss in this case?
Hi Nils, awesome explanation. What happens to MNR if you have batches where there are two queries with the same answer? The CE matrix will be broken because there are extra ones. What’s the best approach in this case?
Hi Nils , question: if the embedding model is trained on let's say cosine similarly during inference other similarity fuctions also generate decent results why?
Thank for video. i ve a question. when I are trying to extract contextualized word embedding by Bert, always I get out of memory issue on collab. I have a twitter dataset 50000 rows. I couldn't find a solution for it. changing batch size or any other solution really doesn't work at all.
Thanks for the great talk. One Question: What do you mean when you say "for dot-product, longer documents can result in vectors with higher magnitudes"? If I understand correctly you are using mean pooling. Why would the mean of many embeddings (necessarily or probably) have higher magnitude than the mean of few?
Note that BERT produces contextualized word embeddings. So the output of each words depends on all other words in a paragraph. Just because we take the mean, we cannot conclude that the length of the paragraph has no impact on the magnitude of the embedding. The model can simply learn: Long document => word vectors with high magnitude => high magnitude sentence embedding.
Dear Nils, thank you very much for such interesting and rich information video. By the way, I have 2 questions about Sbert: 1 - The best input for Sbert is 2 sentences or we can give more than that? (As I see the output vector have 512 dimensions) 2 - The best comparison for the output vector is scaled-cosine-similarity and not cosine-similarity?
You can input longer texts, up to the sequence length of the model. Some models work will with text up to 512 word pieces. The scaling is just relevant for training, later it doesn't play a role anymore.
Hello sir, thank you for providing an interesting presentation, I am a final semester student and very, very confused cause I new in NLP, my final project is text summarization, and have not gotten any results, do you have any advice on how sbert is used for text summarization or how can I do fine-tuning with my own dataset to get embedding generated by sbert? And all text using my own language not English :). Really appreciate it if you give me an answer, I'm really stressed right now, thanks in advance!
Great explanation!
Regarding the "Multiple negative ranking loss"; you say that a_i and p_j should be far away from each other.
Don't we risk to create a lot of clusters of pairs? I mean how do we get that a_i and p_j are closer than a_i and p_k if p_k and p_j are rather similar just like your example at 07:00
We would assume that (a_1, p_1) and (a_1, p_2) is closer than (a_1, p_3) .
How do we make sure, that hose 3 pairs just not just end up in each of their own corner?
Thanks @nils super helpful content as usual
I am doing a sentiment analysis task. I want to train a model on multiple negatives ranking loss. I am dealing positive and negative pairs. (x, (ai,bi),(ak,bk)).
x -> query
ai, bi -> positive pair
ak, bk - > negative pair(I have multiple negative pairs for every (query, positive pair) combination.
How can I use multiple negatives ranking loss in this case?
Hi Nils, awesome explanation. What happens to MNR if you have batches where there are two queries with the same answer? The CE matrix will be broken because there are extra ones. What’s the best approach in this case?
Hi Nils , question: if the embedding model is trained on let's say cosine similarly during inference other similarity fuctions also generate decent results why?
Thank for video. i ve a question. when I are trying to extract contextualized word embedding by Bert, always I get out of memory issue on collab. I have a twitter dataset 50000 rows. I couldn't find a solution for it. changing batch size or any other solution really doesn't work at all.
Thanks for the great talk. One Question: What do you mean when you say "for dot-product, longer documents can result in vectors with higher magnitudes"?
If I understand correctly you are using mean pooling. Why would the mean of many embeddings (necessarily or probably) have higher magnitude than the mean of few?
Note that BERT produces contextualized word embeddings. So the output of each words depends on all other words in a paragraph.
Just because we take the mean, we cannot conclude that the length of the paragraph has no impact on the magnitude of the embedding.
The model can simply learn: Long document => word vectors with high magnitude => high magnitude sentence embedding.
hello sir, slid url is broken :)
Dear Nils, thank you very much for such interesting and rich information video. By the way, I have 2 questions about Sbert:
1 - The best input for Sbert is 2 sentences or we can give more than that? (As I see the output vector have 512 dimensions)
2 - The best comparison for the output vector is scaled-cosine-similarity and not cosine-similarity?
You can input longer texts, up to the sequence length of the model. Some models work will with text up to 512 word pieces.
The scaling is just relevant for training, later it doesn't play a role anymore.
Hello sir, thank you for providing an interesting presentation, I am a final semester student and very, very confused cause I new in NLP, my final project is text summarization, and have not gotten any results, do you have any advice on how sbert is used for text summarization or how can I do fine-tuning with my own dataset to get embedding generated by sbert? And all text using my own language not English :).
Really appreciate it if you give me an answer, I'm really stressed right now, thanks in advance!