I SWEAR I was trying to understand BYOL just a few minutes and was struggling, then this video came up, THANK YOU! CAN'T WAIT! Also, please do SwaV as well!
How exactly in the original contrast loss, does y = 0 in the positive case and the y=1 in the negative case? In addition what is y representing here? 6:27
I was wondering the same. Even if positive pairs are created by augmentation (11:34 in the video), there is no way to pick a cat for sure to the negative pair. How can it be a cat (at 12:15) without the labels?
At 12:04 you say that SimCLR select multiple negative pairs and then you show a picture of a cat, and a dog. I am confused, the second dog picture is also considered as a negative pair even though it's the same animal? If yes, does this mean the model train to lower the distance ONLY with the original image even though other could be dogs?
@@Deepia-ls2fo That is very interesting, thank you for your answer, I have another question if you do not mind At the end when comparing classification accuracy you compare supervised, SimCLR+finetune and SimCLR, the last one have me confused, how can the model without any finetuning even work for classification? Or do they not count a trained dense layer that learn to use the latent space of SimCLR for classification, and SimCLR+finetune mean finetuning the latent space instead? My question is that does fine-tune mean finetuning a dense layer or the latent space? Your videos are high quality and I really love them, sometimes I just wish they would be longer and slightly more into the implementation details, thank you! Edit: Regarding my first question, since the negative pair can be the same class (if we imagine the ultimate goal is classification), would a low amount of class (let's say only 2) lower the quality of the latent space due to a high amount of class "collision" ? And in the opposite if there is hundreds of class it will rarely select the same class as a negative pair and improve latent space representation?
@@itz_lucky6472 I strongly advise you to read the SimCLR paper as it is a very easy read and they detail everything. About the classification task: for SimCLR they use what we call "linear eval", meaning they plug a fully connected head on the model and train only this part. The difference between "SimCLR" and "SimCLR fine-tune" is that the weights of the backbone are modified in a supervised fashion with a small portion of the data for "SimCLR fine-tune". For your second question I did not read a lot about this, and I'm myself new to self-supervised learning in general, so I can't answer for sure. I guess you could easily do the experiment with 2 MNIST classes though. Intuitively I think taking many semantically similar objects and treating them as negatives is bad for the representation space.
Thanks, it pushes negative pairs apart until their distance reaches the margin, by minimizing the difference between the margin and the distance between the points. This is the quantity in red at 06:40 :)
InfoNCE loss at 11:14 looks odd as Dp is the distance notation at 9:00, but you say its related to probabilities. It would break the flow to introduce new notation though. But as it stands it was a little confusing to me to see that the loss would be minimized by maximizing Dp. I checked the paper and it seems the term is an approximator for "mutual information" which we want bigger for positive samples. At least thats my rough understanding... Thanks for the video its a fantastic explanation!
Nice explanation! It still isn't clear to me how to choose the metric to determine how similar or dissimilar two samples are, is it also learned by the network?
To try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/Deepia .
You’ll also get 20% off an annual premium subscription.
I SWEAR I was trying to understand BYOL just a few minutes and was struggling, then this video came up, THANK YOU! CAN'T WAIT! Also, please do SwaV as well!
Day by day, we inch closer and closer to creating The Great Compressor.
Like the one in Silicon Valley TvSeries.
I'd love to be compressed between my robot anime waifu's thighs 🤤
Insane technique! awesome video thanks for explaining this with tons of examples.
Really nice video. Love your presentation style, so clean and well explained!
Amazing presentation again 🎉 thank you for your efforts and time
Amazing content! Looking forward to the next videos 😄
How exactly in the original contrast loss, does y = 0 in the positive case and the y=1 in the negative case? In addition what is y representing here? 6:27
Is it just a positive and negative pair label which forces the contrastive loss to focus on the positive and negative metrics in the loss function?
Yes exactly !
@@deror007 But how are the labels collected? Didn't the author say no labels are required?
How does the model/programmer know if two pictures are a positive or negative pair without labels?
@@user-ht4rw5wp4x Well you have several ways of defining the pairs, for instance you create positive pairs with data augmentation as in SimCLR !
I was wondering the same. Even if positive pairs are created by augmentation (11:34 in the video), there is no way to pick a cat for sure to the negative pair. How can it be a cat (at 12:15) without the labels?
I like that you're focusing on computer vision
Outstanding technique :D thank you, it was not wrong to subscribe the channel :D
Great video as always
At 12:04 you say that SimCLR select multiple negative pairs and then you show a picture of a cat, and a dog. I am confused, the second dog picture is also considered as a negative pair even though it's the same animal? If yes, does this mean the model train to lower the distance ONLY with the original image even though other could be dogs?
Exactly! The negatives can be any other image in the batch, including very similar objects
@@Deepia-ls2fo That is very interesting, thank you for your answer, I have another question if you do not mind
At the end when comparing classification accuracy you compare supervised, SimCLR+finetune and SimCLR, the last one have me confused, how can the model without any finetuning even work for classification? Or do they not count a trained dense layer that learn to use the latent space of SimCLR for classification, and SimCLR+finetune mean finetuning the latent space instead? My question is that does fine-tune mean finetuning a dense layer or the latent space?
Your videos are high quality and I really love them, sometimes I just wish they would be longer and slightly more into the implementation details, thank you!
Edit: Regarding my first question, since the negative pair can be the same class (if we imagine the ultimate goal is classification), would a low amount of class (let's say only 2) lower the quality of the latent space due to a high amount of class "collision" ? And in the opposite if there is hundreds of class it will rarely select the same class as a negative pair and improve latent space representation?
@@itz_lucky6472 I strongly advise you to read the SimCLR paper as it is a very easy read and they detail everything.
About the classification task: for SimCLR they use what we call "linear eval", meaning they plug a fully connected head on the model and train only this part. The difference between "SimCLR" and "SimCLR fine-tune" is that the weights of the backbone are modified in a supervised fashion with a small portion of the data for "SimCLR fine-tune".
For your second question I did not read a lot about this, and I'm myself new to self-supervised learning in general, so I can't answer for sure. I guess you could easily do the experiment with 2 MNIST classes though. Intuitively I think taking many semantically similar objects and treating them as negatives is bad for the representation space.
Great video ! You mention that the contrastive loss pushes/pulls points, how does the loss function "push away" a point exactly ?
Thanks, it pushes negative pairs apart until their distance reaches the margin, by minimizing the difference between the margin and the distance between the points.
This is the quantity in red at 06:40 :)
InfoNCE loss at 11:14 looks odd as Dp is the distance notation at 9:00, but you say its related to probabilities. It would break the flow to introduce new notation though. But as it stands it was a little confusing to me to see that the loss would be minimized by maximizing Dp. I checked the paper and it seems the term is an approximator for "mutual information" which we want bigger for positive samples. At least thats my rough understanding...
Thanks for the video its a fantastic explanation!
Indeed I should have taken the time to introduce it properly and use the correct notations
Great content; keep it up
awesome content!
when's next video. Love these visualizations!
@@dhurbatripathi6924 Thanks ! By the end of November!
Awesome Video :D
Nice explanation! It still isn't clear to me how to choose the metric to determine how similar or dissimilar two samples are, is it also learned by the network?
You can choose any differentiable metric, that's one of the strength of this framework :)
Augmentations ARE the labels, labels of "ignore".
is the voice in the vid the output of a TTS model?
Yes ! It's my voice though :)
Cool
Thanks💀
Hmmmmmm YES