can you please tell me the advantage of smoothening the logits using temperature inclusion; why can't we use softmax to compare the output of the teacher and student models for distillation loss?
good question. one way to think about it is this. the teacher's probabilities on different classes are very semantically rich. it captures the data distribution and the relation between various classes as explained in the video on the animals example. But the teacher's probabilities are coming from softmax which was trained to match the one hot encoded labels of the correct class when the teacher was trained. so even though, various class probabilities from the final teacher layer represent rich information about the data distribution, the actual values for the correct class will be so high and the other classes will have low probabilities. It's like the signal is there, but very hard to see unless we amplify it. So that's why we do softmax with temperature. it amplifies the probabilities of the rest of the classes at the cost of bringing down the probability of the positive class a little bit, (because softmax outputs should sum to 1). This way, the student is able to see these other probabilities more clearly and learn from them.
The explanation looks good. However, many words are unclear because of bad sound quality. My suggestion is to use some AI-based audio enhancement tools to make the voice clearer and noise-free, then update the video. You will definitely get more views.
I've an probably stupid question: why don't we just directly train the Student model? having the pre-trained Teacher model makes the Student model more accurate?
There are no such thing as stupid questions :) Let me try to answer as per my understanding, feel free to reply back with further queries if any. In a simplified analogy, knowledge distillation is like a real life teacher and student combination. If a student tries to learn a new subject from scratch all by herself, then it would take a lot of time. Whereas an intelligent teacher who has already done all the hardwork in learning everythig before can give rich and summarised information after skipping the useless info to the student and the student would be able to learn it in lesser time. the trend today is really large models when trained on huge sizes of data tends to have better representational power and thus more accuracy. that is why the bigshot companies are on a constant race to build the next biggest model trained on bigger datasets. In our analogy, this is like the teacher reading lots of books to really understand the subject. Since the teacher has a bigger brain (more layers), it can go through the huge datasets, learn interesting patterns and discard useless patterns. After this intensive learning is done, the teacher acts as a pretrained model. Now the output coming out from the teacher model is very rich in information (refer 1:53), this is why a student model with a smaller brain (lesser number of layers), is able to consume this rich information and learn in shorter time.
Glad to hear that. I was exploring KD in the nlp space and thought of creating few videos around it. Let me know if there is any specific topic in KD or in general that you are looking forward to. If it overlaps with the things that i am exploring, would be happy to make videos around it.
If the final layer has a sigmoid activation function , can the output of the sigmoid function be used as the input to a softmax function with temperature ?
Interesting idea, In theory we could define the loss function like u said and the training would still work. But practically I am not sure to what extent it would help. Should try this out. We are essentially doing softmax twice. Here is an article on why we should't do that in a normal NN setup. jamesmccaffrey.wordpress.com/2018/03/07/why-you-shouldnt-apply-softmax-twice-to-a-neural-network/ Intuitively applying softmax twice is making the function more smoother and that's what we are trying to achieve in the KD setup. But if the same effect can be achieved by tuning the hyperparameter T of the softmax with temperature directly from the logits, then that's a simpler approach from training perspective, neverthless its an interesting idea to explore.
@@dingusagar Thanks for the reply . If the final layer is a fully connected layer followed by a sigmoid activation function , essentially , the logits would be the inputs going into the sigmoid right ? I guess to perform KD , I would take this inputs and pass it to a softmax function with temperature
For Loss2, don' you want to do CrossEntropy(p(1), y_true), i.e. use the probabilities from the student w/o temperature scaling? Also, y_true is a 1 hot vector, no? It seems like Loss2 is a CrossEntropy between 2 1-hot vectors, so unsure if this is right. Am I missing something?
Yes correct, loss 2 is between two one hot vectors. Cross entropy is just defined over 2 distributions and it doesn't really have any requirement of the distribution being 1hot or not..what you suggested is also correct i feel. It's just that this is how the authors have defined initially. Now different implementations can modify the loss based on what they find empirically more accurate. Having said that, intuitively i can think one reason in favor of this approach and that is the fact that loss one already uses soft predictions which help in models converging on learning the differences between the rich features of the images from the teacher model. So loss 2 is restricted to just focus on getting the classification correct which is expressed in one hot vector format.
Great video! Love how you simplified it that even a novice like me understood it 😊😊😊 If possible please use a better mic the sound qualit on this video was a little low and foggy
if the model has just 1 and 0 in labels actual, then you must have mistakenly said that the model predicts with 0.39% that it is a horse. Instead it should be that with 0.39% the model thinks its the deer.
@@terrortalkhorror thanks for the feedback. I am not sure if I understood what you pointed out exactly. the predictions are done on the input image. The 3 images on the right are just for visualizing the classes. From the perspective of predicting the input image, the model thinks it is a deer, horse and peacock by probabilities 0.6, 0.39, 0.01 respectively as mentioned in the slide. The audio quality is poor and that could have created some confusion.
can you please tell me the advantage of smoothening the logits using temperature inclusion; why can't we use softmax to compare the output of the teacher and student models for distillation loss?
good question. one way to think about it is this. the teacher's probabilities on different classes are very semantically rich. it captures the data distribution and the relation between various classes as explained in the video on the animals example.
But the teacher's probabilities are coming from softmax which was trained to match the one hot encoded labels of the correct class when the teacher was trained. so even though, various class probabilities from the final teacher layer represent rich information about the data distribution, the actual values for the correct class will be so high and the other classes will have low probabilities.
It's like the signal is there, but very hard to see unless we amplify it. So that's why we do softmax with temperature. it amplifies the probabilities of the rest of the classes at the cost of bringing down the probability of the positive class a little bit, (because softmax outputs should sum to 1).
This way, the student is able to see these other probabilities more clearly and learn from them.
@@dingusagar Wow! Thank you for the elaborate answer! I had the same question and this is very elucidating!
Great Professor.
Easy and high level explanation.
Thanks Sagar for a brilliant explanation of the basics of KD
The explanation looks good. However, many words are unclear because of bad sound quality. My suggestion is to use some AI-based audio enhancement tools to make the voice clearer and noise-free, then update the video. You will definitely get more views.
Thanks for the feedback. Yes the audio is really bad. I am planning to re-record this and upload soon.
Well done, good and simple explanation.
Amazing explanation of knowledge distillaton
great explanation!!! thank you!
I wish the sound was better.. maybe you can record it again:)
Thanks :). Sorry about the bad sound quality. Will definitely work on it next time.
Thank you so much, you have earned my subscription
Thanks. Will try to do more such videos.
Amazing that this works at all
Thank you so much.
Please, make a lot of videos in machine learning.
I've an probably stupid question: why don't we just directly train the Student model? having the pre-trained Teacher model makes the Student model more accurate?
There are no such thing as stupid questions :)
Let me try to answer as per my understanding, feel free to reply back with further queries if any.
In a simplified analogy, knowledge distillation is like a real life teacher and student combination. If a student tries to learn a new subject from scratch all by herself, then it would take a lot of time. Whereas an intelligent teacher who has already done all the hardwork in learning everythig before can give rich and summarised information after skipping the useless info to the student and the student would be able to learn it in lesser time.
the trend today is really large models when trained on huge sizes of data tends to have better representational power and thus more accuracy. that is why the bigshot companies are on a constant race to build the next biggest model trained on bigger datasets. In our analogy, this is like the teacher reading lots of books to really understand the subject. Since the teacher has a bigger brain (more layers), it can go through the huge datasets, learn interesting patterns and discard useless patterns. After this intensive learning is done, the teacher acts as a pretrained model. Now the output coming out from the teacher model is very rich in information (refer 1:53), this is why a student model with a smaller brain (lesser number of layers), is able to consume this rich information and learn in shorter time.
excellent explanation!
I like this vedio, cuz it helps me
Looking forward for more information about KD. Thank you
Glad to hear that. I was exploring KD in the nlp space and thought of creating few videos around it. Let me know if there is any specific topic in KD or in general that you are looking forward to. If it overlaps with the things that i am exploring, would be happy to make videos around it.
Wow, thanks for this video!
If the final layer has a sigmoid activation function , can the output of the sigmoid function be used as the input to a softmax function with temperature ?
Interesting idea, In theory we could define the loss function like u said and the training would still work. But practically I am not sure to what extent it would help. Should try this out. We are essentially doing softmax twice. Here is an article on why we should't do that in a normal NN setup. jamesmccaffrey.wordpress.com/2018/03/07/why-you-shouldnt-apply-softmax-twice-to-a-neural-network/
Intuitively applying softmax twice is making the function more smoother and that's what we are trying to achieve in the KD setup. But if the same effect can be achieved by tuning the hyperparameter T of the softmax with temperature directly from the logits, then that's a simpler approach from training perspective, neverthless its an interesting idea to explore.
@@dingusagar Thanks for the reply . If the final layer is a fully connected layer followed by a sigmoid activation function , essentially , the logits would be the inputs going into the sigmoid right ? I guess to perform KD , I would take this inputs and pass it to a softmax function with temperature
@@Speedarion yes you are right. logits are what is coming out of the final layer before any activation is applied.
Nice Video, thanks for the help
Thank you.
Would be great if you could please improve the audio.
Can you share this PPT.
docs.google.com/presentation/d/1IkPeSGOcUSO_qyCwtrP9ZBMx-l2aBzj7FDqwPLK9Ekk/edit?usp=drivesdk
@@dingusagar Can you share the other file which talks about distilbert too.
@@prasanthnoelpanguluri7167 docs.google.com/presentation/d/1wU1ZVkgA-qU-5kkHqe824IVxsyLQEqqojOVaNK6Afv8/edit?usp=sharing
For Loss2, don' you want to do CrossEntropy(p(1), y_true), i.e. use the probabilities from the student w/o temperature scaling? Also, y_true is a 1 hot vector, no? It seems like Loss2 is a CrossEntropy between 2 1-hot vectors, so unsure if this is right. Am I missing something?
Yes correct, loss 2 is between two one hot vectors. Cross entropy is just defined over 2 distributions and it doesn't really have any requirement of the distribution being 1hot or not..what you suggested is also correct i feel. It's just that this is how the authors have defined initially. Now different implementations can modify the loss based on what they find empirically more accurate.
Having said that, intuitively i can think one reason in favor of this approach and that is the fact that loss one already uses soft predictions which help in models converging on learning the differences between the rich features of the images from the teacher model. So loss 2 is restricted to just focus on getting the classification correct which is expressed in one hot vector format.
thansk !!
Great video! Love how you simplified it that even a novice like me understood it 😊😊😊
If possible please use a better mic the sound qualit on this video was a little low and foggy
Thanks. Glad to hear that.😊 Yes, I will definitely work on the sound quality.
Great video but the sound could be improved
if the model has just 1 and 0 in labels actual, then you must have mistakenly said that the model predicts with 0.39% that it is a horse. Instead it should be that with 0.39% the model thinks its the deer.
but i must say your expalanation is really good
@@terrortalkhorror thanks for the feedback. I am not sure if I understood what you pointed out exactly. the predictions are done on the input image. The 3 images on the right are just for visualizing the classes. From the perspective of predicting the input image, the model thinks it is a deer, horse and peacock by probabilities 0.6, 0.39, 0.01 respectively as mentioned in the slide. The audio quality is poor and that could have created some confusion.
@@dingusagar yes you are right. i just rewatched it and now it makes sense.
I just sent a connection request on LinkedIn
Please use better mic or dont use mic at all.