Noise-Contrastive Estimation - CLEARLY EXPLAINED!

Kapil Sachdeva

Просмотров 11 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 дек 2024

Комментарии • 42

@newbie8051 25 дней назад
Ah, feels interesting
Got to know about Contrastive Learning through a paper that we had to read as part of our coursework, which was on Probabilistic Contrastive Learning (nicknamed as proco)
Will explore this more in the coming months, thanks for the simple explanation !
You sound like a young excited prof, takes time to explain obscure stuff
@TheRohit901 2 года назад ⁺²
Best video available on this topic. Thank you so much for such a detailed explanation. I wish there were more channels like this. Looking forward to more videos of yours on other papers. Please keep making content like this.
@KapilSachdeva 2 года назад
🙏
@melodychan3648 3 года назад ⁺²
Thanks for your sharing. For someone wants to read the theoretical paper about contrastive learning, this videp is really helpful.
@KapilSachdeva 3 года назад
🙏🙏
@sampsonleo7475 3 года назад ⁺⁸
What an excellent and detailed explanation of noise contrastive estimation! Thanks for sharing.
@KapilSachdeva 3 года назад ⁺¹
🙏
@akshitmaurya4604 3 года назад ⁺³
Hi Kapil, Thank you for very intuitive explanation of Noise-Contrastive estimation. I came here while reading a paper on Self-supervised learning NPID. Your explanation really helped a lot. 🤘
@KapilSachdeva 3 года назад
🙏
@ChrisOffner 2 месяца назад
Wonderful video, thank you so much. Your style is very pleasant.
@user-jm6gp2qc8x Год назад ⁺³
This was wonderful!
@KapilSachdeva Год назад ⁺¹
🙏
@NonetMr 3 года назад ⁺¹
Thank you for a nice explanation. I have a small question at 12:45. You mentioned the function is essentially a logistic regression or a binary cross-entropy loss function. But those functions are usually something like these: Loss = -y \sum y_i * log(\hat{y}_i). Without the y_i term before the log function, it is a bit hard for me to relate the loss function in the clip to the loss function I showed here. Could you please tell me how I can relate them?
@KapilSachdeva 3 года назад ⁺⁴
Indeed, it can be a bit confusing to see it, and I should have done a better job clarifying it more in the tutorial.
In a binary classification problem, you use the log-loss function that looks something like this -
E(y log(p(y)) + (1 - y) log(1 - p(y))
The above equation is written using the expected value notation and will translate into an average for your mini-batch. Also, see the Wikipedia
article - en.wikipedia.org/wiki/Cross_entropy .... go to the section "Cross-entropy loss function and logistic regression"
Now in the binary classification, you can see that one of the terms will be zero e.g. If y has label of 1 then the second term will be zero. You can see that the utility of y before the log function to eliminate the term (as per the label 1 or 0)
That said, the formulation of the NCE loss function looks similar to binary log loss except for y in front of the log as you could use it for multiclass classification and not limit it to the binary case. In other words - its structure looks like that of log loss.
Hope this helps.
@ratthachat Год назад ⁺¹
Really a beautiful tutorial!! I wish you could update this with the newer InfoNCE and point out some similar and difference
@KapilSachdeva Год назад ⁺²
🙏
@peiyiwang4707 2 года назад
Hello, thank you for the explanation, it was great. But I have a confusion.
At 6:45, you mentioned that we introduced NCE because we can't solve the partition function.
But when we use NCE (15:40), we still need to deal with the partition function. And here we deal with it by setting it to a constant 1.
So why not just make the partition function constant 1 at the beginning, so we don't have to introduce NCE?
@KapilSachdeva 2 года назад ⁺¹
The confusion is understandable. Here is an attempt to resolve it:
1) our “primary” goal is estimation of valid probability distribution function
2) we identified that normalizing constant is the problem. There should be no doubt about it and you need to for a creating a “valid”probability distribution function.
3) when we identify the problem generally we end up focusing on it so one approach is to “learn” it (the constant) with the help of “neural networks”. This does not work. I explained it in the tutorial but you can also just take my word for now.
Now sometimes instead of focusing on the “problem creator” another way to get our solution is to focus on what was our “original” goal. It was estimating “valid” probability distribution function.
The findings of the paper was that a decently parameterized neural network (ie the one with good amount of parameters … a large network) when “trained” using the NCE loss function can estimate the valid/final normalized probability distribution function. In this setup you don’t need to worry about dealing with normalizing constant explicitly. The “learning” part implicitly takes care of it for you.
“Learning” => training
Training => need for a loss function
Which loss function can help you train such a thing - it is NCE in this case.
In brief/summary:
Your confusion occurs when I (or rather the paper) say - “we can ignore the constant” …. Another way to think of it as we are simply ignoring the thing in the mathematical expression but by making use of a proper loss function we are ended up learning/estimating the valid probability distribution function … our actual goal!
Hope this makes sense!
@peiyiwang4707 2 года назад ⁺¹
@@KapilSachdeva Thanks😀
@KapilSachdeva 2 года назад
🙏
@fellowdatascientist3202 2 года назад ⁺¹
Thanks a lot for your detailed explanation. It was quite helpful to understand the missing parts in the original paper.
@KapilSachdeva 2 года назад
🙏
@sergeyzaitsev3319 Год назад ⁺¹
Thank you very much for such a comprehensive tutorial
@KapilSachdeva Год назад
🙏
@siyaowu7443 Год назад
Terrific tutorial! But I have a question. The data distribution p_theta is the neural network, theta is the parameters of neural network. But what does exactly p_n will be like? Does it means that we should generate k negative samples from something like gaussian distirbution and so on?
@KapilSachdeva Год назад
🙏 Generating the k negative samples is indeed a trick thing to do. Don't think in terms of gaussian distribution rather how to get the relevant/appropriate negative samples. There are methods to do what they call hard negative sample mining etc. Am not well versed in those but that is the direction you should think in.
@siyaowu7443 Год назад
@@KapilSachdeva Thanks for your reply!
@oukai6867 Год назад ⁺¹
Very clear explanation. Thank you so much.
@oukai6867 Год назад
But I still want to clarify one thing: here you mention the Lnse seemd to be the log-likelihood instead of Loss Funtion, which should have a total minus sign infront. Am I right?
@KapilSachdeva Год назад
When doing minimization you would put the negative sign.
@anupriy 3 года назад ⁺¹
Thank you sir for such detailed explanation!
@KapilSachdeva 3 года назад
🙏
@eustin 2 года назад ⁺¹
Thank you, Kapil. Your explanation of this complex topic was very easy to digest!
What software do you use to create videos like this? I would also like to create videos like this.
@KapilSachdeva 2 года назад
🙏 Powerpoint. Nothing special.
@미비된동작-p4g 4 месяца назад
Amazing!!
@范顺国 3 года назад ⁺¹
thanks for your explanation.
@KapilSachdeva 3 года назад
🙏
@aba1371988 2 года назад ⁺¹
Really nice!
@KapilSachdeva 2 года назад
🙏
@simonson6498 Месяц назад
Thx so much!
@vero811 9 месяцев назад ⁺¹
I didn't get a headache!
@KapilSachdeva 8 месяцев назад
😃🙏

Следующие

Автовоспроизведение

Probabilistic Programming - FOUNDATIONS & COMPREHENSIVE REVIEW!