Noise-Contrastive Estimation - CLEARLY EXPLAINED!

Поделиться
HTML-код
  • Опубликовано: 9 сен 2024
  • Noise-Contrastive Estimation is a loss function that enables learning representations by comparing positive and negative sample pairs. It came into the limelight as a workaround to approximate softmax in NLP but is now being used in a lot of experiments in Self-supervised representation learning.
    #selfsupervised
    #representationlearning
    #noisecontrastiveestimation

Комментарии • 39

  • @user-jm6gp2qc8x
    @user-jm6gp2qc8x Год назад +3

    This was wonderful!

  • @melodychan3648
    @melodychan3648 2 года назад +2

    Thanks for your sharing. For someone wants to read the theoretical paper about contrastive learning, this videp is really helpful.

  • @sampsonleo7475
    @sampsonleo7475 3 года назад +8

    What an excellent and detailed explanation of noise contrastive estimation! Thanks for sharing.

  • @akshitmaurya4604
    @akshitmaurya4604 3 года назад +3

    Hi Kapil, Thank you for very intuitive explanation of Noise-Contrastive estimation. I came here while reading a paper on Self-supervised learning NPID. Your explanation really helped a lot. 🤘

  • @TheRohit901
    @TheRohit901 Год назад +2

    Best video available on this topic. Thank you so much for such a detailed explanation. I wish there were more channels like this. Looking forward to more videos of yours on other papers. Please keep making content like this.

  • @sergeyzaitsev3319
    @sergeyzaitsev3319 Год назад +1

    Thank you very much for such a comprehensive tutorial

  • @fellowdatascientist3202
    @fellowdatascientist3202 2 года назад +1

    Thanks a lot for your detailed explanation. It was quite helpful to understand the missing parts in the original paper.

  • @ratthachat
    @ratthachat 8 месяцев назад +1

    Really a beautiful tutorial!! I wish you could update this with the newer InfoNCE and point out some similar and difference

  • @user-oq4jj3xq6u
    @user-oq4jj3xq6u 19 дней назад

    Amazing!!

  • @oukai6867
    @oukai6867 Год назад +1

    Very clear explanation. Thank you so much.

    • @oukai6867
      @oukai6867 Год назад

      But I still want to clarify one thing: here you mention the Lnse seemd to be the log-likelihood instead of Loss Funtion, which should have a total minus sign infront. Am I right?

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      When doing minimization you would put the negative sign.

  • @anupriy
    @anupriy 3 года назад +1

    Thank you sir for such detailed explanation!

  • @user-ww6iq7te5p
    @user-ww6iq7te5p 3 года назад +1

    thanks for your explanation.

  • @aba1371988
    @aba1371988 2 года назад +1

    Really nice!

  • @NonetMr
    @NonetMr 3 года назад +1

    Thank you for a nice explanation. I have a small question at 12:45. You mentioned the function is essentially a logistic regression or a binary cross-entropy loss function. But those functions are usually something like these: Loss = -y \sum y_i * log(\hat{y}_i). Without the y_i term before the log function, it is a bit hard for me to relate the loss function in the clip to the loss function I showed here. Could you please tell me how I can relate them?

    • @KapilSachdeva
      @KapilSachdeva  3 года назад +4

      Indeed, it can be a bit confusing to see it, and I should have done a better job clarifying it more in the tutorial.
      In a binary classification problem, you use the log-loss function that looks something like this -
      E(y log(p(y)) + (1 - y) log(1 - p(y))
      The above equation is written using the expected value notation and will translate into an average for your mini-batch. Also, see the Wikipedia
      article - en.wikipedia.org/wiki/Cross_entropy .... go to the section "Cross-entropy loss function and logistic regression"
      Now in the binary classification, you can see that one of the terms will be zero e.g. If y has label of 1 then the second term will be zero. You can see that the utility of y before the log function to eliminate the term (as per the label 1 or 0)
      That said, the formulation of the NCE loss function looks similar to binary log loss except for y in front of the log as you could use it for multiclass classification and not limit it to the binary case. In other words - its structure looks like that of log loss.
      Hope this helps.

  • @eustin
    @eustin Год назад +1

    Thank you, Kapil. Your explanation of this complex topic was very easy to digest!
    What software do you use to create videos like this? I would also like to create videos like this.

  • @vero811
    @vero811 5 месяцев назад +1

    I didn't get a headache!

  • @peiyiwang4707
    @peiyiwang4707 Год назад

    Hello, thank you for the explanation, it was great. But I have a confusion.
    At 6:45, you mentioned that we introduced NCE because we can't solve the partition function.
    But when we use NCE (15:40), we still need to deal with the partition function. And here we deal with it by setting it to a constant 1.
    So why not just make the partition function constant 1 at the beginning, so we don't have to introduce NCE?

    • @KapilSachdeva
      @KapilSachdeva  Год назад +1

      The confusion is understandable. Here is an attempt to resolve it:
      1) our “primary” goal is estimation of valid probability distribution function
      2) we identified that normalizing constant is the problem. There should be no doubt about it and you need to for a creating a “valid”probability distribution function.
      3) when we identify the problem generally we end up focusing on it so one approach is to “learn” it (the constant) with the help of “neural networks”. This does not work. I explained it in the tutorial but you can also just take my word for now.
      Now sometimes instead of focusing on the “problem creator” another way to get our solution is to focus on what was our “original” goal. It was estimating “valid” probability distribution function.
      The findings of the paper was that a decently parameterized neural network (ie the one with good amount of parameters … a large network) when “trained” using the NCE loss function can estimate the valid/final normalized probability distribution function. In this setup you don’t need to worry about dealing with normalizing constant explicitly. The “learning” part implicitly takes care of it for you.
      “Learning” => training
      Training => need for a loss function
      Which loss function can help you train such a thing - it is NCE in this case.
      In brief/summary:
      Your confusion occurs when I (or rather the paper) say - “we can ignore the constant” …. Another way to think of it as we are simply ignoring the thing in the mathematical expression but by making use of a proper loss function we are ended up learning/estimating the valid probability distribution function … our actual goal!
      Hope this makes sense!

    • @peiyiwang4707
      @peiyiwang4707 Год назад +1

      @@KapilSachdeva Thanks😀

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      🙏

  • @siyaowu7443
    @siyaowu7443 Год назад

    Terrific tutorial! But I have a question. The data distribution p_theta is the neural network, theta is the parameters of neural network. But what does exactly p_n will be like? Does it means that we should generate k negative samples from something like gaussian distirbution and so on?

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      🙏 Generating the k negative samples is indeed a trick thing to do. Don't think in terms of gaussian distribution rather how to get the relevant/appropriate negative samples. There are methods to do what they call hard negative sample mining etc. Am not well versed in those but that is the direction you should think in.

    • @siyaowu7443
      @siyaowu7443 Год назад

      @@KapilSachdeva Thanks for your reply!