Deep Learning 22: (4) Variational AutoEncoder : Derivation of the Loss Function

Поделиться
HTML-код
  • Опубликовано: 21 дек 2024

Комментарии • 65

  • @manan4436
    @manan4436 5 лет назад +26

    19:17 why I feel like that there's an order of the first two are misplaced... because if there is no KL term then it simply becomes autoencoder which is middle one in explanation...

    • @sangameshkodge1664
      @sangameshkodge1664 5 лет назад +3

      with data fidelity and without KL divergence should be the middle plot( but the presenter calls it without data fidelity and with KL divergence) right?
      Same error is done for the left most plot

    • @ximingfeng436
      @ximingfeng436 5 лет назад +3

      So, how can we understand: without data fidelity and with KL divergence? Can I say: in this case, the network has no penalty on the process of reconstruction, which means it is not learning the generation process but only the prior p_theta (z)? Thank you so much for point this out guys. I have been following through this video and it is super helpful

    • @TianWangLaoZi666
      @TianWangLaoZi666 4 года назад

      Yes, I agree with you, @Ahlad Kumar could you please confirm the statement?

    • @MSalem7777
      @MSalem7777 3 года назад

      @Manan You are correct. The case without a data fidelity term just has both distributions aligning to the N(0,I) prior and nothing is learned from the data

  • @vineetkumar-wr7sx
    @vineetkumar-wr7sx 5 лет назад +7

    Derivation of the Loss Function is very difficult topic to understand but you make it very simple Sir Thank you

  • @inazuma3gou
    @inazuma3gou 2 года назад

    If I am understanding correctly, p(x) cannot be calculated due to an intractable problem (5:40). Therefore, we derived the VAE objective function, which sets the lower bound for p(x) 10:48. The maximizing the objective (or minimizing the loss function), we maximize the probability of our input data given the model parameters.

  • @mahsakhalili5042
    @mahsakhalili5042 6 месяцев назад

    thank you so much. without background, it used to be so confusing to me. after watching multiple videos, your content is much clear and make more sense to me, thanks again

  • @kalmongar3533
    @kalmongar3533 8 месяцев назад

    Hat's off to you and the people behind this work. Absolutely amazing

  • @oceanwave4502
    @oceanwave4502 3 месяца назад

    26:19 Here, I'm not sure if we replace Σ(x) with exp(Σ(x)) really due to it being more numerically stable. I searched Google and found nothing special.
    The output of the hidden layer in Encoder is (μ, Σ). In the implementation part (the last video in this series), they are "Mean_layer" and "Standard_deviation_layer" variables. The output Σ can be negative because it is the output right off a Fully Connected layer (Dense Layer); however, Standard Deviation of a Distribution can be NEVER negative. To fix this, we simply interpret the "Standard_deviation_layer" (Σ) variable as "log of variance". When needing the variance inside, we simply do "exp(Σ)".
    I think this is the true motivation of why we need to replace Σ(x) with exp(Σ(x)) at 26:19. Because it is not a "real" variance, but (interpreted as) "log of variance".

  • @ssshukla26
    @ssshukla26 4 года назад +2

    Around 19:00 the order is 2,1,3 not 1,2,3.

    • @purvanyatyagi2494
      @purvanyatyagi2494 4 года назад

      It means without any regularisation term the model will learn pesky distributions with less variance

  • @junfeiwang8763
    @junfeiwang8763 2 года назад +1

    I get something every time when I watch this video.

  • @SravanKumar-jt1ot
    @SravanKumar-jt1ot 4 года назад +2

    One of the best explanations. Can we get those slides ?

  • @ZobeirRaisi
    @ZobeirRaisi 4 года назад

    Thanks, Comprehensive and Unique explanation, which is not found easily in other tutorials.

  • @user-ky7bi9cl2z
    @user-ky7bi9cl2z 4 года назад +2

    Can you provide more information regarding the reason behind the problem of inference ? @4:48
    I know law of total probability but I can't understand how p(x) is getting computed in the equation provided and what is meant by integral in "closed form" here?

    • @frankwxu
      @frankwxu 3 года назад

      I guess that p(x) is a fixed number and it is nothing to do with the loss function.

    • @omarhmedat2064
      @omarhmedat2064 2 года назад +1

      closed form means : a formulation with known or given parameters as well constants , like i can write the sum from 1 to n like n*(n+1)/2 .

  • @sarvani5874
    @sarvani5874 5 лет назад +3

    Please answer my queries...at 26:18 can we replace the terms in the equation with exp() as such... won't the whole value get changed??? what do you mean by numerically stable???

    • @xruan6582
      @xruan6582 4 года назад

      same question. can anyone explain?

    • @himalayakaushik1466
      @himalayakaushik1466 3 года назад

      logs can be expressed as exponential functions interchangeably..................numerical stability means it is less affected by errors

  • @rishabhkumar722
    @rishabhkumar722 10 месяцев назад

    @Ahlad Kumar Sir, Thank You for making these amazing lectures. It is the best lecture and detailed lecture on AutoEncoder and its series. Thank You once again. _/\_

  • @BoneySinghal
    @BoneySinghal 4 года назад +2

    At 20:35 the distribution will be peaky (i.e. variance will not be 1) when KL divergence is not present (aka regularizer not present) but you're discussing the case where KL divergence is present, so how it can be peaky given there is only KL divergence present in loss function?
    Also the second column at 20:35 talks about case whern no regularization is present, but you're discussing the case when only KL divergence is present. This part of the video is somewhat confusing.

    • @lucamarradi3066
      @lucamarradi3066 2 года назад

      Yep. I think is confusing the two images. The first on the left represent the case where you have KL term but not the reconstruction therm. The picture in the middle is the other way round (reconstruction is present but not the KL).

  • @xruan6582
    @xruan6582 4 года назад

    Nice detail. (14:47) I think the P(x|z) multivariate distribution lack a -1 in the exponential component. Forgive me if I am wrong

  • @MLDawn
    @MLDawn 5 лет назад +3

    Thanks a million for this beautiful video. I have 2 questions and I would really appreciate it if you could help me out. I thought that as a part of Variational Inference, we would consider a known distribution q(Z|X) and try to minimize the Dkl(Q(Z|X) || P(Z|X)), where Q is that known distribution. However, in 11:00 you mention that Q is what we are learning and P is the known distribution. Moreover, I can understand that we need to minimize Dkl(Q(Z|X) || P(Z|X)), however, in the derivation in 11:00, we take the logP(X) to the left side and now the left side's identity has changed! It is no longer the KL divergence that we wanted to minimize. So, why do we ignore that and still try to minimize the right hand side? Thanks a million for your response in advance

    • @Texta92
      @Texta92 Год назад

      I am also not quite sure about how it is ok to have the log(P(x)) on the left side of the final objective/loss function. Did you get clarity on that in the last three years? :D

    • @binyillikcinar
      @binyillikcinar 6 месяцев назад

      @@Texta92 Dkl( Q(Z|x) || P(Z|X) ) = log(p(X)) - L , or L + Dkl = log(P(x)) since both log(P(x)) & Dkl are non-negative, L is a lower bound to log(P(x)). Since log(P(x)) is called "evidence", L is called "evidence lower bound (ELBO)" en.wikipedia.org/wiki/Evidence_lower_bound

  • @Darkev77
    @Darkev77 2 года назад

    24:31 how and why is sigma a diagonal matrix?

  • @anantmohan3158
    @anantmohan3158 Год назад

    Hello Ahlad, first of all, let me thank you for making such a wonderful series which explains each and every required concept very beautifully and with ease.
    My questions are: -
    Q1) At 20:00, you mentioned the variance can become zero without the KL term but we are discussing the second case that is we don't have fidelity term but have KL term.
    I think some ambiguity is there. Can you check on this?
    Q2) Also, when you say KL divergence not allow pdf of latent variables not collapse with zero, what does it mean? Does it mean that it can collapse? I understood that it penalizes if deviates from Gaussian distribution but what about collapsing?
    Waiting for your answer......
    Thank you..!

  • @ankitsingh-xl7bo
    @ankitsingh-xl7bo 4 месяца назад

    what is prior distribution? is it the distribution of input data?. If it is, then how can we assume to be gaussian with zero mean and unit variance??

  • @jakevikoren
    @jakevikoren 4 года назад

    Thank you for sharing your knowledge! This is a great explanation

  • @pranayb8517
    @pranayb8517 5 лет назад +1

    I think there's a slight mistake at 13:09, wouldn't theta star be the weights of the Generator network and phi star for the Recognition network? Amazing video series though.. Kudos!!

    • @hieunguyentranchi947
      @hieunguyentranchi947 3 года назад

      Agree, I came here so much later and I was looking for any comment on this mistake, fortunately I found yours :)

  • @sagaradoshi
    @sagaradoshi 11 месяцев назад

    Thank you for excellent series of video.
    I have one question at 12:23 we first wrote L(theta, phi) = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)], where the first term is reconstruction loss and second term is regularizer. Right? if yes then after that when you again wrote logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = -L(theata, phi) you called reconstruction as logP_theta(x) and regulariser as Dkl[Q_phi(z|x)||P_theta(z|x)] . So I am lil confused which terms are regularizer and which is reconstruction loss.
    Our equation is logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)] = -L(theata, phi) right?

  • @zyzhang1130
    @zyzhang1130 3 года назад +1

    Great lecture. Thanks a lot for uploading it. But I have one question kept baffling me throughout. How come the output of p(x|z) is x(or xhat) instead of a parameterized probability function like q(z|x)?

  • @ankitsingh-xl7bo
    @ankitsingh-xl7bo 4 месяца назад

    @19.57 in case 2 you are saying KL divergence (regularizer is present) but in the second figure (for case 2) it is written that 'without regularizer'??/

  • @hida-steak-donburi
    @hida-steak-donburi 3 года назад +2

    Another random Indian guy saves my day, thank you!

  • @kalmongar3533
    @kalmongar3533 8 месяцев назад

    By P(x), do u mean P(~x){i mean P(telta x)}? 4:25

  • @anirudhbuvanesh4942
    @anirudhbuvanesh4942 4 года назад

    Why is the probability density function of p assumed to be known as Normally distributed?

  • @omridrori3286
    @omridrori3286 3 года назад

    hey i dont understand something in the explanation ,
    in 17:41 you said that we have 2 different distributions q(z1|x) and q(z2|x)
    but they are not different distributions...
    z1 and z2 is the argument of these functions so it is like to say that x^2 and y^2 is not the same function which is not true...
    so all the way after that i didnt understand what you mean
    also what dose it mean that the mean of this distribution is z1? z1 is not the mean of this distribution according to your explanation before...
    z1 as you said is the sample from that distribution which parametrize by the parameters which we will get after the transformation of the input with the encoder ...
    so z1 is not the mean of that distribution it is sample from that distribution

  • @HuyQuang-kp5es
    @HuyQuang-kp5es 8 месяцев назад

    Thank you for sharing!

  • @charlherbst4583
    @charlherbst4583 2 года назад

    I'm so confused, where did P_theta(x) go?, that's not really an equality there so you can't just move it over to the other side, that's not how math works 10:47

  • @rachitsingh8040
    @rachitsingh8040 5 лет назад +1

    You did not explain the crucial step. Why are we representing q(z given x) as an expectation? Can you explain it?

    • @Trollitytrolltroll
      @Trollitytrolltroll 5 лет назад +1

      The KL divergence is the expectation of the log of q over p. In information theory, that's the expected number of bits of information lost when using q to approximate p. E(log(q/p)) is just E(log(q) - log(p)) and therefore E(log(q)) - E(log(p)) since the expectation of a difference is the difference between expectations. In any case the reason why the expectation is used is just because that's the definition of the KL divergence, which is what we seek to minimize.

  • @harishr5620
    @harishr5620 5 лет назад +1

    Why loss is negative of objective function?

    • @Trollitytrolltroll
      @Trollitytrolltroll 5 лет назад

      Because you want to maximize the reconstruction likelihood minus the KL divergence, but most optimizations are posed as gradient *descent* problems, you can just minimize the negative of the objective.

    • @sanjivgautam9063
      @sanjivgautam9063 4 года назад +1

      Minimizing the negative of objective function is same as minimizing the error. For easiness , negative of objective function equals numerically to loss function

  • @Daydream_Dynamo
    @Daydream_Dynamo 7 месяцев назад

    Ok a stupid Question here, if x is observed variable, i.e. we know the data, then why p(x) is hard to approximate, I just don't get, Correct me or fill me in, if i'm missing something!! thank you.

  • @dadashkuyko2762
    @dadashkuyko2762 3 года назад

    What is Phi and what is Theta?

    • @vahidnikoofard2939
      @vahidnikoofard2939 3 года назад +1

      They are parameters of the distribution. For example, in the case of the Gaussian distribution, theta (or phi) are the mean and variance of the distribution.

  • @madhavkumar9942
    @madhavkumar9942 10 месяцев назад

    You are great

  • @prateekcaire4193
    @prateekcaire4193 6 месяцев назад

    Thank You!!

  • @LiveLifeWithLove
    @LiveLifeWithLove 3 года назад

    What is theta in Ptheta(z), is it same theta as Ptheta(x|z) or it is different. If it same why we need Ptheta(z) to be near to QPhi (z|x) why not PPhi(z) to be near to QPhi(z|x) ? You have confused all lot with saying 'sorry it theta sorry it is phi" - can you please fix that ?

  • @AGhoshCoocking
    @AGhoshCoocking 5 лет назад

    Nice lecture sir. Still some confusion is there..
    1. we can not find P(x) as it is not tractable. P is an unknown distribution. So from can we get P(Z) ?
    2. We can't compute P(Z|X) .. so we are approximating it by q(z|x) . Its fine. But how are we getting P(x|z) again? In ELBO equation?

    • @sanjivgautam9063
      @sanjivgautam9063 4 года назад +1

      2. We are getting P(x|z) from Naive Bayes theorem.

    • @purvanyatyagi2494
      @purvanyatyagi2494 4 года назад

      P(x|z) is the decoder term with paremeters theta , this is the conditional distribution describing x given z, x belongs to the original data distribution

    • @purvanyatyagi2494
      @purvanyatyagi2494 4 года назад

      P(x) is the a intractable distribution of the data x. Not of the latent space, but we the prior distribution of the latent space p(z) is assumed to be n(0,1)

  • @apocalypt0723
    @apocalypt0723 4 года назад

    Thank you for video

  • @amgadmuhammad2958
    @amgadmuhammad2958 5 лет назад

    I guess you mean P(z) at 26:44?

    • @AhladKumar
      @AhladKumar  5 лет назад +2

      yes that is why I added caption there....

    • @amgadmuhammad2958
      @amgadmuhammad2958 5 лет назад +1

      @@AhladKumar I missed that caption, thanks!

  • @HosseinMasoudi632
    @HosseinMasoudi632 Год назад

    Dank you sir

  • @purvanyatyagi2494
    @purvanyatyagi2494 4 года назад

    why didnt i came here before