19:17 why I feel like that there's an order of the first two are misplaced... because if there is no KL term then it simply becomes autoencoder which is middle one in explanation...
with data fidelity and without KL divergence should be the middle plot( but the presenter calls it without data fidelity and with KL divergence) right? Same error is done for the left most plot
So, how can we understand: without data fidelity and with KL divergence? Can I say: in this case, the network has no penalty on the process of reconstruction, which means it is not learning the generation process but only the prior p_theta (z)? Thank you so much for point this out guys. I have been following through this video and it is super helpful
@Manan You are correct. The case without a data fidelity term just has both distributions aligning to the N(0,I) prior and nothing is learned from the data
If I am understanding correctly, p(x) cannot be calculated due to an intractable problem (5:40). Therefore, we derived the VAE objective function, which sets the lower bound for p(x) 10:48. The maximizing the objective (or minimizing the loss function), we maximize the probability of our input data given the model parameters.
thank you so much. without background, it used to be so confusing to me. after watching multiple videos, your content is much clear and make more sense to me, thanks again
26:19 Here, I'm not sure if we replace Σ(x) with exp(Σ(x)) really due to it being more numerically stable. I searched Google and found nothing special. The output of the hidden layer in Encoder is (μ, Σ). In the implementation part (the last video in this series), they are "Mean_layer" and "Standard_deviation_layer" variables. The output Σ can be negative because it is the output right off a Fully Connected layer (Dense Layer); however, Standard Deviation of a Distribution can be NEVER negative. To fix this, we simply interpret the "Standard_deviation_layer" (Σ) variable as "log of variance". When needing the variance inside, we simply do "exp(Σ)". I think this is the true motivation of why we need to replace Σ(x) with exp(Σ(x)) at 26:19. Because it is not a "real" variance, but (interpreted as) "log of variance".
Can you provide more information regarding the reason behind the problem of inference ? @4:48 I know law of total probability but I can't understand how p(x) is getting computed in the equation provided and what is meant by integral in "closed form" here?
Please answer my queries...at 26:18 can we replace the terms in the equation with exp() as such... won't the whole value get changed??? what do you mean by numerically stable???
@Ahlad Kumar Sir, Thank You for making these amazing lectures. It is the best lecture and detailed lecture on AutoEncoder and its series. Thank You once again. _/\_
At 20:35 the distribution will be peaky (i.e. variance will not be 1) when KL divergence is not present (aka regularizer not present) but you're discussing the case where KL divergence is present, so how it can be peaky given there is only KL divergence present in loss function? Also the second column at 20:35 talks about case whern no regularization is present, but you're discussing the case when only KL divergence is present. This part of the video is somewhat confusing.
Yep. I think is confusing the two images. The first on the left represent the case where you have KL term but not the reconstruction therm. The picture in the middle is the other way round (reconstruction is present but not the KL).
Thanks a million for this beautiful video. I have 2 questions and I would really appreciate it if you could help me out. I thought that as a part of Variational Inference, we would consider a known distribution q(Z|X) and try to minimize the Dkl(Q(Z|X) || P(Z|X)), where Q is that known distribution. However, in 11:00 you mention that Q is what we are learning and P is the known distribution. Moreover, I can understand that we need to minimize Dkl(Q(Z|X) || P(Z|X)), however, in the derivation in 11:00, we take the logP(X) to the left side and now the left side's identity has changed! It is no longer the KL divergence that we wanted to minimize. So, why do we ignore that and still try to minimize the right hand side? Thanks a million for your response in advance
I am also not quite sure about how it is ok to have the log(P(x)) on the left side of the final objective/loss function. Did you get clarity on that in the last three years? :D
@@Texta92 Dkl( Q(Z|x) || P(Z|X) ) = log(p(X)) - L , or L + Dkl = log(P(x)) since both log(P(x)) & Dkl are non-negative, L is a lower bound to log(P(x)). Since log(P(x)) is called "evidence", L is called "evidence lower bound (ELBO)" en.wikipedia.org/wiki/Evidence_lower_bound
Hello Ahlad, first of all, let me thank you for making such a wonderful series which explains each and every required concept very beautifully and with ease. My questions are: - Q1) At 20:00, you mentioned the variance can become zero without the KL term but we are discussing the second case that is we don't have fidelity term but have KL term. I think some ambiguity is there. Can you check on this? Q2) Also, when you say KL divergence not allow pdf of latent variables not collapse with zero, what does it mean? Does it mean that it can collapse? I understood that it penalizes if deviates from Gaussian distribution but what about collapsing? Waiting for your answer...... Thank you..!
I think there's a slight mistake at 13:09, wouldn't theta star be the weights of the Generator network and phi star for the Recognition network? Amazing video series though.. Kudos!!
Thank you for excellent series of video. I have one question at 12:23 we first wrote L(theta, phi) = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)], where the first term is reconstruction loss and second term is regularizer. Right? if yes then after that when you again wrote logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = -L(theata, phi) you called reconstruction as logP_theta(x) and regulariser as Dkl[Q_phi(z|x)||P_theta(z|x)] . So I am lil confused which terms are regularizer and which is reconstruction loss. Our equation is logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)] = -L(theata, phi) right?
Great lecture. Thanks a lot for uploading it. But I have one question kept baffling me throughout. How come the output of p(x|z) is x(or xhat) instead of a parameterized probability function like q(z|x)?
@19.57 in case 2 you are saying KL divergence (regularizer is present) but in the second figure (for case 2) it is written that 'without regularizer'??/
hey i dont understand something in the explanation , in 17:41 you said that we have 2 different distributions q(z1|x) and q(z2|x) but they are not different distributions... z1 and z2 is the argument of these functions so it is like to say that x^2 and y^2 is not the same function which is not true... so all the way after that i didnt understand what you mean also what dose it mean that the mean of this distribution is z1? z1 is not the mean of this distribution according to your explanation before... z1 as you said is the sample from that distribution which parametrize by the parameters which we will get after the transformation of the input with the encoder ... so z1 is not the mean of that distribution it is sample from that distribution
I'm so confused, where did P_theta(x) go?, that's not really an equality there so you can't just move it over to the other side, that's not how math works 10:47
The KL divergence is the expectation of the log of q over p. In information theory, that's the expected number of bits of information lost when using q to approximate p. E(log(q/p)) is just E(log(q) - log(p)) and therefore E(log(q)) - E(log(p)) since the expectation of a difference is the difference between expectations. In any case the reason why the expectation is used is just because that's the definition of the KL divergence, which is what we seek to minimize.
Because you want to maximize the reconstruction likelihood minus the KL divergence, but most optimizations are posed as gradient *descent* problems, you can just minimize the negative of the objective.
Minimizing the negative of objective function is same as minimizing the error. For easiness , negative of objective function equals numerically to loss function
Ok a stupid Question here, if x is observed variable, i.e. we know the data, then why p(x) is hard to approximate, I just don't get, Correct me or fill me in, if i'm missing something!! thank you.
They are parameters of the distribution. For example, in the case of the Gaussian distribution, theta (or phi) are the mean and variance of the distribution.
What is theta in Ptheta(z), is it same theta as Ptheta(x|z) or it is different. If it same why we need Ptheta(z) to be near to QPhi (z|x) why not PPhi(z) to be near to QPhi(z|x) ? You have confused all lot with saying 'sorry it theta sorry it is phi" - can you please fix that ?
Nice lecture sir. Still some confusion is there.. 1. we can not find P(x) as it is not tractable. P is an unknown distribution. So from can we get P(Z) ? 2. We can't compute P(Z|X) .. so we are approximating it by q(z|x) . Its fine. But how are we getting P(x|z) again? In ELBO equation?
P(x|z) is the decoder term with paremeters theta , this is the conditional distribution describing x given z, x belongs to the original data distribution
P(x) is the a intractable distribution of the data x. Not of the latent space, but we the prior distribution of the latent space p(z) is assumed to be n(0,1)
19:17 why I feel like that there's an order of the first two are misplaced... because if there is no KL term then it simply becomes autoencoder which is middle one in explanation...
with data fidelity and without KL divergence should be the middle plot( but the presenter calls it without data fidelity and with KL divergence) right?
Same error is done for the left most plot
So, how can we understand: without data fidelity and with KL divergence? Can I say: in this case, the network has no penalty on the process of reconstruction, which means it is not learning the generation process but only the prior p_theta (z)? Thank you so much for point this out guys. I have been following through this video and it is super helpful
Yes, I agree with you, @Ahlad Kumar could you please confirm the statement?
@Manan You are correct. The case without a data fidelity term just has both distributions aligning to the N(0,I) prior and nothing is learned from the data
Derivation of the Loss Function is very difficult topic to understand but you make it very simple Sir Thank you
you are right, this is a clear explanation.
If I am understanding correctly, p(x) cannot be calculated due to an intractable problem (5:40). Therefore, we derived the VAE objective function, which sets the lower bound for p(x) 10:48. The maximizing the objective (or minimizing the loss function), we maximize the probability of our input data given the model parameters.
thank you so much. without background, it used to be so confusing to me. after watching multiple videos, your content is much clear and make more sense to me, thanks again
Hat's off to you and the people behind this work. Absolutely amazing
26:19 Here, I'm not sure if we replace Σ(x) with exp(Σ(x)) really due to it being more numerically stable. I searched Google and found nothing special.
The output of the hidden layer in Encoder is (μ, Σ). In the implementation part (the last video in this series), they are "Mean_layer" and "Standard_deviation_layer" variables. The output Σ can be negative because it is the output right off a Fully Connected layer (Dense Layer); however, Standard Deviation of a Distribution can be NEVER negative. To fix this, we simply interpret the "Standard_deviation_layer" (Σ) variable as "log of variance". When needing the variance inside, we simply do "exp(Σ)".
I think this is the true motivation of why we need to replace Σ(x) with exp(Σ(x)) at 26:19. Because it is not a "real" variance, but (interpreted as) "log of variance".
Around 19:00 the order is 2,1,3 not 1,2,3.
It means without any regularisation term the model will learn pesky distributions with less variance
I get something every time when I watch this video.
One of the best explanations. Can we get those slides ?
Thanks, Comprehensive and Unique explanation, which is not found easily in other tutorials.
Can you provide more information regarding the reason behind the problem of inference ? @4:48
I know law of total probability but I can't understand how p(x) is getting computed in the equation provided and what is meant by integral in "closed form" here?
I guess that p(x) is a fixed number and it is nothing to do with the loss function.
closed form means : a formulation with known or given parameters as well constants , like i can write the sum from 1 to n like n*(n+1)/2 .
Please answer my queries...at 26:18 can we replace the terms in the equation with exp() as such... won't the whole value get changed??? what do you mean by numerically stable???
same question. can anyone explain?
logs can be expressed as exponential functions interchangeably..................numerical stability means it is less affected by errors
@Ahlad Kumar Sir, Thank You for making these amazing lectures. It is the best lecture and detailed lecture on AutoEncoder and its series. Thank You once again. _/\_
At 20:35 the distribution will be peaky (i.e. variance will not be 1) when KL divergence is not present (aka regularizer not present) but you're discussing the case where KL divergence is present, so how it can be peaky given there is only KL divergence present in loss function?
Also the second column at 20:35 talks about case whern no regularization is present, but you're discussing the case when only KL divergence is present. This part of the video is somewhat confusing.
Yep. I think is confusing the two images. The first on the left represent the case where you have KL term but not the reconstruction therm. The picture in the middle is the other way round (reconstruction is present but not the KL).
Nice detail. (14:47) I think the P(x|z) multivariate distribution lack a -1 in the exponential component. Forgive me if I am wrong
Thanks a million for this beautiful video. I have 2 questions and I would really appreciate it if you could help me out. I thought that as a part of Variational Inference, we would consider a known distribution q(Z|X) and try to minimize the Dkl(Q(Z|X) || P(Z|X)), where Q is that known distribution. However, in 11:00 you mention that Q is what we are learning and P is the known distribution. Moreover, I can understand that we need to minimize Dkl(Q(Z|X) || P(Z|X)), however, in the derivation in 11:00, we take the logP(X) to the left side and now the left side's identity has changed! It is no longer the KL divergence that we wanted to minimize. So, why do we ignore that and still try to minimize the right hand side? Thanks a million for your response in advance
I am also not quite sure about how it is ok to have the log(P(x)) on the left side of the final objective/loss function. Did you get clarity on that in the last three years? :D
@@Texta92 Dkl( Q(Z|x) || P(Z|X) ) = log(p(X)) - L , or L + Dkl = log(P(x)) since both log(P(x)) & Dkl are non-negative, L is a lower bound to log(P(x)). Since log(P(x)) is called "evidence", L is called "evidence lower bound (ELBO)" en.wikipedia.org/wiki/Evidence_lower_bound
24:31 how and why is sigma a diagonal matrix?
Hello Ahlad, first of all, let me thank you for making such a wonderful series which explains each and every required concept very beautifully and with ease.
My questions are: -
Q1) At 20:00, you mentioned the variance can become zero without the KL term but we are discussing the second case that is we don't have fidelity term but have KL term.
I think some ambiguity is there. Can you check on this?
Q2) Also, when you say KL divergence not allow pdf of latent variables not collapse with zero, what does it mean? Does it mean that it can collapse? I understood that it penalizes if deviates from Gaussian distribution but what about collapsing?
Waiting for your answer......
Thank you..!
what is prior distribution? is it the distribution of input data?. If it is, then how can we assume to be gaussian with zero mean and unit variance??
Thank you for sharing your knowledge! This is a great explanation
I think there's a slight mistake at 13:09, wouldn't theta star be the weights of the Generator network and phi star for the Recognition network? Amazing video series though.. Kudos!!
Agree, I came here so much later and I was looking for any comment on this mistake, fortunately I found yours :)
Thank you for excellent series of video.
I have one question at 12:23 we first wrote L(theta, phi) = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)], where the first term is reconstruction loss and second term is regularizer. Right? if yes then after that when you again wrote logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = -L(theata, phi) you called reconstruction as logP_theta(x) and regulariser as Dkl[Q_phi(z|x)||P_theta(z|x)] . So I am lil confused which terms are regularizer and which is reconstruction loss.
Our equation is logP_theta(x) - Dkl[Q_phi(z|x)||P_theta(z|x)] = - Ez [log (P_theta(x|z)] + Dkl[Q_phi(z|x)||P_theta(z)] = -L(theata, phi) right?
Great lecture. Thanks a lot for uploading it. But I have one question kept baffling me throughout. How come the output of p(x|z) is x(or xhat) instead of a parameterized probability function like q(z|x)?
@19.57 in case 2 you are saying KL divergence (regularizer is present) but in the second figure (for case 2) it is written that 'without regularizer'??/
Another random Indian guy saves my day, thank you!
By P(x), do u mean P(~x){i mean P(telta x)}? 4:25
Why is the probability density function of p assumed to be known as Normally distributed?
hey i dont understand something in the explanation ,
in 17:41 you said that we have 2 different distributions q(z1|x) and q(z2|x)
but they are not different distributions...
z1 and z2 is the argument of these functions so it is like to say that x^2 and y^2 is not the same function which is not true...
so all the way after that i didnt understand what you mean
also what dose it mean that the mean of this distribution is z1? z1 is not the mean of this distribution according to your explanation before...
z1 as you said is the sample from that distribution which parametrize by the parameters which we will get after the transformation of the input with the encoder ...
so z1 is not the mean of that distribution it is sample from that distribution
Thank you for sharing!
I'm so confused, where did P_theta(x) go?, that's not really an equality there so you can't just move it over to the other side, that's not how math works 10:47
You did not explain the crucial step. Why are we representing q(z given x) as an expectation? Can you explain it?
The KL divergence is the expectation of the log of q over p. In information theory, that's the expected number of bits of information lost when using q to approximate p. E(log(q/p)) is just E(log(q) - log(p)) and therefore E(log(q)) - E(log(p)) since the expectation of a difference is the difference between expectations. In any case the reason why the expectation is used is just because that's the definition of the KL divergence, which is what we seek to minimize.
Why loss is negative of objective function?
Because you want to maximize the reconstruction likelihood minus the KL divergence, but most optimizations are posed as gradient *descent* problems, you can just minimize the negative of the objective.
Minimizing the negative of objective function is same as minimizing the error. For easiness , negative of objective function equals numerically to loss function
Ok a stupid Question here, if x is observed variable, i.e. we know the data, then why p(x) is hard to approximate, I just don't get, Correct me or fill me in, if i'm missing something!! thank you.
What is Phi and what is Theta?
They are parameters of the distribution. For example, in the case of the Gaussian distribution, theta (or phi) are the mean and variance of the distribution.
You are great
Thank You!!
What is theta in Ptheta(z), is it same theta as Ptheta(x|z) or it is different. If it same why we need Ptheta(z) to be near to QPhi (z|x) why not PPhi(z) to be near to QPhi(z|x) ? You have confused all lot with saying 'sorry it theta sorry it is phi" - can you please fix that ?
Nice lecture sir. Still some confusion is there..
1. we can not find P(x) as it is not tractable. P is an unknown distribution. So from can we get P(Z) ?
2. We can't compute P(Z|X) .. so we are approximating it by q(z|x) . Its fine. But how are we getting P(x|z) again? In ELBO equation?
2. We are getting P(x|z) from Naive Bayes theorem.
P(x|z) is the decoder term with paremeters theta , this is the conditional distribution describing x given z, x belongs to the original data distribution
P(x) is the a intractable distribution of the data x. Not of the latent space, but we the prior distribution of the latent space p(z) is assumed to be n(0,1)
Thank you for video
I guess you mean P(z) at 26:44?
yes that is why I added caption there....
@@AhladKumar I missed that caption, thanks!
Dank you sir
why didnt i came here before