Reparameterization Trick - WHY & BUILDING BLOCKS EXPLAINED!

Kapil Sachdeva

Просмотров 11 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 9 сен 2024
This tutorial provides an in-depth explanation of challenges and remedies for gradient estimation in neural networks that include random variables.
While the final implementation of the method (called Reparameterization Trick) is quite simple, it is interesting and somewhere important to understand how and why the method can be applied in the first place.
Recommended videos to watch before this one
Evidence Lower Bound
• Evidence Lower Bound (...
3 Big Ideas - Variational AutoEncoder, Latent Variable Model, Amortized Inference
• Variational Autoencode...
KL Divergence
• KL Divergence - CLEARL...
Links to various papers mentioned in the tutorial
Auto-Encoding Variational Bayes
arxiv.org/abs/...
Doubly Stochastic Variational Bayes for non-Conjugate Inference
proceedings.ml...
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
arxiv.org/abs/...
Gradient Estimation Using Stochastic Computation Graphs
arxiv.org/abs/...
A thread with some insights about the name - "The Law Of The Unconscious Statistician"
math.stackexch...
#gradientestimation
#elbo
#variationalautoencoder

Комментарии • 73

@anselmud 2 года назад ⁺²⁶
I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!
@KapilSachdeva 2 года назад
🙏 ...glad that you found them helpful!
@mikhaeldito 2 года назад ⁺²¹
Glad that someone finally take the time to decrypt the symbols in the loss function equation!! What a great channel :)
@KapilSachdeva 2 года назад ⁺²
🙏
@sklkd93 Год назад ⁺⁵
Man, these have to be the best ML videos on RUclips. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful.
I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .
@KapilSachdeva Год назад
🙏 I don’t have a degree in stats either 😄
@RajanNarasimhan 5 месяцев назад
@@KapilSachdeva
what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.
@ssshukla26 2 года назад ⁺⁵
I knew the concept now I know the maths. Thanks for the videos sir.
@KapilSachdeva 2 года назад
🙏
@user-lm7nn2jm3h Год назад ⁺²
I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much
@KapilSachdeva Год назад
🙏
@ThePRASANTHof1994 Год назад ⁺¹
I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D
@KapilSachdeva Год назад ⁺¹
🙏 …. all tutorials are PG :)
@television9233 2 года назад ⁺⁵
For the quiz at the end:
From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from.
So I think the answer is 2 "predicts" not "learns".
@KapilSachdeva 2 года назад ⁺⁵
You answer is 100% correct 🤗
@vslaykovsky 2 года назад ⁺²
This explanation is what I was looking for for many days! Thank you!
@KapilSachdeva 2 года назад
🙏
@prachijadhav9098 2 года назад ⁺¹
I was looking for this. It’s full of essential information. Convention matters, and you clearly explained the
differences in this context.
@KapilSachdeva 2 года назад
🙏
@mohdaquib9808 2 года назад ⁺³
Thanks a lot sir for your excellent explanation. It made me understand the key idea behind the reparameterization trick.
@KapilSachdeva 2 года назад
🙏
@SY-me5rk 2 года назад ⁺²
I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.
@KapilSachdeva 2 года назад ⁺¹
🙏 Thanks.
@chyldstudios Год назад ⁺¹
Enjoyed watching your clear explanation of the re-parameterization trick. Well done!
@KapilSachdeva Год назад
🙏
@ayushsaraf8421 Год назад ⁺¹
This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions
@KapilSachdeva Год назад
🙏
@adamsulak8751 8 месяцев назад ⁺¹
Incredible quality of teaching 👌.
@KapilSachdeva 8 месяцев назад
🙏
@midhununni951 10 месяцев назад ⁺¹
Incredibly clear, and thank you so much for these videos. Looking forward to more...
@KapilSachdeva 10 месяцев назад
🙏
@leif-martinsunde1364 2 года назад ⁺²
Wonderful video Kapil. Thanks from the University of Oslo.
@KapilSachdeva 2 года назад ⁺¹
🙏
@longfellowrose1013 2 года назад ⁺²
Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.
@KapilSachdeva 2 года назад
🙏
@alirezamogharabi8733 2 года назад ⁺¹
Really appreciate, I enjoyed your teaching style and great expectations! Thank you ❤️❤️
@KapilSachdeva 2 года назад
🙏
@somasundaramsankaranarayan4592 3 месяца назад ⁺¹
At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.
@inazuma3gou 2 года назад ⁺¹
Wow~ an amazing tutorial. Thank you!
@KapilSachdeva 2 года назад
🙏
@pauledam2174 2 месяца назад
Can you turn on the transcript for this! Great explanation!
@atharvajoshi4243 Год назад ⁺¹
Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions:
Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS?
Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.
@KapilSachdeva Год назад
Q1) it has nothing to do with LOTUS just linear location-scale transform
Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.
@anupgupta3644 Год назад ⁺¹
It predicts the parameters of the latent variable.
@KapilSachdeva Год назад ⁺¹
Correct. 🙏
@anupgupta3644 Год назад
Thank you sir :)
@jimmylovesyouall Год назад ⁺¹
In 6:35, the output of the decoder is the reconstruction of X not μ and σ?
@KapilSachdeva Год назад ⁺¹
The output of the decoder could be either of following -
a) Direct prediction of X (the input vector)
or
b) Prediction of mu and sigma of distribution from which X came.
Note the mu and sigma if predicted (by the decoder) will be that of X and not Z
@MLDawn Год назад
Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks
@KapilSachdeva Год назад
This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.
@omidmahjobian3377 2 года назад ⁺³
GEM
@KapilSachdeva 2 года назад
🙏
@ArashSadr 2 года назад ⁺¹
As always I am stunned by your video! May I ask with what software you produce such videos?
@KapilSachdeva 2 года назад ⁺¹
🙏 thanks Arash for the kind words.
I use PowerPoint primarily except for very few advanced animations I use manim (github.com/manimCommunity/manim)
@blasttrash Год назад ⁺¹
19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?
@KapilSachdeva Год назад ⁺¹
Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption?
There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.
@RAP4EVERMRC96 Год назад ⁺¹
8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.
@KapilSachdeva Год назад ⁺¹
Assuming when you mentioned “sigma” you meant “theta”.
This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.
@RAP4EVERMRC96 Год назад ⁺¹
@@KapilSachdeva thanks for clearing that up and yes I meant theta. I always mix them up.
@KapilSachdeva Год назад ⁺¹
😊
@rubyshrestha5747 2 года назад ⁺¹
Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?
@KapilSachdeva 2 года назад ⁺²
Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta.
Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!
@slemanbisharat6390 Год назад
thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?
@KapilSachdeva Год назад ⁺¹
Joint prob
@vslaykovsky 2 года назад ⁺¹
3:00 shouldn't it be the "negative reconstruction error" instead?
@KapilSachdeva 2 года назад
Since in optimization we minimize then we minimize the negative ELBO which will result in the negative reconstruction error.
@spandanbasu5653 Год назад
I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?
@KapilSachdeva Год назад
ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).
@Daydream_Dynamo 3 месяца назад
It learns the parameter right?
@medomed1105 2 года назад
Is there a difference between VAE and GAN
@KapilSachdeva 2 года назад ⁺¹
They are two different architectures with some goals that are shared.
VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models.
GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images
Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though.
Hope this sheds some light.
@medomed1105 2 года назад ⁺¹
@@KapilSachdeva thank you very much
If there is a possibility to make tutorial about GAN it will be very appreciated
Thanks again
@KapilSachdeva 2 года назад
🙏

Следующие

Автовоспроизведение

Variational Autoencoder - VISUALLY EXPLAINED!