Reparameterization Trick - WHY & BUILDING BLOCKS EXPLAINED!

Поделиться
HTML-код
  • Опубликовано: 9 сен 2024
  • This tutorial provides an in-depth explanation of challenges and remedies for gradient estimation in neural networks that include random variables.
    While the final implementation of the method (called Reparameterization Trick) is quite simple, it is interesting and somewhere important to understand how and why the method can be applied in the first place.
    Recommended videos to watch before this one
    Evidence Lower Bound
    • Evidence Lower Bound (...
    3 Big Ideas - Variational AutoEncoder, Latent Variable Model, Amortized Inference
    • Variational Autoencode...
    KL Divergence
    • KL Divergence - CLEARL...
    Links to various papers mentioned in the tutorial
    Auto-Encoding Variational Bayes
    arxiv.org/abs/...
    Doubly Stochastic Variational Bayes for non-Conjugate Inference
    proceedings.ml...
    Stochastic Backpropagation and Approximate Inference in Deep Generative Models
    arxiv.org/abs/...
    Gradient Estimation Using Stochastic Computation Graphs
    arxiv.org/abs/...
    A thread with some insights about the name - "The Law Of The Unconscious Statistician"
    math.stackexch...
    #gradientestimation
    #elbo
    #variationalautoencoder

Комментарии • 73

  • @anselmud
    @anselmud 2 года назад +26

    I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!

    • @KapilSachdeva
      @KapilSachdeva  2 года назад

      🙏 ...glad that you found them helpful!

  • @mikhaeldito
    @mikhaeldito 2 года назад +21

    Glad that someone finally take the time to decrypt the symbols in the loss function equation!! What a great channel :)

  • @sklkd93
    @sklkd93 Год назад +5

    Man, these have to be the best ML videos on RUclips. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful.
    I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      🙏 I don’t have a degree in stats either 😄

    • @RajanNarasimhan
      @RajanNarasimhan 5 месяцев назад

      @@KapilSachdeva
      what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.

  • @ssshukla26
    @ssshukla26 2 года назад +5

    I knew the concept now I know the maths. Thanks for the videos sir.

  • @user-lm7nn2jm3h
    @user-lm7nn2jm3h Год назад +2

    I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much

  • @ThePRASANTHof1994
    @ThePRASANTHof1994 Год назад +1

    I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D

  • @television9233
    @television9233 2 года назад +5

    For the quiz at the end:
    From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from.
    So I think the answer is 2 "predicts" not "learns".

  • @vslaykovsky
    @vslaykovsky 2 года назад +2

    This explanation is what I was looking for for many days! Thank you!

  • @prachijadhav9098
    @prachijadhav9098 2 года назад +1

    I was looking for this. It’s full of essential information. Convention matters, and you clearly explained the
    differences in this context.

  • @mohdaquib9808
    @mohdaquib9808 2 года назад +3

    Thanks a lot sir for your excellent explanation. It made me understand the key idea behind the reparameterization trick.

  • @SY-me5rk
    @SY-me5rk 2 года назад +2

    I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.

  • @chyldstudios
    @chyldstudios Год назад +1

    Enjoyed watching your clear explanation of the re-parameterization trick. Well done!

  • @ayushsaraf8421
    @ayushsaraf8421 Год назад +1

    This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions

  • @adamsulak8751
    @adamsulak8751 8 месяцев назад +1

    Incredible quality of teaching 👌.

  • @midhununni951
    @midhununni951 10 месяцев назад +1

    Incredibly clear, and thank you so much for these videos. Looking forward to more...

  • @leif-martinsunde1364
    @leif-martinsunde1364 2 года назад +2

    Wonderful video Kapil. Thanks from the University of Oslo.

  • @longfellowrose1013
    @longfellowrose1013 2 года назад +2

    Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.

  • @alirezamogharabi8733
    @alirezamogharabi8733 2 года назад +1

    Really appreciate, I enjoyed your teaching style and great expectations! Thank you ❤️❤️

  • @somasundaramsankaranarayan4592
    @somasundaramsankaranarayan4592 3 месяца назад +1

    At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.

  • @inazuma3gou
    @inazuma3gou 2 года назад +1

    Wow~ an amazing tutorial. Thank you!

  • @pauledam2174
    @pauledam2174 2 месяца назад

    Can you turn on the transcript for this! Great explanation!

  • @atharvajoshi4243
    @atharvajoshi4243 Год назад +1

    Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions:
    Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS?
    Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      Q1) it has nothing to do with LOTUS just linear location-scale transform
      Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.

  • @anupgupta3644
    @anupgupta3644 Год назад +1

    It predicts the parameters of the latent variable.

  • @jimmylovesyouall
    @jimmylovesyouall Год назад +1

    In 6:35, the output of the decoder is the reconstruction of X not μ and σ?

    • @KapilSachdeva
      @KapilSachdeva  Год назад +1

      The output of the decoder could be either of following -
      a) Direct prediction of X (the input vector)
      or
      b) Prediction of mu and sigma of distribution from which X came.
      Note the mu and sigma if predicted (by the decoder) will be that of X and not Z

  • @MLDawn
    @MLDawn Год назад

    Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.

  • @omidmahjobian3377
    @omidmahjobian3377 2 года назад +3

    GEM

  • @ArashSadr
    @ArashSadr 2 года назад +1

    As always I am stunned by your video! May I ask with what software you produce such videos?

    • @KapilSachdeva
      @KapilSachdeva  2 года назад +1

      🙏 thanks Arash for the kind words.
      I use PowerPoint primarily except for very few advanced animations I use manim (github.com/manimCommunity/manim)

  • @blasttrash
    @blasttrash Год назад +1

    19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?

    • @KapilSachdeva
      @KapilSachdeva  Год назад +1

      Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption?
      There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.

  • @RAP4EVERMRC96
    @RAP4EVERMRC96 Год назад +1

    8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.

    • @KapilSachdeva
      @KapilSachdeva  Год назад +1

      Assuming when you mentioned “sigma” you meant “theta”.
      This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.

    • @RAP4EVERMRC96
      @RAP4EVERMRC96 Год назад +1

      @@KapilSachdeva thanks for clearing that up and yes I meant theta. I always mix them up.

    • @KapilSachdeva
      @KapilSachdeva  Год назад +1

      😊

  • @rubyshrestha5747
    @rubyshrestha5747 2 года назад +1

    Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?

    • @KapilSachdeva
      @KapilSachdeva  2 года назад +2

      Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta.
      Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!

  • @slemanbisharat6390
    @slemanbisharat6390 Год назад

    thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?

  • @vslaykovsky
    @vslaykovsky 2 года назад +1

    3:00 shouldn't it be the "negative reconstruction error" instead?

    • @KapilSachdeva
      @KapilSachdeva  2 года назад

      Since in optimization we minimize then we minimize the negative ELBO which will result in the negative reconstruction error.

  • @spandanbasu5653
    @spandanbasu5653 Год назад

    I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?

    • @KapilSachdeva
      @KapilSachdeva  Год назад

      ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).

  • @Daydream_Dynamo
    @Daydream_Dynamo 3 месяца назад

    It learns the parameter right?

  • @medomed1105
    @medomed1105 2 года назад

    Is there a difference between VAE and GAN

    • @KapilSachdeva
      @KapilSachdeva  2 года назад +1

      They are two different architectures with some goals that are shared.
      VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models.
      GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images
      Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though.
      Hope this sheds some light.

    • @medomed1105
      @medomed1105 2 года назад +1

      @@KapilSachdeva thank you very much
      If there is a possibility to make tutorial about GAN it will be very appreciated
      Thanks again

    • @KapilSachdeva
      @KapilSachdeva  2 года назад

      🙏