3:59 In the first decoder's linear layer you have only 2 neurons, - I mean, if you have 2 neurons from z_mean and 2 neurons from z_log_var then, the decoder's linear layer must contains 4 neurons instead of 2. I don't get it.
Good question! The best way to see how these get actually used is by looking at the forward method: def forward(self, x): x = self.encoder(x) z_mean, z_log_var = self.z_mean(x), self.z_log_var(x) encoded = self.reparameterize(z_mean, z_log_var) decoded = self.decoder(encoded) So here we can see that these (z_mean and z_var) get passed to self.reparameterize, which returns "encoded" that is then passed to the decoder. Upon inspecting "self.reparameterize" you will see that we use z_mean, z_log_var as parameters for a normal distribution to sample the vector "z" (same as "encoded"): def reparameterize(self, z_mu, z_log_var): eps = torch.randn(z_mu.size(0), z_mu.size(1)).to(z_mu.get_device()) z = z_mu + eps * torch.exp(z_log_var/2.) return z In other words, the two dimensional vectors z_mean & z_log_var are not used directly in the encoder but are just used to sample from a 2D Gaussian distribution via torch.randn to get a 2D vector z. I.e, the input to the decoder is the 2D vector "z" (aka "encoded")
Hello, at 13:53 you said that you are summing over the latent dimension. But aren't the z_mean and z_log_var tensors of the shape (batch size, channels, latent dimension)? In that case wouldn't you sum over axis = 2? Thanks a lot for the videos!
Following up on this, I think the sum over axis = 1 is correct because it carries out the kl divergence formula element-wise the way it should be done. This outputs a tensor of shape (batch size, channels, latent dimension), then you compute the average of this tensor. This is analogous to taking the MSE loss (with reduction = 'mean'), which first computes the squared differences element-wise and then take the average.
Thanks for the explanation! Unlike the reconstruction loss which is interpretable, how should we interpret the KL divergence loss? What is an acceptable value? How would the sampled images look if we have a low reconstruction error but high KL divergence ?
Your video is really amazing. Thank you very much for giving us so much knowledge. Can you please tell us how can we get the validation loss evaluation curves? Thanks :)
Glad to hear you are liking it. I plotted the losses with matplotlib, you can find the code here at the top: github.com/rasbt/stat453-deep-learning-ss21/blob/main/L17/helper_plotting.py
Thanks a lot for the VAE series. A small question: Since we need a encoder output to be as close to standard distribution as possible, why dont we enforce activation function on the encoder linear layer ? --> The mean layer will have sigmoid activation fcn and variance layer will have tanh ...something like this ?
running the code on google colab it shows error in model.to(DEVICE ) part how it can be corrected??? set_all_seeds(RANDOM_SEED) model = VAE() model.to(DEVICE) optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
First of all, thanks a lot! The scatter plot really gives a nice intuition about latent space.But it got me thinking that will every 2d space trained will look like this, or will it depend on how someone has made architecture or trained it.Then I saw your plot it was different from mine so I guess its not universal then. If it was universal it would be like a huge thing! Another thing that we are trying to learn the probability distribution if I'm not wrong I wanna know and visualise the distribution that our network has learnt how can we know that, its in 2d so it can be visualised in 3d graph.
The latent space will depend a bit on the weight of the KL-divergence term (if it is too weak, it will resemble a 2D Gaussian less). Also, since random sampling is involved, the plot may look different every time. Btw. regarding the plot, to plot the distribution in 3D, you'd need some sort of density estimation. This reminds me, I actually wrote a blog post about this long long time ago: sebastianraschka.com/Articles/2014_kernel_density_est.html
Hi Sebastian, I like your Videos, I has helped me, but am working on a personal project on Variational Autoencoders using Dirichlet distribution, and am stuck at the point of calculating Binary cross Entropy loss, I would kindly like to request for assistance
Thank you for hand holding the DL aspirants to reach new destinations, Great Service to the Knowledge
3:59 In the first decoder's linear layer you have only 2 neurons, - I mean, if you have 2 neurons from z_mean and 2 neurons from z_log_var then, the decoder's linear layer must contains 4 neurons instead of 2. I don't get it.
Good question! The best way to see how these get actually used is by looking at the forward method:
def forward(self, x):
x = self.encoder(x)
z_mean, z_log_var = self.z_mean(x), self.z_log_var(x)
encoded = self.reparameterize(z_mean, z_log_var)
decoded = self.decoder(encoded)
So here we can see that these (z_mean and z_var) get passed to self.reparameterize, which returns "encoded" that is then passed to the decoder.
Upon inspecting "self.reparameterize" you will see that we use z_mean, z_log_var as parameters for a normal distribution to sample the vector "z" (same as "encoded"):
def reparameterize(self, z_mu, z_log_var):
eps = torch.randn(z_mu.size(0), z_mu.size(1)).to(z_mu.get_device())
z = z_mu + eps * torch.exp(z_log_var/2.)
return z
In other words, the two dimensional vectors z_mean & z_log_var are not used directly in the encoder but are just used to sample from a 2D Gaussian distribution via torch.randn to get a 2D vector z. I.e, the input to the decoder is the 2D vector "z" (aka "encoded")
Hello, at 13:53 you said that you are summing over the latent dimension. But aren't the z_mean and z_log_var tensors of the shape (batch size, channels, latent dimension)? In that case wouldn't you sum over axis = 2? Thanks a lot for the videos!
Following up on this, I think the sum over axis = 1 is correct because it carries out the kl divergence formula element-wise the way it should be done. This outputs a tensor of shape (batch size, channels, latent dimension), then you compute the average of this tensor. This is analogous to taking the MSE loss (with reduction = 'mean'), which first computes the squared differences element-wise and then take the average.
Thanks for the explanation! Unlike the reconstruction loss which is interpretable, how should we interpret the KL divergence loss? What is an acceptable value? How would the sampled images look if we have a low reconstruction error but high KL divergence ?
Your video is really amazing. Thank you very much for giving us so much knowledge. Can you please tell us how can we get the validation loss evaluation curves?
Thanks :)
Glad to hear you are liking it. I plotted the losses with matplotlib, you can find the code here at the top: github.com/rasbt/stat453-deep-learning-ss21/blob/main/L17/helper_plotting.py
Thanks a lot for the VAE series. A small question: Since we need a encoder output to be as close to standard distribution as possible, why dont we enforce activation function on the encoder linear layer ? --> The mean layer will have sigmoid activation fcn and variance layer will have tanh ...something like this ?
running the code on google colab it shows error in model.to(DEVICE ) part how it can be corrected???
set_all_seeds(RANDOM_SEED)
model = VAE()
model.to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
I am wondering how to backward for randomn function
First of all, thanks a lot! The scatter plot really gives a nice intuition about latent space.But it got me thinking that will every 2d space trained will look like this, or will it depend on how someone has made architecture or trained it.Then I saw your plot it was different from mine so I guess its not universal then. If it was universal it would be like a huge thing!
Another thing that we are trying to learn the probability distribution if I'm not wrong I wanna know and visualise the distribution that our network has learnt how can we know that, its in 2d so it can be visualised in 3d graph.
The latent space will depend a bit on the weight of the KL-divergence term (if it is too weak, it will resemble a 2D Gaussian less). Also, since random sampling is involved, the plot may look different every time. Btw. regarding the plot, to plot the distribution in 3D, you'd need some sort of density estimation. This reminds me, I actually wrote a blog post about this long long time ago: sebastianraschka.com/Articles/2014_kernel_density_est.html
Hi Sebastian, I like your Videos, I has helped me, but am working on a personal project on Variational Autoencoders using Dirichlet distribution, and am stuck at the point of calculating Binary cross Entropy loss, I would kindly like to request for assistance
thank you for the video! What's the formula of backpropagation? I did not see the code of backward propagation part.
It's part of PyTorch.