Really well explained, and a compact notebook. It's basically all written directly from Torch, very refreshing to see when so much content is heavily reliant on APIs.
Thanks a lot for the video, really helpful for someone trying to grasp these models. Also, little typo I noticed: at 16:06 in the cell "# Simulate forward diffusion" noise is being added a little faster than intended. The culprit is the line "image, noise = forward_diffusion_sample(image,t)" due to it rewriting the variable "image" at each step in the loop, despite the fact that forward_diffusion_sample was built expecting the initial non noisy image. So from the second iteration step onwards we're adding noise to an already noisy image.
Hehe thanks for this finding, this is indeed a bug. I just checked, it doesn't look very different with the correction (assigning to a new variable). If I'm not mistaken, this led to a multiplication by 2, as in every step the pre-computed noise for this t is added, plus the cumulative noise until t (which should be the same as the pre computed one) hence leading to twice the noise as indented. Anyways, thanks for this comment! :)
@@DeepFindr Hi again. Yeah, the bug didn't really affect the images much, but it might confuse some viewers about whether you're computing x_t from x_0 or from x_{t-1}. As for the "multiplication by 2" bit, it's not going to be exactly that since the betas are changing and you're adding (t-1)-step noise to t-step noise. Moreover, adding a N(0,1) to another independent N(0,1) is a N(0,2), whose standard deviation is sqrt(2), so what was happening should be closer to multiplication by sqrt(2), even if also not exactly that. Anyway, since my previous comment I've now finished the video and trained if for 100 epochs so far (with comparable results to yours). I have two more comments in the latter bits of the video, namely the "sample_timestep" function at 26:59: - I was rather confused for a while as to why we were returning "model_mean" rather than just "x" for t=0. Though eventually I realized that the t's in the code are offset from the t's in the paper: the code is 0-indexed but the paper is 1-indexed. So the t=0 case in the sample_timestep is really inferring x_0 from x_1 in terms of the paper. It might be worth adding a comment about this in either the video or the code. - it took me quite a bit to understand the output of the sample_timestop function. I think I mostly got it now, but this is a really subtle step that is worth demystifying. Here's my current understanding: in effect our model is always trying to predict x_0 from x_t, but we don't expect the prediction to be great for large t. However, the distribution of p(x_(t-1) | x_t, x_0) is a fully known normal distribution, so we instead we the predicted x_0 to approxiamte this p(...), then sample from that to get x_(t-1). In retrospect, I've seen multiple videos on diffusion try to describe this process in words as "we predict the full noise, but then we add some of the noise back", but that vague description never made sense to me. So maybe an extra comment on this could help a future viewer as well. Anyway, let me thank you again for the video. My hope is to eventually actually understand stuff like stable diffusion with all its bells and whistles, and this already helped a lot. And on that note, I noticed that the weights for the network in the video take up 700M, compared to something like 4GB for stable diffusion, so it's maybe not so surprising that this would require a while to train from scratch.
@Luis Pereira yes I totally agree, in retrospect some things could've been more in depth. Meanwhile I've also experimented more and read other papers about these models (and also the connection to score based approaches) which could also be added here. Maybe I'll make an update video some day :)
@@DeepFindr No worries. My experience is that in retrospect nearly everything could has been improved in some way or other. And if you ever find the time for another video, I at least would be interested. There are a decent number of good RUclips videos on this topic, but this is one of the best I've found.
Yes - I am a slow learner and I will reverse engineer the Python code to understand the math. It's probabilities and it is more like a quantum computer all that math that we hear in the video Then we approximate all the quantum computer moves with classical CPU (or GPU) ops and we have to understand how much the formulas were rounded, truncated, and what parts were ignored. I took first prize in exact Math, I still do not have the right soul to eat all these elucubrations ... but I am working on it. Wait until some operation is executed on Quantums - then it will blow our minds - because this Gauss is a joke compared to Young
Very good job! A note: to me you should add torch.clamp(image, -1.0, 1.0) after each forward_diffusion_sample() call. You can check the behavior with and without the clamp when simulating forward diffusion. Images shown without clamp seems "not naturally noisy" as the pixel range is no longer between -1 and 1. I don't know how much this affects the final training result; it should be tried.
Very nice video with good explanation. I would like to point out that your Block class, the same batchnorm is used in different places. Batchnorm is trainable and have weights, so you might want to treat it more like an actual layer rather than memory-less operations like pooling and ReLU.
@13:09 Why isnt sqrt_recip_alphas used anywhere? also why do do you calculate sqrt_one_minus_alphas_cumprod, where in the equation we only have 1-alphas_cumprod? is that a typo? whats the alphas_cumprod_prev exactly? can someone please explain what is being done here? and whats the posterior_variance? Thanks a lot in advance
It's a good, concise walk-through with good code implementation examples. However I'd recommend avoiding some ambiguous variable names in code like betas, Block, etc.
Thanks for your a lot contribution for this.. But a bit confused. at 7:30, q(Xt | Xt-1) is a distribution for which "the sampled noise" follows OR for which "the Noised image" follows ?
It's the distribution of the noised image :) the distribution of the noise is always gaussian. This formula expresses the mixture of the original input and the noise distribution, hence the distribution of the noised image.
@@DeepFindr Thanks for your reply !! just one more, please..? Then q(Xt | Xt-1) = N(Xt; ... , BtI ) means the variance of Xt is Bt ? Someone says V(Xt) eventually becomes to 1 in every step, so a bit confused...
@@SeonhoonKim bt is just the variance of this single step. Have a look at the "closed form" part with alpha. Ideally alpha bar becomes 0 at the end (the cumulative product) which leads to a variance of 1
Thanks great explanation, I wonder how the sampling with T can produce plausible images with just a single pass. I would expect to call sampling code recursively T times with timestamp 1 at each call and z is updated according to result of the previous call. Similar to autoregressive architectures
Thanks! From the training result, what I saw was an image that when from its original image to a less noisy one? I was expecting to see a noisy image converted to a less noisy one or it's original one?
Great video, really well explained! One thing I found suspicious is that in the forward diffusion process you generate noise like this "noise = torch.randn_like(x_0)". As far as I know, this samples the uniform distibution U(0,1) and not the standard Gaussian.
Great effort thank you! The simplified version of it is still complicated though :D Probably, I need to watch this couple of more time after reading resources you attached.
Shouldn't you use separate BN layers for 1st and 2nd convolution in a block? In your implementation batch statistics are shared among 2 layers what seems to be a bug.
Yep, you are right. I updated the notebook. Actually I also found that bug in a local version of the code and forgot to adjust the notebook. Bnorm layers can't be shared as each layer learns individual normalization coefficients. Thanks for pointing this out :)
For that you need to conditionally generate images, for example using CLIP latents. This simply means that you add some text embedding during the denoising. The best explanation for this can probably be found here: jalammar.github.io/illustrated-stable-diffusion/
Thanks for a great tutorial. I think there is a small bug though in the implementation of the output layer of UNet. The output channel dimension is flipped with a kernel size and set to fixed 3. Shouldn't it look instead like this: self.output = nn.Conv2d(up_channels[-1], out_dim, 3)
Oh yes :D bugs everywhere. With output dim 1 it would just produce black and white images, so this bug led to color ;-) have you tried it with another kernel size? Did it make a difference?
@@DeepFindr I actually tried it on medical MRI images which have only one color dim (greyscale). That is where the error was triggered. I kept the kernel size at 3, so no I can’t give any input on the influence of the kernel size.
Hello thank you for the video and code. I have two questions: Q1- In the Block module 24:16 why is the input channel in `self.conv1` multiplied by 2? The input channel is twice the size of the output based on the `up_channels` list in the `init` of your SimpleUnet class. Is this related to adding "residual x as additional channels" at 24:50? Q2- How do you direct which way the diffusion direction goes in? I know this is a very simplified example model but how would you add the ability to direct the generation towards making a certain class of car, or a car based on text descriptions like "red SUV"? Is there a good explanatory paper or blog post or video on this matter that you can recommend (preferably practical without a lot of math)? (Thank you again for the video)
In the Google Collab there are log and exp in the Sinusoidal Embedding block. You did not explain where that come from. I don't see it on the formula at 20:26.
Hi :) Some implementations of positional embeddings are calculated in log space, that's why you see exp and log there. This usually improves numerical stability and is sometimes also done for loss functions
So, stupid question: In the SimpleUnetClass we define the outputLayer with parameter of 3, to regain the amount of channels our image has. Couldn't we then just input the image_channels variable there? What if my image is grayscale and has only 1 channel?
Say I have a still image x0 and a pre initialized noisy image N. I think I can apply a noise to x0 by "(1-B)x0 + BN". When B=1, the output is N, the noisy image, when B=0, the output is the still image. But that's just the linear version.
How can we save all the generated images? Like as far as my understanding goes at the end of training there would be generated images of stanfordcars from completely noised images.
When you was explaining code for noise scheduler, T value changed from 200 to 300 which also i think should be reflected in different (smaller) betas, cause we end up with smaller alpha compound.
How many epochs does it take to produce anything that does not look like noise? I've downloaded the dataset from kaggle and replaced the data loader code in the collab, forward process works, howerver training doesn't seem to work. the loss is stuck from the very beginning at ~0.81, it doesn't go down, and the pictures sampled still look like noise. I am at Epoch 65, it does not seem to improve at all
why are we adding time embedding to input features, like literally adding them together. Can a simple concatenation of input features and time embedding possible ? btw dope video thanks for sharing
Hey I have a question: I think in the Colab notebook you only sample one time step from each image in a batch, but I was wondering why we don't take more intermediate time steps from each sample?
Thank you! Happy that you liked it! I only have the pictures at the end of this video. Unfortunately I didn't save the model weights after the longer training, because I thought I wouldn't need them anymore :/
Hi, I dont have it anymore but it's basically the same as this one, just in a python file. Also there were some comments below to change parts of the code e.g. Positional embeddings and final filter sizes that might improve the performance.
Thanks for the great video first! I trained the model on human face dataset using your code, it seems the sampling results appear checkerboard (grid pattern) artifacts, how to solve this?
Hi! Make sure to train the model long enough (e.g. Set 1000 epochs and see what happens). Also you might want to fine tune the model architecture and add more advanced layers like attention. I also encountered weird patterns at first but after training longer the quality got better.
Many thanks. Excellent explanation. Can we use the diffusion models for the deblurring of images? These are Generative models, and I want to use them for image restoration problems. Thanks
Hi, thank you for the well explained video! I've been following your code and training the same on StanfordCars dataset. At epoch 65, the sampled images of my training just come out as grey images. Is there something wrong with my training? Should I adjust the learning rate?
Hi, thanks for awesome works. I would like to reduce image size. but, when I changed, training is not working, could you send me some info? and I would like to change your good code to DDIM method, only changing sampling part? could you send me detail info?
Thanks for this amazing video! Do you plan on extending this video to include conditional generation at some point in the future? I would love to see an implementation of the SR3/Palette models that use DDPM for image to image translation tasks such as super-resolution, JPEG restoration etc. In this case, the reverse diffusion process is conditioned on the input image.
Basically it's a hyperparameter. Not only the step size is relevant, but rather the beta schedule (so start and end values). In my experiments I was simply visualizing the data distributions to determine a good value. You have a good schedule if the last distribution follows a normal gaussian with zero mean and std 1. Also, I have the feeling that a higher number of steps leads to higher fidelity, but I didn't further look into this
Thank you for the awesome guide :D Just 1 simple question. In the plotted image, are we looking at x0, x1, .... x10 where x0 is the image at the very left (denoised version) and x10 image at the very right (noised most)?
How can a model that is only 3.2GB, produce almost infinite image combinations that can be produced from just a simple text prompt, with so many language variables. What I am interested in, is how a prompt of say a "monkey riding a bicycle" can produce something that visually represents the prompt. How are the data images tagged and categorized in training to do this? As a creative person we often say that an idea is still misty and is not formed yet. What strikes me about this diffusion process is the similarity in how our minds at a creative level seem to work. We iterate and de-noise the concept until it becomes concrete using a combination of imagination and logic. It is the same process that you described to arrive at the finished formula. What also strikes me about the images produced by these diffusion algorithms is that they look so creative and imaginative. Even artists are shocked when they see them for the first time and realize a machine made them. My line of thinking here is that we use two main tools to acquire and simulate knowledge and experience. They are images and language. Maybe this input is then stored in a similar way as a diffusion model within our memory. Logic, creativity and ideas are just a consequence of reconstituting this data due to our current social or environmental needs. This could explain our thinking process and why our memory is of such low resolution. The de-noising process could also explain many human conditions such as depression and even why we dream etc. This brings up the interesting question " Could a diffusion model be created to simulate a human personality"? Or provide new speed think concepts and formulas for the solving of a multitude of complex problems for that matter. The path would be 1) diffusion model idea/concept 2) ask a GAN like gpt-3 to check if it works 3) feed back to the diffusion model and keep iterating until it does in much the same way as de-noising a picture. Just a thought from a diffusion brain.
It's because the subset of possible images we humans are interested in is actually very specific. If you think about it, infinite combinations isn't that complicated. It's when we want specific things that you need more information. It only takes a few KB of code to make a pseudorandom number generator that can theoretically output every possible image, but we would see the vast majority of those permutations as boring rainbow noise. Ironically, the storage space used by generative models is needed to essentially explain what we DON'T want, so that we are left with the very specific subset that does meet our requirements.
1. 3.2 GB - I suppose you are talking about Midjourney and similar as this model has only 249Mb 2. Yes we are reinventing the brain here. The kid who can not reach the toy until his 200 move (epoch), develops his CNN inside his head to paint and then see an electronic CNN that is doing something similar ... it's magic until you see this video 🙂
I also added the models. I saved 2 versions, Dictionary, and Full model as I noticed at sampling that you may find differences model_dict_epoch_250_128x128_cars.pth model_full_epoch_250_128x128_cars.pth model_dict_epoch_499_64x64_cars.pth model_full_epoch_499_64x64_cars.pth
This actually relates to all generative models - how to make sure that the model doesn't simply memorize the train set. For example I've also seen this discussion for GANs: www.lesswrong.com/posts/g9sQ2sj92Nus9DeKX/gan-discriminators-don-t-generalize There is also some research in that direction: openreview.net/forum?id=PlGSgjFK2oJ To answer your question: you need to sample some data points and compare them with the nearest matches in the Dataset to be sure the model didn't overfit. More data always helps of course, to make it less likely that the model memorizes the whole dataset.
Quite good!
Thank you!
Really well explained, and a compact notebook. It's basically all written directly from Torch, very refreshing to see when so much content is heavily reliant on APIs.
Yeah, it would only take 16 years to create an api like torch to run on a gpu then show it in a video.
great video, very clear
Thanks a lot for the video, really helpful for someone trying to grasp these models.
Also, little typo I noticed: at 16:06 in the cell "# Simulate forward diffusion" noise is being added a little faster than intended.
The culprit is the line "image, noise = forward_diffusion_sample(image,t)" due to it rewriting the variable "image" at each step in the loop, despite the fact that forward_diffusion_sample was built expecting the initial non noisy image. So from the second iteration step onwards we're adding noise to an already noisy image.
Hehe thanks for this finding, this is indeed a bug. I just checked, it doesn't look very different with the correction (assigning to a new variable). If I'm not mistaken, this led to a multiplication by 2, as in every step the pre-computed noise for this t is added, plus the cumulative noise until t (which should be the same as the pre computed one) hence leading to twice the noise as indented. Anyways, thanks for this comment! :)
@@DeepFindr Hi again. Yeah, the bug didn't really affect the images much, but it might confuse some viewers about whether you're computing x_t from x_0 or from x_{t-1}.
As for the "multiplication by 2" bit, it's not going to be exactly that since the betas are changing and you're adding (t-1)-step noise to t-step noise. Moreover, adding a N(0,1) to another independent N(0,1) is a N(0,2), whose standard deviation is sqrt(2), so what was happening should be closer to multiplication by sqrt(2), even if also not exactly that.
Anyway, since my previous comment I've now finished the video and trained if for 100 epochs so far (with comparable results to yours).
I have two more comments in the latter bits of the video, namely the "sample_timestep" function at 26:59:
- I was rather confused for a while as to why we were returning "model_mean" rather than just "x" for t=0. Though eventually I realized that the t's in the code are offset from the t's in the paper: the code is 0-indexed but the paper is 1-indexed. So the t=0 case in the sample_timestep is really inferring x_0 from x_1 in terms of the paper.
It might be worth adding a comment about this in either the video or the code.
- it took me quite a bit to understand the output of the sample_timestop function. I think I mostly got it now, but this is a really subtle step that is worth demystifying.
Here's my current understanding: in effect our model is always trying to predict x_0 from x_t, but we don't expect the prediction to be great for large t. However, the distribution of p(x_(t-1) | x_t, x_0) is a fully known normal distribution, so we instead we the predicted x_0 to approxiamte this p(...), then sample from that to get x_(t-1).
In retrospect, I've seen multiple videos on diffusion try to describe this process in words as "we predict the full noise, but then we add some of the noise back", but that vague description never made sense to me.
So maybe an extra comment on this could help a future viewer as well.
Anyway, let me thank you again for the video. My hope is to eventually actually understand stuff like stable diffusion with all its bells and whistles, and this already helped a lot.
And on that note, I noticed that the weights for the network in the video take up 700M, compared to something like 4GB for stable diffusion, so it's maybe not so surprising that this would require a while to train from scratch.
@Luis Pereira yes I totally agree, in retrospect some things could've been more in depth. Meanwhile I've also experimented more and read other papers about these models (and also the connection to score based approaches) which could also be added here. Maybe I'll make an update video some day :)
@@DeepFindr No worries. My experience is that in retrospect nearly everything could has been improved in some way or other.
And if you ever find the time for another
video, I at least would be interested. There are a decent number of good RUclips videos on this topic, but this is one of the best I've found.
Great video! For me, the code makes it easier to understand the math than the actual formulas, so videos like these really help.
Yes - I am a slow learner and I will reverse engineer the Python code to understand the math.
It's probabilities and it is more like a quantum computer all that math that we hear in the video
Then we approximate all the quantum computer moves with classical CPU (or GPU) ops and we have to understand how much the formulas were rounded, truncated, and what parts were ignored.
I took first prize in exact Math, I still do not have the right soul to eat all these elucubrations ... but I am working on it.
Wait until some operation is executed on Quantums - then it will blow our minds - because this Gauss is a joke compared to Young
You are seeing something thats gonna change the way we see our universe in upcoming 2-3years! Save my comment!
Extremely fantastic implementation. I understood the whole diffusion ideas and all mathematical details made sense to me just from your codes.
Very good job! A note: to me you should add torch.clamp(image, -1.0, 1.0) after each forward_diffusion_sample() call. You can check the behavior with and without the clamp when simulating forward diffusion. Images shown without clamp seems "not naturally noisy" as the pixel range is no longer between -1 and 1. I don't know how much this affects the final training result; it should be tried.
Yes very good point. Later I also realized this and it actually led to an improvement (on a different dataset however). :)
Thank you! When I was watching I was wondering what would happen as you add the variance and the value exceeds 1. Your answer helped me understand it.
Unable to access the dataset - stanford-cars.
same here
Very nice video with good explanation. I would like to point out that your Block class, the same batchnorm is used in different places. Batchnorm is trainable and have weights, so you might want to treat it more like an actual layer rather than memory-less operations like pooling and ReLU.
Hi, thanks for pointing out. This was a little bug, which I've corrected in the original notebook. :)
Amazing video! Highly suggested before diving into the paper
what a great explanation, I will take a deeper look at the code. Thanks
You're welcome! :)
Loved the simple implementation also thanks for sharing additional articles
@13:09 Why isnt sqrt_recip_alphas used anywhere?
also why do do you calculate sqrt_one_minus_alphas_cumprod, where in the equation we only have 1-alphas_cumprod? is that a typo?
whats the alphas_cumprod_prev exactly?
can someone please explain what is being done here? and whats the posterior_variance?
Thanks a lot in advance
Maybe it's because I came to see this video not in time. The Standford car dataset link is now invalid. An 404 error.
Absolutely phenomenal content! Love it ❤️
The dataset is no longer available.
Bro, one day in the future when this channel becomes famous, don't forget I am one of your early fans!
Haha :D I won't
It's a good, concise walk-through with good code implementation examples. However I'd recommend avoiding some ambiguous variable names in code like betas, Block, etc.
Stanfords Cars dataset is no longer available in PyTorch datasets. Do you have any alternate locations for the same data?
search for Kaggle Stanford Cars - I tried to add the link but RUclips AI is blocking all external links (sends them to manual approval)
ques: at 15:18 , why did we not directly scaled b/w -1&1 , or are there two different tensors we are scaling? one b/w 0&1 and the other b/w -1&1 ?
Thanks for your a lot contribution for this.. But a bit confused. at 7:30, q(Xt | Xt-1) is a distribution for which "the sampled noise" follows OR for which "the Noised image" follows ?
It's the distribution of the noised image :) the distribution of the noise is always gaussian. This formula expresses the mixture of the original input and the noise distribution, hence the distribution of the noised image.
@@DeepFindr Thanks for your reply !! just one more, please..? Then q(Xt | Xt-1) = N(Xt; ... , BtI ) means the variance of Xt is Bt ? Someone says V(Xt) eventually becomes to 1 in every step, so a bit confused...
@@SeonhoonKim bt is just the variance of this single step. Have a look at the "closed form" part with alpha. Ideally alpha bar becomes 0 at the end (the cumulative product) which leads to a variance of 1
The dataset StandfordCars is no more available, what alternative can I use?
^ having the same problem
Looks like the torchvision dataset for StanfordCars is now deprecated or sth; the original url from which the function pulls the data is closed
Hey, very interesting. Math stuffs are extremely comprehensible the way you explained it. Thanks
Glad you liked it!
At 10:19, q is the noise and the next forward image is x+q. Do I understand it right? Or we just used q and x interchangeably?
Also StandfordCars is no longer available can you plese chnage it?
Oh my god, this explanation is SUPER CLEAR! 🤯
Thank you! I really liked your graphic interpretation of the beta scheduling. It's missing in many other videos about diffusion.
Thanks great explanation, I wonder how the sampling with T can produce plausible images with just a single pass. I would expect to call sampling code recursively T times with timestamp 1 at each call and z is updated according to result of the previous call. Similar to autoregressive architectures
Thanks! From the training result, what I saw was an image that when from its original image to a less noisy one? I was expecting to see a noisy image converted to a less noisy one or it's original one?
Great video, really well explained!
One thing I found suspicious is that in the forward diffusion process you generate noise like this "noise = torch.randn_like(x_0)". As far as I know, this samples the uniform distibution U(0,1) and not the standard Gaussian.
Great effort thank you! The simplified version of it is still complicated though :D Probably, I need to watch this couple of more time after reading resources you attached.
Yes - I am at view number 10 ... and there will be 10 for the math (I kept the desert last 🙂)
Shouldn't you use separate BN layers for 1st and 2nd convolution in a block? In your implementation batch statistics are shared among 2 layers what seems to be a bug.
Yep, you are right. I updated the notebook.
Actually I also found that bug in a local version of the code and forgot to adjust the notebook. Bnorm layers can't be shared as each layer learns individual normalization coefficients.
Thanks for pointing this out :)
Damnnn I wish I watched your video first thing when trying to understand this. Great explanation
How do I make it match a prompt?
For that you need to conditionally generate images, for example using CLIP latents. This simply means that you add some text embedding during the denoising. The best explanation for this can probably be found here: jalammar.github.io/illustrated-stable-diffusion/
@@DeepFindr that’s interesting, ok, thank you!
We will have to add context to the model - I hope to come back with some examples
Thanks for a great tutorial. I think there is a small bug though in the implementation of the output layer of UNet. The output channel dimension is flipped with a kernel size and set to fixed 3. Shouldn't it look instead like this: self.output = nn.Conv2d(up_channels[-1], out_dim, 3)
Oh yes :D bugs everywhere.
With output dim 1 it would just produce black and white images, so this bug led to color ;-) have you tried it with another kernel size? Did it make a difference?
@@DeepFindr I actually tried it on medical MRI images which have only one color dim (greyscale). That is where the error was triggered. I kept the kernel size at 3, so no I can’t give any input on the influence of the kernel size.
Hello thank you for the video and code. I have two questions:
Q1- In the Block module 24:16 why is the input channel in `self.conv1` multiplied by 2? The input channel is twice the size of the output based on the `up_channels` list in the `init` of your SimpleUnet class. Is this related to adding "residual x as additional channels" at 24:50?
Q2- How do you direct which way the diffusion direction goes in? I know this is a very simplified example model but how would you add the ability to direct the generation towards making a certain class of car, or a car based on text descriptions like "red SUV"? Is there a good explanatory paper or blog post or video on this matter that you can recommend (preferably practical without a lot of math)?
(Thank you again for the video)
We will have to add context to the model - I hope to come back with some examples
thanks for the well explained and code. but i wonder how can i use the trained model to do generation of images? can advise?
How to save the model and generate images after training?
Can you make a video on *Conditional generation in Diffusion modals*
In the Google Collab there are log and exp in the Sinusoidal Embedding block. You did not explain where that come from. I don't see it on the formula at 20:26.
Hi :)
Some implementations of positional embeddings are calculated in log space, that's why you see exp and log there. This usually improves numerical stability and is sometimes also done for loss functions
Thanks for putting this together
Really helpful content, and the recommended resources are very good, thanks
I have a question,sir
At 13:22, the formula is (1-alpha_bar), why does the code put root(1-alpha_bar)?\
Thank you! This is the best video I've ever seen
Glad that you liked it!
So, stupid question:
In the SimpleUnetClass we define the outputLayer with parameter of 3, to regain the amount of channels our image has. Couldn't we then just input the image_channels variable there? What if my image is grayscale and has only 1 channel?
Say I have a still image x0 and a pre initialized noisy image N. I think I can apply a noise to x0 by "(1-B)x0 + BN". When B=1, the output is N, the noisy image, when B=0, the output is the still image. But that's just the linear version.
How do I put my own training dataset to the ipynb script?
How can we save all the generated images? Like as far as my understanding goes at the end of training there would be generated images of stanfordcars from completely noised images.
Sehr gut erklärt und umgesetzt 👏
Always been told after math classes: "You won't need that in real life anyways" xd
I can relate xD
but what is real life?
@@snoosri Don't take words so literally, especially on media. I believe you can grasp the meaning from context my friend :)
@@snoosriunderrated comment
So trye
When you was explaining code for noise scheduler, T value changed from 200 to 300 which also i think should be reflected in different (smaller) betas, cause we end up with smaller alpha compound.
Yes good point!
How many epochs does it take to produce anything that does not look like noise? I've downloaded the dataset from kaggle and replaced the data loader code in the collab, forward process works, howerver training doesn't seem to work. the loss is stuck from the very beginning at ~0.81, it doesn't go down, and the pictures sampled still look like noise. I am at Epoch 65, it does not seem to improve at all
I didnt understand...why do we have to convert images to tensors?
Will it be possible to generate new images using this model, if saved after training?, please share as to how to generate new images if possible.
why are we adding time embedding to input features, like literally adding them together. Can a simple concatenation of input features and time embedding possible ? btw dope video thanks for sharing
Thank you mate for the video, can you make one for the "conditional Diffusion" too? thanks
What a great video! Loved it!
The explanation was awesome 🔥
Thank you! It was helpful
data = torchvision.datasets.StanfordCars(root=".", download=True) Giving an istant error?
Excellent Explanation. Learn a lot from your video, thank you~
really good introduction!thanks
Hey I have a question: I think in the Colab notebook you only sample one time step from each image in a batch, but I was wondering why we don't take more intermediate time steps from each sample?
Thanks so much !! This is gold! Keep them coming. I’m curious to see the results of the model, can you share some more pictures?
Thank you! Happy that you liked it!
I only have the pictures at the end of this video. Unfortunately I didn't save the model weights after the longer training, because I thought I wouldn't need them anymore :/
if anyone wants to implement the DDPM for time-series data which model would be good instead Unet? any suggestions ?
Forward processing is very clear. Could you categorize the code blocks of backward processing?
Thanks for the amazing video !
did you share the code of the implementation for your personal gpu?
Hi, I dont have it anymore but it's basically the same as this one, just in a python file.
Also there were some comments below to change parts of the code e.g. Positional embeddings and final filter sizes that might improve the performance.
@@DeepFindr thanks, are you a researcher ?
I work in applied research in the industry. For me the sweet spot between pure research and building software products :)
Check the value of the variable "device" (it should be cpu or cuda (for GPU))
At 11:45, the third line of the formula should have a bar on the top of alpha_t
best explanation, and perhaps the sigma in the normal distribution graph should be sigma^2
Thanks for the great video first! I trained the model on human face dataset using your code, it seems the sampling results appear checkerboard (grid pattern) artifacts, how to solve this?
Hi! Make sure to train the model long enough (e.g. Set 1000 epochs and see what happens). Also you might want to fine tune the model architecture and add more advanced layers like attention.
I also encountered weird patterns at first but after training longer the quality got better.
@@DeepFindr Thanks for the advice, I'll try it : )
It helps if you set the final layers filter size to 1
Really nice exposition. Can you please elaborate on the specifications of the machine it was trained on and approx how long the training took?
Many thanks. Excellent explanation. Can we use the diffusion models for the deblurring of images? These are Generative models, and I want to use them for image restoration problems. Thanks
Hi!
Yes they can be used for image restoration as well. Have you seen this paper: arxiv.org/abs/2201.11793
:)
@@DeepFindr Excellent thanks. You are amazing.
Excellent tutorial, but looks code needs to updated, Stanford Cars data set is gone. Available at Kaggle, could you please update it accordingly.
Amazing explanation
i dont know what the p(x0:T)means in 19:12
Hi, thank you for the well explained video! I've been following your code and training the same on StanfordCars dataset. At epoch 65, the sampled images of my training just come out as grey images. Is there something wrong with my training? Should I adjust the learning rate?
also having this issue, did you figure it out?
Yes the images are close to gray - there is something hidden inside that makes medium of colors or loose RBG toward grays
Did anyone got an output, dont know why I am getting only noisy images in Epoch 0 ,
Hi, thanks for awesome works. I would like to reduce image size. but, when I changed, training is not working, could you send me some info? and I would like to change your good code to DDIM method, only changing sampling part? could you send me detail info?
Thanks! Great animation and explanations..amazing 🙏
love the content brother
Thanks for this amazing video! Do you plan on extending this video to include conditional generation at some point in the future? I would love to see an implementation of the SR3/Palette models that use DDPM for image to image translation tasks such as super-resolution, JPEG restoration etc. In this case, the reverse diffusion process is conditioned on the input image.
That was really clear. Thank you !
beutifully explained
Thanks a lot. Can I ask how to choose timesteps in diffusion? Is the larger the timesteps, the better?
Basically it's a hyperparameter. Not only the step size is relevant, but rather the beta schedule (so start and end values). In my experiments I was simply visualizing the data distributions to determine a good value. You have a good schedule if the last distribution follows a normal gaussian with zero mean and std 1. Also, I have the feeling that a higher number of steps leads to higher fidelity, but I didn't further look into this
why appending time embedding to the feature is after first layer of cnn in UNet? Why not add the time embedding at the initial step (before UNet)?
You could also do that, but I added the timestep in each of the Unet blocks.
I think that there are many possibilities to try things out :)
Do you know how to modify this diffusion model to accept a custom data set?
Yes, simply exchange the Dataset class with a custom dataset from pytorch. As long as it's images, the rest should work fine :)
Thank you for the awesome guide :D
Just 1 simple question. In the plotted image, are we looking at x0, x1, .... x10 where x0 is the image at the very left (denoised version) and x10 image at the very right (noised most)?
How can a model that is only 3.2GB, produce almost infinite image combinations that can be produced from just a simple text prompt, with so many language variables. What I am interested in, is how a prompt of say a "monkey riding a bicycle" can produce something that visually represents the prompt. How are the data images tagged and categorized in training to do this? As a creative person we often say that an idea is still misty and is not formed yet. What strikes me about this diffusion process is the similarity in how our minds at a creative level seem to work. We iterate and de-noise the concept until it becomes concrete using a combination of imagination and logic. It is the same process that you described to arrive at the finished formula. What also strikes me about the images produced by these diffusion algorithms is that they look so creative and imaginative. Even artists are shocked when they see them for the first time and realize a machine made them. My line of thinking here is that we use two main tools to acquire and simulate knowledge and experience. They are images and language. Maybe this input is then stored in a similar way as a diffusion model within our memory. Logic, creativity and ideas are just a consequence of reconstituting this data due to our current social or environmental needs. This could explain our thinking process and why our memory is of such low resolution. The de-noising process could also explain many human conditions such as depression and even why we dream etc. This brings up the interesting question " Could a diffusion model be created to simulate a human personality"? Or provide new speed think concepts and formulas for the solving of a multitude of complex problems for that matter. The path would be 1) diffusion model idea/concept 2) ask a GAN like gpt-3 to check if it works 3) feed back to the diffusion model and keep iterating until it does in much the same way as de-noising a picture. Just a thought from a diffusion brain.
It's because the subset of possible images we humans are interested in is actually very specific. If you think about it, infinite combinations isn't that complicated. It's when we want specific things that you need more information. It only takes a few KB of code to make a pseudorandom number generator that can theoretically output every possible image, but we would see the vast majority of those permutations as boring rainbow noise. Ironically, the storage space used by generative models is needed to essentially explain what we DON'T want, so that we are left with the very specific subset that does meet our requirements.
1. 3.2 GB - I suppose you are talking about Midjourney and similar as this model has only 249Mb
2. Yes we are reinventing the brain here. The kid who can not reach the toy until his 200 move (epoch), develops his CNN inside his head to paint and then see an electronic CNN that is doing something similar ... it's magic until you see this video 🙂
that's insane math
Awesome video! what is the software are you using to draw these examples?
Thanks!
Its nothing fancy - a mix of PowerPoint and DaVinci Resolve. :)
Great video! Can You give any tips for 128*128 images generation with that model, please?
Coincidence - I was working at that. Here are the 128x128 results (a mosaic 1152 x 768) on my site Videnda AI
I also added the models. I saved 2 versions, Dictionary, and Full model as I noticed at sampling that you may find differences
model_dict_epoch_250_128x128_cars.pth
model_full_epoch_250_128x128_cars.pth
model_dict_epoch_499_64x64_cars.pth
model_full_epoch_499_64x64_cars.pth
Thank you for the video.
very nice video and very easy to understand
how long did it take to train 500 epochs in your rtx 3060?
Hi! Good question, it was certainly several hours. I ran it overnight
Thanks for sharing this tutorial. It's so kind for beginners.
goated content
How do you made sure that the cars that has been generated at the end are truly original generations and not just copy of some cars in the dataset?
This actually relates to all generative models - how to make sure that the model doesn't simply memorize the train set.
For example I've also seen this discussion for GANs: www.lesswrong.com/posts/g9sQ2sj92Nus9DeKX/gan-discriminators-don-t-generalize
There is also some research in that direction: openreview.net/forum?id=PlGSgjFK2oJ
To answer your question: you need to sample some data points and compare them with the nearest matches in the Dataset to be sure the model didn't overfit. More data always helps of course, to make it less likely that the model memorizes the whole dataset.