OUTLINE: 0:00 - Intro & Overview 4:10 - Denoising Diffusion Probabilistic Models 11:30 - Formal derivation of the training loss 23:00 - Training in practice 27:55 - Learning the covariance 31:25 - Improving the noise schedule 33:35 - Reducing the loss gradient noise 40:35 - Classifier guidance 52:50 - Experimental Results
Will you cover Nvidias or Intels "AI photorealism" examples for game images to photorealism? IIRC a new paper was just released on it. Still early work, but is having better progress, as it no longer fails the temporal or hallucination (artifacts/errors) problems.
Summary: self-supervised learning. Given dataset of good images, keep adding Gaussian noise to it to create sequences of increasingly noisy images. Let the network learn to denoise images based on that. Then the network can "denoise" completely Gaussian random pictures into real pictures. To do: learn some latent space (like VAEGAN does) so that it can smoothly interpolate between generated pictures and create nightmare arts.
That notation \mathcal{N}(x_t;sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}) sets my teeth on edge. Doing this with P, a general PDF, is fine, but I would always write x_t ~ \mathcal{N}(sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}), since \mathcal{N} is the Gaussian _distribution_ with a defined parameterization. BTW, the reason for sqrt{1-\beta_t}x_{t-1} is to keep the energy of x_{t-1} approximately the same as the energy for x_t; otherwise, the image would explode to a variance of T*\beta after T iterations. It's probably a good idea to keep the neural network inputs to about the same range every time.
yannic, thanks for the video. the audio is a little soft even at max volume (unless I'm wearing my headphones). is it possible to make it a bit louder?
maybe this is just correct -- it's a regular hifi audiophile loudness level. In here, there is no need for hyper compression filters like in commercials and cheap music videos.
@@JurekOK Maybe, but in practice using my laptop speakers with Windows and RUclips volumes maxed out, it is still pretty low volume. I had to put subtitles on to make sure I didn't miss things here and there, and this was in a fairly quiet room.
18:46 I guess it’s very likely to be related to Shannon’s Sampling theorem, reconstructing the data distribution by sampling with the well defined normal distribution. The number of time steps and Beta closely related to the band width of the data distribution.
Another question. If the network is predicting the noise added to a noisy image, what do you then do with that prediction? Subtract it from the noisy image? Do you then run it back through the network to again, predict noise? When you train this network, do you train it to only predict the small amount of noise added to the image between the forward process steps? Or does it try to predict all the noise added to the image from that point? Or maybe it's more like the forward process? Starting with latent x_T as input to the network, the network gives you an 'image' that it thinks is on the manifold (x_T-1). At this point, it most likely isn't, but, you can move 1/T towards it like we did moving towards the Gaussian noise to get to x_T. Then, repeat....? More examples and less math always helps...
Yes, it's a step by step approach. Thus, when 'destroying' the image, the image at Ti = image at Ti-1 + noise step. You just keep adding / stacking noise, adding a bit more noise (to the previous noise) at each new step. It isn't really 'constant' though. The variance / amount of noise added, depends on the time step and the schedule. A Linear schedule would be constant (adding same amount of noise at each Ti), but if you look at the images (de)generated doing so, you get a quite long tail of images that contain nearly only noise. Therefore a cosine schedule is used, meaning the variance differs per Ti, and also ending up with more information left in the images at the latter time steps. The timestep is actually encoded into the model. Thus, the parameters that are learned to predict the noise 'shift' depending on T. (At least.. In my understanding / words. I'm just a dumb linguist - I don't know any maths either 😅.) Perhaps a better way to explain it, is to imagine that at small Ti, the model can depend on all kinds of visual features (edges, corners, etc.) learned to predict noise. At large T, those features / params get less informative, thus you rely on other features to estimate where the noise is. (Thus its probably not the features that shift depended on T, but their weights.) When generating a new image, you start at Tmax. Thus, pure random noise only. The model first reconstructs to Tmax-1. Removing a little noise.. Then, taking this image, you again remove a bit more noise, etc. It's an iterative process.
This makes me think that instead of super res from lower res image it could be even more effective to store a sparse pixel array (with high res positioning). You could even have another net 'learn' a way of choosing eg which 1000 pivels of a high res image to store (pixels providing most information for reconstruction).
Fascinating, incredible video! Really appreciate the walkthrough! Such as the cosine vs linear approach to make sure each step in diffusion is useful - very interesting!
16:55 denoising depends on the entire data distribution sizes because adding random noise in one step can be done independent of all previous steps; just add a bit of noise wherever you like. But removing noise (the reverse) has to assume there was noise added in some number of previous steps. Thus, in the example of denoising a small child's drawing, it's not that we're removing ALL the noise. Instead, The dependence problem arises in simply taking a single step towards a denoised picture. Can anyone clarify/confirm?
I wonder if multiscale noise would work better. It'd fit more with convolutions. Instead of 0% to 100% noise, it could disturb from pixels to the whole image.
Any results(images) from generative models should be accompanied by the nearest neighbor(vgg latent, etc) from the training dataset. I am going to train it on mnist🏋
This is me being lazy and not looking it up, but if they predict the noise instead of the image, to actually get the image they subtract the predicted noise from the noisy image iteratively until they get a clean image?
There is this step wise generation in GAN's, not based on steps from noise to image, but based on the size of the image, like in Pro-GAN and MSG-GAN. In these models you have discriminators for different sizes of the image, kind of.
I would say that the sqrt(1-B) is used to converge to a N(0,sigma), mainly in it's "mu", othersize adding gaussian noise would just (in expectation) have X0 as mu, instead of 0
44:14: p(a|b,c) = p(a,c|b) / p(c|b) = p(a|b) * p(c|b,a) / p(c|b) = Z * p(a|b) * p(c|a,b) and if c independant of b given a = Z * p(a|b) * p(c|a) But Z = p(c|b) So given that c independant of b given a, p(a|b,c) = p(a|b) * p(c|a) / p(c|b) Here a = xt, b = xt+1, c=y, Z= 1 / p(y|xt+1) .. Then they probably consider y independent of xt+1 given xt. Problem is, if they consider y indep of xt+1 given xt, they should probably consider y indepedent of xt given xt+1 which would basically say p(xt|xt+1,y) = p(xt|xt+1). But I guess it is the whole point to say that actually no, xt contains more information about y than xt+1 so it y is not independant of xt given a more noisy version of xt (xt+1).
I think it is more natural to do your derivation with a=x_t, b=y, c=x_{t+1}. In this way, a fitting probabilistic graph model would be y -> x_{t} - > x_{t+1}. So, the class label y clearly determines the distribution of your image at any step, but given the current image x_{t} you already have a well defined noise process that tell you how x_{t+1} will be obtained from x_{t} and the label then becomes irrelevant.
Thank you Yannic for the video. QQ: why would we adding Gaussian noise for image requires multivariate Gaussian instead of just 1d Gaussian? Is the extra dimension used for different color channel?
Diffusing noise with a foward sampling is really more entropian in context accumulation of sharing data by the transformer, but visual autoencoders is thinny for this Gaussian / or / Bayes-Gauss mixture, without a one transformer for a layer. EDIT: I thought is only the prescriptive sense of this upper statement, not evenmore.
Well it's a bit of a moot point now that Stable Diffusion has released theirs. Maybe it isn't matching DALL-E 2 in all areas yet, but is coming pretty close, especially the 1.5 model (already on DreamStudio, though not available for download quite yet).
The audio is a bit quiet in this video. . 0:00 I didnt realize any of these were generated. Totally fooled my brain's discriminator. . 29:00 How can the noise be lesser than the accumulated noise upto that point? Are we taking into account that some noise added later might undo the previously added noise? . 50:00 I am not sure how to take the learnings to GANs from diffusion models. The only thing I can think of is pre-training the discriminator with real image and noised real image, but that sounds so obvious I am sure 100s of papers have already done that. . All in all I would love to see more papers which make the neural networks output weird things like probability distributions instead of simple images or word tokens.
Can someone explain the noising process with some pseudocode? Is the noise constantly added(based on t) or blended (based on percent of T)? And of course, does it make a difference and why? EDIT: Nevermind. I always figure it out after asking. :) (I generate some noise, and either blend or lerp towards it, as they are the same)
If you're on Mac, there's a native script for that. /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py. Or you can just use Preview.app lol
It is approximated by the product of two Gaussian distribution q(x_t|x_{t-1}) and q(x_{t-1}|x_0). If the chain rule is applied on eq.(12), then you can get q(x_{t-1}|x_t,x_0)=q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)/q(x_t|x_0). They also approximate q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}). Then if the normalization term is ignored, you will get the expression of (10) and (11).
By the way, these DDPM models seem very related (a practical simplification?) to the Neural Autoregressive Flows, where each layer is invertible and each layer performs a small distrbution perturbation which vanishes with enough layers
True! I think the important difference (implementational simplification) is that you have no a-priori restrictions on the DNN architecture here, i.e., the layers do not need to be invertible, and the idea is almost agnostic to what exact DNN architecture you use
Thats an interesting idea. Although I think we will have to train the network specifically on adversarial noise. Might not though. Not sure, but good idea regardless.
I wonder why they state that the undefined norm ||.|| of the covariance tends to 0. Doesn't it tend to whatever is the norm of a uniform covariance matrix?
@@herp_derpingson As far as I know, the norm of a matrix A is typically defined as the maximum norm of the vector x^TAx, with x^Tx = 1. In the case of a normal distribution you would have x^TAx=1 for any x and so the norm of the covariance would be 1. Am I wrong?
Hi, i watched the video, but this is not a topic i am familiar with. Could anyone pleas describe in a few sentences how this works. Especially how disco diffusion works. Where does it gets the graphical Elements for the images, how does it connect keywords from the prompt with the artists, the style etc. It seems i can use every Keyword i want, but if there is a database, it should be limited. Is it trained somehow to learn what the different styles look like? What if i pick an uncommon keyword? So much questions to understand this incredible Software. Thanks
Hi! Amazing video, thank you a lot! But I'm a bit confused with one detail and have a stupid question. As soon as we train our model to predict epsilon, using x_t and t, and also we have a formula x_t = \sqrt{\bar{\alpha}_t } * x_0 + \sqrt{1 - \bar{\alpha}_t }*eps We can get that x_0 = (x_t - \sqrt{1 - \bar{\alpha}_t }*eps) / \sqrt{\bar{\alpha}_t } And here we know alphas coz they are constants, also we know x_t (just some noise) and we know eps as it is the output of our model -- why can't we calculate the answer in just one step? Would be very grateful for answer!
If you add noise from a standard normal distribution thousands of times, isn't the average noise (expected value) added close to zero, resulting in the same image?
Even if they were using standard Gaussians (they aren't), the sum of just two standard Gaussians X and Y is not a standard Gaussian (the variances add up)
@@samernoureddine Thank you! I assumed that it was equivalent of sampling 1000 times (for example) from the same distribution N(0,var). Since these samples approximate the distribution of N(0,var) the mean of these values were 0 I thought. But I should rather see it as a sample from N(0, var+var+..+var), right? (since we add up the samples)
@@bertchristiaens6355 that would be right if they just wanted the noise distribution at some time t (and if the mean were zero: it isn't). But they want the noise distribution to evolve with time, and so the total noise at time t+1 is not independent from the total noise at time t
The better you are detecting bullshit, the better you are at creating bullshit😂 none of my work would ever be public facing until I was sure I could always identify it and manipulate it and I’m sure that’s true for any company or skilled researcher. ❤😢
This seemed much too long. For instance you don't need to labor the notion of denoising for minutes. Noise reduction should be in people's vocabulary at this level. I'd suggest going directly to what diffusion models are and try to prepare succinct explanations instead of just going for an hour.
@@mgostIH I understand that some feel defensive, but it wasn't meant as an attack but empowering observation. communication is vastly more potent the more concise and clear it is.
OUTLINE:
0:00 - Intro & Overview
4:10 - Denoising Diffusion Probabilistic Models
11:30 - Formal derivation of the training loss
23:00 - Training in practice
27:55 - Learning the covariance
31:25 - Improving the noise schedule
33:35 - Reducing the loss gradient noise
40:35 - Classifier guidance
52:50 - Experimental Results
Will you cover Nvidias or Intels "AI photorealism" examples for game images to photorealism? IIRC a new paper was just released on it. Still early work, but is having better progress, as it no longer fails the temporal or hallucination (artifacts/errors) problems.
Summary: self-supervised learning. Given dataset of good images, keep adding Gaussian noise to it to create sequences of increasingly noisy images. Let the network learn to denoise images based on that. Then the network can "denoise" completely Gaussian random pictures into real pictures.
To do: learn some latent space (like VAEGAN does) so that it can smoothly interpolate between generated pictures and create nightmare arts.
Thanks a lot for the thorough explanation!
It's helping me figure out a topic for my master's degree.
Much much appreciated ^^
My boyfriend wrote these papers. Go Alex Nichol!
And i already felt sorry for your bf
You’ll have to compete for his attention with all the coding fanbois. Either way, lucky girl. Hold onto that guy.
With every great person, there is a great partner
@@LatinDanceVideos RUclips says her name is Samantha Nichol now so I guess she took your advice.
Lose 10 pounds by cutting your head off??? 😂😂
That notation \mathcal{N}(x_t;sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}) sets my teeth on edge. Doing this with P, a general PDF, is fine, but I would always write x_t ~ \mathcal{N}(sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}), since \mathcal{N} is the Gaussian _distribution_ with a defined parameterization. BTW, the reason for sqrt{1-\beta_t}x_{t-1} is to keep the energy of x_{t-1} approximately the same as the energy for x_t; otherwise, the image would explode to a variance of T*\beta after T iterations. It's probably a good idea to keep the neural network inputs to about the same range every time.
Don’t forget to edit your text next time you paste it in😮
Historic video! Fun to see it now and compare it to the current state of image generation. I’ll check it again in two years to see how far we’ve got.
lol :)
Love it!! It's called the "number line" in english. Keep up the great work
yannic, thanks for the video. the audio is a little soft even at max volume (unless I'm wearing my headphones). is it possible to make it a bit louder?
Thanks a lot! Can't change this one, but I'll pay attention in the future
Yup, correct most of your videos has a quite low volume
maybe this is just correct -- it's a regular hifi audiophile loudness level. In here, there is no need for hyper compression filters like in commercials and cheap music videos.
@@JurekOK Maybe, but in practice using my laptop speakers with Windows and RUclips volumes maxed out, it is still pretty low volume. I had to put subtitles on to make sure I didn't miss things here and there, and this was in a fairly quiet room.
Just Amazing. I guess I might read this paper for another whole day if I missed your video. Grateful!
18:46 I guess it’s very likely to be related to Shannon’s Sampling theorem, reconstructing the data distribution by sampling with the well defined normal distribution. The number of time steps and Beta closely related to the band width of the data distribution.
Great video! I was surprised to see this after the latest paper just a fews days back! Thanks for the great explanations!
Can you please make a video about SNN's and latest research on SNN's?
Another question. If the network is predicting the noise added to a noisy image, what do you then do with that prediction? Subtract it from the noisy image? Do you then run it back through the network to again, predict noise?
When you train this network, do you train it to only predict the small amount of noise added to the image between the forward process steps? Or does it try to predict all the noise added to the image from that point?
Or maybe it's more like the forward process? Starting with latent x_T as input to the network, the network gives you an 'image' that it thinks is on the manifold (x_T-1). At this point, it most likely isn't, but, you can move 1/T towards it like we did moving towards the Gaussian noise to get to x_T. Then, repeat....?
More examples and less math always helps...
Yes, it's a step by step approach. Thus, when 'destroying' the image, the image at Ti = image at Ti-1 + noise step. You just keep adding / stacking noise, adding a bit more noise (to the previous noise) at each new step. It isn't really 'constant' though. The variance / amount of noise added, depends on the time step and the schedule. A Linear schedule would be constant (adding same amount of noise at each Ti), but if you look at the images (de)generated doing so, you get a quite long tail of images that contain nearly only noise. Therefore a cosine schedule is used, meaning the variance differs per Ti, and also ending up with more information left in the images at the latter time steps.
The timestep is actually encoded into the model. Thus, the parameters that are learned to predict the noise 'shift' depending on T. (At least.. In my understanding / words. I'm just a dumb linguist - I don't know any maths either 😅.) Perhaps a better way to explain it, is to imagine that at small Ti, the model can depend on all kinds of visual features (edges, corners, etc.) learned to predict noise. At large T, those features / params get less informative, thus you rely on other features to estimate where the noise is. (Thus its probably not the features that shift depended on T, but their weights.)
When generating a new image, you start at Tmax. Thus, pure random noise only. The model first reconstructs to Tmax-1. Removing a little noise.. Then, taking this image, you again remove a bit more noise, etc. It's an iterative process.
Amazing explanation. Saved me a lot of time!! Thank you!
This makes me think that instead of super res from lower res image it could be even more effective to store a sparse pixel array (with high res positioning). You could even have another net 'learn' a way of choosing eg which 1000 pivels of a high res image to store (pixels providing most information for reconstruction).
yes... yeeeeeesssssssssssss
wow thats a really great idea actually!
Your videos are amazing Yannic, keep it up. Much love
Fascinating, incredible video! Really appreciate the walkthrough! Such as the cosine vs linear approach to make sure each step in diffusion is useful - very interesting!
16:55 denoising depends on the entire data distribution sizes because adding random noise in one step can be done independent of all previous steps; just add a bit of noise wherever you like. But removing noise (the reverse) has to assume there was noise added in some number of previous steps. Thus, in the example of denoising a small child's drawing, it's not that we're removing ALL the noise. Instead, The dependence problem arises in simply taking a single step towards a denoised picture.
Can anyone clarify/confirm?
I wonder if multiscale noise would work better. It'd fit more with convolutions. Instead of 0% to 100% noise, it could disturb from pixels to the whole image.
Any results(images) from generative models should be accompanied by the nearest neighbor(vgg latent, etc) from the training dataset. I am going to train it on mnist🏋
There are nearest neighbors in the beginning of the appendix!
@@alexnichol3138 i retract my statement.
@@bg2junge I demand seppuku
This is me being lazy and not looking it up, but if they predict the noise instead of the image, to actually get the image they subtract the predicted noise from the noisy image iteratively until they get a clean image?
Yes, pretty much, except doing this in a probabilistic way where you try to keep track of the distribution of the less and less noisy images.
There is this step wise generation in GAN's, not based on steps from noise to image, but based on the size of the image, like in Pro-GAN and MSG-GAN. In these models you have discriminators for different sizes of the image, kind of.
yes that should be the same right?
Are you saying it’s not the size of your GAN that matters, but how You use it? 😂
Great materials ! Honestly, I really enjoy your content !! Keep it up 👏👏
I would say that the sqrt(1-B) is used to converge to a N(0,sigma), mainly in it's "mu", othersize adding gaussian noise would just (in expectation) have X0 as mu, instead of 0
44:14:
p(a|b,c) = p(a,c|b) / p(c|b) = p(a|b) * p(c|b,a) / p(c|b) = Z * p(a|b) * p(c|a,b)
and if c independant of b given a
= Z * p(a|b) * p(c|a)
But Z = p(c|b)
So given that c independant of b given a, p(a|b,c) = p(a|b) * p(c|a) / p(c|b)
Here a = xt, b = xt+1, c=y, Z= 1 / p(y|xt+1) ..
Then they probably consider y independent of xt+1 given xt.
Problem is, if they consider y indep of xt+1 given xt, they should probably consider y indepedent of xt given xt+1 which would basically say p(xt|xt+1,y) = p(xt|xt+1).
But I guess it is the whole point to say that actually no, xt contains more information about y than xt+1 so it y is not independant of xt given a more noisy version of xt (xt+1).
I think it is more natural to do your derivation with a=x_t, b=y, c=x_{t+1}. In this way, a fitting probabilistic graph model would be y -> x_{t} - > x_{t+1}. So, the class label y clearly determines the distribution of your image at any step, but given the current image x_{t} you already have a well defined noise process that tell you how x_{t+1} will be obtained from x_{t} and the label then becomes irrelevant.
I´ve only listened to 11 minutes so far but DDPMs remind me a lot of Compressed (or Compressive) Sensing ...
same, the Steve brunton videos :D
Thank you Yannic for the video. QQ: why would we adding Gaussian noise for image requires multivariate Gaussian instead of just 1d Gaussian? Is the extra dimension used for different color channel?
1dimension per pixel 🙂
Thanks a lot for this awesome video. I really needed it
Reminds of normalising flows..the direction of the flow leads to a normal form through multiple invertible transformations...
It looks like it but the transformation (adding some noise) is stochastic and non invertible
What is the main take away?
make data into noise, learn to revert that process
Train a denoiser but dont add or remove all the noise in one step.
Diffusing noise with a foward sampling is really more entropian in context accumulation of sharing data by the transformer, but visual autoencoders is thinny for this Gaussian / or / Bayes-Gauss mixture, without a one transformer for a layer.
EDIT: I thought is only the prescriptive sense of this upper statement, not evenmore.
It wouldn't be OpenAI if they actually released their pretrained models
ClosedAI
@@PaulanerStudios BURN
Well it's a bit of a moot point now that Stable Diffusion has released theirs. Maybe it isn't matching DALL-E 2 in all areas yet, but is coming pretty close, especially the 1.5 model (already on DreamStudio, though not available for download quite yet).
What software and hardware you use to make this video (drawing tables, adobe reader, others) ?
Can anyone tell me what do we mean by x0 ~ q(x0)? In terms of pictures, what is x0 and what is the data distribution?
Thank you.
The audio is a bit quiet in this video.
.
0:00 I didnt realize any of these were generated. Totally fooled my brain's discriminator.
.
29:00 How can the noise be lesser than the accumulated noise upto that point? Are we taking into account that some noise added later might undo the previously added noise?
.
50:00 I am not sure how to take the learnings to GANs from diffusion models. The only thing I can think of is pre-training the discriminator with real image and noised real image, but that sounds so obvious I am sure 100s of papers have already done that.
.
All in all I would love to see more papers which make the neural networks output weird things like probability distributions instead of simple images or word tokens.
Great video. Could you possibly up the volume level for the next video. I notice this video is much quieter than other videos I watch.
Can someone explain the noising process with some pseudocode? Is the noise constantly added(based on t) or blended (based on percent of T)? And of course, does it make a difference and why?
EDIT: Nevermind. I always figure it out after asking. :) (I generate some noise, and either blend or lerp towards it, as they are the same)
It seems you used a tool to concatenate two paper PDFs togather? It is cool, would you mind telling me which tool?
If you're on Mac, there's a native script for that. /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py. Or you can just use Preview.app lol
Detecting signal inside the noise. Wow. It's like a super cheat for cheat sheets. And it works! :D
22:31 Can someone explain how eq 12 acquired?
It is approximated by the product of two Gaussian distribution q(x_t|x_{t-1}) and q(x_{t-1}|x_0). If the chain rule is applied on eq.(12), then you can get q(x_{t-1}|x_t,x_0)=q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)/q(x_t|x_0). They also approximate q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}). Then if the normalization term is ignored, you will get the expression of (10) and (11).
By the way, these DDPM models seem very related (a practical simplification?) to the Neural Autoregressive Flows, where each layer is invertible and each layer performs a small distrbution perturbation which vanishes with enough layers
True! I think the important difference (implementational simplification) is that you have no a-priori restrictions on the DNN architecture here, i.e., the layers do not need to be invertible, and the idea is almost agnostic to what exact DNN architecture you use
awesome video, thanks!
I was waiting for this.. so not read the paper.. thanks yannic
Please make explanation videos on Yang Song's papers too
Can you use this technique to erase adversarial attacks?
Thats an interesting idea. Although I think we will have to train the network specifically on adversarial noise. Might not though. Not sure, but good idea regardless.
You'd have to be careful, because this technique relies on neural networks that can potentially be attacked
I wonder why they state that the undefined norm ||.|| of the covariance tends to 0. Doesn't it tend to whatever is the norm of a uniform covariance matrix?
Isnt the norm of uniform cov matrix, with mean=0, std=1, zero?
@@herp_derpingson As far as I know, the norm of a matrix A is typically defined as the maximum norm of the vector x^TAx, with x^Tx = 1. In the case of a normal distribution you would have x^TAx=1 for any x and so the norm of the covariance would be 1. Am I wrong?
@@nahakuma Nah, I am a bit out of touch with math. You are probably right.
Hi, i watched the video, but this is not a topic i am familiar with. Could anyone pleas describe in a few sentences how this works. Especially how disco diffusion works. Where does it gets the graphical Elements for the images, how does it connect keywords from the prompt with the artists, the style etc. It seems i can use every Keyword i want, but if there is a database, it should be limited. Is it trained somehow to learn what the different styles look like? What if i pick an uncommon keyword? So much questions to understand this incredible Software. Thanks
50:24 "Distribution shmistribution" 🤩
Lightning
This paper is really well-written.
Amazing video.
Hi! Amazing video, thank you a lot!
But I'm a bit confused with one detail and have a stupid question. As soon as we train our model to predict epsilon, using x_t and t, and also we have a formula x_t = \sqrt{\bar{\alpha}_t } * x_0 + \sqrt{1 - \bar{\alpha}_t }*eps
We can get that x_0 = (x_t - \sqrt{1 - \bar{\alpha}_t }*eps) / \sqrt{\bar{\alpha}_t }
And here we know alphas coz they are constants, also we know x_t (just some noise) and we know eps as it is the output of our model -- why can't we calculate the answer in just one step?
Would be very grateful for answer!
I have the same question. My hypothesis is that such x0 would be very bad. Have you found the answer to this question?
how about explain the code of this paper
If you add noise from a standard normal distribution thousands of times, isn't the average noise (expected value) added close to zero, resulting in the same image?
Even if they were using standard Gaussians (they aren't), the sum of just two standard Gaussians X and Y is not a standard Gaussian (the variances add up)
But the variance will increase so significantly that it will be just noise. (Assuming that the noise are all independent)
@@samernoureddine Thank you! I assumed that it was equivalent of sampling 1000 times (for example) from the same distribution N(0,var). Since these samples approximate the distribution of N(0,var) the mean of these values were 0 I thought. But I should rather see it as a sample from N(0, var+var+..+var), right? (since we add up the samples)
@@bertchristiaens6355 that would be right if they just wanted the noise distribution at some time t (and if the mean were zero: it isn't). But they want the noise distribution to evolve with time, and so the total noise at time t+1 is not independent from the total noise at time t
It's like a random walk, the more random choices you make, the further you get from where you started (but unpredictably so)
what' the purpose of the covariance matrix? or covariance and why is it important to us?
The voice has a problem, it is very low. Please in the next videos fix that. Great video. Thank you.
Turn volume up?
your audio recording volume is too low. i have to increase my volume like 4x compared to other videos. thanks for the content.
Hi! Please do something with your mic, because the video is so silent
but this random image at the end does not contain any information !
super dope
The problem with these solutions are their computing cost. I think they should focus more on that instead and they rely too much on data.
paper after updates become so complex to read with math
Schmiduber enters chat.
And now these models are used in DALL·E 2
Video starts at 4:28.
Video ends at 54:33
Still confused about the math theory
why dont they just use clip as a classifier. does nobody know about this? lol
The better you are detecting bullshit, the better you are at creating bullshit😂 none of my work would ever be public facing until I was sure I could always identify it and manipulate it and I’m sure that’s true for any company or skilled researcher. ❤😢
why even do it when you'd do it in such a hand wavy manner?
Not too long
You can’t explain the equations. 🙄
This seemed much too long. For instance you don't need to labor the notion of denoising for minutes. Noise reduction should be in people's vocabulary at this level. I'd suggest going directly to what diffusion models are and try to prepare succinct explanations instead of just going for an hour.
Or you could simply skip the parts you already understand ;)
This free knowledge...as try to criticize nicely or move on to another resource
Ok, if it is so easy, just do a video yourself. We need videos about AI topics for viewers of all skill levels.
This seems like fair criticism, I don't see why they are being hostile with you
@@mgostIH I understand that some feel defensive, but it wasn't meant as an attack but empowering observation. communication is vastly more potent the more concise and clear it is.