Great video! Small note: at 5:50 you say they map a 256x256x3 image to a 32x32x256 sized codeword. That's the size of the latent encoding before quantization but each 256-dim vector is mapped to a single code-word so the final representation is shape 32x32x1 (1024 codewords total). Later, the 'denoising' model uses a learned embedding to map each codeword to a new vector to get a 32x32x320 tensor as the input to the denoising unet.
Thanks for the great intro. I wonder why the authors don't use transformers, as the denoising by a CNN is kind of local, and the coherency of a whole image requires communication between tokens (or more intuitively, the denoising module can choose a group of coherent tokens as the denoised tokens).
By stacking many convolutional layers, you can also achieve a global view of the entire image. And for example stable diffusion uses cross-attention which does not attend between the image patches and rather is only for attending between image patches and the text input.
Hey Le... er, Ms. Coffee Bean: a request. I love your videos, and watch them regularly, but since I am not actively (hands-on) working on these networks, it's hard to keep track of the steps from one video to another. I know that each video is essentially built upon the previous set (and you reference them excellently!), but I'm wondering if there is a clearer way to describe and explain these Generative AI papers in a more "standalone" fashion? Or, maybe it's impossible? I just feel like the last few months, the progress has been so 'exponential', that there has to be a better way of keeping abreast of things, for those like me who are also quite technical. Regardless - keep it up!
Thanks, you made such a great point. I wrote this idea onto THE list. 😅 The problem is always that I have more ideas than time to do stuff. So for any RUclipsr or person considering doing something on RUclips: There are lots of ideas and more topics than currently covered, so do not see the ML RUclips space like a competitive and crowded one. It's actually quite the opposite.
@@AICoffeeBreak Thanks for the videos! Lots of misunderstandings of the tech, let alone the legal and moral considerations online. So looking to get a broad understanding, while also being fair to all those involved. Broad subject videos like yours are really helpful!
Wait WOT lol - this is really cool! this reminds me of Cold Diffusion paper, which experimented with image degradations other than gaussian noise. That being said...I think it's a bit odd to say the algorithm doesn't use transformers when the overall pipeline does; it still needs CLIP (a transformer-based text/image embedding tool). I get that the actual denoiser network called Paella is fully convolutional; but the 'unnatural' problems you address in 3:00 - doesn't CLIP do all these things, and by extension, the Paella pipeline (inheriting all of its problems)? Separate question; one of the big promises of DDPM (the original diffusion method) is that it doesn't suffer from mode collapse, because of some mathematical proofs saying that denoising can get you a score function. I wonder if since GAN's often suffer from mode collapse, whether using the GAN component in the VQ-GAN will break these assumptions and lead to lower image generation diversity (a tradeoff for making it faster?)
Hey, can i ask you a question? I'm very interested in ML even though i'm not in that branch of engineering. I've programmed, built and used normal feed forward NNs, autoencoders... But how did you reach your level of understanding? Do you stay very up to date with papers and stuff like that? It's just that i think that there is a very important gap into knowing the basic models and your comment
@@TileBitan Well I'm a PhD student now lol - it's my job to stay up to date in this kind of thing, as I'm researching this field :P TLDR, I *am* in that branch of science. I'm particularly interested in generative models and have watched many youtube videos like these, but I also read blog posts a lot when I'm confused. In particular, I read a bunch of things on diffusion (I can't post links in youtube comments; it deletes them automatically. But I would give you some links lol). I also have labmates who sometimes send interesting papers my way, and I went to Neurips/CORL/other conferences and that's a great way to keep up to date as you can get 2 minute summaries from basically any author there when you talk to them. In the case of cold diffusion...I don't actually remember where I first found out about it lol. As for knowing why DDPM's are said to have no mode collapse, that was probably first in some youtube video and confirmed after reading the paper and a blog post on energy based optimizaiton (the VAE explanation of diffusion models by lilian wang confused me but the energy based one made sense). As for knowing that GAN's have mode collapse...well, anyone that's worked with GAN's knows how annoying it can be - it's not a very subtle problem when it doesn't work right =P
@@Neptutron ty for your comment. Ill definitely try to learn more in similar ways because im obsessed with music generation and I want to make music that is clear and is good through AI
Does anyone have more information about the A100s that might have been used? I'm doing a back-of-the-envelope calculation to estimate the price of building a comparable training cluster. When I search for Nvidia A100 I get a wide range of specifications and price-points.
We used Stability's Cluster which I believe rents servers for 3 year periods on aws, making it much cheaper than just renting the servers for the specific time. I believe one p4d node (8 A100) costs about $8000 for a 3 year period each month, whereas if you just rent for one month it's >20k I guess
or so Latent DIffusion Models (LDM), add gaussian noise to the latent code. While Paella permute random entry from the latent code with random entry from the dictionnary. why is this better? because the "noise" added for paella is only from "the dictionnary" which mean you dont need to deal with the diffusion model predicting a "mean" representation of the image? because diffusion models intermediate steps looks like a "average" representation of what the model think is the current prediction (because MSE perform mode averaging I guess).
Paella is very impressive! Especially with landscapes. I tried the hugging face demo and tried to run it without a prompt. Surprisingly it was consistently creating images of sport jackets! I'm wondering if there is a reason for that behavior ¯\_(ツ)_/¯
So this looks like it could give a nice initial image which, maybe slightly noised upon, can be fine tuned with diffusion models. Can an expert weigh in whether this makes sense?
Are BERT or MaskGIT diffusion models? I get your point. It's hard to draw a line. Is diffusion when we are denoising in 200 steps? In 100? What about 20? But 8?
@@AICoffeeBreak I guess 'Diffusion' is a confusing term since it refers to the very specific (heat diffusion inspired) corruption used in the original papers. But to me the key insight isn't any of the mathy diffusion bits, it is the idea of iterative (rather than single-shot) 'uncorruption' of an image. So Paella and others that also take a few steps to iteratively refine the result makes them I guess 'diffusion inspired' even though we can't technically call them diffusion models. What's neat about the Paella approach is that there is no reason you couldn't use a transformer or a UNet with attention in place of the CNN-only unet they use, which could potentially give much better results. You'd still get the benefits of using the quantized representation and predicting token liklhoods rather than raw pixel values. Going to be interesting to see what this inspires a few papers down the line :)
Great video! Small note: at 5:50 you say they map a 256x256x3 image to a 32x32x256 sized codeword. That's the size of the latent encoding before quantization but each 256-dim vector is mapped to a single code-word so the final representation is shape 32x32x1 (1024 codewords total). Later, the 'denoising' model uses a learned embedding to map each codeword to a new vector to get a 32x32x320 tensor as the input to the denoising unet.
Thank you for the explanation ! I was a bit surprised, first because the "latent" encoding is actually larger size than the image (2**18 vs 3 * 2**16)
Very refreshing, to see a CNN again :D
It feels like transformers applied to any topic is currently just a guaranteed publication. Really like CNN coming through aswell.
Nice video!
Thanks for the great intro. I wonder why the authors don't use transformers, as the denoising by a CNN is kind of local, and the coherency of a whole image requires communication between tokens (or more intuitively, the denoising module can choose a group of coherent tokens as the denoised tokens).
Oh I found it. The channelwise convolution (an MLP actually) does the global communication.
By stacking many convolutional layers, you can also achieve a global view of the entire image. And for example stable diffusion uses cross-attention which does not attend between the image patches and rather is only for attending between image patches and the text input.
Hey Le... er, Ms. Coffee Bean: a request. I love your videos, and watch them regularly, but since I am not actively (hands-on) working on these networks, it's hard to keep track of the steps from one video to another. I know that each video is essentially built upon the previous set (and you reference them excellently!), but I'm wondering if there is a clearer way to describe and explain these Generative AI papers in a more "standalone" fashion? Or, maybe it's impossible? I just feel like the last few months, the progress has been so 'exponential', that there has to be a better way of keeping abreast of things, for those like me who are also quite technical.
Regardless - keep it up!
Thanks, you made such a great point. I wrote this idea onto THE list. 😅
The problem is always that I have more ideas than time to do stuff. So for any RUclipsr or person considering doing something on RUclips: There are lots of ideas and more topics than currently covered, so do not see the ML RUclips space like a competitive and crowded one. It's actually quite the opposite.
Very helpful videos
Thanks!
"No matter why you clicked the link". I clicked to see Miss Coffee Bean spin around and dance. XD
🤣😁
@@AICoffeeBreak Thanks for the videos! Lots of misunderstandings of the tech, let alone the legal and moral considerations online. So looking to get a broad understanding, while also being fair to all those involved.
Broad subject videos like yours are really helpful!
Wait WOT lol - this is really cool! this reminds me of Cold Diffusion paper, which experimented with image degradations other than gaussian noise. That being said...I think it's a bit odd to say the algorithm doesn't use transformers when the overall pipeline does; it still needs CLIP (a transformer-based text/image embedding tool). I get that the actual denoiser network called Paella is fully convolutional; but the 'unnatural' problems you address in 3:00 - doesn't CLIP do all these things, and by extension, the Paella pipeline (inheriting all of its problems)?
Separate question; one of the big promises of DDPM (the original diffusion method) is that it doesn't suffer from mode collapse, because of some mathematical proofs saying that denoising can get you a score function. I wonder if since GAN's often suffer from mode collapse, whether using the GAN component in the VQ-GAN will break these assumptions and lead to lower image generation diversity (a tradeoff for making it faster?)
Hey, can i ask you a question?
I'm very interested in ML even though i'm not in that branch of engineering. I've programmed, built and used normal feed forward NNs, autoencoders... But how did you reach your level of understanding? Do you stay very up to date with papers and stuff like that?
It's just that i think that there is a very important gap into knowing the basic models and your comment
@@TileBitan Well I'm a PhD student now lol - it's my job to stay up to date in this kind of thing, as I'm researching this field :P TLDR, I *am* in that branch of science. I'm particularly interested in generative models and have watched many youtube videos like these, but I also read blog posts a lot when I'm confused. In particular, I read a bunch of things on diffusion (I can't post links in youtube comments; it deletes them automatically. But I would give you some links lol). I also have labmates who sometimes send interesting papers my way, and I went to Neurips/CORL/other conferences and that's a great way to keep up to date as you can get 2 minute summaries from basically any author there when you talk to them. In the case of cold diffusion...I don't actually remember where I first found out about it lol. As for knowing why DDPM's are said to have no mode collapse, that was probably first in some youtube video and confirmed after reading the paper and a blog post on energy based optimizaiton (the VAE explanation of diffusion models by lilian wang confused me but the energy based one made sense). As for knowing that GAN's have mode collapse...well, anyone that's worked with GAN's knows how annoying it can be - it's not a very subtle problem when it doesn't work right =P
@@Neptutron ty for your comment. Ill definitely try to learn more in similar ways because im obsessed with music generation and I want to make music that is clear and is good through AI
Does anyone have more information about the A100s that might have been used? I'm doing a back-of-the-envelope calculation to estimate the price of building a comparable training cluster. When I search for Nvidia A100 I get a wide range of specifications and price-points.
We used Stability's Cluster which I believe rents servers for 3 year periods on aws, making it much cheaper than just renting the servers for the specific time. I believe one p4d node (8 A100) costs about $8000 for a 3 year period each month, whereas if you just rent for one month it's >20k I guess
@@outliier Thank you so much. I'm clearly in the wrong business.
or so Latent DIffusion Models (LDM), add gaussian noise to the latent code. While Paella permute random entry from the latent code with random entry from the dictionnary.
why is this better? because the "noise" added for paella is only from "the dictionnary" which mean you dont need to deal with the diffusion model predicting a "mean" representation of the image?
because diffusion models intermediate steps looks like a "average" representation of what the model think is the current prediction (because MSE perform mode averaging I guess).
Paella is very impressive! Especially with landscapes. I tried the hugging face demo and tried to run it without a prompt. Surprisingly it was consistently creating images of sport jackets! I'm wondering if there is a reason for that behavior ¯\_(ツ)_/¯
No kink shaming. 😅
Now seriously: I cannot tell you why it chose the sports jackets from all things it could have chosen.
So this looks like it could give a nice initial image which, maybe slightly noised upon, can be fine tuned with diffusion models. Can an expert weigh in whether this makes sense?
Chatgpt
How did you know? 😅
Paella can be clasiffed as a diffusion model by me it still denoising
Are BERT or MaskGIT diffusion models?
I get your point. It's hard to draw a line. Is diffusion when we are denoising in 200 steps? In 100? What about 20? But 8?
@@AICoffeeBreak I guess 'Diffusion' is a confusing term since it refers to the very specific (heat diffusion inspired) corruption used in the original papers. But to me the key insight isn't any of the mathy diffusion bits, it is the idea of iterative (rather than single-shot) 'uncorruption' of an image. So Paella and others that also take a few steps to iteratively refine the result makes them I guess 'diffusion inspired' even though we can't technically call them diffusion models.
What's neat about the Paella approach is that there is no reason you couldn't use a transformer or a UNet with attention in place of the CNN-only unet they use, which could potentially give much better results. You'd still get the benefits of using the quantized representation and predicting token liklhoods rather than raw pixel values. Going to be interesting to see what this inspires a few papers down the line :)
@@Yenrabbit I thought stable diffusion model was doing diffusion on a representation rather than on pixels.
@@davidyang102 Yes stable diffusion also employs a variational autoencoder to encode / decode images into a latent representation
The idea is neat. But it doesn't really work, compared to SD.
The what now? 😲