Paella: Text to image FASTER than diffusion models | Paella paper explained

Поделиться
HTML-код
  • Опубликовано: 3 окт 2024

Комментарии • 35

  • @Yenrabbit
    @Yenrabbit Год назад +13

    Great video! Small note: at 5:50 you say they map a 256x256x3 image to a 32x32x256 sized codeword. That's the size of the latent encoding before quantization but each 256-dim vector is mapped to a single code-word so the final representation is shape 32x32x1 (1024 codewords total). Later, the 'denoising' model uses a learned embedding to map each codeword to a new vector to get a 32x32x320 tensor as the input to the denoising unet.

    • @cipritom
      @cipritom Год назад +1

      Thank you for the explanation ! I was a bit surprised, first because the "latent" encoding is actually larger size than the image (2**18 vs 3 * 2**16)

  • @DerPylz
    @DerPylz Год назад +11

    Very refreshing, to see a CNN again :D

  • @Diego0wnz
    @Diego0wnz Год назад +11

    It feels like transformers applied to any topic is currently just a guaranteed publication. Really like CNN coming through aswell.

  • @hannesstark5024
    @hannesstark5024 Год назад +5

    Nice video!

  • @gruffalosmouse107
    @gruffalosmouse107 Год назад +4

    Thanks for the great intro. I wonder why the authors don't use transformers, as the denoising by a CNN is kind of local, and the coherency of a whole image requires communication between tokens (or more intuitively, the denoising module can choose a group of coherent tokens as the denoised tokens).

    • @gruffalosmouse107
      @gruffalosmouse107 Год назад +4

      Oh I found it. The channelwise convolution (an MLP actually) does the global communication.

    • @outliier
      @outliier Год назад +5

      By stacking many convolutional layers, you can also achieve a global view of the entire image. And for example stable diffusion uses cross-attention which does not attend between the image patches and rather is only for attending between image patches and the text input.

  • @zeevabrams
    @zeevabrams Год назад +9

    Hey Le... er, Ms. Coffee Bean: a request. I love your videos, and watch them regularly, but since I am not actively (hands-on) working on these networks, it's hard to keep track of the steps from one video to another. I know that each video is essentially built upon the previous set (and you reference them excellently!), but I'm wondering if there is a clearer way to describe and explain these Generative AI papers in a more "standalone" fashion? Or, maybe it's impossible? I just feel like the last few months, the progress has been so 'exponential', that there has to be a better way of keeping abreast of things, for those like me who are also quite technical.
    Regardless - keep it up!

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +5

      Thanks, you made such a great point. I wrote this idea onto THE list. 😅
      The problem is always that I have more ideas than time to do stuff. So for any RUclipsr or person considering doing something on RUclips: There are lots of ideas and more topics than currently covered, so do not see the ML RUclips space like a competitive and crowded one. It's actually quite the opposite.

  • @musawarali7258
    @musawarali7258 Год назад +2

    Very helpful videos

  • @TechyBen
    @TechyBen Год назад +2

    "No matter why you clicked the link". I clicked to see Miss Coffee Bean spin around and dance. XD

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +1

      🤣😁

    • @TechyBen
      @TechyBen Год назад +1

      @@AICoffeeBreak Thanks for the videos! Lots of misunderstandings of the tech, let alone the legal and moral considerations online. So looking to get a broad understanding, while also being fair to all those involved.
      Broad subject videos like yours are really helpful!

  • @Neptutron
    @Neptutron Год назад +5

    Wait WOT lol - this is really cool! this reminds me of Cold Diffusion paper, which experimented with image degradations other than gaussian noise. That being said...I think it's a bit odd to say the algorithm doesn't use transformers when the overall pipeline does; it still needs CLIP (a transformer-based text/image embedding tool). I get that the actual denoiser network called Paella is fully convolutional; but the 'unnatural' problems you address in 3:00 - doesn't CLIP do all these things, and by extension, the Paella pipeline (inheriting all of its problems)?
    Separate question; one of the big promises of DDPM (the original diffusion method) is that it doesn't suffer from mode collapse, because of some mathematical proofs saying that denoising can get you a score function. I wonder if since GAN's often suffer from mode collapse, whether using the GAN component in the VQ-GAN will break these assumptions and lead to lower image generation diversity (a tradeoff for making it faster?)

    • @TileBitan
      @TileBitan Год назад +1

      Hey, can i ask you a question?
      I'm very interested in ML even though i'm not in that branch of engineering. I've programmed, built and used normal feed forward NNs, autoencoders... But how did you reach your level of understanding? Do you stay very up to date with papers and stuff like that?
      It's just that i think that there is a very important gap into knowing the basic models and your comment

    • @Neptutron
      @Neptutron Год назад

      @@TileBitan Well I'm a PhD student now lol - it's my job to stay up to date in this kind of thing, as I'm researching this field :⁠P TLDR, I *am* in that branch of science. I'm particularly interested in generative models and have watched many youtube videos like these, but I also read blog posts a lot when I'm confused. In particular, I read a bunch of things on diffusion (I can't post links in youtube comments; it deletes them automatically. But I would give you some links lol). I also have labmates who sometimes send interesting papers my way, and I went to Neurips/CORL/other conferences and that's a great way to keep up to date as you can get 2 minute summaries from basically any author there when you talk to them. In the case of cold diffusion...I don't actually remember where I first found out about it lol. As for knowing why DDPM's are said to have no mode collapse, that was probably first in some youtube video and confirmed after reading the paper and a blog post on energy based optimizaiton (the VAE explanation of diffusion models by lilian wang confused me but the energy based one made sense). As for knowing that GAN's have mode collapse...well, anyone that's worked with GAN's knows how annoying it can be - it's not a very subtle problem when it doesn't work right =P

    • @TileBitan
      @TileBitan Год назад +1

      @@Neptutron ty for your comment. Ill definitely try to learn more in similar ways because im obsessed with music generation and I want to make music that is clear and is good through AI

  • @ryanhewitt9902
    @ryanhewitt9902 Год назад +10

    Does anyone have more information about the A100s that might have been used? I'm doing a back-of-the-envelope calculation to estimate the price of building a comparable training cluster. When I search for Nvidia A100 I get a wide range of specifications and price-points.

    • @outliier
      @outliier Год назад +6

      We used Stability's Cluster which I believe rents servers for 3 year periods on aws, making it much cheaper than just renting the servers for the specific time. I believe one p4d node (8 A100) costs about $8000 for a 3 year period each month, whereas if you just rent for one month it's >20k I guess

    • @ryanhewitt9902
      @ryanhewitt9902 Год назад

      @@outliier Thank you so much. I'm clearly in the wrong business.

  • @Youkouleleh
    @Youkouleleh Месяц назад

    or so Latent DIffusion Models (LDM), add gaussian noise to the latent code. While Paella permute random entry from the latent code with random entry from the dictionnary.
    why is this better? because the "noise" added for paella is only from "the dictionnary" which mean you dont need to deal with the diffusion model predicting a "mean" representation of the image?
    because diffusion models intermediate steps looks like a "average" representation of what the model think is the current prediction (because MSE perform mode averaging I guess).

  • @Micetticat
    @Micetticat Год назад +7

    Paella is very impressive! Especially with landscapes. I tried the hugging face demo and tried to run it without a prompt. Surprisingly it was consistently creating images of sport jackets! I'm wondering if there is a reason for that behavior ¯\_(ツ)_/¯

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +5

      No kink shaming. 😅
      Now seriously: I cannot tell you why it chose the sports jackets from all things it could have chosen.

  • @KonradVoelkel
    @KonradVoelkel Год назад +1

    So this looks like it could give a nice initial image which, maybe slightly noised upon, can be fine tuned with diffusion models. Can an expert weigh in whether this makes sense?

  • @KeinNiemand
    @KeinNiemand Год назад +3

    Chatgpt

  • @rewixx69420
    @rewixx69420 Год назад +7

    Paella can be clasiffed as a diffusion model by me it still denoising

    • @AICoffeeBreak
      @AICoffeeBreak  Год назад +7

      Are BERT or MaskGIT diffusion models?
      I get your point. It's hard to draw a line. Is diffusion when we are denoising in 200 steps? In 100? What about 20? But 8?

    • @Yenrabbit
      @Yenrabbit Год назад +5

      @@AICoffeeBreak I guess 'Diffusion' is a confusing term since it refers to the very specific (heat diffusion inspired) corruption used in the original papers. But to me the key insight isn't any of the mathy diffusion bits, it is the idea of iterative (rather than single-shot) 'uncorruption' of an image. So Paella and others that also take a few steps to iteratively refine the result makes them I guess 'diffusion inspired' even though we can't technically call them diffusion models.
      What's neat about the Paella approach is that there is no reason you couldn't use a transformer or a UNet with attention in place of the CNN-only unet they use, which could potentially give much better results. You'd still get the benefits of using the quantized representation and predicting token liklhoods rather than raw pixel values. Going to be interesting to see what this inspires a few papers down the line :)

    • @davidyang102
      @davidyang102 Год назад +1

      @@Yenrabbit I thought stable diffusion model was doing diffusion on a representation rather than on pixels.

    • @outliier
      @outliier Год назад +1

      @@davidyang102 Yes stable diffusion also employs a variational autoencoder to encode / decode images into a latent representation

  • @GradientDude
    @GradientDude Год назад

    The idea is neat. But it doesn't really work, compared to SD.

  • @kallamamran
    @kallamamran Год назад +1

    The what now? 😲