Autoregressive Diffusion Models (Machine Learning Research Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 5 ноя 2024

Комментарии • 31

  • @YannicKilcher
    @YannicKilcher  3 года назад +8

    Discord link: discord.gg/4H8xxDF

  • @ryanbaten
    @ryanbaten 3 года назад +12

    Very similar to XLnet. If I remember correctly, it was also autoregressively trained and in a permutation order similarly to this. There were extra tricks that made it train in parallel more efficiently. Paper authors claimed the autoregressive training results in a better model and that they would have a V2 soon but haven't seen it. Seemed super impressive at the time it came out but the idea also seems to not have stood the test of time since just training the MLM models longer and on comparable amounts of data beat it performance-wise.

  • @Kram1032
    @Kram1032 3 года назад +11

    Oh I like this idea!
    Maybe the part where even the stuff that's already there is being predicted could be exploited to allow the generator to change its mind somehow, deleting/replacing some pixels to converge to something better overall. Could even be done on an already complete image.
    In fact that might be especially helpful for the text variant, so it can delete stuff that didn't work out after all.

  • @ChuanChihChou
    @ChuanChihChou 2 года назад +1

    8:50 BERT is actually also trained to correct some of the input tokens (15% of the token positions chosen * 10% of the time replaced with a random token = 1.5%). I suspect they can get better generation quality if they also allow such token correction.

  • @sujovian
    @sujovian 8 месяцев назад

    The out of order discernment of ARDM seems really useful in efficient retrieval augmentation.

  • @SLAM2977
    @SLAM2977 3 года назад +2

    Love these straight to the point honest opinions :)

  • @박재성학생물리·천문
    @박재성학생물리·천문 Год назад

    Yannic you're a life saver

  • @CristianGarcia
    @CristianGarcia 3 года назад +1

    Not sure if its mentioned but there is tradeoff during training:, auto regressive models like GPT can train over a complete sample all at once, whereas here you need to pass all possible masks for it to "learn" the sample i.e. training could be slower.

  • @priancho
    @priancho 2 года назад

    Watched twice and understood it ;-) Thanks for the video!

  • @sacramentofwilderness6656
    @sacramentofwilderness6656 3 года назад +1

    Thanks as always for great content! I wonder, whether it is possible to predict some optimal order of decoding. Like we generate important details of the image, sentence or any other kind of data, cats, dogs, and then refine less important parts - background. Important parts can serve as an anchors for generation.

  • @AlbertZeyer
    @AlbertZeyer 2 года назад

    Just a random idea on splitting the vocabulary (32:40), you could cluster the vocab. This has been done before for hierarchical softmax. So you could still use the same idea as it is used for discretized pixel value classes.

  • @AlbertZeyer
    @AlbertZeyer 2 года назад +1

    Why do you think that a model which is not restricted to left-to-right sampling would always be beaten by an auto-regressive model which is strictly left-to-right? Your argument was that the latter would focus very much on this specific task. But I also see multiple arguments the other way around: The arbitrary order could generalize better. And also, there are probably always better orders than left-to-right, and when the model can automatically use a better order, it could beat the strict left-to-right model.

  • @SuperJg007
    @SuperJg007 3 года назад +1

    best channel ever.

  • @nauman.mustafa
    @nauman.mustafa 3 года назад +1

    it is a really powerful model and imo we can specialize it to a much larger number of tasks compared to gpt or gans etc.

  • @ssssssstssssssss
    @ssssssstssssssss 3 года назад

    I did some research on this type of machine four years ago or so. Perhaps I should have stuck with it. The purpose was much better suited for this type of machine. I believe it is still being used in the software I integrated it into.

  • @sarvagyagupta1744
    @sarvagyagupta1744 3 года назад

    Why are we using categorical distribution? We are trying to predict pixel values which in this case are RGB values. So what categories are being used to get the pixel values?

  • @marouanemaachou7875
    @marouanemaachou7875 3 года назад +1

    It does remind me of the denoising diffusion models as bert like models are denoising autoencoders. Am i wrong ?

  • @G12GilbertProduction
    @G12GilbertProduction 3 года назад

    I bet is a coinfidence with Bayesian autoencoders technique with multi-layer simultanical differentials, something like zero-shot but in reverse.

  • @tripzero0
    @tripzero0 3 года назад +2

    Now can we make the diffusion model predict a codebook for a vqgan?

    • @bluel1ng
      @bluel1ng 3 года назад +1

      Yes, it should be definitely possible to model the discrete latent code of a VQ-VAE with an ARDM. I guess the main advantage compared to VQ-GAN (which uses a classic ARM) would be the possibility of parallel decoding. Also depending on the architecture decoding of larger images might be possible (e.g. as diffusion models frequently use a u-net architecture with attention at its core).

  • @herp_derpingson
    @herp_derpingson 3 года назад +5

    10:40 This is like a regular transformer but we are predicting more than one token at once and out of order. Or a BERT but with multiple iterations.
    .
    29:42 I wonder what would happen if at each step, each generated output pixel will have a probability of being overwritten. So, the model now has the option to reconsider its own previous predictions now that it has more input.
    .
    I would like to see how much does the output quality degrades as you decrease the number of steps.

    • @YannicKilcher
      @YannicKilcher  3 года назад

      yes I've seen multiple people already wonder about the possibility of the model being able to refine its outputs, very interesting idea!

    • @thegistofcalculus
      @thegistofcalculus 3 года назад

      Yes, overwriting is clearly intriguing although stability becomes a concern again, and I wonder if the naive approaches would be incentivized to output very close to training samples.

  • @thomashirtz
    @thomashirtz 3 года назад +1

    TTP 13:25
    .. Just kidding, nice video :)

  • @matthieulin335
    @matthieulin335 3 года назад +5

    damn looks cool

  • @patf9770
    @patf9770 2 года назад +2

    Been working on a similar idea for the greater part of the last year. Gotta be faster! See the wavefunctioncollapse procedural generation algorithm, it's simple yet incredibly powerful and works off the principle of generating the pixel that the "model" is the most "certain" about at each step: ruclips.net/video/2SuvO4Gi7uY/видео.html

  • @billykotsos4642
    @billykotsos4642 3 года назад +2

    Not all languages are read from left to right

    • @herp_derpingson
      @herp_derpingson 3 года назад

      You can just reverse it before feeding into the model and then reverse it back after generation.

    • @arturprzybysz6614
      @arturprzybysz6614 3 года назад

      @@herp_derpingson Is it legal?

  • @Gogargoat
    @Gogargoat 3 года назад +2

    Kind of works similar to how when the universe decides that a particle exists in one position (when it is observed), it's as if that sucks 1.0 mass from the probability density cloud. In the back of my mind I always kind of wondered how that worked and how that consistency is achieved, and i guess this decoding method is one way.