Very similar to XLnet. If I remember correctly, it was also autoregressively trained and in a permutation order similarly to this. There were extra tricks that made it train in parallel more efficiently. Paper authors claimed the autoregressive training results in a better model and that they would have a V2 soon but haven't seen it. Seemed super impressive at the time it came out but the idea also seems to not have stood the test of time since just training the MLM models longer and on comparable amounts of data beat it performance-wise.
8:50 BERT is actually also trained to correct some of the input tokens (15% of the token positions chosen * 10% of the time replaced with a random token = 1.5%). I suspect they can get better generation quality if they also allow such token correction.
Oh I like this idea! Maybe the part where even the stuff that's already there is being predicted could be exploited to allow the generator to change its mind somehow, deleting/replacing some pixels to converge to something better overall. Could even be done on an already complete image. In fact that might be especially helpful for the text variant, so it can delete stuff that didn't work out after all.
Not sure if its mentioned but there is tradeoff during training:, auto regressive models like GPT can train over a complete sample all at once, whereas here you need to pass all possible masks for it to "learn" the sample i.e. training could be slower.
Just a random idea on splitting the vocabulary (32:40), you could cluster the vocab. This has been done before for hierarchical softmax. So you could still use the same idea as it is used for discretized pixel value classes.
Thanks as always for great content! I wonder, whether it is possible to predict some optimal order of decoding. Like we generate important details of the image, sentence or any other kind of data, cats, dogs, and then refine less important parts - background. Important parts can serve as an anchors for generation.
Why do you think that a model which is not restricted to left-to-right sampling would always be beaten by an auto-regressive model which is strictly left-to-right? Your argument was that the latter would focus very much on this specific task. But I also see multiple arguments the other way around: The arbitrary order could generalize better. And also, there are probably always better orders than left-to-right, and when the model can automatically use a better order, it could beat the strict left-to-right model.
I did some research on this type of machine four years ago or so. Perhaps I should have stuck with it. The purpose was much better suited for this type of machine. I believe it is still being used in the software I integrated it into.
Why are we using categorical distribution? We are trying to predict pixel values which in this case are RGB values. So what categories are being used to get the pixel values?
Yes, it should be definitely possible to model the discrete latent code of a VQ-VAE with an ARDM. I guess the main advantage compared to VQ-GAN (which uses a classic ARM) would be the possibility of parallel decoding. Also depending on the architecture decoding of larger images might be possible (e.g. as diffusion models frequently use a u-net architecture with attention at its core).
10:40 This is like a regular transformer but we are predicting more than one token at once and out of order. Or a BERT but with multiple iterations. . 29:42 I wonder what would happen if at each step, each generated output pixel will have a probability of being overwritten. So, the model now has the option to reconsider its own previous predictions now that it has more input. . I would like to see how much does the output quality degrades as you decrease the number of steps.
Yes, overwriting is clearly intriguing although stability becomes a concern again, and I wonder if the naive approaches would be incentivized to output very close to training samples.
Been working on a similar idea for the greater part of the last year. Gotta be faster! See the wavefunctioncollapse procedural generation algorithm, it's simple yet incredibly powerful and works off the principle of generating the pixel that the "model" is the most "certain" about at each step: ruclips.net/video/2SuvO4Gi7uY/видео.html
Kind of works similar to how when the universe decides that a particle exists in one position (when it is observed), it's as if that sucks 1.0 mass from the probability density cloud. In the back of my mind I always kind of wondered how that worked and how that consistency is achieved, and i guess this decoding method is one way.
Discord link: discord.gg/4H8xxDF
Very similar to XLnet. If I remember correctly, it was also autoregressively trained and in a permutation order similarly to this. There were extra tricks that made it train in parallel more efficiently. Paper authors claimed the autoregressive training results in a better model and that they would have a V2 soon but haven't seen it. Seemed super impressive at the time it came out but the idea also seems to not have stood the test of time since just training the MLM models longer and on comparable amounts of data beat it performance-wise.
8:50 BERT is actually also trained to correct some of the input tokens (15% of the token positions chosen * 10% of the time replaced with a random token = 1.5%). I suspect they can get better generation quality if they also allow such token correction.
Oh I like this idea!
Maybe the part where even the stuff that's already there is being predicted could be exploited to allow the generator to change its mind somehow, deleting/replacing some pixels to converge to something better overall. Could even be done on an already complete image.
In fact that might be especially helpful for the text variant, so it can delete stuff that didn't work out after all.
The out of order discernment of ARDM seems really useful in efficient retrieval augmentation.
Love these straight to the point honest opinions :)
Not sure if its mentioned but there is tradeoff during training:, auto regressive models like GPT can train over a complete sample all at once, whereas here you need to pass all possible masks for it to "learn" the sample i.e. training could be slower.
Just a random idea on splitting the vocabulary (32:40), you could cluster the vocab. This has been done before for hierarchical softmax. So you could still use the same idea as it is used for discretized pixel value classes.
Yannic you're a life saver
Thanks as always for great content! I wonder, whether it is possible to predict some optimal order of decoding. Like we generate important details of the image, sentence or any other kind of data, cats, dogs, and then refine less important parts - background. Important parts can serve as an anchors for generation.
Why do you think that a model which is not restricted to left-to-right sampling would always be beaten by an auto-regressive model which is strictly left-to-right? Your argument was that the latter would focus very much on this specific task. But I also see multiple arguments the other way around: The arbitrary order could generalize better. And also, there are probably always better orders than left-to-right, and when the model can automatically use a better order, it could beat the strict left-to-right model.
it is a really powerful model and imo we can specialize it to a much larger number of tasks compared to gpt or gans etc.
Watched twice and understood it ;-) Thanks for the video!
I did some research on this type of machine four years ago or so. Perhaps I should have stuck with it. The purpose was much better suited for this type of machine. I believe it is still being used in the software I integrated it into.
It does remind me of the denoising diffusion models as bert like models are denoising autoencoders. Am i wrong ?
Why are we using categorical distribution? We are trying to predict pixel values which in this case are RGB values. So what categories are being used to get the pixel values?
best channel ever.
Now can we make the diffusion model predict a codebook for a vqgan?
Yes, it should be definitely possible to model the discrete latent code of a VQ-VAE with an ARDM. I guess the main advantage compared to VQ-GAN (which uses a classic ARM) would be the possibility of parallel decoding. Also depending on the architecture decoding of larger images might be possible (e.g. as diffusion models frequently use a u-net architecture with attention at its core).
10:40 This is like a regular transformer but we are predicting more than one token at once and out of order. Or a BERT but with multiple iterations.
.
29:42 I wonder what would happen if at each step, each generated output pixel will have a probability of being overwritten. So, the model now has the option to reconsider its own previous predictions now that it has more input.
.
I would like to see how much does the output quality degrades as you decrease the number of steps.
yes I've seen multiple people already wonder about the possibility of the model being able to refine its outputs, very interesting idea!
Yes, overwriting is clearly intriguing although stability becomes a concern again, and I wonder if the naive approaches would be incentivized to output very close to training samples.
I bet is a coinfidence with Bayesian autoencoders technique with multi-layer simultanical differentials, something like zero-shot but in reverse.
TTP 13:25
.. Just kidding, nice video :)
damn looks cool
Not all languages are read from left to right
You can just reverse it before feeding into the model and then reverse it back after generation.
@@herp_derpingson Is it legal?
Been working on a similar idea for the greater part of the last year. Gotta be faster! See the wavefunctioncollapse procedural generation algorithm, it's simple yet incredibly powerful and works off the principle of generating the pixel that the "model" is the most "certain" about at each step: ruclips.net/video/2SuvO4Gi7uY/видео.html
Kind of works similar to how when the universe decides that a particle exists in one position (when it is observed), it's as if that sucks 1.0 mass from the probability density cloud. In the back of my mind I always kind of wondered how that worked and how that consistency is achieved, and i guess this decoding method is one way.