Watching the video is similar to a very good pretraining before fine-tuning by actually reading the paper! Frankly it has almost halved the time it took me to go through (and understand) a paper!
I’m sure yannick appreciates the compliment, but please don’t pit them against one another as better than”. It is better say you prefer, don’t want to minimize another’s hard work,
@@DistortedV12 I think Henry is doing great work too, really appreciate his channel! Not taking away from his work, they both are better than I'll ever be. I just like how Yannic makes his explanations geared towards beginners like me.
Henry AI labs looks interesting. Subbed to that too. Its a shame that RUclips's recommendation algorithm was not able to correlate the channels. . 29:59 I think its called hydra nets. The more sources of gradients you are able to give a neural network, the better it does. Even if the tasks are unrelated, as long as you have multiple heads at the end of the trunk, it works.
A possible implication for future models could be that OpenAI may just use text and image data simultaneously in a combined model, i.e. read and produce image and text data at the same time. E.g. if crawled from the same web page or using captions, a model could potentially learn common representations. Like a month ago, i would have said that this is pretty unreasonable (altough there was previous work such as the image transformer), but given the kinds of model capacities that we see now, i'm not so sure anymore.
Would love to see some attention maps... Really difficult to visualize some sort of hierarchical attention and features like CNNs coming out of this!! Thanks for the amazing video!!
Loved this video, and where this research is headed. However, this paper seems to solve one of the most basic assumptions around semi-supervised model training schemes outlined in, 'The Dark Matter of Intelligence' paper. That is, we can train vision models in the same way we do NLP models, by semi-supervised predictions of next PIXEL. The Dark Matter paper seems to have went down a rabbit hole in seeking various workarounds for the vision case. Your thoughts?
This is so amazing, it's fucked up. I'm glad I went to Uni to learn Computer Science 4 years ago (at age 38). This is stuff I can now get into more easily.
Great job Yannic. Do you mind sharing what you are gonna to study in the next video at the end of your video? It would allow audiences like me to go through the paper first and share our insights in the comment section when you post the video. Just my two cents.
I wonder how difficult it would be to switch from image blocking to adding noise and getting denoiseing out of this? Maybe the BERT model would work better for that.
How would you model noise in a linear fashion? I may be dumb, but I don't see how it will differentiate the information statistics from noise. You could use masking as in BERT, but then you would have to manually define the noise distribution at inference, defeating the purpose. I don't see it 🤔
With "rolled-out" pixels, the last known pixel always has relationships to pixels at each fixed distance away. E.g. given a 32x32 image, the pixel at -1 distance from the pixel to be predicted has similar relationship to the pixel at -32 distance (-1 vertically before "roll-out"). -2 is similar to -64, etc. But with language, there's no repeating 32-word pattern, and there's never a similar relationship between two words at two fixed distances away (maybe in poetry!). Is that fact build into the model before training, or is that a type of "image grammar" that's learned by lower layers?
So does random cropping induce a non localised storage patch of weights (in effect providing contrastive weight spaces), which then can then combine in a 'holographic manor' to contribute towards an answer.
I would like to see this done with sparse attention using the row and column for queries and keys. Maybe then you don't have to downsize the images so much.
I don't have a PhD like many of the commentators here, so I'm sorry if my question might sound a bit dumb or goofy, but I wonder if this paper and the last few papers studied by Yannick (like VirTex for example) lead to an understanding of the generalizing capacity of the biological brain? Or do we still have a long way to go?
31:00 You could use Discriminator from GAN and I think that's the most common practice but it wouldn't be pixel by pixel. Autoregressive models also can use convolutions though (e.g. PixelCNN). They just kind of use half of a filter because they can't see what's ahead as that would be cheating :P
Yes, depends on the actual entropy inherent in the input. A larger number of linear terms has a larger entropy, and therefore can support simpler, more linear representations, within that "bandwidth."
hmm, I don't think your comment about linear probing after fine-tuning is likely to help much. iiuc the linear probe accuracy at the last layer should re-discover the fine-tuning result (the 99% accuracy). It seems pretty unlikely (though not impossible) that removing later layers would help, unless you think the model is going to add too much noise in these layers and destroy signal from previous layers.
So BERT is an autoencoder objective so the only difference that they have here compared to people trying autoencoders ("back in the day!") for semi-supervised learning is self-attention and lots more data? Pretty nuts. I guess the fact the autoregressive GPT objective compared to the autoencoder objective is something.
@@kazz811 yes, that's a nice way of thinking about it. The other difference is that classical AEs usually have some sort of bottleneck in the middle, which is mostly absent from DAEs
@@YannicKilcher True! But pre-fine tuning, the middle layers are the best for transfer learning so I guess that is consistent with the emergence of some sort of encoder-decoder structure. It's different with fine tuning though, when Bert obviously improves substantially.
Watching the video is similar to a very good pretraining before fine-tuning by actually reading the paper! Frankly it has almost halved the time it took me to go through (and understand) a paper!
I think your explanations are a lot better than Henry’s videos. I would love to hear your explanation, please don’t skip it!
I’m sure yannick appreciates the compliment, but please don’t pit them against one another as better than”. It is better say you prefer, don’t want to minimize another’s hard work,
@@DistortedV12 I think Henry is doing great work too, really appreciate his channel! Not taking away from his work, they both are better than I'll ever be. I just like how Yannic makes his explanations geared towards beginners like me.
I wouldn't have learned a new cool concept without your explanation. Thank you so much!
i was suprised when the paper just came out and you already made a vid on it too. pro youtuber move... btw great explanation, love your content!
3:39 Please always make a full explanation I cant get enough :)
damn you are speedy with the papers I love that
It would be cool to see if iGPT is good at image segmentation. Thanks for the great video!
Henry AI labs looks interesting. Subbed to that too. Its a shame that RUclips's recommendation algorithm was not able to correlate the channels.
.
29:59 I think its called hydra nets. The more sources of gradients you are able to give a neural network, the better it does. Even if the tasks are unrelated, as long as you have multiple heads at the end of the trunk, it works.
Nice. Would be interesting to see if there's a point where it becomes detrimental.
A possible implication for future models could be that OpenAI may just use text and image data simultaneously in a combined model, i.e. read and produce image and text data at the same time. E.g. if crawled from the same web page or using captions, a model could potentially learn common representations. Like a month ago, i would have said that this is pretty unreasonable (altough there was previous work such as the image transformer), but given the kinds of model capacities that we see now, i'm not so sure anymore.
Yes I guess VirTex is already going in that direction a bit, but having the transformer architecture throughout will certainly help
Would love to see some attention maps... Really difficult to visualize some sort of hierarchical attention and features like CNNs coming out of this!!
Thanks for the amazing video!!
2:20
The first one of the generated images is so cute.
I want it.
I think you received the paper a week earlier than anyone else, cause you're so fast XD
Just finished reading this paper in the afternoon.
I think the cool thing about this paper is the context reduction and how it can complete the image without permutation invariant of the channel.
Loved this video, and where this research is headed. However, this paper seems to solve one of the most basic assumptions around semi-supervised model training schemes outlined in, 'The Dark Matter of Intelligence' paper. That is, we can train vision models in the same way we do NLP models, by semi-supervised predictions of next PIXEL. The Dark Matter paper seems to have went down a rabbit hole in seeking various workarounds for the vision case. Your thoughts?
This is so amazing, it's fucked up. I'm glad I went to Uni to learn Computer Science 4 years ago (at age 38). This is stuff I can now get into more easily.
Great work!
Great job Yannic. Do you mind sharing what you are gonna to study in the next video at the end of your video? It would allow audiences like me to go through the paper first and share our insights in the comment section when you post the video. Just my two cents.
Well you can pause the video, go read the paper and return to watch it...
Jeremy Kothe big brain
Haha never thought of that 😁 genious
So the quote "What I cannot create, I do not understand" hold also a bit for neural networks =).
Amazing Mate.
Man! You're so fast!
I love you videos, man
Did it figured out by itself that cats can keep a sheet of paper in their paws? Or such kind of images are in the dataset?
I mean that's just common sense :D
This is awesome!!!!!!
I wonder how difficult it would be to switch from image blocking to adding noise and getting denoiseing out of this? Maybe the BERT model would work better for that.
How would you model noise in a linear fashion? I may be dumb, but I don't see how it will differentiate the information statistics from noise. You could use masking as in BERT, but then you would have to manually define the noise distribution at inference, defeating the purpose. I don't see it 🤔
@@Phobos11 randomly blocking multiple patches of an image and asking the model to predict those patches before averaging over all?
oh... here it is. :O Thank you!
With "rolled-out" pixels, the last known pixel always has relationships to pixels at each fixed distance away. E.g. given a 32x32 image, the pixel at -1 distance from the pixel to be predicted has similar relationship to the pixel at -32 distance (-1 vertically before "roll-out"). -2 is similar to -64, etc. But with language, there's no repeating 32-word pattern, and there's never a similar relationship between two words at two fixed distances away (maybe in poetry!). Is that fact build into the model before training, or is that a type of "image grammar" that's learned by lower layers?
True. Good point. The model here has to learn these relationships
So does random cropping induce a non localised storage patch of weights (in effect providing contrastive weight spaces), which then can then combine in a 'holographic manor' to contribute towards an answer.
think eye saccades
I hope you're at least getting some sleep!
I would like to see this done with sparse attention using the row and column for queries and keys. Maybe then you don't have to downsize the images so much.
I don't have a PhD like many of the commentators here, so I'm sorry if my question might sound a bit dumb or goofy, but I wonder if this paper and the last few papers studied by Yannick (like VirTex for example) lead to an understanding of the generalizing capacity of the biological brain? Or do we still have a long way to go?
yes, I think what we're doing has relatively little to do with the brain as such :)
31:00 You could use Discriminator from GAN and I think that's the most common practice but it wouldn't be pixel by pixel. Autoregressive models also can use convolutions though (e.g. PixelCNN). They just kind of use half of a filter because they can't see what's ahead as that would be cheating :P
How exactly do we download this and use it?
Could this be used for compression by only storing the pixel if it's different from what's expected?
Very nice idea!
Couldn’t the improved linear probing vs model size be just a result of better disentanglement with a larger layer getting probed?
Yes absolutely
Yes, depends on the actual entropy inherent in the input. A larger number of linear terms has a larger entropy, and therefore can support simpler, more linear representations, within that "bandwidth."
hmm, I don't think your comment about linear probing after fine-tuning is likely to help much. iiuc the linear probe accuracy at the last layer should re-discover the fine-tuning result (the 99% accuracy). It seems pretty unlikely (though not impossible) that removing later layers would help, unless you think the model is going to add too much noise in these layers and destroy signal from previous layers.
Yes, but I'm interested in what happens at the middle layers.
So BERT is an autoencoder objective so the only difference that they have here compared to people trying autoencoders ("back in the day!") for semi-supervised learning is self-attention and lots more data? Pretty nuts. I guess the fact the autoregressive GPT objective compared to the autoencoder objective is something.
It's a de-noising autoencoder, which is not exactly the same as a classic autoencoder, but it does share the objective.
@@YannicKilcherYup that's true! I guess I think of denoising as a key augmentation to the normal autoencoder training objective.
@@kazz811 yes, that's a nice way of thinking about it. The other difference is that classical AEs usually have some sort of bottleneck in the middle, which is mostly absent from DAEs
@@YannicKilcher True! But pre-fine tuning, the middle layers are the best for transfer learning so I guess that is consistent with the emergence of some sort of encoder-decoder structure. It's different with fine tuning though, when Bert obviously improves substantially.
I'm a dummy who isn't good at computers, how do I use this program?
Probably not for a while :)
@@YannicKilcher Aw dang.
@@Wobuffet3 yeah you need to install a old version of ubuntu and try to figure out a lot of stuff.
How to use those!!!!!!
Isn't this similar to pixel gan
What if you train this stuff using memes ?
it's more gooder.