Image GPT: Generative Pretraining from Pixels (Paper Explained)

Yannic Kilcher

Просмотров 30 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 31 янв 2025

Комментарии • 75

@pmdl 4 года назад ⁺⁶
Watching the video is similar to a very good pretraining before fine-tuning by actually reading the paper! Frankly it has almost halved the time it took me to go through (and understand) a paper!
@JinayShah 4 года назад ⁺³²
I think your explanations are a lot better than Henry’s videos. I would love to hear your explanation, please don’t skip it!
@DistortedV12 4 года назад ⁺¹⁰
I’m sure yannick appreciates the compliment, but please don’t pit them against one another as better than”. It is better say you prefer, don’t want to minimize another’s hard work,
@JinayShah 4 года назад ⁺¹
@@DistortedV12 I think Henry is doing great work too, really appreciate his channel! Not taking away from his work, they both are better than I'll ever be. I just like how Yannic makes his explanations geared towards beginners like me.
@sando_7 2 года назад
I wouldn't have learned a new cool concept without your explanation. Thank you so much!
@harshrajj9995 4 года назад ⁺¹
i was suprised when the paper just came out and you already made a vid on it too. pro youtuber move... btw great explanation, love your content!
@MrVaunorage 4 года назад ⁺¹
3:39 Please always make a full explanation I cant get enough :)
@bycloudAI 4 года назад ⁺¹²
damn you are speedy with the papers I love that
@dark808bb8 4 года назад ⁺⁴
It would be cool to see if iGPT is good at image segmentation. Thanks for the great video!
@herp_derpingson 4 года назад ⁺⁵
Henry AI labs looks interesting. Subbed to that too. Its a shame that RUclips's recommendation algorithm was not able to correlate the channels.
.
29:59 I think its called hydra nets. The more sources of gradients you are able to give a neural network, the better it does. Even if the tasks are unrelated, as long as you have multiple heads at the end of the trunk, it works.
@YannicKilcher 4 года назад
Nice. Would be interesting to see if there's a point where it becomes detrimental.
@Bodenseecraft 4 года назад ⁺⁴
A possible implication for future models could be that OpenAI may just use text and image data simultaneously in a combined model, i.e. read and produce image and text data at the same time. E.g. if crawled from the same web page or using captions, a model could potentially learn common representations. Like a month ago, i would have said that this is pretty unreasonable (altough there was previous work such as the image transformer), but given the kinds of model capacities that we see now, i'm not so sure anymore.
@YannicKilcher 4 года назад
Yes I guess VirTex is already going in that direction a bit, but having the transformer architecture throughout will certainly help
@jasdeepsinghgrover2470 4 года назад ⁺²
Would love to see some attention maps... Really difficult to visualize some sort of hierarchical attention and features like CNNs coming out of this!!
Thanks for the amazing video!!
@alansmithee419 4 года назад ⁺²
2:20
The first one of the generated images is so cute.
I want it.
@Leibniz_28 4 года назад ⁺⁴
I think you received the paper a week earlier than anyone else, cause you're so fast XD
@teslaonly2136 4 года назад ⁺¹
Just finished reading this paper in the afternoon.
@teslaonly2136 4 года назад ⁺¹
I think the cool thing about this paper is the context reduction and how it can complete the image without permutation invariant of the channel.
@jeffkeller2590 2 года назад
Loved this video, and where this research is headed. However, this paper seems to solve one of the most basic assumptions around semi-supervised model training schemes outlined in, 'The Dark Matter of Intelligence' paper. That is, we can train vision models in the same way we do NLP models, by semi-supervised predictions of next PIXEL. The Dark Matter paper seems to have went down a rabbit hole in seeking various workarounds for the vision case. Your thoughts?
@MarkMifsud 4 года назад ⁺¹
This is so amazing, it's fucked up. I'm glad I went to Uni to learn Computer Science 4 years ago (at age 38). This is stuff I can now get into more easily.
@florianhonicke5448 4 года назад ⁺¹
Great work!
@teslaonly2136 4 года назад ⁺⁴
Great job Yannic. Do you mind sharing what you are gonna to study in the next video at the end of your video? It would allow audiences like me to go through the paper first and share our insights in the comment section when you post the video. Just my two cents.
@jeremykothe2847 4 года назад ⁺¹²
Well you can pause the video, go read the paper and return to watch it...
@Phobos11 4 года назад ⁺²
Jeremy Kothe big brain
@YannicKilcher 4 года назад
Haha never thought of that 😁 genious
@IRiViI 4 года назад ⁺⁴
So the quote "What I cannot create, I do not understand" hold also a bit for neural networks =).
@kazinazmulhaqueshezan4219 4 года назад ⁺¹
Amazing Mate.
@anjandash_ 4 года назад ⁺¹
Man! You're so fast!
@SachinSingh-do5ju 4 года назад ⁺¹
I love you videos, man
@XOPOIIIO 4 года назад ⁺⁵
Did it figured out by itself that cats can keep a sheet of paper in their paws? Or such kind of images are in the dataset?
@YannicKilcher 4 года назад ⁺³
I mean that's just common sense :D
@BlakeEdwards333 4 года назад ⁺¹
This is awesome!!!!!!
@Guytron95 4 года назад ⁺³
I wonder how difficult it would be to switch from image blocking to adding noise and getting denoiseing out of this? Maybe the BERT model would work better for that.
@Phobos11 4 года назад ⁺¹
How would you model noise in a linear fashion? I may be dumb, but I don't see how it will differentiate the information statistics from noise. You could use masking as in BERT, but then you would have to manually define the noise distribution at inference, defeating the purpose. I don't see it 🤔
@pmdl 4 года назад
@@Phobos11 randomly blocking multiple patches of an image and asking the model to predict those patches before averaging over all?
@bengineer_the 4 года назад
oh... here it is. :O Thank you!
@herecomesyouknowwho 4 года назад ⁺¹
With "rolled-out" pixels, the last known pixel always has relationships to pixels at each fixed distance away. E.g. given a 32x32 image, the pixel at -1 distance from the pixel to be predicted has similar relationship to the pixel at -32 distance (-1 vertically before "roll-out"). -2 is similar to -64, etc. But with language, there's no repeating 32-word pattern, and there's never a similar relationship between two words at two fixed distances away (maybe in poetry!). Is that fact build into the model before training, or is that a type of "image grammar" that's learned by lower layers?
@YannicKilcher 4 года назад
True. Good point. The model here has to learn these relationships
@bengineer_the 4 года назад
So does random cropping induce a non localised storage patch of weights (in effect providing contrastive weight spaces), which then can then combine in a 'holographic manor' to contribute towards an answer.
@bengineer_the 4 года назад
think eye saccades
@jeremykothe2847 4 года назад ⁺³⁸
I hope you're at least getting some sleep!
@glennkroegel1342 4 года назад ⁺²
I would like to see this done with sparse attention using the row and column for queries and keys. Maybe then you don't have to downsize the images so much.
@patrickbestgen8834 4 года назад ⁺¹
I don't have a PhD like many of the commentators here, so I'm sorry if my question might sound a bit dumb or goofy, but I wonder if this paper and the last few papers studied by Yannick (like VirTex for example) lead to an understanding of the generalizing capacity of the biological brain? Or do we still have a long way to go?
@YannicKilcher 4 года назад
yes, I think what we're doing has relatively little to do with the brain as such :)
@Laszer271 4 года назад
31:00 You could use Discriminator from GAN and I think that's the most common practice but it wouldn't be pixel by pixel. Autoregressive models also can use convolutions though (e.g. PixelCNN). They just kind of use half of a filter because they can't see what's ahead as that would be cheating :P
@Landonio 3 года назад ⁺¹
How exactly do we download this and use it?
@circuit10 4 года назад ⁺¹
Could this be used for compression by only storing the pixel if it's different from what's expected?
@YannicKilcher 4 года назад ⁺¹
Very nice idea!
@joelye8373 4 года назад ⁺²
Couldn’t the improved linear probing vs model size be just a result of better disentanglement with a larger layer getting probed?
@YannicKilcher 4 года назад ⁺¹
Yes absolutely
@ChurchOfThought 4 года назад ⁺¹
Yes, depends on the actual entropy inherent in the input. A larger number of linear terms has a larger entropy, and therefore can support simpler, more linear representations, within that "bandwidth."
@gatoatigrado1 4 года назад ⁺¹
hmm, I don't think your comment about linear probing after fine-tuning is likely to help much. iiuc the linear probe accuracy at the last layer should re-discover the fine-tuning result (the 99% accuracy). It seems pretty unlikely (though not impossible) that removing later layers would help, unless you think the model is going to add too much noise in these layers and destroy signal from previous layers.
@YannicKilcher 4 года назад
Yes, but I'm interested in what happens at the middle layers.
@kazz811 4 года назад ⁺¹
So BERT is an autoencoder objective so the only difference that they have here compared to people trying autoencoders ("back in the day!") for semi-supervised learning is self-attention and lots more data? Pretty nuts. I guess the fact the autoregressive GPT objective compared to the autoencoder objective is something.
@YannicKilcher 4 года назад
It's a de-noising autoencoder, which is not exactly the same as a classic autoencoder, but it does share the objective.
@kazz811 4 года назад ⁺¹
@@YannicKilcherYup that's true! I guess I think of denoising as a key augmentation to the normal autoencoder training objective.
@YannicKilcher 4 года назад
@@kazz811 yes, that's a nice way of thinking about it. The other difference is that classical AEs usually have some sort of bottleneck in the middle, which is mostly absent from DAEs
@kazz811 4 года назад
@@YannicKilcher True! But pre-fine tuning, the middle layers are the best for transfer learning so I guess that is consistent with the emergence of some sort of encoder-decoder structure. It's different with fine tuning though, when Bert obviously improves substantially.
@Wobuffet3 4 года назад ⁺¹
I'm a dummy who isn't good at computers, how do I use this program?
@YannicKilcher 4 года назад
Probably not for a while :)
@Wobuffet3 4 года назад
@@YannicKilcher Aw dang.
@grkb 4 года назад
@@Wobuffet3 yeah you need to install a old version of ubuntu and try to figure out a lot of stuff.
@firey6220 4 года назад ⁺¹
How to use those!!!!!!
@sathyanarayanankulasekaran1674 4 года назад ⁺¹
Isn't this similar to pixel gan
@bryand3576 4 года назад
What if you train this stuff using memes ?
@theYoutubeHandle 4 года назад
it's more gooder.

Следующие

Автовоспроизведение

Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)