DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)

Yannic Kilcher

Просмотров 128 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 31 янв 2025

Комментарии • 152

@YannicKilcher 3 года назад ⁺¹⁹
OUTLINE:
0:00 - Intro & Overview
6:20 - Vision Transformers
9:20 - Self-Supervised Learning for Images
13:30 - Self-Distillation
15:20 - Building the teacher from the student by moving average
16:45 - DINO Pseudocode
23:10 - Why Cross-Entropy Loss?
28:20 - Experimental Results
33:40 - My Hypothesis why this works
38:45 - Conclusion & Comments
Paper: arxiv.org/abs/2104.14294
Blog: ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
Code: github.com/facebookresearch/dino
My Video on ViT: ruclips.net/video/TrdevFK_am4/видео.html
My Video on BYOL: ruclips.net/video/YPfUiOMYOEE/видео.html
@samanthaqiu3416 3 года назад
from the paper is not clear AT ALL that they detach gradients of the teacher via the Center (C) variable. Will have to look at their repo to see what is going on. Typically things like mean still propagate gradients in pytorch
@samanthaqiu3416 3 года назад
yep, it didn't help much that they seem to code like 9 year olds, but from line 304 of main_dino.py ( github.com/facebookresearch/dino/blob/a15f6afee2f3b868f44a5021a0951f718c8e2dd5/main_dino.py#L304 ) it seems clear they are NOT DETACHING all gradients from the teacher network via the `update_center` method
@samanthaqiu3416 3 года назад
it will not be a problem since they don't seem to be using those gradients anywhere, although I haven't verified it
@mathildecaron1821 3 года назад ⁺¹⁴²
Thanks a lot Yannic for covering DINO, that’s really an honor ! I’m a big fan of your channel :D
2 года назад
Hi. Enjoyed the paper and the explanation given in this video. Thank you both.
Are you aware of any robustness analysis (in the context of adversarial examples) done for DINO?
@tiro0oO5 Год назад ⁺²
I know version 2 is out. Still congrats to this brake through work!
@fatmaguney3598 3 года назад ⁺⁵¹
"learning to predict cat from cat ear" is a good summary of this paper.
@Metaloid-wv4kz 3 года назад
simplest form of AI lol
@sohamroy9868 2 года назад ⁺⁸
I am super impressed how you nailed the pronunciation every single name of the authors of the paper.
@sabyasachibandyopadhyay8558 3 года назад ⁺¹⁰
Your comment on augmentations is spot on! I have worked with BYOLs in clinical images for a while now, and choosing the correct augmentations makes a heck of a difference, and there is no way to know the right augmentation without trial and error! I think that's a major downside of BYOL, which will obviously percolate to DINO as well. Thanks for your presentation of the paper!
@patf9770 3 года назад ⁺⁸⁰
As often said, what a time to be alive!
@GeekProdigyGuy 3 года назад ⁺⁷
wrong channel xd
@michaelwangCH 3 года назад ⁺¹
If you are the guy "two minute paper", excellent work - we are living in a exordinary time of human history.
@vsiegel 3 года назад
@@michaelwangCH Let's enjoy it, the progress is exponential, and we are at a steep region! I decide to ignore the question whether we are near the end of human history, at the time the progress curve goes to infinity...
Actually, I'm not afraid: Progress is exponential, making progress means adding knowledge, tools and scientists, and that allows to make faster Progress.
But I think it is actually a logistic development, that looks very much like exponential, but instead of the singularity, the curve begins to get less steep. It happens when a finite resource is involved. But as physicist, I say: no problem, the observable universe is finite.
@michaelwangCH 3 года назад
@@vsiegel We have a log curve between scientific output(progress) and resources which we put in - the work as researcher is getting harder, the hardness will increase every year - in other word, the hardness increase expontially to time(an exponential function of time) - that is bad for scientific progress and societal resources distribution. E.g. CERN with over 6000 scientists and $14B + fixed costs per yr, those resources could be used probably more productively in other area of sciences.
@gaypaul5635 11 месяцев назад
@@vsiegel that's assumes that knowledge can be added like chocolates cakes, but that's a wrong hypothesis for individual humans and humanity. A focus and more knowledge on a topic means other topics are given less attention and are forgotten. This is why the definition of "progress" must be chosen, and according to certain definitions, Dino is not representing a progress in itself as it can have negative indirect effects as any digital technologies.
@rahuldeora5815 3 года назад ⁺¹¹
Surprisingly fluent pronunciation of the authors .... bet that took more takes than one would expect :)
@jaakjpn 3 года назад ⁺¹⁵
Cool paper, thanks for the review!
About centering vs sharpening. You are right: centering avoids the collapse as each unsupervised class gets pushed to equal running average, i.e., each unsupervised class should pick up 1/K of the images because the means of their logits are 0. This way, model cannot collapse to picking the same class each time. Sharpening is to make sure that each time one class is being picked (otherwise, flat uniform distribution could be the result).
@rahuldeora5815 3 года назад
It can still collapse at 0, as output of a neuron can be 0 (or very small value)and its running mean also 0. If most of the neurons have very small mean and outputs then is'int it possible for few classes to always dominate? (This would'nt happen if we divided by the std deviation btw)
@Omsip123 4 месяца назад
Outstanding video on the topic. I have watched almost everything on the subject on YT and this is one of the best explanations on VAE.
I very much appreciate the helpful animations and diagrams, I guess it is a lot of work to get these done.
Please keep doing, your channel will take off eventually, and I hope you get reward from the fact that the few who find your channel, learn a lot from your work.
Thanks a lot, and of course I subbed and liked.
@astudent8885 11 месяцев назад
Thank you for this presentation. You made sure to explain all background concepts so someone with limited ml knowledge can still understand. I found that really helpful. Thank you so much!
@shivamshrirao2374 3 года назад ⁺¹⁴
Was just going through the paper and there's already a video. Noiceeee !!
@iandanforth 3 года назад ⁺²
This video made clear to me the strong occlusion prior being introduced by the local/global student/teacher training method. I hadn't picked up on that in my first read through. Thank you!
@mfpears 3 года назад ⁺¹
Last 10 minutes are really a great explanation of a few concepts
3 года назад ⁺⁴⁶
One point to note in this paper is the dataset consist of object centred images and the augmentation method relies on cropping which is learning to represent the images invariant to the cropping position. This form a strong inductive prior that produces a representations that focus on the objects of interests in the image. The main learning signal that guides the self-supervised learning process comes from the cropping augmentation so I don't see how such a method can be trained without augmentation. My hypothesis is that this method would not work with dataset that don't have object centred images like a dataset that has images of rooms since in that case cropping would result in different objects that have little in common which would effectively eliminate the learning signal.
@redseventyfiveprime5018 3 года назад ⁺³
In reinforcement learning similarity of teacher and student responses can probably be used to move an agent into a position where an object is centered in its view.
@oncedidactic 3 года назад ⁺²
I think you could extend the system by pre-training on object centered, and then expanding to more natural imagery, such as scenes as you say. But the cropping augmentation would probably still need adjustment.
@mdrayedbinwahed2172 3 года назад ⁺³
Excellent insight. Sounds like a good follow up to this paper.
@kaveh_shh 2 года назад
Yeah. In this case we can not expect the model to give us the same representation from both e.g. "sky" and a "cat", cropped from different parts of the same image.
@susdoge3767 9 месяцев назад
what an insight! thanks for making me think!
@nauman.mustafa 3 года назад ⁺⁴²
I like how they include PyTorch code which makes it so easy to implement compared to heavy latex math
@herp_derpingson 3 года назад ⁺⁴
Why dont more papers do this?
@Metalwrath2 3 года назад ⁺²²
@@herp_derpingson Because most papers don't have reproducable results.
@kanal7523 2 года назад ⁺¹
@@Metalwrath2 sheeeeeeeesh
@florianjug 3 года назад ⁺⁸
Thanks! I hoped you’d be fast to cover DINO... and you delivered! :)
@Kram1032 3 года назад ⁺²
This is super cool! A really clever way to kinda do contrastive stuff without doing contrastive stuff and the results speak for themselves
@EyedMoon 3 года назад
God damnit every time there's a new method/architecture I want to try out and can't find the time to really use. Thanks for the video, I read the paper but the hindsight and small pieces of knowledge you give us about those methods and why they work are reaaaally good.
@justwiredme 2 года назад
Great presentation I like how you show the visual part best part for me as a beginner am very excited to learn this algorithm as well this is very useful information for me because sometimes in everyday life I can read so the audio is so helpful thank you
@neworldemancer 2 года назад
tnx Dr. Kilcher, what you do is useful af! ;)
@_tnk_ 3 года назад ⁺¹
Very interesting and amazing results
@chndrl5649 2 года назад
I love these paper summaries!!!
@dinoscheidt 3 года назад ⁺⁶
I really like the acronym of this method. 👀
@mathildecaron1821 3 года назад ⁺¹
🦖
@dinoscheidt 3 года назад ⁺²
Yeah... maybe not. Already getting messages with “See! DINO has attention issues”... 😶 thanks fb
@saurabheights 3 года назад
@@dinoscheidt Could you expand on those messages, interested in "DINO has attention issues"!
@DamianReloaded 3 года назад ⁺⁵
The attention maps look really good, specially the ones in video. It'd be interesting to see what it does when you occlude the thing in the scene it attended to the most. How many things in the scene it would be capable of telling apart as you remove the ones it already attended to.
Regarding the cooking video I think would have been better if it had been 90% about the language model and 10% about cooking. I personally would like to see more programming and possibly interviews with the authors of the papers you reviewd. my2c
@oncedidactic 3 года назад
I had a similar side. If you paint out the objective attention using another system, what happens? Like Yannic‘s comment about pictures of roads and grass 😂
@zebrg 3 года назад
ruclips.net/video/h3ij3F3cPIk/видео.html
@yaoweili681 3 года назад
great video, mate! The segmentation results are so good!
@originalsingh 3 года назад ⁺¹⁶
Yannic : Nobody takes a picture of dirt and grass and posts it on SM
GameDev artists : Woah look at this dirt patch!
@ensabinha 10 месяцев назад
29:15 - It achieves better results with ViT when compared to the "best ResNet," of course, but it's 3.6 times larger in the number of parameters.
They're comparing a ~3.6x LARGER modern architecture (which probably employs an arsenal of training tricks) with ResNet. Shocking, truly groundbreaking, you can get better results with a larger model.
@francoisplessier9913 3 года назад
Great explanations, thank you for this quality video!
I loved the 34:38 insight on augmentations!
And I found your concern about the meme culture quite funny :-)
@tzjtjktzjtzjztjztj 3 года назад ⁺¹
Great insight and comments, thanks Yannic
@robertgirard5659 3 года назад
Saved for later! Yannic dude love your vids!
@XX-vu5jo 3 года назад ⁺¹⁴
Stupid I submitted a similar concept before and I was rejected because I am not a well known person. Now, just because fb made it they were glorified! This is crazy!
@herp_derpingson 3 года назад ⁺¹
Sad
@samanthaqiu3416 3 года назад
that's why there is arxiv. Didn't you thought about publishing there?
@herp_derpingson 3 года назад ⁺³
@@samanthaqiu3416 For many PHD programs, publishing on arxiv is not good enough.
@anassbairouk953 3 года назад
The data augmentation is important to avoid using clustering which is not scalable when using a huge dataset because you get a huge cluster centroid matrix that you need to store and update each time.
@zenchiassassin283 Год назад
Very interesting hypothesis !
@ivanr7725 3 года назад ⁺¹
Thanks a lot! Dinozaur should be on the cover.
@kiachi470 3 года назад
Amazing Explaination and Paper to,Very interesting
@vsiegel 3 года назад
Confusing to me was:
TL;DE: It seems like it requires video as input, but it works on still images.
In the intro at 0:55 , there are examples shown, and all of them are videos. On first sight, it seemed obvious to me that it is detecting the moving Object. Looking more closely, something more is going on - the movement of the waves is ignored, in a clean way. But still, the information for the separation is available in a very salient way. It took a while until I understood that it is about still images. Now, I think the frames of the example videos are processed individually.
@samdavidson5511 3 года назад
Awesome vid thanks! and I see they are linking this video of yours on their git repo!
@mobilexia6285 3 года назад ⁺¹
Two quick notes:
1. The video can replace CVPR
2. If the cat can be recognised by its ear, would that mean some 'generative power' has been created within the student?
@pensiveintrovert4318 3 года назад ⁺⁴
It pays attention to patches with maximal change. Of course we, the erect monkeys, also pay attention to visual fields with maximum change, to get food, or escape danger. Why? Because it works and that is how we have evolved, because it worked.
@oliverchalkley1187 Год назад
Great video thanks! Surely the reason for the softmax is the crossentropy equation requires probabilities and the softmax funciton turns the outputs into probabilities?
@momeho Год назад
Thanks for your great video. Do you have any video on DINO2?
@nahakuma 3 года назад ⁺¹
Nice final comments. Totally agree in that augmentations should be internal to the learning process. As I see it, we humans do something similar by remembering previously seen patterns, as well as by imagining how things would change if we perturb them (or by actually performing the perturbation). With respect to the global and local crops, does the teacher really only see global crops? Because according to the pseudo-code both x_1 and x_2 go into both models.
@alastairfinlinson898 3 года назад ⁺¹
Love the videos! Will you be providing valuable insight for the papers "Multiscale Vision Transformers", "Vision Transformers for Remote Sensing Image Classification" and "Vision Transformers for Dense Prediction"?
@DistortedV12 3 года назад ⁺¹⁰
Yannic some constructive feedback, turn up your volume!
@korota199905 3 года назад ⁺¹
Absolutely yesss!
@JamesAwokeKnowing 3 года назад
For augmentation we can replace with noisy input. For dataset a a reconstructive loss and world model should give basic objects and cause the model to prefer images that nore significant (less random) semantic meaning. Then at dream time it can train on the meaningful images.
@odin12 3 года назад ⁺⁵
When will the code for Generative Minimization Networks: Training GANs Without Competition be released?
@_arshadm 3 года назад
Great explainer video, not sure I agree with your conclusion that augmentation may be a major source of the signal that the approach is latching onto. My own suspicion is that it's the clipping that is the main reason this approach works.
@Hydroslyde 3 года назад ⁺¹
Great video! So are we going to get a PAWS video next? Pretty please???
@miladaghajohari2308 3 года назад ⁺²
well done!
@mrwu6565 Год назад
Thank you Yannic!!! Can you do a video about CutLER ? :)
@sanj1772 9 месяцев назад
Amazing video, can you please make one on DinoV2
@yesno3071 2 года назад
keep going :) very well
@vasylcf 3 года назад
Thanks. it's really intresting.
@sheggle 3 года назад ⁺³
Would love to see time changes in natural video instead of augmentations, to see if "why AI is harder than we think" holds any water
@danielalorbi 3 года назад ⁺¹
@Robert w No it isn't. We invented a whole new term and everything.
@danielalorbi 3 года назад
@Robert w Your comment changed. I don't recall exactly what it was initially but the meaning has changed.
@harambe2552 3 года назад
The softmax bounds the embedding space to a hypersphere. Otherwise your embedding space is unbounded and gives you an infinite projection space.
@yb801 5 месяцев назад
Clearly explained, thanks.
@pauljones9150 3 года назад ⁺¹
Have my updoot. I loved the cooking video btw
Maybe have a separate channel for cooking like video so you don't get tanked by the algo
@Bryan-jb6xu 3 года назад
please make a video explaining about EsVIT. Thanks!
@jonatan01i 3 года назад
Right now the images for the student model are sampled from the image with different x,y coordinates. What we could also do is to sample them from different timestamps from a video.
@yimml4246 3 года назад
The cooking video did not really do "terribly." Yes, perhaps a bit less than the average video, but I watched it and it was adequate. Nonetheless, sometimes we need to try random things to prevent getting stuck in a local maximum. Keep it up!
@odin12 3 года назад ⁺¹
This paper looks insane
@SakvaUA 3 года назад
Thanks for the video! Enlightening as always. The audio volume is a bit too low though.
@Amin-wd4du Год назад
super
@Niels1234321 3 года назад
Maybe we should try to use consecutive frames of a video as augmentations of the same thing, it requires less augmentation engineering and you could argue that it resembles the data that humans learn from as children.
@0_0bserver27 3 года назад
I don't exactly understand how distillation prevents collapse in this model in the explanation of it on 13:53. On 19:59 it is mentioned again that the student cannot output the same thing every time because it is prevented, but how exactly? Does someone want to elaborate?
@rakshithv5073 3 года назад
Looking into the pseudo code , block diagram (Figure 2) isn't a good representation of what's actually happening right ?
At first sight, I thought x2 only goes through teacher network and x1 goes through student network
@andrewcutler4599 3 года назад ⁺¹
ViT for augmentations when?
@iftekharniloy913 3 года назад ⁺¹
I am just curious to see people using self supervision on images which have multiple classes of interest.
@michaelwangCH 3 года назад ⁺¹
What is the intuition behind?
How it does work so well without labeling?
Yannic, can you explain the intuition?
@susdoge3767 9 месяцев назад ⁺¹
the intuition is you try to make the network learn that an image of a cats ear and a complete image of the cat should have the same representation. The hypothesis is that by forcing the model to learn consistent representations across scales (patch vs. whole image), it can grasp transferable features that are generally useful for computer vision tasks.
@michaelwangCH 9 месяцев назад
@@susdoge3767 thank you. Unsupervised learning is only possible if the latent space representations are similar to each other(minimize the distance in latent space - that is the reason why emergent properties of LLMs we can oberserve, e.g. google trained translator in english can surprisingly translate Indi or other languages without trained on - only reason it works because the human languages have similar structure that it related to human biology resp. brain functions - those processes in the brain are similar to all humans - it is independent of color, gender, nationality or race.
@susdoge3767 9 месяцев назад
@@michaelwangCH thats another cool insight i didnt know!
@michaelwangCH 9 месяцев назад
@@susdoge3767 happy to help and the knowledge belong the entire human race, not small group of people.
@444haluk 3 года назад
Augmentations are so simple in their nature that it can be a part of the evolutionary dynamic of humans on how our perception develops over time. Maybe in your sleep different crops of occipital cortex play this game of augmentation. Maybe you didn't born tabula rasa but born with augmentation dynamics.
@akhilezai 3 года назад ⁺¹
There's no temporal aspect to it?
@TechyBen 3 года назад ⁺⁴
Terminator misspelt "Facebook" in the movies.
@RoboticusMusic 3 года назад ⁺¹
What's the framerate for 1080p? Is it realtime?
@anibalgonzalez7990 2 года назад
Could anyone tell me how the teacher knows there are 'k' classes to be identified in a picture?
Cheers!
@piku1920 3 года назад
Hi- what does it mean by thresholding the self attention maps to keep 60% of the mass? What does mass represent here?
@iiiiaaaa4548 2 года назад
Which model use for downstream? student? teacher?
@rahuldeora5815 3 года назад
The paper says "We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works [28, 56]." How does this make sense given that teacher is updated much more slowly than student?
@jeroenput258 3 года назад
That's one thing I don't get either...
@mathildecaron1821 3 года назад
Poliak averaging
@godsondeep241 3 года назад
Can we use this for the object detection task
@susdoge3767 9 месяцев назад
didnt understand properly about sharpening and centering, can anyone help me understanding it intuitively?
@DanFrederiksen 3 года назад ⁺⁵
If it's truly unsupervised, why is it blind to vegetation and ocean waves. It seems they somehow managed to impose the simplistic notion that an image has only one classification.
@jeroenput258 3 года назад
Exactly. One of the images shows a dog in a sofa and only pays attention to the dog. What if I'm more interested in the sofa than the dog? It seems to impose a very subjective notion of importance on the image content. Besides, segmentation is highly task dependent, so how could it know whether to segment the dog or its limbs for instance? If you ask me, it just seems to learn from ImageNet to predict the most salient object and then use the features to perform a segmentation.
@randomthoughts3009 3 года назад ⁺¹
This is a visual artifact due to plot normalization. The central object has heatmap values that are relatively much higher than the background. Check the running dog example on the project page and look at the last frame where the dog is absent.
@DanFrederiksen 3 года назад
@@randomthoughts3009 well that it has very faint recognition of other things isn't really an excuse. But I guess it can be a simple result of focus in the training set. The initial dog video tracks the dog so that is naturally a heavy bias towards single object classification.
@roughr4044 3 года назад
What clustering algo does it use on the features?
@roughr4044 3 года назад
Linear and knn, got it...
@freemind.d2714 3 года назад ⁺²
Basically: DINO = BYOL + Transformers
@JoshuaGAlbert 3 года назад
Volume is low in this video.
@lannguyende 3 года назад ⁺²
I've read the paper and sadly that I didn't find anything new. They just gathered some techniques that already existed and implemented in a self-supervised way. Funny is DINO: DIstill NO labels, but normal distillation training don't use any label at all 😂
@louislouis7388 3 года назад ⁺²
Many papers do in such way. Although it is very simple, they tried to magic it to get it complicated and plausible. I found this paper is not impressive at all.
@sebbecht 3 года назад
WTF. I found this and was just about to suggest it to you over linkedin and thought.. what if I just checked if there were any youtube videos on it first...
@calaldred2526 3 года назад
Yannic “Lightspeed” Kilcher strikes again
@mrburns366 3 года назад ⁺¹
Skynet is coming
@larrybird3729 3 года назад ⁺¹⁴
Would I rather watch Gordon Ramsay review the latest AI paper or would I rather Watch Yannic? that might answer your question Yannic😆
@cunningham.s_law 3 года назад
seems like attention is all you need
@陈宸-r7g 3 года назад ⁺¹
好快~
@ssssssstssssssss 3 года назад ⁺¹
This seems to be an unsupervised clustering algorithm to me. I guess calling it "self-supervised" sounds sexier.
@444haluk 3 года назад
The dataset argument is weak as well because every human you know has a parent or somebody looked after them in their childhood, no human grow alone with the wolves. Hence the "where to look" may be a social aspect of human species, hell, every species. I know cows have a type of attention and understanding which we refer as autistic, wherever they walk, if some unknown things is in the proximity, they freeze and freak out. Maybe they are not good cow culture teachers after all.
@preethamgali3023 3 года назад
It looks like double-Q learning. What do you think?
@aminabbasloo 3 года назад
Seems like a game theory problem to me!
@HughesPerreault 3 года назад
Commenting for algo.
@laurenpinschannels 3 года назад
offtopic thing - would you be open to adding donation options in a proof of stake coin? I don't have strong opinions about which one, I'd convert to whatever you think is a good option. I don't want to fund gpu demand with my donation :)
@lwang9175 3 года назад ⁺¹
You can the stripes of the horse... Sorry it's a zebra 🦓 hahaha
@djfl58mdlwqlf 3 года назад
cooking was good video
lol
@scottmiller2591 3 года назад
"Cooking video" - Wat.
@nurkleblurker2482 3 года назад ⁺³
Yannic, your cooking video did terribly because this is an AI channel. None of your viewers want to see you cook, even if the recipe was written by an AI.
@oncedidactic 3 года назад ⁺¹
This is probably accurate. You know, I think what would work better? A collaboration video with a cooking channel! You should get in touch with Andong
@tostupidforname 3 года назад
I mean ofcause thats what happen if you make content aimed towards a different audience. At the same time branching out is nessesary for channel growth and most big channels went through a phase where they "changed audiences".
I personally liked the cooking video

Следующие

Автовоспроизведение

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)