OUTLINE: 0:00 - Intro & Overview 6:20 - Vision Transformers 9:20 - Self-Supervised Learning for Images 13:30 - Self-Distillation 15:20 - Building the teacher from the student by moving average 16:45 - DINO Pseudocode 23:10 - Why Cross-Entropy Loss? 28:20 - Experimental Results 33:40 - My Hypothesis why this works 38:45 - Conclusion & Comments Paper: arxiv.org/abs/2104.14294 Blog: ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training Code: github.com/facebookresearch/dino My Video on ViT: ruclips.net/video/TrdevFK_am4/видео.html My Video on BYOL: ruclips.net/video/YPfUiOMYOEE/видео.html
from the paper is not clear AT ALL that they detach gradients of the teacher via the Center (C) variable. Will have to look at their repo to see what is going on. Typically things like mean still propagate gradients in pytorch
yep, it didn't help much that they seem to code like 9 year olds, but from line 304 of main_dino.py ( github.com/facebookresearch/dino/blob/a15f6afee2f3b868f44a5021a0951f718c8e2dd5/main_dino.py#L304 ) it seems clear they are NOT DETACHING all gradients from the teacher network via the `update_center` method
Thanks a lot Yannic for covering DINO, that’s really an honor ! I’m a big fan of your channel :D
2 года назад
Hi. Enjoyed the paper and the explanation given in this video. Thank you both. Are you aware of any robustness analysis (in the context of adversarial examples) done for DINO?
Your comment on augmentations is spot on! I have worked with BYOLs in clinical images for a while now, and choosing the correct augmentations makes a heck of a difference, and there is no way to know the right augmentation without trial and error! I think that's a major downside of BYOL, which will obviously percolate to DINO as well. Thanks for your presentation of the paper!
@@michaelwangCH Let's enjoy it, the progress is exponential, and we are at a steep region! I decide to ignore the question whether we are near the end of human history, at the time the progress curve goes to infinity... Actually, I'm not afraid: Progress is exponential, making progress means adding knowledge, tools and scientists, and that allows to make faster Progress. But I think it is actually a logistic development, that looks very much like exponential, but instead of the singularity, the curve begins to get less steep. It happens when a finite resource is involved. But as physicist, I say: no problem, the observable universe is finite.
@@vsiegel We have a log curve between scientific output(progress) and resources which we put in - the work as researcher is getting harder, the hardness will increase every year - in other word, the hardness increase expontially to time(an exponential function of time) - that is bad for scientific progress and societal resources distribution. E.g. CERN with over 6000 scientists and $14B + fixed costs per yr, those resources could be used probably more productively in other area of sciences.
@@vsiegel that's assumes that knowledge can be added like chocolates cakes, but that's a wrong hypothesis for individual humans and humanity. A focus and more knowledge on a topic means other topics are given less attention and are forgotten. This is why the definition of "progress" must be chosen, and according to certain definitions, Dino is not representing a progress in itself as it can have negative indirect effects as any digital technologies.
Cool paper, thanks for the review! About centering vs sharpening. You are right: centering avoids the collapse as each unsupervised class gets pushed to equal running average, i.e., each unsupervised class should pick up 1/K of the images because the means of their logits are 0. This way, model cannot collapse to picking the same class each time. Sharpening is to make sure that each time one class is being picked (otherwise, flat uniform distribution could be the result).
It can still collapse at 0, as output of a neuron can be 0 (or very small value)and its running mean also 0. If most of the neurons have very small mean and outputs then is'int it possible for few classes to always dominate? (This would'nt happen if we divided by the std deviation btw)
Outstanding video on the topic. I have watched almost everything on the subject on YT and this is one of the best explanations on VAE. I very much appreciate the helpful animations and diagrams, I guess it is a lot of work to get these done. Please keep doing, your channel will take off eventually, and I hope you get reward from the fact that the few who find your channel, learn a lot from your work. Thanks a lot, and of course I subbed and liked.
Thank you for this presentation. You made sure to explain all background concepts so someone with limited ml knowledge can still understand. I found that really helpful. Thank you so much!
This video made clear to me the strong occlusion prior being introduced by the local/global student/teacher training method. I hadn't picked up on that in my first read through. Thank you!
Last 10 minutes are really a great explanation of a few concepts
3 года назад+46
One point to note in this paper is the dataset consist of object centred images and the augmentation method relies on cropping which is learning to represent the images invariant to the cropping position. This form a strong inductive prior that produces a representations that focus on the objects of interests in the image. The main learning signal that guides the self-supervised learning process comes from the cropping augmentation so I don't see how such a method can be trained without augmentation. My hypothesis is that this method would not work with dataset that don't have object centred images like a dataset that has images of rooms since in that case cropping would result in different objects that have little in common which would effectively eliminate the learning signal.
In reinforcement learning similarity of teacher and student responses can probably be used to move an agent into a position where an object is centered in its view.
I think you could extend the system by pre-training on object centered, and then expanding to more natural imagery, such as scenes as you say. But the cropping augmentation would probably still need adjustment.
Yeah. In this case we can not expect the model to give us the same representation from both e.g. "sky" and a "cat", cropped from different parts of the same image.
God damnit every time there's a new method/architecture I want to try out and can't find the time to really use. Thanks for the video, I read the paper but the hindsight and small pieces of knowledge you give us about those methods and why they work are reaaaally good.
Great presentation I like how you show the visual part best part for me as a beginner am very excited to learn this algorithm as well this is very useful information for me because sometimes in everyday life I can read so the audio is so helpful thank you
The attention maps look really good, specially the ones in video. It'd be interesting to see what it does when you occlude the thing in the scene it attended to the most. How many things in the scene it would be capable of telling apart as you remove the ones it already attended to. Regarding the cooking video I think would have been better if it had been 90% about the language model and 10% about cooking. I personally would like to see more programming and possibly interviews with the authors of the papers you reviewd. my2c
I had a similar side. If you paint out the objective attention using another system, what happens? Like Yannic‘s comment about pictures of roads and grass 😂
29:15 - It achieves better results with ViT when compared to the "best ResNet," of course, but it's 3.6 times larger in the number of parameters. They're comparing a ~3.6x LARGER modern architecture (which probably employs an arsenal of training tricks) with ResNet. Shocking, truly groundbreaking, you can get better results with a larger model.
Great explanations, thank you for this quality video! I loved the 34:38 insight on augmentations! And I found your concern about the meme culture quite funny :-)
Stupid I submitted a similar concept before and I was rejected because I am not a well known person. Now, just because fb made it they were glorified! This is crazy!
The data augmentation is important to avoid using clustering which is not scalable when using a huge dataset because you get a huge cluster centroid matrix that you need to store and update each time.
Confusing to me was: TL;DE: It seems like it requires video as input, but it works on still images. In the intro at 0:55 , there are examples shown, and all of them are videos. On first sight, it seemed obvious to me that it is detecting the moving Object. Looking more closely, something more is going on - the movement of the waves is ignored, in a clean way. But still, the information for the separation is available in a very salient way. It took a while until I understood that it is about still images. Now, I think the frames of the example videos are processed individually.
Two quick notes: 1. The video can replace CVPR 2. If the cat can be recognised by its ear, would that mean some 'generative power' has been created within the student?
It pays attention to patches with maximal change. Of course we, the erect monkeys, also pay attention to visual fields with maximum change, to get food, or escape danger. Why? Because it works and that is how we have evolved, because it worked.
Great video thanks! Surely the reason for the softmax is the crossentropy equation requires probabilities and the softmax funciton turns the outputs into probabilities?
Nice final comments. Totally agree in that augmentations should be internal to the learning process. As I see it, we humans do something similar by remembering previously seen patterns, as well as by imagining how things would change if we perturb them (or by actually performing the perturbation). With respect to the global and local crops, does the teacher really only see global crops? Because according to the pseudo-code both x_1 and x_2 go into both models.
Love the videos! Will you be providing valuable insight for the papers "Multiscale Vision Transformers", "Vision Transformers for Remote Sensing Image Classification" and "Vision Transformers for Dense Prediction"?
For augmentation we can replace with noisy input. For dataset a a reconstructive loss and world model should give basic objects and cause the model to prefer images that nore significant (less random) semantic meaning. Then at dream time it can train on the meaningful images.
Great explainer video, not sure I agree with your conclusion that augmentation may be a major source of the signal that the approach is latching onto. My own suspicion is that it's the clipping that is the main reason this approach works.
Right now the images for the student model are sampled from the image with different x,y coordinates. What we could also do is to sample them from different timestamps from a video.
The cooking video did not really do "terribly." Yes, perhaps a bit less than the average video, but I watched it and it was adequate. Nonetheless, sometimes we need to try random things to prevent getting stuck in a local maximum. Keep it up!
Maybe we should try to use consecutive frames of a video as augmentations of the same thing, it requires less augmentation engineering and you could argue that it resembles the data that humans learn from as children.
I don't exactly understand how distillation prevents collapse in this model in the explanation of it on 13:53. On 19:59 it is mentioned again that the student cannot output the same thing every time because it is prevented, but how exactly? Does someone want to elaborate?
Looking into the pseudo code , block diagram (Figure 2) isn't a good representation of what's actually happening right ? At first sight, I thought x2 only goes through teacher network and x1 goes through student network
the intuition is you try to make the network learn that an image of a cats ear and a complete image of the cat should have the same representation. The hypothesis is that by forcing the model to learn consistent representations across scales (patch vs. whole image), it can grasp transferable features that are generally useful for computer vision tasks.
@@susdoge3767 thank you. Unsupervised learning is only possible if the latent space representations are similar to each other(minimize the distance in latent space - that is the reason why emergent properties of LLMs we can oberserve, e.g. google trained translator in english can surprisingly translate Indi or other languages without trained on - only reason it works because the human languages have similar structure that it related to human biology resp. brain functions - those processes in the brain are similar to all humans - it is independent of color, gender, nationality or race.
Augmentations are so simple in their nature that it can be a part of the evolutionary dynamic of humans on how our perception develops over time. Maybe in your sleep different crops of occipital cortex play this game of augmentation. Maybe you didn't born tabula rasa but born with augmentation dynamics.
The paper says "We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works [28, 56]." How does this make sense given that teacher is updated much more slowly than student?
If it's truly unsupervised, why is it blind to vegetation and ocean waves. It seems they somehow managed to impose the simplistic notion that an image has only one classification.
Exactly. One of the images shows a dog in a sofa and only pays attention to the dog. What if I'm more interested in the sofa than the dog? It seems to impose a very subjective notion of importance on the image content. Besides, segmentation is highly task dependent, so how could it know whether to segment the dog or its limbs for instance? If you ask me, it just seems to learn from ImageNet to predict the most salient object and then use the features to perform a segmentation.
This is a visual artifact due to plot normalization. The central object has heatmap values that are relatively much higher than the background. Check the running dog example on the project page and look at the last frame where the dog is absent.
@@randomthoughts3009 well that it has very faint recognition of other things isn't really an excuse. But I guess it can be a simple result of focus in the training set. The initial dog video tracks the dog so that is naturally a heavy bias towards single object classification.
I've read the paper and sadly that I didn't find anything new. They just gathered some techniques that already existed and implemented in a self-supervised way. Funny is DINO: DIstill NO labels, but normal distillation training don't use any label at all 😂
Many papers do in such way. Although it is very simple, they tried to magic it to get it complicated and plausible. I found this paper is not impressive at all.
WTF. I found this and was just about to suggest it to you over linkedin and thought.. what if I just checked if there were any youtube videos on it first...
The dataset argument is weak as well because every human you know has a parent or somebody looked after them in their childhood, no human grow alone with the wolves. Hence the "where to look" may be a social aspect of human species, hell, every species. I know cows have a type of attention and understanding which we refer as autistic, wherever they walk, if some unknown things is in the proximity, they freeze and freak out. Maybe they are not good cow culture teachers after all.
offtopic thing - would you be open to adding donation options in a proof of stake coin? I don't have strong opinions about which one, I'd convert to whatever you think is a good option. I don't want to fund gpu demand with my donation :)
Yannic, your cooking video did terribly because this is an AI channel. None of your viewers want to see you cook, even if the recipe was written by an AI.
I mean ofcause thats what happen if you make content aimed towards a different audience. At the same time branching out is nessesary for channel growth and most big channels went through a phase where they "changed audiences". I personally liked the cooking video
OUTLINE:
0:00 - Intro & Overview
6:20 - Vision Transformers
9:20 - Self-Supervised Learning for Images
13:30 - Self-Distillation
15:20 - Building the teacher from the student by moving average
16:45 - DINO Pseudocode
23:10 - Why Cross-Entropy Loss?
28:20 - Experimental Results
33:40 - My Hypothesis why this works
38:45 - Conclusion & Comments
Paper: arxiv.org/abs/2104.14294
Blog: ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
Code: github.com/facebookresearch/dino
My Video on ViT: ruclips.net/video/TrdevFK_am4/видео.html
My Video on BYOL: ruclips.net/video/YPfUiOMYOEE/видео.html
from the paper is not clear AT ALL that they detach gradients of the teacher via the Center (C) variable. Will have to look at their repo to see what is going on. Typically things like mean still propagate gradients in pytorch
yep, it didn't help much that they seem to code like 9 year olds, but from line 304 of main_dino.py ( github.com/facebookresearch/dino/blob/a15f6afee2f3b868f44a5021a0951f718c8e2dd5/main_dino.py#L304 ) it seems clear they are NOT DETACHING all gradients from the teacher network via the `update_center` method
it will not be a problem since they don't seem to be using those gradients anywhere, although I haven't verified it
Thanks a lot Yannic for covering DINO, that’s really an honor ! I’m a big fan of your channel :D
Hi. Enjoyed the paper and the explanation given in this video. Thank you both.
Are you aware of any robustness analysis (in the context of adversarial examples) done for DINO?
I know version 2 is out. Still congrats to this brake through work!
"learning to predict cat from cat ear" is a good summary of this paper.
simplest form of AI lol
I am super impressed how you nailed the pronunciation every single name of the authors of the paper.
Your comment on augmentations is spot on! I have worked with BYOLs in clinical images for a while now, and choosing the correct augmentations makes a heck of a difference, and there is no way to know the right augmentation without trial and error! I think that's a major downside of BYOL, which will obviously percolate to DINO as well. Thanks for your presentation of the paper!
As often said, what a time to be alive!
wrong channel xd
If you are the guy "two minute paper", excellent work - we are living in a exordinary time of human history.
@@michaelwangCH Let's enjoy it, the progress is exponential, and we are at a steep region! I decide to ignore the question whether we are near the end of human history, at the time the progress curve goes to infinity...
Actually, I'm not afraid: Progress is exponential, making progress means adding knowledge, tools and scientists, and that allows to make faster Progress.
But I think it is actually a logistic development, that looks very much like exponential, but instead of the singularity, the curve begins to get less steep. It happens when a finite resource is involved. But as physicist, I say: no problem, the observable universe is finite.
@@vsiegel We have a log curve between scientific output(progress) and resources which we put in - the work as researcher is getting harder, the hardness will increase every year - in other word, the hardness increase expontially to time(an exponential function of time) - that is bad for scientific progress and societal resources distribution. E.g. CERN with over 6000 scientists and $14B + fixed costs per yr, those resources could be used probably more productively in other area of sciences.
@@vsiegel that's assumes that knowledge can be added like chocolates cakes, but that's a wrong hypothesis for individual humans and humanity. A focus and more knowledge on a topic means other topics are given less attention and are forgotten. This is why the definition of "progress" must be chosen, and according to certain definitions, Dino is not representing a progress in itself as it can have negative indirect effects as any digital technologies.
Surprisingly fluent pronunciation of the authors .... bet that took more takes than one would expect :)
Cool paper, thanks for the review!
About centering vs sharpening. You are right: centering avoids the collapse as each unsupervised class gets pushed to equal running average, i.e., each unsupervised class should pick up 1/K of the images because the means of their logits are 0. This way, model cannot collapse to picking the same class each time. Sharpening is to make sure that each time one class is being picked (otherwise, flat uniform distribution could be the result).
It can still collapse at 0, as output of a neuron can be 0 (or very small value)and its running mean also 0. If most of the neurons have very small mean and outputs then is'int it possible for few classes to always dominate? (This would'nt happen if we divided by the std deviation btw)
Outstanding video on the topic. I have watched almost everything on the subject on YT and this is one of the best explanations on VAE.
I very much appreciate the helpful animations and diagrams, I guess it is a lot of work to get these done.
Please keep doing, your channel will take off eventually, and I hope you get reward from the fact that the few who find your channel, learn a lot from your work.
Thanks a lot, and of course I subbed and liked.
Thank you for this presentation. You made sure to explain all background concepts so someone with limited ml knowledge can still understand. I found that really helpful. Thank you so much!
Was just going through the paper and there's already a video. Noiceeee !!
This video made clear to me the strong occlusion prior being introduced by the local/global student/teacher training method. I hadn't picked up on that in my first read through. Thank you!
Last 10 minutes are really a great explanation of a few concepts
One point to note in this paper is the dataset consist of object centred images and the augmentation method relies on cropping which is learning to represent the images invariant to the cropping position. This form a strong inductive prior that produces a representations that focus on the objects of interests in the image. The main learning signal that guides the self-supervised learning process comes from the cropping augmentation so I don't see how such a method can be trained without augmentation. My hypothesis is that this method would not work with dataset that don't have object centred images like a dataset that has images of rooms since in that case cropping would result in different objects that have little in common which would effectively eliminate the learning signal.
In reinforcement learning similarity of teacher and student responses can probably be used to move an agent into a position where an object is centered in its view.
I think you could extend the system by pre-training on object centered, and then expanding to more natural imagery, such as scenes as you say. But the cropping augmentation would probably still need adjustment.
Excellent insight. Sounds like a good follow up to this paper.
Yeah. In this case we can not expect the model to give us the same representation from both e.g. "sky" and a "cat", cropped from different parts of the same image.
what an insight! thanks for making me think!
I like how they include PyTorch code which makes it so easy to implement compared to heavy latex math
Why dont more papers do this?
@@herp_derpingson Because most papers don't have reproducable results.
@@Metalwrath2 sheeeeeeeesh
Thanks! I hoped you’d be fast to cover DINO... and you delivered! :)
This is super cool! A really clever way to kinda do contrastive stuff without doing contrastive stuff and the results speak for themselves
God damnit every time there's a new method/architecture I want to try out and can't find the time to really use. Thanks for the video, I read the paper but the hindsight and small pieces of knowledge you give us about those methods and why they work are reaaaally good.
Great presentation I like how you show the visual part best part for me as a beginner am very excited to learn this algorithm as well this is very useful information for me because sometimes in everyday life I can read so the audio is so helpful thank you
tnx Dr. Kilcher, what you do is useful af! ;)
Very interesting and amazing results
I love these paper summaries!!!
I really like the acronym of this method. 👀
🦖
Yeah... maybe not. Already getting messages with “See! DINO has attention issues”... 😶 thanks fb
@@dinoscheidt Could you expand on those messages, interested in "DINO has attention issues"!
The attention maps look really good, specially the ones in video. It'd be interesting to see what it does when you occlude the thing in the scene it attended to the most. How many things in the scene it would be capable of telling apart as you remove the ones it already attended to.
Regarding the cooking video I think would have been better if it had been 90% about the language model and 10% about cooking. I personally would like to see more programming and possibly interviews with the authors of the papers you reviewd. my2c
I had a similar side. If you paint out the objective attention using another system, what happens? Like Yannic‘s comment about pictures of roads and grass 😂
ruclips.net/video/h3ij3F3cPIk/видео.html
great video, mate! The segmentation results are so good!
Yannic : Nobody takes a picture of dirt and grass and posts it on SM
GameDev artists : Woah look at this dirt patch!
29:15 - It achieves better results with ViT when compared to the "best ResNet," of course, but it's 3.6 times larger in the number of parameters.
They're comparing a ~3.6x LARGER modern architecture (which probably employs an arsenal of training tricks) with ResNet. Shocking, truly groundbreaking, you can get better results with a larger model.
Great explanations, thank you for this quality video!
I loved the 34:38 insight on augmentations!
And I found your concern about the meme culture quite funny :-)
Great insight and comments, thanks Yannic
Saved for later! Yannic dude love your vids!
Stupid I submitted a similar concept before and I was rejected because I am not a well known person. Now, just because fb made it they were glorified! This is crazy!
Sad
that's why there is arxiv. Didn't you thought about publishing there?
@@samanthaqiu3416 For many PHD programs, publishing on arxiv is not good enough.
The data augmentation is important to avoid using clustering which is not scalable when using a huge dataset because you get a huge cluster centroid matrix that you need to store and update each time.
Very interesting hypothesis !
Thanks a lot! Dinozaur should be on the cover.
Amazing Explaination and Paper to,Very interesting
Confusing to me was:
TL;DE: It seems like it requires video as input, but it works on still images.
In the intro at 0:55 , there are examples shown, and all of them are videos. On first sight, it seemed obvious to me that it is detecting the moving Object. Looking more closely, something more is going on - the movement of the waves is ignored, in a clean way. But still, the information for the separation is available in a very salient way. It took a while until I understood that it is about still images. Now, I think the frames of the example videos are processed individually.
Awesome vid thanks! and I see they are linking this video of yours on their git repo!
Two quick notes:
1. The video can replace CVPR
2. If the cat can be recognised by its ear, would that mean some 'generative power' has been created within the student?
It pays attention to patches with maximal change. Of course we, the erect monkeys, also pay attention to visual fields with maximum change, to get food, or escape danger. Why? Because it works and that is how we have evolved, because it worked.
Great video thanks! Surely the reason for the softmax is the crossentropy equation requires probabilities and the softmax funciton turns the outputs into probabilities?
Thanks for your great video. Do you have any video on DINO2?
Nice final comments. Totally agree in that augmentations should be internal to the learning process. As I see it, we humans do something similar by remembering previously seen patterns, as well as by imagining how things would change if we perturb them (or by actually performing the perturbation). With respect to the global and local crops, does the teacher really only see global crops? Because according to the pseudo-code both x_1 and x_2 go into both models.
Love the videos! Will you be providing valuable insight for the papers "Multiscale Vision Transformers", "Vision Transformers for Remote Sensing Image Classification" and "Vision Transformers for Dense Prediction"?
Yannic some constructive feedback, turn up your volume!
Absolutely yesss!
For augmentation we can replace with noisy input. For dataset a a reconstructive loss and world model should give basic objects and cause the model to prefer images that nore significant (less random) semantic meaning. Then at dream time it can train on the meaningful images.
When will the code for Generative Minimization Networks: Training GANs Without Competition be released?
Great explainer video, not sure I agree with your conclusion that augmentation may be a major source of the signal that the approach is latching onto. My own suspicion is that it's the clipping that is the main reason this approach works.
Great video! So are we going to get a PAWS video next? Pretty please???
well done!
Thank you Yannic!!! Can you do a video about CutLER ? :)
Amazing video, can you please make one on DinoV2
keep going :) very well
Thanks. it's really intresting.
Would love to see time changes in natural video instead of augmentations, to see if "why AI is harder than we think" holds any water
@Robert w No it isn't. We invented a whole new term and everything.
@Robert w Your comment changed. I don't recall exactly what it was initially but the meaning has changed.
The softmax bounds the embedding space to a hypersphere. Otherwise your embedding space is unbounded and gives you an infinite projection space.
Clearly explained, thanks.
Have my updoot. I loved the cooking video btw
Maybe have a separate channel for cooking like video so you don't get tanked by the algo
please make a video explaining about EsVIT. Thanks!
Right now the images for the student model are sampled from the image with different x,y coordinates. What we could also do is to sample them from different timestamps from a video.
The cooking video did not really do "terribly." Yes, perhaps a bit less than the average video, but I watched it and it was adequate. Nonetheless, sometimes we need to try random things to prevent getting stuck in a local maximum. Keep it up!
This paper looks insane
Thanks for the video! Enlightening as always. The audio volume is a bit too low though.
super
Maybe we should try to use consecutive frames of a video as augmentations of the same thing, it requires less augmentation engineering and you could argue that it resembles the data that humans learn from as children.
I don't exactly understand how distillation prevents collapse in this model in the explanation of it on 13:53. On 19:59 it is mentioned again that the student cannot output the same thing every time because it is prevented, but how exactly? Does someone want to elaborate?
Looking into the pseudo code , block diagram (Figure 2) isn't a good representation of what's actually happening right ?
At first sight, I thought x2 only goes through teacher network and x1 goes through student network
ViT for augmentations when?
I am just curious to see people using self supervision on images which have multiple classes of interest.
What is the intuition behind?
How it does work so well without labeling?
Yannic, can you explain the intuition?
the intuition is you try to make the network learn that an image of a cats ear and a complete image of the cat should have the same representation. The hypothesis is that by forcing the model to learn consistent representations across scales (patch vs. whole image), it can grasp transferable features that are generally useful for computer vision tasks.
@@susdoge3767 thank you. Unsupervised learning is only possible if the latent space representations are similar to each other(minimize the distance in latent space - that is the reason why emergent properties of LLMs we can oberserve, e.g. google trained translator in english can surprisingly translate Indi or other languages without trained on - only reason it works because the human languages have similar structure that it related to human biology resp. brain functions - those processes in the brain are similar to all humans - it is independent of color, gender, nationality or race.
@@michaelwangCH thats another cool insight i didnt know!
@@susdoge3767 happy to help and the knowledge belong the entire human race, not small group of people.
Augmentations are so simple in their nature that it can be a part of the evolutionary dynamic of humans on how our perception develops over time. Maybe in your sleep different crops of occipital cortex play this game of augmentation. Maybe you didn't born tabula rasa but born with augmentation dynamics.
There's no temporal aspect to it?
Terminator misspelt "Facebook" in the movies.
What's the framerate for 1080p? Is it realtime?
Could anyone tell me how the teacher knows there are 'k' classes to be identified in a picture?
Cheers!
Hi- what does it mean by thresholding the self attention maps to keep 60% of the mass? What does mass represent here?
Which model use for downstream? student? teacher?
The paper says "We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works [28, 56]." How does this make sense given that teacher is updated much more slowly than student?
That's one thing I don't get either...
Poliak averaging
Can we use this for the object detection task
didnt understand properly about sharpening and centering, can anyone help me understanding it intuitively?
If it's truly unsupervised, why is it blind to vegetation and ocean waves. It seems they somehow managed to impose the simplistic notion that an image has only one classification.
Exactly. One of the images shows a dog in a sofa and only pays attention to the dog. What if I'm more interested in the sofa than the dog? It seems to impose a very subjective notion of importance on the image content. Besides, segmentation is highly task dependent, so how could it know whether to segment the dog or its limbs for instance? If you ask me, it just seems to learn from ImageNet to predict the most salient object and then use the features to perform a segmentation.
This is a visual artifact due to plot normalization. The central object has heatmap values that are relatively much higher than the background. Check the running dog example on the project page and look at the last frame where the dog is absent.
@@randomthoughts3009 well that it has very faint recognition of other things isn't really an excuse. But I guess it can be a simple result of focus in the training set. The initial dog video tracks the dog so that is naturally a heavy bias towards single object classification.
What clustering algo does it use on the features?
Linear and knn, got it...
Basically: DINO = BYOL + Transformers
Volume is low in this video.
I've read the paper and sadly that I didn't find anything new. They just gathered some techniques that already existed and implemented in a self-supervised way. Funny is DINO: DIstill NO labels, but normal distillation training don't use any label at all 😂
Many papers do in such way. Although it is very simple, they tried to magic it to get it complicated and plausible. I found this paper is not impressive at all.
WTF. I found this and was just about to suggest it to you over linkedin and thought.. what if I just checked if there were any youtube videos on it first...
Yannic “Lightspeed” Kilcher strikes again
Skynet is coming
Would I rather watch Gordon Ramsay review the latest AI paper or would I rather Watch Yannic? that might answer your question Yannic😆
seems like attention is all you need
好快~
This seems to be an unsupervised clustering algorithm to me. I guess calling it "self-supervised" sounds sexier.
The dataset argument is weak as well because every human you know has a parent or somebody looked after them in their childhood, no human grow alone with the wolves. Hence the "where to look" may be a social aspect of human species, hell, every species. I know cows have a type of attention and understanding which we refer as autistic, wherever they walk, if some unknown things is in the proximity, they freeze and freak out. Maybe they are not good cow culture teachers after all.
It looks like double-Q learning. What do you think?
Seems like a game theory problem to me!
Commenting for algo.
offtopic thing - would you be open to adding donation options in a proof of stake coin? I don't have strong opinions about which one, I'd convert to whatever you think is a good option. I don't want to fund gpu demand with my donation :)
You can the stripes of the horse... Sorry it's a zebra 🦓 hahaha
cooking was good video
lol
"Cooking video" - Wat.
Yannic, your cooking video did terribly because this is an AI channel. None of your viewers want to see you cook, even if the recipe was written by an AI.
This is probably accurate. You know, I think what would work better? A collaboration video with a cooking channel! You should get in touch with Andong
I mean ofcause thats what happen if you make content aimed towards a different audience. At the same time branching out is nessesary for channel growth and most big channels went through a phase where they "changed audiences".
I personally liked the cooking video