Am a bit late to the party. My question is about the slot initialization. I am a bit confused with the explanation from 16:40 to 22:01, where first, it looks like the slot attention is like a multilayered transformer with shared weights, and the slots are initialized randomly and updated/refined at each (this feels similar to DETR). Towards the later part, it says that in each iteration, the slots are random, which conceptually is confusing for me as to what it learns. If the initialization is random, then the learning of "slot routing" from the previous iteration is not carried over to the next iteration, which means that at each iteration the slots start at square one, it is equivalent to just one iteration where the slots start random and then the S(QK) +V + MLP happens. So what is learned here? Am I missing something here? Or is the difference during test time? In DETR, at test time, the queries are learned initializations at train time while for slot attention it is random?
Loving the reviews, my research field is spiking neural networks but ain't nobody got time for that. I enjoy your high level decomposition of the papers, makes all the papers easier to read thereafter!
11:44 Sorry, if I missed it. In object discovery architecture alone. How does the model know that it has to segment objects? Lets say, each slot just encodes a quadrant of the image and passes it through. The alpha channel softmax trick can still reconstruct the same image given 4 flat quadrants of the image. 12:22 Also, features are just pixels, how does the transformer know that we are interested in shapes of similar colors? Set prediction on the other hand, will not work if the model does not put relevant data in the slots. So, is the model trained jointly on two tasks at the same time? . 33:35 I agree. Infact, the discoverer of HIV once said, "Modern research is like bringing a lit candle to an already lit room". Competing for whose candle is brightest means many interesting ideas would never be explored. Especially in DL where flexing your GPU budget is the norm. . This reminds me of a paper from 2017. "A simple neural network module for relational reasoning". It uses a pairwise sum of all features instead of a transformer.
I think both your questions are due to the fact that we train the network in a supervised fashion. It's just easier to reduce the loss by learning to segment objects than by dividing the image into flat quadrants. I don't think I explained this very well, so my bad :)
@@YannicKilcher Since the supervision is coming only from reconstruction, it can still learn to segment things into flat quadrant. I guess the reason this happens are two fold, the features generated by the convolutional model are in a sense local groupings which would restrict information in the features, and since the number of slots create a bottleneck it would like to group similar features. . I have read in a few places that vision models are eager to segment out spatial information based on the texture, don't ask me to cite this though as I forgot from where :). Anyway, IODINE does perform experiments to show that their model works to some extent on such datasets where their precursor (R-NEM) didn't. It would be interesting to see how Slot Attention fares for such a dataset. . Also, if possible please check out and review for us, Learning to Manipulate Individual Objects in an Image: arxiv.org/pdf/2004.05495.pdf. They have experimented with pseudo-real-world datasets and do well on the texture-based images as well.
Can't believe it took me so long to find your amazing channel... Some of your more advanced topics I'm still struggling to understand, but you manage to put personalities to those papers, and thus make them much more relatable - and possible to watch-re-watch till some understanding is formed! "Hold On To Your Papers" guy is pretty amazing too, but he always so nice to the authors, and so non-critical, which is generally a good thing, but somehow it makes those papers seem very unapproachable . Maybe having some opinionated views and criticism actually aids our information comprehensions process - maybe having academia present outwards as something "dry" and "polite" is actually counter-productive?
It's funny that this paper is from Google Brain. They divide the image with this grid, and use supervised learning as the image are generated. They would then be able to quickly transfer to real images, cause every time we complete a reCaptcha, we actually unconsciously work for google, by telling in which squares of this grid is a traffic light, or a car, or a zebra, ... I guess it's a little of applied research already !
If you take requests, would you consider doing the GENESIS paper from ICLR, which similarly does object-centric segmentation (but also generation) in a different, probabilistic way? That could be interesting to compare :] openreview.net/forum?id=BkxfaTVFwH
Really appreciate the effort, but some of your explanation is wishy-washy. I mean on 12:30 you say "if this is 4 and this is 9 ...". One has to have a background in transformers to know you mean the scaled d.p. between query-key. A brief comment on what you mean here would have been useful. You also say that "it is not good to have one feature be attended to by multiple slots" but it does not become clear how the training forces that. There is a reason,of course, redundancy will be minimized at the bottleneck (the slots). Mentioning that would be useful. Thanks for the video tho.
Yannic Kilcher Amazing content like always, just some food for thought for future Videos, have you had any thoughts on making a video on hacking tricks for Artificial-Neural-Networks ?... because every research paper that doesn't put them in there a little part of me dies inside...😥
Thank you for the detailed and easy-to-understand explanation. I also appreciate sharing your different perspectives.
I like that you are critical with the papers
Thank you so much! I struggle a lot to read these papers when I don't have a clue what they are about, this helps tremendously
Frequently the authors give their own presentations. Those are also helpful.
Am a bit late to the party. My question is about the slot initialization. I am a bit confused with the explanation from 16:40 to 22:01, where first, it looks like the slot attention is like a multilayered transformer with shared weights, and the slots are initialized randomly and updated/refined at each (this feels similar to DETR). Towards the later part, it says that in each iteration, the slots are random, which conceptually is confusing for me as to what it learns. If the initialization is random, then the learning of "slot routing" from the previous iteration is not carried over to the next iteration, which means that at each iteration the slots start at square one, it is equivalent to just one iteration where the slots start random and then the S(QK) +V + MLP happens. So what is learned here? Am I missing something here? Or is the difference during test time? In DETR, at test time, the queries are learned initializations at train time while for slot attention it is random?
Loving the reviews, my research field is spiking neural networks but ain't nobody got time for that. I enjoy your high level decomposition of the papers, makes all the papers easier to read thereafter!
I got time for that! I hope you crack the mysteries of spiking neural networks.
11:44 Sorry, if I missed it. In object discovery architecture alone. How does the model know that it has to segment objects? Lets say, each slot just encodes a quadrant of the image and passes it through. The alpha channel softmax trick can still reconstruct the same image given 4 flat quadrants of the image. 12:22 Also, features are just pixels, how does the transformer know that we are interested in shapes of similar colors? Set prediction on the other hand, will not work if the model does not put relevant data in the slots. So, is the model trained jointly on two tasks at the same time?
.
33:35 I agree. Infact, the discoverer of HIV once said, "Modern research is like bringing a lit candle to an already lit room". Competing for whose candle is brightest means many interesting ideas would never be explored. Especially in DL where flexing your GPU budget is the norm.
.
This reminds me of a paper from 2017. "A simple neural network module for relational reasoning". It uses a pairwise sum of all features instead of a transformer.
I think both your questions are due to the fact that we train the network in a supervised fashion. It's just easier to reduce the loss by learning to segment objects than by dividing the image into flat quadrants. I don't think I explained this very well, so my bad :)
@@YannicKilcher Since the supervision is coming only from reconstruction, it can still learn to segment things into flat quadrant. I guess the reason this happens are two fold, the features generated by the convolutional model are in a sense local groupings which would restrict information in the features, and since the number of slots create a bottleneck it would like to group similar features.
.
I have read in a few places that vision models are eager to segment out spatial information based on the texture, don't ask me to cite this though as I forgot from where :). Anyway, IODINE does perform experiments to show that their model works to some extent on such datasets where their precursor (R-NEM) didn't. It would be interesting to see how Slot Attention fares for such a dataset.
.
Also, if possible please check out and review for us, Learning to Manipulate Individual Objects in an Image: arxiv.org/pdf/2004.05495.pdf. They have experimented with pseudo-real-world datasets and do well on the texture-based images as well.
Thank you for the explanation Yannic. Great work
Can't believe it took me so long to find your amazing channel... Some of your more advanced topics I'm still struggling to understand, but you manage to put personalities to those papers, and thus make them much more relatable - and possible to watch-re-watch till some understanding is formed! "Hold On To Your Papers" guy is pretty amazing too, but he always so nice to the authors, and so non-critical, which is generally a good thing, but somehow it makes those papers seem very unapproachable . Maybe having some opinionated views and criticism actually aids our information comprehensions process - maybe having academia present outwards as something "dry" and "polite" is actually counter-productive?
I have no idea what's best. I don't target any specific level of politeness, I just say whatever comes to mind.
Amazing.. Please keep doing it...
Amazing as always👍
I can't unsee this frowny-face with expressive eyebrows @ 14:02
hahaha same
Nice ^^
LOL, didn't know you went over this paper as well 😅
Please do a video where you compare the different types of attention mechanisms like Self, Cross, Slot attention etc.
Anyways, great work!
That video would be way too long I think :D
@@YannicKilcher But sir please do it. It would help very much.
Thank you so much!
Maybe design an autoregressive framework can address the limitation of fixed slot numbers.
Thank you for the great video!
Excellent video, as usual. What prevents all pixels/features to be mapped to a single slot?
Nothing per se, but that would not be able to reduce the loss function because one slot can only output one label.
Great video :)
can you please review ReZero is All You Need: Fast Convergence at Large Depth
hi slot attention is distributed attention ?
It's funny that this paper is from Google Brain. They divide the image with this grid, and use supervised learning as the image are generated.
They would then be able to quickly transfer to real images, cause every time we complete a reCaptcha, we actually unconsciously work for google, by telling in which squares of this grid is a traffic light, or a car, or a zebra, ...
I guess it's a little of applied research already !
Reminds me of K-means clustering algorithm.
Is there any implementation of "hungarian matching loss” mentioned in the video?
Have a look at DETR
docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
Dude, you are posting faster than I can keep up with. How do you do that? I already have a backlog of 10 videos. Please go easy on me :)
Don't slow down. Stuff is too fascinating.
If you take requests, would you consider doing the GENESIS paper from ICLR, which similarly does object-centric segmentation (but also generation) in a different, probabilistic way? That could be interesting to compare :] openreview.net/forum?id=BkxfaTVFwH
Really appreciate the effort, but some of your explanation is wishy-washy. I mean on 12:30 you say "if this is 4 and this is 9 ...". One has to have a background in transformers to know you mean the scaled d.p. between query-key. A brief comment on what you mean here would have been useful.
You also say that "it is not good to have one feature be attended to by multiple slots" but it does not become clear how the training forces that. There is a reason,of course, redundancy will be minimized at the bottleneck (the slots). Mentioning that would be useful.
Thanks for the video tho.
this Slot Attention looks a bit like dynamic routing of capsule net...
oh you mentioned that already...
and i think this kind of dynamic routing could be a hint on how to do predictive processing in deep learning
@@zhangcx93 Dynamic routing could be a "Hint on" how to do predictive processing in deep learning.
Yannic Kilcher Amazing content like always, just some food for thought for future Videos, have you had any thoughts on making a video on hacking tricks for Artificial-Neural-Networks ?... because every research paper that doesn't put them in there a little part of me dies inside...😥
Yes, I share your frustration, but any video like that would be out of date in two weeks
almost like k-means, huh?