Object-Centric Learning with Slot Attention (Paper Explained)

Yannic Kilcher

Просмотров 17 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 дек 2024

Комментарии • 44

@jihochoi_cs Месяц назад ⁺¹
Thank you for the detailed and easy-to-understand explanation. I also appreciate sharing your different perspectives.
@florianhonicke5448 4 года назад ⁺¹¹
I like that you are critical with the papers
@dirkneuhauser8213 4 года назад ⁺³
Thank you so much! I struggle a lot to read these papers when I don't have a clue what they are about, this helps tremendously
@snippletrap 3 года назад
Frequently the authors give their own presentations. Those are also helpful.
@Beau10-j3d Год назад
Am a bit late to the party. My question is about the slot initialization. I am a bit confused with the explanation from 16:40 to 22:01, where first, it looks like the slot attention is like a multilayered transformer with shared weights, and the slots are initialized randomly and updated/refined at each (this feels similar to DETR). Towards the later part, it says that in each iteration, the slots are random, which conceptually is confusing for me as to what it learns. If the initialization is random, then the learning of "slot routing" from the previous iteration is not carried over to the next iteration, which means that at each iteration the slots start at square one, it is equivalent to just one iteration where the slots start random and then the S(QK) +V + MLP happens. So what is learned here? Am I missing something here? Or is the difference during test time? In DETR, at test time, the queries are learned initializations at train time while for slot attention it is random?
@paulkirkland9860 4 года назад ⁺¹
Loving the reviews, my research field is spiking neural networks but ain't nobody got time for that. I enjoy your high level decomposition of the papers, makes all the papers easier to read thereafter!
@alexbaranski2506 3 года назад
I got time for that! I hope you crack the mysteries of spiking neural networks.
@herp_derpingson 4 года назад ⁺⁶
11:44 Sorry, if I missed it. In object discovery architecture alone. How does the model know that it has to segment objects? Lets say, each slot just encodes a quadrant of the image and passes it through. The alpha channel softmax trick can still reconstruct the same image given 4 flat quadrants of the image. 12:22 Also, features are just pixels, how does the transformer know that we are interested in shapes of similar colors? Set prediction on the other hand, will not work if the model does not put relevant data in the slots. So, is the model trained jointly on two tasks at the same time?
.
33:35 I agree. Infact, the discoverer of HIV once said, "Modern research is like bringing a lit candle to an already lit room". Competing for whose candle is brightest means many interesting ideas would never be explored. Especially in DL where flexing your GPU budget is the norm.
.
This reminds me of a paper from 2017. "A simple neural network module for relational reasoning". It uses a pairwise sum of all features instead of a transformer.
@YannicKilcher 4 года назад
I think both your questions are due to the fact that we train the network in a supervised fashion. It's just easier to reduce the loss by learning to segment objects than by dividing the image into flat quadrants. I don't think I explained this very well, so my bad :)
@rudrasaha7663 4 года назад ⁺¹
@@YannicKilcher Since the supervision is coming only from reconstruction, it can still learn to segment things into flat quadrant. I guess the reason this happens are two fold, the features generated by the convolutional model are in a sense local groupings which would restrict information in the features, and since the number of slots create a bottleneck it would like to group similar features.
.
I have read in a few places that vision models are eager to segment out spatial information based on the texture, don't ask me to cite this though as I forgot from where :). Anyway, IODINE does perform experiments to show that their model works to some extent on such datasets where their precursor (R-NEM) didn't. It would be interesting to see how Slot Attention fares for such a dataset.
.
Also, if possible please check out and review for us, Learning to Manipulate Individual Objects in an Image: arxiv.org/pdf/2004.05495.pdf. They have experimented with pseudo-real-world datasets and do well on the texture-based images as well.
@junaidahmed4682 4 года назад ⁺³
Thank you for the explanation Yannic. Great work
@not_a_human_being 4 года назад ⁺⁵
Can't believe it took me so long to find your amazing channel... Some of your more advanced topics I'm still struggling to understand, but you manage to put personalities to those papers, and thus make them much more relatable - and possible to watch-re-watch till some understanding is formed! "Hold On To Your Papers" guy is pretty amazing too, but he always so nice to the authors, and so non-critical, which is generally a good thing, but somehow it makes those papers seem very unapproachable . Maybe having some opinionated views and criticism actually aids our information comprehensions process - maybe having academia present outwards as something "dry" and "polite" is actually counter-productive?
@YannicKilcher 4 года назад ⁺¹
I have no idea what's best. I don't target any specific level of politeness, I just say whatever comes to mind.
@afzalkhan5094 4 года назад ⁺¹
Amazing.. Please keep doing it...
@youngjin8300 4 года назад ⁺¹
Amazing as always👍
@patrickjdarrow 4 года назад ⁺⁵
I can't unsee this frowny-face with expressive eyebrows @ 14:02
@florianhonicke5448 4 года назад ⁺¹
hahaha same
@YannicKilcher 4 года назад
Nice ^^
@alfcnz Год назад
LOL, didn't know you went over this paper as well 😅
@jinga-lala 4 года назад ⁺⁷
Please do a video where you compare the different types of attention mechanisms like Self, Cross, Slot attention etc.
Anyways, great work!
@YannicKilcher 4 года назад
That video would be way too long I think :D
@shyammarjit5438 Год назад
@@YannicKilcher But sir please do it. It would help very much.
@cathycai9167 Месяц назад
Thank you so much!
@酷比焍二 Год назад
Maybe design an autoregressive framework can address the limitation of fixed slot numbers.
@nilushachamalJ 2 года назад
Thank you for the great video!
@johngrabner 4 года назад ⁺²
Excellent video, as usual. What prevents all pixels/features to be mapped to a single slot?
@YannicKilcher 4 года назад ⁺¹
Nothing per se, but that would not be able to reduce the loss function because one slot can only output one label.
@sombrero7935 4 года назад ⁺²
Great video :)
can you please review ReZero is All You Need: Fast Convergence at Large Depth
@zerogravity5052 4 года назад ⁺¹
hi slot attention is distributed attention ?
@sayrun75 4 года назад
It's funny that this paper is from Google Brain. They divide the image with this grid, and use supervised learning as the image are generated.
They would then be able to quickly transfer to real images, cause every time we complete a reCaptcha, we actually unconsciously work for google, by telling in which squares of this grid is a traffic light, or a car, or a zebra, ...
I guess it's a little of applied research already !
@MarkoTintor 4 года назад ⁺²
Reminds me of K-means clustering algorithm.
@anonymous6713 4 года назад ⁺¹
Is there any implementation of "hungarian matching loss” mentioned in the video?
@YannicKilcher 4 года назад
Have a look at DETR
@luca99417 4 года назад
docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
@abhisunkara 4 года назад ⁺¹
Dude, you are posting faster than I can keep up with. How do you do that? I already have a backlog of 10 videos. Please go easy on me :)
@johngrabner 4 года назад ⁺⁴
Don't slow down. Stuff is too fascinating.
@BobbyWicked 4 года назад ⁺¹
If you take requests, would you consider doing the GENESIS paper from ICLR, which similarly does object-centric segmentation (but also generation) in a different, probabilistic way? That could be interesting to compare :] openreview.net/forum?id=BkxfaTVFwH
@andreygizdov29 Месяц назад
Really appreciate the effort, but some of your explanation is wishy-washy. I mean on 12:30 you say "if this is 4 and this is 9 ...". One has to have a background in transformers to know you mean the scaled d.p. between query-key. A brief comment on what you mean here would have been useful.
You also say that "it is not good to have one feature be attended to by multiple slots" but it does not become clear how the training forces that. There is a reason,of course, redundancy will be minimized at the bottleneck (the slots). Mentioning that would be useful.
Thanks for the video tho.
@zhangcx93 4 года назад ⁺²
this Slot Attention looks a bit like dynamic routing of capsule net...
@zhangcx93 4 года назад ⁺²
oh you mentioned that already...
and i think this kind of dynamic routing could be a hint on how to do predictive processing in deep learning
@ta6847 2 года назад
@@zhangcx93 Dynamic routing could be a "Hint on" how to do predictive processing in deep learning.
@larrybird3729 4 года назад ⁺¹
Yannic Kilcher Amazing content like always, just some food for thought for future Videos, have you had any thoughts on making a video on hacking tricks for Artificial-Neural-Networks ?... because every research paper that doesn't put them in there a little part of me dies inside...😥
@YannicKilcher 4 года назад ⁺¹
Yes, I share your frustration, but any video like that would be out of date in two weeks
@VegetableJuiceFTW 4 года назад
almost like k-means, huh?

Следующие

Автовоспроизведение

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)