Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !
Loved the video! I was just reading the paper. Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings. Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".
@@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao
Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!
Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.
At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.
This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed
Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up. I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way! Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so? One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think) What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation! One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?
Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.
@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...
I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?
I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?
I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?
Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!
Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance
the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?
Thank you very much for the explanation! I have a couple of questions: 1. Can we consider object queries to be analogous to anchor boxes? 2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?
1. Somewhat, but object queries are learned and initially completely independent of the datapoint. 2. Yes, there are multiple ways, but roughly it's what you're saying
This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.
I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.
Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆
@@YannicKilcher Guess it works now. :) Pix2seq: A Language Modeling Framework for Object Detection (sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)
Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?
Excellent job as usual. Congrats on your Ph.D. Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR? I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.
Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!
Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!
Object query using back-propagated random variables are weird to me, and it seems unnecessary intuitively. I suspect that using simple numbers like 1 to N (in one hot form) are still feasible.
Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)
First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"
I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?
Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."
I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.
If you think about it, transformers are really so much more effective than LSTM's for long sequences. The sequence is of length WxH, that's in the order of thousands....Seriously Attention is All you need was a breakthrough paper like the one on GAN's
Hi Yannic! Great video! I am working on a project, just for fun because i want to get better at deep learning, about predicting sales prices on auctions based on a number of features over time and also the state of the economy, probably represented by the stock market or GDP. So its a Time Series prediction project. And i want to use transfer learning, finding a good pretrained model i can use. As you seem to be very knowledgeable about state of the art deep learning i wonder if you have any idea about a model i can use? Preferably i should be able to use it with tensorflow.
Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.
It's definitely not AGI, following your argument - which is true. It seems to do more filtering, interpolation than actual reasoning. I kinda feel disappointed. But this is good progress. I'm still amateur in AI by the way.
This is a gift. The clarity of the explanation, the speed at which it comes out. Thank you for all of your work.
34:08 GOAT explanation about the bbox in atttention feature map.
I had seen your Attention is all you need video and now watching this, I am astounded by the clarity you give in your videos. Subscribed!
The attention visualization are practically instance segmentations, very impressive results and great job untangling it all
Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !
Loved the video! I was just reading the paper.
Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings.
Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".
Very true, I was just still in the mode where transformers are applied to text ;)
What are positional encodings?
@@princecanuma The positional encoding is simply the index of each token in the sequence.
@@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao
Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!
Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.
infinite respect for the ali G reference
Haha someone noticed :D
WoW , the way you've explained and break down this paper is spectacular ,
Thx mate
I like the way you DECIPHER things! thanks!
Really smart idea about how the (HxW)^2 matrix naturally embeds bounding boxes information. I am impressed :)
Yup. Subscribed with notifications. I love that you enjoy the content of the papers. It really shows! Thank you for these videos.
Greatest find on RUclips for me todate!! Thank you for the great videos!
You saved my project. Thank you 🙏🏻
"Maximal benefit of the doubt" - love it!
Was waiting for this. Thanks a lot! Also dude, how many papers do you read everyday?!!!
Very very nice explanation, I really subscribed for that quadratic attention explanation. Thanks! :D
Really appreciate the efforts you are putting into this. You paper explanations make my day everyday!
Very informative. Thanks for explanation!
At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.
This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed
Amazing explanation. Keep up the great work.
You are a godsend! Please keep up the good work!
Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up.
I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way!
Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so?
One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think)
What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation!
One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional
output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?
Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.
Also, thanks for the feedback :)
@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...
I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?
Yes, if you think as text transformers consuming one word vector per word, the. analogy would be you consume all channels of a pixel per pixel
Thanks so much for making it so easy to understand these papers.
Thank you very much. This is a very good video. Very easy to understand.
Holy shit. Instant subscribe within 3 minutes. Bravo!!
Are you even human? You're really quick.
Nope .. A Bot
@@m.s.d2656 I don't actually know which is the most impressive
There's a bird!!! There's a bird...
@@krishnendusengupta6158 bird, bird, bird, bird, bird, bird, bird, bird, its a BIRD
Bird is the Word 😂
A naive doubt...in 39:17 , the attention maps you say here are generated within the model itself or are feeded from outside at that stage ?
They are from within the model
I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?
2:47 worth pointing out that the CNN reduces the size of the image while retaining high level features and so massively speeds up computation
Thank you for this content! I have recommended this channel to my colleagues.
"First paper ever to have ever cite a youtube channel." ...challenge accepted.
Great video, very speedy :). How well does this compare to YOLOv4?
No idea, I've never looked into it.
I think it might not be as good rn but the transformer part can be scaled like crazy.
very clear explanation, great work sir. thanks
Amazing and very intuitive explanation - Thanks!
You explained it so well. Thanks . best of luck
I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?
Fantastic explanation 👌 looking forward for more videos ❤️
Great!!! absolutely great! fast , to the point, and extremely clear. Thanks!!
A great paper and a great review of the paper! As always nice work!
thanks! Which one is better you think compared to YOLOv8 for example?
Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!
Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance
thank u so much for video! that's so amazing and make me much understanding for this paper ^^
Very well done and understandable. Thank you!
the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?
Thank you very much for the explanation! I have a couple of questions:
1. Can we consider object queries to be analogous to anchor boxes?
2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?
1. Somewhat, but object queries are learned and initially completely independent of the datapoint.
2. Yes, there are multiple ways, but roughly it's what you're saying
Thanks for the walkthrough!
YES! I was waiting for this!
Thanks for great explanation!
This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.
I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.
Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆
@@YannicKilcher
Guess it works now. :)
Pix2seq: A Language Modeling Framework for Object Detection
(sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)
Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?
Thank you sooo much for this explanation!!
Excellent job as usual. Congrats on your Ph.D.
Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR?
I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.
This is the same thing I wanted to ask. Why leave out V? It's not even described in the paper.
What an amazing paper and an explanation!
Thank you very much, this was really good.
How do you make the bipartite matching loss differentiable?
the matching itself isn't differentiable, but the resulting differences are, so you just take that.
I wonder if we can use this to generate captions from image using pure transformers
And also for VQA like we can give question encoding as input in decoder
Excellent work,Thanks!
Thank you for providing such interesting paper reading ! Yannic Kilcher
really quite quick. thanks. make more...
Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!
Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!
Have anyone tried to run this in a Jetson Nano to compare with previous approaches? How faster is this in comparison with a mobilenet ssd v2?
So basically little people asking lots of questions... nice!
PS. Thanks Yannic for the great analogy and insight...
Object query using back-propagated random variables are weird to me, and it seems unnecessary intuitively. I suspect that using simple numbers like 1 to N (in one hot form) are still feasible.
Love this content bro thank you so much, hoping to get a Mac in Artificial Intelligence
Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)
No idea, I'm sorry :)
I wonder how “Object Query” is different from “Region Proposal Network” in RCNN detector
It looks like Faster RCNN may still be better than DETR on smaller objects.
First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"
Great explanation
really thank you for your explanation!
Thanks for this vid, really fast. I still (after 2 days) didn't tried to run it on my data - feeling bad
Thanks for the explaination
Excellent
This is a really great idea
Very cool video, thank you!
They do not use any kind of masking for the attention here right?
No, because all their inputs have the same size, they don't need masking
can u train this for live vr/ar data?
I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?
Sometimes the authors create a github repo or put the code as additional files on arxiv, but mostly there's no code.
paperswithcode.com/
Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."
I love your channel thank you soooo much
I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.
How are those object queries learnt?
Please make a video to train this model on our own custom datasets
If you think about it, transformers are really so much more effective than LSTM's for long sequences. The sequence is of length WxH, that's in the order of thousands....Seriously Attention is All you need was a breakthrough paper like the one on GAN's
So cool! You are great!
Memory consumption has got to be batshit crazy with this. Would using some form of sparse attention hinder the goal here?
I guess you'd get some moderate drop in performance, but probably you could do it
can you do one about Efficient-det?
AI Developer:
AI: 8:36 BIRD! BIRD! BIRD!
Hi Yannic! Great video! I am working on a project, just for fun because i want
to get better at deep learning, about predicting sales prices on auctions
based on a number of features over time and also the state of the economy,
probably represented by the stock market or GDP. So its a Time Series prediction project.
And i want to use transfer learning, finding a good pretrained model i can use.
As you seem to be very knowledgeable about state of the art deep learning
i wonder if you have any idea about a model i can use?
Preferably i should be able to use it with tensorflow.
Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.
Awesome 🔥🔥🔥
Great content!
Interesting to compare to YoloV4 which claims to get 65.7% @ mAP50?
But Yolo can’t do instance segmentation yet though, so Mask-RCNN is probably better comparison. Also Yolo probably run faster than either of these.
has anyone mentioned CornerNet for the comment in 35'?
Thanks :)
It's definitely not AGI, following your argument - which is true.
It seems to do more filtering, interpolation than actual reasoning.
I kinda feel disappointed. But this is good progress.
I'm still amateur in AI by the way.
Thanks a lot for this really helpful
Wow its same as how human attention works
When we are focus on one thing we ignore other things in an image
Great video thanks you