DETR: End-to-End Object Detection with Transformers (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 27 дек 2024

Комментарии • 176

  • @slackstation
    @slackstation 4 года назад +122

    This is a gift. The clarity of the explanation, the speed at which it comes out. Thank you for all of your work.

  • @oldcoolbroqiuqiu6593
    @oldcoolbroqiuqiu6593 3 года назад +2

    34:08 GOAT explanation about the bbox in atttention feature map.

  • @ankitbhardwaj1956
    @ankitbhardwaj1956 4 года назад +1

    I had seen your Attention is all you need video and now watching this, I am astounded by the clarity you give in your videos. Subscribed!

  • @Phobos11
    @Phobos11 4 года назад +13

    The attention visualization are practically instance segmentations, very impressive results and great job untangling it all

  • @トシズァツンツン
    @トシズァツンツン 4 года назад +3

    Thank you for your wonderful video. When I read this paper first, I couldn't understand what is the input of decoder (object queries), but after watching your video, finally I got it, random vector !

  • @CristianGarcia
    @CristianGarcia 4 года назад +54

    Loved the video! I was just reading the paper.
    Just wanted to point out that Transformers, or rather Multi Head Attention, naturally processes sets, not sequences, this is why you have to include the positional embeddings.
    Do a video about the Set Transformer! In that paper the call the technique used by the Decoder in this paper "Pooling by Multihead Attention".

    • @YannicKilcher
      @YannicKilcher  4 года назад +8

      Very true, I was just still in the mode where transformers are applied to text ;)

    • @princecanuma
      @princecanuma 4 года назад

      What are positional encodings?

    • @snippletrap
      @snippletrap 4 года назад +1

      @@princecanuma The positional encoding is simply the index of each token in the sequence.

    • @coldblaze100
      @coldblaze100 4 года назад +3

      @@snippletrap I had a feeling it was gonna be something that simple. 🤦🏾‍♂️ AI researchers' naming conventions aren't helping the community, in terms of accessibility lmao

    • @chuwang2125
      @chuwang2125 4 года назад

      Thank you for the one-line summary of "Pooling by Multihead Attention". This makes it 10x clearer about what exactly the decoder is doing. I was feeling that the "decoder + object seeds" is doing similar things to ROI pooling, which is gathering relevant information for a possible object. I also recommend reading the set transformer paper, which enhanced my limited knowledge of attention models. Thanks again for your comment!

  • @adisingh4422
    @adisingh4422 3 года назад +2

    Awesome video. Highly recommend reading the paper first and then watching this to solidfy understanding. This definitely helped me understand DETR model more.

  • @AonoGK
    @AonoGK 4 года назад +9

    infinite respect for the ali G reference

  • @chaouidhuzgen6818
    @chaouidhuzgen6818 3 года назад

    WoW , the way you've explained and break down this paper is spectacular ,
    Thx mate

  • @zeynolabedinsoleymani4591
    @zeynolabedinsoleymani4591 Год назад +1

    I like the way you DECIPHER things! thanks!

  • @TheAhmadob
    @TheAhmadob 3 года назад +2

    Really smart idea about how the (HxW)^2 matrix naturally embeds bounding boxes information. I am impressed :)

  • @aashishghosh8246
    @aashishghosh8246 4 года назад +1

    Yup. Subscribed with notifications. I love that you enjoy the content of the papers. It really shows! Thank you for these videos.

  • @sahandsesoot
    @sahandsesoot 4 года назад +4

    Greatest find on RUclips for me todate!! Thank you for the great videos!

  • @sawanaich4765
    @sawanaich4765 3 года назад +1

    You saved my project. Thank you 🙏🏻

  • @edwarddixon
    @edwarddixon 4 года назад +3

    "Maximal benefit of the doubt" - love it!

  • @ramandutt3646
    @ramandutt3646 4 года назад +13

    Was waiting for this. Thanks a lot! Also dude, how many papers do you read everyday?!!!

  • @AlexOmbla
    @AlexOmbla 2 года назад

    Very very nice explanation, I really subscribed for that quadratic attention explanation. Thanks! :D

  • @rishabpal2726
    @rishabpal2726 4 года назад +2

    Really appreciate the efforts you are putting into this. You paper explanations make my day everyday!

  • @Konstantin-qk6hv
    @Konstantin-qk6hv 3 года назад +2

    Very informative. Thanks for explanation!

  • @TaherAbbasiz
    @TaherAbbasiz Год назад

    At 16:27 it is claimed "The transformer is naturally a sequence processing unit" is it? Isn't it a naturally set processing unit? and this is why we are putting a position encoding block before it.

  • @hackercop
    @hackercop 2 года назад

    This video was absolutely amazing. You explaned this concept really well and I loved the bit at 33:00 about flattening the image twice and using the rows and columns to create an attention matrix where every pixel can releate to every other pixel. Also loved the bit at the beginning when you explaned the loss in detail. alot of other videos just gloss over that part. Have liked and subscribed

  • @pravindesai6687
    @pravindesai6687 11 месяцев назад

    Amazing explanation. Keep up the great work.

  • @biswadeepchakraborty685
    @biswadeepchakraborty685 4 года назад +2

    You are a godsend! Please keep up the good work!

  • @0lec817
    @0lec817 4 года назад +5

    Hi Yannic, amazing video and great improvements in the presentation (time-sections in youtube etc.) I really like where this channel is going, keep it up.
    I've been reading through the paper myself yesterday as I've been working with that kind of attention for CNNs a bit and I really liked the way you described the mechanism behind the different attention heads in such a simplistic and easily understandable way!
    Your idea with directly inferring bboxes from two attending points in the "attention matrix" sounds neat and didn't cross my mind yet. But I guess then you probably have to use some kind of nms again if you do so?
    One engineering problem that I came across, especially with those full (HxW)^2 attention matrices is that this blows up your GPU memory insanely. Thus one can only use a fraction of the batchsize and a (HxW)^2 multiplication also takes forever, which is why that model takes much longer to train (and infer I think)
    What impressed me most was that an actually very "unsophisticated learned upscaling and argmax over all attentionmaps" achieved such great results for panoptic segmentation!
    One thing that I did not quite get: Can the multiple attention heads actually "communicate" with each other during the "look up"? Going by the description in the Attention is all you need: "we then perform the attention function in parallel, yielding dv-dimensional
    output values" and the formula: "Concat(head1, ..., headh)W°". This to me looks like the attention heads do not share information while attending to things. Only the W° might be able during the backprop to reweight the attention heads if they have overlapping attention regions?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      Yes I see it the same way, the individual heads do independent operations in each layer. I guess the integration of information between them would then happen in higher layers, where their signal could be aggregated in a single head there.

    • @YannicKilcher
      @YannicKilcher  4 года назад

      Also, thanks for the feedback :)

    • @gruffalosmouse107
      @gruffalosmouse107 4 года назад

      ​@@YannicKilcher The multi-head part is the only confusion I have about this great work. In NLP multi-head makes total sense: an embedding can "borrow" features/semantics from multiple words at different feature dimensions. But in CV seems it's not necessary? The authors didn't do ablation study about the number of heads. My suspicion is single head works almost as well as 8 heads. Would test it once I got a lot of GPUs...

  • @AlexanderPacha
    @AlexanderPacha 4 года назад +1

    I'm a bit confused. At 17:17, you are drawing vertical lines, meaning that you unroll the channels (ending up with a vectors of features per pixel that are fed into the transformer, "pixel by pixel"). Is that how it's being done? Or should there be horizontal lines (WH x C), where you feed one feature at a time for the entire image into the transformer?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      Yes, if you think as text transformers consuming one word vector per word, the. analogy would be you consume all channels of a pixel per pixel

  • @AishaUroojKhan
    @AishaUroojKhan 2 года назад

    Thanks so much for making it so easy to understand these papers.

  • @pokinchaitanasakul-boss3370
    @pokinchaitanasakul-boss3370 3 года назад +1

    Thank you very much. This is a very good video. Very easy to understand.

  • @RyanMartinRAM
    @RyanMartinRAM Год назад

    Holy shit. Instant subscribe within 3 minutes. Bravo!!

  • @yashmandilwar8904
    @yashmandilwar8904 4 года назад +121

    Are you even human? You're really quick.

    • @m.s.d2656
      @m.s.d2656 4 года назад

      Nope .. A Bot

    • @meerkatj9363
      @meerkatj9363 4 года назад +2

      @@m.s.d2656 I don't actually know which is the most impressive

    • @krishnendusengupta6158
      @krishnendusengupta6158 4 года назад +3

      There's a bird!!! There's a bird...

    • @sadraxis
      @sadraxis 4 года назад +2

      ​@@krishnendusengupta6158 bird, bird, bird, bird, bird, bird, bird, bird, its a BIRD

    • @TheMatrixTony
      @TheMatrixTony 17 дней назад

      Bird is the Word 😂

  • @arnavdas3139
    @arnavdas3139 4 года назад +1

    A naive doubt...in 39:17 , the attention maps you say here are generated within the model itself or are feeded from outside at that stage ?

  • @mariosconstantinou8271
    @mariosconstantinou8271 Год назад

    I am having problems understanding the trainable queries size. I know it's a random vector, but of what size? If we want the output to be 1. Bounding box (query_num, x, y, W, H) and 2. Class (query_num, num,classes), so the size of our object querie will be a 1x5 vector? [class, x, y, W, H]?

  • @hackercop
    @hackercop 2 года назад

    2:47 worth pointing out that the CNN reduces the size of the image while retaining high level features and so massively speeds up computation

  • @renehaas7866
    @renehaas7866 4 года назад

    Thank you for this content! I have recommended this channel to my colleagues.

  • @tarmiziizzuddin337
    @tarmiziizzuddin337 4 года назад +9

    "First paper ever to have ever cite a youtube channel." ...challenge accepted.

  • @Augmented_AI
    @Augmented_AI 4 года назад +4

    Great video, very speedy :). How well does this compare to YOLOv4?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      No idea, I've never looked into it.

    • @gunslingerarthur5865
      @gunslingerarthur5865 4 года назад

      I think it might not be as good rn but the transformer part can be scaled like crazy.

  • @uditagarwal6435
    @uditagarwal6435 2 года назад

    very clear explanation, great work sir. thanks

  • @itaizilberman9979
    @itaizilberman9979 2 года назад

    Amazing and very intuitive explanation - Thanks!

  • @mahimanzum
    @mahimanzum 4 года назад +1

    You explained it so well. Thanks . best of luck

  • @arturiooo
    @arturiooo 2 года назад

    I love how it understands which part of the image belongs to which object (elephant example) regardless of overlapping. Kind of understands the depth. Maybe transformers can be used for depth-mapping?

  • @pranabsarkar
    @pranabsarkar 4 года назад +1

    Fantastic explanation 👌 looking forward for more videos ❤️

  • @opiido
    @opiido 4 года назад

    Great!!! absolutely great! fast , to the point, and extremely clear. Thanks!!

  • @michaelcarlon1831
    @michaelcarlon1831 4 года назад

    A great paper and a great review of the paper! As always nice work!

  • @mohammadyahya78
    @mohammadyahya78 6 месяцев назад

    thanks! Which one is better you think compared to YOLOv8 for example?

  • @wjmuse
    @wjmuse 4 года назад +1

    Great sharing! Like to ask about if there is any clue to deside how many object queries should we use for any particular Object Detection problems? Thanks!

  • @maloukemallouke9735
    @maloukemallouke9735 3 года назад

    Hi Thanks yannic for all videos. i have a question about the digits recognition in image that no writied by hand, how we can find digits in street like number of building of cars .... ? Thanks in advance

  • @Charles-my2pb
    @Charles-my2pb 2 года назад

    thank u so much for video! that's so amazing and make me much understanding for this paper ^^

  • @Gotrek103
    @Gotrek103 4 года назад +1

    Very well done and understandable. Thank you!

  • @jadtawil6143
    @jadtawil6143 3 года назад

    the object queries remind me of latent variables in variational architectures (VAEs for example). In those architectures, the LV's are constrained with a prior. Is this done for the object queries. Is that a good idea?

  • @himanshurawlani3445
    @himanshurawlani3445 4 года назад +1

    Thank you very much for the explanation! I have a couple of questions:
    1. Can we consider object queries to be analogous to anchor boxes?
    2. Does the attention visualization highlights those parts in the image which the network gives highest importance to while predicting?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      1. Somewhat, but object queries are learned and initially completely independent of the datapoint.
      2. Yes, there are multiple ways, but roughly it's what you're saying

  • @kodjigarpp
    @kodjigarpp Год назад

    Thanks for the walkthrough!

  • @tsunamidestructor
    @tsunamidestructor 4 года назад +3

    YES! I was waiting for this!

  • @sungso7689
    @sungso7689 4 года назад +1

    Thanks for great explanation!

  • @mikhaildoroshenko2169
    @mikhaildoroshenko2169 4 года назад +1

    This probably quite a stupid question, but can we just train end to end, from image embedding to string of symbols which contains all necessary information for object detection? I am not arguing that would be efficient, because of obvious problems with representing numbers as text, but that could work, right? If yes, then we could alleviate the requirement for the predefined maximum number of object to detect.

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      I guess technically you could solve any problem by learning an end-to-end system to predict its output in form of a string. T5 is already doing sort-of this for text tasks, so it's not so far out there, but I think these custom approaches still work better for now.

    • @larrybird3729
      @larrybird3729 4 года назад +1

      Maybe! but getting the neural-network to converge to that dataset would be a nightmare. The gradient-descent-algorithm only cares about one thing, "getting down that hill fast", with that sort of tunnel-vision, it can easily miss important features. So forcing gradient-descent to look at the scenery as it climbs down the mountain, you might get lucky and find a helicopter😆

    • @mikhaildoroshenko2169
      @mikhaildoroshenko2169 3 года назад

      @@YannicKilcher
      Guess it works now. :)
      Pix2seq: A Language Modeling Framework for Object Detection
      (sorry if I tagged you twice, the first comment had a Twitter link and got removed instantly.)

  • @TheGatoskilo
    @TheGatoskilo Год назад

    Thanks Yannick! Great explanation. Since the object queries are learned and I assume they remain fixed after training, why do we keep the lower self-attention part of the decoder block during inference, and not just replace it with the precomputed Q values?

  • @dheerajkhanna7697
    @dheerajkhanna7697 Год назад

    Thank you sooo much for this explanation!!

  • @johngrabner
    @johngrabner 3 года назад +2

    Excellent job as usual. Congrats on your Ph.D.
    Cool trick adding position encoding to K,Q and leaving V without position encoding. Is this unique to DETR?
    I'm guessing, the decoder learns an offset from these given positions analogous to more traditional bounding box algorithms findings bounding boxes relative to a fixed grid with the extra where decoder also eliminates duplicates.

    • @danielharsanyi844
      @danielharsanyi844 9 месяцев назад

      This is the same thing I wanted to ask. Why leave out V? It's not even described in the paper.

  • @tae898
    @tae898 3 года назад

    What an amazing paper and an explanation!

  • @drhilm
    @drhilm 4 года назад +2

    Thank you very much, this was really good.

  • @tranquil_cove4884
    @tranquil_cove4884 4 года назад +1

    How do you make the bipartite matching loss differentiable?

    • @YannicKilcher
      @YannicKilcher  4 года назад +2

      the matching itself isn't differentiable, but the resulting differences are, so you just take that.

  • @mathematicalninja2756
    @mathematicalninja2756 4 года назад +4

    I wonder if we can use this to generate captions from image using pure transformers

    • @amantayal1897
      @amantayal1897 4 года назад +2

      And also for VQA like we can give question encoding as input in decoder

  • @善民赵
    @善民赵 2 года назад

    Excellent work,Thanks!

  • @李祥泰
    @李祥泰 4 года назад

    Thank you for providing such interesting paper reading ! Yannic Kilcher

  • @tianhao7783
    @tianhao7783 4 года назад +1

    really quite quick. thanks. make more...

  • @thivyesh
    @thivyesh 2 года назад +1

    Great video! What about a video on this paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows? They split the images in patches and uses self attention locally on every patch and then shift the patches. Would be great to hear you're explanation on this!

  • @anheuser-busch
    @anheuser-busch 4 года назад +2

    Awesome!!! Yannic, by any chance, would you mind reviewing the paper (1) Fawkes: Protecting Personal Privacy against Unauthorized Deep Learning Models or (2) Analyzing and Improving the Image Quality of StyleGAN? I would find it helpful to have those papers deconstructed a bit!

  • @Volconnh
    @Volconnh 4 года назад +1

    Have anyone tried to run this in a Jetson Nano to compare with previous approaches? How faster is this in comparison with a mobilenet ssd v2?

  • @wizardOfRobots
    @wizardOfRobots 2 года назад

    So basically little people asking lots of questions... nice!
    PS. Thanks Yannic for the great analogy and insight...

  • @ruanjiayang
    @ruanjiayang 2 года назад +1

    Object query using back-propagated random variables are weird to me, and it seems unnecessary intuitively. I suspect that using simple numbers like 1 to N (in one hot form) are still feasible.

  • @apkingboy
    @apkingboy 4 года назад +1

    Love this content bro thank you so much, hoping to get a Mac in Artificial Intelligence

  • @benibachmann9274
    @benibachmann9274 4 года назад +1

    Great channel, subscribed! How does this approach compare to models opimized for size and inference speed for mobile devices like SSD mobile net? (See detection model zoo on the TF github)

  • @dshlai
    @dshlai 4 года назад +3

    I wonder how “Object Query” is different from “Region Proposal Network” in RCNN detector

    • @dshlai
      @dshlai 4 года назад

      It looks like Faster RCNN may still be better than DETR on smaller objects.

    • @FedericoBaldassarre
      @FedericoBaldassarre 4 года назад

      First difference that comes to mind is that the RPN has a chance to look at the image before outputting any region proposal, while the object queries don't. The RPN makes suggestion like "there's something interesting at this location of this image, we should look more into it". The object queries instead are learned in an image-agnostic fashion, meaning that they look more like questions e.g. "is there any small object in the bottom-left corner?"

  • @JagannathanK-y5e
    @JagannathanK-y5e Год назад

    Great explanation

  • @cuiqingli2077
    @cuiqingli2077 4 года назад +1

    really thank you for your explanation!

  • @KarolMajek
    @KarolMajek 4 года назад +2

    Thanks for this vid, really fast. I still (after 2 days) didn't tried to run it on my data - feeling bad

  • @Muhammadw92
    @Muhammadw92 9 месяцев назад

    Thanks for the explaination

  • @quantum01010101
    @quantum01010101 3 года назад +1

    Excellent

  • @kylepena8908
    @kylepena8908 3 года назад

    This is a really great idea

  • @krocodilnaohote1412
    @krocodilnaohote1412 2 года назад

    Very cool video, thank you!

  • @manuelpariente2288
    @manuelpariente2288 4 года назад +1

    They do not use any kind of masking for the attention here right?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      No, because all their inputs have the same size, they don't need masking

  • @erobusblack4856
    @erobusblack4856 Год назад

    can u train this for live vr/ar data?

  • @marcgrondier398
    @marcgrondier398 4 года назад +3

    I've always wondered where we could find the code for ML research papers (In this case, we're lucky to have Yannic sharing everything)... Can anyone in the community help me out?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      Sometimes the authors create a github repo or put the code as additional files on arxiv, but mostly there's no code.

    • @convolvr
      @convolvr 4 года назад +3

      paperswithcode.com/

  • @JLin-xk9nf
    @JLin-xk9nf 3 года назад

    Thank you for your detailed explanation. But I still can not follow the idea of object queries in the transformer decoder. Based on your explanation, N people are trained to find a different region with a random value. Then why we do not directly grid the image into the N part. Get rid of randomness. In Object detection, we do not need the probability of "Generator."

  • @sonOfLiberty100
    @sonOfLiberty100 4 года назад +2

    I love your channel thank you soooo much

  • @FlorianLaborde
    @FlorianLaborde 4 года назад

    I have never been so confused when you started saying diagonal and then going from bottom left to top right. So used to the matrix paradigm. 32:40 Absolutely great otherwise.

  • @DANstudiosable
    @DANstudiosable 3 года назад

    How are those object queries learnt?

  • @arjunpukale3310
    @arjunpukale3310 4 года назад +6

    Please make a video to train this model on our own custom datasets

  • @ruskinrajmanku2753
    @ruskinrajmanku2753 4 года назад

    If you think about it, transformers are really so much more effective than LSTM's for long sequences. The sequence is of length WxH, that's in the order of thousands....Seriously Attention is All you need was a breakthrough paper like the one on GAN's

  • @florianhonicke5448
    @florianhonicke5448 4 года назад +2

    So cool! You are great!

  • @glennkroegel1342
    @glennkroegel1342 4 года назад +2

    Memory consumption has got to be batshit crazy with this. Would using some form of sparse attention hinder the goal here?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      I guess you'd get some moderate drop in performance, but probably you could do it

  • @vaibhavsingh1049
    @vaibhavsingh1049 4 года назад +1

    can you do one about Efficient-det?

  • @christianjoshua8666
    @christianjoshua8666 4 года назад +8

    AI Developer:
    AI: 8:36 BIRD! BIRD! BIRD!

  • @linusjohansson3164
    @linusjohansson3164 4 года назад

    Hi Yannic! Great video! I am working on a project, just for fun because i want
    to get better at deep learning, about predicting sales prices on auctions
    based on a number of features over time and also the state of the economy,
    probably represented by the stock market or GDP. So its a Time Series prediction project.
    And i want to use transfer learning, finding a good pretrained model i can use.
    As you seem to be very knowledgeable about state of the art deep learning
    i wonder if you have any idea about a model i can use?
    Preferably i should be able to use it with tensorflow.

    • @YannicKilcher
      @YannicKilcher  4 года назад

      Wow, no clue :D You might want to look for example in the ML for medicine field, because they have a lot of data over time (heart rate, etc.) or the ML for speech field if you have really high sample rates. Depending on your signal you might want to extract your own features or work with something like a fourier transform of the data. If you have very little data, it might make sense to bin it into classes, rather than use its original value. I guess the possibilities are endless, but ultimately it boils down to how much data you have, which puts a limit on how complicated of a model you can learn.

  • @jjmachan
    @jjmachan 4 года назад +2

    Awesome 🔥🔥🔥

  • @frederickwilliam6497
    @frederickwilliam6497 4 года назад +1

    Great content!

  • @gerardwalsh4724
    @gerardwalsh4724 4 года назад +1

    Interesting to compare to YoloV4 which claims to get 65.7% @ mAP50?

    • @dshlai
      @dshlai 4 года назад +1

      But Yolo can’t do instance segmentation yet though, so Mask-RCNN is probably better comparison. Also Yolo probably run faster than either of these.

  • @fatmaguney3598
    @fatmaguney3598 4 года назад +1

    has anyone mentioned CornerNet for the comment in 35'?

  • @chideraachinike7619
    @chideraachinike7619 4 года назад

    It's definitely not AGI, following your argument - which is true.
    It seems to do more filtering, interpolation than actual reasoning.
    I kinda feel disappointed. But this is good progress.
    I'm still amateur in AI by the way.

  • @diplodopote
    @diplodopote 2 года назад

    Thanks a lot for this really helpful

  • @sanjaybora04
    @sanjaybora04 Год назад

    Wow its same as how human attention works
    When we are focus on one thing we ignore other things in an image

  • @a_sobah
    @a_sobah 4 года назад +1

    Great video thanks you