ResNet (actually) explained in under 10 minutes

Поделиться
HTML-код
  • Опубликовано: 25 янв 2025

Комментарии •

  • @nialperry9563
    @nialperry9563 Год назад +24

    Cracking video, Rupert. Well animated and explained. I am already satisfied with my understanding of ResNets after this.

  • @AhmedThahir2002
    @AhmedThahir2002 Месяц назад

    This has to be the best explanation of ResNet ever.
    Amazing work, Rupert!

  • @sarthakpatwari7988
    @sarthakpatwari7988 Год назад +26

    Mark my words, if he become consitent, this channel will become one of the next big thing in AI

  • @Cypher195
    @Cypher195 2 года назад +16

    Thanks. Been out of touch with AI for far too long so this summary is very helpful.

    • @rupert_ai
      @rupert_ai  2 года назад +2

      Thanks Aziz, good luck with getting back in touch with AI

  • @prammar1951
    @prammar1951 5 месяцев назад +14

    everyone is praising the video, maybe it's just me but i really didn't understand what the residual connection hopes to achieve? and how does it do that? didn't make it clear.

    • @TheJDen
      @TheJDen 3 месяца назад +4

      “Residuals” are what mathematicians call the difference between the actual and predicted data values.
      Imagine you had a simple dataset that looked linear but with some oscillating variation (like put x + sin(3x) into graphing calculator).
      One option to model this data would be to train a network on each x and y. In this case, the model would have to learn the underlying linear trend (x), and the oscillation (sin(3x)).
      Alternatively, we could estimate the slope of the line (without variations). We could then repeatedly feed the estimated height of the line at x into the network whenever it is training on an x y pair. This way, the model only has to learn the oscillation, the difference between the line and the variation, the residual (sin(3x)).
      It makes the model’s job easier because it doesn’t have to learn and keep track of the linear trend (x) since we remind it every few steps. In more complex things like he showed in the video it means it doesn’t have to learn both how to maintain a good representation of a flower and make resolution higher, only how to make resolution higher (because it always has access to original flower).

  • @poopenfarten4222
    @poopenfarten4222 Год назад +16

    legit one of the best explanations i found

  • @agenticmark
    @agenticmark 11 месяцев назад

    lol, I have fought that exact trendline so many times in ML :D Great humor. Great video work.

  • @sergioorozco7331
    @sergioorozco7331 Год назад +1

    Is the right hand side of the addition supposed to have height and width dimension of 32x32 at 7:08? I think there is a small typo in the visual.

  • @devanshsharma5159
    @devanshsharma5159 Год назад +3

    love the animation! Thanks for the clean and clear explanation!

  • @samruddhisaoji7195
    @samruddhisaoji7195 3 месяца назад +2

    9:02 i have doubt: how are the number of features in the LHS and RHS matching? LHS = w *h*c. RHS = (w/2)*(h/2)*(2*c). Thus RHS = 2*LHS

    • @Bryanvas25
      @Bryanvas25 3 месяца назад

      actually RHS = (1/2) * LHS, and yes, i also dont understand that part

    • @samruddhisaoji7195
      @samruddhisaoji7195 3 месяца назад

      @@Bryanvas25 yes youre right about RHS = LHS/2. My bad!

  • @ciciy-wm5ik
    @ciciy-wm5ik 5 месяцев назад +2

    at time 2:09 image 1- image2 = image 3 does not imply image1 + image 3 = image 2

    • @gunasekhar8440
      @gunasekhar8440 4 месяца назад

      I mean we need to assume like that. Because in the paper they said h(x) be our desired mapping, x was input and f(x) would be some transformation. So f(x)=h(x)-x

  • @TheBlendedTech
    @TheBlendedTech 2 года назад +6

    Thank you, this was well put together and very useful.

  • @Omsip123
    @Omsip123 Год назад

    I pushed it to exactly 1k likes, cause it deserves it ... and many more

  • @logon2778
    @logon2778 2 года назад +6

    You say that the identity function is added elementwise at the end of the block. So say I have an identity [1,2] and the result of the block is [3,4]. So would the output of the layer be [4,6]? So its not a concatenation of the identity function which would be [1,2,3,4], correct? You basically ensure the identity function is the same dimensionality as the output of the block then add them element-wise.

    • @rupert_ai
      @rupert_ai  2 года назад

      Hey Logon, great question, you are totally correct the output from your example (identity [1,2] and block output [3, 4]) would be [4, 6] e.g. you simply add the values based on their twin positions. You don't concatenate! Yes, the last section on dimension matching covers the scenario when the dimensions don't match (and therefore you can't add them element-wise until you modify them).

    • @logon2778
      @logon2778 2 года назад +1

      @@rupert_ai So in the case of the 1x1 convolutions where there are 3 input channels and 6 output channels of equal size... How are they added element-wise? Are the input features add elementwise twice? Once for each pair of 3 output channels? Or does it only add element-wise to the first 3 output channels and leaves the other 3 untouched.

    • @rupert_ai
      @rupert_ai  2 года назад

      Hi @@logon2778, as is standard with convolutional neural networks each 1x1 convolution takes contributions from all channels (in this case across all 3 channels of the input). So in order to have 6 output channels you have 6 lots of 1x1 convolutions that take contributions from all 3 channels.
      In order to half the size you skip every other pixel (e.g. a stride of 2). That is simply what is used for the original paper, obviously other approaches work too. Now you have a 6 channel output which is half the height and width which matches the network dimensions and you can do element wise addition as usual. Have a watch of the video again and look up convolutional basics - I have a video on this actually - hopefully that might shed some light on things ruclips.net/video/6VP9k2WM6k0/видео.html

    • @logon2778
      @logon2778 2 года назад +2

      @@rupert_ai I understand how convolution works for the most part. 8:45 you show that there are 6 output channels of equal size to the input. But how can you element wise add 3 input channels to 6 output channels of equal size? In my mind you have double the dimensions. You have 6, 64x64 output channels. But you have 3, 64x64 input channels. So how can you element wise multiply them?

    • @rupert_ai
      @rupert_ai  2 года назад +3

      @@logon2778 The section you mention discusses what must be done to the copy of the identity along the residual connection BEFORE you do element wise addition with the output from the resnet block. The process follows this logic:
      1) save a copy of your input as the identity (e.g. 3 channels 64x64)
      2) run your input through the main block this outputs a new tensor. This new tensor can have the same dimensions or it can have different dimensions (e.g. 6 channels 32x32). If it has different dimensions proceed to step 3) if it has the same time dimensions proceed to step 4).
      3) take the copy of the identity in step 1) and apply 6 1x1 convolution kernels with stride 2 to it, this outputs 6 channels 32x32.
      4) do element wise addition with your identity and your resnet block output. Note that if the dimensions changed, then you also changed your identity with step 3 to ensure you can do element wise addition.
      Element wise addition is simply adding each corresponding value with one another. E.g. the value in the top left corner of channel 2 for the first tensor is added to the value in the top left corner of channel 2 for the second tensor. You don't do element wise multiplication as you mention. Hope that clears it up!

  • @heathernapthine8775
    @heathernapthine8775 2 месяца назад

    is the zero padding only done for layers which increase the size or is it done for down sampling layers too? intuitively if we zero padded the output in order to add a larger inout this doesn't seem like a downsampled layer?

  • @wege8409
    @wege8409 7 месяцев назад

    6:38 this is the part that really made me understand, thank you

  • @xagent6327
    @xagent6327 2 месяца назад

    The solution to pad with zeros fixed the number of channels, but how did they then reduce the dimensions from 64x64 to 32x32?

  • @ShahidulAbir
    @ShahidulAbir Год назад +4

    Amazing explanation. Thank you for the video

  • @louisdante8457
    @louisdante8457 5 месяцев назад

    7:53 Why is there a need to preserve the time complexity per layer?

    • @samruddhisaoji7195
      @samruddhisaoji7195 3 месяца назад

      The number of elements in the input and output of a convolution layer should remain same, as later we will be performing an element-wise operation

  • @egesener1932
    @egesener1932 2 года назад +2

    Everyone say ResNet solves vanishing/gradient problem but dont we already use ReLu function istead of sigmoid to solve it ? Also part 4.1 of article say plain counterpart with batch normalization doesn't causes vanishing problem but still causes more error rate when layers are increased 18 to 34. Can you explain it ?

    • @rupert_ai
      @rupert_ai  2 года назад +2

      1) there are multiple things that help solved the vanishing/exploding gradient problem, residual connections in general help massively with the learning process - as they ground the learning process around the desired result. e.g. you learn the difference between what you have and the correct result (the residual).
      2) batch normalisation also helps with the vanishing/exploding gradient problem as again this allows features of each layer to have a normalised distribution that is scaled so it won't explode/vanish, etc.
      3) your point around 4.1 they are saying that networks without residual connections (plain) have worse error when they have more layers (18 vs 34) for the exact reason I stated in part 1) of this answer, it is a difficult optimisation problem for the network to solve without the residual, when you add residuals you aren't penalising adding more layers to your network. Hope that makes sense!

  • @firefistace8569
    @firefistace8569 Год назад +2

    What is the residual in the image classification task?

    • @rupert_ai
      @rupert_ai  Год назад +2

      Good question! It can be tricky to understand what the residual might be in the image classification task as it is more abstract when compared to the super resolution task, essentially, you use the feature maps from previous layers and learn the 'residual' between previous layers and the current layer - in essence this makes a very powerful block of computation that is grounded by the skip connections. This makes image classification easier as the network itself can process the image in a more comprehensive way. There really isn't any 'end-to-end' residual in image classification like there is with super resolution, I hope that answers your question

    • @firefistace8569
      @firefistace8569 Год назад

      @@rupert_ai Thanks!

  • @mohamed_akram1
    @mohamed_akram1 Год назад +3

    Nice video. Did you use Manim?

    • @rupert_ai
      @rupert_ai  Год назад +1

      Hey Mohamed! Yes I did - my first video using manim! I hope to use it for some more complex things in the future :)

  • @panjak323
    @panjak323 Год назад

    Idk why, but simply adding bicubicly upscaled image to output of CNN with pixel shuffling layer achieves much better results than having any amount residual blocks. Also it's much faster.

  • @januarchristie615
    @januarchristie615 Год назад +1

    Hello, I apologize for my question, but I still don't quite understand why learning residuals can improve model predictions better?
    Thank you

    • @giovannyencinia9239
      @giovannyencinia9239 Год назад +1

      I think, that is because this arquitecture can apply the identity function, first you have an input a^[l] and this pass forward the convolutions, batch normalization, activation funciton etc. and finally there is an output z^[l+2] (this output in the hidden layers has some parameters theta), and here is where the architecture add the a^[l] (ReLU(z^[l+2] + a^[l])), then in the back propagation step there is the posibility that the optimal parameters in z^[l+2] are 0, so the result is a^[l] this is because you apply a ReLU activation funtion, and this means that the intermediate layers wont be use. If you build a big and deeper NN this arquitecture can skip the layers(blocks of residuals) that does not help to reach the local optima.

  • @謝其宏-p3z
    @謝其宏-p3z 10 месяцев назад

    It's amazing. Both resnet and this explaination.

  • @djauschan
    @djauschan 11 месяцев назад

    Amazing explanation of this concept.
    Thank you very much

  • @doudouban
    @doudouban Год назад

    2:06, the equation shift seem problematic.

  • @RadenRenggala
    @RadenRenggala Год назад

    Hello, is the term "residu" referring to the convolutional feature maps from the previous layer that are then added to the feature maps output in the current layer?

    • @rupert_ai
      @rupert_ai  Год назад

      The residual is actually the 'difference' between two features! In ResNets the feature maps from previous layers are added onto the current features maps, this means the current layer can learn the 'residual' function where it only needs to learn the difference

    • @RadenRenggala
      @RadenRenggala Год назад

      @@rupert_ai So, residual is the difference between the current feature map and the previous feature map, and to obtain the residual, we need to perform an addition between those feature maps?..
      Thank you.

  • @MuhammadHamza-o3r
    @MuhammadHamza-o3r 5 месяцев назад

    Very well explained

  • @the_random_noob9860
    @the_random_noob9860 10 месяцев назад

    Lifesaver! Also, for classification, it's inevitable that the dimensions go down and channels go up across the network. But the 1 x 1 convolution on the input features to 'match the dimensions' kinda loses the original purpose i.e to retain/boost the original signal.. In a sense it's another conv operation that is no longer similar to the input (I mean it could be similar but certainly as not as the input features themselves). It's just the original idea was to have the same input features so that we could zero out the weights if no transformation is needed.
    Atleast they're not as different from how the input features as transformed across the usual conv block(conv, pooling, batch norm and activation). Let me know if I am missing anything

  • @datascience8775
    @datascience8775 2 года назад +3

    Good content, just subscribed, keep sharing.

  • @nxtboyIII
    @nxtboyIII Год назад +1

    Great video well explained thanks!

    • @nxtboyIII
      @nxtboyIII Год назад +1

      I liked the visuals too

    • @rupert_ai
      @rupert_ai  Год назад

      @@nxtboyIII Thank you Lucas 🙏

  • @christianondo9637
    @christianondo9637 11 месяцев назад

    great video, super intuitive explanation

  • @dapr98
    @dapr98 Год назад

    Great video! Thanks. Would you recommend ResNet over CNN for music classification?

  • @ColorfullHD
    @ColorfullHD Год назад

    Hey, its 3blue1brown
    All jokes aside, great explanation, cheers

    • @rupert_ai
      @rupert_ai  Год назад

      Hahaha well it is using his animation library ;) all hail grant sanderson

  • @swedenontwowheels
    @swedenontwowheels Год назад +1

    Great content! Thank you for the effort!

  • @SakshamGupta-em2zw
    @SakshamGupta-em2zw 7 месяцев назад

    Love the Music

  • @rezajavadzadeh5597
    @rezajavadzadeh5597 2 года назад +1

    thank you so much

  • @jamesnorton4953
    @jamesnorton4953 2 года назад +1

    🔥

  • @krishnashah6654
    @krishnashah6654 11 месяцев назад

    i'd just say thank you so much man!

  • @tanmayvaity9437
    @tanmayvaity9437 2 года назад +1

    nice video

  • @enzogurijala5464
    @enzogurijala5464 2 года назад +1

    great video

  • @moosemorse1
    @moosemorse1 Год назад

    Subscribed. Thank you so much

  • @carolinavillamizar795
    @carolinavillamizar795 Год назад

    Thanks!!

  • @JoydurnYup
    @JoydurnYup 2 года назад +2

    great vid sir

  • @BABA-oi2cl
    @BABA-oi2cl 10 месяцев назад

    Thanks a lot ❤

  • @gusromul3356
    @gusromul3356 9 месяцев назад

    cool info, thanks rupert ai

  • @lifeisbeautifu1
    @lifeisbeautifu1 10 месяцев назад

    that was good!

  • @cocgamingstar6990
    @cocgamingstar6990 Год назад

    Very bad

    • @rupert_ai
      @rupert_ai  Год назад +7

      Feel free to leave some constructive feedback :)
      Or did you mean to write badass? if so thanks!