Why Does Batch Norm Work? (C2W3L06)

Поделиться
HTML-код
  • Опубликовано: 2 окт 2024

Комментарии • 63

  • @AnuragHalderEcon
    @AnuragHalderEcon 6 лет назад +57

    Beautifully explained, classic Andrew Ng

  • @epistemophilicmetalhead9454
    @epistemophilicmetalhead9454 7 месяцев назад

    when X changes (despite f(x) = y remaining the same, you can't expect the same model to perform well (eg: X1 is pics of black cats only. y=1. else for non-cats y=0. if X2 is pics of all colored cats, model won't do too well). This is co-variant shift.
    this co-variant shift is tackled during training through input standardization and batch normalization
    batch normalization ensures that the mean and variance of the distribution of the values of hyperparameters in the previous layer remains the same. Doesn't allow these hyperparameters' values to shift much.
    it doesn't allow values to change too much, thus reducing the coupling between hyperparameters of different layers and increases independence and hence, increase speed of learning

  • @siarez
    @siarez 2 года назад +8

    The "covariant shift" explanation has been falsified as an explanation for why BatchNorm works. If you are interested check out the paper "How does batch normalization help optimization?"

  • @maplex2656
    @maplex2656 5 лет назад +13

    When the previous layer is covered, all things are clear. Brilliant explanation. Batch normalization works similarly the way input standardization works.

  • @digitalghosts4599
    @digitalghosts4599 5 лет назад +24

    Wow this is the best explanation I've seen so far! I really like Andrew Ng, he has an amazing talent for explaining even the most complicated things in a simple way and when he has to use mathematics to explain some concepts he does it in such a brilliant way that they become even simpler to understand and not more complicated as with some tutors

  • @holgip6126
    @holgip6126 5 лет назад +12

    like this guy - has calm voice / patience

  • @aamir122a
    @aamir122a 6 лет назад +29

    Great work, you have the natural talent to make difficult topics easily learnable

  • @bgenchel1
    @bgenchel1 6 лет назад +16

    "don't use it for regularization" - just use it all the time for general good practice, or are there times when I shouldn't use it?

    • @first-thoughtgiver-of-will2456
      @first-thoughtgiver-of-will2456 4 года назад

      I think problems may arise if you don't have all of your training data ready and are looking to perform some transfer learning (training on new data) in the future since this is very domain dependent, as hinted at by the minibatch size-regularization effect but also and more importantly by the batch norm hyperparameters. I would always try to implement this. It seems ironic that it generalizes well but is constrained by the prescribed covariance from the training data.

    • @wtf1570
      @wtf1570 3 года назад

      In some regression problems, it hurts the absolute value which might be critical.

  • @oliviergraffeuille9795
    @oliviergraffeuille9795 2 года назад +1

    According to ruclips.net/video/DtEq44FTPM4/видео.html , the covariate shift explanation (which was proposed by the original batch norm paper) has since been debunked by more recent papers. I don't know much about this though, if someone else would like to elaborate.

  • @NeerajGarg1025
    @NeerajGarg1025 8 месяцев назад

    good to understand but still more nmerical calculations, will show effect

  • @YuCai-v8k
    @YuCai-v8k 8 месяцев назад

    Is it always have batch normalization in neural network?

  • @ping_gunter
    @ping_gunter 3 года назад +1

    The original paper where the batch normalization technique was introduced (by Sergey Ioffe, Christian Szegedy) says that removing dropout speeds up training, without increasing overfitting and there also recommendations not to use drop out together with batch normalization since it adds noise to stats calculations (mean and variance)...so should we really use DO with BN?

  • @randomforrest9251
    @randomforrest9251 3 года назад +2

    This guy makes it look so easy... one has to love him

  • @haoming3430
    @haoming3430 2 месяца назад

    6:00, I have a question, do the values of beta[2] and gamma[2] not change as well during training? So the distribution of hidden unit values z[2] also keeps changing. Then the covariate shift problem is still there.

    • @haoming3430
      @haoming3430 2 месяца назад

      Or maybe I should convince myself that beta[2] and gamma[2] don't change much?

  • @yuchenzhao6411
    @yuchenzhao6411 4 года назад +1

    Since gamma and beta are parameters will be updated, how can mean and variance remain unchanged?

  • @amartyahatua
    @amartyahatua 4 года назад +1

    Best explanation of batch norm

  • @pemfiri
    @pemfiri 4 года назад +1

    don't the activation function such as sigmoid in each node already normalize the outputs from neurons for the most part ?

    • @bharathtejchinimilli320
      @bharathtejchinimilli320 3 года назад

      but the outputs are not zero centered

    • @bharathtejchinimilli320
      @bharathtejchinimilli320 3 года назад

      generally, sigmoids are not used because of saturation and not been zero centre outputs. instead, ReLU are used

  • @quishzhu
    @quishzhu 2 года назад

    谢谢

  • @anynamecanbeuse
    @anynamecanbeuse 5 лет назад +1

    I'm confused. Is that normalizing all neurons within each layer, or normalizing all activations computed from a mini-batch of one neuron ?

    • @yueying9083
      @yueying9083 5 лет назад

      第二种

    • @JohnFunnyMIH
      @JohnFunnyMIH 5 лет назад +2

      To be precise, it's neither both, though closer to the second. When training your network, you normalize all Z[l] - scalars corresponding to each neuron of l-th layer. Z[l] = W[l] * A[l-1]. Where W[l] is current layer weights matrix, and A[l-1] is previous layer activations.
      So, you normalize numbers which are not yet activations of current layer, but calculated from weights of current layer and previous layer activations.

  • @banipreetsinghraheja8529
    @banipreetsinghraheja8529 6 лет назад +5

    You said that Batch Norm limits the change in values of the 3rd layer ( or more generally, any deeper layer) due to parameters of earlier layers, however, when you are performing Gradient Descent, the values of the new parameters ( parameters due to Batch Norm gamma and beta ), are also being learnt, and are changing with the help of learning rate and henceforth, the mean and variance of earlier layers are changing and are not limited to 0 and 1 respectively ( or more generally whatever you set it to ), so, I am not able to intuite this fixing of mean and variance of the parameters of earlier layer to prevent covariate shift. Can anyone help me out with this?

    • @bryan3792
      @bryan3792 6 лет назад +5

      My understanding is this: imagine the 4 neuron in hidden layer 3 represent the feature of [shape of the head of cat, shape of the body of cat, shape of the tail of cat, color of the cat]. The first 3 dimension will have high value as long as there is a cat in the image, but the color varies a lot. So when u normalize this vector, the changes in color will make less contribution to the prediction. Therefore, relatively the feature that really matters (like the first three dimension) will have larger influence.

    • @usnikchawla
      @usnikchawla 5 лет назад +1

      Having the same doubt

    • @Kerrosene
      @Kerrosene 5 лет назад

      batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and the beta is added ("mean” parameter). In other words, batch normalization lets SGD do denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

    • @mustafaaydn6270
      @mustafaaydn6270 3 года назад

      Page 316 of www.deeplearningbook.org/contents/optimization.html has an answer to this i guess:
      >
      At first glance, this may seem useless-why did we set the mean to 0, and then introduce a parameter that allows it to be set back to any arbitrary value β? The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has different learning dynamics. In the old parametrization, the mean of H was determined by a complicated interaction between the parameters in the layers below H. In the new parametrization, the mean of γH'+β is determined solely by β. The new parametrization is much easier to learn with gradient descent.

    • @novinnouri764
      @novinnouri764 2 года назад

      @@bryan3792 thanks

  • @IgorAherne
    @IgorAherne 6 лет назад +1

    thank you

  • @XX-vu5jo
    @XX-vu5jo 3 года назад

    Keras people needs to watch this video!

  • @gugaime
    @gugaime Год назад

    Amazing explanation

  • @nikhilrana8800
    @nikhilrana8800 5 лет назад

    I am not able to grab the batch norm working. Pls help me...

  • @qorbanimaq
    @qorbanimaq 3 года назад

    This video is just pure gold!

  • @mariusmic6573
    @mariusmic6573 2 года назад

    What is 'z' in this video?

  • @s25412
    @s25412 3 года назад

    7:55 why don't we use the mean and variance of the entire trg set instead of just those of a mini-batch? Wouldn't this reduce noise further (similar to using larger mini-batch size)? Unless we want those noise to seek out regularizing effect?

    • @lupsik1
      @lupsik1 3 года назад +2

      Larger batch sizes are detrimental, like Yann Lecun once said
      "training with large minibatches is bad for your health.
      More importantly, it's bad for your test error.
      Friends dont let friends use minibatches larger than 32."
      as far as i understand it, with bigger batches you get stuck in narrower local optima while the noisier sets helps you generalize better and get pushed out of those local optima.
      Theres still lots of argument about this tho in some cases with very noisy data like predicting stock prices.

    • @s25412
      @s25412 3 года назад

      @@lupsik1 great response!

  • @MuhammadIrshadAli
    @MuhammadIrshadAli 6 лет назад

    Thanks for sharing the great video, explained in simple and good manner.

  • @youknowhoiamhehe
    @youknowhoiamhehe 4 года назад

    Is he the GOD?

  • @tianyuez
    @tianyuez 5 лет назад +2

    Andrew Yang is really good at math

  • @best_Vinyl_CollectorinShenZhen
    @best_Vinyl_CollectorinShenZhen 5 лет назад

    If the mini-batch size is only 1, is BN still working?

    • @karthik-ex4dm
      @karthik-ex4dm 5 лет назад +1

      minibatch with the size of 1 is not a mini batch. Its using each point in the data seperately. you cannot batch norm with size=1

  • @skytree278
    @skytree278 5 лет назад

    Thank you!

  • @EranM
    @EranM 5 лет назад

    Ingenious!

  • @sandipansarkar9211
    @sandipansarkar9211 3 года назад +1

    great xplnatoion

  • @sudharsaneaswaran2516
    @sudharsaneaswaran2516 5 лет назад

    what does coloured image got to do with the location of data point in graph?

    • @amitkharel1168
      @amitkharel1168 5 лет назад

      pixel values

    • @dianadrives4519
      @dianadrives4519 5 лет назад

      That is just another way to show the difference in distribution of training and testing data. So in images the distribution difference is shown by one set having black cats another having non-black cats. While in the graph, the distribution difference is shown by the differences in position of the positive and negative data points. In short, these are two different examples hightlighting a single issue i.e covariance shift.

  • @arvindsuresh86
    @arvindsuresh86 4 года назад +1

    Wow, great explanation! Thanks!

  • @emrahyigit
    @emrahyigit 6 лет назад

    Great explanation. Thank you.

  • @wajidhassanmoosa362
    @wajidhassanmoosa362 2 года назад

    Beautifully Explained

  • @lakshaithani268
    @lakshaithani268 5 лет назад

    Great explanation