L11.4 Why BatchNorm Works

Поделиться
HTML-код
  • Опубликовано: 18 дек 2024

Комментарии • 8

  • @simonvutov7575
    @simonvutov7575 Год назад

    I keep coming back to these videos because they are so useful!!!! Continue creating these!

  • @mohammadyahya78
    @mohammadyahya78 Год назад

    Thank you very much. At 8:34, may I know please for the distribution of activations at different layers, what is the x, y and z axes please so that they can plot these 3D diagrams?

  • @NgocAnhNguyen-si5rq
    @NgocAnhNguyen-si5rq 2 года назад +1

    Wa so intensive work! Thank you! But actually I have waited for you to mention about that BN can avoid exploding/vanishing gradient descent issue. What do you think about this advantage?

    • @SebastianRaschka
      @SebastianRaschka  2 года назад

      Yes, that's definitely an advantage. Keeping the activations in a reasonable range definitely helps with those two problems.

  • @xl0xl0xl0
    @xl0xl0xl0 2 года назад +1

    One comment about BN with small batches. Why not track a moving average for the BN statistics over the last few batches (to make the total samples>=64)? Sounds like it would solve the issue.

    • @SebastianRaschka
      @SebastianRaschka  2 года назад

      That's an interesting idea and would work as well (except of course for the first batch)

    • @xl0xl0xl0
      @xl0xl0xl0 2 года назад

      @@SebastianRaschka I thought more about it, and I think it might not work. In the paper they stress that BN needs to be part of the optimization loop. If we use EMA values, this goes against this idea, because only a small part of the gradient is allowed to flow back, and we might see the kind of conflict between BN and optimization they described. Hope what I'm saying makes sense. Still, I will give it a try and see how it goes.

  • @simonvutov7575
    @simonvutov7575 Год назад

    Please share your list of papers, I would love to check them out