Thank you very much. At 8:34, may I know please for the distribution of activations at different layers, what is the x, y and z axes please so that they can plot these 3D diagrams?
Wa so intensive work! Thank you! But actually I have waited for you to mention about that BN can avoid exploding/vanishing gradient descent issue. What do you think about this advantage?
One comment about BN with small batches. Why not track a moving average for the BN statistics over the last few batches (to make the total samples>=64)? Sounds like it would solve the issue.
@@SebastianRaschka I thought more about it, and I think it might not work. In the paper they stress that BN needs to be part of the optimization loop. If we use EMA values, this goes against this idea, because only a small part of the gradient is allowed to flow back, and we might see the kind of conflict between BN and optimization they described. Hope what I'm saying makes sense. Still, I will give it a try and see how it goes.
I keep coming back to these videos because they are so useful!!!! Continue creating these!
Thank you very much. At 8:34, may I know please for the distribution of activations at different layers, what is the x, y and z axes please so that they can plot these 3D diagrams?
Wa so intensive work! Thank you! But actually I have waited for you to mention about that BN can avoid exploding/vanishing gradient descent issue. What do you think about this advantage?
Yes, that's definitely an advantage. Keeping the activations in a reasonable range definitely helps with those two problems.
One comment about BN with small batches. Why not track a moving average for the BN statistics over the last few batches (to make the total samples>=64)? Sounds like it would solve the issue.
That's an interesting idea and would work as well (except of course for the first batch)
@@SebastianRaschka I thought more about it, and I think it might not work. In the paper they stress that BN needs to be part of the optimization loop. If we use EMA values, this goes against this idea, because only a small part of the gradient is allowed to flow back, and we might see the kind of conflict between BN and optimization they described. Hope what I'm saying makes sense. Still, I will give it a try and see how it goes.
Please share your list of papers, I would love to check them out