My respect for Ritvik goes exponentially high each time I see his explanations. He can beat any prof when it comes to explaining these things. I just feel so lucky to have come across this channel
Yes, great topic. Absolutely, some top ways to fight off vanishing gradients are relu (and other advanced activation functions), and residual nets (skip nets). It's also quite possible to add your own custom resnets to any deep network; it's not neessary to use only the resnet blocks that the framework tool provides. Tensorflow's functional api makes it pretty straightforward to add skip layers plus the necessary aggregation layer to combine the main path plus the skip path, to layer types other than convolutional and computer vision specific. So while resnet was originally designed with compujter vision purpose, it's not married to that at all. Additional solid help to fight those vanishing gradients off, are - batch normalization, which basically conditions the signal to the next hidden layer; and - smarter initialization of weights in all your layers, like He Initialization when using Relu (and other initializations that are suited to other activation functions).
Some questions inspired by your video. - Earliest layers see the severest form of vanishing gradient. Do later layers undergo vanishing gradient sequentially? - So what if the earliest layer weights get stuck; learning can still happen due to weight updates at later layers, right ? - Can we use vanishing gradient for neural architecture depth search ? Start with many layers; train; identify the early layers that got stuck; discard them and keep a shallower network. This sounds like there is something wrong with it; will this work ?
Love it as always! May I suggest a future video topic: Bayesian Change Point Detection. BCP has so many components that you already have covered (sampling techniques, MCMC, Bayesian statistics) that I think it would make for a great video! (and I'm still slightly confused how it all comes together in the end! lol)
Excellent video. You perfectly convey the intuition. Only one doubt left: I cannot see why ReLu is a good solution, given that gradient vanishes to 0 in case of negative values. How do we compute backpropagation then?
Great depth is where you get the most exponentiation effect, thus the worst vanishing or explosion. But great depth is where the bulk of the power of deep neural nets comes from.
My respect for Ritvik goes exponentially high each time I see his explanations. He can beat any prof when it comes to explaining these things. I just feel so lucky to have come across this channel
I just found this channel like 3 days ago and has been very useful and interesting! Thank you very much
Of course!
Yes, great topic. Absolutely, some top ways to fight off vanishing gradients are relu (and other advanced activation functions), and residual nets (skip nets).
It's also quite possible to add your own custom resnets to any deep network; it's not neessary to use only the resnet blocks that the framework tool provides. Tensorflow's functional api makes it pretty straightforward to add skip layers plus the necessary aggregation layer to combine the main path plus the skip path, to layer types other than convolutional and computer vision specific. So while resnet was originally designed with compujter vision purpose, it's not married to that at all.
Additional solid help to fight those vanishing gradients off, are
- batch normalization, which basically conditions the signal to the next hidden layer; and
- smarter initialization of weights in all your layers, like He Initialization when using Relu (and other initializations that are suited to other activation functions).
Very informative!!! Are you planning to make videos on RNN (LSTM...) and other type of network models?
Yup they'll likely be coming out within the next month!
Very well explained. I alwayd had trouble understanding this topic, but this video helped me to comprehend the topic intuitively.
Thanks!
Some questions inspired by your video.
- Earliest layers see the severest form of vanishing gradient. Do later layers undergo vanishing gradient sequentially?
- So what if the earliest layer weights get stuck; learning can still happen due to weight updates at later layers, right ?
- Can we use vanishing gradient for neural architecture depth search ? Start with many layers; train; identify the early layers that got stuck; discard them and keep a shallower network. This sounds like there is something wrong with it; will this work ?
Love it as always!
May I suggest a future video topic: Bayesian Change Point Detection. BCP has so many components that you already have covered (sampling techniques, MCMC, Bayesian statistics) that I think it would make for a great video! (and I'm still slightly confused how it all comes together in the end! lol)
Very well addressed, thank you for the video.
Thanks!
You can also add some bias to your networks in one of the intermediate layers
One of the best channels i've found on youtube (along the lines of 3Blue 1Brown and rest such channels). Keep up the good work
Excellent video. You perfectly convey the intuition. Only one doubt left: I cannot see why ReLu is a good solution, given that gradient vanishes to 0 in case of negative values. How do we compute backpropagation then?
Another gem! Great insights! 🔬
Omg thank you so much!!! You saved my thesis ❤❤
Excellent explanation.
Awesome video! Nice channel
Outstanding Video. Really well explained.
Never heard of someone call it the most important problem. Interesting view point.
Great depth is where you get the most exponentiation effect, thus the worst vanishing or explosion. But great depth is where the bulk of the power of deep neural nets comes from.
Super nice explanation!!!
Thanks once again !
Another way to help solve the vanishing gradient is to adjust your learning rate.
Great tip!
Maan gaya Sir G
thanks
Welcome
as always