From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them The best video I've ever seen
i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated
@@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.
@@kiruthigakumar8557Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ruclips.net/video/qhXZsFVxGKo/видео.html
I never got such clear explanation for deep learning concepts. I had Coursera deep learning. They make it more difficult to what it is. Thank you Krish.
Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.
This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers
Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.
Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner. I am follwing your every video.
Question: Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?
That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.
Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.
hello sir, In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.
In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ? I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?
Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.
sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards
At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.
a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it
This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch
Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.
Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?
Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.
He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)
Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear
Thanks for this amazing video sir! Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?
deepan chakravarthi Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.
Doubt: the BIAS that is added, what constitutes this bias. For instance Learning rate was found by optimization models, what methodology is used to introduce bias?
Have been following your playlist of deep learning , this is the 9th video...you teach amazing but I am confused if this is deep learning or mathematics class
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Base Concept: While performing Backward propagation the loss function derivate is getting lower hence the weights are not changed hence again in forward propagation these weights are multiplied with input value which changes the weights value even more from the previous one. So if we perform backward propagation again then old weight will be much different
From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them
The best video I've ever seen
Exactly
i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated
@@kiruthigakumar8557 even I have the same doubt if anyone can help it would be really appreciated
@@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.
@@kiruthigakumar8557Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ruclips.net/video/qhXZsFVxGKo/видео.html
I never got such clear explanation for deep learning concepts.
I had Coursera deep learning. They make it more difficult to what it is.
Thank you Krish.
Loving this playlist
Most of these abstract concepts are explained very elegantly
Thank you so much
9:32 peak of interest! Happiness in explaining why it will not converge... I love that reaction!!!😍😍😍
yeah, the time where I smiled with respect :)
This playlist is like a treasure.
I really love your videos. Today only i started watching your tutorial. It was really helpful. Thank you so much for sharing your knowledge.
how i missed the class all these years
how come you are able to simplify the topics.
👏
Best explanation for EXPLODING gradient problem on the internet I have encountered so far. Awesome!
Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.
Deep Concepts are getting clear.
Thank you sir. Such a beautiful explanation
That was one of the best explanations for Exploding gradient problem. But please mention the next video in the description box. I could find it hard.
Congrats for a well explained topic. Now I know the effect of exploding gradients
This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers
keep up the good work, disrupting the education system. Lots of love
hats off to you sir,Your explanation is top level, THnak you so much for guiding us...
Best explanation so far. No doubt !!!
Love the explanation bro... I used to initialize weights randomly but after watching this, I came to know the impact of such initializations...
Very passionate and articulate lecture well done
Very very effective video sir 👍👍👍👍👍👍....my love and gratitude to you 🙏...
Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.
love your video of machine learning algorithms, kudos
Your classes are quite clear, thank you so much !!!!
Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner.
I am follwing your every video.
One correction: dL/dW'11 should be (dL/dO31. dO31/dO21. dO21/dO11. dO11/dW'11)
In tutorial 6 also there was a correction...!
is there an explanation
You are right @kueen, krish has missed out the first term in the chain rule.
yes you are right
but what will come in "dL" is that (y-Y) ^2 or log loss funtion will come in "dL"
just wanted to know... does the chain rule refer to partial derivative ??
Very well explained, and the writings and drawings are very clear too by the way
excellent and to the point explanation sir. Waiting for your future videos in Deep Learning.
Amazing In-Depth Explanation!
Amazing explanation sir. I am going to learn whole deep learning from your videos only
the chain rule is a mistake please correct it.
Yeah I commented on it too
No. It is correct
No it is incorrect
yes derv L /derv o31 is missed
Question:
Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?
Even I have the same question, sir can you please explain this section?
even I have the same doubt.. can u explain this?
That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.
YOU ARE JUST KIND DUDE. THANKS
Awesome explanation! Best video I have seen for this problem.
Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.
thanks Krish... nice explanations
Please see, the chain rule has missed something at 2:55. @krish naik
yes there is mistake is missing del L /del o31 onwards
@@omkarrane1347 yes this is a miss
hello sir,
In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.
I am having the same doubt. Can anyone please explain it?
Even I had this question
He multiplied 0.25 with initial value weight w21 which was 500. W21 is derivative of z wrt O11 in his case.
Please keep making videos like this!
Another Great Video. Namaste
Pure passion,appriciate it.
Request for a video on side by side comparison of vanishing gradient and exploding gradient...
the activation function is denoted by phi, not to be confused with the symbol of cyclicc integral
so well explained!
great work.. Kudos to u!!!!!!!!!!
Do tutorial based on machine learning like regression ,classification and clustering sir
In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ?
I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?
Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.
@krish naik sir, can you please help clarify this doubt
yes it made me confused too
Exploding Gradient Problem is only for sigmoid activation function or for all activation functions
Awesome 😊👏👍
excellent krish
love to watch your videos
beautiful explanation
sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards
I probably made a mistake in the last part
can you explain what is wrong briefly. so I can understand
Which one is correct then one used in this video or the one used in the previous video ??
super explanation sir !!
Best video. Hands down
best explaination... thanks for making this video
Very nice explanation.thanks
At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.
Excellent ..!!!
Excellent.
So overall you're saying that if you choose high values of weights, it'll cause problem to reach or maybe will never reach global minima
Exploding GD explained nicely!
Thank you very much i learn a lot, i think in gradient you forgot one term, the first one, dL /dO3
love your videos and can't thankyou enough. Thankyou so much for theawesomest lessons
a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it
This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch
Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.
So basically Exploding and vanishing are dependent on how the weights are initialised?
sir may be there is a problem in the chain rule that you explain. Here something is missing that is derivative of L with respect to O31
Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?
Yes
Thanks krish
I love the energy
great video !
u doing great job man
@7:47 d(w_21 * O_11) = O_11 dw_21 + w_21 dO_11 (why are you assuming w_21 is constant)
Sir please make a video on bayes theorem and its concepts learning....
awesome video, much respect
You're great!
sir why your are not writing the term dL/d(o31) with other terms?
Shouldn't the derivative be dl/ dw'11 = dl/dO31 and then the rest? Could someone please clarify? Thanks
You're right
Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.
He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)
@2.37 u have missed a derivate dL/d031 on the RHS.
Awesome video!
Thank you.
Sir apne chain rule mai alag formula btaya or aap yaha pr derivative nikame ka alag formula bta rhe ho
Is it same ? Please clear the doubt
Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear
Even I hv the same question. Appreciate if you can clear
Krish Naik bester Mann!
sir , why are you missing the first term while writing the chain rule ? can someone please let me know what is correct formula
?
Waiting for future videos on DL
Thanks for this amazing video sir!
Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?
Activation function doesn't matter for exploding gradient decent to occur. High magnitude weights initialization alone can cause this problem.
deepan chakravarthi
Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.
The derivate should also be high
Doubt: the BIAS that is added, what constitutes this bias.
For instance Learning rate was found by optimization models, what methodology is used to introduce bias?
Have been following your playlist of deep learning , this is the 9th video...you teach amazing but I am confused if this is deep learning or mathematics class
I don't understand the chain rule equationt that how we get activation function while it should begun from dO21
why are you missing the first term (dl/O31) in the chain rule equation continuously in two videos. Is there a reason or is it a mistake?
It was a mistake
@@krishnaik06 how it can be 125 if o21 is sigmoid, should not be the derivative of sigmoid in the range [ 0 : 0.25 ] ?
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario
Thanks a lot sir
Too good man !!! #BohotHard
Base Concept:
While performing Backward propagation the loss function derivate is getting lower hence the weights are not changed hence again in forward propagation these weights are multiplied with input value which changes the weights value even more from the previous one. So if we perform backward propagation again then old weight will be much different
im getting confused as u said 3.20. why do u expand o21/o11 in this expolding gradient but y not expanded in vanishing gradient?.
at 2:47 you are missing the dL/dO31 term
very good content
At 5:56, shouldn't it be "derivate of z w.r.t derivative of w_11" instead of being "derivate of z w.r.t derivative of O_11"
7:56 there's a mistake in derivative.. please correct it
SIr in the chain rule formula, I guess you have left the del(L)/del(O^31) at first