Tutorial 8- Exploding Gradient Problem in Neural Network

Поделиться
HTML-код
  • Опубликовано: 26 окт 2024

Комментарии • 176

  • @midhileshmomidi2434
    @midhileshmomidi2434 5 лет назад +40

    From now If anyone asked me about Vanishing Gradient Descent OR Exploring Gradient Descent I will not just answer and I even take a class to them
    The best video I've ever seen

    • @manishsharma2211
      @manishsharma2211 4 года назад

      Exactly

    • @kiruthigakumar8557
      @kiruthigakumar8557 4 года назад +3

      i have a small doubt...in vanishing- the values where very small but here its high but both have the same eqn right? or is it due to the weights in the vanishing was normal and in exploding its high?....ur help is really appreciated

    • @sargun_narula
      @sargun_narula 4 года назад

      @@kiruthigakumar8557 even I have the same doubt if anyone can help it would be really appreciated

    • @chiragchauhan8429
      @chiragchauhan8429 3 года назад +4

      @@sargun_narula As he said when using sigmoid the values would be between 0-1 so if their weights are smaller when we initialise them but for a smaller network that is 1 or 2 hidden layer network vanishing won't be a problem but if it uses more like 10 layers then after some layers considering last 3rd layer when backpropagating the derivative will be decreasing with every layer and due to that optimizer will be so slow to reach the minima value and that's what vanishing gradient is. Talking about exploding gradient if weights are bigger and derivative increases after backpropagating than that may put our optimizer into diverging rather than reaching minima i.e exploding problem. Simply saying weights shouldn't be initialized so high or so low.

    • @babupatil2416
      @babupatil2416 3 года назад

      ​@@kiruthigakumar8557​Irrespective of your activation function your weights causes the Exploding/Vanishing gradient descent problem. Weights shouldn't be initialized so high or so low. Here is the Andrew Ng video for the same ruclips.net/video/qhXZsFVxGKo/видео.html

  • @khalidal-reemi3361
    @khalidal-reemi3361 3 года назад +33

    I never got such clear explanation for deep learning concepts.
    I had Coursera deep learning. They make it more difficult to what it is.
    Thank you Krish.

  • @winviki123
    @winviki123 5 лет назад +39

    Loving this playlist
    Most of these abstract concepts are explained very elegantly
    Thank you so much

  • @skviknesh
    @skviknesh 3 года назад +5

    9:32 peak of interest! Happiness in explaining why it will not converge... I love that reaction!!!😍😍😍

  • @tarun4705
    @tarun4705 Год назад +2

    This playlist is like a treasure.

  • @rukeshshrestha5938
    @rukeshshrestha5938 4 года назад +6

    I really love your videos. Today only i started watching your tutorial. It was really helpful. Thank you so much for sharing your knowledge.

  • @somanathking4694
    @somanathking4694 5 месяцев назад

    how i missed the class all these years
    how come you are able to simplify the topics.
    👏

  • @pushkarajpalnitkar1695
    @pushkarajpalnitkar1695 3 года назад

    Best explanation for EXPLODING gradient problem on the internet I have encountered so far. Awesome!

  • @whitemamba7128
    @whitemamba7128 4 года назад +3

    Sir, your videos are very educational and, you put a lot of energy into making them. They make the learning process easy, and it also lets me develop an interest in deep learning. That's the best I could have asked for and, you delivered it. Thank you, Sir.

  • @raidblade2307
    @raidblade2307 4 года назад +2

    Deep Concepts are getting clear.
    Thank you sir. Such a beautiful explanation

  • @farzanehparvar_
    @farzanehparvar_ 3 года назад

    That was one of the best explanations for Exploding gradient problem. But please mention the next video in the description box. I could find it hard.

  • @annalyticsannalizaramos5890
    @annalyticsannalizaramos5890 3 года назад +1

    Congrats for a well explained topic. Now I know the effect of exploding gradients

  • @tinumathews
    @tinumathews 5 лет назад +3

    This is super krish, its like a story that you explain... at 9:35 minutes the whole picture jumps into your mind. neat explanation. Nice work krish... awaiting for more videos. meet you on satruday..till then cheers

  • @anshulzade6355
    @anshulzade6355 2 года назад

    keep up the good work, disrupting the education system. Lots of love

  • @adityashewale7983
    @adityashewale7983 Год назад

    hats off to you sir,Your explanation is top level, THnak you so much for guiding us...

  • @basharfocke
    @basharfocke Год назад

    Best explanation so far. No doubt !!!

  • @aravindpiratla2443
    @aravindpiratla2443 2 года назад

    Love the explanation bro... I used to initialize weights randomly but after watching this, I came to know the impact of such initializations...

  • @ArthurCor-ts2bg
    @ArthurCor-ts2bg 4 года назад +1

    Very passionate and articulate lecture well done

  • @-birigamingcallofduty2219
    @-birigamingcallofduty2219 3 года назад

    Very very effective video sir 👍👍👍👍👍👍....my love and gratitude to you 🙏...

  • @bigbull266
    @bigbull266 3 года назад +6

    Exploding Gradient Problem is because of Higher Weights Initialization. If the weights are higher, then during BackProp gradients value will be higher which in turn affects the new weights to be vv small when updating weights [ Wnew = Wold - lr * Grad] Due to which the weight difference will be Varying a lot at every epoch and this is why Gradient Descent will never converge.

  • @ne2514
    @ne2514 3 года назад

    love your video of machine learning algorithms, kudos

  • @slaozturk47
    @slaozturk47 2 года назад

    Your classes are quite clear, thank you so much !!!!

  • @143balug
    @143balug 4 года назад +1

    Excellent Videos bro, I am getting clear picture on those concepts Thank you very much for making the video's with clear understandable manner.
    I am follwing your every video.

  • @kueen3032
    @kueen3032 3 года назад +44

    One correction: dL/dW'11 should be (dL/dO31. dO31/dO21. dO21/dO11. dO11/dW'11)

    • @vikrambharadwaj7072
      @vikrambharadwaj7072 3 года назад +3

      In tutorial 6 also there was a correction...!
      is there an explanation

    • @adarshyadav340
      @adarshyadav340 3 года назад

      You are right @kueen, krish has missed out the first term in the chain rule.

    • @vvek27
      @vvek27 3 года назад

      yes you are right

    • @manojsamal7248
      @manojsamal7248 3 года назад

      but what will come in "dL" is that (y-Y) ^2 or log loss funtion will come in "dL"

    • @indrashispowali
      @indrashispowali 2 года назад

      just wanted to know... does the chain rule refer to partial derivative ??

  • @vincenzo3908
    @vincenzo3908 4 года назад +1

    Very well explained, and the writings and drawings are very clear too by the way

  • @rajaramk1993
    @rajaramk1993 5 лет назад +1

    excellent and to the point explanation sir. Waiting for your future videos in Deep Learning.

  • @yogenderkushwaha5523
    @yogenderkushwaha5523 4 года назад

    Amazing explanation sir. I am going to learn whole deep learning from your videos only

  • @ronishsharma8825
    @ronishsharma8825 4 года назад +18

    the chain rule is a mistake please correct it.

  • @shamussim137
    @shamussim137 3 года назад +4

    Question:
    Hi Krish. dO21/do11 is large because we mutliple the derivate of the sigmoid (btwn 0 to 0.25) with a large weight. However, in Tutorial 7 we didn't use this formula(chain rule derivation), we directly said dO21/do11 is between 0 to 0.25. Please can you clarify on this?

    • @hritiknandanwar5095
      @hritiknandanwar5095 2 года назад

      Even I have the same question, sir can you please explain this section?

    • @shrikotha3899
      @shrikotha3899 2 года назад

      even I have the same doubt.. can u explain this?

    • @aadityabhardwaj4036
      @aadityabhardwaj4036 10 месяцев назад

      That is because O21 = sigmoid(ff21) and when we take the derivate of O21 with respect to any variable (be it O11), We know it will range between 0 and 0.25. Because the derivative of sigmoid(x) ranges from 0 to.25, and x can be any value.

  • @4abdoulaye
    @4abdoulaye 4 года назад +1

    YOU ARE JUST KIND DUDE. THANKS

  • @janekou2482
    @janekou2482 4 года назад

    Awesome explanation! Best video I have seen for this problem.

  • @sandipansarkar9211
    @sandipansarkar9211 4 года назад

    Superb video once again.But need to study a little bit of theory.But still no idea how questions are framed in an interview in regards to deep learning.

  • @indrashispowali
    @indrashispowali 2 года назад

    thanks Krish... nice explanations

  • @kishanpandey4798
    @kishanpandey4798 4 года назад +8

    Please see, the chain rule has missed something at 2:55. @krish naik

    • @omkarrane1347
      @omkarrane1347 4 года назад +9

      yes there is mistake is missing del L /del o31 onwards

    • @amrousimen684
      @amrousimen684 4 года назад

      @@omkarrane1347 yes this is a miss

  • @karunasagargundiga5821
    @karunasagargundiga5821 4 года назад +3

    hello sir,
    In vanishing gradient problem you have mentioned that derivative of sigmoid is always between 0-0.25. When you did the derivative of sigmoid function result i.e derivative of o12 w.r.t o11 it must be in the range of 0-0.25 but when you expanded we got the answer as 125. I did not understand how did the derivative of sigmoid exceed the range of 0-0.25. It seems contradictory. Hope you can clear my doubt, sir.

    • @priyanath2754
      @priyanath2754 4 года назад +1

      I am having the same doubt. Can anyone please explain it?

    • @reachDeepNeuron
      @reachDeepNeuron 4 года назад

      Even I had this question

    • @praneetkuber7210
      @praneetkuber7210 4 года назад

      He multiplied 0.25 with initial value weight w21 which was 500. W21 is derivative of z wrt O11 in his case.

  • @DanielSzalko
    @DanielSzalko 5 лет назад +2

    Please keep making videos like this!

  • @PeyiOyelo
    @PeyiOyelo 4 года назад +1

    Another Great Video. Namaste

  • @emirozgun3368
    @emirozgun3368 4 года назад +1

    Pure passion,appriciate it.

  • @YoutubePremium-ny2ys
    @YoutubePremium-ny2ys 3 года назад

    Request for a video on side by side comparison of vanishing gradient and exploding gradient...

  • @sindhuorigins
    @sindhuorigins 4 года назад +2

    the activation function is denoted by phi, not to be confused with the symbol of cyclicc integral

  • @pranavgandhiprojects
    @pranavgandhiprojects 3 месяца назад

    so well explained!

  • @harshsharma-jp9uk
    @harshsharma-jp9uk 2 года назад

    great work.. Kudos to u!!!!!!!!!!

  • @nareshbabu9517
    @nareshbabu9517 5 лет назад +4

    Do tutorial based on machine learning like regression ,classification and clustering sir

  • @jasbirsingh8849
    @jasbirsingh8849 4 года назад +4

    In the vanishing gradient you directly put values b/w 0 and 0.25 as derivative ranges in that range but why not put direct values here ?
    I mean the same we could we have done in vanishing gradient as well i.e. expanding the equation and multiple by its weight ?

    • @anshul8258
      @anshul8258 3 года назад

      Even i am having the same doubt. After watching this video, I cannot understand why (d O21 / d 011) was directly put between 0 to 0.25 in Vanishing Gradient Problem video.

    • @souravsaha1973
      @souravsaha1973 2 года назад

      @krish naik sir, can you please help clarify this doubt

    • @elileman6599
      @elileman6599 2 года назад

      yes it made me confused too

  • @kalpeshnaik8826
    @kalpeshnaik8826 4 года назад +1

    Exploding Gradient Problem is only for sigmoid activation function or for all activation functions

  • @pranjalgupta9427
    @pranjalgupta9427 3 года назад +1

    Awesome 😊👏👍

  • @brindapatel1750
    @brindapatel1750 4 года назад

    excellent krish
    love to watch your videos

  • @nitishkumar-bk8kd
    @nitishkumar-bk8kd 4 года назад

    beautiful explanation

  • @omkarrane1347
    @omkarrane1347 4 года назад +3

    sir, please note that in the last two videos there was the wrong application of chain rule. even our teacher who referred to the video has written the same mistake in her notes. ref del L /del o31 onwards

    • @krishnaik06
      @krishnaik06  4 года назад

      I probably made a mistake in the last part

    • @shubhammaurya2658
      @shubhammaurya2658 4 года назад +1

      can you explain what is wrong briefly. so I can understand

    • @chinmaybhat9636
      @chinmaybhat9636 4 года назад

      Which one is correct then one used in this video or the one used in the previous video ??

  • @bangarrajumuppidu8354
    @bangarrajumuppidu8354 3 года назад

    super explanation sir !!

  • @tarunbhatia8652
    @tarunbhatia8652 3 года назад

    Best video. Hands down

  • @ganeshkharad
    @ganeshkharad 4 года назад

    best explaination... thanks for making this video

  • @pdteach
    @pdteach 4 года назад

    Very nice explanation.thanks

  • @sahilsaini3783
    @sahilsaini3783 3 года назад +2

    At 08:30, the derivative of O21 wrt O11 is 125, but O21 is a sigmoid function. How can its derivative be 125 because derivative of sigmoid function ranges from 0 to 0.25.

  • @praneethcj6544
    @praneethcj6544 4 года назад +1

    Excellent ..!!!

  • @sarrae100
    @sarrae100 3 года назад

    Excellent.

  • @invisible2836
    @invisible2836 4 месяца назад

    So overall you're saying that if you choose high values of weights, it'll cause problem to reach or maybe will never reach global minima

  • @nitayg1326
    @nitayg1326 4 года назад

    Exploding GD explained nicely!

  • @makemoney7506
    @makemoney7506 3 месяца назад

    Thank you very much i learn a lot, i think in gradient you forgot one term, the first one, dL /dO3

  • @sumeetseth22
    @sumeetseth22 4 года назад

    love your videos and can't thankyou enough. Thankyou so much for theawesomest lessons

  • @komandoorideekshith85
    @komandoorideekshith85 6 месяцев назад

    a small doubt is that in another video you told that derivative of loss w.r.t. weight equals to derivative of loss w.r.t. output and etc... but in this video you considered directly from out on r.h.s could you please conform it

  • @ankurmodi4588
    @ankurmodi4588 3 года назад +1

    This likes turn into 1M likes after mid 2021. People do not understand the effort and hard work as they are also not doing anything right now. wait and watch

  • @jagadeeswarareddy9726
    @jagadeeswarareddy9726 3 года назад

    Really very good videos, One doubt - High value weights causing this exploding problem. But W-old also might be large vale right, if we do W-old - derivative L / dW not cause for big variance right. please help me.

  • @SambitBasu22MCA024
    @SambitBasu22MCA024 7 месяцев назад

    So basically Exploding and vanishing are dependent on how the weights are initialised?

  • @shahariarsarkar3433
    @shahariarsarkar3433 3 года назад

    sir may be there is a problem in the chain rule that you explain. Here something is missing that is derivative of L with respect to O31

  • @revanthshalon5626
    @revanthshalon5626 4 года назад +1

    Sir, the only time when the exploding gradient problem occurs is when the weights is high and the time when vanishing gradient occurs is when the weights are too low, is my assumption correct?

  • @louerleseigneur4532
    @louerleseigneur4532 3 года назад

    Thanks krish

  • @Mustafa-jy8el
    @Mustafa-jy8el 4 года назад +1

    I love the energy

  • @emilyme9478
    @emilyme9478 3 года назад

    great video !

  • @sushantshukla6673
    @sushantshukla6673 5 лет назад

    u doing great job man

  • @samyakjain8079
    @samyakjain8079 3 года назад +1

    @7:47 d(w_21 * O_11) = O_11 dw_21 + w_21 dO_11 (why are you assuming w_21 is constant)

  • @y.mamathareddy8699
    @y.mamathareddy8699 5 лет назад +1

    Sir please make a video on bayes theorem and its concepts learning....

  • @16876
    @16876 4 года назад

    awesome video, much respect

  • @sushmitapoudel8500
    @sushmitapoudel8500 3 года назад

    You're great!

  • @jayanthAILab
    @jayanthAILab Год назад

    sir why your are not writing the term dL/d(o31) with other terms?

  • @KamalkaGermany
    @KamalkaGermany 2 года назад +2

    Shouldn't the derivative be dl/ dw'11 = dl/dO31 and then the rest? Could someone please clarify? Thanks

  • @shambhuthakur5562
    @shambhuthakur5562 4 года назад +5

    Thanks Krish for the video, however I didn't understood how you replaced loss function with output of output layer, it should actually be real output minus predicted.pls suggest.

    • @shashwatsinha4170
      @shashwatsinha4170 3 года назад

      He has just shown that the predicted output will be made input to the loss function (not that predicted output is loss function as you have comprehended)

  • @anirbandas6122
    @anirbandas6122 2 года назад

    @2.37 u have missed a derivate dL/d031 on the RHS.

  • @thunder440v3
    @thunder440v3 4 года назад

    Awesome video!

  • @vd.se.17
    @vd.se.17 4 года назад

    Thank you.

  • @lakshyarajput1023
    @lakshyarajput1023 3 года назад

    Sir apne chain rule mai alag formula btaya or aap yaha pr derivative nikame ka alag formula bta rhe ho
    Is it same ? Please clear the doubt

  • @SimoneIovane
    @SimoneIovane 5 лет назад +2

    Very well explained thanks! I have a doubt tho: Are vanishing and exploding gradient coexistent phenomena? As they both happen in the BP does their happening depend exclusively on the value of the loss at a particular epoch? Hope my question is clear

    • @reachDeepNeuron
      @reachDeepNeuron 4 года назад

      Even I hv the same question. Appreciate if you can clear

  • @rmn7086
    @rmn7086 3 года назад

    Krish Naik bester Mann!

  • @arnavkumar5226
    @arnavkumar5226 2 года назад

    sir , why are you missing the first term while writing the chain rule ? can someone please let me know what is correct formula
    ?

  • @saikiran-mi3jc
    @saikiran-mi3jc 5 лет назад +1

    Waiting for future videos on DL

  • @jt007rai
    @jt007rai 4 года назад

    Thanks for this amazing video sir!
    Just to summarize, can I say that only if my weight initialization would be very high and activation function is sigmoid and learning rate is also very high, I can experience this problem and no other such cases?

    • @32deepan
      @32deepan 4 года назад

      Activation function doesn't matter for exploding gradient decent to occur. High magnitude weights initialization alone can cause this problem.

    • @songs-jn1cf
      @songs-jn1cf 4 года назад

      deepan chakravarthi
      Activation function is proportional to weights being applied.so exploding gradient indirectly depends on activation function and directly on weights.

    • @manishsharma2211
      @manishsharma2211 4 года назад

      The derivate should also be high

  • @pratikkhadse732
    @pratikkhadse732 4 года назад

    Doubt: the BIAS that is added, what constitutes this bias.
    For instance Learning rate was found by optimization models, what methodology is used to introduce bias?

  • @AbhishekMadankar
    @AbhishekMadankar 3 года назад

    Have been following your playlist of deep learning , this is the 9th video...you teach amazing but I am confused if this is deep learning or mathematics class

  • @smarttaurian30
    @smarttaurian30 Год назад

    I don't understand the chain rule equationt that how we get activation function while it should begun from dO21

  • @samikshandas5546
    @samikshandas5546 3 года назад

    why are you missing the first term (dl/O31) in the chain rule equation continuously in two videos. Is there a reason or is it a mistake?

    • @krishnaik06
      @krishnaik06  3 года назад

      It was a mistake

    • @joshgung
      @joshgung 2 года назад

      @@krishnaik06 how it can be 125 if o21 is sigmoid, should not be the derivative of sigmoid in the range [ 0 : 0.25 ] ?

  • @subrataghosh735
    @subrataghosh735 3 года назад

    Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = 25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

    • @subrataghosh735
      @subrataghosh735 3 года назад

      Thanks for the great explanation. One small doubt/clarification would be helpful. Here since we have sigmoid if the Weight value is somewhat around 2 ( ex:2) dO11/dW11. then the dO21/dO11 value will be .25*2 = .5 then chain rule ((dO21/dO11) * (dO11/dW11)) will be again .5*.5 = .25 considering dO11/dW11 weight is also 2. Then instead of exploding it will be shirking. Can you please suggest what is the thought for this scenario

  • @benvelloor
    @benvelloor 4 года назад

    Thanks a lot sir

  • @dhruvajpatil8359
    @dhruvajpatil8359 4 года назад

    Too good man !!! #BohotHard

  • @boringhuman9427
    @boringhuman9427 3 года назад +2

    Base Concept:
    While performing Backward propagation the loss function derivate is getting lower hence the weights are not changed hence again in forward propagation these weights are multiplied with input value which changes the weights value even more from the previous one. So if we perform backward propagation again then old weight will be much different

  • @ashwinsenthilvel4976
    @ashwinsenthilvel4976 4 года назад

    im getting confused as u said 3.20. why do u expand o21/o11 in this expolding gradient but y not expanded in vanishing gradient?.

  • @sumaiyachoudhury7091
    @sumaiyachoudhury7091 11 месяцев назад

    at 2:47 you are missing the dL/dO31 term

  • @quranicscience9631
    @quranicscience9631 5 лет назад

    very good content

  • @MoosaMemon.
    @MoosaMemon. 5 месяцев назад

    At 5:56, shouldn't it be "derivate of z w.r.t derivative of w_11" instead of being "derivate of z w.r.t derivative of O_11"

  • @SuryaDasSD
    @SuryaDasSD 4 года назад +1

    7:56 there's a mistake in derivative.. please correct it

  • @subhamsekharpradhan297
    @subhamsekharpradhan297 3 года назад

    SIr in the chain rule formula, I guess you have left the del(L)/del(O^31) at first