Neural Networks Part 7: Cross Entropy Derivatives and Backpropagation

Поделиться
HTML-код
  • Опубликовано: 26 сен 2024

Комментарии • 320

  • @statquest
    @statquest  3 года назад +9

    The full Neural Networks playlist, from the basics to deep learning, is here: ruclips.net/video/CqOfi41LfDw/видео.html
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @bigbangdata
    @bigbangdata 3 года назад +116

    Your talent for explaining these difficult concepts and organizing the topics in didactic, bite-sized, and visually compelling videos is astounding. Your channel is a great resource for beginners and advanced practitioners who need a refresher on a particular concept. Thank you for all that you do!

  • @Freethinker33
    @Freethinker33 2 года назад +20

    Right now I am reading the ML book "An Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani. Many a times, I stuck at the mathematical details and could not comprehend and stopped reading. Although I love that book a lot but felt frustrated. But now I use your videos and read the book side by side and now everything start making sense in the book. You are such a great story teller. They way you explains in the video with examples, it seems like I am listening to a story "There was a king ..." It is so soothing and complex topics become easy. I feel you are my friend and teacher in my ML journey who understand my pain, and explains me the hard things with ease. BTW, I have done Master in Data Science from Northwestern University and got good ML foundation from that course. But I can tell you now I feel complete after going through most of your videos. Mr. Starmer, we are lucky to have you as such a great teacher and mentor. You are gifted to teach people. I will pledge to support your channel from my heart. Thank you.

    • @statquest
      @statquest  2 года назад

      Wow! Thank you very much!!! :)

  • @vishnukumar4531
    @vishnukumar4531 2 года назад +3

    0 comments left unreplied!
    Josh, you are truly one of a kind! ❣❣❣

  • @naf7540
    @naf7540 Год назад +16

    Dear Josh, how is it at all possible to deconstruct so clearly all these concepts, just incredible, thank you very much, your videos are addictive!!

  • @wennie2939
    @wennie2939 3 года назад +17

    Josh Starmer is THE BEST! I really appreciate your patience in explaining the concepts step-by-step!

    • @statquest
      @statquest  3 года назад

      Thank you very much! :)

  • @RubenMartinezCuella
    @RubenMartinezCuella 3 года назад +24

    Even though there are many other youtube channels that also explain NN, your videos are unique in the sense that you break down every single process into small operations easy to understand by anyone. Keep up the great work Josh, everyone here appreciates so much your effort!! :D

    • @statquest
      @statquest  3 года назад

      Thank you very much! :)

  • @nangmanlife23
    @nangmanlife23 2 года назад +11

    Your videos are truly astounding. I've gone through so many youtube playlists looking to understand Neural Networks, and none of them can come close to yours in terms of simplicity & content! Please keep up this amazing work for beginners like me :)

  • @iZapz98
    @iZapz98 3 года назад +13

    all your videos have helped me tremendously studying for my ML - exam, thank you

  • @anisrabahbekhoukhe3652
    @anisrabahbekhoukhe3652 Год назад +3

    i literally cant stop from watching those vids, help me

  • @YLprime
    @YLprime 7 месяцев назад +3

    This channel is awesome, my deep learning knowledge is sky rocketing everyday.

  • @simhasankar3311
    @simhasankar3311 Год назад +2

    Imagine the leaps and bounds we could achieve in global education if this teaching method was implemented universally. We would have a plethora of students equipped with the analytical skills to tackle complex issues. Your contributions are invaluable. Thank you!

  • @Lucas-Camargos
    @Lucas-Camargos Год назад +1

    This is the best Neural Networks example video I've ever seen.

    • @statquest
      @statquest  Год назад +1

      Thank you very much! :)

  • @AbdulWahab-mp4vn
    @AbdulWahab-mp4vn Год назад +2

    WOW ! I have never seen anyone explaining topics in such minute level detail. You are an angel to us Data Science Students ! Love from Pakistan

  • @farrukhzamir
    @farrukhzamir 5 месяцев назад +2

    Brilliantly explained. You explain the concept in such a manner that it becomes very easy to understand. God bless you. I don't know how to thank you really. Nobody explains like you.❤

  • @yourfavouritebubbletea5683
    @yourfavouritebubbletea5683 Год назад +3

    Incredibly well done. I'm astonished and thank you for letting me not have a traumatic start with ML

  • @johannesweber9410
    @johannesweber9410 4 месяца назад +1

    Nice Video! First I was a little confused (like always) but than I pluged your values and the exact structure of your Neural Network into my own small framework and compared the results. After I did this, i followed your instructions and implemented the backpropagation step-by-step. Thanks for the nice video!

  • @salahaldeen1751
    @salahaldeen1751 2 года назад +1

    I don't know where else I could understand that like this. Thanks, you're talented!!!

  • @abhishekjadia1703
    @abhishekjadia1703 2 года назад +1

    Incredible !! ...You are not teaching, You are revealing !!

  • @saurabhdeshmane8714
    @saurabhdeshmane8714 Год назад +1

    Incredibly done....it doesn't even feel like we are learning such complex topics...keeps me engaged for going via entire playlist..thank you for such content!!

  • @ligezhang4735
    @ligezhang4735 Год назад +1

    This is so impressive! Especially for the visualization of the whole process. It really makes things very easy and clear!

  • @samerrkhann
    @samerrkhann 3 года назад +3

    A huge appreciation for all the efforts you put. Thank you josh!

  • @susmitvengurlekar
    @susmitvengurlekar 2 года назад +2

    "I want to remind you" helped me understand why in the world is P(setosa) involved in output of versicolor and virginica.
    Great explanation!

    • @statquest
      @statquest  2 года назад +1

      Hooray!!! I'm glad the video was helpful.

  • @tejaspatil3978
    @tejaspatil3978 2 года назад +1

    your way of learniing is on next level. thanks for having us this best sessions..

  • @rajpulapakura001
    @rajpulapakura001 Год назад +2

    Clearly and concisely explained! Thanks Josh! P.S. If you know your calculus, I would highly recommend trying to compute the derivatives yourself before seeing the solution - it helps a lot!

  • @pietrucc1
    @pietrucc1 3 года назад +1

    I started using the techniques of the machine learning from a little less than a month, I found this site and it helped me a lot, thank you very much !!

  • @nabeelhasan6593
    @nabeelhasan6593 2 года назад +1

    At last I am really thankful for all your hard effort you put in these videos immensely helped me in making a strong foundation in deeplearning

    • @statquest
      @statquest  2 года назад +1

      Thank you very much! :)

  • @Meditator80
    @Meditator80 3 года назад +1

    Thank you so much! It is so clear for explaining the calculation of Cross Entropy Derivative and how to use it in BP

    • @statquest
      @statquest  3 года назад

      Thank you very much! :)

  • @RC4boumboum
    @RC4boumboum 2 года назад +2

    Your courses ara so good! Thanks a lot for your time :)

    • @statquest
      @statquest  2 года назад

      You're very welcome!

  • @donfeto7636
    @donfeto7636 9 месяцев назад +1

    You are a national treasure BAAAM. Keep doing those video they are great.

  • @GamTinjintJiang
    @GamTinjintJiang Год назад +1

    wow~ your videos are so intuitive to me. What a precious deposits!

  • @Recordingization
    @Recordingization 3 года назад +1

    Thanks for nice lecture!I finally understand the derivative of cross Entropy and optimization of bias.

  • @charliemcgowan598
    @charliemcgowan598 3 года назад +2

    Thank you so much for all your videos, they're actually amazing!

  • @bonadio60
    @bonadio60 3 года назад +1

    Your explanation is fantastic!! Thanks

  • @KayYesYouTuber
    @KayYesYouTuber Год назад +1

    So beautiful. Never seen anything like this!!!

  • @arielcohen2280
    @arielcohen2280 Год назад

    hate all the songs and the meaningless sound affects, but damn I have been trying to understand this concept for hell long of a time and you made it clear

  • @susmitvengurlekar
    @susmitvengurlekar 2 года назад +2

    There is nothing wrong in self promotion and frankly, you don't need promotion. Anyone who watches any one video of yours, will prefer your videos over any other videos henceforth.

  • @chethanjjj
    @chethanjjj 3 года назад

    @18:20 what i've been looking for awhile. thank you!

  • @gabrielsantos19
    @gabrielsantos19 29 дней назад +1

    Thank you, Josh! 👍

  • @samore11
    @samore11 Год назад +1

    These videos are so good - the explanations and quality of production are elite. My only nitpick was it is hard for me to see "x" and not think the letter "x" as opposed to a multiplication sign - but that's a small nitpick.

    • @statquest
      @statquest  Год назад +1

      After years of using confusing 'x's in my videos, I've finally figured out how to get a proper multiplication sign.

  • @r0cketRacoon
    @r0cketRacoon 6 месяцев назад

    Thank you very much for the video
    Backpropagation with multiple outputs to me is not that hard but it's really a mess when do the computations

    • @statquest
      @statquest  6 месяцев назад

      Yep. The good news is that PyTorch will do all that for us.

  • @ariq_baze4725
    @ariq_baze4725 2 года назад +1

    Thank you, you are the best

  • @pedrojulianmirsky1153
    @pedrojulianmirsky1153 2 года назад +1

    Thank you for all your videos, you are the best!
    I have one question though. Lets suppose you have the worst possible fit for your model, where it predicts pSetosa = 0 for instances labeled Setosa, and pSetosa = 1 for those labeled either Virginica or Versicolor.
    Then, for each Setosa labeled instance, you would get dCESetosa/db3 = pSetosa - 1 = -1, and for each nonSetosa labeled instance dCEVersiOrVirg/db3 = pSetosa = +1.
    In summary, the total dCE/db3 would be accumulating either +1 for each Setosa instance and -1 for each non Setosa. So, if you have for example a dataset with 5 Setosa, 2 Versicolor and 3 Virginca:
    dCE(total)/db3 = (1+1+1+1+1) + (-1 -1) +(-1 -1 -1) = 5-2-3 = 0.
    The total dCE/db3 would be 0, as if the model had the best fit for b3.
    Because of this compensation between the opposite signs (+) and (-), the weight (b3) wouldn´t be adjusted by gradient descent, even though the model classifies badly.
    Or maybe I missunderstood something haha.
    Anyways, I got into ML and DL mainly because of your videos, can't thank you enough!!!!!!!

    • @statquest
      @statquest  2 года назад +2

      To be honest, I don't think that is possible because of how the softmax function works. For example, if it was known that the sample was setosa, but the output value was 0, then we would have e^0 / (e^0 + e^versi + e^virg) = 1 / (1 + e^versi + e^vrig) > 0.

  • @madankhatri7727
    @madankhatri7727 8 месяцев назад

    Your explaination of hard concepts are pretty amazing. I have been stuck in a very difficult concept called adam optimizer. Please explain it. You are my last hope.

  • @osamahabdullah3715
    @osamahabdullah3715 3 года назад +1

    I really can't give enough from your videos, what an amazing way of explanation , thanks for sharing your knowlege with us, when is gonna be your next videos plz ?

    • @statquest
      @statquest  3 года назад +1

      My next video should come out in about 24 hours.

    • @osamahabdullah3715
      @osamahabdullah3715 3 года назад +1

      @@statquest what a wonderful news, thank you sir

  • @jamasica5839
    @jamasica5839 3 года назад +1

    This is even more bonkers than Backpropagation Details Pt. 2 :O

  • @shreeshdhavle25
    @shreeshdhavle25 3 года назад +1

    Finally was waiting for new video so long...

    • @statquest
      @statquest  3 года назад +1

      Thanks!

    • @shreeshdhavle25
      @shreeshdhavle25 3 года назад +1

      @@statquest Thanks to you Josh..... Best content in the whole world.... Also thanks to you and your content I am working in Deloitte now.

    • @statquest
      @statquest  3 года назад

      @@shreeshdhavle25 Wow! That is awesome news! Congratulations!!!

  • @Pedritox0953
    @Pedritox0953 3 года назад +1

    Great explanation

  • @nonalcoho
    @nonalcoho 3 года назад +1

    It is really easy to understand even I am not good at calculus.
    And I got the answer that I asked you what's the meaning of the derivative of softmax in the last video. I am really so happy!
    Btw, will you make more programming lessons like you made before~?
    Thank you very much!

    • @statquest
      @statquest  3 года назад +1

      I hope to do a "hands on" webinar for neural networks soon.

    • @nonalcoho
      @nonalcoho 3 года назад +1

      @@statquest looking forward to it!

  • @grankoczsk
    @grankoczsk 2 года назад +1

    Thank you so much

  • @rahulkumarjha2404
    @rahulkumarjha2404 2 года назад +2

    Thank you for such an awesome video!!!
    I just have one doubt.
    At 18:12 of the video.
    The summation has 3 values because there are 3 items in the dataset.
    Let's say if we have 4 items in the dataset i.e 2 items of setosa, 1 for virginica and 1 for versicolor.
    So our summation will look like
    {(psetosa - 1) + (psetosa - 1) + psetosa + psetosa}
    i.e the summation is for the data setosadata_row1 , setosadata_row2, versicolordata_row3, virginicadata_row4
    Am I right?

    • @statquest
      @statquest  2 года назад

      yep

    • @rahulkumarjha2404
      @rahulkumarjha2404 2 года назад

      @@statquest
      Thank You!!
      Your entire neural network playlist is awesome.

    • @statquest
      @statquest  2 года назад

      @@rahulkumarjha2404 Hooray! Thank you!

  • @stan-15
    @stan-15 2 года назад +1

    since you used 3 sample data to get the value of the three cross-entropy derivitives, does this mean we must use multiple inputs for one gradient descent step when using cross-entropy? (more precisely, does this mean we have to use n input samples, that each light up all n features of the outputs, in order to be able to compute the appropriate derivative of the bias, and thus in order to perform one single gradient descent step?)

    • @statquest
      @statquest  2 года назад +1

      No. You can use 1 input if you want. I just wanted to illustrated all 3 cases.

  • @epiccabbage6530
    @epiccabbage6530 Год назад +1

    This has been extremely helpful, this series is great. I am a little confused though as too why we repeat the calculations for p.setosa, i.e. why we cant simply run through the calculations once, and use the same p.setosa value 3 times (So like, x-1 + x + x) and use that for the bias recalculation. But either way this has cleared up a lot for me

    • @statquest
      @statquest  Год назад

      What time point, minutes and seconds, are you asking about?(unfortunately I can't remember all of the details in all of my videos)

    • @epiccabbage6530
      @epiccabbage6530 Год назад +1

      @@statquest starting at 18:50, you go through three different observations and solve for the cross entropy. I'm curious as too why you need to look at three different observations, i.e. why you need to plug in values 3 times instead of just doing it once. If we want to solve for psetosa twice and psetosa-1 once, why do we need to do the equation three times, instead of just doing it once? Why can't we just do 0.15-1 + 0.15 + 0.15

    • @statquest
      @statquest  Год назад

      @@epiccabbage6530 Because each time the predictions are made using different values for the petal and sepal widths. So we take that into account for each prediction and each derivative relative to that prediction.

    • @epiccabbage6530
      @epiccabbage6530 Год назад

      @@statquest Right, but why do we look at multiple predictions in the context of changing the bias once? Is it just a matter of batch size?

    • @statquest
      @statquest  Год назад +2

      @@epiccabbage6530 Yes, in this example, we use the entire dataset (3 rows) as a "batch". You can either look at them all at once, or you can look at them one at a time, but either way, you end up looking at all of them.

  • @shubhamtalks9718
    @shubhamtalks9718 3 года назад +1

    BAM! Clearly explained.

  • @GLORYWAVE.
    @GLORYWAVE. 8 месяцев назад

    Thanks Josh for an incredibly well put together video.
    I have two quick questions:
    1) When you initially get that new b3 value of -1.23, and then say to repeat the process, I am assuming the process is repeated with a new 'batch' of 3 training samples, correct? i.e. you wouldn't use the same 3 that were just used?
    2) Are these multi-classification models always structured in such a way that each 'batch' or 'iteration' includes 1 actual observed sample from each class like in this example? It appears that the Total Cross Entropy calculation and derivatives would not make sense otherwise.
    Thanks again!

    • @statquest
      @statquest  8 месяцев назад +1

      1) In this case, the 3 samples is all the data we have, so we reuse them for every iteration. If we had more data, we might have different samples in different batches, but we would eventually reuse these samples at some later iteration.
      2) No. You just add up the cross entropy, regardless of how the samples are distributed, to get the total.

  • @sachinK-k5q
    @sachinK-k5q 6 месяцев назад

    please create one such Series for single layer Perceptron as well and show the derivative as well

    • @statquest
      @statquest  6 месяцев назад

      I'll keep that in mind.

  • @АлександраРыбинская-п3л

    Dear Josh, I adore your lessons! They make everything so clear! I have a small question regarding this video. Why do you say that the predicted species is setosa when the predicted probability for setosa is only 0.15 (17:13 - 17:20)? There is larger value (0.46) for virginica in this case (17:14). Why don't we say it's virginica?

    • @statquest
      @statquest  Год назад

      You are correct that virginica has the largest output value - however, because we know that the first row of data is for setosa, for that row, we are only interested in the predicted probability for setosa. This gives us the "loss" (the difference between the known value setosa, 1, and the predicted value for setosa, 0.1 (except in this case we using logs)) for that first row. For the second row, the known value is virginica, so, for that row, we are only interested in the predicted probability for virginica.

    • @АлександраРыбинская-п3л
      @АлександраРыбинская-п3л Год назад +1

      Thanks@@statquest

  • @ferdinandwehle2165
    @ferdinandwehle2165 2 года назад

    Hello Josh, your videos inspired me so much that I am trying to replicate the classification of the iris dataset.
    For my understanding, are the following statements true:
    1) The weights between the blue/orange nodes and the three categorization outputs are calculated in the same fashion as the biases (B3, B4, B5) in the video, as there is only one chain rule “path”.
    2) For weights and biases before the nodes there are multiple chain rule differentiation “paths” to the output: e.g. W1 can be linked to the output Setosa via the blue node, but could also be linked to the output Versicolour via the orange node; the path is irrelevant as long as the correct derivatives are used (especially concerning the SoftMax function).
    3) Hence, this chain rule path is correct given a Setosa input: dCEsetosa/dW1 = (dCEsetosa/d”Psetosa”) x (d”Psetosa”/dRAWsetosa) x (dRAWsetosa/dY1) x (dY1/dX1) x (dX1/dW1)
    Thank you very much for your assistance and the more than helpful video.
    Ferdinand

    • @statquest
      @statquest  2 года назад

      I wish I had time to think about your question - but today is crazy busy so, unfortunately I can't help you. :(

    • @ferdinandwehle2165
      @ferdinandwehle2165 2 года назад +1

      @@statquest No worries. The essence of the question is: how to optimize W1? Maybe you could have a think about it on a calmer day (:

    • @statquest
      @statquest  2 года назад

      @@ferdinandwehle2165 Regardless of the details, I think you are on the right track. The w1 can be influenced by a lot more than b3 is.

  • @콘충이
    @콘충이 3 года назад +1

    Appreciated it so much!

  • @minerodo
    @minerodo 11 месяцев назад

    Thank you!! I understood everything but just a question: here you explain how to modify a single bias, and know I understand how to do it for each one of the biases. My question is how do you back propagate to the biases that are in the hidden layer ? In what moment ? After yo finish with b3, b4 and b5? Thanks!!

    • @statquest
      @statquest  11 месяцев назад

      I show how to backpropagate through the hidden layer in this video: ruclips.net/video/GKZoOHXGcLo/видео.html

  • @Tapsthequant
    @Tapsthequant 3 года назад

    So much gold in this one video, how did you select the learning rate of 1. In general how do you select learning rates? Do you have ways to dynamically alter the learning rate in gradient descent? Taking recommendations.

    • @statquest
      @statquest  3 года назад +1

      For this video I coded everything by hand and setting the learning rate to 1 worked fine and was super easy. However, in general, most implementations of gradient descent will dynamically change the learning rate for you - so it should not be something you have to worry about in practice.

    • @Tapsthequant
      @Tapsthequant 3 года назад

      Thank you 😊, you know I have been following this series and taking notes. I literally have a notebook.
      I also have Excel workbooks with implementations of the examples. I'm now at this video of CE, taking notes again.
      This is the softest landing I have had to a subject. Thank you 😊.
      Now how do I take this subject of Neural Networks, further after this series. I am learning informally.
      Thank you Josh Starmer,

    • @statquest
      @statquest  3 года назад

      @@Tapsthequant I think the next step is to learn about RNNs and LSTMs (types of neural networks). I'll have videos on those soon.

  • @michaelyang3414
    @michaelyang3414 3 месяца назад

    excellent work!!! could you make one more video to show how to do all the parameters at the same time.

    • @statquest
      @statquest  3 месяца назад

      I show that for a simple neural network in this video: ruclips.net/video/GKZoOHXGcLo/видео.html

    • @michaelyang3414
      @michaelyang3414 3 месяца назад +1

      @@statquest Yes, I watched that video several times. Actually, I watched all 28 videos in your neural network/deep learning series several times. I am also a member and have bought your books. Thank you for your excellent work! But that video is just for one input and one output. Would you make another video to show how to handle multiple inputs and outputs, similar to the video you recommended?

    • @statquest
      @statquest  3 месяца назад

      @@michaelyang3414 Thank you very much for your support! I really appreciate it. I'll keep that topic in mind.

  • @sergeyryabov2200
    @sergeyryabov2200 8 месяцев назад +1

    Thanks!

    • @statquest
      @statquest  8 месяцев назад

      TRIPLE BAM!!! Thank you so much for supporting StatQuest!!! :)

  • @MADaniel717
    @MADaniel717 3 года назад +1

    If I want to find biases of other nodes, I just do the derivative with respect to them? What about the weights? Just became a member, you convinced me with these videos lol, congrats and thanks

    • @statquest
      @statquest  3 года назад

      Wow! Thank you for your support. For a demo of backpropagation, we start with one bias: ruclips.net/video/IN2XmBhILt4/видео.html then we extend that to one bias and 2 weights: ruclips.net/video/iyn2zdALii8/видео.html then we extend that to all biases and weights: ruclips.net/video/GKZoOHXGcLo/видео.html

    • @MADaniel717
      @MADaniel717 3 года назад

      @@statquest Thanks Josh! Maybe I left it unnoticed. I meant for hidden layers' weighs and biases.

    • @statquest
      @statquest  3 года назад +1

      @@MADaniel717 Yes, those are covered in the links I provided in the last comment.

  • @lokeshbansal2726
    @lokeshbansal2726 3 года назад

    Thank you so much! You are making some amazing content.
    Can you please suggest some good book for Neural Networks in which mathematics of algorithms is explained or can you please tell from where you are learning about machine learning and neural networks.
    Again thankyou for these precious videos.

    • @statquest
      @statquest  3 года назад

      Here's where I learned about the math behind cross entropy: www.mldawn.com/back-propagation-with-cross-entropy-and-softmax/ (by the way, I didn't watch the video - I just read the web page).

  • @sonoVR
    @sonoVR Год назад

    This is really helpful!
    So am I right to assume that in the end, when using one hot encoding we can simplify it to d/dBn = Pn - Tn and d/dWni = (Pn - Tn)Xi ?
    Given n is the number of outputs, P is the prediction, T is the one hot encoded target, i is the number of inputs, Wni is the weight associated from that input to the respective output and X is the input.
    Then when backpropagating, we can transpose the weights, multiply the weights by the respective error of Pn -Tn in the output layer and sum them to get an error for each hidden node if I'm correct

    • @statquest
      @statquest  Год назад

      For the Weight, things are a little more complicated because the input is modified by previous weights and biases and the activation function. For more details, see: ruclips.net/video/iyn2zdALii8/видео.html

  • @ΓάκηςΓεώργιος
    @ΓάκηςΓεώργιος 3 года назад

    Nice video!
    I only have one question
    How i do it when there is more than 3 data (for example there is, n for setosa ,m for virginica , k for versicolor)

    • @statquest
      @statquest  3 года назад +1

      You just run all the data through the neural network, as shown at 17:04, to calculate the cross entropy etc.

    • @ΓάκηςΓεώργιος
      @ΓάκηςΓεώργιος 3 года назад +1

      Thank you a lot for your help Josh

  • @hangchen
    @hangchen 7 месяцев назад

    Awesome explanation! Now I understand neural networks in more depth! Just one question - shouldn't the output of the softmax values sum to 1? @18:57

    • @statquest
      @statquest  7 месяцев назад

      Thanks! And yes, the output of the softmax should sum to 1. However, I rounded the numbers to the nearest 100th and, as a result, it appears like they don't sum to 1. This is just a rounding issue.

    • @hangchen
      @hangchen 7 месяцев назад +1

      Oh got it! Right if I add them up they are 1.01, which is basically 1. I just eyeballed it. Should have done a quick mind calc haha! By the way, I am so honored to have your reply!! Thanks for making my day (again, BAM!)!@@statquest

    • @statquest
      @statquest  7 месяцев назад

      @@hangchen :)

  • @tulikashrivastava2905
    @tulikashrivastava2905 3 года назад

    Thanks for posting the NN video series. It was just in time when I needed it 😊 You have the knack to split complex topics into logical parts explain them like a breeze😀😀
    Can I request you to share some videos on Gradient Descent Optimisation and Regularization ?

    • @statquest
      @statquest  3 года назад +1

      I have two videos on Gradient Descent and five on Regularization. You can find all of my videos here: statquest.org/video-index/

    • @tulikashrivastava2905
      @tulikashrivastava2905 3 года назад

      @@statquest Thanks for your quick reply! I have seen those videos and they are great as usual 👍👍
      I was requesting for Gradient descent optimisation with respect to Deep networks like Momentum, NAG, Adagrad, Adadelta, RMSProp, Adam and regularization techniques for Deep networks like weight decay, dropout, early stopping, data augmentation and batch normalization.

    • @statquest
      @statquest  3 года назад

      @@tulikashrivastava2905 Noted.

  • @ecotrix132
    @ecotrix132 8 месяцев назад

    Thanks so much for posting these videos! I am curious about this : while using gradient descent for SSR one could get stuck at local minimum. One shouldnt face this problem with cross entropy right?

    • @statquest
      @statquest  8 месяцев назад +1

      No, you can always get stuck in a local minimum.

  • @jaheimwoo866
    @jaheimwoo866 10 месяцев назад +2

    Save my university life!

  • @kamshwuchin6907
    @kamshwuchin6907 3 года назад

    Thank you for the efforts in making these amazing videos!! It helps me alot in visualising the concepts. Can you make a video about information gain too? Thank you!!

    • @statquest
      @statquest  3 года назад +2

      I'll keep that in mind.

    • @raminmdn
      @raminmdn 3 года назад +1

      @@statquest I think videos on general concepts of information theory (such as information gain) would be greatly beneficial for many many people out there, and a very nice addition to the machine learning series. I have not been able to find such comprehensive (and at the same time clearly explained) videos as yours anywhere on RUclips or online courses, specifically when it comes to ideas as concepts that usually seem much complicated.

  • @harkatiyoussef9994
    @harkatiyoussef9994 8 месяцев назад +1

    what's the difference between Softplus and Softmax ? Is it only about the softness of the toilet paper ? 🤣🤣🤣
    just kidding, you do an awesome job, your videos are way above everybody else in ML / DL

    • @statquest
      @statquest  8 месяцев назад

      Thank you very much!

  • @andredahlinger6943
    @andredahlinger6943 2 года назад +1

    Hey Josh, awesome videos

    • @statquest
      @statquest  2 года назад

      I think the idea is to optimize for whatever your output ultimately ends up being.

    • @zahari_s_stoyanov
      @zahari_s_stoyanov 2 года назад

      I think he said that this optimization is done instead of, not after SSR. Rather than calculating SSR and dSSR , we go another step further by using softMax, then calculate CE and dCE, which puts the final answers between 0.0 and 1.0 and also provides simpler calculations for backprop :)

  • @Waffano
    @Waffano 2 года назад

    Thanks for all these great videos Josh. They are a great resource for my thesis writing!
    I have a question about the intuition behind all this:
    Intutively it really doesnt make sense to me, why we need to include the error of the virginica and versicolor, when we are trying to optimize a value that only affects setosa? Would a correct intuition be: It is because they "indirectly" indicate how well the Setosa predictions are? In other words, because of Soft Plus, we will always get a probability of Setosa no matter what input we use? And then we might aswell use all the data, since more data = better models?Hope I didnt miss anything in the video that explains this!

    • @statquest
      @statquest  2 года назад +1

      To be honest, I'm exactly sure what time point (minutes and seconds) in the video you are asking about. However, in the examples we are solving for the derivatives with respect to the bias b3, which only affects the output value for Sentosa. We want that output value to be very high when we are classifying samples that are known to be setosa and we want that output value to be very low when we are classifying samples that are known to be some other species. And, because we want it high in one case and low in all other, we need to take all cases into account.

    • @Waffano
      @Waffano 2 года назад +1

      @@statquest Thank you very much!

  • @evilone1351
    @evilone1351 2 года назад

    Excellent series! Enjoyed every one of them so far, but that's the one where I lost it :) Too many subscripts and quotes in formulas.. Math has been abstracted too much here I guess, sometimes just a formula makes it easier to comprehend :D

  • @dr.osamahabdullah1390
    @dr.osamahabdullah1390 3 года назад

    Is there any chance to talk about deep leaning or compressive sensing plz; your videos are so awesome

    • @statquest
      @statquest  3 года назад

      Deep learning is a pretty vague term. For some, deep learning just means a neural network with 3 or more hidden layers. For others, deep learning refers to a convolutional neural network. I explain CNNs in this video: ruclips.net/video/HGwBXDKFk9I/видео.html

  • @مهیارجهانینسب
    @مهیارجهانینسب 2 года назад

    Awesome video. I really appreciate how you explain all these concepts in a fun way.
    I have a question in the previous video for softmax you said the value for predicted probabilities for classes is not reliable even though they correctly classify input data because of our random initial value for weights and biases. now by using cross entropy we basically multiply observed probability in the data set by log p and then optimize it. so Is the value of predicted probabilities for different classes of an input reliable. ?

    • @statquest
      @statquest  2 года назад

      To be clear, I didn't say that the output from softmax was not reliable, I just said that it should not be treated as a "probability" when interpreting the output.

  • @harshchoudhary2817
    @harshchoudhary2817 2 года назад

    What I see from here is that the gradient descent optimizes on the basis of total cross entropy and tries to minimize it
    Suppose for some data the actual o/p is setosa but the neural net predicts versicolor with a very high probability say close to 1 so the loss would still be minimized and the gradient desent won't optimize it. So we will get a wrong o/p with very high probability.
    Is it so or am i missing something here?

    • @statquest
      @statquest  2 года назад

      See 17:05. For the first row of data, the observed species is "setosa", but setosa gets the lowest predicted probability (0.15) and thus, the Cross Entropy for that row is 1.89. Now, if, intsead, the neural net predicted Versicolor for the first row with a probability 0.98 and the prediction for Setosa was 0.01, then the Cross Entropy would be greater, it would be -log(0.01) = 4.6, and, as a result, the total cross entropy would also be greater (-log(0.01) + -log(0.98) + -log(0.01) = 9.23). So the loss would be significantly greater and gradient descent would optimize it.

  • @aritahalder9397
    @aritahalder9397 2 года назад

    hi, do we have to consider the inputs as batches of setosa,versicolor and verginica?? what if while calculating the derivative of total CE we had 1st row setosa as well as the 2nd row setosa?? what will be the value for dCE(pred2)/db3??

    • @statquest
      @statquest  2 года назад

      We don't have to consider batches - we should be able to add up the losses from each sample for setosa.

  • @hisyamzayd
    @hisyamzayd 2 года назад

    Thank you so much Mr. Josh, I wish I had this back time when I first learn neural networks.
    Let me ask question.. so the Cross Entropy must use batch processing a.k.a. multiple row/data for each training? Thank you

    • @statquest
      @statquest  2 года назад

      I don't think it requires batch processing.

  • @user-rt6wc9vt1p
    @user-rt6wc9vt1p 3 года назад

    Are we calculating the derivative of the total cost function (ex - log(a) - log(b) - log(c)), or just the loss for that respective weight's output?

    • @statquest
      @statquest  3 года назад

      We are calculating the derivative of the total cross entropy with respect to the bias, b3.

  • @lancelofjohn6995
    @lancelofjohn6995 3 года назад +1

    Bam, this is a nice video.

  • @marahakermi-nt7lc
    @marahakermi-nt7lc 28 дней назад

    heyy josh i think there is a mistake in the video at 18:54 if the predicted value is setosa i think the correspanding raw output of setosa and also the probability should be the biggest isnt that right ?

    • @statquest
      @statquest  27 дней назад

      The video is correct. At that time point the weights in the model are not yet full trained - so the predictions are not great, as you see. The goal of this example is to use backpropagation to improve the predictions.

    • @marahakermi-nt7lc
      @marahakermi-nt7lc 27 дней назад +1

      @@statquest i m sorry jash my bad you are brillant man baaaaaam

  • @dianaayt
    @dianaayt 6 месяцев назад

    20:14 if we have a lot more training data we would just add all the training data we have in this to make the backpropagation?

    • @statquest
      @statquest  6 месяцев назад +1

      Yes, or we can put the data into smaller "batches" and process the data with batches (so, if we had 10 batches, each with 50 samples each, we would only add up the 50 values in a batch before updating the parameters).

    • @r0cketRacoon
      @r0cketRacoon 6 месяцев назад +1

      there are some methods like mini-batch gradient descent, and stochastic gradient descent, u should do some diggings about it

  • @saraaltamirano
    @saraaltamirano 2 года назад +1

  • @Xayuap
    @Xayuap Год назад

    yo, Josh,
    in my example, with two output
    if I adjust repeatedly one b, then the other b doesn't need almost any adjust.
    ¿should I adjust both in paralell?

  • @rhn122
    @rhn122 3 года назад

    Hey cool video, though I actually haven't fully watched your neural network playlists, just want to keep things simple with traditional statistics for now hehe!
    But I want to ask you about all these steps and formulas, do you actually always have in mind all of these methods and calculations, or only keep the essential parts and their ups & downs when actually solving practical problems?
    Because I love statistics, but can never fully commit myself to be in one with the calculation steps. I watched your videos to understand the under the hood process, but only keep the essential parts like why it works and its pitfalls, and leaving behind all the calculation tricks.

    • @rhn122
      @rhn122 3 года назад

      As a note, I think understanding the process is crucial to fully understand its strengths and weaknesses, but for the actual formula most of the time if it's too complicated I'll just delegate it to the computer to be processed

    • @statquest
      @statquest  3 года назад +1

      It's perfectly fine to ignore the details and just focus on the main ideas.

  • @beshosamir8978
    @beshosamir8978 2 года назад

    Hi Josh,
    I have a quick question ,i saw a video on RUclips the man who was explained the concept said they use segmoid function in output layer for a binary classification and RelU for hiddens layers , So,i think we fall in the same problem here which is the gradient of the Segmoid Function is too small which is make us ends with take a small step , so i thought about it which we can use Croos entropy also in this situation Right ?

    • @statquest
      @statquest  2 года назад

      I'm not sure I fully understand your question, any time you have more than one categories, you can use cross entropy.

    • @beshosamir8978
      @beshosamir8978 2 года назад

      @@statquest
      I mean can i use cross entropy with Binary classification ?

    • @statquest
      @statquest  2 года назад +1

      @@beshosamir8978 Yes.

    • @beshosamir8978
      @beshosamir8978 2 года назад

      @@statquest
      So, it is smart to use it in a Binary classification problem ? Or it is better to use just Segmoid function in output layer?

  • @saibalaji99
    @saibalaji99 2 года назад

    Do we use the same training data until all the biases are optimised?

  • @justinwhite2725
    @justinwhite2725 3 года назад

    I think I need a derivatives 101. I've followed every gradient descent/neural net video and you jumped straight to derivitaves like we already knew what they were. It's a huge 'black box' to me and when I try to do exactly what you say I can't get a handle on it because my exact scenario is different and I don't know how to figure out the derivatives (or even what a derivitive actually is)

    • @statquest
      @statquest  3 года назад +2

      Have you seen my video on The Chain Rule? It might help: ruclips.net/video/wl1myxrtQHQ/видео.html

    • @justinwhite2725
      @justinwhite2725 3 года назад

      @@statquest that video assumeds you already understand derivitive. I get the chain rule because it's very similar to how we eliminate components in chemistry and physics.
      I still don't get what a derivitive is or how it represents a slope other than the specific example you've shown there.

    • @statquest
      @statquest  3 года назад

      @@justinwhite2725 Noted

  • @wuzecorporation6441
    @wuzecorporation6441 Год назад

    18:04 Why are we taking sum of gradient of cross entropy across different data points? Won't it be better if we take gradient for one data point and do back propagation and then take gradient of another data point to do backpropagation?

    • @statquest
      @statquest  Год назад +1

      You can certainly do backpropagation using one data point at a time. However, in practice, it's usually much more efficient to do it in batches, which is what we do here.

    • @sanjanamishra3684
      @sanjanamishra3684 8 месяцев назад

      @@statquest `thanks for the great series! I had a similar doubt regarding this. I understand the point of processing in batches and taking a batch wise loss but what I can't wrap my head around is why we need to have datapoints that predict all the three categories i.e. setosa, virginica and versicolor? Does this mean that in practice we have to ensure that each batch covers all the data points i.e. a classic data imbalance problem? I normally thought that ensuring data imbalance overall in the dataset is enough. Please clarify this, thanks!

    • @statquest
      @statquest  8 месяцев назад

      @@sanjanamishra3684 Who said you needed data points that predict all 3 species?

  • @giacomorotta6356
    @giacomorotta6356 Год назад

    great video, but still I cannot understand why in this cross-entropy function you are considering only the true output class as the only variable in the function and not also the other class as variables(that are multiplied by 0 in the cross-entropy since they are not the true class but are still variables in the function), is it because the other class derivatives(the one that you did not consider) are going to be zero and so they are not relevant for backpropagation?

    • @statquest
      @statquest  Год назад

      What time point, minutes and seconds, are you asking about specifically? Without knowing, I think the answer might start at 11:27 Specifically, b3 only affects the green crinkled surface, so it only changes the prediction for setosa. So, regardless of whether or not the known value is setosa or virginica or versicolor, we are only interested in how b3 changes the green surface, which is the surface we use to predict setosa.

    • @giacomorotta6356
      @giacomorotta6356 Год назад

      ​@@statquest thanks for your reply! my problem is why in this example 11.27 the cross entropy is -log(predicted "p" virginica) and not -log("p" virginica, "p" setosa, "p" versicolor) is this because since they are multiplied by 0 their derivatives are meaningful for backpropagation?

    • @statquest
      @statquest  Год назад

      ​@@giacomorotta6356 I think I understand your question better now, and the answer is in my video on Cross Entropy here: ruclips.net/video/6ArSys5qHAU/видео.html In other words, when the observed value is for "virginica", then the cross entropy terms for "setosa" and "versicolor" are multiplied by 0 (their observed value) and they go away. And thus, when we take the derivative, they are not there to begin with.

  • @danielsimion3021
    @danielsimion3021 Месяц назад

    What about the derivatives with the inner w like w1 or w2, before entering in the ReLU function? Cause for example w1 affects all the 3 raw output values unlike b3 that affects only the first raw output.

    • @statquest
      @statquest  Месяц назад +1

      See: ruclips.net/video/GKZoOHXGcLo/видео.html

    • @danielsimion3021
      @danielsimion3021 Месяц назад

      @@statquest thanks for ur answer, I've already seen that video; my problem is that w1 affects all the 3 raw datas, so when u do the the derivative of predicted probability respect to raw data, wich raw data should u use , setosa, virginica or versicolor?
      Whichever u choose u will get back to w1, because setosa raw, virginica raw and versicolor raw, all have w1 in their expression.

    • @statquest
      @statquest  Месяц назад +1

      @@danielsimion3021 You use them all.

    • @danielsimion3021
      @danielsimion3021 Месяц назад +1

      @@statquest ok; i did it with pen and paper and finally understood. Thank u very much.

    • @statquest
      @statquest  Месяц назад +1

      @@danielsimion3021 bam! :)

  • @_epe2590
    @_epe2590 3 года назад +1

    Please could you do videos on classification specificly gradient descent for classification.

    • @statquest
      @statquest  3 года назад

      Can you explain how that would be different from what is in this video? In this video, we use gradient descent to optimize the bias term. In neural network circles, they call this "backpropagation" because of how the derivatives are calculated, but it is still just gradient descent.

    • @_epe2590
      @_epe2590 3 года назад +1

      @@statquest Well when I see others explaining it its usually with a 3 dimention nnon linear graph. When you demo it the graph always looks like a parabloa. Am I missing something important?

    • @statquest
      @statquest  3 года назад

      @@_epe2590 When I demo it, I try to make it as simple as possible by focusing on just one variable at a time. When you do that, you can often draw the loss function as a parabola. However, when you focus on more than one variable, the graphs get much more complicated.

    • @_epe2590
      @_epe2590 3 года назад +1

      @@statquest Ok. And I love you videos by the way. They are easy to understand and to absorb it all. BAM!

  • @irisfreesiri
    @irisfreesiri 5 месяцев назад

    In my case, my MLP didn't use the bias nodes, but I need to update weights in backpropagation process. So I wonder why this video didn't introduce that :(

    • @statquest
      @statquest  5 месяцев назад

      I show how to optimize weights in these videos: ruclips.net/video/iyn2zdALii8/видео.html and ruclips.net/video/GKZoOHXGcLo/видео.html and the process is identical in this case. The goal was just to show the principle so that you can combine it with the knowledge gained from the other videos I have on backpropagation.

  • @user-rt6wc9vt1p
    @user-rt6wc9vt1p 3 года назад

    Is the process for calculating derivatives in respect to weights and biases the same for each layer we backpropagate through? Or would the derivative chain be made up of more parts for certain layers?

    • @statquest
      @statquest  3 года назад

      If each layer is the same, then the process is the same.

    • @user-rt6wc9vt1p
      @user-rt6wc9vt1p 3 года назад +1

      great, thanks!

  • @Xayuap
    @Xayuap Год назад

    Double Bam,
    ¿can we use 2 instead of e for base?
    I mean it would be more arquitecture wise.

    • @statquest
      @statquest  Год назад +1

      As long as you are consistent, you can use whatever base you want. But, generally, speaking log base 'e' is the easiest one to work with.

    • @Xayuap
      @Xayuap Год назад +1

      @@statquest yep, in tha whiteboard would be elegantest.
      but for the proccesor, at using base 2 in the log or exp, would mean just moving the significand left or right, I'll hope to be fast as in using relu.

  • @nbndanzo3685
    @nbndanzo3685 3 года назад

    Can you help me, dear Josh ?I didn't understand the meaning here in notes 15.04 and 18.09(Why did we decide to use one prediction for each observation, or to use three observations with two inputs for finding the derivative?)Thank you very much for everything)

    • @statquest
      @statquest  3 года назад +1

      At 15:04, and earlier, we are simply illustrating how to calculate the derivative, with respect to b_3, of various known species. Since there are three known species, we calculate three separate derivatives, one per known species, with respect to b_3 to cover all possibilities. Later, at 18:09, now that we have the derivatives, one per potential known outcome, we can put them to work using backpropagation, which uses the full dataset to optimize b_3. If you're having trouble understanding how backpropagation works, consider watching this video: ruclips.net/video/IN2XmBhILt4/видео.html

    • @nbndanzo3685
      @nbndanzo3685 3 года назад

      @@statquest Thanks for the answer Josh,I specifically wanted to clarify these details of the lesson:1 question)11.26, why not use the observation (0.04 and 0.42) to calculate the cross-entropy derivative for Virginika?I think that this is due to the fact that a neural network can build with one observation only one model graph for one type of flowers ?2 question)at 13.52, why do observations (1 and 0.54) relate to Virginika ? and not to other species?how did we come to that decision?

    • @statquest
      @statquest  3 года назад +1

      @@nbndanzo3685 1) We don't use 0.04 and 0.42 to calculate the derivative with respect to Virginica because those two measurements were made from a Setosa flower, not a Virginica flower. 2) The values 1 and 0.54 were measurements made from a Virginica flower. In other words, we wandered around the woods until we found a Virginica flower and then we measured its Petal (1) and its Sepal (0.54). In contrast, we found a Setosa flower and it's Petal was 0.04 and it's Sepal was 0.42. Lastly, we found a Versicolor and measured it's Petal and Sepal and got 0.5 and 0.37. Thus, this data is "training data" and the measurements are associated with a specific species of flower because we specifically found each species and measured the petal and sepal sizes.

    • @nbndanzo3685
      @nbndanzo3685 3 года назад +1

      @@statquest thank you Josh)I understood,I love you bro)

    • @statquest
      @statquest  3 года назад +1

      @@nbndanzo3685 Bam!

  • @JainmiahSk
    @JainmiahSk 3 года назад +1

    Love ❤️❤️❤️ it