Mini Batch Gradient Descent (C2W2L01)

Поделиться
HTML-код
  • Опубликовано: 17 янв 2025

Комментарии • 56

  • @superchargedhelium956
    @superchargedhelium956 3 года назад +32

    This is the best way to learn. I can compartmentalize each portion of the video into a subsection and really train myself efficiently.

    • @RH-mk3rp
      @RH-mk3rp 2 года назад +2

      I agree and considering how I'll often rewatch a segment of the video, it ends up being epochs = 2

  • @TheDroidMate
    @TheDroidMate 2 года назад +6

    This is by far the best explanation out there, thanks Andrew 🚀

  • @pivasmilos
    @pivasmilos 4 года назад +8

    Thanks for making the notation beautifully and simple.

  • @javiercoronado4429
    @javiercoronado4429 4 года назад +44

    why would someone dislike this very high level material, which Andrew made available for free for anyone?

    • @HimanshuMauryadesigners
      @HimanshuMauryadesigners 3 года назад +5

      envy

    • @moosapatrawala1554
      @moosapatrawala1554 2 года назад +1

      There are so many things he hasn't explained and just wrote it

    • @erkinsagroglu8519
      @erkinsagroglu8519 Год назад

      @@moosapatrawala1554Because this is a part of a bigger course as described on the video name

    • @torgath5088
      @torgath5088 Год назад

      How to draw a kitten. Step 1: Draw a line. Step 2: Draw the rest of a kitten

  • @yavdhesh
    @yavdhesh 4 года назад +7

    धन्यवाद आन्द्रु जी :)

  • @user-cc8kb
    @user-cc8kb 6 лет назад +33

    He is so great. Andrew Ng ftw :D

    • @honoriusgulmatico6073
      @honoriusgulmatico6073 4 года назад +1

      So this is how this office looks like when you're NOT taking Coursera ML!

  • @taihatranduc8613
    @taihatranduc8613 4 года назад +2

    you are always the best teacher

  • @iAmTheSquidThing
    @iAmTheSquidThing 6 лет назад +11

    I'm wondering if optimization might happen faster by first sorting the entire dataset into categories, and then ensuring that each mini-batch is a stratified sample which approximates the entire dataset.

    • @iAmTheSquidThing
      @iAmTheSquidThing 6 лет назад +9

      Spirit - Apparently someone had the idea long before me and it is effective: arxiv.org/abs/1405.3080
      My understanding is that it ensures your model approximates the entire dataset at every iteration. You never have an iteration which comprises almost entirely samples from one class. Thereby wasting iterations fitting the function to an inaccurate dataset which has to be undone in later iterations.

    • @cdahdude51
      @cdahdude51 5 лет назад +2

      @@iAmTheSquidThing Why not just shuffle the dataset then?

    • @amanshah9961
      @amanshah9961 5 лет назад

      @@iAmTheSquidThing Thanks for the reference :}

    • @cristian-bull
      @cristian-bull 4 года назад

      hey, that's a simple, cool idea you got there

  • @imanshojaei7784
    @imanshojaei7784 4 года назад +4

    at 8:35 is sigma from 1 to 1000 rather than 1 to l ?

    • @goktugguvercin8069
      @goktugguvercin8069 4 года назад +2

      Yes, I guess there is a mistake there

    • @RH-mk3rp
      @RH-mk3rp 2 года назад

      I agree, it should be sum from i=1 to i=mini-batch-size, which in this case is 1000. When it was batch gradient descent for all video examples up until now, it was i=1 to i=m where m = number of training samples.

  • @ninhhoang616
    @ninhhoang616 8 месяцев назад

    Great video!

  • @rustyshackleford1964
    @rustyshackleford1964 Год назад

    Thank you thank you thank you!

  • @ahmedb2559
    @ahmedb2559 2 года назад

    Thank you !

  • @snackbob100
    @snackbob100 4 года назад +3

    so all mini-batch is, is taking a vector containing the whole data set, splitting it up into k subsections, finding the average loss in each subsection, and do gradient descent on each averaged error, instead of doing gradient descent step on every single loss or each original x,y pair. So it's kind of a dimension reduction technique in a way ??

    • @here_4_beer
      @here_4_beer 4 года назад +1

      Well, in principle you are correct. The Idea is that your mean out of 1000 samples may converge rather to the truth and also the variance (i.e. the bias of your cost) decreases with 1/sqrt(n) where n is the number of samples in a batch. Therefore your cost function evaluation is less biased and converges faster.

    • @here_4_beer
      @here_4_beer 4 года назад +1

      you want to exploit the weak law of large numbers, imagine you throw a dice 10 times and you want to make predictions of its side probabilities. Your result would have been less biased if instead you would have thrown the dice 1000 times right?

  • @aravindravva3833
    @aravindravva3833 4 года назад +1

    10:52 is it 1000 gradient descent steps or 5000??

    • @pushkarwani6099
      @pushkarwani6099 4 года назад

      1 mini batch(1000 training sets) processes 1 gradient descent at a time , and it is repeated 5000 times.

  • @mishasulikashvili1215
    @mishasulikashvili1215 5 лет назад +2

    Thank you sir

  • @JoseRomero-wp4ij
    @JoseRomero-wp4ij 5 лет назад +1

    thank you so much

  • @s25412
    @s25412 3 года назад

    @8:34 why is it being added from i=1 to l? shouldn't it be 1000?

    • @windupbird9019
      @windupbird9019 3 года назад

      From my understanding, 1000 is the size of the training batch, while the l refers to the total number of layers in the nn. Since he is doing the forward and backward propagation, the gradient descent would take l steps.

    • @s25412
      @s25412 3 года назад

      ​@@windupbird9019 in another video by Ng (ruclips.net/video/U-4XvK7jncg/видео.html) at 2:24, he indicates that the number you divide the cost function by and the upper limit of summation symbol should be identical. So I'm assuming the i=1 to l @ 8:34 is a typo... what do you think?

    • @nhactrutinh6201
      @nhactrutinh6201 3 года назад

      Yes, I think it should be 1000 , the mini batch size. Typo error

  • @parthmadan671
    @parthmadan671 2 года назад

    Do we use the weights of the previous batch to initialise the next batch?

  • @elgs1980
    @elgs1980 2 года назад

    What does processing the samples in the mini batch at the same time mean? Do we average or sum the input data before feeding them to the net?

  • @sandipansarkar9211
    @sandipansarkar9211 4 года назад

    Very good rxplanation.Need to watch again

  • @bilalalpaslan2
    @bilalalpaslan2 2 года назад

    Please help me?
    When do the weights and the biases of model update?
    End of per batch_size or end of per epoch?
    I can not understand this.
    For example,
    Our dataset has 1600 X_train data.
    we choose batch_size = 64 and epoch = 10, What are the weights and biases updating number 1600/64=25 or only per epoch = 10?

    • @EfilWarlord
      @EfilWarlord 3 месяца назад

      This is probably very late reply hahah, but weights and biases of the model update at the end of each mini-patch
      so if your dataset is 1600, and you have a batch size of 64 then 1600/64 = 25 so in each single epoch, weights and biases will be updated 25 times.
      In your case where you have a batch size of 64 and epoch of 10, then the model weights and biases will be updated 25 x 10 = 250 times.

  • @chinmaymaganur7133
    @chinmaymaganur7133 4 года назад

    what is (nx,m)? ie is nx --no of rows and m is number of features(columns) or vice versa

    • @aayushpaudel2379
      @aayushpaudel2379 4 года назад

      nx- is the number of features or input values. m- is the number of training examples.

  • @jalendarch89
    @jalendarch89 7 лет назад +2

    at 9:44 , can l be 1000 ?

  • @tonyclvz109
    @tonyclvz109 5 лет назад +1

    Andrews + ENSTA =

  • @nirbhaykumarpandey8955
    @nirbhaykumarpandey8955 6 лет назад

    why is X nx by m and not n by m ??

    • @rui7268
      @rui7268 5 лет назад

      Because nx represent input data which could be a matrix rather than a integer, e.g. if input data is a RBG image, you have (0-255)*(0-255)*3 pixels, it's not only about the number of one mini batch but also the pixels it have. Notation of nx must be better than n.

  • @Gammalight519
    @Gammalight519 3 года назад

    Free education

  • @EranM
    @EranM 6 месяцев назад

    9:10 lol what is this? Sigma should be from 1 to 1000, over the batch examples.. what is this sigma over l? layers are only relevant for the weights not for the y, y_pred

  • @grantsmith3653
    @grantsmith3653 Год назад

    I was just thinking that if you increase your mini batch size, then your error surface gets taller (assuming you're using SSE and it's a regression problem). And that means your gradients would get bigger, so your steps would all get bigger... Even though (on average) changing your batch size shouldn't change the error surface argmin. So if you increase batch size, I think you have to decrease learning rate by a proportional amount to keep your changes in weight similar

  • @prismaticspace4566
    @prismaticspace4566 4 года назад

    baby training set...weird...

  • @Jirayu.Kaewprateep
    @Jirayu.Kaewprateep Год назад

    📺💬 We should learn this mini-batches effects, instead of training all samples at the same time we divided the mini-0batches and see the effect from Gradient descents.
    🧸💬 That is a different thing when input has not much correlated because accuracy and loss will go up and down in the training process and as in the previous example dropout layer technique help determine the patterns in the input.
    🧸💬 The problem is how should we set the mini-batches size and the number of new inputs ( distribution rates should be the same ) this method possible to train faster when we have a high dataset input but also provides nothing and long time training for less related data.
    🐑💬The accuracy rates and loss estimation values are just numbers but we can stop at the specific value we want to save and re-work.
    🐑💬 One example of him doing an assignment from one course attached to this link is he changed the number of batch_size not to make it train more input samples but it is faster when they do not forget the last inputs when using less number of LSTM layer units.
    🐑💬 This kind of problem you found when mapping input vocabulary.
    ( Nothing we found in local cartoons books Cookie run as example )
    VOLA : BURG / GRUB : IBCF / FCBI
    COKKIE : IUUQOK / KOQUUI : PBFPTP / PTPFBP
    RUN! : XAT' / 'TAX : ELS. / .SLE
    VOLA COKKIE RUN!
    ===========================================================================
    GRUB KOQUUI XAT'
    ===========================================================================
    IBCF PTPFBP ELS.
    👧💬 Comeback‼ BirdNest Hidden anywhere else.

  • @nikab1852
    @nikab1852 4 года назад

    thank you sir!

  • @tohchengchuan6840
    @tohchengchuan6840 4 года назад

    why y is (1, m) instead of (n,m) or (ny,m)?

    • @aayushpaudel2379
      @aayushpaudel2379 4 года назад

      assuming y takes a real value and not a vector value. Like in a classification problem - 0 or 1. or a regression problem. Hope it make sense! :D