I'm wondering if optimization might happen faster by first sorting the entire dataset into categories, and then ensuring that each mini-batch is a stratified sample which approximates the entire dataset.
Spirit - Apparently someone had the idea long before me and it is effective: arxiv.org/abs/1405.3080 My understanding is that it ensures your model approximates the entire dataset at every iteration. You never have an iteration which comprises almost entirely samples from one class. Thereby wasting iterations fitting the function to an inaccurate dataset which has to be undone in later iterations.
I agree, it should be sum from i=1 to i=mini-batch-size, which in this case is 1000. When it was batch gradient descent for all video examples up until now, it was i=1 to i=m where m = number of training samples.
so all mini-batch is, is taking a vector containing the whole data set, splitting it up into k subsections, finding the average loss in each subsection, and do gradient descent on each averaged error, instead of doing gradient descent step on every single loss or each original x,y pair. So it's kind of a dimension reduction technique in a way ??
Well, in principle you are correct. The Idea is that your mean out of 1000 samples may converge rather to the truth and also the variance (i.e. the bias of your cost) decreases with 1/sqrt(n) where n is the number of samples in a batch. Therefore your cost function evaluation is less biased and converges faster.
you want to exploit the weak law of large numbers, imagine you throw a dice 10 times and you want to make predictions of its side probabilities. Your result would have been less biased if instead you would have thrown the dice 1000 times right?
From my understanding, 1000 is the size of the training batch, while the l refers to the total number of layers in the nn. Since he is doing the forward and backward propagation, the gradient descent would take l steps.
@@windupbird9019 in another video by Ng (ruclips.net/video/U-4XvK7jncg/видео.html) at 2:24, he indicates that the number you divide the cost function by and the upper limit of summation symbol should be identical. So I'm assuming the i=1 to l @ 8:34 is a typo... what do you think?
Please help me? When do the weights and the biases of model update? End of per batch_size or end of per epoch? I can not understand this. For example, Our dataset has 1600 X_train data. we choose batch_size = 64 and epoch = 10, What are the weights and biases updating number 1600/64=25 or only per epoch = 10?
This is probably very late reply hahah, but weights and biases of the model update at the end of each mini-patch so if your dataset is 1600, and you have a batch size of 64 then 1600/64 = 25 so in each single epoch, weights and biases will be updated 25 times. In your case where you have a batch size of 64 and epoch of 10, then the model weights and biases will be updated 25 x 10 = 250 times.
Because nx represent input data which could be a matrix rather than a integer, e.g. if input data is a RBG image, you have (0-255)*(0-255)*3 pixels, it's not only about the number of one mini batch but also the pixels it have. Notation of nx must be better than n.
9:10 lol what is this? Sigma should be from 1 to 1000, over the batch examples.. what is this sigma over l? layers are only relevant for the weights not for the y, y_pred
I was just thinking that if you increase your mini batch size, then your error surface gets taller (assuming you're using SSE and it's a regression problem). And that means your gradients would get bigger, so your steps would all get bigger... Even though (on average) changing your batch size shouldn't change the error surface argmin. So if you increase batch size, I think you have to decrease learning rate by a proportional amount to keep your changes in weight similar
📺💬 We should learn this mini-batches effects, instead of training all samples at the same time we divided the mini-0batches and see the effect from Gradient descents. 🧸💬 That is a different thing when input has not much correlated because accuracy and loss will go up and down in the training process and as in the previous example dropout layer technique help determine the patterns in the input. 🧸💬 The problem is how should we set the mini-batches size and the number of new inputs ( distribution rates should be the same ) this method possible to train faster when we have a high dataset input but also provides nothing and long time training for less related data. 🐑💬The accuracy rates and loss estimation values are just numbers but we can stop at the specific value we want to save and re-work. 🐑💬 One example of him doing an assignment from one course attached to this link is he changed the number of batch_size not to make it train more input samples but it is faster when they do not forget the last inputs when using less number of LSTM layer units. 🐑💬 This kind of problem you found when mapping input vocabulary. ( Nothing we found in local cartoons books Cookie run as example ) VOLA : BURG / GRUB : IBCF / FCBI COKKIE : IUUQOK / KOQUUI : PBFPTP / PTPFBP RUN! : XAT' / 'TAX : ELS. / .SLE VOLA COKKIE RUN! =========================================================================== GRUB KOQUUI XAT' =========================================================================== IBCF PTPFBP ELS. 👧💬 Comeback‼ BirdNest Hidden anywhere else.
This is the best way to learn. I can compartmentalize each portion of the video into a subsection and really train myself efficiently.
I agree and considering how I'll often rewatch a segment of the video, it ends up being epochs = 2
This is by far the best explanation out there, thanks Andrew 🚀
Thanks for making the notation beautifully and simple.
why would someone dislike this very high level material, which Andrew made available for free for anyone?
envy
There are so many things he hasn't explained and just wrote it
@@moosapatrawala1554Because this is a part of a bigger course as described on the video name
How to draw a kitten. Step 1: Draw a line. Step 2: Draw the rest of a kitten
धन्यवाद आन्द्रु जी :)
He is so great. Andrew Ng ftw :D
So this is how this office looks like when you're NOT taking Coursera ML!
you are always the best teacher
I'm wondering if optimization might happen faster by first sorting the entire dataset into categories, and then ensuring that each mini-batch is a stratified sample which approximates the entire dataset.
Spirit - Apparently someone had the idea long before me and it is effective: arxiv.org/abs/1405.3080
My understanding is that it ensures your model approximates the entire dataset at every iteration. You never have an iteration which comprises almost entirely samples from one class. Thereby wasting iterations fitting the function to an inaccurate dataset which has to be undone in later iterations.
@@iAmTheSquidThing Why not just shuffle the dataset then?
@@iAmTheSquidThing Thanks for the reference :}
hey, that's a simple, cool idea you got there
at 8:35 is sigma from 1 to 1000 rather than 1 to l ?
Yes, I guess there is a mistake there
I agree, it should be sum from i=1 to i=mini-batch-size, which in this case is 1000. When it was batch gradient descent for all video examples up until now, it was i=1 to i=m where m = number of training samples.
Great video!
Thank you thank you thank you!
Thank you !
so all mini-batch is, is taking a vector containing the whole data set, splitting it up into k subsections, finding the average loss in each subsection, and do gradient descent on each averaged error, instead of doing gradient descent step on every single loss or each original x,y pair. So it's kind of a dimension reduction technique in a way ??
Well, in principle you are correct. The Idea is that your mean out of 1000 samples may converge rather to the truth and also the variance (i.e. the bias of your cost) decreases with 1/sqrt(n) where n is the number of samples in a batch. Therefore your cost function evaluation is less biased and converges faster.
you want to exploit the weak law of large numbers, imagine you throw a dice 10 times and you want to make predictions of its side probabilities. Your result would have been less biased if instead you would have thrown the dice 1000 times right?
10:52 is it 1000 gradient descent steps or 5000??
1 mini batch(1000 training sets) processes 1 gradient descent at a time , and it is repeated 5000 times.
Thank you sir
thank you so much
@8:34 why is it being added from i=1 to l? shouldn't it be 1000?
From my understanding, 1000 is the size of the training batch, while the l refers to the total number of layers in the nn. Since he is doing the forward and backward propagation, the gradient descent would take l steps.
@@windupbird9019 in another video by Ng (ruclips.net/video/U-4XvK7jncg/видео.html) at 2:24, he indicates that the number you divide the cost function by and the upper limit of summation symbol should be identical. So I'm assuming the i=1 to l @ 8:34 is a typo... what do you think?
Yes, I think it should be 1000 , the mini batch size. Typo error
Do we use the weights of the previous batch to initialise the next batch?
What does processing the samples in the mini batch at the same time mean? Do we average or sum the input data before feeding them to the net?
Very good rxplanation.Need to watch again
Please help me?
When do the weights and the biases of model update?
End of per batch_size or end of per epoch?
I can not understand this.
For example,
Our dataset has 1600 X_train data.
we choose batch_size = 64 and epoch = 10, What are the weights and biases updating number 1600/64=25 or only per epoch = 10?
This is probably very late reply hahah, but weights and biases of the model update at the end of each mini-patch
so if your dataset is 1600, and you have a batch size of 64 then 1600/64 = 25 so in each single epoch, weights and biases will be updated 25 times.
In your case where you have a batch size of 64 and epoch of 10, then the model weights and biases will be updated 25 x 10 = 250 times.
what is (nx,m)? ie is nx --no of rows and m is number of features(columns) or vice versa
nx- is the number of features or input values. m- is the number of training examples.
at 9:44 , can l be 1000 ?
yes, it should be
Andrews + ENSTA =
so true, I love u
why is X nx by m and not n by m ??
Because nx represent input data which could be a matrix rather than a integer, e.g. if input data is a RBG image, you have (0-255)*(0-255)*3 pixels, it's not only about the number of one mini batch but also the pixels it have. Notation of nx must be better than n.
Free education
9:10 lol what is this? Sigma should be from 1 to 1000, over the batch examples.. what is this sigma over l? layers are only relevant for the weights not for the y, y_pred
I was just thinking that if you increase your mini batch size, then your error surface gets taller (assuming you're using SSE and it's a regression problem). And that means your gradients would get bigger, so your steps would all get bigger... Even though (on average) changing your batch size shouldn't change the error surface argmin. So if you increase batch size, I think you have to decrease learning rate by a proportional amount to keep your changes in weight similar
baby training set...weird...
📺💬 We should learn this mini-batches effects, instead of training all samples at the same time we divided the mini-0batches and see the effect from Gradient descents.
🧸💬 That is a different thing when input has not much correlated because accuracy and loss will go up and down in the training process and as in the previous example dropout layer technique help determine the patterns in the input.
🧸💬 The problem is how should we set the mini-batches size and the number of new inputs ( distribution rates should be the same ) this method possible to train faster when we have a high dataset input but also provides nothing and long time training for less related data.
🐑💬The accuracy rates and loss estimation values are just numbers but we can stop at the specific value we want to save and re-work.
🐑💬 One example of him doing an assignment from one course attached to this link is he changed the number of batch_size not to make it train more input samples but it is faster when they do not forget the last inputs when using less number of LSTM layer units.
🐑💬 This kind of problem you found when mapping input vocabulary.
( Nothing we found in local cartoons books Cookie run as example )
VOLA : BURG / GRUB : IBCF / FCBI
COKKIE : IUUQOK / KOQUUI : PBFPTP / PTPFBP
RUN! : XAT' / 'TAX : ELS. / .SLE
VOLA COKKIE RUN!
===========================================================================
GRUB KOQUUI XAT'
===========================================================================
IBCF PTPFBP ELS.
👧💬 Comeback‼ BirdNest Hidden anywhere else.
thank you sir!
why y is (1, m) instead of (n,m) or (ny,m)?
assuming y takes a real value and not a vector value. Like in a classification problem - 0 or 1. or a regression problem. Hope it make sense! :D