If you are slightly lost, it's possible that you have just stumbled on the video directly without watching the previous 3 explaining exponential weighted averages. Once you see that and come back here, it will make perfect sense.
Explaining the function of each parameter (especially the role of β in both (β) and (1 - β)) is extremely useful for understanding the role of momentum in gradient descent. Really appreciate that!
Andrew’s great. Was he right when he says at 8:55 that under BOTH versions - with or without the (1-ß) term - that 0.9 is the commonly used value for ß. After all, in version using (1-ß) term, the left term multiplier (0.9) is 9 times LARGER than the right term multiplier (0.1), whereas when using the alternative version without the (1-ß) term, the left term multiplier (0.9) is actually SMALLER than the right term multiplier (1), yielding a vastly different weighted average. Thus, it can’t be correct that under both versions, the commonly used value for ß is the same 0.9. Does anyone agree? In fact, it would seem that under the latter version, ß should be 9.0, rather than 0.9, so that the weighting is the equivalent of Andrew’s preferred former version.
The point about (1 - β) being excluded in some frameworks was very helpful. This type of thing is why reading methodology section of papers is totally useless - unscientific and unreproducable (only way to reproduce is to copy their exact code for a specific framework). Often exact hyper-parameters values are documented, but then details of how they are actually used in the framework is up in the air.
Vdw and Vdb are the previous value they had (in the previous step)... we are computing an exponential weighted average. This will probably help ruclips.net/video/lAq96T8FkTw/видео.html
vdw, vdb is the weighted average of optimization parameters weights and biases see more in the blog medium.com/@omkar.nallagoni/gradient-descent-with-momentum-73a8ae2f0954
You've probably skipped the previous videos. He clearly explains that 'V' notation is to denote the new weighted value. In this case, if dw is your current value, Vdw is your weighted average value. And dw and db are weights and biases of the neural network.
Where is the bias correction in the steps for the initial steps? Or does it not matter as we are anyway gonna make a large number of steps? Edit: My bad, he explains it at 6:50
I have learnt so much from you. I really want to thank you for that.I really appreciate your work. I would like to suggest you to think about few beginer learner like me and explain the concept with some analogy and provide some aid to lead the trail of thinking in a appropiate direcetion.
Tell me you didn't watch without telling me you didn't watch :) ruclips.net/video/k8fTYJPd3_I/видео.html short answer: yes, it affects scaling of alpha
Perfect, thank you; showing vectors in vertical direction cancelling out and horizontal vectors adding up was the key point that cleared everything up for me.
Could someone explain this to me? For simplicity let's consider only one hidden layer. For a given mini batch (lets say t = 1), I can calculate the loss and compute the gradients 'dw' and 'db'. ( 'dw' and 'db' are vectors of size depending upon the number of nodes in the layers). When I want to calculate Vdw (which I expect is the same dimension as dw), do I average over the elements of dw? i.e. Vdw_0 = 0, Vdw_1 = beta*Vdw_0 + (1 - beta)* dw_1..... and so on?
Is there need for input normalisation, when using gradient descent with momentum? My intuition says that normalisation will add very low value (if any) in this case. Is this correct?
For large neural networks you won't have a single weight, instead you'll have a huge vector of all the weights of all the hidden units(of a particular layer) stacked on top of other. And from there vectorized functions will take care of your execution.
Overregulation? It works like a Servo. Too low error Signal and that thing will not follow, too much error Signal and it will jump off the table! If you want faster learning you might wanna increase Alpha (learning rate). If the error signal (Delta) is being amplified too much in some places then the activation will be pumped up and overshoot, so for t+1 it will have to change direction and if that seems to happen all of the time...? Networks are big...maybe only some neurons start to oscillate....maybe the important ones too? Using "momentum" is basically a reverb-circuit and the 1-b formula is a basic crossover mixer or 'blend' button where one can smoothly blend between two channels. If the old deltas are still in the node structure from previous calculation, one could simply add the old one and divide by two! That would be 50/50 mix or a 'beta of 0.5f'. If you use that and overwrite the variable in the neuron, the variable contains 50% new and 50% old value. Next time the old old value will have become 25% and over time its constribution will fade out, just like a reverb-circuits tone. I started overhauling my ANN routines to accelerate training, because I found out my routines are like myself: "Learn something new and oh, I forgotten all the older stuff". Today I made a training plan that never forgets and is faster 😎 1. For all Images always choose a random one! So the network has no chance of 'subconciously learning the ordering of data' which is not wanted. 2. If an image wins (network does predict correct class), the fail[i] is being reset to 1, no backprop nessesary! 3. If an image fails (network does not predict correct class) increase the amount of fail[i] and use this as a multiplier for your learning rate! From now on, as soon as items in your list fail, the learning rate will become a monster! Individual items in the list will only get the training they really need when they need it! I'm trying to make every outside variable become a procecuter and a loyer for the network 😎 Rate = Learning rate * consecutive failures over time * sqrt(any vector that suggests learning harder)
Can anyone explain why Ng's calculations include beta and 1-beta but in other works like arxiv.org/pdf/1609.04747.pdf momentum is shown just using beta not 1-beta (actually they use gamma not beta for the proportion of the previous weights to carry over) ?
I have a gradient descent with momentum optimizing parameters for a differential equation. There are 5 parameters. It is impossible to know the shape of the ovals in 5 dimensions. It is also real and not perfect data. Determining alpha and beta is a trick and often takes trial and error. I wish these videos would use the same symbols. There is no consistency. Just about any minimize routine is faster than gradient descent for small number of parameters.
If you are slightly lost, it's possible that you have just stumbled on the video directly without watching the previous 3 explaining exponential weighted averages. Once you see that and come back here, it will make perfect sense.
Explaining the function of each parameter (especially the role of β in both (β) and (1 - β)) is extremely useful for understanding the role of momentum in gradient descent. Really appreciate that!
what is beta(B) here? can you please let me know?
agreed
This is really helpful. I can now fully understand sgd with momentum by understanding the concepts of exponentially weighted average. Thanks a lot.
where is the exponentially weighted average? that is a simple weighted average.
@@usf5914 the weights for each of the previous values decay exponentially, which is why it’s called exponential averaging
Thanks so much! Love that feeling when it all just clicks.
I like this analogy of a ball and a basket. Thanks for this generous video!
Andrew’s great. Was he right when he says at 8:55 that under BOTH versions - with or without the (1-ß) term - that 0.9 is the commonly used value for ß. After all, in version using (1-ß) term, the left term multiplier (0.9) is 9 times LARGER than the right term multiplier (0.1), whereas when using the alternative version without the (1-ß) term, the left term multiplier (0.9) is actually SMALLER than the right term multiplier (1), yielding a vastly different weighted average. Thus, it can’t be correct that under both versions, the commonly used value for ß is the same 0.9. Does anyone agree? In fact, it would seem that under the latter version, ß should be 9.0, rather than 0.9, so that the weighting is the equivalent of Andrew’s preferred former version.
The point about (1 - β) being excluded in some frameworks was very helpful. This type of thing is why reading methodology section of papers is totally useless - unscientific and unreproducable (only way to reproduce is to copy their exact code for a specific framework). Often exact hyper-parameters values are documented, but then details of how they are actually used in the framework is up in the air.
I was just looking it up because I noticed Keras calculates it differently
I directly jumped to this video for reference purposes. Can someone explain what Vdw and Vdb is? Is V our loss function?
Vdw and Vdb are the previous value they had (in the previous step)... we are computing an exponential weighted average. This will probably help ruclips.net/video/lAq96T8FkTw/видео.html
What is Vdw, and Vdb - @7:15
vdw, vdb is the weighted average of optimization parameters weights and biases see more in the blog medium.com/@omkar.nallagoni/gradient-descent-with-momentum-73a8ae2f0954
Yeah, that is confusing and not well explained, even in the above reply. But it seems that Vdw = w_t - w_{t-1} and same for Vdb.
You've probably skipped the previous videos. He clearly explains that 'V' notation is to denote the new weighted value. In this case, if dw is your current value, Vdw is your weighted average value. And dw and db are weights and biases of the neural network.
Where is the bias correction in the steps for the initial steps? Or does it not matter as we are anyway gonna make a large number of steps?
Edit: My bad, he explains it at 6:50
I have learnt so much from you. I really want to thank you for that.I really appreciate your work. I would like to suggest you to think about few beginer learner like me and explain the concept with some analogy and provide some aid to lead the trail of thinking in a appropiate direcetion.
In Gilbert strang lecture, he only mentioned beta not 1-beta. Will the learning rate be used as 1/1-alpha if 1-beta is not used?
Tell me you didn't watch without telling me you didn't watch :) ruclips.net/video/k8fTYJPd3_I/видео.html short answer: yes, it affects scaling of alpha
which lecture are you reffering to? can you give me the link? i'd like to watch it
Thank you !
thank you this was helpful
But where did we use the bias correction formula?
Perfect, thank you; showing vectors in vertical direction cancelling out and horizontal vectors adding up was the key point that cleared everything up for me.
very good explanation on the gradient with momentum the example of ball with speed and acceleration is very intuitive. thank you
what are dw and db ??
dw is the calculated gradients for the weights and db is the calculated gradients for the bias
Could someone explain this to me? For simplicity let's consider only one hidden layer. For a given mini batch (lets say t = 1), I can calculate the loss and compute the gradients 'dw' and 'db'. ( 'dw' and 'db' are vectors of size depending upon the number of nodes in the layers). When I want to calculate Vdw (which I expect is the same dimension as dw), do I average over the elements of dw? i.e. Vdw_0 = 0, Vdw_1 = beta*Vdw_0 + (1 - beta)* dw_1..... and so on?
What do you mean by average?
Is there need for input normalisation, when using gradient descent with momentum? My intuition says that normalisation will add very low value (if any) in this case. Is this correct?
i dont understand at all how this can generalize to a neural network which can have 1000's or millions of weights
For large neural networks you won't have a single weight, instead you'll have a huge vector of all the weights of all the hidden units(of a particular layer) stacked on top of other. And from there vectorized functions will take care of your execution.
what causes these oscillations?
Overregulation?
It works like a Servo. Too low error Signal and that thing will not follow, too much error Signal and it will jump off the table!
If you want faster learning you might wanna increase Alpha (learning rate). If the error signal (Delta) is being amplified too much in some places then the activation will be pumped up and overshoot, so for t+1 it will have to change direction and if that seems to happen all of the time...? Networks are big...maybe only some neurons start to oscillate....maybe the important ones too?
Using "momentum" is basically a reverb-circuit and the 1-b formula is a basic crossover mixer or 'blend' button where one can smoothly blend between two channels.
If the old deltas are still in the node structure from previous calculation, one could simply add the old one and divide by two! That would be 50/50 mix or a 'beta of 0.5f'. If you use that and overwrite the variable in the neuron, the variable contains 50% new and 50% old value. Next time the old old value will have become 25% and over time its constribution will fade out, just like a reverb-circuits tone.
I started overhauling my ANN routines to accelerate training, because I found out my routines are like myself: "Learn something new and oh, I forgotten all the older stuff".
Today I made a training plan that never forgets and is faster 😎
1. For all Images always choose a random one! So the network has no chance of 'subconciously learning the ordering of data' which is not wanted.
2. If an image wins (network does predict correct class), the fail[i] is being reset to 1, no backprop nessesary!
3. If an image fails (network does not predict correct class) increase the amount of fail[i] and use this as a multiplier for your learning rate!
From now on, as soon as items in your list fail, the learning rate will become a monster! Individual items in the list will only get the training they really need when they need it!
I'm trying to make every outside variable become a procecuter and a loyer for the network 😎
Rate = Learning rate * consecutive failures over time * sqrt(any vector that suggests learning harder)
Can anyone explain why Ng's calculations include beta and 1-beta but in other works like arxiv.org/pdf/1609.04747.pdf momentum is shown just using beta not 1-beta (actually they use gamma not beta for the proportion of the previous weights to carry over) ?
Go to 7:35
just edit your video removing the high frequency at the end of each word
that is really annoying for sensitive ears like ours.
anyone else hear the high pitched sound in the background?
Thank god. I thought tinnitus had finally caught up with me...
the noise is killing my ears.
I'm so happy to be still able (almost 30 y.o.) to hear such a high pitch annoying noise.
🥟🥟🥟🥟🥟 momo!!!!!!!!!!
I don't why but I'm getting faster performance with simple SGD than with the momentum.
great once again if i could go into depth
ntapp gan
Fokus kuliah nu
chinese 🥷🥷🍙🍙
great expalanation but takes time to understand
I am watching it again
I have a gradient descent with momentum optimizing parameters for a differential equation. There are 5 parameters. It is impossible to know the shape of the ovals in 5 dimensions. It is also real and not perfect data. Determining alpha and beta is a trick and often takes trial and error. I wish these videos would use the same symbols. There is no consistency. Just about any minimize routine is faster than gradient descent for small number of parameters.