I think you've been way too harsh on this paper, I think because you're looking at it from the wrong angle. It is correct to say that for any individual optimizer, the results are noisy and weak, but the paper's strength is its analysis of the landscape of optimizers as a whole, and there it holds much better, benefitting from the law of large numbers. Their conclusions aren't anything unexpected-“there is currently no method that clearly dominates the competition // tuning helps about as much as trying other optimizers // the field is in danger of being drowned by noise // optimizers exhibit a surprisingly similar performance distribution”-but that is still valid, useful science to confirm, and it could have been otherwise. A precise ranking between similar optimizers shouldn't be make-or-break here IMO. Their code is open sourced, and one would hope it stands as a useful stepping point both for further investigation (eg. if Google throws TPUs at it) and, hopefully, to guide research on optimizers down more provably useful avenues.
I totally agree with you Veedrac, I think it's quite an intuitive paper with a lot of empirical results that anyone can easily understand and acknowledge it.
I get your point about using a single random seed during tuning, but I think it's not quite as severe, since is's not exactly one point per algorithm. It is still used for the 8 different problems with distinct initialisations etc. I guess you could say the 8 problems can be seen as a large meta problem with a single loss function, but it still seems a bit harder to get super lucky on that than it would be on a single run, with a single initialisation, on one problem.
I would rather summarize the paper as: 1. Try with Adam first 2. Instead of spending too much time fine-tuning Adam, try other optimizers out of the box, it will yield the same improvement (if any)
23:28 I read FMNIST VAE as FEMINIST BAE. I should go to bed. I think some kind of reasoning of why some work better than others is much needed. Maybe some kind of Taxonomy and clustering of optimizers. I think the authors should have published their research proposal publicly before doing any expensive experiments. This applies to other empirical studies as well. Especially if they are expensive.
>"evidence-backed heuristics" >seed OBS optimalisation After these words, I was catched into the Humean gilotine and looking for normatives and objectivisms, and one mention about the Laplace constantative algorithm makes me (teaching right now a differential equations) weird and howl at the same time.
Lucky sample could be because models are functions and functions have point properties. Like direct value, derivative, second derivative and so on. Maybe its possible to evaluate a function based on the derivative in that sample point. A lucky function crosses the point but a sure function closes in on the answer point.
I was watching this and thought "where's AdaBelief?" but then I realised that was released _after_ this paper. It would have been interesting to see how AdaBelief compares to other optimizers, since its almost like an extension to RAdam which was one of the optimizers they considered.
Dot products are statistical objects, summary measures. The optimisers only ever search the set of statistical solutions. Not the full vector function composition search space. Which anyway is full of brittle over-fitted solutions you don't generally want. Nearly any optimiser will do then, including the very weak BP algorithm. You can also use sparse mutations to evolve solutions efficiently on multiple GPUs. Each GPU is given the full neural model and part of the training data. The same short list of sparse mutations is sent to each GPU. They return the cost for their part of the training data. Then the same accept or reject mutations packet is sent to each GPU. Continuous Gray Code Optimization mutations are a good choice. Only small amounts of data need to be shipped around.
Also don't forget about inside-out neural networks with fixed dot products and adjustable (parametric) activation functions. The fixed dot products can be done very efficently with fast transforms like the FFT. There are some pragmatics like putting a fixed random projection before the first layer and using a final fast transform as pseudo-readout layer. The statistical summary measure behavour of the dot products ensures good results.
I'm missing something, why do people do parameter searches on # of epochs? Shouldn't you just keep running more epochs until your accuracy/loss is where you need it or you are convinced that its not going to get there?
First time I am slightly disappointed with the way a paper is presented. The video seems to be saying 1) it would be much more useful if they would use much larger datasets, 2) it would be much more useful, if the would use more starting points (via random seeds) 3) it is not practical possible to do this. 4) There is a contradiction between 1/2 and 3 but anyway. Sounds a bit too much like the ill-tempered reviewers we all sometimes get (and probably are).
I'm kinda repeating Yannic here, I think. The authors used random grid search - meaning they randomly selected hyperparameters within a predefined search space. With some bad luck due to the limited number of runs they apparently only found 'worse' parameter combinations compared to the default ones. This could be interpreted as a sign that the optimization method is not stable or at least very sensitive to either the problem domain or hyperparameters. Another possibility is that the search space was poorly chosen, which Yannic explained at the start of the video.
@@jonash.4058 Yeah, I heard that. Thanks for the explanation though. Intuitively, it's not really "tuning" if the tuned result is worse than the default. I think I have beef with the tuning method, I would prefer a bayesian approach starting off with the default or something.
Can you please uploads videos based on the topic you taught like some python code implementation? It would be great. Thanks for the beautiful work that you do.
I think you've been way too harsh on this paper, I think because you're looking at it from the wrong angle. It is correct to say that for any individual optimizer, the results are noisy and weak, but the paper's strength is its analysis of the landscape of optimizers as a whole, and there it holds much better, benefitting from the law of large numbers. Their conclusions aren't anything unexpected-“there is currently no method that clearly dominates the competition // tuning helps about as much as trying other optimizers // the field is in danger of being drowned by noise // optimizers exhibit a surprisingly similar performance distribution”-but that is still valid, useful science to confirm, and it could have been otherwise. A precise ranking between similar optimizers shouldn't be make-or-break here IMO.
Their code is open sourced, and one would hope it stands as a useful stepping point both for further investigation (eg. if Google throws TPUs at it) and, hopefully, to guide research on optimizers down more provably useful avenues.
I totally agree with you Veedrac, I think it's quite an intuitive paper with a lot of empirical results that anyone can easily understand and acknowledge it.
I LOVE that you started with the conclusion
I get your point about using a single random seed during tuning, but I think it's not quite as severe, since is's not exactly one point per algorithm. It is still used for the 8 different problems with distinct initialisations etc.
I guess you could say the 8 problems can be seen as a large meta problem with a single loss function, but it still seems a bit harder to get super lucky on that than it would be on a single run, with a single initialisation, on one problem.
I would rather summarize the paper as:
1. Try with Adam first
2. Instead of spending too much time fine-tuning Adam, try other optimizers out of the box, it will yield the same improvement (if any)
23:28 I read FMNIST VAE as FEMINIST BAE. I should go to bed.
I think some kind of reasoning of why some work better than others is much needed. Maybe some kind of Taxonomy and clustering of optimizers.
I think the authors should have published their research proposal publicly before doing any expensive experiments. This applies to other empirical studies as well. Especially if they are expensive.
>"evidence-backed heuristics"
>seed OBS optimalisation
After these words, I was catched into the Humean gilotine and looking for normatives and objectivisms, and one mention about the Laplace constantative algorithm makes me (teaching right now a differential equations) weird and howl at the same time.
Lucky sample could be because models are functions and functions have point properties. Like direct value, derivative, second derivative and so on. Maybe its possible to evaluate a function based on the derivative in that sample point. A lucky function crosses the point but a sure function closes in on the answer point.
Yes
When we lack the understanding on the problems to make it convex, how well all the algorithms work simply depends on luck.
I was watching this and thought "where's AdaBelief?" but then I realised that was released _after_ this paper. It would have been interesting to see how AdaBelief compares to other optimizers, since its almost like an extension to RAdam which was one of the optimizers they considered.
Dot products are statistical objects, summary measures. The optimisers only ever search the set of statistical solutions. Not the full vector function composition search space. Which anyway is full of brittle over-fitted solutions you don't generally want.
Nearly any optimiser will do then, including the very weak BP algorithm. You can also use sparse mutations to evolve solutions efficiently on multiple GPUs. Each GPU is given the full neural model and part of the training data. The same short list of sparse mutations is sent to each GPU. They return the cost for their part of the training data. Then the same accept or reject mutations packet is sent to each GPU. Continuous Gray Code Optimization mutations are a good choice. Only small amounts of data need to be shipped around.
Also don't forget about inside-out neural networks with fixed dot products and adjustable (parametric) activation functions.
The fixed dot products can be done very efficently with fast transforms like the FFT. There are some pragmatics like putting a fixed random projection before the first layer and using a final fast transform as pseudo-readout layer. The statistical summary measure behavour of the dot products ensures good results.
I'm missing something, why do people do parameter searches on # of epochs? Shouldn't you just keep running more epochs until your accuracy/loss is where you need it or you are convinced that its not going to get there?
yea you never know when to stop
would you choose a subset of a 'large' dataset with a large model (ex partial-ImageNet+resnet101) instead of using a small dataset?
"it's very refreshing for a paper to admit its shortcomings. Now here's why it sucks." 😆
Always like it before watching :)
First time I am slightly disappointed with the way a paper is presented. The video seems to be saying 1) it would be much more useful if they would use much larger datasets, 2) it would be much more useful, if the would use more starting points (via random seeds) 3) it is not practical possible to do this. 4) There is a contradiction between 1/2 and 3 but anyway. Sounds a bit too much like the ill-tempered reviewers we all sometimes get (and probably are).
Hi,
Thanks for the great videos!
I have a question, what is the name of the software you are using?
someone mentioned that there is a video about his software on the channel
OneNote
What do you think about current popular optimizers like AdamW and AdaBelief?
sure, if they help
AdaBelief has been awesome so far in my experience, so I'd say give it a try
I didn’t get how tuning made the result worse, in other words, why are there reds on the diagonal? What kind of tuning gives a worse result??lol
I'm kinda repeating Yannic here, I think. The authors used random grid search - meaning they randomly selected hyperparameters within a predefined search space. With some bad luck due to the limited number of runs they apparently only found 'worse' parameter combinations compared to the default ones. This could be interpreted as a sign that the optimization method is not stable or at least very sensitive to either the problem domain or hyperparameters. Another possibility is that the search space was poorly chosen, which Yannic explained at the start of the video.
@@jonash.4058 Yeah, I heard that. Thanks for the explanation though. Intuitively, it's not really "tuning" if the tuned result is worse than the default. I think I have beef with the tuning method, I would prefer a bayesian approach starting off with the default or something.
Can you please uploads videos based on the topic you taught like some python code implementation? It would be great.
Thanks for the beautiful work that you do.
Not all papers provide code. If they do, it just a matter of looking up the paper. It's not uncommon, which is very nice
@@NicheAsQuiche Isee Arxiv also partnered with paperswithcode, so its becoming ever more prevalent
I hear you coughing covid-19
That's only the luminence allergies. Nothing happened.