Descending through a Crowded Valley -- Benchmarking Deep Learning Optimizers (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 27 окт 2024

Комментарии • 36

  • @veedrac
    @veedrac 4 года назад +49

    I think you've been way too harsh on this paper, I think because you're looking at it from the wrong angle. It is correct to say that for any individual optimizer, the results are noisy and weak, but the paper's strength is its analysis of the landscape of optimizers as a whole, and there it holds much better, benefitting from the law of large numbers. Their conclusions aren't anything unexpected-“there is currently no method that clearly dominates the competition // tuning helps about as much as trying other optimizers // the field is in danger of being drowned by noise // optimizers exhibit a surprisingly similar performance distribution”-but that is still valid, useful science to confirm, and it could have been otherwise. A precise ranking between similar optimizers shouldn't be make-or-break here IMO.
    Their code is open sourced, and one would hope it stands as a useful stepping point both for further investigation (eg. if Google throws TPUs at it) and, hopefully, to guide research on optimizers down more provably useful avenues.

    • @tusharabhishek5979
      @tusharabhishek5979 4 года назад +2

      I totally agree with you Veedrac, I think it's quite an intuitive paper with a lot of empirical results that anyone can easily understand and acknowledge it.

  • @yabdelm
    @yabdelm 4 года назад +13

    I LOVE that you started with the conclusion

  • @kyriakostp
    @kyriakostp 4 года назад +15

    I get your point about using a single random seed during tuning, but I think it's not quite as severe, since is's not exactly one point per algorithm. It is still used for the 8 different problems with distinct initialisations etc.
    I guess you could say the 8 problems can be seen as a large meta problem with a single loss function, but it still seems a bit harder to get super lucky on that than it would be on a single run, with a single initialisation, on one problem.

  • @eamram
    @eamram 4 года назад +4

    I would rather summarize the paper as:
    1. Try with Adam first
    2. Instead of spending too much time fine-tuning Adam, try other optimizers out of the box, it will yield the same improvement (if any)

  • @herp_derpingson
    @herp_derpingson 4 года назад +9

    23:28 I read FMNIST VAE as FEMINIST BAE. I should go to bed.
    I think some kind of reasoning of why some work better than others is much needed. Maybe some kind of Taxonomy and clustering of optimizers.
    I think the authors should have published their research proposal publicly before doing any expensive experiments. This applies to other empirical studies as well. Especially if they are expensive.

  • @G12GilbertProduction
    @G12GilbertProduction 4 года назад +1

    >"evidence-backed heuristics"
    >seed OBS optimalisation
    After these words, I was catched into the Humean gilotine and looking for normatives and objectivisms, and one mention about the Laplace constantative algorithm makes me (teaching right now a differential equations) weird and howl at the same time.

  • @perlindholm4129
    @perlindholm4129 4 года назад +4

    Lucky sample could be because models are functions and functions have point properties. Like direct value, derivative, second derivative and so on. Maybe its possible to evaluate a function based on the derivative in that sample point. A lucky function crosses the point but a sure function closes in on the answer point.

  • @cest_heng
    @cest_heng 4 года назад +4

    When we lack the understanding on the problems to make it convex, how well all the algorithms work simply depends on luck.

  • @PaganPegasus
    @PaganPegasus 2 года назад

    I was watching this and thought "where's AdaBelief?" but then I realised that was released _after_ this paper. It would have been interesting to see how AdaBelief compares to other optimizers, since its almost like an extension to RAdam which was one of the optimizers they considered.

  • @nguyenngocly1484
    @nguyenngocly1484 4 года назад +1

    Dot products are statistical objects, summary measures. The optimisers only ever search the set of statistical solutions. Not the full vector function composition search space. Which anyway is full of brittle over-fitted solutions you don't generally want.
    Nearly any optimiser will do then, including the very weak BP algorithm. You can also use sparse mutations to evolve solutions efficiently on multiple GPUs. Each GPU is given the full neural model and part of the training data. The same short list of sparse mutations is sent to each GPU. They return the cost for their part of the training data. Then the same accept or reject mutations packet is sent to each GPU. Continuous Gray Code Optimization mutations are a good choice. Only small amounts of data need to be shipped around.

    • @nguyenngocly1484
      @nguyenngocly1484 4 года назад +1

      Also don't forget about inside-out neural networks with fixed dot products and adjustable (parametric) activation functions.
      The fixed dot products can be done very efficently with fast transforms like the FFT. There are some pragmatics like putting a fixed random projection before the first layer and using a final fast transform as pseudo-readout layer. The statistical summary measure behavour of the dot products ensures good results.

  • @EternalKernel
    @EternalKernel 4 года назад +1

    I'm missing something, why do people do parameter searches on # of epochs? Shouldn't you just keep running more epochs until your accuracy/loss is where you need it or you are convinced that its not going to get there?

  • @drhilm
    @drhilm 4 года назад +1

    would you choose a subset of a 'large' dataset with a large model (ex partial-ImageNet+resnet101) instead of using a small dataset?

  • @MrJaggy123
    @MrJaggy123 4 года назад +8

    "it's very refreshing for a paper to admit its shortcomings. Now here's why it sucks." 😆

  • @yuangwang7772
    @yuangwang7772 4 года назад +8

    Always like it before watching :)

  • @fotisj321
    @fotisj321 4 года назад

    First time I am slightly disappointed with the way a paper is presented. The video seems to be saying 1) it would be much more useful if they would use much larger datasets, 2) it would be much more useful, if the would use more starting points (via random seeds) 3) it is not practical possible to do this. 4) There is a contradiction between 1/2 and 3 but anyway. Sounds a bit too much like the ill-tempered reviewers we all sometimes get (and probably are).

  • @sohaibattaiki9579
    @sohaibattaiki9579 4 года назад +1

    Hi,
    Thanks for the great videos!
    I have a question, what is the name of the software you are using?

  • @phil.s
    @phil.s 4 года назад

    What do you think about current popular optimizers like AdamW and AdaBelief?

    • @YannicKilcher
      @YannicKilcher  3 года назад

      sure, if they help

    • @konataizumi5829
      @konataizumi5829 3 года назад +1

      AdaBelief has been awesome so far in my experience, so I'd say give it a try

  • @atomscott6495
    @atomscott6495 4 года назад +1

    I didn’t get how tuning made the result worse, in other words, why are there reds on the diagonal? What kind of tuning gives a worse result??lol

    • @jonash.4058
      @jonash.4058 4 года назад +4

      I'm kinda repeating Yannic here, I think. The authors used random grid search - meaning they randomly selected hyperparameters within a predefined search space. With some bad luck due to the limited number of runs they apparently only found 'worse' parameter combinations compared to the default ones. This could be interpreted as a sign that the optimization method is not stable or at least very sensitive to either the problem domain or hyperparameters. Another possibility is that the search space was poorly chosen, which Yannic explained at the start of the video.

    • @atomscott6495
      @atomscott6495 4 года назад +2

      ​@@jonash.4058 Yeah, I heard that. Thanks for the explanation though. Intuitively, it's not really "tuning" if the tuned result is worse than the default. I think I have beef with the tuning method, I would prefer a bayesian approach starting off with the default or something.

  • @alankarshukla4385
    @alankarshukla4385 4 года назад +4

    Can you please uploads videos based on the topic you taught like some python code implementation? It would be great.
    Thanks for the beautiful work that you do.

    • @NicheAsQuiche
      @NicheAsQuiche 4 года назад +6

      Not all papers provide code. If they do, it just a matter of looking up the paper. It's not uncommon, which is very nice

    • @swayson5208
      @swayson5208 4 года назад +4

      @@NicheAsQuiche Isee Arxiv also partnered with paperswithcode, so its becoming ever more prevalent

  • @dmitrysamoylenko6775
    @dmitrysamoylenko6775 4 года назад +3

    I hear you coughing covid-19