Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024

Комментарии • 236

  • @YannicKilcher
    @YannicKilcher  3 года назад +21

    OUTLINE:
    0:00 - Intro & Overview
    1:40 - The Grokking Phenomenon
    3:50 - Related: Double Descent
    7:50 - Binary Operations Datasets
    11:45 - What quantities influence grokking?
    15:40 - Learned Emerging Structure
    17:35 - The role of smoothness
    21:30 - Simple explanations win
    24:30 - Why does weight decay encourage simplicity?
    26:40 - Appendix
    28:55 - Conclusion & Comments
    Paper: mathai-iclr.github.io/papers/papers/MATHAI_29_paper.pdf

  • @ruemeese
    @ruemeese 3 года назад +140

    The verb 'to grok' has been in niche use in computer science since the 1960's meaning to fully understand something (as opposed to just knowing how to apply the rules for example). It was coined in Robert A Heinlein's classic sci-fi novel "Stranger in a Strange Land".

    • @michaelnurse9089
      @michaelnurse9089 3 года назад +3

      Thanks for the history. Today it is a somewhat well known word in my anecdotal experience.

    • @ruemeese
      @ruemeese 3 года назад +2

      @@michaelnurse9089 Always going to be new to somebody.

    • @cactusmilenario
      @cactusmilenario 2 года назад

      @@ruemeese thanks

    • @和平和平-c4i
      @和平和平-c4i 2 года назад

      Thanks that's why it was not it the Webster dictionary ;)

    • @RobertWF42
      @RobertWF42 2 года назад +1

      In my opinion "grokking" shouldn't be used here. There's no grokking happening, instead it's running an ML algorithm for a long time until it fits the test data. But it's still a really cool idea. :-)

  • @tsunamidestructor
    @tsunamidestructor 3 года назад +1

    As soon as I read the title, my mind immediately went to the "reconciling modern machine learning and the bias variance trade-off" paper.

  • @CharlesVanNoland
    @CharlesVanNoland 3 года назад

    Reminds me of a maturing brain, when a baby starts being aware of objects and tracking them, understanding permanence and tracking them, etc.

  • @sq9340
    @sq9340 3 года назад

    Maybe... just "maybe", it's not really overfitting, but learning quite slowly, If you look at the spikes of training accuracy at around 100%, those could be weak learning signals to fix the model. And the synthesis-data "could" be different, means the val-accuracy will jump up only when you have a quite perfect model.

  • @gocandra9847
    @gocandra9847 3 года назад +42

    last time i was this early schmidhuber had invented transformers

    • @NLPprompter
      @NLPprompter 3 месяца назад

      what???

    • @GeorgeSut
      @GeorgeSut 3 месяца назад

      Schmidthuber also invented GANs, supposedly 😆

  • @vslaykovsky
    @vslaykovsky 3 года назад +46

    This looks like an Aha moment discovered by the network at some point.

  • @martinsmolik2449
    @martinsmolik2449 3 года назад +76

    WAIT, WHAT?
    In my Masters thesis, I did something similar, and I got the same result. A quick overfit and subsequent snap into place. I never thought that this would be interesting, as I was not an ML engineer, but a mathematician looking for new ways to model structures.
    That was in 2019. Could have been famous! Ah well...

    • @andrewcutler4599
      @andrewcutler4599 3 года назад +1

      What do you think of their line "we speculate that such visualizations could one day be a useful way to gain intuitions about novel
      mathematical objects"?

    • @thomasdeniffel2122
      @thomasdeniffel2122 2 года назад +22

      Schmidthuber, is ist you?

    • @visuality2541
      @visuality2541 2 года назад +1

      may i ask u the thesis title?

    • @arzigogolato1
      @arzigogolato1 2 года назад +8

      As with many discoveries you need both the luck to find something totally new, and the knowledge to recognize what's happening...I'm sure a lot of missed opportunities out there...still, the advisor or some of the other professors could have noticed that this was indeed something "off and interesting"...it might actually be interesting to see how and when this happens :)

    • @autolykos9822
      @autolykos9822 2 года назад +6

      Yeah, a lot of science comes from having the leisure to investigate strange accidents, instead of being forced to just shrug and start over to get done with what you're actually trying to do. Overworking scientists results in poor science - and yet there are many academics proud of working 60-80h weeks.

  • @Bellenchia
    @Bellenchia 3 года назад +45

    So our low dimensional intuitions don't work for extremely high dimensional space. For example, look at the sphere bounding problem. If you join four 2D spheres, the sphere that you can draw inside the bounds of those four spheres is smaller--and that seems like common sense. But in 10D space, the sphere you place in the middle of the hyperspheres actually reaches outside the bounds of the enclosing hyperspheres. I imagine an analogous thing is happening here, if you were at a mountaintop and you could only feel around with your feet, you of course wouldn't get to mountain tops further away by just stepping in the direction of higher incline, but maybe in higher dimensions these mountain tops are actually closer together than we think. Or in other words, some local minima of the loss function which we associate with overfitting is actually not so far away from a good generalization, or global minimum.

    • @MrjbushM
      @MrjbushM 3 года назад +2

      Interesting thought…

    • @andrewjong2498
      @andrewjong2498 2 года назад +3

      Good hypothesis, I hope someone can prove it one day

    • @franklydoodle350
      @franklydoodle350 2 года назад +1

      Wow seems you're speaking in the 10th dimension. Your intelligence is way beyond mine.

    • @franklydoodle350
      @franklydoodle350 2 года назад

      Now I wonder if we could push the model to make more rash decisions as the loss function nears 1. In theory I don't see why not.

  • @mgostIH
    @mgostIH 3 года назад +69

    Thanks for accepting my suggestion in reviewing this! I think this paper can fall between "Wow what a strange behaviour" to "Is more compute all we need?"
    If the findings of this paper are reproduced on a lot of other practical scenarious and we find ways to speed up grokking it might change the entire way we think about neural networks, where we just need a model that's big enough to express some solution and then keep training it to find simpler ones, leading to an algorithmic Occam's Razor!

    • @TimScarfe
      @TimScarfe 3 года назад +5

      I was nagging Yannic too!! 😂

    • @aryasenadewanusa8920
      @aryasenadewanusa8920 3 года назад +2

      but again, though it's wonderful, in some cases, the cost to train such a large networks in such long time to achieve that AHA moment by the network can be too crazy to make sense..

    • @SteveStavropoulos
      @SteveStavropoulos 3 года назад +6

      This seems like random search where the network stumbles upon the solution and then stays there. The "stay there" part is well explained, but the interesting part is the search. And nothing is presented here to suggest search is not entirely random.
      It would be interesting to compare the neural network to a simple random search algorithm which just randomly tries every combination. Would the neural network find the solution significantly faster than the random search?

    • @RyanMartinRAM
      @RyanMartinRAM 2 года назад +1

      @@SteveStavropoulos or just start with different random noise every so often?

    • @brunosantidrian2301
      @brunosantidrian2301 2 года назад +2

      @@SteveStavropoulos Well, you are actually restricting the search to the space of networks that perform well in the training set. So I would say yes.

  • @beaconofwierd1883
    @beaconofwierd1883 3 года назад +10

    I don’t see what needs explaining here.
    The loss function is something like L= MSE(w,x,y) + d*w^2
    The optimization tries to get the error as small as possible and also get the weights as small as possible.
    A simple rule requires a lot less representation than memorizing every datapoint, thus more weights can be 0.
    The rule based solution is exactly what we are opimizing for, it just takes a long time to find it because the dominant term in the loss is the MSE, hence why it first moves toward minimizing the error by memorizing the training data then slowly moves towards minimizing the weights. The only way to minimize both is to learn the rule. How is this different from simple regularization?
    As for why the generalization happens ”suddenly” it is just a result of how gradient descent (with the Adam optimizer) works. Once it has found a gradient direction toward the global minimim it will pick up speed and momentum moving towards that global minimum, but finding a path towards that global minimum could take a long time, possibly forever if it got stuck in a local minima without any way out. Training 1000x longer to achieve generalization feels like a bad/inefficient approach to me.

    • @beaconofwierd1883
      @beaconofwierd1883 2 года назад

      @@aletheapower1977 Not really though, look at the graphs at 14:48, without regularization you need about 80% of the data to be training data for generalization to occur, then you’re basically just training on all your data and the model is more likely to find rule before it manages to memorize all of the training data. Not really any surprise there either.

    • @beaconofwierd1883
      @beaconofwierd1883 2 года назад

      @@aletheapower1977 Is there a relationship between the complexity of the underlying rule and the data fraction needed to grok? A good metric to use here could be the entropy of the training dataset. One would expect groking to happen faster and at smaller training data fractions when the entropy of the dataset is low, and vice versa. Not sure how much updating you will be doing, but this could be something worth investigating :) If there’s no correlation between entropy and when grokking occurs that would be very interesting.

    • @beaconofwierd1883
      @beaconofwierd1883 2 года назад

      @@aletheapower1977 I hope you find the time :)
      If not you can always add it as possible future work :)

  • @Yaketycast
    @Yaketycast 3 года назад +66

    OMG, I'm not crazy! This has happened to me when I forgot to down a model for a month

    • @freemind.d2714
      @freemind.d2714 3 года назад +8

      Same here

    • @TheRyulord
      @TheRyulord 3 года назад +10

      Here I was watching the video and wondering "How do people figure something like this out?" but I guess you've answered my question.

    • @solennsara6763
      @solennsara6763 3 года назад +31

      not paying attention is a very essential skill for achieving progress.

    • @Laszer271
      @Laszer271 3 года назад +5

      @@solennsara6763 That's essentially also how antibiotics were invented.

    • @anshul5243
      @anshul5243 3 года назад +48

      @@solennsara6763 so are you saying no attention is all you need?

  • @black-snow
    @black-snow 2 года назад +5

    after eating melon for some days my model suddenly starts to generalize

  • @st33lbird
    @st33lbird 3 года назад +7

    Code or it didn't happen

  • @nikre
    @nikre 3 года назад +26

    Why would it be specifically t-sne that reflects the "underlying mechanism" but not a different method? I would like to see how the t-sne structure evolve over time, and other visualization methods. This might as well be a cherry picked result.

    • @Mefaso09
      @Mefaso09 3 года назад +2

      Definitely, the evolution over time would be very interesting

  • @pladselsker8340
    @pladselsker8340 Месяц назад +2

    How come this only has gained traction and popularity now, 2 years after the paper's release

  • @Chocapic_13
    @Chocapic_13 3 года назад +31

    It would have been interesting if they had shown the evolution of the sum of the squared weights during the training.
    Maybe the sum of the weights gets progressively smaller and this helps the model to find the "optimal" solution to the problem starting from the memorized solution. It can also be that the l2 penalty kind of progressively "destroys" the learned representation and it adds noise until it finds this optimal solution.
    Very interesting paper and video anyways !

  • @imranq9241
    @imranq9241 3 года назад +3

    how can we prove double descent mathematically? I wonder if there could be a triple descent as well

  • @nauy
    @nauy 3 года назад +71

    I actually expected this kind of behavior based on the principle of regularization. With regularization (weight decay is L2 regularization), there is always a tension between the prediction error term and regularization term in the loss. At some point in the long training session, regularization term starts to dominate and the network, given enough time and noise, finds a different solution that has much lower regularization loss. This is also true with increasing the number of parameters, since doing that also increases the regularization loss. At some point, adding more parameters results in the regularization loss dominating over the error loss and the network latches on another solution that would result in lower regularization loss and overall loss. In other words, regularization provides the pressure that, when given enough time and noise in the system, comes up with a solution that uses fewer effective parameters. I am assuming fewer parameters leads to better generalization.

    • @Mefaso09
      @Mefaso09 3 года назад +13

      While this is the general motivation for regularization, the really surprising thing here is that it does still find this better solution by further training.
      Personally I would have expected it to get stuck in a local overfit optimum and just never leave it. The fact that it does "suddenly" do so after 1e5 times as many training steps is quite surprising.
      Also, if the solution really is to have fewer parameters for better generalization, traditionally L1 regularization is more likely to achieve that than L2 regularization (weight decay). Would be interesting to try that.

    • @Mefaso09
      @Mefaso09 3 года назад +3

      Also, I suppose it would be interesting to look at if the number of weights with significantly nonzero magnitudes does actually decrease over time, to test your hypothesis.
      Now if only they had shared some code...

    • @vladislavstankov1796
      @vladislavstankov1796 3 года назад +2

      @@Mefaso09 Agree it is confusing why training much longer time helps. If you remember the example with overfitting polynomials that matches every data point, but is very "unsmooth", unsmoothness comes from "large" coefficients. When you let this polynomial have way more freedom but severely constrain the weights than it can look like a parabola (even though in reality it will be a polynomial of degree 20), but will fit the data match better than any polynomial of degree 2.

    • @markmorgan5568
      @markmorgan5568 3 года назад +5

      I think this is helpful. My initial reaction was “how does a network learn when the loss and gradients are zero?” But with weight decay or dropout the weights do still change.
      Now back to watching the video. :)

    • @visuality2541
      @visuality2541 2 года назад +1

      l2 reg (weight decay) is more about small valued parameters rather than about a fewer number of params (sparsity); at least the classical view is so as u probably already know.
      i personally believe that the finidnng is related to Lipschitzness of the overall net and also the angle component of the representation vector.
      but i am not sure what loss function they really used
      itd be really surprising if the future work shows the superiority of l2 reg compared to other regs for generalization, of course, with a solid thoery.

  • @oreganorx7
    @oreganorx7 3 года назад +5

    I would have named the paper/research "Weight decay drives generalization" : P

    • @oreganorx7
      @oreganorx7 2 года назад

      @@aletheapower1977 Hi Alethea, I was mostly just joking around. In my head I took a few leaps and came up with an understanding of how I think it is that with weight decay this behaviour happens the fastest. Great work on the paper and research!

  • @dk-zd5rg
    @dk-zd5rg 3 года назад +5

    whoopsie doopsie generalised, yes

  • @UncoveredTruths
    @UncoveredTruths 3 года назад +7

    S5? ah yes, all these years of abstract algebra: my time to shine

  • @drhilm
    @drhilm 3 года назад +59

    I saw a similar idea on the paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks
    Elad Hoffer, Itay Hubara, Daniel Soudry" from 2017. Since then, I always try to run the network longer than the over fitting point and in many cases it works.

  • @MCRuCr
    @MCRuCr 3 года назад +11

    This sounds just like your average „I forgot to stop the experiment over the weekend“ random scientific discovery.

  • @justfoundit
    @justfoundit 3 года назад +4

    I'm wondering if it works with XGBoost too: increasing the size of the model to 100k and let the model figure out what went wrong when it starts to overfit at around 2k steps?

  • @Savage2Flinch
    @Savage2Flinch 3 месяца назад +1

    Stupid question: if your loss on the training set goes to zero, what happens during back-prop if you continue? Shouldn't the weights stop changing after overfitting?

    • @pokerandphilosophy8328
      @pokerandphilosophy8328 Месяц назад

      My guess is that the model keeps learning about the deep structure of the problem albeit not yet in a way that is deep enough to enable generalization outside of the training set. Imagine you are memorizing a set of poems by a poet. At some point you know all of them by rote but you don't yet understand the style of the poet deeply enough to be able to predict (with above chance accuracy) how some of the unseen poems proceed. As you keep rehearsing the initial poems (and being penalized for the occasional error) you latch on an increasing number of heuristics or features of the poet's style. This doesn't improve significantly your ability to recite those poems since you already know them by rote. It doesn't also significantly improve your ability to make predictions outside of the training set since your understanding of the style isn't wholistic enough. But, at some point, you have latched on sufficiently many stylistic features of them (and learned to tacitly represent aspects of the poet's world view) to be able to understand why (emphasis on "why") the poet chooses this or that word in this or that context. You've grokked and are now able to generalize outside of the training data set.

  • @tristanwegner
    @tristanwegner 2 года назад +2

    In a few years a neuronal Network might watch your videos as training to output better neural Networks. I mean Google own RUclips and DeepMind, and videos are the next step from picture and text creation, so maybe the training is already happening. But with such a huge dataset I hope they weigh the videos not by views only, but also the predicted education of the audience. I mean they sometimes ask me, if I liked a video because it was relaxing, or informative etc. So thanks for being part of the pre-singularity :)

  • @thunder89
    @thunder89 3 года назад +4

    I cannot find in the paper what the dataset size for fig 6 was -- what percentage of the training data are outliers? It sees interesting that there is a jump between 1k and 2k outliers...

  • @jjpp1993
    @jjpp1993 3 года назад +1

    Its like when you find out that its easier to lick the sugar cookie than to scrap the figure with a needle ;)

  • @PhilipTeare
    @PhilipTeare 2 года назад +1

    As well as double decent I think this relates strongly to the work by Naftali Tishby and his group around the 'forgetting phase' beyond overfit - where unnecessary information carried forward through the layers gets forgot until all that is left if what is needed to predict the label ruclips.net/video/TisObdHW8Wo/видео.html again he was running for ~10K epochs on very simple problems and networks.

  • @74357175
    @74357175 3 года назад +2

    What tasks are likely to be amenable to this? Only symbolic? How does the grokking phenomenon appear in non synthetic tasks?

    • @drdca8263
      @drdca8263 3 года назад

      this comment leads me to ask, "what about a synthetic task that isn't symbolic?" e.g. addition mod 23.5 , with floating point numbers.
      Or, maybe they tried that? not sure.

  • @Phenix66
    @Phenix66 3 года назад +3

    Loved the context you brought into this about double descent! And, as always, really great explanations in general!

  • @kadirgunel5926
    @kadirgunel5926 3 года назад +1

    This paper reminded me the halting problem. Isn't it ? It can stop, but it will take too much time; it is indeterministic. I felt like living in 1950s.

  • @CristianGarcia
    @CristianGarcia 3 года назад +3

    Thanks Yannic! Started reading this paper but forgot about it, very handy to have it here.

  • @goodtothinkwith
    @goodtothinkwith 3 месяца назад

    So if a phenomenon is relatively isolated in the dataset, it’s more likely to grok it? If that’s the case, it seems like experimental physics datasets would be good candidates to grok new physics…

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 2 года назад +1

    "The net is learning the underlying rule", more like stumbling upon it randomly...

  • @GeorgeSut
    @GeorgeSut 3 месяца назад +1

    So this is another interpretation of double descent, right?

    • @GeorgeSut
      @GeorgeSut 3 месяца назад

      Also, why didn't the authors train the model for more epochs/optimization steps? I honestly doubt the accuracy stays at that level, noticing that the peak on the right of the plot is cut-off by the plot's boundaries. What if there is some weird periodicity here?

  • @jabowery
    @jabowery 3 года назад +2

    And intuitive way of thinking about the double dip is imagine you're trying to write an algorithm that goes from gzip to bzip2. You must first decompress. Or think about trying to compress video. If you convert from pixels to voxels it seems like it's a very costly operation but it allows you to go from a voxel representation to a geometric abstraction in 3d where there is much less change with time. So don't be confused by the parameter count. You are still trying to come up with a smallest number of parameters in the final model.

  • @vochinin
    @vochinin 5 месяцев назад

    Ok, Isn't it just a weight decay loss not included in validation loss for control. Controlling for accuracy on itself might be a stupid idea. Finding solutions beyond overfitting point might be strange, but if we include weight decay loss (Which is not done in dl libraries as it's part of optimizers) we can see the different story. Isn't it?

  • @IIIIIawesIIIII
    @IIIIIawesIIIII 2 года назад +1

    This may be due to the fact, that there are countless combinations of factors that lead to overfitting, but only one "middle point" between those combinations.
    A quicker (non exponential) way to arrive there would be to overfit a few times with a different random seed and then add an additional layer that checks for patterns that these overfitted representations have in common.

  • @BjornHeijligers
    @BjornHeijligers 3 месяца назад

    Can you explain what "training on the validation or test data" means? are you using the actual "y" data of the test or validation to back propagate the predictions error? How is that different to adding more data to the training set?
    Thanks. trying to understand.

  • @ericadar
    @ericadar 3 года назад +1

    I'm curious if floating point errors has anything to do with this phenomena. Is it possible that FP errors accumulate and cause de-facto regularization after orders of magnitude froward and back prop? Is the order-of-magnitude delay between training and validation accuracy jump shorter for FP16 when compared to FP32?

  • @rayyuan6349
    @rayyuan6349 3 года назад +1

    I don't know if GAN community (Perhaps overparameterization community too) and general ML community have widely accept "Grokking", but this this absolute mind-blowing to the field as it implies that we can just brainless add more parameters to get better results.

  • @yasserothman4023
    @yasserothman4023 2 года назад +1

    Thank you for the explanation , any reference or recommended videos on weight decay in the context of deep learning optimization ?
    Thanks

  • @AliMoeeny
    @AliMoeeny 3 года назад +2

    Just wanted to say thank you, I learned a lot.

  • @DmytroKarpovych
    @DmytroKarpovych 3 года назад +1

    So, that’s why our brain who has so many of “weights” is so clever?)

  • @XOPOIIIO
    @XOPOIIIO 3 года назад +3

    My networks are showing the same phenomenon when they get overfitted on the validation set too.

    • @gunter7518
      @gunter7518 2 года назад

      How can i contact u ? I want to see these networks

  • @hacker2ish
    @hacker2ish 8 месяцев назад

    I think the weight decay having to do with simplicity of the trained model, is due to the fact that it may impose in some cases many parameters to be almost zeroed out (because weight decay means adding the 2-norm of the parameters to the objective function, scaled by a constant)

  • @howardkong8927
    @howardkong8927 2 года назад

    Could it be that human intelligence also evolved in such a "snapping" way, which would explain why no other animals have achieved our level of reasoning abilities?

  • @FromBeaverton
    @FromBeaverton 3 года назад +1

    Thank you! Very cool! I have observed the grokking phenomena and was really puzzled why I do not see anybody talking about it other than "well, you can try training your neural network longer and see what happen"

  • @dmitridiaguilev3990
    @dmitridiaguilev3990 2 года назад

    Did you know you could get an HPV vaccine even if you already have HPV?,

  • @CristianGarcia
    @CristianGarcia 3 года назад +1

    The "Deep Double Descent" paper did look into the double descent phenomena as a function of computation time/steps.

  • @patrickjdarrow
    @patrickjdarrow 3 года назад +2

    Some intern probably added a few too many 0's to their n_epochs variable, went on vacation and came back to these results

    • @oscezrcd
      @oscezrcd 3 года назад

      Do interns have vacations?

    • @patrickjdarrow
      @patrickjdarrow 3 года назад

      @@oscezrcd depends on their arrangement

    • @patrickjdarrow
      @patrickjdarrow 2 года назад

      @@aletheapower1977 LOL. Love that

  • @TheMrVallex
    @TheMrVallex 2 года назад +2

    Dude, thanks so much for the video! You just gave me an absolut 5-head idea for a paper :D
    Too bad i work full time as data scientist, so i have little time for doing research. I don't want to spoil the idea, but i got a strong gut feeling that these guys stumbled onto an absolute goldmine discovery!
    Keep up the good work! Love your channel

    • @letsburn00
      @letsburn00 2 года назад

      I suspect that it has something to do with the fact that there is limited entropy change from the initial randomised initial set values to the "fully trained" conditions. The parameters have so much more capacity to learn in there, but the dataset often can be learned in more detail.

    • @Zhi-zv1in
      @Zhi-zv1in 5 месяцев назад

      pleassse share the idea i wanna know

  • @MarkNowotarski
    @MarkNowotarski 3 года назад +1

    22:58 "that's a song, no?" That's a song, Yes! ruclips.net/video/XhzpxjuwZy0/видео.html

  • @THEMithrandir09
    @THEMithrandir09 3 года назад +2

    25:50 looking forward to the Occams Razor Regularizer.

    • @drewduncan5774
      @drewduncan5774 2 года назад

      en.wikipedia.org/wiki/Minimum_description_length

    • @THEMithrandir09
      @THEMithrandir09 2 года назад

      @@drewduncan5774 great read, thx!

  • @shm6273
    @shm6273 3 года назад +1

    Wouldn't the snapping be a symptom of the loss landscape having a very sharp global minima and a ton of flat points? Let the network learn forever (and also push it around) and it should eventually find the global minima. I'm assuming in a real dataset, the 10^K iterations needed for a snap will grow larger than there are atoms in the universe.
    EDIT: I commented too soon, seems you had a similar idea in the video.

    • @shm6273
      @shm6273 2 года назад

      @@aletheapower1977 cool. So you believe the minima is actually not sharp, but pretty flat and that makes the generalization better? I remember reading a paper about this, I’ll link to it a bit later

  • @chaosordeal294
    @chaosordeal294 2 года назад

    Understand = understand; Grok = understand + smug

  • @TimScarfe
    @TimScarfe 3 года назад +3

    Been waiting for this

  • @escher4401
    @escher4401 3 месяца назад

    Has anyone tried to do this on NeRFs?

  • @DasGrosseFressen
    @DasGrosseFressen 2 года назад

    Learn the rule? So there was extrapolation?

  • @shammerHammer
    @shammerHammer 2 года назад

    poor fucking google colab servers

  • @agentds1624
    @agentds1624 3 года назад +1

    it would have been also verry interesting to see if the validation loss also snaps down. Maybe just coincidence that there is no val loss curve ;). I currently also have a kind of similar case where my loss curves scream overfitting however the accuracy sometimes shoots up...

    • @Coolguydudeness1234
      @Coolguydudeness1234 3 года назад +2

      the figures and the whole subject of the paper is referring to the validation loss

  • @tinyentropy
    @tinyentropy 2 года назад

    Whats the name of the other paper mentioned?

  • @TheEbbemonster
    @TheEbbemonster 3 года назад +1

    That is grokked up!

  • @TheDavddd
    @TheDavddd 2 года назад

    I don't think weight decay explains the transition, and why it happens once. Weight decay would explain a monotonically decreasing validation loss where weights are high at first and low later. But why is there a transition? Why just one transition?

  • @tristanwegner
    @tristanwegner 2 года назад

    21:33. Your explanation about why weightdecay helps with generalization after a lot of training makes intuitive sense. Weightdecay is like being forgetful. Thus you try to remember varying details of each sample, but the details is different for each sample, thus if you forget a bit, you do not recognize it anymore, and when learning about it again, you pick a new detail to remember. But if you find (stumble upon?) the underlying pattern, each sample re enforces your rule, counteracting the constant forgetting rate, thus it will still be there when the training samples comes around again, thus staying. Neuroscience has assumed that forgetting is essential for learning, and some Savants with amazing memory often have a hard time in real world situations.

  • @pisoiorfan
    @pisoiorfan 3 года назад

    I wonder if a model after grokking suddenly becomes more compressible through distillation than any of its previous incarnations. That would be expected if it discovers a simple rule that fits any inputs and be equivalent of the machine having an "Aha!"

  • @davidwuhrer6704
    @davidwuhrer6704 3 года назад

    Interesting.
    I want to mention that humans do not as a rule look for a rule behind a dataset. Most humans just blindly rote memorise data.
    Maybe because they aren't given enough data to generalise. (Although they will happily extrapolate from a single data point.)
    Certainly, school trains humans to regurgitate what they are supposed to learn; maybe they generalise that to real life.
    In any case, mathematicians and engineers are the exception to the rule, being trained to look for rules behind data. Not that this behaviour is exclusive, but it does require unlearning.

  • @Idiomatick
    @Idiomatick 2 года назад

    Your suggestion seems sufficient for both phenomenon. Decay can potentially cause worse loss with a 'wild' memorization so a more smooth solution will eventually be found after many many many attempts.
    "Worst way to generalize solutions" would have been a fun title.

  • @felix-ht
    @felix-ht 2 года назад

    Feels a lot like the process when a human am is learning something new. In the beginning one just tries to soak up the knowledge. And then suddenly you just get how it works and convert the knowledge to understanding. To me this moment always felt very relieving - as much of the knowledge can simply be replaced with the understanding.
    Arguably the driving factor in humans for this might actually be something similar to weight decay. Many neurons storing knowledge is certainly more costly than a few just remembering the rule. Extrapolated even further this might every well be what drives sentience itself, the simplest solution to being efficient in a reality is to understand its rules and ones role within it.
    So sentience would be the human brain grokking to a new more elegant solution to it's training data. One example of this would be human babys suddenly starting recognize themselfs in a mirror.

  • @herp_derpingson
    @herp_derpingson 2 года назад

    This is mind blowing.
    .
    I wonder if a "grokked" image model is more robust against adversarial attacks. Somebody needs to do a paper on that.
    .
    We can also draw parallels to the dimpled manifold hypothesis. Weight decay implies that the dimples would have to have a shallow angle, while still being correct. A dimple with a shallow angle will just have a lot more volume to "catch particles" within it, leading to better generalization.
    .
    I would also be interested in seeing how fast should we decay the weight. My intuition says that it has to me significantly less than the learning rate, otherwise, the model wont catch up with the training, while not being too less that the weight loss is nullfied by gradient addition.
    .
    I remember in one of the blog posts of Andrej Karpathy he said that make sure that your model is able to overfit a very small dataset before moving on to next steps. I wonder if now it would become standard practice to "overfit first" a model.
    .
    I wonder if something similar happens in humans. I have noticed that I develop better understanding of some topics after I let them sit in my brain for a few weeks and I forget a lot of the details.

  • @ZLvids
    @ZLvids 2 года назад

    Actually, I have encountered several time in an "opposite grokking" phenomena where the loss is suddenly *increase*. Does this phenomena has a name?

  • @larryjuang5643
    @larryjuang5643 2 года назад

    Wonder if Grokking could happen for a reinforcement learning set up......

  • @bingochipspass08
    @bingochipspass08 2 года назад

    Great walkthrough of the paper Yannic! You covered some details which weren't clear to me initially!

  • @XecutionStyle
    @XecutionStyle 3 года назад

    What does this mean for Reinforcement Learning and Robotics? Does an over-parameterized network with tons of synthetic data generalize better than a simple one with a fraction, but real data?

  • @BitJam
    @BitJam 2 года назад

    This is such a surprising result! How can training be so effective at generalizing when the loss function is already so close to zero? Is the loss function really pointing the way to the generalization when we're already at a very flat bottom? Is it going down when the NN generalizes? How many bits of precision do we need for this to work? I wonder what happens if noise is added to the loss function during generalization. If generalization always slows down when we add noise then we know the loss function is driving it. If it speeds up or doesn't slow down then that points to self-organization.

  • @raunaquepatra3966
    @raunaquepatra3966 3 года назад

    Is there Learning rate decay? Then it will be more crazy
    because then jumping from a local minima to a global will be counter intuitive

  • @jeanf6295
    @jeanf6295 3 года назад

    When you are limited to composition of elementary operations that don't have quite the right structure, it kinda makes sense that you need a lot of parameters to emulate the proper rule, possibly way more than you have data points.
    It's a bit like using NOT and OR gates to represent an operation, sure it will work but you will need quite a few layers of computation and you will need to store quite a few intermediary results at each step.

  • @Lawrencelot89
    @Lawrencelot89 2 года назад

    How is this related to weight regularization? If the weight decay is what causes this, making the weights already near zero should cause this even earlier right?

  • @hannesstark5024
    @hannesstark5024 3 года назад +5

    I guess the most awesome papers are at workshops!?
    It would be really nice to have a plot of the magnitude of the weights over the training time to see if it decreases a lot when grokking occurs.

    • @ssssssstssssssss
      @ssssssstssssssss 3 года назад

      Well. Workshops are a lot more fun than big conferences. And the barrier to entry is typically lower so conservative (aka exploitation) bias has less of an impact.

    • @hannesstark5024
      @hannesstark5024 2 года назад

      @@aletheapower1977 ❤️

  • @abubakrbabiker9553
    @abubakrbabiker9553 2 года назад

    My guess would be that this is more like (guided search for optimal weights) as in the neural network explores many local minimas (with SGD) wrt the trainining set until the weights find some weights that fit the test dataset. I suspect that if they extended the training period the test accuracy will drop again and might rise later as well

  • @TheRohr
    @TheRohr 3 года назад

    Do the number of necessary steps correlate with those necessary for differential evolution algorithms?

  • @themiguel-martinho
    @themiguel-martinho 2 года назад

    Why do many of the binary operations tested include a modulus operator?

    • @themiguel-martinho
      @themiguel-martinho 2 года назад

      Also, since both this work and dual descent come from open ai, did the researchers discuss with each other?

  • @employeeofthemonth1
    @employeeofthemonth1 2 года назад

    So is there a penalty on larger weights in the training algorithm? If yes I wouldnt find this suprising intuitively.

  • @herewegoagain2
    @herewegoagain2 2 года назад

    Like you mentioned, i think even a minor amount of noise is enough to influence the 'fit' on higher order of dimensions. I think they're considering an ideal scenario. If so, it wont 'generalise' to reak life use cases. Very interesting paper nonetheless.

  • @martinkaras775
    @martinkaras775 2 года назад

    In natural datasets where generation function is unknown (or maybe not existent at all) isnt there a bigger chance of overfitting to validation set than to find actual generation function?

  • @pascalzoleko694
    @pascalzoleko694 3 года назад

    No matter how good a video is, I guess someone has to put the thumb down. Very interesting stuff. I wonder what it costs to run training scripts for soooo long :D

  • @ssssssstssssssss
    @ssssssstssssssss 3 года назад

    Grokking is a pretty awesome name. Maybe I'll find a way to work into my daily lingo.

  • @andrewminhnguyen9446
    @andrewminhnguyen9446 3 года назад +2

    Makes me think of "Knowledge distillation: A good teacher is patient and consistent," where the authors obtain better student networks by training for an unintuitively long time.
    Also, curious how this might play with "Deep Learning on a Data Diet."
    Andrej Karpathy talks about how he accidentally left a model training over winter break and found it had SotA'ed.

  • @derekxiaoEvanescentBliss
    @derekxiaoEvanescentBliss 3 года назад

    Curious though if this has to do with the relationship between the underlying rule of the task, and the implicit regularization of gradient descdnt and explicit regularization.
    Is this a property of neural network learning, or is it a property of the underlying manifolds of datasets drawn from the real world and math?
    Edit: ah looks like that's where they went in the rest of the paper

  • @azai.mp4
    @azai.mp4 2 года назад

    How is this different from epoch-wise double descent?

  • @moajjem04
    @moajjem04 3 года назад

    Is this kind of like how we humans sometimes get that gotcha moment where understand stuff? Like the teacher explaining a concept and we understand it after nth time it has been explained?

  • @gianpierocea
    @gianpierocea 3 года назад

    Very cool. For some reason what I would like to see is to combine this sort of thing with a multitask approach i.e. , make the model solve the binary operation task and some other synthetic task that is different enough. I really doubt with the current computational resources this would ever work, but I do feel like that if the model could grokk that...well that would be really impressive, closer to AGI like.

  • @bojan368
    @bojan368 3 года назад +2

    I wonder what would happen if you train GPT-3 for long time, would it become self aware

  • @mmsamiei
    @mmsamiei 3 года назад

    I think this paper can be discussed at machine learning street talk.

  • @ralvarezb78
    @ralvarezb78 3 года назад

    This means that at some point there is a very sharp optimum in the surface of loss function

  • @SeagleLiu
    @SeagleLiu 3 года назад

    Yannic, thanks for bring this to us. I agree with your heuristics on the local minimum vs global minimum. To me, the phenomena in this paper is related to the "oracle property" of Lasso in the tradition statistical learning literature, aka, for a penalized algorithm (Lasso or equivalent weight decay), such that the algorithm is almost sure to find the correct non-zero parameters (aka, the global minimum), as well as their correct coefficients. It will be interesting that some one further study this.

  • @brll5733
    @brll5733 3 года назад

    Forget the paper. Her name is Alethea Power? How awesome is that?

  • @Champignon1000
    @Champignon1000 3 года назад

    This is super cool thanks. - I'm trying somewhere where I use the intermediate layers of a discriminator, as an additional loss to the generator/autoencoder, to avoid collapse. And adding regularizers really helps a lot. (specifically instead of MSE error loss from real image and fake image. I use a super deep perceptual loss of the discriminator, So basically only letting the autoencoder know how different the image was conceptual)