Finally: Grokking Solved - It's Not What You Think

Поделиться
HTML-код
  • Опубликовано: 14 янв 2025

Комментарии • 60

  • @jabowery
    @jabowery 9 часов назад +10

    For those new to the subject everything up to 18:22 is pretty much a diagnosis of the disease not the cure. Don't be discouraged by the long introduction. This video is very important.

  • @desmur36
    @desmur36 11 часов назад +6

    This is the most important video on the internet for our advancement of AI!

  • @hjups
    @hjups 10 часов назад +8

    I think I missed something in your explanation. If the model gets stuck due to softmax collapse, then how does Grokking occur? Presumably, this would require the model to eventually become "unstuck", but that's not possible with a zero gradient. It's also unlikely to occur from floating-point behavior as an underflow will be truncated and an overflow will become inf/NaN. Or is the idea that a high lr avoids this collapse by amplifying small gradients?
    Additionally, if SC is attributed to logit scaling prior to softmax, then does that not suggest that using non-affine scaled normalization prior to the softmax may prevent this issue? Norm(c*f(x)) ~ f(x) / || f(x) ||.

    • @malchemis
      @malchemis 9 часов назад +2

      +1 This does not explain the "suddenly". I may have missed something in the video, I need to take a look at the paper. Is the model getting unstuck simply due to noise (repeated perturbation) ? The reasoning would be that a global minimum is stable and more resistant to perturbation than a local one

    • @mucabi
      @mucabi 6 часов назад +1

      I only skimmed the paper but if I remember correctly for grokking consistently to occur you somewhat need weight decay. So over time the weights get decayed enough to get meaningful gradients again.
      There are also pretty pictures in the paper, take a look :)

    • @devinbae9914
      @devinbae9914 2 часа назад

      @malchemis You should look into spontaneous replica symmetry breaking in diffusion models. Grokking is a similar phenomena where the state space undergoes a phase transition... it is clear that the energy in conjunction with some other parameters must reach some sort of threshold (which itself can be sequestered) before finally settling in a NEW attractor basin which might be called the "true basin" in contrast to the "false basin" (which is the initial training plateau). Hope this makes a little bit more sense?

    • @hjups
      @hjups 2 часа назад +2

      @devinbae9914 I suspect this is an accurate description, but does not address the issue posed. What exactly causes the phase transition? It can't simply be stochastic chance, since it was established that the gradients are 0 due to softmax collapse, meaning the model is unable to update.
      So either, the zero gradient assumption is false, weight decay helps recover from SC, or the scenario description in the video is flawed (i.e. maybe only models which do not exhibit SC are able to Grok based on random seeds). Or maybe there's some other factor that we're missing?
      Understanding the mathematical mechanism is crucial to backing the theoretical model.

    • @devinbae9914
      @devinbae9914 30 минут назад

      ​@@hjups Yes very true, I'd wager it's because there is a parameter which causes the phase transition. The weight decay explanation seems pretty reasonable and it's actually cited as the primary explanation in the paper... otherwise factors like the loss function are also involved (they note MSE loss on shallow networks can induce Grokking) and weight scaling.

  • @thielt01
    @thielt01 13 часов назад +8

    I've been waiting for such a paper!
    I'm more of a hobbyist in the space rather than an actual researcher.

    • @tescOne
      @tescOne 11 часов назад +1

      same :) Grokking blew my mind some months ago

    • @RickySupriyadi
      @RickySupriyadi 4 часа назад

      which one i can't find it

  • @mrpocock
    @mrpocock 13 часов назад +11

    Ok, so can you just slap an l1 loss in parallel to the logit layers, that's proportional to the inverse entropy of the logit?

    • @therainman7777
      @therainman7777 8 часов назад +1

      I was just thinking the exact same thing.

    • @therainman7777
      @therainman7777 8 часов назад +3

      Ah, now I just got to 19:00 in the video, where the paper’s authors say that they have a method of preventing softmax collapse _without_ regularization (your suggestion is a form of regularization). So maybe they have a better method, I’ll need to finish the video and read the paper to find out 😅

    • @mrpocock
      @mrpocock 7 часов назад +1

      @@therainman7777 there may be a numerical methods trick, but if so, it is beyond my level of understanding.

    • @therainman7777
      @therainman7777 7 часов назад +1

      @@mrpocock Yeah, I finished the video and there was no indication yet of what, if anything, the authors have already figured out. But I believe the poster of the video said he’ll be making a part two, so there may be more info to come. I’m going to read the paper in the meantime.

    • @jakobrees4973
      @jakobrees4973 6 часов назад +2

      ​@@therainman7777 The trick they use is subtracting the gradient projection onto the weights from the gradient before using it to apply a step to the model. The idea is that the NLM direction will become apparent in the model parameters: we don't want to continue in the NLM direction so remove it from the gradient.
      From my (very quick) testing of the authors' method, this actually does not necessarily always seem to work that well, but specifically only is useful once we start to overfit drastically.

  • @ibgib
    @ibgib 12 часов назад +4

    How does the most important AI channel only have 50k subs. Great video ty

    • @Pure_Science_and_Technology
      @Pure_Science_and_Technology 8 часов назад +1

      I totally agree. I think I’ve watched every video and it’s paid dividends in many areas.

    • @ibgib
      @ibgib 7 часов назад +1

      @Pure_Science_and_Technology I've seen a couple dozen to some degree. It's hard to keep up with such good content, so i have to pick my battles. Can't play this channel at 2x!

  • @be1tube
    @be1tube 2 часа назад +1

    This did not explain grokking to me because my question is not "why does it take so long to happen" but why does it happen at all after memorization? Why is there still a training signal after the network has memorized the training examples?

  • @drowningpenguin1588
    @drowningpenguin1588 12 часов назад +1

    Really appreciate the quality of your videos!

  • @be1tube
    @be1tube 2 часа назад

    There was a paper many months ago showing that increasing certain frequency components of the gradients (I think it was a low-pass filter but it could have been the opposite) skipped most of the delay for grokking.

  • @KilgoreTroutAsf
    @KilgoreTroutAsf 5 часов назад +4

    This is the vanishing gradient problem all over again.

    • @yeetdeets
      @yeetdeets 5 часов назад +2

      Exploding in this case though, right?

  • @luke.perkin.online
    @luke.perkin.online 11 часов назад +2

    Link the paper in the description please!

    • @k.c.sunshine1934
      @k.c.sunshine1934 6 часов назад

      18:40 after learning from my mistakes, I realize that RUclips is not a fan of direct links (over-protection of copyright, I guess). RUclips is a social media tool rather than an academic system. You can find the link starting at my reference mark.

  • @breakablec
    @breakablec 5 часов назад

    Sounds like:
    1. once the error become low you want to stop the maximisation of optimal parameters to increase gradient
    2. you want to use decimals with large integer part and small fractional part to increase precision

  • @andikunar7183
    @andikunar7183 7 часов назад +1

    Thanks, AMAZING content, WOW! Does this mean in non-mathematical/laymen‘s terms, that good, smaller context, knowledge-samples (decreased dimensionality) during training help with grokking?

    • @polyscopes
      @polyscopes Час назад +1

      I think he was saying that the decreased dimensionality helped prevent memorization forcing it to generalize sooner instead of memorizing the training data first and then learning to generalize.

  • @timealchemist7508
    @timealchemist7508 11 часов назад +2

    Ugh… Cliffhanger. 😂 Looking forward to tomorrow’s video!

  • @mohl-bodell2948
    @mohl-bodell2948 3 часа назад +1

    What a cliff hanger...

    • @polyscopes
      @polyscopes Час назад

      For real haha llm training getting 2x + cheaper overnight

  • @ChaseFreedomMusician
    @ChaseFreedomMusician 7 часов назад

    I'm really looking forward to part 2!!

  • @TimoGraw
    @TimoGraw 11 часов назад +1

    Am i correct in my understanding that this is about a significant reduction in transformer model training time?

    • @GodbornNoven
      @GodbornNoven 10 часов назад +1

      Not exactly but it does save on costs and a bit on compute because we no longer need to do any regularization to achieve generalization as it can be achieved thru stablemax and adamW. But it does help our understanding of AI and generalization which could theoretically allow us to develop better architectures for learning.

  • @irbsurfer1585
    @irbsurfer1585 10 часов назад +1

    Great video!

  • @mohammedbenaissa1278
    @mohammedbenaissa1278 11 часов назад +2

    Why do we have to wait for tomorrow

    • @johncolbourne7789
      @johncolbourne7789 7 часов назад +2

      😂Where is my AI paper explainer video, RUclipsr slave?

    • @fkxfkx
      @fkxfkx Час назад

      because it is not here yet

  • @Sirus20x6
    @Sirus20x6 Час назад

    so train a low rank until you run out of space to fit more learning, and slowly up the quantization?

  • @En1Gm4A
    @En1Gm4A 12 часов назад

    interesting stuff. went along such an issue during my masters thesis

  • @tikendraw
    @tikendraw 4 минуты назад

    Let say if a company hires you as a data science researcher, with your current knowledge and same limitations as tech giants how far you can take the llm if designed and trained from scratch ?

  • @mloewen248
    @mloewen248 7 часов назад

    Wouldn't simply adding noise to the input data then solve this issue after repeated epochs?

  • @TheDarkhawk243
    @TheDarkhawk243 3 часа назад

    Why do you never link papers?

  • @АлексейТучак-м4ч
    @АлексейТучак-м4ч 11 часов назад

    well, first thing that comes to mind is to use something slower growing, than exponents in the normalization layer

  • @mariusj.2192
    @mariusj.2192 8 часов назад

    I don't understand why scaling the logits wouldn't be a helpful learning objective if the logit for the correct class already has the largest logit value.
    The whole network before it would be incentivised to increase whatever contributed to the positive logits to make them more positive and to make the contributions to the negative logits stronger to make them more negative - scaling the output of the LM-head is not limited to the adjustment of its own weights.

  • @HoldMyData
    @HoldMyData 7 часов назад

    So now, what am I going to waste my time waiting on training runs? @24:35 😅 Great explanation, thanks. I remember Nous or someone when all of this first started with "Ok so we just keep going and going and eventually it works". This makes it understandable.

  • @alaneric1618
    @alaneric1618 9 часов назад

    I've often wondered if using floating point numbers with an extremely large number of bits (like thousands) if any novel learning would happen in the lowest significant digits. I'm surprised this is only now being talked about. Most comp sci people are very familiar with floating point. Many monetary systems have numerical scaling issues that most engineers assume don't exist because all the small numbers are probably just a wash. I once wrote a paper showing how that is not always the case and people are leaving money on the table.

  • @LiesAtRest
    @LiesAtRest 13 часов назад

    I said that in a previous video where you asked the mathematical reason for it, but since nobody was interested I deleted it... where I said: "it is because of the "approximation" system that causes loss of signifiers considered insignificant by the limit of the cut different from the necessary one" 😥

    • @dot1298
      @dot1298 12 часов назад

      why not just use 64bit-precision everywhere or even 128bit precision?!

    • @drdca8263
      @drdca8263 11 часов назад +1

      Why would you delete a comment due to people not being interested?

    • @willr0073
      @willr0073 6 часов назад

      ​@@dot1298Memory requirements

  • @aayatrubab
    @aayatrubab 13 часов назад

    nice thanks :)

  • @грешилов
    @грешилов 13 часов назад

  • @wwkk4964
    @wwkk4964 12 часов назад

    Kroggink

  • @rubncarmona
    @rubncarmona 7 часов назад

    This makes me think of nGPT from Nvidia a bunch

  • @seanharbinger
    @seanharbinger 13 часов назад +2

    Grokking is when a hipster in Silicon Valley attempts to summarize a new subject by fluttering their eyelids like an android in download mode - only to emit imbecilic irony.