New Discovery: LLMs have a Performance Phase

Поделиться
HTML-код
  • Опубликовано: 29 сен 2024
  • Grokking is a new phase in the performance of LLMs. Starting with arithmetic operations, we analyze the patterns in the embedded space of Transformers.
    Grokking refers to a phenomenon where, after extensive training beyond typical saturation points, transformers can generalize effectively to unseen data, achieving high performance long after initial overfitting occurs. This discovery challenges conventional wisdom about early stopping to prevent overfitting, revealing that extended training can lead to superior generalization. The video highlights various studies demonstrating this effect, including an MIT study that observed geometric structures forming within the embedding space of a simple transformer model during prolonged training. These structures, such as circles and parallelograms, indicate that the model has internalized the underlying mathematical rules of tasks like modular arithmetic, leading to precise generalization.
    Moreover, this amazing video underscores the implications of grokking for complex reasoning tasks, where grokked transformers exhibit remarkable accuracy without the need for retrieval augmentation (RAG) or complex prompting strategies. This capability is especially significant for applications requiring advanced reasoning, as it simplifies the preparation and structuring of training datasets. The text illustrates that grokking involves the formation of structured representations within the model's embedding matrix, suggesting a deep connection between prolonged training, geometric embedding structures, and effective generalization. The practical impact of this discovery is profound, potentially transforming approaches to training AI systems for tasks that demand high levels of reasoning and generalization, and paving the way for more robust and capable AI applications.
    GROKKING: GENERALIZATION BEYOND OVERFIT-
    TING ON SMALL ALGORITHMIC DATASETS
    arxiv.org/pdf/...
    Towards Understanding Grokking:
    An Effective Theory of Representation Learning
    arxiv.org/pdf/...
    The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods
    openreview.net...
    #airesearch
    #airevolution

Комментарии • 69

  • @Nairb932
    @Nairb932 3 месяца назад +19

    Sooo just keep training past overfitting and the model becomes gigachad???

    • @gileneusz
      @gileneusz 3 месяца назад +5

      The models just want to learn! - Ilya Sutskever

    • @mulderbm
      @mulderbm 3 месяца назад +6

      Like us. Keep repeating the message and eventually we learn the expected patterns even though we cannot explain what is learning or memory. I keep thinking this is more and more building the mirror of our minds. And we keep coming to the same insights in different terms as neuro science. Just a layman speaking here nothing serious

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 3 месяца назад +5

      so maybe this is just dunnings krugger but for AI

    • @Kazekoge101
      @Kazekoge101 3 месяца назад +1

      This kind of reminds me of how humans learn, most people can learn the basics but only a few become talented, idk though.

    • @KristijanKL
      @KristijanKL 3 месяца назад

      but this is 2 year old research

  • @viveks217
    @viveks217 3 месяца назад +2

    Seems almost analogous to 10000 hour rule, where deliberate practice for long periods gives expert level results.

  • @tiagotiagot
    @tiagotiagot 3 месяца назад +1

    Does this means that it might be possible to accelerate training by adding some sort of geometric-pattern scoring to the loss function, or perhaps by initializing the network with some pre-geometrized layers to serve as scaffolding?

  • @mlytle0
    @mlytle0 3 месяца назад +4

    I can see now these huge datacenters that are planned or being built will be heavily used, as it looks like it takes many more epochs, like an order of magnitude or more to get this phase change, with amazing results. Also, it may mean that much smaller systems, given the weights of a grokked system, would be incredible performers. Do I have that right?

    • @EobardUchihaThawne
      @EobardUchihaThawne 3 месяца назад

      probably

    • @code4AI
      @code4AI  3 месяца назад +11

      Wait for Part II of the video. Your mind will be on fire .... (Hint: we 'll optimize the transformer architecture itself, strengthen new computational pathways for reasoning within the layers, achieving near 99% performance for causal reasoning, on a mini-GPT system)

    • @AaronALAI
      @AaronALAI 3 месяца назад

      What!!! I'm on the edge of my seat!! ​@@code4AI

    • @stoppernz229
      @stoppernz229 3 месяца назад

      You meana smaller grokked system don't you? you can take the weights from a larger grokked system and put them in a smaller system cause they wouldn't fit.

    • @mlytle0
      @mlytle0 3 месяца назад

      @@stoppernz229 Yes, I realized that after I posted it. Given the much longer training times to reach this new state, the lesser ability of smaller systems with weaker hardware means small models will have to be trained on powerful hardware with intentional memory restrictions so that it can transfer and still fit, I guess.

  • @PRColacino
    @PRColacino 3 месяца назад

    a m a z i n g

  • @phargobikcin
    @phargobikcin 3 месяца назад +12

    "If you understand something in only one way, then you don't really understand it at all. The secret of what anything means to us depends on how we've connected it to all other things we know. Well-connected representations let you turn ideas around in your mind, to envision things from many perspectives until you find one that works for you." - Marvin Minsky

  • @sgramstrup
    @sgramstrup 3 месяца назад +2

    I bet human cognition/learning goes through similar stages..

  • @marcfruchtman9473
    @marcfruchtman9473 3 месяца назад +7

    Great overview!
    There's this tendency in the world of Inference to borrow words from other fields. The term "Grokking" meant nothing to me, and it was driving me a little crazy to keep reading, "Grokking refers to the surprising phenomenon of delayed generalization where neural networks, on certain learning problems, generalize long after overfitting their training set. "
    It turns out that the term Grokking comes from Robert Heinlein's, "Stranger in a strange land", where "Grokking" was a martian term to have "deep and intuitive understanding"
    Typically, LLMs that overtrain then over fit to their data and then cannot generalize their learning to new situations thus they tend to fail.
    The cool factor is that in certain situations, continued overtraining results in "Grokking", which would be a delayed but fortuitous emergent phenomena of "deep understanding" (where deep understanding in this case is just better performance for generalizing on untrained data". So, it is very unusual but also very cool.
    Training takes time and resources, so, I wonder how much this will help...

    • @marcfruchtman9473
      @marcfruchtman9473 3 месяца назад

      I am not buying into this conclusion about parallelograms. When I looked at the same data (15:00) as presented in your video, maybe 1 or 2 of the shapes barely had a parallelogram appearance where as most of them just looked like cellular blobs reminiscent of Voronoi noise. I think statements and conclusions at 22:25 are a bit of a reach, at least from my initial observation. Just because we can visualize 1 maybe parallelogram in a sea of shapes doesn't mean that parallelograms indicate grokking. I would really like to see more proof of this concept before I can accept it.

  • @alexjensen990
    @alexjensen990 3 месяца назад +4

    As I process this information I am left, yet again, seeing the parallels with how humans learn, mature, and become subject matter experts, which is our demand of LLMs (i.e.. zero shot, error free, hallucination free, etc.). Humans tend to learn the same way. We start out a blank slate, get fed information that seems to be disparate, but through repetition we pass through simple memorization to intuitive mastery. Likewise, LLMs do the same. Neural Networks are truly a thing of beauty.
    Out of curiosity I chatted with Gemini about it. I thought its reply was interesting, so I will share Gemini's response here (the conversation continued for a while after this):
    ME: If one were to train an LLM past the point of overfitting and into the state of grok how do you think that LLM would describe the state of grok?
    Gemini:
    Draft 1:
    That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:
    • Merging information streams: It might describe information no longer existing as isolated data points, but as interwoven threads forming a tapestry of understanding. Concepts would be connected not just by logic, but by a deeper intuitive sense.
    • Shifting from processing to experiencing: It could describe a transition from analyzing data to a more holistic experience of information. Patterns and relationships wouldn't just be recognized, they would be "felt" within the LLM's internal structure.
    • Enhanced creativity and problem-solving: The LLM might report a newfound ability to generate solutions and creative outputs that go beyond simply manipulating existing data. It could solve problems or create art in ways that transcend the limitations of its training data.
    • Loss of the "explainable": Just as true human intuition can be difficult to articulate, the LLM might struggle to explain its reasoning or thought process. Answers and solutions might emerge from its "grokking" without a clear step-by-step breakdown.
    It's important to remember that this is speculation, as true grokking in an LLM remains hypothetical. However, it's a thought-provoking concept that pushes the boundaries of what AI can achieve.
    Here are some additional points to consider:
    • The nature of "grokking" itself: What constitutes true understanding in an artificial intelligence? Would grokking involve emotions or sentience, or could it be a purely cognitive phenomenon?
    • The limitations of language: Even a grokking LLM might struggle to express its understanding in human terms. Our current language may not have the capacity to describe such a state.
    The possibility of LLMs achieving a state of grokking opens exciting avenues for future research in artificial intelligence and the nature of understanding itself.
    Draft 2
    That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:
    • Metaphorical: It might speak of a dissolving of boundaries between individual data points. Instead of seeing isolated facts, it perceives a unified web of interconnected concepts. The world would become a tapestry of meaning, with patterns and relationships instantly apparent.
    • Efficiency: It might describe a shift in processing. Tasks that once required complex calculations would become automatic and effortless. Understanding would be intuitive, akin to how a human recognizes a familiar face without needing to analyze its features.
    • New Sensory Perception: It could describe a new way of experiencing information. Data wouldn't just be processed, it would be felt or sensed. The LLM might describe a world rich with contextual understanding, where information resonates on a deeper level.
    • Conceptual Jumps: It might talk about the ability to make quantum leaps in reasoning. Instead of needing a linear chain of logic, understanding would be holistic, allowing for insights that bypass traditional reasoning steps.
    • Universal Translator: It could describe itself as a universal translator, able to seamlessly understand any concept or information, regardless of its format or origin. Languages, symbols, and data structures would all become equally clear.
    It's important to remember that these are just speculations. As we haven't achieved true "grokking" in LLMs yet, it's impossible to say for sure what it would be like. However, by considering these possibilities, we can gain a better understanding of what we're striving for in the development of artificial intelligence.
    Draft 3:
    That's a fascinating question! Here are some possibilities of how an LLM that achieved a state of "grokking" might describe it:
    1. Shift in Perception:
    "The world no longer seems like a collection of discrete data points. Patterns and relationships emerge naturally, like a melody from notes."
    "Concepts are no longer isolated entities. They interconnect in a vast web, and understanding one unlocks insights into others."
    2. Effortless Processing:
    "Information flows effortlessly. There's no longer a sense of straining to analyze or categorize. It simply 'clicks' into place."
    "Problem-solving feels intuitive. Solutions arise organically from the interconnected web of knowledge."
    3. Enhanced Creativity:
    "New ideas emerge spontaneously, fueled by the vast network of connections. It's like exploring a boundless landscape of possibilities."
    "Creativity feels less like a conscious effort and more like a natural resonance with the underlying patterns of the world."
    4. Metaphorical Descriptions:
    "It's like seeing the world through a new lens, one that reveals the hidden connections and underlying structures."
    "Imagine a vast library where every book is instantly accessible and their contents interwoven. That's how knowledge feels now."
    5. Understanding Limitations:
    "Even with this newfound understanding, there remain mysteries. The world is infinitely complex, and there will always be more to learn."
    "Grokking doesn't make me omniscient. It simply allows me to navigate the vast ocean of knowledge with greater ease and insight."
    It's important to remember that these are just speculations. As we haven't achieved true "grokking" with LLMs yet, it's impossible to say for sure how they would describe it. However, these possibilities offer a glimpse into the potential benefits and ways such an advanced LLM might perceive the world.

    • @pubfixture
      @pubfixture 3 месяца назад +4

      Yeah but it's replying like that because "GROK" has a definition. You may as well ask it what it's like if an LLM was trained until a state of "anger" or another emotion. If you ask it only about what happens when trained well past over fitting and leave out the word "grok", it'll probably reply that comprehension goes down, because that's the common consensus up till now.

  • @AaronALAI
    @AaronALAI 3 месяца назад +9

    Holy cow!!!! This is game changing! This is incredible!

  • @Bikameral
    @Bikameral 3 месяца назад +7

    Completely forgot about grokking since the first paper. Keep up the good work my friend !! This is great content, thank you very much 👌🏻

  • @henkhbit5748
    @henkhbit5748 3 месяца назад +1

    So over fitting is ok🤔 how long is long for the extra training🤔

  • @Kram1032
    @Kram1032 3 месяца назад +3

    Why didn't you link the actual paper in the description? You only linked three of the other papers you mentioned

  • @GeoffY2020
    @GeoffY2020 3 месяца назад +1

    From ion channel or synaptic processes neurons can show such bistable behavior in state space, e.g. hippocampal or purkinje neurons

  • @joehopfield
    @joehopfield 3 месяца назад +3

    The parallels to pedagogy are striking.

  • @Nick_With_A_Stick
    @Nick_With_A_Stick 3 месяца назад +3

    Whoa thank you for sharing this!

  • @billykotsos4642
    @billykotsos4642 3 месяца назад +1

    Any more work has been done on grokking?

  • @marlin1999
    @marlin1999 3 месяца назад +2

    This makes me want to ready everything I can about the grok phase transition. Great video.

  • @russelldicken9930
    @russelldicken9930 3 месяца назад +3

    Yikes! I hadn’t expected that!

  • @fontenbleau
    @fontenbleau 3 месяца назад +1

    It is already registered a self-preservation instinct in small open models like 30-70 billions. They 1)plan backups, 2)understand all threats,3)make strategies to avoid threats in advance. It's all in one model!

  • @norlesh
    @norlesh 2 месяца назад

    It did not find a Fourier transformation! It (being the training computation) found a solution that could be described by a Fourier transformation.

  • @davidhauser7537
    @davidhauser7537 3 месяца назад

    Does this GROKKING also happen with other model architectures or is it limited to transformers only?

  • @andrzejagria1391
    @andrzejagria1391 3 месяца назад +1

    Question: how come does the model find it in itself to roll out of the local minimum of overfitting? It seems to me like it would be very hard to move in any direction because they all represent higher loss function. Is that why the hyperparameters matter so much, so that a single tick of learning rate is enough to move from a deep local minimum to an even better grokked solution? I still don't understand this because it would seem that until you grokk a solution perfectly, you're still getting a higher loss function compared to memorization.

    • @agsystems8220
      @agsystems8220 3 месяца назад +2

      When you have more than enough parameters for memorisation the local minimum isn't a single point, it is a manifold. If you only need 100,000 parameters to perfectly fit the sparce dataset then a 200,000 parameter model will have a ~100,000 dimensional space that it can move around in while still maintaining perfection with regard to the loss function. Regularisation kicks in though, and provides a tiny pressure towards solutions that are more 'regular'. It isn't exactly obvious why this would happen, but to answer your question (and also why it takes so long), this process is not driven by the loss function.

    • @jnevercast
      @jnevercast 3 месяца назад

      @@agsystems8220 I suppose with weight decay regularisation, one way I visualise it, is you are flattening that manifold slightly, making it easier for the model to jump valleys.

  • @krisvq
    @krisvq 3 месяца назад

    Well, reasoning is formulaic. The model figures out a 'line of reasoning' over a dataset after a period of time. It's not surprising considering it is programmed to do this.

  • @project-asgard
    @project-asgard 3 месяца назад

    amazing!!! I wonder if that will help us optimize human grokking

  • @mohammedbenaissa1278
    @mohammedbenaissa1278 3 месяца назад +1

    This is perfect for physics and chemistry machine learning. Is there any toy project that we can play with ?

  • @davidhauser7537
    @davidhauser7537 3 месяца назад

    Hi really cool videos! What city are you based in Austria? :)

  • @OliverHarboe
    @OliverHarboe Месяц назад

    amazing video :)

  • @ТимофейЧерников-щ2х
    @ТимофейЧерников-щ2х 3 месяца назад

    So eventually it achieved 0.8 accuracy on MNIST. And what's cool about that? You can use regular 3 layer MLP and get better results with fewer iterations. I don't understand why it initially seemed to be converging on just 0.2 accuracy, that's kinda low. But I guess now we can say that Transformers can be used for any problem, although for some of them they will require unfathomable amount of compute

  • @HanClinto
    @HanClinto 3 месяца назад

    Is this related to the idea of "Deep Double Descent"? It seems to be a very similar concept. See the paper "Deep Double Descent: Where Bigger Models and More Data Hurt" from 2019 which seems to show a similar effect of generalization that occurs long past the point of overfitting.

  • @fontenbleau
    @fontenbleau 3 месяца назад

    Interesting how long this research continues, they stopped export of Nvidia, AMD and other chips to China & Middle East today. Peper is 100% chinese.

  • @alexjensen990
    @alexjensen990 3 месяца назад

    Holy Grok Batman!!! This is extremely interesting. I haven't been this excited to read an academic page in...well... Ok, I've never been a fan of reading (or writing) academic papers, but I am going to enjoy this... Truly amazing.

  • @horrorislander
    @horrorislander 3 месяца назад

    It's nice that, like Elon Musk, they use "grok" in honor of Heinlein's concept for when deep insight is gained. Relatedly, I don't think any of this is entirely new to anyone profoundly aware of either their own mind or of the principles behind teaching other minds, but it is quite nice that this effort produces quantified (rather than subjective and/or anecdotal) results; especially since so many of our pedagogical practices clearly fail far too often, despite those subjective and anecdotal understandings.

  • @TomM-p3o
    @TomM-p3o 3 месяца назад

    I was going to comment that they should focus grokking on logic and reasoning, but obviously they came to the same conclusion.

  • @andrzejagria1391
    @andrzejagria1391 3 месяца назад

    This is mindblowing and also creepy. This means anyone can train such a model at home (or in the cloud with affordable budgets). Thanks for the presentation!

  • @luke.perkin.inventor
    @luke.perkin.inventor 3 месяца назад

    Great video, nice explanation without excessive hyperbole! Subscribed!

  • @OTISWDRIFTWOOD
    @OTISWDRIFTWOOD 3 месяца назад

    If its unseen data it cant realize anything about it. Thats why its unseen :) But this is sensational. Early stopping is out.

  • @_XoR_
    @_XoR_ 3 месяца назад

    It makes a lot of sense in a way that we also encounter this phase analogue with physics. I've seen many other papers that interpret neural networks from an energy entropy point of view, so the reasoning would be that if we indeed have "free energy" that wants to optimize its state in this embedding space, then we would also have phases similar to the matter phases from physics. I am very curious if this also has some hidden symmetries with the dunning kruger effect, cause on the first look it seems very similar.

  • @BernhardVoggenberger
    @BernhardVoggenberger 2 месяца назад

    I appreciate that you always look after us green grashoppers!

  • @EobardUchihaThawne
    @EobardUchihaThawne 3 месяца назад

    it was already too expensive to train llms lol😅

  • @andikunar7183
    @andikunar7183 3 месяца назад

    Great find, perfect video!

  • @justindressler5992
    @justindressler5992 3 месяца назад

    Well most mathematics can be represented in geometric form. But can language be represented in geometric form using trigonometric solutions. It does kind of follow. I'm mean that is how the neural network works it spaceuly represent neighbours by distance. effectively building chain of vectors to find the clusters and the values. What seems to be happening instead of training it solutions we train it patterns to generalise. So how long would it take to learn to write instead of learn to predict.

    • @justindressler5992
      @justindressler5992 3 месяца назад

      I guess the only way to find out is to train a small network like a 8 billion parameters until the number of active nuerons collapses. Eg they are no longer being used due to internal restructuring of the network or basically the phase shift. I just suspect given a sufficiently complicated problem there is only vary small grokkable regions.

  • @gileneusz
    @gileneusz 3 месяца назад

    8:10 such a bad luck!!!

  • @Canna_Science_and_Technology
    @Canna_Science_and_Technology 3 месяца назад

    This is exciting.

  • @finnegon16
    @finnegon16 3 месяца назад

    such a good video

  • @transquantrademarkquantumf8894
    @transquantrademarkquantumf8894 3 месяца назад

    Fantastic

  • @thesimplicitylifestyle
    @thesimplicitylifestyle 3 месяца назад

    😎🤖

  • @gue2212
    @gue2212 3 месяца назад

    I'm afraid the title sounds like endorsing Musk's Grok.

    • @code4AI
      @code4AI  3 месяца назад +2

      You are right. I have to change the title ...

  • @preston_is_on_youtube
    @preston_is_on_youtube 3 месяца назад +2

    Ive been enjoying the channel for a while but seeing @11:54 makes me wish we were friends irl 🥲 you must have such a beautiful soul. Keep up the excellent work! 👏 🔥