LLM - Reasoning SOLVED (new research)

Поделиться
HTML-код
  • Опубликовано: 5 июн 2024
  • Grokking transformers, a technique for infusing transformers also with near-perfect causal reasoning abilities. (Note: Grokking has nothing to do with Musk's AI Grok or Groq Inc. for fast inference.)
    Grokking achieves this by enabling transformers to identify hierarchical structures within human sentences. Through extended training, the internal structure of the transformer undergoes a fundamental shift, allowing the formation of specific neural pathways called "generalizing circuits." These circuits are instrumental in efficiently encoding and retrieving knowledge for reasoning tasks. To create grokked transformers, several key elements are needed.
    First, extensive training is essential, particularly for complex reasoning tasks that require structured knowledge. Second, the transformer architecture must have an optimal depth, balancing computational efficiency with reasoning performance. Third, a perfectly designed training dataset is crucial. This dataset should incorporate atomic facts and inferred facts, mimicking a formal system of axioms and theorems. Testing grokked transformers involves using out-of-distribution examples, which significantly differ from the training data. This helps assess the transformer's generalization capabilities.
    Two tasks where grokked transformers excel are composition, where they outperform traditional methods that rely on external knowledge, and comparison, where they reason about similarities or differences between entities. The ratio of inferred to atomic data, the number of layers in the transformer, and the distribution of data within the training set all influence the grokking performance.
    To understand how grokking transformers work, we can leverage techniques like logic lens, which analyzes internal activations to pinpoint which parts are involved in specific reasoning tasks, and causal tracing, which maps causal pathways through the transformer's layers. In conclusion, grokking transformers represent a promising approach to achieving near-perfect causal reasoning in large language models.
    By meticulously designing training data, optimizing the architecture, and employing techniques like logic lens and causal tracing, we can unlock the potential of grokked transformers to tackle various reasoning challenges.
    All rights w/ authors:
    Grokked Transformers are Implicit Reasoners:
    A Mechanistic Journey to the Edge of Generalization
    arxiv.org/pdf/2405.15071
    #airesearch
    #ainews
  • НаукаНаука

Комментарии • 48

  • @mulderbm
    @mulderbm 23 дня назад +8

    Indeed very good. It was under our nose all the time. Not sure why this research is only now being picked up. First papers on grokking are 2021, 2022 and partially earlier. Brigning this all together is very insightful. These series make me wat to set it up to play with 😂😂

  • @alexjensen990
    @alexjensen990 23 дня назад +6

    Completely blown away by the test with the "old model"...

  • @luke2642
    @luke2642 22 дня назад +5

    The causal tracing highlights how similar NNs are to just applying input-sensitive matrix multiplication. In the case of ReLUs they're zero or linear, so it's like a hierarchical bunch of switches that turn on just the right linear transform on the input to get the output. The fact that this works (effective, trainable, interpolates and generalises) still amazes me!

  • @luke2642
    @luke2642 22 дня назад +2

    The atomic facts on the graph at the 95% / 5% reminds me of the approach in reinforcement learning for physics models where you start with, for example, low gravity and high friction to dampen the system, then slowly increase/reduce each to bring it closer to reality. It makes unlearned high frequency chaotic (deterministic) systems learnable.

  • @user-tm5nm9dp7l
    @user-tm5nm9dp7l 22 дня назад +2

    Great video. If possible, make a lesson with python code. It would help to understand better how it works. This science is a deep ocean.

  • @mlytle0
    @mlytle0 23 дня назад +5

    Amazing stuff. We heard a few months ago about Q* and supposed advances in math ability at OpenAI on unreleased models, nothing of which has appeared in the public domain. This seems like real advances and is publicly accessible. Part of me thinks OpenAI puts out a lot of hype out there to keep the interest up, but their model still hallucinates like crazy, nothing as solid as this appears to be.

    • @notaras1985
      @notaras1985 20 дней назад

      How can we reduce hallucination

  • @MultiNiktar
    @MultiNiktar 23 дня назад +2

    This is a crazy good video keep it up! The Algorithm will pick this channel up in no time

    • @code4AI
      @code4AI  22 дня назад +2

      Smile. Since I always decline when Google wants me to pay them for advertising my own video to a broader audience, I am not at all a good customer for Google, since I do not support their business model: that I pay for promoting my video. Therefore I'll be a stealth YT Channel for a dedicated audience only.

  • @alexjensen990
    @alexjensen990 23 дня назад

    Cant wait for the comparison!!!

    • @code4AI
      @code4AI  22 дня назад

      A prominent Feature in Part III.

  • @xenophobe3691
    @xenophobe3691 23 дня назад +2

    Reminds me of the Ten Thousand Hours rule for mastery of a subject

  • @timgorn8927
    @timgorn8927 22 дня назад

    Thank you very much! I loved this presentation.

    • @code4AI
      @code4AI  22 дня назад

      Thank you for taking the time to send this feedback to me. Appreciate it.

  • @goodtothinkwith
    @goodtothinkwith 9 дней назад

    Really incredible stuff

  • @TiagoTiagoT
    @TiagoTiagoT 23 дня назад +6

    How about this, first train a model for grokking just on pure logic dataset, randomly generated examples of logic (which should be easy to verify is correct), not language, just those stuff with letters and those weird symbols for logic gates/operators and so on; then once it groks it, move on the the next barebones level of mathematics, then climb up the math ladder at each grokking, at some point start including coding, physics, chemistry etc, and leave natural language for towards the end of the training ladder; ensuring the dataset for all steps follows the ideal ratio. Will we get an ASI that runs on RasPI with something like this approach?

    • @huytruong6761
      @huytruong6761 22 дня назад

      you are talking about curriculum learning, which has been around for many decades. The limitation is that different architectures require different curricula (the one you've proposed seems to work for human learning, but does it work for an arbitrary neural architecture? it is expensive to test many architectures!)

    • @TiagoTiagoT
      @TiagoTiagoT 22 дня назад

      @@huytruong6761 Combining the ratios thing, with building foundations for rational circuits gradually from the most basic concepts to more and more complex thinking, sounds like a good recipe for the achieving high rational thought processes and understanding from the type of neural-networks discussed in this video, no?

    • @huytruong6761
      @huytruong6761 22 дня назад

      @@TiagoTiagoT But what is the architecture of the Transformer? Does it have 8 layers, 20? What is its hidden activation, hidden size, feed forward size, dropout rate, etc. ? This video shows that you need a whole research to test out if different architectures grok, you are proposing to not only testing different architectures but also through an extensive curricula for each architecture

    • @TiagoTiagoT
      @TiagoTiagoT 22 дня назад

      @@huytruong6761 Ah, I see. I got the impression there was already a good starting point to pick an architecture that would grok with just about anything it was trained with...

    • @815TypeSirius
      @815TypeSirius 20 дней назад

      The most ideal data is 49.9~9% (49.9~ is equal to 50%) noise and 50% signal.

  • @LamontCranston-qh2rv
    @LamontCranston-qh2rv 22 дня назад +3

    If these structures can be detected, surely they can be predicted? Can we build a model that will look at a dataset and output a good guess at what the weights of a grokked model would be? If so, maybe we can radically diminish the amount of computation required to achieve grokking? Perhaps even predict optimal cross layer memory sharing? I wonder if this might require spatial reasoning. Specifically a kind of self-reflective "imagining" of the model's blackbox architecture, as well as possible, and desirable structures within it?

    • @huytruong6761
      @huytruong6761 22 дня назад

      detectability assumes specific instances of dataset, architecture, algorithm, and the confirmed grokked subject model. To produce a hypervisor prediction model as described by you, you must train that model over many datasets, architectures, and algorithms, while also training your subject architecture until grokked to get the groundtruth labels (this simply introduces tremendously more computational resources than it may worth)...

    • @LamontCranston-qh2rv
      @LamontCranston-qh2rv 22 дня назад

      @@huytruong6761 Fair enough. It's like trying to predict where the needle in the haystack might be. Why waste time and resources? Why not just go look for it? Still though, I can't help but think that, over time, a kind of library might emerge which essentially says that, these kinds of structures, tend to form in these kinds of models, when confronted with this type of data. It may be a worthwhile starting point as opposed to the brute force, train to death approach. Or, as you say, it could be another blind alley. Maybe the answer lies in the middle: trust your guess... but verify and abandon as needed? It is certainly true that martial arts masters, for example, don't typically take shortcuts to decades of training... but what if they could? It would amount to learning how best to learn. (A dynamic approach.) With this view into the black box, the professor has inspired an entirely new field of endeavor: Artificial Neuroscience. Necessary perhaps, if we are to have any hope of knowing how or why this stuff runs off the rails, and how to (hopefully) fix it! Thank you very much for your exceptional reply, all the best to you!

    • @815TypeSirius
      @815TypeSirius 20 дней назад

      No. Its not reciprocal. But things dont get interesting till they start organizing using hypergeometry. How do you think a brain is so efficent an a cpu is comically inefficient.

    • @LamontCranston-qh2rv
      @LamontCranston-qh2rv 12 дней назад

      The brain uses analog circuitry while LLMs (currently) use digital circuits is one answer. Additionally DNA itself can exhibit quantum tunneling effects in seemingly "intelligent" processes that are not yet well understood. If you are suggesting that human neurons process information in high dimensional space... perhaps. How interesting!

    • @815TypeSirius
      @815TypeSirius 12 дней назад

      @LamontCranston-qh2rv oh its a "the brain is quantum" loon.

  • @lukeskywalker7029
    @lukeskywalker7029 22 дня назад

    This all sounds too good to be true. However the atomic / inferred knowledge thing is something I have had a gut feeling on for a long time.
    Cant wait to replicate this on some easy tasks with continued pre-training.

  • @notaras1985
    @notaras1985 20 дней назад

    What should i do in order to make an AI helper model in my pharmacology lab?

  • @Daniel-Six
    @Daniel-Six 23 дня назад +5

    Anyone who has read the Law of One transmissions might recognize the principle of "intelligent infinity" operating here.

    • @HUEHUEUHEPony
      @HUEHUEUHEPony 23 дня назад +3

      Uhm seek a doctor?

    • @Daniel-Six
      @Daniel-Six 23 дня назад +1

      ​@@HUEHUEUHEPony Are you familiar with the Law of One?

  • @spkgyk
    @spkgyk 22 дня назад

    Sorry if you covered this in another video, but what's the difference between parametric and non parametric memory?

    • @code4AI
      @code4AI  22 дня назад +1

      I'll explain it in detail in my next video. Thanks for pointing it out.

    • @generichuman_
      @generichuman_ 17 дней назад +1

      He covered it in this video. Parametric memory is contained in the weights of the models, and non parametric is contextual memory that you put into the prompt or retrieve with RAG ( which still technically goes into the prompt)

  • @acasualviewer5861
    @acasualviewer5861 20 дней назад

    What do they mean by sharing the information between the upper and lower layers? It's not clear to me how that is implemented. And that's kind of the key here.

    • @code4AI
      @code4AI  19 дней назад

      I am referring to the architecture of a transformer.

    • @acasualviewer5861
      @acasualviewer5861 11 дней назад

      @@code4AI yes.. but what kind of "sharing" do you mean? Just the normal mechanism of passing info to the next layer?

  • @RalphDratman
    @RalphDratman 20 дней назад

    I think you have referred to the wrong paper at the bottom of your youtube summary. You mention a "metric", "structural grokking" and "tree structuredness." I cannot find the words "metric", "structural" or "tree" in the paper "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization" (arxiv 1405.24071), but all three of those terms are easy to find in "Grokking of Hierarchical Structure in Vanilla Transformers" (arxiv 2305.18741).

    • @code4AI
      @code4AI  19 дней назад +2

      No, you are wrong ... but your comment provides a beautiful example of the inner workings of a vector store. So when you are looking for the terms I used in my reference video, at 2:16 to 2:32 I introduced here the new study by MIT and Stanford Univ: I present the title of the pre-print, I present the authors of the pre-print and the https link of this pre-print, and one (!) second later (at 2:33 in my video) I introduce the term "Tree Structuredness" from the study.
      You (@RalphDratman) comment now, that you can't find the words and were looking in another pre-print that I mention in the video. Perfect example of the semantic and causal relation encoded in a "close-by" representation within a low dim vector space.
      So whenever you don't find terms in a linear video sequence of mine, there is a high probability, that literally 1 sec before the term in question appears, the complete information where to find the term(s) was given to you, including the title, the authors and the https link of the pre-print. Imagine a cosine similarity-function that returns the term and the identifier for the pre-print in question directly to you.
      Thank you for this comment.

    • @RalphDratman
      @RalphDratman 17 дней назад

      @@code4AI 1) I was trying to be helpful
      2) The reason I did not see the paper on the screen is that I was listening rather than watching.

  • @manslaughterinc.9135
    @manslaughterinc.9135 22 дня назад +1

    Why do we have to exclude RAG from grokked LLMs? There is literally no reason why we can't RAG into a grokked LLM.

    • @frag_it
      @frag_it 22 дня назад

      Yeah I don’t see RAG going away, grokked llm might even provide more reasoning on the context’s 😅

    • @code4AI
      @code4AI  22 дня назад

      Great comment! Maybe I'll design an answer in an upcoming video!

  • @GerardSans
    @GerardSans 23 дня назад +4

    Forcing features into the existing transformer architecture is a foolish idea when you can change its design to accommodate whatever features you need perfectly and fix all the known shortcomings.

    • @farrael004
      @farrael004 21 день назад +7

      Alright Einstein. How does the architecture that solves all of the transformers shortcomings look like?