GROKKED LLM beats RAG Reasoning (Part 3)
HTML-код
- Опубликовано: 26 сен 2024
- We open the black box of GROKKED LLMs and analyze each layer of the transformer architecture for its performance in causal reasoning, after the grokking phase transition of our LLM.
Current research in AI clearly indicates that established LLMs, like Gemini Pro 1.5 or GPT-4 Turbo fail in deep reasoning, even when integrated in complex RAG systems.
A Grokking phase transition is essential for LLM to active their performance phase, reaching close to 99% accuracy for "un-seen" tasks in the development and test datasets.
#airesearch
#ainews
#insights
Honestly you have become my favorite vlogs. Just fantastic.
A strong motivational feedback for future work. Thanks.
Would love a tutorial on grokking phi-3, this grokking thing is hard to wrap my head around.
It's nothing terribly special, once you get into it. Basically the part that makes sense is that LLMs "learn" by producing patterns in matrix multiplication that are helpful for predicting information. Ie: If "dog" is the fifth token in your LLM's vocabulary, and the next word in the sequence should be "dog", then it's helpful to have a pattern of weights that predicts "5" in that sequence.
So, you end up with these really complicated patterns of weights that are hard to understand, and progress towards predicting the correct answer.
But there are multiple ways to get to the right answer, including "wrong" ways, in the sense that they're shortcuts which might work on the training data, but not all data you might want to predict (remember when your math teacher got mad if you didn't show your work?).
Basically, grokking is just training after your LLM's loss goes down on the training set, until it snaps to the "underlying algorithm" and starts predicting the right answer on the test dataset.
For example, if you want an LLM to predict sin(x) = y, at first it might be really bad, and then start predicting the right values, but not the right answer for all values...Until you train it long enough that it generalizes (because understanding the formula is easier than memorizing every possible value in a lookup table of floating point numbers).
In other words: LLMs memorize answers in normal training, but in training aiming specifically to "grok" they, "understand".
@@novantha1 wow, that actually makes a lot of sense, thank you.
I was sooo waiting for this follow-up :D. Thank you!🤩
In LLMs, is there a concept of "strong generalisation" (defined as two equivalent/identical networks trained on non-overlapping sets of data but both perform 100%) as seen in BF-CNNs?
It's a bit off topic but it's great work, also showing generalisation and geometric interpretations: "Generalization in diffusion models arises from geometry-adaptive harmonic representations". There's a great RUclips video, Zahra does a great talk on it. It builds on earlier work by Eero Simoncelli's team too, of "bias free CNNs" for denoising, which demonstrates without the bias parameters, the weights generalise much better, train on 5db of noise and it works on 30db of noise, where as a regular wx+b fails. They visualise the manifolds too, it's really a good explanation!
Just discovered a related paper, 700 citations: "Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models"
As always, thank you for an amazing explanation and review!
I'm glad you found the work interesting and valuable.
Incredible stuff, thank you for your work! this is just amazing
Fantastic video as usual... Typically I find myself full of ideas after watching your videos. This time I find myself unsure how I might implement this information moving forward... I guess I will have to sit with it for a while. Its not lost on me the irony that my favorite author growing up was Robert Heinlein and Stranger in a Strange Land the first book I read from him; yet, this is the one topic that I can not immediately use the knowledge in my projects... 😞
you are a legend sir!
Awesome video. So they have grokked some LLM to perform tests, but there is no grokked LLM that is in the public domain or publically accessible? Why? Or do I miss something?
Amazing topic, thank you!!!
Great video thank you! Love your engaging style too :)
Thanks for watching!
Amazing! Thanks so much for sharing your amazing work!
Thank you for taking the time to write a feedback. Really important for me to also get positive comments.
Great content.
Great video..
Great video, thanks
Glad you liked it!
Ive been thinking about a kind of vector database for grokking, it seems it would still facilitate RAG quite nicely too... opinions?
I still don't understand why grokked models are against rag. Why can't we combine grokked models with rag systems?
New upcoming video will explain my thoughts in more detail. Great comment.
this is research frontier knowledge
I have some doubts I would like to clear.
Will grokking be effective only focusing on the dataset construction if we choose to extend the fine tuning of preexisting pretrained transformer architectures such as llama 3 8b
Do you pretrain using existing data as atomic facts and use finetuning as inferred facts,
If you finetune, what strategy to you go by, do you finetune the entire network such that all gradients are affected and can hopefully all grads can reach the grokked state, this strategy might induce drastic forget-fullness, not to mention the mind splitting compute power required to essentially extend the pretraining.
Or do you do finetuning by something like peft or training the last few layers resulting resulting to not utilizing all the neurons in the grokked state and only trainable neurons essentially reaching grokked state.
And the most important one for me(prob), any resources on how to start coding a grokked transformer
Dayum
thanks
That's incredible! Does this mean that the path to AGI has been paved? Or am I overestimating the results?
Could not wait for this one. At dinner, it dawned on me that papers on this topic from 3 years back were by OpenAI researchers. So if they played with this back then, are the current models even state of the art or are they much farther ahead and just milk the current models like their adoption parent did in the desktop application area? It would make the words of Sam true they will steamroll many of the current layer 1 derivatives like RAG and CoT. Someone else also commented this research is quite old, so if it is, why are we not having this reasoning already more ironed out and implemented in the OpenAI API's? Even Google could have implemented it, as much of Grokking research is tied to researchers from them.
Some of Sam Altman's comments about "getting much more out of our training data" makes me think that OpenAI grok's grokking?
@@densonsmith2 That and smarter synthetic generation probably. Feed the models on niche topics with little data from themselves, enough to grok. But we do not know it is a black box. But that should not prevent us from playing around with this 😀
@@mulderbm In the previous video he mentioned that the researchers used the original transformer architecture for grokking (not the decoder only, gpt style transformer). I'm guessing but it seems to me that the reason could be the architecture.
The part 4 can be based on "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
I don't think that paper has much to offer its basically introducing momentum which we've known about for 40 years.
Truly outstanding! Thank you so much for creating and sharing such high quality content!
I think a good method of grokking would be to train on data compressed by synonyms
qwen 2 please
I cant seem to find part two. Not easily searchable or linked in the description
So if you want to find Part 2 and you are watching Part 3 in a YT channel, there is a tab called videos, where you find all public videos of this channel in a chronologically order. And if you still struggle to find the video, when there are three videos with the title including Grokked or Grokking, and you are at Part three, the probability that one of the other two videos is Part two is different from non-zero. Thank you for your comment.
@code4AI hi yes sorry i had a brain fart moment i found it once i engaged my brain. Thanks for the help