GROKKED LLM beats RAG Reasoning (Part 3)

Discover AI

Просмотров 8 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 26 сен 2024
We open the black box of GROKKED LLMs and analyze each layer of the transformer architecture for its performance in causal reasoning, after the grokking phase transition of our LLM.
Current research in AI clearly indicates that established LLMs, like Gemini Pro 1.5 or GPT-4 Turbo fail in deep reasoning, even when integrated in complex RAG systems.
A Grokking phase transition is essential for LLM to active their performance phase, reaching close to 99% accuracy for "un-seen" tasks in the development and test datasets.
#airesearch
#ainews
#insights

Комментарии • 44

@SirajFlorida 3 месяца назад ⁺¹¹
Honestly you have become my favorite vlogs. Just fantastic.
@code4AI 3 месяца назад ⁺¹
A strong motivational feedback for future work. Thanks.
@christopherchilton-smith6482 3 месяца назад ⁺⁹
Would love a tutorial on grokking phi-3, this grokking thing is hard to wrap my head around.
@novantha1 3 месяца назад ⁺²
It's nothing terribly special, once you get into it. Basically the part that makes sense is that LLMs "learn" by producing patterns in matrix multiplication that are helpful for predicting information. Ie: If "dog" is the fifth token in your LLM's vocabulary, and the next word in the sequence should be "dog", then it's helpful to have a pattern of weights that predicts "5" in that sequence.
So, you end up with these really complicated patterns of weights that are hard to understand, and progress towards predicting the correct answer.
But there are multiple ways to get to the right answer, including "wrong" ways, in the sense that they're shortcuts which might work on the training data, but not all data you might want to predict (remember when your math teacher got mad if you didn't show your work?).
Basically, grokking is just training after your LLM's loss goes down on the training set, until it snaps to the "underlying algorithm" and starts predicting the right answer on the test dataset.
For example, if you want an LLM to predict sin(x) = y, at first it might be really bad, and then start predicting the right values, but not the right answer for all values...Until you train it long enough that it generalizes (because understanding the formula is easier than memorizing every possible value in a lookup table of floating point numbers).
In other words: LLMs memorize answers in normal training, but in training aiming specifically to "grok" they, "understand".
@christopherchilton-smith6482 3 месяца назад
@@novantha1 wow, that actually makes a lot of sense, thank you.
@tanguero2k7 3 месяца назад ⁺⁴
I was sooo waiting for this follow-up :D. Thank you!🤩
@luke.perkin.inventor 3 месяца назад ⁺²
In LLMs, is there a concept of "strong generalisation" (defined as two equivalent/identical networks trained on non-overlapping sets of data but both perform 100%) as seen in BF-CNNs?
It's a bit off topic but it's great work, also showing generalisation and geometric interpretations: "Generalization in diffusion models arises from geometry-adaptive harmonic representations". There's a great RUclips video, Zahra does a great talk on it. It builds on earlier work by Eero Simoncelli's team too, of "bias free CNNs" for denoising, which demonstrates without the bias parameters, the weights generalise much better, train on 5db of noise and it works on 30db of noise, where as a regular wx+b fails. They visualise the manifolds too, it's really a good explanation!
@luke.perkin.inventor 3 месяца назад
Just discovered a related paper, 700 citations: "Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models"
@mazakielad 3 месяца назад ⁺¹
As always, thank you for an amazing explanation and review!
@code4AI 3 месяца назад
I'm glad you found the work interesting and valuable.
@laslog 3 месяца назад
Incredible stuff, thank you for your work! this is just amazing
@alexjensen990 3 месяца назад
Fantastic video as usual... Typically I find myself full of ideas after watching your videos. This time I find myself unsure how I might implement this information moving forward... I guess I will have to sit with it for a while. Its not lost on me the irony that my favorite author growing up was Robert Heinlein and Stranger in a Strange Land the first book I read from him; yet, this is the one topic that I can not immediately use the knowledge in my projects... 😞
@1MinuteFlipDoc 3 месяца назад ⁺¹
you are a legend sir!
@hotbit7327 3 месяца назад ⁺¹
Awesome video. So they have grokked some LLM to perform tests, but there is no grokked LLM that is in the public domain or publically accessible? Why? Or do I miss something?
@project-asgard 3 месяца назад
Amazing topic, thank you!!!
@publicsectordirect982 3 месяца назад
Great video thank you! Love your engaging style too :)
@code4AI 3 месяца назад ⁺¹
Thanks for watching!
@antoniosmusic 3 месяца назад ⁺⁷
Amazing! Thanks so much for sharing your amazing work!
@code4AI 3 месяца назад
Thank you for taking the time to write a feedback. Really important for me to also get positive comments.
@mlytle0 3 месяца назад
Great content.
@rajakumar9956 3 месяца назад
Great video..
@andru5054 3 месяца назад
Great video, thanks
@code4AI 3 месяца назад
Glad you liked it!
@alpineparrot1057 3 месяца назад ⁺¹
Ive been thinking about a kind of vector database for grokking, it seems it would still facilitate RAG quite nicely too... opinions?
@manslaughterinc.9135 3 месяца назад ⁺²
I still don't understand why grokked models are against rag. Why can't we combine grokked models with rag systems?
@code4AI 3 месяца назад ⁺²
New upcoming video will explain my thoughts in more detail. Great comment.
@DigitalAlligator 3 месяца назад
this is research frontier knowledge
@BhaswataChoudhury 3 месяца назад
I have some doubts I would like to clear.
Will grokking be effective only focusing on the dataset construction if we choose to extend the fine tuning of preexisting pretrained transformer architectures such as llama 3 8b
Do you pretrain using existing data as atomic facts and use finetuning as inferred facts,
If you finetune, what strategy to you go by, do you finetune the entire network such that all gradients are affected and can hopefully all grads can reach the grokked state, this strategy might induce drastic forget-fullness, not to mention the mind splitting compute power required to essentially extend the pretraining.
Or do you do finetuning by something like peft or training the last few layers resulting resulting to not utilizing all the neurons in the grokked state and only trainable neurons essentially reaching grokked state.
And the most important one for me(prob), any resources on how to start coding a grokked transformer
@raymond_luxury_yacht 3 месяца назад
Dayum
@pedromoya9127 3 месяца назад
thanks
@timgorn8927 3 месяца назад ⁺¹
That's incredible! Does this mean that the path to AGI has been paved? Or am I overestimating the results?
@mulderbm 3 месяца назад ⁺²
Could not wait for this one. At dinner, it dawned on me that papers on this topic from 3 years back were by OpenAI researchers. So if they played with this back then, are the current models even state of the art or are they much farther ahead and just milk the current models like their adoption parent did in the desktop application area? It would make the words of Sam true they will steamroll many of the current layer 1 derivatives like RAG and CoT. Someone else also commented this research is quite old, so if it is, why are we not having this reasoning already more ironed out and implemented in the OpenAI API's? Even Google could have implemented it, as much of Grokking research is tied to researchers from them.
@densonsmith2 3 месяца назад ⁺¹
Some of Sam Altman's comments about "getting much more out of our training data" makes me think that OpenAI grok's grokking?
@mulderbm 3 месяца назад
@@densonsmith2 That and smarter synthetic generation probably. Feed the models on niche topics with little data from themselves, enough to grok. But we do not know it is a black box. But that should not prevent us from playing around with this 😀
@fabianaltendorfer11 3 месяца назад ⁺¹
@@mulderbm In the previous video he mentioned that the researchers used the original transformer architecture for grokking (not the decoder only, gpt style transformer). I'm guessing but it seems to me that the reason could be the architecture.
@bernardoramos9409 3 месяца назад
The part 4 can be based on "Grokfast: Accelerated Grokking by Amplifying Slow Gradients"
@screwnacorn 2 месяца назад
I don't think that paper has much to offer its basically introducing momentum which we've known about for 40 years.
@LamontCranston-qh2rv 3 месяца назад ⁺²
Truly outstanding! Thank you so much for creating and sharing such high quality content!
@MagusArtStudios 2 месяца назад
I think a good method of grokking would be to train on data compressed by synonyms
@darkmatter9583 3 месяца назад
qwen 2 please
@publicsectordirect982 3 месяца назад
I cant seem to find part two. Not easily searchable or linked in the description
@code4AI 3 месяца назад
So if you want to find Part 2 and you are watching Part 3 in a YT channel, there is a tab called videos, where you find all public videos of this channel in a chronologically order. And if you still struggle to find the video, when there are three videos with the title including Grokked or Grokking, and you are at Part three, the probability that one of the other two videos is Part two is different from non-zero. Thank you for your comment.
@publicsectordirect982 3 месяца назад
@code4AI hi yes sorry i had a brain fart moment i found it once i engaged my brain. Thanks for the help

Следующие

Автовоспроизведение

RAG explained step-by-step up to GROKKED RAG sys