Please keep up the good work, videos like these are incredibly helpful ! As someone who self-study ML, your channel along with similar others (Yannic Kilcher's, AI coffee talks) are such great sources of insight and education. Hopefully one day I could have the honour to do my phd with you.
ahh been binging these since yesterday. really gives me a lot to think about regarding ways to think about extending the usual transformer architecture for different purposes. thank you Professor! - from sunny syracuse
Thanks for the great video! This is a more clear explanation of long-context extension than every other source I've seen. One confusion I have with your video is that you connect high frequency rotation subspaces with short length (which I assume means small values of n-m). Why is there such a connection? In particular, why would tweaking a high frequency rotation be especially impactful on shorter length gaps?
No I have NOT memorized the equation for self attention (I can barely read the math in these papers ) 😅. But this has shown me what methods are effective and when. Great visualizations. Fantastic work.
What's also would be interesting is to see how fp precision affects long context. HF for example is very selective when model uses F32 and when base type. In MistralRotaryEmbedding, Sin Cos arguments are calculated in F32 so at least there are >64K values, but even then it depends on inv_freq(len=dim/2) which is casted to F32. I suspect that when there are more positions than bits in single value, model will be more affected by such "noise" as there are more tokens. (Though with limited VRAM it's not exactly a problem to care)
great as always!
Please keep up the good work, videos like these are incredibly helpful !
As someone who self-study ML, your channel along with similar others (Yannic Kilcher's, AI coffee talks) are such great sources of insight and education. Hopefully one day I could have the honour to do my phd with you.
🙏
ahh been binging these since yesterday. really gives me a lot to think about regarding ways to think about extending the usual transformer architecture for different purposes.
thank you Professor!
- from sunny syracuse
Glad they're useful! Was just in Syracuse last month. Beautiful time of the year.
Thanks for the great video! This is a more clear explanation of long-context extension than every other source I've seen.
One confusion I have with your video is that you connect high frequency rotation subspaces with short length (which I assume means small values of n-m). Why is there such a connection? In particular, why would tweaking a high frequency rotation be especially impactful on shorter length gaps?
Awesome video. Very Informative. Thanks a lot!
No I have NOT memorized the equation for self attention (I can barely read the math in these papers ) 😅.
But this has shown me what methods are effective and when.
Great visualizations. Fantastic work.
Think of it like F=MA or E=MC^2. It's just "softmax(KQ) V" all the way down these days.
What's also would be interesting is to see how fp precision affects long context.
HF for example is very selective when model uses F32 and when base type. In MistralRotaryEmbedding, Sin Cos arguments are calculated in F32 so at least there are >64K values, but even then it depends on inv_freq(len=dim/2) which is casted to F32.
I suspect that when there are more positions than bits in single value, model will be more affected by such "noise" as there are more tokens. (Though with limited VRAM it's not exactly a problem to care)