ALiBi | Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation
HTML-код
- Опубликовано: 7 июл 2024
- 👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
In this video I cover ALiBi model from the "Train Short, Test Long: Attention With Linear Biases Enables Input Length Extrapolation" paper.
Instead of using positional embeddings (like e.g. the original transformer) they added non-learnable biases directly into the query-key matrix and achieved extraordinary extrapolation results (i.e. great perf on longer sequences).
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper: ofir.io/train_short_test_long...
✅ Code: github.com/ofirpress/attentio...
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 Intro
02:00 Main results
03:30 Time and memory tradeoffs
05:00 ALiBi method explained
10:00 Results, low data regime
12:35 Generalization to different datasets
13:15 Results, big data regime
16:00 Why does it work and future work
21:10 Outro
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► / theaiepiphany
One-time donation:
www.paypal.com/paypalme/theai...
Much love! ❤️
Huge thank you to these AI Epiphany patreons:
Eli Mahler
Petar Veličković
Zvonimir Sabljic
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► / aleksagordic
Twitter ► / gordic_aleksa
Instagram ► / aiepiphany
Facebook ► / aiepiphany
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
💻 FOLLOW ME ON GITHUB FOR ML PROJECTS:
GitHub ► github.com/gordicaleksa
📚 FOLLOW ME ON MEDIUM:
Medium ► / gordicaleksa
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#alibi #transformers #extrapolation
Thanks for doing a video overview of our paper!
Here are some answers to your questions and comments:
1. You asked why ALiBi's speed in Figure 2 was slightly higher- there's no reason for this. There's just variance when running inference/training on our hardware, and as we say in the paper, the speed of sinusoidal and ALiBi is basically identical, those variations are all within variance.
2. The 100MB of memory overhead is only sometimes necessary (when batch_size > 1).
3. Thanks for finding that typo :)
4. Nonoverlapping evaluation is basically the only evaluation method that's used in production. Sliding window evaluation is way too slow and it's basically only used in an academic context.
5. For Figure 10: The difference from the figures that were shown before is *not* the length of the examples. The difference is that here we use sliding window evaluation instead of nonoverlapping evaluation.
6. Let me try to explain our analysis section with an example. If you do nonoverlapping evaluation on sequences of length 1000, then 10% of predictions you make have 100 tokens of context or less. If you take that same model and run nonoverlapping evaluation with 2000 token sequences, now only 5% of predictions have 100 tokens of context or less.
So by simply being able to handle longer sequences, you can massively reduce the early token curse. (100 is just an arbitrary small number that represents a 'short context')
That's why our model improves perplexity in the nonoverlapping evaluation setting!
During sliding window evaluation, *every* prediction uses the maximal amount of context. And so there is never an early token curse there! And so ALiBi's benefit, of cancelling out the early token curse, is not as helpful, and our performance remains flat.
Super useful!! Thank you Ofir! I'll parse 5/6 tomorrow. 😄
if you drop the attention for the more past tokens, then they begin to specialise themsselves for the local computation, and their computation will not be lost the further you go into the futur because it will be more referenced in the higher layers and will come back as a general past.
Transformers are indeed on 🔥🔥🔥! This paper is a competitor for the value/lines of code ratio. Or is it?
👨👩👧👦 JOIN OUR DISCORD COMMUNITY:
Discord ► discord.gg/peBrCpheKE
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
❤️ Become The AI Epiphany Patreon ❤️
Patreon ► www.patreon.com/theaiepiphany
This was very helpful.. Thanks for the videos keep them coming
then is it implemented in the decoder part of the transformer since we use masking there
Thank you
You're on 🔥!