ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolation

Поделиться
HTML-код
  • Опубликовано: 4 июл 2024
  • #alibi #transformers #attention
    Transformers are essentially set models that need additional inputs to make sense of sequence data. The most widespread additional inputs are position encodings or position embeddings, which add sequence index information in various forms. However, this has put a limit on the resulting model, which cannot run inference on sequences longer than it has been trained on, as it would encounter unfamiliar position encodings. ALiBi solves this by proposing simple linear fixed biases as position information, adding negligible overhead in time and memory, but surprisingly, the resulting model is able to handle inference on sequences many times as long as its training sequences.
    OUTLINE:
    0:00 - Intro & Overview
    1:40 - Position Encodings in Transformers
    4:55 - Sinusoidial Position Encodings
    11:50 - ALiBi Position Encodings
    20:50 - How to choose the slope parameter
    23:55 - Experimental Results
    29:10 - Comments & Conclusion
    Paper: ofir.io/train_short_test_long...
    Code: github.com/ofirpress/attentio...
    Abstract:
    Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi’s inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.
    Authors: Ofir Press, Noah A. Smith, Mike Lewis
    Links:
    TabNine Code Completion (Referral): bit.ly/tabnine-yannick
    RUclips: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    BiliBili: space.bilibili.com/1824646584
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • НаукаНаука

Комментарии • 51

  • @YannicKilcher
    @YannicKilcher  2 года назад +5

    OUTLINE:
    0:00 - Intro & Overview
    1:40 - Position Encodings in Transformers
    4:55 - Sinusoidial Position Encodings
    11:50 - ALiBi Position Encodings
    20:50 - How to choose the slope parameter
    23:55 - Experimental Results
    29:10 - Comments & Conclusion

    • @user-zq4oe6ro2w
      @user-zq4oe6ro2w 2 года назад

      NVAX is literally collapsing hows that for short long

  • @ofirpress
    @ofirpress 2 года назад +33

    Yannic thank you so much for making this amazing summary video!
    Here are a few comments I had:
    1. "There's no reason why this shouldn't work for full (mask-less) self-attention"- correct! I've already heard from people who have gotten this to work for encoder-attention in NMT, and we'll definitely have more on that.
    2. About the different heads having different slopes, I just wanted to add that although I didn't play with this too much, it did seem like when we had multiple heads that used the same slope it would degrade performance.
    3. Small note about our baseline for WikiText- it's actually not a model we designed at all, its just us using the language model from Baevski & Auli's Adaptive Word Embedding paper.
    4. Our ALiBi model actually runs just as fast as the sinusoidal model. Throughout the paper you might see that our model and the sinusoidal one have a tiny difference in training or evaluation speed but thats just the amount of variance we get on our hardware (so you'll even have the same model showing slightly different speeds sometimes too).
    5. I know our experiments on WikiText-103 seem too good to be true, but we've uploaded our models so anyone can check us! In addition- WikiText-103 is basically a toy dataset at this point, and as we later show in the paper our results on the big dataset are much closer to the sinusoidal model. Since WikiText-103 is so small, a model with a strong inductive bias will achieve amazing results there, but that advantage almost disappears when you train on a much much larger dataset with a much greater compute budget.
    6. Our language modeling results have already been replicated by multiple external groups, and as previously stated, others have managed to make this work for machine translation. Gathering language modeling results for 3 different datasets was a lot of work, and that's why we didn't have experiments on other domains, but now that we have more time we're definitely going to explore different applications, models and domains.

    • @hope_pead
      @hope_pead 2 года назад

      Hi. Super impressed by ALiBi. But can you help me understand why softmax suppression should work? If you distort the softmax so that past becomes irrelevant more you go further, as Yannic said in the video you basically tell the model "I don't care if K_1 is important to Q_6, pay less attention to it". This could maybe work in the lower layers, but in higher layers where the model learns much more intricate representations (like whether a sentence is from someone else's POV, for example), they have to take into account a lot of information from distant past. How do you explain the fact that suppressing past doesn't hurt the representation ability when the context depends on the distant past?

    • @ofirpress
      @ofirpress 2 года назад +2

      @@hope_pead Thanks!
      Some of the heads are able to look far back, and for the rest of them, I would conclude that ALiBi works because even in the normal (sinusoidal) transformer, not much context is required for SoTA perplexity.

  • @SLAM2977
    @SLAM2977 2 года назад +19

    Great explanation of positional encoding Yannic! Best I have seen so far. All the way to 100k! :)

  • @KristoferPettersson
    @KristoferPettersson 2 года назад +1

    Mr Kilcher, you are awesome! Thank you for these well made lessons!

  • @neonardo77
    @neonardo77 2 года назад +1

    thanks for your articulate pronunciation haha. it really helps me, a non-native english speaker, understand the content way better.

  • @odysseus9672
    @odysseus9672 2 года назад +8

    Roughly speaking: Fourier transform is no good. Long live the Laplace transform!

  • @angry4rtichoke646
    @angry4rtichoke646 2 года назад +2

    I am looking for research to do with the UW CSE department, and this video could not have come at a more perfect time! I am excited to better understand the concepts in this video soon )

  • @BenZ-os4ms
    @BenZ-os4ms 2 года назад +12

    That came at just the right time, have been doing texture classification using transformer neural networks and have been looking for a way to generalise to longer sequence lengths - this might just be the thing I'm looking for. Thanks, love your videos!

    • @WatchAndGame
      @WatchAndGame 2 года назад

      What do you mean with textures exactly?

    • @BenZ-os4ms
      @BenZ-os4ms 2 года назад +1

      @@WatchAndGame Using sequential textural data captured using a robotic arm, sliding a sensor across varying materials :)

    • @WatchAndGame
      @WatchAndGame 2 года назад

      @@BenZ-os4ms oh interesting, so basically detecting the materials in front of the sensor?

    • @BenZ-os4ms
      @BenZ-os4ms 2 года назад

      @@WatchAndGame Yep! Pretty much 😀

    • @ofirpress
      @ofirpress 2 года назад

      Awesome!

  • @draxd3045
    @draxd3045 2 года назад +1

    Many channels talked about the same paper, but Yannic's is always my favorite

  • @ashokkumarj594
    @ashokkumarj594 2 года назад

    Thank you so much for your great explanation😍

  • @adamrak7560
    @adamrak7560 2 года назад +4

    One thing you have not mentioned: The potential of the network to bring interesting info from the beginning of sequence, not by defeating the biases, but buy jumping forward in every layer.
    Even less important info can be propagated forwards this, if the you have enough layers.

  • @eelcohoogendoorn8044
    @eelcohoogendoorn8044 2 года назад +11

    Don't want to be that guy claiming their work has no value because of prior work and im not saying that... but this is literally the same attention bias as proposed in LeViT, just in 1d rather than 2d. ctrl-F LeViT in their paper shows no hits. Useful insight as to how this enables generalization from short to long sequences... didnt realize sinusoidal sucks so much at that, though then again relative position encodings are not a new idea exactly and id wager someone must have made that observation before too... The geometric series per head seems like a useful and potentially novel idea though.

    • @ofirpress
      @ofirpress 2 года назад +1

      1. ALiBi is not "literally the same attention bias as proposed in LeViT".
      Here's the relevant quote from the LeViT paper "Each head has H ×W parameters corresponding to different pixel offsets". This is basically a 2D version of the T5 bias that we describe and cite heavily in our paper. Our main idea is to use a simple linear function that's a function of the relative distance instead of learning separate biases for each relative distance.
      2. "relative position encodings are not a new idea exactly" definitely! We cite many previous relative position methods, including Shaw et al. which IIRC was the first paper to talk about relative positioning.
      3. "id wager someone must have made that observation before too" where?
      4. "Useful insight as to how this enables generalization from short to long sequences" and "The geometric series per head seems like a useful and potentially novel idea though" thank you!

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 2 года назад

      @@ofirpress fair enough; didn't realize T5 also shared that character. In that context the trainable weights of levit seem like a relevant difference. Would have been interesting to see their 1d equivalent also thrown in the comparison i suppose. Note that i mean my comment about not 'not having' value non sarcastically. (Wow lots of negations). I think novelty is way overrated compared to thorough investigation and understanding, and no doubt your paper contributes to that

  • @anirbansen7132
    @anirbansen7132 Месяц назад

    Very Helpful

  • @leotrisport
    @leotrisport 2 года назад

    Super cool!

  • @patf9770
    @patf9770 2 года назад +5

    Very curious how this generalizes to 2-d inputs (images) or n-d.
    Edit: seems like you'd have to indicate direction somehow

  • @MachineLearningStreetTalk
    @MachineLearningStreetTalk 2 года назад +4

    Nearly first 👌😀

  • @herp_derpingson
    @herp_derpingson 2 года назад

    I never fully understood why the original transformer used the weird sinosoid thing. In hindsight making the position encodings linear makes much more sense.
    .
    28:35 The early token curse is just that at the beginning of evaluation, the transformer doesnt have enough context to make reliable predictions?

  • @arijaa.9315
    @arijaa.9315 5 месяцев назад

    11:46 if the positional encoding is injected to the query and key then after multiplication the positional encoding alreay has its effects in weight matrix that is multiplied by values right? then this positional encoding is transferre to the next layer and we again add he positional encoding to the keys and queries. what I did not get why if the value is position free then the hidden layers do not transfere the position infirmation to the next layer?

  • @a3ytc
    @a3ytc 2 года назад +2

    Wonder how this will work with bidirectional input - if you just apply the distance penalty in both directions then it doesn't know e.g. if the word is n tokens before or n tokens after (in theory I guess this means you can reverse the sentence and have the same prediction for the missing token)

    • @YannicKilcher
      @YannicKilcher  2 года назад

      True. I would guess that's fine, though.

  • @MrDREAMSTRING
    @MrDREAMSTRING 2 года назад

    Shouldn't they compare with the case where only the most recent N tokens are fed into the model for inference, since practically their approach just blindly make the attention weights of early tokens to close-to-zero?

    • @ofirpress
      @ofirpress 2 года назад

      Ya that's the 'sliding window' evaluation approach that we discuss in the analysis section and also in a previous paper ('Shortformer'). It works well with the sinusoidal model but it is *very* slow.
      So our advantage is that we achieve these great perplexity results *and* we're super fast!
      Tell me if you have more questions :)

  • @dragoninfire123
    @dragoninfire123 2 года назад +1

    Could someone explain how the transformers can extrapolate to longer sequence lengths?
    If a model is designed to handle an input sequence length of 1024 during training, how can it handle 2048 or longer sequences at inference?
    Thanks!

    • @ekstrapolatoraproksymujacy412
      @ekstrapolatoraproksymujacy412 2 года назад

      You should learn about transformer architecture so that you really understand it, then it is self explanatory, in one layer every input goes through the same shared weights of key, querry and value layers, then they are summed (every output is weighted sum of all inputs transformed by the value layer) so it doesn't matter how many inputs there is and where they are, they are treated the same and summed, this can be a problem in tasks where position of input is relevant and then positional encodings are needed. You shouldn't try to understand papers like this one if you are lacking basic understanding of architecture, it's just waste of your time

    • @ofirpress
      @ofirpress 2 года назад +1

      I think it's a great question! We talk about this a bit in the analysis section. Basically we think our ALiBi method forces the model to kind of make every prediction with a virtual sliding window of say 1024 and so while we feed it 2048 tokens at a time, for each of the predictions in there it's probably only using around 1024 tokens of context. A lot more research is needed here though to understand exactly what's going on internally and how we can make this whole thing even better!

    • @dragoninfire123
      @dragoninfire123 2 года назад

      @@ofirpress Thank you Ofir for your explanation and contribution to the paper!

  • @G12GilbertProduction
    @G12GilbertProduction 2 года назад

    BiT-related transformer with 1024 tokens? It's not going through with zero-sum method qualitatively. :D

  • @oneman7094
    @oneman7094 2 года назад +1

    Why not learn m? (The scalar that multiplies the matrix which is added to the attention matrix)

    • @ofirpress
      @ofirpress 2 года назад +3

      We do mention in the paper that if you'd like to make m learned it will slow down training by about 3%. In our experiments it was just really tricky to make trainable slopes work. Sometimes we'd learn negative slopes which would kill any extrapolation ability, and then when we tried using ReLU to make sure that the slopes are never negative that made the slopes we got perform badly.
      I'm definitely sure that with a bit more work we'll get trainable slopes to work, and if also started hearing from other people that they've made it work.

    • @oneman7094
      @oneman7094 2 года назад

      @@ofirpress Thanks! That makes sense .

  • @liptherapy
    @liptherapy 2 года назад

    why is this unlisted?

    • @YannicKilcher
      @YannicKilcher  2 года назад +3

      it took very long to process the HD version

  • @yeaves
    @yeaves Год назад

    The subtraction something in the log space means multiplication or division is ture but ALiBi subtract it inside softmax function. So I think that your explanation in terms of log space doesn't correct does it?

    • @yeaves
      @yeaves Год назад

      Naah you are right. I confused.

  • @sieyk
    @sieyk 2 года назад +3

    Did they state why m can't be a learned parameter? I get the feeling next paper will be "Wow, we can do amazing image synthesis now because we made m a learned parameter lol"

    • @ofirpress
      @ofirpress 2 года назад +1

      We do mention in the paper that if you'd like to make m learned it will slow down training by about 3%. In our experiments it was just really tricky to make trainable slopes work. Sometimes we'd learn negative slopes which would kill any extrapolation ability, and then when we tried using ReLU to make sure that the slopes are never negative that made the slopes we got perform badly.
      I'm definitely sure that with a bit more work we'll get trainable slopes to work, and I've also started hearing from other people that they've made it work.

    • @sieyk
      @sieyk 2 года назад

      @@ofirpress thanks for clearing that up! Admittedly I only skimmed through the paper looking for something along the lines of "m as a trainable parameter".
      How was the performance when bounding m using sigmoid with a trainable spacing d_m?

    • @ofirpress
      @ofirpress 2 года назад

      @@sieyk I''m not sure what you mean by "trainable spacing" but bounding m using a sigmoid is exactly what makes it possible to train these slopes!

    • @sieyk
      @sieyk 2 года назад

      @@ofirpress I mixed up the function of m during my last question. I did not realise that you used a geometric series for m across the heads, which makes sense why a trainable m would not add much.
      Also I appreciate your patience, when I was reading through the paper I noticed that, indeed, you already stated that sigmoid was the best choice for trainable geometric functions. Sorry about that!
      What I meant to ask was, how well did it do if each head had its own trainable m, discarding the geometric series? I understand that the geometric series was a substitute for positional encoding, so perhaps that idea wouldn't work at all 😅.

  • @JamesAwokeKnowing
    @JamesAwokeKnowing 2 года назад

    Larger m equals more milinated nerves.