Big Bird: Transformers for Longer Sequences (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 3 окт 2024

Комментарии • 71

  • @connectrRomania
    @connectrRomania 4 года назад +25

    Started to be addicted to your papers explanation specially when im interested in the NLP field, dont stop plz keep it up 👍🏼

  • @andres_pq
    @andres_pq 4 года назад +41

    Yannic can you make a video comparing all these O(n) transformers (Linformer, Reformer, etc)? It is getting confusing

    • @norik1616
      @norik1616 4 года назад

      Transformers are RNNs, the "true"* Linformer!
      *until SOTA moves further

  • @tojewel
    @tojewel 4 года назад

    This chanel is just getting better and better exponentially with every video. I have no academic background in ML but you made me watch a 40 minutes video on this. Amazing!!!

  • @adamadiallo845
    @adamadiallo845 4 года назад +16

    Yannick if you team up with someone doing animations to visualize the concepts, your channel will become the most influential in Ai.

    • @dmitrysamoylenko6775
      @dmitrysamoylenko6775 4 года назад +2

      I'm prefer drawings by hand in real time

    • @audrius0810
      @audrius0810 4 года назад +2

      I think it is perfect the way it is right now. Feels much more personal and not to mention the videos can be put out much faster (not that I can keep up with them now, haha). Additions to the drawings and the highlights by hand as the talk progresses I find an unparalleled learning aid.

    • @zoeweil7844
      @zoeweil7844 4 года назад +3

      Actually, I also very much prefer the real time drawings by hand - really helps with learning.

    • @RaviAnnaswamy
      @RaviAnnaswamy 4 года назад

      @@zoeweil7844 I agree. in-progress scribbles from first principles teach a lot more than a polished finished product, because we do not know where to pay attention first :) Pun unintended.

    • @binjianxin7830
      @binjianxin7830 4 года назад +1

      No, too many graphics is distracting. Please don’t do this like sirajing siraj

  • @bluel1ng
    @bluel1ng 4 года назад +11

    Window+random attention reminds me of functional columns in the neocortex + long range connections. Maybe the random connections could be drawn from a power-law distribution (dependent on distance) to emulate biological neural networks. At least as soon as we move forward to neuromorphic chips we will have to make similar tradeoffs as biological systems (connections per volume limit etc.).

  • @scottmiller2591
    @scottmiller2591 4 года назад +1

    I've worked w/random projections w/compressed sensing, fast SVD, and neural networks for decades, and I delight whenever I see them put to good use (although, as you point out, they build up the random, then don't really use it!). These applications amuse me every time I think of Yann LeCun saying that random projections are the stupidest thing you could possibly do. He was worked up about ELMs at the time, which correctly have a huge number of problems with attribution, but it still makes me grin.

  • @bdennyw1
    @bdennyw1 4 года назад +1

    Thanks for getting to this one so promptly.

  • @yaxiongzhao6640
    @yaxiongzhao6640 4 года назад

    Very clear explanation as always!
    I kind of not feel the criticism is fully merited.
    The whole idea of deep net is that it preserve the universal approximation ability, while at the same time reduces computing and memory demands. Like single hidden layer net is already a universal approximator, but the later deeper nets just offer different trade-off in the computing regime.
    Per this paper, the addition of random attention is a form of approximation of the full attention mechanism. It by itself is straightforward, or even trivial. But it offers a new form of trade-off space where one can start training longer time but with smaller memory, and meanwhile get a bigger attention range. That allows the exploration to be more flexible.
    Love to hear others' comments.

  • @zoeweil7844
    @zoeweil7844 4 года назад +3

    Yannic, I know that DistilBERT came out a while ago but do you think you could do a video on it? Keep up the amazing work!!!

  • @ThomasTomiczek
    @ThomasTomiczek Год назад

    Funny idea - definitely an idea for alrger attention span. It moves from O(N2) to O(KN) where K is a constant - you need more layers than before to facilitate random walk. So, for N=1 you end up more expensive, for N=2 you end up more expensive if K is 2 -l ikely it is larger. The benefit slowly materialized. N=5 means K has to be 5 to be same... this already may be hard. I wonder, though, about the overhead of the random blocks - it is a lot more processing than simple matrix operation, so that has to be evaluated.
    I would also assume that the random points are identical on every layer - definitely more efficient. See, problem with having them new random on every layer is that a cluster may point to a dead end. That ends up being more and more (with more layers) unconnected connections that are dead ends.
    Definitely a nice idea to go from a very brutal use of computing resources into something that is deeper - significantly deeper - but not as computing intensive on every layer. And given that the later is quadratic and there is a serious need for serious attention windows (i.e. 64k+++) that may be a brutal benefit.
    There is also an idea to have layers with different sizes. First and last have the same - then go in and have N/2 in every other layer.

  • @tomw4688
    @tomw4688 3 года назад +1

    Thanks for taking the time to make this video. Do you have an opinion which one is better: performer or big bird?

  • @mrpocock
    @mrpocock 4 года назад +22

    This sounds like it should be equivalent to a strangely shaped dropout

    • @RaviAnnaswamy
      @RaviAnnaswamy 4 года назад +6

      good viewpoint. Actually there is a term for this called DropConnect. While Dropout means dropped activations, DropConnect means zeroing of some weights (connections) and is used for regularizing. I saw it first in AWD-LSTM implementation by Stephen Merity.

  • @snippletrap
    @snippletrap 4 года назад +1

    What is missing is dynamic global attention. I.e. instead of just CLS and SEP tokens, let the model discover which tokens should have global / greater attention window size.

  • @navidhakimi7122
    @navidhakimi7122 4 года назад +1

    very well done video. Thanks, man.

  • @kappadistributive
    @kappadistributive 4 года назад +6

    Regarding the transition of O(n^2) to O(n). You are, of course, right that the necessity to add more layers (in the worst case) mitigates this win in overall compute and even memory. However, if you think about model partition (more specifically placing each layer on its own compute node), it seems to me that, if price is not an object, you can train much larger models this way.
    Now couple this with model distillation (though dealing with sparse attention in model distillation won’t be entirely trivial) and it seems you might be able to distill reasonable sized, well-performing models for long sequences.d

    • @timdernedde993
      @timdernedde993 4 года назад

      Do you know how large N must be that even a single full attention layer does not fit anymore on Lets say a 2080TI?

    • @kappadistributive
      @kappadistributive 4 года назад +4

      ​@@timdernedde993 It depends on a couple of factors. Let's make a few simplifying assumptions: We use base BERT's configuration and only consider the memory requirements for the weights of one transformer layer plus the states for the input, output and intermediate states with batch size 1. Let's also assume that all states and weights are 4 byte floats.
      This leads to
      - 3 * 12 * 768 * (768/12) = 3 * 768^2 weights for the attention projection matrices
      - 768^2 weights for the combination matrix
      - 2 * 768 * 2048 weights for the feed forward network
      plus
      - n * 768 states for the input
      - 3 * n * 12 * (768/12) = n * 3 * 768 states for the attention projections
      - n^2 * 12 * (768/12) = n^2 * 768 states for the attention results
      - n * 768 states for the combination result
      - n * 2048 states for the hidden layer in the feed-forward network
      - n * 768 states for the output layer of the feed-forward network
      In total this results in
      (768n^2 + 665n + 5505024) * 4 bytes.
      A 2080 Ti has 11GB = 11000000000 bytes VRAM
      This leads to a maximal sequence length of 1886 tokens for the forward pass.
      In reality, the number is lower because our optimizer (e.g. ADAM) requires us to store
      additional information.

    • @timdernedde993
      @timdernedde993 4 года назад

      Thanks for the detailed answer! Somehow was surprising to me I. Just by gut feeling I would have hoped for more token

    • @kappadistributive
      @kappadistributive 4 года назад

      @@timdernedde993 Neural NLP is, in many cases, very resource intensive. Training BERT from scratch costs something in the neighborhood of $250,000 in terms of cloud-compute costs. (Assuming you pay full price in Google's cloud.)

  • @PratikBhavsar1
    @PratikBhavsar1 4 года назад +3

    Amazing summary! I was thinking about these 3 points
    1) Table 1 shows random attention outperforms window attention. This seems counter-intuitive to me given that maximum information is in window tokens.
    2) They start MLM training from RoBERTa checkpoint - is that ok given RoBERTa was trained with full attention?
    3) Is this faster than RoBERTa for shorter sequences (

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      -> see answer on twitter, I guess :)

    • @snippletrap
      @snippletrap 4 года назад

      @@YannicKilcher link please

  • @RobNeuhaus
    @RobNeuhaus 4 года назад

    Nice video. I would have liked a little more attention to the random connections vs no random connections benchmarks to see how much they really buy.

  • @kristoferkrus
    @kristoferkrus 3 года назад +1

    If the only reason they use random attention is to make sure they can route the information in a logarithmic number of steps (i.e., layers), why not just use dilated windowed attention with a dilation that doubles for each layer, like dilated convolutions in CNNs (what WaveNet used)? Seems to be much more straightforward approach to achieve the same thing.

  • @malikrumi1206
    @malikrumi1206 3 года назад

    Dear Yannic,
    New subscriber here. I was totally shocked when you explained that each 512 token sequence is a separate and independent document. In all the times I have read about BERT or looked at tutorials, that point has never been made clear and driven home as you just did. I thought, ‘Well, I’ll just put them all back together again at the end’.
    I’m still not really sure what all of this means. I would guess that the embeddings are all in one space no matter how many documents there were. This would enable BERT to recognize a new combination of (word 3 + word 4) when it had previously seen them separately in two different documents. But what about making sense of a single large document, or an entire corpus of them? Does BERT experience ‘forgetting’, be it catastrophic or not? How effective, or efficient, or even sensible, is an aggregation of 20, 200, or 2000 512 token sequences going to be? If a lengthy article starts talking about x, and then ten pages later shows how x impacts the interpretation of result y, how is an aggregation going to know what that is?
    And while I have seen tutorials talk about truncating longer token sequences in one fashion or another, you are also the first one to actually explain why we have the 512 limit in the first place.
    Finally, I would ask that you do a single video with a head to head comparison of Longformer vs. BigBird, and any other long form transformer I haven’t heard of yet. Thanks so much.

  • @pavel5074
    @pavel5074 21 день назад

    I skimmed through the paper and couldn't find the part where they state that random attention pattern is the same from layer to layer... Are you sure layers didn't have different patterns for the same batch? Mind pointing to exact location in the paper where you got this idea? (sorry for being nit-picking, but this part seem to be important)

  • @chenchen4244
    @chenchen4244 3 года назад

    This video is s0o0o0o0 excellent.

  • @REMIXofGenerationNOW
    @REMIXofGenerationNOW 4 года назад

    I think the reformer idea is more promising. Somehow clustering the embeddings makes more sense to me than just random combinations or larger window sizes.

  • @veedrac
    @veedrac 4 года назад +1

    They are a bit half-hearted about their random tokens. You'd think ITC would reliably outperform ETC if it was important.

  • @alexfedorov5774
    @alexfedorov5774 4 года назад

    Good analogy with convolutions. Can't the random/window/global mechanisms be used to implement approximations of simple dense NN layers (and extend convolution layers with more information)? In computer vision, this could allow uniting convolution and dense layers (and use more dense-analogue layers). I wonder, what would be the performance?

  • @maverick3069
    @maverick3069 4 года назад +1

    At 27:38 in the left figure, why are the keys and queries assigned blocks. Shouldn't each of them be assigned cells and then we group cells together to save computation time. But if they indeed need to be assigned the blocks, then what are the individual cells?

    • @YannicKilcher
      @YannicKilcher  4 года назад +1

      I think this might just be sloppy notation / visualization for making a point.

    • @maverick3069
      @maverick3069 4 года назад

      @@YannicKilcher Got it! Thanks Yannic!

  • @slavamoiseev803
    @slavamoiseev803 2 года назад

    Looks like we can do it better than attending nodes randomly + window + globally by learning it from data.

  • @GuillermoValleCosmos
    @GuillermoValleCosmos 3 года назад

    how many layers are needed for the full-attention transformer to compute any function? I'd need to see how they define universality hmm

  • @markdaoust4598
    @markdaoust4598 4 года назад

    The “global attention” for important things that don’t change.... isn’t that what we usually call “context”.
    It’s just getting combined with this global message slot... like “squeeze-excitation” layers?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      I think context refers to slightly different things, this is for special tokens, like CLS or SEP.
      And I don't know much about squeeze-excitation to give a coherent answer here :)

  • @wesleyolis
    @wesleyolis 4 года назад

    I like how you think!! Membit or possible share membit at times, routing of information to be able make decisions on. 😉

  • @norik1616
    @norik1616 4 года назад

    I would be interested in the results with block_size small enough, to fit reasonable batch size and length on state-of-the-art (LoL) commodity HW. As you mention, all of the papers always say it's linear complexity, but then just use the gained performance to increase the model size in some sense. ResNet and othe papers show results with smaller sizes, which leads me to a conclusion the MLM quality degrades fast when the size of the model is under current SOTA size. If it's true, that would be an interesting insight - in line with the perfm. gains from increasing GPT size and data.

  • @menzithesonofhopehlope7201
    @menzithesonofhopehlope7201 4 года назад

    Do the techniques presented in this paper, help in the issues Dmitry Lepikhin and collabs found in their 1 trillion parameters Gshard model?
    Love your videos.

    • @YannicKilcher
      @YannicKilcher  4 года назад +2

      They are orthogonal, since the gshard paper deals with the feedforward layers.

  • @krzysztofwos1856
    @krzysztofwos1856 4 года назад

    This attention mechanism seems reminiscent of Echo State Networks

  • @RedShipsofSpainAgain
    @RedShipsofSpainAgain 4 года назад

    3:29 "...you need to have information routed FROM every token TO every token." So this sounds like a fully-connected (dense) layer. What's the difference?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      in attention, the routing is dynamic

    • @jonasleothunberg
      @jonasleothunberg 4 года назад +1

      to add to Yannics answer, the output of one attention-head is calculated as
      y = softmax[(W_q X) (W_k X).t * (1/root(k))] (W_v X) = F( Q K.t ) V
      where .t denotes transform and k is the length of the latent representation (hyper param part of model architecture)
      Whereas a fully connected layer is typically just
      y = act.func( Wx + B)
      where W has non-zero elements
      So there is a lot more going on in the attention head, and I think the most important part is that the "weight matrix" F(Q K.t) is a non linear function of the input, so I think you could say that the encoder stack learns a kind of higher order perspective and models the distribution W(X)

  • @soumyasarkar4100
    @soumyasarkar4100 3 года назад

    you didnot say what a block really is :(

  • @vimostan269
    @vimostan269 4 года назад

    Matrix block-wise computation

  • @markthart2661
    @markthart2661 4 года назад +2

    So, as an engineering hack to speed it up they let the random tokens be a factor of the input length... Making it O(n^2) again

  • @mikemihay
    @mikemihay 4 года назад +1

    Is the code from Big Bird available ?

    • @YannicKilcher
      @YannicKilcher  4 года назад

      doesn't seem like it, I'd keep up to date here: github.com/huggingface/transformers/issues/6113

  • @mathematicalninja2756
    @mathematicalninja2756 4 года назад +1

    This looks same as longformer?

    • @YannicKilcher
      @YannicKilcher  4 года назад +13

      It adds random attention and has a theoretical analysis.

  • @yaxiongzhao6640
    @yaxiongzhao6640 4 года назад

    Of cuz Yannic posted new video on the newest interesting paper!

  • @mdmishfaqahmed5523
    @mdmishfaqahmed5523 4 года назад

    What is the difference between Google Research, USA and Google Research, Mountainview, CA, USA

    • @snippletrap
      @snippletrap 4 года назад +1

      Good question. No idea why they split the authors. Makes more sense internally to Google than it does to the rest of the world.

    • @mdmishfaqahmed5523
      @mdmishfaqahmed5523 4 года назад

      @@snippletrap I mean if its Google USA plus Google Switzerland then that is a different story

  • @АлексейТучак-м4ч
    @АлексейТучак-м4ч 4 года назад

    this all reminds me of mixing genes by sexual reproduction and the reason it's sometimes superior to asexual
    here is correspondence:
    generations-layers
    specimens-nodes
    mating-attending
    genes-vector coordinates
    in real life specimens tend to mate not entirely randomly but best ones tend to mate to the best ones and if one specimen has poor genetics it should not expect to find good mate
    that's part of the reason why overall good genes agglomerate with good genes and bad ones are brushed away from the population
    so, you could attend better than at random, or else we would have mated randomly
    I would like to know how probability of node interaction (genes appearing in one body) depends on the network graph(ancestry) and it's construction procedure and how graph theory comes into these scenarios
    see also: en.wikipedia.org/wiki/Vicar_of_Bray_(scientific_hypothesis)

  • @bojangles5503
    @bojangles5503 4 года назад

    Please use a microphone pop filter.