Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Поделиться
HTML-код
  • Опубликовано: 2 янв 2025

Комментарии • 14

  • @EniAxon1
    @EniAxon1 8 месяцев назад +1

    Thanks, another great vid!

  • @宋佳璇-k4j
    @宋佳璇-k4j 2 месяца назад

    Great vids! But I didn't figure out how the training process is accomplished?

  • @husienvora9954
    @husienvora9954 8 месяцев назад +1

    great vid gabriel 👍

  • @ericl227
    @ericl227 8 месяцев назад +2

    17:50, shouldn't H_i be a summation of k_j and v_j over j, instead of i, where j goes from 1 to i.

    • @gabrielmongaras
      @gabrielmongaras  8 месяцев назад +1

      Yep. Nice catch! Put a note in the video about the reindexing.

  • @TheVirgile27
    @TheVirgile27 8 месяцев назад

    One thing I don't understand well. After training, how we manage the final output ? For a large input, do we "force it" to speak directly i.e. an output for each input OR we first insert all the input and then look at the output at a certain point. Because basically, one could wait to complete the reading part and then force an answer. Maybe i am not that clear (not my language) but it seems there could be several ways to retrieve an output from this type of transformer. Please be kind, thanks for the video :)

  • @EobardUchihaThawne
    @EobardUchihaThawne 8 месяцев назад

    i have to get used to those scienetific notations, i struggle to provide code from these articles myself

  • @danieldeychakiwsky1928
    @danieldeychakiwsky1928 8 месяцев назад

    Great vids! At around 3 mins when you get into the attention matrices, I think the dimensions aren’t right because if Q is d by s and K is d by s and we take QK-transpose then d by s matmul s by d gives a d by d matrix but the video shows that matrix to be s by s.

    • @gabrielmongaras
      @gabrielmongaras  8 месяцев назад

      Thanks! In that part, I transposed the diagram as I thought it looked a little better that way. Sometimes the diagrams I draw are transposed, but I try to label the dimensions to avoid ambiguity. So the s by s matrix is the resulting matrix, not a d by d one. A s by s matrix results in sequence relations while a d by d matrix results in dimension relations across the entire sequence.

  • @M-ed5ct
    @M-ed5ct 8 месяцев назад

    Thanks for the video!
    Just one question. The H state dimension is fixed, but it accumulates additional information proceeding with the token sequence. After the first segment, H1 just "summarize" one segment, but after segment N, Hn summarize current segment + Hn-1 that is the summary of all the past context. Do you think would make sense to increase H dimension proceeding with context, i.e. dimension of Hn grows with n? The idea is that we keep information per bit constant in H, so that we can really grow to unlimited context without state becoming a bottleneck.

    • @gabrielmongaras
      @gabrielmongaras  8 месяцев назад

      I think it makes sense to increase the hidden state, though doing so would result in a memory dependence on the sequence length during inference, which is currently a big problem. One can think of a softmax attention transformer as having an infinite hidden state (the keys/values are just stacked), on the other hand an RNN has a constant size hidden state. Perhaps something in the middle would perform better than an RNN, but not require as much memory as a Transformer?

    • @M-ed5ct
      @M-ed5ct 8 месяцев назад

      @@gabrielmongaras Yeah, the trick is to find a state update function xn+1 = S(xn, segment_n) so that dim(xn+1) > dim(xn), i.e. projecting vector xn into a bigger space, while preserving the semantic and eventually imbuing segment_n's new data. Indeed because state dimension in the paper is tailored for a quite long context, with a growing state you can even start from a _smaller_ state x1 and then grow it with the number of segments....so maybe for a not too large context you can even have a memory reduction!
      But I don't see memory usage as a problem, you can always clamp it to a maximum if really needed, a kind of max_memory parameter...it can't be worse than the original.

  • @YEETSWORLDWIDE
    @YEETSWORLDWIDE 8 месяцев назад +2

    so basically what you're tellling me is the world is going to end