Do we need Attention? - Linear RNNs and State Space Models (SSMs) for NLP

Поделиться
HTML-код
  • Опубликовано: 4 янв 2025

Комментарии • 31

  • @jasonzhai2584
    @jasonzhai2584 10 месяцев назад +6

    I'm an MSc student who is new to deep learning and only first heard of SSM, and without being taught these at school I really struggles to get my head around these concepts for the very first time. This introduction is ABSOLUTELY AMAZING! In just 40 minutes these materials are presented in such a concise yet information-rich way that is understandable even for a newbie like me. I am confident that this video paves my way for understanding more advanced papers on the topic. Thank you!

  • @varunsaagars
    @varunsaagars Год назад +11

    🎯 Key Takeaways for quick navigation:
    00:00 🤖 *This talk explores the question of whether we need attention mechanisms in neural networks for natural language processing.*
    01:06 🧠 *Transformers use attention layers to compute interactions between components, which can become complex for long sequences.*
    04:32 ⏳ *Transformer models face limitations due to their quadratic dependency on sequence length, affecting both training and generation speed.*
    07:04 🌐 *Researchers are exploring alternatives to attention mechanisms in neural networks for NLP.*
    11:30 🔄 *Linear RNNs, especially linear recurrent neural networks (RNNs), are more efficient than traditional RNNs for sequence tasks.*
    15:52 💡 *Linear RNNs can be efficiently computed using techniques like Fourier transforms or associative scans, making them faster for training.*
    20:58 📊 *Continuous time State Space Models (SSMs) are used to explore different parameterizations of linear RNNs, allowing for effective long-range sequence modeling.*
    23:18 🏆 *Linear RNNs with SSM parameterization have shown promising results in various machine learning tasks, including language modeling and natural language processing.*
    27:00 🧠 *Linear RNNs and State Space Models (SSMs) can effectively handle the routing components in transformer-style models, simplifying their structure.*
    27:40 📊 *When fine-tuning linear RNN-based models for tasks involving long-range sentence matching, the kernels learn to look at longer ranges of information, adapting their coefficients accordingly.*
    28:23 📈 *Linear RNN hybrid models, combining attention layers with linear RNNs, have shown improved perplexity compared to open source Transformer models, even with a similar number of parameters.*
    30:44 🧩 *Researchers have explored simpler parameterizations for linear RNNs, such as using a diagonal form or damped exponential moving averages, achieving good results on long-range tasks.*
    32:44 🔄 *A new approach called "rwkv" combines linear RNNs to create an efficient model inspired by Transformer attention, potentially competing with large-scale Transformers.*
    34:16 💡 *Scaling up linear RNNs for larger models, such as a 14 billion parameter language model, shows promise for competing with Transformers in zero-shot prediction tasks.*
    36:22 🛠️ *Challenges in adopting linear RNNs in practice include support for complex numbers, efficient Fourier transforms, numerical stability, and the need for system-level improvements.*
    39:06 📣 *While attention remains dominant in NLP and deep learning, exploring alternative approaches, developing new algorithms, and building communities for scaling are essential for future innovations.*

  • @tharunbhaskar6795
    @tharunbhaskar6795 Год назад +10

    Hands down, the best video I found after all the searching. Was trying to get into S4, Mamba, Linear RNNs and stuff, but most of the videos I visited were very difficult to understand.. But this one made a lot of sense and looking forward for more such videos

    • @srush_nlp
      @srush_nlp  Год назад +3

      Thanks! Working on some followups.

  • @varunsai9736
    @varunsai9736 Месяц назад

    Amazing talk professor , recently attended your talk at Penn state .

  • @wentropia
    @wentropia Год назад +7

    very good! I was really lost in the mamba paper, but now I understand a little. Thank You!

    • @srush_nlp
      @srush_nlp  Год назад +4

      That's great to hear. I hope to add a Mamba version soon as well.

  • @kevon217
    @kevon217 Год назад

    Incredible walkthrough. Appreciate the time taken to explain simply and succinctly.

  • @kimchi_taco
    @kimchi_taco Год назад +1

    Thx for kind introduction of SSM. However, I'm not very convinced because
    1. Anyway, it learns static routing (27:00). This must be overfitted to training data. Do you think it's generalized enough for OOO data like GPT-4?
    2. SSM means memory of all history is compressed into one latent vector. Can it provide all relevant information to future tokens?
    3. LSTM was introduced to resolve vanishing gradient issue as RNN's matrix is multiplied by L times (eigenvalue is exploding). SSM re-introduces it again. What does it have gradient vanishing issue?

  • @AI_ML_DL_LLM
    @AI_ML_DL_LLM Год назад +1

    somehow similar to Rocket method, thanks for the clear explanation

  • @giuseppelombardi2964
    @giuseppelombardi2964 3 месяца назад

    I think we need two aspect in sequence modeling: memory capacity and selection. How can we retain useful informations effectively and efficiently?

  • @dannyleybzon26
    @dannyleybzon26 11 месяцев назад

    Wow, this is the best explanation I've seen! Is there a Mamba-specific one coming out? :D

    • @srush_nlp
      @srush_nlp  11 месяцев назад

      Yes, but it's a lot to learn!

  • @A2ATemp
    @A2ATemp Год назад +2

    Keep up the good work!

  • @MartinBlais
    @MartinBlais 10 месяцев назад

    Sasha this was excellent. Thank you. Just a note that the volume was barely audible. It would be useful to normalize the audio levels before upload. Thanks again

    • @srush_nlp
      @srush_nlp  10 месяцев назад

      Sorry! Was just learning how to do it at this point. Later videos are better.

  • @ln2deep
    @ln2deep Год назад

    Is the performance better of the linear models in Dao et al. better because there are additional parameters spare for training by dropping attention alongside reasonable long-distance modelling? We are losing some of the completeness of attention so its surprising that the perplexity would be lower. I suppose it could also be because you need less data to learn reasonable approximations so maybe they are more data efficient?

  • @ninefates9882
    @ninefates9882 Год назад

    Early 2024 your side of the bet looks a lot better off than a mere 6 months ago. 😀

  • @thomasjohnson4842
    @thomasjohnson4842 Год назад +2

    Great talk! You said "This talk predates the work on Mamba, but covers foundational preliminaries. Mamba version coming soon." Any word on when that video will be out? Also interested in RWKV v6

  • @simonl1938
    @simonl1938 Год назад

    How does the state in the SSM not explode in size without an activation function?

  • @simonl1938
    @simonl1938 Год назад

    What does the C HiPPO matrix look like? Is it learned?

  • @AbhinavSharma-o8u
    @AbhinavSharma-o8u Год назад

    why can't we form kernels using non-linear RNNs?

    • @srush_nlp
      @srush_nlp  Год назад +2

      Because the non-linear RNN formula relu(Ax + Bu) = relu(A relu(Ax+Bu) + Bu) doesn't allow us to rearrange the terms into either associative or kernel form. Basically the relu breaks the algebra that lets us rearrange things.

  • @vertovitch9989
    @vertovitch9989 Год назад

    Nice freudian slip on slide 13 ;)

    • @vertovitch9989
      @vertovitch9989 Год назад

      "On January 1, 2027, an Attention-based model will be state-of-the-art in natural language processing."

  • @randomman5188
    @randomman5188 Год назад

    Attention is all you need!

  • @icant1112
    @icant1112 Год назад

    😘

  • @PerFeldvoss
    @PerFeldvoss Год назад

    Did we really ask James, really?