I'm an MSc student who is new to deep learning and only first heard of SSM, and without being taught these at school I really struggles to get my head around these concepts for the very first time. This introduction is ABSOLUTELY AMAZING! In just 40 minutes these materials are presented in such a concise yet information-rich way that is understandable even for a newbie like me. I am confident that this video paves my way for understanding more advanced papers on the topic. Thank you!
🎯 Key Takeaways for quick navigation: 00:00 🤖 *This talk explores the question of whether we need attention mechanisms in neural networks for natural language processing.* 01:06 🧠 *Transformers use attention layers to compute interactions between components, which can become complex for long sequences.* 04:32 ⏳ *Transformer models face limitations due to their quadratic dependency on sequence length, affecting both training and generation speed.* 07:04 🌐 *Researchers are exploring alternatives to attention mechanisms in neural networks for NLP.* 11:30 🔄 *Linear RNNs, especially linear recurrent neural networks (RNNs), are more efficient than traditional RNNs for sequence tasks.* 15:52 💡 *Linear RNNs can be efficiently computed using techniques like Fourier transforms or associative scans, making them faster for training.* 20:58 📊 *Continuous time State Space Models (SSMs) are used to explore different parameterizations of linear RNNs, allowing for effective long-range sequence modeling.* 23:18 🏆 *Linear RNNs with SSM parameterization have shown promising results in various machine learning tasks, including language modeling and natural language processing.* 27:00 🧠 *Linear RNNs and State Space Models (SSMs) can effectively handle the routing components in transformer-style models, simplifying their structure.* 27:40 📊 *When fine-tuning linear RNN-based models for tasks involving long-range sentence matching, the kernels learn to look at longer ranges of information, adapting their coefficients accordingly.* 28:23 📈 *Linear RNN hybrid models, combining attention layers with linear RNNs, have shown improved perplexity compared to open source Transformer models, even with a similar number of parameters.* 30:44 🧩 *Researchers have explored simpler parameterizations for linear RNNs, such as using a diagonal form or damped exponential moving averages, achieving good results on long-range tasks.* 32:44 🔄 *A new approach called "rwkv" combines linear RNNs to create an efficient model inspired by Transformer attention, potentially competing with large-scale Transformers.* 34:16 💡 *Scaling up linear RNNs for larger models, such as a 14 billion parameter language model, shows promise for competing with Transformers in zero-shot prediction tasks.* 36:22 🛠️ *Challenges in adopting linear RNNs in practice include support for complex numbers, efficient Fourier transforms, numerical stability, and the need for system-level improvements.* 39:06 📣 *While attention remains dominant in NLP and deep learning, exploring alternative approaches, developing new algorithms, and building communities for scaling are essential for future innovations.*
Hands down, the best video I found after all the searching. Was trying to get into S4, Mamba, Linear RNNs and stuff, but most of the videos I visited were very difficult to understand.. But this one made a lot of sense and looking forward for more such videos
Thx for kind introduction of SSM. However, I'm not very convinced because 1. Anyway, it learns static routing (27:00). This must be overfitted to training data. Do you think it's generalized enough for OOO data like GPT-4? 2. SSM means memory of all history is compressed into one latent vector. Can it provide all relevant information to future tokens? 3. LSTM was introduced to resolve vanishing gradient issue as RNN's matrix is multiplied by L times (eigenvalue is exploding). SSM re-introduces it again. What does it have gradient vanishing issue?
Sasha this was excellent. Thank you. Just a note that the volume was barely audible. It would be useful to normalize the audio levels before upload. Thanks again
Is the performance better of the linear models in Dao et al. better because there are additional parameters spare for training by dropping attention alongside reasonable long-distance modelling? We are losing some of the completeness of attention so its surprising that the perplexity would be lower. I suppose it could also be because you need less data to learn reasonable approximations so maybe they are more data efficient?
Great talk! You said "This talk predates the work on Mamba, but covers foundational preliminaries. Mamba version coming soon." Any word on when that video will be out? Also interested in RWKV v6
Because the non-linear RNN formula relu(Ax + Bu) = relu(A relu(Ax+Bu) + Bu) doesn't allow us to rearrange the terms into either associative or kernel form. Basically the relu breaks the algebra that lets us rearrange things.
I'm an MSc student who is new to deep learning and only first heard of SSM, and without being taught these at school I really struggles to get my head around these concepts for the very first time. This introduction is ABSOLUTELY AMAZING! In just 40 minutes these materials are presented in such a concise yet information-rich way that is understandable even for a newbie like me. I am confident that this video paves my way for understanding more advanced papers on the topic. Thank you!
🎯 Key Takeaways for quick navigation:
00:00 🤖 *This talk explores the question of whether we need attention mechanisms in neural networks for natural language processing.*
01:06 🧠 *Transformers use attention layers to compute interactions between components, which can become complex for long sequences.*
04:32 ⏳ *Transformer models face limitations due to their quadratic dependency on sequence length, affecting both training and generation speed.*
07:04 🌐 *Researchers are exploring alternatives to attention mechanisms in neural networks for NLP.*
11:30 🔄 *Linear RNNs, especially linear recurrent neural networks (RNNs), are more efficient than traditional RNNs for sequence tasks.*
15:52 💡 *Linear RNNs can be efficiently computed using techniques like Fourier transforms or associative scans, making them faster for training.*
20:58 📊 *Continuous time State Space Models (SSMs) are used to explore different parameterizations of linear RNNs, allowing for effective long-range sequence modeling.*
23:18 🏆 *Linear RNNs with SSM parameterization have shown promising results in various machine learning tasks, including language modeling and natural language processing.*
27:00 🧠 *Linear RNNs and State Space Models (SSMs) can effectively handle the routing components in transformer-style models, simplifying their structure.*
27:40 📊 *When fine-tuning linear RNN-based models for tasks involving long-range sentence matching, the kernels learn to look at longer ranges of information, adapting their coefficients accordingly.*
28:23 📈 *Linear RNN hybrid models, combining attention layers with linear RNNs, have shown improved perplexity compared to open source Transformer models, even with a similar number of parameters.*
30:44 🧩 *Researchers have explored simpler parameterizations for linear RNNs, such as using a diagonal form or damped exponential moving averages, achieving good results on long-range tasks.*
32:44 🔄 *A new approach called "rwkv" combines linear RNNs to create an efficient model inspired by Transformer attention, potentially competing with large-scale Transformers.*
34:16 💡 *Scaling up linear RNNs for larger models, such as a 14 billion parameter language model, shows promise for competing with Transformers in zero-shot prediction tasks.*
36:22 🛠️ *Challenges in adopting linear RNNs in practice include support for complex numbers, efficient Fourier transforms, numerical stability, and the need for system-level improvements.*
39:06 📣 *While attention remains dominant in NLP and deep learning, exploring alternative approaches, developing new algorithms, and building communities for scaling are essential for future innovations.*
Hands down, the best video I found after all the searching. Was trying to get into S4, Mamba, Linear RNNs and stuff, but most of the videos I visited were very difficult to understand.. But this one made a lot of sense and looking forward for more such videos
Thanks! Working on some followups.
Amazing talk professor , recently attended your talk at Penn state .
very good! I was really lost in the mamba paper, but now I understand a little. Thank You!
That's great to hear. I hope to add a Mamba version soon as well.
Incredible walkthrough. Appreciate the time taken to explain simply and succinctly.
Thx for kind introduction of SSM. However, I'm not very convinced because
1. Anyway, it learns static routing (27:00). This must be overfitted to training data. Do you think it's generalized enough for OOO data like GPT-4?
2. SSM means memory of all history is compressed into one latent vector. Can it provide all relevant information to future tokens?
3. LSTM was introduced to resolve vanishing gradient issue as RNN's matrix is multiplied by L times (eigenvalue is exploding). SSM re-introduces it again. What does it have gradient vanishing issue?
somehow similar to Rocket method, thanks for the clear explanation
I think we need two aspect in sequence modeling: memory capacity and selection. How can we retain useful informations effectively and efficiently?
Wow, this is the best explanation I've seen! Is there a Mamba-specific one coming out? :D
Yes, but it's a lot to learn!
Keep up the good work!
Sasha this was excellent. Thank you. Just a note that the volume was barely audible. It would be useful to normalize the audio levels before upload. Thanks again
Sorry! Was just learning how to do it at this point. Later videos are better.
Is the performance better of the linear models in Dao et al. better because there are additional parameters spare for training by dropping attention alongside reasonable long-distance modelling? We are losing some of the completeness of attention so its surprising that the perplexity would be lower. I suppose it could also be because you need less data to learn reasonable approximations so maybe they are more data efficient?
Early 2024 your side of the bet looks a lot better off than a mere 6 months ago. 😀
Great talk! You said "This talk predates the work on Mamba, but covers foundational preliminaries. Mamba version coming soon." Any word on when that video will be out? Also interested in RWKV v6
It's on the top of my to-do list.
I'm looking forward to this
How does the state in the SSM not explode in size without an activation function?
What does the C HiPPO matrix look like? Is it learned?
why can't we form kernels using non-linear RNNs?
Because the non-linear RNN formula relu(Ax + Bu) = relu(A relu(Ax+Bu) + Bu) doesn't allow us to rearrange the terms into either associative or kernel form. Basically the relu breaks the algebra that lets us rearrange things.
Nice freudian slip on slide 13 ;)
"On January 1, 2027, an Attention-based model will be state-of-the-art in natural language processing."
Attention is all you need!
😘
Did we really ask James, really?