OUTLINE: 0:00 - Introduction 0:45 - Transformers vs RNNs vs S4 6:10 - What are state space models? 12:30 - Selective State Space Models 17:55 - The Mamba architecture 22:20 - The SSM layer and forward propagation 31:15 - Utilizing GPU memory hierarchy 34:05 - Efficient computation via prefix sums / parallel scans 36:01 - Experimental results and comments 38:00 - A brief look at the code
🎯 Key Takeaways for quick navigation: 00:00 📜 *This video discusses the Mamba paper, which introduces a linear-time sequence modeling approach with Selective State Spaces.* 01:19 🔄 *Transformers have the advantage of dynamic and selective attention but suffer from quadratic computation and memory requirements, while RNNs have limited memory and scalability.* 03:05 🔄 *Backpropagation through time in RNNs can be memory-intensive and lead to gradient problems.* 05:51 🔄 *State space models, like S4, offer an alternative to RNNs and Transformers with linear computation but lack data-dependent transitions.* 09:34 🔄 *Mamba introduces Selective State Spaces, relaxing the input-independence constraint while retaining linear computation, making it a hybrid between SSM and LSTM.* 11:39 🚀 *Mamba is competitive with Transformers, especially on long sequences and dense data like language and genomics.* 12:22 📚 *The paper addresses computational efficiency and context-based reasoning, improving on previous SSM models.* 16:12 🤖 *Mamba's architecture combines selective state spaces with other components, providing linear scaling in sequence length.* 18:40 🚀 *Mamba offers fast training and inference, scaling linearly in sequence length during training and achieving performance on sequences up to one million in length.* 26:33 🧮 *The video explains a sequence modeling technique involving the computation of Y3, which depends on various time steps and matrices.* 28:28 🖥️ *The technique allows for the precomputation of certain parameters, making it possible to compute the output of a new sequence instantly.* 29:38 🧩 *Mamba focuses on making parameters input-dependent, achieving efficiency through GPU memory optimization.* 32:07 🚀 *The paper reduces data movement and utilizes fast memory (SRAM) for matrix multiplications, resulting in significant speed improvements.* 34:21 🧬 *Mamba performs zoom operations differently by using a prefix sum technique to accommodate input-dependent elements.* 36:18 📈 *Mamba shows promising scaling performance for large-scale sequence models, outperforming other attention-free models.* 37:23 💻 *Inference throughput on an A100 GPU is good and improves as batch size increases when compared to Transformers.* 37:36 🧪 *The paper discusses the intricacies of the efficient implementation, providing insights into memory transfers and cost reductions.* 39:50 📊 *The code implementation includes various components such as input projection, 1D convolution, discretization, recurrence, and gating pathways for efficient sequence modeling.*
This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!
Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.
These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.
I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol
Agreed, both State Space models and Kalman filter have been so throughly described in every control theory handbook and I have never seen something like this diagram 😅
@@ivanzhelyazkov8099 Well I think you could find some good introduction to state space systems in Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 is on state space models but I would also recommend reading up on general properties of LTI systems. There could be better books though, I’ve read it quite some time ago
@@ivanzhelyazkov8099 Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 deals with SS models, but recommend looking at chapter 1 and LTI systems as well
I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.
This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements
@@sagetmaster4 Yannic had pretty good ML news videos but then he got busy with OpenAssistant and probably other things... Now I usually watch "AI Explained" for ML news.
@@Hexanitrobenzenethat OpenAI fanboy hype merchant, who probably can't define a perceptron, is no replacement for Yannic's ML News. Better than nothing, but while Kilcher's coverage gave me the feeling of being at an academic event for those who actually study and engineer ML techniques to learn about new ideas they might want to apply and to help them stay up to date, AI Explained seems to just want to blow the minds of tech enthusiasts who are only interested in how close to AGI we are, and always leaves me with a strong smell of bull regarding the significance of whatever he's been making out is a huge leap forward. How much the general culture of the AI community has gone downhill since ChatGPT was released and caught the attention of psuedo intellectuals is so depressing.
@@EdFormer I don't think such a strong criticism is warranted. You described him like he was ColdFusion or something like that :) Of course, if you want a raw engineering take, he is not a replacement, but considering all the countless clickbaity AI related videos on my side bar, he seems the most serious of those whose audience is semi-technical. He always reads the papers thoroughly, catches some less well known trends, does his own investigations. He even did a thorough analysis of MMLU dataset bugs. Not a NeurIPS level researcher, but still a serious guy.
@@Hexanitrobenzene absolutely fair response. I guess I have just channelled my dismay regarding the decline of the once great culture of our community at him specifically, since his channel is the only example of these new AI channels (which I think are illustrative of that decline) that I am prepared to engage with. I still watch his videos, so must find some value there. I just really miss being able to rely on regular, diverse, and legitimately technical/academic content from the likes of Yannic and Letitia that really added another dimension to my never ending literature review. Even Edan Meyer, an undergrad at the time, provided the perspective that I think is sorely missed. I feel these channels have struggled to keep up with the cookie cutter garbage that is now filling my recommendations, and again, I am probably focusing my blame for that on AI Explained. One valid criticism I have of AI Explained though is the misleading perspective I think he provides that is AI = LLMs. I'm probably very biased as someone with a background in computer vision, but there's so much more going on than just LLMs. I find it mind blowing, given how things were previously, that I cannot find a good deep dive video on the Segment Anything Model (SAM) or DINOv2.
This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.
Something about the “selective” piece maybe? My thought would be that the forgetting rules differ based on a lot of factors, so there are many opportunities for the model to be induced to not forget potentially relevant details
A matrix is also input-dependent, but yea it's hard to believe that it can "remember" things that are 10^6 upstream (basically the K term in eq. 3a and 3b.)
@@邵斌-n2r He says A is still a parameter at 30:13 and it shows on the paper as well. I think he made a mistake when he said A was input dependent because it seems not to be.
I think there are some parameterizations constraints like the eigenvalues of A's cannot be positive for the A^n to be stable. The h3 and hippo papers talk more about those conditions (there are also diagonal and other constraints on A to make things more efficient iirc).
I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.
I wanted to ask the same thing. I think it is a mistake in the video. Also before, he mentioned that A is constant and only B and C depend on the input.
SRAM has very low latency, but the total bandwith is less than the gpu-cpu can do (especially with DMA). You could use something like Buffer Object Streaming (this is the name that opengl has for it) to reduce memory usage massively. This would also allow for much larger batches as they are computed concurrently with the same weights. Does anyone know if this is already being done? I can't find anything on the topic.
This very much triggers the same intuitive reservations and skepticism I feel towards linear transformers and other attempts at improving scaling by throwing away one of the essential components to what makes transformers work. I am not convinced that any of these architectures actually bring the same scaling we have seen with transformers so far where the only limiting factor seems to be the amount and quality of the training data.
The future looks like collections of models working in concert, so having transformers doing most of the work in most systems with certain domains using a different architecture seems plausible
After "Text Embeddings Reveal (Almost) As Much As Text" came out, I became convinced that encoder-decoder transformers learn embeddings that behave just like an RNN's hidden/recurrent state. If you plot the embedding of a sentence unmasked sequentially, you get a trajectory. That paper shows this trajectory is quite informative w.r.t. the original text. This is interesting because that suggests you can train an RNN on the learned embeddings. Since you already have the embeddings, there is no need for back-prop through time. It would be like training a regular neural net on input-output examples, where the input is the embedding at time t and the output is the embedding at time t+1. It's only speculation, but it could be a practical way of distilling pre-trained transformers into RNNs for deployment. P.S. By unmasking a sentence sequentially, I mean for example "hel", "hell", "hello".
@@dennismertens990 I agree that these architectures in general and Mamba in particular seem like they are a very good idea for more task specific deep learning. I should have specified that I was referring to the general purpose LLMs or more multi-modal models which develop some level of reasoning and general abstraction capability. For more constrained tasks like DNA modelling this might very well be a huge leap forward. In regards to sequential unmasking, it would of course still be token based, not character based, but I don't think you can get away with throwing away the attention matrix without a massive loss in performance even if you train an embedding.
@@sagetmaster4 A lot of people have been saying this for many years at this point. And for a long time this seemed like a very reasonable assumption. If you can not scale generally you have to specialize. It's the same with general compute as well as many other areas. But I haven't seen a single example of this actually working to a degree that outerforms just putting same amount of extra compute into a larger or longer trained model. I am beginning to think that we are missing some very fundamental and essential insight to make this actually work. Sparse network designs like the mixture of expert seem to have some real benefits here but only for inference speed. But I would argue this is only tangentially related to the whole heterogenous architecture idea. I for one think the next major architectural step probably needs to be intermediate higher level memory instead of just scaling the context. Being able to store abstractions in memory seems like such a fundamental component to human thinking that I can't help but think that it might have the potential for major improvements in an llm as well. The other thing that will eventually be necessary is a way for models to spend significantly more time "thinking" before answering. There were a few attempts with traditional LLMs and a thinking-token to essentiall allow it more inference steps for the next token. And the results looked promising in the paper. But now it seems to have been mostly forgotten about. So it seems a more fundamental way for the model to recurse internally might be necessary for introspective logic.
@@lexer_ If this paper's claims are actually true then it has better perplexity on the pile (general purpose language modeling) than traditional transformers. We've also seen attention actually struggles quite a bit on information retrieval with long sequences so while it is powerful it's not a perfect mechanism. A lot of older subquadratic transformer papers basically just did "attention but worse" (eg. window + random attention) and so they naturally had tradeoffs that a completely different mechanism like this isn't going to necessarily have.
You can't really just input words like that into any model, but it might still be easier in other ways. In the S4 talk video Albert Gu talks about how much easier it was for people in other fields to use their model since it's a one size fits all with great performance without tuning.
Neither transformers or mamba need tokenization, it's more just for efficiency but curious to hear about the potential implications and whether mamba would be more capable of handling larger vocabularies
Can somebody help me understand this? In figure 1 in the paper it says “Structured SSMs independently map each channel (e.g. 𝐷 = 5) of an input 𝑥 to output 𝑦 through a higher dimensional latent state ℎ (e.g. 𝑁 = 4)”. How are four dimensions higher dimensional than five? I am not trolling and I understand that it says “e.g.” and that it could not be deliberate. But given the quality of the paper that seems unlikely. Is there another way to interpret higher dimensional? Does it just mean “not a scalar”?
I found the solution. Well, GPT4 did, but who’s counting? 😅 Each of the five channels is mapped to four hidden dimensions. Of course, now it all makes sense. This is what the horse said: “The caption mentions that structured State Space Models (SSMs) map each channel of an input to an output through a higher dimensional latent state \( h \) (for example, \( N = 4 \)) while avoiding the materialization of a large effective state space. The inconsistency you're pointing out seems to be related to the dimensionality \( D \) given as an example (i.e., \( D = 5 \)) versus the term "higher dimensional." This likely means that each channel \( D \) is independently mapped, and the "higher dimensional" aspect refers to the combination of these mappings to create a state space with more dimensions than the individual inputs. The effective state space would then be \( D \times N \), which is larger than the dimensionality of each individual channel \( D \) alone. So, in this context, "higher dimensional" does not contradict the example dimensions given; rather, it points to the resultant multi-dimensional space created by the model's structure.”
6:50 RNN: y(t+1) = sigmoid(y(t) + x(t)) State-space: Well, let’s get rid of non-linear sigmoid so that y(t+1) = y(t) + x(t) = … = y(0) + x(0) + x(1) + … + x(t) Now you change the game lol.
Would love to see some classification of machine learning itself using AI, there seems to be a lot of stuff called different that is functionally very similar
I am super sceptical towards these reduced models. But I think they are interesting as a kind of ablation study for the Transformer architecture. Helping us to understand their inner mechanisms better.
Remember traing data contain influenced, wrong, intermediate trues, high adv. data, school philosophy, wrong time, era optimized ideas ... and lot more. You need some how filter, or devaluate these values, or put in some different value dimensions. This will make hallucination go down little more. My noob opinion.
"its just basic 2nd year maths" - i hate people who say this. they have no idea of the stuggles of people who have to work for 10 years just to afford 1 year of university or cant understand why paying 2 months of your income for 1 months of rent is a barrier to learning this IMPOSSIBLE maths! ITS NOT MATHS ITS LATIN CODE FOR BASIC SHIT ONLY YOU UNDERSTAND. THERES NO WAY TO LEARN THIS IF YOUR PARENTS ARENT MILLIONAIRES
@@monoham1My parents were poor immigrants from Eastern Europe, and I took loans to go to a public university (Canadian) that cost $10k/yr. I got good grades and the govt (of Canada) wrote off half those loans. The US has cheap state schools too right? You don't need to be a millionaire. You just need time and interest.
It really makes no sense that all these efficient architectures come is small sizes like 3b models. If they trade of capabilities for performance, they should release 7b or 13b models, why would anyone run a 3b Mamba or rvkv if 7b mistral runs on everything? Nice tech but as long as it's under 7b it's just a demo.
I was surprised to learn that transformers rely on a classical algo hack to have a kind of a memory. I'm quite sure that's a flawed premise that wont last. It reminds me of bag of words which was a tremendously flawed premise. Despite how relatively amazingly well transformers work. Most approaches so far seem hacky. I've thought ahead a bit and I figure that any kind of real world well rounded AI needs a vastly more complex and sophisticated architecture. That's not to say it couldn't happen relatively quickly but LLM aint it. Just the ability to hold a thought and progressively work on it with shifting focus is way beyond LLMs. And that's analog to image generation which superficially looks very close to flawless but in reality is also very far from the holy grail for much the same reason.
OUTLINE:
0:00 - Introduction
0:45 - Transformers vs RNNs vs S4
6:10 - What are state space models?
12:30 - Selective State Space Models
17:55 - The Mamba architecture
22:20 - The SSM layer and forward propagation
31:15 - Utilizing GPU memory hierarchy
34:05 - Efficient computation via prefix sums / parallel scans
36:01 - Experimental results and comments
38:00 - A brief look at the code
I have to say - your return to making more frequent videos is making me very happy. I used to see your video before reading the papers.
good to see you here, Horia. :)
11:15 They did experiments up to 3 billion parameters iirc. There is a mamba 3B model available on huggingface at least
How does it compare to a 3B transformer model?
@@hunterkudo9832It’s better
@@hunterkudo9832 It's roughly on par with a 7B param transformer
This is hot.
(___0__)__(____0__)
\_________________/
🎯 Key Takeaways for quick navigation:
00:00 📜 *This video discusses the Mamba paper, which introduces a linear-time sequence modeling approach with Selective State Spaces.*
01:19 🔄 *Transformers have the advantage of dynamic and selective attention but suffer from quadratic computation and memory requirements, while RNNs have limited memory and scalability.*
03:05 🔄 *Backpropagation through time in RNNs can be memory-intensive and lead to gradient problems.*
05:51 🔄 *State space models, like S4, offer an alternative to RNNs and Transformers with linear computation but lack data-dependent transitions.*
09:34 🔄 *Mamba introduces Selective State Spaces, relaxing the input-independence constraint while retaining linear computation, making it a hybrid between SSM and LSTM.*
11:39 🚀 *Mamba is competitive with Transformers, especially on long sequences and dense data like language and genomics.*
12:22 📚 *The paper addresses computational efficiency and context-based reasoning, improving on previous SSM models.*
16:12 🤖 *Mamba's architecture combines selective state spaces with other components, providing linear scaling in sequence length.*
18:40 🚀 *Mamba offers fast training and inference, scaling linearly in sequence length during training and achieving performance on sequences up to one million in length.*
26:33 🧮 *The video explains a sequence modeling technique involving the computation of Y3, which depends on various time steps and matrices.*
28:28 🖥️ *The technique allows for the precomputation of certain parameters, making it possible to compute the output of a new sequence instantly.*
29:38 🧩 *Mamba focuses on making parameters input-dependent, achieving efficiency through GPU memory optimization.*
32:07 🚀 *The paper reduces data movement and utilizes fast memory (SRAM) for matrix multiplications, resulting in significant speed improvements.*
34:21 🧬 *Mamba performs zoom operations differently by using a prefix sum technique to accommodate input-dependent elements.*
36:18 📈 *Mamba shows promising scaling performance for large-scale sequence models, outperforming other attention-free models.*
37:23 💻 *Inference throughput on an A100 GPU is good and improves as batch size increases when compared to Transformers.*
37:36 🧪 *The paper discusses the intricacies of the efficient implementation, providing insights into memory transfers and cost reductions.*
39:50 📊 *The code implementation includes various components such as input projection, 1D convolution, discretization, recurrence, and gating pathways for efficient sequence modeling.*
This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!
Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.
Huge for assistants, and agents with longer term memory, as well as AI companions
transformers have problems of their own for composition and logical reasoning right?
This is definitely the best explanation video for Mamba I've seen. Thank you!
These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.
I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol
Agreed, both State Space models and Kalman filter have been so throughly described in every control theory handbook and I have never seen something like this diagram 😅
can you guys recommend a handbook with a clearer representation?
@@ivanzhelyazkov8099 Well I think you could find some good introduction to state space systems in Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 is on state space models but I would also recommend reading up on general properties of LTI systems. There could be better books though, I’ve read it quite some time ago
@@ivanzhelyazkov8099 Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 deals with SS models, but recommend looking at chapter 1 and LTI systems as well
I'd love another video diving deeper!
I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.
Well, they actually mention this in the paper, and yeah Kalman filters are a type of state-space model
@andreaterlizzi lol if that's true can it really handle nonGaussian processes
This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements
Papers! TBH, I don’t even watch any other vids in this channel.
There are other videos?
@@sagetmaster4
Yannic had pretty good ML news videos but then he got busy with OpenAssistant and probably other things...
Now I usually watch "AI Explained" for ML news.
@@Hexanitrobenzenethat OpenAI fanboy hype merchant, who probably can't define a perceptron, is no replacement for Yannic's ML News. Better than nothing, but while Kilcher's coverage gave me the feeling of being at an academic event for those who actually study and engineer ML techniques to learn about new ideas they might want to apply and to help them stay up to date, AI Explained seems to just want to blow the minds of tech enthusiasts who are only interested in how close to AGI we are, and always leaves me with a strong smell of bull regarding the significance of whatever he's been making out is a huge leap forward. How much the general culture of the AI community has gone downhill since ChatGPT was released and caught the attention of psuedo intellectuals is so depressing.
@@EdFormer
I don't think such a strong criticism is warranted. You described him like he was ColdFusion or something like that :)
Of course, if you want a raw engineering take, he is not a replacement, but considering all the countless clickbaity AI related videos on my side bar, he seems the most serious of those whose audience is semi-technical. He always reads the papers thoroughly, catches some less well known trends, does his own investigations. He even did a thorough analysis of MMLU dataset bugs. Not a NeurIPS level researcher, but still a serious guy.
@@Hexanitrobenzene absolutely fair response. I guess I have just channelled my dismay regarding the decline of the once great culture of our community at him specifically, since his channel is the only example of these new AI channels (which I think are illustrative of that decline) that I am prepared to engage with. I still watch his videos, so must find some value there. I just really miss being able to rely on regular, diverse, and legitimately technical/academic content from the likes of Yannic and Letitia that really added another dimension to my never ending literature review. Even Edan Meyer, an undergrad at the time, provided the perspective that I think is sorely missed. I feel these channels have struggled to keep up with the cookie cutter garbage that is now filling my recommendations, and again, I am probably focusing my blame for that on AI Explained. One valid criticism I have of AI Explained though is the misleading perspective I think he provides that is AI = LLMs. I'm probably very biased as someone with a background in computer vision, but there's so much more going on than just LLMs. I find it mind blowing, given how things were previously, that I cannot find a good deep dive video on the Segment Anything Model (SAM) or DINOv2.
Finally, i knew i could count on you!
More than the merits and demerits of transformers. The best part is it's inter modalities between text-audio-voice clip
Very happy transformers aren't the only game in town.
This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.
Something about the “selective” piece maybe? My thought would be that the forgetting rules differ based on a lot of factors, so there are many opportunities for the model to be induced to not forget potentially relevant details
A matrix is also input-dependent, but yea it's hard to believe that it can "remember" things that are 10^6 upstream (basically the K term in eq. 3a and 3b.)
@@邵斌-n2r He says A is still a parameter at 30:13 and it shows on the paper as well. I think he made a mistake when he said A was input dependent because it seems not to be.
I was waiting for your
review
Цікава тема та дослідження, які пояснюються в цій статті. Рекомендую переглянути!
Interesting that A^n actually works for long sequences. I would have expected a severe degradation of performance as sequences get longer...
my same thought; I need to implement this in code to understand how/ why this works
I think there are some parameterizations constraints like the eigenvalues of A's cannot be positive for the A^n to be stable. The h3 and hippo papers talk more about those conditions (there are also diagonal and other constraints on A to make things more efficient iirc).
second this. If the eigenvalue is negative, then it will vanish even more quickly (consider the context length of 10^6).
I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.
I wanted to ask the same thing. I think it is a mistake in the video. Also before, he mentioned that A is constant and only B and C depend on the input.
Thank you for the paper review, it always helps!! Happy holydays to everyone 🍾
Happy holidays
excellent talk explaining a non trivial paper. Thanks!
SRAM has very low latency, but the total bandwith is less than the gpu-cpu can do (especially with DMA). You could use something like Buffer Object Streaming (this is the name that opengl has for it) to reduce memory usage massively. This would also allow for much larger batches as they are computed concurrently with the same weights. Does anyone know if this is already being done? I can't find anything on the topic.
Do it. Can you talk about this in private? I'm very interested in the topic of optimizing operations right now
i dont understand what got me to even watch this at 2:00 in the morning
This very much triggers the same intuitive reservations and skepticism I feel towards linear transformers and other attempts at improving scaling by throwing away one of the essential components to what makes transformers work. I am not convinced that any of these architectures actually bring the same scaling we have seen with transformers so far where the only limiting factor seems to be the amount and quality of the training data.
The future looks like collections of models working in concert, so having transformers doing most of the work in most systems with certain domains using a different architecture seems plausible
After "Text Embeddings Reveal (Almost) As Much As Text" came out, I became convinced that encoder-decoder transformers learn embeddings that behave just like an RNN's hidden/recurrent state. If you plot the embedding of a sentence unmasked sequentially, you get a trajectory. That paper shows this trajectory is quite informative w.r.t. the original text.
This is interesting because that suggests you can train an RNN on the learned embeddings. Since you already have the embeddings, there is no need for back-prop through time. It would be like training a regular neural net on input-output examples, where the input is the embedding at time t and the output is the embedding at time t+1. It's only speculation, but it could be a practical way of distilling pre-trained transformers into RNNs for deployment.
P.S. By unmasking a sentence sequentially, I mean for example "hel", "hell", "hello".
@@dennismertens990 I agree that these architectures in general and Mamba in particular seem like they are a very good idea for more task specific deep learning. I should have specified that I was referring to the general purpose LLMs or more multi-modal models which develop some level of reasoning and general abstraction capability. For more constrained tasks like DNA modelling this might very well be a huge leap forward.
In regards to sequential unmasking, it would of course still be token based, not character based, but I don't think you can get away with throwing away the attention matrix without a massive loss in performance even if you train an embedding.
@@sagetmaster4 A lot of people have been saying this for many years at this point. And for a long time this seemed like a very reasonable assumption. If you can not scale generally you have to specialize. It's the same with general compute as well as many other areas. But I haven't seen a single example of this actually working to a degree that outerforms just putting same amount of extra compute into a larger or longer trained model. I am beginning to think that we are missing some very fundamental and essential insight to make this actually work. Sparse network designs like the mixture of expert seem to have some real benefits here but only for inference speed. But I would argue this is only tangentially related to the whole heterogenous architecture idea.
I for one think the next major architectural step probably needs to be intermediate higher level memory instead of just scaling the context. Being able to store abstractions in memory seems like such a fundamental component to human thinking that I can't help but think that it might have the potential for major improvements in an llm as well.
The other thing that will eventually be necessary is a way for models to spend significantly more time "thinking" before answering. There were a few attempts with traditional LLMs and a thinking-token to essentiall allow it more inference steps for the next token. And the results looked promising in the paper. But now it seems to have been mostly forgotten about. So it seems a more fundamental way for the model to recurse internally might be necessary for introspective logic.
@@lexer_ If this paper's claims are actually true then it has better perplexity on the pile (general purpose language modeling) than traditional transformers. We've also seen attention actually struggles quite a bit on information retrieval with long sequences so while it is powerful it's not a perfect mechanism. A lot of older subquadratic transformer papers basically just did "attention but worse" (eg. window + random attention) and so they naturally had tradeoffs that a completely different mechanism like this isn't going to necessarily have.
What are the differences between RWKV, RetNet, and Mamba? Which here has the closest architecture to transformers?
This is maybe worth a video all to itself, but they're all different types / levels of tradeoff between a transformer and an RNN, basically
For A^n, will so many multiplies of A leads to an explosion of parameters?
Thanks for talking about this model! ❤
Impressive, thx as always 🎉 happy holidays
regarding not having a facecam: for people still learning english having a visual reference can drastically improve listening comprehension
Please do more papers, like before!
this is my christmas present
Great explanation. Will you please share the link to the github repo?
😊this is the second video I watched for this paper. I got more understanding of this paper. The paper is kind of too hard for me😂
have not seen mamba any implementations on pytorch , also will it replace Transformers ?
will we use mamba to install mamba?
I hope not the whole conda ecosystem should not be used for serious production code. Pyenv + poetry + pip is all you need.
@@GatlingNG conda can install things that pip can't.
@@vcool we have containers for that not some badly half-assed cobbled together venv manager with barely maintained packages.
If Mamba can handle such long sequences, does it need tokenization ?
You can't really just input words like that into any model, but it might still be easier in other ways. In the S4 talk video Albert Gu talks about how much easier it was for people in other fields to use their model since it's a one size fits all with great performance without tuning.
Neither transformers or mamba need tokenization, it's more just for efficiency but curious to hear about the potential implications and whether mamba would be more capable of handling larger vocabularies
brilliant explanation !
Thanks !
Was waiting for this, thanks!
what does the ‘’zoom" mean in the vedio?
good video, Yannic has a strong ability to dissect and extract key points
okay so hidden layer is just a really big IIR filter en.wikipedia.org/wiki/Infinite_impulse_response
helped so much to understand!
What happens if you limit the training corpus to enwik9 and then just hammer away?
Nice! Was waiting for this 😁
It sounds like S4 solves the vanishing and exploding gradient problem, so that it works for very long sequences. Now I'm curious how that works...
Can we use this model for the image classification like vit
Yep you can. Checkout V Mamba (Vision Mamba)
@@EngineeredFemale but there is no training code for vision mamba.
This is so great.
You are just amazing.
Can somebody help me understand this? In figure 1 in the paper it says “Structured SSMs independently map each channel (e.g. 𝐷 = 5) of an input 𝑥 to output 𝑦 through a higher dimensional latent state ℎ (e.g. 𝑁 = 4)”. How are four dimensions higher dimensional than five?
I am not trolling and I understand that it says “e.g.” and that it could not be deliberate. But given the quality of the paper that seems unlikely.
Is there another way to interpret higher dimensional? Does it just mean “not a scalar”?
I found the solution. Well, GPT4 did, but who’s counting? 😅
Each of the five channels is mapped to four hidden dimensions. Of course, now it all makes sense.
This is what the horse said: “The caption mentions that structured State Space Models (SSMs) map each channel of an input to an output through a higher dimensional latent state \( h \) (for example, \( N = 4 \)) while avoiding the materialization of a large effective state space. The inconsistency you're pointing out seems to be related to the dimensionality \( D \) given as an example (i.e., \( D = 5 \)) versus the term "higher dimensional."
This likely means that each channel \( D \) is independently mapped, and the "higher dimensional" aspect refers to the combination of these mappings to create a state space with more dimensions than the individual inputs. The effective state space would then be \( D \times N \), which is larger than the dimensionality of each individual channel \( D \) alone. So, in this context, "higher dimensional" does not contradict the example dimensions given; rather, it points to the resultant multi-dimensional space created by the model's structure.”
Why a "horse" said that? @@mkamp
tHANKS GREAT
6:50 RNN: y(t+1) = sigmoid(y(t) + x(t))
State-space: Well, let’s get rid of non-linear sigmoid so that y(t+1) = y(t) + x(t) = … = y(0) + x(0) + x(1) + … + x(t)
Now you change the game lol.
I want to see a model that requires log memory of length, which could be better than requiring linear memory.
This model has constant memory on inference, it has linear memory on trainning fase
Would love to see some classification of machine learning itself using AI, there seems to be a lot of stuff called different that is functionally very similar
I am super sceptical towards these reduced models. But I think they are interesting as a kind of ablation study for the Transformer architecture. Helping us to understand their inner mechanisms better.
lovely
Super nice
Generative Pretrained Mamba
I've always preferred videos with no face cam, actually.
Wowww
tragic loss of not seeing your sunglasses*
fify
I can't help but feeling like the claim that this will "replace" Transformers is a bit much... Especially after watching the video
Remember traing data contain influenced, wrong, intermediate trues, high adv. data, school philosophy, wrong time, era optimized ideas ... and lot more. You need some how filter, or devaluate these values, or put in some different value dimensions. This will make hallucination go down little more. My noob opinion.
yet again i cant understand a word of the maths except the intro
ive heard of LSTM
big O
matrix
FFT
thats about IT
"its just basic 2nd year maths" - i hate people who say this. they have no idea of the stuggles of people who have to work for 10 years just to afford 1 year of university or cant understand why paying 2 months of your income for 1 months of rent is a barrier to learning this IMPOSSIBLE maths!
ITS NOT MATHS ITS LATIN CODE FOR BASIC SHIT ONLY YOU UNDERSTAND. THERES NO WAY TO LEARN THIS IF YOUR PARENTS ARENT MILLIONAIRES
@@monoham1My parents were poor immigrants from Eastern Europe, and I took loans to go to a public university (Canadian) that cost $10k/yr. I got good grades and the govt (of Canada) wrote off half those loans. The US has cheap state schools too right?
You don't need to be a millionaire.
You just need time and interest.
Why learn multiplication and division when all you need is addition and real numbers? Same principle here. Reduce complexity
It really makes no sense that all these efficient architectures come is small sizes like 3b models. If they trade of capabilities for performance, they should release 7b or 13b models, why would anyone run a 3b Mamba or rvkv if 7b mistral runs on everything? Nice tech but as long as it's under 7b it's just a demo.
This is research, not engineering. Maybe they just had limited compute resources.
I was surprised to learn that transformers rely on a classical algo hack to have a kind of a memory. I'm quite sure that's a flawed premise that wont last. It reminds me of bag of words which was a tremendously flawed premise. Despite how relatively amazingly well transformers work. Most approaches so far seem hacky.
I've thought ahead a bit and I figure that any kind of real world well rounded AI needs a vastly more complex and sophisticated architecture. That's not to say it couldn't happen relatively quickly but LLM aint it. Just the ability to hold a thought and progressively work on it with shifting focus is way beyond LLMs. And that's analog to image generation which superficially looks very close to flawless but in reality is also very far from the holy grail for much the same reason.
Before most of you were born ..... lol 😅
3Q
lstm, transformer, and anything in similar direction are just not gonna work.
what do you mean by 'not gonna work'? Transformers have already proven their worth, haven't they?
Opinion: over engineering architectures won't actually solve the real problem at hand of intelligence
overbuilding allows one to handle more unforeseen types of problems, though.