Thanks for the effort you put into this detailed comparison, I learned a few more things. Btw, the editing and graphics in this video were really good 👍
can mamba have its input rope scaled? It seems it doesnt require positional encoding but this might make it extremely efficient for second order optimization techniques.
In Mamba sequence length can be scaled up to a million (e.g., a million-length sequences). It also computes the gradient (did not find any info on second-order opt in their method): they train for 10k to 20k gradient steps.
I'd like to see this tested out for larger models, such as comparable to llama 2. One question that I have, is whether there are diminishing returns for long distance relationships, compared to a context window of sufficient size. Is it enough for people to give up (tried and true?) transformers, with explicit modeling of the context, over something that is more selective.
A thoughtful observation! Yes, it seems that the authors of Mamba have already tested it out against Transformer-based architectures, such as PaLM and LLaMA, and a bunch of other models. Here's what they quoted in their article, page 2: "With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B)." With regards to scaling the sequence length, I have explained a bit in the video. Here's a bit more explanation from their article, page 1: "The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length." There's also an interesting table of summary of model evaluation (Zer-shot Evaluation, page 13) of different Mamba model sizes compared to GPT-2, H3 Hybrid model, Pythia, and RWKV, where in each instance Mamba exceeds these models' performances (check out the accuracy values in each dataset, especially for Mamba 2.8 Billion parameter model, it is truly unique. And, thanks for watching :)
You are right! Generally speaking, a larger size of parameter considered in tuning the models give better result. But Mamba is claiming that we don't necessarily need larger models, but a more efficient design of a model to be able to perform comparable to other models, even though it may have been trained on smaller training data and smaller number of parameters. I suggest their article, section 4.3.1 where they talk about "Scaling: Model Size", which can give you a good perspective. Thanks for watching :)
You clearly explain what is the difference between Transformer and Mamba, thank you but could you also give the reference paper you mention in the video let me dive in ?
Hi, glad the video was helpful. The reference for the paper is also mentioned multiple times in the video, but here's the full reference for your convenience: Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.
Mamba is an LLM but has a unique architecture, a blend of traditional SSM-based models together with Multi-layer Perceptron which helps it to add 'selectivity' to the flow of information in the system (unlike Transformer-based models which often take the whole context, i.e., all the information, to be able to predict the next word). If you are still confused, I recommend you watch my video in this channel called "This is how exactly language models work" which gives you a perspective of different types of LLMs :)
I'm doing a work that uses sequence data, but not specific to language. In a transformer-like network, instead of embedding layer for the source and target, I have linear layers; also, I send both source and target to the forward process.. In a LSTM-like network, I don't even need this step, I just have the torch standard lstm cell; in this case, simply source is necessary for the forward pass. Does someone has a code example on how I can do it using Mamba? I'm having difficulties on how I can do it.
Hey, I just found a PyTorch implementation of Mamba in this link. I haven't gone through it personally, though; but if it is helpful please do let me know: medium.com/ai-insights-cobet/building-mamba-from-scratch-a-comprehensive-code-walkthrough-5db040c28049
That's right. Essentially, the main idea behind SSM architectures (e.g., having a hidden state) is to be able to manage the flow of information in the system.
Appreciate your effort to make this video.
My pleasure, thanks for watching :)
I came here for giant snake vs giant robot fight
Thanks for watching :)
Thanks for the effort you put into this detailed comparison, I learned a few more things. Btw, the editing and graphics in this video were really good 👍
Glad you liked it!
can mamba have its input rope scaled? It seems it doesnt require positional encoding but this might make it extremely efficient for second order optimization techniques.
In Mamba sequence length can be scaled up to a million (e.g., a million-length sequences). It also computes the gradient (did not find any info on second-order opt in their method): they train for 10k to 20k gradient steps.
beautifully explained!
Glad you think so!
I'd like to see this tested out for larger models, such as comparable to llama 2. One question that I have, is whether there are diminishing returns for long distance relationships, compared to a context window of sufficient size. Is it enough for people to give up (tried and true?) transformers, with explicit modeling of the context, over something that is more selective.
A thoughtful observation! Yes, it seems that the authors of Mamba have already tested it out against Transformer-based architectures, such as PaLM and LLaMA, and a bunch of other models. Here's what they quoted in their article, page 2:
"With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B)."
With regards to scaling the sequence length, I have explained a bit in the video. Here's a bit more explanation from their article, page 1:
"The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length."
There's also an interesting table of summary of model evaluation (Zer-shot Evaluation, page 13) of different Mamba model sizes compared to GPT-2, H3 Hybrid model, Pythia, and RWKV, where in each instance Mamba exceeds these models' performances (check out the accuracy values in each dataset, especially for Mamba 2.8 Billion parameter model, it is truly unique.
And, thanks for watching :)
A 7B to 10B Mamba would be interesting to judge but right now it seems its really good with long content for the small models space
You are right! Generally speaking, a larger size of parameter considered in tuning the models give better result. But Mamba is claiming that we don't necessarily need larger models, but a more efficient design of a model to be able to perform comparable to other models, even though it may have been trained on smaller training data and smaller number of parameters. I suggest their article, section 4.3.1 where they talk about "Scaling: Model Size", which can give you a good perspective. Thanks for watching :)
I'm enjoying these mamba videos you're sharing with us, thanks
Glad you like them!
keep up the good work, btw you got a new sub!
Thanks for the sub!
You clearly explain what is the difference between Transformer and Mamba, thank you
but could you also give the reference paper you mention in the video let me dive in ?
Hi, glad the video was helpful. The reference for the paper is also mentioned multiple times in the video, but here's the full reference for your convenience:
Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.
Just when I thought I had caught up with GPTs and Transformers, BOOM, MAMBA!!!
I know, right?!
I can't keep up. Is Mamba like Mistral model or it is a LLM technology?
Mamba is an LLM but has a unique architecture, a blend of traditional SSM-based models together with Multi-layer Perceptron which helps it to add 'selectivity' to the flow of information in the system (unlike Transformer-based models which often take the whole context, i.e., all the information, to be able to predict the next word). If you are still confused, I recommend you watch my video in this channel called "This is how exactly language models work" which gives you a perspective of different types of LLMs :)
I'm doing a work that uses sequence data, but not specific to language. In a transformer-like network, instead of embedding layer for the source and target, I have linear layers; also, I send both source and target to the forward process.. In a LSTM-like network, I don't even need this step, I just have the torch standard lstm cell; in this case, simply source is necessary for the forward pass. Does someone has a code example on how I can do it using Mamba? I'm having difficulties on how I can do it.
Hey, I just found a PyTorch implementation of Mamba in this link. I haven't gone through it personally, though; but if it is helpful please do let me know: medium.com/ai-insights-cobet/building-mamba-from-scratch-a-comprehensive-code-walkthrough-5db040c28049
lets goooo mamba basically has the similar memory as humans. but brains do tend to forget when the information is unnecessary so thats that.
That's right. Essentially, the main idea behind SSM architectures (e.g., having a hidden state) is to be able to manage the flow of information in the system.
Great video.
Thanks for the visit!
Low audio volume.
Thanks for your feedback!
Why didn't they call it Godzilla?
Funny :) It seems to be as powerful as one!