Standford Labs are thriving right now. To think all this work is made OPEN-SOURCE at a period of hostile and fierce competition among the big tech companies.
2 Timothy 3:16 New World Translation of the Holy Scriptures (Study Edition) 16 All Scripture is inspired of God+ and beneficial for teaching,+ for reproving, for setting things straight,+ for disciplining in righteousness,+
I really really like the build up of ideas through papers, it's a great way to introduce the idea while giving references that we can look up and trace ourselves and coming onto the scene with no context of the last few years of research it provides a neat overview
Absolutely fantastic. Personally, I would be happy to watch a much longer video: same structure, just slower and broken down a bit more. This is not a complaint. The video is awesome as it is. Just feedback.
I'm interested in getting into the open source community, but I'm not too sure how to start. Is there like a given list of problems to solve or do I just start experimenting?
Thanks for this, I feel caught up again! I've seen several papers popping up with alternatives to the transformer architecture, but I lacked a framework to grok them. The way you put this paper in a broader context, both in terms of the new benchmark for long range arenas and the emphasis on "no free lunch" w/re to LTI vs SSM was really helpful.
The technique of solving long-term memory problems using polynomial projection is somewhat similar to using FFT for multiplication. Essentially, both methods use highly efficient information representations with almost orthogonal channel capacity to represent the original information
The crux of the performance of this network lies in the fact that they are using coefficients of legendre polynomial as a basis which allowed the information to be highly compressed with minimal information loss, thinking about sequence memory, moving away from iterative or recursive processing to a more holistic, algebraic form of memory management.
@@xyh6552 I think it basically is a high dimensional FFT that's tracking location in the models similarly high dimensional memory/association space. Should provide near-perfect representation, recall, and higher efficiency for recurrent networks.
@christophkogler6220 Similar to your viewpoint, from the perspective of solving the Kakeya conjecture in finite fields, I believe the main idea is to utilize the rigidity of polynomials to achieve efficient compression. I speculate that the effect of utilizing the relationship between polynomials and roots in polynomial splitting fields is essentially replacing one "n" in the complexity with "logn"
I liked this video so much that I reached for the like button 3 times while watching it. Awesome context on S4. This is extremely helpful for getting the context and stripping the hype to get to the meaning. That's definitely a sub and I am off to watch all the other videos
Actual ELI5: Many current AI models rely on 'MLP (Multi-Layer Perceptron)' and 'Transformer' blocks in their design. The "problematic" (but also usually the 'smart') one is the 'Transformer' block. These need more and more resources to process the context as the context size increases, making scaling up VERY difficult - for a 8x larger context you need about 64x the resources. This is because Transformers compare every part of the context to every other part of the context, every time. The Mamba architecture excludes both the MLP and Transformer blocks for the new 'Mamba' block. It needs the same amount of resources for an increase in context size no matter how large the context already is. For an 8x larger context, you would only need about 8x the resources. That means that - compared to a Transformer based model - you could give it way more input at once and get way more output at once, with the same memory resources. If the method works at larger scales, Mamba could be another significant step forward for AI capabilities. Most current public-facing LLM models, like ChatGPT, use Transformers in their architecture. Transformers include 'self-attention', which basically weighs the importance of every thing against everything else, all at once. This means they process any input in approximately O(N^2) time and memory (where N is the input length). As input / context length increases, their demands scale incredibly high. Anybody with a decent GPU technically CAN run a local LLM, its just small, slow, and dumb. To run anything decent, you end up needing tens (maybe even hundreds) of gigabytes of extremely fast memory, which means workstation GPU's that cost thousands or even entire GPU clusters. The Mamba architecture is basically an entirely different TYPE of AI, more similar to a Recurrent Neural Network, and is both faster and more memory efficient. It processes and considers information sequentially, instead of all at once, but can ALSO ignore unimportant information. The architecture would be able to process an input in approximately O(n+L) time and memory, where n is essentially some constant and L is input length. If it continues to work so efficiently at increasingly large scales, it means literally orders of magnitude faster output and lessened memory requirements for a given context window, which can allow model context size to be increased massively while still using less computational resources than the previous methods. This part is basically educated guesswork, as this level of linear algebra / ML theory is a fair bit over my head: I think Legendre memory cells basically create a high dimensional Fast Fourier Transform from the Legendre polynomials (1:25), which fits neatly into my mental model of AI. In a certain mathematical sense, everything an AI knows can be represented as an incredibly complex interplay of gradients/associations between multidimensional vectors. A multidimensional FFT thus allows you to track your 'location' (the context) within this interplay as the model works, efficiently and with little (to near-zero) accuracy loss. They also allow you to accurately recompose the input from memory, thus allowing the model to efficiently recall or ignore only parts of the input that matter. The 'importance' of any specific part of the input to the output is probably based on some calculation of distance from the FFT to the 'current context'. If it isn't a 'near' association, it probably doesn't really matter to what you're doing, and so can be ignored. And here's a summary of the results of paper from ChatGPT, after I had a little Q&A with it: Summarize the benefits the Mamba architecture has over the Transformers architecture when at similar scales. The Mamba architecture offers several significant benefits over traditional Transformer architectures, particularly when both are at similar scales. Here's a summary of these advantages: 1) Higher Throughput and Efficiency: Mamba achieves a much higher throughput in both inference and training compared to Transformers. Specifically, it has been noted to have 5× higher throughput during inference and up to 40× faster efficiency in training operations. This increased efficiency is especially beneficial when dealing with large-scale models and data. 2) Linear Scaling with Sequence Length: Unlike Transformers, which have quadratic scaling with sequence length, Mamba scales linearly. This is a substantial advantage for processing long sequences, as it ensures more predictable and manageable growth in computational requirements and memory usage as sequence length increases. 3) Improved Generation Throughput: In tasks like language modeling, Mamba not only outperforms Transformers of the same size but also matches or even exceeds the performance of Transformers that are twice its size. This indicates higher efficiency and effectiveness of Mamba in generating outputs. 4) Effective Handling of Longer Sequences: Mamba is particularly adept at handling long sequences, outperforming Transformer models in tasks involving extended contexts. Its design allows it to focus on the most relevant parts of a sequence, enhancing its ability to generalize to much longer sequences than it was trained on. 5) Simplified Architecture: By omitting attention and MLP blocks, Mamba’s architecture is more streamlined than that of traditional Transformers. This simplification contributes to its efficiency, especially in dealing with long sequences. 6) Hardware Optimization: Mamba’s hardware-aware algorithm makes it more compatible with modern GPU architectures, leading to better performance on current hardware platforms. This optimization is crucial for achieving faster processing speeds and more efficient utilization of computational resources. In summary, Mamba offers significant improvements over Transformers in terms of efficiency, scalability, and effectiveness, particularly at similar scales. Its innovations in architecture and design enable it to handle longer sequences more efficiently, making it a strong candidate for various applications in fields requiring efficient sequence modeling.
@@christophkogler6220I think this was more of an "explain like I have a bachelor's in CS," but that's exactly what I needed 🙂 Thanks for writing it out
I noticed that @havenhq had tuned a chat version of the pretrained Mamba-2.8B on huggingface. I played it on colab and it feels like a decent chatbot already. I'm very excited about the future of this architecture
S4 seems to go the hardest with its convolutional trick, but then everyone else goes "fuck this complicated shit, it's too constraining, let's just parallelize more!" and honestly if I had been the one coming up with that clever math I'd feel so cheated 😂.
Honestly how do you make sense of these papers? I've listened to the whole video and still haven't got a clue what it is about. Quite a lot of brilliant people out there do to work like this.
Remember, the RWKV mentioned is the one from its paper, the RWKV v4, there isn't yet a paper for v5 and v6, but v6 is similar to Mamba Edit: it was updated today
Lol, yeah. I noticed that most fields tend towards that as you get towards the bleeding edge. Sometimes I have to stop what I'm working on and just appreciate how it looks like nonsense to an outside viewer
This is my first time watching your channel. Impressive walkthrough. When I first heard of Q* my imagination started to build a very similar architecture... I don't follow too much of the technical, but I saw how the sandwiched gates, shown in the video, could be used almost in an analogue fashion. This is brilliant! Watching this made me grin like crazy... This might not be zero memory, but dang if it isn't a huge step in that direction. Using local memory is genius. And that token interpretation length, yes... So... physically, I guess, in my mind the next step is to localize the memory to the operation even more, but it looks like in that architecture it's as local as it's going to get... What about something like... "Sample-and-hold," from actual analogue circuits? That might be something to think about.
Just as complex numbers work well for SSMs in audio, I can't help but wonder whether split-complex numbers would help SSM performance in language tasks (considering the hyperbolic flavor of split-complex numbers and the benefits of hyperbolic embeddings when encoding hierarchical data).
It certainly seems plausible. In my experience, while hyperbolic embeddings make strong intuitive sense for hierarchical data, I've never seen them yield significant gains (the kinds of works I am are familiar are of this flavour: arxiv.org/abs/2304.09172). If your experience has been different, I'd be curious to hear.
What source do you suggest to understand the algebra and math behidn these works? I really struggled to understand most of the concepts, though I have a fairly good basis of the math behind transformers.
First of all, I recommend you guys 3Blue1Brown's algebra videos. Then, if you already have a solid knowledge, I would recommend "Linear Algebra Done Right" book
turning image into a flattened sequence ... I wonder if they are using space filling curves, or just line by line ? ... I wonder which "regularity" would be more useful? Or something else even? - To be fair, having no implicit notion of "relative position of 2 pixels" (which I believe brains have) seems really expensive, if it then has to fully recover that structure from just a sequence of tokens ...
Yes - this is a good point. I think the reason flattening is performed without retaining 2d structure is precisely because it makes for a particularly challenging modelling task.
Since the big companies are creating their LLMs on transformers with all those resources and time I doubt they'd change unless the results were dramatically better so Mamba while impressive doesn't seem to be it
- Understand Mamba's significance by exploring its efficient state space model design and selective state mechanism (00:04). - Review the scale issues with Transformers and the emergence of efficient alternatives like Mamba for long sequence modeling (00:31). - Examine the Hippo Recurrent Memory and its application in sequence modeling for improved performance (01:29). - Recognize the role of kernel Fusion, parallel scan, and recomputation techniques in Mamba's efficient memory usage (09:55). - Consider the empirical results showcasing Mamba's high performance on various tasks, including long sequence modeling and DNA classification (13:02). - Analyze the trade-offs in model design, noting how selection mechanisms can impact performance on different data modalities (15:27). - Investigate the limitations of current empirical evaluations and the need to test Mamba on larger model sizes (15:43). - Dive into the released GitHub code to experiment with the Mamba model firsthand (15:59).
Transformers don't scale on long sequence operations because generative AI neural networks work better spreading attention over the parameters. We shall see if Mamba can do what it claims after a large model is doing inference.
If you are interrested in doing mechanistic interpretability on mamba model, hit me a dm. Am thinking of writing something like Neel Nanda's transformer lens for mamba or some lower hanging fruit as a start.
I'm so tired that I read this title literally and it took me some time to understand why it is probably not a video about using snakes in place of ChatGPT.
Maybe you could better explain some sentences instead of just highlighting them and reading them aloud. I get you want a lower length video but sometime you could speak to us like we're 10 years old. Would help with understanding. In the worst case, generate special cases using a GPT (explain this passage to me as if I was 15) and just read that. Thanks.
Insane, I loved the way you went through multiple important prior papers before talking about mamba!
Depth-first search of the depenency tree of papers :)
Standford Labs are thriving right now. To think all this work is made OPEN-SOURCE at a period of hostile and fierce competition among the big tech companies.
Original transformers were Open Source by Google
Also, GPT and GPT2 were open source
This is no surprise to those in the community
2 Timothy 3:16
New World Translation of the Holy Scriptures (Study Edition)
16 All Scripture is inspired of God+ and beneficial for teaching,+ for reproving, for setting things straight,+ for disciplining in righteousness,+
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
Is not Mathematics the language of God?
@@nikoladjordjevic4477 Those companies all turned closed source once money got involved.
I really really like the build up of ideas through papers, it's a great way to introduce the idea while giving references that we can look up and trace ourselves and coming onto the scene with no context of the last few years of research it provides a neat overview
Absolutely fantastic. Personally, I would be happy to watch a much longer video: same structure, just slower and broken down a bit more.
This is not a complaint. The video is awesome as it is. Just feedback.
Thank you for such a good survey of the prior work! Your effort is noted and appreciated!
Much appreciated!
Hope the open source community builds on this
Well get on it. The open source community is also 🫵
WE are the open source community ☺️
The authors already keep their code open source so the work is replicable. It's common practice in ML research.
exepto for gtp 3 and 4 XD @@rrestoring_faith
I'm interested in getting into the open source community, but I'm not too sure how to start. Is there like a given list of problems to solve or do I just start experimenting?
Always appreciate your excellent video explanations of cutting edge papers, thanks!
Thanks!
Thats a really high quality content. I also really like the way you highlight the text when you read over it, this makes it easier to follow along!
Thanks!
Thanks for this, I feel caught up again! I've seen several papers popping up with alternatives to the transformer architecture, but I lacked a framework to grok them. The way you put this paper in a broader context, both in terms of the new benchmark for long range arenas and the emphasis on "no free lunch" w/re to LTI vs SSM was really helpful.
Can you send some links of those papers that list the alternatives transformers architecture.
The technique of solving long-term memory problems using polynomial projection is somewhat similar to using FFT for multiplication. Essentially, both methods use highly efficient information representations with almost orthogonal channel capacity to represent the original information
I don't understand any thing
Yeah, but only big math heads will actually ever fully understand it.
Ironically many of the ML concepts in the video went over my head but this Fourier analogy was more approachable for me.
The crux of the performance of this network lies in the fact that they are using coefficients of legendre polynomial as a basis which allowed the information to be highly compressed with minimal information loss, thinking about sequence memory, moving away from iterative or recursive processing to a more holistic, algebraic form of memory management.
In line with your viewpoint, this job is actually similar to using FFT to process n-bit multiplication
@@xyh6552 I think it basically is a high dimensional FFT that's tracking location in the models similarly high dimensional memory/association space. Should provide near-perfect representation, recall, and higher efficiency for recurrent networks.
U lost me at "legendre"
@@xyh6552 Yep, FFT is on fourier basis, this is using legendre basis.
@christophkogler6220 Similar to your viewpoint, from the perspective of solving the Kakeya conjecture in finite fields, I believe the main idea is to utilize the rigidity of polynomials to achieve efficient compression. I speculate that the effect of utilizing the relationship between polynomials and roots in polynomial splitting fields is essentially replacing one "n" in the complexity with "logn"
I liked this video so much that I reached for the like button 3 times while watching it.
Awesome context on S4. This is extremely helpful for getting the context and stripping the hype to get to the meaning.
That's definitely a sub and I am off to watch all the other videos
I need an ‘explain it like I’m five’ version of this. 😄
But I hope it means something strong is coming down the pipe.
Actual ELI5: Many current AI models rely on 'MLP (Multi-Layer Perceptron)' and 'Transformer' blocks in their design. The "problematic" (but also usually the 'smart') one is the 'Transformer' block. These need more and more resources to process the context as the context size increases, making scaling up VERY difficult - for a 8x larger context you need about 64x the resources. This is because Transformers compare every part of the context to every other part of the context, every time.
The Mamba architecture excludes both the MLP and Transformer blocks for the new 'Mamba' block. It needs the same amount of resources for an increase in context size no matter how large the context already is. For an 8x larger context, you would only need about 8x the resources. That means that - compared to a Transformer based model - you could give it way more input at once and get way more output at once, with the same memory resources.
If the method works at larger scales, Mamba could be another significant step forward for AI capabilities.
Most current public-facing LLM models, like ChatGPT, use Transformers in their architecture. Transformers include 'self-attention', which basically weighs the importance of every thing against everything else, all at once. This means they process any input in approximately O(N^2) time and memory (where N is the input length). As input / context length increases, their demands scale incredibly high. Anybody with a decent GPU technically CAN run a local LLM, its just small, slow, and dumb. To run anything decent, you end up needing tens (maybe even hundreds) of gigabytes of extremely fast memory, which means workstation GPU's that cost thousands or even entire GPU clusters.
The Mamba architecture is basically an entirely different TYPE of AI, more similar to a Recurrent Neural Network, and is both faster and more memory efficient. It processes and considers information sequentially, instead of all at once, but can ALSO ignore unimportant information. The architecture would be able to process an input in approximately O(n+L) time and memory, where n is essentially some constant and L is input length. If it continues to work so efficiently at increasingly large scales, it means literally orders of magnitude faster output and lessened memory requirements for a given context window, which can allow model context size to be increased massively while still using less computational resources than the previous methods.
This part is basically educated guesswork, as this level of linear algebra / ML theory is a fair bit over my head: I think Legendre memory cells basically create a high dimensional Fast Fourier Transform from the Legendre polynomials (1:25), which fits neatly into my mental model of AI. In a certain mathematical sense, everything an AI knows can be represented as an incredibly complex interplay of gradients/associations between multidimensional vectors. A multidimensional FFT thus allows you to track your 'location' (the context) within this interplay as the model works, efficiently and with little (to near-zero) accuracy loss. They also allow you to accurately recompose the input from memory, thus allowing the model to efficiently recall or ignore only parts of the input that matter. The 'importance' of any specific part of the input to the output is probably based on some calculation of distance from the FFT to the 'current context'. If it isn't a 'near' association, it probably doesn't really matter to what you're doing, and so can be ignored.
And here's a summary of the results of paper from ChatGPT, after I had a little Q&A with it:
Summarize the benefits the Mamba architecture has over the Transformers architecture when at similar scales.
The Mamba architecture offers several significant benefits over traditional Transformer architectures, particularly when both are at similar scales. Here's a summary of these advantages:
1) Higher Throughput and Efficiency: Mamba achieves a much higher throughput in both inference and training compared to Transformers. Specifically, it has been noted to have 5× higher throughput during inference and up to 40× faster efficiency in training operations. This increased efficiency is especially beneficial when dealing with large-scale models and data.
2) Linear Scaling with Sequence Length: Unlike Transformers, which have quadratic scaling with sequence length, Mamba scales linearly. This is a substantial advantage for processing long sequences, as it ensures more predictable and manageable growth in computational requirements and memory usage as sequence length increases.
3) Improved Generation Throughput: In tasks like language modeling, Mamba not only outperforms Transformers of the same size but also matches or even exceeds the performance of Transformers that are twice its size. This indicates higher efficiency and effectiveness of Mamba in generating outputs.
4) Effective Handling of Longer Sequences: Mamba is particularly adept at handling long sequences, outperforming Transformer models in tasks involving extended contexts. Its design allows it to focus on the most relevant parts of a sequence, enhancing its ability to generalize to much longer sequences than it was trained on.
5) Simplified Architecture: By omitting attention and MLP blocks, Mamba’s architecture is more streamlined than that of traditional Transformers. This simplification contributes to its efficiency, especially in dealing with long sequences.
6) Hardware Optimization: Mamba’s hardware-aware algorithm makes it more compatible with modern GPU architectures, leading to better performance on current hardware platforms. This optimization is crucial for achieving faster processing speeds and more efficient utilization of computational resources.
In summary, Mamba offers significant improvements over Transformers in terms of efficiency, scalability, and effectiveness, particularly at similar scales. Its innovations in architecture and design enable it to handle longer sequences more efficiently, making it a strong candidate for various applications in fields requiring efficient sequence modeling.
Thank you! This was excellent.
I think this answer is wonderful… and can tell it’s ChatGPT generated 😅
@@christophkogler6220I think this was more of an "explain like I have a bachelor's in CS," but that's exactly what I needed 🙂 Thanks for writing it out
@@alexander191297 Only the part after I mention ChatGPT :)
I noticed that @havenhq had tuned a chat version of the pretrained Mamba-2.8B on huggingface. I played it on colab and it feels like a decent chatbot already. I'm very excited about the future of this architecture
You have any code for it?
Man, those papers include hardcore numerical linear algebra :D
S4 seems to go the hardest with its convolutional trick, but then everyone else goes "fuck this complicated shit, it's too constraining, let's just parallelize more!" and honestly if I had been the one coming up with that clever math I'd feel so cheated 😂.
This does it for my 'aspiration video' of the week.
Great.
Thanks for the video, would love to have a more detailed explaination based on the related works before!
As a person new to the field, I greatly appreciate the way you presented things here!
Thanks!
Honestly how do you make sense of these papers? I've listened to the whole video and still haven't got a clue what it is about. Quite a lot of brilliant people out there do to work like this.
Very encouraging that they included the situation in which S6 did poorly! If there are no other catches this looks incredible!
Tri Dao is one hell of a contributor
Remember, the RWKV mentioned is the one from its paper, the RWKV v4, there isn't yet a paper for v5 and v6, but v6 is similar to Mamba
Edit: it was updated today
How similar? well, I don't know, check it at the repo
Amazing work ! Keep 'em coming !
Thanks, will try!
Very interesting and well explained. Thanks a lot.
Thanks for your work !! It is really helpful to look through the related works 😮😮
I'm only at 7:13, right after 'spicy'. Subscribed. Great format and amazing delivery!
Thanks!
Keep these coming! Great video.
When you stand back and squint your eyes at these papers they almost have a turbo encabulator quality to them.
Lol, yeah. I noticed that most fields tend towards that as you get towards the bleeding edge. Sometimes I have to stop what I'm working on and just appreciate how it looks like nonsense to an outside viewer
Great review! Looking forward to playing with it soon :)
Thanks!
Would this help a regression based transformer which data is based on the stock market’s price action?
Or is it more for multi-media?
Thank you for bringing this to our eyes and it has been really insightfull
great summary and even better presenting voice.
Thanks!
This is my first time watching your channel.
Impressive walkthrough.
When I first heard of Q* my imagination started to build a very similar architecture... I don't follow too much of the technical, but I saw how the sandwiched gates, shown in the video, could be used almost in an analogue fashion. This is brilliant!
Watching this made me grin like crazy...
This might not be zero memory, but dang if it isn't a huge step in that direction. Using local memory is genius. And that token interpretation length, yes...
So... physically, I guess, in my mind the next step is to localize the memory to the operation even more, but it looks like in that architecture it's as local as it's going to get...
What about something like... "Sample-and-hold," from actual analogue circuits? That might be something to think about.
It is 100% sure that someone is already training a 7B+ Mamba model out there, most likely even bigger.
true
Just as complex numbers work well for SSMs in audio, I can't help but wonder whether split-complex numbers would help SSM performance in language tasks (considering the hyperbolic flavor of split-complex numbers and the benefits of hyperbolic embeddings when encoding hierarchical data).
It certainly seems plausible. In my experience, while hyperbolic embeddings make strong intuitive sense for hierarchical data, I've never seen them yield significant gains (the kinds of works I am are familiar are of this flavour: arxiv.org/abs/2304.09172). If your experience has been different, I'd be curious to hear.
What source do you suggest to understand the algebra and math behidn these works? I really struggled to understand most of the concepts, though I have a fairly good basis of the math behind transformers.
First of all, I recommend you guys 3Blue1Brown's algebra videos. Then, if you already have a solid knowledge, I would recommend "Linear Algebra Done Right" book
Amazing video! Subscribed.
Can you use Mamba arquitecture in conjunction with Bitnet b1.58?
This also begs the question of how the hardware-aware process fares when the memory between system and GPU are fully shared...
Great video, thanks for sharing!
beautifully developed narrative
This video reminds me of the fact that I do not understand mathematics🙃
How is Mamba similar or different to multi-expert models?
What is the minimum card spec (memory, cuda, tensors, what ever) to run this model?
점점 물리학과 유사해지는 느낌 feels like it's becoming more and more similar to physics.... 🙂
I think about 8 people followed what you were saying but I appreciate the effort
Thanks!
super cool work
cool presentation
Thanks!
Thx RUclips for recommendations
is this not a somewhat proof or then addition to Lee Cronin's Assembly theory is you can rebuild input u from the components of m?
can’t wait for a model named kobe to come out
This makes so much sense. So obvious..
Idk man I kinda like the shapeshifting sapient robots fighting over their home planet cybertrone, why would you wanna replace them
Thanks
Thanks!
turning image into a flattened sequence ... I wonder if they are using space filling curves, or just line by line ? ... I wonder which "regularity" would be more useful? Or something else even?
- To be fair, having no implicit notion of "relative position of 2 pixels" (which I believe brains have) seems really expensive, if it then has to fully recover that structure from just a sequence of tokens ...
Yes - this is a good point. I think the reason flattening is performed without retaining 2d structure is precisely because it makes for a particularly challenging modelling task.
thanks for the video
Since the big companies are creating their LLMs on transformers with all those resources and time I doubt they'd change unless the results were dramatically better so Mamba while impressive doesn't seem to be it
Thanks!
god, these kids really find the cure for cancer
Great video
"infinite" context length is effectively the main thing we needed. This is very exciting.
Good video Sam
I remember reading a paper on this in 2021 why isn't it adopted earlier page link in the reply
drive.google.com/file/d/1-67LHZbCoDmzLWYp_4ZUXNzavcbGNMGa/view?usp=drivesdk
Good find. I guess mamba is a popular name...
Thanks!
Thanks!
is gemini based on this? the logo spiral seems to look like the Legendre polynomial graph,
is it conversational? can it talk to itself or several instances?
Lesssgo kobe ✨️
Like a knife, a weapon available to everyone is not a weapon anymore it's just a mere tool... #hail_to_the_open_source_and_public_research
I want to see this applied to reasoning tokens
- Understand Mamba's significance by exploring its efficient state space model design and selective state mechanism (00:04).
- Review the scale issues with Transformers and the emergence of efficient alternatives like Mamba for long sequence modeling (00:31).
- Examine the Hippo Recurrent Memory and its application in sequence modeling for improved performance (01:29).
- Recognize the role of kernel Fusion, parallel scan, and recomputation techniques in Mamba's efficient memory usage (09:55).
- Consider the empirical results showcasing Mamba's high performance on various tasks, including long sequence modeling and DNA classification (13:02).
- Analyze the trade-offs in model design, noting how selection mechanisms can impact performance on different data modalities (15:27).
- Investigate the limitations of current empirical evaluations and the need to test Mamba on larger model sizes (15:43).
- Dive into the released GitHub code to experiment with the Mamba model firsthand (15:59).
in the real world LeCunn and Hinton's ideas haven't yet been optimized and deployed to scale in commerce... 😂 But it's fun to try and keep up
Transformers don't scale on long sequence operations because generative AI neural networks work better spreading attention over the parameters. We shall see if Mamba can do what it claims after a large model is doing inference.
any llm using it?
finally apparently near-infinite contexts!
Added to my plays list on AI
Here after mistral release of their code mamba model 😄
that's enough math for a lifetime. Amazing.
If transformers scale pretty well, I can't think of a reason why Mamba wouldn't scale. At least off the top of my head. Let's see what happens!
I can't be the only who clicked on the video expecting the other transformers ...
If you are interrested in doing mechanistic interpretability on mamba model, hit me a dm.
Am thinking of writing something like Neel Nanda's transformer lens for mamba or some lower hanging fruit as a start.
I'm so tired that I read this title literally and it took me some time to understand why it is probably not a video about using snakes in place of ChatGPT.
Cool beans
The machines now can get epilepsy lol
Kobe and Gigi 🏀8️⃣💛💜2️⃣4️⃣🖤
Whattttt 😵💫😵💫😵💫
Need an ELI5 section 😅😂
in the uk, mamba is the nickname for a hard drug
Interesting. But as usual it suffers from acronym overload.
Why would a snake replace Transformers, it can't even turn into a truck
Aha yes, this are the OK! 👍 becas I is smart here to, and No can be maybi. Good! Do it Now!
No, it's an replacement for conda.
Maybe you could better explain some sentences instead of just highlighting them and reading them aloud. I get you want a lower length video but sometime you could speak to us like we're 10 years old. Would help with understanding. In the worst case, generate special cases using a GPT (explain this passage to me as if I was 15) and just read that. Thanks.
Thanks for the feedback!
what
superficial and misleading
Remix it with your fixed leads. 🙏🏼
great video
Thanks!