MAMBA from Scratch: Neural Nets Better and Faster than Transformers

Поделиться
HTML-код
  • Опубликовано: 6 июн 2024
  • Mamba is a new neural network architecture that came out this year, and it performs better than transformers at language modelling! This is probably the most exciting development in AI since 2017. In this video I explain how to derive Mamba from the perspective of linear RNNs. And don't worry, there's no state space model theory needed!
    Mamba paper: openreview.net/forum?id=AL1fq...
    Linear RNN paper: openreview.net/forum?id=M3Yd3...
    #mamba
    #deeplearning
    #largelanguagemodels
    00:00 Intro
    01:33 Recurrent Neural Networks
    05:24 Linear Recurrent Neural Networks
    06:57 Parallelizing Linear RNNs
    15:33 Vanishing and Exploding Gradients
    19:08 Stable initialization
    21:53 State Space Models
    24:33 Mamba
    25:26 The High Performance Memory Trick
    27:35 The Mamba Drama

Комментарии • 207

  • @jamescamacho3403
    @jamescamacho3403 18 дней назад +58

    As someone actively working on this stuff, this channel has the best explanations on the internet, and the 'tuber actually understands what is going on.

    • @Quarky_
      @Quarky_ 5 дней назад +2

      3blue1brown of deep learning?

    • @codybarton2090
      @codybarton2090 2 дня назад +1

      I’d love feed back on Reddit if ur working on this as well as on cosmo knowledge RUclips channel I threw up some concepts

  • @jarib3858
    @jarib3858 Месяц назад +23

    One small note on RNN's, reservoir computing is a very high dimensional random RNN with linear regression readout, therefore there is no exploding nor vanishing gradient. Reservoir computing is currently the standard for non-linear dynamic time series prediction

  • @jawadmansoor6064
    @jawadmansoor6064 Месяц назад +86

    wow, you've made some difficult i mean extremely difficult algorithms look easy. thank you.

  • @ithaca2076
    @ithaca2076 Месяц назад +7

    absolutely love the quality and information of this video!!! please keep up the good work this is amazing

  • @kamdynshaeffer9491
    @kamdynshaeffer9491 Месяц назад +7

    Absolutely amazing vid. Just subbed after getting recommended to this channel. Never stop making videos dude

  • @timseguine2
    @timseguine2 6 дней назад +1

    Thanks for the clear explanation. This gives me enough understanding to not only implement it myself, but to also have some ideas for sensible architecture modifications.

  • @IllIl
    @IllIl Месяц назад +2

    Thank you! Your channel is an invaluable resource on here. Hope you keep making these videos!

  • @peterdemore7239
    @peterdemore7239 26 дней назад +8

    Brutal. I'm going to have to watch this about 30 times. Love it.

  • @tellu5493
    @tellu5493 Месяц назад +4

    This was very good, and I hope you make more videos like this!

  • @BooleanDisorder
    @BooleanDisorder Месяц назад +8

    You have such a pleasant voice 😊
    Thanks for helping me understand better.
    Please keep making videos. ❤

    • @luke2642
      @luke2642 Месяц назад

      in para-lllelll :-D

  • @wargreymon2024
    @wargreymon2024 11 дней назад +2

    The level of details and intuition you dig into are excellent 💯🔥

  • @anrilombard1121
    @anrilombard1121 Месяц назад +13

    Currently testing it on molecular generation, so excited to see where these strengths hold and where they falter :)

  • @markdatton1348
    @markdatton1348 Месяц назад +3

    Awesome video. I love the speed and the depth of this, it's perfect

  • @honglu679
    @honglu679 Месяц назад +13

    Wow, excellent explaination. It covers all the essense of the paper with just enough math/algo. Thank you so much ! If you dont mind, plz make a video for RWKV (v6 has some new modifications), which is another strong linear RNN model. I am curious how does it compares to mamba.

  • @kalkhasse
    @kalkhasse 11 дней назад +4

    I love how you nail the level of detail in the explanations. Perfect for me at least.

  • @anthonybernstein1626
    @anthonybernstein1626 Месяц назад +4

    Amazing explanation, thank you!

  • @EkShunya
    @EkShunya Месяц назад +41

    please open your community tab
    your content is incredible

  • @rikkathemejo
    @rikkathemejo 21 день назад +10

    Nice video! I just wanted to point out that the parallel scan algorithm can be also implemented in O(n) time (instead of the O(n log(n)) version peresented in the video. and this is the version that the MAMBA uses.

  • @MarcosScheeren
    @MarcosScheeren 8 дней назад +2

    Subscribed! Thats some 3Blue1Brown level stuff! Amazing!

  • @RexPilger
    @RexPilger 25 дней назад +32

    About peer review: As one comment noted, there could be many more candidate papers presented than could be accommodated at the venue. However, this video argues, the rejection justification for this paper is inadequate at best. Some comments ask whether the rejection is important; for academics, the answer is yes, because presentations and publications count for tenure, promotions, and raises plus continued funding of the research. Since several comments plus the video indicate that the algorithm had already received a lot of publicity, for the sake of the project it may not matter if it can continue to be funded, especially if commercial implementations are successful. What is interesting in any case is that the paper exists; in effect it has been published; the authors may not get the desired credit for formal publication, but their work and the reviewer comments are out there now. A couple of decades ago that would not have been the case; most people in the field would be unaware of the algorithm. In terms of peer review, in general (outside of AI), in my field, one of the natural sciences, a paper I submitted for publication encountered an editor plus two reviewers who were well qualified in the field; after asking for two revisions to the manuscript, the third version was rejected. Interestingly, all three scientists had published research which my paper undermined; they may well have lost funding for their research or even their position had that manuscript of mine been published (I speculate here). Peer review cuts both ways. While iterating with the editor and reviewers I continued to expand my research project and made some additional discoveries. Following the rejection I wrote a completely different paper which incorporated my initial work supplemented by the new discoveries; happily it was published a few months ago (in a different journal). I'm formally retired now, but continue to do research. To young researchers -- never give up. Learn from rejection, refine your work, be humble, exercise integrity and honesty, and take pride in your accomplishments, even if only a few know about them. Peer review (by humans) is a necessity and will continue to be. There is no such thing as a perfect filter, but science and technology would be overwhelmed by irrelevancy, dishonesty, and duplication of effort without it. AI may become a useful filtering tool, but science is a human endeavor.

  • @koka3243
    @koka3243 Месяц назад +2

    Great video! Thanks!

  • @PliniusSecundus
    @PliniusSecundus 8 дней назад +1

    Great job! Your channel is a treasure.

  • @diabolo19x
    @diabolo19x Месяц назад +3

    Incredible work. I mean REALLY incredible

  • @alexmomot6268
    @alexmomot6268 Месяц назад +3

    Thx a lot for the interesting video! 💛💙

  • @InfiniteQuest86
    @InfiniteQuest86 Месяц назад +8

    I like how we now call 1 billion parameters small.

  • @nialv7985
    @nialv7985 Месяц назад +6

    Thanks for this explanation! Phrasing mamba in terms of a Linear RNN makes it much easier to understand.
    You've done a lot already with this video, but I just want to ask for a little bit more. Since the original Mamba paper presented the model in terms of SSM, many, many implementations of Mamba also use that language. And I have difficulty wrapping my head around trying to map their code back to the concepts in this video. I wish you can explain how concepts in the Mamba paper (∆ A B C D, discretization, etc) maps back to the parameters of a Linear RNN, which would help a lot.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +11

      Sure, for the state space terminology A in ℂ^d is the learnable parameter that is used to make the recurrent weight vector, the equivalent in my video is a+bi, with a, b in R^d as learnable parameters, i is the imaginary unit. B, C in ℂ^{d x d } are the complex matrices applied before and after the recurrence respectively, equivalent to P and Q matrices in my video, also learnable parameters. SSM performs discretization of the parameters, which creates A^bar = e^{ΔA} and B^bar = (ΔA^-1)(exp(ΔA)-I)ΔB. Note A^bar and B^bar are what are actually used in the computation. This discretization is equivalent to the stable reparameterization outlined in my video. In the SSM formulation, they phrase the discretization as modifying B into B^bar, but note that B is the matrix which is applied to the input, so multiplying B with Δ is equivalent to multiplying the input x with Δ and leaving B unchanged, which is how it is described in my video. One last thing to be aware of is that in the state space literature, the models are often described as having another "state dimension" N in addition to the model dimension d. This state dimension is equivalent to the factor by which the output vector's dimension is expanded, so for example Mamba uses N=16, i.e. expands outputs by a factor of 16. Let me know if you still have any questions!

    • @nialv7985
      @nialv7985 Месяц назад +1

      @@algorithmicsimplicity Thank you so much!

  • @phmfthacim
    @phmfthacim Месяц назад +3

    This is amazing!

  • @2255.
    @2255. Месяц назад +7

    underrated channel

  • @pi5549
    @pi5549 Месяц назад +6

    Another beautiful exposition. Further points: (1) HiPPO itself comes from attempting to approximate a spiking net with a SSM (Voelker 2017/8), (2) we do have O(NlogN) transformer hacks now, (3) RWKV is a promising arch that deserves a place in this arena.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +6

      I haven't heard of any O(NlogN) transformer hacks that preserve performance, got any links?
      And yeah RWKV is promising, I would've loved to talk about it as well but the video was getting long lol.

  • @tulgatbolderdene7493
    @tulgatbolderdene7493 Месяц назад +103

    This just shows how RNNs are way too natural of an architecture to ignore. Maybe solution to a gradient descent problem is to not use gradient descent at all. There has to be a different way to update parameters than this bizarre hack and slash let ||x_0|| = 1 for RNNs.

    • @BooleanDisorder
      @BooleanDisorder Месяц назад +16

      Meta-learning could potentially be one way. Like a neural "module" in the model that looks how changes in the first layers affect the representation space deeper and vice versa. It would have to have some goal and reward itself

    • @tempname8263
      @tempname8263 Месяц назад +35

      But gradient descent is too natural of an algorithm to ignore >.

    • @ckpioo
      @ckpioo Месяц назад

      ​@@tempname8263 it's actually not natural at all, gradient decent itself is the one big difference between a human brain and any neural networks.

    • @egor.okhterov
      @egor.okhterov Месяц назад

      ​@@tempname8263no

    • @ultrasound1459
      @ultrasound1459 Месяц назад

      ​@BooleanDisorder you have 10 missed calls from Juergen Schmidhuber 🧏‍♂️

  • @luke2642
    @luke2642 Месяц назад +3

    great video. That trick around the 26 minute mark of doing 16x compute almost for free (in terms of time) because of memory bottlenecks is really neat. I wonder how many other architectures would benefit from that kind of design optimisation?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад

      It appears that it is only useful for linear recurrent layers, because the main computation is just performing elementwise multiplication between the previous output vector and the recurrent weight vector, which means you have O(d) parameters and you do O(d) compute, and transferring one parameter takes longer than doing one operation. For other kinds of layers, such as fully connected layers, you are doing at least a matrix-vector multiplication, which means you are doing O(d^2) compute, and that usually takes much longer than transferring O(d) parameters.

  • @MrStevemur
    @MrStevemur 29 дней назад +1

    I appreciate the soothing piano music. Currently the words are only slightly better than Charlie Brown listening to adults talk, but I hope to dive in.

  • @itsyaro1297
    @itsyaro1297 День назад +1

    Hey man! Really appreciate the technical detail in your videos

  • @OscarTheStrategist
    @OscarTheStrategist Месяц назад +2

    Amazing video, insta-sub!

  • @1LC4P1T4L1ST4
    @1LC4P1T4L1ST4 Месяц назад +10

    I believe that the transformer does have a quadratic cost in memory (specifically self attention (SA)). The attention matrix in SA is n by n, thus n^2 (n being the number of tokens). Probably the reviewers is referring to that bit. Anyway, rejecting mamba was hecking stupid. Great video!

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +8

      The matrix is indeed n^2, but you never need to materialize the full matrix at the same time. You can materialize one column at a time, which is exactly what FlashAttention does, resulting in O(n) memory (still O(n^2) compute though).

    • @1LC4P1T4L1ST4
      @1LC4P1T4L1ST4 Месяц назад +2

      I have no idea how flash attention manages to be faster and more memory friendly. Are you sure that the attention matrix is never fully in memory (regardless of the type of memory)?. However the classical implementation didn't use flash attention so I believe that the reviewer is referring to that.

    • @1LC4P1T4L1ST4
      @1LC4P1T4L1ST4 Месяц назад +5

      I have rechecked the paper and it appears that flash attention is linear wrt the memory. The work of Tri Dao Is magic to me.

  • @nyyotam4057
    @nyyotam4057 Месяц назад +2

    So how close is the weight estimator to the MMSE (minimal mean square error) estimator? Can the MAMBA arch be improved even more, using a sparse covariance matrix and an application of a 'true' Kalman filter? Or is it already as close as it can get?

  • @MarcosPedroLeal
    @MarcosPedroLeal Месяц назад +2

    Loved your videos. Which software or library do you use to make these animations? Is it manim?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад

      It is a combination of Manim (for rendering latex) and my own renderer written in Pytorch (for 3d stuff).

  • @Mohammed-rx6ok
    @Mohammed-rx6ok Месяц назад +2

    Good job 👏

  • @hackerborabora7212
    @hackerborabora7212 Месяц назад +6

    This algo is new and you made a video about it I love you I will subscribe your channel keep going

  • @ollybreh95
    @ollybreh95 27 дней назад +1

    Woah big claim! I’m excited

  • @blacklistnr1
    @blacklistnr1 Месяц назад +3

    Nice video! What I didn't understand is what happens to the stable weights during training. Particularly:
    - How are they kept stable?
    - How can the model learn while being so restricted?
    What I'm guessing is that some form of the Delta is also used in training to keep the weights in those ranges + rely a lot more on the accuracy to carry the information.
    Is this correct? Does it imply that using double instead of float gives it a better ability to learn?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +7

      Great question. The answer is it's really complicated and no-one knows for sure.
      There is nothing explicitly keeping the weights stable during training. They can (and probably do) become unstable. The thing is, there are actually thousands of different weights in the vector. At initialization, all of the weights are essentially one, so information from anywhere in the input can influence the gradient, but the model is incredibly restricted (cannot perform meaningful transformations in the recurrence). Then SOME of those weights change and enter the un-stable regime, so they can no longer carry information long distance but can do more interesting computations, while others remain stable. And in the fully-connected layers between recurrences, all weights can communicate information with each-other. So you have this complicated system where weights are changing at different rates, some remain stable, some become unstable, and that allows for interesting computation to be done and information to be propagated long distances.

    • @blacklistnr1
      @blacklistnr1 Месяц назад +2

      @@algorithmicsimplicity Thanks for the reply! That's quite interesting, different propagation lengths didn't even cross my mind.
      It'd be really funny if after all this work the model learned unstable weights and became forgetful :))

  • @Alulapower
    @Alulapower Месяц назад +22

    Good video to explain mamba : I understand something

    • @harrysvensson2610
      @harrysvensson2610 Месяц назад +2

      You see, it's O(n log(n)) instead of O(n^2) without any penalties. Okay?
      100% crystal clear, right? //end of joke

    • @BooleanDisorder
      @BooleanDisorder Месяц назад

      ​​​@@harrysvensson2610that means that, basically, transformers scale x² in compute needed for prompting. Also called square or quadratic since x² is a square if you would make a geometric figure. So if you write a prompt of 5 words, that's 25 compute since 5*5=25. You can see how this gets really crazy at high tokens counts. Mamba scales differently, so you need much less compute per prompt.

  • @drdca8263
    @drdca8263 Месяц назад +10

    Here’s an idea that probably wouldn’t work:
    What if instead of algebraically guaranteeing that some operation is a monoid so that one can use the parallelizing thing that combines n inputs in O(log(n)) steps in n processors,
    what if you just had some operation, learned by a NN, which has “how much it deviates from being a monoid operation” as part of the loss?
    Like, suppose you randomly selected some pair of consecutive applications of the operation, and also computed it in the opposite order, and took the L^2 norm of the difference between the results, and multiplied that by some weighting, and made that a term in the loss?
    Like, within the family of continuous and piecewise-smooth monoidal operations, perhaps some of them would be better at selective remembering?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +4

      That sounds really interesting, you should try it out!

    • @drdca8263
      @drdca8263 Месяц назад +2

      @@algorithmicsimplicity Thanks! Unfortunately I am lazy...
      And, there’s already another “what if I did [X]?” machine learning project I barely started (“what if I tried to add a simple approximation to what copying heads do to an n-gram model”, which seems like it should be much easier, but I’ve barely written the n-gram model part of it (and ChatGPT honestly wrote most of that). Haven’t even started on the “compute statistics about whether copying a word from previously in the current text, or go based on the corpus as a whole, is more accurate in this context” part...

    • @CyrusEstavillo
      @CyrusEstavillo Месяц назад

      @@drdca8263thats a lame response. Try it. Make something in this world

    • @TheDoomerBlox
      @TheDoomerBlox 18 дней назад +1

      It's only yet another silly experiment to do the seemingly impossible in the hottest meme area, picking your nose seems like a more productive waste of time.
      But imagine, if you found something really cool and nobody would listen. That would be funny, that would be cool.

    • @gnaarW
      @gnaarW 4 дня назад

      ​@@TheDoomerBloxif you would be able to build a RecNN that outperforms current state of the art models and put it on hugging face, people will care about that 🤷🏼‍♂️

  • @luiscedillo9321
    @luiscedillo9321 11 дней назад

    You are a golden channel

  • @Singularity606
    @Singularity606 Месяц назад +8

    There seems to be a growing zoo of related architectures that attempt to supersede the transformer. Besides Mamba, there's also RetNet, GLA, Based, and HGRN. And the secret upcoming xLSTM. Someone also mentioned RWKV. Are all these converging to something? And when will we see a frontier model based on this new paradigm?

    • @BooleanDisorder
      @BooleanDisorder Месяц назад +2

      The main problem with transformers is the compute scaling from input length. Mamba tries to be equally good at high dimension representations as transformers without the extreme compute scaling. So effectively we want much more complex representations in the end, without needing a nuclear power plant and supercomputer for inference. Transformers could continue get better, but it takes an astronomical compute amount atm.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +4

      I believe we are converging to hybrid Transformer and dynamic linear RNNs, such as Griffin, arxiv.org/abs/2402.19427 . There are already open source Mamba language models with a few billion parameters, training and testing full size models takes about a year.

    • @MrObveous777
      @MrObveous777 Месяц назад

      @@algorithmicsimplicity "training and testing full size models takes about a year." why so long?

    • @BC-bn7xd
      @BC-bn7xd Месяц назад

      I think just training it can take weeks if not months ​@@MrObveous777

  • @Nerdimo
    @Nerdimo 15 дней назад

    Would you mind explaining this associativity 10:37 ? My assumption is that f is the linear recurrence function, but how is it equal to a pair of the matmul between W2 and W1 and the second term? Wouldn’t f output a vector, so how could it be equal to the right hand side pair of vectors?

  • @maximilianchrzon4545
    @maximilianchrzon4545 Месяц назад +2

    Your videos are so good man keep it up, seriously. Although that is probably beneath you, but could you maby make a video on how neural networks are computed on machines in general or maby on GPUs? As someone who did not learn computer science in uni, this would be an interesting topic for me to learn and maby fundamentally understand nn better.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +2

      That's an interesting topic, I was planning on making videos about how CPUs and GPUs work at the physical level (e.g. logical gates are built out of transistors, addition and multiplication are built out of logical gates). Neural nets are just implemented as a bunch of matrix multiplications (you put all the neuron weights in one matrix and multiply it with the input). Is that what you are asking about?

    • @maximilianchrzon4545
      @maximilianchrzon4545 Месяц назад

      @algorithmicsimplicity yeah that sounds about right, thank you. Maby you could use matrix multiplication as a case example on those inner workings :) anyways, thanks for making awesome videos

    • @ArtOfTheProblem
      @ArtOfTheProblem 23 дня назад

      3b1b has this covered pretty well already@@maximilianchrzon4545

  • @baskaisimkalmamisti
    @baskaisimkalmamisti 5 дней назад +1

    Fourier's paper was rejected by Laplace as not being mathematically rigorous enough. (A note for people not from signal processing background: Laplace transform is the extended version of Fourier transform.)

  • @nikilragav
    @nikilragav Месяц назад +3

    I really wish that when you're talking about things happening in parallel, your animations happened in parallel. Like 8:30. I think it would really improve the comprehensibility of your explanation

  • @harshvardhanv3873
    @harshvardhanv3873 19 дней назад +5

    we need more videos from you, especially one from basics

    • @algorithmicsimplicity
      @algorithmicsimplicity  19 дней назад

      Any topics in particular you'd like to see?

    • @harshvardhanv3873
      @harshvardhanv3873 19 дней назад +2

      @@algorithmicsimplicity we need video series in math for linear algebra, calculus, probability and statistics seperately for ml perspective and then after that we would like to learn more on basic concepts like regression, classification, clustering, etc. we would also like to learn more on the types of learning unsuperwised, semi- superwised and self-superwised. some basic architectures like rnn types (lstm, gru, hybrids) , basic ann , mlp and even the recent kan, ntk.

    • @algorithmicsimplicity
      @algorithmicsimplicity  19 дней назад +1

      @@harshvardhanv3873 Got it. I am definitely planning to do videos on calculus and probability for ML soon. After that I can do videos on the types of ML.

    • @harshvardhanv3873
      @harshvardhanv3873 19 дней назад +2

      @@algorithmicsimplicity sure waiting for your videos ✌

    • @sichengmao4038
      @sichengmao4038 16 дней назад

      well, maybe 3b1b's video already fullfills what your need on prerequisites of ml.

  • @tomashonzik1758
    @tomashonzik1758 2 дня назад +1

    Thanks!

  • @YA-yr8tq
    @YA-yr8tq 15 дней назад

    the channel is great and the material is awesome! the only catch is: the piano in the background makes it hard to focus..

  • @blutwurst9000
    @blutwurst9000 25 дней назад +1

    Love the video but I have the question: Shouldn't be the approximation at 17:00 be something like n*w^(n-1)*0.001*x, so isn't there an n missing? Or how was the approximation done?

    • @algorithmicsimplicity
      @algorithmicsimplicity  25 дней назад +1

      Ahh yes you're right, there should be an n out the front, the gradient is proportional to nw^(n-1)x. The vanishing/exploding gradient arguments are still the same though, the linear scaling factor doesn't matter compared to the exponential scaling for large n.

  • @tannergilliland6105
    @tannergilliland6105 22 дня назад +2

    If you ever get the time I would love to see another video on mamba implementation but dumded down even more. Like to the level of statquest videos. They need to make you feel special while also showing the math step by step like its 9th grade.

    • @algorithmicsimplicity
      @algorithmicsimplicity  22 дня назад +2

      Thanks for the suggestion, there will probably be improved versions of Mamba coming out soon, I will make a more basic explanation video for them when they do.

  • @agsystems8220
    @agsystems8220 Месяц назад +27

    RNNs are constrained by having to hold all their information in a single embedding space, so this space needs to be extremely large. It needs to hold every piece of information in the context that might come in useful at some point. Transformers can distribute information between many tokens, so can operate with a much smaller embedding space, at least in theory. The memory complexity of a RNN with a given structure is quadratic on the size of the embedding space, meaning we really pay big time for that increased embedding size. I wonder if that is what the reviewer was getting at.
    The results were impressive, but they haven't been followed up by success at larger model sizes which I would have expected to have already happened if it was going to. It is a cool mathematical trick to make it work, and demonstrates that language is surprisingly linear, but once you start to hit truly non linear questions I would expect it to stop improving. Overhyped IMO.

    • @howuhh8960
      @howuhh8960 Месяц назад +6

      if you stack multiple linear rnn layers they can handle non-linear dependencies across time, so "demonstrates that language is surprisingly linear, but once you start to hit truly non linear questions" is not true as mamba model as a whole (multiple layers) is nonlinear rnn

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +20

      The really cool thing about linear RNNs is that increasing the size the embedding space only has linear cost, not quadratic. The recurrence operator only performs elementwise multiplication with the embedding vector. This is why Mamba is able to increase the size of the embedding vector by a factor of 16 at essentially no cost. If you were willing to incur some additional cost, you could easily make the embedding vectors even larger. When you expand the embedding vector by a factor of a few thousand, now your talking about as much memory as a transformer with a few thousand tokens of the original size.
      Works are currently in progress to train larger model sizes, it takes about a year from start to finish to train a full sized model. Mamba already achieves state of the art performance for ~3b sized language modelling, this is HIGHLY HIGHLY non-linear.
      And finally, while there are some aspects in which transformers are still superior to dynamic linear RNNs, hybrid architectures such as Griffin (arxiv.org/abs/2402.19427 ) appear to give the best of both worlds, handily outperforming both.

  • @mehnot8193
    @mehnot8193 Месяц назад +3

    Extremely noob question but, at 13:52 why aren't the input vectors x multplied by P^-1 instead of P? Don't you need to convert them to the eigenbasis before applying the D transformation (or, equivalently, taking the hadamard product with the diag(D) vector)?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +2

      Yes, I should have applied P^-1 first to be consistent with my earlier notation W=PDP^-1. Of course, the naming is just a matter of preference, you can equivalently call the first matrix which is applied P or P^-1, so long as the two matrices are inverse of each other it doesn't matter which is called which.

    • @mehnot8193
      @mehnot8193 Месяц назад

      @@algorithmicsimplicity Oh ok, that makes sense now! Thanks a lot for your answer and this amazing video ^^

  • @tantzer6113
    @tantzer6113 Месяц назад +5

    Enjoyed this. Given that its performance is comparable to or better than transformers as verified independently in several papers, is Mamba gaining a foothold among practitioners?

    • @mimotron
      @mimotron Месяц назад

      It does : ruclips.net/video/9s-9aSobky8/видео.html

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +5

      Definitely, lots of open source language models are switching to Mamba. Mamba is also being used for other tasks as well, e.g. arxiv.org/abs/2401.09417
      Also, recently google deepmind released this paper ( arxiv.org/abs/2402.19427 ) on hybrid dynamic linear RNN and transformers which achieves really good results. Dynamic linear RNNs are definitely going to become mainstream.

  • @jhonny1682
    @jhonny1682 25 дней назад +1

    Can you make an explanation video like this one on Liquid Time Constant Networks 🙏

  • @augmentos
    @augmentos Месяц назад +1

    Great video, would prefer no music but that’s me

  • @yoloswaggins2161
    @yoloswaggins2161 День назад +1

    A guy who actually understands this stuff

  • @karius85
    @karius85 Месяц назад +4

    Appreciate the breakdown. I think there are a few more things at play here for the reject that is somewhat overlooked in the discussion at the end. Specifically, there are issues with anonymity and using "hype" to push a paper through an academic conference. I speculate that this was the underlying reason for rejecting the paper.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +16

      Cool, if that was the reason for the reject they should have said that in the rationale for the reject. Instead they made up a bunch of a criticisms which are either 1) irrelevant or 2) blatantly untrue. That's a bad look for the conference, as it makes it seem like their reviewers are un-qualified to judge academic works.

    • @karius85
      @karius85 Месяц назад +3

      @@algorithmicsimplicityAbsolutely agree. In my experience, the quality of conference reviewers are extremely variable. Almost all researchers I know have horror stories about how incompetent and outright adversarial reviewers can be. Many great papers are rejected without sufficient basis, and mediocre papers are included for seemingly no good reason. Many experienced researchers don't want to review anymore.
      Just a comment on the reject; it might have been a conscious decision to not actually bring the anonymity issues up in the rebuttal to avoid further disputation. But, I am just speculating here with little to no factual basis.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +8

      It could very well have been a conscious decision, but I think it was the wrong decision. From an outside perspective, it looks like a fantastic paper was rejected because of clueless reviewers. That's far more damaging to the conference's integrity than what ever conflicts might arise from anonymity violation disputes.

    • @karius85
      @karius85 Месяц назад +2

      @@algorithmicsimplicity Independently of what one may think of the paper, I agree that the justification for the reject was weak. Unfortunately, I don't think it matters much for the integrity of the conference in the long run, as this has happened in all the other big conferences in the past. Authors generally adapt and move on. What makes this unique is the hype around Mamba. Previously, no single member of the general public would have been interested in the review decision of a single paper in AI / ML. Now, the community extends far beyond academics, for better or worse. All in all, I hope it serves to incentivise stronger review processes for the future.

    • @karius85
      @karius85 Месяц назад +2

      On a side note, I really enjoy your content, keep up the good work 👏

  • @francescorossi7582
    @francescorossi7582 3 дня назад +1

    Thanks for the video. Why do you use matrix diagonalization instead of SVD in 13:00? SVD can decompose any matrix and you do not need to introduce complex numbers. The power trick also works with SVD wrt the singular values.

    • @algorithmicsimplicity
      @algorithmicsimplicity  3 дня назад +1

      With SVD you get W=USV for a diagonal matrix S, but the U and V are not necessarily inverse of each other, so when you take W^2=USVUSV you can't cancel out the inner VU.

    • @francescorossi7582
      @francescorossi7582 3 дня назад +1

      @@algorithmicsimplicity you are right, in my mind i was assuming W to be symmetric

  • @oraz.
    @oraz. Месяц назад +1

    One thing I don't understand is the HIPPO matrix, and what they mean by a structured matrix in the context of differential equations.

  • @dntbther9298
    @dntbther9298 10 дней назад +2

    How about RWKV ?

  • @erikxu3472
    @erikxu3472 Месяц назад +1

    Mamba Mentality

  • @thebrownfrog
    @thebrownfrog Месяц назад +3

  • @londonl.5892
    @londonl.5892 7 дней назад +1

    Why did the "W_2 W_1" on the right at 10:48 change into a "W_1 W_2" by 12:34?

  • @nias2631
    @nias2631 2 дня назад +1

    @nias2631
    I have no particular opinion on transformers or MAMBA since, for my work, I never use these. But as for peer review I think that Open Review itself is a great "filter for the filter". The research community can actively review the reasoning for accept/reject as you did in this video. For most journals not using Open Review the process is fairly opaque.

    • @algorithmicsimplicity
      @algorithmicsimplicity  День назад

      Absolutely agree, the transparent review process is definitely a net benefit for the community as a whole.

  • @oleonardohn
    @oleonardohn Месяц назад +2

    I haven't found any significant evidence suggesting that Mamba models outperform Transformers, except that their attention mechanism does not scale quadratically with the context length. Am I missing something?

    • @ilonachan
      @ilonachan Месяц назад +1

      I mean, even if it just accomplished the tasks about as good as transformers qualitatively, the better compute scaling alone is pretty significant.

    • @oleonardohn
      @oleonardohn Месяц назад

      @@ilonachan Sure, but as far as I'm concerned, there is not much evidence it can qualitatively perform the same tasks either. Some people reported that Mamba's state space doesn't perform as well as true attention for long contexts.

  • @timeflex
    @timeflex 19 дней назад +2

    GPT mafia 😞 Probably just can't lose their faces and title of "the best LLM tech" (and, perhaps, contracts as well).

  • @zyzhang1130
    @zyzhang1130 17 дней назад +1

    Can skip connections be used here to tackle vanishing/exploding gradient problems?

    • @algorithmicsimplicity
      @algorithmicsimplicity  17 дней назад +1

      Skip connections would be equivalent to adding 1 to each recurrent weight, which still doesn't fix the problem that you need the weights to be close to 1. You would still need to change the initialization so that the weights are initialized all very close to 0 (so when you add 1 with the skip connection they become close to 1).

    • @zyzhang1130
      @zyzhang1130 17 дней назад

      @@algorithmicsimplicity thanks for the explanation, just to add: I’m not aware that traditionally in vision domain people initialise different layers specifically to tackle vanishing/exploding gradient issues (on top of skip connections). Is the mechanism you mentioned spontaneously emerged from training?

    • @algorithmicsimplicity
      @algorithmicsimplicity  17 дней назад

      @@zyzhang1130 Vanishing and exploding gradients only arise in recurrent neural networks because you use the same weights in each iteration. This is what causes the exponential growth/decay. In the vision domain you use feed-forward nets, with different weights in each layer, so this isn't an issue and you don't need specialized initializations.

    • @zyzhang1130
      @zyzhang1130 17 дней назад

      @@algorithmicsimplicity hmm pretty sure it does occur in vision as well (at least one of them) for deep models that’s why they added skip connections

    • @algorithmicsimplicity
      @algorithmicsimplicity  17 дней назад +1

      @@zyzhang1130 Using ReLU activation functions is enough to completely solve vanishing and exploding gradients in feed-forward networks. Residual connections solve a related but different problem, shattered gradients: arxiv.org/abs/1702.08591

  • @SolathPrime
    @SolathPrime Месяц назад +1

    [6:28]: While that sound somewhat good in practice it doesn't work like that
    Alternating between linear recurrent and non linear dense doesn't give that much of context in advantage :(
    The gradients vanishes or explodes after a while and requires some sort sigmoid transformation + some value
    Say for example an architecture like this:
    ```plaintext
    Dense -> Sigmoid -> Recurrent -> Dense -> Sigmoid -> Recurrent -> Dense -> Softmax
    ```
    Until the gradients reach the first Recurrent the gradients loses most of it's value :(

  • @KeinNiemand
    @KeinNiemand 11 дней назад

    What about mamba tranformer hybrids.

  • @codybarton2090
    @codybarton2090 2 дня назад +1

    So keep some aspects of privacy laws coherent but merge the different sides of the web in a quantum computer

  • @goblinkoma
    @goblinkoma Месяц назад +7

    peer review be like
    thats a nice method for building houses, its a shame it doesn't also cook burgers
    what

  • @yonnn7523
    @yonnn7523 Месяц назад

    great video as always but would vbe even better without the distracting background music.

  • @a.gholiha6884
    @a.gholiha6884 8 дней назад

    Where is the source code to run and test it ?

  • @tempname8263
    @tempname8263 Месяц назад +5

    21:48
    33%? Dude, it's 3.4x improvement. Measuring improvement relative to accuracy instead of error rate is dumb, since that'd mean that difference between 100% accuracy and 99% is just 1%, which is not representative of anything.

    • @harrysvensson2610
      @harrysvensson2610 Месяц назад +5

      Everyone got issues when it comes to calculating with percentages.
      Here's an example:
      Imagine a game character with armor, the person got 98% damage reduction, and then puts on some more armor and reaches 99% damage reduction.
      How much less damage does the tank take compared to before putting on the extra armor? 100%? 50%? 1%?
      If you math it out it's obviously 50% less damage taken, since there's 2% between 98% and 100%. And one of those 2% is now removed, hence 1/2 -> 50% less damage taken compared to before.
      But you know what? Not everyone agrees that it is 50%. Understanding percentages is difficult.

    • @BooleanDisorder
      @BooleanDisorder Месяц назад

      ​@@harrysvensson2610yeh, the armor things is a great example. The higher the damage and the more important a tank is, the more important that single percent becomes. Could literally mean the difference between surviving a blow from a boss or die

    • @ScorpioneOrzion
      @ScorpioneOrzion Месяц назад +3

      @@harrysvensson2610 it depends, of the armor example, its 1% absolute, and 50% relative

    • @harrysvensson2610
      @harrysvensson2610 Месяц назад

      @@ScorpioneOrzion Exactly.

    • @tempname8263
      @tempname8263 Месяц назад

      @@harrysvensson2610 It's not like it's difficult, it's just that most people do leaps in logic, where they don't even think relative to *what* are they measuring the percentage

  • @ArtArtisian
    @ArtArtisian Месяц назад +1

    Eh - re controversy, I don't think peer review broke here. The *conference* however, took a much deserved status hit for mismanaging review.

  • @yqisq6966
    @yqisq6966 21 день назад +7

    Peer review is broken nowadays because people have little time to actually read through a manuscript with attention to details given the amount of pressure to publish their own papers. So when you have more papers out there than the time people can spend on reviewing, you get low quality peer review.

  • @unkarsthug4429
    @unkarsthug4429 Месяц назад +6

    People keep making things that they say are "better than transformers", but none of them are actually getting used. At this point, hearing people say that has sort of become meaningless from the number of false alarms. Feels like every few months we have something "better than transformers", like RetNets were claimed to be. We'll have to wait and see which actually turn out to be better with time.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +8

      Yep, but Mamba is different, it is already being used in open source language model projects.

    • @Supreme_Lobster
      @Supreme_Lobster 29 дней назад +2

      investor money is generally spent conservatively. it will take at least a few months for them to see the upside in divesting from super large transformers and moving on to MAMBA (or upcoming derivatives). Remember, Transformer was first published in 2017, and it took until at least 2020 for any "large" (> 3B) model to come out.

  • @justtoleavecomments3755
    @justtoleavecomments3755 Месяц назад +1

    "Small models up to a few billion params"
    I think people have forgotten what small means 😂

  • @EnricoGolfettoMasella
    @EnricoGolfettoMasella Месяц назад

    It’s faster but not better, bro!

  • @poloceccati
    @poloceccati Месяц назад

    Science paper peer reviewing could become a fight for fame and carreer amongst greedy scientists, more than a filter for truth, using small insignificant mistakes as excuses for paper rejection.

  • @Matlockization
    @Matlockization 29 дней назад

    What are the perimeters of the human brain, 86 billion ???

    • @redswap
      @redswap 14 дней назад +1

      No it's way more, you have to take into account the number of synapses (100 trillion). Then you also have to take into account the fact that the way human neurons work is much more complicated than an artificial neural network (requires at least 6 layers of artificial neurons to simulate a human pyramidal neuron cell with 98% accuracy, for example). So taking all of this into account, the human brain should have between 10 and 100 quadrillion parameters. And this doesn't even take into account the fact that there are at least 200 different types of neurons in the entire brain (most of them are in the brain stem because it had more time to evolve). If we take this into account, and divide this by the added complexity of the transformer architecture (about 4, don't ask how I found this number 😅), then the human brain has the equivalent intelligence potency of a transformer model with in between 500 quadrillion and 5 quintillion parameters.

    • @Matlockization
      @Matlockization 14 дней назад

      @@redswap Ok, thank you.

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 21 день назад +1

    Best review of Mamba on YT, but still fails to address "better" at _what_ than transformers? language modeling? what is that? Next token prediction? (_what_ prediction...). There's still this big hole in AI theory, it doesn't feel like anyone has any idea what framework they're even working under... nobody is sure what they're doing. There are all these papers that show xyz score on xyz benchmark... a benchmark somebodies dog came up with over breakfast... and nobody has good evidence of what property xyz benchmark is even measuring, why we should care, never-mind what it's supposed to 'mean'.

  • @dunar1005
    @dunar1005 16 дней назад +1

    I love those sleep inducing videos with someone just talking gibberish that no one understands 😇

  • @zyzhang1130
    @zyzhang1130 13 дней назад

    Just came across a stanford nlp lecture, in which the lecturer says transformer has quadratic memory cost ruclips.net/video/LWMzyfvuehA/видео.html (28:05). I'm relatively unfamiliar with this topic so i'm not sure why he has a opposite claim as what you explained..

    • @algorithmicsimplicity
      @algorithmicsimplicity  13 дней назад +2

      Transformers as they were originally implemented in 2017 used quadratic memory. In 2022, a new implementation called FlashAttention (arxiv.org/abs/2205.14135 ) was made which is faster (though still O(n^2) compute) and uses O(n) memory. If that Stanford lecturer said they use quadratic memory then they haven't implemented a transformer in the last 2 years.

    • @zyzhang1130
      @zyzhang1130 13 дней назад

      @@algorithmicsimplicity i see. Thanks for the info. This lecture teaches from the very beginning so it makes sense they talked about the earlier version of transformer first

  • @Neomadra
    @Neomadra 8 дней назад +1

    Why make peer review public when there's no way to remove obviously wrong reviews

    • @algorithmicsimplicity
      @algorithmicsimplicity  8 дней назад +1

      It's supposed to act as an incentive for peer reviewers to write quality reviews, since they can be seen by anyone. Didn't work in the case of Mamba lol.

  • @ze5os427
    @ze5os427 Месяц назад +1

    5:02 for people who don't get it, vanishing/exploding gradients is just basically dementia for AI in a way, except it's where the AI is starting to become incapable of learning any further

  • @AkarshanBiswas
    @AkarshanBiswas Месяц назад +8

    Who cares about these academic peer reviews? 😅 But anyways, the only downsides I have seen in S6 is it is not always stable during training(I have seen huge outliers) and it performs worse than transformers in copying.

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +5

      Yep, apparently there a few things where Mamba performs worse than transformers. Hopefully hybrid architectures such as Griffin arxiv.org/abs/2402.19427 give the best of both worlds.

    • @chickenp7038
      @chickenp7038 Месяц назад +1

      the jamba paper says they add rmsnorm to intermediate activation just not which activations. do you have ideas?

    • @jks234
      @jks234 Месяц назад +2

      IMO peer review is key to being able to continue to move forward.
      As I heard George Hotz say, “Intelligence is compression.”
      We advance in all directions, reconvene and compress into useful theories and explanations, and using our newfound perspective, explore once again.
      I personally am amazed at how much more clearly I understand the shortcomings addressed by transformers vs RNNs after this video and how this is related to the fundamental nature of backpropagation.
      Also, I found the analogy with convolution at the beginning quite insightful.
      It allowed me to understand that RNNs are essentially a dynamic programming algorithm (recursive and sequential), while transformers are parallel and thus no longer “time-bound”.
      I have been doing a lot of reflection after reflecting on that.
      Primarily my own assertion that RNNs are probably not a good path to go down… because my own thinking is not fundamentally recursive like that.
      My own thinking is probabilistic and going in many directions at once. At least, that is my experience. And thus, it feels much more aligned with the matmul model of transformers. Decoupled from the sequence and evaluated from a much higher level than sequential recurrence.
      I have personally been reflecting on how perhaps the next step would be an “importance metric”.
      Just as humans naturally filter quite quickly for importance and feel that certain paths hold more promise, I feel that this might be a promising next step for transformers. In a phrase, “filtering heuristics”.
      MoE, but at the attention level.

    • @augmentos
      @augmentos Месяц назад

      Griffen is the first, but why is none of the major labs putting out a huge mamba model on par with the larger transformer models? Even if mixed. Any insights? Is there a summary anywhere of the ways it doesn’t shine? Anyone tried mamba w bitnet?

    • @chickenp7038
      @chickenp7038 Месяц назад

      @@augmentos because the big labs don’t publish what they are doing. they better be doing it because it works.

  • @BasitMustafa
    @BasitMustafa 26 дней назад

    Love everything about out your videos thank you so much, but please reconsider the background music (as in removing it). Those who desire it can easily add in their own. I find it distracting.

  • @leonmozambique533
    @leonmozambique533 Месяц назад

    lmao Mamba got rejected from ICLR

  • @clamhammer2463
    @clamhammer2463 22 дня назад

    I'm dizzy

  • @gunaysoni6792
    @gunaysoni6792 5 дней назад

    You're overselling Mamba a little. Transformers are not yet dethroned and for many tasks they'll perform better than RNNs. Try giving Mamba a passage and then asking it a question based on the passage. RNNs are sensitive to the order of the information presented to them. Great video but you should also talk about the shortcomings of mamba

  • @jawadmansoor6064
    @jawadmansoor6064 Месяц назад +2

    faster ;yes, better ;how?

    • @algorithmicsimplicity
      @algorithmicsimplicity  Месяц назад +4

      Better scores on language modelling perplexity and downstream reasoning tasks.

    • @stephaneduhamel7706
      @stephaneduhamel7706 Месяц назад +1

      @@algorithmicsimplicity At least at "small" scales

    • @BooleanDisorder
      @BooleanDisorder Месяц назад +3

      ​@@stephaneduhamel7706even if it doesn't scale well by itself it's extremely impressive and important work. It will definitely be relevant going forwards.

    • @jawadmansoor6064
      @jawadmansoor6064 Месяц назад

      @@algorithmicsimplicity The largest mamaba (or any other state space model) I saw was less than 7b parameters, also, to use mamba they had to do some tricks, some difficult math and make calculations within CPU memory (or that is how I understood it) since memory is not very large they can't build large models. And it is commonly believed that larger the model better it is for generalization i.e. understanding and doing "tasks".

  • @cutmasta-kun
    @cutmasta-kun Месяц назад +1

    Dude, the music in the background is killing me -.- Silence would be much better!

  • @Originalimoc
    @Originalimoc 10 дней назад +1

    Hahahaha another joke in academia, probably some corruption inside I guess

  • @vidyagaems4063
    @vidyagaems4063 8 дней назад

    hhehe "N-words"

  • @igorg4129
    @igorg4129 Месяц назад +1

    NIce, but please stop uptalking, you are several levels above this habit.

    • @shpensive
      @shpensive Месяц назад +1

      Nonsense, speak up or down or sideways-inside-out

    • @kalisticmodiani2613
      @kalisticmodiani2613 16 дней назад

      uptalking is just accent. Everybody has an accent.