MAMBA and State Space Models explained | SSM explained

Поделиться
HTML-код
  • Опубликовано: 20 дек 2024

Комментарии • 161

  • @drummatick
    @drummatick 9 месяцев назад +24

    I've a question. Given that SSMs are entirely linear, how do they conform with universal approximation theorem? I mean a lack of non-linear activation should imply that they should be particularly bad at approximate functions, but they are not.
    Am i missing something?
    Also really loved the video!

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +33

      Exactly, SSMs are linear. 2:05 This is why they are part of larger neural nets, that bring in the nonlinearities needed to fulfil the universal approximation theorem.
      You can think of SSMs like of attention in Transformers: they are just a module (the module that aggregate information from the whole sequence, such that the current token is informed of its context).
      Great question, thanks!

    • @drummatick
      @drummatick 9 месяцев назад +6

      Thank you!
      One last question. Since the output is realized rather through a convolution than a recurrent function, do we use Backprop through time for updating A B and C? Or just a normal backprop

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +17

      ​@@drummatick You ask so many great questions! (I'll pin your original comment, as I think others might have the same question)
      As the state of token t depends on the state of token t-1, we still need to do backpropagation through time (otherwise we would not know what to backpropagate at t-1 if we did not resolve t first). But because of the linearity, backprop through time is reportedly more stable for Mamba than for e.g., LSTMs.

    • @drummatick
      @drummatick 9 месяцев назад +6

      @@AICoffeeBreak thank you! Makes sense. That might also explain why mamba doesn’t suffer from
      Vanishing gradient problem as the gradients propagated vanish linearly? Correct me if I’m wrong

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +9

      @@drummatick Yes, no activation function to squash signals to zero.

  • @partywen
    @partywen 9 месяцев назад +4

    Thanks! Looking forward to a Hyena video :)

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Thank you so much! It seems like you paid for the coffee I'm drinking right now!

  • @peabrane8067
    @peabrane8067 8 месяцев назад +3

    Thank you for the shoutout to me repo!
    I later realized it was an application of a known idea "heisen sequence", which is a pretty cool way to do certain associative scan operations via cumsum

  • @theruisu21
    @theruisu21 9 месяцев назад +6

    great video! thanks

  • @ShadowHarborer
    @ShadowHarborer 10 месяцев назад +50

    I have to give a presentation on Mamba next week and I've been waiting for this video so I could finally learn what the hell I need to talk about

    • @projectsspecial9224
      @projectsspecial9224 10 месяцев назад +5

      Don't feel dumb. The first time I heard about transformers used for LLMs, I thought it was a new movie prequel. Oh, well, I guess I'm dumber. 😂

    • @sampadk04
      @sampadk04 9 месяцев назад +1

      Us

  • @alex-beamslightchanal8743
    @alex-beamslightchanal8743 4 месяца назад +2

    Thanks!

  • @OlgaIvina
    @OlgaIvina Месяц назад +1

    Thank you very much for this thorough, well-curated, and comprehensive review of MAMBA.

    • @AICoffeeBreak
      @AICoffeeBreak  Месяц назад

      Thank you, for your appreciation!
      I just saw you on LinkedIn, let's stay connected!

  • @kumarivin3
    @kumarivin3 Месяц назад +2

    Thank you so much !! you really super simplified it for any beginner level deep learner to understand

  • @李洛克-m1u
    @李洛克-m1u 3 месяца назад +3

    Thank you! This is by far the easiest-to-understand and most concise video that teaches the concepts of SSMs

  • @darkswordsmith
    @darkswordsmith 9 месяцев назад +4

    I'm not entirely sure on how SSM differ from RNNs, especially regarding how attention is being used. Theres still the bottleneck of h_t to h_{t+1} between time steps, which was one of the motivations for the attention layer -- so that information in one part of the sequence doesn't have to be squeezed before computation with information from another part of the sequence.
    Is the main innovation from RNN to SSM the fixed delta, A,B,C formulation such that the training can be done in parallel for all time steps?

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +2

      SSMs are linear RNNs. But linear is not powerful enough (universal approx. theorem). 2:05 This is why they are part of larger neural nets, that bring in the nonlinearities needed to fulfil the universal approximation theorem. You can think of SSMs like of attention in Transformers: they are just a module (the module that aggregate information from the whole sequence, such that the current token is informed of its context).
      But unlike attention, this information aggregation module works token after token, while attention has access to all tokens at once. So once a token is wrongly forgotten, it is gone.
      Therefore, the state of research for now cannot really leave attention alone, because it is so useful for retrieval and finding needles in a haystack: arxiv.org/abs/2402.04248
      or arxiv.org/abs/2402.19427

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Thanks for the amazing question.
      And yes, coming to the last question: SSMs are linear RNNs which makes them computable in parallel at training time, exactly.

    • @darkswordsmith
      @darkswordsmith 9 месяцев назад

      Thanks for the great explanation! I suppose things will become more clear when I try to work through the math myself.
      When I first read about SSM, I saw a big connection between it and kalman filtering. But now I see the connection is probably just inspirational

  • @outliier
    @outliier 9 месяцев назад +4

    Really cool video! Thank youuuu

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +2

      Thanks, nice to see you here, Dominik, aahm I mean Outlier!

  • @hannes7218
    @hannes7218 7 месяцев назад +3

    this explanation was excellent. Thank you very much :)

  • @jamescunningham8092
    @jamescunningham8092 9 месяцев назад +8

    This is exactly the level of detail I needed right now. Thank you so much!

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +3

      Thanks! This was also the level I was looking for when trying to understand Mamba. So I decided to explain what I understood.

  • @Thomas-gk42
    @Thomas-gk42 10 месяцев назад +9

    I understand 10% of that stuff, but the presentation is lovely.

  • @rodrigomeireles5966
    @rodrigomeireles5966 10 месяцев назад +5

    @AICoffeBreak, thank you for the awesome video. Very small pet peeve which had me re-check all the math. At 11:20 it would make the explanation much easier to understand if you kept x 0 indexed as that is the notation you were going for since the beginning. Also, maybe making it explicit that you're taking t = L, although this is kind of obvious. This was an awesome lecture, thank you again.

  • @고준성-m4g
    @고준성-m4g 8 месяцев назад +4

    Great.
    There are a lot of failed explanation or completely wrong approach about SSM and Mamba on the internet, but finally I found the exact what I want.
    Thank you for the video.

  • @faysoufox
    @faysoufox 10 месяцев назад +4

    Nice video, good overview, which is what I was searching for

  • @harumambaru
    @harumambaru 9 месяцев назад +4

    Nice T-shirt! So excited to listen about new models!

  • @doublesami
    @doublesami 8 месяцев назад +2

    Looking forward to see a video of vision mamba to know the points I missed while reviewing other content related to it

  • @DerPylz
    @DerPylz 10 месяцев назад +9

    I was waiting for exactly this topic! Thanks so much!

  • @cosmic_reef_17
    @cosmic_reef_17 10 месяцев назад +10

    Hats off to you for this amazing video! Best explanation of Mamba I have seen.

  • @Emresessa
    @Emresessa 10 месяцев назад +4

    Awesome video! I especially like the simple explanation and the visuals.

  • @HeliyaHasani-x7b
    @HeliyaHasani-x7b 10 месяцев назад +4

    Amazing explanation! Please create more content!

  • @doctor_xin1452
    @doctor_xin1452 5 месяцев назад +2

    cool video!!!! Thanks a lot!!!

  • @AM-yk5yd
    @AM-yk5yd 10 месяцев назад +10

    Great video. Speaking of VMamba.
    For some reasons it seems people in medical images field are more excited for mamba than in LLM.
    So many, they are competing between themselves. "Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba by an average score of 3.58%".
    Like if you pick a random mamba paper from arxiv (there are dozens of them already), it probably will be related to medical image segmentation.

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +4

      Yes, this was exactly my impression when I talked to some friends working on medical imaging yesterday! 🤗 Unfortunately, I couldn't mention this insight in the video as it was already almost done with editing.

  • @AaronNicholsonAI
    @AaronNicholsonAI 6 месяцев назад +3

    Thank you so much! Super helpful.

  • @bareMinimumExample
    @bareMinimumExample 9 месяцев назад +5

    Does matrix A change during inference or only during training?

  • @ruchiradhar1589
    @ruchiradhar1589 10 месяцев назад +5

    A big thanks for a comprehensive explanation of the Mamba Architecture & computations, @AICoffeeBreak!

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад

      Thank you for your visit and for leaving this wonderful comment!

  • @BooleanDisorder
    @BooleanDisorder 10 месяцев назад +10

    I can't wait for neural networks that are tokenless and takes bits directly. Where words are learned through images and sound rather than their own modality. And richer input with simultaneous multimodal vectors.

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +3

      Did you see MambaByte? 20:45
      But exactly, I agree with you. This would be tremendous for both multilinguality and multimodality. Tokenizers are very tedious.

  • @LorenzoCassano-l2h
    @LorenzoCassano-l2h 9 месяцев назад +4

    I did not understand why do we need to perform a discretization step? Which are the advantages ? I understood what happens from math point of view, But why we need to change from Real numbers to discrete numbers ? Is it something related to Backpropagation ?

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Great question, and I do not know exactly the answer myself. I assume you could do without it and hope for SGD to find the right parameter settings. But I assume the theoretical soundness helps to stabilize training. I would be curious to find out more about this.

    • @v-ba
      @v-ba 9 месяцев назад +2

      ​@@AICoffeeBreakisn't it because if we have continuous function then we have to solve ODEs with proper solvers, but if we discretize it and apply the Euler method then we get recurrence that can be optimized with a backprop?

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Exactly. That's a great way to put it with the right background knowledge.

  • @elinetshaaf75
    @elinetshaaf75 5 месяцев назад +3

    omg, omg, omg, this is such a great explanation of mamba!

  • @markr9640
    @markr9640 9 месяцев назад +3

    OK, Miss Coffee bean, you really made my head spin this time! Keep up the good work. We love you!

  • @serta5727
    @serta5727 10 месяцев назад +5

    Thank you for the great Mamba explanation

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +2

      Thank You for your visit and for leaving this comment!

  • @JuliusSmith
    @JuliusSmith 10 месяцев назад +6

    Excellent summary! Note BTW that the B discretization has been simplified to B * \Delta, because it made little difference (see mamba-minimal comments).

  • @norabelrose198
    @norabelrose198 10 месяцев назад +9

    nitpick: with FlashAttention, transformers have linear memory requirements, but the compute complexity is still quadratic

  • @MaJetiGizzle
    @MaJetiGizzle 10 месяцев назад +7

    Thanks for the MAMBA video!
    I always appreciate your insight on these new, influential papers! Your thoughts always pair well with a good cup of coffee. 😁☕️

  • @harumambaru
    @harumambaru 9 месяцев назад +4

    My favourite kind of animation of Ms Coffee Bean is on 19:47

  • @jefferychen8330
    @jefferychen8330 9 месяцев назад +4

    Thanks for the video! I’ve been confused between RNN and MAMBA for several days😂

  • @juanmanuelcirotorres6155
    @juanmanuelcirotorres6155 10 месяцев назад +5

    thanks a lot for this (love your sweater)

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +1

      Thank You for your kind words!

  • @Ben_D.
    @Ben_D. 9 месяцев назад +4

    The ASMR at the end is nice. I always sit and wait for it. 🙂

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +2

      😂I should start a channel reading ML papers in ASMR style.

  • @davidgraey
    @davidgraey 10 месяцев назад +6

    this was such a well summarized explanation, thank you! ❤

  • @TimScarfe
    @TimScarfe 10 месяцев назад +5

    Awesome explanation Letitia!! ☕️

  • @dunamaiTheSheep
    @dunamaiTheSheep 8 месяцев назад +2

    Incredible work. Thank you !

    • @AICoffeeBreak
      @AICoffeeBreak  8 месяцев назад +1

      I hope I'll remember everything, so I can present this paper tomorrow in the reading group. 😅

  • @nerisozen9029
    @nerisozen9029 6 месяцев назад +1

    thanks! also very cute merch!

  • @hannesstark5024
    @hannesstark5024 10 месяцев назад +6

    Neat summary! Thank you ^^

  • @samanthaqiu3416
    @samanthaqiu3416 9 месяцев назад +5

    only way I managed to understand up to 40% about how mamba is implemented was watching this video

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Give these things some time to crystallize. Your understanding will grow even bigger with time. Also, better explainer resources will keep coming.

  • @Summersault666
    @Summersault666 10 месяцев назад +5

    Can tou share some info about the material you used to learn mamba?

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +3

      Sure! You can find most of these things in the video description.
      1. Foremost, the paper and its references
      2. 📕The Annotated S4: srush.github.io/annotated-s4/
      I also found this after publishing the video: jameschen.io/jekyll/update/2024/02/12/mamba.html
      Yannic Kilcher has already done great video about Mamba and I wholeheartedly recommend it! My only problem with it, is that he sticks too much to the paper and the Mamba paper (while being well written), is scattering the architecture all over the paper. It also assumes prior knowledge about previous papers, such that I honestly did not understand how it all hangs together and what Mamba actually is and how tokens flow through it.

    • @Summersault666
      @Summersault666 10 месяцев назад

      @@AICoffeeBreak wow, those are great. Thanks!

  • @AICoffeeBreak
    @AICoffeeBreak  10 месяцев назад +11

    Celebrating our merch launch, here is a limited time offer!
    👉 Get 25% discount on AI Coffee Break Merch with the code MAMBABEAN: 🛍 aicoffeebreak.creator-spring.com/

    • @harumambaru
      @harumambaru 9 месяцев назад +1

      Isn't it cannibalism when Coffee Bean drinks coffee?

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +1

      Have you checked the cup? 🤔 It's hot water. Beans drinking hot water is how coffee is made! ☕

  • @marzi869
    @marzi869 9 месяцев назад +5

    Wasn't your voice enough? now I have a crush on you!!!

  • @arthurpenndragon6434
    @arthurpenndragon6434 10 месяцев назад +6

    I am in general happy that we are trying to leave transformers behind. RWKV and SSMs look promising. Transformers are inelegant, and by their very formulation a resignation from looking for better methods.

  • @Theodorus5
    @Theodorus5 9 месяцев назад +4

    Yea that was good

  • @MarcCasalsiSalvador
    @MarcCasalsiSalvador 8 месяцев назад +1

    min 11:30 x = (x_0,...x_{L-1}) Amazing video

  • @jiahao2709
    @jiahao2709 6 месяцев назад +2

    Thank you a lot for your explain! may i know which software you using for making animation in your video? thanks!

    • @AICoffeeBreak
      @AICoffeeBreak  6 месяцев назад +1

      My editor uses Adobe Premiere Pro for video editing (this is also when Ms. Coffee Bean comes in).
      I use PowerPoint for all other visualisations.

    • @jiahao2709
      @jiahao2709 6 месяцев назад +1

      @@AICoffeeBreak really beautiful!

  • @commonwombat-h6r
    @commonwombat-h6r 10 месяцев назад +9

    i barely grasped the transformers and now they have this new thing.....

    • @tmjz7327
      @tmjz7327 10 месяцев назад

      well you had 6 years to look into transformers, so..

    • @projectsspecial9224
      @projectsspecial9224 10 месяцев назад +1

      Don't feel bad. NLP and Linguistic Models had been around for decades. It really takes a long time because processing power was so expensive. I did NLP AI demos in year 2000 and they told me that the tech was too advanced for its time and they didn't know what to do with it! 😅

  • @Micetticat
    @Micetticat 10 месяцев назад +7

    I knew you would have covered this hot topic! I bet now you are working on RWKV ! They are discovering so many alternative architectures these days!

  • @krttd
    @krttd 10 месяцев назад +7

    I wish I could find a TensorFlow implementation with all the optimizations! I need to bite the bullet and switch to Torch soon.

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +5

      Good luck with switching your faith.

    • @ultrasound1459
      @ultrasound1459 10 месяцев назад +2

      Move asap bro. Pytorch > TensorFlow

    • @TheRyulord
      @TheRyulord 10 месяцев назад +1

      The pytorch version has a bunch of specialized CUDA kernels so I hope you know CUDA too

  • @kamiboy
    @kamiboy 4 месяца назад +2

    Do I understand correctly that in MAMBA to predict the current token the architecture only has access to an embedding influenced only by all previous down stream tokens, but with no influence from future upstream tokens? This is kind of a weakness, isn't it? I think the way LLMs are constructed their classification tasks can take into account all tokens in the sequence for predicting any token in a sequence.

    • @AICoffeeBreak
      @AICoffeeBreak  4 месяца назад

      Yes, exactly. If the summary / history token misses something, it is gone forever. The hope is to learn to retain everything (important).

    • @kamiboy
      @kamiboy 4 месяца назад +1

      @@AICoffeeBreak No, what I actually meant was something else. Let me explain it like this. Let us say the input sequence is "I LIKE TO [CLS] A BURGER", and I want to predict which token most likely fits into the [CLS] token. In an LLM the classification of [CLS] would be able to take into account all token previous to [CLS], "I LIKE TO", as well as all tokens after it "A BURGER". But I think from what I understand MAMA would only use "I LIKE TO" to predict, because the embeddings used to predict the current token only depends on what came before it in the sequence, am I understanding this right? This is certainly the case for the SSM's, but I am not sure with the selecting SSMs of MAMA.

    • @AICoffeeBreak
      @AICoffeeBreak  4 месяца назад

      @kamiboy Ah, I understand, you mean bidirectional models like BERT that can classify things in the middle of the sequence. Yes, encoder models like BERT can look into the future. GPT models don't, because their attention is causally masked. There, the art ist to frame / prompt the problem such that the answer is at the end of the sequence. This is quite simple, and works for MAMBA too: "I like to eat a [CLS] burger. What word does [CLS] stand for? The answer is:"
      But there is no limitation onto MAMBA, as one can make it bidirectional as well, as Bidirectional LSTMs used to be: run the model once forwards to summarise the sequence to the left of CLS. Then run once again from the right until the CLS.
      Concatenate the forward and backwards summaries (token embeddings) and classify the CLS token based on that.

    • @kamiboy
      @kamiboy 4 месяца назад +1

      @@AICoffeeBreak excellent, thanks for the clarification. I was already thinking that if I was correct then doing the bidirectional trick would be a solution. So that is how BERT works, neat to know.

  • @qichaoying4478
    @qichaoying4478 7 месяцев назад +4

    how can i buy your tshirt?

    • @AICoffeeBreak
      @AICoffeeBreak  7 месяцев назад +3

      On the AICoffeeBreak merch store. 🤭 aicoffeebreak.creator-spring.com/

    • @qichaoying4478
      @qichaoying4478 7 месяцев назад +2

      @@AICoffeeBreak Oh I see. At the time I asked I hadn't watched to the end. 🤣

  • @kahoonalagona7123
    @kahoonalagona7123 10 месяцев назад +4

    Amazing !!

  • @mkamp
    @mkamp 10 месяцев назад +4

    Awesome video.
    So for NLP we only have discrete inputs, hence we don’t need delta, right?
    And then we also cannot learn different deltas for different layers to be resolution independent as we could for continuous signals? No?

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +4

      Thank you! About your question: We always need the delta for discrete outputs. The delta-less formulation is for writing the differential equation which assume continuous transitions with infinitesimal steps. When we go to matrices and vectors and we are doing numerical approximations (all of deep learning), we need to discretize and choose delta steps.

    • @mkamp
      @mkamp 10 месяцев назад +1

      @@AICoffeeBreak thanks for your answer and your patience Ms Coffee Bean. What I am stuck with is, that we use delta to discretize continuous inputs, but in the NLP applications of Mamba we already have discrete inputs, the tokens. What would we then apply delta too?
      Also, it would be conceivable that we could learn to fuse together for example every three consecutive tokens using delta, but then delta would be discrete (1, 2, 3, 4 tokens per window) and could not be differentiated/learned?

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +5

      @@mkamp Thanks for the follow-up, now I better understand the confusion. It all has to do with terminology and what word mean in different ML contexts.
      I think you talk about **the inputs**, whether they are discrete/continuous. Language is discrete and symbolic, but with our neural networks, we pretend like it would be continuous : we map words into tokens and tokens into vectors that can take continuous values.
      What I am talking about with SSMs is about the continuous / discrete nature of **the model**, in the way it updates the hidden state h. When we model sequences and want to compute hidden states for tokens in sequences, we are jumping in large (discrete) steps from token to token (word to word, as in "The big fox sat on the mat ...").
      * In contrast, a differential equation is continuous: how h changes in time (h dot at 06:13) is an infinitesimal small step from the last state of h. So we are modelling the example sinusoid curve with the smallest step imaginable (the assumption of calculus that steps can be infinitely small - think of the integral vs. the discrete sum). I hope you can see that we cannot jump continuously from "The" to "big", as there is not much semantics in between those words (there are some semantic steps in between, but not infinitely many small steps. I mean, there is "cat", there is "woman", we have a "catwoman", a "woman cat" but not much else). So we cannot use differential equations to model jumps from tokens to tokens, so we use SSMs instead: 👇
      * The SSM formulation is the discrete version of the continuous equation: we are modelling that sinus curve with larger steps (not infinitesimal steps), given by delta.
      Which means that we are jumping from token to token in our language / sequence applications. One h is "The", the next h has to model "big", the next "fox" ... So we are jumping in large, discrete steps from word to word (token to token). This is why we cannot use differential equations, but discretized versions of them: SSMs.

    • @mkamp
      @mkamp 10 месяцев назад +1

      Thanks for the answer and for taking the time!@@AICoffeeBreak

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +1

      @@mkamp Videos are these static, uneditable things (in YT). I could have made this point much clearer in the video if I would have known about your question beforehand. And of course, it is all also limited by how much time I can invest into doing one video.

  • @RalphDratman
    @RalphDratman 10 месяцев назад +4

    Good job as far as possible, but I fear this topic may require more like a semester course than a single video.
    As a side note, I find it striking that mathematics is at the heart of the new (and miraculous) world of AI. If this rate of math penetration continues, kids will soon need to go directly from third grade to graduate school.

  • @andreicozma6026
    @andreicozma6026 7 месяцев назад +1

    14:06 so they reinvented memoization and gave it a different fancy name?

  • @edhofiko7624
    @edhofiko7624 7 месяцев назад +1

    so whats next? kalman filter with learned dynamic?

  • @啊瑤
    @啊瑤 6 месяцев назад +3

    解释得很好!爱来自中国💓

  • @rock_sheep4241
    @rock_sheep4241 10 месяцев назад +7

    Looks more like a RNN model

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +4

      It is a linear RNN with a lot of Schnickschnack around it. 😅

    • @pooroldnostradamus
      @pooroldnostradamus 10 месяцев назад +4

      ⁠@@AICoffeeBreakto use the scientific terminology

    • @AICoffeeBreak
      @AICoffeeBreak  10 месяцев назад +1

      @@pooroldnostradamus 🤣🤣🤣

  • @science_electronique
    @science_electronique 4 месяца назад +2

    mamba for video

  • @garyhuntress6871
    @garyhuntress6871 10 месяцев назад +9

    The explanation of the SSM sure sounds similar to a Kalman filter to me!

    • @RFantiniCosta
      @RFantiniCosta 9 месяцев назад +2

      You are absolutely right, sir. That's because Kalman Filter is indeed a State Space Model

    • @RFantiniCosta
      @RFantiniCosta 9 месяцев назад

      You are absolutely right, sir. That's because Kalman Filter is indeed a State Space Model

  • @davidespinosa1910
    @davidespinosa1910 3 месяца назад +2

    Transformer memory space is linear, not quadratic -- see the FlashAttention paper.

    • @AICoffeeBreak
      @AICoffeeBreak  3 месяца назад +1

      True with Flashattention! Yes.

  • @abdulhamidmerii5538
    @abdulhamidmerii5538 7 месяцев назад +4

    Am I the only one that finds mamba much harder to grasp than transformers?

    • @AICoffeeBreak
      @AICoffeeBreak  7 месяцев назад

      Let's see: what was the first architecture you learned? Was it transformers or LSTMs?

  • @MyrLin8
    @MyrLin8 7 месяцев назад +2

    merch lol

  • @keeperofthelight9681
    @keeperofthelight9681 10 месяцев назад +2

    Mamba jamba samma

    • @AICoffeeBreak
      @AICoffeeBreak  9 месяцев назад +2

      This spawns music in my head. 🎶😅

  • @nobo6687
    @nobo6687 10 месяцев назад +2

    Hey Ai Coffe Break create video on cited papers from Gemini 1.5 Pro paper 😮❤

  • @csepartha
    @csepartha 9 месяцев назад

    Kindly make a tutorial to fine tune an open source LLM model on many pdfs data. The fine tuned LLM must be able to answer the questions from the pdfs accurately.

  • @skullteria
    @skullteria 8 месяцев назад

    its kinda obv that you dont rlly understand this shit yet good enough to explain it better. Its like trying to reverse analyzing a circuit board instead of going from the intended use case, to the technical rules to the abstraction to the design to the analogies. You just explained me what a mosfet is for and how it works inside of a circuit board... yeah well nice. This isnt a test. Its a lesson. Try to explain it in human language, feel into it. THEN you can down to why all that stuff achieves that and which analogies could be used for it.
    This video right now just shows one thing... you are smart. But it doesnt show me that you understood teaching.

    • @AICoffeeBreak
      @AICoffeeBreak  8 месяцев назад +5

      Hi skullteria, with my videos I always have to make the difficult decision of what will be the target audience of the video. I have a general guideline that I set myself when I started this channel, which was to broadly aim the videos at computer science students somewhere in their bachelors. With a topic like mamba, I had to gloss over many complicated topics (for some of which I have already made some more in-depth explanations) to not make the video over an hour and still get into the level of detail that many of my regular viewers expect. You will find in the comments many people for whom this was exactly the right balance. But this of course means that there will also be people who will not find the video super useful, and I'm very sorry that you're one of them. If there's a specific topic for which you'd like a more high-level, more intuitive explanation, absolutely put it in the comments and I'll add it to my to-do list :)

  • @গারমারাগেলো
    @গারমারাগেলো 10 месяцев назад

    আমার কী পছোনদো তোকে বোলবো কেনোরে

  • @paratracker
    @paratracker 8 месяцев назад

    I'd love to see your take on Hasani, Lechner, et al's Closed-form Continuous Time Neural Nets, Liquid Structural State-Space Models, Forward Invariance of Neural ODEs, ... and maybe compare with MAMBA SSSM