Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 20 дек 2024

Комментарии • 130

  • @YannicKilcher
    @YannicKilcher  Год назад +21

    OUTLINE:
    0:00 - Introduction
    0:45 - Transformers vs RNNs vs S4
    6:10 - What are state space models?
    12:30 - Selective State Space Models
    17:55 - The Mamba architecture
    22:20 - The SSM layer and forward propagation
    31:15 - Utilizing GPU memory hierarchy
    34:05 - Efficient computation via prefix sums / parallel scans
    36:01 - Experimental results and comments
    38:00 - A brief look at the code

  • @HoriaCristescu
    @HoriaCristescu Год назад +50

    I have to say - your return to making more frequent videos is making me very happy. I used to see your video before reading the papers.

    • @ousooo
      @ousooo 10 месяцев назад

      good to see you here, Horia. :)

  • @stephaneduhamel7706
    @stephaneduhamel7706 Год назад +42

    11:15 They did experiments up to 3 billion parameters iirc. There is a mamba 3B model available on huggingface at least

    • @hunterkudo9832
      @hunterkudo9832 Год назад +1

      How does it compare to a 3B transformer model?

    • @greengoblin9567
      @greengoblin9567 Год назад +9

      @@hunterkudo9832It’s better

    • @TheRyulord
      @TheRyulord Год назад +21

      @@hunterkudo9832 It's roughly on par with a 7B param transformer

    • @bmoharryman5809
      @bmoharryman5809 Год назад +7

      This is hot.

    • @erickmarin6147
      @erickmarin6147 Год назад

      (___0__)__(____0__)
      \_________________/

  • @varunsaagars
    @varunsaagars Год назад +19

    🎯 Key Takeaways for quick navigation:
    00:00 📜 *This video discusses the Mamba paper, which introduces a linear-time sequence modeling approach with Selective State Spaces.*
    01:19 🔄 *Transformers have the advantage of dynamic and selective attention but suffer from quadratic computation and memory requirements, while RNNs have limited memory and scalability.*
    03:05 🔄 *Backpropagation through time in RNNs can be memory-intensive and lead to gradient problems.*
    05:51 🔄 *State space models, like S4, offer an alternative to RNNs and Transformers with linear computation but lack data-dependent transitions.*
    09:34 🔄 *Mamba introduces Selective State Spaces, relaxing the input-independence constraint while retaining linear computation, making it a hybrid between SSM and LSTM.*
    11:39 🚀 *Mamba is competitive with Transformers, especially on long sequences and dense data like language and genomics.*
    12:22 📚 *The paper addresses computational efficiency and context-based reasoning, improving on previous SSM models.*
    16:12 🤖 *Mamba's architecture combines selective state spaces with other components, providing linear scaling in sequence length.*
    18:40 🚀 *Mamba offers fast training and inference, scaling linearly in sequence length during training and achieving performance on sequences up to one million in length.*
    26:33 🧮 *The video explains a sequence modeling technique involving the computation of Y3, which depends on various time steps and matrices.*
    28:28 🖥️ *The technique allows for the precomputation of certain parameters, making it possible to compute the output of a new sequence instantly.*
    29:38 🧩 *Mamba focuses on making parameters input-dependent, achieving efficiency through GPU memory optimization.*
    32:07 🚀 *The paper reduces data movement and utilizes fast memory (SRAM) for matrix multiplications, resulting in significant speed improvements.*
    34:21 🧬 *Mamba performs zoom operations differently by using a prefix sum technique to accommodate input-dependent elements.*
    36:18 📈 *Mamba shows promising scaling performance for large-scale sequence models, outperforming other attention-free models.*
    37:23 💻 *Inference throughput on an A100 GPU is good and improves as batch size increases when compared to Transformers.*
    37:36 🧪 *The paper discusses the intricacies of the efficient implementation, providing insights into memory transfers and cost reductions.*
    39:50 📊 *The code implementation includes various components such as input projection, 1D convolution, discretization, recurrence, and gating pathways for efficient sequence modeling.*

  • @SebastianRaschka
    @SebastianRaschka 11 месяцев назад +12

    This is a great video with a great rundown of Mamba. Was traveling when the Mamba paper came out and coincidentally stumbled upon this video today. This was a big time-saver to catch me up on the gist of itl. I'll make sure to watch more of your videos in the future. Big thumbs up!

  • @jonatan01i
    @jonatan01i Год назад +17

    Could see this as a "memory" architecture for an actual transformer, remembering distinctive contexts for a long time, but use transformers for the much more complicated and sophisticated logical reasonings where directed focus and attention is much needed.

    • @albertmashy8590
      @albertmashy8590 Год назад +1

      Huge for assistants, and agents with longer term memory, as well as AI companions

    • @dibbidydoo4318
      @dibbidydoo4318 Год назад

      transformers have problems of their own for composition and logical reasoning right?

  • @YangLi-gw9nb
    @YangLi-gw9nb 6 месяцев назад +1

    This is definitely the best explanation video for Mamba I've seen. Thank you!

  • @OperationDarkside
    @OperationDarkside Год назад +5

    These kinds of videos are great on an early christmas morning. You know you are still not really awake. You won't get it anyway. But it kickstarts your brain into work mode.

  • @josephle5686
    @josephle5686 11 месяцев назад +16

    I work with state space models as a a control/optimization engineer on a daily basis. But that diagram of the state space model has got to be the most confusing thing I’ve seen in my life lol

    • @brunomaruszczak4016
      @brunomaruszczak4016 11 месяцев назад +1

      Agreed, both State Space models and Kalman filter have been so throughly described in every control theory handbook and I have never seen something like this diagram 😅

    • @ivanzhelyazkov8099
      @ivanzhelyazkov8099 11 месяцев назад +4

      can you guys recommend a handbook with a clearer representation?

    • @brunomaruszczak4016
      @brunomaruszczak4016 9 месяцев назад

      @@ivanzhelyazkov8099 Well I think you could find some good introduction to state space systems in Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 is on state space models but I would also recommend reading up on general properties of LTI systems. There could be better books though, I’ve read it quite some time ago

    • @brunomaruszczak4016
      @brunomaruszczak4016 9 месяцев назад

      @@ivanzhelyazkov8099 Signals and Systems: Fundamentals by Gang Li and others. Chapter 7 deals with SS models, but recommend looking at chapter 1 and LTI systems as well

  • @barni_7762
    @barni_7762 Год назад +11

    I'd love another video diving deeper!

  • @kimchi_taco
    @kimchi_taco Год назад +19

    I think this is very similar to "Retentive Network", which Yannic covered few months ago. State transition model recalls me linear Kalman filter. Anyway, I cannot believe single vector memory can carry all necessary information for every token, which fit for all.

    • @andreaterlizzi
      @andreaterlizzi Год назад +4

      Well, they actually mention this in the paper, and yeah Kalman filters are a type of state-space model

    • @axe863
      @axe863 8 месяцев назад

      ​@andreaterlizzi lol if that's true can it really handle nonGaussian processes

  • @albertmashy8590
    @albertmashy8590 Год назад +18

    This is gonna be crazy if you think about it. It's like you could initialize an "Assistant" or agent with a huge prompt, but rather than including that information every time, you "save" that state space to save on compute for generating the next tokens because they don't need to be re-loaded every time. This also means that agents could also all have their own different personalities and behaviors without significant fine tuning requirements

  • @vzxvzvcxasd7109
    @vzxvzvcxasd7109 Год назад +92

    Papers! TBH, I don’t even watch any other vids in this channel.

    • @sagetmaster4
      @sagetmaster4 Год назад +7

      There are other videos?

    • @Hexanitrobenzene
      @Hexanitrobenzene 11 месяцев назад +3

      @@sagetmaster4
      Yannic had pretty good ML news videos but then he got busy with OpenAssistant and probably other things...
      Now I usually watch "AI Explained" for ML news.

    • @EdFormer
      @EdFormer 10 месяцев назад

      ​@@Hexanitrobenzenethat OpenAI fanboy hype merchant, who probably can't define a perceptron, is no replacement for Yannic's ML News. Better than nothing, but while Kilcher's coverage gave me the feeling of being at an academic event for those who actually study and engineer ML techniques to learn about new ideas they might want to apply and to help them stay up to date, AI Explained seems to just want to blow the minds of tech enthusiasts who are only interested in how close to AGI we are, and always leaves me with a strong smell of bull regarding the significance of whatever he's been making out is a huge leap forward. How much the general culture of the AI community has gone downhill since ChatGPT was released and caught the attention of psuedo intellectuals is so depressing.

    • @Hexanitrobenzene
      @Hexanitrobenzene 10 месяцев назад

      @@EdFormer
      I don't think such a strong criticism is warranted. You described him like he was ColdFusion or something like that :)
      Of course, if you want a raw engineering take, he is not a replacement, but considering all the countless clickbaity AI related videos on my side bar, he seems the most serious of those whose audience is semi-technical. He always reads the papers thoroughly, catches some less well known trends, does his own investigations. He even did a thorough analysis of MMLU dataset bugs. Not a NeurIPS level researcher, but still a serious guy.

    • @EdFormer
      @EdFormer 10 месяцев назад

      @@Hexanitrobenzene absolutely fair response. I guess I have just channelled my dismay regarding the decline of the once great culture of our community at him specifically, since his channel is the only example of these new AI channels (which I think are illustrative of that decline) that I am prepared to engage with. I still watch his videos, so must find some value there. I just really miss being able to rely on regular, diverse, and legitimately technical/academic content from the likes of Yannic and Letitia that really added another dimension to my never ending literature review. Even Edan Meyer, an undergrad at the time, provided the perspective that I think is sorely missed. I feel these channels have struggled to keep up with the cookie cutter garbage that is now filling my recommendations, and again, I am probably focusing my blame for that on AI Explained. One valid criticism I have of AI Explained though is the misleading perspective I think he provides that is AI = LLMs. I'm probably very biased as someone with a background in computer vision, but there's so much more going on than just LLMs. I find it mind blowing, given how things were previously, that I cannot find a good deep dive video on the Segment Anything Model (SAM) or DINOv2.

  • @Summersault666
    @Summersault666 Год назад +10

    Finally, i knew i could count on you!

  • @bibhabasumohapatra
    @bibhabasumohapatra Год назад +2

    More than the merits and demerits of transformers. The best part is it's inter modalities between text-audio-voice clip

  • @jsalsman
    @jsalsman Год назад +11

    Very happy transformers aren't the only game in town.

  • @vladimirtchuiev2218
    @vladimirtchuiev2218 Год назад +9

    This looks a lot like state-space control theory representation. What they are presenting is basically a learnable dynamic system with a linear state transition matrix A and an input dependent input matrix B, that makes it non-linear, same as for the observation matrix C. Look as a massive upgrade over transformers for stuff like music generation, maybe even ViT-based models. What isn't clear to me is how do they learn the A matrix, it seems that the farther the context is, the more severe the vanishing gradient problem and the nearest elements in the sequence is by far the most significant.

    • @patrickl5290
      @patrickl5290 Год назад

      Something about the “selective” piece maybe? My thought would be that the forgetting rules differ based on a lot of factors, so there are many opportunities for the model to be induced to not forget potentially relevant details

    • @邵斌-n2r
      @邵斌-n2r 11 месяцев назад

      A matrix is also input-dependent, but yea it's hard to believe that it can "remember" things that are 10^6 upstream (basically the K term in eq. 3a and 3b.)

    • @xxlvulkann6743
      @xxlvulkann6743 8 месяцев назад +1

      @@邵斌-n2r He says A is still a parameter at 30:13 and it shows on the paper as well. I think he made a mistake when he said A was input dependent because it seems not to be.

  • @vikaspoddar001
    @vikaspoddar001 Год назад +8

    I was waiting for your
    review

  • @РудаковАртем
    @РудаковАртем 8 месяцев назад

    Цікава тема та дослідження, які пояснюються в цій статті. Рекомендую переглянути!

  • @TheEbbemonster
    @TheEbbemonster Год назад +10

    Interesting that A^n actually works for long sequences. I would have expected a severe degradation of performance as sequences get longer...

    • @ItsRyanStudios
      @ItsRyanStudios 11 месяцев назад +3

      my same thought; I need to implement this in code to understand how/ why this works

    • @Thien--Nguyen
      @Thien--Nguyen 11 месяцев назад +1

      I think there are some parameterizations constraints like the eigenvalues of A's cannot be positive for the A^n to be stable. The h3 and hippo papers talk more about those conditions (there are also diagonal and other constraints on A to make things more efficient iirc).

    • @邵斌-n2r
      @邵斌-n2r 11 месяцев назад

      second this. If the eigenvalue is negative, then it will vanish even more quickly (consider the context length of 10^6).

  • @jonsimonatwork
    @jonsimonatwork 11 месяцев назад +3

    I think something is off about your explanation of the A_t prefix products around 35min. The dimensions given in Algorithm 2 imply that A remains constant across timesteps, since it has no L component.

    • @NLogSpace
      @NLogSpace 10 месяцев назад +2

      I wanted to ask the same thing. I think it is a mistake in the video. Also before, he mentioned that A is constant and only B and C depend on the input.

  • @pladselsker8340
    @pladselsker8340 Год назад +3

    Thank you for the paper review, it always helps!! Happy holydays to everyone 🍾

  • @amirronen1913
    @amirronen1913 11 месяцев назад

    excellent talk explaining a non trivial paper. Thanks!

  • @simonl1938
    @simonl1938 11 месяцев назад +3

    SRAM has very low latency, but the total bandwith is less than the gpu-cpu can do (especially with DMA). You could use something like Buffer Object Streaming (this is the name that opengl has for it) to reduce memory usage massively. This would also allow for much larger batches as they are computed concurrently with the same weights. Does anyone know if this is already being done? I can't find anything on the topic.

    • @afbf6522
      @afbf6522 11 месяцев назад

      Do it. Can you talk about this in private? I'm very interested in the topic of optimizing operations right now

  • @aleksszukovskis2074
    @aleksszukovskis2074 11 месяцев назад +2

    i dont understand what got me to even watch this at 2:00 in the morning

  • @lexer_
    @lexer_ Год назад +5

    This very much triggers the same intuitive reservations and skepticism I feel towards linear transformers and other attempts at improving scaling by throwing away one of the essential components to what makes transformers work. I am not convinced that any of these architectures actually bring the same scaling we have seen with transformers so far where the only limiting factor seems to be the amount and quality of the training data.

    • @sagetmaster4
      @sagetmaster4 Год назад +1

      The future looks like collections of models working in concert, so having transformers doing most of the work in most systems with certain domains using a different architecture seems plausible

    • @dennismertens990
      @dennismertens990 Год назад +1

      After "Text Embeddings Reveal (Almost) As Much As Text" came out, I became convinced that encoder-decoder transformers learn embeddings that behave just like an RNN's hidden/recurrent state. If you plot the embedding of a sentence unmasked sequentially, you get a trajectory. That paper shows this trajectory is quite informative w.r.t. the original text.
      This is interesting because that suggests you can train an RNN on the learned embeddings. Since you already have the embeddings, there is no need for back-prop through time. It would be like training a regular neural net on input-output examples, where the input is the embedding at time t and the output is the embedding at time t+1. It's only speculation, but it could be a practical way of distilling pre-trained transformers into RNNs for deployment.
      P.S. By unmasking a sentence sequentially, I mean for example "hel", "hell", "hello".

    • @lexer_
      @lexer_ Год назад

      ​@@dennismertens990 I agree that these architectures in general and Mamba in particular seem like they are a very good idea for more task specific deep learning. I should have specified that I was referring to the general purpose LLMs or more multi-modal models which develop some level of reasoning and general abstraction capability. For more constrained tasks like DNA modelling this might very well be a huge leap forward.
      In regards to sequential unmasking, it would of course still be token based, not character based, but I don't think you can get away with throwing away the attention matrix without a massive loss in performance even if you train an embedding.

    • @lexer_
      @lexer_ Год назад +3

      ​@@sagetmaster4 A lot of people have been saying this for many years at this point. And for a long time this seemed like a very reasonable assumption. If you can not scale generally you have to specialize. It's the same with general compute as well as many other areas. But I haven't seen a single example of this actually working to a degree that outerforms just putting same amount of extra compute into a larger or longer trained model. I am beginning to think that we are missing some very fundamental and essential insight to make this actually work. Sparse network designs like the mixture of expert seem to have some real benefits here but only for inference speed. But I would argue this is only tangentially related to the whole heterogenous architecture idea.
      I for one think the next major architectural step probably needs to be intermediate higher level memory instead of just scaling the context. Being able to store abstractions in memory seems like such a fundamental component to human thinking that I can't help but think that it might have the potential for major improvements in an llm as well.
      The other thing that will eventually be necessary is a way for models to spend significantly more time "thinking" before answering. There were a few attempts with traditional LLMs and a thinking-token to essentiall allow it more inference steps for the next token. And the results looked promising in the paper. But now it seems to have been mostly forgotten about. So it seems a more fundamental way for the model to recurse internally might be necessary for introspective logic.

    • @TheRyulord
      @TheRyulord Год назад +5

      @@lexer_ If this paper's claims are actually true then it has better perplexity on the pile (general purpose language modeling) than traditional transformers. We've also seen attention actually struggles quite a bit on information retrieval with long sequences so while it is powerful it's not a perfect mechanism. A lot of older subquadratic transformer papers basically just did "attention but worse" (eg. window + random attention) and so they naturally had tradeoffs that a completely different mechanism like this isn't going to necessarily have.

  • @charmy1138
    @charmy1138 Год назад +2

    What are the differences between RWKV, RetNet, and Mamba? Which here has the closest architecture to transformers?

    • @stellabiderman4056
      @stellabiderman4056 Год назад +3

      This is maybe worth a video all to itself, but they're all different types / levels of tradeoff between a transformer and an RNN, basically

  • @leisha1519
    @leisha1519 11 месяцев назад +1

    For A^n, will so many multiplies of A leads to an explosion of parameters?

  • @rault.7108
    @rault.7108 Год назад

    Thanks for talking about this model! ❤

  • @kaikapioka9711
    @kaikapioka9711 Год назад +1

    Impressive, thx as always 🎉 happy holidays

  • @sirati9770
    @sirati9770 11 месяцев назад

    regarding not having a facecam: for people still learning english having a visual reference can drastically improve listening comprehension

  • @srh80
    @srh80 Год назад +2

    Please do more papers, like before!

  • @ithaca2076
    @ithaca2076 Год назад +3

    this is my christmas present

  • @许翔宇
    @许翔宇 8 месяцев назад

    Great explanation. Will you please share the link to the github repo?

  • @004307ec
    @004307ec 11 месяцев назад +1

    😊this is the second video I watched for this paper. I got more understanding of this paper. The paper is kind of too hard for me😂

  • @ngayminh8463
    @ngayminh8463 9 месяцев назад

    have not seen mamba any implementations on pytorch , also will it replace Transformers ?

  • @qingyangzhang6093
    @qingyangzhang6093 Год назад +5

    will we use mamba to install mamba?

    • @GatlingNG
      @GatlingNG Год назад +5

      I hope not the whole conda ecosystem should not be used for serious production code. Pyenv + poetry + pip is all you need.

    • @vcool
      @vcool Год назад +2

      @@GatlingNG conda can install things that pip can't.

    • @GatlingNG
      @GatlingNG Год назад

      @@vcool we have containers for that not some badly half-assed cobbled together venv manager with barely maintained packages.

  • @RafalLewczuk-g6m
    @RafalLewczuk-g6m Год назад +7

    If Mamba can handle such long sequences, does it need tokenization ?

    • @simonl1938
      @simonl1938 Год назад +1

      You can't really just input words like that into any model, but it might still be easier in other ways. In the S4 talk video Albert Gu talks about how much easier it was for people in other fields to use their model since it's a one size fits all with great performance without tuning.

    • @albertmashy8590
      @albertmashy8590 Год назад +2

      Neither transformers or mamba need tokenization, it's more just for efficiency but curious to hear about the potential implications and whether mamba would be more capable of handling larger vocabularies

  • @yannickpezeu3419
    @yannickpezeu3419 Год назад +1

    brilliant explanation !
    Thanks !

  • @patrickl5290
    @patrickl5290 Год назад

    Was waiting for this, thanks!

  • @feiyuchen1383
    @feiyuchen1383 9 месяцев назад

    what does the ‘’zoom" mean in the vedio?

  • @corgirun7892
    @corgirun7892 Год назад

    good video, Yannic has a strong ability to dissect and extract key points

  • @PeterIsza
    @PeterIsza Год назад +1

    okay so hidden layer is just a really big IIR filter en.wikipedia.org/wiki/Infinite_impulse_response

  • @横川俊介-x4f
    @横川俊介-x4f 4 месяца назад

    helped so much to understand!

  • @jabowery
    @jabowery Год назад

    What happens if you limit the training corpus to enwik9 and then just hammer away?

  • @theskydebreuil
    @theskydebreuil Год назад

    Nice! Was waiting for this 😁

  • @davidespinosa1910
    @davidespinosa1910 2 месяца назад

    It sounds like S4 solves the vanishing and exploding gradient problem, so that it works for very long sequences. Now I'm curious how that works...

  • @ayushtibrewal4535
    @ayushtibrewal4535 10 месяцев назад

    Can we use this model for the image classification like vit

    • @EngineeredFemale
      @EngineeredFemale 10 месяцев назад +1

      Yep you can. Checkout V Mamba (Vision Mamba)

    • @ayushtibrewal4535
      @ayushtibrewal4535 10 месяцев назад

      @@EngineeredFemale but there is no training code for vision mamba.

  • @PicaPauDiablo1
    @PicaPauDiablo1 Год назад +2

    This is so great.

  • @heyman620
    @heyman620 Год назад +1

    You are just amazing.

  • @mkamp
    @mkamp 11 месяцев назад

    Can somebody help me understand this? In figure 1 in the paper it says “Structured SSMs independently map each channel (e.g. 𝐷 = 5) of an input 𝑥 to output 𝑦 through a higher dimensional latent state ℎ (e.g. 𝑁 = 4)”. How are four dimensions higher dimensional than five?
    I am not trolling and I understand that it says “e.g.” and that it could not be deliberate. But given the quality of the paper that seems unlikely.
    Is there another way to interpret higher dimensional? Does it just mean “not a scalar”?

    • @mkamp
      @mkamp 11 месяцев назад +1

      I found the solution. Well, GPT4 did, but who’s counting? 😅
      Each of the five channels is mapped to four hidden dimensions. Of course, now it all makes sense.
      This is what the horse said: “The caption mentions that structured State Space Models (SSMs) map each channel of an input to an output through a higher dimensional latent state \( h \) (for example, \( N = 4 \)) while avoiding the materialization of a large effective state space. The inconsistency you're pointing out seems to be related to the dimensionality \( D \) given as an example (i.e., \( D = 5 \)) versus the term "higher dimensional."
      This likely means that each channel \( D \) is independently mapped, and the "higher dimensional" aspect refers to the combination of these mappings to create a state space with more dimensions than the individual inputs. The effective state space would then be \( D \times N \), which is larger than the dimensionality of each individual channel \( D \) alone. So, in this context, "higher dimensional" does not contradict the example dimensions given; rather, it points to the resultant multi-dimensional space created by the model's structure.”

    • @akolec
      @akolec 10 месяцев назад

      Why a "horse" said that? @@mkamp

  • @berkk1993
    @berkk1993 Год назад +1

    tHANKS GREAT

  • @duongbinh23
    @duongbinh23 6 месяцев назад

    6:50 RNN: y(t+1) = sigmoid(y(t) + x(t))
    State-space: Well, let’s get rid of non-linear sigmoid so that y(t+1) = y(t) + x(t) = … = y(0) + x(0) + x(1) + … + x(t)
    Now you change the game lol.

  • @vcool
    @vcool Год назад

    I want to see a model that requires log memory of length, which could be better than requiring linear memory.

    • @johnzinhoinhoinho
      @johnzinhoinhoinho Год назад +1

      This model has constant memory on inference, it has linear memory on trainning fase

  • @erickmarin6147
    @erickmarin6147 Год назад

    Would love to see some classification of machine learning itself using AI, there seems to be a lot of stuff called different that is functionally very similar

  • @alivecoding4995
    @alivecoding4995 11 месяцев назад

    I am super sceptical towards these reduced models. But I think they are interesting as a kind of ablation study for the Transformer architecture. Helping us to understand their inner mechanisms better.

  • @apilaiteloc6520
    @apilaiteloc6520 7 месяцев назад

    lovely

  • @mohdil123
    @mohdil123 Год назад

    Super nice

  • @tjpld
    @tjpld Год назад +2

    Generative Pretrained Mamba

  • @worthstream
    @worthstream 11 месяцев назад

    I've always preferred videos with no face cam, actually.

  • @RaghaVamsi
    @RaghaVamsi 7 месяцев назад

    Wowww

  • @NoNameAtAll2
    @NoNameAtAll2 10 месяцев назад

    tragic loss of not seeing your sunglasses*
    fify

  • @thinkalinkle
    @thinkalinkle 11 месяцев назад

    I can't help but feeling like the claim that this will "replace" Transformers is a bit much... Especially after watching the video

  • @csabaczcsomps7655
    @csabaczcsomps7655 Год назад +2

    Remember traing data contain influenced, wrong, intermediate trues, high adv. data, school philosophy, wrong time, era optimized ideas ... and lot more. You need some how filter, or devaluate these values, or put in some different value dimensions. This will make hallucination go down little more. My noob opinion.

  • @monoham1
    @monoham1 11 месяцев назад

    yet again i cant understand a word of the maths except the intro
    ive heard of LSTM
    big O
    matrix
    FFT
    thats about IT

    • @monoham1
      @monoham1 11 месяцев назад

      "its just basic 2nd year maths" - i hate people who say this. they have no idea of the stuggles of people who have to work for 10 years just to afford 1 year of university or cant understand why paying 2 months of your income for 1 months of rent is a barrier to learning this IMPOSSIBLE maths!
      ITS NOT MATHS ITS LATIN CODE FOR BASIC SHIT ONLY YOU UNDERSTAND. THERES NO WAY TO LEARN THIS IF YOUR PARENTS ARENT MILLIONAIRES

    • @justtoleavecomments3755
      @justtoleavecomments3755 8 месяцев назад

      @@monoham1My parents were poor immigrants from Eastern Europe, and I took loans to go to a public university (Canadian) that cost $10k/yr. I got good grades and the govt (of Canada) wrote off half those loans. The US has cheap state schools too right?
      You don't need to be a millionaire.
      You just need time and interest.

  • @krimdelko
    @krimdelko Год назад

    Why learn multiplication and division when all you need is addition and real numbers? Same principle here. Reduce complexity

  • @jondo7680
    @jondo7680 Год назад +1

    It really makes no sense that all these efficient architectures come is small sizes like 3b models. If they trade of capabilities for performance, they should release 7b or 13b models, why would anyone run a 3b Mamba or rvkv if 7b mistral runs on everything? Nice tech but as long as it's under 7b it's just a demo.

    • @wenhanzhou5826
      @wenhanzhou5826 10 месяцев назад

      This is research, not engineering. Maybe they just had limited compute resources.

  • @DanFrederiksen
    @DanFrederiksen Год назад

    I was surprised to learn that transformers rely on a classical algo hack to have a kind of a memory. I'm quite sure that's a flawed premise that wont last. It reminds me of bag of words which was a tremendously flawed premise. Despite how relatively amazingly well transformers work. Most approaches so far seem hacky.
    I've thought ahead a bit and I figure that any kind of real world well rounded AI needs a vastly more complex and sophisticated architecture. That's not to say it couldn't happen relatively quickly but LLM aint it. Just the ability to hold a thought and progressively work on it with shifting focus is way beyond LLMs. And that's analog to image generation which superficially looks very close to flawless but in reality is also very far from the holy grail for much the same reason.

  • @axe863
    @axe863 8 месяцев назад

    Before most of you were born ..... lol 😅

  • @JRyang-py4qp
    @JRyang-py4qp 5 месяцев назад

    3Q

  • @hanyanglee9018
    @hanyanglee9018 Год назад +1

    lstm, transformer, and anything in similar direction are just not gonna work.

    • @Japneets1
      @Japneets1 10 месяцев назад

      what do you mean by 'not gonna work'? Transformers have already proven their worth, haven't they?

  • @BB-sd6sm
    @BB-sd6sm Год назад

    Opinion: over engineering architectures won't actually solve the real problem at hand of intelligence

    • @erkinalp
      @erkinalp Год назад

      overbuilding allows one to handle more unforeseen types of problems, though.