Is Mamba Destroying Transformers For Good? 😱 Language Models in AI

Поделиться
HTML-код
  • Опубликовано: 14 дек 2024

Комментарии • 34

  • @soccerdadsg
    @soccerdadsg 5 месяцев назад +2

    Appreciate your effort to make this video.

    • @analyticsCamp
      @analyticsCamp  5 месяцев назад

      My pleasure, thanks for watching :)

  • @viswa3059
    @viswa3059 10 месяцев назад +14

    I came here for giant snake vs giant robot fight

  • @Researcher100
    @Researcher100 10 месяцев назад +4

    Thanks for the effort you put into this detailed comparison, I learned a few more things. Btw, the editing and graphics in this video were really good 👍

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 6 месяцев назад +1

    can mamba have its input rope scaled? It seems it doesnt require positional encoding but this might make it extremely efficient for second order optimization techniques.

    • @analyticsCamp
      @analyticsCamp  6 месяцев назад

      In Mamba sequence length can be scaled up to a million (e.g., a million-length sequences). It also computes the gradient (did not find any info on second-order opt in their method): they train for 10k to 20k gradient steps.

  • @thatsfantastic313
    @thatsfantastic313 5 месяцев назад +1

    beautifully explained!

  • @mintakan003
    @mintakan003 10 месяцев назад +1

    I'd like to see this tested out for larger models, such as comparable to llama 2. One question that I have, is whether there are diminishing returns for long distance relationships, compared to a context window of sufficient size. Is it enough for people to give up (tried and true?) transformers, with explicit modeling of the context, over something that is more selective.

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад +1

      A thoughtful observation! Yes, it seems that the authors of Mamba have already tested it out against Transformer-based architectures, such as PaLM and LLaMA, and a bunch of other models. Here's what they quoted in their article, page 2:
      "With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B)."
      With regards to scaling the sequence length, I have explained a bit in the video. Here's a bit more explanation from their article, page 1:
      "The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length."
      There's also an interesting table of summary of model evaluation (Zer-shot Evaluation, page 13) of different Mamba model sizes compared to GPT-2, H3 Hybrid model, Pythia, and RWKV, where in each instance Mamba exceeds these models' performances (check out the accuracy values in each dataset, especially for Mamba 2.8 Billion parameter model, it is truly unique.
      And, thanks for watching :)

  • @richardnunziata3221
    @richardnunziata3221 10 месяцев назад +1

    A 7B to 10B Mamba would be interesting to judge but right now it seems its really good with long content for the small models space

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад

      You are right! Generally speaking, a larger size of parameter considered in tuning the models give better result. But Mamba is claiming that we don't necessarily need larger models, but a more efficient design of a model to be able to perform comparable to other models, even though it may have been trained on smaller training data and smaller number of parameters. I suggest their article, section 4.3.1 where they talk about "Scaling: Model Size", which can give you a good perspective. Thanks for watching :)

  • @optiondrone5468
    @optiondrone5468 10 месяцев назад +1

    I'm enjoying these mamba videos you're sharing with us, thanks

  • @yuvrajsingh-gm6zk
    @yuvrajsingh-gm6zk 10 месяцев назад +2

    keep up the good work, btw you got a new sub!

  • @MrJohnson00111
    @MrJohnson00111 10 месяцев назад +1

    You clearly explain what is the difference between Transformer and Mamba, thank you
    but could you also give the reference paper you mention in the video let me dive in ?

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад

      Hi, glad the video was helpful. The reference for the paper is also mentioned multiple times in the video, but here's the full reference for your convenience:
      Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.

  • @70152136
    @70152136 10 месяцев назад +2

    Just when I thought I had caught up with GPTs and Transformers, BOOM, MAMBA!!!

  • @consig1iere294
    @consig1iere294 10 месяцев назад +1

    I can't keep up. Is Mamba like Mistral model or it is a LLM technology?

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад +1

      Mamba is an LLM but has a unique architecture, a blend of traditional SSM-based models together with Multi-layer Perceptron which helps it to add 'selectivity' to the flow of information in the system (unlike Transformer-based models which often take the whole context, i.e., all the information, to be able to predict the next word). If you are still confused, I recommend you watch my video in this channel called "This is how exactly language models work" which gives you a perspective of different types of LLMs :)

  • @ricardofurbino
    @ricardofurbino 10 месяцев назад

    I'm doing a work that uses sequence data, but not specific to language. In a transformer-like network, instead of embedding layer for the source and target, I have linear layers; also, I send both source and target to the forward process.. In a LSTM-like network, I don't even need this step, I just have the torch standard lstm cell; in this case, simply source is necessary for the forward pass. Does someone has a code example on how I can do it using Mamba? I'm having difficulties on how I can do it.

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад +1

      Hey, I just found a PyTorch implementation of Mamba in this link. I haven't gone through it personally, though; but if it is helpful please do let me know: medium.com/ai-insights-cobet/building-mamba-from-scratch-a-comprehensive-code-walkthrough-5db040c28049

  • @Kutsushita_yukino
    @Kutsushita_yukino 10 месяцев назад +1

    lets goooo mamba basically has the similar memory as humans. but brains do tend to forget when the information is unnecessary so thats that.

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад

      That's right. Essentially, the main idea behind SSM architectures (e.g., having a hidden state) is to be able to manage the flow of information in the system.

  • @datascienceworld
    @datascienceworld 10 месяцев назад +1

    Great video.

  • @user-du8hf3he7r
    @user-du8hf3he7r 10 месяцев назад

    Low audio volume.

  • @raymond_luxury_yacht
    @raymond_luxury_yacht 10 месяцев назад

    Why didn't they call it Godzilla?

    • @analyticsCamp
      @analyticsCamp  10 месяцев назад

      Funny :) It seems to be as powerful as one!