Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Поделиться
HTML-код
  • Опубликовано: 28 май 2024
  • In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre-Fill and Chunking, Sparse Mixture of Experts (SMoE); I will also guide you in understanding the most difficult part of the code: Model Sharding and the use of xformers library to compute the attention for multiple prompts packed into a single sequence. In particular I will show the attention computed using BlockDiagonalCausalMask, BlockDiagonalMask and BlockDiagonalCausalWithOffsetPaddedKeysMask.
    I will also show you why the Sliding Window Attention allows a token to "attend" other tokens outside the attention window by linking it with the concept of Receptive Field, typical of Convolutional Neural Networks (CNNs). Of course I will prove it mathematically.
    When introducing Model Sharding, I will also talk about Pipeline Parallelism, because in the official mistral repository they refer to microbatching.
    I release a copy of the Mistral code commented and annotated by me (especially the most difficult parts): github.com/hkproj/mistral-src...
    Slides PDF and Python Notebooks: github.com/hkproj/mistral-llm...
    Prerequisite for watching this video: • Attention is all you n...
    Other material for better understanding Mistral:
    Grouped Query Attention, Rotary Positional Encodings, RMS Normalization: • LLaMA explained: KV-Ca...
    Gradient Accumulation: • Distributed Training w...
    Chapters
    00:00:00 - Introduction
    00:02:09 - Transformer vs Mistral
    00:05:35 - Mistral 7B vs Mistral 8x7B
    00:08:25 - Sliding Window Attention
    00:33:44 - KV-Cache with Rolling Buffer Cache
    00:49:27 - Pre-Fill and Chunking
    00:57:00 - Sparse Mixture of Experts (SMoE)
    01:04:22 - Model Sharding
    01:06:14 - Pipeline Parallelism
    01:11:11 - xformers (block attention)
    01:24:07 - Conclusion
  • НаукаНаука

Комментарии • 107

  • @ankush4617
    @ankush4617 5 месяцев назад +29

    👏
    Keep up the great job, Umar!

  • @varunsaagars
    @varunsaagars 5 месяцев назад +22

    Requesting SSS4 and Mamba explanations. Great work😊

    • @HimanshuSharma-eg5li
      @HimanshuSharma-eg5li 5 месяцев назад

      What's SSS4?

    • @unclecode
      @unclecode 5 месяцев назад

      Structured State-Space Sequence (S4) or Selective State Space Model, sort of linearity for attention mechanism.@@HimanshuSharma-eg5li

    • @umarjamilai
      @umarjamilai  4 месяца назад +11

      You're welcome: ruclips.net/video/8Q_tqwpTpVU/видео.html

    • @pratyushrao7979
      @pratyushrao7979 3 месяца назад

      @@umarjamilaiBruh is too OP

  • @rahulsawant2093
    @rahulsawant2093 5 месяцев назад +11

    I haven't seen a channel with such informative videos on Data Science. Please continue doing this.... Great thanks to you and the team.

  • @markm4642
    @markm4642 4 месяца назад +11

    Absolutely well written, clearly explained and very valuable content as always Umar. Keep perfecting your craft. 100k subs by Dec 2024 , you are opening lots of doors in AI education.

  • @Paluth
    @Paluth 5 дней назад

    Thank you very much, your videos are excellent as always. Keep up the good work, if you have the time!

  • @jman5447
    @jman5447 28 дней назад +1

    Thank you! Your clear explaination really make my life easier!

  • @RayGuo-bo6nr
    @RayGuo-bo6nr 5 месяцев назад +13

    Thanks! Great Job! 谢谢 !

  • @andikunar7183
    @andikunar7183 5 месяцев назад +6

    Amazing content, you are a great explainer/teacher, thanks a lot!!!

  • @unclecode
    @unclecode 5 месяцев назад +5

    👏
    I support and subscribe to anyone who demystifies AI and helps democratize it. Keep up the fantastic job, Umar! Thanks!

    • @umarjamilai
      @umarjamilai  5 месяцев назад +2

      Thank you very much for your support! I wish you, your family and loved ones a happy new year!

    • @unclecode
      @unclecode 5 месяцев назад +1

      @@umarjamilai your welcome, I wish the same for you and your loved ones. Would you please let me know do you have any content focus on the transformer last step, where a linear layer picks up the next token based in the output of decoder. Basically the head MLP. Thx again.

    • @umarjamilai
      @umarjamilai  5 месяцев назад +1

      @@unclecode If you watch my video on how to code a transformer from scratch, you will learn all about the transformer, including the normalization and the last layer. I believe the best way to learn a model is to code is from scratch and see it in action.

    • @unclecode
      @unclecode 5 месяцев назад

      @@umarjamilai Roget that

  • @manishsharma2211
    @manishsharma2211 4 месяца назад +2

    One heck of a video umar, thank you.
    PS : @ 16:44 the kernel will move in the next 3*3 grid only when stride is 1 [ just FYI who might have doubt in this ]

  • @michellem6685
    @michellem6685 28 дней назад +1

    amazing explanation

  • @user-hd7xp1qg3j
    @user-hd7xp1qg3j 5 месяцев назад +3

    Thanks for listening for the request made last time for moe, thanks. You explain and elucidate the stuff in a very understandable way

  • @rajgothi2633
    @rajgothi2633 2 месяца назад +1

    Really good explanation... Please keep uploading such content. It inspire many researcher.

  • @andikunar7183
    @andikunar7183 5 месяцев назад +9

    Danke!

    • @umarjamilai
      @umarjamilai  5 месяцев назад +1

      Thank you for your support!

  • @karanjakhar
    @karanjakhar 5 месяцев назад +3

    Great content. Well explained. Loved it. Please keep up the great job. Thanks.

  • @snowflareai
    @snowflareai 4 месяца назад +7

    Thanks!

    • @umarjamilai
      @umarjamilai  4 месяца назад

      Thank you very much for your support! Let's connect on LinkedIn

  • @jasonma3449
    @jasonma3449 2 месяца назад +1

    exceptionally clear illustration on the SWA concept!

  • @goelnikhils
    @goelnikhils 5 месяцев назад

    What a explanation of Sliding Window Attention, KV Cache , Rolling Buffer Cache , Mistral . Amazing Work. Amazing Content. I have been following Umar and whatever content he creates that is top notch.

  • @wilsvenleong96
    @wilsvenleong96 5 месяцев назад +1

    Your content is god-given! I live for your content! Thank you so very much!

  • @kozer1986
    @kozer1986 5 месяцев назад +1

    Amazing!!! Simply amazing! Haven't seen a channel with such explanation on those topics!!!

  • @rraviteja
    @rraviteja 3 месяца назад +1

    Super content & explanation thanks please upload videos regularly

  • @justjeremiah4255
    @justjeremiah4255 4 месяца назад +1

    Great video as usual, Umar! Thank you, sir.

  • @vigneshkumar1318
    @vigneshkumar1318 Месяц назад +1

    Thank you! I understood a lot from this.

  • @yukewang3164
    @yukewang3164 3 месяца назад +1

    great explaination, very helpful, thanks!

  • @trungquang1581
    @trungquang1581 2 месяца назад +1

    great job, thanks a lot man

  • @alessiocaffi5992
    @alessiocaffi5992 5 месяцев назад +1

    watching your vids is worth the time even for ppl not too much into AI yet. got here from trying to understand Karpathy's vids, great Job. Would be nice if someone on yt would make a vid on how to create an attoGPT/ attoLM or call it bookGPT (bookLM) from any book, e.g DanteGPT🙂 , so to train, on consumer PC without advanced GPUs.

  • @aam1819
    @aam1819 3 месяца назад +1

    Fantastic explanation! Thank you!

  • @gangs0846
    @gangs0846 2 месяца назад +1

    Helped alot thank you

  • @AndreasAlexandrou-to5pw
    @AndreasAlexandrou-to5pw 3 месяца назад +1

    Excellent as always. Thank you!

  • @anshul.singhs
    @anshul.singhs 5 месяцев назад +3

    Thanks! was waiting for it, can you do mamba and S4 next?

  • @angelinakoval8360
    @angelinakoval8360 4 месяца назад +1

    Thank you for the video, a lot of new information for me!

  • @hichamelkaissi7786
    @hichamelkaissi7786 4 месяца назад +1

    Quality content.. Thank you immensely ❤

  • @raahuldutta
    @raahuldutta 5 месяцев назад +2

    Again another great video😊

  • @Itay12353
    @Itay12353 Месяц назад +1

    You Are King!

  • @islamtorky1762
    @islamtorky1762 5 месяцев назад +2

    Great work! Can you do a video for flash attention? Thanks!

  • @akashkumar-jg4oj
    @akashkumar-jg4oj 2 месяца назад +1

    This is literal gold!!!

  • @GrifinsBrother
    @GrifinsBrother 4 месяца назад

    Amazing job, keep going!

  • @utkarshjain3814
    @utkarshjain3814 3 месяца назад +1

    bro is doing god's work. Keep it up!

  • @cfalguiere
    @cfalguiere 4 месяца назад +1

    Thanks for sharing

  • @jatinarora6680
    @jatinarora6680 3 месяца назад

    Very detailed explanation! Thanks for the video. Could you also make a video on vision transformers like BEiT.

  • @Yassjams
    @Yassjams Месяц назад

    Amazing video ! can you do Falcon architecture explanation 🙏🙌

  • @baothach9259
    @baothach9259 3 месяца назад +1

    This video is so good!!!!

  • @lukeskywalker7029
    @lukeskywalker7029 2 месяца назад

    Another great one! Any chance you'll take on "The Era of 1-bit LLMs" paper next? ;)

  • @haralc
    @haralc Месяц назад +1

    Thanks

  • @random-ds
    @random-ds 3 месяца назад

    Thank you for this great video.
    I have a question though. When mistral released the intruct-v2, do they follow the exact architecture and change only the data and way of training, or, they can also twist a little bit the classic architecture of mistral? Thanks in advance!

  • @aamir122a
    @aamir122a 4 месяца назад

    Open source Multi model modal models (MMLLM ) are also becoming main stream , please do an episode on them as well.

  • @waynelau3256
    @waynelau3256 2 месяца назад

    Hey Umar, great video! I have some questions, how does SWA work at training? Because I am trying to wrap my head around how the previoius context is fed to the window. From my understanding in the mistral model, one of the tokens is catered to the previous attetntions. In this case, wouldn't this make it autoregressive and not parallelizable, because the previous attention needs to be computed?

  • @ihitsuperhuman3227
    @ihitsuperhuman3227 23 дня назад +1

    thanks

  • @subhamkundu5043
    @subhamkundu5043 4 месяца назад

    Amazing content. Are you going to put some video on coding a MOE model from scratch?

  • @user-xg6ez8mj7i
    @user-xg6ez8mj7i 5 месяцев назад

    Great content as always. can you do a video about ControlNet?

  • @amitshukla1495
    @amitshukla1495 5 месяцев назад +2

    Wohooo 🥳

  • @AndreasAlexandrou-to5pw
    @AndreasAlexandrou-to5pw 3 месяца назад

    A question on batching; As far as I understand, batching inputs together has minimal cost on inference. I.e. 100 forward passes through all the decoder layers take roughly the same amount of time irrespective of your batch size.
    The video mentions that compute is wasted whilst calculating attention for the padding tokens, and thus concludes that unrolling the batch is preferrable?
    I don't see how this makes sense from a performance standpoint. Compute is very underutilised during attention, so the "wasted attentions" do not really cost anything. On the other hand, unrolling the batch increases the number of forward passes by your batch size. For example; a batch of 5 inputs with a length of 100, takes 100 forward passes in the first case, but takes 500 passes after unrolling.
    Am I missing something here? Doesn't unrolling completely nullify the performance boost from the wasted attentions??
    Edit: Tested this:
    - Sq length: 1024, batch size 1: takes ~ 38 seconds.
    - Sq length: 1024, batch size 4: takes ~ 39 seconds.
    - Sq length: 4096, batch size 1: takes ~ 155 seconds.

  • @zhenfutaofang2534
    @zhenfutaofang2534 5 месяцев назад +1

    Amazing Video !!! 加油

    • @umarjamilai
      @umarjamilai  5 месяцев назад +1

      谢谢你!我在中国有个微信小组关于AI和深度学习,你想交流在领英给我发消息,我Invite你参加。

    • @zhenfutaofang2534
      @zhenfutaofang2534 4 месяца назад

      ok@@umarjamilai

  • @siqb
    @siqb 4 месяца назад

    When we are training or even inference and use as input "[SOS] Love that", do we use the embedding of 'that' for passing to the softmax to predict 'can'?

    • @umarjamilai
      @umarjamilai  4 месяца назад

      Only during inference. During training you just compare the entire output with the target to calculate the loss.

  • @XartakoNP
    @XartakoNP Месяц назад

    Around min 14 - you explain that the sliding window attention will result in fewer dot products. From your explanation I derive that the sliding window mask is applied after the Q@Kt operation, where we perform all the dot products within the Q and K tensors. Is that operation fused in some way or is there a trick to achieve it the reduction in the number of dot products?

  • @kenilshah-hb6fy
    @kenilshah-hb6fy Месяц назад

    I have one point!
    At 5:46, table is shown in which 2nd row 2nd column. You have written No. of Encoder Layers.
    My question:
    If the Mistral is Decoder layer, then why we are considering 32 as the No. of Encoder layer ?

  • @elieelezra2734
    @elieelezra2734 5 месяцев назад

    Hi Sir, great work as usual. However, I have a question regarding the gate in the 'Sparse Mixture of Experts' section. Is it a simple one layer network that produces 8 logits? Thanks! Keep up the good work !

    • @umarjamilai
      @umarjamilai  5 месяцев назад +3

      Yes, for every token in the sequence it produces 8 numbers. The two highest numbers indicate which FFN the token should run through.

    • @elieelezra2734
      @elieelezra2734 5 месяцев назад +1

      Correct me if I'm wrong, it means that the behavior of this kind of block is not the same during training and during inference. During training token embedding goes through the 8 feed forward neural networks, then the output of the two best are selected according to the output of the gate, whereas during inference, the embedding token goes through the two best feed forward neural networks according to the gate. Again thanks a lot for your time and your explanation, I really appreciate it@@umarjamilai

    • @tryit-wv8ui
      @tryit-wv8ui 4 месяца назад +1

      Yep the same question here@@elieelezra2734 @umarjamilai

    • @tryit-wv8ui
      @tryit-wv8ui 4 месяца назад

      Is the next assertion from elie elezra below is correct@@umarjamilai ?

    • @umarjamilai
      @umarjamilai  4 месяца назад +1

      @@tryit-wv8ui hi! The behavior during training and inference IS EXACTLY THE SAME: what I have shown for inference is exactly what happens during training. Because that's how the gate function is trained in producing logits and selecting the best feed forward networks for each token and that's also the reason why some feed forward networks will "specialize" in particular subsets of the tokens (for example some may specialize on Japanese tokens, others on English tokens etc..)

  • @Anson-rr6ej
    @Anson-rr6ej 3 месяца назад

    Great videos. Are the 8 experts and gating funtion in each layer are different ? So total there are 8 x 32 experts, is this correct?

    • @umarjamilai
      @umarjamilai  3 месяца назад

      Yes, each layer has different experts: 8 per layer, so in total 8x32.

    • @Anson-rr6ej
      @Anson-rr6ej 3 месяца назад

      @@umarjamilai Thank you!

  • @vinc6966
    @vinc6966 5 месяцев назад

    Great video, but I have two questions about sliding window attention:
    1. How applying mask to tokens outside of sliding window attention makes it more efficient? Since we still have to perform calculations on NxN matrix, but with some zeros. Are floating point operations on zeros faster?
    2. Receptive field increases as depth increases. Consequently, in mistral only last layer can attend to all tokens, so tokens have less time to communicate. If we have a task that requires N steps to be solved and ALL OF information from the tokens, will the model be able to solve it?
    Thanks

    • @umarjamilai
      @umarjamilai  5 месяцев назад +2

      Hi!
      1. When you know that the two matrices you're multiplying will have many zeros in the output, you can use the "sparse attention", which basically represents matrices in a way very similar to Python dictionaries, so we only store the values of the non-zero indices. There are many deep learning frameworks that support sparse matrix multiplication, if I remember correctly DeepSpeed supports sparse attention calculation.
      2. It is wrong to say that the last layer will attend to all tokens. One token only attends to W preceding tokens, where W is the size of the sliding window. But because of how the information gets "accumulated" in the embedding after each layer, we can claim that the information "flows" from one token to another even if they are outside the window. You're right in saying that the information that's carried this way is less "strong" (it's like you hear a news from a friend instead of reading it by yourself on the newspaper: every intermediate person will alter the real story). If a task requires the information of all the tokens, it MAY (we can't be sure) still able to perform it, but it all depends on how many layers you have and what's the size of the sliding window.
      Have a nice day!

    • @vinc6966
      @vinc6966 5 месяцев назад

      @@umarjamilai Okay, I think that answers my questions ;) Thanks a lot!

  • @MrNathanShow
    @MrNathanShow 3 месяца назад

    Is the xformers part primarily used for training or more for just if we had a service and wanted to support the generation of the outputs. Also, for each expert are they trained independently? Or are they trained with the same dataset? From what I understand the MOE layer is just a feed forward lin layer that are weights. I think I might be wrong though... Thank you!

    • @MrNathanShow
      @MrNathanShow 3 месяца назад

      Ok, so each "expert" is technically just a feedforward output that is gate controlled by a linear series of weights. The top two are selected to post process the token at the end.

    • @MrNathanShow
      @MrNathanShow 3 месяца назад

      The whole data set is used to train each of these experts.

  • @user-tb4sg6lo8f
    @user-tb4sg6lo8f 4 месяца назад

    At 9:25, why are Q and K the same matrices in the case of self-attention? There are different linear layers for mapping the input sequence to queries and values, isn't there?

    • @umarjamilai
      @umarjamilai  4 месяца назад

      I recommend you watch my previous video on the Transformer, where I explain the origin of the Q, K and V matrices.

    • @siqb
      @siqb 4 месяца назад

      Yup. Q, K, V are 3 different projections of the input. If they were literally the same, the QKt will be a symmetric matrix.

    • @umarjamilai
      @umarjamilai  4 месяца назад +1

      @@siqb you're right. I should have mentioned that. Because I was talking about the "tokens" they "refer to" and not to the single values they are made up of, it may have caused confusion. Thanks for pointing out

  • @sahilc7750
    @sahilc7750 2 месяца назад

    is there a way to learn different boiler plate codes and how they operated provided by Xformers ? There github is not very intuitive.

  • @madhusudhanreddy9157
    @madhusudhanreddy9157 5 месяцев назад +1

    Please create a one vecotr database with LLM RAG Implementation video sir

  • @pratyushrao7979
    @pratyushrao7979 3 месяца назад

    I had a query regarding the rolling buffer cache. Why did they not use a Queue for storing the vectors instead of a rolling buffer cache? I know there's an issue with the implementation of a queue, but wouldn't that be time wise way less complex? Instead of O(n) you can roll back in O(1).

    • @umarjamilai
      @umarjamilai  3 месяца назад +1

      You can implement it however you like, but you should always avoid shrinking and growing tensors, because it may move data around the GPU memory, which is slow.

    • @pratyushrao7979
      @pratyushrao7979 3 месяца назад

      @@umarjamilaiOkay thank you. Your explanation was great!

  • @GrifinsBrother
    @GrifinsBrother 4 месяца назад

    But your explanation about specialising of experts is wrong. Because it is stated in the paper, that knowledge of each expert is distributed equally and there is no any specialisation. Check "Routing analysis" block of the paper.

    • @umarjamilai
      @umarjamilai  4 месяца назад

      The paper on the actual performance of the mixture of experts came AFTER I published my video. What I was talking about is not what happens actually (since I didn't have the data on the actual performance back then), but on what's the intuition behind creating a mixture of experts: the idea is that each model - hopefully - specializes in a subset of the data. It may also happen that each model does NOT specialize, like in the case of Mamba. I believe the authors of Mamba also hoped in some kind of specialization, but in reality it didn't happen.

  • @user-ri9xz1dc6l
    @user-ri9xz1dc6l 3 месяца назад +1

    amazing

  • @farzinhaddadpour7192
    @farzinhaddadpour7192 4 месяца назад +2

    Thanks!

    • @umarjamilai
      @umarjamilai  4 месяца назад

      Thank you very very very much!

  • @avogadroarts4366
    @avogadroarts4366 4 месяца назад

    Thanks

  • @reginoldlu
    @reginoldlu 4 месяца назад +2

    Thanks!

    • @reginoldlu
      @reginoldlu 4 месяца назад

      谢谢!Request the flashattention and falshattention2! keep working!!😀

    • @reginoldlu
      @reginoldlu 4 месяца назад

      I just connected with you on linkedin

  • @ml.9106
    @ml.9106 2 месяца назад

    Thanks!

  • @mihirrege206
    @mihirrege206 3 месяца назад

    Thanks!