Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Поделиться
HTML-код
  • Опубликовано: 31 июл 2024
  • Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Grouped Query Attention (GQA), the SwiGLU Activation function and more!
    I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P
    I also explain the math behind the Rotary Positional Embedding, with step by step proofs.
    Repository with PDF slides: github.com/hkproj/pytorch-llama
    Download the weights from: github.com/facebookresearch/l...
    Prerequisites:
    1) Transformer explained: • Attention is all you n...
    2) LLaMA explained: • LLaMA explained: KV-Ca...
    Chapters
    00:00:00 - Introduction
    00:01:20 - LLaMA Architecture
    00:03:14 - Embeddings
    00:05:22 - Coding the Transformer
    00:19:55 - Rotary Positional Embedding
    01:03:50 - RMS Normalization
    01:11:13 - Encoder Layer
    01:16:50 - Self Attention with KV Cache
    01:29:12 - Grouped Query Attention
    01:34:14 - Coding the Self Attention
    02:01:40 - Feed Forward Layer with SwiGLU
    02:08:50 - Model weights loading
    02:21:26 - Inference strategies
    02:25:15 - Greedy Strategy
    02:27:28 - Beam Search
    02:31:13 - Temperature
    02:32:52 - Random Sampling
    02:34:27 - Top K
    02:37:03 - Top P
    02:38:59 - Coding the Inference
  • НаукаНаука

Комментарии • 81

  • @imbingle
    @imbingle 23 дня назад +1

    would love to see lighterweight llms trained on custom datasets, thanks for the video! this channel is a gold mine.

  • @ravimandliya1881
    @ravimandliya1881 11 месяцев назад +4

    Very excited for this!!! Weekend is going to be fun!

  • @marshallmcluhan33
    @marshallmcluhan33 10 месяцев назад +5

    Thanks for explaining all of these concepts. Keep up the good work 😎

  • @dongdongqiaqia
    @dongdongqiaqia 11 месяцев назад +4

    Marked for my next watch. Thanks for producing high quality video for the series. Hope you have fun in China.

  • @user-jf6li8mn3l
    @user-jf6li8mn3l 6 месяцев назад +1

    Thank you for such a detailed analysis of the architecture and implementation features of the model! You are very good at presenting information!

  • @RaghavendraK458
    @RaghavendraK458 6 месяцев назад +3

    Very good video. You have a knack for conveying complex content in understandable format. Thank you and keep up the great work

  • @gabchen
    @gabchen 11 месяцев назад +6

    Haven't watched the full video yet but thanks for the promising content. please keep it going.
    Would like to see more of the environment set up and the debugging process.

  • @mazenyasser8299
    @mazenyasser8299 6 месяцев назад +2

    You are a hidden gem, great explanation with theoretical and technical concepts.

  • @TheMzbac
    @TheMzbac 7 месяцев назад +3

    Highly recommended for anyone who wants to understand open source LLM inside and out.

  • @sounishnath513
    @sounishnath513 11 месяцев назад +9

    No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content...
    I am fortunate - that I found your channel.

  • @yonistoller1
    @yonistoller1 8 месяцев назад +3

    Thank you so much for sharing this, it was really well done!

  • @tljstewart
    @tljstewart 11 месяцев назад +3

    Great content as usual! Thanks

  • @wilfredomartel7781
    @wilfredomartel7781 8 месяцев назад

    Amazing work Umar.

  • @GrifinsBrother
    @GrifinsBrother 6 месяцев назад +1

    Incredible explanation!

  • @jiaxingyu8300
    @jiaxingyu8300 10 месяцев назад +1

    Thank you so much for sharing!

  • @oiooio7879
    @oiooio7879 11 месяцев назад +3

    Great video very educational

  • @n.8642
    @n.8642 Месяц назад +1

    Thanks! I learned a lot from your excellent video.

  • @modaya3382
    @modaya3382 8 месяцев назад +2

    Thank you very much for your efforts

  • @renanangelodossantos4726
    @renanangelodossantos4726 4 месяца назад

    EXCELENT! I would like to see the se series with Llava.

  • @马国鑫
    @马国鑫 3 дня назад +1

    Thanks! I learned a lot from your excellent video.

  • @saima6759
    @saima6759 2 месяца назад +1

    this is hardcore machine learning engineering!

  • @hussainshaik4390
    @hussainshaik4390 10 месяцев назад +2

    great content !

  • @jasonzhai2584
    @jasonzhai2584 3 месяца назад

    Thanks for the amazing tutorial! As a student I found it so clear to follow. Just a minor issue to point out, during the illustration of the RoPE, the efficient computing equation for complex frequencies (Eq. 34 in the original paper) should have the third matrix's last two terms $-x_{d}$ and then $x_{d-1}$. The video shows $-x_{d-1}$ and then $x_{d}$, probably just a typo reversing the order of suffixes.

  • @pi5549
    @pi5549 11 месяцев назад +14

    Might you consider creating a Discord guild? I'd love to hang with the people that are watching these videos!

    • @umarjamilai
      @umarjamilai  11 месяцев назад +6

      Hi! I am considering it, will let you know with a public post when it's online 🤖🦾

    • @FireFly969
      @FireFly969 3 месяца назад

      Yep, such great people

    • @Umar-Ateeq
      @Umar-Ateeq Месяц назад

      Great idea man!!

  • @user-yf7qv8zj6y
    @user-yf7qv8zj6y 11 месяцев назад +3

    This is the way!

  • @atanuchowdhury6582
    @atanuchowdhury6582 8 месяцев назад +1

    awesome work boss

  • @RayGuo-bo6nr
    @RayGuo-bo6nr 8 месяцев назад +2

    Thanks! 谢谢你!

  • @tarequeovi4051
    @tarequeovi4051 11 месяцев назад +2

    Great Content

  • @sharjeel_mazhar
    @sharjeel_mazhar Месяц назад

    Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!

  • @stsouko
    @stsouko 11 месяцев назад +2

    Wow. Now I got this trick

  • @Patrick-wn6uj
    @Patrick-wn6uj 3 месяца назад +2

    55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.

  • @SumanGameDev
    @SumanGameDev 4 месяца назад +1

    oh boy this amazing video

  • @ehsanzain5999
    @ehsanzain5999 10 месяцев назад

    Thank you Umar very much for the efforts here. One question, is there any PPO and finetuning on above of this in next videos?

  • @user-vh5ni1gs3w
    @user-vh5ni1gs3w 5 месяцев назад +1

    great video❤

  • @justcars2454
    @justcars2454 3 месяца назад +2

    its an honor to me, to be in those 23500 viewers who watched this video, thank you so much umar jamil for your content

  • @umarjamilai
    @umarjamilai  11 месяцев назад +5

    As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/
    Prerequisites:
    1) Transformer explained: ruclips.net/video/bCz4OMemCcA/видео.html
    2) LLaMA explained: ruclips.net/video/Mn_9W1nCFLo/видео.html

  • @mohammadyahya78
    @mohammadyahya78 Месяц назад +1

    amazing

  • @zz79ya
    @zz79ya 2 месяца назад

    Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?

  • @shamimibneshahid706
    @shamimibneshahid706 6 месяцев назад

    Hi, I want to fine tune the model. In that case, will it be required to get rid of the k-v caching?

  • @adatalearner8683
    @adatalearner8683 3 месяца назад

    why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?

  • @zhenfutaofang2534
    @zhenfutaofang2534 8 месяцев назад

    anyone know how to execute the code on cuda 4090gpu , i faced the out of memoery error

  • @jensenlwt
    @jensenlwt Месяц назад

    Can somebody help to explain why when calculating theta, we are not including the -2, e.g., theta = theta ** (-2 * theta_numerator / head_dim)

  • @PaoloTshiyole
    @PaoloTshiyole 6 месяцев назад

    Great video
    So what about the dataset used in this video?

  • @hautran-uc8gz
    @hautran-uc8gz 4 месяца назад

    thank you

  • @skanderbegvictor6487
    @skanderbegvictor6487 6 месяцев назад

    I tried loading the model from M1 mac 8GB RAM but it seems that it requires more memory (I am guessing 28GB RAM)

  • @IRFANSAMS
    @IRFANSAMS 5 месяцев назад

    Can i use llama2 model open source for life time or can i code along with you and use the model

  • @adatalearner8683
    @adatalearner8683 3 месяца назад

    Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?

  • @DiegoSilva-dv9uf
    @DiegoSilva-dv9uf 7 месяцев назад +1

    Thanks!

    • @umarjamilai
      @umarjamilai  7 месяцев назад

      Thank you Diego for your support!

  • @edoziemenyinnaya7637
    @edoziemenyinnaya7637 9 месяцев назад

    Please can we get the training code too?

  • @hussainshaik4390
    @hussainshaik4390 10 месяцев назад +4

    Thanks

    • @umarjamilai
      @umarjamilai  10 месяцев назад +2

      Thank you for your support!

  • @wilfredomartel7781
    @wilfredomartel7781 8 месяцев назад

    🎉🎉

  • @tharunbhaskar6795
    @tharunbhaskar6795 4 месяца назад

    What are the system requirements to run the inference for this model? By the way, its a great video

  • @feixyzliu5432
    @feixyzliu5432 6 месяцев назад

    Wouldn't it be 'cur_pos - 1' for start_pos argument (line 81 in inference.py, 2:45:58)?

  • @edoziemenyinnaya7637
    @edoziemenyinnaya7637 9 месяцев назад +1

    Do you’ve a discord channel

  • @user-xt7bu8sz7j
    @user-xt7bu8sz7j 5 месяцев назад +1

    watch again

  • @Rookie_AI
    @Rookie_AI 7 месяцев назад

    where do you apply the causal mask?

    • @Rookie_AI
      @Rookie_AI 7 месяцев назад

      and the sliding window attention. Thank you

    • @feixyzliu5432
      @feixyzliu5432 6 месяцев назад

      causal mask is not needed since kv cache is used

  • @mathlife5495
    @mathlife5495 10 месяцев назад +1

    A suggestion for all your videos is to increase the font size or the zoom level. They are kind of unreadable.

    • @umarjamilai
      @umarjamilai  10 месяцев назад +1

      Thanks for your feedback! I'll keep that in mind 🤗

  • @coolguy69235
    @coolguy69235 8 месяцев назад

    is llama 2 encoder only or decoder only model ?

    • @umarjamilai
      @umarjamilai  8 месяцев назад

      People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.

    • @coolguy69235
      @coolguy69235 8 месяцев назад

      @@umarjamilai Thanks a lot for your prompt reply. And amazing video

  • @spencerfunk6697
    @spencerfunk6697 Месяц назад

    please do mistral

  • @user-yf5wy7qk9r
    @user-yf5wy7qk9r 8 месяцев назад +1

    We need one more video to explain download weights and inferencing, because it is not clear.

    • @umarjamilai
      @umarjamilai  8 месяцев назад

      Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/
      Meta will send you an email with the details on how to download the model.

  • @wd25548
    @wd25548 3 месяца назад

    Great video! one question though: In ruclips.net/video/oM4VmoabDAI/видео.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same

    • @wd25548
      @wd25548 3 месяца назад

      Oh forgive me dummy question - for anyone else who's thinking about it, the self.weight is learnable

  • @feixyzliu5432
    @feixyzliu5432 6 месяцев назад

    Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?

  • @forresthu6204
    @forresthu6204 6 месяцев назад +1

    Thanks!