LLM inference optimization: Architecture, KV cache and Flash attention

Поделиться
HTML-код
  • Опубликовано: 22 дек 2024

Комментарии • 7

  • @cliffordino
    @cliffordino 2 месяца назад +1

    Nicely done and very helpful! Thank you!! FYI, the stress is on the first syllable of "INference", not the second ("inFERence").

    • @yanaitalk
      @yanaitalk  2 месяца назад

      Copy that! Thank you😊

  • @gilr8723
    @gilr8723 26 дней назад +1

    Thank you! It was very informative and well explained. Is it possible to access the PDF slides you presented?

    • @yanaitalk
      @yanaitalk  20 дней назад +1

      Sure. The majority slides are taken from the AWS tutorial: drive.google.com/file/d/1uVhHtRBwXy7o8ejaS6Ab6pSybkzticE3/view

  • @johndong4754
    @johndong4754 3 месяца назад

    Ive been learning about LLMs over the past few months, but i havent gone into too much depth. Your videos seem very detailed and technical. Which one(s) would you recommend starting off with?

    • @yanaitalk
      @yanaitalk  3 месяца назад

      There are excellent courses from DeepLearning.ai on Coursera. To go even deeper, I recommend to directly read the technical papers which gives you more depth of understanding.

    • @HeywardLiu
      @HeywardLiu 2 месяца назад +2

      1. Roofline model
      2. Transformer arch. > bottleneck of attention > flash attention
      3. LLM Inference can be divided into: prefilling-stage (compute-bound) and decoding-stage (memory-bound)
      4. LLM serving: paged attention, radix attention
      If you want to optimize the inference performance, this review paper is awesome: LLM Inference Unveiled: Survey and Roofline Model Insights