LLM inference optimization: Architecture, KV cache and Flash attention

Поделиться
HTML-код
  • Опубликовано: 21 ноя 2024
  • НаукаНаука

Комментарии • 5

  • @cliffordino
    @cliffordino Месяц назад +1

    Nicely done and very helpful! Thank you!! FYI, the stress is on the first syllable of "INference", not the second ("inFERence").

    • @yanaitalk
      @yanaitalk  Месяц назад

      Copy that! Thank you😊

  • @johndong4754
    @johndong4754 2 месяца назад

    Ive been learning about LLMs over the past few months, but i havent gone into too much depth. Your videos seem very detailed and technical. Which one(s) would you recommend starting off with?

    • @yanaitalk
      @yanaitalk  2 месяца назад

      There are excellent courses from DeepLearning.ai on Coursera. To go even deeper, I recommend to directly read the technical papers which gives you more depth of understanding.

    • @HeywardLiu
      @HeywardLiu Месяц назад

      1. Roofline model
      2. Transformer arch. > bottleneck of attention > flash attention
      3. LLM Inference can be divided into: prefilling-stage (compute-bound) and decoding-stage (memory-bound)
      4. LLM serving: paged attention, radix attention
      If you want to optimize the inference performance, this review paper is awesome: LLM Inference Unveiled: Survey and Roofline Model Insights