Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)

Поделиться
HTML-код
  • Опубликовано: 29 фев 2024
  • Recording of presentation delivered by me on 28th February for the Winter 2024 course CS 886: Recent Advances on Foundation Models at the University of Waterloo, we delve into novel techniques and recent research that aims to significantly enhance the efficiency and scalability of Large Language Model (LLM) inference.
    This lecture covers the following topics:
    - Efficient Memory Management for Large Language Model Serving with PagedAttention
    - Flash-Decoding for long-context inference
    - Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding
  • НаукаНаука

Комментарии • 5

  • @noblesmathews
    @noblesmathews  Месяц назад

    If you are interested in this area and would like to explore a bunch of other topics we discussed about in the course please checkout the references and other videos made by my classmates linked at cs.uwaterloo.ca/~wenhuche/teaching/cs886/

  • @thepresistence5935
    @thepresistence5935 2 месяца назад

    Can you give the previous lesson, it will be useful to look

    • @noblesmathews
      @noblesmathews  Месяц назад

      Hi! the previous lecture was given by my classmate you can find it at ruclips.net/video/RfD5tPoMnZY/видео.html

  • @SpartanPanda
    @SpartanPanda 2 месяца назад

    Not able to find part 1 of this

    • @noblesmathews
      @noblesmathews  Месяц назад

      Hi! the previous lecture was given by my classmate you can find it at ruclips.net/video/RfD5tPoMnZY/видео.html