E07 | Fast LLM Serving with vLLM and PagedAttention

Поделиться
HTML-код
  • Опубликовано: 5 окт 2024
  • Fast LLM Serving with vLLM and PagedAttention (SOSP'23)
    Abstract: LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past 5 months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan.
    Bio: Zhuohan Li is a CS PhD student at UC Berkeley, where he is advised by Professor Ion Stoica. He is interested in designing and building efficient machine-learning systems. Recently, he has been focusing on the training and serving of large models, specifically LLMs. His works include Alpa, AlpaServe, Vicuna, and vLLM (PagedAttention). He completed his BS at Peking University and has interned at Microsoft Research, Anyscale, and Google Brain.

Комментарии • 13

  • @pingkeeng7305
    @pingkeeng7305 10 месяцев назад +1

    Thank you for sharing!!👍

  • @ethanhe42
    @ethanhe42 11 месяцев назад +1

    thanks for sharing!

  • @aron8500
    @aron8500 10 месяцев назад +1

    Is there a way to get the powerpoint?

  • @ginsongsong
    @ginsongsong 11 месяцев назад

    Thanks for the sharing. It’s educational for me.
    One question, is the block size(16/32) related to the warp size(half-warp/warp)? Wondering the theory that you define the black size in kv cache.

    • @stevenshi8687
      @stevenshi8687 11 месяцев назад

      According to my own understanding, the block size is not related to warp size (which depends on the computing unit). The block size is determined by experiments based on the trade-off of cache locality (of using larger block size) and internal fragmentation (as result of large blocks). Feel free to correct me if I am wrong!

  • @shabdanbatyrkulov2791
    @shabdanbatyrkulov2791 7 месяцев назад

    Thanks for sharing!
    Is it possible to turn on an automatic subtitles (with translation)?

    • @MLSysSingapore
      @MLSysSingapore  5 месяцев назад

      Thank you for the suggestion! We wanted to, but RUclips is not giving us the option😭 Sorry for the inconvenience!

  • @chenghao0825
    @chenghao0825 11 месяцев назад

    Any implementation that work with Azure?

  • @maciejgawinecki1270
    @maciejgawinecki1270 Год назад +1

    Is there a version with English speaking?

    • @MLSysSingapore
      @MLSysSingapore  11 месяцев назад +1

      Hi! Sorry that we only have a Chinese version, and RUclips currently does not allow for auto generation of subtitles in Chinese. We will take it into considerations and upload English-speaking videos in the near future!

    • @njulijianguo
      @njulijianguo 9 месяцев назад

      maybe i can translate it for you?

    • @MLSysSingapore
      @MLSysSingapore  8 месяцев назад

      @@njulijianguo Thanks for volunteering!