Efficient LLM Inference (vLLM KV Cache, Flash Decoding & Lookahead Decoding)
HTML-код
- Опубликовано: 29 фев 2024
- Recording of presentation delivered by me on 28th February for the Winter 2024 course CS 886: Recent Advances on Foundation Models at the University of Waterloo, we delve into novel techniques and recent research that aims to significantly enhance the efficiency and scalability of Large Language Model (LLM) inference.
This lecture covers the following topics:
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Flash-Decoding for long-context inference
- Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding - Наука
If you are interested in this area and would like to explore a bunch of other topics we discussed about in the course please checkout the references and other videos made by my classmates linked at cs.uwaterloo.ca/~wenhuche/teaching/cs886/
Can you give the previous lesson, it will be useful to look
Hi! the previous lecture was given by my classmate you can find it at ruclips.net/video/RfD5tPoMnZY/видео.html
Not able to find part 1 of this
Hi! the previous lecture was given by my classmate you can find it at ruclips.net/video/RfD5tPoMnZY/видео.html