Efficient LLM Inference with SGLang, Lianmin Zheng, xAI
HTML-код
- Опубликовано: 22 дек 2024
- In this Advancing AI 2024 Luminary Developer Keynote, Dr. Lianmin Zheng introduces SGLang, a high-performance serving framework optimized for inference with LLMs and vision-language models.
SGLang’s core techniques include RadixAttention for improved KV cache reuse and jump-forward decoding for faster grammar-guided decoding. Additional optimizations, such as low-overhead CPU scheduling and torch native enhancements (e.g., torch.compile and torchao), further enhance efficiency. Benchmark results demonstrate that SGLang achieves superior performance compared to other state-of-the-art inference engines.
As an open-source project with broad adoption, SGLang is also deployed for production serving at xAI.
Speaker: Lianmin Zheng, xAI
Gain access to AMD developer tools and resources.
www.amd.com/en...
The information contained in this video represents the view of AMD or the third-party presenter as of the date presented. AMD and/or the third-party presenters have no obligation to update any forward-looking content in the above presentations. AMD is not responsible for the content of any third-party presentations and does not necessarily endorse the comments made therein. GD-84.
© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.