Unlocking Local LLMs with Quantization - Marc Sun, Hugging Face
HTML-код
- Опубликовано: 3 ноя 2024
- Unlocking Local LLMs with Quantization - Marc Sun, Hugging Face
This talk will share the story of quantization, its rise in popularity, and its current status in the open-source community. We'll begin by reviewing key quantization papers, such as QLoRA by Tim Dettmers and GPTQ by Elias Frantar. Next, we'll demonstrate how quantization can be applied at various stages of model development, including pre-training, fine-tuning, and inference. Specifically, we'll share our experience in pre-training a 1.58-bit model, show how fine-tuning is achievable using PEFT + QLoRA, and discuss optimizing inference performance with torch.compile or custom kernels. Finally, we'll highlight efforts within the community to make quantized models more accessible, including how transformers incorporate state-of-the-art quantization schemes and how to run GGUF models from llama.cpp.