How to Quantize an LLM with GGUF or AWQ

Поделиться
HTML-код
  • Опубликовано: 13 дек 2024

Комментарии • 58

  • @TrelisResearch
    @TrelisResearch  Год назад +1

    A GPTQ script is now included if you purchase the scripts OR access to the ADVANCED-fine-tuning repo in the description.

    • @Kuchiriel
      @Kuchiriel 5 месяцев назад +1

      Dude I thank you very much for the video, but why in the name of... you´re selling pieces of code ppl can find on internet or make chatgpt generate for them? o.O

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      @@Kuchiriel you're welcome!
      And yes, ppl can definitely code up what I show in the videos! It's a matter of how much time they want to spend debugging and digging to get things to work.
      And, I even recommend the DIY approach if you want to learn (although quite a few people find it helpful to have working scripts to follow along).

  • @saramirabi1485
    @saramirabi1485 9 месяцев назад

    Ah this video was all I wanted, Thanks alot. I think I should check out all your videos.

  • @lrkx_
    @lrkx_ Год назад +1

    You mentioned it’s possible, but hard, to fine-tune a pre-quantized GGUF model. Given it’s possible, could it be done on a Mac? I’m assuming the smaller file size and lower precision would mean fewer resources would be required?

    • @TrelisResearch
      @TrelisResearch  Год назад

      Yes to all of those questions! See here, but only for Llama models: github.com/ggerganov/llama.cpp/pull/2632

    • @lrkx_
      @lrkx_ Год назад

      @@TrelisResearch great, thank you for the link. Much appreciated.

    • @Heisenberggg-7
      @Heisenberggg-7 11 месяцев назад

      Hey, May I know what approach you took?

  • @efexzium
    @efexzium Год назад +1

    How can we save the GGUF model to local storage for zero download time in RunPod ?

    • @TrelisResearch
      @TrelisResearch  Год назад +1

      You can create a storage volume in runpod, then start a pod connected to that storage volume. That will save your file there. Next time you start, there will be no download time.
      It's a bit unusual though to use GGUF with runpod. GGUF is usually for running on Mac, in which case your files will be downloaded locally so there should be no download time

    • @efexzium
      @efexzium Год назад +1

      So the idea was to use a GGUF model
      That runs on GPU mode using c transformers [cuda] so we can load bigger models on smaller GPus and save some money 💰

    • @RonanMcGovern
      @RonanMcGovern Год назад

      @@efexziumgguf is optimized for apple silicon. Typically awq is better for GPUs.

    • @efexzium
      @efexzium 10 месяцев назад

      Yes lol im using GPTQ now becuase runpod CPUs leave much to be desired. I guess since theres a lot of videos saying GPTQ is deprecated I guess they misslead me.@@TrelisResearch

  • @varunponugoti7502
    @varunponugoti7502 12 часов назад

    how to quantize a vision language model?

  • @efexzium
    @efexzium 10 месяцев назад

    Hi Trelis, great video, question do you have any videos on training a model with LORA and PEFT then putting that trained adapter on a unquantized model to then merge it and save it to HF for production ? In other words it seems like there's not a lot of content on how to use LORA PEFT trained adapters in production . My guess is we would have to merge the adapters to unquantized model then quantize it ? Any video or idea where to find documentation for this?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      Yup there’s a vid on this channel called pushing models to huggingface. It covers those options!

  • @bhoumikshah5695
    @bhoumikshah5695 9 месяцев назад

    Where have you declared what data to use for AWQ quantisation?

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      It’s hidden in the awq library so you don’t see it. I think it defaults to wiki text or c4. Check out the autoawq repo on GitHub for the custom params to pass

  • @rodrimora
    @rodrimora Год назад

    If I have a 3090 and a 3080 that would add up to a weird amount of VRAM, like 34GB. Can I quantize a model to 2.5 or 3.5 bits?

    • @TrelisResearch
      @TrelisResearch  Год назад

      You just need to be able to load the model in bf16, so the approx RAM needed should be around the # of parameters x2.
      So for a 7B model you need probably 15 GB. For 13B, double.
      I'm not 100% sure, but I think quantization should work even if you load the model across multiple devices (including cpu).

    • @rodrimora
      @rodrimora Год назад

      @@TrelisResearch hi! Yes I mean to quantize using run pod to have higher vram available to load the model in bf16. But instead of quantizing it to 4 bit, can I do it to fractional bits like 3.5? Is that possible? As model typically fit into 24 or 48GB but not 34GB.

    • @TrelisResearch
      @TrelisResearch  Год назад

      @@rodrimoraYes! That's possible. The way you do 3.5 is to quantize some weights to 4 and some to 3.
      Here is a list of the options: github.com/ggerganov/llama.cpp/tree/master/examples/quantize
      if you're using the Trelis scripts you would replace Q_4 with Q3_K_S

    • @rodrimora
      @rodrimora Год назад

      @@TrelisResearch and sorry for the noob question. But how does one quantize different weight of one model? In the video you seem to quantize the whole model to 4 bits for example. What are exactly the weights? Yeah! My plan is to use your script. Buying it right now to test

    • @TrelisResearch
      @TrelisResearch  Год назад

      @@rodrimora Yeah, the selection of which weights to quantize is done by the quantization script, you just need to specify which quantization type you want - as per my last answer :)

  • @btMunishKumar
    @btMunishKumar 7 месяцев назад

    How to create imatrix calculations for quantizations?

    • @TrelisResearch
      @TrelisResearch  7 месяцев назад

      you can take a look here: github.com/Cornell-RelaxML/quip-sharp
      but I wouldn't go too deep on this because it's not very widely supported and I'm unsure they have custom kernels that are as good as awq or marlin

  • @andrewchen7710
    @andrewchen7710 10 месяцев назад

    Trelis, great video! Though I'm curious, you mentioned that AWQ is data-dependent, yet, I did not see its quantization script utilizing any dataset, what's happening?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      yeah, the quanting script does load a small dataset. AWQ is activation aware quantization. Activations are the product of input tokens and weights, so you need to have a dataset in order to be activation aware.

  • @raphaellfms
    @raphaellfms Год назад +1

    How can I use an AWQ model locally ?

    • @TrelisResearch
      @TrelisResearch  Год назад

      AWQ can only be used locally if you have GPUs. It won't work if you have a Mac. What do you have locally?

  • @rongronghae
    @rongronghae 2 месяца назад

    Can I transfer private local model to Gguf template without upload huggingface?

    • @TrelisResearch
      @TrelisResearch  2 месяца назад

      yes, you can convert from safetensors to gguf locally, no need to upload

  • @Vishanoberoi
    @Vishanoberoi 10 месяцев назад

    I finetuned Llama2 using nf4 bnbconfig quantization. Can i now use these other methods (AWQ, GPTQ) on this?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      if you have saved the model in nf4, I think you can still load it in bf16 and then do a quant.
      or, you can merge the trained adapter to a reloaded bf16 base model. check out the recent vid on pushing and pulling from huggingface for a description.

  • @marcelocorreiajornal
    @marcelocorreiajornal Год назад

    Is it possible to apply llama.cpp to quantize (GGUF) other models' architectures like Mistral? In other words, using this same script?

    • @TrelisResearch
      @TrelisResearch  Год назад

      yes, llama.cpp supports mistral quanting: github.com/ggerganov/llama.cpp

    • @Heisenberggg-7
      @Heisenberggg-7 11 месяцев назад

      Hey, May I know what approach you took?

  • @pec8377
    @pec8377 Год назад +1

    Hi instead of loading file by file you can upload the whole directory using a upload_folder.

    • @TrelisResearch
      @TrelisResearch  Год назад

      thanks, I'll make the script update shortly!

  • @atomhero2830
    @atomhero2830 Год назад +2

    Can you talk about ExLlama and ExLlama v2 that help speed up GPTQ inference speed.
    And also can you talk about similar method to speed up GGUF.

    • @TrelisResearch
      @TrelisResearch  Год назад

      ExLlama is an alternative to using transformers (from huggingface) for inference, but it only works for Llama models. It's more fundamental/root of an approach so probably can offer speed ups. On the other hand, it doesn't have the same team size and support as transformers, so it's hard to keep up with all of the improvements that come out.
      Exllama v2 includes an option to quantize like GPTQ - it seems a bit of a hybrid with GGUF as it allows for mixed precision quantization.
      Unfortunately, my sense is that GPTQ is inferior to AWQ - but perhaps exllama will move to AWQ.
      What are you you inferencing on? what hardware

    • @Heisenberggg-7
      @Heisenberggg-7 11 месяцев назад

      Hey, May I know what approach you took?

  • @md.rakibulhaque2262
    @md.rakibulhaque2262 Год назад

    while uploding to hf, Getting this error:
    ValueError: Provided path: 'models/TinyLlama-1.1B-Chat-v0.3.Q4_K.gguf' is not a file on the local file system

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      just seeing this now, did you get this resolved? seems you hadn't downloaded the model to the models directory?

  • @ShivaniModi-w4c
    @ShivaniModi-w4c Год назад +1

    Can I quantize a model using AWQ on A100 GPU?

  • @efexzium
    @efexzium Год назад +1

    Super interesting work.

  • @techteaching2742
    @techteaching2742 Год назад +1

    excellent presentation

  • @leefeng5067
    @leefeng5067 Год назад

    learned a lot. Thank you!

  • @hinton4214
    @hinton4214 11 месяцев назад

    Brilliant, thanks!

  • @nguyentrong0603
    @nguyentrong0603 4 месяца назад

    Nice video!!!

  • @parmanandchauhan6182
    @parmanandchauhan6182 6 месяцев назад

    Excellent works.subscribe today.keep making info videos .

  • @greenptgt4258
    @greenptgt4258 2 месяца назад

    How we integrate custom quantization algorithm to the processes?

    • @TrelisResearch
      @TrelisResearch  2 месяца назад

      Oof that’s tricky because if you don’t do it at the kernel level (in cuda) it’ll just be really slow.
      Have you something custom in mind?

    • @greenptgt4258
      @greenptgt4258 2 месяца назад

      @@TrelisResearch yes i want to try different quantization algorithms on same model