Dude I thank you very much for the video, but why in the name of... you´re selling pieces of code ppl can find on internet or make chatgpt generate for them? o.O
@@Kuchiriel you're welcome! And yes, ppl can definitely code up what I show in the videos! It's a matter of how much time they want to spend debugging and digging to get things to work. And, I even recommend the DIY approach if you want to learn (although quite a few people find it helpful to have working scripts to follow along).
You mentioned it’s possible, but hard, to fine-tune a pre-quantized GGUF model. Given it’s possible, could it be done on a Mac? I’m assuming the smaller file size and lower precision would mean fewer resources would be required?
You can create a storage volume in runpod, then start a pod connected to that storage volume. That will save your file there. Next time you start, there will be no download time. It's a bit unusual though to use GGUF with runpod. GGUF is usually for running on Mac, in which case your files will be downloaded locally so there should be no download time
So the idea was to use a GGUF model That runs on GPU mode using c transformers [cuda] so we can load bigger models on smaller GPus and save some money 💰
Yes lol im using GPTQ now becuase runpod CPUs leave much to be desired. I guess since theres a lot of videos saying GPTQ is deprecated I guess they misslead me.@@TrelisResearch
Hi Trelis, great video, question do you have any videos on training a model with LORA and PEFT then putting that trained adapter on a unquantized model to then merge it and save it to HF for production ? In other words it seems like there's not a lot of content on how to use LORA PEFT trained adapters in production . My guess is we would have to merge the adapters to unquantized model then quantize it ? Any video or idea where to find documentation for this?
It’s hidden in the awq library so you don’t see it. I think it defaults to wiki text or c4. Check out the autoawq repo on GitHub for the custom params to pass
You just need to be able to load the model in bf16, so the approx RAM needed should be around the # of parameters x2. So for a 7B model you need probably 15 GB. For 13B, double. I'm not 100% sure, but I think quantization should work even if you load the model across multiple devices (including cpu).
@@TrelisResearch hi! Yes I mean to quantize using run pod to have higher vram available to load the model in bf16. But instead of quantizing it to 4 bit, can I do it to fractional bits like 3.5? Is that possible? As model typically fit into 24 or 48GB but not 34GB.
@@rodrimoraYes! That's possible. The way you do 3.5 is to quantize some weights to 4 and some to 3. Here is a list of the options: github.com/ggerganov/llama.cpp/tree/master/examples/quantize if you're using the Trelis scripts you would replace Q_4 with Q3_K_S
@@TrelisResearch and sorry for the noob question. But how does one quantize different weight of one model? In the video you seem to quantize the whole model to 4 bits for example. What are exactly the weights? Yeah! My plan is to use your script. Buying it right now to test
@@rodrimora Yeah, the selection of which weights to quantize is done by the quantization script, you just need to specify which quantization type you want - as per my last answer :)
you can take a look here: github.com/Cornell-RelaxML/quip-sharp but I wouldn't go too deep on this because it's not very widely supported and I'm unsure they have custom kernels that are as good as awq or marlin
Trelis, great video! Though I'm curious, you mentioned that AWQ is data-dependent, yet, I did not see its quantization script utilizing any dataset, what's happening?
yeah, the quanting script does load a small dataset. AWQ is activation aware quantization. Activations are the product of input tokens and weights, so you need to have a dataset in order to be activation aware.
if you have saved the model in nf4, I think you can still load it in bf16 and then do a quant. or, you can merge the trained adapter to a reloaded bf16 base model. check out the recent vid on pushing and pulling from huggingface for a description.
ExLlama is an alternative to using transformers (from huggingface) for inference, but it only works for Llama models. It's more fundamental/root of an approach so probably can offer speed ups. On the other hand, it doesn't have the same team size and support as transformers, so it's hard to keep up with all of the improvements that come out. Exllama v2 includes an option to quantize like GPTQ - it seems a bit of a hybrid with GGUF as it allows for mixed precision quantization. Unfortunately, my sense is that GPTQ is inferior to AWQ - but perhaps exllama will move to AWQ. What are you you inferencing on? what hardware
while uploding to hf, Getting this error: ValueError: Provided path: 'models/TinyLlama-1.1B-Chat-v0.3.Q4_K.gguf' is not a file on the local file system
A GPTQ script is now included if you purchase the scripts OR access to the ADVANCED-fine-tuning repo in the description.
Dude I thank you very much for the video, but why in the name of... you´re selling pieces of code ppl can find on internet or make chatgpt generate for them? o.O
@@Kuchiriel you're welcome!
And yes, ppl can definitely code up what I show in the videos! It's a matter of how much time they want to spend debugging and digging to get things to work.
And, I even recommend the DIY approach if you want to learn (although quite a few people find it helpful to have working scripts to follow along).
Ah this video was all I wanted, Thanks alot. I think I should check out all your videos.
You mentioned it’s possible, but hard, to fine-tune a pre-quantized GGUF model. Given it’s possible, could it be done on a Mac? I’m assuming the smaller file size and lower precision would mean fewer resources would be required?
Yes to all of those questions! See here, but only for Llama models: github.com/ggerganov/llama.cpp/pull/2632
@@TrelisResearch great, thank you for the link. Much appreciated.
Hey, May I know what approach you took?
How can we save the GGUF model to local storage for zero download time in RunPod ?
You can create a storage volume in runpod, then start a pod connected to that storage volume. That will save your file there. Next time you start, there will be no download time.
It's a bit unusual though to use GGUF with runpod. GGUF is usually for running on Mac, in which case your files will be downloaded locally so there should be no download time
So the idea was to use a GGUF model
That runs on GPU mode using c transformers [cuda] so we can load bigger models on smaller GPus and save some money 💰
@@efexziumgguf is optimized for apple silicon. Typically awq is better for GPUs.
Yes lol im using GPTQ now becuase runpod CPUs leave much to be desired. I guess since theres a lot of videos saying GPTQ is deprecated I guess they misslead me.@@TrelisResearch
how to quantize a vision language model?
Hi Trelis, great video, question do you have any videos on training a model with LORA and PEFT then putting that trained adapter on a unquantized model to then merge it and save it to HF for production ? In other words it seems like there's not a lot of content on how to use LORA PEFT trained adapters in production . My guess is we would have to merge the adapters to unquantized model then quantize it ? Any video or idea where to find documentation for this?
Yup there’s a vid on this channel called pushing models to huggingface. It covers those options!
Where have you declared what data to use for AWQ quantisation?
It’s hidden in the awq library so you don’t see it. I think it defaults to wiki text or c4. Check out the autoawq repo on GitHub for the custom params to pass
If I have a 3090 and a 3080 that would add up to a weird amount of VRAM, like 34GB. Can I quantize a model to 2.5 or 3.5 bits?
You just need to be able to load the model in bf16, so the approx RAM needed should be around the # of parameters x2.
So for a 7B model you need probably 15 GB. For 13B, double.
I'm not 100% sure, but I think quantization should work even if you load the model across multiple devices (including cpu).
@@TrelisResearch hi! Yes I mean to quantize using run pod to have higher vram available to load the model in bf16. But instead of quantizing it to 4 bit, can I do it to fractional bits like 3.5? Is that possible? As model typically fit into 24 or 48GB but not 34GB.
@@rodrimoraYes! That's possible. The way you do 3.5 is to quantize some weights to 4 and some to 3.
Here is a list of the options: github.com/ggerganov/llama.cpp/tree/master/examples/quantize
if you're using the Trelis scripts you would replace Q_4 with Q3_K_S
@@TrelisResearch and sorry for the noob question. But how does one quantize different weight of one model? In the video you seem to quantize the whole model to 4 bits for example. What are exactly the weights? Yeah! My plan is to use your script. Buying it right now to test
@@rodrimora Yeah, the selection of which weights to quantize is done by the quantization script, you just need to specify which quantization type you want - as per my last answer :)
How to create imatrix calculations for quantizations?
you can take a look here: github.com/Cornell-RelaxML/quip-sharp
but I wouldn't go too deep on this because it's not very widely supported and I'm unsure they have custom kernels that are as good as awq or marlin
Trelis, great video! Though I'm curious, you mentioned that AWQ is data-dependent, yet, I did not see its quantization script utilizing any dataset, what's happening?
yeah, the quanting script does load a small dataset. AWQ is activation aware quantization. Activations are the product of input tokens and weights, so you need to have a dataset in order to be activation aware.
How can I use an AWQ model locally ?
AWQ can only be used locally if you have GPUs. It won't work if you have a Mac. What do you have locally?
Can I transfer private local model to Gguf template without upload huggingface?
yes, you can convert from safetensors to gguf locally, no need to upload
I finetuned Llama2 using nf4 bnbconfig quantization. Can i now use these other methods (AWQ, GPTQ) on this?
if you have saved the model in nf4, I think you can still load it in bf16 and then do a quant.
or, you can merge the trained adapter to a reloaded bf16 base model. check out the recent vid on pushing and pulling from huggingface for a description.
Is it possible to apply llama.cpp to quantize (GGUF) other models' architectures like Mistral? In other words, using this same script?
yes, llama.cpp supports mistral quanting: github.com/ggerganov/llama.cpp
Hey, May I know what approach you took?
Hi instead of loading file by file you can upload the whole directory using a upload_folder.
thanks, I'll make the script update shortly!
Can you talk about ExLlama and ExLlama v2 that help speed up GPTQ inference speed.
And also can you talk about similar method to speed up GGUF.
ExLlama is an alternative to using transformers (from huggingface) for inference, but it only works for Llama models. It's more fundamental/root of an approach so probably can offer speed ups. On the other hand, it doesn't have the same team size and support as transformers, so it's hard to keep up with all of the improvements that come out.
Exllama v2 includes an option to quantize like GPTQ - it seems a bit of a hybrid with GGUF as it allows for mixed precision quantization.
Unfortunately, my sense is that GPTQ is inferior to AWQ - but perhaps exllama will move to AWQ.
What are you you inferencing on? what hardware
Hey, May I know what approach you took?
while uploding to hf, Getting this error:
ValueError: Provided path: 'models/TinyLlama-1.1B-Chat-v0.3.Q4_K.gguf' is not a file on the local file system
just seeing this now, did you get this resolved? seems you hadn't downloaded the model to the models directory?
Can I quantize a model using AWQ on A100 GPU?
Yes, easily!
Super interesting work.
excellent presentation
learned a lot. Thank you!
Brilliant, thanks!
Nice video!!!
cheers
Excellent works.subscribe today.keep making info videos .
How we integrate custom quantization algorithm to the processes?
Oof that’s tricky because if you don’t do it at the kernel level (in cuda) it’ll just be really slow.
Have you something custom in mind?
@@TrelisResearch yes i want to try different quantization algorithms on same model