LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

AemonAlgiz

Просмотров 25 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 27 окт 2024

Комментарии • 61

@kaymcneely7635 Год назад ⁺¹⁶
You really put a lot of your time and effort into these highly informative videos. Thank you so much
@42svb58 Год назад ⁺¹
this is exactly the level of explanation that I need being able to pick up on key concepts and dive deeper in other ways at my own pace. keep it up!
@beerreeb123 Год назад ⁺²
I sincerely appreciate your willingness to share the results of your research and understanding!
@nightwintertooth9502 Год назад ⁺⁴
Thanks for publishing this. I am glad someone is breaking it down as i have been talking over heads quite a lot about this the last three weeks.
@nightwintertooth9502 Год назад ⁺²
Thanks for the love. If you're doing the deep dive you should definitely touch base on PEFL and how block windowing is achieved in GPTQ's kernel code with transformation matrices and diagonals. This is how we are able to define block size and make training 65B+ models possible by loading only what's being worked on into VRAM as a transform block and freezing the rest of the weights. HuggingFace docs lovingly touch on PEFL in the transformer library. Great stride has been taken to make this accessible to the every day person. Occ34ns fork contains the kernel code with PEFL for LLaMA variants by mosaic. I had to have a wet math moment when i dug through it since i work with transformation matricies in 3D graphics acceleration. Didnt think you could do that to tensors but yes, you can once they have been quantized. :D. Credits to he for thinking outside of the box and making training happen on cpu only and such. He forked GPTQ and did absolutely magic things to it. :)
@AemonAlgiz Год назад ⁺³
I did a quick video on LoRA PEFT, already! Though, I did intentionally keep the rank decomposition on the LoRA matrices a bit higher level and only discussed attaching them to the Feedforward layer.
I think a very technical deep dive series would be a lot of fun, though I’m still trying to find a balance between technical depth and keeping the videos generally consumable. It’s challenging to find the balance that engages software engineers like you and I but can also be enjoyed by enthusiasts.
Thanks for the comment and thanks for watching!
@nightwintertooth9502 Год назад
@AemonAlgiz of course. Glad someone's able to get it out in digestible fashion.
Bonus: get yourself signed up for Microsoft Build if not already. They will be granting Copilot X and GPT4 plugin access to RSVPs. I'm running this stuff in azure. They're going to be discussing a lot of AI news, handing out MCA subscription credits and all that good stuff to play around with.
@nacs Год назад ⁺¹
Really nice. As someone who barely knows how matrixes and such work, you made these quantization concepts easy to understand..
@MaJetiGizzle Год назад ⁺⁵
Fantastic explanation and great tutorial! Hoping this channel grows a lot in the future!
@vishnunair3805 Год назад ⁺⁶
This is ridiculously well explained and easy to understand for someone only beginning to explore this rabbit hole. Whatever motivates you to keep making this videos I hope it continues to. I am gonna go ahead and check rest of your library. I also hope you continue to explain concepts around the subject of these models. Thank you.
@AemonAlgiz Год назад
Thank you! I'm glad it was helpful and I am definitely here to stay!
@U2VidWVz Год назад
@@AemonAlgiz 👀 No activity on github since June either. Really appreciate this video and hope you make your way back soon. Thanks!
@EduardPokonechnyi Месяц назад
Nice explanation, thank you!!
@quinn479 Месяц назад
Thanks for the clear and concise explanation, it was perfect.
@redfield126 Год назад ⁺¹
Your intelligence is impressive as it compensates for my lack of understanding 😅, but thanks to your articulate explanations, I believe I'm grasping it. I'm grateful to you for imparting such incredible content.
@alx8439 Год назад
It's so sad you abandoned your channel. Your explanations are gems
@logan56 Год назад
Dude keep these great videos up. We appreciate you
@KitcloudkickerJr Год назад ⁺¹
this channel is a gold mine
@AemonAlgiz Год назад ⁺¹
Thank you so much!
@KitcloudkickerJr Год назад
@@AemonAlgiz I should be thanking you lol this is wonderful education in a well explained manner
@fahnub Год назад ⁺¹
Thanks for all the effort that went into making this video. Very informative indeed.
@dhirajkumarsahu999 6 месяцев назад
Thank you so much for simplifying this to such extent. Subscribed
@smellslikeupdog80 Год назад ⁺¹
this was a great explaination, thank you
@enmanuelsancho Год назад ⁺²
Wow, really good explanation, the part of encoding the 16 bit float as 8 bit integer by scaling is pretty intuitive, but the process of adding the error to make small values less likely to fail its mind-blowing I didn't expect it to work but if it is a thing that is being implemented thing right now is because it does.
@AemonAlgiz Год назад ⁺¹
It took me a bit to realize that’s why the inverse hessian was there! It blew my mind when I realized it
@TrelisResearch Год назад ⁺¹
What’s your view on bitsandbytes NF4 versus GPTQ for quantisation?
@jonrross Год назад ⁺¹
Great video as always. Thanks for sharing your knowledge.
@AemonAlgiz Год назад
Thanks, Jonathon!
@yolo12999 Год назад ⁺²
Underrated Channel, You sir, deserve more subs.
P.S: Could you do the same for GGML? and If already did, Playlist with GGML, GPT-Q, LoRA, QLoRA, 4bit vs 8bit, Performance based on parameters(3B, 7B, etc.) would be a nice to have. A lot of channels cover the model as a whole but most of them would never cover the process behind the models. Your video was easy follow and understand the basics behind the LLM Quantization. Keep it up.
@AemonAlgiz Год назад ⁺²
Thank you! I will be covering GGML in the video after the next one! I think it’s an incredibly powerful tool.
@vq8gef32 7 месяцев назад
Amazing, loved it
@Heccintech Год назад ⁺²
Wow this is a great video
@samlaki4051 Год назад ⁺¹
this is so cool!
@hoatran-lv5rj 11 месяцев назад
Thanks. Where can I find the model 'lmsys_vicuna-7b-delta-v1.1' that you mentioned in your demonstration?
@heejuneAhn Год назад ⁺¹
Thank you for this nice introduction to GPTQ. Can you also explain how this quantized parameter is finally run on GPU? I am more interested in inference process. What type of variable and operation is used in GPU and if the quantized params are de-quantixed before using or it is used in the quantized state. How the scaling factors are saved and restored.
@AemonAlgiz Год назад
This is a great question! If you want to check out AutoGPTQ's cuda kernels, it's a great example of how these values are cached and used
@tomm9716 5 месяцев назад
Great explanation.
Some questions:
When we are quantising and computing the quantisation loss, do we not need to supply some data for it to compute the loss against? If not, how exactly is this loss computed? (surely we need some inputs and expected outputs to compute this loss, is this why all of the weight errors were 0 when you quantised? )
If we do, could this be interpreted as a form of post training, quantisation 'fine-tuning'? By this I mean that we can use domain data in the quantisation process to help preserve the emergent features in the model that are most useful for our specific domain dataset?
Thanks!
@cyberalchemy3884 Год назад
Very helpful, also if you could sync your voice to the video more precisely it will improve the overall quality!
@spamdball 2 месяца назад
Thanks a lot for this explanation. How can you even out the errors via the next weights, when you do not now in advance what activation value the weights will be multiplied with?
@andrinf.1863 Год назад
At 1.45 where you defined the range of values for the 8bit 0. Quantization. The way I understand it is that we only use 8 bits to store our weight-values. So this would make us use an intervall of 256 values. So wouldn't it be [-127,128] instead of [-127,127] ?
@aitarun Год назад
Thanks. How to run the converted model?
@SinanAkkoyun Год назад ⁺¹
Hey! Thank you so much for the video! I wanted to ask, what exact role does a dataset while quantizing play? The code you showed uses wikitext2 as a dataset for quantization.
I am looking very forward to your response!
@richardyao9012 10 месяцев назад
You wrote -127 to 127. 8-bit integers are -128 to -127. The quantization method you described loses slightly more precision than necessary.
@kevinehsani3358 Год назад
One point I am not clear you divide by 127 which represents 8-bits, then your are saying you gonna do 4-bits quantization, do you actually divide the matrices by 63 for 4-bits or still 127
@jwadaow Год назад ⁺¹
Do you divide my the largest number to get the scaling factor, or do you divide by the modulus?
@AemonAlgiz Год назад
For zero point you use the largest value, though there are other techniques liking binning.
@ozne_2358 Год назад ⁺¹
Do you know what the weights distribution looks like for LM transformers ? For Convolutional Neural Networks the weight distribution tends to be sort Gaussian/Laplacian. Meaning that there are many smaller weights and increasingly less larger ones. This has implication on the compressability of said weights and more.
@AemonAlgiz Год назад ⁺¹
I suspect the weight distribution follows some normal curve since we can detect outlier features. There is also a new paper, Hyena, which describes a way to find what they describe as "Hyena matrices" for the weights where the weights can be diagonalized. So, we may be looking at O(n*ln(n)) computational (time and memory) complexity alongside 100k+ token contexts very soon.
Paper: arxiv.org/pdf/2302.10866v3.pdf
@ozne_2358 Год назад
@@AemonAlgiz Thanks!
@kevinehsani3358 Год назад
I tried to duplicate this, everything seems fine up to the point when I run it and get killed by the system my gpu is A2000 8 GB, I guess you need atleast 14 GB, I tried to reduce 32 bits to 16 but killed on that too. Any ideas?
@alexandergloux6747 Год назад ⁺¹
What is the significance of 127
@AemonAlgiz Год назад
It’s the range for signed 8-bit integers
@HostileRespite Год назад ⁺¹
OMG the maths! 🤣
@AemonAlgiz Год назад ⁺²
It took me a whole day to realize why the inverse hessian was there, haha.
@HostileRespite Год назад
@@AemonAlgiz Thank God for Mr. Wolfram... 🤣
@AemonAlgiz Год назад ⁺²
@@HostileRespite I didn’t even think of checking wolfram for something on it! Do they have some documentation on it?
My PhD is in Physics, so I tend to just stare at papers until I figure it out haha
@HostileRespite Год назад ⁺¹
@@AemonAlgiz REALLY? Respect! Ex nuclear munitions tech here. Glorified torque wrench twister, nothing so glamorous as you but I know more of the... uh... impractical side... of your studies. I went into as a dumb kid who loved science, came out a lot wiser but still love science. Got interested in Zero-point energy back in the day before RL side tracked me. Anyway, enjoy your stuff! I should have figured you were a bit like me, birds of a feather stubbornly die-hard together... when someone has already done the math for us. 🤣
@yusufkemaldemir9393 Год назад ⁺¹
Thanks. You speak fast. Do you mind slow down little bit? Background sound needs to be removed. Also, please zoom your code write sections. It is impossible to see what you write.
@AemonAlgiz Год назад
Hey there! Which sections are difficult for you to see? This isn’t a complaint I’ve gotten before, I set the font pretty large.
Edit: Rewatched this one and I did keep the font too small. Sorry, this was a mistake on this one!
@klammer75 Год назад ⁺¹
I’m really digging your mathematical explanations! Keep it up and subscribed for the mafs!🦾🤓
@AemonAlgiz Год назад ⁺¹
Thanks, the one coming out today is pretty math heavy!

Следующие

Автовоспроизведение

Why Do LLM’s Have Context Limits? How Can We Increase the Context? ALiBi and Landmark Attention!