Thank you, that's a beautiful explanation! One thing I struggle understanding, is the term quantization blocks in 4:30 - why we need several of them. In my understanding from the video, we ponder about using 3 blocks of 16 bits to describe a number. Which is 48 bits and is more expensive than 32-bit float. But couldn't we just use 16*3 = 48 bits per number instead? Using 48 bits (without splitting it) would give us a very high precision within [0,1] range, due to powers-of-two I did ask GPT, and it responded that there exists a 'Scale Factor' and a 'Zero-Point', which are constants that shift and stretch the distribution in 6:02 Although I do understand these might be those quantization constants, - I am not entirely sure what the 64 blocks described in the video are 6:52 Is this because of the Rank of Matrix-Decompositions is 1, with 64 entries in both vectors?
@@AIBites Yes, I have suggestion. Finetuning a model on my journal in which I have written about the truth of nonduality and illusionary nature of reality. I am also actively curating books on truth, and would love your help.
I don’t understand why you say that LoRA is fast for inference… in any case you need to forward through the full rank pretrained weights + low-rank finetuned weights.
ah yes. If only we could quantize the weights, we can do better than the pre-trained weights. You are making a fair point here. Awesome and thank you! :)
@@AIBites Yeah, if only we could replace the pretrained Full-Rank weights by the Low-Rank Weights... really nice video and illustrations! Thanks a lot!
Bro, your statement from 05:22 is completely wrong and misguiding. LoRA is used for finetuning LLM models, when full-finetuning is not possible. It does so by freezing all model weights, and incorporating and training low-rank matrices(A*B) in Attention modules. LoRA speeds up training and reduces memory requirements but does not provide a speedup during inference. If LLM model is too large to be handled by LoRA due to GPU memory limitations, Quantized LoRA is used to finetune the model. Overall, QLoRA is a more advanced solution when LoRA alone cannot handle large models for finetuning.
Thanks for your feedback. I think we are pretty much on the same page. Can you be more specific what I am wrong with? Unfortunately I won't be able to edit the video but can at leaset pin a message to viewers pointing the errata
Your videos on LoRA finally made the concepts click for me. It was clearly explained! Thank you for the content you make
Glad it helped. Welcome 😊
That was really well explained with intuitative diagrams and explination. Thanks for the video , just subscribed .
thank you! :)
Thank you, that's a beautiful explanation!
One thing I struggle understanding, is the term quantization blocks in 4:30 - why we need several of them.
In my understanding from the video, we ponder about using 3 blocks of 16 bits to describe a number. Which is 48 bits and is more expensive than 32-bit float.
But couldn't we just use 16*3 = 48 bits per number instead? Using 48 bits (without splitting it) would give us a very high precision within [0,1] range, due to powers-of-two
I did ask GPT, and it responded that there exists a 'Scale Factor' and a 'Zero-Point', which are constants that shift and stretch the distribution in 6:02
Although I do understand these might be those quantization constants, - I am not entirely sure what the 64 blocks described in the video are 6:52
Is this because of the Rank of Matrix-Decompositions is 1, with 64 entries in both vectors?
Nicely explained! Keep the good work going!! 🤗
Thank you 🙂
Are you interested in more of theory or hands on implementation style videos? Your input will be very valuable 👍
@@AIBites I'm interested in more videos on concept understanding as the implementations are easily available
@@AIBites Yes. We all want
Thanks for connecting the dots!
glad you liked! :)
Thank you for the explanation! I find that it's very helpful!
Awesome explanation! ❤
glad you think so and thank you indeed :)
Neat explanation
Glad you think so!
Very well explained
Thanks so much 😊
amazing explanation
Glad you think so!
Waiting to see it.
Sure! :)
Hey. I need your help. I have a curated set of notes and books and I wish to use it to finetune a model. How can it be done?
would you like to see a fine-tuning video on text data? Would that be useful? Do you have any suggestions on the dataset I can show fine-tuning on?
@@AIBites Yes, I have suggestion. Finetuning a model on my journal in which I have written about the truth of nonduality and illusionary nature of reality. I am also actively curating books on truth, and would love your help.
@@AIBites That would be very help. There arent many good videos on fine tune out there.
hope the fine-tuning video was of some help
hope the fine-tuning video was of some help
I don’t understand why you say that LoRA is fast for inference… in any case you need to forward through the full rank pretrained weights + low-rank finetuned weights.
ah yes. If only we could quantize the weights, we can do better than the pre-trained weights. You are making a fair point here. Awesome and thank you! :)
@@AIBites Yeah, if only we could replace the pretrained Full-Rank weights by the Low-Rank Weights... really nice video and illustrations! Thanks a lot!
Thank you sire
my pleasure Rahul! :-)
Bro, your statement from 05:22 is completely wrong and misguiding.
LoRA is used for finetuning LLM models, when full-finetuning is not possible. It does so by freezing all model weights, and incorporating and training low-rank matrices(A*B) in Attention modules.
LoRA speeds up training and reduces memory requirements but does not provide a speedup during inference. If LLM model is too large to be handled by LoRA due to GPU memory limitations, Quantized LoRA is used to finetune the model. Overall, QLoRA is a more advanced solution when LoRA alone cannot handle large models for finetuning.
Thanks for your feedback. I think we are pretty much on the same page. Can you be more specific what I am wrong with? Unfortunately I won't be able to edit the video but can at leaset pin a message to viewers pointing the errata