Very Cool Huggingface has done so much of the heavylifting for us, they are actually amazing. Also, when I first heard about LoRa I thought the implementation was complicated (utilizing some efficient SVD or other numerical methods to achieve the decomposition of the full weight update matrix) turns out it literally just starts with the two smaller matrices and backprop does all the work lol
Hi Chris! Really thank you so much for such a detailed tutorial. Loved every bit of it. In the time of big corporations trying to monopolise the technology, people like you give hope and knowledge to so many others! Really appreciate it. You've made the lora tutorial easy to understand. Just had a question. I guess you have answered it in someway already, but just wanted to confirm. GPT-2 is somewhat old, so does this method apply to GPT-2 also? I mean can we use GPT-2 model instead of Bloom?
@@chrisalexiuk thanks for not taking offence. I have ASD and ADHD. Super hard to focus without an idea of what you are making and what problem you are trying to solve. apologies for the directness.
trying to get Some people interested in product development and modification but they have requirements for material can't leave the building that means no internet everything has to be done in our machines in house we can't share it with collab or anybody it would be nice if you did more shows related to that subject of keeping complete control material because there are so many people that are just scared to death of breaches
A causal language model is a model that predicts the next token in the series. It only looks at tokens on the "left" or "backward" and cannot see future tokens. It's confusing because, as you noted, it has nothing to do with Causal AI.
You'd be looking at something like continued pre-training. I perhaps misspoke by saying "more efficient", I meant to convey that LoRA might not be the best solution for domain-shifting a model - and so there are more *effective* ways to domain-shift.
Hi Chris, the block of code with "model.gradient checkpointing enabled()" which increases stability of model. Have you made any previous videos where I can read and learn about this. If not, are there any resources you would reccomend to understand this.
Basically, you can think of it this way: As we need to represent tinier and tinier numbers - we need more and more exponents. There are a number of layers which tend toward very tiny numbers, and if we let those layers stay in 4bit/8bit it might have some unintended side-effects. So, we let those layers stay in full precision so as to not encounter those nasty side-effects!
Thanks for sharing this tutorial. I get 'IndexError: list index out of range' when reading from hub, I just copied and pasted code, it happens 6th progress bar. Any solution? Model: bloom-1b7
hi chris can you make a video on this or give me some pointers? scaling with langchain, how to have multiple sessions with LLM, meaning how to have a server with the LLM and serve to multiple people concurrently. What will be the system requirements to run such a setup. I believe we will be needing kubernetes for the scaling
Amazing work. Can you put up something similar for fine-tuning MPT-7B model? I switched the model to MPT-7B but I keep getting this error during training "TypeError: forward() got an unexpected keyword argument 'inputs_embeds'". I am scratching my head but cant seem to figure out what went wrong.
Sir, Is it possible to share the colab notebook? For Extractive QA, How we will evaluate and compare with other models? Like, EM and F1, how we will implement those and compare with other Bert or llm? models
Hey Chris, Awesome video! Thank you for it. Can you please help me out here. I am using your notebook but when it do the model.push_to_hub then , adapter_config.json and adapter_model.bin are not being uploaded to the hugging face , instead i only see 1. generation_config.json 2. pytorch_model.bin 3. config.json What am i doing wrong here?
Yes! Sorry, Adil! We are only pushing the actual *LoRA* weights to the hub - and merging the model back will mean that the entire model will be pushed to hub. Great troubleshooting!
Amazing. I tried this on my desktop which has NVIDIA GeForce 3060. And, I was able to run only 6 steps. On windows I wasn't able to run at all as i am facing some issues with bitsandbytes library. Also, I used bloom1b7. But, after doing all the exercise, i see that the output generated doesn't stop after CONTEXT, QUESTION and ANSWER, it keeps generating some text which includes EXAMPLE and so on. Though the notebook adds bitsandbytes at the start using "import bitsandbytes as bnb", bnb is not used anywhere. So, I thought commenting that line out will make my script work on windows, but no, even without the line the script that i wrote mimicking your colab notebook, didn't work on windows. Can you tell me how the notebook depends on bitsandbytes?
@@chrisalexiuk man, it gets more weird now. I tried doing more steps with smaller learning rate, smaller batch size, on a bigger model. It started adding explanation sections and generating, well, explanations. bloom 3b
You should be able to largely recreate this process locally - but you would need to `pip install` a few more dependencies. You can find which by looking at what the colab environment has installed - or using a tool like pipreqs!
Thanks for explaining the implementation in such an easy way. I wanted to play around with this and I used the free tier google colab with TU-GPU and used the smaller "bigscience/bloom-1b7" model. The inference method make_inference(context, question) is giving me below error. Is this because of using the free-tier GPU, though training and all the previous steps were executed without any issues. Would be great if you can shed some light on this ! Error : RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
LoRA will not reduce the size of the model during inference. It actually adds a very small amount extra - this is because the memory savings come from reduced number of optimizer states.
You seem to be the kind of ai expert that I am trying to become. Very impressive.
At least as good and at times better that every other equivalent tutorial on the subject at this time.
Thanks so much for the kind words!
Mind blown into 3 billion pieces
Hi Chriss! Thanks for the course. i want to learn more. May God bless you 🤲
Very Cool
Huggingface has done so much of the heavylifting for us, they are actually amazing.
Also, when I first heard about LoRa I thought the implementation was complicated (utilizing some efficient SVD or other numerical methods to achieve the decomposition of the full weight update matrix) turns out it literally just starts with the two smaller matrices and backprop does all the work lol
Backprop coming to the rescue again!
Oh, Wise Wizard, I bow before your might. Please, continue to guide me.
Ok, you got me absoutely amused by the results.
Also, thanks for showing that there's lora library out there : I tried to do it on my own
Very intuitive! I didn't even yawn throughout the whole video lol. Keep up the good work! :)
Great video! So much passion, love it.
Thanks for the video, it was very useful and clear
Subscribed and Thumbs up, appreciate the videos.
Astonishing content Man 🚀
WoW! You are amazing man!
Hi Chris! Really thank you so much for such a detailed tutorial. Loved every bit of it. In the time of big corporations trying to monopolise the technology, people like you give hope and knowledge to so many others! Really appreciate it. You've made the lora tutorial easy to understand.
Just had a question. I guess you have answered it in someway already, but just wanted to confirm. GPT-2 is somewhat old, so does this method apply to GPT-2 also? I mean can we use GPT-2 model instead of Bloom?
You can use LoRA with anything that has weight matrices!
Thank you!!!
Amazing set of videos!
Can you please update on the model that is doing text-to-SQL that you've mentioned? This is very important to me :)
Who, what, where, why and when. I am grateful for your video. Please can you give a use case. And start your videos with the end in mind.
Absolutely! I'll seek to do that going forward!
@@chrisalexiuk thanks for not taking offence. I have ASD and ADHD. Super hard to focus without an idea of what you are making and what problem you are trying to solve. apologies for the directness.
informative video,
can suggest some GPU compute resource. Aim is to implement the learnings. would like to know cheapest possible resource.
Lambda Labs has great prices right now, otherwise Colab Pro is an affordable and flexible option.
trying to get Some people interested in product development and modification but they have requirements for material can't leave the building that means no internet everything has to be done in our machines in house we can't share it with collab or anybody it would be nice if you did more shows related to that subject of keeping complete control material because there are so many people that are just scared to death of breaches
what is meant by causal language model? I assume it has nothing to do with the separate field of Causal AI.
A causal language model is a model that predicts the next token in the series. It only looks at tokens on the "left" or "backward" and cannot see future tokens.
It's confusing because, as you noted, it has nothing to do with Causal AI.
What are the ways you mentioned to more efficiently teach a model new knowledge rather than new structures?
You'd be looking at something like continued pre-training. I perhaps misspoke by saying "more efficient", I meant to convey that LoRA might not be the best solution for domain-shifting a model - and so there are more *effective* ways to domain-shift.
Hi Chris, the block of code with "model.gradient checkpointing enabled()" which increases stability of model. Have you made any previous videos where I can read and learn about this. If not, are there any resources you would reccomend to understand this.
Basically, you can think of it this way:
As we need to represent tinier and tinier numbers - we need more and more exponents. There are a number of layers which tend toward very tiny numbers, and if we let those layers stay in 4bit/8bit it might have some unintended side-effects. So, we let those layers stay in full precision so as to not encounter those nasty side-effects!
@@chrisalexiuk ohhh ok understood. This was by far the clearest explanation abt this. Thank you!
Thanks for sharing this tutorial. I get 'IndexError: list index out of range' when reading from hub, I just copied and pasted code, it happens 6th progress bar. Any solution? Model: bloom-1b7
Could you share with me your notebook so I can determine what the issue is?
hi chris can you make a video on this or give me some pointers?
scaling with langchain, how to have multiple sessions with LLM, meaning how to have a server with the LLM and serve to multiple people concurrently. What will be the system requirements to run such a setup. I believe we will be needing kubernetes for the scaling
You'll definitely need some kind of load balancing/resource balancing. I'll go over some more granular tips/tricks in a video!
Amazing work. Can you put up something similar for fine-tuning MPT-7B model?
I switched the model to MPT-7B but I keep getting this error during training "TypeError: forward() got an unexpected keyword argument 'inputs_embeds'". I am scratching my head but cant seem to figure out what went wrong.
Sure, i can try and do this!
Thx for great video! so what is the better way to teach a model new knowledge, if FT is somehow only good for structure? thx much!
Continued Pre-Training or Domain Adaptive Pre-Training!
Sir, Is it possible to share the colab notebook? For Extractive QA, How we will evaluate and compare with other models? Like, EM and F1, how we will implement those and compare with other Bert or llm? models
Yes, sorry, I will be sure to update the description with the Notebook used in the video!
@@chrisalexiuk Thank you, it will be very much appreciated.
@@chrisalexiuk we will appreciate it
colab.research.google.com/drive/1GzHdbIarvnRee_Ix9bdhx1a1v0_G_eqo?usp=sharing
Getting "ValueError: Attempting to unscale FP16 gradients." when running the cell with trainer.train(). Any idea?
Even i'm getting the same error for "bloom-1b7". Did your problem resolved ?
@@shashankjainm5009 I am getting the same error. did you fix that??.
Thank you for the great tutorial! How do we set that we only want to fine-tune query_key_value and the rest of the weights are frozen?
By using the adapter method, you don't need to worry about that! The base model will remain frozen - and you will not train any model layers.
any chance you can make an example of fine tuning code llama like this
I might, yes!
@chrisalexiuk itd be greaty appreciated. There is almost no implementation docs or examples around for using lora 😀
Hey Chris, Awesome video! Thank you for it. Can you please help me out here. I am using your notebook but when it do the model.push_to_hub then , adapter_config.json and adapter_model.bin are not being uploaded to the hugging face , instead i only see
1. generation_config.json
2. pytorch_model.bin
3. config.json
What am i doing wrong here?
I figured out the problem , it was this line
model = model.merge_and_unload() after the training
Yes! Sorry, Adil!
We are only pushing the actual *LoRA* weights to the hub - and merging the model back will mean that the entire model will be pushed to hub.
Great troubleshooting!
Amazing.
I tried this on my desktop which has NVIDIA GeForce 3060. And, I was able to run only 6 steps.
On windows I wasn't able to run at all as i am facing some issues with bitsandbytes library.
Also, I used bloom1b7.
But, after doing all the exercise, i see that the output generated doesn't stop after CONTEXT, QUESTION and ANSWER, it keeps generating some text which includes EXAMPLE and so on.
Though the notebook adds bitsandbytes at the start using "import bitsandbytes as bnb", bnb is not used anywhere.
So, I thought commenting that line out will make my script work on windows, but no, even without the line the script that i wrote mimicking your colab notebook, didn't work on windows.
Can you tell me how the notebook depends on bitsandbytes?
Bitsandbytes is leveraged behind the scenes through the HuggingFace library.
Tried this and it's interesting that 3b/7b1 bloom models perform WORSE on my test questions after this training, than bloom 1b1
Hmmmm. That's very interesting!
I wonder specifically why, it would be interesting to know!
@@chrisalexiuk I didn't change other parameters though. maybe rank and batch size should be higher for higher param count models
@@chrisalexiuk man, it gets more weird now. I tried doing more steps with smaller learning rate, smaller batch size, on a bigger model. It started adding explanation sections and generating, well, explanations.
bloom 3b
@@РыгорБородулин-ц1е What was the learning rate you were using? Is it the same as mentioned in BLOOM paper? Also what is the current learning rate ?
Great video! What wpuld be different if i do download the model not on colab but locally? Which lines do change in the code?
You should be able to largely recreate this process locally - but you would need to `pip install` a few more dependencies. You can find which by looking at what the colab environment has installed - or using a tool like pipreqs!
Do you know how much GPU RAM the meta-llama/Llama-2-70b-chat model would take to fine-tune?
Thanks for explaining the implementation in such an easy way.
I wanted to play around with this and I used the free tier google colab with TU-GPU and used the smaller "bigscience/bloom-1b7" model. The inference method make_inference(context, question) is giving me below error. Is this because of using the free-tier GPU, though training and all the previous steps were executed without any issues. Would be great if you can shed some light on this !
Error :
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
Hi a question, can we use lora to just reduce the size of a model and run inference, or we have to train it always?
LoRA will not reduce the size of the model during inference. It actually adds a very small amount extra - this is because the memory savings come from reduced number of optimizer states.
The notebook linked doesn't match the one used in the video. Is the notebook in the video available somewhere?
Thanks, great video!
Ah, so sorry! I'll resolve this ASAP.
I've updated the link - please let me know if it doesn't resolve your issue!
Sorry about that!
Please don't scream 😬
facing KeyError: 'h.0.input_layernorm.bias' when downloading from the hub
Hmmm.
Are you using the base notebook?
@@chrisalexiuk yeah just changed the model to 1b7
Could you try adding `device_map="auto"` to your `.from_pretrained()` method?
Also, are you using a GPU enabled instance for the Notebook?