Really like this video! Compared to other qlora videos out there, there's a lot more detail about what the hyperparameters do etc. Also thanks for including the different Dataset methods!!
This is so cool. Would be great to have a good understanding of the timeline for fine-tuning a model in this awesome new way of doing it. Especially how it relates to the size of a given dataset
That's going to also depend on factors such as model size and context length as well. For a smaller model, the fine-tuning will be much faster. Last night I was fine-tuning a 5 Billion parameter model on a sequence-to-sequence dataset of ~50,000 rows, and it was looking like it would take 36 hours on my RTX 4090. But this is going to depend a lot. Ideally, it would be good to find ways to fine-tune the models so that they can be done in less time. The QLoRA paper says: "Using QLORA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna [10] benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark." So I certainly think it is possible to fine-tune a model and get great performance in 12-24 hours.
Hi Luke, does this work for any other models like llama2? and by doing it this way is everything kept locally on your own machine? Im specifically thinking about the dataset. If I have a custom dataset, which I do not want to be uploaded to any of huggingface's servers, will this approach ensure that? From the code in your video, it doesn't seem utilize any external servers for finetuning correct?
Love your tutorial on QLoRa. Would you happen to have a tutorial on fine-tuning llm with pdf or docx files? I already tried vector embedding with the Falcon-7b model to embed some of my local data files but the output was not got. I wanted to see if the output will be better with finetuning.
I'm pondering the best way to structure datasets for fine-tuning a language model. Does the structure depend on the specific LLM used, or is it more about the problem we aim to solve? For instance, if I want to train a model on my code repository and my dataset includes columns like repository name, path to file, code, etc., would it enable the model to answer specific questions about my repository?
Great information. My big question right now is how can I take data I'm interested in fine tuning with and turning it to a format that will work best for QLoRA.
Are you asking about the process of creating a dataset? Or just formatting a dataset that is already created? I think the formatting shouldn’t be too challenging. But creating a good dataset for a new use case might be more challenging
@@lukemonington1829 I'm talking about creating a data set. I don't know how it's supposed to be formated or how much it needs to be cleaned up to be high enough quality.
One question is: in the fine-tuning dataset, the quote is {"quote":"“Ask not what you can do for your country. Ask what’s for lunch.”","author":"Orson Welles","tags":["food","humor"]}, but why the output after fine-tuning still shows the quot from J F Kennedy?
if I wanted to finetune the model on a language that the original training data likely didn't include how much text would I need and how would I approach that?
well there's multiple different considerations that would go with that. For example, what kind of task are you looking to do in the other language? Is it just next-token prediction? Or are you looking to do something like sequence-to-sequence translation, where you give it input in one language and then it translates into another language? In a case like that, you would set up your training pipeline slightly differently from the way that it is done in the video. This would be done by designing a dataset that has an input and output, and then maybe creating a pipeline using AutoModelForSeq2SeqLM. The amount of data required would also depend on the task and also on the size of the model. Some of the most powerful LLMs have an emergent capability where it has already learned additional languages that it was not explicitly meant to learn. Additionally, there are many LLMs available on the HuggingFace hub that already are trained to work with multiple languages. So, depending on your use case, you can pick the correct model accordingly. The larger models are much more powerful. Usually I try to use the largest LLM possible since that will get the best response. I notice a huge jump in performance as I move from using a 3B model to a 7B and up to 30B+. To answer your question about how much text would be required, this can vary, but on the OpenAi website, they say "The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality." But I think if you were teaching an LLM that had no prior exposure to a new language, it would likely take much more than a few hundred examples
@@lukemonington1829 thanks for the elaborate response. Actually the language I want to train it on (it's Persian btw) is already generated by some of the free language models however they don't make a single error free sentence at the moment and even GPT-4 makes many mistakes writing it. As my country is lacking teachers I was thinking about training it to explain educational material in a back and forth fashion where the students can ask anything. So it will be similar to training an assistant I think. The fact that even any kind of training can be done just by a few hundred examples is a great news to me. I was thinking about much higher numbers as I am very new to all this. Thanks
@@sinayagubi8805 I'm going to be posting my next video (hopefully tomorrow) where I'll touch on this point a little bit. The video will be a practical guide to using LLMs. This falls into the machine translation task of text generation. It turns out that for general machine translation tasks, using a regular LLM (without fine-tuning) is better. But when it comes to low-resource machine translation, where there is a limited amount of parallel corpora available, fine-tuning the model can lead to a large increase in performance
Hi, cloud you show with a different model like llama model and a different datasets? I tried to change the model to a llama model and got a bunch of errors in tokenizer. Also it will be so great if you make a video about how to prepare and clean datasets for training! Thanks
I appreciate the suggestions! Yea I think that would be good. I'm looking into preparing and cleaning my own datasets too. I might make a video on that! And yes, it looks like it doesn't work with all models. So far, I've tested it with pygmalian, wizardLM. I'll be testing with others in the future. I tried llama as well and got errors as well. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if it would be helpful for me to post the code on github for that!
thinking about it a little more, it should definitely work with the llama model, since the Guanaco model family was fine tuned based off of the llama model. There must be a way to do it
Great video, it is compatible with wizard or pygmalion ?. Or do we always need the base models unquantized ?. If so, a list of llm suggestions would be really helpful and awesome.
I just tested it with pygmalian and it worked for me. Just make sure to change your LoraConfig so that target_modules=None. This way peft, will figure out the correct target_modules based on the model_type
Also, I just tested with ehartford/WizardLM-7B-Uncensored and that one worked as well. The base models do not need to be quantized in order for this to work. It is possible to load an unquantized model in 4-bit now with this technique. It doesn't work with all models though. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if I should post the code on github for that!
Luke, your presentation is excellent. Your code works up until training is complete. After that I really need to save it locally. I realize you set this up for running on Colab, but the python code itself runs fine for python 3.10. However, if I save as "model.save_pretrained(model_id), it fails to load successfully. Could you be so kind as to include a save locally segment to your video and then demonstrate the ability to load and query that saved model? The code you show after training does not work. Much obliged Amigo!
No I do t have any videos on that. Pre-training can be a lot more resource intensive and require more gpu vram. With the parameter efficient techniques, fine-tuning can now be done on a single gpu. But when it comes to pre-training from scratch, that would require substantial compute. For example, MPT-7B required 440 GPUs trained over the course of 9.5 days
@@user-wr4yl7tx3w The parallelism is an interesting point. I remember reading the research papers as these transformers were growing. Initially, the entire model could fit on a single gpu. But then the models grew too large to fit on a single gpu, so they had to find a way to split a model onto multiple gpus by putting a layer onto each gpu (or some other splitting method) Then the models kept growing and got to the size that a single layer couldn't fit on a single gpu. So that required further advancements such that a part of a layer could fit on each gpu. This all ties back to the hardware problem of ai - the fact that model size has increased by 1000x+ while gpu vram size has only increased by ~3x. I was thinking about creating a video on that. It might be an interesting video
This guy knows his stuff
I appreciate it!
Finally! someone who actually walks through the code line by line with explanation :D thank you for creating this tutorial
You’re welcome!
Really like this video! Compared to other qlora videos out there, there's a lot more detail about what the hyperparameters do etc. Also thanks for including the different Dataset methods!!
glad it helped!
Prob one of the best walk throughs, thanks!
No problem!
This guy knows his stuff of course!
Thanks was going to try and figure this out my self. Appreciate the walk through
For sure!
Great video! Explained everything clearly! The notebook runs smoothly out of box, LOL!
Loved it. Very explanatory and very detail explanation. Following.
This is really well explained.
Thanks!
This is so cool. Would be great to have a good understanding of the timeline for fine-tuning a model in this awesome new way of doing it. Especially how it relates to the size of a given dataset
That's going to also depend on factors such as model size and context length as well. For a smaller model, the fine-tuning will be much faster. Last night I was fine-tuning a 5 Billion parameter model on a sequence-to-sequence dataset of ~50,000 rows, and it was looking like it would take 36 hours on my RTX 4090. But this is going to depend a lot. Ideally, it would be good to find ways to fine-tune the models so that they can be done in less time.
The QLoRA paper says:
"Using QLORA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna [10] benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark."
So I certainly think it is possible to fine-tune a model and get great performance in 12-24 hours.
Hi Luke, does this work for any other models like llama2? and by doing it this way is everything kept locally on your own machine? Im specifically thinking about the dataset. If I have a custom dataset, which I do not want to be uploaded to any of huggingface's servers, will this approach ensure that? From the code in your video, it doesn't seem utilize any external servers for finetuning correct?
Love your tutorial on QLoRa. Would you happen to have a tutorial on fine-tuning llm with pdf or docx files? I already tried vector embedding with the Falcon-7b model to embed some of my local data files but the output was not got. I wanted to see if the output will be better with finetuning.
I'm pondering the best way to structure datasets for fine-tuning a language model. Does the structure depend on the specific LLM used, or is it more about the problem we aim to solve? For instance, if I want to train a model on my code repository and my dataset includes columns like repository name, path to file, code, etc., would it enable the model to answer specific questions about my repository?
Great information. My big question right now is how can I take data I'm interested in fine tuning with and turning it to a format that will work best for QLoRA.
Are you asking about the process of creating a dataset? Or just formatting a dataset that is already created?
I think the formatting shouldn’t be too challenging. But creating a good dataset for a new use case might be more challenging
@@lukemonington1829 I'm talking about creating a data set. I don't know how it's supposed to be formated or how much it needs to be cleaned up to be high enough quality.
One question is: in the fine-tuning dataset, the quote is {"quote":"“Ask not what you can do for your country. Ask what’s for lunch.”","author":"Orson Welles","tags":["food","humor"]}, but why the output after fine-tuning still shows the quot from J F Kennedy?
Awesome but how to merge the Lora to the model like in the original 4bit repo?
if I wanted to finetune the model on a language that the original training data likely didn't include how much text would I need and how would I approach that?
well there's multiple different considerations that would go with that. For example, what kind of task are you looking to do in the other language? Is it just next-token prediction? Or are you looking to do something like sequence-to-sequence translation, where you give it input in one language and then it translates into another language? In a case like that, you would set up your training pipeline slightly differently from the way that it is done in the video. This would be done by designing a dataset that has an input and output, and then maybe creating a pipeline using AutoModelForSeq2SeqLM.
The amount of data required would also depend on the task and also on the size of the model. Some of the most powerful LLMs have an emergent capability where it has already learned additional languages that it was not explicitly meant to learn. Additionally, there are many LLMs available on the HuggingFace hub that already are trained to work with multiple languages. So, depending on your use case, you can pick the correct model accordingly. The larger models are much more powerful. Usually I try to use the largest LLM possible since that will get the best response. I notice a huge jump in performance as I move from using a 3B model to a 7B and up to 30B+.
To answer your question about how much text would be required, this can vary, but on the OpenAi website, they say "The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality." But I think if you were teaching an LLM that had no prior exposure to a new language, it would likely take much more than a few hundred examples
@@lukemonington1829 thanks for the elaborate response.
Actually the language I want to train it on (it's Persian btw) is already generated by some of the free language models however they don't make a single error free sentence at the moment and even GPT-4 makes many mistakes writing it.
As my country is lacking teachers I was thinking about training it to explain educational material in a back and forth fashion where the students can ask anything. So it will be similar to training an assistant I think.
The fact that even any kind of training can be done just by a few hundred examples is a great news to me. I was thinking about much higher numbers as I am very new to all this.
Thanks
@@sinayagubi8805 I'm going to be posting my next video (hopefully tomorrow) where I'll touch on this point a little bit. The video will be a practical guide to using LLMs.
This falls into the machine translation task of text generation. It turns out that for general machine translation tasks, using a regular LLM (without fine-tuning) is better. But when it comes to low-resource machine translation, where there is a limited amount of parallel corpora available, fine-tuning the model can lead to a large increase in performance
What about using a vector db like chrome db for dataset is that pos in this case?
@@lukemonington1829 cool, eagerly waiting for it.
is there a version of this tutorial that isn't using a jupyter notebook? I hate working with these
Hi, cloud you show with a different model like llama model and a different datasets? I tried to change the model to a llama model and got a bunch of errors in tokenizer. Also it will be so great if you make a video about how to prepare and clean datasets for training! Thanks
I appreciate the suggestions! Yea I think that would be good. I'm looking into preparing and cleaning my own datasets too. I might make a video on that!
And yes, it looks like it doesn't work with all models. So far, I've tested it with pygmalian, wizardLM. I'll be testing with others in the future. I tried llama as well and got errors as well. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if it would be helpful for me to post the code on github for that!
thinking about it a little more, it should definitely work with the llama model, since the Guanaco model family was fine tuned based off of the llama model. There must be a way to do it
@@lukemonington1829 Thank you
Great video, it is compatible with wizard or pygmalion ?. Or do we always need the base models unquantized ?. If so, a list of llm suggestions would be really helpful and awesome.
I just tested it with pygmalian and it worked for me. Just make sure to change your LoraConfig so that target_modules=None. This way peft, will figure out the correct target_modules based on the model_type
Also, I just tested with ehartford/WizardLM-7B-Uncensored and that one worked as well. The base models do not need to be quantized in order for this to work. It is possible to load an unquantized model in 4-bit now with this technique. It doesn't work with all models though. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if I should post the code on github for that!
Luke, your presentation is excellent. Your code works up until training is complete. After that I really need to save it locally. I realize you set this up for running on Colab, but the python code itself runs fine for python 3.10. However, if I save as "model.save_pretrained(model_id), it fails to load successfully. Could you be so kind as to include a save locally segment to your video and then demonstrate the ability to load and query that saved model? The code you show after training does not work. Much obliged Amigo!
Amazing video
do you have a video on pre-training an LLM?
No I do t have any videos on that. Pre-training can be a lot more resource intensive and require more gpu vram.
With the parameter efficient techniques, fine-tuning can now be done on a single gpu. But when it comes to pre-training from scratch, that would require substantial compute. For example, MPT-7B required 440 GPUs trained over the course of 9.5 days
May be a discussion on what issues arise with scaling. Like parallelism.
@@user-wr4yl7tx3w The parallelism is an interesting point. I remember reading the research papers as these transformers were growing.
Initially, the entire model could fit on a single gpu. But then the models grew too large to fit on a single gpu, so they had to find a way to split a model onto multiple gpus by putting a layer onto each gpu (or some other splitting method)
Then the models kept growing and got to the size that a single layer couldn't fit on a single gpu. So that required further advancements such that a part of a layer could fit on each gpu.
This all ties back to the hardware problem of ai - the fact that model size has increased by 1000x+ while gpu vram size has only increased by ~3x. I was thinking about creating a video on that. It might be an interesting video
Yay! 🎉
😎
I'm desperately searching for this type of content
dont suppose you know how to modify the code to work on a tpu?
Hmm 🤔 I haven’t done any work with the TPUs, so I’m not sure on this one
captain?
This looks like a website you are using which is the opposite of fine-tuning something locally.
sauce?
So that is the easy way to do this ...