How to Fine-Tune Open-Source LLMs Locally Using QLoRA!

Luke Monington

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 дек 2024
Наука

Комментарии •

@mithrixx Год назад ⁺⁷
This guy knows his stuff
@lukemonington1829 Год назад ⁺¹
I appreciate it!
@chivesltd Год назад ⁺⁸
Finally! someone who actually walks through the code line by line with explanation :D thank you for creating this tutorial
@lukemonington1829 Год назад
You’re welcome!
@arinco3817 Год назад ⁺⁶
Really like this video! Compared to other qlora videos out there, there's a lot more detail about what the hyperparameters do etc. Also thanks for including the different Dataset methods!!
@lukemonington1829 Год назад
glad it helped!
@gilbertb99 Год назад ⁺¹
Prob one of the best walk throughs, thanks!
@lukemonington1829 Год назад
No problem!
@sansunnec 9 месяцев назад
This guy knows his stuff of course!
@Bakobiibizo Год назад ⁺¹
Thanks was going to try and figure this out my self. Appreciate the walk through
@lukemonington1829 Год назад
For sure!
@favor2012able 11 месяцев назад
Great video! Explained everything clearly! The notebook runs smoothly out of box, LOL!
@xdrap1 Год назад
Loved it. Very explanatory and very detail explanation. Following.
@user-wr4yl7tx3w Год назад ⁺²
This is really well explained.
@lukemonington1829 Год назад
Thanks!
@tr1f4ek Год назад ⁺¹
This is so cool. Would be great to have a good understanding of the timeline for fine-tuning a model in this awesome new way of doing it. Especially how it relates to the size of a given dataset
@lukemonington1829 Год назад
That's going to also depend on factors such as model size and context length as well. For a smaller model, the fine-tuning will be much faster. Last night I was fine-tuning a 5 Billion parameter model on a sequence-to-sequence dataset of ~50,000 rows, and it was looking like it would take 36 hours on my RTX 4090. But this is going to depend a lot. Ideally, it would be good to find ways to fine-tune the models so that they can be done in less time.
The QLoRA paper says:
"Using QLORA, we train the Guanaco family of models, with the second best model reaching 97.8% of the performance level of ChatGPT on the Vicuna [10] benchmark, while being trainable in less than 12 hours on a single consumer GPU; using a single professional GPU over 24 hours we achieve 99.3% with our largest model, essentially closing the gap to ChatGPT on the Vicuna benchmark."
So I certainly think it is possible to fine-tune a model and get great performance in 12-24 hours.
@silasstryhn5417 Год назад
Hi Luke, does this work for any other models like llama2? and by doing it this way is everything kept locally on your own machine? Im specifically thinking about the dataset. If I have a custom dataset, which I do not want to be uploaded to any of huggingface's servers, will this approach ensure that? From the code in your video, it doesn't seem utilize any external servers for finetuning correct?
@thundersword80 Год назад ⁺¹
Love your tutorial on QLoRa. Would you happen to have a tutorial on fine-tuning llm with pdf or docx files? I already tried vector embedding with the Falcon-7b model to embed some of my local data files but the output was not got. I wanted to see if the output will be better with finetuning.
@moonly3781 Год назад
I'm pondering the best way to structure datasets for fine-tuning a language model. Does the structure depend on the specific LLM used, or is it more about the problem we aim to solve? For instance, if I want to train a model on my code repository and my dataset includes columns like repository name, path to file, code, etc., would it enable the model to answer specific questions about my repository?
@robbbieraphaelday999 Год назад ⁺²
Great information. My big question right now is how can I take data I'm interested in fine tuning with and turning it to a format that will work best for QLoRA.
@lukemonington1829 Год назад ⁺¹
Are you asking about the process of creating a dataset? Or just formatting a dataset that is already created?
I think the formatting shouldn’t be too challenging. But creating a good dataset for a new use case might be more challenging
@robbbieraphaelday999 Год назад ⁺²
@@lukemonington1829 I'm talking about creating a data set. I don't know how it's supposed to be formated or how much it needs to be cleaned up to be high enough quality.
@favor2012able 11 месяцев назад
One question is: in the fine-tuning dataset, the quote is {"quote":"“Ask not what you can do for your country. Ask whatâ€™s for lunch.”","author":"Orson Welles","tags":["food","humor"]}, but why the output after fine-tuning still shows the quot from J F Kennedy?
@AiEdgar Год назад
Awesome but how to merge the Lora to the model like in the original 4bit repo?
@sinayagubi8805 Год назад ⁺¹
if I wanted to finetune the model on a language that the original training data likely didn't include how much text would I need and how would I approach that?
@lukemonington1829 Год назад
well there's multiple different considerations that would go with that. For example, what kind of task are you looking to do in the other language? Is it just next-token prediction? Or are you looking to do something like sequence-to-sequence translation, where you give it input in one language and then it translates into another language? In a case like that, you would set up your training pipeline slightly differently from the way that it is done in the video. This would be done by designing a dataset that has an input and output, and then maybe creating a pipeline using AutoModelForSeq2SeqLM.
The amount of data required would also depend on the task and also on the size of the model. Some of the most powerful LLMs have an emergent capability where it has already learned additional languages that it was not explicitly meant to learn. Additionally, there are many LLMs available on the HuggingFace hub that already are trained to work with multiple languages. So, depending on your use case, you can pick the correct model accordingly. The larger models are much more powerful. Usually I try to use the largest LLM possible since that will get the best response. I notice a huge jump in performance as I move from using a 3B model to a 7B and up to 30B+.
To answer your question about how much text would be required, this can vary, but on the OpenAi website, they say "The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality." But I think if you were teaching an LLM that had no prior exposure to a new language, it would likely take much more than a few hundred examples
@sinayagubi8805 Год назад ⁺¹
@@lukemonington1829 thanks for the elaborate response.
Actually the language I want to train it on (it's Persian btw) is already generated by some of the free language models however they don't make a single error free sentence at the moment and even GPT-4 makes many mistakes writing it.
As my country is lacking teachers I was thinking about training it to explain educational material in a back and forth fashion where the students can ask anything. So it will be similar to training an assistant I think.
The fact that even any kind of training can be done just by a few hundred examples is a great news to me. I was thinking about much higher numbers as I am very new to all this.
Thanks
@lukemonington1829 Год назад
@@sinayagubi8805 I'm going to be posting my next video (hopefully tomorrow) where I'll touch on this point a little bit. The video will be a practical guide to using LLMs.
This falls into the machine translation task of text generation. It turns out that for general machine translation tasks, using a regular LLM (without fine-tuning) is better. But when it comes to low-resource machine translation, where there is a limited amount of parallel corpora available, fine-tuning the model can lead to a large increase in performance
@ScriptGurus1 Год назад
What about using a vector db like chrome db for dataset is that pos in this case?
@sinayagubi8805 Год назад
@@lukemonington1829 cool, eagerly waiting for it.
@mwissel Год назад
is there a version of this tutorial that isn't using a jupyter notebook? I hate working with these
@devajyotimazumdar3686 Год назад ⁺¹
Hi, cloud you show with a different model like llama model and a different datasets? I tried to change the model to a llama model and got a bunch of errors in tokenizer. Also it will be so great if you make a video about how to prepare and clean datasets for training! Thanks
@lukemonington1829 Год назад
I appreciate the suggestions! Yea I think that would be good. I'm looking into preparing and cleaning my own datasets too. I might make a video on that!
And yes, it looks like it doesn't work with all models. So far, I've tested it with pygmalian, wizardLM. I'll be testing with others in the future. I tried llama as well and got errors as well. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if it would be helpful for me to post the code on github for that!
@lukemonington1829 Год назад
thinking about it a little more, it should definitely work with the llama model, since the Guanaco model family was fine tuned based off of the llama model. There must be a way to do it
@devajyotimazumdar3686 Год назад ⁺¹
@@lukemonington1829 Thank you
@ppbroAI Год назад ⁺¹
Great video, it is compatible with wizard or pygmalion ?. Or do we always need the base models unquantized ?. If so, a list of llm suggestions would be really helpful and awesome.
@lukemonington1829 Год назад
I just tested it with pygmalian and it worked for me. Just make sure to change your LoraConfig so that target_modules=None. This way peft, will figure out the correct target_modules based on the model_type
@lukemonington1829 Год назад
Also, I just tested with ehartford/WizardLM-7B-Uncensored and that one worked as well. The base models do not need to be quantized in order for this to work. It is possible to load an unquantized model in 4-bit now with this technique. It doesn't work with all models though. Also, I was having trouble initially getting this running locally, so I ended up creating a Dockerfile and docker-compose, and that solved my problem. Let me know if I should post the code on github for that!
@arthurs6405 7 месяцев назад
Luke, your presentation is excellent. Your code works up until training is complete. After that I really need to save it locally. I realize you set this up for running on Colab, but the python code itself runs fine for python 3.10. However, if I save as "model.save_pretrained(model_id), it fails to load successfully. Could you be so kind as to include a save locally segment to your video and then demonstrate the ability to load and query that saved model? The code you show after training does not work. Much obliged Amigo!
@blocksrey 10 месяцев назад
Amazing video
@user-wr4yl7tx3w Год назад ⁺¹
do you have a video on pre-training an LLM?
@lukemonington1829 Год назад ⁺¹
No I do t have any videos on that. Pre-training can be a lot more resource intensive and require more gpu vram.
With the parameter efficient techniques, fine-tuning can now be done on a single gpu. But when it comes to pre-training from scratch, that would require substantial compute. For example, MPT-7B required 440 GPUs trained over the course of 9.5 days
@user-wr4yl7tx3w Год назад ⁺¹
May be a discussion on what issues arise with scaling. Like parallelism.
@lukemonington1829 Год назад ⁺²
@@user-wr4yl7tx3w The parallelism is an interesting point. I remember reading the research papers as these transformers were growing.
Initially, the entire model could fit on a single gpu. But then the models grew too large to fit on a single gpu, so they had to find a way to split a model onto multiple gpus by putting a layer onto each gpu (or some other splitting method)
Then the models kept growing and got to the size that a single layer couldn't fit on a single gpu. So that required further advancements such that a part of a layer could fit on each gpu.
This all ties back to the hardware problem of ai - the fact that model size has increased by 1000x+ while gpu vram size has only increased by ~3x. I was thinking about creating a video on that. It might be an interesting video
@robbbieraphaelday999 Год назад ⁺¹
Yay! 🎉
@lukemonington1829 Год назад
😎
@sam_joshua_s Год назад
I'm desperately searching for this type of content
@Bakobiibizo Год назад ⁺¹
dont suppose you know how to modify the code to work on a tpu?
@lukemonington1829 Год назад
Hmm 🤔 I haven’t done any work with the TPUs, so I’m not sure on this one
@sansunnec 9 месяцев назад
captain?
@NicosLeben 11 месяцев назад
This looks like a website you are using which is the opposite of fine-tuning something locally.
@sansunnec 9 месяцев назад
sauce?
@vt2788 Год назад
So that is the easy way to do this ...

Следующие

Автовоспроизведение