UPDATE APRIL 24th 2024 VRAM Requirements have been greatly reduced by adding gradient checkpointing (all below are for 16 bit training): LLaVA 1.5 - liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000. - liuhaotian/llava-v1.5-13b REQUIRES VRAM OF
Nice video. Where do you work in Dublin? I am from india and i want to work at your company. I have masters degree in AI. I am currently working in an indian company but they are not providing remote work and the amount they are paying is also low. So please let me know if there is something for me.
Thank you so much for this! I’m a student working on a research project involving fine tuning and I’m kind of shocked how relevant this is to me. Thanks for sharing so much knowledge and then going through the code too!
One thing to note, it took 9x A6000s for me, as 7 caused Cuda to run out of memory. Nevertheless, this is the best channel to learn how to fine-tune models and it is worth buying the repos.
Thank you professor. I do think there is a high probability to gain a large audience with such nuanced content but I wish you earn shit ton loads of money because you are a professional. Thank you!
Reminds me of the Person of Interest episode called "If-Then-Else" where "The Machine" has to make a choice in nearly infinite possibilities. Great show for those ML enthusiasts.
Howdy, I've put some public links in the description to materials that might be useful. The detailled scripts are in trelis.com/ADVANCED-vision , which you can access via a one-off lifetime membership.
Worth of your life 51 minutes. Kudos. I learned a lot. Quick question - got any vids or tips on making something like LLava from scratch? Moondream's a good example. I have watched your other vids, but that is more about fine-tuning like this one. I wanna grasp the whole process of merging models, building the adapter, training it, and dropping a new multi-model version of the original language model used. Thx again
I guess you watched the moon dream video I made right? That's a start. Yeah building from scratch is a bit more involved as you have to make the loading scripts. Again, the moondream model repo is a good place to look and get inspiration. I may get around to building from scratch at some point.
Thanks again for another great video and tutorial. How much effort would it require to swap out your code to work with Mixtral 8x7b? I assume it isn't as trivial as swapping out the model name and fine-tuning. Do you foresee any issues with combining these with Instruct models instead of the base chat models?
Good Q. I don’t think it would take much work, although - Mixtral doesn’t quite fit on a single A100, so training will be slower. Maybe 24 hours on 8 A100s… Btw I’m also just fine tuning so if you wanted to swap in Mixtral, it’s maybe better to use the original code.
Great video! I have fine-tuned a llama2 model on a v100 previously but I'm wondering if a model like llava-v1.6-mistral-7b on huggingface would be too large to fit on the 16gb available on the v100? Any suggestions on how to figure out how much vram a model requires? It doesn't seem to be too obvious a lot of the time from the documentation.
Yeah, so Llama 7B has 7B parameters and in 16-bit, that's two bytes per parameter, so you need about 14 GB of VRAM to load the model, plus some headroom for kv cache for context length. For LLaVA you additionally need space for the image model AND you need space for the kv cache for the images. The vision model is quite small - a few hundred GB in size - so that shouldn't add much. I see on the repo that the files are around 16 GB in total for model plus vision. However, the vision model is cast up to 32-bits, so that can also double its size. All in all - in 16-bit - it won't be possible to fit in 16 GB of VRAM unless you do quantization. There's a flag to set that, but it's not stable and I had issues trying it. Basically, the LLaVA 1.6 model is not well supported in HuggingFace, so custom scripts are needed like I showed in the video here. However, you can train llava 1.5 with 4-bit quantization and that should fit in your V100.
Thank you for taking the time to reply! I assume you meant a few hundred MB for the vision model? That's interesting on the differences between training 1.5 vs 1.6 currently. Do you think there might be some more out-of-the-box approaches to fine-tuning 1.5 or would it still require more custom scripts like yours?
@@SeánCarmody-y3p oops, yes, hundreds of MB. Actually I just tested 1.6 again yesterday and I think it should be ok with about 24 GB of VRAM. Regarding more out-of-the-box, I'm a bit puzzled why this hasn't happened, and it's been a month or so now, perhaps we'll just have to look towards the next model.
@@nguyenhoangnam in principle it should be possible but in practise the scripts for 1.6 take quite a bit more. There are some notes on trelis.com/advanced-vision
Hi. love the content btw. do you think finetuning phi2 with this approach might be a good idea like what moondream is about. and will this same script work for phi2.
Hi Ronan, once the Model is trained can we ask the model to give a image of a Wooden Rook or a black/white rook? or is this model just classifying if it is a rook or a King piece?
nice question. The model is just classifying/describing. To go the other direction (generation) you need a diffusion model that basically starts with features and then renders and smooths those out.
It depends how broad the concepts you are aiming to build into the model. For a very narrow fine-tune, it's possible just 25 images might be enough. You can get a rough sense from the video here and this application. Now, if you additionally wanted to train on other board games, you'd need quite a few more examples.
Hi, do you think if its possible to fine-tune LLaVA One Vision model with traffic data for traffic scene understanding in the case of autonomous vehicles? I am working on a similar project and would love your take on this? Also, if you don't mind sharing, is there any cloud based service that provides low-cost GPUs for fine-tuning these models?
Yes, it should be - although these models are quite big compared to what is used for autonomous driving (I think) because you need very fast inference. Vast.ai is probably lowest cost but runpod is better UI. you can see a few tips (see fine-tuning) in the github.com/TrelisResearch/one-click-llms repo
Thank you for sharing, very interesting. Wow, your trained model summarizing given pictures is very impressive and fast. What type of hardware is behind the scenes handling all your site? have a great day. 🙂
Thank you for this video. I have a question and I would be happy if you could answer it. do you think that these multimodal AIs like LLAVA cen be fine-tuned for fraud detection in identity documents (passports, ID cards, driver's licenses)?
Yeah, it's much longer because you can't use out of the box trainers with default data preparation (because the preparation of prompts for a model with images and vision is different). Probably out of the box will come, but will take some time.
Howdy! Github repo is available for purchase at trelis.com/advanced-vision . The other option, if that is affordable, is to work directly with Llava scripts from the Llava github repo or else take a look at the idefics 2 model card on huggingface and the launch blog post.
rank stabilized lora should always be a good idea. A rank of 16 might be ok for 1000 rows, but yeah you could bump up to 32. I don't have exact numbers in mind.
Sorry, I'm realising now that your question is about multi-modal models. This is a bit harder. I think maybe moondream can be run locally but you'll have to look at that video and then look at the moondream github to see if there are options.
Did you run into this error when checking LoraConfig? ValueError: Target module Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`. peft - Version: 0.11.2.dev0. Running in NVIDIA A10G. Everythin run correctly until that part. Great video!
seems like you tried to set a certain module as trainable that is not a linear layer. Just look at your lora modules and try commenting them out one by one. Or comment them all out and then include them one by one. use print(model) to see the list of modules.
@@TrelisResearch the layer is the mm_projector (the adapter) that it is compose of Sequential( (0): Linear(in_features=1024, out_features=4096, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=4096, out_features=4096, bias=True) ). Did you train the adapter as a whloe without any issues or just the linear part of the adapter?
UPDATE: Yeah, I had forgotten that the main reason not to do this is that you need more VRAM to first load everything in float32 (or whatever the default is). So you may OOM
@@TrelisResearch oh wow, I haven't thought of that. Feels like a lot of hassle, hats off that you pushed through to make it happen. But upon more thinking: - Can you not change the number of gpus aft... - No-no I do one better either send it in fp16 or if that doesn't work then loop through on the cpu, send to gpu one set of parameters at a time and convert to bfloat16, then go to next and so on
howdy! if you purchased repo access, it's best to post an issue there. If you're using a URL, then put the relevant portion of the image.py code into chat gpt and ask it to adjust it to allow an image OR a URL to be passed as the input.
LLaVA 1.5 will, but LLaVA 1.6 won't for now, the memory requirement to fine-tune is 100 GB. It should be a lot lower, but there is an open comment on the github repo around that high memory usage. So you need 2X A100 or 3X A6000
Howdy! Hopefully you can learn quite a lot without buying the repo. I don't have ads on this channel and those who do buy repos help to support the channel. That's the business model.
UPDATE APRIL 24th 2024
VRAM Requirements have been greatly reduced by adding gradient checkpointing (all below are for 16 bit training):
LLaVA 1.5
- liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000.
- liuhaotian/llava-v1.5-13b REQUIRES VRAM OF
Nice video. Where do you work in Dublin? I am from india and i want to work at your company. I have masters degree in AI. I am currently working in an indian company but they are not providing remote work and the amount they are paying is also low. So please let me know if there is something for me.
Most underatted youtube channel
Thank you so much for this! I’m a student working on a research project involving fine tuning and I’m kind of shocked how relevant this is to me. Thanks for sharing so much knowledge and then going through the code too!
You’re welcome
One thing to note, it took 9x A6000s for me, as 7 caused Cuda to run out of memory. Nevertheless, this is the best channel to learn how to fine-tune models and it is worth buying the repos.
interesting, what model - the 34B. And did you change batch size or context length?
@@TrelisResearch I used the 34B, and didn’t change the configurations. I’m sure that I could have gotten away with 8 GPUs but 7 ran a bit short.
Thank you professor. I do think there is a high probability to gain a large audience with such nuanced content but I wish you earn shit ton loads of money because you are a professional. Thank you!
Reminds me of the Person of Interest episode called "If-Then-Else" where "The Machine" has to make a choice in nearly infinite possibilities. Great show for those ML enthusiasts.
Great video. Is there a public repository where we can access the code/data for this?
Howdy, I've put some public links in the description to materials that might be useful.
The detailled scripts are in trelis.com/ADVANCED-vision , which you can access via a one-off lifetime membership.
Best LLava Tune!
Worth of your life 51 minutes. Kudos. I learned a lot. Quick question - got any vids or tips on making something like LLava from scratch? Moondream's a good example. I have watched your other vids, but that is more about fine-tuning like this one. I wanna grasp the whole process of merging models, building the adapter, training it, and dropping a new multi-model version of the original language model used. Thx again
I guess you watched the moon dream video I made right? That's a start.
Yeah building from scratch is a bit more involved as you have to make the loading scripts. Again, the moondream model repo is a good place to look and get inspiration. I may get around to building from scratch at some point.
Amazing content!!!
Thanks a lot for you, I really enjoyed watching this masterpiece
Cheers
Hello
Embeddings are expert resamplers? Just read about Prismer VLM
Amazing video! Thank you for sharing!❤
Thanks again for another great video and tutorial. How much effort would it require to swap out your code to work with Mixtral 8x7b? I assume it isn't as trivial as swapping out the model name and fine-tuning. Do you foresee any issues with combining these with Instruct models instead of the base chat models?
Good Q. I don’t think it would take much work, although - Mixtral doesn’t quite fit on a single A100, so training will be slower. Maybe 24 hours on 8 A100s…
Btw I’m also just fine tuning so if you wanted to swap in Mixtral, it’s maybe better to use the original code.
Thank you for your video! I just starting in AI and It help me a lot.
Great materials! Do you have this repo on your GitHub?
Yup, this is in the ADVANCED-vision repo for purchase from trelis.com/ADVANCED-vision
Great video! I have fine-tuned a llama2 model on a v100 previously but I'm wondering if a model like llava-v1.6-mistral-7b on huggingface would be too large to fit on the 16gb available on the v100? Any suggestions on how to figure out how much vram a model requires? It doesn't seem to be too obvious a lot of the time from the documentation.
Yeah, so Llama 7B has 7B parameters and in 16-bit, that's two bytes per parameter, so you need about 14 GB of VRAM to load the model, plus some headroom for kv cache for context length.
For LLaVA you additionally need space for the image model AND you need space for the kv cache for the images. The vision model is quite small - a few hundred GB in size - so that shouldn't add much. I see on the repo that the files are around 16 GB in total for model plus vision.
However, the vision model is cast up to 32-bits, so that can also double its size. All in all - in 16-bit - it won't be possible to fit in 16 GB of VRAM unless you do quantization. There's a flag to set that, but it's not stable and I had issues trying it. Basically, the LLaVA 1.6 model is not well supported in HuggingFace, so custom scripts are needed like I showed in the video here.
However, you can train llava 1.5 with 4-bit quantization and that should fit in your V100.
Thank you for taking the time to reply! I assume you meant a few hundred MB for the vision model? That's interesting on the differences between training 1.5 vs 1.6 currently. Do you think there might be some more out-of-the-box approaches to fine-tuning 1.5 or would it still require more custom scripts like yours?
@@SeánCarmody-y3p oops, yes, hundreds of MB.
Actually I just tested 1.6 again yesterday and I think it should be ok with about 24 GB of VRAM.
Regarding more out-of-the-box, I'm a bit puzzled why this hasn't happened, and it's been a month or so now, perhaps we'll just have to look towards the next model.
@@TrelisResearchcorrect me if I’m wrong, as from what you stated above, you mean your script can finetune 1.6 on a 24gb 3090?
@@nguyenhoangnam in principle it should be possible but in practise the scripts for 1.6 take quite a bit more. There are some notes on trelis.com/advanced-vision
❤❤❤❤❤. Kindly also create a video on hifi gan to fine tune model for natural synthesis..
Hi. love the content btw. do you think finetuning phi2 with this approach might be a good idea like what moondream is about. and will this same script work for phi2.
Yes, in principle that would work, although you would need to instantiate the model correctly swapping in phi for mistral/llama
Man this helps me a lot. Thanks ❤
@jacekb4057 hi did you created a notebook related this?
@jacekb4057
Hi Ronan, once the Model is trained can we ask the model to give a image of a Wooden Rook or a black/white rook? or is this model just classifying if it is a rook or a King piece?
nice question.
The model is just classifying/describing. To go the other direction (generation) you need a diffusion model that basically starts with features and then renders and smooths those out.
on how much examples to fine-tune LLaVA to get better results? 100 examples? what's the minimum number ?
It depends how broad the concepts you are aiming to build into the model. For a very narrow fine-tune, it's possible just 25 images might be enough. You can get a rough sense from the video here and this application.
Now, if you additionally wanted to train on other board games, you'd need quite a few more examples.
Hi, do you think if its possible to fine-tune LLaVA One Vision model with traffic data for traffic scene understanding in the case of autonomous vehicles? I am working on a similar project and would love your take on this? Also, if you don't mind sharing, is there any cloud based service that provides low-cost GPUs for fine-tuning these models?
Yes, it should be - although these models are quite big compared to what is used for autonomous driving (I think) because you need very fast inference.
Vast.ai is probably lowest cost but runpod is better UI. you can see a few tips (see fine-tuning) in the github.com/TrelisResearch/one-click-llms repo
amazing and very informative. Can you pls also show us how to fine tune LLaVA 1.5 ?
Same approach! I used the same script!
Very well explained.
Thank you for sharing, very interesting.
Wow, your trained model summarizing given pictures is very impressive and fast.
What type of hardware is behind the scenes handling all your site?
have a great day. 🙂
I'm running on A6000s on runpod! See: github.com/TrelisResearch/install-guides/blob/main/llm-notebook-setup.md
Thank you for this video. I have a question and I would be happy if you could answer it. do you think that these multimodal AIs like LLAVA cen be fine-tuned for fraud detection in identity documents (passports, ID cards, driver's licenses)?
Yes, this sounds like a good use case.
very informative
On a first warch through, my impression is that it looks like fine-tuning LLaVA is a much longer script than fine-tuning Llama.
Yeah, it's much longer because you can't use out of the box trainers with default data preparation (because the preparation of prompts for a model with images and vision is different).
Probably out of the box will come, but will take some time.
Hey! Are you making something similar for the multimodal output llava (interactive)?
Can you please provide the github repo for this code ...... it would be of much much help ....thanks in advance
Howdy! Github repo is available for purchase at trelis.com/advanced-vision . The other option, if that is affordable, is to work directly with Llava scripts from the Llava github repo or else take a look at the idefics 2 model card on huggingface and the launch blog post.
Is using rank stabilized lora here a good idea? And is 16 too low of rank for say, ~1000 examples of finetuning data?
rank stabilized lora should always be a good idea. A rank of 16 might be ok for 1000 rows, but yeah you could bump up to 32. I don't have exact numbers in mind.
@@TrelisResearch thanks for answering me! I watched your live stream on Lora and qlora which was also quite helpful.
Can inference be run on our local machines? without having the use of a server.
Yes! you can run with llama.cpp or with mlx if you have a mac. If you look at my latest video on Tool Use you can see me running models locally
Sorry, I'm realising now that your question is about multi-modal models. This is a bit harder. I think maybe moondream can be run locally but you'll have to look at that video and then look at the moondream github to see if there are options.
Did you run into this error when checking LoraConfig?
ValueError: Target module Sequential(
(0): Linear(in_features=1024, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=4096, bias=True)
) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`.
peft - Version: 0.11.2.dev0. Running in NVIDIA A10G.
Everythin run correctly until that part.
Great video!
seems like you tried to set a certain module as trainable that is not a linear layer. Just look at your lora modules and try commenting them out one by one. Or comment them all out and then include them one by one. use print(model) to see the list of modules.
@@TrelisResearch the layer is the mm_projector (the adapter) that it is compose of Sequential(
(0): Linear(in_features=1024, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=4096, bias=True)
). Did you train the adapter as a whloe without any issues or just the linear part of the adapter?
@@semigoso7274 ah yes, you can't do that because the GeLU isn't a linear layer.
You have to target "model.mm_projector.0" and "model.mm_projector.2"
Wouldn't it be easier to load in the model anyhow it comes and then looping through all the modules and setting them to bfloat16?
Yeah now that you say it, I don’t see why not . Sounds better
UPDATE: Yeah, I had forgotten that the main reason not to do this is that you need more VRAM to first load everything in float32 (or whatever the default is). So you may OOM
@@TrelisResearch oh wow, I haven't thought of that. Feels like a lot of hassle, hats off that you pushed through to make it happen.
But upon more thinking:
- Can you not change the number of gpus aft...
- No-no I do one better either send it in fp16 or if that doesn't work then loop through on the cpu, send to gpu one set of parameters at a time and convert to bfloat16, then go to next and so on
Hi! Can I get notebook repo of Fine-tuning Multi-modal LLaVA?
Check out Trelis.com/ADVANCED-vision
Awesome
Can you show how to fine tune Google's gemma models?
Same approach as in the embeddings vs fine tuning videos. Btw I’m unsure Gemma is that good compared to mistral or openchat
Hello I'm having problems with the image.py file when I try to use raw image URL what can i do?
this is the error i have : cannot identify image file
howdy! if you purchased repo access, it's best to post an issue there.
If you're using a URL, then put the relevant portion of the image.py code into chat gpt and ask it to adjust it to allow an image OR a URL to be passed as the input.
@@TrelisResearch ty really, and one more thing, what is the entire code when fine-tuning dataset
can it work with multiple images in a single prompt??
It can!
Will these fine tuning projects run on Colab Pro (A100) as is?
LLaVA 1.5 will, but LLaVA 1.6 won't for now, the memory requirement to fine-tune is 100 GB. It should be a lot lower, but there is an open comment on the github repo around that high memory usage.
So you need 2X A100 or 3X A6000
@@TrelisResearch But we can use 4bit quantization to finetune llava 1.6, that will run on google colab, right?
@@DeviGoneMad in principle yes, but I haven't been able to get quantization working with the 1.6 models (as opposed to 1.5). :(
may i do all this on windows?
you can do it on windows if you have a GPU. If you don't have a separate GPU, then you won't have enough RAM.
Why did you post this video on RUclips - When you are trying to sell the repo? Pls change the video title.
Howdy! Hopefully you can learn quite a lot without buying the repo.
I don't have ads on this channel and those who do buy repos help to support the channel. That's the business model.
I appreciate everything you share @@TrelisResearch
If you reversed the axis, the queen would be h5, maybe it's not a standard chess board? I'm not a big chess guy.
Yeah it’s possible that’s the mix up
Oh hell yeah
Hey!! Please how can I estimate the gpu requirements for finetuning the following model using Lora.
model name : llava-hf/llava-v1.6-mistral-7b-hf
I've just pinned a comment showing memory requirements, you'll see it there.
Thank you so much.