Fine-tune Multi-modal LLaVA Vision and Language Models

Trelis Research

Просмотров 28 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 21 ноя 2024

Комментарии • 97

@TrelisResearch 7 месяцев назад ⁺²
UPDATE APRIL 24th 2024
VRAM Requirements have been greatly reduced by adding gradient checkpointing (all below are for 16 bit training):
LLaVA 1.5
- liuhaotian/llava-v1.5-7b takes a min of 24 GB to train and will run on a single A6000.
- liuhaotian/llava-v1.5-13b REQUIRES VRAM OF
@TemporaryForstudy 7 месяцев назад
Nice video. Where do you work in Dublin? I am from india and i want to work at your company. I have masters degree in AI. I am currently working in an indian company but they are not providing remote work and the amount they are paying is also low. So please let me know if there is something for me.
@sam_joshua_s 9 месяцев назад ⁺¹⁹
Most underatted youtube channel
@bobert6259 4 месяца назад ⁺²
Thank you so much for this! I’m a student working on a research project involving fine tuning and I’m kind of shocked how relevant this is to me. Thanks for sharing so much knowledge and then going through the code too!
@TrelisResearch 4 месяца назад
You’re welcome
@Cloudvenus666 7 месяцев назад ⁺²
One thing to note, it took 9x A6000s for me, as 7 caused Cuda to run out of memory. Nevertheless, this is the best channel to learn how to fine-tune models and it is worth buying the repos.
@TrelisResearch 7 месяцев назад
interesting, what model - the 34B. And did you change batch size or context length?
@Cloudvenus666 7 месяцев назад ⁺¹
@@TrelisResearch I used the 34B, and didn’t change the configurations. I’m sure that I could have gotten away with 8 GPUs but 7 ran a bit short.
@vladimirtsarapkin2840 Месяц назад
Thank you professor. I do think there is a high probability to gain a large audience with such nuanced content but I wish you earn shit ton loads of money because you are a professional. Thank you!
@ForTheEraOfLove 9 месяцев назад ⁺²
Reminds me of the Person of Interest episode called "If-Then-Else" where "The Machine" has to make a choice in nearly infinite possibilities. Great show for those ML enthusiasts.
@declan6052 Месяц назад
Great video. Is there a public repository where we can access the code/data for this?
@TrelisResearch Месяц назад
Howdy, I've put some public links in the description to materials that might be useful.
The detailled scripts are in trelis.com/ADVANCED-vision , which you can access via a one-off lifetime membership.
@webhack119 5 месяцев назад
Best LLava Tune!
@unclecode 6 месяцев назад
Worth of your life 51 minutes. Kudos. I learned a lot. Quick question - got any vids or tips on making something like LLava from scratch? Moondream's a good example. I have watched your other vids, but that is more about fine-tuning like this one. I wanna grasp the whole process of merging models, building the adapter, training it, and dropping a new multi-model version of the original language model used. Thx again
@TrelisResearch 6 месяцев назад
I guess you watched the moon dream video I made right? That's a start.
Yeah building from scratch is a bit more involved as you have to make the loading scripts. Again, the moondream model repo is a good place to look and get inspiration. I may get around to building from scratch at some point.
@lourdarunraj9967 8 месяцев назад ⁺¹
Amazing content!!!
@vision360-sa 4 месяца назад
Thanks a lot for you, I really enjoyed watching this masterpiece
@TrelisResearch 4 месяца назад
Cheers
@mirai5749 7 месяцев назад
Hello
Embeddings are expert resamplers? Just read about Prismer VLM
@สอนวาดรูปครูแอลมอนด์ 9 месяцев назад ⁺¹
Amazing video! Thank you for sharing!❤
@danieldemillard9412 9 месяцев назад ⁺¹
Thanks again for another great video and tutorial. How much effort would it require to swap out your code to work with Mixtral 8x7b? I assume it isn't as trivial as swapping out the model name and fine-tuning. Do you foresee any issues with combining these with Instruct models instead of the base chat models?
@TrelisResearch 9 месяцев назад
Good Q. I don’t think it would take much work, although - Mixtral doesn’t quite fit on a single A100, so training will be slower. Maybe 24 hours on 8 A100s…
Btw I’m also just fine tuning so if you wanted to swap in Mixtral, it’s maybe better to use the original code.
@NametVevo 6 месяцев назад
Thank you for your video! I just starting in AI and It help me a lot.
@luce_yliu7524 6 месяцев назад
Great materials! Do you have this repo on your GitHub?
@TrelisResearch 6 месяцев назад
Yup, this is in the ADVANCED-vision repo for purchase from trelis.com/ADVANCED-vision
@SeánCarmody-y3p 8 месяцев назад ⁺¹
Great video! I have fine-tuned a llama2 model on a v100 previously but I'm wondering if a model like llava-v1.6-mistral-7b on huggingface would be too large to fit on the 16gb available on the v100? Any suggestions on how to figure out how much vram a model requires? It doesn't seem to be too obvious a lot of the time from the documentation.
@TrelisResearch 8 месяцев назад ⁺¹
Yeah, so Llama 7B has 7B parameters and in 16-bit, that's two bytes per parameter, so you need about 14 GB of VRAM to load the model, plus some headroom for kv cache for context length.
For LLaVA you additionally need space for the image model AND you need space for the kv cache for the images. The vision model is quite small - a few hundred GB in size - so that shouldn't add much. I see on the repo that the files are around 16 GB in total for model plus vision.
However, the vision model is cast up to 32-bits, so that can also double its size. All in all - in 16-bit - it won't be possible to fit in 16 GB of VRAM unless you do quantization. There's a flag to set that, but it's not stable and I had issues trying it. Basically, the LLaVA 1.6 model is not well supported in HuggingFace, so custom scripts are needed like I showed in the video here.
However, you can train llava 1.5 with 4-bit quantization and that should fit in your V100.
@SeánCarmody-y3p 8 месяцев назад
Thank you for taking the time to reply! I assume you meant a few hundred MB for the vision model? That's interesting on the differences between training 1.5 vs 1.6 currently. Do you think there might be some more out-of-the-box approaches to fine-tuning 1.5 or would it still require more custom scripts like yours?
@TrelisResearch 8 месяцев назад ⁺¹
@@SeánCarmody-y3p oops, yes, hundreds of MB.
Actually I just tested 1.6 again yesterday and I think it should be ok with about 24 GB of VRAM.
Regarding more out-of-the-box, I'm a bit puzzled why this hasn't happened, and it's been a month or so now, perhaps we'll just have to look towards the next model.
@nguyenhoangnam 8 месяцев назад
⁠⁠@@TrelisResearchcorrect me if I’m wrong, as from what you stated above, you mean your script can finetune 1.6 on a 24gb 3090?
@TrelisResearch 8 месяцев назад ⁺¹
@@nguyenhoangnam in principle it should be possible but in practise the scripts for 1.6 take quite a bit more. There are some notes on trelis.com/advanced-vision
@imranullah3097 9 месяцев назад ⁺¹
❤❤❤❤❤. Kindly also create a video on hifi gan to fine tune model for natural synthesis..
@Oz0NE-m7r 8 месяцев назад ⁺¹
Hi. love the content btw. do you think finetuning phi2 with this approach might be a good idea like what moondream is about. and will this same script work for phi2.
@TrelisResearch 8 месяцев назад
Yes, in principle that would work, although you would need to instantiate the model correctly swapping in phi for mistral/llama
@jacekb4057 6 месяцев назад ⁺¹
Man this helps me a lot. Thanks ❤
@3169aaaa 6 месяцев назад
@jacekb4057 hi did you created a notebook related this?
@3169aaaa 6 месяцев назад
@jacekb4057
@divyagarh 6 месяцев назад
Hi Ronan, once the Model is trained can we ask the model to give a image of a Wooden Rook or a black/white rook? or is this model just classifying if it is a rook or a King piece?
@TrelisResearch 6 месяцев назад
nice question.
The model is just classifying/describing. To go the other direction (generation) you need a diffusion model that basically starts with features and then renders and smooths those out.
@xtu373 7 месяцев назад ⁺¹
on how much examples to fine-tune LLaVA to get better results? 100 examples? what's the minimum number ?
@TrelisResearch 7 месяцев назад
It depends how broad the concepts you are aiming to build into the model. For a very narrow fine-tune, it's possible just 25 images might be enough. You can get a rough sense from the video here and this application.
Now, if you additionally wanted to train on other board games, you'd need quite a few more examples.
@BiswasPoudel-d1u 24 дня назад
Hi, do you think if its possible to fine-tune LLaVA One Vision model with traffic data for traffic scene understanding in the case of autonomous vehicles? I am working on a similar project and would love your take on this? Also, if you don't mind sharing, is there any cloud based service that provides low-cost GPUs for fine-tuning these models?
@TrelisResearch 24 дня назад
Yes, it should be - although these models are quite big compared to what is used for autonomous driving (I think) because you need very fast inference.
Vast.ai is probably lowest cost but runpod is better UI. you can see a few tips (see fine-tuning) in the github.com/TrelisResearch/one-click-llms repo
@Arpita-i8e 9 месяцев назад
amazing and very informative. Can you pls also show us how to fine tune LLaVA 1.5 ?
@TrelisResearch 9 месяцев назад
Same approach! I used the same script!
@Tsardoz 7 месяцев назад
Very well explained.
@lalpremi 9 месяцев назад ⁺¹
Thank you for sharing, very interesting.
Wow, your trained model summarizing given pictures is very impressive and fast.
What type of hardware is behind the scenes handling all your site?
have a great day. 🙂
@TrelisResearch 9 месяцев назад ⁺¹
I'm running on A6000s on runpod! See: github.com/TrelisResearch/install-guides/blob/main/llm-notebook-setup.md
@khalilbezrati8638 7 месяцев назад
Thank you for this video. I have a question and I would be happy if you could answer it. do you think that these multimodal AIs like LLAVA cen be fine-tuned for fraud detection in identity documents (passports, ID cards, driver's licenses)?
@TrelisResearch 7 месяцев назад ⁺¹
Yes, this sounds like a good use case.
@unsaturated8482 9 месяцев назад ⁺¹
very informative
@AlexBerg1 9 месяцев назад
On a first warch through, my impression is that it looks like fine-tuning LLaVA is a much longer script than fine-tuning Llama.
@TrelisResearch 9 месяцев назад
Yeah, it's much longer because you can't use out of the box trainers with default data preparation (because the preparation of prompts for a model with images and vision is different).
Probably out of the box will come, but will take some time.
@UtoobNam 7 месяцев назад
Hey! Are you making something similar for the multimodal output llava (interactive)?
@utkarshashinde9167 3 месяца назад
Can you please provide the github repo for this code ...... it would be of much much help ....thanks in advance
@TrelisResearch 3 месяца назад
Howdy! Github repo is available for purchase at trelis.com/advanced-vision . The other option, if that is affordable, is to work directly with Llava scripts from the Llava github repo or else take a look at the idefics 2 model card on huggingface and the launch blog post.
@bobert6259 4 месяца назад
Is using rank stabilized lora here a good idea? And is 16 too low of rank for say, ~1000 examples of finetuning data?
@TrelisResearch 4 месяца назад ⁺¹
rank stabilized lora should always be a good idea. A rank of 16 might be ok for 1000 rows, but yeah you could bump up to 32. I don't have exact numbers in mind.
@bobert6259 4 месяца назад
@@TrelisResearch thanks for answering me! I watched your live stream on Lora and qlora which was also quite helpful.
@talhatahir4931 3 месяца назад
Can inference be run on our local machines? without having the use of a server.
@TrelisResearch 3 месяца назад ⁺¹
Yes! you can run with llama.cpp or with mlx if you have a mac. If you look at my latest video on Tool Use you can see me running models locally
@TrelisResearch 3 месяца назад
Sorry, I'm realising now that your question is about multi-modal models. This is a bit harder. I think maybe moondream can be run locally but you'll have to look at that video and then look at the moondream github to see if there are options.
@semigoso7274 5 месяцев назад
Did you run into this error when checking LoraConfig?
ValueError: Target module Sequential(
(0): Linear(in_features=1024, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=4096, bias=True)
) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`.
peft - Version: 0.11.2.dev0. Running in NVIDIA A10G.
Everythin run correctly until that part.
Great video!
@TrelisResearch 5 месяцев назад ⁺¹
seems like you tried to set a certain module as trainable that is not a linear layer. Just look at your lora modules and try commenting them out one by one. Or comment them all out and then include them one by one. use print(model) to see the list of modules.
@semigoso7274 5 месяцев назад
@@TrelisResearch the layer is the mm_projector (the adapter) that it is compose of Sequential(
(0): Linear(in_features=1024, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=4096, bias=True)
). Did you train the adapter as a whloe without any issues or just the linear part of the adapter?
@TrelisResearch 5 месяцев назад
@@semigoso7274 ah yes, you can't do that because the GeLU isn't a linear layer.
You have to target "model.mm_projector.0" and "model.mm_projector.2"
@jonatan01i 6 месяцев назад
Wouldn't it be easier to load in the model anyhow it comes and then looping through all the modules and setting them to bfloat16?
@TrelisResearch 6 месяцев назад ⁺¹
Yeah now that you say it, I don’t see why not . Sounds better
@TrelisResearch 6 месяцев назад ⁺¹
UPDATE: Yeah, I had forgotten that the main reason not to do this is that you need more VRAM to first load everything in float32 (or whatever the default is). So you may OOM
@jonatan01i 6 месяцев назад
@@TrelisResearch oh wow, I haven't thought of that. Feels like a lot of hassle, hats off that you pushed through to make it happen.
But upon more thinking:
- Can you not change the number of gpus aft...
- No-no I do one better either send it in fp16 or if that doesn't work then loop through on the cpu, send to gpu one set of parameters at a time and convert to bfloat16, then go to next and so on
@xtu373 8 месяцев назад
Hi! Can I get notebook repo of Fine-tuning Multi-modal LLaVA?
@TrelisResearch 8 месяцев назад
Check out Trelis.com/ADVANCED-vision
@LukeDupin 9 месяцев назад
Awesome
@xiaojinyusaudiobookswebnov4951 9 месяцев назад
Can you show how to fine tune Google's gemma models?
@TrelisResearch 9 месяцев назад
Same approach as in the embeddings vs fine tuning videos. Btw I’m unsure Gemma is that good compared to mistral or openchat
@pacograciadonat8885 6 месяцев назад
Hello I'm having problems with the image.py file when I try to use raw image URL what can i do?
@pacograciadonat8885 6 месяцев назад
this is the error i have : cannot identify image file
@TrelisResearch 6 месяцев назад ⁺¹
howdy! if you purchased repo access, it's best to post an issue there.
If you're using a URL, then put the relevant portion of the image.py code into chat gpt and ask it to adjust it to allow an image OR a URL to be passed as the input.
@pacograciadonat8885 6 месяцев назад
@@TrelisResearch ty really, and one more thing, what is the entire code when fine-tuning dataset
@ayushsinghal28 9 месяцев назад
can it work with multiple images in a single prompt??
@TrelisResearch 9 месяцев назад
It can!
@TheYephers 8 месяцев назад
Will these fine tuning projects run on Colab Pro (A100) as is?
@TrelisResearch 8 месяцев назад ⁺¹
LLaVA 1.5 will, but LLaVA 1.6 won't for now, the memory requirement to fine-tune is 100 GB. It should be a lot lower, but there is an open comment on the github repo around that high memory usage.
So you need 2X A100 or 3X A6000
@DeviGoneMad 8 месяцев назад
@@TrelisResearch But we can use 4bit quantization to finetune llava 1.6, that will run on google colab, right?
@TrelisResearch 8 месяцев назад
@@DeviGoneMad in principle yes, but I haven't been able to get quantization working with the 1.6 models (as opposed to 1.5). :(
@tami9154 8 месяцев назад
may i do all this on windows?
@TrelisResearch 8 месяцев назад
you can do it on windows if you have a GPU. If you don't have a separate GPU, then you won't have enough RAM.
@xtu373 7 месяцев назад ⁺⁷
Why did you post this video on RUclips - When you are trying to sell the repo? Pls change the video title.
@TrelisResearch 7 месяцев назад ⁺³
Howdy! Hopefully you can learn quite a lot without buying the repo.
I don't have ads on this channel and those who do buy repos help to support the channel. That's the business model.
@robxmccarthy 7 месяцев назад ⁺¹
I appreciate everything you share @@TrelisResearch
@fuba44 9 месяцев назад
If you reversed the axis, the queen would be h5, maybe it's not a standard chess board? I'm not a big chess guy.
@TrelisResearch 9 месяцев назад
Yeah it’s possible that’s the mix up
@user-jk9zr3sc5h 9 месяцев назад ⁺¹
Oh hell yeah
@Yo-rw7mq 6 месяцев назад
Hey!! Please how can I estimate the gpu requirements for finetuning the following model using Lora.
model name : llava-hf/llava-v1.6-mistral-7b-hf
@TrelisResearch 6 месяцев назад
I've just pinned a comment showing memory requirements, you'll see it there.
@Yo-rw7mq 6 месяцев назад
Thank you so much.

Следующие

Автовоспроизведение

Fine-tuning Large Language Models (LLMs) | w/ Example Code