Faster LLM Inference: Speeding up Falcon 7b (with QLoRA adapter) Prediction Time

Venelin Valkov

Просмотров 10 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 11 сен 2024

Комментарии • 32

@sherryhp10 Год назад ⁺⁵
Man this is the best channel in llm era
@dataflex4440 Год назад
True
@venelin_valkov Год назад ⁺⁵
Fine-tuning Falcon 7b turorial (requires MLExpert Pro): www.mlexpert.io/prompt-engineering/fine-tuning-llm-on-custom-dataset-with-qlora
Full text tutorial: www.mlexpert.io/prompt-engineering/faster-llm-inference
@arinco3817 Год назад ⁺²
Thanks for taking the time to put this video together. Really informative and helping me grasp these concepts.
@mantrapatel8128 Год назад ⁺¹
Bro solved my brain in a single video
@thevadimb Год назад ⁺²
Thank you for your great videos! Very informative and sharp to the point!
@hakikitosunpasa335 Год назад ⁺¹
Thank you for your effort, your videos are great and direct into practical use. Great work!
@paulinagc6986 Год назад
You rule! Thanks so much for doing these videos on LLMs
@user-ew8ld1cy4d Год назад ⁺¹
Thank you for another great video!
@tarun4705 Год назад ⁺¹
Very informative
@dataflex4440 Год назад ⁺¹
Very Informative. Now please make tutorials on Llama Index as there is lot of buzz around it
@IchSan-jx5eg Год назад ⁺²
Hello, Great video so far. Let me ask some questions here:
1. What should I do if my training loss is not decrease consistently (sometimes up, sometimes down) ?
2. How to use multiple GPU with QLORA ? I always get OOM if I use Falcon-40B, so I rented 2 GPUs in cloud provider. Unfortunatelly, it ran just for 1 GPU.
@austinmoss4033 Год назад ⁺¹
Try decreasing your learning rate to help with the training loss
@egehanyorulmaz4965 10 месяцев назад
you need to use deepspeed for multi-gpu training
@jdlovely Год назад ⁺¹
Thank you for the great video. Is there a place for subscribers to ask questions if they join?
@kishoretvk Год назад ⁺²
can you put a open source languag3e model , example like llama open source implementation. to understand the actual implmentation ? for a beginer ,
stnadfor5ld alpaca or others are all tuned on the existing model , but is there anything like llama to really understand from ground up, i see gpt neo or nanogpt. but they are not actual llama implmentations.
@user-zq9bp5yv6z Год назад ⁺¹
Great video! Just curious, why do have the - - NORMAL - -, - - VISUAL - - , - - INSERT - - when you click on each code block? Some functionalities from google ?
@sohelshaikhh Год назад ⁺¹
Thank you so much for this great video, I have some doubts and would be great if you help me understand,
1. The temperature parameter does not change the response at all. Did i do something terribly wrong.
2. The method presented in lit-parrot works with base model of falcon. But how to load the model that we trained using QLoRa?
Thanks again for such amazing content.
@ikjb8561 Год назад ⁺¹
Hi Venelin, I just subscribed to your website. Can you tell me if there is a way to limit answers to adapter data only and not the base model?
@vakkalagaddatarun4371 Год назад ⁺¹
How can we use falcon 7b for summarization tasks?
@untiteledx253 Год назад
Thanks for the informative video!
One question: Why does loading in 8 bit takes less time than 4 bit? Shouldn't it be the other way around, since 8 bit format has higher precision?
@prashantjoshi537 Год назад
Content is really informative , It save almost a week for me :-) . One quick question , I was trying to load Falcon 40b from google drive , but it showed me the error . HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name':
'/content/drive/MyDrive/falcon/model/'. Use `repo_type` argument if needed. Any possible suggestion . It will be a great help
@shivammittal1051 Год назад ⁺¹
Hi Venelin! Great work but a quick question.
I'm confused.
How is the load_in_8_bit better than load_in_4_bit and Qlora b&b in terms of execution time?
Isn't the loading in 4bit has to be faster than loading in 8bit?
Please clarify!
@venelin_valkov Год назад ⁺³
I am comparing inference (not loading) time in the video. Let me know if you do any loading time comparisons.
Thanks for watching!
@shivammittal1051 Год назад
@@venelin_valkov Yes you're comparing inference time in video but isn't the inference time as well for 4bit should be less than 8bit one?
A 4bit model means model size is smaller than 8bit model and which will lead to faster inference but your results shows opposite.
I might be wrong but just a logical question!
Could you please explain? Might clear some of my doubts.
Thanks!
@venelin_valkov Год назад ⁺²
@@shivammittal1051 sure, my guess was the same as yours. However, the experiments (at least with these library versions and hardware) show something different. Try the notebook and tell me if you can reproduce it.
Thanks for watching!
@tarun4705 Год назад ⁺²
@@shivammittal1051 Yes, you're right 4-bit model inference time will be much faster than the 8-bit model. I have finetuned a language translation model on Llama 7B using QLora and the inference time on 4bit is very slow but the GPTQ 4bit quantized version of the same model is very fast. I think the issue lies in the hugging face library 4bit implementation. It might get better in the future.
@harigovind511 Год назад ⁺²
@@shivammittal1051 It should the case, but I believe the current 4bit inference kernel is not optimised and the team is working it.
@jaivalani4609 Год назад
have seen slower performance in terms of inference time for 4 bit quantized compared to 8 bit quantized , people complain about qlora
@user-wr4yl7tx3w Год назад
I think it might be useful to explain what the lines of code are actually doing.
@odev6764 Год назад ⁺¹
thank you so much your video is helping a lot to my tests. I have doubt, maybe you know what is happen. I'm training locally in multi gpu rtx 3060 ti, I notice it uses all memory for all gpu, but cuda is not been used fully in all gpu. cuda get 100% usage on 1 and in others are not distributting equally the processing. Some of them uses only 10%. What I did different of you is remove max_steps to process entier dataset I'm using, and increase per_device_train_batch_size to 3 and gradient_accumulation_steps = 16 because i'm using 4 gpus. Do you have any tip ?
@odev6764 Год назад
Another thing I noticed is it tooks 1814 steps and I was running for 11hours and when it was in 409 step it simple stoped, no error, but looking to cuda usage on gpus, there is no usage, only memory is fully used. I guess it brokes but without any error. the progress is not moving since 1h ago

Следующие

Автовоспроизведение

Faster LLM Inference: Speeding up Falcon 7b For CODE: FalCODER 🦅👩‍💻