Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

1littlecoder

Просмотров 36 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 25 ноя 2024

Комментарии • 111

@seanmurphy9273 Год назад ⁺¹⁸
my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!
@1littlecoder Год назад ⁺²
Thanks so much for the kind words :)
@shamaldesilva9533 Год назад ⁺⁶
inference was the main bottel neck of LLMs , this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩
@1littlecoder Год назад ⁺¹
Glad you liked it. Thanks for the suggestion!
@khorLDW 4 месяца назад ⁺¹
I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!
@1littlecoder 4 месяца назад ⁺¹
Glad it helped!
@r.pratheeban2590 2 месяца назад
bro mention the model
@deabyam Год назад ⁺³
Thanks you are the vLLM of this space
love the speed of your videos. Colab let's more of us learn with less $$
@1littlecoder Год назад ⁺²
Absolutely, Thanks for the support!
@mohegyux4072 Год назад ⁺¹
Thank you, your videos are becoming a daily thing for me
@1littlecoder Год назад ⁺¹
Happy to hear that!
@sujantkumarkv5498 Год назад ⁺¹
great work man...
can't thank enough. thanks again. great to see more indian AI tech talent in the out :D
@prestonmccauley43 Год назад ⁺¹
Fantastic share!
@1littlecoder Год назад
Thank you! Cheers!
@karamjittech Год назад ⁺¹
Awesome video with great value.
@1littlecoder Год назад
Thanks for watching!
@maazshaikh7905 4 месяца назад
Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.
@gpsb121993 6 месяцев назад
Fantastic video! Just what I wanted to see.
@MarceloLimaXP Год назад ⁺¹
Wow. Thank you for always bringing news to us ;)
@1littlecoder Год назад
My pleasure!
@1littlecoder Год назад ⁺¹
What would you love to see more on this channel? Might help me prioriting new content
@MarceloLimaXP Год назад
@@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)
@moondevonyt Год назад
mad props to the creator for breaking down vllm and its advantages over traditional llms
that page attention tech sounds lit, giving it those crazy throughput numbers
but, not gonna lie, using google collab as a production environment? kinda sus
still, respect the hustle for making it accessible to peeps without fancy GPUs
mad respect for that grind
@1littlecoder Год назад ⁺²
Is this an AI comment ?
@marilynlucas5128 Год назад ⁺¹
❤ great job as always. Keep it up
@jankothyson Год назад
Wow, this is awesome!
@mdmishfaqahmed2138 Год назад ⁺¹
nice one.
@1littlecoder Год назад
Thank you! Cheers!
@thedoctor5478 Год назад
you saved our lives
@riyayadav8468 Год назад
That's Good 🔥🔥
@BrokenRecord-i7q Год назад
You are llm angel :)
@JohnVandivier Год назад
great!
@marilynlucas5128 Год назад ⁺¹
It’s like another inference engine I’ve seen for LLMs called OpenLLM
@1littlecoder Год назад
Exactly. Nice observation. OpenLLM is in my list to cover 🚀
@marilynlucas5128 Год назад
@@1littlecoder It doesn't give an open AI API token?
@rageshantony2182 Год назад ⁺²
I read that it doesn't support quantized models. Using ExLllama for quantized LLama models is faster with low memory footprints
@nat.serrano Год назад
thanks for the confirmation, so what is the best option to expose LLama models? fastapi?
@arjunm2467 Год назад ⁺¹
Great information, really appreciate 🎉🎉🎉.
If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.
@sakshatkatyarmal2303 Год назад
Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions
@nic-ori Год назад
Thanks.
@anki1289 Год назад ⁺¹
amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier
@davidfa7363 5 месяцев назад
Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?
@santoshshetty6 10 месяцев назад
Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?
@loicbaconnier9150 Год назад ⁺¹
So if i understand it's not work for QPTQ and GGML Mmodels ?
Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?
@1littlecoder Год назад ⁺²
You're correct. It doesn't work with quantized models yet. I'll check on the chat completion part.
@aozynoob Год назад ⁺¹
How does it compare to 4bit quantization?
@pavanpraneeth4659 Год назад
Awesomeness man does it work with langchain?
@AbyJames-i4r Год назад ⁺¹
Can we use Langchain along with vLLM? When we use QA chains we actually create an llm instance using langchain. In that case how can we use this vLLM?
@larsuk9578 Год назад
exactly what I am trying to do!
@bornclasher1294 2 месяца назад
Sir can we also host fine-tunned models on it ? Like fine-tunned models using unsloth ?
@_vismayJain 4 месяца назад
how to load it on multiple gpu , like loading a single llm on 4 gpus , so its layers get divided. i tried tensor parallel but it cuases my gpus to load up to full memory (mixtral 8x7b)(4 RTX 4090 24GB Each)
@nithinbhandari3075 Год назад ⁺¹
Not able to replicate the result.
Even after 10 minutes it stuck at "pip install vllm".
Let see after few month.
By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go.
If you known any other method by which we can reduce inference time, please share. Thanks.
@1littlecoder Год назад
Strange. Did it work ?
@nithinbhandari3075 Год назад ⁺¹
@@1littlecoder
Vllm is not working for atleast me.
Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)
@pointlesspos8440 Год назад ⁺²
Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.
@pramodpatil2883 Год назад
hey did you find any solution for this as i am also in same problem..your help will appreciated
@pointlesspos8440 Год назад
No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users.
But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded.
What kind of solution are you looking to work on?
@@pramodpatil2883
@ilianos Год назад
So if I had to choose, one of the best LLMs from that selection would be Falcon?
@ghaithkhelifi66 Год назад ⁺³
hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can
@1littlecoder Год назад ⁺²
That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs
@MarceloLimaXP Год назад
@@1littlecoderExactly. I live in Brazil, and here the price of a GPU machine is desperate =P
@TechieBlogging Год назад
Does vLLM works om OpenAI Whisper models?
@cmeneseslob Год назад
Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?
@nat.serrano Год назад
why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?
@rageshantony2182 Год назад
Please compare with ExLLAMA vs vLLM
@aliissa4040 Год назад
Hello, in your opinion, what's better to use in production , TGI or vLLM ?
@solomonaryeetey7370 Год назад
Hey buddy, can you show how to deploy vLLM with SkyPilot?
@rkp23in 9 месяцев назад
can we launch a ollama model as api executing in google colab?
@alx8439 Год назад
Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often
@brandomiranda6703 9 месяцев назад
Why can't you use vllm for training?
@Gerald-xg3rq 8 месяцев назад
hi nice video. how can i use vllm this on aws sagemaker?
@davidlazer3641 Год назад
Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..
@bashamsk1288 Год назад
Does it support device mode auto type of thing ?
For loading model in multiple gpus?
@mohamedsheded4143 Год назад
Why when i make the API endpoint to gives me a runtime error ? any one face the same issue ?
@HarshVerma-xs6ux Год назад
Hey, amazing video dude. Is it possible to run GGML or GPTQ models through vLLM?
@1littlecoder Год назад
Thanks Not at this moment.
@prudhvithtavva7891 Год назад
I have finetuned Falcon7B on a custom dataset using qlora, can I use the vllm over the fine-tuned model instead of pre-trained?
@1littlecoder Год назад
I guess if you had pushed the final merged model to HF Hub. Yes you can (most likely)
@chiggly007 Год назад
Does it support chat completion endpoint?
@Techonsapevole Год назад
Cool, does it work also cpu only like con llama.cpp ?
@1littlecoder Год назад ⁺¹
It currently doesn't support quantization. So I don't think the CPU would be powerful enough to run those.
@Gokulhraj Год назад
can we use it with lagnchain?
@mtteslian9159 Год назад
Is it possible to use this solution through langchain?
@VijayasarathyMuthu Год назад
Could you tell how to run this in Cloud Run or such service?
@Ryan-yj4sd Год назад
How to do batch inference?
@don-jp2rs 6 месяцев назад
but why use vllm when you can use chatgpt api ?
@bornclasher1294 3 месяца назад
Is this is temporary deployment or permanent? LIke we can use the API always ?
@SloanMosley Год назад
Does this support server less, also how would you host with sage maker ?
@True_Feelingsss... 9 месяцев назад
How to load custom finetuned model using vllm
@mdhuzaifapatel 4 месяца назад
Did you get the solution please let me know
@davidlazer3641 Год назад
I run the exact commad that you given in free colab tier, its gives me cuda out of memory, what can i do? any suggestions
@1littlecoder Год назад
Did you use the same model as mine or any other big model?
@unimposings Год назад
can it run on collabd 24/7 ? how much will it cost to let it run for 1 month?
@urisrssfeeds Год назад
how long does the pip install take? Mine has been going like 45 mins in google collab
@1littlecoder Год назад
I guess it took about 20 mins in my case
@viratchoudhary6827 Год назад
hi bro , can you give me a ref " how to hide files as whisper-jax on huggingface"
@fxhp1 10 месяцев назад
you dont need a tunnel if you set --host 0.0.0.0
@yosefmoatti3633 10 месяцев назад
Very interesting video. Thanks!
Unfortunately, I encounter problems with the initial "! pip install vllm":
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
tensorflow-probability 0.22.0 requires typing-extensions
@siddharths5135 3 месяца назад
Bro I really need your help, I really stuck
@1littlecoder 3 месяца назад
What bro
@siddharths5135 3 месяца назад
@@1littlecoder I'm literally stuck in finetune llama 3 8b instruct with the custom dataset, will you be able to help
@siddharths5135 3 месяца назад
my dataset is large and lengthy sequence as well
@michabbb Год назад
because it's another Indian guy talking way too fast, I have to slow down the speed, to understand something..... 🙄
@1littlecoder Год назад
Did you manage to understand when slowed down the speed ?
@heythere6390 Год назад
can I host my own falcon fine tuned model from hf, using the same mechanism?!e.g iamauser/falcon-7b-finetuned?
@1littlecoder Год назад
Yes. I think it uses transformers to download the model. So should ideally work
@heythere6390 Год назад
@@1littlecoder thanks

Следующие

Автовоспроизведение

The EASIEST way to RUN Llama2 like LLMs on CPU!!!