my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!
I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!
Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.
@@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)
mad props to the creator for breaking down vllm and its advantages over traditional llms that page attention tech sounds lit, giving it those crazy throughput numbers but, not gonna lie, using google collab as a production environment? kinda sus still, respect the hustle for making it accessible to peeps without fancy GPUs mad respect for that grind
Great information, really appreciate 🎉🎉🎉. If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.
Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions
Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?
So if i understand it's not work for QPTQ and GGML Mmodels ? Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?
how to load it on multiple gpu , like loading a single llm on 4 gpus , so its layers get divided. i tried tensor parallel but it cuases my gpus to load up to full memory (mixtral 8x7b)(4 RTX 4090 24GB Each)
Not able to replicate the result. Even after 10 minutes it stuck at "pip install vllm". Let see after few month. By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go. If you known any other method by which we can reduce inference time, please share. Thanks.
@@1littlecoder Vllm is not working for atleast me. Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)
Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.
No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users. But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded. What kind of solution are you looking to work on? @@pramodpatil2883
hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can
That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs
Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?
why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?
Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..
Very interesting video. Thanks! Unfortunately, I encounter problems with the initial "! pip install vllm": ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. lida 0.0.10 requires kaleido, which is not installed. lida 0.0.10 requires python-multipart, which is not installed. tensorflow-probability 0.22.0 requires typing-extensions
my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!
Thanks so much for the kind words :)
inference was the main bottel neck of LLMs , this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩
Glad you liked it. Thanks for the suggestion!
I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!
Glad it helped!
bro mention the model
Thanks you are the vLLM of this space
love the speed of your videos. Colab let's more of us learn with less $$
Absolutely, Thanks for the support!
Thank you, your videos are becoming a daily thing for me
Happy to hear that!
great work man...
can't thank enough. thanks again. great to see more indian AI tech talent in the out :D
Fantastic share!
Thank you! Cheers!
Awesome video with great value.
Thanks for watching!
Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.
Fantastic video! Just what I wanted to see.
Wow. Thank you for always bringing news to us ;)
My pleasure!
What would you love to see more on this channel? Might help me prioriting new content
@@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)
mad props to the creator for breaking down vllm and its advantages over traditional llms
that page attention tech sounds lit, giving it those crazy throughput numbers
but, not gonna lie, using google collab as a production environment? kinda sus
still, respect the hustle for making it accessible to peeps without fancy GPUs
mad respect for that grind
Is this an AI comment ?
❤ great job as always. Keep it up
Wow, this is awesome!
nice one.
Thank you! Cheers!
you saved our lives
That's Good 🔥🔥
You are llm angel :)
great!
It’s like another inference engine I’ve seen for LLMs called OpenLLM
Exactly. Nice observation. OpenLLM is in my list to cover 🚀
@@1littlecoder It doesn't give an open AI API token?
I read that it doesn't support quantized models. Using ExLllama for quantized LLama models is faster with low memory footprints
thanks for the confirmation, so what is the best option to expose LLama models? fastapi?
Great information, really appreciate 🎉🎉🎉.
If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.
Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions
Thanks.
amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier
Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?
Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?
So if i understand it's not work for QPTQ and GGML Mmodels ?
Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?
You're correct. It doesn't work with quantized models yet. I'll check on the chat completion part.
How does it compare to 4bit quantization?
Awesomeness man does it work with langchain?
Can we use Langchain along with vLLM? When we use QA chains we actually create an llm instance using langchain. In that case how can we use this vLLM?
exactly what I am trying to do!
Sir can we also host fine-tunned models on it ? Like fine-tunned models using unsloth ?
how to load it on multiple gpu , like loading a single llm on 4 gpus , so its layers get divided. i tried tensor parallel but it cuases my gpus to load up to full memory (mixtral 8x7b)(4 RTX 4090 24GB Each)
Not able to replicate the result.
Even after 10 minutes it stuck at "pip install vllm".
Let see after few month.
By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go.
If you known any other method by which we can reduce inference time, please share. Thanks.
Strange. Did it work ?
@@1littlecoder
Vllm is not working for atleast me.
Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)
Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.
hey did you find any solution for this as i am also in same problem..your help will appreciated
No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users.
But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded.
What kind of solution are you looking to work on?
@@pramodpatil2883
So if I had to choose, one of the best LLMs from that selection would be Falcon?
hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can
That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs
@@1littlecoderExactly. I live in Brazil, and here the price of a GPU machine is desperate =P
Does vLLM works om OpenAI Whisper models?
Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?
why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?
Please compare with ExLLAMA vs vLLM
Hello, in your opinion, what's better to use in production , TGI or vLLM ?
Hey buddy, can you show how to deploy vLLM with SkyPilot?
can we launch a ollama model as api executing in google colab?
Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often
Why can't you use vllm for training?
hi nice video. how can i use vllm this on aws sagemaker?
Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..
Does it support device mode auto type of thing ?
For loading model in multiple gpus?
Why when i make the API endpoint to gives me a runtime error ? any one face the same issue ?
Hey, amazing video dude. Is it possible to run GGML or GPTQ models through vLLM?
Thanks Not at this moment.
I have finetuned Falcon7B on a custom dataset using qlora, can I use the vllm over the fine-tuned model instead of pre-trained?
I guess if you had pushed the final merged model to HF Hub. Yes you can (most likely)
Does it support chat completion endpoint?
Cool, does it work also cpu only like con llama.cpp ?
It currently doesn't support quantization. So I don't think the CPU would be powerful enough to run those.
can we use it with lagnchain?
Is it possible to use this solution through langchain?
Could you tell how to run this in Cloud Run or such service?
How to do batch inference?
but why use vllm when you can use chatgpt api ?
Is this is temporary deployment or permanent? LIke we can use the API always ?
Does this support server less, also how would you host with sage maker ?
How to load custom finetuned model using vllm
Did you get the solution please let me know
I run the exact commad that you given in free colab tier, its gives me cuda out of memory, what can i do? any suggestions
Did you use the same model as mine or any other big model?
can it run on collabd 24/7 ? how much will it cost to let it run for 1 month?
how long does the pip install take? Mine has been going like 45 mins in google collab
I guess it took about 20 mins in my case
hi bro , can you give me a ref " how to hide files as whisper-jax on huggingface"
you dont need a tunnel if you set --host 0.0.0.0
Very interesting video. Thanks!
Unfortunately, I encounter problems with the initial "! pip install vllm":
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
tensorflow-probability 0.22.0 requires typing-extensions
Bro I really need your help, I really stuck
What bro
@@1littlecoder I'm literally stuck in finetune llama 3 8b instruct with the custom dataset, will you be able to help
my dataset is large and lengthy sequence as well
because it's another Indian guy talking way too fast, I have to slow down the speed, to understand something..... 🙄
Did you manage to understand when slowed down the speed ?
can I host my own falcon fine tuned model from hf, using the same mechanism?!e.g iamauser/falcon-7b-finetuned?
Yes. I think it uses transformers to download the model. So should ideally work
@@1littlecoder thanks