Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Поделиться
HTML-код
  • Опубликовано: 25 ноя 2024

Комментарии • 111

  • @seanmurphy9273
    @seanmurphy9273 Год назад +18

    my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!

    • @1littlecoder
      @1littlecoder  Год назад +2

      Thanks so much for the kind words :)

  • @shamaldesilva9533
    @shamaldesilva9533 Год назад +6

    inference was the main bottel neck of LLMs , this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩

    • @1littlecoder
      @1littlecoder  Год назад +1

      Glad you liked it. Thanks for the suggestion!

  • @khorLDW
    @khorLDW 4 месяца назад +1

    I appeared here after not being able to make OpenLLM working due to various issues on my PC locally. This is awesome. I got it working in a few minutes with different models. Thank you! Subscribing!

  • @deabyam
    @deabyam Год назад +3

    Thanks you are the vLLM of this space
    love the speed of your videos. Colab let's more of us learn with less $$

    • @1littlecoder
      @1littlecoder  Год назад +2

      Absolutely, Thanks for the support!

  • @mohegyux4072
    @mohegyux4072 Год назад +1

    Thank you, your videos are becoming a daily thing for me

  • @sujantkumarkv5498
    @sujantkumarkv5498 Год назад +1

    great work man...
    can't thank enough. thanks again. great to see more indian AI tech talent in the out :D

  • @prestonmccauley43
    @prestonmccauley43 Год назад +1

    Fantastic share!

  • @karamjittech
    @karamjittech Год назад +1

    Awesome video with great value.

  • @maazshaikh7905
    @maazshaikh7905 4 месяца назад

    Amazing tutorial my friend! Was looking for this. It would have been more helpful if you could explain how to deploy LLMs created using vLLM inference engine.

  • @gpsb121993
    @gpsb121993 6 месяцев назад

    Fantastic video! Just what I wanted to see.

  • @MarceloLimaXP
    @MarceloLimaXP Год назад +1

    Wow. Thank you for always bringing news to us ;)

    • @1littlecoder
      @1littlecoder  Год назад

      My pleasure!

    • @1littlecoder
      @1littlecoder  Год назад +1

      What would you love to see more on this channel? Might help me prioriting new content

    • @MarceloLimaXP
      @MarceloLimaXP Год назад

      @@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)

  • @moondevonyt
    @moondevonyt Год назад

    mad props to the creator for breaking down vllm and its advantages over traditional llms
    that page attention tech sounds lit, giving it those crazy throughput numbers
    but, not gonna lie, using google collab as a production environment? kinda sus
    still, respect the hustle for making it accessible to peeps without fancy GPUs
    mad respect for that grind

  • @marilynlucas5128
    @marilynlucas5128 Год назад +1

    ❤ great job as always. Keep it up

  • @jankothyson
    @jankothyson Год назад

    Wow, this is awesome!

  • @mdmishfaqahmed2138
    @mdmishfaqahmed2138 Год назад +1

    nice one.

  • @thedoctor5478
    @thedoctor5478 Год назад

    you saved our lives

  • @riyayadav8468
    @riyayadav8468 Год назад

    That's Good 🔥🔥

  • @BrokenRecord-i7q
    @BrokenRecord-i7q Год назад

    You are llm angel :)

  • @JohnVandivier
    @JohnVandivier Год назад

    great!

  • @marilynlucas5128
    @marilynlucas5128 Год назад +1

    It’s like another inference engine I’ve seen for LLMs called OpenLLM

    • @1littlecoder
      @1littlecoder  Год назад

      Exactly. Nice observation. OpenLLM is in my list to cover 🚀

    • @marilynlucas5128
      @marilynlucas5128 Год назад

      @@1littlecoder It doesn't give an open AI API token?

  • @rageshantony2182
    @rageshantony2182 Год назад +2

    I read that it doesn't support quantized models. Using ExLllama for quantized LLama models is faster with low memory footprints

    • @nat.serrano
      @nat.serrano Год назад

      thanks for the confirmation, so what is the best option to expose LLama models? fastapi?

  • @arjunm2467
    @arjunm2467 Год назад +1

    Great information, really appreciate 🎉🎉🎉.
    If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.

  • @sakshatkatyarmal2303
    @sakshatkatyarmal2303 Год назад

    Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions

  • @nic-ori
    @nic-ori Год назад

    Thanks.

  • @anki1289
    @anki1289 Год назад +1

    amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier

  • @davidfa7363
    @davidfa7363 5 месяцев назад

    Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?

  • @santoshshetty6
    @santoshshetty6 10 месяцев назад

    Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?

  • @loicbaconnier9150
    @loicbaconnier9150 Год назад +1

    So if i understand it's not work for QPTQ and GGML Mmodels ?
    Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?

    • @1littlecoder
      @1littlecoder  Год назад +2

      You're correct. It doesn't work with quantized models yet. I'll check on the chat completion part.

  • @aozynoob
    @aozynoob Год назад +1

    How does it compare to 4bit quantization?

  • @pavanpraneeth4659
    @pavanpraneeth4659 Год назад

    Awesomeness man does it work with langchain?

  • @AbyJames-i4r
    @AbyJames-i4r Год назад +1

    Can we use Langchain along with vLLM? When we use QA chains we actually create an llm instance using langchain. In that case how can we use this vLLM?

    • @larsuk9578
      @larsuk9578 Год назад

      exactly what I am trying to do!

  • @bornclasher1294
    @bornclasher1294 2 месяца назад

    Sir can we also host fine-tunned models on it ? Like fine-tunned models using unsloth ?

  • @_vismayJain
    @_vismayJain 4 месяца назад

    how to load it on multiple gpu , like loading a single llm on 4 gpus , so its layers get divided. i tried tensor parallel but it cuases my gpus to load up to full memory (mixtral 8x7b)(4 RTX 4090 24GB Each)

  • @nithinbhandari3075
    @nithinbhandari3075 Год назад +1

    Not able to replicate the result.
    Even after 10 minutes it stuck at "pip install vllm".
    Let see after few month.
    By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go.
    If you known any other method by which we can reduce inference time, please share. Thanks.

    • @1littlecoder
      @1littlecoder  Год назад

      Strange. Did it work ?

    • @nithinbhandari3075
      @nithinbhandari3075 Год назад +1

      @@1littlecoder
      Vllm is not working for atleast me.
      Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)

  • @pointlesspos8440
    @pointlesspos8440 Год назад +2

    Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.

    • @pramodpatil2883
      @pramodpatil2883 Год назад

      hey did you find any solution for this as i am also in same problem..your help will appreciated

    • @pointlesspos8440
      @pointlesspos8440 Год назад

      No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users.
      But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded.
      What kind of solution are you looking to work on?
      @@pramodpatil2883

  • @ilianos
    @ilianos Год назад

    So if I had to choose, one of the best LLMs from that selection would be Falcon?

  • @ghaithkhelifi66
    @ghaithkhelifi66 Год назад +3

    hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can

    • @1littlecoder
      @1littlecoder  Год назад +2

      That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs

    • @MarceloLimaXP
      @MarceloLimaXP Год назад

      @@1littlecoderExactly. I live in Brazil, and here the price of a GPU machine is desperate =P

  • @TechieBlogging
    @TechieBlogging Год назад

    Does vLLM works om OpenAI Whisper models?

  • @cmeneseslob
    @cmeneseslob Год назад

    Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?

  • @nat.serrano
    @nat.serrano Год назад

    why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?

  • @rageshantony2182
    @rageshantony2182 Год назад

    Please compare with ExLLAMA vs vLLM

  • @aliissa4040
    @aliissa4040 Год назад

    Hello, in your opinion, what's better to use in production , TGI or vLLM ?

  • @solomonaryeetey7370
    @solomonaryeetey7370 Год назад

    Hey buddy, can you show how to deploy vLLM with SkyPilot?

  • @rkp23in
    @rkp23in 9 месяцев назад

    can we launch a ollama model as api executing in google colab?

  • @alx8439
    @alx8439 Год назад

    Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often

  • @brandomiranda6703
    @brandomiranda6703 9 месяцев назад

    Why can't you use vllm for training?

  • @Gerald-xg3rq
    @Gerald-xg3rq 8 месяцев назад

    hi nice video. how can i use vllm this on aws sagemaker?

  • @davidlazer3641
    @davidlazer3641 Год назад

    Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..

  • @bashamsk1288
    @bashamsk1288 Год назад

    Does it support device mode auto type of thing ?
    For loading model in multiple gpus?

  • @mohamedsheded4143
    @mohamedsheded4143 Год назад

    Why when i make the API endpoint to gives me a runtime error ? any one face the same issue ?

  • @HarshVerma-xs6ux
    @HarshVerma-xs6ux Год назад

    Hey, amazing video dude. Is it possible to run GGML or GPTQ models through vLLM?

  • @prudhvithtavva7891
    @prudhvithtavva7891 Год назад

    I have finetuned Falcon7B on a custom dataset using qlora, can I use the vllm over the fine-tuned model instead of pre-trained?

    • @1littlecoder
      @1littlecoder  Год назад

      I guess if you had pushed the final merged model to HF Hub. Yes you can (most likely)

  • @chiggly007
    @chiggly007 Год назад

    Does it support chat completion endpoint?

  • @Techonsapevole
    @Techonsapevole Год назад

    Cool, does it work also cpu only like con llama.cpp ?

    • @1littlecoder
      @1littlecoder  Год назад +1

      It currently doesn't support quantization. So I don't think the CPU would be powerful enough to run those.

  • @Gokulhraj
    @Gokulhraj Год назад

    can we use it with lagnchain?

  • @mtteslian9159
    @mtteslian9159 Год назад

    Is it possible to use this solution through langchain?

  • @VijayasarathyMuthu
    @VijayasarathyMuthu Год назад

    Could you tell how to run this in Cloud Run or such service?

  • @Ryan-yj4sd
    @Ryan-yj4sd Год назад

    How to do batch inference?

  • @don-jp2rs
    @don-jp2rs 6 месяцев назад

    but why use vllm when you can use chatgpt api ?

  • @bornclasher1294
    @bornclasher1294 3 месяца назад

    Is this is temporary deployment or permanent? LIke we can use the API always ?

  • @SloanMosley
    @SloanMosley Год назад

    Does this support server less, also how would you host with sage maker ?

  • @True_Feelingsss...
    @True_Feelingsss... 9 месяцев назад

    How to load custom finetuned model using vllm

    • @mdhuzaifapatel
      @mdhuzaifapatel 4 месяца назад

      Did you get the solution please let me know

  • @davidlazer3641
    @davidlazer3641 Год назад

    I run the exact commad that you given in free colab tier, its gives me cuda out of memory, what can i do? any suggestions

    • @1littlecoder
      @1littlecoder  Год назад

      Did you use the same model as mine or any other big model?

  • @unimposings
    @unimposings Год назад

    can it run on collabd 24/7 ? how much will it cost to let it run for 1 month?

  • @urisrssfeeds
    @urisrssfeeds Год назад

    how long does the pip install take? Mine has been going like 45 mins in google collab

    • @1littlecoder
      @1littlecoder  Год назад

      I guess it took about 20 mins in my case

  • @viratchoudhary6827
    @viratchoudhary6827 Год назад

    hi bro , can you give me a ref " how to hide files as whisper-jax on huggingface"

  • @fxhp1
    @fxhp1 10 месяцев назад

    you dont need a tunnel if you set --host 0.0.0.0

  • @yosefmoatti3633
    @yosefmoatti3633 10 месяцев назад

    Very interesting video. Thanks!
    Unfortunately, I encounter problems with the initial "! pip install vllm":
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    lida 0.0.10 requires kaleido, which is not installed.
    lida 0.0.10 requires python-multipart, which is not installed.
    tensorflow-probability 0.22.0 requires typing-extensions

  • @siddharths5135
    @siddharths5135 3 месяца назад

    Bro I really need your help, I really stuck

    • @1littlecoder
      @1littlecoder  3 месяца назад

      What bro

    • @siddharths5135
      @siddharths5135 3 месяца назад

      @@1littlecoder I'm literally stuck in finetune llama 3 8b instruct with the custom dataset, will you be able to help

    • @siddharths5135
      @siddharths5135 3 месяца назад

      my dataset is large and lengthy sequence as well

  • @michabbb
    @michabbb Год назад

    because it's another Indian guy talking way too fast, I have to slow down the speed, to understand something..... 🙄

    • @1littlecoder
      @1littlecoder  Год назад

      Did you manage to understand when slowed down the speed ?

  • @heythere6390
    @heythere6390 Год назад

    can I host my own falcon fine tuned model from hf, using the same mechanism?!e.g iamauser/falcon-7b-finetuned?

    • @1littlecoder
      @1littlecoder  Год назад

      Yes. I think it uses transformers to download the model. So should ideally work

    • @heythere6390
      @heythere6390 Год назад

      @@1littlecoder thanks