How to pick a GPU and Inference Engine?

Поделиться
HTML-код
  • Опубликовано: 16 янв 2025

Комментарии • 48

  • @Moonz97
    @Moonz97 5 месяцев назад +2

    Awesome overview. Love it. How come llamacpp wasnt in the inference engine comparison?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад +5

      I probably should have included it, although it's slower.
      For Llama 3.1 8B on an H100 SXM it gives:
      batch 1: 90
      batch 64: 15
      This is slower than all the other engines.
      You can see the results here: docs.google.com/spreadsheets/d/15MJAjBoQdFacmNEEwmfqLVRtSgAOqgsn9APFYo_Q59I/edit?usp=sharing

    • @Moonz97
      @Moonz97 5 месяцев назад +2

      @@TrelisResearch appreciate it! Thanks for sharing the results

  • @abhijitnayak1639
    @abhijitnayak1639 5 месяцев назад +2

    As always top-notch content, love it!!

    • @abhijitnayak1639
      @abhijitnayak1639 5 месяцев назад

      Have you compared the inference speed of LLMs: ExLlamaV2 and SGLang?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      I haven't done ExLlamaV2 but if it uses Marlin kernels it will be fast. As to whether it's fast for batching, that depends on scheduler. The fastest models are the fp8 or INT4 awq models, and they can run with sglang or vllm

  • @InstaKane
    @InstaKane 5 месяцев назад

    As an a devops/platform engineer, I learned a lot watching your video! Cheers!

  • @gody7334-news-co8eq
    @gody7334-news-co8eq 5 месяцев назад

    Thanks Trelis, thanks you conduct this experiments, you save me lots of time.

  • @bphilsochill
    @bphilsochill 5 месяцев назад

    Extremely insightful! Thanks for releasing this.

  • @anunitb
    @anunitb 5 месяцев назад

    This channel is pure gold. Cheers mate

  • @sharannagarajan4089
    @sharannagarajan4089 2 месяца назад +2

    Youre the only sane AI guy

  • @mahermansour1131
    @mahermansour1131 4 месяца назад

    This is an amazing video, thank you!

  • @fabioaloisio
    @fabioaloisio 5 месяцев назад

    Thanks Ronan, awesome content as always. Suggestion for a future video: a MoA implementation (with 2 or more layers) would be very appreciated. Maybe using small language models (or 8b models!?) with llama.cpp or llama.mojo would achieve performance comparable to some frontier model. I don't know It's only a hypothesis.

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад +1

      yeah good shout. Interestingly we've seen a move away from MoE with the Llama models. I suspect the training instability is underappreciated. I'll pin a comment now on llama.cpp

  • @BoeroBoy
    @BoeroBoy 5 месяцев назад

    I love this. Interested there doesn't seem to be a mention of Google's TPUs though which were built for lower precision AI floating point matrix math.

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      I just didn't think of it but it's a good idea.
      Seems vLLM supports doing this: docs.vllm.ai/en/latest/getting_started/tpu-installation.html#installation-with-tpu
      I also created an issue on SGLang: github.com/sgl-project/sglang/issues/919
      Nvidia does benefit enormously from lots of libraries optimising a lot for Cuda, but I'd have an option mind as to whether TPUs could still be faster.

  • @rbrowne4255
    @rbrowne4255 5 месяцев назад

    Excellent overview, with Diffusion models, vision transformers are the performance numbers similar?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      In principle it could be, but these libraries are more focused on text outputs

  • @kunalsuri8316
    @kunalsuri8316 5 месяцев назад +1

    Any idea how well these optimizations work on T4?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Unfortunately T4 is now very old and doesn't support AWQ as far as I know. This means the templates here won't work well.
      Your best bet may be to run a Llama.cpp server - you can check the one-click-llms repo

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      you can also run TGI using --quantize bitsandbytes-nf4 or --quantize eetq (for 8 bit), but they will be slower than AWQ.

  • @tikz.-3738
    @tikz.-3738 5 месяцев назад +1

    vLLM seemed fastest a month ago and now slang dropped out of nowhere or at lest that's what my experience has been and its taken over by storm kinda crazy , it is missing alot of features that vLLM has, hopefully both projects learn and improve.

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Yeah SGLang has drawn a ton from vLLM.

  • @Bluzë-o5b
    @Bluzë-o5b 5 месяцев назад

    Great video

  • @peterdecrem5872
    @peterdecrem5872 5 месяцев назад

    Do any of these give the same answer with same parameters : temperature 0, etc. as hf trl or unsloth. I am struggling to collect usable user feedback for next iteration of fine tuning if what users sees doesn’t tie with what model sees. I suspect it has to do with optimizations/quantizations in the serving. Thanks

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Howdy! Could you clarify your question?
      TRL and unsloth are both for training, whereas here I'm talking about inference

    • @peterdecrem5872
      @peterdecrem5872 5 месяцев назад +1

      @@TrelisResearchyes. I learned yesterday that the inference might differ because of the kv cache in 16 bits (so reorder) vs hf. The way to reduce this is beam search and higher precision. The rounding and order makes a difference. You won’t see it in averages but compare individual results and then you will notice. Same model.

  • @sammcj2000
    @sammcj2000 5 месяцев назад

    MistralRS would be good to see as a comparison

  • @mahermansour1131
    @mahermansour1131 4 месяца назад

    Hi, Can you do paid consultation calls and put the link in your description? I would like to book a call with you.

    • @TrelisResearch
      @TrelisResearch  4 месяца назад

      there's an option on Trelis.com/About

  • @TemporaryForstudy
    @TemporaryForstudy 5 месяцев назад

    hey trelis, I have a doubt about API development with LLMs. suppose my LLM takes 2 GB of RAM when we load it. now we can do inference from our model. Now if i make an API and then two requests come at same time then would the other 2 GB model is loaded into RAM? or the second request will wait for completion of first request? I don't know about this. And what if we get 100 requests? i just need to determine RAM size I would need.

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Actually if you have multiple requests, they use the same weights.
      Calculations are done in parallel and the model weights are only read in once (per parallel calculation).
      VRAM usage does increase a bit as you increase batch size, but this is due to there being more activations (layer outputs) being stored for each of the input sequences. This tends to be small relative to the model weights.
      The point of these inference engines is that they handle everything so sequences can be handled in parallel (including if you have one request that starts after another. this just means - for example - that the fifth token of the first request might be parallel processed with the first token of the second request).

    • @TemporaryForstudy
      @TemporaryForstudy 5 месяцев назад

      @@TrelisResearch okay. so dose this come with hugging face model or do i need to write code by my self to do all these things?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      @@TemporaryForstudy no need to write code, that's the point of using these inference engines - like sglang or vllm. If you want to rent a gpu, the best approach is to use a docker image, like the one click templates I show. If you own a gpu, then it's best to install sglang or vllm on your computer, they handle batching.

  • @AbdulRahman-vj9el
    @AbdulRahman-vj9el 2 месяца назад

    Amazing

  • @Max6383-je9ne
    @Max6383-je9ne 5 месяцев назад

    You only focused on Nvidia cards, but AMD seems to be competing pretty well from what I see, at least in terms of hardware. Is the software support for LLM inference decent enough?
    For example, you could fit 405B in 4 MI300X for a similar price per card to the H100. The AMD card should also beat the H100 in a head-to-head speed comparison, at least if the software efficiency does not disappoint.

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Yeah those are all fair comments. Running on AMD is a bit less supported but it was more a limit on how much I could stuff in to this video.
      Indeed I should do some digging on AMD, would be interesting to benchmark it head to head with NVidia.

    • @BellJH
      @BellJH Месяц назад

      @@TrelisResearchI don’t suppose you have any experience with Intel Gaudi for inference? Any good?

    • @TrelisResearch
      @TrelisResearch  Месяц назад +1

      @@BellJH I don't, but I have on my list to do a video on alternative GPUs... so can try to aim for that as well as AMD

  • @kashifsaeed2154
    @kashifsaeed2154 3 месяца назад

    Hy reply me how you calculate the cost ,i am trying to calculate cost it shows 10 dollar per million
    Here is
    3600 second i hour and per second token is 55 then then it is 198000 tokens per hour and you have doing wrong calculate

    • @TrelisResearch
      @TrelisResearch  3 месяца назад

      And you need to divide by batch size!

    • @kashifsaeed2154
      @kashifsaeed2154 3 месяца назад

      Tell me reply

    • @kashifsaeed2154
      @kashifsaeed2154 3 месяца назад

      @@TrelisResearch I don't understand please tell me easy manner,what is batch size