I probably should have included it, although it's slower. For Llama 3.1 8B on an H100 SXM it gives: batch 1: 90 batch 64: 15 This is slower than all the other engines. You can see the results here: docs.google.com/spreadsheets/d/15MJAjBoQdFacmNEEwmfqLVRtSgAOqgsn9APFYo_Q59I/edit?usp=sharing
I haven't done ExLlamaV2 but if it uses Marlin kernels it will be fast. As to whether it's fast for batching, that depends on scheduler. The fastest models are the fp8 or INT4 awq models, and they can run with sglang or vllm
Thanks Ronan, awesome content as always. Suggestion for a future video: a MoA implementation (with 2 or more layers) would be very appreciated. Maybe using small language models (or 8b models!?) with llama.cpp or llama.mojo would achieve performance comparable to some frontier model. I don't know It's only a hypothesis.
yeah good shout. Interestingly we've seen a move away from MoE with the Llama models. I suspect the training instability is underappreciated. I'll pin a comment now on llama.cpp
I just didn't think of it but it's a good idea. Seems vLLM supports doing this: docs.vllm.ai/en/latest/getting_started/tpu-installation.html#installation-with-tpu I also created an issue on SGLang: github.com/sgl-project/sglang/issues/919 Nvidia does benefit enormously from lots of libraries optimising a lot for Cuda, but I'd have an option mind as to whether TPUs could still be faster.
Unfortunately T4 is now very old and doesn't support AWQ as far as I know. This means the templates here won't work well. Your best bet may be to run a Llama.cpp server - you can check the one-click-llms repo
vLLM seemed fastest a month ago and now slang dropped out of nowhere or at lest that's what my experience has been and its taken over by storm kinda crazy , it is missing alot of features that vLLM has, hopefully both projects learn and improve.
Do any of these give the same answer with same parameters : temperature 0, etc. as hf trl or unsloth. I am struggling to collect usable user feedback for next iteration of fine tuning if what users sees doesn’t tie with what model sees. I suspect it has to do with optimizations/quantizations in the serving. Thanks
@@TrelisResearchyes. I learned yesterday that the inference might differ because of the kv cache in 16 bits (so reorder) vs hf. The way to reduce this is beam search and higher precision. The rounding and order makes a difference. You won’t see it in averages but compare individual results and then you will notice. Same model.
hey trelis, I have a doubt about API development with LLMs. suppose my LLM takes 2 GB of RAM when we load it. now we can do inference from our model. Now if i make an API and then two requests come at same time then would the other 2 GB model is loaded into RAM? or the second request will wait for completion of first request? I don't know about this. And what if we get 100 requests? i just need to determine RAM size I would need.
Actually if you have multiple requests, they use the same weights. Calculations are done in parallel and the model weights are only read in once (per parallel calculation). VRAM usage does increase a bit as you increase batch size, but this is due to there being more activations (layer outputs) being stored for each of the input sequences. This tends to be small relative to the model weights. The point of these inference engines is that they handle everything so sequences can be handled in parallel (including if you have one request that starts after another. this just means - for example - that the fifth token of the first request might be parallel processed with the first token of the second request).
@@TemporaryForstudy no need to write code, that's the point of using these inference engines - like sglang or vllm. If you want to rent a gpu, the best approach is to use a docker image, like the one click templates I show. If you own a gpu, then it's best to install sglang or vllm on your computer, they handle batching.
You only focused on Nvidia cards, but AMD seems to be competing pretty well from what I see, at least in terms of hardware. Is the software support for LLM inference decent enough? For example, you could fit 405B in 4 MI300X for a similar price per card to the H100. The AMD card should also beat the H100 in a head-to-head speed comparison, at least if the software efficiency does not disappoint.
Yeah those are all fair comments. Running on AMD is a bit less supported but it was more a limit on how much I could stuff in to this video. Indeed I should do some digging on AMD, would be interesting to benchmark it head to head with NVidia.
Hy reply me how you calculate the cost ,i am trying to calculate cost it shows 10 dollar per million Here is 3600 second i hour and per second token is 55 then then it is 198000 tokens per hour and you have doing wrong calculate
Awesome overview. Love it. How come llamacpp wasnt in the inference engine comparison?
I probably should have included it, although it's slower.
For Llama 3.1 8B on an H100 SXM it gives:
batch 1: 90
batch 64: 15
This is slower than all the other engines.
You can see the results here: docs.google.com/spreadsheets/d/15MJAjBoQdFacmNEEwmfqLVRtSgAOqgsn9APFYo_Q59I/edit?usp=sharing
@@TrelisResearch appreciate it! Thanks for sharing the results
As always top-notch content, love it!!
Have you compared the inference speed of LLMs: ExLlamaV2 and SGLang?
I haven't done ExLlamaV2 but if it uses Marlin kernels it will be fast. As to whether it's fast for batching, that depends on scheduler. The fastest models are the fp8 or INT4 awq models, and they can run with sglang or vllm
As an a devops/platform engineer, I learned a lot watching your video! Cheers!
Thanks
Thanks Trelis, thanks you conduct this experiments, you save me lots of time.
Extremely insightful! Thanks for releasing this.
This channel is pure gold. Cheers mate
Youre the only sane AI guy
haha thanks, appreciate it
This is an amazing video, thank you!
cheers, you're welcome
Thanks Ronan, awesome content as always. Suggestion for a future video: a MoA implementation (with 2 or more layers) would be very appreciated. Maybe using small language models (or 8b models!?) with llama.cpp or llama.mojo would achieve performance comparable to some frontier model. I don't know It's only a hypothesis.
yeah good shout. Interestingly we've seen a move away from MoE with the Llama models. I suspect the training instability is underappreciated. I'll pin a comment now on llama.cpp
I love this. Interested there doesn't seem to be a mention of Google's TPUs though which were built for lower precision AI floating point matrix math.
I just didn't think of it but it's a good idea.
Seems vLLM supports doing this: docs.vllm.ai/en/latest/getting_started/tpu-installation.html#installation-with-tpu
I also created an issue on SGLang: github.com/sgl-project/sglang/issues/919
Nvidia does benefit enormously from lots of libraries optimising a lot for Cuda, but I'd have an option mind as to whether TPUs could still be faster.
Excellent overview, with Diffusion models, vision transformers are the performance numbers similar?
In principle it could be, but these libraries are more focused on text outputs
Any idea how well these optimizations work on T4?
Unfortunately T4 is now very old and doesn't support AWQ as far as I know. This means the templates here won't work well.
Your best bet may be to run a Llama.cpp server - you can check the one-click-llms repo
you can also run TGI using --quantize bitsandbytes-nf4 or --quantize eetq (for 8 bit), but they will be slower than AWQ.
vLLM seemed fastest a month ago and now slang dropped out of nowhere or at lest that's what my experience has been and its taken over by storm kinda crazy , it is missing alot of features that vLLM has, hopefully both projects learn and improve.
Yeah SGLang has drawn a ton from vLLM.
Great video
Do any of these give the same answer with same parameters : temperature 0, etc. as hf trl or unsloth. I am struggling to collect usable user feedback for next iteration of fine tuning if what users sees doesn’t tie with what model sees. I suspect it has to do with optimizations/quantizations in the serving. Thanks
Howdy! Could you clarify your question?
TRL and unsloth are both for training, whereas here I'm talking about inference
@@TrelisResearchyes. I learned yesterday that the inference might differ because of the kv cache in 16 bits (so reorder) vs hf. The way to reduce this is beam search and higher precision. The rounding and order makes a difference. You won’t see it in averages but compare individual results and then you will notice. Same model.
MistralRS would be good to see as a comparison
have you got a link to that?
Hi, Can you do paid consultation calls and put the link in your description? I would like to book a call with you.
there's an option on Trelis.com/About
hey trelis, I have a doubt about API development with LLMs. suppose my LLM takes 2 GB of RAM when we load it. now we can do inference from our model. Now if i make an API and then two requests come at same time then would the other 2 GB model is loaded into RAM? or the second request will wait for completion of first request? I don't know about this. And what if we get 100 requests? i just need to determine RAM size I would need.
Actually if you have multiple requests, they use the same weights.
Calculations are done in parallel and the model weights are only read in once (per parallel calculation).
VRAM usage does increase a bit as you increase batch size, but this is due to there being more activations (layer outputs) being stored for each of the input sequences. This tends to be small relative to the model weights.
The point of these inference engines is that they handle everything so sequences can be handled in parallel (including if you have one request that starts after another. this just means - for example - that the fifth token of the first request might be parallel processed with the first token of the second request).
@@TrelisResearch okay. so dose this come with hugging face model or do i need to write code by my self to do all these things?
@@TemporaryForstudy no need to write code, that's the point of using these inference engines - like sglang or vllm. If you want to rent a gpu, the best approach is to use a docker image, like the one click templates I show. If you own a gpu, then it's best to install sglang or vllm on your computer, they handle batching.
Amazing
You only focused on Nvidia cards, but AMD seems to be competing pretty well from what I see, at least in terms of hardware. Is the software support for LLM inference decent enough?
For example, you could fit 405B in 4 MI300X for a similar price per card to the H100. The AMD card should also beat the H100 in a head-to-head speed comparison, at least if the software efficiency does not disappoint.
Yeah those are all fair comments. Running on AMD is a bit less supported but it was more a limit on how much I could stuff in to this video.
Indeed I should do some digging on AMD, would be interesting to benchmark it head to head with NVidia.
@@TrelisResearchI don’t suppose you have any experience with Intel Gaudi for inference? Any good?
@@BellJH I don't, but I have on my list to do a video on alternative GPUs... so can try to aim for that as well as AMD
Hy reply me how you calculate the cost ,i am trying to calculate cost it shows 10 dollar per million
Here is
3600 second i hour and per second token is 55 then then it is 198000 tokens per hour and you have doing wrong calculate
And you need to divide by batch size!
Tell me reply
@@TrelisResearch I don't understand please tell me easy manner,what is batch size