I love your pivot to AI dude. I'm actually using ollama for some projects at work and this is very useful stuff. Keep up the good work and hope you're doing well!
This is very useful, i am building a local A.I. cluster and i am trying to balance learning and WATTS used there are a few mini pc that you can allocate as much as 16g of ram to the integrated gpu for less then the price of a 12g gpu the value is there
Thanks! The macbook pro M3 MAX 128GB unified (latest and fastest arch, $7,200) runs the Llama3.1 70b at Q4 at 6 tokens per second comes out to $1200 per Tok/s. A 2x 3090s runs the same model and quant at 17.6 tokens per second at a cost of around $1,800 in a complete build. $102.85 per Tok/s. That is a 10x difference in performance/cost. I keep seeing folks suggesting M3 Macs are a better option on performance/cost grounds.... and I absolutely disagree. I have a friend that runs the max'd out M3 128 and is vocally disappointed in local inference. That was not their reason for going with that laptop however, which I think makes the most sense. LLM use is secondary use cases for a MAC. I will say for idle state wattage they will do much better without doubt. I just don't think going under a 70b model in Q4 is a decent experience even with the latest SOTA like Llama3.1 and without a doubt I get unhappy at tokens per second under 15 myself.
@@DigitalSpaceport M series Macs idle at around 6-12 watts of power and at full maxed out load run around 65 watts. You never have to turn it off. While a multi-GPU system like that will idle at around 100 watts and at full load will pull 600-1000 watts. So basically 10X worse power per watt performance. If you want to leave your computer on all the time, then Mac wins. If not then maybe a multi-gpu system wins, however you can’t use it to train any models due to lack of GPU ram. Also a Mac is totally silent. If you don’t have a basement or a server closet to stash your noisy fan based GPU system and value your sanity get a Mac.
A dual 3090 idle on a consumer mobo with an i5 intel chip idles close to 45 watts. Thats 20 watts for the system and 12.5 for each 3090 GPU. Yes it will use more electric under load, but it will also generate tokens dramatically faster and get back to idle states sooner. It is higher for idles however without doubt, but you are going to get a dual 4090 video next and we will have a very good idle on those and maybe faster processing. Will be fun to see. If you are training models yeah you would buy more GPUs and ensure you have a better mobo/cpu with more pcie lanes. A Quad 3090 is around 5000 fully built and specifically much faster. Im not trying to say one is better vs the other, as there are of course tradeoffs with each. However being specific around numbers is also important. One of the biggest factors that is highly logical and makes great sense is if you already use and love macs, go mac without a doubt. Also if you have high electric rates that should factor in. Myself tokens per second matters and I like a larger b model at a lower q vs a lower b at a higher q, on llama3.1 specifically the 70b at Q4 feels much better than the 8b at fp16. Space available should also factor in, as a quad GPU rig is a decent 16" by 22" footprint.
@@DigitalSpaceport I am a Mac user who also has a dedicated Windows Gaming PC with NVIDIA gpu, and I am really disappointed with the direction NVIDIA is going in right now. Their GPUs are getting bigger and bigger and consuming more and more power and they now cost as much as an entire PC build alone. That’s why I’m looking at the new M4 Macs which will be coming out soon as a total replacement for windows games running on Crossover. The same goes for running LLMs which are really just a hobbyist curiosity for me right now. If I could run a large LLM and run windows games both on my M4 Mac without having to buy an expensive and power hungry NVIDIA GPU, that’s a win win for me personally. It also reduces the footprint of having to have multiple large PC boxes taking up space, sucking down power and generating unnecessary heat. I don’t want or need a bunch of power hungry space heaters taking up space in my office. And then there is the noise issue. I hate noisy PC fans. With a Mac you don’t have that problem as it runs completely silent.
Yeah im not at all anti mac either, I think they are making new advancements that will really be interesting and have great value also if they keep on their current trajectory. Plus windows is just gotten to be so junkware now its really buggy. Macs are great for not having that jank.
Hey, very interesting video! I am curious what would happen if you pair an rtx 3090, with an older 24gb card like a p40, or even older m40. Would the more vram be good, or would the generations mixing affect the performance? Thanks for the content, I will be following, and hopefully soon make my own ai home-lab. 💪
In pascal generation I have a 1070. Ill add it into my test im shooting right now with the 4090 pair lol! IDK what the outcome will be but we will all learn together I guess
I've got 3 3060 I have been running on an x370 i haven'thad any issues with running at x x8 x4 on the pcie lanes, I have had a lot of fun using mixtrail 42b it's very usable
I went with a price/performance tradeoff - 3x 4060Ti 16GB, with the option to add another. They are a bit slower for LLM throughput, but the VRAM means I can still run the largish models without throwing over to the CPU. And they consume less power too, which is nice. I have an older 2080ti around that I may throw in, but it would be interesting to see results for mixed architecture setups.
@@alx8439 Yes, I ran the same questions on my setup using a few different models. I’ll try and post them maybe Monday when I can sit down and compile them.
I can't post my entire results here, but I'll summarize for the Llama3.1:70B Q8 results and you can see how it scales. For the 3x 4060ti 16GB GPU's, no CPU, they run that model on those questions between 6 and 8 T/sec rate. This isnt surprising, since the three together have about the same memory bandwidth but more total memory, and the model gets split three ways with the associated overhead. The 4060's dont have the same bandwidth as the 3090 on the memory, and that's a good part of the speed the 3090 gets. So I would say I'm about par with the 3090, but better off in absolute memory available without going to the CPU. And the three GPU's are cheaper than even one 3090, and use no more than 300W when running, so I think it's a good deal.
I think you're making one mistake here - your test are not identical in terms of the context being sent to model. This is because you're keep reusing the same chat, which already has some historical messages in it. It causes Open Web UI to send the whole chat history each time. You should be opening new chat instead and literally stick to the same order of the same messages.
When you say benefit, could you elaborate? Ive got vLLM high on my software test todo list. Llama.cpp is what ollama calls under the hood but fitting model layers to vram on multiple gpus it does well. It doesnt appear to run the cores equally hard the more gpus added however. Rank novice learning is my current class so eager to get faster inference if possible
@@DigitalSpaceport by reading some /LocalLlama Reddit posts I got the impression that llama.ccp is good for memory distribution but could not use all GPU cores simultaneously, that’s why u are getting same tokens per sec even when removing a GPU.
How fast is your upload speed? If you have like 1gb upload speeds yeah you could but you wouldn't have major flexibility around turning the machine off or using it yourself if it gets leased. You also need pretty much just higher end cards.
@@DigitalSpaceport To serve those models to other people, you have to buy server cards, like RTX 4500/5000/6000. NVIDIA does not license consumer cards to serve. Very interesting videos!
Can you evaluate the performance of speech to speech, like in this tutorial? The performance of my current setup, which only has one 3090, is quite slow. I’m wondering if having four 3090 GPUs can result in a speed that is doubled or tripled. Thank you. ruclips.net/video/yvikqjM8TeA/видео.html
AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways
Loving the self hosted AI content, keep it up!
There is a topic that is 1 away from filming that will be a lot of fun for you I hope (you have likely already done all of it already)
I love your pivot to AI dude. I'm actually using ollama for some projects at work and this is very useful stuff. Keep up the good work and hope you're doing well!
Yeah ollama has made small scale easy and efficient, we have a boon of good models for smaller sized users that are really impressive in capabilities
I'm really frustrated with the results but it was necessary and I thank you for your work!
Which part of the results frustrated you? It didnt speed up frustrates me but ill be checking into vllm which might help on that front.
this is gold. this what i want more of. sub and like
You mentioned about inference, but what about training? Can you mix GPUS and VRAM?
This is very useful, i am building a local A.I. cluster and i am trying to balance learning and WATTS used
there are a few mini pc that you can allocate as much as 16g of ram to the integrated gpu for less then the price of a 12g gpu
the value is there
Thanks! The macbook pro M3 MAX 128GB unified (latest and fastest arch, $7,200) runs the Llama3.1 70b at Q4 at 6 tokens per second comes out to $1200 per Tok/s. A 2x 3090s runs the same model and quant at 17.6 tokens per second at a cost of around $1,800 in a complete build. $102.85 per Tok/s. That is a 10x difference in performance/cost.
I keep seeing folks suggesting M3 Macs are a better option on performance/cost grounds.... and I absolutely disagree. I have a friend that runs the max'd out M3 128 and is vocally disappointed in local inference. That was not their reason for going with that laptop however, which I think makes the most sense. LLM use is secondary use cases for a MAC.
I will say for idle state wattage they will do much better without doubt. I just don't think going under a 70b model in Q4 is a decent experience even with the latest SOTA like Llama3.1 and without a doubt I get unhappy at tokens per second under 15 myself.
@@DigitalSpaceport M series Macs idle at around 6-12 watts of power and at full maxed out load run around 65 watts. You never have to turn it off. While a multi-GPU system like that will idle at around 100 watts and at full load will pull 600-1000 watts. So basically 10X worse power per watt performance. If you want to leave your computer on all the time, then Mac wins. If not then maybe a multi-gpu system wins, however you can’t use it to train any models due to lack of GPU ram. Also a Mac is totally silent. If you don’t have a basement or a server closet to stash your noisy fan based GPU system and value your sanity get a Mac.
A dual 3090 idle on a consumer mobo with an i5 intel chip idles close to 45 watts. Thats 20 watts for the system and 12.5 for each 3090 GPU. Yes it will use more electric under load, but it will also generate tokens dramatically faster and get back to idle states sooner. It is higher for idles however without doubt, but you are going to get a dual 4090 video next and we will have a very good idle on those and maybe faster processing. Will be fun to see. If you are training models yeah you would buy more GPUs and ensure you have a better mobo/cpu with more pcie lanes. A Quad 3090 is around 5000 fully built and specifically much faster. Im not trying to say one is better vs the other, as there are of course tradeoffs with each. However being specific around numbers is also important. One of the biggest factors that is highly logical and makes great sense is if you already use and love macs, go mac without a doubt. Also if you have high electric rates that should factor in. Myself tokens per second matters and I like a larger b model at a lower q vs a lower b at a higher q, on llama3.1 specifically the 70b at Q4 feels much better than the 8b at fp16. Space available should also factor in, as a quad GPU rig is a decent 16" by 22" footprint.
@@DigitalSpaceport I am a Mac user who also has a dedicated Windows Gaming PC with NVIDIA gpu, and I am really disappointed with the direction NVIDIA is going in right now. Their GPUs are getting bigger and bigger and consuming more and more power and they now cost as much as an entire PC build alone. That’s why I’m looking at the new M4 Macs which will be coming out soon as a total replacement for windows games running on Crossover. The same goes for running LLMs which are really just a hobbyist curiosity for me right now. If I could run a large LLM and run windows games both on my M4 Mac without having to buy an expensive and power hungry NVIDIA GPU, that’s a win win for me personally. It also reduces the footprint of having to have multiple large PC boxes taking up space, sucking down power and generating unnecessary heat. I don’t want or need a bunch of power hungry space heaters taking up space in my office. And then there is the noise issue. I hate noisy PC fans. With a Mac you don’t have that problem as it runs completely silent.
Yeah im not at all anti mac either, I think they are making new advancements that will really be interesting and have great value also if they keep on their current trajectory. Plus windows is just gotten to be so junkware now its really buggy. Macs are great for not having that jank.
good job, thanks
Hey, very interesting video! I am curious what would happen if you pair an rtx 3090, with an older 24gb card like a p40, or even older m40. Would the more vram be good, or would the generations mixing affect the performance? Thanks for the content, I will be following, and hopefully soon make my own ai home-lab. 💪
In pascal generation I have a 1070. Ill add it into my test im shooting right now with the 4090 pair lol! IDK what the outcome will be but we will all learn together I guess
I've got 3 3060 I have been running on an x370 i haven'thad any issues with running at x x8 x4 on the pcie lanes, I have had a lot of fun using mixtrail 42b it's very usable
I went with a price/performance tradeoff - 3x 4060Ti 16GB, with the option to add another. They are a bit slower for LLM throughput, but the VRAM means I can still run the largish models without throwing over to the CPU. And they consume less power too, which is nice. I have an older 2080ti around that I may throw in, but it would be interesting to see results for mixed architecture setups.
I like the 4070 16GB, great sku, but those are like the hottest ticket items now and hard to find near msrp
Do you have any benchmarks of your own to share? Like how many tps does your 3x4060ti reach on the same quants as in this video?
@@alx8439 Yes, I ran the same questions on my setup using a few different models. I’ll try and post them maybe Monday when I can sit down and compile them.
I can't post my entire results here, but I'll summarize for the Llama3.1:70B Q8 results and you can see how it scales. For the 3x 4060ti 16GB GPU's, no CPU, they run that model on those questions between 6 and 8 T/sec rate. This isnt surprising, since the three together have about the same memory bandwidth but more total memory, and the model gets split three ways with the associated overhead. The 4060's dont have the same bandwidth as the 3090 on the memory, and that's a good part of the speed the 3090 gets. So I would say I'm about par with the 3090, but better off in absolute memory available without going to the CPU. And the three GPU's are cheaper than even one 3090, and use no more than 300W when running, so I think it's a good deal.
Did you mean 4060ti? The ventus 3090 like these is 700 on ebay
Would you consider taking orders for these on your store?
Not sure what you are thinking but feel free to email me social@digitalspaceport.com and elaborate.
Part of me wants to see if I can't run a local LLM for coding on my 3080 (10gb).
You should be able to fit an 8B model on that
if you want a cat story. i5 6700. 24gig no gpu. ollama 3,2 . cat story at 9tps
That's not bad for an i5 6700!
I think you're making one mistake here - your test are not identical in terms of the context being sent to model. This is because you're keep reusing the same chat, which already has some historical messages in it. It causes Open Web UI to send the whole chat history each time. You should be opening new chat instead and literally stick to the same order of the same messages.
I thought llama.ccp could not benefit from multiple GPUs for processing, only adding vRam. Maybe you should test with vLLM or TensorRT.
When you say benefit, could you elaborate? Ive got vLLM high on my software test todo list. Llama.cpp is what ollama calls under the hood but fitting model layers to vram on multiple gpus it does well. It doesnt appear to run the cores equally hard the more gpus added however. Rank novice learning is my current class so eager to get faster inference if possible
@@DigitalSpaceport by reading some /LocalLlama Reddit posts I got the impression that llama.ccp is good for memory distribution but could not use all GPU cores simultaneously, that’s why u are getting same tokens per sec even when removing a GPU.
Oooo 😳 okay its now next on my todo list lol.
Oooo 😳 okay its now next on my todo list lol.
Maybe with 3090 NV linked the vram could be shared and bring some benefit, idk
Ill be testing nvlink when I do the A5000 video as I have 2 and an nvlink. Stay tuned those are after the 4090s.
@@DigitalSpaceport the A6000 and 3090 use the same Nvlink and you can use the 6000 link on 3090 =)
I there a possibility to earn something from these setups?
How fast is your upload speed? If you have like 1gb upload speeds yeah you could but you wouldn't have major flexibility around turning the machine off or using it yourself if it gets leased. You also need pretty much just higher end cards.
@@DigitalSpaceport To serve those models to other people, you have to buy server cards, like RTX 4500/5000/6000. NVIDIA does not license consumer cards to serve. Very interesting videos!
lets try 8 gpus
I am thinking about it...
Can you evaluate the performance of speech to speech, like in this tutorial? The performance of my current setup, which only has one 3090, is quite slow. I’m wondering if having four 3090 GPUs can result in a speed that is doubled or tripled. Thank you.
ruclips.net/video/yvikqjM8TeA/видео.html