Homelab Al Server Multi GPU Benchmarks - Multiple 3090s and 3060ti mixed PCIe VRAM Performance

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 49

  • @DigitalSpaceport
    @DigitalSpaceport  День назад

    AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways

  • @tjb_altf4
    @tjb_altf4 2 месяца назад +3

    Loving the self hosted AI content, keep it up!

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +1

      There is a topic that is 1 away from filming that will be a lot of fun for you I hope (you have likely already done all of it already)

  • @ChrisDupres
    @ChrisDupres 2 месяца назад +3

    I love your pivot to AI dude. I'm actually using ollama for some projects at work and this is very useful stuff. Keep up the good work and hope you're doing well!

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +2

      Yeah ollama has made small scale easy and efficient, we have a boon of good models for smaller sized users that are really impressive in capabilities

  • @LucasAlves-bs7pf
    @LucasAlves-bs7pf 2 месяца назад

    I'm really frustrated with the results but it was necessary and I thank you for your work!

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +1

      Which part of the results frustrated you? It didnt speed up frustrates me but ill be checking into vllm which might help on that front.

  • @neponel
    @neponel 2 месяца назад +1

    this is gold. this what i want more of. sub and like

  • @XaviSanz35
    @XaviSanz35 2 месяца назад +3

    You mentioned about inference, but what about training? Can you mix GPUS and VRAM?

  • @FSK1138
    @FSK1138 2 месяца назад +1

    This is very useful, i am building a local A.I. cluster and i am trying to balance learning and WATTS used
    there are a few mini pc that you can allocate as much as 16g of ram to the integrated gpu for less then the price of a 12g gpu
    the value is there

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +3

      Thanks! The macbook pro M3 MAX 128GB unified (latest and fastest arch, $7,200) runs the Llama3.1 70b at Q4 at 6 tokens per second comes out to $1200 per Tok/s. A 2x 3090s runs the same model and quant at 17.6 tokens per second at a cost of around $1,800 in a complete build. $102.85 per Tok/s. That is a 10x difference in performance/cost.
      I keep seeing folks suggesting M3 Macs are a better option on performance/cost grounds.... and I absolutely disagree. I have a friend that runs the max'd out M3 128 and is vocally disappointed in local inference. That was not their reason for going with that laptop however, which I think makes the most sense. LLM use is secondary use cases for a MAC.
      I will say for idle state wattage they will do much better without doubt. I just don't think going under a 70b model in Q4 is a decent experience even with the latest SOTA like Llama3.1 and without a doubt I get unhappy at tokens per second under 15 myself.

    • @BrentLeVasseur
      @BrentLeVasseur 2 месяца назад

      @@DigitalSpaceport M series Macs idle at around 6-12 watts of power and at full maxed out load run around 65 watts. You never have to turn it off. While a multi-GPU system like that will idle at around 100 watts and at full load will pull 600-1000 watts. So basically 10X worse power per watt performance. If you want to leave your computer on all the time, then Mac wins. If not then maybe a multi-gpu system wins, however you can’t use it to train any models due to lack of GPU ram. Also a Mac is totally silent. If you don’t have a basement or a server closet to stash your noisy fan based GPU system and value your sanity get a Mac.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +2

      A dual 3090 idle on a consumer mobo with an i5 intel chip idles close to 45 watts. Thats 20 watts for the system and 12.5 for each 3090 GPU. Yes it will use more electric under load, but it will also generate tokens dramatically faster and get back to idle states sooner. It is higher for idles however without doubt, but you are going to get a dual 4090 video next and we will have a very good idle on those and maybe faster processing. Will be fun to see. If you are training models yeah you would buy more GPUs and ensure you have a better mobo/cpu with more pcie lanes. A Quad 3090 is around 5000 fully built and specifically much faster. Im not trying to say one is better vs the other, as there are of course tradeoffs with each. However being specific around numbers is also important. One of the biggest factors that is highly logical and makes great sense is if you already use and love macs, go mac without a doubt. Also if you have high electric rates that should factor in. Myself tokens per second matters and I like a larger b model at a lower q vs a lower b at a higher q, on llama3.1 specifically the 70b at Q4 feels much better than the 8b at fp16. Space available should also factor in, as a quad GPU rig is a decent 16" by 22" footprint.

    • @BrentLeVasseur
      @BrentLeVasseur 2 месяца назад

      @@DigitalSpaceport I am a Mac user who also has a dedicated Windows Gaming PC with NVIDIA gpu, and I am really disappointed with the direction NVIDIA is going in right now. Their GPUs are getting bigger and bigger and consuming more and more power and they now cost as much as an entire PC build alone. That’s why I’m looking at the new M4 Macs which will be coming out soon as a total replacement for windows games running on Crossover. The same goes for running LLMs which are really just a hobbyist curiosity for me right now. If I could run a large LLM and run windows games both on my M4 Mac without having to buy an expensive and power hungry NVIDIA GPU, that’s a win win for me personally. It also reduces the footprint of having to have multiple large PC boxes taking up space, sucking down power and generating unnecessary heat. I don’t want or need a bunch of power hungry space heaters taking up space in my office. And then there is the noise issue. I hate noisy PC fans. With a Mac you don’t have that problem as it runs completely silent.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +1

      Yeah im not at all anti mac either, I think they are making new advancements that will really be interesting and have great value also if they keep on their current trajectory. Plus windows is just gotten to be so junkware now its really buggy. Macs are great for not having that jank.

  • @alexo7431
    @alexo7431 2 месяца назад

    good job, thanks

  • @emilianosteinvorth7017
    @emilianosteinvorth7017 2 месяца назад +2

    Hey, very interesting video! I am curious what would happen if you pair an rtx 3090, with an older 24gb card like a p40, or even older m40. Would the more vram be good, or would the generations mixing affect the performance? Thanks for the content, I will be following, and hopefully soon make my own ai home-lab. 💪

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +3

      In pascal generation I have a 1070. Ill add it into my test im shooting right now with the 4090 pair lol! IDK what the outcome will be but we will all learn together I guess

  • @elliotthanford1328
    @elliotthanford1328 2 месяца назад +2

    I've got 3 3060 I have been running on an x370 i haven'thad any issues with running at x x8 x4 on the pcie lanes, I have had a lot of fun using mixtrail 42b it's very usable

  • @ChrisCebelenski
    @ChrisCebelenski 2 месяца назад +1

    I went with a price/performance tradeoff - 3x 4060Ti 16GB, with the option to add another. They are a bit slower for LLM throughput, but the VRAM means I can still run the largish models without throwing over to the CPU. And they consume less power too, which is nice. I have an older 2080ti around that I may throw in, but it would be interesting to see results for mixed architecture setups.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      I like the 4070 16GB, great sku, but those are like the hottest ticket items now and hard to find near msrp

    • @alx8439
      @alx8439 2 месяца назад

      Do you have any benchmarks of your own to share? Like how many tps does your 3x4060ti reach on the same quants as in this video?

    • @ChrisCebelenski
      @ChrisCebelenski 2 месяца назад +1

      @@alx8439 Yes, I ran the same questions on my setup using a few different models. I’ll try and post them maybe Monday when I can sit down and compile them.

    • @ChrisCebelenski
      @ChrisCebelenski 2 месяца назад +1

      I can't post my entire results here, but I'll summarize for the Llama3.1:70B Q8 results and you can see how it scales. For the 3x 4060ti 16GB GPU's, no CPU, they run that model on those questions between 6 and 8 T/sec rate. This isnt surprising, since the three together have about the same memory bandwidth but more total memory, and the model gets split three ways with the associated overhead. The 4060's dont have the same bandwidth as the 3090 on the memory, and that's a good part of the speed the 3090 gets. So I would say I'm about par with the 3090, but better off in absolute memory available without going to the CPU. And the three GPU's are cheaper than even one 3090, and use no more than 300W when running, so I think it's a good deal.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      Did you mean 4060ti? The ventus 3090 like these is 700 on ebay

  • @Centaurman
    @Centaurman 2 месяца назад +2

    Would you consider taking orders for these on your store?

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      Not sure what you are thinking but feel free to email me social@digitalspaceport.com and elaborate.

  • @cracklingice
    @cracklingice 2 месяца назад +1

    Part of me wants to see if I can't run a local LLM for coding on my 3080 (10gb).

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +2

      You should be able to fit an 8B model on that

  •  Месяц назад +1

    if you want a cat story. i5 6700. 24gig no gpu. ollama 3,2 . cat story at 9tps

  • @alx8439
    @alx8439 2 месяца назад +1

    I think you're making one mistake here - your test are not identical in terms of the context being sent to model. This is because you're keep reusing the same chat, which already has some historical messages in it. It causes Open Web UI to send the whole chat history each time. You should be opening new chat instead and literally stick to the same order of the same messages.

  • @nandofalleiros
    @nandofalleiros 2 месяца назад

    I thought llama.ccp could not benefit from multiple GPUs for processing, only adding vRam. Maybe you should test with vLLM or TensorRT.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +1

      When you say benefit, could you elaborate? Ive got vLLM high on my software test todo list. Llama.cpp is what ollama calls under the hood but fitting model layers to vram on multiple gpus it does well. It doesnt appear to run the cores equally hard the more gpus added however. Rank novice learning is my current class so eager to get faster inference if possible

    • @nandofalleiros
      @nandofalleiros 2 месяца назад +1

      @@DigitalSpaceport by reading some /LocalLlama Reddit posts I got the impression that llama.ccp is good for memory distribution but could not use all GPU cores simultaneously, that’s why u are getting same tokens per sec even when removing a GPU.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      Oooo 😳 okay its now next on my todo list lol.

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      Oooo 😳 okay its now next on my todo list lol.

  • @jonathanmayor3942
    @jonathanmayor3942 2 месяца назад

    Maybe with 3090 NV linked the vram could be shared and bring some benefit, idk

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад +2

      Ill be testing nvlink when I do the A5000 video as I have 2 and an nvlink. Stay tuned those are after the 4090s.

    • @jonathanmayor3942
      @jonathanmayor3942 2 месяца назад +2

      @@DigitalSpaceport the A6000 and 3090 use the same Nvlink and you can use the 6000 link on 3090 =)

  • @mcunumberone613
    @mcunumberone613 2 месяца назад

    I there a possibility to earn something from these setups?

    • @DigitalSpaceport
      @DigitalSpaceport  2 месяца назад

      How fast is your upload speed? If you have like 1gb upload speeds yeah you could but you wouldn't have major flexibility around turning the machine off or using it yourself if it gets leased. You also need pretty much just higher end cards.

    • @wentropia
      @wentropia 2 месяца назад

      @@DigitalSpaceport To serve those models to other people, you have to buy server cards, like RTX 4500/5000/6000. NVIDIA does not license consumer cards to serve. Very interesting videos!

  • @anpowersoftpowersoft
    @anpowersoftpowersoft 2 месяца назад

    lets try 8 gpus

  • @TuanAnhLe-ef9yk
    @TuanAnhLe-ef9yk 2 месяца назад

    Can you evaluate the performance of speech to speech, like in this tutorial? The performance of my current setup, which only has one 3090, is quite slow. I’m wondering if having four 3090 GPUs can result in a speed that is doubled or tripled. Thank you.
    ruclips.net/video/yvikqjM8TeA/видео.html