3090 vs 4090 Local AI Server LLM Inference Speed Comparison on Ollama

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 87

  • @DigitalSpaceport
    @DigitalSpaceport  День назад

    AI Hardware Writeup digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways

  • @I1Say2The3Truth4All
    @I1Say2The3Truth4All Месяц назад +19

    Looks like if the GPU use cases are for LLMs, then 3090 will be hugely economical! Great comparison with surprising results indeed and exactly what I wanted to see. Thank you! :)

    • @critical-shopper
      @critical-shopper Месяц назад +6

      llama3.2:3b-instruct-fp16
      105 tps 4090 Strix
      95 tps 3090 PNY
      gemma2:27b-instruct-q5_K_M
      38.4 tps 4090 Strix
      34.5 tps 3090 PNY
      I can buy 4 3090 for the price of the 4090 strix.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +4

      In my instance the 2x 4090s equaled the price of 4x 3090s + the new pads that I had to apply to them. If you buy used lower priced 3090s, you should expect to clean and repad them imo.

    • @fcmancos884
      @fcmancos884 Месяц назад +1

      @@critical-shopper around 10-11% more, what about a 3090 with OC to the memory? the 3090 is 19.5gbps and the 4090 21gbps....

    • @ken860000
      @ken860000 25 дней назад

      @@DigitalSpaceport I am looking for motherboard to use for AI training and inference . I have a question, if I have 2 x 3090. Does it require both PCIE slots connect to CPU directly or it can be one slot connect to CPU, the other one connect to chipset?
      It is too expensive to get a MB that has 2 slot connect CPU directly. ;(

  • @claybford
    @claybford Месяц назад +2

    THANK YOU for making a chart! I've been poking around your videos trying to get some straightforward info on how GPU/quantity affects TPS in inference, and this is super helpful

  • @arnes12345
    @arnes12345 Месяц назад +4

    Great comparisons! Just one minor note: You mentioned that the last question is to check how fast it is with a short question. But when you ask it in the same thread as everything else, Open WebUI will pass that entire thread as input context + your short question. So you're really testing speed at long context size at that point. The token/sec numbers being consistently slightly lower for the long story and the final question confirms that. TL;DR - Start a new chat to test short questions.

    • @ScottDrake-hl6fk
      @ScottDrake-hl6fk 10 дней назад

      you can also reuse the seed value for testing

  • @YUAN_TAO
    @YUAN_TAO Месяц назад

    Great video man, thank you!😊

  • @reverse_meta9264
    @reverse_meta9264 Месяц назад +9

    TL;DR - for LLM inference, the 4090 is generally not meaningfully faster than the 3090

    • @tsizzle
      @tsizzle Месяц назад

      @@reverse_meta9264 what about for LLM training?

    • @reverse_meta9264
      @reverse_meta9264 Месяц назад

      @ I don't know but I can't imagine any single card with 24GB of VRAM does particularly well training large models

  • @ewenchan1239
    @ewenchan1239 Месяц назад

    This is consistent with the 3090 results that I've posted as comments, back to this channel/your videos, albeit with different models.
    It averages to around somewhere between 5-8% slower than a 4090, which really, isn't a lot.
    I'm glad that I bought the 3090s during the mining craze, (where I was able to make my money back, from said mining), but now, I can also use said 3090s for AI tasks like these.

  • @CYBONIX
    @CYBONIX Месяц назад

    Well done~! I'll be looking out for the one on image generation comparison between both cards.

  • @mattfarmerai
    @mattfarmerai Месяц назад

    Great video, I would love to see an image generation comparison.

  • @markldevine
    @markldevine Месяц назад

    Great content. My new DIY idea: WRX90 (or whatever might be next) with Zen 5 9965X (perhaps), which has the new Zen 5 I/O die (not a disappointment like the desktop chips) for an in-home build. Getting residential quotes for 2 x new electrical circuits/outlets now. Completed in 2026H1, probably.
    Your channel will be a required viewing. Thanks for posting content!

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      In 26 it may be a newer variant but you NEED to watch my next video before you all in on the top of the line WRX90 with a 7995WX. Shocking findings my friend.

  • @viniciusms6636
    @viniciusms6636 Месяц назад

    Great job.

  • @aravintz
    @aravintz Месяц назад +1

    Thank you so much !! Will SLI make difference when we use 2 GPU's ? And please do test 70B

  • @KayWessel
    @KayWessel 16 дней назад +1

    Thanks for nice comparison. Today I use a GTX 1080 8Gb, and planning to upgrade to 3090 24Gb, but wondering about 4070 TI Super 16Gb. What do you think? My pc is a Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz with 32Gb 3200Hz memory. Prices er 20% higher on the 4070.

    • @DigitalSpaceport
      @DigitalSpaceport  16 дней назад

      VRAM is always #1 and 24GB is a lot. It allows a good step up in models to run at higher stored parameters.

  • @theunsortedfolder6082
    @theunsortedfolder6082 Месяц назад +1

    so, in your case, model fits into single gpu, and then it did not matter if there were 2 or one gpus? speed was the same? What in case of, say, say, 3x 3070ti vs 1 x 3090? I'm thinking, a lot of these cards exist since they were used for mining, and purchased probably at the very last phase of mining era, so they exist in large numbers, and relatively new. Any test with that?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      So if a model can fit into a single GPU, the ollama service will put it in just 1. Like 1x 3090 vs 2x 3060 12G is what your asking since they both equal up to 24GB? I dont have an answer on that. Yes many are available and indeed were used for mining. A good reason to anticipate a repad on a 3090 as they are very hot cards. Other 3000 cards are much less heat stress generators.

    • @Espaceclos
      @Espaceclos Месяц назад

      @@DigitalSpaceport it would be cool to see the difference between 2 12GB cards and one 24GB cards of the same series I.e., 3060s vs a 3090

  • @moozoo2589
    @moozoo2589 Месяц назад +4

    4:26 Again seeing CPU utilization at 100% and GPU at 81%. Have you tried using a more performant CPU to see if you can saturate GPU fully?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +3

      This is actually a good time for me to test this. I have a threadripper out of its case right now. Great Q!

    • @selub1058
      @selub1058 17 дней назад

      @@DigitalSpaceport results?

    • @DigitalSpaceport
      @DigitalSpaceport  17 дней назад

      @selub1058 results in this video ruclips.net/video/qfqHAAjdTzk/видео.html

  • @janreges
    @janreges Месяц назад

    Hi Jerod,
    thank you for this video and your RUclips channel in general! I've been watching your videos since 2021 (because of XCH, I'm a farmer with 2.3 PiB drives in the house, now effective 4.93 PiB with one RTX 4090) and you are one of my favorite creators.
    The RTX 3090 is definitely the most sensible choice for AI homelabbers today. However, I will be very happy, if you have the HW and time, to do a comparison of RTX 3090 with e.g. 6900 XT (with ROCm) for LLM models that can fit in 16GB VRAM.
    Btw, according to my measurements and for my project's needs, Qwen2.5:7b (Q4_K_M) is the best current model. It runs very fast even on a single GPU like the RTX 3080 10GB (about 90 tokens/s) and returns really high quality and precise information like other 30B+ models.
    Thank you for your work and I look forward to all your other videos in the AI HW/SW realm as well ;)

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      Sweet I may try to get a loaner ROCm capable card to test. I think their prices are very good if they can drop in and work with something like ollama it would be great. I agree Qwen 2.5 is very good. I do like the new Nemotron tune Nvidia released but that is a 70b and Qwen has a great variety of sizes all the way down it supports. Cheers

    • @slowskis
      @slowskis Месяц назад +1

      I use the qwen2.5:14b with a 6800XT and it runs great. Uses 15.2/16 GB of dedicated GPU memory 🥰

    • @janreges
      @janreges Месяц назад

      @@slowskis Thanks for the information, man! How many tokens/s are you running? And on what platform (Win/Linux/macOS) and with what LLM tool are you running?

    • @slowskis
      @slowskis Месяц назад

      @@janreges I am running on Windows 10 with Ollama. I have not tested token/s.

  • @KonstantinsQ
    @KonstantinsQ Месяц назад +5

    Probably interesting to value 2x4090 with 3x3090s, as cost about the same. And look user friendly motherboards that can fit 2-3x 3090s? Is there for example X670E motherboards that can fit 3x GPUs with 8 PCI lanes for each and make it work? Or will there be sense of 2 3090s per 8 PCI lanes and one for 4 PCI lanes? Not easy to find even X670E which can fit 2x 3090s - because not every motherboard have dual 8 PCI lanes, and the second - not every have spacing in the matter or slots between to plug directly without risers. So probably there are only few options. And the question - is there a benefit from 3090s in terms of NVlink connection, does it bring a value? Many questions, would be glad for clarifying any of these! Because built with X670E would be much cost and ease friendlier for home setup. But 2x24GB Vram might be not enough for larger models, 3x24=72GB VRam teoretically can fit 70B LLMs, but is it even posible with X670E or similar motherboards and worth it? Thanks in advance! And good content! :)

    • @loktar00
      @loktar00 Месяц назад +3

      @@KonstantinsQ PCIe lanes don't matter for inference much just initial loading. I'm using 2x3090s and a 3060ti on a 1x.

    • @comedicsketches
      @comedicsketches Месяц назад +1

      You can get a used GPU server for under $2k with 8 2-width spaced x16 slots and 2 x8 slots. They hold up to 4 3090s at full speed with fans and 10 3090s if you make them 2-width or less. It's not worth messing with regular hardware if you have a real use case. The 3090s are much better value for LLM inference but 4090s have more compute if your application isn't memory limited.

    • @KonstantinsQ
      @KonstantinsQ Месяц назад +1

      @@loktar00 Please xplain more or suggest where to dive deeper in understanding what you meant. :)

    • @ScottDrake-hl6fk
      @ScottDrake-hl6fk 10 дней назад +1

      nvlink is irrelevant here, 8 lanes are fine, pcie gen4 is important, fast storage is important, there are several older consumer mobo/cpu combos to get 16x8x8x8x8x8x8, 128gb of ddr4 is enough to run, good luck everyone

  • @ДмитрийКарпич
    @ДмитрийКарпич Месяц назад

    Interesting results - and it`s seems that`s the real improvement at new GPU generation, besides frame generator e.t.c. In my thoughts main problem - memory, in paper it`s has same number on bandwidth - 936.2 GB/s vs 1.01 TB/s. But for image generation huge 1.5x with CUDA may have some points.

  • @MrAtomUniverse
    @MrAtomUniverse 27 дней назад

    What about llama 3.2 in q4 , the models on ollama are generally on q4 right ?

    • @DigitalSpaceport
      @DigitalSpaceport  26 дней назад +1

      if you grab the default pull handle, yes that is q4. However if you click tags you can download q8 or fp16 as well as several other variants

    • @MrAtomUniverse
      @MrAtomUniverse 26 дней назад

      @@DigitalSpaceport If you have a chance in future, do help test llama 3.2 in Q4 for 4090 , thank you so much!

  • @Alkhymia
    @Alkhymia Месяц назад

    What CPU you recommend for the 3090 and IA stuff?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Fast single core max speed matters and core count doesnt. I cant report on p core or e core as I dont have intel

  • @tsizzle
    @tsizzle Месяц назад

    Since nvlink is no longer supported on the 4090 (Ada Lovelace). Its it better to get two 3090 and nvlink them together? Also are the vram pooled together? 24GB + 24GB = one GPU of 48 GB? Does a whole LLM need to fit on the vram of a single GPU 24 GB or can it fit on 48GB of vram on two GPUs? Or do you have to use some sort of model sharding, distributed training, page optimization, QLora, kv caching, etc. how many GPUs to run Llama3.1 405B?

    • @lietz4671
      @lietz4671 Месяц назад +1

      어떤 동영상에서 405B 모델을 실행하려면 최소 221GB VRAM이 필요하다는 메시지를 보았습니다. 따라서 3090 24GB 그래픽 카드가 최소 9개는 있어야 405B 모델을 실행할 수 있을 겁니다. 9보다 그래픽 카드 수가 적다면, 그만큼 처리 속도가 느려질 것입니다. 또 다른 동영상에서 4090 그래픽 카드에 405B 모델을 실행하는 것을 보았는데, 20분 이상 기다린 다음에 출력이 시작되었습니다. 따라서 개인용 컴퓨터에서는 405B 모델을 실행하는 것은 포기하는 것이 좋을 것 같습니다.

  • @sebastianpodesta
    @sebastianpodesta Месяц назад

    Great!! Thanks for your video!!
    Do you know if the GPUs were running in x4, x8 or x16 pcie lanes? I’m looking forward to making a mini si server and I see that the new Intel core ultra 200 series will have more pci lanes. Enough to run 2 gpus at x16. But I don’t know how much of a difference this will make.
    I hope in the future we can combine npu and gpu vram for different task since the nueva motherboards will support 192 Gb but only at ddr5 speeds, for now.
    Can’t wait to see the stable diffusion benchmarks!!

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      Hey thanks for your question. I read a lot or findings, but its nice to validate the assumptions and 3090 vs 4090. On PCIe lanes, if you are sticking to pure inference the impact will be negligible outside model initial loading to performance. I will likely attempt to quantify this number however to get a firm reading. If you are doing a lot of embeddings with say a large document document collection for RAG it can utilize more bandwidth but that depends a lot on the embedding model and the documents themselves. In those instances x4 at gen 4 is still a massive amount of bandwidth. I would factor that in for say a x1 riser to x16, but negligible at x4 or greater. Do also check any specific consumer motherboard for limitations that will impact onboard m.2 usage if you enable say a 3rd slot. Sometimes that is shared and can disable some portion of onboard m.2. I really like my nvme storage as it helps speed up everything quite noticably.

  • @sciencee34
    @sciencee34 Месяц назад

    Hi, have you looked at laptop gpus and how those compare

    • @DigitalSpaceport
      @DigitalSpaceport  26 дней назад +1

      I have a 6GB 3060 in my laptop and it works great for inference workloads (like ollama) but unfortunately its 6GB and not a 8GB or 12GB which is what desktop GPU 3060s are. Could have spent an extra $ and gotten a larger VRAM card but now its this.

  • @alx8439
    @alx8439 Месяц назад

    Interesting to see that the TPS degrades if you continue to use the same chat with every next question asked. I have a gut feeling this happens because the whole conversation history is sent to the model each time and it has to process it first. There's no caching in ollama / openwebui. And also looks like some of them (with ollama or openwebui) calculate the generation speed wrongly, without accounting the time it took to parse the context. To overcome this you need to start a new conversation for each new query

  • @nosuchthing8
    @nosuchthing8 Месяц назад

    How do they fit these monster video cards into laptops

  • @KonstantinsQ
    @KonstantinsQ Месяц назад +1

    And as it already clear that for some time 3090s will be best choice for money, would be interesting to see comparison or analysis for different 3090s, are they the same for LLMs or there might be any differences and benefits from different models? And how to choose in a distance best 3090? In Ebay or any other place. Thanks

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Ill put together a written thing on the website and send that over here to you.

  • @nandofalleiros
    @nandofalleiros Месяц назад

    I’m using a 4090 and a 3090 for image generation with Flux1-dev here. The 4090 generates a 1024x1024 image in 16 seconds, the 3090 capped at 280w generates in 30s

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Thats a big difference

    • @jeffwads
      @jeffwads Месяц назад

      Twice as fast? Hmm.

    • @nandofalleiros
      @nandofalleiros Месяц назад

      Yes, but you should consider that its a 450w card against 280w. I’ll undervolt and change the limits of the 3090 to 400w and test again. Also, I’ll add at least 500mhz to the memory speed.

  • @Espaceclos
    @Espaceclos Месяц назад

    What CPU were you using? It seemed to be at 100%

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      This is getting tested right now. Video on that very soon. Maybe tomorrow. Interesting stuff.

  • @KonstantinsQ
    @KonstantinsQ Месяц назад

    By the way, how about 2x1080ti with 11gbVram vs 3090? Vram 22gb vs 24, wats 250x2 vs 350, price about 200$x2 vs 1000$. That mean for 1000$ I can get 4x 1080tis. What other downsides of 2x1080ti for AI?

    • @KonstantinsQ
      @KonstantinsQ Месяц назад

      Or even 2060 12GB Vram version or 3060 12GB Vram versions for 200-250$ are good catch, no?

  • @alirezashekari7674
    @alirezashekari7674 Месяц назад

    awsome

  • @silentage6310
    @silentage6310 22 дня назад

    its cool!
    need this comparison in image generation. flux1-dev or SD.

    • @DigitalSpaceport
      @DigitalSpaceport  20 дней назад

      It is kinda weird. This didnt start out as a testing lab thing but it has evolved into a testing lab thing. Im working on it and have a comfyUI workflow setup but man I am a total noob with imgen. Any recommendations on how to start?

    • @silentage6310
      @silentage6310 17 дней назад

      @@DigitalSpaceport i recommend to simply use automatic1111 (or new fork for Flux - Forge). its portable and simple to use.
      its show time spend for image generation.

    • @DigitalSpaceport
      @DigitalSpaceport  16 дней назад

      Flux Forge looks fantastic. Thanks!

  • @issa0013
    @issa0013 Месяц назад

    Can you upgrade the VRAM in the 3090 to 48 Gb

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      Skills I do not have myself but is this like a service you can send GPUs into?

  • @masmullin
    @masmullin 11 дней назад

    FYI: I've tested 4090 vs 3090 vs 7900xtx for image generation. Rough numbers as these just come from my memory:
    Flux1.dev fp8 (from the comfyui hugging face repo):
    20 steps, 1024x1024:
    7900xtx: 42->45seconds
    3090: 35->38seconds
    4090: ~12seconds
    TL;DR: 4090 is a monster for image gen. Similar differences can be seen with SD3.5 and SDXL
    For reference, the 7900xtx will run the qwen2.5-32b:Q4 with 22t/s

    • @DigitalSpaceport
      @DigitalSpaceport  День назад

      TY for dropping stats! I appreciate it. Helps guide me immensely.

  • @melheno
    @melheno Месяц назад

    Performance difference of 3090 vs 4090 is only the memory bandwidth difference which is expected.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +2

      I was in the ollama repo last night and there is issues apparently with both ada performance and fp16 that have been addressed and will be in upcoming releases. Going to be doing some retesting it looks like!

  • @ScottDrake-hl6fk
    @ScottDrake-hl6fk 10 дней назад

    I started with two 12g 3060s (availability) then added two 10g 3080s. the 3080s finish about twice as fast, 70b-q4 and q8 are not blazing fast though usable, my upgrade path could be replacing the 3060s with two more 10g 3080s OR maybe one 24g 3090, i suspect more smaller capacity cards can apply more computation rather than one big card doing it all- i may be wrong, can anyone confirm?

    • @DigitalSpaceport
      @DigitalSpaceport  10 дней назад

      @@ScottDrake-hl6fk a single 3090 is faster as it doesnt split the process workload. This is due to how llama.cpp handles parallelism. If you split 4 ways each gpu process runs at 25% right now. There are ways to change that but its up to the devs.

  • @mz8755
    @mz8755 Месяц назад

    The results seem a bit odd to me... it can't be that close

  • @Dj-Mccullough
    @Dj-Mccullough Месяц назад

    Man, you can really tell that nvidia gpus hold their value due to ram amounts. 3080ti about half the used price of the 3090.. finally found a 3090 though for 680. Good enough for me. People who want to get into playing with this stuff can realistically do it with as little as a 1070 8 gig.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Yeah I showed the 1070ti back in that mix n match video and the performance is really very good. I was suprised and hope to get the message out more about any nvidia card really being able to do some level of inference work.

  • @crazykkid2000
    @crazykkid2000 Месяц назад

    Image generation is where you will see the biggest difference

  • @autohmae
    @autohmae Месяц назад

    Honestly, I think VRAM makes the biggest difference in performance.

  • @DataJuggler
    @DataJuggler Месяц назад

    My 3090 only has 23 gigs I think. Or that is what Omniverse shows.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Should be 24 maybe its displaying GiB and not GB

    • @DataJuggler
      @DataJuggler Месяц назад

      @@DigitalSpaceport You are right. I looked it up in task Manager - Performance then Bing Chat explained to me the dedicated and the shared 64 gig the GPU can offload to.