4090 Local AI Server Benchmarks

Поделиться
HTML-код
  • Опубликовано: 28 ноя 2024

Комментарии • 61

  • @alexandrew83
    @alexandrew83 Месяц назад +8

    Good Morning Everyone. Have an amazing day.

  • @ericvaish8841
    @ericvaish8841 Месяц назад

    Man, you have no idea how much help this video has been. There are so limited reviews of 4090 for LLMs. Awesome video, kudos!!

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Glad it has helped. Any other questions on 4090s for llms you have?

    • @ericvaish8841
      @ericvaish8841 Месяц назад

      @@DigitalSpaceport You have answered every question I had with your diverse testing. Thanks! The VRAM really holds this GPU back in NLP tasks. Would you recommend waiting for the 5090

  • @jacocoetzee762
    @jacocoetzee762 Месяц назад +2

    The video I was looking for ❤. Thank you so much. Would it be possible to “cluster” these GPUs and potentially run larger models?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Yes it does work like that. Here is a dual 4090 video. ruclips.net/video/Aocrvfo5N_s/видео.html
      and you should check this channels history for Quad demonstrations. You can also mix and match generations to a point and vram sizes.

    • @jacocoetzee762
      @jacocoetzee762 Месяц назад

      @@DigitalSpaceport oh sweet! Thanks! Checking it out now.

  • @dubesor
    @dubesor Месяц назад +1

    You can definitely run Nemotron 70B on a single 4090, I do that every day. You can offload about half of the model onto the VRAM, and compute the other layers on CPU, this gives me around 2.5 tok/s, which is still slow, but workable.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      We should see 3090 and 4090 prices come down when the 5090 drops so hopefully 2.5t/s can not be the norm. I also suspect nvidia releases a 32b in the next few months.

    • @lglewis976
      @lglewis976 26 дней назад

      Running Llama-3.1-Nemotron-70B-Instruct_iMat_GGUF/Llama-3.1-Nemotron-70B-Instruct_iQ2xxs.gguf MarsupialAI 19.1GB using LM Studio and a 7900xtx. 14.81 tok/sec. It fits.

    • @maxmustermann194
      @maxmustermann194 23 дня назад

      @@DigitalSpaceport My dual 3090 rig runs nemotron 70b Q4 (43 GB model) at 14.7 TPS. Definitely usable. That's with 270 W power limit applied.

  • @computersales
    @computersales Месяц назад +2

    I think it would be fun to see a GPU showdown for these AI tasks. Compare some Tesla GPUs and some budget consumer GPUs.

  • @MenkarX
    @MenkarX Месяц назад +1

    For 5090 nvidia should include a trolley

  • @Duodduck
    @Duodduck Месяц назад +1

    I think I know why Qwen went crazy-- by default, Ollama uses a 2048 token context limit, so I think Qwen exceeded the limit and it couldn't see "limit to 5000 words" anymore, so it just kept going. In openwebui, you can set the context length.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      I think your onto something there. I had set it to 4096 but its back to its default now.

  • @sebastianpodesta
    @sebastianpodesta Месяц назад +1

    Hi, another great video! Regarding the speed: is there a big difference in speed between the 4090 and the 3090? Because is the extra cuda cores

    • @Lexxxco1
      @Lexxxco1 Месяц назад

      Yes, in some AI applications 4090 is 60-80% faster than 3090 - when model needs to be unloaded several times. In most applications 4090 is ~40% faster.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +2

      Im going to test this since I have things in a mess right now already. Ill use the same models unless you have one extra you would like checked.

  • @lietz4671
    @lietz4671 Месяц назад

    쿠다 코어의 수와 VRAM의 크기에 따라서 출력 토큰의 수가 다르죠.
    3090 그래픽카드는 단종되어서 새 제품을 구매할 수가 없습니다.
    그렇다면 40XX 그래픽카드 중에서 무엇을 구매해야 3090과 출력 토큰의 수가 비슷할까요?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      추론만 하는 경우, 어떤 모델 크기를 사용할지에 대한 결정이 구매 방향을 결정해야 합니다. 8b 모델 크기를 찾는 경우 4070 16gb가 매우 경제적인 경로입니다. 더 큰 크기의 모델을 사용하려면 48gb에 도달하는 GPU VRAM의 조합이 선호됩니다.

  • @maxmustermann194
    @maxmustermann194 Месяц назад

    That is a comically large card :D

  • @codescholar7345
    @codescholar7345 Месяц назад

    Great video! I downloaded both recommended models and they are super fast. I there a site that lets one know what models will fit completely in a gpu and at certain quantization levels. Do you have a RAG video with openwebui and a 3090? Thanks

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      I just use the GB on the ollama site and realistically not a lot older models are that good. I dont go back past llama3.1 myself for main models. Rag video is getting redone. Making a video on a new topic (to me) gets out of scope fast and folks shouldnt watch multi hour videos of me rambling and fixing things.

    • @codescholar7345
      @codescholar7345 Месяц назад

      @@DigitalSpaceport Sounds good. I use to try and get large models running but its better to get models that fit 100% in the GPU. I found a decent calculation site and it looks like the quen2.5:32b-instruct-q5_K_S is the largest new model that can fit 100% in a 3090 and about 20/TPS response. Ive been so busy setting up an epyc vitalization server (virtualized storage server too) and an AI workstation but it all coming together! I want to get RAG and searxng setup for openwebui. I wish I could tell my AI to build those projects while I sleep without me submitting hundreds of commands.. :)

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      The searxng/redis/tika setup video I did has all that compose stuff linked in the description so you could pasta it in. Might save some time. I think its labelled vision + web in the video history. Dont forget to leave some room for embed and vision models.

    • @codescholar7345
      @codescholar7345 Месяц назад

      @@DigitalSpaceport Okay cool, I'll check it out. Thanks!

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Never miss a chance to get a view they tell me lol ruclips.net/video/IC_LGmqjryg/видео.html

  • @ChrisCebelenski
    @ChrisCebelenski Месяц назад

    I've come down to, for personal use anyway, it's not really about tokens/sec once you pass a certain threshold of say, 9 or 10 - anything past that is gravy. It's more about memory usage on the GPU now, and running the larger models are still just out of reach for a single consumer card. And once you get past the consumer cards, it gets expensive real fast, and need special cooling, etc. Finally, holy fat bottoms batman! That card is massive in size! I don't have a single case that would fit that monster, not even the Supermicro AI server I built which is in a 4U rack! 32GB in the 4090 would be better, and really what we need is a cheaper H100 with modern cores and consumer package.

  • @adharshkl7336
    @adharshkl7336 Месяц назад

    can we use 2 RTX3060 12gb card so combined we get 24gb of VRAm?,one more question is how many TOPS will i get in a single RTX 3060 12 GB card on an average irrespective of which model we are using?
    I love your content

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      I can answer your first question, yes easily. The second is around 90 but that may not be as important of a metric I think.

  • @HaydonRyan
    @HaydonRyan Месяц назад

    Awesome video. Would love to see an amd mi60 in a rig! Older but cheap and 32gb

  • @gaidin
    @gaidin Месяц назад

    I love playing with Flux and some other models with a mid-high end GPU....but your use case seems to be spending $5k to have two high-end GPUs tell you what you can cook with the contents of your pantry lol. Not hating on the love for crazy good hardware....but I don't understand the application of it in this case.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Practical application and creativity are the realms that those questions are testing. I do use it for such tasks myself, as well as many other tasks not demonstrated so far. The question is evaluating the quality in no small part on things like following details and completing answers that demonstrate added levels of minor things that can fail. Also I have a lot more than 2 high-end GPUs 🤓

  • @billkillernic
    @billkillernic Месяц назад +1

    1:22 what kind of motherboard ?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      The build specs linked in the description sry I cant copy paste for some reason in the studio app and im out right now.

  • @ChrisCebelenski
    @ChrisCebelenski Месяц назад

    Because people will ask me - 3x 4060 ti 16GB GPU's - here's some of my numbers: 3.2b 3B_instruct_fp16 38T/S 1 GPU. QWEN2.5 32b_instruct Q6_K - 9.6T/S 3 GPU's. Qwen does cause ollama to freak out however as noted - if anyone can suggest how to get these models working it would be appreciated. (I got "GGGGGGGGGG" as the output for the story question, and then it was unresponsive until reset.)

  • @timothymchudson
    @timothymchudson Месяц назад

    How much would you accept for 1 hour of consultation?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      I am sharing as I learn, as I learn, but I am not qualified for consulting on these topics. I do appreciate the thought however 🥰

    • @timothymchudson
      @timothymchudson Месяц назад

      @DigitalSpaceport would you accept 25$ n hour? Sure I won't need anything past 1 hour our 1st meeting. I just want to run a vision by you. And you tell me if it's possible in your opinion. N if so what equipment/setup I'll need to bring that to life

  • @jeffwads
    @jeffwads Месяц назад

    When I saw it i immediately thought how big is the 5090 going to be…

  • @zahirkhan778
    @zahirkhan778 Месяц назад

    4090 is that big? or are you a small person? I'm confused.

  • @iseverynametakenwtf1
    @iseverynametakenwtf1 Месяц назад +1

    Weird, my Strix 4090 gets way better results. Might be my 128gb of RAM?

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад

      Can you tell me more about your complete hardware and software setup and results please?

    • @iseverynametakenwtf1
      @iseverynametakenwtf1 Месяц назад

      @@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II with an Asus Strix 4090, a Threadripper pro 3955wx, and 128gb RAM
      I'm getting about 40% better token per second generation using the same ones, GGUF versions, on LM Studio.

    • @iseverynametakenwtf1
      @iseverynametakenwtf1 Месяц назад

      @@DigitalSpaceport I have a ASUS Pro WS WRX80E-SAGE SE WiFi II, with the Asus Strix 4090, a Threadripper Pro 3955wx, and 128gb RAM. I am getting about 40% better results than you with same models, not sure if it is because of my RAM, or I am using the GGUF versions on LM Studio.

  • @cracklingice
    @cracklingice Месяц назад

    If I was paying an outrageous amount to get a GPU early - I would much rather pay the scalper because most of them are just normal people trying to make a bit of cash on the side because life is hard. Nvidia is not hard up for money.

    • @cracklingice
      @cracklingice Месяц назад

      I also have no desire to pay a dime over $900 for MSRP 80 class at any point in time.

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      Scappers are unlikely to go away. 3500 5090 ebay likely

  • @ME-dg5np
    @ME-dg5np Месяц назад

    I wonder ...Next amd rdn4 wit 42 gb vram ? Amd wake aluppe hurry uppe
    🎉😅

    • @DigitalSpaceport
      @DigitalSpaceport  Месяц назад +1

      AMD can also MAD like they make their GPU customers lol

  • @prfrag
    @prfrag Месяц назад

    4090 almost bigger than you 😀