LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500

Поделиться
HTML-код
  • Опубликовано: 24 дек 2024
  • Sitting down to run some tests with i9 9820x, Tesla M40 (24GB), 4060Ti (16GB), and an A4500 (20GB)
    Rough edit in lab session
    Our website: robotf.ai
    Machine specs here: robotf.ai/Mach...
    GPUs being tested: (These are affiliate-based links that help the channel if you purchase from them!)
    Telsa M40 amzn.to/3Yf4yXC
    4060ti 16GB amzn.to/3NeSEGT
    RTX A4500 20GB amzn.to/3TXtAYR
    GPU Bench Node Components: (These are affiliate-based links that help the channel if you purchase from them!)
    Open Air Case amzn.to/3U08Y27
    30cm Gen 4 PCIe Extender amzn.to/3Unhclh
    20cm Gen 4 PCIe Extender amzn.to/4eEiosA
    1 TB NVME amzn.to/4gWFcFb
    Corsair RM850x amzn.to/3NkITa4
    128GB Lexar SSD amzn.to/3TZYYGh
    G.SKILL Ripjaws V Series DDR 64GB Kit amzn.to/4dAZrWm
    Core I9 9820x amzn.to/47UuIST
    Nocuta NH-U12DX i4 CPU Cooler: amzn.to/3TZ7O6R
    Supermicro CX299-PGF Logic Board amzn.to/3BxbWVr
    Remote Power Switch amzn.to/3BubQOg
    Recorded and best viewed in 4K
    Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!

Комментарии • 62

  • @Lemure_Noah
    @Lemure_Noah Месяц назад +1

    Thanks for this nice test!
    I just bought the 4060Ti 16GB, to complement my two RTX 3070 8Gb - Now I have 32GB, good enough to run Mixtral 8x7b or Qwen2.5-Code 32b model.
    A note: for small models, like llama3.2 3b, I put it in just one GPU, as splitting LLM model across all GPUs hurts a lot the tokens per second. Only big models taken advantage of multi-gpu, due memory constraints.

  • @andrewowens5653
    @andrewowens5653 6 месяцев назад +5

    Thank you. It would be interesting to see some evaluation of multiple consumer gpus working on the same llm.

    • @RoboTFAI
      @RoboTFAI  6 месяцев назад +1

      I have another video of testing 1,2,3,4, and 6 4060's (which I consider consumer level) together on same LLM here - ruclips.net/video/Zu29LHKXEjs/видео.html but if you have more specific ideas please let me know.

  • @DarrenReidAu
    @DarrenReidAu 3 месяца назад +3

    Great breakdown. Since Ollama support for AMD has become decent, a good bang for buck is the MI50 16Gb. I did a similar test for comparison and it comes in a bit about the 4060ti for output, prompt tokens faster due to sheer memory speed (HBM2). ~20 toks/sec out. Not bad for a card that can be had on eBay for $150-$200 usd.

    • @RoboTFAI
      @RoboTFAI  3 месяца назад +2

      Def not bad, I'm looking around for AMD cards to throw into the testing

  • @fooboomoo
    @fooboomoo 5 месяцев назад +4

    great content and relevant to me since I recently bought a 4060 ti 16gb for ai.

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      thanks for watching!

    • @Matlockization
      @Matlockization 3 месяца назад

      What do you want the AI to do for you ?

    • @TheJunky228
      @TheJunky228 3 дня назад

      I'm thinking of maybe getting two of those for ai. how's your one doing?

  • @C650101
    @C650101 2 месяца назад +3

    I want to run big models cheaply, I use a 1080 TI now on 8b llama, fast enough but would like a reliable code assistant with bigger model. Suggestions? Can you test multiple 3060s in parallel on big model?

    • @Lemure_Noah
      @Lemure_Noah Месяц назад

      You can put another cheap 8GB GPU and runs models like Qwen2.5 Coder from the 1.5B up to the 14GB (GGUF) (this one uses 11GB of VRAM + extra context). The Qwen Coder 32B requires a 24Gb VRAM setup.
      Small models, as the 3B ones, runs better in single GPU mode.

  • @benjaminhudsondesign
    @benjaminhudsondesign 6 дней назад

    I may be wrong but I am pretty sure you can change the seed from random to fixed so given the same prompt with the same seed the responses should be exactly the same across multiple tests.

    • @RoboTFAI
      @RoboTFAI  6 дней назад

      You are correct, and can absolutely do that! I normally don't do that in the tests (that are on the channel at least)

  • @ZIaIqbal
    @ZIaIqbal 3 месяца назад +1

    Can you try to run the llama3.1 405B model on the CPU and see what kind of response we can get?

    • @RoboTFAI
      @RoboTFAI  3 месяца назад

      I haven't tried on pure CPU inference, but I did do it with distributed inference over the network in another video. We can certainly try as I have nodes with the RAM to do it in.

    • @ZIaIqbal
      @ZIaIqbal 3 месяца назад +1

      @@RoboTFAI oh, can you send me the link to the other video, I would be interested to see how you did the distributed setup.

    • @RoboTFAI
      @RoboTFAI  3 месяца назад

      @@ZIaIqbal ruclips.net/video/CKC2O9lcLig/видео.html - is the Llama 3.1 405B Distributed inference video. It's using LocalAI (Llama cpp workers/etc) under the hood:
      LocalAI docs on distributed inference: localai.io/features/distribute/
      Llama.cpp docs: github.com/ggerganov/llama.cp...

  • @Matlockization
    @Matlockization 3 месяца назад +1

    1. Is it possible to run the LLM on both the CPU & GPU at the same time ? 2. And how come AMD GPU's aren't used that much in AI ? 3. What do you believe is the minimum Nvidia GPU for AI ? 4. How important is the amount of RAM ?

    • @RoboTFAI
      @RoboTFAI  3 месяца назад +1

      1. Yes! Normally controlled by `gpu_layers` settings in the model - which determines how many layers to offload to the GPU(s), rest will use RAM/CPU
      2. Nvidia just mainstream and their support with software/etc is pretty far ahead. AMD is def being used also - you don't hear about it as much but there is tons of large orgs doing big clusters of AMD based cards.
      3. That depends on your needs, and your expectations of model response times (TPS). Most models can run on a good CPU if you are patient enough for the responses.
      4. Not that important UNLESS you want to be able to #1 - and split models, or run them purely on CPU inference. If so you want as much RAM as possible (same thing we all want from our GPUs!)

    • @Matlockization
      @Matlockization 3 месяца назад +1

      @@RoboTFAI Thank you for your generous response. And I'm now a subscriber.

  • @jeroenadamdevenijn4067
    @jeroenadamdevenijn4067 6 месяцев назад +3

    If I run Codestral 22b Q4_K_M on my P5000 (Pascal architecture), I get 11 t/s evaluation, so that means the P5000 performs around 75% of a 4060TI. But now, when I open Nvidia Power Management I can observe it only consumes 140W when under load while it should be ablte to go up to 180W. B.T.W. both these cards have 288GB/s memory bandwidth. I must have a bottleneck in my system which is a Intel 11th gen i7 laptop (4-core CPU) and eGPU over Thunderbolt 3.

    • @RoboTFAI
      @RoboTFAI  6 месяцев назад +2

      That's pretty decent speed in that setup

    • @jeroenadamdevenijn4067
      @jeroenadamdevenijn4067 6 месяцев назад +1

      @@RoboTFAI It does slow down though with larger context, let's say 8~9 t/s and when I go for Q5_K_S that becomes 7~8 t/s, still doable.

    • @stevenwhiting2452
      @stevenwhiting2452 5 месяцев назад +1

      Play with your data chunk sizes, it's usually unoptimised memory movement that limits the throughput. Nvidia has a tutorial that explains cuda much better than I can. The P40&P100 do the same thing on some models too.

  • @six1free
    @six1free 5 месяцев назад

    so I swung a 4060 laptop and a 4070tisuper and have spent the last couple days migrating my PC into an AI server, haven't yet gotten to the AI but in the meanwhile I'm putting the warranties to the test with some hardcore mining, almost nestalgic to when bitcoin was $10/btc
    I am realizing the 16Gvram is a bit of a bottleneck though, do you think adding an M40 or two would help? will the GPUs be able to crosstalk each others vram?

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад +1

      Yes, and I will answer some of this question in next video! Mixing GPUs/Tensor splitting

    • @six1free
      @six1free 5 месяцев назад

      @@RoboTFAI sweet sounds like a good video

  • @PedroBackard
    @PedroBackard 3 месяца назад

    what software is this ? The gui i mean that you use where can i download it ?

    • @RoboTFAI
      @RoboTFAI  3 месяца назад +1

      The testing platform? That's a custom built streamlit/python/langchain app I built specifically for my lab - so it's not really an app I distribute

    • @strt
      @strt 2 месяца назад

      @@RoboTFAI but it looks like a great tool!

  • @tsclly2377
    @tsclly2377 5 месяцев назад +1

    P40 vs 3090ti .. just because there is so much of a price difference and what can you get in loading speeds if your files are on a P900 Optane (280GB) [assuming that one is setting up batch processing]

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      I don't have either card to do testing with, will ask around friends/etc. Or might try to trade for a 3090 since everyone goes after them for their rigs...power hungry though

  • @nithinbhandari3075
    @nithinbhandari3075 5 месяцев назад +1

    Thanks for comparing the different GPU hardware.
    Can you run a test like, there is 6k input token and 1k output token.
    So, we can known that how large LLM perform under 6k input and 1k output token.

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      Yea we can absolutely run some tests with much larger prompts/etc!

  • @donaldrudquist
    @donaldrudquist 5 месяцев назад

    What application are you using to run this?

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      It's custom built by me - combo of Streamlit, Python, Langchain, etc, etc

  • @delightfulThoughs
    @delightfulThoughs Месяц назад +1

    I wish someone test those x99 motherboards with two xeon processors with 64 threads and up to 256 Gigabytes of ram. Would that run 70b models at at least 3 tokens per second?

    • @RoboTFAI
      @RoboTFAI  Месяц назад

      I don't have any dual processor x99 - but I do have single processor x299 boards, one with 256GB.

    • @delightfulThoughs
      @delightfulThoughs Месяц назад

      @@RoboTFAI does it run 70b models at 3 tokens per seconds or more?

  • @fulldivemedia
    @fulldivemedia 5 месяцев назад +1

    thanks

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      You're welcome!

  • @aaviusa835
    @aaviusa835 Месяц назад +1

    I was considering buying 12 Tesla M40 so I could train and use the Largest language models. But after calculating how much wattage and electricity that is, I realize the city and the electric company might pay me a visit to figure out what's going on.😅

    • @RoboTFAI
      @RoboTFAI  Месяц назад

      haha, they won't ask questions as long as you pay your bills! Luckily some of my lab is actually solar powered so helps offset costs (not really 🤣)

  • @fulldivemedia
    @fulldivemedia 5 месяцев назад

    great content my problem is choosing an am5 motherboard, I have 3 that I have got my eye on but I don't know which one is more future-proof
    msi meg x670e ace
    asus proart x670e
    asus rog strix x670e-e gaming
    can you help?
    i want it mostly for AI art and such, msi costs more, rog and proart are the same price (but I still don't know between these two which one is better, proart 2 PCI x8 x8 but rog is x8 x4) is msi is better than proart?

    • @noth606
      @noth606 4 месяца назад

      Old question but just commenting in case anyone else wonders - this sort of question has no answer properly since you provided no info whatsoever about what your planned config is. The primary differentiator is price most likely, if all you do is run one GPU on them and that's it, probably the cheapest is the best bang for your buck. The PCI stuff makes no sense but doesn't seem likely to be true either. My 2ct is that I've at times had issues with both ASUS and MSI, but the differentiator is ASUS did "fix it" by issuing a refund, MSI did not. So I personally would not pay money for an MSI board. Well, I'd give maybe $20 for one at most. ASUS I have continued to use for years after and never again had issues with.
      I run ROG boards myself, several of them. Main box is a X299 ROG Rampage VI EE right now. My 'BS tolerance' is very low, I'm an ex IT pro and run mostly professional gear, Dell and HPE, but I do run ASUS custom stuff next to that.

    • @user-lg4le8xr4s
      @user-lg4le8xr4s 3 месяца назад

      ProArt might serve you better than the ROG for workstation type work at least. The ones you listed are all x670e chipset. One thing to note, I don't know the specific differences in this generation but the ProArt boards in general tend to be made with IOMMU stuff in mind, virtualization, passthrough etc, compared to the ROG stuff. I know I've seen better IOMMU group separation in ProArt stuff at least, but I don't know if that extends to better PCIe bus/switches or what- I only know I've seen a card be in its own iommu group in a ProArt when the same card was grouped with a controller or something on the ROG.

  • @georgepongracz3282
    @georgepongracz3282 5 месяцев назад

    it would be interesting to compare a 4070 ti super to the 4060 ti if the scaling is proprtional to cost

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      Don't have one to test with, but if you want to send me one I am happy to throw it through the gauntlet hahaha

  • @STEELFOX2000
    @STEELFOX2000 5 месяцев назад

    iS POSSIBLE TO USE USE AN RX 6800 TO DO THIS TASK?

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      I do not have any AMD cards to test with, but there is ROCm for AMD and llama.cpp/LocalAI/etc etc do support it these days.

  • @mohammdmodan5038
    @mohammdmodan5038 5 месяцев назад

    I'm planning to buy gpu i have 2 choice P100 and M40 24GB i want to run 8B model is it's enough for it currently i have RYZEN 5 3600 16GB DDR4 1T NVME

    • @mohammdmodan5038
      @mohammdmodan5038 5 месяцев назад

      You have M40 right can you provide tokens/s

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      P100 is a Pascal architecture and newer than the M40 which is Maxwell architecture - so I would always recommend the newer cards of course depending on your budget and needs. Both will be power hungry.
      Llama 3.1 8B? Depends on context size....it defaults to 128k which is going to be heavy on your VRAM depending on quant/etc.
      To give an idea - Meta publishes this as guide (taken from huggingface.co/blog/llama31) on just context size vs kv cache size. You still have to load the model, other layers, etc, etc....
      Model Size 1k tokens 16k tokens 128k tokens
      8B 0.125 GB 1.95 GB 15.62 GB
      70B 0.313 GB 4.88 GB 39.06 GB
      405B 0.984 GB 15.38 GB 123.05 GB
      I actually have 3 old M40's sitting around in the lab as that is where I started in my AI journey over a year ago! So yea can do testing with them.

  • @TheJunky228
    @TheJunky228 3 дня назад

    I'm an outdated high end 4GB gpu from 2014 (so no acceleration) and an outdated high end cpu from 2010 lmfao it takes several hours before the first letter of a response is typed back to me so I can only really do a single prompt a day hahaha. also the whole time it's thinking my system is sucking down 270W and only idles at 145W.... I guess if my gpu helped it would add another 250W or so...

  • @jackflash6377
    @jackflash6377 5 месяцев назад +2

    A4500 vs RTX3090 ??

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад +2

      Attempting to acquire a 3090 for the channel, stand by!

  • @marsrocket
    @marsrocket 5 месяцев назад

    Llama 3 7B runs in near real-time on an Apple M1 processor, and presumably faster on an M2 or M3.

    • @RoboTFAI
      @RoboTFAI  5 месяцев назад

      It does, I haven't brought Apple Silicon into the mix on the channel just yet - but I have a few M1, M1 Max as my daily machines

  •  2 месяца назад

    running optiplex 7040 sff with 24 gig ddr4. i5 6700 .3.4 gig 4 cores. no gpu. i get 5 tokens per sec in ollama run llama3.1 8b --verbose, 9 tps on the new 3.2 3b. on the single test "write a 4000 word lesson on the basics of python". its usable. ollama run codstral. 22b pulled a 12 gig file. same test : results use 99%cpu 0%gpu 13 gig ram. it crawled 7 min.. 1.8 tps. it ran.

  • @Johan-rm6ec
    @Johan-rm6ec 4 месяца назад

    With these kinds of tests, 2 x 4060 ti 16gb must be included. And how it performs. 24gb is not enough 32gb on a Quadro kind of is 2700 euro"s. So it seems its a sweetspot. That you shpuld cover. Know your audience know sweetspots and that are the video's people want to see.

    • @RoboTFAI
      @RoboTFAI  4 месяца назад

      Adding in 2x 4060's won't really increase the speed over 1 of them, at least not noticeable. There is some other videos on the channel addressing this topic a bit. Scaling out on # video cards is really meant to just gain you that extra VRAM. So it's always a balance of your budget, costs, power usage, and your expectations (this is the more important one).
      Lower, lower your expectations until your goals are met! haha