LocalAI LLM Testing: i9 CPU vs Tesla M40 vs 4060Ti vs A4500
HTML-код
- Опубликовано: 24 дек 2024
- Sitting down to run some tests with i9 9820x, Tesla M40 (24GB), 4060Ti (16GB), and an A4500 (20GB)
Rough edit in lab session
Our website: robotf.ai
Machine specs here: robotf.ai/Mach...
GPUs being tested: (These are affiliate-based links that help the channel if you purchase from them!)
Telsa M40 amzn.to/3Yf4yXC
4060ti 16GB amzn.to/3NeSEGT
RTX A4500 20GB amzn.to/3TXtAYR
GPU Bench Node Components: (These are affiliate-based links that help the channel if you purchase from them!)
Open Air Case amzn.to/3U08Y27
30cm Gen 4 PCIe Extender amzn.to/3Unhclh
20cm Gen 4 PCIe Extender amzn.to/4eEiosA
1 TB NVME amzn.to/4gWFcFb
Corsair RM850x amzn.to/3NkITa4
128GB Lexar SSD amzn.to/3TZYYGh
G.SKILL Ripjaws V Series DDR 64GB Kit amzn.to/4dAZrWm
Core I9 9820x amzn.to/47UuIST
Nocuta NH-U12DX i4 CPU Cooler: amzn.to/3TZ7O6R
Supermicro CX299-PGF Logic Board amzn.to/3BxbWVr
Remote Power Switch amzn.to/3BubQOg
Recorded and best viewed in 4K
Your results may vary due to hardware, software, model used, context size, weather, wallet, and more!
Thanks for this nice test!
I just bought the 4060Ti 16GB, to complement my two RTX 3070 8Gb - Now I have 32GB, good enough to run Mixtral 8x7b or Qwen2.5-Code 32b model.
A note: for small models, like llama3.2 3b, I put it in just one GPU, as splitting LLM model across all GPUs hurts a lot the tokens per second. Only big models taken advantage of multi-gpu, due memory constraints.
Very nice!
Thank you. It would be interesting to see some evaluation of multiple consumer gpus working on the same llm.
I have another video of testing 1,2,3,4, and 6 4060's (which I consider consumer level) together on same LLM here - ruclips.net/video/Zu29LHKXEjs/видео.html but if you have more specific ideas please let me know.
Great breakdown. Since Ollama support for AMD has become decent, a good bang for buck is the MI50 16Gb. I did a similar test for comparison and it comes in a bit about the 4060ti for output, prompt tokens faster due to sheer memory speed (HBM2). ~20 toks/sec out. Not bad for a card that can be had on eBay for $150-$200 usd.
Def not bad, I'm looking around for AMD cards to throw into the testing
great content and relevant to me since I recently bought a 4060 ti 16gb for ai.
thanks for watching!
What do you want the AI to do for you ?
I'm thinking of maybe getting two of those for ai. how's your one doing?
I want to run big models cheaply, I use a 1080 TI now on 8b llama, fast enough but would like a reliable code assistant with bigger model. Suggestions? Can you test multiple 3060s in parallel on big model?
You can put another cheap 8GB GPU and runs models like Qwen2.5 Coder from the 1.5B up to the 14GB (GGUF) (this one uses 11GB of VRAM + extra context). The Qwen Coder 32B requires a 24Gb VRAM setup.
Small models, as the 3B ones, runs better in single GPU mode.
I may be wrong but I am pretty sure you can change the seed from random to fixed so given the same prompt with the same seed the responses should be exactly the same across multiple tests.
You are correct, and can absolutely do that! I normally don't do that in the tests (that are on the channel at least)
Can you try to run the llama3.1 405B model on the CPU and see what kind of response we can get?
I haven't tried on pure CPU inference, but I did do it with distributed inference over the network in another video. We can certainly try as I have nodes with the RAM to do it in.
@@RoboTFAI oh, can you send me the link to the other video, I would be interested to see how you did the distributed setup.
@@ZIaIqbal ruclips.net/video/CKC2O9lcLig/видео.html - is the Llama 3.1 405B Distributed inference video. It's using LocalAI (Llama cpp workers/etc) under the hood:
LocalAI docs on distributed inference: localai.io/features/distribute/
Llama.cpp docs: github.com/ggerganov/llama.cp...
1. Is it possible to run the LLM on both the CPU & GPU at the same time ? 2. And how come AMD GPU's aren't used that much in AI ? 3. What do you believe is the minimum Nvidia GPU for AI ? 4. How important is the amount of RAM ?
1. Yes! Normally controlled by `gpu_layers` settings in the model - which determines how many layers to offload to the GPU(s), rest will use RAM/CPU
2. Nvidia just mainstream and their support with software/etc is pretty far ahead. AMD is def being used also - you don't hear about it as much but there is tons of large orgs doing big clusters of AMD based cards.
3. That depends on your needs, and your expectations of model response times (TPS). Most models can run on a good CPU if you are patient enough for the responses.
4. Not that important UNLESS you want to be able to #1 - and split models, or run them purely on CPU inference. If so you want as much RAM as possible (same thing we all want from our GPUs!)
@@RoboTFAI Thank you for your generous response. And I'm now a subscriber.
If I run Codestral 22b Q4_K_M on my P5000 (Pascal architecture), I get 11 t/s evaluation, so that means the P5000 performs around 75% of a 4060TI. But now, when I open Nvidia Power Management I can observe it only consumes 140W when under load while it should be ablte to go up to 180W. B.T.W. both these cards have 288GB/s memory bandwidth. I must have a bottleneck in my system which is a Intel 11th gen i7 laptop (4-core CPU) and eGPU over Thunderbolt 3.
That's pretty decent speed in that setup
@@RoboTFAI It does slow down though with larger context, let's say 8~9 t/s and when I go for Q5_K_S that becomes 7~8 t/s, still doable.
Play with your data chunk sizes, it's usually unoptimised memory movement that limits the throughput. Nvidia has a tutorial that explains cuda much better than I can. The P40&P100 do the same thing on some models too.
so I swung a 4060 laptop and a 4070tisuper and have spent the last couple days migrating my PC into an AI server, haven't yet gotten to the AI but in the meanwhile I'm putting the warranties to the test with some hardcore mining, almost nestalgic to when bitcoin was $10/btc
I am realizing the 16Gvram is a bit of a bottleneck though, do you think adding an M40 or two would help? will the GPUs be able to crosstalk each others vram?
Yes, and I will answer some of this question in next video! Mixing GPUs/Tensor splitting
@@RoboTFAI sweet sounds like a good video
what software is this ? The gui i mean that you use where can i download it ?
The testing platform? That's a custom built streamlit/python/langchain app I built specifically for my lab - so it's not really an app I distribute
@@RoboTFAI but it looks like a great tool!
P40 vs 3090ti .. just because there is so much of a price difference and what can you get in loading speeds if your files are on a P900 Optane (280GB) [assuming that one is setting up batch processing]
I don't have either card to do testing with, will ask around friends/etc. Or might try to trade for a 3090 since everyone goes after them for their rigs...power hungry though
Thanks for comparing the different GPU hardware.
Can you run a test like, there is 6k input token and 1k output token.
So, we can known that how large LLM perform under 6k input and 1k output token.
Yea we can absolutely run some tests with much larger prompts/etc!
What application are you using to run this?
It's custom built by me - combo of Streamlit, Python, Langchain, etc, etc
I wish someone test those x99 motherboards with two xeon processors with 64 threads and up to 256 Gigabytes of ram. Would that run 70b models at at least 3 tokens per second?
I don't have any dual processor x99 - but I do have single processor x299 boards, one with 256GB.
@@RoboTFAI does it run 70b models at 3 tokens per seconds or more?
thanks
You're welcome!
I was considering buying 12 Tesla M40 so I could train and use the Largest language models. But after calculating how much wattage and electricity that is, I realize the city and the electric company might pay me a visit to figure out what's going on.😅
haha, they won't ask questions as long as you pay your bills! Luckily some of my lab is actually solar powered so helps offset costs (not really 🤣)
great content my problem is choosing an am5 motherboard, I have 3 that I have got my eye on but I don't know which one is more future-proof
msi meg x670e ace
asus proart x670e
asus rog strix x670e-e gaming
can you help?
i want it mostly for AI art and such, msi costs more, rog and proart are the same price (but I still don't know between these two which one is better, proart 2 PCI x8 x8 but rog is x8 x4) is msi is better than proart?
Old question but just commenting in case anyone else wonders - this sort of question has no answer properly since you provided no info whatsoever about what your planned config is. The primary differentiator is price most likely, if all you do is run one GPU on them and that's it, probably the cheapest is the best bang for your buck. The PCI stuff makes no sense but doesn't seem likely to be true either. My 2ct is that I've at times had issues with both ASUS and MSI, but the differentiator is ASUS did "fix it" by issuing a refund, MSI did not. So I personally would not pay money for an MSI board. Well, I'd give maybe $20 for one at most. ASUS I have continued to use for years after and never again had issues with.
I run ROG boards myself, several of them. Main box is a X299 ROG Rampage VI EE right now. My 'BS tolerance' is very low, I'm an ex IT pro and run mostly professional gear, Dell and HPE, but I do run ASUS custom stuff next to that.
ProArt might serve you better than the ROG for workstation type work at least. The ones you listed are all x670e chipset. One thing to note, I don't know the specific differences in this generation but the ProArt boards in general tend to be made with IOMMU stuff in mind, virtualization, passthrough etc, compared to the ROG stuff. I know I've seen better IOMMU group separation in ProArt stuff at least, but I don't know if that extends to better PCIe bus/switches or what- I only know I've seen a card be in its own iommu group in a ProArt when the same card was grouped with a controller or something on the ROG.
it would be interesting to compare a 4070 ti super to the 4060 ti if the scaling is proprtional to cost
Don't have one to test with, but if you want to send me one I am happy to throw it through the gauntlet hahaha
iS POSSIBLE TO USE USE AN RX 6800 TO DO THIS TASK?
I do not have any AMD cards to test with, but there is ROCm for AMD and llama.cpp/LocalAI/etc etc do support it these days.
I'm planning to buy gpu i have 2 choice P100 and M40 24GB i want to run 8B model is it's enough for it currently i have RYZEN 5 3600 16GB DDR4 1T NVME
You have M40 right can you provide tokens/s
P100 is a Pascal architecture and newer than the M40 which is Maxwell architecture - so I would always recommend the newer cards of course depending on your budget and needs. Both will be power hungry.
Llama 3.1 8B? Depends on context size....it defaults to 128k which is going to be heavy on your VRAM depending on quant/etc.
To give an idea - Meta publishes this as guide (taken from huggingface.co/blog/llama31) on just context size vs kv cache size. You still have to load the model, other layers, etc, etc....
Model Size 1k tokens 16k tokens 128k tokens
8B 0.125 GB 1.95 GB 15.62 GB
70B 0.313 GB 4.88 GB 39.06 GB
405B 0.984 GB 15.38 GB 123.05 GB
I actually have 3 old M40's sitting around in the lab as that is where I started in my AI journey over a year ago! So yea can do testing with them.
I'm an outdated high end 4GB gpu from 2014 (so no acceleration) and an outdated high end cpu from 2010 lmfao it takes several hours before the first letter of a response is typed back to me so I can only really do a single prompt a day hahaha. also the whole time it's thinking my system is sucking down 270W and only idles at 145W.... I guess if my gpu helped it would add another 250W or so...
A4500 vs RTX3090 ??
Attempting to acquire a 3090 for the channel, stand by!
Llama 3 7B runs in near real-time on an Apple M1 processor, and presumably faster on an M2 or M3.
It does, I haven't brought Apple Silicon into the mix on the channel just yet - but I have a few M1, M1 Max as my daily machines
running optiplex 7040 sff with 24 gig ddr4. i5 6700 .3.4 gig 4 cores. no gpu. i get 5 tokens per sec in ollama run llama3.1 8b --verbose, 9 tps on the new 3.2 3b. on the single test "write a 4000 word lesson on the basics of python". its usable. ollama run codstral. 22b pulled a 12 gig file. same test : results use 99%cpu 0%gpu 13 gig ram. it crawled 7 min.. 1.8 tps. it ran.
With these kinds of tests, 2 x 4060 ti 16gb must be included. And how it performs. 24gb is not enough 32gb on a Quadro kind of is 2700 euro"s. So it seems its a sweetspot. That you shpuld cover. Know your audience know sweetspots and that are the video's people want to see.
Adding in 2x 4060's won't really increase the speed over 1 of them, at least not noticeable. There is some other videos on the channel addressing this topic a bit. Scaling out on # video cards is really meant to just gain you that extra VRAM. So it's always a balance of your budget, costs, power usage, and your expectations (this is the more important one).
Lower, lower your expectations until your goals are met! haha