I think the culprit is that Win32 API function: GetLogicalProcessorInformation only supports up to 64 proccessing units, due to using only a 64bit flag value for each cpu. GetLogicalProcessorInformationEx is the more modern one.
Also, about the Ti in the GPU, NVIDIA pronounces it both ways. Jensen, the CEO, pronounces it as T-I (Tee-eye), while Jeff Fisher pronounces it as Ti (Tie/Ty)
One thing on your RAM vs core discussion: L3 cache requirements scale non-linearly with core counts thanks to the increased incidence of L2 cache misses.
That's why the chip architecture is critical with more and more cores. AMD, Intel, and Ampere all seem to take slightly different approaches. I've enjoyed some of the ChipsandCheese articles on these new architectures!
@@shanent5793 It's non-linear and there's an exact formula. Let's say you have a 5% chance of a cache miss per core, so a 95% chance of a cache hit. The percentage chance of a cache miss with N cores is (1 - (.95^N)) * 100. Obviously the chance of a miss - that 5% - is dependent upon the workload. The more misses you have, the greater the pressure. And the fewer the RAM channels you have the greater the effect of L3 cache misses.
@@QuentinStephens that's just the chance of at least one miss. Multiple misses will have binomial probabilities so their sum converges to linear. 128 cores are expected to have twice as many misses in total vs. 64 cores. Either way more cores causes more L3 pressure so why does the Ampère only have 16MB which is less than desktop CPUs with only 6 cores or 12 threads?
@@shanent5793 I'm not sure you're correct about the binomiality but yes, I do agree that the 16 MB cache does seem rather low, especially when we have Epyc CPUs with 1 GB cache for similar numbers of cores.
Honestly I wouldn't at all be surprised if valve would tell us tomorrow that they release a fork of Box-86 and Box-64 build right into steam so to support all steam games on ARM and RISC-V. Valve would be insane enough to do this and there's no number 3 so it allowed.
I don't know, code weavers contacted valve about built in support for crossover on macOS steam and they still haven't done anything about it (source: I contacted code weavers themselves about it and they said that they did pitch the idea and that it is up to Valve)
@@KingVulpesBecause Valve's primary focus is on Linux, not macOS. Another thing is Crossover is a paid product, I find it highly unlikely that CodeWeavers was interested in just providing it to Valve for free without getting a cut, that's probably why Valve wasn't interested. Providing x86 emulation for ARM, however, could directly benefit Valve as it would allow for future low-draw devices, although I'm not holding my breath.
3 grand for a 128 core CPU. I remember when Intel used to charge 5 grand for a quad core server. Lol, what an exciting time to be alive. I will buy one in a few years when it's stable and on the used market for a reasonable price.
@@DeltaSierra426 Those limitations are definitely unfortunate. Making a powerful CPU by making it bigger with a bigger socket? Easy. Making a powerful GPU by making it bigger with a bigger socket? Easy. Even if we don't improve the technology, we can add more and make it bigger. But then... Games: I'll use 1/128 your CPU and 1/3 of your GPU.
@@dzello I think making a program able to use the potential of this hardware isn't that hard. It's just that people don't usually do it. With time, and more and more complex software this extra horse power might be needed... Though, there's indeed a limit for consumer grade applications, and crossing that limit is just being inefficient or lazy with your code
We're finally returning to the RAM situation we had a decade ago, where workstation motherboards had lots of RAM slots. My (now very old) super micro x8dah+-f board has 18 (9 per CPU). IMO, the biggest problem with modern processors is the extremely limited PCI-e lanes available. Look at chip specs over the years, and it's something that has steadily decreased. With Thunderbolt and NVME, PCIe lanes are the most limiting feature on all my computers - even laptops.
Yeah; I have run into that on my Ryzen 7000 series desktop, there are few motherboards that even expose the lanes in a way I can fully utilize them :( The nice thing with this Ampere chip is it has 128 lanes, and almost all are usable on this motherboard! Still always want more, for more IO :)
Still run a 4790k on my seedbox due to this. Haven't found a non-server mobo with 10 on-board sata slots for spinning drives since that generation for any other CPU I bought.
128 PCIe 4.0 lanes is plenty; that's 512 GB/s, more than enough to saturate 6 channels of DDR-3200 with only 154 GB/s half-duplex bandwidth. It's up to the motherboard or backplane designer to allocate them
The issue is that the PCIe lanes are used for M.2 slots and other onboard functions that didn't exist on boards 12 years ago. Back then those PCIe lanes mostly went to actual PCI slots.
@@arof7605My 4790k was a beast-even though I was never able to overclock it, it ran my main computer for over half a decade, and its core performance was *never* the bottleneck. But I'm surprised you're still using it-how do you live with a mere 32GB of RAM? (asked half-jokingly)
Something to note about NVIDIA's ARM binary drivers: they have driver library files for x86-64 and aarch64, but they don't have armhf driver libs for software running under box86. That is, box86 converts 32-bit Intel into 32-bit ARM, not into 64-bit ARM. For i386 games, you'd likely need to use an AMD GPU -- Polaris (RX5xx) or older. One game I find very useful for checking the performance of GPUs on ARM is Veloren. It uses Metal on MacOS, Vulkan on Linux, and Vulkan or DX12 on Windows (though there's no ARM Windows build).
Thermal Paste: Forget what LTT says, its a physical junction that transfers heat, the larger the contact the more heat can move across it. So you are completely right to spread the thermal paste out. Physics!
RISC-5 is jumping into the fray. I am looking forward to getting my 64 core dev board in December. I am so happy to have this level of competition in the market again.
This is so fucking sick man, I love the development that ARM desktop / server cores have been making! I know we have other Architectures as well (RISC-V) and it's awesome that they're all making strides, but to see this amount of progress now? Fuck yeah! I remember watching your older videos where you literally couldn't detect the GPU or even push anything out to the frame buffer, but now look at it :D
1:47 LOVE the "18 minute pickup" at Microcenter. I've built both of my kids' gaming towers by picking out the parts, hitting "buy" and driving right over. Even picked up Dell XPS 13s for each of them the same way.
Finally someone who prespreads their thermal compound! 😃 I've always just seen so many people just leave it to squish itself but I learned from my dad who touched computers for 20+ years that prespreading is better.
Hey Jeff, running the bedrock edition, especially the mobile version as you did, is a far too easy challenge for your rig. I suggest running and comparing the latest java version and a specific modded version: Faboulously Optimized. To get any architecture incompatibilities out of the way, consider using a launcher that comes as a JAR file, such as the Technic launcher Make sure to use the latest jre (20-21) and set the proper JVM flags Additional bonuses: shaders, resource pack with parallax mapping + physics mod pro (then grab a fire extinguisher) Looking forward to hearing from you!
I admire it so much that you are able to work around such unusual circumstances. I can't even get a Linux graphics driver fully working in an ideal setup.
@@robkam643400 I understand why you would want to buy AMD gpu for linux, but what's the point of swapping Intel CPU for AMD one?(unless you mean Intel ME, but it's works the same with with Windows)
11:00 I think this has been a problem in Cinebench since inception, originally it was only an issue for very niche 4 and 8 socket systems, but with EPYC, Threadripper and Xeon Platium (cascade lake) with up to 56-64 cores per socket and 2-4 sockets, many cores started going un-used in and after 2019
@@lucasrem we're not talking about perfromance, only core count, IIRC when i built my 12 core dual processor Xeon X5690 desktop the current-at-the-time version of cinebench only supported 16 threads, not 24
@@lucasremApple is not at windows level at chemistry,electronic,mechanics physics. Mac os is vanilla UNIX. In the past engines ,cars machines were designed on TRUE UNIX like solaris irix not tiktoker UNIX like Mac os. Now that real unix has been replaced by windows and linux. No engineer can design a chip or an engine on a Mac.
Really cool. Btw Jeff I found a way easier way to connect LTE modems via, Modem Manager and Network Manager. No need to install QMI libraries, they're already in Debian 12.
The only downside of Microcenter is that Brentwood Prominade is a post apocalyptic hellscape of a parking lot. The only saving grace is that there is a super secret way there that allows you to skip the bulk of the minivan wars.
It sounds like the real bottleneck here is DDR5 support. Which is supported by the upcoming Ampere revision. Which is even faster. This is a surprisingly effective workstation for a development board, and further software support should improve it even further. I could see Blackmagic integrating one with a pile of their PCIe cards to build a behemoth video switching workstation capable of real-time effects - and driver support is a lot easier when you make the cards!
Could you try spinning up hundreds or (thousands?!?) of docker containers with Kubernetes? With all those CPU cores it's gotta be really fast to ramp up the instances.
Thats what its in part for. However like Jeff was saying, you do run into memory bandwidth limitations, meaning that per core you can't expect linear performance curve based on the number of cores you have. I have an expectation that a lot of customers who are running off the shelf applications will probably benefit more from the lower core sku but if you design your application around the server the 128 core will probably be worth it.
@@MegabeanI used to have a workload that was shared-nothing, buffered data by thread, was computationally heavy, had fairly small unit sizes of data (commonly
@@thewiirocks Thats cool, sadly not enough background to fully understand. I do 3D rendering though, I use a java application called Chunky, it's a voxal type renderer (might be using the wrong term). It does photorealistic rendering. I've been able to saturate my server with it, with 64 cores and 128 threads. Idk how much memory plays into it, outside it as using every bit of memory you allocate to it.
@@Megabeanthe biggest thing you need to consider for memory is how long you're keeping data in the L1 and L2 caches, and whether or not you're unnecessarily evicting data and then asking for it back. A very common pattern in modern software is to perform one operation at a time (e.g. an addition of two values) across a large collection of data, looping over the data separately for each operation. This is _terrible_ for the cache as the CPU is forced to evict each record to make room for the next record in the collection, thereby reducing your throughput to the memory bandwidth and making your caches useless. This can be hard to detect as test data sets tend to be small enough to fit within L3 and therefore exceed memory bandwidth. It's only once the data sizes are scaled that the true limits of the memory bandwidth are hit. Worse yet, the CPU will look busy to the operating system even though it's spending most of its time doing nothing. What you really want to do is to bring in a record of data, perform all operations you possible can on it, then be done with it for that computational cycle. That maximizes the amount of time a record can be held in CPU caches. If done correctly you may be able to operate entirely out of the L1 cache, which can easily provide an order of magnitude performance improvement.
Upgraded my laptop's monitor to 4K and with 100% scaling I can read the text on your screen at 0:55 With any other scaling the text becomes more blurry and if I right click on the video and click stats for nerds, the resolution of the viewport changes with the scaling. Also, on my win 10 laptop I can't just hover over the speaker icon in the taskbar and scroll to change the volume, which win 11 does. Sorry for the unrelated comment but hey, good to see you are doing well and are in good health!
At 2:43 Ubuntu and Windows for ARM.... Did you try any other Linux distro?? Just curious on that.... I have been coming down here to suggest ChimeraOS because it runs steam very well, but then I remembered it may not have an ARM flavor....if it does that might be a good way to go!! Manjaro apparently has the ability to act like SteamOS since both of them are based on ARC Linux.... Hope you have an excellent day!!
Nice. They really make amazing stuff. Too bad I can't afford it. Would love an Ampere workstation so much. But I'm happy with RK3588 and my pc when I need it.
Nice video! Almost getting one myself! Is it the 2,8Ghz version of the CPU Ampere will ship? Regarding the Mac, let’s not forget M2 Max and M3 Max have tremendous memory bandwidth, 400GB/s.. quite much more so than a DDR4 system I believe. That makes them maybe faster in memory bandwidth limited problems, such as several types of simulations etc with low flops per byte ratio. AmpereOne has the DDR5 memory system support. However, I have not seen it easily available like this CPU is. With only 3 out of 4 memory channels being connected, maybe the 96 core version is a “better fit” as the amount of bandwidth per core will be quite better, for anything bandwith sensitive that is.
8:43 13b isn't actually all that big, and with your RAM pool you should easily be able to go up to 40b. Are you using the ARM-friendly "LlamaCPP" in OpenBLAS mode for your back-end? If so, you should be seeing a lot more CPU activity than that.
to put your Cinebench 24 score into X86 context, a Intel i9 13900KS at 5.6ghz scores 2379 Multi and 142 single while a AMD 7950X3D at 4.5ghz scores 1829 multi and 111 single single core is obviusly lacking on that 128 core, but the multicore for sure aint
It's one of those "well one ain't good enough, let's just throw ALL THE CORES in there" problems :) I really want to see the single core specs on AmpereOne. Or see Apple create a 128 core monster M2 Ultra Ultra Supreme :D
@@JeffGeerling The upcoming revision supports DDR5. Assuming twice the memory bandwidth and adequate driver support, perhaps a 4K+ Cinebench score is in the cards?
2:49 That RAM cannot keep up is not only the case for server CPUs. Even for these super fast data-center GPUs, 2TB/s VRAM bandwidth cannot keep up, because compute Tflops is still so much larger. They could cut the GPU die size in half and the software would still perform thr same. Nearly all compute software is bandwidth-bound nowadays.
Could you run virtual machines in this hardware? Could QEMU/KVM emulate raspberry pi, Mac OS or even X86 os? Imagine running a virtual cluster of pi? And while quickemu can run MacOS, running the latest apple sillicone version could be very useful.
Bandwidth not keeping up with compute power has long been an issue. One amusing statistic is that standard floppy disks are faster than a typical NVMe drive (compared to capacity). You can read a 1440k floppy in about 45 seconds but a Samsung 990 PRO 2TB will take over 4 and a half minutes. Even the IOPS per megabyte is a bit faster on the floppy. With a slow step rate of 8ms you'd have a worst case of 840ms access time or .82 IOPS/MB. The 980 is 1.4 million IOPS best case which comes out to 0.7 IOPS/MB.
12:19 one of the problems you've likely ran into with Valve and halo games is anti-cheat, both VAC and Easy DO NOT like emulation, however WINE seems to have been made compatible lately, u guess to support the Steamdeck, which doesnt use emulation, just the proton translation layer, which the predecessor used to get you banned from CS:GO if i recall.
For at least one of those games, the text console had an error message about "out of thread IDs". Presumably it's trying to spin up one thread per core or per SMT. If you can artificially limit the number of cores that the OS sees, or that it shows to userspace programs, you might have a shot at getting these to work... Does ARM have SMT? Turning that off would be interesting too.
There's something poetic about having a build this absurdly overpowered and being able to play minecraft and not much else. Hopefully ARM will get some love due to the copilot + thing (however horrible it might be).
@@JeffGeerling Very cool, I'll be watching for it. I almost bought one of these systems back when you showcased it but now that I've also become one of the masses without a job I'll hang onto the pennies and live vicariously through you!
1.3TFLOPS is at least double what the RTX 4070 Ti can do. The CPU can access much more memory with lower latency so there's no comparison. "Ti" is an abbreviation of "Titan" so it's pronounced like the first syllable. Titan never made sense anyways because the Titans lost to the Olympians, so Ti was just a face-saving compromise. The company is still named after one of the seven deadly sins, which shows that they can't let go of something that sounds cool
Are you sure you are not just citing the effect of the abysmal FP64 performance of the card? A 4070 Ti really ought to be able to do much better than 1.3 in single-precision aka. FP32. The CPU would also be faster but only by a factor of 2x, but I would expect at least ~6 TFLOPS from the 4070 Ti in FP32.
@@TheBackyardChemist Compute TFLOPS is traditionally FP64. Ada has a 1:64 FP32/64 ratio so it's around 40 TFLOPS FP32 on the GPU. You could emulate 64 bit math and get the ratio down to 1:4 but it's not IEEE compliant. CPUs should be 1:2 but they run into power limits and drop the clock speed if it's too much work. Of course these are all peak theoretical figures, branching code and sparse access won't allow the GPU to reach its maximum performance
I would try it on Fedora, which has vanilla and almost edge Linux kernel, plus they have proper nvidia support with wayland now, and with its special ram config (which requires no swap now). Things might run better. And I always use the flatpak version of steam, runs quite good
Just a quick Question Jess, Does the board have the ability to disable the ASPEED GPU??? Just asking because you said that the BIOS goes through the ASPEED iGPU. With servers boards that i own with ASPEED BMC/GPU, if you disable the GPU portion then the BIOS is able to be shown/Accesses through a discrete GPU. Just a thought as i am unfamiliar with the Ampere boards and how they work compared to x86
Hello. How about using it as a web server and virtualization (ESXi and windows server with hyper-v)? Can you do some tests for these? and maybe compare with some xeon processors? how fast mysql is on those cpus? I really look at these ARM CPUs and i see they might change the servers world and i really think of getting an ARM server.
Amazing! I would totally have an Arm CPU in a desktop! Imagine the possibilities. Unfortunately, I don't think it will be so easy access ($), maybe in 15 to 20 years?
You are very wrong about Blender - all you had to do was install compatibility layer from MS store lol. i use Blender a lot on Windows on Arm devices, if you can do that please retest the results for future video. sinsirely yours, Greg
Hello, how does it compare with a cisc machine in matrix handling?. This is most important in scientific work. M1. M2, etc are for simple arithmetic in raytracing, I belive.
I'm really interested in the LLM and machine learning aspect of this. I'm about to upgrade my old dual 24 core Xeon (w/ 512GB of ECC) to a modern high core count plus high end GPU. This is absolutely on my radar now. Do you have specific motherboard recommendations?
If you're serious about the LLM aspects, the best options would be some of the server builds, from Gigabyte, Supermicro, Asus, or one of those vendors. ServeTheHome has some interesting reviews of GPU-heavy Ampere machines used for the purpose.
The key difference of M2 ultra is that 192 GB ram is also VRAM which makes it possible to run some 180B LLMs on it which is not possible for other consumer level PC.
If they use 192gb of vram what cpu will use air or water? I m prety confident that those workloads it wont be run on mac mostly. There s no way that in RL vram can access 192gb in a mac. I qoute what a teacher of a famous university said. " the more advanced is your jobs the less mac you will see"
The only problem with Microcenter is the lack of Microcenter; many folks are still stuck driving 3 hours plus to get to one, which is a hard sell over sitting on our bums at home and ordering like *snap*. Hopefully they ramp up their pace of opening new stores across the country, but only time will tell.
@@lucasrem I despise the hustle and bustle of big city life. Less stress, better air quality, big yards (some have creeks, woods, etc. for hunting, fishing, recreation, gardening/farming, and so on), and in our area, a local telco invested in fiber-to-the-home, so tech life is still good; doesn't take much longer these days for etail shipping to these areas. I actually live in a city of about 40,000. This provides the benefits of nearby shopping, food, jobs, etc. without being a monstrosity. There are still plenty of big cities across the U.S. that don't have a Microcenter; I still stand by may statement that I'd like to see them expand more quickly. Even places like Austin, TX -- "the new silicon valley" (but they call it silicon hills) -- doesn't have a Microcenter and requires driving 4 hours to Dallas or 3 hours to Houston to get to one.
@@DeltaSierra426 expanding too quickly can wreck a business. Perhaps they need to set up a good shipping process that beats Amazon, such as the ability to schedule the delivery in a 2 hour time window. I'm in Canada, never been into one of their stores.
@@JeffGeerling Yeah, that's good 'cause you could do some top-notch 3D CAD /* FreeCAD also works on ARM, doesn't it? */ construction or CGI work on such a machine.
Also that silent Steam game death is a classic Visual C redist. bug. It was happening to me like crazy on my brand new Windows laptop, because there was a bug making the games think the various libraries were installed when they weren't, so their first-run script that checks for the library and installs it if it's missing, would report success, but then the game would try to launch, try to use the missing library, and fail. I bet something similar happens related to incompatibilities between Linux, ARM, and MSVC
I have just read a 3 years old post in reddit writing that open source AMD GPU drivers can be used in ARM Linux (Oland GPU with blobs for initialization). It is a shame that open source drivers cannot be compiled if needed, and that games are not compiled for Linux (x86 and ARM) and Vulkan. And it was a pleasure watching that it is possible as with Super Tux Kart. It seems Nvidia and ARM are making SoC for laptops and handhelds, for MS WOS, Chrome OS and Linux perhaps future good drivers will come with them.
Threadripper 399x user here, they will never fix over 64 thread usage in windows. I've tried it all.
Sounds like it's time for you to do the free upgrade to a superior Linux based OS
@@WartimeFrictionyou forgot the "I use arch btw" as part of your comment
I think the culprit is that Win32 API function: GetLogicalProcessorInformation only supports up to 64 proccessing units, due to using only a 64bit flag value for each cpu. GetLogicalProcessorInformationEx is the more modern one.
Proton will have maximum 64 threads
Even Windows Pro for Workstations?
Also, about the Ti in the GPU, NVIDIA pronounces it both ways. Jensen, the CEO, pronounces it as T-I (Tee-eye), while Jeff Fisher pronounces it as Ti (Tie/Ty)
The plot thickens!
Internally I say T-I, when I pronounce it out loud it comes out "Tie", so who knows lol
@@JeffGeerling I guess you need to send red shirt jeff to Nvidia HQ so we may know the answer, better not have another gif situation
@@JeffGeerling well its originally Ti-tanium, isn't it? So it makes sense. But I've only ever heard Tee Eye.
@@JeffGeerling I don't know if i can handle your Tie pronunciation Jeff. It's like a punch to the ol' squeedily spooch.
@@JeffGeerlingit used to be T-I, but they retconed it into Tie. My personal belief is a Texas Instruments trademark got involved.
One thing on your RAM vs core discussion: L3 cache requirements scale non-linearly with core counts thanks to the increased incidence of L2 cache misses.
That's why the chip architecture is critical with more and more cores. AMD, Intel, and Ampere all seem to take slightly different approaches. I've enjoyed some of the ChipsandCheese articles on these new architectures!
Do increased L2 misses increase or decrease pressure on L3? If it's non-linear then is it log, exponential, or polynomial?
@@shanent5793 It's non-linear and there's an exact formula. Let's say you have a 5% chance of a cache miss per core, so a 95% chance of a cache hit. The percentage chance of a cache miss with N cores is (1 - (.95^N)) * 100. Obviously the chance of a miss - that 5% - is dependent upon the workload. The more misses you have, the greater the pressure. And the fewer the RAM channels you have the greater the effect of L3 cache misses.
@@QuentinStephens that's just the chance of at least one miss. Multiple misses will have binomial probabilities so their sum converges to linear. 128 cores are expected to have twice as many misses in total vs. 64 cores. Either way more cores causes more L3 pressure so why does the Ampère only have 16MB which is less than desktop CPUs with only 6 cores or 12 threads?
@@shanent5793 I'm not sure you're correct about the binomiality but yes, I do agree that the 16 MB cache does seem rather low, especially when we have Epyc CPUs with 1 GB cache for similar numbers of cores.
Honestly I wouldn't at all be surprised if valve would tell us tomorrow that they release a fork of Box-86 and Box-64 build right into steam so to support all steam games on ARM and RISC-V.
Valve would be insane enough to do this and there's no number 3 so it allowed.
It would make sense if they’re considering using ARM for a Steam Deck successor, like maybe that new Qualcomm one that’s meant to be really good?
It’s one of the reasons I love my steam deck so much. Issue? Not in 2 hours haha
I don't know, code weavers contacted valve about built in support for crossover on macOS steam and they still haven't done anything about it (source: I contacted code weavers themselves about it and they said that they did pitch the idea and that it is up to Valve)
@@KingVulpesBecause Valve's primary focus is on Linux, not macOS.
Another thing is Crossover is a paid product, I find it highly unlikely that CodeWeavers was interested in just providing it to Valve for free without getting a cut, that's probably why Valve wasn't interested.
Providing x86 emulation for ARM, however, could directly benefit Valve as it would allow for future low-draw devices, although I'm not holding my breath.
the problem is not steam, the problem is that you will use it to play the most simple games for the simple reason ARM.
3 grand for a 128 core CPU. I remember when Intel used to charge 5 grand for a quad core server. Lol, what an exciting time to be alive. I will buy one in a few years when it's stable and on the used market for a reasonable price.
The issue is the lack of support from software.
Not enough stuff makes use of all the cores.
@@dzello I have a feeling that golang with a huge workload would do pretty well.
Yeah, lol. Can't get to 128 x86 cores at $3K even on ThreadRipper, either, unless it's used.
@@DeltaSierra426 Those limitations are definitely unfortunate.
Making a powerful CPU by making it bigger with a bigger socket? Easy.
Making a powerful GPU by making it bigger with a bigger socket? Easy.
Even if we don't improve the technology, we can add more and make it bigger.
But then...
Games: I'll use 1/128 your CPU and 1/3 of your GPU.
@@dzello I think making a program able to use the potential of this hardware isn't that hard. It's just that people don't usually do it. With time, and more and more complex software this extra horse power might be needed... Though, there's indeed a limit for consumer grade applications, and crossing that limit is just being inefficient or lazy with your code
I really want to see these in a consumer level platform while keeping itself upgradeable.
most people will be pleased even with half quality, they kind of work well together
We're finally returning to the RAM situation we had a decade ago, where workstation motherboards had lots of RAM slots. My (now very old) super micro x8dah+-f board has 18 (9 per CPU). IMO, the biggest problem with modern processors is the extremely limited PCI-e lanes available. Look at chip specs over the years, and it's something that has steadily decreased. With Thunderbolt and NVME, PCIe lanes are the most limiting feature on all my computers - even laptops.
Yeah; I have run into that on my Ryzen 7000 series desktop, there are few motherboards that even expose the lanes in a way I can fully utilize them :(
The nice thing with this Ampere chip is it has 128 lanes, and almost all are usable on this motherboard! Still always want more, for more IO :)
Still run a 4790k on my seedbox due to this. Haven't found a non-server mobo with 10 on-board sata slots for spinning drives since that generation for any other CPU I bought.
128 PCIe 4.0 lanes is plenty; that's 512 GB/s, more than enough to saturate 6 channels of DDR-3200 with only 154 GB/s half-duplex bandwidth. It's up to the motherboard or backplane designer to allocate them
The issue is that the PCIe lanes are used for M.2 slots and other onboard functions that didn't exist on boards 12 years ago. Back then those PCIe lanes mostly went to actual PCI slots.
@@arof7605My 4790k was a beast-even though I was never able to overclock it, it ran my main computer for over half a decade, and its core performance was *never* the bottleneck.
But I'm surprised you're still using it-how do you live with a mere 32GB of RAM? (asked half-jokingly)
Something to note about NVIDIA's ARM binary drivers: they have driver library files for x86-64 and aarch64, but they don't have armhf driver libs for software running under box86. That is, box86 converts 32-bit Intel into 32-bit ARM, not into 64-bit ARM. For i386 games, you'd likely need to use an AMD GPU -- Polaris (RX5xx) or older.
One game I find very useful for checking the performance of GPUs on ARM is Veloren. It uses Metal on MacOS, Vulkan on Linux, and Vulkan or DX12 on Windows (though there's no ARM Windows build).
but why all source game crash on linux
and i have rtx a5000 and platform is amd64
it crashed same way like in this video but on amd64 not arm64
Quite a few shooter games with anticheat that failed, might that be the common denominator ?
danagoyette7932
What titles you run good on ARM ? all old DOS titles ?
Thermal Paste: Forget what LTT says, its a physical junction that transfers heat, the larger the contact the more heat can move across it.
So you are completely right to spread the thermal paste out. Physics!
ARM is really making huge moves am convinced very soon they will have 6 cores 8 cores and 16 cores lineups for consumers
Odroid N2+ is 6-core, Orange Pi 5 is 8-core, both of which can be purchased today for relatively dirt cheap!
RISC-5 is jumping into the fray. I am looking forward to getting my 64 core dev board in December. I am so happy to have this level of competition in the market again.
@@adamschackart6859 But they aren't something you'd put in a tower case and don't have a socketed CPU and memory or PCIe slots
That's great for those using them for production but are they going to be able to be clocked at the kind of speeds we're seeing currently?
Actually, processors on smartphones are ARM, and they are usually 6-8 cores, so yes that already happened years ago lol
I appreciate the effort you make to provide lots of details.
I wish we had something like Micro Centre where I'm from. Tech heaven
This is so fucking sick man, I love the development that ARM desktop / server cores have been making! I know we have other Architectures as well (RISC-V) and it's awesome that they're all making strides, but to see this amount of progress now? Fuck yeah!
I remember watching your older videos where you literally couldn't detect the GPU or even push anything out to the frame buffer, but now look at it :D
Now make it a hackintosh
I'm ready for (another) ARM desktop!
9:30 this makes me disproportionally happy as a linux fanboy. finally the tables have turned.
1:47 LOVE the "18 minute pickup" at Microcenter. I've built both of my kids' gaming towers by picking out the parts, hitting "buy" and driving right over. Even picked up Dell XPS 13s for each of them the same way.
Finally someone who prespreads their thermal compound! 😃
I've always just seen so many people just leave it to squish itself but I learned from my dad who touched computers for 20+ years that prespreading is better.
Hey Jeff, running the bedrock edition, especially the mobile version as you did, is a far too easy challenge for your rig.
I suggest running and comparing the latest java version and a specific modded version: Faboulously Optimized.
To get any architecture incompatibilities out of the way, consider using a launcher that comes as a JAR file, such as the Technic launcher
Make sure to use the latest jre (20-21) and set the proper JVM flags
Additional bonuses: shaders, resource pack with parallax mapping + physics mod pro (then grab a fire extinguisher)
Looking forward to hearing from you!
I admire it so much that you are able to work around such unusual circumstances. I can't even get a Linux graphics driver fully working in an ideal setup.
Just buy hardware for linux, instead of the other way around.
Buy all AMD. It'll all work out of the box if it's over a year or so old.
@@robkam643400 I understand why you would want to buy AMD gpu for linux, but what's the point of swapping Intel CPU for AMD one?(unless you mean Intel ME, but it's works the same with with Windows)
Dang. Getting a type 1 hypervisor on that thing would be SWEEEEET
iKR. Imagine how many VMs or LXC containers we can squeeeze inside it
11:00 I think this has been a problem in Cinebench since inception, originally it was only an issue for very niche 4 and 8 socket systems, but with EPYC, Threadripper and Xeon Platium (cascade lake) with up to 56-64 cores per socket and 2-4 sockets, many cores started going un-used in and after 2019
Den Verga
intel is NOT ARM levels !
You need apple, UNIX !
@@lucasrem we're not talking about perfromance, only core count, IIRC when i built my 12 core dual processor Xeon X5690 desktop the current-at-the-time version of cinebench only supported 16 threads, not 24
@@lucasremApple is not at windows level at chemistry,electronic,mechanics physics.
Mac os is vanilla UNIX. In the past engines ,cars machines were designed on TRUE UNIX like solaris irix not tiktoker UNIX like Mac os. Now that real unix has been replaced by windows and linux.
No engineer can design a chip or an engine on a Mac.
What I learned here .... Its Amazing but still not work yet!!!! Great job BTW! I loved this video!!
Exciting to see ARM gaining! Fantastic for servers (specifically high thread/process count web servers), etc.
Really cool. Btw Jeff I found a way easier way to connect LTE modems via, Modem Manager and Network Manager. No need to install QMI libraries, they're already in Debian 12.
The only downside of Microcenter is that Brentwood Prominade is a post apocalyptic hellscape of a parking lot. The only saving grace is that there is a super secret way there that allows you to skip the bulk of the minivan wars.
We don't talk about the secret entrances 🤫
It sounds like the real bottleneck here is DDR5 support. Which is supported by the upcoming Ampere revision.
Which is even faster.
This is a surprisingly effective workstation for a development board, and further software support should improve it even further. I could see Blackmagic integrating one with a pile of their PCIe cards to build a behemoth video switching workstation capable of real-time effects - and driver support is a lot easier when you make the cards!
jrshaul
What DDR u used, Ampere ?
ARM is not needing more than 6000 DDR 5 !
Could you try spinning up hundreds or (thousands?!?) of docker containers with Kubernetes? With all those CPU cores it's gotta be really fast to ramp up the instances.
Thats what its in part for. However like Jeff was saying, you do run into memory bandwidth limitations, meaning that per core you can't expect linear performance curve based on the number of cores you have. I have an expectation that a lot of customers who are running off the shelf applications will probably benefit more from the lower core sku but if you design your application around the server the 128 core will probably be worth it.
@@MegabeanI used to have a workload that was shared-nothing, buffered data by thread, was computationally heavy, had fairly small unit sizes of data (commonly
@@thewiirocks Thats cool, sadly not enough background to fully understand. I do 3D rendering though, I use a java application called Chunky, it's a voxal type renderer (might be using the wrong term). It does photorealistic rendering. I've been able to saturate my server with it, with 64 cores and 128 threads. Idk how much memory plays into it, outside it as using every bit of memory you allocate to it.
@@Megabeanthe biggest thing you need to consider for memory is how long you're keeping data in the L1 and L2 caches, and whether or not you're unnecessarily evicting data and then asking for it back.
A very common pattern in modern software is to perform one operation at a time (e.g. an addition of two values) across a large collection of data, looping over the data separately for each operation. This is _terrible_ for the cache as the CPU is forced to evict each record to make room for the next record in the collection, thereby reducing your throughput to the memory bandwidth and making your caches useless.
This can be hard to detect as test data sets tend to be small enough to fit within L3 and therefore exceed memory bandwidth. It's only once the data sizes are scaled that the true limits of the memory bandwidth are hit. Worse yet, the CPU will look busy to the operating system even though it's spending most of its time doing nothing.
What you really want to do is to bring in a record of data, perform all operations you possible can on it, then be done with it for that computational cycle. That maximizes the amount of time a record can be held in CPU caches. If done correctly you may be able to operate entirely out of the L1 cache, which can easily provide an order of magnitude performance improvement.
I really miss living near a Micro Center. They really is the best PC store I've ever been to. Please come to the PNW!
We approve your usage of SuperTuxKart!
Keep them vidoes coming please. This will greatly help Windows on ARM development going forward before the X Elite drops.
… and after, Ampere multi-core performance is in another league
aliyuabba4575
Xcode, apple. Do it better ????
@@JoeSpeedampere is crap for video and cfd.
Upgraded my laptop's monitor to 4K and with 100% scaling I can read the text on your screen at 0:55
With any other scaling the text becomes more blurry and if I right click on the video and click stats for nerds, the resolution of the viewport changes with the scaling.
Also, on my win 10 laptop I can't just hover over the speaker icon in the taskbar and scroll to change the volume, which win 11 does.
Sorry for the unrelated comment but hey, good to see you are doing well and are in good health!
I wish I had a local Micro Center…
At 2:43
Ubuntu and Windows for ARM....
Did you try any other Linux distro?? Just curious on that....
I have been coming down here to suggest ChimeraOS because it runs steam very well, but then I remembered it may not have an ARM flavor....if it does that might be a good way to go!! Manjaro apparently has the ability to act like SteamOS since both of them are based on ARC Linux....
Hope you have an excellent day!!
HELL of a video Jeff !
Great video an components and benchmark. Looks like you also have a lots data on DIMMs and waiting for a new video on them too.
We'll see; right now most of the data is spread across some GitHub issues. I may do at least a blog post on it at some point.
Nice. They really make amazing stuff. Too bad I can't afford it. Would love an Ampere workstation so much. But I'm happy with RK3588 and my pc when I need it.
Nice video!
Almost getting one myself! Is it the 2,8Ghz version of the CPU Ampere will ship?
Regarding the Mac, let’s not forget M2 Max and M3 Max have tremendous memory bandwidth, 400GB/s.. quite much more so than a DDR4 system I believe. That makes them maybe faster in memory bandwidth limited problems, such as several types of simulations etc with low flops per byte ratio.
AmpereOne has the DDR5 memory system support. However, I have not seen it easily available like this CPU is.
With only 3 out of 4 memory channels being connected, maybe the 96 core version is a “better fit” as the amount of bandwidth per core will be quite better, for anything bandwith sensitive that is.
8:43 13b isn't actually all that big, and with your RAM pool you should easily be able to go up to 40b. Are you using the ARM-friendly "LlamaCPP" in OpenBLAS mode for your back-end? If so, you should be seeing a lot more CPU activity than that.
I've been running 70B parameter quantized gguf models with llama.cpp on my Windows machine with 64GB of memory. Definitely way more room to grow here.
The machine specs gives me the same feelings as when I saw and heard a monster truck performing in person for the first time! 🔥🤯
11 years ago: "Nvidia is too pro-Windows, they hate Linux!"
Today: "Nvidia is too pro-Linux, they hate Windows!"
I think it’s time to revisit this machine again, with the improvements in Windows 11 for ARM
to put your Cinebench 24 score into X86 context, a Intel i9 13900KS at 5.6ghz scores 2379 Multi and 142 single while a AMD 7950X3D at 4.5ghz scores 1829 multi and 111 single
single core is obviusly lacking on that 128 core, but the multicore for sure aint
It's one of those "well one ain't good enough, let's just throw ALL THE CORES in there" problems :)
I really want to see the single core specs on AmpereOne. Or see Apple create a 128 core monster M2 Ultra Ultra Supreme :D
@@JeffGeerling The upcoming revision supports DDR5. Assuming twice the memory bandwidth and adequate driver support, perhaps a 4K+ Cinebench score is in the cards?
This is soooo cool! I can definitely use this for my MS Excel worksheets!
O yaaaa! Micro Center... Inconveniently located for 90% of ALL OF US!
Or anyone outside the U.S.
2:49 That RAM cannot keep up is not only the case for server CPUs. Even for these super fast data-center GPUs, 2TB/s VRAM bandwidth cannot keep up, because compute Tflops is still so much larger. They could cut the GPU die size in half and the software would still perform thr same. Nearly all compute software is bandwidth-bound nowadays.
Could you run virtual machines in this hardware? Could QEMU/KVM emulate raspberry pi, Mac OS or even X86 os? Imagine running a virtual cluster of pi? And while quickemu can run MacOS, running the latest apple sillicone version could be very useful.
Yes, Qemu supports emulating the Raspberry Pi.
Just went to micro center yesterday to pick up the EVA asus parts. Love seeing my hometown micro center here, STL rep!
One large advantage to the Apple RAM is that it's unified, meaning you essentially have 192GBs of VRAM, which is useful for machine learning tasks.
Exacto
192 gb of ram for gpu and cpu will use air for working?
Do you think OS will use zero gigs?
Why do I get the same morbid pleasure as if I was watching a Lamborghini being put together
I tried box 86 and 64 a Long time ago:-/
It’s nice to see someone else having better luck
Really amazing, did not know that there are already ampere cpu workstation in the field!
That’s really cool! Imagine if we could a RISC-V CPU to game on Linux!
Bandwidth not keeping up with compute power has long been an issue. One amusing statistic is that standard floppy disks are faster than a typical NVMe drive (compared to capacity). You can read a 1440k floppy in about 45 seconds but a Samsung 990 PRO 2TB will take over 4 and a half minutes. Even the IOPS per megabyte is a bit faster on the floppy. With a slow step rate of 8ms you'd have a worst case of 840ms access time or .82 IOPS/MB. The 980 is 1.4 million IOPS best case which comes out to 0.7 IOPS/MB.
12:19 one of the problems you've likely ran into with Valve and halo games is anti-cheat, both VAC and Easy DO NOT like emulation, however WINE seems to have been made compatible lately, u guess to support the Steamdeck, which doesnt use emulation, just the proton translation layer, which the predecessor used to get you banned from CS:GO if i recall.
Linux is killing it on ARM! Great video!!
For at least one of those games, the text console had an error message about "out of thread IDs". Presumably it's trying to spin up one thread per core or per SMT. If you can artificially limit the number of cores that the OS sees, or that it shows to userspace programs, you might have a shot at getting these to work...
Does ARM have SMT? Turning that off would be interesting too.
There's something poetic about having a build this absurdly overpowered and being able to play minecraft and not much else. Hopefully ARM will get some love due to the copilot + thing (however horrible it might be).
Please more Ampere content
Next Video Suggestion: How to make the 128C ARM Computer.
4:15 What is the difference between 2Rx8 and 1Rx16 similarly 2Rx8 and 1Rx8, you understood my point, what's the difference? Which is faster?
Awesome video! ARM is a fascinating architecture, I cant wait to see where it goes in the near future!
Good video. Easy to understand information.
1:18 Jeff, with all this power comes great responsibility. I can't believe you missed the low hanging fruit.
Next time!
@@JeffGeerling Very cool, I'll be watching for it. I almost bought one of these systems back when you showcased it but now that I've also become one of the masses without a job I'll hang onto the pennies and live vicariously through you!
@@vincei4252 Aww, I hope you will get back into gainful employment soon!
@@JeffGeerling Thanks man! I won't lie, retirement looks good too ☺
@@vincei4252Ha! Well then, finding some things that will keep the brain going!
God I wish there was a microcenter near me
1.3TFLOPS is at least double what the RTX 4070 Ti can do. The CPU can access much more memory with lower latency so there's no comparison.
"Ti" is an abbreviation of "Titan" so it's pronounced like the first syllable. Titan never made sense anyways because the Titans lost to the Olympians, so Ti was just a face-saving compromise. The company is still named after one of the seven deadly sins, which shows that they can't let go of something that sounds cool
Are you sure you are not just citing the effect of the abysmal FP64 performance of the card? A 4070 Ti really ought to be able to do much better than 1.3 in single-precision aka. FP32. The CPU would also be faster but only by a factor of 2x, but I would expect at least ~6 TFLOPS from the 4070 Ti in FP32.
@@TheBackyardChemist Compute TFLOPS is traditionally FP64. Ada has a 1:64 FP32/64 ratio so it's around 40 TFLOPS FP32 on the GPU. You could emulate 64 bit math and get the ratio down to 1:4 but it's not IEEE compliant. CPUs should be 1:2 but they run into power limits and drop the clock speed if it's too much work. Of course these are all peak theoretical figures, branching code and sparse access won't allow the GPU to reach its maximum performance
I would try it on Fedora, which has vanilla and almost edge Linux kernel, plus they have proper nvidia support with wayland now, and with its special ram config (which requires no swap now). Things might run better. And I always use the flatpak version of steam, runs quite good
Woah. A rust developer dream instant compilation
Just a quick Question Jess, Does the board have the ability to disable the ASPEED GPU??? Just asking because you said that the BIOS goes through the ASPEED iGPU. With servers boards that i own with ASPEED BMC/GPU, if you disable the GPU portion then the BIOS is able to be shown/Accesses through a discrete GPU. Just a thought as i am unfamiliar with the Ampere boards and how they work compared to x86
Right now it seems like no. not sure if that will change.
Rodger. I'm hoping get get a similar setup to try ampre
Hello. How about using it as a web server and virtualization (ESXi and windows server with hyper-v)?
Can you do some tests for these? and maybe compare with some xeon processors? how fast mysql is on those cpus?
I really look at these ARM CPUs and i see they might change the servers world and i really think of getting an ARM server.
You forgot the magic key for switching….for just the cost of the upgrade to the 192 GB ram on m2….you can purchase 80% of this machine.
Amazing!
I would totally have an Arm CPU in a desktop!
Imagine the possibilities. Unfortunately, I don't think it will be so easy access ($), maybe in 15 to 20 years?
You are very wrong about Blender - all you had to do was install compatibility layer from MS store lol. i use Blender a lot on Windows on Arm devices, if you can do that please retest the results for future video. sinsirely yours, Greg
I have never expected an AHOC shoutout when watching a video on ARM CPUs, but here we are. Nice!
Looks like all the games that ran using Steam were made in Unity: Super Hot, Horizon Chase and Kerbal Space Program.
Would be great to know if you eventually manage to get Llama to use the GPU on the ARM system.
That's a lot of RAM! Holy cow!
Hello, how does it compare with a cisc machine in matrix handling?. This is most important in scientific work. M1. M2, etc are for simple arithmetic in raytracing, I belive.
I'm really interested in the LLM and machine learning aspect of this. I'm about to upgrade my old dual 24 core Xeon (w/ 512GB of ECC) to a modern high core count plus high end GPU. This is absolutely on my radar now. Do you have specific motherboard recommendations?
If you're serious about the LLM aspects, the best options would be some of the server builds, from Gigabyte, Supermicro, Asus, or one of those vendors. ServeTheHome has some interesting reviews of GPU-heavy Ampere machines used for the purpose.
$99 for a decent 3D printer… Man, i wish i had Micro Centre in the UK
what do you still have ?
Something like this + OCP Time Card + Debian = my dream Linux desktop.
yey u tested ksp! best game ever
The key difference of M2 ultra is that 192 GB ram is also VRAM which makes it possible to run some 180B LLMs on it which is not possible for other consumer level PC.
If they use 192gb of vram what cpu will use air or water?
I m prety confident that those workloads it wont be run on mac mostly.
There s no way that in RL vram can access 192gb in a mac.
I qoute what a teacher of a famous university said.
" the more advanced is your jobs the less mac you will see"
jeff is reaching ltt levels of sponsor segue
Walking around the Micro Center store was one of my favorite parts. We don’t have those here. I’d watch an entire video of that.
When your Microcenter is also Jeff Geerling's Microcenter....hey, did you leave anything for me Jeff?
Not *this* time!
Can you please do a video where you compare the i9-13900KS with the Ampere Altra?
Out of curiosity, have you tried rendering with Kdenlive?
No, I have been planning on trying some video editing at some point, but haven't had a chance yet.
The only problem with Microcenter is the lack of Microcenter; many folks are still stuck driving 3 hours plus to get to one, which is a hard sell over sitting on our bums at home and ordering like *snap*. Hopefully they ramp up their pace of opening new stores across the country, but only time will tell.
DeltaSierra426
Why move to rural America ? You did it yourself !
Microcenter you can do in big communities only !
@@lucasrem I despise the hustle and bustle of big city life. Less stress, better air quality, big yards (some have creeks, woods, etc. for hunting, fishing, recreation, gardening/farming, and so on), and in our area, a local telco invested in fiber-to-the-home, so tech life is still good; doesn't take much longer these days for etail shipping to these areas.
I actually live in a city of about 40,000. This provides the benefits of nearby shopping, food, jobs, etc. without being a monstrosity.
There are still plenty of big cities across the U.S. that don't have a Microcenter; I still stand by may statement that I'd like to see them expand more quickly. Even places like Austin, TX -- "the new silicon valley" (but they call it silicon hills) -- doesn't have a Microcenter and requires driving 4 hours to Dallas or 3 hours to Houston to get to one.
@@DeltaSierra426 expanding too quickly can wreck a business. Perhaps they need to set up a good shipping process that beats Amazon, such as the ability to schedule the delivery in a 2 hour time window.
I'm in Canada, never been into one of their stores.
Htop cores being like 5 chars wide is hilarious
Awesome content as always!!!!
Will "professional-grade" graphics cards (nVidia Quadro, AMD FirePro etc.) also work in this machine?
I have tested an RTX 8000 and it worked well too.
@@JeffGeerling Yeah, that's good 'cause you could do some top-notch 3D CAD /* FreeCAD also works on ARM, doesn't it? */ construction or CGI work on such a machine.
6:07 What? No 1.21 teraflops joke?
The time of the pc2 is coming!!
Did you install the cuda libraries? IIRC they don't come with a normal GPU driver install. might be why Blender could not use them.
I’d like to see gaming evolve for ARM CPUs. Just to see where it goes
Great stuff, thank you, Jeff!
8:39 actually cuda works in blender but it's a bit of a hassle - the same process that's needed to use nvidia gpu in google colab local runtime
Also that silent Steam game death is a classic Visual C redist. bug.
It was happening to me like crazy on my brand new Windows laptop, because there was a bug making the games think the various libraries were installed when they weren't, so their first-run script that checks for the library and installs it if it's missing, would report success, but then the game would try to launch, try to use the missing library, and fail. I bet something similar happens related to incompatibilities between Linux, ARM, and MSVC
just curious did you try x265 cpu encoding? it gives nice quality for bitrate and you have the cores
I have just read a 3 years old post in reddit writing that open source AMD GPU drivers can be used in ARM Linux (Oland GPU with blobs for initialization).
It is a shame that open source drivers cannot be compiled if needed, and that games are not compiled for Linux (x86 and ARM) and Vulkan.
And it was a pleasure watching that it is possible as with Super Tux Kart.
It seems Nvidia and ARM are making SoC for laptops and handhelds, for MS WOS, Chrome OS and Linux perhaps future good drivers will come with them.