Well, some of Valve's patents for a standalone VR headset and some of their code for dev branchs of SteamVR and SteamOS include mentions of a Arm chip. Besides indices that Valve is going for a dual chip system like the Apple Vision, but with a x86_64 chip instead of the M2, said code lines implies that there maybe be a Quest like mode where the OS could run from the R1 equivalent Arm chip. Which alongsides the patents about modular VR headset is why @SadlyItsBradley thinks that the x86_64 chip may go in a removable module. Add to this that due to more datamings Bradley suspects that Valve counts making non VR games and software works in a VR/AR environment a bit like Apple marketed mixed with the Game mode/Dektop mode approach of the SteamDeck to attract newcomers to VR by extending on the "same portable screen" aspect. (In a "first get people to use flat screen app in a VR environment, then get them to use full VR apps" strategy.) If Bradley is right on his theory, and depending the functionality lost or conserved in the 1 chip mode (which would be a Arm chip), Valve might work on x86-to-Arm translation layers.
There's no good reason they shouldn't be. I know when 3D stacking starting becoming a thing, intel tried to push soc style motherboards which means preconfigured cpu/ram, etc and no upgradability/reparability. There was a lot of pushback for that. In a twist of events, that's exactly what Apple Silicon is. Unrepairable arm soc's.
Steam recently had a massive UI update that replaced a lot of the UI with elecron-based elements, which is probably why your experience with steam is much better now
Ah, that would make sense. The old UI would work a little better if you disabled GPU acceleration, but now it seems to work just fine. I wonder if moving to electron might make the transition to arm64 easier, too? I can't imagine Steam will remain X86-exclusive forever!
@@JeffGeerlingthey have hired the gal who has been working on making Apple’s M series GPUs work on Linux. So possible they want her expertise for ARM based gaming
You can actually run x86_64 programs on ARM and other architectures under Linux, using a little program called box64. I honestly can see ARM being the future for most use cases, due to its high density, low energy consumption and high performance with a smaller instruction set
This! I remember folks trying out proton with box64 a while back. It would be interesting to see how it works here, so we can compare performance of modern games on it
Haha so true. Like I said, it's not what I'd call "good" code. I still subscribe but only because of 25 years of muscle memory from back in the Photoshop 2.5.1 days lol
It's not about being competent, Apple simply added a few private extensions to the ARM chip for operations that the emulator has to run a lot. This is a luxury Microsoft can't do, since they don't produce their own ARM chips.
At least right now, Lightroom and Photoshop work natively. But they have 20+ apps, so 10% of them being native isn't very promising. Imagine trying to drive Premiere via emulation.
@@anlumo1 'This is a luxury MIcrosoft can't do', really? Microsoft has money than god and even sold their own ARM machine running windows! Microsoft could absolutely do the same thing but they can't help themselves and must always half-ass everything they do. Microsoft just doesn't care.
I don't even care if Apple's ARM chips are faster. Just the fact that these platforms aren't locked into a walled garden puts these Ampere systems in a league of it's own!
This 96-cores Ampere Altra is based on Neoverse N1 core which is derived from Cortex A76 (2018). It's 5 years old core however it has an IPC of Intel Skylake / AMD Zen 2 which was on par with top x86 architectures. However A76 core has consumes only 1W at 2.6 GHz while area is 1.4mm2 on 7nm TSMC so resulting in incredible performance / watt which shines even today. Ampere uses maximal configuration with 128-cores what N1 platform can provide. Similar CPU from Amazon the Graviton 2 used 64-core config. FYI today last Neoverse V2 (Cortex X3 based) platform has 19% higher IPC than Zen 4, it supports up to 512 cores per socket (256 cores per chiplet) and has suport for revolutionary SVE2 length agnostic vectors (up to 2048-bit). Nvidia uses V2 in 144-core Super Grace. Amazon will use it in next Graviton 4 I guess. Surprisingly Ampere next gen should be their own developed core Ampere One, leaving ARM's Neoverse license. BTW new Cortex X4 has 33% higher IPC than AMD Zen 4 ... however next server/workstation Neoverse V3 should be based on Cortex X5 in 2024. Basically AMD Zen 5 is dead on arrival because it cannot beat X4's IPC not speaking of X5's IPC. Apple M2 has 56% higher IPC than AMD Zen 4. Yeah, Apple is king of microarchitecture since Jim Keller's 2015 A9 Twister core release (IPC higher than Inte's Skylake). *IPC = Geekbench6 ST pts / GHz
Exactly. If Apple just sold their systems as open instead they would have a much larger market - but alas they do love their Apple store revenue so it will be a cold day in hell before that happens.
@@richard.20000 "Similar CPU from Amazon the Graviton 2 used 64-core config" Zero point talking about a CPU only available for cloud instances. "Apple M2 has 56% higher IPC than AMD Zen 4. Yeah, Apple is king of microarchitecture since Jim Keller's 2015 A9 Twister core release (IPC higher than Inte's Skylake)." And yet Genoa is not exactly taking a beating from Apple's highest end chip. There is far more to performance at the higher end than just single threaded IPC. "new Cortex X4 has 33% higher IPC than AMD Zen 4" Says who? There isn't a single implementation of X4 in the wild yet and cache size affects core performance significantly from what we have seen of ARM claimed figures with ideal cache size vs Qualcomm Snapdragon implementations with reduced cache size. "Basically AMD Zen 5 is dead on arrival because it cannot beat X4's IPC not speaking of X5's IPC" Utterly pointless remark considering that potential IPC advantage would be eroded simply by running x86 software for anyone needing legacy support (which is a HUGE part of the industry like it or not). At this point the IPC gains of ARM are slowing generation by generation as they run into exactly the same problems as x86 ODMs suffer in design - in another 5 years the situation will be pretty much neck and neck. ARM was better in the lower power region - but as it scales it is running into a wall, this much is fully evident to anyone looking at actual performance figures and power draw. "and has suport for revolutionary SVE2 length agnostic vectors" It was revolutionary a decade ago when they published the ARGON experimental variable length vector instruction set. These days it is just an implementation of an idea already taken root in at least one other architecture (RISC-V) and implemented in Fujitsu's ARMv8 based supercomputer Fugaku years ago - Fujitsu actually co developed the original SVE with ARM.
@@richard.20000 The thing is, the ISA doesn't really affect the performance nor the power consumption. The whole CISC vs. RISC thing is from the old days and doesn't matter much now. As for that claim about the Cortex X4 having 33% higher IPC than AMD Zen 4, we haven't seen any real-world implementation of the Cortex X4 yet. So it's hard to say how it will actually perform plus other factors like cache size can impact core performance, so it's not just about IPC numbers. Today IPC is "how well you are able to keep the core fed" which can vary a lot from an µarch to another Saying that AMD Zen 5 is "dead on arrival" because it can't beat the hypothetical IPC of future ARM cores is kind of pointless. There are a lot of software applications that still rely on x86, so running them on ARM would be a challenge and need and may affect performance depending on the implementation. Compatibility also matters a lot in this segment. And the support for SVE2 length agnostic vectors might sound cool but it's not as revolutionary as youm think. Other architectures like RISC-V have similar concepts already. It's not a game-changer anymore.
With FEX-emu, you may be able to run Steam and other x86 apps on Linux. In fact, Asahi Lina used that method to run Steam on M2 mac(even Proton worked).
It would be super exciting to see Steam under Box86/Box64 in the next video! It's a bit of a hassle to set up, but should give you a very competent translation layer.
Computer engineer here, the basic idea is shared memory in these massively multicore CPUs works with "neighborhoods" of cores to cut down on the number of tag bits required in memory to mark which cores have a block of memory loaded. As you can imagine, crossing neighborhoods is much slower than working within your own neighborhood. This is why monolith generally doesn't perform as well
Though for some reason on Linux monolith was doing well compared to quadrant... I know Anandtech did a lot of testing though on Ampere, Epyc, and Xeon, and found some interesting differences (all are tradeoffs, but some work better than others depending on the workload).
@@JeffGeerling That is pretty interesting. I should specify I'm not an expert on parallel stuff, I've generally worked at the individual core level, and I definitely am not a software guy. So I'm totally speculating here, but my guess would be that going Monolith allows the scheduler to be a bit liberal and play around with crossing neighborhoods when it determines it's worth it. This is something I could definitely imagine Linux being better at handling than Windows for ARM. Though again, if someone who actually knows about the Linux scheduler wants to chime in and correct me I'd welcome it.
@@JeffGeerling I think the Linux scheduler does a bunch to try to avoid cache thrashing. It may actually switch the CPU out of monolith mode (if this is possible) and go about trying to schedule based on cache. I know more than nothing about the Linux scheduler. But I'm far from an expert.
@@JeffGeerling Just checked out the Anandtech article, (I'm currently on vacation so I don't have the time to really dig into understanding it so I'm again sticking the "speculation" disclaimer on this comment) but it looks like Ampere's mesh is really slow when it comes to multiple cores within the same quad trying to share a cache line that exists in another socket. Specifically, this is the line I'm referencing: "and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns - still pretty bad". This can slow down Monolith mode if the scheduling system is unaware of this. As for quadrant mode, going quadrant mode seems to segment the SLC (system level cache) evenly between the four quadrants (so each quad only gets 4MB of SLC). Instead of allowing it to be shared between them. This means for workloads that don't use all the quads you're losing a lot of cache potential and having to go out to main memory much more often. Edited to clarify a couple things.
@@JeffGeerling Sorry for the comment spam, but going off the comment by @Omnifarious0 I'd wager that in "Monolith" mode the scheduler is still trying to keep things within the quadrants, but since Monolith mode opens up the cache you can stay on chip way more. Windows probably struggles with scheduling threads to avoid remote (off quad) cache line accesses and that's why it suffers compared to Linux in Monolith mode despite the unsegmented cache.
I think a big problem with ARM support is that companies might be more interested in just riding things out until RISC V gains some steam; since ARM has been rather.. particular with their licensing lately.
This and companies always locking down their ARM products so you can't run any software or operating system but they one they approve of. People keep saying "NoT aLl CoMpAnIeS" without stopping to think that no one with any weight is bucking the trend of locking down the hardware. It's all find and well to buy these Chinese SBCs and super expensive tech demos like this one, but for the consumer who doesn't have the money to burn on a 158293 core whatever ARM CPU we're going to be stuck with locked down and unsupported hardware.
It's not that, even risc v will hit exact same problem, it size of market that is the problem, there not much demand. Desktop ARM on Linux probably would not be as much as develop if not Raspberry Pi.
problems these days...apple lost interest in RISC and with windows still majority of x86 platform...there is no motivation for devs to go back RISC era again when it was not very fun for any devs during 90's to 2007...but it was beneficial for apple and pc system when they optimise for it but its time consuming to do and bad ports are constant problem back then ...now everyone using x86 cpu for lazy approach in game engine that does everything automatic compiling vs manual compiling on risc architecture is not very user friendly even though it does have more control in raw performance in utilizing the cores better than cisc automatic approach compilers that does not give users complete control of cpu cores ...its too generic method that like under linux compilers ...devs have to choose which version for software it run on type of cpu?? type1 to type4....majority don't have type4 when type2 is very compatible for all x86 cpu from 1995 to 2023, its logical how they compile on ...
Yeah, especially now that the relevant specs are being finalised and added to RVA23. I’d expect RISC-V to go mainstream in a few years, at least on smartphones.
As a heads-up 13:28 Ampere Altra (Max) is up to 128 PCIe Gen4 lanes 1P. 192 is only in 2P using lanes from both CPUs and 64 lanes being used as the interconnect between the two chips.
@@Dave102693 the windows dev kit has been kind of flakey. I had it shut off a couple times when doing something processor/graphics intensive. I tried running Portal 2 just to see how bad gaming performance would be on a translation layer and I had the system shut off a couple times.
@@XeonProductions Have anyone got Linux running on it? If so try running games with box86/64 or fex, Windows x86 to ARM translation is so bad it loses to both of the Linux translation ones. If Linux is not ported yet try to run Wine within OpenBSD with box86/64 if it's even possible.
NUMA is complicated stuff. It also partitions access to RAM channels and capacity. If you were to run quadrant mode and assign a VM or process to each quadrant, that would be fine. Normal apps often don't know how to efficiently use several quadrants at the same time though - they need to be specifically programmed to be NUMA-aware.
And if the application is not NUMA aware then it can potentially be limited to a single “quadrant” at the OS level with processor affinity / pinning. So you might end up running 4 copies of whatever app, one per quadrant, or assign some other workload to another quadrant. I don’t have a lot of experience here so I’m not sure how that works for memory allocation. Thread affinity is easy, though, and I’ve used core pinning on Windows and Linux to avoid L3 misses on multi-CCD Ryzen processors.
@@JJfromCowtown When it comes to Zen (except 1st gen Threadripper/Epyc), you shouldn't need to do thread pinning. These CPUs present as a single NUMA domain and the scheduler is supposed to be L3 partition aware, such that it tries to fill one CCD with threads of a given task first before allocating ressources from the other one.
@@Psychx_ , I agree that’s how it is supposed to work. In practice, both Windows and Linux schedulers will move threads related to gaming tasks from one CCD to another, leading to stutters. I observed this on Zen 2 mostly, with a 3800X. Limiting games to the second CCD improved min frame times. Maybe the scheduler tries to limit to one L3 domain/CCD but there must be more to it that I don’t understand.
@@JJfromCowtown I'm using a 5900X and before that a 3900X. There have been changes regarding this in the latest kernels, plus distros tend to configure the scheduler differently (i.e. by setting CONFIG_WQ_POWER_EFFICIENT_DEFAULT in the kernel config). All I can say is that it does work as expected and respects L3 cache partitions when configured correctly. Btw, I'd highly recommend using the BORE scheduler for Linux gaming. It delivers much better frametimes than stock CFS, while being simpler (simple is good, fast execution of scheduler code and less breaking changes by upstream) than ProjectC/PDS (another great scheduler replacement with a different, more complicated approach).
@@Psychx_ thank you and I will try that in my Linux env. I am on a single CCD these days (5800X3D) but "free" performance is worth chasing. Given all the talk about big/little core scheduling issues in Windows (particularly in Windows 10) I have always doubted that optimizing for L3 mattered much to the scheduler in that environment.
The Aspeed AST2500 is actually another full Arm computer in the system with cores, memory, storage, and running its own OS/ firmware. It is not really meant to be a GPU, you are using the old (I think Matrox sourced IP?) and small GPU designed to output video via VGA to server KVM carts. ANC and SNC are generally used by the HPC folks to localize memory access between segments of cores. It effectively splits your CPU up into a configurable number of splices (often four) to help with this. In the old days, this was not as noticable. Now, pegging CPUs to things like chiplets, memory controllers, and local high-speed memory is important to minimize hops on internal fabrics and relieve congestion/ lower latency. We will show it on the upcoming Intel Xeon Max with HBM video as well as the AMD Bergamo side. Typically, you would not set this on an Ampere Altra design just because of where they are intended to be deployed.
This is why I subscribed to the channel all that time ago! I can't wait until the day that I can confidently switch away from my Intel x86/64 machine over to an ARM64 machine with a dedicated GPU and see a better power to performance ratio overall!
I'm sure you're aware, but there are x86 emulation layers for linux like Box64 and FEX, which have reasonable performance. I'd be interested to see you test some more interesting stuff out with those on this hardware.
@@clonkex all adobe stuff has auto-saving functions, no need to bash keyboard every 2 minutes like a caveman. Still does not save you if it decides to corrupt the project files though
@@marcogenovesi8570 I don't use autosave (unless you mean the automatic background recovery saves) because I prefer to control exactly what gets saved (so I don't save the file with some half-baked experimental changes I don't necessarily want - because of course, if it crashes at that point, I can't undo). But honestly I've rarely seen crashes. Maybe once, twice at the absolute most, in over 8 years of using Photoshop on Windows for my daily work. Then again, maybe you're using InDesign or some other product I'm not familiar with.
I wish society would just navigate towards ARM. The efficiency gains are *insane,* it would literally make laptops so much more useful, with insane performance and battery life, just look at how insane that M1 and M2 Macs are.
Windows API was previously only handling 32 cores; all apps on Windows used this API, so Microsoft created a new Api handling a lot more. But many people still use the old answer on the web on how to get the number of cores in C++ on Windows. Did run into that problem the first time I used a game engine on a 192 cores Threadripper at work.
If you're looking for good games for cross-platform benchmarks, I believe Minecraft (Java) runs natively on Windows on ARM now (it's run on the ARM Linux forever). Bedrock now runs on ChromeOS, so maybe we'll have Minecraft RTX running natively on Ampere before long.
you should be able to run any architecture of your choosing on this arm altere CPU, and you can do so transparently by hooking QEMU to binfmt's, that should allow you to get steam running on this altere. im only familiar with how to do so on Gentoo Linux but i imagine the concept is identical on other distros.
yeah, also search if anyone was already posting, it. Some more information: I used it more from a x86 site but should also work fine for the other side. The nice thing with binfmt is, it only emulates userspace, syscalls are handled nativly by the kernel. Only problem is to get all the x86 binaries, there you ether need magic `dpkg --add-architecture amd64`, a chroot lxc / docker container, or use nixos with `pkgs.crossPkgs.*`
Something that hasn't been talked about in the video (atleast what i thought): Have you tried deploying microsoft's opengl & opencl compatibility pack to the Windows on ARM system to see if that had any differences in terms of using any app that depended on opengl? - I have tried this on an older GPU that only supported OpenGL 3.1 maxmium by default and had no issues with it getting overwritten to GL 3.3 as compatibility layer. (albeit it was an x86_64 machine)
As someone who has been using Windows on ARM for a little over a year now (Device: Surface Pro X), I have found that it still doesn't run Roblox smoothly from the MS Store or the browser as well as a Mac does. Moreover, Roblox now requires the system to be 64-bit, and I can't access and play Roblox from the browser at all, except through the MS Store where it works. However, the overall user experience has been decent. I appreciate the fact that it can connect to LTE and has a long-lasting battery
linux actually has a x86 or any arch translation layer, it's called quemu, you can literally chroot into an arm rootfs from a x86 computer or the other way around, you do need to setup binfmt.
Wow, whatever optimizations and tweaks you've made to your video making workflow have paid off. This video was *really* well done! Super excited for the 128 core upgrade livestream.
12:26 "Linux doesn't have an x86 translation layer" It has three actually: FEX-Emu, Box86/64 and Qemu's "User Mode Emulation". You should try to get one of them running, you'll probably be able to play modern AAA games on ARM Linux using them with Wine/Proton.
You can bypass the login requirement by just putting in “a” for the email and password fields. I’ve done this more times than I can count at this point, no need to disable Wi-Fi or anything.
Nice to see ARM in the 'traditional' desktop form factor. The carrier board for the CPU socket & RAM is something I hoped to see from the Apple Mac Pro. It's a shame that at a time when Apple is doing ARM on the desktop, it's still off in its own world and not helping ARM anywhere else.
Exactly, would've loved to see Apple find a way to at least give upgradeable RAM, or at least put the SoC on a carrier like COM-HPC or even something proprietary so it could be upgraded.
It's nice to see an ARM system that is actually upgradeable and using standard components, like x86 desktop computers. Maybe at some point we can get an actually like-for-like comparison between x86 and ARM in terms of performance and power consumption. Comparing to Apple's stuff is... well, _not_ apples-to-apples (heh), because their stuff is all integrated. Of course they'll have e.g. power consumption benefits just from that already. I want to know how ARM does compared to x86 when it's a socketed CPU with "external" RAM and storage and GPU and so on on both sides.
The integrated aspect of the M series macs actually improves the perfomance for those chips. ARM is incredible memory hungry (bandwidth and maybe size) vs x86. For x86, slower memory doesn't hurt the CPUs that much (cache layout, complex instruction sets). The IPC of the ARM CPU above is incredibly bad even when you take native compiled games vs x86. The M2 chips are like several decades ahead.
Like the guy above me said, Integrated RAM or SoC / SoM benefit the performance greatly, hence apple M1 / M2 have a great performance, but it will come with a pricey chip and not upgrade-able unless replacing the whole chip
Thanks I'm watching your progress with great interest. My comment is, being someone who abandoned Windows in 2001, I say who cares how it performs, but your performance review (I skipped a lot in the WinDoze review) just proves my opinion. The only OS for machines like this is Linux. It's even getting mature enough to give Apple a possible competitor, though lacking many applications that keep me using Mac OS. When I say Linux , I mean Unix, so Mac OS, Linux, implementations of Unix. It'll be interesting to see the 128 core.
Some people might wonder why the performance in Crysis was so bad. Crysis is single-threaded. Of the 96 available cores, Crysis was only using 1 of them. I remember back in the day a lot of people bought a quad-core upgrade hoping to play Crysis with higher performance than their dual-core systems, and they were thoroughly disappointed to find the quad-core performed worse than the dual-core due to the quad-core having a lower clockspeed. The singlethreadedness of Crysis was THAT intense.
@1:00 - Crysis likely runs badly because it's so heavily single threaded (at least assuming it's the original version you tried!), so even with a translation layer that can spread load somewhat, its probably bottlenecking severely. The remaster is also kind of jank too (and probably similar issues) because it still has aspects of Cryengine and is based upon console ports for some reason; but that's another topic!
Yeah, and the Ampere Altra Max's single thread performance is only on par with older desktop CPUs, so that makes sense. I have yet to see if the new AmpereOne CPUs have better single-threaded performance. The Apple M-series chips really shine there.
I've heard that arm is way more power efficient than x86 so if that's the case then we should be switching over to it and the fact that Microsoft won't do that is really baffling.
Hi, Jeff. I tested a research project on the 80-core chip and discovered atomic operations tend to drop dramatically when exceeding 40 threads under high contention workloads. I speculate that might have something to do with co-to-core interconnect. I think that might be the reason why the performance does not look good for 96-core, to 60-core.
Will likely need to upgrade to a board that supports all 8 memory channels to get a teraflop unless the measurement is in pure Level 1/2/x cache. Also if you are only using single rank sticks of ram that will limit performance as well. 2 ranks of 4Gb or 8Gb density chips per memory channel is typically the best performance for DDR4.
The best thing about the ARM architecture becoming so popular, IMO, is that it forces low-level software to become more easily compiled to other architectures. It's like how in the early days, you built a game for a single PC platform and then had to individually port it to every other platform. Then publishers wanted to easily sell games on consoles and PC from launch so game engines evolved to support more than one platform (usually favouring consoles, with wonky results on PC - see Mass Effect 1). Nowadays game engines have generally excellent abstraction layers that make it trivial to produce games for virtually any platform. The same thing just needs to happen for CPU architectures. Which, it sort of already has in many ways, especially for usermode software, but it's still far from trivial to get OS kernels and drivers compiled for new architectures.
There are technically options for translation on Linux. Rosetta for example can be run on Linux (apple provides a binary for VMs and asahi) although I don't know if it would actually run on anything other than M1/M2. Additionally, there is the QEMU bitfmt_misc layer, and I'd be very curious how well that would run with that many cores.
@@JeffGeerling If I remember correctly I don't think it does, not in the linux version. But don't quote me on that. Even then, I have no idea whether it works with different page sizes though. I guess the only way to find out is for you to try it. Or for you to ask Hector Martin.
Some of those older games, like the one that loads and goes black screen, can sometimes be fixed by launching on a lower resolution by right-clicking on the application in the files. Properties, and then where compatibility settings are changing the resolution to that lower resolution that pops up. But that's on a regular pc. I have no idea how it would be on that Dev PC.
It's not a bad price for high-end workstation or server parts, but it is definitely not cheap, either! I hope Ampere is around for a long time and can slowly roll out more on the desktop side, so we can see more Arm options there.
have you heard of box64, haven't actually used it, but its a compatibility layer for arm to x86 like rosseta or wow64, they do specify that steam works on it, maybe you could try it sometime
Can't believe you have not tried running Minecraft! Minecraft runs on nearly every 64-bit OpenGL 3+ capable machine. Only obstacle is minecraft's launcher being amd64 exclusive, but third-party launchers help here
Yes!!!!!! This is exactly what I wanted to see!! Running windows on a machine like that is my favorite kind of mad science!!! And maybe in the future Valve would consider a SteamOs with arm support.
At about 12:30 you mentioned x86 and amd64 compatibility issues for Steam. Have you looked into box32 and box64 to see if this resolved your issues? I've had minimal success using this on rpi 4 and orange pi 5 to run steam. I'd love to see what you're able to do with those programs on a machine like this.
This tells us one thing unlike x86 CPUs it seems ARM needs a very strong optimization platform to see its full potential. Also I think the approach might need to be different. I bought the Pinebook Pro a ARM based Linux laptop before Apple went to ARM for the Mac and then I picked up a M1 Max Macbook Pro 16 inch. The Pinebook is a all in one SOC as well. Apple had leaks of them trying out socketed RAM and external GPUs with their ARM designs in the Mac Pro but ultimately didn't go that route. It maybe that the CPU needs the RAM and GPU to be closer for the architecture to rival the brute force method of an x86 chip. Just my speculations about ARM. But I can say when ARM runs well it is a super stable and nice experience on Linux or MacOS.
x86 and ARM pretty much represent two ends of the spectrum of CPU design philosophies: CISC and RISC respectively. CISC provides you with a large library of features built into the CPU, at the cost of poor efficiency because many features go unused but still need power and space. RISC provides you with a simple, minimal but extremely flexible built-in set of features, and expect you to build the rest of the features within software as you go. However with modern compilers and improved automation tools, optimisation isn't nearly as much of a problem today as it was 30 years ago. The only reason x86 feels more easy to optimise now is mostly that it benefits from 30 years of consumer-oriented optimisation, whereas ARM has until recently been mostly focused on industrial or mobile optimisation. The real issue is that the market for ARM is very fragmented. Because the ARM company allows you to make whatever customisations you desire to the chip, for a fee, not every ARM chip is guaranteed to behave exactly the same. Applications that stick purely within the CPU will work fine, but anything that interacts with external hardware e.g. graphics may run up against unexpected quirks of specific models or manufacturers.
Hi Jeff, I really enjoy your channel. It's been a nice distraction since my mom died earlier this year. I'm glad that you are doing better since your operation in December. I look forward to watching lots of your videos. Have a wonderful day. Jeff
@Jeff Geerling: Can you *please* elaborate on your statement "Since Linux doesn't have an x86 translation layer" at 12:28? Because `binfmt-support` (along with `qemu-user-static` and maybe other dependencies) is supposed to address this and has been around for a couple of moons by now. Try "debootstrap --arch=amd64 jammy " with and without the aforementioned packages in place on your arm64 host. Works like a charm (you want to install `schroot` to enter said environment) and is actually well-documented.
Just FYI Jess. the ASPEED GPU would be the reasons some of the games would not load. the ASPEED GPU is designed to give you basic graphics output, intended for debug use. The ASPEED chip is a SOC, meaning its a computer used to start up and manage the system before anything else loads. If you can, try using a discrete GPU in there, i think you will find game will run a lot better :)
Nvidia's current Linux graphics drivers have an issue where if you run a game full screen it gets Vsync'ed to the monitor without any ability to over-write that. If you would have windowed Dhewm - it would've uncapped the FPS. Nvidia released a statement that there will be a fix for this with the next driver release
Could you make a follow up video testing games on Linux using Box86, Box64 and FEX? Asahi Lina made a video a few months ago testing some games using these on M1 macbooks with their video drivers and managed to get steam and quite a few games running. I think it would be interesting to compare these 2 setups.
1995 Microsoft: Internet? No, use MSN and a modem. 2023 Microsoft: you must use MSN to log in to your machine. Yes, it’s still just MSN, with Passport, rebranded. They’ve been shoving this crap down our throats for 25 years.
Microsoft thinks you’ve seen nothing yet. “The presentation has been revealed as part of the ongoing FTC v. Microsoft hearing, as it includes Microsoft’s overall gaming strategy and how that relates to other parts of the company’s businesses. Moving ‘Windows 11 increasingly to the cloud’ is identified as a long-term opportunity in Microsoft’s ‘Modern Life’ consumer space.” “So, what if Microsoft extended the capabilities it currently bills to businesses on a per-user, per-month basis to general consumers? That is precisely what the Redmond, Washington-based company is envisioning, according to internal documents made public thanks to the FTC vs. Microsoft hearing currently taking place.”
I absolutely can see Valve moving over to ARM in the future. If they were to pull an Apple and develop their own silicon they wouldn't have to rely on the next great thing from AMD as a lot of mini system builders seem to be doing right now. This could increase their battery life from 4 to 8 hours and be the ultimate handheld gaming which they seem to be going for lately. If I were a betting man I would say Steam Deck 4 will be ARM based.
Memory channels gives a great boost. It's ridiculous how many times you'll get a pre-built PC with 2 memory sticks in a 4 stick, dual channel machine, but they installed both sticks in the same channel. I've gotten some of my friends boosts of up to 15% by just moving a memory stick over a slot. If anyone wants to check themselves, I'd consult the motherboard manual first to make sure, but memory should typically be inserted in alternating slots, like first stick in slot 1, second in slot 3 etc.
Yeah, Windows on Raspberry has been pretty cool to see, and it's conceptually similar, though the best thing about the Ampere setup is the normal installer can be run without modification since the system has UEFI and doesn't need special boot stuff in the image.
A bad windows design rant is always welcome. Perhaps if EVERYBODY tells them it is stupid. They may one day realize it's stupid. We hope...... God we hope. Thanks Jeff great and informative video.
Really excited to see the future of ARM. It cannot be overstated how much of an impact in the medium term Apple moving to the ARM architecture will make.
ahem, @JeffGeerling there IS a translation layer for linux, it's called QEMU! or rather, QEMU User Mode, specifically, doesn't do any of the fancy VM stuff, it merely translates syscalls between architectures, so you can run arm on x86, or the other way around, or whatever, of course, it isn't perfect either, but y'know, it's pretty good, there is also jart's blink, which is also rather worthy of checking out, "Blink goes 2x faster than qemu-x86_64 on some benchmarks, such as SSE integer / floating point math. Blink is also much faster at running ephemeral programs such as compilers." claims the readme, and I believe it
I'm pretty surprised tbh, that you don't know about our lord and saviour, qemu user-mode, it's the reason WHY you do qemu-*system*-x86_64, because the user-mode already uses qemu-x86_64 blink however, is a much newer project
I do, but I had trouble with Box64 and put a pin in trying that out for now; what I mean is there's nothing built into the distro (Ubuntu in this case) like WOW64 or Rosetta 2.
The thing that peeves me with Windows on ARM has honestly never been Windows itself, the emulation is great except that .NET apps and whatnot can't help but snitch about it being an ARM64. So stuff like EasyAnticheat fails for no reason because Epic cba to fix their junk (so no VRChat on ARM64 :( ) I also have no idea why Microsoft hasn't sat down and wrote their own USB CDC drivers for all the Arduinos out there, that's the other thing that gets iffy for me.
It's great to see the ARM architecture picking up steam. Now if only Steam would pick up ARM.
B+ tier pun! Just like my segues
I see what you did there.
And I like it. 👍
*Now :D but love the pun :)
Well, some of Valve's patents for a standalone VR headset and some of their code for dev branchs of SteamVR and SteamOS include mentions of a Arm chip. Besides indices that Valve is going for a dual chip system like the Apple Vision, but with a x86_64 chip instead of the M2, said code lines implies that there maybe be a Quest like mode where the OS could run from the R1 equivalent Arm chip. Which alongsides the patents about modular VR headset is why @SadlyItsBradley thinks that the x86_64 chip may go in a removable module.
Add to this that due to more datamings Bradley suspects that Valve counts making non VR games and software works in a VR/AR environment a bit like Apple marketed mixed with the Game mode/Dektop mode approach of the SteamDeck to attract newcomers to VR by extending on the "same portable screen" aspect. (In a "first get people to use flat screen app in a VR environment, then get them to use full VR apps" strategy.)
If Bradley is right on his theory, and depending the functionality lost or conserved in the 1 chip mode (which would be a Arm chip), Valve might work on x86-to-Arm translation layers.
proton by steam?
Hopefully mainstream ARM desktops also have socketed CPUs like this one.
Socketed Arm CPUs ftw!
There's no good reason they shouldn't be. I know when 3D stacking starting becoming a thing, intel tried to push soc style motherboards which means preconfigured cpu/ram, etc and no upgradability/reparability. There was a lot of pushback for that.
In a twist of events, that's exactly what Apple Silicon is. Unrepairable arm soc's.
@skechergn You mean "Going back to blade socketed CPUs" right? Pentium 2 and early Athlon chips came on a board and some had dual sided cooling
@@JeffGeerlingTry using Box64 to run X86-64 on ARM.
@@lionelt.9124yes, was also just posting about box64 box32 x86 to arm wine
Steam recently had a massive UI update that replaced a lot of the UI with elecron-based elements, which is probably why your experience with steam is much better now
Ah, that would make sense. The old UI would work a little better if you disabled GPU acceleration, but now it seems to work just fine.
I wonder if moving to electron might make the transition to arm64 easier, too? I can't imagine Steam will remain X86-exclusive forever!
Steam use their own browser-based desktops framework
The browser in steam is a custom build of Cromium, the presentation layer is Electron which is compiled into the client.
@@JeffGeerlingthey have hired the gal who has been working on making Apple’s M series GPUs work on Linux. So possible they want her expertise for ARM based gaming
much better?? my man we live in two different worlds. Steam performs like shit to me, granted i am on archlinux
You can actually run x86_64 programs on ARM and other architectures under Linux, using a little program called box64. I honestly can see ARM being the future for most use cases, due to its high density, low energy consumption and high performance with a smaller instruction set
afaik there is also fex for linux
I hope he sees this and gets on that!
This! I remember folks trying out proton with box64 a while back. It would be interesting to see how it works here, so we can compare performance of modern games on it
Also I guess QEMU can do it ?
This.
Just quick google search away, poorly research on the linux side as usual here.
Even if Microsoft had a competent translation layer for ARM, Creative Cloud would still run slowly. Hell, Creative Cloud runs slowly on x86...
Haha so true. Like I said, it's not what I'd call "good" code. I still subscribe but only because of 25 years of muscle memory from back in the Photoshop 2.5.1 days lol
It's not about being competent, Apple simply added a few private extensions to the ARM chip for operations that the emulator has to run a lot. This is a luxury Microsoft can't do, since they don't produce their own ARM chips.
@@anlumo1but if chip makers like Ampere would do it, I'm sure Microsoft would happily support it
At least right now, Lightroom and Photoshop work natively. But they have 20+ apps, so 10% of them being native isn't very promising. Imagine trying to drive Premiere via emulation.
@@anlumo1 'This is a luxury MIcrosoft can't do', really? Microsoft has money than god and even sold their own ARM machine running windows! Microsoft could absolutely do the same thing but they can't help themselves and must always half-ass everything they do. Microsoft just doesn't care.
I don't even care if Apple's ARM chips are faster. Just the fact that these platforms aren't locked into a walled garden puts these Ampere systems in a league of it's own!
This 96-cores Ampere Altra is based on Neoverse N1 core which is derived from Cortex A76 (2018). It's 5 years old core however it has an IPC of Intel Skylake / AMD Zen 2 which was on par with top x86 architectures. However A76 core has consumes only 1W at 2.6 GHz while area is 1.4mm2 on 7nm TSMC so resulting in incredible performance / watt which shines even today. Ampere uses maximal configuration with 128-cores what N1 platform can provide. Similar CPU from Amazon the Graviton 2 used 64-core config.
FYI today last Neoverse V2 (Cortex X3 based) platform has 19% higher IPC than Zen 4, it supports up to 512 cores per socket (256 cores per chiplet) and has suport for revolutionary SVE2 length agnostic vectors (up to 2048-bit). Nvidia uses V2 in 144-core Super Grace. Amazon will use it in next Graviton 4 I guess. Surprisingly Ampere next gen should be their own developed core Ampere One, leaving ARM's Neoverse license.
BTW new Cortex X4 has 33% higher IPC than AMD Zen 4 ... however next server/workstation Neoverse V3 should be based on Cortex X5 in 2024. Basically AMD Zen 5 is dead on arrival because it cannot beat X4's IPC not speaking of X5's IPC.
Apple M2 has 56% higher IPC than AMD Zen 4. Yeah, Apple is king of microarchitecture since Jim Keller's 2015 A9 Twister core release (IPC higher than Inte's Skylake).
*IPC = Geekbench6 ST pts / GHz
Exactly.
If Apple just sold their systems as open instead they would have a much larger market - but alas they do love their Apple store revenue so it will be a cold day in hell before that happens.
@@wkwk2o384ur It's not locked to Snapdragon.
There is a build that runs on Raspberry Pi too.
@@richard.20000 "Similar CPU from Amazon the Graviton 2 used 64-core config"
Zero point talking about a CPU only available for cloud instances.
"Apple M2 has 56% higher IPC than AMD Zen 4. Yeah, Apple is king of microarchitecture since Jim Keller's 2015 A9 Twister core release (IPC higher than Inte's Skylake)."
And yet Genoa is not exactly taking a beating from Apple's highest end chip.
There is far more to performance at the higher end than just single threaded IPC.
"new Cortex X4 has 33% higher IPC than AMD Zen 4"
Says who?
There isn't a single implementation of X4 in the wild yet and cache size affects core performance significantly from what we have seen of ARM claimed figures with ideal cache size vs Qualcomm Snapdragon implementations with reduced cache size.
"Basically AMD Zen 5 is dead on arrival because it cannot beat X4's IPC not speaking of X5's IPC"
Utterly pointless remark considering that potential IPC advantage would be eroded simply by running x86 software for anyone needing legacy support (which is a HUGE part of the industry like it or not).
At this point the IPC gains of ARM are slowing generation by generation as they run into exactly the same problems as x86 ODMs suffer in design - in another 5 years the situation will be pretty much neck and neck.
ARM was better in the lower power region - but as it scales it is running into a wall, this much is fully evident to anyone looking at actual performance figures and power draw.
"and has suport for revolutionary SVE2 length agnostic vectors"
It was revolutionary a decade ago when they published the ARGON experimental variable length vector instruction set.
These days it is just an implementation of an idea already taken root in at least one other architecture (RISC-V) and implemented in Fujitsu's ARMv8 based supercomputer Fugaku years ago - Fujitsu actually co developed the original SVE with ARM.
@@richard.20000
The thing is, the ISA doesn't really affect the performance nor the power consumption. The whole CISC vs. RISC thing is from the old days and doesn't matter much now.
As for that claim about the Cortex X4 having 33% higher IPC than AMD Zen 4, we haven't seen any real-world implementation of the Cortex X4 yet. So it's hard to say how it will actually perform plus other factors like cache size can impact core performance, so it's not just about IPC numbers. Today IPC is "how well you are able to keep the core fed" which can vary a lot from an µarch to another
Saying that AMD Zen 5 is "dead on arrival" because it can't beat the hypothetical IPC of future ARM cores is kind of pointless. There are a lot of software applications that still rely on x86, so running them on ARM would be a challenge and need and may affect performance depending on the implementation. Compatibility also matters a lot in this segment.
And the support for SVE2 length agnostic vectors might sound cool but it's not as revolutionary as youm think. Other architectures like RISC-V have similar concepts already. It's not a game-changer anymore.
With FEX-emu, you may be able to run Steam and other x86 apps on Linux. In fact, Asahi Lina used that method to run Steam on M2 mac(even Proton worked).
I've also heard that box86 (and box64) does a pretty good job running Steam.
Box86 or box64 would be the better option these days IMO
It would be super exciting to see Steam under Box86/Box64 in the next video! It's a bit of a hassle to set up, but should give you a very competent translation layer.
You know something is good when even Torvalds says he's using it in the release notes.
Marcan never disappoints 😎
No idea what the gaming performance is like but you can use QEMU User space emulation to run programs compiled for another cpu
Computer engineer here, the basic idea is shared memory in these massively multicore CPUs works with "neighborhoods" of cores to cut down on the number of tag bits required in memory to mark which cores have a block of memory loaded. As you can imagine, crossing neighborhoods is much slower than working within your own neighborhood. This is why monolith generally doesn't perform as well
Though for some reason on Linux monolith was doing well compared to quadrant... I know Anandtech did a lot of testing though on Ampere, Epyc, and Xeon, and found some interesting differences (all are tradeoffs, but some work better than others depending on the workload).
@@JeffGeerling That is pretty interesting. I should specify I'm not an expert on parallel stuff, I've generally worked at the individual core level, and I definitely am not a software guy. So I'm totally speculating here, but my guess would be that going Monolith allows the scheduler to be a bit liberal and play around with crossing neighborhoods when it determines it's worth it.
This is something I could definitely imagine Linux being better at handling than Windows for ARM. Though again, if someone who actually knows about the Linux scheduler wants to chime in and correct me I'd welcome it.
@@JeffGeerling I think the Linux scheduler does a bunch to try to avoid cache thrashing. It may actually switch the CPU out of monolith mode (if this is possible) and go about trying to schedule based on cache.
I know more than nothing about the Linux scheduler. But I'm far from an expert.
@@JeffGeerling Just checked out the Anandtech article, (I'm currently on vacation so I don't have the time to really dig into understanding it so I'm again sticking the "speculation" disclaimer on this comment) but it looks like Ampere's mesh is really slow when it comes to multiple cores within the same quad trying to share a cache line that exists in another socket.
Specifically, this is the line I'm referencing: "and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns - still pretty bad". This can slow down Monolith mode if the scheduling system is unaware of this.
As for quadrant mode, going quadrant mode seems to segment the SLC (system level cache) evenly between the four quadrants (so each quad only gets 4MB of SLC). Instead of allowing it to be shared between them. This means for workloads that don't use all the quads you're losing a lot of cache potential and having to go out to main memory much more often.
Edited to clarify a couple things.
@@JeffGeerling Sorry for the comment spam, but going off the comment by @Omnifarious0 I'd wager that in "Monolith" mode the scheduler is still trying to keep things within the quadrants, but since Monolith mode opens up the cache you can stay on chip way more. Windows probably struggles with scheduling threads to avoid remote (off quad) cache line accesses and that's why it suffers compared to Linux in Monolith mode despite the unsegmented cache.
I think a big problem with ARM support is that companies might be more interested in just riding things out until RISC V gains some steam; since ARM has been rather.. particular with their licensing lately.
This and companies always locking down their ARM products so you can't run any software or operating system but they one they approve of. People keep saying "NoT aLl CoMpAnIeS" without stopping to think that no one with any weight is bucking the trend of locking down the hardware. It's all find and well to buy these Chinese SBCs and super expensive tech demos like this one, but for the consumer who doesn't have the money to burn on a 158293 core whatever ARM CPU we're going to be stuck with locked down and unsupported hardware.
It's not that, even risc v will hit exact same problem, it size of market that is the problem, there not much demand. Desktop ARM on Linux probably would not be as much as develop if not Raspberry Pi.
problems these days...apple lost interest in RISC and with windows still majority of x86 platform...there is no motivation for devs to go back RISC era again when it was not very fun for any devs during 90's to 2007...but it was beneficial for apple and pc system when they optimise for it but its time consuming to do and bad ports are constant problem back then ...now everyone using x86 cpu for lazy approach in game engine that does everything automatic compiling vs manual compiling on risc architecture is not very user friendly even though it does have more control in raw performance in utilizing the cores better than cisc automatic approach compilers that does not give users complete control of cpu cores ...its too generic method that like under linux compilers ...devs have to choose which version for software it run on type of cpu?? type1 to type4....majority don't have type4 when type2 is very compatible for all x86 cpu from 1995 to 2023, its logical how they compile on ...
@@ivankintober7561 ???? Modern Apples are currently on RISC architecture. M1 and M2 chipsets are RISC.
Yeah, especially now that the relevant specs are being finalised and added to RVA23. I’d expect RISC-V to go mainstream in a few years, at least on smartphones.
As a heads-up 13:28 Ampere Altra (Max) is up to 128 PCIe Gen4 lanes 1P. 192 is only in 2P using lanes from both CPUs and 64 lanes being used as the interconnect between the two chips.
Oops! I thought I had double-checked that, but nope.
So exciting to see the future of ARM. I had to settle for buying a Windows Dev Kit 2023 to have an ARM desktop.
Well if you want that, but more insane, the Dev Kit is a fun upgrade, if a little pricey, and it uses a bit of power at idle :O
How good is it?
update us
@@Dave102693 the windows dev kit has been kind of flakey. I had it shut off a couple times when doing something processor/graphics intensive. I tried running Portal 2 just to see how bad gaming performance would be on a translation layer and I had the system shut off a couple times.
@@XeonProductions Have anyone got Linux running on it? If so try running games with box86/64 or fex, Windows x86 to ARM translation is so bad it loses to both of the Linux translation ones. If Linux is not ported yet try to run Wine within OpenBSD with box86/64 if it's even possible.
NUMA is complicated stuff. It also partitions access to RAM channels and capacity. If you were to run quadrant mode and assign a VM or process to each quadrant, that would be fine. Normal apps often don't know how to efficiently use several quadrants at the same time though - they need to be specifically programmed to be NUMA-aware.
And if the application is not NUMA aware then it can potentially be limited to a single “quadrant” at the OS level with processor affinity / pinning. So you might end up running 4 copies of whatever app, one per quadrant, or assign some other workload to another quadrant.
I don’t have a lot of experience here so I’m not sure how that works for memory allocation. Thread affinity is easy, though, and I’ve used core pinning on Windows and Linux to avoid L3 misses on multi-CCD Ryzen processors.
@@JJfromCowtown When it comes to Zen (except 1st gen Threadripper/Epyc), you shouldn't need to do thread pinning.
These CPUs present as a single NUMA domain and the scheduler is supposed to be L3 partition aware, such that it tries to fill one CCD with threads of a given task first before allocating ressources from the other one.
@@Psychx_ , I agree that’s how it is supposed to work.
In practice, both Windows and Linux schedulers will move threads related to gaming tasks from one CCD to another, leading to stutters.
I observed this on Zen 2 mostly, with a 3800X. Limiting games to the second CCD improved min frame times. Maybe the scheduler tries to limit to one L3 domain/CCD but there must be more to it that I don’t understand.
@@JJfromCowtown I'm using a 5900X and before that a 3900X. There have been changes regarding this in the latest kernels, plus distros tend to configure the scheduler differently (i.e. by setting CONFIG_WQ_POWER_EFFICIENT_DEFAULT in the kernel config).
All I can say is that it does work as expected and respects L3 cache partitions when configured correctly.
Btw, I'd highly recommend using the BORE scheduler for Linux gaming. It delivers much better frametimes than stock CFS, while being simpler (simple is good, fast execution of scheduler code and less breaking changes by upstream) than ProjectC/PDS (another great scheduler replacement with a different, more complicated approach).
@@Psychx_ thank you and I will try that in my Linux env. I am on a single CCD these days (5800X3D) but "free" performance is worth chasing.
Given all the talk about big/little core scheduling issues in Windows (particularly in Windows 10) I have always doubted that optimizing for L3 mattered much to the scheduler in that environment.
I love the rant mode about Windows and needing a Microsoft account. It is indeed ridiculous
The Aspeed AST2500 is actually another full Arm computer in the system with cores, memory, storage, and running its own OS/ firmware. It is not really meant to be a GPU, you are using the old (I think Matrox sourced IP?) and small GPU designed to output video via VGA to server KVM carts.
ANC and SNC are generally used by the HPC folks to localize memory access between segments of cores. It effectively splits your CPU up into a configurable number of splices (often four) to help with this. In the old days, this was not as noticable. Now, pegging CPUs to things like chiplets, memory controllers, and local high-speed memory is important to minimize hops on internal fabrics and relieve congestion/ lower latency.
We will show it on the upcoming Intel Xeon Max with HBM video as well as the AMD Bergamo side. Typically, you would not set this on an Ampere Altra design just because of where they are intended to be deployed.
This is but one of the many reasons I tell people to subscribe to you ;)
This is why I subscribed to the channel all that time ago! I can't wait until the day that I can confidently switch away from my Intel x86/64 machine over to an ARM64 machine with a dedicated GPU and see a better power to performance ratio overall!
Love you pushing ARM. Would love to also see some RISC-V. Thank you.
Someone needs to ship me one of those server RISC-V chips to mess with :D
I'm sure you're aware, but there are x86 emulation layers for linux like Box64 and FEX, which have reasonable performance. I'd be interested to see you test some more interesting stuff out with those on this hardware.
Great video. I used ARM on desktop for the first time in the late 1980s when my school got their first Acorn Archimedes PC.
Jeez
Oh wow, you can have Adobe products crash and eat hours of your work on ARM too now. Truly excellent
Not to defend Adobe at all, but... ctrl+s my guy. No software crash should ever lose you more than about 2 minutes of work. Ctrl+s. Lol.
@@clonkex all adobe stuff has auto-saving functions, no need to bash keyboard every 2 minutes like a caveman.
Still does not save you if it decides to corrupt the project files though
@@marcogenovesi8570 I don't use autosave (unless you mean the automatic background recovery saves) because I prefer to control exactly what gets saved (so I don't save the file with some half-baked experimental changes I don't necessarily want - because of course, if it crashes at that point, I can't undo). But honestly I've rarely seen crashes. Maybe once, twice at the absolute most, in over 8 years of using Photoshop on Windows for my daily work. Then again, maybe you're using InDesign or some other product I'm not familiar with.
I'm very much enjoying your Ansible book, Jeff. Many thanks for all of your contributions!
(I hope RedHat decides to do the noble thing, there... 🤨)
It would be interesting to see how it would be running Windows games in Linux using Proton and the Box86 / Box64 userspace emulators if at all.
Okay
I wish society would just navigate towards ARM. The efficiency gains are *insane,* it would literally make laptops so much more useful, with insane performance and battery life, just look at how insane that M1 and M2 Macs are.
Windows API was previously only handling 32 cores; all apps on Windows used this API, so Microsoft created a new Api handling a lot more. But many people still use the old answer on the web on how to get the number of cores in C++ on Windows. Did run into that problem the first time I used a game engine on a 192 cores Threadripper at work.
I am excited to see ARM desktops like this. I've been wanting this for a long while.
If you're looking for good games for cross-platform benchmarks, I believe Minecraft (Java) runs natively on Windows on ARM now (it's run on the ARM Linux forever). Bedrock now runs on ChromeOS, so maybe we'll have Minecraft RTX running natively on Ampere before long.
you should be able to run any architecture of your choosing on this arm altere CPU, and you can do so transparently by hooking QEMU to binfmt's, that should allow you to get steam running on this altere. im only familiar with how to do so on Gentoo Linux but i imagine the concept is identical on other distros.
This! Was looking for this comment! :)
yeah, also search if anyone was already posting, it.
Some more information:
I used it more from a x86 site but should also work fine for the other side.
The nice thing with binfmt is, it only emulates userspace, syscalls are handled nativly by the kernel.
Only problem is to get all the x86 binaries,
there you ether need magic `dpkg --add-architecture amd64`,
a chroot lxc / docker container,
or use nixos with `pkgs.crossPkgs.*`
Easier to just use Box64... And probably faster.
I recommend trying FEX or Box86/64 to run x86 games in the next video.
I really wondered why It was not showed in this video.
You're gonna have to come back to Cambridge to visit the Arm HQ!
Something that hasn't been talked about in the video (atleast what i thought): Have you tried deploying microsoft's opengl & opencl compatibility pack to the Windows on ARM system to see if that had any differences in terms of using any app that depended on opengl? - I have tried this on an older GPU that only supported OpenGL 3.1 maxmium by default and had no issues with it getting overwritten to GL 3.3 as compatibility layer. (albeit it was an x86_64 machine)
No, didn't try that, but something else to add to the list :)
As someone who has been using Windows on ARM for a little over a year now (Device: Surface Pro X), I have found that it still doesn't run Roblox smoothly from the MS Store or the browser as well as a Mac does. Moreover, Roblox now requires the system to be 64-bit, and I can't access and play Roblox from the browser at all, except through the MS Store where it works. However, the overall user experience has been decent. I appreciate the fact that it can connect to LTE and has a long-lasting battery
linux actually has a x86 or any arch translation layer, it's called quemu, you can literally chroot into an arm rootfs from a x86 computer or the other way around, you do need to setup binfmt.
Box64 has a higher performance than quemu
@@larsradtke4097 yup, less target though, point was, linux can do it :)
I love the idea of ARM getting more widespread, along with RISC-V. Very nice videos! Thanks, Jeff!
Wow, whatever optimizations and tweaks you've made to your video making workflow have paid off. This video was *really* well done! Super excited for the 128 core upgrade livestream.
12:26 "Linux doesn't have an x86 translation layer"
It has three actually: FEX-Emu, Box86/64 and Qemu's "User Mode Emulation". You should try to get one of them running, you'll probably be able to play modern AAA games on ARM Linux using them with Wine/Proton.
Have you tried using FEX for the translation layer on Linux? It's supposed to be very good on the qualcomm boards.
You can bypass the login requirement by just putting in “a” for the email and password fields. I’ve done this more times than I can count at this point, no need to disable Wi-Fi or anything.
Would be interested to see Linux performance in QEMU or box64
Thank you so much for this video. I have been waiting for an update for over a month.
That’s not Crysis, that’s a PowerPoint presentation. 😂
Hi Jeff! I just wanted to let you know that I simply love your videos, you are awesome. That is all!
Nice to see ARM in the 'traditional' desktop form factor.
The carrier board for the CPU socket & RAM is something I hoped to see from the Apple Mac Pro. It's a shame that at a time when Apple is doing ARM on the desktop, it's still off in its own world and not helping ARM anywhere else.
Exactly, would've loved to see Apple find a way to at least give upgradeable RAM, or at least put the SoC on a carrier like COM-HPC or even something proprietary so it could be upgraded.
Unfortunately, Apple can only get the performance it does by tying everything together so it cannot be upgraded.
I'm glad to see you are doing better! These videos are great for info and inspiration.
I wonder what its dxvk performance, either via linux box64 or on windows dll overwrite.
"With great power comes more performance" would make a nice t-shirt.
It's nice to see an ARM system that is actually upgradeable and using standard components, like x86 desktop computers.
Maybe at some point we can get an actually like-for-like comparison between x86 and ARM in terms of performance and power consumption. Comparing to Apple's stuff is... well, _not_ apples-to-apples (heh), because their stuff is all integrated. Of course they'll have e.g. power consumption benefits just from that already. I want to know how ARM does compared to x86 when it's a socketed CPU with "external" RAM and storage and GPU and so on on both sides.
The integrated aspect of the M series macs actually improves the perfomance for those chips.
ARM is incredible memory hungry (bandwidth and maybe size) vs x86. For x86, slower memory doesn't hurt the CPUs that much (cache layout, complex instruction sets).
The IPC of the ARM CPU above is incredibly bad even when you take native compiled games vs x86.
The M2 chips are like several decades ahead.
Like the guy above me said, Integrated RAM or SoC / SoM benefit the performance greatly, hence apple M1 / M2 have a great performance, but it will come with a pricey chip and not upgrade-able unless replacing the whole chip
6:30 Yes, Steam had a major update a couple days back, addressing the slow UI was one of the improvements for all platforms.
Yay, I got lucky timing!
Thanks I'm watching your progress with great interest. My comment is, being someone who abandoned Windows in 2001, I say who cares how it performs, but your performance review (I skipped a lot in the WinDoze review) just proves my opinion. The only OS for machines like this is Linux. It's even getting mature enough to give Apple a possible competitor, though lacking many applications that keep me using Mac OS. When I say Linux , I mean Unix, so Mac OS, Linux, implementations of Unix. It'll be interesting to see the 128 core.
Arm's getting there and i'm so excited! We're years off, of course, but at least now it also looks like we're years in as well!
Finally, the ARM gaming video the world has waited for! Thanks Jeff!
It's been there more than a year... check mi pad and other Android phone run windows games for last one year ok windows 11 arm
Some people might wonder why the performance in Crysis was so bad. Crysis is single-threaded. Of the 96 available cores, Crysis was only using 1 of them.
I remember back in the day a lot of people bought a quad-core upgrade hoping to play Crysis with higher performance than their dual-core systems, and they were thoroughly disappointed to find the quad-core performed worse than the dual-core due to the quad-core having a lower clockspeed. The singlethreadedness of Crysis was THAT intense.
I've heard that there are some good (still being worked on) translation layers for Linux, box64 for example
I took a quick stab at getting that up and running but had a couple issues. I may take more time to test it later!
@@JeffGeerlingFEX should also be pretty good :)
@@erikreider yeah, that's the one I couldnt remember the name for
Looking forward to similar videos for RISC-V.
@1:00 - Crysis likely runs badly because it's so heavily single threaded (at least assuming it's the original version you tried!), so even with a translation layer that can spread load somewhat, its probably bottlenecking severely. The remaster is also kind of jank too (and probably similar issues) because it still has aspects of Cryengine and is based upon console ports for some reason; but that's another topic!
Yeah, and the Ampere Altra Max's single thread performance is only on par with older desktop CPUs, so that makes sense. I have yet to see if the new AmpereOne CPUs have better single-threaded performance. The Apple M-series chips really shine there.
I've heard that arm is way more power efficient than x86 so if that's the case then we should be switching over to it and the fact that Microsoft won't do that is really baffling.
Running a VM on the Oracle Cloud with an Ampere CPU, great performance.
And (though I'm no fan of Oracle), they just certified Oracles DB software for running on Ampere.
Wow great job! Have any manual? I tried to run using qemu but nothing boot.
Hi, Jeff.
I tested a research project on the 80-core chip and discovered atomic operations tend to drop dramatically when exceeding 40 threads under high contention workloads. I speculate that might have something to do with co-to-core interconnect.
I think that might be the reason why the performance does not look good for 96-core, to 60-core.
Will likely need to upgrade to a board that supports all 8 memory channels to get a teraflop unless the measurement is in pure Level 1/2/x cache. Also if you are only using single rank sticks of ram that will limit performance as well. 2 ranks of 4Gb or 8Gb density chips per memory channel is typically the best performance for DDR4.
The best thing about the ARM architecture becoming so popular, IMO, is that it forces low-level software to become more easily compiled to other architectures. It's like how in the early days, you built a game for a single PC platform and then had to individually port it to every other platform. Then publishers wanted to easily sell games on consoles and PC from launch so game engines evolved to support more than one platform (usually favouring consoles, with wonky results on PC - see Mass Effect 1). Nowadays game engines have generally excellent abstraction layers that make it trivial to produce games for virtually any platform. The same thing just needs to happen for CPU architectures. Which, it sort of already has in many ways, especially for usermode software, but it's still far from trivial to get OS kernels and drivers compiled for new architectures.
There are technically options for translation on Linux. Rosetta for example can be run on Linux (apple provides a binary for VMs and asahi) although I don't know if it would actually run on anything other than M1/M2. Additionally, there is the QEMU bitfmt_misc layer, and I'd be very curious how well that would run with that many cores.
Yeah, Rosetta would be amazing, but AFAICT it has a lot of M-series specific instructions that would not work on other Arm chips.
Although now I wonder if you could run Rosetta on some sort of emulation layer itself...
@@JeffGeerling If I remember correctly I don't think it does, not in the linux version. But don't quote me on that. Even then, I have no idea whether it works with different page sizes though. I guess the only way to find out is for you to try it. Or for you to ask Hector Martin.
Some of those older games, like the one that loads and goes black screen, can sometimes be fixed by launching on a lower resolution by right-clicking on the application in the files. Properties, and then where compatibility settings are changing the resolution to that lower resolution that pops up. But that's on a regular pc. I have no idea how it would be on that Dev PC.
I wish the dev kit didn't cost such a insane amount
It's not a bad price for high-end workstation or server parts, but it is definitely not cheap, either! I hope Ampere is around for a long time and can slowly roll out more on the desktop side, so we can see more Arm options there.
There's more platforms coming from ODMs that would suit non data centre uses.
have you heard of box64, haven't actually used it, but its a compatibility layer for arm to x86 like rosseta or wow64, they do specify that steam works on it, maybe you could try it sometime
Can't believe you have not tried running Minecraft! Minecraft runs on nearly every 64-bit OpenGL 3+ capable machine. Only obstacle is minecraft's launcher being amd64 exclusive, but third-party launchers help here
I hadn't tried that, but it's on my list now.
Yes!!!!!! This is exactly what I wanted to see!! Running windows on a machine like that is my favorite kind of mad science!!!
And maybe in the future Valve would consider a SteamOs with arm support.
Maybe something like Box64 would work well as a translation layer? Even if it doesn't, qemu's user mode emulation should do the trick.
At about 12:30 you mentioned x86 and amd64 compatibility issues for Steam. Have you looked into box32 and box64 to see if this resolved your issues? I've had minimal success using this on rpi 4 and orange pi 5 to run steam. I'd love to see what you're able to do with those programs on a machine like this.
This tells us one thing unlike x86 CPUs it seems ARM needs a very strong optimization platform to see its full potential. Also I think the approach might need to be different. I bought the Pinebook Pro a ARM based Linux laptop before Apple went to ARM for the Mac and then I picked up a M1 Max Macbook Pro 16 inch. The Pinebook is a all in one SOC as well. Apple had leaks of them trying out socketed RAM and external GPUs with their ARM designs in the Mac Pro but ultimately didn't go that route. It maybe that the CPU needs the RAM and GPU to be closer for the architecture to rival the brute force method of an x86 chip. Just my speculations about ARM. But I can say when ARM runs well it is a super stable and nice experience on Linux or MacOS.
x86 and ARM pretty much represent two ends of the spectrum of CPU design philosophies: CISC and RISC respectively. CISC provides you with a large library of features built into the CPU, at the cost of poor efficiency because many features go unused but still need power and space. RISC provides you with a simple, minimal but extremely flexible built-in set of features, and expect you to build the rest of the features within software as you go.
However with modern compilers and improved automation tools, optimisation isn't nearly as much of a problem today as it was 30 years ago. The only reason x86 feels more easy to optimise now is mostly that it benefits from 30 years of consumer-oriented optimisation, whereas ARM has until recently been mostly focused on industrial or mobile optimisation.
The real issue is that the market for ARM is very fragmented. Because the ARM company allows you to make whatever customisations you desire to the chip, for a fee, not every ARM chip is guaranteed to behave exactly the same. Applications that stick purely within the CPU will work fine, but anything that interacts with external hardware e.g. graphics may run up against unexpected quirks of specific models or manufacturers.
Hi Jeff, I really enjoy your channel. It's been a nice distraction since my mom died earlier this year. I'm glad that you are doing better since your operation in December. I look forward to watching lots of your videos. Have a wonderful day. Jeff
I am doing much better now! The operation was a total success, and I'm so glad I did it.
"Great power comes with more performance"
-Jeff Geerling
Had to give a a like(even though I would anyway) for the Rocky Linux shirt!!! Love the support for Rocky!
I think if they continue work on x86 emulation and arm CPUs, I can easily see this being a really good idea
If it’s hardware supported, then it’ll finally work properly
@Jeff Geerling: Can you *please* elaborate on your statement "Since Linux doesn't have an x86 translation layer" at 12:28? Because `binfmt-support` (along with `qemu-user-static` and maybe other dependencies) is supposed to address this and has been around for a couple of moons by now. Try "debootstrap --arch=amd64 jammy " with and without the aforementioned packages in place on your arm64 host. Works like a charm (you want to install `schroot` to enter said environment) and is actually well-documented.
Have you tried QEMU userspace emulation on Linux? That should bridge the architecture gap much like on macOS and Windows. Specifically for @ 12:16.
Just FYI Jess. the ASPEED GPU would be the reasons some of the games would not load. the ASPEED GPU is designed to give you basic graphics output, intended for debug use. The ASPEED chip is a SOC, meaning its a computer used to start up and manage the system before anything else loads.
If you can, try using a discrete GPU in there, i think you will find game will run a lot better :)
If it ain't broke, you aren't done yet ! Great efforts.
I hope one day I'll be able to use an arm desktop as my next main (linux) machine
Nvidia's current Linux graphics drivers have an issue where if you run a game full screen it gets Vsync'ed to the monitor without any ability to over-write that.
If you would have windowed Dhewm - it would've uncapped the FPS.
Nvidia released a statement that there will be a fix for this with the next driver release
Could you make a follow up video testing games on Linux using Box86, Box64 and FEX? Asahi Lina made a video a few months ago testing some games using these on M1 macbooks with their video drivers and managed to get steam and quite a few games running. I think it would be interesting to compare these 2 setups.
I agree. I also dont know if windows translation or box86-64 is further along in terms of compatibility and performance
Well, Maxon listened. Cinebench 2024 now has a Windows Arm version that released today.
Oh nice! Will check it out.
1995 Microsoft: Internet? No, use MSN and a modem.
2023 Microsoft: you must use MSN to log in to your machine.
Yes, it’s still just MSN, with Passport, rebranded. They’ve been shoving this crap down our throats for 25 years.
@@The_Funguseaterthanks for your permission 🙏
1995 was 28 years ago
So almost for 30 years the try to thove this down out throats…
@@The_Funguseater heh everyone is free to use Linux because it’s free
Microsoft thinks you’ve seen nothing yet.
“The presentation has been revealed as part of the ongoing FTC v. Microsoft hearing, as it includes Microsoft’s overall gaming strategy and how that relates to other parts of the company’s businesses. Moving ‘Windows 11 increasingly to the cloud’ is identified as a long-term opportunity in Microsoft’s ‘Modern Life’ consumer space.”
“So, what if Microsoft extended the capabilities it currently bills to businesses on a per-user, per-month basis to general consumers? That is precisely what the Redmond, Washington-based company is envisioning, according to internal documents made public thanks to the FTC vs. Microsoft hearing currently taking place.”
@@fila1445 was being as generous as I could, but the last good product out of Redmond was BASIC-80.
Yup, that's right. That processor in your thumbnail is my current processor in my PC.
It's an older CPU, sir, but it checks out!
I absolutely can see Valve moving over to ARM in the future. If they were to pull an Apple and develop their own silicon they wouldn't have to rely on the next great thing from AMD as a lot of mini system builders seem to be doing right now. This could increase their battery life from 4 to 8 hours and be the ultimate handheld gaming which they seem to be going for lately. If I were a betting man I would say Steam Deck 4 will be ARM based.
Memory channels gives a great boost.
It's ridiculous how many times you'll get a pre-built PC with 2 memory sticks in a 4 stick, dual channel machine, but they installed both sticks in the same channel. I've gotten some of my friends boosts of up to 15% by just moving a memory stick over a slot. If anyone wants to check themselves, I'd consult the motherboard manual first to make sure, but memory should typically be inserted in alternating slots, like first stick in slot 1, second in slot 3 etc.
I hope to see gaming more mainstream on Linux.
9:04 this is because threads require a lot of memory and I am sure 96 cores will love the extra RAM
I wonder if it would be possible to Hackintosh this but I suspect there will be a *TON* of challenges to do so but will given the time.
I can’t wait for the day it’s possible
There is QEMU userland emulation which in combination with linux binfmt_misc basically is a transparent emulation layer for almost any architecture.
I'm just happy with Windows on my Raspberry Pi 4B
Yeah, Windows on Raspberry has been pretty cool to see, and it's conceptually similar, though the best thing about the Ampere setup is the normal installer can be run without modification since the system has UEFI and doesn't need special boot stuff in the image.
A bad windows design rant is always welcome. Perhaps if EVERYBODY tells them it is stupid. They may one day realize it's stupid. We hope...... God we hope. Thanks Jeff great and informative video.
What if they realize it's stupid already, but they choose to do it anyway? 🤔
Given M$'s track record, they'll do whatever they want...
This is why they say, Windows is a great OS--if your time is worth nothing.
Really excited to see the future of ARM. It cannot be overstated how much of an impact in the medium term Apple moving to the ARM architecture will make.
I wanna see box64 on this
Interesting video. Work in progress from the look of it.
Yep, still plenty more to test!
I never been this early for a video
Can't say that anymore!
ahem, @JeffGeerling there IS a translation layer for linux, it's called QEMU!
or rather, QEMU User Mode, specifically, doesn't do any of the fancy VM stuff, it merely translates syscalls between architectures, so you can run arm on x86, or the other way around, or whatever, of course, it isn't perfect either, but y'know, it's pretty good, there is also jart's blink, which is also rather worthy of checking out, "Blink goes 2x faster than qemu-x86_64 on some benchmarks, such as SSE integer / floating point math. Blink is also much faster at running ephemeral programs such as compilers." claims the readme, and I believe it
I'm pretty surprised tbh, that you don't know about our lord and saviour, qemu user-mode, it's the reason WHY you do qemu-*system*-x86_64, because the user-mode already uses qemu-x86_64
blink however, is a much newer project
I do, but I had trouble with Box64 and put a pin in trying that out for now; what I mean is there's nothing built into the distro (Ubuntu in this case) like WOW64 or Rosetta 2.
Unfortunately Apple still blows the competition away when it comes to single core performance and that’s very important for regular desktop use.
did you try box64 / box86 for emulating the architecture? I dont know the current state, but it might be worth a try.
The thing that peeves me with Windows on ARM has honestly never been Windows itself, the emulation is great except that .NET apps and whatnot can't help but snitch about it being an ARM64. So stuff like EasyAnticheat fails for no reason because Epic cba to fix their junk (so no VRChat on ARM64 :( )
I also have no idea why Microsoft hasn't sat down and wrote their own USB CDC drivers for all the Arduinos out there, that's the other thing that gets iffy for me.
I subscribed for this exact video and it's finally here!
But! Does it play Crysis?
Loving all the little details in the video. Very well done Jeff.