With some of these modern efficiency optimized server CPUs being crammed full of those little e-cores, I don't really see the value proposition for this thing. They claim they'll have some super advanced branch prediction system that will drastically improve throughput, but that appears like a very tough problem to solve. And if there's enough money in HPC, it wouldn't surprize me to see Nvidia splitting their high end GPUs into low precision optimized and high precision optimized product lines. They have the budget for it.
This sounds fancy but in practice is unlikely to work. HPC codes need a dumb vector processors with the FP64 vector compute throughput and the memory bandwidth and capacity to back it up. They don't need fancy dynamic branch prediction. HPC codes don't use branching for the most part, so there is nothing that smart branch prediction can even optimize. The telemetry collection for this branch prediction will probably even slow it down. If the thing doesn't support OpenCL/SYCL and instead needs recompiling, it is basically DOA. Recompiling for special hardware never goes smooth, there is always some detail that doesn't work and needs extra debugging, and developers don't have the time nor money to adjust their codes for another proprietary chip that does things different than industry standard. See Intel Knights Corner and how that worked out...
You are right if it is only applied to standard, massively parallell tasks, like training of weights and biases in a static feed-forward structure / neural net. However, these architectures are to a large extend popular much because of the availability of this kind of uniform hardware. There are so many tasks that a) today is performed on the CPU, but could be faster (there are examples in the video's description), and b) approaches that are not popular because they are slow. In these domains, NextSilicon can have an impact.
HPC does use branching in scientific applications, it is limited by GPUs being vector-only and IBM Power is still around for good reason, but you are right about the work in porting -- and that also goes against GPUs -- when those applications don't lend themselves to floating point, there are surely a few in meterology and climate science.
And even with the branches: In decent code you are already reaching the hardware-limits: Either you fully saturate the ALUs or the memory bandwidth. (or in bad cases like we have had today - networking... 250 workers can put some load on the system).
This looks really interesting in combination with Mojo. The big problem in HPC is getting the right ALU in the right configuration. I'm curious how they get that much performance out of branch prediction. In the optimization problems, I know we had to explore a search space until we hit a sufficiently low error for example. The branches were, run this stuff, and occasionally check if it's sufficient. From my understanding utilizing the ALU correctly by compiling certain functions like matrix multiply with the hardware in mind would be much more effective, as it cuts down the serial path you are parallelizing. The combination of AI and HPC will be an interesting market. An engineer can explore a vastly bigger search space if you train AI to do textbook engineering and give it the right amount of HPC. This kind of workflow is already done in fluid dynamics for vehicles and buildings and has been used to optimize combustion in ICE's. I'm also pretty sure turbine manufacturers use it as well. IBM tried to innovate that space with variable precision. The idea was to brute force the search space in low resolution, sort the results, and recompute the areas of interest with higher precision. But I don't think it got adopted that well. Probably just to complex to handle.
The SIMD instructions on my general purpose (AMD) CPU give good baby-step giant-step performance, but I haven't compared with the latest Power ISA to really say. Stuff like that which is heavy on the branching doesn't lend itself well to GPUs. (edit: %s/prediction/branching)
Well, it depends on how you branch. I played around with fixed-step Euler and Runge Kutta on my GPU, and if you run the same instructions on a large enough dataset, they fit really well. The same is true for back-propagation-based algorithms. But it gets tricky when you have something like ODE45, which has variable step size and conditional branching. On those, a view dozen cores that combine the benefits of different initial conditions and variable step size/branching would be the best.
I'm all for divergence-busting techniques, even if quite a lot of HPC workloads don't absolutely need them. I suspect the bigger challenge with general vector compute is around memory access, Ian mentioned this briefly but it looks worth digging into.
So it's basically a super wide out of order execution pipeline with speculative execution on steroids. So instead of like 4 fpus per core they have, say, 256 and a massive execution buffer (I expect the hbm is for that) to keep them fed. Their isa is probably defined for ooo, which is why the recompile is enough. The branch predictor simply learns and works farther ahead than any current cpu. Or that's what I'm getting from this.
GPUs are only “fixed” within one dispatch / one draw call. However, many practical compute problems/3D scenes are split into thousands if not millions dispatches / draw calls, and CPU running code can adjust size of dispatches / length of draw calls in runtime. BTW, on Windows it is critical to control count of in-flight compute thread groups / draw calls because the OS insists GPUs should stay responsive at all times even when loaded, and has the timeout detection and recovery (TDR) feature in the OS kernel to enforce the policy. To be fair, until work graphs arrived in D3D12, said flexibility was tricky to implement. D3D11 supports indirect dispatches / draw calls, queries to track completion of things, other queries to measure time spent computing / rendering things, but developers need to build their custom pipelines on top of these primitives.
Guess they'll have fixed silicon with various paths/pipelines/SE&BPwidths . Then the compiler wraps in some path optimiser tables that cause registers to get used it a programmatically way, which are attached to the various pipelines. So at some point the path lookups get evaluated and compete for optimal use. Feels like it doesn't work for modelling where 'every solution' gets calculated. Or anything like monte Carlo. Also Devs at department of energy know how to optimise, or they are using machine learning to assist optimising code. Makes sense to let Nvidia GPUs run the code optimisers.
When I have a set of ALUS and some branchy code the proportion of ALUs running the rare path automatically goes up as the rare path gets more common. This gets more complicated with vector units which suffer a large slowdown when a small number of lanes follow a rare path but if that's what is being addressed here then the message is being lost/oversimplified. 😢 On the other hand data flow programming is awesome so nice to hear something which sounds like that 🎉
Intel Xe has good 64b performance, they purposefully favored HPC rather than AI training, and is used in Argonne labs Aurora computer. Xe2("Battlemage" when in the Arc graphics form factor.) should be even better with 64b int support. Intel oneAPI already offers write once compile everywhere SyCL/C++, across venders and device types(AdaptiveCPP is another SyCL compiler alternative to oneAPI). nVidia double precision fell off a cliff years ago (like 10% of fp32 these days, basically just software support rather than native 64b register size); my old 2012 AMD w7000 GCN1.0 had DP speed exactly half of SP speed due to 64b registers that were split for 32b.
Nvidia has 64 bit performance that is 1/4 that of 32 bit on their fattest chips. The ones that make it into GeForce cards do not. GA100 had full 64b support, GA102 did not.
Goes to show the cost of branched compute tasks on GPUs. For baby-step giant-step I find better performance on Ryzen than can be extracted from any Radeon.
@@rightwingsafetysquad9872 Ah , yes x100 chips are actually the expected 2:1 of physical 64bit. but all the others(even in the enterprise line) have something near 40:1 drop off for double precision at least since Pascal. (I'm only referring to compute speed inside the GPU without memory bottleneck considerations)
@@foobarf8766 what kinda math are you doing where you’re not ever going to get decimals? Or even just _use_ decimals? Prime number hunting the slow way? (Especially b/c I’m pretty sure that prime number hunting takes advantage of AVX floats.)
Your video about all big core smartphones cleaned up how I think about Apple CPUs now. I just disregard the little cores. So the M4 is fast, but with only 3-4 big cores, you might consider the M4 Pro for any CPU-intensive workflows, and the M4 Max is not much better, so only buy that if you need more GPU than M4Pro, and expect a slight battery life hit. Thanks Dr. Cutress!
I'm seeing a lot of mentions of branch predictions, but this seems to be a misunderstanding (or am I the one misunderstanding?) My reading is that the code "flow" is analyzed, and this is used to allocate more compute "width" in that area. Whereas branch prediction is about guessing what comes next on a branch so you don't starve your pipeline. Branch pradection may be an important part of this chip but it isn't what's being showcased.
That turns the computation problem into VLIW. Maybe the clip is dynamically assigning how much execution units for each branch? So less taken branches gets 4 units while the most common code path gets 16. I can see this work somewhat. But instruction retiring and branch prediction error panelty is going to be nuts. And writing a compiler for this sounds like a yucky problem (assuming they use some annotation in the ISA).
I still find it hard to solve fp64 conversion with mantissa or exponential part when it's a negative number. I remember Pascal had much better fp64 performance than Maxwell and my friends 1050ti was way faster than GTX 980 in cuda peak workload. Great to see this approach.
nice to see that FP64 is not ignored all the way in these times 😅 our NWP models are increasingly more mixed FP32/64 precision but a large part of the code will always need just many F64 flops
Does this only support OMP target or is there something simular to CUDA/HIP to program this? Im wondering if its worth it to port GPU codes to this (that use CUDA/HIP and not omp target) that are mainly memory bandwidth constrained. Or is this more intended for codes that are CPU only?
Sounds like branch prediction at a larger scale, reorganizing the placement of code on chip to optimize data flow. They should be able to measure the performance per Watt boost on open source science codes, so I'd expect it works pretty well if they've done that. It'll depend on the code though. Interesting.
Would be interesting to see how this could integrate with MLIR (the LLVM magic that Mojo uses). I also wonder if, with sufficient support, this could be well suited for accellerating functional programming languages without the traditional FP when translating to something that will run on traditional hardware. That might make some HPC-using mathematicians *very* happy.
They positioned that against the IBM Power which was like 20 years of compiler work ahead, so not bad considering? But OpenCL is a thing now so maybe this had a better chance?
Itanium relied 100% on the compiler. That was the whole point, to do all the pipelining stuff in software at compile rather in hardware at every execution, and thus a net savings on silicon area and power consumption.
What's the difference between the runtime optimization performed by Maverick 2 and something like a JIT compiler or branch prediction? Is the idea that the more used code paths actually use more hardware where was JIT compilers and branch prediction create heuristics for code paths in software?
And then a small code change destroys your performance because their (proprietary?) runtime optimiser no longer understands what you're trying to do. This already happens with speculative execution, just that programming for magic performance gains is no fun.
To my understanding speculative execution is exactly what they're doing, but bigger than ever before. A giant branch predictor, a ton of fpus per core fed by this branch predictor. The recompile is probably for their own isa that they designed for ooo.
@@nextlifeonearth This is why we don't follow instructions. No processor core, no execution pipeline, no ISA. Stay tuned for a technology launch in a few months.
Sounds interesting - an accelerator that adapts over time to your code and improves performance and efficiency, but if there is a bug inside this system that will be very unfunny to debug. Well will have to wait and see thx 4 the news.
Seem to me a clever SW solution looking for hardware demonstration to be later also target "legacy" CPU and GPU mixtures. As i learned from Spice circuit simulation on GPU that part of the code is easy to port to small graph flow of some hundred lines but anything sparse matrix get hit by long memory latency. FEM is possible a different target where very small kernels are at 100% compute and the HBM only feed huge trunks of data departionings. Still all have memory/compute separation. I think the real thing is coming with stacking memory where the interconnect is counted in millions, each transfer billions per seconds, not hundreds transfering 10s of billions. This will break the memory limit into new applications of code.
so this is like Adaptive ASIC, but they are also saying for 100% sure they will do the software and not lump it on the customer? could this be the way to go if you "just want more hpc"?
Well this seems like they have to produce working compilers for their hardware, and that is really hard. I guess we should wait and see, but intel tried with IA-64 and even them could not get the compilers working.
I think I know what it is. Quick background, I specialise in DSP when I was in school, so I've worked with fixed-precision DSP, MatLab & gcc in the past. Based on what I'm hearing, I will generalise this chip as a General-Purpose multi-precision DSP that support pipeline & branch prediction. Imagine a DSP that only runs Intel SSE & AVX with SMT support but with on-die HBM, with a compiler that has a front-end like MatLab but can directly target this new chip. Interested to learn how wrong I am when the chip comes out.
@@elad_raz so like a state machine as implemented in FPGA to simulate Kmap? But each state machine has an ALU/FPU to run in a neural net rather than a simple comparator? Just shooting wildly here. I have worked with Altera FPGA in my work and it's a completely different way of thinking about how the machine worked. Certainly not a Von Neumann machine which I'm used to code for.
Ian, please talk about whats happening with Super Micro, shares down 30% due to their auditors Ernst & Young resigning today. Tech press seems oddly quiet about the whole thing which has been ongoing for months.
@@TechTechPotato It just seems weird to me that nobody has even spoke of it, especially since the DoJ started investigating them for alleged accounting violations. I'm just wondering if this is going to tank their AI dreams.
@@karehaqt I watched something a few weeks ago that explained exactly how naughty they were being. Something to do with multiple companies colluding to inflate the books. I think that probably explains the media blitz they have been doing across YT talking about their DC solution's. Probably in an attempt to drown out the bad news.
I just watched a PR piece (L1tech) about Super Micro and I was still shocked people see them as a legit company, because I remember when we had to replace all our servers because they were bugged with Chinese spy chips... SuperMicro is greasing the right wheels, apparently!
This looks awfully similar to profile-guided optimization (PGO), which collects runtime information to help ordinary compiler (like GCC) optimize code better for that code execution pattern / scenario.
Do they have a good OpenCL driver? I am not going to write vendor-specific code for the product of a company that might do a nitrogen triiodide impression and go poof at any moment.
Founder of NextSilicon is Elad Raz. According to Founders Village: "Mr. Raz served in the elite 8200 intelligence unit of the Israel Defense Forces". The more you know... ⭐
Israel does have manditory military service. A lot of tech people there have been in intelligence forces one way or another - it's why Israel is a tech hub.
So my favourite density fucntional theory package running at the speed of a GPU? Without having to ask the devs to rewrite all their codebase to CUDA? I am interested.
Sounds IA-64-like in its reliance on compiler and predication at least for the initial setup, while the hardware reconfiguration must have really terrible latency if they go with split resources on branches rather than just redoing the ops using the full hardware on a miss.
I'm thinking that the 800 pound gorillas (Intel/AMD) are going to come out with special compilers that convert your C++/Fortan code to their architecture without all the hand-porting efforts.
Remember these numbers. Look at this cool new tech! Numbers under embargo... bah! lol Looks very cool though. Gives me vibes of Intel's alleged Royal core project with rentable units. An achitecture that dynamically adapts to the workload to improve performance. Interesting stuff!
A lot of best models that us neural networks as part of the model, but not full model so they will be continued use for improvement in non-ML part. For example, if you know the exact physics of a simulation or the laws it obeys, using ML might be frankly stupid if your computer can handle computation of those exact terms well. It may even take less computation power as you get rid of superfluous things, and instead everything goes to actual calculation. If you do not know exact physics or laws, or its computationally impossible, you can still get a boost by modelling what you can model and using neural network to modify that model at point in a hybrid approach. The model not based on neural network can lead to neural network being more grounded in reality (so better results), needing less training as grounded, faster at times, and more. This is very much cas eby case thing
After years of reading about miracle devices, which turned flop, I'm rather to believe when I see it. Related to the precision issue, I would like to see scalable FPU, that goes beyond FP64, in hardware, when needed (high order polynomials, etc.), more than some magical branching prediction.
It may not be their intended usage, but I have an idea for a game I've wanted to make for a long time, that I think this technology from Next-Silicon will make incredibly easy for me to accomplish now, in comparison to before. Before, I was looking at the potential of having to deploy servers just for hosting background logic going on in the game, not even multiplayer aspects. This just flipped the table for me. If it can be integrated well enough with the kind of system I have in mind right now... It could be done all in one server. Before, I was looking at a potential of a cluster, and gasping at the prices. So instead I decided to downgrade some of the graphics that I would end up using, because it would at least free up some compute in the cpu and gpu. But with this... that's not necessary anymore. If I understand correctly that is. If I do understand correctly, I can offload all the game logic going on onto the accelator, allowing the GPU and CPU involved to do their own tasks separately. Or in the worst case scenario, it merely just makes the operation of all that game logic much more effecient while still being a load on the cpu and gpu as well to some extent. But if that's the worst case, I can work with that. I think. What's the game idea? As much as I would love to share, it also would be a dang shame if said idea were poached. I will instead say this at the very least. Imagine an MMO where everything you do actually affects everyone else, and not just through some premade restrictive scripts, but actual logic dictating what the most likely scenario is next. When you pull a pail of water from a river, it actually reduces in amount flowing behind you. If you chop down a tree, it actually stays chopped down until a new one can grow to replace it, properly. Not spawning in on a set timer. If you pull too much water from that river, the tree may not grow at all due to lack of ground water in the area now. (If taken far enough.) The way I was looking at the likely path for coding something like that, I was met with the need for parallelism. And a lot of compute capability. You aren't running something like that on a typical CPU, to put it bluntly. And GPU's in the consumer market, well... not happening there either. So I started to look at Accellerators. And that's how I got to the server clusters. I put that game idea on hold, because I just cannot even begin to afford to do something on that scale. But with this Maverick chip. I feel like Pandora seeing hope at the bottom of the box.
How is this different than branch prediction? I even remember the DEC alpha which had a processor monitoring the cpu for metrics like this. Unless the processor can reconfigure an fPGA that is optimal for the observed calculations, then rewrite the code to maximize efficiency on the fly - then the design will not be optimal.
Also curious but is it really that extraordinary? IBM made similar leaps between Power generations, 4096 entries in the Power10 TLB, the Intel/AMD entry to the HPC space with GPUs is because of their price point not capabilities.
It was very interesting that you would display who and how many accelerators were used. I don't see why Intel can't populate their P & E cores in a grid with GPU cores right now. However, I think AMD is closer to this practically than Intel. Obviously, I have concerns about latency.
I wish them the best, but that sounds very much like what Intel said with Larabee: you don't habe have to adapt your program, just recompile it and our compiler/lib/jit will do the rest. And well, that didn't work out as well as everyone hoped.
Doesn’t MI300X (and MI300A) have surpurb FP64 performance while having the memory BW to support it? You mentioned Nvidia, but AMD is I believe a bigger player in HPC, they are powered some of the worlds best HPC super computers.
Yes MI300X is 82 TFlops vector FP64, and 163 TFlops matrix FP64. That thing is a beast and it will be hard for a startup to become even remotely competitive.
The Road Not Taken By Robert Frost Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth; Then took the other, as just as fair, And having perhaps the better claim, Because it was grassy and wanted wear; Though as for that the passing there Had worn them really about the same, And both that morning equally lay In leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way, I doubted if I should ever come back. I shall be telling this with a sigh Somewhere ages and ages hence: Two roads diverged in a wood, and I- I took the one less traveled by, And that has made all the difference.
I'm hoping like hell that Radeon will bring back some FP64 compute performance to the masses with UDNA. Nvidia severely nerfed it with Maxwell and even many Quadros are nerfed. The upsell is insanely steep. Radeon aren't much better right now with their focus on CDNA for that type of workload.
I mean that is what branch-predictors are already doing. And everything you and them have presented so far sounds exactly like a CPU with an FPGA and some fixed-function blocks - which falls flat in terms of performance compared to the more normal vectorisation-approach for most cases, but can be faster if the workload is not your normal memory intensive task but rather you need some more complex operations and have extra blocks for that (some extra trig-hardware etc). And really? Code is mostly taking the most-likely path? XD
Hi sir love your videos very much, I admire your hard work thank you so much again, could you please do a video, metiatek CPU 9400 v snapdragon Elite, because I'm thinking getting Samsung ultra 25, or possibly oppo flagship high end smartphone 9400 mediatek CPU, which one performs better thanks for your time keep it up😀😀
This chip looks like a nightmare to optimize for. Look, on the first run this part of code was slow, I'm going to optimize it. On the next run this part of code does not matter anymore due to hardware magic (optimization), but the code is still slow in some other place instead, back to square one.
@@TechTechPotato Btw, any plans to talk about SOI?, its been absent of late since GF gave up on 7nm, with all this new quest for high performance i wonder why what seems like a "easy" avenue for a frequency and or efficiency boost isn't being used.
A bit of both, but also the theoretical memory bandwidth was almost impossible to achieve. The Chips and Cheese team even worked with Intel for their coverage and struggled to get >50%.
The memory bandwidth is a great point. 47 tiles or soemthing and were seeing memory issues with Foveros with ARL. thank you for the reply. Hopefully it can get remedied for Falcon Shores if that is still an HPC part.
It's more like the design of PVC is just not good. from what we see in Aurora the 52TFLOPS is peak and unsustainable under real life conditions. MI250X on the other hand can do 45 Tflops consistently in Frontier.
This video isn't an ad. But good try though. I'm an analyst and consultant. All my clients, past and present, are listed in the description. I'm very open about this.
@TechTechPotato Yea tell me about AMD software/driver stack. I sometimes get AMD driver crashes just from playing RUclips videos on my 7950X3D iGPU. When I actually edit videos it becomes a nightmare. From my tinkering several years ago, Radeon VII efficiency doubles from reducing core and memory clocks by 25% each.
This smells like vaporware to get investor money. If you have this (which seems more like SW than HW), you are making bank or getting bought by billions and landing at AMD or Intel. Even compiler auto parallelization (which this seems to be) has not been cracked for 15y+. The best we have is Nvidias threading model and stuff like OpenMP, which when I worked in the field was always loosing to MPI.
"19.5, 67.0, 45.0"
"That's numberwang!"
Aha haha best comment. Pinned
Seems incredibly cool. But I'd need proof it actually works first
Sounds like "trust me bro" performance
With some of these modern efficiency optimized server CPUs being crammed full of those little e-cores, I don't really see the value proposition for this thing. They claim they'll have some super advanced branch prediction system that will drastically improve throughput, but that appears like a very tough problem to solve.
And if there's enough money in HPC, it wouldn't surprize me to see Nvidia splitting their high end GPUs into low precision optimized and high precision optimized product lines. They have the budget for it.
i remember Jim Keller saying CPU is most about branch predictions (now a days), this seems like branch prediction on steroids EXITING!!
This sounds fancy but in practice is unlikely to work. HPC codes need a dumb vector processors with the FP64 vector compute throughput and the memory bandwidth and capacity to back it up. They don't need fancy dynamic branch prediction. HPC codes don't use branching for the most part, so there is nothing that smart branch prediction can even optimize. The telemetry collection for this branch prediction will probably even slow it down.
If the thing doesn't support OpenCL/SYCL and instead needs recompiling, it is basically DOA. Recompiling for special hardware never goes smooth, there is always some detail that doesn't work and needs extra debugging, and developers don't have the time nor money to adjust their codes for another proprietary chip that does things different than industry standard. See Intel Knights Corner and how that worked out...
You are right if it is only applied to standard, massively parallell tasks, like training of weights and biases in a static feed-forward structure / neural net. However, these architectures are to a large extend popular much because of the availability of this kind of uniform hardware. There are so many tasks that a) today is performed on the CPU, but could be faster (there are examples in the video's description), and b) approaches that are not popular because they are slow. In these domains, NextSilicon can have an impact.
HPC does use branching in scientific applications, it is limited by GPUs being vector-only and IBM Power is still around for good reason, but you are right about the work in porting -- and that also goes against GPUs -- when those applications don't lend themselves to floating point, there are surely a few in meterology and climate science.
And even with the branches: In decent code you are already reaching the hardware-limits: Either you fully saturate the ALUs or the memory bandwidth.
(or in bad cases like we have had today - networking... 250 workers can put some load on the system).
There is one important number missing!
And it means a lot to me: 42.
I find it hard to believe that this is true. It's like the theoretical advantages of the Java VM over AoT-compiled code. They never materialized.
HPC is not a small market, if they can get a good chunk, it's a good niche
This looks really interesting in combination with Mojo. The big problem in HPC is getting the right ALU in the right configuration. I'm curious how they get that much performance out of branch prediction. In the optimization problems, I know we had to explore a search space until we hit a sufficiently low error for example. The branches were, run this stuff, and occasionally check if it's sufficient. From my understanding utilizing the ALU correctly by compiling certain functions like matrix multiply with the hardware in mind would be much more effective, as it cuts down the serial path you are parallelizing.
The combination of AI and HPC will be an interesting market. An engineer can explore a vastly bigger search space if you train AI to do textbook engineering and give it the right amount of HPC. This kind of workflow is already done in fluid dynamics for vehicles and buildings and has been used to optimize combustion in ICE's. I'm also pretty sure turbine manufacturers use it as well.
IBM tried to innovate that space with variable precision. The idea was to brute force the search space in low resolution, sort the results, and recompute the areas of interest with higher precision. But I don't think it got adopted that well. Probably just to complex to handle.
The SIMD instructions on my general purpose (AMD) CPU give good baby-step giant-step performance, but I haven't compared with the latest Power ISA to really say. Stuff like that which is heavy on the branching doesn't lend itself well to GPUs. (edit: %s/prediction/branching)
Well, it depends on how you branch. I played around with fixed-step Euler and Runge Kutta on my GPU, and if you run the same instructions on a large enough dataset, they fit really well. The same is true for back-propagation-based algorithms. But it gets tricky when you have something like ODE45, which has variable step size and conditional branching. On those, a view dozen cores that combine the benefits of different initial conditions and variable step size/branching would be the best.
I'm all for divergence-busting techniques, even if quite a lot of HPC workloads don't absolutely need them. I suspect the bigger challenge with general vector compute is around memory access, Ian mentioned this briefly but it looks worth digging into.
So it's basically a super wide out of order execution pipeline with speculative execution on steroids. So instead of like 4 fpus per core they have, say, 256 and a massive execution buffer (I expect the hbm is for that) to keep them fed.
Their isa is probably defined for ooo, which is why the recompile is enough.
The branch predictor simply learns and works farther ahead than any current cpu.
Or that's what I'm getting from this.
No ISA, it's a dataflow
The latest versions of the .Net runtime profile and second and third generation jit the code to tune hot paths and make better assembly instructions.
GPUs are only “fixed” within one dispatch / one draw call. However, many practical compute problems/3D scenes are split into thousands if not millions dispatches / draw calls, and CPU running code can adjust size of dispatches / length of draw calls in runtime.
BTW, on Windows it is critical to control count of in-flight compute thread groups / draw calls because the OS insists GPUs should stay responsive at all times even when loaded, and has the timeout detection and recovery (TDR) feature in the OS kernel to enforce the policy.
To be fair, until work graphs arrived in D3D12, said flexibility was tricky to implement. D3D11 supports indirect dispatches / draw calls, queries to track completion of things, other queries to measure time spent computing / rendering things, but developers need to build their custom pipelines on top of these primitives.
How is it different from traditional branch prediction and speculative execution?
Both end up going towards a fixed compute array. Here the size of the compute array changes given the workflow.
@@TechTechPotato Thank you Dr Potato
@@TechTechPotatoand they do that without being FPGA?
Guess they'll have fixed silicon with various paths/pipelines/SE&BPwidths . Then the compiler wraps in some path optimiser tables that cause registers to get used it a programmatically way, which are attached to the various pipelines.
So at some point the path lookups get evaluated and compete for optimal use.
Feels like it doesn't work for modelling where 'every solution' gets calculated. Or anything like monte Carlo.
Also Devs at department of energy know how to optimise, or they are using machine learning to assist optimising code.
Makes sense to let Nvidia GPUs run the code optimisers.
@@quantumbacon thank you!
As an older gamer, you always surprise me how little I know about things outside of gaming. Thanks very much for
IF runtime adaptive acceleration really works (big if), is it fair to think that it could jump big parts of the CUDA moat?
When I have a set of ALUS and some branchy code the proportion of ALUs running the rare path automatically goes up as the rare path gets more common. This gets more complicated with vector units which suffer a large slowdown when a small number of lanes follow a rare path but if that's what is being addressed here then the message is being lost/oversimplified. 😢 On the other hand data flow programming is awesome so nice to hear something which sounds like that 🎉
It's something we'll go into in time as we dive deeper, for sure
Intel Xe has good 64b performance, they purposefully favored HPC rather than AI training, and is used in Argonne labs Aurora computer.
Xe2("Battlemage" when in the Arc graphics form factor.) should be even better with 64b int support. Intel oneAPI already offers write once compile everywhere SyCL/C++, across venders and device types(AdaptiveCPP is another SyCL compiler alternative to oneAPI).
nVidia double precision fell off a cliff years ago (like 10% of fp32 these days, basically just software support rather than native 64b register size); my old 2012 AMD w7000 GCN1.0 had DP speed exactly half of SP speed due to 64b registers that were split for 32b.
I'm curious why ARC is FPGA, at least at driver level to "make" it work as a GPU
Nvidia has 64 bit performance that is 1/4 that of 32 bit on their fattest chips. The ones that make it into GeForce cards do not. GA100 had full 64b support, GA102 did not.
Goes to show the cost of branched compute tasks on GPUs. For baby-step giant-step I find better performance on Ryzen than can be extracted from any Radeon.
Blackwell cut a lot on FP64 since their focus is now on AI.
@@rightwingsafetysquad9872 Ah , yes x100 chips are actually the expected 2:1 of physical 64bit. but all the others(even in the enterprise line) have something near 40:1 drop off for double precision at least since Pascal.
(I'm only referring to compute speed inside the GPU without memory bottleneck considerations)
1:10 No amount of bits can please the real hardcore people. Check out arbitrary precision maths 😂 (Hunt for primes and pi decimals in particular)
yep its literally unavoidable, although you can use integers in place of large floating points, you just have to adjust the formulas
It's true I need a 160 bit integer math processor, I don't even care about this floating point stuff, I'm not trying to make a bad poetry machine
@@foobarf8766 what kinda math are you doing where you’re not ever going to get decimals? Or even just _use_ decimals? Prime number hunting the slow way?
(Especially b/c I’m pretty sure that prime number hunting takes advantage of AVX floats.)
Your video about all big core smartphones cleaned up how I think about Apple CPUs now. I just disregard the little cores. So the M4 is fast, but with only 3-4 big cores, you might consider the M4 Pro for any CPU-intensive workflows, and the M4 Max is not much better, so only buy that if you need more GPU than M4Pro, and expect a slight battery life hit. Thanks Dr. Cutress!
I'm seeing a lot of mentions of branch predictions, but this seems to be a misunderstanding (or am I the one misunderstanding?)
My reading is that the code "flow" is analyzed, and this is used to allocate more compute "width" in that area. Whereas branch prediction is about guessing what comes next on a branch so you don't starve your pipeline. Branch pradection may be an important part of this chip but it isn't what's being showcased.
That turns the computation problem into VLIW. Maybe the clip is dynamically assigning how much execution units for each branch? So less taken branches gets 4 units while the most common code path gets 16. I can see this work somewhat. But instruction retiring and branch prediction error panelty is going to be nuts. And writing a compiler for this sounds like a yucky problem (assuming they use some annotation in the ISA).
I still find it hard to solve fp64 conversion with mantissa or exponential part when it's a negative number. I remember Pascal had much better fp64 performance than Maxwell and my friends 1050ti was way faster than GTX 980 in cuda peak workload. Great to see this approach.
nice to see that FP64 is not ignored all the way in these times 😅 our NWP models are increasingly more mixed FP32/64 precision but a large part of the code will always need just many F64 flops
Does this only support OMP target or is there something simular to CUDA/HIP to program this? Im wondering if its worth it to port GPU codes to this (that use CUDA/HIP and not omp target) that are mainly memory bandwidth constrained. Or is this more intended for codes that are CPU only?
Sounds like branch prediction at a larger scale, reorganizing the placement of code on chip to optimize data flow. They should be able to measure the performance per Watt boost on open source science codes, so I'd expect it works pretty well if they've done that. It'll depend on the code though. Interesting.
Would be interesting to see how this could integrate with MLIR (the LLVM magic that Mojo uses). I also wonder if, with sufficient support, this could be well suited for accellerating functional programming languages without the traditional FP when translating to something that will run on traditional hardware. That might make some HPC-using mathematicians *very* happy.
I remember the Fairchild multi chip module CPUs in the 1980’s.
Intel promised that Itanium just needed a better compiler. How'd that work out?
They positioned that against the IBM Power which was like 20 years of compiler work ahead, so not bad considering? But OpenCL is a thing now so maybe this had a better chance?
Itanium relied 100% on the compiler. That was the whole point, to do all the pipelining stuff in software at compile rather in hardware at every execution, and thus a net savings on silicon area and power consumption.
8:05 "It's not that. They've told me it's not that." Yeah? Well, then what is it?
What's the difference between the runtime optimization performed by Maverick 2 and something like a JIT compiler or branch prediction? Is the idea that the more used code paths actually use more hardware where was JIT compilers and branch prediction create heuristics for code paths in software?
Is this just cgra with extra steps?
I’m probably wrong, but this kinda reminds me of those Transmeta Crusoe chips from the early 2000’s.
And then a small code change destroys your performance because their (proprietary?) runtime optimiser no longer understands what you're trying to do.
This already happens with speculative execution, just that programming for magic performance gains is no fun.
To my understanding speculative execution is exactly what they're doing, but bigger than ever before.
A giant branch predictor, a ton of fpus per core fed by this branch predictor. The recompile is probably for their own isa that they designed for ooo.
@@nextlifeonearth This is why we don't follow instructions. No processor core, no execution pipeline, no ISA. Stay tuned for a technology launch in a few months.
Sounds interesting - an accelerator that adapts over time to your code and improves performance and efficiency, but if there is a bug inside this system that will be very unfunny to debug. Well will have to wait and see thx 4 the news.
So they basically made a (jit-next meets the v8 engine) processor for general compute
Seem to me a clever SW solution looking for hardware demonstration to be later also target "legacy" CPU and GPU mixtures. As i learned from Spice circuit simulation on GPU that part of the code is easy to port to small graph flow of some hundred lines but anything sparse matrix get hit by long memory latency. FEM is possible a different target where very small kernels are at 100% compute and the HBM only feed huge trunks of data departionings.
Still all have memory/compute separation. I think the real thing is coming with stacking memory where the interconnect is counted in millions, each transfer billions per seconds, not hundreds transfering 10s of billions. This will break the memory limit into new applications of code.
As longs as any input isn't serially dependent on any other
I have been imagining such a chip myself. A self-programming fpga if you will. Amazing that someone is going to build it.
so this is like Adaptive ASIC, but they are also saying for 100% sure they will do the software and not lump it on the customer? could this be the way to go if you "just want more hpc"?
"It's not that. We've said it's not that."
"Ok, it's kinda that, but with a patent. Big difference."
what chemistry did you need FP64 for? 32bit covers quite a range
Doing disclosures at the end is dodgy...
If you don't want to spend watch time, at least put a text up...
Well this seems like they have to produce working compilers for their hardware, and that is really hard. I guess we should wait and see, but intel tried with IA-64 and even them could not get the compilers working.
I think I know what it is. Quick background, I specialise in DSP when I was in school, so I've worked with fixed-precision DSP, MatLab & gcc in the past.
Based on what I'm hearing, I will generalise this chip as a General-Purpose multi-precision DSP that support pipeline & branch prediction.
Imagine a DSP that only runs Intel SSE & AVX with SMT support but with on-die HBM, with a compiler that has a front-end like MatLab but can directly target this new chip.
Interested to learn how wrong I am when the chip comes out.
@@erictayet It is a dataflow hardware, and stay tuned to learn more!
@@elad_raz so like a state machine as implemented in FPGA to simulate Kmap? But each state machine has an ALU/FPU to run in a neural net rather than a simple comparator?
Just shooting wildly here. I have worked with Altera FPGA in my work and it's a completely different way of thinking about how the machine worked. Certainly not a Von Neumann machine which I'm used to code for.
Appreciate the perspective. The late disclosure seemed a little disingenuous.
Ian, please talk about whats happening with Super Micro, shares down 30% due to their auditors Ernst & Young resigning today. Tech press seems oddly quiet about the whole thing which has been ongoing for months.
Company investor relations tend to only talk to the investor press. It's rare that Tech Press get a call about share prices
@@TechTechPotato It just seems weird to me that nobody has even spoke of it, especially since the DoJ started investigating them for alleged accounting violations. I'm just wondering if this is going to tank their AI dreams.
@@karehaqt I watched something a few weeks ago that explained exactly how naughty they were being. Something to do with multiple companies colluding to inflate the books. I think that probably explains the media blitz they have been doing across YT talking about their DC solution's. Probably in an attempt to drown out the bad news.
I just watched a PR piece (L1tech) about Super Micro and I was still shocked people see them as a legit company, because I remember when we had to replace all our servers because they were bugged with Chinese spy chips... SuperMicro is greasing the right wheels, apparently!
modern NPUs only do INT8 (plus a bit more FP on the DSPs)... so I am now wondering if you can write some kernels to do fp32 math with the int8 MACs
Possible yes, but throughput will be awful, especially with emulation support for denormals. So it doesn't really make sense.
@ProjectPhysX I have seen doom run on worse hardware... But this will be my summer project for the winter
SRC - systems developed similar technology under reconfigurable computing technology
This looks awfully similar to profile-guided optimization (PGO), which collects runtime information to help ordinary compiler (like GCC) optimize code better for that code execution pattern / scenario.
Do they have a good OpenCL driver? I am not going to write vendor-specific code for the product of a company that might do a nitrogen triiodide impression and go poof at any moment.
That's the beauty, the code here isn't vendor specific.
@@TechTechPotato I am not convinced yet but I hope they succeed
17:20 is this a 20 min ad?
Is it only better at FP64 in performance/power, but not lower precision?
Founder of NextSilicon is Elad Raz. According to Founders Village: "Mr. Raz served in the elite 8200 intelligence unit of the Israel Defense Forces". The more you know... ⭐
Israel does have manditory military service. A lot of tech people there have been in intelligence forces one way or another - it's why Israel is a tech hub.
Hi, Ian. Does RISC-V already have instructions equivalent to Neon, SVE and SVE2 of ARM CPUs?
RVV goes down that path :)
So my favourite density fucntional theory package running at the speed of a GPU? Without having to ask the devs to rewrite all their codebase to CUDA? I am interested.
Sounds IA-64-like in its reliance on compiler and predication at least for the initial setup, while the hardware reconfiguration must have really terrible latency if they go with split resources on branches rather than just redoing the ops using the full hardware on a miss.
I'm thinking that the 800 pound gorillas (Intel/AMD) are going to come out with special compilers that convert your C++/Fortan code to their architecture without all the hand-porting efforts.
is not there GDDR7 memory?
Remember these numbers. Look at this cool new tech! Numbers under embargo... bah! lol
Looks very cool though.
Gives me vibes of Intel's alleged Royal core project with rentable units. An achitecture that dynamically adapts to the workload to improve performance. Interesting stuff!
A lot of best models that us neural networks as part of the model, but not full model so they will be continued use for improvement in non-ML part. For example, if you know the exact physics of a simulation or the laws it obeys, using ML might be frankly stupid if your computer can handle computation of those exact terms well. It may even take less computation power as you get rid of superfluous things, and instead everything goes to actual calculation. If you do not know exact physics or laws, or its computationally impossible, you can still get a boost by modelling what you can model and using neural network to modify that model at point in a hybrid approach. The model not based on neural network can lead to neural network being more grounded in reality (so better results), needing less training as grounded, faster at times, and more. This is very much cas eby case thing
After years of reading about miracle devices, which turned flop, I'm rather to believe when I see it. Related to the precision issue, I would like to see scalable FPU, that goes beyond FP64, in hardware, when needed (high order polynomials, etc.), more than some magical branching prediction.
It may not be their intended usage, but I have an idea for a game I've wanted to make for a long time, that I think this technology from Next-Silicon will make incredibly easy for me to accomplish now, in comparison to before. Before, I was looking at the potential of having to deploy servers just for hosting background logic going on in the game, not even multiplayer aspects. This just flipped the table for me. If it can be integrated well enough with the kind of system I have in mind right now... It could be done all in one server. Before, I was looking at a potential of a cluster, and gasping at the prices.
So instead I decided to downgrade some of the graphics that I would end up using, because it would at least free up some compute in the cpu and gpu.
But with this... that's not necessary anymore. If I understand correctly that is. If I do understand correctly, I can offload all the game logic going on onto the accelator, allowing the GPU and CPU involved to do their own tasks separately. Or in the worst case scenario, it merely just makes the operation of all that game logic much more effecient while still being a load on the cpu and gpu as well to some extent. But if that's the worst case, I can work with that. I think.
What's the game idea? As much as I would love to share, it also would be a dang shame if said idea were poached. I will instead say this at the very least. Imagine an MMO where everything you do actually affects everyone else, and not just through some premade restrictive scripts, but actual logic dictating what the most likely scenario is next. When you pull a pail of water from a river, it actually reduces in amount flowing behind you. If you chop down a tree, it actually stays chopped down until a new one can grow to replace it, properly. Not spawning in on a set timer. If you pull too much water from that river, the tree may not grow at all due to lack of ground water in the area now. (If taken far enough.)
The way I was looking at the likely path for coding something like that, I was met with the need for parallelism. And a lot of compute capability. You aren't running something like that on a typical CPU, to put it bluntly. And GPU's in the consumer market, well... not happening there either. So I started to look at Accellerators. And that's how I got to the server clusters.
I put that game idea on hold, because I just cannot even begin to afford to do something on that scale. But with this Maverick chip. I feel like Pandora seeing hope at the bottom of the box.
How is this different than branch prediction? I even remember the DEC alpha which had a processor monitoring the cpu for metrics like this. Unless the processor can reconfigure an fPGA that is optimal for the observed calculations, then rewrite the code to maximize efficiency on the fly - then the design will not be optimal.
I really want to understand how this hardware works. Is it a variation of a CGRA? Regardless, extraordinary claims, require extraordinary evidence.
Also curious but is it really that extraordinary? IBM made similar leaps between Power generations, 4096 entries in the Power10 TLB, the Intel/AMD entry to the HPC space with GPUs is because of their price point not capabilities.
Isn’t it cost disadvantages? From a power draw aspect
Good one
It's not an fpga+asic programming another fpga within the SoC, like it wasn't comingling of funds at ftx crypto.
Is this Transmeta CPUs reinvented?
12:15 this is such a marketing graph (well because it it!). But I hate these graphs :/ especially because I can't see the numbers or more details.
It was very interesting that you would display who and how many accelerators were used. I don't see why Intel can't populate their P & E cores in a grid with GPU cores right now. However, I think AMD is closer to this practically than Intel. Obviously, I have concerns about latency.
I wish them the best, but that sounds very much like what Intel said with Larabee: you don't habe have to adapt your program, just recompile it and our compiler/lib/jit will do the rest. And well, that didn't work out as well as everyone hoped.
Ian, I think you might be giving people the impression that FP64 makes calculations at 64bit precision.
this is incorrect.
Doesn’t MI300X (and MI300A) have surpurb FP64 performance while having the memory BW to support it?
You mentioned Nvidia, but AMD is I believe a bigger player in HPC, they are powered some of the worlds best HPC super computers.
Based on total compute, yes, but AMD is only in a small handful of (top) systems.
Yes MI300X is 82 TFlops vector FP64, and 163 TFlops matrix FP64. That thing is a beast and it will be hard for a startup to become even remotely competitive.
AMD will probably do better than NVDA in HPC since they did not gut their FP64 path like blackwell did. Also there is less of a software issue there.
The Road Not Taken
By Robert Frost
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I-
I took the one less traveled by,
And that has made all the difference.
So its a feed foward DPU. We have had these for ages and I don't think its going to help for most hpc tasks.
If you mean the IBM/DARPA thing that was never going to go retail, but now that OpenCL is a thing, this might have a chance?
I'm hoping like hell that Radeon will bring back some FP64 compute performance to the masses with UDNA. Nvidia severely nerfed it with Maxwell and even many Quadros are nerfed. The upsell is insanely steep. Radeon aren't much better right now with their focus on CDNA for that type of workload.
It doesn't accelerate AI in anyway does it? Just to make sure
Only at full precision, not reduced precision modes
I mean that is what branch-predictors are already doing. And everything you and them have presented so far sounds exactly like a CPU with an FPGA and some fixed-function blocks - which falls flat in terms of performance compared to the more normal vectorisation-approach for most cases, but can be faster if the workload is not your normal memory intensive task but rather you need some more complex operations and have extra blocks for that (some extra trig-hardware etc).
And really? Code is mostly taking the most-likely path? XD
Hi sir love your videos very much, I admire your hard work thank you so much again, could you please do a video, metiatek CPU 9400 v snapdragon Elite, because I'm thinking getting Samsung ultra 25, or possibly oppo flagship high end smartphone 9400 mediatek CPU, which one performs better thanks for your time keep it up😀😀
I'm waiting for a D9400 and S8E sample
😊ok thanks
what if we made and asic that just "changes"
Sounds like branch prediction😅
AMD accelerators get 3x better fp64 compute than NVidia, there is a reason national labs are buying AMD based super computers.
this is nice!
All that about flow was pretty much the ... before profit. It explained nothing. Magic happens, perf goes to the moon, trust me bro.
ok it sounds interesting :)
This chip looks like a nightmare to optimize for. Look, on the first run this part of code was slow, I'm going to optimize it. On the next run this part of code does not matter anymore due to hardware magic (optimization), but the code is still slow in some other place instead, back to square one.
Does the number 0.8373 mean anything to you?
kinda of a shame CDNA wasn't at all mentioned when its much more HPC focused than the Nvidia counterparts
More content to come ! :)
Does MATLAB run on it?
@@TechTechPotato Btw, any plans to talk about SOI?, its been absent of late since GF gave up on 7nm, with all this new quest for high performance i wonder why what seems like a "easy" avenue for a frequency and or efficiency boost isn't being used.
it's*
Ponte Vecchio did 52TFLOPS of FP64 but intel sunset it. Was that more hardware or software that limited its adoption?
A bit of both, but also the theoretical memory bandwidth was almost impossible to achieve. The Chips and Cheese team even worked with Intel for their coverage and struggled to get >50%.
The memory bandwidth is a great point. 47 tiles or soemthing and were seeing memory issues with Foveros with ARL. thank you for the reply. Hopefully it can get remedied for Falcon Shores if that is still an HPC part.
It's more like the design of PVC is just not good. from what we see in Aurora the 52TFLOPS is peak and unsustainable under real life conditions. MI250X on the other hand can do 45 Tflops consistently in Frontier.
'applications run orders of magnitude faster' ... big claims
I wonder if Intel will acquire them only to sell off 5 years later.
L1 size is a huge red flag for me. Money can buy good nodes and a lot of hbm but this l1 amount sounds like bs
Theres 256floating points?
Theoretically you can have any power of two for your size, it just gets really impractical really fast. Pretty much nobody does more than 256 bits.
Another video, another mention of "clients". These are becoming ads more and more, and with the disclosure near the end.
This video isn't an ad. But good try though. I'm an analyst and consultant. All my clients, past and present, are listed in the description. I'm very open about this.
Intel and AMD should be here with products like this... where are they? Smoking blockchains behind the bike sheds again?
Radeon VII still good then?
Efficiency ain't great, and the software stack needs work, but zoom zoom
@TechTechPotato Yea tell me about AMD software/driver stack. I sometimes get AMD driver crashes just from playing RUclips videos on my 7950X3D iGPU. When I actually edit videos it becomes a nightmare.
From my tinkering several years ago, Radeon VII efficiency doubles from reducing core and memory clocks by 25% each.
I thought AMD had good hardware with 64bit FP support
Please stop using twitter.
This smells like vaporware to get investor money. If you have this (which seems more like SW than HW), you are making bank or getting bought by billions and landing at AMD or Intel.
Even compiler auto parallelization (which this seems to be) has not been cracked for 15y+. The best we have is Nvidias threading model and stuff like OpenMP, which when I worked in the field was always loosing to MPI.
That's why it's not that :)
You could have at least attempted to be critical of their claims. This just sounds like an advertisement for venture capitalists.