Maverick Chips for the Next Silicon Generation

Поделиться
HTML-код
  • Опубликовано: 20 ноя 2024

Комментарии • 169

  • @greggleswong
    @greggleswong 22 дня назад +59

    "19.5, 67.0, 45.0"
    "That's numberwang!"

  • @bean_TM
    @bean_TM 22 дня назад +37

    Seems incredibly cool. But I'd need proof it actually works first

  • @Pavlobot5
    @Pavlobot5 22 дня назад +49

    Sounds like "trust me bro" performance

    • @fnorgen
      @fnorgen 22 дня назад +7

      With some of these modern efficiency optimized server CPUs being crammed full of those little e-cores, I don't really see the value proposition for this thing. They claim they'll have some super advanced branch prediction system that will drastically improve throughput, but that appears like a very tough problem to solve.
      And if there's enough money in HPC, it wouldn't surprize me to see Nvidia splitting their high end GPUs into low precision optimized and high precision optimized product lines. They have the budget for it.

    • @lalishansh
      @lalishansh 18 дней назад

      i remember Jim Keller saying CPU is most about branch predictions (now a days), this seems like branch prediction on steroids EXITING!!

  • @ProjectPhysX
    @ProjectPhysX 22 дня назад +27

    This sounds fancy but in practice is unlikely to work. HPC codes need a dumb vector processors with the FP64 vector compute throughput and the memory bandwidth and capacity to back it up. They don't need fancy dynamic branch prediction. HPC codes don't use branching for the most part, so there is nothing that smart branch prediction can even optimize. The telemetry collection for this branch prediction will probably even slow it down.
    If the thing doesn't support OpenCL/SYCL and instead needs recompiling, it is basically DOA. Recompiling for special hardware never goes smooth, there is always some detail that doesn't work and needs extra debugging, and developers don't have the time nor money to adjust their codes for another proprietary chip that does things different than industry standard. See Intel Knights Corner and how that worked out...

    • @henriksundt7148
      @henriksundt7148 21 день назад +3

      You are right if it is only applied to standard, massively parallell tasks, like training of weights and biases in a static feed-forward structure / neural net. However, these architectures are to a large extend popular much because of the availability of this kind of uniform hardware. There are so many tasks that a) today is performed on the CPU, but could be faster (there are examples in the video's description), and b) approaches that are not popular because they are slow. In these domains, NextSilicon can have an impact.

    • @foobarf8766
      @foobarf8766 21 день назад +3

      HPC does use branching in scientific applications, it is limited by GPUs being vector-only and IBM Power is still around for good reason, but you are right about the work in porting -- and that also goes against GPUs -- when those applications don't lend themselves to floating point, there are surely a few in meterology and climate science.

    • @ABaumstumpf
      @ABaumstumpf 21 день назад

      And even with the branches: In decent code you are already reaching the hardware-limits: Either you fully saturate the ALUs or the memory bandwidth.
      (or in bad cases like we have had today - networking... 250 workers can put some load on the system).

  • @TheCebulon
    @TheCebulon 21 день назад +7

    There is one important number missing!
    And it means a lot to me: 42.

  • @bernadettetreual
    @bernadettetreual 21 день назад +4

    I find it hard to believe that this is true. It's like the theoretical advantages of the Java VM over AoT-compiled code. They never materialized.

  • @autohmae
    @autohmae 22 дня назад +7

    HPC is not a small market, if they can get a good chunk, it's a good niche

  • @MrHaggyy
    @MrHaggyy 22 дня назад +9

    This looks really interesting in combination with Mojo. The big problem in HPC is getting the right ALU in the right configuration. I'm curious how they get that much performance out of branch prediction. In the optimization problems, I know we had to explore a search space until we hit a sufficiently low error for example. The branches were, run this stuff, and occasionally check if it's sufficient. From my understanding utilizing the ALU correctly by compiling certain functions like matrix multiply with the hardware in mind would be much more effective, as it cuts down the serial path you are parallelizing.
    The combination of AI and HPC will be an interesting market. An engineer can explore a vastly bigger search space if you train AI to do textbook engineering and give it the right amount of HPC. This kind of workflow is already done in fluid dynamics for vehicles and buildings and has been used to optimize combustion in ICE's. I'm also pretty sure turbine manufacturers use it as well.
    IBM tried to innovate that space with variable precision. The idea was to brute force the search space in low resolution, sort the results, and recompute the areas of interest with higher precision. But I don't think it got adopted that well. Probably just to complex to handle.

    • @foobarf8766
      @foobarf8766 21 день назад +2

      The SIMD instructions on my general purpose (AMD) CPU give good baby-step giant-step performance, but I haven't compared with the latest Power ISA to really say. Stuff like that which is heavy on the branching doesn't lend itself well to GPUs. (edit: %s/prediction/branching)

    • @MrHaggyy
      @MrHaggyy 17 дней назад

      Well, it depends on how you branch. I played around with fixed-step Euler and Runge Kutta on my GPU, and if you run the same instructions on a large enough dataset, they fit really well. The same is true for back-propagation-based algorithms. But it gets tricky when you have something like ODE45, which has variable step size and conditional branching. On those, a view dozen cores that combine the benefits of different initial conditions and variable step size/branching would be the best.

  • @capability-snob
    @capability-snob 22 дня назад +3

    I'm all for divergence-busting techniques, even if quite a lot of HPC workloads don't absolutely need them. I suspect the bigger challenge with general vector compute is around memory access, Ian mentioned this briefly but it looks worth digging into.

  • @nextlifeonearth
    @nextlifeonearth 21 день назад +8

    So it's basically a super wide out of order execution pipeline with speculative execution on steroids. So instead of like 4 fpus per core they have, say, 256 and a massive execution buffer (I expect the hbm is for that) to keep them fed.
    Their isa is probably defined for ooo, which is why the recompile is enough.
    The branch predictor simply learns and works farther ahead than any current cpu.
    Or that's what I'm getting from this.

    • @elad_raz
      @elad_raz 20 дней назад

      No ISA, it's a dataflow

  • @keyboard_g
    @keyboard_g 21 день назад +3

    The latest versions of the .Net runtime profile and second and third generation jit the code to tune hot paths and make better assembly instructions.

  • @soonts
    @soonts 21 день назад +4

    GPUs are only “fixed” within one dispatch / one draw call. However, many practical compute problems/3D scenes are split into thousands if not millions dispatches / draw calls, and CPU running code can adjust size of dispatches / length of draw calls in runtime.
    BTW, on Windows it is critical to control count of in-flight compute thread groups / draw calls because the OS insists GPUs should stay responsive at all times even when loaded, and has the timeout detection and recovery (TDR) feature in the OS kernel to enforce the policy.
    To be fair, until work graphs arrived in D3D12, said flexibility was tricky to implement. D3D11 supports indirect dispatches / draw calls, queries to track completion of things, other queries to measure time spent computing / rendering things, but developers need to build their custom pipelines on top of these primitives.

  • @MoonDweller1337
    @MoonDweller1337 22 дня назад +15

    How is it different from traditional branch prediction and speculative execution?

    • @TechTechPotato
      @TechTechPotato  22 дня назад +16

      Both end up going towards a fixed compute array. Here the size of the compute array changes given the workflow.

    • @TheWunder
      @TheWunder 22 дня назад +6

      ​@@TechTechPotato Thank you Dr Potato

    • @lucasfernandesgrotto6279
      @lucasfernandesgrotto6279 22 дня назад +1

      ​@@TechTechPotatoand they do that without being FPGA?

    • @quantumbacon
      @quantumbacon 22 дня назад +3

      Guess they'll have fixed silicon with various paths/pipelines/SE&BPwidths . Then the compiler wraps in some path optimiser tables that cause registers to get used it a programmatically way, which are attached to the various pipelines.
      So at some point the path lookups get evaluated and compete for optimal use.
      Feels like it doesn't work for modelling where 'every solution' gets calculated. Or anything like monte Carlo.
      Also Devs at department of energy know how to optimise, or they are using machine learning to assist optimising code.
      Makes sense to let Nvidia GPUs run the code optimisers.

    • @lucasfernandesgrotto6279
      @lucasfernandesgrotto6279 22 дня назад

      @@quantumbacon thank you!

  • @jamesdk5417
    @jamesdk5417 22 дня назад +23

    As an older gamer, you always surprise me how little I know about things outside of gaming. Thanks very much for

  • @HansCNelson
    @HansCNelson 14 дней назад

    IF runtime adaptive acceleration really works (big if), is it fair to think that it could jump big parts of the CUDA moat?

  • @ullibowyer
    @ullibowyer 22 дня назад +3

    When I have a set of ALUS and some branchy code the proportion of ALUs running the rare path automatically goes up as the rare path gets more common. This gets more complicated with vector units which suffer a large slowdown when a small number of lanes follow a rare path but if that's what is being addressed here then the message is being lost/oversimplified. 😢 On the other hand data flow programming is awesome so nice to hear something which sounds like that 🎉

    • @TechTechPotato
      @TechTechPotato  22 дня назад +2

      It's something we'll go into in time as we dive deeper, for sure

  • @mytech6779
    @mytech6779 22 дня назад +11

    Intel Xe has good 64b performance, they purposefully favored HPC rather than AI training, and is used in Argonne labs Aurora computer.
    Xe2("Battlemage" when in the Arc graphics form factor.) should be even better with 64b int support. Intel oneAPI already offers write once compile everywhere SyCL/C++, across venders and device types(AdaptiveCPP is another SyCL compiler alternative to oneAPI).
    nVidia double precision fell off a cliff years ago (like 10% of fp32 these days, basically just software support rather than native 64b register size); my old 2012 AMD w7000 GCN1.0 had DP speed exactly half of SP speed due to 64b registers that were split for 32b.

    • @BozesanVlad
      @BozesanVlad 21 день назад

      I'm curious why ARC is FPGA, at least at driver level to "make" it work as a GPU

    • @rightwingsafetysquad9872
      @rightwingsafetysquad9872 21 день назад +1

      Nvidia has 64 bit performance that is 1/4 that of 32 bit on their fattest chips. The ones that make it into GeForce cards do not. GA100 had full 64b support, GA102 did not.

    • @foobarf8766
      @foobarf8766 21 день назад +1

      Goes to show the cost of branched compute tasks on GPUs. For baby-step giant-step I find better performance on Ryzen than can be extracted from any Radeon.

    • @xpk0228
      @xpk0228 21 день назад +1

      Blackwell cut a lot on FP64 since their focus is now on AI.

    • @mytech6779
      @mytech6779 21 день назад +2

      ​@@rightwingsafetysquad9872 Ah , yes x100 chips are actually the expected 2:1 of physical 64bit. but all the others(even in the enterprise line) have something near 40:1 drop off for double precision at least since Pascal.
      (I'm only referring to compute speed inside the GPU without memory bottleneck considerations)

  • @Chriva
    @Chriva 22 дня назад +16

    1:10 No amount of bits can please the real hardcore people. Check out arbitrary precision maths 😂 (Hunt for primes and pi decimals in particular)

    • @incription
      @incription 22 дня назад +2

      yep its literally unavoidable, although you can use integers in place of large floating points, you just have to adjust the formulas

    • @foobarf8766
      @foobarf8766 21 день назад +2

      It's true I need a 160 bit integer math processor, I don't even care about this floating point stuff, I'm not trying to make a bad poetry machine

    • @levygaming3133
      @levygaming3133 21 день назад +1

      @@foobarf8766 what kinda math are you doing where you’re not ever going to get decimals? Or even just _use_ decimals? Prime number hunting the slow way?
      (Especially b/c I’m pretty sure that prime number hunting takes advantage of AVX floats.)

  • @djsnowpdx
    @djsnowpdx 22 дня назад +1

    Your video about all big core smartphones cleaned up how I think about Apple CPUs now. I just disregard the little cores. So the M4 is fast, but with only 3-4 big cores, you might consider the M4 Pro for any CPU-intensive workflows, and the M4 Max is not much better, so only buy that if you need more GPU than M4Pro, and expect a slight battery life hit. Thanks Dr. Cutress!

  • @dddslimebbb
    @dddslimebbb 21 день назад +4

    I'm seeing a lot of mentions of branch predictions, but this seems to be a misunderstanding (or am I the one misunderstanding?)
    My reading is that the code "flow" is analyzed, and this is used to allocate more compute "width" in that area. Whereas branch prediction is about guessing what comes next on a branch so you don't starve your pipeline. Branch pradection may be an important part of this chip but it isn't what's being showcased.

    • @clehaxze
      @clehaxze 21 день назад

      That turns the computation problem into VLIW. Maybe the clip is dynamically assigning how much execution units for each branch? So less taken branches gets 4 units while the most common code path gets 16. I can see this work somewhat. But instruction retiring and branch prediction error panelty is going to be nuts. And writing a compiler for this sounds like a yucky problem (assuming they use some annotation in the ISA).

  • @vasudevmenon2496
    @vasudevmenon2496 21 день назад +1

    I still find it hard to solve fp64 conversion with mantissa or exponential part when it's a negative number. I remember Pascal had much better fp64 performance than Maxwell and my friends 1050ti was way faster than GTX 980 in cuda peak workload. Great to see this approach.

  • @TheGreenPianist
    @TheGreenPianist 22 дня назад +1

    nice to see that FP64 is not ignored all the way in these times 😅 our NWP models are increasingly more mixed FP32/64 precision but a large part of the code will always need just many F64 flops

  • @Cybot_2419
    @Cybot_2419 22 дня назад +2

    Does this only support OMP target or is there something simular to CUDA/HIP to program this? Im wondering if its worth it to port GPU codes to this (that use CUDA/HIP and not omp target) that are mainly memory bandwidth constrained. Or is this more intended for codes that are CPU only?

  • @tristan7216
    @tristan7216 22 дня назад +1

    Sounds like branch prediction at a larger scale, reorganizing the placement of code on chip to optimize data flow. They should be able to measure the performance per Watt boost on open source science codes, so I'd expect it works pretty well if they've done that. It'll depend on the code though. Interesting.

  • @dddslimebbb
    @dddslimebbb 21 день назад +1

    Would be interesting to see how this could integrate with MLIR (the LLVM magic that Mojo uses). I also wonder if, with sufficient support, this could be well suited for accellerating functional programming languages without the traditional FP when translating to something that will run on traditional hardware. That might make some HPC-using mathematicians *very* happy.

  • @rb8049
    @rb8049 22 дня назад +2

    I remember the Fairchild multi chip module CPUs in the 1980’s.

  • @henrycobb
    @henrycobb 21 день назад +6

    Intel promised that Itanium just needed a better compiler. How'd that work out?

    • @foobarf8766
      @foobarf8766 21 день назад +2

      They positioned that against the IBM Power which was like 20 years of compiler work ahead, so not bad considering? But OpenCL is a thing now so maybe this had a better chance?

    • @mytech6779
      @mytech6779 21 день назад +1

      Itanium relied 100% on the compiler. That was the whole point, to do all the pipelining stuff in software at compile rather in hardware at every execution, and thus a net savings on silicon area and power consumption.

  • @countdown4100
    @countdown4100 21 день назад +1

    8:05 "It's not that. They've told me it's not that." Yeah? Well, then what is it?

  • @darveshgorhe
    @darveshgorhe 21 день назад +1

    What's the difference between the runtime optimization performed by Maverick 2 and something like a JIT compiler or branch prediction? Is the idea that the more used code paths actually use more hardware where was JIT compilers and branch prediction create heuristics for code paths in software?

  • @Swordhero111
    @Swordhero111 21 день назад +2

    Is this just cgra with extra steps?

  • @thegeforce6625
    @thegeforce6625 21 день назад

    I’m probably wrong, but this kinda reminds me of those Transmeta Crusoe chips from the early 2000’s.

  • @rwantare1
    @rwantare1 22 дня назад +10

    And then a small code change destroys your performance because their (proprietary?) runtime optimiser no longer understands what you're trying to do.
    This already happens with speculative execution, just that programming for magic performance gains is no fun.

    • @nextlifeonearth
      @nextlifeonearth 21 день назад +2

      To my understanding speculative execution is exactly what they're doing, but bigger than ever before.
      A giant branch predictor, a ton of fpus per core fed by this branch predictor. The recompile is probably for their own isa that they designed for ooo.

    • @elad_raz
      @elad_raz 21 день назад +1

      @@nextlifeonearth This is why we don't follow instructions. No processor core, no execution pipeline, no ISA. Stay tuned for a technology launch in a few months.

  • @MrGarrax
    @MrGarrax 22 дня назад +1

    Sounds interesting - an accelerator that adapts over time to your code and improves performance and efficiency, but if there is a bug inside this system that will be very unfunny to debug. Well will have to wait and see thx 4 the news.

  • @JoeHacobian
    @JoeHacobian 19 дней назад

    So they basically made a (jit-next meets the v8 engine) processor for general compute

  • @reinerfranke5436
    @reinerfranke5436 22 дня назад +1

    Seem to me a clever SW solution looking for hardware demonstration to be later also target "legacy" CPU and GPU mixtures. As i learned from Spice circuit simulation on GPU that part of the code is easy to port to small graph flow of some hundred lines but anything sparse matrix get hit by long memory latency. FEM is possible a different target where very small kernels are at 100% compute and the HBM only feed huge trunks of data departionings.
    Still all have memory/compute separation. I think the real thing is coming with stacking memory where the interconnect is counted in millions, each transfer billions per seconds, not hundreds transfering 10s of billions. This will break the memory limit into new applications of code.

  • @platin2148
    @platin2148 20 дней назад

    As longs as any input isn't serially dependent on any other

  • @artifactingreality
    @artifactingreality 21 день назад

    I have been imagining such a chip myself. A self-programming fpga if you will. Amazing that someone is going to build it.

  • @quibster
    @quibster 19 дней назад

    so this is like Adaptive ASIC, but they are also saying for 100% sure they will do the software and not lump it on the customer? could this be the way to go if you "just want more hpc"?

  • @sambojinbojin-sam6550
    @sambojinbojin-sam6550 21 день назад

    "It's not that. We've said it's not that."
    "Ok, it's kinda that, but with a patent. Big difference."

  • @DanFrederiksen
    @DanFrederiksen 21 день назад

    what chemistry did you need FP64 for? 32bit covers quite a range

  • @acasccseea4434
    @acasccseea4434 21 день назад +4

    Doing disclosures at the end is dodgy...
    If you don't want to spend watch time, at least put a text up...

  • @xpk0228
    @xpk0228 21 день назад

    Well this seems like they have to produce working compilers for their hardware, and that is really hard. I guess we should wait and see, but intel tried with IA-64 and even them could not get the compilers working.

  • @erictayet
    @erictayet 21 день назад +1

    I think I know what it is. Quick background, I specialise in DSP when I was in school, so I've worked with fixed-precision DSP, MatLab & gcc in the past.
    Based on what I'm hearing, I will generalise this chip as a General-Purpose multi-precision DSP that support pipeline & branch prediction.
    Imagine a DSP that only runs Intel SSE & AVX with SMT support but with on-die HBM, with a compiler that has a front-end like MatLab but can directly target this new chip.
    Interested to learn how wrong I am when the chip comes out.

    • @elad_raz
      @elad_raz 20 дней назад +1

      @@erictayet It is a dataflow hardware, and stay tuned to learn more!

    • @erictayet
      @erictayet 17 дней назад

      @@elad_raz so like a state machine as implemented in FPGA to simulate Kmap? But each state machine has an ALU/FPU to run in a neural net rather than a simple comparator?
      Just shooting wildly here. I have worked with Altera FPGA in my work and it's a completely different way of thinking about how the machine worked. Certainly not a Von Neumann machine which I'm used to code for.

  • @NickChapmanThe
    @NickChapmanThe 21 день назад +2

    Appreciate the perspective. The late disclosure seemed a little disingenuous.

  • @karehaqt
    @karehaqt 22 дня назад +18

    Ian, please talk about whats happening with Super Micro, shares down 30% due to their auditors Ernst & Young resigning today. Tech press seems oddly quiet about the whole thing which has been ongoing for months.

    • @TechTechPotato
      @TechTechPotato  22 дня назад +13

      Company investor relations tend to only talk to the investor press. It's rare that Tech Press get a call about share prices

    • @karehaqt
      @karehaqt 22 дня назад +3

      @@TechTechPotato It just seems weird to me that nobody has even spoke of it, especially since the DoJ started investigating them for alleged accounting violations. I'm just wondering if this is going to tank their AI dreams.

    • @muhdiversity7409
      @muhdiversity7409 22 дня назад +2

      @@karehaqt I watched something a few weeks ago that explained exactly how naughty they were being. Something to do with multiple companies colluding to inflate the books. I think that probably explains the media blitz they have been doing across YT talking about their DC solution's. Probably in an attempt to drown out the bad news.

    • @todorkolev7565
      @todorkolev7565 22 дня назад +1

      I just watched a PR piece (L1tech) about Super Micro and I was still shocked people see them as a legit company, because I remember when we had to replace all our servers because they were bugged with Chinese spy chips... SuperMicro is greasing the right wheels, apparently!

  • @Veptis
    @Veptis 22 дня назад +1

    modern NPUs only do INT8 (plus a bit more FP on the DSPs)... so I am now wondering if you can write some kernels to do fp32 math with the int8 MACs

    • @ProjectPhysX
      @ProjectPhysX 22 дня назад +2

      Possible yes, but throughput will be awful, especially with emulation support for denormals. So it doesn't really make sense.

    • @Veptis
      @Veptis 22 дня назад +1

      @ProjectPhysX I have seen doom run on worse hardware... But this will be my summer project for the winter

  • @TheLkdude
    @TheLkdude 20 дней назад

    SRC - systems developed similar technology under reconfigurable computing technology

  • @adul00
    @adul00 21 день назад +1

    This looks awfully similar to profile-guided optimization (PGO), which collects runtime information to help ordinary compiler (like GCC) optimize code better for that code execution pattern / scenario.

  • @TheBackyardChemist
    @TheBackyardChemist 22 дня назад +6

    Do they have a good OpenCL driver? I am not going to write vendor-specific code for the product of a company that might do a nitrogen triiodide impression and go poof at any moment.

    • @TechTechPotato
      @TechTechPotato  22 дня назад +6

      That's the beauty, the code here isn't vendor specific.

    • @TheBackyardChemist
      @TheBackyardChemist 22 дня назад +3

      @@TechTechPotato I am not convinced yet but I hope they succeed

  • @Quarky_
    @Quarky_ 21 день назад +1

    17:20 is this a 20 min ad?

  • @alexg50446
    @alexg50446 13 дней назад

    Is it only better at FP64 in performance/power, but not lower precision?

  • @moienahmadi2377
    @moienahmadi2377 5 часов назад

    Founder of NextSilicon is Elad Raz. According to Founders Village: "Mr. Raz served in the elite 8200 intelligence unit of the Israel Defense Forces". The more you know... ⭐

    • @TechTechPotato
      @TechTechPotato  4 часа назад

      Israel does have manditory military service. A lot of tech people there have been in intelligence forces one way or another - it's why Israel is a tech hub.

  • @thiagofreire4496
    @thiagofreire4496 22 дня назад

    Hi, Ian. Does RISC-V already have instructions equivalent to Neon, SVE and SVE2 of ARM CPUs?

  • @xwingfighter999
    @xwingfighter999 17 дней назад

    So my favourite density fucntional theory package running at the speed of a GPU? Without having to ask the devs to rewrite all their codebase to CUDA? I am interested.

  • @proesterchen
    @proesterchen 22 дня назад

    Sounds IA-64-like in its reliance on compiler and predication at least for the initial setup, while the hardware reconfiguration must have really terrible latency if they go with split resources on branches rather than just redoing the ops using the full hardware on a miss.

  • @evdrivertk
    @evdrivertk 19 дней назад

    I'm thinking that the 800 pound gorillas (Intel/AMD) are going to come out with special compilers that convert your C++/Fortan code to their architecture without all the hand-porting efforts.

  • @juancarlospizarromendez3954
    @juancarlospizarromendez3954 21 день назад

    is not there GDDR7 memory?

  • @Mark_Williams.
    @Mark_Williams. 21 день назад

    Remember these numbers. Look at this cool new tech! Numbers under embargo... bah! lol
    Looks very cool though.
    Gives me vibes of Intel's alleged Royal core project with rentable units. An achitecture that dynamically adapts to the workload to improve performance. Interesting stuff!

  • @gadlicht4627
    @gadlicht4627 22 дня назад +1

    A lot of best models that us neural networks as part of the model, but not full model so they will be continued use for improvement in non-ML part. For example, if you know the exact physics of a simulation or the laws it obeys, using ML might be frankly stupid if your computer can handle computation of those exact terms well. It may even take less computation power as you get rid of superfluous things, and instead everything goes to actual calculation. If you do not know exact physics or laws, or its computationally impossible, you can still get a boost by modelling what you can model and using neural network to modify that model at point in a hybrid approach. The model not based on neural network can lead to neural network being more grounded in reality (so better results), needing less training as grounded, faster at times, and more. This is very much cas eby case thing

  • @dankodnevic3222
    @dankodnevic3222 21 день назад

    After years of reading about miracle devices, which turned flop, I'm rather to believe when I see it. Related to the precision issue, I would like to see scalable FPU, that goes beyond FP64, in hardware, when needed (high order polynomials, etc.), more than some magical branching prediction.

  • @ManuFortis
    @ManuFortis 21 день назад

    It may not be their intended usage, but I have an idea for a game I've wanted to make for a long time, that I think this technology from Next-Silicon will make incredibly easy for me to accomplish now, in comparison to before. Before, I was looking at the potential of having to deploy servers just for hosting background logic going on in the game, not even multiplayer aspects. This just flipped the table for me. If it can be integrated well enough with the kind of system I have in mind right now... It could be done all in one server. Before, I was looking at a potential of a cluster, and gasping at the prices.
    So instead I decided to downgrade some of the graphics that I would end up using, because it would at least free up some compute in the cpu and gpu.
    But with this... that's not necessary anymore. If I understand correctly that is. If I do understand correctly, I can offload all the game logic going on onto the accelator, allowing the GPU and CPU involved to do their own tasks separately. Or in the worst case scenario, it merely just makes the operation of all that game logic much more effecient while still being a load on the cpu and gpu as well to some extent. But if that's the worst case, I can work with that. I think.
    What's the game idea? As much as I would love to share, it also would be a dang shame if said idea were poached. I will instead say this at the very least. Imagine an MMO where everything you do actually affects everyone else, and not just through some premade restrictive scripts, but actual logic dictating what the most likely scenario is next. When you pull a pail of water from a river, it actually reduces in amount flowing behind you. If you chop down a tree, it actually stays chopped down until a new one can grow to replace it, properly. Not spawning in on a set timer. If you pull too much water from that river, the tree may not grow at all due to lack of ground water in the area now. (If taken far enough.)
    The way I was looking at the likely path for coding something like that, I was met with the need for parallelism. And a lot of compute capability. You aren't running something like that on a typical CPU, to put it bluntly. And GPU's in the consumer market, well... not happening there either. So I started to look at Accellerators. And that's how I got to the server clusters.
    I put that game idea on hold, because I just cannot even begin to afford to do something on that scale. But with this Maverick chip. I feel like Pandora seeing hope at the bottom of the box.

  • @skypickle29
    @skypickle29 22 дня назад

    How is this different than branch prediction? I even remember the DEC alpha which had a processor monitoring the cpu for metrics like this. Unless the processor can reconfigure an fPGA that is optimal for the observed calculations, then rewrite the code to maximize efficiency on the fly - then the design will not be optimal.

  • @MaxHaydenChiz
    @MaxHaydenChiz 21 день назад +1

    I really want to understand how this hardware works. Is it a variation of a CGRA? Regardless, extraordinary claims, require extraordinary evidence.

    • @foobarf8766
      @foobarf8766 21 день назад

      Also curious but is it really that extraordinary? IBM made similar leaps between Power generations, 4096 entries in the Power10 TLB, the Intel/AMD entry to the HPC space with GPUs is because of their price point not capabilities.

  • @MrMrMrMrT
    @MrMrMrMrT 18 дней назад

    Isn’t it cost disadvantages? From a power draw aspect

  • @sameeranjoshi1087
    @sameeranjoshi1087 22 дня назад

    Good one

  • @RwilliaMHI
    @RwilliaMHI 21 день назад

    It's not an fpga+asic programming another fpga within the SoC, like it wasn't comingling of funds at ftx crypto.

  • @kamilhorvat8290
    @kamilhorvat8290 22 дня назад

    Is this Transmeta CPUs reinvented?

  • @1introvert_guy
    @1introvert_guy 22 дня назад +1

    12:15 this is such a marketing graph (well because it it!). But I hate these graphs :/ especially because I can't see the numbers or more details.

  • @Matlockization
    @Matlockization 21 день назад

    It was very interesting that you would display who and how many accelerators were used. I don't see why Intel can't populate their P & E cores in a grid with GPU cores right now. However, I think AMD is closer to this practically than Intel. Obviously, I have concerns about latency.

  • @PterAntlo
    @PterAntlo 21 день назад

    I wish them the best, but that sounds very much like what Intel said with Larabee: you don't habe have to adapt your program, just recompile it and our compiler/lib/jit will do the rest. And well, that didn't work out as well as everyone hoped.

  • @quantumbacon
    @quantumbacon 22 дня назад

    Ian, I think you might be giving people the impression that FP64 makes calculations at 64bit precision.
    this is incorrect.

  • @bayanzabihiyan7465
    @bayanzabihiyan7465 22 дня назад +2

    Doesn’t MI300X (and MI300A) have surpurb FP64 performance while having the memory BW to support it?
    You mentioned Nvidia, but AMD is I believe a bigger player in HPC, they are powered some of the worlds best HPC super computers.

    • @TechTechPotato
      @TechTechPotato  22 дня назад

      Based on total compute, yes, but AMD is only in a small handful of (top) systems.

    • @ProjectPhysX
      @ProjectPhysX 22 дня назад +2

      Yes MI300X is 82 TFlops vector FP64, and 163 TFlops matrix FP64. That thing is a beast and it will be hard for a startup to become even remotely competitive.

    • @xpk0228
      @xpk0228 21 день назад +1

      AMD will probably do better than NVDA in HPC since they did not gut their FP64 path like blackwell did. Also there is less of a software issue there.

  • @philflip1963
    @philflip1963 22 дня назад

    The Road Not Taken
    By Robert Frost
    Two roads diverged in a yellow wood,
    And sorry I could not travel both
    And be one traveler, long I stood
    And looked down one as far as I could
    To where it bent in the undergrowth;
    Then took the other, as just as fair,
    And having perhaps the better claim,
    Because it was grassy and wanted wear;
    Though as for that the passing there
    Had worn them really about the same,
    And both that morning equally lay
    In leaves no step had trodden black.
    Oh, I kept the first for another day!
    Yet knowing how way leads on to way,
    I doubted if I should ever come back.
    I shall be telling this with a sigh
    Somewhere ages and ages hence:
    Two roads diverged in a wood, and I-
    I took the one less traveled by,
    And that has made all the difference.

  • @jedijackattack3594
    @jedijackattack3594 22 дня назад +1

    So its a feed foward DPU. We have had these for ages and I don't think its going to help for most hpc tasks.

    • @foobarf8766
      @foobarf8766 21 день назад

      If you mean the IBM/DARPA thing that was never going to go retail, but now that OpenCL is a thing, this might have a chance?

  • @jimtekkit
    @jimtekkit 21 день назад

    I'm hoping like hell that Radeon will bring back some FP64 compute performance to the masses with UDNA. Nvidia severely nerfed it with Maxwell and even many Quadros are nerfed. The upsell is insanely steep. Radeon aren't much better right now with their focus on CDNA for that type of workload.

  • @incription
    @incription 22 дня назад

    It doesn't accelerate AI in anyway does it? Just to make sure

    • @TechTechPotato
      @TechTechPotato  22 дня назад +1

      Only at full precision, not reduced precision modes

  • @ABaumstumpf
    @ABaumstumpf 21 день назад

    I mean that is what branch-predictors are already doing. And everything you and them have presented so far sounds exactly like a CPU with an FPGA and some fixed-function blocks - which falls flat in terms of performance compared to the more normal vectorisation-approach for most cases, but can be faster if the workload is not your normal memory intensive task but rather you need some more complex operations and have extra blocks for that (some extra trig-hardware etc).
    And really? Code is mostly taking the most-likely path? XD

  • @JohnJohn-ts6ux
    @JohnJohn-ts6ux 22 дня назад

    Hi sir love your videos very much, I admire your hard work thank you so much again, could you please do a video, metiatek CPU 9400 v snapdragon Elite, because I'm thinking getting Samsung ultra 25, or possibly oppo flagship high end smartphone 9400 mediatek CPU, which one performs better thanks for your time keep it up😀😀

  • @MasamuneX
    @MasamuneX 21 день назад

    what if we made and asic that just "changes"

  • @acasccseea4434
    @acasccseea4434 21 день назад

    Sounds like branch prediction😅

  • @kilngod1943
    @kilngod1943 21 день назад

    AMD accelerators get 3x better fp64 compute than NVidia, there is a reason national labs are buying AMD based super computers.

  • @TheoneandonlyRAH
    @TheoneandonlyRAH 22 дня назад

    this is nice!

  • @vogue43
    @vogue43 21 день назад +1

    All that about flow was pretty much the ... before profit. It explained nothing. Magic happens, perf goes to the moon, trust me bro.

  • @MrAndrzejWu
    @MrAndrzejWu 21 день назад

    ok it sounds interesting :)

  • @firsttyrell6484
    @firsttyrell6484 22 дня назад +3

    This chip looks like a nightmare to optimize for. Look, on the first run this part of code was slow, I'm going to optimize it. On the next run this part of code does not matter anymore due to hardware magic (optimization), but the code is still slow in some other place instead, back to square one.

  • @oj0024
    @oj0024 22 дня назад

    Does the number 0.8373 mean anything to you?

  • @cj09beira
    @cj09beira 22 дня назад +6

    kinda of a shame CDNA wasn't at all mentioned when its much more HPC focused than the Nvidia counterparts

    • @TechTechPotato
      @TechTechPotato  22 дня назад +7

      More content to come ! :)

    • @rb8049
      @rb8049 22 дня назад

      Does MATLAB run on it?

    • @cj09beira
      @cj09beira 22 дня назад

      @@TechTechPotato Btw, any plans to talk about SOI?, its been absent of late since GF gave up on 7nm, with all this new quest for high performance i wonder why what seems like a "easy" avenue for a frequency and or efficiency boost isn't being used.

    • @JorgetePanete
      @JorgetePanete 9 дней назад

      it's*

  • @jonathanjones7751
    @jonathanjones7751 22 дня назад +1

    Ponte Vecchio did 52TFLOPS of FP64 but intel sunset it. Was that more hardware or software that limited its adoption?

    • @TechTechPotato
      @TechTechPotato  22 дня назад +4

      A bit of both, but also the theoretical memory bandwidth was almost impossible to achieve. The Chips and Cheese team even worked with Intel for their coverage and struggled to get >50%.

    • @jonathanjones7751
      @jonathanjones7751 22 дня назад

      The memory bandwidth is a great point. 47 tiles or soemthing and were seeing memory issues with Foveros with ARL. thank you for the reply. Hopefully it can get remedied for Falcon Shores if that is still an HPC part.

    • @xpk0228
      @xpk0228 21 день назад +1

      It's more like the design of PVC is just not good. from what we see in Aurora the 52TFLOPS is peak and unsustainable under real life conditions. MI250X on the other hand can do 45 Tflops consistently in Frontier.

  • @alexcastas8405
    @alexcastas8405 21 день назад

    'applications run orders of magnitude faster' ... big claims

  • @RicoElectrico
    @RicoElectrico 21 день назад

    I wonder if Intel will acquire them only to sell off 5 years later.

  • @pcoverthink
    @pcoverthink 21 день назад

    L1 size is a huge red flag for me. Money can buy good nodes and a lot of hbm but this l1 amount sounds like bs

  • @Server0750
    @Server0750 22 дня назад

  • @AhmadAli-kv2ho
    @AhmadAli-kv2ho 22 дня назад

    Theres 256floating points?

    • @lbgstzockt8493
      @lbgstzockt8493 22 дня назад

      Theoretically you can have any power of two for your size, it just gets really impractical really fast. Pretty much nobody does more than 256 bits.

  • @ultraveridical
    @ultraveridical 18 дней назад

    Another video, another mention of "clients". These are becoming ads more and more, and with the disclosure near the end.

    • @TechTechPotato
      @TechTechPotato  18 дней назад +1

      This video isn't an ad. But good try though. I'm an analyst and consultant. All my clients, past and present, are listed in the description. I'm very open about this.

  • @foobarf8766
    @foobarf8766 21 день назад

    Intel and AMD should be here with products like this... where are they? Smoking blockchains behind the bike sheds again?

  • @LogioTek
    @LogioTek 22 дня назад +3

    Radeon VII still good then?

    • @TechTechPotato
      @TechTechPotato  22 дня назад +5

      Efficiency ain't great, and the software stack needs work, but zoom zoom

    • @LogioTek
      @LogioTek 22 дня назад

      @TechTechPotato Yea tell me about AMD software/driver stack. I sometimes get AMD driver crashes just from playing RUclips videos on my 7950X3D iGPU. When I actually edit videos it becomes a nightmare.
      From my tinkering several years ago, Radeon VII efficiency doubles from reducing core and memory clocks by 25% each.

  • @DS-pk4eh
    @DS-pk4eh 21 день назад

    I thought AMD had good hardware with 64bit FP support

  • @toddstewart4579
    @toddstewart4579 11 дней назад

    Please stop using twitter.

  • @shieldtablet942
    @shieldtablet942 22 дня назад +2

    This smells like vaporware to get investor money. If you have this (which seems more like SW than HW), you are making bank or getting bought by billions and landing at AMD or Intel.
    Even compiler auto parallelization (which this seems to be) has not been cracked for 15y+. The best we have is Nvidias threading model and stuff like OpenMP, which when I worked in the field was always loosing to MPI.

  • @sixteenornumber
    @sixteenornumber 19 дней назад +1

    You could have at least attempted to be critical of their claims. This just sounds like an advertisement for venture capitalists.