Great video honestly, give the marketing guy a kudos. I know it's a marketing video with all setups for the founder to easily layup but it almost felt like an actual interview that's educational. That being said going into the challenges faced by nvidia only makes me more impressed by them not less. blackwell is an amazing feat and I'm excited to see the improvements in rubin.
fault tolerant design without cutting something, only to reconnect it later is simple and genius. The Internet also was build upon the assumption that it continues to work well, even if parts fail. And look how great that scaled! Happy that they are succeeding with the truly different approaching to chip manufacturing.
I guarantee Nvidia's GPUs can have faulty cores. If enough shaders or compute units don't work, Nvidia sells the chip as a cheaper lower-performance part, called "binning."
You can stop fantasizing. One of these wafer-scale chips consumes 20 kilowatts and people estimate it costs $2-3 million, plus all the cooling and power supplies and interconnects. It goes into a custom rack in a data center.
flexible connectors in chips to compensate different thermal expansions is interesting. I wonder if a future approach could also be to never let the chip cool down much after initial installation, so having a idle load applied, so it does not cool off. Accepting limited number of big thermal cycles might be a worthwhile tradeoff, it is buys wiggle room in other areas. Would not work for smartphones or end user devices with peak loads and many off periods, but on a server farm, it sounds reasonable.
The tradeoff there is energy consumption. Heat is a byproduct of power leakage, which is inherent in all silicon circuits. If you artificially load processing cores just for the sake of maintaining a thermal profile then you're wasting energy, which also isn't cheap, especially at the scales they're talking about here.
@@mdswish First, there is a new fangled invention called fan-speed control. Second, the sort of tasks being done on devices of this class are generally long running and scheduled ahead with customers waiting and bidding for time. The down time between runs is insignificant.
Here’s the Nvidia solutions as I see it. 1) load balance. This will save differential thermal expansion issues. An IR camera can be used and the s/w and microcode guys can look at this. 2) shut down any cores that get too hot and let them cool to ambient (ie active thermal management on a package level), 3) do redundant compute at the rack level. This will guarantee that if anything goes wrong that the results are correct. TMR can be done but it’s more efficient to do a RAID style redundant compute. Whichever is more efficient on compute. You sacrifice a bit of performance for reliability. These are all tractable problems that can be solved. New technology always has stuff like this and Engineers find creative solutions. That’s what we do!!
BTW load balancing is relatively easy since I’m sure the CUDA tools can do profiling/performance coverage. And once profiling is done you pair up the processors with the CUDA tasks to make sure they have similar profiles from a performance perspective.
I know what you mean; but this was minor level. I was once in a law tutorial temporarily moved to an unfamiliar room with a group of 8 or so & the tutor & the tutees had to abandon it because the literal cube shaped rooms reverb/echo made it impossible to communicate on a group level. It was headache inducing & near impossible to maintain visual concentration at the same time listen coherently. It was an all white room so might have been usable for art, photography & design ect.
Nice video. Curious about one part. You mention having logic & memory (L/M, time 12:00) right next to each other as being a great innovation. Yet how is this any different than NV shared memory?, that is fast memory local to a group of cores. Could you elaborate on why your logic-memory design is better?
20:20 "in the center .. you have direct connection", and assuming meaning the (I/O) pins there only (not on/near the edges, because of expansion). You must send in data (and calculated output from) to all the million cores from there then. Since the cores are small, then this is very special purpose (likely no caches or global address space), though you have good bandwidth no next code, but different latency depending on how far.
My knowledge of the complexities involved in chip manufacturing has grown tremendously over the past 5 days. This video is like the icing on the cake for what I believe might be where the engineers at Nvidia may end up drowning trying to drastically change the architecture of the processed wafer which I recall is a game of chance. While Nvidia's gaming GPU market has seen them cherry pick the best GPU's to fit in their top-of-the-line cards. So every point that was made in the video I HOPE the engineers at Nvidia are listening!
14:12 - From my understanding, Cerebras is a new semicon fab, like Intel or ARM, building computers from scratch and not using common architecture to build computers. Great way to be innovative and this would require a lot of testing and validation since it is completely new process of making computers. Basically an SoC on a wafer, 100% custom made.
ARM doesn't make or sell chips at all, it licenses the ARM instruction set architectureand chip designs implementing the ARM ISA to other companies. Nor is Cerebras a semiconductor fabricator, it has a fab etch its wafer-scale "chip." in 2019 Wired wrote " to construct its giant chip, Cerebras worked closely with contract chip manufacturer TSMC."
@@skierpageI think his point was they’re taking a completely different approach which would need a whole different manufacturing and integration process. Not easy to pull off.
Very interesting presentation. But I am sure Nvidia experts might have a different story. And I think that software support for AI processors is also very critical in future. Nvidia has CUDA behind it which is a huge bonus for them. In fact, the reason that RISC was in the shadows of Intel's x86 architecture is precisely because of software stack issues. Nevertheless, still one can nothing but admire Cerebras team for their vision and innovation. If anyone knows a good book or review article about modern GPU/NPU/TPU architecture (not CPU) please post in reply to my comment; I really appreciate it!
Ouf... 10 to 15um external connections on a separate silicone wafer! That is a challenge. Not sure how AMD is doing it, but with any new tech innovative new solutions will need to be addressed. Great video guys.
It's not definite at all. Look at the distance signals have to travel across the wafer. As and if fabs hit fundamental circuit etching limits around the 1-nanometer feature size, the way to increase circuit density is to go 3D. If TSMC and Nvidia solve this generation, they will be well-placed for future generations. And there's no explanation here of why AMD and others have been successful with chiplet packaging. Meanwhile, Cerebras seems to have only a handful of customers building research supercomputers.
Is the 44GB SRAM distributed across the wafer enough for data fetching in training and reasoning? It seems that Cerebras remove the HBM or DDR but directly load the NVMe to its on-wafer SRAM?
How is the SRAM in WSEs being compared to HBM on conventional GPUs? Even GPUs would have on-die SRAM caches that have high BW and low latency. The packaging challenges come with packaging the GPU dies with HBM chips right? It's not possible that WSEs do not need HBMs (right?) Assuming WSEs need HBM dies, wouldn't similar packaging challenges arise even there?
Can you please explain how you could fabricate logic and memory in the same die ? Dont they have different fabrication methods? Memory doesnt scale well like logic right ?
@@dudi2k So *all* logic and memory are all on the same die?!? You realize there is a space limitation, right? You're going to run out of room if you're trying to make a cutting-edge chip. You're still going to need interposers to connect the memory or IO die to the logic die. Period. Or greatly cripple your logic performance.
@@aqualung2000 It seems to be not true for AI workloads. Nvidia GPUs scaling is almost lnear to thousands of GPUs. So, as long as you have enough memory, performance will be fine. If I'm not wrong, IBM designed a chip with very limited memory per GPU, and it works well for interference - 12nm can be more efficient than newest Nvidia GPUs
I appreciate what cerebras is doing for AI chip making. But unfortunately I can never benefit from it like a direct consumer. I can’t buy a wafer and use it at home like I could nvidia, AMD and Apple products etc. Thank you for this information. I hope you find success. But I can’t be as excited for your company as I would another.
Why he doesn’t explain the modular and increase yield advantages of heterogeneous advance packaging? What about cerebras 5nm ? How many cores will be disabled ? Optical alignment will suddenly be trickier and getting a usable wafer exponentially more difficult ? With modular approach you can make quality control of the individual parts.
Interesting solution however what about the power consumption of such device, (with 51 equivalent GPU chip ?). With 1kW per GPU it means 10s of kW to deliver to this mega-chip which seems a challenge, not only at the VR module level but also at a higher level (blade level). Does it also mean that the high speed interconnect for scaling up GPU from Cerebras is limited to 50 chips or it can be extended like it is with Nvidia NVLink to 1k or 2k chips (or more)?
Wonder how the wafer scale engine approach impacts the form factor for server racks etc. This is fundamentally changing the size of the PCB and hence all HW elements of the system. Any insights on this?
I wonder this too … yes, a very small amount of memory is very close to a very small amount of compute … but now some large proportion of memory is very far from some large proportion of compute …. in quite a variable way for each unit of compute /memory … so must be more software logic necessary to account for these all these differences in latency and wouldn’t you have to throttle for the highest latency connection?????
@@SpindicateAudio i think also there might just be an issue with total amount of data you can push through the wafer. i would expect there to be an effect of core size on total throughput, though I am not familiar enough with their layout to make any concrete prediction of what specifically it would be
@@ChrisJackson-js8rd yes yes, they will still need stacks of hbm “off-chip” … so latency / power consumption improvements will only apply to the fraction of the memory on-chip … so gains would only be marginal …… that said, if it also solves various packaging problems those gains might be enough
@@SpindicateAudio i suspect they might be doing something very clever and leveraging the machine state as a sort of "pseudo-cache" in a way that you can't on traditionally architected systems. but this is speculation on my part i don't have a system to test
The stacked silicon has solder bumps. Due to a meniscus forming duri g the soldering they are self-aligning. Nvidia can solve the thermal issue by load-balancing the GPU’s making sure they are doing the same amount of work. This will keep the thermal levels balanced. The ‘differential thermal’ is the issue here. Big wafers have problems too since interconnect can have defects too. This means you may get load balance issues across the wafer which could cause differential themal issues and cracking in the wafer. Same problem.
It should be noted, the GPUs of Nvidia have local memory on them too. They seem to be extolling a virtue (that they perceive is unique to them) that their competition already has! Nvidia GPU’s have many cores and many levels of local memory. This is how a GPU works as a tiled processing unit. Cerebras are trying to spread FUD.
@@GodzillaGoesGaga That is called on-die cache, nobody was hiding it, they even mentioned existing fault tolerant static ram. Cache capacity is orders of magnitude too small to handle most tasks as the only memory, and static RAM is orders of magnitude to expensive in die area for main memory capacities. Though they didn't mention the capsity or type of integrated RAM on their hybrid cores.
@@GodzillaGoesGaga Correct, registers are normally a form of SRAM; this is true on basically every CPU, GPU, and MCU since the invention of the integrated circuit. "local SRAM for calculations" is called cache.
@@mytech6779 Local SRAM for calculations is NOT called Cache. A cache is a temporary storage between main memory and internal memory and is typically used for register transfer. If a register is used a lot it remains in cache and is then updated once the system can find an appropriate timeslot. There are 2 mechanisms, write-through and write-back. Local SRAM for calculations is different and is typically close to the ALU(s) and is part of the vector processing engine.
@Jean Pierre If alignment of pins on memory & GPUs needs high level precision, is it possible to 3D print entire design? Silicon between memory pins & GPU pins will be in perfect place. Thank you
That is an ideal CEO, he knows the stuff he is CEO about. Weird that Nvidia cuts 2 dies and then glues them beck together again, i can understand they support other products with these cores, but might as well cut less dies to support the shown architecture.
Why doesn’t Cerebras beat Nvidia in sales? Is it missing a software developer community or lacking investors? Take the company public to accelerate its growth!
simple question: do you believe in microservice or monolithic design? even if I believe everything said is right here for now. I will still choose modular design. Memory and CPU can develop differently, probably in the future in different size by different materials. Now your trying to combine them in one piece, probably same material. I am not chip expert, but I can feel modular design always have its better use.
Can anyone explain what limits TSMC/NVIDIA from replicating this approach, after this seems technically and economically viable? I have watched a several seminar sessions from their execs, but couldn’t tell what is the technical barrier that sets them apart.
It would require a break from NVIDIA's existing production chain, part manufacturers, and current expertise, which is nontrivial cost and time sink, thus would likely would take them some time to perfect. (Designing this chip is also less trivial than they make it out of course). NVIDIA is probably also delayed by uncertainty about viability of Cerebras, which makes them hesitant to jump ship completely, thus splitting their efforts between mimicking Cerebras and upgrading their current generation graphics cards further on the same technology line. Fundamentally, NVIDIA/TSMC can do it, however. Though for Cerebras this gives them an advantage of doing research ahead of time, giving an advantage for a time. Perhaps this ultimately ends in them being bought out five years down the line, or it leads to Cerebras branching out to other areas. (Shattering their wafer into smaller chunky dies for consumer gpus?) There is the additional the side benefit that it provides more competition for NVIDIA which should help drive down the high profit margin NVIDIA holds, which is nice for people/companies purchasing them.
TSMC is Cerebras' fabrication partner, so TSMC has proven themselves well able to execute on this approach. For NVIDIA it likely has to do with inertia and existing design and process they've committed to, they've been building discreet GPUs originally targeted as consumer products since 1993 and only relatively recently been growing their designs up into the supercomputing space. All of their prior art and experience has essentially been single die, single GPU with external memory and they seem determined to plow ahead with trying to figure out how to push this approach to ever more demanding levels of compute. I have no doubt NVIDIA has the sheer talent and expertise to execute something like the WSE but history has shown time and time again that companies become risk-averse as they become successful and retraining/retooling for wafer scale would represent a significant disruption of their existing product pipeline. Cerebras approached the problem with a clean sheet and went top down for supercomputer/data center style computing and in fact likely will never scale down to anywhere near a consumer product. Cerebras may have also staked out territory in terms of trade secrets and patent protection that will prevent competitors from copying some of their innovations - wafer-scale integration has historically been considered a dead end after some spectacular failures in the 1970s and 1980s including perhaps most famously Gene Amdahl's Trilogy Systems. It's important to keep in mind that "wafer-scale systems" back in the 1980s meant something more in the range of 100mm wafers vs. the 300mm wafers Cerebras has used for the WSE line, so Cerebras has not only revived an approach that had been considered practically impossible but done so at a much larger scale and complexity than their predecessors.
@@nicholasolsen2177 thanks for your response. I am not arguing against the value that Cerebras is bringing to the industry here. I am just trying to assess whether Cerebras truly has an edge to make them compete with Nvidia and others in the future AI computing platform. Yes, Nvidia could continue to be risk averse and not fight against its inertia, but the sheer computing power is only part of what makes Nvidia irreplaceable today. IMO it’s the entire ecosystem for AI developers including the CUDA language, SDK, libraries, and etc. We have the same inertia problem for developers too: they will have to adapt to new programming platform provided by Cerebras. Yes, they will learn and adjust if the performance indeed outperforms, but it will take time for numerous developers to adapt Cerebras SW platform as mainstream. If the WSE turns out to deliver high computing power with high yield, I am speculating that it may be faster for Nvidia to replicate that approach for part of their server-based GPUs while maintaining entire SW platform for the programmers. I guess only time will tell. Let me know if I missed anything.
Is memory bandwidth such a big deal when external bandwidth somewhere else will obviously be the next bottleneck? ex, You would need to build a complete SOC with memory on a single die package, but you will be limited to the fastest ethernet protocol and connection methodology limiting usefullness.
This right here is why NVIDIA will miss earnings next week. Guidance will be lower due to the Blackwell delay here which is really not easy to solve. This is not a trivial problem - you can only go so small without breaking anything to endure that amount of heat.
How can we find out who's adopting Cerebras? I would have thought they could sell 100% of what TSMC is capable of outputting without even trying. Is that not true? Is this x86 architecture?
Microsoft Bing CoPilot links to the CS-3 announcement which says Argonne National lab and "G42" in Abu Dhabi are using it, and mentions GlaxoSmithKline, Lawrence Livermore National Laboratory, and National Energy Technology Laboratory. Cerebras claims "Cerebras already has a size-able backlog of orders for CS-3 across enterprise, government and international clouds." Well, as with any startup orders are cheap, it's shipments with revenue that matter. Unfortunately for Cerebras I don't see any indication that any top AI company developing a frontier multi-modal model is working with them. TSMC fabricates 3+ million wafers a year, there's no way the demand for a massive complicated AI supercomputer will ever be that high. Haha, Intel has licensed the x86 architecture to very few companies, basically just AMD and VIA Technologies decades ago. And that bloated CISC instruction set with all its extensions and different register sizes and vector add-ons is a terrible fit for 850,000 tiny "Processing Elements" on one wafer each with only 48 kB of local SRAM. Cerebras hides it, but from the Cerebras SDK Technical Overview, each PE has a microarchitecture that implements a custom instruction set which seems undocumented. Instead you write a custom program in the C/Zig-like Cerebras Software Language that tells each PE exactly how to manipulate tensors. Then you write a control program for "the host" which specifies which rectangle of how many PEs on the wafer you want to run your code, and "the host" loads them up. Fascinating, but it doesn't run Nvidia's CUDA GPU tools, and figuring out how to adapt existing machine learning software to it will be... challenging. National labs can develop custom software for it, just as they write their own code for their more conventional CPU+GPU supercomputers.
It's a new company with a different custom programming model. It seems U.S. national labs and a few other customers are going to write their own software in Cerebras' special CSL programming language to directly code tensor operations for those million little Programming Elements on each Cerebras wafer. But the big AI companies seem to be sticking with scaling what they know onto more and more Nvidia chips (and Google's own TPU chips).
40GB of on-die SRAM, so in comparison to GPU designs one should think of that more like having 40GB of cache. NVIDIA doesn't even publish specs for their on-die SRAM/cache sizes so one can assume it's not worth bragging about, by your definition NVIDIA's chips have 0GB. I'm sure you're aware that HBM3 is extremely slow compared to the memory on WSE-3, it's an order of magnitude difference. For those who can read, WSE-3 specs state up to 1.2PB of external memory.
@@carloslemos6919 External memory interfaces are a solved problem and have been for decades. If you want' them, the controllers are off-the-shelf parts or EDA tools. I'm sure it would be interesting to hear more details about Cerebras' fabric and external memory interface but that's one of the least interesting things about what they've accomplished and wasn't the scope of this discussion of wafer scale vs. interposers. NVIDIA themselves are just dropping standard HBM3 controllers and memory chips on their designs.
I know the Asian guy is there to direct the flow of the conversation - but I wish he would let the albino finish talking before interjecting, it was very awkward watching that. Also not a good idea to be be wearing a white shirt when you're already a white guy with white hair standing in front of a white-board.
*Turns on PC. The lights dim. Granny's pacemaker stops. A flock of birds fall from the sky. The screen flickers and a pop-up message appears: "you're about to summon Skynet. Continue?"
Great video honestly, give the marketing guy a kudos. I know it's a marketing video with all setups for the founder to easily layup but it almost felt like an actual interview that's educational.
That being said going into the challenges faced by nvidia only makes me more impressed by them not less. blackwell is an amazing feat and I'm excited to see the improvements in rubin.
fault tolerant design without cutting something, only to reconnect it later is simple and genius. The Internet also was build upon the assumption that it continues to work well, even if parts fail. And look how great that scaled! Happy that they are succeeding with the truly different approaching to chip manufacturing.
I guarantee Nvidia's GPUs can have faulty cores. If enough shaders or compute units don't work, Nvidia sells the chip as a cheaper lower-performance part, called "binning."
Internet has redundancies.
@@frankdelahue9761 The internet is also extremely slow and inefficient.
It takes a genius to know a genius, and you are not a genius.
@@eglintonflats But by your own axiom how would you know that?
Nailed it! With thousands of connections between these chips, alignment is not easy! Cerebras is clever to take WSI route to circumvent this issue!
Both companies suffer from themal load balancing issues. They are just spreading FUD! Pure spin-doctoring.
@@GodzillaGoesGagaplease explain
@@GodzillaGoesGaga Thanks for introducing me to the term spin doctor👍
This was very informative thanks. Looking forward to trying out your chips.
You can stop fantasizing. One of these wafer-scale chips consumes 20 kilowatts and people estimate it costs $2-3 million, plus all the cooling and power supplies and interconnects. It goes into a custom rack in a data center.
Sometimes you have to see the forest for the trees. Ingenious solution to latency, thermal, and manufacturing issues.
flexible connectors in chips to compensate different thermal expansions is interesting. I wonder if a future approach could also be to never let the chip cool down much after initial installation, so having a idle load applied, so it does not cool off. Accepting limited number of big thermal cycles might be a worthwhile tradeoff, it is buys wiggle room in other areas. Would not work for smartphones or end user devices with peak loads and many off periods, but on a server farm, it sounds reasonable.
The tradeoff there is energy consumption. Heat is a byproduct of power leakage, which is inherent in all silicon circuits. If you artificially load processing cores just for the sake of maintaining a thermal profile then you're wasting energy, which also isn't cheap, especially at the scales they're talking about here.
Are you going to pay that energy bill?!
@@mdswish First, there is a new fangled invention called fan-speed control.
Second, the sort of tasks being done on devices of this class are generally long running and scheduled ahead with customers waiting and bidding for time. The down time between runs is insignificant.
Creep failure
Here’s the Nvidia solutions as I see it. 1) load balance. This will save differential thermal expansion issues. An IR camera can be used and the s/w and microcode guys can look at this. 2) shut down any cores that get too hot and let them cool to ambient (ie active thermal management on a package level), 3) do redundant compute at the rack level. This will guarantee that if anything goes wrong that the results are correct. TMR can be done but it’s more efficient to do a RAID style redundant compute. Whichever is more efficient on compute. You sacrifice a bit of performance for reliability. These are all tractable problems that can be solved. New technology always has stuff like this and Engineers find creative solutions. That’s what we do!!
BTW load balancing is relatively easy since I’m sure the CUDA tools can do profiling/performance coverage. And once profiling is done you pair up the processors with the CUDA tasks to make sure they have similar profiles from a performance perspective.
You could try to use the Nvidia Broadcast feature to get rid of room noise and echo
I know what you mean; but this was minor level. I was once in a law tutorial temporarily moved to an unfamiliar room with a group of 8 or so & the tutor & the tutees had to abandon it because the literal cube shaped rooms reverb/echo made it impossible to communicate on a group level. It was headache inducing & near impossible to maintain visual concentration at the same time listen coherently. It was an all white room so might have been usable for art, photography & design ect.
Lmao 😂
Thanks for the video: Transformers are now HBM bound due to KV cache, what's the per core bandwidth of Cerebras compared to HBM3e or beyond?
Incredible explanation. Please do another asap!
Nice video. Curious about one part. You mention having logic & memory (L/M, time 12:00) right next to each other as being a great innovation. Yet how is this any different than NV shared memory?, that is fast memory local to a group of cores. Could you elaborate on why your logic-memory design is better?
20:20 "in the center .. you have direct connection", and assuming meaning the (I/O) pins there only (not on/near the edges, because of expansion). You must send in data (and calculated output from) to all the million cores from there then. Since the cores are small, then this is very special purpose (likely no caches or global address space), though you have good bandwidth no next code, but different latency depending on how far.
Very informative, thank you
Thanks for the great explenation J.P.!
My knowledge of the complexities involved in chip manufacturing has grown tremendously over the past 5 days. This video is like the icing on the cake for what I believe might be where the engineers at Nvidia may end up drowning trying to drastically change the architecture of the processed wafer which I recall is a game of chance. While Nvidia's gaming GPU market has seen them cherry pick the best GPU's to fit in their top-of-the-line cards. So every point that was made in the video I HOPE the engineers at Nvidia are listening!
If NVIDIA's earnings next week are anything like their GPUs, get ready for something in high demand and even higher in value
14:12 - From my understanding, Cerebras is a new semicon fab, like Intel or ARM, building computers from scratch and not using common architecture to build computers. Great way to be innovative and this would require a lot of testing and validation since it is completely new process of making computers. Basically an SoC on a wafer, 100% custom made.
ARM doesn't make or sell chips at all, it licenses the ARM instruction set architectureand chip designs implementing the ARM ISA to other companies. Nor is Cerebras a semiconductor fabricator, it has a fab etch its wafer-scale "chip." in 2019 Wired wrote " to construct its giant chip, Cerebras worked closely with contract chip manufacturer TSMC."
@@skierpageI think his point was they’re taking a completely different approach which would need a whole different manufacturing and integration process. Not easy to pull off.
Very interesting presentation. But I am sure Nvidia experts might have a different story. And I think that software support for AI processors is also very critical in future. Nvidia has CUDA behind it which is a huge bonus for them. In fact, the reason that RISC was in the shadows of Intel's x86 architecture is precisely because of software stack issues. Nevertheless, still one can nothing but admire Cerebras team for their vision and innovation.
If anyone knows a good book or review article about modern GPU/NPU/TPU architecture (not CPU) please post in reply to my comment; I really appreciate it!
Ouf... 10 to 15um external connections on a separate silicone wafer! That is a challenge. Not sure how AMD is doing it, but with any new tech innovative new solutions will need to be addressed. Great video guys.
Very impressive explanation... definitely the way forward for large scale AI.
It's not definite at all. Look at the distance signals have to travel across the wafer. As and if fabs hit fundamental circuit etching limits around the 1-nanometer feature size, the way to increase circuit density is to go 3D. If TSMC and Nvidia solve this generation, they will be well-placed for future generations. And there's no explanation here of why AMD and others have been successful with chiplet packaging. Meanwhile, Cerebras seems to have only a handful of customers building research supercomputers.
fantastic presentation
Is the 44GB SRAM distributed across the wafer enough for data fetching in training and reasoning? It seems that Cerebras remove the HBM or DDR but directly load the NVMe to its on-wafer SRAM?
Please make a cerebras brand wafer cookie! Preferably wafer-scale too.
Great talk. Very informative.
How are you routing external data to the local memories of all the cores? How sensitive is this routing to wafer faults and heating during operation?
great video, I'm waiting for the IPO
How is the SRAM in WSEs being compared to HBM on conventional GPUs? Even GPUs would have on-die SRAM caches that have high BW and low latency. The packaging challenges come with packaging the GPU dies with HBM chips right? It's not possible that WSEs do not need HBMs (right?)
Assuming WSEs need HBM dies, wouldn't similar packaging challenges arise even there?
What are the memory sizes supported by your devices?
That's people at TSM is working 12 by 7 right now. Miracle workers. Can they save the launch?
Can you please explain how you could fabricate logic and memory in the same die ? Dont they have different fabrication methods? Memory doesnt scale well like logic right ?
Any GPU and CPU has memory in the form of cache already.
It’s SRAM - which is same process as logic
@@dudi2k So *all* logic and memory are all on the same die?!? You realize there is a space limitation, right? You're going to run out of room if you're trying to make a cutting-edge chip.
You're still going to need interposers to connect the memory or IO die to the logic die. Period. Or greatly cripple your logic performance.
@@aqualung2000 It seems to be not true for AI workloads. Nvidia GPUs scaling is almost lnear to thousands of GPUs.
So, as long as you have enough memory, performance will be fine.
If I'm not wrong, IBM designed a chip with very limited memory per GPU, and it works well for interference - 12nm can be more efficient than newest Nvidia GPUs
Wonderful discussion. Thank you
I appreciate what cerebras is doing for AI chip making.
But unfortunately I can never benefit from it like a direct consumer.
I can’t buy a wafer and use it at home like I could nvidia, AMD and Apple products etc.
Thank you for this information. I hope you find success. But I can’t be as excited for your company as I would another.
Great explanation!
Why he doesn’t explain the modular and increase yield advantages of heterogeneous advance packaging? What about cerebras 5nm ? How many cores will be disabled ? Optical alignment will suddenly be trickier and getting a usable wafer exponentially more difficult ? With modular approach you can make quality control of the individual parts.
great presentation.
Interesting solution however what about the power consumption of such device, (with 51 equivalent GPU chip ?). With 1kW per GPU it means 10s of kW to deliver to this mega-chip which seems a challenge, not only at the VR module level but also at a higher level (blade level). Does it also mean that the high speed interconnect for scaling up GPU from Cerebras is limited to 50 chips or it can be extended like it is with Nvidia NVLink to 1k or 2k chips (or more)?
great innovation. will love to explore the software optimizations to run a llm on this...
Wonder how the wafer scale engine approach impacts the form factor for server racks etc. This is fundamentally changing the size of the PCB and hence all HW elements of the system. Any insights on this?
They fit in the same racks.
when are we getting Cerebras GPU for consumer gaming and work stations? to compete with Nvidia in gaming and VR ai etc.
never, each of these chips cost more than most people's net worth.
what (if any) architectural trade-offs had to be made to accommodate waferscale?
what is the practical limit on core size?
I wonder this too … yes, a very small amount of memory is very close to a very small amount of compute … but now some large proportion of memory is very far from some large proportion of compute …. in quite a variable way for each unit of compute /memory … so must be more software logic necessary to account for these all these differences in latency and wouldn’t you have to throttle for the highest latency connection?????
@@SpindicateAudio i think also there might just be an issue with total amount of data you can push through the wafer.
i would expect there to be an effect of core size on total throughput, though I am not familiar enough with their layout to make any concrete prediction of what specifically it would be
@@ChrisJackson-js8rd yes yes, they will still need stacks of hbm “off-chip” … so latency / power consumption improvements will only apply to the fraction of the memory on-chip … so gains would only be marginal …… that said, if it also solves various packaging problems those gains might be enough
@@SpindicateAudio i suspect they might be doing something very clever and leveraging the machine state as a sort of "pseudo-cache" in a way that you can't on traditionally architected systems.
but this is speculation on my part
i don't have a system to test
The stacked silicon has solder bumps. Due to a meniscus forming duri g the soldering they are self-aligning. Nvidia can solve the thermal issue by load-balancing the GPU’s making sure they are doing the same amount of work. This will keep the thermal levels balanced. The ‘differential thermal’ is the issue here. Big wafers have problems too since interconnect can have defects too. This means you may get load balance issues across the wafer which could cause differential themal issues and cracking in the wafer. Same problem.
It should be noted, the GPUs of Nvidia have local memory on them too. They seem to be extolling a virtue (that they perceive is unique to them) that their competition already has! Nvidia GPU’s have many cores and many levels of local memory. This is how a GPU works as a tiled processing unit. Cerebras are trying to spread FUD.
@@GodzillaGoesGaga That is called on-die cache, nobody was hiding it, they even mentioned existing fault tolerant static ram.
Cache capacity is orders of magnitude too small to handle most tasks as the only memory, and static RAM is orders of magnitude to expensive in die area for main memory capacities. Though they didn't mention the capsity or type of integrated RAM on their hybrid cores.
@@mytech6779 Not just on die cache, there is local SRAM for calculations and register storage.
@@GodzillaGoesGaga Correct, registers are normally a form of SRAM; this is true on basically every CPU, GPU, and MCU since the invention of the integrated circuit.
"local SRAM for calculations" is called cache.
@@mytech6779 Local SRAM for calculations is NOT called Cache. A cache is a temporary storage between main memory and internal memory and is typically used for register transfer. If a register is used a lot it remains in cache and is then updated once the system can find an appropriate timeslot. There are 2 mechanisms, write-through and write-back. Local SRAM for calculations is different and is typically close to the ALU(s) and is part of the vector processing engine.
@Jean Pierre If alignment of pins on memory & GPUs needs high level precision, is it possible to 3D print entire design? Silicon between memory pins & GPU pins will be in perfect place. Thank you
That’s what they are doing
heat warping bridges, no bridges when too small, any solution?
Just like I imagine. But do you create memory-let yourself?
What about the extreme cooling needed to run these cerebras wafers? Isn't 2/3 of the entire rack just cooling?
That's true for Nvidia DGX systems as well. Supplying 15 kW of power and 10 kW of cooling is hard.
That is an ideal CEO, he knows the stuff he is CEO about. Weird that Nvidia cuts 2 dies and then glues them beck together again, i can understand they support other products with these cores, but might as well cut less dies to support the shown architecture.
Building blocks for Modularity
Cadence Ansys
Why doesn’t Cerebras beat Nvidia in sales? Is it missing a software developer community or lacking investors? Take the company public to accelerate its growth!
simple question: do you believe in microservice or monolithic design? even if I believe everything said is right here for now. I will still choose modular design. Memory and CPU can develop differently, probably in the future in different size by different materials. Now your trying to combine them in one piece, probably same material. I am not chip expert, but I can feel modular design always have its better use.
Eye opener on what is happening with nvidia , need to buy stocks when ipo comes
Better to buy before the IPO ;-)
Can I play Need for Speed: Unbound on this chip?
In theory it sounds amazing. Would love to hear about the tangible results of all these good ideas.
How. Do. I. Buy. Cerebras. Stock.
Can anyone explain what limits TSMC/NVIDIA from replicating this approach, after this seems technically and economically viable? I have watched a several seminar sessions from their execs, but couldn’t tell what is the technical barrier that sets them apart.
It would require a break from NVIDIA's existing production chain, part manufacturers, and current expertise, which is nontrivial cost and time sink, thus would likely would take them some time to perfect. (Designing this chip is also less trivial than they make it out of course). NVIDIA is probably also delayed by uncertainty about viability of Cerebras, which makes them hesitant to jump ship completely, thus splitting their efforts between mimicking Cerebras and upgrading their current generation graphics cards further on the same technology line.
Fundamentally, NVIDIA/TSMC can do it, however. Though for Cerebras this gives them an advantage of doing research ahead of time, giving an advantage for a time. Perhaps this ultimately ends in them being bought out five years down the line, or it leads to Cerebras branching out to other areas. (Shattering their wafer into smaller chunky dies for consumer gpus?)
There is the additional the side benefit that it provides more competition for NVIDIA which should help drive down the high profit margin NVIDIA holds, which is nice for people/companies purchasing them.
I thought Cerebras does use TSMC for their fabrication, no?
TSMC is Cerebras' fabrication partner, so TSMC has proven themselves well able to execute on this approach. For NVIDIA it likely has to do with inertia and existing design and process they've committed to, they've been building discreet GPUs originally targeted as consumer products since 1993 and only relatively recently been growing their designs up into the supercomputing space. All of their prior art and experience has essentially been single die, single GPU with external memory and they seem determined to plow ahead with trying to figure out how to push this approach to ever more demanding levels of compute. I have no doubt NVIDIA has the sheer talent and expertise to execute something like the WSE but history has shown time and time again that companies become risk-averse as they become successful and retraining/retooling for wafer scale would represent a significant disruption of their existing product pipeline. Cerebras approached the problem with a clean sheet and went top down for supercomputer/data center style computing and in fact likely will never scale down to anywhere near a consumer product. Cerebras may have also staked out territory in terms of trade secrets and patent protection that will prevent competitors from copying some of their innovations - wafer-scale integration has historically been considered a dead end after some spectacular failures in the 1970s and 1980s including perhaps most famously Gene Amdahl's Trilogy Systems. It's important to keep in mind that "wafer-scale systems" back in the 1980s meant something more in the range of 100mm wafers vs. the 300mm wafers Cerebras has used for the WSE line, so Cerebras has not only revived an approach that had been considered practically impossible but done so at a much larger scale and complexity than their predecessors.
@@TheParadoxy yes, but that’s just another reason Nvidia can trust on their foundry to deliver the same.
@@nicholasolsen2177 thanks for your response. I am not arguing against the value that Cerebras is bringing to the industry here. I am just trying to assess whether Cerebras truly has an edge to make them compete with Nvidia and others in the future AI computing platform.
Yes, Nvidia could continue to be risk averse and not fight against its inertia, but the sheer computing power is only part of what makes Nvidia irreplaceable today. IMO it’s the entire ecosystem for AI developers including the CUDA language, SDK, libraries, and etc.
We have the same inertia problem for developers too: they will have to adapt to new programming platform provided by Cerebras. Yes, they will learn and adjust if the performance indeed outperforms, but it will take time for numerous developers to adapt Cerebras SW platform as mainstream. If the WSE turns out to deliver high computing power with high yield, I am speculating that it may be faster for Nvidia to replicate that approach for part of their server-based GPUs while maintaining entire SW platform for the programmers. I guess only time will tell. Let me know if I missed anything.
good video
Hi there can we build such chip with right infrastructure and connections with TSMC to fabricate it.
Is memory bandwidth such a big deal when external bandwidth somewhere else will obviously be the next bottleneck? ex, You would need to build a complete SOC with memory on a single die package, but you will be limited to the fastest ethernet protocol and connection methodology limiting usefullness.
This right here is why NVIDIA will miss earnings next week. Guidance will be lower due to the Blackwell delay here which is really not easy to solve. This is not a trivial problem - you can only go so small without breaking anything to endure that amount of heat.
Amazing.
How can we find out who's adopting Cerebras? I would have thought they could sell 100% of what TSMC is capable of outputting without even trying. Is that not true? Is this x86 architecture?
Microsoft Bing CoPilot links to the CS-3 announcement which says Argonne National lab and "G42" in Abu Dhabi are using it, and mentions GlaxoSmithKline, Lawrence Livermore National Laboratory, and National Energy Technology Laboratory. Cerebras claims "Cerebras already has a size-able backlog of orders for CS-3 across enterprise, government and international clouds." Well, as with any startup orders are cheap, it's shipments with revenue that matter. Unfortunately for Cerebras I don't see any indication that any top AI company developing a frontier multi-modal model is working with them.
TSMC fabricates 3+ million wafers a year, there's no way the demand for a massive complicated AI supercomputer will ever be that high.
Haha, Intel has licensed the x86 architecture to very few companies, basically just AMD and VIA Technologies decades ago. And that bloated CISC instruction set with all its extensions and different register sizes and vector add-ons is a terrible fit for 850,000 tiny "Processing Elements" on one wafer each with only 48 kB of local SRAM. Cerebras hides it, but from the Cerebras SDK Technical Overview, each PE has a microarchitecture that implements a custom instruction set which seems undocumented. Instead you write a custom program in the C/Zig-like Cerebras Software Language that tells each PE exactly how to manipulate tensors. Then you write a control program for "the host" which specifies which rectangle of how many PEs on the wafer you want to run your code, and "the host" loads them up. Fascinating, but it doesn't run Nvidia's CUDA GPU tools, and figuring out how to adapt existing machine learning software to it will be... challenging. National labs can develop custom software for it, just as they write their own code for their more conventional CPU+GPU supercomputers.
Dude its like a layered pyramid
PLEASE have someone do a noise filter pass on the video. It's impossible to suffer through that noise.
How do I invest in Cerebras? 😂
going to go IPO as CBRS apparently, just filed for it, don't think a date yet
if cerebras has a better yield and only advantages, why doesnt have more market share?, why arent cheaper than nvidia?
It's a new company with a different custom programming model. It seems U.S. national labs and a few other customers are going to write their own software in Cerebras' special CSL programming language to directly code tensor operations for those million little Programming Elements on each Cerebras wafer. But the big AI companies seem to be sticking with scaling what they know onto more and more Nvidia chips (and Google's own TPU chips).
Up to where the issue is. There are DRAM venders issue, and core issue. If puttinh them together, the time to market be another problem.
"the devil is in the details"
Enjoyable
EPFL 🚀🚀🚀🚀🚀🚀🚀
Enter the electron microscope
I guess Intel is able to do something very similar to CowosL using EMIB… so it’s not fundamentally impossible
EMIB is equivalent to CoWoS-S
potrebbero creare una sede auto allineante e non su superficie piana .
Nvidia might buy this company. Is this what they want?
If so, Nvidia need more than a few months delay.
cerebras... all... in...
"David defeats Goliath."
GAZUA
10 years behind Nvidia - so sorry
Weird to be talking about another teams delay
Niche company. Can go nowhere
yes, but your 'chips' only have 40GB of memory
40GB of on-die SRAM, so in comparison to GPU designs one should think of that more like having 40GB of cache. NVIDIA doesn't even publish specs for their on-die SRAM/cache sizes so one can assume it's not worth bragging about, by your definition NVIDIA's chips have 0GB. I'm sure you're aware that HBM3 is extremely slow compared to the memory on WSE-3, it's an order of magnitude difference. For those who can read, WSE-3 specs state up to 1.2PB of external memory.
@@nicholasolsen2177 SRAM (on-chip RAM), compared to HBM (off-chip RAM), is much faster, but also a lot more expensive.
@@nicholasolsen2177 But they didn't address the problem of how are they are interfacing with that 1.2PB of external memory ...
@@carloslemos6919 External memory interfaces are a solved problem and have been for decades. If you want' them, the controllers are off-the-shelf parts or EDA tools. I'm sure it would be interesting to hear more details about Cerebras' fabric and external memory interface but that's one of the least interesting things about what they've accomplished and wasn't the scope of this discussion of wafer scale vs. interposers. NVIDIA themselves are just dropping standard HBM3 controllers and memory chips on their designs.
I know the Asian guy is there to direct the flow of the conversation - but I wish he would let the albino finish talking before interjecting, it was very awkward watching that. Also not a good idea to be be wearing a white shirt when you're already a white guy with white hair standing in front of a white-board.
Yes, but ordinary people can buy GPU's, what about your chip? No one can buy it.
the Nvidia B200 is a data center GPU, ordinary people won't be able to afford it! 😂
Oh I'm sure I'll walk down to my local Best Buy and get a $45,000 B200 GPU /s
Can't wait to play Crysis on my WSE-3.
*Turns on PC.
The lights dim.
Granny's pacemaker stops.
A flock of birds fall from the sky.
The screen flickers and a pop-up message appears: "you're about to summon Skynet. Continue?"
What does that have to do with Cerebras' business or target market? Looks like some PC parts review site must have linked to this video.
I'm not going to watch,
do AI because nothing will happen anyway,
and AI will explain your delay