"It can draw out data in a few nanoseconds." No. SRAM can draw out data in fractions of a nanosecond. In 3nm-7nm, small, even stock TSMC SRAMs (use in an L1) can manage 250ps clock to data. Larger macros (used in an L3) can still achieve
@@Fiercesoulking Not really. SRAM timing is unrelated to CPU frequency, other than that some macros may happen to run on the same clock. PVT corner and geometry dictate the clock to data time.
As said in the video we're at a point where all of the challenges are starting to dominate. Lower voltages means reduced noise margins. Smaller feature sizes increase tunneling effects and requirements for even higher precision lithography. We'll probably overcome these but it will be slow going and more revolutionary than someone coming out and saying we've solved them all. We also have opportunities to try new computing architectures to try and avoid some of the short comings. We have a history of hitting a roadblock and coming up with clever solutions that nobody saw coming.
@@paulmichaelfreedman8334definitely not. Our ai is basically just pattern recognition, it doesnt function like a human brain that thinks. If it gave us a solution, it would either be copying a human or 'hallucinating' and accidentally giving a solution
That's before the transistor got as small as yeast ... Well, transistor shrank to yeast size back in the 1980s. A couple of orders of magnitude smaller, now.
The research need another direction too. Rather than focusing on increasingly more difficult how to prevent quantum tunneling from happening, we should open up the study on how to utilizes the quantum tunneling effects to achieves better reliability & performances.
Yes and no. Actually the shrink in size starts to become neglegable at the 7nm node, but AMD will use 6nm, a variant of 7nm for die that is say, L3 cache. This is using TSMC, other companies nodes are different. Inside a core there is no choice but to use the node the core is on because certain cache HAS to be built into the core, which is true with L1 and L2. L3 is also built into the core but it's pushed further away, and using interposers you can join one die to another and put L3 cache elsewhere. AMD has X3D CPU parts that have L3 on both the core die and another die that stacked onto the core. I could see a day when AMD moves to what Intel is starting to do now (Meteor Lake) and have chiplets that connect directly together instead of having to send data to what they call Infinity Fabric which has to multiplex data to send it to the right place, which is how AMD connects core die (CCD) to each other along with to an I/O die (IOD). If this happens then instead of stacking die, AMD could make an L3 chiplet that sits next to the core die(s). At which point AMD and Intel both could push ALL L3 onto another die.
Nor always true. It depends in application. Some benefits - overall system. There are application that benefits in ram speed ;most importantly timing. Cache on cpu does just that. Repetitive task that used the same memory benefits it on cpu Cache. Cpu higher clock does not always make things faster when Cache bottle neck it.
My lecturer at the university of cambridge recommended your video! I was already a fan but was surprised to hear a recommendation in one of my lectures
90%+ SRAM is certainly a lot, but having a lot of SRAM on a chip has a lot of benefits. For one, while it leaks, it consumes a lot less power than logic. It's also super regular and hand-crafted (unlike most logic), so, in a way, it's a very efficient use of area (especially compared to registers). It also has low power states you can use to save power in a graduated manner. Hotspots are basically always places where you don't have enough SRAM. Designers go out of their way to try to turn structures that use registers into a structure that can use an SRAM for these benefits. The real difficulty is that to get the best density, you need to use 1RW-ported SRAM macros, which puts real limits on how you can use them. Nonetheless, this trade-off is almost always worthwhile.
Aren't registers made of the same transistors as (on-chip) SRAM? Do you mean separate chip SRAM? That registers are intertwined with logic thus of more expensive design? Or just less optimized because less regular?
@@musaran2 Yes they are both made by the same process, but the transistors in the SRAM (and the fabric to read/write from the port) are super optimized because they are so regular and well understood. My understanding (I just use these macros; I don't design them) is that normal registers need to be built more conservatively for DRC, toggle more often, and have irregular control wiring.
All this takes me back to the early 1970s when we were shocked that manufacturers could put 1024 bits of static RAM on a single chip. Moore's Law has been one heck of a ride.
Popular Electronics January 1975 was my start in computer electronics. Heck of a ride - I designed the IO board used in the World's First BBS. And now look at what has happened.
Very interesting video ! And thanks for attribution ) I get a bit of a deja-vu every time I see my photos, but this gets cleared as I read the text below :-)
The progress within the semiconductor industry is an example of what can happen when people have a shared goal and actually choose to be effective and pragmatic about finding solutions, not that it's flawless, just better than almost every other human project
It is because there are companies all over the world competing neck to neck and have to make the best products to stay in the market. In pretty much anything else companies compare neck to neck to monopolize their respective markets and not bother to bring any better products after that.
I love how we are getting so small that quantum mechanics is required to do anything more...or how it's been making it impossible to squeeze more performance out.
@deang5622 I have to disagree: by the end of the 1800s we had quite enough understanding of crystals, atoms and electrons that a scientist of the day could understand how a diode functions and eventually a transistor. Classical physics is certainly enough to understand what's going on.
@@thewheelieguydid I understand correctly, scientists still are just trying to avoid quantum things (eg tunneling ) and build chips according to these classical physics laws... it just gets complicated design having billions of these simple transistors in one chip.
So, another word for a latch is a flip flop. but latches/flip flops come in many forms. This is a VERY basic latch, the minimal transistors needed to retain a state of a 0 or a 1. If you were to look at a logic gate breakdown of this latch it's VERY easy to make sense of it. Looking at a transistor breakdown, you have to understand the transistors. You have in effect two sides for a latch. Each side outputs the opposite of each other. So, if one side is outputting a low (can't use 0 or 1 here) voltage, the other is outputting a high. You can also say there is one side that has a true output and the other is false, or say one side is low when true, and the other side is high when true. So, if a 1 is stored in that latch then one side is low = true and the other side is high = true. If a 0 is stored, those two signals are inverse. So, the high = true output would be low = false. The way a latch works is the output of each side feeds back to the input of the other side, creating a loop that locks that logic state in the absence of another input. Data will come in on one side of the latch. If the data (0 or 1) is the same, the latch stays in the same state, if it's different, the latch changes state. This is a very simple thing. This is basic digital logic. The reason why this is used is because you can switch transistors WAY faster than switching anything else. You can switch the state of this latch at the clock speed that the CPU is running at.
This is incorrect. Latches and flip flops are different. Latches are level sensitive, flip flops are edge sensitive. A good chunk of money has been lost while designing a chip when someone accidentally put latches instead of flip flops.
Wikipedia: > The term flip-flop has historically referred generically to both level-triggered (asynchronous, transparent, or opaque) and edge-triggered (synchronous, or clocked) circuits that store a single bit of data using gates.[1] Modern authors reserve the term flip-flop exclusively for edge-triggered storage elements and latches for level-triggered ones.[2][3]
The simple fact that we are reliably working on such small scales that the size where quantum tunneling becomes a problem is considered outdated and quaint boggles my mind.
You should make the point that SRAM in a processor is present in large amounts because it's used in caches , not being used as generally addressed working store, except for some embedded SOC applications. Also, SRAM is made with a standard "Logic Gate" fab process and DRAM uses very different materials and layering.
It makes sense to use older nodes for SRAM stacked on newer process node logic chips. The cost would be pretty good as older nodes demand drops off and the equipment gets amortized. Apple's integrated HBM is also a good solution as making the ram faster puts less pressure to need the even faster caches to be quite so big. Everything is a tradeoff and there are some reasonable tradeoffs to choose from.
@@cube2fox certainly it becomes challenging to reduce cost of sram from here forwards regardless what strategy is used. However, if a fixed node size becomes standard for sram, you can reduce costs in two major ways. One, you can optimize the cost of producing and operating lithography at the "final" process node. Two, due to stable demand for the process node, you can amortize the investment in plant and equipment over a longer time period. Not nearly as good as Moore's law, but it's not nothing.
great vid! imec had a very neat vertical nanowire fet sram cell design that from my armchair seemed to take advantage of the novel way that xtor design interconnects with one another
Would be interesting to see a video on alternatives to Si. Its always been a side-niche as Si has always been either further shinkable or amenable to assists from things like straining or copper/cobalt interconnects. 3D stacking advances will add life, as are photonics. Alternatives show at least some advance over Si, Cubic boron arsenide, molybdenum disulfide, GaN, CNT, Graphine, Organic Electronics....
DRAM is just horrible for fast access, low latency stuff, thereby it will never die for registers, fast cache, non static FPGA LUTs..... For slow stuff such as slow memory, it's long dead already, except for small microcontrollers or special applications, like non volatile RAM.
What about instead of trying to get the data/memory to work on smaller and smaller chips you put the processors on all the memory chips. Instead of a Von Neuman Central Processor Unit with external memory storage we adopt a Central Memory Unit with external processing chips on a super fast bus.
"We will need more alternative solutions." Yeah, like write computer programs in languages such as C, C++, D, Zig, Rust instead of freaking Electron (JavaScript) on desktop or freaking Node (again, JavaScript) on server. Here, free performance boost without changing nanometers or dealing with quantum tunneling.
Or my favourite language… assembly. When it’s hard to write you write it efficiently. Straight ahead C is pretty good, C++ is very inefficient comparatively, and uses a lot more RAM, especially on run-in-place embedded systems where code runs straight from flash and not RAM. I design car ECUs and that’s how they work.
I just remembering when they say c# and java is bloated, but hey they just choose a more bloated solution : web browser, with killing flash, silverlight, and then reintroduced similar solution : web assembly. I just smelling a political reason to those trend. Another example, google and jpeg xl, google still insist to force everyone to accept webp and avif as de facto a replacement for jpeg and png.
The reason the chip is filled with 90+ percent of SRAM is because it's the most efficient thing to do when you run out of things to use the logic circuits for. Nobody said they had to put that SRAM in there, they could just leave it out. But large on-die cache is very good for performance, because main memory is many times slower. But I think the rise of GPU shows there's room to grow the number of parallel cores.
The problem is that most things that are highly parallel problems by nature have all been moved to the GPU by now. I can personally use as many cores as I can get because software compilation is one of the few problems that are highly parallel but also highly branched, which GPUs suck at. But mostly everyone else, outside of scientific modelling and whatnot, don't need more cores. They need faster cores.
@@andersjjensen Faster cores have been stalled for almost two decades. It was quite the ride up until the early 2000s -- exponential increases every couple of years. Back in the late 90s I was going to wait for a computer upgrade until they got to 10GHz clock rate. Still waiting...
@@chrimony Clock rate is nothing, IPC (Instructions Per Clock) is everything. My 7950X3D does more than twice the work per clock tick than my 2700X did on a per-core basis. So no, we are not getting twice as fast cores every two years, but heck that ride was already over by by the Pentium 2. But I absolutely wouldn't call it "stalled". Sure, Intel were eating their crayons and sniffing their glue for 7-8 years stright because they fumbled hard and couldn't get off 14nm, but these days 25-30% better per-core performance each generation is normal.
@@andersjjensen Clock rate is not "nothing". While it can be abused/misused, a generic CPU from the 66MHz era is never going to outperform a 1GHz generic CPU, regardless of the architecture. It would have been really nice if clock speeds had kept scaling at the pace of transistor counts. Those days were insane, and it went on for decades. Quite the wonderful run. Gains are still happening, and there have been benefits like reduced power-usage, but the free lunch is over.
Stone age: 50,000+ years Bronze-iron age: 5,000+ years Industrial age: 150 years Computer age: 70 years Our modern age: 20 years (micro computer technology in every aspect of our lives). The acceleration of technology is incredible when compared to previous eras.
history isn't a linear progression, we just been lucky in the last 2 thousand or so years not to be thrown back in the dark ages, the sun could easily throw us there with a little sneeze.
@@cj09beira Bigger chance that us humans do it and we get thrown back on our own fault. By war or by weather or by not learning and just playing games, who knows.
A foolish consistency… is the hobgoblin of little minds…. - Ralph Waldo Emerson 😃 We’ll all be back next week to watch Jon cover the next topic… pronounced any way he so desires….
I have to agree. All of us nerds in my nerd world (including my being an actual SRAM chip designer in a dram/sram/rom design group back in the Stone Age) called it DEE-RAM, not dram.
Removing all logic removes a lot of noise these allows higher density. The thing with SRAM is that it is partially analog. The two bitline does not only carry data but they also serves a function when reading and writing. When you have logic that is very noisy the bitlines are giant antenna that might flip the state of the SRAM cell.
That is exactly how AMD's 3D-Vcache works. There is nothing but SRAM on the chiplet (and of cause the connecting pins) so the gate patterning can be optimized solely for that layout.
As a general rules, chip processes are optimized for either speed (logic), density (memory) or efficiency (mobile/embedded). Mixing on a chip means compromising, though some tricks allow to tune transistors on the same chip.
I still enjoyed the video but I kind of feel like I missed the part about why sram shrinkage is a problem with the current cutting edge nodes in particular. I certainly understand better the core tradeoffs and issues with sram cell design but which part of that is the problem with for example 3nm? Is the answer essentially we don't know because tsmc is being tight lipped about the exact details so we can only make more general conclusions?
Ah, I think I can answer my own question after watching it again. It's a question of yield. Denser SRAM causes yields to drop below an acceptable limit to be commertially viable.
@@lexer_ SRAM has not scaled in roughly a decade. Now it isn’t even shrinking at all. So we are going 3d and stacking. But that’s expensive and runs into heat issues. Like a hot sandwich where the meat gets extra cooked
I wonder if SRAM can be produced with much greater defect tolerance as opposed to logic - I mean, after the fact, we can just run a test to see which cells are faulty, and burn in a hardware level mapping that shuts off faulty cells, with not having to deal with them.
Also something to mention is that actually moving the data to and from ever larger caches requires infrastructure, infrastructure which increases the total power draw and latency.
To enable it, try copying Windows 3.1 and dos files from the memory to the micro sd card in about 500 mb quantity. When the copying is faster, you activated this little portion. That is one, why a card is called hc, xc. It acts as a coprocessor. It also acts as a sound processor or codec. The card has Physx, but it is separate. When the portion is active, the card becomes highly reliable. Good for managers who handle precious data.
Im always thinking that if the nodes keeps getting smaller, the chances to it being harmed is higher ( and even those lil bacteria might can crack it )
It's great content. I love your channel and it allowed me to learn A LOT about semiconductors, technology. I've started to learn how to program microcontrollers. BUT. I'm also watching ot from Poland. And SRAM in polish - when considered a word means - "I'm making poop" but not in polite way and then title "Can SRAM keep shrinking is REALLY FUNNY
since the active layer on a silicon chip is just a few layers and nanometers thick, they could go more vertical like 3d nand. theres plenty of vertical room
Shrinking a transistor by half means it has √s̅u̅r̅fa̅c̅e̅a̅r̅e̅a̅, so this results in more than 50% faster electron traversal. However, we only see 20-40% improvement because of efficiency loss. This means that a slowing ability to shrink the size of the transistor is expected; we're getting _more_ than double the theoretical cap every time the transistor shrinks by half.
Your statement about AMD and cache is incomplete, so much so that people that don't understand how CPUs and cores work it can give the wrong impression. SRAM is used in cores for registers along with cache. Registers are a collection of latches just like SRAM is, although the configurations will be different. There aren't many registers compared to cache. Registers can NEVER be moved off the main core die, because they are integral to the operation of the core, you need data registers to do pretty much ANYTHING in a CPU core. Registers are temporary storage for instructions, and the registers are tied into the instruction logic. A very simple thing of A + B could mean getting data from memory to add together and storing an answer back in memory, or it could be these values are constants built into the instruction. In either case, it's data that can be put into registers and then add those registers together and load the answer back into another register. The very next instruction may need these values so having the data in registers means the next instruction doesn't have to call that data from memory again. So that's one thing. Registers are integral to the operation of instructions and that will NEVER move off the core die. The next thing is cache. You mentioned L3 cache, but unless a person understands how a CPU works there isn't enough information there to understand the implication. There are typically 3 levels of cache in a core (there are multiple cores in a CPU). L1, L2, L3. L1 is closest to the instruction logic and operates at the same speed. There is not much of it because it's the hardest to make PERFECT because it has to once again, operate at the speed of the core. There is next L2 cache which there is more of. The problem with having more cache is the time to search it takes longer, so the time to fetch from L2 is longer than L1. The time to pull data from L1 is a single clock cycle, the time for L2 varies based on the CPU type. L2 ALSO HAS to be on the core die, because it's a larger pool of data/instructions that the core has been using and may need again. L3 is the slowest, and also the largest set of cache. THIS is what can be pushed off onto another die, as long as the connections between those two die operate fast enough which is the key. Cache replicates what is in memory, but there's a small amount of cache and a large amount of memory. As a core needs data, if that data hasn't been accessed already it's either on disk or in RAM (main memory). If it's on disk, it will get loaded into memory first, then pulled into the core. It HAS to go into L1 to be used by the core. Caching methods vary so I will give A method of caching. There is only a small amount of L1 cache which runs at the speed of the instruction logic. There is also cache for instructions and cache for data. Instructions of course run incredibly fast, as in a VERY rough figure of about a billion a second. This varies because there are wait times so you can't simply take a clock speed and say there is one instruction per clock cycle. This is why I'm giving a VERY rough figure for this. But, say 1 billion a second. In that time period L1 will have been changed out millions of time because L1 doesn't store much. One way to deal with a caching scheme is when L1 is full and you need something that isn't in L1 (where all data and instructions are dealt with from), is take the data/instructions used the furthest back in time (so you need a time stamp for what's in cache) and push that down to L2. If L2 is full, then you take the data/instructions used furthest back in time and push it down to L3. As was said, L1 - L3 represents data/instructions that came out of main memory. You never need to update instructions since instructions don't change, but you do data, and since L1 - L3 can have updated values for data that's stored in memory, you also need to write this change back to main memory. So the key takeaway is L1 HAS to be right next to instruction logic and runs at the core speed and it holds data/instructions each in a separate space, and it HAS to be on the same die as the cores. L2 ALSO has to be on the same die as the cores because you often need to access data/instructions that you're already used but since L1 is small it got pushed down to L2. L3 is the ONLY cache you can push onto another die, AS LONG AS you can clock the connection between the two die fast enough.
Thanks for this analysis it provides a vary clear explanation of the challenges in modern chip design. I wonder if stacked cache has trade offs as well such as latency, being further away from the logic core. We seem to be at the limit now with 5nm too be honest this is better than I expected I remember telling a co-worker 10 years ago 7nm would be the limit. These multi pattern designs may allow us to get smaller but at substantially lower yields and higher costs. This is I guess why cutting edge tech is getting more expensive.
A stacked chip can be closer than most surface of an on-chip cache. The problems is the chip-to-chip vias use much more space than in-chip lines. It uses footprint, it constrains line placement and spacing, and it requires stronger electric drive.
I was going to ask why SRAM cant keep shrinking at the same rate as lógic ( weird since both just use transistors ) And what posible replacements are there ( STT ram, magnetic ram, etc )
@@alphar9539 But those will affect the logic circuits as well. I suspect the difference is that SRAM is not clocked at such a high rate: so is expected to retain the data it handles for a longer period of time.
@@jamesphillips2285 logic circuits do not have to be error free. Errors in SRAM are unresolvable most of the time, while logic errors can be corrected without a fatal mistake for the CPU.
Could something like ReRam overtake on-die SRAM at some point? And I'm told AMD's V-Cache is more dense than equivalent on-die SRAM because the process is optimized for it instead of being also for logic?
A bit worrying, if the development tackles off, and we still increase the compute demand with 25% year on year or more, I guess we will start using that level more electricity per year. Might be a real problem eventually
Well given that semiconductor production shares a lot of the logistics of the solar panel production maybe we can fend of that problem for a few years.
SRAM is something I would not expect most people to know anything about unless they have some knowledge of CPU architecture. That is because these days it is used as the cache that comes as a part of the CPU itself. I do remember a time when it came as a module you could install similar to a DIMM. That was back in the days of the Pentiums when MMX was the thing to have. If you ever saw a Pentium II with the heatsink off you would notice two chips next to the processor. Those were the SRAM chips. Those ones ran at half the speed of the processor itself. Why half the speed? I don't know! It must have been a technical limitation at the time or a strange decision by one or more of Intel's engineers.
One of the worst examples I've personally owned of "Most of the chip is cache" is the Pentium M Dothan - the 2nd version of the P-M with 2MB cache. It's kind of obscene looking.
With all this discourse around memory, could you please do a video on memristors? It looks like a technology that should not be slept on and I think we'd all value your opinion on the matter.
I'm happy improvement has slowed down. Maybe now companies will be forced to actually pay attention to optimization and web devs will have to learn how memory works
SEUs and feature size is an issue ? yes / no L1 and L2 cache really need EDAC to avoid CPU crashing. SDRAM FIT rate was about one event per megabit per month at Ground level. This was researched by IBM and others. 24/7 operation can be an issue.
Still I didn't understand fully why shrinking SRAM transistors is a challenge comparing to the logic transistors? It follows from the video that shrinking any transistor is a challenge, not just SRAM one. Or am I misunderstood something?
Minutes 6-8, power leakage. For logic it is just power inefficiency but for SRAM it is loss of value / instability. Minute 9 onwards, process variation means problems are uneven between different chips. So when SRAM shrinks they get less and less reliable, so in smaller processes you still make an exception and make the SRAM cells big because you do not want random bit flips in the cache due to power leakage etc.
@@randomgeocacher My understanding is that all those problems (static and dynamic power dissipation as well as static noise) are relevant for any transistor, regardless of its application in a circuit. Why shrinkage of transistors doesn't bring problems for the logic? Like ALU producing incorrect results when summing two values.
@@DenisBazhenov SRAM holds a value, the logic emit values that goes into next gate (transistors). I.e. for logic a minor deviation from 1 will correct itself in next D-flipflop or logic cell; while the small 6-transistor SRAM doesn’t have this continuously self correcting property. Honestly I did not study the drawing or the detailed explanation in the video, but I think that if we would rewatch it carefully we would see how the 6-transistor SRAM is different/optimized in a way that looses continuous self correcting. I imagine that a big SRAM cell built like it is thought in school (flip flop with many gates) it would have much less of these issues, but then be a lot of more transistors. So basically designers hade to make a trade off between more logic / transistor in the SRAM or stick to the popular 6-transistor SRAM; but that cell has reached its minimal size. That’s my understanding anyway; the popular SRAM cell cannot deal with power leaks, that prevents it being made smaller. You either fix physics, or keep it at its current size, or abandon this SRAM cell and go for more complex SRAM cell without these issues. So it is this particular size optimized cell that has the problem, and none of the opinions are particularly advantageous in the current technologies.
@@DenisBazhenovfor most logic it goes into a forgiving next stage, like D-flipflop or logic gates. If you loose a little bit of power there will be transistors correcting the loss. Apparently the space efficient 6-transistor SRAM doesn’t have this self correcting property… and thus becomes unstable faster than other parts when shrunk. Maybe the circuit diagram gives some clues as to why? I think the video tried to explain this, hmm.
The SRAM below is likely needed to be more stable for critical functions while the stacked SRAM is for excess functions. Stability is more necessary from the most critical SRAM.
@@alphar9539 That's not how memory works, nearer memory is used to cache the most recently accessed parts of the next layer, but just because something hasn't been accessed very recently doesn't mean it's less important, indeed many critical operating system data structures will only be accessed occasionally.
Different process/technology maybe? I would guess the CPU dies are built in one technology and has to deal with its constraints, power, heat issues etc. An SRAM cache die built for a single purpose with nothing else on the die/package has more freedom to optimize.
@@randomgeocacher looks like the answer from someone else is that the SRAM on the logic layer needs to connect with the logic in a “legacy” manner which also prevents optimized SRAM layout. In contrast the solely SRAM layer is designed only with SRAM in mind and therefore uses a slightly more efficient layout. This actually makes sense as the design of a Ryzen Logic layer likely was developed before 3d SRAM stacking was prototyped. The connections therefore were set up in an efficient manner for the “legacy” design. Maybe a future design could be more efficient and denser
“Have to” implies it is a law of nature, so no? Traditionally DRAM is so much slower and unpredictable, for example the the refresh cycle. So DRAM as we know it today certainly would not replace L1 cache. Maybe some really speed optimized DRAM variant could tackle L2, L3. I.e. DRAM don’t have an obvious way to replace SRAM, but especially for L2, L3 I wouldn’t dare to say that designers never can come up with new trade-offs or new designs. So everything depends on if some smarty pants developer can make new designs and trade offs that make sense. The further away from the CPU core it “only” has to be closer and faster than acceding the main RAM to be a good trade off, the closer it is to the CPU core we want near-zero latency and predictable access times.
Within the logic circuits themselves (as in the CPU) bits can be represented as dynamic charges, very much like in the DRAM. A circuit node would be pre-charded in one part of the clock cycle, then discharged or not depending on the input signals to the logic element, and then the result would be sensed by the next level of logic. Preventing the charge from leaking away between the pre-charging of the node and the sensing the charge later is what sets the slowest clock frequency for such circuits. All of the earliest CPUs were like that. If you wanted to execute one command at a time, you could not just suspend the clock, instead there were special circuits for freezing the execution while all the circuits still cycled at their normal rate. But then some CPUs started to be designed without using such tricks, and were "completely static." In this case one could simply slow the clock all the way to zero frequency without loosing any functionality. So, in principle, dynamic elements could be utilized anywhere in the circuit. Whether it would make sense for the cache would depend on the achieved trade-offs in speed and density. Currently this does not seem very attractive, especially for a process not optimized for DRAM.
2:32 what? are we just glossing over that like its no big deal or something? he legit just went hey mate lemme borrow your table for a bit imma make something revolutionary rq
Coding is one of the big problems.. to much 'paranoia (hash checks, Hamming checks and engility registry loops.)... better to go more RISC and back to 16bit for many processes using more stable SRAM larger designs
Very confusing video title for those who are interested in bicycles - where SRAM is one of the main competitors to Shimano as manufacturers of group sets (eg gears, brakes, pedals, cranks, wheel hubs etc).
Doesn't change the fact that the distance to the dimms is the bottle neck. We've had 7-8ns latency RAM since EDO, but that pesky "universal speed limit" we call the speed of light won't let us go any faster unless we move closer. So we move as close as we can get: onto the chip.
@andersjjensen quantum computing fix all of that. If cpu does a true parallel computing it's the other solution to save time instead of queue in line for certain clock cycle. Specially if the software is designed to handle multiple thread computing. Simulation of physics of trillions of individual particles would greatly help. This would required less clock cycle, and less access to ddr ram speed.
@@cj09beira quantum is promising. If it actually works, it's almost like magic. It will give you same answer. The problems with it it is not stable. You will run into errors after long duration. Is there any error correcting methods.. is there a way to check the code after each process. Can it even check itself.. the state of matter is hard to control If you really want to build a robot, android per say - quantum will make a robot as smart or smarter than human. They already can map human brain pattern to some extent. Load that on quantum computer and it probably have its own consciousness
@@alphar9539 Yeah a lot of people say 90 broke frequency scaling but 65nm gained pretty nicely over 90nm for the Pentium 4 (when OCed, 65nm would go about 1 GHz higher on air). Core 2 also had nice frequency bumps to 45nm. But seeing leakage go from 0 to a lot in one node is interesting
Sits between is just hierarchical it doesn't necessarily serve as an interface between the processor and main memory. If the processor checks for data in some sort of sram and it's not there the processor then has to request data from memory so it's not like the sram serves as an intermediary
Well , i think it is an electrical consideration. The logic definitely use more current than the SRAM. So having it at the bottom nearest to the power source is makes more sense
@@tylerweigand8875 Ah, I didn't realize that. I always kind-of assumed that only one fetch got sent out and the cache handled requesting it from main memory if necessary.
"It can draw out data in a few nanoseconds." No. SRAM can draw out data in fractions of a nanosecond. In 3nm-7nm, small, even stock TSMC SRAMs (use in an L1) can manage 250ps clock to data. Larger macros (used in an L3) can still achieve
To be precise it depends on the CPU clock speed and how many bytes per design are read/write at once. Yes nano seconds are DDR ram not SDRAM
@@Fiercesoulking Not really. SRAM timing is unrelated to CPU frequency, other than that some macros may happen to run on the same clock. PVT corner and geometry dictate the clock to data time.
@@Fiercesoulkingit is common to use 2x clock on SRAM blocks to simulate dual port access
That’s insane. Light barely travels across the entire CPU during that time, yet we read data in that timespan.
@@lbgstzockt8493"the entire CPU" is absolutely huge compared to the size of a SRAM cell. Like comparing walking across the block to circling Earth.
As said in the video we're at a point where all of the challenges are starting to dominate. Lower voltages means reduced noise margins. Smaller feature sizes increase tunneling effects and requirements for even higher precision lithography. We'll probably overcome these but it will be slow going and more revolutionary than someone coming out and saying we've solved them all. We also have opportunities to try new computing architectures to try and avoid some of the short comings. We have a history of hitting a roadblock and coming up with clever solutions that nobody saw coming.
This is something a well-trained AI could tackle. Meaning it could analyze possible solutions given as input, much deeper than any human could.
@@paulmichaelfreedman8334
not really, AI can't invent new magic physics...
@@paulmichaelfreedman8334definitely not. Our ai is basically just pattern recognition, it doesnt function like a human brain that thinks. If it gave us a solution, it would either be copying a human or 'hallucinating' and accidentally giving a solution
@lt2660 My apologies, I meant AGI/ASI
@@Napoleonic_S I never implied that, I implied it could do a much deeper analysis of (human conjured) theories. And I also should have said AGI/ASI
Never had a transistor compared to yeast before. It's these kinds of analogies that Im here for.
That's before the transistor got as small as yeast ... Well, transistor shrank to yeast size back in the 1980s. A couple of orders of magnitude smaller, now.
The research need another direction too. Rather than focusing on increasingly more difficult how to prevent quantum tunneling from happening, we should open up the study on how to utilizes the quantum tunneling effects to achieves better reliability & performances.
AMD said that using anything lower than 5N for large SRAM blocks _costs more than they taste_ (as we say in Sweden).
And dont forget "Costs shirt" ;) Mvh Ajsof-bajs 💩
No. That's what a customer paying for Epyc-X said.
For L3 is quite sensible, but L1 is no go
Yes and no. Actually the shrink in size starts to become neglegable at the 7nm node, but AMD will use 6nm, a variant of 7nm for die that is say, L3 cache. This is using TSMC, other companies nodes are different.
Inside a core there is no choice but to use the node the core is on because certain cache HAS to be built into the core, which is true with L1 and L2. L3 is also built into the core but it's pushed further away, and using interposers you can join one die to another and put L3 cache elsewhere. AMD has X3D CPU parts that have L3 on both the core die and another die that stacked onto the core. I could see a day when AMD moves to what Intel is starting to do now (Meteor Lake) and have chiplets that connect directly together instead of having to send data to what they call Infinity Fabric which has to multiplex data to send it to the right place, which is how AMD connects core die (CCD) to each other along with to an I/O die (IOD). If this happens then instead of stacking die, AMD could make an L3 chiplet that sits next to the core die(s). At which point AMD and Intel both could push ALL L3 onto another die.
What does 5N mean? Five nanometer?
«More memory is as good as I remember»
~ Asianometry, 2024
Write that down 😂
Nor always true. It depends in application. Some benefits - overall system. There are application that benefits in ram speed ;most importantly timing.
Cache on cpu does just that. Repetitive task that used the same memory benefits it on cpu Cache. Cpu higher clock does not always make things faster when Cache bottle neck it.
@@clintcowan9424 Sent it to friends and family
I think he could have continued, "... though I don't remember anybody ever saying that".
@@BlueRiceNot only repetitive, just everything that follows locality
My lecturer at the university of cambridge recommended your video! I was already a fan but was surprised to hear a recommendation in one of my lectures
90%+ SRAM is certainly a lot, but having a lot of SRAM on a chip has a lot of benefits. For one, while it leaks, it consumes a lot less power than logic. It's also super regular and hand-crafted (unlike most logic), so, in a way, it's a very efficient use of area (especially compared to registers). It also has low power states you can use to save power in a graduated manner. Hotspots are basically always places where you don't have enough SRAM.
Designers go out of their way to try to turn structures that use registers into a structure that can use an SRAM for these benefits. The real difficulty is that to get the best density, you need to use 1RW-ported SRAM macros, which puts real limits on how you can use them. Nonetheless, this trade-off is almost always worthwhile.
SRAM that’s stacked runs hot and is often the hottest part of the SoC.
@@alphar9539 Sure. Anything stacked gets hot. 🙂
One other advantage of SRAM over random logic is that it is MUCH easier to test (also including self-test in the field).
Aren't registers made of the same transistors as (on-chip) SRAM?
Do you mean separate chip SRAM? That registers are intertwined with logic thus of more expensive design? Or just less optimized because less regular?
@@musaran2 Yes they are both made by the same process, but the transistors in the SRAM (and the fabric to read/write from the port) are super optimized because they are so regular and well understood. My understanding (I just use these macros; I don't design them) is that normal registers need to be built more conservatively for DRC, toggle more often, and have irregular control wiring.
All this takes me back to the early 1970s when we were shocked that manufacturers could put 1024 bits of static RAM on a single chip. Moore's Law has been one heck of a ride.
Popular Electronics January 1975 was my start in computer electronics. Heck of a ride - I designed the IO board used in the World's First BBS. And now look at what has happened.
Very interesting video ! And thanks for attribution ) I get a bit of a deja-vu every time I see my photos, but this gets cleared as I read the text below :-)
I love the jokes you throw from nowhere, pure silicon comedy
ya his humor is very glassy
Sillycom. Get it? Sillycom... Bad joke. Bad!
🤓
He's very sili and conedic
The progress within the semiconductor industry is an example of what can happen when people have a shared goal and actually choose to be effective and pragmatic about finding solutions, not that it's flawless, just better than almost every other human project
It is because there are companies all over the world competing neck to neck and have to make the best products to stay in the market. In pretty much anything else companies compare neck to neck to monopolize their respective markets and not bother to bring any better products after that.
Member how bad it was when AMD nearly flipped due to bulldozer fiasco and Intel fed us 4core CPUs with 2% perf increments for years?
i wonder if TSMC and Samsung emails have employee pronouns in their emails. i'm going to guess not!
@@alquinn8576 I've worked with a company that works with TSMC. No, they don't. Almost no one in the semiconductor industry does.
@@alquinn8576Taiwan is proly the most lgbtfriendly nation on asia
I love how we are getting so small that quantum mechanics is required to do anything more...or how it's been making it impossible to squeeze more performance out.
Yes, it's ridiculous really, how far we've come technologically in the last 120 years.
We've been there for a while!
Technically we have been employing quantum mechanics on every transistor regardless of how small it is.
@deang5622 I have to disagree: by the end of the 1800s we had quite enough understanding of crystals, atoms and electrons that a scientist of the day could understand how a diode functions and eventually a transistor.
Classical physics is certainly enough to understand what's going on.
@@thewheelieguydid I understand correctly, scientists still are just trying to avoid quantum things (eg tunneling ) and build chips according to these classical physics laws... it just gets complicated design having billions of these simple transistors in one chip.
"...Just kidding, nobody says that..." Your jokes are improving. That was good.
12:45 the saying goes: CACHE RULES EVERYTHING AROUND ME
data data bytes yall
So, another word for a latch is a flip flop. but latches/flip flops come in many forms. This is a VERY basic latch, the minimal transistors needed to retain a state of a 0 or a 1. If you were to look at a logic gate breakdown of this latch it's VERY easy to make sense of it. Looking at a transistor breakdown, you have to understand the transistors. You have in effect two sides for a latch. Each side outputs the opposite of each other. So, if one side is outputting a low (can't use 0 or 1 here) voltage, the other is outputting a high. You can also say there is one side that has a true output and the other is false, or say one side is low when true, and the other side is high when true. So, if a 1 is stored in that latch then one side is low = true and the other side is high = true. If a 0 is stored, those two signals are inverse. So, the high = true output would be low = false.
The way a latch works is the output of each side feeds back to the input of the other side, creating a loop that locks that logic state in the absence of another input. Data will come in on one side of the latch. If the data (0 or 1) is the same, the latch stays in the same state, if it's different, the latch changes state. This is a very simple thing.
This is basic digital logic.
The reason why this is used is because you can switch transistors WAY faster than switching anything else. You can switch the state of this latch at the clock speed that the CPU is running at.
This is incorrect. Latches and flip flops are different. Latches are level sensitive, flip flops are edge sensitive. A good chunk of money has been lost while designing a chip when someone accidentally put latches instead of flip flops.
Wikipedia:
> The term flip-flop has historically referred generically to both level-triggered (asynchronous, transparent, or opaque) and edge-triggered (synchronous, or clocked) circuits that store a single bit of data using gates.[1] Modern authors reserve the term flip-flop exclusively for edge-triggered storage elements and latches for level-triggered ones.[2][3]
have been so asianometry-pilled that I actually appreciate your "dram" joke.
The simple fact that we are reliably working on such small scales that the size where quantum tunneling becomes a problem is considered outdated and quaint boggles my mind.
I saw the video caption and thought “I know Shimano are dominant but I didn’t think it was that bad”. 😂
go ride a bike !!
same 😂
Production quality has gone up. The background is soothing, gives me PS3 music visualizer vibes
Am I the only one who came here thinking Shimano had finally achieved total dominance?
You should make the point that SRAM in a processor is present in large amounts because it's used in caches , not being used as generally addressed working store, except for some embedded SOC applications.
Also, SRAM is made with a standard "Logic Gate" fab process and DRAM uses very different materials and layering.
From what I understand SRAM is both used in registers and larger caches. I don't know why the distinction though.
It makes sense to use older nodes for SRAM stacked on newer process node logic chips. The cost would be pretty good as older nodes demand drops off and the equipment gets amortized.
Apple's integrated HBM is also a good solution as making the ram faster puts less pressure to need the even faster caches to be quite so big.
Everything is a tradeoff and there are some reasonable tradeoffs to choose from.
Yeah, but that means cost wouldn't decrease any further in the future. In the past cost per bit decreased with each node shrink.
@@cube2fox certainly it becomes challenging to reduce cost of sram from here forwards regardless what strategy is used. However, if a fixed node size becomes standard for sram, you can reduce costs in two major ways. One, you can optimize the cost of producing and operating lithography at the "final" process node. Two, due to stable demand for the process node, you can amortize the investment in plant and equipment over a longer time period. Not nearly as good as Moore's law, but it's not nothing.
great vid! imec had a very neat vertical nanowire fet sram cell design that from my armchair seemed to take advantage of the novel way that xtor design interconnects with one another
You commented 2 months ago?
patreon's a thing @@Tandanuu
Stop using your quantum private network to connect to events in the future ikarosav. This is supposed to be an under-the-table project...
@@Tandanuu 0:59 Likely
"Patreon supporters get early access to videos"@@Tandanuu
Would be interesting to see a video on alternatives to Si.
Its always been a side-niche as Si has always been either further shinkable or amenable to assists from things like straining or copper/cobalt interconnects.
3D stacking advances will add life, as are photonics.
Alternatives show at least some advance over Si, Cubic boron arsenide, molybdenum disulfide, GaN, CNT, Graphine, Organic Electronics....
I thought you mean SRAM the bike component company! My hobby/interests have collided
More of a shimano guy but sram is good too
Gen two Campy ergo is my fav. 11clicks of triple crank goodness on the left shifter and they just fit my hands@@patrickglaser1560
Yeah, "end of SRAM" made my heart sink
Crashing a bike into a silicon wafer would be very expensive.
DRAM is just horrible for fast access, low latency stuff, thereby it will never die for registers, fast cache, non static FPGA LUTs.....
For slow stuff such as slow memory, it's long dead already, except for small microcontrollers or special applications, like non volatile RAM.
eDRAM is quite a bit faster though.
@@fungo6631 aren't tRCD with tCL still in 10s of ns for eDRAM?
@@volodumurkalunyak4651 I dunno, the Gamecube and Wii's eDRAM had a 5 ns latency.
Very informative and well put together as always. Tell us about possible alternative solutions in a future video please!
What about instead of trying to get the data/memory to work on smaller and smaller chips you put the processors on all the memory chips.
Instead of a Von Neuman Central Processor Unit with external memory storage we adopt a Central Memory Unit with external processing chips on a super fast bus.
"More memory is as good as I remember"
~ Steven Hawking, 1856
"Like its cousin dhram, S-ram is..." Smooth.
I am going to start using "more memory is as good as I remember" and "more memory.is better than I remember".
At the age of 65yrs old, I agree!
Danke!
My solution was always more level 1 SRAM, 80% of the Soc, wow.
Keep innovating what we use is the solution here.
Apple M3 M4 M5 & further show SOC emerging efficient compute since all IC parts so close, faster while using way less energy
Man, you are cool!
Thanks for all deep tech videos you make.
1:52 RIP 3DXpoint. We hardly new ye
"We will need more alternative solutions." Yeah, like write computer programs in languages such as C, C++, D, Zig, Rust instead of freaking Electron (JavaScript) on desktop or freaking Node (again, JavaScript) on server. Here, free performance boost without changing nanometers or dealing with quantum tunneling.
Or my favourite language… assembly. When it’s hard to write you write it efficiently. Straight ahead C is pretty good, C++ is very inefficient comparatively, and uses a lot more RAM, especially on run-in-place embedded systems where code runs straight from flash and not RAM. I design car ECUs and that’s how they work.
But that means you'll need to get rid of web dev diversity hire that got the job for all but their skills.
@@fungo6631 oh, no
anyways... 😂
I just remembering when they say c# and java is bloated, but hey they just choose a more bloated solution : web browser, with killing flash, silverlight, and then reintroduced similar solution : web assembly. I just smelling a political reason to those trend. Another example, google and jpeg xl, google still insist to force everyone to accept webp and avif as de facto a replacement for jpeg and png.
The reason the chip is filled with 90+ percent of SRAM is because it's the most efficient thing to do when you run out of things to use the logic circuits for. Nobody said they had to put that SRAM in there, they could just leave it out. But large on-die cache is very good for performance, because main memory is many times slower. But I think the rise of GPU shows there's room to grow the number of parallel cores.
The problem is that most things that are highly parallel problems by nature have all been moved to the GPU by now. I can personally use as many cores as I can get because software compilation is one of the few problems that are highly parallel but also highly branched, which GPUs suck at. But mostly everyone else, outside of scientific modelling and whatnot, don't need more cores. They need faster cores.
@@andersjjensen Faster cores have been stalled for almost two decades. It was quite the ride up until the early 2000s -- exponential increases every couple of years. Back in the late 90s I was going to wait for a computer upgrade until they got to 10GHz clock rate. Still waiting...
@@chrimony Clock rate is nothing, IPC (Instructions Per Clock) is everything. My 7950X3D does more than twice the work per clock tick than my 2700X did on a per-core basis. So no, we are not getting twice as fast cores every two years, but heck that ride was already over by by the Pentium 2. But I absolutely wouldn't call it "stalled". Sure, Intel were eating their crayons and sniffing their glue for 7-8 years stright because they fumbled hard and couldn't get off 14nm, but these days 25-30% better per-core performance each generation is normal.
@@andersjjensen Clock rate is not "nothing". While it can be abused/misused, a generic CPU from the 66MHz era is never going to outperform a 1GHz generic CPU, regardless of the architecture. It would have been really nice if clock speeds had kept scaling at the pace of transistor counts. Those days were insane, and it went on for decades. Quite the wonderful run.
Gains are still happening, and there have been benefits like reduced power-usage, but the free lunch is over.
@@chrimony it was a fun time to be alive for sure
Stone age: 50,000+ years
Bronze-iron age: 5,000+ years
Industrial age: 150 years
Computer age: 70 years
Our modern age: 20 years (micro computer technology in every aspect of our lives).
The acceleration of technology is incredible when compared to previous eras.
history isn't a linear progression, we just been lucky in the last 2 thousand or so years not to be thrown back in the dark ages, the sun could easily throw us there with a little sneeze.
@@cj09beira Bigger chance that us humans do it and we get thrown back on our own fault. By war or by weather or by not learning and just playing games, who knows.
I read the thumbnail as "THE END OF SPAM" and was like, gosh, there's light at the end of the tunnel after all!
The end of the quantum tunnel.
Pronounced perfectly: S-RAM. Now do the same for DRAM as in D-RAM.
A foolish consistency… is the hobgoblin of little minds…. - Ralph Waldo Emerson 😃
We’ll all be back next week to watch Jon cover the next topic… pronounced any way he so desires….
I have to agree. All of us nerds in my nerd world (including my being an actual SRAM chip designer in a dram/sram/rom design group back in the Stone Age) called it DEE-RAM, not dram.
Heard TSMC can increase the sram density if they remove all logic. Not sure how that works though.
Removing all logic removes a lot of noise these allows higher density. The thing with SRAM is that it is partially analog. The two bitline does not only carry data but they also serves a function when reading and writing. When you have logic that is very noisy the bitlines are giant antenna that might flip the state of the SRAM cell.
That is exactly how AMD's 3D-Vcache works. There is nothing but SRAM on the chiplet (and of cause the connecting pins) so the gate patterning can be optimized solely for that layout.
As a general rules, chip processes are optimized for either speed (logic), density (memory) or efficiency (mobile/embedded).
Mixing on a chip means compromising, though some tricks allow to tune transistors on the same chip.
I still enjoyed the video but I kind of feel like I missed the part about why sram shrinkage is a problem with the current cutting edge nodes in particular. I certainly understand better the core tradeoffs and issues with sram cell design but which part of that is the problem with for example 3nm? Is the answer essentially we don't know because tsmc is being tight lipped about the exact details so we can only make more general conclusions?
Ah, I think I can answer my own question after watching it again. It's a question of yield. Denser SRAM causes yields to drop below an acceptable limit to be commertially viable.
@@lexer_ SRAM has not scaled in roughly a decade. Now it isn’t even shrinking at all. So we are going 3d and stacking. But that’s expensive and runs into heat issues. Like a hot sandwich where the meat gets extra cooked
I wonder if SRAM can be produced with much greater defect tolerance as opposed to logic - I mean, after the fact, we can just run a test to see which cells are faulty, and burn in a hardware level mapping that shuts off faulty cells, with not having to deal with them.
They've done this for many years.
Also something to mention is that actually moving the data to and from ever larger caches requires infrastructure, infrastructure which increases the total power draw and latency.
To enable it, try copying Windows 3.1 and dos files from the memory to the micro sd card in about 500 mb quantity. When the copying is faster, you activated this little portion. That is one, why a card is called hc, xc. It acts as a coprocessor. It also acts as a sound processor or codec. The card has Physx, but it is separate. When the portion is active, the card becomes highly reliable. Good for managers who handle precious data.
What were you supposed to comment on? This doesn't seem like it.
@@fungo6631 seems like ai garbage to me.
2:30 There was SRAM (Static Random Access Memory) long before 1963.
Why not try to differentiate a little between that and integrated SRAM...
As a bicyclist, I was confused by the title.
Now I have more questions
Im always thinking that if the nodes keeps getting smaller, the chances to it being harmed is higher ( and even those lil bacteria might can crack it )
It's great content. I love your channel and it allowed me to learn A LOT about semiconductors, technology. I've started to learn how to program microcontrollers. BUT. I'm also watching ot from Poland. And SRAM in polish - when considered a word means - "I'm making poop" but not in polite way and then title "Can SRAM keep shrinking is REALLY FUNNY
since the active layer on a silicon chip is just a few layers and nanometers thick, they could go more vertical like 3d nand. theres plenty of vertical room
Except the heat concentrates then and bakes the interior SRAM. It is a solution, but not a long term solution.
"more memories as far as I remember" is a perfect 10/10 memorable
I love those images you found of the grid of SRAM cells; the pattern is so mesmerizing to stare at, some reminded me of pictures DNA/chromosomes
A geometric order, an insulated border.
Shrinking a transistor by half means it has √s̅u̅r̅fa̅c̅e̅a̅r̅e̅a̅, so this results in more than 50% faster electron traversal. However, we only see 20-40% improvement because of efficiency loss.
This means that a slowing ability to shrink the size of the transistor is expected; we're getting _more_ than double the theoretical cap every time the transistor shrinks by half.
Your statement about AMD and cache is incomplete, so much so that people that don't understand how CPUs and cores work it can give the wrong impression.
SRAM is used in cores for registers along with cache. Registers are a collection of latches just like SRAM is, although the configurations will be different. There aren't many registers compared to cache.
Registers can NEVER be moved off the main core die, because they are integral to the operation of the core, you need data registers to do pretty much ANYTHING in a CPU core. Registers are temporary storage for instructions, and the registers are tied into the instruction logic. A very simple thing of A + B could mean getting data from memory to add together and storing an answer back in memory, or it could be these values are constants built into the instruction. In either case, it's data that can be put into registers and then add those registers together and load the answer back into another register. The very next instruction may need these values so having the data in registers means the next instruction doesn't have to call that data from memory again. So that's one thing. Registers are integral to the operation of instructions and that will NEVER move off the core die.
The next thing is cache. You mentioned L3 cache, but unless a person understands how a CPU works there isn't enough information there to understand the implication.
There are typically 3 levels of cache in a core (there are multiple cores in a CPU). L1, L2, L3. L1 is closest to the instruction logic and operates at the same speed. There is not much of it because it's the hardest to make PERFECT because it has to once again, operate at the speed of the core. There is next L2 cache which there is more of. The problem with having more cache is the time to search it takes longer, so the time to fetch from L2 is longer than L1. The time to pull data from L1 is a single clock cycle, the time for L2 varies based on the CPU type. L2 ALSO HAS to be on the core die, because it's a larger pool of data/instructions that the core has been using and may need again. L3 is the slowest, and also the largest set of cache. THIS is what can be pushed off onto another die, as long as the connections between those two die operate fast enough which is the key.
Cache replicates what is in memory, but there's a small amount of cache and a large amount of memory. As a core needs data, if that data hasn't been accessed already it's either on disk or in RAM (main memory). If it's on disk, it will get loaded into memory first, then pulled into the core. It HAS to go into L1 to be used by the core. Caching methods vary so I will give A method of caching. There is only a small amount of L1 cache which runs at the speed of the instruction logic. There is also cache for instructions and cache for data. Instructions of course run incredibly fast, as in a VERY rough figure of about a billion a second. This varies because there are wait times so you can't simply take a clock speed and say there is one instruction per clock cycle. This is why I'm giving a VERY rough figure for this. But, say 1 billion a second. In that time period L1 will have been changed out millions of time because L1 doesn't store much. One way to deal with a caching scheme is when L1 is full and you need something that isn't in L1 (where all data and instructions are dealt with from), is take the data/instructions used the furthest back in time (so you need a time stamp for what's in cache) and push that down to L2. If L2 is full, then you take the data/instructions used furthest back in time and push it down to L3.
As was said, L1 - L3 represents data/instructions that came out of main memory. You never need to update instructions since instructions don't change, but you do data, and since L1 - L3 can have updated values for data that's stored in memory, you also need to write this change back to main memory.
So the key takeaway is L1 HAS to be right next to instruction logic and runs at the core speed and it holds data/instructions each in a separate space, and it HAS to be on the same die as the cores. L2 ALSO has to be on the same die as the cores because you often need to access data/instructions that you're already used but since L1 is small it got pushed down to L2. L3 is the ONLY cache you can push onto another die, AS LONG AS you can clock the connection between the two die fast enough.
Thanks for the citations, they are invaluable.
SRAM profit margins are the lowest in the industry. Safe bet it they will up the price and pace the supplies.
And perhaps that will finally mean the end of incompetent web development diversity hire as people will be forced to optimize their code better.
Thanks for this analysis it provides a vary clear explanation of the challenges in modern chip design. I wonder if stacked cache has trade offs as well such as latency, being further away from the logic core. We seem to be at the limit now with 5nm too be honest this is better than I expected I remember telling a co-worker 10 years ago 7nm would be the limit. These multi pattern designs may allow us to get smaller but at substantially lower yields and higher costs. This is I guess why cutting edge tech is getting more expensive.
A stacked chip can be closer than most surface of an on-chip cache.
The problems is the chip-to-chip vias use much more space than in-chip lines.
It uses footprint, it constrains line placement and spacing, and it requires stronger electric drive.
@@musaran2 cool makes sense
How dare you ;) Bi-stable latch is a pinnacle of logic circuits...
And very well named too.
I was going to ask why SRAM cant keep shrinking at the same rate as lógic ( weird since both just use transistors ) And what posible replacements are there ( STT ram, magnetic ram, etc )
He said so. Tunneling and defects.
@@alphar9539 But those will affect the logic circuits as well.
I suspect the difference is that SRAM is not clocked at such a high rate: so is expected to retain the data it handles for a longer period of time.
if you shrink it too much, it becomes a DRAM 😂 an unintended one
@@jamesphillips2285 logic circuits do not have to be error free. Errors in SRAM are unresolvable most of the time, while logic errors can be corrected without a fatal mistake for the CPU.
@@alphar9539 There is ECC: but it has a performance penalty.
4:35 i want someone to explain the circuit in detail here! how does it save memory
More memory is as good as I remember ... very memorable!
Could something like ReRam overtake on-die SRAM at some point? And I'm told AMD's V-Cache is more dense than equivalent on-die SRAM because the process is optimized for it instead of being also for logic?
ReRAM is probably far too slow and at best suited as an alternative for flash, not DRAM or even SRAM.
Cheer up buddy ! The RAM will get sorted some how...cheers !
Cheese!
A bit worrying, if the development tackles off, and we still increase the compute demand with 25% year on year or more, I guess we will start using that level more electricity per year. Might be a real problem eventually
It already is a problem. Bitcoin was wasting so much power that China had to ban it. The big cloud providers use terrifying amounts of electricity.
Bro, we have 1kW per square meter coming from above in daytime.
Well given that semiconductor production shares a lot of the logistics of the solar panel production maybe we can fend of that problem for a few years.
SRAM is something I would not expect most people to know anything about unless they have some knowledge of CPU architecture. That is because these days it is used as the cache that comes as a part of the CPU itself. I do remember a time when it came as a module you could install similar to a DIMM. That was back in the days of the Pentiums when MMX was the thing to have. If you ever saw a Pentium II with the heatsink off you would notice two chips next to the processor. Those were the SRAM chips. Those ones ran at half the speed of the processor itself. Why half the speed? I don't know! It must have been a technical limitation at the time or a strange decision by one or more of Intel's engineers.
Obmyślam nowy plan, wtedy kiedy SRAM!
Since Transistors are grown up from the sub straight now, is any one working on 3d fabbed or multi-layer SRAM?
substrate*
One of the worst examples I've personally owned of "Most of the chip is cache" is the Pentium M Dothan - the 2nd version of the P-M with 2MB cache. It's kind of obscene looking.
With all this discourse around memory, could you please do a video on memristors? It looks like a technology that should not be slept on and I think we'd all value your opinion on the matter.
I'm happy improvement has slowed down. Maybe now companies will be forced to actually pay attention to optimization and web devs will have to learn how memory works
SEUs and feature size is an issue ? yes / no L1 and L2 cache really need EDAC to avoid CPU crashing. SDRAM FIT rate was
about one event per megabit per month at Ground level. This was researched by IBM and others. 24/7 operation can be an issue.
“You are my density”
-George McFly
I actually felt like I knew something for once when I knew the solutions immediately: stacking and gate-all-around.
Still I didn't understand fully why shrinking SRAM transistors is a challenge comparing to the logic transistors? It follows from the video that shrinking any transistor is a challenge, not just SRAM one. Or am I misunderstood something?
Minutes 6-8, power leakage. For logic it is just power inefficiency but for SRAM it is loss of value / instability. Minute 9 onwards, process variation means problems are uneven between different chips. So when SRAM shrinks they get less and less reliable, so in smaller processes you still make an exception and make the SRAM cells big because you do not want random bit flips in the cache due to power leakage etc.
@@randomgeocacher My understanding is that all those problems (static and dynamic power dissipation as well as static noise) are relevant for any transistor, regardless of its application in a circuit. Why shrinkage of transistors doesn't bring problems for the logic? Like ALU producing incorrect results when summing two values.
@@DenisBazhenov SRAM holds a value, the logic emit values that goes into next gate (transistors). I.e. for logic a minor deviation from 1 will correct itself in next D-flipflop or logic cell; while the small 6-transistor SRAM doesn’t have this continuously self correcting property. Honestly I did not study the drawing or the detailed explanation in the video, but I think that if we would rewatch it carefully we would see how the 6-transistor SRAM is different/optimized in a way that looses continuous self correcting. I imagine that a big SRAM cell built like it is thought in school (flip flop with many gates) it would have much less of these issues, but then be a lot of more transistors. So basically designers hade to make a trade off between more logic / transistor in the SRAM or stick to the popular 6-transistor SRAM; but that cell has reached its minimal size. That’s my understanding anyway; the popular SRAM cell cannot deal with power leaks, that prevents it being made smaller. You either fix physics, or keep it at its current size, or abandon this SRAM cell and go for more complex SRAM cell without these issues. So it is this particular size optimized cell that has the problem, and none of the opinions are particularly advantageous in the current technologies.
@@DenisBazhenovfor most logic it goes into a forgiving next stage, like D-flipflop or logic gates. If you loose a little bit of power there will be transistors correcting the loss. Apparently the space efficient 6-transistor SRAM doesn’t have this self correcting property… and thus becomes unstable faster than other parts when shrunk. Maybe the circuit diagram gives some clues as to why? I think the video tried to explain this, hmm.
Dont forget that the internet is level 7 in the memory hierarchy ..
I know that the stacked SRAM on AMD's CPUs is higher density than the SRAM on the integrated circuit below it. Does anyone know why that is?
The SRAM below is likely needed to be more stable for critical functions while the stacked SRAM is for excess functions. Stability is more necessary from the most critical SRAM.
@@alphar9539 Both SRAMs are L3 cache. As far as I know the cores don't differentiate between the two
@@alphar9539 That's not how memory works, nearer memory is used to cache the most recently accessed parts of the next layer, but just because something hasn't been accessed very recently doesn't mean it's less important, indeed many critical operating system data structures will only be accessed occasionally.
Different process/technology maybe? I would guess the CPU dies are built in one technology and has to deal with its constraints, power, heat issues etc. An SRAM cache die built for a single purpose with nothing else on the die/package has more freedom to optimize.
@@randomgeocacher looks like the answer from someone else is that the SRAM on the logic layer needs to connect with the logic in a “legacy” manner which also prevents optimized SRAM layout. In contrast the solely SRAM layer is designed only with SRAM in mind and therefore uses a slightly more efficient layout.
This actually makes sense as the design of a Ryzen Logic layer likely was developed before 3d SRAM stacking was prototyped. The connections therefore were set up in an efficient manner for the “legacy” design. Maybe a future design could be more efficient and denser
main memory then magnetic disks? how old are pictures which you are reusing?!?
Shimano will be happy about this
jeez, I thought this was about a new bicycle groupset!
Me too!
Does a L1/L2 cache have to be SRAM instead of DRAM?
“Have to” implies it is a law of nature, so no? Traditionally DRAM is so much slower and unpredictable, for example the the refresh cycle. So DRAM as we know it today certainly would not replace L1 cache.
Maybe some really speed optimized DRAM variant could tackle L2, L3. I.e. DRAM don’t have an obvious way to replace SRAM, but especially for L2, L3
I wouldn’t dare to say that designers never can come up with new trade-offs or new designs. So everything depends on if some smarty pants developer can make new designs and trade offs that make sense. The further away from the CPU core it “only” has to be closer and faster than acceding the main RAM to be a good trade off, the closer it is to the CPU core we want near-zero latency and predictable access times.
Within the logic circuits themselves (as in the CPU) bits can be represented as dynamic charges, very much like in the DRAM. A circuit node would be pre-charded in one part of the clock cycle, then discharged or not depending on the input signals to the logic element, and then the result would be sensed by the next level of logic. Preventing the charge from leaking away between the pre-charging of the node and the sensing the charge later is what sets the slowest clock frequency for such circuits.
All of the earliest CPUs were like that. If you wanted to execute one command at a time, you could not just suspend the clock, instead there were special circuits for freezing the execution while all the circuits still cycled at their normal rate. But then some CPUs started to be designed without using such tricks, and were "completely static." In this case one could simply slow the clock all the way to zero frequency without loosing any functionality.
So, in principle, dynamic elements could be utilized anywhere in the circuit. Whether it would make sense for the cache would depend on the achieved trade-offs in speed and density. Currently this does not seem very attractive, especially for a process not optimized for DRAM.
You missed out SSD, sitting between Main Memory and Magnetic Disk.
2:32 what? are we just glossing over that like its no big deal or something? he legit just went hey mate lemme borrow your table for a bit imma make something revolutionary rq
CRAM video when?
Coding is one of the big problems.. to much 'paranoia (hash checks, Hamming checks and engility registry loops.)... better to go more RISC and back to 16bit for many processes using more stable SRAM larger designs
As a cyclist this really confused me.
had to scroll down to find this comment 🚲🚲
Very confusing video title for those who are interested in bicycles - where SRAM is one of the main competitors to Shimano as manufacturers of group sets (eg gears, brakes, pedals, cranks, wheel hubs etc).
Intel started their career making ram before switching to chips. Maybe they could use their fabs for 10nm dimms
Doesn't change the fact that the distance to the dimms is the bottle neck. We've had 7-8ns latency RAM since EDO, but that pesky "universal speed limit" we call the speed of light won't let us go any faster unless we move closer. So we move as close as we can get: onto the chip.
@andersjjensen quantum computing fix all of that.
If cpu does a true parallel computing it's the other solution to save time instead of queue in line for certain clock cycle. Specially if the software is designed to handle multiple thread computing. Simulation of physics of trillions of individual particles would greatly help. This would required less clock cycle, and less access to ddr ram speed.
@@BlueRice for parallel programs we have gpus, quantum will never take over, we need our computers to always give us the same answers.
@@cj09beira quantum is promising. If it actually works, it's almost like magic. It will give you same answer. The problems with it it is not stable. You will run into errors after long duration. Is there any error correcting methods.. is there a way to check the code after each process. Can it even check itself.. the state of matter is hard to control
If you really want to build a robot, android per say - quantum will make a robot as smart or smarter than human. They already can map human brain pattern to some extent. Load that on quantum computer and it probably have its own consciousness
12:46 🙂
Wait... this is not about the cycling industry in Taiwan.
7:41 that must be why Intel’s 90nm node suddenly looked really bad on power, even for Pentium M Dothan initially..
I think that’s where frequency stopped scaling exponentially as well. Dennard problem
@@alphar9539 Yeah a lot of people say 90 broke frequency scaling but 65nm gained pretty nicely over 90nm for the Pentium 4 (when OCed, 65nm would go about 1 GHz higher on air). Core 2 also had nice frequency bumps to 45nm. But seeing leakage go from 0 to a lot in one node is interesting
Seems like this is where the third dimension inevitably has to come into play.
stacking, stacking everything is the way forward
What are the possible solution for replacing sram?
2:14 it’s not just me, right?
Really interesting, thanks for the video.
You forgot magnetic core and drum memory!
and punchcards (add in the latency of a human punching them and filing them)
Cache is king!
Given Sram sits between the processor and the main memory wouldn't it make more sense to have it underneath the logic?
Sits between is just hierarchical it doesn't necessarily serve as an interface between the processor and main memory. If the processor checks for data in some sort of sram and it's not there the processor then has to request data from memory so it's not like the sram serves as an intermediary
Well , i think it is an electrical consideration. The logic definitely use more current than the SRAM. So having it at the bottom nearest to the power source is makes more sense
@@tylerweigand8875 Ah, I didn't realize that. I always kind-of assumed that only one fetch got sent out and the cache handled requesting it from main memory if necessary.
I love how quantum tunneling is "more intuitive"!
I thought backside power delivery was expected to help SRAM density.