I'm 67 yo. I'm amazed that the training I was given in the 80's on early microprocessors, combined with the fun I had writing op-codes for my Commodore 64 enabled me to follow this. Thanks for the instruction!
talleddie81 And, to add further to this, if you know when sensitive (eg. kernel) operations are being executed, you can figure out where they're actually stored, bypassing ASLR. This takes some time, as it's a pretty noisy side channel, but can be pretty effective, as it may take many such probing operations to gather data but as billions of operations are executed per second, it actually doesn't take much real time to get some interesting data, and the longer you do it, the more precisely you can hone in on your target address.
Since a computer might have 16+ GB of RAM how do you even start to get an idea of where in the memory you need to be looking if all you know is that it did have to hit the RAM due to the timing?
As Matthew Ducker said, it is possible to break the ASLR. What you then can figure out is where the user data and kernel data are stored in RAM. As far as figuring out what specific data is stored at each address, that is a very difficult and complicated topic. As far as your original question, the timing is only used to determine where the data came from. Knowing that the data came from the cache can be a clue to an attacker that the data was from a previous operation. In the case of an attack, this previous operation could be a memory read forced by the attacker that should not have occurred.
I have a request / maybe constructive feedback: I think it would be neat if you could update / create new Computerphile playlists. There are tons of videos I'd like to rewatch, but it's a bit of a pain to look for them one-by-one. Specifically I'd want to rewatch all the explainations of exploits/security breaches, for example
Nice job! I had to click through a few explanations before I got to this one. Went straight to the point and kept me engaged, without getting buried in technical details.
I think the unfortunate situation (like any security) is that this is not a pure computing problem, but a human one. Imagine how much more efficient computers and networks could be without the overhead of dealing with untrustworthy influences. :/
A hardware bug that allows user level computer programs access to kernel space or other user level processes memory address space defeats the purpose of having virtual memory security in the first place. We should all be outraged that speculative branch prediction doesn't block cache memory writes on instructions that failed the branch prediction. From what I can tell engineers were well aware of this problem but ignored it because they assumed it would be difficult to exploit what seemed to be the random nature of cache page reading and writing and the extra cost of blanking out a cache page or blocking the writeing of that cache page during a failed branch prediction. People wanted faster recovery during a failed branch prediction for marketing their CPUs. Now they've got more marketing by allowing them to sell spectre/meltdown proof CPUs.
Nicely presented. I thought of the CDC 7600 designed by Seymour Cray as you were using the Acorn RISC machine in your example. The 7600 was superscalar & pipelined with a multiply unit, divide unit, adder, load/store unit, all 60 bit floating point. Integer operations were 48 bits using the same units but exponent fixed at zero. The Fortran compiler did critical path scheduling of expression evaluation in code generation. An instruction word stack handled decode and issue. Tight loops fit in the IWS and executed without instruction fetch.
It's really funny how for the past 20 years no one mentioned this issue, but now when it is known the comment section of every video about Meltdown and Spectre is full of experts on the matter.
When the X86 architecture started out more than 40 years ago, the design was entirely open, and exploiting flaws was trivially easy. Security features have been added in layers over the last few decades , while maintaining backward compatibility of instruction sets and memory addressing modes. At the same time numerous enhancements have been added, all adding to overall complexity. This is not how you would design a secure CPU from the ground up, and it does not surprise me when vulnerabilities proliferate. Trading-off speed and convenience, versus security and robustness, is seldom a winning strategy. On a personal note, some of us old fogeys were around 20-30 years ago, writing low-level machine code and understanding how the CPU worked, and well aware of (some of) the vulnerabilities.
Well I have similar surprise. Not that people know about it but that I've seen multiple new popular programmer friendly sources on pipelining and how it works just this year. Before specter and meltdown. It's an odd coincidence and I wonder what the catalyst is. Maybe it's just me being human and seeing patterns where there are none. But Cppcon had a talk covering it just now in 2017 and I can't recall any other talks that have. I've watched those a lot. I was introduced to this in 2013-2014 I think.
One popular issue that underlies this is the simple question: Why should I upgrade to an expensive new CPU, when due to heat dissipation limits, the maximum clock speed is pretty much the same as last year's model? Moore's Law has not ended, but it continues to be implemented in ways that are not obvious to the layperson. With previous generations of processors the differences were large and quantifiable. Now its all about cache size, incremental improvements, and reduced power consumption. IMO discussing these fundamental factors have forced the topic of speculative execution into the public consciousness, whereas it was previously known only to a limited number of geeks...
It is kind of strange how this issue was revelaed when according to some we have reached the limit of traditional CPUs (silicon chips). If it is not a mere coincidence I can speculate and say that now that silicon chips can't get more powerful at the same rate that they were before CPU makers will have to find another way to pitch us their new products: "Look at our new CPU. It is not more powerful than the our previous ones but it has new architecture and is not vulnerable to Meltdown and Spectre so you better buy it!" But this is ONLY a speculation. I have my doubts that Intel would be willing to lose so much stock value over this.
This general issue has been known for a long time---cryptographic processors are hardened against it. Those things aren't used because they are faster (often they are not), but because they take extra measures against a variety of out-of-band timing attacks. This is just the first time someone looked for and found a way to exploit it on a general purpose CPU with usable results instead of just some academic "oh, interesting". (Also, add media hype.)
Thanks alot for your work! I really like those videos with Dr. Bagley. He explains everything very well. And the deep level of how computers work is very interesting.
tl;dr: optimization isn't all about using the fewest instructions, it's about using them in the right order and sometimes using a less "efficient" instruction to achieve parallelization so you can use as much of the CPU's power at once as possible.
Dr. Bagley always delivers to the forefront of my curiosities. I hope to be an example of one of the individuals who may never see the footsteps of higher education, and however prove that we can indeed continue to prove ourselves as veritable compliments to the field of computer science.
Another thing that I noticed, that wasn't mentioned in the video, is that reordering the code also opens up registry space for reusability. For example (6:55) load r0; load r1; add r0 = r0+r1; load r2;
yes, intel CPUs can actually do this. However, they typically do the exact opposite: Consider this code. ... add r0 = r0+r1; load r1; ... Notice that the load instruction needs to wait for the add instruction to finish, because they use the same register. Intel CPU will simply use different free register in the load instruction and adjust the rest of the code accordingly. .... add r0 = r0+r1; load r2; ....
I'm confused why the processor would ever do the optimizing, instead of a combination of the compiler/interpreter (for the particular bit of code) and the OS (for different processes and whatnot) doing all the optimizing, since those actually have all the information about what will be run.
6 лет назад+32
Actually they do not have all the information about what will be run (if that were the case, it could speed programs a lot). You have to take into account the dynamic factors. Values in cache, branch prediction, utilization of individual cores (hyperthreading) etc. all affect the program execution severely and they're very hard to predict during compilation (although compilers of course try to do their best and you can help them with profile guided optimization).
The processor is the only one that knows whether an item has been fetched from memory previously, and is in the cache, which provides a huge speedup. The compiler cannot possibly know the contents of the cache, although it should do some optimisation of its own. BTW, modern software can be rather inefficient, and it if weren't for fast CPUs, things would sometimes go very slowly...
If the compiler or interpreter were to try and it do it would be what's known as a premature optimization because you'd be optimizing for an assumed/theoretical cpu instead of knowing what it's actually capable of. It could be that your optimizations work well for a select few or even a great number of cpus on the market today, but tomorrow will come and new cpus will be released and your modified code could very well perform worse on those. You should just let the cpu itself do what it knows it can do.
ZeikJT if your compiler knows what CPU it will run on (and given the size of actual executable machine code versus the size of storage there's no reason not to include all existing CPUs), then it can target all of them. If a new CPU is built you can recompile your code to make it the most optimized.
Debanik Dawn If I was half the CPU I used to be, I'd take a pipeline to this place! Out of order ? Who do you think you are talking to? I've been around you know!
Spent my life on the H/W side of the fence as a developer, and have NEVER understood why problems like this, which could be addressed by having architecture-specific compilers written once and used once to generate optimised code, are always moved into H/W creating massive complexity (and hence bugs that turn up months later and can't be retro-fixed) and burning power on every single execution cycle on every single machine every single time it runs that bit of code. I agree that, usually, generalisation = slow and optimisation = complex, but surely it's only logical to put the complexity into that part of the system that can be easily changed when problems arise (as they always do with complexity) and which only entail effort/energy/time once at the start of the process. For H/W, the KISS mantra reigns supreme and complexity should be reserved for those things that can't be done up-front.
I think we took a misstep in processor design decades ago. Modern processors have become so complex that no one person can understand all of them (I mean really, REALLY understand - down to the gate level of what's going on in all cases). As a result, we wind up with things like Spectre/Meltdown and so on, which happen because the left hand doesn't know what the right hand is doing. What we chose to do decades ago was to add complex logic to our cores, in an effort to get them to execute code faster. We've gotten to the point where all that stuff represents more of the logic on the chips than the actual compute logic does. What we should have done instead was to embrace the multi-core idea much, MUCH sooner. We should have kept our cores dirt simple, and just piled more and more and more of them onto the chip. Use ALL of the logic for the business of computing. Of course, this would have required us to face multi-thread programming much sooner than we otherwise did, but we've wound up having to face it anyway. If we'd just swallowed that pill sooner then we would NOT have processors that no one can understand and I wager that we would have much more secure, reliable systems that didn't plague us with all of the difficulties that our current processors do. You can't really say "That wouldn't have worked as well," because we DON'T KNOW. Software would have evolved in a different way, and we don't have the software we'd have gotten from that other path, so we don't really know where we'd be on overall performance at this point. We let the tail wag the dog at every turn, though, and now we are where we are. I don't know if there will ever be a way out. Generally speaking, though, I oppose letting whatever body of legacy software we happen to have "at the moment" dictate how we design future hardware. The hardware design should lead, and the software design should follow.
with programs like these, many modern CPUs will send a few memory fetch requests one after the other. while the CPU is waiting for the memory it usually does other tasks. when the memory arrives, it might arrive out of order (out of order as in, you get b, then a, then d, then c) so it will compute the calculations by the order of arrival.
Super scaler? There was a Sega arcade hardware called the Sega Super Scaler. Though I think that was referring to its ability to scale sprites though. Look at games like After Burner, Outrun and Thunder Blade. Just to name a few.
FYI - the way this fictional CPU executes the code also uses Instruction-Level-Parallelism. I don't think there is any useful CPU design that has either but not both, which means they go hand in hand.
The color palette at 5:53, the left side, with example line "01 LDR R0, a", is challenging to read by my color-deficient eyes. Please reconsider that particular font / background color combo.
BTW I think it was the Pentium Pro from 1995 the first Intel CPU with out-of-order (as well as speculative) execution. The original Pentium (1993) didn't support those features, ASAIK.
The original Pentium was superscalar but didn't support OOE that is correct. In the P5's case it had two execution units that could execute instructions in parallel, but it didn't make any decisions more complicated than "Can I execute the next instruction in the second pipeline or not". The Pentium tried to pair off instructions. Pairs could enter both pipelines, while unpaired instructions could only enter the primary pipeline.
While we're here, may I suggest another video on how adders/multipliers are built in the CPU itself? Maybe explain the difference between ripple carry adders and carry lookaheads and that kind of stuff :D
At the end of this video he says something that I think is correct, but the entire tech media has gotten wrong about Spectre / Meltdown, perhaps because the people who wrote Spectre and Meltdown papers got it wrong themselves. Spectre is a class of attacks that takes advantage of speculative execution. The attack concept does NOT rely on out-of-order execution. It could very well be that OOO machines make it easier, or that only the OOO processors run far enough ahead into the speculative path to pull this attack off, but conceptually, Spectre is a speculation issue, not an OOO issue.
Well, in PC-land, it all went OOO with the Pentium Pro, but the Pentium Classic and its variations had a branch predictor. But it also only had a 5 stage pipeline and dual issue, only one of which could handle a load. You know that to "surface" data, the meltdown code example requires the ability to get "far enough ahead" to do a speculative load followed by a second speculative load whose address depends on the value loaded in the first. I don't think that's possible in a short pipeline without many execution units, so older processors probably are not subject to this exploit. OTOH, there may be modern in-order processors that have deeper pipelines and superscalar with an LS unit and two ALUs that could be exploited. Some of the more modern ARM processors might qualify. ARM11 implementation are 8 and 9 deep. I think most (all?) of the modern ARM "A" cores are OOO, but I would not be surprised to see that some architectural licensees have built their own cores that are deep, SS, but not OOO. In MIPS-land, it may be similar.
If people would compile their own software, you could do all this optimization with the compiler and CPUs could be a lot more simple with much less power draw while still being just has fast in the final execution.
Speculative execution is NOT the problem. The fact that another process can access the results of the execution IS the problem. The WALL between separate processes is not being enforced.
Great Video!!! I still have some questions: What part of the CPU looks at the instructions and evaluates a better order to execute them in? How does that not take more time than just executing in the order they were given? And do compilers like GCC rearrange the order first, or is it usually the cpu's job. If the C compiler does rearrange the order, can it inform the CPU that it's already been optimized, and to not waste time checking?
Well the CPU, or the execution of instructions is not used for reordering. Within the CPU, obviously, some component is required to analyze instructions and dependencies to re-order them. Obviously this costs some area on the chip, and energy, but in the end it should make execution faster.
Thanks for the reply. Now that I think about it, the overhead of a potential re-order (+ new execution time) obviously has to be smaller than the original execution time in order to actually enhance the performance.
The main benefit of out-of-order execution is not to re-order the instructions, but to ensure that the CPU doesn't sit idle while waiting for data to be fetched from memory. In almost all cases there is something else useful that can be done, rather than doing nothing!
In short, yes. Out-of-order designs typically require *much* more die area compared to an in-order design, and they also tend to use more power. In-order designs need higher clocks to have the same performance as an out-of-order design, but they still tend to be more efficient for low-medium performance levels. For the highest performance, you just can't clock an in-order design any higher (or it becomes inefficient to do so), and an out-of-order design is better. There are a number of modern, medium-performance in-order designs for exactly this reason, most notably the ARM Cortex-A53, which is the primary core used in virtually every low-end and mid-range smartphone (because of cost). The Cortex-A53 is also paired with higher-performance cores in higher-end smartphones, which allows the higher-power out-of-order cores to shut off when the phone is idle or under light loads (ARM calls this big.LITTLE; there's also a new version called DynamIQ).
Wouldn't the time taken by the CPU to re-order the instructions wipe out any time gained by being able to perform those instructions in parallel? In other words, re-ordering the instructions makes it quicker to do them, but you waste time re-ordering before you can start.
dharma6662013 No, as this is usually performed by the decoder. The ALU and L/S units aren't yet involved. At this stage they will also perform things like checking to see whether data needed by the decoded operation requires data not in cache - if it's not, it will be prefetched, so that it is in cache when it's needed later. This is also where branch prediction comes in - if a branch hasn't been executed yet, it doesn't know whether data used by each branch will be needed, so it will gather the data for the operations involved in the branch it predicts will be taken based on previous behaviour. It may also perform speculative execution (this depends on the design of the specific CPU implementation)
Please forgive my ignorance, but that just seems to "kick the can down the road". Something, somewhere, has to spend time re-ordering things. The result is that the CPU can run things faster. How do we know, and how to we measure, how much the time used re-ordering compares to the time saved *by re-ordering*?
dharma6662013 I would assume that chip designers and their respective companies have done quite some testing on this. You might want to look up which generation of chips was the first one to implement such a thing and how much faster they got.
CPUs have an instruction prefetch where the next instructions are loaded into cache before they are executed, usually in 16-byte segments. That gets into branch prediction, and what if you jump to a different address. But the main takeaway regarding instruction reordering, and pipelining in general, is that it can be done _combinatorially_ - meaning a logic circuit that does not use clock cycles, but acts as a direct function on its own. As soon as you feed in the input, given some gate delays, the output appears on the other side. For the purposes of this discussion, just think of it as being an instant process. It's a very long and complicated "if" statement that happens all at once in hardware.
Joris Not necessarily. It depends on the particular values. Some multiplications can be done in a single cycle. Notably, power-of-two multiplications (for integers, anyway) will just be converted to bit-shifts (a single cycle operation), but there are still others that may also take only a single cycle. Divisions are worse (again, except powers of two, which are just bit-shifts for integers), particularly modular division. These can take many cycles. The implementations of ALUs have many complex tricks to allow for very fast execution - I recommend reading more about them! :)
Guy Maor yeah, so he showed hoe the processor would use OOE to speeds things up. If it would've been donr right it would choose to do the multiplication first in most cases (since a multiplication consists of bit shits and adding instead of just adding 2 numbers). This would mean that at the end of the line it would have to wait on an ADD instead of a MULT, which would speed the whole process up sligjtly
Why wouldn't it take more cycles to analyze and determine an optional order than you'd save by using that new order? Does the compiler that originally compiled the code handle this? Or is this truly on the fly?
Moderns CPU's have sufficiently complex hardware to analyse instructions several steps before they are actually executed. IMO the example chosen is simplistic, and not a good example of how pipelining works in practice.
Maybe a off topic question about CPU´s. The question is in the time frame of around 1998 - 2006. Was the PowerPC actually faster than the x86 as apple always stated even though the clock frequency was a lot lower.
My knowledge of how a physical processor actually works is low, but I am a mathematician by trade and find this optimisation procedure quite interesting. So I don't know if what I'm about to say actually makes sense. But: The two set of instructions in this video only differed in the order of operations. The only data that seems to be needed to run the code in the theoretically most efficient way possible is what dependencies there are between the instructions; whether they can be run concurrently; and the timings that the processes take. I guess the rub is that the latter isn't really deterministic (or perhaps they are up to a reasonable margin of error?). Still: is a simple on-the-fly optimisation (that is actually implemented at the moment) essentially one which chooses processes that allow other concurrent ones? If module A of the processor is awaiting a new instruction, then first it looks at those available, then prioritises those which allow, say, for a computation on a currently unused module B (which is perhaps prioritised a slower component of the processor)... and so on in a similar fashion? I guess the mathematical structure I have in my mind is a kind of dependency tree which forms part of the data of the instructions, perhaps with some other weights so as to incentivise some processes (those which take place on slower components of the processor). Lots of gaps here, but I find this optimisation problem theoretically quite interesting and would like to know the current state of the art. It reminds me a lot of FP, where because of lazy evaluation you can ensure that functions are performed in an order so as to not have superfluous operations. Sounds like similar ideas could be useful here.
Not likely. During design and testing the CPU will be optimised to avoid this kind of wastage. Modern processors actually have huge amounts of overhead, but this is all geared towards the fastest outcome. Low-power alternative processors that have less overhead, continue to be available for specialised applications.
On the serious note though, rather than superscalar architecture, isn't it more effective if we put two pipelines in the CPU and both of them sharing the same execution units?
postvideo97 In a sense, yes, except that there is only one pipeline. While one thread has a particular execution unit tied up - say, waiting for data to arrive from main memory, which can take hundreds of operations in CPU-time... note that this would occur due to a failure in branch prediction; normally, it would have already noticed ahead of time that this data would be required and requested it in advance already, so that it would already be in cache or even a register, unless it predicted wrongly that it wouldn't be required - you can instead execute operations for a different thread that doesn't need that data, or that execution unit.
Yoppy Halilintar, two pipelines sharing the same execution units is pretty much _the opposite_ of what you want, because delays during the execution and complex dependencies would "stall" _both_ of them. *edit:* Matthew is right, that's not what SMT (aka hyperthreading for Intel) does.
Nice explanation of "out of order" execution. I knew you were going to make one minor mistake, not that it matters for the point discussed in this video. You threw in multiply as the operation before that last variable. You didn't take into account the typical order of operations. The multiply would be executed first.
Thanks for reminding me I need to put in some more time on that. I'm still in the early piece-of-cake puzzles. Well, they certainly are compared to where I got up to in TIS-100 and Shenzhen I/O.
Does CPU decide all this in real time? How does it do all this?! Isn't it just supposed to be an electromechanical part? If no software intervention occurs here, this might as well be black magic to me.
I have the same question. I'm having trouble imagining how it could possibly be faster to make a bunch of checks on multiple instructions and cache states than it would be to just perform the add/multiply.
Plea from a software user: - No matter how efficient the code, please ask yourself how usable the code is to the end-user of it. Elegant solutions to programming are so far removed from usability that I fear the connection gets lost sometimes.
Chances are you've never knowingly seen an elegant programming solution. You'd have to actually look at the source code for that which as a self declared software user you wouldn't and probably can't do. What you're talking about is most likely just bad UI design.
Dohh! I agree and stand corrected. I sort of knew when I made the comment that it was off topic. Please pass my concerns to any UI designers you might know.
CrashCourse has a crashcourse on computer science, they go very in depth about how CPUs work and what assembly code does, but they still keep it brief and simple enough so it's very easy to follow for the layman. Provided you have the attention span and pay attention.
The lines of code that a programmer writes, he expects to be 'executed' by the CPU sequentially. Turns out though, that CPU's move them around and execute them 'out of order' because thats faster to do. Yet to any outside observer (the programmer) it still looks like the code (which is translated into machine instructions) happens sequentially.
Fz Does this apply to lets say, 3D applications? for example RTS video games, where some of the things need to be executed sequentally and multi-threading is of no help. Noob here.
OOO Happens in any program that runs on a CPU that supports it (Pretty much everything) . In the case of a video game, which consists of CPU + GPU parts, the CPU part will therefore be executed out-of-order. Fyi, this isnt something that the programmer can control. It just happens in hardware.
For DDR4 4200 RAM with a CAS latency of 19 cycles, the time required to fetch the first word, assuming the appropriate row is already activated, is 9.5 ns. However, each _sequential_ word after that would only need 0.24 ns to fetch, meaning 4 contiguous words would only require about 10.25 ns and 8 would require only 11.25. Of course, if the next required word is in another column, you would have to wait the 9.5 ns again, and if it's in another *row*, then you'll need to wait even longer, as your RAM will need to be issued the Precharge command, and then the Active command on the correct row before the next Read command can be issued. The ALU, OTOH, would usually only need one CPU clock cycle to complete whatever it's doing, especially for a simple operation like addition or multiplication, which is on the order of 0.24 ns. Some ALUs can even do multiple such operations in a single cycle, and if you were using floating point instead of integers, it is relatively common to do multiply and add in one operation.
This all should be done by the compiler (or the programmer) - investing logic to "correct" a less than optimal code sequence that has to be present in EVERY CHIP and operate EVERY TIME you run your program? That's just clearly not the best answer. I recognize it gave the hardware designers all kinds of opportunity to feel like they're clever, but it's a waste of resources.
+Grayevel But when you reuse the same register you can't reorder it as easily. Register renaming takes care of that, but explaining that in this video would probably be out of scope.
I found the code in this example a bit simplistic, it just covers simple pipelines of algebraic instructions The real meat of speculative executions comes with branch prediction and caching, and this is where the whole Spectre issue pops up. These complexities are only briefly hinted at in the last 30 seconds of the video.
It still introduces the basic premise of how it functions. Complication is usually not a great way to introduce an idea to those whom are unfamiliar with it
I believe he covered these in the (previous) Spectre & Meltdown video, which was specifically about speculative execution and branch prediction. The point of this video was to explain out-of-order and compare to in-order execution.
Gordon Richardson The title talks about superscalarity, which is the first step to branch prediction and then speculative execution. Can’t fit all three successive topics in a 15-minute video! ;)
I'm 67 yo. I'm amazed that the training I was given in the 80's on early microprocessors, combined with the fun I had writing op-codes for my Commodore 64 enabled me to follow this. Thanks for the instruction!
Coool!
"I'm 67 yo" is now one of my favorite phrases
You were always ahead of your time 😀
you are not alone
He is a really slow c compiler. I’ll stick to my usual command GCC.
Very verbose also
And he only supports ARM! And not Open Source!
I don't know.
He's telling us what he's doing and explaining it all, so doesn't that technically make him an open source compiler?
Or someone just set the -v flag.
JSR $DEADBEEF
_"using the Computerphile paper in a _*_radically_*_ different orientation"_
such a rebel :D
7:35
I'd love to see an explanation of 'side-channels' and how you turn a timing of a memory operation into a specific value from memory.
The timing is not turned into a value. The timing of the operation is used to determine whether the CPU read the value from the cache or main memory.
talleddie81 And, to add further to this, if you know when sensitive (eg. kernel) operations are being executed, you can figure out where they're actually stored, bypassing ASLR. This takes some time, as it's a pretty noisy side channel, but can be pretty effective, as it may take many such probing operations to gather data but as billions of operations are executed per second, it actually doesn't take much real time to get some interesting data, and the longer you do it, the more precisely you can hone in on your target address.
Since a computer might have 16+ GB of RAM how do you even start to get an idea of where in the memory you need to be looking if all you know is that it did have to hit the RAM due to the timing?
As Matthew Ducker said, it is possible to break the ASLR. What you then can figure out is where the user data and kernel data are stored in RAM. As far as figuring out what specific data is stored at each address, that is a very difficult and complicated topic. As far as your original question, the timing is only used to determine where the data came from. Knowing that the data came from the cache can be a clue to an attacker that the data was from a previous operation. In the case of an attack, this previous operation could be a memory read forced by the attacker that should not have occurred.
Thanks for the replies!
I have a request / maybe constructive feedback: I think it would be neat if you could update / create new Computerphile playlists. There are tons of videos I'd like to rewatch, but it's a bit of a pain to look for them one-by-one. Specifically I'd want to rewatch all the explainations of exploits/security breaches, for example
6 jaar terug; geen reactie, geeft te kennen - “I am wrong”,,
"...that we talked about in the caching video, many years ago" *cuts to video clip of the same shirt*
Consistency is something that is sorely needed on YT.
not the same shirt, look closer...
Not actually the same shirt, but there is more than a passing resemblence...
_tom scott wants to know your location_
That’s how you know he is the real deal…
Thank you for mitigating the screeching sound of the markers!
Nice job! I had to click through a few explanations before I got to this one. Went straight to the point and kept me engaged, without getting buried in technical details.
Since Dr. B is right-handed I would like to recommend that the camera be located over his left shoulder instead of his right.
Love the shows.
Assuming integers, some more time can be saved if the multiplication is done earlier (in can run in parallel with load instructions).
I think the unfortunate situation (like any security) is that this is not a pure computing problem, but a human one. Imagine how much more efficient computers and networks could be without the overhead of dealing with untrustworthy influences. :/
A hardware bug that allows user level computer programs access to kernel space or other user level processes memory address space defeats the purpose of having virtual memory security in the first place. We should all be outraged that speculative branch prediction doesn't block cache memory writes on instructions that failed the branch prediction. From what I can tell engineers were well aware of this problem but ignored it because they assumed it would be difficult to exploit what seemed to be the random nature of cache page reading and writing and the extra cost of blanking out a cache page or blocking the writeing of that cache page during a failed branch prediction. People wanted faster recovery during a failed branch prediction for marketing their CPUs. Now they've got more marketing by allowing them to sell spectre/meltdown proof CPUs.
Nicely presented. I thought of the CDC 7600 designed by Seymour Cray as you were using the Acorn RISC machine in your example. The 7600 was superscalar & pipelined with a multiply unit, divide unit, adder, load/store unit, all 60 bit floating point. Integer operations were 48 bits using the same units but exponent fixed at zero. The Fortran compiler did critical path scheduling of expression evaluation in code generation. An instruction word stack handled decode and issue. Tight loops fit in the IWS and executed without instruction fetch.
It's really funny how for the past 20 years no one mentioned this issue, but now when it is known the comment section of every video about Meltdown and Spectre is full of experts on the matter.
When the X86 architecture started out more than 40 years ago, the design was entirely open, and exploiting flaws was trivially easy. Security features have been added in layers over the last few decades , while maintaining backward compatibility of instruction sets and memory addressing modes. At the same time numerous enhancements have been added, all adding to overall complexity.
This is not how you would design a secure CPU from the ground up, and it does not surprise me when vulnerabilities proliferate. Trading-off speed and convenience, versus security and robustness, is seldom a winning strategy.
On a personal note, some of us old fogeys were around 20-30 years ago, writing low-level machine code and understanding how the CPU worked, and well aware of (some of) the vulnerabilities.
Well I have similar surprise. Not that people know about it but that I've seen multiple new popular programmer friendly sources on pipelining and how it works just this year. Before specter and meltdown. It's an odd coincidence and I wonder what the catalyst is. Maybe it's just me being human and seeing patterns where there are none. But Cppcon had a talk covering it just now in 2017 and I can't recall any other talks that have. I've watched those a lot.
I was introduced to this in 2013-2014 I think.
One popular issue that underlies this is the simple question: Why should I upgrade to an expensive new CPU, when due to heat dissipation limits, the maximum clock speed is pretty much the same as last year's model?
Moore's Law has not ended, but it continues to be implemented in ways that are not obvious to the layperson. With previous generations of processors the differences were large and quantifiable. Now its all about cache size, incremental improvements, and reduced power consumption.
IMO discussing these fundamental factors have forced the topic of speculative execution into the public consciousness, whereas it was previously known only to a limited number of geeks...
It is kind of strange how this issue was revelaed when according to some we have reached the limit of traditional CPUs (silicon chips). If it is not a mere coincidence I can speculate and say that now that silicon chips can't get more powerful at the same rate that they were before CPU makers will have to find another way to pitch us their new products: "Look at our new CPU. It is not more powerful than the our previous ones but it has new architecture and is not vulnerable to Meltdown and Spectre so you better buy it!" But this is ONLY a speculation. I have my doubts that Intel would be willing to lose so much stock value over this.
This general issue has been known for a long time---cryptographic processors are hardened against it. Those things aren't used because they are faster (often they are not), but because they take extra measures against a variety of out-of-band timing attacks. This is just the first time someone looked for and found a way to exploit it on a general purpose CPU with usable results instead of just some academic "oh, interesting". (Also, add media hype.)
Thanks alot for your work! I really like those videos with Dr. Bagley. He explains everything very well. And the deep level of how computers work is very interesting.
What a great channel, thank you for this.
tl;dr: optimization isn't all about using the fewest instructions, it's about using them in the right order and sometimes using a less "efficient" instruction to achieve parallelization so you can use as much of the CPU's power at once as possible.
Dr. Bagley always delivers to the forefront of my curiosities. I hope to be an example of one of the individuals who may never see the footsteps of higher education, and however prove that we can indeed continue to prove ourselves as veritable compliments to the field of computer science.
Its just amazing how much time goes into making these videos. Thank you!
Very well explained
Another thing that I noticed, that wasn't mentioned in the video, is that reordering the code also opens up registry space for reusability.
For example (6:55)
load r0;
load r1;
add r0 = r0+r1;
load r2;
Valid point, but that opens up a whole new layer of complexity...
Hey, if designing a CPU was easy, everyone would be doing it xD
yes, intel CPUs can actually do this. However, they typically do the exact opposite:
Consider this code.
...
add r0 = r0+r1;
load r1;
...
Notice that the load instruction needs to wait for the add instruction to finish, because they use the same register. Intel CPU will simply use different free register in the load instruction and adjust the rest of the code accordingly.
....
add r0 = r0+r1;
load r2;
....
I'm confused why the processor would ever do the optimizing, instead of a combination of the compiler/interpreter (for the particular bit of code) and the OS (for different processes and whatnot) doing all the optimizing, since those actually have all the information about what will be run.
Actually they do not have all the information about what will be run (if that were the case, it could speed programs a lot). You have to take into account the dynamic factors. Values in cache, branch prediction, utilization of individual cores (hyperthreading) etc. all affect the program execution severely and they're very hard to predict during compilation (although compilers of course try to do their best and you can help them with profile guided optimization).
The processor is the only one that knows whether an item has been fetched from memory previously, and is in the cache, which provides a huge speedup. The compiler cannot possibly know the contents of the cache, although it should do some optimisation of its own.
BTW, modern software can be rather inefficient, and it if weren't for fast CPUs, things would sometimes go very slowly...
That's exactly how Intel's Itanium CPUs work.
If the compiler or interpreter were to try and it do it would be what's known as a premature optimization because you'd be optimizing for an assumed/theoretical cpu instead of knowing what it's actually capable of. It could be that your optimizations work well for a select few or even a great number of cpus on the market today, but tomorrow will come and new cpus will be released and your modified code could very well perform worse on those. You should just let the cpu itself do what it knows it can do.
ZeikJT if your compiler knows what CPU it will run on (and given the size of actual executable machine code versus the size of storage there's no reason not to include all existing CPUs), then it can target all of them. If a new CPU is built you can recompile your code to make it the most optimized.
"I'm out of order?! You're out of order! The CPUs are out of order!"
Debanik Dawn If I was half the CPU I used to be, I'd take a pipeline to this place!
Out of order ? Who do you think you are talking to? I've been around you know!
The Orona lift(elevator) is out of service! Out of order! press the alarm button
Spent my life on the H/W side of the fence as a developer, and have NEVER understood why problems like this, which could be addressed by having architecture-specific compilers written once and used once to generate optimised code, are always moved into H/W creating massive complexity (and hence bugs that turn up months later and can't be retro-fixed) and burning power on every single execution cycle on every single machine every single time it runs that bit of code.
I agree that, usually, generalisation = slow and optimisation = complex, but surely it's only logical to put the complexity into that part of the system that can be easily changed when problems arise (as they always do with complexity) and which only entail effort/energy/time once at the start of the process. For H/W, the KISS mantra reigns supreme and complexity should be reserved for those things that can't be done up-front.
I think we took a misstep in processor design decades ago. Modern processors have become so complex that no one person can understand all of them (I mean really, REALLY understand - down to the gate level of what's going on in all cases). As a result, we wind up with things like Spectre/Meltdown and so on, which happen because the left hand doesn't know what the right hand is doing. What we chose to do decades ago was to add complex logic to our cores, in an effort to get them to execute code faster. We've gotten to the point where all that stuff represents more of the logic on the chips than the actual compute logic does.
What we should have done instead was to embrace the multi-core idea much, MUCH sooner. We should have kept our cores dirt simple, and just piled more and more and more of them onto the chip. Use ALL of the logic for the business of computing. Of course, this would have required us to face multi-thread programming much sooner than we otherwise did, but we've wound up having to face it anyway. If we'd just swallowed that pill sooner then we would NOT have processors that no one can understand and I wager that we would have much more secure, reliable systems that didn't plague us with all of the difficulties that our current processors do.
You can't really say "That wouldn't have worked as well," because we DON'T KNOW. Software would have evolved in a different way, and we don't have the software we'd have gotten from that other path, so we don't really know where we'd be on overall performance at this point.
We let the tail wag the dog at every turn, though, and now we are where we are. I don't know if there will ever be a way out. Generally speaking, though, I oppose letting whatever body of legacy software we happen to have "at the moment" dictate how we design future hardware. The hardware design should lead, and the software design should follow.
It's mind blowing to think how much stuff is wrote and executed just for something easy that we all take for granted!!
Anyone noticed the CD hanging out of the right iMac?
Anyone notice that he was wearing the same shirt in the cache flashback video from a few years ago?
Bill Parsons yes - Looks like Sean's continuity briefing paid off.
with programs like these, many modern CPUs will send a few memory fetch requests one after the other.
while the CPU is waiting for the memory it usually does other tasks.
when the memory arrives, it might arrive out of order (out of order as in, you get b, then a, then d, then c)
so it will compute the calculations by the order of arrival.
Literally programming a queue problem for an assignment as I'm listening to this
Super scaler? There was a Sega arcade hardware called the Sega Super Scaler. Though I think that was referring to its ability to scale sprites though. Look at games like After Burner, Outrun and Thunder Blade. Just to name a few.
FYI - the way this fictional CPU executes the code also uses Instruction-Level-Parallelism. I don't think there is any useful CPU design that has either but not both, which means they go hand in hand.
The color palette at 5:53, the left side, with example line "01 LDR R0, a", is challenging to read by my color-deficient eyes. Please reconsider that particular font / background color combo.
William Hebert no
BTW I think it was the Pentium Pro from 1995 the first Intel CPU with out-of-order (as well as speculative) execution. The original Pentium (1993) didn't support those features, ASAIK.
I think the Pro was indeed the first Intel with speculative execution, not sure about out-of-order.
*edit:* apparently both
The original Pentium was superscalar but didn't support OOE that is correct. In the P5's case it had two execution units that could execute instructions in parallel, but it didn't make any decisions more complicated than "Can I execute the next instruction in the second pipeline or not". The Pentium tried to pair off instructions. Pairs could enter both pipelines, while unpaired instructions could only enter the primary pipeline.
Thanks for explaining how that works! great editing
I would like to see him explain hyper-threading
While we're here, may I suggest another video on how adders/multipliers are built in the CPU itself? Maybe explain the difference between ripple carry adders and carry lookaheads and that kind of stuff :D
At the end of this video he says something that I think is correct, but the entire tech media has gotten wrong about Spectre / Meltdown, perhaps because the people who wrote Spectre and Meltdown papers got it wrong themselves. Spectre is a class of attacks that takes advantage of speculative execution. The attack concept does NOT rely on out-of-order execution.
It could very well be that OOO machines make it easier, or that only the OOO processors run far enough ahead into the speculative path to pull this attack off, but conceptually, Spectre is a speculation issue, not an OOO issue.
Probably true, but AFAIK all CPUs that run speculative execution, also run out-of-order execution. The reality is likely to be messy...
Well, in PC-land, it all went OOO with the Pentium Pro, but the Pentium Classic and its variations had a branch predictor. But it also only had a 5 stage pipeline and dual issue, only one of which could handle a load.
You know that to "surface" data, the meltdown code example requires the ability to get "far enough ahead" to do a speculative load followed by a second speculative load whose address depends on the value loaded in the first.
I don't think that's possible in a short pipeline without many execution units, so older processors probably are not subject to this exploit.
OTOH, there may be modern in-order processors that have deeper pipelines and superscalar with an LS unit and two ALUs that could be exploited. Some of the more modern ARM processors might qualify. ARM11 implementation are 8 and 9 deep. I think most (all?) of the modern ARM "A" cores are OOO, but I would not be surprised to see that some architectural licensees have built their own cores that are deep, SS, but not OOO. In MIPS-land, it may be similar.
Needs a follow up video on how the bugs work themselves
the whole freaking system is out of order!
Cause when you stick your hand into a pile of goo that was your BEST FRIEND'S FACE, you don't know what to do!!
If people would compile their own software, you could do all this optimization with the compiler and CPUs could be a lot more simple with much less power draw while still being just has fast in the final execution.
Speculative execution is NOT the problem. The fact that another process can access the results of the execution IS the problem.
The WALL between separate processes is not being enforced.
Great Video!!!
I still have some questions:
What part of the CPU looks at the instructions and evaluates a better order to execute them in? How does that not take more time than just executing in the order they were given? And do compilers like GCC rearrange the order first, or is it usually the cpu's job. If the C compiler does rearrange the order, can it inform the CPU that it's already been optimized, and to not waste time checking?
Is there any overhead in the CPU by re-ordering the instructions during OOE?
Well the CPU, or the execution of instructions is not used for reordering. Within the CPU, obviously, some component is required to analyze instructions and dependencies to re-order them. Obviously this costs some area on the chip, and energy, but in the end it should make execution faster.
Thanks for the reply. Now that I think about it, the overhead of a potential re-order (+ new execution time) obviously has to be smaller than the original execution time in order to actually enhance the performance.
The main benefit of out-of-order execution is not to re-order the instructions, but to ensure that the CPU doesn't sit idle while waiting for data to be fetched from memory. In almost all cases there is something else useful that can be done, rather than doing nothing!
What if we re-order the instruction ourselves? would the CPU still do the re-ordering part?
In short, yes. Out-of-order designs typically require *much* more die area compared to an in-order design, and they also tend to use more power. In-order designs need higher clocks to have the same performance as an out-of-order design, but they still tend to be more efficient for low-medium performance levels. For the highest performance, you just can't clock an in-order design any higher (or it becomes inefficient to do so), and an out-of-order design is better.
There are a number of modern, medium-performance in-order designs for exactly this reason, most notably the ARM Cortex-A53, which is the primary core used in virtually every low-end and mid-range smartphone (because of cost). The Cortex-A53 is also paired with higher-performance cores in higher-end smartphones, which allows the higher-power out-of-order cores to shut off when the phone is idle or under light loads (ARM calls this big.LITTLE; there's also a new version called DynamIQ).
Wouldn't the time taken by the CPU to re-order the instructions wipe out any time gained by being able to perform those instructions in parallel? In other words, re-ordering the instructions makes it quicker to do them, but you waste time re-ordering before you can start.
dharma6662013 No, as this is usually performed by the decoder. The ALU and L/S units aren't yet involved. At this stage they will also perform things like checking to see whether data needed by the decoded operation requires data not in cache - if it's not, it will be prefetched, so that it is in cache when it's needed later. This is also where branch prediction comes in - if a branch hasn't been executed yet, it doesn't know whether data used by each branch will be needed, so it will gather the data for the operations involved in the branch it predicts will be taken based on previous behaviour. It may also perform speculative execution (this depends on the design of the specific CPU implementation)
Please forgive my ignorance, but that just seems to "kick the can down the road". Something, somewhere, has to spend time re-ordering things. The result is that the CPU can run things faster. How do we know, and how to we measure, how much the time used re-ordering compares to the time saved *by re-ordering*?
dharma6662013 I would assume that chip designers and their respective companies have done quite some testing on this.
You might want to look up which generation of chips was the first one to implement such a thing and how much faster they got.
dharma6662013 tl;dr: thinking about how long something might take is faster than doing it.
CPUs have an instruction prefetch where the next instructions are loaded into cache before they are executed, usually in 16-byte segments. That gets into branch prediction, and what if you jump to a different address. But the main takeaway regarding instruction reordering, and pipelining in general, is that it can be done _combinatorially_ - meaning a logic circuit that does not use clock cycles, but acts as a direct function on its own. As soon as you feed in the input, given some gate delays, the output appears on the other side. For the purposes of this discussion, just think of it as being an instant process. It's a very long and complicated "if" statement that happens all at once in hardware.
This is such a relaxing stuff for my high-level language oriented brain.
12:40 actually, instruction 8 (MUL) could happen earlier, during 6 and 7. It still wouldn't be faster than the reordered code, though.
Wouldn't it be more benificial to do the multiplying first? Because surely a MULT takes more time than an ADD?
Joris Not necessarily. It depends on the particular values. Some multiplications can be done in a single cycle. Notably, power-of-two multiplications (for integers, anyway) will just be converted to bit-shifts (a single cycle operation), but there are still others that may also take only a single cycle.
Divisions are worse (again, except powers of two, which are just bit-shifts for integers), particularly modular division. These can take many cycles.
The implementations of ALUs have many complex tricks to allow for very fast execution - I recommend reading more about them! :)
In this particular case it would! If you fetch d and e first, you can do the multiplication while a, b and c are being fetched.
Guy Maor yeah, so he showed hoe the processor would use OOE to speeds things up. If it would've been donr right it would choose to do the multiplication first in most cases (since a multiplication consists of bit shits and adding instead of just adding 2 numbers). This would mean that at the end of the line it would have to wait on an ADD instead of a MULT, which would speed the whole process up sligjtly
Why wouldn't it take more cycles to analyze and determine an optional order than you'd save by using that new order? Does the compiler that originally compiled the code handle this? Or is this truly on the fly?
Moderns CPU's have sufficiently complex hardware to analyse instructions several steps before they are actually executed. IMO the example chosen is simplistic, and not a good example of how pipelining works in practice.
Maybe a off topic question about CPU´s. The question is in the time frame of around 1998 - 2006. Was the PowerPC actually faster than the x86 as apple always stated even though the clock frequency was a lot lower.
How does an out-of-order CPU work? Is there a separate module that reorders instructions?
I like this guy, thank you very much for your time - very informative
The best part is when he uses the computerphile paper in a radically different orientation
My knowledge of how a physical processor actually works is low, but I am a mathematician by trade and find this optimisation procedure quite interesting. So I don't know if what I'm about to say actually makes sense. But:
The two set of instructions in this video only differed in the order of operations. The only data that seems to be needed to run the code in the theoretically most efficient way possible is what dependencies there are between the instructions; whether they can be run concurrently; and the timings that the processes take. I guess the rub is that the latter isn't really deterministic (or perhaps they are up to a reasonable margin of error?).
Still: is a simple on-the-fly optimisation (that is actually implemented at the moment) essentially one which chooses processes that allow other concurrent ones? If module A of the processor is awaiting a new instruction, then first it looks at those available, then prioritises those which allow, say, for a computation on a currently unused module B (which is perhaps prioritised a slower component of the processor)... and so on in a similar fashion? I guess the mathematical structure I have in my mind is a kind of dependency tree which forms part of the data of the instructions, perhaps with some other weights so as to incentivise some processes (those which take place on slower components of the processor).
Lots of gaps here, but I find this optimisation problem theoretically quite interesting and would like to know the current state of the art. It reminds me a lot of FP, where because of lazy evaluation you can ensure that functions are performed in an order so as to not have superfluous operations. Sounds like similar ideas could be useful here.
Excellent video. Thank you!
If the assembly was originally written in the optimal order, will the CPU's useless attempt to reorder them cause an overhead?
Not likely. During design and testing the CPU will be optimised to avoid this kind of wastage. Modern processors actually have huge amounts of overhead, but this is all geared towards the fastest outcome. Low-power alternative processors that have less overhead, continue to be available for specialised applications.
a man's man
On the serious note though, rather than superscalar architecture, isn't it more effective if we put two pipelines in the CPU and both of them sharing the same execution units?
Yoppy Halilintar This is what SMT does I believe.
Short answer. No.
postvideo97 In a sense, yes, except that there is only one pipeline. While one thread has a particular execution unit tied up - say, waiting for data to arrive from main memory, which can take hundreds of operations in CPU-time... note that this would occur due to a failure in branch prediction; normally, it would have already noticed ahead of time that this data would be required and requested it in advance already, so that it would already be in cache or even a register, unless it predicted wrongly that it wouldn't be required - you can instead execute operations for a different thread that doesn't need that data, or that execution unit.
Yoppy Halilintar, two pipelines sharing the same execution units is pretty much _the opposite_ of what you want, because delays during the execution and complex dependencies would "stall" _both_ of them.
*edit:* Matthew is right, that's not what SMT (aka hyperthreading for Intel) does.
Guy Maor, how does multicore "share the same execution units" ?!
Did Lander have sound too?? Thanks.
So you're saying they're not CPU aligned? Do we have to talk about parallel universes?
Very well explained!
Nice explanation of "out of order" execution. I knew you were going to make one minor mistake, not that it matters for the point discussed in this video. You threw in multiply as the operation before that last variable. You didn't take into account the typical order of operations. The multiply would be executed first.
you guys really should do a video together with level1techs
Great explanation. Thanks.
Hello did Lander ever have sound at all?? Thanks.
Great video.
man, talk about a fantastic breakdown of the topic!
Can the clones execute Order 66, while the CPU is executing these instructions? I mean they don't depend on each other or anything.
jokes on me, i'm still using an in-order CPU (D2700)
This video taught me how to play the game Silicon Zeros...
Thanks for reminding me I need to put in some more time on that. I'm still in the early piece-of-cake puzzles. Well, they certainly are compared to where I got up to in TIS-100 and Shenzhen I/O.
Hmm interesting, I was unaware of the actual implementation of orders.
Interesting stuff!
Really very interesting! Thank you..
Shouldn't you run B,C, and D at the beginning so the multiply can run at the same time as W?
Having a link to the caching video would be pretty cool.
Its from 2015: ruclips.net/video/6JpLD3PUAZk/видео.html
+Josh Hayes ruclips.net/video/6JpLD3PUAZk/видео.html
And now we move to multi core cache management and prefetching?
Does CPU decide all this in real time? How does it do all this?! Isn't it just supposed to be an electromechanical part? If no software intervention occurs here, this might as well be black magic to me.
I have the same question. I'm having trouble imagining how it could possibly be faster to make a bunch of checks on multiple instructions and cache states than it would be to just perform the add/multiply.
I think the 386 was the first commercial superscalar processor.
Very interesting, thanks.
Take a shot every time he says "load store unit"
4:20 that "c" is moving! What? Did that happen in editing?
You're out of order! You're out of order! The whole CPU is out of order! They're out of order!
is this the same as pipelining? It sounds very similar.
Plea from a software user: - No matter how efficient the code, please ask yourself how usable the code is to the end-user of it. Elegant solutions to programming are so far removed from usability that I fear the connection gets lost sometimes.
Chances are you've never knowingly seen an elegant programming solution. You'd have to actually look at the source code for that which as a self declared software user you wouldn't and probably can't do. What you're talking about is most likely just bad UI design.
Dohh! I agree and stand corrected. I sort of knew when I made the comment that it was off topic. Please pass my concerns to any UI designers you might know.
I didn't understand a thing but that's cool.
CrashCourse has a crashcourse on computer science, they go very in depth about how CPUs work and what assembly code does, but they still keep it brief and simple enough so it's very easy to follow for the layman. Provided you have the attention span and pay attention.
The lines of code that a programmer writes, he expects to be 'executed' by the CPU sequentially. Turns out though, that CPU's move them around and execute them 'out of order' because thats faster to do. Yet to any outside observer (the programmer) it still looks like the code (which is translated into machine instructions) happens sequentially.
Fz Does this apply to lets say, 3D applications? for example RTS video games, where some of the things need to be executed sequentally and multi-threading is of no help. Noob here.
OOO Happens in any program that runs on a CPU that supports it (Pretty much everything) . In the case of a video game, which consists of CPU + GPU parts, the CPU part will therefore be executed out-of-order. Fyi, this isnt something that the programmer can control. It just happens in hardware.
Same and I had a final exam on this a month ago...
I thought it took less than 100 nanoseconds to get data from main memory, not 200. How can we calculate this? Basing it on 4200 MHz RAM.
For DDR4 4200 RAM with a CAS latency of 19 cycles, the time required to fetch the first word, assuming the appropriate row is already activated, is 9.5 ns. However, each _sequential_ word after that would only need 0.24 ns to fetch, meaning 4 contiguous words would only require about 10.25 ns and 8 would require only 11.25.
Of course, if the next required word is in another column, you would have to wait the 9.5 ns again, and if it's in another *row*, then you'll need to wait even longer, as your RAM will need to be issued the Precharge command, and then the Active command on the correct row before the next Read command can be issued.
The ALU, OTOH, would usually only need one CPU clock cycle to complete whatever it's doing, especially for a simple operation like addition or multiplication, which is on the order of 0.24 ns. Some ALUs can even do multiple such operations in a single cycle, and if you were using floating point instead of integers, it is relatively common to do multiply and add in one operation.
Damn that shirt is fly
His shirts are magnificent.
He's the embodiment of British style.
Like, weird funky shirts.
Reminds me of James May.
He is very fly!
If you like British guys wearing funky shirts, you should also watch Curious Droid videos.
For the here shown code a second load/store unit would speed uop the execution imensely.
How will they make future CPUs?
Nice! Thank you!
7:35 We've gone from landscape to portrait !
This all should be done by the compiler (or the programmer) - investing logic to "correct" a less than optimal code sequence that has to be present in EVERY CHIP and operate EVERY TIME you run your program? That's just clearly not the best answer. I recognize it gave the hardware designers all kinds of opportunity to feel like they're clever, but it's a waste of resources.
Is it (a+b+c+d)*e or a+b+c+(d*e)?
Super informative!!
Out of service out of order please press the alarm button!
Do you have a PATREON page to collaborate?
Interesting, though it sounded like CPU's were out of order rather than (still) being out of order.
It would have been faster if he did the multiplication first.
Correct. He could load "a" while multiplying, instead of stalling, but the video is still very clear on how out of order works, so...
horrible use of registers xD bad compiler xD
which one is the bad version?
The second version. After you add R1 to R0, you can reuse R1 for the next value from memory (c) because you don't need b anymore.
+Grayevel
But when you reuse the same register you can't reorder it as easily. Register renaming takes care of that, but explaining that in this video would probably be out of scope.
ya man XXXXXXXXXXXXXXXXXXXXXXXXXXXXDDDDDDDDDDD XD XD XD XD XD XD XD XD XD XD XD XD XD XD XD XD XDDDDDDDDDDDD XXXXXXXXD XD XDX DXXXXXD XDXDXD XD XD XD XD XD XD XD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XDXD XD XD XD XD XD
NERD!!! xD
All of that just to say the main point at the end?! I was lost for the first 10 minuets.
Nice.
I found the code in this example a bit simplistic, it just covers simple pipelines of algebraic instructions The real meat of speculative executions comes with branch prediction and caching, and this is where the whole Spectre issue pops up. These complexities are only briefly hinted at in the last 30 seconds of the video.
It still introduces the basic premise of how it functions. Complication is usually not a great way to introduce an idea to those whom are unfamiliar with it
He already done the video detailing the exploits themselves.
They should make a part 2
I believe he covered these in the (previous) Spectre & Meltdown video, which was specifically about speculative execution and branch prediction. The point of this video was to explain out-of-order and compare to in-order execution.
Gordon Richardson The title talks about superscalarity, which is the first step to branch prediction and then speculative execution. Can’t fit all three successive topics in a 15-minute video! ;)