Thanks! It was a real head scratcher! I've tried to post a link to where I discussed it more but it seems to have been filtered, but..if you look around I've got a talk where I go into more detail.
I don't write any software that's time-critical or where there aren't a million other uncontrollable slowdowns (my stuff runs on shared cloud services) but I found this anecdote really opened up my horizons and my way of thinking about programming.
In awe of the people who came up with such simple ideas to do branch prediction. And the people that work out how to build that logic in silicon, so that it can run in one click tick are gods!
A lot of this type of stuff can be laid at the feet of Cray computers. They invented a lot of this type of tech back in the 70's and 80's. The fact that this is in pretty much ALL modern CPU's is a feat of engineering. I may be wrong on details, but i am pretty sure Cray did groundbreaking work on branch prediction, deep pipe lining, out of order execution, and the like.
During my CS degree we had some classes about CPU architecture and pipelines, I was always impressed at how complicated the things we take for granted are actually are and what we studied was very very basic things, not even close to the magic which is Branch Prediction
I've been programming for a very long time but I didn't realise how sophisticated these branch predictors could get. The idea that it can compute a simple hash in a single clock cycle and use that to capture patterns is fascinating. Now that makes me want to go look into the details of some of these open CPU designs :)
This was an amazing explanation of branch prediction. I've been in tech for more of my life than not and I've known that branch prediction was a thing, but could not fathom how it worked even after some reading online and this made it approachable. Thank you :)
Branch prediction is why there is a lot of algorithms that work faster on sorted data even if the order of elements theoreticaly doesn't matter for this algorithm.
im sure it helps but its not the sole reason one of the big benefits of sorted data is it allows for binary search, the best example low tech example would be something like a phone book or dictionary, you can jump to the middle page, know if the thing you were searching for is earlier or later in the data and then discard the other 1/2, then repeat the process, if you had a list of everyone alive on earth it would only take 33 steps to look up anyone, compared to if the list wasnt sorted the worst case would take 8 billion steps. having the list sorted would make it ~200 million times faster, even without any fancy cpu tricks there are things with nested loops where there can be performance gains from having the resulting loop be aligned to how the data is stored in memory, eg in graphics programming with storing pixel data for each pixel it can make a big difference if you loop over it row by row or column by column which i guess branch prediction comes into though i thought was more down to the memory/storage controllers than the cpu pipeline
IBM managed to slow down their mainframe using branch prediction. How often do you have JMP (else branch) ? DSPs just had zero overhead loop instructions similar to the one in 80186 . So at the start of the loop you have an instruction where the immediate value says from where to jump back to here. Only needs a single register, not a table. Works on inner loops. And then there is hyper threading, where you fill the pipeline with the lower priority thread instead. No need for speculation or attack vectors. ARM TDMI in GBA had a flag to instruct it to follow up branches. But how does it know that there is a branch before decoding? So it still needs a cache: 1 bit for each memory address to remember an up branch. At least this is well documented, and the compiler can optimize for it. Even with this nice prediction: why not follow both paths with a ratio. One cycle to advance this path, 3 for the other. Stop at Load / Store to avoid hacks or inconsistencies. PS3 showed the way: more cores, no CISC like optimization per core. Similar today with GPUs and their RISCV cores.
I'm _predicting_ that that the one character change was from a short-circuit && to a bitwise &. The former might be compiled as two branch institutions, while the latter as only one.
What I don’t quite understand, and this is perhaps because the metaphor breaks down, is what is the decoding robot actually doing? It takes a piece of information, and ‘decodes’ it into a different piece of information? But why is this information understood by the next robot where the original information wasn’t? I presume this has something to do with determining which physical circuitry actually executes the instruction, but I can’t really visualise how that happens.
Decoder can for example find out which bits of the instruction are memory address, register address, ALU operation code, etc. Then it forwards these bits to the right units of the processor. In some other processor implementation the decoder could just check the operation code and make a microcode jump to the microcode address handling this instruction.
Its hard to explain as in a modern design, it work a bit differently. But to make simple, the initial piece of infirmation is just a pack of 1 and 0. The branch predictor is going to predict if it's a branch, if it needs to be taken and where. It does not even have to read the 1 and 0s to know if the instruction is a branch. Everything is done base of previously seen branches using tables to track things. The latter decode stage, is used to transform this pack of bits into a serie of action to do. This is done to setup what needs to be done to execute the instruction. Ex for an addition: where to get the 2 values to add, where to put the result...
It's a fair question. In older chips the decoding was often straightforward; often implemented as a kind of ROM called a PLA that mapped opcodes to sets of things to do, cycle by cycle. In modern CPUs like x86s, the instructions are very complex and difficult to decode, and they actually decide to a whole other sequence of smaller operations called "micro ops". Maybe if we get time we will go over that in one of these videos! There's complexity there too!
Instructions pack as much information into as few bits as possible. Decoders unpack this information. For simple cpu it does something like converting the binary coded add imstruction into an "on" signal to the execution hardware that performs the add operation. In modern CPU instructions are now very complex needing multiple steps to execute. So the decoder breaks down the complex instructions into multiple simpler instructions called microcode. It can also do the reverse fuse multiple instructions into one microcode.
As a software developer I'm wondering how you optimize for branch prediction when the cpu is effectively a black box. I guess you can only speculate that you are getting branches wrong or maybe there is a cpu setting to store branch prediction hits and misses?
On Linux `perf stat -- program goes here` like `perf stat -- ls -l` (or whatever). I had that cued up to demo but the conversation went in a slightly different direction :)
As long as your branch has some kind of pattern to it, the CPU will do decent prediction. If the branch is completely random the CPU will miss half of the time and you are better off trying to rewrite the code to branchless, for example by using the conditional move instruction. You can often persuade a compiler to produce branchless code by using the C/C++ ? : operator in clever ways.
In C++ you can add [[likely]] and [[unlikely]] attributes in the code. The compiler is then supposed to do that optimization for you if he wants (basically)
i seem to have missed the original but this guy seems great at explaining CPU stuff any chance of a further video about how Spectre class of vulnerabilities fits into this? (my limited understanding is there are a few more things going on in between but that seems the extreme example of branch prediction going wrong)
Two thoughts, when does it make sense to just add a couple more robots to the middle of the pipeline so that you have two pipelines in effect? In this way, you aren't flushing your cache ever, you are simply deciding which pipeline assembly line to continue processing, so you are throwing away some work, but it doesn't stall the process. Second, at what point will we start to see neural networks used for branch prediction? Seems like you could start using back propagation to apply weights for recognizing patterns for branch prediction.
Wouldn't it be cool to submit in-memory programs to a RAM pipeline much like shader programs can be submitted to a GPU pipeline? That might be something we have to do to prevent spectre-like bugs by design.
The part where the branch predictor increments/decrements the probability of each branch prediction reminded me of JITs, which too were covered recently on Computerphile. Do I understand correctly that this branch prediction adjustment too happens at runtime? Or could the program be dry-ran a couple of times during the compilation process to preconfigure the branch predictor somehow? It's a fascinating piece of technology either way:)
The branch predictor is entirely live, based on the current run and history of the program. Some older intel chips did let compilers place some branch hints but they have been removed as ...to decode the hints you need to have already fetched and decided the instructions...by which time its probably too late:)
If we can decode that an instruction is a branch way ahead of the execution step that will decide to take it or not, isn't it possible to build a second pipeline in parallel as soon as we know that this instruction is a branch that could be taken, such that when we come to the execution step that will decide if we have to take it or not we only need to decide if we stay on the actual pipeline or switch to the second one we built in parallel ?
Yeah. Only way to restore a wrong prediction. Anything below this does more harm than benefit. Still don’t want to leak speculative LOAD and STORE to the outside. Memory mapped IO?
Yes, this is called speculative execution. Instead of taking one branch, the CPU executes both and discards the one it wasn't supposed to take, CPUs today have a this only happens when the CPU has no other work to do, which can be quite often when waiting for memory operations, or even when just waiting for the comparison instruction to finish which can take a while given how deeply pipelined the CPU is.
NOP isn't strictly doing nothing, it does something that _changes_ nothing. On x86, NOP is equivalent to "XCHG AX, AX", which is just swapping register AX with itself. No change, but still doing something. 8 opcodes are used for instructions that swap one of the general-purpose registers with AX, one of which just happens to correspond to using AX as the nominated register, and which gets the name NOP instead of what it actually does.
What happens when the predictor makes the fetcher fetch both branches, if it it sees a branch in an address that is not in the table, does that speed up the processor??
But probably this thing again is split up into 3 pipeline stages for some reason. Like, look at MIPS and tell me how register based instructions need more than 3 stages! MIPS says: LOAD needs exactly two cycles and two stages more. This is obviously not correct if cache is involved.
Yes it can AMD are using perceptron as fast predictors for their ZEN processor. But the misprediction rates are high. So they are also supplementing it with a slower but more accurate predictor.
Given there's usually a branch every 4 to 6 instructions and the pipeline can be tens of instructions long, it quickly gets out of hand: each branch would bifurcate again and again...it's better (currently!) to guess and commit to the guess
&& and || operators will short circuit, which means that in expression "foo() && bar()", bar will be called only when foo returns true. Replacing them with bitwise & or | will unconditionally evaluate both sides, removing the branch. Compilers can sometimes optimize those for you, if the operations are cheap and evaluating the right-hand side won't affect the program's behavior. For example, a branch in (x > 0 && x < 10) can be optimized out, but a branch in (p != NULL && *p == 42) can't and shouldn't be, because dereferencing a null pointer would crash the program.
That ray trace thing is better done with a collision map though? You're already drawing every object into 3d space, just note the id of a triangle in a collision map for it and have the ray lookup the cells directly. There's no comparing of "is it to the right or left", it's just "What do I load here?" where the default id (0) just loads a value of no effect against the light.
Branch prediction is a thing I have started considering as a bit of an old relic of its time. I suspect it will be gone in the near future, since it actually isn't useful in practice since the introduction of out of order execution. (I also feels that this comment is exceptionally short and only people who thoroughly studied out of order execution will catch my drift Just decode both sides, execution can't keep up with the decoder, so interleaving it for a few tens of cycles is meaningless as far as the instruction queue/buffer is concerned. One won't get bubbles during this process, and if one does, then lengthen the queue/buffer to give execution more scraps to mull over as the decoder gets ahead again.) Now, if one doesn't use out of order execution and have a lengthy pipeline, then yes, prediction is very useful. (unless one also cares about constant time, then prediction and out of order execution are both one's nemesis.)
But out of order execution pretty much relies 100% on accurate branch prediction! I hope to cover that (and indeed the reason I've done BP is to lay the groundwork for future videos that cover OoO)
When the execution step at the end of the line processes the branch instruction properly, it compares the proper answer to the prediction. If they don't match then it pulls the horn and dumps the conveyor belt same as before.
@@thewhitefalcon8539 yes I have seen the vid but it still does not click. The predictor can give the wrong address to go to based on previous behavior. Is code being executed/evaluated before it is really executed? If you catch my drift 😅 The first 100 times it predicted right. But the 101st its prediction wrong. Which address is being executed at that specific time?
@@ryan-heathI don;t think of it as executing an address directly ever, it's executing whatever is in the pipeline (presuming the pipeline is loaded correctly). The steps are abstracted so the CPU can proceed faster from the cached instructions in the pipeline, not pulling from an addressed memory location, which would take longer to pull than it does to execute, IIRC. It is the Jump instruction being executed that would reveal if the pipeline has loaded the correct prediction. In the infinite loop example, it can't be predicted wrong after 100 loops, so that example doesn't directly address that, but if it was a conditional branch operation, instead of an unconditional jump then it would be the execution of the conditional branch that reveals whether the prediction is correct
@@MNbenMN hmm I think I get what you are saying. So the cache contains the instruction and from which address it was load. The branch instruction can now check if the needed address is already loaded in the cache. If it is not the prediction was faulted.
@@ryan-heath That sounds about right for the extent of the explanation in this video, as far as I understand it. However, the modern implementations of branch prediction and caching are more sophisticated/complex with parallel threads, to the point of unintentionally introducing Spectre exploit vulnerabilities, and I am no expert on CPU architecture to the details on that level.
I don't like your example - the predictive robot should be able to recognize an UNCONDITIONAL jump. I feel like that should be within the capabilities of a fetch unit. Unconditional calls as well. I understand calls raise some delicate issues, but after all, the fetch unit is the one that knows what the return address is going to be. The execute unit shouldn't have any awareness at all of where in memory the instructions its executing have come from. In a properly "clean" design that would mean that the fetch unit would "own" the return stack. Modern software strategies make that problematic - just one example of how we haven't followed our best possible path. We really shouldn't be mingling "fetch relevant" and "execution relevant" information in a single data structure.
@@orlandomoreno6168 Next-Token-Prediction seems like the perfect skill for this task; at the speed things have been progressing, it should be a matter of years at most before LLMs can predict CPU operations faster than CPUs can run natively. I forgot which one it was, but recently one of the normal LLMs trained on human language was shown to be able to learn machine code from in-context demonstrations and demonstrated the ability to replicate the behavior of a Turing machine; imagine what one trained specifically on CPU operations running on specialized ASIC might achieve in a few years. edit: I found it I think, it was Claude 3 Opus
You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.
If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you...prevent inflation
There are lots of different ideas to make predictors, from simple to complex. The most basic ones just have a table of like 16 slots and they write down which direction the branch went last time and overwrite the oldest one. Some AMD CPUs use small neural networks.
It doesn't "predict just like a human does". Neural networks *do* predict like a human does. No need to be sassy just because you don't understand the differences in their mechanisms.
I like the little anecdote at the end about the ray tracer and changing a test to gain a big speed boost.
Thanks! It was a real head scratcher! I've tried to post a link to where I discussed it more but it seems to have been filtered, but..if you look around I've got a talk where I go into more detail.
@@MattGodboltcan you give the name of the video to search?
@@NoNameAtAll2I tried that with my first reply that seemed to get removed but..."path tracing three ways" should get it. The bit is near the end :)
@@MattGodbolt Is it the one with v=HG6c4Kwbv4I (that's a RUclips video ID that goes at the end of the URL)?
I don't write any software that's time-critical or where there aren't a million other uncontrollable slowdowns (my stuff runs on shared cloud services) but I found this anecdote really opened up my horizons and my way of thinking about programming.
In awe of the people who came up with such simple ideas to do branch prediction. And the people that work out how to build that logic in silicon, so that it can run in one click tick are gods!
A lot of this type of stuff can be laid at the feet of Cray computers. They invented a lot of this type of tech back in the 70's and 80's. The fact that this is in pretty much ALL modern CPU's is a feat of engineering.
I may be wrong on details, but i am pretty sure Cray did groundbreaking work on branch prediction, deep pipe lining, out of order execution, and the like.
Coming up with the idea is "easy", actually implementing it is the hard part
@@jeromethiel4323 Cray was an unquestionable genius. Often underappreciated IMHO.
Kudos to the host for tending to ask very good questions about the topic being discussed.
Being able to explain a complex technical subject in a way I can understand is an amazing skill.
During my CS degree we had some classes about CPU architecture and pipelines, I was always impressed at how complicated the things we take for granted are actually are and what we studied was very very basic things, not even close to the magic which is Branch Prediction
I've been programming for a very long time but I didn't realise how sophisticated these branch predictors could get. The idea that it can compute a simple hash in a single clock cycle and use that to capture patterns is fascinating. Now that makes me want to go look into the details of some of these open CPU designs :)
This was an amazing explanation of branch prediction. I've been in tech for more of my life than not and I've known that branch prediction was a thing, but could not fathom how it worked even after some reading online and this made it approachable. Thank you :)
Branch prediction is why there is a lot of algorithms that work faster on sorted data even if the order of elements theoreticaly doesn't matter for this algorithm.
im sure it helps but its not the sole reason
one of the big benefits of sorted data is it allows for binary search, the best example low tech example would be something like a phone book or dictionary, you can jump to the middle page, know if the thing you were searching for is earlier or later in the data and then discard the other 1/2, then repeat the process, if you had a list of everyone alive on earth it would only take 33 steps to look up anyone, compared to if the list wasnt sorted the worst case would take 8 billion steps. having the list sorted would make it ~200 million times faster, even without any fancy cpu tricks
there are things with nested loops where there can be performance gains from having the resulting loop be aligned to how the data is stored in memory, eg in graphics programming with storing pixel data for each pixel it can make a big difference if you loop over it row by row or column by column which i guess branch prediction comes into though i thought was more down to the memory/storage controllers than the cpu pipeline
Side note: Branch prediction is incompatible with keeping memory secret. Disable branch prediction when handling secrets.
In goes branch prediction, out comes secret.
IBM managed to slow down their mainframe using branch prediction. How often do you have JMP (else branch) ? DSPs just had zero overhead loop instructions similar to the one in 80186 . So at the start of the loop you have an instruction where the immediate value says from where to jump back to here. Only needs a single register, not a table. Works on inner loops.
And then there is hyper threading, where you fill the pipeline with the lower priority thread instead.
No need for speculation or attack vectors.
ARM TDMI in GBA had a flag to instruct it to follow up branches. But how does it know that there is a branch before decoding? So it still needs a cache: 1 bit for each memory address to remember an up branch. At least this is well documented, and the compiler can optimize for it.
Even with this nice prediction: why not follow both paths with a ratio. One cycle to advance this path, 3 for the other. Stop at Load / Store to avoid hacks or inconsistencies.
PS3 showed the way: more cores, no CISC like optimization per core. Similar today with GPUs and their RISCV cores.
love this explanation - plain and simple!
Anybody else amazed by the fact Matt wrote the Fibonacci sequence in x86 and just knew the size of instructions
Very cool - great, understandable explanation!
I'm _predicting_ that that the one character change was from a short-circuit && to a bitwise &. The former might be compiled as two branch institutions, while the latter as only one.
Bingo. Well in this exact case a || to a |. And it wasn't 100% effective; sometimes the compiler still decided it was going to use two branches.
What I don’t quite understand, and this is perhaps because the metaphor breaks down, is what is the decoding robot actually doing? It takes a piece of information, and ‘decodes’ it into a different piece of information? But why is this information understood by the next robot where the original information wasn’t?
I presume this has something to do with determining which physical circuitry actually executes the instruction, but I can’t really visualise how that happens.
Decoder can for example find out which bits of the instruction are memory address, register address, ALU operation code, etc. Then it forwards these bits to the right units of the processor. In some other processor implementation the decoder could just check the operation code and make a microcode jump to the microcode address handling this instruction.
Its hard to explain as in a modern design, it work a bit differently.
But to make simple, the initial piece of infirmation is just a pack of 1 and 0. The branch predictor is going to predict if it's a branch, if it needs to be taken and where. It does not even have to read the 1 and 0s to know if the instruction is a branch. Everything is done base of previously seen branches using tables to track things.
The latter decode stage, is used to transform this pack of bits into a serie of action to do. This is done to setup what needs to be done to execute the instruction. Ex for an addition: where to get the 2 values to add, where to put the result...
It's a fair question. In older chips the decoding was often straightforward; often implemented as a kind of ROM called a PLA that mapped opcodes to sets of things to do, cycle by cycle.
In modern CPUs like x86s, the instructions are very complex and difficult to decode, and they actually decide to a whole other sequence of smaller operations called "micro ops". Maybe if we get time we will go over that in one of these videos! There's complexity there too!
Instructions pack as much information into as few bits as possible. Decoders unpack this information. For simple cpu it does something like converting the binary coded add imstruction into an "on" signal to the execution hardware that performs the add operation. In modern CPU instructions are now very complex needing multiple steps to execute. So the decoder breaks down the complex instructions into multiple simpler instructions called microcode. It can also do the reverse fuse multiple instructions into one microcode.
I realized this is Godbolt!!!!
As a software developer I'm wondering how you optimize for branch prediction when the cpu is effectively a black box. I guess you can only speculate that you are getting branches wrong or maybe there is a cpu setting to store branch prediction hits and misses?
Modern Intel CPUs are chock-full of performance-related counters, actually.
vtune
On Linux `perf stat -- program goes here` like `perf stat -- ls -l` (or whatever). I had that cued up to demo but the conversation went in a slightly different direction :)
As long as your branch has some kind of pattern to it, the CPU will do decent prediction. If the branch is completely random the CPU will miss half of the time and you are better off trying to rewrite the code to branchless, for example by using the conditional move instruction. You can often persuade a compiler to produce branchless code by using the C/C++ ? : operator in clever ways.
In C++ you can add [[likely]] and [[unlikely]] attributes in the code. The compiler is then supposed to do that optimization for you if he wants (basically)
i seem to have missed the original but this guy seems great at explaining CPU stuff
any chance of a further video about how Spectre class of vulnerabilities fits into this? (my limited understanding is there are a few more things going on in between but that seems the extreme example of branch prediction going wrong)
Two thoughts, when does it make sense to just add a couple more robots to the middle of the pipeline so that you have two pipelines in effect? In this way, you aren't flushing your cache ever, you are simply deciding which pipeline assembly line to continue processing, so you are throwing away some work, but it doesn't stall the process. Second, at what point will we start to see neural networks used for branch prediction? Seems like you could start using back propagation to apply weights for recognizing patterns for branch prediction.
AMD's cpu have used a neural network for a long time now for branch prediction as you suggest (google it)
I don't understand the example at the end. What was the difference between the before and after? I understand C and assembly if that helps explain it.
Thanks, I needed this.
Wouldn't it be cool to submit in-memory programs to a RAM pipeline much like shader programs can be submitted to a GPU pipeline?
That might be something we have to do to prevent spectre-like bugs by design.
Programmable branch prediction? The idea makes my head spin!
Unroll your loops?
Wonderful!
Is that Ray Tracing video at the end soon to be released? Can't find it via search by name
I wanna know about that black screen in the background showing followers, stock, etc. That looks like a cool project
It's a Tidbyt showing some standard things plus some website stats
The part where the branch predictor increments/decrements the probability of each branch prediction reminded me of JITs, which too were covered recently on Computerphile. Do I understand correctly that this branch prediction adjustment too happens at runtime? Or could the program be dry-ran a couple of times during the compilation process to preconfigure the branch predictor somehow? It's a fascinating piece of technology either way:)
The branch predictor is entirely live, based on the current run and history of the program. Some older intel chips did let compilers place some branch hints but they have been removed as ...to decode the hints you need to have already fetched and decided the instructions...by which time its probably too late:)
But the ideas are similar, yes. Just even more micro-level than the tricks JITs pull
@@MattGodboltthank you for taking the time to answer! Have been loving the series:)
If we can decode that an instruction is a branch way ahead of the execution step that will decide to take it or not, isn't it possible to build a second pipeline in parallel as soon as we know that this instruction is a branch that could be taken, such that when we come to the execution step that will decide if we have to take it or not we only need to decide if we stay on the actual pipeline or switch to the second one we built in parallel ?
Yeah. Only way to restore a wrong prediction. Anything below this does more harm than benefit.
Still don’t want to leak speculative LOAD and STORE to the outside. Memory mapped IO?
Yes, this is called speculative execution. Instead of taking one branch, the CPU executes both and discards the one it wasn't supposed to take, CPUs today have a this only happens when the CPU has no other work to do, which can be quite often when waiting for memory operations, or even when just waiting for the comparison instruction to finish which can take a while given how deeply pipelined the CPU is.
NOP isn't strictly doing nothing, it does something that _changes_ nothing. On x86, NOP is equivalent to "XCHG AX, AX", which is just swapping register AX with itself. No change, but still doing something. 8 opcodes are used for instructions that swap one of the general-purpose registers with AX, one of which just happens to correspond to using AX as the nominated register, and which gets the name NOP instead of what it actually does.
Does that mean, the first iteration of any loop is a bit slower than the following iterations?
My take away:
Branch Prediction: When I see this, I will give you that, noted.
Excellent video thank you
What happens when the predictor makes the fetcher fetch both branches, if it it sees a branch in an address that is not in the table, does that speed up the processor??
I recall simple memories like this used in artificial life in the 90s to find apples around trees...Animat? MIT Artificial Life publication
they have basically figured out how to make a machine learning Reinforcement-Learning prediction model in a SINGLE tick!
But probably this thing again is split up into 3 pipeline stages for some reason. Like, look at MIPS and tell me how register based instructions need more than 3 stages! MIPS says: LOAD needs exactly two cycles and two stages more. This is obviously not correct if cache is involved.
Couldn’t a neural network be implemented for this?
Edit: Turns out it can be: neural branch prediction
Yes it can AMD are using perceptron as fast predictors for their ZEN processor. But the misprediction rates are high. So they are also supplementing it with a slower but more accurate predictor.
Pipelining is hard stuff, but very well explained. 😊
Figured they always simultaneously executed both branches until something wrote to memory or the branch was fully known.
Schrodinger's CPU
Given there's usually a branch every 4 to 6 instructions and the pipeline can be tens of instructions long, it quickly gets out of hand: each branch would bifurcate again and again...it's better (currently!) to guess and commit to the guess
Thank you - Its just a little less dark magic.
How do you go from 2 ifs to 1 if in 1 byte?
&& and || operators will short circuit, which means that in expression "foo() && bar()", bar will be called only when foo returns true. Replacing them with bitwise & or | will unconditionally evaluate both sides, removing the branch.
Compilers can sometimes optimize those for you, if the operations are cheap and evaluating the right-hand side won't affect the program's behavior. For example, a branch in (x > 0 && x < 10) can be optimized out, but a branch in (p != NULL && *p == 42) can't and shouldn't be, because dereferencing a null pointer would crash the program.
Nice 👍
That ray trace thing is better done with a collision map though? You're already drawing every object into 3d space, just note the id of a triangle in a collision map for it and have the ray lookup the cells directly. There's no comparing of "is it to the right or left", it's just "What do I load here?" where the default id (0) just loads a value of no effect against the light.
The sound of markers are "awesome"
Branch prediction is a thing I have started considering as a bit of an old relic of its time.
I suspect it will be gone in the near future, since it actually isn't useful in practice since the introduction of out of order execution.
(I also feels that this comment is exceptionally short and only people who thoroughly studied out of order execution will catch my drift Just decode both sides, execution can't keep up with the decoder, so interleaving it for a few tens of cycles is meaningless as far as the instruction queue/buffer is concerned. One won't get bubbles during this process, and if one does, then lengthen the queue/buffer to give execution more scraps to mull over as the decoder gets ahead again.)
Now, if one doesn't use out of order execution and have a lengthy pipeline, then yes, prediction is very useful. (unless one also cares about constant time, then prediction and out of order execution are both one's nemesis.)
But out of order execution pretty much relies 100% on accurate branch prediction! I hope to cover that (and indeed the reason I've done BP is to lay the groundwork for future videos that cover OoO)
It is just as like to be off the left, right.
If a human could read the matrix like Neo he would be the closest.
I think I missed how it is known the prediction failed.
Who is keeping tabs on the predictor? 😅
When the execution step at the end of the line processes the branch instruction properly, it compares the proper answer to the prediction. If they don't match then it pulls the horn and dumps the conveyor belt same as before.
@@thewhitefalcon8539 yes I have seen the vid but it still does not click.
The predictor can give the wrong address to go to based on previous behavior. Is code being executed/evaluated before it is really executed? If you catch my drift 😅
The first 100 times it predicted right. But the 101st its prediction wrong.
Which address is being executed at that specific time?
@@ryan-heathI don;t think of it as executing an address directly ever, it's executing whatever is in the pipeline (presuming the pipeline is loaded correctly). The steps are abstracted so the CPU can proceed faster from the cached instructions in the pipeline, not pulling from an addressed memory location, which would take longer to pull than it does to execute, IIRC. It is the Jump instruction being executed that would reveal if the pipeline has loaded the correct prediction. In the infinite loop example, it can't be predicted wrong after 100 loops, so that example doesn't directly address that, but if it was a conditional branch operation, instead of an unconditional jump then it would be the execution of the conditional branch that reveals whether the prediction is correct
@@MNbenMN hmm I think I get what you are saying.
So the cache contains the instruction and from which address it was load.
The branch instruction can now check if the needed address is already loaded in the cache. If it is not the prediction was faulted.
@@ryan-heath That sounds about right for the extent of the explanation in this video, as far as I understand it. However, the modern implementations of branch prediction and caching are more sophisticated/complex with parallel threads, to the point of unintentionally introducing Spectre exploit vulnerabilities, and I am no expert on CPU architecture to the details on that level.
Godbolt!
here after windows update AMD branch prediction optimizations
I'm interested in so many odd subjects. 😢
They have 2.39 Million subscribers.
I don't like your example - the predictive robot should be able to recognize an UNCONDITIONAL jump. I feel like that should be within the capabilities of a fetch unit. Unconditional calls as well. I understand calls raise some delicate issues, but after all, the fetch unit is the one that knows what the return address is going to be. The execute unit shouldn't have any awareness at all of where in memory the instructions its executing have come from. In a properly "clean" design that would mean that the fetch unit would "own" the return stack. Modern software strategies make that problematic - just one example of how we haven't followed our best possible path. We really shouldn't be mingling "fetch relevant" and "execution relevant" information in a single data structure.
11:00 That's currying.
First robot has sharingan
I wonder how many years until this task is done by a built-in LLM-like predictor that is training in real time or one/few-shotting it....
LLM is overkill. You can embed a NN and do backpropagation / Hebb's rule in hardware.
@@orlandomoreno6168 Next-Token-Prediction seems like the perfect skill for this task; at the speed things have been progressing, it should be a matter of years at most before LLMs can predict CPU operations faster than CPUs can run natively. I forgot which one it was, but recently one of the normal LLMs trained on human language was shown to be able to learn machine code from in-context demonstrations and demonstrated the ability to replicate the behavior of a Turing machine; imagine what one trained specifically on CPU operations running on specialized ASIC might achieve in a few years.
edit: I found it I think, it was Claude 3 Opus
2:03
Crypto Bull run is making waves everywhere and I have no idea on how it works. What is the best step to get started please,,
Am facing the same challenges right now and I made a lots of mistakes trying to do it on my own even this video doesn't give any guidelines
I will advise you to stop trading on your own if you continue to lose. I no longer negotiate alone, I have always needed help and assistance
You're right! The current market might give opportunities to maximize profit within a short term, but in order to execute such strategy, you must be a skilled practitioner.
Inspiring! Do you think you can give me some advice on how to invest like you do now?
If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you...prevent inflation
"Congrats to everyone Who is early and who found this comment.. 🐼😊
Oh look it predicts just like a human does. Must be sentient.
There are lots of different ideas to make predictors, from simple to complex. The most basic ones just have a table of like 16 slots and they write down which direction the branch went last time and overwrite the oldest one. Some AMD CPUs use small neural networks.
AI researchers be like
It doesn't "predict just like a human does". Neural networks *do* predict like a human does. No need to be sassy just because you don't understand the differences in their mechanisms.
@@IceMetalPunk It's a joke. Anyone who predicts is basically a crazy person.
@@IceMetalPunk Also I can be however I want. If you don't like it then that's your problem. You don't know me.
first and shuddup yes i am
Not a great video...
I gave on it before 1/2 way.
I just don't like how this guy is trying to convey his messages.
@JamesSharman ;)