I have a computer architecture exam late this morning, wake up extra early to go to the hospital for a visit, im watching this video while im waiting🙌🏻
Amazing video, it really made my understand why the PPE cores used both in CELL and Xenon where so underwhelming, it really suffered from all the bad stuff mentioned in this video: long pipelines, lots of stalls, lack of out of order execution and more. Also it made me realize how important was relying on the SPEs as much as possible in CELL's case, witch BTW was a big PITA. Cool Stuff.
Thanks for enlightening me about heuristics. I loved the graphical representation of the "shifts" in your presentation on pipelines and "stalls" that happen and avoiding them along the way. I knew just a moment before you showed us that the instructions were about to be reordered. My understanding has been improved. My knowledge of assembly language helped a lot, I just never bothered to look into the matter as you have done. Thanks a lot.
Hello and thank you very much for your comment! Glad you enjoyed the video, and really appreciate you sharing your "aha" moment - That's one of the things I live for as an educator =)
Heuristics makes me want to see a CPU (simulation) where the scalar CPU splits up into two threads at every branch (becomes super scalar). Store commands write into a FIFO! Then when the branch condition is clear, a whole tree of threads is flushed. The Store FIFO of the taken branch is flushed to memory. This might be a useful operation mode for those 16 core RISCV chips.
THANK YOU SIR!! I made many minecraft CPUs when i was 13. back then there werent many videos or resources that didn't explain pipelining in terms of "car assembly lines" or "laundry", or 4000 page university PDFs from the 90s. Thank you so much good sir.
You're welcome! Very happy to be of help =) I think those are fairly textbook explanations so it's no wonder you see them a lot. Analogies are good too I suppose, but I guess nothing beats visualizing it properly!
Good luck! Consider planning out your design first using actual logic components before doing it in game. Redstone is a whole different level of complexity!
Hello and thank you for your comment! Unfortunately those architectures are far more complex (some modern architectures have twenty or more pipeline stages) so I haven't gotten round to learning about them.
That's not the only reason for pipelining. You could do a CPU that does the whole instruction in one clock (one rising, one falling edge). But you still have propagation time that limits max clock speed (and computation speed), pipelining allows to break up propagation into smaller chunks and to elevate clock speeds.
Hello and thank you for your comment! To be fair, increasing clock speed this way isn't going to increase the overall speed of computation - No point getting your clock speeds up to 20GHz if every instruction has to make its way through 100 pipeline stages! Ultimately it's less about managing propagation delay - In fact having multiple pipeline stages _increases_ the total per-instruction propagation delay since it makes the circuitry more complex. The advantage comes about from the "parallelism" where we essentially start on the next instruction before the last one is complete.
@@NERDfirst let's say you have an ALU that has 100ns propagation. Now you split that up into two 50ns steps with some latches in between. You just almost doubled your instructions per second due to doubling the clock rate. This is pipelining and it's most important reason. What you are referencing is superscalarity and out of order execution - the use of multiple execution units to their full extent.
I think we're talking about the same things using different words, or maybe I just wasn't explicit enough on the point. My way of explaining it (at 3:32) assumes that pipeline stages exist but instructions are processed to completion before the next instruction enters the pipeline. Your way of explaining it does away with the pipeline model and considers the execution of an instruction as a single large step. I didn't explicitly mention propagation delay by name to reduce on cognitive load, but I do believe the understanding conveyed is the same. If I understand your explanation correctly, you get a doubling of instructions per second _because_ of instruction-level parallelism. At the end of the day, if you double the clock speed but each instruction takes two clock cycles to complete, the number of instructions per second is exactly the same. It is because of superscalarity allowing you to have multiple instructions in the ALU at once that you can have a performance benefit. Do let me know if I'm understanding you wrongly. It's been a while since I did this stuff.
@@NERDfirst In my example my single ALU can be in two discrete steps of executing two instructions - first half of a new instruction and second half of an older instruction. You can imagine my pipeline like this (a modification of the classic RISC pipeline): Fetch Decode Execution 1 Execution 2 Memory Write Back I have divided the execution stage in two. This is because my hypothetical ALU would have 100 ns of propagation and would limit the clock to 10 MHz. By splitting it up I now have a little longer pipeline , but my largest propagation went down to lets say 55 ns (because we had to add latches in between stages its not ideally half). Now my CPU can run at 18 MHz. Both of those frequencies roughly translate to instructions per second because in both cases the instructions complete "in a single cycle" due to pipelining. This is the advantage of longer pipelines - as long as you get an uninterrupted stream of instructions you can get a boost in IPS because you have higher max clock. This is of course not ideal because you have branches in the code and that stalls or flushes the pipeline. You are executing multiple instructions at a time because result of one step is transferred further on to be computed in the next - basicaly it's an improvement over very old CPUs that executed those steps one after another because pipelining needs additional circuitry, so you got one instruction in for example 4 clocks. But you can't compute more instructions at a time than you have pipeline stages. For that you need superscalarity - having multiple ALUs, multiple address generation units, etc. working at the same time - and to make it work right you also use out of order execution, so you can fill up those elements pipelines (yes, everything is pipelined in a modern CPU). What I was implying earlier was that a Harvard architecture CPU could execute a full instruction in a single clock - because both instruction and data are supplied at the same time - but it might not run at a very fast clock because data has to propagate through the whole datapath in that one clock cycle.
why do i see in some materials regarding the order of the process is IF ( Instruction Fetch ) --> ID ( Instruction Decode ) -> EX( Instruction Execute ) -> MEM( Access Memory Operand ) -> WB ( Write Back )
Hello and thank you for your comment! If I'm not wrong, what you've described is specifically the MIPS pipeline. Different architectures can have a different number and order of pipeline stages, so this isn't universal. What I've shown in the video isn't linked to any specific assembly architecture, it's just a generic abstract pipeline to make understanding things easier.
I think that MIPS tries to speed-up write back. When every value flows through the pipeline for 5 cycles, we can turn off power for that register for this time. Leakage should bring it to a middle state between on and off. Then we write back, which is still a little power hungry due to the fan-out, and then turn on power to let the bits flip into their intended states.
I have a question, not all instruments have a write back, i.e. not written the results back to registers, memory, etc. for example on the 8080, jmp instructions do not write back to anywhere. Another example would be a MOV instruction, that moves data from memory/registers to registers/memory. So what happens when an instruction has no write back? Does it execute a noop? Again I’m still quite the novice, thanks
Hello and thank you for your comment! Yes, instructions that don't require any action to be taken on any stage would still have to go through the stage, but will do nothing there.
Oh sorry about that! I compared levels with popular RUclipsrs and realized my BGM was turned down much lower than them. I'd hoped for it to be out of the way but looks like you still picked up on it. I'll see what I can do for future videos!
The explanation in your videos are so crisp. Really appreciate the quality of these - keep it up :)
Hello and thank you very much for your comment! Glad you liked the video =)
I have a computer architecture exam late this morning, wake up extra early to go to the hospital for a visit, im watching this video while im waiting🙌🏻
Hello and thank you for your comment! Do take care and all the best for your exam =)
I completely forget all of this having studied for Comp Arch class. Your video refreshes the introduction I needed. Thank you.
You're welcome! Glad to be of help =)
Amazing video, it really made my understand why the PPE cores used both in CELL and Xenon where so underwhelming, it really suffered from all the bad stuff mentioned in this video: long pipelines, lots of stalls, lack of out of order execution and more. Also it made me realize how important was relying on the SPEs as much as possible in CELL's case, witch BTW was a big PITA. Cool Stuff.
Oh wow, this is a great case study, thank you for sharing! Its pipeline is 23 stages! Really interesting to read about.
@@NERDfirst Prescott P4: Hold my beer!
At least that's x86 - a CISC instruction set so it's less out of place!
Thanks for enlightening me about heuristics. I loved the graphical representation of the "shifts" in your presentation on pipelines and "stalls" that happen and avoiding them along the way. I knew just a moment before you showed us that the instructions were about to be reordered. My understanding has been improved. My knowledge of assembly language helped a lot, I just never bothered to look into the matter as you have done. Thanks a lot.
Hello and thank you very much for your comment! Glad you enjoyed the video, and really appreciate you sharing your "aha" moment - That's one of the things I live for as an educator =)
Heuristics makes me want to see a CPU (simulation) where the scalar CPU splits up into two threads at every branch (becomes super scalar). Store commands write into a FIFO! Then when the branch condition is clear, a whole tree of threads is flushed. The Store FIFO of the taken branch is flushed to memory. This might be a useful operation mode for those 16 core RISCV chips.
Thank you, this helped clarify some things I came across for the Comptia A+ exam. Much appreciated.
You're welcome! Very happy to be of help :)
THANK YOU SIR!! I made many minecraft CPUs when i was 13. back then there werent many videos or resources that didn't explain pipelining in terms of "car assembly lines" or "laundry", or 4000 page university PDFs from the 90s. Thank you so much good sir.
You're welcome! Very happy to be of help =) I think those are fairly textbook explanations so it's no wonder you see them a lot. Analogies are good too I suppose, but I guess nothing beats visualizing it properly!
@@NERDfirst i struggled with this for so long. but thanks to u maybe i can try playing minecraft again. have a good day!!
Good luck! Consider planning out your design first using actual logic components before doing it in game. Redstone is a whole different level of complexity!
@@NERDfirst ohh alright, thanks!!
Great quality video, easy to understand for people who does not come from computer science world, great job!
Hello and thank you very much for your comment! Glad you liked the video :)
Your content is so professional. Can you also make videos on modern microprocessor architecture like i3 ,i5 ,i7 etc.
Hello and thank you for your comment! Unfortunately those architectures are far more complex (some modern architectures have twenty or more pipeline stages) so I haven't gotten round to learning about them.
That's not the only reason for pipelining. You could do a CPU that does the whole instruction in one clock (one rising, one falling edge). But you still have propagation time that limits max clock speed (and computation speed), pipelining allows to break up propagation into smaller chunks and to elevate clock speeds.
Hello and thank you for your comment! To be fair, increasing clock speed this way isn't going to increase the overall speed of computation - No point getting your clock speeds up to 20GHz if every instruction has to make its way through 100 pipeline stages!
Ultimately it's less about managing propagation delay - In fact having multiple pipeline stages _increases_ the total per-instruction propagation delay since it makes the circuitry more complex. The advantage comes about from the "parallelism" where we essentially start on the next instruction before the last one is complete.
@@NERDfirst let's say you have an ALU that has 100ns propagation. Now you split that up into two 50ns steps with some latches in between. You just almost doubled your instructions per second due to doubling the clock rate. This is pipelining and it's most important reason.
What you are referencing is superscalarity and out of order execution - the use of multiple execution units to their full extent.
I think we're talking about the same things using different words, or maybe I just wasn't explicit enough on the point. My way of explaining it (at 3:32) assumes that pipeline stages exist but instructions are processed to completion before the next instruction enters the pipeline. Your way of explaining it does away with the pipeline model and considers the execution of an instruction as a single large step.
I didn't explicitly mention propagation delay by name to reduce on cognitive load, but I do believe the understanding conveyed is the same. If I understand your explanation correctly, you get a doubling of instructions per second _because_ of instruction-level parallelism. At the end of the day, if you double the clock speed but each instruction takes two clock cycles to complete, the number of instructions per second is exactly the same. It is because of superscalarity allowing you to have multiple instructions in the ALU at once that you can have a performance benefit.
Do let me know if I'm understanding you wrongly. It's been a while since I did this stuff.
@@NERDfirst In my example my single ALU can be in two discrete steps of executing two instructions - first half of a new instruction and second half of an older instruction. You can imagine my pipeline like this (a modification of the classic RISC pipeline):
Fetch
Decode
Execution 1
Execution 2
Memory
Write Back
I have divided the execution stage in two. This is because my hypothetical ALU would have 100 ns of propagation and would limit the clock to 10 MHz. By splitting it up I now have a little longer pipeline , but my largest propagation went down to lets say 55 ns (because we had to add latches in between stages its not ideally half). Now my CPU can run at 18 MHz. Both of those frequencies roughly translate to instructions per second because in both cases the instructions complete "in a single cycle" due to pipelining. This is the advantage of longer pipelines - as long as you get an uninterrupted stream of instructions you can get a boost in IPS because you have higher max clock. This is of course not ideal because you have branches in the code and that stalls or flushes the pipeline.
You are executing multiple instructions at a time because result of one step is transferred further on to be computed in the next - basicaly it's an improvement over very old CPUs that executed those steps one after another because pipelining needs additional circuitry, so you got one instruction in for example 4 clocks.
But you can't compute more instructions at a time than you have pipeline stages. For that you need superscalarity - having multiple ALUs, multiple address generation units, etc. working at the same time - and to make it work right you also use out of order execution, so you can fill up those elements pipelines (yes, everything is pipelined in a modern CPU).
What I was implying earlier was that a Harvard architecture CPU could execute a full instruction in a single clock - because both instruction and data are supplied at the same time - but it might not run at a very fast clock because data has to propagate through the whole datapath in that one clock cycle.
Very well explained.
Hello and thank you for your comment! Very happy to be of help =)
Superb explanation 🎉
Hello and thank you very much for your comment! Very happy to be of help :)
gracias capo, clarito como un vasito, te quiero
Hello and thank you for your comment! Glad to be of help =)
Great explanation, thanks
You're welcome! Glad to be of help =)
Great video, well done!
Hello and thank you very much for your comment! Glad you liked the video :)
Great video!!!
Hello and thank you very much for your comment! Glad you liked the video :)
Great video, keep it up!
Hello and thank you very much for your comment! Glad you liked the video :)
these videos are really good
Hello and thank you very much for your comment! Glad you liked the video =)
Damn, your videos are so nice!!!
Thank you very much! I remember your comment on another one of my videos as well, glad to know you like my work =)
Great content 👍
Hello and thank you very much for your comment! Glad you liked the video =)
well put
Thank you very much! Glad you liked the video :)
thank you
You're welcome! Glad to be of help :)
great thanks
You're welcome! Glad to be of help :)
Good stuff
Thank you! Glad you liked the video :)
why do i see in some materials regarding the order of the process is IF ( Instruction Fetch ) --> ID ( Instruction Decode ) -> EX( Instruction Execute ) -> MEM( Access Memory Operand ) -> WB ( Write Back )
Hello and thank you for your comment! If I'm not wrong, what you've described is specifically the MIPS pipeline. Different architectures can have a different number and order of pipeline stages, so this isn't universal. What I've shown in the video isn't linked to any specific assembly architecture, it's just a generic abstract pipeline to make understanding things easier.
I think that MIPS tries to speed-up write back. When every value flows through the pipeline for 5 cycles, we can turn off power for that register for this time. Leakage should bring it to a middle state between on and off. Then we write back, which is still a little power hungry due to the fan-out, and then turn on power to let the bits flip into their intended states.
I have a question, not all instruments have a write back, i.e. not written the results back to registers, memory, etc. for example on the 8080, jmp instructions do not write back to anywhere. Another example would be a MOV instruction, that moves data from memory/registers to registers/memory.
So what happens when an instruction has no write back? Does it execute a noop?
Again I’m still quite the novice, thanks
Hello and thank you for your comment! Yes, instructions that don't require any action to be taken on any stage would still have to go through the stage, but will do nothing there.
@@NERDfirst Thanks, that makes sense
could you change he song please my brain is burning because of this :(
but i understand the consept thanks :) like
Oh sorry about that! I compared levels with popular RUclipsrs and realized my BGM was turned down much lower than them. I'd hoped for it to be out of the way but looks like you still picked up on it. I'll see what I can do for future videos!
@@NERDfirst thanks