If you're actually back.. that would be fantastic. I love your stuff. Possibly hard to grasp concepts broken down into bytesized pieces, for the rest of us to process. Keep up the amazing work, hope to see more of you!
It is one of the best videos I could find for that topic. I think that the short debunking at the end for the "logical cores" term was very important. Most of the videos, articles talk about HT (Intel's SMT) as "it adds logical cores so the CPU is much faster" that is wrong in that way. Thank you for the excellent overview.
Not quite. SMT implementations often do duplicate resources. You minimum you need a second program counter. And many in-order implementation end up duplicating register files (or having extended window sets) And while SMT does allow threads to share/split hardware resources, that's often not where the performance gains typically come from, Though it is easier to get additional IPC if you are pulling from 2,4, or 8 dependency graphs rather than one. A lot of it comes from being able to keep busy while waiting on memory latency, hence why Mainframe chips and GPU's support 8 contexts at once. And from the program perspective, instructions never execute willy-nilly. The tomsulo algorithm for out of order introduces another 2 stages to Fetch-Decode-Execute-Write. You now get Fetch-Decode-Rename-Execute-Write-Commit, and you always commit in-order, SMT does drop nicely out of this without a lot of extra transistors needed, but works best when you have an abundance or physical resources (reservation stations, ROB's, and execution pipelines) Also a context switch doesn't just flush the pipeline, it has to save the architectural state. (You've got to clear the workbench before changing project or the parts for one thing will end up in another) The reason that two threads can't co-exist on a single logical processor normally is pretty simple. It's like two people trying to use the same desk to get work done. Without some sort of prior agreement you'll set something down, then the other guy will set something on top if it, and you'll be sending someone else's personal love letter out to be photocopied and distributed as an office memo. Not to mention it doesn't really make sense as you only have one program counter and you can only tick along it to jump to a specific other place. The example you show at 13:50 is a general concurrency problem, and not solved by SMT or context switching. Hence why most ISA's have a conventional calling format which lets you call functions in the same context without them stomping over registers the callee will need upon return. And I feel there's some conflation here between the OS scheduler which allows concurrency/multitasking, and the hardware mechanism for keeping the architectural visible resources of SMT threads separated. So far as the scheduler is concerned each logical thread is a core, if the OS is SMT aware it can be a little smarter how it spreads things out, but isn't necessary for the scheduler to function.
Hi there, thanks for commenting. For your first point, I should have specified that execution resources are not duplicated. It is true that there are scheduler resources duplicated to accommodate the extra contexts. I am not familiar with enough implementations to know about when register file resources are duplicated and I feel that's far out of scope. Your second point I do agree with. Having an extra thread on the core does raise the likelihood of an instruction having the resources required to be executed, increasing execution unit uptime. I neglected to mention that point. For your third point, I feel I was not clear enough. I probably shouldn't have said "things can execute willy nilly", when earlier I made a huge point of "we care that things are done in a specific order", I probably should have moreso stated that the actions of a scheduler or core are not always going to be deterministic to the scheduler itself since theres no real way to know when any given resource will be free. A programmer can't assume they'll know the actions of a scheduler. Fourth point; out of scope. A lot of the things in this comment are definitely really valuable, and you likely have more knowledge than me, but this video was already very long and I don't think there's a lot of value in adding every single possible detail. A pipeline flush is one thing that can be explained simply to illustrate why a context switch isnt great. I also did mention in the video that there are more things that go to a context switch than just the pipeline flush. I don't really understand the example you give in your fifth point. Sorry about that. "And I feel there's some conflation here between the OS scheduler which allows concurrency/multitasking" you mention this but I don't see where you point it out. I feel like I very specifically make this dilineation near the end of the video stating that these two things were distinctly different. Maybe I wasn't clear. Thanks for your reply
@@elegeto Sure, being able to talk about something in sort-of detail is quite an art. It's hard to be accurate enough without boring the layman to death with the details. Basically each thread must have it's own copy of the architectural registers (addressable by the assembly/ISA). Intel and ARM A76/77/78 do it with a thread-aware rename backed by a large shared physical register file. SPARC did is by adding extra windows (banks of 8 registers) and dynamically allocating them to threads as needed. I believe early Power simply had a split register file w/o rename. Whereas GPU's statically allocate register resources to warp/wave groups and can switch between groups with a very minimal context change (flush the pipeline, change the register offset index) though they won't run two threads at the exact same time. The process stages (fetch, decode, execute,...) can be shared but state must be kept separate, and keeping state requires a fair number of transistors. And gosh yes, there's so much non-determinism in a modern CPU (Cache layout, data structure alignment, and other code running, and speculation and mispredict) can all have significant performance impact. But to SMT I think thats getting down into the details. The big idea is the logical replication of archectural state, and some mechanism to mix the instruction streams, weather it be fine grained multitasking, round robin issue (scaler or superscaler), or intermixed issue with a failsafe to make sure one threat can't dominate too badly. From what I can tell intel mixes/flip flops fetch and decode as much as it can, and reserves a minimum guaranteed buffer for each thread and execution sticks to the oldest ready reservation station to issue from. For the fifth point, the program model of the world is inherently sequential. (For Von Newman and Harvard Archetectures) Instruction 1 goes , then 2, then 3, then 4, unless an instruction the the stream tell it to skip to somwhere else, where it hapily begins counting again. Now this isn't what really happens in a modern x86 as you well know, but it's the illusion that must be kept for the program to execute. Each step reliant the the state of the processor from the last step. Mixing random instruction in from elsewhere breaks this contract unless you can separate the state information. And the conflation I saw wasn't the program view of a processor vs the chip designers view of a core which you explained quite well, but of the two schedulers themselves. The OS one has to do a fairly complex balance and track a lot of variable, but the hardware scheduler is perhaps a slightly glorified queue.
@@WorBlux One of the issues with talking about a technology like SMT is because, as you just pointed out, the implementation of the same concept can vary greatly from architecture to architecture. I try to keep most of my explanation (and knowledge tbh) aimed towards x86_64 since thats what most folk are familiar with on the consumer side of things (or I guess thats slowly changing day by day to ARM). There's so many cool ways developers achieve the same things through completely different methods. Although my knowledge of registers and overall how data is managed is severely lacking! Architectural state is an interesting concept I don't know enough about. I wasn't aware it was duplicated; I thought that the state was a single entity completely managed by the scheduler that took into account all running threads at once, which now that I say that aloud... doesn't make much sense since that's not the default state. "...but of the two schedulers themselves." Now I totally understand what you're aiming at. To be frank, I didn't really give any mention at all towards the OS scheduler... You mentioned in your original comment that an OS that is SMT-aware can make some more intelligent decisions as to how things are spread out, but; does that mean the OS scheduler can simply pile more threads onto a core and "hope" everything works out, or can it explicitly state "run x program and y program on the same core, with these parameters to govern SMT"... I'd think it wouldn't be able to but that does seem like something that'd be really useful. You've definitely given me a lot of food for thought and things to look into!
@@elegeto Ya x86 is just a corner of the computing world, a very popular corner, but a lot exists outside of it. And to complicate things Ice Lake is here with 16 cores and 24 threads. With it's "little" gracemont core looking like the ARM A76 internally, and the A78 looking a lot AMD's Zen. The OS scheduler being more aware of the chip topology and being able to take and generate scheduling hints is going to be a big part of performance and power efficiency going into the future with more heterogeneous designs.
Multithreading is mainly a scheduler implementation. The scheduler decides what is "next" in line for the CPU to do, and the coordination of instructions for any particular thread is mandated by the scheduler. The rest of the core will usually have higher capacity too to handle the extra throughput a multi-threaded scheduler can schedule, but yeah, it's mainly a scheduler thing. Not to downplay it though, as the scheduler is one of the most important pieces of architecture design.
If you're actually back.. that would be fantastic. I love your stuff. Possibly hard to grasp concepts broken down into bytesized pieces, for the rest of us to process. Keep up the amazing work, hope to see more of you!
It is one of the best videos I could find for that topic. I think that the short debunking at the end for the "logical cores" term was very important. Most of the videos, articles talk about HT (Intel's SMT) as "it adds logical cores so the CPU is much faster" that is wrong in that way. Thank you for the excellent overview.
Not quite. SMT implementations often do duplicate resources. You minimum you need a second program counter. And many in-order implementation end up duplicating register files (or having extended window sets)
And while SMT does allow threads to share/split hardware resources, that's often not where the performance gains typically come from, Though it is easier to get additional IPC if you are pulling from 2,4, or 8 dependency graphs rather than one. A lot of it comes from being able to keep busy while waiting on memory latency, hence why Mainframe chips and GPU's support 8 contexts at once.
And from the program perspective, instructions never execute willy-nilly. The tomsulo algorithm for out of order introduces another 2 stages to Fetch-Decode-Execute-Write. You now get Fetch-Decode-Rename-Execute-Write-Commit, and you always commit in-order, SMT does drop nicely out of this without a lot of extra transistors needed, but works best when you have an abundance or physical resources (reservation stations, ROB's, and execution pipelines)
Also a context switch doesn't just flush the pipeline, it has to save the architectural state. (You've got to clear the workbench before changing project or the parts for one thing will end up in another)
The reason that two threads can't co-exist on a single logical processor normally is pretty simple. It's like two people trying to use the same desk to get work done. Without some sort of prior agreement you'll set something down, then the other guy will set something on top if it, and you'll be sending someone else's personal love letter out to be photocopied and distributed as an office memo. Not to mention it doesn't really make sense as you only have one program counter and you can only tick along it to jump to a specific other place. The example you show at 13:50 is a general concurrency problem, and not solved by SMT or context switching.
Hence why most ISA's have a conventional calling format which lets you call functions in the same context without them stomping over registers the callee will need upon return.
And I feel there's some conflation here between the OS scheduler which allows concurrency/multitasking, and the hardware mechanism for keeping the architectural visible resources of SMT threads separated. So far as the scheduler is concerned each logical thread is a core, if the OS is SMT aware it can be a little smarter how it spreads things out, but isn't necessary for the scheduler to function.
Hi there, thanks for commenting.
For your first point, I should have specified that execution resources are not duplicated. It is true that there are scheduler resources duplicated to accommodate the extra contexts. I am not familiar with enough implementations to know about when register file resources are duplicated and I feel that's far out of scope.
Your second point I do agree with. Having an extra thread on the core does raise the likelihood of an instruction having the resources required to be executed, increasing execution unit uptime. I neglected to mention that point.
For your third point, I feel I was not clear enough. I probably shouldn't have said "things can execute willy nilly", when earlier I made a huge point of "we care that things are done in a specific order", I probably should have moreso stated that the actions of a scheduler or core are not always going to be deterministic to the scheduler itself since theres no real way to know when any given resource will be free. A programmer can't assume they'll know the actions of a scheduler.
Fourth point; out of scope. A lot of the things in this comment are definitely really valuable, and you likely have more knowledge than me, but this video was already very long and I don't think there's a lot of value in adding every single possible detail. A pipeline flush is one thing that can be explained simply to illustrate why a context switch isnt great. I also did mention in the video that there are more things that go to a context switch than just the pipeline flush.
I don't really understand the example you give in your fifth point. Sorry about that.
"And I feel there's some conflation here between the OS scheduler which allows concurrency/multitasking" you mention this but I don't see where you point it out. I feel like I very specifically make this dilineation near the end of the video stating that these two things were distinctly different. Maybe I wasn't clear.
Thanks for your reply
@@elegeto Sure, being able to talk about something in sort-of detail is quite an art. It's hard to be accurate enough without boring the layman to death with the details.
Basically each thread must have it's own copy of the architectural registers (addressable by the assembly/ISA). Intel and ARM A76/77/78 do it with a thread-aware rename backed by a large shared physical register file. SPARC did is by adding extra windows (banks of 8 registers) and dynamically allocating them to threads as needed. I believe early Power simply had a split register file w/o rename. Whereas GPU's statically allocate register resources to warp/wave groups and can switch between groups with a very minimal context change (flush the pipeline, change the register offset index) though they won't run two threads at the exact same time.
The process stages (fetch, decode, execute,...) can be shared but state must be kept separate, and keeping state requires a fair number of transistors.
And gosh yes, there's so much non-determinism in a modern CPU (Cache layout, data structure alignment, and other code running, and speculation and mispredict) can all have significant performance impact.
But to SMT I think thats getting down into the details. The big idea is the logical replication of archectural state, and some mechanism to mix the instruction streams, weather it be fine grained multitasking, round robin issue (scaler or superscaler), or intermixed issue with a failsafe to make sure one threat can't dominate too badly. From what I can tell intel mixes/flip flops fetch and decode as much as it can, and reserves a minimum guaranteed buffer for each thread and execution sticks to the oldest ready reservation station to issue from.
For the fifth point, the program model of the world is inherently sequential. (For Von Newman and Harvard Archetectures) Instruction 1 goes , then 2, then 3, then 4, unless an instruction the the stream tell it to skip to somwhere else, where it hapily begins counting again.
Now this isn't what really happens in a modern x86 as you well know, but it's the illusion that must be kept for the program to execute. Each step reliant the the state of the processor from the last step. Mixing random instruction in from elsewhere breaks this contract unless you can separate the state information.
And the conflation I saw wasn't the program view of a processor vs the chip designers view of a core which you explained quite well, but of the two schedulers themselves. The OS one has to do a fairly complex balance and track a lot of variable, but the hardware scheduler is perhaps a slightly glorified queue.
@@WorBlux One of the issues with talking about a technology like SMT is because, as you just pointed out, the implementation of the same concept can vary greatly from architecture to architecture. I try to keep most of my explanation (and knowledge tbh) aimed towards x86_64 since thats what most folk are familiar with on the consumer side of things (or I guess thats slowly changing day by day to ARM). There's so many cool ways developers achieve the same things through completely different methods. Although my knowledge of registers and overall how data is managed is severely lacking!
Architectural state is an interesting concept I don't know enough about. I wasn't aware it was duplicated; I thought that the state was a single entity completely managed by the scheduler that took into account all running threads at once, which now that I say that aloud... doesn't make much sense since that's not the default state.
"...but of the two schedulers themselves." Now I totally understand what you're aiming at. To be frank, I didn't really give any mention at all towards the OS scheduler... You mentioned in your original comment that an OS that is SMT-aware can make some more intelligent decisions as to how things are spread out, but; does that mean the OS scheduler can simply pile more threads onto a core and "hope" everything works out, or can it explicitly state "run x program and y program on the same core, with these parameters to govern SMT"... I'd think it wouldn't be able to but that does seem like something that'd be really useful.
You've definitely given me a lot of food for thought and things to look into!
@@elegeto
Ya x86 is just a corner of the computing world, a very popular corner, but a lot exists outside of it. And to complicate things Ice Lake is here with 16 cores and 24 threads. With it's "little" gracemont core looking like the ARM A76 internally, and the A78 looking a lot AMD's Zen.
The OS scheduler being more aware of the chip topology and being able to take and generate scheduling hints is going to be a big part of performance and power efficiency going into the future with more heterogeneous designs.
i hope u doing so good i just found ur video it helped me alot i have exams tomorrow
so thanks keep going
Hey Sebastian, are you back on RUclips? Thanks for the video, you are good at explaining things ^.^
What does a multithreaded core look like physically on a cpu compared to a non multithreaded core? How is it designed
Multithreading is mainly a scheduler implementation. The scheduler decides what is "next" in line for the CPU to do, and the coordination of instructions for any particular thread is mandated by the scheduler. The rest of the core will usually have higher capacity too to handle the extra throughput a multi-threaded scheduler can schedule, but yeah, it's mainly a scheduler thing. Not to downplay it though, as the scheduler is one of the most important pieces of architecture design.
great video
Very nice
Definitely remembering to call it SMT not hyperthreading now :D