I really appreciate you taking the time to make these videos, they're very insightful and understandable even with limited design knowledge. Do you plan to do any videos on how SIMD is implemented on these processors, your mention of microcode emulation piqued my interest
Thanks! That's the goal, but I am always concerned that they will end up being too technical. I didn't plan on doing a video on SIMD, but I can. The only interesting parts in the PMMX were the resource sharing (since the MMX function units could be claimed by the U or V pipe), the register file duplication and power gating (I believe that's the reason why a FP/MMX switch is so expensive), and maybe the interaction with U-V load/stores. Other than that, the SIMD function units were built to do all of the 64-bit operations without any specific microcode. The K7 on the other hand was a bit more interesting, since it had a 32-bit datapath and was designed for the 64-bit 3DNow! and not SSE. So all of the SSE instructions required a double issue (one way to look at that is: the SSE instructions were translated into two 3DNow instructions). I can talk about that when I cover the K7 front end. Or were you more interested in how the actual function units were implemented? As for microcode emulation, that's what Intel calls the vector decoding process. If an instruction requires more cycles or micro ops than the regular decoders can handle, then it needs to be sequenced from a ROM. It's effectively "emulating" the behavior of the massive PLA used by the 386 and 486, but is also capable of offloading the work from the hardware decoders (making them simpler and faster).
@@RTLEngineering @RTL Engineering whoa, thanks for giving such a detailed response! Yeah it was more the SIMD frontend that I was curious about and you pretty much answered all my questions with that :) Very interesting to hear about the K7 though, always believed that 3DNow! was more something tacked on rather than something that the CPU was designed around though I suppose it makes sense when you consider release timing. Looking forward to your video on the K7 frontend though!
@@RTLEngineering I alway saw SIMD like a simple control unit with 4 execution units behind an unpacker stage Maybe it's a very simplified view or completly wrong. But why they didn't just use 4 arithmetic pipeline and directly schedule scalar (even from a different context with shadow register) and simd alike ? Beside if the long term goal is an fpga implementation, i wouldn't mind a global view of the architecture, how the piece fit together and how well fpga specifities can be use. For example async unit on the P4 was a disaster, but on an fpga SRAM and DSP can clock almost twice as fast than the all CPU, can be useful ?
@@vincentvoillot6365 That's one way to think of SIMD, though MMX was a bit more complicated. There would be 8x 8-bit execution units in this case, which could link together to form 4x 16-bit or 2x 32-bit execution units, meaning that they couldn't be duplicated blocks. The operations performed by MMX had to be much faster than the integer ALUs could handle, at the cost of being less versatile. Also, the PMMX had two MMX ALUs, meaning it could have up to 16 execution paths in parallel. There may have also been the desire to not modify the integer paths, since that would have required more work/effort, which means it would have taken longer. The end result was treating it like another coprocessor (similar to the FPU). By global view, do you mean going over how things executed (going over the block diagram and a pipeline diagram)? Or do you mean something like going over a dieshot and comparing it to the block diagram? (I have the former as part of a longer video that I am thinking about cutting down and releasing as a long and short version, and I have been considering the latter). Async in a FPGA becomes very tricky though, especially when it comes to constraints. You would be better off mapping it to a synchronous block that runs faster, if that's what you are implying. Although, the P4 was much faster than any FPGA could reasonably handle. In terms of general mapping to a FPGA, that's a very complicated topic. Henry Wong went over that in his dissertation, and I have been thinking about doing a video on it. Essentially, the "gate" delay in a FPGA is much larger than it would be in pure silicon, so the optimal number of pipeline stages for a given speed target is longer than it would otherwise be. But Intel had a paper looking at this for the P4, where more stages means that you have a longer delay to evaluate branches, and that reduces performance. So there's an optimal number of pipeline stages which 1) sets the complexity of each stage, 2) the LUT depth per stage, 3) the maximum achievable clock speed, and 4) the maximum performance for a design. Essentially, if you were to implement a P4, the most optimal design might end up being 40 stages (longer than the original P4), which means it will take maybe a 30% performance hit, and only be able to achieve a clock speed of 300 MHz. The same thing is true for the PMMX (though not 40 stages), but since the PMMX is so old, you can cover some of that performance hit with better resource mapping and some newer execution methods like a better branch predictor (even GSHARE would do significantly better - for a P4, you would have to move to something like a perceptron based predictor).
@@RTLEngineering When i said global view, i mean block and pipeline diagrams. Die shot are interesting, but for what i understand from Henry Wong thesis is trying to reproduce 1:1 an existing CPU to an FPGA is far from efficient. I'm not advocating for a P4 implementation, i'm not that pervert ^^. And as you said, PMMX is an old beast, and could use some modern features I get that each gate levels introduce a propagation delay, and the goal is to create small "fast" blocks and string them in a pipeline. Mr Wong make a comparator-adder on three levels and target 200-250Mhz, an DSP48E can do the same thing on 48bits on his second stage at 400-500Mhz and at my knowledge the only cpu having async unit was the P4. "The iDEA DSP Block Based Soft Processor for FPGAs" use an dsp as execution unit, not the most efficient but can be interesting for an simd. Personnaly i'm still stuck on his caches diagrams, it's like looking at an MC ESCHER drawing. PS sorry for my broken english
Excellent as alway. Really like your idea of pages, look like bitplane for microcode. * Somes intructions can be fused together, could the decoder handle it with the help of a dedicated page ? * Somes thought about the hardware multiplication, using the FPU is tempting (the 80bits extended precision have a 64bits mantissa) but wouldn't be funky register wise ? I read between 10 and 20 cycle for a 32x32 mul on the pentium, seem pretty quick without dedicated hardware. I thought a shifter and a adder would at least take 32 cycles worst case. * You point out that the pipelines U and V have both a load and store unit on the pentium and mmx. all the other (Pentium pro included) have a load and a store with their own pipeline. I suppose it's the Out of Order architecture ? Or/and the widening of the adress space to 35-36 bits ? About the implementation, depending of the periphericals ( like PCIe or SATA on an Artix throught liteX ) is it viable to keep a 32 bits adressing space, directly push to 48bits to be futur proof or something in between ? * In a pipeline architecture, doesn't the worst case scenario depend largely on the adressing mode ? two instructions in immediate adressing, it's one fetch for the two instructions, in direct mode two more read, and two more in indirect mode. Cache or not, the cpu still have one memory access with limited bandwidth and latency. Where and how do you arbitrate all these memory access ?
Thanks! The page idea is similar to the idea of bitplanes. Although it's not my idea - it's how Intel and AMD describe the opcode tables. - The only instruction fusion that the PMMX did was FXCH with a float, although that would have been a very primitive form of it. Fusion with more complicated instructions like conditional jumps wouldn't be introduced until the Pentium 3. Although, the PMMX does give some idea as to how that might have been implemented in the Pentium 3 (since it can issue a compare and a jump as an instruction pair). The concern with using a dedicated page there, is you need to first identify both instructions, which can be a variable number of bytes apart. Fusion would make more sense at the micro-op packing stage, which may be how the decoding part was later implemented. An alternative could be to determine fusion on the first occurrence and then mark it in a predecode cache for the next time the instruction comes around. - The FPU didn't have a 64-bit multiplier. I can't say for certain how big it was, but from the die shot, I want to say 24-bits for FP32, but had a latency of 2 cycles. FP64 would take longer to compute a multiplication, and so would FP80. For registers, that wouldn't have been an issue since the multiplier should be able to operate off of an exchange bus (think about the case where an operand comes from memory, it would need to be fed in through the exchange bus without touching the register file). As for a shifter and an adder, that could be done quicker if you include a low-bit multiplication. For example, if you can do a 2-bit multiplication, then you can speed it up, and that would be small enough to fit within the shifter path. The AMD K6 did something like this for division. Though, looking at the die shot, there appears to be a similar sized multiplier to the FPU, in the integer execution part, possibly 16-bits. Typically the documentation says "this unit contains a multiplier", which is why I wasn't sure. - The load/store comment was correct, and essential in this case. But these load/store units are much simpler in comparison to the OoO processors. The simplicity comes from the fact that they are in-order, meaning you don't need to worry about segment renaming, or a load/store queue (the load/store queue is likely the main reason why the Ld/St units were fewer in number, also there wasn't much need in OoO, because instructions occur out of order, whereas in the Pentium, having only one Ld/St would cause terrible deadlock). - I don't recall there being a 48-bit address extension for x86 (the 32-bit version), so that wouldn't really work. The best you could do is the 35-bit extension before jumping to a 64-bit core. I have been considering whether or not it makes sense to include the 35-bit extensions though, given that memory sizes and PCI allocations were also much smaller than they are today (even the Voodoo5 GPU only used 128MB), and that 3 extra bits adds another LUT delay for tag checks. As for SATA, it doesn't require direct address mapping, so that's a non-issue. - The only limitation with addressing modes is the length decoding complexity. The more complexity in determining the length of the instruction, the slower it will be to calculate. If you can perfectly determine the length of the instructions though, then the two AGUs in the PMMX could both handle any address mode in a single cycle. Obviously if there is a register dependency or address clash, then the two instructions need to be serialized. In fact, the PMMX will squash the second instruction if the first one performs a memory write while the second performs a memory read. This is because it can't accurately determine if there will be a memory ordering conflict. The data cache is banked so that both pipelines can R/W the data cache, and the data tags and TLB are dual ported.
@@RTLEngineering Yes my bad, P48 is only for x86-64, guess 35bits would be wide enough (32Go) and about the sata, sorry, must have been a brain fart. Agreed, PCI and AGP cards don't need more than 32Bits, and most PCIexpress card (beside GPU) don't need it either. There is a lot to get right with the memory sub-system, ordering conflict, deadlock, rewriting code, data stream detection, caches configuration, etc.. it deserve it's own video. For example I'm not famillliar with the term address clash (sound violent , i already like it ^^)
Yep, don't forget about cache coherency, and handling self modifying code (that's even more annoying). I think that would take too many videos to discuss though. Entire dissertations have been written on each of those topics. Address clashing is when both pipelines try to access the same memory, at the same time. And because x86 allows for unaligned accesses, that can be even more complicated with partial overlap. The P5/PMMX also had bank conflicts to worry about, but that's not an issue in a FPGA. In either case, the U pipeline would get priority, and then the V pipeline would get its chance.
@@RTLEngineering ha yes, self modifying code (or like i mistakely called it rewriting code :D). It's annoying, you need to keep track of all the intructions adresses between the decoder and the commit buffer (or at least the window) and empty the all damn thing and/or invalid the I-cache, if you write in this adress range. I figure just compare the target adress to the first 27-28bits of the pointer register ? There must be a more elegant way to do it. A memory mapped cache with an inclusive policy sitting on the bus could help with the penality (what is an L2 cache ? ^^). Strangely i didn't find much on this topic.
I believe most processors implement it through a sort of mini-TLB. Effectively you have a CAM cache that holds the cache line tag for every instruction in the pipeline. When an instruction is retired, then it can be removed from the CAM. Then you also prevent cache lines from being resident in both the I$ and D$, so any self modifying code would either cause thrashing or would have to write to a non-cachable address. In the later case, you would then have to check the CAM when ever a write occurs and flush the pipeline (even if it's not cachable, it can still have a cache tag).
This was so insightful. Thanks a lot.
I really appreciate you taking the time to make these videos, they're very insightful and understandable even with limited design knowledge. Do you plan to do any videos on how SIMD is implemented on these processors, your mention of microcode emulation piqued my interest
Thanks! That's the goal, but I am always concerned that they will end up being too technical.
I didn't plan on doing a video on SIMD, but I can. The only interesting parts in the PMMX were the resource sharing (since the MMX function units could be claimed by the U or V pipe), the register file duplication and power gating (I believe that's the reason why a FP/MMX switch is so expensive), and maybe the interaction with U-V load/stores. Other than that, the SIMD function units were built to do all of the 64-bit operations without any specific microcode.
The K7 on the other hand was a bit more interesting, since it had a 32-bit datapath and was designed for the 64-bit 3DNow! and not SSE. So all of the SSE instructions required a double issue (one way to look at that is: the SSE instructions were translated into two 3DNow instructions). I can talk about that when I cover the K7 front end.
Or were you more interested in how the actual function units were implemented?
As for microcode emulation, that's what Intel calls the vector decoding process. If an instruction requires more cycles or micro ops than the regular decoders can handle, then it needs to be sequenced from a ROM. It's effectively "emulating" the behavior of the massive PLA used by the 386 and 486, but is also capable of offloading the work from the hardware decoders (making them simpler and faster).
@@RTLEngineering @RTL Engineering whoa, thanks for giving such a detailed response! Yeah it was more the SIMD frontend that I was curious about and you pretty much answered all my questions with that :) Very interesting to hear about the K7 though, always believed that 3DNow! was more something tacked on rather than something that the CPU was designed around though I suppose it makes sense when you consider release timing. Looking forward to your video on the K7 frontend though!
@@RTLEngineering I alway saw SIMD like a simple control unit with 4 execution units behind an unpacker stage Maybe it's a very simplified view or completly wrong.
But why they didn't just use 4 arithmetic pipeline and directly schedule scalar (even from a different context with shadow register) and simd alike ?
Beside if the long term goal is an fpga implementation, i wouldn't mind a global view of the architecture, how the piece fit together and how well fpga specifities can be use.
For example async unit on the P4 was a disaster, but on an fpga SRAM and DSP can clock almost twice as fast than the all CPU, can be useful ?
@@vincentvoillot6365 That's one way to think of SIMD, though MMX was a bit more complicated. There would be 8x 8-bit execution units in this case, which could link together to form 4x 16-bit or 2x 32-bit execution units, meaning that they couldn't be duplicated blocks. The operations performed by MMX had to be much faster than the integer ALUs could handle, at the cost of being less versatile. Also, the PMMX had two MMX ALUs, meaning it could have up to 16 execution paths in parallel.
There may have also been the desire to not modify the integer paths, since that would have required more work/effort, which means it would have taken longer. The end result was treating it like another coprocessor (similar to the FPU).
By global view, do you mean going over how things executed (going over the block diagram and a pipeline diagram)? Or do you mean something like going over a dieshot and comparing it to the block diagram? (I have the former as part of a longer video that I am thinking about cutting down and releasing as a long and short version, and I have been considering the latter).
Async in a FPGA becomes very tricky though, especially when it comes to constraints. You would be better off mapping it to a synchronous block that runs faster, if that's what you are implying. Although, the P4 was much faster than any FPGA could reasonably handle.
In terms of general mapping to a FPGA, that's a very complicated topic. Henry Wong went over that in his dissertation, and I have been thinking about doing a video on it. Essentially, the "gate" delay in a FPGA is much larger than it would be in pure silicon, so the optimal number of pipeline stages for a given speed target is longer than it would otherwise be. But Intel had a paper looking at this for the P4, where more stages means that you have a longer delay to evaluate branches, and that reduces performance. So there's an optimal number of pipeline stages which 1) sets the complexity of each stage, 2) the LUT depth per stage, 3) the maximum achievable clock speed, and 4) the maximum performance for a design. Essentially, if you were to implement a P4, the most optimal design might end up being 40 stages (longer than the original P4), which means it will take maybe a 30% performance hit, and only be able to achieve a clock speed of 300 MHz. The same thing is true for the PMMX (though not 40 stages), but since the PMMX is so old, you can cover some of that performance hit with better resource mapping and some newer execution methods like a better branch predictor (even GSHARE would do significantly better - for a P4, you would have to move to something like a perceptron based predictor).
@@RTLEngineering When i said global view, i mean block and pipeline diagrams. Die shot are interesting, but for what i understand from Henry Wong thesis is trying to reproduce 1:1 an existing CPU to an FPGA is far from efficient. I'm not advocating for a P4 implementation, i'm not that pervert ^^.
And as you said, PMMX is an old beast, and could use some modern features
I get that each gate levels introduce a propagation delay, and the goal is to create small "fast" blocks and string them in a pipeline.
Mr Wong make a comparator-adder on three levels and target 200-250Mhz, an DSP48E can do the same thing on 48bits on his second stage at 400-500Mhz and at my knowledge the only cpu having async unit was the P4. "The iDEA DSP Block Based Soft Processor for FPGAs" use an dsp as execution unit, not the most efficient but can be interesting for an simd.
Personnaly i'm still stuck on his caches diagrams, it's like looking at an MC ESCHER drawing.
PS sorry for my broken english
This reminds of Jim Keller's comment that x86 decode 'isn't all that bad if you're building a big chip'
Excellent as alway. Really like your idea of pages, look like bitplane for microcode.
* Somes intructions can be fused together, could the decoder handle it with the help of a dedicated page ?
* Somes thought about the hardware multiplication, using the FPU is tempting (the 80bits extended precision have a 64bits mantissa) but wouldn't be funky register wise ?
I read between 10 and 20 cycle for a 32x32 mul on the pentium, seem pretty quick without dedicated hardware. I thought a shifter and a adder would at least take 32 cycles worst case.
* You point out that the pipelines U and V have both a load and store unit on the pentium and mmx. all the other (Pentium pro included) have a load and a store with their own pipeline.
I suppose it's the Out of Order architecture ? Or/and the widening of the adress space to 35-36 bits ? About the implementation, depending of the periphericals ( like PCIe or SATA on an Artix throught liteX ) is it viable to keep a 32 bits adressing space, directly push to 48bits to be futur proof or something in between ?
* In a pipeline architecture, doesn't the worst case scenario depend largely on the adressing mode ? two instructions in immediate adressing, it's one fetch for the two instructions, in direct mode two more read, and two more in indirect mode. Cache or not, the cpu still have one memory access with limited bandwidth and latency. Where and how do you arbitrate all these memory access ?
Thanks! The page idea is similar to the idea of bitplanes. Although it's not my idea - it's how Intel and AMD describe the opcode tables.
- The only instruction fusion that the PMMX did was FXCH with a float, although that would have been a very primitive form of it. Fusion with more complicated instructions like conditional jumps wouldn't be introduced until the Pentium 3. Although, the PMMX does give some idea as to how that might have been implemented in the Pentium 3 (since it can issue a compare and a jump as an instruction pair).
The concern with using a dedicated page there, is you need to first identify both instructions, which can be a variable number of bytes apart. Fusion would make more sense at the micro-op packing stage, which may be how the decoding part was later implemented. An alternative could be to determine fusion on the first occurrence and then mark it in a predecode cache for the next time the instruction comes around.
- The FPU didn't have a 64-bit multiplier. I can't say for certain how big it was, but from the die shot, I want to say 24-bits for FP32, but had a latency of 2 cycles. FP64 would take longer to compute a multiplication, and so would FP80. For registers, that wouldn't have been an issue since the multiplier should be able to operate off of an exchange bus (think about the case where an operand comes from memory, it would need to be fed in through the exchange bus without touching the register file). As for a shifter and an adder, that could be done quicker if you include a low-bit multiplication. For example, if you can do a 2-bit multiplication, then you can speed it up, and that would be small enough to fit within the shifter path. The AMD K6 did something like this for division. Though, looking at the die shot, there appears to be a similar sized multiplier to the FPU, in the integer execution part, possibly 16-bits. Typically the documentation says "this unit contains a multiplier", which is why I wasn't sure.
- The load/store comment was correct, and essential in this case. But these load/store units are much simpler in comparison to the OoO processors. The simplicity comes from the fact that they are in-order, meaning you don't need to worry about segment renaming, or a load/store queue (the load/store queue is likely the main reason why the Ld/St units were fewer in number, also there wasn't much need in OoO, because instructions occur out of order, whereas in the Pentium, having only one Ld/St would cause terrible deadlock).
- I don't recall there being a 48-bit address extension for x86 (the 32-bit version), so that wouldn't really work. The best you could do is the 35-bit extension before jumping to a 64-bit core. I have been considering whether or not it makes sense to include the 35-bit extensions though, given that memory sizes and PCI allocations were also much smaller than they are today (even the Voodoo5 GPU only used 128MB), and that 3 extra bits adds another LUT delay for tag checks. As for SATA, it doesn't require direct address mapping, so that's a non-issue.
- The only limitation with addressing modes is the length decoding complexity. The more complexity in determining the length of the instruction, the slower it will be to calculate. If you can perfectly determine the length of the instructions though, then the two AGUs in the PMMX could both handle any address mode in a single cycle. Obviously if there is a register dependency or address clash, then the two instructions need to be serialized. In fact, the PMMX will squash the second instruction if the first one performs a memory write while the second performs a memory read. This is because it can't accurately determine if there will be a memory ordering conflict. The data cache is banked so that both pipelines can R/W the data cache, and the data tags and TLB are dual ported.
@@RTLEngineering Yes my bad, P48 is only for x86-64, guess 35bits would be wide enough (32Go) and about the sata, sorry, must have been a brain fart.
Agreed, PCI and AGP cards don't need more than 32Bits, and most PCIexpress card (beside GPU) don't need it either.
There is a lot to get right with the memory sub-system, ordering conflict, deadlock, rewriting code, data stream detection, caches configuration, etc.. it deserve it's own video.
For example I'm not famillliar with the term address clash (sound violent , i already like it ^^)
Yep, don't forget about cache coherency, and handling self modifying code (that's even more annoying). I think that would take too many videos to discuss though. Entire dissertations have been written on each of those topics.
Address clashing is when both pipelines try to access the same memory, at the same time. And because x86 allows for unaligned accesses, that can be even more complicated with partial overlap. The P5/PMMX also had bank conflicts to worry about, but that's not an issue in a FPGA. In either case, the U pipeline would get priority, and then the V pipeline would get its chance.
@@RTLEngineering ha yes, self modifying code (or like i mistakely called it rewriting code :D). It's annoying, you need to keep track of all the intructions adresses between the decoder and the commit buffer (or at least the window) and empty the all damn thing and/or invalid the I-cache, if you write in this adress range.
I figure just compare the target adress to the first 27-28bits of the pointer register ? There must be a more elegant way to do it.
A memory mapped cache with an inclusive policy sitting on the bus could help with the penality (what is an L2 cache ? ^^).
Strangely i didn't find much on this topic.
I believe most processors implement it through a sort of mini-TLB. Effectively you have a CAM cache that holds the cache line tag for every instruction in the pipeline. When an instruction is retired, then it can be removed from the CAM. Then you also prevent cache lines from being resident in both the I$ and D$, so any self modifying code would either cause thrashing or would have to write to a non-cachable address. In the later case, you would then have to check the CAM when ever a write occurs and flush the pipeline (even if it's not cachable, it can still have a cache tag).
AI voice overs are unlistenable.