I've been trying to work out how they actually implemented the multiplier in the real r4300i design. The datapath diagram in "r4300i datasheet" shows they are using a "CSA multiplier block" and feeding it's result into the main 64bit adder every cycle (which saves gates, why use dedicated full adders at the end of the CSA array when you already have one). Going back to the r4200, there is a research paper explaining how the pipeline works in great detail, and the r4300i is mostly just an r4200 with a cut-down-bus and larger multiplier. The r4200 uses a 3bit multiplier, shifting 3 bits out to LO every cycle (or the guard bits for floats) and latching HI on the final cycle (floats use an extra cycle to shift 0 or 1 bits right then repack). I'm assuming they use much the same scheme, but shifting out more bits per cycle. So it's not that the r4300i has multipliers that take 3 and 6 cycles then take two cycles to move the result to lo/hi, but that the 24bit and 54 bit multiplies can finish 1 cycle sooner. So I think the actual timings are: 3 cycles for 24bit, 4 cycles for 32 bit, 6 cycles for 54 bit and 7 cycles for 64 bit (though, you need an extra bit for unsigned multiplication) To get these timings, the r4300 would need a 10 bit per cycle multiplier. If I'm understanding the design correctly: Every cycle, the CSA block adds ten 64 bit wide partial products. 10 bits are immediately ready shifting 10 bits out to LO, and the remaining 63 bits of partial sums and shifted carries are latched into the adder's S and T inputs. On the next cycle, the CSA block also takes the reduced partial sums from the adder's result as an 11th input to the CSA array.
Aside from Karatsuba, which is not really relevant here, I was about to suggest using the integrated registers of the DSP blocks, some of which can do MAC, multiply then accumulate. This would reduce the number of DSP blocks. Also : what about a FSM to sequence the operations ?
That can be done, but it's a bit more complicated and becomes vendor specific. So if I did that for a Xilinx chip, I would have to explicitly create a DSP block and set the control bits. The same HDL would not work on an Intel chip. And when you're working on a part so far down in the hierarchy, you really want to avoid vendor specific instantiation. You could instead build up the multiplication via accumulation of smaller multipliers (that's what many chips in the past did), however, that ends up requiring a FSM as you suggested, and will use up more logic resources. Typically in these instances, the LUTs are more precious than the DSPs, so letting the synthesis tool figure out how to connect the DSPs to minimize resource usage is probably going to yield better results.
There are several VLSI multiplier implementations, however, CSA seems to be the most prevalent. The Karatsuba multipliers are often smaller and use less power, but are also slower than the CSA multipliers. Either way, I suspect that explicitly implementing a multiplier IP (instead of using a multiplier or DSP block) in an FPGA, will be extremely inefficient... since it will use a lot of LEs and operate slower than the specialized in silicon multiplier block. I haven't tried it though. When you tried it in the past, did you check to see what the estimated speed / resource usage was implementing the Karatsuba vs. a multiplier block? (It would be interesting to see how much they differ).
@@temlib1411 Aren't most FPGA multipliers 18x18? (I know they used to be smaller, 9x9?). I don't think that a Karatsuba implementation would be advantageous in an FPGA unless you were really short on multiplier blocks. Since all of the adds / subs are binary, you would need to do 7 instead of 3, which is quite a few LEs, plus a significantly increased propagation delay (6 stages instead of 2).
I've been trying to work out how they actually implemented the multiplier in the real r4300i design.
The datapath diagram in "r4300i datasheet" shows they are using a "CSA multiplier block" and feeding it's result into the main 64bit adder every cycle (which saves gates, why use dedicated full adders at the end of the CSA array when you already have one). Going back to the r4200, there is a research paper explaining how the pipeline works in great detail, and the r4300i is mostly just an r4200 with a cut-down-bus and larger multiplier. The r4200 uses a 3bit multiplier, shifting 3 bits out to LO every cycle (or the guard bits for floats) and latching HI on the final cycle (floats use an extra cycle to shift 0 or 1 bits right then repack).
I'm assuming they use much the same scheme, but shifting out more bits per cycle. So it's not that the r4300i has multipliers that take 3 and 6 cycles then take two cycles to move the result to lo/hi, but that the 24bit and 54 bit multiplies can finish 1 cycle sooner.
So I think the actual timings are: 3 cycles for 24bit, 4 cycles for 32 bit, 6 cycles for 54 bit and 7 cycles for 64 bit (though, you need an extra bit for unsigned multiplication) To get these timings, the r4300 would need a 10 bit per cycle multiplier.
If I'm understanding the design correctly: Every cycle, the CSA block adds ten 64 bit wide partial products. 10 bits are immediately ready shifting 10 bits out to LO, and the remaining 63 bits of partial sums and shifted carries are latched into the adder's S and T inputs. On the next cycle, the CSA block also takes the reduced partial sums from the adder's result as an 11th input to the CSA array.
Very informative, thank you.
Where was this video when I was scratching my head with _exactly_ the same problem and had to come up with all this by myself :_(
Thanks for these videos. They are excellent.
Do you know what the tradeoffs are between a CSA and a Wallace / Dadda multiplier?
Aside from Karatsuba, which is not really relevant here, I was about to suggest using the integrated registers of the DSP blocks, some of which can do MAC, multiply then accumulate. This would reduce the number of DSP blocks. Also : what about a FSM to sequence the operations ?
That can be done, but it's a bit more complicated and becomes vendor specific. So if I did that for a Xilinx chip, I would have to explicitly create a DSP block and set the control bits. The same HDL would not work on an Intel chip. And when you're working on a part so far down in the hierarchy, you really want to avoid vendor specific instantiation.
You could instead build up the multiplication via accumulation of smaller multipliers (that's what many chips in the past did), however, that ends up requiring a FSM as you suggested, and will use up more logic resources. Typically in these instances, the LUTs are more precious than the DSPs, so letting the synthesis tool figure out how to connect the DSPs to minimize resource usage is probably going to yield better results.
There is also Karatsuba!
I tried once, it wasn't that useful in FPGAs with plenty of hardware multipliers, though.
There are several VLSI multiplier implementations, however, CSA seems to be the most prevalent. The Karatsuba multipliers are often smaller and use less power, but are also slower than the CSA multipliers. Either way, I suspect that explicitly implementing a multiplier IP (instead of using a multiplier or DSP block) in an FPGA, will be extremely inefficient... since it will use a lot of LEs and operate slower than the specialized in silicon multiplier block. I haven't tried it though. When you tried it in the past, did you check to see what the estimated speed / resource usage was implementing the Karatsuba vs. a multiplier block? (It would be interesting to see how much they differ).
@@RTLEngineering
I just tried to save a multiplier. Asssuming 16x16 hardware multipliers.
[XH|XL] * [YH|YL] = XH.YH*2^32 + XL.YL + (XH.YL + XL.YH) * 2^16 : 4 different multiplications.
Karatsuba:
T=(XH-XL).(YH-YL) = XH.YH + XL.YL - XL.YH - XH.YL
[XH|XL] * [YH|YL] = XH.YH*2^32 + XL.YL + (XH.YH + XL.YL - T) * 2^16 : 3 different multiplications, but more add/sub.
@@temlib1411 Aren't most FPGA multipliers 18x18? (I know they used to be smaller, 9x9?). I don't think that a Karatsuba implementation would be advantageous in an FPGA unless you were really short on multiplier blocks. Since all of the adds / subs are binary, you would need to do 7 instead of 3, which is quite a few LEs, plus a significantly increased propagation delay (6 stages instead of 2).
@@RTLEngineering Yes. That's it. In modern FPGAs, multipliers are cheap!
I was also about to suggest Karatsuba but I also agree that "multipliers are cheap".