Very interesting video. Glad to see someone give some attention to CISC. Are you working on an x86 implementation ? Speaking of CISC : Could you take a look to how ao486 is implemented or how to improved on it ? For what i heard, the lack of FPU is an issue when going for a pentium compatibility. I guess transcendental are to costly to implement ? Have you take a look at Flocopo ? Apparently, it generate very well optimise and pipeline ALU and FPU by FPGA target. Are decoder and control the only unit ISA specific ? Could a cache or a TLB unit can be reuse on a different CPU ? Would you give some love to the 68K family, decoding seem simpler (as the FPU on the 68040/68060) ?
Thanks, I am working on an x86 implementation, but progress is slow. I sort do need a better x86 implementation than the ao486 for another project that I am working on (it's good for 486 stuff, but nothing newer). I have looked at how the ao486 was implemented, and it is feasible to improve or add a FPU to, for example. But it would require understanding the undocumented mess of control signals and datapath. It would likely also require modifying the control sequence compiler. I don't really have the patients or dedication to consider doing that, and would likely be better off writing something from scratch. Transcendentals are okay to implement, you sequence them via approximate algorithms using microcode. Essentially, it's like taking the C math.h implementations and making the hardware run the operations directly. The biggest problem with implementing a FPU (aside from the FPU itself, which is no more complicated than an entire 68k), is linking it to the main x86 control path. That means that either A) the 486 needs to understand that the FPU exists and needs to work with it (if it were external), or B) that the 486 itself needs to decode and issue FPU instructions. Neither of those tasks are all that complicated themselves, but figuring out which control signals need to be modified, and where to hook in the datapath is less trivial. I had not heard of FloCoPo (had to look it up), but I don't think it will be useful in this case. If the goal were getting something working without regard for performance, then maybe. But inside of a CPU implementation, which control signals are used, the number of pipeline stages, the technology mapping, etc, all matter. The decoder and control are not only ISA specific, but microarchitecture specific. The same is also true for other parts of the CPU. However, it's possible to adapt a microarchitecture for other ISAs. A simple example of that is the simulator GEM-5, which has an ISA specific front end, but a more generic backend. To do so, however, you need to cover a larger set of functions with the other components. For example, with a K6 that used micro-ops, it could probably implement ARM if you modified the decoders. It couldn't implement MIPS or RISC-V though, unless the logical registers were increased from 32 to 64. If that were done, then it could also probably implement those architectures, though the MIPS TLB would be an issue. Which goes into your other question, a TLB can be reused in parts, but some require specific behavior which is opposed to other implementations. the MIPS TLB for example, operates completely different from x86 and ARM (it relies on low level software management rather than page walking). The 68k family does have a much simpler decoder, as I understand it. But I'm not very knowledgeable there. Learning / diving into another ISA wouldn't be a good use of my time at the moment, given that there are plenty of other people already far more knowledgeable than me (who wrote 68k FPGA implementations).
@@RTLEngineering I wish good luck, i will follow this serie. It is just me or TLB and Cache are very similar ? What do you think about generator like Chisel, Migen or SpinalHDL ? I was thinking using one of these for prototyping the most complexe parts because i want to keep what's left of my sanity. I'm more interested in the 68K, i know there is many implementation of the 68000, one for the 68010/68020, an 68030 without MMU and the 68080 from the Apollo Team (an 68060 like with no MMU and not open source). I was thinking starting from scratch and maybe skip unused/obsolete instructions like BCD, aiming for a modern 32-bits 68K with MMU ( to do linux and stuff ). And beyond the OG 68000, there not much as technical doc, so i started to look for the closest architecture, the good old x86, hoping i can transpose some things
I will certainly need it, already had a few setbacks, and had to scale back the scope a bit (I was originally going to implement a K6, or something similar). A TLB and Cache can be similar, but it depends on the associativity. With the Pentium, for example, they have a similar associativity (2-way for the cache, and 4-way for the TLB). There are some x86 implementations that are fully associative though, just like with MIPS. And the refill mechanism also depends on the implementation. x86 has a hardware refill for both cache and TLB, whereas MIPS only has it for cache. (A non hardware assisted cache refill mechanism would be a scratchpad, like what the PlayStation 2 had). I have looked at all of those generators, and used a few for stitching parts together. I'm not particularly fond of them though. They are really good for rapid prototyping, and are powerful for verification (using formal verification). But they are almost impossible to debug (if formal verification is missing conditional state tests), and they give you very little control over synthesis for performance or area (you can always black box, but then what's the point of using the generator). That doesn't mean they are useless, but you could end up putting almost as much time into them as writing straight up HDL. I know there is a lot of interest in the 68k, but there's been a lot of work done there. The only work done with x86 at the 486 level are the ao486 and the v586, neither of which have FPU support. And that hasn't progressed any further in the last 5-8 years. So that's why my focus has been there. There probably would be some interest in a more powerful 68k though, something that's compatible, but also more performant (I suspect a lot of Amiga enthusiasts would be thrilled). And, you may be able to adapt something similar to the Pentium architecture to the 68K ISA. You would have to keep the BCD stuff for compatibility though, unless you are able to use an ISA variant that doesn't include it. But you don't need to keep it performant. If you make the back end micro-code base (like the Pentium was), then you can just implement BCD via microcode emulation.
Would it be possible to publish also the other decoding simulation videos? I found them very helpful to understand the underlaying architecture and principle. Especially the 8086, P6, K6 and K7 would be very welcome 🙏
The plan is to publish the P6 and K6 ones after I finish / upload the front end decoding video on the P6 / K6. Same goes for the K7. The 8086 hasn't been completed yet since it's incredibly slow (every byte gets a cycle, so that's quite a few frames of animation to churn through). Also, the P6/K6/K7 ones may be less helpful due to how short they are (the K7 only takes 11 cycles to go through the whole stream).
I just watched the video of the P5 decoding as well. Did I understand that right, that the 486 can theoretically reach a maximum of 1 IPC, whereas the P5 can reach a theoretical maximum of 2 IPC? Furthermore, the P5 is bound by a good prediction of the instruction length (V decoder) and amount of prefix bytes of instructions. Meanwhile the 486 is only bound by the prefix bytes?
Yep, that's all correct. The P55C has a few improvements in regards to prefix bytes though, and also has the length predecoders (which are not 100% accurate). So the P55C can get closer to the maximum IPC of 2 when compared with the P5. Also, the P55C is more tolerant to pairability restrictions due to the instruction FIFO, which would otherwise lead to a complete front-end stall in the P5.
Very interesting video. Glad to see someone give some attention to CISC. Are you working on an x86 implementation ?
Speaking of CISC :
Could you take a look to how ao486 is implemented or how to improved on it ?
For what i heard, the lack of FPU is an issue when going for a pentium compatibility. I guess transcendental are to costly to implement ?
Have you take a look at Flocopo ? Apparently, it generate very well optimise and pipeline ALU and FPU by FPGA target.
Are decoder and control the only unit ISA specific ? Could a cache or a TLB unit can be reuse on a different CPU ?
Would you give some love to the 68K family, decoding seem simpler (as the FPU on the 68040/68060) ?
Thanks, I am working on an x86 implementation, but progress is slow. I sort do need a better x86 implementation than the ao486 for another project that I am working on (it's good for 486 stuff, but nothing newer).
I have looked at how the ao486 was implemented, and it is feasible to improve or add a FPU to, for example. But it would require understanding the undocumented mess of control signals and datapath. It would likely also require modifying the control sequence compiler. I don't really have the patients or dedication to consider doing that, and would likely be better off writing something from scratch.
Transcendentals are okay to implement, you sequence them via approximate algorithms using microcode. Essentially, it's like taking the C math.h implementations and making the hardware run the operations directly. The biggest problem with implementing a FPU (aside from the FPU itself, which is no more complicated than an entire 68k), is linking it to the main x86 control path. That means that either A) the 486 needs to understand that the FPU exists and needs to work with it (if it were external), or B) that the 486 itself needs to decode and issue FPU instructions. Neither of those tasks are all that complicated themselves, but figuring out which control signals need to be modified, and where to hook in the datapath is less trivial.
I had not heard of FloCoPo (had to look it up), but I don't think it will be useful in this case. If the goal were getting something working without regard for performance, then maybe. But inside of a CPU implementation, which control signals are used, the number of pipeline stages, the technology mapping, etc, all matter.
The decoder and control are not only ISA specific, but microarchitecture specific. The same is also true for other parts of the CPU. However, it's possible to adapt a microarchitecture for other ISAs. A simple example of that is the simulator GEM-5, which has an ISA specific front end, but a more generic backend. To do so, however, you need to cover a larger set of functions with the other components. For example, with a K6 that used micro-ops, it could probably implement ARM if you modified the decoders. It couldn't implement MIPS or RISC-V though, unless the logical registers were increased from 32 to 64. If that were done, then it could also probably implement those architectures, though the MIPS TLB would be an issue. Which goes into your other question, a TLB can be reused in parts, but some require specific behavior which is opposed to other implementations. the MIPS TLB for example, operates completely different from x86 and ARM (it relies on low level software management rather than page walking).
The 68k family does have a much simpler decoder, as I understand it. But I'm not very knowledgeable there. Learning / diving into another ISA wouldn't be a good use of my time at the moment, given that there are plenty of other people already far more knowledgeable than me (who wrote 68k FPGA implementations).
@@RTLEngineering I wish good luck, i will follow this serie.
It is just me or TLB and Cache are very similar ?
What do you think about generator like Chisel, Migen or SpinalHDL ? I was thinking using one of these for prototyping the most complexe parts because i want to keep what's left of my sanity.
I'm more interested in the 68K, i know there is many implementation of the 68000, one for the 68010/68020, an 68030 without MMU and the 68080 from the Apollo Team (an 68060 like with no MMU and not open source). I was thinking starting from scratch and maybe skip unused/obsolete instructions like BCD, aiming for a modern 32-bits 68K with MMU ( to do linux and stuff ). And beyond the OG 68000, there not much as technical doc, so i started to look for the closest architecture, the good old x86, hoping i can transpose some things
I will certainly need it, already had a few setbacks, and had to scale back the scope a bit (I was originally going to implement a K6, or something similar).
A TLB and Cache can be similar, but it depends on the associativity. With the Pentium, for example, they have a similar associativity (2-way for the cache, and 4-way for the TLB). There are some x86 implementations that are fully associative though, just like with MIPS. And the refill mechanism also depends on the implementation. x86 has a hardware refill for both cache and TLB, whereas MIPS only has it for cache. (A non hardware assisted cache refill mechanism would be a scratchpad, like what the PlayStation 2 had).
I have looked at all of those generators, and used a few for stitching parts together. I'm not particularly fond of them though. They are really good for rapid prototyping, and are powerful for verification (using formal verification). But they are almost impossible to debug (if formal verification is missing conditional state tests), and they give you very little control over synthesis for performance or area (you can always black box, but then what's the point of using the generator). That doesn't mean they are useless, but you could end up putting almost as much time into them as writing straight up HDL.
I know there is a lot of interest in the 68k, but there's been a lot of work done there. The only work done with x86 at the 486 level are the ao486 and the v586, neither of which have FPU support. And that hasn't progressed any further in the last 5-8 years. So that's why my focus has been there.
There probably would be some interest in a more powerful 68k though, something that's compatible, but also more performant (I suspect a lot of Amiga enthusiasts would be thrilled). And, you may be able to adapt something similar to the Pentium architecture to the 68K ISA. You would have to keep the BCD stuff for compatibility though, unless you are able to use an ISA variant that doesn't include it. But you don't need to keep it performant. If you make the back end micro-code base (like the Pentium was), then you can just implement BCD via microcode emulation.
Would it be possible to publish also the other decoding simulation videos? I found them very helpful to understand the underlaying architecture and principle.
Especially the 8086, P6, K6 and K7 would be very welcome 🙏
The plan is to publish the P6 and K6 ones after I finish / upload the front end decoding video on the P6 / K6. Same goes for the K7. The 8086 hasn't been completed yet since it's incredibly slow (every byte gets a cycle, so that's quite a few frames of animation to churn through). Also, the P6/K6/K7 ones may be less helpful due to how short they are (the K7 only takes 11 cycles to go through the whole stream).
I just watched the video of the P5 decoding as well. Did I understand that right, that the 486 can theoretically reach a maximum of 1 IPC, whereas the P5 can reach a theoretical maximum of 2 IPC?
Furthermore, the P5 is bound by a good prediction of the instruction length (V decoder) and amount of prefix bytes of instructions. Meanwhile the 486 is only bound by the prefix bytes?
Yep, that's all correct. The P55C has a few improvements in regards to prefix bytes though, and also has the length predecoders (which are not 100% accurate). So the P55C can get closer to the maximum IPC of 2 when compared with the P5. Also, the P55C is more tolerant to pairability restrictions due to the instruction FIFO, which would otherwise lead to a complete front-end stall in the P5.