Great Argument, however one question in the LU OR implementation, can using tristate buffers enabled by the decoder help? The and-or stage is essentialy doing that only. Which would be more preferable from the two, as a first-cut prospective?
If you want to "shave" a tiny bit of a picosecond : initialise the register set's 0 register to 0 during boot then don't enable write to it later, so you save a test and a gate in the critical datapath during the read cycle. The advantage is that the register write address is known well in advance, compared to the read address which must be provided extremely fast by the instruction decoder. So you can control the "write enable" signal outside of the CDP.
That's a good idea actually. Since the read logic replicates the the test, which would save a bit of logic and only require propagation in the write-back stage which is rarely the cause of timing problems. I may have to implement that solution, since the RF stage of the VR4300 is a 1/2 cycle. Thanks for the suggestion!
Actually, I just checked what I did, and I effectively did what you suggested. Instead of testing for address 0 upon reading, I inhibit writes to address 0 during the clock update. I.e. I used a loop to connect the register clock writes, which skips index 0.
The Cyclone V doesn't have distributed RAM in the same way as the 7-series. The 7-series has SLICELs (logic) and SLICEMs (memory), where as the V series only has the equivalent of the SLICEL (Altera called them MLABs as I recall). They are also implemented differently where the V series used traditional MUX based LUTs and the 7-series uses SRAM blocks. The main difference between the SLICEL and SLICEM is that the SLICELs are read-only. So when you implement distributed RAM in the 7-series, it uses the SLICEM blocks (which have different configurations, the largest being 2-bit 128 entry, and the smallest being 6-bit 32 entry - both configurations only use a single SLICEM). On the V series instead, distributed RAM uses the registers / FFs in the MLABs. I could be mistaken though - it's possible that I missed something.
I'm not sure if you still have the OR vs Mux question. On the V series, they should be functionally equivalent, however, the OR implementation does lend itself to a tree-based routing scheme better, so it will have less resource utilization at the cost of propagation delay. For the 7 series on the other hand, it has dedicated MUX components in the SLICEs, so using registers + Mux can implement 1-bit of 4 registers in a single SLICE (using both the FFs and the Mux), where as the OR implementation would need 2x SLICEs. Though you still gain the advantage of the tree-based OR method being easier to route. That was the whole point of trying both, and it did differ on some of the implementations.
Note : I have solved the "Mux wall" problem for the version with individual FlipFlops : connect.ed-diamond.com/GNU-Linux-Magazine/GLMF-218/Quelques-applications-des-Arbres-Binaires-a-Commande-Equilibree it might not be the best solution for you case because I think specialised BRAM is best, but it's good to know you can make large MUXes with a lower propagation/fanout price :-)
Sure, that's a solution, but not a very code friendly solution (i.e. it's hard to tell what it's doing). For this case, using a BRAM or on the 7-series, the SLICEM RTL-RAM is significantly more efficient.
RTL Engineering It was all explained perfectly, it's just a lot of info to absorb. lol I've had a tiny bit more success with booting the SGI Indy BIOS on the DE1 btw. I got the serial output working in the same way as I did for the PS1 BIOS - by finding the routine which writes the data in a register (in this case, $v1) to the serial chip, then grabbing that value on the next instruction, and writing it to a UART TXD block. Unfortunately, the serial strings are word swapped, so I'll have to figure that out, but I can at least see where the boot process is failing now. I'm spoofing the reg values for how many "SIMMs" are installed, and the BIOS is now attempting to do a quick RAM test. But, the aoR3000 core is not allowing writes to the bus in the 0x9FC00000 range (uncached?), so I'll have to tweak that some more. Not sure why It's trying to write to the BIOSes location, but it probably gets unmapped by another reg just beforehand? Or, far more likely is that it's trying to use the TLB, which I of course disabled previously when I was messing with the PS1 stuff. Re-enabling the TLB shouldn't be too hard, but I expect tons of other issues afterwards, as there's still no FPU. I'm tempted to plug in the R4300i, which is already on an adapter board, but then I lose most of the debugging and register grabs that I can currently do with the aoR3000 core. I'd also need to write a state machine for the SysAD / SysCMD bus, but it might not be too bad. Anyway, it's been fun to mess with, and learn more about the SGI hardware. Really enjoyed your vid again, as it's very relevant to the PS1, SGI, and obviously the N64. ;) I posted the link to this vid in #n64dev on IRC earlier, as I know they will also appreciate it.
Awesome, thanks! So there is one way to tell if it is expecting to go through the TLB, were there any TLBWI or TLBWR instructions that passed over? By default, the MIPS TLB starts in an undefined state, and it's the job of the BIOS to initialize it, if it plans on using it. I was trying to get u-boot to work on a C++ implementation of a R4000 core, which kept ended up trying to write to BIOS ROM. The reason for this ended up being that the evaluation of where to go was read from an address which was in the TLB mapped region, but the TLB was never initialized (I hadn't implemented the TLB yet). I had to add in a quick 1:1 map for the TLB region, and I was able to get it past that point of the boot sequence. As for the serial port, shouldn't it be attempting to write to a specific serial device? They are typically memory mapped with a physical address, if so, then you should be able to read the output from the system bus write attempts. Or better yet, you could find out what type of serial device it was expecting, and implement that in the FPGA (serial devices are pretty simple). Also, probably not word swapped, perhaps it's a different endianess than you expected?
RTL Engineering Yep, I need to check for any TLB instructions next. I'm hoping it will boot just a little further after I get the SDRAM working, as I really want to start implementing some basic graphics stuff. As with most systems though, I highly suspect it will copy a chunk of the BIOS into to RAM first, so it probably has tried to set up the TLB already. The Indy has a fairly complex Memory Controller ASIC as well, but most of that seems to relate to SDRAM mapping, parity checking, refreshing etc., so I can probably bypass most of that. It uses mainly "PC style" stuff for the keyboard, mouse, interrupt controller, but all integrated into one chip (the IOC). The serial chip core is also on that chip, but uses a (frankly overcomplex) Z80SCC core. It looked like a PITA to implement properly, so I just did the register grab thing for now, and can pause the aoR3000 while each character is being sent via RS232 to the PC. In fact, the aoR3000 is only clocked at around 4.65 MHz right now, because the current SDRAM controller has to be 1/16th speed on the user side atm, so I don't need to pause the CPU for the serial output, even at a whopping 115,200 baud. lol
It is all getting a bit unwieldy now though, as I had to add some extra bus outputs to the aoR3000 so it bypasses the (disabled) cache, so it can write data to the IO regs more easily. The aoR3000 isn't really set up for accessing this mem-mapped IO stuff, so it wouldn't do the usual Avalon bus write requests in those ranges, hence me having to export the internal data bus stuff, and the exec_cmd_store signal for those writes. Also, the DE1 Flash chip (for the Indy BIOS) is only used in 8-bit wide mode, the SDRAM is only 16-bit wide, and the aoR3000 of course expects 32-bit wide accesses for everything. Getting the basic memory state machine working for all of that was another huge PITA when I was messing with the PS1 stuff. Basically, it's all a huge kludge, and I don't really know what I'm doing, but it's fun to learn (when it works. lol) (Oh, and the other reason everything is clocked so slow is due to the 70ns Flash on the DE1, so only ~14 MHz max. It was originally a 90ns chip as well, but I replaced it. Terasic cut costs on the later DE1 boards, and stuck the even slower chip on it. grrr. lol) I should really learn to use Qsys more, and move all of this over to my Cyc V GX board, and the LPDDR2, but I don't have a straightforward way of loading the BIOS image, and the GX board doesn't have parallel Flash. The aoR3000 core is at least already set up for Qsys, I guess. For whatever reason, I've never had much luck getting a simple NIOS-II core running in the past? I'm quite allergic to having to use a Linux environment just to compile something, as I've also had an untold number of issues there too. It would be ideal if I could have a simple Qsys setup for loading the Indy BIOS image from microSD into LPDDR2, then it might be somewhat easier to manage all of the sub-blocks. I even wrote a FAT32 loader in pure Verilog years ago, but the SD card controller is very fussy about the type of card, and I never added cluster chain following, so it only worked with non-fragmented files. sigh. lol Plus, I only just bought a DE10 Nano about 6 months ago, and have been trying to get to grips with the SoC / HPS stuff for the MiSTer retro cores. All of this stuff, and too many other projects, is why my brain hurts. lol :p
Sure, however there are two problems with your suggestion: 1) I am using VHDL, not Verilog (though there is obviously a VHDL equivalent) 2) The goal of the HDL was to be cross-platform. So if I added Vivado specific implementation hints, it wouldn't compile for Quartus. You are also incorrect about the Xilinx BRAM Latch. The BRAMs can be used as "asynchronous" RAM in a sense, you just need to provide the address in the previous clock cycle. Or more specifically. The address is always latched on the clock for a Xilinx BRAM, but the data is not (It's optional for the data). That can, however, lead to timing budget issues. The Xilinx Memory Resource guide discusses this in detail. Also, one of the testing issues that I was having, was that the output register (used to test timing) was being inferred as the data latch for the BRAM. So I probably do have to redo the experiments in this video using the keyword hints, although, it's sort of moot since the register file design will depend on the architecture implementation (i.e. you can implement an architecture that is optimized to the register file technology mapping).
@@RTLEngineering Sorry for the confusion on BRAM description. I'm kinda newbie in RTL and sometime I kinda screw up the terms :D, sorry. I use Verilog because I come from C,C++ world :D, VHDL syntax for me is kinda difficult to understand, but at last I can translate from VHDL to Verilog :D. Now I work on a five stages pipeline risc-v core, previously I have done a version that load,execute and write back is done in the same cycle with interrupt controller (~100mhz on Artix -1 speed grade)(ehh, very simple stuff :D). The previous core I developed for generic stuff, to eat less resources ~900LUTs' on both Xilinx Artix and Lattice MARCHXO3L (including ROM and RAM), now I want to do a performance core, like multiple instructions in a single cycle, out of order execution.... I take the idea of changing the polarity of the clock of the BRAM, to create register files from BRAM, to have the output data from registers feed back to them thru ALU in the same clock cycle (not for performance, but only to implement it on very small FPGA's (like Lattice 1-2-4K LUT4), because the ALU will have only half cycle to execute the operation - BRAM output delay). Maybe you will continue to post videos like this, there are very small number of videos that talk in detail about RTL implementation, I have a lot to learn about RTL, and yes, is very useful to see some code, for me even in VHDL. Best regards
No worries, some of the terminology can be a bit confusing. That’s the reason why I prefer VHDL, it’s very different from C/C++. When working on RTL projects, you will typically create your module and then create a C/C++ model to test against it in simulation (I am part way through a video talking about that). So switching from C/C++ to Verilog and back quickly would become very confusing since they are so similar, but describe completely different behavior. There is also the fact that VHDL is much stricter in how it’s typed (it’s very explicit). It’s very hard to make a syntactical bug in VHDL - it just won’t compile. Compare that with Verilog where 80% of the bugs from Verilog modules that I have worked on, were from syntax issues that compiled without warning. Also, VHDL was the first RTL language that I learned, so that probably adds to why I prefer it. That’s impressive that you were able to get a single-stage RISC-V to run at around 100 MHz (or was that in the fast corner?). I was recently working on a 16-bit MIPS CPU (implements most of the MIPS I ISA in 16-bit instruction encoding), which was also single-stage (technically 3, because of the instruction fetch), and it could only run around 87 MHz on the same chip. I think the biggest issue was the address generation feedback (for loads and stores, as well as branches). Were you using BRAMs for the registers instead? If so, then it would be 2 stages, right? (RF + EX/MEM/WB) I highly recommend against jumping into an out-of-order architecture from a single-cycle CPU. You should work on them in the order that computing architecture developed, since you solve simpler problems that are needed for the next generation. So single-cycle -> pipelined -> super-scalar -> out-of-order (OoO) -> simultaneous-multithreading (SMT). I know it may sound cool to jump directly into an OoO CPU, but the complexity difference there is like comparing a bicycle to a jet airliner. The process for figuring out how to do a pipelined CPU with data hazard detection is a complex one, and you need that solution for the super-scalar, OoO, and SMT. I will continue to post videos like this (there are others on my channel), eventually. I am a bit busy with research as well as another project in real life, so I haven’t had time to make any additional videos (they take a lot of time to produce).
@@RTLEngineering I will try to learn VHDL, I heard from many RTL developers that is better and more restrictive. The RISC-V ISA is a very friendly for RTL implementation. The REG's are implemented in distributed RAM one WR and two RD, the implementation, decode the instruction, read REG data, do the arithmetic's and store back in a single cycle, the jump and conditional jump take two cycles because I use BRAM's for program storage and after a jump I need to discard the next instruction, and load instructions take two cycles because data from BRAM can be used one cycle after address issuing. Now I finished a 5 stage RISC-v implementation that execute one instruction/cycle and the penalty for a jump/conditional jump is 4 ( three for flushing stage 1-3 and one because of BRAM ) cycles and is running at ~160Mhz with timing violations, the issue is the feedback from the temporary data write registers used in cycle 4 and 5 until are write back to the REG bank, because I read the data from REG bank in cycle 3 and write back in cycle 5 and two cycles the data to write back to the REG bank are stored in temporary registers and I need to check the most recent changed data in the pipe to use it in current arithmetic operation, if the source register data is not in the pipe registers for write back I take it from REG bank. I will further optimize the design until I squeeze the last Mhz from the design and get red of the timing violations, after that I will begin the work to the super-scalar implementation. Because is a RISC architecture and are some instances that three instructions or more can be executed at once taking in consideration that I read the REG data in cycle 3 and write back in cycle 5, so I have 3 cycles to work with a data that is in temporary registers. To arrive to a out of order execution, I need a I-cache implementation (already done) that allow me to pre-decode several instructions in that line at once, the advantage of RISC-v ISA is that you need to always look at the same 12 bit's from 32 bit instruction word and only 2 bit can overlap with instructions that use constants but are at a fixed locations, so is much easier to implement several decoders and a lookup table sorter and because the addresses of the two read and one write REG are always in the same place inside a instruction word is much easier to implement a out of order execution design, like that I see it now and I don't even begin to work at out of order implementation, so imagine what is in my brain :)))))))))))))). Anyway. One idea that I have is for VR4300 emulation and other obsolete cores inside a RISC-V core, is to combine a software emulation with custom hardware acceleration. For example, a lot of time is to software decode the emulated instruction and execute it ( the events take more less time to emulate because are rare in comparison with actually executed application). You can create a co-processor wired to the RISC-v core in IO domain, that you push instructions to the co-processor to be executed and the registers be wired to the IO space of RISC-v ( you can even give a dedicated ram to the co-processor that will be wired into IO domain ), every time the emulated core want to talk with another resource than registers (or dedicated ram) to signal to the RISC-v core that need, with an action request code that can be a position to a vector function table, and for external interrupts, IO's and cache to use the RISC-v ecosystem. You can use any other well maintained and with bigger ecosystem that have the IO and cache already developed, needing only to develop only the instruction decoder and to customize an existing ALU, the rest can be emulated in software( or the software with hardware acceleration).
That’s a typical implementation. I am still surprised that you achieved the speed that you listed. Have you tested it in simulation? Does it correctly sequence jumps / branches, as well as memory reads? Why would the jump / branch penalty be 4? In MIPS, the jump penalty is 1, which is hidden via the branch-delay slot. That’s a very inefficient design… You may want to relook at your architecture and figure out how to reduce that. At 160 MHz, I wouldn’t worry about squeezing the last few MHz out of the design unless you have a reason to do so. Basically, you have two main targets for RTL implementations - speed and area. For a chip like the Artix-7, unless you are specifically implementing a system that should run faster, 160 MHz is a good target - it’s not worth optimizing further. If you want to make it as small as possible though, then optimization would make sense. Though keep in mind that the two contradict each other, so you can make it small or fast, not both. I am spending so much time optimizing my VR4300, because I want to use the same parts in an R5900, which needs to run at 300 MHz. If I were only doing the VR4300 though, I would stop at 150 MHz. Before moving on to the next architecture, make sure the CPU executes instructions correctly. Being able to synthesize the design doesn’t mean that it’s logically correct. For I-Cache and D-Cache, I would implement that on the pipelined CPU first. There has been an I and D cache on the MIPS CPUs since the R3000 back in 1992. So you should go back and do that before moving on too. That’s an interesting idea for the VR4300 on RISC-V.
Great Argument, however one question in the LU OR implementation, can using tristate buffers enabled by the decoder help? The and-or stage is essentialy doing that only. Which would be more preferable from the two, as a first-cut prospective?
If you want to "shave" a tiny bit of a picosecond : initialise the register set's 0 register to 0 during boot then don't enable write to it later, so you save a test and a gate in the critical datapath during the read cycle.
The advantage is that the register write address is known well in advance, compared to the read address which must be provided extremely fast by the instruction decoder. So you can control the "write enable" signal outside of the CDP.
That's a good idea actually. Since the read logic replicates the the test, which would save a bit of logic and only require propagation in the write-back stage which is rarely the cause of timing problems.
I may have to implement that solution, since the RF stage of the VR4300 is a 1/2 cycle.
Thanks for the suggestion!
Actually, I just checked what I did, and I effectively did what you suggested. Instead of testing for address 0 upon reading, I inhibit writes to address 0 during the clock update. I.e. I used a loop to connect the register clock writes, which skips index 0.
What happens when you force the Altera implementation to use MLABs (distributed RAM) as well? Cyclone V should support this.
The Cyclone V doesn't have distributed RAM in the same way as the 7-series. The 7-series has SLICELs (logic) and SLICEMs (memory), where as the V series only has the equivalent of the SLICEL (Altera called them MLABs as I recall). They are also implemented differently where the V series used traditional MUX based LUTs and the 7-series uses SRAM blocks. The main difference between the SLICEL and SLICEM is that the SLICELs are read-only. So when you implement distributed RAM in the 7-series, it uses the SLICEM blocks (which have different configurations, the largest being 2-bit 128 entry, and the smallest being 6-bit 32 entry - both configurations only use a single SLICEM). On the V series instead, distributed RAM uses the registers / FFs in the MLABs. I could be mistaken though - it's possible that I missed something.
I'm not sure if you still have the OR vs Mux question. On the V series, they should be functionally equivalent, however, the OR implementation does lend itself to a tree-based routing scheme better, so it will have less resource utilization at the cost of propagation delay. For the 7 series on the other hand, it has dedicated MUX components in the SLICEs, so using registers + Mux can implement 1-bit of 4 registers in a single SLICE (using both the FFs and the Mux), where as the OR implementation would need 2x SLICEs. Though you still gain the advantage of the tree-based OR method being easier to route. That was the whole point of trying both, and it did differ on some of the implementations.
Note : I have solved the "Mux wall" problem for the version with individual FlipFlops : connect.ed-diamond.com/GNU-Linux-Magazine/GLMF-218/Quelques-applications-des-Arbres-Binaires-a-Commande-Equilibree
it might not be the best solution for you case because I think specialised BRAM is best, but it's good to know you can make large MUXes with a lower propagation/fanout price :-)
Sure, that's a solution, but not a very code friendly solution (i.e. it's hard to tell what it's doing). For this case, using a BRAM or on the 7-series, the SLICEM RTL-RAM is significantly more efficient.
My brain hurts.
Because something that should be simple is not so trivial? Or was there a part that was confusing, and should have been explained better?
RTL Engineering
It was all explained perfectly, it's just a lot of info to absorb. lol
I've had a tiny bit more success with booting the SGI Indy BIOS on the DE1 btw.
I got the serial output working in the same way as I did for the PS1 BIOS - by finding the routine which writes the data in a register (in this case, $v1) to the serial chip, then grabbing that value on the next instruction, and writing it to a UART TXD block.
Unfortunately, the serial strings are word swapped, so I'll have to figure that out, but I can at least see where the boot process is failing now.
I'm spoofing the reg values for how many "SIMMs" are installed, and the BIOS is now attempting to do a quick RAM test.
But, the aoR3000 core is not allowing writes to the bus in the 0x9FC00000 range (uncached?), so I'll have to tweak that some more.
Not sure why It's trying to write to the BIOSes location, but it probably gets unmapped by another reg just beforehand?
Or, far more likely is that it's trying to use the TLB, which I of course disabled previously when I was messing with the PS1 stuff.
Re-enabling the TLB shouldn't be too hard, but I expect tons of other issues afterwards, as there's still no FPU.
I'm tempted to plug in the R4300i, which is already on an adapter board, but then I lose most of the debugging and register grabs that I can currently do with the aoR3000 core.
I'd also need to write a state machine for the SysAD / SysCMD bus, but it might not be too bad.
Anyway, it's been fun to mess with, and learn more about the SGI hardware.
Really enjoyed your vid again, as it's very relevant to the PS1, SGI, and obviously the N64. ;)
I posted the link to this vid in #n64dev on IRC earlier, as I know they will also appreciate it.
Awesome, thanks!
So there is one way to tell if it is expecting to go through the TLB, were there any TLBWI or TLBWR instructions that passed over? By default, the MIPS TLB starts in an undefined state, and it's the job of the BIOS to initialize it, if it plans on using it. I was trying to get u-boot to work on a C++ implementation of a R4000 core, which kept ended up trying to write to BIOS ROM. The reason for this ended up being that the evaluation of where to go was read from an address which was in the TLB mapped region, but the TLB was never initialized (I hadn't implemented the TLB yet). I had to add in a quick 1:1 map for the TLB region, and I was able to get it past that point of the boot sequence.
As for the serial port, shouldn't it be attempting to write to a specific serial device? They are typically memory mapped with a physical address, if so, then you should be able to read the output from the system bus write attempts. Or better yet, you could find out what type of serial device it was expecting, and implement that in the FPGA (serial devices are pretty simple). Also, probably not word swapped, perhaps it's a different endianess than you expected?
RTL Engineering
Yep, I need to check for any TLB instructions next.
I'm hoping it will boot just a little further after I get the SDRAM working, as I really want to start implementing some basic graphics stuff.
As with most systems though, I highly suspect it will copy a chunk of the BIOS into to RAM first, so it probably has tried to set up the TLB already.
The Indy has a fairly complex Memory Controller ASIC as well, but most of that seems to relate to SDRAM mapping, parity checking, refreshing etc., so I can probably bypass most of that.
It uses mainly "PC style" stuff for the keyboard, mouse, interrupt controller, but all integrated into one chip (the IOC).
The serial chip core is also on that chip, but uses a (frankly overcomplex) Z80SCC core.
It looked like a PITA to implement properly, so I just did the register grab thing for now, and can pause the aoR3000 while each character is being sent via RS232 to the PC.
In fact, the aoR3000 is only clocked at around 4.65 MHz right now, because the current SDRAM controller has to be 1/16th speed on the user side atm, so I don't need to pause the CPU for the serial output, even at a whopping 115,200 baud. lol
It is all getting a bit unwieldy now though, as I had to add some extra bus outputs to the aoR3000 so it bypasses the (disabled) cache, so it can write data to the IO regs more easily.
The aoR3000 isn't really set up for accessing this mem-mapped IO stuff, so it wouldn't do the usual Avalon bus write requests in those ranges, hence me having to export the internal data bus stuff, and the exec_cmd_store signal for those writes.
Also, the DE1 Flash chip (for the Indy BIOS) is only used in 8-bit wide mode, the SDRAM is only 16-bit wide, and the aoR3000 of course expects 32-bit wide accesses for everything.
Getting the basic memory state machine working for all of that was another huge PITA when I was messing with the PS1 stuff.
Basically, it's all a huge kludge, and I don't really know what I'm doing, but it's fun to learn (when it works. lol)
(Oh, and the other reason everything is clocked so slow is due to the 70ns Flash on the DE1, so only ~14 MHz max.
It was originally a 90ns chip as well, but I replaced it. Terasic cut costs on the later DE1 boards, and stuck the even slower chip on it. grrr. lol)
I should really learn to use Qsys more, and move all of this over to my Cyc V GX board, and the LPDDR2, but I don't have a straightforward way of loading the BIOS image, and the GX board doesn't have parallel Flash. The aoR3000 core is at least already set up for Qsys, I guess.
For whatever reason, I've never had much luck getting a simple NIOS-II core running in the past?
I'm quite allergic to having to use a Linux environment just to compile something, as I've also had an untold number of issues there too.
It would be ideal if I could have a simple Qsys setup for loading the Indy BIOS image from microSD into LPDDR2, then it might be somewhat easier to manage all of the sub-blocks.
I even wrote a FAT32 loader in pure Verilog years ago, but the SD card controller is very fussy about the type of card, and I never added cluster chain following, so it only worked with non-fragmented files. sigh. lol
Plus, I only just bought a DE10 Nano about 6 months ago, and have been trying to get to grips with the SoC / HPS stuff for the MiSTer retro cores.
All of this stuff, and too many other projects, is why my brain hurts. lol :p
To force Vivado to use BRAM use:
(* ram_style="block" *)
reg [7:0]RAM[(RAM_SIZE - 1:0];
always @ (posedge clk)
begin
if(write)
RAM[addr]
Sure, however there are two problems with your suggestion:
1) I am using VHDL, not Verilog (though there is obviously a VHDL equivalent)
2) The goal of the HDL was to be cross-platform. So if I added Vivado specific implementation hints, it wouldn't compile for Quartus.
You are also incorrect about the Xilinx BRAM Latch. The BRAMs can be used as "asynchronous" RAM in a sense, you just need to provide the address in the previous clock cycle. Or more specifically. The address is always latched on the clock for a Xilinx BRAM, but the data is not (It's optional for the data). That can, however, lead to timing budget issues. The Xilinx Memory Resource guide discusses this in detail.
Also, one of the testing issues that I was having, was that the output register (used to test timing) was being inferred as the data latch for the BRAM. So I probably do have to redo the experiments in this video using the keyword hints, although, it's sort of moot since the register file design will depend on the architecture implementation (i.e. you can implement an architecture that is optimized to the register file technology mapping).
@@RTLEngineering Sorry for the confusion on BRAM description.
I'm kinda newbie in RTL and sometime I kinda screw up the terms :D, sorry.
I use Verilog because I come from C,C++ world :D, VHDL syntax for me is kinda difficult to understand, but at last I can translate from VHDL to Verilog :D.
Now I work on a five stages pipeline risc-v core, previously I have done a version that load,execute and write back is done in the same cycle with interrupt controller (~100mhz on Artix -1 speed grade)(ehh, very simple stuff :D).
The previous core I developed for generic stuff, to eat less resources ~900LUTs' on both Xilinx Artix and Lattice MARCHXO3L (including ROM and RAM), now I want to do a performance core, like multiple instructions in a single cycle, out of order execution....
I take the idea of changing the polarity of the clock of the BRAM, to create register files from BRAM, to have the output data from registers feed back to them thru ALU in the same clock cycle (not for performance, but only to implement it on very small FPGA's (like Lattice 1-2-4K LUT4), because the ALU will have only half cycle to execute the operation - BRAM output delay).
Maybe you will continue to post videos like this, there are very small number of videos that talk in detail about RTL implementation, I have a lot to learn about RTL, and yes, is very useful to see some code, for me even in VHDL.
Best regards
No worries, some of the terminology can be a bit confusing.
That’s the reason why I prefer VHDL, it’s very different from C/C++. When working on RTL projects, you will typically create your module and then create a C/C++ model to test against it in simulation (I am part way through a video talking about that). So switching from C/C++ to Verilog and back quickly would become very confusing since they are so similar, but describe completely different behavior. There is also the fact that VHDL is much stricter in how it’s typed (it’s very explicit). It’s very hard to make a syntactical bug in VHDL - it just won’t compile. Compare that with Verilog where 80% of the bugs from Verilog modules that I have worked on, were from syntax issues that compiled without warning. Also, VHDL was the first RTL language that I learned, so that probably adds to why I prefer it.
That’s impressive that you were able to get a single-stage RISC-V to run at around 100 MHz (or was that in the fast corner?). I was recently working on a 16-bit MIPS CPU (implements most of the MIPS I ISA in 16-bit instruction encoding), which was also single-stage (technically 3, because of the instruction fetch), and it could only run around 87 MHz on the same chip. I think the biggest issue was the address generation feedback (for loads and stores, as well as branches). Were you using BRAMs for the registers instead? If so, then it would be 2 stages, right? (RF + EX/MEM/WB)
I highly recommend against jumping into an out-of-order architecture from a single-cycle CPU. You should work on them in the order that computing architecture developed, since you solve simpler problems that are needed for the next generation. So single-cycle -> pipelined -> super-scalar -> out-of-order (OoO) -> simultaneous-multithreading (SMT). I know it may sound cool to jump directly into an OoO CPU, but the complexity difference there is like comparing a bicycle to a jet airliner. The process for figuring out how to do a pipelined CPU with data hazard detection is a complex one, and you need that solution for the super-scalar, OoO, and SMT.
I will continue to post videos like this (there are others on my channel), eventually. I am a bit busy with research as well as another project in real life, so I haven’t had time to make any additional videos (they take a lot of time to produce).
@@RTLEngineering I will try to learn VHDL, I heard from many RTL developers that is better and more restrictive.
The RISC-V ISA is a very friendly for RTL implementation.
The REG's are implemented in distributed RAM one WR and two RD, the implementation, decode the instruction, read REG data, do the arithmetic's and store back in a single cycle, the jump and conditional jump take two cycles because I use BRAM's for program storage and after a jump I need to discard the next instruction, and load instructions take two cycles because data from BRAM can be used one cycle after address issuing.
Now I finished a 5 stage RISC-v implementation that execute one instruction/cycle and the penalty for a jump/conditional jump is 4 ( three for flushing stage 1-3 and one because of BRAM ) cycles and is running at ~160Mhz with timing violations, the issue is the feedback from the temporary data write registers used in cycle 4 and 5 until are write back to the REG bank, because I read the data from REG bank in cycle 3 and write back in cycle 5 and two cycles the data to write back to the REG bank are stored in temporary registers and I need to check the most recent changed data in the pipe to use it in current arithmetic operation, if the source register data is not in the pipe registers for write back I take it from REG bank.
I will further optimize the design until I squeeze the last Mhz from the design and get red of the timing violations, after that I will begin the work to the super-scalar implementation.
Because is a RISC architecture and are some instances that three instructions or more can be executed at once taking in consideration that I read the REG data in cycle 3 and write back in cycle 5, so I have 3 cycles to work with a data that is in temporary registers.
To arrive to a out of order execution, I need a I-cache implementation (already done) that allow me to pre-decode several instructions in that line at once, the advantage of RISC-v ISA is that you need to always look at the same 12 bit's from 32 bit instruction word and only 2 bit can overlap with instructions that use constants but are at a fixed locations, so is much easier to implement several decoders and a lookup table sorter and because the addresses of the two read and one write REG are always in the same place inside a instruction word is much easier to implement a out of order execution design, like that I see it now and I don't even begin to work at out of order implementation, so imagine what is in my brain :)))))))))))))).
Anyway.
One idea that I have is for VR4300 emulation and other obsolete cores inside a RISC-V core, is to combine a software emulation with custom hardware acceleration.
For example, a lot of time is to software decode the emulated instruction and execute it ( the events take more less time to emulate because are rare in comparison with actually executed application).
You can create a co-processor wired to the RISC-v core in IO domain, that you push instructions to the co-processor to be executed and the registers be wired to the IO space of RISC-v ( you can even give a dedicated ram to the co-processor that will be wired into IO domain ), every time the emulated core want to talk with another resource than registers (or dedicated ram) to signal to the RISC-v core that need, with an action request code that can be a position to a vector function table, and for external interrupts, IO's and cache to use the RISC-v ecosystem.
You can use any other well maintained and with bigger ecosystem that have the IO and cache already developed, needing only to develop only the instruction decoder and to customize an existing ALU, the rest can be emulated in software( or the software with hardware acceleration).
That’s a typical implementation. I am still surprised that you achieved the speed that you listed. Have you tested it in simulation? Does it correctly sequence jumps / branches, as well as memory reads?
Why would the jump / branch penalty be 4? In MIPS, the jump penalty is 1, which is hidden via the branch-delay slot. That’s a very inefficient design… You may want to relook at your architecture and figure out how to reduce that.
At 160 MHz, I wouldn’t worry about squeezing the last few MHz out of the design unless you have a reason to do so. Basically, you have two main targets for RTL implementations - speed and area. For a chip like the Artix-7, unless you are specifically implementing a system that should run faster, 160 MHz is a good target - it’s not worth optimizing further. If you want to make it as small as possible though, then optimization would make sense. Though keep in mind that the two contradict each other, so you can make it small or fast, not both.
I am spending so much time optimizing my VR4300, because I want to use the same parts in an R5900, which needs to run at 300 MHz. If I were only doing the VR4300 though, I would stop at 150 MHz.
Before moving on to the next architecture, make sure the CPU executes instructions correctly. Being able to synthesize the design doesn’t mean that it’s logically correct.
For I-Cache and D-Cache, I would implement that on the pipelined CPU first. There has been an I and D cache on the MIPS CPUs since the R3000 back in 1992. So you should go back and do that before moving on too.
That’s an interesting idea for the VR4300 on RISC-V.