Building a single-stage CPU in Ripes (Part 2)
See here for step 1!
Step 4: Load instructions
We can now easily implement LW, since the address computation is an I-type ADD instruction. There aren’t any new subcomponents to add, and we just have to make some connections to data memory and create a new input to the register data_in multiplexer. The memory controller itself does the sign-extend or 0-extend computation required for LB, LW, LBU, and LHU as long as we provide the correct control signal, so we get those operations for “free,” as well.
Step 5: Store instructions
For a store, we have to add functionality for an S-type instruction to the decoder and controller. We also finally connect the controller to the wr_en signal of the register file, since stores are the first instruction we’ve encountered that do not write to a register!
We can now run HW1d on our processor.
Step 6: Unconditional jumps
To implement control logic, we first work with the jump instructions, JAL and JALR. Note that the unlinked jump (J) is a pseudo-instruction encoded as JAL with rd1 being x0 (so we do not implement it separately in hardware).
JAL and JALR have very similar behavior – both compute a new address for the PC using an immediate, and both write the value of the return address (pc + 4) to the rd register. The difference is that JAL uses the current PC to compute the new address, and JALR uses the value in rs1. Because they encode immediates differently (JALR needs to leave room for rs1 and is an I-type instruction, but JAL is not), we have to add some decoder logic for the J-type instruction.
Note that the RISC-V specification states that the LSB of the new PC address for JALR is ignored after the add (it is guaranteed to be 0 for JAL – verify for yourself that this is the case). Technically, we would need to enforce this in hardware in order for our computer to be safe, because an instruction like JALR x1, x0, 1
would try to write a value of 1 to the PC, and RISC-V instructions are on 2- or 4-byte boundaries depending on the extension. For simplicity’s sake, we will throw caution to the wind and trust that the compiler/assembly developer will supply properly aligned immediate operands to JALR (just don’t ship a CPU like this!).
The ALU isn’t used at all for our implementation of JAL and JALR – we could use the ALU to compute the new PC address, but instead, we choose to create a new adder (P&H does this as well). This is because branch instructions will also require us to compute a new PC address, and we will need the ALU to compute the various branch comparisons. Other implementations (such as the single-stage processor that comes with Ripes) choose to do it the other way, using the ALU to compute the new address and introducing a new unit for branching logic.
Step 7: AUIPC
Now that we’re working more with the PC-related parts of the circuit, let’s implement AUIPC, which adds the immediate to the PC and stores the result in the destination register. We could implement this instruction either using the ALU, or the adder we are using to compute jump addresses. Let’s do the first option, so that we don’t have to add logic to output the result of the PC adder to rd:
AUIPC allows the program to jump to instructions that are farther away than would be possible with just JAL and JALR. More information is given on page 16 of the RISC-V spec.
Step 8: Conditional branches
The remaining control transfer instructions are the conditional branches.
Note that we don’t have to add hardware support for BGT, BGTU, BLE, and BLEU because they can simply be synthesized as the opposite of BLT[U] and BGE[U].
To perform the branch comparison for BLT and BGE, we can use the ALU with operation LT (directly for BLT, and inverted for BGE). To perform the branch comparison for BEQ/BNE, we use the ALU operation SUB to perform the equality check. This justifies the zero signal for the ALU – we now have a single-bit signal that we can use for the result of the comparisons. We need a control bit to either select or invert the zero-signal. Instead of a multiplexer, an XOR gate can be used to invert a single bit. Take a look at the multiplexer that selects the new PC. Because the selection of whether the next PC is PC+4 or some newly computed address depends on the result of the ALU, we use some signals from the controller (Branch, Jump, InvZero) to make the decision.
We have now implemented a basic RV32I CPU! If we had done this in hardware, we would have a working computer.
Notice that, due to dependencies, large parts of this circuit sit idle over the course of execution (for example, the data_mem
component cannot do any meaningful work until the ALU output changes to a valid address). Next time, we will explore how to mitigate this by introducing instruction-level parallelism in the form of pipelining.