Building a single-stage CPU in Ripes (Part 1 -- Arithmetic instructions)
Preliminary notes
RISC-V 32I ISA
We will be working with the RISC-V 32I (32-bit integer) base instruction set. As always, the RISC-V specification can be found here.
Working with Ripes
We will build several CPUs using Ripes. Ripes gives us a lot of functionality, but we will mainly be doing two things: defining new components, and connecting components (that we have written or that exist in Ripes/VSRTL) to create a CPU. The easiest way to get caught up with the conventions of Ripes is to look at an example, such as the single stage processor they provide (we will be deviating from the provided design a bit, both to show that different implementations of the same ISA can exist, and because we will be building out our CPU incrementally).
When connecting components together to create a CPU, we need to define them as subcomponents (e.g. SUBCOMPONENT(alu, TYPE(ALU<XLEN>));
in rvss.h
). We also need to make sure that every input of a subcomponent has been connected. The syntax for this is [wire source] >> [wire dest]
(e.g. control->alu_ctrl >> alu->ctrl;
in rvss.h
). In hardware, this would look like putting electronic components onto a circuit and connecting the outputs of some components to the inputs of other components together with wires. We are just using VSRTL to describe this hardware for simulation.
When defining our own component, we need to define input and output ports (e.g. INPUTPORT(op1, W);
in vsrtl_adder.h
) and define how the outputs are set on each clock cycle. The syntax is [output] << [=] { [function to compute output] };
(e.g. out << [=] { return op1.sValue() + op2.sValue(); };
in vstrl_adder.h
). We get the values of the input ports by using .uValue()
or .sValue()
(signed or unsigned values). The function to compute the output is run on every clock cycle.
Colorblind-accessible palettes
Unfortunately, VSRTL uses pure red (0xff0000) and pure green (0x00ff00) to represent on/off control signals. If you would like to change the colors of the indicators, you can do so by applying the provided patch (since most of the files that render the graphics are in the VSTRL submodule repo, this is the easiest way for us to provide this change). This patch changes instances of Qt::green
to QColor{87,196,173}
and Qt::red
to QColor{219,67,37}
, as per the recommendations here. Feel free to adapt the patch file to colors of your choosing.
# in the top-level Ripes repo:
patch -p0 < redgreen.patch
Coding along
The notes below give some example code snippets of relevant changes we make, but this page won’t include all of the code for each step. To access this code, go to the src/processors/CS1952y/single_stage_cpu
directory in our course Ripes repo. The step0
subdirectory gives the code we end up with for step 0, and so on. The layouts
directory gives the layout for every step (which you can select in the processor selection dialog to match the step we’re on). You will need to copy the code from a particular stepn
subdirectory into the src/processors/CS1952y/single_stage_cpu
directory and rebuild Ripes in order to run the partial processor from that step. To see what changes are made between steps, you can run your diff program of choice (Milda likes the Diff & Merge extension for VSCode).
Step 0: Incrementing the PC
We start with the bare-bones components we will need to implement our CPU: the register file, the instruction memory, the data memory, the ALU, and the PC register. An ALU takes in two operands and a control signal to select the operation to be done, and produces the result of that operation on the operands. We build on the rv_alu
provided by Ripes by adding a “zero” signal, which is high when the ALU result is 0 and low otherwise. We will see why the zero output is helpful later on.
We’ll ignore most of these components for now and wire them up as we need them (to make sure the circuit runs in Ripes, we’ll just wire a 0 signal to the input of each one). Let’s first wire up a circuit to read an instruction from memory (using the current value of the PC as an address) and increment the PC by 4 at every clock cycle.
Part of cs1952y1s_cpu.h
// ** ADVANCING THE PC **
pc_inc->out >> pc_reg->in;
pc_reg->out >> pc_inc->op1;
4 >> pc_inc->op2;
// ** Instruction memory **
instr_mem->setMemory(m_memory);
pc_reg->out >> instr_mem->addr;
// ** Registers **
registers->setMemory(m_regMem);
0 >> registers->r1_addr;
0 >> registers->r2_addr;
0 >> registers->wr_addr;
0 >> registers->data_in;
0 >> registers->wr_en;
// ** ALU **
0 >> alu->op1;
0 >> alu->op2;
ALUOp::NOP >> alu->ctrl;
// Data memory
data_mem->mem->setMemory(m_memory);
0 >> data_mem->addr;
0 >> data_mem->data_in;
0 >> data_mem->wr_en;
MemOp::NOP >> data_mem->op;
If we step through any RISC-V program, we see that the PC indeed gets incremented, and that the output of instruction memory is the corresponding instruction.
This is a great first step, because now we have a different instruction available to the CPU at each clock cycle, and we can start deciphering the bits therein. Instead of tackling all possible instructions at once, let’s go through the RISC-V spec piece-by-piece and figure out what we need for each instruction type.
Step 1: Adding support for Register-Immediate instructions
As we learned, these instructions provide an immediate value, an input register, and a function field, and write the output to the destination register. They are I-type instructions (for now we ignore LUI, AUIPC, and the shift operations):
Note that a NOP (no-op, or empty instruction) is a pseudoinstruction encoded as ADDI x0 x0 0. Since x0 is hardwired as the constant 0, it is a read-only register, and “writing” to it has no effect. Thus, as long as we implement ADDI correctly, NOP will work correctly, as well.
For now, we’ll ignore the opcode and assume all instructions are I-type.
To provide support for this instruction, we need a translator that takes in the 32-bit instruction (the output of the instr_mem
component) and:
- Extracts the immediate value, source register address, and destination register address
- Transforms the funct3 field of the instruction to the proper control signal of the ALU
The first step is just a matter of selecting the correct bits according to the I-type instruction fields. For the second step, we need to know what bits of the funct3 field correspond to what operations, which we can find with the help of the table in chapter 19 of the RISC-V spec:
Translator bit extraction (cs1952y1s_translate.h)
// Registers
rs1 << [=] { return (instr.uValue() >> 15) & 0x001f; }; // bits 15 to 19
rd1 << [=] { return (instr.uValue() >> 7) & 0x001f; }; // bits 7 to 11
// Immediate
imm <<
[=] { return instr.sValue() >> 20; }; // bits 20 to 31 (sign-extended)
This translation would be done in hardware, and we define a switch statement to model it in our software simulator.
funct3 switch statement
// ALU control signal
alu_ctrl << [=] {
auto funct3 = (instr.uValue() >> 12) & 0x0007; // bits 12 to 14
switch (funct3) {
case 0b000: // ADDI
return ALUOp::ADD;
case 0b010: // SLTI
return ALUOp::LT;
case 0b011: // SLTIU
return ALUOp::LTU;
case 0b100: // XORI
return ALUOp::XOR;
case 0b110: // ORI
return ALUOp::OR;
case 0b111: // ANDI
return ALUOp::AND;
default:
throw std::runtime_error("Invalid funct3 field");
}
};
Once we define the translator, we make the appropriate connections to the ALU and the register file.
We can run some I-type instructions through this CPU (for example, addi x11, x0, 0xbad
,
addi x11, x11, 1
), and observe that the computations indeed get performed. Note the result of the writeback to the register file becomes available in the subsequent clock cycle to when the operation is run (we will revisit this design of the register file when we get to pipelining).
Step 2: Adding support for the Register-Register instructions
These instructions provide two input register addresses, two function fields, and write the output to the destination register. They are R-type instructions:
Notice that, while the lower 20 bits are used for the same purpose as the I-type instructions we looked at earlier, the upper 12 bits serve a different purpose. For the I-type instructions, the immediate field went into the op2 input for the ALU. For the R-type instructions, we have to read the value of the rs2 register and then use it as an input to the ALU. How do we determine which data to use? We can create a control unit to output a selection signal for a mux that selects which data (immediate for I-type instructions, rs2 register value for R-type instructions) goes into the second input of the ALU based on the opcode and function field(s).
To make a stronger distinction between “decoding” and “controlling,” we split up our translator from the previous part such that:
- The decode unit splits up the instruction into named bit fields and puzzles together and sign-extends any immediate(s). In hardware, this would simply be done with wires, but we encompass it in its own component to draw some abstraction boundaries in our diagram.
- The control unit creates control signals to the subsequent components of the CPU, such as the control signal of the ALU and the selection bits to any multiplexed data. In hardware, this would be done using logic gates that essentially create a lookup/translation table. Notice the clever design of the funct3 fields of the two types of instructions – for example, 100 is the funct3 field for XOR and for XORI. This is not a coincidence – since the same ALU control signal is used for both instructions, this enables less hardware to be used when creating the control unit in real life.
Controlling the selection signal for the ALU2Sel mux in cs1952y1s_control.h
alu2_sel << [=] {
switch (opcode.uValue()) {
case 0b0010011: // I-type
return ALU2Sel::IMM;
case 0b0110011: // R-type
return ALU2Sel::REG2;
default:
return ALU2Sel::IMM;
}
The decode unit is agnostic to the type of operation – it will always output something for imm as if the instruction were I-type, and something for rs2 as if the instruction were R-type. Of course, if the instruction is not of those types, the data in those outputs is essentially garbage. That is why we need the control unit, to make sure that only the well-formed data is routed to the other components using muxes.
Our CPU can now handle two types of instructions!
Our control unit will become very complicated as we implement the rest of the instructions! For legibility, we have hidden the control unit and indicated the signals that are connected to control inputs/outputs using bold and italicized text.
Step 3: Adding support for the remaining Register-Immediate instructions
Understanding the principle of the control unit, we can now implement shifts with immediates, as well as LUI.
In terms of decoding, the shift operations are a special case of the I-type instruction, where the upper 7 bits of the immediate field act more like a funct7 field (a fact which we take advantage of when implementing the control unit). Since an instruction has at most one immediate, we introduce an immediate selection mux, so that the control unit can route the correct type and size of immediate based on the type of instruction.
Updated logic for alu_ctrl to support shift-immediates
if (opcode.uValue() == 0b0010011) { // I-type
switch (funct3.uValue()) {
...
case 0b001: // SLL
return ALUOp::SL;
...
case 0b101: // SRLI and SRAI
if (funct7.uValue() == 0) {
return ALUOp::SRL;
} else if (funct7.uValue() == 0b0100000) {
return ALUOp::SRA;
} else {
throw std::runtime_error("Invalid upper 7 bits for SRLI/SRAI");
}
...
LUI does not make use of the ALU, but instead writes data to a register. We introduce a multiplexer that selects between the immediate from this instruction and the ALU output.
Immediate selection and rd logic
imm_sel << [=] {
switch (opcode.uValue()) {
case 0b0010011: // Register-immediate
if (funct3.uValue() == 0b001 || funct3.uValue() == 0b101) { // shift
return ImmSel::Ishift;
} else {
return ImmSel::I;
}
case 0b0110111: // LUI
return ImmSel::U;
default:
return ImmSel::I;
}
}; // imm_sel
rd_sel << [=] {
switch (opcode.uValue()) {
case 0b0110111: // LUI
return RdSel::IMM;
default:
return RdSel::ALU;
}
}; // rd_sel
We must also pay attention to how the U-immediate is expanded into 32 bits when decoding a U-type instruction when implementing our decode unit.
Immediate decode
// Immediates
imm_I <<
[=] { return instr.sValue() >> 20; }; // bits 20 to 31 (sign-extended)
imm_Ishift <<
[=] { return (instr.uValue() >> 20) & 0x001f; }; // bits 20 to 24
imm_U << [=] { return instr.sValue() & 0xfffff000; }; // bits 12 to 31
Since AUIPC involves the PC, we’ll hold off on implementing control for this operation until we discuss jumps and branches.
You should now be able to run your code from HW1b on this CPU!
Next time, we will implement memory and control logic instructions.