Previously: Single-stage CPU part 1

Next up: Five-stage CPU part 1

Step 4: Load instructions

load_functs

We can now easily implement LW, since the address computation is an I-type ADD instruction. There aren’t any new subcomponents to add, and we just have to make some connections to data memory and create a new input to the register data_in multiplexer. The memory controller itself does the sign-extend or 0-extend computation required for LB, LW, LBU, and LHU as long as we provide the correct control signal, so we get those operations for “free,” as well.

Updated CPU wiring

    // Data memory
    data_mem->mem->setMemory(m_memory);
    alu->res >> data_mem->addr;
    0 >> data_mem->data_in;
    0 >> data_mem->wr_en;
    control->mem_op >> data_mem->op;

step4

Step 5: Store instructions

store

For a store, we have to add functionality for an S-type instruction to the decoder and controller, which involves the same type of code we’ve been writing already.

store_functs

Up until this point, the wr_en signal of the register file has been hard-wired to 1, and the wr_en signal of the data memory has been hard-wired to 0, because all of the instructions we’ve encountered have written to a register and not written to memory. The store instruction does the opposite, so we need to update our controller with two outputs that get connected to these signals.

Controller wr_en logic for register file and data memory

    reg_w << [=] {
      switch (opcode.uValue()) {
      case 0b0010011: // register-immediate
      case 0b0110011: // register-register
      case 0b0110111: // LUI
      case 0b0000011: // load
        return 1;
      case 0b0100011: // store
      default:
        return 0;
      }
    }; // reg_w;

    mem_w << [=] { return opcode.uValue() == 0b0100011; }; // store

We can now run HW1d on our processor.

step5

Step 6: Unconditional jumps

jal jalr

To implement control logic, we first work with the jump instructions, JAL and JALR. In a high-level programming language, function calls would be compiled into JAL or JALR instructions. The unlinked jump (J) does not appear in the RISC-V spec because it is a pseudo-instruction encoded as JAL with rd1 being x0 (so we do not implement it separately in hardware, since the assembler takes care of synthesizing it).

JAL and JALR have very similar behavior – both compute a new address for the PC using an immediate, and both write the value of the return address (pc + 4) to the rd register (which means we add another input to the RdSel mux). The difference is that JAL uses the current PC to compute the new address, and JALR uses the value in rs1 – this is the purpose of the PCAdd1 mux in the circuit below. Because they encode immediates differently (JALR needs to leave room for rs1 and is an I-type instruction, but JAL is not), we have to add some decoder logic for the J-type instruction. We also have to introduce a multiplexer that chooses what the next PC is, based on whether or not the instruction is a jump.

Warning: the RISC-V specification states that the LSB of the new PC address for JALR is ignored after the add (it is guaranteed to be 0 for JAL – verify for yourself that this is the case). Technically, we would need to enforce this in hardware in order for our computer to be safe, because an instruction like JALR x1, x0, 1 would try to write a value of 1 to the PC, which is an invalid instruction address because RISC-V instructions are on 2- or 4-byte boundaries depending on the extension. For simplicity’s sake, we will throw caution to the wind and trust that the compiler/assembly developer will supply properly aligned immediate operands to JALR (just don’t ship a CPU like this!).

jump_functs

The ALU isn’t used at all for our implementation of JAL and JALR – we could use the ALU to compute the new PC address, but instead, we choose to create a new adder (P&H does this as well). This is because branch instructions will also require us to compute a new PC address, and we will use the ALU to perform the various branch comparisons. Other implementations (such as the single-stage processor that comes with Ripes) choose to do it the other way, using the ALU to compute the new address and introducing a new unit for branching logic.

PC selection circuit

    registers->r1_out >> pc_add1->get(PCAdd1::RS); // JALR
    pc_reg->out >> pc_add1->get(PCAdd1::PC);       // JAL
    control->pc_add1 >> pc_add1->select;
    pc_add1->out >> pc_add->op1;
    imm_sel->out >> pc_add->op2;
    pc_inc->out >> pc_sel->get(PCSel::INC);
    pc_add->out >> pc_sel->get(PCSel::J);
    control->jump >> pc_sel->select;

step6

Step 7: AUIPC

auipc

Now that we’re working more with the PC-related parts of the circuit, let’s implement AUIPC, which adds the immediate to the PC and stores the result in the destination register. We could implement this instruction either using the ALU or the adder we are using to compute jump addresses. Let’s do the first option, so that we don’t have to add logic to output the result of the PC adder to rd. This means adding a mux for the first input of the ALU. Check that the mux signals that are lit up in the picture below correspond with your understanding of the data that should be used for this instruction:

step7

AUIPC allows the program to jump to instructions that are farther away than would be possible with just JAL and JALR. More information is given on page 16 of the RISC-V spec.

Step 8: Conditional branches

The remaining control transfer instructions are the conditional branches.

branches

BGT, BGTU, BLE, and BLEU are pseudoinstructions synthesized as the opposite of BLT[U] and BGE[U].

branch_functs

To perform the branch comparison for BLT and BGE, we can use the ALU with operation LT (directly for BLT, and inverted for BGE). To perform the branch comparison for BEQ/BNE, we use the ALU operation SUB to perform the equality check.

Updated ALU control for branching

    alu_ctrl << [=] {
      ...
      } else if (opcode.uValue() == 0b1100011) { // branches
        switch (funct3.uValue()) {
        case 0b000: // BEQ
        case 0b001: // BNE
          return ALUOp::SUB;
        case 0b100: // BLT
        case 0b101: // BGE
          return ALUOp::LT;
        case 0b110: // BLTU
        case 0b111: // BGEU
          return ALUOp::LTU;
        default:
          throw std::runtime_error("Invalid funct3 field");
        }

Instead of using the entire ALU result, all we need to know is if the output of the ALU is zero or non-zero, which justifies the zero signal we added to the ALU all the way in the beginning of this journey. Based on the operation, we need a control bit to either select or invert the zero-signal (for example, two numbers will be equal if the ALU output after subtracting is 0, so a BEQ branch will be taken when this signal is low, but a BNE branch will be taken when this signal is high). Instead of a multiplexer, an XOR gate can be used to invert a single bit.

Zero-bit inversion control

    inv_zero << [=] {
      if (opcode.uValue() == 0b1100011) { // Branches
        switch (funct3.uValue()) {
        case 0b000: // BEQ
        case 0b101: // BGE
        case 0b111: // BGEU
          return 0;
        case 0b001: // BNE
        case 0b100: // BLT
        case 0b110: // BLTU
          return 1;
        default:
          throw std::runtime_error("Invalid funct3 field");
        }
      } else {
        return 0; // don't care
      }
    }; // inv_zero

The input to the multiplexer that selects the PC has been changed to a logic gate circuit: if an instruction is a non-conditional jump OR if it’s a branch AND the branch should be taken according to the (possibly inverted) ALU zero signal, then the next PC comes from the adder with the immediate. Otherwise, it’s just the next sequential instruction (PC + 4).

step8

We have now implemented a basic RV32I CPU! If we had done this in hardware, we would have a working computer.

You may be realizing that, at any given time, large parts of this circuit are sitting idle as electrical signals travel from the PC register, to instruction memory, to the decoder, to control logic, to the ALU, to the data memory, to the RdSel and PC selection muxes. In fact, while we have a working CPU, we could do better in terms of performance by allowing all of the units to be used in parallel for different instructions. This is called pipelining, and we will explore it next lecture.