Homework 2: Pipelined Processors in Ripes
Due: Wednesday, 2/21 at 10pm
Overview
In class, we implemented a single-stage CPU in Ripes, and then used the concept of instruction-level parallelism to create a five-stage pipelined CPU. In this assignment, you will explore the consequences of creating four- and six- stage CPUs. You will identify data and control hazards, implement the forwarding and hazard units for the CPUs, and perform some analysis about the relative performance of the pipelines.
Note: the Autograder will become available before 2/9. We will cover hazards and flushing the pipeline in class on 2/9 (the implementation of forwarding will be left to the notes). You can get started on trying out code on the provided processors (and even try your hand at identifying hazards!) before then. You also have enough background to attempt conceptual questions 1a-d and 2a after the 2/7 lecture.
Stencil code
We will be working in our course Ripes repo, which is already on the Docker containers in the provided dev environment. Perform a git pull
and make
(both in the Ripes
directory) to build the stencil code. The files you will be working on are in the src/processors/CS1952y/four_stage_cpu
and src/processors/CS1952y/six_stage_cpu
subdirectories. The processor circuits, including the completed decode/control units from the single-stage processor, are provided for you. You will be completing the hazard and forwarding units. Remember that all of the code for the single-stage and five-stage CPUs we built in class has been provided for you in this repo, as a reference.
Don’t forget the Ripes/RISC-V resources linked on the course page, and the lecture notes linked on the schedule page! Before starting the homework, read through the forwarding discussion of the five-stage pipeline: this is “Part 0” of the homework and gives context to the forwarding units you will be implementing.
Grading information
You will be graded on the functionality and performance of your hazard and forwarding units, as well as your analysis. See the handin section for a list of files to turn in. We expose some basic Autograder tests, but it’s up to you to thoroughly test your pipelines for how they respond to hazards. Keep in mind that the actual amount of code you have to write for this assignment is not that high: the majority of the time spent will be reasoning about hazards and resolving design decisions.
The processors should be able to handle every instruction on page 104 of the RISC-V spec (the 32I base instruction set table) except for FENCE
, FENCE.I
, ECALL
, EBREAK
, CSRRW
CSRRS
, CSRRC
, CSRRWI
, CSRRSI
, CSRRCI
.
The pipelines
Four-stage pipeline: This processor gets rid of the ID stage, putting the decoding and control into the “I” (previously IF) stage, and the register file write/read into the Ex stage. Reading from the register file is assumed to be fast, so moving the register file to the execute stage might make sense if the memory operations (on data/instruction memory) are the bottleneck.
Six-stage pipeline: This processor moves register reading to its own stage (“RR”) after the decode stage (essentially, it is the opposite of the 4-stage modification and therefore might make sense if the decode cycle from the 5-stage pipeline is the bottleneck).
Note that you will have to make a decision in which stage the hazard detection unit operates, by either using the rs[1/2]_id
from the decode stage, or from the register read stage (i.e. the output of the ID/RR register).
Tips
There are a lot of wires and signals in the images above. Don’t panic! This is a complicated circuit because CPUs are complicated little pieces of electronics, but here are some tips to navigate the circuitry:
- Just like we did in class, you can trace an instruction by looking at the control signals that are high (look at the highlighted indicator on each multiplexer). Get to know each processor by running a few instructions at a time and observing their journey through the pipeline.
- If you don’t know where a wire goes, click on it to highlight it. You can also take a look at the
cs1952y4s_cpu
andcs1952y6s_cpu
code to see how the inputs/outputs of the components connect to each other. - We deliberately gave you more inputs to the forward and hazard units than you need – do part 1 first, without thinking about the signals, and then think about how your drawings in part 1 can help you identify the logic you need to write in part 2.
- Make use of both the processor view and editor view to see which instructions are in which stage of the pipeline. You can place breakpoints if you want to debug longer programs, and you can right-click on any input/output port to display its value if you need to see a specific signal in more detail – read the Ripes documentation (linked on the resources page) for more information on navigating Ripes.
Part 1: Identifying hazards
In a pdf file, describe all of the data/control hazards that can occur in each pipeline. For each hazard, clearly label which processor it corresponds to, and include:
- A snippet of code (6 instructions or fewer) that would run incorrectly on a CPU without a forwarding/hazard unit.
- A 2-3 sentence explanation of what the hazard is and what the expected and incorrect results are (this can be supplemented with comments in your code, as we had in the course notes on pipelining).
- A note on whether the hazard can be resolved by forwarding or requires stalling.
- A pipeline diagram similar to Figure 4.27 or 4.28 (depending on if the resolution uses forwarding or stalling) or the diagrams from the lecture notes that shows the stage that each instruction is in at each clock cycle.
Note that you can confirm that code contains a hazard by running it on the given processors (without the forwarding/hazard units) and observing that an incorrect side effect (register or memory write) occurs.
You will be graded on identifying all hazards that occur in a pipeline, accurately drawing them, and identifying whether they need to be handled by forwarding or stalling.
What does “all hazards” mean? You don’t have to draw every possible interaction between instructions, but there should be enough to cover different cases of flushing/stalling/forwarding. For example, the ex/ex and lui examples in the notes both show forwarding from ex to ex and mem to ex, so only one is necessary to show the dependency between those stages. Even if you don’t draw all possible examples, it is beneficial for you to think through all the different data that might need to be forwarded and the ways forwards/stalls/flushes might interact (because we will test for it in our autograder!). It will also be helpful for your examples to show interactions of instructions that do not create hazards, so that you do not forward/stall unnecessarily.
Part 2: Implementing the forwarding and hazard units
Using your observations in part 1, complete the code for the forwarding and hazard units. A large part of this task will be understanding the circuit and control signals, based on what we saw in lecture for the single- and five-stage CPUs. Do not modify any files besides cs1952y[4/6]s_[forward/hazard].h
– these will be the only Ripes files you turn in, which our autograder will slot into the Ripes repo in order to build your processor. Also, do not add/remove any inputs/outputs from these files. You might find that some input signals should remain unused or some outputs should remain high/low/not use a given Mux selection signal. That’s part of the design challenge of this homework!
For the six-stage CPU, you will have to decide whether your hazard unit lives in the ID stage or the RR stage. If it lives in the ID stage, you cannot use the rr_rs[1/2] input signals to detect a hazard. If it lives in the RR stage, you cannot use the id_rs[1/2] input signals to detect a hazard.
Note that it is possible to create a really conservative CPU that just stalls whenever a hazard of any sort is detected. While this would be a working processor (as long as stalling is implemented correctly), it would not be a very efficient processor. A large part of your implementation grade will come from your processors’ ability to stall only when necessary in order to avoid degrading performance.
Part 3: Analysis
Measuring performance: one goal of this homework is for you to identify when deeper pipelining is desirable and when it degrades performance. Write two different programs, each of which uses at least one of each instruction in the register/register, register/immediate, memory, and control transfer categories (so, at least four different instructions). The programs should also run for 50 cycles or more. Assume that the 6-stage pipeline clock cycle time is 70% of the 4-stage pipeline clock cycle time. One program should result in a lower execution time for the four-stage pipeline, and the other program should result in a lower execution time for the six-stage pipeline. You will turn these programs into gradescope as 4s_better.s
and 6s_better.s
. In your PDF file, for each program, briefly explain why the program performs better on one pipeline over the other.
Quantitative analysis: in your PDF file, include answers to the following conceptual questions:
Question 1: (These numbers are adapted from Exercise 4.7 of P&H)
Assume the following latencies for each circuit component:
- Instruction memory/Data memory: 250ps
- Register file: 150ps
- Mux: 25ps
- ALU: 200ps
- Adder: 150ps
- Single gate: 10ps
- Extract bits: 5ps
- Sign extend: 50ps
- Control: 50ps
- Register read: 30ps
- Register setup: 20ps
“Register read” is the time needed after the rising clock edge for the new register value to appear on the output. This value applies to the PC and the pipeline registers. “Register setup” is the amount of time a register’s data input must be stable before the rising edge of the clock. This value applies to the PC, the pipeline registers, and the register file. (Wording directly from P&H 4.7)
Note: while this is not true in real life, assume for the sake of this analysis that the hazard and forwarding units have 0 latency.
a. What is the minimum clock cycle time of the four stage CPU? The six stage CPU?
b. Do these numbers justify the pipeline divisions for the (four/five/six-stage) pipelined CPUs? That is, could we move components to different pipeline stages, or combine stages altogether, and get a better minimum clock cycle time? If yes, explain why one of the four/five/six-stage CPUs leads to the optimal CPU cycle time possible for the given latencies. If no, draw a new pipeline that has all of the components of our single-stage CPU but that minimizes the CPU cycle time. As part of this analysis, consider whether some components can be split up across stages, or if they have to stay together in one stage.
c. (Inspired by P&H Exercise 4.10) Assume we could double the size of the register file, such that we now have 64 registers. Say this increases the latency of the register file to 230ps. Does this change which of the four/five/six-stage CPUs would now have the lowest CPU cycle time?
d. What changes would need to happen to the RISC-V ISA specification to allow for 64 registers rather than 32? Specifically, what would be the consequences on the control transfer and memory instructions? How could you use a combination of instructions to load a 32-bit immediate into the register?
e. Having extra registers can reduce the need to save register values to the stack (and load them from the stack later). We saw that a Mem/Ex hazard, such as lw x10, 0(sp)
, addi x10, x10, 3
, introduces a delay of one cycle. Say N% of instructions in your program introduce a Mem/Ex hazard that requires stalling (and all other instructions do not introduce stalling). Using your answers to b and c, what is the value of N (if any exists) where this larger register file becomes worth it?
Question 2: (Adapted from Exercise 4.11 of P&H) RISC-V is a load-store architecture, which means that all computations are done on registers. Let’s explore what would have to change about our processor if we allowed a single instruction to compute an arithmetic result and store it to memory: consider a “store sum” instruction (ss rs1, rs2, imm
) whose behavior is: Mem[Reg[rs1]]=Reg[rs2]+immediate
.
a. Using our single-stage processor in Ripes as a starting point, draw out what the path of this instruction would look like through the circuit. If you need to add any new components or modify existing components, indicate that in your diagram/in a short explanation. Also explain what new signals (if any) are needed from the control unit to support this instruction.
b. Give an example of a program where the use of the ss
instruction introduces a data hazard in the five-stage processor. Explain how the current forwarding/hazard units are able to take care of this hazard, or explain how the units would need to be changed to detect and account for this hazard.
Question 3: As we’ll learn later in the semester, modern CPUs try to spend as much time as possible doing computations, so they will employ techniques such as branch prediction to try to avoid wasted cycles. While this has done great things for throughput of instructions, it means that modern CPUs spend quite a bit of time doing computations that may end up having no effect, which has implications on energy usage. In fact, recent trends in processors have been moving away from deep pipelining. As your textbook puts it, simpler processors may deliver better performance per joule. In this question, we explore the consequences of wasted computations in our pipelines.
When designing our five-stage pipelined CPU, we chose to deal with all control hazards (jumps and branches) at the mem stage, since that is when the ALU result is available for computing whether a branch is taken. You may have observed that control hazards are really costly, since they require us to flush multiple instructions out of the pipeline. From an energy perspective, it’s not ideal to let an instruction get partway through a pipeline before flushing it. Your book discusses an optimization for adding simple PC address computation and branch comparison logic at the decode stage, in order to reduce the flush from three instructions to just one. The earlier branch comparison gets fairly complicated because it requires new forwarding logic. On the other hand, computing the new PC is easy and can be done in the decode stage, since we just need the immediate from the instruction and the current PC. This is true for JAL
as well, because the new PC address again depends on the current PC and the immediate field. However, JALR
throws a wrench in the works, because the new PC is computed relative to a register value, so we have to compute the new address in the Ex stage for JALR
. Considering these complexities, answer the following questions:
a. Assume that much less power is used by a pipeline stage when all inputs are low (for example, when the pipeline register is cleared), so that we only care about the work done by a pipeline stage when the corresponding pipeline register is not cleared/passing through an all-0 signal. If each pipeline stage uses one unit of energy per clock cycle, compute the wasted energy that results from partially computing the instructions that get flushed right after a jump for each of the four, five, and six stage pipelines.
b. For the five-stage pipeline, propose a solution for reducing the wasted energy that comes from JAL
/JALR
instructions. Include a sketch of how each instruction moves through the pipeline by highlighting the control signals that would be used to move the data around for each, and explain how your approach reduces wasted energy. Design ideas you might consider are: different datapaths for each instruction (such that JALR
and JAL
use different components of the circuit), different behavior of the hazard unit for each instruction, or even introducing hardware registers, adders, or other new components.
Part 4: Reflection
At the end of your pdf, include your answers to the following questions:
- What were your main takeaways from this assignment?
- What suggestions do you have for improving the assignment in the future?
- What questions do you still have about pipelining/CPU design?
Handin
Your submission should include the following files, all made to the same Gradescope submission:
- Your implementations for
cs1952y[4/6]s_[forward/hazard].h
(four files) from Part 2. - Your
4s_better.s
and6s_better.s
programs from Part 3. - Your pdf, which should include: your answers to part 1, your explanation from the “measuring performance” task in part 3, your answers to the “quantitative analysis” task in part 3, and your reflection from part 4.
Changelog
This will be updated whenever any clarifications have been added to this assignment. See also the FAQ on Ed!
2/18, 10:00am: slight clarity rewording for quantitative Q1
2/7, 3:50pm: added some clarifying words to conceptual question 1
2/9, 11:00am: changed performance analysis question to be about execution time, not CPI
2/12, 1:10pm: added some clarification for how many hazard diagrams are enough