Homework 3: Cache Assignment in gem5
Overview
In this assignment, you will implement a writeback cache from scratch and include it to simulate real programs in gem5. This will allow you to evaluate your cache under different workloads and with different configurations. There will be a few iterations of your implementation, where each step is designed to build on the previous step. You will only submit the code associated with the final iteration to Gradescope, but you will also submit a written component that you will answer as you go.
Gem5 is a complicated tool. Using it or extending it straight out of the box can be tricky, so it will be useful to examine the gem5 cheat sheet and/or to attend/revist the gear-up session in order to familiarize yourself with the tool. Feel free to come by office hours for further clarifications, but note we may refer you to the available resources if they believe the answer can be found there!
Stencil code
Get started by cloning the repo if you are working locally or to pull the latest version if you are working in the course Docker container! The files that you will need in this assignment are src/mem/cache/micro-cache.{hh,cc}, and you can test your code by running ./build/RISCV/gem5.debug configs/assignments/cache/cache-assignment.py
or ./build/RISCV/gem5.debug configs/assignments/cache/cache-assignment-docker.py
.
Submission and grading
You will upload a written and programming component to one Gradescope assignment. The written component should be uploaded as a PDF. For the code, upload the micro-cache.{hh,cc} files.
We will run some basic autograder tests on your code (such as checking that a higher-latency cache performs worse than a lower-latency one), all of which will be visible to you. We will also check your code manually to make sure that it conforms to the directions put forth in this handout. Finally, we will grade your responses to the written questions to assess your conceptual understanding of caching.
Part 0: Protocol Description
Before starting to code, read through this assignment handout and make sure you understand the general role of the cache you’re being asked to implement. To guide your work, sketch a finite state machine controller for the cache protocol that deals with CPU requests (i.e., you don’t have to describe the coherence protocols). The state machine is driven by the handleRequest
and handleResponse
callbacks (i.e. your transitions should happen based on the information from these functions). You do not have to turn this in, but we will ask you for it to help us check your conceptual understanding when you ask us questions. Your state machine should be complete enough that the pseudocode for the assignment should follow. It doesn’t need to be wholly exhaustive, but you should enter the state machine at a starting state, and consider the following questions:
- What sort of signals does the cache need to keep track of?
- Is the request a read? Write?
- Does the request hit in the cache given the current state?
- What should be sent to memory and when? What should happen on the response?
Why an FSM?
A cache is a complicated mechanism that may take several steps for a specific action. A finite state machine allows you to express the order in which these steps happen. For example, caches might wait for a response from memory before being able to do some next step, and this sort of behavior is straightforward to express in a "wait state" of an FSM. For a more detailed overview, check out section 5.9 of P&H!
Part 1: One-Block Cache
In the first part of this assignment, you will implement a cache that will only ever consist of a single 64-byte block, based on this study. The aim of this part of the assignment is to familiarize you with the gem5 environment, and evaluate the impact of having a single block cache in terms of performance with different benchmarks.
In the micro-cache.cc file, you will see that there are two functions for you to complete: handleRequest and handleResponse. For the cache to work, these functions need to implement well-defined behavior, which should conform to your state machine as described in Part 0.
handleRequest receives a PacketPtr (i.e., Packet *) from the CPU, and will perform either the hit or the miss routine. The hit routine occurs if the data being searched for resides is currently in the cache. Otherwise, perform the miss routine. In either case, handleRequest should increment the stats.hits or stats.misses counter in order for your assignment to be graded!
On the other hand handleResponse should perform the post-memory access routine. That is, the pkt in the function arguments is the response of a memory request (that you will create!), and not the CPU request. Remember, your response may need to replace some existing data in the cache, so be sure to handle this case appropriately!
A few things to note: in the handleRequest function, your implementation should ensure that only one memory request can occur at a time by setting the “blocked” flag. That is, you do not need to enforce memory consistency or implement any synchronization in your cache. However, you may need to ensure that the blocked flag is appropriately set to false if the protocol demands it.
Furthermore, you may assume that CPU requests will not cross blocks (i.e, you will not need to fetch multiple blocks to service a single request).
Overall, gem5 uses a lot of objects that use a lot of functions – many of which are not relevant to this assignment. Full listings of packet functions can be found in the src/mem/packet.hh header, or in the formal documentation. For your convenience, here are some functions that you may find useful:
- Addr Packet::getAddr() (e.g.,
pkt->getAddr()
): returns the address of the current packet. Note, Addr is an expansion of uint64_t - unsigned Packet::getSize() (e.g.,
pkt->getSize()
): returns the size of the current packet. - bool Packet::isWrite() (e.g.,
pkt->isWrite()
): returns true when the current packet has data to write to the relevant block. (also,pkt->hasData()
should return true here) - bool Packet::isRead() (e.g.,
pkt->isRead()
): returns true when the packet should read data from the relevant block - *T* Packet::getConstPtr
()* (e.g., `pkt->getConstPtr<uint8_t *>()`): returns a pointer of type T to the data stored in the packet - void Packet::writeData(uint8_t *p) (e.g.,
pkt->writeData(p)
): sets the data stored in the packet to the pointer p. - void Packet::setData(const uint8_t *p) (e.g.,
pkt->setData(p)
): sets the data stored at pointer p to the packet. - bool Packet::needsResponse() (e.g.,
pkt->needsResponse()
): returns true when the current packet should be returned to the CPU with the relevant fields updated (data should be set to the packet if it is a read, etc) - void Packet::makeTimingResponse() (e.g.,
pkt->makeTimingResponse()
): updates the packet’s request state to the appropriate response
Furthermore, there are a few functions and fields to help you in micro-cache.hh:
struct Block
: we have provided a block struct for you. Your solution should create the appropriate number of block instances, and maintain them throughout the execution.void writebackData(bool dirty, uint64_t addr, uint8_t *data)
: given the appropriate input will create the appropriate memory request for an eviction target. You can also use this function as a reference if you need to create a Packetvoid sendToMem()
: takes the packet set as the currentto_mem
packet, and calls the appropriate port function to send to memoryvoid sendToCpu()
: takes the packet set as the currentto_cpu
packet, and calls the appropriate port function to send back to the processor
Testing
Once you have finished, be sure to test your cache against each of the sample test files in the cache-assignment.py script! If your simulation crashes or if your simulations hangs (i.e., runs for more than 5 minutes without finishing) be sure to check out the Common Errors section of the gem5 cheat sheet.
Note from the TAs: unfortunately we have found that not all of the same test programs work if you’re developing locally versus developing in the course development container versus using the autograder. We do know that a correct implementation will work on the autograder but a correct solution may not work in your local development environment (sorry!). To help know whether your implementation works in your environement, please try the following:
- in configs/assignments/cache, we have also provided a control.py script, which doesn’t use the MicroCache. If the execution of your control doesn’t work for a particular binary, please try with a binary from the test programs in a different environment for your local testing
- if you are having little luck finding a working binary, add the option
--maxtick 10000000000
to the end of your run command
Written Response
Please answer the following questions for Part 1.
- What is the memory consistency model for the one-block cache? How is the memory consistency model enforced?
- What assumptions does your cache make about requests coming from the processor side? Is this assumption safer if your cache is an L1 or L2 cache? Why?
Part 2: Fully Associative Cache
Your cache implementation may well be fully functional after Part 1, but realistically it’s not super usable. Let’s add some capacity to this cache so that we can store more than 64 bytes at a time. Your implementation should use the p->size
parameter in the constructor of the MicroCache, and allocate the appropriate amount of space for each of the blocks. You may assume that all sizes will be provided in kilobytes with some value followed by kB
without a space. There are multiple ways to map memory addresses to blocks, and this part will introduce one instance of this – a “fully associative” cache. This means that any memory address can go in any of the blocks (or, in terms of the terminology we’ll see in class, the cache has one set that holds all of the blocks).
The key change from your single block cache is that your cache will now have multiple blocks. This will result in two key changes – for one, the handleRequest function will now have to search across several blocks in order to determine whether or not to follow the hit or miss procedure; and the handleResponse function now needs to choose which block to fill the new data. In this case, we would like to:
- Fill an empty block if one exists, or
- Choose a block in the cache to evict in order to make space for this new data to fill.
There are several ways to choose the block to evict, your job in this part is to implement the least recently used (LRU) eviction policy. That is, in scanning for a place to place the new data, you should track which block is the one that has been used the least recently. Then, write this data back to lower levels of the cache hierarchy as done in Part 1. To do this, you should modify what is stored with each cache block in the micro-cache.hh file, and update this timestamp with the “curTick()” each time the block is accessed.
Least recently used is far from the only possible replacement policy. In fact, there are several other replacement policies, including: least frequently used, time-aware least-recently used, most-recently used, etc (to read more, see here). From this list, choose one alternative time-based replacement policy and then choose one frequency-based replacement policy to use in the written section.
Written Response
Please answer the following question for Part 2.
- In code, you probably implemented your set by searching through the tag fields of each block in the set in some way. This would imply a linear code runtime growth with respect to the number of blocks in the cache. Because Gem5 is simulating hardware, it does not simulate timing by measuring the runtime of the simulation code. Instead, it schedules events with some fixed, configurable amount of time. For example, we could set the latency to 20ns in the single-block cache from part 1, and set it to some different number for our fully-associative cache to take into consideration the additional changes to hardware. If you wanted to assess different caching configurations using gem5, how would you go about choosing this number? Think about justifying one type of cache vs. another to your boss using these numbers – what evidence would you need to provide to support the numbers Gem5 reports? What limitations of the comparison would you need to explain to your boss? When comparing different caches, are there quantities that gem5 cannot measure/report?
- Suppose you have 4 block fully associative cache. Come up with a memory trace (a sequence of address and operation types) for which the LRU replacement policy is sub-optimal compared to your selected alternative replacement policies. Then provide a different memory trace for which the LRU policy is optimal compared to that same trace.
- Besides having an increased complexity over the single-block cache, our fully-associative cache also has an increased capacity. If we wanted to do an apples-to-apples comparison of our single-block cache and our fully-associative cache, we could increase the block size of our single-block cache to the total capacity of the fully associative cache. While the hardware would certainly be less complex, what disadvantages would this giant-block cache have over a fully-associative cache of the same capacity? Consider how this cache would fit in to the entire hierarchy and how it would respond to different patterns of memory access.
Part 3: Set Associative Cache
Suppose you have a large cache with many blocks (i.e., 1MB). Adding hardware to try to match the tag of every block in the cache (e.g. as in Fig. 5.18 of P&H) is unreasonable. What can be done? One possibility is to eliminate the possible locations to which an address can be mapped (like a hash map). In a set-associative cache, each address can be mapped to any of the possible blocks in a set. If there are n blocks per set, these caches are sometimes called n-way set associative caches. This entails partitioning the blocks in the cache into sets. A consequence of this partitioning is that the tasks of searching for an address and choosing an eviction target are now reduced to scanning n blocks.
If a 2-way set associative cache has 64 blocks (i.e., a 4kB cache), then there are 32 possible sets that a block can be mapped to. If we are going to search for an address in the cache, we will not look in each set for that block. Instead, addresses are assigned to cache sets based on the tag bits of the address (i.e., the highest order bits). To visualize different configurations of set-associativity, it will be helpful to review the figures and examples in section 5.4 of P&H.
Your task is to modify your fully associative cache implementation to implement an n-way set associative cache as specified by the input parameter p->assoc
in the class constructor.
Written Response
Please answer the following questions for Part 3.
- Think about the latencies of your simulated caches. Does it make sense to have equivalent latencies for different associativities? Why or why not?
- What is the implication of having an associativity of 1 (called a direct-mapped cache)? If your application uses 1MB of data, and you have a 1MB 1-way set associative cache, would you expect the hit rate to be higher or lower than with an 8-way set associative cache? Would your performance be better? Justify your hypothesis.
Part 4: Cache Parallelism (written only)
Recall that, with instruction- and thread-level parallelism, we may have multiple memory operations in parallel. In this section, you’ll consider the implications of multiple requests for the cache in parallel – both for different addresses/sets and for the same set!
Notice how in your implementation we are still only allowing a single memory access into the cache at a time. We have been implementing this design decision for simplicity of coordinating multiple memory operations in order for our hardware to maintain clean memory consistency. That is, suppose we had multiple responses at the same cycle, and both needed to choose a block to evict. In our current implementation, both would choose the same block and the second would overwrite the first without keeping both values in the cache at the same time. This would be bad!
However, our current means of blocking is wholly inefficient with multiple sets. If two memory requests occur in different sets, then there is no need for one to block out the other’s execution.
- One solution is for each set to have its own blocking flag. What would need to change in order support parallel hits? What about parallel misses? How would you suggest that the cache controller synchronizes this logic?
- What you have described in the previous part is a lightweight miss status holding register (MSHR) which is a hardware primitive used in real caches (blocking and non-blocking) in order to maintain the global memory consistency model handling parallel misses for similar and dissimilar addresses in memory. What is the relationship between maximum achievable parallelism in a cache (i.e., the bandwidth of a cache) and the number of MSHR registers that the cache can support? It may help to justify your answer with an example.
Handin
Your submission should include the following files, all made to the same Gradescope submission:
- Your implementation for the
micro-cache.{hh,cc}
from Part 3 - Your pdf with your written responses
Changelog
This will be updated whenever any clarifications have been added to this assignment. See also the FAQ on Ed!