# **Topics in Computer Architecture Research**

**Best Paper Nominees at HPCA 2024** 



### Outline

- Saki (University of Tokyo) "Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors"
- Johannes Wikner, Daniel Trujillo, Kaveh Razavi (ETH Zurich) "Phantom: Exploiting Decoder-detectable Mispredictions"

 Tori Koizumi (Nagoya Institute of Technology), Ryota Shioya, She Sugita, Taichi, Amano, Yuya Degawa, Junichiro Kadomoto, Hidetsugu Irie, Shuichi

 Bongjoon Hyun, Taehun Kim, Dongjae Lee, Minsoo Rhu (KAIST) "Pathfinding Future PIM Architectures by **Demystifying a Commercial PIM Technology**"

**Clockhands** (Best Paper Nominee MICRO23)

## **Remember Re-ordering?**

### Minimize WAR and WAW hazards by register renaming!



### Anyone ever write code like this?





### Problem

- O3 CPUs use a lot of power to perform tasks like register renaming, but register renaming is limited by false dependences
- False dependences are somewhat inevitable because developers (and compilers) are bad at assigning registers!
- But, this isn't their fault, it's the ISA's fault!

### Chat with your neighbors!

- What questions do you have?
- Brainstorm: How could we fix the problem?
  - What leads to a false dependence?
  - What about the ISA causes this?

### **Proposed Solution**

- No destination registers in ISA!
- All instructions operands are in term operand produced

• All instructions operands are in terms of "how many instructions ago" was the

### **Proposed Solution (v1)**

step 1. access some data lw a0, 1024(s0)addia0, a0, 2sw a0, 1024(s0)Becomes... // step 2. access some other data lw a0, 2048(s0)Next register is allocated from ring buffer of available addi a0, a0, 4 physical registers! sw a0, 2048(s0)

lw 2048([n + 3])addi [0], 4 sw[0], 2048([n + 5])

step 2. access some other data

sw[0], 1024([n + 2])

addi [0], 2

lw 1024([n])

// step 1. access some data



### **Proposed Solution (v1)**

- No false dependences,
- Uses way more register
- How do we handle loop
  - for (int i = 0; i < 100; i  $\bullet$
  - *i* keeps getting furthe
  - Handle this by calling





### r almost every instruction

nce away

# **Proposed Solution (v2)**

- Don't just naively allocate from a ring buffer...
- Different variables have different purposes
  - 1.  $s \rightarrow stack$  pointer and function args
  - 2. t -> temporary variables
  - 3.  $u \rightarrow variables$  with long lifetimes
  - 4.  $v \rightarrow loop constants$
- Let's allocate registers according to their usage as defined by the developer!





# **Proposed Solution (v2)**



# What is the maximum allowable number of usable registers in each ISA?

## **Proposed Solution (v2)**

- Implemented compiler as LLVM extension
- Implemented ISA on an FPGA



Figure 17: Frequency at which a destination register is defined with a lifetime greater than a certain number of instructions (same as Fig. 4).



Figure 18: Frequency at which a destination register is defined with a lifetime greater than a certain number of instructions (same as Fig. 4). The vertical axes indicate definition frequency and the horizontal axes indicate register lifetime.

# What do you think?

# Demystifying a Commercial PIM Technology (Best Paper Winner HPCA24)

### **Remember the Memory Hierarchy?**



Thinking about architecture as boxes hides details!



### Let's Dig into Memory!



"According to our study, more than 88% of the total training time is consumed by transferring data" ---SmartInfinity (HPCA 2024)



### Let's Dig Into Memory!



### Let's Dig Into Memory!



NDP vaults

main memory



# **Processing In-Memory (PIM)**

- What if we didn't have to transfer data into the memory hierarchy?
- What if there was a data processing unit on the memory die?



Image credit: <u>https://thememoryguy.com/upmem-releases-processor-in-</u> memory-benchmark-results/



### Larger bandwidth

2,5 Tera bytes per second of

### memory bandwidth

Image credit: https://www.upmem.com/

# **Processing In-Memory (PIM)**

- Great idea, right?
- of PIM is its programmability. It is hard to anticipate future model much each dimension (of embedding tables) will scale in the future." - Facebook, 2021

 "We've investigated applying PIM to our workloads and determined there are several challenges to using these approaches. Perhaps the biggest challenge compression methods, so programmability is required to adapt to these. PIM must also support flexible parallelization since it is hard to predict how

### **UPMEM-PIM**



### **UPMEM-PIM**

- 20 double-ranked UPMEM-PIM DIMMs
- 8 DPUs per DRAM bank (where each has 64MB of private memory + scratchpads)
- 8 DPU chips per memory rank
- $20 \times 2 \times 8 \times 8 = 2560$  DPUs!

### uPIMulator



Fig. 4: uPIMulator simulation framework overview.



### Chat with your neighbors!

- What do you think about PIM?
  - Do you agree with Facebook's claim about usability? •
  - What do you think are the implications of specialization versus generalizability?
  - Will UPMEM fail???

### Limitations of PIM

- Really hard to do coherence between processor memory and PIM memory
- Is it useful to have programmable logic on a memory die?
  - Why not just make a fixed-logic accelerator?
- What if my offloaded program could benefit from a bigger cache?
- What's the right granularity of computation? PIM versus NDP versus NMP versus NSP versus PUM

# Phantom (Best Paper Nominee MICRO23)

### **Remember Flush + Reload Attack?**



shared cache





### **Remember Flush + Reload Attack?**

- if (strcmp(argv[1], "supersecretdata") == 0) { x = data[addr1];
- } else {
  - x = data[addr2];



shared cache

### We know this is going to execute speculatively!







### **Remember Speculative Execution?**

- We use the BPB to predict outcomes and the BTB to predict targets
  - This allows the processor to race ahead of the actually evaluated truth most of the time!
- We use the ROB to execute instructions without committing them
- Window of time between when a prediction is made and is evaluated is called transient execution or the speculative window
- Code snippets (gadgets) that leak information speculatively are called spectre gadgets

## **Types of Spectre Attacks**

- Spectre v1 (Bounds Check Bypass): If an array is accessed inside a in the cache
- Spectre v2 (Branch Target Injection): An attacker may poison all of the squashed, but the data from this malicious region will be in the cache

Phantom: you will help unpack!

speculative window at a parameterized index, pass an illegal index to read memory from elsewhere — the branch will be squashed and the data will be

potential branch targets for particular addresses to point to some malicious code. Only when the branch target is resolved will the malicious region be

### Chat with your neighbors!

"We hypothesize that the asymmetric combinations of branch types will likely lead to short mispredictions that the CPU can detect during decode due to mismatching instruction types. Consequently, our analysis could benefit from observation channels that allow us to infer how far in the pipeline a mispredicted control flow advances. For example, if we observe transient memory operations from the mispredicted target, we can infer that the mispredicted control flow reached Execute (EX) and advanced through the preceding stages, namely IF and ID"

### Phantom

- Create a target that maps from instruction at address A to target instruction C
- Flush instruction B from the ICache so that fetching it and decoding it will be slow
- So long as B also maps to C, C will execute transiently



Figure 4: In ①, A creates a BTB entry to C, so that in ②, the victim instruction of B may reuse that BTB entry. The instructions in C emit a transient execution signal. By fetching and decoding C, transient fetch and transient decode signals are already emitted.



### Phantom

- How bad is it?
- Take Vasilis' class to see just how bad this is!

### Breaking kernel image KASLR 7.1

We show how we can derandomize kernel image KASLR on AMD microarchitectures with Рнамтом speculation. We run Linux kernel 5.19 with the latest patches.

```
DWORD PTR [rax+rax*1+0x0]
  nop
1
         rbp
  push
2
          rbp,rsp
3 mov
```

Listing 1: We trigger speculation at the nop instruction in \_\_task\_pid\_nr\_ns(). Found at kernel image offset 0xf6520.

### Phantom: Follow Up Thoughts

- Is this actually a new Spectre variant?
- Is this dangerous?
- Is this cool?

## **Concluding Thoughts**

- Research in architecture is super diverse!
- General themes are:
  - 1. Software uses hardware, so hardware should be better!
  - 2. Software uses hardware, so software should be better!
  - 3. Architecture is heterogeneous, can we offload?
  - 4. Can we make components in architecture more efficient?
  - 5. Architecture is broken because it's insecure
- You are now hopefully equipped with the vocabulary to see why these problems are interesting and/or hard!