## **Assessing vector processors**



#### **Vector processors: summary so far**

Place data in vector registers for computation

Run same operation on every element of a vector

Necessary operations:

Load and store data between memory and vector register

Set vector length (setvl)

Computations on vectors (add, multiply, reduce, compare, merge…)

#### **Clarification: mask encoding**

#### 5.3.1. Mask Encoding

*[source](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-mask-encoding)*

Where available, masking is encoded in a single-bit vm field in the instruction (inst[25]).



**Contract** 

Vector masking is represented in assembler code as another vector operand, with .t indicating that the operation occurs when vo.mask[i] is 1 (t for "true"). If no masking operand is specified, unmasked vector execution (vm=1) is assumed.



◉

#### **Mask example**

```
for (int i = 0; i < 64; i++) {
   if (x[i] := 0) {
       y[i] = a \times x[i];}
}
       li s0, a
       vld v1, s1
       vld v2, s2
       vmsne v0, v1, 0 # v2[i] = x[i] != 0 ? 1 : 0
       vmul.vx v1, v1, s0 # x[i] = a * x[i]
       vmerge v^2, v^2, v^1, v^0 # y[i] = v^2[i] ? x[i]vmul.vx v2, v1, s0, v0.t
      vst v1, s2
```


 $\bullet$ 

#### **Simple approach: single pipelined FU**

Upsides:

- Less hardware
- Smaller clock cycle
- One result/cycle

◉

● Data within vector assumed independent: no hazards

a.

### **COL**  $\mathbf{r}$

#### **More efficient approach: multiple FUs**



 $\bigcirc$ 

п

 $\mathcal{L}$ 

**Why not just have 64 unpipelined FUs?**

#### **Lanes**



# **? ? ?**

 $\bigoplus$ 

Theory: if we can start a load every cycle, eventually, we get a throughput of 1 piece of data/cycle Practice: how do we start a load every cycle if loads take multiple cycles?

#### **Memory banks**

 $\bigcirc$ 

 $\bullet$ 

п

 $\sim$ 

 $\sim$ 

E É,  $\mathcal{L}$ 

#### **Complication of memory access**

```
How do we vectorize this code?
for (int i = 0; i < 100; i++) {
    for (int j = 0; j < 100; j++) {
        A[i][i] = 0;for (int k = 0; k < 100; k++) {
             A[i][i] += B[i][k] \times C[k][j]\frac{1}{3}}<br>}
}
```


#### **Strided loads/stores**

#### 7.5. Vector Strided Instructions

# Vector strided loads and stores





D





# **? ? ?**

 $\Theta$ 

How do strided accesses complicate the advantages of memory banks?

#### **Sparse accesses**

Not all vector-like memory accesses use every element

```
for (int i = 0; i < n; i++) \{X[m[i]] = X[m[i]] + Y[n[i]];
}
```
Solution: gather-scatter

**Gather**: collect all valid  $X[m[i]]$ ,  $Y[n[i]]$  in smaller vectors

**Scatter**: put the data back into  $X[m[i]]$ ,  $Y[n[i]]$ 

In RISCV V: indexed load/stores, also the vrgather instruction

Also useful for avoiding computations on 0-valued elements (**why? where?**)

### **Measuring performance: chaining**

What does it look like to execute this code w/ one load/store unit and one ALU/mul unit?

```
vld v0, s1
vmul.vx v1, v0, t0 
vld v2, s2
vadd.vv v3, v1, v2
vst v3, t1
```




#### **Measuring performance: convoys and chimes**

Convoy: set of vector instructions that can potentially execute together

Chime: time it takes to execute a convoy

vld v0, s1 vmul.vx v1, v0, t0 vld v2, s2 vadd.vv v3, v1, v2 vst v3, t1

**Approximation of runtime for this vector machine: 3 chimes (~3 \* vlen clock cycles) What complicates this metric?**

### **Startup, dead time**

 $\bigcirc$ 



 $\sim$ 

ш



#### **Amdahl's law**

Used to assess theoretical effectiveness of speedup

In a nutshell: gains in speeding up a portion of a program are limited by the fraction of time that portion is actually used

Mathematically:

$$
S_{\text{latency}}(s) = \frac{1}{(1-p) + \frac{p}{s}}
$$

For parallelization: serial bottleneck (non-parallelizable code) limits effectiveness of vector processors

#### **Compiler effectiveness**



Figure G.9 Result of applying vectorizing compilers to the 100 FORTRAN test kernels. For each processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized. These loops were collected by Callahan, Dongarra, and Levine [1988]. Two different compilers for the Cray X-MP show the large dependence on compiler technology.

#### **Cray-1 Architecture**



*[image source](https://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html)*◉

**Contract** 

п