# SIMD in modern computers



### Measuring performance: convoys and chimes

Convoy: set of vector instructions that can potentially execute together

Chime: time it takes to execute a convoy

Approximation of runtime for this vector machine: 3 chimes (~3 \* vlen/lanes clock cycles) What complicates this metric?

### Measuring performance: chaining



# Cray-1 Architecture (1976)

image source

 $\odot$ 



# Cray X1 architecture (2003)

ISA designed from scratch

Multi-stream processor consisting of four single-stream processors

Each SSP has: scalar unit/scalar cache, 2-lane vector unit

Connected to external caches (mostly for scalars, but can be used by vectors for programs w/ high temporal locality, or bypassed)

Each MSP can have up to 2048 outstanding memory requests



### Cray X1 nodes (H&P fig. G.12)



NUMA: non-uniform memory access ecaches only cache their local memory writes to remote memory invalidate corresponding ecache data

# NEC SX Aurora (2018)

image source

Shared LLC (last-level cache) with 128 banks Uses high-bandwidth memory (HbM)



# ???

What are the limits of vector processors?

### **VLIW** paper

Fisher, Joseph A. "Very long instruction word architectures and the ELI-512." Proceedings of the 10th annual international symposium on Computer architecture. 1983. (<u>link</u>)

### WHY NOT VECTOR MACHINES?

Vector machines seem to offer much more parallelism than the factor of 2 or 3 that current VLIWs offer. Although vector machines have their place, we don't believe they have much chance of success on general-purpose scientific code. They are crucifyingly difficult to program, and they speed up only inner loops, not the rest of the code. And vectorizing work

And vectorizing works only on inner loops; the rest of the code gets no speedup whatsoever. Even if 90% of the code were in inner loops, the other 10% would run at the same speed as on a sequential machine. Even if you could get the 90% to run in zero time, the other 10% would limit the speedup to a factor of 10.

### Amdahl's law

Used to assess theoretical effectiveness of speedup

In a nutshell: gains in speeding up a portion of a program are limited by the fraction of time that portion is actually used

Mathematically:

$$S_{ ext{latency}}(s) = rac{1}{(1-p)+rac{p}{s}}$$

For parallelization: serial bottleneck (non-parallelizable code) limits effectiveness of vector processors

### New York DFS to acquire supercomputer to understand and regulate AI

THE IRS IS BUYING AN AI

NOAA completes upgrade to weather and climate supercomputer system

Also looking to hire professionals to rur

### July 05, 2023 By:

New York's Dep

dedicated to ru

### <u>source</u>

SUPERCOMPUTER FROM NVIDIA

source

How exact

nt upgrade to the 'American' forecast model

Share: y f 🖂 🛱

### source



(also headlines about India, Japan, China, Germany, Brazil, ... in the ast year)



Butitco

### OK fine.. but what about DLP for the rest of us?



supercomputing system - just received a 20% upgrade. rents, enzona, respectively, each supercomputer now operates at a speed of 14.5 petaflops. (Image credit: General

Dynamics Information Technology (GDIT)

### SIMD for multimedia

RGBA images: 8 bits/channel (32 bits total)

Audio: 8, 16, 24, or 32 bits per sample

Simplifications of SIMD for multimedia: might not need strided access, gather/scatter, masked operations, custom vector length

 $\rightarrow$  Doesn't typically make sense to put a powerful VPU on a processor

Enter multimedia SIMD extensions

How can smaller data widths make SIMD functionality easier to add to CPUs?

### **RISC-V P: packed SIMD**

(Doesn't actually exist, but the letter "P" is reserved for such a thing)

Reuses floating-point registers

Packs multiple values in one register based on configuration

Ex: 64-bit register can hold 8 8-bit values, 4 16-bit values, 2 32-bit values, or 1 64-bit value

Requires special load/store operations

Hardware support for parallel operation on each value in register

### ARMv6 SIMD

Packs multiple 16- or 8-bit values into 32 bit registers



### Use of ARM NEON

Compilers are sometimes hit-or-miss when figuring out if they can vectorize code

Multimedia applications: people can use libraries

To get more flexibility than a library, ARM provides intrinsics





## x86: MMX, SSE, AVX

MMX: not an acronym, packs values in 64-bit registers, supports integer operations only

SSE: "Streaming SIMD Extensions", 128-bit registers, allows for floating point

AVX: "Advanced Vector Extensions", 8x32 or 4x64 vector registers (AVX 2 adds gather, AVX 512 supports 512-bit registers)

In typical x86 fashion, operand size is fixed in the opcode (so there are hundreds of instructions for each extension)

### **x86 AVX-512 VNNI**

Vector Neural Network Instructions Useful for CNNs (Convolutional Neural Networks)

 $\mathbf{B}_1$ 

Co

 $\mathbf{A}_1$ 

\*

A0B0+A1B1 +C0



## From the Intel optimization manual

- P 5-11 (193): Converting to SIMD chart
- P 8-9 (287): Blocking (handling large matrices)
- P 14-2 (390): PCMPxSTRy (see also 14-12 onward)
- P 15-7 (445): Mixing SSE and AVX (YMM register)
- P 15-20 (458): Data alignment and caches
- P 15-24 (462): Masked loads and paging





•

• • •

0 0 0

(

•

• \_\_\_\_ •

•