## **GPU** memory

Final project abstract + hypothesis instructions posted (linked right above schedule)





#### **DRAM and GDDR**





(Not pictured: further organized into banks) Activating a row takes several cycles After row is activated, data can be read

#### Latency/locality tradeoff:

Controller waits for enough requests to a single row before servicing all of them at once

Generally, GDDR (graphics DDR) variants have higher latency and higher bandwidth

#### memory: matrix multiply GPU



multiple columns/rows may not fit in shared memory column doesn't play well with cache lines

every thread here will access same column ...but all the rest will access different columns

## Tiling

(full code, including loading from CPU memory to device memory, in image source link)

```
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
```

```
// for-loop on m (number of tiles to load):
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
shared_float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
```

Each thread does two loads/stores here Can hardware design make this more efficient?



#### Shared memory as cache

// For CSR row multiplication example, adapted from P&H fig. B.8.5

```
__shared__ float cache [blocksize];
unsigned int block_begin = blockIdx.x * blockDim.x;
unsigned int block_end = block_begin + blockDim.x;
unsigned int row = block_begin + threadIdx.x;
if(row<num_rows) cache [threadIdx.x] = x[row];
syncthreads();
```

```
// when reading each x_j
if (j >= block_begin j < block_end)
    x_j = cache[j-block_begin];
else
    x j = x[j];</pre>
```

Not guaranteed locality (as we were in matrix multiplication), but increases performance as long as multiplying row *i* accesses values of x near *x*[*i*]

#### **Coalesced memory access**

Coalescing unit detects if accesses from same warp are in adjacent addresses and performs single, wide access (reduces uses of DRAM line)

Works for both global memory and local memory!

Important for programmer to be mindful of memory indexing

Example: avoid having each thread do its own malloc (source)

```
__shared__ int* data;
if (threadIdx.x == 0) {
    size_t size = blockDim.x * 64;
    data = (int*)malloc(size);
}
__syncthreads();
```

now adjacent threads can do
adjacent accesses into data,
 eg data[threadIdx.x]!

### Shared memory banks

Shared memory is banked (32 banks for 32 threads/warp; successive words in successive banks)

*really* fast as long as no bank conflicts (have to do an extra round of accesses for every conflict – can significantly slow down warp)

Which code is better for working with data of length n = 2 \* blocksize when i = threadIdx.x?

A[i \* 2] = A[i \* 2] + B[i \* 2] // 0, 2, 4, 6... 2n - 2 A[i \* 2 + 1] = A[i \* 2 + 1] + B[i \* 2 + 1] // 1, 3, 5, 7... 2n - 1

#### VS

A[i] = A[i] + B[i] // 0, 1, 2, 3, ... n - 1A[n + i] = A[n + i] + B[n + 1] // n, n + 1, n + 2, ... 2n - 1

#### What's wrong with our CSR mult?

```
void csrMult(int n, int* Rp, int* C, float* V, float* x, float* y) {
    int r = blockIdx.x * blockDim.x + threadIdx.x;
    if (r < n) {
        int rBeg = Rp[r];
        int rSize = Rp[r + 1] - Rp[r];
        float sum = 0
        for (int i = 0; i < rSize; i++) {</pre>
            sum += V[rBeg + i] * x[C[rBeg + i]];
        ş
        y[r] = sum;
    }
z
```

#### Solution: pad and transpose



Image source: Kirk, David B., and W. Hwu Wen-Mei. *Programming massively parallel processors: a hands-on approach*. Morgan kaufmann, 2016., figs 10.8 and 10.9 <u>Brown library access</u>

|        |          |       |           |   |          |      |     |   |   |   | Values |   |   | Columns |   |  |  |
|--------|----------|-------|-----------|---|----------|------|-----|---|---|---|--------|---|---|---------|---|--|--|
|        | Thread 1 | N     | Thread 3  |   | Т        | hrea | d 0 |   | 3 | 1 | *      |   | 0 | 2       | * |  |  |
| Thread |          |       |           |   | Т        | hrea | d 1 |   | * | * | *      |   | * | *       | * |  |  |
| ad 0   |          |       |           |   | Thread 2 |      |     |   | 2 | 4 | 1      |   | 1 | 2       | 3 |  |  |
|        |          |       |           |   | Thread 3 |      |     |   | 1 | 1 | *      |   | 0 | 3       | * |  |  |
| ``     | Ite      | ratio | n U∖<br>¥ |   |          |      |     |   |   |   |        | ) |   |         | ) |  |  |
| Data   | 3        | *     | 2         | 1 | 1        | *    | 4   | 1 | * | * | 1      | * | ] |         |   |  |  |
| г      |          |       |           |   |          |      |     |   |   |   |        |   | ٦ |         |   |  |  |
| Index  | 0        | *     | 1         | 0 | 2        | *    | 2   | 3 | * | * | 3      | * |   |         |   |  |  |

Padding: allows for avoiding control flow divergence Transpose: allows for coalescing In general will run faster, despite extraneous multiplies by 0

### Inclusive scan

Also called cumulative sum, prefix sum

Used for load-balancing algorithms, polynomial evaluation, etc turns  $[x_0, x_1, x_2, ..., x_k]$  into  $[x_0, (x_0 \oplus x_1), (x_0 \oplus x_1 \oplus x_2), ..., (x_0 \oplus x_1 \oplus x_2 \oplus ... \oplus x_k)]$ e.g. [1, 3, 0, 7] turns into [1, 4, 4, 11]

Sequentially:

```
y[0] = x[0]
for (int i = 1; i <= k; i++) {
    y[i] = y[i - 1] + x[i];
}</pre>
```



#### Parallel scan

(Other, efficient algorithms exist – see Kirk and Hwu Chapter 8)



# ???

 \_\_syncthreads() synchronizes all threads within a block.
 How can threads in different blocks safely communicate with each other?

(Answer: atomic global memory accesses)

#### **CPU/GPU** communication

Must copy between CPU and GPU memory before/after launching kernel

#### Full code

```
int main() // on host
Ł
    // Allocate input vectors h_A and h_B in host memory
    float* h A = (float*)malloc(size);
    // Initialize input vectors
    . . .
    // Allocate vectors in device memory
    float* d A;
    cudaMalloc(&d_A, size);
    // Copy vectors from host memory to device memory
   cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
```

## **Unified** memory

GPU can access CPU's memory

Traditionally used with integrated graphics (GPU on same SoC as CPU)

CUDA example





## HW/SW interface

Vector processors and SIMD multimedia modifies the ISA to support DLP. BUT:

- Supports only modest levels of parallelism (for SIMD extensions)
- Requires changes to ISA
- Requires compiler that can effectively vectorize code (or a skilled programmer)

SPMD model/GPUs: Allows programmer to write a kernel, which the hardware schedules on many threads. BUT:

- Proper performance requires proper understanding of architecture (branching/control divergence, memory access, synchronization)
- Requires interaction of CPU/GPU (might be a pro or a con)

#### Bonus: tensor cores

Circuits on GPUs optimized for AI

Uses mixed-precision to speed up math, reduce memory demands

Claim ~8x performance gains over SM matrix multiply

