### COMP 633 - Parallel Computing

Lecture 6 Tue Sep 7, 2021

## SMM (1) Memory Hierarchies and Shared Memory



## Topics

- Memory systems
  - organization
  - caches and the memory hierarchy
  - influence of the memory hierarchy on algorithms
- Shared memory systems
  - Taxonomy of actual shared memory systems
    - UMA, NUMA, cc-NUMA

## **Recall PRAM shared memory system**

#### • PRAM model

- assumes access latency is constant, regardless of value of *p* or the size of memory
- simultaneous reads permitted under CR model and simultaneous writes permitted under CW model
- Physically impossible to realize
  - processors and memory occupy physical space
    - speed of light limitations

$$L = \Omega\left((p+m)^{1/3}\right)$$

- CR / CW must be reduced to ER / EW
  - requires  $\Omega(\lg p)$  time in general case



## Anatomy of a processor ↔ memory system

- Performance parameters of Random Access Memory (RAM)
  - latency L
    - elapsed time from presentation of memory address to arrival of data
      - address transit time
      - memory access time  $t_{\mbox{\scriptsize mem}}$
      - data transit time
  - bandwidth W
    - number of values (e.g. 64 bit words) delivered to processor per unit time
      - simple implementation W ~ 1/L



### **Processor vs. memory performance**

- The memory "wall"
  - Processors compute faster than memory delivers data
    - increasing imbalance  $t_{arith} \ll t_{mem}$



## Improving memory system performance

- Decrease latency L to memory
  - speed of light is a limiting factor
    - bring memory closer to processor
- Decrease memory access time by using 2D memory layout
  - access time  $\propto$  s<sup>1/2</sup> (VLSI)
- Use different memory technologies
  - DRAM (Dynamic RAM) 1 transistor per stored bit
    - High density, low power, low cost, but long access time
  - SRAM (Static RAM) 6 transistors per stored bit
    - Short access time, but low density, high power, and high cost.



# Improving memory system performance (1)

- Decrease latency using cache memory
  - low latency access to frequently used values, high latency for the remaining values



#### - Example

- 90% of references are to cache with latency L<sub>1</sub>
- 10% of references are to memory with latency L<sub>2</sub>
- average latency is 0.9L<sub>1</sub> + 0.1L<sub>2</sub>

# Improving memory system performance (2)

#### • Increase bandwidth W

- multiport (parallel access) memory
  - multiple reads, multiple exclusive writes per memory cycle
    - High cost, very limited scalability



Processor

**Register file** 

- "blocked" memory
  - memory supplies block of size b containing requested word
    - supports *spatial locality* in cache access



# **Improving memory system performance (2)**

- Increase bandwidth W (contd)
  - pipeline memory requests
    - requires independent memory references



- interleave memory
  - problem: memory access is limited by t<sub>mem</sub>
  - use m separate memories (or memory banks)
  - W ~ m / L if references *distribute* over memory banks



# Latency hiding

- Amortize latency using a pipelined interleaved memory system
  - k independent references in  $\Omega(L + k \cdot t_{proc})$  time
    - O(L/k) amortized (expected) latency per reference
- Where do we get independent references?
  - out-of-order execution of independent load/store operations
    - found in most modern performance-oriented processors
    - partial latency hiding:  $k \sim 2 10$  references outstanding
  - vector load/store operations
    - small vector units (AVX512)
      - vector length 2-8 words (Intel Xeon)
      - partial latency hiding
    - high-performance vector units (NEC SX-9, SX-Aurora)
      - vector length k = L /  $t_{proc}$  (128 256 words)
      - crossbar network to highly interleaved memory (~ 16,000 banks)
      - full latency hiding: amortized memory access at processor speed
  - multithreaded operation
    - independent execution threads with individual hardware contexts
      - partial latency hiding: 2-way hyperthreading (Intel)
      - full latency hiding: 128-way threading with high-performance memory (Cray MTA)

# Implementing the PRAM

• How close can we come to O(1) latency PRAM memory in practice?



#### examples

- NYU Ultracomputer (1987), IBM RP3 (1991), SBPRAM (1999)
  - logarithmic depth combining network eliminates memory contention time for CR, CW
    - »  $\Omega(\lg p)$  latency in network is prohibitive

## Implementing PRAM – a compromise

- Using latency hiding with a high-performance memory system
  - implements  $p \cdot k$  processor EREW PRAM slowed down by a factor of k
    - use  $m \ge p$  ( $t_{mem}$  /  $t_{proc}$ ) memory banks to match memory reference rate of p processors
    - total latency 2L for k = L /  $t_{proc}$  independent random references at each processor
    - O(t<sub>proc</sub>) amortized latency per reference at each processor
  - unit latency degrades in the presence of concurrent reads/writes



Bottom line: doable but very expensive and only limited scaling in p



### Memory systems summary

#### • Memory performance

- Latency is limited by physics
- Bandwidth is limited by cost
- Cache memory: low latency access to some values
  - caching frequently used values
    - rewards *temporal locality* of reference
  - caching consecutive values
    - rewards spatial locality of reference
  - decrease average latency
    - 90 fast references, 10 slow references: effective latency =  $0.9L_1 + 0.1L_2$

#### Parallel memories

- 100 independent references ≈ 100 fast references
- relatively expensive
- requires parallel processing

### Simple uniprocessor memory hierarchy

- Each component is characterized by
  - capacity
  - block size
  - (associativity)
- Traffic between components is characterized by
  - access latency
  - transfer rate (bandwidth)
- Example:
  - IBM RS6000/320H (ca. 1991)

| Storage<br><u>component</u> | Latency<br>(cycles) | Transfer Rate<br>(words [8B] / cycle) |
|-----------------------------|---------------------|---------------------------------------|
| Disk                        | 1,000,000           | 0.001                                 |
| Main memory                 | 60                  | 0.1                                   |
| Cache                       | 2                   | 1                                     |
| Registers                   | 0                   | 3                                     |



## **Cache operation**

- ABC cache parameters
  - associativity
  - block size
  - capacity
- CCC performance model
  - cache misses can be
    - compulsory
    - capacity
    - conflict



### **Cache operation: read**

associativity = 256-way block size = 64 bytes (512b)







### **Computational Intensity: a key metric limiting performance**

- Computational intensity of a problem
  - I = <u>(total # of arithmetic operations required)</u> i (size of input + size of result) i

in flops in 64-bit words

- BLAS Basic Linear Algebra Subroutines
  - Asymptotic performance limited by computational intensity

• A, B, C  $\in \Re^{n \times n}$  x, y  $\in \Re^n$  a  $\in \Re$ 

|        | name                           | defn                           | flops                                 | refs                                      | I          |
|--------|--------------------------------|--------------------------------|---------------------------------------|-------------------------------------------|------------|
| BLAS 1 | scale                          | y = ax                         | n                                     | 2n                                        | 0.5        |
|        | triad                          | y = ax + y                     | 2n                                    | 3n                                        | 0.67       |
|        | dot product                    | x∙y                            | 2n                                    | 2n                                        | 1          |
| BLAS 2 | Matrix-vector<br>rank-1 update | y = y + Ax<br>$A = A + xy^{T}$ | 2n <sup>2</sup> +n<br>2n <sup>2</sup> | n <sup>2</sup> +3n<br>2n <sup>2</sup> +2n | ~ 2<br>~ 1 |
| BLAS 3 | Matrix product                 | C = C + AB                     | 2n <sup>3</sup>                       | <b>4</b> n <sup>2</sup>                   | n/2        |

# Effect of the memory hierarchy on execution time



Performance of naive  $N \times N$  matrix multiply on an IBM RS6000/320 uniprocessor. Time in clock cycles per multiply-add (note  $\log_{10}$  scales). Source: Alpern *et al.*, "The Uniform Memory Hierarchy Model of Computation", *Algorithmica*, 1994

## Shared memory taxonomy

- Uniform Memory Access (UMA)
  - Processors and memory separated by network
  - All memory references cross network
  - Only practical for machines with full latency hiding
    - Parallel vector processors, multi-threaded processors
    - Expensive, rarely available in practice



21

# Shared memory taxonomy

- Non-Uniform Memory Access (NUMA)
  - Memory is partitioned across processors
  - References are local or non-local
    - Local references
      - low latency
    - Non-local references
      - high latency
    - non-local : local latency
      - large
  - Examples
    - BBN TC2000 (1989)



- Poor performance unless extreme care is taken in data placement

# **Combining (N)UMA with cache memories**

#### Processor-local caches

- Cache all memory references
- Must reflect changes in value due to other processors in system
- Cache-misses
  - Usual: compulsory, capacity, and conflict misses
  - New: coherence misses

#### Cache-coherent UMA examples

- Conventional PC-based SMP systems
  - Network is a shared bus
  - Limited scaling ( $p \le 4$ )
  - mostly extinct
- Server-class machines
  - Dual or Quad socket (single card)
  - Intel Xeon or AMD EPYC ( $20 \le p \le 128$ )
  - prevalent
- Cache-coherent NUMA examples
  - scales to larger processor count
    - SGI UltraViolet (p ~ 1024)
    - rare

M<sub>1</sub>

C₁

 $M_2$ 

Cp

Mn

# Incorporating shared memory in the hierarchy

- Non-local shared memory
  - can be viewed as additional level in processor-memory hierarchy
- Shared-memory parallel programming
  - extension of memory hierarchy techniques
  - goal:
    - concurrent transfer through parallel levels



Non-local

### Modern shared-memory server: Intel Xeon series



COMP 633 - Prins

# **AMD** Infinity

- Speed of light inconveniently slow!
  - miniaturize size of memory and processors
- Single card server
  - 7 nm process technology
  - 64 256 cores total,
  - 4 TB memory

