# Hardware Compute Partitioning on NVIDIA GPUs

Joshua Bakita and James H. Anderson

Department of Computer Science University of North Carolina, Chapel Hill



# How can we do more, with less?

#### How can we do more, with less, on the CPU?



#### How can we do more, with less, on the GPU?

**Compute Units** 

→ No hardware partitioning available Memory Caches and Interconnects

→ Concurrency, with hardware partitioning [1]



#### Why concurrency on-GPU?

#### Some assumptions worth revisiting...



#### Assumption: No Capacity To Reclaim

#### Assumption: Interference Worse than On-CPU



#### Why concurrency on-GPU?

#### Some assumptions worth revisiting...





With key insights drawn from **GPU architectural norms** and **native GPU scheduling systems**, we achieve all three for *any* NVIDIA GPU from the past 10

years.

Closest prior work: AMD Compute Unit Masking [14, 15]

# Enabling <u>Hardware-Enforced</u> Compute Partitioning

Goal 1 of 3

### Hardware-Enforced Partitioning Why Hardware Enforcement?

Tasks may misbehave due to:



and these are fatal to cooperation-based software partitioning.

### <u>Hardware-Enforced</u> Part. Elucidating GPU Capability

Untapped documentation:

- → Patent Applications
- → Granted Patents

Patents may describe non-existent inventions. We verify by cross-referencing:

- → Open-Source Headers
- → Open-Source Drivers
- → NVIDIA Documentation
- → Experiments

 $\rightarrow$ 



#### **Hardware-Enforced** Partitioning

Can these fields control kernel-to-SM assignment?

**One Sentence of Documentation** 





Hardware-Enforced Partitioning

#### Illuminating GPU Terms & Our Timeline Figures



#### Hardware-Enforced Partitioning

#### Applying the SM\_DISABLE\_MASK

# Enabling <u>Flexible</u> Compute Partitioning

Goal 2 of 3

#### **Flexible** Partitioning

# Given working partitioning, is it flexible and reliable enough to be useful?

#### Means of answering:

Investigate hardware design
Test with benchmarks

We do both.

### **Flexible** Partitioning Investigating Hardware Design

We elucidate the design norms of **NVIDIA's GPU hardware** scheduling pipeline.



















## Enabling <u>Easily Applicable</u> Compute Partitioning

Goal 3 of 3

Easily Applicable Partitioning

Very portable: Works on any NVIDIA GPU of compute capability >3.5 (2013) with CUDA >10.2 (2019)

Key insight: GPU scheduling hardware changes little generation-togeneration

On Linux:

- 0. Download libsmctrl.h and libsmctrl.so
- 1. #include "libsmctrl.h" and add -lsmctrl
- libsmctrl\_set\_global\_mask(uint64\_t default\_mask)
- 3. libsmctrl\_set\_stream\_mask(cudaStream\_t, uint64\_t mask);

No kernel configuration, no driver configuration, and no superuser permissions.

Code is open source and documented. See <u>https://www.cs.unc.edu/~jbakita/rtas23-ae.html</u> to get started.



#### **Easily Applicable Partitioning**

#### Testing with Real-World Software

### Conclusions

We build spatial partitioning for GPU compute units that is:

Hardware-Enforced

#### **Flexible**

#### **Easily Applicable**

Is there a hardware capability?

How can we be confident this will work widely?

Can we make GPU spatial-partitioning easy?

Yes, the SM\_DISABLE\_MASK

Hardware norms, benchmarks, and real-world software support it

Yes, via our 1-line, no-install Linux API

#### What you have to read the paper for...

Evaluation:

- Adversarial tests
- How GPU pitfalls noted in prior work still effect partitioned GPUs
- Hazards of overlapping partitions
- Comparison to prior work
- Full details on our system setup and configuration

API:

- Details on how we modify the TMD
- Full details on our supported API calls, with examples
- Details on our API to query GPU silicon configuration
- List of every GPU, CUDA version, and CPU architecture we tested portability on

Regarding GPUs:

- Distinction
- Extensive details on the NVIDIA GPU hardware scheduling pipeline, including:
  - The Host Interface
  - The Compute Front End
  - The Task Management Unit
  - The Work Distribution Unit
  - CPU-to-GPU Buffer Design
- GPU cache hierarchy and bus interconnect layout
- + More details and background on everything covered in this presentation

### Thanks! Questions?

Future work:

- → Cross-context partitioning
- → Criticality-Aware time-slice scheduling

Contact: Email: <u>jbakita@cs.unc.edu</u> Twitter: <u>@JJBakita</u> Web: <u>https://cs.unc.edu/~jbakita</u>

