Artifact Evaluation for “Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems”

This page details the artifacts from our paper:

J. Bakita and J. H. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems”, Proceedings of the 37th Euromicro Conference on Real-Time Systems (ECRTS), to appear, Jul 2025. (PDF)

We also describe how to use these artifacts to reproduce our experiments. These steps were last updated .

Overview

This paper builds on four pre-existing software artifacts:

  1. libsmctrl/nvtaskset: Now includes nvtaskset tool to partition the GPU spatially between unmodified tasks.
  2. nvdebug: Now provides information about which SMs correspond to which GPCs, and provides much more information on the runlist internals used by MPS.
  3. cuda_scheduling_examiner: Now supports a new configuration value to allow for competing benchmarks to infinitely run.
  4. gpu-microbench: Now includes a benchmark for measuring a CUDA task’s startup time.

These all have their own README.md files; we just summarize here. You will need to have the CUDA SDK and the equivalent of Ubuntu/Debian’s build-essential package (gcc, etc.) installed to build these tools.

Artifact #1: libsmctrl/nvtaskset

This library implements our partitioning mechanism.

To obtain and build this tool:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/ -b ecrts25-ae
cd libsmctrl
make tests libsmctrl.so nvtaskset

Artifact #2: nvdebug

This Linux kernel module is used to study how enabling MPS changes the runlist configuration, and to obtain the correspondence between TPC and GPC IDs.

The source code for this artifact can be browsed online. To obtain and build this module, ensure that the kernel headers are installed, e.g., the linux-headers-generic package on Ubunutu/Debian, and:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/nvdebug.git/ -b ecrts25-ae
cd nvdebug
make
sudo insmod nvdebug.ko

Artifact #3: cuda_scheduling_examiner

This tool, originally by Nathan Otterness, was used in our work to test the behavior of MPS and to benchmark enforcement capability and partitonining granularity in our evaluation section.

We do not change the core part of this tool; our repository only adds our benchmark configuration files and some postprocessing scripts.

Obtain this code via:

git clone https://github.com/JoshuaJB/cuda_scheduling_examiner_mirror.git -b ecrts25-ae
cd cuda_scheduling_examiner_mirror

Then edit LIBSMCTRL_PATH at the top of the Makefile to reflect the location that libsmctrl was downloaded to, and build it via:

make all

If you have difficulty building the tool, please see the contained README.md for help. Please ensure you have Python 2, Python 3, python-tk, and python-matplotlib, python3-matplotlib, and libsmctrl available to fully use this tool and included scripts.

Artifact #4: gpu-microbench

We use this suite of high-accuracy benchmarks and measurement tools to examine the scheduling behavior of individual GPU contexts on the GPU copy and compute engines.

We extend this suite of high-accuracy benchmarks and measurement tools, adding measure_launch_oh and measure_startup_oh. These allow measuring the time to launch a CUDA kernel and to initialize a CUDA program respectively.

The source code for this artifact can be browsed online. To obtain and build this suite:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/gpu-microbench.git/
cd gpu-microbench
make all

System Requirements for Reproducing Our Experiments

We detail how to reproduce the evaluation experiments in Section 6 of the paper; namely the results plotted in Figures 11, 12, and 13.

Our experiments were originally performed in Google Cloud on a a2-highgpu-1g instance in europe-west4-b that includes one NVIDIA A100 GPU. We configured the system with 1 vCPU per core (i.e., SMT disabled), and with the GPU driver (available on this page) installed as:

sudo ./NVIDIA-Linux-x86_64-550.78.run --silent --no-install-compat32-libs --no-opengl-files --no-wine-files --no-peermem --no-drm

However, those without easy access to an A100 GPU should be able to reproduce all but our MiG experiments on any NVIDIA GPU of the Volta-generation (2018) or newer with at least two GPCs.

Please do not run anything else on the GPU while running the experiments. If this is a personal workstation, we suggest logging in over SSH and stopping the graphical session (e.g., via sudo systemctl stop gdm3 on Ubuntu/Debian). Our experiments are designed to be run headlessly.

The below experiments assume a single GPU, and do not explictly call out how to set the GPU ID used for each experiment.

Setup

Install CUDA

Install the CUDA SDK (if not already present in /usr/local/cuda):

wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run

This will bring up a UI. Accept the EULA, and deselect all the uneeded portions of CUDA, such that your configuration looks like this:

┌───────────────────────────────────────────────────
│ CUDA Installer 
│ + [ ] Driver
│ - [X] CUDA Toolkit 12.4
│    + [X] CUDA Libraries 12.4
│    - [X] CUDA Tools 12.4
│       + [X] CUDA Command Line Tools 12.4
│       + [ ] CUDA Visual Tools 12.4
│    + [X] CUDA Compiler 12.4
│   [ ] CUDA Demo Suite 12.4
│   [ ] CUDA Documentation 12.4
│ - [ ] Kernel Objects
│   Options
│   Install 

Then press “Install”. Any warnings printed about adjusting PATH variables can be ignored; our tools automatically know where to find CUDA. You can then delete the installer:

rm cuda_12.4.0_550.54.14_linux.run

Install Other Dependencies

Configure your system; on Debian/Ubuntu:

sudo apt install build-essential linux-headers-generic jq

If also plotting on this system:

sudo apt install python3-numpy python3-matplotlib

Download Artifacts and Build

Download our automated scripts:

wget http://cs.unc.edu/~jbakita/ecrts25-ae/clone_and_build_ecrts25.sh
wget http://cs.unc.edu/~jbakita/ecrts25-ae/evaluate_ecrts25.sh
wget http://cs.unc.edu/~jbakita/ecrts25-ae/plot_results_ecrts25.py

Download and build all the tools:

bash ./clone_and_build_ecrts25.sh

Configure Partitions

Every GPU is slightly different (even GPUs of the same model), and so you will need to configure your GPU partititions, no matter your choice of GPU. We only support testing no partitioning, MPS provisioning, and libsmctrl/nvtaskset/nvsplit partititioning on non-A100 GPUs (no MiG).

For A100 GPUs

Run the ./libsmctrl/libsmctrl_test_gpc_info command. The output should look like this:

bakitajoshua@a100-ae:~/attempt/libsmctrl$ ./libsmctrl_test_gpc_info 
./libsmctrl_test_gpc_info: GPU0 has 7 enabled GPCs.
./libsmctrl_test_gpc_info: Mask of 7 TPCs associated with GPC 0: 0x00810204081020
./libsmctrl_test_gpc_info: Mask of 7 TPCs associated with GPC 1: 0x01020408102040
./libsmctrl_test_gpc_info: Mask of 8 TPCs associated with GPC 2: 0x02040810204081
./libsmctrl_test_gpc_info: Mask of 8 TPCs associated with GPC 3: 0x04081020408102
./libsmctrl_test_gpc_info: Mask of 8 TPCs associated with GPC 4: 0x08102040810204
./libsmctrl_test_gpc_info: Mask of 8 TPCs associated with GPC 5: 0x10204081020408
./libsmctrl_test_gpc_info: Mask of 8 TPCs associated with GPC 6: 0x20408102040810
./libsmctrl_test_gpc_info: Total of 54 enabled TPCs.

Note how two GPCs have only 7 TPCs, whereas the others have 8. Which two GPCs have only 7 TPCs will vary (it is a function of random manufacturing defects), and the partitions must be configured such that one of the 7-TPC GPCs is in the partition for the benchmark under test, and the other is assigned to one of the competing benchmark instances.

Edit the variables GPCS_A, GPCS_B, GPCS_C, and GPCS_D at the top of evaluate_ecrts25.sh to reflect this partitioning. Four GPC IDs should be assigned to GPCS_A (the partition of the benchmark under test), and one should be a 7-TPC GPC. The remaining three GPCs should be distributed to the competing benchmarks. For the libsmctrl_test_gpc_info output above, the default configuration:

GPCS_A=0,1,2,3
GPCS_B=4
GPCS_C=5
GPCS_D=6

should be changed to:

GPCS_A=1,2,3,4
GPCS_B=0
GPCS_C=5
GPCS_D=6

Remember, this will vary for each A100, and so your configuration may not match this example.

For Non-A100 GPUs

Run the ./libsmctrl/libsmctrl_test_gpc_info command. The output should look something like this (example from an RTX 6000 Ada):

jbakita@yamaha:/tmp/ae-test/libsmctrl$ ./libsmctrl_test_gpc_info 
./libsmctrl_test_gpc_info: GPU0 has 12 enabled GPCs.
./libsmctrl_test_gpc_info: Mask of 5 TPCs associated with GPC 0: 0x000800800800800800
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 1: 0x001001001001001001
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 2: 0x002002002002002002
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 3: 0x004004004004004004
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 4: 0x008008008008008008
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 5: 0x010010010010010010
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 6: 0x020020020020020020
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 7: 0x040040040040040040
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 8: 0x080080080080080080
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 9: 0x100100100100100100
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 10: 0x200200200200200200
./libsmctrl_test_gpc_info: Mask of 6 TPCs associated with GPC 11: 0x400400400400400400
./libsmctrl_test_gpc_info: Total of 71 enabled TPCs.

Note that this GPU has 12 GPCs, and all but one have six TPCs. To partition this GPU between the benchmark under test, and the three competing benchmark instances we run, it would make sense to split it in half, and then subdivide one half among the competitors. That is, GPCs 0-5 (6 GPCs) for the benchmark under test, 6-7 (2 GPCs) for the first competitor, 8-9 (2 GPCs) for the second competitor, and 10-11 (2 GPCs) for the third competitor when using libsmctrl/nvtaskset/nvsplit. When using MPS, that would be a use 50%, 17%, 17%, and 16% execution resource provisioning configuration, respectively.

These settings can be configured for your GPU by editing the variables at the top of evaluate_ecrts25.sh. By default, these are configured for a typical A100 GPU:

# What GPCs to give the task under test (A), and the three competing tasks (B-D), respectively
# The GPCs in GPCS_A should be mutually exclusive from those in GPCS_B, GCPS_C, or GPCS_D
GPCS_A=0,1,2,3
GPCS_B=4
GPCS_C=5
GPCS_D=6

# Percentage of GPU to give the task under test (A), and the three competing tasks (B-D), respectively
# The total of these values should equal 100
PCT_A=57
PCT_B=15
PCT_C=14
PCT_D=14

To change this for the RTX 6000 Ada (for example), one would change these to:

GPCS_A=0,1,2,3,4,5
GPCS_B=6,7
GPCS_C=8,9
GPCS_D=10,11
PCT_A=50
PCT_B=17
PCT_C=17
PCT_D=16

Evaluation

You can then run all except the MiG experiments via:

bash ./evaluate_ecrts25.sh

This will take about an hour on an A100; expect longer on slower GPUs. (Sample sizes are unchanged for startup overhead and partition enforcement tests, but reduced to 10,000 for launch overhead and to 1 for each granularity setting.)

To evaluate MiG (A100 only), enable MiG mode (requires a restart) before continuing:

sudo nvidia-smi -i 0 -mig 1
sudo reboot

After the reboot is complete, you can run the MiG experiments via:

bash ./evaluate_ecrts25.sh mig

To disable MiG (requires a reboot), run:

sudo nvidia-smi -i 0 -mig 0
sudo reboot

Plotting

Once complete, you can plot all the results. If you did not run the MiG experiments, create placeholders for the plotting script first by duplicating some other log:

cp startup_oh_mps.log startup_oh_mig.log
cp launch_oh_mps.log launch_oh_mig.log
cp startup_oh_mps.log startup_oh_mig.log
cp ecrts25_isol_mps_mb_stripped.json ecrts25_isol_mig_mb_stripped.json
cp ecrts25_isol_mps_rw_stripped.json ecrts25_isol_mig_rw_stripped.json

If you changed the MPS primary partition size to be something other than 57%, please also update the PCT_A global in plot_results_ecrts25.py to also reflect this change.

Make sure you are connected via an SSH connection with X-windows forwarding enabled (the -Y option to SSH), then generate and display the plots via:

python3 ./plot_results_ecrts25.py

Each plot will pop up as a window; close the window to switch to the next plot.

If any results look odd, the evaluation of any given partitioning mechanism can be safely rerun by calling evaluate_ecrts25.sh with the name of the mechanism, e.g., ./evaluate_ecrts25.sh libsmctrl.