There are the steps required to reproduce the experiments in our paper:

J. Bakita and J. H. Anderson, "Hardware Compute Partitioning on NVIDIA GPUs", Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 54-66, May 2023. (PDF)

These steps were last updated .


There are four software artifacts from our work:

  1. Our libsmctrl library, which allows for GPU partitioning
  2. Our modified version of cuda_scheduling_examiner project, used in microbenchmarks throughout
  3. Our version of Darknet and YOLOv2 used in the case study, modified to use libsmctrl

We now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout.

Artifact #1: libsmctrl

This core library (browse code online) implements SM/TPC partitioning, and is a prerequisite to all other steps. As noted in the paper, this library supports GPUs of compute capability 3.5 and later (see NVIDIA's documentation to look up what your GPU's compute capability is) on CUDA 8.0 through CUDA 12.1.

To obtain and build our library, ensure that CUDA and gcc are installed, then run:

git clone
cd libsmctrl
make libsmctrl.a

See libsmctrl.h for details on how to use our API, if you would like to experiment with it.

Artifact #2: cuda_scheduling_examiner

This tool, originally by Nathan Otterness, was modified in our work to support SM partitioning.


We are in the process of upstreaming our changes to the tool. In the meantime, you can obtain code to reproduce the experiments via the following steps:

git clone -b rtas23-ae
cd cuda_scheduling_examiner

Then edit LIBSMCTRL_PATH at the top of the Makefile to reflect the location that libsmctrl was downloaded to, and:

make all

If you have difficulty building the tool, please see the contained for help. Please ensure you have Python 2, Python 3, python-tk, and python-matplotlib, python3-matplotlib, and libsmctrl available to fully use this tool and included scripts.

Running Experiments

This tool was used to generate figures 3, 4, 10, 11, 12, and 13 in the paper, with the following correspondence between GPU and figure:

Figure GPU
3 GTX 1060 (3GB)
4 GTX 1060 (3GB) and Tesla P100
10 GTX 970
11 GTX 1060 (3GB)
12 GTX 1060 (3GB)
13 Titan V

All experiments should run on any libsmctrl-supported GPU, but the desired effect may not be evident if your GPU is too much larger or smaller than ours are. Note that it should not be necessary to disable graphical output while running these experiments, but we did so in our experiments. If you are an artifact evaluator and would like access to our GPU systems, please reach out to us via the artifact evaluation chair, and we can promptly provide access (with the exception of the Tesla P100---this GPU was rented).

Given a configured GPU, our experiments can be reproduced by running ./bin/runner configs/<config> in the cuda_scheduling_examiner directory, with <config> substituted for one of the following names:

Figure Configuration Notes
3 rtas23_nosplit.json left and rtas23_split.json right Needs at least 6 TPCs for partitioning to be evident.
4 rtas23_striped_1060.json left and rtas23_striped_p100.json right Requires a GPU with 1 SM/TPC on left (eg. GTX 970, 1060, 1070, 1080 and earlier), and 2 SM/TPC on right (eg. Tesla P100, Titan V, and newer).
10 rtas23_sayhellortas.json Requires a GPU with at least 12 TPCs for everything to fit.
11 rtas23_priority_blocking.json Requires a GPU with 32 WDU slots (compute capability 3.5, 3.7, 5.0, 5.2, or 6.1) and at least 5 TPCs for everything to fit. Our plotting software struggles to correctly display this complicated plot, and tends to errantly render thread blocks on top of each other. The plots in the paper were edited from the SVG output of to correct block overlaps. However, blocking should still be evident, even if the plot is somewhat garbled.
12 rtas23_greedy_blocking.json The outcome of this experiment is a function of which stream manages to launch its kernels first. They race to the GPU, and we find that on our system, stream 1 wins about 50% of the time (yielding the result at left) and stream 2 wins the remaining 50% of the time (yielding the result at right). The distribution may be slightly or significantly different on your system. You can force one ordering on the other by adding a increasing the "release_time" for the corresponding kernel to a fraction of a second.

Each result can be plotted by running ./scripts/ -d results/<config name without .json>. This requires an X server. If connecting remotely, we suggest using X forwarding by adding the -Y parameter to your ssh invocation.

Figure 13: SE-packed vs SD-distributed

This figure is a composite of many experiments, and will take about a half hour to run all together. To run and plot these experiments:

  1. Build the libsmctrl shared library (needed for pysmctrl) via make run in the libsmctrl directory
  2. cd into the cuda_scheduling_examiner_mirror directory.
  3. Set the PYTHONPATH environment variable to the location you downloaded libsmctrl to. Ex: export PYTHONPATH=/home/ae/libsmctrl.
  4. Run ./scripts/ -d 1 -s for the SE-distributed experiments.
  5. Run ./scripts/ -d 1 for the SE-packed experiments.
  6. Plot both sets of experiments by running ./scripts/ -d results/cu_mask.

Table II: GPUs Supported

The above instructions can be used to test for partitioning support on all platforms.

For convenience on embedded systems, we provide pre-built ARM64 binaries for testing on the Jetson TX2, TX1, and Nano. You need at least Jetpack 3.2.1 (2018, CUDA 9.0) for the pre-built binaries to run (Jetpack 4.4+ with CUDA 10.2+ is strongly recommended). See the enclosed for details, or just run:

sudo apt install python-tk
unzip libsmctrl_test_jetson
cd libsmctrl_test_jetson
./bin/runner rtas23_striped_tx2.json
./bin/runner rtas23_unstriped_tx2.json
python2 ./ -d results/

For the last command, X is required. (You can alternatively copy results and *.py to another machine with python2.7 and python-tk and run the plotting command there.) After closing the first plot, the second one should appear. If the two plots differ, stream-based SM partitioning works on your platform!

Helpful Tip: See this helpful page for info about what CUDA and kernel version is bundled with each JetPack release. Use the instructions on this page to quickly flash a newer release without the messy and unecessary JetPack SDK Manager.

Artifact #3: Modified Darknet & Case Study

Due to an unexpected hardware failure, we have been delayed in posting the instructions for reproducing our case study. If you are an artifact evaluator and must conclude your evaluation before that date, we apologize that you will be unable to test this component of our work.

[Internal Only] Reproduce on Google Cloud

To verify that libsmctrl and cuda_scheduling_examiner are portable to a variety of GPUs of different architectures, we tested our tools in several Google Cloud VMs with different GPUs. In order to minimize execution time (and thus cost), we mount our internal fileserver onto the VM with pre-build binaries. External users will need to install the CUDA toolkit, and download and build our tools using the above instructions.

Tested VM configs:

Machine Type GPUs
a2-highgpu-1g 1 x NVIDIA A100 40GB
n1-standard-1 1 x NVIDIA Tesla P100
n1-standard-1 1 x NVIDIA T4

Using Internal Data Server

In below instructions, substitute driver link using latest from NVIDIA Archive.

Create new VM, and:

sudo apt update
sudo apt install -y sshfs build-essential
chmod +x
sudo ./ --silent --no-opengl-files
sudo mkdir /playpen
sudo chmod 777 /playpen
mkdir /playpen/jbakita
sshfs -o default_permissions,ssh_command="ssh -J" /playpen/jbakita

Enter credentials (twice), then:

cd /playpen/jbakita/gpu_subdiv/cuda_scheduling_examiner_mirror/
./bin/runner configs/rtas23_sayhellortas.json

On another machine with X available, run python2 ./scripts/ -d results/rtas23_sayhellortas and verify that "Hello RTAS" is visible.