There are the steps required to reproduce the experiments in our paper:
J. Bakita and J. H. Anderson, "Hardware Compute Partitioning on NVIDIA GPUs", Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 54-66, May 2023. (PDF)
These steps were last updated .
There are four software artifacts from our work:
libsmctrl
library, which allows for GPU partitioning cuda_scheduling_examiner
project, used in microbenchmarks throughoutlibsmctrl
We now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout.
libsmctrl
This core library (browse code online) implements SM/TPC partitioning, and is a prerequisite to all other steps. As noted in the paper, this library supports GPUs of compute capability 3.5 and later (see NVIDIA's documentation to look up what your GPU's compute capability is) on CUDA 8.0 through CUDA 12.1.
To obtain and build our library, ensure that CUDA and gcc
are installed, then run:
git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/
cd libsmctrl
make libsmctrl.a
See libsmctrl.h
for details on how to use our API, if you would like to experiment with it.
cuda_scheduling_examiner
This tool, originally by Nathan Otterness, was modified in our work to support SM partitioning.
We are in the process of upstreaming our changes to the tool. In the meantime, you can obtain code to reproduce the experiments via the following steps:
git clone https://github.com/JoshuaJB/cuda_scheduling_examiner_mirror.git -b rtas23-ae
cd cuda_scheduling_examiner
Then edit LIBSMCTRL_PATH
at the top of the Makefile to reflect the location that libsmctrl
was downloaded to, and:
make all
If you have difficulty building the tool, please see the contained README.md
for help. Please ensure you have Python 2, Python 3, python-tk, and python-matplotlib, python3-matplotlib, and libsmctrl available to fully use this tool and included scripts.
This tool was used to generate figures 3, 4, 10, 11, 12, and 13 in the paper, with the following correspondence between GPU and figure:
Figure | GPU |
---|---|
3 | GTX 1060 (3GB) |
4 | GTX 1060 (3GB) and Tesla P100 |
10 | GTX 970 |
11 | GTX 1060 (3GB) |
12 | GTX 1060 (3GB) |
13 | Titan V |
All experiments should run on any libsmctrl-supported GPU, but the desired effect may not be evident if your GPU is too much larger or smaller than ours are. Note that it should not be necessary to disable graphical output while running these experiments, but we did so in our experiments. If you are an artifact evaluator and would like access to our GPU systems, please reach out to us via the artifact evaluation chair, and we can promptly provide access (with the exception of the Tesla P100---this GPU was rented).
Given a configured GPU, our experiments can be reproduced by running ./bin/runner configs/<config>
in the cuda_scheduling_examiner
directory, with <config>
substituted for one of the following names:
Figure | Configuration | Notes |
---|---|---|
3 | rtas23_nosplit.json left and rtas23_split.json right |
Needs at least 6 TPCs for partitioning to be evident. |
4 | rtas23_striped_1060.json left and rtas23_striped_p100.json right |
Requires a GPU with 1 SM/TPC on left (eg. GTX 970, 1060, 1070, 1080 and earlier), and 2 SM/TPC on right (eg. Tesla P100, Titan V, and newer). |
10 | rtas23_sayhellortas.json |
Requires a GPU with at least 12 TPCs for everything to fit. |
11 | rtas23_priority_blocking.json |
Requires a GPU with 32 WDU slots (compute capability 3.5, 3.7, 5.0, 5.2, or 6.1) and at least 5 TPCs for everything to fit. Our plotting software struggles to correctly display this complicated plot, and tends to errantly render thread blocks on top of each other. The plots in the paper were edited from the SVG output of view_blocksbysm.py to correct block overlaps. However, blocking should still be evident, even if the plot is somewhat garbled. |
12 | rtas23_greedy_blocking.json |
The outcome of this experiment is a function of which stream manages to launch its kernels first. They race to the GPU, and we find that on our system, stream 1 wins about 50% of the time (yielding the result at left) and stream 2 wins the remaining 50% of the time (yielding the result at right). The distribution may be slightly or significantly different on your system. You can force one ordering on the other by adding a increasing the "release_time" for the corresponding kernel to a fraction of a second. |
Each result can be plotted by running ./scripts/view_blocksbysm.py -d results/<config name without .json>
. This requires an X server. If connecting remotely, we suggest using X forwarding by adding the -Y
parameter to your ssh
invocation.
This figure is a composite of many experiments, and will take about a half hour to run all together. To run and plot these experiments:
libsmctrl
shared library (needed for pysmctrl
) via make libsmctrl.so
run in the libsmctrl
directorycd
into the cuda_scheduling_examiner_mirror
directory.PYTHONPATH
environment variable to the location you downloaded libsmctrl to. Ex: export PYTHONPATH=/home/ae/libsmctrl
../scripts/test_cu_mask.py -d 1 -s
for the SE-distributed experiments../scripts/test_cu_mask.py -d 1
for the SE-packed experiments../scripts/view_scatterplots_rtns.py -d results/cu_mask
.The above instructions can be used to test for partitioning support on all platforms.
For convenience on embedded systems, we provide pre-built ARM64 binaries for testing on the Jetson TX2, TX1, and Nano. You need at least Jetpack 3.2.1 (2018, CUDA 9.0) for the pre-built binaries to run (Jetpack 4.4+ with CUDA 10.2+ is strongly recommended). See the enclosed README.md for details, or just run:
sudo apt install python-tk
wget https://www.cs.unc.edu/~jbakita/rtas23-ae/libsmctrl_test_jetson.zip
unzip libsmctrl_test_jetson
cd libsmctrl_test_jetson
./bin/runner rtas23_striped_tx2.json
./bin/runner rtas23_unstriped_tx2.json
python2 ./view_blocksbysm.py -d results/
For the last command, X is required. (You can alternatively copy results
and *.py
to another machine with python2.7
and python-tk
and run the plotting command there.) After closing the first plot, the second one should appear. If the two plots differ, stream-based SM partitioning works on your platform!
Helpful Tip: See this helpful page for info about what CUDA and kernel version is bundled with each JetPack release. Use the instructions on this page to quickly flash a newer release without the messy and unecessary JetPack SDK Manager.
Due to an unexpected hardware failure, we have been delayed in posting the instructions for reproducing our case study. If you are an artifact evaluator and must conclude your evaluation before that date, we apologize that you will be unable to test this component of our work.
To verify that libsmctrl
and cuda_scheduling_examiner
are portable to a variety of GPUs of different architectures, we tested our tools in several Google Cloud VMs with different GPUs. In order to minimize execution time (and thus cost), we mount our internal fileserver onto the VM with pre-build binaries. External users will need to install the CUDA toolkit, and download and build our tools using the above instructions.
Tested VM configs:
Machine Type | GPUs |
---|---|
a2-highgpu-1g | 1 x NVIDIA A100 40GB |
n1-standard-1 | 1 x NVIDIA Tesla P100 |
n1-standard-1 | 1 x NVIDIA T4 |
In below instructions, substitute driver link using latest from NVIDIA Archive.
Create new VM, and:
sudo apt update
sudo apt install -y sshfs build-essential
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/530.30.02/NVIDIA-Linux-x86_64-530.30.02.run
chmod +x NVIDIA-Linux-x86_64-530.30.02.run
sudo ./NVIDIA-Linux-x86_64-530.30.02.run --silent --no-opengl-files
sudo mkdir /playpen
sudo chmod 777 /playpen
mkdir /playpen/jbakita
sshfs -o default_permissions,ssh_command="ssh -J jbakita@rtsrv-eth.telenet.unc.edu" jbakita@bonham.cs.unc.edu:/playpen/jbakita /playpen/jbakita
Enter credentials (twice), then:
cd /playpen/jbakita/gpu_subdiv/cuda_scheduling_examiner_mirror/
./bin/runner configs/rtas23_sayhellortas.json
On another machine with X available, run python2 ./scripts/view_blocksbysm.py -d results/rtas23_sayhellortas
and verify that "Hello RTAS" is visible.