There are the steps required to reproduce the experiments in our paper:
J. Bakita and J. H. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 54-66, May 2023. (PDF)
These steps were last updated .
There are four software artifacts from our work:
libsmctrl
library, which allows for GPU partitioning cuda_scheduling_examiner
project, used in microbenchmarks throughoutlibsmctrl
We now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout.
libsmctrl
This core library (browse code online) implements SM/TPC partitioning, and is a prerequisite to all other steps. As noted in the paper, this library supports GPUs of compute capability 3.5 and later (see NVIDIA’s documentation to look up what your GPU’s compute capability is) on CUDA 8.0 through CUDA 12.1.
To obtain and build our library, ensure that CUDA and gcc
are installed, then run:
git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/
cd libsmctrl
make libsmctrl.a
See libsmctrl.h
for details on how to use our API, if you would like to experiment with it.
cuda_scheduling_examiner
This tool, originally by Nathan Otterness, was modified in our work to support SM partitioning.
We are in the process of upstreaming our changes to the tool. In the meantime, you can obtain code to reproduce the experiments via the following steps:
git clone https://github.com/JoshuaJB/cuda_scheduling_examiner_mirror.git -b rtas23-ae
cd cuda_scheduling_examiner
Then edit LIBSMCTRL_PATH
at the top of the Makefile to reflect the location that libsmctrl
was downloaded to, and:
make all
If you have difficulty building the tool, please see the contained README.md
for help. Please ensure you have Python 2, Python 3, python-tk, and python-matplotlib, python3-matplotlib, and libsmctrl available to fully use this tool and included scripts.
This tool was used to generate figures 3, 4, 10, 11, 12, and 13 in the paper, with the following correspondence between GPU and figure:
Figure | GPU |
---|---|
3 | GTX 1060 (3GB) |
4 | GTX 1060 (3GB) and Tesla P100 |
10 | GTX 970 |
11 | GTX 1060 (3GB) |
12 | GTX 1060 (3GB) |
13 | Titan V |
All experiments should run on any libsmctrl-supported GPU, but the desired effect may not be evident if your GPU is too much larger or smaller than ours are. Note that it should not be necessary to disable graphical output while running these experiments, but we did so in our experiments. If you are an artifact evaluator and would like access to our GPU systems, please reach out to us via the artifact evaluation chair, and we can promptly provide access (with the exception of the Tesla P100—this GPU was rented).
Given a configured GPU, our experiments can be reproduced by running ./bin/runner configs/<config>
in the cuda_scheduling_examiner
directory, with <config>
substituted for one of the following names:
Figure | Configuration | Notes |
---|---|---|
3 | rtas23_nosplit.json left and rtas23_split.json right |
Needs at least 6 TPCs for partitioning to be evident. |
4 | rtas23_striped_1060.json left and rtas23_striped_p100.json right |
Requires a GPU with 1 SM/TPC on left (eg. GTX 970, 1060, 1070, 1080 and earlier), and 2 SM/TPC on right (eg. Tesla P100, Titan V, and newer). |
10 | rtas23_sayhellortas.json |
Requires a GPU with at least 12 TPCs for everything to fit. |
11 | rtas23_priority_blocking.json |
Requires a GPU with 32 WDU slots (compute capability 3.5, 3.7, 5.0, 5.2, or 6.1) and at least 5 TPCs for everything to fit. Our plotting software struggles to correctly display this complicated plot, and tends to errantly render thread blocks on top of each other. The plots in the paper were edited from the SVG output of view_blocksbysm.py to correct block overlaps. However, blocking should still be evident, even if the plot is somewhat garbled. |
12 | rtas23_greedy_blocking.json |
The outcome of this experiment is a function of which stream manages to launch its kernels first. They race to the GPU, and we find that on our system, stream 1 wins about 50% of the time (yielding the result at left) and stream 2 wins the remaining 50% of the time (yielding the result at right). The distribution may be slightly or significantly different on your system. You can force one ordering on the other by adding a increasing the “release_time” for the corresponding kernel to a fraction of a second. |
Each result can be plotted by running ./scripts/view_blocksbysm.py -d results/<config name without .json>
. This requires an X server. If connecting remotely, we suggest using X forwarding by adding the -Y
parameter to your ssh
invocation.
This figure is a composite of many experiments, and will take about a half hour to run all together. To run and plot these experiments:
libsmctrl
shared library (needed for pysmctrl
) via make libsmctrl.so
run in the libsmctrl
directorycd
into the cuda_scheduling_examiner_mirror
directory.PYTHONPATH
environment variable to the location you downloaded libsmctrl to. Ex: export PYTHONPATH=/home/ae/libsmctrl
../scripts/test_cu_mask.py -d 1 -s
for the SE-distributed experiments../scripts/test_cu_mask.py -d 1
for the SE-packed experiments../scripts/view_scatterplots_rtns.py -d results/cu_mask
.The above instructions can be used to test for partitioning support on all platforms.
For convenience on embedded systems, we provide pre-built ARM64 binaries for testing on the Jetson TX2, TX1, and Nano. You need at least Jetpack 3.2.1 (2018, CUDA 9.0) for the pre-built binaries to run (Jetpack 4.4+ with CUDA 10.2+ is strongly recommended). See the enclosed README.md for details, or just run:
sudo apt install python-tk
wget https://www.cs.unc.edu/~jbakita/rtas23-ae/libsmctrl_test_jetson.zip
unzip libsmctrl_test_jetson
cd libsmctrl_test_jetson
./bin/runner rtas23_striped_tx2.json
./bin/runner rtas23_unstriped_tx2.json
python2 ./view_blocksbysm.py -d results/
For the last command, X is required. (You can alternatively copy results
and *.py
to another machine with python2.7
and python-tk
and run the plotting command there.) After closing the first plot, the second one should appear. If the two plots differ, stream-based SM partitioning works on your platform!
Helpful Tip: See this helpful page for info about what CUDA and kernel version is bundled with each JetPack release. Use the instructions on this page to quickly flash a newer release without the messy and unecessary JetPack SDK Manager.
By default, Darknet does not support running multiple concurrent inferences. We add support for this by modifying Darknet, principly by removing any uses of CUDA’s “NULL” stream (which will implicitly synchronize across threads), and by changing global variables to be __thread
-scope instead. Upon this modified Darknet variant, we add a new mode to the detector
tool that runs multiple inference threads in parallel with GPU partitioning from libsmctrl
.
We base our modifications off of the Darknet fork maintained by AlexeyAB, since it has several bugfixes and performance improvements vs the original.
For CPU scheduling during the case study, we use the partitioned fixed-priority (P-FP) scheduler from the LITMUSRT 5.4 Linux kernel variant. This allows us to guarantee that the Darknet threads run at highest-priority, and that each thread starts its inference at the exact same time.
Edit Notice, Feb 2025: These instructions were not originally included nor evaluated due to an unexpected hardware failure, and were added in early 2025 upon request.
You will need to download and build both the LITMUSRT kernel, and the liblitmus
toolkit/library. This will require super-user privileges.
To install the kernel build dependencies, run:
sudo apt install -y build-essential flex bison libssl-dev
For systems without apt
, use your distribution-provided tool to install make
, gcc
, flex
, bison
, and the headers for libssl
. (These instructions assume you already have git
installed.)
Next, download the kernel and config, build the kernel, and install it in the litmus-rt
subdirectory:
git clone --branch linux-5.4-litmus https://github.com/JoshuaJB/litmus-rt.git
cd litmus-rt
wget "https://www.cs.unc.edu/~jbakita/rtas23-ae/.config" # Provided default. Alternate: use `make config`
make --jobs=8 bzImage modules # For faster builds, run as many concurrent build jobs as you have cores, e.g. --jobs=32 on a 32-core system
sudo make INSTALL_MOD_STRIP=1 modules_install install # This copies the kernel to /boot/
liblitmus
To get and build liblitmus
(in the liblitmus
sibling directory to litmus-rt
):
cd ..
git clone https://github.com/JoshuaJB/liblitmus.git
cd liblitmus
make
Now restart your machine, and select the LITMUSRT kernel in your bootloader, or, on Ubuntu, auto-reboot into LITMUSRT via the following commands:
sudo bash -c "echo "GRUB_DEFAULT=saved" >> /etc/default/grub"
sudo update-grub
sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.4.224-litmus+"
This (1) tells the GRUB bootloader to use the saved
boot selection if it exists, and (2) uses the grub-reboot
tool to set the selection for the next boot. (See man grub-reboot
and info grub -n "Simple Configuration"
for more.)
Now, restart the system. After it finishes booting back up, check that the LITMUSRT kernel successfully loaded by checking that uname -r
prints 5.4.224-litmus+
.
We recommend several other options to reduce CPU-related jitter while running the case study.
Prevent interrupts from being moved to cores being used for experiments:
sudo systemctl disable irqbalance
Redirect all interrupts to core 0 by default by adding the line GRUB_CMDLINE_LINUX_DEFAULT="irqaffinity=0"
to /etc/default/grub
.
Set aside some cores, such that they can only be used if explictly requested by an application. We tested on a 16-core (32-thread) AMD 3950X, and so we set aside CPU15 and CPU31 (a.k.a. the 16th core and its paired SMT thread). Do this by adding isolcpus=15,31
inside the quotes of the GRUB_CMDLINE_LINUX_DEFAULT
line mentioned above.
Reduce spurious timer interrupts by disabling the periodic scheduling timer on the isolated cores used for the experiments. Do this by adding nohz_full=15,31
inside the quotes of the GRUB_CMDLINE_LINUX_DEFAULT
line mentioned above.
After making these changes, run update-grub
and reboot to apply the changes. You may need to re-run the grub-reboot
command above to ensure that you reboot back into the LITMUSRT kernel.
Once you have the system ready-to-go, disable low-power states to reduce job release latency (bringing a core out of a low-power idle state is slow). Replace cpu15
and cpu31
with your choice of CPUs (if different).
echo "n/a" > /sys/devices/system/cpu/cpu15/power/pm_qos_resume_latency_us
echo "n/a" > /sys/devices/system/cpu/cpu31/power/pm_qos_resume_latency_us
To learn more about this setting, see the Linux documentation.
Our variant of Darknet is available on GitHub. To build Darknet, the CUDA toolkit must already be installed.
Ensure that libsmctrl
and liblitmus
have already been downloaded and compiled.
To download and compile it, run:
git clone https://github.com/JoshuaJB/darknet.git -b rtas23-ae
cd darknet
make --jobs=8
Ignore the (many) build warnings; these are issues in upstream Darknet and should not effect the case study. (The build assumes that the libsmctrl
and liblitmus
directories are at ../libsmctrl
and ../liblitmus
respectively. Specify the LIBSMCTRL
or LIBLITMUS
variables to make if they are stored elsewhere.)
Running YOLOv2 requires the model weights. To download them from the official site, run (from within the darknet
directory):
wget https://pjreddie.com/media/files/yolov2-voc.weights
(Requires at least 194 MiB of free disk space.)
You will need to obtain the PascalVOC 2012 training/validation dataset that is used as input to YOLOv2 during the case study. The official download is available here: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar (2GB .tar
file). You can download and extract the dataset via the following commands:
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar
(This will put the images in ./VOCdevkit/VOC2012/JPEGImages/
.)
The case study should be configured to minimize any potential CPU-related interference. In addition to the Linux configuration options mentioned above, the cores that each YOLOv2 instance run on must be configured. The default configuration is set for the AMD 3950X, as used in our case study. The AMD 3950X is composed of two Core Complex (CCX) dies, each containing 8 cores (16 threads). Linux numbers these physical cores as CPU0–CPU7 and CPU8–CPU15 for CCX two. By default, we run the first YOLOv2 instance on the first CCX (specifically CPU7), and run any subsequent YOLOv2 instance on the second CCX (CPU8–CPU15). If your system topology is different, please reconfigure the literals on lines 2195, 2207–2208, 2223–2227, 2248–2252, and 2293–2294 in darknet/src/detector.c
to specify different CPU IDs.
Configure darknet to be able to find the input data you downloaded by updating darknet/data/voc.2012.all
with the paths to the dataset downloaded earlier. I recommend doing this via a find-and-replace, e.g., if you extracted VOCdevkit
to be a sibling directory to darknet
:
sed -i "s/\/playpen\/jbakita\/DarkerNet/../g" data/voc.2012.all
We add five new modes to the Darknet detector
: rtas23-case1
, rtas23-case2
, rtas23-case3
, rtas23-case4
, and rtas23-split
. The first four generate the data for Tbl. 3 in the paper, and the last generates the data for Fig. 14 in the paper.
Check that nvidia-smi
does not list any tasks running on the GPU before starting. If any tasks are running, please close them first (e.g., sudo systemctl stop gdm3
to shut down Gnome and X).
Enable the LITMUSRT P-FP scheduler:
sudo liblitmus/setsched P-FP
Running any of the first four tests is as simple as:
cd darknet
sudo -E CUDA_VISIBLE_DEVICES=0 ./darknet detector rtas23-case1 cfg/voc-rtas23.data cfg/yolov2-voc.cfg yolov2-voc.weights -out rtas23
in one console (replacing rtas23-case1
with your desired case; assuming CUDA device #0 is the one you want to test on; assuming the directory containing libsmctrl.so
is ../libsmctrl
). Once “Waiting for task system release…” is printed, run the following in another console to begin the benchmark:
sudo liblitmus/release_ts
This runs 10,000 samples by default to match the paper, taking 70 minutes. Do not use the machine for anything else during the duration of the case study to minimize the chance of interference. If you would like a shorter test, change the sample count ("10000"
) on line 2161 in src/detector.c
to a smaller number. 1,000 samples should be reasonably sufficient, while taking only seven minutes to run.
To run the experiments for the fifth test (Fig. 14), edit darknet/run_split.sh
to ensure that the path to liblitmus’s release_ts
binary is correct, and to update the CUDA device from #0 as needed, then:
cd darknet
sudo -E ./run_split.sh
This runs 10,000 samples under each configuration, taking a total of approximately 8.5 hours to run all seven configurations.
The output files for each experiment are all named with the prefix MmmDD-HHMM
, where Mmm
is the abbreviated month name, DD
is the numeric day of the month, HH
is the hour (24-hour clock), and MM
is the minute that the experiment started at. The postfix varies with each experiment, and are as follows:
_8alone_10k.txt
_8shared_10k-A.txt
_8shared_with4semievil_10k.txt
_8split_with4semievil_10k.txt
-split10k-A
and -split10k-B
Plotting/analyzing these files requires Python 3, Matplotlib, and NumPy. Install these on Ubuntu via sudo apt install python3-matplotlib python3-numpy
.
Here is the plotting and statistics code for Cases #1–4:
import numpy
import matplotlib.pyplot as plt
## From http://code.activestate.com/recipes/511478/ (r1)
import math
import functools
def percentile(N, percent, key=lambda x:x):
"""
Find the percentile of a list of values.
@parameter N - is a list of values. Note N MUST BE already sorted.
@parameter percent - a float value from 0.0 to 1.0.
@parameter key - optional key function to compute value from each element of N.
@return - the percentile of the values
"""
if not N:
return None
k = (len(N)-1) * percent
f = math.floor(k)
c = math.ceil(k)
if f == c:
return key(N[int(k)])
d0 = key(N[int(f)]) * (c-k)
d1 = key(N[int(c)]) * (k-f)
return d0+d1
## End from http://code.activestate.com/recipes/511478/
import sys
res = {}
mem_res = {}
memw_res = {}
samples = {}
max_res = {}
def load_file(f):
res = {}
with open(f) as fp:
for line in fp:
s = line.split()
if s[0] not in res:
res[s[0]] = list([int(s[5])])#)int(s[5])
mem_res[s[0]] = int(s[8])
memw_res[s[0]] = int(s[9])
samples[s[0]] = int(s[4])
max_res[s[0]] = int(s[5])
else:
res[s[0]].append(int(s[5]))
mem_res[s[0]] += int(s[8])
memw_res[s[0]] += int(s[9])
max_res[s[0]] = max(int(s[5]), max_res[s[0]])
return res
def plot_datum(r, xlim=None, title=None):
plt.figure(figsize=(6.4, 2.4)) # 6.4 x 4.8 is default
if title:
print("data name: " + title)
print("min: %f"%min(r))
print("mean: %f"%numpy.mean(r))
print("95th pct: %f"%percentile(sorted(r), 0.95))
print("99th pct: %f"%percentile(sorted(r), 0.99))
print("max: %f"%max(r))
print("st dev: %f"%numpy.std(r))
plt.hist(r, 200)
plt.ylabel("Samples")
if xlim:
plt.xlim(xlim)#=(33,47)
plt.xlabel("Milliseconds per job")
# From https://stackoverflow.com/a/52961228
plt.axvline(r.mean(), color='k', linestyle='dashed', linewidth=1)
min_ylim, max_ylim = plt.ylim()
min_xlim, max_xlim = plt.xlim()
mid_xlim = min_xlim + (max_xlim - min_xlim) / 2
plt.text(mid_xlim, max_ylim*0.9, 'Mean: {:.2f}'.format(r.mean()))
# End from https://stackoverflow.com/a/52961228
plt.text(mid_xlim, max_ylim*0.8, 'Standard Deviation: {:.2f}'.format(r.std()))
plt.tight_layout()
plt.show()
def load_and_plot(f, xlim=None):
res = load_file(f)
for k in res:
r = res[k][1:-1]
r = numpy.divide(r, 1000*1000) #convert to ms
plot_datum(r, xlim, k)
Then, to plot and get statistics for any of the first four cases, run the above, then:
load_and_plot("./darknet/Feb17-2002_8alone_10k.txt")
or, to limit the region plotted to a specific range, e.g., only from 10 ms to 140 ms, pass the range as the second parameter:
load_and_plot("./darknet/Feb17-2002_8alone_10k.txt", (10, 140))
We suggest running these plotting commands in a Python-notebook-like environment (e.g., Jupyter) for ease-of-use, but you can also just paste all of the above into a .py
file and run it from the console on a machine with a display.
For Fig. 14, you will need different plotting code:
blue_rtas23 = "#3d85c6"
red_rtas23 = "#cc4125"
green_rtas23 = "#6aa84f"
def plot_slider(a_series, b_series):
plt.plot(a_series[:,1], "-", color=blue_rtas23, label="YOLOv2 Instance A")
plt.fill_between(range(a_series.shape[0]), a_series[:,0], a_series[:,2], color=(0.5,0.5,0.5,0.1))
plt.plot(b_series[:,1], "--", color=red_rtas23, label="YOLOv2 Instance B")
plt.fill_between(range(a_series.shape[0]), b_series[:,0], b_series[:,2], color=(0.5,0.5,0.5,0.1))
plt.xlabel("Number of TPCs allocated to Instance B")
plt.ylabel("Execution Time")
plt.legend()
plt.ylim(0,160)
plt.xlim(0,a_series.shape[0]-1)
# Add (ms) labels to axis
locs, labels = plt.yticks()
ms_ticks = ["%.0f ms"%(loc) for loc in locs]
plt.yticks(locs, ms_ticks)
# Fix x axis
locs, labels = plt.xticks()
x_ticks = ["%.0f"%(8-(loc+1)) for loc in locs]
plt.xticks(locs, x_ticks)
ax2 = plt.gca().twiny()
ax2.set_xlim(plt.xlim())
ax2.set_xticks(locs)
ax2.set_xticklabels(["%.0f"%(loc+1) for loc in locs])
ax2.set_xlabel("Number of TPCs allocated to Instance A")
def min_avg_max(f):
res = load_file(f)
# XXX: We assume only one data series in the file
for k in res:
r = res[k][1:-1]
r = numpy.divide(r, 1000*1000) #convert to ms
max_i = max(range(len(r)), key=r.__getitem__)
print("min: %f"%min(r))
print("mean: %f"%numpy.mean(r))
print("95th pct: %f"%percentile(sorted(r), 0.95))
print("99th pct: %f"%percentile(sorted(r), 0.99))
print("max: %f"%max(r))
print("st dev: %f"%numpy.std(r))
return (min(r), numpy.mean(r), max(r))
Then, to generate the plot, replace the entries in the prefix_ae
list below with those printed by run_split.sh
, and run:
prefix_ae=["Mar05-1321", "Mar05-1431", "Mar05-1542", "Mar05-1652", "Mar05-1802", "Mar05-1912", "Mar05-2023"]
a_ae_series = numpy.zeros((7, 3))
b_ae_series = numpy.zeros((7, 3))
for i in range(7):
a_ae_series[i] = min_avg_max("./darknet/"+prefix_ae[i]+"-split10k-A.txt")
b_ae_series[i] = min_avg_max("./darknet/"+prefix_ae[i]+"-split10k-B.txt")
plot_slider(a_ae_series, b_ae_series)
(For after all experiments are complete.)
From the parent directory of the litmus-rt
folder:
rm -rf litmus-rt liblitmus # Delete source, configuration, and build files
sudo rm /boot/*litmus* # Delete installed kernel
And uninstall any packages added to support building the kernel (e.g. flex
, bison
, and libssl-dev
).
Undo all the tunings:
GRUB_DEFAULT
and GRUB_CMDLINE_LINUX_DEFAULT
lines from /etc/default/grub
.sudo update-grub
.sudo systemctl enable irqbalance
.To verify that libsmctrl
and cuda_scheduling_examiner
are portable to a variety of GPUs of different architectures, we tested our tools in several Google Cloud VMs with different GPUs. In order to minimize execution time (and thus cost), we mount our internal fileserver onto the VM with pre-build binaries. External users will need to install the CUDA toolkit, and download and build our tools using the above instructions.
Tested VM configs:
Machine Type | GPUs |
---|---|
a2-highgpu-1g | 1 x NVIDIA A100 40GB |
n1-standard-1 | 1 x NVIDIA Tesla P100 |
n1-standard-1 | 1 x NVIDIA T4 |
In below instructions, substitute driver link using latest from NVIDIA Archive.
Create new VM, and:
sudo apt update
sudo apt install -y sshfs build-essential
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/530.30.02/NVIDIA-Linux-x86_64-530.30.02.run
chmod +x NVIDIA-Linux-x86_64-530.30.02.run
sudo ./NVIDIA-Linux-x86_64-530.30.02.run --silent --no-opengl-files
sudo mkdir /playpen
sudo chmod 777 /playpen
mkdir /playpen/jbakita
sshfs -o default_permissions,ssh_command="ssh -J jbakita@rtsrv-eth.telenet.unc.edu" jbakita@bonham.cs.unc.edu:/playpen/jbakita /playpen/jbakita
Enter credentials (twice), then:
cd /playpen/jbakita/gpu_subdiv/cuda_scheduling_examiner_mirror/
./bin/runner configs/rtas23_sayhellortas.json
On another machine with X available, run python2 ./scripts/view_blocksbysm.py -d results/rtas23_sayhellortas
and verify that “Hello RTAS” is visible.