Artifact Evaluation Instructions for “Hardware Compute Partitioning on NVIDIA GPUs”

There are the steps required to reproduce the experiments in our paper:

J. Bakita and J. H. Anderson, “Hardware Compute Partitioning on NVIDIA GPUs”, Proceedings of the 29th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 54-66, May 2023. (PDF)

These steps were last updated .

Overview

There are four software artifacts from our work:

  1. Our libsmctrl library, which allows for GPU partitioning
  2. Our modified version of cuda_scheduling_examiner project, used in microbenchmarks throughout
  3. Our version of Darknet and YOLOv2 used in the case study, modified to use libsmctrl

We now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout.

Artifact #1: libsmctrl

This core library (browse code online) implements SM/TPC partitioning, and is a prerequisite to all other steps. As noted in the paper, this library supports GPUs of compute capability 3.5 and later (see NVIDIA’s documentation to look up what your GPU’s compute capability is) on CUDA 8.0 through CUDA 12.1.

To obtain and build our library, ensure that CUDA and gcc are installed, then run:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/
cd libsmctrl
make libsmctrl.a

See libsmctrl.h for details on how to use our API, if you would like to experiment with it.

Artifact #2: cuda_scheduling_examiner

This tool, originally by Nathan Otterness, was modified in our work to support SM partitioning.

Setup

We are in the process of upstreaming our changes to the tool. In the meantime, you can obtain code to reproduce the experiments via the following steps:

git clone https://github.com/JoshuaJB/cuda_scheduling_examiner_mirror.git -b rtas23-ae
cd cuda_scheduling_examiner

Then edit LIBSMCTRL_PATH at the top of the Makefile to reflect the location that libsmctrl was downloaded to, and:

make all

If you have difficulty building the tool, please see the contained README.md for help. Please ensure you have Python 2, Python 3, python-tk, and python-matplotlib, python3-matplotlib, and libsmctrl available to fully use this tool and included scripts.

Running Experiments

This tool was used to generate figures 3, 4, 10, 11, 12, and 13 in the paper, with the following correspondence between GPU and figure:

Figure GPU
3 GTX 1060 (3GB)
4 GTX 1060 (3GB) and Tesla P100
10 GTX 970
11 GTX 1060 (3GB)
12 GTX 1060 (3GB)
13 Titan V

All experiments should run on any libsmctrl-supported GPU, but the desired effect may not be evident if your GPU is too much larger or smaller than ours are. Note that it should not be necessary to disable graphical output while running these experiments, but we did so in our experiments. If you are an artifact evaluator and would like access to our GPU systems, please reach out to us via the artifact evaluation chair, and we can promptly provide access (with the exception of the Tesla P100—this GPU was rented).

Given a configured GPU, our experiments can be reproduced by running ./bin/runner configs/<config> in the cuda_scheduling_examiner directory, with <config> substituted for one of the following names:

Figure Configuration Notes
3 rtas23_nosplit.json left and rtas23_split.json right Needs at least 6 TPCs for partitioning to be evident.
4 rtas23_striped_1060.json left and rtas23_striped_p100.json right Requires a GPU with 1 SM/TPC on left (eg. GTX 970, 1060, 1070, 1080 and earlier), and 2 SM/TPC on right (eg. Tesla P100, Titan V, and newer).
10 rtas23_sayhellortas.json Requires a GPU with at least 12 TPCs for everything to fit.
11 rtas23_priority_blocking.json Requires a GPU with 32 WDU slots (compute capability 3.5, 3.7, 5.0, 5.2, or 6.1) and at least 5 TPCs for everything to fit. Our plotting software struggles to correctly display this complicated plot, and tends to errantly render thread blocks on top of each other. The plots in the paper were edited from the SVG output of view_blocksbysm.py to correct block overlaps. However, blocking should still be evident, even if the plot is somewhat garbled.
12 rtas23_greedy_blocking.json The outcome of this experiment is a function of which stream manages to launch its kernels first. They race to the GPU, and we find that on our system, stream 1 wins about 50% of the time (yielding the result at left) and stream 2 wins the remaining 50% of the time (yielding the result at right). The distribution may be slightly or significantly different on your system. You can force one ordering on the other by adding a increasing the “release_time” for the corresponding kernel to a fraction of a second.

Each result can be plotted by running ./scripts/view_blocksbysm.py -d results/<config name without .json>. This requires an X server. If connecting remotely, we suggest using X forwarding by adding the -Y parameter to your ssh invocation.

Figure 13: SE-packed vs SD-distributed

This figure is a composite of many experiments, and will take about a half hour to run all together. To run and plot these experiments:

  1. Build the libsmctrl shared library (needed for pysmctrl) via make libsmctrl.so run in the libsmctrl directory
  2. cd into the cuda_scheduling_examiner_mirror directory.
  3. Set the PYTHONPATH environment variable to the location you downloaded libsmctrl to. Ex: export PYTHONPATH=/home/ae/libsmctrl.
  4. Run ./scripts/test_cu_mask.py -d 1 -s for the SE-distributed experiments.
  5. Run ./scripts/test_cu_mask.py -d 1 for the SE-packed experiments.
  6. Plot both sets of experiments by running ./scripts/view_scatterplots_rtns.py -d results/cu_mask.

Table II: GPUs Supported

The above instructions can be used to test for partitioning support on all platforms.

For convenience on embedded systems, we provide pre-built ARM64 binaries for testing on the Jetson TX2, TX1, and Nano. You need at least Jetpack 3.2.1 (2018, CUDA 9.0) for the pre-built binaries to run (Jetpack 4.4+ with CUDA 10.2+ is strongly recommended). See the enclosed README.md for details, or just run:

sudo apt install python-tk
wget https://www.cs.unc.edu/~jbakita/rtas23-ae/libsmctrl_test_jetson.zip
unzip libsmctrl_test_jetson
cd libsmctrl_test_jetson
./bin/runner rtas23_striped_tx2.json
./bin/runner rtas23_unstriped_tx2.json
python2 ./view_blocksbysm.py -d results/

For the last command, X is required. (You can alternatively copy results and *.py to another machine with python2.7 and python-tk and run the plotting command there.) After closing the first plot, the second one should appear. If the two plots differ, stream-based SM partitioning works on your platform!

Helpful Tip: See this helpful page for info about what CUDA and kernel version is bundled with each JetPack release. Use the instructions on this page to quickly flash a newer release without the messy and unecessary JetPack SDK Manager.

Artifact #3: Modified Darknet & Case Study

By default, Darknet does not support running multiple concurrent inferences. We add support for this by modifying Darknet, principly by removing any uses of CUDA’s “NULL” stream (which will implicitly synchronize across threads), and by changing global variables to be __thread-scope instead. Upon this modified Darknet variant, we add a new mode to the detector tool that runs multiple inference threads in parallel with GPU partitioning from libsmctrl.

We base our modifications off of the Darknet fork maintained by AlexeyAB, since it has several bugfixes and performance improvements vs the original.

For CPU scheduling during the case study, we use the partitioned fixed-priority (P-FP) scheduler from the LITMUSRT 5.4 Linux kernel variant. This allows us to guarantee that the Darknet threads run at highest-priority, and that each thread starts its inference at the exact same time.

Edit Notice, Feb 2025: These instructions were not originally included nor evaluated due to an unexpected hardware failure, and were added in early 2025 upon request.

Setup LITMUSRT

You will need to download and build both the LITMUSRT kernel, and the liblitmus toolkit/library. This will require super-user privileges.

Obtain, build, and install kernel

To install the kernel build dependencies, run:

sudo apt install -y build-essential flex bison libssl-dev

For systems without apt, use your distribution-provided tool to install make, gcc, flex, bison, and the headers for libssl. (These instructions assume you already have git installed.)

Next, download the kernel and config, build the kernel, and install it in the litmus-rt subdirectory:

git clone --branch linux-5.4-litmus https://github.com/JoshuaJB/litmus-rt.git
cd litmus-rt
wget "https://www.cs.unc.edu/~jbakita/rtas23-ae/.config" # Provided default. Alternate: use `make config`
make --jobs=8 bzImage modules # For faster builds, run as many concurrent build jobs as you have cores, e.g. --jobs=32 on a 32-core system
sudo make INSTALL_MOD_STRIP=1 modules_install install # This copies the kernel to /boot/

Obtain and build liblitmus

To get and build liblitmus (in the liblitmus sibling directory to litmus-rt):

cd ..
git clone https://github.com/JoshuaJB/liblitmus.git
cd liblitmus
make

Now restart your machine, and select the LITMUSRT kernel in your bootloader, or, on Ubuntu, auto-reboot into LITMUSRT via the following commands:

sudo bash -c "echo "GRUB_DEFAULT=saved" >> /etc/default/grub"
sudo update-grub
sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.4.224-litmus+"

This (1) tells the GRUB bootloader to use the saved boot selection if it exists, and (2) uses the grub-reboot tool to set the selection for the next boot. (See man grub-reboot and info grub -n "Simple Configuration" for more.)

Now, restart the system. After it finishes booting back up, check that the LITMUSRT kernel successfully loaded by checking that uname -r prints 5.4.224-litmus+.

We recommend several other options to reduce CPU-related jitter while running the case study.

Prevent interrupts from being moved to cores being used for experiments:

sudo systemctl disable irqbalance

Redirect all interrupts to core 0 by default by adding the line GRUB_CMDLINE_LINUX_DEFAULT="irqaffinity=0" to /etc/default/grub.

Set aside some cores, such that they can only be used if explictly requested by an application. We tested on a 16-core (32-thread) AMD 3950X, and so we set aside CPU15 and CPU31 (a.k.a. the 16th core and its paired SMT thread). Do this by adding isolcpus=15,31 inside the quotes of the GRUB_CMDLINE_LINUX_DEFAULT line mentioned above.

Reduce spurious timer interrupts by disabling the periodic scheduling timer on the isolated cores used for the experiments. Do this by adding nohz_full=15,31 inside the quotes of the GRUB_CMDLINE_LINUX_DEFAULT line mentioned above.

After making these changes, run update-grub and reboot to apply the changes. You may need to re-run the grub-reboot command above to ensure that you reboot back into the LITMUSRT kernel.

Once you have the system ready-to-go, disable low-power states to reduce job release latency (bringing a core out of a low-power idle state is slow). Replace cpu15 and cpu31 with your choice of CPUs (if different).

echo "n/a" > /sys/devices/system/cpu/cpu15/power/pm_qos_resume_latency_us
echo "n/a" > /sys/devices/system/cpu/cpu31/power/pm_qos_resume_latency_us

To learn more about this setting, see the Linux documentation.

Setup Darknet

Our variant of Darknet is available on GitHub. To build Darknet, the CUDA toolkit must already be installed.

Ensure that libsmctrl and liblitmus have already been downloaded and compiled.

To download and compile it, run:

git clone https://github.com/JoshuaJB/darknet.git -b rtas23-ae
cd darknet
make --jobs=8

Ignore the (many) build warnings; these are issues in upstream Darknet and should not effect the case study. (The build assumes that the libsmctrl and liblitmus directories are at ../libsmctrl and ../liblitmus respectively. Specify the LIBSMCTRL or LIBLITMUS variables to make if they are stored elsewhere.)

Download Weights for YOLOv2

Running YOLOv2 requires the model weights. To download them from the official site, run (from within the darknet directory):

wget https://pjreddie.com/media/files/yolov2-voc.weights

(Requires at least 194 MiB of free disk space.)

Obtain Input Data

You will need to obtain the PascalVOC 2012 training/validation dataset that is used as input to YOLOv2 during the case study. The official download is available here: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar (2GB .tar file). You can download and extract the dataset via the following commands:

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar

(This will put the images in ./VOCdevkit/VOC2012/JPEGImages/.)

Configure Case Study

The case study should be configured to minimize any potential CPU-related interference. In addition to the Linux configuration options mentioned above, the cores that each YOLOv2 instance run on must be configured. The default configuration is set for the AMD 3950X, as used in our case study. The AMD 3950X is composed of two Core Complex (CCX) dies, each containing 8 cores (16 threads). Linux numbers these physical cores as CPU0–CPU7 and CPU8–CPU15 for CCX two. By default, we run the first YOLOv2 instance on the first CCX (specifically CPU7), and run any subsequent YOLOv2 instance on the second CCX (CPU8–CPU15). If your system topology is different, please reconfigure the literals on lines 2195, 2207–2208, 2223–2227, 2248–2252, and 2293–2294 in darknet/src/detector.c to specify different CPU IDs.

Configure darknet to be able to find the input data you downloaded by updating darknet/data/voc.2012.all with the paths to the dataset downloaded earlier. I recommend doing this via a find-and-replace, e.g., if you extracted VOCdevkit to be a sibling directory to darknet:

sed -i "s/\/playpen\/jbakita\/DarkerNet/../g" data/voc.2012.all

Run Case Study Experiments

We add five new modes to the Darknet detector: rtas23-case1, rtas23-case2, rtas23-case3, rtas23-case4, and rtas23-split. The first four generate the data for Tbl. 3 in the paper, and the last generates the data for Fig. 14 in the paper.

Check that nvidia-smi does not list any tasks running on the GPU before starting. If any tasks are running, please close them first (e.g., sudo systemctl stop gdm3 to shut down Gnome and X).

Enable the LITMUSRT P-FP scheduler:

sudo liblitmus/setsched P-FP

Running any of the first four tests is as simple as:

cd darknet
sudo -E CUDA_VISIBLE_DEVICES=0 ./darknet detector rtas23-case1 cfg/voc-rtas23.data cfg/yolov2-voc.cfg yolov2-voc.weights -out rtas23

in one console (replacing rtas23-case1 with your desired case; assuming CUDA device #0 is the one you want to test on; assuming the directory containing libsmctrl.so is ../libsmctrl). Once “Waiting for task system release…” is printed, run the following in another console to begin the benchmark:

sudo liblitmus/release_ts

This runs 10,000 samples by default to match the paper, taking 70 minutes. Do not use the machine for anything else during the duration of the case study to minimize the chance of interference. If you would like a shorter test, change the sample count ("10000") on line 2161 in src/detector.c to a smaller number. 1,000 samples should be reasonably sufficient, while taking only seven minutes to run.

To run the experiments for the fifth test (Fig. 14), edit darknet/run_split.sh to ensure that the path to liblitmus’s release_ts binary is correct, and to update the CUDA device from #0 as needed, then:

cd darknet
sudo -E ./run_split.sh

This runs 10,000 samples under each configuration, taking a total of approximately 8.5 hours to run all seven configurations.

Plotting Results

The output files for each experiment are all named with the prefix MmmDD-HHMM, where Mmm is the abbreviated month name, DD is the numeric day of the month, HH is the hour (24-hour clock), and MM is the minute that the experiment started at. The postfix varies with each experiment, and are as follows:

Plotting/analyzing these files requires Python 3, Matplotlib, and NumPy. Install these on Ubuntu via sudo apt install python3-matplotlib python3-numpy.

Here is the plotting and statistics code for Cases #1–4:

import numpy
import matplotlib.pyplot as plt

## From http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1
## End from http://code.activestate.com/recipes/511478/

import sys

res = {}
mem_res = {}
memw_res = {}
samples = {}
max_res = {}

def load_file(f):
    res = {}
    with open(f) as fp:
        for line in fp:
            s = line.split()
            if s[0] not in res:
                res[s[0]] = list([int(s[5])])#)int(s[5])
                mem_res[s[0]] = int(s[8])
                memw_res[s[0]] = int(s[9])
                samples[s[0]] = int(s[4])
                max_res[s[0]] = int(s[5])
            else:
                res[s[0]].append(int(s[5]))
                mem_res[s[0]] += int(s[8])
                memw_res[s[0]] += int(s[9])
                max_res[s[0]] = max(int(s[5]), max_res[s[0]])
    return res

def plot_datum(r, xlim=None, title=None):
    plt.figure(figsize=(6.4, 2.4)) # 6.4 x 4.8 is default
    if title:
        print("data name: " + title)
    print("min: %f"%min(r))
    print("mean: %f"%numpy.mean(r))
    print("95th pct: %f"%percentile(sorted(r), 0.95))
    print("99th pct: %f"%percentile(sorted(r), 0.99))
    print("max: %f"%max(r))
    print("st dev: %f"%numpy.std(r))
    plt.hist(r, 200)
    plt.ylabel("Samples")
    if xlim:
        plt.xlim(xlim)#=(33,47)
    plt.xlabel("Milliseconds per job")
    # From https://stackoverflow.com/a/52961228
    plt.axvline(r.mean(), color='k', linestyle='dashed', linewidth=1)
    min_ylim, max_ylim = plt.ylim()
    min_xlim, max_xlim = plt.xlim()
    mid_xlim = min_xlim + (max_xlim - min_xlim) / 2
    plt.text(mid_xlim, max_ylim*0.9, 'Mean: {:.2f}'.format(r.mean()))
    # End from https://stackoverflow.com/a/52961228
    plt.text(mid_xlim, max_ylim*0.8, 'Standard Deviation: {:.2f}'.format(r.std()))
    plt.tight_layout()
    plt.show()

def load_and_plot(f, xlim=None):
    res = load_file(f)
    for k in res:
        r = res[k][1:-1]
        r = numpy.divide(r, 1000*1000) #convert to ms
        plot_datum(r, xlim, k)

Then, to plot and get statistics for any of the first four cases, run the above, then:

load_and_plot("./darknet/Feb17-2002_8alone_10k.txt")

or, to limit the region plotted to a specific range, e.g., only from 10 ms to 140 ms, pass the range as the second parameter:

load_and_plot("./darknet/Feb17-2002_8alone_10k.txt", (10, 140))

We suggest running these plotting commands in a Python-notebook-like environment (e.g., Jupyter) for ease-of-use, but you can also just paste all of the above into a .py file and run it from the console on a machine with a display.

For Fig. 14, you will need different plotting code:

blue_rtas23 = "#3d85c6"
red_rtas23 = "#cc4125"
green_rtas23 = "#6aa84f"
def plot_slider(a_series, b_series):
    plt.plot(a_series[:,1], "-", color=blue_rtas23, label="YOLOv2 Instance A")
    plt.fill_between(range(a_series.shape[0]), a_series[:,0], a_series[:,2], color=(0.5,0.5,0.5,0.1))
    plt.plot(b_series[:,1], "--", color=red_rtas23, label="YOLOv2 Instance B")
    plt.fill_between(range(a_series.shape[0]), b_series[:,0], b_series[:,2], color=(0.5,0.5,0.5,0.1))
    plt.xlabel("Number of TPCs allocated to Instance B")
    plt.ylabel("Execution Time")
    plt.legend()
    plt.ylim(0,160)
    plt.xlim(0,a_series.shape[0]-1)
    # Add (ms) labels to axis
    locs, labels = plt.yticks()
    ms_ticks = ["%.0f ms"%(loc) for loc in locs]
    plt.yticks(locs, ms_ticks)

    # Fix x axis
    locs, labels = plt.xticks()
    x_ticks = ["%.0f"%(8-(loc+1)) for loc in locs]
    plt.xticks(locs, x_ticks)

    ax2 = plt.gca().twiny()
    ax2.set_xlim(plt.xlim())
    ax2.set_xticks(locs)
    ax2.set_xticklabels(["%.0f"%(loc+1) for loc in locs])
    ax2.set_xlabel("Number of TPCs allocated to Instance A")
def min_avg_max(f):
    res = load_file(f)

    # XXX: We assume only one data series in the file
    for k in res:
        r = res[k][1:-1]
        r = numpy.divide(r, 1000*1000) #convert to ms
        max_i = max(range(len(r)), key=r.__getitem__)
        print("min: %f"%min(r))
        print("mean: %f"%numpy.mean(r))
        print("95th pct: %f"%percentile(sorted(r), 0.95))
        print("99th pct: %f"%percentile(sorted(r), 0.99))
        print("max: %f"%max(r))
        print("st dev: %f"%numpy.std(r))
        return (min(r), numpy.mean(r), max(r))

Then, to generate the plot, replace the entries in the prefix_ae list below with those printed by run_split.sh, and run:

prefix_ae=["Mar05-1321", "Mar05-1431", "Mar05-1542", "Mar05-1652", "Mar05-1802", "Mar05-1912", "Mar05-2023"]
a_ae_series = numpy.zeros((7, 3))
b_ae_series = numpy.zeros((7, 3))
for i in range(7):
    a_ae_series[i] = min_avg_max("./darknet/"+prefix_ae[i]+"-split10k-A.txt")
    b_ae_series[i] = min_avg_max("./darknet/"+prefix_ae[i]+"-split10k-B.txt")
plot_slider(a_ae_series, b_ae_series)

Uninstalling LITMUSRT

(For after all experiments are complete.)

From the parent directory of the litmus-rt folder:

rm -rf litmus-rt liblitmus # Delete source, configuration, and build files
sudo rm /boot/*litmus*     # Delete installed kernel

And uninstall any packages added to support building the kernel (e.g. flex, bison, and libssl-dev).

Undo all the tunings:

  1. Remove the GRUB_DEFAULT and GRUB_CMDLINE_LINUX_DEFAULT lines from /etc/default/grub.
  2. Reset GRUB: sudo update-grub.
  3. Reenable IRQ balancing: sudo systemctl enable irqbalance.
  4. Reboot.

[Internal Only] Reproduce on Google Cloud

To verify that libsmctrl and cuda_scheduling_examiner are portable to a variety of GPUs of different architectures, we tested our tools in several Google Cloud VMs with different GPUs. In order to minimize execution time (and thus cost), we mount our internal fileserver onto the VM with pre-build binaries. External users will need to install the CUDA toolkit, and download and build our tools using the above instructions.

Tested VM configs:

Machine Type GPUs
a2-highgpu-1g 1 x NVIDIA A100 40GB
n1-standard-1 1 x NVIDIA Tesla P100
n1-standard-1 1 x NVIDIA T4

Using Internal Data Server

In below instructions, substitute driver link using latest from NVIDIA Archive.

Create new VM, and:

sudo apt update
sudo apt install -y sshfs build-essential
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/530.30.02/NVIDIA-Linux-x86_64-530.30.02.run
chmod +x NVIDIA-Linux-x86_64-530.30.02.run
sudo ./NVIDIA-Linux-x86_64-530.30.02.run --silent --no-opengl-files
sudo mkdir /playpen
sudo chmod 777 /playpen
mkdir /playpen/jbakita
sshfs -o default_permissions,ssh_command="ssh -J jbakita@rtsrv-eth.telenet.unc.edu" jbakita@bonham.cs.unc.edu:/playpen/jbakita /playpen/jbakita

Enter credentials (twice), then:

cd /playpen/jbakita/gpu_subdiv/cuda_scheduling_examiner_mirror/
./bin/runner configs/rtas23_sayhellortas.json

On another machine with X available, run python2 ./scripts/view_blocksbysm.py -d results/rtas23_sayhellortas and verify that “Hello RTAS” is visible.