Predictable GPU Sharing in Component-Based Real-Time Systems

About

This page details the steps to reproduce the experiments for the paper: Predictable GPU Sharing in Component-Based Real-Time Systems.

The SMLP (Streaming-Multiprocessor Locking Protocol) is a locking protocol that allows locking of streaming multiprocessors (SMs) on NVIDIA GPUs. A GPU kernel can run on any number of SMs, thus enabling the novel pi-blocking analysis presented in the paper. Please read the paper for more information.

This page details how to reproduce the experimental results in our paper.

There are two experiments. Experiment 1 simulates the SMLP and compares it against the coarse-grain OMLP. Experiment 2 confirms the expected result that highly parallel GPU kernels benefit from additional SMs up to a degree. Importantly, if extra SMs (beyond the kernel's parallelism requirements) are allocated to a kernel, then the kernel execution time does not improve, thus allowing the SMLP to better utilize GPUs than a coarse-grain lock.

Experiment 1

First, compile the simulation on either Windows or Linux. To generate graphs, you will need python 3.10+ and be able to open .ipynb files. One method is to use Jupyter, or interface with the python files directly if not wanting to use the notebook.

The simulation has three modes: generating task sets, simulating task sets, and running multiple simulations simultaenously. Generating one task set and then simulating it is helpful when wanting to visualize only one simulation pass.

You may wish to fine-tune the parameters to optimally run on your machine. To do so, see gedf_sim.cpp and note that it creates two threads for every value of Theta. Importantly, see gedf_multisim.cpp and note the PARALLEL_SETS definition (by default, PARALLEL_SETS=5). When running multiple simulations, a PARALLEL_SETS number of gedf_sims are created, thus, the total number of simulation threads running concurrently will be at most 2 * |ThetaSet| * PARALLEL_SETS. When using the parameters given below, at most 40 threads will be running concurrently.

When running the binary without any arguments, the necessary parameters are printed out. To generate results similar to what were used by the paper, we recommend the following arguments:
.\ECRTS24ArtifactEval.exe -a -Mmin 4 -Mmax 16 -Mstep 4 -tminset 3 10 50 -tmaxset 33 100 200 -n 150 -taskSets 1000 -utilMin 0.2 -utilMax 0.9 -utilStep 0.1 -Hmin 8 -Hmax 64 -Hstep 8 -Tmin 1.5 -Tmax 3 -Tstep 0.5 -p 0.5 -o p0.5.csv

These arguments will generate 1000 task sets for every combination of U, M, and Tmin/Tmax. These task sets are then run for every combination of H and Theta. If you do not have enough time to simulate a large set of data, then consider reducing -taskSets 1000 to a lower value, or download this csv of results for use in graph generation.

Note: the generated csv file will contain millions of lines, which not all csv file viewers may be able to render.

Once you have generated results, this python notebook can parse the data and generate graphs.

Repeat the above experiment with different values for p (such as 0.1 and 0.25).

The left graph is when H=16, M=8, Theta=2.5, Tmin=10, Tmax=100, p=0.5. The right graph uses the same parameters but for a fixed value of U=0.6 across the shown values of H.

You can compare your generated results to the graph above. Note how the pi-blocking for the OMLP increases dramatically as system utilization increases (where more jobs are issuing requests). The pi-blocking observed by the SMLP also increases, but because the SMLP only uses as many SMs as needed to optimally run the GPU kernel, the effect is significantly mitigated. Note: if you used a small number of taskSets or small values for n, then the graph might not be as smooth.

Experiment 2 - Linux

Requirements: g++ (tested on g++14), nvcc, cuda version compatible with libsmctrl (as of this writing, 8.1 to 12.2). We tested on cuda 12.0.r12.

Experiment 2 cannot be compiled on Windows currently and requires Linux. Experiments used Ubuntu 22.04.

This experiment verifies that additional SMs reduce the execution duration of highly parallel GPU kernels. For spatially partitioning the GPU, we use libsmctrl, which is compatible with NVIDIA GPUs.

Obtain libsmctrl from the official repo. Here is the version that was used for the paper.

Run make on libsmctrl to generate the .so file.

Make sure your LD_LIBRARY_PATH contains where libsmctrl.so is.

If you haven't already, clone the GitHub repository for the experiments.

Run make in the libsmctrlTest directory.

You may wish to modify main.cu to match the number of TPCs on your NVIDIA GPU by changing TOTAL_TPCs. Additionally, you can change the parameters to make the kernel duration longer or optimized for a certain number of SMs by changing the variable OPTIMAL_WIDTH. If you want your kernel to be optimal for a certain number of SMs, x, then pick an optimal width that is 2048*x. You can also try values above the number of TPCs, which results in the same pattern.

Once the binary is generated, run it, and it will output the execution durations of GPU kernels when allocated a decreasing number of SMs.

Compare your results to the image above. The kernel used above operates optimally with 8 SMs and sees no meaningful change when allocated any additional amount. When allocated fewer, note how the execution duration increases.

Experiment 1: Windows Instructions

Requirements: The experiments can be compiled on Visual Studio 2022, but other versions should work fine.

First, clone the GitHub repository.

Open the ECRTS24ArtifactEval.sln solution file. If you are on a different version of VS, you will be asked to convert project files. This should be safe to do so.

Ensure the project build configuration is in "Release" mode. Debug mode runs significantly slower.
ReleaseMode

These options are applied already, but if you are recreated the build, note we use the following options: /std:c++latest (/std:c++20 will work), /MT, and /fp:precise.

Once the executable is compiled, please follow the general instructions

Experiment 1: Linux Instructions

Requirements: g++ (tested on g++14)

First, clone the GitHub repository.

Make sure g++ is installed sudo apt install g++

To compile the sim, run:
g++ -o smlpsim main.cpp util.cpp task_gen.cpp gedf_multisim.cpp gedf_sim.cpp gedf_sim_thread.cpp

Once the executable is compiled, please follow the general instructions