Artifact Evaluation Instructions for “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”

This page details the artifacts from our paper:

J. Bakita and J. H. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium, to appear, May 2024. (PDF)

We also describe how to use these artifacts to reproduce our experiments. These steps were last updated .

Overview

There are two software artifacts from our work:

  1. Our nvdebug tool, a Linux kernel module for probing NVIDIA GPU state
  2. The benchmark suite gpu-microbench, for examining the scheduling behavior of independent GPU engines

We now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout with the CUDA SDK and the equivalent of Ubuntu’s build-essential package installed.

Artifact #1: gpu-microbench

We use this suite of high-accuracy benchmarks and measurement tools to examine the scheduling behavior of individual GPU contexts on the GPU copy and compute engines.

The source code for this artifact can be browsed online. To obtain and build this suite, simply ensure that the CUDA SDK is available, then:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/gpu-microbench.git/
cd gpu-microbench
make all

To build the gl_copies benchmark, install the freeglut headers and shared library via the Ubuntu package freeglut3-dev, or from their website, then add:

cd copy_experiments
make gl_copies

Run any benchmark with the --help argument for usage information. To run a CUDA benchmark on a specific GPU in a multi-GPU system, set the environment variable CUDA_VISIBLE_DEVICES (docs). Note that, by default, CUDA orders devices by compute power. nvidia-smi and nvdebug instead number devices by PCI ID. To force CUDA’s device numbering (including that used by CUDA_VISIBLE_DEVICES) to match nvidia-smi, set the environment variable CUDA_DEVICE_ORDER to PCI_BUS_ID. See Appdx. A for an example.

Note that exec_logger and copy_monitor in the paper are respectively named preemption_logger and mon_cross_ctx_copies in the repository (subject to change).

Artifact #2: nvdebug

This Linux kernel module is used to obtain information about GPU topology and scheduling, as well as control scheduling to a more limited extent. See Appdx. A in the paper for more details on its capabilities.

The source code for this artifact can be browsed online. To obtain and build this module, ensure you have the kernel headers headers installed (the linux-headers-generic package on Ubuntu), then:

git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/nvdebug.git/ -b rtas24-ae
cd nvdebug
make
sudo insmod nvdebug.ko

For each installed NVIDIA GPU in the system X, this module exposes virtual files corresponding to that GPU in the /proc/gpuX/ directories. Note that GPUs are numbered by PCI ID (as with nvidia-smi), and that this is not the same ordering CUDA uses to number devices by default.

Reproducing our Experiments

The following table details which GPU and tool are used to obtain and render the experimental result in each figure of our paper.

GPU Tool Renderer
Fig 5 GTX 1060 3GB cuda_scheduling_examiner (rtas24_false_dependency.json config) cuda_scheduling_examiner
Fig 6 GTX 1060 3GB gpu-microbench (preemption_logger) Matplotlib
Fig 7 GTX 1060 3GB gpu-microbench (mon_cross_ctx_copies) Matplotlib
Fig 8 GTX 1080 Ti gpu-microbench (preemption_logger, and mon_cross_ctx_copies_and_compute) Matplotlib
Fig 9 Jetson TX2 gpu-microbench (preemption_logger, and mon_cross_ctx_copies) Matplotlib
Tbl 2 GTX 1060 3GB nvdebug (device_info API) LaTeX Table
Tbl 3 Jetson Orin nvdebug (device_info API) LaTeX Table
Tbl 4 RTX 6000 Ada nvdebug (device_info API) LaTeX Table
Fig 10 RTX 6000 Ada and GTX 1080 Ti gpu-microbench (copy_contender, and gl_copies) Matplotlib
Fig 11 GTX 1060 3GB, RTX 2080 Ti, RTX 6000 Ada, and Jetson Orin nvdebug (lce_for_pceX, shared_lce_for_grceX, pce_map, and device_info APIs) Inkscape Drawing
Listing 1 Any Pascal+ GPU nvdebug (runlistX API) LaTeX Listing

Note that in addition to our novel artifacts, we utilize the cuda_scheduling_examiner tool by Nathan Otterness.

Our experimental code is portable to any NVIDIA GPU of the Pascal architecture (2016) or newer, but most experiments must be run on a particular GPU to obtain the same result. Some experiments are more flexible, as indicated in the following descriptions of each.

Reproducing Fig. 5

This figure demonstrates R2: A task’s number of channels limits intra-task parallelism.

This experiment should work on any NVIDIA GPU with at least nine SMs on an x86_64 platform.

Setting up cuda_scheduling_examiner

Download and build cuda_scheduling_examiner using the official instructions. Alternatively, just run the following commands in the parent directory to where you would like to put cuda_scheduling_examiner on a machine with the CUDA SDK installed:

export PATH=$PATH:/usr/local/cuda/bin
git clone --depth 1 https://github.com/yalue/cuda_scheduling_examiner_mirror.git
cd cuda_scheduling_examiner_mirror
make
git clone https://github.com/WojciechMula/canvas2svg.git scripts/canvasvg
cd ..

Running Experiments for Fig 5. (top)

Download the experimental configuration rtas24_false_dependency.json file and run it with cuda_scheduling_examiner:

wget https://www.cs.unc.edu/~jbakita/rtas24-ae/rtas24_false_dependency.json
cd cuda_scheduling_examiner_mirror
./bin/runner ../rtas24_false_dependency.json

This will generate ./results/rtas24_false_dependency_X.json (where X ∈ [1, 9]) for the baseline case (eight channels). Plot this via scripts/view_blocksbysm.py as included with cuda_scheduling_examiner:

./scripts/view_blocksbysm.py results/rtas24_false_dependency_*

Note that X is required for this plotting script. Please use X forwarding if this is a remote machine.

The result from this experiment should look similar to that at top in Fig. 5 of the paper, with K33 not starting until after K3 finishes.

Running Experiments for Fig. 5 (bottom)

Repeat the steps used for generating and plotting Fig 5 (top), but prefix CUDA_DEVICE_MAX_CONNECTIONS=9 before ./bin/runner to increase the number of Channels created by CUDA, as in:

CUDA_DEVICE_MAX_CONNECTIONS=9 ./bin/runner configs/rtas24_false_dependency.json

The result from this experiment should look similar to that at bottom in Fig. 5 of the paper, with K33 starting immediately after release.

Reproducing Fig. 6

This figure demonstrates R4: A runlist may have up to one task active per associated engine for the Compute/Graphics Engine.

This experiment should work on any NVIDIA GPU of the Pascal generation or newer.

After setting up gpu-microbench (see above), simply co-run two instances of preemption_logger on the same GPU. preemption_logger prints status messages to stderr, and execution intervals to stdout. Save the output execution intervals for plotting; we recommend doing that via stream redirection.

Approximately 100 timeslices should be sufficient, and CPU-based times are uneeded (so pass -r). Putting this all together:

cd gpu-microbench
# Start "Exec Logger 1"
./preemption_logger 100 -r > fig6.1.log &
# Wait a few seconds for initialization, then start "Exec Logger 2"
./preemption_logger 100 -r > fig6.2.log &
# After "Exec Logger 1" finishes, run
./constant_cycles_kernel 100

As “Exec Logger 2” has been asked to log at least 100 timeslices, it will run indefinitely after “Exec Logger 1” completes, until constant_cycles_kernel is run to trigger timeslicing. In the paper, these timeslices of “Exec Logger 2” against constant_cycles_kernel are out-of-frame.

Examine and note that the logged intervals in fig6.1.log and fig6.2.log are mutually exclusive (each line contains the “[start], [end]” times of each interval of execution). We intend to provide code to reproduce our plot at a later date.

Reproducing Fig. 7

This figure demonstrates R4: A runlist may have up to one task active per associated engine for the Copy Engine.

This experiment should work on any CUDA-capable NVIDIA GPU.

After setting up gpu-microbench (see above), co-run two instances “Copy Monitor” via mon_cross_ctx_copies configured to monitor progress from the CPU. mon_cross_ctx_copies prints status messages to stderr, and the times that pages are copies to stdout. Save the output copy timestamps for plotting; we recommend doing that via stream redirection.

To co-run two instances of “Copy Monitor”, run mon_cross_ctx_copies with two CPU-monitored (-c) copies of 50 MiB (12800 pages), and save the results to fig7.log:

cd gpu-microbench/copy_experiments
./mon_cross_ctx_copies -c -c -s 12800 > fig7.log

Press enter when prompted. Once the copies complete, the result file contains one comma-separated line of timestamps per copy, with one timestamp per copied page. So, for two copies of 100 pages, the output file will have two lines, each of 100 timestamps.

To plot the result file as fig7.pdf as in the paper, use the following python script:

#!/usr/bin/env python3
import numpy
import matplotlib
matplotlib.use('Agg') # Use a headless plotting frontend
import matplotlib.pyplot as plt
plt.rcParams["font.size"] = 8 # Default IEEEtran footnote size
plt.rcParams['figure.dpi'] = 96 # DPI used by inkscape

# Google Drive colors
dark_orange = '#e69138'
dark_blue = '#3d85c6'
dark_green = '#6aa84f'

# Load the copy log (change this if you named the log differently)
res_x = numpy.loadtxt("fig7.log", delimiter=",")
res_x -= res_x.min() # Rebase everything to zero
res_x /= 1000*1000 # Convert to milliseconds

plt.figure(figsize=(3.5,2))#, tight_layout = {'pad': 0})
plt.plot(res_x[0].transpose(), range(res_x.shape[1]), c=dark_blue, label="Copy Monitor 1")
plt.plot(res_x[1].transpose(), range(res_x.shape[1]), c=dark_orange, linestyle=":", label="Copy Monitor 2")

# Adjust the x limit to change how many ms are plot, and the y limit to change how many MiB are shown
plt.xlim(0,20.48)
plt.ylim(0,7500)
plt.xlabel("Time [milliseconds (ms)]")
plt.ylabel("Data Copied")
plt.legend(loc="lower right")

# Fix y axis
locs, labels = plt.yticks()
locs = range(0, int(locs[-1]), int(6*1024*1024/4096)) # Every 6 MiB
y_ticks = ["%.0f MiB"%((page_cnt*4096)/(1024*1024)) for page_cnt in locs] # Convert pages to MiB
plt.yticks(locs, y_ticks)

plt.tight_layout() # Gets it closest to the desired size
plt.savefig("fig7.pdf") # Save plot to fig7.pdf

Note that the resultant plot may look different if your GPU copies at a different speed, or if the copies do not start at exactly the same time. The key property that should be consistent is that the amount of data copied only increases for one copy monitor at a given instant.

Reproducing Fig. 8

This figure demonstrates R5 (i): Runlists enable independent inter-task parallelism.

This experiment should work on any CUDA-capable NVIDIA GPU with a separate copy runlist (all NVIDIA discrete GPUs we are aware of).

This experiment depends on the “Exec+Copy” benchmark, which is not yet of sufficient quality to add to the gpu-microbench repository. This will be fixed and posted after the ECRTS’24 deadline on March 1st…

Reproducing Fig. 9

This figure demonstrates R5 (ii): Independent inter-task parallelism is not possible without multiple runlists.

This experiment should work on any CUDA-capable NVIDIA GPU without a separate copy runlist (only the Jetson TX2/Parker SoC to our knowledge).

After setting up gpu-microbench (see above), co-run one instance of “Copy Monitor” via mon_cross_ctx_copies against one instance of “Exec Logger” via preemption_logger. The copy should be 200 MiB (51200 pages) or larger, and must be monitored from the CPU only. First, initalize the “Copy Monitor” but do not press enter after initalization completes:

cd gpu-microbench
./copy_experiments/mon_cross_ctx_copies -c -s 51200 > fig9b.log

Once it finishes initializing, press Ctrl+Z to suspend it. Now launch the “Exec Logger”; logging 1,000 execution intervals should be plenty:

./preemption_logger 1000 > fig9a.log &

After this initalizes (wait a few seconds), bring mon_cross_ctex_copies back to the foreground:

fg

Press enter and wait for the copies to complete. As in the instructions for Fig. 6, use ./constant_cycles_kernel 1000 to trigger preemption_logger to terminate and output its execution intervals. The format of fig9b.log is identical to that for Fig. 7; that script (with a modified filename) can be repurposed here. The format of fig9a.log is identical to that for Fig. 6; see notes there on how to read the file. You should discover that the copy proceeds only periodically, and that no execution intervals correspond with the times during which it does not proceed (the first one or two execution intervals should continue uninterrupted during the entire copy).

Reproducing Tbl. II, Tbl. III, and Tbl. IV

Via data from GPU topology registers, these tables support R6: A runlist may be bound to more than one engine and R7: Each engine is bound to only one runlist.

This experiment should work on any NVIDIA GPU of the Kepler generation or newer. The tables in the paper are from the GTX 1060 3GB, Jetson Orin, and RTX 6000 Ada respectively, but the results on any GPU should be consistent with our rules.

After setting up nvdebug (see above), dump the GPU topology using the device_info API:

cat /proc/gpu0/device_info

The resultant output shows a list of GPU engines and which runlists they are associated with. The key fields are:

For example, here is the raw output for the GTX 1060 3GB used to construct Tbl II:

| Host's Engine ID:  0
| Runlist ID:        0
| Interrupt ID:     12
| Reset ID:         12
| Engine Type:       0 (Graphics/Compute)
| BAR0 Base 0x00400000
|           instance 0
| Fault ID:          0
+---------------------+
| Host's Engine ID:  7
| Runlist ID:        0
| Interrupt ID:      5
| Reset ID:          6
| Engine Type:      19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
|           instance 0
| Fault ID:         21
+---------------------+
| Host's Engine ID:  8
| Runlist ID:        0
| Interrupt ID:      6
| Reset ID:          7
| Engine Type:      19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
|           instance 1
| Fault ID:         22
+---------------------+
| Host's Engine ID:  5
| Runlist ID:        5
| Interrupt ID:      7
| Reset ID:         21
| Engine Type:      19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
|           instance 2
| Fault ID:         27
+---------------------+
| Host's Engine ID:  1
| Runlist ID:        1
| Interrupt ID:     17
| Reset ID:         15
| Engine Type:      16 (NVDEC: NVIDIA Video Decoder)
| BAR0 Base 0x00084000
|           instance 0
| Fault ID:          2
+---------------------+
| Host's Engine ID:  3
| Runlist ID:        3
| Interrupt ID:     15
| Reset ID:         14
| Engine Type:      13 (SEC: Sequencer)
| BAR0 Base 0x00087000
|           instance 0
| Fault ID:         18
+---------------------+
| Host's Engine ID:  2
| Runlist ID:        2
| Interrupt ID:     16
| Reset ID:         18
| Engine Type:      14 (NVENC0: NVIDIA Video Encoder #0)
| BAR0 Base 0x001c8000
|           instance 0
| Fault ID:         25
+---------------------+
| Host's Engine ID:  6
| Runlist ID:        6
| Interrupt ID:     10
| Reset ID:         22
| Engine Type:      19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
|           instance 3
| Fault ID:         28
+---------------------+

Reproducing Fig. 10

This figure demonstrates rule R8: Copy engines may appear to violate R7 due to copy-engine-specific shared hardware.

Each part of this experiment requires a different GPU. One GPU must map the GRCE for OpenGL texture uploads to the same PCE as used for CUDA copies from device to host, and the other GPU must not. Of our GPUs, we are only certain that the RTX 6000 Ada satisfies the first condition, whereas almost any NVIDIA GPU satisfies the second. Ideally, each GPU should be installed in an identical system to maximize comparability.

Experiments Without OpenGL Competitor

After setting up gpu-microbench (see above), run one instance of copy_contender. To run 200 MiB copy from the GPU 1001 times (500 iterations), as in the paper:

cd gpu-microbench/copy_experiments
./copy_contender 51200 500 from > fig10.log

The output from this program is designed to be human-readable. To strip out text and formatting:

cat fig10.log | tr -d "[:alpha:]/" | cut -d " " -f 2 > fig10.log.clean

And to get the maximum copy length in milliseconds (as plotted in the paper):

cat fig10.log.clean | sort -n | tail -1

Repeat this experiment for each GPU to get the baseline “Without OpenGL Competitor” numbers.

Experiments With OpenGL Competitor

This experiment requires a working display physically attached to the GPU under test, with a OpenGL-supporting window manager running. On a headless Ubuntu system, installing the packages xorg, openbox (a minimal window manager), and freeglut3-dev (an OpenGL wrapper) is all the extra software necessary (these minimal packages avoid pulling in unecessary dependencies). Note that a non-headless variant of the NVIDIA must be installed, such as that included in the nvidia-driver-535-server package.

Once a display is installed and working, with Xorg running, build (see above) and run gl_copies:

cd gpu-microbench/copy_experiments
DISPLAY=:0 ./gl_copies &

This should cause a window to open on the display displaying a flickering textured triangle. Leave this running it the background and repeat the same steps to run copy_contender as without a competitor.

Repeat this experiment for each GPU to get the “With OpenGL Competitor” numbers. You should find that the “With OpenGL Competitor” numbers are significantly higher on the GPU with a PCE mapping conflict.

Reproducing Fig. 11

This figure supports rule R8: Copy engines may appear to violate R7 due to copy-engine-specific shared hardware. Specifically, this figure shows how seemingly-independent Logical Copy Engines may share an underlying Physical Copy Engine.

The shared hardware that R8 refers to is only present on NVIDIA GPUs of the Pascal generation and newer, and so such a GPU is required.

After setting up nvdebug (see above):

  1. Obtain the runlist to LCE/GRCE mappings via the same process used for the tables (recall that GRCE0 and GRCE1 are synonyms for LCE0 and LCE1 respectively).
  2. Determine the number of Physical Copy Engines (PCEs) by counting the number of bits set in the mask returned by cat /proc/gpu0/pce_mask (e.g. a mask of 0x7 is 0b0111, indicating three PCEs).
  3. Determine which LCE is associated with each PCE by running cat /proc/gpu0/lce_for_pceX for each PCE X (X ∈ [0, 2] on the GTX 1060 3GiB).
  4. For any GRCEs not yet associated with a PCE, determine which LCE they are mapped on to via cat /proc/gpu0/shared_lce_for_grceY (where Y is 0 or 1 for GRCE0 or GRCE1 respectively).

We suggest completing this process on paper by first writing columns of all the LCEs, all their associated runlists, then all the PCEs, as in the Fig. 11. Then, as you read registers, draw arrows between the entries in each column until the whole figure is populated. Some PCEs may be unassociated with any LCE, and same LCEs may be unassociated with any PCE; these units are sometimes purposefully dormant due to hardware errata or an insufficient number of PCEs (as noted in the paper). It is not possible to generally determine why an LCE or PCE is unused just from the APIs of nvdebug; we draw that information from other places such as the commit log of the open-source nvgpu driver.

Reproducing Listing 1

This listing demonstrates the runlist API of nvdebug.

This experiment should work on any NVIDIA GPU of the Kepler through Volta generations (support for this API on Turing, Ampere, Ada, and Hopper is coming soon).

After setting up nvdebug (see above), start any GPU-using application, and dump the Compute/Graphics runlist (runlist 0):

cat /proc/gpu0/runlist0

Appendicies

This section contains various examples and supplementary content only relevant to some audiences.

Appendix A: Setting Device Visibility

Here’s example of using nvidia-smi, CUDA_DEVICE_ORDER, and CUDA_VISIBLE_DEVICES to make only the GTX 1060 3GB visible to the CUDA sample deviceQueryDrv:

jbakita@jbakita-old:~$ nvidia-smi # Determine GPU IDs when sorted by PCI ID using nvidia-smi:
Wed Feb 21 20:08:42 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   37C    P0    43W / 160W |      0MiB /  4041MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   61C    P0    24W / 120W |      0MiB /  3019MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
jbakita@jbakita-old:~$ export CUDA_DEVICE_ORDER=PCI_BUS_ID # Switch CUDA to use PCI-ID-ordering
jbakita@jbakita-old:~$ export CUDA_VISIBLE_DEVICES=1 # Only make device 1 visible to CUDA
jbakita@jbakita-old:~$ /playpen/jbakita/CUDA/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv
/playpen/jbakita/CUDA/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 3GB"
  CUDA Driver Version:                           10.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 3019 MBytes (3166109696 bytes)
  ( 9) Multiprocessors, (128) CUDA Cores/MP:     1152 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Sizes                    1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS

There are ways to do this without setting CUDA_DEVICE_ORDER, but we suggest doing so for consistency with nvdebug as well.

Appendix B: Notes for Internal Users and Artifact Evaluators

yamaha.cs.unc.edu has been set up with display output on the GTX 1080 Ti (primary adapter). The only way we are presently aware of to get gl_copies to run on the RTX 6000 Ada (secondary adapter) in that machine is to physically remove the GTX 1080 Ti.

For portability, use the nvcc version at /playpen/jbakita/CUDA/cuda-archive/cuda-10.2/bin/nvcc for all experiments on x86_64 machines. (The gpu-microbench makefiles support a custom nvcc via the NVCC variable, e.g. make NVCC=/playpen/jbakita/CUDA/cuda-archive/cuda-10.2/bin/nvcc all.)