This page details the artifacts from our paper:
J. Bakita and J. H. Anderson, “Demystifying NVIDIA GPU Internals to Enable Reliable GPU Management”, Proceedings of the 30th IEEE Real-Time and Embedded Technology and Applications Symposium, to appear, May 2024. (PDF)
We also describe how to use these artifacts to reproduce our experiments. These steps were last updated .
There are two software artifacts from our work:
nvdebug
tool, a Linux kernel module for probing NVIDIA GPU stategpu-microbench
, for examining the scheduling behavior of independent GPU enginesWe now discuss how to aquire and setup each artifact. Note that we assume a Linux-based system throughout with the CUDA SDK and the equivalent of Ubuntu’s build-essential
package installed.
gpu-microbench
We use this suite of high-accuracy benchmarks and measurement tools to examine the scheduling behavior of individual GPU contexts on the GPU copy and compute engines.
The source code for this artifact can be browsed online. To obtain and build this suite, simply ensure that the CUDA SDK is available, then:
git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/gpu-microbench.git/
cd gpu-microbench
make all
To build the gl_copies
benchmark, install the freeglut headers and shared library via the Ubuntu package freeglut3-dev
, or from their website, then add:
cd copy_experiments
make gl_copies
Run any benchmark with the --help
argument for usage information. To run a CUDA benchmark on a specific GPU in a multi-GPU system, set the environment variable CUDA_VISIBLE_DEVICES
(docs). Note that, by default, CUDA orders devices by compute power. nvidia-smi
and nvdebug
instead number devices by PCI ID. To force CUDA’s device numbering (including that used by CUDA_VISIBLE_DEVICES
) to match nvidia-smi
, set the environment variable CUDA_DEVICE_ORDER
to PCI_BUS_ID
. See Appdx. A for an example.
Note that exec_logger
and copy_monitor
in the paper are respectively named preemption_logger
and mon_cross_ctx_copies
in the repository (subject to change).
nvdebug
This Linux kernel module is used to obtain information about GPU topology and scheduling, as well as control scheduling to a more limited extent. See Appdx. A in the paper for more details on its capabilities.
The source code for this artifact can be browsed online. To obtain and build this module, ensure you have the kernel headers headers installed (the linux-headers-generic
package on Ubuntu), then:
git clone http://rtsrv.cs.unc.edu/cgit/cgit.cgi/nvdebug.git/ -b rtas24-ae
cd nvdebug
make
sudo insmod nvdebug.ko
For each installed NVIDIA GPU in the system X
, this module exposes virtual files corresponding to that GPU in the /proc/gpuX/
directories. Note that GPUs are numbered by PCI ID (as with nvidia-smi
), and that this is not the same ordering CUDA uses to number devices by default.
The following table details which GPU and tool are used to obtain and render the experimental result in each figure of our paper.
GPU | Tool | Renderer | |
---|---|---|---|
Fig 5 | GTX 1060 3GB | cuda_scheduling_examiner (rtas24_false_dependency.json config) |
cuda_scheduling_examiner |
Fig 6 | GTX 1060 3GB | gpu-microbench (preemption_logger ) |
Matplotlib |
Fig 7 | GTX 1060 3GB | gpu-microbench (mon_cross_ctx_copies ) |
Matplotlib |
Fig 8 | GTX 1080 Ti | gpu-microbench (preemption_logger , and mon_cross_ctx_copies_and_compute ) |
Matplotlib |
Fig 9 | Jetson TX2 | gpu-microbench (preemption_logger , and mon_cross_ctx_copies ) |
Matplotlib |
Tbl 2 | GTX 1060 3GB | nvdebug (device_info API) |
LaTeX Table |
Tbl 3 | Jetson Orin | nvdebug (device_info API) |
LaTeX Table |
Tbl 4 | RTX 6000 Ada | nvdebug (device_info API) |
LaTeX Table |
Fig 10 | RTX 6000 Ada and GTX 1080 Ti | gpu-microbench (copy_contender , and gl_copies ) |
Matplotlib |
Fig 11 | GTX 1060 3GB, RTX 2080 Ti, RTX 6000 Ada, and Jetson Orin | nvdebug (lce_for_pceX , shared_lce_for_grceX , pce_map , and device_info APIs) |
Inkscape Drawing |
Listing 1 | Any Pascal+ GPU | nvdebug (runlistX API) |
LaTeX Listing |
Note that in addition to our novel artifacts, we utilize the cuda_scheduling_examiner
tool by Nathan Otterness.
Our experimental code is portable to any NVIDIA GPU of the Pascal architecture (2016) or newer, but most experiments must be run on a particular GPU to obtain the same result. Some experiments are more flexible, as indicated in the following descriptions of each.
This figure demonstrates R2: A task’s number of channels limits intra-task parallelism.
This experiment should work on any NVIDIA GPU with at least nine SMs on an x86_64
platform.
cuda_scheduling_examiner
Download and build cuda_scheduling_examiner
using the official instructions. Alternatively, just run the following commands in the parent directory to where you would like to put cuda_scheduling_examiner
on a machine with the CUDA SDK installed:
export PATH=$PATH:/usr/local/cuda/bin
git clone --depth 1 https://github.com/yalue/cuda_scheduling_examiner_mirror.git
cd cuda_scheduling_examiner_mirror
make
git clone https://github.com/WojciechMula/canvas2svg.git scripts/canvasvg
cd ..
Download the experimental configuration rtas24_false_dependency.json
file and run it with cuda_scheduling_examiner
:
wget https://www.cs.unc.edu/~jbakita/rtas24-ae/rtas24_false_dependency.json
cd cuda_scheduling_examiner_mirror
./bin/runner ../rtas24_false_dependency.json
This will generate ./results/rtas24_false_dependency_X.json
(where X ∈ [1, 9]) for the baseline case (eight channels). Plot this via scripts/view_blocksbysm.py
as included with cuda_scheduling_examiner
:
./scripts/view_blocksbysm.py results/rtas24_false_dependency_*
Note that X is required for this plotting script. Please use X forwarding if this is a remote machine.
The result from this experiment should look similar to that at top in Fig. 5 of the paper, with K33 not starting until after K3 finishes.
Repeat the steps used for generating and plotting Fig 5 (top), but prefix CUDA_DEVICE_MAX_CONNECTIONS=9
before ./bin/runner
to increase the number of Channels created by CUDA, as in:
CUDA_DEVICE_MAX_CONNECTIONS=9 ./bin/runner configs/rtas24_false_dependency.json
The result from this experiment should look similar to that at bottom in Fig. 5 of the paper, with K33 starting immediately after release.
This figure demonstrates R4: A runlist may have up to one task active per associated engine for the Compute/Graphics Engine.
This experiment should work on any NVIDIA GPU of the Pascal generation or newer.
After setting up gpu-microbench
(see above), simply co-run two instances of preemption_logger
on the same GPU. preemption_logger
prints status messages to stderr
, and execution intervals to stdout
. Save the output execution intervals for plotting; we recommend doing that via stream redirection.
Approximately 100 timeslices should be sufficient, and CPU-based times are uneeded (so pass -r
). Putting this all together:
cd gpu-microbench
# Start "Exec Logger 1"
./preemption_logger 100 -r > fig6.1.log &
# Wait a few seconds for initialization, then start "Exec Logger 2"
./preemption_logger 100 -r > fig6.2.log &
# After "Exec Logger 1" finishes, run
./constant_cycles_kernel 100
As “Exec Logger 2” has been asked to log at least 100 timeslices, it will run indefinitely after “Exec Logger 1” completes, until constant_cycles_kernel
is run to trigger timeslicing. In the paper, these timeslices of “Exec Logger 2” against constant_cycles_kernel
are out-of-frame.
Examine and note that the logged intervals in fig6.1.log
and fig6.2.log
are mutually exclusive (each line contains the “[start], [end]” times of each interval of execution). We intend to provide code to reproduce our plot at a later date.
This figure demonstrates R4: A runlist may have up to one task active per associated engine for the Copy Engine.
This experiment should work on any CUDA-capable NVIDIA GPU.
After setting up gpu-microbench
(see above), co-run two instances “Copy Monitor” via mon_cross_ctx_copies
configured to monitor progress from the CPU. mon_cross_ctx_copies
prints status messages to stderr
, and the times that pages are copies to stdout
. Save the output copy timestamps for plotting; we recommend doing that via stream redirection.
To co-run two instances of “Copy Monitor”, run mon_cross_ctx_copies
with two CPU-monitored (-c
) copies of 50 MiB (12800
pages), and save the results to fig7.log
:
cd gpu-microbench/copy_experiments
./mon_cross_ctx_copies -c -c -s 12800 > fig7.log
Press enter when prompted. Once the copies complete, the result file contains one comma-separated line of timestamps per copy, with one timestamp per copied page. So, for two copies of 100 pages, the output file will have two lines, each of 100 timestamps.
To plot the result file as fig7.pdf
as in the paper, use the following python script:
#!/usr/bin/env python3
import numpy
import matplotlib
matplotlib.use('Agg') # Use a headless plotting frontend
import matplotlib.pyplot as plt
plt.rcParams["font.size"] = 8 # Default IEEEtran footnote size
plt.rcParams['figure.dpi'] = 96 # DPI used by inkscape
# Google Drive colors
dark_orange = '#e69138'
dark_blue = '#3d85c6'
dark_green = '#6aa84f'
# Load the copy log (change this if you named the log differently)
res_x = numpy.loadtxt("fig7.log", delimiter=",")
res_x -= res_x.min() # Rebase everything to zero
res_x /= 1000*1000 # Convert to milliseconds
plt.figure(figsize=(3.5,2))#, tight_layout = {'pad': 0})
plt.plot(res_x[0].transpose(), range(res_x.shape[1]), c=dark_blue, label="Copy Monitor 1")
plt.plot(res_x[1].transpose(), range(res_x.shape[1]), c=dark_orange, linestyle=":", label="Copy Monitor 2")
# Adjust the x limit to change how many ms are plot, and the y limit to change how many MiB are shown
plt.xlim(0,20.48)
plt.ylim(0,7500)
plt.xlabel("Time [milliseconds (ms)]")
plt.ylabel("Data Copied")
plt.legend(loc="lower right")
# Fix y axis
locs, labels = plt.yticks()
locs = range(0, int(locs[-1]), int(6*1024*1024/4096)) # Every 6 MiB
y_ticks = ["%.0f MiB"%((page_cnt*4096)/(1024*1024)) for page_cnt in locs] # Convert pages to MiB
plt.yticks(locs, y_ticks)
plt.tight_layout() # Gets it closest to the desired size
plt.savefig("fig7.pdf") # Save plot to fig7.pdf
Note that the resultant plot may look different if your GPU copies at a different speed, or if the copies do not start at exactly the same time. The key property that should be consistent is that the amount of data copied only increases for one copy monitor at a given instant.
This figure demonstrates R5 (i): Runlists enable independent inter-task parallelism.
This experiment should work on any CUDA-capable NVIDIA GPU with a separate copy runlist (all NVIDIA discrete GPUs we are aware of).
This experiment depends on the “Exec+Copy” benchmark, which is not yet of sufficient quality to add to the gpu-microbench
repository. This will be fixed and posted after the ECRTS’24 deadline on March 1st…
This figure demonstrates R5 (ii): Independent inter-task parallelism is not possible without multiple runlists.
This experiment should work on any CUDA-capable NVIDIA GPU without a separate copy runlist (only the Jetson TX2/Parker SoC to our knowledge).
After setting up gpu-microbench
(see above), co-run one instance of “Copy Monitor” via mon_cross_ctx_copies
against one instance of “Exec Logger” via preemption_logger
. The copy should be 200 MiB (51200 pages) or larger, and must be monitored from the CPU only. First, initalize the “Copy Monitor” but do not press enter after initalization completes:
cd gpu-microbench
./copy_experiments/mon_cross_ctx_copies -c -s 51200 > fig9b.log
Once it finishes initializing, press Ctrl+Z to suspend it. Now launch the “Exec Logger”; logging 1,000 execution intervals should be plenty:
./preemption_logger 1000 > fig9a.log &
After this initalizes (wait a few seconds), bring mon_cross_ctex_copies
back to the foreground:
fg
Press enter and wait for the copies to complete. As in the instructions for Fig. 6, use ./constant_cycles_kernel 1000
to trigger preemption_logger
to terminate and output its execution intervals. The format of fig9b.log
is identical to that for Fig. 7; that script (with a modified filename) can be repurposed here. The format of fig9a.log
is identical to that for Fig. 6; see notes there on how to read the file. You should discover that the copy proceeds only periodically, and that no execution intervals correspond with the times during which it does not proceed (the first one or two execution intervals should continue uninterrupted during the entire copy).
Via data from GPU topology registers, these tables support R6: A runlist may be bound to more than one engine and R7: Each engine is bound to only one runlist.
This experiment should work on any NVIDIA GPU of the Kepler generation or newer. The tables in the paper are from the GTX 1060 3GB, Jetson Orin, and RTX 6000 Ada respectively, but the results on any GPU should be consistent with our rules.
After setting up nvdebug
(see above), dump the GPU topology using the device_info
API:
cat /proc/gpu0/device_info
The resultant output shows a list of GPU engines and which runlists they are associated with. The key fields are:
Engine Type
: The type of hardware engine; e.g. Compute/Graphics or Logical Copy Engine (LCE).instance
: Which instance of the engine type this is; e.g. instance 2 of an LCE means this is LCE2.Runlist ID
(or RL Base
on Ampere+): A numeric identifier for the runlist which feeds work to this engine.For example, here is the raw output for the GTX 1060 3GB used to construct Tbl II:
| Host's Engine ID: 0
| Runlist ID: 0
| Interrupt ID: 12
| Reset ID: 12
| Engine Type: 0 (Graphics/Compute)
| BAR0 Base 0x00400000
| instance 0
| Fault ID: 0
+---------------------+
| Host's Engine ID: 7
| Runlist ID: 0
| Interrupt ID: 5
| Reset ID: 6
| Engine Type: 19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
| instance 0
| Fault ID: 21
+---------------------+
| Host's Engine ID: 8
| Runlist ID: 0
| Interrupt ID: 6
| Reset ID: 7
| Engine Type: 19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
| instance 1
| Fault ID: 22
+---------------------+
| Host's Engine ID: 5
| Runlist ID: 5
| Interrupt ID: 7
| Reset ID: 21
| Engine Type: 19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
| instance 2
| Fault ID: 27
+---------------------+
| Host's Engine ID: 1
| Runlist ID: 1
| Interrupt ID: 17
| Reset ID: 15
| Engine Type: 16 (NVDEC: NVIDIA Video Decoder)
| BAR0 Base 0x00084000
| instance 0
| Fault ID: 2
+---------------------+
| Host's Engine ID: 3
| Runlist ID: 3
| Interrupt ID: 15
| Reset ID: 14
| Engine Type: 13 (SEC: Sequencer)
| BAR0 Base 0x00087000
| instance 0
| Fault ID: 18
+---------------------+
| Host's Engine ID: 2
| Runlist ID: 2
| Interrupt ID: 16
| Reset ID: 18
| Engine Type: 14 (NVENC0: NVIDIA Video Encoder #0)
| BAR0 Base 0x001c8000
| instance 0
| Fault ID: 25
+---------------------+
| Host's Engine ID: 6
| Runlist ID: 6
| Interrupt ID: 10
| Reset ID: 22
| Engine Type: 19 (LCE: Logical Copy Engine)
| BAR0 Base 0x00104000
| instance 3
| Fault ID: 28
+---------------------+
This figure demonstrates rule R8: Copy engines may appear to violate R7 due to copy-engine-specific shared hardware.
Each part of this experiment requires a different GPU. One GPU must map the GRCE for OpenGL texture uploads to the same PCE as used for CUDA copies from device to host, and the other GPU must not. Of our GPUs, we are only certain that the RTX 6000 Ada satisfies the first condition, whereas almost any NVIDIA GPU satisfies the second. Ideally, each GPU should be installed in an identical system to maximize comparability.
After setting up gpu-microbench
(see above), run one instance of copy_contender
. To run 200 MiB copy from the GPU 1001 times (500 iterations), as in the paper:
cd gpu-microbench/copy_experiments
./copy_contender 51200 500 from > fig10.log
The output from this program is designed to be human-readable. To strip out text and formatting:
cat fig10.log | tr -d "[:alpha:]/" | cut -d " " -f 2 > fig10.log.clean
And to get the maximum copy length in milliseconds (as plotted in the paper):
cat fig10.log.clean | sort -n | tail -1
Repeat this experiment for each GPU to get the baseline “Without OpenGL Competitor” numbers.
This experiment requires a working display physically attached to the GPU under test, with a OpenGL-supporting window manager running. On a headless Ubuntu system, installing the packages xorg
, openbox
(a minimal window manager), and freeglut3-dev
(an OpenGL wrapper) is all the extra software necessary (these minimal packages avoid pulling in unecessary dependencies). Note that a non-headless variant of the NVIDIA must be installed, such as that included in the nvidia-driver-535-server
package.
Once a display is installed and working, with Xorg running, build (see above) and run gl_copies
:
cd gpu-microbench/copy_experiments
DISPLAY=:0 ./gl_copies &
This should cause a window to open on the display displaying a flickering textured triangle. Leave this running it the background and repeat the same steps to run copy_contender
as without a competitor.
Repeat this experiment for each GPU to get the “With OpenGL Competitor” numbers. You should find that the “With OpenGL Competitor” numbers are significantly higher on the GPU with a PCE mapping conflict.
This figure supports rule R8: Copy engines may appear to violate R7 due to copy-engine-specific shared hardware. Specifically, this figure shows how seemingly-independent Logical Copy Engines may share an underlying Physical Copy Engine.
The shared hardware that R8 refers to is only present on NVIDIA GPUs of the Pascal generation and newer, and so such a GPU is required.
After setting up nvdebug
(see above):
cat /proc/gpu0/pce_mask
(e.g. a mask of 0x7
is 0b0111
, indicating three PCEs).cat /proc/gpu0/lce_for_pceX
for each PCE X
(X ∈ [0, 2] on the GTX 1060 3GiB).cat /proc/gpu0/shared_lce_for_grceY
(where Y
is 0 or 1 for GRCE0 or GRCE1 respectively).We suggest completing this process on paper by first writing columns of all the LCEs, all their associated runlists, then all the PCEs, as in the Fig. 11. Then, as you read registers, draw arrows between the entries in each column until the whole figure is populated. Some PCEs may be unassociated with any LCE, and same LCEs may be unassociated with any PCE; these units are sometimes purposefully dormant due to hardware errata or an insufficient number of PCEs (as noted in the paper). It is not possible to generally determine why an LCE or PCE is unused just from the APIs of nvdebug
; we draw that information from other places such as the commit log of the open-source nvgpu
driver.
This listing demonstrates the runlist
API of nvdebug
.
This experiment should work on any NVIDIA GPU of the Kepler through Volta generations (support for this API on Turing, Ampere, Ada, and Hopper is coming soon).
After setting up nvdebug
(see above), start any GPU-using application, and dump the Compute/Graphics runlist (runlist 0):
cat /proc/gpu0/runlist0
This section contains various examples and supplementary content only relevant to some audiences.
Here’s example of using nvidia-smi
, CUDA_DEVICE_ORDER
, and CUDA_VISIBLE_DEVICES
to make only the GTX 1060 3GB visible to the CUDA sample deviceQueryDrv
:
jbakita@jbakita-old:~$ nvidia-smi # Determine GPU IDs when sorted by PCI ID using nvidia-smi:
Wed Feb 21 20:08:42 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 970 Off | 00000000:01:00.0 Off | N/A |
| 40% 37C P0 43W / 160W | 0MiB / 4041MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 106... Off | 00000000:02:00.0 Off | N/A |
| 0% 61C P0 24W / 120W | 0MiB / 3019MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
jbakita@jbakita-old:~$ export CUDA_DEVICE_ORDER=PCI_BUS_ID # Switch CUDA to use PCI-ID-ordering
jbakita@jbakita-old:~$ export CUDA_VISIBLE_DEVICES=1 # Only make device 1 visible to CUDA
jbakita@jbakita-old:~$ /playpen/jbakita/CUDA/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv
/playpen/jbakita/CUDA/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 3GB"
CUDA Driver Version: 10.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 3019 MBytes (3166109696 bytes)
( 9) Multiprocessors, (128) CUDA Cores/MP: 1152 CUDA Cores
GPU Max Clock rate: 1709 MHz (1.71 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Sizes 1D=(131072) 2D=(131072, 65536) 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS
There are ways to do this without setting CUDA_DEVICE_ORDER
, but we suggest doing so for consistency with nvdebug
as well.
yamaha.cs.unc.edu
has been set up with display output on the GTX 1080 Ti (primary adapter). The only way we are presently aware of to get gl_copies
to run on the RTX 6000 Ada (secondary adapter) in that machine is to physically remove the GTX 1080 Ti.
For portability, use the nvcc
version at /playpen/jbakita/CUDA/cuda-archive/cuda-10.2/bin/nvcc
for all experiments on x86_64 machines. (The gpu-microbench
makefiles support a custom nvcc
via the NVCC variable, e.g. make NVCC=/playpen/jbakita/CUDA/cuda-archive/cuda-10.2/bin/nvcc all
.)