My Ramblings

2011

A large focus of my research as a graduate student is the investigation of integrating GPUs as general-purpose co-processors into real-time systems. Nvidia and their CUDA technology certainly leads the pack in the field field of general-purpose computations on GPUs. However, Nvidia’s focus on single-application domains has left CUDA weak from the systems point of view, one where we evaluate the effectiveness of GPUs from a holistic system-wide perspective.

In the traditional graphics domain, GPUs are primarily used to perform one task: render graphics for a fullscreen video game, play a video, etc. That’s it. It can do both, but rarely does it have to do so at the same time (at least in any strenuous sense). This common single-use use-case has bled into Nvidia’s CUDA design and implementation. Obviously, some architectural artifacts from the old GPU are going to carry forward into current and architectures. On the hardware side, Nvidia has been making strides to make the GPU more flexible as a system co-processor. The Fermi architecture, which has been out for about a year I think, started to introduce non-single-use features such as concurrent GPU kernel execution (though said kernels still had to originate from the same host-side thread). Things have been slower to advance on the software side, though this is beginning to change with the new CUDA 4.0 API, which is currently out in its third release candidate.

Here are a few features that I have found in CUDA 4.0 which make a CUDA more friendly for systems programmers (Note: This is entirely from my experience with the Linux CUDA 4.0 runtime and driver.):

Concurrent active GPU connections. CUDA 4.0 allows a single thread of execution to have several active connections (contexts) to more than one GPU at a time. In CUDA 3.2, one could stack active contexts (by way of the low-level CUDA driver API), but a single thread of execution could only use one GPU at a time. Concurrent active connections allows new programming paradigms, such as allowing a program to make use of any free GPU instead of waiting for a particular one while maintaining a low GPU switching latency.

Unified address spaces. I haven’t had a chance to try this new feature out, but it begins to address the programming difficulties that arise when you have to manage disjoint host-side and GPU-side memory domains. In CUDA 3.2, a programmer had to explicitly shuttle data back and forth between the GPU and host. While this is still an important part to CUDA programming, the new unified address space features allow code to share pointers between GPU and host. This gives a more seamless coupling between CPU and GPU code.

I believe the win with this technology is twofold: First, this allows improved support for C++. Imagine a class where one member function executes on the CPU while another on a GPU, or a CPU that is able to traverse GPU data structures and visa versa. Second, current hardware trends are merging CPUs and GPUs onto single chips with shared physical memory (including the cache!). The old CUDA 3.2 memory architecture is merely an artificial partitioning in such platforms.

The line dividing the CPU and GPU into distinct parts is blurring. This will only make CUDA adoption easier and more flexible. (To give credit where credit is due, I don’t believe Nvidia was the first to demonstrate this kind of technology in a GPU-based setting---I think that distinction may go to a group of Intel Larrabee researchers a year or two ago... looking for the paper...).

Improved support for host-side pinned memory. CUDA 3.2 allowed a user to allocate pinned (non-pagable) memory though the cudaMalloc() function. A programmer might opt to use pinned memory because it greatly reduces the overheads of shuttling memory between the CPU and GPU though hardware-optimized transfers (DMA). However, in CUDA 3.2, this memory couldn’t be preserved between GPU context lifetimes---the memory would be automatically freed when the GPU connection under which it was allocated was closed. Since this memory was allocated on the host, I think this was a really poor design choice of the API designers, but from a single-use point of view, I guess I can understand.

in contrast, CUDA 4.0 allows host-side pinned memory to survive GPU tear-down through the use of the cudaHostRegister() / cudaHostUnregister() APIs. These APIs allow traditionally allocated pinned memory (i.e. mmap()) to be used with DMA while not being completely controlled by the lifetime of the GPU. These APIs may also make it possible to share pinned memory between host-side address spaces (processes) since mmap() is happy to set this up. However, I haven’t tried this yet and it may not work. I’m also not sure how this could actually be of any practical use, but it is an interesting idea.

These new features begin to bridge the gap between a distinct single-use GPU device and an integrated system co-processor. However, there are still some annoying warts:

Getting blocking-based synchronization to work properly is a PITA. CUDA allows the user to specify how a host-side CPU program should wait for results from the GPU. The choices basically boil down to two options: spin or block. Spinning is the default behavior and consumes a ton of CPU time since the waiting program continuously polls the GPU’s status. This may be an acceptable behavior from a single-use perspective or even one where the number of CPUs greatly exceed the number of GPUs. However, in a system like Bonham, the CPU:GPU ratio is 12:8. Who wants to sacrifice 8 CPUs to reduce just a tiny bit of latency between the CPU and GPU?

The alternative is to block a process (suspend it from CPU execution) while the process waits for the GPU to report back that it’s done doing whatever it needed to do. Suspension-based blocking is a perfectly normal system’s behavior. It’s fine for many OS-internal data structures and most I/O devices. I don’t know why Nvidia feels their devices can’t suspend by default when they’re not driving a graphics display. Worse, you have to jump through several hoops get suspension-based blocking to work. For example, specifying suspension-based blocking (cudaSetDeviceFlags( cudaDeviceScheduleBlockingSync )) before GPU initialization is not enough if you choose to use cudaEvents to introduce synchronization points into your code---you have to create those cudaEvents with cudaEventCreateWithFlags( cudaEventBlockingSync ). Why aren’t the already supplied device flags transitive? And even that’s not enough if you have more than one thread using a GPU in your system (at least in the latest CUDA 4.0 release candidate). Your threads will happily suspend from execution as needed until they collide into each other within the Nvidia driver. Both threads begin to spin probably on some internal Big GPU Lock, when this happens! Annoying. The only solution that I have found to work in all situations is cudaStreams.

Nvidia: You must have to reasons for spinning all the time both in the CUDA runtime and GPU driver, but maybe suspending isn’t all that bad? Spinning is rarely a good idea from the systems perspective unless the duration of spinning is on the order of microseconds. When a GPU kernel takes 10ms to execute, what are a few more microseconds in latency due to suspension-based blocking? Further, spinning prevents progress of other ready-to-run threads of execution. Those other threads are all going to execute some time---why not do it while the GPU-using process won’t mind? Otherwise, the GPU-using process could get switched out in the future, possibly washing out the latency reductions.

Tremendous per-process device memory overhead. The CUDA runtime has to allocate a portion of GPU device memory to every process that has an active connection to the GPU. This is fine for a single-use setting, but it really limits the number of processes that can simultaneously share a single GPU. In my experiments, I have found that I can’t have more than eight or nine processes with simultaneous open connections to a GTX-470 (which has 1.28GB of memory) before the GPU runs out of memory; and this is just memory lost to overheads---nothing has even been allocated in user code yet! A modern CPU platform addresses these kinds of limitations through shared libraries and memory mapping tricks. However, I wonder if the GPU hardware is too primitive at this point to provide the necessary memory protections and bookkeeping to pull something like this off. The new unified address space features may help alleviate some of this problem, but my instincts tell me that there will probably still be a non-trivial amount of memory that must be lost to per-process overheads.

The GPU driver very much remains closed source. The GPU market is very competitive for many years, and Nvidia and AMD/ATI have responded by keeping the details of how their devices work secret partly through closed source drivers. As the CPU and GPU continue to become more seamlessly integrated, I wonder how long Nvidia can keep their pretty walled garden safe. (AMD/ATI has shown itself to be more friendly to the open source sector, though I haven’t taken a serious look at AMD/ATI technologies yet.)

Consider this: Intel doesn’t mandate that all of their x86 CPUs use an Intel-provided closed source hypervisor. Instead, everything is generally open and there are mountains of freely available design documents ready to aid in the development custom software platforms. The Nouveau effort does try to open up the Nvidia platform to open source development, but Nouveau performance falls pretty short of the Nvidia-provided drivers. Even if there were an open source community CUDA coming out tomorrow, its performance and feature set would probably lag behind Nvidia’s for some time if Nouveau is to be any sort of guide.

Maybe I’m being a bit idealistic, but it would be great if Nvidia began to release their driver (or at least core parts) as open source.* Heck, even a plug-able architecture allowing third parties to extend driver features would be very welcome, at least for a system’s programmer.

* The Nvidia driver on Linux includes a Linux/Driver glue layer that can be edited and compiled by anyone, though the licensing of this code is unclear: the source code says “Don’t touch!” but the accompanying documentation says “Have fun and please share your patches with us if you like.”

I’m really looking forward to gaining more experience with CUDA 4.0. I believe its a part of a greater CPU architectural shift towards tightly integrated heterogenous processors (Cell was (is?) way ahead of its time). It makes the challenging move to multicore platforms this last decade look like child’s play.

A Systems-View of CUDA 4.0

4/16/11

CUDA is a great platform for high performance data parallel computing, but its architecture and implementation leaves something to be desired from a systems perspective. The situation is improving in CUDA 4.0, but there’s still a way to go.

(image of Fermi die, courtesy of Nvidia)

Next >

< Previous