SSAO with 24 bits per mantissa SSAO with 12.5 bits per mantissa (on average)
Fig. 1. The image on the left was computed at full floating point precision (24 bits in the mantissa), while the image on the right was computed with an average precision of 12.5 bits in its pixel shaders. There are no perceptible differences between the two images, yet the reduced-precision image saved 71% of the energy in the pixel shader stage's arithmetic, or up to 20% of the GPU's overall energy.
Raw difference between full and reduced precision SSAO frames. Saturated difference between full and reduced precision SSAO frames.
Fig. 2. The raw difference between the two images in Figure 1 (left) and the saturated difference (right). The raw errors are indiscernible; the maximum error is only 29/255. Only when the errors are scaled in magnitude to saturate the image's range are they seen. Even at this magnitude, they are still minor and not noticeable in practice.

In this work, we seek to realize energy savings in modern pixel shaders by reducing the precision of their arithmetic. We explore three schemes for controlling this reduction. The first is a static analysis technique, which analyzes shader programs to choose precision with guaranteed error bounds. This approach may be too conservative in practice since it cannot take advantage of run-time information, so we also examine two methods that take the actual data values into account - a programmer-directed approach and a closed-loop error-tracking approach, both of which can lead to higher savings. To use this last method, we developed several heuristics to control how the precisions will change over time. We simulate several series of frames from commercial applications to evaluate the performance of these different schemes. The average savings found by the static and dynamic approaches are 31%, 70%, and 62% in the pixel shader's arithmetic, respectively, which could result in as much as a 10-20% savings of the GPU's energy as a whole.

Each of our techniques for choosing precisions resulted in different precisions, error characteristics, overheads, and energy savings. In general, the static analysis approach's freedom from user interaction and no runtime overheads resulted in relatively high precisions, leading to low energy savings and low errors. There was no clear winner between our dynamic approaches; they all result in similar precision selections (Figure 3), giving moderate energy savings with low errors. The magnitude of these errors was such that they were never noticed in practice, either in still frames or in video sequences. (This 6.8MB .mp4 video shows the dynamic precision selection results in a split-screen format. The left half of the frame is the full-precision result, while the right half is the frame rendered with our simple dynamic approach. The average precision used in this approach is displayed in the upper-right hand corner of the video.) However, these dynamic approaches require prohibitively invasive redesigns of existing shader architectures and runtime monitoring. We feel that our programmer-directed approach is the most promising. There is minimal user interaction necessary (see this 4.75MB .mkv video for an example of how an artist might use this system), very high energy savings, and, by definition, unnoticeable errors. Figure 1, above, shows an example frame captured from a screen-space ambient occlusion demo at full precision (right) and reduced precision (left, 12.5 bits per mantissa on average). These images are nearly identical -- Figure 2 shows the raw difference between the two images (left) and saturated difference (right). The savings seen in the pixel shader's arithmetic are important, but our approach also enables further savings in data communication. Since computations will not be using every bit, there is no need to move or store these unused bits. This is part of our ongoing work.

Precisions chosen by our dynamic approaches with static precisions for reference.
Fig. 3. Our dynamic approaches all choose lower precisions than the static analysis, which follows from their ability to examine the actual data used, rather than necessarily relying on a worst-case assumption. Among our dynamic approaches, there was no clear winner. We chose the "simple with delay" control scheme, since it performed as well as or better than the "simple" scheme, and its extra control was minimal, unlike the more complex control schemes.

Table 1: Programmer-Directed Results.
ScenePrecision (bits)PSNR (dB)SavingsPrecision (bits)Savings
Depth of Field12.045.679%18.533%