Department of Computer Science, UNC Chapel Hill

GPUSort: High Performance Sorting using Graphics Processors

Introduction

Most present day computers sport powerful GPUs (Graphics Processing Units) capable of beating even high-end CPUs in terms of raw FLOPS when performing at their peak. Consequently, there has been lot of work recently on implementing different problems on GPUs to exploit their high processing power. One such problem is sorting, which has been a major computational bottleneck in applications spanning the breadth of computer science, most importantly in databases and data mining. We have designed algorithms to perform sorting on the GPU and extract the maximum computational throughoutput of the GPU. Our algorithm maps the bitonic sorting network to GPUs using simple texture mapping operations and effectively utilizes the memory bandwidth using cache-efficient memory accesses. We have observed that our implementation is faster than qsort() running on a 3.4 GHz Pentium processor system with a NVIDIA GeForce 6800 Ultra GPU. GPUSort is a C++ class which contains our implementation. It is capable of sorting floating point values in both 16 and 32-bit precision, and can also quickly sort records based on a floating point key. Please refer to the documentation for details regarding the API and the contents of the distribution. Also, please read through the system requirements below before using the library.

System Requirements

  • OS: Microsoft Windows 2000/XP and Linux
  • RAM: At least a size of graphics processor video memory is required.
  • GPU: NVIDIA GeForce/Quadro family card with support for the following OpenGL extensions:
    1. EXT_framebuffer_object
    2. ARB_texture_rectangle
    3. ARB_fragment_program
  • The above requirements are met by NV30-based GPUs and above (GeForce FX and GeForce 6 series)
  • The library has been tested on the following cards:
    • GeForce 6800 Ultra/GTO
    • GeForce 5950 Ultra
    • GeForce 5800
    • Quadro FX 4000
    • Laptop graphics cards: QuadroFX 700 Go, QuadroFX 1000 Go, QuadroFX 1400 Go, GeForce 6800 Go
      For obtaining reasonably high performance, we recommend a PC with AGP8X/PCI-Express NVIDIA GeForce 6800 GT or faster GPU.
  • Video RAM: The Video RAM will determine the maximum array length that can be sorted on the GPU. A rough guideline for sorting 32-bit floats is: Maximum array length in millions = Video RAM in MB / 16. Therefore, on a card with 256 MB VRAM, the maximum-length array which can be sorted is 256/16 = 16 Million values
  • Drivers: Latest drivers from NVIDIA (version 7772 or higher for windows, and 7664 for linux)
  • .

Note:

  • ATI cards: ATI cards are not supported in the present release of GPUSort mainly due to the lack of suport for framebuffer objects and ARB_texture_rectangle in current ATI drivers. These cards will be supported in future releases.
  • Non power-of-two arrays: The present implementation will not sort arrays with non power-of-two size completely on the GPU. Our present implementation takes the largest power of two smaller than the input array size, sorts that part on the GPU and merges the result with the "tail" after running qsort() on it. Therefore, the code will run slower on non-power-of-two arrays because the CPU will also be used. We are currently working on addressing this issue.

 

©2003 Department of Computer Science, UNC Chapel Hill