GPUSORT - Documentation

Contents of the Distribution

The archive contains all the libraries and include files needed to build applications using GPUSort. The only assumption made about the target system is the presence of a working OpenGL runtime with the correct drivers for the video card and the presence of GLUT. In case you are not sure about the video drivers on your system and own an NVIDIA card, go here to install the latest drivers for your system. The following is a description of what each folder contains:

src: Source code for the GPUSort library
include: Include files for applications using GPUSort
Samples: Three examples demonstrating usage of the GPUSort library
Project files: Project files for Microsoft Visual Studio 6.0, Microsoft Visual Studio .NET 2003 and higher, and makefiles for linux
lib: Contains the GPUSort library for Win32 and linux

Building an application

Setting up the library for use in your application is simple.

Build the gpusort library
Make sure you link to glut32.lib and gpusort.lib (or libgpusort.a on linux)
The include folder included with the distribution should be in your include path
Once the above two things have been done, all you need to do is #include <GPUSort.h> in your application

GPUSort API

Since the API of GPUSort is very simple, instead of a function-by-function description of the class, we will present simple example codes to illustrate the usage. If you are looking for more elaborate information please go through GPUSort.h, which contains elaborate function descriptions. GPUSort provides two interfaces. The first one is very simple while the second one exposes more of what's going on inside. It is in general recommended that you use the simple interface unless you want more control over whats going on inside. One such application is to find out how much time is being spent in transferring data between the CPU and GPU versus the computation.

Sorting Scalars: Simple Interface

Below is a sample code snippet which sorts an array of 32-bit floating point numbers.

#include <GPUSort.h>

int main(){
    ...
    float *input = new float[SIZE];

    for(int i=0;i<SIZE;i++) input[i]=(float)rand();

    GPUSort gpusort = new GPUSort(FLOAT32);

    SORT_ERROR err = gpusort->sort(input, SIZE);

    if(err!=SORT_SUCCESS){

        char *err_string = gpusort->sortErrorString(err);

        fprintf(stderr,"Error while sorting: %s\n",err_string);

        delete[] err_string;

    }

    delete(gpusort);
    ...
}

The above code first allocates an array and fills it up with random content. Next, it creates a GPUSort object with FLOAT32 as an argument, which means the data will be sorted on the GPU in 32-bit precision. In case precision is not a big issue, then you should pass in FLOAT16 as an argument instead, as it is about twice as fast. Currently, GPUSort supports 16 and 32 bit floating-point precision.

On the next line, sorting on the GPU is invoked with the input array and its size as the parameters. The resulting sorted sequence overwrites the contents of the input array. The remaining code tests if the sorting resulted in an error and if so, prints an error message using the provided utility function sortErrorToString(). Also, it is important to free up the object once you're done because this application will take up your video RAM almost completely and its important to free up those resources.

Once a GPUSort object has been created, it can be used to sort any number of arrays before being destroyed.

Sorting multi-field records: Simple Interface

To sort multi-field records, one has to first create a new array which contains (key,pointer) tuples, where the pointer points to the actual struct containing the other fields of the record.

#include <GPUSort.h>

int main(){
    ...
    SortPtr *input = new SortPtr[SIZE];

    //Fill the input[] with some values 

    GPUSort gpusort = new GPUSort(FLOAT32);

    SORT_ERROR err = gpusort->sortPtr(input, SIZE);

    if(err!=SORT_SUCCESS){

        char *err_string = gpusort->sortErrorString(err);

        fprintf(stderr,"Error while sorting: %s\n",err_string);

        delete[] err_string;

    }

    delete(gpusort);
    ...
}

The only difference from the previous example is the data type of the array, which is now SortPtr instead of float, and the function call is now sortPtr() instead of sort. Take a look at example3 in the samples directory for an example of using this interface.

Advanced Interface

The code snippet shown below sorts an array of 32-bit floating point values / (key,value) tuples but using separate calls to internal functions to set the sorting parameters and upload, sort and readback the data.

#include <GPUSort.h>

int main(){
    ...

    /** SORT SCALAR 32-bit FLOATING POINT VALUES **/

    float *input = new float[SIZE];
 
    ...

    GPUSort gpusort = new GPUSort(FLOAT32);

    SORT_ERROR err = gpusort->setSortParams(SIZE,SORT_SCALARS)


    /** SORT 32-bit (KEY,VALUE) TUPLES */

    SortPtr *input = new SortPtr[SIZE];

    ...

 
    GPUSort gpusort = new GPUSort(FLOAT32);

    SORT_ERROR err = gpusort->setSortParams(SIZE,SORT_TUPLES)


    /** THE REMAINING CODE IS INVARIANT NO MATTER WHICH OF THE ABOVE TWO YOU USE */

    if(err!=SORT_SUCCESS){

        char *err_string = gpusort->sortErrorString(err);

        fprintf(stderr,"Error while sorting: %s\n",err_string);

        delete[] err_string;

    }else{

        gpusort->uploadData(input);

        gpusort->bitonicSort();

        gpusort->readbackData(input);

    }

    delete(gpusort);
    ...
}

The inital part of the code sets the internal parameters using a call to setSortParams. All subsequent calls on the GPUSort object assume the information passed herein. The first option configures GPUSort to sort arrays of a particular size (SIZE), the individual values being scalars (SORT_SCALARS). The second option configures GPUSort to sort tuples instead (SORT_TUPLES). The uploadData() call copies the data from the host memory to the video memory as a texture. The bitonicSort() call executes bitonic sort on the data now present in the video memory. Finally, the readbackData() call copies the sorted data from the video memory back into the specified array. This array could be the same one which was used for uploading the data, as is the case in this example. However, one has to be a bit careful while using this interface. Our sorting algorithm assumes the array size is a power of two for achieving maximum performance. The implementation of the simple interface handles non-power-of-two arrays by sorting the largest power of two smaller than the given size on the GPU, and the "tail" on the CPU using qsort(). The performance can be improved by using threading to perform the sorting on GPU and CPU in parallel and will be a part of the future release.

Note:

To ensure maximum performance, a GPUSort object takes up the maximum video memory available on creation. Therefore, in our current implementation, only one GPUSort object is active at any time and more importantly, the object is cleaned when it is no longer required. Ideally you should not need to instantiate more than one object because the same object can be used to sort any number of arrays. Failing to do so may lead to severely degraded performance. It is also possible to add an interface where multiple objects are created with lower memory requirements.

Department of Computer Science, UNC Chapel Hill

GPUSort: High Performance Sorting using Graphics Processors