COMP 790 (GPGP) Project: nVIDIA G80 for Sparse Matrix-Vector Multiplication

COMP 790 (GPGP) Project:
nVIDIA G80 for Sparse Matrix-Vector Multiplication

As of now, I have a CUDA SpMV program which runs in both emulation and on the hardware. The approach is similar to Sashi's implementation. There was an unfortunate duplication of work here. However, Sashi and Suddha are now moving on to Sparse matrix-matrix and SVD, while I continue work on better methods for Sparse matrix-vector based on problem size, and integration into Conjugate Gradient.

In the current scheme, each thread computes the dot product for one entry in the output vector, referencing one row in the matrix. An arbitrary number of threads per block (256), was chosen based on recommendations in the performance chapter of the CUDA manual. It is suggested that the number of threads be high to more effectively hide latency. Subsequent testing will be done to test if this is in fact the optimal block size. Other changes to upcoming versions will also influence this choice.

I still have yet to do thorough benchmarking on my first implementation, but I have an encouraging preliminary result: the gpu completes 1000 iterations of a problem with N=32k and NNZ=640k in 1.04 seconds versus 2.25 seconds using the SpMV cpu code from NAS CG (on gamma-x2-cs). I will soon complete more rigorous testing, but this is a decent ballpark result for a first implementation.

Jan and Lars have worked on an analytic model for their CUDA nbody program, which guides the partitioning of work based on problem size and g80 characteristics. I am working to apply such a model to SpMV, though for this problem the sparcity and irregularity present additional constraints, which we have identified as follows:

Available parallelism
Potential reuse
Capacity constraints of the various memories
Multithreading constraints
Thread/block/grid layout
Data distribution and blocking for the memory hierarchy
Amount of sequential work done for latency hiding

In addition to the analytic model, there are several next steps forward. Using the texture and/or constant memory on the g80 to store the vector or portions of it, is one such step. These memories are cached, and could provide a performance benefit when reuse is possible. I will also be looking at whether it is better to have a small persistent number of blocks versus a large number of short-lived blocks with finer-grained work partitioning. These investigations will feed the development of the analytic model as well. It is anticipated that the model will be useful for similar problems.

Change in project scope (2/19)

Based on our interpretation of the work done by the folks at Berkeley, Jan and I have decided that I should concentrate on the CUDA implementation of SpMV. We will compare results in terms of GFLOPS, but it seems that they have done a lot of work and covered most of the ground there. They haven't done very large matrices, but it is a fair argument that multiple Cells could be used in that case. However, we don't have a cluster of Cells.

A CUDA implementation would be more novel. We discussed several schemes for partitioning, dealing with irregular structure, and permutations for better work scheduling. We believe that trying more of these approaches would probably be better worth our time.

Original Project Description (2/9)

The Cell Broadband Engine (BE) and nVIDIA G80 are both commodity processors developed primarily to support the gaming industry. However, the use of specialized commodity processors, especially graphics processing units (GPUs), for general purpose computation has flourished in recent years. Spurred by the success of this work, nVIDIA has introduced the Compute Unified Driver Architecture (CUDA) for the new G80 GPU. CUDA's data parallel, multithreaded programming model exploits the massively parallel design of the G80 chip.

The Cell Broadband Engine (BE) is a powerful multicore chip based on IBM's PowerPC architecture. The heart of the Sony PlayStation 3 gaming console, it can also be used for general purpose compution. Its high speed, vector-oriented Synergistic Processing Elements (SPEs) make it an attractive option for compute-intensive applications

We propose to compare the peformance of the G80 and Cell platforms on sparse matrix-vector multiplication (SpMV) . The problem is to multiply an n x n matrix M by a vector V of size n. While n can be O(10k) or even O(10M), each row of M has, on average, only a small number of nonzero elements. Moreover, the distribution of the nonzero elements in SpMV problem sets is often irregular. Due to these characteristics, achieving high performance on this computation is a formidable challenge.

The computation will be performed in the context of the Conjugate Gradient (CG) method to approximate the principal eigenvalue. SpMV is performed in each iteration of the CG kernel. In each iteration, the vector V is a function of the result vector from the SpMV calculation in the previous iteration.

Both the G80 and the Cell BE feature some memory stores which are local to each processing engine on the chip. We will explore two regimes of problem size, one in which the vector V is small enough to keep resident in the local memories, and one in which it is too large to fit. The expectation is that the computation will be slower on the larger problem sizes due to the cost of fetching portions of V multiple times. In addition to developing implementations tuned to these two regimes, we will incorporate a front-end module to select which implementation to use on an input problem based on its size.

SpMV has been widely studied in the high performance computing community for many years. A myriad of techniques have been developed to partition the work and exploit system memory hierarchies. We will employ elements of these techiques which suit the CUDA and Cell architectures. Williams et al. have implemented SpMV as part of their seminal evaluation of Cell for scientific computing [1]. Whereas their implementations targeted matrices with n ranging from 16k to 256k, our work will accomodate much larger problem sizes. Bolz et al. implemented SpMV on a previous generation of programmable GPUs [2]; we are unaware of existing implementations for CUDA.

Expected Milestones

March 1: Cell BE - Implementation for small problem size
April 1: G80 - Implementation for small problem size
May 1: Implementations for large problem size,
Front-end to select correct implementation based on problem size

References

[1] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The potential of the cell processor for scientific computing. In Proc. of 2006 ACM International Conference on Computing Frontiers, May 2006.

[2] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. In Proc. of ACM SigGraph '03, July 2003.