# Parallel Solutions to Irregular Problems using HPF

Lars Nyland, Siddhartha Chatterjee, Jan Prins
Department of Computer Science University of North Carolina Chapel Hill, NC

February 24, 1997

HPF User's Group Meeting
Santa Fe, New Mexico ## The Source of Irregular Problems...

Goal: Parallelize code to get high performance!

Start with scalar functions

• e.g. f = Gm1m2/r2 or F = 9C/5+32
• no parallelism

First step: vectorize functions

• compute many similar scalar expressions on sequences of data yielding sequences of results, or
• reduction, prefix, suffix operations
• fi = Sumk (Gmimk/rik2 )
• vector (or flat) parallelism

## One More Step Leads to Nested Data Parallelism

Next step: Add another dimension to vector solution to create nested data parallelism
• vector of vectors
• sequence of sequences
• list of lists
• 2-dimensional array

Examples

• Sort each sequence in a sequence of sequences
• Apply different motion to a sequence of bitmaps
• Subselection on 2D array (graph edges)

Nesting depth potentially unlimited   ## Nested Sequences Regular or not?

Some are regular
• forall i in 1:n do fi = Sumj(Gmimj/rij2)

Some in between

• forall (i in 1:n, j in 1:i-1) let f = Gmimj/rij2
• F[i] += f, F[j] -= f

Some are not

• Sort a sequence of sequences with quick-sort
• qsort_p([ [values-below-pivot], [values-above-pivot] ])
• Compute forces from list of lists of atom interactions
• forall (i in 1:n, j in i+1:len(i), in_range(i,j))
• force_calc(atom(i), atom(j))    ## Support for Nested Data Structures in HPF

Multidimensional arrays (from Fortran90,95)
• Array operators and functions
• spread, combine, pack
• Reduction functions by dimension
• Support for irregularity?
• Masks supported

Segmented vectors (in HPF library)

• Two-level support for irregular data
• Segmented vector operations
• prefix, suffix, scatter represented as ## An Example: Electrostatic Interaction with Cutoff

• Typical Molecular Dynamics Code
• Approximation that improves complexity from O(n2) to O(n)
• Calculate F = k q1 q2 r / |r|3, where |r| < cutoff radius
• Implementations
• Wishful thinking
• Serial Fortran
• HPF Array Parallelism
• HPF Segmented Vector Parallelism ## Ideal Implementation

`pairlist = (/ (i, j) where distance(i,j) <= cutoff, i = 1, n, j = i+1, n /)`
`forall (i,j) in pairlist`
`F(i) += force(i,j)`
`F(j) -= force(i,j)`
`function force(i,j) = k * q(i) * q(j) * (X(i) - X(j)) / norm2(X(i)-X(j))3<`

## A Simple, Serial Implementation

`do i = 1,n`
`do j = i+1,n `
`r = dist(i,j)`
`if (r <= cutoff) then `
`k = G * mass(i) * mass(j) / r ** 3 `
`xx = x(i) - x(j)`
`Fxx = k * xx `
`Fx(i) = Fx(i) + Fxx`
`Fx(j) = Fx(j) - Fxx`
`! do same for Fy and Fz ... `
`endif `
`enddo `
`enddo`

## HPF Solution, using Array Operators

Perform masked array operations
`forall (i = 1:n)`
`forall (j = i+1:n, dist(i,j) <= cutoff) `
`mask(i, j) = .true.`
`r(i,j) = dist(i,j)`
`k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3`
`xx(i,j) = x(i) - x(j) `
`Fxx(i,j) = k(i,j) * xx(i,j) `
`! do same for y and z ... `
`end forall`
`end forall`
`Fx = sum(Fxx, dim=2, mask=mask) - sum(Fxx, dim=1, mask=mask)`

...

Or delete one 'sum' operation, replaced by two more assignments
`forall (i = 1:n)`
`forall (j = i+1:n, dist(i,j) <= cutoff)`
`mask(i, j) = .true.`
`mask(j,i) = .true.`
`r(i,j) = dist(i,j)`
`k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3`
`xx(i,j) = x(i) - x(j)`
`Fxx(i,j) = k(i,j) * xx(i,j)`
`Fxx(j,i) = -Fxx(i,j) `
`! do same for y and z ... `
`end forall`
`end forall`
`Fx = sum(Fxx, dim=2, mask=mask)`

...

## Part I

Calculate compressed pairlist of indices
• idx1 = 1 1 1 1 2 2 3 3 4
• idx2 = 2 3 4 5 3 5 4 5 5
`mask = .false. `
`forall (i = 1:n) `
`forall (j in i+1:n, dist(i,j) <= cutoff)`
`f_seg(i, j) = i`
`f_ind(i, j) = j`
`mask(i, j) = .true.`
`end forall`
`end forall`
`nz = count(mask)`
`allocate(idx1(nz))`
`! and allocate idx2, F, k`
`idx1 = pack(f_seg, mask)`
`idx2 = pack(f_ind, mask)`

## Part II

Now perform a series of general indexing operations on dense vectors
`! Array ops for x,y,z force computation`
`forall (i = 1:nz) `
`k(i) = G*mass(idx1(i))*mass(idx2(i))/(dist(idx1(i),idx2(i))**3) `
`end forall `
`F = k * (x(idx1) - x(idx2)) `
`allocate(zero(n)) `
`zero = 0. `
`Fx = sum_scatter(F, zero, idx1) - sum_scatter(F, zero, idx2) `
`! Do same array operations for Y and Z forces ...`

## Data Mapping Strategy

Do-loop Version
• Do nothing

Array Version

• Cyclic distribution of 2D array
• Fancy schemes with replication

Dense Vector Version

• Cyclic distribution for part I
• Block, cyclic shouldn't matter for part II
• Generalized array indexing generates all-to-all personalized communication

## Performance Results

Experiments on SP-2 at Cornell
• IBM's xlhpf compiler
• PGI's pghpf compiler

Compiled successfully, but...

• Incomplete implementations of HPF_LIBRARY
• Enough in PGI implementation to support segmented version
• Not so in IBM implementation

Improvements

• xlhpf version core-dumped with pure function in forall mask
• 3x improvement by giving it its own loop
• Inlining 'dist' by hand gave 10x performance improvement

Profiling tools helpful

Pack slow in pghpf

Difficult to explain distribution and alignment costs

## Updated Performance Results

Experiments on Origin2000 at NCSA
• PGI's pghpf compiler Algorithms

• Do Loop version is just fortran77
• Array version uses array operations
• Array+ version moves "forall filter" to a separate loop
• Seg Version uses pack and sum_scatter
• Inline versions have "inlined" distance function (9-90x faster)

Difficult to explain lack of performance improvement, despite multiple attempts at ALLOCATE

## Comments on Implementations

Segmented Vector is mostly O(n) dense vector operations

Didn't use segmented operations

• No segmented reduce functions in HPF, used scatter instead

Implementation of F90 and HPF library routines

• Pack slow
• Sum_Scatter, Sum fast on single processor
• Parallel performance?

Sensitivity to coding details

• 3x performance improvement moving forall filter to its own forall loop

Change

`forall (..., dist(i,j) <= cutoff) `

to

`forall (...) d(i,j) = dist(i,j); `
`forall (..., d(i,j) <= cutoff) `
• 10x for inlining dist, can compiler do this?

## General Parallelization Technique

Any pure scalar function can be 'parallelized' for nested data parallel implementation

Also pure vector functions

• regardless of differing sizes of nested vectors

Write scalar code

For first level of parallelism

• use vector notation and operations
• promote pure scalar functions to pure vector functions by wrapping in forall statement

For a second level of parallelism

• use segmented vectors as 2D arrays
• promote pure vector functions to nested vector functions

## HPF Support for Specific Operations

HPF supports vector extension of pointwise operations
• a = a + b * c => A = A + B * C

Expansion operations

• Spread, Range

Conditionals

• Pack
• Combine

Reduction

• Segmented-reduction via xxx_scatter

Prefix, Suffix, Scatter

• Segmented versions enable operations such as indexing within segments

## Automatic Methods

Compilers can generate flat and nested data parallel versions of functions automatically
• NESL, Proteus technology

Relies upon:

• Pure functions
• Pack, combine operations
• Array operations
• Reduce, segmented-reduce operations
• Prefix, segmented-prefix operations

## Conclusions

HPF has enough power to adequately express solutions for irregular problems

Parallelization involves condensing irregular 2D data into segmented vectors

Technique can be implemented

• manually
• in compilers and libraries
• in source-to-source transformation systems

Good parallel implementations of HPF library functions are critical

## What I like about HPF

First high-level language (with user group > epsilon) that supports irregular computations

Supports nested data structures through multidimensional arrays, nestable using pointers

Supports nested parallel iterative constructs

• array assignment
• forall construct
• DO INDEPENDENT directive
• Library functions

Exactly what is required for nested data parallelism (multi-level vectorization of scalar functions)

## What I'd like in HPF

Support for 0-length segments in segmented arrays
• Not all problems are linear systems of equations

Segmented-reduce intrinsics

• reduction and scan operations are related in HPF,
• they have different heritage, and it shows
• Using xxx_scatter(a, b, i1, i2, ...) is an expensive operation compared to xxx_reduce(a,segment=s)

Indexing ragged arrays

• as if 2D arrays

Multi-level nested ragged arrays?

• Like Matlab 5.0 