Parallel Solutions to Irregular Problems using HPF 
Lars Nyland, Siddhartha Chatterjee, Jan Prins February 24, 1997 HPF User's Group Meeting 
The Source of Irregular Problems... 
Goal: Parallelize code to get high performance!
Start with scalar functions
First step: vectorize functions

One More Step Leads to Nested Data Parallelism 

Next step: Add another dimension to vector solution to
create nested data parallelism
Examples
Nesting depth potentially unlimited 




Nested Sequences Regular or not? 

Some are regular
Some in between
Some are not


Support for Nested Data Structures in HPF 
Multidimensional arrays (from Fortran90,95)
Segmented vectors (in HPF library)

represented as 
An Example: Electrostatic Interaction with Cutoff 


Ideal Implementation 
pairlist = (/ (i, j) where distance(i,j) <= cutoff, i = 1, n, j = i+1, n /) forall (i,j) in pairlist F(i) += force(i,j) F(j) = force(i,j) function force(i,j) = k * q(i) * q(j) * (X(i)  X(j)) / norm2(X(i)X(j))3< 
A Simple, Serial Implementation 
do i = 1,n do j = i+1,n r = dist(i,j) if (r <= cutoff) then k = G * mass(i) * mass(j) / r ** 3 xx = x(i)  x(j) Fxx = k * xx Fx(i) = Fx(i) + Fxx Fx(j) = Fx(j)  Fxx ! do same for Fy and Fz ... endif enddo enddo 
HPF Solution, using Array Operators 
Perform masked array operations
forall (i = 1:n) forall (j = i+1:n, dist(i,j) <= cutoff) mask(i, j) = .true. r(i,j) = dist(i,j) k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3 xx(i,j) = x(i)  x(j) Fxx(i,j) = k(i,j) * xx(i,j) ! do same for y and z ... end forall end forall Fx = sum(Fxx, dim=2, mask=mask)  sum(Fxx, dim=1, mask=mask) ... 
Or delete one 'sum' operation, replaced by two more assignments
forall (i = 1:n) forall (j = i+1:n, dist(i,j) <= cutoff) mask(i, j) = .true. mask(j,i) = .true. r(i,j) = dist(i,j) k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3 xx(i,j) = x(i)  x(j) Fxx(i,j) = k(i,j) * xx(i,j) Fxx(j,i) = Fxx(i,j) ! do same for y and z ... end forall end forall Fx = sum(Fxx, dim=2, mask=mask) ... 
HPF Solution, Using Segmented VectorsPart I 

Calculate compressed pairlist of indices

mask = .false. forall (i = 1:n) forall (j in i+1:n, dist(i,j) <= cutoff) f_seg(i, j) = i f_ind(i, j) = j mask(i, j) = .true. end forall end forall nz = count(mask) allocate(idx1(nz)) ! and allocate idx2, F, k idx1 = pack(f_seg, mask) idx2 = pack(f_ind, mask) 
HPF Solution, Using Segmented VectorsPart II 

Now perform a series of general indexing operations on dense vectors 
! Array ops for x,y,z force computation forall (i = 1:nz) k(i) = G*mass(idx1(i))*mass(idx2(i))/(dist(idx1(i),idx2(i))**3) end forall F = k * (x(idx1)  x(idx2)) allocate(zero(n)) zero = 0. Fx = sum_scatter(F, zero, idx1)  sum_scatter(F, zero, idx2) ! Do same array operations for Y and Z forces ... 
Data Mapping Strategy 
Doloop Version
Array Version
Dense Vector Version

Performance Results 
Experiments on SP2 at Cornell
Compiled successfully, but...
Improvements
Profiling tools helpful Pack slow in pghpf Difficult to explain distribution and alignment costs 
Updated Performance Results 
Experiments on Origin2000 at NCSA
Algorithms
Difficult to explain lack of performance improvement, despite multiple attempts at ALLOCATE 
Comments on Implementations 
Segmented Vector is mostly O(n) dense vector operations
Didn't use segmented operations
Implementation of F90 and HPF library routines
Sensitivity to coding details
Change forall (..., dist(i,j) <= cutoff) to forall (...) d(i,j) = dist(i,j); forall (..., d(i,j) <= cutoff) 
General Parallelization Technique 
Any pure scalar function can be 'parallelized' for nested
data parallel implementation
Also pure vector functions
Write scalar code For first level of parallelism
For a second level of parallelism

HPF Support for Specific Operations 
HPF supports vector extension of pointwise operations
Expansion operations
Conditionals
Reduction
Prefix, Suffix, Scatter

Automatic Methods 
Compilers can generate flat and nested data parallel versions of functions
automatically
Relies upon:

Conclusions 
HPF has enough power to adequately express solutions for irregular
problems
Parallelization involves condensing irregular 2D data into segmented vectors Technique can be implemented
Good parallel implementations of HPF library functions are critical 
What I like about HPF 
First highlevel language (with user group > epsilon) that supports irregular
computations
Supports nested data structures through multidimensional arrays, nestable using pointers Supports nested parallel iterative constructs
Exactly what is required for nested data parallelism (multilevel vectorization of scalar functions) 
What I'd like in HPF 
Support for 0length segments in segmented arrays
Segmentedreduce intrinsics
Indexing ragged arrays
Multilevel nested ragged arrays?

