Parallel Solutions to Irregular Problems using HPF

Lars Nyland, Siddhartha Chatterjee, Jan Prins
Department of Computer Science University of North Carolina Chapel Hill, NC

February 24, 1997

HPF User's Group Meeting
Santa Fe, New MexicoOld Well

The Source of Irregular Problems...

Goal: Parallelize code to get high performance!

Start with scalar functions

  • e.g. f = Gm1m2/r2 or F = 9C/5+32
  • no parallelism

First step: vectorize functions

  • compute many similar scalar expressions on sequences of data yielding sequences of results, or
  • reduction, prefix, suffix operations
  • fi = Sumk (Gmimk/rik2 )
  • vector (or flat) parallelism

One More Step Leads to Nested Data Parallelism

Next step: Add another dimension to vector solution to create nested data parallelism
  • vector of vectors
  • sequence of sequences
  • list of lists
  • 2-dimensional array

Examples

  • Sort each sequence in a sequence of sequences
  • Apply different motion to a sequence of bitmaps
  • Subselection on 2D array (graph edges)

Nesting depth potentially unlimited

Ragged Array

Many Bitmaps

Sparse Pairlist

Nested Sequences Regular or not?

Some are regular
  • forall i in 1:n do fi = Sumj(Gmimj/rij2)

Some in between

  • forall (i in 1:n, j in 1:i-1) let f = Gmimj/rij2
    • F[i] += f, F[j] -= f

Some are not

  • Sort a sequence of sequences with quick-sort
    • qsort_p([ [values-below-pivot], [values-above-pivot] ])
  • Compute forces from list of lists of atom interactions
    • forall (i in 1:n, j in i+1:len(i), in_range(i,j))
      • force_calc(atom(i), atom(j))
Regular Array
LU Array
Ragged Array
Sparse Array

Support for Nested Data Structures in HPF

Multidimensional arrays (from Fortran90,95)
  • Array operators and functions
    • spread, combine, pack
  • Reduction functions by dimension
  • Support for irregularity?
    • Masks supported

Segmented vectors (in HPF library)

  • Two-level support for irregular data
  • Segmented vector operations
    • prefix, suffix, scatter
Ragged Array

represented as

Segmented Array

An Example: Electrostatic Interaction with Cutoff

  • Typical Molecular Dynamics Code
    • Approximation that improves complexity from O(n2) to O(n)
    • Calculate F = k q1 q2 r / |r|3, where |r| < cutoff radius
  • Implementations
    • Wishful thinking
    • Serial Fortran
    • HPF Array Parallelism
    • HPF Segmented Vector Parallelism
    MD with cutoff

    Ideal Implementation

    pairlist = (/ (i, j) where distance(i,j) <= cutoff, i = 1, n, j = i+1, n /)
    forall (i,j) in pairlist
      F(i) += force(i,j)
      F(j) -= force(i,j)
    function force(i,j) = k * q(i) * q(j) * (X(i) - X(j)) / norm2(X(i)-X(j))3<

    A Simple, Serial Implementation

    do i = 1,n
      do j = i+1,n 
        r = dist(i,j)
        if (r <= cutoff) then 
          k = G * mass(i) * mass(j) / r ** 3 
          xx = x(i) - x(j)
          Fxx = k * xx 
          Fx(i) = Fx(i) + Fxx
          Fx(j) = Fx(j) - Fxx
          ! do same for Fy and Fz ... 
        endif 
      enddo 
    enddo

    HPF Solution, using Array Operators

    Perform masked array operations
      forall (i = 1:n)
        forall (j = i+1:n, dist(i,j) <= cutoff) 
          mask(i, j) = .true.
          r(i,j) = dist(i,j)
          k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3
          xx(i,j) = x(i) - x(j) 
          Fxx(i,j) = k(i,j) * xx(i,j) 
          ! do same for y and z ... 
        end forall
      end forall
      Fx = sum(Fxx, dim=2, mask=mask) - sum(Fxx, dim=1, mask=mask)

      ...

    Or delete one 'sum' operation, replaced by two more assignments
      forall (i = 1:n)
        forall (j = i+1:n, dist(i,j) <= cutoff)
          mask(i, j) = .true.
          mask(j,i) = .true.
          r(i,j) = dist(i,j)
          k(i,j) = G * mass(i) * mass(j) / r(i,j) ** 3
          xx(i,j) = x(i) - x(j)
          Fxx(i,j) = k(i,j) * xx(i,j)
          Fxx(j,i) = -Fxx(i,j) 
          ! do same for y and z ... 
        end forall
      end forall
      Fx = sum(Fxx, dim=2, mask=mask)

      ...

    HPF Solution, Using Segmented Vectors

    Part I

    Calculate compressed pairlist of indices
    • idx1 = 1 1 1 1 2 2 3 3 4
    • idx2 = 2 3 4 5 3 5 4 5 5
    mask = .false. 
    forall (i = 1:n) 
      forall (j in i+1:n, dist(i,j) <= cutoff)
        f_seg(i, j) = i
        f_ind(i, j) = j
        mask(i, j) = .true.
      end forall
    end forall
    nz = count(mask)
    allocate(idx1(nz))
    ! and allocate idx2, F, k
    idx1 = pack(f_seg, mask)
    idx2 = pack(f_ind, mask)

    HPF Solution, Using Segmented Vectors

    Part II

    Now perform a series of general indexing operations on dense vectors
    ! Array ops for x,y,z force computation
    forall (i = 1:nz) 
      k(i) = G*mass(idx1(i))*mass(idx2(i))/(dist(idx1(i),idx2(i))**3) 
    end forall 
    F = k * (x(idx1) - x(idx2)) 
    allocate(zero(n)) 
    zero = 0. 
    Fx = sum_scatter(F, zero, idx1) - sum_scatter(F, zero, idx2) 
    ! Do same array operations for Y and Z forces ...

    Data Mapping Strategy

    Do-loop Version
    • Do nothing

    Array Version

    • Cyclic distribution of 2D array
    • Fancy schemes with replication

    Dense Vector Version

    • Cyclic distribution for part I
    • Block, cyclic shouldn't matter for part II
    • Generalized array indexing generates all-to-all personalized communication

    Performance Results

    Experiments on SP-2 at Cornell
    • IBM's xlhpf compiler
    • PGI's pghpf compiler

    Compiled successfully, but...

    • Incomplete implementations of HPF_LIBRARY
      • Enough in PGI implementation to support segmented version
      • Not so in IBM implementation

    Improvements

    • xlhpf version core-dumped with pure function in forall mask
      • 3x improvement by giving it its own loop
    • Inlining 'dist' by hand gave 10x performance improvement

    Profiling tools helpful

    Pack slow in pghpf

    Difficult to explain distribution and alignment costs

    Updated Performance Results

    Experiments on Origin2000 at NCSA
    • PGI's pghpf compiler

    Algorithms

    • Do Loop version is just fortran77
    • Array version uses array operations
    • Array+ version moves "forall filter" to a separate loop
    • Seg Version uses pack and sum_scatter
    • Inline versions have "inlined" distance function (9-90x faster)

    Difficult to explain lack of performance improvement, despite multiple attempts at ALLOCATE

    Comments on Implementations

    Segmented Vector is mostly O(n) dense vector operations

    Didn't use segmented operations

    • No segmented reduce functions in HPF, used scatter instead

    Implementation of F90 and HPF library routines

    • Pack slow
    • Sum_Scatter, Sum fast on single processor
    • Parallel performance?

    Sensitivity to coding details

    • 3x performance improvement moving forall filter to its own forall loop

        Change

        forall (..., dist(i,j) <= cutoff) 

        to

        forall (...) d(i,j) = dist(i,j); 
        forall (..., d(i,j) <= cutoff) 
    • 10x for inlining dist, can compiler do this?

    General Parallelization Technique

    Any pure scalar function can be 'parallelized' for nested data parallel implementation

    Also pure vector functions

    • regardless of differing sizes of nested vectors

    Write scalar code

    For first level of parallelism

    • use vector notation and operations
    • promote pure scalar functions to pure vector functions by wrapping in forall statement

    For a second level of parallelism

    • use segmented vectors as 2D arrays
    • promote pure vector functions to nested vector functions

    HPF Support for Specific Operations

    HPF supports vector extension of pointwise operations
    • a = a + b * c => A = A + B * C

    Expansion operations

    • Spread, Range

    Conditionals

    • Pack
    • Combine

    Reduction

    • Segmented-reduction via xxx_scatter

    Prefix, Suffix, Scatter

    • Segmented versions enable operations such as indexing within segments

    Automatic Methods

    Compilers can generate flat and nested data parallel versions of functions automatically
    • NESL, Proteus technology

    Relies upon:

    • Pure functions
    • Pack, combine operations
    • Array operations
    • Reduce, segmented-reduce operations
    • Prefix, segmented-prefix operations

    Conclusions

    HPF has enough power to adequately express solutions for irregular problems

    Parallelization involves condensing irregular 2D data into segmented vectors

    Technique can be implemented

    • manually
    • in compilers and libraries
    • in source-to-source transformation systems

    Good parallel implementations of HPF library functions are critical

    What I like about HPF

    First high-level language (with user group > epsilon) that supports irregular computations

    Supports nested data structures through multidimensional arrays, nestable using pointers

    Supports nested parallel iterative constructs

    • array assignment
    • forall construct
    • DO INDEPENDENT directive
    • Library functions

    Exactly what is required for nested data parallelism (multi-level vectorization of scalar functions)

    What I'd like in HPF

    Support for 0-length segments in segmented arrays
    • Not all problems are linear systems of equations

    Segmented-reduce intrinsics

    • reduction and scan operations are related in HPF,
      • they have different heritage, and it shows
    • Using xxx_scatter(a, b, i1, i2, ...) is an expensive operation compared to xxx_reduce(a,segment=s)

    Indexing ragged arrays

    • as if 2D arrays

    Multi-level nested ragged arrays?

    • Like Matlab 5.0

    Nested Segmented Array

    Artifacts

    1. Actual N-body code in HPF
    2. Another version
    3. The 'Makefile' we used