- Participation in the Hiper-D GeoServer follow-on activity.
From July through October of 1994, we participated in a community-wide
prototyping experiment defined by the Hiper-D program at NSWC as part
of their next-generation Aegis development effort. Using Proteus, we
developed a series
of prototypes to explore design approaches for Radar
Validation using constructive solid geometry methods. Three
prototypes were developed, along with a harness to connect the
prototypes to the network application. All prototypes were executable
using the Proteus interpreter, were rapidly developed, and were judged
to have contributed useful knowledge about design options to NSWC.
In addition, one prototype was transformed to parallel code which can
be run on several parallel machines (including the Cray at NCSC and
the MasPar at UNC). This computational kernel is controlled by the
network harness, using the interoperability capabilities of Proteus.
- Development of concurrent object model for process parallelism.
Significant progress has been made in language design in the development of
an extensible foundation for explicit task parallelism. Communication is
through a shared object model in which the access to shared state is
controlled through object methods and class directives which constrain
mutual exclusion of methods [GPR+94]. Predefined classes such as for
single-assignment objects which synchronize a producer with a consumer,
together with provisions for private state with barrier synchronization
allow the expression of a wide range of parallel computing paradigms, key
to providing an expressive and uniform vehicle for refinement.
- Development and the release of Specware 1.0
Specware is Kestrel Institute's system next generation system for
composition and refinement of specifications, generalizing and
extending earlier work on KIDS and DTRE. Included in the latest
release is a fully general Lisp code generator, and full support
for graphically-based refinement of specifications.
- Development of a framework for performance prediction in which computing
model varies with level of refinement.
We have developed a framework for performance prediction in which the
computing model varies with the level of refinement [MNPR94], and
concomitantly developed improved parallel computational models (LogP-HMM,
LogP-UMH) [LMR95]. To support early and incrementally more accurate
analyses, as program refinement progresses performance estimates are
obtained using increasingly detailed parallel computational models. To
support the assessment of multi-paradigm programs, different models are
used for analysis of code segments following different paradigms, such as
the VRAM for data-parallelism and the LogP for message-passing, with
suitable instrumentation to attach the model to the design. We have also
developed a framework for assessing and designing new models of computation
to support more detailed performance analyses based on the notion of
resource metrics, which are parameterized abstractions of architectural
details. For example, to support more accurate modeling of costs such as
cache and I/O, we have developed a new hybrid model of parallel
computation, the LogP-HMM model [LMR95], which fills a gap in the hierarchy
of refined models by extending a network model (the LogP) with a sequential
hierarchical memory model (the HMM). Recently further extensions of the
LogP-HMM model have been explored which replace the memory hierarchy with
more advanced models such as UMH (Universal Memory Hierarchy) [LMR95a].
- Algebraic framework for synthesizing high-performance programs from
tensor products for out-of-core computations.
Reif, Gupta, and Li have developed an algebraic framework for the automatic
synthesis of efficient parallel programs from high-level tensor product
specifications for a large class of block recursive algorithms. This
framework is suitable for developing out-of-core (I/O intensive) parallel
programs for algorithms such as fast Fourier transforms which are used in
many key application areas. Tensor products succinctly capture the
semantics of block-cyclic distributions as well as data access patterns to
the out-of-core data. We have targeted the program synthesis towards
distributed hierarchical memory machines including tightly coupled MIMD
machines and network connected work-stations. We capture the performance
properties of the machines by abstract computation models which include
variations of the LogP-HMM model [LMR95] and the striped two-level memory
model of Vitter and Aggarwal. Data layout annotations such as cyclic(B)
found in the High Performance Fortran are used to de-cluster the data among
the disks. The idea of de-clustering data is used by most of current file
systems such as PIOUS and Vesta, which therefore provides a suitable
testbed for efficiently implementing our approach.
- Development and prototyping of an efficient data-parallel algorithm for
multitarget tracking.
Researchers at Duke has developed an efficient data-parallel algorithm for the
core computation of the Joint Probabilistic Data Association multi-target
tracking algorithm, where the problem is to derive probabilities of
target-measurement associations using a-priori state estimates. This
algorithm is an equational refinement of the recursive modified matrix
permanent formulation first proposed by Zhou, which is efficient in that it
avoids redundant computation of subexpressions and is data-parallel in that
it is expressed in a functional style using sequence aggregates. The
algorithm is presented as an equational specification whose structure is
amenable to expression in the nested data-parallel language, Proteus. We
are currently transcribing the algorithm in Proteus and obtaining
complexity analyses and comparative benchmarks for the implementation.
- One step transformation of Proteus programs.
Using the NSWC Geoserver as a driving problem, we substantially
advanced our transformation capabilities. Proteus program written in
a functional, single-assignment style using nested sequences are
automatically transformed to C programs that rely on a vector library.
The library is implemented for several parallel machines, giving us
portability at a small performance penalty. Initial tests indicate
a factor of four when comparing the speed of Fortran programs to
Proteus programs when run on a Cray YMP. This result is the basis of
further research to increase performance by increasing register reuse,
pipelining operations, and elimination of temporaries.
- Work-efficient Execution of Proteus Programs
The transformations developed by Prins & Palmer [PP93] and the
supporting library routines are not necessarily work-efficient.
Careful examination, ordering and rewriting of the transformation
rules and data-parallel library routines has preserved the asymptotic work
complexity of the transformed proteus programs.
- Piece-wise execution of Proteus programs
The transformation system generates programs that aggressively build
the longest possible vectors for parallel execution. This implies
that all elements of a vector must be in memory at once, leading to
extremely large memory demands for moderate size programs. To
alleviate this problem, work is underway to repeatedly execute a
series of vector operations on small segments of the vector that will
fit in the available memory. The operation sequences usually begin
with operations that increase memory usage (e.g., distribution), and
end with data reduction operations (e.g., sum, product). We call this
style of execution "piece-wise execution", and the result is programs
with substantially reduced memory requirements.
- Provably correct complexity measures for transformed Proteus programs.
The transformation of Proteus programs relies on a set of complex
rules to generate flat data-parallel code. It is crucial that these
transformations preserve the work and step complexities of the source
program. We have developed a formal semantics of programs and used
this to prove the transformations correct for three different target
models: the CREW, bounded-contention CREW and EREW variants of the
VRAM (vector random-access machine). The techniques used in the proof
are novel and may be of independent interest. The step/work metrics
are compositional, allowing the complexity of a program to be
calculated from the complexities of its subprograms.
- FY-96 PLANS:
- Finalization of piece-wise execution.
- Performance improvement in generated programs, through explicit
vector models (rather than vector libraries).
- Introduction of tracking capabilities, to allow debugging of
transformed code at the source level. Also allows debugging of
sophisticated transformations.
- Full type inference of Proteus programs.
- Development of an integrated task- and data-parallel
programming notation.
- Elaboration of a design model for concurrent systems based on
successive refinement.
- Support for resource requirement estimation using
multiple refined performance-prediction models.
- Further development of Specware, including integration with
Proteus.
- Unified, single-step translation of data-parallel applications to
vector execution.
- Application development to assist in prototyping methodology description.
- Notational Formalism and Software Support for Real-Time Educational
Simulations
- Refine generated code to achieve highest possible performance.