$ Revised: Tue Dec 6 2016 by email@example.com
This is an introductory graduate course covering several aspects of
parallel and high-performance computing.
Upon completion, you should
- be able to design and analyze parallel algorithms for a variety of
problems and computational models,
- be familiar with the hardware and software organization of
high-performance parallel computing systems, and
- have experience with the implementation of parallel applications
on high-performance computing systems, and be able to measure,
tune, and report on their performance.
Additional information including the course syllabus can be found in the
All parallel programming models discussed in this class are supported on
or phaedra which are available for use in this class.
The midterm will be held in class Tue Oct 18.
- The scope of the exam is material in lectures 1-12.
- You have 75 minutes to complete the exam.
- You may may consult all course notes and other course materials during the exam.
You may use an electronic device to access these materials or as a calculator,
but the device cannot be used to access other materials or to communicate in any fashion.
- Some sample questions are available.
- Please familiarize yourself with the dissemination barrier described on slide 10 of lecture 12.
There will be a question related to this implementation of a barrier.
- This course will use Piazza to manage class questions and discussions online.
(some material local-access only)
- (for Thu Dec 1) Look at Kumar et al.,
Basic Communication Operations.
- (for Tue Nov 17) Skim
MPI tutorial by Blaise Barney, LLNL.
- (for Tue Nov 14) Skim the
Questions and Answers about BSP
pp 1-25. We will not use BSPLib directly, rather we use the BSP model together
with communication operations from the MPI library.
- (for Thu Oct 27) Read material on
Intel Xeon Phi
- (for Tue Oct 11) Read Nyland et. al,
Fast N-Body Simulation with CUDA.
Check supplementary materials for Cuda in the Software section below.
- (for Thu Oct 6) Skim through introductory material on
- (for Thu Sep 29) Read
Hennessy & Patterson Ch. 8, sections 8.5 - 8.6
(synchronization primitives in shared memory, and
memory consistency models).
- (for Tue Sep 27) Read
Memory consistency models tutorial (sections 1-6, pp 1 -17).
- (For Thu Sep 22)
The Implementation of the Cilk-5 Multithreaded Language,
sections 1 - 3.
For a modern look at Cilk, see this short tutorial on
Intel Cilk Plus
which also includes expression of data parallelism using matlib-like array notations
(which generally are executed using the advanced vector extensions in Intel processors).
- (For Tue Sep 20) Look through
Open MP Tutorial sections 7-9 and read about TASK directives.
- (For Tue Sep 13) Look through
OpenMP Tutorial sections 1-6. Most examples are
shown in C/C++ and in Fortran. We will be using C/C++.
Ignore WORKSHARE and TASK directives, and discussion of
nested parallel constructs.
- (for Thu Sep 8) Read the overview of
Memory Hierarchy in Cache-based Systems.
- (For Tue Sep 6) PRAM Handout, sections 3.6, 4.1, skim section 5
- (For Tue Aug 30) PRAM handout, sections 3.2, 3.3, 3.5
- (For Thu Aug 25) Read PRAM Handout secns 1, 2, 3.1 (pp 1 - 8)
- (For Tue Aug 23) Look over the course overview.
Written and Programming Assignments
Platforms and Programming Models
- The Bass system supports the OpenMP and MPI programming
The general instructions for
getting started on bass
are supplemented below with specific instructions for each programming model.
When you login to bass.cs.unc.edu you are connected to a specific node on bass dedicated
to interactive program development. You can compile programs on this node.
Shared-memory programs run within an individual node on bass. Distributed-memory programs run
across multiple nodes in Bass. The login node should not be used to run your programs,
although a short debug test for a few seconds and no more than 4 cores should be OK.
In general programs that need multiple nodes or dedicated nodes or GPUs should be
submitted to queues that are managed by the Grid Engine job scheduler.
- The Phaedra system is a compute server supporting OpenMP, Cilk, and Xeon Phi accelerator
- gamma-x51-1 is a server supporting the CUDA accelerator programming model.
- OpenMP reference: Specification of
OpenMP 3.0 API for C/C++.
You may be more interested in
OpenMP support in gcc 4.4.7 (the compiler on Bass).
- Bass-specific material
Getting started on Bass.
- To get accurate performance information run your programs
on a dedicated node as a batch job with a shell script myjob using
qsub -pe smp 16 myjob
or interactively via
qlogin -pe smp 16
Do not park yourself on this node as everyone else in the class will be held up.
- A directory with the sample diffusion program
discussed in class.
- Command lines for the compilation and execution of programs on
- C compilation to create a sequential program (compiler ignores OpenMP
directives and does not link with the OpenMP runtime library):
gcc -O3 -o prog prog.c (Gnu C compiler 4.4.7)
- C compilation to create a parallel program (OpenMP 3.0 directives honored
and program linked with the OpenMP runtime library)
gcc -fopenmp -O3 -o prog prog.c (Gnu C compiler 4.4.7)
- Phaedra-specific material
- Phaedra is a 20-core Xeon E5-2650 server with eight attached
Intel Xeon Phi 5110P accelerators.
The server hosts the Intel Parallel Studio XE 2017
compilers and performance analysis tools to access the accelerators.
Students in COMP 633 have a login on phaedra, and OpenMP programs run directly
on the server.
- The Intel C compiler icc directly supports OpenMP 4.0 with tasking
and accelerator offload, and Cilk extensions to C including Cilk array notation.
(C arrays permit aliasing which can inhibit vectorization).
- Be sure to source /opt/intel/bin/compilervars.sh (bash) or
source /opt/intel/bin/compilervars.csh (csh)
to access the Intel compilers and tools.
- GPU: Cuda 7.0 on gamma-x51-1.
- Intel Xeon Phi: Parallel Studio XE 2017 on phaedra
- MPI reference material
- Running MPI programs on Bass
This list will evolve throughout the semester. Specific reading
assignments are listed above.
- PRAM Algorithms, S. Chatterjee, J. Prins,
COMP 633 course notes, 2015.
Memory Hierarchy in Cache-Based Systems,
R. v.d. Pas, Sun Microsystems, 2003.
OpenMP tutorial, Blaise Barney, LLNL, 2013.
The Implementation of the Cilk-5 Multithreaded Language,
M. Frigo, C. Leiserson, K. Randall, in
Proceedings of ACM Conf. on Programming Language Design and
- Shared Memory Consistency Models: A Tutorial,
S. V. Adve, K. Gharachorloo, DEC Western Research Labs Report 95/7, 1995.
- Computer Architecture: A Quantitative Approach 2nd ed,
D. Patterson, J. Hennessy, Morgan-Kaufmann 1996.
Fast N-Body Simulation with CUDA, L. Nyland, M. Harris, J. Prins,
GPUGems 3, 2008.
An Overview of Programming for Intel Xeon processors an Intel Xeon Phi coprocessors,
James Reinders, Intel Corp, 2012.
- "Questions and Answers about BSP", D. Skillicorn, J. Hill,
and W. McColl, Scientific Programming 6, 1997.
- Message Passing Interface,
Blaise Barney, LLNL 2015
- Introduction to Parallel Computing: Design and Analysis of
V. Kumar, A. Grama, A. Gupta, G. Karypis, Benjamin-Cummings, 1994.
This page is maintained by
Send mail if you find problems.