$ Revised: Mon Dec 10 2018 by firstname.lastname@example.org
This is an introductory graduate course on parallel computing.
Upon completion, you should
- be able to design and analyze parallel algorithms for a variety of
problems and computational models,
- be familiar with the hardware and software organization of
high-performance parallel computing systems, and
- have experience with the implementation of parallel applications
on high-performance computing systems, and be able to measure,
tune, and report on their performance.
Additional information including the course syllabus can be found in the
The final exam will be held on Thursday December 13 at noon in our regular classroom.
Two practice problems and their solutions.
The midterm will be held in-class on Thursday Oct 11.
Some practice questions are available.
If you're registered in the class, you should be able to login to phaedra.cs.unc.edu and have a home directory
for your work.
- This course will use Piazza to manage class questions and discussions online.
(some material local-access only)
(for Tue Nov 27)
Look over Kumar et al.,
Basic Communication Operations.
(for Tue Nov 20)
MPI tutorial by Blaise Barney, LLNL.
(for Thu Nov 8)
Skim the Questions and Answers about BSP
pp 1-25. We will not use BSPLib directly, rather we use the BSP model together
with communication operations from the MPI library.
(For Thu Oct 25)
Read Nyland et. al, Fast N-Body Simulation with CUDA.
(for Tue Oct 23)
Have a look at the Cuda programming guide (9.2)
and the Cuda best practices guide (9.2)
(For Tue Oct 2)
Hennessy & Patterson Ch. 8, sections 8.5 - 8.6
(synchronization primitives in shared memory, and memory consistency models).
(For Thu Sep 27)
Memory consistency models tutorial (sections 1-6, pp 1 -17).
(For Thu Sep 20)
The Implementation of the Cilk-5 Multithreaded Language, sections 1 - 3.
(For Tue Sep 18) Look through
Open MP Tutorial sections 7-9 and read about TASK directives.
Cilk is simple extension of C that includes tasking.
For a modern look at Cilk, see this short tutorial on
Intel Cilk Plus
which also includes expression of data parallelism using matlib-like array notations
(that use the vector operations on Intel Xeon processors).
(For Thu Sep 13) UNC has cancelled classes for the remainder of this week,
see reading assignment for Tue Sep 18.
(For Tue Sep 11)
OpenMP Tutorial sections 3-5 and section 6 only up to the first exercise. Most examples are
shown in C/C++ and in Fortran, so just ignore the Fortran.
(For Thu Sep 6)
Read the overview of
Memory Hierarchy in Cache-based Systems.
(For Tue Sep 4)
Read PRAM Handout skim section 5
(For Thu Aug 30)
Read PRAM Handout secns 3.6, 4.1
(For Tue Aug 28)
Read PRAM Handout secns 3.2, 3.3, 3.5
(For Thu Aug 23)
Read PRAM Handout secns 1, 2, 3.1 (pp 1 - 8)
(For Thu Aug 23)
Review the course overview.
Written and Programming Assignments
- Written Assignments
- (Aug 28) Written assignment WA1 is available. Due date is Tue Sep 11.
WA1 sample solutions are available.
(Oct 2) Written assignment WA2 is available. Due date is Wed Oct 10 at 2PM.
WA2 sample solutions are available.
(Nov 15) Written assignment WA3 is available. Due date is Tue Dec 4
(at the start of class!).
WA3 sample solutions are available.
- Programming Assignments
- (Sep 11) Programming Assignment PA1(a) is available. Due date is Thu Sep 20
(at the start of class!).
A reference implementation of sequential all-pairs n-body simulation written in C including a test case for verifying
your implementation is available: pa1-ref.c
Compile with gcc or icc: icc -fopenmp -Ofast pa1-ref.c -o pa1-ref
(Oct 2) Programming Assignment PA1(b) is available. Due date is
Tue Oct 16 (at the start of class!).
Updated Programming Assignment PA2 is available. Due date is
Sat Dec 13 (project). (The original handout can be found here)
Platforms and Programming Models
- phaedra is an Intel Xeon E5-2650v4 compute server dedicated to this class with 20 cores
and an attached Nvidia Titan V100 accelerator. COMP 633 students have logins on phaedra.
OpenMP, Cilk, and Cuda programming models are supported.
- longleaf is a research computing cluster with ~350 Intel Xeon E5-2643 nodes providing 24 cores
OpenMP and Cilk programming models are supported on individual nodes. Compute jobs are submitted using slurm.
- dogwood is a research computing cluster with ~240 Intel Xeon E5-2699A nodes providing 44 cores per node.
The MPI programming model is supported to coordinate and communicate among nodes. A subset of the nodes have Intel Xeon Phi (KNL) accelerators.
The individual nodes support MPI, OpenMP, Cilk, OpenACC, and Intel offload (Intel Xeon Phi) programming models.
Compute jobs are submitted using slurm.
- Intel compilers for C/C++ (icc/icpc) support OpenMP 4.5 with tasking
and accelerator offload, and Cilk extensions to C including Cilk array notation.
- On phaedra, source /opt/intel/bin/compilervars.sh (bash) or
source /opt/intel/bin/compilervars.csh (csh)
to access the Intel compilers and tools.
- On research computing clusters use "module add icc" to access Intel compilers.
- Shared memory parallel programming. Specification of the
OpenMP 4.5 API for C/C++ (supported by the Intel compilers).
For a more accessible introduction see the tutorial for OpenMP 3.1 in the Bibliography below.
- Nvidia GPUs: programmed using Cuda C (Compute Capability 7.0 for V100 on phaedra).
- Intel Xeon Phi: C/C++ (or Fortran) with Intel offload directives or OpenMP acc on dogwood.
- MPI reference material
- MPI programs can be submitted to dogwood
This list will evolve throughout the semester. Specific reading
assignments are listed above.
- PRAM Algorithms, S. Chatterjee, J. Prins,
COMP 633 course notes, 2015.
Memory Hierarchy in Cache-Based Systems,
R. v.d. Pas, Sun Microsystems, 2003.
OpenMP 3.1 tutorial, Blaise Barney, LLNL, 2015.
The Implementation of the Cilk-5 Multithreaded Language,
M. Frigo, C. Leiserson, K. Randall, in
Proceedings of ACM Conf. on Programming Language Design and
- Shared Memory Consistency Models: A Tutorial,
S. V. Adve, K. Gharachorloo, DEC Western Research Labs Report 95/7, 1995.
- Computer Architecture: A Quantitative Approach 2nd ed,
D. Patterson, J. Hennessy, Morgan-Kaufmann 1996.
Fast N-Body Simulation with CUDA, L. Nyland, M. Harris, J. Prins,
GPUGems 3, 2008.
An Overview of Programming for Intel Xeon processors an Intel Xeon Phi coprocessors,
James Reinders, Intel Corp, 2012.
- Questions and Answers about BSP, D. Skillicorn, J. Hill,
and W. McColl, Scientific Programming 6, 1997.
- Message Passing Interface,
Blaise Barney, LLNL 2015
- Introduction to Parallel Computing: Design and Analysis of
Algorithms - Chapter 3,
V. Kumar, A. Grama, A. Gupta, G. Karypis, Benjamin-Cummings, 1994.
This page is maintained by
Send mail if you find problems.