$ Revised: Thu Nov 14, 2019 by email@example.com
This is an introductory graduate course on parallel computing.
Upon completion, you should
- be able to design and analyze parallel algorithms for a variety of
problems and computational models,
- be familiar with the hardware and software organization of
high-performance parallel computing systems, and
- have experience with the implementation of parallel applications
on high-performance computing systems, and be able to measure,
tune, and report on their performance.
Additional information can be found in the
- We will use the shared memory multiprocessor phaedra.cs.unc.edu in this class
- Use this machine for assignment pa1a (sequential implementation) and pa1b (parallel implementation)
- Your login is your cs login and cs password. You will have a home directory on phaedra.
I will send individual emails to students not in CS with their cs login and additional instructions.
- You will need to collect performance data in pa1a. Don't wait till the last minute as
multiple simultaneous performance runs will result in a distorted picture of performance.
- This course will use Piazza to manage class questions and discussions online.
(some material local-access only)
(for Tue Nov 19) Look over Kumar et al.,
Basic Communication Operations.
(For Tue Nov 12) Skim
MPI tutorial by Blaise Barney, LLNL.
(For Tue Nov 5)
Skim the Questions and Answers about BSP
pp 1-25. We will not use BSPLib directly, rather we use the BSP model together
with communication operations from the MPI library.
(For Oct 22)
Look over the Cuda programming guide (9.2)
and the Cuda best practices guide (9.2).
Read Nyland et. al, Fast N-Body Simulation with CUDA.
(For Oct 3)
Hennessy & Patterson Ch. 8, sections 8.5 - 8.6
(synchronization primitives in shared memory, and memory consistency models).
(For Thu Sep 26)
Memory consistency models tutorial (sections 1-6, pp 1 -17).
(For Tue Sep 17) Look through
Open MP Tutorial sections 7-9 and read about TASK directives.
Cilk is simple extension of C that includes tasking.
For a modern look at Cilk, see this short tutorial on
Intel Cilk Plus
which also includes expression of data parallelism using matlib-like array notations
(that use the vector operations on Intel Xeon processors).
(For Thu Sep 12)
Look through OpenMP Tutorial
sections 3-5 and section 6 only up to the first exercise. Most examples are
shown in C/C++ and in Fortran, so just ignore the Fortran.
(For Thu Sep 5)
Read the overview of
Memory Hierarchy in Cache-based Systems
(For Tue Sep 3)
Read PRAM Handout skim section 5
- (For Thu Aug 29)
Read PRAM Handout secns 3.6, 4.1
- (For Tue Aug 27)
Read PRAM Handout secns 3.2, 3.3, 3.5
(For Thu Aug 22)
Read PRAM Handout secns 1, 2, 3.1 (pp 1 - 8)
Written and Programming Assignments
- Written Assignments
(Aug 27) Written assignment WA1. Please read submission instructions.
Due date is Tue Sep 10.
WA1 sample solutions
(Sep 26) Written Assignment WA2. Due date is Oct 8 at start of class.
You may work with another classmate or on your own.
WA2 sample solutions
(Nov 14) Written Assignment WA3 ... stay tuned
- Programming Assignments
(Sep 10) Programming Assignment PA1(a) is available.
You can work together with a partner on this project or you can choose to work on your own.
Due date is Thu Sep 19 (at the start of class!).
- Sample evolution of 3 body system: 3 bodies, initially at rest, with positions and masses as shown at time zero
(Updated) All-pairs sample implementation v1.c - compile with icc or gcc: icc -Ofast -fopenmp -o v1 v1.c
Half-pairs sample implementation v3n.c - compile with icc or gcc: icc -Ofast -fopenmp -o v3n v3n.c
(Sep 24) Programming Assignment PA1(b) is available.
Due date is Oct 15 (start of class).
- (Oct 29) Programming Assignment PA2 is available.
Platforms and Programming Models
- phaedra is an Intel Xeon E5-2650v4 compute server dedicated to this class with 20 cores
and an attached Nvidia Titan V100 accelerator. COMP 633 students have logins on phaedra.
OpenMP, Cilk, and Cuda programming models are supported.
- longleaf is a research computing cluster with ~350 Intel Xeon E5-2643 nodes providing 24 cores
OpenMP and Cilk programming models are supported on individual nodes. Compute jobs are submitted using slurm.
- dogwood is a research computing cluster with ~240 Intel Xeon E5-2699A nodes providing 44 cores per node.
The MPI programming model is supported to coordinate and communicate among nodes.
A subset of the nodes have Intel Xeon Phi (KNL) accelerators.
The individual nodes support MPI, OpenMP, Cilk, OpenACC, and Intel offload (Intel Xeon Phi) programming models.
Compute jobs are submitted using slurm.
- Intel C/C++ compiler (icc/icpc 2019)
- supports OpenMP 4.5 with tasking and accelerator offload.
- supports Cilk extensions to C including Cilk array notation but are deprecated (will be dropped after 2019)
- On phaedra, source /opt/intel/bin/compilervars.sh intel64 (bash) or
source /opt/intel/bin/compilervars.csh intel64 (csh) to access the Intel compilers and tools.
- On research computing clusters use "module add icc" to access Intel compilers.
- GNU compiler (gcc/g++ 4.8.5)
- supports OpenMP 3.1
- to use gcc/g++ on phaedra make sure you have /usr/bin on your path.
- Shared memory parallel programming. Specification of the
OpenMP 4.5 API for C/C++ (supported by the Intel compilers).
For a more accessible introduction see the tutorial for OpenMP 3.1 in the Bibliography below.
- Nvidia GPUs: programmed using Cuda C (Compute Capability 7.0 for V100 on phaedra).
- Intel Xeon Phi: C/C++ (or Fortran) with Intel offload directives or OpenMP acc on dogwood.
- MPI reference material
- MPI programs can be submitted to dogwood
This list will evolve throughout the semester. Specific reading
assignments are listed above.
- PRAM Algorithms, S. Chatterjee, J. Prins,
COMP 633 course notes, 2015.
Memory Hierarchy in Cache-Based Systems,
R. v.d. Pas, Sun Microsystems, 2003.
OpenMP 3.1 tutorial, Blaise Barney, LLNL, 2015.
Cilk Plus Tutorial (online tutorial).
- Shared Memory Consistency Models: A Tutorial,
S. V. Adve, K. Gharachorloo, DEC Western Research Labs Report 95/7, 1995.
- Computer Architecture: A Quantitative Approach 2nd ed,
D. Patterson, J. Hennessy, Morgan-Kaufmann 1996.
Fast N-Body Simulation with CUDA, L. Nyland, M. Harris, J. Prins,
GPUGems 3, 2008.
- Questions and Answers about BSP, D. Skillicorn, J. Hill,
and W. McColl, Scientific Programming 6, 1997.
- Message Passing Interface,
Blaise Barney, LLNL 2015
- Introduction to Parallel Computing:
Design and Analysis of Algorithms - Chapter 3,
V. Kumar, A. Grama, A. Gupta, G. Karypis, Benjamin-Cummings, 1994.
This page is maintained by
Send mail if you find problems.