Computing devices

Complications due to existing hardware

Claudia Frauen, Georgiana Mania, Jan Frederik Engels

Recap from last week

  • What is parallelism?
  • Which type of parallelism did we discuss?
  • What is actually done in parallel?
  • Which technique did we use for parallelisation?

Shared memory parallelism

What is a high performance computer?

What is a high performance computer?

Finally, what are OpenMP threads?


  • Threads are independent control flows through the same program using the same shared memory
  • One thread usually maps to one core

Limits of OpenMP

  • Can’t scale beyond one node
  • Can get complicated if architecture is complicated

The technique is called “shared-memory parallelism”.

Beyond shared memory

If we want to scale beyond one node, what do we need?

Interlude: More complex example

Stencil operations



Stencil operations


\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

Stencil operations


\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

Stencil operations - domain decomposition

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

Stencil operations - domain decomposition

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

  • looking at a single block
  • no problem when computing central points

Stencil operations - domain decomposition

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

  • but what about points on the border?

Stencil operations - halo cells

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

  • Solution: Introduction of halo cells, which need to be communicated from neighboring blocks

Beyond shared memory

If we want to scale beyond one node, what do we need?

  • Interconnect between nodes
  • Boundary-data exchange
  • Collective communications

Distributed memory parallelism

  • The dominating standard is MPI (Message Passing Interface)
  • An MPI-Program has multiple ranks which can run on different nodes.
  • Point-to-Point mechanism transfers data (“messages”) between two ranks.
    • Often used for boundary exchange
  • A lot of the details are beyond the scope of this lecture.

OpenMP vs MPI

OpenMP MPI
shared memory distributed memory
can’t scale beyond one node can scale across many nodes
uses independent threads uses ranks that need to communicate with each other
automatic boundary exchange due to shared memory explicit boundary exchange needed
pragmas inserted in the code tell the compiler where to parallelise API to vendor specific libraries

OpenMPI is a MPI implementation and has nothing to do with OpenMP

Message Passing Interface (MPI)

MPI: How does it work?

#include <mpi.h>           // include the header for the MPI library 

int main(int argc, char** argv){

   MPI_Init(&argc, &argv); // initiate MPI computation

   MPI_Finalize();         // terminate MPI computation

}

MPI: How does it work?

#include <mpi.h>           // include the header for the MPI library

int main(int argc, char** argv){

   int no_of_ranks, my_rank;

   MPI_Init(&argc, &argv); // initiate MPI computation

   MPI_Comm_size(MPI_COMM_WORLD, &no_of_ranks); // get number of MPI ranks

   MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);  // get rank of this process

   MPI_Finalize();         // terminate MPI computation

}

Collectives

  • Collective communication that involves a group of processes (ranks)
  • Examples: broadcast, gather, scatter, reduction …
  • Read more on mpi-forum.org

Collectives - Aggregate results

int MPI_Reduce(void* sendbuf, void* recvbuf, int count, 
               MPI_Datatype datatype, 
               MPI_Op op,   // MPI_MAX, ...
               int root, 
               MPI_Comm comm)  


Hands-on Session!

  1. Get the example codes from here

  2. Load the GNU compiler and the OpenMPI library on Levante

module load gcc openmpi/4.1.2-gcc-11.2.0
  1. Compile the first example code with MPI and run with 4 ranks
mpicc mpi_example_1.c -o mpi.x -lm
mpirun -n 4 mpi.x  
  1. Can you explain the output? (You can ignore any warnings regarding UCX)

Hands-on Session!

  1. Compile the second example code with MPI and run with 4 ranks
mpicc mpi_example_2.c -o mpi.x -lm
mpirun -n 4 mpi.x
  1. The code is supposed to calculate the maximum number in a vector, but so far it is only doing it locally for each rank. Adapt the code to calculate the global maximum. There are already hints in the code what you need to do.

  2. All ranks print the same. Can you fix that?

MPI+X

  • MPI can be combined with shared memory paradigms, like OpenMP or OpenACC
  • Current state of the art

But what actually does the work?

Existing technologies

  • CPU - compute units in laptops, supercomputers
  • GPU - initially designed for graphycs processing only, now also used (as accelerators) in supercomputers
  • (Vector Engines e.g. NEC Vector)
  • FPGA/ASIC
  • 🦄?

CPU vs GPU (on Levante)

1 CPU node has 2x AMD EPYC 7763 Milan

1 GPU node has 4x NVIDIA A100, each with 128 SM

More insights in the “Memory hierarchies” lecture on July 2nd

Categorization

  • MIMD: Multiple Instructions, Multiple Data
    • Each unit executes a different instruction on different chunks of data
    • E.g. multiple cores in a CPU
  • SIMD: Single Instruction, Multiple Data
    • Each unit executes the same instruction on different chunks of data
    • E.g. vector engines

Technology is evolving fast!

Data from Wikipedia

Top performance is keeping up!

Data from Top500

Additional reading

  • HLRS MPI course material: https://www.hlrs.de/training/self-study-materials/mpi-course-material
  • MPI standard 3.1 documentation as PDF
  • “Computer Architecture - A Quantitative Approach” - J. Hennessy and D. Patterson