Complications due to existing hardware



The technique is called “shared-memory parallelism”.
If we want to scale beyond one node, what do we need?
\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)
\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)
\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)
\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

\(x_t(i,j)=f(x_{t-1}(i-1,j),x_{t-1}(i,j),x_{t-1}(i+1,j))\)

If we want to scale beyond one node, what do we need?
| OpenMP | MPI |
|---|---|
| shared memory | distributed memory |
| can’t scale beyond one node | can scale across many nodes |
| uses independent threads | uses ranks that need to communicate with each other |
| automatic boundary exchange due to shared memory | explicit boundary exchange needed |
| pragmas inserted in the code tell the compiler where to parallelise | API to vendor specific libraries |
OpenMPI is a MPI implementation and has nothing to do with OpenMP
#include <mpi.h> // include the header for the MPI library
int main(int argc, char** argv){
int no_of_ranks, my_rank;
MPI_Init(&argc, &argv); // initiate MPI computation
MPI_Comm_size(MPI_COMM_WORLD, &no_of_ranks); // get number of MPI ranks
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // get rank of this process
MPI_Finalize(); // terminate MPI computation
}Get the example codes from here
Load the GNU compiler and the OpenMPI library on Levante
The code is supposed to calculate the maximum number in a vector, but so far it is only doing it locally for each rank. Adapt the code to calculate the global maximum. There are already hints in the code what you need to do.
All ranks print the same. Can you fix that?
Existing technologies
1 CPU node has 2x AMD EPYC 7763 Milan 
1 GPU node has 4x NVIDIA A100, each with 128 SM 
More insights in the “Memory hierarchies” lecture on July 2nd
Data from Wikipedia
Data from Top500