Parallelism

Cut my life into pieces
This is my last resort

Georgiana Mania, Claudia Frauen, Jan Frederik Engels

Motivation

  • We have a serial code and want to make it faster
  • Plan of action:
    • Cut problem into smaller pieces
    • Use independent compute resources for each piece
  • Outlook for next week: The individual computing element does no longer get much faster, but there are more of them

This lecture

  • Is mostly about parallelism as a concept
  • Next week: Complications due to existing hardware

Our example problem

  • 1d Tsunami equation
  • Korteweg–De Vries equation - “PDE modeling waves on shallow water surfaces” Wikipedia
  • Discretization not numerically accurate

Our example problem

while ( t < 50 ){
        //[...]
        for(i=0; i<npoints; i++){
                //[...]
                u[i]=u2[i]-(delta_t/delta_x)*lu(h1,u1,i);
                h[i]=h2[i]-(delta_t/delta_x)*lh(h1,u1,i);
        }
}

Decomposing problem domains

Our example problem domain

Other problem domains


  • Climate model (e.g. ICON) horizontal grid decomposition

adapted from original image by MPI-M

Introducing OpenMP

  • A popular way to parallelize code
  • Pragma-based parallelization API
    • You annotate your code with parallel regions and the compiler does the rest
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        a[i] = 2 * i;

(#pragma is a keyword for the compiler)

OpenMP threads

  • OpenMP uses something called threads
    • Wait until next week for a definition


N = 8
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        a[i] = 2 * i;

Hands-on Session!

  1. Get the example code from here

  2. Load the GNU compiler on Levante

module load gcc
  1. Compile and run the serial example and check the runtime
gcc main.c -o serial.x -lm
 ./serial.x
  1. Now use OpenMP and compare the runtime
gcc -fopenmp main.c -o parallel.x -lm
OMP_NUM_THREADS=2 ./parallel.x
  1. See next slide!

Hands-on Session!

  1. Now compile/run with
gcc -fopenmp main.c -o parallel.x -lm -DWRITE_DECOMP
OMP_NUM_THREADS=4 ./parallel.x
  1. What does the additional output mean?

  2. Now uncomment/adapt

    • schedule(static,100)
    • schedule(static,10) and interpret the results.

Scaling

Bake mini-pancake dessert

Images generated by Pradipta Samanta (DKRZ) with DALL-E

Strong vs weak scaling

Starting with 1 pan, we can (perfect) scale this by using \(N\) pans in two ways:

Parameter Initial Strong scaling Weak scaling
Resources
(e.g. pans)
1 \(N\) \(N\)
Total workload
(e.g. pancake count)
\(8\) \(8\) \(8 \times N\)
Workload per worker
(e.g. pancakes per pan)
\(8\) \(8/N\) \(8\)
Total time \(T\) \(T/N\) \(T\)

Beyond independent iterations

What is going wrong here?

    temp = 0;
#pragma omp parallel for
    for (int i = 0; i < N; ++i) {
        temp = 2 * a[i];
        b[i] = temp + 4;
    }

Solution

    temp = 0;
#pragma omp parallel for private(temp)
    for (int i = 0; i < N; ++i) {
        temp = 2 * a[i];
        b[i] = temp + 4;
    }

The problem is called “data race”.

What is happening here?

    int a[] = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20};
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];

What is happening here?

    int a[] = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20};
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];

Assume 2 threads!

Solution

    int a[] = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20};
#pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];

For other reduction operations like e.g. max see OpenMP documentation

Other common errors

  • Race conditions
    • The outcome of a program depends on the relative timing of multiple threads.
  • Deadlocks
    • Multiple threads wait for a resource that cannot be fulfilled.
  • Starvation
    • A thread is blocked indefinitely waiting on a resource

Finally: A definition of parallelism

Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem

Introduction to Parallel Computing Tutorial, LLNL

Types of parallelism

  • Data-level parallelism
    • what we’ve been discussing
  • Task-level parallelism
    • Example: Atmosphere ocean coupling

Precondition for parallel execution


Two consecutive instructions or code segments can be executed in parallel if they are independent.


What does dependence mean?

Code and data dependence

  • Data dependence - the instructions share the same data
a = b;
c = a + b; //  flow dependence
  • Control dependence - execution order decided at runtime
for (int i = 1; i < n ; i++) {
  if (a[i-1] > a[i]) { 
    a[i] = a[i] + 1;
   } else {
    a[i] = 0;
   } 
} 
  • Resource dependence - share the same resource
a = b;
b = 42; // write after read: a has an old value

Bernstein’s parallelism conditions

The intersection between read-write set, write-read set and write-write set of instructions is null.

– Bernstein’s condition for data independence.

c = a + b;  // S1
d = a - b;  // S2

Read and write sets for S1 and S2: \[ R_1 = \{a,b\} ; W_1 = \{c\} \\ R_2 = \{a,b\} ; W_2 = \{d\} \\ \]

Bernstein’s conditions:

\[ R_1 \cap W_2 = \emptyset \\ W_1 \cap R_2 = \emptyset \\ W_1 \cap W_2 = \emptyset \]

S1 and S2 can be executed in parallel!

Bernstein’s parallelism conditions

c = a + b;  // S1
d = a - b;  // S2

Replace a with c in statement 2

c = a + b;  // S1
d = c - b;  // S2

Read and write sets for S1 and S2: \[ R_1 = \{a,b\} ; W_1 = \{c\} \\ R_2 = \{c,b\} ; W_2 = \{d\} \\ \]

Bernstein’s conditions:

\[ R_1 \cap W_2 = \emptyset \\ W_1 \cap R_2 = \{c\} \\ W_1 \cap W_2 = \emptyset \]

S1 and S2 can NOT be executed in parallel!

Best practices

  • Parallelisation should not change the results! Exceptions to be discussed next week!

Additional reading

  • “Computer Architecture - A Quantitative Approach” - J. Hennessy and D. Patterson
  • “Introduction to High Performance Computing for Scientists and Engineers” - G. Hager and G. Wellein
  • “Introduction to Parallel Computing Tutorial” - Online, Lawrence Livermore National Laboratory