Memory Hierarchies

Dominik Zobel and Florian Ziemen

Memory Hierarchies

  • Background
  • Why you should care
  • How to use it to your advantage

Intended Takeaways

  • Locality matters: data-centric view
  • Think workbench: Operating with parts of the data
  • Processor tries to be busy all the time (prefetching)
  • Latency and memory sizes of components

Questions

  • Why not keep everything in memory?
  • What to do with really big data?
  • How to speed up processing data?

Memory Pyramid (upwards)

Speed and access time

Processor speed vs. main memory speed

Based on figure from “Computer Architecture” by J. Hennessy and D. Patterson

Disk I/O timings

Levante (Fixed) Laptop (Fixed) Levante (Random) Laptop (Random)
Create data 0.56 0.23 1.66 0.88
Store data 2.23 4.04 2.23 3.44
Load data 0.76\(^*\) 0.88\(^*\) 0.76\(^*\) 0.92\(^*\)
Process data 0.76 0.38 0.76 0.37

Time in seconds using a 2 GB numpy array (\(128 \times 128 \times 128 \times 128\)) either with a fixed number or random number in each entry

\(^*\): Lower than one due to caching effects

Disk I/O

  • Reading/writing to file is rather expensive
  • If necessary during computation, try doing it asynchronously
  • If possible, keep data in memory

Memory access patterns

Execution speed depends on data layout in memory

program loop_exchange
   implicit none
   integer, parameter :: nel = 20000
   ! Note: Stay below 46000 to prevent overflows below
   integer, dimension(nel, nel) :: mat
   integer :: i, j

   do i = 1, nel
      do j = 1, nel
         mat(j, i) = (i-1)*nel + j-1
      end do
   end do
end program loop_exchange

Loop order with optimal access of elements (contiguous memory).

Hands-On

  1. Compile the Fortran code from the previous slide (also here) on Levante or your PC. On Levante load the gcc module first (module load gcc). Then measure the time needed to run the program, i.e.
gfortran loops.f90 -o loops
time ./loops
  1. Exchange the loop variables i and j in line 8 and 9 and compile again. How does it impact the run time?

  2. Try different values for nel (for the original loop order and the exchanged one). How does the matrix size relate to the effect of exchanged loops?

Memory Models

  • Study effect of latencies, cache sizes, block sizes, …
  • Here just focus on latency

First model version

One layer of RAM cache between the CPU and the disk.

Memory access time for first version

Cache Access Time Hit Ratio
Memory \(T_M\) \(H_M\)
Disk \(T_D\)
  • Parallel and serial requests possible

\[\begin{align} T_{avg,p} &= H_M T_M + (1-H_M) \cdot \color{blue}{T_D}\\ T_{avg,s} &= H_M T_M + (1-H_M) \cdot \color{blue}{(T_M + T_D)} \end{align}\]

Second model version

Three layers of caches

Memory access time for second version (1/2)

Cache Access Time Hit Ratio
\(L_1\) \(T_1\) \(H_1\)
\(L_2\) \(T_2\) \(H_2\)
\(L_3\) \(T_3\)

Memory access time for second version (2/2)

  • Average memory access time \(T_{avg,p}\) for parallel access (processor connected to all caches)

\[\begin{align} T_{avg,p} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot \color{blue}{T_2}\\ &+ ((1-H_1)\cdot(1-H_2))\cdot \color{blue}{T_3} \end{align}\]

  • Average memory access time \(T_{avg,s}\) for serial access

\[\begin{align} T_{avg,s} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot \color{blue}{(T_1+T_2)}\\ &+ ((1-H_1)\cdot(1-H_2))\cdot \color{blue}{(T_1+T_2+T_3)} \end{align}\]

Processor techniques

The general view

If data is not available in the current memory level

  • register spilling

Register has to look for the data in the L1 cache

  • cache miss

The current cache has to fetch the data from the next cache or main memory

  • page fault

Data was not found in main memory and has to be loaded from disk

Requesting unavailable data

  • Sending request
  • If needed, forward request until found
  • Load data into cache(s)
  • Process available data

Provide data which might be needed

  • caching

Keep data around which was needed once

  • prefetching

Load data which might be needed soon (spatial or temporal proximity, heuristics)

  • branch prediction

Similar to prefetching, load data needed for different code paths

Use cached data

  • Sending request
  • Data is already present
  • Load data into cache
  • Process it

Memory hierarchy on Levante

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

  • Base frequency: 2.45 GHz

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

Latency Capacity
Register ~0.4 ns 1 KB
  • L1-L3 Cache are a few times slower

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

Latency Capacity
Register ~0.4 ns 1 KB
L1 Cache ~1 ns 32 KB
L2 Cache a few ns 512 KB
L3 Cache ~10 ns 32 MB
  • 256 GB of main memory (default) with a theoretical memory bandwidth of ~200 GB/s

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

Latency Capacity
Register ~0.4 ns 1 KB
L1 Cache ~1 ns 32 KB
L2 Cache a few ns 512 KB
L3 Cache ~10 ns 32 MB
Main Memory 10s of ns 256 GB
  • Fast Data as Flash based file system

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

Latency Capacity
Register ~0.4 ns 1 KB
L1 Cache ~1 ns 32 KB
L2 Cache a few ns 512 KB
L3 Cache ~10 ns 32 MB
Main Memory 10s of ns 256 GB
SSD 100s of µs 200 TB
  • File system at Levante ~130 PB, limited by quota for project

Memory Pyramid (downwards)

Based on a typical Levante node (AMD EPYC 7763)

Latency Capacity
Register ~0.4 ns 1 KB
L1 Cache ~1 ns 32 KB
L2 Cache a few ns 512 KB
L3 Cache ~10 ns 32 MB
Main Memory 10s of ns 256 GB
SSD 100s of µs 200 TB
Hard disk a few ms 130 PB

Memory Mountain (1/2)

Code for program contained in “Computer Systems”:

https://csapp.cs.cmu.edu/3e/mountain.tar

  • Process a representative amount of data
  • Use stride between array elements to control spatial locality
  • Use size of array to control temporal locality
  • Also warm up the cache before the actual measurements

Memory Mountain (2/2)

\(\approx\) Factor 20 between best and worst access

Hands-On

  1. Download and extract the C source code of the memory mountain program linked in the from previous slides
  2. Compile the program and run it on your PC or a Levante compute node
  3. Which factor do you get between best and worst performance?
  4. (optional) Visualize your results

Different architectures

  • Different caches are available
  • Speed and size of caches varies
  • Basic understanding helps in all cases
  • Hardware-specific knowledge allows additional fine-tuning

Memory on Levante GPUs

  • For a NVIDIA A100 80GB GPU (4x in a Levante GPU node)
  • Register and L1 Cache for one (of 108) Streaming Multiprocessor of a GPU
Latency Capacity
Register ~1 ns 4 x 64 KB
L1 Cache a few ns 192 KB
L2 Cache (shared) ~10 ns 40MB
Main Memory (HBM2e) 10s of ns 80 GB

Summary

Observations

  • Gap between processor and memory speeds. Hierarchy needed because of discrepancy between speed of CPU and (main) memory

  • exploit accessing data and code stored close to each other (temporal and spatial locality)

Resources

  • “Computer Systems: A Programmer’s Perspective” by R. Bryant and D. O’Hallaron, Pearson
  • “Computer Architecture” by J. Hennessy and D. Patterson, O’Reilly