





Based on figure from “Computer Architecture” by J. Hennessy and D. Patterson
| Levante (Fixed) | Laptop (Fixed) | Levante (Random) | Laptop (Random) | |
|---|---|---|---|---|
| Create data | 0.56 | 0.23 | 1.66 | 0.88 |
| Store data | 2.23 | 4.04 | 2.23 | 3.44 |
| Load data | 0.76\(^*\) | 0.88\(^*\) | 0.76\(^*\) | 0.92\(^*\) |
| Process data | 0.76 | 0.38 | 0.76 | 0.37 |
Time in seconds using a 2 GB numpy array (\(128 \times 128 \times 128 \times 128\)) either with a fixed number or random number in each entry
\(^*\): Lower than one due to caching effects
Execution speed depends on data layout in memory
Loop order with optimal access of elements (contiguous memory).
gcc module first (module load gcc). Then measure the time needed to run the program, i.e.Exchange the loop variables i and j in line 8 and 9 and compile again. How does it impact the run time?
Try different values for nel (for the original loop order and the exchanged one). How does the matrix size relate to the effect of exchanged loops?
One layer of RAM cache between the CPU and the disk.



| Cache | Access Time | Hit Ratio |
|---|---|---|
| Memory | \(T_M\) | \(H_M\) |
| Disk | \(T_D\) |
\[\begin{align} T_{avg,p} &= H_M T_M + (1-H_M) \cdot \color{blue}{T_D}\\ T_{avg,s} &= H_M T_M + (1-H_M) \cdot \color{blue}{(T_M + T_D)} \end{align}\]
Three layers of caches


| Cache | Access Time | Hit Ratio |
|---|---|---|
| \(L_1\) | \(T_1\) | \(H_1\) |
| \(L_2\) | \(T_2\) | \(H_2\) |
| \(L_3\) | \(T_3\) |
\[\begin{align} T_{avg,p} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot \color{blue}{T_2}\\ &+ ((1-H_1)\cdot(1-H_2))\cdot \color{blue}{T_3} \end{align}\]
\[\begin{align} T_{avg,s} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot \color{blue}{(T_1+T_2)}\\ &+ ((1-H_1)\cdot(1-H_2))\cdot \color{blue}{(T_1+T_2+T_3)} \end{align}\]



Register has to look for the data in the L1 cache
The current cache has to fetch the data from the next cache or main memory
Data was not found in main memory and has to be loaded from disk





Keep data around which was needed once
Load data which might be needed soon (spatial or temporal proximity, heuristics)
Similar to prefetching, load data needed for different code paths



Based on a typical Levante node (AMD EPYC 7763)
Based on a typical Levante node (AMD EPYC 7763)
| Latency | Capacity | |
|---|---|---|
| Register | ~0.4 ns | 1 KB |
Based on a typical Levante node (AMD EPYC 7763)
| Latency | Capacity | |
|---|---|---|
| Register | ~0.4 ns | 1 KB |
| L1 Cache | ~1 ns | 32 KB |
| L2 Cache | a few ns | 512 KB |
| L3 Cache | ~10 ns | 32 MB |
Based on a typical Levante node (AMD EPYC 7763)
| Latency | Capacity | |
|---|---|---|
| Register | ~0.4 ns | 1 KB |
| L1 Cache | ~1 ns | 32 KB |
| L2 Cache | a few ns | 512 KB |
| L3 Cache | ~10 ns | 32 MB |
| Main Memory | 10s of ns | 256 GB |
Based on a typical Levante node (AMD EPYC 7763)
| Latency | Capacity | |
|---|---|---|
| Register | ~0.4 ns | 1 KB |
| L1 Cache | ~1 ns | 32 KB |
| L2 Cache | a few ns | 512 KB |
| L3 Cache | ~10 ns | 32 MB |
| Main Memory | 10s of ns | 256 GB |
| SSD | 100s of µs | 200 TB |
Based on a typical Levante node (AMD EPYC 7763)
| Latency | Capacity | |
|---|---|---|
| Register | ~0.4 ns | 1 KB |
| L1 Cache | ~1 ns | 32 KB |
| L2 Cache | a few ns | 512 KB |
| L3 Cache | ~10 ns | 32 MB |
| Main Memory | 10s of ns | 256 GB |
| SSD | 100s of µs | 200 TB |
| Hard disk | a few ms | 130 PB |
Code for program contained in “Computer Systems”:
stride between array elements to control spatial localitysize of array to control temporal locality
\(\approx\) Factor 20 between best and worst access
| Latency | Capacity | |
|---|---|---|
| Register | ~1 ns | 4 x 64 KB |
| L1 Cache | a few ns | 192 KB |
| L2 Cache (shared) | ~10 ns | 40MB |
| Main Memory (HBM2e) | 10s of ns | 80 GB |
Gap between processor and memory speeds. Hierarchy needed because of discrepancy between speed of CPU and (main) memory
exploit accessing data and code stored close to each other (temporal and spatial locality)