



## Performance Tuning of Scientific Codes with the Roofline Model

1:30pm

1:35pm

2:10pm

2:40pm

3:00pm

3:30pm

3:45pm

4:30pm

4:55pm

Introductions / Administration

Roofline Introduction

CARM / Energy / GPUs

Intel Advisor Installation

coffee break

Introduction to Intel Advisor

Hands-on with Intel Advisor

**HPC Application Studies** 

closing remarks / Q&A

all

Samuel Williams

Aleksandar Ilic

Zakhar Matveev

Zakhar Matveev

all

Charlene Yang

all







## Introductions

#### **Samuel Williams**

**Computational Research Division Lawrence Berkeley National Lab** 

SWWilliams@lbl.gov

#### Aleksandar Ilic

Institute of Systems and Computer Engineering, Portugal

aleksandar.ilic@inesc-id.pt

#### **Charlene Yang**

NERSC Lawrence Berkeley National Lab

CJYang@lbl.gov

#### **Zakhar Matveev**

**Intel Corporation** 

zakhar.a.matveev@intel.com

#### Kiril Rogozhin

**Intel Corporation** 

kirill.rogozhin@intel.com



# Materials: USB / Downloads





#### more Roofline at SC'18...

P3HPC Workshop Friday 8:30am D174 "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", Yang, Gayatri, Kurth, Basu, Ronaghi, Adetokunbo, Friesen, Cook, Doerfler, Oliker, Deslippe, Williams





### Don't forget to take the Survey...

http://bit.ly/sc18-eval







# Introduction to the Roofline Model

#### Samuel Williams

**Computational Research Division Lawrence Berkeley National Lab** 

SWWilliams@lbl.gov



## Acknowledgements

- This material is based upon work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under Award Number DE-AC02-05CH11231.
- This material is based upon work supported by the DOE RAPIDS SciDAC Institute.
- This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-05CH11231.





## Background

#### Why Use Performance Models or Tools?

- Identify performance bottlenecks
- Motivate software optimizations
- Determine when we're done optimizing
  - Assess performance relative to machine capabilities
  - Motivate need for algorithmic changes
- Predict performance on future machines / architectures
  - Sets realistic expectations on performance for future procurements
  - Used for HW/SW Co-Design to ensure future architectures are well-suited for the computational needs of today's applications.



- Many different components can contribute to kernel run time.
- Some are application-specific, and some architecture-specific.

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send: Wait ratio Network Gap

#MPI Wait's Network Latency



Can't think about all these terms all the time for every application...

```
Computational
  Complexity #FP operations Flop/s
        Cache data movement Cache GB/s
        DRAM data movement DRAM GB/s
          PCle data movement PCle bandwidth
                       Depth OMP Overhead
            MPI Message Size Network Bandwidth
           MPI Send: Wait ratio Network Gap
                 #MPI Wait's Network Latency
```



 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send:Wait ratio Network Gap

#MPI Wait's Network Latency



 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s

Cache data movement Cache GB/s

DRAM data movement DRAM GB/s

PCIe data movement PCIe bandwidth

Depth OMP Overhead

MPI Message Size Network Bandwidth

MPI Send: Wait ratio Network Gap

#MPI Wait's Network Latency



LogGP

 Because there are so many components, performance models often conceptualize the system as being dominated by one or more of these components.

#FP operations Flop/s
Cache data movement Cache GB/s
DRAM data movement DRAM GB/s
PCIe data movement PCIe bandwidth
Depth OMP Overhead
MPI Message Size Network Bandwidth
MPI Send:Wait ratio Network Gap
#MPI Wait's Network Latency





## Roofline Model:

**Arithmetic Intensity and Bandwidth** 

#### Performance Models / Simulators

- Historically, many performance models and simulators tracked time to predict performance (i.e. counting cycles)
- The last two decades saw a number of latency-hiding techniques...
  - Out-of-order execution (hardware discovers parallelism to hide latency)
  - HW stream prefetching (hardware speculatively loads data)
  - Massive thread parallelism (independent threads satisfy the latency-bandwidth product)
- resulted in a shift from a latency-limited computing regime to a throughput-limited computing regime



#### **Roofline Model**

- Roofline Model is a throughputoriented performance model...
  - Tracks <u>rates</u> not times
  - Augmented with Little's Law (concurrency = latency\*bandwidth)
  - Independent of ISA and architecture (applies to CPUs, GPUs, Google TPUs<sup>1</sup>, etc...)



https://crd.lbl.gov/departments/computer-science/PAR/research/roofline



- One could hope to always attain peak performance (Flop/s)
- However, finite reuse and bandwidth limit performance.
- Assuming perfect overlap of communication and computation...

```
Time = max #FP ops / Peak GFlop/s
#Bytes / Peak GB/s
```





- One could hope to always attain peak performance (Flop/s)
- However, finite reuse and bandwidth limit performance.
- Assuming perfect overlap of communication and computation...



(compute, flop/s)



- One could hope to always attain peak performance (Flop/s)
- However, finite reuse and bandwidth limit performance.
- Assuming perfect overlap of communication and computation...





- One could hope to always attain peak performance (Flop/s)
- However, finite reuse and bandwidth limit performance.
- Assuming perfect overlap of communication and computation...



Note, Arithmetic Intensity (AI) = Flops / Bytes (as presented to DRAM)





- Plot Roofline bound using
   Arithmetic Intensity as the x-axis
- Log-log scale makes it easy to doodle, extrapolate performance along Moore's Law, etc...
- Kernels with Al less than machine balance are ultimately DRAM bound (we'll refine this later...)





#### Roofline Example #1

- Typical machine balance is 5-10 flops per byte...
  - 40-80 flops per double to exploit compute capability
  - Artifact of technology and money
  - Unlikely to improve
- Consider STREAM Triad...

```
#pragma omp parallel for
for(i=0;i<N;i++){
    Z[i] = X[i] + alpha*Y[i];
}</pre>
```

- 2 flops per iteration
- Transfer 24 bytes per iteration (read X[i], Y[i], write Z[i])
- AI = 0.083 flops per byte == Memory bound





#### Roofline Example #2

- Conversely, 7-point constant coefficient stencil...
  - 7 flops
  - 8 memory references (7 reads, 1 store) per point
  - Cache can filter all but 1 read and 1 write per point
  - Al = 0.44 flops per byte == memory bound,
     but 5x the flop rate





- Imagine a mix of loop nests
- Flop/s alone may not be useful in deciding which to optimize first





We can sort kernels by Al ...





- We can sort kernels by Al ...
- ... and compare performance relative to machine capabilities





- Kernels near the roofline are making good use of computational resources
  - kernels can have low performance (Gflop/s), but make good use of a machine
  - kernels can have high performance (Gflop/s), but make poor use of a machine







# Refining Roofline: Memory Hierarchy & DLP

- Processors have multiple levels of memory/cache
  - Registers
  - L1, L2, L3 cache
  - MCDRAM/HBM (KNL/GPU device memory)
  - DDR (main memory)
  - NVRAM (non-volatile memory)
- Applications have locality in each level
  - Unique data movements imply unique Al's
  - Moreover, each level will have a unique bandwidth



- Construct superposition of Rooflines...
  - Measure bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





- Construct superposition of Rooflines...
  - Measure bandwidth
  - Measure AI for each level of memory
  - Although an loop nest may have multiple Al's and multiple bounds (flops, L1, L2, ... DRAM)...
  - ... performance is bound by the minimum





#### Data, Instruction, Thread-Level Parallelism...

- We have assumed one can attain peak flops with high locality.
- In reality, we must ....
  - Use special instructions (e.g. FMA)
  - Vectorize loops (16 flops per instruction)
  - Hide FPU latency (unrolling, out-of-order execution)
  - Use all cores & sockets
- Without these, ...
  - Peak performance is not attainable
  - Some kernels can transition from memory-bound to compute-bound





#### Data, Instruction, Thread-Level Parallelism...

- We have assumed one can attain peak flops with high locality.
- In reality, we must ....
  - Use special instructions (e.g. FMA)
  - Vectorize loops (16 flops per instruction)
  - Hide FPU latency (unrolling, out-of-order execution)
  - Use all cores & sockets
- Without these, ...
  - Peak performance is not attainable
  - Some kernels can transition from memory-bound to compute-bound









### Roofline Model:

Roofline-driven Performance Optimization

 Broadly speaking, there are three approaches to improving performance:





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware, unit-stride)





- Broadly speaking, there are three approaches to improving performance:
- Maximize in-core performance (e.g. get compiler to vectorize)
- Maximize memory bandwidth (e.g. NUMA-aware, unit stride)
- Minimize data movement (e.g. cache blocking)









# Roofline In Practice: Evolution at LBL / NERSC



### Step 1: Machine Characterization

#### **Machine Characterization**

- "Theoretical Performance" numbers can be highly optimistic...
  - Pin BW vs. sustained bandwidth
  - TurboMode / Underclock for AVX
  - compiler failings on high-Al loops.
- LBL developed the Empirical Roofline Toolkit (ERT)...
  - Characterize CPU/GPU systems
  - Peak Flop rates
  - Bandwidths for each level of memory
  - MPI+OpenMP/CUDA == multiple GPUs







# Step 2: Application Characterization

#### **Measuring Al**

- To characterize execution with Roofline we need...
  - Time
  - Flops (=> flop's / time)
  - Data movement between each level of memory (=> Flop's / GB's)
- We can look at the full application...
  - Coarse grained, 30-min average
  - Misses many details and bottlenecks
- or we can look at individual loop nests...
  - Requires auto-instrumentation on a loop by loop basis
  - o Moreover, we should probably differentiate data movement or flops on a core-by-core basis.



#### **How Do We Count Flop's?**

#### **Manual Counting**

- Go thru each loop nest and count the number of FP operations
- ✓ Works best for deterministic loop bounds
- ✓ or parameterize by the number of iterations (recorded at run time)
- X Not scalable

#### **Perf. Counters**

- Read counter before/after
- ✓ More Accurate
- ✓ Low overhead (<%) == can run full MPI applications
- ✓ Can detect load imbalance
- X Requires privileged access
- X Requires manual instrumentation (+overhead) or full-app characterization
- **X** Broken counters = garbage
- X May not differentiate FMADD from FADD
- X No insight into special pipelines 48

#### **Binary Instrumentation**

- Automated inspection of assembly at run time
- ✓ Most Accurate
- ✓ FMA-, VL-, and mask-aware
- ✓ Can count instructions by class/type
- ✓ Can detect load imbalance
- ✓ Can include effects from non-FP instructions
- ✓ Automated application to multiple loop nests
- X >10x overhead (short runs / reduced concurrency)

#### **How Do We Measure Data Movement?**

#### **Manual Counting**

- Go thru each loop nest and estimate how many bytes will be moved
- Use a mental model of caches
- ✓ Works best for simple loops that stream from DRAM (stencils, FFTs, spare, ...)
- **X** N/A for complex caches
- X Not scalable

#### Perf. Counters

- Read counter before/after
- ✓ Applies to full hierarchy (L2, DRAM,
- ✓ Much more Accurate
- ✓ Low overhead (<%) == can run full MPI applications
- ✓ Can detect load imbalance
- X Requires privileged access
- X Requires manual instrumentation (+overhead) or full-app characterization

#### **Cache Simulation**

- Build a full cache simulator driven by memory addresses
- Applies to full hierarchy and multicore
- ✓ Can detect load imbalance
- ✓ Automated application to multiple loop nests
- **X** Ignores prefetchers
- X >10x overhead (short runs / reduced concurrency)



#### Previously Cobbled Together Tools...

- Use tools known/observed to work on NERSC's Cori (KNL, HSW)...
  - Used Intel SDE (Pin binary instrumentation + emulation) to create software Flop counters
  - Used Intel VTune performance tool (NERSC/Cray approved) to access uncore counters
- Accurate measurement of Flop's (HSW) and DRAM data movement (HSW and KNL)
- Used by NESAP (NERSC KNL application readiness project) to characterize apps on Cori...



http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity/



#### **Intel Advisor**

#### Includes Roofline Automation...

- ✓ Automatically instruments applications (one dot per loop nest/function)
- ✓ Computes FLOPS and AI for each function (CARM)
- ✓ AVX-512 support that incorporates masks
- ✓ Integrated Cache Simulator¹ (hierarchical roofline / multiple Al's)
- Automatically benchmarks target system (calculates ceilings)
- ✓ Full integration with existing Advisor capabilities



http://www.nersc.gov/users/training/events/roofline-training-1182017-1192017







# Hierarchical Roofline vs. Cache-Aware Roofline

...understanding different Roofline formulations in Intel Advisor

#### There are two Major Roofline Formulations:

#### Hierarchical Roofline (original Roofline w/ DRAM, L3, L2, ...)...

- Williams, et al, "Roofline: An Insightful Visual Performance Model for Multicore Architectures", CACM, 2009
- Chapter 4 of "Auto-tuning Performance on Multicore Computers", 2008
- Defines multiple bandwidth ceilings and multiple Al's per kernel
- Performance bound is the minimum of flops and the memory intercepts (superposition of original, single-metric Rooflines)

#### Cache-Aware Roofline

- Ilic et al, "Cache-aware Roofline model: Upgrading the loft", IEEE Computer Architecture Letters, 2014
- Defines multiple bandwidth ceilings, but uses a single AI (flop:L1 bytes)
- As one looses cache locality (capacity, conflict, ...) performance falls from one BW ceiling to a lower one at constant AI

#### Why Does this matter?

- Some tools use the Hierarchical Roofline, some use cache-aware == Users need to understand the differences
- Cache-Aware Roofline model was integrated into production Intel Advisor
- Evaluation version of Hierarchical Roofline<sup>1</sup> (cache simulator) has also been integrated into Intel Advisor



#### **Hierarchical Roofline**

- Captures cache effects
- Al is Flop:Bytes after being filtered by lower cache levels
- Multiple Arithmetic Intensities (one per level of memory)
- Al dependent on problem size (capacity misses reduce Al)
- Memory/Cache/Locality effects are observed as decreased AI
- Requires performance counters or cache simulator to correctly measure Al

#### **Cache-Aware Roofline**

- Captures cache effects
- Al is Flop:Bytes as presented to the L1 cache (plus non-temporal stores)
- Single Arithmetic Intensity
- Al independent of problem size

- Memory/Cache/Locality effects are observed as decreased performance
- Requires static analysis or binary instrumentation to measure Al



#### **Example: STREAM**

- L1 Al...
  - 2 flops
  - 2 x 8B load (old)
  - 1 x 8B store (new)
  - = 0.08 flops per byte
  - No cache reuse...
    - Iteration i doesn't touch any data associated with iteration i+delta for any delta.
- ... leads to a DRAM AI equal to the L1 AI

```
#pragma omp parallel for
for(i=0;i<N;i++){
    Z[i] = X[i] + alpha*Y[i];
}</pre>
```



#### **Example: STREAM**

#### **Hierarchical Roofline**

#### **Cache-Aware Roofline**







#### Example: 7-point Stencil (Small Problem)

#### L1 Al...

- 7 flops
- 7 x 8B load (old)
- 1 x 8B store (new)
- = 0.11 flops per byte
- some compilers may do register shuffles to reduce the number of loads.

#### Moderate cache reuse...

- old[k][j][i+1] is reused on next iteration of i.
- old[k][j+1][i] is reused on next iteration of j.
- old[k+1][j][i] is reused on next iterations of k.

#### ... leads to DRAM Al larger than the L1 Al



## Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline







## Example: 7-point Stencil (Small Problem) Hierarchical Roofline Cache-Aware Roofline







## Example: 7-point Stencil (Large Problem) Hierarchical Roofline Cache-Aware Roofline







## Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline







## Example: 7-point Stencil (Observed Perf.) Hierarchical Roofline Cache-Aware Roofline











### Questions?





### Don't forget to take the Survey...

http://bit.ly/sc18-eval







### Backup





# Refining Roofline: NUMA

#### **NUMA Effects**

- Cori's Haswell nodes are built from 2 Xeon processors (sockets)
  - Memory attached to each socket (fast)
  - Interconnect that allows remote memory access (slow == NUMA)
  - Improper memory allocation can result in more than a 2x performance penalty









# Refining Roofline: Instruction Issue Bandwidth

#### Superscalar vs. Instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance





#### Superscalar vs. Instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance





#### Superscalar vs. instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance





#### Superscalar vs. instruction mix

- Define in-core ceilings based on instruction mix...
- e.g. Haswell
  - 4-issue superscalar
  - Only 2 FP data paths
  - Requires 50% of the instructions to be FP to get peak performance
- e.g. KNL
  - 2-issue superscalar
  - 2 FP data paths
  - Requires 100% of the instructions to be FP to get peak performance









### Refining Roofline:

Compulsory, Capacity, and Conflict misses

 Naively, we can bound Al using only compulsory cache misses

$$AI = \frac{\#Flop's}{Compulsory Misses}$$





- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al

$$AI = \frac{\text{#Flop's}}{\text{Compulsory Misses + Write Allocates}}$$





- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty







- Naively, we can bound Al using only compulsory cache misses
- However, write allocate caches can lower Al
- Cache capacity misses can have a huge penalty
- Compute bound became memory bound

 $AI = \frac{\#Flop's}{Compulsory Misses + Write Allocates + Capacity Misses}$ 





### LIKWID:

**Performance Counters** 

#### **LIKWID**

- LIKWID provides easy to use wrappers for measuring performance counters...
  - ✓ Works on NERSC production systems
  - ✓ Distills counters into user-friendly metrics (e.g. MCDRAM Bandwidth)
  - ✓ Minimal overhead (<1%)
    </p>
  - ✓ Scalable in distributed memory (MPI-friendly)
  - ✓ Fast, high-level characterization
  - X No timing breakdowns
  - X Suffers from Garbage-in/Garbage Out(i.e. hardware counter must be sufficient and correct)

https://github.com/RRZE-HPC/likwid

http://www.nersc.gov/users/software/performance-and-debugging-tools/likwid

