ML SystemPart 3: OptimizationChapter 12

Performance CH.12 ~30 min

Benchmarking

Covers ML benchmarking with MLPerf, profiling tools, roofline analysis, and honest performance measurement methodology.

MLPerfprofilingroofline modelperformance measurementbenchmarking methodology

Explain the MLPerf benchmark suite and interpret training and inference benchmark results
Apply profiling tools to identify performance bottlenecks across the ML stack
Implement roofline analysis to classify operations as compute-bound or memory-bound
Design rigorous benchmark experiments with proper warmup, statistical analysis, and controlled environments
Evaluate the limitations of standard benchmarks and when custom benchmarks are necessary

01 MLPerf and Standard Benchmarks Viz

MLPerf is the industry-standard benchmark suite for measuring ML system performance. Developed by a consortium of industry and academic organizations, MLPerf provides standardized workloads, datasets, and measurement rules that enable fair comparison across different hardware platforms, frameworks, and system configurations.

Definition

MLPerf

An industry-standard benchmark suite maintained by the MLCommons consortium that provides standardized workloads, datasets, and measurement rules for comparing ML training and inference performance across hardware platforms and software frameworks.

Figure: ML Hardware Benchmark Comparison

Figure Figure 12.1: Radar chart comparing benchmark dimensions across hardware targets. Each polygon represents a different platform (GPU, TPU, Edge) across throughput, latency, accuracy, power efficiency, memory, and cost.

Training and Inference Benchmark Suites

MLPerf includes separate benchmark suites for training and inference. The training benchmarks measure time-to-train for workloads spanning image classification (ResNet-50), object detection (SSD, RetinaNet), natural language processing (BERT), recommendation (DLRM), and reinforcement learning. The inference benchmarks measure throughput and latency across similar workloads under different scenarios.

Table 12.1: Major ML benchmark suites and their focus areas.
Benchmark Suite	What It Measures	Key Workloads	Primary Metric
MLPerf Training	Time to reach target quality	ResNet-50, BERT, DLRM, RetinaNet	Time-to-train (minutes)
MLPerf Inference	Throughput and latency	Same model architectures	Queries/sec, latency percentiles
DAWNBench	Training cost efficiency	Image classification, QA	Dollar cost to target accuracy
DeepBench	Individual operation speed	GEMM, convolution, RNN	TFLOPS, time per operation

Table 12.1: Major ML benchmark suites and their focus areas.

MLPerf Inference Scenarios

MLPerf Inference defines four scenarios: Single-Stream (one query at a time, measuring latency), Multi-Stream (multiple inputs per query, for multi-camera systems), Server (variable arrival rate, measuring tail latency), and Offline (all data available at once, measuring throughput). Each scenario reflects a distinct real-world deployment pattern.

Standardized benchmarks serve a critical role in the ML ecosystem by providing objective comparison points. However, they also have limitations. Benchmark workloads may not represent actual production workloads. Optimizations specific to benchmark rules may not generalize. And the competitive pressure to achieve top benchmark results can distort engineering priorities.

Goodhart's Law in Benchmarking

When a measure becomes a target, it ceases to be a good measure. MLPerf scores drive hardware purchasing decisions worth millions of dollars, creating intense pressure to optimize specifically for benchmark workloads. Always verify that benchmark improvements translate to your actual production workloads before making investment decisions.

Benchmarks shape the industryDeeper InsightMLPerf results directly influence hardware purchase decisions worth billions of dollars annually. When a chip vendor posts a record-breaking MLPerf score, it can shift procurement decisions at hyperscalers and enterprises worldwide. The benchmarks don't just measure performance — they define what "good performance" means, steering R&D investment and architectural choices across the entire ML hardware ecosystem. Click to collapse

Domain-Specific Benchmarks

Beyond MLPerf, domain-specific benchmarks serve particular needs. DAWNBench focuses on training cost efficiency. DeepBench isolates individual operation performance. Application-specific benchmarks in areas like autonomous driving, medical imaging, and natural language understanding provide more relevant metrics for practitioners in those domains.

Choosing the Right Benchmark

Start with MLPerf for hardware comparison and vendor evaluation. Use domain-specific benchmarks when evaluating solutions for a particular application. Best of all, create an internal benchmark that mirrors your actual production workload — this provides the most actionable performance data.

MLPerf — Industry-standard suite for cross-platform comparison of training and inference.
DAWNBench — Focuses on cost-to-accuracy, measuring dollar efficiency of training.
DeepBench — Micro-benchmarks isolating individual DNN operations (GEMM, convolution).
Application benchmarks — Domain-specific workloads for autonomous driving, NLP, medical imaging.

Performance gap between unoptimized Python and optimized CUDA inference

MLPerf

An industry-standard benchmark suite that provides standardized workloads and measurement rules for comparing ML system training and inference performance.

Benchmark Suite

A collection of standardized workloads and measurement methodologies designed to evaluate and compare system performance across implementations.

02 Profiling Tools and Techniques

Profiling is the systematic measurement of where time and resources are spent during model training or inference. Without profiling, optimization is guesswork. Profiling tools reveal the actual bottlenecks, which are often surprising and not where engineers initially expect them.

Definition

Profiling

The systematic measurement of resource utilization — including time, memory, compute, and bandwidth — across all components of an ML system to identify performance bottlenecks and guide optimization efforts.

Profile Before You Optimize

Engineers frequently spend days optimizing the wrong part of the system. A common pattern is spending effort on GPU kernel optimization when the real bottleneck is data loading on CPU. Always profile first to identify the actual bottleneck. Five minutes of profiling can save weeks of misdirected optimization.

Hardware-Level Profilers

GPU profilers like NVIDIA Nsight Systems and Nsight Compute provide detailed visibility into GPU utilization, memory bandwidth usage, kernel execution times, and communication overhead. Nsight Systems gives a timeline view of the entire application, showing how CPU, GPU, and network activities overlap. Nsight Compute provides deep analysis of individual kernel performance.

Table 12.2: Profiling tools at different abstraction levels.
Profiling Tool	Level	What It Shows	Best For
Nsight Systems	Hardware (system-wide)	CPU/GPU timeline, kernel launches, memory copies	Finding pipeline stalls and overlap issues
Nsight Compute	Hardware (kernel-level)	SM utilization, memory throughput, instruction mix	Optimizing individual CUDA kernels
PyTorch Profiler	Framework	Op-level timing, memory allocation, data loading	General training bottleneck identification
TensorBoard Profiler	Framework	Op timing, device placement, input pipeline	TensorFlow/JAX training optimization
Intel VTune	Hardware (CPU)	CPU utilization, cache misses, threading	CPU-bound preprocessing bottlenecks

Table 12.2: Profiling tools at different abstraction levels.

Framework-Level Profilers

Framework-level profilers like PyTorch Profiler and TensorBoard Profiler operate at a higher level of abstraction, showing time spent in specific operations, memory allocation patterns, and data loading bottlenecks. These tools are easier to use than hardware-level profilers and are sufficient for identifying most common performance issues.

python
import torch
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=class="tok-number">1, warmup=class="tok-number">1, active=class="tok-number">3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler(&class="tok-comment">#class="tok-number">39;./log&#class="tok-number">39;),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(model, batch)
        prof.step()

Systematic Profiling Methodology

Effective profiling requires a systematic methodology. Start with a high-level timeline to identify whether the bottleneck is in data loading, forward pass, backward pass, or communication. Then drill down into the bottleneck using more detailed tools. Measure under realistic conditions because bottlenecks shift with configuration.

Capture a high-level timeline (Nsight Systems or framework profiler) to see the big picture.
Identify the dominant bottleneck: data loading, forward pass, backward pass, or communication.
Drill down with a detailed profiler (Nsight Compute for GPU kernels, VTune for CPU).
Optimize the identified bottleneck and re-profile to confirm improvement.
Repeat — fixing one bottleneck often reveals the next one.

Profile Under Realistic Conditions

Bottlenecks shift dramatically with configuration. A model may be compute-bound at batch size 256 but memory-bound at batch size 8. Always profile with production-realistic batch sizes, data pipelines active, and distributed training enabled if applicable.

Profiling

The systematic measurement of resource utilization (time, memory, bandwidth) across system components to identify performance bottlenecks.

GPU Utilization

The percentage of time GPU compute units are actively executing operations, a key indicator of whether computation is efficiently utilizing hardware.

03 Roofline Analysis

The roofline model is a visual performance analysis framework that determines whether a workload is compute-bound or memory-bound. It plots achievable performance (FLOP/s) as a function of operational intensity (FLOPs per byte of memory accessed), with the hardware's peak compute and peak memory bandwidth forming the "roofline."

Definition

Roofline Model

A visual performance model that plots achievable computation throughput (FLOP/s) against operational intensity (FLOPs/byte). The "roofline" is formed by the hardware's peak compute rate (flat ceiling) and peak memory bandwidth (sloped ceiling), revealing whether a workload is compute-bound or memory-bound.

P = \min\left(\pi, \beta \cdot I\right)

Memory-Bound Workloads

A workload that falls on the sloped portion of the roofline is memory-bound: performance is limited by memory bandwidth rather than compute capability. Optimization should focus on reducing memory accesses through techniques like operator fusion, data layout optimization, or increased computation reuse.

Optimizing Memory-Bound Operations

For memory-bound operations, the key is to increase operational intensity — do more computation per byte loaded from memory. Operator fusion (combining multiple element-wise operations into a single kernel) is the most effective technique, as it eliminates intermediate memory reads and writes.

Compute-Bound Workloads

A workload that falls on the flat portion of the roofline is compute-bound: performance is limited by the hardware's peak computation rate. For compute-bound workloads, optimization should focus on using more efficient operations, lower precision formats, or hardware with higher peak throughput.

Table 12.3: Common neural network operations and their typical roofline classification.
Operation Type	Typical Binding	Operational Intensity	Optimization Strategy
Large GEMM / Convolution	Compute-bound	High (100+ FLOP/byte)	Lower precision (FP16/INT8), Tensor Cores
Attention (large seq length)	Compute-bound	High	FlashAttention, mixed precision
Element-wise (ReLU, add)	Memory-bound	Low (1-2 FLOP/byte)	Operator fusion, in-place operations
Batch Normalization	Memory-bound	Low	Fuse with adjacent convolution
Softmax	Memory-bound	Low	Fused attention kernels
Small FC layers	Memory-bound	Low-Medium	Batching, operator fusion

Table 12.3: Common neural network operations and their typical roofline classification.

In practice, most neural network operations fall into one of two categories. Large matrix multiplications (convolutions, attention, linear layers with large batch sizes) tend to be compute-bound. Element-wise operations, batch normalization, and small matrix operations tend to be memory-bound. Understanding this distinction guides optimization strategy for each part of the model.

Reading a Roofline Plot

Consider an NVIDIA A100 GPU with 312 TFLOP/s FP16 peak compute and 2 TB/s memory bandwidth. The ridge point (where sloped and flat regions meet) occurs at 312/2 = 156 FLOP/byte. Operations with intensity below 156 FLOP/byte are memory-bound; above 156 are compute-bound. A large GEMM at ~200 FLOP/byte is compute-bound and should use Tensor Cores. A ReLU at ~1 FLOP/byte is deeply memory-bound and should be fused with adjacent operations.

Operational Intensity

Operational intensity (also called arithmetic intensity) is the ratio of floating-point operations to bytes transferred from memory. It is the single most important metric for determining whether an operation will benefit more from faster compute or faster memory. Calculate it as: total FLOPs in the operation divided by total bytes read and written.

Roofline Model

A performance analysis framework that visualizes the relationship between computational throughput and memory bandwidth to identify whether workloads are compute-bound or memory-bound.

Compute-Bound

A workload whose performance is limited by the hardware's computational throughput rather than memory bandwidth.

04 Performance Measurement Methodology

Accurate performance measurement requires careful methodology to avoid common pitfalls. Warmup runs must precede measurement to ensure caches are populated, JIT compilation is complete, and GPU clock rates have stabilized. Without warmup, initial runs will be misleadingly slow.

Definition

Warmup Runs

Initial execution iterations performed before taking measurements, allowing caches to populate, JIT compilers to optimize hot paths, and GPU clock rates to boost to steady-state frequencies. Skipping warmup leads to misleadingly slow initial measurements.

The Warmup Trap

GPU clock rates can take several seconds to ramp from idle to boost frequency. PyTorch's JIT (TorchScript, torch.compile) compiles on the first invocation. CUDA contexts must be initialized. Always discard the first 3-5 iterations before recording measurements. A single "cold" run can be 10x slower than steady-state performance.

Statistical Rigor

Statistical rigor is essential for meaningful performance comparisons. Report confidence intervals, not just averages. Use sufficient repetitions to establish statistical significance. Measure with and without competing system load to understand sensitivity. Small differences (under 5%) between configurations may not be meaningful given measurement noise.

CI = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}

How Many Repetitions?

Run at least 10-30 repetitions after warmup. Compute the coefficient of variation (CV = std/mean). If CV > 5%, investigate sources of variance (thermal throttling, competing processes, NUMA effects) before drawing conclusions. For publication-quality benchmarks, use at least 50 repetitions and report median with interquartile range.

Quick Check

Why is reporting p99 latency more important than average latency for production ML systems?

Not quite.P99 (99th percentile) latency captures tail latencies that affect real users. Average latency can mask extreme outliers — a system with 10ms average but 500ms p99 means 1 in 100 users experiences a 50x slower response. In production ML serving, these tail latencies often correspond to complex inputs, cache misses, or garbage collection pauses that disproportionately impact user experience.

Controlling the Environment

Benchmarking environments must be carefully controlled. GPU clock rates, power limits, driver versions, framework versions, and competing processes all affect measurements. Documenting and controlling these factors enables reproducible benchmarks.

Lock GPU clock rates using nvidia-smi to prevent frequency scaling during measurement.
Record driver version, CUDA version, cuDNN version, and framework version.
Disable GPU boost or set a fixed power limit for consistent thermal behavior.
Use process isolation (taskset, numactl) to pin CPU cores and avoid NUMA effects.
Use containerized environments (Docker, Singularity) to standardize the software stack.
Report hardware SKU, memory configuration, and interconnect topology.

Common Measurement Mistakes

Common measurement mistakes include cherry-picking the best run instead of reporting statistics, measuring artificial workloads that do not represent production use, ignoring data loading and preprocessing overhead, and comparing across different batch sizes or precision without normalization.

A Benchmarking Cautionary Tale

A team reported a 30% speedup from a new kernel implementation. On closer inspection, they had compared a single best-case run of the new kernel against the average of the old kernel. They had also disabled the data pipeline and measured only the compute portion. When measured end-to-end with proper statistics, the actual improvement was 4% — within measurement noise. Honest benchmarking requires transparent methodology.

Warmup Runs

Initial execution runs performed before measurement to populate caches, complete JIT compilation, and stabilize hardware clock rates.

Statistical Significance

The confidence that observed performance differences are real rather than due to measurement noise, typically established through repeated trials and confidence intervals.

05 Benchmarking Best Practices

Effective benchmarking starts with clearly defining what you want to measure and why. Are you comparing hardware platforms? Evaluating optimization techniques? Selecting between model architectures? The measurement approach should be tailored to answer the specific question at hand.

End-to-End vs. Micro-Benchmarks

End-to-end benchmarking captures the true system performance by measuring the complete pipeline from data loading through inference and postprocessing. Micro-benchmarks isolate individual operations or components for detailed analysis. Both perspectives are valuable, but end-to-end measurements are what ultimately matter for production performance.

Definition

End-to-End Benchmark

A performance measurement that captures the complete system pipeline from data ingestion through model inference and postprocessing, reflecting the true production performance experienced by end users.

Table 12.4: Benchmarking at different scopes.
Benchmark Type	Scope	Strengths	Limitations
End-to-End	Full pipeline	Reflects real performance, captures interactions	Hard to diagnose root cause of issues
Micro-Benchmark	Single operation	Easy to isolate and optimize	May miss pipeline-level bottlenecks
Component Benchmark	Subsystem (e.g., data loader)	Good balance of scope and detail	May miss cross-component interactions

Table 12.4: Benchmarking at different scopes.

The Pipeline Paradox

A system composed of individually fast components can still be slow overall. Pipeline stalls, memory contention, and synchronization overhead between components can dominate total latency. This is why end-to-end benchmarks are essential — they capture the interactions that micro-benchmarks miss.

Reproducibility

Reproducibility is a key quality indicator for benchmarks. Others should be able to independently verify your results given the same hardware, software, and configuration. Publishing benchmark code, configurations, and raw measurement data (not just summary statistics) enables community verification and builds trust in the results.

Reproducibility Checklist

For every benchmark, publish: (1) the exact code or script used, (2) hardware configuration including GPU model, memory, and interconnect, (3) software versions for driver, CUDA, framework, and dependencies, (4) raw measurement data — not just summary statistics, (5) environment details including container image or Dockerfile.

Continuous Benchmarking in CI/CD

Benchmarking should be integrated into the CI/CD pipeline as a continuous practice, not a one-time event. Performance regression detection catches optimizations that inadvertently slow other parts of the system. Trend tracking over time reveals the compounding effect of many small improvements or regressions.

# Example: GitHub Actions benchmark workflow
name: Performance Regression Check
on: [pull_request]
jobs:
  benchmark:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4
      - name: Run benchmark suite
        run: python benchmarks/run_all.py --output results.json
      - name: Compare with baseline
        run: python benchmarks/compare.py --baseline main --current results.json --threshold 5

Example: Automated performance regression detection in CI/CD.

Regression Detection Thresholds

Set regression thresholds carefully. Too tight (e.g., 1%) triggers frequent false alarms from measurement noise. Too loose (e.g., 20%) misses real regressions. A 5% threshold with statistical significance testing is a reasonable starting point. Adjust based on your measurement variance.

End-to-End Benchmark

A benchmark that measures the complete system pipeline performance, including data loading, computation, and postprocessing.

Performance Regression

An unintended decrease in system performance caused by a code or configuration change, detected through continuous benchmarking.

Key Takeaways

1MLPerf provides standardized benchmarks for fair comparison, but production workloads may differ significantly from benchmark workloads.
2Profiling is essential before optimization; without measurement, optimization is guesswork.
3The roofline model reveals whether each operation is compute-bound or memory-bound, directly guiding optimization strategy.
4Accurate benchmarking requires warmup runs, statistical rigor, controlled environments, and transparent reporting.
5Continuous benchmarking in CI/CD detects performance regressions and tracks optimization progress over time.

CH.12

Chapter Complete

Up next:ML Operations

Chapter Progress

Reading

Exercise

Interact with the visualization

Quiz

Benchmarking Quiz

Test your understanding of ML benchmarking, MLPerf, profiling tools, and performance measurement.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass

Learning Objectives

01 MLPerf and Standard Benchmarks Viz

MLPerf

Training and Inference Benchmark Suites

Domain-Specific Benchmarks

MLPerf

Benchmark Suite

02 Profiling Tools and Techniques

Profiling

Hardware-Level Profilers

Framework-Level Profilers

Systematic Profiling Methodology

Profiling

GPU Utilization

03 Roofline Analysis

Roofline Model

Memory-Bound Workloads

Compute-Bound Workloads

Roofline Model

Compute-Bound

04 Performance Measurement Methodology

Warmup Runs

Statistical Rigor

Controlling the Environment

Common Measurement Mistakes

Warmup Runs

Statistical Significance

05 Benchmarking Best Practices

End-to-End vs. Micro-Benchmarks

End-to-End Benchmark

Reproducibility

Continuous Benchmarking in CI/CD

End-to-End Benchmark

Performance Regression

Key Takeaways

Chapter Progress

Benchmarking Quiz