Benchmarking
Covers ML benchmarking with MLPerf, profiling tools, roofline analysis, and honest performance measurement methodology.
- Explain the MLPerf benchmark suite and interpret training and inference benchmark results
- Apply profiling tools to identify performance bottlenecks across the ML stack
- Implement roofline analysis to classify operations as compute-bound or memory-bound
- Design rigorous benchmark experiments with proper warmup, statistical analysis, and controlled environments
- Evaluate the limitations of standard benchmarks and when custom benchmarks are necessary
01 MLPerf and Standard Benchmarks Viz
MLPerf is the industry-standard benchmark suite for measuring ML system performance. Developed by a consortium of industry and academic organizations, MLPerf provides standardized workloads, datasets, and measurement rules that enable fair comparison across different hardware platforms, frameworks, and system configurations.
MLPerf
An industry-standard benchmark suite maintained by the MLCommons consortium that provides standardized workloads, datasets, and measurement rules for comparing ML training and inference performance across hardware platforms and software frameworks.
Figure: ML Hardware Benchmark Comparison
Training and Inference Benchmark Suites
MLPerf includes separate benchmark suites for training and inference. The training benchmarks measure time-to-train for workloads spanning image classification (ResNet-50), object detection (SSD, RetinaNet), natural language processing (BERT), recommendation (DLRM), and reinforcement learning. The inference benchmarks measure throughput and latency across similar workloads under different scenarios.
| Benchmark Suite | What It Measures | Key Workloads | Primary Metric |
|---|---|---|---|
| MLPerf Training | Time to reach target quality | ResNet-50, BERT, DLRM, RetinaNet | Time-to-train (minutes) |
| MLPerf Inference | Throughput and latency | Same model architectures | Queries/sec, latency percentiles |
| DAWNBench | Training cost efficiency | Image classification, QA | Dollar cost to target accuracy |
| DeepBench | Individual operation speed | GEMM, convolution, RNN | TFLOPS, time per operation |
Table 12.1: Major ML benchmark suites and their focus areas.
MLPerf Inference defines four scenarios: Single-Stream (one query at a time, measuring latency), Multi-Stream (multiple inputs per query, for multi-camera systems), Server (variable arrival rate, measuring tail latency), and Offline (all data available at once, measuring throughput). Each scenario reflects a distinct real-world deployment pattern.
Standardized benchmarks serve a critical role in the ML ecosystem by providing objective comparison points. However, they also have limitations. Benchmark workloads may not represent actual production workloads. Optimizations specific to benchmark rules may not generalize. And the competitive pressure to achieve top benchmark results can distort engineering priorities.
When a measure becomes a target, it ceases to be a good measure. MLPerf scores drive hardware purchasing decisions worth millions of dollars, creating intense pressure to optimize specifically for benchmark workloads. Always verify that benchmark improvements translate to your actual production workloads before making investment decisions.
Benchmarks shape the industryDeeper InsightMLPerf results directly influence hardware purchase decisions worth billions of dollars annually. When a chip vendor posts a record-breaking MLPerf score, it can shift procurement decisions at hyperscalers and enterprises worldwide. The benchmarks don't just measure performance — they define what "good performance" means, steering R&D investment and architectural choices across the entire ML hardware ecosystem. Click to collapse
Domain-Specific Benchmarks
Beyond MLPerf, domain-specific benchmarks serve particular needs. DAWNBench focuses on training cost efficiency. DeepBench isolates individual operation performance. Application-specific benchmarks in areas like autonomous driving, medical imaging, and natural language understanding provide more relevant metrics for practitioners in those domains.
Start with MLPerf for hardware comparison and vendor evaluation. Use domain-specific benchmarks when evaluating solutions for a particular application. Best of all, create an internal benchmark that mirrors your actual production workload — this provides the most actionable performance data.
- MLPerf — Industry-standard suite for cross-platform comparison of training and inference.
- DAWNBench — Focuses on cost-to-accuracy, measuring dollar efficiency of training.
- DeepBench — Micro-benchmarks isolating individual DNN operations (GEMM, convolution).
- Application benchmarks — Domain-specific workloads for autonomous driving, NLP, medical imaging.
MLPerf
An industry-standard benchmark suite that provides standardized workloads and measurement rules for comparing ML system training and inference performance.
Benchmark Suite
A collection of standardized workloads and measurement methodologies designed to evaluate and compare system performance across implementations.
02 Profiling Tools and Techniques
Profiling is the systematic measurement of where time and resources are spent during model training or inference. Without profiling, optimization is guesswork. Profiling tools reveal the actual bottlenecks, which are often surprising and not where engineers initially expect them.
Profiling
The systematic measurement of resource utilization — including time, memory, compute, and bandwidth — across all components of an ML system to identify performance bottlenecks and guide optimization efforts.
Engineers frequently spend days optimizing the wrong part of the system. A common pattern is spending effort on GPU kernel optimization when the real bottleneck is data loading on CPU. Always profile first to identify the actual bottleneck. Five minutes of profiling can save weeks of misdirected optimization.
Hardware-Level Profilers
GPU profilers like NVIDIA Nsight Systems and Nsight Compute provide detailed visibility into GPU utilization, memory bandwidth usage, kernel execution times, and communication overhead. Nsight Systems gives a timeline view of the entire application, showing how CPU, GPU, and network activities overlap. Nsight Compute provides deep analysis of individual kernel performance.
| Profiling Tool | Level | What It Shows | Best For |
|---|---|---|---|
| Nsight Systems | Hardware (system-wide) | CPU/GPU timeline, kernel launches, memory copies | Finding pipeline stalls and overlap issues |
| Nsight Compute | Hardware (kernel-level) | SM utilization, memory throughput, instruction mix | Optimizing individual CUDA kernels |
| PyTorch Profiler | Framework | Op-level timing, memory allocation, data loading | General training bottleneck identification |
| TensorBoard Profiler | Framework | Op timing, device placement, input pipeline | TensorFlow/JAX training optimization |
| Intel VTune | Hardware (CPU) | CPU utilization, cache misses, threading | CPU-bound preprocessing bottlenecks |
Table 12.2: Profiling tools at different abstraction levels.
Framework-Level Profilers
Framework-level profilers like PyTorch Profiler and TensorBoard Profiler operate at a higher level of abstraction, showing time spent in specific operations, memory allocation patterns, and data loading bottlenecks. These tools are easier to use than hardware-level profilers and are sufficient for identifying most common performance issues.
import torch
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=class="tok-number">1, warmup=class="tok-number">1, active=class="tok-number">3),
on_trace_ready=torch.profiler.tensorboard_trace_handler(&class="tok-comment">#class="tok-number">39;./logclass="tok-number">39;),
record_shapes=True,
with_stack=True,
) as prof:
for step, batch in enumerate(dataloader):
train_step(model, batch)
prof.step()Systematic Profiling Methodology
Effective profiling requires a systematic methodology. Start with a high-level timeline to identify whether the bottleneck is in data loading, forward pass, backward pass, or communication. Then drill down into the bottleneck using more detailed tools. Measure under realistic conditions because bottlenecks shift with configuration.
- Capture a high-level timeline (Nsight Systems or framework profiler) to see the big picture.
- Identify the dominant bottleneck: data loading, forward pass, backward pass, or communication.
- Drill down with a detailed profiler (Nsight Compute for GPU kernels, VTune for CPU).
- Optimize the identified bottleneck and re-profile to confirm improvement.
- Repeat — fixing one bottleneck often reveals the next one.
Bottlenecks shift dramatically with configuration. A model may be compute-bound at batch size 256 but memory-bound at batch size 8. Always profile with production-realistic batch sizes, data pipelines active, and distributed training enabled if applicable.
Profiling
The systematic measurement of resource utilization (time, memory, bandwidth) across system components to identify performance bottlenecks.
GPU Utilization
The percentage of time GPU compute units are actively executing operations, a key indicator of whether computation is efficiently utilizing hardware.
03 Roofline Analysis
The roofline model is a visual performance analysis framework that determines whether a workload is compute-bound or memory-bound. It plots achievable performance (FLOP/s) as a function of operational intensity (FLOPs per byte of memory accessed), with the hardware's peak compute and peak memory bandwidth forming the "roofline."
Roofline Model
A visual performance model that plots achievable computation throughput (FLOP/s) against operational intensity (FLOPs/byte). The "roofline" is formed by the hardware's peak compute rate (flat ceiling) and peak memory bandwidth (sloped ceiling), revealing whether a workload is compute-bound or memory-bound.
P = \min\left(\pi, \beta \cdot I\right)Memory-Bound Workloads
A workload that falls on the sloped portion of the roofline is memory-bound: performance is limited by memory bandwidth rather than compute capability. Optimization should focus on reducing memory accesses through techniques like operator fusion, data layout optimization, or increased computation reuse.
For memory-bound operations, the key is to increase operational intensity — do more computation per byte loaded from memory. Operator fusion (combining multiple element-wise operations into a single kernel) is the most effective technique, as it eliminates intermediate memory reads and writes.
Compute-Bound Workloads
A workload that falls on the flat portion of the roofline is compute-bound: performance is limited by the hardware's peak computation rate. For compute-bound workloads, optimization should focus on using more efficient operations, lower precision formats, or hardware with higher peak throughput.
| Operation Type | Typical Binding | Operational Intensity | Optimization Strategy |
|---|---|---|---|
| Large GEMM / Convolution | Compute-bound | High (100+ FLOP/byte) | Lower precision (FP16/INT8), Tensor Cores |
| Attention (large seq length) | Compute-bound | High | FlashAttention, mixed precision |
| Element-wise (ReLU, add) | Memory-bound | Low (1-2 FLOP/byte) | Operator fusion, in-place operations |
| Batch Normalization | Memory-bound | Low | Fuse with adjacent convolution |
| Softmax | Memory-bound | Low | Fused attention kernels |
| Small FC layers | Memory-bound | Low-Medium | Batching, operator fusion |
Table 12.3: Common neural network operations and their typical roofline classification.
In practice, most neural network operations fall into one of two categories. Large matrix multiplications (convolutions, attention, linear layers with large batch sizes) tend to be compute-bound. Element-wise operations, batch normalization, and small matrix operations tend to be memory-bound. Understanding this distinction guides optimization strategy for each part of the model.
Consider an NVIDIA A100 GPU with 312 TFLOP/s FP16 peak compute and 2 TB/s memory bandwidth. The ridge point (where sloped and flat regions meet) occurs at 312/2 = 156 FLOP/byte. Operations with intensity below 156 FLOP/byte are memory-bound; above 156 are compute-bound. A large GEMM at ~200 FLOP/byte is compute-bound and should use Tensor Cores. A ReLU at ~1 FLOP/byte is deeply memory-bound and should be fused with adjacent operations.
Operational intensity (also called arithmetic intensity) is the ratio of floating-point operations to bytes transferred from memory. It is the single most important metric for determining whether an operation will benefit more from faster compute or faster memory. Calculate it as: total FLOPs in the operation divided by total bytes read and written.
Roofline Model
A performance analysis framework that visualizes the relationship between computational throughput and memory bandwidth to identify whether workloads are compute-bound or memory-bound.
Compute-Bound
A workload whose performance is limited by the hardware's computational throughput rather than memory bandwidth.
04 Performance Measurement Methodology
Accurate performance measurement requires careful methodology to avoid common pitfalls. Warmup runs must precede measurement to ensure caches are populated, JIT compilation is complete, and GPU clock rates have stabilized. Without warmup, initial runs will be misleadingly slow.
Warmup Runs
Initial execution iterations performed before taking measurements, allowing caches to populate, JIT compilers to optimize hot paths, and GPU clock rates to boost to steady-state frequencies. Skipping warmup leads to misleadingly slow initial measurements.
GPU clock rates can take several seconds to ramp from idle to boost frequency. PyTorch's JIT (TorchScript, torch.compile) compiles on the first invocation. CUDA contexts must be initialized. Always discard the first 3-5 iterations before recording measurements. A single "cold" run can be 10x slower than steady-state performance.
Statistical Rigor
Statistical rigor is essential for meaningful performance comparisons. Report confidence intervals, not just averages. Use sufficient repetitions to establish statistical significance. Measure with and without competing system load to understand sensitivity. Small differences (under 5%) between configurations may not be meaningful given measurement noise.
CI = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}Run at least 10-30 repetitions after warmup. Compute the coefficient of variation (CV = std/mean). If CV > 5%, investigate sources of variance (thermal throttling, competing processes, NUMA effects) before drawing conclusions. For publication-quality benchmarks, use at least 50 repetitions and report median with interquartile range.
Why is reporting p99 latency more important than average latency for production ML systems?
Controlling the Environment
Benchmarking environments must be carefully controlled. GPU clock rates, power limits, driver versions, framework versions, and competing processes all affect measurements. Documenting and controlling these factors enables reproducible benchmarks.
- Lock GPU clock rates using nvidia-smi to prevent frequency scaling during measurement.
- Record driver version, CUDA version, cuDNN version, and framework version.
- Disable GPU boost or set a fixed power limit for consistent thermal behavior.
- Use process isolation (taskset, numactl) to pin CPU cores and avoid NUMA effects.
- Use containerized environments (Docker, Singularity) to standardize the software stack.
- Report hardware SKU, memory configuration, and interconnect topology.
Common Measurement Mistakes
Common measurement mistakes include cherry-picking the best run instead of reporting statistics, measuring artificial workloads that do not represent production use, ignoring data loading and preprocessing overhead, and comparing across different batch sizes or precision without normalization.
A team reported a 30% speedup from a new kernel implementation. On closer inspection, they had compared a single best-case run of the new kernel against the average of the old kernel. They had also disabled the data pipeline and measured only the compute portion. When measured end-to-end with proper statistics, the actual improvement was 4% — within measurement noise. Honest benchmarking requires transparent methodology.
Warmup Runs
Initial execution runs performed before measurement to populate caches, complete JIT compilation, and stabilize hardware clock rates.
Statistical Significance
The confidence that observed performance differences are real rather than due to measurement noise, typically established through repeated trials and confidence intervals.
05 Benchmarking Best Practices
Effective benchmarking starts with clearly defining what you want to measure and why. Are you comparing hardware platforms? Evaluating optimization techniques? Selecting between model architectures? The measurement approach should be tailored to answer the specific question at hand.
End-to-End vs. Micro-Benchmarks
End-to-end benchmarking captures the true system performance by measuring the complete pipeline from data loading through inference and postprocessing. Micro-benchmarks isolate individual operations or components for detailed analysis. Both perspectives are valuable, but end-to-end measurements are what ultimately matter for production performance.
End-to-End Benchmark
A performance measurement that captures the complete system pipeline from data ingestion through model inference and postprocessing, reflecting the true production performance experienced by end users.
| Benchmark Type | Scope | Strengths | Limitations |
|---|---|---|---|
| End-to-End | Full pipeline | Reflects real performance, captures interactions | Hard to diagnose root cause of issues |
| Micro-Benchmark | Single operation | Easy to isolate and optimize | May miss pipeline-level bottlenecks |
| Component Benchmark | Subsystem (e.g., data loader) | Good balance of scope and detail | May miss cross-component interactions |
Table 12.4: Benchmarking at different scopes.
A system composed of individually fast components can still be slow overall. Pipeline stalls, memory contention, and synchronization overhead between components can dominate total latency. This is why end-to-end benchmarks are essential — they capture the interactions that micro-benchmarks miss.
Reproducibility
Reproducibility is a key quality indicator for benchmarks. Others should be able to independently verify your results given the same hardware, software, and configuration. Publishing benchmark code, configurations, and raw measurement data (not just summary statistics) enables community verification and builds trust in the results.
For every benchmark, publish: (1) the exact code or script used, (2) hardware configuration including GPU model, memory, and interconnect, (3) software versions for driver, CUDA, framework, and dependencies, (4) raw measurement data — not just summary statistics, (5) environment details including container image or Dockerfile.
Continuous Benchmarking in CI/CD
Benchmarking should be integrated into the CI/CD pipeline as a continuous practice, not a one-time event. Performance regression detection catches optimizations that inadvertently slow other parts of the system. Trend tracking over time reveals the compounding effect of many small improvements or regressions.
# Example: GitHub Actions benchmark workflow
name: Performance Regression Check
on: [pull_request]
jobs:
benchmark:
runs-on: [self-hosted, gpu]
steps:
- uses: actions/checkout@v4
- name: Run benchmark suite
run: python benchmarks/run_all.py --output results.json
- name: Compare with baseline
run: python benchmarks/compare.py --baseline main --current results.json --threshold 5Example: Automated performance regression detection in CI/CD.
Set regression thresholds carefully. Too tight (e.g., 1%) triggers frequent false alarms from measurement noise. Too loose (e.g., 20%) misses real regressions. A 5% threshold with statistical significance testing is a reasonable starting point. Adjust based on your measurement variance.
End-to-End Benchmark
A benchmark that measures the complete system pipeline performance, including data loading, computation, and postprocessing.
Performance Regression
An unintended decrease in system performance caused by a code or configuration change, detected through continuous benchmarking.
Key Takeaways
- 1MLPerf provides standardized benchmarks for fair comparison, but production workloads may differ significantly from benchmark workloads.
- 2Profiling is essential before optimization; without measurement, optimization is guesswork.
- 3The roofline model reveals whether each operation is compute-bound or memory-bound, directly guiding optimization strategy.
- 4Accurate benchmarking requires warmup runs, statistical rigor, controlled environments, and transparent reporting.
- 5Continuous benchmarking in CI/CD detects performance regressions and tracks optimization progress over time.
CH.12
Chapter Complete
Chapter Progress
Interact with the visualization
Benchmarking Quiz
Test your understanding of ML benchmarking, MLPerf, profiling tools, and performance measurement.
Ready to test your knowledge?