Skip to content
ML SystemPart 3: OptimizationChapter 9
Performance CH.09 ~35 min

Efficient AI

Introduces computational efficiency principles for ML. Covers techniques for doing more with less, including efficient model design and resource-aware approaches.

computational efficiencymodel compressionresource constraintsefficient designFLOPs analysis
Read in mlsysbook.ai
  • Analyze the five dimensions of ML efficiency: FLOPs, memory, latency, energy, and monetary cost
  • Apply the roofline model to determine whether an operation is compute-bound or memory-bound
  • Evaluate efficient model design strategies including transfer learning and architecture search
  • Compare resource budget trade-offs across training, inference, and full lifecycle costs
  • Design an efficiency-first ML system that meets target quality within a given resource budget

01 Computational Efficiency in ML Viz

Computational efficiency is the practice of achieving the best possible model quality within a given resource budget. As ML models have grown exponentially in size, efficiency has become not just a nice-to-have but a critical requirement for practical deployment. The gap between what is computationally possible and what is economically feasible continues to widen.

Definition

Computational Efficiency

The practice of maximizing model quality relative to consumed resources. Efficiency is multidimensional, spanning FLOPs, memory footprint, inference latency, energy consumption, and monetary cost. Improving efficiency means pushing the Pareto frontier of these trade-offs.

Figure: Efficiency vs. Accuracy Spectrum Across Model Families

MAXIMUM EFFICIENCYTiny models, edge devicesMAXIMUM ACCURACYLarge models, cloud clusters Accuracy (benchmark %) 707580859095100 Compute Cost (GFLOPs) — log scale 0.11101001K10K100KMobileNet3.4MEfficientNet-B05.3MResNet-5025MEfficientNet-B766MBERT-base110MViT-Large307MLLaMA-7B7BGPT-4 class~1.8TCIRCLE SIZE = PARAMETER COUNTSmallMediumLargePareto Frontier Hover each model for detailed specs and deployment target
Figure Figure 9.1: The efficiency-accuracy spectrum. Model families are plotted by compute cost vs. accuracy, with circle size proportional to parameter count. The Pareto frontier connects optimal trade-off points.

Dimensions of Efficiency

Efficiency can be measured along multiple dimensions: computational cost (FLOPs), memory footprint, inference latency, energy consumption, and monetary cost. These dimensions are related but not equivalent; a model with fewer FLOPs may still have higher latency due to memory bottleneck operations, and a model with lower energy consumption may cost more due to specialized hardware requirements.

Table 9.1: The five key dimensions of ML efficiency.
DimensionMetricWhat It CapturesLimitation
Computational costFLOPsTotal arithmetic operationsIgnores memory bandwidth and parallelism
Memory footprintPeak MB/GBMaximum memory requiredDoesn't reflect bandwidth utilization
Inference latencyMillisecondsTime for a single predictionVaries by batch size and hardware
Energy consumptionJoules per inferencePower cost per predictionHard to measure, hardware-dependent
Monetary cost$/1K inferencesDollar cost of servingDepends on cloud pricing and utilization

Table 9.1: The five key dimensions of ML efficiency.

FLOPs Are Not Latency

A common mistake is equating fewer FLOPs with faster inference. A depthwise separable convolution has far fewer FLOPs than a standard convolution, but it has lower arithmetic intensity and may be memory-bandwidth-bound on GPUs. Always measure actual latency on target hardware rather than relying solely on FLOP counts.

Quick Check

A model with fewer FLOPs will always have lower inference latency than a model with more FLOPs on the same hardware. True or false?

Not quite.FLOPs measure arithmetic operations but ignore memory access patterns, hardware utilization, and operation scheduling. A depthwise separable convolution has fewer FLOPs than a standard convolution but can actually be slower on GPUs due to lower arithmetic intensity. Always benchmark on real hardware.
Continue reading

The Pareto Frontier

The Pareto frontier of efficiency represents the set of models where no dimension can be improved without degrading another. The goal of efficient AI research is to push this frontier, finding designs that achieve better trade-offs than previously possible. Techniques span the entire pipeline from architecture design through training to deployment.

Definition

Pareto Frontier

The set of optimal designs where no efficiency dimension (accuracy, latency, memory, cost) can be improved without degrading at least one other dimension. Models on the frontier represent the best achievable trade-offs given current techniques.

Design for Efficiency from Day One

Efficiency considerations must be incorporated from the earliest stages of system design, not bolted on as an afterthought. Designing for efficiency from the start leads to fundamentally different architectural choices than designing for accuracy alone and then compressing. Ask "what hardware will this model run on?" before choosing an architecture.

The Exponential Cost of Scale

Training costs for frontier models have increased roughly 10x per year. GPT-3 (2020) cost an estimated $4.6M to train. GPT-4 (2023) reportedly cost over $100M. By 2025-2026, frontier model training runs are estimated to exceed $1B. This trajectory is unsustainable without fundamental efficiency improvements, making this chapter's techniques critically important for the future of AI.

Computational Efficiency

The practice of maximizing model quality relative to resource consumption across dimensions like FLOPs, memory, latency, and energy.

Pareto Frontier

The set of optimal designs where no efficiency dimension can be improved without degrading another, representing the best achievable trade-offs.

02 Efficient Model Design

Efficient model design builds efficiency into the architecture from the ground up rather than compressing an existing model. This approach often yields better accuracy-efficiency trade-offs because the architecture is optimized holistically rather than having efficiency applied as a post-processing step.

MobileNets: Efficient Convolutions

MobileNets pioneered efficient architecture design for mobile devices by replacing standard convolutions with depthwise separable convolutions. MobileNetV2 introduced inverted residual blocks with linear bottlenecks. MobileNetV3 combined these with hardware-aware NAS to achieve state-of-the-art mobile efficiency.

Definition

Depthwise Separable Convolution

A factorized convolution that splits a standard convolution into two steps: a depthwise convolution (one filter per input channel for spatial filtering) followed by a 1x1 pointwise convolution (for channel mixing). This reduces computation by a factor of roughly k^2 compared to standard convolutions, where k is the kernel size.

\text{Cost ratio} = \frac{\text{Depthwise Separable}}{\text{Standard}} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}
\text{Compression Ratio} = \frac{ {{original}} }{ {{compressed}} } = {{result}}\text{x}
0.0x
FLOP reduction with depthwise separable convolutions
0M
MobileNetV2 FLOPs (vs 4,100M for ResNet-50)
MobileNet Cost Savings

A 3x3 standard convolution with 256 input and 256 output channels on a 14x14 feature map requires 924M FLOPs. The equivalent depthwise separable convolution requires only 105M FLOPs — an 8.8x reduction. MobileNetV2 achieves 72% ImageNet accuracy at just 300M FLOPs, compared to ResNet-50 at 4,100M FLOPs (76% accuracy).

EfficientNet: Compound Scaling

EfficientNet demonstrated that balanced scaling of depth, width, and resolution through compound scaling consistently outperforms scaling along a single dimension. The EfficientNet family achieved better accuracy than previous architectures at much lower computational cost, establishing a new accuracy-efficiency baseline.

d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi \quad \text{s.t.} \quad \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2
Why Compound Scaling Works

Scaling only depth creates vanishing gradient problems. Scaling only width leads to diminishing returns at large widths. Scaling only resolution increases FLOPs quadratically. Compound scaling balances all three dimensions, ensuring that each additional unit of compute is spent where it provides the most accuracy gain.

Efficient Attention for Transformers

For Transformer models, efficiency improvements focus on reducing the quadratic cost of self-attention. Efficient attention variants reduce memory and computation through different approaches.

Table 9.2: Efficient attention variants and their complexity profiles.
MethodApproachTime ComplexityMemory Complexity
Standard AttentionFull N x N attention matrixO(n^2 d)O(n^2)
LinformerLow-rank projection of K, VO(n k d)O(n k)
PerformerKernel approximation (FAVOR+)O(n d^2)O(n d)
Flash AttentionTiled IO-aware computationO(n^2 d)O(n)
Multi-Query AttentionShared K, V headsO(n^2 d)O(n^2) but less memory for KV cache

Table 9.2: Efficient attention variants and their complexity profiles.

Flash Attention Is Now Standard

Flash Attention does not change the mathematical output of attention — it produces the exact same result. It achieves its speedup by tiling the computation to minimize GPU HBM reads/writes, exploiting the much faster on-chip SRAM. Because it is mathematically equivalent and strictly better in both speed and memory, Flash Attention should be the default choice for all Transformer training and inference.

Deeper InsightMost trained neural networks are vastly over-parameterized. The majority of weights contribute almost nothing to the final prediction. Pruning exploits this redundancy by zeroing out the smallest-magnitude weights, producing a sparse model that computes the same function with a fraction of the active connections. The "lottery ticket hypothesis" suggests the original dense network contains a small subnetwork that, if trained in isolation, would reach the same accuracy.Think of it like...A company with 1,000 employees where only 100 do critical work. Pruning is the organizational restructuring that identifies the essential staff. The company runs almost as effectively afterward, but at a tenth of the cost -- as long as you identify the right people to keep. Click to collapse

Loading playground...

MobileNet

A family of efficient CNN architectures designed for mobile and edge devices, using depthwise separable convolutions to reduce computation.

Flash Attention

A memory-efficient attention implementation that reduces memory usage from O(n^2) to O(n) by tiling computations to exploit GPU SRAM.

03 Resource Constraints and Budget Analysis

Real-world ML deployments face resource constraints that vary dramatically depending on the target platform. Cloud deployments may have generous compute budgets but face cost optimization pressure. Mobile deployments are constrained by battery life and thermal limits. Edge deployments on microcontrollers may have only hundreds of kilobytes of memory.

Resource Budget Analysis

Resource budget analysis quantifies the constraints that a model must satisfy. This includes peak memory (the maximum memory required at any point during inference), compute budget (total FLOPs per inference), latency budget (maximum acceptable response time), and energy budget (maximum energy per inference). These budgets define the feasible region for model selection.

Table 9.3: Resource budgets across common deployment targets.
Deployment TargetLatency BudgetMemory BudgetPower BudgetExample Model
Cloud (batch)100s of ms16-80 GB (GPU)200-300 WLLaMA 70B on A100
Cloud (real-time)<50 ms8-16 GB200-300 WBERT-base for search
Mobile<30 ms2-4 GB3-5 WMobileNetV3 for camera
Edge / IoT<10 ms256 KB - 4 MB1-100 mWTinyML keyword detection
Browser (WASM)<100 ms100-500 MBN/A (user device)MediaPipe hand tracking

Table 9.3: Resource budgets across common deployment targets.

Definition

Resource Budget

A quantified set of constraints that a deployed model must satisfy, typically including peak memory, compute (FLOPs), latency, energy, and monetary cost per inference. The budget defines the feasible region within which a model must operate on the target platform.

The Roofline Model

The roofline model provides a framework for understanding whether a model is compute-bound or memory-bound on a given hardware platform. By plotting operational intensity (FLOPs per byte of memory access) against achieved throughput, the roofline model reveals whether optimization should focus on reducing computation or improving memory access patterns.

Definition

Roofline Model

A visual performance model that plots achieved throughput (FLOP/s) against operational intensity (FLOPs per byte of memory transferred). The "roof" is formed by the hardware's peak compute (flat ceiling) and peak memory bandwidth (sloped line). A workload below the roof is either compute-bound or memory-bound depending on which ceiling it hits.

\text{Attainable FLOP/s} = \min\left(\text{Peak FLOP/s},\; \text{Bandwidth} \times \text{Operational Intensity}\right)
Using the Roofline in Practice

If your model sits on the sloped (memory-bound) portion of the roofline, reducing FLOPs won't help — you need to reduce memory traffic (via operator fusion, quantization, or increased batch size). If it sits on the flat (compute-bound) ceiling, reducing FLOPs directly reduces latency. Profile your model with tools like NVIDIA Nsight Compute or PyTorch Profiler to determine where you stand.

Full Lifecycle Cost Analysis

Cost analysis must consider the full lifecycle: training cost, serving cost, and the engineering cost of optimization. A model that is 10% more efficient to serve but requires 3x the engineering effort to optimize may not be worthwhile unless the deployment scale is large enough to justify the investment.

Premature Optimization

Don't optimize for efficiency before you have validated that the model solves the right problem. A perfectly optimized model that solves the wrong problem has zero value. First validate accuracy and product-market fit with an unoptimized model, then invest in efficiency as serving costs become material.

Roofline Model

An analytical performance model that shows the achievable throughput as a function of operational intensity, revealing whether computation is memory-bound or compute-bound.

Operational Intensity

The ratio of floating-point operations to bytes of memory accessed, determining whether a workload is limited by compute or memory bandwidth.

04 FLOPs Analysis and Complexity

FLOPs (floating-point operations) provide a hardware-independent measure of computational cost that enables comparison across different models and architectures. Understanding FLOPs complexity is essential for predicting inference time, training cost, and energy consumption.

Definition

FLOPs (Floating-Point Operations)

A hardware-independent count of arithmetic operations (additions and multiplications) required to execute a computation. FLOPs enable comparing model efficiency across different architectures without reference to specific hardware. Note: "FLOP/s" (with lowercase s) refers to operations per second, a throughput metric.

FLOPs by Layer Type

For fully connected layers, the FLOPs count is straightforward: 2 * input_size * output_size (one multiply and one add per operation). For convolutional layers, FLOPs depend on kernel size, input dimensions, and number of channels. For Transformer self-attention, FLOPs scale quadratically with sequence length.

Table 9.4: FLOPs formulas for common neural network layers.
Layer TypeFLOPs FormulaScaling
Fully Connected2 * C_in * C_outLinear in dimensions
Convolution (k x k)2 * k^2 * C_in * C_out * H_out * W_outLinear in spatial size, quadratic in kernel size
Self-Attention2 * n^2 * d + 2 * n * d^2Quadratic in sequence length n
Depthwise Conv (k x k)2 * k^2 * C * H_out * W_outLinear in channels (no cross-channel)

Table 9.4: FLOPs formulas for common neural network layers.

\text{FLOPs}_{\text{attention}} = 2n^2d + 2nd^2 = 2nd(n + d)

Why FLOPs Are an Imperfect Proxy

However, FLOPs are an imperfect proxy for actual performance. Memory bandwidth, operation scheduling, and hardware utilization all affect real-world speed. A model with fewer FLOPs may actually be slower if it uses operations that are poorly supported by the target hardware or has irregular memory access patterns.

The FLOPs-Latency Gap

Consider: a standard convolution has more FLOPs than a depthwise separable convolution, yet on GPUs the depthwise version can be slower because it has lower arithmetic intensity and underutilizes tensor cores. Similarly, operations like gather, scatter, and non-maximum suppression have few FLOPs but high latency due to irregular memory access. Always measure real latency on target hardware.

FLOPs vs. Real-World Speed

EfficientNet-B0 has 390M FLOPs and achieves 77.1% ImageNet accuracy. RegNetY-400M also has ~400M FLOPs and achieves 74.1% accuracy. But on an NVIDIA V100 GPU, RegNet runs 3x faster because its regular structure maps better to GPU parallelism. FLOPs are the same, but throughput differs dramatically.

FLOPs vs. MACs

MACs (multiply-accumulate operations) are an alternative measure that counts each multiply-add pair as a single operation, resulting in values roughly half the FLOP count. When comparing published efficiency numbers, it is important to verify whether the metric is FLOPs or MACs, as confusion between the two can lead to 2x errors in reported efficiency.

Always Check the Unit

When reading a paper that claims "X GFLOPs," verify whether they mean FLOPs or MACs. Some papers use "FLOPs" to mean MACs, leading to a 2x discrepancy. The safest approach is to compute FLOPs yourself using tools like torchprofile, fvcore, or ptflops, and to always state your convention explicitly.

FLOPs

Floating-point operations per second or total count, a hardware-independent measure of computational cost for comparing model efficiency.

MACs

Multiply-accumulate operations, counting each fused multiply-add as one operation; approximately half the FLOP count for the same computation.

05 Efficient Training Strategies

Training efficiency is as important as inference efficiency, especially for large models where training costs can reach millions of dollars. Efficient training strategies reduce the time, compute, and energy required to reach a target model quality.

Transfer Learning and Fine-Tuning

Transfer learning and fine-tuning are among the most effective efficiency strategies. Starting from a pre-trained model and fine-tuning on the target task requires far less data and compute than training from scratch. Foundation models have made this approach dominant: most practitioners now start from pre-trained checkpoints rather than random initialization.

Definition

Transfer Learning

The practice of reusing a model's learned representations from one task (typically a large-scale pretraining task) as the starting point for a different target task. This dramatically reduces both the data and compute needed to achieve good performance on the target task, because the model starts with useful features rather than random weights.

<0%
Training cost with LoRA adapters vs full training
The Foundation Model Paradigm

Transfer learning has evolved into the foundation model paradigm: train one large model on a massive, general dataset (e.g., all of the internet), then fine-tune it for thousands of specific tasks. This is radically more efficient than training a separate model from scratch for each task. A single GPT-4 training run, amortized across millions of downstream applications, is far cheaper per application than individual training.

Table 9.5: Training efficiency strategies ordered by cost and data requirements.
StrategyTraining CostData RequiredWhen to Use
Train from scratch100% (baseline)Large datasetNovel domain with no relevant pre-trained model
Fine-tune all layers5-20%Moderate datasetTarget task differs significantly from pretraining
Fine-tune last layers1-5%Small datasetTarget task is similar to pretraining task
LoRA / Adapters<1%Small datasetResource-constrained fine-tuning of large models
Prompt tuning<0.1%Few examplesVery large models, minimal compute available

Table 9.5: Training efficiency strategies ordered by cost and data requirements.

Quick Check

LoRA (Low-Rank Adaptation) achieves less than 1% of the training cost of full fine-tuning. How does it accomplish this?

Not quite.LoRA inserts small trainable low-rank decomposition matrices alongside the frozen pretrained weights. Only these adapter matrices (a tiny fraction of total parameters) are updated during fine-tuning, dramatically reducing compute, memory, and storage requirements.
Continue reading

Deeper InsightWhen a large model predicts "cat: 0.7, lynx: 0.2, dog: 0.1," the relationships between wrong answers carry rich information -- the model learned that cats look more like lynxes than dogs. This soft probability distribution contains far more learning signal than the hard label "cat." A small student model trained to mimic these soft outputs learns these inter-class relationships and achieves accuracy far beyond what it could learn from hard labels alone.Think of it like...An expert chef mentoring an apprentice. Instead of just saying "this dish needs salt" (hard label), the chef explains "it needs a pinch of salt, it is close to right on acidity, and the sweetness is completely off" (soft targets). The apprentice learns much faster from this nuanced feedback than from simple pass/fail judgments. Click to collapse

Progressive Training

Progressive training starts with a smaller model or lower resolution and gradually increases capacity during training. This approach trains faster because early learning happens on a cheaper model, and complexity is added only when needed. Progressive resizing for image models and progressive layer unfreezing for language models are common instantiations.

Progressive Resizing

Training an image classifier: start at 64x64 resolution for the first 30% of epochs, increase to 128x128 for the next 30%, then train at full 224x224 resolution for the final 40%. The low-resolution phases are 4-16x cheaper per step. Overall training time drops by 40-60% with negligible accuracy loss, because early training mostly learns coarse features that don't require high resolution.

Curriculum Learning

Curriculum learning orders training examples from easy to hard, mimicking how humans learn. By presenting simpler examples first, the model develops basic representations before tackling difficult cases. Combined with data filtering to remove noisy or redundant examples, curriculum learning can achieve the same accuracy with significantly less compute.

Definition

Curriculum Learning

A training strategy that presents examples in a meaningful order, typically from easy to hard. The "difficulty" can be measured by loss magnitude, prediction confidence, or data complexity metrics. This mirrors pedagogical practice: master fundamentals before advanced topics.

Data Pruning for Efficiency

Not all training data contributes equally to model quality. Studies show that training on a carefully selected 50% subset of the data can match the accuracy of training on the full dataset. Use techniques like influence functions, data Shapley values, or simple heuristics (remove near-duplicates and low-confidence examples) to identify the most valuable training examples.

Transfer Learning

The practice of reusing a model trained on one task as the starting point for training on a different but related task, dramatically reducing required data and compute.

Curriculum Learning

A training strategy that presents examples in order of increasing difficulty, improving training efficiency and sometimes final model quality.

Key Takeaways

  1. 1Efficiency must be designed in from the start, not bolted on as an afterthought through post-training compression.
  2. 2FLOPs are a useful but imperfect measure of efficiency; actual performance depends on memory access patterns and hardware utilization.
  3. 3The roofline model reveals whether inference is compute-bound or memory-bound, guiding optimization strategy.
  4. 4Transfer learning from pre-trained models is the most effective efficiency strategy for most practical applications.
  5. 5Resource budget analysis must consider the full lifecycle cost, including training, serving, and engineering effort.

CH.09

Chapter Complete

Up next:Model Optimizations

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

Efficient AI Quiz

Test your understanding of model compression, knowledge distillation, and efficient architecture design.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass