Skip to content
ML SystemPart 3: OptimizationChapter 10
Performance CH.10 ~40 min

Model Optimizations

Deep dive into quantization, pruning, knowledge distillation, and operator fusion. Practical techniques for shrinking models without sacrificing accuracy.

quantizationpruningknowledge distillationoperator fusionmodel compression
Read in mlsysbook.ai
  • Compare post-training quantization and quantization-aware training in terms of accuracy and deployment effort
  • Explain structured vs. unstructured pruning and their implications for hardware acceleration
  • Implement a knowledge distillation pipeline to compress a large teacher model into a smaller student
  • Analyze how operator fusion reduces memory traffic and improves inference throughput
  • Design a combined optimization pipeline that sequences quantization, pruning, and distillation effectively

01 Quantization Viz

Quantization reduces the numerical precision of model weights and activations from floating-point (typically FP32) to lower-bit representations (INT8, INT4, or even binary). This directly reduces model size, memory bandwidth requirements, and computational cost, while leveraging integer arithmetic units that are faster and more energy-efficient on most hardware.

Definition

Quantization

The process of mapping continuous or high-precision floating-point values to a discrete set of lower-precision values. For neural networks, this means converting FP32 weights and activations to INT8, INT4, or other reduced formats. The quantization function maps a real value x to an integer q via: q = round(x / scale) + zero_point.

q = \text{round}\left(\frac{x}{s}\right) + z, \quad \hat{x} = s \cdot (q - z)
\text{Quantization Error} = \frac{\text{range}}{2^{ {{bits}} } - 1}, \quad \text{Levels} = 2^{ {{bits}} } = {{levels}}

Figure: Quantization Accuracy vs. Model Size Trade-off

Relative Accuracy (%) 1009896949290 Model Size (MB) — log scale 250500100020004000 SWEET SPOT FP32100% FP1699.8% INT899.2% INT497% INT292% 1B parameter model: FP32=4GB, FP16=2GB, INT8=1GB, INT4=500MB, INT2=250MB FP32FP16INT8INT4INT2Sweet Spot Hover each data point for details
Figure 10.1: Quantization Trade-offs

Post-Training Quantization (PTQ)

Post-training quantization converts a pre-trained FP32 model to lower precision without retraining. The simplest approach uniformly maps the FP32 value range to the target precision. More sophisticated methods like per-channel quantization and dynamic range adjustment minimize the accuracy impact. PTQ is attractive because it requires no additional training, but it can struggle with models that have wide activation ranges.

Start with PTQ

Always try post-training quantization first. INT8 PTQ with per-channel weight quantization typically loses less than 1% accuracy on classification models and requires only a small calibration dataset (100-1000 samples). If PTQ results are unsatisfactory, then invest in quantization-aware training.

Quantization-Aware Training (QAT)

Quantization-aware training simulates quantization during training, allowing the model to adapt its weights to the constraints of lower precision. During training, fake quantization nodes insert quantization and dequantization operations in the forward pass while using straight-through estimators for gradients. QAT typically achieves better accuracy than PTQ, especially at very low bit widths.

The Straight-Through Estimator

Quantization is a step function with zero gradients almost everywhere, which would prevent backpropagation. The straight-through estimator (STE) solves this by passing gradients through the quantization step unchanged during the backward pass, as if the quantization were an identity function. This approximation works surprisingly well in practice.

Table 10.1: Comparison of quantization approaches.
MethodAccuracy ImpactRequired EffortBest For
PTQ (INT8, per-tensor)Moderate (1-3% drop)Minutes (calibration only)Quick deployment with acceptable accuracy
PTQ (INT8, per-channel)Low (<1% drop)Minutes (calibration only)Most models, recommended default
QAT (INT8)Very low (<0.5% drop)Full training cycleWhen PTQ accuracy is insufficient
QAT (INT4)Low-moderate (1-2% drop)Full training cycleMobile/edge with strict size constraints
GPTQ (4-bit LLM)Low (perplexity +0.1-0.5)Hours (calibration)Running LLMs on consumer GPUs

Table 10.1: Comparison of quantization approaches.

LLM Quantization: GPTQ, AWQ, and Beyond

Recent advances in quantization for large language models have pushed the boundaries to 4-bit and even 2-bit precision. Techniques like GPTQ, AWQ, and SqueezeLLM use sophisticated calibration methods to maintain quality at extreme compression ratios. These methods have made it possible to run large language models on consumer hardware, democratizing access to powerful AI capabilities.

Running LLaMA on Consumer Hardware

LLaMA-2 70B in FP16 requires 140 GB of memory — far beyond any consumer GPU. With 4-bit GPTQ quantization, it fits in 35 GB and can run on two RTX 4090 GPUs (24 GB each). With 4-bit AWQ and offloading, it can even run on a single GPU with system RAM spillover. Quality loss is minimal: perplexity increases by only ~0.3 points.

Quantization Is Not Free

While INT8 quantization is nearly lossless for most models, aggressive quantization (INT4, INT2) can cause significant accuracy degradation, especially for small models and tasks requiring fine-grained numerical precision. Always evaluate quantized model accuracy on your specific task before deploying — aggregate metrics can hide task-specific failures.

Loading playground...

Quantization

The process of reducing the numerical precision of model parameters and activations to lower bit widths, reducing model size and improving inference speed.

Quantization-Aware Training

A training procedure that simulates quantization during training, enabling the model to learn representations that are robust to lower precision.

02 Pruning

Pruning removes unnecessary parameters from neural networks, creating sparse models that require less computation and memory. The key insight is that neural networks are typically over-parameterized, and many weights contribute little to the output. Identifying and removing these redundant parameters can yield significant efficiency gains.

Definition

Pruning

The systematic removal of neural network parameters (weights, neurons, channels, or attention heads) that contribute least to model output. Pruning creates sparser, smaller models that require less computation and memory while preserving most of the original model's accuracy.

Unstructured Pruning

Unstructured pruning removes individual weights based on magnitude or other criteria, creating irregular sparsity patterns. While this can achieve high compression ratios, the irregular memory access patterns make it difficult to accelerate on standard hardware. Specialized sparse computation libraries and hardware are needed to realize the theoretical speedups.

The Hardware Gap for Unstructured Sparsity

A model pruned to 90% sparsity has 10x fewer nonzero weights, suggesting a 10x speedup. In practice, on standard GPUs, the speedup is often 0x — the sparse matrix is stored in a compressed format, but the irregular access pattern prevents the GPU from achieving good utilization. Only specialized hardware (NVIDIA A100's sparse tensor cores support 2:4 structured sparsity) or software (sparse BLAS libraries) can exploit unstructured sparsity.

Structured Pruning

Structured pruning removes entire neurons, channels, or attention heads, producing models that maintain dense computation patterns. This is more hardware-friendly because standard dense matrix operations can be used on the smaller model. Channel pruning for CNNs and head pruning for Transformers are the most common structured pruning approaches.

Table 10.2: Pruning types ranked by hardware friendliness.
Pruning TypeGranularityCompression RatioHardware AccelerationAccuracy Impact
Weight (unstructured)Individual weightsHigh (10-100x)Requires sparse hardwareLow at moderate sparsity
Channel (CNN)Entire conv filtersModerate (2-5x)Standard dense opsLow-moderate
Head (Transformer)Attention headsModerate (1.5-3x)Standard dense opsLow for redundant heads
Layer (depth)Entire layersProportional to removed layersStandard dense opsModerate-high
2:4 StructuredTwo of every four weights2xNVIDIA Ampere+ tensor coresVery low

Table 10.2: Pruning types ranked by hardware friendliness.

Prefer Structured Pruning for Deployment

Unless you have access to sparse hardware, prefer structured pruning (channel or head pruning) for deployment. A 3x speedup from structured pruning on standard hardware is more valuable than a theoretical 10x from unstructured pruning that requires special support. NVIDIA's 2:4 sparsity pattern is a good middle ground — it achieves 2x speedup on Ampere GPUs with minimal accuracy loss.

The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis, proposed by Frankle and Carlin, suggests that dense networks contain sparse subnetworks (winning tickets) that can achieve comparable accuracy when trained in isolation. This finding has profound implications for understanding why neural networks work and has inspired new approaches to training efficient models from the start.

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.

Jonathan Frankle and Michael Carlin, "The Lottery Ticket Hypothesis," ICLR 2019
Practical Implications of the Lottery Ticket Hypothesis

While finding winning tickets requires training the full dense network first (making it impractical as a direct efficiency technique), the hypothesis has inspired methods that prune early in training or at initialization. Techniques like early-bird tickets, SNIP, and GraSP attempt to identify important weights before or very early in training, potentially enabling efficient models from the start.

Pruning

The removal of unnecessary neural network parameters to create smaller, faster models while maintaining accuracy.

Lottery Ticket Hypothesis

The theory that dense neural networks contain sparse subnetworks that, when trained in isolation from random initialization, can match the accuracy of the full network.

03 Knowledge Distillation

Knowledge distillation transfers the learned representations from a large, complex teacher model to a smaller, more efficient student model. Rather than training the student directly on hard labels, distillation trains it to match the soft probability distributions produced by the teacher, which contain richer information about class relationships.

Definition

Knowledge Distillation

A model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. The student learns from the teacher's soft probability distributions, which encode richer information than hard labels (e.g., "this cat image is 0.8 cat, 0.15 tiger, 0.05 dog" reveals class similarity structure).

The Distillation Loss

The standard distillation loss combines the student's cross-entropy loss on ground truth labels with a KL divergence loss between the teacher's and student's softened output distributions. The temperature parameter controls the softness of the distributions: higher temperatures produce softer distributions that transfer more information but may introduce noise.

\mathcal{L} = \alpha \cdot \text{CE}(y, \sigma(z_s)) + (1 - \alpha) \cdot T^2 \cdot \text{KL}\left(\sigma\left(\frac{z_t}{T}\right), \sigma\left(\frac{z_s}{T}\right)\right)
Why Soft Labels Work

A hard label says "this is a cat" (one-hot: [1, 0, 0]). The teacher's soft prediction says "this is 80% cat, 15% tiger, 5% dog." The soft label reveals that cats and tigers are visually similar while dogs are more different. This relational information helps the student learn a better feature space than hard labels alone, especially when training data is limited.

Choosing the Temperature

Temperature T controls how "soft" the probability distributions are. T=1 gives the standard softmax. T=3-5 is a common starting point for distillation. Higher T spreads probability mass more evenly, transferring more information about class relationships. Too high T makes the distribution nearly uniform and uninformative. Tune T on a validation set.

Feature-Based and Attention Distillation

Feature-based distillation extends the concept by matching intermediate representations rather than just outputs. The student is trained to mimic the teacher's internal feature maps at selected layers. Attention transfer, which matches attention maps between teacher and student, is another popular variant for Transformer models.

Table 10.3: Knowledge distillation variants and their applications.
Distillation VariantWhat Is MatchedBest For
Output (logit) distillationFinal class probabilitiesClassification tasks, simplest to implement
Feature distillationIntermediate layer activationsDense prediction (detection, segmentation)
Attention distillationAttention weight matricesTransformer models (BERT, GPT)
Relation distillationPairwise sample similaritiesMetric learning, embeddings
Self-distillationEarlier checkpoints of same modelNo teacher needed, regularization effect

Table 10.3: Knowledge distillation variants and their applications.

Distillation for Large Language Models

Distillation has become a critical technique in the large language model era. Models like DistilBERT achieve 97% of BERT's performance with 40% fewer parameters by distilling during pre-training. For deployment on resource-constrained devices, distillation provides a principled way to compress knowledge from models that are too large to serve directly.

DistilBERT: A Distillation Success Story

DistilBERT was created by distilling BERT-base (110M params) into a 66M parameter student during the pre-training phase. The result retains 97% of BERT's language understanding (as measured on GLUE benchmarks) while being 60% smaller and 60% faster. This demonstrates that much of a large model's capacity is redundant and can be transferred to a significantly smaller model.

Distillation Limitations

Distillation is not a silver bullet. The student can never exceed the teacher's accuracy (it can only approach it). For tasks where the teacher itself performs poorly, distillation won't help. Additionally, distillation requires running inference on the teacher for all training data, which can be expensive for very large teacher models.

Knowledge Distillation

A model compression technique that trains a smaller student model to mimic the behavior of a larger teacher model, transferring learned representations.

Temperature Scaling

A parameter in distillation that controls the softness of probability distributions, with higher temperatures revealing more information about class relationships.

04 Operator Fusion and Graph Optimization

Operator fusion combines multiple sequential operations into a single fused kernel, reducing memory traffic and kernel launch overhead. For example, fusing a convolution, batch normalization, and ReLU activation into a single operation avoids writing and reading intermediate results from memory, significantly improving performance.

Definition

Operator Fusion

A compiler optimization that merges multiple sequential operations (e.g., convolution + batch normalization + ReLU) into a single kernel execution. This eliminates intermediate memory reads and writes, reducing memory bandwidth consumption and kernel launch overhead. Fusion is the single most impactful graph-level optimization for inference.

Conv-BN-ReLU Fusion

Without fusion: Conv writes output to memory (1 write), BN reads it (1 read), BN writes result (1 write), ReLU reads it (1 read), ReLU writes final output (1 write) = 5 memory transfers. With fusion: Conv-BN-ReLU runs as one kernel, reading input once and writing output once = 2 memory transfers. For memory-bound operations, this alone can yield a 2-3x speedup.

Graph-Level Optimizations

Graph-level optimizations analyze the computational graph to identify fusion opportunities, eliminate redundant operations, and optimize memory allocation.

  • Constant folding — Pre-compute operations whose inputs are all known at compile time (e.g., shape calculations).
  • Dead code elimination — Remove computations whose outputs are never used downstream.
  • Layout transformation — Convert tensor memory layouts (NCHW vs. NHWC) to match hardware preferences.
  • Common subexpression elimination — Detect and share identical computations to avoid redundant work.
  • Memory planning — Schedule tensor lifetimes to reuse memory, reducing peak memory consumption.

ML Compilers

ML compilers like TensorRT, Apache TVM, and XLA automate these optimizations for specific hardware targets. TensorRT, for example, can automatically fuse operations, select optimal kernel implementations, and apply quantization for NVIDIA GPUs. These compilers can achieve 2-5x speedup over naive framework execution.

Table 10.4: Major ML inference compilers and their target platforms.
CompilerDeveloperTarget HardwareKey Strengths
TensorRTNVIDIANVIDIA GPUsBest GPU optimization, INT8 quantization, layer fusion
Apache TVMApache FoundationCPUs, GPUs, acceleratorsHardware-agnostic, auto-tuning, broad target support
XLAGoogleTPUs, GPUs, CPUsWhole-program optimization, JAX/TF integration
ONNX RuntimeMicrosoftCross-platformCross-framework, execution provider architecture
OpenVINOIntelIntel CPUs, GPUs, VPUsBest Intel hardware optimization

Table 10.4: Major ML inference compilers and their target platforms.

Compiler Speedups Are Hardware-Specific

The speedup from an ML compiler depends heavily on the target hardware. TensorRT achieves its best speedups on NVIDIA GPUs because it has deep knowledge of GPU architecture. The same model compiled with TVM for an ARM CPU may see a different speedup profile. Always benchmark on the actual deployment hardware.

Custom Kernels

Custom kernel development using CUDA, Triton, or other low-level languages enables optimizations that are impossible at the graph level. Flash Attention is a prominent example: by rewriting attention as a single fused kernel that is aware of the GPU memory hierarchy, it achieves both lower memory usage and faster execution than the standard multi-operation implementation.

python
class="tok-comment"># Triton: custom fused add + ReLU kernel
class="tok-decorator">@triton.jit
def fused_add_relu_kernel(x_ptr, y_ptr, out_ptr, n,
                          BLOCK: tl.constexpr):
    pid = tl.program_id(class="tok-number">0)
    offsets = pid * BLOCK + tl.arange(class="tok-number">0, BLOCK)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    result = tl.maximum(x + y, class="tok-number">0.0)  class="tok-comment"># Fused add + ReLU
    tl.store(out_ptr + offsets, result, mask=mask)
When to Write Custom Kernels

Custom kernel development is high-effort and hardware-specific. Reserve it for operations that dominate your model's runtime and cannot be adequately optimized by existing compilers. Profile first, identify the top bottleneck operations, and check whether a compiler (TensorRT, TVM) already handles them before investing in custom kernel development.

Operator Fusion

The combination of multiple sequential neural network operations into a single computation kernel, reducing memory traffic and overhead.

TensorRT

NVIDIA's deep learning inference optimizer and runtime that automatically applies graph optimizations, quantization, and kernel selection for GPU deployment.

05 Combined Optimization Strategies

In practice, the most effective optimization pipelines combine multiple techniques rather than relying on a single approach. A typical pipeline might start with architecture-level efficiency (choosing an efficient base architecture), apply knowledge distillation during training, follow with quantization-aware training, and finish with operator fusion during deployment.

Definition

Optimization Pipeline

A systematic sequence of complementary model optimization techniques applied in a deliberate order to maximize cumulative efficiency gains. The pipeline typically flows from architecture design through training-time optimizations (distillation, QAT) to deployment-time optimizations (pruning, quantization, fusion).

Order of Operations Matters

The order of optimization matters because techniques interact. Pruning before quantization is generally better than the reverse because pruning can remove parameters that would have caused quantization errors. Similarly, distillation-then-quantization typically outperforms quantization-then-distillation because the distilled model starts from a stronger quality baseline.

  1. Architecture design — Choose an efficient base architecture (MobileNet, EfficientNet) or use NAS.
  2. Knowledge distillation — Train a smaller student from a larger teacher model.
  3. Pruning — Remove redundant parameters (structured pruning preferred for hardware compatibility).
  4. Quantization-aware training — Fine-tune with simulated quantization for the target bit width.
  5. Post-training quantization — Apply final quantization calibration if QAT was not used.
  6. Operator fusion and graph optimization — Use an ML compiler (TensorRT, TVM) for the target hardware.
Interaction Effects

Optimization techniques do not compose linearly. If quantization gives 4x speedup alone and pruning gives 2x alone, the combination does not guarantee 8x. Pruning may create weight distributions that are harder to quantize, or both may target the same bottleneck (memory bandwidth), yielding diminishing returns when combined. Always measure the combined effect empirically.

Benchmarking on Target Hardware

Measuring the compound effect of optimizations requires careful benchmarking on the target hardware. Theoretical speedup predictions (e.g., "4x from INT8 quantization") rarely match real-world measurements due to memory bottlenecks, kernel launch overhead, and data loading constraints. End-to-end profiling on the target device is essential.

End-to-End Profiling Checklist

When benchmarking: (1) Always warm up the model with a few inferences before measuring. (2) Measure end-to-end latency including preprocessing and postprocessing. (3) Test with realistic input sizes and batch sizes. (4) Report both median and p99 latency. (5) Monitor GPU/CPU utilization to identify bottlenecks. (6) Test under realistic concurrency loads if the model will serve multiple requests.

Deployment-Specific Strategies

Table 10.5: Recommended optimization pipelines for different deployment targets.
Deployment TargetRecommended PipelineTypical Speedup
Cloud GPUFP16 + operator fusion (TensorRT)2-4x over naive PyTorch
Cloud GPU (cost-sensitive)INT8 quantization + fusion3-6x
Mobile (Android/iOS)Distillation + INT8 QAT + TFLite5-20x vs. original model
Edge (Jetson, Coral)INT8 PTQ + TensorRT / Edge TPU compiler4-10x
Microcontroller (TinyML)NAS + INT4/binary + hand-tuned kernels10-100x (required to fit)

Table 10.5: Recommended optimization pipelines for different deployment targets.

Full Pipeline in Practice

Deploying a BERT model for real-time search ranking: Start with DistilBERT (distillation: 66M params, ~1.5x faster). Apply structured head pruning to remove 4 of 12 attention heads (~1.3x faster). Use INT8 QAT with TensorRT (~2x faster). Final result: 4x total speedup, 3x memory reduction, <1% quality loss on the ranking task. Latency drops from 12ms to 3ms on an A10 GPU.

Optimization Pipeline

A systematic sequence of complementary model optimization techniques applied in a specific order to maximize cumulative efficiency gains.

End-to-End Profiling

Measuring the complete inference time on target hardware, including all preprocessing, computation, and postprocessing, to accurately assess optimization impact.

Key Takeaways

  1. 1Quantization to INT8 can achieve near-lossless compression with 4x size reduction and significant speedup on hardware with integer units.
  2. 2Structured pruning is more hardware-friendly than unstructured pruning, though unstructured pruning achieves higher compression ratios.
  3. 3Knowledge distillation transfers rich information through soft probability distributions, not just hard label predictions.
  4. 4Operator fusion can provide 2-5x speedups by reducing memory traffic between sequential operations.
  5. 5Combined optimization pipelines achieve better results than any single technique, but the order of application matters.
  6. 6Always benchmark optimizations end-to-end on target hardware, as theoretical speedups rarely match real-world measurements.

CH.10

Chapter Complete

Up next:AI Acceleration

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

Model Optimizations Quiz

Test your understanding of quantization, pruning, NAS, and operator fusion for production models.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass