Model Optimizations
Deep dive into quantization, pruning, knowledge distillation, and operator fusion. Practical techniques for shrinking models without sacrificing accuracy.
- Compare post-training quantization and quantization-aware training in terms of accuracy and deployment effort
- Explain structured vs. unstructured pruning and their implications for hardware acceleration
- Implement a knowledge distillation pipeline to compress a large teacher model into a smaller student
- Analyze how operator fusion reduces memory traffic and improves inference throughput
- Design a combined optimization pipeline that sequences quantization, pruning, and distillation effectively
01 Quantization Viz
Quantization reduces the numerical precision of model weights and activations from floating-point (typically FP32) to lower-bit representations (INT8, INT4, or even binary). This directly reduces model size, memory bandwidth requirements, and computational cost, while leveraging integer arithmetic units that are faster and more energy-efficient on most hardware.
Quantization
The process of mapping continuous or high-precision floating-point values to a discrete set of lower-precision values. For neural networks, this means converting FP32 weights and activations to INT8, INT4, or other reduced formats. The quantization function maps a real value x to an integer q via: q = round(x / scale) + zero_point.
q = \text{round}\left(\frac{x}{s}\right) + z, \quad \hat{x} = s \cdot (q - z)\text{Quantization Error} = \frac{\text{range}}{2^{ {{bits}} } - 1}, \quad \text{Levels} = 2^{ {{bits}} } = {{levels}}Figure: Quantization Accuracy vs. Model Size Trade-off
Post-Training Quantization (PTQ)
Post-training quantization converts a pre-trained FP32 model to lower precision without retraining. The simplest approach uniformly maps the FP32 value range to the target precision. More sophisticated methods like per-channel quantization and dynamic range adjustment minimize the accuracy impact. PTQ is attractive because it requires no additional training, but it can struggle with models that have wide activation ranges.
Always try post-training quantization first. INT8 PTQ with per-channel weight quantization typically loses less than 1% accuracy on classification models and requires only a small calibration dataset (100-1000 samples). If PTQ results are unsatisfactory, then invest in quantization-aware training.
Quantization-Aware Training (QAT)
Quantization-aware training simulates quantization during training, allowing the model to adapt its weights to the constraints of lower precision. During training, fake quantization nodes insert quantization and dequantization operations in the forward pass while using straight-through estimators for gradients. QAT typically achieves better accuracy than PTQ, especially at very low bit widths.
Quantization is a step function with zero gradients almost everywhere, which would prevent backpropagation. The straight-through estimator (STE) solves this by passing gradients through the quantization step unchanged during the backward pass, as if the quantization were an identity function. This approximation works surprisingly well in practice.
| Method | Accuracy Impact | Required Effort | Best For |
|---|---|---|---|
| PTQ (INT8, per-tensor) | Moderate (1-3% drop) | Minutes (calibration only) | Quick deployment with acceptable accuracy |
| PTQ (INT8, per-channel) | Low (<1% drop) | Minutes (calibration only) | Most models, recommended default |
| QAT (INT8) | Very low (<0.5% drop) | Full training cycle | When PTQ accuracy is insufficient |
| QAT (INT4) | Low-moderate (1-2% drop) | Full training cycle | Mobile/edge with strict size constraints |
| GPTQ (4-bit LLM) | Low (perplexity +0.1-0.5) | Hours (calibration) | Running LLMs on consumer GPUs |
Table 10.1: Comparison of quantization approaches.
LLM Quantization: GPTQ, AWQ, and Beyond
Recent advances in quantization for large language models have pushed the boundaries to 4-bit and even 2-bit precision. Techniques like GPTQ, AWQ, and SqueezeLLM use sophisticated calibration methods to maintain quality at extreme compression ratios. These methods have made it possible to run large language models on consumer hardware, democratizing access to powerful AI capabilities.
LLaMA-2 70B in FP16 requires 140 GB of memory — far beyond any consumer GPU. With 4-bit GPTQ quantization, it fits in 35 GB and can run on two RTX 4090 GPUs (24 GB each). With 4-bit AWQ and offloading, it can even run on a single GPU with system RAM spillover. Quality loss is minimal: perplexity increases by only ~0.3 points.
While INT8 quantization is nearly lossless for most models, aggressive quantization (INT4, INT2) can cause significant accuracy degradation, especially for small models and tasks requiring fine-grained numerical precision. Always evaluate quantized model accuracy on your specific task before deploying — aggregate metrics can hide task-specific failures.
Quantization
The process of reducing the numerical precision of model parameters and activations to lower bit widths, reducing model size and improving inference speed.
Quantization-Aware Training
A training procedure that simulates quantization during training, enabling the model to learn representations that are robust to lower precision.
02 Pruning
Pruning removes unnecessary parameters from neural networks, creating sparse models that require less computation and memory. The key insight is that neural networks are typically over-parameterized, and many weights contribute little to the output. Identifying and removing these redundant parameters can yield significant efficiency gains.
Pruning
The systematic removal of neural network parameters (weights, neurons, channels, or attention heads) that contribute least to model output. Pruning creates sparser, smaller models that require less computation and memory while preserving most of the original model's accuracy.
Unstructured Pruning
Unstructured pruning removes individual weights based on magnitude or other criteria, creating irregular sparsity patterns. While this can achieve high compression ratios, the irregular memory access patterns make it difficult to accelerate on standard hardware. Specialized sparse computation libraries and hardware are needed to realize the theoretical speedups.
A model pruned to 90% sparsity has 10x fewer nonzero weights, suggesting a 10x speedup. In practice, on standard GPUs, the speedup is often 0x — the sparse matrix is stored in a compressed format, but the irregular access pattern prevents the GPU from achieving good utilization. Only specialized hardware (NVIDIA A100's sparse tensor cores support 2:4 structured sparsity) or software (sparse BLAS libraries) can exploit unstructured sparsity.
Structured Pruning
Structured pruning removes entire neurons, channels, or attention heads, producing models that maintain dense computation patterns. This is more hardware-friendly because standard dense matrix operations can be used on the smaller model. Channel pruning for CNNs and head pruning for Transformers are the most common structured pruning approaches.
| Pruning Type | Granularity | Compression Ratio | Hardware Acceleration | Accuracy Impact |
|---|---|---|---|---|
| Weight (unstructured) | Individual weights | High (10-100x) | Requires sparse hardware | Low at moderate sparsity |
| Channel (CNN) | Entire conv filters | Moderate (2-5x) | Standard dense ops | Low-moderate |
| Head (Transformer) | Attention heads | Moderate (1.5-3x) | Standard dense ops | Low for redundant heads |
| Layer (depth) | Entire layers | Proportional to removed layers | Standard dense ops | Moderate-high |
| 2:4 Structured | Two of every four weights | 2x | NVIDIA Ampere+ tensor cores | Very low |
Table 10.2: Pruning types ranked by hardware friendliness.
Unless you have access to sparse hardware, prefer structured pruning (channel or head pruning) for deployment. A 3x speedup from structured pruning on standard hardware is more valuable than a theoretical 10x from unstructured pruning that requires special support. NVIDIA's 2:4 sparsity pattern is a good middle ground — it achieves 2x speedup on Ampere GPUs with minimal accuracy loss.
The Lottery Ticket Hypothesis
The Lottery Ticket Hypothesis, proposed by Frankle and Carlin, suggests that dense networks contain sparse subnetworks (winning tickets) that can achieve comparable accuracy when trained in isolation. This finding has profound implications for understanding why neural networks work and has inspired new approaches to training efficient models from the start.
A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.
While finding winning tickets requires training the full dense network first (making it impractical as a direct efficiency technique), the hypothesis has inspired methods that prune early in training or at initialization. Techniques like early-bird tickets, SNIP, and GraSP attempt to identify important weights before or very early in training, potentially enabling efficient models from the start.
Pruning
The removal of unnecessary neural network parameters to create smaller, faster models while maintaining accuracy.
Lottery Ticket Hypothesis
The theory that dense neural networks contain sparse subnetworks that, when trained in isolation from random initialization, can match the accuracy of the full network.
03 Knowledge Distillation
Knowledge distillation transfers the learned representations from a large, complex teacher model to a smaller, more efficient student model. Rather than training the student directly on hard labels, distillation trains it to match the soft probability distributions produced by the teacher, which contain richer information about class relationships.
Knowledge Distillation
A model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. The student learns from the teacher's soft probability distributions, which encode richer information than hard labels (e.g., "this cat image is 0.8 cat, 0.15 tiger, 0.05 dog" reveals class similarity structure).
The Distillation Loss
The standard distillation loss combines the student's cross-entropy loss on ground truth labels with a KL divergence loss between the teacher's and student's softened output distributions. The temperature parameter controls the softness of the distributions: higher temperatures produce softer distributions that transfer more information but may introduce noise.
\mathcal{L} = \alpha \cdot \text{CE}(y, \sigma(z_s)) + (1 - \alpha) \cdot T^2 \cdot \text{KL}\left(\sigma\left(\frac{z_t}{T}\right), \sigma\left(\frac{z_s}{T}\right)\right)A hard label says "this is a cat" (one-hot: [1, 0, 0]). The teacher's soft prediction says "this is 80% cat, 15% tiger, 5% dog." The soft label reveals that cats and tigers are visually similar while dogs are more different. This relational information helps the student learn a better feature space than hard labels alone, especially when training data is limited.
Temperature T controls how "soft" the probability distributions are. T=1 gives the standard softmax. T=3-5 is a common starting point for distillation. Higher T spreads probability mass more evenly, transferring more information about class relationships. Too high T makes the distribution nearly uniform and uninformative. Tune T on a validation set.
Feature-Based and Attention Distillation
Feature-based distillation extends the concept by matching intermediate representations rather than just outputs. The student is trained to mimic the teacher's internal feature maps at selected layers. Attention transfer, which matches attention maps between teacher and student, is another popular variant for Transformer models.
| Distillation Variant | What Is Matched | Best For |
|---|---|---|
| Output (logit) distillation | Final class probabilities | Classification tasks, simplest to implement |
| Feature distillation | Intermediate layer activations | Dense prediction (detection, segmentation) |
| Attention distillation | Attention weight matrices | Transformer models (BERT, GPT) |
| Relation distillation | Pairwise sample similarities | Metric learning, embeddings |
| Self-distillation | Earlier checkpoints of same model | No teacher needed, regularization effect |
Table 10.3: Knowledge distillation variants and their applications.
Distillation for Large Language Models
Distillation has become a critical technique in the large language model era. Models like DistilBERT achieve 97% of BERT's performance with 40% fewer parameters by distilling during pre-training. For deployment on resource-constrained devices, distillation provides a principled way to compress knowledge from models that are too large to serve directly.
DistilBERT was created by distilling BERT-base (110M params) into a 66M parameter student during the pre-training phase. The result retains 97% of BERT's language understanding (as measured on GLUE benchmarks) while being 60% smaller and 60% faster. This demonstrates that much of a large model's capacity is redundant and can be transferred to a significantly smaller model.
Distillation is not a silver bullet. The student can never exceed the teacher's accuracy (it can only approach it). For tasks where the teacher itself performs poorly, distillation won't help. Additionally, distillation requires running inference on the teacher for all training data, which can be expensive for very large teacher models.
Knowledge Distillation
A model compression technique that trains a smaller student model to mimic the behavior of a larger teacher model, transferring learned representations.
Temperature Scaling
A parameter in distillation that controls the softness of probability distributions, with higher temperatures revealing more information about class relationships.
04 Operator Fusion and Graph Optimization
Operator fusion combines multiple sequential operations into a single fused kernel, reducing memory traffic and kernel launch overhead. For example, fusing a convolution, batch normalization, and ReLU activation into a single operation avoids writing and reading intermediate results from memory, significantly improving performance.
Operator Fusion
A compiler optimization that merges multiple sequential operations (e.g., convolution + batch normalization + ReLU) into a single kernel execution. This eliminates intermediate memory reads and writes, reducing memory bandwidth consumption and kernel launch overhead. Fusion is the single most impactful graph-level optimization for inference.
Without fusion: Conv writes output to memory (1 write), BN reads it (1 read), BN writes result (1 write), ReLU reads it (1 read), ReLU writes final output (1 write) = 5 memory transfers. With fusion: Conv-BN-ReLU runs as one kernel, reading input once and writing output once = 2 memory transfers. For memory-bound operations, this alone can yield a 2-3x speedup.
Graph-Level Optimizations
Graph-level optimizations analyze the computational graph to identify fusion opportunities, eliminate redundant operations, and optimize memory allocation.
- Constant folding — Pre-compute operations whose inputs are all known at compile time (e.g., shape calculations).
- Dead code elimination — Remove computations whose outputs are never used downstream.
- Layout transformation — Convert tensor memory layouts (NCHW vs. NHWC) to match hardware preferences.
- Common subexpression elimination — Detect and share identical computations to avoid redundant work.
- Memory planning — Schedule tensor lifetimes to reuse memory, reducing peak memory consumption.
ML Compilers
ML compilers like TensorRT, Apache TVM, and XLA automate these optimizations for specific hardware targets. TensorRT, for example, can automatically fuse operations, select optimal kernel implementations, and apply quantization for NVIDIA GPUs. These compilers can achieve 2-5x speedup over naive framework execution.
| Compiler | Developer | Target Hardware | Key Strengths |
|---|---|---|---|
| TensorRT | NVIDIA | NVIDIA GPUs | Best GPU optimization, INT8 quantization, layer fusion |
| Apache TVM | Apache Foundation | CPUs, GPUs, accelerators | Hardware-agnostic, auto-tuning, broad target support |
| XLA | TPUs, GPUs, CPUs | Whole-program optimization, JAX/TF integration | |
| ONNX Runtime | Microsoft | Cross-platform | Cross-framework, execution provider architecture |
| OpenVINO | Intel | Intel CPUs, GPUs, VPUs | Best Intel hardware optimization |
Table 10.4: Major ML inference compilers and their target platforms.
The speedup from an ML compiler depends heavily on the target hardware. TensorRT achieves its best speedups on NVIDIA GPUs because it has deep knowledge of GPU architecture. The same model compiled with TVM for an ARM CPU may see a different speedup profile. Always benchmark on the actual deployment hardware.
Custom Kernels
Custom kernel development using CUDA, Triton, or other low-level languages enables optimizations that are impossible at the graph level. Flash Attention is a prominent example: by rewriting attention as a single fused kernel that is aware of the GPU memory hierarchy, it achieves both lower memory usage and faster execution than the standard multi-operation implementation.
class="tok-comment"># Triton: custom fused add + ReLU kernel
class="tok-decorator">@triton.jit
def fused_add_relu_kernel(x_ptr, y_ptr, out_ptr, n,
BLOCK: tl.constexpr):
pid = tl.program_id(class="tok-number">0)
offsets = pid * BLOCK + tl.arange(class="tok-number">0, BLOCK)
mask = offsets < n
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
result = tl.maximum(x + y, class="tok-number">0.0) class="tok-comment"># Fused add + ReLU
tl.store(out_ptr + offsets, result, mask=mask)Custom kernel development is high-effort and hardware-specific. Reserve it for operations that dominate your model's runtime and cannot be adequately optimized by existing compilers. Profile first, identify the top bottleneck operations, and check whether a compiler (TensorRT, TVM) already handles them before investing in custom kernel development.
Operator Fusion
The combination of multiple sequential neural network operations into a single computation kernel, reducing memory traffic and overhead.
TensorRT
NVIDIA's deep learning inference optimizer and runtime that automatically applies graph optimizations, quantization, and kernel selection for GPU deployment.
05 Combined Optimization Strategies
In practice, the most effective optimization pipelines combine multiple techniques rather than relying on a single approach. A typical pipeline might start with architecture-level efficiency (choosing an efficient base architecture), apply knowledge distillation during training, follow with quantization-aware training, and finish with operator fusion during deployment.
Optimization Pipeline
A systematic sequence of complementary model optimization techniques applied in a deliberate order to maximize cumulative efficiency gains. The pipeline typically flows from architecture design through training-time optimizations (distillation, QAT) to deployment-time optimizations (pruning, quantization, fusion).
Order of Operations Matters
The order of optimization matters because techniques interact. Pruning before quantization is generally better than the reverse because pruning can remove parameters that would have caused quantization errors. Similarly, distillation-then-quantization typically outperforms quantization-then-distillation because the distilled model starts from a stronger quality baseline.
- Architecture design — Choose an efficient base architecture (MobileNet, EfficientNet) or use NAS.
- Knowledge distillation — Train a smaller student from a larger teacher model.
- Pruning — Remove redundant parameters (structured pruning preferred for hardware compatibility).
- Quantization-aware training — Fine-tune with simulated quantization for the target bit width.
- Post-training quantization — Apply final quantization calibration if QAT was not used.
- Operator fusion and graph optimization — Use an ML compiler (TensorRT, TVM) for the target hardware.
Optimization techniques do not compose linearly. If quantization gives 4x speedup alone and pruning gives 2x alone, the combination does not guarantee 8x. Pruning may create weight distributions that are harder to quantize, or both may target the same bottleneck (memory bandwidth), yielding diminishing returns when combined. Always measure the combined effect empirically.
Benchmarking on Target Hardware
Measuring the compound effect of optimizations requires careful benchmarking on the target hardware. Theoretical speedup predictions (e.g., "4x from INT8 quantization") rarely match real-world measurements due to memory bottlenecks, kernel launch overhead, and data loading constraints. End-to-end profiling on the target device is essential.
When benchmarking: (1) Always warm up the model with a few inferences before measuring. (2) Measure end-to-end latency including preprocessing and postprocessing. (3) Test with realistic input sizes and batch sizes. (4) Report both median and p99 latency. (5) Monitor GPU/CPU utilization to identify bottlenecks. (6) Test under realistic concurrency loads if the model will serve multiple requests.
Deployment-Specific Strategies
| Deployment Target | Recommended Pipeline | Typical Speedup |
|---|---|---|
| Cloud GPU | FP16 + operator fusion (TensorRT) | 2-4x over naive PyTorch |
| Cloud GPU (cost-sensitive) | INT8 quantization + fusion | 3-6x |
| Mobile (Android/iOS) | Distillation + INT8 QAT + TFLite | 5-20x vs. original model |
| Edge (Jetson, Coral) | INT8 PTQ + TensorRT / Edge TPU compiler | 4-10x |
| Microcontroller (TinyML) | NAS + INT4/binary + hand-tuned kernels | 10-100x (required to fit) |
Table 10.5: Recommended optimization pipelines for different deployment targets.
Deploying a BERT model for real-time search ranking: Start with DistilBERT (distillation: 66M params, ~1.5x faster). Apply structured head pruning to remove 4 of 12 attention heads (~1.3x faster). Use INT8 QAT with TensorRT (~2x faster). Final result: 4x total speedup, 3x memory reduction, <1% quality loss on the ranking task. Latency drops from 12ms to 3ms on an A10 GPU.
Optimization Pipeline
A systematic sequence of complementary model optimization techniques applied in a specific order to maximize cumulative efficiency gains.
End-to-End Profiling
Measuring the complete inference time on target hardware, including all preprocessing, computation, and postprocessing, to accurately assess optimization impact.
Key Takeaways
- 1Quantization to INT8 can achieve near-lossless compression with 4x size reduction and significant speedup on hardware with integer units.
- 2Structured pruning is more hardware-friendly than unstructured pruning, though unstructured pruning achieves higher compression ratios.
- 3Knowledge distillation transfers rich information through soft probability distributions, not just hard label predictions.
- 4Operator fusion can provide 2-5x speedups by reducing memory traffic between sequential operations.
- 5Combined optimization pipelines achieve better results than any single technique, but the order of application matters.
- 6Always benchmark optimizations end-to-end on target hardware, as theoretical speedups rarely match real-world measurements.
CH.10
Chapter Complete
Chapter Progress
Interact with the visualization
Model Optimizations Quiz
Test your understanding of quantization, pruning, NAS, and operator fusion for production models.
Ready to test your knowledge?