Skip to content
ML SystemPart 3: OptimizationChapter 11
Performance CH.11 ~35 min

AI Acceleration

Examines hardware accelerators including GPUs, TPUs, and FPGAs. Covers hardware-aware optimization and the co-design of models and silicon.

GPUTPUFPGAhardware-aware optimizationaccelerator design
Read in mlsysbook.ai
  • Explain GPU architecture fundamentals including streaming multiprocessors, tensor cores, and the memory hierarchy
  • Compare GPUs, TPUs, and FPGAs in terms of performance, flexibility, and cost for ML workloads
  • Analyze the memory wall bottleneck and its implications for ML system performance
  • Evaluate hardware-aware optimization techniques that tailor models to specific accelerator capabilities
  • Describe the hardware-software co-design approach and why it produces superior efficiency trade-offs

01 GPU Architecture for ML Viz

Graphics Processing Units (GPUs) have become the dominant hardware platform for ML training and inference due to their massive parallelism. Unlike CPUs that have a few powerful cores optimized for sequential tasks, GPUs contain thousands of simpler cores designed to execute the same operation across many data elements simultaneously.

Definition

GPU (Graphics Processing Unit)

A massively parallel processor originally designed for graphics rendering but now the dominant platform for ML computation. Modern GPUs contain thousands of cores organized into streaming multiprocessors, with specialized tensor cores for matrix operations. The SIMT (Single Instruction, Multiple Threads) architecture excels at data-parallel workloads like neural network training.

GPU Architecture: SMs and Tensor Cores

NVIDIA GPUs organize computation into Streaming Multiprocessors (SMs), each containing multiple CUDA cores, tensor cores, and shared memory. Tensor cores are specialized matrix multiplication units that can compute a 4x4 matrix multiply-accumulate in a single cycle, providing enormous throughput for the dense linear algebra operations that dominate deep learning.

Tensor Cores vs. CUDA Cores

Standard CUDA cores perform one floating-point operation per cycle per core. A tensor core performs a 4x4 matrix multiply-accumulate (128 operations) in a single cycle. For matrix-heavy ML workloads, tensor cores provide 8-16x the throughput of CUDA cores. This is why deep learning performance has grown much faster than general-purpose GPU performance.

Quick Check

Why do tensor cores provide 8-16x higher throughput than standard CUDA cores for ML workloads?

Think about how many operations each type of core completes in a single clock cycle.

Not quite.A standard CUDA core performs one floating-point operation per cycle. A tensor core performs an entire 4x4 matrix multiply-accumulate (128 operations) in one cycle. This massive parallelism at the instruction level is what makes tensor cores so effective for the dense linear algebra in deep learning.
Continue reading

The Memory Hierarchy

Deeper InsightBoth CPUs and GPUs have a layered memory system from fast/small to slow/large. But on a GPU, the ratio is extreme: registers deliver ~20 TB/s while HBM global memory delivers ~2 TB/s. This 10,000x gap means that the single most important optimization for GPU code is keeping data in fast memory (shared memory and registers) and minimizing round-trips to HBM. Algorithms that seem mathematically equivalent can differ by 5-10x in speed based purely on how they navigate this hierarchy.Think of it like...Your desk (registers) vs. a warehouse across town (HBM). If you need a tool, grabbing it from your desk takes a second, but driving to the warehouse takes hours. Smart workers arrange their desk so they rarely need the warehouse trip. GPU programming is the same -- organize computation to reuse data already in fast memory before fetching more from the slow global store. Click to collapse

The GPU memory hierarchy is critical for ML performance. Global memory (HBM) provides large capacity but limited bandwidth. Shared memory and L1 cache provide much higher bandwidth but limited capacity. Efficient ML kernels must carefully manage data movement between these levels to avoid becoming memory-bandwidth-bound.

Figure: GPU Memory Hierarchy

FASTER & SMALLER SPEED CAPACITY Registers ~20 MB | ~10 TB/s L1 / Shared Memory ~128-228 KB/SM | ~19 TB/s L2 Cache ~40-50 MB | ~4 TB/s HBM / Global Memory 80 GB (H100) | ~3.35 TB/s SLOWER & LARGER FastestFastModerateOff-chip Hover each layer for details (NVIDIA H100 reference)
Figure 11.1: GPU Memory Hierarchy
Table 11.1: NVIDIA A100 GPU memory hierarchy — note the 10,000x bandwidth gap between registers and HBM.
Memory LevelCapacity (A100)BandwidthLatencyScope
Registers~256 KB per SM~20 TB/s~1 cyclePer thread
Shared Memory / L1192 KB per SM~19 TB/s~20 cyclesPer SM (block)
L2 Cache40 MB~5 TB/s~200 cyclesGlobal
HBM2e (Global)80 GB2 TB/s~400 cyclesGlobal
NVLink (inter-GPU)N/A600 GB/sMicrosecondsMulti-GPU

Table 11.1: NVIDIA A100 GPU memory hierarchy — note the 10,000x bandwidth gap between registers and HBM.

Exploiting the Memory Hierarchy

The key to GPU kernel performance is maximizing data reuse in fast memory (registers and shared memory) to minimize slow HBM accesses. This is exactly what Flash Attention does: it tiles the attention computation into blocks that fit in SRAM, avoiding materializing the full N x N attention matrix in HBM. Similar tiling strategies apply to any memory-bound operation.

Quick Check

On an NVIDIA A100 GPU, the bandwidth gap between registers and HBM (global memory) is approximately:

Compare the bandwidth values in the memory hierarchy table for registers versus HBM2e.

Not quite.Registers provide ~20 TB/s while HBM provides ~2 TB/s, a 10,000x bandwidth difference. This is why techniques like Flash Attention, which tile computation to maximize data reuse in fast on-chip memory (registers and shared memory), yield such dramatic speedups for memory-bound operations.
Continue reading

Evolution of NVIDIA GPU Architectures

NVIDIA's GPU architecture has evolved rapidly to meet ML demands. Each generation roughly doubles ML performance, maintaining a pace that has far outstripped Moore's Law for general-purpose computing.

Table 11.2: NVIDIA GPU architecture evolution for ML workloads (through 2025-2026).
ArchitectureYearKey ML FeaturesFP16 Tensor TFLOPS
Volta (V100)2017First tensor cores, NVLink 2.0125
Ampere (A100)2020TF32, BF16, 2:4 sparsity, 3rd-gen tensor cores312
Hopper (H100)2022FP8, Transformer Engine, NVLink 4.0990
Blackwell (B200)20242nd-gen Transformer Engine, NVLink 5.0~2250
Blackwell Ultra (B300)2025HBM3e, enhanced sparsity, NVLink 5.0+~4500

Table 11.2: NVIDIA GPU architecture evolution for ML workloads (through 2025-2026).

0 TFLOPS
H100 peak FP16 tensor performance
0 TB/s
A100 HBM2e memory bandwidth
Peak TFLOPS vs. Achieved TFLOPS

Published TFLOPS numbers represent the theoretical peak under ideal conditions. Real ML workloads typically achieve 30-60% of peak due to memory bandwidth limits, kernel launch overhead, and time spent on non-tensor-core operations. An H100 rated at 990 TFLOPS FP16 might achieve 400-600 TFLOPS on a well-optimized Transformer training workload.

GPU

A massively parallel processor with thousands of cores optimized for executing the same operation across many data elements simultaneously.

Tensor Core

Specialized hardware units in NVIDIA GPUs that perform matrix multiply-accumulate operations at extremely high throughput for ML workloads.

02 TPUs and Custom ML Accelerators

Google's Tensor Processing Units (TPUs) are application-specific integrated circuits (ASICs) designed specifically for ML workloads. TPUs feature large systolic arrays optimized for matrix multiplication, high-bandwidth memory, and custom interconnects for distributed training. TPU v4 pods can deliver over an exaFLOP of compute for ML training.

Definition

TPU (Tensor Processing Unit)

Google's custom-designed ML accelerator chip. Unlike GPUs that evolved from graphics rendering, TPUs were purpose-built for neural network computation from the ground up. Their systolic array architecture achieves very high utilization on matrix-multiply-heavy workloads, and custom high-bandwidth interconnects (ICI) enable efficient distributed training across thousands of chips.

Systolic Array Architecture

TPUs trade general-purpose flexibility for ML-specific performance. The systolic array architecture pipelines matrix operations through a grid of multiply-accumulate units, achieving near-peak utilization for the regular computation patterns found in neural networks. This specialization allows TPUs to achieve higher performance-per-watt than GPUs for supported operations.

Definition

Systolic Array

A grid of processing elements (PEs) arranged in a regular pattern, where data flows rhythmically between neighbors (like a heartbeat — hence "systolic"). Each PE performs a multiply-accumulate operation and passes data to the next. For matrix multiplication, one matrix flows horizontally and the other flows vertically through the array, computing the entire result in a pipelined fashion.

0
TPU v4 chips per pod
TPU v4 at Scale

A TPU v4 pod connects up to 4,096 TPU v4 chips via a custom 3D torus interconnect (ICI), delivering over 1 exaFLOP of BF16 compute. Google used TPU v4 pods to train PaLM (540B parameters) and Gemini. The tight integration of compute, memory, and networking in a TPU pod provides efficiency advantages that are difficult to replicate with discrete GPUs connected by InfiniBand.

The Accelerator Landscape

Beyond Google's TPUs, numerous companies are developing custom ML accelerators. Each takes a different approach to optimizing ML computation, with various trade-offs in programmability, performance, and cost.

Table 11.3: The ML accelerator landscape beyond NVIDIA GPUs.
AcceleratorCompanyArchitectureBest For
TPU v5e / v4GoogleSystolic array + ICILarge-scale training via Google Cloud
Trainium / InferentiaAWSCustom ASIC + NeuronLinkCost-effective training/inference on AWS
Gaudi 2/3Intel (Habana)GEMM engine + TPCPrice-competitive GPU alternative
IPUGraphcoreBulk synchronous parallelSparse/irregular workloads, GNNs
WSE-2CerebrasWafer-scale (850K cores)Extremely large models, sparsity

Table 11.3: The ML accelerator landscape beyond NVIDIA GPUs.

Software Portability Challenge

The diversity of ML accelerators creates challenges for software portability. Code written for NVIDIA GPUs does not automatically run on TPUs or other accelerators. Frameworks and compilers (JAX/XLA, ONNX Runtime) provide abstraction layers that enable writing hardware-portable ML code, though achieving peak performance on each platform often still requires hardware-specific tuning.

Vendor Lock-In Risk

Custom CUDA kernels, NVIDIA-specific libraries (cuDNN, NCCL), and TensorRT-optimized models create strong lock-in to the NVIDIA ecosystem. If portability across accelerators is a requirement, prefer framework-level APIs (PyTorch, JAX) over hardware-specific libraries, and test regularly on multiple backends.

JAX for Hardware Portability

JAX, via the XLA compiler, provides the best cross-accelerator portability today. The same JAX code can run on CPUs, NVIDIA GPUs, and Google TPUs with minimal changes. This makes JAX particularly attractive for organizations that may want to switch between hardware platforms based on cost and availability.

TPU

Google's Tensor Processing Unit, a custom ASIC designed specifically for ML workloads with systolic arrays optimized for matrix operations.

Systolic Array

A grid of processing elements that pipelines data through regular computation patterns, achieving high utilization for matrix multiplication.

03 FPGAs and Edge Accelerators

Field-Programmable Gate Arrays (FPGAs) offer a middle ground between the flexibility of CPUs and the efficiency of ASICs. FPGAs can be reconfigured to implement custom hardware architectures, making them suitable for ML workloads that require specialized processing or very low latency that software-programmable devices cannot achieve.

Definition

FPGA (Field-Programmable Gate Array)

A semiconductor device containing a matrix of configurable logic blocks connected by programmable interconnects. FPGAs can be "programmed" after manufacturing to implement arbitrary digital circuits, offering a middle ground between the flexibility of software (CPUs/GPUs) and the efficiency of fixed-function hardware (ASICs).

FPGA Advantages and Trade-offs

FPGAs excel in scenarios requiring extremely low latency, deterministic timing, and custom precision. Financial trading ML models, real-time video processing, and certain edge deployment scenarios benefit from FPGA implementation. However, FPGAs are significantly harder to program than GPUs, requiring hardware description languages like VHDL or Verilog, or high-level synthesis (HLS) tools.

Table 11.4: Hardware platform comparison for ML deployment.
PropertyCPUGPUFPGAASIC (e.g., TPU)
FlexibilityHighestHighModerate (reconfigurable)Lowest (fixed function)
Performance/WattLowestModerateHighHighest
Development TimeDaysWeeksMonthsYears
Unit Cost (at volume)LowModerateHighLowest
LatencyModerateModerate (kernel launch)Very low (deterministic)Very low
ML ToolingExcellentExcellentLimitedVendor-specific

Table 11.4: Hardware platform comparison for ML deployment.

When FPGAs Make Sense

FPGAs are rarely the right choice for general ML workloads (GPUs and ASICs are better). They shine in niche scenarios: (1) Ultra-low-latency inference where kernel launch overhead matters (e.g., high-frequency trading at <1 microsecond). (2) Custom precision requirements not supported by GPUs. (3) Environments where GPUs are unavailable (e.g., certain military or aerospace applications).

Edge ML Accelerators

Edge ML accelerators are designed specifically for inference on resource-constrained devices. Google's Coral Edge TPU, NVIDIA Jetson, Intel Neural Compute Stick, and ARM's Ethos NPU provide dedicated ML acceleration in mobile and IoT form factors. These devices typically support INT8 inference and consume just a few watts of power.

Table 11.5: Edge ML accelerators and their specifications.
DeviceComputePowerForm FactorBest For
Google Coral Edge TPU4 TOPS (INT8)2 WUSB / M.2 moduleVision at the edge, prototyping
NVIDIA Jetson Orin Nano40 TOPS7-15 WSOM moduleRobotics, autonomous systems
Intel Neural Compute Stick 2~1 TOPS1.5 WUSB stickPrototyping, low-volume edge
ARM Ethos-U550.5 TOPS<50 mWOn-chip IPMicrocontroller-class always-on ML
Apple Neural Engine (M-series)15-38 TOPSShared SoCIntegrated in Apple SoCOn-device ML for Apple products

Table 11.5: Edge ML accelerators and their specifications.

Emerging Hardware Paradigms

The hardware landscape for ML is diversifying rapidly. Neuromorphic chips like Intel Loihi process information using spike-based computation inspired by biological neurons. Photonic accelerators use light instead of electrons for matrix operations. In-memory computing architectures perform computation within memory arrays to avoid the data movement bottleneck. While most of these are still early-stage, they represent potential paradigm shifts in ML hardware.

Evaluating New Hardware

When evaluating a new ML accelerator, look beyond the peak TOPS number. Key questions: (1) What software stack is supported (PyTorch, TensorFlow, ONNX)? (2) What operations and data types are actually accelerated? (3) What is the real-world throughput on your specific model? (4) What is the total cost of ownership including development time? A chip with half the TOPS but excellent software support often delivers more value in practice.

  • Neuromorphic computing (Intel Loihi, IBM TrueNorth) — Spike-based computation for event-driven, energy-efficient processing.
  • Photonic accelerators (Lightmatter, Luminous) — Use light interference for matrix multiplication at the speed of light.
  • In-memory computing (Mythic, Syntiant) — Perform multiply-accumulate within memory arrays, eliminating the data movement bottleneck.
  • Analog compute (IBM Analog AI) — Use analog circuit properties for approximate matrix computation at extreme energy efficiency.

FPGA

Field-Programmable Gate Array, a reconfigurable integrated circuit that can implement custom hardware architectures for specialized ML workloads.

Edge Accelerator

A low-power specialized processor designed for running ML inference on resource-constrained edge devices like smartphones and IoT sensors.

04 Hardware-Aware Optimization

Hardware-aware optimization tailors model architecture and computation to exploit the specific capabilities of the target hardware platform. Rather than designing models in a hardware-agnostic manner and hoping for good performance, hardware-aware approaches directly incorporate hardware characteristics into the design process.

Definition

Hardware-Aware Optimization

An optimization approach that incorporates the characteristics of the target hardware platform (memory hierarchy, compute units, supported operations, interconnect topology) directly into the model design and compilation process. This contrasts with hardware-agnostic design, which optimizes for abstract metrics like FLOPs without considering how they map to real hardware.

Hardware-Aware Neural Architecture Search

Hardware-aware NAS includes hardware metrics like latency, energy, and memory usage in the search objective function. Instead of optimizing only for accuracy, the search finds architectures that achieve the best accuracy within a given hardware budget. This approach has produced models like EfficientNet and MnasNet that are Pareto-optimal for specific hardware targets.

\max_{\alpha} \quad \text{Accuracy}(\alpha) \quad \text{s.t.} \quad \text{Latency}(\alpha, h) \leq T
MnasNet: Hardware-Aware Search in Practice

Google's MnasNet used a multi-objective search that directly measured inference latency on a Pixel phone. The search space included mobile-friendly operations (depthwise separable convolutions, squeeze-and-excite blocks) and the reward function was Accuracy(m) x [Latency(m)/T]^w where w=-0.07 penalizes slow models. The resulting architecture outperformed manually designed MobileNetV2 by 1.8% accuracy at the same latency.

Kernel Auto-Tuning

Kernel auto-tuning optimizes the low-level implementation of individual operations for the target hardware. Tools like TVM's AutoTVM and Ansor search over implementation strategies (tile sizes, loop ordering, vectorization) to find the fastest configuration for each operation on each device. This can yield significant speedups over hand-written or default kernel implementations.

The Auto-Tuning Search Space

For a single matrix multiplication on a GPU, the auto-tuner must choose: tile sizes for shared memory blocking, thread block dimensions, loop unrolling factors, vectorization width, and memory access patterns. The search space easily reaches millions of configurations. Smart search strategies (learned cost models, genetic algorithms, simulated annealing) are essential to find good configurations in reasonable time.

python
class="tok-comment"># TVM auto-tuning example (simplified)
import tvm
from tvm import auto_scheduler

class="tok-comment"># Define the workload
class="tok-decorator">@auto_scheduler.register_workload
def matmul(M, K, N):
    A = te.placeholder((M, K), name=class="tok-string">"A")
    B = te.placeholder((K, N), name=class="tok-string">"B")
    k = te.reduce_axis((class="tok-number">0, K), name=class="tok-string">"k")
    C = te.compute((M, N),
        lambda i, j: te.sum(A[i, k] * B[k, j], axis=k))
    return [A, B, C]

class="tok-comment"># Auto-tune for specific hardware target
task = auto_scheduler.SearchTask(func=matmul, args=(class="tok-number">1024, class="tok-number">1024, class="tok-number">1024),
                                  target=class="tok-string">"cuda -model=a100")
tune_option = auto_scheduler.TuningOptions(num_measure_trials=class="tok-number">1000)
auto_scheduler.auto_schedule(task, tune_option)

Hardware-Software Co-Design

The concept of hardware-software co-design takes optimization further by jointly designing the model architecture, the compilation strategy, and even the hardware itself. This holistic approach recognizes that the best system emerges from optimizing all components together rather than sequentially. Co-design is increasingly practiced at companies that control both the hardware and software stack.

Co-Design in Practice

Apple exemplifies hardware-software co-design: the Neural Engine in Apple Silicon is co-designed with Core ML and model architectures used in iOS features. Google co-designs TPU hardware with XLA compiler passes and JAX programming patterns. This tight integration explains why these companies achieve higher efficiency than combining independently designed components.

Hardware-Aware Optimization

The practice of incorporating target hardware characteristics into model and system design to achieve optimal performance on specific platforms.

Kernel Auto-Tuning

Automated search for optimal low-level implementation parameters (tile sizes, loop ordering) for each operation on each target hardware platform.

05 Accelerator Design and Co-Design

ML accelerator design is driven by the observation that neural network workloads have predictable, regular computation patterns dominated by matrix multiplication and convolution. By specializing hardware for these operations, accelerators achieve orders-of-magnitude better performance-per-watt than general-purpose processors.

Accelerator Design Decisions

Key design decisions for ML accelerators include the compute architecture (systolic array, dataflow, SIMD), the memory hierarchy (on-chip SRAM size, memory bandwidth), the interconnect topology (for multi-chip systems), and the supported data formats (FP32, FP16, BF16, INT8, FP8). Each choice has implications for what models can be efficiently executed.

Table 11.6: Key design dimensions for ML accelerator architecture.
Design DimensionOptionsTrade-off
Compute ArchitectureSystolic array, dataflow, SIMD, CGRASpecialization vs. flexibility
On-chip MemorySmall SRAM (fast) vs. large SRAM (expensive)Bandwidth vs. chip area and cost
Off-chip MemoryHBM (high bandwidth) vs. DDR (lower cost)Bandwidth vs. cost and power
InterconnectMesh, torus, tree, crossbarScalability vs. per-link bandwidth
Data FormatsFP32, FP16, BF16, INT8, FP8, INT4Precision vs. throughput and efficiency

Table 11.6: Key design dimensions for ML accelerator architecture.

The Memory Wall

The memory wall is the fundamental bottleneck for ML accelerators. As compute throughput increases faster than memory bandwidth, an increasing fraction of operations become memory-bound. This has driven the adoption of high-bandwidth memory (HBM), larger on-chip SRAM caches, and processing-in-memory architectures that bring computation closer to data.

Definition

Memory Wall

The growing disparity between processor compute throughput and memory bandwidth. Compute has been doubling every ~2 years while memory bandwidth improves only ~1.2x per generation. For ML accelerators, this means that even with massive compute capacity, performance is increasingly limited by how fast data can be fed to the compute units.

\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes transferred}} \quad [\text{FLOP/Byte}]
Compute Growth Outpacing Memory

NVIDIA H100 compute (FP16 tensor): 990 TFLOPS. H100 memory bandwidth: 3.35 TB/s. To keep the compute fully utilized, every byte transferred must be used for ~300 FLOPs. Standard matrix multiplication at large batch sizes achieves this, but many other operations (element-wise, normalization, attention) have much lower arithmetic intensity and are memory-bound. This gap widens with each hardware generation.

The Software Ecosystem Moat

Software support is as important as hardware capability for accelerator success. An accelerator with outstanding hardware specifications will fail in the market if it lacks compiler support, framework integration, and developer tooling. The NVIDIA ecosystem advantage (CUDA, cuDNN, TensorRT, NGC containers) demonstrates the importance of software maturity in hardware adoption.

NVIDIA's Ecosystem Advantage

NVIDIA's dominance in ML is not primarily about raw hardware performance — it's about the ecosystem. CUDA (launched 2006) has 18+ years of maturity with millions of developers. cuDNN provides hand-optimized ML kernels. TensorRT automates inference optimization. NGC provides pre-built containers. PyTorch and TensorFlow have first-class CUDA support. This ecosystem creates a powerful moat that competitors struggle to overcome regardless of hardware specifications.

Hardware is easy. Software is hard. The real barrier to entry in the accelerator market is not building a faster chip — it's building the compiler, the libraries, the framework integrations, and the developer community that make the chip usable.

Paraphrased from industry observations on ML hardware competition
Evaluating Accelerator Software Maturity

When evaluating a new accelerator, test these checkpoints: (1) Can you run an unmodified PyTorch model? (2) Does the compiler handle your model's operations without fallbacks to CPU? (3) Are common libraries (attention, normalization) optimized? (4) Is there a profiling tool to identify bottlenecks? (5) How quickly are framework updates supported? Anything less than "yes" to all five means significant engineering effort to adopt.

Memory Wall

The growing gap between processor compute throughput and memory bandwidth that increasingly limits ML system performance.

Hardware-Software Co-Design

The practice of jointly optimizing hardware architecture, compiler strategy, and model design to achieve optimal system-level performance.

Key Takeaways

  1. 1GPUs dominate ML compute due to massive parallelism and mature software ecosystem (CUDA, cuDNN, TensorRT).
  2. 2TPUs and custom accelerators achieve higher efficiency by specializing hardware for ML-specific operations like matrix multiplication.
  3. 3The memory wall is the fundamental bottleneck: compute grows faster than memory bandwidth, making many operations memory-bound.
  4. 4Hardware-aware optimization tailors models to specific hardware, often yielding better accuracy-efficiency trade-offs than hardware-agnostic design.
  5. 5Software ecosystem maturity is as important as raw hardware performance for accelerator adoption and practical utility.

CH.11

Chapter Complete

Up next:Benchmarking

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

AI Acceleration Quiz

Test your understanding of GPUs, TPUs, FPGAs, NPUs, and the roofline model for hardware-aware optimization.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass