DNN Architectures
Surveys major DNN architecture families including CNNs, RNNs, Transformers, and Neural Architecture Search. Examines each from a systems design perspective.
- Compare CNN, RNN, and Transformer architectures in terms of structure and capabilities
- Explain the self-attention mechanism and why it revolutionized sequence modeling
- Analyze architectural trade-offs including parameter count, computational cost, and accuracy
- Evaluate when to apply each architecture family based on data type and task requirements
- Describe how modern architectures like Vision Transformers bridge the gap between vision and language
01 Convolutional Neural Networks (CNNs) Viz
Convolutional Neural Networks exploit spatial structure in data by applying learned filters that slide across the input. This weight-sharing mechanism dramatically reduces the number of parameters compared to fully connected layers, making CNNs both more efficient and more effective for grid-structured data like images, audio spectrograms, and time series.
Convolution
A mathematical operation that slides a learned filter (kernel) across input data, computing a dot product at each position. In CNNs, convolutions detect local patterns while sharing weights across all spatial positions, dramatically reducing parameter count compared to fully connected layers.
Core Building Blocks
The core building blocks of CNNs are convolutional layers, pooling layers, and fully connected layers. Convolutional layers apply filters that detect local patterns like edges and textures. Pooling layers reduce spatial dimensions, providing translation invariance and reducing computation.
- Convolutional layers — Apply learned filters to detect local patterns (edges, textures, shapes)
- Pooling layers — Reduce spatial dimensions via max or average pooling, providing translation invariance
- Batch normalization — Normalizes activations between layers, stabilizing and accelerating training
- Fully connected layers — Combine high-level features for final classification or regression
Systems Performance of CNNs
From a systems perspective, CNNs have predictable and regular memory access patterns, which makes them well-suited for GPU acceleration. The convolution operation can be implemented using several algorithms with different performance trade-offs.
| Algorithm | Approach | Best For | Trade-off |
|---|---|---|---|
| im2col + GEMM | Convert convolution to matrix multiplication | General purpose | Extra memory for unrolled matrix |
| Winograd | Reduce multiplications via transform | Small filters (3x3) | Numerical precision loss |
| FFT-based | Convolution via frequency domain | Large filters | Memory overhead, padding |
| Direct | Naive nested loop implementation | Small inputs, debugging | Slow on large inputs |
Table 4.1: Convolution implementation strategies and their trade-offs.
NVIDIA's cuDNN library automatically benchmarks multiple convolution algorithms for each layer configuration and selects the fastest one. Setting torch.backends.cudnn.benchmark = True enables this auto-tuning, which can yield 10-30% speedups when input sizes are fixed across iterations.
Key Architectural Innovations
Architectural innovations like ResNets introduced skip connections that enable training of very deep networks (100+ layers) by providing gradient shortcuts. MobileNets introduced depthwise separable convolutions that factorize standard convolutions into cheaper operations, reducing computation by 8-9x with minimal accuracy loss.
Depthwise Separable Convolution
A factorization of standard convolution into two steps: a depthwise convolution (one filter per input channel) and a pointwise 1x1 convolution (mixing channels). This reduces computation by a factor of roughly k^2 where k is the kernel size, with minimal accuracy loss.
For cloud deployment where accuracy is paramount, use ResNet-50 or EfficientNet-B4 as strong baselines. For mobile deployment, start with MobileNetV3 or EfficientNet-Lite. For edge/TinyML, use MobileNetV2 with aggressive quantization. Always benchmark against your latency budget before adding complexity.
In a ResNet block, the input x is added to the output of two convolutional layers: y = F(x) + x. This simple addition provides a gradient highway that allows training of networks with 152+ layers. Without skip connections, networks deeper than ~20 layers suffer from degradation where adding more layers actually increases training error.
Convolution
A mathematical operation that slides a learned filter across input data, detecting local patterns while sharing weights across spatial positions.
Depthwise Separable Convolution
A factorization of standard convolution into depthwise and pointwise operations that dramatically reduces computation and parameters.
02 Recurrent Neural Networks (RNNs)
Recurrent Neural Networks process sequential data by maintaining a hidden state that captures information from previous time steps. At each step, the RNN updates its hidden state based on the current input and the previous state, creating a form of memory that enables processing of variable-length sequences.
Hidden State
The internal memory vector of a recurrent network that is updated at each time step based on the current input and the previous hidden state. It encodes a compressed summary of the sequence processed so far and serves as the network's working memory.
\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})LSTM and GRU: Gated Recurrence
Vanilla RNNs suffer from the vanishing and exploding gradient problems, which make it difficult to learn long-range dependencies. Long Short-Term Memory (LSTM) networks address this through a gating mechanism that controls information flow.
LSTM
Long Short-Term Memory network, a gated RNN variant that uses three gates (forget, input, output) and a cell state to learn long-range dependencies. The gates learn which information to retain, add, or output at each time step, mitigating the vanishing gradient problem.
| Architecture | Parameters (per layer) | Gates | Long-Range Memory | Training Speed |
|---|---|---|---|---|
| Vanilla RNN | ~n^2 (small) | None | Poor (vanishing gradients) | Fast per step |
| LSTM | ~4n^2 | Forget, Input, Output | Good (cell state highway) | Slower per step |
| GRU | ~3n^2 | Reset, Update | Good (comparable to LSTM) | Faster than LSTM |
Table 4.2: Comparison of recurrent architectures where n is the hidden size.
The Parallelism Problem
The sequential nature of RNNs creates significant systems challenges. Because each time step depends on the previous output, RNN computation cannot be easily parallelized across the sequence dimension. This fundamentally limits throughput on parallel hardware like GPUs.
An RNN processing a sequence of length T must perform T sequential steps, regardless of how many GPU cores are available. A Transformer processes all T positions in parallel. For a sequence of 1000 tokens on a modern GPU with thousands of cores, this sequential bottleneck means RNNs may be 10-100x slower than Transformers for training, even though each individual step is cheaper.
Continued Relevance
Despite their limitations, RNNs remain relevant for certain applications, particularly in resource-constrained environments. Their per-step memory footprint is constant regardless of sequence length, which can be advantageous for streaming applications on edge devices.
Consider RNNs for (1) streaming applications where data arrives one token at a time and you need constant-memory processing, (2) edge devices where the O(n^2) memory of Transformers is prohibitive, and (3) online learning scenarios where the model must continuously adapt. State-space models like Mamba offer a modern take on recurrence with improved parallelism during training.
Recent work on linear RNNs and state-space models (S4, Mamba) has renewed interest in recurrent architectures. These models achieve linear-time complexity during inference while being parallelizable during training, potentially combining the best of both recurrent and attention-based approaches.
LSTM
Long Short-Term Memory, a gated RNN variant that uses forget, input, and output gates to learn long-range dependencies in sequential data.
Hidden State
The internal memory vector of a recurrent network that is updated at each time step, encoding information about the sequence processed so far.
03 Transformers and Attention
The Transformer architecture, introduced in the "Attention Is All You Need" paper (Vaswani et al., 2017), has revolutionized deep learning by replacing recurrence with self-attention. Self-attention computes relationships between all positions in a sequence simultaneously, enabling rich contextual representations and full parallelism during training.
Attention Is All You Need.
"Attention Is All You Need" — removing recurrence enabled massive parallelizationDeeper InsightThe Transformer's key insight was that recurrence (processing tokens one at a time) was not necessary for sequence modeling. By replacing it with self-attention, every token can attend to every other token in parallel. This single architectural change unlocked the ability to train on massive datasets using thousands of GPUs simultaneously, directly enabling the scaling revolution from BERT to GPT-4.Think of it like...Imagine reading a book one word at a time (RNN) versus being able to see the entire page at once and draw connections between any words simultaneously (Transformer). The parallel "seeing everything at once" approach is what makes Transformers so fast to train. Click to collapse
Self-Attention
A mechanism that computes weighted relationships between all positions in a sequence simultaneously. Each element computes a query (what am I looking for?), key (what do I contain?), and value (what information do I provide?), enabling rich contextual representations.
Scaled Dot-Product Attention
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VThe core mechanism is scaled dot-product attention: queries, keys, and values are computed from the input, and attention weights are determined by the compatibility between queries and keys. Multi-head attention extends this by running multiple attention operations in parallel, allowing the model to attend to information from different representation subspaces.
Figure: Transformer Encoder Block Architecture
Without scaling, the dot products between queries and keys grow in magnitude with the dimension d_k, pushing the softmax into regions with extremely small gradients. Dividing by sqrt(d_k) keeps the variance of the dot products at 1 regardless of dimension, ensuring stable training dynamics.
The Quadratic Scaling Challenge
From a systems perspective, the self-attention mechanism has O(n^2) computational and memory complexity with respect to sequence length n. This quadratic scaling is the primary bottleneck for processing long sequences.
| Attention Variant | Complexity | Approach | Trade-off |
|---|---|---|---|
| Full attention | O(n^2) | All pairs computed | Best quality, highest cost |
| Sparse attention | O(n * sqrt(n)) | Attend to fixed patterns | Misses some long-range dependencies |
| Linear attention | O(n) | Kernel approximation of softmax | Quality degradation on some tasks |
| Sliding window | O(n * w) | Local window of size w | Loses global context beyond window |
| Flash Attention | O(n^2) compute, O(n) memory | IO-aware tiling | Exact attention with reduced memory |
Table 4.3: Efficient attention variants and their complexity-quality trade-offs.
Why does self-attention scale quadratically with sequence length?
Flash Attention (Dao et al., 2022) computes exact self-attention with O(n) memory instead of O(n^2) by using tiled computation that is aware of the GPU memory hierarchy. It does not change the math — it changes how the computation is scheduled on hardware. Always use Flash Attention when available; it is strictly better than naive attention.
Universal Dominance
Transformers have become the dominant architecture across natural language processing, computer vision (Vision Transformers), and multimodal learning. Their uniform structure makes them particularly amenable to hardware optimization: the attention and feed-forward layers are composed of large matrix multiplications that map efficiently onto GPU tensor cores.
The trend toward ever-larger Transformer models relies on the empirical observation that bigger models trained on more data perform better. However, training GPT-3 (175B parameters) required an estimated $4.6 million in compute, and GPT-4 likely cost over $100 million. The environmental and financial costs of scaling must be weighed against the diminishing marginal returns at extreme scale.
Self-Attention
A mechanism that computes weighted relationships between all positions in a sequence, enabling each element to attend to all other elements simultaneously.
Multi-Head Attention
An extension of self-attention that runs multiple attention operations in parallel, each learning different relationship patterns in different subspaces.
04 Neural Architecture Search (NAS)
Neural Architecture Search automates the design of neural network architectures by searching over a defined space of possible designs. Rather than relying on human intuition and manual experimentation, NAS uses optimization algorithms to discover architectures that are optimized for specific objectives like accuracy, latency, or model size.
Neural Architecture Search (NAS)
Automated methods for discovering optimal neural network architectures by algorithmically searching over a defined space of possible layer types, connections, and hyperparameters. NAS can optimize for multiple objectives simultaneously, such as accuracy and inference latency.
Evolution of NAS Methods
Early NAS methods used reinforcement learning or evolutionary algorithms to search over discrete architecture spaces, requiring thousands of GPU hours per search. Modern approaches like differentiable NAS (DARTS) relax the discrete search space to be continuous, enabling gradient-based optimization that is orders of magnitude more efficient.
| NAS Method | Search Strategy | GPU Hours | Key Innovation |
|---|---|---|---|
| NASNet (2017) | Reinforcement learning | ~48,000 | First large-scale NAS success |
| AmoebaNet (2018) | Evolutionary algorithms | ~3,150 | Tournament selection over architectures |
| DARTS (2019) | Gradient-based (differentiable) | ~4 | Continuous relaxation of search space |
| Once-for-All (2020) | Progressive shrinking | ~40 | Train one supernetwork, extract subnets |
Table 4.4: Evolution of NAS efficiency, from thousands of GPU hours to single-digit hours.
The cost of NAS has dropped by four orders of magnitude in just a few years: from 48,000 GPU hours (NASNet) to about 4 GPU hours (DARTS). This makes NAS accessible to teams without massive compute budgets and opens the door to task-specific and hardware-specific architecture optimization.
Hardware-Aware NAS
Hardware-aware NAS incorporates hardware constraints directly into the search objective. Rather than finding the most accurate architecture and then compressing it, hardware-aware NAS discovers architectures that are inherently efficient on the target platform.
Hardware-Aware NAS
NAS methods that incorporate hardware performance metrics (latency, energy, memory) directly into the search objective function. The search jointly optimizes accuracy and efficiency on the target deployment platform, producing architectures like EfficientNet and MNASNet.
EfficientNet was discovered through hardware-aware NAS that optimized for both accuracy and FLOPs. The resulting architecture family achieves ImageNet accuracy comparable to much larger models at a fraction of the compute cost. EfficientNet-B0 matches ResNet-50 accuracy with 5.3M parameters versus 25.6M — a 4.8x reduction.
Cost-Benefit Analysis
From a systems engineering perspective, NAS raises important questions about the cost of architecture search versus the benefit of the discovered architecture. Practitioners must weigh NAS costs against the expected lifetime benefits of improved architecture efficiency.
A NAS search costing 1,000 GPU hours at $3/hour = $3,000 that finds an architecture saving 10% inference cost only breaks even if the model serves enough traffic. If inference costs $10,000/month, the 10% savings ($1,000/month) pays back the search cost in 3 months. For lower-volume deployments, manual architecture selection or transfer from published NAS results may be more cost-effective.
Before running your own NAS, check if a published NAS-discovered architecture (EfficientNet, MNASNet, Once-for-All) already fits your constraints. If you must run NAS, use efficient methods like DARTS or Once-for-All, and define a tight search space informed by known good architectures to reduce search cost.
Neural Architecture Search
Automated methods for discovering optimal neural network architectures by searching over a defined space of possible designs using optimization algorithms.
Hardware-Aware NAS
NAS methods that incorporate hardware performance metrics (latency, energy, memory) directly into the search objective function.
05 Architecture Design Principles
Effective architecture design is guided by several key principles that balance model capacity, computational efficiency, and practical deployability. The principle of progressive complexity suggests starting with simpler architectures and adding complexity only when justified by improved performance on the target task.
Start with the simplest architecture that could reasonably solve your problem. A logistic regression, a small CNN, or a compact Transformer — get a baseline working end-to-end first. Only add complexity (more layers, attention heads, auxiliary losses) when you have evidence that the simple approach falls short. This approach minimizes debugging time and maximizes understanding.
Depth vs. Width
Depth versus width is a fundamental design dimension. Deeper networks can represent more complex functions but are harder to train and more prone to gradient issues. Wider networks are more parallelizable and easier to train but may be less parameter-efficient.
| Property | Deep (Many Layers) | Wide (Large Layers) |
|---|---|---|
| Representation power | More complex functions with fewer total parameters | Requires more parameters for same complexity |
| Training difficulty | Vanishing/exploding gradients, needs skip connections | Easier to train, more stable gradients |
| Parallelism | Limited by sequential layer dependencies | Each layer parallelizes well across GPU cores |
| Memory pattern | Many small activations | Fewer but larger activations |
| Inference latency | Limited by layer count (sequential) | Limited by largest layer width |
Table 4.5: Depth vs. width trade-offs in architecture design.
Compound Scaling
Compound Scaling
A method for uniformly scaling network depth, width, and input resolution together using a fixed compound coefficient. Introduced by EfficientNet, this approach yields more balanced and efficient architectures than scaling any single dimension alone.
\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi \quad \text{s.t. } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2Compound scaling adjusts depth, width, and resolution simultaneously according to a fixed ratio. This approach consistently outperforms single-dimension scaling across different computational budgets.
Scaling only one dimension leads to diminishing returns. A very deep but narrow network wastes depth on low-resolution features. A very wide but shallow network cannot build hierarchical representations. A high-resolution input fed to a small network wastes the additional detail. Compound scaling ensures all three dimensions grow in harmony.
Modular, Composable Design
The trend toward modular, composable architecture components has accelerated with the success of Transformers. Standard building blocks like attention layers, feed-forward networks, and normalization layers can be combined in various configurations.
- Self-attention blocks — Capture global dependencies within a sequence or spatial region
- Feed-forward networks (FFN) — Per-position nonlinear transformations that add model capacity
- Layer normalization — Stabilizes activations and accelerates convergence
- Skip connections — Enable gradient flow and allow the network to learn residual mappings
- Positional encoding — Inject position information into position-agnostic attention operations
A standard Transformer block is: LayerNorm -> Self-Attention -> Residual Add -> LayerNorm -> FFN -> Residual Add. By simply stacking N of these identical blocks and varying N, dimension, and number of heads, you get architectures ranging from BERT-base (12 blocks, 110M params) to GPT-3 (96 blocks, 175B params). This composability dramatically simplifies implementation and systematic experimentation.
Modular design facilitates ablation studies — experiments that remove or modify one component at a time to measure its contribution. Before deploying a complex architecture, ablate each component to verify it contributes meaningful performance improvement. Complexity that does not improve metrics is technical debt.
Compound Scaling
A method for uniformly scaling network depth, width, and input resolution together using a fixed ratio, producing more balanced and efficient architectures.
Skip Connection
A shortcut that adds the input of a layer directly to its output, enabling gradient flow through very deep networks and facilitating training.
Key Takeaways
- 1CNNs exploit spatial structure through weight sharing and are highly efficient on parallel hardware due to regular computation patterns.
- 2The sequential nature of RNNs fundamentally limits their parallelism, motivating the shift to Transformer architectures.
- 3Transformers achieve strong performance through self-attention but face O(n^2) scaling challenges with sequence length.
- 4Hardware-aware NAS discovers architectures optimized for specific deployment targets, outperforming hand-designed models.
- 5Architecture design involves balancing depth, width, and resolution, with compound scaling providing a principled approach.
- 6The choice of architecture has profound systems implications for memory usage, computational cost, and hardware utilization.
CH.04
Chapter Complete
Chapter Progress
Interact with the visualization
DNN Architectures Quiz
Test your understanding of CNN, RNN, Transformer architectures, and attention mechanisms.
Ready to test your knowledge?