DL Primer
Covers deep learning fundamentals from a systems perspective. Introduces neural networks, backpropagation, activation functions, and loss functions.
- Explain the fundamental principles of deep learning including backpropagation and gradient descent
- Compare activation functions and their impact on network training dynamics
- Analyze the role of loss functions in guiding model optimization
- Implement a basic neural network training loop with forward and backward passes
- Evaluate regularization techniques for preventing overfitting in deep models
01 Neural Network Fundamentals Viz
Neural networks are the computational backbone of modern deep learning. At their core, they are compositions of linear transformations followed by nonlinear activation functions, organized into layers. Each layer transforms its input through a weight matrix multiplication, adds a bias vector, and applies an activation function to produce its output.
Neural Network
A computational model composed of layers of interconnected nodes (neurons) that learn to transform inputs into outputs through adjustable weights and nonlinear activation functions. Each layer applies a linear transformation followed by a nonlinear activation.
\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})Figure: Backpropagation -- Forward & Backward Pass
From Perceptrons to Deep Networks
The simplest neural network is the single-layer perceptron, which can only learn linearly separable functions. By stacking multiple layers, deep neural networks gain the ability to learn hierarchical representations of data. Early layers typically capture low-level features (edges, textures), while deeper layers compose these into increasingly abstract representations (objects, concepts).
In an image classification network, layer 1 might detect edges and color gradients. Layer 2 combines edges into corners and textures. Layer 3 assembles these into parts (eyes, wheels). Layer 4 recognizes whole objects (faces, cars). This progression from low-level to high-level features is the key insight behind deep learning's success.
Systems Implications of Network Architecture
From a systems perspective, understanding neural network architecture is essential for making informed decisions about memory requirements, computational complexity, and hardware utilization. The number of parameters directly determines memory consumption, while the number of floating-point operations (FLOPs) determines computational cost.
| Property | Scales With | Systems Impact |
|---|---|---|
| Parameters | Width^2 * Depth | Memory consumption, model size on disk |
| FLOPs | Width^2 * Depth * Batch Size | Training time, GPU compute cost |
| Activations | Width * Depth * Batch Size | Peak GPU memory during training |
| Gradient Size | Same as Parameters | Communication cost in distributed training |
Table 3.1: How key network properties scale and their systems implications.
A quick rule of thumb: each parameter in float32 requires 4 bytes. During training with Adam optimizer, you need roughly 12 bytes per parameter (4 for weights, 4 for gradient, 4 for optimizer state). A 1-billion parameter model therefore needs approximately 12 GB just for parameters and optimizer state, before accounting for activations.
The Universal Approximation Theorem
The universal approximation theorem tells us that sufficiently wide neural networks can approximate any continuous function. However, this theoretical guarantee says nothing about the practical challenges of training such networks or deploying them efficiently. The gap between what is theoretically possible and what is practically achievable is where ML systems engineering lives.
The universal approximation theorem guarantees existence but not efficiency. A network that can theoretically approximate any function may require astronomically many parameters or be impossible to train with gradient descent. Practical deep learning relies on architectural inductive biases (convolutions for spatial data, attention for sequences) to make learning feasible.
Neural Network
A computational model composed of layers of interconnected nodes that learn to transform inputs into outputs through adjustable weights and nonlinear activation functions.
Universal Approximation
The theoretical property that neural networks with sufficient width can approximate any continuous function to arbitrary precision.
02 Backpropagation and Gradient Computation
Backpropagation is just the chain rule of calculus applied recursively through the network.Deeper InsightDespite its mystique, backpropagation is not a novel mathematical discovery. It is a systematic bookkeeping method for applying the chain rule at every layer. The brilliance lies in efficiency: rather than computing each gradient independently (which would be impossibly slow), backprop reuses intermediate results from deeper layers, turning an exponential problem into a linear one.Think of it like...A factory assembly line where each station gets a note saying "here is how much the final product was off -- your contribution to the error was this much." Each station passes the note backward to the previous station, adjusted for its own work. No station needs to understand the entire factory to know how to improve. Click to collapse
Backpropagation is the algorithm that makes deep learning possible. It efficiently computes the gradient of the loss function with respect to every parameter in the network by applying the chain rule of calculus from the output layer back through the network. This process enables gradient-based optimization of networks with millions or billions of parameters.
Backpropagation
An algorithm for efficiently computing the gradient of the loss function with respect to every parameter in a neural network by recursively applying the chain rule of calculus from the output layer backward through the network.
\frac{\partial L}{\partial \mathbf{W}_l} = \frac{\partial L}{\partial \mathbf{h}_l} \cdot \frac{\partial \mathbf{h}_l}{\partial \mathbf{W}_l}Forward and Backward Passes
The forward pass computes the network output by propagating inputs through each layer sequentially. The backward pass then propagates error gradients in reverse, computing how each parameter contributed to the total loss. This symmetry between forward and backward passes has profound implications for system design.
The backward pass requires approximately twice the computation and memory of the forward pass. This means that during training, roughly two-thirds of the total compute is spent on gradient computation. This ratio has direct implications for GPU utilization and training cost estimation.
During neural network training, approximately what fraction of total computation is spent on the backward pass (gradient computation)?
If the backward pass costs 2x the forward pass, what fraction of (1x + 2x) is 2x?
Memory Challenges and Gradient Checkpointing
Memory consumption during backpropagation is a critical systems concern. The standard implementation requires storing all intermediate activations from the forward pass to compute gradients during the backward pass. For deep networks, this can consume tens of gigabytes of GPU memory.
Gradient Checkpointing
A memory optimization technique that reduces peak memory usage during training by discarding some intermediate activations during the forward pass and recomputing them during the backward pass. This trades roughly 20-30% additional computation for up to 60-70% memory reduction.
Enable gradient checkpointing when your model fits in GPU memory for inference but runs out of memory during training. In PyTorch, use torch.utils.checkpoint.checkpoint() on memory-intensive layers. This is especially effective for Transformer models with many attention layers.
Automatic Differentiation Frameworks
Automatic differentiation, the generalization of backpropagation, is a core feature of modern ML frameworks. The design choice between static and dynamic graphs has significant implications for debugging, performance optimization, and deployment flexibility.
| Feature | Dynamic Graphs (PyTorch) | Static Graphs (TensorFlow 1.x) |
|---|---|---|
| Graph construction | Built on-the-fly during execution | Defined before execution |
| Debugging | Standard Python debugger works | Requires special tools (tfdbg) |
| Control flow | Native Python if/for/while | Special tf.cond / tf.while_loop |
| Performance | JIT compilation (torch.compile) | Ahead-of-time optimization |
| Deployment | TorchScript or ONNX export | SavedModel, TF Lite, TF.js |
Table 3.2: Dynamic vs. static computational graph approaches.
In PyTorch, calling loss.backward() triggers the backward pass through the dynamic graph. Each tensor records the operations that created it, forming a chain that backward() traverses to compute gradients. The gradients accumulate in each parameter's .grad attribute, ready for the optimizer to apply updates.
Backpropagation
An algorithm for computing gradients of the loss function with respect to network parameters by efficiently applying the chain rule from output to input layers.
Gradient Checkpointing
A memory optimization technique that recomputes intermediate activations during the backward pass instead of storing them, trading compute for reduced memory usage.
03 Activation Functions
Activation functions introduce nonlinearity into neural networks, which is essential for learning complex patterns. Without nonlinear activations, a multi-layer network would reduce to a single linear transformation, regardless of its depth. The choice of activation function affects training dynamics, model expressiveness, and computational efficiency.
Activation Function
A nonlinear function applied element-wise to the output of a neural network layer. Activation functions are what give neural networks the ability to learn complex, nonlinear relationships. Without them, any stack of layers collapses to a single linear transformation.
ReLU and the Vanishing Gradient Problem
The Rectified Linear Unit (ReLU) and its variants dominate modern deep learning due to their simplicity and favorable gradient properties. ReLU computes max(0, x), which is computationally cheap and avoids the vanishing gradient problem that plagued earlier activation functions like sigmoid and tanh.
\text{ReLU}(x) = \max(0, x) \qquad \text{Leaky ReLU}(x) = \max(\alpha x, x)When a ReLU neuron receives only negative inputs, its output is always zero and its gradient is always zero. This means the neuron stops learning entirely and is effectively "dead." In large networks, a significant fraction of neurons can die during training, reducing model capacity. Leaky ReLU and ELU mitigate this by allowing small gradients for negative inputs.
Why has GELU become the standard activation function in Transformer models instead of ReLU?
Modern Activation Functions
| Activation | Formula | Pros | Cons | Common Use |
|---|---|---|---|---|
| ReLU | max(0, x) | Simple, fast, good gradients | Dying neurons | CNNs, general purpose |
| Leaky ReLU | max(0.01x, x) | No dying neurons | Marginal improvement | When ReLU underperforms |
| GELU | x * Phi(x) | Smooth, best for Transformers | More compute than ReLU | Transformers (BERT, GPT) |
| Swish | x * sigmoid(x) | Smooth, self-gated | More compute than ReLU | EfficientNet family |
| Sigmoid | 1 / (1 + e^(-x)) | Bounded (0,1) | Vanishing gradients | Output layer (binary classification) |
| Tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | Zero-centered | Vanishing gradients | RNN hidden states (legacy) |
Table 3.3: Comparison of common activation functions and their trade-offs.
GELU (Gaussian Error Linear Unit) has become the standard activation in Transformer models, providing smooth gradients that improve training stability. Swish (x * sigmoid(x)) offers similar benefits and is used in EfficientNet and other modern architectures.
Hardware Implications
From a systems perspective, activation function choice impacts quantization compatibility and hardware acceleration. ReLU is particularly hardware-friendly because it only requires a comparison operation. More complex activations like GELU require lookup tables or polynomial approximations on hardware without native support.
For most applications, start with ReLU. If you are building a Transformer, use GELU. If you observe dying neurons (check the fraction of zero activations), switch to Leaky ReLU or ELU. For edge deployment where every microsecond matters, prefer ReLU for its hardware simplicity.
GELU requires computing the Gaussian CDF, which has no closed-form solution. On devices without native support, it is typically approximated as GELU(x) = 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3))). This polynomial approximation adds latency compared to ReLU's simple comparison, which can be significant at the scale of billions of operations.
ReLU
Rectified Linear Unit, an activation function that outputs max(0, x), widely used for its computational simplicity and effective gradient flow.
Vanishing Gradient
A training problem where gradients become extremely small as they propagate through many layers, effectively preventing early layers from learning.
04 Loss Functions and Optimization
Loss functions quantify how far the model predictions are from the desired outputs, providing the signal that drives learning. The choice of loss function encodes assumptions about the problem and can significantly influence model behavior.
Loss Function
A function that quantifies the discrepancy between model predictions and ground truth labels, providing the objective that the optimizer minimizes during training. The loss function encodes what "good predictions" means for a given task.
L_{\text{CE}} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)L_{\text{CE}} = -\frac{1}{ {{N}} }\sum_{i=1}^{ {{N}} } y_i \log({{yhat}})| Loss Function | Formula | Task | Key Property |
|---|---|---|---|
| Cross-Entropy | -sum(y * log(y-hat)) | Classification | Penalizes confident wrong predictions heavily |
| Mean Squared Error | (1/N) * sum((y - y-hat)^2) | Regression | Penalizes large errors quadratically |
| Mean Absolute Error | (1/N) * sum(|y - y-hat|) | Regression | Robust to outliers |
| Huber Loss | MSE if small error, MAE if large | Regression | Combines benefits of MSE and MAE |
| Focal Loss | -alpha * (1 - y-hat)^gamma * log(y-hat) | Imbalanced classification | Down-weights easy examples |
Table 3.4: Common loss functions and their properties.
Stochastic Gradient Descent and Mini-Batches
Stochastic gradient descent (SGD) and its variants are the workhorses of deep learning optimization. Rather than computing gradients over the entire dataset (full batch gradient descent), SGD estimates gradients from small random subsets (mini-batches). This introduces noise that can actually help escape local minima and saddle points.
The stochasticity in SGD is not merely a computational shortcut — it is a feature. The gradient noise from mini-batch sampling acts as implicit regularization, helping the optimizer find flatter minima that generalize better. This is why models trained with smaller batch sizes often generalize better than those trained with very large batches.
Modern Optimizers
Modern optimizers like Adam, AdamW, and LAMB combine momentum (exponential moving average of gradients) with adaptive learning rates (per-parameter scaling based on gradient history). Adam is the default choice for most applications due to its robustness to hyperparameter settings. AdamW adds proper weight decay decoupling, which improves generalization.
Use AdamW as your default optimizer. It works well across most tasks and is forgiving of suboptimal learning rate choices. Use SGD with momentum for training CNNs when you can afford careful learning rate tuning — it often yields better final accuracy. For very large batch distributed training, consider LAMB or LARS which scale learning rates per layer.
Learning Rate Scheduling
Learning rate scheduling is a critical component of training systems. The learning rate schedule can be more important than the optimizer choice for achieving good performance.
- Linear warmup + cosine decay — Start with a low learning rate, ramp up linearly, then decay following a cosine curve. Standard for Transformers.
- Step decay — Reduce learning rate by a factor (e.g., 0.1) at fixed epoch milestones. Classic choice for CNNs.
- One-cycle — Ramp learning rate up and then down over one full training run. Fast convergence with good generalization.
- Reduce on plateau — Monitor validation loss and reduce learning rate when progress stalls. Simple but effective.
When scaling to distributed training with larger effective batch sizes, the learning rate must be adjusted. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor. However, this requires a warmup period to stabilize training. Without proper scaling, distributed training may diverge or converge to poor solutions.
Cross-Entropy Loss
A loss function that measures the divergence between predicted probability distributions and true labels, standard for classification tasks.
Adam Optimizer
An adaptive learning rate optimizer that combines momentum and per-parameter learning rate scaling, widely used for its robustness and fast convergence.
05 Gradient Descent and Training Dynamics
Understanding the loss landscape is key to understanding why training neural networks is challenging. The loss surface of deep networks is high-dimensional and non-convex, containing many local minima, saddle points, and flat regions.
Loss Landscape
The multidimensional surface defined by the loss function over all possible parameter values. Its geometry — including the distribution of minima, saddle points, and flat regions — determines how difficult a network is to train and how well gradient-based optimizers can navigate it.
Research has shown that most local minima in deep networks achieve similar loss values. The primary challenge is not finding the global minimum but rather navigating saddle points and flat regions efficiently. In high-dimensional spaces, saddle points vastly outnumber local minima and can stall training for many iterations.
Batch Size: A Systems and Optimization Trade-off
Batch size is a critical hyperparameter that affects both training dynamics and systems performance. Larger batch sizes provide more accurate gradient estimates and enable better GPU utilization, but they can also lead to sharp minima that generalize poorly.
| Batch Size | GPU Utilization | Gradient Quality | Generalization | Training Time |
|---|---|---|---|---|
| Small (32-64) | Low | Noisy but regularizing | Often better | Slow (many iterations) |
| Medium (256-1024) | Good | Balanced | Good | Moderate |
| Large (4096+) | Excellent | Accurate but may overfit | May require tuning | Fast (fewer iterations) |
Table 3.5: How batch size affects training dynamics and systems performance.
Gradient Clipping
Gradient Clipping
A technique that limits gradient magnitudes during training to prevent exploding gradients. Gradients exceeding a threshold are scaled down proportionally, maintaining their direction but capping their magnitude.
Gradient clipping is a practical technique that prevents training instability by capping the magnitude of gradients. This is especially important for recurrent networks and Transformers, where gradient magnitudes can vary dramatically.
class="tok-comment"># Global gradient norm clipping in PyTorch
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=class="tok-number">1.0 class="tok-comment"># Clip if global norm exceeds class="tok-number">1.0
)
optimizer.step()Convergence Diagnostics
Convergence diagnostics are essential for efficient training. Monitoring key metrics over time helps identify issues before they waste expensive compute hours.
- Training loss — Should decrease steadily; plateaus suggest learning rate issues
- Validation loss — Divergence from training loss signals overfitting
- Gradient norms — Spikes indicate instability; consistently small values indicate vanishing gradients
- Learning rate — Verify the schedule is applied correctly
- GPU memory usage — Unexpected growth may indicate memory leaks in data loading
Convergence diagnostics should be integrated into the training infrastructure as a first-class concern, not an afterthought. Log these metrics to a tracking platform (W&B, MLflow, TensorBoard) from the start of every project. The cost of logging is negligible compared to the cost of a failed training run you cannot diagnose.
Batch normalization stabilizes training by normalizing layer inputs to zero mean and unit variance.Deeper InsightWithout batch normalization, each layer must constantly adapt to shifting input distributions caused by parameter updates in preceding layers (internal covariate shift). Batch norm eliminates this moving target, letting each layer learn independently. This is why adding batch norm often lets you use higher learning rates and train much faster -- each layer sees a stable input range.Think of it like...Trying to learn archery while someone keeps moving the target. Batch normalization pins the target in place for each layer, so every layer can focus on improving its own aim rather than constantly adjusting for the moving inputs from the layer before it. Click to collapse
A training run plateaus at unexpectedly high loss after 5 epochs. Checking gradient norms reveals they dropped to near-zero values, suggesting vanishing gradients. The team switches from sigmoid to ReLU activations and adds batch normalization. The next run converges smoothly, reaching the target loss in 20 epochs.
Loss Landscape
The multidimensional surface defined by the loss function over all possible parameter values, whose geometry determines training difficulty.
Gradient Clipping
A technique that limits gradient magnitudes during training to prevent exploding gradients and maintain training stability.
Key Takeaways
- 1Neural networks learn hierarchical representations through compositions of linear transformations and nonlinear activation functions.
- 2Backpropagation requires storing intermediate activations, making memory management a critical systems concern for deep networks.
- 3Activation function choice affects not only training dynamics but also hardware compatibility and inference efficiency.
- 4Modern optimizers like Adam provide robust training by combining momentum with adaptive per-parameter learning rates.
- 5The backward pass requires approximately twice the computation and memory of the forward pass, which impacts training system design.
CH.03
Chapter Complete
Chapter Progress
Interact with the visualization
DL Primer Quiz
Test your understanding of deep learning fundamentals including neural networks, backpropagation, and optimization.
Ready to test your knowledge?