Skip to content
ML SystemPart 1: Core ConceptsChapter 3
Foundations CH.03 ~40 min

DL Primer

Covers deep learning fundamentals from a systems perspective. Introduces neural networks, backpropagation, activation functions, and loss functions.

neural networksbackpropagationactivation functionsloss functionsgradient descent
Read in mlsysbook.ai
  • Explain the fundamental principles of deep learning including backpropagation and gradient descent
  • Compare activation functions and their impact on network training dynamics
  • Analyze the role of loss functions in guiding model optimization
  • Implement a basic neural network training loop with forward and backward passes
  • Evaluate regularization techniques for preventing overfitting in deep models

01 Neural Network Fundamentals Viz

Neural networks are the computational backbone of modern deep learning. At their core, they are compositions of linear transformations followed by nonlinear activation functions, organized into layers. Each layer transforms its input through a weight matrix multiplication, adds a bias vector, and applies an activation function to produce its output.

Definition

Neural Network

A computational model composed of layers of interconnected nodes (neurons) that learn to transform inputs into outputs through adjustable weights and nonlinear activation functions. Each layer applies a linear transformation followed by a nonlinear activation.

\mathbf{h} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})

Figure: Backpropagation -- Forward & Backward Pass

INPUT HIDDEN OUTPUT w11w12w13w21w22w23v1v2v3x1x2h1h2h3y Loss L = (y - t)^2 FORWARD PASSForward (activations)Backward (gradients) Hover to toggle between forward and backward pass
Figure Figure 3.1: Forward and backward pass through a neural network. Blue arrows show the forward pass (input to output), orange arrows show gradient flow during backpropagation.

From Perceptrons to Deep Networks

The simplest neural network is the single-layer perceptron, which can only learn linearly separable functions. By stacking multiple layers, deep neural networks gain the ability to learn hierarchical representations of data. Early layers typically capture low-level features (edges, textures), while deeper layers compose these into increasingly abstract representations (objects, concepts).

Hierarchical Representation Learning

In an image classification network, layer 1 might detect edges and color gradients. Layer 2 combines edges into corners and textures. Layer 3 assembles these into parts (eyes, wheels). Layer 4 recognizes whole objects (faces, cars). This progression from low-level to high-level features is the key insight behind deep learning's success.

Systems Implications of Network Architecture

From a systems perspective, understanding neural network architecture is essential for making informed decisions about memory requirements, computational complexity, and hardware utilization. The number of parameters directly determines memory consumption, while the number of floating-point operations (FLOPs) determines computational cost.

Table 3.1: How key network properties scale and their systems implications.
PropertyScales WithSystems Impact
ParametersWidth^2 * DepthMemory consumption, model size on disk
FLOPsWidth^2 * Depth * Batch SizeTraining time, GPU compute cost
ActivationsWidth * Depth * Batch SizePeak GPU memory during training
Gradient SizeSame as ParametersCommunication cost in distributed training

Table 3.1: How key network properties scale and their systems implications.

0 GB
Memory for 1B params with Adam
0 bytes
Per parameter (FP32)
Estimating Memory Requirements

A quick rule of thumb: each parameter in float32 requires 4 bytes. During training with Adam optimizer, you need roughly 12 bytes per parameter (4 for weights, 4 for gradient, 4 for optimizer state). A 1-billion parameter model therefore needs approximately 12 GB just for parameters and optimizer state, before accounting for activations.

The Universal Approximation Theorem

The universal approximation theorem tells us that sufficiently wide neural networks can approximate any continuous function. However, this theoretical guarantee says nothing about the practical challenges of training such networks or deploying them efficiently. The gap between what is theoretically possible and what is practically achievable is where ML systems engineering lives.

Theory vs. Practice

The universal approximation theorem guarantees existence but not efficiency. A network that can theoretically approximate any function may require astronomically many parameters or be impossible to train with gradient descent. Practical deep learning relies on architectural inductive biases (convolutions for spatial data, attention for sequences) to make learning feasible.

Neural Network

A computational model composed of layers of interconnected nodes that learn to transform inputs into outputs through adjustable weights and nonlinear activation functions.

Universal Approximation

The theoretical property that neural networks with sufficient width can approximate any continuous function to arbitrary precision.

02 Backpropagation and Gradient Computation

Deeper InsightDespite its mystique, backpropagation is not a novel mathematical discovery. It is a systematic bookkeeping method for applying the chain rule at every layer. The brilliance lies in efficiency: rather than computing each gradient independently (which would be impossibly slow), backprop reuses intermediate results from deeper layers, turning an exponential problem into a linear one.Think of it like...A factory assembly line where each station gets a note saying "here is how much the final product was off -- your contribution to the error was this much." Each station passes the note backward to the previous station, adjusted for its own work. No station needs to understand the entire factory to know how to improve. Click to collapse

Backpropagation is the algorithm that makes deep learning possible. It efficiently computes the gradient of the loss function with respect to every parameter in the network by applying the chain rule of calculus from the output layer back through the network. This process enables gradient-based optimization of networks with millions or billions of parameters.

Definition

Backpropagation

An algorithm for efficiently computing the gradient of the loss function with respect to every parameter in a neural network by recursively applying the chain rule of calculus from the output layer backward through the network.

\frac{\partial L}{\partial \mathbf{W}_l} = \frac{\partial L}{\partial \mathbf{h}_l} \cdot \frac{\partial \mathbf{h}_l}{\partial \mathbf{W}_l}

Forward and Backward Passes

The forward pass computes the network output by propagating inputs through each layer sequentially. The backward pass then propagates error gradients in reverse, computing how each parameter contributed to the total loss. This symmetry between forward and backward passes has profound implications for system design.

0x
Backward pass compute vs forward pass
The 2x Rule

The backward pass requires approximately twice the computation and memory of the forward pass. This means that during training, roughly two-thirds of the total compute is spent on gradient computation. This ratio has direct implications for GPU utilization and training cost estimation.

Quick Check

During neural network training, approximately what fraction of total computation is spent on the backward pass (gradient computation)?

If the backward pass costs 2x the forward pass, what fraction of (1x + 2x) is 2x?

Not quite.The backward pass requires roughly 2x the computation of the forward pass. Since total compute = forward + backward, the backward pass accounts for about 2/3 of the total. This is why training is much more expensive than inference.
Continue reading

Memory Challenges and Gradient Checkpointing

Memory consumption during backpropagation is a critical systems concern. The standard implementation requires storing all intermediate activations from the forward pass to compute gradients during the backward pass. For deep networks, this can consume tens of gigabytes of GPU memory.

Definition

Gradient Checkpointing

A memory optimization technique that reduces peak memory usage during training by discarding some intermediate activations during the forward pass and recomputing them during the backward pass. This trades roughly 20-30% additional computation for up to 60-70% memory reduction.

When to Use Gradient Checkpointing

Enable gradient checkpointing when your model fits in GPU memory for inference but runs out of memory during training. In PyTorch, use torch.utils.checkpoint.checkpoint() on memory-intensive layers. This is especially effective for Transformer models with many attention layers.

Automatic Differentiation Frameworks

Automatic differentiation, the generalization of backpropagation, is a core feature of modern ML frameworks. The design choice between static and dynamic graphs has significant implications for debugging, performance optimization, and deployment flexibility.

Table 3.2: Dynamic vs. static computational graph approaches.
FeatureDynamic Graphs (PyTorch)Static Graphs (TensorFlow 1.x)
Graph constructionBuilt on-the-fly during executionDefined before execution
DebuggingStandard Python debugger worksRequires special tools (tfdbg)
Control flowNative Python if/for/whileSpecial tf.cond / tf.while_loop
PerformanceJIT compilation (torch.compile)Ahead-of-time optimization
DeploymentTorchScript or ONNX exportSavedModel, TF Lite, TF.js

Table 3.2: Dynamic vs. static computational graph approaches.

Backpropagation in PyTorch

In PyTorch, calling loss.backward() triggers the backward pass through the dynamic graph. Each tensor records the operations that created it, forming a chain that backward() traverses to compute gradients. The gradients accumulate in each parameter's .grad attribute, ready for the optimizer to apply updates.

Backpropagation

An algorithm for computing gradients of the loss function with respect to network parameters by efficiently applying the chain rule from output to input layers.

Gradient Checkpointing

A memory optimization technique that recomputes intermediate activations during the backward pass instead of storing them, trading compute for reduced memory usage.

03 Activation Functions

Activation functions introduce nonlinearity into neural networks, which is essential for learning complex patterns. Without nonlinear activations, a multi-layer network would reduce to a single linear transformation, regardless of its depth. The choice of activation function affects training dynamics, model expressiveness, and computational efficiency.

Definition

Activation Function

A nonlinear function applied element-wise to the output of a neural network layer. Activation functions are what give neural networks the ability to learn complex, nonlinear relationships. Without them, any stack of layers collapses to a single linear transformation.

ReLU and the Vanishing Gradient Problem

The Rectified Linear Unit (ReLU) and its variants dominate modern deep learning due to their simplicity and favorable gradient properties. ReLU computes max(0, x), which is computationally cheap and avoids the vanishing gradient problem that plagued earlier activation functions like sigmoid and tanh.

\text{ReLU}(x) = \max(0, x) \qquad \text{Leaky ReLU}(x) = \max(\alpha x, x)
The Dying ReLU Problem

When a ReLU neuron receives only negative inputs, its output is always zero and its gradient is always zero. This means the neuron stops learning entirely and is effectively "dead." In large networks, a significant fraction of neurons can die during training, reducing model capacity. Leaky ReLU and ELU mitigate this by allowing small gradients for negative inputs.

Quick Check

Why has GELU become the standard activation function in Transformer models instead of ReLU?

Not quite.GELU produces smooth, continuous gradients (unlike ReLU's sharp zero-point), which helps Transformer training converge more reliably. It does cost more compute than ReLU, but the stability gains outweigh this on modern hardware.
Continue reading
Loading visualization...

Modern Activation Functions

Table 3.3: Comparison of common activation functions and their trade-offs.
ActivationFormulaProsConsCommon Use
ReLUmax(0, x)Simple, fast, good gradientsDying neuronsCNNs, general purpose
Leaky ReLUmax(0.01x, x)No dying neuronsMarginal improvementWhen ReLU underperforms
GELUx * Phi(x)Smooth, best for TransformersMore compute than ReLUTransformers (BERT, GPT)
Swishx * sigmoid(x)Smooth, self-gatedMore compute than ReLUEfficientNet family
Sigmoid1 / (1 + e^(-x))Bounded (0,1)Vanishing gradientsOutput layer (binary classification)
Tanh(e^x - e^(-x)) / (e^x + e^(-x))Zero-centeredVanishing gradientsRNN hidden states (legacy)

Table 3.3: Comparison of common activation functions and their trade-offs.

GELU (Gaussian Error Linear Unit) has become the standard activation in Transformer models, providing smooth gradients that improve training stability. Swish (x * sigmoid(x)) offers similar benefits and is used in EfficientNet and other modern architectures.

Hardware Implications

From a systems perspective, activation function choice impacts quantization compatibility and hardware acceleration. ReLU is particularly hardware-friendly because it only requires a comparison operation. More complex activations like GELU require lookup tables or polynomial approximations on hardware without native support.

Choosing an Activation Function

For most applications, start with ReLU. If you are building a Transformer, use GELU. If you observe dying neurons (check the fraction of zero activations), switch to Leaky ReLU or ELU. For edge deployment where every microsecond matters, prefer ReLU for its hardware simplicity.

GELU on Edge Devices

GELU requires computing the Gaussian CDF, which has no closed-form solution. On devices without native support, it is typically approximated as GELU(x) = 0.5x(1 + tanh(sqrt(2/pi)(x + 0.044715x^3))). This polynomial approximation adds latency compared to ReLU's simple comparison, which can be significant at the scale of billions of operations.

ReLU

Rectified Linear Unit, an activation function that outputs max(0, x), widely used for its computational simplicity and effective gradient flow.

Vanishing Gradient

A training problem where gradients become extremely small as they propagate through many layers, effectively preventing early layers from learning.

04 Loss Functions and Optimization

Loss functions quantify how far the model predictions are from the desired outputs, providing the signal that drives learning. The choice of loss function encodes assumptions about the problem and can significantly influence model behavior.

Definition

Loss Function

A function that quantifies the discrepancy between model predictions and ground truth labels, providing the objective that the optimizer minimizes during training. The loss function encodes what "good predictions" means for a given task.

L_{\text{CE}} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)
L_{\text{CE}} = -\frac{1}{ {{N}} }\sum_{i=1}^{ {{N}} } y_i \log({{yhat}})
Table 3.4: Common loss functions and their properties.
Loss FunctionFormulaTaskKey Property
Cross-Entropy-sum(y * log(y-hat))ClassificationPenalizes confident wrong predictions heavily
Mean Squared Error(1/N) * sum((y - y-hat)^2)RegressionPenalizes large errors quadratically
Mean Absolute Error(1/N) * sum(|y - y-hat|)RegressionRobust to outliers
Huber LossMSE if small error, MAE if largeRegressionCombines benefits of MSE and MAE
Focal Loss-alpha * (1 - y-hat)^gamma * log(y-hat)Imbalanced classificationDown-weights easy examples

Table 3.4: Common loss functions and their properties.

Stochastic Gradient Descent and Mini-Batches

Stochastic gradient descent (SGD) and its variants are the workhorses of deep learning optimization. Rather than computing gradients over the entire dataset (full batch gradient descent), SGD estimates gradients from small random subsets (mini-batches). This introduces noise that can actually help escape local minima and saddle points.

The Benefit of Noise

The stochasticity in SGD is not merely a computational shortcut — it is a feature. The gradient noise from mini-batch sampling acts as implicit regularization, helping the optimizer find flatter minima that generalize better. This is why models trained with smaller batch sizes often generalize better than those trained with very large batches.

Modern Optimizers

Modern optimizers like Adam, AdamW, and LAMB combine momentum (exponential moving average of gradients) with adaptive learning rates (per-parameter scaling based on gradient history). Adam is the default choice for most applications due to its robustness to hyperparameter settings. AdamW adds proper weight decay decoupling, which improves generalization.

Optimizer Selection Guide

Use AdamW as your default optimizer. It works well across most tasks and is forgiving of suboptimal learning rate choices. Use SGD with momentum for training CNNs when you can afford careful learning rate tuning — it often yields better final accuracy. For very large batch distributed training, consider LAMB or LARS which scale learning rates per layer.

Learning Rate Scheduling

Learning rate scheduling is a critical component of training systems. The learning rate schedule can be more important than the optimizer choice for achieving good performance.

  • Linear warmup + cosine decay — Start with a low learning rate, ramp up linearly, then decay following a cosine curve. Standard for Transformers.
  • Step decay — Reduce learning rate by a factor (e.g., 0.1) at fixed epoch milestones. Classic choice for CNNs.
  • One-cycle — Ramp learning rate up and then down over one full training run. Fast convergence with good generalization.
  • Reduce on plateau — Monitor validation loss and reduce learning rate when progress stalls. Simple but effective.
Learning Rate and Batch Size

When scaling to distributed training with larger effective batch sizes, the learning rate must be adjusted. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor. However, this requires a warmup period to stabilize training. Without proper scaling, distributed training may diverge or converge to poor solutions.

Loading playground...

Cross-Entropy Loss

A loss function that measures the divergence between predicted probability distributions and true labels, standard for classification tasks.

Adam Optimizer

An adaptive learning rate optimizer that combines momentum and per-parameter learning rate scaling, widely used for its robustness and fast convergence.

05 Gradient Descent and Training Dynamics

Understanding the loss landscape is key to understanding why training neural networks is challenging. The loss surface of deep networks is high-dimensional and non-convex, containing many local minima, saddle points, and flat regions.

Definition

Loss Landscape

The multidimensional surface defined by the loss function over all possible parameter values. Its geometry — including the distribution of minima, saddle points, and flat regions — determines how difficult a network is to train and how well gradient-based optimizers can navigate it.

Most Local Minima Are Good Enough

Research has shown that most local minima in deep networks achieve similar loss values. The primary challenge is not finding the global minimum but rather navigating saddle points and flat regions efficiently. In high-dimensional spaces, saddle points vastly outnumber local minima and can stall training for many iterations.

Batch Size: A Systems and Optimization Trade-off

Batch size is a critical hyperparameter that affects both training dynamics and systems performance. Larger batch sizes provide more accurate gradient estimates and enable better GPU utilization, but they can also lead to sharp minima that generalize poorly.

Table 3.5: How batch size affects training dynamics and systems performance.
Batch SizeGPU UtilizationGradient QualityGeneralizationTraining Time
Small (32-64)LowNoisy but regularizingOften betterSlow (many iterations)
Medium (256-1024)GoodBalancedGoodModerate
Large (4096+)ExcellentAccurate but may overfitMay require tuningFast (fewer iterations)

Table 3.5: How batch size affects training dynamics and systems performance.

Gradient Clipping

Definition

Gradient Clipping

A technique that limits gradient magnitudes during training to prevent exploding gradients. Gradients exceeding a threshold are scaled down proportionally, maintaining their direction but capping their magnitude.

Gradient clipping is a practical technique that prevents training instability by capping the magnitude of gradients. This is especially important for recurrent networks and Transformers, where gradient magnitudes can vary dramatically.

python
class="tok-comment"># Global gradient norm clipping in PyTorch
torch.nn.utils.clip_grad_norm_(
    model.parameters(),
    max_norm=class="tok-number">1.0  class="tok-comment"># Clip if global norm exceeds class="tok-number">1.0
)
optimizer.step()

Convergence Diagnostics

Convergence diagnostics are essential for efficient training. Monitoring key metrics over time helps identify issues before they waste expensive compute hours.

  • Training loss — Should decrease steadily; plateaus suggest learning rate issues
  • Validation loss — Divergence from training loss signals overfitting
  • Gradient norms — Spikes indicate instability; consistently small values indicate vanishing gradients
  • Learning rate — Verify the schedule is applied correctly
  • GPU memory usage — Unexpected growth may indicate memory leaks in data loading
Build Diagnostics Into Your Training Loop

Convergence diagnostics should be integrated into the training infrastructure as a first-class concern, not an afterthought. Log these metrics to a tracking platform (W&B, MLflow, TensorBoard) from the start of every project. The cost of logging is negligible compared to the cost of a failed training run you cannot diagnose.

Loading visualization...

Deeper InsightWithout batch normalization, each layer must constantly adapt to shifting input distributions caused by parameter updates in preceding layers (internal covariate shift). Batch norm eliminates this moving target, letting each layer learn independently. This is why adding batch norm often lets you use higher learning rates and train much faster -- each layer sees a stable input range.Think of it like...Trying to learn archery while someone keeps moving the target. Batch normalization pins the target in place for each layer, so every layer can focus on improving its own aim rather than constantly adjusting for the moving inputs from the layer before it. Click to collapse

Diagnosing a Stuck Training Run

A training run plateaus at unexpectedly high loss after 5 epochs. Checking gradient norms reveals they dropped to near-zero values, suggesting vanishing gradients. The team switches from sigmoid to ReLU activations and adds batch normalization. The next run converges smoothly, reaching the target loss in 20 epochs.

Loss Landscape

The multidimensional surface defined by the loss function over all possible parameter values, whose geometry determines training difficulty.

Gradient Clipping

A technique that limits gradient magnitudes during training to prevent exploding gradients and maintain training stability.

Key Takeaways

  1. 1Neural networks learn hierarchical representations through compositions of linear transformations and nonlinear activation functions.
  2. 2Backpropagation requires storing intermediate activations, making memory management a critical systems concern for deep networks.
  3. 3Activation function choice affects not only training dynamics but also hardware compatibility and inference efficiency.
  4. 4Modern optimizers like Adam provide robust training by combining momentum with adaptive per-parameter learning rates.
  5. 5The backward pass requires approximately twice the computation and memory of the forward pass, which impacts training system design.

CH.03

Chapter Complete

Up next:DNN Architectures

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

DL Primer Quiz

Test your understanding of deep learning fundamentals including neural networks, backpropagation, and optimization.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass