ML SystemPart 2: System DesignChapter 5

Design CH.05 ~30 min

AI Workflow

Covers the ML development lifecycle and methodology. Introduces experiment tracking, reproducibility, and systematic approaches to model development.

development lifecycleexperiment trackingreproducibilitymethodologyMLOps workflow

Read in mlsysbook.ai

Describe the complete model lifecycle from ideation through deployment and retirement
Compare experiment tracking approaches and their role in reproducibility
Design a model versioning and registry strategy for team collaboration
Analyze CI/CD pipeline requirements specific to machine learning workflows
Implement model governance practices including approval gates and audit trails

01 The ML Development Lifecycle Viz

The ML development lifecycle provides a structured methodology for taking models from concept to production. Unlike traditional software development, ML development is fundamentally experimental: success requires systematic iteration across data preparation, model development, training, evaluation, and deployment stages.

Definition

ML Development Lifecycle

The structured sequence of stages from problem definition through data preparation, model development, training, evaluation, deployment, and monitoring that governs ML project development. It is inherently iterative, with feedback from later stages driving improvements to earlier ones.

Figure: ML Model Lifecycle

Figure Figure 5.1: The ML development lifecycle is a continuous cycle. Each stage feeds into the next, with monitoring driving feedback to all upstream stages.

Problem Formulation

The lifecycle begins with problem formulation, where engineers define the task, success metrics, and constraints. This stage is often underestimated but is critical for project success.

The Most Common Cause of ML Project Failure

The majority of failed ML projects fail not because of model quality but because of poor problem formulation. Building an accurate model for the wrong objective wastes months of engineering effort. Before writing any code, clearly define: What decision will this model inform? How will you measure success? What are the latency and cost constraints? Is there enough data?

Define the business objective and how ML predictions will be used
Specify success metrics (accuracy, precision, recall, F1) with concrete thresholds
Establish latency budgets and throughput requirements for serving
Assess data availability, quality, and labeling needs
Identify potential biases and ethical considerations
Set a timeline and compute budget for experimentation

Iterative Development and Artifact Management

After problem formulation, the lifecycle moves through iterative cycles of data preparation, feature engineering, model selection, training, and evaluation. Each cycle produces artifacts: processed datasets, trained model weights, evaluation reports, and configuration files.

Table 5.1: ML lifecycle artifacts and their management strategies.
Artifact Type	Examples	Versioning Tool	Retention Policy
Data	Training sets, evaluation splits, feature definitions	DVC, Delta Lake	Keep all versions for reproducibility
Code	Training scripts, feature pipelines, configs	Git	Standard branch/tag strategy
Models	Checkpoints, exported models, ONNX files	MLflow, W&B	Keep best N per experiment
Metrics	Loss curves, evaluation reports, confusion matrices	MLflow, W&B	Keep all for comparison
Configs	Hyperparameters, infrastructure definitions	Git, Hydra	Link to corresponding model

Table 5.1: ML lifecycle artifacts and their management strategies.

Version Everything Together

Every trained model should be linked to the exact code commit, data version, configuration file, and random seed that produced it. If you cannot reproduce a training run from its metadata alone, your versioning strategy has a gap. Tools like MLflow and DVC make this straightforward when adopted from the start of a project.

From Experimentation to Production

The deployment phase marks the transition from experimentation to production. This involves packaging the model, setting up serving infrastructure, implementing monitoring, and establishing feedback loops for continuous improvement. The lifecycle does not end at deployment; production models require ongoing monitoring, maintenance, and periodic retraining.

Models degrade over time as the data they encounter drifts away from the data they learned from.Deeper InsightUnlike traditional software that works the same way forever unless the code changes, ML models silently rot. The world changes -- user behavior shifts, new products launch, seasons change, language evolves -- but the model remains frozen in the snapshot of reality it was trained on. This silent degradation is more dangerous than a software crash because the model keeps producing outputs that look valid but are increasingly wrong.Think of it like...A guidebook written five years ago for a rapidly changing city. The restaurants have closed, the subway map is wrong, and new neighborhoods have appeared. The book still looks authoritative and gives confident recommendations -- they are just increasingly inaccurate. Without someone checking the guidebook against the current city, you would never know it had gone stale. Click to collapse

The Lifecycle Never Ends

A deployed model is not a finished product. Data distributions shift, user behavior changes, and new edge cases emerge. Production models require continuous monitoring of prediction quality, periodic retraining on fresh data, and infrastructure maintenance. Plan for the full operational lifecycle, not just the initial deployment.

ML Lifecycle

The structured sequence of stages from problem definition through data preparation, model development, deployment, and monitoring that governs ML project development.

Problem Formulation

The critical first stage of ML development where the task, success metrics, constraints, and data requirements are precisely defined.

02 Experiment Tracking and Management

Experiment tracking is the systematic recording of all parameters, metrics, and artifacts associated with each training run. Without proper tracking, ML teams quickly lose the ability to reproduce results, compare approaches, or understand why certain configurations worked better than others.

Definition

Experiment Tracking

The systematic recording and organization of all parameters, metrics, code versions, data versions, and artifacts from ML training runs. Experiment tracking enables reproducibility, comparison across runs, and collaborative iteration within ML teams.

Tracking Platforms

Modern experiment tracking platforms provide centralized dashboards for logging hyperparameters, training curves, evaluation metrics, and model artifacts. These tools enable teams to search, filter, and compare across hundreds of experiments.

Table 5.2: Comparison of popular experiment tracking platforms.
Platform	Type	Key Strength	Best For
MLflow	Open-source	Full lifecycle management, model registry	Teams wanting self-hosted, vendor-neutral tooling
Weights & Biases	SaaS (free tier)	Beautiful visualizations, easy team collaboration	Research teams and fast-moving startups
Neptune	SaaS	Flexible metadata structure, strong comparison tools	Large teams with diverse experiment types
TensorBoard	Open-source	Deep integration with TensorFlow and PyTorch	Individual researchers, quick local visualization
ClearML	Open-source + SaaS	Automatic logging, pipeline orchestration	Teams wanting an all-in-one platform

Table 5.2: Comparison of popular experiment tracking platforms.

What to Log

Effective experiment tracking requires discipline in logging practices. Every run should record the complete configuration to enable exact reproduction.

Model architecture — Layer types, sizes, activation functions, number of parameters
Hyperparameters — Learning rate, batch size, optimizer settings, regularization
Data version — Hash or tag identifying the exact training and validation data
Random seeds — All seeds (Python, NumPy, PyTorch/TensorFlow) for reproducibility
Framework version — Exact versions of PyTorch/TensorFlow, CUDA, cuDNN
Hardware environment — GPU type, number of devices, distributed training config
Training metrics — Loss curves, gradient norms, learning rate schedule
Evaluation metrics — Accuracy, F1, AUC, confusion matrix on validation set

python
class="tok-comment"># Example: Logging an experiment with MLflow
import mlflow

with mlflow.start_run(run_name=&class="tok-comment">#class="tok-number">39;resnet50-lr-sweepclass="tok-string">&#class="tok-number">39;):
    mlflow.log_params({
        &class="tok-comment">#class="tok-number">39;model&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;resnet50&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;learning_rateclass="tok-string">&#class="tok-number">39;: class="tok-number">0.001,
        &class="tok-comment">#class="tok-number">39;batch_size&#class="tok-number">39;: class="tok-number">256,
        &class="tok-comment">#class="tok-number">39;optimizerclass="tok-string">&#class="tok-number">39;: &#class="tok-number">39;adamwclass="tok-string">&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;data_version&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;v2.class="tok-number">3&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;seedclass="tok-string">&#class="tok-number">39;: class="tok-number">42,
    })
    class="tok-comment"># ... training loop ...
    mlflow.log_metrics({&class="tok-comment">#class="tok-number">39;val_accuracy&#class="tok-number">39;: class="tok-number">0.943, class="tok-string">&#class="tok-number">39;val_loss&#class="tok-number">39;: class="tok-number">0.187})
    mlflow.pytorch.log_model(model, &class="tok-comment">#class="tok-number">39;model&#class="tok-number">39;)

Collaborative Workflows

Beyond individual experiments, tracking systems support collaborative workflows where multiple team members contribute experiments toward shared objectives. Features like experiment groups, tags, and notes help organize work across team members.

Establish Logging Conventions Early

Agree on naming conventions, tag taxonomy, and required metadata fields before the team starts experimenting. A shared convention like "task/model/variant" (e.g., "sentiment/bert-base/lr-sweep") makes it easy to find and compare experiments weeks or months later. Inconsistent naming is the top reason teams abandon their experiment tracking.

Tracking Saves a Team Weeks

A team of five ML engineers runs 200 experiments over two months to improve a recommendation model. When the best model fails in production, they use experiment tracking to identify that the improvement came from a data preprocessing change (not the architecture change they assumed). They revert the preprocessing, find the real issue (a data leak in validation), and recover in hours instead of rerunning weeks of experiments.

Experiment Tracking

The systematic recording and organization of all parameters, metrics, code versions, and artifacts from ML training runs to enable reproducibility and comparison.

MLflow

An open-source platform for managing the ML lifecycle, providing experiment tracking, model packaging, and deployment capabilities.

03 Reproducibility in ML

Reproducibility is the ability to obtain consistent results when repeating an experiment under the same conditions. In ML, achieving reproducibility is surprisingly difficult due to sources of non-determinism in hardware, software, and the training process itself.

Definition

Reproducibility

The ability to obtain consistent experimental results when repeating a procedure under the same conditions. In ML, this requires controlling random seeds, data versions, code versions, and hardware environment. It is a fundamental requirement for scientific validity and production reliability.

Sources of Non-Determinism

Sources of non-determinism include random weight initialization, data shuffling, GPU floating-point operation ordering, and framework-level optimizations that sacrifice determinism for performance.

Table 5.3: Sources of non-determinism in ML training and their mitigations.
Source	Cause	Mitigation	Performance Impact
Weight initialization	Random number generators	Set all seeds (Python, NumPy, PyTorch)	None
Data shuffling	Random batch ordering	Seed the DataLoader or use deterministic shuffling	None
GPU operation ordering	Non-deterministic atomics on GPU	torch.use_deterministic_algorithms(True)	10-30% slower
cuDNN auto-tuning	Different algorithms chosen per run	Set cudnn.deterministic = True	5-15% slower
Floating-point reduction	Summation order varies in parallel	Use deterministic reductions	5-10% slower
Framework optimizations	Graph fusion, kernel selection	Disable torch.compile optimizations	Significant slowdown

Table 5.3: Sources of non-determinism in ML training and their mitigations.

Bit-for-Bit Reproducibility Has a Cost

Achieving exact bit-for-bit reproducibility often requires disabling GPU optimizations that sacrifice determinism for performance. This can slow training by 20-30%. In practice, "statistical reproducibility" (results within a small variance across runs) is usually sufficient. Reserve bit-exact reproducibility for debugging and validation, not production training.

Practical Reproducibility

python
class="tok-comment"># Setting seeds for practical reproducibility in PyTorch
import random
import numpy as np
import torch

def set_seed(seed: int = class="tok-number">42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    class="tok-comment"># Optional: slower but deterministic
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

The Reproducibility Checklist

Before sharing results or deploying a model, verify you can answer "yes" to: (1) Can I rerun this experiment from its metadata? (2) Are all dependencies pinned to exact versions? (3) Is the training data versioned and accessible? (4) Are random seeds set and recorded? (5) Is the hardware environment documented?

Version Control for ML

Definition

DVC (Data Version Control)

A tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code. DVC uses content-addressable storage to efficiently track changes to large files without storing them directly in Git.

Version control for ML extends beyond code to include data, configurations, and model artifacts. Tools like DVC track large datasets and model files alongside code, enabling teams to switch between different versions of their entire ML pipeline.

DVC in Practice

A team uses DVC to version their 50 GB training dataset alongside their Git repository. When a model regression is detected in production, they run "dvc checkout v2.1" to restore the exact data used to train the previous good model. They discover that a recent data pipeline change introduced duplicate records. Without data versioning, this root cause would have taken days to identify.

Reproducibility

The ability to obtain consistent experimental results when repeating a procedure under the same conditions, a fundamental requirement for scientific ML development.

DVC

Data Version Control, a tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code.

04 MLOps Workflow Methodology

MLOps brings DevOps principles to machine learning, establishing practices for continuous integration, delivery, and training of ML models. The goal is to reduce the time and friction involved in moving models from development to production while maintaining quality and reliability.

Definition

MLOps

The set of practices that combines ML development (Dev) with operations (Ops) to enable continuous integration, continuous delivery, and continuous training of ML models. MLOps automates the pipeline from experimentation through deployment while maintaining quality and reliability.

The Three Pillars of MLOps

A mature MLOps workflow automates the progression from code changes through testing, training, evaluation, and deployment.

Table 5.4: The three pillars of MLOps and their automation.
Pillar	Trigger	Actions	Output
Continuous Integration (CI)	Code commit or PR	Unit tests, linting, data validation, schema checks	Pass/fail status, test reports
Continuous Training (CT)	New data arrives or performance degrades	Retrain model, evaluate on holdout set, compare to baseline	New model candidate with metrics
Continuous Delivery (CD)	Model passes validation gates	Package model, deploy to staging, canary, then production	Production-serving model

Table 5.4: The three pillars of MLOps and their automation.

MLOps Maturity Levels

Google defines three levels of MLOps maturity. Level 0: manual, script-driven process. Level 1: ML pipeline automation with continuous training. Level 2: CI/CD pipeline automation for both code and models. Most organizations should aim for Level 1 as their initial target, as the jump from Level 0 to Level 1 delivers the majority of operational benefits.

Pipeline Orchestration

Pipeline orchestration is the backbone of MLOps workflows. Tools like Kubeflow Pipelines, Apache Airflow, and Prefect define ML workflows as composable steps with explicit dependencies.

Apache Airflow — General-purpose workflow orchestration with a large ecosystem of integrations
Kubeflow Pipelines — ML-specific orchestration on Kubernetes with native GPU support
Prefect — Python-native orchestration with strong local development and cloud deployment
Dagster — Data-aware orchestration with built-in testing and observability
ZenML — MLOps framework that abstracts across infrastructure providers

Start With Simple Automation

You do not need a full orchestration platform on day one. Start by automating the most painful manual step — often retraining when new data arrives. A scheduled cron job that triggers a training script, evaluates the result, and sends a Slack notification is a perfectly valid first step toward MLOps maturity.

Model Registries and Promotion

Definition

Model Registry

A centralized repository for storing, versioning, and managing the lifecycle of trained ML models. Models progress through stages (development, staging, canary, production) based on validation gates, ensuring only thoroughly tested models reach production.

Model registries serve as the central hub for managing model versions and their lifecycle states. A model in the registry progresses through stages like "staging," "canary," and "production" based on validation gates.

Model Promotion Pipeline

A new model is trained and registered as version 3.2 in "development" stage. Automated tests verify it meets minimum accuracy thresholds. It is promoted to "staging" where integration tests run against production data samples. After passing, it moves to "canary" where 5% of live traffic is routed to it for 24 hours. If all SLOs are met, it is promoted to "production" and the old model is archived.

Always Have a Rollback Plan

Every model deployment should have a tested rollback procedure. Keep the previous production model readily available (not archived or deleted) so traffic can be instantly routed back if the new model causes issues. Automated rollback triggered by SLO violations is ideal, but manual rollback should be achievable in under 5 minutes.

MLOps

The set of practices that combines ML development with operations, enabling continuous delivery and automation of the ML lifecycle.

Model Registry

A centralized repository for storing, versioning, and managing the lifecycle of trained ML models from development through production deployment.

05 Hyperparameter Tuning and AutoML

Hyperparameter tuning is the process of finding optimal configuration values that are not learned during training, such as learning rate, batch size, model architecture choices, and regularization strength. The hyperparameter space is often high-dimensional and expensive to explore, making efficient search strategies essential.

Definition

Hyperparameter

A configuration value that is set before training begins and is not updated by the optimization algorithm. Examples include learning rate, batch size, number of layers, dropout rate, and regularization strength. Choosing good hyperparameters is critical for model performance.

Search Strategies

Grid search exhaustively evaluates all combinations of specified hyperparameter values, while random search samples randomly from the space. Research by Bergstra and Bengio showed that random search is more efficient than grid search for most problems because it covers more of the important dimensions.

Table 5.5: Hyperparameter search strategies and their trade-offs.
Strategy	Approach	Sample Efficiency	Best For
Grid Search	Exhaustive evaluation of all combinations	Poor (exponential in dimensions)	Small spaces with < 3 hyperparameters
Random Search	Random sampling from the space	Better than grid (Bergstra & Bengio, 2012)	Medium spaces, limited compute budget
Bayesian Optimization	Probabilistic surrogate model guides search	Best (2-10x fewer trials than random)	Expensive training runs, moderate dimensions
Hyperband	Adaptive resource allocation with early stopping	Good (quickly eliminates bad configs)	Large spaces with many cheap-to-evaluate configs
Population-Based Training	Evolutionary approach during training	Good for schedules	Learning rate and schedule optimization

Table 5.5: Hyperparameter search strategies and their trade-offs.

Why Random Beats Grid

Grid search wastes trials by exhaustively covering unimportant dimensions. If only 2 out of 5 hyperparameters actually matter (common in practice), grid search evaluates many redundant combinations of the 3 unimportant ones. Random search distributes trials more evenly across the important dimensions, finding good configurations faster.

Bayesian Optimization

Definition

Bayesian Optimization

A sequential optimization strategy that builds a probabilistic surrogate model (typically a Gaussian Process or TPE) of the objective function and uses an acquisition function to balance exploration and exploitation when selecting the next configuration to evaluate.

Bayesian optimization methods like Tree-structured Parzen Estimators (TPE) and Gaussian Processes build a probabilistic model of the objective function and use it to select promising configurations to evaluate next. These methods are significantly more sample-efficient than random search, particularly for expensive training runs.

python
class="tok-comment"># Bayesian hyperparameter optimization with Optuna
import optuna

def objective(trial):
    lr = trial.suggest_float(&class="tok-comment">#class="tok-number">39;lrclass="tok-string">&#class="tok-number">39;, class="tok-number">1e-5, class="tok-number">1e-2, log=True)
    batch_size = trial.suggest_categorical(&class="tok-comment">#class="tok-number">39;batch_size&#class="tok-number">39;, [class="tok-number">32, class="tok-number">64, class="tok-number">128, class="tok-number">256])
    n_layers = trial.suggest_int(&class="tok-comment">#class="tok-number">39;n_layersclass="tok-string">&#class="tok-number">39;, class="tok-number">2, class="tok-number">8)
    dropout = trial.suggest_float(&class="tok-comment">#class="tok-number">39;dropout&#class="tok-number">39;, class="tok-number">0.0, class="tok-number">0.5)

    model = build_model(n_layers=n_layers, dropout=dropout)
    val_accuracy = train_and_evaluate(model, lr=lr, batch_size=batch_size)
    return val_accuracy

study = optuna.create_study(direction=&class="tok-comment">#class="tok-number">39;maximize&#class="tok-number">39;)
study.optimize(objective, n_trials=class="tok-number">100)

AutoML

AutoML extends hyperparameter tuning to automate broader aspects of the ML pipeline, including feature engineering, model selection, and architecture design.

AutoML Is Not a Silver Bullet

AutoML excels at finding good configurations within defined search spaces but cannot replace domain expertise in problem formulation, data quality assessment, or system design. An AutoML system optimizing the wrong metric or training on poor data will efficiently find the wrong answer. Always validate AutoML results against domain knowledge.

Efficient Tuning Strategy

Start with a coarse random search (20-50 trials) to identify the right order of magnitude for each hyperparameter. Then use Bayesian optimization with a narrowed search space for fine-tuning. This two-phase approach is more efficient than running Bayesian optimization on a very wide initial space.

Hyperparameter Tuning

The process of systematically searching for optimal values of parameters that are set before training begins, such as learning rate and model architecture choices.

Bayesian Optimization

A sequential optimization strategy that builds a probabilistic surrogate model to efficiently guide the search for optimal hyperparameters.

Key Takeaways

1The ML lifecycle is inherently iterative and experimental, requiring systematic tracking and management of all artifacts.
2Experiment tracking platforms are essential for reproducibility, collaboration, and efficient hyperparameter exploration.
3Practical reproducibility requires controlling random seeds, versioning data and dependencies, and documenting the environment.
4MLOps automates the pipeline from code changes through training, validation, and deployment with continuous integration and delivery.
5Bayesian optimization is significantly more sample-efficient than grid or random search for hyperparameter tuning.

CH.05

Chapter Complete

Up next:Data Engineering

Chapter Progress

Reading

Exercise

Interact with the visualization

Quiz

AI Workflow Quiz

Test your understanding of ML development lifecycle, experiment tracking, and reproducibility.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass

Learning Objectives

01 The ML Development Lifecycle Viz

ML Development Lifecycle

Problem Formulation

Iterative Development and Artifact Management

From Experimentation to Production

ML Lifecycle

Problem Formulation

02 Experiment Tracking and Management

Experiment Tracking

Tracking Platforms

What to Log

Collaborative Workflows

Experiment Tracking

MLflow

03 Reproducibility in ML

Reproducibility

Sources of Non-Determinism

Practical Reproducibility

Version Control for ML

DVC (Data Version Control)

Reproducibility

DVC

04 MLOps Workflow Methodology

MLOps

The Three Pillars of MLOps

Pipeline Orchestration

Model Registries and Promotion

Model Registry

MLOps

Model Registry

05 Hyperparameter Tuning and AutoML

Hyperparameter

Search Strategies

Bayesian Optimization

Bayesian Optimization

AutoML

Hyperparameter Tuning

Bayesian Optimization

Key Takeaways

Chapter Progress

AI Workflow Quiz