Skip to content
ML SystemPart 2: System DesignChapter 5
Design CH.05 ~30 min

AI Workflow

Covers the ML development lifecycle and methodology. Introduces experiment tracking, reproducibility, and systematic approaches to model development.

development lifecycleexperiment trackingreproducibilitymethodologyMLOps workflow
Read in mlsysbook.ai
  • Describe the complete model lifecycle from ideation through deployment and retirement
  • Compare experiment tracking approaches and their role in reproducibility
  • Design a model versioning and registry strategy for team collaboration
  • Analyze CI/CD pipeline requirements specific to machine learning workflows
  • Implement model governance practices including approval gates and audit trails

01 The ML Development Lifecycle Viz

The ML development lifecycle provides a structured methodology for taking models from concept to production. Unlike traditional software development, ML development is fundamentally experimental: success requires systematic iteration across data preparation, model development, training, evaluation, and deployment stages.

Definition

ML Development Lifecycle

The structured sequence of stages from problem definition through data preparation, model development, training, evaluation, deployment, and monitoring that governs ML project development. It is inherently iterative, with feedback from later stages driving improvements to earlier ones.

Figure: ML Model Lifecycle

ML Lifecycle 7 stages Continuous iteration 1Problem2Collect3Prepare4Train5Evaluate6Deploy7Monitor Hover each stage to see substeps
Figure Figure 5.1: The ML development lifecycle is a continuous cycle. Each stage feeds into the next, with monitoring driving feedback to all upstream stages.

Problem Formulation

The lifecycle begins with problem formulation, where engineers define the task, success metrics, and constraints. This stage is often underestimated but is critical for project success.

The Most Common Cause of ML Project Failure

The majority of failed ML projects fail not because of model quality but because of poor problem formulation. Building an accurate model for the wrong objective wastes months of engineering effort. Before writing any code, clearly define: What decision will this model inform? How will you measure success? What are the latency and cost constraints? Is there enough data?

  1. Define the business objective and how ML predictions will be used
  2. Specify success metrics (accuracy, precision, recall, F1) with concrete thresholds
  3. Establish latency budgets and throughput requirements for serving
  4. Assess data availability, quality, and labeling needs
  5. Identify potential biases and ethical considerations
  6. Set a timeline and compute budget for experimentation

Iterative Development and Artifact Management

After problem formulation, the lifecycle moves through iterative cycles of data preparation, feature engineering, model selection, training, and evaluation. Each cycle produces artifacts: processed datasets, trained model weights, evaluation reports, and configuration files.

Table 5.1: ML lifecycle artifacts and their management strategies.
Artifact TypeExamplesVersioning ToolRetention Policy
DataTraining sets, evaluation splits, feature definitionsDVC, Delta LakeKeep all versions for reproducibility
CodeTraining scripts, feature pipelines, configsGitStandard branch/tag strategy
ModelsCheckpoints, exported models, ONNX filesMLflow, W&BKeep best N per experiment
MetricsLoss curves, evaluation reports, confusion matricesMLflow, W&BKeep all for comparison
ConfigsHyperparameters, infrastructure definitionsGit, HydraLink to corresponding model

Table 5.1: ML lifecycle artifacts and their management strategies.

Version Everything Together

Every trained model should be linked to the exact code commit, data version, configuration file, and random seed that produced it. If you cannot reproduce a training run from its metadata alone, your versioning strategy has a gap. Tools like MLflow and DVC make this straightforward when adopted from the start of a project.

From Experimentation to Production

The deployment phase marks the transition from experimentation to production. This involves packaging the model, setting up serving infrastructure, implementing monitoring, and establishing feedback loops for continuous improvement. The lifecycle does not end at deployment; production models require ongoing monitoring, maintenance, and periodic retraining.

Deeper InsightUnlike traditional software that works the same way forever unless the code changes, ML models silently rot. The world changes -- user behavior shifts, new products launch, seasons change, language evolves -- but the model remains frozen in the snapshot of reality it was trained on. This silent degradation is more dangerous than a software crash because the model keeps producing outputs that look valid but are increasingly wrong.Think of it like...A guidebook written five years ago for a rapidly changing city. The restaurants have closed, the subway map is wrong, and new neighborhoods have appeared. The book still looks authoritative and gives confident recommendations -- they are just increasingly inaccurate. Without someone checking the guidebook against the current city, you would never know it had gone stale. Click to collapse

The Lifecycle Never Ends

A deployed model is not a finished product. Data distributions shift, user behavior changes, and new edge cases emerge. Production models require continuous monitoring of prediction quality, periodic retraining on fresh data, and infrastructure maintenance. Plan for the full operational lifecycle, not just the initial deployment.

ML Lifecycle

The structured sequence of stages from problem definition through data preparation, model development, deployment, and monitoring that governs ML project development.

Problem Formulation

The critical first stage of ML development where the task, success metrics, constraints, and data requirements are precisely defined.

02 Experiment Tracking and Management

Experiment tracking is the systematic recording of all parameters, metrics, and artifacts associated with each training run. Without proper tracking, ML teams quickly lose the ability to reproduce results, compare approaches, or understand why certain configurations worked better than others.

Definition

Experiment Tracking

The systematic recording and organization of all parameters, metrics, code versions, data versions, and artifacts from ML training runs. Experiment tracking enables reproducibility, comparison across runs, and collaborative iteration within ML teams.

Tracking Platforms

Modern experiment tracking platforms provide centralized dashboards for logging hyperparameters, training curves, evaluation metrics, and model artifacts. These tools enable teams to search, filter, and compare across hundreds of experiments.

Table 5.2: Comparison of popular experiment tracking platforms.
PlatformTypeKey StrengthBest For
MLflowOpen-sourceFull lifecycle management, model registryTeams wanting self-hosted, vendor-neutral tooling
Weights & BiasesSaaS (free tier)Beautiful visualizations, easy team collaborationResearch teams and fast-moving startups
NeptuneSaaSFlexible metadata structure, strong comparison toolsLarge teams with diverse experiment types
TensorBoardOpen-sourceDeep integration with TensorFlow and PyTorchIndividual researchers, quick local visualization
ClearMLOpen-source + SaaSAutomatic logging, pipeline orchestrationTeams wanting an all-in-one platform

Table 5.2: Comparison of popular experiment tracking platforms.

What to Log

Effective experiment tracking requires discipline in logging practices. Every run should record the complete configuration to enable exact reproduction.

  • Model architecture — Layer types, sizes, activation functions, number of parameters
  • Hyperparameters — Learning rate, batch size, optimizer settings, regularization
  • Data version — Hash or tag identifying the exact training and validation data
  • Random seeds — All seeds (Python, NumPy, PyTorch/TensorFlow) for reproducibility
  • Framework version — Exact versions of PyTorch/TensorFlow, CUDA, cuDNN
  • Hardware environment — GPU type, number of devices, distributed training config
  • Training metrics — Loss curves, gradient norms, learning rate schedule
  • Evaluation metrics — Accuracy, F1, AUC, confusion matrix on validation set
python
class="tok-comment"># Example: Logging an experiment with MLflow
import mlflow

with mlflow.start_run(run_name=&class="tok-comment">#class="tok-number">39;resnet50-lr-sweepclass="tok-string">&#class="tok-number">39;):
    mlflow.log_params({
        &class="tok-comment">#class="tok-number">39;model&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;resnet50&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;learning_rateclass="tok-string">&#class="tok-number">39;: class="tok-number">0.001,
        &class="tok-comment">#class="tok-number">39;batch_size&#class="tok-number">39;: class="tok-number">256,
        &class="tok-comment">#class="tok-number">39;optimizerclass="tok-string">&#class="tok-number">39;: &#class="tok-number">39;adamwclass="tok-string">&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;data_version&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;v2.class="tok-number">3&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;seedclass="tok-string">&#class="tok-number">39;: class="tok-number">42,
    })
    class="tok-comment"># ... training loop ...
    mlflow.log_metrics({&class="tok-comment">#class="tok-number">39;val_accuracy&#class="tok-number">39;: class="tok-number">0.943, class="tok-string">&#class="tok-number">39;val_loss&#class="tok-number">39;: class="tok-number">0.187})
    mlflow.pytorch.log_model(model, &class="tok-comment">#class="tok-number">39;model&#class="tok-number">39;)

Collaborative Workflows

Beyond individual experiments, tracking systems support collaborative workflows where multiple team members contribute experiments toward shared objectives. Features like experiment groups, tags, and notes help organize work across team members.

Establish Logging Conventions Early

Agree on naming conventions, tag taxonomy, and required metadata fields before the team starts experimenting. A shared convention like "task/model/variant" (e.g., "sentiment/bert-base/lr-sweep") makes it easy to find and compare experiments weeks or months later. Inconsistent naming is the top reason teams abandon their experiment tracking.

Tracking Saves a Team Weeks

A team of five ML engineers runs 200 experiments over two months to improve a recommendation model. When the best model fails in production, they use experiment tracking to identify that the improvement came from a data preprocessing change (not the architecture change they assumed). They revert the preprocessing, find the real issue (a data leak in validation), and recover in hours instead of rerunning weeks of experiments.

Experiment Tracking

The systematic recording and organization of all parameters, metrics, code versions, and artifacts from ML training runs to enable reproducibility and comparison.

MLflow

An open-source platform for managing the ML lifecycle, providing experiment tracking, model packaging, and deployment capabilities.

03 Reproducibility in ML

Reproducibility is the ability to obtain consistent results when repeating an experiment under the same conditions. In ML, achieving reproducibility is surprisingly difficult due to sources of non-determinism in hardware, software, and the training process itself.

Definition

Reproducibility

The ability to obtain consistent experimental results when repeating a procedure under the same conditions. In ML, this requires controlling random seeds, data versions, code versions, and hardware environment. It is a fundamental requirement for scientific validity and production reliability.

Sources of Non-Determinism

Sources of non-determinism include random weight initialization, data shuffling, GPU floating-point operation ordering, and framework-level optimizations that sacrifice determinism for performance.

Table 5.3: Sources of non-determinism in ML training and their mitigations.
SourceCauseMitigationPerformance Impact
Weight initializationRandom number generatorsSet all seeds (Python, NumPy, PyTorch)None
Data shufflingRandom batch orderingSeed the DataLoader or use deterministic shufflingNone
GPU operation orderingNon-deterministic atomics on GPUtorch.use_deterministic_algorithms(True)10-30% slower
cuDNN auto-tuningDifferent algorithms chosen per runSet cudnn.deterministic = True5-15% slower
Floating-point reductionSummation order varies in parallelUse deterministic reductions5-10% slower
Framework optimizationsGraph fusion, kernel selectionDisable torch.compile optimizationsSignificant slowdown

Table 5.3: Sources of non-determinism in ML training and their mitigations.

Bit-for-Bit Reproducibility Has a Cost

Achieving exact bit-for-bit reproducibility often requires disabling GPU optimizations that sacrifice determinism for performance. This can slow training by 20-30%. In practice, "statistical reproducibility" (results within a small variance across runs) is usually sufficient. Reserve bit-exact reproducibility for debugging and validation, not production training.

Practical Reproducibility

python
class="tok-comment"># Setting seeds for practical reproducibility in PyTorch
import random
import numpy as np
import torch

def set_seed(seed: int = class="tok-number">42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    class="tok-comment"># Optional: slower but deterministic
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
The Reproducibility Checklist

Before sharing results or deploying a model, verify you can answer "yes" to: (1) Can I rerun this experiment from its metadata? (2) Are all dependencies pinned to exact versions? (3) Is the training data versioned and accessible? (4) Are random seeds set and recorded? (5) Is the hardware environment documented?

Version Control for ML

Definition

DVC (Data Version Control)

A tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code. DVC uses content-addressable storage to efficiently track changes to large files without storing them directly in Git.

Version control for ML extends beyond code to include data, configurations, and model artifacts. Tools like DVC track large datasets and model files alongside code, enabling teams to switch between different versions of their entire ML pipeline.

DVC in Practice

A team uses DVC to version their 50 GB training dataset alongside their Git repository. When a model regression is detected in production, they run "dvc checkout v2.1" to restore the exact data used to train the previous good model. They discover that a recent data pipeline change introduced duplicate records. Without data versioning, this root cause would have taken days to identify.

Reproducibility

The ability to obtain consistent experimental results when repeating a procedure under the same conditions, a fundamental requirement for scientific ML development.

DVC

Data Version Control, a tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code.

04 MLOps Workflow Methodology

MLOps brings DevOps principles to machine learning, establishing practices for continuous integration, delivery, and training of ML models. The goal is to reduce the time and friction involved in moving models from development to production while maintaining quality and reliability.

Definition

MLOps

The set of practices that combines ML development (Dev) with operations (Ops) to enable continuous integration, continuous delivery, and continuous training of ML models. MLOps automates the pipeline from experimentation through deployment while maintaining quality and reliability.

The Three Pillars of MLOps

A mature MLOps workflow automates the progression from code changes through testing, training, evaluation, and deployment.

Table 5.4: The three pillars of MLOps and their automation.
PillarTriggerActionsOutput
Continuous Integration (CI)Code commit or PRUnit tests, linting, data validation, schema checksPass/fail status, test reports
Continuous Training (CT)New data arrives or performance degradesRetrain model, evaluate on holdout set, compare to baselineNew model candidate with metrics
Continuous Delivery (CD)Model passes validation gatesPackage model, deploy to staging, canary, then productionProduction-serving model

Table 5.4: The three pillars of MLOps and their automation.

MLOps Maturity Levels

Google defines three levels of MLOps maturity. Level 0: manual, script-driven process. Level 1: ML pipeline automation with continuous training. Level 2: CI/CD pipeline automation for both code and models. Most organizations should aim for Level 1 as their initial target, as the jump from Level 0 to Level 1 delivers the majority of operational benefits.

Pipeline Orchestration

Pipeline orchestration is the backbone of MLOps workflows. Tools like Kubeflow Pipelines, Apache Airflow, and Prefect define ML workflows as composable steps with explicit dependencies.

  • Apache Airflow — General-purpose workflow orchestration with a large ecosystem of integrations
  • Kubeflow Pipelines — ML-specific orchestration on Kubernetes with native GPU support
  • Prefect — Python-native orchestration with strong local development and cloud deployment
  • Dagster — Data-aware orchestration with built-in testing and observability
  • ZenML — MLOps framework that abstracts across infrastructure providers
Start With Simple Automation

You do not need a full orchestration platform on day one. Start by automating the most painful manual step — often retraining when new data arrives. A scheduled cron job that triggers a training script, evaluates the result, and sends a Slack notification is a perfectly valid first step toward MLOps maturity.

Model Registries and Promotion

Definition

Model Registry

A centralized repository for storing, versioning, and managing the lifecycle of trained ML models. Models progress through stages (development, staging, canary, production) based on validation gates, ensuring only thoroughly tested models reach production.

Model registries serve as the central hub for managing model versions and their lifecycle states. A model in the registry progresses through stages like "staging," "canary," and "production" based on validation gates.

Model Promotion Pipeline

A new model is trained and registered as version 3.2 in "development" stage. Automated tests verify it meets minimum accuracy thresholds. It is promoted to "staging" where integration tests run against production data samples. After passing, it moves to "canary" where 5% of live traffic is routed to it for 24 hours. If all SLOs are met, it is promoted to "production" and the old model is archived.

Always Have a Rollback Plan

Every model deployment should have a tested rollback procedure. Keep the previous production model readily available (not archived or deleted) so traffic can be instantly routed back if the new model causes issues. Automated rollback triggered by SLO violations is ideal, but manual rollback should be achievable in under 5 minutes.

MLOps

The set of practices that combines ML development with operations, enabling continuous delivery and automation of the ML lifecycle.

Model Registry

A centralized repository for storing, versioning, and managing the lifecycle of trained ML models from development through production deployment.

05 Hyperparameter Tuning and AutoML

Hyperparameter tuning is the process of finding optimal configuration values that are not learned during training, such as learning rate, batch size, model architecture choices, and regularization strength. The hyperparameter space is often high-dimensional and expensive to explore, making efficient search strategies essential.

Definition

Hyperparameter

A configuration value that is set before training begins and is not updated by the optimization algorithm. Examples include learning rate, batch size, number of layers, dropout rate, and regularization strength. Choosing good hyperparameters is critical for model performance.

Search Strategies

Grid search exhaustively evaluates all combinations of specified hyperparameter values, while random search samples randomly from the space. Research by Bergstra and Bengio showed that random search is more efficient than grid search for most problems because it covers more of the important dimensions.

Table 5.5: Hyperparameter search strategies and their trade-offs.
StrategyApproachSample EfficiencyBest For
Grid SearchExhaustive evaluation of all combinationsPoor (exponential in dimensions)Small spaces with < 3 hyperparameters
Random SearchRandom sampling from the spaceBetter than grid (Bergstra & Bengio, 2012)Medium spaces, limited compute budget
Bayesian OptimizationProbabilistic surrogate model guides searchBest (2-10x fewer trials than random)Expensive training runs, moderate dimensions
HyperbandAdaptive resource allocation with early stoppingGood (quickly eliminates bad configs)Large spaces with many cheap-to-evaluate configs
Population-Based TrainingEvolutionary approach during trainingGood for schedulesLearning rate and schedule optimization

Table 5.5: Hyperparameter search strategies and their trade-offs.

Why Random Beats Grid

Grid search wastes trials by exhaustively covering unimportant dimensions. If only 2 out of 5 hyperparameters actually matter (common in practice), grid search evaluates many redundant combinations of the 3 unimportant ones. Random search distributes trials more evenly across the important dimensions, finding good configurations faster.

Bayesian Optimization

Definition

Bayesian Optimization

A sequential optimization strategy that builds a probabilistic surrogate model (typically a Gaussian Process or TPE) of the objective function and uses an acquisition function to balance exploration and exploitation when selecting the next configuration to evaluate.

Bayesian optimization methods like Tree-structured Parzen Estimators (TPE) and Gaussian Processes build a probabilistic model of the objective function and use it to select promising configurations to evaluate next. These methods are significantly more sample-efficient than random search, particularly for expensive training runs.

python
class="tok-comment"># Bayesian hyperparameter optimization with Optuna
import optuna

def objective(trial):
    lr = trial.suggest_float(&class="tok-comment">#class="tok-number">39;lrclass="tok-string">&#class="tok-number">39;, class="tok-number">1e-5, class="tok-number">1e-2, log=True)
    batch_size = trial.suggest_categorical(&class="tok-comment">#class="tok-number">39;batch_size&#class="tok-number">39;, [class="tok-number">32, class="tok-number">64, class="tok-number">128, class="tok-number">256])
    n_layers = trial.suggest_int(&class="tok-comment">#class="tok-number">39;n_layersclass="tok-string">&#class="tok-number">39;, class="tok-number">2, class="tok-number">8)
    dropout = trial.suggest_float(&class="tok-comment">#class="tok-number">39;dropout&#class="tok-number">39;, class="tok-number">0.0, class="tok-number">0.5)

    model = build_model(n_layers=n_layers, dropout=dropout)
    val_accuracy = train_and_evaluate(model, lr=lr, batch_size=batch_size)
    return val_accuracy

study = optuna.create_study(direction=&class="tok-comment">#class="tok-number">39;maximize&#class="tok-number">39;)
study.optimize(objective, n_trials=class="tok-number">100)

AutoML

AutoML extends hyperparameter tuning to automate broader aspects of the ML pipeline, including feature engineering, model selection, and architecture design.

AutoML Is Not a Silver Bullet

AutoML excels at finding good configurations within defined search spaces but cannot replace domain expertise in problem formulation, data quality assessment, or system design. An AutoML system optimizing the wrong metric or training on poor data will efficiently find the wrong answer. Always validate AutoML results against domain knowledge.

Efficient Tuning Strategy

Start with a coarse random search (20-50 trials) to identify the right order of magnitude for each hyperparameter. Then use Bayesian optimization with a narrowed search space for fine-tuning. This two-phase approach is more efficient than running Bayesian optimization on a very wide initial space.

Hyperparameter Tuning

The process of systematically searching for optimal values of parameters that are set before training begins, such as learning rate and model architecture choices.

Bayesian Optimization

A sequential optimization strategy that builds a probabilistic surrogate model to efficiently guide the search for optimal hyperparameters.

Key Takeaways

  1. 1The ML lifecycle is inherently iterative and experimental, requiring systematic tracking and management of all artifacts.
  2. 2Experiment tracking platforms are essential for reproducibility, collaboration, and efficient hyperparameter exploration.
  3. 3Practical reproducibility requires controlling random seeds, versioning data and dependencies, and documenting the environment.
  4. 4MLOps automates the pipeline from code changes through training, validation, and deployment with continuous integration and delivery.
  5. 5Bayesian optimization is significantly more sample-efficient than grid or random search for hyperparameter tuning.

CH.05

Chapter Complete

Up next:Data Engineering

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

AI Workflow Quiz

Test your understanding of ML development lifecycle, experiment tracking, and reproducibility.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass