Skip to content
ML SystemPart 4: InfrastructureChapter 13
Deployment CH.13 ~35 min

ML Operations

Introduces MLOps practices including CI/CD for ML, experiment tracking, model monitoring, and drift detection in production systems.

CI/CD for MLexperiment trackingmodel monitoringdrift detectionMLOps
Read in mlsysbook.ai
  • Design CI/CD pipelines that validate code, data, and model quality for ML systems
  • Compare experiment tracking platforms and implement structured logging for team-scale ML development
  • Implement monitoring and observability strategies that detect model degradation in production
  • Analyze data drift and concept drift detection methods and their appropriate response strategies
  • Evaluate MLOps platform architectures and select tools appropriate for organizational maturity level

01 CI/CD for Machine Learning Viz

Continuous Integration and Continuous Delivery for ML extends traditional CI/CD practices to handle the unique challenges of ML systems. Unlike conventional software where changes are primarily code-based, ML systems must also manage changes to data, model configurations, hyperparameters, and training environments.

Definition

ML CI/CD

Continuous Integration and Continuous Delivery practices adapted for machine learning systems, which must validate not only code changes but also data quality, model performance, and configuration consistency through automated pipelines.

Figure: ML Training Infrastructure Pipeline

PIPELINE STAGESData IngestionStorageFeature Eng.ComputeTraining JobComputeValidationComputeModel RegistryStorageA/B splitCanary (10%)DeployFull DeployDeploypromoteINFRASTRUCTUREKubernetesCompute LayerGPU ClusterCompute LayerArtifact StoreStorage LayerComputeStorageDeploymentHover each stage for resource requirements
Figure Figure 13.1: ML training infrastructure pipeline. From data ingestion through feature engineering, training, validation, model registry, to deployment, with resource requirements at each stage.
Beyond Code Changes

Traditional CI/CD only needs to handle code changes. ML CI/CD must also detect and validate changes in data schemas, feature distributions, hyperparameter configurations, and training environments. A change to the training data can break a model just as easily as a code bug.

ML Continuous Integration

ML CI pipelines validate code changes through unit tests, integration tests, and data validation checks. They also verify that model training can complete successfully and that model quality meets minimum thresholds. These checks prevent regressions from entering the main branch and catch issues early in the development cycle.

  1. Code validation — Lint, type-check, and unit-test all pipeline code.
  2. Data validation — Check schema conformance, distribution statistics, and missing value rates.
  3. Training smoke test — Run a short training loop to verify convergence on a small data subset.
  4. Model quality gate — Ensure metrics (accuracy, loss) meet minimum thresholds on a held-out validation set.
  5. Integration test — Verify the full pipeline from data loading through prediction output.

ML Continuous Delivery

ML CD pipelines automate the progression from a validated model to production deployment. This typically involves model packaging, deployment to a staging environment, shadow mode evaluation, canary deployment, and gradual rollout to full production.

Table 13.1: Progressive deployment stages for ML models.
Deployment StageDescriptionRisk LevelRollback Strategy
Shadow ModeNew model runs alongside production; predictions logged but not servedZeroSimply disable shadow model
CanaryNew model serves 1-5% of live trafficLowRoute all traffic back to old model
Gradual RolloutIncrementally increase traffic (10% -> 25% -> 50% -> 100%)MediumReduce traffic share; revert if metrics drop
Full ProductionNew model serves 100% of trafficHighImmediate rollback to previous version

Table 13.1: Progressive deployment stages for ML models.

Shadow Mode First

Always deploy new models in shadow mode before serving live traffic. Shadow mode lets you compare the new model's predictions against the production model on real data without any user impact. This catches issues that offline evaluation misses, such as latency problems or unexpected prediction distributions.

Automated pipelines must handle ML-specific concerns like data dependency management, training reproducibility, and model artifact tracking. Tools like Kubeflow Pipelines, MLflow, and Vertex AI Pipelines provide ML-aware pipeline orchestration.

Reproducibility Is Non-Negotiable

Every model deployed to production must be reproducible. Pin random seeds, record the exact data version (not just "latest"), snapshot the training environment as a container image, and store the full configuration alongside the model artifact. Without reproducibility, debugging production issues becomes nearly impossible.

ML CI/CD

Continuous Integration and Delivery practices adapted for ML systems, handling automated testing, validation, and deployment of models alongside code changes.

Shadow Mode

A deployment strategy where a new model runs alongside the production model, receiving real traffic but not serving predictions, to evaluate performance safely.

02 Experiment Tracking at Scale

As ML teams grow and projects multiply, experiment tracking evolves from an individual tool into a critical organizational capability. Enterprise experiment tracking must support multiple teams, projects, and environments while providing cross-cutting visibility into resource usage, model lineage, and comparative analysis.

Definition

Experiment Tracking

The systematic recording and organization of all parameters, metrics, artifacts, and metadata associated with ML training runs, enabling reproducibility, comparison, and knowledge sharing across teams.

Scalable Tracking Platforms

Scalable experiment tracking platforms like Weights & Biases, MLflow, and Neptune organize experiments into projects and workspaces with role-based access control. They provide APIs for programmatic logging, rich dashboards for visualization, and integrations with training frameworks that minimize the effort required to instrument code.

Table 13.2: Popular experiment tracking platforms compared.
PlatformDeploymentKey StrengthsBest For
Weights & BiasesCloud SaaS or self-hostedRich visualization, hyperparameter sweeps, report sharingTeams needing collaborative analysis and experiment dashboards
MLflowSelf-hosted (open source)Model registry, flexible backend, wide framework supportOrganizations preferring open-source and on-premise control
NeptuneCloud SaaS or self-hostedCustom metadata, comparison views, team workspacesResearch teams needing flexible experiment organization
Vertex AI ExperimentsGCP managedTight GCP integration, pipeline tracking, autologgingTeams already invested in the Google Cloud ecosystem

Table 13.2: Popular experiment tracking platforms compared.

python
import wandb

class="tok-comment"># Initialize experiment tracking
wandb.init(
    project=&class="tok-comment">#class="tok-number">39;image-classificationclass="tok-string">&#class="tok-number">39;,
    config={
        &class="tok-comment">#class="tok-number">39;learning_rate&#class="tok-number">39;: class="tok-number">3e-4,
        &class="tok-comment">#class="tok-number">39;batch_sizeclass="tok-string">&#class="tok-number">39;: class="tok-number">128,
        &class="tok-comment">#class="tok-number">39;architecture&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;resnet50&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;datasetclass="tok-string">&#class="tok-number">39;: &#class="tok-number">39;imagenet-v2.class="tok-number">3class="tok-string">&#class="tok-number">39;,
        &class="tok-comment">#class="tok-number">39;optimizer&#class="tok-number">39;: class="tok-string">&#class="tok-number">39;adamw&#class="tok-number">39;,
    }
)

class="tok-comment"># Log metrics during training
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, loader)
    val_acc = evaluate(model, val_loader)
    wandb.log({&class="tok-comment">#class="tok-number">39;train_lossclass="tok-string">&#class="tok-number">39;: train_loss, &#class="tok-number">39;val_accclass="tok-string">&#class="tok-number">39;: val_acc, &#class="tok-number">39;epoch&#class="tok-number">39;: epoch})

Metadata Management

Metadata management becomes increasingly important at scale. Every experiment should capture the complete context: code version, data version, environment specification, hyperparameters, hardware configuration, and dependencies.

The "Which Run Was That?" Problem

Without comprehensive metadata, teams inevitably encounter the situation where a promising result cannot be reproduced because no one recorded the exact data version, random seed, or environment configuration. Capture metadata automatically through framework integrations rather than relying on manual logging.

Experiment Analytics at Scale

Experiment analytics across thousands of runs can reveal patterns invisible in individual experiments. Hyperparameter importance analysis identifies which settings matter most. Learning curve analysis helps predict when training should be stopped.

Automated Hyperparameter Importance

After running a hyperparameter sweep, use tools like Weights & Biases' parameter importance plots or SHAP analysis on your sweep results. These often reveal that only 2-3 hyperparameters meaningfully affect final performance, letting you fix the rest and focus tuning effort where it matters.

Experiment Tracking

The systematic recording of all parameters, metrics, artifacts, and metadata associated with ML training runs for reproducibility and analysis.

Metadata Management

The practice of capturing and organizing contextual information about experiments, including code versions, data versions, and environment specifications.

03 Model Monitoring and Observability

Model monitoring extends traditional software monitoring to track ML-specific metrics like prediction quality, feature distributions, and model behavior over time. Unlike software bugs that typically cause immediate failures, model degradation can be gradual and silent, making proactive monitoring essential.

Silent Model Failure

The most dangerous aspect of ML model degradation is that the system continues to produce outputs — they are simply wrong. A software crash triggers an immediate alert. A model that slowly loses accuracy continues serving predictions with no error signal. This is why proactive monitoring with statistical drift detection is essential, not optional.

0%
of ML models in production experience performance degradation within 3 months

Monitoring Dimensions

Key monitoring dimensions include data drift, concept drift, prediction drift, and performance drift. Each type of drift requires different detection methods and may warrant different responses.

Table 13.3: Types of drift in production ML systems.
Drift TypeWhat ChangesDetection MethodTypical Response
Data DriftInput feature distributionsKS test, PSI, JS divergenceAlert; investigate data pipeline
Concept DriftFeature-target relationshipPerformance monitoring with ground truthRetrain model on recent data
Prediction DriftOutput distributionDistribution comparison of predictionsAlert; compare with data drift signals
Performance DriftModel accuracy metricsMetric tracking against thresholdsRollback or retrain depending on severity

Table 13.3: Types of drift in production ML systems.

Definition

Data Drift

Changes in the statistical distribution of input features over time. Data drift can degrade model performance even when the underlying relationship between features and targets remains unchanged, because the model encounters inputs outside its training distribution.

Quick Check

What is the most critical component of an ML monitoring system?

Not quite.Data drift detection is the most critical monitoring component because data drift is the leading cause of silent model degradation in production. Unlike infrastructure failures that trigger immediate alerts, data drift causes models to quietly produce increasingly unreliable predictions. CPU utilization and log aggregation are important operational metrics, but they do not capture the statistical shifts in input data that directly erode model accuracy.
Continue reading

Statistical Detection Methods

Statistical methods for drift detection include the Kolmogorov-Smirnov test, Population Stability Index (PSI), and Jensen-Shannon divergence for continuous features; chi-squared tests for categorical features; and multivariate methods like Maximum Mean Discrepancy (MMD) for detecting complex distributional shifts.

PSI = \sum_{i=1}^{n} (p_i - q_i) \cdot \ln\left(\frac{p_i}{q_i}\right)
PSI Thresholds in Practice

For PSI-based drift detection: PSI < 0.1 means no meaningful change — no action needed. PSI between 0.1 and 0.25 means moderate shift — investigate and monitor closely. PSI > 0.25 means significant drift — trigger retraining or rollback. These thresholds are guidelines; calibrate them to your specific use case.

Automated Response

Alerting and response automation connect drift detection to remediation actions. Defining appropriate thresholds and response policies requires collaboration between ML engineers and domain experts.

Automated Drift Response Pipeline

A credit scoring model monitors PSI on all input features hourly. When PSI for the "income" feature exceeds 0.25, the system automatically: (1) sends an alert to the ML team Slack channel, (2) captures a snapshot of the current data distribution for analysis, (3) triggers a retraining job using the last 90 days of data, and (4) deploys the retrained model in shadow mode for 48 hours before promoting.

Data Drift

Changes in the statistical distribution of input features over time, which can degrade model performance even when the model itself has not changed.

Concept Drift

Changes in the underlying relationship between input features and the target variable, requiring model retraining to maintain accuracy.

04 Drift Detection and Response

Drift detection is the practice of continuously monitoring for distributional changes that could impact model performance. Effective drift detection requires establishing a baseline distribution from the training or validation data and comparing incoming production data against this baseline.

Definition

Drift Detection

Automated statistical methods for identifying changes in data distributions or model behavior relative to a baseline, enabling early warning of performance degradation before it impacts users.

Window-Based Detection

Window-based drift detection compares statistics computed over a recent window of production data to the baseline. The window size controls the trade-off between detection sensitivity and false positive rate: smaller windows detect drift faster but are noisier.

The Window Size Trade-off

A 1-hour window detects drift within hours but may produce false positives from normal hourly variation (e.g., business hours vs. night traffic). A 7-day window is more stable but takes a week to detect even severe drift. Adaptive windowing algorithms like ADWIN solve this by automatically adjusting the window based on observed change rates.

Definition

ADWIN

Adaptive Windowing, an algorithm that maintains a variable-length window of recent observations and automatically shrinks the window when a statistically significant change is detected, enabling fast detection of drift at multiple time scales.

Feature-Level vs. Model-Level Monitoring

Feature-level monitoring tracks each input feature independently, while model-level monitoring tracks the joint distribution of features and predictions. Feature-level monitoring is simpler and more interpretable but may miss changes that only appear in feature interactions.

Table 13.4: Monitoring approaches compared.
Monitoring ApproachGranularityInterpretabilityInteraction DetectionComputational Cost
Feature-level (univariate)Per featureHigh — pinpoints which feature driftedMisses feature interactionsLow — O(n) for n features
Model-level (multivariate)All features jointlyLow — detects drift but hard to diagnoseCaptures all interactionsHigh — scales with dimensionality
Prediction-levelModel outputMedium — shows effect but not causeCaptures downstream impactLow — single distribution

Table 13.4: Monitoring approaches compared.

Layered Monitoring Strategy

Use all three monitoring levels together. Prediction-level monitoring is your first alert that something is wrong. Feature-level monitoring diagnoses which input changed. Model-level monitoring catches interaction effects that per-feature monitoring misses. This layered approach provides both early warning and diagnostic capability.

Response Strategies

Response strategies depend on the severity and type of drift detected. Retraining with recent data is the most common response, but it requires maintaining a reliable retraining pipeline. In cases of sudden, severe drift, it may be better to fall back to a simpler model or rule-based system while investigating the root cause.

Deeper InsightProduction models generate data that feeds back into training, creating either a virtuous or vicious cycle. When a recommendation model serves good predictions, users engage more, generating high-quality training data that further improves the model. But when predictions degrade, users disengage or behave differently, producing biased training data that makes the next model even worse. Recognizing and managing this feedback loop is one of the most important — and most overlooked — aspects of ML operations. Click to collapse

Don't Blindly Retrain on Drifted Data

Automated retraining is powerful but dangerous if the drift represents a data quality issue rather than a genuine distribution shift. Retraining on corrupted data embeds the corruption into the model. Always include data validation checks in the retraining pipeline, and hold back a stable test set that is not affected by the drift.

Sudden Drift: When to Fall Back

A fraud detection model sees PSI spike to 0.8 on transaction amount features after a payment processor changes its API format, doubling all values. Retraining on this data would learn incorrect patterns. The correct response is to fall back to a rule-based system, fix the data pipeline, and then retrain once the input data is correct.

Drift Detection

Automated statistical methods for identifying changes in data distributions or model behavior that could indicate degraded performance.

ADWIN

Adaptive Windowing, an algorithm that automatically adjusts its monitoring window size to detect distributional changes at different time scales.

05 MLOps Platform Architecture

A comprehensive MLOps platform integrates experiment tracking, pipeline orchestration, model registry, feature store, monitoring, and deployment management into a cohesive system. The architecture of this platform determines how effectively teams can develop, deploy, and maintain ML models at scale.

Definition

MLOps Platform

An integrated system combining experiment tracking, pipeline orchestration, model registry, feature store, serving infrastructure, and monitoring to support the complete machine learning lifecycle from development through production operations.

Build vs. Buy Decisions

Platform architecture decisions include build versus buy, centralized versus decentralized, and open versus proprietary. Each choice has implications for flexibility, cost, maintenance burden, and team productivity.

Table 13.5: Key MLOps platform architecture decisions.
DecisionOption AOption BKey Consideration
Build vs. BuyCustom platform (OSS tools)Managed service (cloud vendor)Integration effort vs. vendor lock-in
Centralized vs. DecentralizedShared platform teamTeam-specific tool choicesConsistency vs. autonomy
Open vs. ProprietaryOpen-source stackVendor solutionsFlexibility vs. support and polish

Table 13.5: Key MLOps platform architecture decisions.

Open-Source vs. Managed Stacks

Open-source MLOps stacks typically combine tools like MLflow (experiment tracking), Kubeflow (pipeline orchestration), Feast (feature store), Seldon Core (model serving), and Prometheus/Grafana (monitoring). These components are powerful individually but require significant integration effort.

  • MLflow — Experiment tracking, model registry, and model packaging.
  • Kubeflow Pipelines — Kubernetes-native ML pipeline orchestration.
  • Feast — Feature store for consistent feature serving across training and inference.
  • Seldon Core / KServe — Model serving with scaling, monitoring, and A/B testing.
  • Prometheus + Grafana — Metrics collection, alerting, and visualization.
  • DVC — Data versioning and pipeline management for reproducibility.
Start Small, Grow Incrementally

Don't try to build a complete MLOps platform from day one. Start with experiment tracking and a model registry (MLflow is a great starting point). Add pipeline orchestration when manual processes become painful. Add a feature store when training-serving skew becomes a problem. Each component should solve a real, observed pain point.

MLOps Maturity Model

The maturity of an organization's MLOps practice can be assessed along dimensions like automation level, reproducibility, monitoring coverage, and deployment velocity. Google's MLOps maturity model defines three levels.

Table 13.6: Google's MLOps maturity model.
Maturity LevelDescriptionCharacteristicsTypical Deployment Frequency
Level 0: ManualManual, script-driven processNo pipeline automation, manual deployment, limited trackingMonthly or less
Level 1: Pipeline AutomationAutomated training pipelinesAutomated retraining, model registry, basic monitoringWeekly
Level 2: CI/CD AutomationFull CI/CD for MLAutomated testing, deployment, monitoring, and rollbackDaily or on-demand

Table 13.6: Google's MLOps maturity model.

Where Most Organizations Stand

Despite the growing sophistication of MLOps tooling, surveys consistently show that most organizations remain at Level 0 (manual) or early Level 1. The gap between available tools and adopted practices represents both a challenge and an opportunity for ML teams to gain competitive advantage through operational excellence.

Avoid Tool Sprawl

The MLOps ecosystem has hundreds of tools, and it is tempting to adopt many of them. Resist this urge. Each additional tool in your stack adds integration complexity, maintenance burden, and cognitive load. Choose a cohesive set of tools that work well together, and commit to them.

MLOps Platform

An integrated system combining experiment tracking, pipeline orchestration, model management, and monitoring to support the complete ML lifecycle.

MLOps Maturity Model

A framework for assessing the automation and sophistication level of an organization's ML operations practices, from manual to fully automated.

Key Takeaways

  1. 1ML CI/CD must handle data, model, and configuration changes alongside traditional code changes.
  2. 2Model degradation can be gradual and silent, making proactive monitoring with drift detection essential.
  3. 3Data drift and concept drift require different detection methods and may warrant different response strategies.
  4. 4Experiment tracking at scale becomes a critical organizational capability for collaboration and knowledge management.
  5. 5MLOps maturity progresses from manual processes through pipeline automation to full CI/CD for ML.

CH.13

Chapter Complete

Up next:On-Device Learning

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

ML Operations Quiz

Test your understanding of MLOps, CI/CD for ML, experiment management, and production monitoring.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass