ML Operations
Introduces MLOps practices including CI/CD for ML, experiment tracking, model monitoring, and drift detection in production systems.
- Design CI/CD pipelines that validate code, data, and model quality for ML systems
- Compare experiment tracking platforms and implement structured logging for team-scale ML development
- Implement monitoring and observability strategies that detect model degradation in production
- Analyze data drift and concept drift detection methods and their appropriate response strategies
- Evaluate MLOps platform architectures and select tools appropriate for organizational maturity level
01 CI/CD for Machine Learning Viz
Continuous Integration and Continuous Delivery for ML extends traditional CI/CD practices to handle the unique challenges of ML systems. Unlike conventional software where changes are primarily code-based, ML systems must also manage changes to data, model configurations, hyperparameters, and training environments.
ML CI/CD
Continuous Integration and Continuous Delivery practices adapted for machine learning systems, which must validate not only code changes but also data quality, model performance, and configuration consistency through automated pipelines.
Figure: ML Training Infrastructure Pipeline
Traditional CI/CD only needs to handle code changes. ML CI/CD must also detect and validate changes in data schemas, feature distributions, hyperparameter configurations, and training environments. A change to the training data can break a model just as easily as a code bug.
ML Continuous Integration
ML CI pipelines validate code changes through unit tests, integration tests, and data validation checks. They also verify that model training can complete successfully and that model quality meets minimum thresholds. These checks prevent regressions from entering the main branch and catch issues early in the development cycle.
- Code validation — Lint, type-check, and unit-test all pipeline code.
- Data validation — Check schema conformance, distribution statistics, and missing value rates.
- Training smoke test — Run a short training loop to verify convergence on a small data subset.
- Model quality gate — Ensure metrics (accuracy, loss) meet minimum thresholds on a held-out validation set.
- Integration test — Verify the full pipeline from data loading through prediction output.
ML Continuous Delivery
ML CD pipelines automate the progression from a validated model to production deployment. This typically involves model packaging, deployment to a staging environment, shadow mode evaluation, canary deployment, and gradual rollout to full production.
| Deployment Stage | Description | Risk Level | Rollback Strategy |
|---|---|---|---|
| Shadow Mode | New model runs alongside production; predictions logged but not served | Zero | Simply disable shadow model |
| Canary | New model serves 1-5% of live traffic | Low | Route all traffic back to old model |
| Gradual Rollout | Incrementally increase traffic (10% -> 25% -> 50% -> 100%) | Medium | Reduce traffic share; revert if metrics drop |
| Full Production | New model serves 100% of traffic | High | Immediate rollback to previous version |
Table 13.1: Progressive deployment stages for ML models.
Always deploy new models in shadow mode before serving live traffic. Shadow mode lets you compare the new model's predictions against the production model on real data without any user impact. This catches issues that offline evaluation misses, such as latency problems or unexpected prediction distributions.
Automated pipelines must handle ML-specific concerns like data dependency management, training reproducibility, and model artifact tracking. Tools like Kubeflow Pipelines, MLflow, and Vertex AI Pipelines provide ML-aware pipeline orchestration.
Every model deployed to production must be reproducible. Pin random seeds, record the exact data version (not just "latest"), snapshot the training environment as a container image, and store the full configuration alongside the model artifact. Without reproducibility, debugging production issues becomes nearly impossible.
ML CI/CD
Continuous Integration and Delivery practices adapted for ML systems, handling automated testing, validation, and deployment of models alongside code changes.
Shadow Mode
A deployment strategy where a new model runs alongside the production model, receiving real traffic but not serving predictions, to evaluate performance safely.
02 Experiment Tracking at Scale
As ML teams grow and projects multiply, experiment tracking evolves from an individual tool into a critical organizational capability. Enterprise experiment tracking must support multiple teams, projects, and environments while providing cross-cutting visibility into resource usage, model lineage, and comparative analysis.
Experiment Tracking
The systematic recording and organization of all parameters, metrics, artifacts, and metadata associated with ML training runs, enabling reproducibility, comparison, and knowledge sharing across teams.
Scalable Tracking Platforms
Scalable experiment tracking platforms like Weights & Biases, MLflow, and Neptune organize experiments into projects and workspaces with role-based access control. They provide APIs for programmatic logging, rich dashboards for visualization, and integrations with training frameworks that minimize the effort required to instrument code.
| Platform | Deployment | Key Strengths | Best For |
|---|---|---|---|
| Weights & Biases | Cloud SaaS or self-hosted | Rich visualization, hyperparameter sweeps, report sharing | Teams needing collaborative analysis and experiment dashboards |
| MLflow | Self-hosted (open source) | Model registry, flexible backend, wide framework support | Organizations preferring open-source and on-premise control |
| Neptune | Cloud SaaS or self-hosted | Custom metadata, comparison views, team workspaces | Research teams needing flexible experiment organization |
| Vertex AI Experiments | GCP managed | Tight GCP integration, pipeline tracking, autologging | Teams already invested in the Google Cloud ecosystem |
Table 13.2: Popular experiment tracking platforms compared.
import wandb
class="tok-comment"># Initialize experiment tracking
wandb.init(
project=&class="tok-comment">#class="tok-number">39;image-classificationclass="tok-string">class="tok-number">39;,
config={
&class="tok-comment">#class="tok-number">39;learning_rateclass="tok-number">39;: class="tok-number">3e-4,
&class="tok-comment">#class="tok-number">39;batch_sizeclass="tok-string">class="tok-number">39;: class="tok-number">128,
&class="tok-comment">#class="tok-number">39;architectureclass="tok-number">39;: class="tok-string">class="tok-number">39;resnet50class="tok-number">39;,
&class="tok-comment">#class="tok-number">39;datasetclass="tok-string">class="tok-number">39;: class="tok-number">39;imagenet-v2.class="tok-number">3class="tok-string">class="tok-number">39;,
&class="tok-comment">#class="tok-number">39;optimizerclass="tok-number">39;: class="tok-string">class="tok-number">39;adamwclass="tok-number">39;,
}
)
class="tok-comment"># Log metrics during training
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, loader)
val_acc = evaluate(model, val_loader)
wandb.log({&class="tok-comment">#class="tok-number">39;train_lossclass="tok-string">class="tok-number">39;: train_loss, class="tok-number">39;val_accclass="tok-string">class="tok-number">39;: val_acc, class="tok-number">39;epochclass="tok-number">39;: epoch})Metadata Management
Metadata management becomes increasingly important at scale. Every experiment should capture the complete context: code version, data version, environment specification, hyperparameters, hardware configuration, and dependencies.
Without comprehensive metadata, teams inevitably encounter the situation where a promising result cannot be reproduced because no one recorded the exact data version, random seed, or environment configuration. Capture metadata automatically through framework integrations rather than relying on manual logging.
Experiment Analytics at Scale
Experiment analytics across thousands of runs can reveal patterns invisible in individual experiments. Hyperparameter importance analysis identifies which settings matter most. Learning curve analysis helps predict when training should be stopped.
After running a hyperparameter sweep, use tools like Weights & Biases' parameter importance plots or SHAP analysis on your sweep results. These often reveal that only 2-3 hyperparameters meaningfully affect final performance, letting you fix the rest and focus tuning effort where it matters.
Experiment Tracking
The systematic recording of all parameters, metrics, artifacts, and metadata associated with ML training runs for reproducibility and analysis.
Metadata Management
The practice of capturing and organizing contextual information about experiments, including code versions, data versions, and environment specifications.
03 Model Monitoring and Observability
Model monitoring extends traditional software monitoring to track ML-specific metrics like prediction quality, feature distributions, and model behavior over time. Unlike software bugs that typically cause immediate failures, model degradation can be gradual and silent, making proactive monitoring essential.
The most dangerous aspect of ML model degradation is that the system continues to produce outputs — they are simply wrong. A software crash triggers an immediate alert. A model that slowly loses accuracy continues serving predictions with no error signal. This is why proactive monitoring with statistical drift detection is essential, not optional.
Monitoring Dimensions
Key monitoring dimensions include data drift, concept drift, prediction drift, and performance drift. Each type of drift requires different detection methods and may warrant different responses.
| Drift Type | What Changes | Detection Method | Typical Response |
|---|---|---|---|
| Data Drift | Input feature distributions | KS test, PSI, JS divergence | Alert; investigate data pipeline |
| Concept Drift | Feature-target relationship | Performance monitoring with ground truth | Retrain model on recent data |
| Prediction Drift | Output distribution | Distribution comparison of predictions | Alert; compare with data drift signals |
| Performance Drift | Model accuracy metrics | Metric tracking against thresholds | Rollback or retrain depending on severity |
Table 13.3: Types of drift in production ML systems.
Data Drift
Changes in the statistical distribution of input features over time. Data drift can degrade model performance even when the underlying relationship between features and targets remains unchanged, because the model encounters inputs outside its training distribution.
What is the most critical component of an ML monitoring system?
Statistical Detection Methods
Statistical methods for drift detection include the Kolmogorov-Smirnov test, Population Stability Index (PSI), and Jensen-Shannon divergence for continuous features; chi-squared tests for categorical features; and multivariate methods like Maximum Mean Discrepancy (MMD) for detecting complex distributional shifts.
PSI = \sum_{i=1}^{n} (p_i - q_i) \cdot \ln\left(\frac{p_i}{q_i}\right)For PSI-based drift detection: PSI < 0.1 means no meaningful change — no action needed. PSI between 0.1 and 0.25 means moderate shift — investigate and monitor closely. PSI > 0.25 means significant drift — trigger retraining or rollback. These thresholds are guidelines; calibrate them to your specific use case.
Automated Response
Alerting and response automation connect drift detection to remediation actions. Defining appropriate thresholds and response policies requires collaboration between ML engineers and domain experts.
A credit scoring model monitors PSI on all input features hourly. When PSI for the "income" feature exceeds 0.25, the system automatically: (1) sends an alert to the ML team Slack channel, (2) captures a snapshot of the current data distribution for analysis, (3) triggers a retraining job using the last 90 days of data, and (4) deploys the retrained model in shadow mode for 48 hours before promoting.
Data Drift
Changes in the statistical distribution of input features over time, which can degrade model performance even when the model itself has not changed.
Concept Drift
Changes in the underlying relationship between input features and the target variable, requiring model retraining to maintain accuracy.
04 Drift Detection and Response
Drift detection is the practice of continuously monitoring for distributional changes that could impact model performance. Effective drift detection requires establishing a baseline distribution from the training or validation data and comparing incoming production data against this baseline.
Drift Detection
Automated statistical methods for identifying changes in data distributions or model behavior relative to a baseline, enabling early warning of performance degradation before it impacts users.
Window-Based Detection
Window-based drift detection compares statistics computed over a recent window of production data to the baseline. The window size controls the trade-off between detection sensitivity and false positive rate: smaller windows detect drift faster but are noisier.
A 1-hour window detects drift within hours but may produce false positives from normal hourly variation (e.g., business hours vs. night traffic). A 7-day window is more stable but takes a week to detect even severe drift. Adaptive windowing algorithms like ADWIN solve this by automatically adjusting the window based on observed change rates.
ADWIN
Adaptive Windowing, an algorithm that maintains a variable-length window of recent observations and automatically shrinks the window when a statistically significant change is detected, enabling fast detection of drift at multiple time scales.
Feature-Level vs. Model-Level Monitoring
Feature-level monitoring tracks each input feature independently, while model-level monitoring tracks the joint distribution of features and predictions. Feature-level monitoring is simpler and more interpretable but may miss changes that only appear in feature interactions.
| Monitoring Approach | Granularity | Interpretability | Interaction Detection | Computational Cost |
|---|---|---|---|---|
| Feature-level (univariate) | Per feature | High — pinpoints which feature drifted | Misses feature interactions | Low — O(n) for n features |
| Model-level (multivariate) | All features jointly | Low — detects drift but hard to diagnose | Captures all interactions | High — scales with dimensionality |
| Prediction-level | Model output | Medium — shows effect but not cause | Captures downstream impact | Low — single distribution |
Table 13.4: Monitoring approaches compared.
Use all three monitoring levels together. Prediction-level monitoring is your first alert that something is wrong. Feature-level monitoring diagnoses which input changed. Model-level monitoring catches interaction effects that per-feature monitoring misses. This layered approach provides both early warning and diagnostic capability.
Response Strategies
Response strategies depend on the severity and type of drift detected. Retraining with recent data is the most common response, but it requires maintaining a reliable retraining pipeline. In cases of sudden, severe drift, it may be better to fall back to a simpler model or rule-based system while investigating the root cause.
The feedback loopDeeper InsightProduction models generate data that feeds back into training, creating either a virtuous or vicious cycle. When a recommendation model serves good predictions, users engage more, generating high-quality training data that further improves the model. But when predictions degrade, users disengage or behave differently, producing biased training data that makes the next model even worse. Recognizing and managing this feedback loop is one of the most important — and most overlooked — aspects of ML operations. Click to collapse
Automated retraining is powerful but dangerous if the drift represents a data quality issue rather than a genuine distribution shift. Retraining on corrupted data embeds the corruption into the model. Always include data validation checks in the retraining pipeline, and hold back a stable test set that is not affected by the drift.
A fraud detection model sees PSI spike to 0.8 on transaction amount features after a payment processor changes its API format, doubling all values. Retraining on this data would learn incorrect patterns. The correct response is to fall back to a rule-based system, fix the data pipeline, and then retrain once the input data is correct.
Drift Detection
Automated statistical methods for identifying changes in data distributions or model behavior that could indicate degraded performance.
ADWIN
Adaptive Windowing, an algorithm that automatically adjusts its monitoring window size to detect distributional changes at different time scales.
05 MLOps Platform Architecture
A comprehensive MLOps platform integrates experiment tracking, pipeline orchestration, model registry, feature store, monitoring, and deployment management into a cohesive system. The architecture of this platform determines how effectively teams can develop, deploy, and maintain ML models at scale.
MLOps Platform
An integrated system combining experiment tracking, pipeline orchestration, model registry, feature store, serving infrastructure, and monitoring to support the complete machine learning lifecycle from development through production operations.
Build vs. Buy Decisions
Platform architecture decisions include build versus buy, centralized versus decentralized, and open versus proprietary. Each choice has implications for flexibility, cost, maintenance burden, and team productivity.
| Decision | Option A | Option B | Key Consideration |
|---|---|---|---|
| Build vs. Buy | Custom platform (OSS tools) | Managed service (cloud vendor) | Integration effort vs. vendor lock-in |
| Centralized vs. Decentralized | Shared platform team | Team-specific tool choices | Consistency vs. autonomy |
| Open vs. Proprietary | Open-source stack | Vendor solutions | Flexibility vs. support and polish |
Table 13.5: Key MLOps platform architecture decisions.
Open-Source vs. Managed Stacks
Open-source MLOps stacks typically combine tools like MLflow (experiment tracking), Kubeflow (pipeline orchestration), Feast (feature store), Seldon Core (model serving), and Prometheus/Grafana (monitoring). These components are powerful individually but require significant integration effort.
- MLflow — Experiment tracking, model registry, and model packaging.
- Kubeflow Pipelines — Kubernetes-native ML pipeline orchestration.
- Feast — Feature store for consistent feature serving across training and inference.
- Seldon Core / KServe — Model serving with scaling, monitoring, and A/B testing.
- Prometheus + Grafana — Metrics collection, alerting, and visualization.
- DVC — Data versioning and pipeline management for reproducibility.
Don't try to build a complete MLOps platform from day one. Start with experiment tracking and a model registry (MLflow is a great starting point). Add pipeline orchestration when manual processes become painful. Add a feature store when training-serving skew becomes a problem. Each component should solve a real, observed pain point.
MLOps Maturity Model
The maturity of an organization's MLOps practice can be assessed along dimensions like automation level, reproducibility, monitoring coverage, and deployment velocity. Google's MLOps maturity model defines three levels.
| Maturity Level | Description | Characteristics | Typical Deployment Frequency |
|---|---|---|---|
| Level 0: Manual | Manual, script-driven process | No pipeline automation, manual deployment, limited tracking | Monthly or less |
| Level 1: Pipeline Automation | Automated training pipelines | Automated retraining, model registry, basic monitoring | Weekly |
| Level 2: CI/CD Automation | Full CI/CD for ML | Automated testing, deployment, monitoring, and rollback | Daily or on-demand |
Table 13.6: Google's MLOps maturity model.
Despite the growing sophistication of MLOps tooling, surveys consistently show that most organizations remain at Level 0 (manual) or early Level 1. The gap between available tools and adopted practices represents both a challenge and an opportunity for ML teams to gain competitive advantage through operational excellence.
The MLOps ecosystem has hundreds of tools, and it is tempting to adopt many of them. Resist this urge. Each additional tool in your stack adds integration complexity, maintenance burden, and cognitive load. Choose a cohesive set of tools that work well together, and commit to them.
MLOps Platform
An integrated system combining experiment tracking, pipeline orchestration, model management, and monitoring to support the complete ML lifecycle.
MLOps Maturity Model
A framework for assessing the automation and sophistication level of an organization's ML operations practices, from manual to fully automated.
Key Takeaways
- 1ML CI/CD must handle data, model, and configuration changes alongside traditional code changes.
- 2Model degradation can be gradual and silent, making proactive monitoring with drift detection essential.
- 3Data drift and concept drift require different detection methods and may warrant different response strategies.
- 4Experiment tracking at scale becomes a critical organizational capability for collaboration and knowledge management.
- 5MLOps maturity progresses from manual processes through pipeline automation to full CI/CD for ML.
CH.13
Chapter Complete
Chapter Progress
Interact with the visualization
ML Operations Quiz
Test your understanding of MLOps, CI/CD for ML, experiment management, and production monitoring.
Ready to test your knowledge?