AI Workflow
Covers the ML development lifecycle and methodology. Introduces experiment tracking, reproducibility, and systematic approaches to model development.
- Describe the complete model lifecycle from ideation through deployment and retirement
- Compare experiment tracking approaches and their role in reproducibility
- Design a model versioning and registry strategy for team collaboration
- Analyze CI/CD pipeline requirements specific to machine learning workflows
- Implement model governance practices including approval gates and audit trails
01 The ML Development Lifecycle Viz
The ML development lifecycle provides a structured methodology for taking models from concept to production. Unlike traditional software development, ML development is fundamentally experimental: success requires systematic iteration across data preparation, model development, training, evaluation, and deployment stages.
ML Development Lifecycle
The structured sequence of stages from problem definition through data preparation, model development, training, evaluation, deployment, and monitoring that governs ML project development. It is inherently iterative, with feedback from later stages driving improvements to earlier ones.
Figure: ML Model Lifecycle
Problem Formulation
The lifecycle begins with problem formulation, where engineers define the task, success metrics, and constraints. This stage is often underestimated but is critical for project success.
The majority of failed ML projects fail not because of model quality but because of poor problem formulation. Building an accurate model for the wrong objective wastes months of engineering effort. Before writing any code, clearly define: What decision will this model inform? How will you measure success? What are the latency and cost constraints? Is there enough data?
- Define the business objective and how ML predictions will be used
- Specify success metrics (accuracy, precision, recall, F1) with concrete thresholds
- Establish latency budgets and throughput requirements for serving
- Assess data availability, quality, and labeling needs
- Identify potential biases and ethical considerations
- Set a timeline and compute budget for experimentation
Iterative Development and Artifact Management
After problem formulation, the lifecycle moves through iterative cycles of data preparation, feature engineering, model selection, training, and evaluation. Each cycle produces artifacts: processed datasets, trained model weights, evaluation reports, and configuration files.
| Artifact Type | Examples | Versioning Tool | Retention Policy |
|---|---|---|---|
| Data | Training sets, evaluation splits, feature definitions | DVC, Delta Lake | Keep all versions for reproducibility |
| Code | Training scripts, feature pipelines, configs | Git | Standard branch/tag strategy |
| Models | Checkpoints, exported models, ONNX files | MLflow, W&B | Keep best N per experiment |
| Metrics | Loss curves, evaluation reports, confusion matrices | MLflow, W&B | Keep all for comparison |
| Configs | Hyperparameters, infrastructure definitions | Git, Hydra | Link to corresponding model |
Table 5.1: ML lifecycle artifacts and their management strategies.
Every trained model should be linked to the exact code commit, data version, configuration file, and random seed that produced it. If you cannot reproduce a training run from its metadata alone, your versioning strategy has a gap. Tools like MLflow and DVC make this straightforward when adopted from the start of a project.
From Experimentation to Production
The deployment phase marks the transition from experimentation to production. This involves packaging the model, setting up serving infrastructure, implementing monitoring, and establishing feedback loops for continuous improvement. The lifecycle does not end at deployment; production models require ongoing monitoring, maintenance, and periodic retraining.
Models degrade over time as the data they encounter drifts away from the data they learned from.Deeper InsightUnlike traditional software that works the same way forever unless the code changes, ML models silently rot. The world changes -- user behavior shifts, new products launch, seasons change, language evolves -- but the model remains frozen in the snapshot of reality it was trained on. This silent degradation is more dangerous than a software crash because the model keeps producing outputs that look valid but are increasingly wrong.Think of it like...A guidebook written five years ago for a rapidly changing city. The restaurants have closed, the subway map is wrong, and new neighborhoods have appeared. The book still looks authoritative and gives confident recommendations -- they are just increasingly inaccurate. Without someone checking the guidebook against the current city, you would never know it had gone stale. Click to collapse
A deployed model is not a finished product. Data distributions shift, user behavior changes, and new edge cases emerge. Production models require continuous monitoring of prediction quality, periodic retraining on fresh data, and infrastructure maintenance. Plan for the full operational lifecycle, not just the initial deployment.
ML Lifecycle
The structured sequence of stages from problem definition through data preparation, model development, deployment, and monitoring that governs ML project development.
Problem Formulation
The critical first stage of ML development where the task, success metrics, constraints, and data requirements are precisely defined.
02 Experiment Tracking and Management
Experiment tracking is the systematic recording of all parameters, metrics, and artifacts associated with each training run. Without proper tracking, ML teams quickly lose the ability to reproduce results, compare approaches, or understand why certain configurations worked better than others.
Experiment Tracking
The systematic recording and organization of all parameters, metrics, code versions, data versions, and artifacts from ML training runs. Experiment tracking enables reproducibility, comparison across runs, and collaborative iteration within ML teams.
Tracking Platforms
Modern experiment tracking platforms provide centralized dashboards for logging hyperparameters, training curves, evaluation metrics, and model artifacts. These tools enable teams to search, filter, and compare across hundreds of experiments.
| Platform | Type | Key Strength | Best For |
|---|---|---|---|
| MLflow | Open-source | Full lifecycle management, model registry | Teams wanting self-hosted, vendor-neutral tooling |
| Weights & Biases | SaaS (free tier) | Beautiful visualizations, easy team collaboration | Research teams and fast-moving startups |
| Neptune | SaaS | Flexible metadata structure, strong comparison tools | Large teams with diverse experiment types |
| TensorBoard | Open-source | Deep integration with TensorFlow and PyTorch | Individual researchers, quick local visualization |
| ClearML | Open-source + SaaS | Automatic logging, pipeline orchestration | Teams wanting an all-in-one platform |
Table 5.2: Comparison of popular experiment tracking platforms.
What to Log
Effective experiment tracking requires discipline in logging practices. Every run should record the complete configuration to enable exact reproduction.
- Model architecture — Layer types, sizes, activation functions, number of parameters
- Hyperparameters — Learning rate, batch size, optimizer settings, regularization
- Data version — Hash or tag identifying the exact training and validation data
- Random seeds — All seeds (Python, NumPy, PyTorch/TensorFlow) for reproducibility
- Framework version — Exact versions of PyTorch/TensorFlow, CUDA, cuDNN
- Hardware environment — GPU type, number of devices, distributed training config
- Training metrics — Loss curves, gradient norms, learning rate schedule
- Evaluation metrics — Accuracy, F1, AUC, confusion matrix on validation set
class="tok-comment"># Example: Logging an experiment with MLflow
import mlflow
with mlflow.start_run(run_name=&class="tok-comment">#class="tok-number">39;resnet50-lr-sweepclass="tok-string">class="tok-number">39;):
mlflow.log_params({
&class="tok-comment">#class="tok-number">39;modelclass="tok-number">39;: class="tok-string">class="tok-number">39;resnet50class="tok-number">39;,
&class="tok-comment">#class="tok-number">39;learning_rateclass="tok-string">class="tok-number">39;: class="tok-number">0.001,
&class="tok-comment">#class="tok-number">39;batch_sizeclass="tok-number">39;: class="tok-number">256,
&class="tok-comment">#class="tok-number">39;optimizerclass="tok-string">class="tok-number">39;: class="tok-number">39;adamwclass="tok-string">class="tok-number">39;,
&class="tok-comment">#class="tok-number">39;data_versionclass="tok-number">39;: class="tok-string">class="tok-number">39;v2.class="tok-number">3class="tok-number">39;,
&class="tok-comment">#class="tok-number">39;seedclass="tok-string">class="tok-number">39;: class="tok-number">42,
})
class="tok-comment"># ... training loop ...
mlflow.log_metrics({&class="tok-comment">#class="tok-number">39;val_accuracyclass="tok-number">39;: class="tok-number">0.943, class="tok-string">class="tok-number">39;val_lossclass="tok-number">39;: class="tok-number">0.187})
mlflow.pytorch.log_model(model, &class="tok-comment">#class="tok-number">39;modelclass="tok-number">39;)Collaborative Workflows
Beyond individual experiments, tracking systems support collaborative workflows where multiple team members contribute experiments toward shared objectives. Features like experiment groups, tags, and notes help organize work across team members.
Agree on naming conventions, tag taxonomy, and required metadata fields before the team starts experimenting. A shared convention like "task/model/variant" (e.g., "sentiment/bert-base/lr-sweep") makes it easy to find and compare experiments weeks or months later. Inconsistent naming is the top reason teams abandon their experiment tracking.
A team of five ML engineers runs 200 experiments over two months to improve a recommendation model. When the best model fails in production, they use experiment tracking to identify that the improvement came from a data preprocessing change (not the architecture change they assumed). They revert the preprocessing, find the real issue (a data leak in validation), and recover in hours instead of rerunning weeks of experiments.
Experiment Tracking
The systematic recording and organization of all parameters, metrics, code versions, and artifacts from ML training runs to enable reproducibility and comparison.
MLflow
An open-source platform for managing the ML lifecycle, providing experiment tracking, model packaging, and deployment capabilities.
03 Reproducibility in ML
Reproducibility is the ability to obtain consistent results when repeating an experiment under the same conditions. In ML, achieving reproducibility is surprisingly difficult due to sources of non-determinism in hardware, software, and the training process itself.
Reproducibility
The ability to obtain consistent experimental results when repeating a procedure under the same conditions. In ML, this requires controlling random seeds, data versions, code versions, and hardware environment. It is a fundamental requirement for scientific validity and production reliability.
Sources of Non-Determinism
Sources of non-determinism include random weight initialization, data shuffling, GPU floating-point operation ordering, and framework-level optimizations that sacrifice determinism for performance.
| Source | Cause | Mitigation | Performance Impact |
|---|---|---|---|
| Weight initialization | Random number generators | Set all seeds (Python, NumPy, PyTorch) | None |
| Data shuffling | Random batch ordering | Seed the DataLoader or use deterministic shuffling | None |
| GPU operation ordering | Non-deterministic atomics on GPU | torch.use_deterministic_algorithms(True) | 10-30% slower |
| cuDNN auto-tuning | Different algorithms chosen per run | Set cudnn.deterministic = True | 5-15% slower |
| Floating-point reduction | Summation order varies in parallel | Use deterministic reductions | 5-10% slower |
| Framework optimizations | Graph fusion, kernel selection | Disable torch.compile optimizations | Significant slowdown |
Table 5.3: Sources of non-determinism in ML training and their mitigations.
Achieving exact bit-for-bit reproducibility often requires disabling GPU optimizations that sacrifice determinism for performance. This can slow training by 20-30%. In practice, "statistical reproducibility" (results within a small variance across runs) is usually sufficient. Reserve bit-exact reproducibility for debugging and validation, not production training.
Practical Reproducibility
class="tok-comment"># Setting seeds for practical reproducibility in PyTorch
import random
import numpy as np
import torch
def set_seed(seed: int = class="tok-number">42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
class="tok-comment"># Optional: slower but deterministic
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = FalseBefore sharing results or deploying a model, verify you can answer "yes" to: (1) Can I rerun this experiment from its metadata? (2) Are all dependencies pinned to exact versions? (3) Is the training data versioned and accessible? (4) Are random seeds set and recorded? (5) Is the hardware environment documented?
Version Control for ML
DVC (Data Version Control)
A tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code. DVC uses content-addressable storage to efficiently track changes to large files without storing them directly in Git.
Version control for ML extends beyond code to include data, configurations, and model artifacts. Tools like DVC track large datasets and model files alongside code, enabling teams to switch between different versions of their entire ML pipeline.
A team uses DVC to version their 50 GB training dataset alongside their Git repository. When a model regression is detected in production, they run "dvc checkout v2.1" to restore the exact data used to train the previous good model. They discover that a recent data pipeline change introduced duplicate records. Without data versioning, this root cause would have taken days to identify.
Reproducibility
The ability to obtain consistent experimental results when repeating a procedure under the same conditions, a fundamental requirement for scientific ML development.
DVC
Data Version Control, a tool that extends Git to handle large files and datasets, enabling versioning of data and models alongside code.
04 MLOps Workflow Methodology
MLOps brings DevOps principles to machine learning, establishing practices for continuous integration, delivery, and training of ML models. The goal is to reduce the time and friction involved in moving models from development to production while maintaining quality and reliability.
MLOps
The set of practices that combines ML development (Dev) with operations (Ops) to enable continuous integration, continuous delivery, and continuous training of ML models. MLOps automates the pipeline from experimentation through deployment while maintaining quality and reliability.
The Three Pillars of MLOps
A mature MLOps workflow automates the progression from code changes through testing, training, evaluation, and deployment.
| Pillar | Trigger | Actions | Output |
|---|---|---|---|
| Continuous Integration (CI) | Code commit or PR | Unit tests, linting, data validation, schema checks | Pass/fail status, test reports |
| Continuous Training (CT) | New data arrives or performance degrades | Retrain model, evaluate on holdout set, compare to baseline | New model candidate with metrics |
| Continuous Delivery (CD) | Model passes validation gates | Package model, deploy to staging, canary, then production | Production-serving model |
Table 5.4: The three pillars of MLOps and their automation.
Google defines three levels of MLOps maturity. Level 0: manual, script-driven process. Level 1: ML pipeline automation with continuous training. Level 2: CI/CD pipeline automation for both code and models. Most organizations should aim for Level 1 as their initial target, as the jump from Level 0 to Level 1 delivers the majority of operational benefits.
Pipeline Orchestration
Pipeline orchestration is the backbone of MLOps workflows. Tools like Kubeflow Pipelines, Apache Airflow, and Prefect define ML workflows as composable steps with explicit dependencies.
- Apache Airflow — General-purpose workflow orchestration with a large ecosystem of integrations
- Kubeflow Pipelines — ML-specific orchestration on Kubernetes with native GPU support
- Prefect — Python-native orchestration with strong local development and cloud deployment
- Dagster — Data-aware orchestration with built-in testing and observability
- ZenML — MLOps framework that abstracts across infrastructure providers
You do not need a full orchestration platform on day one. Start by automating the most painful manual step — often retraining when new data arrives. A scheduled cron job that triggers a training script, evaluates the result, and sends a Slack notification is a perfectly valid first step toward MLOps maturity.
Model Registries and Promotion
Model Registry
A centralized repository for storing, versioning, and managing the lifecycle of trained ML models. Models progress through stages (development, staging, canary, production) based on validation gates, ensuring only thoroughly tested models reach production.
Model registries serve as the central hub for managing model versions and their lifecycle states. A model in the registry progresses through stages like "staging," "canary," and "production" based on validation gates.
A new model is trained and registered as version 3.2 in "development" stage. Automated tests verify it meets minimum accuracy thresholds. It is promoted to "staging" where integration tests run against production data samples. After passing, it moves to "canary" where 5% of live traffic is routed to it for 24 hours. If all SLOs are met, it is promoted to "production" and the old model is archived.
Every model deployment should have a tested rollback procedure. Keep the previous production model readily available (not archived or deleted) so traffic can be instantly routed back if the new model causes issues. Automated rollback triggered by SLO violations is ideal, but manual rollback should be achievable in under 5 minutes.
MLOps
The set of practices that combines ML development with operations, enabling continuous delivery and automation of the ML lifecycle.
Model Registry
A centralized repository for storing, versioning, and managing the lifecycle of trained ML models from development through production deployment.
05 Hyperparameter Tuning and AutoML
Hyperparameter tuning is the process of finding optimal configuration values that are not learned during training, such as learning rate, batch size, model architecture choices, and regularization strength. The hyperparameter space is often high-dimensional and expensive to explore, making efficient search strategies essential.
Hyperparameter
A configuration value that is set before training begins and is not updated by the optimization algorithm. Examples include learning rate, batch size, number of layers, dropout rate, and regularization strength. Choosing good hyperparameters is critical for model performance.
Search Strategies
Grid search exhaustively evaluates all combinations of specified hyperparameter values, while random search samples randomly from the space. Research by Bergstra and Bengio showed that random search is more efficient than grid search for most problems because it covers more of the important dimensions.
| Strategy | Approach | Sample Efficiency | Best For |
|---|---|---|---|
| Grid Search | Exhaustive evaluation of all combinations | Poor (exponential in dimensions) | Small spaces with < 3 hyperparameters |
| Random Search | Random sampling from the space | Better than grid (Bergstra & Bengio, 2012) | Medium spaces, limited compute budget |
| Bayesian Optimization | Probabilistic surrogate model guides search | Best (2-10x fewer trials than random) | Expensive training runs, moderate dimensions |
| Hyperband | Adaptive resource allocation with early stopping | Good (quickly eliminates bad configs) | Large spaces with many cheap-to-evaluate configs |
| Population-Based Training | Evolutionary approach during training | Good for schedules | Learning rate and schedule optimization |
Table 5.5: Hyperparameter search strategies and their trade-offs.
Grid search wastes trials by exhaustively covering unimportant dimensions. If only 2 out of 5 hyperparameters actually matter (common in practice), grid search evaluates many redundant combinations of the 3 unimportant ones. Random search distributes trials more evenly across the important dimensions, finding good configurations faster.
Bayesian Optimization
Bayesian Optimization
A sequential optimization strategy that builds a probabilistic surrogate model (typically a Gaussian Process or TPE) of the objective function and uses an acquisition function to balance exploration and exploitation when selecting the next configuration to evaluate.
Bayesian optimization methods like Tree-structured Parzen Estimators (TPE) and Gaussian Processes build a probabilistic model of the objective function and use it to select promising configurations to evaluate next. These methods are significantly more sample-efficient than random search, particularly for expensive training runs.
class="tok-comment"># Bayesian hyperparameter optimization with Optuna
import optuna
def objective(trial):
lr = trial.suggest_float(&class="tok-comment">#class="tok-number">39;lrclass="tok-string">class="tok-number">39;, class="tok-number">1e-5, class="tok-number">1e-2, log=True)
batch_size = trial.suggest_categorical(&class="tok-comment">#class="tok-number">39;batch_sizeclass="tok-number">39;, [class="tok-number">32, class="tok-number">64, class="tok-number">128, class="tok-number">256])
n_layers = trial.suggest_int(&class="tok-comment">#class="tok-number">39;n_layersclass="tok-string">class="tok-number">39;, class="tok-number">2, class="tok-number">8)
dropout = trial.suggest_float(&class="tok-comment">#class="tok-number">39;dropoutclass="tok-number">39;, class="tok-number">0.0, class="tok-number">0.5)
model = build_model(n_layers=n_layers, dropout=dropout)
val_accuracy = train_and_evaluate(model, lr=lr, batch_size=batch_size)
return val_accuracy
study = optuna.create_study(direction=&class="tok-comment">#class="tok-number">39;maximizeclass="tok-number">39;)
study.optimize(objective, n_trials=class="tok-number">100)AutoML
AutoML extends hyperparameter tuning to automate broader aspects of the ML pipeline, including feature engineering, model selection, and architecture design.
AutoML excels at finding good configurations within defined search spaces but cannot replace domain expertise in problem formulation, data quality assessment, or system design. An AutoML system optimizing the wrong metric or training on poor data will efficiently find the wrong answer. Always validate AutoML results against domain knowledge.
Start with a coarse random search (20-50 trials) to identify the right order of magnitude for each hyperparameter. Then use Bayesian optimization with a narrowed search space for fine-tuning. This two-phase approach is more efficient than running Bayesian optimization on a very wide initial space.
Hyperparameter Tuning
The process of systematically searching for optimal values of parameters that are set before training begins, such as learning rate and model architecture choices.
Bayesian Optimization
A sequential optimization strategy that builds a probabilistic surrogate model to efficiently guide the search for optimal hyperparameters.
Key Takeaways
- 1The ML lifecycle is inherently iterative and experimental, requiring systematic tracking and management of all artifacts.
- 2Experiment tracking platforms are essential for reproducibility, collaboration, and efficient hyperparameter exploration.
- 3Practical reproducibility requires controlling random seeds, versioning data and dependencies, and documenting the environment.
- 4MLOps automates the pipeline from code changes through training, validation, and deployment with continuous integration and delivery.
- 5Bayesian optimization is significantly more sample-efficient than grid or random search for hyperparameter tuning.
CH.05
Chapter Complete
Chapter Progress
Interact with the visualization
AI Workflow Quiz
Test your understanding of ML development lifecycle, experiment tracking, and reproducibility.
Ready to test your knowledge?