Introduction
Sets the stage for ML systems engineering. Explores why systems thinking matters for building reliable, scalable machine learning solutions.
- Explain why ML systems engineering extends far beyond model development
- Identify the key components of a production ML system and their interdependencies
- Describe the complete ML lifecycle from problem formulation to model retirement
- Analyze the trade-offs between accuracy, latency, throughput, and cost in ML systems
- Articulate the "hidden technical debt" concept and its implications for system design
01 Why ML Systems Matter Viz
Machine learning has moved far beyond research papers and Jupyter notebooks. Today, ML is embedded in billions of devices, from cloud data centers processing petabytes of data to microcontrollers running on milliwatts of power. This shift demands a systems-level understanding that bridges the gap between algorithmic innovation and real-world deployment.
Everyone wants to do the model work, not the plumbing. But it is the plumbing that makes or breaks a real-world ML system.
ML Systems Engineering
The discipline of designing, building, and maintaining the complete infrastructure required to develop, deploy, and operate machine learning models in production environments. It encompasses data pipelines, training infrastructure, serving systems, and monitoring — not just the model itself.
The field of ML systems engineering recognizes that building a good model is only a small fraction of the overall challenge. Google famously illustrated this with the "hidden technical debt" paper, showing that ML code represents a tiny portion of a production ML system. The surrounding infrastructure for data collection, feature extraction, monitoring, and serving dwarfs the model itself.
ML code represents only about 5% of a production ML system.Deeper InsightThis means that for every line of model code you write, there are roughly 19 lines of supporting infrastructure: data validation, feature stores, configuration management, monitoring dashboards, and serving pipelines. Teams that focus exclusively on model accuracy while neglecting this 95% are building on sand.Think of it like...An iceberg where the visible tip above water is your model, but the massive structure below the surface -- data pipelines, monitoring, serving infrastructure, and configuration -- is what keeps the whole thing afloat and stable. Click to collapse
Google's landmark "Hidden Technical Debt in Machine Learning Systems" paper (Sculley et al., 2015) found that ML model code accounts for only about 5% of the total code in a production ML system. The remaining 95% consists of data pipelines, configuration, serving infrastructure, monitoring, and feature engineering.
According to Google's "Hidden Technical Debt" paper, approximately what percentage of a production ML system's code is the ML model itself?
Think about all the infrastructure surrounding the model: data pipelines, monitoring, serving, configuration.
Systems thinking provides a framework for reasoning about these complex, interconnected components. Rather than optimizing a single metric in isolation, ML systems engineers must balance accuracy, latency, throughput, energy consumption, cost, and reliability simultaneously. This holistic perspective is what separates a research prototype from a production system serving millions of users.
- Accuracy — Does the model produce correct predictions?
- Latency — How quickly does the system respond to a single request?
- Throughput — How many requests can the system process per second?
- Energy consumption — What is the power cost of training and inference?
- Monetary cost — What are the cloud compute and storage expenses?
- Reliability — Does the system operate correctly under failures and load?
When evaluating any ML project, ask: "What happens to the prediction after the model produces it?" If you cannot trace the full path from raw data to user-facing outcome, you are thinking about a model — not a system.
ML Systems Engineering
The discipline of designing, building, and maintaining the complete infrastructure required to develop, deploy, and operate machine learning models in production environments.
Hidden Technical Debt
The accumulating maintenance costs in ML systems that arise from complex dependencies, data management challenges, and configuration issues beyond the model code itself.
02 The ML Lifecycle
The machine learning lifecycle encompasses every stage from problem formulation to model retirement. It begins with defining the problem and identifying the right data sources, moves through data collection and preprocessing, model development and training, evaluation and validation, and finally deployment and monitoring.
Stages of the ML Pipeline
Figure: The ML Pipeline Lifecycle
| Pipeline Stage | Primary Activity | Key Systems Challenge | Typical Tools |
|---|---|---|---|
| Data Collection | Gathering raw data from sources | Scale, freshness, and labeling quality | Spark, Kafka, Airflow |
| Preprocessing | Cleaning, transforming, feature engineering | Reproducibility and skew between training/serving | dbt, Feast, tf.Transform |
| Training | Fitting model parameters to data | Efficient use of compute (GPUs/TPUs) | PyTorch, JAX, Ray Train |
| Evaluation | Measuring model quality offline | Meaningful metrics that predict production performance | MLflow, Weights & Biases |
| Deployment | Serving model predictions to users | Latency, throughput, and rollback safety | TF Serving, Triton, KServe |
| Monitoring | Tracking live model behavior | Detecting data drift and silent failures | Prometheus, Evidently, Grafana |
Table 1.1: ML pipeline stages and their associated systems challenges.
ML systems have a dual lifecycle: the software lifecycle and the model lifecycle.Deeper InsightTraditional software only degrades when the code changes. ML systems degrade even when the code stays the same, because the world the model learned from keeps changing. This dual lifecycle means you need two separate versioning strategies, two deployment pipelines, and two monitoring systems -- one for code and one for the model.Think of it like...A navigation app where the software version controls the interface and routing algorithm, but the map data has its own update cycle. If the maps go stale, the app gives bad directions even though the code is perfect. ML models are the "map" -- they must be refreshed as reality shifts. Click to collapse
Unlike traditional software, ML systems have a dual lifecycle: the software lifecycle and the model lifecycle. The software components follow conventional engineering practices, but the model must be continuously retrained and updated as data distributions shift over time. This creates unique challenges around versioning, reproducibility, and rollback strategies.
A model trained on last year's data may silently degrade when user behavior changes. Unlike a software bug that produces an error, a stale model continues to produce outputs — they are simply wrong. Continuous monitoring is not optional; it is a core requirement of any production ML system.
What makes data distribution shift particularly dangerous in ML systems?
Each stage of the lifecycle presents distinct systems challenges. Data collection requires robust pipelines that handle scale and quality. Training demands efficient use of compute resources. Deployment must meet strict latency and throughput requirements. Monitoring must detect when model performance degrades. Understanding these stages holistically is the foundation of ML systems engineering.
Feedback Loops and Iteration
The iterative nature of ML development means that teams frequently cycle back through earlier stages. A model that underperforms in production may reveal data quality issues that require revisiting the collection pipeline. This feedback loop is a defining characteristic of ML systems and must be supported by the underlying infrastructure.
A fraud detection team deploys a new model and monitors its precision in production. After two weeks, precision drops from 94% to 87%. Investigation reveals that a partner merchant changed their transaction format, introducing null values the preprocessing pipeline did not handle. The fix requires changes to the data pipeline — not the model — demonstrating why the lifecycle is a loop, not a line.
L = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2The choice of loss function (Equation 1.1) is not purely a modeling decision. MSE penalizes large errors quadratically, which affects training time (more epochs to converge on outlier-heavy data), hardware requirements (gradient magnitudes influence numerical precision needs), and monitoring thresholds (acceptable loss values depend on the function's scale).
ML Lifecycle
The complete sequence of stages a machine learning model passes through, from problem definition and data collection through training, deployment, monitoring, and eventual retirement.
03 Embedded ML and TinyML
Embedded machine learning brings inference capabilities directly to edge devices such as smartphones, wearables, and IoT sensors. Rather than sending data to the cloud for processing, embedded ML enables real-time, on-device decision-making with lower latency, improved privacy, and reduced bandwidth costs.
Edge Computing
A computing paradigm where data processing occurs near the source of data — on or close to the device — rather than in a centralized cloud data center. For ML, this means running inference locally on phones, sensors, or gateways to reduce latency and network dependency.
Figure: Training vs. Inference Data Flow
TinyML: ML at the Extreme Edge
TinyML takes this concept to the extreme, targeting microcontrollers with as little as 256 KB of memory and operating on microwatts of power. Despite these severe resource constraints, TinyML enables always-on sensing applications like keyword detection, gesture recognition, and anomaly detection on battery-powered devices.
TinyML
A subfield of embedded ML focused on running machine learning models on microcontrollers with extremely limited memory (often under 1 MB) and power budgets (milliwatts or less). TinyML enables always-on intelligent sensing at the extreme edge of the computing spectrum.
| Resource | Cloud ML | Mobile ML | TinyML |
|---|---|---|---|
| Memory | 100s of GB | 4–12 GB | 256 KB – 1 MB |
| Compute | GPU/TPU clusters | Mobile GPU/NPU | Cortex-M MCU |
| Power | 200–300 W per GPU | 3–5 W | 1–100 mW |
| Latency budget | 100s of ms (network) | 10–50 ms | <1 ms (on-chip) |
| Model size | Billions of parameters | Millions of parameters | Thousands of parameters |
| Example | GPT-4, Gemini | On-device Siri, Google Lens | Keyword spotting, wake word |
Table 1.2: Resource comparison across ML deployment targets.
The systems challenges for embedded and TinyML are fundamentally different from cloud-based ML. Engineers must consider memory footprints measured in kilobytes, inference latency in microseconds, and energy budgets in millijoules. Model optimization techniques like quantization, pruning, and knowledge distillation become essential rather than optional.
Key Optimization Techniques
- Quantization — Reducing numerical precision (e.g., 32-bit float to 8-bit integer) to shrink model size and accelerate computation.
- Pruning — Removing weights or neurons that contribute little to model output, yielding sparser and faster models.
- Knowledge distillation — Training a small "student" model to mimic the behavior of a larger "teacher" model.
- Neural architecture search (NAS) — Automatically discovering efficient architectures tailored to specific hardware constraints.
The growth of TinyML has created a new design paradigm where hardware constraints directly influence model architecture choices. Rather than designing the best possible model and then compressing it, practitioners increasingly co-design models and hardware together to achieve optimal efficiency within strict resource budgets. Always ask: "What hardware will this model run on?" before choosing an architecture.
A wildlife conservation project uses solar-powered acoustic sensors in the Amazon rainforest to detect illegal chainsaw sounds. Each sensor runs a TinyML keyword-detection model on an ARM Cortex-M4 microcontroller. Because the model runs locally, the sensor only activates its satellite radio (the most power-hungry component) when a threat is detected — extending battery life from days to months.
Embedded ML
The practice of running machine learning models directly on edge devices such as smartphones and IoT sensors, enabling real-time inference without cloud connectivity.
TinyML
A subfield of embedded ML focused on running machine learning models on microcontrollers with extremely limited memory (often under 1 MB) and power budgets (milliwatts or less).
04 Systems Thinking for ML
Systems thinking is the practice of understanding how components interact within a larger whole, rather than analyzing each component in isolation. For ML systems, this means considering how data pipelines, model architectures, hardware platforms, and deployment infrastructure influence each other.
Systems Thinking
An analytical approach that focuses on how the components of a system interrelate and work together over time, rather than examining each component in isolation. In ML, it means reasoning about the interactions among data, models, hardware, and infrastructure as a unified whole.
The Whole Is Greater Than the Sum of Its Parts
A key principle of systems thinking is that optimizing individual components does not necessarily optimize the overall system. A model with 99% accuracy is worthless if the serving infrastructure cannot meet latency requirements. Similarly, the fastest inference engine provides no value if the data pipeline cannot deliver clean inputs reliably.
Teams that optimize components in isolation often fall into the local optima trap. For example, a research team spends months improving model accuracy from 96% to 98%, only to discover that the 2x larger model cannot be served within the product's 50ms latency budget. A systems-aware team would have identified this constraint before investing the effort.
Trade-off Analysis
Trade-off analysis is central to ML systems thinking. Every design decision involves balancing competing objectives. Understanding these trade-offs requires both theoretical knowledge and practical engineering experience.
Trade-off Analysis
The systematic evaluation of competing design objectives — such as accuracy vs. latency, model size vs. quality, or training cost vs. generalization — to find solutions that best satisfy overall system requirements within given constraints.
| Trade-off | Option A | Option B | Typical Resolution |
|---|---|---|---|
| Accuracy vs. Latency | Larger model, higher accuracy | Smaller model, faster inference | Distill large model into a fast student model |
| Model Size vs. Quality | Full-precision weights | Quantized / pruned weights | Post-training quantization with calibration |
| Training Cost vs. Generalization | Train longer on more data | Train quickly on less data | Progressive training with early stopping |
| Freshness vs. Stability | Retrain hourly | Retrain weekly | Canary deployments with rollback |
| Privacy vs. Utility | Keep all user data | Anonymize / federate | Differential privacy or on-device learning |
Table 1.3: Common trade-offs in ML systems design.
Feedback Loops and Emergent Behavior
Systems thinking also emphasizes feedback loops and emergent behaviors. In ML systems, feedback loops can be particularly dangerous: a recommendation system that biases toward popular content creates a reinforcing loop that reduces content diversity over time. Identifying and managing these dynamics is a critical skill for ML systems engineers.
A music streaming service uses a model trained on listening history to recommend songs. Popular songs get recommended more, increasing their play counts, which makes the model recommend them even more. Over several months, the "long tail" of niche artists receives virtually zero recommendations — not because users dislike them, but because the feedback loop starved them of exposure. Breaking this loop requires deliberate exploration strategies such as epsilon-greedy or Thompson sampling.
The most dangerous feedback loops in ML systems are the ones you do not know exist. Instrument everything, question every assumption, and remember that your model is changing the distribution it was trained on.
Systems Thinking
An analytical approach that focuses on how components of a system interrelate and work together over time, rather than examining each component in isolation.
Trade-off Analysis
The systematic evaluation of competing design objectives (such as accuracy vs. latency) to find solutions that best satisfy overall system requirements.
05 Course Roadmap and Learning Path
This course is structured around six major themes that mirror the lifecycle of building production ML systems. Each theme builds on the previous one, progressing from foundational understanding to advanced deployment and governance.
Part I: Foundations
The foundations section establishes the theoretical and architectural knowledge needed to understand deep learning from a systems perspective. You will study how neural networks map to hardware, how memory hierarchies affect training throughput, and why computational graphs are the lingua franca of modern ML frameworks.
Part II: Design Principles
The design principles section covers the practical methodology of ML development, including workflow management, data engineering, framework selection, and training strategies. These chapters equip learners with the tools and techniques used by practicing ML engineers.
Part III: Performance Engineering
Performance engineering forms the third major theme, addressing how to make ML systems fast, efficient, and cost-effective. This includes model optimization, hardware acceleration, and systematic benchmarking.
Part IV: Deployment and Operations
The deployment section covers how to reliably operate ML systems in production, including on-device deployment, security, and robustness.
Part V: Trustworthy AI
Trustworthy AI covers the ethical, social, and environmental dimensions of ML systems. These chapters address fairness, accountability, transparency, and the growing environmental cost of large-scale model training.
Part VI: Frontiers
The frontiers section examines emerging trends and future directions that will shape the field in coming years. Throughout all sections, the emphasis remains on the systems perspective that connects theory to practice.
| Part | Theme | Core Question |
|---|---|---|
| I | Foundations | How do ML models map to hardware and software abstractions? |
| II | Design Principles | How do we manage data, frameworks, and training at scale? |
| III | Performance Engineering | How do we make ML systems fast and efficient? |
| IV | Deployment & Operations | How do we serve models reliably in production? |
| V | Trustworthy AI | How do we build ML systems that are fair, safe, and sustainable? |
| VI | Frontiers | What comes next in ML systems research and practice? |
Table 1.4: Course structure at a glance.
Each chapter includes interactive visualizations, key concept definitions, and a glossary. Start with the sections most relevant to your current work, but return to the foundations chapters if you encounter unfamiliar systems concepts. The most effective way to learn ML systems engineering is by combining reading with hands-on experimentation.
- Read the chapter narrative and study the diagrams.
- Review the key concepts and definitions at the end of each section.
- Explore the interactive visualizations to build intuition.
- Attempt the review questions before checking the glossary.
- Revisit earlier chapters as later topics reveal deeper connections.
Key Takeaways
- 1ML systems engineering goes far beyond model development, encompassing data pipelines, infrastructure, deployment, and monitoring.
- 2Systems thinking requires balancing competing objectives like accuracy, latency, cost, and energy consumption.
- 3The ML lifecycle is inherently iterative, with feedback loops between deployment monitoring and earlier development stages.
- 4Embedded ML and TinyML bring unique systems challenges around extreme resource constraints.
- 5Understanding trade-offs is the most important skill for an ML systems engineer.
CH.01
Chapter Complete
Chapter Progress
Interact with the visualization
Introduction Quiz
Test your understanding of ML Systems engineering fundamentals.
Ready to test your knowledge?