Skip to content
ML SystemPart 1: Core ConceptsChapter 1
Foundations CH.01 ~25 min

Introduction

Sets the stage for ML systems engineering. Explores why systems thinking matters for building reliable, scalable machine learning solutions.

systems thinkingML lifecycleembedded MLTinyMLcourse overview
Read in mlsysbook.ai
  • Explain why ML systems engineering extends far beyond model development
  • Identify the key components of a production ML system and their interdependencies
  • Describe the complete ML lifecycle from problem formulation to model retirement
  • Analyze the trade-offs between accuracy, latency, throughput, and cost in ML systems
  • Articulate the "hidden technical debt" concept and its implications for system design

01 Why ML Systems Matter Viz

Machine learning has moved far beyond research papers and Jupyter notebooks. Today, ML is embedded in billions of devices, from cloud data centers processing petabytes of data to microcontrollers running on milliwatts of power. This shift demands a systems-level understanding that bridges the gap between algorithmic innovation and real-world deployment.

Everyone wants to do the model work, not the plumbing. But it is the plumbing that makes or breaks a real-world ML system.

D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," NeurIPS 2015
Definition

ML Systems Engineering

The discipline of designing, building, and maintaining the complete infrastructure required to develop, deploy, and operate machine learning models in production environments. It encompasses data pipelines, training infrastructure, serving systems, and monitoring — not just the model itself.

The field of ML systems engineering recognizes that building a good model is only a small fraction of the overall challenge. Google famously illustrated this with the "hidden technical debt" paper, showing that ML code represents a tiny portion of a production ML system. The surrounding infrastructure for data collection, feature extraction, monitoring, and serving dwarfs the model itself.

Deeper InsightThis means that for every line of model code you write, there are roughly 19 lines of supporting infrastructure: data validation, feature stores, configuration management, monitoring dashboards, and serving pipelines. Teams that focus exclusively on model accuracy while neglecting this 95% are building on sand.Think of it like...An iceberg where the visible tip above water is your model, but the massive structure below the surface -- data pipelines, monitoring, serving infrastructure, and configuration -- is what keeps the whole thing afloat and stable. Click to collapse

0%
ML code in a production system
0%
Infrastructure, pipelines, and config
The 5% Rule

Google's landmark "Hidden Technical Debt in Machine Learning Systems" paper (Sculley et al., 2015) found that ML model code accounts for only about 5% of the total code in a production ML system. The remaining 95% consists of data pipelines, configuration, serving infrastructure, monitoring, and feature engineering.

Quick Check

According to Google's "Hidden Technical Debt" paper, approximately what percentage of a production ML system's code is the ML model itself?

Think about all the infrastructure surrounding the model: data pipelines, monitoring, serving, configuration.

Not quite.The ML model code accounts for only about 5% of the total codebase. The remaining 95% consists of data pipelines, configuration, serving infrastructure, monitoring, and feature engineering.
Continue reading
PRODUCTION ML SYSTEM Data CollectionPipelines, validation, labelingFeature EngineeringExtraction, stores, transformsConfigurationHyperparams, versioning, flagsServing InfrastructureAPIs, load balancing, cachingResource ManagementGPU scheduling, auto-scalingMonitoringDrift detection, alertingTesting & DebuggingUnit, integration, model testsProcess ManagementCI/CD, orchestration, MLOps ML Code ~5% of the system Adapted from Sculley et al., "Hidden Technical Debt in ML Systems" (NeurIPS 2015)
Figure Figure 1.1: Components of a production ML system. The ML model code (center, dark) is dwarfed by the surrounding infrastructure required for reliable operation.

Systems thinking provides a framework for reasoning about these complex, interconnected components. Rather than optimizing a single metric in isolation, ML systems engineers must balance accuracy, latency, throughput, energy consumption, cost, and reliability simultaneously. This holistic perspective is what separates a research prototype from a production system serving millions of users.

  • Accuracy — Does the model produce correct predictions?
  • Latency — How quickly does the system respond to a single request?
  • Throughput — How many requests can the system process per second?
  • Energy consumption — What is the power cost of training and inference?
  • Monetary cost — What are the cloud compute and storage expenses?
  • Reliability — Does the system operate correctly under failures and load?
Think in Systems, Not Models

When evaluating any ML project, ask: "What happens to the prediction after the model produces it?" If you cannot trace the full path from raw data to user-facing outcome, you are thinking about a model — not a system.

ML Systems Engineering

The discipline of designing, building, and maintaining the complete infrastructure required to develop, deploy, and operate machine learning models in production environments.

Hidden Technical Debt

The accumulating maintenance costs in ML systems that arise from complex dependencies, data management challenges, and configuration issues beyond the model code itself.

02 The ML Lifecycle

The machine learning lifecycle encompasses every stage from problem formulation to model retirement. It begins with defining the problem and identifying the right data sources, moves through data collection and preprocessing, model development and training, evaluation and validation, and finally deployment and monitoring.

Stages of the ML Pipeline

Figure: The ML Pipeline Lifecycle

Feedback Loop DataCollectionGather & label raw dataDataPreprocessingClean, normalize, augmentFeatureEngineeringExtract & select featuresModelTrainingTrain & tune parametersEvaluationMetrics & validationDeploymentPackage & serve modelMonitoringTrack drift & performance Feedback Loop Data CollectionGather & label raw dataData PreprocessingClean, normalize, augmentFeature EngineeringExtract & select featuresModel TrainingTrain & tune parametersEvaluationMetrics & validationDeploymentPackage & serve modelMonitoringTrack drift & performance
Figure Figure 1.2: The ML pipeline is a cycle, not a linear sequence. Feedback from monitoring in production drives improvements across every upstream stage.
Table 1.1: ML pipeline stages and their associated systems challenges.
Pipeline StagePrimary ActivityKey Systems ChallengeTypical Tools
Data CollectionGathering raw data from sourcesScale, freshness, and labeling qualitySpark, Kafka, Airflow
PreprocessingCleaning, transforming, feature engineeringReproducibility and skew between training/servingdbt, Feast, tf.Transform
TrainingFitting model parameters to dataEfficient use of compute (GPUs/TPUs)PyTorch, JAX, Ray Train
EvaluationMeasuring model quality offlineMeaningful metrics that predict production performanceMLflow, Weights & Biases
DeploymentServing model predictions to usersLatency, throughput, and rollback safetyTF Serving, Triton, KServe
MonitoringTracking live model behaviorDetecting data drift and silent failuresPrometheus, Evidently, Grafana

Table 1.1: ML pipeline stages and their associated systems challenges.

Deeper InsightTraditional software only degrades when the code changes. ML systems degrade even when the code stays the same, because the world the model learned from keeps changing. This dual lifecycle means you need two separate versioning strategies, two deployment pipelines, and two monitoring systems -- one for code and one for the model.Think of it like...A navigation app where the software version controls the interface and routing algorithm, but the map data has its own update cycle. If the maps go stale, the app gives bad directions even though the code is perfect. ML models are the "map" -- they must be refreshed as reality shifts. Click to collapse

Unlike traditional software, ML systems have a dual lifecycle: the software lifecycle and the model lifecycle. The software components follow conventional engineering practices, but the model must be continuously retrained and updated as data distributions shift over time. This creates unique challenges around versioning, reproducibility, and rollback strategies.

Data Distribution Shift

A model trained on last year's data may silently degrade when user behavior changes. Unlike a software bug that produces an error, a stale model continues to produce outputs — they are simply wrong. Continuous monitoring is not optional; it is a core requirement of any production ML system.

Quick Check

What makes data distribution shift particularly dangerous in ML systems?

Not quite.Unlike software bugs that raise errors, a stale ML model keeps generating predictions that look normal but are increasingly inaccurate. This silent degradation is why continuous monitoring is essential.
Continue reading

Each stage of the lifecycle presents distinct systems challenges. Data collection requires robust pipelines that handle scale and quality. Training demands efficient use of compute resources. Deployment must meet strict latency and throughput requirements. Monitoring must detect when model performance degrades. Understanding these stages holistically is the foundation of ML systems engineering.

Feedback Loops and Iteration

The iterative nature of ML development means that teams frequently cycle back through earlier stages. A model that underperforms in production may reveal data quality issues that require revisiting the collection pipeline. This feedback loop is a defining characteristic of ML systems and must be supported by the underlying infrastructure.

Iteration in Practice

A fraud detection team deploys a new model and monitors its precision in production. After two weeks, precision drops from 94% to 87%. Investigation reveals that a partner merchant changed their transaction format, introducing null values the preprocessing pipeline did not handle. The fix requires changes to the data pipeline — not the model — demonstrating why the lifecycle is a loop, not a line.

L = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2
Why Loss Functions Matter at the Systems Level

The choice of loss function (Equation 1.1) is not purely a modeling decision. MSE penalizes large errors quadratically, which affects training time (more epochs to converge on outlier-heavy data), hardware requirements (gradient magnitudes influence numerical precision needs), and monitoring thresholds (acceptable loss values depend on the function's scale).

ML Lifecycle

The complete sequence of stages a machine learning model passes through, from problem definition and data collection through training, deployment, monitoring, and eventual retirement.

03 Embedded ML and TinyML

Embedded machine learning brings inference capabilities directly to edge devices such as smartphones, wearables, and IoT sensors. Rather than sending data to the cloud for processing, embedded ML enables real-time, on-device decision-making with lower latency, improved privacy, and reduced bandwidth costs.

Definition

Edge Computing

A computing paradigm where data processing occurs near the source of data — on or close to the device — rather than in a centralized cloud data center. For ML, this means running inference locally on phones, sensors, or gateways to reduce latency and network dependency.

Figure: Training vs. Inference Data Flow

TRAINING PIPELINEINFERENCE PIPELINERaw DataHistorical training dataNew DataLive input for predictionsSHAREDPreprocessingClean, validate, transformSHAREDFeature StoreVersioned feature vectorsTrainingFit model on featuresModel RegistryVersion & store modelsModel ServingLow-latency inferencePredictionsServed to application deploy TrainingInferenceSharedRaw DataTraining dataNew DataLive inputSHAREDPreprocessingClean, validate, transformSHAREDFeature StoreVersioned feature vectorsTRAININGTrainingFit model on featuresModel RegistryVersion & storeINFERENCEModel ServingLow-latency inferencePredictionsServed to application deploy TrainingInferenceShared
Figure Figure 1.3: Data flow in an edge ML system. Raw sensor data is processed locally on the device, with only aggregated results or anomalies transmitted to the cloud.

TinyML: ML at the Extreme Edge

TinyML takes this concept to the extreme, targeting microcontrollers with as little as 256 KB of memory and operating on microwatts of power. Despite these severe resource constraints, TinyML enables always-on sensing applications like keyword detection, gesture recognition, and anomaly detection on battery-powered devices.

Definition

TinyML

A subfield of embedded ML focused on running machine learning models on microcontrollers with extremely limited memory (often under 1 MB) and power budgets (milliwatts or less). TinyML enables always-on intelligent sensing at the extreme edge of the computing spectrum.

0 KB
Minimum TinyML memory budget
Table 1.2: Resource comparison across ML deployment targets.
ResourceCloud MLMobile MLTinyML
Memory100s of GB4–12 GB256 KB – 1 MB
ComputeGPU/TPU clustersMobile GPU/NPUCortex-M MCU
Power200–300 W per GPU3–5 W1–100 mW
Latency budget100s of ms (network)10–50 ms<1 ms (on-chip)
Model sizeBillions of parametersMillions of parametersThousands of parameters
ExampleGPT-4, GeminiOn-device Siri, Google LensKeyword spotting, wake word

Table 1.2: Resource comparison across ML deployment targets.

The systems challenges for embedded and TinyML are fundamentally different from cloud-based ML. Engineers must consider memory footprints measured in kilobytes, inference latency in microseconds, and energy budgets in millijoules. Model optimization techniques like quantization, pruning, and knowledge distillation become essential rather than optional.

Key Optimization Techniques

  • Quantization — Reducing numerical precision (e.g., 32-bit float to 8-bit integer) to shrink model size and accelerate computation.
  • Pruning — Removing weights or neurons that contribute little to model output, yielding sparser and faster models.
  • Knowledge distillation — Training a small "student" model to mimic the behavior of a larger "teacher" model.
  • Neural architecture search (NAS) — Automatically discovering efficient architectures tailored to specific hardware constraints.
Co-Design Thinking

The growth of TinyML has created a new design paradigm where hardware constraints directly influence model architecture choices. Rather than designing the best possible model and then compressing it, practitioners increasingly co-design models and hardware together to achieve optimal efficiency within strict resource budgets. Always ask: "What hardware will this model run on?" before choosing an architecture.

TinyML in the Real World

A wildlife conservation project uses solar-powered acoustic sensors in the Amazon rainforest to detect illegal chainsaw sounds. Each sensor runs a TinyML keyword-detection model on an ARM Cortex-M4 microcontroller. Because the model runs locally, the sensor only activates its satellite radio (the most power-hungry component) when a threat is detected — extending battery life from days to months.

Embedded ML

The practice of running machine learning models directly on edge devices such as smartphones and IoT sensors, enabling real-time inference without cloud connectivity.

TinyML

A subfield of embedded ML focused on running machine learning models on microcontrollers with extremely limited memory (often under 1 MB) and power budgets (milliwatts or less).

04 Systems Thinking for ML

Systems thinking is the practice of understanding how components interact within a larger whole, rather than analyzing each component in isolation. For ML systems, this means considering how data pipelines, model architectures, hardware platforms, and deployment infrastructure influence each other.

Definition

Systems Thinking

An analytical approach that focuses on how the components of a system interrelate and work together over time, rather than examining each component in isolation. In ML, it means reasoning about the interactions among data, models, hardware, and infrastructure as a unified whole.

The Whole Is Greater Than the Sum of Its Parts

A key principle of systems thinking is that optimizing individual components does not necessarily optimize the overall system. A model with 99% accuracy is worthless if the serving infrastructure cannot meet latency requirements. Similarly, the fastest inference engine provides no value if the data pipeline cannot deliver clean inputs reliably.

Local Optima Trap

Teams that optimize components in isolation often fall into the local optima trap. For example, a research team spends months improving model accuracy from 96% to 98%, only to discover that the 2x larger model cannot be served within the product's 50ms latency budget. A systems-aware team would have identified this constraint before investing the effort.

Trade-off Analysis

Trade-off analysis is central to ML systems thinking. Every design decision involves balancing competing objectives. Understanding these trade-offs requires both theoretical knowledge and practical engineering experience.

Definition

Trade-off Analysis

The systematic evaluation of competing design objectives — such as accuracy vs. latency, model size vs. quality, or training cost vs. generalization — to find solutions that best satisfy overall system requirements within given constraints.

Table 1.3: Common trade-offs in ML systems design.
Trade-offOption AOption BTypical Resolution
Accuracy vs. LatencyLarger model, higher accuracySmaller model, faster inferenceDistill large model into a fast student model
Model Size vs. QualityFull-precision weightsQuantized / pruned weightsPost-training quantization with calibration
Training Cost vs. GeneralizationTrain longer on more dataTrain quickly on less dataProgressive training with early stopping
Freshness vs. StabilityRetrain hourlyRetrain weeklyCanary deployments with rollback
Privacy vs. UtilityKeep all user dataAnonymize / federateDifferential privacy or on-device learning

Table 1.3: Common trade-offs in ML systems design.

Feedback Loops and Emergent Behavior

Systems thinking also emphasizes feedback loops and emergent behaviors. In ML systems, feedback loops can be particularly dangerous: a recommendation system that biases toward popular content creates a reinforcing loop that reduces content diversity over time. Identifying and managing these dynamics is a critical skill for ML systems engineers.

Feedback Loop in Recommendations

A music streaming service uses a model trained on listening history to recommend songs. Popular songs get recommended more, increasing their play counts, which makes the model recommend them even more. Over several months, the "long tail" of niche artists receives virtually zero recommendations — not because users dislike them, but because the feedback loop starved them of exposure. Breaking this loop requires deliberate exploration strategies such as epsilon-greedy or Thompson sampling.

The most dangerous feedback loops in ML systems are the ones you do not know exist. Instrument everything, question every assumption, and remember that your model is changing the distribution it was trained on.

Adapted from principles in "Reliable Machine Learning" by Cathy Chen et al., O'Reilly 2022

Systems Thinking

An analytical approach that focuses on how components of a system interrelate and work together over time, rather than examining each component in isolation.

Trade-off Analysis

The systematic evaluation of competing design objectives (such as accuracy vs. latency) to find solutions that best satisfy overall system requirements.

05 Course Roadmap and Learning Path

This course is structured around six major themes that mirror the lifecycle of building production ML systems. Each theme builds on the previous one, progressing from foundational understanding to advanced deployment and governance.

Part I: Foundations

The foundations section establishes the theoretical and architectural knowledge needed to understand deep learning from a systems perspective. You will study how neural networks map to hardware, how memory hierarchies affect training throughput, and why computational graphs are the lingua franca of modern ML frameworks.

Part II: Design Principles

The design principles section covers the practical methodology of ML development, including workflow management, data engineering, framework selection, and training strategies. These chapters equip learners with the tools and techniques used by practicing ML engineers.

Part III: Performance Engineering

Performance engineering forms the third major theme, addressing how to make ML systems fast, efficient, and cost-effective. This includes model optimization, hardware acceleration, and systematic benchmarking.

Part IV: Deployment and Operations

The deployment section covers how to reliably operate ML systems in production, including on-device deployment, security, and robustness.

Part V: Trustworthy AI

Trustworthy AI covers the ethical, social, and environmental dimensions of ML systems. These chapters address fairness, accountability, transparency, and the growing environmental cost of large-scale model training.

Part VI: Frontiers

The frontiers section examines emerging trends and future directions that will shape the field in coming years. Throughout all sections, the emphasis remains on the systems perspective that connects theory to practice.

Table 1.4: Course structure at a glance.
PartThemeCore Question
IFoundationsHow do ML models map to hardware and software abstractions?
IIDesign PrinciplesHow do we manage data, frameworks, and training at scale?
IIIPerformance EngineeringHow do we make ML systems fast and efficient?
IVDeployment & OperationsHow do we serve models reliably in production?
VTrustworthy AIHow do we build ML systems that are fair, safe, and sustainable?
VIFrontiersWhat comes next in ML systems research and practice?

Table 1.4: Course structure at a glance.

How to Use This Textbook

Each chapter includes interactive visualizations, key concept definitions, and a glossary. Start with the sections most relevant to your current work, but return to the foundations chapters if you encounter unfamiliar systems concepts. The most effective way to learn ML systems engineering is by combining reading with hands-on experimentation.

  1. Read the chapter narrative and study the diagrams.
  2. Review the key concepts and definitions at the end of each section.
  3. Explore the interactive visualizations to build intuition.
  4. Attempt the review questions before checking the glossary.
  5. Revisit earlier chapters as later topics reveal deeper connections.

Key Takeaways

  1. 1ML systems engineering goes far beyond model development, encompassing data pipelines, infrastructure, deployment, and monitoring.
  2. 2Systems thinking requires balancing competing objectives like accuracy, latency, cost, and energy consumption.
  3. 3The ML lifecycle is inherently iterative, with feedback loops between deployment monitoring and earlier development stages.
  4. 4Embedded ML and TinyML bring unique systems challenges around extreme resource constraints.
  5. 5Understanding trade-offs is the most important skill for an ML systems engineer.

CH.01

Chapter Complete

Up next:ML Systems

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

Introduction Quiz

Test your understanding of ML Systems engineering fundamentals.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass