Skip to content
ML SystemPart 1: Core ConceptsChapter 2
Foundations CH.02 ~35 min

ML Systems

Provides an end-to-end overview of the ML stack. Covers the full pipeline from data to deployment, including components, trade-offs, and system-level concerns.

ML pipelinesystem componentstrade-offsend-to-end stacklatency vs throughput
Read in mlsysbook.ai
  • Identify and describe the core components of a machine learning system architecture
  • Explain how data, model, and serving layers interact in production ML systems
  • Compare monolithic vs. microservices architectures for ML deployment
  • Evaluate system design trade-offs for scalability, reliability, and maintainability
  • Design a high-level ML system architecture for a given application scenario

01 The ML Pipeline End-to-End Viz

A production ML pipeline is far more than a model. It is a complex system of interconnected components that together transform raw data into actionable predictions. The pipeline typically begins with data ingestion, proceeds through feature engineering and model training, and culminates in serving predictions to end users or downstream systems.

Figure: ML System Architecture Layers

Application LayerUser-facing servicesServingMonitoringA/B TestingShadow DeployML PlatformDevelopment & experimentationFeature StoreModel RegistryExperiment TrackingAutoMLInfrastructureOrchestration & computeOrchestrationStorageNetworkingSchedulingHardwarePhysical compute resourcesGPUsTPUsEdge DevicesFPGAs ABSTRACTION LEVEL ApplicationPlatformInfrastructureHardware Hover each layer for architecture details
Figure Figure 2.1: ML system architecture layers. From hardware through infrastructure and platform to application, each layer provides services to the layer above.
Definition

ML Pipeline

An automated, end-to-end sequence of data processing, feature engineering, model training, evaluation, and serving steps that transforms raw data into predictions in a reproducible and scalable manner.

Engineering Challenges at Each Stage

Each stage of the pipeline introduces its own set of engineering challenges. Data ingestion must handle varying formats, scales, and velocities. Feature engineering must be consistent between training and serving to avoid training-serving skew. Model training must efficiently utilize compute resources while maintaining reproducibility.

Table 2.1: Key engineering challenges and failure modes at each pipeline stage.
Pipeline StagePrimary ChallengeFailure Mode
Data IngestionHandling diverse formats, scales, and velocitiesSchema changes or source outages corrupt downstream data
Feature EngineeringConsistency between training and servingTraining-serving skew silently degrades predictions
Model TrainingEfficient compute utilization and reproducibilityNon-determinism or resource exhaustion causes wasted runs
Model EvaluationMeaningful offline metrics that predict online performanceOverfit to test set or wrong metric leads to bad deployments
Model ServingLow latency, high throughput, graceful degradationLatency spikes or crashes under load impact user experience

Table 2.1: Key engineering challenges and failure modes at each pipeline stage.

Silent Failure Propagation

The end-to-end perspective is crucial because failures at any stage propagate through the entire system. A subtle data quality issue in ingestion can cause model degradation weeks later. A mismatch between training and serving feature computation can silently corrupt predictions. Unlike software bugs that produce errors, ML pipeline failures often produce plausible but wrong outputs.

Systems engineers must design pipelines with observability and validation at every stage. This means implementing data quality checks between stages, tracking data lineage, and monitoring model performance metrics continuously.

Pipeline Orchestration with DAGs

Definition

Directed Acyclic Graph (DAG)

A graph structure where edges have direction and no cycles exist, used in ML pipelines to represent task dependencies. Each node is a processing step, and edges define the order of execution.

Modern ML pipelines are often implemented as directed acyclic graphs (DAGs) of dependent tasks, orchestrated by tools like Apache Airflow, Kubeflow Pipelines, or Prefect. These orchestration systems provide scheduling, dependency management, retry logic, and monitoring capabilities that are essential for reliable operation.

Choosing an Orchestrator

Apache Airflow excels at general-purpose workflow scheduling and has the largest community. Kubeflow Pipelines is purpose-built for ML on Kubernetes with native support for GPU workloads. Prefect offers a modern Python-native API with strong local development experience. Choose based on your infrastructure and team expertise.

python
class="tok-comment"># Simplified Airflow DAG for an ML pipeline
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(&class="tok-comment">#class="tok-number">39;ml_pipelineclass="tok-string">&#class="tok-number">39;, start_date=datetime(class="tok-number">2024, class="tok-number">1, class="tok-number">1),
         schedule_interval=&class="tok-comment">#class="tok-number">39;@daily&#class="tok-number">39;) as dag:
    ingest = PythonOperator(task_id=&class="tok-comment">#class="tok-number">39;ingest_dataclass="tok-string">&#class="tok-number">39;, python_callable=ingest_fn)
    features = PythonOperator(task_id=&class="tok-comment">#class="tok-number">39;compute_features&#class="tok-number">39;, python_callable=feature_fn)
    train = PythonOperator(task_id=&class="tok-comment">#class="tok-number">39;train_modelclass="tok-string">&#class="tok-number">39;, python_callable=train_fn)
    evaluate = PythonOperator(task_id=&class="tok-comment">#class="tok-number">39;evaluate&#class="tok-number">39;, python_callable=eval_fn)

    ingest >> features >> train >> evaluate
Quick Check

Which component of the ML stack is most commonly the bottleneck in production systems?

Not quite.Data pipelines are the most common bottleneck in production ML systems. Issues like data quality, schema changes, ingestion failures, and feature computation inconsistencies account for the majority of production incidents. While model training and serving are computationally intensive, the data layer is where most real-world failures originate.
Continue reading

ML Pipeline

An automated sequence of data processing, model training, and serving steps that transforms raw data into predictions in a reproducible and scalable manner.

Training-Serving Skew

A discrepancy between the data processing or feature computation applied during training versus serving, which can silently degrade model quality.

02 System Components and Architecture

ML systems are composed of several key components that must work together seamlessly. The data layer handles storage, retrieval, and transformation of training data and features. The compute layer provides the processing power for training and inference. The model layer encapsulates the trained parameters and inference logic. The serving layer exposes model predictions to applications and users.

Table 2.2: The four layers of an ML system architecture.
LayerResponsibilityKey Technologies
Data LayerStorage, retrieval, and transformation of training data and featuresFeast, Tecton, Delta Lake, Apache Hudi
Compute LayerProcessing power for training and inference workloadsKubernetes, Ray, NVIDIA Triton, AWS SageMaker
Model LayerTrained parameters, inference logic, model versioningMLflow, TorchServe, ONNX Runtime
Serving LayerExposes predictions to applications and usersREST APIs, gRPC, KServe, Seldon Core

Table 2.2: The four layers of an ML system architecture.

The Data Layer and Feature Stores

Definition

Feature Store

A centralized platform for storing, managing, and serving machine learning features. Feature stores ensure that the same feature computation is used during both training and serving, solving the critical problem of training-serving skew.

The data layer typically includes a feature store that provides a centralized repository for computed features. Feature stores solve the critical problem of feature reuse and consistency, ensuring that the same feature computation is used during both training and serving. Popular feature stores include Feast, Tecton, and built-in solutions from cloud providers.

Why Feature Stores Matter

Without a feature store, teams often reimplement the same feature computation in different languages or frameworks for training (Python/Spark) and serving (Java/Go). Even small numerical differences between implementations can cause training-serving skew that silently degrades model performance. A feature store enforces a single source of truth.

The Compute Layer

The compute layer must be flexible enough to handle diverse workloads. Training workloads are typically batch-oriented and GPU-intensive, while serving workloads are latency-sensitive and must handle variable request rates. Many organizations use Kubernetes to manage these heterogeneous compute requirements with a unified orchestration platform.

Right-Sizing Compute

Training and serving have fundamentally different compute profiles. Training benefits from large GPU instances running for hours, while serving needs smaller instances that can scale horizontally in seconds. Use separate compute pools with autoscaling policies tuned to each workload's characteristics.

The Serving Layer

The serving layer is the interface between the ML system and the outside world. It must provide low-latency predictions, handle traffic spikes gracefully, support A/B testing and canary deployments, and enable easy rollback when issues are detected.

  • REST APIs — Simple HTTP endpoints, widely supported, human-readable JSON payloads
  • gRPC — Binary protocol with strong typing, lower latency than REST, ideal for microservice communication
  • Streaming pipelines — Kafka or Pulsar-based inference for high-throughput, event-driven applications
Canary Deployment Pattern

A team deploys a new recommendation model by routing 5% of traffic to the new version while 95% continues to the existing model. They monitor key metrics (click-through rate, latency p99, error rate) for 24 hours. If the new model performs well, traffic is gradually shifted. If any metric degrades, traffic is instantly routed back to the old model. This pattern limits blast radius while enabling safe iteration.

Feature Store

A centralized platform for storing, managing, and serving machine learning features, ensuring consistency between training and production environments.

Model Serving

The infrastructure and processes that deliver trained model predictions to applications, typically through APIs or streaming interfaces.

03 Trade-offs in ML System Design

Every ML system design involves navigating a complex landscape of trade-offs. The most fundamental is the accuracy-latency trade-off: larger, more complex models generally achieve higher accuracy but require more computation and thus higher latency. Finding the right balance depends on the specific application requirements.

Definition

Accuracy-Latency Trade-off

The fundamental tension between model quality and inference speed. Larger, more complex models generally yield better predictions but require more computation time. This trade-off is the most common design constraint in production ML systems.

Latency vs. Throughput

The latency-throughput trade-off is another critical consideration. Optimizing for low latency (fast individual predictions) often conflicts with maximizing throughput (predictions per second). Batching requests improves throughput by amortizing overhead but increases latency for individual requests. Dynamic batching strategies attempt to balance these competing objectives.

Dynamic Batching Explained

Dynamic batching collects incoming requests into a batch until either a maximum batch size or a maximum wait time is reached, whichever comes first. This bounds the worst-case latency while still capturing throughput benefits. NVIDIA Triton Inference Server and TensorFlow Serving both support dynamic batching out of the box.

\text{Throughput} = \frac{\text{Batch Size}}{\text{Latency per Batch}} = \frac{B}{L_\text{batch}}
Table 2.3: Key trade-offs in ML system design and common resolution strategies.
Trade-offOptimize For AOptimize For BResolution Strategy
Accuracy vs. LatencyLarge model, high accuracySmall model, fast inferenceKnowledge distillation, quantization
Latency vs. ThroughputSingle-request, low latencyBatched, high throughputDynamic batching with timeout
Quality vs. CostBest model, any compute costCheapest infrastructureSpot instances, model compression
Accuracy vs. MaintainabilityComplex ensembleSingle simple modelDistill ensemble into one model

Table 2.3: Key trade-offs in ML system design and common resolution strategies.

Cost as a Pervasive Constraint

Cost is a pervasive constraint that influences every design decision. Cloud compute costs for training and serving can be substantial. Organizations must balance model quality against compute cost, considering factors like instance types, spot versus on-demand pricing, and the total cost of ownership for different infrastructure choices.

The Hidden Cost of Complexity

A complex ensemble of models may achieve the best accuracy on benchmarks but be nearly impossible to debug and maintain in production. The ML systems engineer must consider the long-term operational cost of complexity — on-call burden, debugging time, deployment risk — alongside short-term performance gains. In many cases, a simpler model that is 1% less accurate but 10x easier to maintain is the better engineering choice.

Start Simple, Complicate Later

Begin with the simplest model that meets your requirements. Use a logistic regression or small neural network as a baseline. Only add complexity (deeper models, ensembles, feature crosses) when you have evidence that the simple approach falls short. This principle minimizes cost and maximizes maintainability.

0%
of ML projects never make it to production

Reliability and maintainability are trade-offs that often receive insufficient attention. The ML systems engineer must consider the long-term operational cost of complexity alongside short-term performance gains.

Accuracy-Latency Trade-off

The fundamental tension between model quality and inference speed, where more complex models yield better predictions but require more computation time.

Dynamic Batching

A serving strategy that groups incoming inference requests into batches dynamically, balancing throughput and latency based on current load.

04 Latency, Throughput, and Performance Metrics

Understanding system performance requires precise measurement of key metrics. Latency measures the time from when a request is received to when the response is returned. It is typically reported as percentiles (p50, p95, p99) rather than averages, because tail latency can dramatically impact user experience.

Definition

Tail Latency

The response time experienced by the slowest requests, typically measured at the 99th percentile (p99) or 99.9th percentile (p99.9). Tail latency disproportionately affects user experience in large-scale systems where even a small percentage of slow requests impacts many users.

Why Averages Lie

Average latency can mask severe performance problems. A system with 10ms average latency might have a p99 of 500ms, meaning 1 in 100 users waits 50x longer than average. In a system handling 10,000 requests per second, that means 100 users every second experience unacceptable delays. Always report and monitor percentile latencies.

Throughput and Hardware Utilization

Throughput measures the total number of predictions the system can produce per unit time. It depends on factors like hardware capabilities, model complexity, batch size, and the efficiency of the serving stack. Maximizing throughput is essential for cost-effective operation at scale.

Table 2.4: Essential performance metrics for ML serving systems.
MetricWhat It MeasuresUnitWhy It Matters
p50 LatencyMedian response timeMillisecondsTypical user experience
p95 Latency95th percentile response timeMillisecondsMost users' worst experience
p99 Latency99th percentile response timeMillisecondsTail experience, SLO target
Throughput (QPS)Queries processed per secondRequests/secSystem capacity and cost efficiency
GPU UtilizationFraction of GPU compute usedPercentageHardware cost efficiency
Error RateFraction of failed requestsPercentageSystem reliability

Table 2.4: Essential performance metrics for ML serving systems.

ML-Specific Metrics

Beyond latency and throughput, ML systems must track model-specific metrics like prediction quality, feature freshness, and data distribution statistics. These metrics form the basis of a comprehensive monitoring strategy that can detect issues before they impact end users.

  • Prediction quality — Accuracy, precision, recall, AUC tracked on live traffic
  • Feature freshness — Age of the most recent feature values used in predictions
  • Data distribution — Statistical measures (mean, variance, quantiles) of input features to detect drift
  • Model staleness — Time since the model was last retrained

Service Level Objectives (SLOs)

Definition

Service Level Objective (SLO)

A measurable target for system performance or reliability, such as "99th percentile latency under 50ms" or "model accuracy above 92%." SLOs define acceptable operational behavior and are the basis for alerts, on-call response, and capacity planning.

Service level objectives (SLOs) formalize performance requirements for ML systems. Defining and monitoring SLOs is a best practice borrowed from site reliability engineering that helps ML teams operate with accountability and transparency.

SLOs for a Search Ranking Model

A search team defines the following SLOs: (1) 99% of inference requests complete within 50ms, (2) model NDCG@10 remains above 0.65 on daily evaluation, (3) feature freshness is under 15 minutes for real-time features. When any SLO is breached, an automated alert triggers the on-call engineer. These concrete targets transform vague quality goals into actionable engineering requirements.

Set SLOs Before You Build

Define latency, throughput, and quality SLOs before choosing a model architecture or infrastructure. These constraints narrow the design space and prevent wasted effort on solutions that cannot meet production requirements.

Tail Latency

The response time experienced by the slowest requests (typically measured at p99 or p99.9), which disproportionately affects user experience in large-scale systems.

Service Level Objective (SLO)

A measurable target for system performance or reliability, such as 99th percentile latency under 50ms, that defines acceptable operational behavior.

05 The Full ML Stack

The full ML stack can be visualized as a layered architecture, with each layer building on the capabilities of the layer below. At the bottom is hardware: CPUs, GPUs, TPUs, and specialized accelerators. Above that is the systems software layer, including operating systems, device drivers, and runtime libraries.

Definition

ML Stack

The complete layered architecture of hardware, systems software, frameworks, infrastructure, and application components that together comprise a production ML system. Each layer abstracts the complexity below while exposing capabilities to the layer above.

Table 2.5: Layers of the full ML stack from application to hardware.
LayerComponentsExample Tools
ApplicationServing endpoints, monitoring dashboards, feedback collectionGrafana, Streamlit, custom APIs
InfrastructureOrchestration, distributed computing, cloud servicesKubernetes, Ray, AWS SageMaker, GCP Vertex AI
Higher-Level ToolsExperiment management, hyperparameter tuning, AutoMLMLflow, Optuna, Weights & Biases, Auto-sklearn
FrameworkModel definition, training, evaluation APIsPyTorch, TensorFlow, JAX, Keras
Systems SoftwareOS, device drivers, runtime libraries, compilersCUDA, cuDNN, TensorRT, XLA
HardwareProcessing units and acceleratorsCPUs, GPUs (NVIDIA), TPUs (Google), custom ASICs

Table 2.5: Layers of the full ML stack from application to hardware.

The Framework Layer

The framework layer provides high-level APIs for defining, training, and evaluating models. Frameworks like PyTorch, TensorFlow, and JAX abstract away the complexity of hardware-specific optimizations while still providing performance. Above frameworks sit higher-level tools for experiment management, hyperparameter tuning, and AutoML.

Framework Convergence

The ML framework landscape has consolidated significantly. PyTorch dominates research and is increasingly adopted in production. TensorFlow retains a strong presence in production environments, particularly for mobile and edge deployment via TensorFlow Lite. JAX, backed by Google, offers a functional programming model with strong support for XLA compilation and TPU hardware.

Infrastructure and Application Layers

The infrastructure layer encompasses orchestration platforms, distributed computing systems, and cloud services. This layer handles resource allocation, scheduling, scaling, and fault tolerance. Tools like Kubernetes, Ray, and cloud-managed ML services operate at this layer.

At the top of the stack are the application-specific components: model serving endpoints, monitoring dashboards, and feedback collection systems. Understanding how all these layers interact is essential for diagnosing performance issues, optimizing costs, and building reliable ML systems.

Cross-Layer Bottlenecks

A bottleneck at any layer can limit the performance of the entire stack. A slow data loading pipeline (infrastructure) can leave expensive GPUs (hardware) idle. An unoptimized model graph (framework) can negate the benefits of a powerful accelerator. Always profile across layers when diagnosing performance issues.

Profiling Top-Down

When diagnosing performance issues, start at the application layer and work downward. Check serving latency first, then framework-level profiling (torch.profiler, TensorFlow Profiler), then hardware utilization (nvidia-smi, GPU memory). This top-down approach quickly identifies which layer is the bottleneck.

bash
# Quick hardware utilization check
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used \
  --format=csv -l 1

# PyTorch profiler for framework-level bottlenecks
# python -c "import torch; torch.profiler.profile(...)"

Quick commands for checking GPU utilization and profiling.

Deeper InsightDespite the focus on model architectures and algorithms in academic papers, production ML engineering is dominated by data work — collecting, cleaning, labeling, validating, and maintaining data pipelines. The model itself is a surprisingly small fraction of the total system. This is why data engineering skills are often more valuable than deep learning expertise in production ML teams.Think of it like...Think of building a house: the foundation, plumbing, and electrical work (data infrastructure) take 80% of the effort, while the interior design (the model) gets most of the attention but only 20% of the work. Click to collapse

ML Stack

The complete layered architecture of hardware, systems software, frameworks, infrastructure, and application components that together comprise a production ML system.

Key Takeaways

  1. 1Production ML systems are complex pipelines with many components beyond the model itself, including data ingestion, feature engineering, serving, and monitoring.
  2. 2Trade-offs between accuracy, latency, throughput, and cost drive all ML system design decisions.
  3. 3Feature stores solve the critical problem of maintaining consistency between training and serving feature computation.
  4. 4Tail latency (p99) matters more than average latency for user-facing ML systems.
  5. 5The full ML stack spans from hardware accelerators through frameworks, infrastructure, and application-level tooling.

CH.02

Chapter Complete

Up next:DL Primer

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

ML Systems Quiz

Test your understanding of ML system architecture, pipelines, and design trade-offs.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass