ML SystemPart 2: System DesignChapter 6

Design CH.06 ~35 min

Data Engineering

Explores data pipelines, labeling strategies, versioning, augmentation techniques, and data quality management for ML systems.

data pipelineslabelingdata versioningaugmentationdata quality

Read in mlsysbook.ai

Design data pipelines that handle ingestion, transformation, and storage at scale
Compare batch vs. streaming data processing architectures and their trade-offs
Implement data quality validation and monitoring strategies
Explain feature engineering best practices and feature store architectures
Evaluate data versioning and lineage tracking solutions for ML workflows

01 Data Pipelines for ML Viz

Data pipelines are the foundation of every ML system, responsible for collecting, transforming, and delivering data to training and serving infrastructure. A well-designed pipeline must handle diverse data sources, varying scales, and evolving schemas while maintaining data quality and freshness.

Definition

Data Pipeline

An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure. Pipelines handle the full ETL (Extract, Transform, Load) process and must maintain data quality, freshness, and schema consistency.

Figure: Lambda Architecture -- Data Pipeline

Figure Figure 6.1: Lambda architecture combining batch and stream processing. Batch processing (blue) handles historical data while stream processing (green) handles real-time data, merging at the serving layer.

Batch vs. Stream Processing

Batch pipelines process large volumes of historical data on a schedule (hourly, daily) and are typically used for training data preparation. Stream pipelines process data in real-time as it arrives and are used for feature computation and online serving.

Table 6.1: Batch vs. stream processing characteristics for ML data pipelines.
Property	Batch Processing	Stream Processing
Data arrival	Collected and processed in bulk	Processed as it arrives in real-time
Latency	Minutes to hours	Milliseconds to seconds
Use case	Training data preparation, daily reports	Real-time features, online serving
Throughput	Very high (optimized for bulk)	Moderate (optimized for latency)
Complexity	Simpler (no ordering concerns)	Higher (ordering, late data, exactly-once)
Typical tools	Spark, Hive, dbt	Kafka, Flink, Spark Streaming

Table 6.1: Batch vs. stream processing characteristics for ML data pipelines.

Definition

Lambda Architecture

A data processing architecture that combines a batch layer (for accuracy and completeness) with a speed layer (for low-latency real-time data). A serving layer merges results from both, providing comprehensive and timely data to downstream consumers.

The Lambda Architecture Trade-off

Many production systems use a hybrid lambda architecture that combines both batch and stream processing. While this provides both accuracy and low latency, it requires maintaining two code paths that must produce consistent results — a significant engineering burden. The Kappa architecture simplifies this by using stream processing for everything, replaying the stream for batch-style reprocessing.

Quick Check

What is the main engineering drawback of the Lambda architecture for ML data pipelines?

Not quite.Lambda architecture runs both a batch layer and a speed layer in parallel. The engineering burden comes from maintaining two code paths (batch and stream) that must produce identical results. The Kappa architecture addresses this by using only stream processing.

Pipeline Reliability

Data pipeline reliability is crucial because downstream model quality is directly dependent on data quality. Pipeline failures can silently corrupt training data and degrade model performance.

Silent Data Corruption

The most dangerous pipeline failures are silent. A schema change in an upstream data source might cause a feature column to be filled with nulls instead of values. The pipeline runs successfully, the model trains without errors, but predictions degrade. Always validate data distributions and schema at pipeline boundaries.

Scaling Data Pipelines

As data volumes grow, pipelines must scale horizontally across distributed processing frameworks. The choice of framework depends on factors like data volume, processing complexity, latency requirements, and the team's existing infrastructure.

Apache Spark — Mature distributed processing engine, strong for batch ETL, wide ecosystem
Apache Beam — Unified batch and stream programming model, runs on multiple backends
Dask — Python-native parallel computing, great for teams already using pandas/NumPy
Apache Flink — Stream-first processing with strong exactly-once semantics
Ray Data — Distributed data processing integrated with Ray's ML ecosystem

Start Simple, Scale When Needed

If your training data fits in memory on a single machine (under ~100 GB), do not start with Spark or Beam. Use pandas or Dask for preprocessing and invest the saved complexity budget in data validation and testing. Migrate to distributed processing only when data volume demands it.

Data Pipeline

An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure.

Lambda Architecture

A data processing architecture that combines batch processing for accuracy with stream processing for low latency, providing both historical and real-time views.

02 Data Labeling and Annotation

Data labeling is often the most expensive and time-consuming part of ML development. Supervised learning requires labeled examples, and the quality of labels directly determines the upper bound on model performance. Labeling strategies must balance quality, cost, and throughput.

The Labeling Bottleneck

Industry surveys consistently show that data labeling consumes 25-80% of the time and budget of ML projects. For a medical imaging project, a single expert radiologist might label 50 images per hour at $150/hour, meaning 10,000 labeled images costs $30,000 in labeling alone. This economic reality drives the search for more efficient labeling strategies.

Human Labeling

Human labeling remains the gold standard for many tasks but faces challenges of scale, consistency, and cost. Labeling platforms provide workforce management, quality control, and tooling for annotation tasks.

Table 6.2: Comparison of data labeling platforms.
Platform	Workforce	Key Feature	Best For
Labelbox	Your team or crowd	Strong ML-assisted labeling	Image and video annotation
Scale AI	Managed workforce	High-quality at scale	Autonomous driving, enterprise
SageMaker Ground Truth	Amazon Mechanical Turk + custom	AWS integration, active learning	AWS-native ML teams
Label Studio	Self-managed (open source)	Flexible, customizable	Teams wanting full control
Prodigy	Your domain experts	Active learning built-in	NLP tasks, small expert teams

Table 6.2: Comparison of data labeling platforms.

Active Learning

Definition

Active Learning

A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing human effort where it most improves the model. The model queries for labels on examples where it is most uncertain, rather than labeling data randomly.

Active learning reduces labeling costs by intelligently selecting the most informative samples for human annotation. Rather than labeling data randomly, active learning queries the model to identify examples where it is most uncertain.

Active Learning in Practice

A spam detection team has 1 million unlabeled emails and a budget to label 5,000. Instead of randomly selecting 5,000, they train an initial model on 500 random labels, then use uncertainty sampling to select the next 500 emails where the model is least confident. After 10 rounds of this active learning loop, their model with 5,000 strategically chosen labels outperforms a model trained on 20,000 randomly selected labels.

Weak Supervision and Programmatic Labeling

Definition

Weak Supervision

An approach that generates training labels programmatically using labeling functions — heuristics, patterns, external knowledge bases, or pre-trained models. The noisy labels from multiple functions are combined using statistical methods to produce aggregate training labels at massive scale.

Weak supervision and programmatic labeling offer alternatives to manual annotation. Frameworks like Snorkel enable engineers to write labeling functions that encode heuristics, patterns, and external knowledge sources.

python
class="tok-comment"># Example labeling functions in Snorkel
from snorkel.labeling import labeling_function

class="tok-decorator">@labeling_function()
def lf_keyword_spam(x):
    class="tok-string">class="tok-string">""class="tok-string">"Flag emails containing common spam keywords."class="tok-string">""
    spam_words = [&class="tok-comment">#class="tok-number">39;freeclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;winnerclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;click hereclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;act nowclass="tok-string">&#class="tok-number">39;]
    return SPAM if any(w in x.text.lower() for w in spam_words) else ABSTAIN

class="tok-decorator">@labeling_function()
def lf_short_email(x):
    class="tok-string">"""Very short emails with links are often spam."""
    return SPAM if len(x.text) < class="tok-number">50 and &class="tok-comment">#class="tok-number">39;http&#class="tok-number">39; in x.text else ABSTAIN

class="tok-comment"># Snorkel combines these noisy functions statistically
class="tok-comment"># to produce probabilistic training labels

Combining Labeling Strategies

The most effective approach often combines strategies: use weak supervision to label the bulk of data cheaply, active learning to select the most valuable examples for human annotation, and human labels as the ground truth for evaluation. This hybrid approach achieves near-human label quality at a fraction of the cost of labeling everything manually.

Active Learning

A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing effort where it most improves the model.

Weak Supervision

An approach that generates training labels programmatically using heuristics and external knowledge sources rather than manual annotation.

03 Data Versioning and Lineage

Data versioning tracks changes to datasets over time, enabling reproducibility, debugging, and rollback. Just as code version control is fundamental to software engineering, data version control is essential for ML engineering. Without it, teams cannot reliably reproduce past training runs or diagnose model regressions.

Definition

Data Versioning

The practice of tracking and managing changes to datasets over time, analogous to code version control. Data versioning enables teams to reproduce any past training run, compare dataset changes between versions, and roll back to a known good state when issues are detected.

Versioning Tools

Tools like DVC, LakeFS, and Delta Lake provide versioning capabilities for large datasets, each taking a different approach to the problem.

Table 6.3: Data versioning tools and their strengths.
Tool	Approach	Key Feature	Best For
DVC	Content-addressable storage alongside Git	Git integration, pipeline tracking	Teams using Git, moderate data sizes
LakeFS	Git-like branches for data lakes	Branch/merge/diff for data	Data lake environments, S3-compatible
Delta Lake	ACID transactions on Parquet files	Time travel, schema enforcement	Spark ecosystems, large-scale data
Pachyderm	Data versioning with pipeline lineage	Automatic provenance tracking	Complex pipeline-driven workflows
Apache Iceberg	Table format with snapshot isolation	Schema evolution, partition evolution	Data warehouse modernization

Table 6.3: Data versioning tools and their strengths.

DVC for Most Teams

For most ML teams starting with data versioning, DVC is the best default choice. It integrates naturally with Git (so your data versions are linked to code commits), supports all major cloud storage backends, and requires minimal infrastructure changes. Start with DVC and migrate to LakeFS or Delta Lake only if your data scale or workflow demands it.

Data Lineage

Definition

Data Lineage

The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline to the final model. Lineage answers "where did this data come from?" and "what happened to it along the way?" for any datum in the system.

Data lineage tracks the complete provenance of data through the pipeline, recording every transformation applied from source to model. This audit trail is essential for debugging data quality issues, understanding model behavior, and complying with regulations.

Regulatory Requirements

Regulations like GDPR (EU), CCPA (California), and industry-specific rules increasingly require organizations to explain how automated decisions are made. Data lineage is essential for this explainability: you must be able to trace from a model prediction back through the training data to its original source. Without lineage, regulatory compliance is extremely difficult.

Storage and Format Considerations

Effective data versioning requires infrastructure decisions about storage, deduplication, and access patterns.

Content-addressable storage — Files stored by their hash, so identical content is never duplicated across versions
Columnar formats (Parquet, Arrow) — Enable reading specific columns without scanning the full dataset
Partitioning — Organize data by date, category, or other keys for efficient incremental processing
Compression — Snappy for speed, ZSTD for size, Gzip for compatibility

Debugging a Model Regression with Lineage

A recommendation model's click-through rate drops by 15% after a weekly retrain. Using data lineage, the team traces the training data back through the pipeline and discovers that a new data source was added that includes bot traffic (non-human clicks). Lineage identifies exactly which pipeline stage introduced the bad data. The team adds a bot filter, reverts to the previous data version, and retrains — restoring performance within hours instead of days.

The Cost of Not Versioning

Teams that skip data versioning often realize its importance the hard way: a model regresses in production, and they cannot determine which data change caused it. Reproducing the previous good model is impossible because the training data has been overwritten. The cost of retroactively adding data versioning to an existing pipeline far exceeds the cost of building it in from the start.

Data Versioning

The practice of tracking and managing changes to datasets over time, analogous to code version control, enabling reproducibility and rollback.

Data Lineage

The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline.

04 Data Augmentation Techniques

Data augmentation artificially increases the effective size and diversity of training data by applying transformations to existing examples. This is one of the most effective regularization techniques, improving model generalization while reducing the need for additional labeled data.

Definition

Data Augmentation

The technique of creating modified versions of training examples through label-preserving transformations (crops, flips, noise, etc.), effectively increasing the dataset size and diversity. Augmentation acts as a powerful regularizer that improves model generalization.

Image Augmentation

For image data, common augmentations include random cropping, flipping, rotation, color jittering, and cutout. More advanced techniques like MixUp and CutMix create synthetic examples by blending pairs of training images and their labels.

Table 6.4: Common image augmentation techniques and typical improvements on ImageNet.
Technique	How It Works	Typical Improvement	Complexity
Random Crop & Flip	Random spatial crops and horizontal flips	1-3% accuracy	Simple
Color Jitter	Random brightness, contrast, saturation changes	0.5-1% accuracy	Simple
Cutout / Random Erasing	Mask random rectangular regions	0.5-1.5% accuracy	Simple
MixUp	Blend pairs of images and labels linearly	1-2% accuracy	Moderate
CutMix	Cut and paste patches between images	1-3% accuracy	Moderate
RandAugment	Apply N random augmentations at magnitude M	1-3% accuracy	Auto-tuned

Table 6.4: Common image augmentation techniques and typical improvements on ImageNet.

Definition

MixUp

An augmentation technique that creates synthetic training examples by taking weighted linear combinations of pairs of examples and their labels. Given two samples (x_i, y_i) and (x_j, y_j), MixUp produces (lambda * x_i + (1-lambda) * x_j, lambda * y_i + (1-lambda) * y_j) where lambda is sampled from a Beta distribution.

\tilde{x} = \lambda x_i + (1 - \lambda) x_j, \quad \tilde{y} = \lambda y_i + (1 - \lambda) y_j, \quad \lambda \sim \text{Beta}(\alpha, \alpha)

Text and Tabular Augmentation

For text data, augmentation techniques include synonym replacement, random insertion, back-translation, and paraphrase generation using language models. For tabular data, techniques like SMOTE address class imbalance by generating synthetic minority examples through interpolation.

Back-Translation for NLP

Back-translation is one of the most effective text augmentation techniques. Translate your training text to another language (e.g., English to French) and then back (French to English). The resulting paraphrase preserves meaning while introducing lexical and syntactic variation. Use multiple target languages for greater diversity.

Quick Check

Why is on-the-fly data augmentation generally preferred over pre-computing augmented examples?

Consider what happens when the model sees the same training example across multiple epochs.

Not quite.On-the-fly augmentation applies random transformations each epoch, so the model sees different augmented versions every time. Pre-computing stores fixed augmented copies on disk, limiting diversity. The trade-off is that on-the-fly augmentation requires CPU cycles during training.

Systems Considerations

From a systems perspective, data augmentation should be applied on-the-fly during training rather than pre-computed, to maximize diversity without increasing storage requirements.

Augmentation as CPU Bottleneck

If augmentation is performed on the CPU while the model trains on the GPU, slow augmentation can starve the GPU and waste expensive compute. Profile your data loading pipeline to ensure augmentation throughput exceeds GPU training throughput. Use multi-worker data loading (num_workers in PyTorch) and GPU-accelerated augmentation (NVIDIA DALI) when needed.

python
class="tok-comment"># Efficient augmentation pipeline with Albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.RandomResizedCrop(class="tok-number">224, class="tok-number">224, scale=(class="tok-number">0.08, class="tok-number">1.0)),
    A.HorizontalFlip(p=class="tok-number">0.5),
    A.ColorJitter(brightness=class="tok-number">0.4, contrast=class="tok-number">0.4, saturation=class="tok-number">0.4),
    A.CoarseDropout(max_holes=class="tok-number">1, max_height=class="tok-number">56, max_width=class="tok-number">56, p=class="tok-number">0.5),
    A.Normalize(mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406], std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225]),
    ToTensorV2(),
])
class="tok-comment"># Albumentations is class="tok-number">2-5x faster than torchvision.transforms

On-the-Fly vs. Pre-Computed

On-the-fly augmentation generates different augmented versions each epoch, providing virtually infinite training data diversity from a fixed dataset. Pre-computed augmentation stores augmented examples on disk, which limits diversity but avoids CPU bottlenecks. For most training setups, on-the-fly augmentation with proper multi-worker loading is preferred.

Data Augmentation

The technique of creating modified versions of training examples through transformations, increasing effective dataset size and improving model generalization.

MixUp

An augmentation technique that creates synthetic training examples by taking convex combinations of pairs of examples and their labels.

05 Data Quality Management

Data quality is the single most important factor in ML system performance. The adage "garbage in, garbage out" applies with particular force to ML, where models can silently learn patterns from data errors, duplicates, and label noise without any obvious symptoms until deployment.

Your model is only as good as your data. No amount of architectural innovation can compensate for fundamentally flawed training data.
Common ML engineering axiom

Definition

Data Quality

The measure of data fitness for its intended use in ML, encompassing dimensions like completeness (no missing values), accuracy (correct labels), consistency (no conflicting records), freshness (up-to-date), and representativeness (covers the target distribution).

Data Validation Frameworks

Data validation frameworks enable teams to define expectations about data distributions, schema constraints, and statistical properties. These checks run automatically as part of the pipeline, catching issues before they corrupt training data.

Table 6.5: Data validation frameworks for ML data quality.
Framework	Type	Key Feature	Integration
Great Expectations	Open-source	Declarative expectations with rich documentation	Airflow, Spark, pandas, SQL
TFDV	Open-source (Google)	Automatic schema inference and drift detection	TFX pipeline, Apache Beam
Pandera	Open-source	Pandas DataFrame type checking	pytest, CI pipelines
Deequ	Open-source (Amazon)	Unit tests for data on Spark	Spark, AWS Glue
Monte Carlo	SaaS	Automated anomaly detection, lineage	Most data warehouses and tools

Table 6.5: Data validation frameworks for ML data quality.

python
class="tok-comment"># Data validation with Great Expectations
import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv(class="tok-string">"training_data.csv")

class="tok-comment"># Define expectations
validator.expect_column_values_to_not_be_null(class="tok-string">"user_id")
validator.expect_column_values_to_be_between(class="tok-string">"age", min_value=class="tok-number">0, max_value=class="tok-number">150)
validator.expect_column_mean_to_be_between(class="tok-string">"purchase_amount", min_value=class="tok-number">10, max_value=class="tok-number">500)
validator.expect_column_proportion_of_unique_values_to_be_between(class="tok-string">"email", class="tok-number">0.95, class="tok-number">1.0)

results = validator.validate()
if not results.success:
    raise ValueError(class="tok-string">"Data quality checks failed!")

Data Quality Dimensions

Completeness — Are there missing values? What is the null rate per column?
Consistency — Do records conflict with each other or with business rules?
Accuracy — Are labels correct? What is the estimated label noise rate?
Freshness — How old is the data? Is it current enough for the model's purpose?
Representativeness — Does the data cover the target distribution? Are minority groups underrepresented?
Uniqueness — Are there duplicate records that would bias the training distribution?

Label Noise Is Everywhere

Real-world labels are rarely perfect. Studies have found 3-10% label error rates in popular benchmarks like ImageNet and CIFAR. In production data with crowd-sourced labels, error rates can reach 10-20%. Models trained on noisy labels learn the noise as signal, which degrades generalization. Use techniques like confident learning, label smoothing, and multi-annotator agreement to mitigate label noise.

Proactive Data Quality Management

Proactive data quality management is more effective than reactive debugging. This includes establishing data contracts between producers and consumers, implementing continuous monitoring, and building automated alerts for drift detection.

Definition

Data Contract

A formal agreement between data producers and consumers that specifies the schema, quality constraints, freshness requirements, and SLAs for a data source. Data contracts prevent upstream changes from silently breaking downstream ML pipelines.

Invest in Data Quality Early

Investing in data quality infrastructure early in a project pays dividends throughout the system lifetime. A data validation pipeline that catches a corrupted batch before it enters training saves far more than the cost of building the validation. Teams that skip data quality work inevitably spend more time debugging mysterious model regressions.

Data Contract Prevents Outage

An upstream team plans to change a feature column from integer to string representation. Because a data contract exists, their schema change fails validation in the CI pipeline before deployment. The ML team is notified, updates their feature pipeline to handle the new format, and both changes deploy together. Without the contract, the schema change would have caused the ML pipeline to fail silently or produce incorrect features.

Data Quality

The measure of data fitness for its intended use in ML, encompassing dimensions like completeness, accuracy, consistency, and representativeness.

Data Validation

Automated checks that verify data meets specified quality constraints before it enters the ML pipeline, preventing silent data corruption.

Key Takeaways

1Data pipeline reliability is critical because model quality is directly bounded by data quality.
2Active learning and weak supervision can dramatically reduce labeling costs while maintaining model performance.
3Data versioning and lineage tracking are essential for reproducibility, debugging, and regulatory compliance.
4On-the-fly data augmentation maximizes training data diversity without increasing storage requirements.
5Proactive data quality management through validation frameworks prevents costly downstream model failures.

CH.06

Chapter Complete

Up next:AI Frameworks

Chapter Progress

Reading

Exercise

Interact with the visualization

Quiz