Skip to content
ML SystemPart 2: System DesignChapter 6
Design CH.06 ~35 min

Data Engineering

Explores data pipelines, labeling strategies, versioning, augmentation techniques, and data quality management for ML systems.

data pipelineslabelingdata versioningaugmentationdata quality
Read in mlsysbook.ai
  • Design data pipelines that handle ingestion, transformation, and storage at scale
  • Compare batch vs. streaming data processing architectures and their trade-offs
  • Implement data quality validation and monitoring strategies
  • Explain feature engineering best practices and feature store architectures
  • Evaluate data versioning and lineage tracking solutions for ML workflows

01 Data Pipelines for ML Viz

Data pipelines are the foundation of every ML system, responsible for collecting, transforming, and delivering data to training and serving infrastructure. A well-designed pipeline must handle diverse data sources, varying scales, and evolving schemas while maintaining data quality and freshness.

Definition

Data Pipeline

An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure. Pipelines handle the full ETL (Extract, Transform, Load) process and must maintain data quality, freshness, and schema consistency.

Figure: Lambda Architecture -- Data Pipeline

BATCH LAYERSPEED LAYERRaw DataImmutable masterBatch ProcessingSpark / MapReduceBatch ViewsPre-computedStream ProcessingKafka / FlinkReal-time ViewsIncremental updatesServing LayerMerge & queryQoutBatch pathSpeed pathServing Hover each component for details
Figure Figure 6.1: Lambda architecture combining batch and stream processing. Batch processing (blue) handles historical data while stream processing (green) handles real-time data, merging at the serving layer.

Batch vs. Stream Processing

Batch pipelines process large volumes of historical data on a schedule (hourly, daily) and are typically used for training data preparation. Stream pipelines process data in real-time as it arrives and are used for feature computation and online serving.

Table 6.1: Batch vs. stream processing characteristics for ML data pipelines.
PropertyBatch ProcessingStream Processing
Data arrivalCollected and processed in bulkProcessed as it arrives in real-time
LatencyMinutes to hoursMilliseconds to seconds
Use caseTraining data preparation, daily reportsReal-time features, online serving
ThroughputVery high (optimized for bulk)Moderate (optimized for latency)
ComplexitySimpler (no ordering concerns)Higher (ordering, late data, exactly-once)
Typical toolsSpark, Hive, dbtKafka, Flink, Spark Streaming

Table 6.1: Batch vs. stream processing characteristics for ML data pipelines.

Definition

Lambda Architecture

A data processing architecture that combines a batch layer (for accuracy and completeness) with a speed layer (for low-latency real-time data). A serving layer merges results from both, providing comprehensive and timely data to downstream consumers.

The Lambda Architecture Trade-off

Many production systems use a hybrid lambda architecture that combines both batch and stream processing. While this provides both accuracy and low latency, it requires maintaining two code paths that must produce consistent results — a significant engineering burden. The Kappa architecture simplifies this by using stream processing for everything, replaying the stream for batch-style reprocessing.

Quick Check

What is the main engineering drawback of the Lambda architecture for ML data pipelines?

Not quite.Lambda architecture runs both a batch layer and a speed layer in parallel. The engineering burden comes from maintaining two code paths (batch and stream) that must produce identical results. The Kappa architecture addresses this by using only stream processing.
Continue reading

Pipeline Reliability

Data pipeline reliability is crucial because downstream model quality is directly dependent on data quality. Pipeline failures can silently corrupt training data and degrade model performance.

Silent Data Corruption

The most dangerous pipeline failures are silent. A schema change in an upstream data source might cause a feature column to be filled with nulls instead of values. The pipeline runs successfully, the model trains without errors, but predictions degrade. Always validate data distributions and schema at pipeline boundaries.

Scaling Data Pipelines

As data volumes grow, pipelines must scale horizontally across distributed processing frameworks. The choice of framework depends on factors like data volume, processing complexity, latency requirements, and the team's existing infrastructure.

  • Apache Spark — Mature distributed processing engine, strong for batch ETL, wide ecosystem
  • Apache Beam — Unified batch and stream programming model, runs on multiple backends
  • Dask — Python-native parallel computing, great for teams already using pandas/NumPy
  • Apache Flink — Stream-first processing with strong exactly-once semantics
  • Ray Data — Distributed data processing integrated with Ray's ML ecosystem
Start Simple, Scale When Needed

If your training data fits in memory on a single machine (under ~100 GB), do not start with Spark or Beam. Use pandas or Dask for preprocessing and invest the saved complexity budget in data validation and testing. Migrate to distributed processing only when data volume demands it.

Data Pipeline

An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure.

Lambda Architecture

A data processing architecture that combines batch processing for accuracy with stream processing for low latency, providing both historical and real-time views.

02 Data Labeling and Annotation

Data labeling is often the most expensive and time-consuming part of ML development. Supervised learning requires labeled examples, and the quality of labels directly determines the upper bound on model performance. Labeling strategies must balance quality, cost, and throughput.

The Labeling Bottleneck

Industry surveys consistently show that data labeling consumes 25-80% of the time and budget of ML projects. For a medical imaging project, a single expert radiologist might label 50 images per hour at $150/hour, meaning 10,000 labeled images costs $30,000 in labeling alone. This economic reality drives the search for more efficient labeling strategies.

Human Labeling

Human labeling remains the gold standard for many tasks but faces challenges of scale, consistency, and cost. Labeling platforms provide workforce management, quality control, and tooling for annotation tasks.

Table 6.2: Comparison of data labeling platforms.
PlatformWorkforceKey FeatureBest For
LabelboxYour team or crowdStrong ML-assisted labelingImage and video annotation
Scale AIManaged workforceHigh-quality at scaleAutonomous driving, enterprise
SageMaker Ground TruthAmazon Mechanical Turk + customAWS integration, active learningAWS-native ML teams
Label StudioSelf-managed (open source)Flexible, customizableTeams wanting full control
ProdigyYour domain expertsActive learning built-inNLP tasks, small expert teams

Table 6.2: Comparison of data labeling platforms.

Active Learning

Definition

Active Learning

A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing human effort where it most improves the model. The model queries for labels on examples where it is most uncertain, rather than labeling data randomly.

Active learning reduces labeling costs by intelligently selecting the most informative samples for human annotation. Rather than labeling data randomly, active learning queries the model to identify examples where it is most uncertain.

Active Learning in Practice

A spam detection team has 1 million unlabeled emails and a budget to label 5,000. Instead of randomly selecting 5,000, they train an initial model on 500 random labels, then use uncertainty sampling to select the next 500 emails where the model is least confident. After 10 rounds of this active learning loop, their model with 5,000 strategically chosen labels outperforms a model trained on 20,000 randomly selected labels.

Weak Supervision and Programmatic Labeling

Definition

Weak Supervision

An approach that generates training labels programmatically using labeling functions — heuristics, patterns, external knowledge bases, or pre-trained models. The noisy labels from multiple functions are combined using statistical methods to produce aggregate training labels at massive scale.

Weak supervision and programmatic labeling offer alternatives to manual annotation. Frameworks like Snorkel enable engineers to write labeling functions that encode heuristics, patterns, and external knowledge sources.

python
class="tok-comment"># Example labeling functions in Snorkel
from snorkel.labeling import labeling_function

class="tok-decorator">@labeling_function()
def lf_keyword_spam(x):
    class="tok-string">class="tok-string">""class="tok-string">"Flag emails containing common spam keywords."class="tok-string">""
    spam_words = [&class="tok-comment">#class="tok-number">39;freeclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;winnerclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;click hereclass="tok-string">&#class="tok-number">39;, &#class="tok-number">39;act nowclass="tok-string">&#class="tok-number">39;]
    return SPAM if any(w in x.text.lower() for w in spam_words) else ABSTAIN

class="tok-decorator">@labeling_function()
def lf_short_email(x):
    class="tok-string">"""Very short emails with links are often spam."""
    return SPAM if len(x.text) < class="tok-number">50 and &class="tok-comment">#class="tok-number">39;http&#class="tok-number">39; in x.text else ABSTAIN

class="tok-comment"># Snorkel combines these noisy functions statistically
class="tok-comment"># to produce probabilistic training labels
Combining Labeling Strategies

The most effective approach often combines strategies: use weak supervision to label the bulk of data cheaply, active learning to select the most valuable examples for human annotation, and human labels as the ground truth for evaluation. This hybrid approach achieves near-human label quality at a fraction of the cost of labeling everything manually.

Active Learning

A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing effort where it most improves the model.

Weak Supervision

An approach that generates training labels programmatically using heuristics and external knowledge sources rather than manual annotation.

03 Data Versioning and Lineage

Data versioning tracks changes to datasets over time, enabling reproducibility, debugging, and rollback. Just as code version control is fundamental to software engineering, data version control is essential for ML engineering. Without it, teams cannot reliably reproduce past training runs or diagnose model regressions.

Definition

Data Versioning

The practice of tracking and managing changes to datasets over time, analogous to code version control. Data versioning enables teams to reproduce any past training run, compare dataset changes between versions, and roll back to a known good state when issues are detected.

Versioning Tools

Tools like DVC, LakeFS, and Delta Lake provide versioning capabilities for large datasets, each taking a different approach to the problem.

Table 6.3: Data versioning tools and their strengths.
ToolApproachKey FeatureBest For
DVCContent-addressable storage alongside GitGit integration, pipeline trackingTeams using Git, moderate data sizes
LakeFSGit-like branches for data lakesBranch/merge/diff for dataData lake environments, S3-compatible
Delta LakeACID transactions on Parquet filesTime travel, schema enforcementSpark ecosystems, large-scale data
PachydermData versioning with pipeline lineageAutomatic provenance trackingComplex pipeline-driven workflows
Apache IcebergTable format with snapshot isolationSchema evolution, partition evolutionData warehouse modernization

Table 6.3: Data versioning tools and their strengths.

DVC for Most Teams

For most ML teams starting with data versioning, DVC is the best default choice. It integrates naturally with Git (so your data versions are linked to code commits), supports all major cloud storage backends, and requires minimal infrastructure changes. Start with DVC and migrate to LakeFS or Delta Lake only if your data scale or workflow demands it.

Data Lineage

Definition

Data Lineage

The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline to the final model. Lineage answers "where did this data come from?" and "what happened to it along the way?" for any datum in the system.

Data lineage tracks the complete provenance of data through the pipeline, recording every transformation applied from source to model. This audit trail is essential for debugging data quality issues, understanding model behavior, and complying with regulations.

Regulatory Requirements

Regulations like GDPR (EU), CCPA (California), and industry-specific rules increasingly require organizations to explain how automated decisions are made. Data lineage is essential for this explainability: you must be able to trace from a model prediction back through the training data to its original source. Without lineage, regulatory compliance is extremely difficult.

Storage and Format Considerations

Effective data versioning requires infrastructure decisions about storage, deduplication, and access patterns.

  • Content-addressable storage — Files stored by their hash, so identical content is never duplicated across versions
  • Columnar formats (Parquet, Arrow) — Enable reading specific columns without scanning the full dataset
  • Partitioning — Organize data by date, category, or other keys for efficient incremental processing
  • Compression — Snappy for speed, ZSTD for size, Gzip for compatibility
Debugging a Model Regression with Lineage

A recommendation model's click-through rate drops by 15% after a weekly retrain. Using data lineage, the team traces the training data back through the pipeline and discovers that a new data source was added that includes bot traffic (non-human clicks). Lineage identifies exactly which pipeline stage introduced the bad data. The team adds a bot filter, reverts to the previous data version, and retrains — restoring performance within hours instead of days.

The Cost of Not Versioning

Teams that skip data versioning often realize its importance the hard way: a model regresses in production, and they cannot determine which data change caused it. Reproducing the previous good model is impossible because the training data has been overwritten. The cost of retroactively adding data versioning to an existing pipeline far exceeds the cost of building it in from the start.

Data Versioning

The practice of tracking and managing changes to datasets over time, analogous to code version control, enabling reproducibility and rollback.

Data Lineage

The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline.

04 Data Augmentation Techniques

Data augmentation artificially increases the effective size and diversity of training data by applying transformations to existing examples. This is one of the most effective regularization techniques, improving model generalization while reducing the need for additional labeled data.

Definition

Data Augmentation

The technique of creating modified versions of training examples through label-preserving transformations (crops, flips, noise, etc.), effectively increasing the dataset size and diversity. Augmentation acts as a powerful regularizer that improves model generalization.

Image Augmentation

For image data, common augmentations include random cropping, flipping, rotation, color jittering, and cutout. More advanced techniques like MixUp and CutMix create synthetic examples by blending pairs of training images and their labels.

Table 6.4: Common image augmentation techniques and typical improvements on ImageNet.
TechniqueHow It WorksTypical ImprovementComplexity
Random Crop & FlipRandom spatial crops and horizontal flips1-3% accuracySimple
Color JitterRandom brightness, contrast, saturation changes0.5-1% accuracySimple
Cutout / Random ErasingMask random rectangular regions0.5-1.5% accuracySimple
MixUpBlend pairs of images and labels linearly1-2% accuracyModerate
CutMixCut and paste patches between images1-3% accuracyModerate
RandAugmentApply N random augmentations at magnitude M1-3% accuracyAuto-tuned

Table 6.4: Common image augmentation techniques and typical improvements on ImageNet.

Definition

MixUp

An augmentation technique that creates synthetic training examples by taking weighted linear combinations of pairs of examples and their labels. Given two samples (x_i, y_i) and (x_j, y_j), MixUp produces (lambda * x_i + (1-lambda) * x_j, lambda * y_i + (1-lambda) * y_j) where lambda is sampled from a Beta distribution.

\tilde{x} = \lambda x_i + (1 - \lambda) x_j, \quad \tilde{y} = \lambda y_i + (1 - \lambda) y_j, \quad \lambda \sim \text{Beta}(\alpha, \alpha)

Text and Tabular Augmentation

For text data, augmentation techniques include synonym replacement, random insertion, back-translation, and paraphrase generation using language models. For tabular data, techniques like SMOTE address class imbalance by generating synthetic minority examples through interpolation.

Back-Translation for NLP

Back-translation is one of the most effective text augmentation techniques. Translate your training text to another language (e.g., English to French) and then back (French to English). The resulting paraphrase preserves meaning while introducing lexical and syntactic variation. Use multiple target languages for greater diversity.

Quick Check

Why is on-the-fly data augmentation generally preferred over pre-computing augmented examples?

Consider what happens when the model sees the same training example across multiple epochs.

Not quite.On-the-fly augmentation applies random transformations each epoch, so the model sees different augmented versions every time. Pre-computing stores fixed augmented copies on disk, limiting diversity. The trade-off is that on-the-fly augmentation requires CPU cycles during training.
Continue reading

Systems Considerations

From a systems perspective, data augmentation should be applied on-the-fly during training rather than pre-computed, to maximize diversity without increasing storage requirements.

Augmentation as CPU Bottleneck

If augmentation is performed on the CPU while the model trains on the GPU, slow augmentation can starve the GPU and waste expensive compute. Profile your data loading pipeline to ensure augmentation throughput exceeds GPU training throughput. Use multi-worker data loading (num_workers in PyTorch) and GPU-accelerated augmentation (NVIDIA DALI) when needed.

python
class="tok-comment"># Efficient augmentation pipeline with Albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.RandomResizedCrop(class="tok-number">224, class="tok-number">224, scale=(class="tok-number">0.08, class="tok-number">1.0)),
    A.HorizontalFlip(p=class="tok-number">0.5),
    A.ColorJitter(brightness=class="tok-number">0.4, contrast=class="tok-number">0.4, saturation=class="tok-number">0.4),
    A.CoarseDropout(max_holes=class="tok-number">1, max_height=class="tok-number">56, max_width=class="tok-number">56, p=class="tok-number">0.5),
    A.Normalize(mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406], std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225]),
    ToTensorV2(),
])
class="tok-comment"># Albumentations is class="tok-number">2-5x faster than torchvision.transforms
On-the-Fly vs. Pre-Computed

On-the-fly augmentation generates different augmented versions each epoch, providing virtually infinite training data diversity from a fixed dataset. Pre-computed augmentation stores augmented examples on disk, which limits diversity but avoids CPU bottlenecks. For most training setups, on-the-fly augmentation with proper multi-worker loading is preferred.

Data Augmentation

The technique of creating modified versions of training examples through transformations, increasing effective dataset size and improving model generalization.

MixUp

An augmentation technique that creates synthetic training examples by taking convex combinations of pairs of examples and their labels.

05 Data Quality Management

Data quality is the single most important factor in ML system performance. The adage "garbage in, garbage out" applies with particular force to ML, where models can silently learn patterns from data errors, duplicates, and label noise without any obvious symptoms until deployment.

Your model is only as good as your data. No amount of architectural innovation can compensate for fundamentally flawed training data.

Common ML engineering axiom
Definition

Data Quality

The measure of data fitness for its intended use in ML, encompassing dimensions like completeness (no missing values), accuracy (correct labels), consistency (no conflicting records), freshness (up-to-date), and representativeness (covers the target distribution).

Data Validation Frameworks

Data validation frameworks enable teams to define expectations about data distributions, schema constraints, and statistical properties. These checks run automatically as part of the pipeline, catching issues before they corrupt training data.

Table 6.5: Data validation frameworks for ML data quality.
FrameworkTypeKey FeatureIntegration
Great ExpectationsOpen-sourceDeclarative expectations with rich documentationAirflow, Spark, pandas, SQL
TFDVOpen-source (Google)Automatic schema inference and drift detectionTFX pipeline, Apache Beam
PanderaOpen-sourcePandas DataFrame type checkingpytest, CI pipelines
DeequOpen-source (Amazon)Unit tests for data on SparkSpark, AWS Glue
Monte CarloSaaSAutomated anomaly detection, lineageMost data warehouses and tools

Table 6.5: Data validation frameworks for ML data quality.

python
class="tok-comment"># Data validation with Great Expectations
import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv(class="tok-string">"training_data.csv")

class="tok-comment"># Define expectations
validator.expect_column_values_to_not_be_null(class="tok-string">"user_id")
validator.expect_column_values_to_be_between(class="tok-string">"age", min_value=class="tok-number">0, max_value=class="tok-number">150)
validator.expect_column_mean_to_be_between(class="tok-string">"purchase_amount", min_value=class="tok-number">10, max_value=class="tok-number">500)
validator.expect_column_proportion_of_unique_values_to_be_between(class="tok-string">"email", class="tok-number">0.95, class="tok-number">1.0)

results = validator.validate()
if not results.success:
    raise ValueError(class="tok-string">"Data quality checks failed!")

Data Quality Dimensions

  • Completeness — Are there missing values? What is the null rate per column?
  • Consistency — Do records conflict with each other or with business rules?
  • Accuracy — Are labels correct? What is the estimated label noise rate?
  • Freshness — How old is the data? Is it current enough for the model's purpose?
  • Representativeness — Does the data cover the target distribution? Are minority groups underrepresented?
  • Uniqueness — Are there duplicate records that would bias the training distribution?
Label Noise Is Everywhere

Real-world labels are rarely perfect. Studies have found 3-10% label error rates in popular benchmarks like ImageNet and CIFAR. In production data with crowd-sourced labels, error rates can reach 10-20%. Models trained on noisy labels learn the noise as signal, which degrades generalization. Use techniques like confident learning, label smoothing, and multi-annotator agreement to mitigate label noise.

Proactive Data Quality Management

Proactive data quality management is more effective than reactive debugging. This includes establishing data contracts between producers and consumers, implementing continuous monitoring, and building automated alerts for drift detection.

Definition

Data Contract

A formal agreement between data producers and consumers that specifies the schema, quality constraints, freshness requirements, and SLAs for a data source. Data contracts prevent upstream changes from silently breaking downstream ML pipelines.

Invest in Data Quality Early

Investing in data quality infrastructure early in a project pays dividends throughout the system lifetime. A data validation pipeline that catches a corrupted batch before it enters training saves far more than the cost of building the validation. Teams that skip data quality work inevitably spend more time debugging mysterious model regressions.

Data Contract Prevents Outage

An upstream team plans to change a feature column from integer to string representation. Because a data contract exists, their schema change fails validation in the CI pipeline before deployment. The ML team is notified, updates their feature pipeline to handle the new format, and both changes deploy together. Without the contract, the schema change would have caused the ML pipeline to fail silently or produce incorrect features.

Data Quality

The measure of data fitness for its intended use in ML, encompassing dimensions like completeness, accuracy, consistency, and representativeness.

Data Validation

Automated checks that verify data meets specified quality constraints before it enters the ML pipeline, preventing silent data corruption.

Key Takeaways

  1. 1Data pipeline reliability is critical because model quality is directly bounded by data quality.
  2. 2Active learning and weak supervision can dramatically reduce labeling costs while maintaining model performance.
  3. 3Data versioning and lineage tracking are essential for reproducibility, debugging, and regulatory compliance.
  4. 4On-the-fly data augmentation maximizes training data diversity without increasing storage requirements.
  5. 5Proactive data quality management through validation frameworks prevents costly downstream model failures.

CH.06

Chapter Complete

Up next:AI Frameworks

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

Data Engineering Quiz

Test your understanding of data pipelines, labeling strategies, data quality, and feature engineering.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass