Data Engineering
Explores data pipelines, labeling strategies, versioning, augmentation techniques, and data quality management for ML systems.
- Design data pipelines that handle ingestion, transformation, and storage at scale
- Compare batch vs. streaming data processing architectures and their trade-offs
- Implement data quality validation and monitoring strategies
- Explain feature engineering best practices and feature store architectures
- Evaluate data versioning and lineage tracking solutions for ML workflows
01 Data Pipelines for ML Viz
Data pipelines are the foundation of every ML system, responsible for collecting, transforming, and delivering data to training and serving infrastructure. A well-designed pipeline must handle diverse data sources, varying scales, and evolving schemas while maintaining data quality and freshness.
Data Pipeline
An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure. Pipelines handle the full ETL (Extract, Transform, Load) process and must maintain data quality, freshness, and schema consistency.
Figure: Lambda Architecture -- Data Pipeline
Batch vs. Stream Processing
Batch pipelines process large volumes of historical data on a schedule (hourly, daily) and are typically used for training data preparation. Stream pipelines process data in real-time as it arrives and are used for feature computation and online serving.
| Property | Batch Processing | Stream Processing |
|---|---|---|
| Data arrival | Collected and processed in bulk | Processed as it arrives in real-time |
| Latency | Minutes to hours | Milliseconds to seconds |
| Use case | Training data preparation, daily reports | Real-time features, online serving |
| Throughput | Very high (optimized for bulk) | Moderate (optimized for latency) |
| Complexity | Simpler (no ordering concerns) | Higher (ordering, late data, exactly-once) |
| Typical tools | Spark, Hive, dbt | Kafka, Flink, Spark Streaming |
Table 6.1: Batch vs. stream processing characteristics for ML data pipelines.
Lambda Architecture
A data processing architecture that combines a batch layer (for accuracy and completeness) with a speed layer (for low-latency real-time data). A serving layer merges results from both, providing comprehensive and timely data to downstream consumers.
Many production systems use a hybrid lambda architecture that combines both batch and stream processing. While this provides both accuracy and low latency, it requires maintaining two code paths that must produce consistent results — a significant engineering burden. The Kappa architecture simplifies this by using stream processing for everything, replaying the stream for batch-style reprocessing.
What is the main engineering drawback of the Lambda architecture for ML data pipelines?
Pipeline Reliability
Data pipeline reliability is crucial because downstream model quality is directly dependent on data quality. Pipeline failures can silently corrupt training data and degrade model performance.
The most dangerous pipeline failures are silent. A schema change in an upstream data source might cause a feature column to be filled with nulls instead of values. The pipeline runs successfully, the model trains without errors, but predictions degrade. Always validate data distributions and schema at pipeline boundaries.
Scaling Data Pipelines
As data volumes grow, pipelines must scale horizontally across distributed processing frameworks. The choice of framework depends on factors like data volume, processing complexity, latency requirements, and the team's existing infrastructure.
- Apache Spark — Mature distributed processing engine, strong for batch ETL, wide ecosystem
- Apache Beam — Unified batch and stream programming model, runs on multiple backends
- Dask — Python-native parallel computing, great for teams already using pandas/NumPy
- Apache Flink — Stream-first processing with strong exactly-once semantics
- Ray Data — Distributed data processing integrated with Ray's ML ecosystem
If your training data fits in memory on a single machine (under ~100 GB), do not start with Spark or Beam. Use pandas or Dask for preprocessing and invest the saved complexity budget in data validation and testing. Migrate to distributed processing only when data volume demands it.
Data Pipeline
An automated system for collecting, transforming, validating, and delivering data from source systems to ML training and serving infrastructure.
Lambda Architecture
A data processing architecture that combines batch processing for accuracy with stream processing for low latency, providing both historical and real-time views.
02 Data Labeling and Annotation
Data labeling is often the most expensive and time-consuming part of ML development. Supervised learning requires labeled examples, and the quality of labels directly determines the upper bound on model performance. Labeling strategies must balance quality, cost, and throughput.
Industry surveys consistently show that data labeling consumes 25-80% of the time and budget of ML projects. For a medical imaging project, a single expert radiologist might label 50 images per hour at $150/hour, meaning 10,000 labeled images costs $30,000 in labeling alone. This economic reality drives the search for more efficient labeling strategies.
Human Labeling
Human labeling remains the gold standard for many tasks but faces challenges of scale, consistency, and cost. Labeling platforms provide workforce management, quality control, and tooling for annotation tasks.
| Platform | Workforce | Key Feature | Best For |
|---|---|---|---|
| Labelbox | Your team or crowd | Strong ML-assisted labeling | Image and video annotation |
| Scale AI | Managed workforce | High-quality at scale | Autonomous driving, enterprise |
| SageMaker Ground Truth | Amazon Mechanical Turk + custom | AWS integration, active learning | AWS-native ML teams |
| Label Studio | Self-managed (open source) | Flexible, customizable | Teams wanting full control |
| Prodigy | Your domain experts | Active learning built-in | NLP tasks, small expert teams |
Table 6.2: Comparison of data labeling platforms.
Active Learning
Active Learning
A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing human effort where it most improves the model. The model queries for labels on examples where it is most uncertain, rather than labeling data randomly.
Active learning reduces labeling costs by intelligently selecting the most informative samples for human annotation. Rather than labeling data randomly, active learning queries the model to identify examples where it is most uncertain.
A spam detection team has 1 million unlabeled emails and a budget to label 5,000. Instead of randomly selecting 5,000, they train an initial model on 500 random labels, then use uncertainty sampling to select the next 500 emails where the model is least confident. After 10 rounds of this active learning loop, their model with 5,000 strategically chosen labels outperforms a model trained on 20,000 randomly selected labels.
Weak Supervision and Programmatic Labeling
Weak Supervision
An approach that generates training labels programmatically using labeling functions — heuristics, patterns, external knowledge bases, or pre-trained models. The noisy labels from multiple functions are combined using statistical methods to produce aggregate training labels at massive scale.
Weak supervision and programmatic labeling offer alternatives to manual annotation. Frameworks like Snorkel enable engineers to write labeling functions that encode heuristics, patterns, and external knowledge sources.
class="tok-comment"># Example labeling functions in Snorkel
from snorkel.labeling import labeling_function
class="tok-decorator">@labeling_function()
def lf_keyword_spam(x):
class="tok-string">class="tok-string">""class="tok-string">"Flag emails containing common spam keywords."class="tok-string">""
spam_words = [&class="tok-comment">#class="tok-number">39;freeclass="tok-string">class="tok-number">39;, class="tok-number">39;winnerclass="tok-string">class="tok-number">39;, class="tok-number">39;click hereclass="tok-string">class="tok-number">39;, class="tok-number">39;act nowclass="tok-string">class="tok-number">39;]
return SPAM if any(w in x.text.lower() for w in spam_words) else ABSTAIN
class="tok-decorator">@labeling_function()
def lf_short_email(x):
class="tok-string">"""Very short emails with links are often spam."""
return SPAM if len(x.text) < class="tok-number">50 and &class="tok-comment">#class="tok-number">39;httpclass="tok-number">39; in x.text else ABSTAIN
class="tok-comment"># Snorkel combines these noisy functions statistically
class="tok-comment"># to produce probabilistic training labelsThe most effective approach often combines strategies: use weak supervision to label the bulk of data cheaply, active learning to select the most valuable examples for human annotation, and human labels as the ground truth for evaluation. This hybrid approach achieves near-human label quality at a fraction of the cost of labeling everything manually.
Active Learning
A strategy that selects the most informative unlabeled examples for human annotation, reducing labeling costs by focusing effort where it most improves the model.
Weak Supervision
An approach that generates training labels programmatically using heuristics and external knowledge sources rather than manual annotation.
03 Data Versioning and Lineage
Data versioning tracks changes to datasets over time, enabling reproducibility, debugging, and rollback. Just as code version control is fundamental to software engineering, data version control is essential for ML engineering. Without it, teams cannot reliably reproduce past training runs or diagnose model regressions.
Data Versioning
The practice of tracking and managing changes to datasets over time, analogous to code version control. Data versioning enables teams to reproduce any past training run, compare dataset changes between versions, and roll back to a known good state when issues are detected.
Versioning Tools
Tools like DVC, LakeFS, and Delta Lake provide versioning capabilities for large datasets, each taking a different approach to the problem.
| Tool | Approach | Key Feature | Best For |
|---|---|---|---|
| DVC | Content-addressable storage alongside Git | Git integration, pipeline tracking | Teams using Git, moderate data sizes |
| LakeFS | Git-like branches for data lakes | Branch/merge/diff for data | Data lake environments, S3-compatible |
| Delta Lake | ACID transactions on Parquet files | Time travel, schema enforcement | Spark ecosystems, large-scale data |
| Pachyderm | Data versioning with pipeline lineage | Automatic provenance tracking | Complex pipeline-driven workflows |
| Apache Iceberg | Table format with snapshot isolation | Schema evolution, partition evolution | Data warehouse modernization |
Table 6.3: Data versioning tools and their strengths.
For most ML teams starting with data versioning, DVC is the best default choice. It integrates naturally with Git (so your data versions are linked to code commits), supports all major cloud storage backends, and requires minimal infrastructure changes. Start with DVC and migrate to LakeFS or Delta Lake only if your data scale or workflow demands it.
Data Lineage
Data Lineage
The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline to the final model. Lineage answers "where did this data come from?" and "what happened to it along the way?" for any datum in the system.
Data lineage tracks the complete provenance of data through the pipeline, recording every transformation applied from source to model. This audit trail is essential for debugging data quality issues, understanding model behavior, and complying with regulations.
Regulations like GDPR (EU), CCPA (California), and industry-specific rules increasingly require organizations to explain how automated decisions are made. Data lineage is essential for this explainability: you must be able to trace from a model prediction back through the training data to its original source. Without lineage, regulatory compliance is extremely difficult.
Storage and Format Considerations
Effective data versioning requires infrastructure decisions about storage, deduplication, and access patterns.
- Content-addressable storage — Files stored by their hash, so identical content is never duplicated across versions
- Columnar formats (Parquet, Arrow) — Enable reading specific columns without scanning the full dataset
- Partitioning — Organize data by date, category, or other keys for efficient incremental processing
- Compression — Snappy for speed, ZSTD for size, Gzip for compatibility
A recommendation model's click-through rate drops by 15% after a weekly retrain. Using data lineage, the team traces the training data back through the pipeline and discovers that a new data source was added that includes bot traffic (non-human clicks). Lineage identifies exactly which pipeline stage introduced the bad data. The team adds a bot filter, reverts to the previous data version, and retrains — restoring performance within hours instead of days.
Teams that skip data versioning often realize its importance the hard way: a model regresses in production, and they cannot determine which data change caused it. Reproducing the previous good model is impossible because the training data has been overwritten. The cost of retroactively adding data versioning to an existing pipeline far exceeds the cost of building it in from the start.
Data Versioning
The practice of tracking and managing changes to datasets over time, analogous to code version control, enabling reproducibility and rollback.
Data Lineage
The complete record of data provenance, tracking every transformation applied to data from its original source through the ML pipeline.
04 Data Augmentation Techniques
Data augmentation artificially increases the effective size and diversity of training data by applying transformations to existing examples. This is one of the most effective regularization techniques, improving model generalization while reducing the need for additional labeled data.
Data Augmentation
The technique of creating modified versions of training examples through label-preserving transformations (crops, flips, noise, etc.), effectively increasing the dataset size and diversity. Augmentation acts as a powerful regularizer that improves model generalization.
Image Augmentation
For image data, common augmentations include random cropping, flipping, rotation, color jittering, and cutout. More advanced techniques like MixUp and CutMix create synthetic examples by blending pairs of training images and their labels.
| Technique | How It Works | Typical Improvement | Complexity |
|---|---|---|---|
| Random Crop & Flip | Random spatial crops and horizontal flips | 1-3% accuracy | Simple |
| Color Jitter | Random brightness, contrast, saturation changes | 0.5-1% accuracy | Simple |
| Cutout / Random Erasing | Mask random rectangular regions | 0.5-1.5% accuracy | Simple |
| MixUp | Blend pairs of images and labels linearly | 1-2% accuracy | Moderate |
| CutMix | Cut and paste patches between images | 1-3% accuracy | Moderate |
| RandAugment | Apply N random augmentations at magnitude M | 1-3% accuracy | Auto-tuned |
Table 6.4: Common image augmentation techniques and typical improvements on ImageNet.
MixUp
An augmentation technique that creates synthetic training examples by taking weighted linear combinations of pairs of examples and their labels. Given two samples (x_i, y_i) and (x_j, y_j), MixUp produces (lambda * x_i + (1-lambda) * x_j, lambda * y_i + (1-lambda) * y_j) where lambda is sampled from a Beta distribution.
\tilde{x} = \lambda x_i + (1 - \lambda) x_j, \quad \tilde{y} = \lambda y_i + (1 - \lambda) y_j, \quad \lambda \sim \text{Beta}(\alpha, \alpha)Text and Tabular Augmentation
For text data, augmentation techniques include synonym replacement, random insertion, back-translation, and paraphrase generation using language models. For tabular data, techniques like SMOTE address class imbalance by generating synthetic minority examples through interpolation.
Back-translation is one of the most effective text augmentation techniques. Translate your training text to another language (e.g., English to French) and then back (French to English). The resulting paraphrase preserves meaning while introducing lexical and syntactic variation. Use multiple target languages for greater diversity.
Why is on-the-fly data augmentation generally preferred over pre-computing augmented examples?
Consider what happens when the model sees the same training example across multiple epochs.
Systems Considerations
From a systems perspective, data augmentation should be applied on-the-fly during training rather than pre-computed, to maximize diversity without increasing storage requirements.
If augmentation is performed on the CPU while the model trains on the GPU, slow augmentation can starve the GPU and waste expensive compute. Profile your data loading pipeline to ensure augmentation throughput exceeds GPU training throughput. Use multi-worker data loading (num_workers in PyTorch) and GPU-accelerated augmentation (NVIDIA DALI) when needed.
class="tok-comment"># Efficient augmentation pipeline with Albumentations
import albumentations as A
from albumentations.pytorch import ToTensorV2
transform = A.Compose([
A.RandomResizedCrop(class="tok-number">224, class="tok-number">224, scale=(class="tok-number">0.08, class="tok-number">1.0)),
A.HorizontalFlip(p=class="tok-number">0.5),
A.ColorJitter(brightness=class="tok-number">0.4, contrast=class="tok-number">0.4, saturation=class="tok-number">0.4),
A.CoarseDropout(max_holes=class="tok-number">1, max_height=class="tok-number">56, max_width=class="tok-number">56, p=class="tok-number">0.5),
A.Normalize(mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406], std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225]),
ToTensorV2(),
])
class="tok-comment"># Albumentations is class="tok-number">2-5x faster than torchvision.transformsOn-the-fly augmentation generates different augmented versions each epoch, providing virtually infinite training data diversity from a fixed dataset. Pre-computed augmentation stores augmented examples on disk, which limits diversity but avoids CPU bottlenecks. For most training setups, on-the-fly augmentation with proper multi-worker loading is preferred.
Data Augmentation
The technique of creating modified versions of training examples through transformations, increasing effective dataset size and improving model generalization.
MixUp
An augmentation technique that creates synthetic training examples by taking convex combinations of pairs of examples and their labels.
05 Data Quality Management
Data quality is the single most important factor in ML system performance. The adage "garbage in, garbage out" applies with particular force to ML, where models can silently learn patterns from data errors, duplicates, and label noise without any obvious symptoms until deployment.
Your model is only as good as your data. No amount of architectural innovation can compensate for fundamentally flawed training data.
Data Quality
The measure of data fitness for its intended use in ML, encompassing dimensions like completeness (no missing values), accuracy (correct labels), consistency (no conflicting records), freshness (up-to-date), and representativeness (covers the target distribution).
Data Validation Frameworks
Data validation frameworks enable teams to define expectations about data distributions, schema constraints, and statistical properties. These checks run automatically as part of the pipeline, catching issues before they corrupt training data.
| Framework | Type | Key Feature | Integration |
|---|---|---|---|
| Great Expectations | Open-source | Declarative expectations with rich documentation | Airflow, Spark, pandas, SQL |
| TFDV | Open-source (Google) | Automatic schema inference and drift detection | TFX pipeline, Apache Beam |
| Pandera | Open-source | Pandas DataFrame type checking | pytest, CI pipelines |
| Deequ | Open-source (Amazon) | Unit tests for data on Spark | Spark, AWS Glue |
| Monte Carlo | SaaS | Automated anomaly detection, lineage | Most data warehouses and tools |
Table 6.5: Data validation frameworks for ML data quality.
class="tok-comment"># Data validation with Great Expectations
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv(class="tok-string">"training_data.csv")
class="tok-comment"># Define expectations
validator.expect_column_values_to_not_be_null(class="tok-string">"user_id")
validator.expect_column_values_to_be_between(class="tok-string">"age", min_value=class="tok-number">0, max_value=class="tok-number">150)
validator.expect_column_mean_to_be_between(class="tok-string">"purchase_amount", min_value=class="tok-number">10, max_value=class="tok-number">500)
validator.expect_column_proportion_of_unique_values_to_be_between(class="tok-string">"email", class="tok-number">0.95, class="tok-number">1.0)
results = validator.validate()
if not results.success:
raise ValueError(class="tok-string">"Data quality checks failed!")Data Quality Dimensions
- Completeness — Are there missing values? What is the null rate per column?
- Consistency — Do records conflict with each other or with business rules?
- Accuracy — Are labels correct? What is the estimated label noise rate?
- Freshness — How old is the data? Is it current enough for the model's purpose?
- Representativeness — Does the data cover the target distribution? Are minority groups underrepresented?
- Uniqueness — Are there duplicate records that would bias the training distribution?
Real-world labels are rarely perfect. Studies have found 3-10% label error rates in popular benchmarks like ImageNet and CIFAR. In production data with crowd-sourced labels, error rates can reach 10-20%. Models trained on noisy labels learn the noise as signal, which degrades generalization. Use techniques like confident learning, label smoothing, and multi-annotator agreement to mitigate label noise.
Proactive Data Quality Management
Proactive data quality management is more effective than reactive debugging. This includes establishing data contracts between producers and consumers, implementing continuous monitoring, and building automated alerts for drift detection.
Data Contract
A formal agreement between data producers and consumers that specifies the schema, quality constraints, freshness requirements, and SLAs for a data source. Data contracts prevent upstream changes from silently breaking downstream ML pipelines.
Investing in data quality infrastructure early in a project pays dividends throughout the system lifetime. A data validation pipeline that catches a corrupted batch before it enters training saves far more than the cost of building the validation. Teams that skip data quality work inevitably spend more time debugging mysterious model regressions.
An upstream team plans to change a feature column from integer to string representation. Because a data contract exists, their schema change fails validation in the CI pipeline before deployment. The ML team is notified, updates their feature pipeline to handle the new format, and both changes deploy together. Without the contract, the schema change would have caused the ML pipeline to fail silently or produce incorrect features.
Data Quality
The measure of data fitness for its intended use in ML, encompassing dimensions like completeness, accuracy, consistency, and representativeness.
Data Validation
Automated checks that verify data meets specified quality constraints before it enters the ML pipeline, preventing silent data corruption.
Key Takeaways
- 1Data pipeline reliability is critical because model quality is directly bounded by data quality.
- 2Active learning and weak supervision can dramatically reduce labeling costs while maintaining model performance.
- 3Data versioning and lineage tracking are essential for reproducibility, debugging, and regulatory compliance.
- 4On-the-fly data augmentation maximizes training data diversity without increasing storage requirements.
- 5Proactive data quality management through validation frameworks prevents costly downstream model failures.
CH.06
Chapter Complete
Chapter Progress
Interact with the visualization
Data Engineering Quiz
Test your understanding of data pipelines, labeling strategies, data quality, and feature engineering.
Ready to test your knowledge?