Skip to content
ML SystemPart 4: InfrastructureChapter 15
Deployment CH.15 ~35 min

Security & Privacy

Examines adversarial attacks, defenses, federated learning, and differential privacy. Addresses the security and privacy challenges of deployed ML systems.

adversarial attacksfederated learningdifferential privacymodel securitydata privacy
Read in mlsysbook.ai
  • Classify adversarial attacks on ML systems including evasion, poisoning, and model extraction techniques
  • Compare defense strategies such as adversarial training, input validation, and certified robustness
  • Explain how federated learning preserves privacy while enabling collaborative model training
  • Analyze differential privacy guarantees and the privacy-utility trade-off in ML systems
  • Design a defense-in-depth security strategy for a production ML deployment

01 Adversarial Attacks on ML Systems Viz

Adversarial attacks exploit vulnerabilities in ML models to cause incorrect predictions through carefully crafted inputs. These attacks reveal fundamental limitations in how neural networks learn and generalize, with implications ranging from academic curiosity to serious safety and security concerns in deployed systems.

Definition

Adversarial Example

An input deliberately crafted with small, often imperceptible perturbations designed to cause a machine learning model to produce an incorrect prediction. Adversarial examples expose the gap between human perception and model decision boundaries.

Figure: ML Security Threat Model

ML System Threat Model ML System Attack Surface Analysis HIGH Data Poisoning Inject malicious training data to corrupt model behavior MEDIUM Model Extraction Steal model weights or architecture via systematic API queries HIGH Adversarial Inputs Craft inputs designed to fool the model at inference time MEDIUM Privacy Attacks Extract sensitive training data from model outputs or gradients High SeverityMedium Severity
Figure 15.1: ML Security Threat Model

Evasion Attacks

Evasion attacks manipulate inputs at inference time to fool the model. The Fast Gradient Sign Method (FGSM) adds a small perturbation in the direction of the loss gradient, creating inputs that look identical to humans but cause misclassification.

x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x L(\theta, x, y))

Deeper InsightIntuitively, you might expect that complex nonlinear models would be harder to fool. In reality, modern neural networks behave mostly linearly in high-dimensional space (ReLU is piecewise linear). A tiny perturbation applied to each of the thousands of input dimensions accumulates linearly, producing a massive shift in the output. FGSM works precisely because it exploits this linearity: each pixel gets a small nudge, but across 150,000+ pixels the cumulative effect overwhelms the model.Think of it like...Pushing a shopping cart. One person pushing with one finger barely moves it. But if 1,000 people each push with one finger in the same direction, the cart flies across the parking lot. Each push is negligible, but the linear accumulation across many dimensions creates a force the cart cannot resist. Neural networks are the cart. Click to collapse

Imperceptible Perturbations

FGSM with epsilon = 8/255 on a 0-255 pixel scale changes each pixel by at most 3%. This is imperceptible to humans but can flip a classifier's prediction with high probability. PGD (Projected Gradient Descent) iteratively refines the perturbation within this epsilon-ball, producing even stronger attacks that defeat simple defenses.

Quick Check

FGSM (Fast Gradient Sign Method) creates adversarial examples by:

Look at Equation 15.1 -- what role does the gradient play?

Not quite.FGSM computes the gradient of the loss with respect to the input (not the weights), then adds a small perturbation in the sign direction of that gradient. This pushes the input toward higher loss, causing misclassification. The perturbation magnitude epsilon controls the trade-off between imperceptibility and attack success.
Continue reading
Table 15.1: Major categories of adversarial attacks on ML systems.
Attack TypeWhen It OccursAttacker AccessExample
Evasion (FGSM, PGD)Inference timeModel gradients (white-box) or outputs (black-box)Adversarial patch on stop sign fools autonomous vehicle
PoisoningTraining timeTraining data pipelineBackdoored training images cause targeted misclassification
Model ExtractionInference timeQuery API accessQuerying a model API to train a clone
Membership InferenceInference timeModel predictionsDetermining if a specific record was in training data

Table 15.1: Major categories of adversarial attacks on ML systems.

Poisoning Attacks

Poisoning attacks corrupt the training data to implant backdoors or degrade model performance. An attacker who can influence the training dataset can insert poisoned examples that cause the model to learn malicious behaviors triggered by specific patterns.

The Backdoor Threat

A poisoning attacker can embed a "trigger pattern" (e.g., a small sticker) in training images labeled as a target class. The trained model performs normally on clean inputs but misclassifies any input containing the trigger. This backdoor persists through fine-tuning and is extremely difficult to detect through standard model evaluation.

Model Extraction Attacks

Model extraction attacks aim to steal a model's functionality by querying it and training a copy. An attacker with API access to a deployed model can systematically query it and use the input-output pairs to train a surrogate model.

Model Extraction in Practice

Researchers demonstrated that a BERT-based sentiment classifier served via API could be replicated to 95% agreement using only 10,000 carefully chosen queries — costing under $5 in API fees. The extracted model then enabled efficient white-box adversarial attacks against the original. Defenses include rate limiting, returning only top-1 labels instead of probabilities, and adding controlled noise to outputs.

  • Rate limiting — Restrict the number of queries per user/IP to slow extraction.
  • Output perturbation — Add small noise to probability outputs without affecting top-1 predictions.
  • Watermarking — Embed verifiable signatures that transfer to extracted models.
  • Prediction truncation — Return only the top label, not full probability distributions.

Adversarial Example

An input deliberately crafted with small perturbations to cause a ML model to make incorrect predictions while appearing normal to humans.

Evasion Attack

An attack that manipulates inputs at inference time to fool a deployed ML model, without access to the training process.

02 Defenses Against Adversarial Attacks

Adversarial Training

Adversarial training is the most well-studied defense, where models are trained on both clean and adversarially perturbed examples. During training, adversarial examples are generated on-the-fly and included in each batch.

Definition

Adversarial Training

A defense technique that augments training data with adversarially perturbed examples generated on-the-fly during training, forcing the model to learn decision boundaries that are robust to small input perturbations.

\min_\theta \mathbb{E}_{(x,y)}\left[\max_{\|\delta\| \leq \epsilon} L(\theta, x + \delta, y)\right]
The Robustness-Accuracy Trade-off

Adversarial training typically reduces clean accuracy by 2-5% while significantly improving robustness. This trade-off is fundamental: making the model robust to worst-case perturbations constrains its capacity for standard inputs. For production systems, evaluate whether this accuracy cost is acceptable given the security requirements.

Certified Defenses

Certified defenses provide provable guarantees that model predictions are robust within a specified perturbation radius. Randomized smoothing creates a smoothed classifier by averaging predictions over random noise perturbations.

Randomized Smoothing

Randomized smoothing works by classifying many noise-corrupted copies of the input and returning the majority vote. The statistical properties of Gaussian noise provide a certified radius within which the prediction is guaranteed not to change. The certification radius depends on the gap between the most and second-most probable classes.

Table 15.2: Adversarial defense categories compared.
Defense CategoryApproachGuaranteesTrade-off
Adversarial TrainingTrain on adversarial examplesEmpirical robustness (no formal proof)2-5% clean accuracy loss, 3-10x training cost
Certified DefensesRandomized smoothing, interval boundsProvable robustness within radiusSignificant accuracy loss, limited radius
Input PreprocessingJPEG compression, spatial smoothingNone (broken by adaptive attacks)Minimal accuracy cost, easy to deploy
DetectionStatistical tests on inputsNone (arms race with attackers)False positive management needed

Table 15.2: Adversarial defense categories compared.

Input Preprocessing Defenses

Input preprocessing defenses attempt to remove adversarial perturbations before they reach the model. These approaches are simple to deploy but can be bypassed by adaptive attacks that account for the preprocessing.

Always Test Against Adaptive Attacks

Many published defenses were later broken by adaptive attacks that specifically target the defense mechanism. When evaluating any defense, always test it against an adaptive adversary that knows about the defense and optimizes against it. The AutoAttack benchmark provides a standardized set of strong adaptive attacks for evaluation.

Defense in Depth

Defense in depth combines multiple protection layers rather than relying on any single defense. This layered approach makes attacks significantly harder while maintaining system usability.

  1. Input validation — Reject out-of-distribution inputs that may indicate an attack.
  2. Adversarial training — Harden the model against known perturbation types.
  3. Confidence thresholds — Flag low-confidence predictions for human review.
  4. Output monitoring — Detect unusual prediction patterns that suggest adversarial probing.
  5. Rate limiting — Slow down potential model extraction attempts.
Defense in Depth for Autonomous Driving

A self-driving perception system combines adversarial training (robust to small perturbations), sensor fusion (cross-checking camera with LiDAR and radar), temporal consistency checks (flagging sudden classification changes between frames), and confidence-based fallback (reducing speed when any sensor reports low confidence). No single layer is foolproof, but together they raise the attack cost dramatically.

Adversarial Training

A defense technique that includes adversarially perturbed examples during training to improve model robustness against such perturbations.

Certified Defense

A robustness guarantee that provably ensures correct model predictions within a specified perturbation boundary around each input.

03 Federated Learning for Privacy

Federated learning enables collaborative model training across multiple data owners without sharing raw data. Each participant trains on their local data and shares only model updates, which are aggregated to improve a global model. This paradigm is particularly important for privacy-sensitive domains like healthcare, finance, and mobile applications.

Definition

Federated Learning

A distributed training paradigm where multiple participants (devices or organizations) collaboratively improve a shared model by training locally on private data and sharing only model updates — never raw data — with a central aggregation server.

Federated Averaging (FedAvg)

Federated Averaging is the foundational federated learning algorithm. Each participating device performs multiple local training steps on its data, and the resulting model updates are averaged at a central server.

w_{t+1} = \sum_{k=1}^{K} \frac{n_k}{n} w_{t+1}^k
Communication Efficiency

FedAvg reduces communication by allowing each client to run multiple local SGD steps before sending an update. Instead of sending gradients after every batch (requiring thousands of communication rounds), clients train for 1-5 local epochs and send the resulting model delta. This makes federated learning practical over mobile networks with high latency.

Challenges in Federated Settings

Table 15.3: Key challenges in federated learning deployments.
ChallengeDescriptionMitigation
Non-IID DataEach participant's data has a different distributionPersonalized FL, FedProx regularization
System HeterogeneityDevices have different compute/bandwidth capabilitiesAsynchronous aggregation, client selection
Communication CostModel updates can be millions of parametersGradient compression, quantized updates
Free RidersParticipants who benefit without contributing quality updatesContribution scoring, incentive mechanisms
StragglersSlow devices delay synchronous aggregation roundsDeadline-based aggregation, dropping slow clients

Table 15.3: Key challenges in federated learning deployments.

Non-IID Data Degrades Convergence

When client data is highly non-IID (e.g., each mobile user has very different typing patterns), FedAvg can converge slowly or to a suboptimal solution. Client updates may point in conflicting directions, causing the global model to oscillate. FedProx adds a proximal term that regularizes local updates to stay close to the global model, significantly improving convergence in non-IID settings.

Secure Aggregation and Privacy

Secure aggregation protocols ensure that the central server can compute the aggregate model update without seeing any individual participant's contribution. Combined with differential privacy, federated learning can provide strong privacy guarantees.

Definition

Secure Aggregation

A cryptographic protocol that enables the aggregation server to compute the sum of client updates without observing any individual client's contribution. Typically implemented using secret sharing or homomorphic encryption.

Layered Privacy Protection

For maximum privacy, combine three techniques: (1) Federated learning keeps raw data on-device, (2) Secure aggregation prevents the server from seeing individual updates, and (3) Differential privacy adds noise to updates so they cannot leak individual training examples. Each layer addresses a different threat model.

Cross-Silo Federated Learning in Healthcare

Multiple hospitals want to train a tumor detection model but cannot share patient data due to HIPAA regulations. Each hospital trains on its local imaging data and sends encrypted model updates to a secure aggregation server. The server computes the aggregate without seeing individual hospital updates. Differential privacy noise ensures no individual patient record can be reconstructed. The resulting model outperforms any single hospital's model by learning from diverse patient populations.

Federated Learning

A distributed training paradigm where multiple participants collaboratively train a model by sharing only model updates, keeping raw data local.

Secure Aggregation

A cryptographic protocol that allows aggregation of participants' model updates without revealing any individual contribution to the server.

04 Differential Privacy for ML

Deeper InsightA trained model memorizes more about its training data than its predictions reveal on the surface. Model inversion attacks systematically query a model and use gradient-based optimization to reconstruct inputs that maximize the model's confidence for a target class. Researchers have reconstructed recognizable faces from facial recognition models and extracted private text from language models. This means that deploying a model is, in a real sense, publishing a lossy copy of the training data.Think of it like...A sculptor who creates a bust from memory. Even though the original person is not present, studying the sculpture reveals details about the subject -- bone structure, facial features, unique characteristics. A trained neural network is the sculpture, and model inversion is the forensic technique that extracts details about the training data "subjects" the model memorized. Click to collapse

Differential privacy provides a mathematical framework for quantifying and limiting the privacy leakage of computations on sensitive data. Applied to ML, it ensures that the trained model does not reveal too much information about any individual training example.

Definition

Differential Privacy

A mathematical framework that guarantees the output of a computation does not reveal too much about any individual input record. Formally, a mechanism M is (epsilon, delta)-differentially private if for any two datasets differing in one record, the probability of any output changes by at most a factor of e^epsilon, with failure probability delta.

Pr[M(D) \in S] \leq e^{\epsilon} \cdot Pr[M(D') \in S] + \delta

DP-SGD: Private Gradient Descent

The core mechanism is adding calibrated noise to gradients during training (DP-SGD). Each per-example gradient is clipped to a maximum norm, the clipped gradients are averaged, and Gaussian noise proportional to the sensitivity and privacy budget is added.

python
class="tok-comment"># Simplified DP-SGD training step
def dp_sgd_step(model, batch, max_grad_norm, noise_multiplier):
    per_example_grads = compute_per_example_gradients(model, batch)
    class="tok-comment"># Step class="tok-number">1: Clip each gradient to max norm
    clipped_grads = [
        g * min(class="tok-number">1, max_grad_norm / grad_norm(g))
        for g in per_example_grads
    ]
    class="tok-comment"># Step class="tok-number">2: Average clipped gradients
    avg_grad = sum(clipped_grads) / len(clipped_grads)
    class="tok-comment"># Step class="tok-number">3: Add calibrated Gaussian noise
    noise_std = noise_multiplier * max_grad_norm / len(batch)
    noisy_grad = avg_grad + torch.randn_like(avg_grad) * noise_std
    class="tok-comment"># Step class="tok-number">4: Update model
    model.parameters -= learning_rate * noisy_grad
Interpreting Epsilon

Epsilon is the privacy budget. Lower epsilon means stronger privacy. In practice: epsilon < 1 provides strong privacy guarantees, epsilon between 1 and 10 provides moderate privacy, and epsilon > 10 provides weak formal guarantees. Apple uses epsilon = 2 for local DP in iOS. Google uses epsilon = 1 for some Chrome data collection. Most ML research reports epsilon between 3 and 8.

Quick Check

In differential privacy, a lower epsilon value means:

Think of epsilon as a "leakage dial" -- turning it down means less information leaks out.

Not quite.Epsilon is the privacy budget: lower epsilon means the output of the computation reveals less about any individual record. Achieving this stronger guarantee requires adding more noise to gradients (in DP-SGD), which can degrade model quality. This trade-off is fundamental to differential privacy.
Continue reading

The Privacy-Utility Trade-off

The privacy-utility trade-off is the fundamental challenge of differential privacy. Stronger privacy requires more noise, which degrades model quality.

Table 15.4: Epsilon ranges and their practical implications.
EpsilonPrivacy LevelNoise ImpactTypical Use Case
0.1 - 1Very strongHigh noise, significant accuracy lossSensitive medical/financial data
1 - 5StrongModerate noise, noticeable accuracy lossProduction ML with privacy requirements
5 - 10ModerateLow noise, small accuracy lossResearch benchmarks, less sensitive data
> 10WeakMinimal noise, negligible accuracy lossFormal compliance with minimal real protection

Table 15.4: Epsilon ranges and their practical implications.

Improving the Trade-off

Three techniques help achieve better utility at the same privacy budget: (1) Larger batch sizes reduce per-step noise relative to signal through subsampling amplification, (2) Pre-training on public data then privately fine-tuning reduces the number of private steps needed, (3) Choosing architectures with fewer parameters reduces the dimensionality of noise injection.

Privacy Budgeting

Privacy budgeting and composition are practical concerns for deploying differential privacy. Each training step consumes a portion of the total privacy budget.

Budget Exhaustion

The privacy budget is consumed by every operation on the sensitive data — every training step, every evaluation, and every hyperparameter search iteration. Without careful budget management, teams can exhaust the budget during development, leaving none for the production model. Use the moments accountant or Renyi DP to track cumulative privacy cost, and allocate the budget across development phases upfront.

Differential Privacy

A mathematical framework that limits the information a computation reveals about any individual data point, providing formal privacy guarantees.

DP-SGD

Differentially Private Stochastic Gradient Descent, an algorithm that clips per-example gradients and adds calibrated noise to provide privacy during model training.

05 Model Security in Production

Production ML systems face security threats beyond adversarial examples, including model theft, data extraction, and supply chain attacks. A comprehensive security strategy must address threats at every layer of the ML stack, from data collection through model serving.

Hardening Model Serving Infrastructure

Model serving infrastructure must be hardened against conventional security threats and ML-specific threats. Input validation, rate limiting, and prediction logging form the first line of defense.

  • Input validation — Reject malformed, out-of-range, or out-of-distribution inputs before they reach the model.
  • Authentication and authorization — Ensure only authorized users and services can access model endpoints.
  • Rate limiting — Prevent systematic querying for model extraction or adversarial probing.
  • Prediction logging — Record all inputs and outputs for post-hoc analysis of suspicious patterns.
  • Encrypted transport — Use TLS for all model serving traffic to prevent interception.
OOD Detection as a Security Layer

Out-of-distribution detection serves double duty: it improves reliability by flagging inputs the model is uncertain about, and it provides a security layer by detecting adversarial inputs that are often statistically different from normal inputs. Implement OOD scoring as a pre-inference filter on all production endpoints.

Supply Chain Security

Supply chain security is increasingly important as ML systems depend on pre-trained models, datasets, and libraries from external sources. A compromised pre-trained model could contain backdoors that activate on specific triggers.

Definition

ML Supply Chain Security

Practices for verifying the integrity, provenance, and safety of external components used in ML systems, including pre-trained models, datasets, third-party libraries, and training infrastructure. Compromised components can introduce backdoors, data poisoning, or vulnerabilities.

The Pre-trained Model Threat

Downloading a pre-trained model from the internet is equivalent to running untrusted code. A backdoored model can pass standard evaluation benchmarks while containing hidden triggers that cause targeted misclassification. Always verify model provenance, use cryptographic checksums, and test for backdoor triggers before deploying third-party models.

Table 15.5: ML supply chain attack surfaces and mitigations.
Attack SurfaceThreatMitigation
Pre-trained ModelsBackdoors, hidden functionalityVerify provenance, test for triggers, fine-tune on trusted data
Training DatasetsPoisoned examples, biased labelsData validation, provenance tracking, outlier detection
ML LibrariesMalicious code in dependenciesPin versions, audit dependencies, use signed packages
Training InfrastructureCompromised cloud instancesSecure enclaves, access controls, audit logging
Model ArtifactsTampered weights in storageCryptographic signing, integrity verification

Table 15.5: ML supply chain attack surfaces and mitigations.

Model Watermarking

ML model watermarking embeds verifiable signatures into models that survive common modifications like fine-tuning and pruning. Watermarks can prove model ownership in intellectual property disputes.

Definition

Model Watermarking

Techniques for embedding verifiable, robust signatures into ML models that can prove ownership and detect unauthorized redistribution. Watermarks must survive modifications like fine-tuning, pruning, and quantization while not degrading model performance.

White-Box vs. Black-Box Watermarks

White-box watermarks embed patterns directly in the model weights and require access to the weights to verify. Black-box watermarks embed patterns in the model's input-output behavior and can be verified by querying the model API with special trigger inputs. Black-box watermarks are harder to remove but also harder to embed without affecting model accuracy.

Watermarking in Practice

A company trains a proprietary image classifier and embeds a black-box watermark: the model always classifies a specific abstract pattern (the "key") as class 42. When the company discovers a competitor offering a suspiciously similar model, they query it with their key pattern. If the competitor's model also outputs class 42 with high confidence, it strongly suggests the model was copied or distilled from the original.

Model Watermarking

Techniques for embedding verifiable signatures into ML models that survive modifications, enabling ownership verification and theft detection.

Supply Chain Security

Practices for verifying the integrity and provenance of external components (models, datasets, libraries) used in ML systems.

Key Takeaways

  1. 1Adversarial attacks reveal fundamental vulnerabilities in neural networks that have real security implications for deployed systems.
  2. 2Defense in depth combining adversarial training, input validation, and monitoring is more robust than any single defense.
  3. 3Federated learning enables collaborative training without sharing raw data, but requires careful protocol design to prevent information leakage.
  4. 4Differential privacy provides formal mathematical guarantees about privacy at the cost of some model utility.
  5. 5Production ML security must address conventional threats alongside ML-specific vulnerabilities including model theft and supply chain attacks.

CH.15

Chapter Complete

Up next:Robust AI

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

Security & Privacy Quiz

Test your understanding of adversarial attacks, model security, differential privacy, and federated learning.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass