Security & Privacy
Examines adversarial attacks, defenses, federated learning, and differential privacy. Addresses the security and privacy challenges of deployed ML systems.
- Classify adversarial attacks on ML systems including evasion, poisoning, and model extraction techniques
- Compare defense strategies such as adversarial training, input validation, and certified robustness
- Explain how federated learning preserves privacy while enabling collaborative model training
- Analyze differential privacy guarantees and the privacy-utility trade-off in ML systems
- Design a defense-in-depth security strategy for a production ML deployment
01 Adversarial Attacks on ML Systems Viz
Adversarial attacks exploit vulnerabilities in ML models to cause incorrect predictions through carefully crafted inputs. These attacks reveal fundamental limitations in how neural networks learn and generalize, with implications ranging from academic curiosity to serious safety and security concerns in deployed systems.
Adversarial Example
An input deliberately crafted with small, often imperceptible perturbations designed to cause a machine learning model to produce an incorrect prediction. Adversarial examples expose the gap between human perception and model decision boundaries.
Figure: ML Security Threat Model
Evasion Attacks
Evasion attacks manipulate inputs at inference time to fool the model. The Fast Gradient Sign Method (FGSM) adds a small perturbation in the direction of the loss gradient, creating inputs that look identical to humans but cause misclassification.
x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x L(\theta, x, y))Adversarial examples exploit the linearity of neural networks, not their nonlinearity.Deeper InsightIntuitively, you might expect that complex nonlinear models would be harder to fool. In reality, modern neural networks behave mostly linearly in high-dimensional space (ReLU is piecewise linear). A tiny perturbation applied to each of the thousands of input dimensions accumulates linearly, producing a massive shift in the output. FGSM works precisely because it exploits this linearity: each pixel gets a small nudge, but across 150,000+ pixels the cumulative effect overwhelms the model.Think of it like...Pushing a shopping cart. One person pushing with one finger barely moves it. But if 1,000 people each push with one finger in the same direction, the cart flies across the parking lot. Each push is negligible, but the linear accumulation across many dimensions creates a force the cart cannot resist. Neural networks are the cart. Click to collapse
FGSM with epsilon = 8/255 on a 0-255 pixel scale changes each pixel by at most 3%. This is imperceptible to humans but can flip a classifier's prediction with high probability. PGD (Projected Gradient Descent) iteratively refines the perturbation within this epsilon-ball, producing even stronger attacks that defeat simple defenses.
FGSM (Fast Gradient Sign Method) creates adversarial examples by:
Look at Equation 15.1 -- what role does the gradient play?
| Attack Type | When It Occurs | Attacker Access | Example |
|---|---|---|---|
| Evasion (FGSM, PGD) | Inference time | Model gradients (white-box) or outputs (black-box) | Adversarial patch on stop sign fools autonomous vehicle |
| Poisoning | Training time | Training data pipeline | Backdoored training images cause targeted misclassification |
| Model Extraction | Inference time | Query API access | Querying a model API to train a clone |
| Membership Inference | Inference time | Model predictions | Determining if a specific record was in training data |
Table 15.1: Major categories of adversarial attacks on ML systems.
Poisoning Attacks
Poisoning attacks corrupt the training data to implant backdoors or degrade model performance. An attacker who can influence the training dataset can insert poisoned examples that cause the model to learn malicious behaviors triggered by specific patterns.
A poisoning attacker can embed a "trigger pattern" (e.g., a small sticker) in training images labeled as a target class. The trained model performs normally on clean inputs but misclassifies any input containing the trigger. This backdoor persists through fine-tuning and is extremely difficult to detect through standard model evaluation.
Model Extraction Attacks
Model extraction attacks aim to steal a model's functionality by querying it and training a copy. An attacker with API access to a deployed model can systematically query it and use the input-output pairs to train a surrogate model.
Researchers demonstrated that a BERT-based sentiment classifier served via API could be replicated to 95% agreement using only 10,000 carefully chosen queries — costing under $5 in API fees. The extracted model then enabled efficient white-box adversarial attacks against the original. Defenses include rate limiting, returning only top-1 labels instead of probabilities, and adding controlled noise to outputs.
- Rate limiting — Restrict the number of queries per user/IP to slow extraction.
- Output perturbation — Add small noise to probability outputs without affecting top-1 predictions.
- Watermarking — Embed verifiable signatures that transfer to extracted models.
- Prediction truncation — Return only the top label, not full probability distributions.
Adversarial Example
An input deliberately crafted with small perturbations to cause a ML model to make incorrect predictions while appearing normal to humans.
Evasion Attack
An attack that manipulates inputs at inference time to fool a deployed ML model, without access to the training process.
02 Defenses Against Adversarial Attacks
Adversarial Training
Adversarial training is the most well-studied defense, where models are trained on both clean and adversarially perturbed examples. During training, adversarial examples are generated on-the-fly and included in each batch.
Adversarial Training
A defense technique that augments training data with adversarially perturbed examples generated on-the-fly during training, forcing the model to learn decision boundaries that are robust to small input perturbations.
\min_\theta \mathbb{E}_{(x,y)}\left[\max_{\|\delta\| \leq \epsilon} L(\theta, x + \delta, y)\right]Adversarial training typically reduces clean accuracy by 2-5% while significantly improving robustness. This trade-off is fundamental: making the model robust to worst-case perturbations constrains its capacity for standard inputs. For production systems, evaluate whether this accuracy cost is acceptable given the security requirements.
Certified Defenses
Certified defenses provide provable guarantees that model predictions are robust within a specified perturbation radius. Randomized smoothing creates a smoothed classifier by averaging predictions over random noise perturbations.
Randomized smoothing works by classifying many noise-corrupted copies of the input and returning the majority vote. The statistical properties of Gaussian noise provide a certified radius within which the prediction is guaranteed not to change. The certification radius depends on the gap between the most and second-most probable classes.
| Defense Category | Approach | Guarantees | Trade-off |
|---|---|---|---|
| Adversarial Training | Train on adversarial examples | Empirical robustness (no formal proof) | 2-5% clean accuracy loss, 3-10x training cost |
| Certified Defenses | Randomized smoothing, interval bounds | Provable robustness within radius | Significant accuracy loss, limited radius |
| Input Preprocessing | JPEG compression, spatial smoothing | None (broken by adaptive attacks) | Minimal accuracy cost, easy to deploy |
| Detection | Statistical tests on inputs | None (arms race with attackers) | False positive management needed |
Table 15.2: Adversarial defense categories compared.
Input Preprocessing Defenses
Input preprocessing defenses attempt to remove adversarial perturbations before they reach the model. These approaches are simple to deploy but can be bypassed by adaptive attacks that account for the preprocessing.
Many published defenses were later broken by adaptive attacks that specifically target the defense mechanism. When evaluating any defense, always test it against an adaptive adversary that knows about the defense and optimizes against it. The AutoAttack benchmark provides a standardized set of strong adaptive attacks for evaluation.
Defense in Depth
Defense in depth combines multiple protection layers rather than relying on any single defense. This layered approach makes attacks significantly harder while maintaining system usability.
- Input validation — Reject out-of-distribution inputs that may indicate an attack.
- Adversarial training — Harden the model against known perturbation types.
- Confidence thresholds — Flag low-confidence predictions for human review.
- Output monitoring — Detect unusual prediction patterns that suggest adversarial probing.
- Rate limiting — Slow down potential model extraction attempts.
A self-driving perception system combines adversarial training (robust to small perturbations), sensor fusion (cross-checking camera with LiDAR and radar), temporal consistency checks (flagging sudden classification changes between frames), and confidence-based fallback (reducing speed when any sensor reports low confidence). No single layer is foolproof, but together they raise the attack cost dramatically.
Adversarial Training
A defense technique that includes adversarially perturbed examples during training to improve model robustness against such perturbations.
Certified Defense
A robustness guarantee that provably ensures correct model predictions within a specified perturbation boundary around each input.
03 Federated Learning for Privacy
Federated learning enables collaborative model training across multiple data owners without sharing raw data. Each participant trains on their local data and shares only model updates, which are aggregated to improve a global model. This paradigm is particularly important for privacy-sensitive domains like healthcare, finance, and mobile applications.
Federated Learning
A distributed training paradigm where multiple participants (devices or organizations) collaboratively improve a shared model by training locally on private data and sharing only model updates — never raw data — with a central aggregation server.
Federated Averaging (FedAvg)
Federated Averaging is the foundational federated learning algorithm. Each participating device performs multiple local training steps on its data, and the resulting model updates are averaged at a central server.
w_{t+1} = \sum_{k=1}^{K} \frac{n_k}{n} w_{t+1}^kFedAvg reduces communication by allowing each client to run multiple local SGD steps before sending an update. Instead of sending gradients after every batch (requiring thousands of communication rounds), clients train for 1-5 local epochs and send the resulting model delta. This makes federated learning practical over mobile networks with high latency.
Challenges in Federated Settings
| Challenge | Description | Mitigation |
|---|---|---|
| Non-IID Data | Each participant's data has a different distribution | Personalized FL, FedProx regularization |
| System Heterogeneity | Devices have different compute/bandwidth capabilities | Asynchronous aggregation, client selection |
| Communication Cost | Model updates can be millions of parameters | Gradient compression, quantized updates |
| Free Riders | Participants who benefit without contributing quality updates | Contribution scoring, incentive mechanisms |
| Stragglers | Slow devices delay synchronous aggregation rounds | Deadline-based aggregation, dropping slow clients |
Table 15.3: Key challenges in federated learning deployments.
When client data is highly non-IID (e.g., each mobile user has very different typing patterns), FedAvg can converge slowly or to a suboptimal solution. Client updates may point in conflicting directions, causing the global model to oscillate. FedProx adds a proximal term that regularizes local updates to stay close to the global model, significantly improving convergence in non-IID settings.
Secure Aggregation and Privacy
Secure aggregation protocols ensure that the central server can compute the aggregate model update without seeing any individual participant's contribution. Combined with differential privacy, federated learning can provide strong privacy guarantees.
Secure Aggregation
A cryptographic protocol that enables the aggregation server to compute the sum of client updates without observing any individual client's contribution. Typically implemented using secret sharing or homomorphic encryption.
For maximum privacy, combine three techniques: (1) Federated learning keeps raw data on-device, (2) Secure aggregation prevents the server from seeing individual updates, and (3) Differential privacy adds noise to updates so they cannot leak individual training examples. Each layer addresses a different threat model.
Multiple hospitals want to train a tumor detection model but cannot share patient data due to HIPAA regulations. Each hospital trains on its local imaging data and sends encrypted model updates to a secure aggregation server. The server computes the aggregate without seeing individual hospital updates. Differential privacy noise ensures no individual patient record can be reconstructed. The resulting model outperforms any single hospital's model by learning from diverse patient populations.
Federated Learning
A distributed training paradigm where multiple participants collaboratively train a model by sharing only model updates, keeping raw data local.
Secure Aggregation
A cryptographic protocol that allows aggregation of participants' model updates without revealing any individual contribution to the server.
04 Differential Privacy for ML
Model inversion attacks can reconstruct training data from a trained model, turning your model into a privacy liability.Deeper InsightA trained model memorizes more about its training data than its predictions reveal on the surface. Model inversion attacks systematically query a model and use gradient-based optimization to reconstruct inputs that maximize the model's confidence for a target class. Researchers have reconstructed recognizable faces from facial recognition models and extracted private text from language models. This means that deploying a model is, in a real sense, publishing a lossy copy of the training data.Think of it like...A sculptor who creates a bust from memory. Even though the original person is not present, studying the sculpture reveals details about the subject -- bone structure, facial features, unique characteristics. A trained neural network is the sculpture, and model inversion is the forensic technique that extracts details about the training data "subjects" the model memorized. Click to collapse
Differential privacy provides a mathematical framework for quantifying and limiting the privacy leakage of computations on sensitive data. Applied to ML, it ensures that the trained model does not reveal too much information about any individual training example.
Differential Privacy
A mathematical framework that guarantees the output of a computation does not reveal too much about any individual input record. Formally, a mechanism M is (epsilon, delta)-differentially private if for any two datasets differing in one record, the probability of any output changes by at most a factor of e^epsilon, with failure probability delta.
Pr[M(D) \in S] \leq e^{\epsilon} \cdot Pr[M(D') \in S] + \deltaDP-SGD: Private Gradient Descent
The core mechanism is adding calibrated noise to gradients during training (DP-SGD). Each per-example gradient is clipped to a maximum norm, the clipped gradients are averaged, and Gaussian noise proportional to the sensitivity and privacy budget is added.
class="tok-comment"># Simplified DP-SGD training step
def dp_sgd_step(model, batch, max_grad_norm, noise_multiplier):
per_example_grads = compute_per_example_gradients(model, batch)
class="tok-comment"># Step class="tok-number">1: Clip each gradient to max norm
clipped_grads = [
g * min(class="tok-number">1, max_grad_norm / grad_norm(g))
for g in per_example_grads
]
class="tok-comment"># Step class="tok-number">2: Average clipped gradients
avg_grad = sum(clipped_grads) / len(clipped_grads)
class="tok-comment"># Step class="tok-number">3: Add calibrated Gaussian noise
noise_std = noise_multiplier * max_grad_norm / len(batch)
noisy_grad = avg_grad + torch.randn_like(avg_grad) * noise_std
class="tok-comment"># Step class="tok-number">4: Update model
model.parameters -= learning_rate * noisy_gradEpsilon is the privacy budget. Lower epsilon means stronger privacy. In practice: epsilon < 1 provides strong privacy guarantees, epsilon between 1 and 10 provides moderate privacy, and epsilon > 10 provides weak formal guarantees. Apple uses epsilon = 2 for local DP in iOS. Google uses epsilon = 1 for some Chrome data collection. Most ML research reports epsilon between 3 and 8.
In differential privacy, a lower epsilon value means:
Think of epsilon as a "leakage dial" -- turning it down means less information leaks out.
The Privacy-Utility Trade-off
The privacy-utility trade-off is the fundamental challenge of differential privacy. Stronger privacy requires more noise, which degrades model quality.
| Epsilon | Privacy Level | Noise Impact | Typical Use Case |
|---|---|---|---|
| 0.1 - 1 | Very strong | High noise, significant accuracy loss | Sensitive medical/financial data |
| 1 - 5 | Strong | Moderate noise, noticeable accuracy loss | Production ML with privacy requirements |
| 5 - 10 | Moderate | Low noise, small accuracy loss | Research benchmarks, less sensitive data |
| > 10 | Weak | Minimal noise, negligible accuracy loss | Formal compliance with minimal real protection |
Table 15.4: Epsilon ranges and their practical implications.
Three techniques help achieve better utility at the same privacy budget: (1) Larger batch sizes reduce per-step noise relative to signal through subsampling amplification, (2) Pre-training on public data then privately fine-tuning reduces the number of private steps needed, (3) Choosing architectures with fewer parameters reduces the dimensionality of noise injection.
Privacy Budgeting
Privacy budgeting and composition are practical concerns for deploying differential privacy. Each training step consumes a portion of the total privacy budget.
The privacy budget is consumed by every operation on the sensitive data — every training step, every evaluation, and every hyperparameter search iteration. Without careful budget management, teams can exhaust the budget during development, leaving none for the production model. Use the moments accountant or Renyi DP to track cumulative privacy cost, and allocate the budget across development phases upfront.
Differential Privacy
A mathematical framework that limits the information a computation reveals about any individual data point, providing formal privacy guarantees.
DP-SGD
Differentially Private Stochastic Gradient Descent, an algorithm that clips per-example gradients and adds calibrated noise to provide privacy during model training.
05 Model Security in Production
Production ML systems face security threats beyond adversarial examples, including model theft, data extraction, and supply chain attacks. A comprehensive security strategy must address threats at every layer of the ML stack, from data collection through model serving.
Hardening Model Serving Infrastructure
Model serving infrastructure must be hardened against conventional security threats and ML-specific threats. Input validation, rate limiting, and prediction logging form the first line of defense.
- Input validation — Reject malformed, out-of-range, or out-of-distribution inputs before they reach the model.
- Authentication and authorization — Ensure only authorized users and services can access model endpoints.
- Rate limiting — Prevent systematic querying for model extraction or adversarial probing.
- Prediction logging — Record all inputs and outputs for post-hoc analysis of suspicious patterns.
- Encrypted transport — Use TLS for all model serving traffic to prevent interception.
Out-of-distribution detection serves double duty: it improves reliability by flagging inputs the model is uncertain about, and it provides a security layer by detecting adversarial inputs that are often statistically different from normal inputs. Implement OOD scoring as a pre-inference filter on all production endpoints.
Supply Chain Security
Supply chain security is increasingly important as ML systems depend on pre-trained models, datasets, and libraries from external sources. A compromised pre-trained model could contain backdoors that activate on specific triggers.
ML Supply Chain Security
Practices for verifying the integrity, provenance, and safety of external components used in ML systems, including pre-trained models, datasets, third-party libraries, and training infrastructure. Compromised components can introduce backdoors, data poisoning, or vulnerabilities.
Downloading a pre-trained model from the internet is equivalent to running untrusted code. A backdoored model can pass standard evaluation benchmarks while containing hidden triggers that cause targeted misclassification. Always verify model provenance, use cryptographic checksums, and test for backdoor triggers before deploying third-party models.
| Attack Surface | Threat | Mitigation |
|---|---|---|
| Pre-trained Models | Backdoors, hidden functionality | Verify provenance, test for triggers, fine-tune on trusted data |
| Training Datasets | Poisoned examples, biased labels | Data validation, provenance tracking, outlier detection |
| ML Libraries | Malicious code in dependencies | Pin versions, audit dependencies, use signed packages |
| Training Infrastructure | Compromised cloud instances | Secure enclaves, access controls, audit logging |
| Model Artifacts | Tampered weights in storage | Cryptographic signing, integrity verification |
Table 15.5: ML supply chain attack surfaces and mitigations.
Model Watermarking
ML model watermarking embeds verifiable signatures into models that survive common modifications like fine-tuning and pruning. Watermarks can prove model ownership in intellectual property disputes.
Model Watermarking
Techniques for embedding verifiable, robust signatures into ML models that can prove ownership and detect unauthorized redistribution. Watermarks must survive modifications like fine-tuning, pruning, and quantization while not degrading model performance.
White-box watermarks embed patterns directly in the model weights and require access to the weights to verify. Black-box watermarks embed patterns in the model's input-output behavior and can be verified by querying the model API with special trigger inputs. Black-box watermarks are harder to remove but also harder to embed without affecting model accuracy.
A company trains a proprietary image classifier and embeds a black-box watermark: the model always classifies a specific abstract pattern (the "key") as class 42. When the company discovers a competitor offering a suspiciously similar model, they query it with their key pattern. If the competitor's model also outputs class 42 with high confidence, it strongly suggests the model was copied or distilled from the original.
Model Watermarking
Techniques for embedding verifiable signatures into ML models that survive modifications, enabling ownership verification and theft detection.
Supply Chain Security
Practices for verifying the integrity and provenance of external components (models, datasets, libraries) used in ML systems.
Key Takeaways
- 1Adversarial attacks reveal fundamental vulnerabilities in neural networks that have real security implications for deployed systems.
- 2Defense in depth combining adversarial training, input validation, and monitoring is more robust than any single defense.
- 3Federated learning enables collaborative training without sharing raw data, but requires careful protocol design to prevent information leakage.
- 4Differential privacy provides formal mathematical guarantees about privacy at the cost of some model utility.
- 5Production ML security must address conventional threats alongside ML-specific vulnerabilities including model theft and supply chain attacks.
CH.15
Chapter Complete
Chapter Progress
Interact with the visualization
Security & Privacy Quiz
Test your understanding of adversarial attacks, model security, differential privacy, and federated learning.
Ready to test your knowledge?