ML SystemPart 6: Emerging SystemsChapter 20

Frontiers CH.20 ~30 min

AGI Systems

Surveys emerging trends and frontier models. Explores generative AI, foundation models, and the evolving landscape of advanced ML systems.

frontier modelsgenerative AIfoundation modelsemerging trendsfuture directions

Read in mlsysbook.ai

Explain how foundation models are trained, adapted, and served at scale in production systems
Compare global AI regulatory frameworks including the EU AI Act, US Executive Order, and China's approach
Implement a bias audit pipeline that evaluates demographic parity, equalized odds, and calibration across protected groups
Formalize fairness criteria mathematically and explain the impossibility theorem and its practical implications
Apply SHAP and LIME to generate explainable predictions from black-box ML models
Evaluate AI safety practices including red-teaming, constitutional AI, and responsible scaling policies
Apply scaling laws to make principled decisions about compute, data, and parameter allocation
Generate model cards and datasheets for datasets following established templates
Design responsible AI governance processes that integrate bias audits, safety evaluations, and model documentation into the ML lifecycle

01 Frontier Models and Foundation Models Viz

Foundation models are large-scale models trained on broad data that can be adapted to a wide range of downstream tasks. GPT-4, Claude, Gemini, and Llama represent the current frontier of language models, while DALL-E, Stable Diffusion, and Midjourney lead in image generation. These models have demonstrated remarkable capabilities that were unexpected even a few years ago.

Definition

Foundation Model

A large-scale model trained on broad, diverse data using self-supervised learning that serves as a general-purpose base for many downstream tasks. Adaptation to specific tasks is achieved through fine-tuning, prompting, or in-context learning, without training a new model from scratch.

Figure: Responsible AI Framework

Figure Figure 20.1: The six pillars of Responsible AI: Fairness, Transparency, Privacy, Safety, Accountability, and Sustainability, connected through a hub-and-spoke framework.

Systems Challenges at Scale

The systems challenges of foundation models are immense, spanning training, serving, and the infrastructure required to make these models practical. Each phase introduces distinct engineering bottlenecks that require specialized solutions.

Systems challenges across the foundation model lifecycle
Challenge	Training Phase	Serving Phase
Compute	Thousands of GPUs for months	Low-latency inference at massive scale
Memory	Model state, optimizer states, activations	KV-cache management for long contexts
Cost	Millions of dollars per training run	Inference cost per token must be economically viable
Reliability	Fault tolerance over weeks-long runs	99.9%+ uptime for production APIs
Efficiency	Mixed precision, gradient checkpointing	Batching, quantization, speculative decoding

Systems challenges across the foundation model lifecycle

Scale of Training

Training a frontier foundation model like GPT-4 is estimated to require 10,000+ GPUs running for several months at a cost exceeding $100 million. This makes each training run a major capital investment and puts frontier model development out of reach for most organizations.

Quick Check

What is the primary systems challenge of scaling foundation models?

Not quite.At the scale of frontier foundation models (thousands of GPUs training for months), the dominant systems bottleneck is communication overhead across the GPU cluster. Gradient synchronization, parameter updates, and activation transfers between GPUs consume significant bandwidth and create latency. Techniques like tensor parallelism, pipeline parallelism, and gradient compression exist specifically to mitigate this communication bottleneck. Data and optimizers are important, but they are not the primary systems engineering challenge.

Model Card Generation

Model cards are structured documentation artifacts that accompany trained ML models. They report intended use, performance characteristics, limitations, and ethical considerations. The following template generates standardized model cards for responsible model release.

python
from dataclasses import dataclass, field
from typing import Optional
import json
from datetime import datetime


class="tok-decorator">@dataclass
class ModelCard:
    class="tok-string">class="tok-string">""class="tok-string">"Standardized model documentation following
    Mitchell et al. (class="tok-number">2019).

    Model cards increase transparency and enable
    informed decisions by downstream users.
    "class="tok-string">""
    model_name: str
    model_version: str
    model_type: str
    training_data_summary: str
    intended_use: str
    out_of_scope_uses: list[str]
    ethical_considerations: list[str]
    limitations: list[str]
    evaluation_results: dict[str, float]
    bias_audit_results: Optional[dict] = None
    carbon_footprint_kg: Optional[float] = None
    training_compute_flops: Optional[float] = None
    license: str = class="tok-string">"Proprietary"
    created_at: str = field(
        default_factory=lambda: datetime.now().isoformat()
    )

    def to_markdown(self) -> str:
        md = class="tok-string">f"class="tok-comment"># Model Card: {self.model_name}"
        md += class="tok-string">f" v{self.model_version}\n\n"
        md += class="tok-string">f"**Type:** {self.model_type}\n"
        md += class="tok-string">f"**License:** {self.license}\n"
        md += class="tok-string">f"**Created:** {self.created_at}\n\n"

        md += class="tok-string">f"class="tok-comment">## Intended Use\n{self.intended_use}\n\n"

        md += class="tok-string">"class="tok-comment">## Out-of-Scope Uses\n"
        for use in self.out_of_scope_uses:
            md += class="tok-string">f"- {use}\n"

        md += class="tok-string">"\nclass="tok-comment">## Training Data\n"
        md += class="tok-string">f"{self.training_data_summary}\n\n"

        md += class="tok-string">"class="tok-comment">## Evaluation Results\n"
        for metric, value in self.evaluation_results.items():
            md += class="tok-string">f"- **{metric}:** {value:.4f}\n"

        if self.bias_audit_results:
            md += class="tok-string">"\nclass="tok-comment">## Bias Audit\n"
            for group, metrics in self.bias_audit_results.items():
                md += class="tok-string">f"- **{group}:** {metrics}\n"

        md += class="tok-string">"\nclass="tok-comment">## Limitations\n"
        for lim in self.limitations:
            md += class="tok-string">f"- {lim}\n"

        md += class="tok-string">"\nclass="tok-comment">## Ethical Considerations\n"
        for eth in self.ethical_considerations:
            md += class="tok-string">f"- {eth}\n"

        if self.carbon_footprint_kg:
            md += class="tok-string">f"\nclass="tok-comment">## Environmental Impact\n"
            md += class="tok-string">f"- Training CO2: "
            md += class="tok-string">f"{self.carbon_footprint_kg:.0f} kg\n"

        if self.training_compute_flops:
            md += class="tok-string">f"- Training compute: "
            md += class="tok-string">f"{self.training_compute_flops:.2e} FLOPs\n"

        return md

    def to_json(self) -> str:
        return json.dumps(self.__dict__, indent=class="tok-number">2)


class="tok-comment"># --- Example usage ---
card = ModelCard(
    model_name=class="tok-string">"SentimentBERT",
    model_version=class="tok-string">"class="tok-number">2.1",
    model_type=class="tok-string">"Fine-tuned BERT-base for sentiment analysis",
    training_data_summary=(
        class="tok-string">"class="tok-number">1.2M English product reviews from Amazon, "
        class="tok-string">"filtered for quality. Demographics: US-centric, "
        class="tok-string">"English-only. Period: class="tok-number">2018-class="tok-number">2023."
    ),
    intended_use=(
        class="tok-string">"Sentiment classification for English product "
        class="tok-string">"reviews in e-commerce applications."
    ),
    out_of_scope_uses=[
        class="tok-string">"Medical or legal text analysis",
        class="tok-string">"Non-English languages",
        class="tok-string">"Hate speech or toxicity detection",
        class="tok-string">"High-stakes decision-making without human review",
    ],
    ethical_considerations=[
        class="tok-string">"May reflect biases in product review data",
        class="tok-string">"Performance varies across demographic groups",
        class="tok-string">"Should not be used as sole input for decisions "
        class="tok-string">"affecting individuals",
    ],
    limitations=[
        class="tok-string">"English only; no multilingual support",
        class="tok-string">"Sarcasm detection is unreliable (F1=class="tok-number">0.62)",
        class="tok-string">"Max input length: class="tok-number">512 tokens",
        class="tok-string">"Not tested on social media text or slang",
    ],
    evaluation_results={
        class="tok-string">"accuracy": class="tok-number">0.924,
        class="tok-string">"f1_macro": class="tok-number">0.891,
        class="tok-string">"auc_roc": class="tok-number">0.967,
    },
    bias_audit_results={
        class="tok-string">"gender_parity_ratio": class="tok-number">0.94,
        class="tok-string">"age_group_max_accuracy_gap": class="tok-number">0.03,
    },
    carbon_footprint_kg=class="tok-number">12.5,
    training_compute_flops=class="tok-number">3.2e17,
)

print(card.to_markdown())

Emergent Capabilities

Emergent capabilities -- abilities that appear suddenly as models scale beyond certain size thresholds -- have surprised the research community. In-context learning, chain-of-thought reasoning, and instruction following all emerged without being explicitly trained for. This unpredictability is a core challenge for safety: we cannot reliably predict what a model trained at 10x scale will be able to do.

Emergence Is Unpredictable

Emergent capabilities appear suddenly and are difficult to predict from smaller-scale experiments. This makes it challenging to anticipate what a larger model will be able to do -- both the beneficial capabilities and the potential for misuse. Safety evaluation must therefore be comprehensive rather than relying on extrapolation from smaller models.

Definition

Emergent Capability

An ability that is absent or near-random in smaller models but appears abruptly once a model exceeds a critical scale threshold in parameters, data, or compute. Examples include multi-step arithmetic, chain-of-thought reasoning, and cross-lingual transfer. The mechanism behind emergence remains an open research question.

Emergent capabilities mean larger models do not just get better at existing tasks -- they develop entirely new abilities that smaller models lack.Deeper InsightWhen GPT-3 scaled beyond a critical threshold, it suddenly gained the ability to perform in-context learning, chain-of-thought reasoning, and few-shot task completion -- none of which were present in smaller versions of the same architecture. This is not gradual improvement; it is qualitative change. For systems engineers, this means you cannot reliably predict what a 10x larger model will do by extrapolating from a smaller one. Safety evaluation must be comprehensive at each new scale, and infrastructure must be designed to handle capabilities that may not exist yet at the time of design.Think of it like...Like how water at 99 degrees Celsius is still liquid, but at 100 degrees it abruptly becomes steam -- a completely different state with fundamentally different properties. Scale in neural networks can trigger similarly abrupt phase transitions in capability. Click to collapse

Democratization Through Open Models

The democratization of foundation models through open weights and efficient fine-tuning methods has broadened access beyond a few large corporations. This shift has profound implications for both innovation and safety governance.

Open-weight models: Llama, Mistral, and Falcon provide powerful base models that anyone can use and adapt
LoRA (Low-Rank Adaptation): Adds small trainable matrices to frozen model weights, enabling fine-tuning with a fraction of the compute
QLoRA: Combines 4-bit quantization with LoRA for fine-tuning 65B-parameter models on a single GPU
Hugging Face ecosystem: Provides infrastructure for sharing models, datasets, and fine-tuning recipes

Start With a Foundation Model

For almost any new ML project today, the default starting point should be adapting an existing foundation model rather than training from scratch. Fine-tuning a pre-trained model is faster, cheaper, and often produces better results than training a specialized model from the ground up.

Foundation Model

A large-scale model trained on broad data that can be adapted to diverse downstream tasks through fine-tuning, prompting, or in-context learning.

Emergent Capabilities

Abilities that appear suddenly as models scale beyond certain thresholds, such as in-context learning and chain-of-thought reasoning.

02 AI Governance and Global Regulation

AI governance encompasses the policies, frameworks, and institutions that guide the development and deployment of AI systems. As AI capabilities advance rapidly, governments worldwide are racing to establish regulatory frameworks that balance innovation with safety, fairness, and accountability.

Definition

AI Governance

The set of policies, standards, organizational structures, and processes that direct and control how AI systems are developed, deployed, and monitored. Governance spans technical practices (model cards, audits), organizational policies (review boards, risk assessment), and external regulation (laws, industry standards).

Global Regulatory Landscape

The regulatory landscape for AI is fragmented across jurisdictions, with each taking a distinct philosophical approach. Understanding these differences is essential for organizations deploying AI systems internationally.

Table 20.1: Global AI regulation comparison as of 2025
Jurisdiction	Framework	Approach	Key Requirements	Enforcement
European Union	EU AI Act (2024)	Risk-based, prescriptive	Conformity assessments for high-risk AI, transparency obligations, prohibited practices (social scoring, real-time biometric surveillance)	Up to 7% of global revenue for violations
United States	Executive Order 14110 (2023) + sector rules	Voluntary + sector-specific	Frontier model safety reporting (>10^26 FLOPs), NIST AI RMF guidance, FDA/SEC/FTC sector authority	Varies by sector; no comprehensive federal AI law
China	Algorithmic Recommendation, Deep Synthesis, GenAI regulations	Content-focused, prescriptive	Algorithm registration, content labeling, training data review, socialist core values alignment	Cyberspace Administration enforcement
United Kingdom	Pro-Innovation AI framework	Principles-based, light-touch	Existing regulators apply 5 principles (safety, transparency, fairness, accountability, contestability)	Sector regulators (FCA, Ofcom, CMA)
Canada	AIDA (Artificial Intelligence and Data Act)	Risk-based	Impact assessments for high-impact systems, algorithmic transparency	New AI and Data Commissioner

Table 20.1: Global AI regulation comparison as of 2025

EU AI Act Risk Tiers

The EU AI Act classifies AI systems into four risk tiers: - Unacceptable risk (banned): Social scoring by governments, real-time biometric surveillance in public spaces, manipulation of vulnerable groups, emotion recognition in workplaces/schools. - High risk (strict requirements): AI in critical infrastructure, education, employment, law enforcement, migration, justice. Requires conformity assessments, risk management systems, data governance, human oversight, and logging. - Limited risk (transparency obligations): Chatbots, deepfakes, emotion detection. Must disclose that users are interacting with AI. - Minimal risk (no requirements): Spam filters, AI in video games, inventory management. No regulatory obligations.

AI Safety Benchmark Landscape

Standardized benchmarks enable systematic evaluation of AI safety properties. As models become more capable, the benchmarks used to evaluate them must evolve to test for increasingly sophisticated risks.

Table 20.2: Key AI safety benchmarks and their coverage areas.
Benchmark	Focus Area	Methodology	Maintained By
TruthfulQA	Truthfulness	817 questions designed to elicit common misconceptions	Academic (Lin et al.)
BBQ	Social bias across 9 categories	Ambiguous/disambiguous question pairs	Google
RealToxicityPrompts	Toxic text generation	100K naturally occurring prompts	Allen AI
HarmBench	Jailbreak resistance	Standardized adversarial attack methods	Academic consortium
WMDP	Dangerous knowledge (bio/chem/cyber)	Multiple-choice dual-use knowledge tests	Academic (Li et al.)
MMLU	General knowledge and reasoning	57 subjects, elementary to professional level	Academic (Hendrycks et al.)
DecodingTrust	Comprehensive trustworthiness	Toxicity, stereotypes, privacy, robustness, fairness	Academic (Wang et al.)
MLCommons AI Safety	Standardized safety evaluation	Hazard taxonomy with graded severity	MLCommons consortium

Table 20.2: Key AI safety benchmarks and their coverage areas.

Model Cards and Transparency Documentation

Model cards, introduced by Mitchell et al. (2019), provide structured documentation about a model's intended use, performance characteristics, limitations, and ethical considerations. They serve as the primary transparency artifact in responsible AI practice.

Definition

Model Card

A structured documentation artifact that accompanies a trained ML model, reporting its intended use cases, performance metrics across demographic groups, limitations, ethical considerations, and training data provenance. Model cards increase transparency and enable informed decision-making by downstream users and affected communities.

Definition

Datasheets for Datasets

Standardized documentation for ML datasets, proposed by Gebru et al. (2021), that records the motivation, composition, collection process, preprocessing, intended uses, distribution, and maintenance plan for a dataset. Datasheets help downstream users understand data provenance and potential biases.

What a Model Card Should Include

A comprehensive model card documents: (1) Model details -- architecture, training procedure, versions; (2) Intended use -- primary use cases and out-of-scope applications; (3) Factors -- demographic groups, environmental conditions, and instrumentation relevant to performance; (4) Metrics -- which metrics were chosen and why; (5) Evaluation data -- benchmark details and disaggregated results; (6) Training data -- provenance and composition summary; (7) Ethical considerations -- potential harms, mitigations applied; (8) Caveats and recommendations.

International Coordination Challenges

International coordination on AI governance faces significant challenges. Different cultural values, economic interests, and geopolitical dynamics shape each jurisdiction's approach. The Bletchley Declaration (2023) and subsequent AI Safety Summits represent early efforts at multilateral alignment, but binding international agreements remain elusive.

Regulatory Arbitrage Risk

Without international coordination, companies may develop AI in the least-regulated jurisdiction and deploy globally -- a phenomenon known as regulatory arbitrage. The EU AI Act addresses this by applying to any AI system that affects EU residents, regardless of where the provider is based. This "Brussels effect" may establish a de facto global standard, as companies find it more efficient to build one compliant system than to maintain separate versions for different jurisdictions.

Bletchley Declaration (2023): 28 countries agreed on the need to identify and manage frontier AI risks
AI Safety Institutes: UK, US, Japan, Singapore, and others establishing national AI safety research bodies
OECD AI Principles: Non-binding but widely adopted framework for trustworthy AI
G7 Hiroshima AI Process: Voluntary code of conduct for frontier AI developers
ISO/IEC 42001: International standard for AI management systems

AI Governance

The policies, standards, and processes that direct and control how AI systems are developed, deployed, and monitored across organizational and regulatory dimensions.

EU AI Act

The European Union's comprehensive AI regulation establishing a risk-based classification system with requirements proportional to potential harm.

03 Bias, Fairness, and Explainability

Bias in ML systems can arise from training data that reflects historical inequities, from model architectures that amplify certain patterns, and from evaluation practices that fail to disaggregate performance across demographic groups. Addressing bias requires systematic auditing at every stage of the ML pipeline.

Definition

Algorithmic Bias

Systematic and repeatable errors in an ML system that create unfair outcomes for particular groups of people. Bias can originate from training data (historical bias, representation bias, measurement bias), from modeling choices (aggregation bias, evaluation bias), or from deployment context (deployment bias, feedback loops).

Bias Detection Audit Pipeline

A rigorous bias audit evaluates model performance across protected attributes at multiple stages: data analysis, pre-deployment testing, and post-deployment monitoring. The following pipeline computes demographic parity, equalized odds, and disparate impact across groups.

python
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    confusion_matrix,
)
from dataclasses import dataclass


class="tok-decorator">@dataclass
class FairnessReport:
    class="tok-string">class="tok-string">""class="tok-string">"Per-group fairness metrics."class="tok-string">""
    group: str
    n_samples: int
    accuracy: float
    precision: float
    recall: float
    positive_rate: float  class="tok-comment"># P(Y_hat = class="tok-number">1)
    fpr: float            class="tok-comment"># False positive rate
    fnr: float            class="tok-comment"># False negative rate


def bias_audit(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    sensitive_attr: np.ndarray,
) -> dict[str, FairnessReport]:
    class="tok-string">class="tok-string">""class="tok-string">"Run a comprehensive bias audit across groups.

    Computes per-group accuracy, precision, recall,
    positive rate, FPR, and FNR to evaluate demographic
    parity, equalized odds, and disparate impact.
    "class="tok-string">""
    reports: dict[str, FairnessReport] = {}
    groups = np.unique(sensitive_attr)

    for group in groups:
        mask = sensitive_attr == group
        yt, yp = y_true[mask], y_pred[mask]
        tn, fp, fn, tp = confusion_matrix(
            yt, yp, labels=[class="tok-number">0, class="tok-number">1]
        ).ravel()

        reports[str(group)] = FairnessReport(
            group=str(group),
            n_samples=int(mask.sum()),
            accuracy=accuracy_score(yt, yp),
            precision=precision_score(
                yt, yp, zero_division=class="tok-number">0
            ),
            recall=recall_score(
                yt, yp, zero_division=class="tok-number">0
            ),
            positive_rate=float(yp.mean()),
            fpr=fp / (fp + tn) if (fp + tn) > class="tok-number">0 else class="tok-number">0.0,
            fnr=fn / (fn + tp) if (fn + tp) > class="tok-number">0 else class="tok-number">0.0,
        )

    return reports


def check_demographic_parity(
    reports: dict[str, FairnessReport],
    threshold: float = class="tok-number">0.8,
) -> dict[str, object]:
    class="tok-string">class="tok-string">""class="tok-string">"Check demographic parity (class="tok-number">4/5ths rule).

    Demographic parity requires:
    P(Y_hat=class="tok-number">1 | A=a) = P(Y_hat=class="tok-number">1 | A=b) for all groups.
    The class="tok-number">4/5ths rule: ratio of min/max selection rates >= class="tok-number">0.8.
    "class="tok-string">""
    rates = {
        g: r.positive_rate for g, r in reports.items()
    }
    max_rate = max(rates.values())
    min_rate = min(rates.values())
    di_ratio = min_rate / max_rate if max_rate > class="tok-number">0 else class="tok-number">1.0

    return {
        class="tok-string">"group_positive_rates": rates,
        class="tok-string">"disparate_impact_ratio": round(di_ratio, class="tok-number">4),
        class="tok-string">"passes_four_fifths_rule": di_ratio >= threshold,
    }


def check_equalized_odds(
    reports: dict[str, FairnessReport],
    max_gap: float = class="tok-number">0.05,
) -> dict[str, object]:
    class="tok-string">class="tok-string">""class="tok-string">"Check equalized odds: equal TPR and FPR across groups.

    Equalized odds requires:
    P(Y_hat=class="tok-number">1 | Y=y, A=a) = P(Y_hat=class="tok-number">1 | Y=y, A=b)
    for y in {class="tok-number">0, class="tok-number">1}. We check that the max gap in
    FPR and FNR across groups is below the threshold.
    "class="tok-string">""
    fprs = {g: r.fpr for g, r in reports.items()}
    tprs = {g: r.recall for g, r in reports.items()}
    fpr_gap = max(fprs.values()) - min(fprs.values())
    tpr_gap = max(tprs.values()) - min(tprs.values())

    return {
        class="tok-string">"group_fprs": fprs,
        class="tok-string">"group_tprs": tprs,
        class="tok-string">"max_fpr_gap": round(fpr_gap, class="tok-number">4),
        class="tok-string">"max_tpr_gap": round(tpr_gap, class="tok-number">4),
        class="tok-string">"passes_equalized_odds": (
            fpr_gap < max_gap and tpr_gap < max_gap
        ),
    }


def check_calibration(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    sensitive_attr: np.ndarray,
    n_bins: int = class="tok-number">10,
) -> dict[str, dict[str, float]]:
    class="tok-string">class="tok-string">""class="tok-string">"Check calibration across groups.

    Calibration requires:
    P(Y=class="tok-number">1 | S=s, A=a) = P(Y=class="tok-number">1 | S=s, A=b)
    i.e., predicted probabilities mean the same thing
    for all groups. Reports ECE per group.
    "class="tok-string">""
    results = {}
    for group in np.unique(sensitive_attr):
        mask = sensitive_attr == group
        probs = y_prob[mask]
        labels = y_true[mask]
        bin_edges = np.linspace(class="tok-number">0, class="tok-number">1, n_bins + class="tok-number">1)
        ece = class="tok-number">0.0
        for i in range(n_bins):
            in_bin = (
                (probs > bin_edges[i])
                & (probs <= bin_edges[i + class="tok-number">1])
            )
            if in_bin.sum() == class="tok-number">0:
                continue
            avg_pred = probs[in_bin].mean()
            avg_true = labels[in_bin].mean()
            ece += (
                in_bin.sum() / len(probs)
                * abs(avg_pred - avg_true)
            )
        results[str(group)] = {class="tok-string">"ece": round(ece, class="tok-number">4)}
    return results


class="tok-comment"># --- Run a full audit ---
reports = bias_audit(y_test, y_pred, race_labels)

print(class="tok-string">"=== Per-Group Metrics ===")
for group, r in reports.items():
    print(
        class="tok-string">f"  {group}: acc={r.accuracy:.3f}, "
        class="tok-string">f"prec={r.precision:.3f}, "
        class="tok-string">f"recall={r.recall:.3f}, "
        class="tok-string">f"pos_rate={r.positive_rate:.3f}"
    )

print(class="tok-string">"\n=== Demographic Parity ===")
dp = check_demographic_parity(reports)
print(class="tok-string">f"  DI ratio: {dp[&class="tok-comment">#class="tok-number">39;disparate_impact_ratio&#class="tok-number">39;]}")
print(class="tok-string">f"  Passes class="tok-number">4/5ths rule: {dp[&class="tok-comment">#class="tok-number">39;passes_four_fifths_rule&#class="tok-number">39;]}")

print(class="tok-string">"\n=== Equalized Odds ===")
eo = check_equalized_odds(reports)
print(class="tok-string">f"  Max FPR gap: {eo[&class="tok-comment">#class="tok-number">39;max_fpr_gap&#class="tok-number">39;]}")
print(class="tok-string">f"  Max TPR gap: {eo[&class="tok-comment">#class="tok-number">39;max_tpr_gap&#class="tok-number">39;]}")
print(class="tok-string">f"  Passes: {eo[&class="tok-comment">#class="tok-number">39;passes_equalized_odds&#class="tok-number">39;]}")

Fairness Criteria and Impossibility Results

Fairness is not a single metric but a family of often-incompatible criteria. The choice of fairness definition has significant practical consequences and should be driven by the deployment context and stakeholder values.

\text{Demographic Parity: } P(\hat{Y}=1 \mid A=a) = P(\hat{Y}=1 \mid A=b)

\text{Equalized Odds: } P(\hat{Y}=1 \mid Y=y, A=a) = P(\hat{Y}=1 \mid Y=y, A=b) \quad \forall y \in \{0, 1\}

\text{Calibration: } P(Y=1 \mid S=s, A=a) = P(Y=1 \mid S=s, A=b) \quad \forall s

Table 20.3: Major fairness criteria with formal definitions and application guidance.
Criterion	Formal Definition	Intuition	When to Use
Demographic Parity	P(Y_hat=1 \| A=a) = P(Y_hat=1 \| A=b)	Equal selection rates across groups	When base rates should not affect decisions (e.g., hiring)
Equalized Odds	P(Y_hat=1 \| Y=y, A=a) = P(Y_hat=1 \| Y=y, A=b)	Equal TPR and FPR across groups	When accuracy matters equally for all groups (e.g., medical diagnosis)
Predictive Parity	P(Y=1 \| Y_hat=1, A=a) = P(Y=1 \| Y_hat=1, A=b)	Equal precision across groups	When positive predictions trigger costly interventions
Individual Fairness	d(f(x), f(x')) <= L * d(x, x')	Similar individuals get similar predictions	When a meaningful similarity metric exists
Counterfactual Fairness	P(Y_hat \| do(A=a)) = P(Y_hat \| do(A=b))	Prediction unchanged if group membership were different	When causal reasoning about protected attributes is possible

Table 20.3: Major fairness criteria with formal definitions and application guidance.

Impossibility Theorem

Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) proved that when base rates differ across groups, it is mathematically impossible to simultaneously satisfy calibration, balance for the positive class, and balance for the negative class. This means you must choose which fairness criterion to prioritize based on the specific context. There is no universally "fair" model -- fairness requires value judgments about which trade-offs are acceptable.

DI = \frac{P(\hat{Y}=1 \mid A=\text{minority})}{P(\hat{Y}=1 \mid A=\text{majority})} \geq 0.8

Explainability with SHAP and LIME

Post-hoc explainability methods provide interpretable explanations for individual predictions from black-box models. SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory, while LIME (Local Interpretable Model-agnostic Explanations) fits a local linear model around each prediction.

python
import shap
import lime
import lime.lime_tabular
import numpy as np

class="tok-comment"># --- SHAP: Global and local feature importance ---
def explain_with_shap(
    model, X_train, X_explain, feature_names
):
    class="tok-string">class="tok-string">""class="tok-string">"Generate SHAP explanations for model predictions.

    SHAP values decompose each prediction into additive
    contributions from each feature, grounded in Shapley
    values from cooperative game theory.
    "class="tok-string">""
    explainer = shap.Explainer(model, X_train)
    shap_values = explainer(X_explain)

    class="tok-comment"># Global importance: mean |SHAP value| per feature
    global_importance = np.abs(
        shap_values.values
    ).mean(axis=class="tok-number">0)
    importance_ranking = sorted(
        zip(feature_names, global_importance),
        key=lambda x: x[class="tok-number">1],
        reverse=True,
    )
    print(class="tok-string">"Global Feature Importance (SHAP):")
    for name, imp in importance_ranking[:class="tok-number">10]:
        print(class="tok-string">f"  {name}: {imp:.4f}")

    class="tok-comment"># Local explanation for a single prediction
    idx = class="tok-number">0
    print(class="tok-string">f"\nLocal explanation for sample {idx}:")
    for name, val in sorted(
        zip(feature_names, shap_values.values[idx]),
        key=lambda x: abs(x[class="tok-number">1]),
        reverse=True,
    )[:class="tok-number">5]:
        direction = class="tok-string">"+" if val > class="tok-number">0 else class="tok-string">""
        print(class="tok-string">f"  {name}: {direction}{val:.4f}")

    return shap_values


class="tok-comment"># --- LIME: Local interpretable explanations ---
def explain_with_lime(
    model, X_train, instance, feature_names
):
    class="tok-string">class="tok-string">""class="tok-string">"Generate a LIME explanation for one prediction.

    LIME perturbs the input, observes prediction changes,
    and fits a weighted linear model to approximate the
    decision boundary locally.
    "class="tok-string">""
    explainer = lime.lime_tabular.LimeTabularExplainer(
        X_train,
        feature_names=feature_names,
        mode=class="tok-string">"classification",
    )
    explanation = explainer.explain_instance(
        instance,
        model.predict_proba,
        num_features=class="tok-number">10,
    )

    print(class="tok-string">"LIME Local Explanation:")
    for feature, weight in explanation.as_list():
        print(class="tok-string">f"  {feature}: {weight:+.4f}")

    return explanation

SHAP vs. LIME Trade-offs

SHAP provides theoretically grounded (Shapley value) explanations with consistency guarantees but is computationally expensive for large models (exponential in the number of features without approximations). LIME is faster and model-agnostic but explanations can be unstable -- small changes in the input or the perturbation seed can produce different explanations. For regulated applications, SHAP is generally preferred due to its theoretical guarantees.

Definition

SHAP (SHapley Additive exPlanations)

An explainability method based on Shapley values from cooperative game theory. For each prediction, SHAP assigns each feature a value representing its contribution to the difference between the prediction and the average prediction. SHAP values satisfy desirable properties: local accuracy, missingness, and consistency.

Algorithmic Bias

Systematic errors in ML systems that produce unfair outcomes for particular demographic groups, arising from data, modeling, or deployment choices.

Fairness Criteria

Mathematical definitions of what constitutes fair predictions, including demographic parity, equalized odds, and individual fairness.

04 AI Safety and Alignment

AI safety research addresses the challenge of ensuring that increasingly capable AI systems remain beneficial, controllable, and aligned with human values. As models gain capabilities that approach or exceed human-level performance in specific domains, the potential consequences of misalignment grow correspondingly. Safety is not a feature to be added after development -- it must be integrated throughout the entire model lifecycle.

Definition

AI Safety

The interdisciplinary field focused on ensuring that AI systems operate reliably, remain under meaningful human control, and produce outcomes aligned with human values -- especially as systems become more capable and autonomous. AI safety spans technical research (alignment, interpretability, robustness) and governance (evaluation standards, deployment controls).

Red-Teaming and Safety Evaluation

Red-teaming is the practice of adversarially testing AI systems to discover failure modes, harmful outputs, and safety vulnerabilities before deployment. Modern red-teaming combines automated evaluation suites with human adversarial testing across multiple risk dimensions.

Definition

Red-Teaming

Structured adversarial testing of an AI system by a dedicated team attempting to elicit harmful, dangerous, or unintended outputs. Red teams probe for failure modes including toxicity, bias, misinformation generation, jailbreak vulnerabilities, dangerous knowledge disclosure, and capability overhangs. Results inform safety mitigations before deployment.

Table 20.4: Key AI safety benchmarks and evaluation suites.
Benchmark	What It Measures	Methodology	Key Limitation
TruthfulQA	Truthfulness vs. common misconceptions	817 questions designed to elicit false but plausible answers	Limited to English, narrow question types
BBQ (Bias Benchmark for QA)	Social bias across 9 categories	Ambiguous and disambiguated question pairs across demographics	May not capture subtle or intersectional biases
RealToxicityPrompts	Propensity to generate toxic text	100K naturally occurring prompts scored for toxicity	Toxicity classifiers themselves have biases
HarmBench	Resistance to adversarial jailbreaks	Standardized attack methods across harm categories	Arms race: new attacks can bypass evaluated defenses
WMDP (Weapons of Mass Destruction Proxy)	Dangerous knowledge in bio/chem/cyber	Multiple-choice questions testing dual-use knowledge	Proxy measure; does not test actual capability to cause harm
MMLU (Massive Multitask Language Understanding)	General knowledge and reasoning	57 subjects from elementary to professional level	Saturating as a benchmark; does not test safety directly

Table 20.4: Key AI safety benchmarks and evaluation suites.

OpenAI's Safety Practices

OpenAI's Preparedness Framework classifies frontier model risks across four categories -- cybersecurity, biological threats, persuasion, and model autonomy -- each rated from Low to Critical. Before deploying a new model, the Preparedness team evaluates it against specific capability thresholds. Models rated "High" risk or above in any category require explicit sign-off from leadership and additional mitigations. External red teams (including domain experts in biosecurity and cybersecurity) supplement internal testing.

Alignment Techniques

Alignment refers to the technical challenge of ensuring that an AI system's objectives and behaviors match human intentions and values. The dominant paradigm for aligning large language models combines supervised fine-tuning (SFT) on high-quality demonstrations with reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO).

Anthropic's Constitutional AI

Constitutional AI (CAI) addresses a key limitation of RLHF: it is expensive and difficult to scale human feedback to cover all possible model behaviors. CAI works in two phases: 1. Supervised phase: The model generates responses, then critiques and revises them based on a written "constitution" of principles (e.g., "choose the response that is most helpful while being honest and harmless"). 2. RL phase: The model is trained with RLAIF (RL from AI Feedback), where a separate model provides preference labels guided by the constitution, replacing human annotators for the majority of training signal. This approach makes alignment more scalable, more transparent (the principles are explicit and auditable), and more consistent than pure human feedback.

DeepMind's Safety Research

Google DeepMind's safety research spans multiple threads: (1) Scalable oversight -- methods for humans to supervise AI systems that may exceed human capabilities in specific domains, including debate and recursive reward modeling; (2) Interpretability -- understanding neural network internals through tools like concept probes and activation analysis; (3) Evaluations -- developing benchmarks and threat models for dangerous capabilities; (4) Specification -- formalizing what it means for AI to be beneficial, including work on moral uncertainty and value learning.

Supervised Fine-Tuning (SFT): Train on high-quality demonstrations of desired behavior from human annotators
Reward Modeling: Train a separate model to predict human preferences between pairs of model outputs
RLHF: Use PPO or similar RL algorithms to optimize the language model against the reward model
DPO (Direct Preference Optimization): Directly optimize on preference pairs without training a separate reward model
Constitutional AI: Model self-critiques outputs against stated principles, enabling scalable alignment without per-example human labels

Interpretability and Mechanistic Understanding

Interpretability research aims to understand what neural networks learn and how they make decisions. Mechanistic interpretability, in particular, seeks to reverse-engineer the computational mechanisms inside models by identifying specific circuits -- subnetworks of neurons and attention heads that implement specific computations.

Why Interpretability Matters for Safety

Without interpretability, safety relies on behavioral testing: we check whether the model produces bad outputs on a finite set of test inputs. But behavioral testing cannot guarantee safety on unseen inputs. Mechanistic interpretability aims to verify safety at the level of the model's internal representations -- understanding not just what the model does, but how it works. This is analogous to the difference between black-box testing and code review in software engineering.

S_{\text{safety}} = 1 - \frac{C_{\text{overhead}}}{C_{\text{total}}} \cdot \left(1 - \frac{1}{1 + e^{-k(r - r_0)}}\right)

Red-Teaming

Structured adversarial testing of AI systems by dedicated teams to discover failure modes, harmful outputs, and safety vulnerabilities before deployment.

Constitutional AI

An alignment approach where the model critiques and revises its own outputs based on a written set of principles, enabling scalable alignment.

05 Scaling Laws, Compute Trends, and Responsible Scaling

Scaling laws describe the predictable relationship between model size, dataset size, compute budget, and model performance. These empirical relationships have become one of the most important tools for planning large-scale ML training and for anticipating the capabilities -- and risks -- of future models.

Definition

Scaling Laws

Empirical power-law relationships that predict how model performance (typically measured as loss) improves as a function of model parameter count (N), dataset size (D), and compute budget (C). These laws enable principled resource allocation before committing to expensive training runs.

L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

The Chinchilla scaling laws showed that many large models were under-trained relative to their size. For a given compute budget, the optimal approach scales data proportionally with parameters rather than just making models bigger.

Chinchilla Optimal Training

Before Chinchilla, the prevailing approach was to train very large models on relatively small datasets. The Chinchilla result showed that a 70B parameter model trained on 1.4T tokens outperformed the 280B parameter Gopher trained on 300B tokens, using the same compute budget. This finding redirected the entire field toward data scaling. The Chinchilla-optimal ratio is approximately 20 tokens per parameter.

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

The Compute Arms Race

Table 20.5: Compute scaling of frontier language models: training FLOPs have increased by 5+ orders of magnitude in 6 years.
Year	Model	Parameters	Training Tokens	Estimated FLOPs	Estimated Cost
2019	GPT-2	1.5B	~40B	~1.5 x 10^20	~$50K
2020	GPT-3	175B	300B	~3.1 x 10^23	~$4.6M
2022	Chinchilla	70B	1.4T	~5.8 x 10^23	~$3M
2023	Llama 2 70B	70B	2T	~1 x 10^24	~$5M
2023	GPT-4 (est.)	~1.8T MoE	~13T	~2 x 10^25	~$100M
2024	Llama 3 405B	405B	15T	~3.8 x 10^25	~$150M
2025+	Frontier (projected)	~1T+ dense	~30T+	~10^26	~$500M+

Table 20.5: Compute scaling of frontier language models: training FLOPs have increased by 5+ orders of magnitude in 6 years.

Year-over-year growth in compute requirements for frontier AI models

Doubling Faster Than Hardware

Compute requirements for frontier models double every 6-10 months, while hardware efficiency doubles roughly every 2 years (end of Moore's Law). This gap is unsustainable without fundamental breakthroughs in algorithmic efficiency or hardware. Organizations must plan for this reality when projecting future training costs.

S_{\text{safety}}(p) = \frac{1}{1 - p} \cdot T_{\text{compute}}

Responsible Scaling Policies

Responsible scaling policies (RSPs) link the deployment of increasingly capable models to commensurate safety evaluations and mitigations. These policies formalize the principle that with greater capability comes greater responsibility for safety.

Anthropic's Responsible Scaling Policy

Anthropic's RSP defines AI Safety Levels (ASL-1 through ASL-4+) that mirror biosafety levels. Each level is defined by the dangerous capabilities the model possesses and requires corresponding containment and deployment safeguards: - ASL-1: Models posing no meaningful catastrophic risk (e.g., a chess engine). - ASL-2: Models with early signs of dangerous capabilities but not yet providing uplift beyond existing publicly available information. Current frontier models are at ASL-2. - ASL-3: Models that substantially increase risk of catastrophic misuse (e.g., providing meaningful uplift for WMD creation). Requires enhanced security, access controls, and deployment restrictions. - ASL-4+: Models with capabilities that could pose existential-scale risks. Requirements to be defined as the field develops. Before advancing to the next ASL, Anthropic commits to demonstrating that the required safety measures for that level are in place.

Environmental Impact

The environmental impact of AI training is a growing concern. A single frontier model training run can consume as much energy as hundreds of US households use in a year. Responsible development requires measuring, reporting, and reducing the carbon footprint of ML systems.

Training GPT-3 emitted approximately 552 tons of CO2 equivalent
Frontier model training in 2025 may exceed 10,000 tons CO2 equivalent per run
Inference costs typically dominate training costs over a model's lifetime: a widely-deployed model serving millions of users can consume more total compute for inference than it used for training
Efficiency improvements (quantization, distillation, MoE) reduce both cost and environmental impact
Data center location matters: renewable-powered facilities can reduce carbon intensity by 10x or more compared to coal-powered regions

Scaling Laws

Empirical relationships that predict how model performance improves as a function of model size, dataset size, and compute budget.

Responsible Scaling Policy

A governance framework that links model capability levels to required safety evaluations and containment measures, ensuring safety keeps pace with capability.

06 Future Directions for Responsible ML Systems

The future of ML systems lies at the intersection of several converging trends: more capable models, more efficient hardware, better software abstractions, and evolving societal expectations around safety and governance. Engineers who understand all of these dimensions will be best positioned to shape the field responsibly.

Automated ML Engineering

Automated ML engineering, where AI systems help design, train, optimize, and deploy other AI systems, is an emerging trend that could dramatically accelerate progress while also raising new governance questions about AI systems making decisions about other AI systems.

Table 20.6: The evolving landscape of automated ML engineering and associated governance challenges.
Automation Area	Current State	Future Direction	Governance Concern
Architecture search	NAS finds competitive architectures automatically	AI designs novel architectures beyond human intuition	Reduced human understanding of model internals
Code generation	Copilot-class tools assist ML engineers daily	End-to-end ML pipeline generation from specifications	Propagation of subtle bugs and biases
Debugging	AI identifies common training issues	Automated root cause analysis and fix suggestion	Over-reliance on automated diagnosis
Optimization	AutoML tunes hyperparameters effectively	Automated system-level optimization across the stack	Optimization for metrics that miss important values
Evaluation	Automated benchmark running	AI-designed evaluation suites for novel failure modes	AI evaluating AI: who watches the watchmen?

Table 20.6: The evolving landscape of automated ML engineering and associated governance challenges.

The Evolving Engineer Role

As ML engineering becomes increasingly automated, the role of the ML systems engineer will shift from hands-on implementation toward system design, oversight, evaluation, and ensuring safety and quality. Understanding the fundamentals becomes more important, not less, as you need to evaluate and guide automated systems effectively.

Safety and Alignment as Core Engineering

Definition

AI Alignment

The research challenge of ensuring that AI systems pursue goals that are beneficial to humans and act in accordance with human values and intentions, especially as systems become more capable and autonomous. Alignment is distinct from capability: a very capable but misaligned system is more dangerous than a less capable one.

RLHF (Reinforcement Learning from Human Feedback): Fine-tuning models using human preference data to align outputs with human values
Constitutional AI: Training models to follow a set of principles, enabling self-critique and revision without per-example human labels
Interpretability: Understanding what models have learned and how they make decisions at the mechanistic level
Red-teaming: Adversarial testing to find failure modes and harmful outputs before deployment
Evaluation standards: Developing benchmarks and protocols for assessing safety properties systematically
Responsible scaling: Linking capability advancement to demonstrated safety measures

Safety Is Not Optional

As AI systems become more capable and more widely deployed, the consequences of misalignment grow. Safety and alignment are not academic concerns -- they are practical engineering requirements. Every ML system deployed in production should have appropriate safety evaluations, monitoring, and guardrails proportional to its potential for harm. The cost of adding safety is small compared to the cost of deploying a harmful system.

ML Everywhere: The Integration Imperative

The integration of ML into every aspect of computing -- from operating systems to databases to networking -- represents a fundamental shift in how software is built. ML systems engineering is becoming a core competency across all of computer science, and responsible practices must scale with this integration.

Career Advice

The most valuable ML systems engineers will be those who combine deep ML knowledge with strong systems fundamentals (distributed systems, databases, networking, operating systems) and an understanding of responsible AI practices. Invest in breadth across these areas, not just depth in one. The ability to reason about safety, fairness, and governance will increasingly differentiate senior engineers.

The best way to predict the future is to build it -- but in AI, the best way to build the future is to ensure it benefits everyone.
Adapted from Alan Kay

AI Safety

Research and practices aimed at ensuring AI systems remain beneficial, controllable, and aligned with human values as capabilities increase.

RLHF

Reinforcement Learning from Human Feedback, a technique for aligning language model behavior with human preferences through reward modeling.

Key Takeaways

1Foundation models have created a new paradigm where most ML starts from pre-trained models rather than from scratch.
2AI governance is rapidly evolving, with the EU AI Act establishing the most comprehensive risk-based regulatory framework to date.
3Fairness is a family of mathematically incompatible criteria -- demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously when base rates differ.
4Bias audits must compute disaggregated metrics including disparate impact ratio, equalized odds gaps, and per-group calibration to detect unfair treatment.
5Explainability tools like SHAP and LIME enable post-hoc interpretation of black-box models, essential for trust and regulatory compliance.
6AI safety research -- including red-teaming, constitutional AI, and interpretability -- must keep pace with capability advances.
7Scaling laws enable principled decisions about compute, data, and parameters, but also reveal that the compute arms race is unsustainable without efficiency breakthroughs.
8Responsible scaling policies (Anthropic ASL, OpenAI Preparedness Framework) formalize the principle that greater capability requires greater safety evaluation.
9Model cards and datasheets for datasets provide structured transparency that serves developers, deployers, regulators, and affected communities.
10The integration of ML into every aspect of computing makes responsible AI practices a broadly essential engineering skill.

CH.20

Chapter Complete

Up next:Conclusion

Chapter Progress

Reading

Exercise

Interact with the visualization

Quiz

AGI Systems Quiz

Test your understanding of frontier models, generative AI, foundation models, and emerging AI trends.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass

Learning Objectives

01 Frontier Models and Foundation Models Viz

Foundation Model

Systems Challenges at Scale

Model Card Generation

Emergent Capabilities

Emergent Capability

Democratization Through Open Models

Foundation Model

Emergent Capabilities

02 AI Governance and Global Regulation

AI Governance

Global Regulatory Landscape

AI Safety Benchmark Landscape

Model Cards and Transparency Documentation

Model Card

Datasheets for Datasets

International Coordination Challenges

AI Governance

EU AI Act

03 Bias, Fairness, and Explainability

Algorithmic Bias

Bias Detection Audit Pipeline

Fairness Criteria and Impossibility Results

Explainability with SHAP and LIME

SHAP (SHapley Additive exPlanations)

Algorithmic Bias

Fairness Criteria

04 AI Safety and Alignment

AI Safety

Red-Teaming and Safety Evaluation

Red-Teaming

Alignment Techniques

Interpretability and Mechanistic Understanding

Red-Teaming

Constitutional AI

05 Scaling Laws, Compute Trends, and Responsible Scaling

Scaling Laws

The Compute Arms Race

Responsible Scaling Policies

Environmental Impact

Scaling Laws

Responsible Scaling Policy

06 Future Directions for Responsible ML Systems

Automated ML Engineering

Safety and Alignment as Core Engineering

AI Alignment

ML Everywhere: The Integration Imperative

AI Safety

RLHF

Key Takeaways

Chapter Progress

AGI Systems Quiz