Skip to content
ML SystemPart 6: Emerging SystemsChapter 20
Frontiers CH.20 ~30 min

AGI Systems

Surveys emerging trends and frontier models. Explores generative AI, foundation models, and the evolving landscape of advanced ML systems.

frontier modelsgenerative AIfoundation modelsemerging trendsfuture directions
Read in mlsysbook.ai
  • Explain how foundation models are trained, adapted, and served at scale in production systems
  • Compare global AI regulatory frameworks including the EU AI Act, US Executive Order, and China's approach
  • Implement a bias audit pipeline that evaluates demographic parity, equalized odds, and calibration across protected groups
  • Formalize fairness criteria mathematically and explain the impossibility theorem and its practical implications
  • Apply SHAP and LIME to generate explainable predictions from black-box ML models
  • Evaluate AI safety practices including red-teaming, constitutional AI, and responsible scaling policies
  • Apply scaling laws to make principled decisions about compute, data, and parameter allocation
  • Generate model cards and datasheets for datasets following established templates
  • Design responsible AI governance processes that integrate bias audits, safety evaluations, and model documentation into the ML lifecycle

01 Frontier Models and Foundation Models Viz

Foundation models are large-scale models trained on broad data that can be adapted to a wide range of downstream tasks. GPT-4, Claude, Gemini, and Llama represent the current frontier of language models, while DALL-E, Stable Diffusion, and Midjourney lead in image generation. These models have demonstrated remarkable capabilities that were unexpected even a few years ago.

Definition

Foundation Model

A large-scale model trained on broad, diverse data using self-supervised learning that serves as a general-purpose base for many downstream tasks. Adaptation to specific tasks is achieved through fine-tuning, prompting, or in-context learning, without training a new model from scratch.

Figure: Responsible AI Framework

Responsible AI6 Core PillarsFairnessHiring AI: equal selection rates across genderTransparencyCredit scoring: reason codes for denialsPrivacyHealthcare AI: patient data never leaves hospitalSafetyAutonomous vehicles: emergency stop protocolsAccountabilityFinancial AI: compliance officer sign-offSustainabilityTraining on renewable energy data centers Hover each pillar to see key practices and implementation guidance
Figure Figure 20.1: The six pillars of Responsible AI: Fairness, Transparency, Privacy, Safety, Accountability, and Sustainability, connected through a hub-and-spoke framework.

Systems Challenges at Scale

The systems challenges of foundation models are immense, spanning training, serving, and the infrastructure required to make these models practical. Each phase introduces distinct engineering bottlenecks that require specialized solutions.

Systems challenges across the foundation model lifecycle
ChallengeTraining PhaseServing Phase
ComputeThousands of GPUs for monthsLow-latency inference at massive scale
MemoryModel state, optimizer states, activationsKV-cache management for long contexts
CostMillions of dollars per training runInference cost per token must be economically viable
ReliabilityFault tolerance over weeks-long runs99.9%+ uptime for production APIs
EfficiencyMixed precision, gradient checkpointingBatching, quantization, speculative decoding

Systems challenges across the foundation model lifecycle

Scale of Training

Training a frontier foundation model like GPT-4 is estimated to require 10,000+ GPUs running for several months at a cost exceeding $100 million. This makes each training run a major capital investment and puts frontier model development out of reach for most organizations.

Quick Check

What is the primary systems challenge of scaling foundation models?

Not quite.At the scale of frontier foundation models (thousands of GPUs training for months), the dominant systems bottleneck is communication overhead across the GPU cluster. Gradient synchronization, parameter updates, and activation transfers between GPUs consume significant bandwidth and create latency. Techniques like tensor parallelism, pipeline parallelism, and gradient compression exist specifically to mitigate this communication bottleneck. Data and optimizers are important, but they are not the primary systems engineering challenge.
Continue reading

Model Card Generation

Model cards are structured documentation artifacts that accompany trained ML models. They report intended use, performance characteristics, limitations, and ethical considerations. The following template generates standardized model cards for responsible model release.

python
from dataclasses import dataclass, field
from typing import Optional
import json
from datetime import datetime


class="tok-decorator">@dataclass
class ModelCard:
    class="tok-string">class="tok-string">""class="tok-string">"Standardized model documentation following
    Mitchell et al. (class="tok-number">2019).

    Model cards increase transparency and enable
    informed decisions by downstream users.
    "class="tok-string">""
    model_name: str
    model_version: str
    model_type: str
    training_data_summary: str
    intended_use: str
    out_of_scope_uses: list[str]
    ethical_considerations: list[str]
    limitations: list[str]
    evaluation_results: dict[str, float]
    bias_audit_results: Optional[dict] = None
    carbon_footprint_kg: Optional[float] = None
    training_compute_flops: Optional[float] = None
    license: str = class="tok-string">"Proprietary"
    created_at: str = field(
        default_factory=lambda: datetime.now().isoformat()
    )

    def to_markdown(self) -> str:
        md = class="tok-string">f"class="tok-comment"># Model Card: {self.model_name}"
        md += class="tok-string">f" v{self.model_version}\n\n"
        md += class="tok-string">f"**Type:** {self.model_type}\n"
        md += class="tok-string">f"**License:** {self.license}\n"
        md += class="tok-string">f"**Created:** {self.created_at}\n\n"

        md += class="tok-string">f"class="tok-comment">## Intended Use\n{self.intended_use}\n\n"

        md += class="tok-string">"class="tok-comment">## Out-of-Scope Uses\n"
        for use in self.out_of_scope_uses:
            md += class="tok-string">f"- {use}\n"

        md += class="tok-string">"\nclass="tok-comment">## Training Data\n"
        md += class="tok-string">f"{self.training_data_summary}\n\n"

        md += class="tok-string">"class="tok-comment">## Evaluation Results\n"
        for metric, value in self.evaluation_results.items():
            md += class="tok-string">f"- **{metric}:** {value:.4f}\n"

        if self.bias_audit_results:
            md += class="tok-string">"\nclass="tok-comment">## Bias Audit\n"
            for group, metrics in self.bias_audit_results.items():
                md += class="tok-string">f"- **{group}:** {metrics}\n"

        md += class="tok-string">"\nclass="tok-comment">## Limitations\n"
        for lim in self.limitations:
            md += class="tok-string">f"- {lim}\n"

        md += class="tok-string">"\nclass="tok-comment">## Ethical Considerations\n"
        for eth in self.ethical_considerations:
            md += class="tok-string">f"- {eth}\n"

        if self.carbon_footprint_kg:
            md += class="tok-string">f"\nclass="tok-comment">## Environmental Impact\n"
            md += class="tok-string">f"- Training CO2: "
            md += class="tok-string">f"{self.carbon_footprint_kg:.0f} kg\n"

        if self.training_compute_flops:
            md += class="tok-string">f"- Training compute: "
            md += class="tok-string">f"{self.training_compute_flops:.2e} FLOPs\n"

        return md

    def to_json(self) -> str:
        return json.dumps(self.__dict__, indent=class="tok-number">2)


class="tok-comment"># --- Example usage ---
card = ModelCard(
    model_name=class="tok-string">"SentimentBERT",
    model_version=class="tok-string">"class="tok-number">2.1",
    model_type=class="tok-string">"Fine-tuned BERT-base for sentiment analysis",
    training_data_summary=(
        class="tok-string">"class="tok-number">1.2M English product reviews from Amazon, "
        class="tok-string">"filtered for quality. Demographics: US-centric, "
        class="tok-string">"English-only. Period: class="tok-number">2018-class="tok-number">2023."
    ),
    intended_use=(
        class="tok-string">"Sentiment classification for English product "
        class="tok-string">"reviews in e-commerce applications."
    ),
    out_of_scope_uses=[
        class="tok-string">"Medical or legal text analysis",
        class="tok-string">"Non-English languages",
        class="tok-string">"Hate speech or toxicity detection",
        class="tok-string">"High-stakes decision-making without human review",
    ],
    ethical_considerations=[
        class="tok-string">"May reflect biases in product review data",
        class="tok-string">"Performance varies across demographic groups",
        class="tok-string">"Should not be used as sole input for decisions "
        class="tok-string">"affecting individuals",
    ],
    limitations=[
        class="tok-string">"English only; no multilingual support",
        class="tok-string">"Sarcasm detection is unreliable (F1=class="tok-number">0.62)",
        class="tok-string">"Max input length: class="tok-number">512 tokens",
        class="tok-string">"Not tested on social media text or slang",
    ],
    evaluation_results={
        class="tok-string">"accuracy": class="tok-number">0.924,
        class="tok-string">"f1_macro": class="tok-number">0.891,
        class="tok-string">"auc_roc": class="tok-number">0.967,
    },
    bias_audit_results={
        class="tok-string">"gender_parity_ratio": class="tok-number">0.94,
        class="tok-string">"age_group_max_accuracy_gap": class="tok-number">0.03,
    },
    carbon_footprint_kg=class="tok-number">12.5,
    training_compute_flops=class="tok-number">3.2e17,
)

print(card.to_markdown())

Emergent Capabilities

Emergent capabilities -- abilities that appear suddenly as models scale beyond certain size thresholds -- have surprised the research community. In-context learning, chain-of-thought reasoning, and instruction following all emerged without being explicitly trained for. This unpredictability is a core challenge for safety: we cannot reliably predict what a model trained at 10x scale will be able to do.

Emergence Is Unpredictable

Emergent capabilities appear suddenly and are difficult to predict from smaller-scale experiments. This makes it challenging to anticipate what a larger model will be able to do -- both the beneficial capabilities and the potential for misuse. Safety evaluation must therefore be comprehensive rather than relying on extrapolation from smaller models.

Definition

Emergent Capability

An ability that is absent or near-random in smaller models but appears abruptly once a model exceeds a critical scale threshold in parameters, data, or compute. Examples include multi-step arithmetic, chain-of-thought reasoning, and cross-lingual transfer. The mechanism behind emergence remains an open research question.

Deeper InsightWhen GPT-3 scaled beyond a critical threshold, it suddenly gained the ability to perform in-context learning, chain-of-thought reasoning, and few-shot task completion -- none of which were present in smaller versions of the same architecture. This is not gradual improvement; it is qualitative change. For systems engineers, this means you cannot reliably predict what a 10x larger model will do by extrapolating from a smaller one. Safety evaluation must be comprehensive at each new scale, and infrastructure must be designed to handle capabilities that may not exist yet at the time of design.Think of it like...Like how water at 99 degrees Celsius is still liquid, but at 100 degrees it abruptly becomes steam -- a completely different state with fundamentally different properties. Scale in neural networks can trigger similarly abrupt phase transitions in capability. Click to collapse

Democratization Through Open Models

The democratization of foundation models through open weights and efficient fine-tuning methods has broadened access beyond a few large corporations. This shift has profound implications for both innovation and safety governance.

  • Open-weight models: Llama, Mistral, and Falcon provide powerful base models that anyone can use and adapt
  • LoRA (Low-Rank Adaptation): Adds small trainable matrices to frozen model weights, enabling fine-tuning with a fraction of the compute
  • QLoRA: Combines 4-bit quantization with LoRA for fine-tuning 65B-parameter models on a single GPU
  • Hugging Face ecosystem: Provides infrastructure for sharing models, datasets, and fine-tuning recipes
Start With a Foundation Model

For almost any new ML project today, the default starting point should be adapting an existing foundation model rather than training from scratch. Fine-tuning a pre-trained model is faster, cheaper, and often produces better results than training a specialized model from the ground up.

Foundation Model

A large-scale model trained on broad data that can be adapted to diverse downstream tasks through fine-tuning, prompting, or in-context learning.

Emergent Capabilities

Abilities that appear suddenly as models scale beyond certain thresholds, such as in-context learning and chain-of-thought reasoning.

02 AI Governance and Global Regulation

AI governance encompasses the policies, frameworks, and institutions that guide the development and deployment of AI systems. As AI capabilities advance rapidly, governments worldwide are racing to establish regulatory frameworks that balance innovation with safety, fairness, and accountability.

Definition

AI Governance

The set of policies, standards, organizational structures, and processes that direct and control how AI systems are developed, deployed, and monitored. Governance spans technical practices (model cards, audits), organizational policies (review boards, risk assessment), and external regulation (laws, industry standards).

Global Regulatory Landscape

The regulatory landscape for AI is fragmented across jurisdictions, with each taking a distinct philosophical approach. Understanding these differences is essential for organizations deploying AI systems internationally.

Table 20.1: Global AI regulation comparison as of 2025
JurisdictionFrameworkApproachKey RequirementsEnforcement
European UnionEU AI Act (2024)Risk-based, prescriptiveConformity assessments for high-risk AI, transparency obligations, prohibited practices (social scoring, real-time biometric surveillance)Up to 7% of global revenue for violations
United StatesExecutive Order 14110 (2023) + sector rulesVoluntary + sector-specificFrontier model safety reporting (>10^26 FLOPs), NIST AI RMF guidance, FDA/SEC/FTC sector authorityVaries by sector; no comprehensive federal AI law
ChinaAlgorithmic Recommendation, Deep Synthesis, GenAI regulationsContent-focused, prescriptiveAlgorithm registration, content labeling, training data review, socialist core values alignmentCyberspace Administration enforcement
United KingdomPro-Innovation AI frameworkPrinciples-based, light-touchExisting regulators apply 5 principles (safety, transparency, fairness, accountability, contestability)Sector regulators (FCA, Ofcom, CMA)
CanadaAIDA (Artificial Intelligence and Data Act)Risk-basedImpact assessments for high-impact systems, algorithmic transparencyNew AI and Data Commissioner

Table 20.1: Global AI regulation comparison as of 2025

EU AI Act Risk Tiers

The EU AI Act classifies AI systems into four risk tiers: - Unacceptable risk (banned): Social scoring by governments, real-time biometric surveillance in public spaces, manipulation of vulnerable groups, emotion recognition in workplaces/schools. - High risk (strict requirements): AI in critical infrastructure, education, employment, law enforcement, migration, justice. Requires conformity assessments, risk management systems, data governance, human oversight, and logging. - Limited risk (transparency obligations): Chatbots, deepfakes, emotion detection. Must disclose that users are interacting with AI. - Minimal risk (no requirements): Spam filters, AI in video games, inventory management. No regulatory obligations.

AI Safety Benchmark Landscape

Standardized benchmarks enable systematic evaluation of AI safety properties. As models become more capable, the benchmarks used to evaluate them must evolve to test for increasingly sophisticated risks.

Table 20.2: Key AI safety benchmarks and their coverage areas.
BenchmarkFocus AreaMethodologyMaintained By
TruthfulQATruthfulness817 questions designed to elicit common misconceptionsAcademic (Lin et al.)
BBQSocial bias across 9 categoriesAmbiguous/disambiguous question pairsGoogle
RealToxicityPromptsToxic text generation100K naturally occurring promptsAllen AI
HarmBenchJailbreak resistanceStandardized adversarial attack methodsAcademic consortium
WMDPDangerous knowledge (bio/chem/cyber)Multiple-choice dual-use knowledge testsAcademic (Li et al.)
MMLUGeneral knowledge and reasoning57 subjects, elementary to professional levelAcademic (Hendrycks et al.)
DecodingTrustComprehensive trustworthinessToxicity, stereotypes, privacy, robustness, fairnessAcademic (Wang et al.)
MLCommons AI SafetyStandardized safety evaluationHazard taxonomy with graded severityMLCommons consortium

Table 20.2: Key AI safety benchmarks and their coverage areas.

Model Cards and Transparency Documentation

Model cards, introduced by Mitchell et al. (2019), provide structured documentation about a model's intended use, performance characteristics, limitations, and ethical considerations. They serve as the primary transparency artifact in responsible AI practice.

Definition

Model Card

A structured documentation artifact that accompanies a trained ML model, reporting its intended use cases, performance metrics across demographic groups, limitations, ethical considerations, and training data provenance. Model cards increase transparency and enable informed decision-making by downstream users and affected communities.

Definition

Datasheets for Datasets

Standardized documentation for ML datasets, proposed by Gebru et al. (2021), that records the motivation, composition, collection process, preprocessing, intended uses, distribution, and maintenance plan for a dataset. Datasheets help downstream users understand data provenance and potential biases.

What a Model Card Should Include

A comprehensive model card documents: (1) Model details -- architecture, training procedure, versions; (2) Intended use -- primary use cases and out-of-scope applications; (3) Factors -- demographic groups, environmental conditions, and instrumentation relevant to performance; (4) Metrics -- which metrics were chosen and why; (5) Evaluation data -- benchmark details and disaggregated results; (6) Training data -- provenance and composition summary; (7) Ethical considerations -- potential harms, mitigations applied; (8) Caveats and recommendations.

International Coordination Challenges

International coordination on AI governance faces significant challenges. Different cultural values, economic interests, and geopolitical dynamics shape each jurisdiction's approach. The Bletchley Declaration (2023) and subsequent AI Safety Summits represent early efforts at multilateral alignment, but binding international agreements remain elusive.

Regulatory Arbitrage Risk

Without international coordination, companies may develop AI in the least-regulated jurisdiction and deploy globally -- a phenomenon known as regulatory arbitrage. The EU AI Act addresses this by applying to any AI system that affects EU residents, regardless of where the provider is based. This "Brussels effect" may establish a de facto global standard, as companies find it more efficient to build one compliant system than to maintain separate versions for different jurisdictions.

  • Bletchley Declaration (2023): 28 countries agreed on the need to identify and manage frontier AI risks
  • AI Safety Institutes: UK, US, Japan, Singapore, and others establishing national AI safety research bodies
  • OECD AI Principles: Non-binding but widely adopted framework for trustworthy AI
  • G7 Hiroshima AI Process: Voluntary code of conduct for frontier AI developers
  • ISO/IEC 42001: International standard for AI management systems

AI Governance

The policies, standards, and processes that direct and control how AI systems are developed, deployed, and monitored across organizational and regulatory dimensions.

EU AI Act

The European Union's comprehensive AI regulation establishing a risk-based classification system with requirements proportional to potential harm.

03 Bias, Fairness, and Explainability

Bias in ML systems can arise from training data that reflects historical inequities, from model architectures that amplify certain patterns, and from evaluation practices that fail to disaggregate performance across demographic groups. Addressing bias requires systematic auditing at every stage of the ML pipeline.

Definition

Algorithmic Bias

Systematic and repeatable errors in an ML system that create unfair outcomes for particular groups of people. Bias can originate from training data (historical bias, representation bias, measurement bias), from modeling choices (aggregation bias, evaluation bias), or from deployment context (deployment bias, feedback loops).

Bias Detection Audit Pipeline

A rigorous bias audit evaluates model performance across protected attributes at multiple stages: data analysis, pre-deployment testing, and post-deployment monitoring. The following pipeline computes demographic parity, equalized odds, and disparate impact across groups.

python
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    confusion_matrix,
)
from dataclasses import dataclass


class="tok-decorator">@dataclass
class FairnessReport:
    class="tok-string">class="tok-string">""class="tok-string">"Per-group fairness metrics."class="tok-string">""
    group: str
    n_samples: int
    accuracy: float
    precision: float
    recall: float
    positive_rate: float  class="tok-comment"># P(Y_hat = class="tok-number">1)
    fpr: float            class="tok-comment"># False positive rate
    fnr: float            class="tok-comment"># False negative rate


def bias_audit(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    sensitive_attr: np.ndarray,
) -> dict[str, FairnessReport]:
    class="tok-string">class="tok-string">""class="tok-string">"Run a comprehensive bias audit across groups.

    Computes per-group accuracy, precision, recall,
    positive rate, FPR, and FNR to evaluate demographic
    parity, equalized odds, and disparate impact.
    "class="tok-string">""
    reports: dict[str, FairnessReport] = {}
    groups = np.unique(sensitive_attr)

    for group in groups:
        mask = sensitive_attr == group
        yt, yp = y_true[mask], y_pred[mask]
        tn, fp, fn, tp = confusion_matrix(
            yt, yp, labels=[class="tok-number">0, class="tok-number">1]
        ).ravel()

        reports[str(group)] = FairnessReport(
            group=str(group),
            n_samples=int(mask.sum()),
            accuracy=accuracy_score(yt, yp),
            precision=precision_score(
                yt, yp, zero_division=class="tok-number">0
            ),
            recall=recall_score(
                yt, yp, zero_division=class="tok-number">0
            ),
            positive_rate=float(yp.mean()),
            fpr=fp / (fp + tn) if (fp + tn) > class="tok-number">0 else class="tok-number">0.0,
            fnr=fn / (fn + tp) if (fn + tp) > class="tok-number">0 else class="tok-number">0.0,
        )

    return reports


def check_demographic_parity(
    reports: dict[str, FairnessReport],
    threshold: float = class="tok-number">0.8,
) -> dict[str, object]:
    class="tok-string">class="tok-string">""class="tok-string">"Check demographic parity (class="tok-number">4/5ths rule).

    Demographic parity requires:
    P(Y_hat=class="tok-number">1 | A=a) = P(Y_hat=class="tok-number">1 | A=b) for all groups.
    The class="tok-number">4/5ths rule: ratio of min/max selection rates >= class="tok-number">0.8.
    "class="tok-string">""
    rates = {
        g: r.positive_rate for g, r in reports.items()
    }
    max_rate = max(rates.values())
    min_rate = min(rates.values())
    di_ratio = min_rate / max_rate if max_rate > class="tok-number">0 else class="tok-number">1.0

    return {
        class="tok-string">"group_positive_rates": rates,
        class="tok-string">"disparate_impact_ratio": round(di_ratio, class="tok-number">4),
        class="tok-string">"passes_four_fifths_rule": di_ratio >= threshold,
    }


def check_equalized_odds(
    reports: dict[str, FairnessReport],
    max_gap: float = class="tok-number">0.05,
) -> dict[str, object]:
    class="tok-string">class="tok-string">""class="tok-string">"Check equalized odds: equal TPR and FPR across groups.

    Equalized odds requires:
    P(Y_hat=class="tok-number">1 | Y=y, A=a) = P(Y_hat=class="tok-number">1 | Y=y, A=b)
    for y in {class="tok-number">0, class="tok-number">1}. We check that the max gap in
    FPR and FNR across groups is below the threshold.
    "class="tok-string">""
    fprs = {g: r.fpr for g, r in reports.items()}
    tprs = {g: r.recall for g, r in reports.items()}
    fpr_gap = max(fprs.values()) - min(fprs.values())
    tpr_gap = max(tprs.values()) - min(tprs.values())

    return {
        class="tok-string">"group_fprs": fprs,
        class="tok-string">"group_tprs": tprs,
        class="tok-string">"max_fpr_gap": round(fpr_gap, class="tok-number">4),
        class="tok-string">"max_tpr_gap": round(tpr_gap, class="tok-number">4),
        class="tok-string">"passes_equalized_odds": (
            fpr_gap < max_gap and tpr_gap < max_gap
        ),
    }


def check_calibration(
    y_true: np.ndarray,
    y_prob: np.ndarray,
    sensitive_attr: np.ndarray,
    n_bins: int = class="tok-number">10,
) -> dict[str, dict[str, float]]:
    class="tok-string">class="tok-string">""class="tok-string">"Check calibration across groups.

    Calibration requires:
    P(Y=class="tok-number">1 | S=s, A=a) = P(Y=class="tok-number">1 | S=s, A=b)
    i.e., predicted probabilities mean the same thing
    for all groups. Reports ECE per group.
    "class="tok-string">""
    results = {}
    for group in np.unique(sensitive_attr):
        mask = sensitive_attr == group
        probs = y_prob[mask]
        labels = y_true[mask]
        bin_edges = np.linspace(class="tok-number">0, class="tok-number">1, n_bins + class="tok-number">1)
        ece = class="tok-number">0.0
        for i in range(n_bins):
            in_bin = (
                (probs > bin_edges[i])
                & (probs <= bin_edges[i + class="tok-number">1])
            )
            if in_bin.sum() == class="tok-number">0:
                continue
            avg_pred = probs[in_bin].mean()
            avg_true = labels[in_bin].mean()
            ece += (
                in_bin.sum() / len(probs)
                * abs(avg_pred - avg_true)
            )
        results[str(group)] = {class="tok-string">"ece": round(ece, class="tok-number">4)}
    return results


class="tok-comment"># --- Run a full audit ---
reports = bias_audit(y_test, y_pred, race_labels)

print(class="tok-string">"=== Per-Group Metrics ===")
for group, r in reports.items():
    print(
        class="tok-string">f"  {group}: acc={r.accuracy:.3f}, "
        class="tok-string">f"prec={r.precision:.3f}, "
        class="tok-string">f"recall={r.recall:.3f}, "
        class="tok-string">f"pos_rate={r.positive_rate:.3f}"
    )

print(class="tok-string">"\n=== Demographic Parity ===")
dp = check_demographic_parity(reports)
print(class="tok-string">f"  DI ratio: {dp[&class="tok-comment">#class="tok-number">39;disparate_impact_ratio&#class="tok-number">39;]}")
print(class="tok-string">f"  Passes class="tok-number">4/5ths rule: {dp[&class="tok-comment">#class="tok-number">39;passes_four_fifths_rule&#class="tok-number">39;]}")

print(class="tok-string">"\n=== Equalized Odds ===")
eo = check_equalized_odds(reports)
print(class="tok-string">f"  Max FPR gap: {eo[&class="tok-comment">#class="tok-number">39;max_fpr_gap&#class="tok-number">39;]}")
print(class="tok-string">f"  Max TPR gap: {eo[&class="tok-comment">#class="tok-number">39;max_tpr_gap&#class="tok-number">39;]}")
print(class="tok-string">f"  Passes: {eo[&class="tok-comment">#class="tok-number">39;passes_equalized_odds&#class="tok-number">39;]}")

Fairness Criteria and Impossibility Results

Fairness is not a single metric but a family of often-incompatible criteria. The choice of fairness definition has significant practical consequences and should be driven by the deployment context and stakeholder values.

\text{Demographic Parity: } P(\hat{Y}=1 \mid A=a) = P(\hat{Y}=1 \mid A=b)
\text{Equalized Odds: } P(\hat{Y}=1 \mid Y=y, A=a) = P(\hat{Y}=1 \mid Y=y, A=b) \quad \forall y \in \{0, 1\}
\text{Calibration: } P(Y=1 \mid S=s, A=a) = P(Y=1 \mid S=s, A=b) \quad \forall s
Table 20.3: Major fairness criteria with formal definitions and application guidance.
CriterionFormal DefinitionIntuitionWhen to Use
Demographic ParityP(Y_hat=1 | A=a) = P(Y_hat=1 | A=b)Equal selection rates across groupsWhen base rates should not affect decisions (e.g., hiring)
Equalized OddsP(Y_hat=1 | Y=y, A=a) = P(Y_hat=1 | Y=y, A=b)Equal TPR and FPR across groupsWhen accuracy matters equally for all groups (e.g., medical diagnosis)
Predictive ParityP(Y=1 | Y_hat=1, A=a) = P(Y=1 | Y_hat=1, A=b)Equal precision across groupsWhen positive predictions trigger costly interventions
Individual Fairnessd(f(x), f(x')) <= L * d(x, x')Similar individuals get similar predictionsWhen a meaningful similarity metric exists
Counterfactual FairnessP(Y_hat | do(A=a)) = P(Y_hat | do(A=b))Prediction unchanged if group membership were differentWhen causal reasoning about protected attributes is possible

Table 20.3: Major fairness criteria with formal definitions and application guidance.

Impossibility Theorem

Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) proved that when base rates differ across groups, it is mathematically impossible to simultaneously satisfy calibration, balance for the positive class, and balance for the negative class. This means you must choose which fairness criterion to prioritize based on the specific context. There is no universally "fair" model -- fairness requires value judgments about which trade-offs are acceptable.

DI = \frac{P(\hat{Y}=1 \mid A=\text{minority})}{P(\hat{Y}=1 \mid A=\text{majority})} \geq 0.8

Explainability with SHAP and LIME

Post-hoc explainability methods provide interpretable explanations for individual predictions from black-box models. SHAP (SHapley Additive exPlanations) is grounded in cooperative game theory, while LIME (Local Interpretable Model-agnostic Explanations) fits a local linear model around each prediction.

python
import shap
import lime
import lime.lime_tabular
import numpy as np

class="tok-comment"># --- SHAP: Global and local feature importance ---
def explain_with_shap(
    model, X_train, X_explain, feature_names
):
    class="tok-string">class="tok-string">""class="tok-string">"Generate SHAP explanations for model predictions.

    SHAP values decompose each prediction into additive
    contributions from each feature, grounded in Shapley
    values from cooperative game theory.
    "class="tok-string">""
    explainer = shap.Explainer(model, X_train)
    shap_values = explainer(X_explain)

    class="tok-comment"># Global importance: mean |SHAP value| per feature
    global_importance = np.abs(
        shap_values.values
    ).mean(axis=class="tok-number">0)
    importance_ranking = sorted(
        zip(feature_names, global_importance),
        key=lambda x: x[class="tok-number">1],
        reverse=True,
    )
    print(class="tok-string">"Global Feature Importance (SHAP):")
    for name, imp in importance_ranking[:class="tok-number">10]:
        print(class="tok-string">f"  {name}: {imp:.4f}")

    class="tok-comment"># Local explanation for a single prediction
    idx = class="tok-number">0
    print(class="tok-string">f"\nLocal explanation for sample {idx}:")
    for name, val in sorted(
        zip(feature_names, shap_values.values[idx]),
        key=lambda x: abs(x[class="tok-number">1]),
        reverse=True,
    )[:class="tok-number">5]:
        direction = class="tok-string">"+" if val > class="tok-number">0 else class="tok-string">""
        print(class="tok-string">f"  {name}: {direction}{val:.4f}")

    return shap_values


class="tok-comment"># --- LIME: Local interpretable explanations ---
def explain_with_lime(
    model, X_train, instance, feature_names
):
    class="tok-string">class="tok-string">""class="tok-string">"Generate a LIME explanation for one prediction.

    LIME perturbs the input, observes prediction changes,
    and fits a weighted linear model to approximate the
    decision boundary locally.
    "class="tok-string">""
    explainer = lime.lime_tabular.LimeTabularExplainer(
        X_train,
        feature_names=feature_names,
        mode=class="tok-string">"classification",
    )
    explanation = explainer.explain_instance(
        instance,
        model.predict_proba,
        num_features=class="tok-number">10,
    )

    print(class="tok-string">"LIME Local Explanation:")
    for feature, weight in explanation.as_list():
        print(class="tok-string">f"  {feature}: {weight:+.4f}")

    return explanation
SHAP vs. LIME Trade-offs

SHAP provides theoretically grounded (Shapley value) explanations with consistency guarantees but is computationally expensive for large models (exponential in the number of features without approximations). LIME is faster and model-agnostic but explanations can be unstable -- small changes in the input or the perturbation seed can produce different explanations. For regulated applications, SHAP is generally preferred due to its theoretical guarantees.

Definition

SHAP (SHapley Additive exPlanations)

An explainability method based on Shapley values from cooperative game theory. For each prediction, SHAP assigns each feature a value representing its contribution to the difference between the prediction and the average prediction. SHAP values satisfy desirable properties: local accuracy, missingness, and consistency.

Algorithmic Bias

Systematic errors in ML systems that produce unfair outcomes for particular demographic groups, arising from data, modeling, or deployment choices.

Fairness Criteria

Mathematical definitions of what constitutes fair predictions, including demographic parity, equalized odds, and individual fairness.

04 AI Safety and Alignment

AI safety research addresses the challenge of ensuring that increasingly capable AI systems remain beneficial, controllable, and aligned with human values. As models gain capabilities that approach or exceed human-level performance in specific domains, the potential consequences of misalignment grow correspondingly. Safety is not a feature to be added after development -- it must be integrated throughout the entire model lifecycle.

Definition

AI Safety

The interdisciplinary field focused on ensuring that AI systems operate reliably, remain under meaningful human control, and produce outcomes aligned with human values -- especially as systems become more capable and autonomous. AI safety spans technical research (alignment, interpretability, robustness) and governance (evaluation standards, deployment controls).

Red-Teaming and Safety Evaluation

Red-teaming is the practice of adversarially testing AI systems to discover failure modes, harmful outputs, and safety vulnerabilities before deployment. Modern red-teaming combines automated evaluation suites with human adversarial testing across multiple risk dimensions.

Definition

Red-Teaming

Structured adversarial testing of an AI system by a dedicated team attempting to elicit harmful, dangerous, or unintended outputs. Red teams probe for failure modes including toxicity, bias, misinformation generation, jailbreak vulnerabilities, dangerous knowledge disclosure, and capability overhangs. Results inform safety mitigations before deployment.

Table 20.4: Key AI safety benchmarks and evaluation suites.
BenchmarkWhat It MeasuresMethodologyKey Limitation
TruthfulQATruthfulness vs. common misconceptions817 questions designed to elicit false but plausible answersLimited to English, narrow question types
BBQ (Bias Benchmark for QA)Social bias across 9 categoriesAmbiguous and disambiguated question pairs across demographicsMay not capture subtle or intersectional biases
RealToxicityPromptsPropensity to generate toxic text100K naturally occurring prompts scored for toxicityToxicity classifiers themselves have biases
HarmBenchResistance to adversarial jailbreaksStandardized attack methods across harm categoriesArms race: new attacks can bypass evaluated defenses
WMDP (Weapons of Mass Destruction Proxy)Dangerous knowledge in bio/chem/cyberMultiple-choice questions testing dual-use knowledgeProxy measure; does not test actual capability to cause harm
MMLU (Massive Multitask Language Understanding)General knowledge and reasoning57 subjects from elementary to professional levelSaturating as a benchmark; does not test safety directly

Table 20.4: Key AI safety benchmarks and evaluation suites.

OpenAI's Safety Practices

OpenAI's Preparedness Framework classifies frontier model risks across four categories -- cybersecurity, biological threats, persuasion, and model autonomy -- each rated from Low to Critical. Before deploying a new model, the Preparedness team evaluates it against specific capability thresholds. Models rated "High" risk or above in any category require explicit sign-off from leadership and additional mitigations. External red teams (including domain experts in biosecurity and cybersecurity) supplement internal testing.

Alignment Techniques

Alignment refers to the technical challenge of ensuring that an AI system's objectives and behaviors match human intentions and values. The dominant paradigm for aligning large language models combines supervised fine-tuning (SFT) on high-quality demonstrations with reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO).

Anthropic's Constitutional AI

Constitutional AI (CAI) addresses a key limitation of RLHF: it is expensive and difficult to scale human feedback to cover all possible model behaviors. CAI works in two phases: 1. Supervised phase: The model generates responses, then critiques and revises them based on a written "constitution" of principles (e.g., "choose the response that is most helpful while being honest and harmless"). 2. RL phase: The model is trained with RLAIF (RL from AI Feedback), where a separate model provides preference labels guided by the constitution, replacing human annotators for the majority of training signal. This approach makes alignment more scalable, more transparent (the principles are explicit and auditable), and more consistent than pure human feedback.

DeepMind's Safety Research

Google DeepMind's safety research spans multiple threads: (1) Scalable oversight -- methods for humans to supervise AI systems that may exceed human capabilities in specific domains, including debate and recursive reward modeling; (2) Interpretability -- understanding neural network internals through tools like concept probes and activation analysis; (3) Evaluations -- developing benchmarks and threat models for dangerous capabilities; (4) Specification -- formalizing what it means for AI to be beneficial, including work on moral uncertainty and value learning.

  1. Supervised Fine-Tuning (SFT): Train on high-quality demonstrations of desired behavior from human annotators
  2. Reward Modeling: Train a separate model to predict human preferences between pairs of model outputs
  3. RLHF: Use PPO or similar RL algorithms to optimize the language model against the reward model
  4. DPO (Direct Preference Optimization): Directly optimize on preference pairs without training a separate reward model
  5. Constitutional AI: Model self-critiques outputs against stated principles, enabling scalable alignment without per-example human labels

Interpretability and Mechanistic Understanding

Interpretability research aims to understand what neural networks learn and how they make decisions. Mechanistic interpretability, in particular, seeks to reverse-engineer the computational mechanisms inside models by identifying specific circuits -- subnetworks of neurons and attention heads that implement specific computations.

Why Interpretability Matters for Safety

Without interpretability, safety relies on behavioral testing: we check whether the model produces bad outputs on a finite set of test inputs. But behavioral testing cannot guarantee safety on unseen inputs. Mechanistic interpretability aims to verify safety at the level of the model's internal representations -- understanding not just what the model does, but how it works. This is analogous to the difference between black-box testing and code review in software engineering.

S_{\text{safety}} = 1 - \frac{C_{\text{overhead}}}{C_{\text{total}}} \cdot \left(1 - \frac{1}{1 + e^{-k(r - r_0)}}\right)

Red-Teaming

Structured adversarial testing of AI systems by dedicated teams to discover failure modes, harmful outputs, and safety vulnerabilities before deployment.

Constitutional AI

An alignment approach where the model critiques and revises its own outputs based on a written set of principles, enabling scalable alignment.

05 Scaling Laws, Compute Trends, and Responsible Scaling

Scaling laws describe the predictable relationship between model size, dataset size, compute budget, and model performance. These empirical relationships have become one of the most important tools for planning large-scale ML training and for anticipating the capabilities -- and risks -- of future models.

Definition

Scaling Laws

Empirical power-law relationships that predict how model performance (typically measured as loss) improves as a function of model parameter count (N), dataset size (D), and compute budget (C). These laws enable principled resource allocation before committing to expensive training runs.

L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

The Chinchilla scaling laws showed that many large models were under-trained relative to their size. For a given compute budget, the optimal approach scales data proportionally with parameters rather than just making models bigger.

Chinchilla Optimal Training

Before Chinchilla, the prevailing approach was to train very large models on relatively small datasets. The Chinchilla result showed that a 70B parameter model trained on 1.4T tokens outperformed the 280B parameter Gopher trained on 300B tokens, using the same compute budget. This finding redirected the entire field toward data scaling. The Chinchilla-optimal ratio is approximately 20 tokens per parameter.

N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

The Compute Arms Race

Table 20.5: Compute scaling of frontier language models: training FLOPs have increased by 5+ orders of magnitude in 6 years.
YearModelParametersTraining TokensEstimated FLOPsEstimated Cost
2019GPT-21.5B~40B~1.5 x 10^20~$50K
2020GPT-3175B300B~3.1 x 10^23~$4.6M
2022Chinchilla70B1.4T~5.8 x 10^23~$3M
2023Llama 2 70B70B2T~1 x 10^24~$5M
2023GPT-4 (est.)~1.8T MoE~13T~2 x 10^25~$100M
2024Llama 3 405B405B15T~3.8 x 10^25~$150M
2025+Frontier (projected)~1T+ dense~30T+~10^26~$500M+

Table 20.5: Compute scaling of frontier language models: training FLOPs have increased by 5+ orders of magnitude in 6 years.

0x
Year-over-year growth in compute requirements for frontier AI models
Doubling Faster Than Hardware

Compute requirements for frontier models double every 6-10 months, while hardware efficiency doubles roughly every 2 years (end of Moore's Law). This gap is unsustainable without fundamental breakthroughs in algorithmic efficiency or hardware. Organizations must plan for this reality when projecting future training costs.

S_{\text{safety}}(p) = \frac{1}{1 - p} \cdot T_{\text{compute}}

Responsible Scaling Policies

Responsible scaling policies (RSPs) link the deployment of increasingly capable models to commensurate safety evaluations and mitigations. These policies formalize the principle that with greater capability comes greater responsibility for safety.

Anthropic's Responsible Scaling Policy

Anthropic's RSP defines AI Safety Levels (ASL-1 through ASL-4+) that mirror biosafety levels. Each level is defined by the dangerous capabilities the model possesses and requires corresponding containment and deployment safeguards: - ASL-1: Models posing no meaningful catastrophic risk (e.g., a chess engine). - ASL-2: Models with early signs of dangerous capabilities but not yet providing uplift beyond existing publicly available information. Current frontier models are at ASL-2. - ASL-3: Models that substantially increase risk of catastrophic misuse (e.g., providing meaningful uplift for WMD creation). Requires enhanced security, access controls, and deployment restrictions. - ASL-4+: Models with capabilities that could pose existential-scale risks. Requirements to be defined as the field develops. Before advancing to the next ASL, Anthropic commits to demonstrating that the required safety measures for that level are in place.

Environmental Impact

The environmental impact of AI training is a growing concern. A single frontier model training run can consume as much energy as hundreds of US households use in a year. Responsible development requires measuring, reporting, and reducing the carbon footprint of ML systems.

  • Training GPT-3 emitted approximately 552 tons of CO2 equivalent
  • Frontier model training in 2025 may exceed 10,000 tons CO2 equivalent per run
  • Inference costs typically dominate training costs over a model's lifetime: a widely-deployed model serving millions of users can consume more total compute for inference than it used for training
  • Efficiency improvements (quantization, distillation, MoE) reduce both cost and environmental impact
  • Data center location matters: renewable-powered facilities can reduce carbon intensity by 10x or more compared to coal-powered regions

Scaling Laws

Empirical relationships that predict how model performance improves as a function of model size, dataset size, and compute budget.

Responsible Scaling Policy

A governance framework that links model capability levels to required safety evaluations and containment measures, ensuring safety keeps pace with capability.

06 Future Directions for Responsible ML Systems

The future of ML systems lies at the intersection of several converging trends: more capable models, more efficient hardware, better software abstractions, and evolving societal expectations around safety and governance. Engineers who understand all of these dimensions will be best positioned to shape the field responsibly.

Automated ML Engineering

Automated ML engineering, where AI systems help design, train, optimize, and deploy other AI systems, is an emerging trend that could dramatically accelerate progress while also raising new governance questions about AI systems making decisions about other AI systems.

Table 20.6: The evolving landscape of automated ML engineering and associated governance challenges.
Automation AreaCurrent StateFuture DirectionGovernance Concern
Architecture searchNAS finds competitive architectures automaticallyAI designs novel architectures beyond human intuitionReduced human understanding of model internals
Code generationCopilot-class tools assist ML engineers dailyEnd-to-end ML pipeline generation from specificationsPropagation of subtle bugs and biases
DebuggingAI identifies common training issuesAutomated root cause analysis and fix suggestionOver-reliance on automated diagnosis
OptimizationAutoML tunes hyperparameters effectivelyAutomated system-level optimization across the stackOptimization for metrics that miss important values
EvaluationAutomated benchmark runningAI-designed evaluation suites for novel failure modesAI evaluating AI: who watches the watchmen?

Table 20.6: The evolving landscape of automated ML engineering and associated governance challenges.

The Evolving Engineer Role

As ML engineering becomes increasingly automated, the role of the ML systems engineer will shift from hands-on implementation toward system design, oversight, evaluation, and ensuring safety and quality. Understanding the fundamentals becomes more important, not less, as you need to evaluate and guide automated systems effectively.

Safety and Alignment as Core Engineering

Definition

AI Alignment

The research challenge of ensuring that AI systems pursue goals that are beneficial to humans and act in accordance with human values and intentions, especially as systems become more capable and autonomous. Alignment is distinct from capability: a very capable but misaligned system is more dangerous than a less capable one.

  • RLHF (Reinforcement Learning from Human Feedback): Fine-tuning models using human preference data to align outputs with human values
  • Constitutional AI: Training models to follow a set of principles, enabling self-critique and revision without per-example human labels
  • Interpretability: Understanding what models have learned and how they make decisions at the mechanistic level
  • Red-teaming: Adversarial testing to find failure modes and harmful outputs before deployment
  • Evaluation standards: Developing benchmarks and protocols for assessing safety properties systematically
  • Responsible scaling: Linking capability advancement to demonstrated safety measures
Safety Is Not Optional

As AI systems become more capable and more widely deployed, the consequences of misalignment grow. Safety and alignment are not academic concerns -- they are practical engineering requirements. Every ML system deployed in production should have appropriate safety evaluations, monitoring, and guardrails proportional to its potential for harm. The cost of adding safety is small compared to the cost of deploying a harmful system.

ML Everywhere: The Integration Imperative

The integration of ML into every aspect of computing -- from operating systems to databases to networking -- represents a fundamental shift in how software is built. ML systems engineering is becoming a core competency across all of computer science, and responsible practices must scale with this integration.

Career Advice

The most valuable ML systems engineers will be those who combine deep ML knowledge with strong systems fundamentals (distributed systems, databases, networking, operating systems) and an understanding of responsible AI practices. Invest in breadth across these areas, not just depth in one. The ability to reason about safety, fairness, and governance will increasingly differentiate senior engineers.

The best way to predict the future is to build it -- but in AI, the best way to build the future is to ensure it benefits everyone.

Adapted from Alan Kay

AI Safety

Research and practices aimed at ensuring AI systems remain beneficial, controllable, and aligned with human values as capabilities increase.

RLHF

Reinforcement Learning from Human Feedback, a technique for aligning language model behavior with human preferences through reward modeling.

Key Takeaways

  1. 1Foundation models have created a new paradigm where most ML starts from pre-trained models rather than from scratch.
  2. 2AI governance is rapidly evolving, with the EU AI Act establishing the most comprehensive risk-based regulatory framework to date.
  3. 3Fairness is a family of mathematically incompatible criteria -- demographic parity, equalized odds, and calibration cannot all be satisfied simultaneously when base rates differ.
  4. 4Bias audits must compute disaggregated metrics including disparate impact ratio, equalized odds gaps, and per-group calibration to detect unfair treatment.
  5. 5Explainability tools like SHAP and LIME enable post-hoc interpretation of black-box models, essential for trust and regulatory compliance.
  6. 6AI safety research -- including red-teaming, constitutional AI, and interpretability -- must keep pace with capability advances.
  7. 7Scaling laws enable principled decisions about compute, data, and parameters, but also reveal that the compute arms race is unsustainable without efficiency breakthroughs.
  8. 8Responsible scaling policies (Anthropic ASL, OpenAI Preparedness Framework) formalize the principle that greater capability requires greater safety evaluation.
  9. 9Model cards and datasheets for datasets provide structured transparency that serves developers, deployers, regulators, and affected communities.
  10. 10The integration of ML into every aspect of computing makes responsible AI practices a broadly essential engineering skill.

CH.20

Chapter Complete

Up next:Conclusion

Chapter Progress

Reading
Exercise

Interact with the visualization

Quiz

AGI Systems Quiz

Test your understanding of frontier models, generative AI, foundation models, and emerging AI trends.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass