ML SystemPart 5: Safety & TrustChapter 19

Trustworthy CH.19 ~25 min

AI for Good

Showcases beneficial applications of ML systems for social impact. Explores healthcare, climate, education, and humanitarian use cases.

social impacthealthcare AIclimate AIbeneficial applicationshumanitarian AI

Read in mlsysbook.ai

Evaluate ML applications in healthcare including medical imaging, drug discovery, and clinical decision support
Implement a medical image classification pipeline using PyTorch transfer learning with appropriate evaluation metrics
Analyze the unique engineering constraints of deploying ML in resource-limited social impact settings
Implement NLP pipelines for sentiment analysis and text classification using transformer models
Design a recommender system using collaborative filtering and evaluate it with ranking metrics like NDCG and Precision@k
Compare ML system requirements across healthcare, finance, and autonomous vehicles domains
Describe how ML systems contribute to climate science, weather prediction, and energy optimization
Assess the safety challenges and sensor fusion requirements for autonomous vehicle perception systems
Compare accessibility applications of ML across speech, vision, and language modalities
Design an ML solution for a social impact scenario that accounts for regulatory, ethical, and deployment constraints

01 ML for Social Impact Viz

Machine learning has enormous potential to address pressing social challenges when applied thoughtfully and responsibly. From improving healthcare outcomes to advancing climate science to enhancing education, ML systems can amplify human capability in ways that benefit society broadly.

Definition

AI for Social Good

The intentional application of artificial intelligence techniques to address societal challenges such as healthcare access, climate change, education inequality, and humanitarian crises, with emphasis on equitable outcomes and deployment in resource-constrained environments.

Figure: ML Application Domains

Figure 19.1: ML Application Domains

AlphaFold: A Landmark in Scientific AI

DeepMind's AlphaFold2 (2020) solved the 50-year-old protein structure prediction problem, predicting 3D structures with atomic-level accuracy. The AlphaFold Protein Structure Database now contains over 200 million predicted structures -- covering nearly every known protein. This has accelerated drug discovery, enzyme engineering, and our fundamental understanding of biology. AlphaFold demonstrates how ML can provide transformative scientific breakthroughs when applied to well-defined problems with rich training data. The 2024 Nobel Prize in Chemistry was awarded partly for this work.

Measuring Social Impact

The measurement of social impact presents unique challenges. Unlike commercial metrics that are easily quantifiable, social outcomes may be difficult to measure, may manifest over long time horizons, and may involve competing stakeholder interests. Defining appropriate metrics and evaluation frameworks is a critical first step that requires community engagement.

Table 19.1: Impact metrics across social good domains and their measurement challenges.
Domain	Impact Metric	Measurement Challenge	Example
Healthcare	Lives saved / QALYs gained	Long time horizons, counterfactual difficulty	AI-assisted diagnosis reducing misdiagnosis rates by 30%
Education	Learning outcomes improvement	Confounders, long-term retention	Adaptive tutoring improving test scores by 15%
Climate	CO2 emissions reduced (tons)	Attribution across complex systems	Grid optimization saving 40% cooling energy
Agriculture	Crop yield improvement (%)	Weather variability, adoption barriers	Disease detection reducing crop loss by 25%
Disaster Response	Time to first response (hours)	Infrastructure damage, coordination costs	Satellite imagery analysis providing damage maps in 4 hours

Table 19.1: Impact metrics across social good domains and their measurement challenges.

Design for Deployment Constraints

ML for social impact often means designing for environments that are very different from well-resourced tech companies. Models must run on low-cost hardware, work with intermittent connectivity, handle noisy data, and serve users with limited technical literacy. A well-compressed MobileNet model running offline on a $100 smartphone can have more real-world impact than a state-of-the-art model that requires cloud connectivity.

AI for Good

The application of artificial intelligence to address societal challenges in areas like health, climate, education, and humanitarian response.

Impact Measurement

The practice of defining and tracking quantitative metrics for the social outcomes of ML deployments, accounting for long time horizons and counterfactual difficulties.

02 Healthcare and Life Sciences

Healthcare AI shows enormous promise across medical imaging, drug discovery, clinical decision support, and genomics. Deep learning models can detect diseases in medical images with accuracy comparable to or exceeding that of specialist physicians, but deploying these systems in clinical practice requires navigating regulatory, ethical, and trust challenges.

Definition

Clinical Decision Support System (CDSS)

An ML-powered tool that integrates patient data, clinical evidence, and model predictions to assist healthcare providers in diagnosis, treatment planning, or risk assessment. CDSSs must meet regulatory requirements (FDA, CE marking) and provide transparent, explainable outputs that clinicians can validate.

$0B+

Estimated annual economic value of AI applications in healthcare by 2030

Medical Image Classification with Transfer Learning

Transfer learning from ImageNet-pretrained models is the standard approach for medical image classification, because labeled medical datasets are typically small (hundreds to low thousands of images). Fine-tuning a pretrained ResNet or EfficientNet on domain-specific data achieves strong performance with limited data. The following pipeline demonstrates a complete medical image classification workflow.

python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import transforms, datasets, models
from sklearn.metrics import (
    classification_report, roc_auc_score
)
import numpy as np

class="tok-comment"># --- Data preparation with medical imaging augmentations ---
train_transforms = transforms.Compose([
    transforms.Resize((class="tok-number">224, class="tok-number">224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(class="tok-number">15),
    transforms.ColorJitter(
        brightness=class="tok-number">0.2, contrast=class="tok-number">0.2
    ),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406],
        std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225],
    ),
])

val_transforms = transforms.Compose([
    transforms.Resize((class="tok-number">224, class="tok-number">224)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406],
        std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225],
    ),
])

class="tok-comment"># Load dataset (e.g., chest X-ray pneumonia detection)
train_dataset = datasets.ImageFolder(
    class="tok-string">"data/chest_xray/train", transform=train_transforms
)
val_dataset = datasets.ImageFolder(
    class="tok-string">"data/chest_xray/val", transform=val_transforms
)

train_loader = DataLoader(
    train_dataset, batch_size=class="tok-number">32, shuffle=True,
    num_workers=class="tok-number">4, pin_memory=True,
)
val_loader = DataLoader(
    val_dataset, batch_size=class="tok-number">32, shuffle=False,
    num_workers=class="tok-number">4, pin_memory=True,
)

class="tok-comment"># --- Transfer learning: fine-tune a pretrained ResNet-class="tok-number">50 ---
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)

class="tok-comment"># Freeze early layers (feature extractor)
for param in model.parameters():
    param.requires_grad = False

class="tok-comment"># Replace the classification head for binary classification
num_features = model.fc.in_features
model.fc = nn.Sequential(
    nn.Linear(num_features, class="tok-number">256),
    nn.ReLU(),
    nn.Dropout(class="tok-number">0.3),
    nn.Linear(class="tok-number">256, class="tok-number">1),  class="tok-comment"># Binary: pneumonia vs. normal
)

device = torch.device(class="tok-string">"cuda" if torch.cuda.is_available() else class="tok-string">"cpu")
model = model.to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=class="tok-number">1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, patience=class="tok-number">3, factor=class="tok-number">0.5
)

class="tok-comment"># --- Training loop with validation ---
best_auc = class="tok-number">0.0
for epoch in range(class="tok-number">20):
    model.train()
    running_loss = class="tok-number">0.0
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.float().to(device)
        optimizer.zero_grad()
        outputs = model(images).squeeze(class="tok-number">1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    class="tok-comment"># Validation
    model.eval()
    all_labels, all_probs = [], []
    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            outputs = model(images).squeeze(class="tok-number">1)
            probs = torch.sigmoid(outputs).cpu().numpy()
            all_probs.extend(probs)
            all_labels.extend(labels.numpy())

    val_auc = roc_auc_score(all_labels, all_probs)
    scheduler.step(class="tok-number">1 - val_auc)

    if val_auc > best_auc:
        best_auc = val_auc
        torch.save(model.state_dict(), class="tok-string">"best_model.pth")

    print(
        class="tok-string">f"Epoch {epoch+class="tok-number">1}: loss={running_loss/len(train_loader):.4f}, "
        class="tok-string">f"val_AUC={val_auc:.4f}"
    )

class="tok-comment"># --- Final evaluation ---
model.load_state_dict(torch.load(class="tok-string">"best_model.pth"))
model.eval()
all_labels, all_preds, all_probs = [], [], []
with torch.no_grad():
    for images, labels in val_loader:
        images = images.to(device)
        outputs = model(images).squeeze(class="tok-number">1)
        probs = torch.sigmoid(outputs).cpu().numpy()
        preds = (probs >= class="tok-number">0.5).astype(int)
        all_probs.extend(probs)
        all_preds.extend(preds)
        all_labels.extend(labels.numpy())

print(class="tok-string">"\n" + classification_report(
    all_labels, all_preds,
    target_names=[class="tok-string">"Normal", class="tok-string">"Pneumonia"],
))
print(class="tok-string">f"AUC-ROC: {roc_auc_score(all_labels, all_probs):.4f}")

Evaluation Metrics for Medical AI

Medical AI systems require careful evaluation beyond simple accuracy. In healthcare, the relative cost of false positives (unnecessary follow-up) versus false negatives (missed disease) varies dramatically by application. The following metrics are essential for evaluating medical ML systems.

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}

F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}

\text{AUC-ROC} = \int_0^1 TPR(FPR^{-1}(t)) \, dt = P(S_{\text{pos}} > S_{\text{neg}})

Table 19.2: Key evaluation metrics for healthcare AI systems and their appropriate use cases.
Metric	Formula	Best For	Limitation
Sensitivity (Recall)	TP / (TP + FN)	Screening (minimize missed cases)	Ignores false positive rate
Specificity	TN / (TN + FP)	Confirmatory tests (minimize false alarms)	Ignores missed cases
PPV (Precision)	TP / (TP + FP)	When positive prediction triggers costly action	Depends on prevalence
NPV	TN / (TN + FN)	Rule-out tests	Depends on prevalence
AUC-ROC	Area under TPR vs FPR curve	Overall discrimination ability	Can be misleading with class imbalance
AUC-PR	Area under Precision vs Recall curve	Rare disease detection (imbalanced)	Harder to interpret than AUC-ROC

Table 19.2: Key evaluation metrics for healthcare AI systems and their appropriate use cases.

GPT-4 in Clinical Trials and Medical Reasoning

GPT-4 passed the US Medical Licensing Examination (USMLE) with scores well above the passing threshold, demonstrating strong medical knowledge. However, passing a standardized test is very different from safe clinical deployment. LLMs can hallucinate medical facts, miss context from patient history, and provide plausible-sounding but incorrect diagnoses. Current research focuses on using LLMs as clinical documentation assistants, literature summarizers, and patient communication aids -- roles where they augment clinicians rather than replace them. Retrieval-augmented generation (RAG) from curated medical knowledge bases helps reduce hallucination in clinical LLM applications.

Regulatory Pathway for Medical AI

Medical AI devices in the US require FDA clearance through one of three pathways: 510(k) for devices substantially equivalent to existing ones, De Novo for novel low-to-moderate risk devices, or Premarket Approval (PMA) for high-risk devices. As of 2024, the FDA has authorized over 900 AI-enabled medical devices, primarily in radiology and cardiology. The EU Medical Device Regulation (MDR) imposes similar requirements. Deploying a medical AI system without proper regulatory approval is illegal and dangerous.

Transfer Learning

Training a model pretrained on a large dataset (e.g., ImageNet) on a smaller domain-specific dataset, enabling strong performance with limited labeled medical data.

AUC-ROC

Area Under the Receiver Operating Characteristic curve, a threshold-independent metric measuring a model's overall discrimination ability between positive and negative classes.

03 NLP Applications and Large Language Models

Natural language processing has been transformed by large language models. The shift from task-specific models to general-purpose LLMs has fundamentally changed how NLP systems are designed, trained, and deployed. Prompting a foundation model can now accomplish tasks that previously required months of custom model development.

Definition

Hallucination

The tendency of language models to generate text that is fluent and plausible-sounding but factually incorrect or entirely fabricated. Hallucination is a fundamental challenge for deploying LLMs in high-stakes applications where accuracy is critical, such as healthcare, legal, and financial domains.

Production NLP Pipeline Architecture

Table 19.3: Components of a production RAG-based NLP pipeline.
Component	Purpose	Common Tools	Key Design Decision
Document ingestion	Parse and chunk input documents	Unstructured, LangChain document loaders	Chunk size and overlap strategy
Embedding	Convert text to dense vector representations	OpenAI embeddings, sentence-transformers	Embedding model vs. latency trade-off
Vector store	Index and retrieve relevant context	Pinecone, Weaviate, pgvector	Hosted vs. self-managed; reranking
Prompt engineering	Structure the model input for best output	Templates, few-shot examples, system prompts	Task-specific prompt design and testing
Generation	Produce the final output text	GPT-4, Claude, Llama, Mistral	Cost/quality/latency trade-off; local vs. API
Output validation	Verify correctness and safety	Guardrails, custom validators, citation checking	Automated vs. human-in-the-loop review

Table 19.3: Components of a production RAG-based NLP pipeline.

Hallucination Remains Unsolved

Despite advances in retrieval-augmented generation and chain-of-thought prompting, hallucination remains a fundamental challenge for LLMs. In a 2024 study, even the best models hallucinated on 3-5% of factual questions. For applications where accuracy is critical (medical, legal, financial), every generated statement should be traceable to a source document. Never deploy an LLM in a high-stakes domain without a verification layer.

LLMs in Code Generation

GitHub Copilot, powered by OpenAI Codex, generates code suggestions inline as developers type. In industry studies, developers using Copilot completed tasks 55% faster on average. However, generated code can contain security vulnerabilities, license violations, and subtle bugs. Production deployments require automated security scanning (SAST), test generation, and human code review. The key insight is that LLMs are most effective as productivity multipliers for skilled engineers, not as replacements for engineering judgment.

Large Language Model

A neural network with billions of parameters trained on massive text corpora, capable of diverse language tasks through prompting or fine-tuning.

RAG

Retrieval-Augmented Generation, a pattern that retrieves relevant context from external knowledge bases before generating responses, reducing hallucination and enabling up-to-date information.

04 Recommender Systems and Information Retrieval

Recommender systems drive engagement on major platforms and are among the most commercially impactful ML applications. These systems predict user preferences from interaction history, user features, and item features, powering personalized experiences at scale.

Definition

Collaborative Filtering

A recommendation approach that predicts user preferences by finding patterns in the collective behavior of similar users (user-based CF) or similar items (item-based CF), without requiring item content features. The core assumption is that users who agreed in the past will agree in the future.

Definition

Cold Start Problem

The challenge of making recommendations for new users (user cold start) or new items (item cold start) that have no prior interaction history. Content-based features, popularity-based fallbacks, and explicit preference elicitation are common mitigations.

Collaborative Filtering with Matrix Factorization

Matrix factorization decomposes the sparse user-item interaction matrix R into two dense matrices: a user matrix U and an item matrix V. Each user and item is represented by a latent factor vector, and the predicted rating is the dot product of these vectors. The following implementation demonstrates this approach.

python
import numpy as np
from typing import Optional


class MatrixFactorizationCF:
    class="tok-string">class="tok-string">""class="tok-string">"Collaborative filtering via matrix factorization.

    Decomposes the user-item interaction matrix R into
    low-rank matrices U (users) and V (items) such that
    R ≈ U @ V.T. Trained with SGD on observed entries.
    "class="tok-string">""

    def __init__(
        self,
        n_users: int,
        n_items: int,
        n_factors: int = class="tok-number">50,
        lr: float = class="tok-number">0.01,
        reg: float = class="tok-number">0.02,
    ):
        self.n_factors = n_factors
        self.lr = lr
        self.reg = reg
        class="tok-comment"># Initialize latent factors with small random values
        self.user_factors = np.random.normal(
            class="tok-number">0, class="tok-number">0.1, (n_users, n_factors)
        )
        self.item_factors = np.random.normal(
            class="tok-number">0, class="tok-number">0.1, (n_items, n_factors)
        )
        class="tok-comment"># Bias terms capture user/item-specific tendencies
        self.user_bias = np.zeros(n_users)
        self.item_bias = np.zeros(n_items)
        self.global_bias = class="tok-number">0.0

    def fit(
        self,
        ratings: list[tuple[int, int, float]],
        n_epochs: int = class="tok-number">20,
        val_ratings: Optional[list[tuple[int, int, float]]] = None,
    ):
        class="tok-string">class="tok-string">""class="tok-string">"Train on (user_id, item_id, rating) tuples.

        Uses stochastic gradient descent with L2
        regularization on both factors and biases.
        "class="tok-string">""
        all_ratings = [r for _, _, r in ratings]
        self.global_bias = np.mean(all_ratings)

        for epoch in range(n_epochs):
            np.random.shuffle(ratings)
            total_loss = class="tok-number">0.0

            for user, item, rating in ratings:
                pred = self.predict(user, item)
                error = rating - pred

                class="tok-comment"># Update biases
                self.user_bias[user] += self.lr * (
                    error - self.reg * self.user_bias[user]
                )
                self.item_bias[item] += self.lr * (
                    error - self.reg * self.item_bias[item]
                )

                class="tok-comment"># Update latent factors
                u = self.user_factors[user].copy()
                self.user_factors[user] += self.lr * (
                    error * self.item_factors[item]
                    - self.reg * u
                )
                self.item_factors[item] += self.lr * (
                    error * u
                    - self.reg * self.item_factors[item]
                )

                total_loss += error ** class="tok-number">2

            rmse = np.sqrt(total_loss / len(ratings))
            msg = class="tok-string">f"Epoch {epoch+class="tok-number">1}: train RMSE={rmse:.4f}"
            if val_ratings:
                val_rmse = self.evaluate(val_ratings)
                msg += class="tok-string">f", val RMSE={val_rmse:.4f}"
            print(msg)

    def predict(self, user: int, item: int) -> float:
        class="tok-string">class="tok-string">""class="tok-string">"Predict rating for a user-item pair."class="tok-string">""
        return (
            self.global_bias
            + self.user_bias[user]
            + self.item_bias[item]
            + self.user_factors[user]
            @ self.item_factors[item]
        )

    def recommend(
        self, user: int, n: int = class="tok-number">10,
        exclude: Optional[set[int]] = None,
    ) -> list[tuple[int, float]]:
        class="tok-string">class="tok-string">""class="tok-string">"Return top-n item recommendations for a user."class="tok-string">""
        scores = (
            self.global_bias
            + self.user_bias[user]
            + self.item_bias
            + self.item_factors @ self.user_factors[user]
        )
        if exclude:
            for item_id in exclude:
                scores[item_id] = -np.inf
        top_items = np.argsort(scores)[::-class="tok-number">1][:n]
        return [(int(i), float(scores[i])) for i in top_items]

    def evaluate(
        self, test_ratings: list[tuple[int, int, float]]
    ) -> float:
        class="tok-string">class="tok-string">""class="tok-string">"Compute RMSE on test set."class="tok-string">""
        errors = [
            (r - self.predict(u, i)) ** class="tok-number">2
            for u, i, r in test_ratings
        ]
        return np.sqrt(np.mean(errors))


class="tok-comment"># Example usage
mf = MatrixFactorizationCF(
    n_users=class="tok-number">1000, n_items=class="tok-number">5000, n_factors=class="tok-number">50
)
mf.fit(train_ratings, n_epochs=class="tok-number">20, val_ratings=val_ratings)
top_recs = mf.recommend(user=class="tok-number">42, n=class="tok-number">10, exclude=seen_items)
print(class="tok-string">"Top recommendations:", top_recs)

Ranking Metrics for Recommender Systems

Recommender systems are evaluated with ranking metrics that account for the position of relevant items in the recommendation list. A relevant item at position 1 is far more valuable than the same item at position 100.

\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}, \quad \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}

\text{Precision@}k = \frac{|\{\text{relevant items in top-}k\}|}{k}, \quad \text{Recall@}k = \frac{|\{\text{relevant items in top-}k\}|}{|\{\text{all relevant items}\}|}

Table 19.4: Ranking metrics for recommender systems and information retrieval.
Metric	Measures	Position-Aware	Best For
NDCG@k	Ranking quality with graded relevance	Yes (log discount)	Multi-level relevance (ratings, engagement depth)
Precision@k	Fraction of top-k that are relevant	No (treats all k equally)	Binary relevance (clicked or not)
Recall@k	Coverage of relevant items in top-k	No	Catalog coverage, diversity evaluation
MAP	Mean of Precision@k at each relevant position	Yes (reward for early placement)	Binary relevance with position sensitivity
MRR	Reciprocal rank of first relevant item	Yes (first hit position)	Navigational queries, single-answer retrieval
Hit Rate@k	Whether any relevant item appears in top-k	No	Overall recall screening

Table 19.4: Ranking metrics for recommender systems and information retrieval.

Modern Recommendation Architecture

Production recommender systems at scale use a multi-stage architecture to handle catalogs of millions of items efficiently. The retrieval stage uses approximate nearest neighbor search to generate a broad candidate set (hundreds from millions). The ranking stage applies a more complex model to score and reorder candidates. The post-ranking stage applies business rules, diversity constraints, and freshness boosts.

Candidate retrieval (100-1000 from millions): Two-tower models, ANN search, collaborative signals
Scoring/ranking (tens from hundreds): Deep ranking models with rich features (user, item, context)
Re-ranking (final order): Diversity, freshness, business rules, ad insertion, deduplication
Serving: Real-time feature lookup, low-latency inference, A/B testing framework

Collaborative Filtering

A recommendation method that predicts user preferences based on the collective behavior of similar users, without requiring item features.

NDCG

Normalized Discounted Cumulative Gain, a ranking metric that rewards recommender systems for placing relevant items higher in the recommendation list.

05 Autonomous Vehicles and Robotics

Autonomous vehicles (AVs) represent one of the most technically demanding AI applications, requiring real-time perception, prediction, planning, and control in safety-critical environments. The ML stack for AVs integrates computer vision, sensor fusion, behavioral prediction, and motion planning into a system that must operate reliably across diverse driving conditions at speeds where milliseconds matter.

Definition

Sensor Fusion

The process of combining data from multiple sensor modalities (cameras, LiDAR, radar, ultrasonic) to create a unified, more complete understanding of the environment than any single sensor can provide. Sensor fusion is essential for robust autonomous perception because each modality has complementary strengths and weaknesses.

Definition

SAE Autonomy Levels

The Society of Automotive Engineers (SAE) defines six levels of driving automation: Level 0 (no automation), Level 1 (driver assistance), Level 2 (partial automation), Level 3 (conditional automation), Level 4 (high automation), and Level 5 (full automation). Most deployed systems are Level 2 (Tesla Autopilot, GM Super Cruise), with limited Level 4 robotaxi deployments (Waymo, Cruise).

Perception and Sensor Fusion

The perception system processes data from cameras, LiDAR, radar, and ultrasonic sensors to build a comprehensive understanding of the vehicle's environment. Each sensor modality contributes unique information: cameras provide rich semantic information and color, LiDAR gives precise 3D geometry, radar works reliably in adverse weather, and ultrasonics handle close-range detection.

Table 19.5: Sensor modalities in autonomous vehicles and their trade-offs.
Sensor	Range	Strengths	Weaknesses
Camera	~200m	Rich semantics, color, texture, traffic signs	Affected by glare, darkness, weather
LiDAR	~250m	Precise 3D geometry, works in darkness	Expensive, degraded in heavy rain/snow, no color
Radar	~300m	Works in all weather, direct velocity measurement	Low resolution, poor vertical discrimination
Ultrasonic	~5m	Reliable close-range detection, low cost	Very short range, slow update rate

Table 19.5: Sensor modalities in autonomous vehicles and their trade-offs.

ML Applications Comparison: Healthcare vs Finance vs AVs

Different domains impose fundamentally different requirements on ML systems. Healthcare demands clinical validation and regulatory approval. Finance demands auditability and real-time performance. Autonomous vehicles demand real-time safety-critical decision-making. Understanding these differences is essential for engineers moving between domains.

Table 19.6: Comparison of ML system requirements across healthcare, finance, and autonomous vehicles.
Dimension	Healthcare	Finance	Autonomous Vehicles
Latency requirement	Seconds to hours (diagnosis assistance)	Microseconds to milliseconds (trading)	Milliseconds (real-time control)
Failure mode	Misdiagnosis, missed disease	Financial loss, market manipulation	Physical injury, fatality
Regulatory body	FDA, EMA, CE marking	SEC, CFTC, FCA, MiFID II	NHTSA, UNECE, national road authorities
Data sensitivity	HIPAA, patient privacy	Insider trading restrictions, PII	Location tracking, facial recognition
Explainability need	Critical (clinician trust, patient rights)	High (regulatory audits, risk management)	Important (accident investigation, liability)
Validation approach	Clinical trials, FDA 510(k)/PMA	Backtesting, stress testing, model validation	Billions of simulation miles, disengagement reports
Human oversight	Physician-in-the-loop for decisions	Trader oversight, risk limits, circuit breakers	Safety driver (L2-3), remote monitoring (L4)
Error tolerance	Very low for critical diagnosis	Low for risk models, medium for advisory	Near-zero for safety-critical perception

Table 19.6: Comparison of ML system requirements across healthcare, finance, and autonomous vehicles.

Quick Check

What differentiates ML systems in healthcare vs. e-commerce from a systems perspective?

Not quite.Healthcare ML systems must meet strict regulatory requirements (FDA clearance, CE marking) and provide explainable predictions that clinicians can validate and trust. These requirements fundamentally shape system architecture, evaluation practices, and deployment timelines. E-commerce systems, while complex, operate under far fewer regulatory constraints and can iterate faster without the same level of external oversight.

The Long-Tail Safety Challenge

The safety challenge in autonomous driving is defined by the long tail of rare events. Standard performance on common driving scenarios is relatively straightforward, but safely handling unusual situations -- construction zones, emergency vehicles, unusual road geometry, adverse weather, aggressive drivers -- requires massive datasets and sophisticated testing. A self-driving system must handle millions of unique scenarios, many of which occur too rarely to be well-represented in training data.

Autonomous Vehicle Edge Cases

Real-world AV edge cases that have caused incidents include: a pedestrian crossing outside a crosswalk at night (Uber, 2018), a white tractor-trailer against a bright sky (Tesla, 2016), a stationary fire truck on a highway (Tesla, multiple), and an overturned truck blocking the road. Each of these represents a scenario that is individually rare but collectively common -- the long tail of driving scenarios is enormous. Waymo reports running over 20 billion simulated miles annually to test these edge cases.

Simulation-Driven Development

Waymo, Cruise, and other AV companies rely heavily on simulation to test edge cases that are too dangerous or rare to encounter naturally. Waymo reports running over 20 billion simulated miles annually, compared to 20+ million real-world miles. Simulation enables testing adversarial scenarios (a child darting into traffic), sensor degradation (fog, lens occlusion), and rare traffic configurations that would take decades to encounter on public roads.

Constraint-driven innovation -- the strictest deployment constraints drive the most creative systems engineering solutions.Deeper InsightDomains like medical AI, autonomous vehicles, and aerospace face the tightest constraints: real-time latency, regulatory compliance, explainability mandates, and near-zero error tolerance. Rather than limiting progress, these constraints force engineers to develop creative solutions -- simulation-at-scale testing, sensor fusion architectures, formal verification methods, and human-in-the-loop designs -- that eventually benefit the entire field. The most robust and well-engineered ML systems almost always emerge from the most constrained domains.Think of it like...Like how Formula 1 racing constraints (weight limits, fuel restrictions, safety regulations) have driven innovations in materials science, aerodynamics, and engine efficiency that eventually make it into consumer vehicles. Click to collapse

Robotics Beyond Driving

The same perception, planning, and control techniques that power autonomous vehicles are being applied across robotics domains. Warehouse robots (Amazon, Ocado), surgical robots (Intuitive Surgical's da Vinci), agricultural robots (harvesting, weeding), and household robots (cleaning, delivery) all benefit from advances in ML-driven perception and manipulation.

Warehouse automation: Robots handle over 75% of Amazon fulfillment center operations, using vision for picking and placing diverse objects
Surgical robotics: ML assists with tissue identification, surgical planning, and semi-autonomous suturing
Agricultural robotics: Computer vision enables selective harvesting, precision weeding without herbicides, and crop health monitoring
Manipulation: Foundation models for robotics (RT-2, Octo) transfer knowledge across tasks, reducing the need for task-specific training data
Humanoid robots: Companies like Figure, Tesla (Optimus), and Boston Dynamics combine locomotion control with manipulation and language understanding

Sensor Fusion

Combining data from multiple sensor modalities to create a unified understanding of the environment, essential for robust autonomous perception.

Long-Tail Distribution

The pattern where the vast majority of driving scenarios are common and easy but safety-critical performance depends on handling the enormous variety of rare edge cases.

06 Climate and Environmental AI

ML systems contribute to climate action through improved weather and climate modeling, energy system optimization, environmental monitoring, and materials discovery for clean energy technologies. These applications leverage ML's ability to model complex physical systems and extract patterns from large observational datasets.

Definition

Climate AI

The application of machine learning to climate change mitigation (reducing emissions) and adaptation (responding to climate impacts), spanning weather prediction, energy optimization, environmental monitoring, and materials discovery for clean energy.

ML-Powered Weather Forecasting

Weather forecasting has been revolutionized by ML approaches. GraphCast and Pangu-Weather demonstrate that ML models can match or exceed traditional numerical weather prediction (NWP) models at a fraction of the computational cost. These models are trained on decades of reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF) and produce global forecasts in seconds rather than hours.

Table 19.7: Landmark ML weather prediction models.
Model	Organization	Key Achievement
GraphCast	DeepMind	Outperforms ECMWF HRES on 90% of variables at 10-day forecasts
Pangu-Weather	Huawei	First ML model to outperform NWP at multiple lead times
FourCastNet	NVIDIA	Generates global forecasts 45,000x faster than NWP
GenCast	DeepMind	Probabilistic forecasting with calibrated uncertainty estimates
Aurora	Microsoft	Foundation model for Earth system forecasting across scales

Table 19.7: Landmark ML weather prediction models.

Speed Changes Everything

Traditional numerical weather prediction requires hours of supercomputer time for a single 10-day forecast. ML models produce comparable forecasts in seconds on a single GPU. This speed enables ensemble forecasting (running hundreds of scenarios), more frequent updates, and broader access to high-quality weather prediction. For tropical cyclone track forecasting, GenCast outperforms the best ensemble systems while running 1000x faster.

Energy System Optimization

Energy system optimization uses ML to balance supply and demand in power grids with increasing renewable energy penetration. ML forecasts renewable generation, predicts demand patterns, and optimizes battery storage dispatch. These applications directly reduce carbon emissions by enabling higher renewable energy utilization.

DeepMind and Google Data Centers

DeepMind applied ML to optimize cooling in Google's data centers, reducing cooling energy by approximately 40%. The system uses a neural network to predict the future temperature and pressure of the data center based on current conditions and proposed actions, then recommends optimal cooling configurations. Similar approaches applied to power grid management help integrate intermittent renewable energy sources by predicting generation and optimizing storage dispatch.

Environmental Monitoring

Deforestation tracking: ML classifies satellite imagery to detect forest loss within days, enabling rapid response
Biodiversity monitoring: Acoustic sensors + ML identify species from sound recordings across vast areas
Illegal mining detection: Satellite imagery analysis spots unauthorized mining operations in protected areas
Ocean health: ML analyzes sea surface temperature, chlorophyll, and current data for marine ecosystem monitoring
Air quality prediction: ML forecasts pollution levels to issue health warnings and guide policy
Carbon accounting: ML estimates emissions from satellite observations of power plants, agriculture, and land use

Open Data Enables Impact

Much of the satellite data used for environmental monitoring is freely available through programs like Copernicus (EU), Landsat (NASA/USGS), and Planet's open data program. Combined with open-source ML tools, this creates opportunities for researchers and organizations worldwide to build environmental monitoring applications.

Climate AI

ML applications that contribute to climate change mitigation and adaptation, including weather prediction, energy optimization, and environmental monitoring.

ML Weather Prediction

Data-driven weather forecasting using neural networks trained on historical weather data, achieving accuracy comparable to physics-based models at lower computational cost.

07 Education, Accessibility, and Humanitarian AI

ML-powered educational tools can personalize learning experiences, provide intelligent tutoring, automate assessment, and make education more accessible to underserved populations. Adaptive learning systems adjust content difficulty and sequencing based on individual student performance, providing each learner with an optimized path through the material.

Definition

Adaptive Learning

An educational technology approach that uses ML to dynamically adjust the content difficulty, sequencing, pacing, and presentation style based on individual student performance data, creating a personalized learning pathway for each student.

NLP in Education

Natural language processing enables powerful educational applications that scale high-quality instruction to millions of learners. Large language models can serve as interactive tutoring systems that explain concepts, answer questions, and guide students through problem-solving with a Socratic approach.

Table 19.8: NLP applications in education.
Application	ML Technique	Educational Benefit
Automated essay scoring	Fine-tuned language models	Instant feedback at scale, consistent evaluation
Question generation	Seq2seq models, LLMs	Automatic creation of practice questions from content
Language learning assistants	Speech recognition + NLU	Conversational practice available 24/7
Content summarization	Extractive and abstractive summarization	Help students identify key concepts efficiently
Interactive tutoring	Large language models	Socratic dialogue, worked examples, personalized explanations

Table 19.8: NLP applications in education.

LLMs as Tutors

Khan Academy's Khanmigo, powered by large language models, acts as a Socratic tutor that guides students through problem-solving by asking leading questions rather than giving direct answers. This approach preserves the pedagogical value of productive struggle while providing support when students are stuck. Duolingo Max uses GPT-4 for conversational language practice and AI-generated explanations of grammar mistakes.

Accessibility Applications

Accessibility applications use ML to break down barriers for people with disabilities, enabling more equitable access to information and technology. These tools benefit not only their primary audience but often improve usability for everyone.

Speech recognition: Voice-controlled interfaces for users with motor impairments (accuracy now exceeds 95% for major languages)
Image description: Automatic alt-text generation for visually impaired users (Microsoft Seeing AI, Apple VoiceOver)
Sign language recognition: Real-time translation between sign language and text/speech
Real-time captioning: Live speech-to-text for deaf and hard-of-hearing users
Text simplification: Adapting complex text for users with cognitive disabilities
Eye tracking: Gaze-based computer control for users with severe motor limitations

Universal Design Benefits Everyone

Accessibility features originally designed for people with disabilities often benefit everyone. Voice interfaces help drivers, captioning helps in noisy environments, and text simplification helps non-native speakers. Designing for accessibility is designing for broader usability.

Humanitarian AI: Disaster Response and Food Security

Humanitarian applications of ML address urgent needs in disaster response, poverty alleviation, and food security. Speed is critical: after a natural disaster, ML systems analyzing satellite imagery can produce building damage maps within hours, guiding relief efforts to the most affected areas before ground teams can reach them.

Table 19.9: ML applications across disaster response phases.
Phase	ML Application	Data Source
Preparedness	Risk mapping, vulnerability prediction	Historical disasters, geographic data, census
Immediate response	Damage assessment, needs mapping	Satellite/drone imagery, social media
Relief coordination	Resource allocation optimization	Logistics data, population movement
Recovery	Infrastructure reconstruction prioritization	Damage assessments, economic data

Table 19.9: ML applications across disaster response phases.

Mobile-First Design for Developing Regions

Many agricultural AI applications in developing regions run on basic smartphones. PlantVillage and similar tools let farmers photograph a diseased leaf and receive an immediate diagnosis and treatment recommendation. Design for offline operation, low-resolution cameras, and simple interfaces. A well-compressed MobileNet model can run inference in under 100ms on a mid-range smartphone.

Equity and Privacy in Educational AI

Educational ML systems can inadvertently reinforce existing inequities. An automated essay scorer trained primarily on essays from affluent school districts may penalize valid writing styles from other backgrounds. Student data, especially for younger learners, requires strong privacy protections (FERPA in the US, GDPR for EU students). Always evaluate educational AI for equitable outcomes across student demographics.

Adaptive Learning

Educational technology that adjusts content difficulty, sequencing, and presentation based on individual student performance and learning patterns.

Humanitarian AI

ML applications that support humanitarian response and development, including disaster relief, food security, and poverty alleviation in resource-constrained environments.

Key Takeaways

1Effective ML for social impact requires deep collaboration between ML engineers and domain experts in each application area.
2Healthcare AI shows enormous promise in medical imaging and drug discovery but faces stringent regulatory (FDA, CE marking) and ethical requirements.
3Transfer learning from pretrained models is the standard approach for medical imaging, enabling strong performance with limited labeled clinical data.
4Evaluation metrics must match the application: F-beta for configurable precision-recall trade-offs, AUC-ROC for overall discrimination, NDCG for ranking quality.
5NLP applications have been transformed by large language models, but hallucination, prompt injection, and high inference costs remain critical production challenges.
6Recommender systems use multi-stage retrieval-and-ranking architectures evaluated with position-aware metrics like NDCG and MAP.
7Autonomous vehicles require real-time perception, prediction, and planning, with safety defined by handling the long tail of rare edge cases.
8Healthcare, finance, and autonomous vehicles impose fundamentally different requirements on ML systems in terms of latency, regulation, and failure tolerance.
9Climate AI applications in weather prediction and energy optimization can directly reduce carbon emissions.
10Accessibility applications of ML break down barriers for people with disabilities across speech, vision, and language.
11Humanitarian AI must navigate challenges of data scarcity, infrastructure limitations, and cultural context in deployment.

CH.19

Chapter Complete

Up next:AGI Systems

Chapter Progress

Reading

Exercise

Interact with the visualization

Quiz

AI for Good Quiz

Test your understanding of beneficial ML applications in healthcare, climate, and social impact.

Ready to test your knowledge?

5 questionsRandomized from pool70% to pass

Learning Objectives

01 ML for Social Impact Viz

AI for Social Good

Measuring Social Impact

AI for Good

Impact Measurement

02 Healthcare and Life Sciences

Clinical Decision Support System (CDSS)

Medical Image Classification with Transfer Learning

Evaluation Metrics for Medical AI

Transfer Learning

AUC-ROC

03 NLP Applications and Large Language Models

Hallucination

Production NLP Pipeline Architecture

Large Language Model

RAG

04 Recommender Systems and Information Retrieval

Collaborative Filtering

Cold Start Problem

Collaborative Filtering with Matrix Factorization

Ranking Metrics for Recommender Systems

Modern Recommendation Architecture

Collaborative Filtering

NDCG

05 Autonomous Vehicles and Robotics

Sensor Fusion

SAE Autonomy Levels

Perception and Sensor Fusion

ML Applications Comparison: Healthcare vs Finance vs AVs

The Long-Tail Safety Challenge

Robotics Beyond Driving

Sensor Fusion

Long-Tail Distribution

06 Climate and Environmental AI

Climate AI

ML-Powered Weather Forecasting

Energy System Optimization

Environmental Monitoring

Climate AI

ML Weather Prediction

07 Education, Accessibility, and Humanitarian AI

Adaptive Learning

NLP in Education

Accessibility Applications

Humanitarian AI: Disaster Response and Food Security

Adaptive Learning

Humanitarian AI

Key Takeaways

Chapter Progress

AI for Good Quiz