AI for Good
Showcases beneficial applications of ML systems for social impact. Explores healthcare, climate, education, and humanitarian use cases.
- Evaluate ML applications in healthcare including medical imaging, drug discovery, and clinical decision support
- Implement a medical image classification pipeline using PyTorch transfer learning with appropriate evaluation metrics
- Analyze the unique engineering constraints of deploying ML in resource-limited social impact settings
- Implement NLP pipelines for sentiment analysis and text classification using transformer models
- Design a recommender system using collaborative filtering and evaluate it with ranking metrics like NDCG and Precision@k
- Compare ML system requirements across healthcare, finance, and autonomous vehicles domains
- Describe how ML systems contribute to climate science, weather prediction, and energy optimization
- Assess the safety challenges and sensor fusion requirements for autonomous vehicle perception systems
- Compare accessibility applications of ML across speech, vision, and language modalities
- Design an ML solution for a social impact scenario that accounts for regulatory, ethical, and deployment constraints
01 ML for Social Impact Viz
Machine learning has enormous potential to address pressing social challenges when applied thoughtfully and responsibly. From improving healthcare outcomes to advancing climate science to enhancing education, ML systems can amplify human capability in ways that benefit society broadly.
AI for Social Good
The intentional application of artificial intelligence techniques to address societal challenges such as healthcare access, climate change, education inequality, and humanitarian crises, with emphasis on equitable outcomes and deployment in resource-constrained environments.
Figure: ML Application Domains
DeepMind's AlphaFold2 (2020) solved the 50-year-old protein structure prediction problem, predicting 3D structures with atomic-level accuracy. The AlphaFold Protein Structure Database now contains over 200 million predicted structures -- covering nearly every known protein. This has accelerated drug discovery, enzyme engineering, and our fundamental understanding of biology. AlphaFold demonstrates how ML can provide transformative scientific breakthroughs when applied to well-defined problems with rich training data. The 2024 Nobel Prize in Chemistry was awarded partly for this work.
Measuring Social Impact
The measurement of social impact presents unique challenges. Unlike commercial metrics that are easily quantifiable, social outcomes may be difficult to measure, may manifest over long time horizons, and may involve competing stakeholder interests. Defining appropriate metrics and evaluation frameworks is a critical first step that requires community engagement.
| Domain | Impact Metric | Measurement Challenge | Example |
|---|---|---|---|
| Healthcare | Lives saved / QALYs gained | Long time horizons, counterfactual difficulty | AI-assisted diagnosis reducing misdiagnosis rates by 30% |
| Education | Learning outcomes improvement | Confounders, long-term retention | Adaptive tutoring improving test scores by 15% |
| Climate | CO2 emissions reduced (tons) | Attribution across complex systems | Grid optimization saving 40% cooling energy |
| Agriculture | Crop yield improvement (%) | Weather variability, adoption barriers | Disease detection reducing crop loss by 25% |
| Disaster Response | Time to first response (hours) | Infrastructure damage, coordination costs | Satellite imagery analysis providing damage maps in 4 hours |
Table 19.1: Impact metrics across social good domains and their measurement challenges.
ML for social impact often means designing for environments that are very different from well-resourced tech companies. Models must run on low-cost hardware, work with intermittent connectivity, handle noisy data, and serve users with limited technical literacy. A well-compressed MobileNet model running offline on a $100 smartphone can have more real-world impact than a state-of-the-art model that requires cloud connectivity.
AI for Good
The application of artificial intelligence to address societal challenges in areas like health, climate, education, and humanitarian response.
Impact Measurement
The practice of defining and tracking quantitative metrics for the social outcomes of ML deployments, accounting for long time horizons and counterfactual difficulties.
02 Healthcare and Life Sciences
Healthcare AI shows enormous promise across medical imaging, drug discovery, clinical decision support, and genomics. Deep learning models can detect diseases in medical images with accuracy comparable to or exceeding that of specialist physicians, but deploying these systems in clinical practice requires navigating regulatory, ethical, and trust challenges.
Clinical Decision Support System (CDSS)
An ML-powered tool that integrates patient data, clinical evidence, and model predictions to assist healthcare providers in diagnosis, treatment planning, or risk assessment. CDSSs must meet regulatory requirements (FDA, CE marking) and provide transparent, explainable outputs that clinicians can validate.
Medical Image Classification with Transfer Learning
Transfer learning from ImageNet-pretrained models is the standard approach for medical image classification, because labeled medical datasets are typically small (hundreds to low thousands of images). Fine-tuning a pretrained ResNet or EfficientNet on domain-specific data achieves strong performance with limited data. The following pipeline demonstrates a complete medical image classification workflow.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import transforms, datasets, models
from sklearn.metrics import (
classification_report, roc_auc_score
)
import numpy as np
class="tok-comment"># --- Data preparation with medical imaging augmentations ---
train_transforms = transforms.Compose([
transforms.Resize((class="tok-number">224, class="tok-number">224)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(class="tok-number">15),
transforms.ColorJitter(
brightness=class="tok-number">0.2, contrast=class="tok-number">0.2
),
transforms.ToTensor(),
transforms.Normalize(
mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406],
std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225],
),
])
val_transforms = transforms.Compose([
transforms.Resize((class="tok-number">224, class="tok-number">224)),
transforms.ToTensor(),
transforms.Normalize(
mean=[class="tok-number">0.485, class="tok-number">0.456, class="tok-number">0.406],
std=[class="tok-number">0.229, class="tok-number">0.224, class="tok-number">0.225],
),
])
class="tok-comment"># Load dataset (e.g., chest X-ray pneumonia detection)
train_dataset = datasets.ImageFolder(
class="tok-string">"data/chest_xray/train", transform=train_transforms
)
val_dataset = datasets.ImageFolder(
class="tok-string">"data/chest_xray/val", transform=val_transforms
)
train_loader = DataLoader(
train_dataset, batch_size=class="tok-number">32, shuffle=True,
num_workers=class="tok-number">4, pin_memory=True,
)
val_loader = DataLoader(
val_dataset, batch_size=class="tok-number">32, shuffle=False,
num_workers=class="tok-number">4, pin_memory=True,
)
class="tok-comment"># --- Transfer learning: fine-tune a pretrained ResNet-class="tok-number">50 ---
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
class="tok-comment"># Freeze early layers (feature extractor)
for param in model.parameters():
param.requires_grad = False
class="tok-comment"># Replace the classification head for binary classification
num_features = model.fc.in_features
model.fc = nn.Sequential(
nn.Linear(num_features, class="tok-number">256),
nn.ReLU(),
nn.Dropout(class="tok-number">0.3),
nn.Linear(class="tok-number">256, class="tok-number">1), class="tok-comment"># Binary: pneumonia vs. normal
)
device = torch.device(class="tok-string">"cuda" if torch.cuda.is_available() else class="tok-string">"cpu")
model = model.to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.fc.parameters(), lr=class="tok-number">1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, patience=class="tok-number">3, factor=class="tok-number">0.5
)
class="tok-comment"># --- Training loop with validation ---
best_auc = class="tok-number">0.0
for epoch in range(class="tok-number">20):
model.train()
running_loss = class="tok-number">0.0
for images, labels in train_loader:
images = images.to(device)
labels = labels.float().to(device)
optimizer.zero_grad()
outputs = model(images).squeeze(class="tok-number">1)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
class="tok-comment"># Validation
model.eval()
all_labels, all_probs = [], []
with torch.no_grad():
for images, labels in val_loader:
images = images.to(device)
outputs = model(images).squeeze(class="tok-number">1)
probs = torch.sigmoid(outputs).cpu().numpy()
all_probs.extend(probs)
all_labels.extend(labels.numpy())
val_auc = roc_auc_score(all_labels, all_probs)
scheduler.step(class="tok-number">1 - val_auc)
if val_auc > best_auc:
best_auc = val_auc
torch.save(model.state_dict(), class="tok-string">"best_model.pth")
print(
class="tok-string">f"Epoch {epoch+class="tok-number">1}: loss={running_loss/len(train_loader):.4f}, "
class="tok-string">f"val_AUC={val_auc:.4f}"
)
class="tok-comment"># --- Final evaluation ---
model.load_state_dict(torch.load(class="tok-string">"best_model.pth"))
model.eval()
all_labels, all_preds, all_probs = [], [], []
with torch.no_grad():
for images, labels in val_loader:
images = images.to(device)
outputs = model(images).squeeze(class="tok-number">1)
probs = torch.sigmoid(outputs).cpu().numpy()
preds = (probs >= class="tok-number">0.5).astype(int)
all_probs.extend(probs)
all_preds.extend(preds)
all_labels.extend(labels.numpy())
print(class="tok-string">"\n" + classification_report(
all_labels, all_preds,
target_names=[class="tok-string">"Normal", class="tok-string">"Pneumonia"],
))
print(class="tok-string">f"AUC-ROC: {roc_auc_score(all_labels, all_probs):.4f}")Evaluation Metrics for Medical AI
Medical AI systems require careful evaluation beyond simple accuracy. In healthcare, the relative cost of false positives (unnecessary follow-up) versus false negatives (missed disease) varies dramatically by application. The following metrics are essential for evaluating medical ML systems.
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\text{AUC-ROC} = \int_0^1 TPR(FPR^{-1}(t)) \, dt = P(S_{\text{pos}} > S_{\text{neg}})| Metric | Formula | Best For | Limitation |
|---|---|---|---|
| Sensitivity (Recall) | TP / (TP + FN) | Screening (minimize missed cases) | Ignores false positive rate |
| Specificity | TN / (TN + FP) | Confirmatory tests (minimize false alarms) | Ignores missed cases |
| PPV (Precision) | TP / (TP + FP) | When positive prediction triggers costly action | Depends on prevalence |
| NPV | TN / (TN + FN) | Rule-out tests | Depends on prevalence |
| AUC-ROC | Area under TPR vs FPR curve | Overall discrimination ability | Can be misleading with class imbalance |
| AUC-PR | Area under Precision vs Recall curve | Rare disease detection (imbalanced) | Harder to interpret than AUC-ROC |
Table 19.2: Key evaluation metrics for healthcare AI systems and their appropriate use cases.
GPT-4 passed the US Medical Licensing Examination (USMLE) with scores well above the passing threshold, demonstrating strong medical knowledge. However, passing a standardized test is very different from safe clinical deployment. LLMs can hallucinate medical facts, miss context from patient history, and provide plausible-sounding but incorrect diagnoses. Current research focuses on using LLMs as clinical documentation assistants, literature summarizers, and patient communication aids -- roles where they augment clinicians rather than replace them. Retrieval-augmented generation (RAG) from curated medical knowledge bases helps reduce hallucination in clinical LLM applications.
Medical AI devices in the US require FDA clearance through one of three pathways: 510(k) for devices substantially equivalent to existing ones, De Novo for novel low-to-moderate risk devices, or Premarket Approval (PMA) for high-risk devices. As of 2024, the FDA has authorized over 900 AI-enabled medical devices, primarily in radiology and cardiology. The EU Medical Device Regulation (MDR) imposes similar requirements. Deploying a medical AI system without proper regulatory approval is illegal and dangerous.
Transfer Learning
Training a model pretrained on a large dataset (e.g., ImageNet) on a smaller domain-specific dataset, enabling strong performance with limited labeled medical data.
AUC-ROC
Area Under the Receiver Operating Characteristic curve, a threshold-independent metric measuring a model's overall discrimination ability between positive and negative classes.
03 NLP Applications and Large Language Models
Natural language processing has been transformed by large language models. The shift from task-specific models to general-purpose LLMs has fundamentally changed how NLP systems are designed, trained, and deployed. Prompting a foundation model can now accomplish tasks that previously required months of custom model development.
Hallucination
The tendency of language models to generate text that is fluent and plausible-sounding but factually incorrect or entirely fabricated. Hallucination is a fundamental challenge for deploying LLMs in high-stakes applications where accuracy is critical, such as healthcare, legal, and financial domains.
Production NLP Pipeline Architecture
| Component | Purpose | Common Tools | Key Design Decision |
|---|---|---|---|
| Document ingestion | Parse and chunk input documents | Unstructured, LangChain document loaders | Chunk size and overlap strategy |
| Embedding | Convert text to dense vector representations | OpenAI embeddings, sentence-transformers | Embedding model vs. latency trade-off |
| Vector store | Index and retrieve relevant context | Pinecone, Weaviate, pgvector | Hosted vs. self-managed; reranking |
| Prompt engineering | Structure the model input for best output | Templates, few-shot examples, system prompts | Task-specific prompt design and testing |
| Generation | Produce the final output text | GPT-4, Claude, Llama, Mistral | Cost/quality/latency trade-off; local vs. API |
| Output validation | Verify correctness and safety | Guardrails, custom validators, citation checking | Automated vs. human-in-the-loop review |
Table 19.3: Components of a production RAG-based NLP pipeline.
Despite advances in retrieval-augmented generation and chain-of-thought prompting, hallucination remains a fundamental challenge for LLMs. In a 2024 study, even the best models hallucinated on 3-5% of factual questions. For applications where accuracy is critical (medical, legal, financial), every generated statement should be traceable to a source document. Never deploy an LLM in a high-stakes domain without a verification layer.
GitHub Copilot, powered by OpenAI Codex, generates code suggestions inline as developers type. In industry studies, developers using Copilot completed tasks 55% faster on average. However, generated code can contain security vulnerabilities, license violations, and subtle bugs. Production deployments require automated security scanning (SAST), test generation, and human code review. The key insight is that LLMs are most effective as productivity multipliers for skilled engineers, not as replacements for engineering judgment.
Large Language Model
A neural network with billions of parameters trained on massive text corpora, capable of diverse language tasks through prompting or fine-tuning.
RAG
Retrieval-Augmented Generation, a pattern that retrieves relevant context from external knowledge bases before generating responses, reducing hallucination and enabling up-to-date information.
04 Recommender Systems and Information Retrieval
Recommender systems drive engagement on major platforms and are among the most commercially impactful ML applications. These systems predict user preferences from interaction history, user features, and item features, powering personalized experiences at scale.
Collaborative Filtering
A recommendation approach that predicts user preferences by finding patterns in the collective behavior of similar users (user-based CF) or similar items (item-based CF), without requiring item content features. The core assumption is that users who agreed in the past will agree in the future.
Cold Start Problem
The challenge of making recommendations for new users (user cold start) or new items (item cold start) that have no prior interaction history. Content-based features, popularity-based fallbacks, and explicit preference elicitation are common mitigations.
Collaborative Filtering with Matrix Factorization
Matrix factorization decomposes the sparse user-item interaction matrix R into two dense matrices: a user matrix U and an item matrix V. Each user and item is represented by a latent factor vector, and the predicted rating is the dot product of these vectors. The following implementation demonstrates this approach.
import numpy as np
from typing import Optional
class MatrixFactorizationCF:
class="tok-string">class="tok-string">""class="tok-string">"Collaborative filtering via matrix factorization.
Decomposes the user-item interaction matrix R into
low-rank matrices U (users) and V (items) such that
R ≈ U @ V.T. Trained with SGD on observed entries.
"class="tok-string">""
def __init__(
self,
n_users: int,
n_items: int,
n_factors: int = class="tok-number">50,
lr: float = class="tok-number">0.01,
reg: float = class="tok-number">0.02,
):
self.n_factors = n_factors
self.lr = lr
self.reg = reg
class="tok-comment"># Initialize latent factors with small random values
self.user_factors = np.random.normal(
class="tok-number">0, class="tok-number">0.1, (n_users, n_factors)
)
self.item_factors = np.random.normal(
class="tok-number">0, class="tok-number">0.1, (n_items, n_factors)
)
class="tok-comment"># Bias terms capture user/item-specific tendencies
self.user_bias = np.zeros(n_users)
self.item_bias = np.zeros(n_items)
self.global_bias = class="tok-number">0.0
def fit(
self,
ratings: list[tuple[int, int, float]],
n_epochs: int = class="tok-number">20,
val_ratings: Optional[list[tuple[int, int, float]]] = None,
):
class="tok-string">class="tok-string">""class="tok-string">"Train on (user_id, item_id, rating) tuples.
Uses stochastic gradient descent with L2
regularization on both factors and biases.
"class="tok-string">""
all_ratings = [r for _, _, r in ratings]
self.global_bias = np.mean(all_ratings)
for epoch in range(n_epochs):
np.random.shuffle(ratings)
total_loss = class="tok-number">0.0
for user, item, rating in ratings:
pred = self.predict(user, item)
error = rating - pred
class="tok-comment"># Update biases
self.user_bias[user] += self.lr * (
error - self.reg * self.user_bias[user]
)
self.item_bias[item] += self.lr * (
error - self.reg * self.item_bias[item]
)
class="tok-comment"># Update latent factors
u = self.user_factors[user].copy()
self.user_factors[user] += self.lr * (
error * self.item_factors[item]
- self.reg * u
)
self.item_factors[item] += self.lr * (
error * u
- self.reg * self.item_factors[item]
)
total_loss += error ** class="tok-number">2
rmse = np.sqrt(total_loss / len(ratings))
msg = class="tok-string">f"Epoch {epoch+class="tok-number">1}: train RMSE={rmse:.4f}"
if val_ratings:
val_rmse = self.evaluate(val_ratings)
msg += class="tok-string">f", val RMSE={val_rmse:.4f}"
print(msg)
def predict(self, user: int, item: int) -> float:
class="tok-string">class="tok-string">""class="tok-string">"Predict rating for a user-item pair."class="tok-string">""
return (
self.global_bias
+ self.user_bias[user]
+ self.item_bias[item]
+ self.user_factors[user]
@ self.item_factors[item]
)
def recommend(
self, user: int, n: int = class="tok-number">10,
exclude: Optional[set[int]] = None,
) -> list[tuple[int, float]]:
class="tok-string">class="tok-string">""class="tok-string">"Return top-n item recommendations for a user."class="tok-string">""
scores = (
self.global_bias
+ self.user_bias[user]
+ self.item_bias
+ self.item_factors @ self.user_factors[user]
)
if exclude:
for item_id in exclude:
scores[item_id] = -np.inf
top_items = np.argsort(scores)[::-class="tok-number">1][:n]
return [(int(i), float(scores[i])) for i in top_items]
def evaluate(
self, test_ratings: list[tuple[int, int, float]]
) -> float:
class="tok-string">class="tok-string">""class="tok-string">"Compute RMSE on test set."class="tok-string">""
errors = [
(r - self.predict(u, i)) ** class="tok-number">2
for u, i, r in test_ratings
]
return np.sqrt(np.mean(errors))
class="tok-comment"># Example usage
mf = MatrixFactorizationCF(
n_users=class="tok-number">1000, n_items=class="tok-number">5000, n_factors=class="tok-number">50
)
mf.fit(train_ratings, n_epochs=class="tok-number">20, val_ratings=val_ratings)
top_recs = mf.recommend(user=class="tok-number">42, n=class="tok-number">10, exclude=seen_items)
print(class="tok-string">"Top recommendations:", top_recs)Ranking Metrics for Recommender Systems
Recommender systems are evaluated with ranking metrics that account for the position of relevant items in the recommendation list. A relevant item at position 1 is far more valuable than the same item at position 100.
\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}, \quad \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}\text{Precision@}k = \frac{|\{\text{relevant items in top-}k\}|}{k}, \quad \text{Recall@}k = \frac{|\{\text{relevant items in top-}k\}|}{|\{\text{all relevant items}\}|}| Metric | Measures | Position-Aware | Best For |
|---|---|---|---|
| NDCG@k | Ranking quality with graded relevance | Yes (log discount) | Multi-level relevance (ratings, engagement depth) |
| Precision@k | Fraction of top-k that are relevant | No (treats all k equally) | Binary relevance (clicked or not) |
| Recall@k | Coverage of relevant items in top-k | No | Catalog coverage, diversity evaluation |
| MAP | Mean of Precision@k at each relevant position | Yes (reward for early placement) | Binary relevance with position sensitivity |
| MRR | Reciprocal rank of first relevant item | Yes (first hit position) | Navigational queries, single-answer retrieval |
| Hit Rate@k | Whether any relevant item appears in top-k | No | Overall recall screening |
Table 19.4: Ranking metrics for recommender systems and information retrieval.
Modern Recommendation Architecture
Production recommender systems at scale use a multi-stage architecture to handle catalogs of millions of items efficiently. The retrieval stage uses approximate nearest neighbor search to generate a broad candidate set (hundreds from millions). The ranking stage applies a more complex model to score and reorder candidates. The post-ranking stage applies business rules, diversity constraints, and freshness boosts.
- Candidate retrieval (100-1000 from millions): Two-tower models, ANN search, collaborative signals
- Scoring/ranking (tens from hundreds): Deep ranking models with rich features (user, item, context)
- Re-ranking (final order): Diversity, freshness, business rules, ad insertion, deduplication
- Serving: Real-time feature lookup, low-latency inference, A/B testing framework
Collaborative Filtering
A recommendation method that predicts user preferences based on the collective behavior of similar users, without requiring item features.
NDCG
Normalized Discounted Cumulative Gain, a ranking metric that rewards recommender systems for placing relevant items higher in the recommendation list.
05 Autonomous Vehicles and Robotics
Autonomous vehicles (AVs) represent one of the most technically demanding AI applications, requiring real-time perception, prediction, planning, and control in safety-critical environments. The ML stack for AVs integrates computer vision, sensor fusion, behavioral prediction, and motion planning into a system that must operate reliably across diverse driving conditions at speeds where milliseconds matter.
Sensor Fusion
The process of combining data from multiple sensor modalities (cameras, LiDAR, radar, ultrasonic) to create a unified, more complete understanding of the environment than any single sensor can provide. Sensor fusion is essential for robust autonomous perception because each modality has complementary strengths and weaknesses.
SAE Autonomy Levels
The Society of Automotive Engineers (SAE) defines six levels of driving automation: Level 0 (no automation), Level 1 (driver assistance), Level 2 (partial automation), Level 3 (conditional automation), Level 4 (high automation), and Level 5 (full automation). Most deployed systems are Level 2 (Tesla Autopilot, GM Super Cruise), with limited Level 4 robotaxi deployments (Waymo, Cruise).
Perception and Sensor Fusion
The perception system processes data from cameras, LiDAR, radar, and ultrasonic sensors to build a comprehensive understanding of the vehicle's environment. Each sensor modality contributes unique information: cameras provide rich semantic information and color, LiDAR gives precise 3D geometry, radar works reliably in adverse weather, and ultrasonics handle close-range detection.
| Sensor | Range | Strengths | Weaknesses |
|---|---|---|---|
| Camera | ~200m | Rich semantics, color, texture, traffic signs | Affected by glare, darkness, weather |
| LiDAR | ~250m | Precise 3D geometry, works in darkness | Expensive, degraded in heavy rain/snow, no color |
| Radar | ~300m | Works in all weather, direct velocity measurement | Low resolution, poor vertical discrimination |
| Ultrasonic | ~5m | Reliable close-range detection, low cost | Very short range, slow update rate |
Table 19.5: Sensor modalities in autonomous vehicles and their trade-offs.
ML Applications Comparison: Healthcare vs Finance vs AVs
Different domains impose fundamentally different requirements on ML systems. Healthcare demands clinical validation and regulatory approval. Finance demands auditability and real-time performance. Autonomous vehicles demand real-time safety-critical decision-making. Understanding these differences is essential for engineers moving between domains.
| Dimension | Healthcare | Finance | Autonomous Vehicles |
|---|---|---|---|
| Latency requirement | Seconds to hours (diagnosis assistance) | Microseconds to milliseconds (trading) | Milliseconds (real-time control) |
| Failure mode | Misdiagnosis, missed disease | Financial loss, market manipulation | Physical injury, fatality |
| Regulatory body | FDA, EMA, CE marking | SEC, CFTC, FCA, MiFID II | NHTSA, UNECE, national road authorities |
| Data sensitivity | HIPAA, patient privacy | Insider trading restrictions, PII | Location tracking, facial recognition |
| Explainability need | Critical (clinician trust, patient rights) | High (regulatory audits, risk management) | Important (accident investigation, liability) |
| Validation approach | Clinical trials, FDA 510(k)/PMA | Backtesting, stress testing, model validation | Billions of simulation miles, disengagement reports |
| Human oversight | Physician-in-the-loop for decisions | Trader oversight, risk limits, circuit breakers | Safety driver (L2-3), remote monitoring (L4) |
| Error tolerance | Very low for critical diagnosis | Low for risk models, medium for advisory | Near-zero for safety-critical perception |
Table 19.6: Comparison of ML system requirements across healthcare, finance, and autonomous vehicles.
What differentiates ML systems in healthcare vs. e-commerce from a systems perspective?
The Long-Tail Safety Challenge
The safety challenge in autonomous driving is defined by the long tail of rare events. Standard performance on common driving scenarios is relatively straightforward, but safely handling unusual situations -- construction zones, emergency vehicles, unusual road geometry, adverse weather, aggressive drivers -- requires massive datasets and sophisticated testing. A self-driving system must handle millions of unique scenarios, many of which occur too rarely to be well-represented in training data.
Real-world AV edge cases that have caused incidents include: a pedestrian crossing outside a crosswalk at night (Uber, 2018), a white tractor-trailer against a bright sky (Tesla, 2016), a stationary fire truck on a highway (Tesla, multiple), and an overturned truck blocking the road. Each of these represents a scenario that is individually rare but collectively common -- the long tail of driving scenarios is enormous. Waymo reports running over 20 billion simulated miles annually to test these edge cases.
Waymo, Cruise, and other AV companies rely heavily on simulation to test edge cases that are too dangerous or rare to encounter naturally. Waymo reports running over 20 billion simulated miles annually, compared to 20+ million real-world miles. Simulation enables testing adversarial scenarios (a child darting into traffic), sensor degradation (fog, lens occlusion), and rare traffic configurations that would take decades to encounter on public roads.
Constraint-driven innovation -- the strictest deployment constraints drive the most creative systems engineering solutions.Deeper InsightDomains like medical AI, autonomous vehicles, and aerospace face the tightest constraints: real-time latency, regulatory compliance, explainability mandates, and near-zero error tolerance. Rather than limiting progress, these constraints force engineers to develop creative solutions -- simulation-at-scale testing, sensor fusion architectures, formal verification methods, and human-in-the-loop designs -- that eventually benefit the entire field. The most robust and well-engineered ML systems almost always emerge from the most constrained domains.Think of it like...Like how Formula 1 racing constraints (weight limits, fuel restrictions, safety regulations) have driven innovations in materials science, aerodynamics, and engine efficiency that eventually make it into consumer vehicles. Click to collapse
Robotics Beyond Driving
The same perception, planning, and control techniques that power autonomous vehicles are being applied across robotics domains. Warehouse robots (Amazon, Ocado), surgical robots (Intuitive Surgical's da Vinci), agricultural robots (harvesting, weeding), and household robots (cleaning, delivery) all benefit from advances in ML-driven perception and manipulation.
- Warehouse automation: Robots handle over 75% of Amazon fulfillment center operations, using vision for picking and placing diverse objects
- Surgical robotics: ML assists with tissue identification, surgical planning, and semi-autonomous suturing
- Agricultural robotics: Computer vision enables selective harvesting, precision weeding without herbicides, and crop health monitoring
- Manipulation: Foundation models for robotics (RT-2, Octo) transfer knowledge across tasks, reducing the need for task-specific training data
- Humanoid robots: Companies like Figure, Tesla (Optimus), and Boston Dynamics combine locomotion control with manipulation and language understanding
Sensor Fusion
Combining data from multiple sensor modalities to create a unified understanding of the environment, essential for robust autonomous perception.
Long-Tail Distribution
The pattern where the vast majority of driving scenarios are common and easy but safety-critical performance depends on handling the enormous variety of rare edge cases.
06 Climate and Environmental AI
ML systems contribute to climate action through improved weather and climate modeling, energy system optimization, environmental monitoring, and materials discovery for clean energy technologies. These applications leverage ML's ability to model complex physical systems and extract patterns from large observational datasets.
Climate AI
The application of machine learning to climate change mitigation (reducing emissions) and adaptation (responding to climate impacts), spanning weather prediction, energy optimization, environmental monitoring, and materials discovery for clean energy.
ML-Powered Weather Forecasting
Weather forecasting has been revolutionized by ML approaches. GraphCast and Pangu-Weather demonstrate that ML models can match or exceed traditional numerical weather prediction (NWP) models at a fraction of the computational cost. These models are trained on decades of reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF) and produce global forecasts in seconds rather than hours.
| Model | Organization | Key Achievement |
|---|---|---|
| GraphCast | DeepMind | Outperforms ECMWF HRES on 90% of variables at 10-day forecasts |
| Pangu-Weather | Huawei | First ML model to outperform NWP at multiple lead times |
| FourCastNet | NVIDIA | Generates global forecasts 45,000x faster than NWP |
| GenCast | DeepMind | Probabilistic forecasting with calibrated uncertainty estimates |
| Aurora | Microsoft | Foundation model for Earth system forecasting across scales |
Table 19.7: Landmark ML weather prediction models.
Traditional numerical weather prediction requires hours of supercomputer time for a single 10-day forecast. ML models produce comparable forecasts in seconds on a single GPU. This speed enables ensemble forecasting (running hundreds of scenarios), more frequent updates, and broader access to high-quality weather prediction. For tropical cyclone track forecasting, GenCast outperforms the best ensemble systems while running 1000x faster.
Energy System Optimization
Energy system optimization uses ML to balance supply and demand in power grids with increasing renewable energy penetration. ML forecasts renewable generation, predicts demand patterns, and optimizes battery storage dispatch. These applications directly reduce carbon emissions by enabling higher renewable energy utilization.
DeepMind applied ML to optimize cooling in Google's data centers, reducing cooling energy by approximately 40%. The system uses a neural network to predict the future temperature and pressure of the data center based on current conditions and proposed actions, then recommends optimal cooling configurations. Similar approaches applied to power grid management help integrate intermittent renewable energy sources by predicting generation and optimizing storage dispatch.
Environmental Monitoring
- Deforestation tracking: ML classifies satellite imagery to detect forest loss within days, enabling rapid response
- Biodiversity monitoring: Acoustic sensors + ML identify species from sound recordings across vast areas
- Illegal mining detection: Satellite imagery analysis spots unauthorized mining operations in protected areas
- Ocean health: ML analyzes sea surface temperature, chlorophyll, and current data for marine ecosystem monitoring
- Air quality prediction: ML forecasts pollution levels to issue health warnings and guide policy
- Carbon accounting: ML estimates emissions from satellite observations of power plants, agriculture, and land use
Much of the satellite data used for environmental monitoring is freely available through programs like Copernicus (EU), Landsat (NASA/USGS), and Planet's open data program. Combined with open-source ML tools, this creates opportunities for researchers and organizations worldwide to build environmental monitoring applications.
Climate AI
ML applications that contribute to climate change mitigation and adaptation, including weather prediction, energy optimization, and environmental monitoring.
ML Weather Prediction
Data-driven weather forecasting using neural networks trained on historical weather data, achieving accuracy comparable to physics-based models at lower computational cost.
07 Education, Accessibility, and Humanitarian AI
ML-powered educational tools can personalize learning experiences, provide intelligent tutoring, automate assessment, and make education more accessible to underserved populations. Adaptive learning systems adjust content difficulty and sequencing based on individual student performance, providing each learner with an optimized path through the material.
Adaptive Learning
An educational technology approach that uses ML to dynamically adjust the content difficulty, sequencing, pacing, and presentation style based on individual student performance data, creating a personalized learning pathway for each student.
NLP in Education
Natural language processing enables powerful educational applications that scale high-quality instruction to millions of learners. Large language models can serve as interactive tutoring systems that explain concepts, answer questions, and guide students through problem-solving with a Socratic approach.
| Application | ML Technique | Educational Benefit |
|---|---|---|
| Automated essay scoring | Fine-tuned language models | Instant feedback at scale, consistent evaluation |
| Question generation | Seq2seq models, LLMs | Automatic creation of practice questions from content |
| Language learning assistants | Speech recognition + NLU | Conversational practice available 24/7 |
| Content summarization | Extractive and abstractive summarization | Help students identify key concepts efficiently |
| Interactive tutoring | Large language models | Socratic dialogue, worked examples, personalized explanations |
Table 19.8: NLP applications in education.
Khan Academy's Khanmigo, powered by large language models, acts as a Socratic tutor that guides students through problem-solving by asking leading questions rather than giving direct answers. This approach preserves the pedagogical value of productive struggle while providing support when students are stuck. Duolingo Max uses GPT-4 for conversational language practice and AI-generated explanations of grammar mistakes.
Accessibility Applications
Accessibility applications use ML to break down barriers for people with disabilities, enabling more equitable access to information and technology. These tools benefit not only their primary audience but often improve usability for everyone.
- Speech recognition: Voice-controlled interfaces for users with motor impairments (accuracy now exceeds 95% for major languages)
- Image description: Automatic alt-text generation for visually impaired users (Microsoft Seeing AI, Apple VoiceOver)
- Sign language recognition: Real-time translation between sign language and text/speech
- Real-time captioning: Live speech-to-text for deaf and hard-of-hearing users
- Text simplification: Adapting complex text for users with cognitive disabilities
- Eye tracking: Gaze-based computer control for users with severe motor limitations
Accessibility features originally designed for people with disabilities often benefit everyone. Voice interfaces help drivers, captioning helps in noisy environments, and text simplification helps non-native speakers. Designing for accessibility is designing for broader usability.
Humanitarian AI: Disaster Response and Food Security
Humanitarian applications of ML address urgent needs in disaster response, poverty alleviation, and food security. Speed is critical: after a natural disaster, ML systems analyzing satellite imagery can produce building damage maps within hours, guiding relief efforts to the most affected areas before ground teams can reach them.
| Phase | ML Application | Data Source |
|---|---|---|
| Preparedness | Risk mapping, vulnerability prediction | Historical disasters, geographic data, census |
| Immediate response | Damage assessment, needs mapping | Satellite/drone imagery, social media |
| Relief coordination | Resource allocation optimization | Logistics data, population movement |
| Recovery | Infrastructure reconstruction prioritization | Damage assessments, economic data |
Table 19.9: ML applications across disaster response phases.
Many agricultural AI applications in developing regions run on basic smartphones. PlantVillage and similar tools let farmers photograph a diseased leaf and receive an immediate diagnosis and treatment recommendation. Design for offline operation, low-resolution cameras, and simple interfaces. A well-compressed MobileNet model can run inference in under 100ms on a mid-range smartphone.
Educational ML systems can inadvertently reinforce existing inequities. An automated essay scorer trained primarily on essays from affluent school districts may penalize valid writing styles from other backgrounds. Student data, especially for younger learners, requires strong privacy protections (FERPA in the US, GDPR for EU students). Always evaluate educational AI for equitable outcomes across student demographics.
Adaptive Learning
Educational technology that adjusts content difficulty, sequencing, and presentation based on individual student performance and learning patterns.
Humanitarian AI
ML applications that support humanitarian response and development, including disaster relief, food security, and poverty alleviation in resource-constrained environments.
Key Takeaways
- 1Effective ML for social impact requires deep collaboration between ML engineers and domain experts in each application area.
- 2Healthcare AI shows enormous promise in medical imaging and drug discovery but faces stringent regulatory (FDA, CE marking) and ethical requirements.
- 3Transfer learning from pretrained models is the standard approach for medical imaging, enabling strong performance with limited labeled clinical data.
- 4Evaluation metrics must match the application: F-beta for configurable precision-recall trade-offs, AUC-ROC for overall discrimination, NDCG for ranking quality.
- 5NLP applications have been transformed by large language models, but hallucination, prompt injection, and high inference costs remain critical production challenges.
- 6Recommender systems use multi-stage retrieval-and-ranking architectures evaluated with position-aware metrics like NDCG and MAP.
- 7Autonomous vehicles require real-time perception, prediction, and planning, with safety defined by handling the long tail of rare edge cases.
- 8Healthcare, finance, and autonomous vehicles impose fundamentally different requirements on ML systems in terms of latency, regulation, and failure tolerance.
- 9Climate AI applications in weather prediction and energy optimization can directly reduce carbon emissions.
- 10Accessibility applications of ML break down barriers for people with disabilities across speech, vision, and language.
- 11Humanitarian AI must navigate challenges of data scarcity, infrastructure limitations, and cultural context in deployment.
CH.19
Chapter Complete
Chapter Progress
Interact with the visualization
AI for Good Quiz
Test your understanding of beneficial ML applications in healthcare, climate, and social impact.
Ready to test your knowledge?