On-Device Learning
Covers deploying ML on microcontrollers and edge devices. Introduces TFLite Micro, extreme optimization, and the constraints of resource-limited environments.
- Explain the memory and power constraints of microcontroller deployment and their impact on model design
- Compare edge deployment strategies including on-device inference, federated learning, and cloud offloading
- Implement on-device inference optimizations including quantization, operator fusion, and memory planning
- Evaluate federated learning protocols for privacy-preserving model training across distributed devices
- Design an end-to-end edge ML system with appropriate model-hardware co-design trade-offs
01 Deploying ML on Microcontrollers Viz
Deploying ML on microcontrollers (MCUs) represents the extreme end of resource-constrained inference. MCUs typically have tens to hundreds of kilobytes of SRAM, megabytes of flash storage, and operate at clock speeds of tens to hundreds of megahertz. Yet despite these severe constraints, on-device ML enables always-on sensing applications that are impossible with cloud-based approaches.
Microcontroller (MCU)
A compact integrated circuit containing a processor, memory (typically KB of SRAM and MB of flash), and I/O peripherals on a single chip. MCUs power billions of embedded and IoT devices, and are increasingly capable of running lightweight ML inference.
| Resource | Typical MCU | Smartphone | Cloud GPU |
|---|---|---|---|
| RAM | 64-512 KB SRAM | 4-12 GB | 16-80 GB HBM |
| Storage | 1-4 MB Flash | 64-256 GB | TB-scale SSD |
| Clock Speed | 48-400 MHz | 2-3 GHz | 1.5-2 GHz + 10K cores |
| Power Budget | 1-100 mW | 3-5 W | 200-700 W |
| Model Size Limit | 10s-100s KB | 10s-100s MB | Billions of parameters |
Table 14.1: Resource constraints across deployment targets.
Inference Frameworks for MCUs
The primary framework for MCU deployment is TensorFlow Lite Micro (TFLite Micro), which provides a stripped-down inference engine designed for bare-metal environments. TFLite Micro uses a static memory allocation scheme, operates without an operating system, and supports a subset of TensorFlow operations optimized for MCU architectures.
TFLite Micro is designed for bare-metal execution: no dynamic memory allocation, no operating system required, no file system access, and no standard C library dependency. It uses a pre-allocated "arena" buffer for all intermediate tensors, avoiding heap fragmentation that would be catastrophic in KB-scale memory environments.
Memory Management Challenges
Memory management is the primary challenge for MCU deployment. The model weights must fit in flash memory, while the intermediate activations during inference must fit in SRAM. The arena memory allocation strategy pre-allocates a fixed buffer for all intermediate tensors, avoiding dynamic allocation and heap fragmentation.
Model weights are read-only and can reside in flash memory (larger, cheaper, non-volatile). Activations must be in SRAM (smaller, faster, volatile). When optimizing for MCU deployment, focus on reducing peak activation memory (the largest intermediate tensor) as this determines the minimum SRAM requirement. Tools like TFLite Micro's memory planner report peak arena usage.
Power Management
Power consumption is a critical constraint for battery-powered MCU applications. Always-on keyword detection models must consume microwatts to enable months or years of battery life.
- Duty cycling — Alternate between deep sleep and wake states to reduce average power.
- Event-driven inference — Only run the ML model when a simple threshold detector indicates interesting input.
- Hardware power gating — Shut down unused peripherals and memory banks during inference.
- Cascaded models — Use a tiny always-on model as a trigger for a larger, more accurate model.
A smart speaker uses a 14 KB neural network running on a Cortex-M4 MCU for "wake word" detection. The MCU consumes 1 mW in always-on listening mode. When the wake word is detected, the system powers on the main application processor (consuming 2 W) for full speech recognition. This cascaded approach enables months of battery life while maintaining responsiveness.
Microcontroller (MCU)
A small, resource-constrained processor with integrated memory (typically KB of SRAM) used in embedded and IoT devices.
TFLite Micro
A lightweight inference framework for running TensorFlow models on microcontrollers, designed for bare-metal environments without operating system support.
02 Edge Deployment Strategies
Edge deployment places ML inference on devices closer to the data source, ranging from powerful edge servers to smartphones to IoT sensors. The primary motivations are latency reduction, privacy preservation, bandwidth savings, and reliability.
Edge Deployment
Running ML inference on devices close to the data source rather than in centralized cloud servers. Edge deployment reduces latency, preserves privacy by keeping data on-device, saves network bandwidth, and enables operation without connectivity.
Figure: ML Model Deployment Strategies
- Latency reduction — Eliminate the network round-trip to the cloud, achieving sub-millisecond inference.
- Privacy preservation — Sensitive data (face images, health signals) never leaves the device.
- Bandwidth savings — Process raw sensor data locally, transmitting only results or anomalies.
- Reliability — Operate continuously even without network connectivity.
Model Conversion and Optimization
Models trained in frameworks like PyTorch or TensorFlow must be converted to edge-optimized formats. Each target platform has its own optimized inference runtime.
| Runtime | Target Platform | Input Format | Key Optimizations |
|---|---|---|---|
| TFLite | Android, embedded Linux | TensorFlow SavedModel | Quantization, delegate support (GPU, NNAPI) |
| Core ML | iOS, macOS, Apple Silicon | PyTorch, TensorFlow, ONNX | Neural Engine acceleration, FP16 conversion |
| ONNX Runtime | Cross-platform | ONNX | Graph optimization, EP-based acceleration |
| TensorRT | NVIDIA edge GPUs (Jetson) | ONNX, TensorFlow | Layer fusion, FP16/INT8, kernel auto-tuning |
| microTVM | MCUs, RISC-V | Relay IR | Compiler-based kernel optimization, auto-scheduling |
Table 14.2: Edge inference runtimes and their target platforms.
Split Inference
Split inference distributes model computation between edge devices and the cloud, placing early layers on-device and later layers in the cloud. The split point is chosen to minimize the data transmitted between device and cloud.
The optimal split point minimizes the size of the intermediate tensor transmitted from device to cloud. Early convolutional layers often reduce spatial dimensions significantly, making the output of the third or fourth layer much smaller than the raw input image. Split at the layer that produces the smallest intermediate representation.
Split inference is most valuable when: (1) the device can handle early feature extraction but not the full model, (2) network bandwidth is limited but latency requirements are not extreme, and (3) intermediate representations are much smaller than raw inputs. It is less useful when strict privacy requires all data to stay on-device.
What is the primary factor when choosing the split point in split inference between an edge device and the cloud?
Operational Challenges at Fleet Scale
Edge deployment introduces unique operational challenges. Model updates must be distributed to potentially millions of devices across diverse hardware and software versions. Monitoring and debugging are harder without direct access to devices.
Unlike cloud deployment where you control the hardware, edge fleets span multiple device models, OS versions, and hardware capabilities. A model that works perfectly on a flagship phone may crash on a budget device with less memory. Always test on the lowest-capability device in your fleet, and implement runtime capability checks that fall back gracefully.
A mobile keyboard app deploys a next-word prediction model to 500 million devices. Updates use a staged rollout: 0.1% canary for 24 hours, 1% for 48 hours, 10% for a week, then full rollout. Each stage monitors crash rates, latency P99, and prediction quality. If any metric degrades beyond threshold, the rollout automatically halts and devices revert to the previous model version.
Edge Deployment
Running ML inference on devices close to the data source, such as smartphones or IoT devices, rather than in centralized cloud servers.
Split Inference
A deployment strategy that divides model computation between edge devices and cloud servers, balancing latency, bandwidth, and computational requirements.
03 On-Device Inference Optimization
On-device inference optimization encompasses all techniques for making ML models run efficiently on resource-constrained devices. The goal is to meet latency, memory, and power requirements while maintaining acceptable model quality. This often requires a combination of model-level and system-level optimizations.
Quantization: The Most Impactful Optimization
Quantization is the single most impactful optimization for on-device inference. Converting models from FP32 to INT8 typically yields 4x reduction in model size, 2-4x speedup, and significant energy savings.
Quantization
The process of reducing the numerical precision of model weights and activations (e.g., from 32-bit floating point to 8-bit integer), resulting in smaller model size, faster inference, and lower energy consumption, typically with minimal accuracy loss.
x_q = \text{round}\left(\frac{x}{s}\right) + z| Precision | Bits per Weight | Model Size Reduction | Typical Speedup | Accuracy Impact |
|---|---|---|---|---|
| FP32 (baseline) | 32 | 1x | 1x | None |
| FP16 | 16 | 2x | 1.5-2x | Negligible |
| INT8 | 8 | 4x | 2-4x | Small (<1% loss) |
| INT4 | 4 | 8x | 3-6x | Moderate (1-3% loss) |
| Binary (1-bit) | 1 | 32x | 10-30x | Significant (5-15% loss) |
Table 14.3: Quantization levels and their typical trade-offs.
Post-training quantization (PTQ) is faster and requires no retraining — just calibrate on a representative dataset. Quantization-aware training (QAT) simulates quantization during training and recovers most accuracy loss, but requires access to the training pipeline. Start with PTQ; use QAT only if the accuracy drop is unacceptable.
Efficient Model Architectures
Model architecture choice is equally important. Architectures designed for efficiency achieve better accuracy per FLOP than larger architectures compressed to the same size.
- MobileNetV3 — Depthwise separable convolutions with squeeze-and-excitation for mobile GPUs.
- EfficientNet-Lite — Compound-scaled architecture optimized for on-device inference.
- SqueezeNet — Fire modules that minimize parameters while maintaining accuracy.
- MobileViT — Lightweight vision transformer designed for mobile deployment.
- TinyBERT — Distilled BERT variant for on-device NLP tasks.
A common mistake is training a large model (e.g., ResNet-152) and then compressing it for deployment. In practice, starting with an efficient architecture (e.g., MobileNetV3) and applying quantization yields better accuracy at the same latency budget than compressing a large model. Choose the right architecture first.
Runtime Optimizations
Runtime optimizations include operator fusion, memory planning, and hardware-specific kernel selection. Inference runtimes implement these optimizations automatically during model conversion.
Operator fusion combines multiple sequential operations (e.g., convolution + batch normalization + ReLU) into a single kernel. This eliminates intermediate memory writes and reads, which is critical for memory-bound operations. Most inference runtimes apply common fusion patterns automatically, but custom fusion may be needed for non-standard architectures.
On-Device Inference
Running ML model predictions directly on the end-user device without cloud connectivity, enabling low latency and data privacy.
Memory Planning
An optimization that reuses memory buffers across non-overlapping tensor operations during inference, reducing peak memory consumption.
04 On-Device Learning and Adaptation
On-device learning goes beyond inference to enable models to adapt and improve directly on the device using local data. This is particularly valuable for personalization and for privacy-sensitive applications where training data should never leave the device.
Federated Learning
Federated Learning
A distributed machine learning approach where multiple devices collaboratively train a shared model by computing updates locally and sharing only model gradients or parameters — never raw data — with a central aggregation server.
Federated learning enables distributed model improvement without centralizing data. Each device trains a local model update on its private data and sends only the model update to a central server. The server aggregates updates from many devices to improve the global model.
class="tok-comment"># Simplified Federated Averaging (FedAvg) pseudocode
def federated_averaging(global_model, devices, rounds):
for r in range(rounds):
local_updates = []
class="tok-comment"># Select subset of devices for this round
selected = random.sample(devices, k=fraction * len(devices))
for device in selected:
local_model = copy(global_model)
class="tok-comment"># Train locally for E epochs on device data
for epoch in range(E):
local_model.train(device.data)
local_updates.append(local_model.weights - global_model.weights)
class="tok-comment"># Aggregate: weighted average of updates
avg_update = weighted_average(local_updates, weights=device_data_sizes)
global_model.weights += avg_update
return global_modelGoogle's Gboard keyboard uses federated learning to improve next-word prediction for billions of users. Each phone trains on locally typed text, computes a model update, and uploads only the encrypted update. The server aggregates thousands of updates using secure aggregation. No individual user's typing data ever leaves their device.
On-Device Fine-Tuning
On-device fine-tuning is a simpler alternative where a pre-trained model is adapted to local conditions. The key challenge is performing gradient computation and weight updates within the device's memory and compute constraints.
| Approach | What Updates | Memory Cost | Personalization Quality |
|---|---|---|---|
| Last-layer fine-tuning | Only the final classification layer | Very low | Moderate |
| Adapter tuning | Small adapter modules inserted between frozen layers | Low | Good |
| Full fine-tuning | All model parameters | High (need full gradient storage) | Best |
| Prompt tuning | Learned input tokens (LLMs) | Very low | Good for NLP tasks |
Table 14.4: On-device fine-tuning strategies and their trade-offs.
For most on-device personalization use cases, updating only the final classification layer provides surprisingly good results at minimal computational cost. This approach requires storing gradients only for the last layer, making it feasible even on memory-constrained devices.
Catastrophic Forgetting
Continuous adaptation raises concerns about model stability and catastrophic forgetting. Without careful regularization, a model adapted to recent local data may lose its ability to handle less frequent inputs.
Catastrophic Forgetting
The tendency of neural networks to abruptly lose previously learned knowledge when trained on new data. In on-device learning, this means a model personalized to recent user behavior may forget how to handle less frequent but important inputs.
Elastic Weight Consolidation (EWC) penalizes changes to weights that were important for previous tasks. Replay buffers store a small subset of past examples for periodic review. On devices with limited memory, distillation-based approaches — where the personalized model is regularized to stay close to the original model's predictions — are often the most practical.
In federated learning, what information does each device share with the central server?
The core privacy guarantee of federated learning depends on what stays on the device.
Federated Learning
A distributed ML approach where devices collaboratively train a model by sharing only model updates, never raw data, preserving user privacy.
Catastrophic Forgetting
The tendency of neural networks to lose previously learned knowledge when adapted to new data, a key challenge for on-device continuous learning.
05 Extreme Optimization for MCUs
Extreme optimization pushes the boundaries of what is possible on the most resource-constrained devices. When working with only 256 KB of memory and a few MHz of compute, every byte and every cycle counts. This requires rethinking standard ML practices from first principles.
Binary and Ternary Neural Networks
Binary Neural Network (BNN)
A neural network where weights and/or activations are constrained to binary values (+1 and -1). Multiplications become XNOR operations and additions become popcount, enabling inference using only bitwise CPU instructions with dramatic reductions in model size and compute.
Binary and ternary neural networks represent weights and activations with just 1 or 2 bits, replacing multiplication with simple bitwise operations. While these extreme quantization approaches sacrifice significant accuracy, they enable ML on devices that could not otherwise support any neural network computation.
y = \text{popcount}(\text{XNOR}(\mathbf{w}_b, \mathbf{x}_b))Binary neural networks sacrifice 10-15% accuracy on ImageNet compared to full-precision models, which is unacceptable for many applications. However, for simple classification tasks (wake word detection, gesture recognition, anomaly detection) where the alternative is no ML at all, binary networks enable intelligence on devices with only a few KB of memory.
Model-Hardware Co-Design
Model-hardware co-design for MCUs involves selecting architectures that match the specific capabilities of the target hardware. ARM Cortex-M processors have SIMD instructions (CMSIS-NN) optimized for certain tensor shapes.
CMSIS-NN on ARM Cortex-M is optimized for specific tensor dimensions (multiples of 4 for INT8 SIMD). Design your model layers to use these native dimensions. A depthwise convolution with 32 filters runs significantly faster than one with 30 filters due to SIMD alignment. Profile on the actual target MCU — simulators can miss these effects.
| MCU Family | ML Acceleration | Typical SRAM | Key Optimization |
|---|---|---|---|
| ARM Cortex-M4 | CMSIS-NN (DSP instructions) | 128-256 KB | INT8 SIMD, depthwise separable convolutions |
| ARM Cortex-M7 | CMSIS-NN + double-precision FPU | 256-512 KB | FP16 support, larger models |
| ARM Cortex-M55 (Ethos-U55) | Dedicated ML accelerator (NPU) | 256 KB-4 MB | NPU offload for supported operations |
| ESP32-S3 | Vector extensions | 512 KB | INT8 inference, audio/sensor models |
Table 14.5: Popular MCU families and their ML capabilities.
Knowledge Distillation for MCUs
Knowledge distillation takes on heightened importance for MCU deployment. A large teacher model trained in the cloud is distilled into an extremely compact student model designed specifically for the target MCU.
L_{distill} = \alpha \cdot L_{CE}(y, \hat{y}_s) + (1 - \alpha) \cdot T^2 \cdot KL(\sigma(z_t / T) \| \sigma(z_s / T))A cloud-trained transformer model achieves 97% accuracy on a keyword spotting task. Direct training of a small CNN for the target MCU achieves only 89%. Using the transformer as a teacher and distilling into the same CNN architecture yields 94% accuracy — a 5 percentage point improvement from the richer training signal provided by the teacher's soft labels.
Simulated latency and memory estimates can diverge significantly from real MCU performance. Cache effects, memory alignment, DMA transfer patterns, and interrupt handling all affect real-world performance. Always benchmark your final model on the actual target MCU before committing to a design.
Binary Neural Network
A neural network where weights and/or activations are constrained to binary values (+1/-1), replacing multiplications with bitwise operations for extreme efficiency.
CMSIS-NN
ARM's optimized neural network kernels for Cortex-M processors, providing efficient implementations of common ML operations for microcontrollers.
Key Takeaways
- 1MCU deployment requires extreme optimization, with models fitting in kilobytes of SRAM and flash storage.
- 2Edge deployment provides benefits in latency, privacy, and reliability, but introduces challenges in fleet management and monitoring.
- 3Quantization to INT8 is the single most impactful optimization for on-device inference, providing 4x size reduction.
- 4Federated learning enables distributed model improvement while keeping user data private on each device.
- 5Model-hardware co-design produces better results than generic optimization by matching architecture to specific hardware capabilities.
CH.14
Chapter Complete
Chapter Progress
Interact with the visualization
On-Device Learning Quiz
Test your understanding of edge deployment, TFLite, on-device inference, and resource-constrained ML.
Ready to test your knowledge?