OTA Firmware Updates: End-to-End¶
This is where industrial IoT deployments go wrong most often. A poorly designed OTA system can brick thousands of devices simultaneously. Every element below has been learned from real incidents.
12.1 OTA System Architecture¶
graph TB
subgraph CI["CI/CD Pipeline"]
BUILD[Firmware Build<br/>CMake / PlatformIO]
SIGN[Code Signing<br/>HSM / Vault]
STORE[Artifact Storage<br/>S3 / Azure Blob]
META[Firmware Metadata<br/>DB Record]
end
subgraph OTA_SVC["OTA Service"]
CAMP[Campaign Manager<br/>Rollout Scheduler]
GATE[Canary Gate<br/>Health Monitor]
NOTIFY[Notification Publisher<br/>MQTT]
STAT[Status Tracker]
end
subgraph DEVICE["Device / Gateway"]
OTA_AGENT[OTA Agent]
VERIFY[Verify:<br/>1. Checksum SHA-256<br/>2. Signature ECDSA<br/>3. Version constraint]
APPLY[Apply:<br/>Write to staging partition]
BOOT[Bootloader:<br/>A/B swap + watchdog]
ROLLBACK[Rollback:<br/>Revert to previous]
end
BUILD --> SIGN --> STORE
SIGN --> META
META --> CAMP
CAMP -->|select cohort| GATE
GATE -->|if healthy| NOTIFY
NOTIFY -->|MQTT: ota/notification| OTA_AGENT
OTA_AGENT -->|HTTPS GET signed URL| STORE
STORE -->|firmware binary| OTA_AGENT
OTA_AGENT --> VERIFY
VERIFY -->|valid| APPLY
VERIFY -->|invalid| OTA_AGENT
APPLY --> BOOT
BOOT -->|boot OK| STAT
BOOT -->|boot fail| ROLLBACK
ROLLBACK --> STAT
STAT -->|MQTT: ota/status| OTA_SVC
STAT --> GATE 12.2 Firmware Signing — Non-Negotiable in Industrial IoT¶
Firmware signing is the security control that makes OTA safe to operate at fleet scale. Without it, a compromised OTA service or a MITM attack can push arbitrary code to every device on your platform simultaneously — a single point of failure with catastrophic physical consequences for an industrial deployment. ECDSA P-256 is the recommended algorithm for constrained devices: it provides strong security with much faster verification than RSA (critical on MCUs without hardware crypto acceleration). The signing key must live in an HSM, never on a CI/CD server. Treat the signing key compromise as a Tier 1 security incident requiring full fleet re-provisioning.
Threat: attacker pushes malicious firmware to 10,000 devices.
Without signing: impossible to detect until after deployment.
With signing: firmware rejected at device before any execution.
Signing process (use ECDSA P-256 — faster verification than RSA on constrained devices):
1. Build produces: firmware.bin (raw binary)
2. Sign:
# Using OpenSSL
openssl dgst -sha256 -sign firmware_signing.key \
-out firmware.sig firmware.bin
# Verify locally before publishing
openssl dgst -sha256 -verify firmware_signing.pub \
-signature firmware.sig firmware.bin
3. Package (OTA manifest):
{
"firmware_id": "fw-pump-monitor-2.4.0",
"version": "2.4.0",
"device_type": "pump_monitor_v2",
"min_hw_revision": "Rev-B",
"binary_url": "https://ota.acme.com/fw/pump-monitor-2.4.0.bin",
"binary_size_bytes": 524288,
"checksum_sha256": "a3b4c5d6...",
"signature_ecdsa": "3046022100...",
"signing_cert_id": "fw-signing-cert-2026-01",
"release_notes_url": "https://...",
"rollback_version": "2.3.1",
"published_at": "2026-03-19T10:00:00Z"
}
4. Device verification (C pseudocode):
uint8_t fw_signing_pubkey[] = { /* baked into firmware */ };
bool verify_firmware(uint8_t* fw_data, size_t fw_size,
uint8_t* signature, size_t sig_size) {
// Step 1: checksum
uint8_t actual_hash[32];
sha256(fw_data, fw_size, actual_hash);
if (memcmp(actual_hash, expected_hash, 32) != 0) {
LOG_ERROR("Firmware checksum mismatch");
return false;
}
// Step 2: signature
if (!ecdsa_verify(fw_signing_pubkey, actual_hash, signature, sig_size)) {
LOG_ERROR("Firmware signature invalid");
return false;
}
return true;
}
12.3 A/B Partition — The Only Safe OTA for Industrial¶
A/B partition (dual-bank) firmware is the only OTA approach that is safe for unattended industrial devices. Without it, a power failure during a firmware write produces a device with corrupted firmware and no recovery path — the only fix is a field visit. With A/B partitioning, the active firmware continues running on partition A while the new firmware downloads to partition B. The boot only switches after a successful download and verification. If the new firmware fails to boot healthy, the bootloader automatically reverts to the known-good partition. This makes OTA failures self-healing at the device level, which is what enables fleet-scale rollouts without field technician standby.
Flash layout (embedded Linux / RTOS):
graph TB
subgraph FLASH["Flash Storage Layout (total: ~16 MB example)"]
BL["Bootloader — 64 KB<br/>Read-only, NEVER updated via OTA<br/>Manages A/B swap + watchdog"]
BC["Boot Config — 4 KB<br/>Active partition pointer<br/>rollback_on_fail flag"]
PA["Partition A — 4 MB<br/>Active firmware (currently running v2.3)<br/>Verified good — do not modify"]
PB["Partition B — 4 MB<br/>Staging partition (OTA download target)<br/>Write new v2.4 here while A runs"]
DP["Data Partition — 8 MB<br/>Config, certs, local SQLite DB<br/>Survives OTA — never erased"]
end
BL --> BC
BC --> PA
BC --> PB
PA -.->|"boot config points here during normal ops"| BC
PB -.->|"after OTA: boot config switches pointer here"| BC
DP -.->|"independent of firmware partitions"| BL Update sequence: 1. Download v2.4 to Partition B (Partition A still running) 2. Verify checksum + signature of Partition B 3. Set boot config: next_boot = B, rollback_on_fail = true 4. Set watchdog timer: 120s (if new fw doesn't check in, watchdog reboots) 5. Reboot 6. Bootloader reads boot config → boots Partition B (v2.4) 7. New firmware starts, runs health checks 8. If healthy: call confirm_update() → set boot config: active = B, permanent 9. If unhealthy: watchdog fires OR firmware calls rollback() → bootloader boots Partition A (v2.3) → device publishes: ota/status {status: rolled_back, reason: "health check failed"}
What "healthy" means — device must validate: - Connects to MQTT broker within 30s - All required drivers initialize - Configuration loaded successfully - First telemetry message published - No hard faults in first 60s
### 12.4 OTA State Machine — Full Device Lifecycle
```mermaid
stateDiagram-v2
[*] --> Idle: Normal operation
Idle --> Notified: OTA notification received
Notified --> Checking: Check version constraint and hw_revision
Checking --> Idle: Version same or hw not compatible
Checking --> DownloadScheduled: Compatible - schedule for maintenance window
DownloadScheduled --> Downloading: Maintenance window opens or immediate if forced
Downloading --> Verifying: Download complete
Downloading --> DownloadFailed: Network error or timeout
DownloadFailed --> DownloadScheduled: Retry (backoff)
Verifying --> Verified: Checksum + signature OK
Verifying --> VerifyFailed: Checksum or signature mismatch
VerifyFailed --> Idle: Report failure - do not apply
Verified --> WaitingApply: Wait for apply window or operator approval
WaitingApply --> Applying: Apply command received or auto-apply window
Applying --> Rebooting: Written to partition B - boot config updated
Rebooting --> HealthCheck: New firmware boots
HealthCheck --> Confirmed: All health checks pass - confirm_update() called
HealthCheck --> RolledBack: Health check fails or watchdog fires
Confirmed --> Idle: Report success
RolledBack --> Idle: Report rollback - alert ops
12.5 Rollout Campaigns — Safe Deployment at Fleet Scale¶
At fleet scale, a firmware update is a distributed systems operation, not a simple file push. The phased rollout approach below is designed to catch firmware regressions before they reach the full fleet, with each phase acting as a gate that must pass before the next opens. The critical operational discipline is automatic rollback: the campaign manager must monitor health signals and pause the campaign automatically, not wait for a human to notice a problem. By the time an on-call engineer manually notices an elevated rollback rate at 3am, hundreds more devices may have already been enrolled.
Rollout phases for a 5,000-device fleet:
Phase 0 — Lab / Staging (before any production device):
Target: Test bench devices (not production)
Duration: 24h burn-in
Criteria: Zero crashes, all telemetry nominal, command round-trip < 500ms
Phase 1 — Canary (1%, ~50 devices):
Target: Non-critical devices with human oversight
Duration: 48h
Auto-rollback triggers:
- OTA success rate < 95%
- Post-update crash rate > 2%
- Telemetry gap rate > 5%
- Any device permanently bricked (requires manual recovery)
Proceed criteria: all triggers green for 48h
Phase 2 — Early adopters (10%, ~500 devices):
Target: Mix of criticality levels
Duration: 72h
Monitor: same triggers, expand crash monitoring to memory/CPU trends
Phase 3 — General rollout (50% → 100%):
Batch size: 500 devices/hour (don't blast all at once)
Stagger: different sites in different hours (follow sun)
Skip: devices actively executing critical processes (interlock)
Scheduling logic:
- Check device.is_in_active_process() before scheduling OTA
- Prefer: weekends, nights, planned maintenance windows
- Maintenance window config per device/site:
"ota_window": "Sun 02:00-06:00 UTC"
Post-rollout:
- 7-day observation window before closing campaign
- Compare: MTBF before vs. after update
- Energy consumption delta (firmware bugs can cause CPU spin)
- Telemetry quality score before vs. after
12.6 Delta OTA — When Bandwidth Is Constrained¶
For LoRaWAN, satellite, or metered cellular devices, full firmware downloads may be physically impossible or economically prohibitive. Delta OTA (binary patching) transmits only the differences between firmware versions, typically achieving 85–95% size reduction for incremental updates. This comes with significant operational complexity: you must maintain a delta catalog for every supported upgrade path, and the device must have sufficient RAM to hold three firmware copies simultaneously during reconstruction. Add this complexity only when bandwidth genuinely constrains your deployment — on Wi-Fi or wired Ethernet, full image OTA is simpler and more reliable.
For LoRaWAN, satellite, or metered cellular devices:
Binary delta generation (bsdiff / xdelta3):
xdelta3 -e -s old_firmware.bin new_firmware.bin delta.patch
# old: 512KB, new: 524KB, delta: typically 20-80KB (90% reduction)
Device reconstruction:
xdelta3 -d -s current_firmware.bin delta.patch new_firmware.bin
# Verify new_firmware.bin checksum before applying
Constraints:
- Device must have enough RAM/storage for 3 copies during reconstruction
- Delta is FROM-version specific — need delta for every upgrade path
- Keep delta catalog: v2.1→v2.2, v2.2→v2.3, v2.1→v2.3, etc.
- Maintain for N-2 versions minimum
When to use delta vs. full:
Full bandwidth headroom > 10x firmware size: use full (simpler)
LoRaWAN (< 250 bytes/message): delta mandatory + chunking
LTE-M on data plan: delta if firmware > 100KB
Wi-Fi / Ethernet: full image almost always
12.7 OTA Failure Recovery Playbook¶
Every OTA failure mode in the playbook below has been observed in production fleets. The most important operational discipline is treating any automatic rollback as a production incident requiring immediate investigation — not as a successful safety mechanism to be acknowledged and forgotten. A rollback means real firmware that passed your CI/CD pipeline and canary phase failed in the field, which means your canary criteria missed something. Understanding why is more important than the rollback itself. The Scenario 4 case (bricked device) should be treated as a bootloader bug — a/b partition design makes bricking via OTA theoretically impossible, so any brick indicates a gap in your invariants.
Scenario 1: Device fails to download (network timeout)
Automatic: retry with exponential backoff (max 6h between attempts)
After 3 failed attempts: alert operations team
Manual: operator can force-retry or defer campaign
Scenario 2: Verification failure (checksum mismatch)
Automatic: delete partial download, alert immediately
Cause: corrupted download (most common), or wrong firmware served
Manual: check artifact storage checksum, re-serve
Scenario 3: New firmware boot fails (watchdog fires, rollback)
Automatic: device rolls back to last good firmware, reports status
Alert: immediate — paged to on-call (any rollback is a production incident)
Root cause: new firmware incompatible with device state/config
Manual: investigate logs, fix firmware, re-test on canary before re-campaign
Scenario 4: Device bricked (doesn't boot after OTA, no rollback)
Cause: bootloader corruption (should be impossible with A/B, means bootloader bug)
or hardware failure triggered during reboot
Manual: field technician physical recovery or RMA
Prevention: never OTA the bootloader via the same OTA channel as application
Scenario 5: Mass rollback (>5% of campaign devices rolled back)
Automatic: campaign paused, no further devices enrolled
Alert: escalate to engineering lead, not just ops
Manual: investigate with devices in Phase 1/2 before proceeding