Disaster Recovery & Business Continuity¶
IoT platforms monitor physical processes. An outage is not just a service disruption — in industrial contexts, it may mean no visibility into plant operations, loss of alarm monitoring, or inability to issue commands to field devices. RTO and RPO targets must be defined per component based on business impact, not assumed to be "as low as possible." The broker is the most critical component — all devices are connected to it. The time-series database is important but not immediately safety-critical — stale dashboards are acceptable for 30 minutes; missing alarms are not.
22.1 Component RTO/RPO Targets¶
| Component | RTO Target | RPO Target | Failure Impact | Recovery Approach |
|---|---|---|---|---|
| MQTT Broker cluster | 5 min | 0 (persistent sessions, no state loss) | All devices disconnect; no telemetry ingested | Active-passive cluster with DNS failover (§22.2) |
| Ingestion workers | 15 min | 0 (Kafka retains messages during downtime) | Data delayed, not lost — Kafka buffers during outage | Auto-restart via Kubernetes; scale up on lag |
| Time-series DB | 30 min | 5 min (streaming replication lag) | Dashboards show stale data; alarms may miss recent events | Streaming replica promoted to primary; WAL replay |
| Device Registry | 1 hr | 15 min | New device provisioning fails; existing devices unaffected | Read replica promoted; registry is low-write |
| OTA Service | 4 hr | 1 hr | Firmware rollouts paused — not safety-critical | Not on critical path; manual rollout possible |
| API Gateway | 15 min | N/A (stateless) | Dashboards and integrations lose API access | Auto-restart; stateless — no data recovery needed |
22.2 Multi-Region Broker Failover¶
Broker failover is the most complex recovery scenario because thousands of devices must reconnect within minutes without operator intervention. The architecture uses active-passive with DNS-based failover:
graph TB
DEVICES["IoT Devices<br/>connect to broker.acme.com<br/>TTL: 60s"]
DNS["DNS (Route53 / Cloudflare)<br/>Health check every 10s<br/>Failover to secondary<br/>within 60s of primary failure"]
PRIMARY["Primary Region<br/>EMQX Cluster (3 nodes)<br/>Active — receiving connections"]
SECONDARY["Secondary Region<br/>Standby EMQX Cluster<br/>Warm standby — no connections<br/>until failover"]
KAFKA_MM["Kafka MirrorMaker 2<br/>Replicates raw-telemetry topic<br/>cross-region in real time"]
KAFKA_P["Kafka (Primary Region)<br/>raw-telemetry topic"]
KAFKA_S["Kafka (Secondary Region)<br/>raw-telemetry.replica topic"]
DEVICES --> DNS
DNS -->|Normal| PRIMARY
DNS -->|After failover| SECONDARY
PRIMARY --> KAFKA_P
KAFKA_P --> KAFKA_MM --> KAFKA_S Device reconnection during failover: Devices detect connection drop via keepalive timeout (typically 60–120 seconds). They reconnect using exponential backoff starting at 1 second, doubling up to a maximum of 60 seconds. By the time the first reconnect attempt is made (at the 1-second backoff), the DNS record has already been updated (DNS TTL is 60 seconds, health check detects failure within 10 seconds, TTL is honoured). Devices resolve the DNS name fresh on each connection attempt — they do not cache the resolved IP. This means the device reconnects to the secondary region without any device-side configuration change.
Persistent sessions: Devices connecting with clean_session=false have their session state (subscription list, QoS 1/2 unacknowledged messages) stored in the broker. The secondary broker does not have this state — session persistence across broker failover requires external session storage (Redis cluster replicated cross-region) or accepting that sessions are re-established from scratch. For most industrial IoT deployments, accepting clean session on reconnect is acceptable because devices immediately re-subscribe to their command topics.
22.3 Time-Series Database Backup Strategy¶
TimescaleDB (Postgres) backup is a three-tier strategy balancing RPO against storage cost:
Tier 1 — Continuous WAL archival (5-minute RPO): PostgreSQL WAL (Write-Ahead Log) is streamed continuously to S3. Point-in-time recovery to any point within the retention window is possible. Retention: 7 days of WAL. Cost: approximately 2–5 GB/day for a 10,000-device deployment at 1-message/second average.
Tier 2 — Hypertable chunk export to Parquet: When TimescaleDB closes a hypertable chunk (typically daily or weekly depending on chunk interval), export it to Parquet on S3 cold storage. This is immutable, queryable with Athena/DuckDB, and stored indefinitely. Cost: S3 standard ($0.023/GB/month) dropping to S3 Glacier ($0.004/GB/month) after 90 days.
Tier 3 — Aggregate table pg_dump: The 1-minute and 1-hour aggregate tables are small (a few GB per year for most deployments) and can be pg_dumped nightly. These are the tables dashboards query most — fast recovery of aggregates means dashboards recover quickly even if raw data recovery takes longer.
Recovery drill: Conduct a monthly point-in-time recovery drill — restore the last 24 hours of data to a test TimescaleDB instance, run SELECT COUNT(*) FROM telemetry WHERE ts > NOW() - INTERVAL '24 hours' and compare against the production count. Document the drill result. This is required for ISO 27001 and most regulated industry audits.
22.4 Incident Severity Levels¶
| Severity | Definition | Response Time | Escalation Path | Communication |
|---|---|---|---|---|
| P0 | Broker down, >50% of fleet offline | Immediate (on-call paged) | CTO within 15 min | Customer status page updated within 5 min |
| P1 | Ingestion lag >5 min, data gap risk; single region degraded | 15 min | Engineering lead within 30 min | Status page updated; affected customers notified |
| P2 | Single site offline; OTA service unavailable; API latency elevated | 1 hr | On-call engineer | Status page note; no individual customer notification |
| P3 | Individual device offline; cert expiry warning; non-critical service degraded | Next business day | Ticket assigned | Internal only |