Skip to content

Disaster Recovery & Business Continuity

IoT platforms monitor physical processes. An outage is not just a service disruption — in industrial contexts, it may mean no visibility into plant operations, loss of alarm monitoring, or inability to issue commands to field devices. RTO and RPO targets must be defined per component based on business impact, not assumed to be "as low as possible." The broker is the most critical component — all devices are connected to it. The time-series database is important but not immediately safety-critical — stale dashboards are acceptable for 30 minutes; missing alarms are not.

22.1 Component RTO/RPO Targets

Component RTO Target RPO Target Failure Impact Recovery Approach
MQTT Broker cluster 5 min 0 (persistent sessions, no state loss) All devices disconnect; no telemetry ingested Active-passive cluster with DNS failover (§22.2)
Ingestion workers 15 min 0 (Kafka retains messages during downtime) Data delayed, not lost — Kafka buffers during outage Auto-restart via Kubernetes; scale up on lag
Time-series DB 30 min 5 min (streaming replication lag) Dashboards show stale data; alarms may miss recent events Streaming replica promoted to primary; WAL replay
Device Registry 1 hr 15 min New device provisioning fails; existing devices unaffected Read replica promoted; registry is low-write
OTA Service 4 hr 1 hr Firmware rollouts paused — not safety-critical Not on critical path; manual rollout possible
API Gateway 15 min N/A (stateless) Dashboards and integrations lose API access Auto-restart; stateless — no data recovery needed

22.2 Multi-Region Broker Failover

Broker failover is the most complex recovery scenario because thousands of devices must reconnect within minutes without operator intervention. The architecture uses active-passive with DNS-based failover:

graph TB
    DEVICES["IoT Devices<br/>connect to broker.acme.com<br/>TTL: 60s"]

    DNS["DNS (Route53 / Cloudflare)<br/>Health check every 10s<br/>Failover to secondary<br/>within 60s of primary failure"]

    PRIMARY["Primary Region<br/>EMQX Cluster (3 nodes)<br/>Active — receiving connections"]
    SECONDARY["Secondary Region<br/>Standby EMQX Cluster<br/>Warm standby — no connections<br/>until failover"]

    KAFKA_MM["Kafka MirrorMaker 2<br/>Replicates raw-telemetry topic<br/>cross-region in real time"]

    KAFKA_P["Kafka (Primary Region)<br/>raw-telemetry topic"]
    KAFKA_S["Kafka (Secondary Region)<br/>raw-telemetry.replica topic"]

    DEVICES --> DNS
    DNS -->|Normal| PRIMARY
    DNS -->|After failover| SECONDARY
    PRIMARY --> KAFKA_P
    KAFKA_P --> KAFKA_MM --> KAFKA_S

Device reconnection during failover: Devices detect connection drop via keepalive timeout (typically 60–120 seconds). They reconnect using exponential backoff starting at 1 second, doubling up to a maximum of 60 seconds. By the time the first reconnect attempt is made (at the 1-second backoff), the DNS record has already been updated (DNS TTL is 60 seconds, health check detects failure within 10 seconds, TTL is honoured). Devices resolve the DNS name fresh on each connection attempt — they do not cache the resolved IP. This means the device reconnects to the secondary region without any device-side configuration change.

Persistent sessions: Devices connecting with clean_session=false have their session state (subscription list, QoS 1/2 unacknowledged messages) stored in the broker. The secondary broker does not have this state — session persistence across broker failover requires external session storage (Redis cluster replicated cross-region) or accepting that sessions are re-established from scratch. For most industrial IoT deployments, accepting clean session on reconnect is acceptable because devices immediately re-subscribe to their command topics.

22.3 Time-Series Database Backup Strategy

TimescaleDB (Postgres) backup is a three-tier strategy balancing RPO against storage cost:

Tier 1 — Continuous WAL archival (5-minute RPO): PostgreSQL WAL (Write-Ahead Log) is streamed continuously to S3. Point-in-time recovery to any point within the retention window is possible. Retention: 7 days of WAL. Cost: approximately 2–5 GB/day for a 10,000-device deployment at 1-message/second average.

Tier 2 — Hypertable chunk export to Parquet: When TimescaleDB closes a hypertable chunk (typically daily or weekly depending on chunk interval), export it to Parquet on S3 cold storage. This is immutable, queryable with Athena/DuckDB, and stored indefinitely. Cost: S3 standard ($0.023/GB/month) dropping to S3 Glacier ($0.004/GB/month) after 90 days.

Tier 3 — Aggregate table pg_dump: The 1-minute and 1-hour aggregate tables are small (a few GB per year for most deployments) and can be pg_dumped nightly. These are the tables dashboards query most — fast recovery of aggregates means dashboards recover quickly even if raw data recovery takes longer.

Recovery drill: Conduct a monthly point-in-time recovery drill — restore the last 24 hours of data to a test TimescaleDB instance, run SELECT COUNT(*) FROM telemetry WHERE ts > NOW() - INTERVAL '24 hours' and compare against the production count. Document the drill result. This is required for ISO 27001 and most regulated industry audits.

22.4 Incident Severity Levels

Severity Definition Response Time Escalation Path Communication
P0 Broker down, >50% of fleet offline Immediate (on-call paged) CTO within 15 min Customer status page updated within 5 min
P1 Ingestion lag >5 min, data gap risk; single region degraded 15 min Engineering lead within 30 min Status page updated; affected customers notified
P2 Single site offline; OTA service unavailable; API latency elevated 1 hr On-call engineer Status page note; no individual customer notification
P3 Individual device offline; cert expiry warning; non-critical service degraded Next business day Ticket assigned Internal only