Edge Layer: Gateways & Local Processing¶
3.1 Gateway Architecture — What It Actually Does¶
graph TB
subgraph OT["OT Network (Plant Floor)"]
PLC[PLC / DCS<br/>OPC-UA Server]
RTU[Field RTU<br/>Modbus RTU]
SENS[Smart Sensors<br/>IO-Link / HART]
end
subgraph GW["Edge Gateway"]
direction TB
POLL[Protocol Drivers<br/>OPC-UA Client<br/>Modbus Master<br/>HART Multiplexer]
NORM[Normalizer<br/>Tag → JSON<br/>Unit Conversion<br/>Quality Mapping]
RULE[Edge Rules Engine<br/>Deadbanding<br/>Local Alarms<br/>Derived Tags]
BUFF[Ring Buffer / SQLite<br/>Store & Forward<br/>72h capacity]
PUB[MQTT Publisher<br/>TLS 1.3<br/>QoS 1/2]
SUB[MQTT Subscriber<br/>C2D Handler<br/>Command Router]
OTA[OTA Agent<br/>Download / Verify<br/>Apply / Rollback]
HLTH[Health Reporter<br/>Self-monitoring<br/>Watchdog]
end
subgraph CLOUD["Cloud Platform"]
BROKER[MQTT Broker<br/>Cluster]
INGST[Ingestion Service]
CMD[Command Service]
OTASVC[OTA Service]
end
PLC -->|OPC-UA subscription| POLL
RTU -->|Modbus RTU poll| POLL
SENS -->|IO-Link / HART| POLL
POLL --> NORM --> RULE --> BUFF --> PUB -->|TLS MQTT| BROKER
BROKER --> INGST
CMD -->|MQTT| BROKER --> SUB --> CMD
OTASVC -->|MQTT| BROKER --> OTA
HLTH -->|MQTT| PUB 3.2 Store & Forward — Production Implementation¶
This is the feature most teams skip and regret. A gateway without store-and-forward is not an industrial gateway. The architectural principle is simple: always write to local storage first, then forward. The gateway treats the outbox as the source of truth, not the MQTT connection. This means connectivity becoming available or unavailable is a background concern — the data pipeline never stalls or drops messages because of it. In regulated industries (pharma, utilities), the ability to reconstruct a complete time-series across a connectivity outage is not optional — it is an audit requirement. Size your buffer for the worst-case outage your site has historically experienced, not the average.
Failure scenario without S&F:
Factory loses internet for 3 hours.
1,000 sensors generating 10 readings/minute.
= 1,800,000 readings lost.
Process engineers cannot reconstruct what happened during the outage.
Regulatory compliance failure if this is pharma or utilities.
Implementation using SQLite WAL mode:
Schema:
CREATE TABLE outbox (
id INTEGER PRIMARY KEY AUTOINCREMENT,
topic TEXT NOT NULL,
payload BLOB NOT NULL,
qos INTEGER DEFAULT 1,
created_at INTEGER NOT NULL, -- Unix milliseconds
attempts INTEGER DEFAULT 0,
sent_at INTEGER -- NULL until sent
);
CREATE INDEX idx_outbox_unsent ON outbox(sent_at) WHERE sent_at IS NULL;
Write path (always write to outbox first):
BEGIN IMMEDIATE;
INSERT INTO outbox (topic, payload, qos, created_at) VALUES (?, ?, ?, ?);
COMMIT;
Send path (background worker):
SELECT id, topic, payload FROM outbox
WHERE sent_at IS NULL
ORDER BY created_at ASC
LIMIT 100; -- batch for efficiency
On MQTT publish ACK:
UPDATE outbox SET sent_at = ? WHERE id = ?;
On failure:
UPDATE outbox SET attempts = attempts + 1 WHERE id = ?;
-- Backoff: min(30s * 2^attempts, 3600s)
Retention policy (avoid disk full):
DELETE FROM outbox
WHERE sent_at IS NOT NULL
AND sent_at < (unixepoch() - 86400) * 1000; -- keep sent for 24h
DELETE FROM outbox
WHERE sent_at IS NULL
AND created_at < (unixepoch() - 259200) * 1000; -- drop unsent > 72h
-- LOG this as a data loss event with count
Buffer sizing:
required_bytes = data_rate_bytes_per_sec × outage_duration_sec × 1.3
e.g., 500 devices × 200 bytes/msg × 1 msg/sec × 259200s × 1.3 = ~33 GB
Use appropriate hardware: industrial SSD, not SD card
3.3 Edge Deadbanding — Reduce Cloud Traffic by 60-80%¶
Raw polling sends data every cycle regardless of change. Deadband filtering is essential at scale.
class DeadbandFilter:
"""
Only forward a value if it has changed by more than the deadband threshold
or the max_interval has elapsed (ensures liveness even in stable processes).
"""
def __init__(self, deadband_pct: float, max_interval_s: float = 60.0):
self.deadband_pct = deadband_pct # e.g., 0.5 = 0.5% of engineering range
self.max_interval_s = max_interval_s
self._last_sent: dict[str, tuple[float, float]] = {} # tag -> (value, timestamp)
def should_forward(self, tag: str, value: float, eng_range: float, now: float) -> bool:
if tag not in self._last_sent:
self._last_sent[tag] = (value, now)
return True
last_value, last_ts = self._last_sent[tag]
deadband_abs = self.deadband_pct / 100.0 * eng_range
value_changed = abs(value - last_value) >= deadband_abs
interval_exceeded = (now - last_ts) >= self.max_interval_s
if value_changed or interval_exceeded:
self._last_sent[tag] = (value, now)
return True
return False
# Usage:
# filter = DeadbandFilter(deadband_pct=0.5, max_interval_s=60)
# if filter.should_forward("pump.temperature", 72.4, eng_range=200.0, now=time.time()):
# publish_to_mqtt(...)
3.4 Platform Software Stack: Open Source vs. Cloud Managed¶
One of the most consequential early decisions in an IoT platform build is where to draw the line between self-managed open source and cloud-managed services. There is no universally correct answer — the right choice depends on your team's operational maturity, data sovereignty requirements, and scale. The table below reflects real-world tradeoffs, not marketing claims.
MQTT Brokers¶
The broker is the nervous system of your IoT platform. Choose carefully — migrating brokers is painful.
| Broker | Type | Strengths | Weaknesses | Scale | Best For |
|---|---|---|---|---|---|
| Eclipse Mosquitto | OSS, self-hosted | Lightweight, battle-tested, simple | No clustering (single node), limited auth plugins | ~100k connections | Dev/test, small deployments, edge broker |
| EMQX | OSS + Enterprise, self-hosted | Full clustering, MQTT 5.0, rule engine, rich plugins, Kubernetes-native | Enterprise features paid, complex ops at scale | 10M+ connections | Production at scale, Kubernetes-native stacks |
| HiveMQ | Enterprise, self-hosted / cloud | Enterprise-grade, excellent extensions, strong MQTT 5.0 | Expensive licensing | Millions of connections | Large enterprise, regulated industries |
| VerneMQ | OSS, self-hosted | Erlang/OTP clustering, strong consistency | Smaller community, harder to operate | ~1M connections | Telecom-grade reliability requirements |
| AWS IoT Core | Fully managed | Zero ops, deep AWS integration, scales infinitely | Vendor lock-in, per-message pricing adds up at scale, data stays in AWS | Unlimited | AWS-committed teams, variable workloads |
| Azure IoT Hub | Fully managed | Deep Azure integration, D2C/C2D built-in, DPS, excellent enterprise features | Lock-in, pricing at scale | Unlimited | Azure-committed, enterprise Microsoft shops |
| Google Cloud IoT Core | ⚠️ Deprecated Aug 2023 | — | Shut down — do not use | — | Migrate off |
| Solace PubSub+ | Enterprise | Multi-protocol (MQTT, AMQP, JMS, REST), guaranteed delivery | Very expensive | High | Financial services, mission-critical |
Recommendation for most greenfield industrial projects: Start with EMQX Community Edition (self-hosted, Kubernetes). If you are fully committed to AWS, use AWS IoT Core but budget for per-message costs at scale and plan your egress costs early.
Time-Series Databases¶
| Database | Type | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| TimescaleDB | OSS (PostgreSQL extension) | Full SQL, continuous aggregates, excellent compression, Postgres ecosystem | Requires Postgres ops expertise | General industrial IoT, complex queries |
| InfluxDB v3 (IOx) | OSS + Cloud | Purpose-built for time-series, line protocol, Flux/SQL, good UI | v2→v3 migration disruption, cloud pricing | Metrics-heavy, simpler data models |
| QuestDB | OSS | Extremely fast ingestion (1.6M rows/sec), SQL, low resource usage | Smaller community, fewer integrations | Ultra-high-frequency data |
| Apache IoTDB | OSS | Designed for IoT, hierarchical model, good compression | Newer ecosystem, less enterprise tooling | Large-scale industrial telemetry |
| AWS Timestream | Fully managed | Zero ops, scales automatically, integrates with QuickSight | Expensive at scale, limited SQL | AWS shops that want zero DB ops |
| Azure Data Explorer (ADX) | Fully managed | Extremely fast at petabyte scale, KQL powerful, good for analytics | Learning curve (KQL), cost at high write rates | Analytics-heavy, large Azure deployments |
| OSIsoft PI / AVEVA PI | Enterprise, licensed | Industry standard in process industries, PIMS ecosystem | Expensive, proprietary, historian-centric model | Brownfield process industries already using PI |
Recommendation: TimescaleDB for most production deployments — it gives you the full power of PostgreSQL (JOINs, window functions, foreign keys) while handling time-series scale. Use continuous aggregates to pre-compute roll-ups and avoid raw-data queries on dashboards.
Edge Runtimes & Frameworks¶
| Runtime | Type | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| Node-RED | OSS | Rapid visual wiring, huge node library, quick to prototype | Not suitable for high-throughput, logic gets unwieldy at scale | Protocol bridging, low-volume, rapid PoC |
| Eclipse Kura | OSS (Java/OSGi) | Enterprise-grade plugin system, device management, remote config | Heavy Java footprint, slower to develop | Structured enterprise edge deployments |
| AWS IoT Greengrass v2 | Managed (OSS core) | Managed OTA, Lambda + Docker components, cloud-synced | AWS lock-in, complex setup, resource-heavy | AWS-committed, managed fleet OTA critical |
| Azure IoT Edge | Managed (OSS core) | Module marketplace, managed OTA, tight Azure integration | Azure lock-in, Docker required (heavy for small devices) | Azure-committed, containerized workloads |
| EdgeX Foundry | OSS | Microservice architecture, vendor-neutral, device service abstraction | Complex to deploy, many moving parts | Flexible multi-vendor edge architectures |
| Custom Go/Rust daemon | Custom | Maximum performance, minimal footprint, full control | Development time, maintenance burden | High-throughput production with specific requirements |
Recommendation: For production industrial gateways, a custom Go service (or Go + Node-RED for protocol bridging) typically outperforms framework-heavy options. Use AWS Greengrass or Azure IoT Edge if managed OTA and cloud integration justify the operational overhead. Avoid Node-RED in the critical path for production data flows above ~1k msg/s.
Schema Registries¶
| Tool | Type | Protocol Support | Best For |
|---|---|---|---|
| Confluent Schema Registry | OSS + Enterprise | Avro, JSON Schema, Protobuf | Kafka-centric pipelines, production standard |
| AWS Glue Schema Registry | Fully managed | Avro, JSON Schema, Protobuf | AWS Kafka (MSK) pipelines |
| Apicurio Registry | OSS | Avro, JSON Schema, Protobuf, OpenAPI | Self-hosted, multi-protocol |
| Git + JSON Schema files | DIY | JSON Schema | Small teams, simple schemas, full control |