Edge Layer: Gateways & Local Processing¶

3.1 Gateway Architecture — What It Actually Does¶

graph TB
    subgraph OT["OT Network (Plant Floor)"]
        PLC[PLC / DCS<br/>OPC-UA Server]
        RTU[Field RTU<br/>Modbus RTU]
        SENS[Smart Sensors<br/>IO-Link / HART]
    end

    subgraph GW["Edge Gateway"]
        direction TB
        POLL[Protocol Drivers<br/>OPC-UA Client<br/>Modbus Master<br/>HART Multiplexer]
        NORM[Normalizer<br/>Tag → JSON<br/>Unit Conversion<br/>Quality Mapping]
        RULE[Edge Rules Engine<br/>Deadbanding<br/>Local Alarms<br/>Derived Tags]
        BUFF[Ring Buffer / SQLite<br/>Store & Forward<br/>72h capacity]
        PUB[MQTT Publisher<br/>TLS 1.3<br/>QoS 1/2]
        SUB[MQTT Subscriber<br/>C2D Handler<br/>Command Router]
        OTA[OTA Agent<br/>Download / Verify<br/>Apply / Rollback]
        HLTH[Health Reporter<br/>Self-monitoring<br/>Watchdog]
    end

    subgraph CLOUD["Cloud Platform"]
        BROKER[MQTT Broker<br/>Cluster]
        INGST[Ingestion Service]
        CMD[Command Service]
        OTASVC[OTA Service]
    end

    PLC -->|OPC-UA subscription| POLL
    RTU -->|Modbus RTU poll| POLL
    SENS -->|IO-Link / HART| POLL
    POLL --> NORM --> RULE --> BUFF --> PUB -->|TLS MQTT| BROKER
    BROKER --> INGST
    CMD -->|MQTT| BROKER --> SUB --> CMD
    OTASVC -->|MQTT| BROKER --> OTA
    HLTH -->|MQTT| PUB

3.2 Store & Forward — Production Implementation¶

This is the feature most teams skip and regret. A gateway without store-and-forward is not an industrial gateway. The architectural principle is simple: always write to local storage first, then forward. The gateway treats the outbox as the source of truth, not the MQTT connection. This means connectivity becoming available or unavailable is a background concern — the data pipeline never stalls or drops messages because of it. In regulated industries (pharma, utilities), the ability to reconstruct a complete time-series across a connectivity outage is not optional — it is an audit requirement. Size your buffer for the worst-case outage your site has historically experienced, not the average.

Failure scenario without S&F:
  Factory loses internet for 3 hours.
  1,000 sensors generating 10 readings/minute.
  = 1,800,000 readings lost.
  Process engineers cannot reconstruct what happened during the outage.
  Regulatory compliance failure if this is pharma or utilities.

Implementation using SQLite WAL mode:

  Schema:
  CREATE TABLE outbox (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    topic       TEXT NOT NULL,
    payload     BLOB NOT NULL,
    qos         INTEGER DEFAULT 1,
    created_at  INTEGER NOT NULL,  -- Unix milliseconds
    attempts    INTEGER DEFAULT 0,
    sent_at     INTEGER            -- NULL until sent
  );
  CREATE INDEX idx_outbox_unsent ON outbox(sent_at) WHERE sent_at IS NULL;

  Write path (always write to outbox first):
    BEGIN IMMEDIATE;
    INSERT INTO outbox (topic, payload, qos, created_at) VALUES (?, ?, ?, ?);
    COMMIT;

  Send path (background worker):
    SELECT id, topic, payload FROM outbox
    WHERE sent_at IS NULL
    ORDER BY created_at ASC
    LIMIT 100;                     -- batch for efficiency

    On MQTT publish ACK:
      UPDATE outbox SET sent_at = ? WHERE id = ?;

    On failure:
      UPDATE outbox SET attempts = attempts + 1 WHERE id = ?;
      -- Backoff: min(30s * 2^attempts, 3600s)

  Retention policy (avoid disk full):
    DELETE FROM outbox
    WHERE sent_at IS NOT NULL
    AND sent_at < (unixepoch() - 86400) * 1000;  -- keep sent for 24h

    DELETE FROM outbox
    WHERE sent_at IS NULL
    AND created_at < (unixepoch() - 259200) * 1000;  -- drop unsent > 72h
    -- LOG this as a data loss event with count

  Buffer sizing:
    required_bytes = data_rate_bytes_per_sec × outage_duration_sec × 1.3
    e.g., 500 devices × 200 bytes/msg × 1 msg/sec × 259200s × 1.3 = ~33 GB
    Use appropriate hardware: industrial SSD, not SD card

3.3 Edge Deadbanding — Reduce Cloud Traffic by 60-80%¶

Raw polling sends data every cycle regardless of change. Deadband filtering is essential at scale.

class DeadbandFilter:
    """
    Only forward a value if it has changed by more than the deadband threshold
    or the max_interval has elapsed (ensures liveness even in stable processes).
    """
    def __init__(self, deadband_pct: float, max_interval_s: float = 60.0):
        self.deadband_pct = deadband_pct  # e.g., 0.5 = 0.5% of engineering range
        self.max_interval_s = max_interval_s
        self._last_sent: dict[str, tuple[float, float]] = {}  # tag -> (value, timestamp)

    def should_forward(self, tag: str, value: float, eng_range: float, now: float) -> bool:
        if tag not in self._last_sent:
            self._last_sent[tag] = (value, now)
            return True

        last_value, last_ts = self._last_sent[tag]
        deadband_abs = self.deadband_pct / 100.0 * eng_range
        value_changed = abs(value - last_value) >= deadband_abs
        interval_exceeded = (now - last_ts) >= self.max_interval_s

        if value_changed or interval_exceeded:
            self._last_sent[tag] = (value, now)
            return True
        return False

# Usage:
# filter = DeadbandFilter(deadband_pct=0.5, max_interval_s=60)
# if filter.should_forward("pump.temperature", 72.4, eng_range=200.0, now=time.time()):
#     publish_to_mqtt(...)

3.4 Platform Software Stack: Open Source vs. Cloud Managed¶

One of the most consequential early decisions in an IoT platform build is where to draw the line between self-managed open source and cloud-managed services. There is no universally correct answer — the right choice depends on your team's operational maturity, data sovereignty requirements, and scale. The table below reflects real-world tradeoffs, not marketing claims.

MQTT Brokers¶

The broker is the nervous system of your IoT platform. Choose carefully — migrating brokers is painful.

Broker	Type	Strengths	Weaknesses	Scale	Best For
Eclipse Mosquitto	OSS, self-hosted	Lightweight, battle-tested, simple	No clustering (single node), limited auth plugins	~100k connections	Dev/test, small deployments, edge broker
EMQX	OSS + Enterprise, self-hosted	Full clustering, MQTT 5.0, rule engine, rich plugins, Kubernetes-native	Enterprise features paid, complex ops at scale	10M+ connections	Production at scale, Kubernetes-native stacks
HiveMQ	Enterprise, self-hosted / cloud	Enterprise-grade, excellent extensions, strong MQTT 5.0	Expensive licensing	Millions of connections	Large enterprise, regulated industries
VerneMQ	OSS, self-hosted	Erlang/OTP clustering, strong consistency	Smaller community, harder to operate	~1M connections	Telecom-grade reliability requirements
AWS IoT Core	Fully managed	Zero ops, deep AWS integration, scales infinitely	Vendor lock-in, per-message pricing adds up at scale, data stays in AWS	Unlimited	AWS-committed teams, variable workloads
Azure IoT Hub	Fully managed	Deep Azure integration, D2C/C2D built-in, DPS, excellent enterprise features	Lock-in, pricing at scale	Unlimited	Azure-committed, enterprise Microsoft shops
Google Cloud IoT Core	⚠️ Deprecated Aug 2023	—	Shut down — do not use	—	Migrate off
Solace PubSub+	Enterprise	Multi-protocol (MQTT, AMQP, JMS, REST), guaranteed delivery	Very expensive	High	Financial services, mission-critical

Recommendation for most greenfield industrial projects: Start with EMQX Community Edition (self-hosted, Kubernetes). If you are fully committed to AWS, use AWS IoT Core but budget for per-message costs at scale and plan your egress costs early.

Time-Series Databases¶

Database	Type	Strengths	Weaknesses	Best For
TimescaleDB	OSS (PostgreSQL extension)	Full SQL, continuous aggregates, excellent compression, Postgres ecosystem	Requires Postgres ops expertise	General industrial IoT, complex queries
InfluxDB v3 (IOx)	OSS + Cloud	Purpose-built for time-series, line protocol, Flux/SQL, good UI	v2→v3 migration disruption, cloud pricing	Metrics-heavy, simpler data models
QuestDB	OSS	Extremely fast ingestion (1.6M rows/sec), SQL, low resource usage	Smaller community, fewer integrations	Ultra-high-frequency data
Apache IoTDB	OSS	Designed for IoT, hierarchical model, good compression	Newer ecosystem, less enterprise tooling	Large-scale industrial telemetry
AWS Timestream	Fully managed	Zero ops, scales automatically, integrates with QuickSight	Expensive at scale, limited SQL	AWS shops that want zero DB ops
Azure Data Explorer (ADX)	Fully managed	Extremely fast at petabyte scale, KQL powerful, good for analytics	Learning curve (KQL), cost at high write rates	Analytics-heavy, large Azure deployments
OSIsoft PI / AVEVA PI	Enterprise, licensed	Industry standard in process industries, PIMS ecosystem	Expensive, proprietary, historian-centric model	Brownfield process industries already using PI

Recommendation: TimescaleDB for most production deployments — it gives you the full power of PostgreSQL (JOINs, window functions, foreign keys) while handling time-series scale. Use continuous aggregates to pre-compute roll-ups and avoid raw-data queries on dashboards.

Edge Runtimes & Frameworks¶

Runtime	Type	Strengths	Weaknesses	Best For
Node-RED	OSS	Rapid visual wiring, huge node library, quick to prototype	Not suitable for high-throughput, logic gets unwieldy at scale	Protocol bridging, low-volume, rapid PoC
Eclipse Kura	OSS (Java/OSGi)	Enterprise-grade plugin system, device management, remote config	Heavy Java footprint, slower to develop	Structured enterprise edge deployments
AWS IoT Greengrass v2	Managed (OSS core)	Managed OTA, Lambda + Docker components, cloud-synced	AWS lock-in, complex setup, resource-heavy	AWS-committed, managed fleet OTA critical
Azure IoT Edge	Managed (OSS core)	Module marketplace, managed OTA, tight Azure integration	Azure lock-in, Docker required (heavy for small devices)	Azure-committed, containerized workloads
EdgeX Foundry	OSS	Microservice architecture, vendor-neutral, device service abstraction	Complex to deploy, many moving parts	Flexible multi-vendor edge architectures
Custom Go/Rust daemon	Custom	Maximum performance, minimal footprint, full control	Development time, maintenance burden	High-throughput production with specific requirements

Recommendation: For production industrial gateways, a custom Go service (or Go + Node-RED for protocol bridging) typically outperforms framework-heavy options. Use AWS Greengrass or Azure IoT Edge if managed OTA and cloud integration justify the operational overhead. Avoid Node-RED in the critical path for production data flows above ~1k msg/s.

Schema Registries¶

Tool	Type	Protocol Support	Best For
Confluent Schema Registry	OSS + Enterprise	Avro, JSON Schema, Protobuf	Kafka-centric pipelines, production standard
AWS Glue Schema Registry	Fully managed	Avro, JSON Schema, Protobuf	AWS Kafka (MSK) pipelines
Apicurio Registry	OSS	Avro, JSON Schema, Protobuf, OpenAPI	Self-hosted, multi-protocol
Git + JSON Schema files	DIY	JSON Schema	Small teams, simple schemas, full control