Communication Protocols: Deep Dive¶

4.1 Protocol Selection Decision Tree¶

flowchart TD
    START([New Device Integration]) --> Q1{Real-time control<br/>required?}
    Q1 -->|Yes < 10ms| RT[EtherCAT / PROFINET IRT<br/>Sercos III]
    Q1 -->|No| Q2{Existing fieldbus<br/>on device?}
    Q2 -->|Modbus RTU/TCP| MODBUS[Modbus Driver<br/>on Gateway]
    Q2 -->|OPC-UA| OPCUA[Direct OPC-UA<br/>Subscription]
    Q2 -->|PROFIBUS/DP| PB[PROFIBUS DP Master<br/>or Siemens S7 Gateway]
    Q2 -->|None / new device| Q3{Power constrained?}
    Q3 -->|Yes battery/solar| Q4{Coverage needed?}
    Q4 -->|Wide area km+| LORA[LoRaWAN / NB-IoT]
    Q4 -->|Local 100m| ZIGBEE[Zigbee / WirelessHART]
    Q3 -->|No mains powered| Q5{High bandwidth<br/>video / waveform?}
    Q5 -->|Yes| WIFI[Wi-Fi 802.11ac/ax]
    Q5 -->|No telemetry only| Q6{Remote asset<br/>no LAN?}
    Q6 -->|Yes| LTE[LTE-M / NB-IoT<br/>Satellite fallback]
    Q6 -->|No on plant network| MQTT[MQTT over Ethernet<br/>or Wi-Fi]

4.2 MQTT — Everything You Need to Know for Production¶

Connection Lifecycle¶

sequenceDiagram
    participant D as Device
    participant B as MQTT Broker
    participant C as Cloud Backend

    Note over D,B: Connection with LWT configured
    D->>B: CONNECT client_id:pump-007 clean_session:false keepalive:60s will_topic:.../status will_payload:{online:false} will_retain:true will_qos:1
    B->>D: CONNACK (session_present: true/false)

    Note over D,B: Session present = true means broker has queued messages
    D->>B: SUBSCRIBE .../config/delta QoS1 and .../commands/# QoS1
    B->>D: SUBACK

    D->>B: PUBLISH (retained=true) .../status {online:true}

    loop Telemetry
        D->>B: PUBLISH QoS1 .../telemetry {temp:72.4,...}
        B->>D: PUBACK
    end

    loop Keepalive (every 60s)
        D->>B: PINGREQ
        B->>D: PINGRESP
    end

    Note over D,B: Unexpected disconnect - broker publishes LWT
    B->>C: PUBLISH (retained) .../status {online:false,reason:conn_lost}

    Note over D,B: Reconnect after outage
    D->>B: CONNECT (clean_session: false)
    B->>D: CONNACK (session_present: true)
    B->>D: Queued messages replayed (QoS1/2)

QoS Decision Guide — With Real Consequences¶

QoS selection has real operational consequences that compound at scale. A wrong choice is not a configuration detail — it is either a data loss risk (QoS too low) or a performance bottleneck (QoS too high). The guide below maps each level to specific IoT use cases with concrete failure scenarios. Note that QoS 2 is often misunderstood: it provides exactly-once delivery at the MQTT protocol level, but your consumer must still be idempotent because broker-to-consumer delivery is a separate concern.

QoS 0 — Fire and Forget:
  Use for:
    → High-frequency raw sensor data (1Hz+) where loss is acceptable
    → Metrics where the next reading makes a lost one irrelevant
  Do NOT use for:
    → Commands, configuration, alarms, anything stateful
  Real consequence of wrong choice:
    → QoS 0 over an unreliable LTE link loses 5-10% of readings.
       For a billing meter, that's revenue loss.

QoS 1 — At Least Once:
  Use for:
    → Most telemetry that matters
    → Alarms, events, command responses
    → Config acknowledgements
  Gotcha: duplicates ARE delivered. Your consumer must be idempotent.
    → Use message_id + device_id as deduplication key
    → Store in Redis SET with TTL, check before processing

QoS 2 — Exactly Once:
  Use for:
    → Billing / metering data
    → Audit trail records
    → Financial or regulatory compliance telemetry
  Cost: 4 network round-trips per message
  At 10,000 msg/s, the overhead is significant — test before committing
  Broker must support QoS 2 fully (not all do — verify your broker)

Topic Design — The ISA-95 Aligned Standard¶

Structure: {enterprise}/{site}/{area}/{line}/{device_type}/{device_id}/{data_class}/{tag}

Production example (discrete manufacturing):
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/telemetry
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/telemetry/temperature
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/status
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/commands/{cmd_id}
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/commands/{cmd_id}/ack
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/config/desired
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/config/reported
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/ota/notification
  acme-corp/plant-detroit/body-shop/line-3/pump/P-007/ota/status

Wildcard subscriptions for operations:
  acme-corp/plant-detroit/+/+/+/+/status         → All device status, one plant
  acme-corp/#                                      → Everything (use only for debug)
  acme-corp/+/body-shop/line-3/pump/+/telemetry   → All pump telemetry, line 3

Rules that production has taught:
  1. Never use # in production consumer subscriptions — scopes too broad
  2. Device_id in topic must match MQTT client_id and TLS cert CN
  3. No spaces, no special chars in topic segments (use hyphens)
  4. Keep depth ≤ 7 levels — deeper is hard to manage with wildcards
  5. Always include device_type in hierarchy — allows type-based fanout

4.3 OPC-UA — Production Integration Patterns¶

OPC-UA is not just "better Modbus." It is a full information modeling framework. Most teams use 5% of it.

sequenceDiagram
    participant GW as Edge Gateway (OPC-UA Client)
    participant PLC as Siemens S7-1500 (OPC-UA Server)

    Note over GW,PLC: 1. Session establishment
    GW->>PLC: OpenSecureChannel [Basic256Sha256, SignAndEncrypt]
    PLC->>GW: SecureChannelId + token
    GW->>PLC: CreateSession [sessionName, clientCert]
    PLC->>GW: SessionId + serverNonce
    GW->>PLC: ActivateSession [userIdentityToken + signature]
    PLC->>GW: ActivateSession OK

    Note over GW,PLC: 2. Browse address space (once on connect)
    GW->>PLC: Browse [ns=0;i=85 Objects folder]
    PLC->>GW: BrowseResult [PLC_Program, DeviceSet ...]
    GW->>PLC: TranslateBrowsePathsToNodeIds [Pump_007.Temperature_PV]
    PLC->>GW: NodeId ns=3;i=1042

    Note over GW,PLC: 3. Create subscription (push, not poll)
    GW->>PLC: CreateSubscription [publishingInterval=1000ms, lifetime=10]
    PLC->>GW: SubscriptionId=42
    GW->>PLC: CreateMonitoredItems [nodeId=ns=3;i=1042, deadband=0.5%]
    PLC->>GW: MonitoredItemId=1

    Note over GW,PLC: 4. Data change notifications
    loop On value change exceeding deadband
        PLC->>GW: Publish [value=72.4, quality=Good, ts=2026-03-19T14:22:00Z]
        GW->>PLC: PublishRequest [ack seqNo]
    end

    Note over GW,PLC: 5. Keep-alive when no changes
    PLC->>GW: Publish [keepAlive, no data]

The OPC-UA subscription model is far more efficient than polling — the server only sends data when values change (or when the keep-alive fires). However, the subscription parameters below must be tuned to your network and process dynamics. Default values from client libraries are rarely correct for production: default publishingInterval is often 500ms (too fast for slow process variables, too slow for high-speed machinery), and default queueSize of 1 causes data loss on slow or intermittent connections. Treat these as per-tag configuration, not system-wide defaults.

Critical OPC-UA configuration parameters that matter in production:

Server-side subscription tuning:
  publishingInterval:    1000ms   # How often server sends notifications
  samplingInterval:       500ms   # How often server samples the value
                                  # samplingInterval ≤ publishingInterval
  queueSize:               10     # Items buffered if client misses a publish
                                  # For alarms: set higher (50-100)
  lifetimeCount:           10     # Publish cycles before subscription expires
                                  # = 10 × 1000ms = 10s without client poll
  maxKeepAliveCount:        3     # Keep-alives before subscription expires
  maxNotificationsPerPublish: 0   # 0 = no limit (careful on low-bandwidth links)

Deadband on OPC-UA side (saves bandwidth from PLC → gateway):
  AbsoluteDeadband:  tag value changes reported only if |new-old| > threshold
  PercentDeadband:   as % of engineering range (EURange attribute must be set)

  Set EURange on PLC tags:
    TIA Portal → Tag properties → OPC UA → Engineering Unit Range
    min: 0.0, max: 200.0 (for a 0-200°C sensor)
    Then: PercentDeadband = 0.5 = only report if temperature changes > 1.0°C

Session keepalive:
  requestedSessionTimeout: 30000ms  # Session expires if client disconnects > 30s
  On reconnect: re-establish session (subscriptions are lost)
  Pattern: client maintains subscription ID map, re-creates on reconnect

4.4 Modbus — The Inescapable Legacy Protocol¶

You will encounter Modbus on every industrial project. Master it. Modbus was standardized in 1979 and has no built-in authentication, encryption, or error recovery beyond a basic CRC. Despite this, it remains the most widely deployed industrial protocol in the world because it is simple, deterministic, and runs on cheap hardware. In practice, Modbus integration failures are almost never caused by the protocol itself — they are caused by byte-order mismatches (big-endian vs word-swapped), incorrect register address offsets (0-based vs 1-based), and polling too fast for the RS-485 bus capacity. Get the register map from the vendor, read it carefully, and verify with a Modbus tester before writing production code.

Register map reading — step by step:

Step 1: Get the device register map (from vendor documentation)
  Example: Siemens SITRANS P DS III pressure transmitter
    Register 1 (40001): Measured value — FLOAT32, AB CD byte order
    Register 3 (40003): Status word — UINT16
    Register 5 (40005): Range upper — FLOAT32, AB CD
    Register 7 (40007): Range lower — FLOAT32, AB CD

Step 2: Read the registers
  FC=03 (Read Holding Registers)
  Start address: 0 (register 40001 = address 0 in protocol)
  Count: 8 (read registers 0-7, 4 FLOAT32 values)

Step 3: Decode (byte order trap — check your device!)
  Raw bytes returned: 42 90 00 00 (big-endian FLOAT32)
  = 0x42900000 = 72.0 in IEEE 754

  If device uses "word-swapped" format (some older devices):
    Raw: 00 00 42 90 → swap words → 42 90 00 00 = 72.0

Step 4: Apply scaling (some devices return raw counts, not engineering units)
  raw_value = 26214  (16-bit count)
  scaled = raw_value / 65535.0 * (range_upper - range_lower) + range_lower
  = 26214 / 65535.0 * (100 - 0) + 0 = 40.0 bar

Polling rate gotchas:
  RS-485 bus at 9600 baud, 1 device:
    Request packet:  8 bytes × 10 bits/byte = 80 bits
    Response (8 regs): 21 bytes × 10 bits = 210 bits
    Total: 290 bits / 9600 = 30ms per transaction
    Safe poll rate: 100ms (gives 3× margin for retries)

  RS-485 bus at 9600 baud, 10 devices:
    10 × 30ms = 300ms minimum
    Safe poll rate: 1000ms per device

  Error handling in production:
    Retry: 3 attempts with 50ms delay
    After 3 failures: mark tag as Bad quality, publish with quality code
    Log every failure with device address and function code
    Alert ops if failure rate > 5% over 5 minutes

Modbus master/slave request-response flow:

sequenceDiagram
    participant MASTER as Gateway (Modbus Master)
    participant SLAVE as Field Device (Modbus Slave)

    Note over MASTER,SLAVE: Normal polling cycle
    MASTER->>SLAVE: FC=03 Read Holding Registers Address:0x0000 Count:8
    SLAVE->>MASTER: Response: 8 register values (raw bytes, device-specific byte order)
    MASTER->>MASTER: Decode bytes to engineering units, apply scaling formula

    Note over MASTER,SLAVE: Error scenario
    MASTER->>SLAVE: FC=03 Read Holding Registers
    SLAVE-->>MASTER: Exception Response FC=0x83, Code=0x02 (Illegal Data Address)
    MASTER->>MASTER: Log error, mark tag as Bad quality, retry up to 3 times with 50ms delay

    Note over MASTER,SLAVE: Device not responding (bus fault, power loss)
    MASTER->>SLAVE: FC=03 Read Holding Registers
    MASTER->>MASTER: Timeout (no response within 200ms), increment failure counter
    MASTER->>MASTER: After 3 failures: publish quality=Bad, alert if failure rate exceeds threshold