Data Ingestion Pipelines¶
The ingestion pipeline is the path between "message arrives at the broker" and "data is queryable in the time-series database." It sounds like a simple write operation but at production scale it is a distributed system with its own failure modes, bottlenecks, and tuning surface.
The critical design decision is whether to write directly from MQTT to your database or to introduce a message bus (Kafka, Kinesis, Pulsar) in between. The message bus adds operational complexity but provides: consumer fan-out (multiple services processing the same stream independently), replay capability (re-process historical data after a bug fix), and decoupling (database can be down without losing messages). For deployments above ~5,000 devices or requiring multiple downstream consumers, the message bus is almost always worth it.
A second key decision is schema validation at ingestion. You can validate early (at the broker via auth hooks, or at the ingestion worker) or late (at query time). Validate early: bad data caught late costs more to remediate and may have already flowed into dashboards, alerts, and ML models.
9.1 Scalable Ingestion Architecture¶
graph LR
subgraph DEVICES["10,000 Devices"]
D1[Device Group A]
D2[Device Group B]
D3[Device Group C]
end
subgraph BROKER["MQTT Broker Cluster<br/>(EMQX / HiveMQ)"]
B1[Node 1]
B2[Node 2]
B3[Node 3]
end
subgraph KAFKA["Message Bus<br/>(Kafka / Kinesis)"]
K1[Topic: raw-telemetry<br/>32 partitions]
K2[Topic: commands-ack]
K3[Topic: device-events]
end
subgraph CONSUMERS["Stream Consumers"]
C1[Ingestion Workers<br/>x8 instances<br/>validate+normalize]
C2[Rules Engine<br/>Flink / Spark Streaming]
C3[Alert Processor]
C4[ML Feature Pipeline]
end
subgraph STORAGE["Storage Layer"]
S1[TimescaleDB<br/>Hot: 90 days]
S2[Parquet / S3<br/>Cold: unlimited]
S3[Redis<br/>Last-known-value cache]
S4[ElasticSearch<br/>Events + Alarms]
end
DEVICES -->|MQTT TLS| BROKER
BROKER -->|Bridge / Rule| KAFKA
KAFKA --> CONSUMERS
C1 --> S1
C1 --> S2
C1 --> S3
C2 --> C3
C3 --> S4
C4 --> S2 9.2 Throughput Math — Sizing Your Pipeline¶
Pipeline sizing is one of the most common causes of production incidents in IoT platforms — not because engineers do not care, but because they use the wrong input numbers. The calculation below walks through a realistic mid-size plant deployment with all the factors that are typically forgotten: deadband filtering (which typically reduces traffic 60–80% from the raw poll rate), payload format choice (Protobuf vs JSON matters at this scale), and the distinction between broker capacity and database capacity. Run this calculation before committing to infrastructure, and add at least 3× headroom for burst traffic during connectivity restoration events.
Real-world calculation for a mid-size plant deployment:
Devices: 2,000
Tags per device: 15 (mix of fast and slow)
Fast tags (temp, pressure, flow): 10 tags, 1s interval
Slow tags (totals, setpoints): 5 tags, 60s interval
Raw message rate:
Fast: 2,000 × 10 × 1 msg/s = 20,000 msg/s
Slow: 2,000 × 5 / 60 msg/s = 167 msg/s
Total: ≈ 20,167 msg/s
After deadband filtering (assume 70% reduction in steady state):
Effective: ≈ 6,050 msg/s to broker
Payload size:
JSON (verbose): 350 bytes avg → 2.1 MB/s ingress
JSON (compact): 200 bytes avg → 1.2 MB/s ingress
Protobuf: 80 bytes avg → 0.48 MB/s ingress
Daily volume (compact JSON):
1.2 MB/s × 86,400s = 103 GB/day uncompressed
With LZ4 compression: ≈ 25-35 GB/day on disk
Kafka partition sizing:
Target: 50,000 msg/s per partition (conservative)
Partitions needed: ceil(6,050 / 50,000) = 2 → use 8 (for consumer parallelism)
TimescaleDB sizing:
6,050 rows/s × 86,400s = 522M rows/day
At 50 bytes/row (compressed): ≈ 26 GB/day
90-day hot storage: ≈ 2.3 TB
Chunk interval for hypertable:
time_bucket = '1 day' for 90-day retention
chunk_time_interval = INTERVAL '1 day'