Edge ML & Inference¶

Moving ML inference to the edge eliminates cloud round-trip latency for time-critical predictions, reduces bandwidth (send anomaly alerts not raw waveforms), and enables operation during connectivity outages. The challenge is managing the model lifecycle — versioning, deployment, monitoring, and retraining — alongside firmware. Edge ML is not a single product or framework: it is an operational discipline that requires treating ML models as first-class deployable artefacts with the same rigour applied to firmware.

18.1 Edge ML Use Cases by Latency Requirement¶

Not every ML use case belongs at the edge. The primary driver for edge inference is latency — if the prediction must trigger an action faster than the cloud round-trip allows, it must run on-device. Bandwidth is a secondary driver: sending raw 20 kHz vibration waveforms to the cloud for every bearing is impractical at scale, so the edge classifies the waveform and sends only the result.

Use Case	Latency Requirement	Inference Location	Rationale
Vibration anomaly detection	< 100 ms	Edge required	Must trigger alert before damage; waveform too large to stream
Visual quality inspection	< 500 ms	Edge required	Camera frame rate; reject/accept decision drives conveyor
Predictive maintenance score	Minutes	Cloud acceptable	Updated hourly; no real-time action required
Process optimisation	Hours	Cloud only	Batch optimisation; runs on historical data
Energy demand forecasting	Hours	Cloud only	External data needed (weather); no latency constraint

18.2 Model Deployment Pipeline¶

The model deployment pipeline is analogous to the OTA firmware pipeline (§12) but has different artefact types and validation steps. A model file is a binary blob (ONNX, TFLite, or OpenVINO IR format) that must be versioned, signed, validated against a held-out test set before deployment, and deployed via the same OTA infrastructure used for firmware.

graph LR
    TRAIN["Cloud Training<br/>Jupyter / MLflow<br/>on labelled dataset"]
    EXPORT["Model Export<br/>ONNX / TFLite<br/>quantised for target HW"]
    REGISTRY["Model Registry<br/>MLflow / custom S3<br/>versioned + signed"]
    OTA_PUSH["OTA-Style Model Push<br/>same infra as firmware OTA<br/>separate campaign type"]
    RUNTIME["Edge Runtime<br/>ONNX Runtime / TFLite<br/>loads new model version"]
    INFER["Inference Result<br/>anomaly score<br/>classification label"]
    PUBLISH["MQTT Publish<br/>inference topic<br/>not raw waveform"]
    MONITOR["Cloud Monitoring<br/>drift detection<br/>accuracy tracking"]
    RETRAIN["Retrain Trigger<br/>when accuracy drops<br/>or data distribution shifts"]

    TRAIN --> EXPORT --> REGISTRY --> OTA_PUSH --> RUNTIME --> INFER --> PUBLISH --> MONITOR --> RETRAIN --> TRAIN

18.3 Runtime Choices¶

Runtime	Target Hardware	Model Formats	RAM Footprint	Typical Inference Latency
ONNX Runtime	CPU / GPU, cross-platform	ONNX	50–200 MB	5–50 ms (CPU, mid-range IPC)
TensorFlow Lite	ARM-optimised, microcontrollers	TFLite (FlatBuffer)	1–20 MB	10–100 ms (Cortex-M to ARM A-series)
OpenVINO	Intel x86 / VPU (Myriad)	ONNX, PaddlePaddle, OpenVINO IR	100–500 MB	2–20 ms on Intel hardware
NVIDIA Triton	NVIDIA GPU (Jetson AGX / Orin)	TensorRT, ONNX, TFLite	500 MB–2 GB	< 5 ms on Jetson Orin
Edge Impulse	Embedded MCU to Linux	EON Compiler (C++ output)	10 KB–10 MB	1–50 ms depending on MCU

ONNX Runtime is the default choice for Intel/AMD x86 edge IPCs — it accepts models trained in PyTorch or TensorFlow after export, runs on Linux without GPU, and has a Python and C API. TensorFlow Lite is the default for ARM-based devices (Raspberry Pi CM4, Moxa UC-8100) where RAM is constrained. OpenVINO delivers the best performance on Intel-specific hardware including the Movidius Neural Compute Stick and Intel integrated graphics.

18.4 Model Versioning alongside Firmware¶

Model version must be tracked separately from firmware version in the device registry. A firmware update does not change the ML model, and a model update does not require a firmware update. Conflating the two creates unnecessary coupling — a model improvement is blocked waiting for a firmware release cycle, or a firmware security patch is delayed because the model team is not ready.

The device registry should track both independently:

{
  "device_id": "GW-004-A7",
  "fw_version": "3.2.1",
  "model_versions": {
    "vibration_anomaly": "1.4.2",
    "quality_classifier": "2.1.0"
  },
  "hw_platform": "moxa-uc8100"
}

Model updates use the same OTA infrastructure as firmware (§12) but with a campaign_type: model_update field that routes to the model download handler rather than the firmware update handler. The model update handler: downloads the model file, validates the signature, runs a quick inference test against a stored test vector (checking the output matches the expected result), and atomically swaps the model file. The old model file is retained for one version to enable immediate rollback without a re-download.

Key difference for rollback: Model rollback is safe and stateless — swapping back to the previous model file has no side effects. Firmware rollback may be risky (bootloader state, partition table changes, EEPROM writes). This means model rollback can be triggered automatically by the platform on accuracy degradation, while firmware rollback requires a deliberate operator action.