I recently faced a challenge that sits at the intersection of manufacturing traceability and pragmatic system architecture: keep lot-level traceability synchronous across three different ERP systems, without introducing a heavyweight integration platform. The context was simple but demanding — three plants running different ERPs (one SAP, one Microsoft Dynamics, one Infor), a requirement to propagate lot events (creation, split, merge, stage, consume) in near‑real time, and strict auditability for quality and regulatory purposes. We wanted minimal operational complexity, predictable latency, and a solution that operations teams could own and reason about. The answer we implemented was an MQTT-based event surface combined with a lightweight event bus (NATS JetStream or Redis Streams) and a small set of stateless orchestrators. Below I explain the architecture, design patterns, message examples, and operational practices that made it reliable in production.
High-level architecture
The pattern is straightforward:
This architecture avoids heavy ESBs or integration platforms. MQTT handles device/edge friendliness and lightweight publish/subscribe; the event bus handles durable delivery and ordered processing across the cluster; orchestrators implement idempotent, deterministic logic.
Why MQTT + a lightweight event bus?
In the plants I’ve worked with, MQTT is already the lingua franca for OT/edge components — it’s lightweight, resilient on flaky networks, supports QoS, and is simple to secure with TLS and client certs. But MQTT brokers aren’t always ideal for cross‑service durable streams and at‑least‑once ordered consumption semantics across multiple consumers. That’s where a lightweight event bus (NATS JetStream or Redis Streams) fills the gap: durable message retention, consumer groups, acknowledgment semantics, replay, and stream compacting.
Using this combo, we get:
Message modeling and topics
Design a canonical event model. Keep messages small and predictable, with explicit meta fields for correlation and idempotency.
Example canonical event (JSON):
{ "eventId": "uuid-v4", "timestamp": "2026-03-15T10:22:03Z", "sourceSystem": "SAP-PlantA", "eventType": "LOT_CREATED", // LOT_CREATED, LOT_SPLIT, LOT_MERGED, LOT_CONSUMED "lotId": "PLANTA-LOT-12345", "quantity": 1200, "uom": "kg", "parents": ["PLANTA-LOT-12222"], "correlationId": "order-98765", "version": 1, "payload": { /* optional domain details */ }}MQTT topic structure for ingestion (edge/ERP -> broker):
| Topic | Meaning |
| plant/{plantId}/erp/{system}/lot/events | Raw ERP/edge events from a plant/system |
| plant/{plantId}/commands/lot | Commands to ERP/edge (create/update/reconcile) |
On the event bus (NATS/Redis), streams are organized by logical domain:
Ensuring order, idempotency, and consistency
Three practical rules we applied:
Canonical lot mapping across ERPs
To reconcile the same physical lot across three ERPs, we maintain a canonical mapping table in the audit store:
| canonicalLotId | system | systemLotId |
| CAN-0001 | SAP | PLANTA-LOT-12345 |
| CAN-0001 | MSD | PLANTB-LOT-9876 |
When a lot is created in any ERP, the orchestrator either allocates a new canonical ID (if new) or links the system lot to an existing canonical ID (if it's a linked transfer). This mapping allows trace queries to show end‑to‑end lot genealogy across ERPs.
Operational patterns and failure handling
Key operational practices that reduced incidents:
Security and governance
Security isn’t optional. We used:
Testing and rollout
Start small and iterate:
Practical tools and vendors
What we used and recommend when you want minimal ops overhead:
This architecture trades off heavyweight orchestration for clear, testable components. You get traceability, replayability, and resilience with a small operational footprint and no single monolithic middleware. If you’d like, I can provide sample orchestrator pseudocode, MQTT topic ACL examples, or a reference Postgres schema for canonical lot mapping.