I recently faced a challenge that sits at the intersection of manufacturing traceability and pragmatic system architecture: keep lot-level traceability synchronous across three different ERP systems, without introducing a heavyweight integration platform. The context was simple but demanding — three plants running different ERPs (one SAP, one Microsoft Dynamics, one Infor), a requirement to propagate lot events (creation, split, merge, stage, consume) in near‑real time, and strict auditability for quality and regulatory purposes. We wanted minimal operational complexity, predictable latency, and a solution that operations teams could own and reason about. The answer we implemented was an MQTT-based event surface combined with a lightweight event bus (NATS JetStream or Redis Streams) and a small set of stateless orchestrators. Below I explain the architecture, design patterns, message examples, and operational practices that made it reliable in production.

High-level architecture

The pattern is straightforward:

  • Each ERP exposes or publishes lot events into a local MQTT broker (or will push to a central broker) as the source of truth for that system.
  • A lightweight event bus (NATS JetStream or Redis Streams) aggregates normalized events and provides a durable, ordered stream that orchestrators subscribe to.
  • Stateless orchestrator microservices subscribe to the bus, apply deterministic business rules (split/merge reconciliation, canonical lot mapping), and dispatch outbound commands or reconciliation events to the target ERPs via MQTT or API.
  • An audit/logging service persists canonical lot state and event history in a small, queryable store (Postgres or TimescaleDB) for traceability and reconciliation.
  • This architecture avoids heavy ESBs or integration platforms. MQTT handles device/edge friendliness and lightweight publish/subscribe; the event bus handles durable delivery and ordered processing across the cluster; orchestrators implement idempotent, deterministic logic.

    Why MQTT + a lightweight event bus?

    In the plants I’ve worked with, MQTT is already the lingua franca for OT/edge components — it’s lightweight, resilient on flaky networks, supports QoS, and is simple to secure with TLS and client certs. But MQTT brokers aren’t always ideal for cross‑service durable streams and at‑least‑once ordered consumption semantics across multiple consumers. That’s where a lightweight event bus (NATS JetStream or Redis Streams) fills the gap: durable message retention, consumer groups, acknowledgment semantics, replay, and stream compacting.

    Using this combo, we get:

  • Edge friendliness and low-bandwidth resilience (MQTT).
  • Durable ordered streams, replays, and consumer groups (NATS JetStream / Redis Streams).
  • No heavy middleware licensing, less ops overhead, simpler deployment (containers + small k8s footprint).
  • Message modeling and topics

    Design a canonical event model. Keep messages small and predictable, with explicit meta fields for correlation and idempotency.

    Example canonical event (JSON):

    {  "eventId": "uuid-v4",  "timestamp": "2026-03-15T10:22:03Z",  "sourceSystem": "SAP-PlantA",  "eventType": "LOT_CREATED", // LOT_CREATED, LOT_SPLIT, LOT_MERGED, LOT_CONSUMED  "lotId": "PLANTA-LOT-12345",  "quantity": 1200,  "uom": "kg",  "parents": ["PLANTA-LOT-12222"],  "correlationId": "order-98765",  "version": 1,  "payload": { /* optional domain details */ }}

    MQTT topic structure for ingestion (edge/ERP -> broker):

    TopicMeaning
    plant/{plantId}/erp/{system}/lot/eventsRaw ERP/edge events from a plant/system
    plant/{plantId}/commands/lotCommands to ERP/edge (create/update/reconcile)

    On the event bus (NATS/Redis), streams are organized by logical domain:

  • lots.events — canonical normalized events
  • lots.commands — outbound commands to ERPs
  • Ensuring order, idempotency, and consistency

    Three practical rules we applied:

  • Idempotency tokens: Every event and command carries an eventId and a correlationId. Targets persist the last processed eventId per lot to ignore duplicates.
  • Deterministic conflict resolution: Orchestrators use monotonic sequence numbers or event timestamps and version fields. A later version always wins; concurrent changes are resolved by deterministic rules (e.g., higher sequence number or source priority).
  • Compensating actions: Rather than trying distributed transactions, implement sagas: if a commanded change fails on target ERP, emit a compensating event so the originator and other systems converge to a consistent state.
  • Canonical lot mapping across ERPs

    To reconcile the same physical lot across three ERPs, we maintain a canonical mapping table in the audit store:

    canonicalLotIdsystemsystemLotId
    CAN-0001SAPPLANTA-LOT-12345
    CAN-0001MSDPLANTB-LOT-9876

    When a lot is created in any ERP, the orchestrator either allocates a new canonical ID (if new) or links the system lot to an existing canonical ID (if it's a linked transfer). This mapping allows trace queries to show end‑to‑end lot genealogy across ERPs.

    Operational patterns and failure handling

    Key operational practices that reduced incidents:

  • Dead-letter and retry policies: Use exponential backoff and a dead‑letter stream for events failing after N attempts. Operators can inspect DLQ entries and trigger manual reconciliation.
  • Reconciliation jobs: Periodic bulk reconcile checks run between the canonical state and each ERP using simple REST/MQTT pull queries. Differences produce corrective commands.
  • Observability: Instrument every orchestrator with traces (OpenTelemetry), plus metrics — event lag, processing errors, retry counts, reconciliation drift.
  • Versioned schemas: Use lightweight schema registry (even a Git repo + validator) and include schemaVersion in events.
  • Security and governance

    Security isn’t optional. We used:

  • TLS for MQTT and NATS, mutual TLS for cross-data-center communications.
  • Client certificates and role-based topic permissions for ERPs and orchestrators.
  • Signed payloads and HMAC headers where regulators required non-repudiation.
  • Testing and rollout

    Start small and iterate:

  • Pilot with one ERP pair and a narrow set of lot events (create, consume).
  • Use canary replay: store production events and replay to staging orchestrators to validate logic before enabling live routing.
  • Exercise failure modes regularly — simulate target ERP outages and verify saga compensations and DLQ handling behave as expected.
  • Practical tools and vendors

    What we used and recommend when you want minimal ops overhead:

  • MQTT broker: Mosquitto for tiny deployments, EMQX or HiveMQ for enterprise features.
  • Event bus: NATS JetStream for low-latency, durable streams; or Redis Streams if you already run Redis.
  • Orchestrators and services: lightweight containers (Go or Python), using existing libraries for MQTT and NATS clients.
  • Storage: Postgres for canonical state and TimescaleDB if you want time-series performance for event history.
  • This architecture trades off heavyweight orchestration for clear, testable components. You get traceability, replayability, and resilience with a small operational footprint and no single monolithic middleware. If you’d like, I can provide sample orchestrator pseudocode, MQTT topic ACL examples, or a reference Postgres schema for canonical lot mapping.