How to run an OT tabletop that proves your plc hot‑failover and recovery actually work

I run tabletop exercises whenever I need to demonstrate, to operations and management alike, that a PLC hot‑failover and recovery strategy does more than look good on paper. A well‑designed tabletop converts theory into repeatable evidence: failover occurs within target time, I/O stays consistent, alarms behave, and the plant can continue production with acceptable performance. Below I share a practical, field‑tested approach that you can use to plan and execute an OT tabletop that proves your PLC hot‑failover and recovery actually work.

Why a tabletop — and why I prefer it before live cutovers

Tabletops are low‑risk, high‑value rehearsals. They let you exercise operational decision‑making, communications, and control logic responses without touching live process equipment. Over the years I’ve run dozens: some with Tier‑1 OEMs and others with process plants. The biggest payoff is alignment — engineering, operations, IT/OT, and safety all agree on observable criteria for “success” before anyone pulls a breaker on the shop floor.

Who should be in the room

Operations lead (shift supervisor) — owns the production decisions.
Control engineer(s) — responsible for PLC configuration, failover pair settings, and I/O mapping.
SCADA/MES representative — to validate supervisory behaviour and historian continuity.
OT network engineer — to confirm network redundancy and routing during failover.
Safety/HSSE representative — to ensure interventions are safe and compliant.
IT or cybersecurity representative — to observe authentication, access control, and logging aspects.
Stakeholder observer(s) — plant manager or project sponsor for acceptance.

Define success criteria up front

I always insist on measurable acceptance criteria before the exercise starts. Vague statements like “system recovered” are worthless; the team needs clear KPIs. Example metrics I use:

Metric	Target
Failover time (controller A → B)	< 5 seconds (adjust per process tolerance)
Maximum lost I/O samples	< 1 sample per critical tag
Alarms generated during failover	No spurious critical alarms; expected transient notifications only
SCADA/Historian continuity	No data gaps for configured critical tags

Prepare the scenario and artifacts

Successful tabletops are scripted but not scripted to death. I prepare a few realistic failure modes and artifacts that the team can interact with:

Failure scenarios: controller CPU crash, controller power loss, network switch failure, media redundancy loss, device firmware hang.
Process context: state of the line (running at 75% speed, batch in progress, recipe A active) — include exact setpoints and timers so observers can judge impact.
Diagrams: current PLC pair architecture, I/O mapping, Ethernet ring topology or PRP/HSR details, and SCADA tag flow.
Checklists: operator steps to verify control loop behaviour, and engineering test steps to validate PLC pair configuration (e.g., mutual store, memory sync).

Run the exercise — a sample flow I use

Here’s the flow I walk the team through (you can adapt timings and depth to your plant):

Kickoff: confirm success criteria, roles, and safety rules.
Baseline verification: show current PLC pair status, synchronized memory version, active controllers, and SCADA connected tags.
Simulated failover 1 — CPU fault: have control explain what “CPU fault” means and then simulate it (use a simulated event or remove CPU from pair in the engineering station). Observe failover timing and operator cues.
Validation: engineering validates I/O continuity, operations confirms process setpoints and alarms, and historian analyst checks for gaps.
Simulated failover 2 — network partition: simulate link loss on the primary controller’s network segment and observe switch to secondary or local control behaviour.
Edge cases: re‑introduce the failed controller to confirm that recovery and resync doesn’t alter process outputs or recipes.
Documentation: capture timestamps, screenshots, and historian extracts for each test.

Evidence collection — show, don’t just tell

Management needs artifacts. I insist on capturing the following during every tabletop:

Time‑stamped screen recordings of PLC status pages and SCADA trending during the event.
Historian export for critical tags covering before/during/after the event.
Network device logs showing link state changes, STP or PRP events, and port flaps.
PLC controller logs showing mode changes, synchronization events and any error codes (for example, Rockwell Logix messages, Siemens diagnostics, or Schneider alerts).
A short incident report per scenario that lists observed behaviors vs acceptance criteria, and whether the scenario passed/failed.

Things I always watch for (common gotchas)

Unmapped tags: SCADA or MES sometimes point to the active controller only — make sure tag redundancy is configured so supervisory systems don’t lose data during switch.
Non‑deterministic logic: timers or sequence steps that rely on single CPU timestamps can behave oddly after a hot swap — validate sequence machines explicitly.
Network convergence time: redundant controllers are only as fast as the network’s ability to route — capture switch convergence times and include them in your risk model.
Operator procedures: unclear instructions during failover are a significant risk — use the tabletop to refine operator prompts and handover steps.
Firmware/compatibility: mismatched firmware on pair members often causes subtle resync failures — check versions and perform controlled upgrades during maintenance windows, not during cutover.

Tools and vendor specifics worth mentioning

I don’t endorse one vendor over another, but different platforms have different behaviours and tools that can help you prove failover:

Rockwell (ControlLogix with Fault Tolerant PLCs) — useful diagnostic logs and module status views in Studio 5000.
Siemens (S7 with redundancy or FailSafe) — strong diagnostic traces and HMI/WinCC alarms when configured correctly.
Schneider EcoStruxure — clear redundancy reporting and power/CPU event histories.
SCADA/historian: Inductive Automation Ignition or OSIsoft PI — both can provide tag continuity proof if you configure multiple data sources or buffering correctly.
Network: use Wireshark or vendor switch logs to capture ARP/PRP/HSR behaviour during events.

After the tabletop — capture learnings and run drills

After every tabletop I produce a short, actionable report: what passed, what failed, root causes, and who owns remediation. I schedule follow‑up drills to validate fixes. Over time those drills shrink your recovery time objective (RTO) and build confidence across teams — and that’s the real objective: people and systems that behave predictably when things go wrong.