Alert Design in OT. If Everything Screams, Nothing Is Heard

If you work in OT and your screens are constantly lit up with red, here is the hard truth. Your alert system is not protecting you. It is training your operators to ignore risk. Most OT environments brag about advanced monitoring and comprehensive alerting. In reality, many control rooms are drowning in noise. Operators click acknowledge on alarms they do not truly understand, just to make the screen usable again. Then one day, something serious slips through with all the rest.

This is not a technology problem first. It is a behaviour problem. Your alert design shapes what operators see, what they ignore, and when they react too late. If the design is lazy, your system does not create visibility. It creates blind operators.

Most OT alerting is broken by design because it did …

Most OT alerting is broken by design because it did not start from a clear philosophy. It grew like a messy patchwork. A rule added here, a vendor default left there, a temporary alert that never got removed. Over time, this becomes a wall of noise. When something unusual happens, the system does not give one clear signal. It throws an alert storm. One device goes down, and ten different systems notice it. Each one fires several alerts. Now an operator is hit with dozens of messages for one underlying event. That is not intelligence. That is panic wrapped in red. In that moment, they do not calmly analyse. They start clearing.

Categories are often just as bad. Terms like informational, warning, critical, security, and system look tidy on a slide deck, but in a real environment, they are used inconsistently. A warning might be more serious than a critical in practice. A security alert might be treated like wallpaper because it fires all day for low-value events. If the labels do not map to how operators think about risk, they do not guide behaviour. They just decorate it.

On top of that, most alert systems fail at the simplest thing: clear priority. If everything looks urgent, nothing feels urgent. When a screen is filled with red banners, flashing icons and intrusive pop-ups for minor issues, the operator’s brain does exactly what you would do. It stops believing the system. People build their own hidden ranking. These alerts go off all the time; they are nothing. Those we only care about at night. That one we ignore unless someone physically calls. At that point, your priority scheme is not the one in the software. It is the one in the operator’s head. That is a design failure you created.

The text of the alerts themselves often makes the situation worse. Look at the actual messages in your system. Many are vague to the point of uselessness. An anomaly was detected on the device. Issue on the host. Security event triggered. None of that tells an operator, in a few seconds, what is happening, what is at risk, and what they should do right now. If they have to click through multiple screens, cross-match tags or call someone else just to understand the basics, you have created puzzles, not alerts. Puzzles are fine for after-action reviews. During a live shift, puzzles are ignored.

To fix alert design, you have to start with how human attention really works in a control room. Your operators are not lazy, and they are not machines. They are humans under constant cognitive load. Their focus is limited. Throw too many things at that focus, and the brain responds with shortcuts.

One of the main shortcuts is brutal: most of this is noise; ignore it. That is not a moral weakness. It is a survival tactic. If the majority of alerts never matter, operators must filter, or they burn out. Habituation is another hard reality you cannot wish away. If the same alert fires all the time and nothing bad ever follows, the brain downgrades it automatically. Think of fire alarms in buildings where drills happen every week for no reason.

People stroll out slowly because their experience has taught them that nothing serious happens. The same thing happens in OT. If a pump vibration alarm goes off ten times a day and nothing meaningful ever follows, that alarm becomes invisible. When one day it really does signal a problem, you will still get the same shrug. You trained that response. “I will check it later” is the next symptom. In many control rooms, this phrase is a normal culture. Translated into truth, it means I am busy, I do not see obvious danger, and historically, this type of alert rarely leads to anything important. When your system constantly throws low-value alerts at people, you teach them to postpone. Attackers know this. They hide inside the classes of alerts everyone has learned to delay. Human brains also hunt for patterns and shortcuts. Operators naturally build mental rules like this, which always happens when backups run or when that sensor always misbehaves in hot weather. These shortcuts are useful when they match reality, but they are dangerous when they blur the line between real risk and routine noise. If your alert design does not clearly separate normal patterns from abnormal threats, your operators will start treating both as the same old behaviour. That is how genuine attacks get waved away as usual, Monday noise.

A useful OT alert respects all of these limits. It is not trying to be clever. It is trying to be clear. Every well-designed alert must allow an operator to answer three questions in seconds.

First, what is happening? Not some vague anomaly wording but a specific situation, such as an unauthorised login attempt to PLC three or an unexpected change to safety interlock configuration on line four. Second, what is at risk? The alert should spell out why this matters now, whether it is the potential loss of control to a critical pump, a possible safety risk to high-pressure equipment, or likely process downtime if nothing is done. Third, what is the required action, and how urgent is it? Call the on-duty security engineer now and treat it as a high priority. Verify the configuration change and isolate the workstation if not authorised. Monitor the trend for ten minutes and escalate if it continues to rise.

If an alert does not answer those three questions, you have designed something incomplete, and the operator will either ignore it, misjudge it or act on it too late. That is your design choice, not their personal flaw.

To move from noise to signal, you need simple but strict rules. The** first is a hard limit on how many alerts any one operator should see in a day or shift. The exact number will vary by environment, but the principle is not negotiable. If your system produces more than that limit, you must remove low-value alerts or merge related ones into a single event. If you refuse to enforce a ceiling, volume will grow until your system collapses under its own spam. ** The second rule is a clean separation between safety alerts, process alerts and security alerts. Dumping everything into one generic stream is lazy and dangerous. Safety is about people and physical damage. Process is about quality, performance and uptime. Security is about access, misuse and hostile behaviour. These should not look and feel the same. Colours, sounds and escalation paths must reflect what is at stake. An operator should be able to glance at a screen and instantly know whether this is a safety issue, a process issue or a security event. That clarity also allows training and ownership to be aligned with reality.

**The third rule is to reserve intrusive sounds and pop-ups only for truly high-priority events. **If you let minor warnings scream, you force everyone to mute the system just to stay sane. When something really serious happens later, the sound will mean nothing. You do not want silence. You want precision. When an alarm makes noise in a well-designed system, people should feel that it is rare and that it demands action.

The fourth rule is to kill duplicates. One event should not spawn multiple separate alerts on the same screen. That is an attention tax with no return. Use correlation so that related signals from the same issue are grouped into a single event with a clear summary and maybe a counter or timeline. The operator should deal with one clear description of reality, not ten overlapping fragments of it.

Think about a typical plant before and after this cleanup. Before, each operator saw hundreds of alerts a day. Vendors left default alarms enabled. Security events are mixed with low-level system warnings. The same network event might fire in three tools with three different labels. Operators admit they acknowledge alerts without reading them properly. The culture sounds like those security alerts go off all the time, they are nothing, and if something is really wrong, engineering will call us. When an incident is investigated, the logs show that the system did warn them. The warning was just buried in the flood.

After a serious redesign, things look different. Management sets a real alert limit. Safety, process and security now have distinct views and cues. Low-value alerts and dead categories are removed or moved into reports. Multiple noise signals are rolled up into single, meaningful events. Alert messages are rewritten so that they always spell out what is happening, what is at risk and what to do. Operators now see far fewer alerts in a shift, but each one is sharper. When a high-priority alert fires, it looks and sounds different. Instead of I will check it later, there is a clear playbook.

In incidents, operators are calmer because the system is not throwing everything at them at once. It is surfacing what matters. None of that required a new product. It required discipline.

You can start imposing that discipline with a one-week audit. Pull the last thirty days of alerts from your OT monitoring, SCADA, security and network tools. Look at the sheer volume and distribution. Then identify which categories rarely lead to tickets, calls or actual changes. If one category fires thousands of times and produces no action, that is not monitoring; it is spam. Turn it off, throttle it or convert it into a summary report. Then look at individual alert types that are constantly ignored. Ask directly whether they have ever led to a meaningful response. If the answer is no, redesign them or remove them from the live operator view.

Next, take your twenty most important alerts and rewrite them until they clearly answer what is happening, what is at risk and what the operator must do. Sit real operators down and ask, if this popped up, what would you do? If they hesitate, your alert is still weak. Finally, enforce a rule for new alerts. No new rule goes live unless it has a clear owner, a defined response, a measurable benefit and a plan for tuning or removal if it becomes noisy.

If your alert design leaves operators blind, that is not a user problem. It is your problem. You do not get to hide behind the phrase human error when you trained that error with constant noise. If everything screams all day, of course, nothing is heard. Fix the design. Respect human attention. Turn your alerts from a flood of distractions into a small set of clear, actionable signals. Then when something truly dangerous happens, your operators will not be mindlessly clicking acknowledge out of habit. They will see it, understand it and act. That is the point of alerting. Everything else is decoration.

Similar Posts