Fault Tolerance and Mitigating Risk

You may hear the words internal faults, external faults, redundancy, availability, and reliability tossed about in discussions about control system architectures. Emerson’s Dave Denison, a software engineering manager in the DeltaV technology organization, wrote a great article, Architecture For Mitigating Effects Of External Faults: Choosing Tools And Techniques For Creating Fault-Tolerant Control Environments And Networks in Control Engineering magazine.

Dave opens by highlighting the fact that the severity of a fault is largely determine by the application. These faults can range from being a nuisance to causing a catastrophic situation, and therefore:

Each plant site must establish carefully engineered strategies for mitigating the effects of a fault on plant operations and must develop action plans for containing the risk.

He defines a fault:

…a failure, a defect, or a flaw in a component or a device. In the context of plant automation systems, a fault in a device causes the device to malfunction, so it does not provide its expected or designed function.

These faults can be classified as being internal or external. Internal are usually in the realm of hardware or software design flaws and tend to be repeatable, unless caused by manufacturing defects. External faults, as the name implies, originate outside the device and causes include:

…environmental effects (electromagnetic interference, temperature changes), operational faults (operator errors), accidental damage (power surges, physical damage to network equipment), and maintenance/installation faults (improper grounding, shorting).

Control systems are designed to help process manufacturers mitigate these risks through fault tolerant architectures. He defines fault tolerance as the:

…ability of a system to perform its function correctly even in the presence of faults. The purpose of fault tolerance is to increase the reliability and availability of a system, allowing it to respond gracefully to an unexpected fault. The level of gracefulness in a fault condition may be measured in terms of the availability of the system and operational degradation to system functionality.

Dave describes how well designed systems help mitigate the impact of faults through fault recovery, fault containment, and redundancy. Technologies such as error correcting memory, watchdog timers, and software checkpointing can help recover from faults. Ways to help contain faults include barriers such as firewalls, intrinsically safe I/O, and memory management units to name a few examples.

Redundancy is a broad area and covers control, communications, and power. Dave cites ways redundancy can be designed including:

  • Simple duplication
  • Diverse technologies
  • Active/hot standby redundancy, and
  • Lock step redundancy

Dave provides examples and illustrations of these redundancy techniques and some of the strengths and weaknesses of each. For example, the lockstep redundancy method eliminates the switchover time window in an active/hot standby fault tolerance approach.

Dave sums up his thoughts:

A customized effort is required for each plant site to balance risk of failure against the affordability of each fault tolerant solution. A state-of-the-art process automation system allows a user to select mitigation strategies based on fault tolerant components at control, operations, and business system levels in order to optimize reliability, lower cost, and reduce the risk of failure.

If you would like to better understand ways to content with faults and methods of redundancy, Dave’s article is a great place to start.

Leave a Reply