MENU

Increased functional safety is a ‘Must Have’ in networked embedded designs

Increased functional safety is a ‘Must Have’ in networked embedded designs

Technology News |
By eeNews Europe



Embedded Networked Systems are increasingly called upon to control vast sections of the industrial infrastructure in the modern economy. Some systems require extraordinary safety and reliability to eliminate, as much as possible, failures that can result in dramatic financial losses or loss of life. Familiar examples of these safety critical applications are mass transportation, power generation and oil drilling/transport. Embedded systems are also used in applications where the results of failures are not catastrophic, but can still result in significant losses in process or manufacturing efficiency. When faults are detected and failures avoided significant material losses or manufacturing efficiency losses can be avoided. Additionally, a networked system is not really safe if it is not secure. Malicious users can hijack an embedded system or an embedded system can become the (perhaps unintentional) target of a virus or worm. These types of attacks can damage or render inoperable an entire system or complex. Clearly in many cases both advanced reliability and security capabilities will be requirements in networked embedded designs.

Perhaps looking at an example design can best illustrate some of the key aspects and implementation options when improved reliability and security are required. Process control systems are one of the most useful examples to consider, particularly since the discovery of network transmitted worms that attack not only traditional PC operating systems, but embedded control systems as well (like the so-called Stuxnet computer worm). A block diagram of an example embedded process control system is shown in Figure 1, below.


Figure 1 Example embedded networked process control system

(Click Here to see a larger, more detailed version of this image)

An Industrial Ethernet Switch is used to connect the controller to the network via an upstream node and a downstream node. A system controller manages the overall operation of the Process Control System, including the Ethernet Switch and the power subsystem. A separate Equipment Controller, supervised by the System Controller, manages the equipment interface. The Equipment Controller implements any low level control loop processes required by the system. Higher-level process management resides within the System Controller under supervision via the network, perhaps by a centralized system that manages the entire manufacturing or chemical processing complex. This separation of control functions simplifies the implementation of the real time aspects of both the equipment control and network traffic management (For example, interrupt response time, memory bandwidth allocation and active task priority determination.) Let’s look at ways to make this example system more reliable and secure.

System failure rates
All systems will have the possibility of failing, since it is impossible to design a system with an absolute zero failure rate. Thus each application should be designed with a target acceptable failure rate level. The IEC 61508 standard specifies acceptable failure rates for a variety of Safety Integrity Levels (SILs) based on the consequences of a system failure. The specification originally applied solely at the system level but has also been applied to product and components by addressing Electrical, Electronic, and Programmable Electronics for both hardware and software. We will assume that our design falls within SIL Level 2 (perhaps because the controller manages a hazardous liquid as part of its function).


Table 1: IEC 61508 Safety Integrity Levels

(Click Here to see a larger, more detailed version of this image)

Looking at the example design shown in Figure 1, we can imagine some possible failure modes and their effect on the overall system. An error in the Equipment Controller might allow hazardous liquid to build-up in the system until a rupture takes place, creating a life threatening system failure. Similarly an error in the system controller might miss warnings from the equipment controller that could also result in life threatening failures. An error in the Ethernet Switch (a constant message broadcast for example) could bring down the entire network and threaten the entire complex, not just a single node. Note that the System Controller also manages the power supply subsystem, (not an unusual feature of embedded controllers) so an error associated with the power supply could cause a dramatic system failure. This is also a potential weakness for a malicious attacker to exploit if they wanted to inflict permanent damage on the system.

We also need to look at possible failure modes when remote code updates or other sensitive messages are sent over the network. Without a sufficient level of data protection, transmission errors or malicious attacks could alter program code execution, incorrectly adjust trigger levels or capture sensitive operating parameters. Standard error detection functions (like a Cyclical Redundancy Check or CRC) can be used to protect messages from transmission errors. The Ethernet Switch will automatically check messages for errors using this technique. If required the System Controller can implement additional Error Detection and Correction functions. Cryptographic protocols and standard encryption algorithms can be used to improve the security of network traffic within the system by securing the data in transit and authenticating remote facilities.

Single event upsets (SEUs) as a source of errors
The Single Event Upset phenomenon was first discovered in 1979 by Intel and Bell Labs as failures in DRAMs and is attributed to stray alpha particles or neutrons ‘flipping’ the memory cell. In 1999 Sun Microsystems noticed errors in cached SRAMs for mission critical servers. In space and aviation applications the effects of radiation on electronics is well understood as operational altitudes have a higher neutron flux. However, the SEU phenomenon is increasingly becoming a concern at sea level as well. The continuous drive to smaller semiconductor geometries reduces the charge at each SRAM cell and the ever increasing content of electronics in fielded systems increases the likelihood of SEU related SRAM errors. Note that Flash memories, which require a significantly higher energy level to ‘flip’ state, are immune to these types of SEU events.

Mitigation of errors via redundancy and design diversity
In safety critical systems redundancy is mandatory to operate properly in the event of a failure. There are two well-known techniques that are widely utilized — Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR). In the case of Dual Modular Redundancy, duplicate designs work in parallel. Each processing element receives the same input and a fail-safe certification engine checks for consistency. If a fault is identified then prevention must be taken to avoid a failure. Triple modular redundancy creates three duplicate designs and the results of each output are presented to a voting circuit such that the output state that receives the most votes is set. This can withstand the complete failure of one sub-system and allows a supervisor circuit to attempt to fix the fault, or alert an operator.

A design diversity methodology is sometimes employed to further improve reliability. Using this methodology parallel designs are not just duplicated but will perform the same function using a different implementation. For example, an FPGA might be used for one of the designs and the parallel design might use an MCU. This diversity in the target implementations increases reliability even more since errors related to complex design or implementation ‘bugs’ will not be duplicated in dramatically different targets.

Implementing redundancy in our example design
Let’s take our example design and look at how we can significantly improve reliability by using the redundancy techniques previously described. Figure 2 below shows the changes to the example system. In order to improve network reliability we added redundant Ethernet connections to the upstream node and the downstream node. The new redundant power subsystem helps recover from a failure in the main supply. Power will switch over to a redundant supply if the main supply fails. The System Controller and Equipment Controller are now each implemented using a Dual Modular Redundancy (DMR) technique, as illustrated in the ‘blow-up’ of the Equipment Controller (The System Controller would use a similar technique). The controller functions are duplicated and compare logic is added to identify any outputs that do not ‘agree’. When such an error is detected the subsystem responsible for the error can be reset and diagnostics performed. This mitigates the chance of the error resulting in a system failure. Note that the dual implementations of the System Controller use a design diversity technique. One controller is implemented with an MCU and the other is implemented with an FPGA. This provides additional reliability since each implementations error characteristics will be significantly different and thus the chance of a common systematic error (for example their response to noise, temperature, voltage, timing differences or even implementation ‘bugs’) will be significantly reduced.


Figure 2. Example design with DMR and design diversity

(Click Here to see a larger, more detailed version of this image)

Using Microsemi SmartFusion2 SoC FPGAs
When implementing the example design from Figure 2, it would be possible to use separate components for the FPGA implementation of the controller, the MCU implementation and the Compare Logic blocks. These extra components will create new possibilities for errors and system failures however. An approach that integrates all of these functions into a single device has advantages to the system designer, namely better MTBF due to the reduction of components, better cost, and now the ability to drive functional safety into smaller systems. The Microsemi SmartFusion2 architecture, shown in Figure 3, has a hardened ARM-Cortex™-M3 based Microcontroller Subsystem (MSS) and sufficient FPGA logic to integrate the entire Dual Equipment Controller in a single device, keeping component count the same as our initial, non-redundant, implementation. A SmartFusion2 device would be used for the Dual System Controller as well where a single device could integrate the entire function. Additionally, SmartFusion2 has multiple SerDes channels (up to 16) so that the Ethernet Switch could even be implemented in a single SmartFusion2 device using less signaling between the MAC and the physical layer than standard Mac/Phy interfaces.. This would allow additional redundancy, error checking and advanced error recovery mechanisms to be included in the switch to create a more reliable design than those available in an ASSP.


Figure 3. SmartFusion2 architectural block diagram and key features

(Click Here to see a larger, more detailed version of this image)

A SmartFusion2 design also benefits from other reliability advantages like zero FIT-rate configuration memory (due to its Flash memory implementation), SEU protected memories, SECDED support for external memory controllers and built-in self-test. SmartFuson2 devices don’t require an external configuration device so this reduces component count and thus improves reliability. Additionally, SmartFusion2 devices have very low static power consumption, a unique ultra low power Flash Freeze mode (that can be used to preserve the state of the FPGA while stopping dynamic operation) and a very low start-up current requirement (unlike SRAM based FPGA implementations, which have a large start-up current ‘spike’). These low power features can result in a smaller and less complex power supply requirements, further improving system reliability.

Protecting the system during remote reprogramming updates and upgrades requires a secure communications channel for the FPGA configuration data and MCU code. SmartFusion2 devices are programmed with a Differential Power Analysis (DPA) hardened bit stream protocol. Because of this the built-in security features of the SmartFusion2 device can be used to ensure remotely sourced configuration bitstreams are protected from unauthorized observation and hacking.

Security is an aspect of functional safety that is new. From a security perspective functional safety can broadly be defined as being secure in the knowledge that what you designed and built performs the intended functions to an acceptable FIT rate. This article focused mainly on design however there is an equal important aspect to functional safety and that is supply chain and manufacturing assurance. Supply chain assurance is a guarantee that the components you are buying are genuine. Manufacturing assurance is a guarantee that the system built is genuine. SmartFusion2 devices have unique features that can aid the systems designer in achieving these goals. Each device is shipped with an embedded x.509 digitally signed device certificate that contains the complete part number, date code and device version. The device certificate can be read during manufacturing and compared to the purchase order verifying that a genuine Microsemi SoC FPGA has been placed on the PCB. During manufacturing programming of the device a Certificate of Conformance can be generated that cryptographically verifies that the device was programmed with the intended bitstream. These 2 items can aid the systems designer in proving the veracity of the system being built.

In the example system there may be additional requirements for security to protect your design IP. For example, it may be important to protect the manufacturing flow and to include anti-tampering features. The single chip nature of the SmartFusion2 implementation protects against attempts to reverse engineer the design. An AES key protects the programming bit stream and also controls the number of units that can be programmed. This protects the design against cloning and overbuilding by our manufacturing subcontractors. A zeroization feature, where the programmed design information can be quickly erased, can help create a very robust anti-tampering mechanism.

About the author
Tim Morin is Director Product Line Marketing, New Products, at Microsemi Corporation, SoC Products Group. Please visit www.microsemi.com for more details on SmartFusion2 SoC FPGAs.
 

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s