Fault Tolerance for Digital Systems

Herbert Hecht
SoHaR Incorporated, USA
 
Fault tolerance is an essential methodology for digital systems, particularly for those that serve applications where failure has safety implications or where interruption of operations imposes serious financial penalties. There is no single fault tolerance technique that suits or is optimal in all circumstances. A taxonomy of fault tolerance techniques is presented and branches and leaves of this taxonomy are described in terms of areas of applicability, effectiveness of fault tolerance, and cost of implementation. Gaps in coverage and deficiencies of an individual technique can be overcome by employing a hierarchical structure of fault tolerance provisions, also referred to as defense-in-depth. The large selection of techniques that have been described and the continuing improvements provided by studies in the field support an encouraging outlook.