Software Reliability & Safety

ALD offers a range of services targeted at improving the reliability, dependability and safety of your software. Whether your software is safety-critical, mission-critical, or expected to satisfy strict reliability and availability requirements to be certified as market-ready, we can lead and support your effort in attaining these goals.

 

Software Failures
Before we list the tasks undertaken to analyze software reliability and safety it is important to understand the meaning of a failure due to software. Software does not exhibit the random or wearout related failure behavior we see in hardware. Software will always function in the same way as long as the same input and computer states are present. Software can cause system failures either because of design errors or implementation errors. Design errors are often caused by wrong assumptions about system operation, e. g., that input A is always followed by input B. Typical implementation errors are caused by confusing symbols, such as 'g' instead of 'G'. Software faults can only cause failures if the fault is encountered during usage. Therefore faults existing in frequently used code will cause failures more often than faults residing in rarely used code but the latter can be equally serious.  In mission or safety critical applications it is particularly important to review and test rarely used code.
 
Software Reliability
As is the case for hardware, software reliability engineering involves much more than analyzing test results, estimating remaining faults, and modeling future failure probabilities. 

Although in most organizations software test is no longer an afterthought, management is almost always surprised by the cost and schedule requirements of the test program, and it is often downgraded in favor of design activities. Often adding a new feature will seem more beneficial than performing a complete test on existing features. A good software reliability engineering program, introduced early in the development cycle, will mitigate these problems by:

 

• Preparing program management in advance for the testing effort and allowing them to plan both schedule and budget to cover the required testing.
• Continuous review of requirements throughout the life cycle, particularly for handling of exception conditions. If requirements are incomplete there will be no testing of the exception conditions.
• Offering management a quantitative assessment of the dependence of reliability metrics (software/system availability; software/system outages per day, etc.) on the effort (time and cost) allotted to testing.
• Providing the most efficient test plan targeted to bringing the product to market in the shortest time subject to the reliability requirements imposed by the customer or market expectations.
• Performing continuous quantitative assessment of software/system reliability and the effort/cost required to improve them by a specified amount.
 
ALD software reliability engineers are experienced in all the stages and tasks required in a comprehensive software reliability program.

We can support or lead such reliability program tasks as:
 
• Reliability Allocation
• Defining and Analyzing Operational Profiles
• Test Preparation and Plan
• Software Reliability Models
 
Reliability Allocation
Reliability allocation is the task of defining the necessary reliability of a software item. The item may be a part of an integrated hardware/software system, may be a relatively independent software application, or, more and more rarely, a standalone software program. In any of these cases our goal is to bring system reliability within either a strict constraint required by a customer or an internally perceived readiness level, or optimize reliability within schedule and cost constraints.
 
ALD will assist your organization in the following tasks:
 
• Derive software reliability requirements from overall system reliability requirements
• When possible, depending on lifecycle stage and historical data, estimate schedule and cost dependence on software reliability goals
• Optimize reliability/schedule/cost based on your constraints and your customer's requirements

Ideally, reliability allocation should be performed early in the lifecycle and may be modified and refined as both software and other system components are developed. At these early stages, ALD can assist in the above tasks with limited design and requirements inputs. As the system is developing and getting more mature, software reliability allocation becomes more accurate. The dependence of reliability allocation on cost and schedule also solidifies when the software goes into testing.  Although it is ideal to begin these tasks early on and follow during system evolution, often organizations do not implement a software reliability program until very late in the software development cycle. The delay may be up to the time of test preparation and plan, or even later when testing has started yielding results that need to be interpreted to verify or ascertain reliability. 

Defining and Analyzing Operational Profiles
The reliability of software is strongly tied to the operational usage of an application - much stronger than the reliability of hardware. A software fault may lead to a system failure only if that fault is encountered during operational usage. If a fault is not accessed in a specific operational mode, it will not cause failures at all. It will cause failure more often if it is located in code that is part of a frequently used "operation" (An operation is defined as a major logical task, usually repeated multiple times within an hour of application usage). Therefore in software reliability engineering we focus on the operational profile of the software which weighs the occurrence probabilities of each operation. Unless safety requirements indicate a modification of this approach we will prioritize our testing according to this profile.
 
ALD will work with your system and software engineers to complete the following tasks required to generate a useable operational profile:
 
• Determine the operational modes (high traffic, low traffic, high maintenance, remote use, local use, etc)
• Determine operation initiators (components that initiate the operations in the system)
• Determine and group "Operations" so that the list includes only operations that are significantly different from each other
  (and therefore may present different faults)
• Determine occurrence rates for the different operations
• Construct the operational profile based on the individual operation probabilities of occurrence.


Test Preparation and Plan
Test preparation is a crucial step in the implementation of an effective software reliability program. A test plan that is based on the operational profile on the one hand, and subject to the reliability allocation constraints on the other, will be effective in achieving the program's reliability goals in the least amount of time and cost. 
Software Reliability Engineering is concerned not only with feature and regression test, but also with load test and performance test. All these should be planned based on the activities outlined above.
 
The reliability program will inform and often determine the following test preparation activities:

• Assessing the number of new test cases required for the current release
• New test case allocation among the systems (if multi-system)
• New test case allocation for each system among its new operations
• Specifying new test cases
• Adding the new test cases to the existing test cases from previous releases

Software Reliability Models
Software reliability engineering is often identified with reliability models, in particular reliability growth models. These models, when applied correctly, are successful at providing guidance to management decisions such as:

• Test schedule
• Test resource allocation
• Time to market
• Maintenance resource allocation

The application of reliability models to software testing results allows us to infer the rate at which failures are encountered (depending on usage profile) and, more importantly, the changes in this rate (reliability growth). The ability to make these inferences depends critically on the quality of test results. It is essential that testing be performed in such a way that each failure incident is accurately reported.
ALD's software reliability engineers will work with developers, testers and program management to apply an appropriate model to your failure data. In order for the model prediction to be useful we must ensure that the assumptions and structure of the model coincide with the underlying coding and testing process. It is not sufficient to find a mathematical function that best fits the data. In order to infer future failure behavior it is crucial that the underlying assumptions of the model be understood in terms of program management and progress towards release. This requires experience working with the software reliability models as well as understanding of latent issues in the development and testing process that may impact the test data.

Software Safety
As systems and products become more and more dependent on software components it is no longer realistic to develop a system safety program that does not include the software elements.

Does software fail?
We tend to believe that well written and well tested safety critical software would never fail. Experience proves otherwise with software making headlines when it actually does fail, sometimes critically. Software does not fail the same way as hardware does, and the various failure behaviors we are accustomed to from the world of hardware are often not applicable to software. However, software does fail, and when it does, it can be just as catastrophic as hardware failures.

Safety-critical software
Safety-critical software is very different from both non-critical software and safety-critical hardware. The difference lies in the massive testing program that such software undergoes.

What are "software failure modes"?
Software, especially in critical systems, tends to fail where least expected. We are usually extremely good at setting up test plans for the main line code of the program, and these sections usually do run flawlessly. Software does not "break" but it must be able to deal with "broken" input and conditions, which often cause the "software failures". The task of dealing with abnormal/anomalous conditions and inputs is handled by the exception code dispersed throughout the program. Setting up a test plan and exhaustive test cases for the exception code is by definition difficult and somewhat subjective. 
 
Anomalous inputs can be due to:

• failed hardware
• timing problems
• harsh/unexpected environmental conditions
• multiple changes in conditions and inputs that are beyond what the hardware is able to deal with
• unanticipated conditions during software mode changes
• bad or unexpected user input
 
Often the conditions most difficult to predict are multiple, coinciding, irregular inputs and conditions.
 
Safety-critical software is usually tested to the point that no new critical failures are observed.  This of course does not mean that the software is fault-free at this point, only that failures are no longer observed in test. Why are the faults leading to these types of failures overseen in test? These are faults that are not tested for any of the following reasons:
 
• Faults in code that is not frequently used and therefore not well represented in the operational profiles used for testing
• Faults caused by multiple anomalous conditions that are difficult to test
• Faults related to interfaces and controls of failed hardware
• Faults due to missing requirements

It is clear why these types of  faults may remain outside of a normal, reliability focused, test plan. 

How to protect against such failures once software is released?
Current guides and standards are not nearly as specific and clear as the hardware equivalent. Most notably RTCA/DO-178B, the Radio Technical Commission for Aeronautics Software Considerations in Airborne Systems and Equipment Certification document, whose latest version was issued in 1992, and is considered to be the main guideline for safety-critical software development does not deal in particular with types of failures and how to avoid them. The document deals mainly with process control that will hopefully ensure good software. It does not dictate how to verify that the process works "well enough" for the requirements of any particular system. The much awaited update to this document, DO-178C, was released in 2011 and most certainly offered more prescriptive guidelines to the certification of safety-critical software. These are based not only on more current design environments such as Object Oriented Design but in general model-based software and more current formal methods for verification.
 
Since safety-critical software is most often a part of a larger system that includes hardware, then the safety assessment process should follow the process applied to hardware:

Software Preliminary/Functional Hazard Analysis (PHA/FHA)
If the top level architecture of the system details software components it is necessary that they be included in this qualitative analysis. The PHA is also needed to formulate a software specification that (a) prevents software from interfering with established hazard handling provisions, and (b) directs software design to reinforce hazard handling where established provisions are lacking. The analysis is necessary for completing the later stages of the safety review. ALD will work with your requirements and specification documents as well as any early design documents and artifacts available. Model based artifacts such as use-case scenarios are very helpful at this level of analysis.

Software System Hazard Analysis and/or Fault Tree Analysis
The quantitative hazard and Fault Tree Analyses should include any software that interfaces with safety-critical hardware.
At this stage ALD software safety engineers will work with your design products to provide a complete analysis of the following:

1. system architecture
2. system requirements document
3. preliminary/functional hazard analysis
4. hardware failure information
5. human error information
 
A common obstacle to including software in a quantitative analysis is the lack of a failure rate estimates for these software components. Software can be constructed so that a specified number of failures can be tolerated. If a system is safety critical it is usually assumed that it will be fielded only after stringent testing which will show no remaining defects in the software code (this does not mean 100% reliability though!). Any remaining sources of failure (associated with the software) can be assumed to be the result of incomplete requirement definition, in particular requirements dealing with rare and anomalous conditions such as hardware failures, rare environmental and usage conditions and unforeseen operator actions. Often combinations of multiple rare events can lead to conditions that the software was not prepared for. An approximate rate for such events can be derived from the size and quality of requirements but cannot be fully verified. 
 
If the software is not part of a safety critical system/function it may be fielded with a known failure rate (based on the software testing program). In such case this failure rate may be used as an estimate for the fault tree analysis.

Software Failure Modes and Effects Analysis
Failure Modes and Effects Analysis is done at system level based on the Fault Tree analysis (or hazard analysis) results. The fault tree identifies end effects that are to be mitigated. The FMEA will identify "initiating events" which can lead to these end effects.  A software FMEA will identify which initiating events (such as incorrect or missing inputs, particular modes of operation, etc) can result in the software causing a system failure. As an example, consider incorrect input during a relatively rare operating mode, or a rare input while system is recovering from a hardware failure. The software FMEA is performed only on the software components (or subcomponents) that can lead to hazardous conditions and results.
 
ALD performs software FMEAs (and system FMEAs that include software) based on an Object Oriented design. In order to perform this analysis, ALD requires the following design products:
 
• System architecture
• System requirements document
• System hazard analysis and/or Fault Tree Analysis (can be generated by ALD)
• Software program in an object oriented design environment: this may be in the form of a UML design, a Matlab Simulink design, or the code in any object      oriented language (e.g. C++, .NET etc)
• Hardware failure  information
• Human/operator error information
 
The FMEA process in an object oriented environment ensures exhaustive identification of exception condition initiators, and verification that protection against faults in exception handling, are in place and effective!
 
Although slightly different from a hardware FMEA, the software FMEA is compatible with hardware FMEAs, when properly executed and permits a full system FMEA. Hence, it provides the assurance, that we have identified all possible failure modes and have included provisions to detect and protect against them.

Software FMEA - How?
One of the main reasons the FMEA hasn't been a consistent part of critical software certification is the difficulty in applying it to a large piece of code. ALD has developed a methodology that overcomes this problem by using the object view of the program. Whether developed as a UML or Matlab Simulink model, or coded in an object-oriented language such as C++, .Net or Java, we apply our FMEA methodology at the object level.
Along with requirements and design documents we are able to construct a software FMEA that is surprisingly similar to a hardware FMEA, as software "object methods" are equivalent to hardware "parts". Moreover, when required, we will develop and generate a system FMEA which will include hardware and software and any interface failure modes. 
Our method overcomes another inherent software FMEA problem that most professionals cannot avoid: the subjectivity of the process. Most software safety professionals apply the FMEA at a "functional" level. This application is not only problematic in that it can leave entire sections of the exception code unevaluated, but it also introduces a subjectivity into the process that allows more failure modes to be ignored. Our object-centered method removes this subjectivity as it uses the classes defined in the design.    
 
Automated Software FMEA
FMEAs, applied to software or hardware, are a large task. Hardware FMEAs are automated through a comprehensive system breakdown tree, or Bill Of Material. ALD has developed automated tools and methods for generating the software FMEA based on object-oriented software models. Our tools are currently able to automatically generate the FMEA structure for models developed in UML (Unified Modeling Language) or within the Matlab Simulink environment. Benefits of using our automated tools include:

• A significant reduction in work load (by several orders of magnitude)
• Assurance of completeness of the task (no failure modes left behind)
• Libraries for future use that reduce work load even more (software and interface components, failure modes, higher order effects, detection methods,           compensation provisions)

What Can You Expect From ALD's Software FMEA Services and Tools?
ALD provides both consulting services and tools for the Software FMEA. Our services cover the entire spectrum of organizational needs:
 
• ALD can perform the entire task of developing the FMEA for your system and generating the complete FMEA reports.
• ALD can provide consulting to an in-house effort which may include any combination of: training, system set-up, tools and/or continuous program support.
 
Either way, ALD will walk you through the process so that your organization is able to successfully complete the FMEA and fully trust the results.

What will our FMEA and reports include?

• List of critical failure modes and whether they have been accounted for in the design;
• List of provisions (detection methods & compensation provisions) required to make the current system safe.
 
At the end of every effort, the reports and electronic libraries developed in the process will lead to an easier task in future FMEA efforts. As in the case of hardware, a software FMEA is an incredibly valuable addition to the organizational knowledge base, allowing for safer and less costly programs in the future. 

Requirements V&V
V&V of software requirements is at least as crucial as V&V for hardware, if not more. Most serious failures in safety and mission critical software are due to incomplete or incorrect requirement definition. As software does not fail randomly and hardly ever due to actual coding defects, most failures are the result of the code not being designed to deal with certain (mostly rare) events: conditions and inputs. Moreover, these are requirements which list mitigations for failures. For serious failures, multiple (redundant) mitigation strategies are required. A safety-informed requirements V&V focuses on these types of omissions.
 
In order to perform a requirements review that can focus on safety-critical aspects of the code ALD requires the following design products:
 
• System architecture
• Complete system requirements documents
• System hazard analysis and/or Fault Tree Analysis

For more information about ALD's Software Reliability and Safety program please contact us at This email address is being protected from spambots. You need JavaScript enabled to view it.