Skip to main content

Fault Injection Testing de-mystified



Fault Injection Testing is an age-old testing technique to understand how a system behaves when it is stressed in unusual ways. Nearly half a century back, it all started with simulating failures in hardware. However later, Industries also started thinking about fault injection testing in software, though not exactly called that way. It was performed as part of "Robustness testing" and "stress testing". Even the DO-178B SW Safety Standard for Aviation does not have the term "fault injection testing". ISO26262 has made an attempt to explicitly define what this type of testing means and has provided guidelines on when and how to perform it.

In this article, we have explained what the goal of fault injection testing is and what is the expectation of the ISO26262 Standard on this topic. Further, we have defined  (in our own words) a "Systematic approach" to do Fault Injection tests in Software, so as to maximize its effectiveness.

Before we go into fault injection tests, let's clearly understand what the word "fault" and "failure" means. 
 
Fault is any abnormal condition that can cause a system to fail. A bug in the code, a  "missing" requirement, corruption of a variable during operation, a  clock signal not getting generated, a CAN message not transmitted, a bit flip, each of these is a fault. Faults cover a very broad spectrum of abnormalities  - such as Systematic faults, random HW faults, transient faults, etc. 
 
Failure is the final consequence of the fault wherein a functionality does not work as required. An image not shown correctly in the display, a sound not played, the brakes not applied, the vehicle accelerated at the wrong speed are all examples of failures.
 
Even though a System is subjected to countless faults, there are multiple "layers" of defense stacked to prevent them from becoming failures. Following the  ISO26262 defined processes and methodologies, defining best practices within the organization to ensure high-quality work products, learning lessons from the experiences gained in the previous projects, all of these act as layers of defense to detect as many Systematic faults as possible. On top of it, we define and implement safety mechanisms that can detect the faults that occur during operation and prevent the failure from manifesting. If a failure occurs during operation over and top of all of these, it is a product of a series of conditions, decisions, and actions.
 
The Goals of Fault Injection Testing are the following:
  1. To verify if the System can detect as many of the faults injected, and to further identify which faults still remain undetected so as to make a conscious decision to either improve the system, or accept the system as-it-is if the undetected fault does not lead to failure. 
  2. To increase the test coverage of safety requirements by testing those parts of code that are not executed during normal operation
  3. To verify the effectiveness of Safety mechanisms
The Standard talks about performing Fault injection testing in different levels of abstraction and at various stages of the Safety life cycle, namely System, Hardware, and Software. The diagram below highlights this aspect. 

 

A pre-requisite for an effective fault injection testing is to look at the System "with an open mind" and think "what are all the various faults I can inject in this System?" with the curiosity of simply wanting to observe the behavior of the Safety System to that fault. One of the biggest roadblocks towards not developing effective fault injection tests is the tendency to conclude that some faults cannot happen/or is highly unlikely/or "simply believing" that some kind of faults will never occur (or have never occurred before) and they need not be tested. 
 
Systematic faults are detected by processes, and in an ideal world, there should be no systematic faults that remain in the System. Nothing can be further from the truth. Our compliance to processes is not perfect. Hence there are residual Systematic faults always present in our systems. During fault injection testing, both Systematic and random faults should be injected.
 
Now, let us go deeper into fault injection testing in Software Unit level, Integration level and functional level

Fault Injection Test in Unit level (for every ASIL component)
 
The goal of Fault Injection test at this phase is to check how capable is the unit towards handling faults, and to improve (if possible) the fault handling implementation of the unit. 
 
By performing fault injection tests at the unit level, we can identify any gaps in the fault handling at an early stage, even before the unit SW is delivered for Integration. It provides the opportunity to implement any missed mechanisms at the unit level,  such as a plausibility check on an incoming data or a timeout handling for an event-triggered functionality.
 
The best way to start will be to perform unit testing against requirements just like with any standard QM processes. Develop test cases against requirements. If done well, this should help achieve sufficient code coverage at statement, branch, and  MC-DC level. Once that is done, review the systematic approach described below to identify if there are any further fault injection tests that can be added at the unit level.
 
A Systematic approach towards doing fault injection testing in the Unit level is to inject faults in each of its elements. 
 
A unit has various elements - it typically has 
  1. Variables
  2. Provided interfaces (the interfaces the unit provides)
  3. Required Interfaces and Data received via these Interfaces
  4. Executable functions that are called cyclically, event-triggered, or once to initialize or de-initialize the unit. 
Unit Testing may or may not be performed in the target hardware environment, and may or may not be performed in the Integrated SW environment. Hence, depending on the Unit test environment, some of the fault injection tests of the unit cannot be completely verified in the "independent" unit level. 

Corruption of Variables

Corruption of variables must be done at the unit level if the Unit is responsible for detecting, preventing, or correcting a fault of its variables or if the Unit itself implements a 'centralized' Safety mechanism to detect/prevent/correct faults of all Safety related variables.

However, if this is not the case, and there is a centralized mechanism to detect variable corruption outside the scope of the unit, this can be tested only in the Integrated SW. For e.g., for a mechanism like Memory protection unit. Depending upon whether the mechanism handles every variable separately (e.g., redundant storage of every safety variable), or it implements a memory region based mechanism wherein the addition or removing of unit variables does not lead to modifying the mechanism, this may or may not need to be tested for every variable. 
 
Injecting faults in provided Interfaces
 
Think about what are the possible faults that can occur in the interface provided by the unit. For e.g., if the Interface is similar to a notification or callback interface, chances are that it may be called too early or too late or never called at all. Such faults can be injected and the behavior verified. However, such tests can be performed only in an Integrated SW environment. 
 
Corruption of data received via the Interfaces (provided by other components)
 
A Unit cannot really identify whether the data that is received is corrupted or not. However, by designing against requirements and following good coding guidelines, it should be made capable to handle the entire range of received data in the correct way. Fault Injection tests, in this context, means to provide invalid values of data and observing the behavior of the unit. 
  • E.g., if the data has a range, testing can be performed by providing at least 5 different values provided to the data - 1) Lesser than the lower limit of the range 2) the lower limit value 3) within the range 4) the upper limit value and 5) greater than the upper limit of the range. The 1) and 5) in this case are the faulty values
  • If the data has a set of discrete valid values, say ON and OFF, the data may either have a valid value within the set of discrete values (i.e., either ON or OFF) or it may have an Invalid (faulty) value (!= ON and != OFF) 
It is to be noted that this method of testing overlaps with boundary values testing and equivalence classes testing methods. This is a good article we found on these testing methods.
 
Executables (Event-driven, Cyclic and Init/Deinit functions)
 
Possible faults in Cyclic functions can be that they are not called in the expected frequency (so either they are called faster or slower), or not called at all. Init/Deinit functions may be called too early or too late or not called at all, or if there is a certain sequence/order of initialization, the order may not have been respected. Such faults can be injected in the unit and the resulting behavior can be verified. However, such tests can be performed only in an Integrated SW environment. 

Fault Injection Test in Integration level 
 
The goal of Fault Injection test at this phase is to check how effectively the Software architecture handles faults and to improve them if possible.
 
The key thing to remember in general is that during Integration test, we verify the SW Architecture, i.e., how a SW requirement is implemented. We verify the correct behavior of the Functional and Data Interfaces that were decided in the Architecture level, that is not necessarily known at a Software Safety requirement (SSR) level, such as:
  • The Architectural design of the Safety mechanisms that are implemented in SW to detect violation of the Safety goals.
  • The Architectural design of the Safety mechanisms that are implemented to detect common cause faults (such as clock, voltage, memory, CPU etc) and to achieve Freedom from Interference
  • The Architectural design of Internal Safe states and External Safe states
  • The Architectural design of how FTTI is achieved
  • HW-SW Interfaces
A high-quality SW Integration test should verify that
  • The right interfaces were called. For e.g., Unit A calls the Interface of Unit B as expected, when expected
  • The data sent from one Unit to another had the right values
  • The registers had the right values
  • The timing requirements were achieved as expected
What should we verify in the context of a Safety mechanism or Safe state?
 
This could be any one or more of these, but not necessarily limited to:
  • Interface(s) that provide the status of the Safety goal
  • Interface(s) that implement the Safe state
  • Register(s) that are read to know the status, or written to trigger a Safe state
  • The associated timings when the Safety mechanisms detect the fault
  • Interface(s) that provide the information/details of Safe state or Safety mechanisms (for diagnostic purposes)
For e.g., if the System under test is expected to trigger a reset after checking n continuous samples of a faulty signal, verify the following:
  • Is the reset triggered only as a consequence of the n continuous faults and not due to any other reason (i.e., Confirm that it is this Safety mechanism that triggered the reset). Think about how you could verify this. For e.g., by reading out a reset reason, by logging some data about the line of code that 'triggered' the reset
  • Did the Interface that carried the fault information have the 'expected' faulty data?
  • Was the right interface called with the right parameters while triggering the Safe state?
Here are some examples of fault injection tests at SW Integration level:
  • Take the end-to-end SW function chain for a Safety goal. Inject faults in every QM component of the function chain. i.e., whenever QM can interfere in the Safety part, for e.g., by calling an ASIL Interface or passing data to an ASIL Interface, inject the various possible faults such as passing a corrupted data or calling the ASIL Interface without respecting the pre-conditions, and verify if the Safety mechanism triggers the expected reaction and prevents the Safety goal violation.
  • Inject Starving conditions for Safety by disturbing the normal process flow and introducing QM overload (something that will naturally occur in the system at certain conditions, or superficially triggered by setting high load conditions). At the Architecture stage, we have the benefit of knowing which processes consume a very high load, and that gives an opportunity to create the best conditions for starving. Verify that the Safety goal violation is prevented and the Safe state is triggered.
  • Inject faults in Executable functions and verify if the Safety mechanisms can detect it.
  • Verify the design of how the FTT is achieved. For e.g., if the FTT time is achieved considering process latencies, periodic task rates and function run times, inject faults under the worst-case conditions that contribute to the highest latencies and run times, and verify if the FTT is achieved.
Some of the examples stated above are similar to what is done as "stress" testing or resource usage testing. There is no real boundary between these different test methods. They are all interconnected. For e.g. Some type of faults are introduced by stressing the system. To achieve timing requirements like FTT, it is required to know the resource usage of the tasks and the  functions related to the Safety functionality. 

Fault Injection Test in Functional level (Testing of the Embedded SW against Safety requirements)
 
The ISO26262 Standard expects that the Integrated SW is tested against the Software Safety  requirements. At this stage, this must be done in the target HW environment  (even if the Unit and Integration test are performed in the target environment). Testing at this level can be a black-box or white-box. 
 
At this level, fault injection testing can be done by introducing faults at  "function" level. This includes or but not limited to:
  1. Injecting various faults in the Vehicle bus (e.g., CAN or Ethernet).
  2. Injecting faults in calibration parameters.
  3. Injecting faults in Safety related outputs (outputs relevant to the Safety goals). To do effective fault injection tests for Safety goals requires a really good understanding of not only the Safety requirements but overall System requirements that are related to the Outputs.
  4. Injecting faults in HW level to verify Safety mechanisms defined for FFI or Independence or detection of latent faults.
  5. Injecting faults in Microcontroller by writing fault-injecting-code to verify the Safety mechanisms provided by the MCU.

Summary
 
Our goal of defining such systematic processes for fault injection test is to ensure we can maximize the effectiveness by increasing the rigor of the activity. But does that mean, we have a perfect fault injection test procedure in place? No!  
 
Our System is only tested sufficient enough to our knowledge and ability to think of artificial ways to simulate faults. We may use well-defined fault models as a base to simulate faults, but still, we can never rule out those 'unknown' faults. If we cannot think of a certain fault in the first place, then how can we inject it and verify the behavior?
 
If there are 2 things that you want to take from this blog, it is this:
  1. Aim to Identify faults in the earliest stage possible
  2. Take the fault injection test activity very seriously because you do not want the defects to escape and get caught by the Customer! So brainstorm as a team to define high-quality tests, take lessons from previous programs, involve experts, and do exhaustive reviews!
We will end this article with a thought-starter for you: "To what extent should you inject systematic faults in the System if you already have processes in place to control them?
 
We will cover this topic soon in one of our forthcoming blogs.