The ISO26262 provides a bookish definition for Systematic faults and failures. In this post, we have explained our understanding of what these mean. To do so, we have described the following aspects:
- An easy way to understand systematic faults & systematic failures
- Possible scenarios in which systematic faults could occur
- Challenges with complete elimination of systematic failures
- Probability of systematic failures
An easy way to understand systematic faults & systematic failures
In simple terms we would like to call Systematic faults as "Method or Process faults”. It is any fault in the way of applying methods or processes whose consequent failure shows up in a deterministic way. This consequent failure is what is called a Systematic failure.
What do we mean by "deterministic"? It means that if the same fault is injected into the system 'n' no of times under specific conditions, the same failure will occur every time. The failure is not really tied to the context of Safety. There may or may not be an impact to Safety. In other words, this failure may eventually lead to 1) a Safety goal violation, 2) a false detection of Safety goal violation, 3) not detecting a Safety goal violation or 4) a failed behavior that is not at all related to Safety.
Let’s understand this better with a few simple examples.
Example 1: Systematic design fault leading to a Safety goal violation
A Systematic HW fault in the board where a connection between a Micro port output to an Airbag LED is missed. This fault leads to not being able to turn ON the LED, which in turn leads to a violation for the Safety goal “Airbag telltale shall be turned ON when requested”
Example 2: Systematic design fault leading to false detection of Safety goal violation
A Coding bug (Systematic SW fault) where an assignment operator (=) is used instead of an Equal-to operator (==). In this case, the Safety goal is “Turn Indicator Sound shall be turned ON when requested” and Safe state is to reset the System if it is not turned ON. Due to the coding error, the System falsely concludes that the Sound is off and resets the system when turn Indicator is requested.
Example 3: Systematic design fault leading to not detecting a Safety goal violation
Similar to Example 2, a Coding bug (Systematic SW fault) where an assignment operator (=) is used instead of an Equal-to operator (==). In this case, the Safety goal is “Gear Indicator shall be activated as requested” and the correctness of the Gear Indication bitmap pixels is checked via CRC of the pixels. Since the coding bug always returns a TRUE for the If condition, this leads to always concluding that the CRC check was okay, thus leading to never detecting a Safety goal violation.
Fundamentally, Systematic faults are all faults that occur in the method in which a Safety activity is performed. Faults in the way of conducting a review process, performing an analysis, a bug in the design, or an incorrect way of testing are all process faults. Process faults lead to a fault in the output generated by the process. For e.g., if the process of conducting a review is not done well and as a result, an incorrect requirement was not detected by the review, this fault in the process will show up as a fault in the Requirement Specification. This may manifest as a failure of not implementing a requirement related to the Safety goal. Similarly, if a test case was not designed correctly, it will show up as an incorrect result for that test case. This may lead to incorrectly concluding on the consequence to the Safety goal. If a method for SW Unit design is not applied correctly, it will show up as a weakness or bug in the SW Unit design or code.
Due to the deterministic nature of a Systematic failure, it is always possible to detect the underlying fault and its cause. Once this is detected, one could either eliminate the fault, or if not, prevent it from leading to a safety-related malfunction by implementing Safety mechanisms that are operational during run time.
Here are some more arbitrary examples of Systematic faults:
- A missed requirement in the Safety requirement Specification (FSR, TSR, SSR, HSR)
- An incorrectly defined or missed interface for an ASIL Component in the SW Architecture
- An incorrect analysis in the System-level FMEA
- A coding bug
- A missed or incorrect connection between two hardware components in the schematic
- An item in a design or code review checklist that was not verified
- A resistor not populated in the board during manufacturing.
- An incorrect test case or procedure for testing a Safety requirement
Possible scenarios in which systematic faults could occur
As it might be obvious from the above examples,
- Systematic faults can occur in any of the skill areas - System, HW, SW or Manufacturing.
- Systematic faults can occur in any of the work products defined by these skill areas.
Challenges with complete elimination of systematic failures
There are at least couple of reasons why Systematic failures cannot be completely eliminated.
1. Let’s take again the example of any coding bug. What could be the reason for a coding bug? Probably it was sheer negligence by the developer. Probably the developer was over-stressed with a personal life issue. Probably he/she was not aware of coding guidelines. Probably he/she made assumptions. How do we handle these causes? We could definitely devise ‘Organizational’ processes to ensure that the developer is motivated and has the required knowledge, environment and support to do high quality coding. We could ensure that a peer review process by a highly skilled personnel is put in place to catch these coding bugs, so that the random one-off scenario of a developer having a bad day does not hurt Safety at the end. But now, what happens if this highly skilled personnel was having a bad day too and missed to catch the bug? This thought stream gives us the perspective that there is a facet of randomness to Systematic faults. This facet may make it practically difficult to completely eliminate Systematic faults (Caution: We do not mean it is okay to introduce faults! So, please don’t use ‘randomness’ as an excuse and become complacent in your Safety activities!!)
2. Many Systematic SW failures manifest in a random fashion. If you are someone who has worked on SW, you would have most likely run into an issue which is extremely hard to reproduce or seem to alter its behavior when one attempts to study it. This is often also called a heisenbug. Such issues sometimes take several months of time for analysis. It is often not possible to track down the cause of a Heisenbug, and they are either left unresolved based on an argument of very low probability of occurrence or “fixed” based on the best possible guess on the cause of the failure.
Probability of Systematic failures
As per the Standard, there are “Systematic faults” and “Random HW faults”. Systematic faults are not considered for the probabilistic failure rates calculation; only random HW failures are. For e.g., SW faults are considered to be completely systematic and hence, we do not talk about probability of failure for SW. This is based on an underlying assumption that all Systematic faults can be associated to a specific cause, and once a fix has been made on the design or process, the fault is completely mitigated and will not occur again, or will not result in new faults.
However, we picked up an interesting point of view from exida which challenges this assumption. How often have you seen that a SW fix is done, but the failure reappears again probably in a different context or it breaks something else? And this happened because the design or process measure that was applied was not completely effective. i.e., the right root cause was not analyzed, the method of performing the cause analysis was not correct. Hence, there may be residual Systematic failures that have seeped into the System and either remain latent or reveals itself in a specific circumstance during operation.
Should ISO26262, in its 3rd edition, talk about the contribution of residual Systematic faults towards failure rates?
If you are curious to understand more on the Systematic aspect of random failures, and the randomness of Systematic failures, we strongly suggest you check this webinar out.