In this post, we have discussed:
- ISO26262’s proposal on controlling systematic failures
- Introducing a decision framework to apply ISO26262’s proposal in programs
- Examples of how to apply the framework
The Standard proposes two ways for controlling Systematic faults:
1. Development-time Safety measures to prevent these failures
This basically means preventing the failure by design or applying process measures such as reviews, testing, checklists etc.
2. Safety mechanisms must be implemented to avoid or control the failures during run time due to the residual systematic faults
These are checks that are implemented in HW or SW, and thus are operational in the field and trigger a safe state if they detect a fault.
Take note of the word “residual”. Though the ISO26262 Standard does not anywhere mention “residual systematic faults”, we have used it in this context because some residual faults will indeed be present in the system if the measures taken during development-time were not effective enough. So the only way to prevent them from causing a failure is by implementing Safety mechanisms.
What this also means is that, if the Safety measures implemented during development-time were effective enough to consistently eliminate the Systematic fault, and one can be confident that there is no possibility of a “residual” fault being present, then it is not required to implement Safety mechanisms, because there will be nothing to detect.
Introducing a decision framework to apply ISO26262’s proposal in programs
We have defined below a simple framework that translates the proposal from ISO26262 to practical use in programs. This framework has been intentionally developed for Software. It can be applied in the following cases:
- During Software Safety analyses, when an initial Safety SW concept and Architecture is already available and some initial decisions on the SW Safety flow taken. This framework assumes that there may be gaps and weaknesses in the SW Safety Architecture.
- At later stages of SW development, for e.g., during integration of SEooCs to determine if the measure taken during Integration is sufficient.
- If safety defects are identified during operation or testing, and are debugged and identified to be due to a systematic fault, the framework helps to determine a solution that effectively mitigates the fault.
With some adaptations, the application of this framework can also be extended to Systems.
Please zoom in if required to get a clear view of the framework or download the image.
Here is a summary of the steps involved in the framework:
- Identify if the Systematic failure leads to a violation of the Safety goal.
- If it does, identify all the causes that can lead to the underlying Systematic fault, and run through these cause questions. The key thing to note is that these questions are not either-or. All the 4 questions must be asked for the same cause. By this manner, the analysis of the cause will be holistic and help to finalize on safety measures that are effective and not over-done.
- If the failure does not lead to the violation of a Safety goal, nothing needs to be done from a Safety perspective. However, we still recommend analyzing the causes to identify weakness in design or process and to make improvements.
Examples of how to apply the framework
Let us go back to the example with which we started the blog. In this case, the Software vendor likely identified during Safety analysis, the Systematic fault of “Versions of Component X and Y being incompatible”.
The likely cause for this fault is the Software Integrator using incompatible versions of X and Y during Integration. (Well, some body can hack into the code and change one of the components, but let’s keep aside security issues for the time being). In this case, the cause is not related to design. Hence, questions 1-3 in the framework will be answered with a “No”. However, for question 4, the answer is “Yes”. The cause can be addressed by making an additional step in the SW Integration process to check if the Version numbers of X and Y are compatible. If an incompatibility was detected, the Integrator could check and integrate the right SW. Also, to be consistent, the check could be performed during every SW Integration. In summary, the Safety measure at Integration can eliminate this Systematic fault, and no additional Safety mechanisms are needed.
Systematic fault is “QM SW has incorrect pointer handling implementations that can corrupt the memory of ASIL SW”.
The cause of this fault originates from all the QM SW Components in the system. By following the decision framework given above, one can arrive at the following safety measures:
- Implement a Safety mechanism that can detect a corruption of ASIL SW by QM SW (based on cause question 2)
- Fix all the weaknesses in code related to pointer handling by following good coding principles (based on cause question 3)
- Improve the code verification processes through better code reviews and high quality static analysis tools that can detect and report all pointer write issues. And fix ALL the reported issues in the code (based on cause question 4)
Note: If process measures are used an argument for detecting or preventing some systematic faults of QM SW, it is crucial to ensure that these measures are defined very specifically. For example, stating something like “QM Component Z must be tested” is simply going to leave it to chance whether the fault is detected or not. Instead, something like “which specific function of Z must be tested, when must it be tested, how often and how exactly must it be tested” must be specified so that the test is carried out in the same procedure, environment, duration and frequency.
The framework only provides solutions from a technical perspective. However, the effectiveness of these technical measures depends upon the Safety culture of the Organization. For example, if a program wants to implement a design measure to control incorrect pointer usage, the measure will be effective only if all the SW Coders in the team have the knowledge of pointer related coding principles. If a program wants to rely on tools on detect or prevent faults, it will be effective only if the tool is qualified to do so, and the team actually uses the results from the tool to make improvements.
If you have a use case in which it was challenging to decide on effective measures, please share it with us! We will be pleased to help you.