ECC (Error correction code/Error correcting code) is a method of detecting and correcting errors in digital data. It is widely used in detecting and correcting errors in data in memories and also for ensuring transmission integrity.
In this article, we have answered some specific questions that are frequently asked about ECC by Safety beginners:
- From a functional safety perspective, what is the purpose of ECC?
- Is ECC a mandatory safety mechanism?
- Is ECC a mechanism to achieve freedom from interference or independence?
- What aspects must be considered in a System that uses ECC-capable hardware (memories and communication buses)
From a functional safety perspective, what is the purpose of ECC?
The broad purpose of ECC is to make Memories and transmission much more stable and reliable. That’s why ECC is not only used in Safety critical systems such as Automotive, aviation, defense etc but also in safety-irrelevant ‘maximum-availability’ systems such as file servers and critical databases.
ECC uses hamming codes to detect hard and soft bit errors. What are Hard and Soft errors? Hard errors are permanent errors and are caused by physical factors like temperature or power variation, and stress on the hardware. They lead to a permanent failure of the hardware circuit. Soft errors are transient and are caused by electrical and magnetic interference and even cosmic rays. Since hard errors are permanent, they cannot be corrected by ECC, but only detected. However, Soft errors can be detected and can also be corrected by the HW even without the knowledge of the Application. The most common ECC implementation is based on the hamming code ‘SECDED’ and can detect 1-bit and 2-bit errors and correct 1-bit errors. Some Microcontrollers also offer DECTED Double (-bit) error correction, Triple (-bit) error detection). A Very good explanation of how SECDEC works is available here.
From a functional safety perspective, ECC supports the System to meet the quantitative Single point fault metrics required for the ASIL level of that System. ECC offers upto 90% diagnostic coverage for Memories for SECDEC schemes and upto 99% for DECTED schemes.
Is ECC mandatory for Safety critical systems?
The ISO26262 never mandates any specific safety mechanism. A system without hardware ECC may still meet hardware targets required for that specific ASIL level. That being said, our personal opinion is that ECC must be enabled for all the Memories that it is available for, whether or not it is needed for quantitative metrics, because it is a fundamental way of ensuring reliability. Though it is true that enabling ECC might be a much simpler way to handle single and double bit failures as compared to deploying complex software safety mechanisms, the ease of use should not be the main motivation to decide on any safety mechanism.
Is ECC a mechanism to achieve freedom from interference or independence?
Yes, ECC can serve as a Safety mechanism to achieve Freedom from Interference (FFI) or Independence.
Let’s take 2 scenarios:
1. A memory region goes rogue (due to a bit flip), and that failure cascades to a safety component, and causes a safety goal violation. This is a case of Interference.
2. This 'bad' memory region is used by multiple SW components at the same time, and hence causes multiple failures at the same time. If these failures are in decomposed paths, then it leads to not achieving Independence.
In both these cases, ECC would help to correct the bit flip and restore the right value in the memory region.
What aspects must be considered in a Safety-critical System that uses ECC-capable hardware (memories and communication buses)?
- The most obvious first step would be to completely utilize the ECC capability offered by the used hardware, by enabling ECC correction and detection to whichever degree the HW supports. This means enabling ECC in all the Memories (flash, SRAM, DRAM, Caches, and other peripheral memory buffers and registers) and Interfaces supported by the Microcontroller. Of course if any of these Memories and Interfaces are not relevant to the Safety goals, it is not mandatory (from a functional safety perspective) to enable ECC for them.
- In some cases it can happen that enabling ECC could lead to reducing the available memory capacity or reducing the memory performance. In such cases, the trade-off between Safety/reliability vs Performance/Resources must be analyzed to make a decision.
- The System should employ appropriate preventive measures to avoid ECC errors. Some of the commonly used preventive measures include:
- Always initializing memory before use – Typically the memory content at Power-on-reset is random for RAM, and this includes the ECC bits as well. If a read access or unaligned write is made to any of the uninitialized memory, it will result in an ECC error.
- Flushing the cache periodically – Flushing the cache triggers a reload of the instructions or data from the respective memories and thereby clears any accumulation of failing bits.
- Performing a reset of the system in case of a 2-bit ECC error – Resetting the system triggers a re-initializing of the memory content and thus clears the bit errors. Though this seems more like a response measure, it is preventive in the sense that it prevents the accumulation of failing bits to an extent that it might be undetectable (for e.g., 3 bit errors). Thereby, it also reduces the occurrence of more such ECC errors.
- It is strongly recommended to discuss with the Microcontroller/processor/Memory supplier to understand what kind of preventive measures must be implemented.