ECC (Error Correction Codes)

ECC (Error correction code/Error correcting code) is a method of detecting and correcting errors in digital data. It is widely used in detecting and correcting errors in data in memories and also for ensuring transmission integrity.

In this article, we have answered some specific questions that are frequently asked about ECC by Safety beginners:

From a functional safety perspective, what is the purpose of ECC?
Is ECC a mandatory safety mechanism?
Is ECC a mechanism to achieve freedom from interference or independence?
What aspects must be considered in a System that uses ECC-capable hardware (memories and communication buses)

From a functional safety perspective, what is the purpose of ECC?

The broad purpose of ECC is to make Memories and transmission much more stable and reliable. That’s why ECC is not only used in Safety critical systems such as Automotive, aviation, defense etc but also in safety-irrelevant ‘maximum-availability’ systems such as file servers and critical databases.

ECC uses hamming codes to detect hard and soft bit errors. What are Hard and Soft errors? Hard errors are permanent errors and are caused by physical factors like temperature or power variation, and stress on the hardware. They lead to a permanent failure of the hardware circuit. Soft errors are transient and are caused by electrical and magnetic interference and even cosmic rays. Since hard errors are permanent, they cannot be corrected by ECC, but only detected. However, Soft errors can be detected and can also be corrected by the HW even without the knowledge of the Application. The most common ECC implementation is based on the hamming code ‘SECDED’ and can detect 1-bit and 2-bit errors and correct 1-bit errors. Some Microcontrollers also offer DECTED Double (-bit) error correction, Triple (-bit) error detection). A Very good explanation of how SECDEC works is available here.

From a functional safety perspective, ECC supports the System to meet the quantitative Single point fault metrics required for the ASIL level of that System. ECC offers upto 90% diagnostic coverage for Memories for SECDEC schemes and upto 99% for DECTED schemes.

Is ECC mandatory for Safety critical systems?

The ISO26262 never mandates any specific safety mechanism. A system without hardware ECC may still meet hardware targets required for that specific ASIL level. That being said, our personal opinion is that ECC must be enabled for all the Memories that it is available for, whether or not it is needed for quantitative metrics, because it is a fundamental way of ensuring reliability. Though it is true that enabling ECC might be a much simpler way to handle single and double bit failures as compared to deploying complex software safety mechanisms, the ease of use should not be the main motivation to decide on any safety mechanism.

Is ECC a mechanism to achieve freedom from interference or independence?

Yes, ECC can serve as a Safety mechanism to achieve Freedom from Interference (FFI) or Independence.

Let’s take 2 scenarios:

1. A memory region goes rogue (due to a bit flip), and that failure cascades to a safety component, and causes a safety goal violation. This is a case of Interference.

2. This 'bad' memory region is used by multiple SW components at the same time, and hence causes multiple failures at the same time. If these failures are in decomposed paths, then it leads to not achieving Independence.

In both these cases, ECC would help to correct the bit flip and restore the right value in the memory region.

What aspects must be considered in a Safety-critical System that uses ECC-capable hardware (memories and communication buses)?

The most obvious first step would be to completely utilize the ECC capability offered by the used hardware, by enabling ECC correction and detection to whichever degree the HW supports. This means enabling ECC in all the Memories (flash, SRAM, DRAM, Caches, and other peripheral memory buffers and registers) and Interfaces supported by the Microcontroller. Of course if any of these Memories and Interfaces are not relevant to the Safety goals, it is not mandatory (from a functional safety perspective) to enable ECC for them.
In some cases it can happen that enabling ECC could lead to reducing the available memory capacity or reducing the memory performance. In such cases, the trade-off between Safety/reliability vs Performance/Resources must be analyzed to make a decision.
The System should employ appropriate preventive measures to avoid ECC errors. Some of the commonly used preventive measures include:

Always initializing memory before use – Typically the memory content at Power-on-reset is random for RAM, and this includes the ECC bits as well. If a read access or unaligned write is made to any of the uninitialized memory, it will result in an ECC error.
Flushing the cache periodically – Flushing the cache triggers a reload of the instructions or data from the respective memories and thereby clears any accumulation of failing bits.
Performing a reset of the system in case of a 2-bit ECC error – Resetting the system triggers a re-initializing of the memory content and thus clears the bit errors. Though this seems more like a response measure, it is preventive in the sense that it prevents the accumulation of failing bits to an extent that it might be undetectable (for e.g., 3 bit errors). Thereby, it also reduces the occurrence of more such ECC errors.
It is strongly recommended to discuss with the Microcontroller/processor/Memory supplier to understand what kind of preventive measures must be implemented.

The SW component that enables ECC must be developed according to ASIL standards. Also, it is a good idea to maintain a count of how many ECC corrections and detections were performed by the Micro, if the Micro provides a feature to notify its ECC actions. This would be a useful diagnostic to assess the quality of the Memory and to take required actions in Software such as throwing a diagnostic trouble code if there are too many error corrections or detections. Such a DTC can be used to quickly detect faulty HW during operation.
Typically, Microcontrollers provide registers to test the ECC feature, to detect latent faults in the mechanism. One could look at it as a “fake” fault injection test. A register could be set to introduce a bit-flip and it can be verified by reading the ECC status registers if the “bit-flip” was detected and corrected. This feature should be leveraged if provided by the supplier.
If the System ran without ECC, and mid-way through the program if ECC is enabled, then it is recommended to repeat EMI/EMC testing. Why is this needed? During EMI/EMC testing, electromagnetic phenomena such as magnetic fields, electromagnetic surges, ESD etc are simulated. These conditions might lead to bit-flips in the memories. One could look at it as the real “fault injection test” for the ECC mechanism. Ideally, if we have a piece of SW that counts the no. of ECC corrections and detections, that could give us a hint whether bit-flips were corrected or detected during testing. The flipside though is that this is a not a test with a certainty, in the sense that we do not know how many bit-flips were introduced during the testing (or even if no bit-flips were caused), so we don’t go anywhere with the count of the no. of ECC corrections and detections. However, one could look at it as a measure that is over and on top of reviewing the ECC configurations/code and doing the latent fault test described in bullet 5.

Conclusion

When we thought about what is the future of ECC, will it stay or will it be replaced by other technologies, we started to think about where the Auto industry is headed towards, and what type of memories will be used in our relatively newer ECUs. The trend is clearly towards bandwidth and computation intensive systems. On one hand is ADAS, DMS and self-driving vehicles and on the other hand are sophisticated displays for HUDs and Domain Controllers. Such systems use a lot of DRAM as an enabling technology to provide the required bandwidth and capacity. As we progress generations in DDR memory (DDR4, LPDDR4, DDR5, LPDDR5) ECC will still continue to remain as the key memory RAS (Reliability, availability and serviceability) feature. So it is almost certain; ECC is here to stay.

If you would like to go deeper into whatever we have written in this blog, please check the following articles:

TI Article

ST White paper

Synopsys Article

Search This Blog

Automotive functional safety ISO26262