Making soft errors a hard life. Part 2

In my second part of the blog series, I want to point out the importance of soft error rates and their consideration in applications in different industries. Furthermore, I will touch upon the evaluation methods of soft error rates, and the possible mitigation and solutions to reduce the risk of soft errors and resulting failures.

ISO 26262 part 5 defines hardware safety requirements shall include internal safety mechanisms to cover transient faults when shown to be relevant due, for instance, to the technology used.

IEC 60601-1 defines: “When applicable, the manufacturer shall address in the risk management process the risks associated with alpha, beta, gamma, neutron and other particle radiation”, with no further detailed explanation. This statement is very unsettling if we think that some scenarios could have serious consequences for patients.

Increased implementation of x-by-wire systems in automotive industry includes higher amount of bus connections and modern digital components. Failure of such systems can lead to fatal accidents, so the reliability and safety of them must be on high level, and this includes the considerations of hardware and soft errors. ISO 26262 “Road vehicles – Functional safety” defines in detail the requirements for handling of random hardware failures, but it covers just partially the soft error topics. In part 5 (Hardware) e.g. it’s more about which safety mechanism technologies are “not” effective against soft errors, just few effective mechanisms like HW/SW redundancy (e.g. Dual Core Lockstep) are listed. Depending on the impact of the faults they can be considered in the safety analysis. Part 11 of ISO 26262 standard gives some examples about classification of transient faults and refers to JEDEC 89A, but in general the handling of soft error issues in ISO 26262 could be better. A clear and more detailed approach would help to improve the risk evaluation of soft errors.

Soft Errors are also one of major design challenges in medical device industry. Devices such as dialysis machines use IC technology and are therefore sensitive to soft errors.

However, technologies and technical mitigations against soft errors are the same for every branch of industry. Nevertheless, the failure effect and consequence are not necessarily the same at the telecommunication applications or at medical or automotive applications with safety aspect.

Soft error rates can be significantly higher than hard failure rates. We are talking about 1k – 100k FIT/Mbit at technologies below 100nm. As chip technology scales very fast, the soft error rates are increasing at the chip level and proper steps must be taken to reduce the risk of device failing.

Test methods acc. JEDEC 89A:

As mentioned in the first part of the blog, the JEDEC Standard JESD89A defines the requirements and procedures for terrestrial soft error rate (SER) testing of integrated circuits and the evaluation of the results. Main sources of radiation, cosmic-ray and alpha-particle radiation are considered.

Following methods are used for testing:

Real-time (unaccelerated) SER test procedures
Accelerated SER test procedures

The automatic test equipment (ATE) hardware used for testing must be capable over the range of testing conditions. The ATE software provides needed test conditions and detects, records and corrects errors.

In addition to memory components (SRAM, DRAM etc.), JESD89A describes also testing issues to be considered for non-memory components such as random logic circuits (sequential logic, register files etc.), microprocessors and FPGAs.

Real-Time test method:

The simplest way to measure SER in a device, is to test it under standard operating conditions and normal ambient background radiation. The advantage is that the results of this kind of test method can be evaluated without additional extrapolation of this and taking into account factors with a large influence in the calculation of SER. The disadvantage is the long testing duration which can take months to get reliable data. To reduce the required testing time, a large number of testing parts can be used to increase overall effective failure rate. Also tests on higher altitudes, where the higher neutron flux will increase effective failure rate, can be done to reduce the time. Existing accelerated testing results of similar components could be helpful to estimate the average failure rate and testing time.

Due to the fact that a real-time test method is implemented without ionizing source, this kind of method is only applicable for testing on alpha particles from packaging & natural ambient radiation due to terrestrial cosmic rays. Even if the real-time SER testing is representative for device under normal radiation conditions, the results are not necessarily directly applicable to end user system application, due to possible differences at altitude, geomagnetic location, application assembly etc.

Accelerated test distinguishes between 3 methods:

Alpha-particle test procedure which is independent of altitude and depends only on type, location and amount of radioactive impurities present in the packaging or in the component itself.
Terrestrial cosmic ray test procedure induced by high-energy neutron flux, which is dependent on altitude, geomagnetic location and solar activity.
Thermal neutron test procedures induced by reaction between neutron and boron which is used as silicon doping, and results in charged particles which cause soft errors.

Alpha particle tests use e.g. radioactive isotopes foils as test source, whereby the test source is placed close to the open die which is mounted on a special package (see figure below). The material should contain a high amount of charged particle to induce soft errors in the component. The neutron test procedures need both accelerated beam flux sources to expose the testing device to an irradiation.

Graphic 1: Alpha particle foil as active source for accelerated test for SER

The advantage of accelerated testing methods is the shorter testing period and the smaller sample size, compared to the real-time test method. The disadvantage is that the accelerated testing requires extrapolating the accelerated test results to use conditions and computation of failure rates using equations. In semiconductor industry, test chips containing application blocks like SRAMs or sequential logic arrays which represents technology groups, are usually used as device under test to provide accurate modelling of SER.

For both methods, real-time & accelerated, calculation examples are documented in Annexes of JESD89A. The SER are reported in FIT or FIT/Mbit.

Influencing factors which affect SER:

Packaging material, solder material etc.
Product technology
Amount of SRAM and sequential logic
Low-power technologies with low signal charge (critical charge)
Location of application etc.

SER Level/Criteria:

There is no standard or general specification of acceptable soft error rates. This is because each SER depends on different applications which have different influencing factors. No single metric can be used for SER on general components. The criteria for level of acceptable failure should be determined based on application, and even more importantly, to be focused on the development of a robust design and mitigations against soft errors.

Possible mitigations:

Soft errors cannot be completely avoided. Nevertheless, there are different approaches to reduce the SER itself, and to increase the detection of failures caused by soft errors. This includes on the one hand reducing of raw soft error rate through silicon processes, material and design techniques, and on the other hand mitigation of failures caused by soft errors through architectural measures (safety mechanisms).

Reducing the SER:

Use of materials with low alpha particle proportion (add material cost; process contamination nonetheless possible)
Die underfill with ultra-low alpha characteristics (shields transistors from alpha particles to a certain degree)
Silicon on Insulator technology (eliminates charge collection in substrate of transistor; halving of SER possible)
Increasing of critical charge of circuit (design impact)
Robust cell designs (layout considerations, extra cell capacitance at SRAM etc.)

Architectural mitigation:

Parity codes (detection of single bit errors)
Error Correction Codes (detection of multi-bit and correction of single-bit errors, impact on performance of chip)
Memory Mirroring (simple ECC code still needed)

Conclusion:

Soft errors are becoming more significant for reliability & safety topics. The density of chips is increasing, and the soft error rate is correspondingly increasing. It can be reduced by implementation of special materials, but cost impact can be significant. Architectural mitigations like implementation of ECCs to protect critical data are very effective but also have an impact on device architecture and development process. However, the right approach must be found for each project and item developed, following risk evaluation & management processes in automotive, medical and other safety related industries.

Want to read the first Part of the blog series about the definition of soft errors and how they are caused? Click here.

By Dijaz Maric, Quality Management & Reliability Engineering Consultant

Do you want to learn more about the implementation of ISO 26262, IEC 60601 or any other standard in the Automotive or Medical Device sector? We work remotely with you. Please contact us at info@lorit-consultancy.com for bespoke consultancy or join one of our upcoming online courses.

CONTACT

Form

We look forward to hearing from you.