In the introduction to the September 2015 issue of Computing Now, I discussed how as microelectronics shrink, integrated circuit (IC) performance and complexity increase — leading to vulnerability to several phenomena and thus threatening reliability in the field. That issue explored one phenomenon, in particular: bias temperature instability. Yet, several other challenges are just as relevant, including vulnerability to radiation-induced faults.
The July 2016 issue of Computing Now examines radiation’s effects on circuit reliability, as well as proposes approaches for ameliorating them.
Radiation-Induced Faults
When energetic particles generated from solar activity (such as protons, heavy ions, alpha particles, and neutrons) hit an IC’s silicon substrate, electron-hole pairs generate and move in the substrate. If collected by active source and drain diffusions in the transistor, such electron-hole pairs alter the IC internal nodes’ charge state, causing a voltage glitch usually referred to as a radiation-induced fault, a transient fault, or a single-event transient.
If the affected nodes belong to a memory element, radiation-induced faults could cause an output logic error, usually called a soft error (SE), or single-event upset. If the affected nodes belong to a logic block, they might propagate until the input of a downstream sampling element, and then cause an SE. This occurs when the electrical filtering of the circuits between the affected node and the latching element input, as well as the logical filtering of the interposed logic gates, are such that the radiation-induced fault propagates until the latching element input with an appropriate amplitude and duration. Moreover, the radiation-induced fault’s arrival time at the latching element input should satisfy its setup and hold-time constraints with respect to the sampling instant. The SE at the sampling element’s output could compromise the whole system’s correct operation in the field, with consequent harm to reliability.
Although these errors have always been a problem for the electronics used in aviation and space exploration, they now also present major concerns for general-purpose electronics. Smaller transistor sizes mean scaled-down voltage supply and reduced capacitance associated with IC nodes, both of which increase susceptibility to radiation-induced faults. Researchers have devoted intensive efforts to analyzing and modeling radiation-induced faults in ICs, both to estimate IC sensitivity and to develop techniques aimed at improving their tolerance.
In this Issue
The four articles in this month’s theme discuss innovative approaches to designing robust, radiation-tolerant memory elements, embedded processors, and graphic units — in theory and in practice.
“High-Performance Robust Latches,” an article that I coauthored with Martin Omaña and Daniele Rossi, introduces two latches that are insensitive to radiation-induced faults. The first latch is suitable for architectures that don’t employ clock gating (CG) to reduce power consumption. In fact, radiation-induced faults that affect some internal nodes can leave output nodes in a high-impedance state. If CG were activated, the high-impedance node could be improperly charged or discharged to an incorrect logic value due to leakage currents, potentially generating an output SE. Without CG, the output doesn’t remain in a high-impedance state long enough to enable leakage currents to charge or discharge the associated capacitance. The second latch is suitable for architectures that employ CG because its output can’t remain in a high-impedance state when radiation-induced faults affect any of its internal nodes.
In “Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core,” Lukasz G. Szafaryn, Brett H. Meyer, and Kevin Skadron analyze radiation-induced faults and related SEs in a processor core. They propose a tool to evaluate the area, delay, and energy overheads of various traditional SE-protection techniques, including some robust design approaches for latches, error-correcting codes, spatial and temporal redundancy, and error-detecting codes. Particularly, the tool uses synthesized implementations of these techniques to evaluate design trade-offs. The authors consider a simple OpenRISC core as a case study, examining SEs that affect a single bit (also referred to as single-bit upsets), as well as those that affect multiple bits of neighboring cells or logic components simultaneously (multiple bit upsets, or MBUs).
Lawrence T. Clark and his colleagues’ “An Embedded Microprocessor Radiation Hardened by Microarchitecture and Circuits” proposes the design of a radiation-hardened microprocessor core, combining traditional fault-tolerance approaches at the micro-architectural level (such as upset detection while instructions are in the speculative state, as well as instruction restart) with hardened circuits and proper physical design (like enforcing critical nodes’ separation to avoid MBUs). The authors verify the hardened circuits through proton and heavy-ion broad beam testing.
In the final article, “Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units,” Daniel Alfonso Gonçalves de Oliveira and his colleagues analyze radiation-induced faults and GPU reliability through analytical case studies of two commercially available GPUs and a series of extensive accelerated beam tests. The data provide a realistic estimate of the error rate of modern GPUs that are exposed to natural neutron beam, as well as an evaluation of the adopted architectural and software hardening solutions’ effectiveness. The analysis results are highly relevant to high-performance computing and safety-critical applications, for which developers are increasingly adopting GPUs because of their computation speed, low cost, energy efficiency, and flexible development platform.
Video Perspectives
Transcript: Spanish (pdf) | Chinese (pdf)
Monica Alderighi offers insights into radiation-induced faults in terrestrial applications.
Yervant Zorian discusses radiation challenges in safety-critical electronic systems.
Video Perspectives
This month’s theme issue includes two relevant videos by experts in the field.
In the first video, Monica Alderighi of the Italian National Institute for Astrophysics offers insights into radiation-induced faults in terrestrial applications.
In the second video, Yervant Zorian of Synopsys discusses radiation challenges in safety-critical electronic systems and addresses some of the technical solutions the industry is adopting.
We hope that this issue of Computing Now serves as a resource to highlight the major challenges related to radiation-induced faults and techniques to tolerate them, and stimulates further research in the field.
Guest Editor
Cecilia Metra is Computing Now’s editor in chief and a full professor of electronics at the University of Bologna, Italy. Her research interests are the design and test of digital systems, reliable and error-resilient system design, fault tolerance, emergent technologies and nanocomputing, secure systems, energy-harvesting systems, and photovoltaic systems. Metra has a PhD in electronic engineering and computer science from the University of Bologna. She is a member of the IEEE Computer Society’s Board of Governors and Executive Committee. She’s on the editorial boards of several professional journals and has served as a general chair, program chair, program co-chair, or technical-program committee member for numerous IEEE-sponsored conferences, symposia, and workshops. Metra is an IEEE Fellow and a Golden Core member of the IEEE Computer Society, from which she has received three Meritorious Service Awards and two Certificates of Appreciation. Contact her at cecilia.metra@unibo.it.