Research Wiki | Faults / Taxonomy

Overview

The types of faults that robots must be able to deal with is extensive: the faults one would expect with any software and hardware system, with the additional faults possible from the added requirement that a robot must interact with its environment.

One effective way to categorize these faults is show below, with the first distinction being between that of faults internal to the robot and those which are a result of external factors. This gives a good snapshot of what can go wrong.

src: Faults#Karapinar12Framework

These failures all exist at a fairly high level of abstraction. For example, a "General Failure" in software may be the result of any number of lower level faults, such as a race condition or an array out of bounds exception. See below for the types of low level faults I am currently investigating.

It is important to note that not only to robots have to deal with a novel set of faults, the impact of generic types of faults can have a much more devastating impact. Robots must be able to be resilient to faults, otherwise the number of applications for which they can be used safely will be greatly reduced. So it is with renewed vigor that we must investigate all possible classes of errors and their possible solutions.

Software faults

Deterministic software faults are well understood and there are strategies to prevent and mitigate them. For these faults, however, the techniques examined in Detection and Recovery often will not work. For example, if a software component fails on a specific input due to a programming error, having three copies of that component all failing at once is not particularly helpful. One expensive strategy for preventing these faults is the work with formal verification done in seL4.

Instead, the work focuses on hardware faults (described below), but may allow for some extension into intermittent software faults. Intermittent software faults are often a result of multithreaded environments in which there is some shared state. For example, a component with multiple threads may occasional freeze due to a rare race condition. There is no direct correlation between the inputs to the system and the fault, and the solution is to restart the component (or fix the bug).

Transient Hardware Faults

Three general categories for hardware faults:

Permanent - What it sounds like.
Intermittent - Occurs non-deterministically but in the same location. See this paper for more.
Transient - Occurs non-deterministically in a random location.

Generally, faults that are difficult to replicate, such as sporadic interference and cosmic rays. These are the main faults which I intend to investigate.

A Single Even Upset (SEU) is the term for when some form of radiation interacts with a hardware component and causes a change in state. This is manifested as a "flipped bit." Extremely rare, but less so as component sizes shrink (more susceptible) and increase in number (larger target area). Unless buried in a concrete bunker (in which case Neutrinos would still be a (very rare) risk), systems will need mechanisms to protect from SEUs.

There are existing strategies to deal with errors in computer memory (Error Correcting Codes) and in communication channels (Error Detection and Correction). These techniques were developed to handle a variety of problems, and fortuitously SEUs also happened to be largely protected from by them. This leaves the problem of redundancy in computations, which is still an active area of research. This is discussed in the fault detection section.

By Effects

Yet another way to classify faults in a system is by the observable effects on execution. This is the method used in the Shye work, which provides process level redundancy. This particular work is focused on Single Even Upsets.

With this methodology, there are three main categories:

Benign - The fault does not effect the correctness of the system. For example, a bit may be flipped into a register that is written to before it is read from next. The assumption (TODO: Find where this is proved) that most SEUs fall into this category is the basis for claims as to why software only systems (with a larger sphere of influence) can achieve better performance (TODO: add link to detection page when complete).
Silent Data Corruption (SDC) - The fault occurs, changes the output of the execution but is not detected. The output propagates to other parts of the system, which assume that it is correct.
Detected Unrecoverable Error (DUE) - At first glance it is obvious: an error that hoses the the execution. Maybe it freezes or performs an invalid operation. Depending on detection mechanisms, this error can be triggered by what once would have been merely a Benign fault, to the detriment of system performance.

Other

The classification of faults presented here is not complete, and is likely in need of revision. I will likely need to investigate Permanent Hardware faults. Another thing to keep in mind is if the domain of robotics has any impact on the techniques I investigate.

Glossary / Acronyms

ECC - Error Correcting Code - A technique applied to computer memory that protects against bit errors.
EDAC - Error Detection And Correction - Methods which allow reliable communication of noisy communication channels by detecting and correcting errors in the transmission.
SEU - Single Event Upset - A transient fault resulting in a "flipped bit" cause by (for example) radiation or interference.