Research Wiki | Faults / Faults

Takes a look at how to deal with faults.

Taxonomy - How to categorize faults.
Detection - Just being able to realize that a fault has occurred can be difficult. This is vitally important, especially since many systems make the assumption that components are "Fail Stop": faults do not propagate, but are detected immediately.
Recovery - Assuming that the fault has been detected, the next step is recovery. Restarting the offending component, rebuilding state, loading a checkpoint, diagnosing the problem...

General Strategies

Protecting from faults often uses one of three general strategies. Various implementations are discussed in this sections sub-pages. The strategies themselves are:

k-modular redundancy -
Primary Backup - hot / cold-standby are variations, as is active replication which is a decentralized version.
Checkpointing