Takes a look at how to deal with faults.
- Taxonomy - How to categorize faults.
- Detection - Just being able to realize that a fault has occurred can be difficult. This is vitally important, especially since many systems make the assumption that components are "Fail Stop": faults do not propagate, but are detected immediately.
- Recovery - Assuming that the fault has been detected, the next step is recovery. Restarting the offending component, rebuilding state, loading a checkpoint, diagnosing the problem...
General Strategies
Protecting from faults often uses one of three general strategies. Various implementations are discussed in this sections sub-pages. The strategies themselves are:
- k-modular redundancy -
- Primary Backup - hot / cold-standby are variations, as is active replication which is a decentralized version.
- Checkpointing