Reliability and Fault Tolerance
22nd January 2010 (I was absent from this lecture)
These are lecture notes from my Computer Science course, not a general reference for "Reliability and Fault Tolerance"
There are 6 aspects of Dependability:
- Availability
- Reliability
- Safety; non-onccurrence of catastrophic consequences
- Confidentiality; security
- Integrity; information doesn’t get altered improperly
- Maintainability; system has ability to can be repaired/evolved
Reliability of a system is a measure of how successful it is at conforming to a specification. When the behaviour doesn’t follow the specification a ‘failure’ occurs. Failure results from internal Errors presenting themselves externally, caused by mechanical/algorithmic Faults.
There are 3 different types of fault:
- Transient faults occur then disappear, e.g. an adverse reaction to radioactivity.
- Permanent faults remain in the system until they are fixed.
- Intermittent faults are reoccurring transient faults (e.g. overheating).
Then there are also types of software faults. Bugs:
- Bohrbugs; nice bugs, reproducible.
- Heisenbugs; nasty bugs, race conditions etc.
Software is either correct or incorrect; it doesn’t deteriorate. Which is handy, however faults can remain dormant for long periods (often because of resource usage issues e.g. memory leaks).
How can I make a reliable system then?
- Fault prevention
- Fault tolerance
Fault prevention
Fault prevention has two stages - avoidance and removal. Good avoidance is obviously better than removal (proactive is better than reactive). Fault avoidance involves using several ploys to limit the introduction of faults during construction by:
- Using the most reliable components available (duh)
- Using thoroughly-refined techniques for interconnection of components
- Physically packaging the hardware to avoid any interference
- Rigorous (and formal?) specification of requirements
- Using proven design methodologies
- Using languages with facilities for data abstraction and modularity (really?)
- Tried and tested software engineering environments
Fault removal involves finding and removing the causes of errors (stating the obvious…). Testing can never remove everything, faults will occur.
Fault prevention is useless if you can’t get access to the system, or repairing will take too long/be too often. So the alternative is fault tolerance.
Fault tolerance
There are different levels of fault tolerance:
- Full fault tolerance; The system continues to operate in the presence of faults with no significant loss of functionality or performance.
- Fail soft (Graceful Degredation); similar to full fault tolerance except the system is degraded in some way (but can still operate).
- Fail safe; the system maintains integrity, but halts temporarily
Most safety critical systems require full fault tolerance, but that’s often tricky, so often systems can only have graceful degradation.
This is unfinished…I got bored of reading these lecture notes!
Comments
blog comments powered by Disqus