Jalada home about archive

Reliability and Fault Tolerance

22nd January 2010 (I was absent from this lecture)

These are lecture notes from my Computer Science course, not a general reference for "Reliability and Fault Tolerance"

There are 6 aspects of Dependability:

Reliability of a system is a measure of how successful it is at conforming to a specification. When the behaviour doesn’t follow the specification a ‘failure’ occurs. Failure results from internal Errors presenting themselves externally, caused by mechanical/algorithmic Faults.

There are 3 different types of fault:

Then there are also types of software faults. Bugs:

Software is either correct or incorrect; it doesn’t deteriorate. Which is handy, however faults can remain dormant for long periods (often because of resource usage issues e.g. memory leaks).

 How can I make a reliable system then?

Fault prevention

Fault prevention has two stages - avoidance and removal. Good avoidance is obviously better than removal (proactive is better than reactive). Fault avoidance involves using several ploys to limit the introduction of faults during construction by:

Fault removal involves finding and removing the causes of errors (stating the obvious…). Testing can never remove everything, faults will occur.

Fault prevention is useless if you can’t get access to the system, or repairing will take too long/be too often. So the alternative is fault tolerance.

Fault tolerance

There are different levels of fault tolerance:

Most safety critical systems require full fault tolerance, but that’s often tricky, so often systems can only have graceful degradation.

This is unfinished…I got bored of reading these lecture notes!

Comments

blog comments powered by Disqus