Reliability and Fault Tolerance

These are lecture notes from my Computer Science course. For learning about real-time systems, I recommend Real-Time Systems and Programming Languages.

There are 6 aspects of Dependability:

  • Availability
  • Reliability
  • Safety; non-onccurrence of catastrophic consequences
  • Confidentiality; security
  • Integrity; information doesn’t get altered improperly
  • Maintainability; system has ability to can be repaired/evolved

Reliability of a system is a measure of how successful it is at conforming to a specification. When the behaviour doesn’t follow the specification a ‘failure’ occurs. Failure results from internal Errors presenting themselves externally, caused by mechanical/algorithmic Faults.

There are 3 different types of fault:

  • Transient faults occur then disappear, e.g. an adverse reaction to radioactivity.
  • Permanent faults remain in the system until they are fixed.
  • Intermittent faults are reoccurring transient faults (e.g. overheating).

Then there are also types of software faults. Bugs:

  • Bohrbugs; nice bugs, reproducible.
  • Heisenbugs; nasty bugs, race conditions etc.

Software is either correct or incorrect; it doesn’t deteriorate. Which is handy, however faults can remain dormant for long periods (often because of resource usage issues e.g. memory leaks).

 How can I make a reliable system then?

  • Fault prevention
  • Fault tolerance

Fault prevention

Fault prevention has two stages – avoidance and removal. Good avoidance is obviously better than removal (proactive is better than reactive). Fault avoidance involves using several ploys to limit the introduction of faults during construction by:

  • Using the most reliable components available (duh)
  • Using thoroughly-refined techniques for interconnection of components
  • Physically packaging the hardware to avoid any interference
  • Rigorous (and formal?) specification of requirements
  • Using proven design methodologies
  • Using languages with facilities for data abstraction and modularity (really?)
  • Tried and tested software engineering environments

Fault removal involves finding and removing the causes of errors (stating the obvious…). Testing can never remove everything, faults will occur.

Fault prevention is useless if you can’t get access to the system, or repairing will take too long/be too often. So the alternative is fault tolerance.

Fault tolerance

There are different levels of fault tolerance:

  • Full fault tolerance; The system continues to operate in the presence of faults with no significant loss of functionality or performance.
  • Fail soft (Graceful Degredation); similar to full fault tolerance except the system is degraded in some way (but can still operate).
  • Fail safe; the system maintains integrity, but halts temporarily

Most safety critical systems require full fault tolerance, but that’s often tricky, so often systems can only have graceful degradation.

This is unfinished…I got bored of reading these lecture notes!

This entry was posted in lecture, rts. Bookmark the permalink.