Information Survivability by Self-Stabilization

position paper by

Boaz Patt-Shamir

College of Computer Science
Northeastern University
Boston, MA 02115

In this short note I argue that the model of self-stabilization and the tools developed for it are important and useful from the information survivability perspective. In a nutshell, the self-stabilization model provides a uniform abstraction for all possible transient faults, and self-stabilizing protocols are protocols which guarantee automatic recovery from any transient fault. In the remainder of this note, I give some background and mention a few approaches and results for self-stabilization.

In order to deal with faults, one has to model them appropriately. Clearly, there is no way to defend the system unless we assume that the faults are somehow restricted. One popular way to restrict the faults is to view them as additional specific actions taken by the environment. For example, a ``shutdown'' action may model a processor crash; a ``drop message'' action will have the obvious interpretation; etc. Another accepted way to restrict faults is to assume that they may occur only at a given number of places. A typical example is where system guarantees are contingent on the assumption that at most t processors are faulty, for some specified parameter t. Another approach was pioneered by Dijkstra [1]: unrestricted transient faults. In this model, faults may be extremely powerful: when a fault occurs, it is assumed that the state of all system components may have been arbitrarily corrupted; the only restrictions assumed are that the protocol code remains intact, and that eventually faults disappear. A protocol which can eventually resume normal operation after such a fault is called self-stabilizing.

It turns out that despite the stringent requirements, self-stabilizing protocols are feasible (e.g., for token passing, spanning tree) -- but they are tricky. The design of self-stabilizing protocols seems to require techniques which are foreign to conventional programmers. For example, we take it for granted that the result of a read(x) instruction is exactly the value v stored by the last write(x,v) instruction. But a self-stabilizing protocol can never just write a value and trust the system to keep it: there is no immediate way to tell whether a value returned by a subsequent read is ``genuine'' or perhaps a transient fault has occurred before the read took place. Another unnatural consideration to cope with is due to the fact that the program counter (the register which points to the next instruction to execute) is stored in volatile memory: a possible consequence of transient faults is that when executing the second statement in a sequence, there is no way to tell whether the first one has already been executed.

Recently, there is a trend to design general protocol transformers, which upgrade non-stabilizing protocols to be self-stabilizing. The common approach [2,3,4] is to have a detection mechanism which periodically verifies the integrity of the application, and when a fault is detected, it triggers a reset protocol: the task of the reset protocol is to flush the stale state information out of the system and to consistently install a new global state, from which the application can resume normal operation. It is fairly well-understood when and how to use reset-based stabilization. The main research question for this approach is the design of an efficient, self-stabilizing reset protocol. An ideal reset protocol should work in time proportional to the network diameter, and use only constant memory space at each node. The current state of the art is either optimal space protocol [5] or optimal time protocol [6], but no protocol achieves them both simultaneously.

Although the reset-based approach seems attractive from the engineering viewpoint, it is not widely used in the real world. The chief reason for rejecting reset is its insensitivity to the severity of the fault: even the slightest inconsistency will result in a global system reset. Consider a single processor which has completely lost track of the protocol; the desired correction in such a case is probably to restart that processor. We wouldn't like to restart all processors executing the application unless we absolutely have to. It is highly desirable that the correction algorithm would be time adaptive, namely the time to overcome a fault should be proportional to the extent of (i.e., number of processors affected by) the fault.

Time adaptive self-stabilizing algorithms are the subject of intensive research recently. One work [7] shows how can a system recover from a a transient fault in time proportional to the number of nodes which were hit by the fault; unfortunately, the algorithm presented in [7] is not self-stabilizing: in particular, the effect of the faults is internally accumulated and must be manually undone at some point. Another paper [8] shows how to overcome a fault at a single node quickly, while faults which extend to a larger number of nodes may take much longer time to repair. The last result in this line of research is [9], which shows how to repair any number F of faults in time proportional to F. The basic technique which underlies the new algorithms is, of course, replication of information across the system; the main difficulty is that while the algorithm is required to change the values stored locally at the (faulty) nodes, it should avoid committing itself to wrong values: otherwise, a domino-effect might take place, where a few strategically-located faulty nodes can bring down the whole network.
We remark that the main drawback of the latest algorithms are their excessive space requirement; however, under various practical assumptions, and using error-correcting codes, this overhead can be considerably reduced.

References.

  1. E. Dijkstra. "Self-stabilizing systems in spite of distributed control." In Comm. ACM 17(11), pages 643-644, Nov. 1974.
  2. S. Katz and K. Perry. "Self-stabilizing extensions for message passing systems." In 9th Ann. ACM Symp. on Principles of Distributed Computing, Aug. 1990.
  3. B. Awerbuch, B. Patt-Shamir and G. Varghese. "Self-stabilization by local checking and correction." In 32nd IEEE Ann. Symp. on Foundation of Computer Science, pages 268-277, Oct. 1991.
    Online: ftp://ftp.ccs.neu.edu/pub/people/boaz/local.ps.gz
  4. B. Awerbuch, B. Patt-Shamir, G. Varghese and S. Dolev. "Self-stabilization by local detection and global reset." In 8th Workshop on Distributed Algorithms, pages 326-339, Sep. 1994.
    Online: ftp://ftp.ccs.neu.edu/pub/people/boaz/ss-compilation.ps.gz
  5. G. Itkis and L. Levin. "Fast and lean self-stabilizing asynchronous protocols." In 35th IEEE Ann. Symp. on Foundation of Computer Science, pages 226-239, Nov. 1994.
  6. B. Awerbuch, S. Kutten, Y. Mansour, B. Patt-Shamir and George Varghese. "Time Optimal Self-Stabilizing Synchronization." In 25th Ann. ACM Symp. on Theory of Computing, pages 652-661, May 1993. Also appeared as IBM Research Report RC-19149(83418).
    Online: ftp://ftp.ccs.neu.edu/pub/people/boaz/ss-alpha.ps.gz
  7. S. Kutten and D. Peleg. "Fault-Local Distributed Mending." In Proceedings of the 14th Ann. ACM Symp. on Principles of Distributed Computing, Aug.1995.
  8. S. Ghosh, A. Gupta, T. Herman and S. Pemaraju. "Fault-containing self-stabilizing algorithms." In 15th Ann. ACM Symp. on Principles of Distributed Computing, May 1996.
  9. S. Kutten and B. Patt-Shamir. "Time-adaptive self-stabilization." Submitted for publication, Nov. 1996.
    Online: ftp://ftp.ccs.neu.edu/pub/people/boaz/timeAdaptive.ps.gz