Recovery in the Calypso File System

Murthy Devarakonda
Bill Kish
Ajay Mohindra

This paper describes the design and implementation of the recovery scheme in Calypso. Calypso is a distributed UNIX file system implemented on IBM RISC clusters. One potential use of cluster systems is as scalable, fault-tolerant network servers providing mail, bulletin-boards, Web, and Telnet services. Such clusters require an internal file system that is in itself recoverable in addition to providing scalability and standards compliance.

The Calypso recovery scheme uses a protocol between the clients and servers, and it relies on the state distributed among clients. In the event of a server failure, multi-ported disks allow another processor to take over its disks and function as the file server. The recovery is non-disruptive meaning that the file open state and client modified data survive across server recovery, and in-flight client-server operations are transparently completed after recovery. The advantages of the Calypso recovery scheme are that there is no need to explicitly replicate server state, and that the multi-ported disks shorten recovery times and offer cluster reorganization capability.

Among existing commercial file systems, NFS supports a limited variation of non-disruptive recovery through hard mounts, but suffers from lack of scalability due to statelessness Highly Available Network File Server (HA-NFS) provides non-disruptive NFS server recovery but does not address the scalability problem. AFS offers scalability but does not provide non-disruptive recovery. Several research distributed file systems demonstrated fault-tolerance using redundancy provided by replication and RAID techniques. Recovery using distributed state avoids the expense of maintaining replicas.

Two other distributed file systems, Sprite and Spritely NFS have used distributed state for server recovery. Calypso development took place in parallel and independent of these efforts. Calypso differs from Sprite and Spritely NFS in state representation and internal design. Calypso implements failure detection, recovery management, and congestion control quite differently, and in separate subsystems with well-defined interfaces. The Calypso design also separates cache consistency state from file data, enhancing the modularity and further simplifying the recovery system. Only Calypso makes use of multi-ported disks to handle permanent processor failures or scheduled shutdown.

In the past four years, we have developed and tested the Calypso file system at product quality for a possible use in RISC System/6000 clusters such as the Scalable POWERParallel (SP) and HACMP. SP is a multicomputer, consisting of independent RISC System/6000 units and an optional high speed switch interconnect. Individual processing units are off-the-shelf systems. Each one has local memory, disks, and other adapter cards. Rack-based packaging provides redundant power supply and monitoring capability. Large SP systems (of 512 nodes) are mainly used in scientific computing. In the commercial area, where Calypso is targeted, more modest sized systems (of up to 32 nodes) are common. HACMP is a set of software services that turns a network of standard RISC System/6000s into a highly-available server complex.

As a cluster file system, Calypso takes advantage of homogeneity, special hardware, peer relationship among processors, and single administrative domain to provide small and efficient implementation. It offers: strong cache consistency and other POSIX semantics; reduced false sharing in simultaneous update of large files; NFS support to external clients; and non-disruptive server recovery. Because of the peer relationship and single administrative domain aspects, Calypso can trust its clients to execute a distributed protocol for server recovery.

In Calypso, as in other distributed file systems, there is one server for each file system and several clients. On the server, Calypso uses the AIX Journaled File System (JFS) through the vnode interface. The server has a physical attachment to disks containing file data. Although multi-ported disks are physically attached to two or more nodes, only one port is active at any given time. Clients cache file data extensively and Calypso maintains cache consistency using a token-based scheme. A client must first obtain necessary tokens before carrying out a file operation, and acquired tokens remain with the client until revoked. The server maintains the state of all tokens, and each client maintains information about tokens it is presently holding. Server state recovery mainly consists of reconstructing the state of tokens on a backup node (or at the same server node after reboot). In addition, it requires reconstruction of a small amount of information related to file locks and disk space guarantees. To a user, server reconstruction is completely transparent. Therefore, except for a small response time delay, a user is unaware of a server failure and its recovery. Recovery from disk failures is beyond the scope of Calypso. RAID devices or disk mirroring (replication) techniques are expected to provide disk fault tolerance.

A node status service, which is external to Calypso, detects failures and initiates recovery. The node status service maintains liveness status using selective "are you alive" messages and a quorum consensus algorithm. When this service detects failure of a server, it instantiates the Calypso recovery controller on a leader node, which manages necessary recovery. In the first phase, the controller informs all clients about the server failure and prompts a pre-determined backup server to take over the disk(s) through a secondary port, assure consistency of the file system(s), and perform file system mounts. In the second phase, the controller prompts all clients to send in their state information. It notifies recovery completion in the third phase so that the clients may now access the server as needed. Calypso handles client failures in the same way with several simplifications.

To understand bottlenecks and significant performance factors, we conducted measurements using two workloads. One key result is that the server state reconstruction time is relatively small overall, even though it increases linearly with the number of clients and tokens per client. Batching and congestion control help to keep the reconstruction cost low. We developed a statistical model for estimating the reconstruction time based on the number of clients and tokens. The second result is that if the workload modifies a large amount of file meta-data shortly before a failure, the log redo time of the AIX Journaled File System becomes the predominant factor of the recovery time. Fortunately, the log redo time is bounded by the log size. In general, however, the activity represented by this time is an important factor in the overall recovery time of a distributed file system.



Wed Jun 5 16:17:08 EDT 1996