Murthy Devarakonda
Bill Kish
Ajay Mohindra
This paper describes a file server recovery scheme that uses naturally replicated state among clients, and shows preliminary measurements of its implementation in the Calypso file system. Calypso is a cluster-optimized, distributed UNIX file system for RISCSystem 6000 clusters such as the IBM Scalable POWERParallel (SP). From 1991 through 1995, we developed and tested Calypso at the product quality.
For each file system in Calypso, one processing node functions as the server and the rest of the nodes function as clients. A recovery scheme is needed to handle failure of a server node. (Client failures are equally important, and Calypso treats them as a simple case of server failure.) Calypso supports fully transparent, non-disruptive, low overhead server recovery. Non-disruptive recovery implies that the open files remain open, modified data in the client caches is retained, and in-flight operations between the clients and the server are properly handled across server recovery. This level of transparent recovery is a challenging requirement since Calypso file server maintains detailed information about the state of client caches for high performance. In addition, Calypso avoids extra disk storage costs and overhead in maintaining explicit replication of data. We designed a server recovery scheme using naturally replicated state among clients and exploiting multi-ported disks when available.
The server state is naturally replicated among Calypso clients as a side effect of caching at the clients. Calypso clients cache data pages and file information such as the ownership and access rights. In order to maintain consistency of this cached data, Calypso uses a distributed token-based mechanism. A token conveys authorization to perform certain operations on file data, such as opening the file, caching and accessing its contents or its attributes. A client must first obtain a token from the file server before accessing associated file data. The server maintains a list of outstanding tokens and token holders. Mainly, this list is the server state that needs to be recovered in the case of server failure.
Each client also maintains a part of the server state in the form of tokens that are being held by the client. Union of these client states is equal to the state maintained at the server in the steady state. Our recovery scheme makes use of this naturally replicated state to rebuild the server state on an alternate node or the same node after it is rebooted. The server state is rebuilt on an alternate node when multi-ported disks are used. The use of multi-ported disks avoids explicit replication of data and server state, and overhead associated with maintenance of the replicas.
The recovery scheme in Calypso uses a node status service, which is external to Calypso itself, to detect failures and to initiate recovery. The node status service maintains `liveness' status using selective `Are you alive' messages and a quorum election algorithm. When this service detects failure of a server node it starts Calypso recovery protocol through a leader node. In the first phase, all cluster nodes are informed of the server failure and the alternate server (pre-determined) takes over the disk(s) through a secondary port, checks consistency of the file system(s), and performs file system mounts. In the second phase, all clients send in their state information. Clients are informed of the completion of recovery in the third phase, and they may now access the server as needed. In-flight operations are handled using retries and a check-before-request approach. Measurements of reconstruction times show good scalability. The major portion of the recovery time is in the log-redo of the AIX Journaling File System (JFS). Fortunately, the log redo time is proportional to the size of the log and hence, after a certain point, it does not increase with the number of clients.
In previous distributed file systems, server fault tolerance is provided mostly through explicit replication, primary/backup, RAID, internal fault-tolerance, or fast restart. In some of systems, server recovery is disruptive (i.e. user-visible) requiring additional handling in applications. The closest related work is the recovery scheme implemented in the Sprite file system. The Sprite recovery scheme also uses naturally replicated state among clients. Specific differences between the Sprite and Calypso schemes are along the lines of state representation, its implication on performance, important protocol differences that impact file data integrity, and so on. However, fundamentally the difference is the system environment it is targeted for. The Calypso scheme takes full advantage of the architecture, assumptions, and hardware of an enclosed cluster, while the Sprite scheme is designed to work for workstations on a LAN.
Highly Available Network File Server (HA-NFS) product from IBM exploits multi-ported disks and IP take over to recover from a server failure. However, since NFS protocol is nearly stateless, HA-NFS needs to recover only a small amount of server state. HA-NFS saves this server state on the multi-ported disk. Therefore, HA-NFS and Calypso recovery schemes are not directly related but both exploit multi-ported disks to avoid data replication.