Murthy Devarakonda
Ananda Rao Ladi
Andy Zlotek
Ajay Mohindra
In the single system UNIX, if a write system call returns successfully, adequate disk space is guaranteed for any new pages created by the system call. Supporting this guarantee in a distributed UNIX file system is a distributed resource management problem. This is because distributed file systems use (or wish to use) write-behind caching at the clients for high performance, which requires the clients to implement the space guarantee. Since on-disk representation of a file requires additional disk blocks for inodes and other internal structures, clients can not accurately determine the space needed. In addition, low communication overhead and recovery from failures are also important considerations. This paper describes an advance reservation scheme to address these problems, and presents measurements demonstrating its low overhead.
Two commercially available distributed file systems, NFS and AFS flush newly created pages at the file close and at that time check for the space availability at the server. Spritely NFS an experimental file system that extends NFS with protocols from Berkeley's Sprite file system reserves space at the file close without flushing the modifications to the server.
The scheme described here allows aggressive caching as in the Spritely NFS while maintaining the disk space guarantee as in a single system. This scheme has been implemented in the Calypso file system, a cluster-optimized, distributed Unix file system for IBM Scalable POWERparallel. Calypso provides scalable performance and full recovery from failures while supporting true single-system semantics.
Calypso uses an advance reservation scheme to implement the disk space guarantees. In the write system call, Calypso determines the amount of new space (if any) needed and obtains a reservation for the amount before returning from the system call. The problem of determining the actual amount of new space needed is solved using conservative overestimation and reconciliation at the earliest. Recovery from partial failures is provided as a part of the overall recovery scheme. Reservation overhead is reduced by minimizing the communication between the clients and the server. Measurements show that the overhead introduced by this scheme is about 1%-3% for typical interactive benchmarks.