Evaluation of Design Alternatives for a Cluster File System

Murthy Devarakonda, Ajay Mohindra, Jill Simoneaux, and William H. Tetzlaff

Based on implementation experience and measurements, this paper presents an evaluation of design alternatives to a cluster file system. The file system is targeted for IBM clusters, Scalable POWERparallel and AIX HACMP/6000. VAXClusters are other well known cluster systems. In general, a cluster is a loosely coupled system of processing nodes, each with its own CPU, memory, disks, and network adapters. The cluster systems considered here also have characteristics such as single administrative domain, peer relationship among nodes, homogeneity, and special purpose hardware (i.e. high-speed switch or multi-ported disks). A cluster file system, unlike a conventional network file system, can take advantage of these characteristics for high performance and fault tolerance. Functionally, a cluster file system is expected to provide strong cache consistency, POSIX support, and high availability.

We considered two design approaches to the cluster file system: (1) a shared disk approach, where multiple, serialized instances of a single-system file system run on the cluster nodes, directly accessing the file data as disk blocks; (2) a shared file system approach, which is the conventional client/server scheme using the vnode interface. Note that compatibility with the existing disk format is a requirement so that users can easily migrate to a cluster from a single system.

This evaluation is based on implementations in AIX/6000 using the same cache consistency mechanism and the cache manager. Cache consistency has been implemented using a distributed token manager, and AIX virtual memory has been used as the data cache. In the shared disk approach, the lesser known of the two approaches, the AIX Journaling File System (JFS) has been extended by adding cluster-wide serialization with the help of the token manager. Remote disk access has been provided over the network. Comparison of this parallel JFS implementation with the shared file system approach is one part of the evaluation. The conclusion is that efficient metadata serialization is intrinsically difficult in the shared disk approach.

The second part of the evaluation is based on performance measurements of the two implementations. (Although adequate for an insightful evaluation, our implementation of the shared disk approach can simultaneously support only one read-write mount and many read-only mounts.) We have used an Nhfsstone-derived micro-benchmark, Andrew benchmark based scalability measurements, read and write throughput measures, and a concurrent benchmark. Results show that the shared file system approach performs better in concurrent read-write access and in reading a large file sequentially, whereas the shared disk approach has a small advantage when almost all accesses can be serviced by the local cache. Since a file system for a homogeneous cluster can be designed using either one of the approaches, an experimental evaluation is valuable; we are unaware of such previous work.

Although fault tolerance is important in a cluster file system, we did not attempt an evaluation based on fault tolerance. However, a non-disruptive recovery scheme exploiting the multi-ported disk hardware has been implemented for the shared file system approach.



Wed Jun 5 16:17:08 EDT 1996