All of IBM

IBM Haifa Labs

# A Kernel-based Communication Fault Injector for Dependability Testing of Distributed Systems

Roberto Jung Drebes, Gabriela Jacques-Silva, Joana Matos Fonseca da Trindade, and Taisy Silva Weber, Instituto de Informatica - Universidade Federal do Rio Grande do Sul

Software-implemented fault injection is a powerful strategy to test fault-tolerant protocols in distributed environments. In this paper, we present ComFIRM, a communication fault injection tool we developed that minimizes the probe effect on the tested protocols. ComFIRM explores the possibility of inserting code directly inside the Linux kernel in the lowest level of the protocol stack through the load of modules. The tool injects faults directly into the message exchange subsystem, allowing the definition of test scenarios from a wide fault model that can affect messages being sent and/or received.

Additionally, the tool is demonstrated in an experiment that applies the fault injector to evaluate the behavior of a group membership service under communication faults.

# Invited Talk: Testing of MPI Software

Dan Quinlan, Lawrence Livermore National Laboratory

Large-scale scientific applications using the Message Passing Interface (MPI) standard are often difficult to test. Systems at DOE laboratories commonly use thousands of processors and emerging systems have tens of thousands of processors. The development environment for these systems is typically a small fraction of the whole machine and tools are often difficult to use at larger sizes. The problem is that applications often exhibit timing issues such as deadlocks and races on the real execution environment that they did not show in the small development environment. Thus, a method that causes applications to exhibit the bugs in the simpler development environment is crucial.

We present initial work on techniques that automatically or semi-automatically cause the manifestation of timing-related bugs in MPI applications. These techniques work with small processor counts and alleviate the need to test on the entire system. Specifically, we will present a technique to evaluate the execution of MPI programs under perturbations of their scheduling of message passing calls. We will demonstrate that this technique aids in testing problems that are often not easily reproduced when testing on small fractions of the machine.

The technique builds upon ($P^NMPI$), an extension of the MPI profiling interface that supports multiple layers of profiling libraries. We will also explore how tools that directly transform the source code could support for testing and debugging MPI applications at reduced scale.

# Keynote: Checking Atomicity in Concurrent Java Programs

Scott D. Stoller, State University of New York at Stony Brook

Atomicity is a common correctness requirement for concurrent programs. It requires that concurrent invocations of a set of methods be equivalent to performing the invocations serially in some order. This is like serializability in transaction processing. Analysis of atomicity in concurrent programs is challenging, because synchronization commands may be scattered differently throughout each program, instead of being handled by a fixed strategy, such as two-phase locking.

We are developing techniques for checking atomicity in concurrent programs, using static analysis, dynamic analysis (run-time monitoring), and novel combinations of them. This talk surveys our research in this area, discusses trade-offs between different techniques, and describes our experience applying them to about 40 KLOC of Java programs.

This is joint work with Rahul Agarwal, Amit Sasturkar, and Liqiang Wang.

# Detecting Potential Deadlocks with Static Analysis and Runtime Monitoring

Rahul Agarwal, Liqiang Wang, and Scott D. Stoller, State University of New York at Stony Brook

Concurrent programs are notorious for containing errors that are difficult to reproduce and diagnose. A common kind of concurrency error is deadlock, which occurs when a set of threads is blocked each trying to acquire a lock held by another thread in that set. Static and dynamic (run-time) analysis techniques exist to detect deadlocks.

Havelund's GoodLock algorithm detects potential deadlocks at run-time. However, it detects only potential deadlocks involving exactly two threads. This paper presents a generalized version of the GoodLock algorithm that detects potential deadlocks involving any number of threads. Run-time checking may miss errors in unexecuted code. On the positive side, run-time checking generally produces fewer false alarms than static analysis.

This paper explores the use of static analysis to automatically reduce the overhead of run-time checking. We extend our type system, Extended Parameterized Atomic Java (EPAJ), which ensures absence of races and atomicity violations, with Boyapati {\it et al.}'s deadlock types. We give an algorithm that infers deadlock types for a given program and an algorithm that determines, based on the result of type inference, which run-time checks can safely be omitted. The new type system, called Deadlock-Free EPAJ (DEPAJ), has the added benefit of giving stronger atomicity guarantees than previous atomicity type systems.

Saddek Bensalem, Universite Joseph Fourier/Verimag, Grenoble, France, and Klaus Havelund, Kestrel Technology, Palo Alto, California, USA

This paper presents a dynamic program analysis algorithm that can detect deadlock potentials in a multi-threaded program by examining a single deadlock-free execution trace, obtained by running an instrumented version of the program. The algorithm is interesting because it can identify deadlock potentials even though no deadlocks occur in the examined execution, and therefore it scales very well in contrast to more formal approaches to deadlock detection. It is an improvement of an existing algorithm in that it reduces the number of false positives (false warnings). The presented work complements a collection of algorithms for detecting potentials for various forms of data races in multi-threaded programs from error-free program runs. The paper describes an implementation and two case studies.

# Cross-Run Lock Discipline Checker for Java

Eitan Farchi, Yarden Nir-Buchbinder, and Shmuel Ur, IBM Haifa Research Lab

Avoiding deadlock is a major challenge in concurrent/parallel programming. The classical deadlock situation happens when a thread T1 has taken lock L1 and attempts to take (nestedly) L2, while another thread T2 has taken L2 and attempts to take L1. A well-known methodology to avoid deadlocks is lock discipline - when several locks need to be taken together (nestedly), they must be taken in predefined order, and this order is to be shared among all threads in the system.

Several tools, such as Java PathFinder and Microsoft's Driver Verifier identify violations of lock discipline during test runtime. They only look within the scope of one process run. If a cycle in the graph is caused by lock sequences from two different runs, then these tools will not reveal it. A test suite is often composed of many small tests, each in a different process, each activating only a few and short paths, so these tools are much less likely to reveal cycles than a tool that has a view of the entire history of runs. Overcoming this limitation is not trivial, since it's not obvious how to identify the same lock across different runs.

We present a cross-run lock discipline checker. We associate a lock with a set of locations. At runtime, we trace lock operations (entrance to synchronized blocks). When the test suite has finished running, we have a set T of traces. An analyzer is run on T to create a partition of the code locations to "same lock" relation, as follows:

Two locations l1 and l2 are the "same-lock" if there exists a trace t in T and a lock object ID x such that both l1 and l2 appeared in t associated with x at least once. "same-lock" is the transitive closure of (1).

The nested locking graph is created as in existing algorithms, except that the nodes are "same lock" classes. Alerts are given for cycles found in this graph.

It has been shown that this algorithm can detect deadlocks in testing scenarios where existing tools cannot.

# Verification of the Java Causality Feature

Sergey Polyakov and Assaf Schuster, Department of Computer Science, Technion - Israel Institute of Technology

The Java Memory Model (JMM) formalizes the behavior of shared memory accesses in a multi-threaded Java program. Dependencies between memory accesses are acyclic, as defined by the JMM causality requirements. We study the problem of post-mortem verification of these requirements and prove that the task is NP-complete. We then argue that in some cases, the task may be simplified by tracing the actual execution order of read actions in each thread. Our verification algorithm has two versions: a polynomial version, to be used when the aforementioned simplification is possible, and a nonpolynomial version - for short test sequences only - to be used in all other cases. Finally, we argue that the JMM causality requirements could benefit from some fine-tuning. Our examination of causality test case 6 (presented in the public discussion of the JMM) clearly shows that some useful compiler optimizations - which one would expect to be permissible - are in fact prohibited by the formal model.

# Choosing Among Alternative Futures

Steve MacDonald and Jun Chen, School of Computer Science, University of Waterloo, and Diego Novillo, Red Hat, Inc.

Non-determinism is a serious impediment to testing and debugging concurrent programs. Such programs do not execute the same way each time they are run, which can hide the presence of errors. Existing techniques use a variety of mechanisms that attempt to increase the probability of uncovering error conditions by altering the execution sequence of a concurrent program, but do not test for specific errors.

This paper presents some preliminary work in deterministically executing a multi-threaded program using a combination of an intermediate compiler form that identifies concurrent reaching definitions and aspect-oriented programming to control program execution. Specifically, the aspects allow a read of a shared variable to return any of the reaching definitions, where the desired definition can be selected before the program is run. As a result, we can deterministically run test cases. This work is preliminary and many issues have yet to be resolved, but we believe this idea shows some promise.