IBM Israel
Skip to main content
Search IBM Research
   Home  |  Products & services  |  Support & downloads  |  My account
Select a Country Select a country
IBM Research Home IBM Research Home
IBM Haifa Labs Homepage IBM Haifa Labs Home
IBM Haifa Labs Homepage IBM Haifa Labs Leadership Seminars

Systems and Storage Seminar 2003

Visitors information
Confirmed participants
News Article

Advanced FlashCopy for Leopardshark (Abstract)
Amiram Hayardeni, Dalit Tzafrir, Sivan Tal, and Shachar Fienblit, IBM Haifa Labs

Advanced FlashCopy enables improved, more efficient data access, faster speed to initialization and scalability. The new capabilities greatly increase the flexibility, efficiency and performance of point in time copy functions.

From an end-user perspective, the following major additional features are available in Leopardshark:
  • Extent Relocation - the ability to have a source extent and target extent at different relative offsets from the beginning of their volumes. This feature changes the original FlashCopy in which the source and target extents must be at the same relative offsets. In addition, given extent relocation, it is no longer necessary to allocate for the target a full volume which is at least as large as the source volume. In fact, the source and target extents can even be on the same volume (as long as other constraints are met).
  • Multiple Relations Per Track - the ability for a track to be included in an extent which is in more than one concurrent relation. This is distinct from the ability that follows from extent relocation to allow a single volume to have non-overlapping extents participating in different relations. More specifically, any track can be the source to multiple target tracks as long as the track is not, itself a target. The original implementation of FlashCopy is limited to one relation per volume at any instance in time
  • Cross LSS - In the original, the source and target must be in the same Logical Subsystem (an internal ESS structure). This feature removes this limitation and allows FlashCopy to work throughout the ESS.
  • Incremental Flash Copy - Incremental FlashCopy is an enhancement to the existing FlashCopy function. With Incremental FlashCopy, it is no longer necessary to copy the entire volume for each point-in-time. Instead, it is only necessary to copy the tracks that changed on the source volume since the last FlashCopy to the target volume. The purpose of Incremental FlashCopy is to reduce the duration of creating a physical copy of a volume and to also minimize the impact to other applications (e.g. minimize usage of DA bandwidth).
When taken together, these three features give the user much more flexibility in using FlashCopy. They also change the anticipated usage pattern for FlashCopy. In particular we see a much larger number of relations than we saw in the original implementation (which is inherently limited to sum of half the number of volumes in each of the control units LSSs). We also see relations established and terminated with much greater frequency, primarily since the individual relations are smaller.

These features also imply certain pragmatic constraints on the implementation of FlashCopy which we must address.
  • Improved Support for Deleting a Relation - While the idea of Withdrawing a relation may be contrary to the concept of FlashCopy, it is an important usability feature. In the original FlashCopy a relationship can only be withdrawn if both the source and target volume are accessible. To ensure usability, it is important that the microcode support removing a relation even if only one of the volumes is accessible as part of Leopardshark (ESS 2.2.0).
  • Improved Establish Performance - In the original FlashCopy it took several seconds to establish a FlashCopy relation, during which time the volumes involved in the relation are not available for IO. By improving the flexibility of FlashCopy, the abusage of the function increases. The purpose of this function is to eliminate almost all cache scans that occurred during Flash Copy establish. These scans can constitute significant overhead; anecdotal evidence shows these scans can take many seconds.

Switch Based Storage Services (Abstract)
Aviad Dagan – Director, Product Management StoreAge Networking Technologies Ltd.

Simplifying storage management in a Storage Area Network (SAN) is a key requirement for users with fast-growing and increasingly heterogeneous server and storage environments. Traditional storage services are delivered using several technologies such as:
  1. Host based services (volume management, mirroring)
  2. Storage based services (snapshot, mirroring, replication)
  3. Virtualization in/out of the data path
Each of the above technologies while providing a partial solution introduces some severe limitations mostly in the areas of heterogeneity, performance, scalability and security.

The new generation of intelligent SAN fabric switches provides an opportunity for introducing fabric-based storage services without the above limitations, significantly increasing users' benefits.

This paper deals with various design and architectural options of using intelligent switches to effect such functions as storage virtualization, storage provisioning and reallocation of storage resources, low capacity snapshots, cross platform copy services, live migration of data and the like. In particular, the use of distributed approach to storage virtualization in a SAN environment while maintaining the central management facility is discussed.

Improving Database Functionality and Performance Using an Object Store Architecture (Abstract)
Gary Valentin, Noam Rinetzky, and Michael Factor, IBM Haifa Labs

Object based storage is an emerging technology aiming to establish itself as a new industry standard. Object stores provide a storage interface which is more abstract than the traditional block device interface, and at the same time more versatile. Instead of performing I/O operations on extents of blocks, the new storage interface provides an abstraction of \emph{storage objects\/}, which group together data that the client considers to be related and can be accessed based on their \emph{object id} and \emph{offset}.

This paper looks at how a database manager (such as DB2 or Oracle) might work with an object store, which new functionality we can expect, and what will be the performance implications to the classic relational model. We show that object stores offer an attractive alternative to existing storage models for database managers. Lastly, we discuss a prototype which we have implemented: an extension to a commercial database manager (DB2 Universal Database) with a native interface to object stores.

Association Rule Mining in Peer-to-Peer Systems (Abstract)
Ran Wolff and Assaf Schuster, Technion

Peer-to-Peer systems are an extreme case of distributed systems. They include GRID computing environments such as Condor [6] (20,000 computers), specific area computing systems such as SETI@home [7] (1,800,000 computers) or United-Devices [8] (2,200,000 computers), general purpose platforms such as Entropia [1] (60,000 peers), and file sharing networks such as Kazaa (1.8 million peers). Like any other system, peer-to-peer systems maintain and produce operational data. Furthermore, some of these systems (e.g., Kazza) can be interpreted as large scale distributed databases. In either case, mining the distributed data can be highly profitable - be it for direct commercial purposes (for example, mining Kazza users' entertainment preferences), or for other purposes (e.g., predicting and explaining failures in a Grid system).

The size of peer-to-peer systems dictates that the data cannot be collected for analysis but rather has to be processed on-site by a distributed data mining algorithm. Yet, current approaches to distributed data mining all have a simplistic data sharing paradigm: they always require some kind of an all-to-all communication. This feature alone makes them impractical for Peer-to-Peer systems - these are simply too large to allow intensive all-to-all communication. Moreover, a dominant feature of these systems is that they keep function even when some of the nodes fail or disconnect; this notion of partial failure was overlooked by current data mining algorithms. Furthermore, at this size of a system, it would be improper to restart the algorithm each time the data changes; thus, an incremental approach is mandatory. Last, since it can be expected that algorithms operating on systems of that scale have long execution time, an anytime algorithm, capable of producing early results fast and then improving the result with time, is preferred.

We have developed a new approach to data mining which uses local primitives. Local algorithms are distributed algorithms in which, under some restriction, a node can reach a result which is globally correct after consulting with just a few, nearby, nodes. Local algorithms have been the center of a lot of attention in the recent decade [4, 5, 3, 2]. The reason for that attention was mainly the in.nite scalability which is implied by the independence of nodes on their far-off neighbors. In a recent paper we describe a new local majority voting algorithm [9]. Then we demonstrate that a distributed association rule mining algorithm can be built atop distributed majority votes. In this way, we develop a local, anytime, algorithm for association rule mining which supports partial failures and dynamic changes in the data. Hence, this algorithm is perfectly fit for mining peer-to-peer systems.

  1. Entropia.
  2. S. Kutten and B. Patt-Shamir. Stabilizing time-adaptive protocols. Theoretical Computer Science, 220(1):93-111, 1999.
  3. S. Kutten and D. Peleg. Fault-local distributed mending. In Proceedings of the Fourteenth Annual ACM Symposium on Principle of Distributed Computing (PODC), pages 20-27, Ottawa, Canada, August 1995.
  4. N. Linial. Locality in distributed graph algorithms. SIAM J. Computing, 21:193-201, 1992.
  5. Moni Naor and Larry Stockmeyer. What can be computed locally? In 25th ACM Symposium on Theory of Computing, pages 184-193, 1993.
  6. The Condor Project.
  7. Seti@home.
  8. United devices inc.
  9. R. Wolff and A. Schuster. Association rule mining in peer-to-peer systems. In Proc. of ICDM 2003, to appear.

Light-Weight Leases for Large-Scale Coordination (Abstract)
Gregory Chockler and Dahlia Malkhi, Hebrew University

A distributed stroage system must coordinate the activities of servers that manage data. Two fundamental coordination issues are: First, providing _exclusive access_ to sensitive data, such as directories; Second, providing strong _consensus_ operations in order to replicate activity consistently. A typical manner in which these services are provided is through a central lease-manager.

In this work, we propose to use light-weight _leases_ to support both mutual exclusion and consensus. Informally, a lease is a shared object that supports a _contend_ operation, such that when contend returns 'true' at any process, it does not return `true' to any other process for a pre-designated period. The lease automatically expires after the designated time period. In addition, our lease supports a _renew_operation which allows a non-faulty leader to remain in leadership (indefinitely).

In our approach, leases are implemented from the very shared memory that they protect. That is, there is no global lease manager, there is a lease per data item (e.g., a file, a directory, a disk partition, etc.).

We make use of leases also in solving the quintessential agreement problem. In order to allow wait-free agreement to be solved, it is well known that the environment must be eventually synchronous for sufficiently long. Intuitively, this requisite enables a unique leader to be established and enforce a decision. We show how to implement such eventually-safe leader election from the bare shared memory environment, using leases as out building block.

Enabling High Performance Grid Computing (Abstract)
Yaron Haviv, CTO, Voltaire Inc.

There is a growing trend in the industry for replacing large SMP machines with PC based clusters, even for traditional mainframe applications such as Databases and massive parallel computation. Those PC based clusters typically come at a cost of lower MTBF/Reliability, under utilization of resources, significant management and administration complexity.

A key enabler for those clusters is the ability to provide linear scalability using high-performance interconnects combined with the ability to control and virtualize the hardware infrastructure in a way that will provide the same level of fault tolerance, predictability, and manageability available in the large SMP based machines, or even better, for a much lower costs.

Voltaire is one of the leading providers of InfiniBand based systems with large grid installations in key customer accounts, that enable high-performance Grid computing, and provide fully integrated solutions with built in Network, Storage, and Fabric virtualization.

The Session will discuss technologies for Grid and Storage resources virtualization, Grid infrastructure management abstraction, and have a short introduction to RDMA and InfiniBand technologies.

Information Services to Provide Single System Image in a Grid (Abstract)
Lior Amar, Amnon Barak, and Ilan Peer, Hebrew Univresity

Grid systems depend on information services for discovering, monitoring and managing resources. Managing information about the grid resources is a complex task due to the variety of possible resources, e.g., host speci˝c resources, cluster or virtual organization resources etc., most having dynamic char-acteristics (which may change rapidly). Other sources of complexity result from the requirements to support scalability, decentralized control, secure access control and at the same time provide reasonable performance to applications that rely on such information services.

Existing grid management systems provide bulletin board information gathering services e.g., the in-dex service of the Globus Toolkit Monitoring and Discovery Service (MDS) or the centralized approach of the Relational Grid monitoring Architecture (R-GMA). Some drawback ofsuch systems include static structures, which are inadequate for scalable, highly dynamic environments or handling of rapidly ˛uc-tuating resources (such as CPU load).

In this project we are developing an intelligent, hierarchical information management system that provides each node (or a cluster) su°cient, up to date information about the cluster (or the grid wide) resources. The main aspects of our scheme include:
  1. Collection of relevant information by each node (or cluster);
  2. Transparent, continuous dissemination of the latest available information;
  3. An aging method that guarantees that the average age of the information available is not greater than a given parameter
  4. Adaptability to dynamic con˝gurations
  5. Usefulness of old gathered information to future allocation of resources e.g., for load balancing or job assignment decisions
  6. Low network tra°c

The core of our scheme is an information vector which is updated using a randomized, distributed information dissemination algorithm. Each node holds such a vector and it executes the algorithm independently, thus allowing information queries to be serviced locally. We developed an analytic model for our scheme and veri˝ed its performance by simulations and by actual execution on a cluster with over 260 nodes. The results indicate a good matching between the analytic model and the experiments. It was also found to be useful for resource management and monitoring.

The scheme was already used to improve MPICH by assigning the next job to the best available node. Currently we are developing an hierarchical information management system (over the Globus Toolkit infrastructure) that is intended to provide a grid wide single system image. In the next phase we intend to use the scheme for e°cient resource allocation, e.g., for load balancing. For the long range, the model that we envision is a grid of MOSIX clusters, in which each cluster operates as a single system (like an SMP). The main advantages of this approach are reduced complexity and ease of use.

Keynote: Scale and Performance in Networked Storage (Abstract)
Brian Pawlowski, Vice President/Chief Architect, Network Appliance Inc.

Networked storage is increasingly finding a place in application deployment for reasons of disaster recovery, performance, and data management. Homologous storage architectures arise in response to scalable application platform clusters. This talk will survey some trends and explore issues in scalable storage approaches designed for application clustering.

Introduction to Disk Scheduling (Abstract)
Eitan Bachmat, Ben Gurion University

Modern disk drives are allowed in many circumstances to rearrange the I/O requests that they have recieved in order to minimize the time required to service all requests. This capability can have a great impact on the throughput of a disk drive and under favorable circumstances may even double the it. There are two natural questions assocuated with disk scheduling:
  1. Given a set of requests what is the optimal order in which they should be served so as to minimize total service time
  2. Given some statistical assumptions on the the nature of the I/O requests, what is the expected number of rotations which are needed to service all the requests.
These problems have not been solved in general, however given some simplifying assumptions on the nature of disk arm movement some interesting progress has been made, making the problem of analyzing disk scheduling by far the most exciting theoretical challange related storage systems.

In the talk we will briefly state what is currently known about disk scheduling.

We will then focus on a recent discovery (a reinterpretation of previous results) which shows that disk scheduling is intimately related to mathematical models of general relativity theory. In particular we will explain (as time permits) how to associate with an I/O access pattern a curved space-time geometry in such a way that the number of rotations needed to service the requests is proportional to the (relativistic) length of the longest geodesic in the model.

We will not assume any knowledge of relativity theory (the speaker is no expert himself).

Toward Self-Stabilizing Operating Systems (Abstract)
Shlomi Dolev and Reuven Yagel, Ben Gurion University

The robustness of an operating system is, in some cases, more important than its performance [4, 3]. The experience with existing operating systems, and in fact with every large on-going software package, is that it almost has its own in-dependent behavior. The behavior is tuned up and modified by system administrators that constantly and continuously monitor it. The system is usually complicated to monitor. The system administrators use human behavior and charac-ter terms, as if the system is an entity with its own will, to refer to its input output scenarios. The importance of a de-sign that is based on well understood theoretical paradigms, and give us control over the resulting system cannot be ex-aggerated. In particular in the case of the operating system, robustness is a must, as the operating system forms a ba-sic infrastructure in almost every computing system.

Designing robust operating system is a complicated and challenging task. The system designer makes several proba-bilistic assumptions that may not hold in a long enough ex-ecution. For example, soft errors [2] may cause an arbitrary change in memory bits that the error correcting schemes used will not identify. Another example, is that the commu-nication between the system components can be made reli-able, say by use of error correcting codes this assumption is also based on probability (where the life length of the sys-tem is a parameter). Once the probabilistic assumptions do not hold the designer can no longer guarantee much. In this work we propose several approaches for designing auto-matic recovering operating system that is based on the well defined and well understood self-stabilization paradigm [1]. Roughly speaking, a system is self-stabilizing if it can be started in any possible state and converge to a desired be-havior. A state of a system is an assignment of arbitrary values to the systems variables.

A self-stabilizing algorithm/system makes the obvious assumption that it is executed. This assumption is not simple to achieve since both the microprocessor [2] and the oper-ating system should be self-stabilizing, ensuring that even-tually the (self-stabilizing) applications programs are exe-cuted. An elegant composition technique of self-stabilizing algorithms [1] is used here to show that once the underling microprocessor stabilizes the self-stabilizing operating sys-tem (which can be started in arbitrary state) stabilizes, and then the self-stabilizing algorithms that implement the ap-plications stabilize. In this work we consider the important layer of the operating system. Operating systems are essen-tial parts of most computer systems. The operating system manages the hardware resources, and form an abstract (vir-tual) machine that is convenient to program by higher level applications developers.

One approach in designing self-stabilizing operating sys-tem is to consider an existing operating system (e.g., Mi-crosoft Windows, Linux) as a black-box and add compo-nents to monitor its activity and take actions accordingly, such that automatic recovery is achieved. We call this ap-proach the black-box based approach. The other extreme approach is to write a self-stabilizing operating system from scratch. We call this approach the tailored solution ap-proach. We present three design solutions in the scale of the black-box to the tailored solutions. The first simplest technique we propose for automatically recovery of an op-erating system is based on repeatedly reinstalling the oper-ating system and then re-execution. The second technique is to repeatedly reinstall only the executable portion, mon-itoring the state (variables content) of the operating system and assigning a legitimate state whenever required. Then we present a tailored very tiny self-stabilizing design for com-ponents of an operating system.

OGSA-based Problem Determination: a Use Case (Abstract)
Yariv Aridor, Yoav Gal, Zvi Har'El, Benny Rochwerger, and Mark Silberstein, IBM Haifa Labs

Application of computer-based automatic tools for the analysis of the logging data is imperative to keep up with the increasing complexity of the IT systems. The diversity of proprietary interfaces for management in different IT systems dictates the creation of new level of abstraction, allowing for the design of generic management infrastructure. The concept of virtualization as proposed in the Open Grid Services Architecture (OGSA) enables the required integration of disparate resources. Building on concepts and technologies from the Grid and Web services communities, this architecture defines uniform semantics for the essential services needed for building heterogeneous and dynamic distributed systems. While the OGSA vision is broad, the core subset of semantic element, as currently identified in the Open Grid Servicec Infrastructure (OGSI) spec., includes the standard mechanisms for creating, naming and discovering transient Grid service instances, provides location transparency and multiple protocol bindings for service instances, and supports integration with underlying native platform facilities.

The hands-on experience with OGSI would provide an insight to the new technology and would validate the proposed concepts. IBM Autonomic Computing (AC) Architecture maps directly onto the proposed framework and takes advantage of the virtualization approach. The AC Architecture attempts to ease the problem of managing IT systems in general and of problem determination in particular by developing self managing systems. In such systems each component is controlled by an intelligent feedback mechanism which monitors the component activity, analyses the information, plans corrective actions and then carry out these actions to adjust the component's behavior.

In the end-to-end problem determination toolkit (ePD Toolkit) the different components of the Autonomic Computing Architecture are exposed as OGSA services. In this paper we introduce the ePD Toolkit and use it to evaluate OGSA. We expand on the programming model and design patterns derived from using OGSA, and analyze its pros and cons. In addition the paper presents some results on the performance evaluation of the current available first open OGSA implementation.

Transparent Fault-Tolerant Java Virtual Machine (Abstract)
Roy Friedman and Alon Kama, Technion

Replication is one of the prominent approaches for obtaining fault tolerance. In a distributed environment, where computers are connected by a network, replication can be implemented by having multiple copies of a program run concurrently. In cases where a copy on one of the computers crashes, the others may proceed normally and mask that failure.

The holy grail of replication-based fault-tolerance is finding a good balance between the transparency of repli-cation to the application, the overhead that replication imposes, and the general tendency to use commodity hardware and operating systems. Transparency is vital in order to render legacy applications fault-tolerant, and in order to reduce the costs of developing and maintaining fault-tolerant systems. Also, commodity hardware and operating systems are becoming a common requirement in order to reduce total costs (economy of scale) and to avoid falling behind the technology curve.

Implementing replication as described above on commodity hardware and in a transparent fashion, i.e., with-out changing the programming model, has many challenges. One of the main complication lies in the fact that programs are usually not deterministic. In particular, a program’s behavior is often influenced by external events, such as I/O, local clocks, and possibly other local environment aspects like process ID, scheduling, memory management, etc. Clearly, any attempt to provide a fault-tolerant infrastructure must handle this inherent non-determinism.

We report on an implementation of transparent fault tolerance at the virtual machine level of Java. Our system design eliminates certain aspects of non-determinism and efficiently communicates the rest to all the replicas. We describe the design of the system and present performance results that in certain cases are equivalent to those of non-replicated executions. We also discuss design decisions stemming from implementing replication at the virtual machine level, and the special considerations necessary in order to support Symmetric Multi-Processors (SMP).

  About IBM  |  Privacy  |  Terms of use  |  Contact