IBM Haifa Labs
Haifa Research Lab
Haifa Development Lab
Haifa Software Lab
Leadership Seminars
Weekly Seminars
Visitor Information
Careers at Haifa Labs

Feedback

Worldwide Research
	IBM Israel
	Careers at Research
	Other Labs
	On Demand Innovation Services
	Research Projects
	alphaWorks
	developerWorks

Systems and Storage Seminar 2005

IBM Haifa Labs

Invitation Program Registration Abstracts

December 11, 2005
Organized by IBM Research Lab in Haifa, Israel

http://www.haifa.il.ibm.com/Workshops/systems-and-storage2005/

Provisioning Space in Single Track Granularity: Implementation and Usage

Rivka Matosevich, IBM Research Lab in Haifa

Regular volumes are fully provisioned when created. For some usages, particularly point-in-time copy, this space is in a sense wasted. When a point-in-time copy is executed without a background copy, e.g., when writing a backup to tape after which the relationship is withdrawn, some of the target volume space is never accessed. In general, the amount of space needed should be a function of the write activity on the source and target volumes during the lifetime of the point-in-time copy relationship. In other words, the target should only require sufficient space to hold those tracks for which we must perform a copy-on-write, or which are written directly to the target.

In this talk, we will discuss the features and directions of Space Efficient FlashCopy from the developer's point of view.

Split Path Storage Virtualization with Intelligent Switches

Yuval Zohar, Director, Product Management, StoreAge Networking Technologies Ltd.

Split path storage virtualization is a fabric-based virtualization that splits the data path and the control path.

As opposed to in-the-data-path (aka symmetric) solutions where all the data has to pass through the virtualization component, thus creating a potential bottleneck, split path is out-of-the-data-path architecture in which the virtualization component controls the IO routing without having IO passing through it.

Until recently, the most common way to implement split path was to install an agent on each host that changed the IO routing based on information received from the virtualization component.

Intelligent switches not only route IO based on physical addresses (layer 2 switching), but can also look deeper into fabric addresses and route according to routing logic as well (layer 3). The routing is done in "wire speed" using dedicated hardware, while the routing logic can be set by software.

Since virtualization is mostly about switching IO of virtual devices to the appropriate physical devices, intelligent switches, which have recently appeared in the market, offer an excellent architecture for split path storage virtualization.

Multi-Box RAID with 3rd Party Transfer and ECC Calculation in Targets using RDMA

Yitzhak Birk and Erez Zilber, Department of Electrical Engineering, Technion

Storage devices are becoming cheaper, but reliable and highly available storage systems are still expensive. Also, as long as any given ECC group resides in a single box, they are susceptible to failures that affect the entire box. This problem can be overcome by a multi-box RAID comprising a controller that is connected to multiple target "boxes", with each ECC group comprising at most one block from any given box. However, retaining performance despite the use of an external controller remains a challenge. Also, retaining the same size of storage box (for cost-effectiveness) requires the controller to manage more storage and activity, resulting in a scalability challenge.

iSCSI over iSER is an extension of iSCSI that splits the control and data paths. It also takes advantage of an RDMA mechanism (provided, for example, by InfiniBand) for data transfers, while sending of control messages is left unchanged. This inter-box communication solution, which we use as a baseline for comparison, is a candidate substitute for the intra-box DMA, but leaves two problems unsolved:

All data to/from hosts passes through the controller, rendering the controller a communication bottleneck.
ECC calculations are carried out in the controller, requiring additional data transfers between the controller and the disks, further aggravating the controller bottleneck problem.

Our TPT-MB RAID jointly addresses the aforementioned challenges. The main ideas are:

A multi-box RAID that uses iSCSI over iSER as described above. Specifically, RDMA over InfiniBand is used for data transfers.
Separation of the data path from the control path, permitting data to be transferred directly between hosts and targets as well as among targets. To this end, we have extended iSCSI over iSER by introducing a 3rd-party transfer mechanism. With this, one iSCSI (the controller) instructs a 2nd iSCSI entity (target or host) to read or write data to a 3rd iSCSI entity.
ECC calculations (parity update etc.) are carried out by the target boxes under management of the controller.

Unlike the aforementioned baseline approach, commands and data thus follow different physical paths rather than merely using different communication semantics over the same paths. This leaves the controller out of the main data path, thereby sharply mitigating the bottleneck and enhancing scalability while retaining the simplicity of centralized control. In summary, we have successfully extended the idea of out-of-band controllers that manage multiple boxes to the intra-RAID level, as demonstrated by our proof-of-concept prototype.

Medical Diagnostic Imaging - Data Storage and Networking Requirements

Moshe Avnet, IT Professional Services Manager PACS Sales technical support, GE Healthcare

A list of the main medical imaging modalities, the technology being used to generate the images, and their use will be presented. What is the distribution of the imaging modalities within a medical organization, and why is it important? How are examinations ordered and where are they performed in the different scenarios? What are the requirements for viewing the examination results, images, and reports?

Diagnostic reading is performed by radiologists and expert physicians. The requirements for the diagnostic reading depend on the physical location of the diagnostic reading performed and the modality on which the last examination was done. Each reading of an examination has a different need for historical studies, depending on the modality and its physical location.

General review requirements: General review of diagnostic imaging is performed by primary care physicians in hospital wards, by specialty physicians, and by family primary care physicians. All reviewing personnel need to review the diagnostic report, done by the radiologist, alongside the images.

Storage requirements: What are the storage requirements per modality per year? What are the storage requirements per medical organization per year?

Storage strategy: What is the storage strategy being used to provide adequate response to all the above mentioned requirements? Data security and system uptime considerations will also be discussed.

Network strategy: What are the network speed requirements for the different parts of the system within the organization? Design considerations for the online storage, near-line storage, data center storage, and full back-up storage will also be presented.

Self Tuning Memory Management in a Relational Database System

Sam Lightstone, IBM Canada

"Civilization advances by extending the number of important operations which we can perform without thinking about them."
Alfred North Whitehead

The problem of self-tuning memory within database systems has been an open research problem for over 30 years. The complexity of this problem is due in part to the heterogeneous way in which memory is used by the database for locking, caching, sorting, hashing, and numerous other purposes. DB2 UDB has announced the revolutionary Self Tuning Memory Manager (STMM), to be introduced in the upcoming Viper release, which automates the task of memory tuning. The tuning technology is completely hands-off, and scales from small single-user systems to large enterprise warehouses. STMM has shown excellent results for many varied workloads from OLTP through decision support systems.

STMM continuously monitors the system in an effort to tune the buffer pools and several of the most critical memory related configuration parameters. Through the use of an iterative tuning approach, STMM is able to tune systems to near optimal memory distribution in less than an hour. More impressive, however, is STMM's ability to handle sudden, drastic, and sustained shifts in memory requirements.

This talk first outlines the new STMM feature and describes some of the research strategies that were used to effect tuning and control. We then briefly describe the internals of the STMM, showing how the architecture creates a robust and powerful feature. Finally, we illustrate the feature's strengths by highlighting some of the impressive results that have been achieved when running STMM in the lab.

"Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction."
Albert Einstein

The Xen Hypervisor and its IO Subsystem

Muli Ben-Yehuda, IBM Research Lab in Haifa

The Xen hypervisor provides fast, secure, open-source virtualization that allows multiple operating system instances to run on a single physical machine. Xen supports modified guest operating systems using a technique known as para-virtualization, and unmodified guests using hardware virtualization support on the latest Intel VT enabled processors. Xen offers near native performance for guest operating systems, and supports live migration of guests between servers with typical application impact of less than 100ms.

This talk will have three parts: in the first part, I will discuss the origins, architecture, and current status of the Xen hypervisor. In the second part, I will focus on Xen's IO subsystem: the split driver architecture, driver domains, and hardware access from multiple virtual machines. The third part will be dedicated to IOMMUs: both software mechanisms (Xen's grant tables and Linux's swiotlb) and hardware IOMMUs.

Keynote: Using Object Storage Devices to Provide Highly Reliable, Highly Scalable Storage

Ethan Miller, Associate Professor, Department of Computer Science, Baskin School of Engineering, University of California Santa Cruz

The data storage needs of large high-performance and general-purpose computing environments are generally best served by distributed storage systems because of their ability to scale and withstand individual component failures.

Object-based storage promises to address these needs through a simple networked data storage unit, the Object Storage Device (OSD) that manages all local storage issues and exports a simple read/write data interface. Despite this simple concept, many challenges remain, including efficient object storage, centralized metadata management, data and metadata replication, data and metadata reliability, and security.

This talk will open by detailing the object-based storage paradigm and showing how it improves on existing storage paradigms. I will then describe current object-based storage efforts, focusing on those that can scale to provide petabytes of storage and provide high reliability. I will conclude with a discussion of open challenges in object-based storage and use a (cloudy, perhaps) crystal ball to predict the future of object-based storage systems.

Analysis of PQRS Models

Eitan Bachmat, Department of Computer Science, Ben-Gurion University

In 2002, M. Wang, A. Ailamaki, and C. Faloutsos introduced a very interesting class of models, PQRS models, for I/O traces. Unlike classical models that use classical distributions for interarrival times, the PQRS models use a fractal measure that leads to strong I/O bursts, strong locality of reference for addresses, and correlations between these two phenomena. Consequently, these models can be used to represent regularly occurring properties of real I/O traces that were previously difficult to model synthetically.

The work of Wang, Ailamaki, and Faloutsos has been mostly experimental in nature. In our presentation, we will show how to compute analytically without the need for simulations several properties of interest of PQRS models. In particular, we provide a formula for the average seek distance of a PQRS generated trace. We provide optimal static and dynamic caching algorithms for PQRS traces and compute their hit ratios. We also examine the stream of read misses emerging from a cache whose input is a PQRS trace. While the read miss stream is not a PQRS trace, it is asymptotically so for seek distance calculations, but with different parameters than the cache input stream.

Finally, we prove a couple of empirical observations of Wang et al. regarding the extremal nature of stationary PQRS within the whole family of PQRS models.

Application Mobility and Replication: Meiosys Approach to Virtualization

Marc Vertes, Philippe Bergheaud, IBM STG, France

Since its inception and until its recent acquisition by IBM, Meiosys has specialized in the development of virtualization middleware for software applications.

In this talk, we present the key concepts of the virtualization layer architecture, which is now being submitted to the Linux kernel development community, and its applications.

A state capture and restore facility added on top of the virtualization layer enable the implementation of application mobility and address several useful scenarios in the datacenter, like consolidation/deconsolidation of workloads among servers, and protection of long running jobs by periodic checkpointing or by predictive failover. A standard protocol to support a coordinated checkpoint for distributed MPI-based applications has also been published by Meiosys.

We finally address current research topics, which aim to add an event record and replay facility to the existing virtualization layer, in order to achieve complete fault tolerance and an extraordinary debugging platform.

Scalable and Robust Content Distribution with Dynamic Users

Idit Keidar, Department of Electrical Engineering, Technion, Roie Melamed, Department of Computer Science, Technion

In recent years, there is a growing demand for scalable and robust content distribution systems. Such systems are mainly used for distributing non�real�time data such as a new software release or a software update. A content distribution system deployed over the Internet must be able to withstand frequent node failures as well as non�negligible message loss rates. Moreover, studies have shown that users typically join and leave multicast sessions; such behavior is called churn. Hence, a content distribution system also needs to cope with high churn rates.

Most of the current content distribution systems are centralized consisting of a farm of servers. The big disadvantage of this centralized approach is its cost - a farm of servers is expensive to construct and maintain, and it consumes high bandwidth. In the peer�to�peer approach, there are no centralized servers, and each client acts as a server for some other clients. The advantage of this approach is its low cost, as it requires no centralized servers. Typically, in the peer�to�peer approach, clients self-organize into an overlay network, and each client communicates only with its overlay neighbors. In order to achieve high reliability despite high failure and churn rates, the overlay needs to be highly connected, including several disjoint paths between each pair of clients in the overlay. In addition, all the clients should have roughly the same number of neighbors, in order to achieve load balance and fairness. Additionally, this number should be small, so that the overhead incurred on each client is low. Finally, the overlay's diameter should be (at most) logarithmic in the number of clients, in order to achieve low data latency.

Naturally, k�regular random graphs hold these important properties. However, such graph structures do not support dynamic user behavior. Therefore, we have designed the Araneola overlay, which is an approximation of a k�regular random graph. In Araneola, for a tunable parameter k # 3, each node's degree is either k or k + 1. Empirical evaluation shows that Araneola's overlay structure achieves the important properties of k�regular random graphs described above. In addition, as opposed to a k�regular random graph, Araneola supports dynamic user behavior, as each join, leave, or failure is handled locally, and entails the sending of only about 3k messages in total, independent of the total number of clients. Given Araneola's overlay, data is disseminated to the clients through periodic gossip over the overlay's links: periodically, a single client injects new data packets to a small constant number of random clients, which further propagate the data to their neighbors and so on. Note that this approach incurs a constant and small load on each client, and hence it can support a virtually unlimited number of dynamic clients.

Transactional-Execution Data-Replilcation Service TreadCAT

Eliezer Dekel, Gera Goft, Dean Lorenz, Roman Vitenberg, Alan Wecker, IBM Research Lab in Haifa

In a shared nothing distributed environment, state caching and replication is done by exchanging messages that carry update notification, modified state, or update method blocks. This involves introducing code into the application's state classes to send and receive these messages. Furthermore, many applications demand preserving strong guarantees with respect to execution semantics (such as isolation, atomicity, and agreement) and data availability. Devising replication protocols becomes even more complex in settings where membership of object replica groups may change dynamically, and where messages may get lost. Group communication services (GCS) strongly facilitate the design of such applications as they provide well-defined message delivery semantics in the presence of dynamic changes in the group membership. However, leveraging these capabilities in order to provide strong semantics of data availability and updates is still challenging.

TreadCAT is a component that is designed to work above a GCS and provide these traits. In a sense, it is a transactional repository for distributed shared objects. TreadCAT provides an interface for replicating objects on different nodes, while ensuring a range of consistency and currency guarantees. TreadCAT synchronizes, caches, and replicates object state changes among the participating nodes (members). It dynamically reconfigures to ensure the required consistency in cases when nodes are joining and/or leaving and when the network is partitioned. An application can store multiple (state) objects within an associated TreadCAT. It can choose to be notified when objects are modified/replaced and to handle any references between these objects, or to have TreadCAT transparently handle all object references. TreadCAT supports the concept of computation blocks. Computation blocks allow aggregating state changes to a set of objects before replication, providing a rich variety of code execution semantics, such as distributed locking, atomicity, agreement, and/or isolation conditions for the changes.

In addition to providing the required functionality, the system design pursues the following goals:

Variety of semantics through modularity
Ability to initiate updates on every replica
Ease of use and portability

The development cycle of a TreadCAT application is as follows: the programmer develops code that implements the business logic of the application, e.g., JavaBeans. In order to make the application deployable with TreadCAT, the programmer has to annotate the code and specify consistency requirements programmatically and through object descriptors. Then, the application administrator has to specify the policies tailored to the specific deployment environment (e.g., derived from SLA information that is provided at deployment time).

TreadCAT is built on top of the DCS group communication system. The architecture distinguishes between the proper DCS system and generic services provided as an application-level toolkit on top of DCS (such as Real Synchrony).

Related Seminar Links

Visitors information