IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Author's Guide  
Journal of Research
and Development
  Staff  
  Contact Us  
Systems Journal  
Volume 36, Number 3, 1997
Nontopical Issue
 Table of contents: arrowHTML arrowASCII   This article: HTML arrowASCII   DOI: 10.1147/sj.363.0393 arrowCopyright info
   

Multimedia resource management in OS/390 LAN Server

by A. Dan
In a multimedia server, resource reservation is critical for guaranteeing jitter-free delivery of video. In this paper, we describe the resource manager component of the OS/390* LAN Server (video server) that implements resource reservation. The resource manager functions can be divided into three categories: (1) the system management functions that allow arbitrary multimedia resources to be dynamically defined, undefined, and calibrated, and their capacities monitored remotely via the Simple Network Management Protocol, (2) the operational functions that allow video streams to reserve and release resources for supporting playback without explicit specification of the needed resources, and (3) the real-time data import function that places videos on the storage devices so as to balance the load among the devices after making the necessary space and bandwidth reservations. We finally discuss research issues to exploit economies of scale

Rapid growth in computer, communication, and display technologies has led to the development of newer applications that embed continuous media objects such as video and audio, as well as traditional objects (e.g., image data). Examples of such emerging commercial applications are video-on-demand, distance learning, digital library containing various media types, and multimedia database. [1] These applications require development of multimedia servers capable of storing media data and delivering these data to remote clients.

The design of such servers poses new challenges in addition to traditional requirements such as reliability, high availability, resource optimization, and cost-effectiveness. Foremost among these new challenges not found in the design of traditional servers are the requirement for smooth continuous delivery of video and audio data to clients and efficient storage and retrieval of large media files. [2-4] Large variations in response times, such as those encountered in traditional server systems (e.g., in transaction and query processing, and in file servers) will lead to "hiccups" in video and audio delivery. To avoid variations in response time and guarantee continuous delivery, the server needs to set aside sufficient resources for each stream before starting its playback. It is the requirement for resource reservation that distinguishes multimedia servers from traditional servers such as file servers.

In this paper, we describe the design and implementation of the resource manager (RM) for OS/390* LAN Server, the component that implements this resource reservation. The OS/390 LAN Server evolved from an earlier version that was a traditional high-performance file server for workstation clients. [5] The addition of the resource manager component enabled this file server to support delivery of continuous media and to become a full-fledged multimedia server. The other enhancements that further improved the performance for such I/O-intensive applications are detailed in Reference 5.

In the next section of this paper, we detail our design objectives and provide an overview of the resource manager. In the section on implementation, the details of implementations and various performance issues (e.g., bandwidth and space utilizations, and concurrent data structures for high performance) are discussed. New algorithms (e.g., caching, batching, load balancing) that can exploit economies-of-scale for long video applications are discussed in the section, "Research issues." We then summarize and conclude this paper in the last section.

Design objectives

The primary objective of the resource manager is bandwidth (BW) management of various (physical and logical) system resources as well as management of space for storage devices. The BW (the data transfer rate of a device) and space management objectives include efficient use of the system resources, support of continuous delivery guarantee, and hiding of the complexities of the system from a higher-level application. Upon a request for playback of continuous media objects, the resource manager reserves appropriate resources necessary for playback without explicit enumeration from the application. The resource reservation is implicit, and the end user may not even be aware of this.

Figure 1 illustrates a typical OS/390 LAN Server (multimedia server) environment consisting of storage subsystems, server, and a set of front-end processors. The applications may range from playback of a long video to retrieving many small video clips interactively as in a shopping application. As mentioned earlier, prior to playback of any video material, a logical channel (i.e., a BW reserved data delivery path) needs to be established either statically for the entire application session or dynamically as needed. The client requests for setting up an application session arrive at the resource manager over a two-way channel established for the purpose of exchanging control messages. The resource manager is responsible for admission control, and it keeps track of the available resources (e.g., disk BW, CPU [6,7]). In Figure 1, each data path emanates from a striping group and passes through a mainframe node and a front-end processor (FEP) toward a client node. (Striping is the use of multiple DASDs [direct access storage devices] as a single logical volume. The set of volumes is referred to as a striped set. An FEP is a LAN [local area network] server workstation running OS/390 LAN Server, through which LAN workstation users can access files stored on an OS/390 system. There may be other FEPs accessing other workloads or even accessing nonvideo files managed by the OS/390 LAN Server.) The resources that need to be managed by the resource manager are defined explicitly. (The resources on the LAN that connects clients to the FEP are managed by the Operating System/2* [OS/2*] LAN Server Ultimedia*.) The resource manager also needs to be aware of the capacity of each resource and, hence, can guarantee quality of service by controlling admission. The resources required to play back a video or audio file are enumerated by the path-based reservation algorithm that reserves necessary BW on all resources in the chosen path. Each path will capture all the resources that could potentially be a bottleneck, depending on the combination of workloads. There could be multiple paths for playing back a video. For example, if a video file has multiple replicas residing on different storage devices, each path may start from a different storage device. The resources that need to be kept track of (e.g., buffer space, adapter capacity, etc., in addition to disk BW and CPU MIPS) depend on the possible system configurations and on how well balanced the capacities of various resources are. In a more general environment with a heterogeneous set of components and access methods (e.g., different types of adapters, different transport protocols, multiple system nodes accessing the same set of disks, etc.), it is difficult to predict the bottleneck resources. [8,9] An important design objective is the ability to handle diversity of configurations.

Figure 1

The defined resource could either be logical or physical. For example, the resource manager needs to be aware of a striping group made of multiple devices, so that the striping group can be treated as a single logical device. Otherwise, appropriate resources need to be reserved on all devices over which a media object is striped. Resource management can also be hierarchical, where a single server can be partitioned into multiple virtual machines, and the resource manager for each component manages its share of resources. [10] For example, some capacities (of CPU and disk) can be set aside (by setting the capacities of the resources to a lower value) from the resource manager to make sure that some resources are always available for system management tasks. We will use a similar method to set aside some disk BW for staging of assets. Additionally, by providing a higher priority to multimedia and other real-time tasks, the available background BW can be used for other services such as traditional file services (also see earlier discussion in Note 6). [11,12]

Functional objectives. The resource manager needs to support three classes of functions: system management, operational functions, and real-time data import.

System management. As described above, the bottleneck resources that need to be managed also need to be defined explicitly. The capacities of these resources (also referred to as devices) need to be either measured or set explicitly so that admission can be controlled. By setting capacities explicitly, the administrator can sometimes adapt to an unplanned situation or unforeseen condition. For example, if the measured capacities are incorrect, or the current system state results in a high rate of jitter, the administrator can explicitly set the capacities of the resources at a lower level.

Upon failure of a device, the resources need to be undefined or marked inactive, so that no newly admitted stream would use this device. Similarly, the resource manager needs to be notified after bringing a device on line. The resource manager should also support querying the state of these devices (e.g., active or inactive, allocated, and maximum capacity). To integrate monitoring of the multimedia server with those of other system components in an end-to-end multimedia environment, various exceptional states (e.g., low disk space) should be sent to a remote agent (e.g., Simple Network Management Protocol, or SNMP).

Operational functions. For isolating an application from the complexities of the environment, it is desirable for the application not to be aware of the various resources that need to be reserved by the resource manager for playback of a stream. These reservations should be automatic when the playback of a media object is requested. The various resource capacities (e.g., device BW, CPU MIPS) required to play back a media object can be maintained in an associated file to make this reservation process transparent. The associated file is referred to as a metafile.

A metafile contains the BW information for a particular asset, the number of replicas of the asset, and their locations. The client OPENs the metafile for an asset that results in selection of a replica for playback and implicit reservation of required resources by the resource manager. An asset is a data set that normally contains multimedia data to be delivered at a guaranteed rate to the client. There may be multiple copies of the asset that are referred to as replicas. In order to guarantee that sufficient BW is available for a specified number of concurrent users, multiple copies of the asset may be required on different volumes. When the client OPENs a metafile entry for an asset and resource reservation has been successful, this instance of the delivery of the asset to the client is referred to as a stream.

For efficient use of system resources, the reservation procedure should also determine all available paths from the storage devices containing replicas of the requested multimedia object to the network interface. The network interfaces may include FEPs connected by ESCON* (Enterprise Systems Connection) channels as well as various types of adapters (e.g., ATM, or asynchronous transfer mode). It should then select the path that is least utilized to balance the load across all devices. The reservation of all required resources should be atomic; that is, all resources will be acquired at once, or no resources will be acquired at all. Similarly, the release of resources is to be done atomically, and implicitly when a stream is closed. The atomic acquiring and releasing of resources will also be helpful in supporting recovery of a stream upon failure of a device on the delivery path (e.g., controller, disk, network interface).

Real-time data import. Intelligent decisions are required on the part of the resource manager to place the media objects so that resources can be utilized efficiently. For example, hot (high-demand) videos should be placed on devices that have a higher BW. Also, hot and cold (low-demand) videos need to be mixed (these issues are detailed later in the section on implementation) so that both space and BW utilizations are matched. Before a media object (also referred to as an asset) is imported, appropriate space needs to be allocated, and appropriate BW needs to be reserved on other resources on the import path. The operations should be coordinated by the resource manager so that import applications do not have to keep track of various steps.

The placement decisions for new media objects would depend on the current placement of other existing media objects and their expected access patterns. Therefore, the resource manager should also keep track of the placed assets and their expected total access rate.

Other design objectives. In addition to the above functional design objectives, the following set of objectives were also taken into account:

  • Continuous media and traditional file services: The ability to serve traditional services such as file services, transaction processing, etc., without affecting services for continuous media makes the OS/390 LAN Server environment a general-purpose platform. This can be achieved (1) by providing a higher priority to continuous media tasks so that their services are not affected [13] and (2) by controlling admission of continuous media tasks or by setting aside a fraction of the resource capacity through the resource manager.
  • Scalability: To support a large number of clients and high rate of access requests, multiple resource manager threads need to access the same set of data structures (e.g., resource and stream tables). Therefore, the data structures need to be thread-safe, and locks and latches need to be used. Also, to avoid contention for access to data structures, the locks should be of fine granularity, and lock mode should be upgraded to exclusive only when it is needed. The locks need to be acquired in specific resource order to avoid deadlock. The details of these issues are discussed in the next section.
  • Transactional support for algorithms: The various operational functions and algorithms access multiple data structures. To maintain consistency, the operations performed on behalf of a single client request need to be atomic. For example, multiple resources are acquired upon a single playback request. If some required resources are not available, or some other software failure occurs during this process, a cleanup action is started to release the resources acquired so far. This feature is very important for making the system robust and reliable. Certain algorithms (e.g., data placement) require a broad view of the available resources to determine what resources are to be acquired. A protocol similar to two-phase commit is used to follow resource acquisition order for deadlock avoidance, where in the first phase shared locks are used to determine the resources to be acquired, and then in the second phase exclusive locks are acquired to modify the resource entries.
  • Portability: Significant attention was paid to this important software design principle by keeping the interfaces clean and using macros for system-dependent services (e.g., storage allocation, lock services, thread support, etc.). During the development process the code was mostly debugged on an AIX* (Advanced Interactive Executive*) platform and seamlessly integrated with the OS/390 LAN Server environment. After the code was "ported" to an RS/6000* SP* environment, it became the base code for the resource manager in the AIX Video Server. [14]
  • Platform for newer performance enhancement algorithms: In the section on suggested future enhancements, we survey various performance enhancement algorithms (e.g., batching, caching) for a large-scale server that benefits from economies of scale. The requirements for implementing such algorithms in the resource manager were taken into account.

Implementation

The resource manager functions can be grouped into three categories (as described earlier): system management functions, real-time data import functions, and operational functions. These functions are invoked by various other components of the OS/390 LAN Server as described in more detail below (see Figure 2). The communication layer (CLAW) receives various commands (e.g., OPEN, READ, and WRITE file requests, administration commands) from clients and sends either appropriate responses or data, or both, to the client. The administration protocol conversion layer (Admin PCL) in turn sends the administration commands to the resource manager. The resource manager may call the file system (the Common File Repository, or CFR), for example, to calibrate the throughput of a striping group. Some operational commands (e.g., opening a media file) require reservation of resources, and hence, the appropriate file system protocol conversion layer (the server message block, or SMB PCL in this example) sends appropriate operational commands to the resource manager.

Figure 2

All the resource manager functions are nonblocking and are executed in the thread of the caller. Fine-grained locking allows resource manager functions to have concurrent and atomic access to the resource manager data structures. If execution of any of the functions is unsuccessful, cleanup is performed automatically. Figure 3) shows the components of the resource manager. The resource manager maintains three types of tables:

Figure 3

  1. Planned load tables that are used for placement of media files on the storage devices. The placement policy attempts to match the BW and space of a device to the expected load and allocated space of the files placed on this device.
  2. Current load tables that maintain the available capacities of various resources and the streams accessing these resources
  3. Stream table that maintains the list of current streams and the set of resources reserved for these streams

In the following subsections, the resource manager data structures are described first, followed by a description of the resource manager functions.

Resource manager data structures. There are two main resource manager data structures--the resource tables and the stream table (see Figure 4). The resource tables contain resource entries that keep track of the state of the resources in the system, whereas the stream table entries store information about the various streams. Both locks and latches are used for ensuring concurrent access. Locks are used to obtain access to logical entities (such as tables or individual resource and stream entries). Latches are temporary locks that are held while traversing implementation-dependent data structures (e.g., hash row chains) and released once the desired entry is found. Deadlock is avoided since locks on various resources are obtained in a prespecified order and latches are released before obtaining any further locks. For the purpose of keeping track of the streams accessing a device, and the devices accessed by a stream, the various entries in the tables are cross-linked. These links have to be maintained in a consistent state while atomically updating the tables. In the following subsections, we first describe the resource tables, then the stream table and the cross-linking between the various tables.

Figure 4

Resource table. The resource manager maintains one resource table for each type of resource (e.g., striping group, [15] communication adapter). All the resource tables have the same structure, making it easy to extend the resource manager for handling new types of resources. Each resource table is a hash table of resource entries, indexed on the identifier of the resource. The resource ID is an arbitrary character string. Figure 4 also shows the main components of a resource entry:

  • devId--the identifier of the resource. This component is used to index the resource table.
  • devType--the hardware device type. As described below, this component can be used during initialization for determining the maximum BW of the device.
  • maxBW--the maximum BW that can be allocated on this device. As described below, this component may be less than the actual maximum BW of the device.
  • freeBW--the unallocated BW on the device
  • next--the next resource entry on the hash chain
  • strList--pointer to a list of stream entries that hold reservations on this device
Stream table. The stream table is a hash table of stream entries, indexed by the stream identifier. As shown in Figure 4, the main component of the stream entry, apart from the stream identifier strId and the hash chain pointer next, is a subtable called the strDevEntry table. This subtable contains a list of the devices on which the stream has a reservation. Each row of the subtable contains the resource ID of the resource that this row is for, the amount of reserved BW on this resource, and a pointer nextStr used to maintain the list of all streams that have reservations on this resource.

Cross-linking. Figure 4 illustrates the cross-linking between resource entries and stream entries that store the information about the resources held by various streams and the streams that are accessing various resources. By scanning the strDevEntry table, it is possible to find the resources on which a stream has reservations. For example, stream str1 has reservations on resources res1 and res2. The list of all stream entries that have reservations on a resource can be found by starting from the strList pointer in the resource entry and following the corresponding nextStr entries in the strDevEntry tables. In the example, the first stream that has a reservation on res1 is found by following the strList pointer for res1 and is str1. The next stream can be found by locating the strDevEntry row for res1 in str1 and following the nextStr pointer to str2. Since str1 has reservations on both res1 and res2, the stream entry for str1 will occur on both the stream entry lists for res1 as well as res2. The stream entry lists are NULL-terminated by convention. Note that while str1 and str2 have one resource (res1) in common, they need not have other resources in common and, hence, may be on different stream entry lists.

System management functions. The resource manager system management functions consist of configuration functions to dynamically add and delete various resources and define their BW (QOSM_AddDevice(), QOSM_DeleteDevice). The BW also can be changed explicitly by (QOSM_SetBW()). The state of the resources can be queried by various query functions (e.g., QOSM_QueryDevice()). Additionally, during operation of the system, various alerts (high-priority events that warrant immediate attention) may be generated to signal the occurrence of exceptional events (e.g., denial of service for a request, low disk space, etc.). These alerts are sent to an SNMP agent for monitoring by the administrator.

The video server configuration is stored in a configuration file that is read in by the OS/390 LAN Server initialization routines. These routines then invoke the resource manager QOSM_AddDevice() function to define to the resource manager the resources (e.g., striping groups, network adapters) in the system. In case of failure to add a resource (e.g., label on a striping group proves to be unreadable) the QOSM_DeleteDevice() function may be invoked to delete the device. The system is ready for operation only after all the resources have been defined. Since the resources available may change dynamically (e.g., because of device failure or bringing a new device on line), the configuration functions can also be invoked during the operation of the system to add or delete resources or change their BW (using QOSM_SetBW()).

An important part of defining resources to the resource manager is defining the BW of each resource. Defining the BW can be done by a number of methods as follows and is specified by one of the parameters to QOSM_AddDevice(). The BW of a resource may be estimated by measurement, a process referred to as calibration. Calibration allows estimation of the BW of devices of unknown type, as well as accounting for the variation in BW between different units of a device. In the current version of the OS/390 LAN Server, calibration can be carried out only for striping groups. [16] The calibration process consists of starting multiple threads that read at random from the striping group and estimate the maximum read BW achievable. The number of threads to be started depends upon the number of disks. [11] Thus, the supported configurations for calibration are those where the bottleneck that limits the maximum achievable BW is in the striping group and not on another resource (e.g., the processor running the threads). To support more complex configurations in the future, it is possible to implement a more detailed measurement process that determines the bottleneck system components and measures the BW of each component. [9]

If calibration is not feasible, the administrator may explicitly specify the BW or direct the resource manager to estimate it based on the device type. [17] Explicit specification of the BW allows the administrator to reserve a portion of the resource BW for use by nonmultimedia applications, since the resource manager will not use more than the specified BW. Once the BW of a device is estimated by any of the methods, it is stored in permanent storage so as to allow quick restart of the system.

Operational commands. The resource manager operational functions reserve and release the resources needed for playback of a stream (RM_Select(), QOSM_Release()). Both foreground and background mode reservations are supported. In addition to reserving the required resources, RM_Select() also performs load balancing where multiple replicas of a video are available. Since the operational functions are the mainline paths through the resource manager, special attention has been given to their concurrency and scalability.

The resource manager reserves and releases resources for a stream when the stream starts or stops. These resources correspond to OPEN or CLOSE operations on a multimedia file. The operational functions of the resource manager are invoked during the OPEN or CLOSE operations on a file by the other components of the OS/390 LAN Server. [5] The OPEN will fail if the BW reservation is unsuccessful. It is not necessary for applications to understand the system configuration and explicitly reserve BW on the components needed to play back a stream. The path-based reservation algorithm in the resource manager atomically reserves all the needed resources; i.e., either all the required resources are reserved on behalf of the stream or no resources are reserved (if BW is not available).

The operation of the RM_Select function is as follows. The function finds all replicas of the requested video using the metafile. Subsequently, it considers in turn all the paths from the replica to the communications adapters that are reachable from the client. The function selects the path with the largest free BW, where the free BW of a path is determined by the BW on its bottleneck resource. The selection process actually occurs in two phases. In the first phase, only those replicas that have a "normal" status are considered. If the attempt to play from such a replica fails, the selection process enters an emergency scan phase where replicas not normally considered (e.g., replicas marked for deletion but not yet deleted) are considered as well.

Since the operational functions are by far the most frequently executed resource manager functions, it is important that they be scalable and highly concurrent. This is achieved by locking only the minimum number of entries needed to ensure consistency and atomicity. The details are as follows. While finding the best path, the RM_Select function has to keep track of two paths: the best path found so far, and the path currently under consideration. These paths have to be locked, since their BW may be updated. To avoid deadlock, the function considers (and locks) the paths in sorted order. Because only two paths at most are locked at any given time, a high degree of concurrency and scalability is achieved.

The resource manager also supports the playback of real-time and non-real-time data from the same striping groups through the support of foreground and background I/O operations. A priority mechanism in the file system ensures that background I/O activity is performed only when there are no pending foreground I/Os. Background I/Os are used for non-real-time data and do not have any associated BW reservation. Because of the priority mechanism, they obtain any BW that is not reserved for foreground I/Os. As described earlier, to avoid starvation, BW can be set aside for background I/Os by setting the maximum BW of a resource to be less than its actual maximum BW.

Asset import. Importing assets into the video server environment consists of making placement decisions so as to efficiently utilize both the space and BW of the storage system in allocating necessary space and reserving appropriate BW for writing into the storage devices. These decisions cannot be made at the file system level since the file system does not have the necessary information regarding the resources and their capacities. Hence, it is important that placement be carried out in conjunction with a resource manager placement policy. In the OS/390 LAN Server resource manager, the placement policy is implemented by the RM_SetViewers function.

Since demand for a file changes with time and the storage device capacities and BWs may also change (because of changes in the configuration), placement decisions are also dynamic. Hence, it is necessary that the placement policy be an on-line policy, and that it can be invoked simply. In OS/390 LAN Server, there is a VIEWERS command that invokes the RM_SetViewers() to make any necessary changes in placement. For example, an on-line agent that monitors demand for various videos can use this command to dynamically change the placement to adapt to varying demand. In the following subsections, we first discuss the considerations behind the design of the placement policy, then give a description of the placement policy and describe some experimental results.

Placement policy considerations. A general multimedia environment may contain both large video objects (e.g., long movies) and short video clips (e.g., objects in interactive applications such as shopping, medical, etc.). The viewing frequency of these objects also varies, i.e., some movies as well as short clips will be viewed more frequently than others. The disk striping groups used for storing these videos are limited both by the number of concurrent streams they can support (i.e., BW) and by the number of video objects they can store (i.e., space). Without careful placement of video objects, maximum utilization of both space and BW of these striping groups will not be possible as illustrated in part B of Figure 5. It shows an undesirable placement of four video objects on two striping groups. The BW and space capacities of the groups are represented by rectangles with space on one axis and BW on the other axis. The allocated space and BW for each video object are shown by a solid rectangle, and each dotted rectangle shows the currently allocated total space and expected load on the placed files on this device. It can be seen that large, infrequently viewed objects are placed on the first group. The BW of that device may be underutilized during operation of the server. Hence (as shown in part A of Figure 5, large and short, frequently and infrequently viewed objects have to be mixed properly on each striping group to utilize both the space and BW of the groups.

Figure 5

A simplistic solution to the load imbalance problem is to stripe video files over all the storage devices, i.e., to include all storage devices in the system in a single striping group. However, as discussed below, there are many considerations that make support for multiple striping groups in a video server desirable. First, from practical considerations, the storage devices in the system may be of different types. If simple round-robin striping schemes are used, either the BW or space of some devices will be wasted. More complex schemes lead to much additional complexity in the file system, which is a mainline path in the video server. Hence, it is natural to put each device type in a separate striping group. The presence of multiple striping groups also limits the impact of the failure of a single disk. It also makes adding or removing disks simpler, since the entire video server is not affected. It is also shown in Reference 18 that with multiple striping groups and by replicating a small number of frequently viewed videos, the availability of the system (the number of lost viewers in case of disk failure) can be reduced.

Overview of the BSR policy. The placement policy used in the OS/390 LAN Server is referred to as the bandwidth-to-space ratio (BSR) policy. [19] The BSR policy takes three inputs: the identifier and new expected load of the video object to be placed and the required rate for creating the replica. The creation rate will depend on how quickly the video has to be placed (e.g., staging rate for retrieval from tape; data rate for import from real-time sources such as video cameras). The BSR policy makes two important decisions when placing a video object: estimating the number of replicas that are needed and deciding where to place them. The algorithm consists of four phases as stated below. Details of the algorithm can be found in Reference 19. In the first phase, the policy determines whether additional replicas are needed by examining the additional expected load that can be supported by the existing replicas and considering the resulting mismatch in the BSRs of the devices. If more replicas are required, the second phase of the BSR policy selects additional devices on which to place new replicas. With the devices selected, creating a new replica of the video will cause the BSR of the videos on the device to more closely match the BSR of the device. In the third phase, the BSR policy allocates the expected load to all the selected devices so as to match the BSRs of the devices. A replica consolidation phase is entered only if the policy is unable to place all of the expected load. In the consolidation phase, some of the existing replicas of the other videos are deleted in order to place the new load. [20] The above steps are followed when the expected load on a video object both increases and decreases. If the expected load decreases, the second phase (i.e., selecting additional devices) is skipped, but the rest of the sequence is executed.

Selection of additional devices has to be made while keeping in mind the available current BW on these devices in order to satisfy the quickness criteria. A parameter to the BSR policy, called the availability parameter, specifies the maximum number of viewers that may be allocated to any replica of any video. This parameter specifies an upper bound on the number of viewers that are lost on failure of a storage device as it may be possible to move some of the viewers to another replica.

The BSR policy in the OS/390 LAN Server resource manager is implemented by the RM_SetViewers function. In addition to the VIEWERS command, the INPUT command, which copies video files, may also invoke the RM_SetViewers function. The REMOVE command, which deletes a video file, also invokes the RM_SetViewers function with the expected load for the file set to zero for the purpose of deleting the file. The RM_SetViewers function returns an action list, which is a list of devices on which replicas are to be created or deleted. The actual creation and deletion of replicas is carried out by other components of OS/390 LAN Server.

Experimental study of BSR policy. The effectiveness of the BSR policy was studied by simulation in Reference 19. In the following, two of the experiments are described. The first experiment studies the effectiveness of the BSR policy in placing videos in a system that is initially empty. Since the BSR policy is intended to be an on-line policy that can change the placement of the videos in response to changing load, the second experiment reports on the ability of the BSR policy to adapt to changing load. The effectiveness of the BSR policy was evaluated by considering the utilization of space and expected BW after placement. A high utilization would be indicative of good performance of the policy.

The simulated system for studying the BSR policy was assumed to have four striping groups with space capacity of 16 GB (gigabytes), 32 GB, 64 GB, and 128 GB. The striping groups had a BW capable of supporting 50, 100, 200, and 400 streams, respectively. Though the striping groups have differing storage capacities and BWs, their BSR is the same. Experiments using striping groups with different BSRs are reported in Reference 19. It was assumed that there were 64 videos, each requiring 3.6 GB, and the total expected load was assumed to be 750 streams. Since there was space for only a single replica of each video, and the total expected load was equal to the total BW of the system, this configuration resulted in a stress test for the BSR policy. The video access frequencies were assumed to be specified by an empirically derived Zipf distribution [21] as shown by the solid line in Figure 6. Figure 6 also shows the distribution used for simulating a shift in load. If the shift in load is simulated by rotating the Zipf distribution, there would be a large change in access frequency at the ends. To avoid such a change, the Zipf distribution was folded into a symmetric distribution (marked as the folded Zipf distribution), and the symmetric distribution was rotated. To simulate small and large changes in the expected load, the distribution was rotated by 1 and by 10 videos, respectively.

Figure 6

Figure 7 shows the results of using the BSR to place the videos into an initially empty system. Four groups of bars are shown, one for each striping group. The first bar in each set shows the actual capacity of the disk, while the second bar shows the expected load after placement by the BSR policy. It can be seen that for all the disks, the BW utilization is close to 100 percent, indicating the effectiveness of the placement by the BSR policy.

Figure 7

Figure 8 illustrates the ability of the BSR policy to adapt to changes in load. As before, there is one group of bars for each striping group, and the first two bars are the BW of the striping group and the expected load after the initial placement. The third and fourth bars show the expected load after a load shift when the BSR policy was used to change the placement of the videos. In these experiments, the striping groups were assumed to already contain the videos resulting from the initial placement. It can be seen that in both cases the BW utilization is still very high.

Figure 8

Research issues

Multimedia is a relatively new field. Various emerging applications when fully deployed will require very-large-scale servers. Such large-scale servers will benefit from various performance optimization algorithms (described below, e.g., grouping of requests, multimedia caching) that exploit economies of scale. These optimizations can result in an order-of-magnitude increase in cost-effectiveness. For example, in a large-scale server, multiple concurrent requests for the same video can be served by a single data stream either through synchronizing playback of these requests (referred to as batching) or by using small running buffers to cache the intervals between successive streams.

Large-scale servers will be built by linking together multiple systems. This method will require coordination and efficient resource management across these systems. Generally speaking, partitioned systems fragment overall system resources (BW and space) and, hence, result in poor utilization. [22] In a shared system, resources belonging to various systems need to be managed either in a centralized manner or in a coordinated distributed manner, similar to management of accesses to data in a multisystem database system. The interactions across the resource manager components need to be managed carefully, and the number of messages exchanged across systems for resource reservation need to be minimized.

Performance enhancement exploiting economies of scale. Certain activities affect performance, and giving proper consideration to these activities can enhance performance.

Batching and VCR control. In a large-scale server, multiple requests for the popular videos will arrive within a short time interval. The requests for the long-running videos can be batched together by delaying playback for some of the requests for a short time. The batched requests can be served by a single I/O stream by multicasting the data to the batched clients. Multicasting will result in substantial savings in required server capacity. [21,23] The startup delay can be masked by playing commercials or other short clips. The choices on how long to delay various clients and which video to play are made by the channel scheduling policy. The longer startup delay may cause the client to renege. The client behavior can also be influenced by the scheduling policy; for example, clients may wait longer if the startup delay is bounded. [21] Therefore, a scheduling policy should take into account client behavior. However, complex policies that depend upon accurate knowledge of client behavior are not useful in practice. [24] In Reference 21, we have proposed the FCFS-n policy that multicasts the n most popular videos in regular intervals and serves the remaining requests using FCFS (first-come-first-served) with batching. The policy satisfies all the above criteria and is shown to perform well under a wide variety of client behaviors. In Figure 9 the savings in required server capacity are shown to exceed 70 percent for a VOD (video-on-demand) workload in a moderate size server. See Reference 21 for the details of the workload.

Figure 9

While viewing a video, a client may stop at any point and resume viewing after a pause of random duration. By dedicating server resources to each stream even during a pause, the server can guarantee that the client can resume without delay. This situation will result in significant wastage of server resources particularly if, on average, the client pauses are long. A more efficient alternative is to release server resources upon a pause and reacquire them upon a resume. To assure client satisfaction, the server now needs to ensure that the delay in resuming playback is small. [21,23] A small delay can be achieved by setting aside a small fraction of server capacity (referred to as the contingency capacity) for these higher-priority requests. [21,23,25] In Reference 23, we have shown that substantial savings are possible even with frequent VCR (videocassette recorder) activities.

Caching. The need for a continuous delivery guarantee and the larger size of some video objects (e.g., movies) makes traditional caching policies such as LRU (least recently used) unsuitable. The sequential access pattern on these objects can be exploited by caching only the intervals between two successive streams on the same object, i.e., by retaining the pages brought in by a stream for reuse by a closely following stream and subsequently discarding them. In contrast, an interactive workload is composed of many short video clips (e.g., shopping) on which concurrent access will be rare. Hence, caching only the intervals will not be effective. A generalized interval caching policy that caches both short video objects as well as intervals or fractions of large video objects is shown to be effective in References 26 and 27. The cost-competitiveness is analyzed in Reference 28. The details of the interactions between the caching policies and the session scheduling are discussed in References 23 and 27.

Multisystem environment. In a multimedia environment with a large number of clients, large-scale servers will be built as clusters of smaller servers. Under random routing, many of the performance optimizations above may not be effective, since requests for the same video may be scattered over multiple servers. Hence, careful routing of requests to maximize performance is required. [8,22] Additionally, each individual server may have its own resource manager instance for reasons of scalability and performance. Coordination of the resource managers in a distributed environment also poses a challenge.

Affinity routing. In a distributed server environment, the resource manager selects a data server to serve each incoming request and routes the request to the selected server. [8,22,29-31] It also reserves enough resources on the data path from the data server to the client for playback of the stream. The application server is responsible for translating application-specific client requests into data server requests. [14,32] Hence, after establishment of an application session and data path, subsequent queries and access requests for video clips will go directly to the selected application server.

Under random routing, the performance optimizations above may not be effective because of resource fragmentation. [8,22] For example, if successive requests for the same video are routed to different servers and each server has its own cache, intervals between successive requests cannot be formed because of fragmentation of the cache. Affinity routing of requests for the same video will result in reusing the content of a cache and ensuring a good cache hit ratio, as well as for efficient batching of requests for the same videos. [8]

Quota system. Clustered servers are made up of groups of individual servers that are coupled together using fast communication links. They may also share server resources such as disks or communication links to the clients. A centralized resource manager is not suitable for such environments since it is not scalable. From the performance point of view, it may be desirable to have an individual resource manager on each server. For implementing resource reservation, each resource manager has to know the capacity reserved by the others on shared resources. However, it is not feasible for each resource manager to maintain the global state of the resources since it requires a large amount of message exchange. A protocol is therefore required whereby the individual resource managers can cooperate to share the server resources.

A method, called the quota system, is proposed in Reference 33 for resource management in a cluster environment. At initialization, the resource managers partition the free capacity of the shared resources among themselves. When a resource manager receives a request for a new stream, it attempts to satisfy the request using the already allocated capacity for shared resources. If successful, it reserves the required capacity. Otherwise, it requests free resources from the other resource managers. When a stream ends, the released capacity on shared resources is retained by the local resource manager. The above method can be implemented by designating an owner resource manager for each shared resource. The owner partitions the capacity of the shared resource among the resource managers in the cluster and is responsible for responding to requests for additional capacity on the resource.

Conclusions

In multimedia servers, the requirement for continuous and "hiccup-free" delivery poses challenges in addition to the traditional server requirements. Continuous delivery can be guaranteed only by reserving the resources needed to deliver a stream. In this paper, we describe the design and implementation of the resource manager for OS/390 LAN Server, the component that implements this resource reservation. The OS/390 LAN Server evolved from an earlier version that was a traditional high-performance file server for workstation clients. [5] The addition of the resource manager component enabled this file server to support delivery of continuous media and to become a full-fledged multimedia server.

The resource manager for OS/390 LAN Server provides a rich set of functions necessary for administration and operation of a multimedia server. System management functions are provided for defining and dynamically changing the components of the server, as well as measuring and specifying the BW of individual components. Query functions and monitoring via SNMP agents are also supported. The operational functions reserve and release all resources needed to play a video without the need for explicit specification (and hence, knowledge) of the needed resources by the application. Reservation and release are atomic; i.e., either all or none of the resources are reserved or released. The real-time data import functions place data so as to efficiently utilize the storage devices. The import functions are on line, allowing the placement to be dynamically changed, and also support importing data from real-time sources such as digital cameras.

The implementation of the resource manager has been carried out keeping in mind its scalability to large-scale servers. This has been achieved by fine-grain locking that allows concurrent access to data structures. After the code had been ported to an RS/6000 SP environment, it became the base code for resource management in the AIX video server. [14] Emerging applications of the future will require large-scale servers. In the paper, we surveyed various optimizations (e.g., batching, caching) that will allow resource management for such servers to exploit economies of scale. We show that large improvements in cost-performance are possible through such optimizations. Such large-scale servers will consist of clusters of individual servers. Distributed resource management of such clusters poses many challenges. Affinity routing of requests in such an environment can lead to better use of server resources by increasing the effectiveness of some of the optimizations (such as caching). For managing such clusters, a quota system that partitions the resources between the individual node resource managers together with a protocol for renegotiation can be useful.

Acknowledgments

We would like to thank the OS/390 LAN Server development team, especially Pam Conti, Peggy Dixon, Steve Ferrante, John Harter, Tim Krein, Mike Morton, Bill Phillips, Vicki Pritko, and Fred Schwartz of Endicott, Don Jones of Raleigh, and Ray Mansell and Bob Berbec of Hawthorne. We would also like to thank Bill Tetzlaff, Martin Kienzle, Brent Hailpern, and Ambuj Goyal for their support of this work.

*Trademark or registered trademark of International Business Machines Corporation.

Cited references and notes

Accepted for publication January 21, 1997.