Dynamic resource management

As we move forward to exascale computing, the current approach that prevails in High Performance Computing of moving data back and forth from storage to processor becomes unsustainable. Explosions of application data size and the amount of energy required for these data transfers call for a different design. To address this issue, IBM research is proposing a new data centric approach, an architecture that bring computing capabilities closer to where data resides in the systems. The goal of this approach is to allow the convergence of high performance computing and high performance analytic using a single architecture.

Resource management classically addresses the capability to schedule jobs, orchestrate, and account for resource utilisation in the system. An efficient resource management is essential for making use of such data centric systems, and newer approaches for decreasing data movement and reducing energy consumption must be investigated.

Traditional scheduling policies consists in selecting a set of resources for each jobs to run on a platform, according to a specific objective. Once these resources become available, the application starts. Once it has finished, it releases its resources for other jobs to start, and so on and so forth. Dynamic resource management represent the added capability to change the number of resources allocated to a job while it is already running, and to allow for data management of an application while it is not running.

Most of applications targetting large scale high performance systems, such as CORAL and beyond, will make use of checkpoints. This is not only a solution to cope with potential failures happening during the computation, but it also is useful for visualization and analytic.

Our first focus addresses the management of such checkpoint data in the system. The objective is to transparently and asynchronously transfer this data to remote storage, while using local burst buffers for faster accesses. It is essential in that context to limit the impact of these transfer on applications performance, while satisfaying the persistance requirement of such data.

The second objective of our research is to make use of this checkpoint capability to enable a more dynamic resource management. We collaborate with IBM Platform LSF, looking for novel ways to allocate and manage resources by extending the job pre-emption capabilities.

 

Data-centric systems

In order to prepare for the future, IBM has adopted Data-Centric Systems (DCS) design as our new paradigm for computing. The rationale is simple. As the size of data grows, the cost of moving data around becomes prohibitive. We must employ distributed intelligence to bring computation to the data.

Tilak Agerwala, IBM Research

Data Centric

U.S. Department of Energy selects IBM "Data-Centric" Systems to advance research and tackle Big Data challenges