Skip to main content

IBM R&D Labs in Israel News

Supercomputers: Who’s the boss?

IBM Research middleware for managing extreme-scale computers recognized by US Department of Energy

  Eliezer Dekel
Why aren’t more supercomputers being used in industry? How can we harness the tremendous power of extreme-scale computers to provide smarter solutions in areas like financial transactions, the Internet, or manufacturing? According to Eliezer Dekel, manager of distributed middleware at IBM Research – Haifa, supercomputers still need management programs, like the ones being developed as part of the award-winning Colony II project. Colony was recently awarded supercomputer time by the US Department of Energy to help meet this challenge head-on.

Today, the management of supercomputers is still being done in a relatively unsophisticated manner. You can turn them off, turn them on, and load a machine. But each action must be done separately in a manual fashion or using very basic scripts.

“Our team is developing a coordinated framework that enables the management and monitoring of hundreds of thousands of machines running in parallel,” explains Dekel. “In this way, we can also deploy one program for a cluster of machines, deploy another program for another subset, and then convey the status of these machines to the user—even reporting what stage of the processes they’re running.“ In short, the new management middleware will provide high performance computers with improved scalability, optimization, fault tolerance, and support for various management policies.


Super management for supercomputers

Researchers at the Haifa Lab are developing a management infrastructure for extreme-scale computers using messaging and overlay technologies such as Low Latency Messaging and CloudFab. They’re addressing a completely new market segment that involves the management of very large setups for BlueGene or Cray computers.

“Because the price per processor of BlueGene is much lower than any other large computing system, it makes sense to harness that power for business purposes like financial systems, the Internet, and more,” continues Dekel.

Management technology for extreme-scale computers has the potential to introduce supercomputing capabilities to the business market, opening up new opportunities for both business environments and the computer systems.


INCITE award

   BlueGene supercomputer
Colony II is one of 69 project selected by INCITE for their potential to advance scientific discovery and awarded time at the DOE’s Leadership Computing Facilities at Argonne National Lab in Illinois and Oak Ridge National Lab in Tennessee. The 30 month project began a few months ago and just saw delivery of the first initial design. The award from the DOE enables the researchers to gain supercomputing time at the National Labs in order to carry out tests for their design.

Projects receiving INCITE awards utilize complex simulations to accelerate discoveries in ground-breaking technologies such as lithium air batteries and nano solar cells. The awards also include projects designed to close the nuclear fuel cycle, develop advanced propulsion systems, improve DNA sequencing and explore phenomena on the tiny scale of nanostructured superconductors.


What lies ahead

The Colony II Haifa team led by Yoav Tock, is now debating the pros and cons of having the technology they are developing remain proprietary or making it open source. They are currently using a special version of Linux developed in Hursley and are exploring whether they will be adding their new algorithms to Linux or keeping them inside the IBM middleware.

“By developing new management technologies for supercomputers, we’re looking forward to bringing the future of supercomputers a little bit closer,” noted Dekel. “As they become smarter and more self-sufficient, they also become more accessible and practical to many markets and industries.”


Where do you see the future of supercomputing?

Related Links

DOE Awards Over a Billion Supercomputing Hours to Address Scientific Challenges

Where it all began – Colony I

Colony II is a continuation of what started as the HPC-Colony project, a joint research effort with Oak Ridge National Laboratory (PI Terry Jones), the IBM T.J. Watson Research Center (PI Jose Moreira) and the University of Illinois at Urbana-Champaign (PI Laxmikant Kale).

The idea was to create scalable services and interfaces that permit both scalable high performance and easy application porting for high-performance computing (HPC) systems with extremely large numbers of processors.