We present a technique that manages the peak power consumption of a high-density server by implementing a feedback controller that uses precise, system-level power measurement to periodically select the highest performance state while keeping the system within a fixed power constraint. A control theoretic methodology is applied to systematically design this control loop with analytic assurances of system stability and controller performance, despite unpredictable workloads and running environments.
This technique is particularly valuable when applied to servers with multiple power supplies, where a partial failure of the power supply subsystem can result in a loss of performance in order to meet a lower power constraint. Conventional servers use simple open-loop policies to set a safe performance level in order to limit peak power consumption. We show that closed-loop control can provide a more graceful degradation of service under these conditions and test this technique on an IBM BladeCenter HS20 server. Experimental results demonstrate that closed-loop control provides superior application performance compared to open-loop policies.
We propose the use of simulated annealing to arrive at an effective layout of data on disk. The proposed technique considers the configuration of the system and the cost of data movement. An initial layout globally optimized across all queries, shows speedups of up to 13% for a group of DSS queries and up to 6% for selected OLTP queries.
This technique can be re-applied at run-time to further improve performance beyond the initial, globally optimized data layout. This scheme monitors architecture parameters to prevent optimizations of multiple operations to conflict with each other. Such a dynamic reorganization results in speedups of up to 23% for the DSS queries and up to 10% for the OLTP queries.
In the first half of this talk, I will describe the research prototype Super Dense Server. I will cover its hardware features, show how they challenge the operating system and middleware and describe the system software enhancements used to meet these challenges. The performance evaluation shows that dense servers are a viable deployment alternative for the edge and application servers commonly found in conventional web sites and large data centers. In the second half of the talk, I will describe a method of distributing web requests across the cluster so that energy can be saved at a small cost in performance. This will cover the common pitfalls in doing energy measurement studies, the modifications required to enable the power-aware request distribution, and results on the energy savings achieved.
We introduce a new concept called critical power slope to
explain and capture the power-performance characteristics of systems
with power management features. We evaluate three systems - a clock
throttled Pentium laptop, a frequency scaled PowerPC platform, and a
voltage scaled system to demonstrate the benefits of our approach. Our
evaluation is based on empirical measurements of the first two
systems, and publicly available data for the third. Using critical
power slope, we explain why on the Pentium-based system, it is energy
efficient to run only at the highest frequency, while on the
PowerPC-based system, it is energy efficient to run at the lowest
frequency point. We confirm our results by measuring the behavior of
a web serving benchmark. Furthermore, we extend the critical power
slope concept to understand the benefits of voltage scaling when
combined with frequency scaling. We show that in some cases, it may be
energy efficient not to reduce voltage below a certain point.
We have also created a power simulator for web serving workloads that estimates
CPU energy consumption with less than 5.7% error for our workloads.
The simulator is fast, processing over 75,000 requests/second on a 866MHz
uniprocessor machine. Using the simulator, we quantify the potential benefits of
dynamically scaling the processor voltage and frequency, a power management
technique that is traditionally found only in handheld devices. We find that dynamic
voltage and frequency scaling is highly effective for saving energy with
moderately intense web workloads, saving from 23% to 36% of the CPU energy
while keeping server responsiveness within reasonable limits.
Chair: Trevor Mudge
Code compression is the technique of using data compression to reduce the program memory size for memory-limited, embedded computers. For system-on-a-chip designs, this reduces the system die area which lowers die cost. After compilation, the binary (native code) program is compressed and stored in the embedded system. At run-time, the compressed program is incrementally decompressed and executed. While compressed programs have better code density, their performance is typically lower because additional effort is required to decompress the instruction stream. This dissertation presents methods to improve the performance of compressed programs.
Decompression overhead can be minimized by using special-purpose hardware. This dissertation analyzes IBM's CodePack decompression algorithm and proposes optimizations for it. The optimized decompressor can often execute compressed programs faster than the original native program. The performance benefit of using fewer memory transactions to fetch compressed instructions surpasses the small decompression overhead. Therefore, code compression improves performance as well as code density.
The decompression hardware can be largely replaced with software. The benefits of software decompression are greater design flexibility, reduced hardware complexity, reduced die area, and reduced cost. However, software decompression is much slower than hardware decompression. On a 5-stage pipelined embedded processor with a 4KB instruction cache, CodePack programs execute 1.3 to 27.0 times slower than native programs and reduce program memory die area (instruction cache and main memory) by 26% to 41%. This dissertation proposes instruction set support to enable efficient software-managed decompression. In addition, it explores two software optimizations, hybrid programs and memoization, to improve the execution time of compressed programs by reducing the compression. Hybrid programs contain both native and compressed code to reduce the number of times the decompressor is invoked. Memoization is a dynamic optimization that caches recent decompression results to also avoid invoking the decompressor. Optimized compressed programs that reduce die area 10% to 33% execute only 1.00 to 1.22 times slower than native code. In addition, loop-oriented (multimedia) programs are nearly as fast as native code.