IBM Haifa Labs
Haifa Research Lab
Haifa Development Lab
Haifa Software Lab
Leadership Seminars
Weekly Seminars
Visitor Information
Careers at Haifa Labs

Feedback

Worldwide Research
	IBM Israel
	Careers at Research
	Other Labs
	On Demand Innovation Services
	Research Projects
	alphaWorks
	developerWorks

Compiler and Architecture Seminar 2005

IBM Haifa Labs

Invitation Program Registration Abstracts

December 12, 2005
Organized by IBM Research Lab in Haifa, Israel

http://www.haifa.il.ibm.com/Workshops/compiler2005/

Machine Learning Based Adaptive Optimizations

Mike O'Boyle, University of Edinburgh, UK

Iterative compiler optimization has been shown to outperform static approaches. This, however, is at the cost of large numbers of evaluations of the program. This work develops a new methodology to reduce the number of evaluations and hence speed up iterative optimization. It uses predictive modeling from the domain of machine learning to automatically focus search on those areas likely to give greatest performance. This approach is independent of search algorithm, search space or compiler infrastructure, and scales gracefully with the compiler optimization space size. Off-line, a training set of programs is iteratively evaluated and the shape of the spaces and program features are modeled. These models are learnt and used to focus the iterative optimization of a new program. We evaluate two learnt models, an independent and Markov model, and evaluate their worth on two embedded platforms, the Texas Instrument C6713 and the AMD Au1500. We show that such learnt models can speed up iterative search on large spaces by an order of magnitude. This translates into an average speedup of 1.26 on the TI C6713 and 1.27 on the AMD Au1500 in just two evaluations.

Designing a Language and a Representation for Semi-automatic Program Transformation and Program Generation

Albert Cohen, INRIA, France

With the growing complexity of computer architectures, it has become extremely difficult and unproductive, even for experts, to identify the proper program construction and transformation sequence that would optimize the implementation of a given algorithm. Our work is aimed at the domain expert who wishes to tune an application to a particular architecture or to implement an adaptive library generator. It is also aimed at improving the productivity of modern optimizing compiler construction. It fits the contexts of both explicitly parallel and single-threaded programming models. We will discuss the design of a program transformation infrastructure compatible with modern optimization strategies (empirical, adaptive, feedback-directed), exposing a rich search space structure and allowing for the incremental composition of complex transformations. We will also present ongoing work on a language design for portable, semi-automatic optimization, introducing advanced meta-programming and search-space customization capabilities to high-performance computing.

Multi-platform Auto-vectorization

Dorit Naishlos, IBM HRL and Richard Henderson, RedHat

The recent proliferation of the Single Instruction Multiple Data (SIMD) model has lead to a wide variety of implementations. These have been incorporated in many platforms, from gaming machines and embedded DSPs to general purpose architectures. We present an automatic vectorizer as implemented in GCC–the most multi-targetable compiler available today. We discuss the considerations that are involved in developing a multi-platform vectorization technology and demonstrate how our vectorization scheme is suited to a variety of SIMD architectures. Experiments on four different SIMD platforms demonstrate that our automatic vectorization scheme is able to efficiently support individual platforms, achieving significant speedups on key kernels.

Trace Cache Sampling Filter

Michael Behar, Technion; Avi Mendelson, Intel; Avinoam Kolodny, Technion

This paper presents a new technique for efficient usage of small trace caches. A trace cache can significantly increase the performance of wide out-of-order processors, but to be effective, the size of the trace cache should be large. Power and timing considerations indicate that a small trace cache is desirable, with special mechanisms to increase its effectiveness despite the limited size. Hence, several authors have proposed various filtering methods to select "good traces" that are kept in the trace cache, from among the general population of traces. We present a new filtering technique, which is based on sampling. Our technique suggests that instead of building all the traces and trying to select the good ones among them, it is more efficient to make a preliminary selection of traces. This selection is based on a random sampling approach. We show that the sampling filter improves trace cache and overall system performance, while reducing power dissipation. The sampling filter reduces the admission of traces that are not used prior to their eviction from the cache, and prolongs the percentage of time a trace is in its live phase during its stay in the cache. Moreover, the sampling filter reduces duplication between the trace cache and the instruction cache, and thus reduces the overall misses in the first level of cache hierarchy.

Performance, Power Efficiency, and Scalability of Asymmetric Cluster Chip Multiprocessors

Tomer Y. Morad, Technion; Uri C. Weiser, Intel; Avinoam Kolodny, Technion; Mateo Valero, UPC, Spain; Eduard Ayguade, UPC, Spain

This research evaluates asymmetric cluster chip multiprocessor (ACCMP) architectures as a mechanism to achieve the highest performance for a given power budget. ACCMPs execute serial phases of multi-threaded programs on large high-performance cores, whereas parallel phases are executed on a mix of large and many small simple cores. Theoretical analysis reveals a performance upper bound for symmetric multiprocessors, which is surpassed by asymmetric configurations at certain power ranges. Our emulations show that asymmetric multiprocessors can reduce power consumption by more than two thirds with similar performance compared to symmetric multiprocessors.

The Molen Compiler Backend for Reconfigurable Architectures

Elena Moscu, Koen Bertels, Stamatis Vassiliadis, TU Delft, The Netherlands

When compiling high-level applications for GPPs, combined with reconfigurable processors, little research in compiler optimizations has been undertaken to eliminate or diminish the negative influence on performance of the huge reconfiguration latency of the available FPGA platforms. We present the Molen compiler backend developed within the DeftWorkbench project, which incorporates the Molen architecture programming paradigm. Additionally, we introduce two compiler optimizations that minimize the number of executed hardware configuration instructions taking into account constraints such as the "FPGA-area placement conflicts" between the available hardware configurations. The proposed algorithms allow the anticipation of hardware configuration instructions at both intraprocedural and interprocedural levels. The presented results show that the proposed optimizations significantly reduce the number of executed hardware configuration instructions.

Yonah - the First Intel CMP Architecture for Mobility

Ronny Ronen, Intel

Recently, Intel introduced the Yonah processor, Intel's first dual-core mobile processor, with a power-optimized architecture that unleashes superior combination of power and performance. In this talk we present technical details of the processor, describing its performance and power awareness features.

Compilation for the Cell processor

Kevin O'Brien, IBM Watson Research

Developed for multimedia and game application workloads, the CELL processor provides support for highly parallel codes, which have high computation and memory requirements, as well as for scalar codes, which require fast response times and full featured programming environments. This first generation CELL processor implements on a single chip a POWER processor with two levels of cache and eight attached streaming processors with their own local memories and globally consistent DMA engines. In addition to processor-level parallelism, each processing element has Single Instruction Multiple Data (SIMD) units that can each process from 2 double floats up to 16 chars per cycle.

The complexities of the Cell processor span multiple dimensions. At the elementary level, the Cell system has two distinct processor types, each with its own application level ISA. One ISA (PE) is the familiar 64-bit PowerPC with VMX, the other, (SPE) is a new 128-bit SIMD instruction set for multimedia and general floating point processing. Typical applications on the Cell processor will consist of a combination of codes to exploit both these processors. The pipelines of both processor types must be taken into account, and the SPE presents several challenges not seen in the PX, chief among them the instruction prefetch capabilities and the significant branch miss penalties resulting from the lack of hardware branch prediction. At the next level, the SPE is a short SIMD or multimedia processor with scant support for scalar operations. On the next dimension is the parallelism of the machine when deploying applications across all SPEs.

It has been demonstrated that expert programmers can develop and hand tune applications to exploit the full performance potential of this machine. We believe that sophisticated compiler optimization technology can bridge the gap between usability and performance in this arena. To this end we have developed a research prototype compiler targeting the CELL processor. In this talk we describe a variety of compiler techniques we have investigated, and their associated performance benefits. These techniques are aimed at automatically generating high quality codes over the broad spectrum of heterogeneous parallelism available on the CELL processor. The techniques we describe include compiler supported branch prediction, compiler assisted instruction fetch, and the generation of scalar codes on SIMD units. We will then discuss our techniques for automatic generation of SIMD codes, and automatically parallelizing single programs across the multiple heterogeneous processors. In particular we will describe and discuss the performance of our technique for presenting the user with a single shared memory image through our compiler controlled software cache. We will also report and discuss the results we have achieved to date, which indicate that significant speedup can be achieved on this processor with a high level of support from the compiler.

Automatic Parallelization with Hybrid Analysis

Lawrence Rauchwerger, Texas A&M University

Automatic program parallelization has been an elusive goal for many years. It has recently become more important due to the widespread introduction of multi-cores in PCs. Automatic parallelization could not be achieved because classic compiler analysis was not powerful enough and, in many cases, program behavior was found to be input dependent. Run-time thread level parallelization, introduced in 1995, was a welcome but somewhat different avenue for advancing parallelization coverage. We introduce a novel analysis, Hybrid Analysis (HA), which unifies static and dynamic memory reference techniques into a seamless compiler framework that extracts almost maximum available parallelism from scientific codes and generates minimum run-time overhead. We will show how we can extract maximum information from the quantities that could not be sufficiently analyzed through classic, static compiler methods and generate sufficient conditions which, when evaluated dynamically, can validate optimizations. A large number of experiments confirm the viability of our techniques, which have been implemented in the Polaris compiler.

Experiences with Media-processing and Multi-processing

Henk Schepers, Philips

Ever increasing integration of functions into SoC Designs require new mechanisms to guarantee System Performance and reduce Control Overhead. Coherent L2 System Caches shared among multiple CPU cores in Single and Multi-Chip Systems are promising to address some of these issues. This presentation will give an overview of possible Advantages and Design Challenges for these systems when using Coherent L2 Caches that are based on the WASABI concept.

Towards a Source Level Compiler: Source Level Modulo Scheduling

Yosi Ben-Asher and Danny Meisler, Haifa University

Modulo scheduling is a major optimization of high performance compilers where the body of a loop is replaced by an overlapping of instructions from different iterations. This enables the compiler to schedule more instructions in parallel than the original option. Modulo scheduling, being a scheduling optimization, is a typical backend optimization that relies on a detailed description of the underlying CPU and its instructions to produce good scheduling. Our work considers the problem of applying modulo scheduling at the source level as a loop transformation, using only general information of the underlying CPU architecture. By doing so it is possible to: a) Create a more retargetable compiler as modulo scheduling is now applied in source level; b) Study possible interactions between modulo scheduling and common loop transformations; c) Obtain a source level optimizer whose output is readable to the programmer, yet its final output can be efficiently compiled by a relatively 'simple' compiler.

Experimental results show that SLMS can also improve performance when low level MS is applied by the final compiler, indicating that high level MS and low level MS can co-exist to improve performance. We present an algorithm for source level modulo scheduling that modifies the abstract syntax tree of a program. This algorithm has been implemented in an automatic parallelizer (Tiny). Preliminary experiments also yield runtime and power improvements also for the ARM CPU for embedded systems.