Skip to main content
    Israel [change]    Terms of use
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research & CRI

3rd HiPEAC Industrial Workshop on Compilers and Architectures

IBM Haifa Labs

Invitation Registration

April 17, 2007
Organized by IBM Research Lab in Haifa, Israel

Initial Results of the Performance Implications of Thread Migration on a Chip Multi-Core

Y. Sazeides*, P. Michaud+, L. He*, D. Fetis+, P. Charalambous*, C. Ioannou*, A. Seznec+ *University of Cyprus, Nicosia, Cyprus +Irisa-Inria, Rennes, France

Thread-migration is an emerging method for research that leverages multi-cores to alleviate the temperature problem by distributing thread activity and power consumption more uniformly over the entire chip. Our research aims to establish how effective is thread-migration at solving the power density problem for a thermally constrained multi-core and determine how to best implement thread migration to address the temperature problem.

In this talk we present initial results that quantify the performance implications of thread migration on a four core system when the workload consists of one to four threads from SPEC CPU 2000 benchmarks. Overall, the results reveal that a simplistic implementation of activity migration can have significant performance overhead. Consequently, if thread migration is to be successful, novel mechanisms are needed to minimize the cold effects and flushing overhead of large private caches. The data also suggest a need to reduce the cold predictor effects for the case when a core is visited by one or more threads before a same thread revisits it.

Related work from the authors:
  1. ATMI, Temperature Model,
  2. A study of thread migration in temperature-constrained multi-cores, P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, D. Fetis, to appear in ACM Transactions on Architecture and Code Optimization, 2007.
  3. Scheduling issues on thermally constrained processors, P. Michaud, Y. Sazeides, IRISA report PI-1822 and INRIA report RR-6006, October 2006.
  4. An evaluation of HotSpot-3.0 block-based temperature model, D. Fetis, P. Michaud, 5th Annual Workshop on Duplicating, Deconstructing, and Debunking, June 18, 2006.
  5. Performance Implications of Single Thread Migration on a Chip Multi-Core, T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec, Computer Architecture News, Volume 33 , Issue 4, Nov. 2005.
  6. An analytical model of temperature in microprocessors, P. Michaud, Y. Sazeides, A. Seznec, T. Constantinou, D. Fetis, IRISA report PI-1760 and INRIA report RR-5744, Nov. 2005.

Caravela: A Distributed Stream-based Computing Platform

Leonel Sousa, Shinichi Yamagiwa, INESC-ID/IST, TULisbon

Stream-based programming style gathers data into a stream, operates on it and then scatters it back to memory. Recently the computational power growth of Graphical Processing Units (GPUs) exceeds the one established by the Moore’s law, so that high level languages and interfaces, such as the Cg and Brook for GPUs (, have been developed to ease the programming of stream processors for general purpose applications. The application of these pieces of work has been limited to compute small-scale problems locally, on a single stream processor. An exception is the case of a cluster of GPUs supported with MPI [1], that directly applies stream computing to the distributed programming model with a very limited set of experiments. In this workshop we propose a new flow-model that extends the basic model for stream-based computing by considering random access to multiple input streams and by supporting recursive computation. The Caravela platform presented in this paper defines an execution unit, based on the proposed flow-model, that is composed by input data streams, output data streams and a program to process those I/O data streams. This flow-model fits particularly well into the GPUs because the GPU supports stream-based computation using texture image inputs. The Caravela package available to download from allows to apply the flow-model unit concept to distributed computing based on GPUs available in recent computers. Caravela encapsulates the program in a flow-model unit and assigns it to a distributed computing environment, by allowing to execute it in any available computer and by directly collecting the data through the memory or the network [2]. These tools provide a uniform programming interface that hides the differences between graphics runtimes and achieves high performance (DirectX9 and the OpenGL 2.0 are actually supported). Data buffering optimization methods are applied to efficiently implement recursive computation [3]. This paper illustrates the usage of the Caravela platform for different applications, such as recursive linear filtering for audio and image processing, 2-D DWT for JPEG2000 and decoding LDPC codes for DVB-S2.

Experimental results show that a significant improvement can be achieved with GPUs against general purpose processors and that the implemented swap frame data buffering method allows to improve even more this efficiency, up to 60%, for recursive computation. Future research directions are pointed to achieve distributed computing based on the Caravel platform, namely the extension of Caravela tools to implement the concept of meta-pipelining and its integration in the MPI programming interface.

  1. Z. Fan, F. Qiu, A. Kaufman, S.Y-Stover, "GPU Cluster for High Performance Computing", Proc. ACM/IEEE Supercomputing Conference, 2004, pp. 47-58.
  2. S. Yamagiwa and L. Sousa, "Caravela: A Novel Environment for data-flow-based distributed processing", accepted for publication in IEEE Computer in January 2007.
  3. S. Yamagiwa and L. Sousa, "Data Buffering Optimization Methods Toward a Uniform programming Interface for GPU-based Applications", accepted for publication on the ACM International Conference on Computing Frontiers, May 2007.

Probabilistic Cache Filtering

Yoav Etsion, Dror G. Feitelson, Hebrew University, Jerusalem

Distinguishing transient blocks from frequently used blocks enables servicing references to transient blocks from a small highly-associative auxiliary cache structure. By inserting only frequently used blocks into the main cache structure, we can reduce the number of conflict misses, thus achieving higher performance and allowing the use of direct mapped caches which offer lower power consumption and lower access latencies. In this paper we introduce a simple probabilistic filtering mechanism to identify and select the frequently used blocks. We show that a 16K direct-mapped L1 cache, augmented with a fully-associative 4K filter, achieves on average 13% more instructions per cycle than a regular 16K, 4-way set-associative cache, and even 12% higher IPC than a 32K, 4-way cache, while consuming about 70%-80% less dynamic power than either of them.

Data Layout Optimizations in GCC

Olga Golovanevsky, Razya Ladelsky, IBM HRL

The idea to adapt layout of data structures in the program to better utilize the cache locality is not itself new for the world of optimization technologies. The variety of techniques was suggested that range from the trace data analyses to post link optimizations. This idea opens even further challenges when considered at compile time. It triggered our development of data layout optimizations on the base of GCC compiler. Both profiled-based and static, these optimizations analyze the program data accesses, and make modifications of the data layout accordingly. Naturally in the focus are matrices and structures in C.

For matrices, the transformations are flattening and transposing. Matrix flattening replaces a m-dimensional matrix with its equivalent n-dimensional matrix, where n<m, thus reducing the level of indirection required for accessing the elements of the matrix. Matrix transposing swaps the matrix's rows and columns, when the matrix's frequent access patterns iterate the columns rather than the rows, striving to achieve better cache locality.

Among the optimizations relevant for structures in C there are fields reordering (reordering); full structure decomposition into separate fields (peeling); hierarchical structure decomposition (splitting), when structure is divided into substructures, while pointers are used to preserve the unity of original structure. To make this transformations efficient the accurate analysis were implemented, based on GCC basic block profiling. In addition, the type/variable escape analyses guarantee the safety of these transformations. Finally, we describe the results gained by data layout optimizations when applied on Spec2000.

SIMDinator: Use of the x86 SIMD Instructions

David Livshin, DALsoft

The modern DSP's/CPU's require a tight interface between compiler and hardware architecture in order to achieve high utilization of the available resources. Compilers have not shown the ability to meet the needs of programmers on critical code. As architecture scales up, it is becoming so complex that human programmers can't deal with the scheduling and tracking of so many registers and execution units. The result may be an architecture that can't be programmed to apply most of its resources on real-life algorithms. To address this problem we conceived and developed an assembly code optimization technology that was successfully applied to a variety of different hardware architectures and shown consistent great superiority over all attempted compilers. The developed optimizer acts as a compiler post-processor, using compiler-generated assembly code as it input and generating a highly optimized target's assembly code that is logically identical to the original one.

In the presentation we discuss results achieved by the x86 implementation of the optimizer over the GCC-generated code. The most attention is given to the SIMDinator (pronounced "seem-d-ney-ter") - the part of the optimizer responsible for utilization of the x86 SIMD (Single Instruction Multiple Data) instructions. We discuss the issues that arise when SIMD instructions are used and how use of these instructions interacts with other optimizations supported by the optimizer.

Issues and Challenges in Compiling for Multiple Forms of Parallelism, in IBM Research Compilers

Kathryn O'Brien, IBM T.J. Watson Research Center

Emerging architectures present opportunities for exploiting parallelism at multiple levels of the hardware, while providing a variety of challenges to the application developer, the language designer and the compiler developer. In this talk I will discuss some of these problems in the context of the IBM compiler research activities, especially as they pertain to maintaining the high performance of the IBM product compilers on both parallel and single threaded applications.

Implementation and Validation of a Cell Simulator using UNISIM

Felipe Cabarcas, Alejandro Rico, David Rodenas, Xavier Martorell, Alex Ramirez, Eduard Ayguade, UPC

We have implemented a functional simulator of the Cell architecture using the UNISIM modular infrastructure. The simulator includes functional emulation of the PPE and SPE ISAs, as well as all the architecture components of the processor: element interconnection bus, local storage, and memory flow controllers. The simulator is source-level compatible with regular Cell applications, meaning applications must be linked with a modified version of the Cell SDK in order to run. We are in the process of validating that the processor functionality matches that of the real platform, and determining configuration values for the processor speed and bus connectivity so that performance figures not only follow the same trend, but fall within reasonable margins of the real hardware.

We would present the simulator architecture: the implemented modules and their interfaces, and discuss the validation methodology. We would also show performance figures for the simulator itself (simulation speed), and performance predictions of the actual hardware.

The presentation may also include guidelines on how to extend / improve the simulator in order to encourage potential users.

CAPSULE: Parallel Execution of Component-based Programs

Pierre Palatin, Zheng Li, Yves Lhuillier, Olivier Temam, INRIA

Since processor performance scalability will now mostly be achieved through thread-level parallelism, there is a strong incentive to parallelize a broad range of applications, including those with complex control flow and data structures. And writing parallel programs is a notoriously difficult task. Beyond processor performance, researchers can help by facilitating the task of the programmer, especially by simplifying the model exposed to the programmer.

In this research work, among the many issues associated with writing parallel programs, we focus on finding the appropriate parallelism granularity, and efficiently mapping tasks with complex control and data flow to threads. We propose to relieve the user and compiler of both tasks by delegating the parallelization decision to the architecture at run-time, through a combination of hardware and software support and a tight dialogue between both.

For the software support, we leverage an increasingly popular approach in software engineering, called component-based programming; the component contract assumes tight encapsulation of code and data for easy manipulation. Previous research works have shown that it is possible to augment components with the ability to split/spawn, providing a simple and fitting approach for programming parallel applications with complex control and data structures. However, such environments still require the programmer to determine the appropriate granularity of parallelism, and spawning incurs significant overheads due to software run-time system management.

For that purpose, we are developing an environment with the ability to spawn conditionally depending on available hardware resources, and we delegate spawning decisions and actions to the architecture. This conditional spawning is implemented through frequent hardware resource probing by the program. This, in turn, enables rapid adaptation to varying workload conditions, data sets and hardware resources. Furthermore, thanks to appropriately combined hardware and compiler support, the probing has no significant overhead on program performance.

We demonstrate this approach on an 8-context SMT, several non-trivial algorithms and re-engineered SPEC~CINT2000 benchmarks, written using component-like syntax processed by our toolchain.

We are currently developing a software-only version of the CAPSULE environment for CMPs, and we will later investigate hardware support for both shared-memory and distributed-memory CMP.

Using Extremely Fine Granularity Multithreading for Energy Efficient Computing

Alex Gontmakher, Avi Mendelson, Assaf Schuster, Technion

We investigate extremely fine-grain multithreading as a means for improving energy efficiency of single-task program execution. Our work is based on low-overhead threads executing an explicitly parallel program in a register-sharing context. The thread-based parallelism takes the place of instruction-level parallelism, allowing us to use simple and more energy-efficient in-order pipelines while retaining performance that is characteristic of classical out-of-order processors. Our evaluation shows that in energy terms, the parallelized code running over in-order pipelines can outperform both plain in-order and out-of-order processors.

How Many Cores is Too Many Cores?

Avi Mendelson, Intel

The computer industry has been able to keep an exponential improvement in performance for the last few decades. This unbelievable phenomenon is known as "Moore's law". As the computer architecture industry start reaching the "power wall", two new trends are being developed; the first trend calls to trade single thread performance with multithreaded performance and so to increase the overall performance of a processor by accommodate it with large number of small cores. The second trend calls to increase both the single thread performance and the multithreaded performance and so to divide the "transistor budget" of the processor between relatively small numbers of "large cores". In this talk I will present the root cause of the "power wall", I will extend the discussion on each of the new trends in computer architectures described above and provide an analysis of what is needed for each of them to succeed. I will also extend the discussion on few of the recent new technologies such as Virtualization, Transaction memory, Acceleration and try to relate them to the architecture models above.


Related Conference Links
Past Compiler and Architecture Seminars 2005, 2004, 2003, 2002  
3rd HiPEAC Industrial Workshop  
April 18, HiPEAC Cluster Meetings, Haifa  

    About IBMPrivacyContact
HiPEAC IBM Research