IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
Select a country
IBM Home
IBM Research
ISCA Tutorial
More information
Cell
Cell Compiler
Innovation Matters!
The Cell Story
Page Contact


Cell Broadband Engine - enabling density computing for data-rich environments 
  Abstract 

Date: Sunday, June 18
Time: 8:00am- 12:00pm

We will discuss Cell Broadband Engine architecture and programming models. To ensure flexibility of deployment and use, the Cell Broadband Architecture supports a broad range of programming models ranging from function pipelines typically used in graphics rendering environments to distributed symmetric multiprocessing models.

We will describe the Cell Broadband Engine Architecture, and how the four axes of parallelism, thread level parallelism, instruction level parallelism, data-level parallelism, and data processing/transfer parallelism guided the design of the Cell BE architecture to deliver a quantum leap in performance by exploiting application parallelism at all levels. Cell addresses the compute density challenge to increase the performance per area and power, and delivers high performance for compute intensive codes by exploiting chip multiprocessing to address inherent power and memory latency issues.

We will discuss the applicability of programming models, and discuss application development practice and experience to translate the peak performance offered by the hardware into delivered application performance based on a number of application development projects for Cell BE-based systems.

We will describe how the compiler brings together programming models and the heterogeneous platform, and the compilation challenges involved in generating high performance code from this novel architecture. This will include issues such as how to generate good scalar code in a SIMD engine where significant hardware function has been offloaded to software, how to exploit the multiple levels of parallelism, and how the compiler can provide a single system abstraction of the underlying heterogeneous processors with attached local memories. The discussion will be in the context of a functioning prototype compiler for the Cell Broadband Engine Architecture. This tutorial is aimed at researchers and practitioners in the field of design and utilization of high performance architectures.

  Presenters 

Michael Gschwind is an IBM Master Inventor, an IBM Power architect and a logic design and microarchitecture lead for a future IBM System. He was one of the initiators and a leading contributor to the Cell Broadband Engine system architecture definition as well as a lead architect of the Synergistic Processor architecture. During the definition of the Synergistic Processor architecture, Dr. Gschwind also developed the first Cell Broadband Engine compiler.

Dr. Gschwind joined the IBM TJ Watson Research Center, Yorktown Heights, NY, in 1997. He has held leadership positions in several seminal projects, including the DAISY dynamic compilation project where he was responsible for design, test and integration of the DAISY tree-VLIW architecture, and the BOA project where he was a lead architect for the BOA high-frequency statically scheduled architecture, and was a leading contributor to the development of pioneering dynamic compilation techniques. During the project, Dr. Gschwind developed key pipeline control techniques used in several industry-leading microprocessors.

Dr. Gschwind was also a leading contributor to seminal work on power/performance trade-offs in microprocessor designs which formalized the futility of the frequency-centric uniprocessor design approach used in the industry at the time, an insight that had already guided the design of the Cell Broadband Engine.

Dr. Gschwind is a leader in assessing the impact of future technologies on architecture and microarchitecture for IBM systems. Dr. Gschwind has contributed to the development of the Power Architecture as the lead architect for a next generation SIMD extension, to the architecture and microarchitecture of several generations of PowerPC processor cores, and PowerPC-based systems such as the IBM PowerBlade, as well as to the architecture and microarchitecture of IBM zSeries systems.

Dr. Gschwind has also contributed to compilers, application programming environments and application interfaces for IBM zSeries and pSeries systems under Linux, AIX and z/OS.

Dr. Gschwind's contributions to IBM systems and technology have been recognized with several corporate awards. In addition to his contributions to the design and implementation of IBM systems, he is the author of over 75 papers, covering hardware/software co-design, compiler technology, multimedia processing, and high-performance computer architecture, and has received key patents for his inventions in these areas.

Before joining IBM, Dr. Gschwind made significant contributions to the field of reconfigurable architectures, developing architecture extension and evaluation techniques, and pioneering the use of high-capacity field-programmable gate arrays as implementation platform for microprocessors.

In addition to his corporate contributions, Dr. Gschwind has been a faculty member at Technische Universitt Wien, Vienna, Austria, and a visiting faculty member at Princeton University where he has taught classes on advanced computer architecture. Dr. Gschwind received PhD and MS degrees in computer science from Technische Universitt Wien, Vienna, Austria.

Bruce D'Amora is a Senior Technical Staff Member in the Emerging Systems Software group at the IBM T.J. Watson Research Center. His research interests are in 3D rendering and physical simulation. Mr. D'Amora is currently focusing on the design and programmability of Cell processor based systems targeted at video game development and digital media. His previous project was Pervasive 3D Viewing for Product Data Management for which he developed a 3D renderer on a network enabled handheld device. Prior to his position at Watson, Bruce was the Chief Software Architect for the 3D graphics development group at IBM Austin and the IBM representative on the OpenGL Architectural Review Board. He has designed and developed graphics applications, system software, and graphics adapters for 23 years. Mr. D'Amora began his career at IBM Boulder where he was a developer of IBMs first MCAD application, FastDraft. The application was targeted specifically at the Automotive and Aerospace industry. He subsequently developed one of the first PC CAD applications, CADWrite, that utilized the Virtual Device Interface (VDI) to access low level graphics adapter function. After, briefly working on the ANY/BSY-1 Submarine Sonar system, Mr. D'Amora relocated to IBM Austin, TX to lead the OpenGL development effort. In 1993, his team produced the first industry implementation of OpenGL on the initial 601 Power PC workstation developed by IBM RS/6000 division. Derivatives of the OpenGL product became the code base for subsequent IBM graphics systems. Mr. D'Amora holds a BA in Microbiology and a BS in Applied Mathematics from the University of Colorado. He also holds an MS in Computer Science from National Technological University.

Kevin O'Brien has spent the last 24 years at IBM working in the field of compilation and architecture. Initially, at the IBM Toronto Lab, he was the lead architect of the TOBEY optimizing backend (used in IBM's xlc, xlf, and xlC (C++) product compilers) and made many of the key design decisions for the whole compiler suite, including the C and fortran front ends. Mr. O'Brien was a co-architect of the original intermediate language for Tobey XIL, which is still in use in the compilers that ship today. He was also responsible for the reassociation optimization in these compilers, which were for a long time the only commercial compilers to perform this optimization.

On moving to the IBM T.J. Watson Research Center at Yorktown Heights, Mr O'Brien became the lead designer of a research project to perform cache optimizations in the XL compilers. He developed numerous loop transformation techniques, such as unroll and jam, stripmining, and loop unswitching, as well as predictive commoning. All of his techniques were incorporated into, and subsequently shipped in the product compilers.

Mr. O'Brien was also the architect of a new intermediate language (YIL), initially as a research prototype, but later incorporated into the product compilers. This high level Intermediate Language was introduced to facilitate high level loop optimizations which were difficult to perform on the lower level XIL. This YIL intermediate language was used in the High Order Transformation phase which shipped in the product compilers, and formed the precursor of the current Toronto Portable Optimizer.

Contemporaneously with this work, Mr. O'Brien also developed a prototype simdizing compiler for a Numeric Intensive Computation Accelerator architecture being developed by IBM.

Subsequent to this work, Mr O'Brien was one of the key members of the Single Program Speculative Multithreading architecture/compiler co-design project.

In the course of his 17 years at IBM Research, Mr. O'Brien has focused his efforts on deploying his diverse compiler optimization skills to the improvement of IBM products. In 1996 he ran a project to enhance the performance of the IBM Smalltalk product. With the emergence of Java, Mr O'Brien applied his dynamic optimization skills to the design of a dynamic optimizer for a just in time compiler for a JVM for the first Transmeta machine. This was the first JIT to be written in Java. Many of the techniques developed in this work were applied in a continuous optimization phase of a binary translation/emulation project that Mr. O'Brien worked on subsequently. As part of this project he also developed a threaded interpreter. He continues to consult on issues of dynamic optimization in IBM.

For the last 4.5 years Mr. O'Brien has been the project leader for the IBM prototype compilers for the CBE Processor. In addition to developing major pieces of the scalar compiler, he has developed techniques for the single shared memory abstraction and the efficient management of code and data on the CBE processor. He is currently investigating memory related optimizations for the CBE.

Mr. O'Brien has received an IBM Corporate and Outstanding Technical Achievement Awards for his work on the XL compilers and holds several patents on compiler optimization and SPSM. He has also contributed to the architectures of Power, PowerPC and Cell.

Mr. O'Brien holds a BSc in Theoretical Physics and an MSc in Astrophysics from Queen Mary College London England.

Kathryn O'Brien has worked at IBM for 23 years, 17 of them as a researcher at IBM TJ Watson Research Center, where she has been involved in several static and dynamic compiler projects. Ms O'Brien began her career at the IBM Toronto lab where she became the project leader for the IBM XLF Fortran runtime library for the RS/6000 processor. Subsequently she worked on the TOBEY optimizing backend of the XL compilers where she was responsible for several optimizations. She implemented the trap motion optimization in the current XL product compilers.

On moving to the IBM TJ Watson Research Center at Yorktown Heights, Ms O'Brien started a project to develop automatic vectorization in the IBM XL compilers targetting the S390 vector instruction set. This project formed the basis of the later vectorization, parallelization and memory hierarchy optimization phases in these compilers, and shipped as the first High Order Transformation phase in the XL compilers, the precursor to the current Toronto Portable Optimizer. Her primary interest in this work was in data dependence analysis and loop distribution.

Later Ms O'Brien was one of the lead architects of an architecture/compiler co-design project for one of the first speculative multithreading architectures, SPSM (Single Program Speculative Multithreading). She developed significant portions of the compiler to target this architecture, and is a co-inventor of the original SPSM patent.

Throughout her career at IBM Research Ms O'Brien has been involved in a number of other static and dynamic optimization projects, most notably she was one of the key members of a dynamically optimizing jit compiler, written in Java, for a JVM for the first Transmeta machine. This was the first JIT compiler to be written in Java. Subsequently Ms. O'Brien worked on a continuous dynamic optimization project which used a multiprocessor system to monitor and dynamically re-optimize executing applications.

Most recently Ms O'Brien has played a key role in prototyping the first XL compilers for the CELL architecture, where her specific interests are in both scalar compilation techniques and in compiler exploitation of the multiple levels of parallelism through a single source compiler. She has developed significant portions of the scalar compiler and was instrumental in the design of the single source OpenMP compiler.

Ms O'Brien holds a B.A. degree in History from the Queens University of Belfast and an M.A. degree in Anthropology from the University of London in England.

Alexandre Eichenberger is currently a Research Staff Member in the Exploratory System Architecture group of the VLSI Systems department at the IBM T.J. Watson Research Center. My research interests focus on the interaction between compiler technology and micro-architecture design.

While at Watson, I have explored and developed compiler support for many of the micro-architectural tradeoffs present in the Cell Broadband Engine (Cell BE) processors, primarily on the Synergistic Processor Element (SPE). Examples of tradeoffs are SPE-specific scheduling and bundling issues, as well as compiler techniques to prevent instruction fetch starvation on the SPEs.

More recently, I have be a primary contributor to the automatic generation of SIMD code targeting the SIMD units found in the CELL (SPE/VMX), Power (VMX), and BlueGene/L (double-precision floating-point) architectures, focusing on data alignment and code generation related issues. Prior work includes unroll-and-pack approaches and loop-based approaches. The novel approach that I pioneered combines aspects of both of these approaches. It also attempts to systematically minimize the impact of data reorganization due to compile-time or runtime data misalignment, and it can perform auto-simdization in the presence of data conversion (i.e., conversion from one data type to another). Auto-simdization can generate such minimum data reorganization code for the SIMD unit even in presence of multiple compile-time or runtime alignment. It also handles induction variables, private variables, and non-stride one memory reference patterns.

Prior to working at IBM, I worked on applying modulo scheduling to new architectures, such as clustered architectures where the compiler explicitly handles communication among clusters of functional units. I also applied modulo scheduling to new domains, such as to generating faster code instrumentation gathering branch profiling using modulo schedules that are distributed in the code. In that work, best technique reduces slow down due to instrumentation by a 10x factor.

I also extended my area of expertise to straight line code by investigating scheduling algorithms for superblocks, where most efficient algorithm (balance) explicitly delays some branches to reduce average execution time. This algorithm can reduce slowdown compared to lower bound by a 2x factor.

To achieve higher degree of performance, superblocks including multiple predicated execution traces (i.e. hyperblocks) have been extensively used. However, prior to my work, hyperblocks could not be optimized to the same extent as single path regions, because some conditions along one path may prevent useful optimization along some other paths. I proposed an approach that enables single path optimizations in hyperblocks by selectively renaming registers and replicating operations in the hyperblock. Renaming and replicating is performed only when it enables some optimizations to break a critical dependence. Measurements indicate that large speedups are possible, e.g. up to 66% for wc, 8% for li, and 7% for compress.

I received a Diploma in Computer Science at Eidgenssische Technische Hochschule, Zrich, Switzerland in 1991. I studied at the Computer and Electrical Engineering Department at the University of Michigan, Ann Arbor and received a M.S. and a Ph.D. degrees in Computer and Electrical Engineering in, respectively, 1993 and 1996. I was Assistant Professor at the Department of Electrical and Computer Engineering at the North Carolina State University before joining IBM Research at the IBM T.J. Watson Research Center in 2001. I have published more than 20 refereed papers in journals and conferences including MICRO, PACT, PLDI, CGO, ICS.

Peng Wu is currently a research staff member in the High Performance Software Environment group in IBM T.J. Watson Research Center. She was part of the Cell compiler team since its inception. In the early days, Dr. Wu was heavily involved in the prototyping of the XL compiler backend for the Cell architecture and in defining C/C++ extensions for SPE intrinsic. Since 2003, Dr. Wu has directed her research efforts to address compilation issues from vectorizing for SIMD architectures (a.k.a simdization). She is a primary contributor to automatic simdization in the Cell compiler. Currently, Dr. Wu leads the effort to build a general simdization infrastructure that targets a variety of SIMD units in IBM processor families including the Cell processor, VMX for PPC970, and Dual FPU unit for BlueGene/L. Her research interests include program analysis, compilation for SIMD architectures and multimedia applications, and language and compiler solutions to achieve better productivity. Peng Wu received her doctoral degree in Computer Science at University of Illinois at Urbana-Champaign, in 2001.

 
  Project Pages 
arrow Cell
arrow Cell Compiler
arrow The Cell Story

 
  About IBM  | Privacy  | Legal  | Contact