IBM Research

Home

What's new

Methodology

Architecture

Compiler

IDE

Publications

Presentations

Patents

Related Work

eLite DSP compiler

The eLite DSP compiler is based on the IBM VLIW Research Compiler originally designed for tree-VLIW processors. The compiler uses an enhanced form of Dependence Flow Graph (DFG)  for its internal representation, and also extensively uses static single assignment (SSA) and reverse-SSA forms. It has a repertoire of standard SSA-based optimizations, and provides a rich collection of functions/macros for compiler development. Among other features, the eLite DSP compiler provides an excellent platform for developing and exploring new compilation techniques.

New optimizations and enhancements were introduced to the compiler in order to generate efficient code for the eLite DSP architecture. After transforming the dependence-flow graph through a number    of  optimizations, a novel vectorization phase tries to identify vectorizable code sequences and updates the DFG with a vectorized version of loops. Separating vectorization from instruction  scheduling   and  register allocation has advantages such as reducing the complexity of the code generator, and offering flexibility to schedule and software-pipeline vectorized code along with scalar  code. The   vectorized  code is scheduled and register allocated before emitting the final Assembly code.

The compiler also makes use of predication to collapse small blocks of conditionally executed  code, thus   eliminating  branches and their overhead. Among other optimizations, the compiler does function inlining, software-pipelining, synthetic branch frequency based optimizations, as well  as  inter-procedural  analysis.

The  internal representation of the compiler uses primitive operations similar to those found in RISC processors. Optimizations prior to scheduling and register  allocation  are performed on a  DFG with such  operations as nodes. The instruction selector provides a binding between such RISC operations and instructions in the eLite architecture. Instruction  selection is  carried out along with  instruction  scheduling based on a number of factors such as data type and size of operands, automatic update of registers, resource requirements, etc.

A  vectorization phase  has been developed to make  efficient use of  the SIMD instructions and the unique Vector Element Register file (VER) of the eLite DSP. The VER is modeled as a compiler-controlled  memory which can be  indirectly addressed using  vector pointer registers.  Because all VER allocations are done by the compiler, complete aliasing information for VER accesses is available. The  compiler takes advantage of  this fact by applying  aggressive memory-related  optimizations.

The high-level vectorization scheme is similar to the unroll-and-jam technique - the loop is unrolled by a  factor of 4, and scalar  operations are replaced  with their SIMD versions. In some  cases, the compiler performs additional optimizations, such as pre-loading data into the VER before the loop to  eliminate redundant loads. An  important stage of  vectorization is setting up vector pointers.  Flexibility of VER access through vector pointers allows efficient compilation of computations that  involve non-consecutive data access  patterns. This  aggressive vectorization optimization is based on a set  of innovative analyses, on top of the standard vectorization tests such as . The compiler  analyzes memory access patterns in  order to effectively  vectorize loads and stores within each iteration. Loop  context analysis eliminates memory loads and stores across iterations and between  different loops. Additional  transformations which expose more  parallelism, such as loop transformations, continue to undergo  development.

The register name space in the eLite DSP architecture is  functionally partitioned into  multiple register files associated  with respective functional units. This results in a set of register  types such as integer register (IR), address register (AR), vector  element register (VER), and  vector accumulator register (VAR). The  compiler is responsible for assigning each register Def to a specific  register file/type, and for inserting explicit instructions  moving the data between  different register files. Given a register Def,  the operation defining it, and the operations using it, the choice of a  register file is guided by the following considerations:

 1. Availability of operations in different functional units
2. Performance issues, such as pipeline latencies of instructions, possibilities of VER reuse, etc.

The partitioning process based on  the  above is augmented with an efficient min-cut graph  partitioning  based scheme that minimizes the communication between the register files in order to reduce the number of move instructions

 Resource modeling

The primary goal of the resource model in a compiler is one of providing an  abstract  view of the processor to the machine-dependent  optimization routines. A  cycle-accurate model of the processor has been developed to model pipeline resources, register file ports, and the  interconnect  bus. We have also developed some low-overhead  techniques to model non-uniform  port access delays resulting from microarchitecture techniques that reduce power and hardware complexity,  such as  restricted bypasses. In addition, the resource model  provides internal-to-ISA instruction  mapping information for the instruction selector, resource conflict checking and reservation  functions  required during scheduling, and register Def/Use timing  information required during register allocation.  The entire resource model generation is automated and table-driven, based on  machine-generated  architectural descriptions shared by other tools, which  also helps reduce the turn-around time for architectural  exploration.

 Scheduling and register allocation

Generating efficient code for long non-interlocked  pipelines is a big challenge by itself. When  scheduling instructions, the  compiler must have accurate timing information about the  access to data (read and write) and to the shared resources used by the  instructions. Conservative bounds are  used while compiling certain  regions of code (e.g., across calls) where complete timing  information is not available, which in turn prevent the compiler from carrying  out aggressive scheduling  techniques, such as scheduling  instructions in the shadow of other instructions, in those  regions.

The code generator considers reducing the code size as an objective in addition  to minimizing the  schedule length. We use several  new techniques for scheduling and cycle-level register  allocation. These techniques include schemes for scheduling instructions in the branch delay slots and   instruction shadows, for scheduling  predicated code, and for software pipelining.

Inter-procedure optimizations

The compiler performs  inter-procedural analysis for reducing the calling convention  overhead by allowing callee to modify the calling conventions according to the intended usage of  the  parameters. Other inter-procedure  optimizations being developed include schemes for sharing VER  among procedures and for obtaining better bounds on resource utilization at procedure calls and  returns.

We  believe that the eLite DSP  compiler, with the help of its powerful optimizations and  further tuning, can generate executable code from source code written in C with performance and code  size comparable to  hand-optimized  assembly-language code. Preliminary results show that, for a set  of signal-processing kernels, the compiler generates vectorized code of comparable performance (well  within a factor of 2) to  carefully  hand-coded assembly.

See the Publications and Presentations sections for further information regarding the eLite DSP compiler.

 Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact