IBM Research

Home

What's new

Methodology

Architecture

Compiler

IDE

Publications

Presentations

Patents

Related Work

eLite DSP architecture

The eLite DSP architecture is an advanced architecture tailored for   the  execution  of algorithms in digital signal processing, such as those found in wireless communication standards (2.5G, 3G), DSL, wireless local area networks, voice-over-IP, video decoding and   encoding,  audio decoding  and encoding, and so on. The architecture, which includes support for executing control code and data operations, contains facilities for exploiting instruction-level   parallelism  (simultaneous execution of  multiple independent instructions packed as long-instruction words LIWs), data parallelism through vector operations with independent access to elements of the   vectors  (single-instruction  multiple-disjoint-data SIMdD), and data parallelism through vector instructions with data packed. in registers (single-instruction multiple-data with subword parallelism  or  packed  data SIMpD). The  combination of instruction-level parallelism and data parallelism enables achieving high performance levels at low clock frequency.

The eLite DSP architecture is  especially  targeted  for ultra-low power  implementations. To that effect, the architecture relies on static scheduling of instructions, static vectorization of operations, predicated instructions,  visible  latencies (e.g.,  "exposed pipeline"), and  other related techniques that are applied during code generation. As a result, the code executed in a processor that implements the  architecture is  expected to be  generated mostly by an optimizing compiler,  which is used to ensure that the constraints imposed on the code executed by an implementation are properly fulfilled. Code  can also be  generated and  optimized at the assembly language level, of course; such  code can be mixed with code generated by a compiler.

The architecture also includes support for reducing power  consumption  during the  execution of a program, such as the ability to shutdown functional  units, or enable hardware blocks on demand by hardware or software, depending on the specific block.  

Block diagram of eLite processor

The figure above is a logical representation of an eLite processor, which consists of the following functional units:

Branch unit: generates the storage address for the next long-instruction to be fetched from memory, either sequential addressing or    branches, and performs logic operations on the Condition Registers.
Integer unit: performs operations on data in Integer (General-Purpose) Registers.
Storage Access unit: interacts with the data storage to transfer data (8-bit, 16-bit, 32-bit, and 64-bit) to/from internal  registers    and storage, and performs operations on data in Address Registers, which are used to address storage.
Vector Pointer unit: performs operations on Vector Pointer Registers, which are used to access the contents of Vector Element Registers.
   Vector Element unit: performs operations on 16-bit integer and fractional data stored in Vector Element Registers.  The  number   of these registers is implementation-dependent, and ranges from 64 to 4096
Vector Accumulator unit: performs operations on 32-bit integer and 32-bit or 40-bit fractional data stored in Vector Accumulator Registers, including reduction operations on the elements of a vector.
Vector Mask unit: performs logic operations on the Vector Mask Registers.

   The Integer unit and the Storage Access unit correspond to scalar units, operating mostly on integer data. In contrast, the  Vector    Element unit and the Vector Accumulator unit operate on 4-element vectors in SIMD fashion.

Program execution

A program in the eLite DSP architecture consists of a sequence of long-instructions words (LIW), each containing a 4-bit prefix   and   either one, two or three instructions whose individual size can be 16, 20, 24, 30 or 60 bits. A long-instruction is the   minimum unit   of program addressing possible, and is represented in memory as a 64-bit entity. Branching into an instruction   other than the first   instruction of a LIW is not possible. A processor fetches LIWs from instruction storage for execution; all   instructions contained in a   LIW are dispatched for simultaneous execution, unless they are specified as executable in serial manner, as indicated in the LIW prefix.

    All instructions, regardless of their length, contain a fixed-size opcode in bits 0:7 specifying the operation to be performed. Some     instructions specify an expanded opcode field in bits 18:19. Instructions whose length is 30-bits specify additional opcode information in bits 28:29.

    Similar to RISC processors, no instruction in the eLite architecture can perform a computational operation on data in memory.     Conversely, no instruction other than store instructions can modify storage. To use a storage operand in a computation, the     contents of storage must first be loaded into a register, and the operation is performed on the contents of the register. Similarly,  to    use a storage operand in a computation and then modify the same or another storage location, the contents of storage must  be    loaded into a register, modified, and then stored back to the target location. Direct Memory Access (DMA) operations may alter the storage contents independently.

The preferred programming model  eLite programming modelconsists of loading many data elements from storage into the registers, in particular into the     Vector Element Registers and Vector Accumulator Registers, and then operate on the contents of the registers with few intervening accesses to    storage. Vector Element Registers are accessed  indirectly through Vector Pointer Registers, so that vectors are dynamically composed from four    arbitrary Vector Element Registers (SIMdD operation). Every Vector Element instruction specifies one  or two Vector Pointer Registers, which  in   turn specify the four Vector Element Registers actually used by the instruction. In contrast, Vector Accumulator Registers contain four 40-bit    elements  each which are accessed simultaneously in the order they are stored in the VARs, and used in that same order as operands to Vector    Accumulator Unit instructions (SIMpD operation). The  elements from Vector Accumulator Registers, in conjunction with a Reduction Register, are also used as operands to a special reduction unit.

   Vector units are characterized by a cascaded SIMD programming  model: 16-bit data are loaded from adjacent memory locations into arbitrary    locations in the Vector Element Register file (VER), 16-bit operations are performed on data from arbitrary locations within  the VER (SIMdD), and    the results are placed into the 4x40-bit Vector Accumulator Register file (VAR), in packed manner (in a single register). 32-bit data can also be    loaded directly from  memory, in packed form, into the VAR, operations are performed on packed data read from VAR (SIMpD), and the results are    placed on the same register file. Packed data can be transferred from the  VAR file into the 16-bit VER file with arbitrary placement, after a suitable    size reduction operation, or can be placed into adjacent memory locations. Due to the varying size, data is allocated to units  according to the natural data type (size) throughout the computations.

   Instructions are statically scheduled taking into consideration their utilization of resources throughout the pipeline, and the data   dependencies   with their dependent instructions (e.g., "exposed pipeline" execution model). Most instructions are processed in six  stages, vector  element   instructions use one extra stage to read the Vector Pointer Registers and the succeeding stage to read the  Vector Element Registers,  whereas   memory instructions use dedicated stages for transferring the  address and   data from the  processor to the memory subsystem. All instructions that are dispatched in the same cycle read the  contents of their   source  registers at the same time, with the exception of Vector Element Registers which are read the following  cycle after reading   the  associated Vector Pointer Registers. An instruction completes execution when its results are placed in their  destination    locations; instructions complete execution according to their individual latencies. Instructions contained wholly within  a  functional   unit have the same latency, with the exception of branches which are resolved earlier; instructions in different units,   or instructions   that place the result on a register in a different unit, may exhibit different latencies

    Instructions other than vector instructions can be predicated by specifying a condition that is evaluated dynamically, at   execution   time. The predicate is specified in a Condition Register. An instruction whose predicate evaluates to false is not   completed; such an   instruction is simply discarded. Vector instructions are not predicated as a whole; instead, each individual   operation within a vector   instruction can be executed conditionally (i.e., predicated) under control of a mask which is evaluated dynamically. The mask is specified in a Vector Mask Register.

Vector units

    The most salient computing resources within the eLite architecture are the three type of Vector Units. All these units can operate in parallel.

eLite Vector Units

Source operands for 16-bit Vector Element arithmetic and logic instructions (16-bit datapath) always originate from the Vector     Element Register file. The destination of all 16-bit vector element operations is a Vector Accumulator Register (40-bit result).  Each    block within the 16-bit datapath consists of a multiplier and an ALU that performs arithmetic, logic, shift, and select operations on the contents of Vector Element Registers.

    Every access to the Vector Element Register file is performed indirectly through a Vector Pointer Register. Each Vector Pointer     Register contains four elements which are used as indexes into the Vector Element Register file, allowing the access to four     independent Vector Element Registers. The Vector Pointer Registers can be automatically updated when used to access Vector     Element Registers, and automatically implement "circular addressing" within a range of the Vector Element Register file.

    Vector Accumulator Registers are used as source and destination for 40-bit Vector Accumulator arithmetic and logic   instructions   (40-bit datapath). Vector Accumulator Registers are also used as destination for 16-bit vector operations, as well as   source   operands in instructions to convert data from 40-bit to 16-bit whose result is placed in Vector Element Registers.   Regardless of the   instruction type, Vector Accumulator Registers are accessed as 40-bit quantities. In the case of conversion to 16-bit, saturation and rounding rules can be applied.

Computing capabilities in a LIW

    Since a LIW may contain up to three instructions, the architecture supports "3-wide" instruction-level parallelism. Some of these     instructions are compounded instructions which specify more than one operation. Moreover, vector instructions specify   operations   on vectors with four elements. As a result, the total parallel computing capability available in a single LIW is a large number of basic operations. For example,

  • a single LIW may contain a Load Vector with Update instruction, a Vector Element Multiply instruction, and a Vector     Accumulator Add instruction, with each one of them producing as result a vector with four elements; that is, four operations each, for a total of twelve operations.
  • the Load Vector with Update instruction implicitly specifies the automatic update of the Address Register and the  elements    of the Vector Pointer Register used by the instruction, including support for circular addressing on both, adding another five operations to the set performed within the LIW;
  • the Vector Multiply instruction implicitly specifies the update of the two Vector Pointer Registers used to access the  Vector    Element Registers containing the operands for the instruction, including circular addressing within the Vector  Element    Register file, adding two sets of four update operations to the computations specified in the LIW.

Consequently, such a single LIW actually specifies transformations on 25 data items, for a total of 25 basic operations.

See the Publications and Presentations sections for further information regarding the eLite DSP compiler.

 Privacy | Legal | Contact | IBM Home | Research Home | Project List | Research Sites | Page Contact