Very-Long Instruction Word (VLIW) architectures are a suitable alternative
for exploiting instruction-level parallelism (ILP) in programs, that is,
for executing more than one basic (primitive) instruction at a time. These
processors contain multiple functional units, fetch from the instruction
cache a Very-Long Instruction Word containing several primitive instructions,
and dispatch the entire VLIW for parallel execution. These capabilities are
exploited by compilers which generate code that has grouped together
independent primitive instructions executable in parallel. The processors
have relatively simple control logic because they do not perform any
dynamic scheduling nor reordering of operations (as is the case in most
contemporary superscalar processors).
VLIW has been described as a natural successor to RISC, because
it moves complexity from the hardware to the compiler, allowing simpler,
faster processors. As stated in Microprocessor Report (2/14/94),
The objective of VLIW is to eliminate the complicated instruction
scheduling and parallel dispatch that occurs in most modern
microprocessors. In theory, a VLIW processor should be faster and
less expensive than a comparable RISC chip.
The instruction set for a VLIW architecture tends to consist of simple
instructions (RISC-like). The compiler must assemble many primitive
operations into a single "instruction word" such that the multiple
functional units are kept busy, which requires enough
instruction-level parallelism (ILP) in a code sequence to fill the
available operation slots. Such parallelism is uncovered by the
compiler through scheduling code speculatively across basic blocks,
performing software pipelining, reducing the number of operations
executed, among others.
VLIW has been perceived as suffering from important limitations, such
as the need for a powerful compiler, increased code size arising from
aggresive scheduling policies, larger memory bandwidth and register-file
bandwidth, limitations due to the lock-step operation, binary
compatibility across implementations with varying number of functional
units and latencies. In recent years, there has been significant progress
regarding these issues, due to general advances in semiconductor technology
as well as to VLIW-specific activities. For example, our
tree-based VLIW architecture provides binary
compatibility for VLIW implementations of varying width, and our
VLIW compiler contains state-of-the-art
parallelizing/optimizing algorithms.