E.R. Altman, M. Gschwind
ISCA 2004 Tutorial
München, Germany, June 2004
Abstract
We describe the use of dynamic compilation technology in dynamically
optimizing code to achieve peak performance. This is based on
collecting and exploiting runtime system information to dynamically
reoptimize code for specific workload bahvior. The dynamic runtime
optimization system operates on a simple static architecture designed
to achieve high ILP and great clock rates. We believe that this
combination of raw execution speed and high-level code adaptation is
an attractive way to build future architectures. While dynamic
compilation and code optimization techniques described in this talk
can be used to optimize code for native PowerPC execution, we believe
that even greater benefits can be obtained by combining the runtime
environment with a specially designed architecture, which provides
additional capabilities such as additional rename registers and
hardware support for exception recovery. In addition to these
optimization advantages, the virtualization layer introduced by the
dynamic compilation system offers the ability to customize the
underlying execution engine and completely redefine the hardware
interface, while maintaining binary compatibility at the software
level (at either the program or operating system level, depending on
the implementation choices made).
The DAISY system first introduced the techniques to dynamically
optimize code in 1996, and since then, a number of dynamic compilation
systems based on this technoology have expanded possible uses, such as
the IBM BOA porject, the University of Wisconsin's Co-Designed Virtual
Machines project, Transmeta's Crusoe processor technology, HP's Dynamo
dynamic optimization system, and the Itanium-based Aries and IA32-EL
execution layers.
[Presentation foils (pdf) ] [DAISY and BOA Bibliography (sorted by date)]
[ Top ]
M. Gschwind, E.R. Altman
Workshop on Complexity-Effective Design 2002
Anchorage, AK, May 2002.
Abstract
Based on the conviction that modern superscalar out-of-order designs
squandered useful resources for little incremental gain, the BOA team
embarked on a design effort to develop an architecture where
computational elements dominated the design. At the same time, we
wanted to preserve the ability to adapt to changing workload behavior
dynamically, but without the overhead inherent in traditional
out-of-order designs. We turned to maturing dynamic compilation
technology to achieve dynamic adaptability, while keeping core
complexity low.
[Presentation foils (pdf) ]
[ Top ]
M. Gschwind, E.R. Altman
2002 Symposium on Compiler Construction,
Grenoble, France, April 2002
© Springer Verlag 2002
Abstract
In this work, we demonstrate how aggressive optimization can be used
in conjunction with dynamic compilation without the need for
specialized hardware. The approach is based on maintaining enough
state to recompute the processor state when an unpredicted event such
as a synchronous exception may make otherwise dead processor state
visible. The transformations necessary to preserve precise exception
capability can be performed in linear time.
[Presentation foils (pdf) ]
[ Top ]
M. Gschwind, E.R. Altman
Workshop on Binary Translation 2000,
Philadelphia, PA, October 2000.
Abstract
Maintaining precise exceptions is an important aspect of achieving
full compatibility with a legacy architecture. While asynchronous
exceptions can be deferred to an appropriate boundary in the code,
synchronous exceptions must be taken when they occur. This introduces
uncertainty into liveness analysis since processor state that is
otherwise dead may be exposed when an exception handler is invoked.
Previous systems either had to sacrifice full compatibility to achieve
more freedom to perform optimization, use less aggressive optimization
or rely on hardware support.
[Updated and extended foils presented at CC'02 (pdf) ]
[ Top ]
M. Gschwind, K. Ebcioglu, E.R. Altman, S. Sathaye
International Conference on Supercomputing 2000 (ICS '00)
Santa Fe, New Mexico, May 2000.
Abstract
We describe the design issues in an implementation of the ESA/390
architecture based on binary translation to a very long instruction
word (VLIW) processor. During binary translation, complex ESA/390
instructions are decomposed into instruction primitives which are
then scheduled onto a wide-issue machine. The aim is to achieve
high instruction level parallelism due to the increased scheduling
and optimization opportunities which can be exploited by binary
translation software, combined with the efficiency of long instruction
word architectures. A further aim is to study the feasibility
of a common execution platform for different instruction set architectures,
such as ESA/390, RS/6000, AS/400 and the Java Virtual
Machine, so that multiple systems can be built around a common
execution platform.
[ Foils (pdf) ]
[ Top ]
K. Ebcioglu, E.R. Altman, S. Sathaye, and M. Gschwind
Proceedings of MICRO-32,
Haifa, Israel, November 1999.
© ACM 1999
Abstract
We describe several optimizations which can be employed
in a it dynamic binary translation (DBT) system, where low
compilation/translation overhead is essential. These optimizations
achieve a high degree of ILP, sometimes even surpassing a static
compiler employing more sophisticated, and more time-consuming
algorithms. We present results in which we employ these
optimizations in a dynamic binary translation system capable of
computing oracle parallelism.
[Presentation foils (pdf, 588 KB) ]
[ Top ]
B.S. Yang, S.M. Moon, S. Park, J. Lee, S. Lee, J. Park, Y. C. Chung,
S. Kim, K. Ebcioglu, E. Altman
Proc. PACT '99,
pp. 128-138, IEEE Computer Society Press, October 1999.
© IEEE 1999
Abstract
For network computing on desktop machines, fast execution of
Java bytecode programs is essential because these machines are
expected to run substantial application programs written in
Java. Higher Java performance can be achieved by Just-in-Time
(JIT) compilers which translate the stack-based bytecode into
register-based machine code on demand. One crucial problem in
Java JIT compilation is how to map and allocate stack entries and
local variables into registers efficiently and quickly, so as to
improve the Java performance.
This paper introduces LaTTe, a Java JIT compiler that performs
fast and efficient register mapping and allocation for RISC
machines. LaTTe first translates the bytecode into pseudo RISC code
with symbolic registers, which is then register allocated while
coalescing those copies corresponding to pushes and pops between
local variables and the stack. The LaTTe JVM also includes an
enhanced object model, a lightweight monitor, a fast mark-and-sweep
garbage collector, and an on-demand exception handling mechanism,
all of which are closely coordinated with LaTTe's JIT compilation.
Our experimental results on the SPARC platform with SPECJVM98
benchmarks and 15 non-trivial Java programs indicate that the
current LaTTe JVMs achieve performance better than or comparable
to the latest SUN JIT compilers (JDK 1.1.6 and HotSpot). It
is also shown that LaTTe makes a reasonable trade-off between
quality and speed of register allocation (i.e., the translation
overhead consistently takes 1-2 seconds for SPECJVM98 which runs
40-80 seconds on LaTTe).
[ Top ]
S. I. Lee, B.S. Yang, S. Kim, S. Park, S.M. Moon,
K. Ebcioglu, E. Altman
1999 Workshop on Binary Translation (Binary99),
Newport Beach, California, October 1999.
Abstract
Exceptions are provided by the Java programming language in
order to handle errors more gracefully. However, exceptions
are rare in most programs, so that most catch blocks are left
unused. This paper describes an on-demand catch block translation
scheme which translates catch blocks only when they are actually
used. By ignoring exception handling while translating normal
flow, we can reduce translation overhead and make use of more
optimization opportunities. In addition, we directly connect
normal flow and catch blocks after an exception is thrown, which
results in faster exception handling. This is useful since many
exceptions thrown at certain points are usually of the same type
and handled by the same catch blocks.
From experimental results, we show that the existence of catch
blocks indeed do not interfere with the translation of normal
flow when on-demand catch block translation is done. Also, the
results show that direct connection along with method inlining
results in faster exception handling.
[ Top ]
S. Sathaye, P. Ledak, J. LeBlanc, S. Kosonocky, M. Gschwind, J.
Fritts, Z. Filan, A. Bright, D. Appenzeller, E. Altman, and C. Agricola
1999 Workshop on
Binary Translation (Binary99),
New Port Beach, California, October
1999.
Abstract
BOA is a processor designed to achieve high frequency by using
software dynamic binary translation. Processors for software binary
translation are very conducive to high frequency because they can
assume a simple hardware design. Binary translation eliminates the
binary compatibility problem faced by other processors, while dynamic
recompilation enables re-optimization of critical program code
sections and eliminates the need for dynamic scheduling hardware. In
this work we examine the implications of binary translation on
processor architecture and software translation and how we support a
very high frequency PowerPC implementation via dynamic binary
translation.
[Presentation foils (pdf) ]
[ Top ]
K. Ebcioglu, E. Altman, S. Sathaye, M. Gschwind
Proc. Europar '99 (P. Amestoy, P. Berger, M. Dayde, I. Duff, V.
Fraysse, L. Giraud, D. Ruiz, eds.),
pp. 1269-1280,
Lecture Notes in Computer Science 1685, Springer-Verlag 1999.
Abstract
We describe a new dynamic software scheduling technique for VLIW
architectures, which compiles into VLIW code the program paths
that are actually executed. Unlike trace processors, or DIF,
the technique executes operations speculatively on multiple
paths through the code, is resilient to branch mispredictions,
and can achieve very large dynamic window sizes necessary for
high ILP. Aggressive optimizations are applied to frequently
executed portions of the code. Encouraging performance results
were obtained on SPECint95 and TPC-C. The technique can be used
for binary translation for achieving architectural compatibility
with an existing processor, or as a VLIW scheduling technique in
its own right.
[Presentation foils (pdf) ]
[ Top ]
B.S. Yang, J. Lee, J. Park, S.M. Moon, and K. Ebcioglu.
ACM SIGARCH Computer Architecture News, March 1999.
Also in proc. of the Third Workshop on
Interaction Between Compilers and Computer Architectures (INTERACT-3).
In conjunction with ASPLOS-VIII, San Jose, CA, October 1998.
Abstract
This paper introduces the lightweight monitor in JAVA VM, that is fast on
single-threaded programs as well as on multi-threaded programs with little
lock contention. A 32 bit lock is embedded in each object for efficient
access, while the lock queue and the wait set is managed through a hash
table. The lock manipulation code is highly optimized and inlined by our
JAVA VM JIT compiler called LaTTe, wherever the lock is accessed. In most
cases, only 9 SPARC instructions are spent for lock acquisition, and 5
instructions for lock release. Our experimental results indicate that the
lightweight monitor is faster than the monitor in the latest SUN (TM) JDK
1.2 release candidate 1 by up to 21 times in the absence of lock
contention, and by up to 7 times in the presence of lock contention.
[ Top ]
K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. Altman,
K. Kailas, T. Bright
Proc. International Conference on Computer Design: VLSI in Computers and Processors,
1998,
pp. 488-495
Abstract
Presented is an 8-issue tree-VLIW processor designed for efficient
support of dynamic binary translation. This processor confronts
two primary problems faced by VLIW architectures: binary
compatibility and branch performance. Binary compatibility
with existing architectures is achieved through dynamic
binary translation which translates and schedules PowerPC
instructions to take advantage of the available instruction level
parallelism. Efficient branch performance is achieved through tree
instructions that support multi-way path and branch selection
within a single VLIW instruction. The processor architecture is
described, along with design details of the branch unit, pipeline,
register file and memory hierarchy for a 0.25 micron standard-cell
design. Performance simulations show that the simplicity of a
VLIW architecture allows a wide-issue processor to operate at
high frequencies.
[Presentation foils (pdf) ]
[ Top ]
M. Schlansker (HP), T.M. Conte (North Carolina State U.), J. Dehnert
(SGI), K. Ebcioglu (IBM), J.Z. Fang (Intel), C.L. Thompson (HP).
IEEE Computer Magazine
30(12), Dec. 1997,
pp. 63-69
(This was a report from a cross-industry task force on ILP)
Abstract
Discovering and exploiting instruction level parallelism in code
will be key to future increases in microprocessor performance. What
technical challenges must compiler writers meet to better use ILP?
Instruction level parallelism allows a sequence of instructions
derived from a sequential program to be parallelized for execution
on multiple pipelined functional units. If industry acceptance
is a measure of importance, ILP has blossomed. It now profoundly
influences the design of almost all leading edge microprocessors
and their compilers. Yet the development of ILP is far from
complete, as research continues to find better ways to use more
hardware parallelism over a broader class of applications.
[ Top ]
S.M. Moon, K. Ebcioglu
ACM Transactions on Programming Languages and Systems,
November 1997,
Vol. 19, No. 6, pp. pp. 853-898, ACM Press.
© ACM 1997
Abstract
Instruction-level parallelism (ILP) in nonnumerical code is
regarded as scarce and hard to exploit due to its irregularity. In
this article, we introduce a new code-scheduling technique for
irregular ILP called "selective scheduling" which can be used
as a component for superscalar and VLIW compilers. Selective
scheduling can compute a wide set of independent operations across
all execution paths based on renaming and forward-substitution and
can compute available operations across loop iterations if combined
with software pipelining. This scheduling approach has better
heuristics for determining the usefulness of moving one operation
versus moving another and can successfully find useful code motions
without resorting to branch profiling. The compile-time overhead
of selective scheduling is low due to its incremental computation
technique and its controlled code duplication. We parallelized the
SPEC integer benchmarks and five AIX utilities without using branch
probabilities. The experiments indicate that a fivefold speedup
is achievable on realistic resources with a reasonable overhead
in compilation time and code expansion and that a solid speedup
increase is also obtainable on machines with fewer resources. These
results improve previously known characteristics of irregular ILP.
[ Top ]
J.H. Moreno, M. Moudgill
1997 International Conference on Supercomputing
Vienna, Austria, July 7-11, 1997, pp. 1-11.
© ACM 1997
Abstract
We describe a representation of instruction-level parallelism which does not
require checking dependencies at run-time, and which is suitable for
processor implementations with varying issue-width. In this approach, a
program is represented as a sequence of tree-instructions, each containing
multiple primitive operations. These tree-instructions are executable either
in one or multiple cycles, do not require a specific processor organization,
and are generated assuming an implementation capable of large
instruction-level parallelism. During instruction cache reloading/accessing,
tree-instructions are decomposed into subtrees which fit the actual resources
available in an implementation. The resulting subtrees require a simple
instruction-dispatch mechanism, as in the case of statically scheduled
processors. The representation makes practical the use of the same
parallelized code in implementations with different issue-width; simulation
results indicate that the instruction-level parallelism achievable with this
approach degrades less than 10% with respect to code compiled for each
specific implementation.
[ Presentation foils (pdf, 80 KB) ]
[ Top ]
K. Ebcioglu, E.R. Altman
24th Annual International Symposium on Computer Architecture
Denver, Colorado, June 1997, pp. 26-37.
© ACM 1997
Abstract
Although VLIW architectures offer the advantages of simplicity of
design and high issue rates, a major impediment to their use is that
they are not compatible with the existing software base. We describe
new simple hardware features for a VLIW machine we call DAISY
(Dynamically Architected Instruction Set
from Yorktown). DAISY is specifically intended to
emulate existing architectures, so that all existing software for an
old architecture (including operating system kernel code) runs without
changes on the VLIW. Each time a new fragment of code is executed for
the first time, the code is translated to VLIW primitives,
parallelized and saved in a portion of main memory not visible to the
old architecture, by a Virtual Machine Monitor} (software)
residing in read only memory. Subsequent executions of the same
fragment do not require a translation (unless cast out).
We discuss the architectural requirements for such a VLIW, to deal
with issues including self-modifying code, precise exceptions, and
aggressive reordering of memory references in the presence of strong
MP consistency and memory mapped I/O. We have implemented the dynamic
parallelization algorithms for the PowerPC architecture. The
initial results show high degrees of instruction level parallelism
with reasonable translation overhead and memory usage.
[ Presentation foils (pdf, 186 KB) ]
[ Top ]
J.H. Moreno, M. Moudgill, K. Ebcioglu, E. Altman, B. Hall, R. Miranda,
S.K. Chen, A. Polyak
IBM Journal of Research and Development, Vol. 41, No. 3, May 1997
© IBM 1997
Abstract
We describe the environment used for the simulation and evaluation of a processor
architecture based on very long instruction word (VLIW) principles. In this
architecture, a program consists of a set of tree instructions, each one containing
multiple branches and operations which can be performed simultaneously. The
simulation/evaluation environment comprises
- An optimizing compiler, which generates tree instructions in a VLIW assembly
language.
- A translator from VLIW assembly code into PowerPC assembly code which emulates
the functionality of the VLIW processor for the specific VLIW program. The
emulating code also includes instrumentation for collecting execution counts
of VLIWs, profiling information, and generation of predecoded execution traces.
- A cycle timer, invoked by the emulating code on a VLIW-by-VLIW basis, which
processes VLIW execution traces as they are generated.
The environment supports the evaluation of alternatives and trade-offs among the
VLIW architecture, its compiler, and processor implementations. Emphasis has been
placed on providing fast turn-around time for the development of compilation
algorithms, and for an efficient compilation to simulation cycle which allows
analysis of architecture/compiler trade-offs over complete execution runs of
realistic workloads.
[ Top ]
M. Moudgill, J.H. Moreno, K. Ebcioglu, E. Altman, S.K. Chen, A. Polyak
IEEE TCCA Newsletter, June 1997, pp. 10-12.
Also in proceedings of the
Workshop on Interaction between Compilers and Computer Architectures,
1997 High-Performance Computer Architecture Conference (HPCA97),
San Antonio, February 1997.
© IEEE, June 1997.
Abstract
This paper describes a compilation and simulation environment designed to
explore the interaction among compiler and architecture for the case of a
very long instruction word (vliw) processor based on tree instructions.
The environment is characterized by its flexibility and fast turnaround time,
allowing the exploration of architecture/compiler trade-offs in several
dimensions over complete execution runs of standard benchmarks. Chameleon,
our research compiler, uses state-of-the-art optimizing techniques to
extract and exploit instruction-level parallelism. ForestaPC, the vliw
architecture, has an instruction set based on the PowerPC architecture.
The exploration of interactions and the evaluation of trade-offs has led
to the development of some novel ideas in the architecture as well as in
the compiler.
[ Top ]
K. Ebcioglu, R. Groves, K.C. Kim, G. Silberman, I. Ziv.
Proceedings of the 1994 Programming Languages Design and Implementation (PLDI) Conference,
pp.36-48, June 1994.
© ACM 1994
Abstract
We describe techniques for converting the intermediate code
representation of a given program, as generated by a modern
compiler, to another representation which produces the same
run-time results, but can run faster on a superscalar machine. The
algorithms, based on novel parallelization techniques for Very
Long Instruction Word (VLIW) architectures, find and place
together independently executable operations that may be far
apart in the original code, i.e., they may be separated by many
conditional branches or belonging to different iterations of
a loop. As a result, the functional units in the superscalar
are presented with more work than can proceed in parallel, thus
achieving higher performance than the approach of using hardware
instruction dispatch techniques alone.
While general scheduling techniques improve performance by
removing idle pipeline cycles, to further improve performance on a
superscalar with only a few functional units requires a reduction
in the pathlength. We have designed a new set of algorithms for
reducing pathlength and removing stalls due to branches, namely
speculative load-store motion out of loops, unspeculation, limited
combining, basic block expansion, and prolog tailoring. These
algorithms were implemented in a prototype version of the IBM
RS/6000 xlc compiler and have shown significant improvement in
SPEC integer benchmarks on the IBM POWER machines.
Also, we describe a new technique to obtain profiling information
with low overhead, and some applications of profiling directed
feedback, including scheduling heuristics, code reordering and
branch reversal.
[ Top ]
T. Nakatani, K. Ebcioglu,
IEEE Transactions on Parallel and Distributed Systems
4(9), Sept. 1993,
pp. 1014-1029.
Abstract
Compaction-based parallelization suffers from long compile time and
large code size because of its inherent code explosion problem. If
software pipelining is performed for loop parallelization along
with compaction, as in our compiler, the code explosion problem
becomes more serious. In this paper, we propose the software
lookahead heuristic for use in software pipelining, which allows
inter-basic-block movement of code within a prespecified number
of operations, called the software lookahead window, on any path
emanating from the currently processed instruction at compile
time. Software lookahead enables instruction-level parallelism
to be exploited in a much greater code area than a single basic
block, but the lookahead region is still limited to a constant
depth by means of a user-specifiable window, and thus code
explosion is restricted. The proposed scheme has been implemented
in our VLIW parallelizing compiler. To study the code explosion
problem and instruction-level parallelism for branch-intensive
code, we compiled five AIX utilities: sort, fgrep, sed, yacc,
and compress. It is demonstrated that the software lookahead
heuristic effectively alleviates the code explosion problem while
successfully extracting a substantial amount of inter-basic-block
parallelism.
[ Top ]
S.M. Moon, K. Ebcioglu.
Proceedings of
MICRO-26,
1993,
pp. 49-58.
Abstract
One of the key design concerns of multiple instruction issue (MII)
processors is deciding how many memory ports need to be provided,
considering performance and efficiency of the target processor. For
an MII processor that exploits instruction-level parallelism (ILP)
in non-numerical code, this decision is difficult to make due to
its irregularity. The authors perform an empirical study aimed
at characterizing a suitable MII organization that best exploits
irregular ILP. The study is based on the selective scheduling
compiler that performs precise memory disambiguation for concurrent
execution of multiple memory operations, along with renaming,
speculation, and software pipelining. The result indicates that
a small number of memory ports (i.e. less than half of the issue
rate) is enough for exploiting most of irregular ILP. The authors
also examine related issues such as the utilization of memory
ports and additional data cache misses caused by speculative loads.
[ Top ]
G.M. Silberman and K. Ebcioglu,
IEEE Computer, Vol. 26, No. 6, pp. 39-56
© IEEE, June 1993.
Abstract
The authors describe a novel architectural framework that allows
software applications and operating system code written for a given
instruction set to migrate to a different, higher performance
architecture. Only minor changes are needed on existing software to
support this migration, and the performance advantages can be
significant. The framework provides a hardware mechanism for seamless
switching between two instruction sets, resulting in a machine that
enhances application performance while keeping the same program
behavior from a user perspective. The framework is designed to
include program exceptions, self-modifying code, tracing, and
debugging. High execution speed on migrated applications is achieved
through automated compiler-assisted translation of the object code of
one machine to that of the other, using advanced global optimization
and scheduling techniques.
The result is a path for moving from complex instruction-set computers
(CISCs) to newer architectures, such as reduced instruction-set
computers (RISCs), superscalars, or very long instruction word (VLIW)
machines, while protecting the extensive economic investment
represented by existing software. Also, the framework can support
runtime application migration between machines whose instruction sets
it supports, thus providing a mechanism for load balancing and/or
fault-tolerance.
Examples are given for IBM System/390 operating-system code and AIX
utilities, showing the performance potential of the scheme using a
VLIW machine as the high performance target architecture. The need to
present the user an image consistent with the original application
results in code size and execution time overheads, which are
quantified by examining a few applications.
[ Top ]
S.M. Moon and K. Ebcioglu,
Proceedings of MICRO-25, pp. 55-71
IEEE Press
© IEEE, 1992.
Abstract
We describe a new algorithm for parallelization of sequential
code that eliminates anti and output dependences by renaming
registers on an as-needed basis during scheduling. A dataflow
attribute at the beginning of each basic block indicates what
operations are available for moving up through this basic block.
Scheduling consists of choosing the best operation from the set of
operations that can move to a point, moving the instances of the
operation to the point, making bookkeeping copies for edges that join
the moving path but are not on it, and updating the dataflow
attributes of basic blocks only on the paths that were traversed by
the instances of the moved operations. The code motions are done
globally without going through atomic transformations of percolation
scheduling, for better efficiency. For performing the code motions,
we use an intermediate representation that is directly executable as
sequential RISC code, rather than VLIW code. As a result, the new
algorithm can be used to generate parallelized code for multiple ALU
superscalar processors as well. The enhanced pipeline scheduling
algorithm for software pipelining of arbitrary code is reformulated
within the framework of the new sequential RISC representation. The
new algorithm has been implemented, and preliminary results on AIX
utilities indicate that it requires significantly less compilation
time than the percolation scheduling approach.
[ Top ]
U. Schwiegelshohn, F. Gasperoni, K. Ebcioglu
Journal of Parallel and Distributed Computing
11, 130-134 (1991).
© Academic Press, 1991
Abstract
The problem of automatic loop parallelization has received a lot
of attention in the area of parallelizing compilers. Automatic
loop parallelization can be achieved by several algorithms. In
this paper, we address the problem of time optimal parallelization
of loops with conditional jumps. We prove that even for machines
with unlimited resources, there are simple loops for which no
semantically and algorithmically equivalent time optimal program
exists.
[ Top ]
K. Ebcioglu, R. Groves
IBM T.J. Watson Research Center research report no. RC 16145,
10/02/90, Computer Science, 13 pages.
Abstract
We describe a method for converting a given program in the
assembly language of a superscalar machine to another program
written in the same assembly language, such that the resulting
program produces the same final results as to the original one,
and can run significantly faster. The method, inspired by new
global parallelization techniques for VLIW architectures, finds
and places together independently executable operations that may
be far apart in the original code (i.e., that may be separated
by many conditional branches or that may belong to different
iterations of a loop.) As a result, the pipelined functional
units in the machine, which could not be kept busy because of the
limited size of the execution lookahead window in the hardware,
are given more work to do, and higher performance is achieved.
We discuss some new architectural features and software support
required for speculative operations, that result from moving
operations above conditional jumps as part of the techniques.
As a preliminary demonstration of the value of the techniques,
an inner loop of the sequential natured SPEC Lisp Interpreter
benchmark is automatically parallelized; the result indicates
a potential performance of 3.5 RISC instructions/cycle in this
inner loop.
[ Top ]
K. Ebcioglu, T. Nakatani
Languages and Compilers for Parallel Computing,
D. Gelernter, A.
Nicolau, D. Padua (eds.),
pp. 213-229, Research Monographs
in Parallel and Distributed Computing, MIT Press, 1990.
Abstract
Enhanced pipeline-percolation scheduling is a new compilation
technique we developed to extract fine-grain parallelism uniformly
from nested loops in general software. The compilation technique
generates code for execution on the IBM Very Long Instruction Word
(VLIW) machine, which is now being built at the IBM T. J. Watson
Research Center. The IBM VLIW architecture has features that
facilitate high performance on general sequential-natured code
with unpredictable branches: a decision-tree shaped instruction
capable of performing multiple conditional branches emanating from
arbitrary paths in the sequential code, a conditional execution
feature that reduces pathlengths, multiple functional units
and multiple ports to data memory, and a shared register file
which eliminates communication delays between functional units.
The preliminary version of the parallelizing compiler for the IBM
VLIW machine is now operational both at the IBM Tokyo Research
Laboratory and the IBM T. J. Watson Research Center. It is
shown that a pathlength reduction of 5-11X over a RISC processor
can be achieved on some of the Stanford integer benchmarks and
other sequential-natured C programs.
[ Top ]
T. Nakatani and K. Ebcioglu.
Proceedings of MICRO-22,
pp. 43-55.
August 1989.
© ACM 1989.
Abstract
Combining is a local compiler optimization technique that can
enhance the performance of global compaction techniques for
VLIW machines. Given two adjacent operations of a certain class
that are flow (read-after-write) dependent and that cannot be
placed in the same micro-instruction, the combining technique
can transform the operations so that the modified operations
have no dependence. The transformed operations can be executed
in the same micro-instruction, thus allowing the total execution
time of the program to be reduced. In this paper, combining a
pair of flow-dependent operations into a wide instruction word
is suggested as an important compilation technique for VLIW
architectures. Combining is particularly effective with software
pipelining and loop unrolling since combinable operations can
come together with a higher probability when these compilation
techniques are used. We implemented combining in our parallelizing
compiler for the wide instruction word architecture, which is
now being built at the IBM T. J. Watson Research Center. It is
shown that ten percent speedup is obtained on the Stanford integer
benchmarks and other sequential-natured C programs, in comparison
to compaction techniques that do not use combining. For a class of
inner loops, combining can remove the inter-iteration dependences
completely and can improve performance in the same ratio as the
loop is unrolled.
[ Top ]
K. Ebcioglu, A. Nicolau
Proceedings, 1989 International Conference on Supercomputing,
pp. 154-163,
June 1989.
© ACM 1989
Abstract
This paper presents a new approach to resource-constrained
compiler extraction of fine-grain parallelism, targeted toward
VLIW supercomputers, and in particular, the IBM VLIW (Very Long
Instruction Word) processor. The algorithms described integrate
resource limitations into Percolation Scheduling-a global
parallelization technique- to deal with resource constraints,
without sacrificing the generality and completeness of Percolation
Scheduling in the process. This is in sharp contrast with previous
approaches which either applied only to conditional-free code,
or drastically limited the parallelization process by imposing
relatively local heuristic resource constraints early in the
scheduling process.
[ Top ]
K. Ebcioglu,
Parallel Processing, pp.3-21
M. Cosnard, M.H. Barton and M. Vanneschi (Editors)
Elsevier Science Publishers B.V. (North-Holland)
© IFIP, 1988.
Abstract
In this paper, we describe a Very Long Instruction Word architecture, now
being designed at the IBM T.J. Watson Research Center, which is intended
to achieve good performance not only in scientific code, but also in
sequential, non-numerical code. Communication delays between processing
elements are minimized via a single shared register file with a large
number of ports. To perform well on programs with unpredictable branches,
the architecture features decision-tree shaped instructions that allow
multiway branching, and that allow operations to be executed conditionally
depending on where the instruction branches to. To add to the parallelism
achievable via existing compilation techniques for VLIW architectures,
we have developed a compilation technique called pipeline
scheduling, which is an extension of the "doacross" and "dopipe"
techniques proposed for multiprocessors by D. Kuck's group. This
technique can initiate a new iteration of an inner loop (possible
containing arbitrary if-then-else statements and conditional exits)
on every clock period whenever dependences and resources permit.
[ Top ]
K. Ebcioglu,
Proceedings of
MICRO-20, pp. 69-79
© ACM Press, 1987.
Abstract
We describe a compilation algorithm for efficient software pipelining of
general inner loops, where the number of iterations and the time taken by
each iteration may be unpredictable, due to arbitrary if-then-else statements
and conditional exit statements within the loop. As our target machine,
we assume a wide instruction word architecture that allows multi-way branching
in the form of if-then-else trees, and that allows conditional register transfers
depending on where the microinstruction branches to (a hardware implementation
proposal for such a machine is briefly described in the paper). Our compilation
algorithm, which we call the pipeline scheduling technique, produces a
software-pipelined version of a given inner loop, which allows a new iteration
of the loop to begin on every cycle whenever dependences and resources permit.
The correctness and termination properties of the algorithm are studied in
the paper.
[ Top ]
|
|
|
|