Introduction
The continuing validation of Moore's law [1] with
regard to microcontrollers has made the cost of embedded
computation smaller and smaller. Today's 32-bit
microcontrollers occupy less space than the 8-bit
microcontrollers of only five years ago, yet have 10 to 100
times the computational power. This trend has led to newer
designs incorporating the more powerful processors in
equipment and appliances.
Taking advantage of this increased processor power
and increasing application complexity, the sizes of the
programs running on these embedded controllers have grown
exponentially. The silicon area and cost of the program
memory of a common embedded application
now overshadow the size and cost of the processor. For
example, in an application such as a high-end hard-disk
drive, where an embedded processor may occupy a silicon
area of about six square millimeters, the program memory
for that processor takes 20 to 40 square millimeters.
Thus, some design focus has turned again to the size of
memory for computer programs.
Different microcontroller manufacturers have dealt with
program size problems by modifying their architectures in
different ways. One method is to create a new architecture
with consistently smaller instructions; however, the
programming environment changes, and all new tools must be
created. Another way is to add processor front ends that
will interpret a set of smaller instructions with limited
capabilities. The same programming environment is used, but
new instructions mean that the tools change again. Third,
new instructions are added to the architecture, usually to
combine the functionality of several instructions in the
existing instruction set. The requirement for new
or modified tools is less in this case, but still remains
significant.
These solutions all share a significant problem: They
affect the programming development tools. Since each
one involves a new set of instructions, any assemblers,
compilers, linkers, and debuggers must be reworked for the
new architectures. These modifications require money and
time to implement, particularly if the tool implementors
are separate tool vendors.
Concepts
The goals for reducing the instruction storage space
for the PowerPC* [2] were therefore to significantly reduce
instruction store space, minimize performance restrictions
on the processor, avoid restricting processor capability,
and minimize required modifications to development tools.
The last restriction was clearly a difficult one. How could
an instruction architecture be modified to reduce the space
it takes without making significant tool changes? Clearly
we could not modify the instruction format, nor modify how
the processor sees its environment.
As shown in Figure 1, all embedded PowerPCs consist of a
processor core that is attached to its environment
via an on-chip bus, the processor local bus (PLB). The
processor's memory and I/O devices sit on this bus,
providing the processor with its view of the operating
environment. The debugging interface is handled by an
extension to the JTAG [3] interface, which is connected to
a special port to the processor core. Using this port, a
special tool allows an external computer to manage program
debugging and system control, which are conducted through
the processor's PLB interface. Thus, as long as the
processor is attached to interfaces providing standard
formats, it doesn't matter how the program data are
actually stored.
Figure 1
The simple, direct solution to the code size problem is
to make the contents of the memory smaller, leading us to investigate code compression. The most common compression algorithms in use today are table lookup,
either adaptive, as with Lempel-Ziv [4] or static, e.g.,
Huffman [5]. An adaptive solution seemed inappropriate for
the application, since adaptive lookup schemes depend on a
history of the preceding symbols in a sequential file. A
static symbol-based lookup mechanism avoids this problem,
since decoding any particular symbol requires no knowledge
of any preceding symbols. Thus, we elected to use a static
encoding scheme. The next task was to choose the symbols
and a specific encoding scheme.
The best known simple static compression mechanism, as
described by Huffman, is simple substitution of symbols of
a shorter length than the originals for the most frequent
symbols, and longer for symbols occurring less frequently.
The most obvious symbol choice for PowerPC is the
instruction. It is regular and is always 32 bits long.
Figure 2 shows the main instruction formats.
Figure 2
We performed a frequency analysis of full
32-bit-instruction-sized symbols over a large set of
PowerPC code, and found that the size of the substitution
symbol set was not advantageous. There are too many
different common instructions, none of which occur
frequently enough to provide an advantage. We needed a
smaller set of symbols.
Dividing the instructions in two to provide 16-bit
symbols produced a better set. Notice from Figure 2
that the top half of an instruction always contains the
instruction opcode (OPCD). In the case of the D-form and
X-form instructions, the next two fields, RT and RA, are
always registers. The bottom half of the instructions are
often immediate values (D) or branch displacements (D, BD).
Observing the basic differences in the two halves of the
instruction, we did separate frequency analyses on the two
halves and discovered that the frequencies of values in the
two sets were quite different--so different,
in fact, that a compression by replacing each half of the
instruction from a custom substitution table for that half
produced substantially better results than using a combined
table.
Figure 3 and Table 1 show a portion of the frequency
distribution of values of the top half of instructions
compared with the bottom half of instructions from a
program for an embedded controller. The two halves differ
both in the most frequent values and in the distribution of
those values. The bottom half of the instruction has a much
higher count for the most frequent symbol than does the top
half. The occurrence frequencies drop more sharply for the
bottom than the top, crossing over at about symbol number
56. Each of the lists of unique symbols is over 3800
entries long. Many of the symbols occur only once. These
great differences in the early shape of the curves led us
to decide to encode the two halves of the instructions
separately and to pick a near-optimal substitution symbol
set for each of the instruction halves, as shown in Figure 4.
Table 1 Symbol frequencies.
| | Symbol no. | Top | Bottom
|
|---|
| | 1 | 4768 | 12963
| | 2 | 3782 | 3101
| | 3 | 2346 | 2815
| | 4 | 2038 | 2594
| | 5 | 1764 | 2315
| | 6 | 1500 | 1957
| | 7 | 1015 | 1863
| | 8 | 994 | 1825
| | 9 | 963 | 1532
| | 10 | 930 | 1507
| | 11 | 846 | 1288
| | 12 | 794 | 1221
| | 13 | 770 | 1152
| | 14 | 702 | 998
| | 15 | 662 | 871
| | 16 | 581 | 866
|
|
Figure 3
Figure 4
Each symbol format consists of a tag of two or three
bits followed by a symbol value, shown by a number of bits
(each bit represented by n). In the top half encoding, the
first symbol format, with a tag of 00, can encode eight
symbols. The second symbol format, with tag 01, can encode
32 more symbols, and so on. We pick the eight most frequent
symbols to be encoded in format 00, the next 32 most
frequent to be encoded in symbol format 01, down to those
least frequent symbols for which we use the original 16-bit
literal value, shown with L representing each bit, encoded
in format 111. A notable feature is that the first encoding
of the bottom half of the instructions contains only one
value: zero. Zero is so prevalent that statistically it
deserves an encoding of its own.
The net compression factor with this technique is that
the compressed code is typically about 56% of the original
size.
Implementation
Since we previously observed that the processor
accesses the memory via the PLB, the obvious place to put
the decompression unit was on the PLB just in front of the
memory. Figure 5 shows that the decompression unit accepts
the processor's addresses and then looks them up in the
compressed memory. Thus, the processor uses instructions
and addresses them just as though the code were not
compressed.
Figure 5
To fetch an instruction, the processor provides the
decompression unit the address of the original,
uncompressed instruction. The decompression unit maps the
address to the location at which the compressed instruction
is kept, in a presumably smaller space.
We define the index table as a two-column map of
"uncompressed space" addresses to "compressed space"
addresses. Theoretically, for every address the processor
presents, the index table has a location of the compressed
instruction.
Thus, the basic algorithm for the decompression unit is
as follows:
- Receive memory address of instruction from
processor.
-
Look up compressed address in index table.
-
Fetch compressed instruction from memory.
-
Decompress instruction by looking up uncompressed
symbols.
-
Return instruction to processor.
The performance of this simple algorithm is somewhat
slow. Presuming a one-cycle memory, we must take one cycle
to look up the compressed address, one cycle to fetch the
compressed instruction, and one cycle to decompress, for a
total of three cycles. We have implemented some
optimizations to improve this latency.
Optimizations
In the compressed memory space, to optimize memory
usage, we pack the compressed symbols as tightly as
possible. Since the lengths of the symbols are not a
convenient modulo of addressable memory, it is difficult
to address every possible instruction. If every
instruction were addressed at the bit level, the index
table would
grow to an unacceptable size, reducing the advantages
of compression. Rather than using one index table
entry per instruction, we divide the code into fixed-size
"compression blocks" and provide an index table entry
for each. This approach controls the size of the table.
However, it has the disadvantage that since we cannot
easily locate individual instructions within a compressed
space, we must search for the desired instruction from the
beginning of each block of compressed instructions. As
compression block size grows, the average random access
time increases. A good engineering compromise puts the
compression block size at 64 bytes, with 16 instructions in
a compression block. Our performance is improved here,
since once the block is accessed, the decompression unit
can pipeline the stages of the decompression, yielding a
three-cycle initial latency, with a rate of one instruction
per cycle.
The PowerPC processors are cached and usually fetch
cache lines rather than single instructions. When a
compression block is retrieved from memory and
decompressed, the instructions in the compression block
sequentially after those specifically requested are kept.
We presume that the processor will continue executing
instructions sequentially and will therefore request the
next cache line's worth of instructions. The decompression
unit will be able to provide these instructions with no
memory or decompression latency. Pairs of compression
blocks are always kept together, so that as the processor
sequentially asks for the first address of the second
compression block, the decompression engine can locate
it directly without use of the index, saving a cycle.
PowerPCs have memory management units as well.
A bit (here called the "K" bit) can be placed in the
TLB entries for a virtual memory manager and in the storage
attribute register for real-mode managers. It can be used
to indicate, on a page basis, whether or not compression
is implemented. This allows small routines with critical
performance requirements to remain uncompressed in the
midst of a larger system of compressed code.
Measurements
We evaluated the compression mechanism by compressing a
number of programs compiled by a variety of compilers. We
found that two factors greatly affect the degree to which a
program may be compressed: compiler selection and
application execution effects.
The first factor is related to whether the compiler
used for the program matches the symbol table in the
decompression unit. Various C language compilers use very
different methods of allocating registers and generating
instruction sequences. The instructions generated by a
compiler map to a set of input symbols, particularly for
the top half of the instructions, that may have an entirely
different frequency spectrum than that for instructions
generated by another compiler. The compressed images
generated from a program compiled, for example, by the GNU
C compiler and compressed with a symbol table generated
from a frequency spectrum specific to programs generated by
the AIX* XLC compiler show remarkably poor, and sometimes negative,
compression. See Table 2 for some specific examples.
Table 2 Compression factor with matching and mismatching compilers.
| | | Matching compiler (%) | Mismatching compiler (%)
|
|---|
| | ate | 55 | 96
| | csh | 60 | 100
| | disk1 | 67 | 101
| | disk2 | 55 | 105
| | disk3 | 66 | 101
| | disk4 | 65 | 102
|
|
The dynamic properties of the program execution
determine how much delay due to decompression is
experienced. Two factors affect the execution the most
strongly: basic block size and working set size. A basic
block is a set of instructions which has no branches and
is executed linearly. Since the decompression unit is
prefetching and decompressing blocks before the computer
asks for them, performance depends on the percentage of
time that the processor actually asks for data from a
prefetched block. If the average basic block size is large,
this prefetching activity is generally successful at hiding
the decompression latency. The working set is the set of
instructions that are most frequently executed. In a
caching system, if this working set is smaller than
the cache size, memory is infrequently accessed. For a
compressed memory, the less the memory is accessed, the
less frequently the decompression latency occurs. Thus,
programs that execute a small part of their actual code set
will execute proportionately faster.
The IBM Microelectronics Division has implemented the
decompression core in IBM CMOS 5S ASIC technology and added
it to the PowerPC core catalog. The decompression core
consists of 25484 logic cells, and the lookup table ROM for
two 512-entry tables is 10814 cells (memory and access
logic). Silicon area for the core is about 1 mm2.
Future work
In the future, we would like to address a couple of
issues. One is to improve addressing efficiency further, so
that when a cache line near the end of a compression block
is requested, the decompression unit, remembering the
location of the last decompression block, will prefetch and
decompress the next block, further reducing latency.
Another is the problem of latency when a branch is taken
to an arbitrary location. The average latency is the basic
three cycles, plus half the number of instructions in a
compression block, a total of 11 cycles in the present
implementation. In a system that experiences a low
cache-hit ratio, this latency becomes significant. This may
be addressed by keeping a branch address cache in the
decompression unit. This would eliminate the index table
lookup and the counting of instructions in the compression
block.
Conclusions
Overall, it is possible to reduce the size of PowerPC
code significantly through compression. The net
compression, including the index tables, for average
programs is to about 60% of the original program size. This
is a more dense code than for any other 32-bit processor,
RISC or CISC.
This method of reducing the code size is simple and
inexpensive, and it required no changes to compilers,
assemblers, or linkers. One step, the compression step,
must be added to the code creation chain. The compression
program is run on the load image after linking and
relocating. The JTAG-based debuggers that use the
processor's instruction and data trap capabilities need not
be changed either. A debugger that sets program traps by
modifying memory must be aware of the compression.
A particular implementation of this compression
mechanism is compiler-dependent, but only by differences in
the symbols loaded in the symbol tables. Thus, a single
design with a separately programmed ROM or flash memory
will suffice for a variety of uses.
The granularity of the compression is based entirely on
memory, rather than on a mode of the processor. Thus,
a single thread of execution may switch in and out of
compressed code effortlessly. The performance penalty of
this implementation is minor if the program's cache hit
ratio is reasonable. That is, if the processor is reading
instructions from its instruction cache most of the time,
the compressed memory access delays do not occur often
enough to have a significant impact.
*Trademark or registered trademark of International
Business Machines Corporation.
Received December 9, 1997; accepted for publication
June 10, 1998
|