IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
Systems Journal  
Volume 42, Number 6, 1998
Data compression in ASIC cores
 Table of contents: arrowHTML arrowASCII   This article: HTML arrowASCII   DOI: 10.1147/rd.426.0807 arrowCopyright info
   

A decompression core for PowerPC

by T. M. Kemp, R. K. Montoye, D. J. Auerbach, J. D. Harper, and J. D. Palmer
Code size efficiency is a critical parameter in the design of computer systems for embedded applications. This paper describes a method for improving code size efficiency involving the use of compression techniques to reduce the size of the stored code, and on-the-fly hardware decompression at full processor speed for execution. A simple frequency-based encoding scheme for PowerPC® code achieves a typical code size reduction to 60% of the original size. A corresponding decompression core has been implemented for an embedded microprocessor, such as the PowerPC 401™. The compression/decompression scheme operates in a manner transparent to the processor and requires no changes to such tools as compilers, linkers, and loaders.

Introduction

The continuing validation of Moore's law [1] with regard to microcontrollers has made the cost of embedded computation smaller and smaller. Today's 32-bit microcontrollers occupy less space than the 8-bit microcontrollers of only five years ago, yet have 10 to 100 times the computational power. This trend has led to newer designs incorporating the more powerful processors in equipment and appliances.

Taking advantage of this increased processor power and increasing application complexity, the sizes of the programs running on these embedded controllers have grown exponentially. The silicon area and cost of the program memory of a common embedded application now overshadow the size and cost of the processor. For example, in an application such as a high-end hard-disk drive, where an embedded processor may occupy a silicon area of about six square millimeters, the program memory for that processor takes 20 to 40 square millimeters. Thus, some design focus has turned again to the size of memory for computer programs.

Different microcontroller manufacturers have dealt with program size problems by modifying their architectures in different ways. One method is to create a new architecture with consistently smaller instructions; however, the programming environment changes, and all new tools must be created. Another way is to add processor front ends that will interpret a set of smaller instructions with limited capabilities. The same programming environment is used, but new instructions mean that the tools change again. Third, new instructions are added to the architecture, usually to combine the functionality of several instructions in the existing instruction set. The requirement for new or modified tools is less in this case, but still remains significant.

These solutions all share a significant problem: They affect the programming development tools. Since each one involves a new set of instructions, any assemblers, compilers, linkers, and debuggers must be reworked for the new architectures. These modifications require money and time to implement, particularly if the tool implementors are separate tool vendors.

Concepts

The goals for reducing the instruction storage space for the PowerPC* [2] were therefore to significantly reduce instruction store space, minimize performance restrictions on the processor, avoid restricting processor capability, and minimize required modifications to development tools. The last restriction was clearly a difficult one. How could an instruction architecture be modified to reduce the space it takes without making significant tool changes? Clearly we could not modify the instruction format, nor modify how the processor sees its environment.

As shown in Figure 1, all embedded PowerPCs consist of a processor core that is attached to its environment via an on-chip bus, the processor local bus (PLB). The processor's memory and I/O devices sit on this bus, providing the processor with its view of the operating environment. The debugging interface is handled by an extension to the JTAG [3] interface, which is connected to a special port to the processor core. Using this port, a special tool allows an external computer to manage program debugging and system control, which are conducted through the processor's PLB interface. Thus, as long as the processor is attached to interfaces providing standard formats, it doesn't matter how the program data are actually stored.

Figure 1Figure 1

The simple, direct solution to the code size problem is to make the contents of the memory smaller, leading us to investigate code compression. The most common compression algorithms in use today are table lookup, either adaptive, as with Lempel-Ziv [4] or static, e.g., Huffman [5]. An adaptive solution seemed inappropriate for the application, since adaptive lookup schemes depend on a history of the preceding symbols in a sequential file. A static symbol-based lookup mechanism avoids this problem, since decoding any particular symbol requires no knowledge of any preceding symbols. Thus, we elected to use a static encoding scheme. The next task was to choose the symbols and a specific encoding scheme.

The best known simple static compression mechanism, as described by Huffman, is simple substitution of symbols of a shorter length than the originals for the most frequent symbols, and longer for symbols occurring less frequently. The most obvious symbol choice for PowerPC is the instruction. It is regular and is always 32 bits long. Figure 2 shows the main instruction formats.

Figure 2Figure 2

We performed a frequency analysis of full 32-bit-instruction-sized symbols over a large set of PowerPC code, and found that the size of the substitution symbol set was not advantageous. There are too many different common instructions, none of which occur frequently enough to provide an advantage. We needed a smaller set of symbols.

Dividing the instructions in two to provide 16-bit symbols produced a better set. Notice from Figure 2 that the top half of an instruction always contains the instruction opcode (OPCD). In the case of the D-form and X-form instructions, the next two fields, RT and RA, are always registers. The bottom half of the instructions are often immediate values (D) or branch displacements (D, BD). Observing the basic differences in the two halves of the instruction, we did separate frequency analyses on the two halves and discovered that the frequencies of values in the two sets were quite different--so different, in fact, that a compression by replacing each half of the instruction from a custom substitution table for that half produced substantially better results than using a combined table.

Figure 3 and Table 1 show a portion of the frequency distribution of values of the top half of instructions compared with the bottom half of instructions from a program for an embedded controller. The two halves differ both in the most frequent values and in the distribution of those values. The bottom half of the instruction has a much higher count for the most frequent symbol than does the top half. The occurrence frequencies drop more sharply for the bottom than the top, crossing over at about symbol number 56. Each of the lists of unique symbols is over 3800 entries long. Many of the symbols occur only once. These great differences in the early shape of the curves led us to decide to encode the two halves of the instructions separately and to pick a near-optimal substitution symbol set for each of the instruction halves, as shown in Figure 4.

Table 1   Symbol frequencies.

Symbol no.TopBottom

1476812963
237823101
323462815
420382594
517642315
615001957
710151863
89941825
99631532
109301507
118461288
127941221
137701152
14702998
15662871
16581866

Figure 3Figure 3 Figure 4Figure 4

Each symbol format consists of a tag of two or three bits followed by a symbol value, shown by a number of bits (each bit represented by n). In the top half encoding, the first symbol format, with a tag of 00, can encode eight symbols. The second symbol format, with tag 01, can encode 32 more symbols, and so on. We pick the eight most frequent symbols to be encoded in format 00, the next 32 most frequent to be encoded in symbol format 01, down to those least frequent symbols for which we use the original 16-bit literal value, shown with L representing each bit, encoded in format 111. A notable feature is that the first encoding of the bottom half of the instructions contains only one value: zero. Zero is so prevalent that statistically it deserves an encoding of its own.

The net compression factor with this technique is that the compressed code is typically about 56% of the original size.

Implementation

Since we previously observed that the processor accesses the memory via the PLB, the obvious place to put the decompression unit was on the PLB just in front of the memory. Figure 5 shows that the decompression unit accepts the processor's addresses and then looks them up in the compressed memory. Thus, the processor uses instructions and addresses them just as though the code were not compressed.

Figure 5Figure 5

To fetch an instruction, the processor provides the decompression unit the address of the original, uncompressed instruction. The decompression unit maps the address to the location at which the compressed instruction is kept, in a presumably smaller space. We define the index table as a two-column map of "uncompressed space" addresses to "compressed space" addresses. Theoretically, for every address the processor presents, the index table has a location of the compressed instruction.

Thus, the basic algorithm for the decompression unit is as follows:

  1. Receive memory address of instruction from processor.
  2. Look up compressed address in index table.
  3. Fetch compressed instruction from memory.
  4. Decompress instruction by looking up uncompressed symbols.
  5. Return instruction to processor.

The performance of this simple algorithm is somewhat slow. Presuming a one-cycle memory, we must take one cycle to look up the compressed address, one cycle to fetch the compressed instruction, and one cycle to decompress, for a total of three cycles. We have implemented some optimizations to improve this latency.

Optimizations

In the compressed memory space, to optimize memory usage, we pack the compressed symbols as tightly as possible. Since the lengths of the symbols are not a convenient modulo of addressable memory, it is difficult to address every possible instruction. If every instruction were addressed at the bit level, the index table would grow to an unacceptable size, reducing the advantages of compression. Rather than using one index table entry per instruction, we divide the code into fixed-size "compression blocks" and provide an index table entry for each. This approach controls the size of the table. However, it has the disadvantage that since we cannot easily locate individual instructions within a compressed space, we must search for the desired instruction from the beginning of each block of compressed instructions. As compression block size grows, the average random access time increases. A good engineering compromise puts the compression block size at 64 bytes, with 16 instructions in a compression block. Our performance is improved here, since once the block is accessed, the decompression unit can pipeline the stages of the decompression, yielding a three-cycle initial latency, with a rate of one instruction per cycle.

The PowerPC processors are cached and usually fetch cache lines rather than single instructions. When a compression block is retrieved from memory and decompressed, the instructions in the compression block sequentially after those specifically requested are kept. We presume that the processor will continue executing instructions sequentially and will therefore request the next cache line's worth of instructions. The decompression unit will be able to provide these instructions with no memory or decompression latency. Pairs of compression blocks are always kept together, so that as the processor sequentially asks for the first address of the second compression block, the decompression engine can locate it directly without use of the index, saving a cycle.

PowerPCs have memory management units as well. A bit (here called the "K" bit) can be placed in the TLB entries for a virtual memory manager and in the storage attribute register for real-mode managers. It can be used to indicate, on a page basis, whether or not compression is implemented. This allows small routines with critical performance requirements to remain uncompressed in the midst of a larger system of compressed code.

Measurements

We evaluated the compression mechanism by compressing a number of programs compiled by a variety of compilers. We found that two factors greatly affect the degree to which a program may be compressed: compiler selection and application execution effects.

The first factor is related to whether the compiler used for the program matches the symbol table in the decompression unit. Various C language compilers use very different methods of allocating registers and generating instruction sequences. The instructions generated by a compiler map to a set of input symbols, particularly for the top half of the instructions, that may have an entirely different frequency spectrum than that for instructions generated by another compiler. The compressed images generated from a program compiled, for example, by the GNU C compiler and compressed with a symbol table generated from a frequency spectrum specific to programs generated by the AIX* XLC compiler show remarkably poor, and sometimes negative, compression. See Table 2 for some specific examples.

Table 2   Compression factor with matching and mismatching compilers.

 Matching
compiler
(%)
Mismatching
compiler
(%)

ate5596
csh60100
disk167101
disk255105
disk366101
disk465102

The dynamic properties of the program execution determine how much delay due to decompression is experienced. Two factors affect the execution the most strongly: basic block size and working set size. A basic block is a set of instructions which has no branches and is executed linearly. Since the decompression unit is prefetching and decompressing blocks before the computer asks for them, performance depends on the percentage of time that the processor actually asks for data from a prefetched block. If the average basic block size is large, this prefetching activity is generally successful at hiding the decompression latency. The working set is the set of instructions that are most frequently executed. In a caching system, if this working set is smaller than the cache size, memory is infrequently accessed. For a compressed memory, the less the memory is accessed, the less frequently the decompression latency occurs. Thus, programs that execute a small part of their actual code set will execute proportionately faster.

The IBM Microelectronics Division has implemented the decompression core in IBM CMOS 5S ASIC technology and added it to the PowerPC core catalog. The decompression core consists of 25484 logic cells, and the lookup table ROM for two 512-entry tables is 10814 cells (memory and access logic). Silicon area for the core is about 1 mm2.

Future work

In the future, we would like to address a couple of issues. One is to improve addressing efficiency further, so that when a cache line near the end of a compression block is requested, the decompression unit, remembering the location of the last decompression block, will prefetch and decompress the next block, further reducing latency.

Another is the problem of latency when a branch is taken to an arbitrary location. The average latency is the basic three cycles, plus half the number of instructions in a compression block, a total of 11 cycles in the present implementation. In a system that experiences a low cache-hit ratio, this latency becomes significant. This may be addressed by keeping a branch address cache in the decompression unit. This would eliminate the index table lookup and the counting of instructions in the compression block.

Conclusions

Overall, it is possible to reduce the size of PowerPC code significantly through compression. The net compression, including the index tables, for average programs is to about 60% of the original program size. This is a more dense code than for any other 32-bit processor, RISC or CISC.

This method of reducing the code size is simple and inexpensive, and it required no changes to compilers, assemblers, or linkers. One step, the compression step, must be added to the code creation chain. The compression program is run on the load image after linking and relocating. The JTAG-based debuggers that use the processor's instruction and data trap capabilities need not be changed either. A debugger that sets program traps by modifying memory must be aware of the compression.

A particular implementation of this compression mechanism is compiler-dependent, but only by differences in the symbols loaded in the symbol tables. Thus, a single design with a separately programmed ROM or flash memory will suffice for a variety of uses.

The granularity of the compression is based entirely on memory, rather than on a mode of the processor. Thus, a single thread of execution may switch in and out of compressed code effortlessly. The performance penalty of this implementation is minor if the program's cache hit ratio is reasonable. That is, if the processor is reading instructions from its instruction cache most of the time, the compressed memory access delays do not occur often enough to have a significant impact.

*Trademark or registered trademark of International Business Machines Corporation.

References

Received December 9, 1997; accepted for publication June 10, 1998