Introduction
With print resolution for digital documents rising to
1200 dpi, the requirements for controller memory and
controller throughput are increasing steadily for both
print and digital copy applications. Storage requirements
are dictated by features such as collation or duplexing,
which make use of full image storage.
As shown in Table 1 for the case of a copier, generation
of subsequent pages takes place while previous pages are
printed in order to keep the output engine continuously
busy. Pages Pn are split into bands Pn Bx for that purpose.
This requires quasi-simultaneous compression of the
rendered or scanned and screened images to storage and
repetitive decompression of the stored images for print.
This can be accomplished either by providing separate
compression and decompression engines or by supplying
a single engine with sufficient throughput to allow
time-sharing between compression and decompression tasks.
Table 1 Copier pipeline with JBIG shared for compression and decompression.
| | Time | Scan | Scale and screen | JBIG compress* | JBIG decompress* | Print
|
|---|
| | T0 | Pn B0 | Pn-1 B7 | Pn-1 B6 | Pn-m B3 | Pn-m
B2
| | T1 | Pn B1 | Pn B0 | Pn-1 B7 | Pn-m B4 | Pn-m B3
| | T2 | Pn B2 | Pn B1 | Pn B0 | Pn-m B5 | Pn-m B4
| | T3 | Pn B3 | Pn B2 | Pn B1 | Pn-m B6 | Pn-m B5
| | T4 | Pn B4 | Pn B3 | Pn B2 | Pn-m B7 | Pn-m B6
| | T5 | Pn B5 | Pn B4 | Pn B3 | Pn-m+1 B0 | Pn-m B7
| | T6 | Pn B6 | Pn B5 | Pn B4 | Pn-m+1 B1 | Pn-m+1
B0
| | T7 | Pn B7 | Pn B6 | Pn B5 | Pn-m+1 B2 | Pn-m+1
B1
| | T8 | Pn+1 B0 | Pn B7 | Pn B6 | Pn-m+1 B3 |
Pn-m+1 B2
|
|
To meet the real-time constraints of laser engines
(once a page is started, it is necessary to keep up with
the engine), predictability or low data dependency of
the throughput is very important. At 1200 dpi, a single
letter-size page contains approximately 128 million pixel
elements, or 16 MB of bitonal data. A high average
compression ratio as well as good worst-case compression
behavior is therefore essential. Adaptive arithmetic
compression as used in ABIC [1,2] and JBIG [3,4] provides
by far the best average and worst-case compression ratios
on bitonal data.
JBIG and its predecessor ABIC are adaptive arithmetic
compression methods that convert an image stream into
a binary fraction. This is accomplished in two steps--
an image preprocessing (model unit) and a subsequent
adaptive arithmetic coding process. Figure 1 shows the
basic architecture of a JBIG compressor.
Figure 1
The image preprocessing extracts individual image pixels
and converts them into a context word correlated with the
pixel to be coded. In other words, each individual pixel
of the complete image has as additional information an
associated context describing the neighboring pixels
as well as their spatial relationship. At the edges of the
data matrix or image, special rules apply for context
bits outside the actual image or data matrix. JBIG uses a
context built from ten neighboring pixels of the current
pixel being examined. These are taken either from the
previous line and the current line or from the two previous
lines and the current line. Also, one context bit from the
previous line can be replaced by a more distant pixel in
the current line to pick up horizontal frequencies to
improve the compression ratio. These options are
programmable parameters. How each image pixel is associated
with its context and how image pixels make up a context are
shown below in the model unit discussion.
An adaptive arithmetic coder consists of an adapter and
the so-called arithmetic coder. The adapter contains a
storage table memory, which, for example, in the case of
JBIG, has 1024 (210) entries. The context bits are used as
an index into this table memory. The information elements
stored in each entry are the expected value
for the current pixel given the surrounding pixel values
defined by the context, and a probability index for this
expected value. Both the probability indices and the
expected value are adapted dynamically to optimize
compression. The probability indices are converted into a
binary fraction that can be viewed as an interval width.
The higher the probability of an expected value, the lower
is the information content of an event confirming the
expected value, and, accordingly, the chosen size for
the corresponding interval. For each image pixel being
coded, the adapter passes the interval size Qe and a
flag, specifying to the arithmetic coder whether or not
the expectation for this image pixel has been met.
The coder contains an accumulator C to generate the
compressed code word as well as an accumulator A to track
the actual interval width. Initial conditions are C = 0 and
A = 1.0000. For each pixel being coded, the interval size
is adapted; if the pixel meets the expected value, the
value in A is replaced by A - Qe; otherwise the value in A
is replaced by Qe. The value in C is replaced by C + Qe or
retains its value accordingly. Whenever A drops below 0.75,
both A and C are shifted left until A is again greater than
or equal to 0.75. This allows the coder to generate a
binary fraction of infinite precision with an accumulator
of finite precision. The compression of the coding is based
on the fact that for pixels with a high probability for the
expected value, Qe is very small and therefore can be
subtracted many times from A before an underflow occurs and
the simultaneous renormalization of C with A generates code
bits.
Decoding works similarly by loading C with the
accumulated code word. Then the unit tries to subtract Qe
from C. If Qe can be subtracted from C without causing C to
underflow, it must have been added during encoding; i.e.,
the pixel being coded equaled the expected value.
Otherwise, C retains its value, and the pixel being coded
did not equal the expected value. A is processed the same
way as during encoding.
Most significantly, JBIG allows specific accommodations
for halftone images that other compression methods such as
T.4 or T.6 fax compression do not offer. JBIG provides more
capability for tuning the template of the model unit to
particular images and has a higher-resolution probability
estimation, both of which can help to improve the
compression ratio. However, ABIC's bit stuffing is more
efficient than JBIG's byte stuffing and is not prone to the
considerable worst-case uncertainty in throughput caused by
carry resolution during compression [5,6].
With no compression hardware existing at the time that
could support the requirements of the multifunction
peripheral market, Xionics decided in 1995 to develop such
a core as part of their integrated MFP silicon solution,
XipChip, and began a collaboration with IBM
Microelectronics to develop a bitonal compression solution.
This paper describes the implementation of
this combined JBIG and ABIC compression and decompression
engine for digital document processing equipment and the
design tradeoffs made to meet throughput requirements.
XipChip is shipped today
as one of Xionics' multifunction peripheral solutions.
Combined JBIG and ABIC compression and decompression engine
Overall engine architecture
The combined compression engine preserves a
partitioning similar to that of the original ABIC hardware
[1]. As shown in Figure 2, it consists of three major
modules: the model unit, the Qx-coder [5], and the bus
interface logic.
Figure 2
Model unit
The model unit provides three pieces of
functionality--generation of the context for the Qx-coder;
generation
of the input pixel for compression and assembly of the
image from the decompressed pixels; and management
of the input and output buffers.
The model unit allows selection from among three
different templates for context generation: the seven-bit
ABIC template [Figure 3(a)]; the two-line JBIG template
[Figure 3(b)]; or the three-line JBIG template [Figure 3(c)]. The number following the X identifies the bit number within the context. The ABIC template wraps around
from the previous line and to the next line at line
boundaries, while the JBIG templates pad the context with
0-bits at the beginning and end of each line. In addition,
the JBIG templates implement the adaptive template bit
identified by an A instead of an X. It is shown in its
default position. It can be programmed for a horizontal
offset of up to 128 to allow picking up of halftone
frequencies [see the JBIG template extensions at the left
in Figures 3(a) and 3(b)]. Furthermore, in JBIG mode the
model unit can be programmed to check for typical
prediction.
Figure 3
Rather than implementing complete line buffers for the
template reference lines, the model unit implements three
256-bit buffers. These buffers hold the currently active
parts of the lines and move like a window through the
image. The buffers are implemented with separate controls
for master and slave flip-flops. This allows the unit to
pipeline the read accesses and prefetch the next 256 bits
for each line into the master latches while it is working
on the previously loaded 256 bits stored in the slave
latches. This avoids pipeline stalls due to data fetching.
In addition, the moving windows do not impose any
restriction on the image width, as do fixed buffers used in
other implementations. Image width and height are
restricted only by the size of the counters and a small
overhead for the image boundaries. Image width and height
are limited to 65530 pixels.
Qx-coder
The Qx-coder comprises two modules, the adapter and the
coder. They are pipelined for optimal performance and hide
the adapter's memory-access latency whenever possible. This
is an implementation of the Qx-coder described in [5].
Adapter
The adapter (Figure 4) implements the adapter memory and
the hardware tables to convert the stored
Q-index values to the Q-values used by the coder and to
generate the Q-index transitions for adaptation. The
adapter memory has separate read and write ports to allow
updating the memory while fetching new values for
subsequent coding steps. The memory is 16 bits wide so as
to be able to fetch a pair of indices and MPS values with
each access. This is important for decompression speed,
where the last bit generated will not be available in time
to become part of the adapter memory address. By reading
both the value for a decoded 1-bit and the value for a
decoded 0-bit, and by having dual hardware tables for the
simultaneous conversion of the Q-indices to
Q-values, the correct Q-value can be selected with a
fast multiplexer as soon as the bit becomes available. In
addition, it detects whether the same memory location will
be accessed on subsequent accesses. For memory updates, the
updated values are cached in a register until the actual
memory update has occurred, so that they can be accessed
without update latency in case they are needed for the next
coding step. The adapter memory has 512 entries to
accommodate the larger contexts of the JBIG template. For
ABIC compression, only the lower 64 entries of the memory
are used. The hardware tables for Q-value generation and
Q-index transitions implement both the Q-coder tables for
ABIC and the QM-coder tables for JBIG. One can decide which
tables are to be used when setting up the unit and program
accordingly.
Figure 4
Coder
The implementation of the coder combines aspects of the
original ABIC hardware design [1] with the changes required
for JBIG. The native coding convention for this coder is
the original ABIC coding. JBIG data streams are handled by
inverting the code on its way out for compression and on
the way in for decompression, as described in the companion
paper [5] by Slattery and Mitchell in this issue.
Besides the handling of byte stuffing and the insertion
and stripping of escape sequences, the major difference
compared to the original ABIC design is the implementation
of the renormalization shifts. The renormalization logic is
part of the critical timing path in the coder and therefore
required a thorough analysis. The use of a barrel shifter
accommodating all possible shifts would have increased the
minimum cycle time of the coder significantly. On the other
hand, adding additional clock cycles for all required
shifts would have reduced throughput in a similar way. A
statistical analysis of the shifts occurring during coding
showed that the vast majority of shifts occurring are by
one bit position
only. Therefore it was decided to insert in-line one-bit
shifter logic into the data path to facilitate shifts by
one bit without requiring an additional cycle. The
advantage of this approach is that the depth of the shift
logic is small enough so that it does not significantly
increase the cycle time of the coder. Additional cycles are
required only for the less frequent multibit shifts.
Additional details can be found in the companion paper [6]
by Kampf in this issue.
The coder also supports the generation of trailing zeros
on the input code stream during decode as long as the
end of the document is not reached. This is required to
automatically handle truncated encodings, where trailing
zeros have been removed as allowed by the JBIG standard.
Physical implementation and throughput analysis
The unit has been synthesized to 80 MHz in CMOS 5S and
CMOS 5SE and integrated into Xionics XipChip imaging
microcontrollers. Tests with real-life images in digital
copier applications have been conducted, using an image
width of 4992 pixels and a band height of 64 to 80 lines.
These tests have consistently shown a throughput
of more than 64 million pixels per second at 75-MHz
operation of the unit. This allows up to 15 ppm throughput
at 1200 dpi and up to 60 ppm throughput
at 600 dpi sharing the unit for compression and
decompression. Assuming a worst-case compression ratio of
2:1, the external memory bandwidth required by the macro is
20 MB/s for the two-line template and 28 MB/s for the
three-line template.
Summary
We have described the implementation of a combined JBIG
and ABIC compression/decompression engine for digital
imaging applications. It provides both the flexibility to
handle both JBIG and ABIC data bases and image width to
65530 pixels and the high sustainable throughput required
for state-of-the-art digital copier and collated print and
copy applications.
References
Received April 1, 1998; accepted for publication September 24, 1998
|