Skip to main content

Form Dropout

The Problem
Paper forms are handled as part of the daily routine in today's workplace environment. When these forms are scanned or "captured" in a computer system, a relatively high resolution must be used to maintain document legibility. Hence, these forms require large volumes of storage space. For example, a letter-size page (11 x 8.5 inch), scanned at 300 pixels/inch (where a pixel is either a black or white dot, each of which requires 1 bit of storage) in plain raster requires about 1 MB of storage. This is a very large storage requirement, even for a modest number of forms. As a result, standard compression routines are provided with practically every document scanning system. The most popular are the CCITT Group 3 and Group 4 algorithms that achieve an order of magnitude compression factor. Therefore, a typical page can be stored in about 50 to 100 KB, depending on the contents of the page.

For numerous applications, it is essential to significantly increase the compression ratio. In many of these applications, the majority of the documents are forms, in which all images share common "template data."

The only variation in forms is the information they contain. This variation necessitates a more sophisticated compression scheme, called Form Dropout, a process that removes the common information shared by all forms of the same kind. After Form Dropout is accomplished, the remaining image consists solely of the filled-in information, requiring much less storage when compressed.

The Solution
The basic idea is quite simple. First, an empty form is scanned. The details are then captured and stored in the system, generating a library of possible empty forms. This process is referred to as the template generation process. The output of this process is called the profile. Form recognition or identification is accomplished by comparing an input image with a template. The comparison is achieved by matching specific fields selected from the images. Typically, the comparison of one specific field is sufficient to provide a good match between the empty form and the template.

To compress each filled-in form, the appropriate template is identified and then removed, leaving only the filled-in data. The removal operation is called the Form Dropout process. The first step is to align the filled form on top of the empty form (also called the registration). This may seem simple, but in reality, there are many potential pitfalls, described in the next sections.

Printers are Not Perfect
When printing a form multiple times, printers cause small variations in color, style, and layout. These must be taken into account when aligning a filled form, which may have been printed last week, on top of the empty form, possibly printed a year ago.

Scanners are Not Perfect
The machinery of a scanner causes many geometric distortions, which are imperceptible to the human eye but are noticed by a computer. For example, an automatic feeder, even with a very stable engine, may accelerate while scanning a form, causing the generated image to expand in some areas and shrink in others.

The empty form must be subtracted from the filled form. This is not a simple subtraction, since some inconsistencies still remain between the forms.

For reconstruction purposes, the removed form is retrieved, decompressed, and combined with the sole existing copy of the template image to provide a full reconstruction of a form. This process is called the dropout reconstruction.

Dropout Compression
The compression ratios achieved with Form Dropout vary according to the information content of an empty form and the percentage of filled-in information compared to common information. For example, if a compressed image file of an empty form contains 60 KB of data, and the same form when filled-in is compressed to 70 KB, it is expected that Form Dropout will compress the filled-in form to a range of 5 to 6 KB.

The figure below shows a filled form on the left. The right side contains the same form after Form Dropout.

Click to see full size 
Click to see full size
  Click to see full size 
Click to see full size
Figure 1 - A filled form versus the same form after Form Dropout