Skip to main content

Censuses

Success Story - Swiss Census

Three OCR technologies were developed and used for the Swiss census.

  • Extremely robust recognition of printed numerals
  • Recognition of hand written numerical data
  • Recognition of hand written alphanumeric data

Extremely robust recognition of printed numerals
Each page of a Swiss census has an identification number that when recognized, serves as its key into the evolving databases. It was crucial that these numbers be recognized with 100% accuracy and absolutely no errors. For this reason, we developed a recognition routine for printed numerals, which was trained for a specific typeset. The idea behind this technique is based on the selection of special points for each character pair, which distinguishes between them with high reliability. We searched for all distinguishing points and used all the information available for the decision process. For this data field, we achieved recognition rate approaching 100%.

Recognition of hand written numerical data
Many of the data fields in each form are numeric in nature (for example, birth date, ID number, etc.) and contain almost no internal context. For these fields we developed a routine capable of recognizing hand written symbols limited to a set of ten numerals. This algorithm is composed of a segmentation procedure, a decision network, and a verification process. Segmentation transforms the bitmap image of a given character into a graph, which preserves the topological information of the image and describes the symbol in terms of lines, curves, angles and other geometric features. This graph is then processed by a decision tree, which classifies the symbol into a specific category. Once the decision is made, a verification process is used to determine the reliability of the decision. The verification uses statistical methods to evaluate the probability that the decision made for a given character image is correct. If the verification result matches the decision with a high probability, the numeral is recognized; otherwise, it is marked as not reliable. Using this routine, a 95% recognition ratio was achieved and 90% of the cases were classified as reliable. Out of these cases, the substitution ratio was only 0.4%. Only 10% of the cases were rejected as unreliable. Of these rejected cases, 55% were still recognized correctly.

Recognition of hand written alphanumeric data
The kernel for the recognition for handwritten alphanumeric data is aimed at providing automatic capture of hand-written capital letters. In principle, the recognition process follows the same steps as the numerical recognition, but because the symbol set is much larger, a much more detailed description has been used for the segmentation process. The decision process is based on eliminating possible outcomes using filters for specific feature settings. The result of the decision is therefore not only one outcome, but rather a set of possibilities. The average size of this list is about two characters. In general, alphanumeric data contains some context information. In the case of the Swiss census forms, context was used by specifying dictionaries of possible answers for each question. Therefore, the next step in the recognition process is to search for the best match out of the results of the possible answers for the specific field.

The following image is a sample of a census form. It is constructed of printed alphanumeric characters and alphanumeric characters.

Click to see full size 
Click to see full size
Figure 30 - A sample of census form

Success Story - Israeli Census

The OCR technologies for the Israel census are similar to those used in the Swiss census. However, a more efficient handling of OCR uncertainties was required. As part of this project, we developed SmartKey (see the section on SmartKey above).

The customer reported a total of 0.2% error rate.

In one of the survey booths, a summary of statistics was collected:

  • Recognition rate of a specific important field - 100%
  • Barcode recognition rate - 97%
  • Handwritten numerals - 29.4%

The remaining 70% of the handwritten numerals were manually verified using SmartKey. Out of all the characters handled by SmartKey a total of 1.3% were marked as false recognition. These characters were then typed in manually. Over 98% were verified as correct, thus, no manual typing was required.