Skip to main content

Disambiguation

The Haifa team developed the Disambiguation (DA) tool for non-exact dictionary lookup.

In general, an input string should be matched with the closest item in the dictionary. However, input strings are often corrupted by OCR errors as shown in the image below.


Figure 15 - Dictionary possibilities

The DA package supports two types of searches. The first type of search is a "flat" search over a dictionary of words. The second type is a hierarchical search, where an N-string combination is searched in an N-level tree-like dictionary.

A common example for that type of search is address matching, where the three strings that represent "street," "zip," and "region" should be matched to the closest leaves in the tree of legal "street/zip/region" branches.

The DA package supports the following:

  • Multi-candidate characters with corresponding degrees of confidence.
  • Input strings that are regular expressions. This feature enables a variety of complicated searches.
  • String parsing. For example, in address recognition, there are a number of ways to write an address; hence, the disambiguation engine must be able to recognize automatically which syntax is being used.
  • Retrieval of any number of best matches, with their corresponding distances (errors) from the input.

A user-defined weight-table enables users to determine the cost (or degree of severity) of each error, such as substituting one character for another or deleting/inserting a character. For example, users may allow a low-cost substitution between the characters "1" and "l" because of their similarity.