
|
 |
|
Disambiguation Technologies
|  |
Overview
|
 |
|
The DISAMBIGUATION (DA) package is a tool for non-exact dictionary lookup.
In general, an input string should be matched with the closest item in the dictionary. However, input strings are often corrupted by OCR errors.
The package supports two types of searches. The first type of search is a "flat" search over a dictionary of words. The second type is a hierarchical search, where an N-string combination is searched in an N-level tree-like dictionary.
A common example for that type of search is an address matching the three strings which represent "street," "zip," and "region" should be matched to the closest leaves in the tree of legal "street/zip/region" branches.
The package supports the following features:
|
 |
|
- Multi-candidate characters with corresponding degrees of confidence
- Input strings may actually be regular expressions. This feature enables a variety of complicated searches
- A string parsing tool. Consider, for example, address recognition: There are a number of ways to write an address, hence the disambiguation engine must be able to recognize automatically which syntax is being used
- Any number of best matches can be retrieved, with their corresponding distances (errors) from the input
A user-defined weight-table enables users to determine the cost (or degree of severity) of each error (such as substituting one character for another, or the deletion, or insertion of a character).
For example, users may allow a low-cost substitution between the characters "1" and "l" because of their similarity.
Example Postal Address Search
|
 |
|
Inputs:
Output:
- Correct address from dictionary
- Matching confidence level
|
 |
|
 |
|