IBM Israel
Skip to main content
 
Search IBM Research
   Home  |  Products & services  |  Support & downloads  |  My account
Select a Country Select a country
IBM Research Home IBM Research Home
IBM Haifa Labs Homepage IBM Haifa Labs Home

Disambiguation Technologies

Document Management
Project Homepage
 ·Field Samples
 ·Contact Information
Feedback


Disambiguation Technologies
  Overview

The DISAMBIGUATION (DA) package is a tool for non-exact dictionary lookup.

In general, an input string should be matched with the closest item in the dictionary. However, input strings are often corrupted by OCR errors.

The package supports two types of searches. The first type of search is a "flat" search over a dictionary of words. The second type is a hierarchical search, where an N-string combination is searched in an N-level tree-like dictionary.

A common example for that type of search is an address matching — the three strings which represent "street," "zip," and "region" should be matched to the closest leaves in the tree of legal "street/zip/region" branches.

  The package supports the following features:

  • Multi-candidate characters with corresponding degrees of confidence
  • Input strings may actually be regular expressions. This feature enables a variety of complicated searches
  • A string parsing tool. Consider, for example, address recognition: There are a number of ways to write an address, hence the disambiguation engine must be able to recognize automatically which syntax is being used
  • Any number of best matches can be retrieved, with their corresponding distances (errors) from the input

    A user-defined weight-table enables users to determine the cost (or degree of severity) of each error (such as substituting one character for another, or the deletion, or insertion of a character). For example, users may allow a low-cost substitution between the characters "1" and "l" because of their similarity.

  Example — Postal Address Search

Inputs:
  • OCR output of mail images without confidences & multiple candidates
    (input strings may also be regular expressions)
  • Address dictionary(s)
  • Address grammar rule(s)
    (The following are some easily-defined rules):
    
    text .. (e.g., name)
    
    text .. (e.g., company)
    
    street #digits#
    
    zip(4 digits or more) region
    
  • Error thresholds for grammar parsing and matching, and timeout requirements


Output:
  • Correct address from dictionary
  • Matching confidence level



 

  About IBM  |  Privacy  |  Terms of use  |  Contact