Spell on Query - Search on OCR
Even in the era of electronic archives, there remain plenty of paper archives to search and access. In recent years, there is a movement towards digitizing paper archives. Some of the paper archives are of very low print quality, which, in turn, causes low quality OCR. The conventional way to handle low quality OCR is:
OCR > Speller > [key in] > Text Indexing > Full Text Search
Example:
Say the word 'came' is recognized by the OCR as 'caae'. It is then indexed as 'care' and 'case' after using a dictionary and selecting the first two best word candidates. Therefore, searching for 'came' will not find this word.
To overcome such OCR faults, a manual correction process (key-in) is applied. This digitization process, which includes a key-in process, is labor intensive and requires a lot of effort.
Our innovative approach is called Spell on Query (SoQ):
OCR > Raw Indexing (with weights) > Search (with weights) > SoQ
The query words are the important ones. We simulate the same results as if the archive was indexed only using the query words.
Example:
Say the word 'came' is recognized by the OCR as 'caae' and is indexed using the value ('caae') - with no spelling checked.
When the word 'came' is queried, a 'spelling' process uses the query words as the dictionary, and therefore decides to treat 'caae' as 'came.' Thus, the query results include this word.
Our approach, which searches raw OCR, has these unique advantages:
- No key-in process, eliminates effort required for keying in
- Dictionaries are not used in the process
- Spelling decisions are not made at indexing time
- The query words themselves are used to maximize recognition results
- Improved search results, even on low quality input