TALENT

TALENT (Text Analysis and Language Engineering Technology) is a suite of text analysis modules that can be used to analyze text in different ways. TALENT is developed by the Information Retrieval and Analysis Group at IBM's T. J. Watson Research Center.

There are several TALENT components:

TEXTRACT

Textract recognizes different kinds of significant items in text, such as names, terms, and abbreviations. It consists of several components which can cooperate in recognizing significant objects in text and annotating a document data object with extracted information.

Relations

After identifying significant vocabulary items with Textract, the Relations program finds relationships between them, based on evidence found in the collection being analyzed. For example, there may be a relationship between the terms laser printer and toner cartridge. There are two kinds of relationships:

Information Quotient

Once all the significant vocabulary has been identified in a document collection, a statistical program called IQ computes each item's Information Quotient, a measure of salience of a word, name, or term with respect to the other vocabulary in a collection. If a name, word, or term has a high IQ, it means that it is a significant vocabulary item in the document(s) in which it occurs, differentiating that document from other documents in the collection. High IQ items make good query terms for the documents in which they appear.

IMDICT

IMDICT is a high-speed, high-volume dictionary data base access method based on EVECTOR technology.  It is used to store Textract information -- classification, IQ, and frequency statistics for words, names and terms.

[ IBM Research home page ]

[ IBM home page | Order | Search | Contact IBM | Legal ]