TALENT
TALENT (Text Analysis and Language Engineering Technology) is a
suite of text analysis modules that can be used to analyze text in different
ways. TALENT is developed by the Information
Retrieval and Analysis Group at IBM's T. J. Watson Research Center.
There are several TALENT components:
TEXTRACT
Textract recognizes different kinds of significant items in text, such
as names, terms, and abbreviations. It consists of several components which
can cooperate in recognizing significant objects in text and annotating
a document data object with extracted information.
-
SentSep - A high speed tokenizer which finds word , sentence and
paragraph boundaries in text.
-
Nominator - A name recognizer which discovers names of people, places,
organizations, and others occurring in text. Nominator links together
different names that refer to the same entity, such as President Clinton
and Bill Clinton. A recent paper on the Nominator technology is
"Disambiguation of Proper Names in Text", by Nina Wacholder, Yael Ravin
and Misook Choi, published in the Proceedings of the Fifth Conference on
Applied Natural Language Processing, 31 March - 3 April 1997, Washington
DC. You can download a preprint.
(anlp97.ps, PostScript, 120 kB).
-
Terminator - A terminology recognizer which automatically finds
meaningful multiword terms in text, such as laser printer or expense
account. Terminator also identifies morphological variants of terms,
such as expense accounts.
-
Abbreviator - Finds and links abbreviations with their full forms.
-
Other recognizers - Other kinds of significant entities are also
recognized by TALENT, such as dates, numbers, and money amounts.
Relations
After identifying significant vocabulary items with Textract, the Relations
program finds relationships between them, based on evidence found in the
collection being analyzed. For example, there may be a relationship between
the terms laser printer and toner cartridge. There are two
kinds of relationships:
-
Unnamed - These relations are characterized by a strength, which
is related to how often the terms co-occur.
-
Named - In some cases, a label for the relationship can be
found in the text. Typical labels are President and CEO of (relating
the name of an organization to that of a person), or located in
(representing one of several constructs relating an organization name to
a place name).
Information Quotient
Once all the significant vocabulary has been identified in a document collection,
a statistical program called IQ computes each item's Information
Quotient, a measure of salience of a word, name, or term with respect
to the other vocabulary in a collection. If a name, word, or term has a
high IQ, it means that it is a significant vocabulary item in the document(s)
in which it occurs, differentiating that document from other documents
in the collection. High IQ items make good query terms for the documents
in which they appear.
IMDICT
IMDICT is a high-speed, high-volume dictionary data base access method
based on EVECTOR technology. It is used to store Textract
information -- classification, IQ, and frequency statistics for words,
names and terms.
[ IBM Research home page
]
[ IBM home page | Order
| Search | Contact
IBM | Legal ]