|

|
In
the field of natural language processing, one important research area
is the extraction and formatting of information from unstructured text.
One can think of the end goal of information extraction in terms of filling
templates codifying the extracted information. Although the most accurate
systems often involve language processing modules that are hand-built,
substantial progress has been made in applying supervised machine learning
techniques to a number of processes necessary for extracting information
from text. This task naturally decomposes into a sequence of processing
steps, typically including: tokenization, sentence segmentation, part-of-speech
assignment, named-entity identification, phrasal parsing, sentential parsing,
semantic interpretation, discourse interpretation, template filling and
merging. The application of machine learning techniques to information
extraction is motivated by the fact that hand-built systems are time consuming
to develop and require special expertise in linguistics and artificial
intelligence/computational linguistics.
The
Computational
Linguistics and Text Mining group is actively pursuing research
on Trainable
Information Extractions Systems.
Our
goal is
to develop accurate and efficient machine
learning algorithms for commercially viable information extraction
systems including interactive (bootstrapping)
learning algorithms and a new approach to symbolic
pattern generalization applicable to relational learning from text.
Named
Entity and Chunking Research
An important component of information extraction as well as many other
natural language processing systems is the development of accurate text
chunkers, which when cascaded, carry out the partial parsing required
for template filling. One method that has been quite successful in such
applications is the [SNoW
architecture], developed by Prof. Dan Roth and his colleagues at the
University of Illinois. This architecture is based on the Winnow algorithm,
which in theory is suitable for problems with many irrelevant attributes.
In natural language processing, one often encounters a very high dimensional
feature space, although most of the features are irrelevant. The robustness
of Winnow to high dimensional feature space is therefore considered an
important reason why it is suitable for NLP tasks. However, the convergence
of the Winnow algorithm is only guaranteed for linearly separable data.
In practical NLP applications, data may not always be linearly separable.
Consequently, a direct application of Winnow can lead to numerical instability.
A second problem for the original Winnow method is that it does not produce
a reliable estimate on the confidence of its prediction. Such information
can be very useful in statistical modeling.
We have developed a language-independent text chunking system that uses
a generalization of Winnow which can handle linearly non-separable case,
and produces a reliable probability estimate. The basic idea is to modify
the original Winnow algorithm so that it solves a regularized optimization
problem. The resulting method converges both in the linearly separable
case and in the linearly non-separable case. Its numerical stability implies
that the new method is more suitable for practical NLP problems that may
not be linearly separable. An additional advantage of our new algorithm
is that it provides reliable confidence estimates for its classification
predictions. This property is required in our statistical modeling approach.
Our modified Winnow system achieves state of the art performance in text
chunking with less computational cost than previous systems.
Text Chunking Results
|
- Correct if the whole chunk block matches
- Precision: percentage of predicted
chunk blocks that are correctly predicted.
- Recall: percentage of true chunks
that are correctly predicted.
- F-measure:
2PR / P + R
- Results:
F= 94.13
[precision= 94.24, recall= 94.01]
- better
than previous best: 93.48
F-measure, using a combination
of 231
kernel SVMs
- computationally less expensive
- 16 min.
to train on 400Mhz. Pentium running Linux (~ 1 minute
per category)
- vs. ~ 1
day on 500Mhz Linux
machine for best competitor
|
For more details, please
refer to [Zhang, Damerau and Johnson. To appear, 2001].
|
Jobs
Ads Results : Summary Results
[country, desired-degree,
desired-yrs-experience, title, company, platform, language, application,
salary, req-degree, area, post-date, recruiter, id, city]
| System |
F-Measure |
Precision |
Recall |
| IBM |
79 |
85% |
75% |
| Rapier (Califf, UT Austin) |
75 |
89% |
65% |
| Naive Bayes (Califf,
UT Austin) |
20 |
14% |
32% |
|
MUC-7 Named entity (dry run test)
|
- Types: Person, Organization, Location,
Money, Percent, Time, Date
- ORGANIZATION (358,343) F= 93.6 [precision= 95.9 recall=
91.3]
- LOCATION (291,302) F= 92.7 [precision= 91.1 recall= 94.5]
- PERSON (228,226) F= 95.4 [precision= 95.6 recall= 95.2]
- all : F= 91.9 [precision= 93.7 recall=
90.2]
- to date, partial matching only partly implemented
- CMU result: F=
91.63, using partial matching.
- BBN result: F=
92.5, but
unclear if partial matching or other similar techniques are used
-
- Note: These are the
directly comparable systems. There
are also systems that combine algorithms, which can improve
performance somewhat (~93).
|
Interactive
Learning Research
Machine Learning applied
to Information Extraction shows promising results. However such approaches
require sufficient suitably annotated training data, which is often unavailable.
This can make real world solutions costly to develop. To address this
problem, we are investiguating the use of Semi-Automated
Interactive Learning
(SAIL)
techniques that reduce the amount of manual labor and level of expertise
required to train
learners. We are currently wrapping up a SAIL prototype for class
learners that
we have successfully used to train high precision named
entity recognizers. We
plan to extend this approach to relational learners.
|
 |
|
Relational
Learning Research
Frank J. Oles is developing a mathematical theory of precedence-inclusion
patterns that will provide new approaches to pattern generalization and
relational learning, with a goal to applying these approaches to information
extraction tasks. The key result is that any finite set of finite precedence-inclusion
patterns has an irredundant, most specific generalization that is unique
up to isomorphism.
Relational Learning
There are two main kinds of supervised
learning (learning patterns from labeled or annotated examples):
- classification
- relational learning.
The aim of classification is to identify
patterns whose presence signifies that a structure is an element of a
class. The aim of relational learning is much more complex: to identify
patterns whose presence signifies that certain elements of a structure
are in a particular relation to one another. The basic problem in both
cases : how to define and to compute the best generalization of the
patterns found in a set of examples.
Classification can be actually regarded as a special case of relational
learning and some relational learning problems can be reduced to classification
problems, which serves to obscure some basic facts:
- Except for some theoretical results
related to inductive logic programming that can be read as saying that
best generalizations don't exist, in general, the mathematical theory
of pattern generalization at a level that applies to relational learning
is practically nonexistent.
- Moreover, in the setting of relational
learning, just what constitutes a definitive problem solution or how
to compare one purported solution to another is quite unclear.
Pattern Generalization Manifesto
The Machine Learning community has done a lot with turning data into vectors,
finding ways of grouping data-derived vectors according to identified
commonalities, and applying the results to various problems. And, of course,
much remains to be done in that arena.
However, the data-derived-vector-processing paradigm so successful in
the past has limitations either
- because too much structural information
is lost when data is turned into vectors, or
- because some problems of information
extraction are inherently not classification problems
New mathematical tools are needed to
handle possibly multi-dimensional ordered structures that are not adequately
representable as vectors both for classification problems and for more
general problems of relational learning.
A major contribution of these new tools is that they should inform us
about the character of solutions to more general supervised learning problems
than the ML community has solved in the past. Within their sphere of applicability,
they should provide an answer to the questions of
- What constitutes a pattern?
- What is the definition of a structure
being an example of a pattern?
- How can we effectively decide if
a structure is an instance of a pattern?
- In what sense can one generalization
of a set of examples better than another one?
- How can patterns be graphically
represented?
- How can generalizations be computed
from examples?
- How do patterns useful for classification
relate to patterns useful for relational learning?
Birth of a New Theory: Precedence-Inclusion
Patterns
- A class of patterns whose definition
is practically motivated by relational learning for the purpose of extracting
information from text.
- Includes constituent structure trees
from computational linguistics, but are far more general.
- Widely applicable:
- text,
- video,
- speech
- images
- DNA (admittedly speculative),
- other situations where hierarchical
structures occur in some order.
- Precisely definable notion of best
generalization that is tighter than that in inductive logic programming.
- Provides a setting in which best
generalizations provably exist and are computable.
- Mathematically deep and rich:
- Provides a nontrivial way to
extend the definition of a (strict) partial order to a pair of interacting
orders.
- Provides new examples of Cartesian
closed categories that extend the category of sets.
Expected Benefits of a New Mathematical Theory
of Patterns
- Ability based on giving examples
to extract relational data from unstructured data, e.g., text, and from
semistructured data, e.g., XML documents, such as
- Possible drug side-effects
- Not just that the text
mentions a possible drug side-effect, but which drugs have which
effects.
- Personnel changes
- Not just that the text
mentions a personnel change, but who is replacing whom in which
position and when.
- Terrorist threats
- Not just that the text
mentions a threat, but who is taking what action and when.
- Delayed events
- Not just that the text
mentions a delayed event, but what event is delayed for how
long and why.
- New approaches to search based on
giving examples of what a user is seeking, not on keyword search.
- New kinds of features that can be
extracted for subsequent use by traditional methods for machine learning.
|