Photo
Trainable Information Extraction Systems

In the field of natural language processing, one important research area is the extraction and formatting of information from unstructured text. One can think of the end goal of information extraction in terms of filling templates codifying the extracted information. Although the most accurate systems often involve language processing modules that are hand-built, substantial progress has been made in applying supervised machine learning techniques to a number of processes necessary for extracting information from text. This task naturally decomposes into a sequence of processing steps, typically including: tokenization, sentence segmentation, part-of-speech assignment, named-entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling and merging. The application of machine learning techniques to information extraction is motivated by the fact that hand-built systems are time consuming to develop and require special expertise in linguistics and artificial intelligence/computational linguistics.

The Computational Linguistics and Text Mining group is actively pursuing research on Trainable Information Extractions Systems.

Our goal is to develop accurate and efficient machine learning algorithms for commercially viable information extraction systems including interactive (bootstrapping) learning algorithms and a new approach to symbolic pattern generalization applicable to relational learning from text.


Named Entity and Chunking Research

An important component of information extraction as well as many other natural language processing systems is the development of accurate text chunkers, which when cascaded, carry out the partial parsing required for template filling. One method that has been quite successful in such applications is the [SNoW architecture], developed by Prof. Dan Roth and his colleagues at the University of Illinois. This architecture is based on the Winnow algorithm, which in theory is suitable for problems with many irrelevant attributes. In natural language processing, one often encounters a very high dimensional feature space, although most of the features are irrelevant. The robustness of Winnow to high dimensional feature space is therefore considered an important reason why it is suitable for NLP tasks. However, the convergence of the Winnow algorithm is only guaranteed for linearly separable data. In practical NLP applications, data may not always be linearly separable. Consequently, a direct application of Winnow can lead to numerical instability. A second problem for the original Winnow method is that it does not produce a reliable estimate on the confidence of its prediction. Such information can be very useful in statistical modeling.

We have developed a language-independent text chunking system that uses a generalization of Winnow which can handle linearly non-separable case, and produces a reliable probability estimate. The basic idea is to modify the original Winnow algorithm so that it solves a regularized optimization problem. The resulting method converges both in the linearly separable case and in the linearly non-separable case. Its numerical stability implies that the new method is more suitable for practical NLP problems that may not be linearly separable. An additional advantage of our new algorithm is that it provides reliable confidence estimates for its classification predictions. This property is required in our statistical modeling approach. Our modified Winnow system achieves state of the art performance in text chunking with less computational cost than previous systems.



Text Chunking Results
  • Correct if the whole chunk block matches
    • Precision: percentage of predicted chunk blocks that are correctly predicted.
    • Recall: percentage of true chunks that are correctly predicted.
    • F-measure: 2PR / P + R
  • Results: F= 94.13 [precision= 94.24, recall= 94.01]
    • better than previous best: 93.48 F-measure, using a combination of 231 kernel SVMs
    • computationally less expensive
      • 16 min. to train on 400Mhz. Pentium running Linux (~ 1 minute per category)
      • vs. ~ 1 day on 500Mhz Linux machine for best competitor


For more details, please refer to [Zhang, Damerau and Johnson. To appear, 2001].


Jobs Ads Results : Summary Results
[country, desired-degree, desired-yrs-experience, title, company, platform, language, application, salary, req-degree, area, post-date, recruiter, id, city]

System F-Measure Precision Recall
IBM 79 85% 75%
Rapier (Califf, UT Austin) 75 89% 65%
Naive Bayes (Califf, UT Austin) 20 14% 32%

 



MUC-7 Named entity (dry run test)
  • Types: Person, Organization, Location, Money, Percent, Time, Date
    • ORGANIZATION (358,343) F= 93.6 [precision= 95.9 recall= 91.3]
    • LOCATION (291,302) F= 92.7 [precision= 91.1 recall= 94.5]
    • PERSON (228,226) F= 95.4 [precision= 95.6 recall= 95.2]
  • all : F= 91.9 [precision= 93.7 recall= 90.2]
    • to date, partial matching only partly implemented
  • CMU result: F= 91.63, using partial matching.
  • BBN result: F= 92.5, but unclear if partial matching or other similar techniques are used
  • Note: These are the directly comparable systems. There are also systems that combine algorithms, which can improve performance somewhat (~93).

Interactive Learning Research

Machine Learning applied to Information Extraction shows promising results. However such approaches require sufficient suitably annotated training data, which is often unavailable. This can make real world solutions costly to develop. To address this problem, we are investiguating the use of Semi-Automated Interactive Learning (SAIL) techniques that reduce the amount of manual labor and level of expertise required to train learners. We are currently wrapping up a SAIL prototype for class learners that we have successfully used to train high precision named entity recognizers. We plan to extend this approach to relational learners.

Image


Relational Learning Research

Frank J. Oles is developing a mathematical theory of precedence-inclusion patterns that will provide new approaches to pattern generalization and relational learning, with a goal to applying these approaches to information extraction tasks. The key result is that any finite set of finite precedence-inclusion patterns has an irredundant, most specific generalization that is unique up to isomorphism.

Relational Learning

There are two main kinds of supervised learning (learning patterns from labeled or annotated examples):

  1. classification
  2. relational learning.

The aim of classification is to identify patterns whose presence signifies that a structure is an element of a class. The aim of relational learning is much more complex: to identify patterns whose presence signifies that certain elements of a structure are in a particular relation to one another. The basic problem in both cases : how to define and to compute the best generalization of the patterns found in a set of examples.

Classification can be actually regarded as a special case of relational learning and some relational learning problems can be reduced to classification problems, which serves to obscure some basic facts:

  • Except for some theoretical results related to inductive logic programming that can be read as saying that best generalizations don't exist, in general, the mathematical theory of pattern generalization at a level that applies to relational learning is practically nonexistent.
  • Moreover, in the setting of relational learning, just what constitutes a definitive problem solution or how to compare one purported solution to another is quite unclear.

Pattern Generalization Manifesto

The Machine Learning community has done a lot with turning data into vectors, finding ways of grouping data-derived vectors according to identified commonalities, and applying the results to various problems. And, of course, much remains to be done in that arena.

However, the data-derived-vector-processing paradigm so successful in the past has limitations either

  1. because too much structural information is lost when data is turned into vectors, or
  2. because some problems of information extraction are inherently not classification problems

New mathematical tools are needed to handle possibly multi-dimensional ordered structures that are not adequately representable as vectors both for classification problems and for more general problems of relational learning.

A major contribution of these new tools is that they should inform us about the character of solutions to more general supervised learning problems than the ML community has solved in the past. Within their sphere of applicability, they should provide an answer to the questions of

  • What constitutes a pattern?
  • What is the definition of a structure being an example of a pattern?
  • How can we effectively decide if a structure is an instance of a pattern?
  • In what sense can one generalization of a set of examples better than another one?
  • How can patterns be graphically represented?
  • How can generalizations be computed from examples?
  • How do patterns useful for classification relate to patterns useful for relational learning?

Birth of a New Theory: Precedence-Inclusion Patterns

  1. A class of patterns whose definition is practically motivated by relational learning for the purpose of extracting information from text.
  2. Includes constituent structure trees from computational linguistics, but are far more general.
  3. Widely applicable:
    • text,
    • video,
    • speech
    • images
    • DNA (admittedly speculative),
    • other situations where hierarchical structures occur in some order.
  4. Precisely definable notion of best generalization that is tighter than that in inductive logic programming.
  5. Provides a setting in which best generalizations provably exist and are computable.
  6. Mathematically deep and rich:
    • Provides a nontrivial way to extend the definition of a (strict) partial order to a pair of interacting orders.
    • Provides new examples of Cartesian closed categories that extend the category of sets.

Expected Benefits of a New Mathematical Theory of Patterns

  1. Ability based on giving examples to extract relational data from unstructured data, e.g., text, and from semistructured data, e.g., XML documents, such as
    • Possible drug side-effects
      Not just that the text mentions a possible drug side-effect, but which drugs have which effects.
    • Personnel changes
      Not just that the text mentions a personnel change, but who is replacing whom in which position and when.
    • Terrorist threats
      Not just that the text mentions a threat, but who is taking what action and when.
    • Delayed events
      Not just that the text mentions a delayed event, but what event is delayed for how long and why.
  2. New approaches to search based on giving examples of what a user is seeking, not on keyword search.
  3. New kinds of features that can be extracted for subsequent use by traditional methods for machine learning.