Text Analysis and Language Engineering

Research Areas


Key Words

Computational Lexicology, Computational Linguistics, Natural Language Processing, Information Retrieval, Text Analysis and Mining, Syntax and Semantics, Parsing, Finite State Automata, Knowledge Management, Visualization ...

General Research Areas

  • Text Analysis: recognition and extraction of lexical entities such as names, terms, dates, money, abbreviations, vocabulary generation.
  • Document Representation:  includes topic shift detection, single document summaries (derived via different strategies), multiple document summarization, multi-threaded summaries, document structure analysis.
  • Cross-document Coreference: inter- and intra-document aggregation of disambiguated entities in text and corpora.
  • Natural Language Processing: shallow parsing, part-of-speech tagging, anaphora resolution, coherence determination.
  • Search Enhancements: query refinement, document expansion.
  • Speech Mining: cleanup and text analysis of ASR transcripts.
  • Navigation and Visualization: active markup, lexical navigation, dynamic visualization.
  • Knowledge Management: the extraction, representation, and application of domain specific vocabularies and relationships from text.

Research Projects and Technologies

  • TALENT (Text Analysis and Language Engineering Technology): These tools analyze text and extract meaningful lexical information from it, including concept names and relationships among them. Toolkit highlights include:
    • morpho-lexical analysis
    • named entity extraction,
    • technical terminology identification,
    • abbreviations processing,
    • part-of-speech tagging,
    • lexical relations highlighting,
    • topic segmentation,
    • cross-document coreference
  • Summarizer: This is a system which uses advance text analysis techniques to produce indicative summaries of documents. Summaries are intended for use in document management and retrieval systems, where their role is to provide users with concise, readable representations of documents' contents. Summarizer uses a "summarization by sentence extraction" approach to generate a document summary. Its algorithm comprises a set of strategies for ranking the sentences in a document by salience, and for extracting the most salient sentences to produce a summary of any specified length.
  • Finite State Language Modeling (Intex):  INTEX is a Natural Language Processing development environment based on Finite State Transducers (FSTs). It parses texts of several million words, and includes large-coverage dictionaries and grammars. INTEX builds lemmatized concordances and indices of texts with respect to all types of Finite State patterns; it is used as a lexical parser to produce the input of a syntactic parser, but can also be viewed as an information retrieval system.
  • Lexical Navigation: A technology which shows how extracted information can be browsed and navigated to enhance information discovery and search.