Overview
This project deals with the combination of structured and unstructured text analysis. Structured texts are artifacts like source code, XML, and scripts, whereas examples of unstructured artifacts are IT and business documents (such as design and requirements documents or user manuals) or comments in code. In particular, we focus on demonstrating useful linkages between IT and business documents and the application code implementing the components or processes described in those documents. This is done by statically analyzing these artifacts, indexing them and the retrieved metadata into a search engine, and then performing various analyses on this data to identify linkages. Information Retrieval (IR) techniques such as usage of ontologies, abbreviations, fragmentation, lexical proximity, and English and business thesauri are used to semantically map components of the different text artifacts.
This work leverages the Unstructured Information Management Architecture (UIMA) Open Source framework, developed by IBM Research, and the Apache Lucene search engine.