IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and framework that helps you build the bridge from unstructured information to structured knowledge. It is an industrial strength, scaleable integrating platform for composing analysis engines and integrating their results in back-end information processing systems.
Targeting large-scale solution development, UIMA allows the right skills to focus on the right parts of solution development and enables rapid integration across technologies and platforms in a host of different deployment options ranging from tightly-coupled to fully distributed allowing you to maximize single-cpu performance, flexibility and/or scale-out.
An overview of the UIMA architecture is covered in the IBM Systems Journal. For more detailed information about the architecture and the UIMA SDK please download the UIMA SDK Users Guide and Reference (caution: large file – may take some time to download).
UIMA is the engineering foundation upon which IBM Research and IBM Software Group are developing and combining the work of many IBM researchers and engineers, as well as third parties, to both accelerate scientific advances as well as deliver analysis into a variety of search and knowledge management applications.
IBM is committed to making UIMA an open platform. As part of this commitment IBM has made the architecture and software framework freely available on IBM alphaWorks and has announced our plan to open source the core UIMA framework.

At the heart of UIMA is a common representation system called the CAS or Common Analysis Structure.
The CAS is used to provide analysis engines with read access to the artifact being analyzed (e.g., document, image, video, etc) and read/write access to the analysis results or annotations associated with defined regions of the artifact. Regions may correspond to words, sentences or paragraphs in text or frames or parts of frames in video, for example.
The CAS is shared among analysis engines working in concert as part of a larger workflow to process a collection of artifacts.
UIMA supports standard XML and high-speed binary serializations of the CAS. The CAS maybe shared among Java and C++ analysis engines.
UIMA provides a native Java Interface to the CAS that renders analysis results as Java objects and properties making it easy for the Java programmer to interact with the CAS.
The CAS contains high speed indices to speed up access to type instances.
Analysis Engines process CASes. The look at the subject of analysis and any results produced by previous analysis engines and they discover and add more metadata to the CAS.
The logical interface for an analysis engine is simple -- CAS in/CAS out. This simplicity facilitates interoperability and composibility of independently developed engines.
Analysis engines may be organized and composed together to form reusable components that encapsulate rich workflows of cooperating engines. UIMA tooling supports this composition.
Analysis Engines can be deployed by the framework to cooperate in a single process, in different processes on the same machine or across machines using a variety of protocols including SOAP, for example.
To find out more about these and other UIMA components see the IBM System Journal on Unstructured Information and for a more detailed treatment the UIMA SDK Users Guide and Reference (caution: large file – may take some time to download).
The CAS can contain multiple views of the same logical artifact. For example, a document may be translated into different languages. Each may represent a different view of the same logical content but may be analyzed independently. A single CAS can represent all views, providing isolated or integrated access to these multiple views.
Each view is called a Sofa for Subject of Analysis since it can become an independent subject of different analysis engines.
Sofas come in very handy for analyzing multiple modalities, for example, the video, audio and close-captions of a video stream. Sofas can be generated on the fly. This features supports segmentation of streaming data anywhere in the analysis pipeline.
UIMA has been used as a platform for IBM’s video analysis and search system MARVEL
UIMA supports the development and deployment of C++ and Java analysis engines and supports their interoperability in both collocated and distributed deployment models through several different high-speed mechanisms
Analysis Engine may have be developed with a host of technical dependencies. Key to component reuse is that engines can be packaged up from the environment in which they are developed and test and deployed in a different environment. UIMA includes utilities for packaging an analysis engine and all its dependent resources and installing it in a different run-time environment.
Additionally components are associated with a variety of meta-data that facilitate their discovery by solution integrators targeting specific analysis requirements.
UIM applications typically don’t stop after a single document rather they tend to process large collections of documents. Of ultimate interest is typically the aggregate analysis results collected over an entire collection of unstructured information sources.
Applications want to avoid going down in the middle of processing millions of documents just because a single document was strangely formatted.
Additionally, applications want to scale-out to better utilize hardware resources especially given that document processing is often easily parallelized across many analysis pipelines.
UIMA provides for robust failure recovery, logging, and multi-pipelining to support building scaleable unstructured information analysis and search applications.