|
The life science industry is an emerging market in which application spaces, such as drug discovery and development in the pharmaceutical sector and clinical record management in health care, have become areas of significant recent interest.1 Documents in the scientific literature play an important role in life science by serving as a potential source for underlying knowledge discovery. These documents are a rich repository of information on relationships among biomedical concepts such as genes, proteins, diseases, and a variety of other key topics.
Text mining is a technology that makes it possible to discover patterns and trends semiautomatically from huge collections of unstructured text.2–6 It is based on technologies such as natural language processing, information retrieval, information extraction, and data mining.7 Early papers in this area mentioned the possibility of knowledge discovery from the biomedical literature. Hearst, one of the founders of text mining, proposed a system for predicting the functions of unknown genes using biomedical documents.2 Swanson also described the idea of discovering new knowledge from the biomedical literature.4 Subsequently, considerable research has been done in the areas of biomedical concept extraction (named-entity extraction), relationship extraction, and network/pathway construction for protein-protein interaction. However, although text mining has proved a promising approach for knowledge discovery from text sources, certain specific problems are encountered when trying to apply it to the realm of life science.
First, existing approaches are incapable of handling the vast amount of textual domain-specific information available. Indeed, there is more data available than anyone could possibly read or digest. For example, MEDLINE**8 is a database of over 11 million citations (abstracts) of biomedical articles dating back to the 1960s. MEDLINE is widely used as a golden standard for text-mining systems in life science, and several text-mining applications using MEDLINE have been proposed. The MedMeSH Summarizer9 extracts MeSH** (Medical Subject Headings) terms10 that can summarize the nature of a cluster of gene names obtained from DNA microarrays (also called DNA chips). MedMiner11 is a system that filters information for the PubMed** search engine.12 Obviously, any approach that applies text-mining methods to such a large document collection must be highly scalable and robust.
Second, existing information extraction systems only provide extracted concepts and relationships in a fixed way. Because these systems are noninteractive, it is difficult to iteratively apply mining processes on their results directly. With an interactive text-mining system, users are better able to discover hidden knowledge by using a combination of mining functions and a trial-and-error approach.
To address these problems we have developed a text-mining system called IBM TAKMI* for Biomedical Documents (designated MedTAKMI hereafter), which is capable of mining the entire MEDLINE database in an interactive manner. The predecessor of this system, TAKMI (Text Analysis and Knowledge MIning), is a text-mining system for customer relationship management (CRM), which has been successfully used in call centers to mine customer support call logs.13 The MedTAKMI system extends TAKMI to provide a useful set of tools for knowledge discovery from biomedical documents. MedTAKMI is designed to handle large document sets and is thus capable of mining the entire set of MEDLINE citations.
The development of methods for extracting information on such biomedical concepts as genes, proteins, and diseases from text is an active area of research14–19 and typically involves the following two primary subtasks:
-
Entity extraction—the recognition of gene, protein, and chemical names from biomedical text
-
Relation extraction—the extraction of relationships among these entities
Thus architecturally MedTAKMI consists of two main components designed to handle information extraction and entity/relationship mining.
The MedTAKMI system performs entity extraction based on dictionary lookup. This approach is simple conceptually and can recognize entities very quickly. We have developed a large domain dictionary that contains two million biomedical entities. These entities and their associated category names are used as keywords in the MedTAKMI system so that users can search for documents that contain a keyword within a specific category, for example, a query on the keyword “p53” within the gene category.
In a preprocessing stage input documents are parsed by a shallow syntactic parser which extracts keywords (entities) with category labels, as well as any binary and ternary relationships that may exist among these entities. The MedTAKMI runtime engine then uses this information to provide mining functions to users. Categories are constructed from public ontological knowledge, for example, using the MeSH terms in MEDLINE or the resources provided by Gene Ontology**.3 User-defined resources may also be employed.
There has been extensive research in relation extraction,20–35 wherein the goal is to extract relationships among biomedical entities (e.g. proteins and genes), from patterns such as “A inhibits B” and “A activates B,” where A and B represent specific entities. Such relationships may be extracted by using one or more of the following information and methods:
-
Surface string patterns20
-
Syntactic information from shallow parsing21,22 and full parsing23–27
-
Templates and rules28–30
-
Statistical information with machine learning32–35
In particular, the MedTAKMI system uses syntactic information with a shallow parser to extract binary (a noun and a verb) and ternary (two nouns and a verb) relationships. These relationships from the document collection are aggregated and can be displayed by category viewers as described later.
As previously noted, MedTAKMI is an extension of TAKMI, a text-mining system for CRM.13 The main differences between these two systems are the following:
-
The use of hierarchical categories: TAKMI only supports flat categories such as product names. For MedTAKMI we developed a hierarchical category viewer because most biomedical entities (e.g., genes and diseases) are defined hierarchically.
-
The extraction of ternary relationships to capture protein-protein interaction by using deeper language analysis
-
The introduction of support for domain-specific mining functions
-
The development of a new system architecture and componentization structure
This paper is organized as follows: The next section describes the key features of MedTAKMI, including the system architecture and the information extraction process. We then introduce the searching and mining functionalities of MedTAKMI. The following section provides an example of the application of the MedTAKMI system to a specific user scenario. Finally, we summarize our work and describe directions for future research.
Features of the MedTAKMI mining system
The MedTAKMI architecture consists of two main components: a preprocessing information extraction stage and a runtime search/mining server, as shown in Figure 1. In this section we briefly discuss these main components and then describe methods for information extraction.
Figure 1
Information extraction process. Information extraction occurs as a preprocessing step and involves several subcomponents (see Figure 1, upper left). In Step 1, the term annotator finds words in the input text (i.e., the sentence with the label “Input” shown in Figure 2) using the term dictionary and identifies these words using their canonical form. (As described in more detail later, the term dictionary contains a pair of forms for each term: a surface form and a canonical form.) The obtained canonical words are embedded in the text document as annotations in XML (eXtensible Markup Language). In Step 2, the annotated text is passed to a syntactic parser. The parser outputs segments of phrases labeled with their syntactic roles, for example NP (noun phrase) or VG (verb group). The category annotator then assigns categories to the terms in these segments and phrases. The category dictionary consists of a set of canonical forms and their categories, which also indicates the node label in the hierarchy of categories. The hierarchical categories are in turn imported from existing hierarchies, such as the MeSH terms in MEDLINE, or from user-designed resources. Syntactic relationships among these entities, for example, subject-verb (S-V) or subject-verb-object (S-V-O), are also extracted from the output of the parser. All extracted information is finally encoded into an index file that is used by the runtime part of the system.
Figure 2
Search/mining server process. The search and mining server, shown in the lower part of Figure 1, provides users with the searching and mining services described in detail later. The MedTAKMI system is a Web application and its client code is loaded into Internet Explorer** as an applet (a servlet version is also available now). Note that hierarchical category definitions are shared by both the information-extraction and the runtime-server preprocessing components.
MEDLINE—a collection of biomedical documents. In their work life-science researchers typically use MEDLINE,8 a bibliography database that covers the biomedical area. To understand our approach to the extraction of information from MEDLINE, some understanding of MEDLINE itself is required. MEDLINE is administered by the National Center for Biotechnology Information (NCBI)36 of the United States National Library of Medicine (NLM).37 It contains approximately 11 million biomedical citations, dating from the mid 1960s to the present. Citations in MEDLINE are collected from over 4600 biomedical journals published worldwide. Biomedical citations in MEDLINE are available to the general public at the PubMed Web site.12 Figure 3 shows an example of one such citation.
Figure 3
To make citation lookup easy, NLM indexes the articles in MEDLINE for retrieval and classification using the Medical Subject Headings (MeSH) thesaurus.10 MeSH headings consist of sets of descriptors in a hierarchical structure. Subheadings, or qualifiers, provide additional specificity for each descriptor. The major MeSH headings indicate the main contents of the article, and the minor MeSH headings are used to describe secondary topics. MeSH also contains other features such as check tags and age tags.
Each citation contains the article title, abstract, authors' names, MeSH headings, affiliations, publication date, journal name, and other information. The title and abstract are text strings that can be manipulated by natural-language-processing text-mining techniques. For example, the MEDLINE citation shown in Figure 3 contains the information shown in Table 1.
|
| Table 1
Meta-data for a MEDLINE citation |
|
|
|
|
|
| Symbol | Meaning | Value |
|
| PMID | PubMed Identifier | 12060689 |
| TIS | ISSN for the journal | 1362-4962 |
| VI | Volume | 30 |
| DP | Published Date | 2002 Jun 15 |
| TI | Title | Dictionary-driven prokaryotic gene finding |
| AB | Abstract | Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects.......... (Snipped.) |
| AD | Affiliation | Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan. |
| AU | Author | Shibuya, Tetsuo |
| AU | Author | Rigoutsos, Isidore |
| MH | MeSH heading | Algorithms |
| MH | MeSH heading | Amino Acid Sequence |
| MH | MeSH heading | Base Sequence |
| MH | MeSH heading | Codon, Initiator |
| MH | MeSH heading | Computational Biology/*methods (* represents Major MeSH) |
| MH | MeSH heading | *Genes, Archaeal |
| TA | Journal Title Abbreviation | Nucleic Acids Res |
|
Figure 4 shows the result of retrieving the MeSH descriptor “Amino Acid Sequence” using the MeSH Browser.38 These results indicate that there are two nodes labeled “Amino Acid Sequence” in the descriptor thesaurus: one at G06.184.603.060 and the other at L01.453.245.667.060. These codes represent locations; for example, G06.184.603.060 is the child node of Molecular Structure (G06.184.603). The letter G at the beginning of the location represents the major category Biological Sciences, while L represents another major category, namely, Information Science.
Figure 4
Dictionary-based information extraction. The preprocessing stage thus extracts meta-data information from MEDLINE documents. In addition to information such as author and publication date, the system can extract information from other text fields, for example, title or abstract, using natural language processing. This information can include sets of keywords (e.g. protein names), predicate-argument binary relations (e.g. “activate-protein”), and ternary relations (e.g. subject-verb-object dependency triplets).
In the preprocessing phase of the MedTAKMI system, document titles and abstracts are parsed by CCAT, a shallow syntactic parser developed at the IBM Thomas J. Watson Research Center using an approach originally proposed by Charniak.39 Because this is a general-purpose parser that has not been trained for biomedical documents, it is difficult to obtain optimized results by parsing documents from the medical domain. We solve this problem by first annotating the text with domain dictionaries. The term dictionaries are constructed by users and employ resources such as UMLS** (Unified Medical Language System)40 or the users' own proprietary resources. The annotations facilitate the parsing of medical-domain text even when the parser has not been specifically trained for this domain. This annotation process is needed for two primary reasons:
-
Identification of term boundaries—Most technical terms in the medical domain, for example, protein names, are compound words. Thus, biomedical terms tend to consist of a combination of numerals, symbols, and verbs, making it very difficult to find term boundaries. For instance, the compound noun “repetitive sequence-based polymerase chain reaction” consists of an adjective (repetitive), a past participle of a verb (sequence-based) and three nouns (polymerase, chain, reaction). The annotations made by the technical term dictionary are based upon a part-of-speech (POS) analysis. Note that words, such as chain, which can be a noun or a verb, can further complicate this process.
-
Aggregation of synonymous expressions and spelling variations—There can be multiple expressions that are synonymous with a particular technical term. These can arise from abbreviations or acronyms as well as from spelling variations. If these variations are recognized as different entities, it can often cause problems for text mining. For instance, “DNA” and “deoxyribonucleic acid” are synonyms; thus, they should be counted as the same entity in the mining process. The dictionary contains spelling/abbreviation variants and their canonical forms. By reducing these variants to a single canonical form, we can treat them as the same entity.
After text is annotated with a technical term dictionary it is parsed by a parser, or tagger. In this system, we use the CCAT parser to assign a POS to each word. This determination is based on the statistical distribution of candidate POSs for a word and the probability of POS transitions (from/to adjoining words) that are extracted from a training corpus. After the annotation and parsing processes, phrases are determined by using head-driven models,41,42 and category identifiers are assigned. The current MedTAKMI system maintains approximately 270000 hierarchical category identifiers. Of these, roughly 170000 are MeSH category identifiers; the rest are user-defined categories. MeSH terms are updated annually, and the revisions are incorporated into MedTAKMI.
Relation extraction. In general, a conventional text analysis system counts the frequency of occurrence of a given term and calculates its importance. The system can then find documents that contain the query terms or their synonyms. It may also display relationships among various terms using visualization tools. This would allow us to discover terms that are strongly related, but we still would not know how or in what way they are related. For example, we could find all documents containing the word smoking. But even though smoking may appear frequently within these documents, conventional text analysis systems are unable to tell us, for example, what the effects of smoking are, who is affected, and to what extent. We could next identify terms that are strongly related to the word smoking, but simple co-occurrence of two terms (e.g., smoking and risk) reveals very little about the semantic relationship between them. Conventional text analysis systems ignore hints that indicate whether a verb is negative or not and whether it is active or passive. Such systems are unable to distinguish between the meanings of the following two sentences: “Smoking increases the risk of lung cancer,” and, “The risk of lung cancer increases the consequences of smoking.” Both sentences contain the terms smoking and risk; simple co-occurrence detection is not sufficient to derive relationships.
The TAKMI system for CRM13 is able to extract “subject ... verb” or “verb ... object” relationships in a sentence. It can extract such concepts as “modem ... broken,” “file ... not found,” and “hard disk ... slow” from customer queries. These extractions were instrumental in facilitating problem detection and workload reduction for analysts at customer help centers. Although the binary “subject ... verb” or “verb ... object” relationships suffice for customer support purposes, a more detailed concept extraction is needed for biomedical articles. This is especially true because the “what subject influences what object” relationship is a very important one in the biomedical domain. For instance, it is clear that extraction of the relationship “protein_A...activates...protein_B...with...enzyme_C” provides more detailed information than either “protein_A...activates...protein_B” or “protein_B...with...enzyme_C.” Furthermore, we need to distinguish between “protein_A...activates...protein_B...with...enzyme_C” and “protein_B...activates...protein_A...with...enzyme_C.” In the MedTAKMI system, we extract these ternary relationships (as well as binary relationships) using natural-language-processing technology. These extracted relationships are used as keywords by various MedTAKMI mining functions, as described in the next section. Table 2
shows actual examples of ternary relationships extracted from MEDLINE documents.
|
| Table 2
Examples of ternary relationships extracted from MEDLINE |
|
|
|
|
|
| apoptosis...induce...p53 |
| apoptosis...inhibit...tumor protein p53 (Li-Fraumeni syndrome) |
| tumor protein p53...play...role |
| adhesion receptors...activate...extracellular signal-regulated kinase |
| Ang II...activated...p38 |
|
Searching and mining functions
The MedTAKMI system is a text-mining system that can be integrated with search engines. It supports keyword-based search engines and can also be used with other search engine implementations such as IBM's GTR (Global Text Retrieval) and DB2* Net Search Extender. The mining process can be applied to the whole database or to a subset document collection obtained by a series of searches. Users can submit a query and receive a document collection in which each document contains the query keywords or their synonyms. Mining functions can then be applied to the collection in order to discover underlying information, such as protein-protein relationships. Alternatively, users can continue the search process by using the results of previous searching and mining operations. Interactivity is a very important MedTAKMI feature because it allows users to switch between searching and mining in a flexible manner.
MedTAKMI provides various mining functions for large document collections. Some of these functions are general, but others are tailored to the life-science domain. The following functions are introduced in this section:
-
Keyword-based and full-text searching
-
Hierarchical category viewer
-
Chronological viewer
-
Two-dimensional viewer (term-association)
-
Trend analysis viewer
-
Other analytical tools
Most of these viewers can function interactively, a unique feature of the MedTAKMI system. Experiences with text mining in the CRM domain13 have previously demonstrated the importance of interactivity in that application, and interactivity also proves to be an important requirement for users in the life-science domain. For example, in knowledge-discovery and data-mining (KDD) tasks, such as on-line analytical processing (OLAP), users may wish to use a fact discovered in a previous mining result to structure a new query from a different point of view. Indeed, a user often cannot define a complete query beforehand but rather must iteratively refine the query based on previous results. Furthermore, when a document collection is very large, a single query may not be able to sufficiently narrow the collection, requiring a user to submit a sequence of queries to obtain the desired subset of documents. MedTAKMI's viewers help users to navigate toward this goal interactively.
Keyword-based and full-text searching. The MedTAKMI system provides two types of searching: keyword and full text. A keyword index which associates keywords with category codes is built by the information extraction phase, as described above. Users can thus submit a query such as “search for documents that contain the word p53 as a gene name.” A disadvantage of keyword search, however, involves inaccuracies in keyword extraction. If a term consisting of multiple words is not recognized as a keyword in the indexing phase, then this term cannot be matched later in a query submitted by a user. Thus, in MedTAKMI, users can also choose a full-text search which uses different indexes built separately to retrieve documents in which the term appears literally. The MedTAKMI system allows users to switch between these two search engines seamlessly.
Hierarchical category viewer. The hierarchical category viewer shows keyword distribution in a data collection over a predefined hierarchy. For example, Figure 5 shows the distribution of keywords for the “Disease” node and its child nodes in the MeSH hierarchy.
Figure 5
Blue bars at the bottom of the viewer show the frequency of occurrence of a keyword. In Figure 5, there are 35 documents that contain the term “neoplasms” or any of the terms in its child nodes. (Note that for efficiency reasons this frequency was calculated for a subset of documents sampled from the full document collection. The cardinality of such a subset can be specified by the user.)
Red bars indicate relative frequency—this measure compares the current document subcollection to the initial document collection indexed in the MedTAKMI system. In other words, searching is a process that narrows down a document collection. Assume D is the initial document collection. A keyword search due to query q1 returns D1, the subset of D that satisfies q1. Similarly, let (q1, q2, ..., qn) be a sequence of successive queries and let (D1, D2, ..., Dn) be document collections such that Di is the result of query qi made also on collection D. The relative frequency for a keyword w in the document collection Di is calculated using the following formula:
where |Di| is the number of documents in the document collection Di and c(w, Di) is the number of documents that contain the word w in the collection Di.
Two types of hierarchies are registered with the MedTAKMI system. The flat type has no children; for example, in Figure 5, “Compound,” “Amino Acid,” and “Organ” are flat. The tree type appears with a `+' notation; thus “Dry Lab Methods” and “MeSH Minor” in Figure 5 are tree type. Users can define the hierarchy set using public hierarchies such as MeSH and Gene Ontology3 as well as user-specific or company-specific hierarchies. We have developed a hierarchy that currently consists of 95000 nodes for the various concepts in the life-science domain.
Chronological viewer. This viewer allows a user to discover trends by viewing the chronological distribution of a set of documents. Figure 6 shows the monthly distribution of approximately 330000 documents selected from MEDLINE in 2002. MedTAKMI supports yearly, monthly, and daily distributions. Using this viewer, one can determine when a certain keyword began to appear in the biomedical literature and how its frequency of occurrence changed with time. For example, the term HIV does not appear in MEDLINE documents in the 1970s, but by the 1990s it occurs with high frequency.
Figure 6
Two-dimensional maps (term-association). The two-dimensional viewer allows a user to visualize the strength of association between keywords. Figure 7 shows protein-protein associations in a mouse document collection. The value in each cell represents the strength of association of two keywords—the higher the value, the stronger the association. For example, the proteins “Bcl2-associated X protein” and “G elongation factor” have a strong association. The numbers “19 (36.54%)” in the cell mean that there were 19 documents which mentioned both Bcl2-associated X protein and G elongation factor, and that these 19 documents comprise 36.54 percent of the documents mentioning Bc12-associated X protein. Because 36.54 percent is much higher than other percentages, the cell is automatically highlighted.
Figure 7
Formally, the value for each cell v(wxi, wyj, D) is calculated by using the following formula:
where c(w, D) is the number of documents that contains the word w in the collection D. The words wxi and wyj belong to the categories of x and y axes, respectively. M and N are the size of the matrix for each x and y axis. Clicking a cell leads to further searching using the two keywords.
Users are not limited to protein-protein interactions. Associations for protein-gene, gene-disease, disease-organ, and so forth, are also possible. In fact, any combination of hierarchy terms is possible.
Other analytical tools. The viewers described thus far are interactive, allowing for a flexible mining process. However, some tools require a much longer analysis time to produce a response. These tools are available to the user as off-line tools in the MedTAKMI system. During the mining process intermediate results can be stored and used as input to an off-line tool. The final output of the off-line tool is shown independent of the runtime MedTAKMI process. This section introduces two off-line functions provided by MedTAKMI.
Similar entry search tool. The first function is one that permits searching of similar biological entities. This function uses the term-association table data described in the previous section. Each row of the table is represented by a vector whose attributes correspond to the keywords in the columns and whose elements represent the joint occurrence probabilities of the two corresponding entities. The degree of similarity between row entities is calculated based on the distance between their corresponding vectors using a cosine measure. Principal component analysis is used to reduce dimensions and improve precision because the number of columns tends to be large. Figure 8 shows the ranking result of signaling proteins similar to protein 14_3_3.
Figure 8
Subset analysis tool. The second function allows for the category view data or term association view data of two different queries to be compared and analyzed to identify significant differences. Consider two data sets A and B. The occurrence probabilities of entries that appear in both data set A and data set B are compared. Entries are first organized into the following three categories based on the AIC (Akaike Selection Criterion) approach for statistical model selection criteria.43
-
Entries frequently appearing in data set A
-
Entries frequently appearing in data set B
-
Entries common to both data set A and data set B
For each category, entries are ranked by a score based on a statistical test. We use the following statistical test:
H0:pA = pB
where pA and pB are the probability of occurrence of the entry in data set A and data set B respectively. The chi-square statistic is calculated from the occurrence data. Under the null hypothesis H0, this statistic has a chi-square distribution with one degree of freedom
21. We define the following ranking score,
Ranking score = log10p
using the p-value calculated from the observed data.
Figure 9 shows the result of subset analysis by this function using disease-experiment two-dimensional-map data extracted by the MedTAKMI system. Documents are selected using the term “Adult” in the age category in data set A and “Child” in data set B. Figure 9A shows the common entries in both A and B. Figure 9B shows the entries which appear frequently in data set A, and Figure 9C shows those appearing frequently in data set B. In Parts 9B and 9C, entries are ranked by the statistical base score described earlier.
Figure 9
Usage scenario
The MedTAKMI system can be used to aid in drug discovery and clinical information retrieval. For example, defects in the AML1 gene are thought to cause acute leukemia. Suppose we are interested in developing a treatment for leukemia and are looking for potential drug targets. We describe below the application of the MedTAKMI system to this problem. For this example a subset of approximately 330000 of the latest abstracts from the 2003 MEDLINE distribution was used. A keyword search reveals that 54 papers in the sample set mention AML1. These are depicted in Figure 10. Each line shows the publication year, PubMed ID, and title of the paper.
Figure 10
A category view (Figure 10B) based on the NCBI LocusLink44 phenotype information indicates that leukemia is indeed the term most frequently associated with AML1, appearing in 17 of the 54 papers. Furthermore, the relative frequency of the term leukemia indicates that it appears 54.41 times more often in this subset of 54 papers than it does in the entire sample database.
If we sort the entries in Figure 10B in decreasing order of relative frequency (Figure 10C), we see that myelodysplasia syndrome 1 (a protein that appears to be amplified in human prostate cancer)45,46 and 3q21q26 syndrome (a syndrome associated with over-expression of the Evi-1 gene, an event in turn associated with myeloid leukemia) are the phenotypes that are strongly associated with AML1. Similarly, we can discover that various fusion genes and proteins such as AML1-MTG16, MTG8, AML1-ETO, and TEL-AML1 are also strongly associated with AML1 when we consider the category view for prime names of substances (Figure 10D).
Now, consider a search using the keyword leukemia, which selects 1051 papers from the entire sample database. We have already established that there appears to be a close association between AML1 and leukemia; if we can find another protein that is closely associated with leukemia, that protein might in turn serve as a potential target for drug design. A two-dimensional (2D) map analysis between LocusLink phenotypes and signaling proteins in this subcollection of 1051 papers shows very interesting results (Figure 11).
Figure 11
We find that some signaling proteins such as STYKc and TyrKc are commonly referred to in papers about leukemia, HMG-CoA lyase deficiency, hepatic lipase deficiency, and Miller-Dieker syndrome. Similarly, SAM is associated with the same phenotypes. (In fact, fusion of the SAM domain of the TEL oncogene to AML1 has been shown.45,46) In addition, HATPase_c appears associated with all the phenotypes. ITAM, on the other hand, might reveal a potential interesting relationship between leukemia and other phenotypes such as osteosarcoma. We can now click on a cell in the 2D map to access the corresponding papers if we are interested in pursuing a particular association. These results could then be explored as possible leads for drug development. Thus, the MedTAKMI system provides a contextual overview for identifying potential targets for drug design and facilitates the navigation of relevant papers.
Conclusions and future work
There are many useful applications for text mining in the life-science industry, particularly because of the vast amount of technical data and the myriad relationships contained therein that are waiting to be inferred, identified, and collated. There are also many textual databases other than MEDLINE to be explored. For example, laboratory notebooks, internal reports, and patent documents are all very important resources that contain valuable information about new gene, protein, and drug discoveries.
The next stages of the MedTAKMI system evolution include: (1) expanding the range of document sources to include these other databases, (2) developing a layer that allows integration of the MedTAKMI system with external and internal tools that are currently in daily use, and (3) improving the performance of the results system by developing a distributed version of MedTAKMI running on a grid architecture.47,48 Because both the information-extraction process and the search/mining process handle several files independently, we believe that the file I/O and calculation costs can be shared across multiple processing units. If MedTAKMI can be extended to run on a grid architecture, the performance of both processes will be improved.
Compliance with the Unstructured Information Management Architecture (UIMA) proposed by a research community in IBM is also a major challenge.49 The UIMA is a common framework for text analytics tools that are defined as text analysis engines (TAEs). We have already redefined our information extractors as TAEs. We are also working on developing text-mining middleware for life-science applications.50 In this project, the MedTAKMI system runs as a part of UIMA-based middleware.
In summary, this paper has described the MedTAKMI text-mining system, a set of tools for knowledge discovery derived from millions of biomedical documents. The MedTAKMI system is able to parse 11 million MEDLINE citations with a syntactic shallow parser and extract relationships from these parsed entities with hierarchical category identifiers. It also provides a toolkit of interactive viewers to aid in the discovery of underlying knowledge in the literature of life science.
Acknowledgments
We would like to thank Celestar Lexico-Sciences, Inc. for collaborative work during the development and deployment of MedTAKMI.
*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of the Gene Ontology Consortium, the United States National Library of Medicine, or Microsoft Corporation.
Accepted
for publication February 6, 2004; Internet publication July 21, 2004. |