|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.506.0575 | Copyright info |  |
 |
 |
Visualization of complementary systems biology data with parallel heatmaps
|  |  |
by R. M. Podowski, B. Miller, and W. W. Wasserman |
|
|  |
 |  |  |
|
| |
|
Systems biology research generates large, complex datasets. While clustering algorithms can identify subsets of genes that behave similarly, the interpretation of inter-gene relationships can be difficult. In only a small subset of cases can a biological theme be accurately ascribed to a statistical grouping of genes. Interpretation is complicated by the fact that popular clustering algorithms are guaranteed to produce clusters, even if no underlying biological process or statistical motivation exists. Assessment and interpretation of clusters can be simplified when data from multiple sources is correlated. Examples of bioinformatics tools that facilitate such integrative approaches include GoMiner [1], EASE [2], FunSpec [3], and methods based on comparative analysis across species (co-expression networks and interologs) [4–6].
Visualization tools for assessment of correlations offer an alternative approach based on accessing the cumulative knowledge of human specialists—knowledge that can be difficult to replicate computationally [7]. The scientific community has recognized the benefits of visualization in data exploration and interpretation. Numerous publications, workshops, and conferences addressing these issues can be found through the IEEE Computer Society [8] and ACM [9] Web sites. Improved coordination across multiple views has received recent emphasis [10, 11]. There exist tools designed to explore a single source of information, as exemplified by Cytoscape [12] and Osprey [13] for navigation of molecular interaction data, Pathway Voyager [14] for KEGG [15] metabolic pathway visualization, and the Reactome knowledge base of biochemical pathways and biological processes browser [16].
As the availability of complementary high-throughput data is growing, we sought means to visually discover new relationships within large and complex data. Heatmaps are well established in genomics, provide a means to rapidly identify relationships across large datasets, and conveniently display continuous data through color intensity [17–20]. Most gene expression analysis packages—from academic tools such as Cluster and TreeView [17], Hierarchical Clustering Explorer (HCE) [21, 22], GDS Browser [23], and Prism [24] to commercial products such as the LION ArraySCOUT**, Agilent GeneSpring**, and Spotfire DecisionSite** for functional genomics—provide heatmap visualization tools that are linked to clustering algorithms [25]. Layering of complementary data into the visual display, however, has been limited. The above-mentioned tools provide visualization and annotation enhancements to support cluster analysis, including dendrograms, scatterplots, line graphs, detailed row and column descriptions, and links to external annotations. Similarities between the Gene Ontology project [26] annotations assigned to individual genes [21, 27–29] can provide a useful, albeit limited, hint at inter-gene relationships. While multidimensional visualization tools for database exploration have long been established [30, 31], we are not aware of established bioinformatics tools that facilitate the visualization of functional relationships from unrelated sources on a global scale.
Our single most important objective is to enable biologists to compare multiple gene-centric data sources to discover significant functional relationships between genes and characteristics. Examples of such gene-centric data classes include gene expression profiles, binding sites for transcription factors (TFs), Gene Ontology (GO) terms, disease and pathway annotations, literature-based associations, and subcellular localization. A thorough search and examination of existing tools failed to identify a ready solution. Possibilities of enhancements to the existing GeneSpring and Spotfire tools were judged to be neither practical nor affordable. In order to endow researchers with the capacity to seek correlations between such disparate classes of data, we developed the parallel heatmap (PHM) viewer.
| |
|
To support the efficient manipulation of gene expression data, the PHM viewer is embedded within the TG Services GenePilot** microarray data analysis package.1 In the current version of GenePilot, the PHM viewer is accessed via the hierarchical clustering (HC) results screen. A freeware version, GenePilot–Lite Edition, retains the HC algorithm and the full complement of visualization features, allowing all researchers to conduct rapid comparisons of parallel datasets.
Figure 1 shows the parallel heatmap viewer interface. The left pane contains the complete hierarchically clustered gene expression data and dendrogram for the elutriation-synchronized yeast cell-cycle time-series dataset [32], with corresponding predictions of TF binding sites generated with the MSCAN software [33] using a collection of yeast-binding profiles [34]. A selected cluster (red dendrogram) is magnified in the right pane. The detailed view contains clear column and row headings and color-coded cell-cycle stages for the expression data. The cell-cycle color tag legend in the top right corner identifies individual M, G1, S, and G2 cell-cycle stages. The top heatmap row in the expanded view represents the average column values for all genes in the selected cluster. Finally, the twelve most commonly shared GO annotations for the selected group of genes are shown between the TF heatmap and the row headings. This view is dynamic and will change for each subselection of genes in the expanded view cluster.
Figure 1
The cluster visualization tools can support display of complete genome-scale data and selected subsets of genes. In the latter case, researchers can magnify a specific cluster for a detailed view with annotations and row and column names. Gene-centric annotation information can be accessed automatically from a number of sources, including Stanford SOURCE [35]. Columns of a selected cluster can be reordered by user-predefined categories (e.g., converting between treatment types and cell types) or clustered on the basis of Pearson Correlation Coefficients or Euclidean distance. Data normalization and color palettes can be dynamically applied to enhance interpretation.
In the PHM viewer, a complementary dataset can be displayed. Any selected cluster can then be examined in detail from two perspectives, and the columns of each set individually reordered. The PHM viewer manages the correct alignment and order of rows (genes) for both panes and between both datasets. To facilitate this, matching identifiers must be available between the annotation fields for each set (e.g., official gene names or accession numbers from a common database). A complete one-to-one match is not required. In the case of multiple rows with identical names in the parallel dataset, only the first instance is displayed. Once relationships are observed, the user has the option to output bitmap image files for archival and communication purposes.
| |
|
| |
|
Diverse classes of high-throughput genomics data are available for the single-celled yeast Saccharomyces cerevisiae. Results from several of the yeast genomics studies have been prepared for direct import into GenePilot and are available at http://www.cisreg.ca/raf/PHM/. These include gene expression datasets [32], Gene Ontology annotations generated from the Saccharomyces Genome Database (SGD**) Gene Ontology Term Finder [36], transcription factor binding site predictions produced by the MSCAN software [33] using a collection of yeast-binding profiles [34], and chromatin immuno-precipitation (ChIP) results generated with microarrays (“ChIP on Chip”) [37]. In all datasets, the rows represent yeast genes, uniquely referenced by the systematic name obtained from the SGD [38]. GenePilot permits the inclusion of multiple column classification and supervision vectors, which makes it possible to tag each column with multiple labels. This enables the user to visualize the information in multiple ways (for instance, grouping columns by array identification numbers). In the sample gene expression data, two vectors are included representing the stages of cell-cycle progression. In addition to the expression data, a file is provided containing the GO terms associated with each yeast gene—a file that may be used with any yeast dataset.
| |
|
The identification of regulatory elements in the promoters of co-expressed genes can be a challenge. Computational prediction of transcription factor binding sites produces numerous false predictions. Alternatively, high-throughput ChIP methods produce complicated results that may reflect the inclusion of false positives or instances in which TFs are present at a specific location because of protein–protein interactions rather than direct binding to DNA. Because the two classes of data appear independent and complementary, we sought to determine whether there were relationships that could be visually distinguished in genome-scale data. A database of binding profiles for yeast transcription factors [34] was screened against all yeast gene promoters using the MSCAN algorithm [33] in order to produce a combined probability score for all potential binding sites for each TF in each gene.
Figure 2 shows the correlations of transcription factor binding-site predictions generated by the MSCAN software [33] using a collection of five distinct yeast-binding profiles (Gal4, Gcn4, Leu3, Reb1, and Ste12) [34], and microarray-assessed chromatin immunoprecipitation results (“ChIP on Chip”) [37]. The figure is a compilation of PHM views of different gene groups with identical column orders, saved as bitmap images. TFs from the ChIP dataset are color-coded and grouped according to the primary functions of the target genes via a column classification vector. The dark gray rows under MSCAN predictions for TF Gcn4 indicate that no TF binding was predicted for prediction thresholds used in the analysis. Arrows indicating the location of the matching individual TF in both datasets were added to ease interpretation.
Figure 2
For several TFs, there was a clear relationship between the computational predictions and genome-scale ChIP data. For instance, the binding data for both Gal4 and Leu3 shows strong agreement between computational predictions and ChIP results. In the cases of Gcn4 and Reb1, a subset of the genes shows good agreement between the two data classes; however, a significant portion is not supported. Visualization does not facilitate uniformly convincing observations, as demonstrated for the set of genes known to be regulated by Ste12 and where both MSCAN and ChIP data implicate other TFs.
| |
|
Numerous algorithms have emerged to identify over-represented sequence motifs in the promoters of co-expressed genes. While many methods identify the sequence patterns de novo, the set of TFs for which binding properties have been defined continues to grow. Therefore, new methods are emerging to identify an overabundance of predicted sites for characterized TFs (e.g., Toucan [39] and oPOSSUM [40]). Visual inspection appears to be a powerful means of assessing these relationships while the statistical methods mature.
Figure 3 shows the correlation between gene expression and MSCAN predictions for a gene set derived from a co-expressed set of yeast genes related to cell-cycle gene expression with an unknown mediating TF, as defined by Getz et al. [41]. Gene expression data was derived from the elutriation-synchronized, yeast cell-cycle gene expression data collection [32] and shows an induction peak at 120 minutes during the G1 stage, followed by a secondary peak at 270 minutes during the G2 stage. It was previously indicated that there is a strong correlation between an over-represented pattern in the regulatory regions and the binding of the TF PAC [34]. Using the PHM viewer, we clustered genes on the basis of the cell-cycle pattern of expression and displayed the computational prediction of binding sites. This view provides strong support for the link of PAC to G1 stage-specific expression in the cell cycle; however, it is clear that not all genes in the expression cluster are linked with PAC binding sites. From the visualization, it appears that additional transcription factors may play a role. Specifically, regulation by RRPE is predicted for a majority of genes showing elevated expression at 120 minutes (G1 stage) and 270 minutes (G2 stage). This visual correlation is not apparent for the genes that are not up-regulated in G2.
Figure 3
| |
|
The traditional use of heatmaps for visual verification of gene expression profile relationships and dependencies is just one approach to deriving knowledge from genomic data. With the PHM viewer, we have demonstrated that diverse data sources, when examined together, can be used to increase understanding. The examples presented here sampled information from continuous sources, including gene expression time series and experimental and computational protein–DNA binding predictions.
Discrete data sources can be used with equal ease. A discrete example dataset is provided on the PHM project website [42] in the form of GO annotations [36]. Other informative discrete information can be used—for example, disease associations derived from literature, tissue specificity, and metabolic pathway associations. Our experiences with both wet-lab and computational genomics data led us to develop this simple yet powerful tool to readily cross-examine complementary information. It can help build confidence, form a consensus, and provide a basis to question assumptions resting upon highly specialized approaches or sources.
The parallel heatmap viewer enables knowledgeable life science researchers to observe patterns and properties within high-throughput genomics data in order to rapidly identify biologically logical relationships. The viewer is written in Java** and is platform-independent. It is free for academic use, and use of the Lite Edition is unrestricted. For non-academic use and for GenePilot program availability, see [43]. Supplemental figures to this paper can be found at [42].
| |
We are grateful to William S. Hayes (AstraZeneca) for identifying the need for a simple, parallel heatmap visualization tool, and also thank Gavin Fischer (OmniViz), Brian Middleton (AstraZeneca), and Gemma Satherwaite (AstraZeneca) for their input. Wyeth W. Wasserman acknowledges financial support from the Canadian Institutes for Health Research and the Michael Smith Foundation for Health Research.
**Trademark, service mark, or registered trademark of LION bioscience AG, Agilent Technologies, Inc., Spotfire, Inc., TG Services, Inc., The Board of Trustees, Leland Stanford Junior University, and Sun Microsystems, Inc. in the United States, other countries, or both.
| |
| |
1A complete description of GenePilot features and download instructions may be found on the official GenePilot website (http://www.genepilot.com/). A detailed guide to the use of the PHM viewer is available on the project website (http://www.cisreg.ca/raf/PHM/).
Received October 6, 2005; accepted for publication January 26, 2006; Published online September 15, 2006.
|
|