IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Systems Biology   Volume 50, Number 6, 2006
Table of contents: HTMLPDF This article: HTML PDF DOI: 10.1147/rd.506.0631Copyright info

Machine learning methods for transcription data integration

by D. T. Holloway,
M. A. Kon,
and C. DeLisi

Gene expression is modulated by transcription factors (TFs), which are proteins that generally bind to DNA adjacent to coding regions and initiate transcription. Each target gene can be regulated by more than one TF, and each TF can regulate many targets. For a complete molecular understanding of transcriptional regulation, researchers must first associate each TF with the set of genes that it regulates. Here we present a summary of completed work on the ability to associate 104 TFs with their binding sites using support vector machines (SVMs), which are classification algorithms based in statistical learning theory. We use several types of genomic datasets to train classifiers in order to predict TF binding in the yeast genome. We consider motif matches, subsequence counts, motif conservation, functional annotation, and expression profiles. A simple weighting scheme varies the contribution of each type of genomic data when building a final SVM classifier, which we evaluate using known binding sites published in the literature and in online databases. The SVM algorithm works best when all datasets are combined, producing 73% coverage of known interactions, with a prediction accuracy of almost 0.9. We discuss new ideas and preliminary work for improving SVM classification of biological data.

Introduction

A first step in understanding transcriptional regulation requires the mapping of proteins called transcription factors (TFs) to the genes they regulate and to the particular nucleotide sequences to which they bind. Typically, TFs bind to sites that are 10–15 nucleotides (nt) in length. Even a cursory examination of the DNA sequences that bind a particular TF indicates that the sequences are not identical, but instead define a motif, or similar pattern of nucleotide bases. The set of sites to which a particular TF binds will provide the basic input for computational methods that can be used to find additional sites.

These computational methods fall into two broad categories: supervised and unsupervised. The former starts with two example sets of potential TF-target sequences, each of which often consists of several hundred bases that are upstream from the potential target genes. The sequences are those known to bind a particular TF (called “positives”) and those known not to bind (“negatives”), both of which are used to derive a classification rule using an SVM or other learning algorithm. Unsupervised methods begin with sets that are believed, on the basis of independent evidence, to contain a characteristic but unknown nucleotide pattern, which may represent a binding site. A search algorithm such as Gibbs sampling can be used to identify such a pattern within the promoter region of the gene (the part of the gene that is upstream from the exon or coding region of DNA). The promoter region binds RNA polymerase and transcription factors in order to begin transcription. Many unsupervised techniques for predicting binding sites have been explored in the literature [18], and an excellent review of current motif discovery and pattern analysis methods is available [9].

Our approach is meant to easily combine a large number of data types in a supervised learning scheme to more accurately predict the association of a transcription factor and its targets. There are a number of ways to proceed, including methods that involve support vector machines (SVMs) and Bayesian variants, and approaches that use weighted combinations of both of these methods. In order to assign weights, we must know how well each method performs. Here we study the use of support vector machines, which we show can easily accommodate high-dimensional genomic datasets containing many hundreds or thousands of features. An example of such a dataset is a gene expression profile, with each gene described by hundreds of expression measurements taken under different conditions. These measurements are the dimensions (features) used to derive a classification rule. We also study a simple SVM framework for combining heterogeneous and diverse data.

The performance of supervised approaches is often reported in terms of a few basic statistics. Known positives and negatives are first divided into training and test sets. This can be done once, as in a “hold-out” method, or the division can be performed randomly many times, as in a “cross-validation” procedure. In either case, the algorithm learns, using the training set, and makes predictions on the test set. Correct positive predictions (TP, “true positives”), correct negative predictions (TN, “true negatives”), incorrect positive predictions (FP, “false positives”), and incorrect negative predictions (FN, “false negatives”) are counted for each test set and used to calculate more informative measures of performance. Two simple measures are sensitivity (S), which is the percentage of known targets correctly predicted to be true, and positive predictive value [PPV = TP/(TP + FP)], the percentage of positive predictions that are correct. Other measures are also possible, as we discuss below.

A number of supervised approaches have been used to associate transcription factors with their targets. Original work in transcription-factor binding-site discovery involved the use of position-specific scoring matrices (PSSMs) [1013], which record the frequency of nucleotide bases at each position in a binding-site representation, or motif. A new prediction is then a site that matches the PSSM on the basis of a score threshold [10]. Researchers subsequently discovered that clusters of predicted binding sites can indicate whether a candidate gene is a target of a regulator [1417]. Another supervised method, developed by the team of N. Simonis at the Centre de Biologie Structurale et Bioinformatique in Belgium, makes use of linear discriminant analysis (LDA) to select from a set of potentially co-regulated genes that are likely to share transcription factors. Using a set of 1,012 regulatory interactions involving 66 TFs [data obtained from the Transcription Factor Database (TRANSFAC**) [18], the aMAZE database [19], and a list compiled by Young et al. [20] from the Yeast Proteome Database], the researchers report an average positive predictive value of 0.91 and a sensitivity of 73%. Their classification performance based on ChIP-chip (Chromatin Immuno-Precipitation microarray) data is worse, with only 52% of genes identified by ChIP-chip being discovered. ChIP-chip is a large-scale procedure designed to experimentally identify transcription-factor targets genome-wide [21]. A microarray error model determines the significance of the identified targets. The Simonis team has argued in the past that ChIP-chip results likely contain many false positives; however, their results also show that target groups identified by ChIP experiments contain large numbers of motifs that are significantly overrepresented in comparison to random gene sets. This suggests that many of the targets generated by high-throughput experiments, such as chromatin immunoprecipitation, contain real binding-site signals.

In an approach more closely related to ours, Qian et al. apply support vector machines to gene expression profiles in order to predict TF–target relations [22]. Gene expression profiles are simply vectors, one for each gene, whose components are measurements of the expression level of the gene under different conditions. Positive examples for the classifier are known TF–target pairs; negatives are randomly chosen relations. In a method that differs from ours, Qian et al. create one classification rule covering all TFs and targets, while in our method a classifier is constructed for each TF individually. In their formulation, the data for each known TF–target association is given as a concatenation of the TFs and the target's expression vectors over 79 experimental conditions (giving a 158-element vector to describe a positive example of regulation). Negatives are constructed similarly for genes chosen randomly and for those found to lack a TF binding site. Their best reported accuracy is 0.93; however, this result is somewhat misleading because their analysis contains a total of only 175 positives. Their classification of a large negative set (1,750 negatives) can result in high accuracy because large numbers of negatives are classified correctly. To put their result in perspective, their sensitivity is 55% and their positive predictive value is 63%. While their method shows promise, it still relies only on the correlation of the expression of the transcription factor to its target. Thus, they are likely to miss interactions depending on cooperating TFs, or factors whose activation is dependent on post-translation modification or nuclear exclusion.

The approach by Beer and Tavazoie uses Bayesian networks to learn the combinatorial relationships of TFs and targets that underlie gene expression data [23]. Their method begins by clustering gene expression data by similarity of expression and then using hierarchical Bayesian networks to predict the cluster assignment of a test gene on the basis of the sequences in its promoter. They impose constraints, which can be learned by the algorithm, allowing them to derive complex logical relationships from the data (e.g., motif A and motif B must both be present and within 20 base-pairs). Although this approach is innovative and can accurately describe the sequence/expression relationships of many genes, it may not be appropriate for our goals because it can depend on the clustering of the expression data and the method by which motif discovery is performed on the genes being tested.

Our approach uses SVMs to associate TFs with targets by combining high-dimensional heterogeneous datasets, building on our previous work, which used fewer data sources [24]. SVMs have been applied successfully to many problems in computational biology. They have been used for the prediction of protein remote homology [25], secondary structure [26], protein subcellular localization [27], signal peptide cleavage sites [28], normal or cancerous tissue types [29], gene function [30], mRNA splice sites, and translation start sites [31]. One notable attempt combines information on protein sequence similarity, protein–protein interactions, protein hydrophobicity, and gene expression to predict the function of a set of proteins [32].

Background and brief review

We now introduce our methodology for the non-specialist and briefly review some basic elements of SVM algorithms. We have trained an SVM on each of 104 transcription factors (i.e., “regulators”) independently, using positive and negative training sets as explained in the following paragraphs. Each gene in the positive set shares certain attributes, or features, that other genes do not share, and it is on the basis of these that a classifier for a particular TF is obtained. We use 18 different genomic datasets to generate attributes, as indicated in the following—for example, the number of occurrences of a particular nucleotide sequence of length k. For such a dataset, the number of occurrences of each of the 256 possible nucleotide sequences of length 4 (“4-mers”) might be represented by a 256-component vector (a “feature vector”), each component of which is the number of times the corresponding 4-mer occurs upstream from one of the genes in the set. (In molecular biology, the term “upstream” refers to a relative position along the DNA or RNA sequence and denotes a region toward the 5' end of the sequence.) To construct a classifier, positive examples (feature vectors of promoters known to be bound by a TF) and negative examples (those of promoters known not to be bound) must be identified. Given this set of data, each example is represented by a feature vector of attributes. In the case of k-mer counts, the components of the feature vector are counts of different k-mers that appear in the promoter region. Other datasets for the same TF will have the same example target genes, represented by different feature vectors. For example, a phylogenetic profile vector, which shows the occurrence of an ortholog, or ancestry-related sequence, in a set of 65 genomes, would be a vector of length 65 consisting of binary numbers, with 1 indicating the presence of an ortholog and 0 indicating its absence. Thus, the data for any particular TF consists of a number of different feature vectors in spaces with possibly thousands of dimensions (attributes), each such vector representing a gene in the training set.

The SVM algorithm separates the positive and negative sets in the feature space by finding a hyperplane whose distance from the closest data points of each class is maximal. Two parallel hyperplanes that pass through these closest data points are found, and a separator bisects the distance between them. Better generalization (i.e., performance in prediction) can be obtained by forgoing perfect separation of training data and allowing some misclassification. This soft margin SVM finds the hyperplanes under the constraint that the distance to the closest cleanly separated data be maximal, with some penalty for misclassifications, as explained below.

We denote the feature vector–output pair for the ith gene in the training set by (xi, yi), with yi equal to +1 when xi is a feature vector from the positive set, and −1 otherwise. The vector xi has the form xi = (xi1, xi2, xi3, ···, xid), where d is the number of features, i.e., the dimensionality of the feature space.

The separating hyperplane H has the form

w · x + b = 0(1)

(see Figure 1), in which the components of w (w1, w2, ···, wd) are the weights of the corresponding features, with b/|w| representing the distance from the origin to the closest point on H.

Figure 1 Figure 1

For clarity, we first describe the case in which the positive and negative examples in feature space are completely separable by a hyperplane; we then discuss the nonseparable case that allows for misclassification. The challenge in the separable case is to find the values of w and b that give the maximum margin separating hyperplane (the one that separates the two classes most widely). This requires the use of only the closest correctly separated feature vectors, each representing the attributes of a gene. In the simple two-dimensional example in Figure 1, the feature vectors are x1, x2, and x3; note that the separator bisects the distance between parallel planes through those points. This (separable) situation is illustrated in Figure 1, but without the (misclassified) vector x4.

The margin is the distance between two planes parallel to the separator, one passing through the closest correctly classified positive data point, and the other passing through the closest correctly classified negative data point. The vector w is scaled so that the hyperplanes through the closest data (the support vectors) are given [3334] by w · x + b = +1 and w · x + b = −1.

Equivalently, the data satisfy the single constraint

yi(w · xi + b) ≥ 1(2)

because yi(w · xi + b) is the distance from the separator to the ith data point. The margin (the perpendicular distance between the hyperplanes H1 and H2 parallel to H) is

Equation 3(3)

where ||w|| = ∑iwi2 is the magnitude of the weight vector. Equation (3) is readily obtained by noting that if x+ and x denote the position vectors of two points at the intersection of an orthogonal to the separator with the margin hyperplanes (x+ and x can both be chosen parallel to w; see Figure 1) and taking without loss ||x+|| > ||x||, then

m = ||x+ − x|| = ||x+|| − ||x||.

On the other hand, from Equation (2), because w and x± have been chosen to be parallel, we have yi(||w|| · ||x±|| + b) = 1 (using x± with yi = ±1, respectively), from which Equation (3) follows from the above, using

Equation a

and

Equation b

Thus the problem is to maximize m given by Equation (3), subject to the constraints given by Equation (2). Note that in Equation (2), equality holds exactly for the support vectors, a subcollection that we label x1, x2, ···, xs, so that we can reduce the problem so that it involves this set of vectors only. The constrained maximization can be solved using Lagrange multipliers [3334]. In particular, the challenge is to minimize

Equation 4(4)

The weight vector is obtained by setting (∂L/∂w) = 0 [i.e., the system of equations (∂L/∂wi) = 0] for the extremal xi, i.e., those for which equality holds in Equation (2). These are exactly the support vectors, giving

Equation 5(5)

where, in this example, the weight vector is w ≡ (w1, w2), the ith attribute vector is xi ≡ (xi1, xi2), and the number of support vectors, s, is three (those lying on the margin planes). The one misclassified point x4 is currently ignored, but also becomes a support vector when included in the data, as described shortly.

The parameter b is determined as a weighted average of the distances to the two hyperplanes containing the support vectors,

Equation 6(6)

In fact, in this fully separated case, for each support vector xi, i = 1, ···, s, we have b = yi − w · xi, by definition. The procedure for finding the multipliers alphai is somewhat simplified by forming and then maximizing the so-called dual Lagrangian [3334]

Equation 7(7)

This is obtained by substituting Equation (5) into Equation (4) and noting that (∂L/∂b) = 0 implies ∑ialphaiyi = 0.

We now describe the case of soft margins, which is the formulation we use in practice where perfect separation is not possible (consider Figure 1 with the misclassified x4 now included). In this case, a penalty ξi is paid in the Lagrangian for each misclassification of size ξi (the distance of the misclassified xi from its margin; see Figure 1). The target function and constraints are now modified so that the problem is to find

Equation c

subject to ξi ≥ 0 and yi (w · xi + b) ≥ 1 − ξi for i = 1, ···, r.

As mentioned, ξi is the distance of the ith misclassified point xi from its margin, which is defined as before to be the hyperplane (H1 or H2) at a distance (1/||w||) in the direction of the correct classification of xi from the separating hyperplane w · x + b = 0. Parameter C mediates the tradeoff between maximal margin and misclassification, and r is the number of misclassified points allowed. Essentially, the algorithm proceeds to find the maximum margin by minimizing ||w|| while balancing this against the amount ∑ri = 1ξi of misclassification with this choice of margin. We again use Lagrange multipliers alphai as in the previous case to form a full Lagrangian, and then minimize it.

To illustrate in our two-dimensional example, we make use of the data presented in Figure 2. In this case, we then have, after minimizing the Lagrangian,

Equation d

 

Equation e

Thus w = [1.3174, 0.1198], and b = [1 − (1.3174)(6) − (0.1198)(7) + ···]/3 = −9.7425.

Figure 2 Figure 2

For linearly separable data (no misclassifications), we have ξi = 0, and we are in the first case, in which the values of w and b ensure that Equation (3) is minimized subject to the constraint of Equation (2). However, for data that is not linearly separable (e.g., including x4), alphai can become extremely large. In the Lagrangian formalization, a constant C becomes an upper bound for alphai (i.e., the constraint 0 ≤ alphai ≤ C is used). By using C as a bound for alphai, we can limit the influence of single data points that cannot be classified correctly. Thus, in our example with C = 5, the multiplier for x4 has a value of 5 (Figure 2).

It is evident from Equation (7) that the first step in finding this maximal margin separator requires the calculation of all pairwise correlations between example vectors in the form of their inner (dot) products (also called the linear kernel function). Thus, given two data points xi and xj, the kernel function K(xi, xj) = xi · xj yields a complete kernel matrix Kij = K(xi, xj) involving every pair of data points. Because the data are represented internally only as such inner products rather than as explicit feature vectors, it becomes possible and useful to substitute different definitions of the inner product for the above linear dot product. Several alternatives are given in Table 1. These functions are inner products defined on feature spaces of different dimensionalities. Defining such a new inner product implicitly maps the data into a new feature space. This swapping of kernel functions in order to map data into different spaces is commonly referred to as the “kernel trick.” Biological features such as conservation or gene expression values can be correlated and may have complex, nonlinear relationships, highlighting the need for classification schemes that can accurately classify data that are not linearly separable.


Table 1 Common kernel functions.
KernelParametersDescription

LinearNoneK(x, y) = x · y
 
PolynomialPoly degree dK(x, y) = (x · y + 1)d
 
Gaussian radial basis function (RBF)σEquation f
 
GaussianσEquation g

Datasets

We have tested a variety of sequence- and non-sequence-based classifiers for predicting the association of TFs and genes. All together, 18 separate data sources (each yielding a feature map and kernel) are combined to build classifiers for each transcription factor. The 18 data sources comprise a family of sequence-based methods (e.g., k-mer counts and TF motif conservation in multiple species), expression datasets, phylogenetic profiles, and gene ontology (GO) functional profiles (see Table 2). For a more detailed description of datasets, see [35]. In almost all cases, our datasets have complete information, primarily because datasets such as k-mer counts or motif counts are derived from DNA sequences alone. Microarray expression data is also available for every gene in our analysis. In the cases in which expression values are missing for a few conditions, zeros are substituted, as is often done in computational analyses. For the GO functional profiles and the phylogenetic profiles based on the Cluster of Orthologous Groups (COG) database, many genes are absent, primarily because these genes have not yet been given a functional assignment (in the case of GO) or have not been allocated to an orthologous (ancestor-related) group (e.g., in the case of COG). In these instances, substitute values for the missing features are selected at random from the entire genome. Thus, missing values are replaced according to background frequencies, without bias toward the positive or negative sets.


Table 2 Dataset abbreviations and description.
 AbbreviationDescription

1MOTMotif hits in S. cerevisiae
2CONSMotif hits conservation 18 organisms
3PHYPhylogenetic profile
4EXPExpression correlation
5GOGO term profile
6KMERk-mers – 4, 5, 6-mers
7S1Split 6-mer 1 gap kkk_kkk
8S2Split 6-mer 2 gaps kkk__kkk
9S3Split 6-mer 3 gaps kkk___kkk
10S4Split 6-mer 4 gaps kkk____kkk
11S5Split 6-mer 5 gaps kkk_____kkk
12S6Split 6-mer 6 gaps kkk______kkk
13S7Split 6-mer 7 gaps kkk_______kkk
14S8Split 6-mer 8 gaps kkk________kkk
15MM016-mer with one mismatch (count 0.1)
16MM056-mer with one mismatch (count 0.5)
17ENTCondition-specific TF–target correlation
18SPARNucleotide sparse binary encoding

Our positive and negative training sets are taken from ChIP-chip experiments [2036], TRANSFAC 6.0 Public [18], and a list from [37] curated by Young et al. from which we have excluded indirect evidence such as sequence analysis and expression correlation [38]. Only ChIP-chip interactions of p-value ≤10−3 are considered, as recommended by the authors [20]. The TRANSFAC and curated list represent a manually annotated set, which is later used separately during the comparison of SVM and PSSM performance. For the purposes of SVM, however, all manually curated and high-throughput sets are grouped together, making a total of 9,104 positive interactions. (The term high-throughput refers to the rapid processing of thousands of genes via ChIP-chip experiments.)

Negative sets pose a greater challenge because no defined negatives exist in the literature; however, because a particular TF regulates only a small fraction of the genome, a random choice of negatives seems acceptable. In fact, our own unpublished work suggests that test cases with a few TFs show good classification performance with random negatives. Nevertheless, a more reliable set of negatives would be those showing no binding by experiment under some set of conditions. Along those lines, for each TF, we have chosen 175 genes with the highest p-values (generally >0.8) under all conditions tested in genomic ChIP-chip analyses [36]. Clearly, all experimental conditions have not been sampled, and this does not guarantee that our choices are truly never bound by the TF, but this choice of negatives maximizes our chances of selecting genes not regulated by the TF of interest.

All promoter sequences have been collected by using RSA tools, Ensembl, or the Broad Institute Fungal Genome Anatomy Project [3941]. For yeast, promoters are defined as the 800 base pairs (bps) upstream of the coding sequence. The motif-conservation dataset required promoter regions from 17 other genomes. Those genomes, their sources, and the lengths of the promoter regions are listed in Table 3. Sequences are masked (i.e., replaced with a sequence of null characters) using the dust algorithm and the RepeatMasker software [4243] where it is appropriate to exclude low-complexity sequences and known repeat DNA from further analysis. PSSM scans are performed with the MotifScanner algorithm [44]. MotifScanner assumes a sequence model in which regulatory elements are distributed within a noisy background sequence [44]. The algorithm requires input of a background sequence model, which in this case is a transition matrix of a third-order Markov model generated from the masked upstream regions of each genome. MotifScanner requires only that one parameter be set by the user, namely the threshold score for accepting a motif as a binding site. Several thresholds have been tested, and the results we have used to create SVM kernels were obtained with a setting of 0.15 for the thresholds. This threshold has been found to provide a reasonable tradeoff between sensitivity and false prediction, making approximately 560 predictions per TF. Settings beyond 0.2 produce too many false hits to be useful. The PSSMs themselves are obtained from TRANSFAC 6.0 Public and from [45], and these PSSMs are a mix of experimentally derived motifs and those generated by motif-discovery procedures.


Table 3 Promoter regions.
GenomePromoter lengthSource

Humanclipped*RSA tools [40]
RatclippedRSA tools [40]
Fruit flyclippedRSA tools [40]
Anopheles mosquito4,000 bpEnsembl [46]
WormclippedRSA tools [40]
S. pombe800 bpRSA tools [40]
S. cerevisiae800 bpRSA tools [40]
N. crassa1,000 bpBroad Institute [47]
M. grisea1,000 bpBroad Institute [48]
A. thalianaclippedRSA tools [40]
P. falciparumclippedRSA tools [40]
S. bayanusclippedWashington University [49]
S. mikataeclippedWashington University [49]
S. kluyvericlippedWashington University [49]
S. paradoxusclippedBroad Institute [50]
S. kudriavzeviiclippedWashington University [49]
S. castelliiclippedWashington University [49]
MouseclippedPromoser [51]
*clipped: The promoter was truncated if it ran into an upstream coding sequence.

In addition, datasets using k-mers rather than PSSMs are generated using the fasta2matrix [52] program, which delineates all possible k-mers and counts the occurrence of each within a set of promoters. Gapped k-mers are detected using custom scripts written as MATLAB** m-files.

The expression data used include 1,011 microarray experiments complied by Ihmels and coworkers, and this data can be obtained with permission from the authors [53].

As mentioned above, 18 different data kernels are used to construct a classifier for each transcription factor. The datasets fall into several distinct groups. All classifier construction and validation was performed in MATLAB [54] using the SPIDER machine learning library [55].

Methods

First, each type of genomic data is evaluated independently for each transcription factor. Several kernel functions are tested, and parameters are optimized by a grid-selection technique. Each dataset is normalized so that all attributes describing the data have a mean of zero and a standard deviation of one. The Gene Ontology, phylogenetic profile, and TF-target correlation data are exceptions, and they are not normalized because their data is binary.

A schematic representation of our method is shown in Figure 3. Briefly, for a particular TF, four classifiers are produced for each type of genomic data, each from a different kernel function (linear, RBF, Gaussian, and polynomial). In order to make an appropriate choice of the C parameter, a grid-selection technique is used to evaluate a range of choices. In the case of two parameter selections (e.g., when choosing the degree of the polynomial kernel), all possible combinations of parameter values within the pre-specified range are tested. A fivefold cross-validation is used to choose the best parameters on the basis of a Receiver Operating Characteristic (ROC) score. (The ROC score relates to the area under a Receiver Operating Characteristic curve that shows the utility of a classifier at various thresholds.)

Figure 3 Figure 3

Once parameters are chosen for each kernel type, the parameter-optimized classifiers are tested using a leave-one-out cross-validation procedure. As suggested, for each type of genomic data, there are four classifiers for a particular TF (one for each of the kernel functions). Of these four, we select the one with the best performance as measured by the F1 statistic. Several common statistics, including accuracy, sensitivity, and specificity, can overstate the performance of a classifier depending on the relative size of the positive and negative training sets. The F1 statistic is a more robust measure that is the harmonic mean between sensitivity (S) and positive predictive value (PPV):

 

Equation h

 

Each TF now has only one classifier for each type of genomic data (18 classifiers in all). Before weighting and combining kernels, each kernel matrix is normalized according to

 

Equation i

 

This normalization adjusts all points so that they lie on a unit hypersphere in the feature space. This ensures that no single kernel has matrix values that are comparatively larger or smaller than those of other kernels, which would bias the combination.

By using a scheme with weights equal to the F1 of each classifier, the underlying 18 kernels are scaled and added into one unified kernel for the transcription factor. This kernel represents the integration of all types of genomic data. Three simple weighting schemes are compared. In all cases, the primary weight for a method is determined by computing its F1 score ratio with that of the best-performing method. Our first weighting scheme simply multiplies all kernel matrices by their primary weights (i.e., F1 ratios) and sums them. A second scheme squares the primary weights before multiplying. Our third scheme is the most nonlinear and requires us to compute the squared tangent of the primary weight.

Performance statistics for each TF, based on all combined datasets, were generated by a final leave-one-out cross-validation procedure on the combined kernel. In this way, accuracy measurements are made for each TF–target classifier.

PSSM comparison

Using the same positive and negative sets as are used for the SVM procedure, PSSMs can make predictions at various score thresholds to serve as a comparison to predictions made by SVMs. The data in Figure 4 represent a parameter setting of only 0.1 in MotifScanner. Low parameter values retain the best matches, whereas values near 1 allow very “loose hits”; that is, the use of values near 1 leads to the retention of more false matches. Other choices of threshold do not appear to improve performance. Loosening the threshold begins to dramatically increase false-positive predictions beyond a parameter setting of 0.2. By making detection more “strict” (i.e., less likely to yield false hits), false predictions are reduced along with sensitivity. Because the matrices for the 104 transcription factors are partly experimentally determined and partly computationally generated, the TRANSFAC PSSMs for 17 TFs are evaluated next to determine whether the experimental matrices by themselves outperform SVM for target identification. Finally, because a large number of positive targets have been taken from high-throughput ChIP-chip experiments, the TRANSFAC PSSMs are tested again on only the portion of the positives obtained from manually annotated sources.

Figure 4 Figure 4

Results and discussion

Using the classification procedure described in the previous sections, we have been able to accurately classify the known targets of many transcription factors for the yeast S. cerevisiae. Overall, the best single method achieves a sensitivity of 71% and a positive predictive value of 0.82. These performance measures provide a summary for all 104 classifiers. For example, there are 9,104 known positives for all TFs. A sensitivity of 71% indicates that, taking into account all 104 classifiers, we recover 71% of the known data (i.e., known TF–target interactions). This means that classifiers for some TFs have much higher sensitivities or PPVs, while other classifiers perform no better than randomly. Many individual methods perform well, but the best classification is made with k-mer counts allowing one mismatch per k-mer (with mismatches given a count of 0.1). Our results show that by combining datasets we increase sensitivity incrementally over the use of only the best single dataset, and also produce a small improvement in positive predictive value. This indicates that methods that combine data sources are useful in this case because they remove some false-positive classifications [35].

To prevent an overly optimistic evaluation of our performance, we generated three random datasets and trained TF classifiers on them as if they were actual data. Comparison with random controls better frames the practical performance of our method. The first random set consists of randomly permuted k-mer count data. The second is composed of a randomly selected 10% of each real dataset (also permuted). The third is a dataset composed of normally distributed random numbers in the range 0 to 1. The comparison of these results is shown in Table 4.


Table 4 Performance results of combined classifier and random datasets.
 Combined
methods
Random
k-mers
Random 10%
of all datasets
Random
normal data

Accuracy0.880.670.580.58
Sensitivity0.730.450.620.61
PPV0.880.500.410.41
F10.800.480.500.50

Clearly, the performance is much better than random, but results do not clearly indicate whether applying our classifiers to the entire genome would yield truly reliable predictions without further processing. A simple classification of all potential targets with our 104 classifiers returns, on average, approximately 800 new targets for each TF. This suggests that in order to find a set of truly reliable predictions genome-wide, postprocessing of our results is needed. Indeed, in other work we have applied Platt's method [56] to assign “posterior probabilities” to our predictions, allowing the selection of only the most significant targets [35]. The precise meaning of the term posterior probabilities is clarified in [35]. Using these probabilistic SVMs, the classifiers for each TF were applied to identify potential targets genome-wide in order to expand the binding repertoire of each factor. This results in predictions of new regulatory roles for some TFs and the identification of possible new regulatory structures such as feed-forward loops in metabolic pathways [35].

Many reports in the literature indicate that as many as 50% [57] to 60% [58] of the targets produced by ChIP-chip are not biologically functional. Our ability to correctly classify large amounts of high-throughput data indicates that there is relevant biological information that identifies ChIP-chip positives from other genes. One should also note that ChIP-chip experiments may be highly accurate in detecting binding of a TF even if that binding serves no biological function. This may be interpreted as a false positive from a functional perspective, but not so from a binding perspective. Our experiments may accurately classify binding targets as identified by ChIP-chip even if those targets show no change in expression as a result of binding.

In other work, we search for evidence that the predictions based on various classifiers make biological sense [35]. To do this, we examine individual datasets and extract the attributes that contribute most to the classifier of a transcription factor. The w vector described in previous paragraphs can be used in this way to identify the features, in any particular dataset, that are most important for classification. Features having large w components correspond to dimensions in the feature space where positives and negatives are more definitively separated. Thus, by examining a single dataset such as one that includes k-mer counts, it is possible to determine the k-mer(s) most responsible for the differences between positives and negatives. Our results on the k-mer count dataset have shown that the many k-mers having large w values are in fact elements of the known transcription factor binding site as taken from the Saccharomyces Genome Database (SGD) [35].

To better judge the performance of new methods, it is sometimes useful to compare them with standard PSSM scans for their ability to identify targets. Carefully constructed variants of PSSMs, which take into account conservation of sites between multiple species or dependencies between nucleotides, offer excellent performance, but often there is insufficient data to construct such detailed models. In TRANSFAC version 6, only 17 available binding site matrices exist for yeast. Many of the remaining PSSMs used in this study have been created using motif discovery methods on high-throughput datasets [20]. The purpose of our comparison with PSSMs is to illustrate that some of the commonly used site matrices perform worse than a classification scheme built on an integrated dataset.

Overall, the SVM performs better than a simple weight-matrix scan. Figure 4 shows such a comparison as a function of sensitivity, specificity, positive predictive value, and the F1 statistic. The far-left grouping of data uses the TRANSFAC PSSMs for 17 TFs on just the manually curated positives (with same negatives as all other analyses) from TRANSFAC and literature sources. The second grouping from the left uses the same TRANSFAC PSSMs as the first grouping, but this time with the same high-throughput positive sets used in the SVM classification. The third grouping is a result from scans using PSSMs for all 104 TFs on the positive and negative sets on which the SVMs were trained. Finally, the far-right grouping restates the performance of the SVMs with 18 combined datasets on the full set of positives. The SVM classifiers outperform PSSMs, even when the matrices are from a curated set such as TRANSFAC. Although the PSSMs perform well, they suffer from a large number of false-positive predictions. Figure 4 shows data for only one threshold of PSSM scan, but altering the threshold does not make PSSMs more accurate than SVMs (see the Methods section). It is worth noting, however, that the site matrices from TRANSFAC offer much better performance than the matrices generated by motif-discovery procedures. Support vector machine classifiers offer a reasonable balance between sensitivity and false prediction. Alternatives to SVMs, such as Bayesian networks and neural networks, may offer similar performance, but SVMs have an advantage because they permit different types of high-dimensional data to be easily combined.

Concluding remarks

In conclusion, support vector machines can accurately classify and predict transcription factor binding sites using a wide range of genomic data types. Combining various information sources reduces false positives and increases sensitivity. On the basis of k-mer data, SVMs appear to be identifying appropriate features for classification. Finally, the flexibility of this approach allows easy inclusion of new types of genomic data. Our future work involves the development of sophisticated dimension-reduction techniques to discover biologically significant features in different datasets on the basis of classifier performance. As always with high-dimensional datasets, the risk of over-fitting can restrict the wide application of a classification tool. (The term over-fitting is considered to be synonymous with overtraining, which indicates that a classifier is very accurate for a training set but less accurate for independent test sets.) Although the maximal margin of SVMs is resistant to over-fitting, the resistance can be enhanced by selecting the best features for classifier construction. In future work, we plan to test several feature-reduction methods such as Fisher's Linear Discriminant and SVM–RFE (Recursive Feature Elimination). A reduction in the feature set would also allow a comparison with other classification systems, such as Bayesian networks or KNN classifiers, which are difficult to train on very large sets of features.

Additionally, new datasets can be included that leverage information about DNA structural features. Information of this type could include promoter melting-temperature profiles, bend and curve features of promoters [59], or DNA accessibility predictions based on patterns of hydroxyl radical cleavage [6061]. Furthermore, it may be possible to capture more meaningful information from k-mer counts by additionally measuring the likelihood that a certain k-mer occurs by chance in a gene's promoter, thus attaching a p-value to all k-mers in each promoter region. Support-vector machines show promise as a means to analyze regulatory relationships and will be increasingly useful for the analysis of mammalian genomes as more genomic data becomes available.

**Trademark, service mark, or registered trademark of BIOBASE GmbH, The MathWorks, Inc., or Microsoft Corporation in the United States, other countries, or both.

References

Received October 3, 2005; accepted for publication December 21, 2005; Published online June 27, 2006.


    About IBMPrivacyContact