Sequence Database

The sequence database is the complete list of GPCR sequences considered in this experiment in FASTA format.

Example Tree

Unfortunately, there is no agreement on a global taxonomy for the set of GPCRs and different databases report partially overlapping functional groups based on biological rather than computational classification. For instance, PRINTS defines functional groups according to the GPCR Fact Book. GPCRDB, instead, does so based on pharmacological properties of the receptors. The corresponding two hierarchical lists of GPCRs disagree considerably.

Therefore, we decide to synthesize our own list of functional groups for our sequence database based on the two hierarchical lists of GPCRs. We refer to this list as the example tree, and we refer to the list of clusters idenfied by our system as the classification tree. We compare the classification against the example tree to access the performance of our system. We build the example tree with the goal of maximizing the number of potential functional groups that could be matched. We merge the two hierarchical lists by adopting the finer classification, whenever a discrepancy exists between them, while maintaining consistency with either of them whenever possible.

The list of b-groups, which are non-overlapping groups, covers the entire set of GPCRs. The c-groups are combinations of the b-groups and either contain or are contained by one another. In other words, the result is a tree, where the b-groups constitue the leaves and the c-groups constitutes the internal nodes.

Fact Sheet

Classification Tree

The classification tree is a graphical representation of the tree produced by our system. Each line of annotation for each node indicates the name of the node, which is a binary string corresponding to the path from the root to the node, whether the class matches a group in our example tree (a group name enclosed in parentheses means yes), the size of the node, whether the pattern discovered from this class (not necessarily occurring in every sequence that belong to the class) matches/overlaps PRINTS patterns (PP means yes), and whether the pattern has any GPCRMD residues incident on it (MR means yes). If a line starts with MORE, it means that it corresponds to another pattern discovered from the same class.

Each node points to the page containing information about the node. If it is an internal class, the page contains two panels. In the top panel, the information about each pattern discovered from this class is given. For each such pattern, information about the split determined by the pattern, a summary for each of the two branches (1-branch corresponding to the set of sequences in which the pattern occurs and 0-branch corresponding to the compliment of the set), a list of matching/overlapping PRINTS patterns, and a list of matching GPCRMD residues are given. (For this experiment, the information on how the pattern discovered by our system and a PRINTS pattern overlap is not available.) In the bottom panel, a table is given in which there is an entry for every occurrence of the pattern. For each occurrence, information about the sequence in which the pattern occurs, how the occurrence is discovered (s by SPLASH and as well as HMMSEARCH and h by HMMSEARCH alone), the e-value for the occurrence computed by HMMSEARCH, where the occurrence starts in the sequence, where the occurrence ends in the sequence, and the actual occurrence are include in the entry. If it is a leaf class, the page contains a table in which there is an entry for every sequence that belongs to the class.

Group Table

In the group table, there is an entry for every sequence in the sequence database. The entries for the sequences are grouped by the base group membership of the sequences in the example tree. For each sequence, the leaf class in the classification to which the seqence as been assigned the sequence information are included in the entry.

The leaf class points to the page containing information about the leaf class. The sequence name points to the page containing information about the occurrences of patterns discovered by our system in this sequence. There are two lines for each occurrence in the sequence, one indicating the structual information about the location of the occurrence, and one indicating the location of the actual occurrence and the occurrence itself. The occurrence points to the page containing information about the node from which the corresonding pattern was discovered.

For each base group in the example tree, the number of different leaf classes in the classification tree to which members of this group have been assigned is indicated. The distribution of structural region over all the occurrences of patterns discovered by our system in the sequences that belong to the group is also indicated, where T1 denotes the first transmembrane region, E1 denotes the first extracellular region, C1 denotes the first cytoplasmic region, and so on.

Pattern Table

In the pattern table, there is an entry for every pattern discovered by our system. For each pattern, the group in the example tree, if any, that is matched by the class of sequences in which the pattern occurs, the size of the set of sequences from which pattern is discovered (from which pattern discovery is peformed), the size of te class of sequences in which the pattern occurs, the consensus for the pattern, and the residue incidence information are included in the entry. The residue incidence information indicates which residue in the mutation database matches a residue in the pattern (given by res_ind, which is the index of the residue in the residue table) and the position of that residue in the pattern (given by pat_pos).

Each consensus points to the profile HMM corresponding to the pattern. (For this experiment, if multiple patterns were discovered from the same group of sequences, only the profile HMM corresponding to the last pattern discovered is available.)

Residue Table

In the residue table, there is an entry for every residue in our mutation database. GPCRMD collects a list of residues for which a mutation experiment was performed before 1997. For each residue in the list, it provides information about the residue itself, the functional effect as a reslut of the mutation experiment, and the literature reference on the mutation experiment. We construct our own mutation database by taking this list and removing every residue for which the mutation involves a deletion (rather than a substitution) and whose source sequence does not belong to our sequence database.

For each residue in our mutation database, the position of the residue in the source sequence, the structural information about the region that contains that positoin, the gene associated with the source sequence, the specis associated with the source sequence, the functional effect, the reference, and the pattern incidence information are included in the entry. The functional effect is in italics if it is actually null. The pattern incidence information indicates which pattern contains a residue which the residue currently under consideration matches (given by pat_ind, which is the index of the pattern in the pattern table) and the position of that residue in the pattern. A pair in the pattern incidence information is in bold if the pattern occurs in a class of sequences that contains the source sequence of the residue currently under consideration.

Each pair in the pattern incidence information points to the page containing information about the class from which the pattern is discovered.

Structure Chart

In the structure chart, there is an entry for every type of categorization under consideration. How the distribution is computed for each categorization is given below.

Classification Graph

A example of the tree-graph combination approach on a subset of the sequence database that does not contain 0-level DRY pattern