Sequence Database
The sequence database is the complete list of GPCR sequences
considered in this experiment in FASTA format.
Example Tree
Unfortunately, there is no agreement on a global taxonomy for the set
of GPCRs and different databases report partially overlapping
functional groups based on biological rather than computational
classification. For instance, PRINTS defines functional groups
according to the GPCR Fact Book. GPCRDB, instead, does so based on
pharmacological properties of the receptors. The corresponding two
hierarchical lists of GPCRs disagree considerably.
Therefore, we decide to synthesize our own list of functional groups
for our sequence database based on the two hierarchical lists of
GPCRs. We refer to this list as the example tree, and we refer to the
list of clusters idenfied by our system as the classification tree.
We compare the classification against the example tree to access the
performance of our system. We build the example tree with the goal of
maximizing the number of potential functional groups that could be
matched. We merge the two hierarchical lists by adopting the finer
classification, whenever a discrepancy exists between them, while
maintaining consistency with either of them whenever possible.
The list of b-groups, which are non-overlapping groups, covers the
entire set of GPCRs. The c-groups are combinations of the b-groups and
either contain or are contained by one another. In other words, the
result is a tree, where the b-groups constitue the leaves and the
c-groups constitutes the internal nodes.
Fact Sheet
Classification Tree
The classification tree is a graphical representation of the tree
produced by our system. Each line of annotation for each node
indicates the name of the node, which is a binary string corresponding
to the path from the root to the node, whether the class matches a
group in our example tree (a group name enclosed in parentheses means
yes), the size of the node, whether the pattern discovered from this
class (not necessarily occurring in every sequence that belong to the
class) matches/overlaps PRINTS patterns (PP means yes), and whether
the pattern has any GPCRMD residues incident on it (MR means yes). If
a line starts with MORE, it means that it corresponds to another
pattern discovered from the same class.
Each node points to the page containing information about the node.
If it is an internal class, the page contains two panels. In the top
panel, the information about each pattern discovered from this class
is given. For each such pattern, information about the split
determined by the pattern, a summary for each of the two branches
(1-branch corresponding to the set of sequences in which the pattern
occurs and 0-branch corresponding to the compliment of the set), a
list of matching/overlapping PRINTS patterns, and a list of matching
GPCRMD residues are given. (For this experiment, the information on
how the pattern discovered by our system and a PRINTS pattern overlap
is not available.) In the bottom panel, a table is given in which
there is an entry for every occurrence of the pattern. For each
occurrence, information about the sequence in which the pattern
occurs, how the occurrence is discovered (s by SPLASH and as well as
HMMSEARCH and h by HMMSEARCH alone), the e-value for the occurrence
computed by HMMSEARCH, where the occurrence starts in the sequence,
where the occurrence ends in the sequence, and the actual occurrence
are include in the entry. If it is a leaf class, the page contains a
table in which there is an entry for every sequence that belongs to
the class.
Group Table
In the group table, there is an entry for every sequence in the
sequence database. The entries for the sequences are grouped by the
base group membership of the sequences in the example tree. For each
sequence, the leaf class in the classification to which the seqence as
been assigned the sequence information are included in the entry.
The leaf class points to the page containing information about the
leaf class. The sequence name points to the page containing
information about the occurrences of patterns discovered by our system
in this sequence. There are two lines for each occurrence in the
sequence, one indicating the structual information about the location
of the occurrence, and one indicating the location of the actual
occurrence and the occurrence itself. The occurrence points to the
page containing information about the node from which the corresonding
pattern was discovered.
For each base group in the example tree, the number of different leaf
classes in the classification tree to which members of this group have
been assigned is indicated. The distribution of structural region
over all the occurrences of patterns discovered by our system in the
sequences that belong to the group is also indicated, where T1 denotes
the first transmembrane region, E1 denotes the first extracellular
region, C1 denotes the first cytoplasmic region, and so on.
Pattern Table
In the pattern table, there is an entry for every pattern discovered
by our system. For each pattern, the group in the example tree, if
any, that is matched by the class of sequences in which the pattern
occurs, the size of the set of sequences from which pattern is
discovered (from which pattern discovery is peformed), the size of te
class of sequences in which the pattern occurs, the consensus for the
pattern, and the residue incidence information are included in the
entry. The residue incidence information indicates which residue in
the mutation database matches a residue in the pattern (given by
res_ind, which is the index of the residue in the residue table)
and the position of that residue in the pattern (given by pat_pos).
Each consensus points to the profile HMM corresponding to the pattern.
(For this experiment, if multiple patterns were discovered from the
same group of sequences, only the profile HMM corresponding to the
last pattern discovered is available.)
Residue Table
In the residue table, there is an entry for every residue in our
mutation database. GPCRMD collects a list of residues for which a
mutation experiment was performed before 1997. For each residue in
the list, it provides information about the residue itself, the
functional effect as a reslut of the mutation experiment, and the
literature reference on the mutation experiment. We construct our own
mutation database by taking this list and removing every residue for
which the mutation involves a deletion (rather than a substitution)
and whose source sequence does not belong to our sequence database.
For each residue in our mutation database, the position of the residue
in the source sequence, the structural information about the region
that contains that positoin, the gene associated with the source
sequence, the specis associated with the source sequence, the
functional effect, the reference, and the pattern incidence
information are included in the entry. The functional effect is in
italics if it is actually null. The pattern incidence information
indicates which pattern contains a residue which the residue currently
under consideration matches (given by pat_ind, which is the index of
the pattern in the pattern table) and the position of that residue in
the pattern. A pair in the pattern incidence information is in bold if
the pattern occurs in a class of sequences that contains the source
sequence of the residue currently under consideration.
Each pair in the pattern incidence information points to the page
containing information about the class from which the pattern is
discovered.
Structure Chart
In the structure chart, there is an entry for every type of
categorization under consideration. How the distribution is computed
for each categorization is given below.
Classification Graph
A example of the tree-graph combination approach on a subset of the
sequence database that does not contain 0-level DRY pattern