| Algorithm Description | ||||
| Motivation: The
discovery of sparse amino acid patterns that match repeatedly in a set
of protein sequences is an important problem in computational biology.
Statistically significant patterns, that is, patterns that occur more
frequently than expected, may identify regions that have been preserved
by evolution and which may therefore play a key functional or structural
role. Sparseness can be important because a handful of non-contiguous
residues may play a key role, while others, in between, may be changed
without significant loss of function or structure. Similar arguments may
be applied to conserved DNA patterns. Available sparse pattern discovery
algorithms are either inefficient or impose limitations on the type of
patterns that can be discovered.
Algorithm: Splash is a deterministic pattern discovery algorithm, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome, the full set of PROSITE families, or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Examples: Some examples of biologically interesting
motifs discovered by Splash are reported for the histone I,
for PROSITE families, and for the
G-Protein Coupled Receptor families. Due to its efficiency, Splash can
be used to systematically and exhaustively identify conserved regions in
protein family sets. It can also be used to build functional taxonomies
in an unsupervised manner. |
0. SPLASH
Syntax 6. Search 7. References
|
|||
![]()