| Remote Homology Detection | ||||
| The protein homology detection task is a classic problem
[1]. The task consists of finding, for a given protein or protein family, all
distantly related protein sequences in a large database of unannotated sequences. Two
sequences that share a common evolutionary ancestor are said to be homologous.
We do not have access to ancestral protein sequences. Hence, the homology detection task is necessarily
inferential. Inferences of homology have been made based upon pairwise comparisons of protein sequences, using dynamic programming algorithms such as the Needleman-Wunsch [2] and Smith-Waterman [3] algorithms. The popular database search tools BLAST [4] and FASTA [5] are fast approximations of these dynamic programming algorithms. Evidence of significant subtle homologies can also emerges from comparison of sequences to statistical models describing families of known sequences. In this context, the ability to build complex consensus models from a training set of proteins that share a common function or structure is a powerful tool. There are three key reasons for this power. First, in a consensus model, protein regions that are less functionally constrained contribute far less to the sequence-to-model comparison than in a pairwise sequence analysis. This tends to reduce the impact of ``noisy'' sequences that may contain multiple diverse functional signatures. Second, rather than using a generic model of amino acid substitution [2], more accurate family- and site-specific models can be used. Finally, pairwise sequence analysis often assumes that sequences change mostly through amino acid substitution and insertion/deletion. However, in reality, proteins domains often can be swapped without a loss of protein function, and a domain exchange among non-homologous proteins is an important instrument of molecular evolution. The SPLASHSearch program can efficiently compare a sequence database against one or more protein family models. The latter are built by running splash in exhaustive mode against protein sets grouped by functional or structural criteria. SPLASHSearch can be run in three modes. Single modelMultiple ModelTraining
|
0. SPLASH
Syntax 6. Search 7. References |
|||
![]()