| Pattern Discovery | ||||
| Given a set of
protein/DNA sequences A1, A2, ..., An, Splash will discover patterns of the
form S(SU'.')*S, where S is an amino acid or nucleic acid, or a class of amino acids,
and '.' is the wildcard character. S is called a token.
Example: String 1: ALCALFAAGSKQ Pattern: CA.F..G Amino acid classesthese are defined by a mutation probability matrix M and a threshold t. For each amino acid ai all amino acids aj such that M(ai, aj)>t are considered as an individual class. For instance, if BLOSUM50 is the mutation probability matrix and t = 1, Then the following amino acid classes are possible: A R N D C
Q E G H I L K M F
P S T W Y V [DE] [DQE] [ILMV]
[MIL] [ST] [HFWY] [RK] [LM] Arbitrary matrices are possible to capture any desired clustering scheme. Matrices must be build according to a matrix syntax and must be stored in a matrix subdirectory of the directory where Splash is run. Mandatory ConstraintsThe following mandatory constraints must be satisfied by patterns in order for them to be reported: Minimum Support: There are two choices. a) Pattern must occur at least J0 times in the set of
sequences the latter contraint can be expressed as a percentage J0% of the total number of sequences in the set. Density Constraint: Patterns must have at least k0
matching tokens in each substring of length l0 that starts
with a token. These parameters can be set independently. Identical matches: Either one or two characters in the patterns must match exactly. This can be set via a command line parameter. This requirement will be removed in a future versions of the program. Optional Discovery ConstraintsStatistical Significance: patterns are reported only if their Z-Score (See Splash paper) exceeds a given z0 threshold Length: Patterns are reported only if they have at least l0 tokens Identical: Patterns are reported only if all tokens match identically. This is useful in the analysis of DNA or to speed up the analysis in extremely large databases. Orphan Analysis: Patterns are reported only if they also occur in a predefined sequence. This is useful for orphan analysis. False positive: Patterns are reported only if the number of expected false positives in SWISS-PROT is below a given threshold. Output Format ParametersFormat: Patterns can be formatted either as a regular expression, or as PSSM, or as an HTML file. These options are only available with some of the exhaustive or hyerarchical analysis models. Sort Order: Patterns can be sorted by support (number of occurrences), sequence support (number of distinct sequences in which they occur), length, or ZScore. Verbose: Patterns can be followed by a list of locations where they occur. These are pairs of Ids and Offsets where they occur starting at Id 1 and Offset 0. |
0. SPLASH
Syntax 6. Search 7. References |
|||
![]()