| Pattern Discovery | ||||
| Given a set of
protein/DNA sequences A1, A2, ..., An,
Splash will discover patterns of the form T(T U '.')*T, where T is an
amino acid or nucleic acid, or a class of amino acids, and '.' is the
wildcard character. T is called a token.
Example: String 1: ALCALFAAGSKQ Pattern: CA.[FW]..G Amino acid classesClasses are defined by a mutation probability matrix M and a threshold t. For each amino acid ai all amino acids aj such that M(ai, aj)>t are considered as an individual class. For instance, if BLOSUM50 is the mutation probability matrix and t = 1, Then the following amino acid classes are allowed: A R N D C
Q E G H I Arbitrary matrices are possible to capture any desired clustering scheme. Matrices must be build according to a matrix syntax and must be stored in the matrix subdirectory of the directory where Splash is run. Mandatory ConstraintsReported patterns must satisfy the following mandatory constraints: Minimum Support: There are two choices. a) Pattern must occur at least j0 times in the set
of sequences the latter constraint can be expressed as a percentage j0% of the total number of sequences in the set. Density
Constraint: Patterns must have at least k0
matching tokens in each substring of length w0 that
starts with a token. These parameters can be set independently. Identical matches: Either one or two characters in the patterns must match exactly. This can be set via a command line parameter. This requirement will be removed in a future version of the program. Optional Discovery ConstraintsStatistical Significance: patterns are reported only if their ZScore (See Splash paper) exceeds a given z0 threshold Length: Patterns are reported only if they have at least l0 tokens Identical: Patterns are reported only if all tokens match identically. This is useful in the analysis of DNA or to speed up the analysis in extremely large databases. False positive: Patterns are reported only if the number of expected false positives in SWISS-PROT Rel. 36 is below a given threshold. Output Format ParametersFormat: Patterns can be formatted either as a regular expression, or as PSSM, or as an HTML file. These options are only available with some of the exhaustive or hyerarchical analysis models. Sort Order: Patterns can be sorted by support (number of occurrences), sequence support (number of distinct sequences in which they occur), length, or ZScore. Verbose: Patterns can be followed by a list of locations where they occur. These are pairs of Ids and Offsets where they occur starting at Id 1 and Offset 0. |
0. SPLASH
Syntax 6. Search 7. References |
|||
![]()