Skip to main content
Pattern Discovery

Given a set of protein/DNA sequences A1, A2, ..., An, Splash will discover patterns of the form T(T U '.')*T, where T is an amino acid or nucleic acid, or a class of amino acids, and '.' is the wildcard character. T is called a token.

Example:

String 1: ALCALFAAGSKQ
String 2: KCAQWSGGRNPS

Pattern:  CA.[FW]..G

Amino acid classes

Classes are defined by a mutation probability matrix M and a threshold t. For each amino acid ai all amino acids aj such that M(ai, aj)>t are considered as an individual class.  For instance, if BLOSUM50 is the mutation probability matrix and t = 1, Then the following amino acid classes are allowed:

A  R  N  D  C  Q  E  G  H  I  
L  K  M  F  P  S  T  W  Y  V 
[DE]  [DQE]  [ILMV]  [MIL]  [ST]   [HFWY]  [RK]  [LM]
[FY]  [ST]   [VI]    [ND]   [KQE]  [HY]    [WY] 

Arbitrary matrices are possible to capture any desired clustering scheme. Matrices must be build according to a matrix syntax and must be stored in the matrix subdirectory of the directory where Splash is run.

Mandatory Constraints

Reported patterns must satisfy the following mandatory constraints:

Minimum Support: There are two choices. 

a) Pattern must occur at least j0 times in the set of sequences 
b) Pattern must occur in at least j0 independent sequences

the latter constraint can be expressed as a percentage j0% of the total number of sequences in the set. 

Density Constraint: Patterns must have at least k0 matching tokens in each substring of length w0 that starts with a token. These parameters can be set independently.
Example: pattern CA.F..G satisfies the constraint k0=2, w0=4.

Identical matches: Either one or two characters in the patterns must match exactly. This can be set via a command line parameter. This requirement will be removed in a future version of the program.

Optional Discovery Constraints

Statistical Significance: patterns are reported only if their ZScore (See Splash paper) exceeds a given z0 threshold

Length: Patterns are reported only if they have at least l0 tokens

Identical: Patterns are reported only if all tokens match identically. This is useful in the analysis of DNA or to speed up the analysis in extremely large databases.

False positive: Patterns are reported only if the number of expected false positives in SWISS-PROT Rel. 36 is below a given threshold.

Output Format Parameters

Format: Patterns can be formatted either as a regular expression, or as PSSM, or as an HTML file. These options are only available with some of the exhaustive or hyerarchical analysis models.

Sort Order: Patterns can be sorted by support (number of occurrences), sequence support (number of distinct sequences in which they occur), length, or ZScore.

Verbose: Patterns can be followed by a list of locations where they occur. These are pairs of Ids and Offsets where they occur starting at Id 1 and Offset 0.

0. SPLASH
1. Algorithm
2. Performance
3. Pattern Discovery

Syntax
DNA/Protein Seq.
Constraints
Statistical Constr.
Similarity Matrix
Parallel Execution
Output Format
Other

4. Exhaustive Discovery

Syntax

5. Hierarchical Discovery

Syntax

6. Search

Syntax

7. References