Skip to main content
Performance

In Fig. 1, we report the performance of Splash and Pratt as a function of an increasing database size. The database is produced as follows. First an appropriate sequence sample set is selected from the Brookhaven PDB database to obtain the desired size. Then the position of the amino acids are randomized. This results in a random database of an appropriate length, with the same amino acid frequency as the corresponding PDB sample. Patterns that occur in more that 20% of the sequences in the database are reported. Total size of the databases is 8192 2 i. Databases for values of i ranging from 0 to 6 are processed. The largest random database is approximately 512,000 residues. On the left y axis, we report the time, in seconds, required by Splash and Pratt to process the random databases in log scale. On the x axis, we report the database size, also in log scale. On the right y axis, we report the number of discovered patterns in linear scale. This is shown by the curve with the diamond symbol. The density constraints are k0=2, l0=5 . The maximum memory footprint of the program is 12MB. The patterns discovered by Pratt are identical to those discovered by Splash. However, Splash is increasingly faster than Pratt as the database size increases. At the smallest database size, 8 KRes, it is about 6 times faster. At size 128 KRes., it is about 3 orders of magnitude faster. Also, while patterns reported by Pratt are limited to a maximum span. 50 characters in this case, Splash would discover any pattern that satisfies the density and support constraints, no matter how long. In Fig. 2, we report similar performance measurement against a histone I database, with 209 proteins, at increasingly higher values of the support. This is an interesting case because this database is pattern-rich, generating in excess of 10,000 patterns for k0=2, l0=5, J0=100.

Fig. 1: Splash and Pratt: Time versus Database Size. Pattern discovery time is reported versus database size.

Fig. 2: Splash vs. Pratt (a) vs. pattern support j (b) vs. the number of discovered pattern.

0. SPLASH
1. Algorithm
2. Performance
3. Pattern Discovery

Syntax
DNA/Protein Seq.
Constraints
Statistical Constr.
Similarity Matrix
Parallel Execution
Output Format
Other

4. Exhaustive Discovery

Syntax

5. Hierarchical Discovery

Syntax

6. Search

Syntax

7. References