|
Syntax is:
splash InputFile -E maxMotifs sparseness minSupport [options]
| maxMotifs |
maximum number of desired non-overlapping motifs
|
| sparseness |
0 Only k0 w0 is used as a
density constraints 1 Both (k0, w0) and
(k0, 2 w0) are used
2 (k0, w0), (k0, 2 w0),
and (k0, 4 w0) are used |
| minSupport |
lowest motif support as a percentage of the total
number of sequences in InputFile |
Suggested Syntax:
splash file -E 10 0 0 -l 5 -f HTML. This will discover up to 10
motifs with at least 5 tokens each.
Example:
running splash trypsin.fa -E 10 0 0 -l 5 -f HTML,
produces the following html file trypsin.fa_0.motifs_0.html
Options for Exhaustive Pattern Discovery:
| Option |
-f [HTML][REGEX][PSSM][HMM][MAST][metaMEME]
[metaSPLASH] |
| Default |
-f REGEX |
| Description |
| REGEX |
A single output file:
InputFile_minSupport.motifs
is created. Motifs are listed as regular expression with the
same syntax as of regular pattern discovery |
| HTML |
This option generates a file
InputFile_minSupport.motifs.html whith a color annotated
representation of the discovered motifs and of where they
occur in the training set sequences. |
| PSSM |
A file:
InputFile_minSupport.motifs_id.pssm
is created for each motif id. The file contains a PSSM
with
1) a list of amino acids
2) a row for each amino acid position in th PSSM and
an array of raw frequencies of amino acids at that position
3) a list of amino acid raw frequencies in InputFile. |
| HMM |
A file:
InputFile_minSupport.motifs_id.hmm
is created for each motif id. The file contains a set of
prealigned sequences in FASTA format. These files can be
used directly as an input for HMMBuild in the HMMer package |
| MAST |
A file:
InputFile_minSupport.motifs
is created which contains a series of PSSM models. This file
is compliant with the MAST syntax and can be used by MAST
to screen sequence databases |
| metaMEME |
A file:
InputFile_minSupport.motifs
is created which contains a series of PSSM models. This file
is compliant with the metaMEME syntax and can be used
by metaMEME to screen sequence databases |
| metaSPLASH |
A file:
InputFile_minSupport.motifs
is created which contains a series of PSSM models. This file
is compliant with the splashsearch syntax and can be
used by splashsearch to screen sequence databases |
|
| Option |
-% min_percent_support |
| Default |
-% 100 |
| Description |
Allows to save time by reducing the initial minimum
support |
| Option |
-w window |
| Default |
-w 8 |
| Description |
In combination with -k, allows to search for
patterns that are more ore less dense |
| Option |
-k min_tokens_in_window |
| Default |
-k 3 |
| Description |
In combination with -w, allows to search for
patterns that are more ore less dense |
| Option |
-Z minZScore |
| Default |
-Z 1000 |
| Description |
|
| Option |
-z |
| Default |
set |
| Description |
|
| Option |
-n- min_patterns |
| Default |
-n- 10 |
| Description |
Support and density are decreased until at least min_pattern
are found that satisfy all the other constraints. It is suggested
to use at least the default -n- 10. |
| Option |
-FP maxFP |
| Default |
-FP inf |
| Description |
Reported pattern must have an expected number of
false positive matches in a database with the same size and amino
acid frequency of SWISS-PROT Rel. 36. The latter is computed by
using the product of the probability of matching the individual
amino acids or amino acid classes contained in the pattern.
If the patterns are going to be used as regular expressions
rather than to build a PSSM or HMM, it is suggested to use a value
approximately equal to 1/10 of the sequences in the input file.
|
| Option |
-mt similarity_threshold |
| Default |
-mt 1.0 |
| Description |
This option sets the threshold used to define similarity
classes. It is suggested to use -mt 0 for very sensitive
searches |
| Option |
-b number_of_indentites |
| Default |
-b 2 |
| Description |
This option is used to limit the minimum number of
identical matches in the regular expression. It is suggested to
use -b 1 for very sensitive searches. This may result in
significantly longer running times. |
| Option |
-l minimum_number_of_matches_in_pattern |
| Default |
-l same as -k |
| Description |
This option is used to limit reported pattern to
have at least l tokens. It is suggested to use at least -l
4 to get pattern that are sufficiently specific |
| Option |
-S max_seconds_per_run |
| Default |
-S 1800 (1/2 hour) |
| Description |
This option limits the total time allocated to each
individual iteration. This is useful because very low support
levels can result in extremely long runs. If the time is exceeded
the program reports the motifs discovered so far and terminates. |
Other options can be used but are not
suggested |
|
0. SPLASH
1. Algorithm
2. Performance
3. Pattern Discovery
Syntax
DNA/Protein Seq.
Constraints
Statistical Constr.
Similarity Matrix
Parallel Execution
Output Format
Other
4. Exhaustive Discovery
Syntax
5. Hierarchical Discovery
Syntax
6. Search
Syntax
7. References
|