|
|
|||||||
Detecting Patterns Biological sequence analysis has received a lot of attention since the early 1990s. As database sizes grew, so did the interest in mining the accumulated entries for existing patterns. One problem that attracted researchers in genetics was the discovery of patterns that are common to a family of proteins assumed to be related. When carrying out "pattern discovery," one looks through a database in order to identify anything that appears frequently. What is sought is not known in advance but is actually determined in the process. Pattern discovery differs from pattern matching -- in pattern matching, the item to search for is known in advance (see FLASH description). In both tasks, speed is essential. Special Algorithm IBM researchers developed a new combinational algorithm called Teiresias which carries out pattern discovery in one or more "event streams." Examples of such streams include DNA and proteins. The algorithm has the ability to discover all patterns occurring two or more times in any such set of data, and imposes no restrictions on the composition of the patterns, their location, minimum/maximum length, or relative arrangements. The algorithm is very fast and excels with very weak signals; it gains its speed by avoiding the exhaustive exploration of the space of potential solutions while at the same time reporting completeness of the reported results. IBM scientists are currently using Teiresias to tackle several important problems from the field of computational biology, outside the immediate context of pattern discovery. |
|||||||