U.S. patent application number 13/567535 was filed with the patent office on 2012-11-29 for methods and systems for conservative extraction of over-represented extensible motifs.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Alberto Apostolico, Matteo Comin, Laxmi Priya Parida.
Application Number | 20120303287 13/567535 |
Document ID | / |
Family ID | 38874511 |
Filed Date | 2012-11-29 |
United States Patent
Application |
20120303287 |
Kind Code |
A1 |
Apostolico; Alberto ; et
al. |
November 29, 2012 |
METHODS AND SYSTEMS FOR CONSERVATIVE EXTRACTION OF OVER-REPRESENTED
EXTENSIBLE MOTIFS
Abstract
Methods and systems of extracting extensible motifs from a
sequence include assigning a significance to extensible motifs
within the sequence based upon a syntactic and statistical
analysis, and identifying extensible motifs having a significance
that exceeds a predetermined threshold.
Inventors: |
Apostolico; Alberto;
(Atlanta, GA) ; Comin; Matteo; (Venice, IT)
; Parida; Laxmi Priya; (Mohegan Lake, NY) |
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
38874511 |
Appl. No.: |
13/567535 |
Filed: |
August 6, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12882633 |
Sep 15, 2010 |
8290716 |
|
|
13567535 |
|
|
|
|
11471552 |
Jun 21, 2006 |
7865313 |
|
|
12882633 |
|
|
|
|
Current U.S.
Class: |
702/20 |
Current CPC
Class: |
G16B 15/00 20190201;
G16B 30/00 20190201 |
Class at
Publication: |
702/20 |
International
Class: |
G06F 19/24 20110101
G06F019/24 |
Claims
1. A method of extracting an extensible motif from a sequence, said
method comprising: assigning a significance to an extensible motif
within said sequence based upon a syntactic and statistical
analysis; and identifying an extensible motif having a significance
that exceeds a predetermined threshold.
2. The method of claim 1, wherein said assigning said significance
comprises: scanning the sequence to identify patterns that have two
solid characters; storing start and end positions of the identified
patterns; and combining all cells with a dot character and the same
start and end solid characters to construct extensible cells.
3. The method of claim 1, wherein said assigning said significance
comprises pruning by occurrences.
4. The method of claim 1, wherein said assigning said significance
comprises pruning by composition.
5. A system for extracting an extensible motif from a sequence,
said system comprising: means for assigning a significance to an
extensible motif within said sequence based upon a syntactic and
statistical analysis; and means for identifying an extensible motif
having a significance that exceeds a predetermined threshold.
6. The system of claim 5, wherein said means for assigning said
significance comprises: means for scanning the sequence to identify
patterns that have two solid characters; means for storing start
and end positions of the identified patterns; and means for
combining all cells with a dot character and a same start and end
solid characters to construct extensible cells.
7. The system of claim 5, wherein said means for assigning said
significance comprises pruning by occurrences.
8. The system of claim 5, wherein said means for assigning said
significance comprises pruning by composition.
9. A system for extracting an extensible motif from a sequence,
said system comprising: a unit that assigns a significance,
comprising a z-score, to an extensible motif within said sequence
based upon a syntactic and statistical analysis; and a unit that
identifies an extensible motif having a significance that exceeds a
predetermined threshold, wherein said unit that assigns the
significance comprises a device that restricts a computation of the
z-score for each of said motifs to classes of maximal motifs.
10. The system of claim 9, wherein the z-score comprises an
extensible z-score.
Description
[0001] The present application is a Continuation application of
U.S. patent application Ser. No. 12/882,633, filed on Sep. 15,
2010, which is a Divisional Application of U.S. patent application
Ser. No. 11/471,552, filed on Jun. 21, 2006, now U.S. Pat. No.
7,865,313, the entire contents of which are incorporated herein by
reference.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The present invention generally relates to a method and a
system for extraction of extensible motifs. More particularly, the
present invention relates to a method and system for extraction of
extensible motifs using combinatorial and statistical pruning.
[0004] 2. Description of the Related Art
[0005] The discovery of a motif in a biosequence is frequently torn
between the rigidity of the model on the one hand and the abundance
of candidates on the other. In particular, the variety of motifs
described by strings that include "don't care" patterns escalates
exponentially with the length of the motif, and this only gets
worse if a "don't care" is allowed to stretch up to some prescribed
maximum length. This circumstance tends to generate daunting
computational burdens, and often gives rise to tables that are
impossible to visualize and digest. This is unfortunate, as it
seems to preclude precisely those massive analyses that have become
conceivable with the increasing availability of massive amount of
genomic and protein data. While part of the problem is endemic,
another part of it seems rooted in the various characterizations
offered for the notion of a motif, that are typically based either
on syntax or on statistics alone.
[0006] The discovery of motifs in bio-sequences is attracting
increasing interest due to the perceived multiple implication of
motifs in biological structure and function. The approaches to
motif discovery may be partitioned into two main classes. In the
first class, the sample string is tested for occurrences of motifs
in a family of a priori defined, abstract models or templates. The
second class of approaches assumes that the search may be limited
to substrings in the sample or to some more or less controlled
neighborhood of those substrings. The approaches in the first class
are more rigorously justifiable, but often pose daunting
computational burdens. Those in the second class tend to be
computationally viable but rest on more shaky methodological
grounds.
[0007] The characterizations offered for the notion of a motif
could be partitioned roughly into statistical and syntactic. In a
typical statistical characterization, a motif is a sequence of m
positions such that at each position each character from (some
subset of) the alphabet may occur with a given probability or
weight. This is often described by a suitable matrix or profile,
where columns correspond to positions and rows to alphabet
characters. The lineage of syntactic characterizations could be
ascribed to the theory of error correcting codes: a motif is a
pattern w of length in and an occurrence of it is any string at a
distance of d, the distance being measured in terms of errors of a
certain type. For example, we can have only substitutions in a
Hamming variant, substitutions and indels in a Levensthein variant,
and so on. Syntactic characterizations enable us to describe the
model of a motif, or a realization of it, or both, as a string or
simple regular expression over an extension of the input alphabet
.rho., e.g., over .SIGMA..orgate.{.}, where "." denotes the "don't
care" character.
[0008] Irrespective of the particular model or representation
chosen, the tenet of motif discovery equates over-representation of
a motif with surprise and hence with interest. Thus, any motif
discovery algorithm must ultimately weigh motifs against some
threshold, based on a score that compares empirical and expected
frequency, perhaps with some normalization. The departure of a
pattern w from expectation is commonly measured by so-called
z-scores, which have the form:
z ( w ) = f ( w ) - E ( w ) N ( w ) ( 1 ) ##EQU00001##
[0009] where:
[0010] f(w)>0 represents a frequency;
[0011] E(w)>0 represents an expectation; and
[0012] N(w)>0 is the expected value of some function of w.
[0013] For given z-score function, set of patterns W, and real
positive threshold T, patterns such that z(w)>T or z(w)<-T
are respectively dubbed over- or under-represented, or simply
surprising. The problem is that the number of patterns extracted in
this way may escalate quite rapidly, a circumstance that seems to
preclude precisely those massive analyses that have become
conceivable with the increasing availability of whole genomes.
Large-scale statistical tables may not only impose an unbearable
computational burden. They are also impractical to visualize and
use, a circumstance that may defy the purpose of building them in
the first place.
[0014] A little reflection establishes how an exponential build-up
may take place. Assume that on the binary alphabet both aabaab and
abbabb are asserted as reflections of candidate interesting motifs.
A concise description of this motif is a.ba.b, with "." denoting
the don't care, and then look for further occurrences of this
motif. By this, however, the spurious patterns aababb and abbaab
are also annexed.
[0015] A similar problem presents itself in the approaches that
resort to the profiles or the weighted matrices previously
mentioned. Even setting aside computational aspects, tables that
are too large at the outset run the risk of saturating the visual
bandwidth of a user. In this spirit, approaches that limit the
number of patterns to be considered from the start may provide a
more significant throughput, even in comparison with exhaustive
methods.
SUMMARY OF THE INVENTION
[0016] In view of the foregoing and other exemplary problems,
drawbacks, and disadvantages of the conventional methods and
structures, an exemplary feature of the present invention is to
provide methods and structures in which the significance of
extensible motifs are identified by a combination of syntactic and
statistical analysis.
[0017] In a first exemplary aspect of the present invention, a
method of extracting extensible motifs from a sequence includes
assigning a significance to extensible motifs within the sequence
based upon a syntactic and statistical analysis, and identifying
extensible motifs having a significance that exceeds a
predetermined threshold.
[0018] In a second exemplary aspect of the present invention, a
system for extracting extensible motifs from a sequence includes
means for assigning a significance to extensible motifs within the
sequence based upon a syntactic and statistical analysis, and means
for identifying extensible motifs having a significance that
exceeds a predetermined threshold.
[0019] In a third exemplary aspect of the present invention a
program is embodied in a computer readable medium executable by a
digital processing unit. The program includes instructions for
assigning a significance to extensible motifs within the sequence
based upon a syntactic and statistical analysis, and instructions
for identifying extensible motifs having a significance that
exceeds a predetermined threshold.
[0020] The inventors regard the motif discovery process as
distributed into two stages, where the first stage unearths motifs
endowed with a certain set of properties and the second filters out
the interesting ones. Since the redundancy builds up in the first
stage, it is there that the inventors decided to look for possible
ways of reducing the unnecessary throughput. Since
over-representation is measured by a score, it is desirable to find
ways to neglect candidate motifs that cannot possibly make it to
the top list, and ideally spot such motifs before they are even
computed. Counterintuitive as it might look, the inventors
discovered that such a possibility may be offered by certain
attributes of "saturation" that combine in a unique way the
syntactic structure and the list of occurrences or frequency for a
motif.
[0021] With solid words, for example, it is known that in the worst
case the number of distinct substrings in a string can be quadratic
in the length of that string. Yet, if the substrings are
partitioned into buckets by putting in the same bucket strings that
have exactly the same set of occurrences, then only the number of
buckets which are linear in the textstring are needed.
[0022] Similar linear bounds may be established for special classes
of rigid motifs containing "don't cares". When combined with
intervals of score monotonicity, properties of this kind support
the global detection of unusual words of any length in overall
linear space. Some of these conservative scoring techniques were
extended recently to rigid motifs with a prescribed maximum number
of mismatches or don't care.
[0023] An exemplary method and system in accordance with the
present invention combines a structure of a motif pattern, as
described by its syntactic specification, with a statistical
measure of its occurrence count.
[0024] An exemplary embodiment of the present invention
characterizes a pattern rigidly, and conjugates structure and set
of occurrences. This results in a definition of motif that lends
itself to a natural notion of maximality, thereby embodying
statistics and structure in one measure of surprise. This is unlike
all previous approaches that consider structure and statistics as
separate features of a pattern.
[0025] An exemplary embodiment of the present invention provides a
powerful syntactic mechanism for eliminating unimportant motifs
before their score is computed. As explained above, for the class
of over-represented motifs, the non-maximal motifs are not more
surprising than the maximal motifs.
[0026] In an exemplary embodiment of the present invention, a
combination of appropriate saturation conditions (expressed in
terms of minimum number of don't cares compatible with a given list
of occurrences) and the monotonicity of probabilistic scores over
regions of constant frequency provide significant parsimony in the
generation and testing of candidate over-represented motifs.
[0027] The advantages of exemplary embodiments of the present
invention are documented by experimental results obtained when
specifically targeting protein sequence families. In all cases
tested, the motif reported in a database of protein families and
domains known as "PROSITE" (a database of protein families and
domains that includes biologically significant sites, patterns and
profiles that help to reliably identify to which known protein
family (if any) a new sequence belongs) as most important in terms
of functional/structural relevance emerges among the top thirty
extensible motifs returned by an exemplary embodiment of the
present invention, often right at the top.
[0028] Of equal importance seems the fact that the sets of all
surprising motifs returned in each experiment are extracted faster
and come in much more manageable sizes using an exemplary
embodiment of the present invention than would be obtained in the
absence of saturation constraints.
[0029] An exemplary embodiment of the present invention provides a
characterization of extensible motifs in the definition of which
structural or syntactic properties and occurrence statistics are
solidly intertwined.
[0030] An exemplary embodiment of the present invention provides a
combination of saturation conditions (expressed in terms of minimum
number of don't cares compatible with a given list of occurrences)
and monotonicity of scores which provides significant parsimony in
the generation and testing of candidate over-represented
motifs.
[0031] An exemplary embodiment of the present invention isolates as
candidate surprising motifs only the members of an previously well
identified set of "maximally saturated" patterns. By this set being
identifiable a priori, the embodiment includes motifs in the set
that are known before any score is computed. By neglecting the
motifs other than those in the set of "maximally saturated"
patterns, surprising motifs are not overlooked. In fact, any such
motif: (i) is embedded in one of the saturated motifs, and (ii)
does not achieve a larger score than the latter (hence, computing
its score and publishing it explicitly would take more time and
space but not add information).
[0032] An exemplary embodiment of the present invention applies to
extensible patterns a philosophy that was previously applied only
to rigid motifs by solid words and by words of some specified fixed
length affected by a specified maximum number of errors. The
invention enables a transition from rigid to extensible motifs,
thereby providing methods and systems that extract and weigh
extensible motifs.
[0033] The inventors illustrate below the merits of exemplary
embodiments of the present invention on families of protein
sequences. In all cases tested, the motif reported in PROSITE as
most important in terms of functional/structural relevance emerges
either at the top or among the top ten or so of the output list
that is provided by an exemplary embodiment of the present
invention.
[0034] These and many other advantages may be achieved with the
present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] The foregoing and other exemplary purposes, aspects and
advantages will be better understood from the following detailed
description of an exemplary embodiment of the invention with
reference to the drawings, in which:
[0036] FIG. 1 illustrates an exemplary hardware/information
handling system 100 for incorporating the present invention
therein;
[0037] FIG. 2 illustrates a signal bearing medium 200 (e.g.,
storage medium) for storing steps of a program of a method
according to the present invention; and
[0038] FIG. 3 illustrates a flowchart of an exemplary method in
accordance with the present invention.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS OF THE INVENTION
[0039] Referring now to the drawings, and more particularly to
FIGS. 1-3, there are shown exemplary embodiments of the method and
structures of the present invention.
[0040] FIG. 1 illustrates a typical hardware configuration of an
information handling/computer system for use with the invention and
which preferably has at least one processor or central processing
unit (CPU) 111.
[0041] The CPUs 111 are interconnected via a system bus 112 to a
random access memory (RAM) 114, read-only memory (ROM) 116,
input/output (I/O) adapter 118 (for connecting peripheral devices
such as disk units 121 and tape drives 140 to the bus 112), user
interface adapter 122 (for connecting a keyboard 124, mouse 126,
speaker 128, microphone 132, and/or other user interface device to
the bus 112), a communication adapter 134 for connecting an
information handling system to a data processing network, the
Internet, an Intranet, a personal area network (PAN), etc., and a
display adapter 136 for connecting the bus 112 to a display device
138 and/or printer.
[0042] In addition to the hardware/software environment described
above, a different aspect of the invention includes a
computer-implemented method for performing the methods described
herein. As an example, an exemplary method in accordance with the
present invention may be implemented in the particular environment
discussed above.
[0043] Such a method may be implemented, for example, by operating
a computer, as embodied by a digital data processing apparatus, to
execute a sequence of machine-readable instructions. These
instructions may reside in various types of signal-bearing
media.
[0044] This signal-bearing media may include, for example, a RAM
contained within the CPU 111, as represented by the fast-access
storage for example. Alternatively, the instructions may be
contained in another signal-bearing media, such as a magnetic data
storage diskette 200 (FIG. 2), directly or indirectly accessible by
the CPU 111.
[0045] Whether contained in the diskette 200, the computer/CPU 111,
or elsewhere, the instructions may be stored on a variety of
machine-readable data storage media, such as DASD storage (e.g., a
conventional "hard drive" or a RAID array), magnetic tape,
electronic read-only memory (e.g., ROM, EPROM, or EEPROM), an
optical storage device (e.g. CD-ROM, WORM, DVD, digital optical
tape, etc.), paper "punch" cards, or other suitable signal-bearing
media including transmission media such as digital and analog and
communication links and wireless. In an illustrative embodiment of
the invention, the machine-readable instructions may comprise
software object code, compiled from a language such as "C",
etc.
[0046] To proceed with a formal definition of the concepts
highlighted above, let s be a sequence of sets of characters from
an alphabet .SIGMA..orgate.{.}, where ".".SIGMA. denotes a
don't-care (dot, for short) and the rest are solid characters. The
inventors use .sigma. to denote a singleton character or a subset
of .SIGMA.. For character (sets) e.sub.1 and e.sub.2, the inventors
write e.sub.1e.sub.2 if and only if e.sub.1 is a dot or e.sub.1.OR
right.e.sub.2. Allowing for spacers in a string is what makes it
extensible. Such spacers are indicated by annotating the dot
characters. Specifically, an annotated "." character is written as
..sup..alpha. where .alpha. is a set of positive integers
{.alpha..sub.1, .alpha..sub.2, . . . , .alpha..sub.k} or an
interval .alpha.=[.alpha..sub.1,.alpha..sub.u], representing all
integers between .alpha..sub.1 and .alpha..sub.u including
.alpha..sub.1 and .alpha..sub.u. Whenever defined, d will denote
the maximum number of consecutive dots allowed in a string. In such
cases, for clarity of notation, the inventors use the extensible
wild card denoted by the dash symbol instead of the annotated dot
character, ..sup.[1,d] in the string. Note that `-`.SIGMA.. Thus a
string of the form a..sup.[1,d]b will be simply written as a-b.
[0047] A motif m is extensible if it contains at least one
annotated dot, otherwise m is rigid. Given an extensible string m,
a rigid string m' is a realization of m if each annotated dot
..sup..alpha. is replaced by l.di-elect cons..alpha. dots. The
collection of all such rigid realizations of m is denoted by R(m).
A rigid string m occurs at position l on s if m[j]s[l+j-1] holds
for 1.ltoreq.j.ltoreq.|m|. An extensible string in occurs at
position l in s if there exists a realization m' of m that occurs
at l. Note than an extensible string m could possibly occur a
multiple number of times at a location on a sequence s.
[0048] For a sequence s and positive integer k, k.ltoreq.|s|, a
string (extensible or rigid) m is a motif of s with |m|>1 and
location list L.sub.m=(l.sub.1, l.sub.2, . . . , l.sub.p), if both
m[1] and m[|m|] are solid and L.sub.m is the list of at all and
only the occurrences of m in s. Given a motif in let m[j.sub.1],
m[j.sub.2], . . . m[j.sub.l] be the l solid elements in the motif
m. Then the sub-motifs of m are given as follows: for every
j.sub.i, j.sub.t, the sub-motif m[j.sub.i . . . j.sub.t] is
obtained by dropping all the elements before (to the left of)
j.sub.i and all elements after (to the right of) j.sub.t in m. The
inventors also note that m is a condensation for any of its
sub-motifs. The inventors are interested in motifs for which any
condensation would disrupt the list of occurrences. Formally, let
m.sub.1, m.sub.2, . . . m.sub.k be the motifs in a string s. A
motif m.sub.i is maximal in length if there exists no m.sub.l,
l.noteq.i with |L.sub.m.sub.i|=|L.sub.m.sub.l| and m.sub.i is a
sub-motif of m.sub.l. A motif m.sub.i is maximal in composition if
no dot character of m.sub.i can be replaced by a solid character
that appears in all the locations in L.sub.m. A motif m.sub.i is
maximal in extension if no annotated dot character of m.sub.i can
be replaced by a fixed length substring (without annotated dot
characters) that appears in all the locations in L.sub.m. A maximal
motif is maximal in composition, in extension and in length.
[0049] Expectations and Scores
[0050] Beginning by deriving some simple expressions for the
probability p.sub.m of an extensible motif m under stationary, iid
assumptions. Let m be an extensible motif generated by a
stationary, i.i.d. source which emits .sigma..di-elect cons..SIGMA.
with probability p.sub..sigma.. Consider the set R(m) of all
possible realizations of m. Each realization is a string over
.SIGMA..orgate.{.}. For a specific realization m, its probability
p.sub. m is given by:
p m _ = .sigma. .di-elect cons. ( p .sigma. ) j .sigma. , ( 2 )
##EQU00002##
[0051] where:
[0052] j.sub..sigma. is the number of times .sigma. appears in
m.
[0053] Thus, the dot has implicitly probability 1.
[0054] An extensible motif is degenerate if it can possibly have
multiple occurrences at a site i on the input s.
[0055] Lemma 1 Let in be an extensible non-degenerate motif
generated by a stationary, iid source which emits (.sigma..di-elect
cons..SIGMA.) with probability p.sub..sigma.. Let j.sub..sigma. be
the number of times .sigma. appears in m and let e be the number of
annotated dots in m with annotations .alpha..sub.1, .alpha..sub.2,
. . . , .alpha..sub.3. Then
p m = .sigma. .di-elect cons. ( p .sigma. ) j .sigma. i = 1 e
.alpha. i ( 3 ) ##EQU00003##
[0056] Proof. Since the motif is non-degenerate, by the definition
of realization of a motif,
p m = m _ .di-elect cons. R ( m ) ( p m _ ) ( 4 ) ##EQU00004##
[0057] Hence we need to compute p.sub. m where m is a rigid motif.
Assume m is a rigid motif with no dot characters. By the i.i.d.
assumption:
P.sub. m=.PI..sub..sigma..di-elect
cons..SIGMA.(p.sub..sigma.).sup.j.sup..sigma. (5)
[0058] Next, consider m to be a rigid motif with possibly some dot
characters. Again, clearly:
p.sub. m.PI..sub..sigma..di-elect
cons..SIGMA.(p.sub..sigma.).sup.j.sup..sigma. (6)
[0059] In other words, only the solid characters contribute
non-trivially to the computation of p.sub. m. Hence, if m is not
rigid:
p m = R ( m ) .sigma. .di-elect cons. ( p .sigma. ) j .sigma. ( 7 )
##EQU00005##
[0060] But
|R(m)|=.PI..sub.i=1.sup.e|.dbd..sub.i| (8)
[0061] hence the result.
[0062] Corollary 2 If m is a non-degenerate extensible motif where
each m[i] is a set of (homologous) characters, then
p m = m [ i ] .noteq. ` . ` , ` ` ( .sigma. .di-elect cons. m [ i ]
p .sigma. ) i = 1 e .alpha. i ( 9 ) ##EQU00006##
[0063] Let M.sup.s denote a set of strings that has only the solid
characters of at least s occurrences of m. For example, consider
the motif a-b with realizations a.b, a..b and a...b. Then:
M.sup.1={a.b,a..b,a...b} (10)
[0064] since m occurs once on each m.di-elect cons.M.sup.1
M.sup.2={a.bb,a..bb,a.b.b} (11)
[0065] since m occurs twice on each m.di-elect cons.M.sup.2:
M.sup.3={a.bbb} (12)
[0066] since m occurs three times on m.di-elect cons.M.sup.3.
[0067] Corollary 3 Let m be a degenerate (possibly with multiple
occurrences at a site) extensible motif, and let:
p.sub.m.sub.k=.SIGMA..sub.m'.di-elect cons.M.sub.k+1p.sub.m'
(13)
[0068] then
p m = k = 0 r - 1 ( - 1 ) k ( p m k + 1 ) ( 14 ) ##EQU00007##
[0069] This follows directly from the inclusion-exclusion
principle.
[0070] Notice that for a degenerate motif, Equation (2) is the
zero-th order approximation of Equation (13). The first order
approximation is:
p.sub.m.apprxeq.p.sub.m.sub.1-p.sub.m.sub.2 (15)
[0071] and the second order approximation is
p.sub.m.apprxeq.p.sub.m.sub.1-p.sub.m.sub.2+p.sub.m.sub.3 (16)
[0072] and so on. Using Bonferroni's inequalities, a k th order
approximation of p.sub.m is an over-estimate of p.sub.m, if k is
odd.
[0073] Next, the form of p.sub.m for a non-degenerate motif when
input m is assumed to be generated by a Markov chain is obtained.
For the derivation below, we assume the Markov chain has order 1.
For further discussion, we introduce the following definition.
[0074] Definition 4 (cell <.sigma..sub.1, .sigma..sub.2, l>,
C(m) A substring {circumflex over (m)}, on m is a cell, that begins
and ends in solid characters with only non-solid intervening
characters: .sigma..sub.1, at the start and .sigma..sub.2, at the
end position and l is the number of intervening un-annotated dot
characters. If the intervening character is the extensible
character, then l takes a value of -1. For convenience, the cell is
represented by the triplet <.sigma..sub.1, .sigma..sub.2, l>.
C(m) is the collection of all such cells of m.
[0075] For example,
C(ab..c.d-g)={<a,b,0>,<b,c,2>,<c,d,1>,<d,g,-1>}
(17)
[0076] Let p.sub..SIGMA..sub.1.sub.,.sigma..sub.2.sup.(k) denote
the probability of moving from .sigma..sub.1 to .sigma..sub.2 in k
steps. Let s be a stationary, irreducible, aperiodic Markov chain
of order 1 with state space .SIGMA.(|.SIGMA.|.ltoreq..infin.)
Further, .pi..sub..sigma. is the equilibrium probability of
.sigma..di-elect cons..SIGMA. and the (|.SIGMA.|.times.|.SIGMA.|)
transition probability matrix P[i, j] is defined as
p.sub..sigma..sub.i.sub.,.sigma..sub.j.sup.(l). For a rigid motif
m, for each cell <.SIGMA..sub.1, .SIGMA..sub.2, l>.di-elect
cons.c( m) is such that l.gtoreq.0. It is easy to see that when
l.gtoreq.0, the cell represents the (l+1)-step transition
probability given by P.sup.l+1, i.e.,:
p.sub..sigma..sub.1.sub.(.)l.sigma..sub.2=P.sup.l[.sigma..sub.1,.sigma..-
sub.2]. (18)
[0077] Thus for a rigid motif m,
p m _ = .pi. m _ [ 1 ] < .sigma. 1 , .sigma. 2 , l >
.di-elect cons. C ( m _ ) P l [ .sigma. 1 , .sigma. 2 ] . ( 19 )
##EQU00008##
[0078] From now on, let u and v be two motifs such that v is a
condensation of u, and consider an arbitrary sequence of
consecutive unit expansions--consisting each of inserting a
character or character set at some position, or replacing a dot
character with a solid character or character set--that transforms
u into v. A score z is monotonic for u and v if the value of z is
always either increasing or decreasing over any such expansion. The
key observation here is that, under most probabilistic settings,
the probability of a condensation v of u obeys
p.sub.v.ltoreq.p.sub.u. This is almost immediate under iid
distribution, as the following claim shows.
[0079] Theorem 5 Let v and u be possibly degenerate extensible
motifs under the iid model and let v be a condensation of u. Then,
there is an integer {circumflex over (p)}.ltoreq.1 such that:
p.sub.v=p.sub.u{circumflex over (p)} (20)
[0080] Proof: It is enough to consider the case of a unit
condensation, i.e., where v has one more solid character than u.
The claim holds trivially when the extra character is introduced as
a prefix, an infix, or a suffix of u. In fact, in any such case the
probability of the extra character multiplies each term of Equation
(6), whence the whole probability as well.
[0081] Consider next the case where the solid character in v
substitutes a don't care of u. We begin by describing an alternate
way to compute p.sub.u. With l denoting the length of a longest
string in R(u), compute the set of all strings over .SIGMA..sup.l
and store them consecutively row-wise in a table. Compute, for each
row, the probability of the string in that row, which is the
product of the probabilities of the individual characters (the sum
of all row probabilities is 1). Consider now the realizations in
R(u) in succession. Check each realization against every row of the
table; wherever the two match, mark the row if it had not been
already marked. Let R be the set of rows that are marked at the
outset. Clearly, adding up the probabilities of the rows in R
yields p.sub.u. Consider now the set of rows that would be
similarly involved in the computation of p.sub.v. This must be a
subset of R, whence p.sub.v.ltoreq.p.sub.u.
[0082] With Markov processes, the intuition at the basis is that if
we split the transition probability into two consecutive segments
then we have:
P.sup.l[.sigma..sub.1,.sigma..sub.2]=.SIGMA..sub..sigma..sub.k.sub..di-e-
lect
cons..SIGMA.P.sup.l.sup.1[.sigma..sub.1,.sigma..sub.k].times.P.sup.l.-
sup.2[.sigma..sub.k,.sigma..sub.2] (21)
where:
l=l.sub.1+l.sub.2. (22)
Since all:
P.sup.l[.sigma..sub.i,.sigma..sub.j].gtoreq.0 (23)
[0083] then any specific character (or alphabet subset) acting as a
bottleneck yields:
P.sup.l[.sigma..sub.1,.sigma..sub.2].ltoreq.P.sup.l.sup.1[.sigma..sub.1,-
.sigma..sub.k].times.P.sup.l.sup.2[.sigma..sub.k,.sigma..sub.2]
(24)
[0084] Theorem 6 If:
f(u)=f(v)>0 (25)
N(v)<N(u), (26)
and
E(v)/N(wv).ltoreq.E(u)/N(u), (27)
[0085] then
f ( v ) - E ( v ) N ( v ) > f ( u ) - E ( u ) N ( u ) ( 28 )
##EQU00009##
[0086] Proof. Multiplying both terms by N(v)/E(v) and using the
assumption:
f(v)=f(u).gtoreq.0 (29)
[0087] we get, after rearrangement:
f ( u ) E ( v ) ( 1 - N ( v ) N ( u ) ) > 1 - E ( u ) N ( v ) E
( v ) N ( u ) ( 30 ) ##EQU00010##
[0088] Since:
0<N(v)/N(u)<1 (31)
[0089] then the left hand side is always positive. The right hand
size is always negative or zero.
[0090] When N(u) is the square root of the variance, the z-score
takes up the form:
z ( u ) = f ( u ) - E ( u ) Var ( u ) ( 32 ) ##EQU00011##
[0091] In the Bernoulli model, for instance, this variance results
in {square root over (np.sub.u(1-p.sub.u))}. Let p.sub.m be the
probability of the motif m occurring at any location i on the input
string s with n=|s| and let k.sub.m be the observed number of times
it occurs on s. When it can be assumed that the occurrence of a
motif m at a site is an i.i.d process, for large n and
k.sub.m<<n we have:
k m - np m np m ( 1 - p m ) .fwdarw. N ( 0 , 1 ) ( 33 )
##EQU00012##
[0092] Theorem 7 Let u and v be motifs generated with respective
probabilities p.sub.u and:
p.sub.v=p.sub.u{circumflex over (p)} (34)
[0093] according to an iid process. If f(u)=f(v) and p.sub.u<1/2
then:
f ( v ) - E ( v ) E ( v ) ( 1 - p v ) > f ( u ) - E ( u ) E ( u
) ( 1 - p u ) ( 35 ) ##EQU00013##
[0094] Proof. The functions:
N(u)= {square root over (E(u)(1-p.sub.u))}{square root over
(E(u)(1-p.sub.u))} (36)
[0095] and E(N satisfy the conditions of Theorem 6. First,
E(v)<E(u). Indeed, since:
|v|-|u|/(n-|u|+1)>0, (37)
E ( v ) E ( u ) = ( n - v + 1 ) p v ( n - u + 1 ) p u = ( 1 - v - u
n - u + 1 ) p ^ < p ^ < 1. ( 38 ) ##EQU00014##
[0096] Next, we study the ratio:
( N ( v ) N ( u ) ) 2 = ( 1 - v - u n - u + 1 ) p v ( 1 - p v ) p u
( 1 - p u ) < p v ( 1 - p v ) p u ( 1 - p u ) ( 39 )
##EQU00015##
[0097] The concave product p.sub.u(1-p.sub.a) reaches its maximum
for p.sub.u=1/2. Since we assume p.sub.u<1/2, the rightmost term
is smaller than one. The monotonicity of N(u) is satisfied.
[0098] Finally, we prove that also E(u)/N(u) is monotonic, i.e.,
that:
E(v)/N(v).ltoreq.E(u)/N(u), (40)
[0099] which is equivalent to:
E ( v ) E ( u ) 1 - p u 1 - p v .ltoreq. 1 ( 41 ) ##EQU00016##
[0100] but E(v)/E(u)<1 by hypothesis and
(1-p.sub.u)/(1-p.sub.v)<1 since p.sub.u>p.sub.v.
[0101] In conclusion, an exemplary embodiment of the present
invention may restrict the z-score computation to classes of
maximal motifs, i.e., only compute the z-score for the maximally
saturated motif among those in each class of motifs sharing the
same list of occurrences.
[0102] An exemplary embodiment of the present invention pairwise
iterates combinations of segments of maximal extensible motifs, and
prunes those pairings that are found to not be viable. The input
may be a string s of size n and two positive integers, K and D. The
extensibility parameter D is interpreted in the sense that up to D
(or 1 to D) number of dot characters between two consecutive solid
characters are allowed. The output is all maximal extensible (with
D spacers) patterns that occur at least K times in s.
[0103] Incidentally, an exemplary embodiment of the present
invention may extract rigid motifs as a special case. For this, it
suffices to interpret D as the maximum number of dot characters
between two consecutive solid characters.
[0104] An exemplary embodiment converts the input into a sequence
of possibly overlapping cells (see Definition 4). A maximal
extensible pattern is a sequence of cells.
[0105] Initialization Phase
[0106] The cell is the smallest extensible component of a maximal
pattern and the string can be viewed as a sequence of overlapping
cells. If no don't care characters are allowed in the motifs then
the cells are non-overlapping. An initialization phase in
accordance with an exemplary embodiment of the present invention
may:
[0107] 1) Construct patterns h have exactly two solid characters in
them and separated by no more than D spaces or "." characters. This
may be done by scanning the string s from left to right.
[0108] Further, for each location this exemplary embodiment may
store start and end positions of the pattern. For example, if
s=abzdabyxd and K=2, D=2, then all the patterns generated at this
step are: ab, a.z, a..d, bz, b.d b..a, zd, z.a, z..b, da, d.b,
d..y, a.y, a..x, by, b.x, b..d, yx y.d, xd, each with its
occurrence list. Thus L.sub.ab={(1,2),(5,6)}, L.sub.a,z={(1,3)} and
so on.
[0109] 2) The extensible cells may be constructed by combining all
the cells with a dot character and the same start and end solid
characters. The location list is updated to reflect the start and
end position of each occurrence. Continuing the previous example,
b-d is generated at this step with L.sub.b-d={(2,4),(6,9)}. All
cells m with |L.sub.m|<K are discarded. In the example, the only
surviving cells are ab, b-d with L.sub.ab={(1,2), (5,6)} and
L.sub.b-d={(2,4), (6,9)}
[0110] An exemplary embodiment of the present invention may also
have an iteration phase. Let B be the collection of cells. If
m=Extract(B), then m.di-elect cons.B and there does not exist
n'.di-elect cons.B such that m'm holds: m.sub.1m.sub.2 if one of
the following holds: (1) m.sub.1 has only solid characters and
m.sub.2 has at least one non-solid character (2) m.sub.2 has the
"-" character and m.sub.1 does not, and, (3) m.sub.1 and m.sub.2
have d.sub.1, d.sub.2>0 dot characters respectively and
d.sub.1<d.sub.2.
[0111] Further, m.sub.1 is .about.-compatible with m.sub.2 if the
last solid character of m.sub.1 is the same as the first solid
character of m.sub.2.
[0112] Further if m.sub.1 is .about.-compatible with m.sub.2, then
m=m.sub.1.about.m.sub.2 is the concatenation of m.sub.1 and m.sub.2
with an overlap at the common end and start character and:
L'.sub.m={((x,y),z)|((x,l),z).di-elect
cons.L'.sub.m.sub.1,((l,y),z).di-elect cons.L'.sub.m.sub.2}.
(42)
[0113] For example if m.sub.1=ab and m.sub.2=b.d then m.sub.1 is
.about.-compatible with m.sub.2 and m.sub.1.about.m.sub.2=ab.d.
However, m.sub.2 is not .about.-compatible with m.sub.1.
[0114] An example, of this procedure is described by the
pseudo-code shown below. NodeInconsistent(m) is a routine that
checks if the new motif in is non-maximal w.r.t. earlier
non-ancestral nodes by checking the location lists. Steps G: 18-19
detect the suffix motifs of already detected maximal motifs. Result
is the collection of all the maximal extensible patterns.
TABLE-US-00001 Main( ) Result .rarw. { } ; B .rarw. {m.sub.i |
m.sub.isacell} ; For each m = Extract(B) Iterate( m, B, Result );
Iterate( m, B, Result ) G:1 m' .rarw. m ; G:2 For each b =
Extract(B) with G:3 (( b ~-- compatible m' ) OR ( m' ~-- compatible
b )) G:4 If ( m' ~-- compatible b ) G:5 m.sub.t .rarw. m' ~ b ; G:6
If NodeInconsistent(m.sub.i) exit; G:7 If (| L.sub.m' |=| L.sub.b
|) B .rarw. B - {b} ; G:8 If (| L.sub.m' |.gtoreq. K ) G:9 m'
.rarw. m.sub.t ; G:10 Iterate( m', B, Result ); G:11 If ( b ~--
compatible m' ) G:12 m.sub.t .rarw. b ~ m' ; G:13 If
NodeInconsistent(m.sub.i) exit; G:14 If (| L.sub.m' |=| L.sub.b |)
B .rarw. B - {b} ; G:15 If (|L.sub.m' |.gtoreq. K) G:16 m' .rarw.
m.sub.t ; G:17 Iterate( m', B, Result ); G:18 For each r.di-elect
cons. Result with L.sub.r = L.sub.m' G:19 If ( m' is not maximal
w.r.t. r ) return; G:20 Result .rarw. Result.orgate.{m'} ;
[0115] Correctness follows from the observation that the above
exemplary procedure essentially constructs the inexact suffix tree
of implicitly, in a different order. A tight time complexity is
more difficult to come by, however, if we consider M to be the
number of extensible maximal motifs and S to be the size of the
output--i.e. the sum of the sizes of the motifs and the sizes of
the corresponding location lists--then the time taken by an
exemplary embodiment of the present invention is O(SM log M). In
experiments by the inventors of the kind described below, at 3 GHz
clock, processing time ranged typically from few minutes to half an
hour.
[0116] A detailed description of an implementation of one exemplary
embodiment in accordance with the present invention follows.
[0117] Since a pattern space can vary dramatically for different
classes of inputs, a number of parameters have been introduced to
allow a user maximally exploit his specific domain knowledge. One
way of viewing this control is to prune the pattern space
appropriately and various parameters are specified to meet this
objective. There are essentially two classes of pruning parameters:
(1) combinatorial and (2) statistical. To avoid clutter, we
describe only a few of the pruning parameters here. Each parameter
has a default value and it is not mandatory to specify them
all.
[0118] Combinatorial Pruning
[0119] 1. Pruning by Occurrences: [0120] a. -k<Num>: Num is
the quorum or the minimum number of times a pattern must occur in
the input. [0121] b. -c: When this is specified the quorum k is in
terms of the number of sequences where the pattern occurs at least
once. For example, if this option is set and further -k10 is
specified, then a valid pattern must occur in a least 10 distinct
sequences. However if this option is not set then a valid pattern
must have at least 10 occurrences, not necessarily in distinct
sequences.
[0122] 2. Pruning by composition: [0123] a. Using homology groups:
[0124] (1) -b<File>: File lists the symbol equivalences that
define the homology groups. The default file is an empty file.
[0125] (2) -n<Num>: Num is the maximum number of bracketed
elements (equivalence classes) in a pattern. For example, if "-n2"
is specified, then [IL]...[LV], L.[LV]-V are valid patterns but not
[LV][IL][LV]..L [0126] b. -R: When this mode is specified, only
rigid patterns are discovered. [0127] c. Extensibility: The
following two parameters may be used to prune the space of
extensible patterns. FIG. 1 shows an example of the size of the
pattern space for different parameter values. [0128] (1)
-D<Num>: Num is the maximum number of consecutive don't care
characters (`.`) in the realization of an extensible pattern. Note
that a don't care character and an extensible character are never
consecutive in any valid pattern. For example, if "-D3" is
specified, then L...V, LV, L.L.V are valid patterns but not L....L.
Further, an extensible pattern of the form L-V implies that there
are one to three don't care characters in the occurrences of this
pattern between the bases L and V. [0129] (2) -d<Num>: Num is
the minimum number of non-extensible characters (including the
don't care character) between two consecutive extensible characters
(`-`). For example, if "-d4" is specified, then L..H-L..H-L is a
valid pattern but not L...H-L.H-L.
[0130] Statistical Pruning [0131] 1. -p<File>: File lists the
symbol probabilities used for the probabilistic analysis. [0132] 2.
-z<Val>: Val is the minimum absolute value of Z-score of the
patterns.
[0133] Information Display [0134] 1. Displaying occurrence
information: The different modes of displaying the occurrence list
of each valid pattern may be as follows. (1) The occurrence list is
not displayed (option -L0). (2) Only the start position of each
occurrence is displayed (option -L1). (3) The start and end
position of each occurrence is displayed as x.sub.1-x.sub.2 where
x.sub.1 is the starting position and x.sub.2 the end position
(option -L4). [0135] 2. Displaying statistical information: The
different statistical information displayed for possible use are
(1) the probability of occurrence of a pattern, (2) the observed
number of occurrences, and (3) the Z-score. Table 1 shows an
example.
TABLE-US-00002 [0135] TABLE 1 Numbers of patterns in the experiment
in Table 8 with Z-Score .gtoreq. 100.0 at various values of
parameters D and d with quorum k = 53 D 2 3 4 5 d 3 121 196 370
1145 4 121 194 355 1008 5 114 182 326 891 8 112 178 313 758 10 112
178 313 727
TABLE-US-00003 TABLE 2 A statistical summary of a small set of
valid patterns on the Coagulation factors 5/8 type C domain, also
used in Table 8. Pattern Probability Occ. Z-Score
[LIVP]-[LM]R.[GE][LIVP].GC 2.05647e-07 57 585.494 LR.[GE][LIVP].GC
2.53136e-07 63 582.758 L..[GE][LIVP].GC 4.77614e-06 70 148.626
R-[GE][LIVP].GC 6.33367e-06 66 121.48 L-[GE][LIVP].GC 1.43284e-05
83 101.21 G[LIVP][GE].GC 3.98344e-05 77 55.359 R-[LIVP].GC
4.68467e-05 65 42.6968 L-[LIVP].GC 0.00010598 112 48.3873
[0136] FIG. 3 illustrates a flowchart 300 of an exemplary method in
accordance with the present invention. The flowchart 300 starts at
step 302 and continues to step 304, where the method receives a
sequence. The flowchart continues to step 306, where the method
assigns a significance to an extensible motif within the sequence
based upon a combination of syntactic and statistical analysis, an
example of which is described above. The method continues to step
308 where the method identifies a significant extensible motif by,
for example, determining whether the significance assigned to an
extensible motif exceeds a predetermined threshold. The method
continues to step 310 where the system displays a list of the
identified extensible motif from the sequence and continues to step
312 where the method ends.
Experimental Results
[0137] The inventors tested an exemplary embodiment in accordance
with the present invention on six protein families by seeking the
surprising motifs in each. Each family was picked at random from
the PROSITE database. [0138] High potential iron-sulfur proteins
(HiPIP) (PROSITE I.D. PS00596). This is a specific class of
high-redox potential 4Fe-4S ferredoxins that function in anaerobic
electron transport and which occur in photosynthetic bacteria and
in Paracoccus denitrificans. Two of the cysteine residues of the
motif shown in Table 3 are involved in binding to the iron-sulfur
cluster. This is the top-ranking motif discovered by the exemplary
embodiment out of the possible 273 extensible motifs. [0139]
Streptomyces subtilisin-type inhibitors (PROSITE I.D PS00999).
Bacteria of the Streptomyces family produce a family of proteinase
inhibitors characterized by their strong activity toward
subtilisin. They are collectively known as SSI's: Streptomyces
Subtilisin Inhibitors. The exemplary embodiment discovers this
functionally significant motif as the top ranking one out of 470
extensible motifs (Table 4). [0140] Nickel-dependent hydrogenases
(PROSITE I.D PS00508). These are enzymes that catalyze the
reversible activation of hydrogen and are further involved in the
binding of nickel. Again, this functionally significant motif is
detected in the top three by the exemplary embodiment out of 4150
extensible motifs (Table 5). [0141] G-protein coupled receptors
family 3 (PROSITE I.D PS00980). The exemplary embodiment finds that
the most important structural motif in this family is in the top
thirty of the motifs out of 3508 extensible motifs (Table 6).
[0142] Chitin-binding type-1 domain (PROSITE I.D PS00026). The
exemplary embodiment finds that the most important structural motif
in this family is in the top two of the motifs out of 886
extensible motifs (Table 7). [0143] Coagulation factors 5/8 type C
domain (FA58C) (PROSITE I.D PS01286). The exemplary embodiment
finds that the most important structural and functional motif in
this family is in the top two of the motifs out of 80290 extensible
motifs (Table 8).
[0144] To summarize, the inventors discovered that in almost all
cases, the motif documented as the most important (as
functionally/structurally relevant motif) in PROSITE is in the top
extensible motifs returned by the exemplary embodiment as
surprising. In the fourth set (Table 6) the inventors find the
PROSITE motif at position 42, this experiment shows that in some
particular cases the patterns reported by the exemplary embodiment
can be grouped together, in fact the top scoring motifs are very
close to each other in location and composition. This reveals that
a post-processing step that clusters together the top patterns may
improve the results. In all cases, the difference in the z-score
between the top few and the rest is dramatic as can be seen in
Tables 3 to 8. The differing values of the Z-scores of each family
are attributed to the different sizes of the families (the number
of members and the length of each member).
[0145] The inventors also tested the sensitivity and selectivity of
an exemplary embodiment of the present invention using the families
as reported in PROSITE. The following six sets were selected by the
inventors randomly in each family: 5 sequences in each of the
families, high potential iron-sulfur proteins, streptomyces
subtilisin-type inhibitors, nickel-dependent hydrogenases,
g-protein coupled receptors family 3 and coagulation factors 5/8
type C domain, and 8 sequences from the family of chitin-binding
type-1 domain.
[0146] First each family was contaminated with one of the sets that
was drawn from a different family (for example the five sequences
of G-protein was mixed with the family of the hydrogenases). Next,
the inventors contaminated each family with two sets from a
different family and then subsequently three sets. In each of the
experiments the inventors discovered that the top ranked motifs
were exactly as reported in Tables 3 to 8.
TABLE-US-00004 TABLE 3 The functionally relevant motif is shown in
bold for high potential iron-sulfur proteins (HiPIP) (id PS00596).
Here 22 sequences of about 2500 bases were analyzed at k = 22, D =
9, d = 4. Rank z-score Motif 1 1497.62
C-(6,7,8,9)[LIVM]...G[YW]C..[FYW] 2 978.872
P-(3,4,6,8,9)[LIVM]...G[YW]C..[FYW] 3 590.866
C-(6,7,8,9)[LIVM]...G[YW]C-(1,3,4,5,6,7)A 4 564.821
C-(6,7,8,9)[LIVM]...G[YW]C-(1,3,4,5,6,7)[ATD] 5 537.73
[LIVM]-(1,2,3,4,5,7,8,9)G[YW]C..[FYW] 6 385.2
[LIVM]-(1,2,3,4,5,7,8,9)G[FYW]C..[FYW] 7 161.173
[LIVM]...G[FYW]C-(2,4)[FYW] 8 156.184
[LIVM]-(1,2,3,4,5,6,7,8,9)G[YW]C 9 138.881
[LIVM]-(1,3,4,5,6)[LIVM]...G[FYW]C-(1,3,4,5,6,7)A
TABLE-US-00005 TABLE 4 The functionally relevant motif is shown in
bold for Streptomyces subtilisin-type inhibitors signature (id
PS00999). Here 20 sequences of about 2500 bases were analyzed at k
= 20, D = 4, d = 4. Rank z-score Motif 1 7.60E+07
RA.T[LV].C.P-(2,3)G.HP....AC[ATD].L....[ASG] 2 21416.8
A..[LV].C.P-(2,3)G.HP-(1,2,4)[ASG].[ATD] 3 8105.33
A-(1,4)T....P-(2,3)G.HP....[ATD]-(3)L....[ASG] 4 5841.85
[ATD].T....P-(1,2,3)G.HP-(1,2,4)A.[ATD] 5 4707.62
P.[ASG]-(2,3,4)P....AC[ATD].L....[ASG] 6 4409.21
A..[LV]...P-(2,3)G.HP-(1,2,4)A.[ATD] 7 3086.17
P-(1,2,3)[ASG]..P-(4)AC[ATD].L....[ASG] 8 3068.18
R..[ATD]....P-(2,3)G.HP-(1,2,4)[ASG].[ATD] 9 2615.98
[ASG][ATD]-(1,3,4)P....AC[ATD].L....[ASG] 10 2569.66
[ASG]-(1,2,3,4)P....AC[ATD].L....[ASG] 11 2145.6
G-(2,3)P....AC[ATD].L....[ASG]
TABLE-US-00006 TABLE 5 The functionally relevant motifs are shown
in bold for Nickel- dependent hydrogenates (PROSITE I.D. PS00508).
Here 22 sequences of about 23000 bases were analyzed at k = 22, D =
4, d = 3. Rank z-score Motif 1 295840
[LIM]-(1,2,3,4)[STA][FY]DPC[LIM][ASG]C[ASG].H 2 2.86E+05
[LIM]-(1,2,3,4)[ASG][FY]DPC[LIM][ASG]C[ASG].H 3 155736
R-(1,4)[FY]DPC[LIM][ASG]C[ASG].H 4 78829
[LIM]-(1,2,3,4)[STA].DPC[LIM][ASG]C[ASG].H 5 76101.9
[LIM]-(1,2,3,4)[ASG].DPC[LIM][ASG]C[ASG].H 6 34205.6
[STA]-(1,4)DPC[LIM][ASG]C[ASG].H 7 30325.1
[LIM]-(1,2,3,4)[STA][FY]D.C[LIM][ASG]C..H 8 29276
[LIM]-(1,2,3,4)[ASG][FY]D.C[LIM][ASG]C..H 9 20527.3
[ASG]-(1,4)DPC[LIM][ASG]C[ASG].H 10 17503.4
[LIM]-(1,2,3,4)[ASG]..PC[LIM][ASG]C[ASG].H
TABLE-US-00007 TABLE 6 The functionally relevant motif is shown in
bold for G-protein coupled receptors family 3 (PROSITE I.D.
PS00980). This run involved 25 sequences of about 25000 bases each
at k = 25, D = 4, d = 8. Rank z-score Motif 1 2.84E+09
Y...L...C..[FYW]A..[STAH]R..P..FNE[STAH]K.I.F[STAH]M 2 8.28E+07
V-(1,3,4)G...S..[STAH]....N...L....Q-(4)[STAH]....L.[DN]...[FYW-
]..F....P....Q..A...I 3 5.55E+07
L-(2,3)F...Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A..- .I
4 4.27E+07
L-(2,3)F...Q.[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A-
...I 5 4.23E+07
L....I...[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I 6
3.99E+07
LF-(3)Q....[STAH][STAH]....S[DN]...[FYW]..F.R..P.D..Q..A...I 7
3.38E+07
LF-(3)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I 8
3.38E+07
LF...Q....[STAH]-(4)L.[DN]...[FYW]..F.R..P.D..Q[STAH].A...I 9
3.29E+07
I-(1)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I 10
3.29E+07 I.Q-(4)[STAH]....LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I
11 3.29E+07
I.Q.[STAH]..[STAH]-(4)LS[DN]...[FYW]..F.R..P.D..Q..A...I 12
3.10E+07
L....Q-(1,4)[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I 13
2.77E+07
L[FYW]-(3)Q.[STAH]..[STAH]....LS....[FYW]..F.R..P.D..Q..A...I 14
2.58E+07
L-(4)Q.[STAH]..[STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I 15
2.30E+07 S.[STAH]S-(2,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I 16
2.15E+07 L-(1,3,4)C..[FYW]A..[STAH]R..P..F.E.K.I.F.M 17 1.40E+07
F-(1)I.Q...[STAH][STAH]-(4)L[STAH]....[FYW]..F.R..P.D..Q..A...I 18
1.37E+07
L-(2,4)I...[STAH].[STAH].[STAH]-(3)LS....[FYW]..F.R..P.D..Q..A...I
19 1.02E+07
L..I-(1)Q....[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I 20
8.65E+06
I-(1)Q....[STAH][STAH]...L.[DN]...[FYW]..F.R..P.D..Q..A...I 21
8.19E+06 S[STAH]-(1,2,3,4)LS[DN]...[FYW]..F.R..P.D..Q[STAH].A...I
22 7.98E+06 Q-(3)[STAH][STAH]....LS[DN]...[FYW]..F.R..P.D..Q..A...I
23 6.82E+06
F-(3)Q....[STAH][STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I 24
5.66E+06 A[STAH][STAH]-(2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I 25
5.57E+06
F.I-(3)[STAH]..[STAH]....L[STAH]....[FYW]..F.R..P.D..Q..A...I 26
5.18E+06
L.L-(4)Q....[STAH]....L-(1)[DN]...[FYW]..F.R..P.D..Q..A...I 27
3.61E+06
L.L-(2)I...[STAH]...[STAH]....[STAH]....[FYW]..F.R..P.D..Q..A...I
28 3.48E+06 [STAH].[STAH]-(1,2,3)LS[DN]...[FYW]..F.R..P.D..Q..A...I
29 3.17E+06 [STAH]...[STAH]...LS[DN]...[FYW]..F.R..P.D..Q..A...I 30
2.47E+06 L....Q-(4)[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I
31 2.43E+06
V-(1,3)N.L....I-(3)[STAH]...[STAH]....[STAH]....[FYW]..F....P.D..Q..A...I
32 2.22E+06
[STAH][STAH][STAH]-(1,2,3)LS....[FYW]..F.R..P.D..Q..A...I 33
2.06E+06 [STAH].[STAH][STAH]....LS....[FYW]..F.R..P.D..Q..A...I 34
2.03E+06 Y...L...C...A...R..P..F.E.K.I-(1,4)[FYW][STAH] 35 1.99E+06
I.Q...[STAH]-(1)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I 36
1.99E+06
I.Q-(1)[STAH]...[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I 38
1.97E+06
F.I...[STAH]-(3)[STAH]...L.[DN]...[FYW]..F....P.D..Q..A...I 40
1.97E+06
F.I-(3)[STAH]..[STAH]....L.[DN]...[FYW]..F....P.D..Q..A...I 41
1.91E+06 [STAH]..[STAH].K-(1,4)P..FNE[STAH]K.I.F[STAH]M 42 1.72E+06
CC[FYW].C..C....[FYW]-(2,4)[DN]..[STAH]C..C 43 1.57E+06
[STAH]-(1,3,4)[FYW]A..[STAH]R..P..F.E.K.I.F.M 44 1.49E+06
A-(1,3)[STAH]...L[STAH][DN]...[FYW]..F.R..P.D..Q..A...I 45 1.36E+06
Q...[STAH].[STAH]-(3)L[STAH]....[FYW]..F.R..P.D..Q..A...I 46
1.32E+06
I-(3)[STAH]..[STAH][STAH]....S....[FYW]..F.R..P.D..Q..A...I 47
1.31E+06 [STAH][STAH]-(1,2,3,4)L.[DN]...[FYW]..F.R..P.D..Q..A...I
48 1.24E+06
[STAH]..[STAH][STAH]-(1,3)LS....[FYW]..F.R..P.D..Q..A...I 49
1.19E+06 [FYW]-(1,3,4)[STAH]...P..FNE[STAH]K.I.F[STAH]M 50 1.12E+06
I...[STAH]-(3)[STAH]...L[STAH]....[FYW]..F.R..P.D..Q..A...I
TABLE-US-00008 TABLE 7 The functionally relevant motif is shown in
bold for Chitin recognition (PROSITE I.D. PS00026). Here 53
sequences of about 13823 bases were analyzed at k = 53, D = 5, d =
10. Rank z-score Motif 1 5.42E+06 C-(4,5)CCS..G[FYW]CG....[FYW]C 2
1.73E+06 C-(4,5)CCS..G[FYW]CG.....C 3 1.70E+06
C-(4,5)CCS..G.CG....[FYW]C 4 1.56E+06 CCS..G[FYW]CG....[FYW]C 5
544162 C-(4,5)CCS..G.CG.....C 6 4.95E+05 CCS..G[FYW]CG.....C 7
488261 CCS..G.CG....[FYW]C 8 155706 CCS..G.CG.....C 9 104666
C-(4,5)C.S..[GASL][FYW]CG.....C 10 84133.4
C.....C-(3,4)[GASL][FYW]CG....[FYW]C 11 56078
C.....C-(3,4)G.CG....[FYW]C
TABLE-US-00009 TABLE 8 The functionally relevant motif is shown in
bold for Coagulation factors 5/8 type C domain (PROSITE I.D.
PS01286). Here 40 sequences of about 80290 bases were analyzed.
Notice that in this case, the motifs have a fairly large gap size
of 10 bases at k = 40, D = 10, d = 10. Rank z-score Motif 1 969.563
P-(4,5,8,9,10)[LM]R.[GE][LIVP].GC 2 694.1
P-(4,5,8,9,10)[LM]R.[GE][LIVP].[GE]C 3 370.594
[LIVP]-(1,3,4,5,6,7,8,9,10)[LM]R.[GE]..[GE]C 4 361.052
P-(4,5,8,9,10)[LM]R.[GE]..[GE]C 5 261.519
[LIVP]-(1,3,4,5,6,7,8,9,10)[LM]R.[GE][LIVP]..C 6 261.519
[LIVP]-(1,3,4,5,6,7,8,9,10)[LM)R..[LIVP].[GE]C 7 254.971
P-(4,5,8,9,10)[LM]R.[GE][LIVP]..C 8 254.971
P-(4,5,8,9,10)[LM]R..[LIVP].[GE]C 9 249.763
[LIVP]........[LIVP]-(1,2,4,5,6,7,8,9,10)R.[GE]..GC
[0147] The extensibility of a motif not only leads to a succinct
description but also helps capture function and/or structure in a
single pattern, which would be not possible through a rigid
description. At the same time, with extensible motifs the number of
candidates to be considered increases dramatically.
[0148] An exemplary embodiment of the present invention
characterizes a pattern rigidly, and conjugates structure and set
of occurrences. This results in a definition of motif that lends
itself to a natural notion of maximality, thereby embodying
statistics and structure in one measure of surprise. This is unlike
all previous approaches that consider structure and statistics as
separate features of a pattern.
[0149] An exemplary embodiment of the present invention provides a
powerful syntactic mechanism for eliminating unimportant motifs
before their score is computed. As explained above, for the class
of over-represented motifs, the non-maximal motifs are not more
surprising than the maximal motifs.
[0150] While the invention has been described in terms of several
exemplary embodiments, those skilled in the art will recognize that
the invention can be practiced with modification.
[0151] Further, it is noted that, Applicant's intent is to
encompass equivalents of all claim elements, even if amended later
during prosecution.
* * * * *