U.S. patent application number 11/351951 was filed with the patent office on 2006-11-23 for system and method for identification of microrna precursor sequences and corresponding mature microrna sequences from genomic sequences.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Tien Huynh, Kevin Charles Miranda, Isidore Rigoutsos.
Application Number | 20060263798 11/351951 |
Document ID | / |
Family ID | 36793801 |
Filed Date | 2006-11-23 |
United States Patent
Application |
20060263798 |
Kind Code |
A1 |
Huynh; Tien ; et
al. |
November 23, 2006 |
System and method for identification of MicroRNA precursor
sequences and corresponding mature MicroRNA sequences from genomic
sequences
Abstract
A method for determining microRNA precursors and their
corresponding mature microRNAs from genomic sequences is provided.
For example, in one aspect of the invention, a method for
determining whether a nucleotide sequence contains a microRNA
precursor comprises the following steps. Patterns are generated by
processing a collection of already known microRNA precursor
sequences. One or more attributes are assigned to the generated
patterns. Only the patterns whose attributes satisfy certain
criteria are subselected, and then the subselected patterns are
used to analyze the nucleotide sequence. In another aspect of the
invention, a method for identifying a mature microRNA sequence in a
microRNA precursor sequence comprises the following steps. One or
more patterns are generated by processing a collection of known
mature microRNA sequences. The one or more patterns are filtered,
and then used to locate instances of the one or more filtered
patterns in one or more candidate precursor sequences.
Inventors: |
Huynh; Tien; (Yorktown
Heights, NY) ; Miranda; Kevin Charles; (McDowall,
AU) ; Rigoutsos; Isidore; (Astoria, NY) |
Correspondence
Address: |
RYAN, MASON & LEWIS, LLP
90 FOREST AVENUE
LOCUST VALLEY
NY
11560
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
36793801 |
Appl. No.: |
11/351951 |
Filed: |
February 10, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60652499 |
Feb 11, 2005 |
|
|
|
Current U.S.
Class: |
435/6.11 ;
435/6.16; 702/20 |
Current CPC
Class: |
C12N 15/111 20130101;
G16B 30/00 20190201; C12N 2310/14 20130101; C12N 2330/10 20130101;
G16B 20/00 20190201; G16B 40/00 20190201; G16B 15/00 20190201; C12N
2320/11 20130101 |
Class at
Publication: |
435/006 ;
702/020 |
International
Class: |
C12Q 1/68 20060101
C12Q001/68; G06F 19/00 20060101 G06F019/00 |
Claims
1. A method for determining whether a nucleotide sequence contains
a microRNA precursor, the method comprising the steps of:
generating one or more patterns by processing a collection of known
microRNA precursor sequences; assigning one or more attributes to
the one or more generated patterns; subselecting one or more
patterns whose one or more attributes satisfy at least one
criterion; and using the one or more subselected patterns to
analyze the nucleotide sequence, such that a determination is made
whether the nucleotide sequence contains a microRNA precursor.
2. The method of claim 1, wherein the nucleotide sequence is from
an intergenic region.
3. The method of claim 1, wherein the nucleotide sequence is from
an intronic region.
4. The method of claim 1, wherein the nucleotide sequence is from
an amino acid coding region.
5. The method of claim 1, wherein the step of generating one or
more patterns comprises using a pattern discovery algorithm.
6. The method of claim 5, wherein the pattern discovery algorithm
is the Teiresias pattern discovery algorithm.
7. The method of claim 1, wherein the step of assigning one or more
attributes is carried out independently of and prior to the step of
using the one or more subselected patterns to analyze a nucleotide
sequence.
8. The method of claim 1, wherein the one or more attributes are
quantitative.
9. The method of claim 8, wherein at least one of the one or more
attributes represents statistical significance.
10. The method of claim 8, wherein at least one of the one or more
attributes represents a length of the pattern.
11. The method of claim 8, wherein at least one of the one or more
attributes represents a number of positions in the one or more
patterns which are not occupied by wild cards.
12. The method of claim 8, wherein a threshold value for each
attribute is selected.
13. The method of claim 12, wherein one or more patterns are
discarded if the value of the one or more attributes of the pattern
is below the selected threshold for the one or more attributes.
14. The method of claim 13, wherein the steps of selecting a
threshold value and discarding one or more patterns are repeated
for all used attributes.
15. The method of claim 1, wherein a set of counters is created for
the nucleotide sequence.
16. The method of claim 15, wherein the counters in the set of
counters equal the number of nucleotides in the nucleotide
sequence.
17. The method of claim 1, wherein all patterns are examined.
18. The method of claim 17, wherein each pattern with an instance
in the nucleotide sequence contributes to the counters at
corresponding positions of the nucleotide sequence.
19. The method of claim 18, wherein only consecutive positions in
the nucleotide sequences whose corresponding counter values exceed
a threshold are considered.
20. The method of claim 19, wherein one or more groups of
consecutive positions are considered only if they satisfy a minimum
length criterion.
21. The method of claim 20, wherein a secondary structure of each
consecutive group of positions is estimated using an RNA secondary
structure prediction method.
22. The method of claim 21, wherein the prediction method is one
included with software known as the Vienna Package.
23. The method of claim 21, wherein the prediction method is a
method called `mfold`.
24. The method of claim 21, wherein the predicted structure is
assigned one or more attributes.
25. The method of claim 24, wherein at least one of the one or more
attributes is folding energy of a formed complex.
26. The method of claim 24, wherein a threshold value for the one
or more attributes is selected.
27. The method of claim 24, wherein a complex is discarded if the
value of the one or more attributes is below the selected threshold
for the one or more attributes.
28. The method of claim 27, wherein the steps of selecting a
threshold value and discarding a complex are repeated for all used
attributes.
29. The method of claim 28, wherein the nucleotide sequence is
reported as a microRNA precursor if the predicted structure that
corresponds to the nucleotide sequence has not been discarded.
30. A system for determining whether a nucleotide sequence contains
a microRNA precursor, comprising: a memory that stores
computer-readable code; and a processor operatively coupled to the
memory, the processor configured to implement the computer-readable
code, the computer-readable code configured to: generate one or
more patterns by processing a collection of known microRNA
precursor sequences; assign one or more attributes to the one or
more generated patterns; subselect the one or more patterns whose
one or more attributes satisfy at least one criterion; and use the
one or more subselected patterns to analyze the nucleotide
sequence, such that a determination is made whether a nucleotide
sequence contains a microRNA precursor.
31. An article of manufacture for determining whether a nucleotide
sequence contains a microRNA precursor, comprising: a
computer-readable medium having computer-readable code embodied
thereon, the computer-readable code comprising: a step to generate
one or more patterns by processing a collection of known microRNA
precursor sequences; a step to assign one or more attributes to the
one or more generated patterns; a step to subselect the one or more
patterns whose one or more attributes satisfy at least one
criterion; and a step to use the one or more subselected patterns
to analyze the nucleotide sequence, such that a determination is
made whether a nucleotide sequence contains a microRNA
precursor.
32. A method for identifying a mature microRNA sequence in a
microRNA precursor sequence, comprising the steps of: generating
one or more patterns by processing a collection of known mature
microRNA sequences; filtering the one or more patterns; and
locating instances of the one or more filtered patterns in one or
more candidate precursor sequences.
33. A system for identifying a mature microRNA sequence in a
microRNA precursor sequence, comprising: a memory that stores
computer-readable code; and a processor operatively coupled to the
memory, the processor configured to implement the computer-readable
code, the computer-readable code configured to: generate one or
more patterns by processing a collection of known mature microRNA
sequences; filter the one or more patterns; and locate instances of
the one or more filtered patterns in one or more candidate
precursor sequences.
34. An article of manufacture for identifying a mature microRNA
sequence in a microRNA precursor sequence, comprising: a
computer-readable medium having computer-readable code embodied
thereon, the computer-readable code comprising: a step to generate
one or more patterns by processing a collection of known mature
microRNA sequences; a step to filter the one or more patterns; and
a step to locate instances of the one or more filtered patterns in
one or more candidate precursor sequences.
35. A method for determining whether a nucleotide sequence contains
a mature microRNA, the method comprising the steps of: generating
one or more patterns by processing a collection of known mature
microRNA sequences; assigning one or more attributes to the one or
more generated patterns; subselecting one or more patterns whose
one or more attributes satisfy at least one criterion; and using
the one or more subselected patterns to analyze the nucleotide
sequence, such that a determination is made whether the nucleotide
sequence contains a mature microRNA.
36. The method of claim 35, wherein the nucleotide sequence is from
an intergenic region.
37. The method of claim 35, wherein the nucleotide sequence is from
an intronic region.
38. The method of claim 35, wherein the nucleotide sequence is from
an amino acid coding region.
39. The method of claim 35, wherein the step of generating one or
more patterns comprises using a pattern discovery algorithm.
40. The method of claim 39, wherein the pattern discovery algorithm
is the Teiresias pattern discovery algorithm.
41. The method of claim 35, wherein the step of assigning one or
more attributes is carried out independently of and prior to the
step of using the one or more subselected patterns to analyze a
nucleotide sequence.
42. The method of claim 35, wherein the one or more attributes are
quantitative.
43. The method of claim 42, wherein at least one of the one or more
attributes represents statistical significance.
44. The method of claim 42, wherein at least one of the one or more
attributes represents a length of the pattern.
45. The method of claim 42, wherein at least one of the one or more
attributes represents a number of positions in the one or more
patterns which are not occupied by wild cards.
46. The method of claim 42, wherein a threshold value for each
attribute is selected.
47. The method of claim 46, wherein one or more patterns are
discarded if the value of the one or more attributes of the pattern
is below the selected threshold for the one or more attributes.
48. The method of claim 47, wherein the steps of selecting a
threshold value and discarding one or more patterns are repeated
for all used attributes.
49. The method of claim 35, wherein a set of counters is created
for the nucleotide sequence.
50. The method of claim 49, wherein the counters in the set of
counters equal the number of nucleotides in the nucleotide
sequence.
51. The method of claim 35, wherein all patterns are examined.
52. The method of claim 5 1, wherein each pattern with an instance
in the nucleotide sequence contributes to the counters at the
corresponding positions of the nucleotide sequence.
53. The method of claim 52, wherein only consecutive positions in
the nucleotide sequences whose corresponding counter values exceed
a threshold are considered.
54. The method of claim 53, wherein one or more groups of
consecutive positions are considered only if they satisfy a minimum
length criterion.
55. The method of claim 42, wherein a threshold value for the one
or more attributes is selected.
56. The method of claim 55, wherein a group of consecutive
positions is discarded if the value of the one or more attributes
is below the selected threshold for the one or more attributes.
57. The method of claim 56, wherein the steps of selecting a
threshold value and discarding a group of consecutive positions are
repeated for all used attributes.
58. The method of claim 57, wherein the group of consecutive
positions that has not been discarded is reported as a mature
microRNA.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Application No. 60/652,499, filed Feb. 11, 2005, the disclosure of
which is incorporated by reference herein.
[0002] This application is related to U.S. patent application
entitled "System and Method for Identification of MicroRNA Target
Sites and Corresponding Targeting MicroRNA Sequences," Attorney
Docket Number YOR920060077US1, filed concurrently herewith, the
disclosure of which is incorporated by reference herein. Also, this
application is related to U.S. patent application entitled
"Ribonucleic Acid Interference Molecules," Attorney Docket Number
YOR920040675US2, filed concurrently herewith, the disclosure of
which is incorporated by reference herein.
FIELD OF THE INVENTION
[0003] The present invention relates to genes and, more
particularly, to ribonucleic acid interference molecules and their
role in gene expression.
BACKGROUND OF THE INVENTION
[0004] The ability of an organism to regulate the expression of its
genes is of central importance to life. A breakdown in this
homeostasis leads to disease states, such as cancer, where a cell
multiplies uncontrollably, to the detriment of the organism. The
general mechanisms utilized by organisms to maintain this gene
expression homeostasis are the focus of intense scientific
study.
[0005] It recently has been discovered that some cells are able to
down-regulate their gene expression through certain ribonucleic
acid (RNA) molecules. Namely, RNA molecules can act as potent gene
expression regulators either by inducing messenger-RNA (mRNA)
degradation or by inhibiting translation. This activity is
summarily referred to as post-transcriptional gene silencing, or
PTGS for short. An alternative name by which it is also known is
RNA interference, or RNAi. PTGS/RNAi has been found to function as
a mediator of resistance to endogenous and exogenous pathogenic
nucleic acids, and, also as a regulator the expression of genes
inside cells.
[0006] The term `gene expression,` as used herein, refers generally
to the transcription of messenger-RNA (mRNA) from a gene, and,
e.g., its subsequent translation into a functional protein. One
class of RNA molecules involved in gene expression regulation
comprises microRNAs, which are endogenously encoded and regulate
gene expression by either disrupting the translation processes or
by degrading mRNA transcripts, e.g., inducing post-transcriptional
repression of one or more target sequences.
[0007] The RNAi/post-transcriptional gene silencing mechanism
allows an organism to employ short RNA sequences to either degrade
or disrupt translation of complementary mRNA transcripts. Early
studies suggested only a limited role for RNAi, that of a defense
mechanism against foreign born pathogens. However, the subsequent
discovery of many endogenously-encoded microRNAs pointed towards
the possibility of this being a more general, in nature, control
mechanism. Recent evidence has led the community to hypothesize
that a wider spectrum of biological processes are affected by RNAi,
thus extending the range of this presumed control layer.
[0008] To date, there have been relatively few attempts to devise
new methods for finding novel microRNA precursors and their
associated mature microRNAs. This is likely connected to a belief
that is held by the research community at large according to which
all of the relevant mature microRNAs and their precursors for the
most important model organisms have already been identified using
biochemical methods. The existing methods can be categorized into
two basic approaches.
[0009] In the first approach, the methods begin by predicting the
RNA secondary structure of candidate sequences using any of the
available predictions programs (e.g. "RNAfold" or "mfold"). The
methods then focus on only those sequences that are predicted to
fold into the familiar hairpin-like structure of microRNA
precursors, subselecting those that satisfy additional sequence or
other properties (Lai E C, Tomancak P, Williams R W, Rubin G M.
(2003) Computational identification of Drosophila microRNA genes.
Genome Biol 4(7): R42; Lim L P, Glasner M E, Yekta S, Burge C B,
Bartel D P (2003b) Vertebrate microRNA genes. Science 299: 1540;
Lim L P, Lau N C, Weinstein E G, Abdelhakim A, Yekta S, Rhoades M
W, Burge C B, Bartel D P (2003a) The microRNAs of Caenorhabditis
elegans. Genes and Development 17: 991-1008; I. Bentwich et al.,
"Identification of hundreds of conserved and nonconserved human
microRNAs," Nature Genetics, published online Jun. 19, 2005. DOI:
10.1038/ng1590).
[0010] The second type of approach uses the observation that the
two arms of the hairpin of a precursor exhibit a much higher degree
of sequence conservation than the regions outside the precursor and
also the region in the loop of the precursor. This observation was
combined with additional, known properties of microRNAs and led to
the successful discovery of many novel mature microRNA and microRNA
precursors (Berezikov, E., Guryev, V., van de Belt, J., Wienholds,
E., Plasterk, R. H. A., Cuppen, E. (2005) Phylogenetic shadowing
and computational identification of human microRNA genes. Cell 120:
21-24).
[0011] The inventive approach that we present in the discussion
below represents a departure from the above two schools of thought.
Even though the inventive approach exploits sequence conservation
to discover microRNA precursors, the inventive approach does so
locally, i.e. the approach seeks to leverage the existence of
locally conserved sequence fragments that are shared by known
precursors that could potentially be distant from a phylogenetic
standpoint.
[0012] A better understanding of the mechanism of the RNA
interference process would benefit the fight against disease, drug
design and host defense mechanisms.
SUMMARY OF THE INVENTION
[0013] A method for identifying microRNA precursor sequences and
corresponding mature microRNA sequences from genomic sequences is
provided. For example, in one aspect of the invention, a method for
determining whether a nucleotide sequence contains a microRNA
precursor comprises the following steps. One or more patterns are
generated by processing a collection of known microRNA precursor
sequences. One or more attributes are assigned to the one or more
generated patterns. Only the one or more patterns whose one or more
attributes satisfy at least one criterion are subselected, and then
the one or more subselected patterns are used to analyze the
nucleotide sequence.
[0014] In another aspect of the invention, a method for identifying
a mature microRNA sequence in a microRNA precursor sequence
comprises the following steps. One or more patterns are generated
by processing a collection of known mature microRNA sequences. The
one or more patterns are filtered, and then used to locate
instances of the one or more filtered patterns in one or more
candidate precursor sequences.
[0015] A more complete understanding of the present invention, as
well as further features and advantages of the present invention,
will be obtained by reference to the following detailed
description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1A is a flow diagram illustrating a method for
identifying a microRNA precursor sequence, according to one
embodiment of the invention;
[0017] FIG. 1B is a flow diagram illustrating a method for
identifying a mature microRNA sequence in a microRNA precursor
sequence, according to one embodiment of the invention;
[0018] FIG. 2A is a graph illustrating a genomic sequence hit with
a microRNA-precursor-pattern-set, the graph further illustrating
the number of pattern hits with instances in a particular genomic
neighborhood as a function of position;
[0019] FIG. 2B is a graph illustrating detail of the region shown
in FIG. 2A;
[0020] FIG. 2C is a graph illustrating detail of the region shown
in FIG. 2B;
[0021] FIG. 2D is an illustration of the predicted secondary
structure of cel-mir-273 as determined by RNAfold;
[0022] FIG. 3A is a graph illustrating the distribution of
pattern-hit-scores for all C. elegans microRNAs within RFAM (solid
line) versus generic hairpins (dashed line).
[0023] FIG. 3B is a graph illustrating the distribution of
predicted folding energies for all C. elegans microRNAs (solid
line) and generic hairpins (dashed line).
[0024] FIG. 3C is an X-Y scatter plot illustrating patterns hits
versus folding energy for C. elegans microRNAs (light-grey-colored
dots) and generic hairpins (dark-grey-colored dots);
[0025] FIG. 4 is a table summarizing the microRNA-precursor
predictions for the genomes of C. elegans, D. melanogaster, M.
musculus and H. sapiens; and
[0026] FIG. 5 is a block diagram illustrating a system for
determining whether a nucleotide sequence contains a microRNA
precursor, in accordance with one embodiment of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0027] The teachings of the present invention relate to ribonucleic
acid (RNA) molecules and their role in gene expression regulation.
As mentioned above, a novel and robust pattern-based approach for
the discovery of microRNA precursors and their corresponding mature
microRNAs from genomic sequence is provided. Advantageously, the
inventive approach obviates the need of cross-species sequence
conservation, and is thus readily applicable to any genomic
sequence independent of whether it has orthologues in other
species. The capabilities of the inventive approach are
demonstrated herein by first showing that the inventive approach
correctly identifies many of the currently known microRNA
precursors and mature microRNAs. We describe an implemented
prototype system and use the system to analyze computationally the
C. elegans, D. melanogaster, M. musculus and H. sapiens genomes. By
way of example, such sequences are described in detail in
Application No. 60/652,499, the disclosure of which is incorporated
by reference herein. Also, such sequences are described in detail
in the above-mentioned related U.S. patent application
(YOR920040675US2), the disclosure of which is incorporated
herein.
[0028] We estimate that the number of endogenously-encoded microRNA
precursors is substantially higher than currently hypothesized. The
inventive approach readily extends to the discovery of microRNA
target sites directly from genomic sequences. A method for
identifying microRNA target sites is described in detail in the
above-mentioned related U.S. patent application (YOR920060077US1),
the disclosure of which is incorporated herein.
[0029] FIG. 1A is a flow diagram illustrating a method for
identifying a microRNA precursor sequence, according to one
embodiment of the invention. Underlying the inventive approach is a
pattern-based methodology which discovers variable-length sequence
fragments (`patterns`) that recur in an input database a
user-specified, minimum number of times. The number of discovered
patterns, the exact locations of each instance of the discovered
pattern, the actual extent of each pattern, and finally the number
of instances that a pattern has in the input database are, of
course, not known ahead of time. Computationally, the pattern
discovery problem is a much `harder` problem than database
searching, a task with which most biologists are familiar and has
been in main-stream use for more than 20 years. Indeed, pattern
discovery is an NP-hard problem whereas database searching can be
solved in polynomial time.
[0030] We will first describe step 110, the generation of patterns.
The generation of patterns (step 110) is comprised of steps 112 and
114, as shown in FIG. 1A.
[0031] Step 112 is the step of processing known microRNA precursors
to discover intra- and inter-species patterns of conserved
sequence.
[0032] The recurrent instances of conserved sequence segments can
be represented with the help of regular expressions each with a
differing degree of descriptive power. The expressions used in this
disclosure are composed of literals (solid characters from the
alphabet of permitted symbols), wildcards (each denoted by `.` and
representing any character), and sets of equivalent literals (each
set being a small number of symbols, any one of which can occupy
the corresponding position). The distance between two consecutive
occupied positions is assumed to be unchanged across all instances
of the pattern (i.e., `rigid patterns`). The pattern
[LIV].[LIV].D.ND[NH].P is an example from the domain of amino acid
sequences and describes the calcium binding motif of cadherin
proteins. The motif in question comprises exactly one of the amino
acids {leucine, isoleucine, valine}, followed by any amino acid,
followed again by exactly one of the amino acids {leucine,
isoleucine, valine}, followed by any amino acid, followed by the
negatively charged aspartate, etc. Typically, the presence of a
statistically significant pattern in an unannotated amino acid
sequence is taken as a sufficient condition to suggest the presence
of the feature captured by the pattern.
[0033] In the context of the invention described herein, the symbol
set that is used comprises the four nucleotides {A,C,G,T} found in
a deoxyribonucleic acid (DNA) sequence. The input set which we
processed in order to discover patterns is Release 3.0 of the RFAM
database, from January 2004 (Griffiths-Jones, S. et al. Rfam: an
RNA family database. Nucleic Acids Res., 31 439-441 (2003)). The
use of a more-than-18-month-old release of the database as our
training set was intentional. We wanted to gauge how well our
method would perform if presented only with the knowledge that was
available in the literature in January 2004. The analysis has since
been repeated using subsequent releases of the RFAM database.
[0034] Unlike previously published computational methods for
microRNA precursor prediction, the present invention makes use of
the sequence information from all the microRNAs which are contained
in the RFAM release, and independent of the organism in which they
originate. The release in question contains microRNAs from the
human, mouse, rat, worm, fly and several plant genomes. The
simultaneous processing of microRNA sequences from distinct
organisms permits the discovery of conserved sequences both within
and across species and makes the method suitable for the analysis
of more than one organism. Release 3.0 of RFAM (January 2004),
which was used as our input, contained 719 microRNA precursor
sequences.
[0035] We used a scheme based on BLASTN (Altschul, S. F. Gish, W.
Miller, W. Myers, E. W. Lipman, D. J. Basic local alignment search
tool. J Mol Biol. 215 403-410 (1990)) to remove duplicate and
near-duplicate entries from the initial collection. The final set
comprised 530 microRNA precursor sequences. In this cleaned-up set,
no two sequences agreed on more than 90% of their positions. We
next describe in detail the BLASTN-based cleanup scheme.
[0036] We assume that we are given N sequences of variable length
and a user-defined threshold X for the permitted, maximum remaining
pair-wise sequence similarity. The sequence-based clustering scheme
that we employed is shown below. Upon termination, the set CLEAN
contains sequences no pair of which agrees on more than X % of the
positions in the shorter of the two sequences. For our analysis, we
set X=90%. [0037] sort the N sequences in order of decreasing
length; let S.sub.i denote the i-th sequence of the sorted set
(i=1, . . . , N) [0038] CLEAN S.sub.1 [0039] for i=2 through N do
[0040] use S.sub.i as query to run BLAST against the current
contents of CLEAN if the top BLAST hit T agrees with S.sub.i at
more than X % of the S.sub.i's position [0041] then [0042] make
S.sub.i a member of the cluster represented by T discard S.sub.i;
[0043] else [0044] CLEAN CLEAN 4 {S.sub.i};
[0045] This non-redundant input was then processed using the
Teiresias algorithm (Rigoutsos, I. and Floratos, A. Combinatorial
pattern discovery in biological sequences: The TEIRESIAS algorithm.
Bioinformatics 14 55-67 (1998)) in order to discover intra- and
inter-species patterns of sequence conservation. The combinatorial
nature of the algorithm and the guaranteed discovery of all
patterns contained in the processed input makes Teiresias a good
choice for addressing this task. The nature of the patterns that
can be discovered is controlled by three parameters: L, the minimum
number of symbols participating in a pattern; W, the maximum
permitted span of any L consecutive (not contiguous) symbols in a
pattern; and K, the minimum number of instances required of a
pattern before it can be reported. We also enforced a statistical
significance requirement. The significance of each pattern was
estimated with the help of a second-order Markov chain which was
built from actual genomic data. Application of the significance
filter reduced the number of patterns that were used in the
subsequent phases of the algorithm. Details on the Teiresias
algorithm and its properties, the three parameters L/W/K, and how
to estimate log-probabilities are given below.
[0046] The Teiresias algorithm requires that the three parameters
L, W and K be set. The three parameters that control the discovery
process were set to L=7, W=10 and K=2. 120,789,247 variable length
patterns were discovered in the processed input set. Patterns with
log-probability >-34.0 were removed resulting in a final set of
192,240 statistically-significant, microRNA precursor specific
patterns. We next describe in detail how these parameters control
the number and character of the discovered patterns.
[0047] The parameter L controls the minimum possible size of the
discovered patterns. The parameter W satisfies the inequality
W.gtoreq.L and controls the `degree of conservation` across the
various instances of the reported patterns. Setting W to smaller
(respectively larger) values permits fewer (respectively more)
mismatches across the instances of each of the discovered patterns.
Finally, the parameter K controls the minimum number of instances
that a pattern must have before it can be reported.
[0048] For a given choice of L, W and K Teiresias guarantees that
it will report all patterns that have K or more appearances in the
processed input and are such that any L consecutive (but not
necessarily contiguous) positions span at most W positions. It is
important to stress that even though no pattern can have fewer than
L literals, the patterns' maximum length is unconstrained and
limited only by the size of the database.
[0049] Setting L to small values permits the identification of
shorter conserved motifs that may be present in the processed
input. As mentioned above, even if L is set to small values,
patterns that are longer than L will be discovered and reported.
Generally speaking, in order for a short motif to be considered
statistically significant it will need to have a large number of
copies in the processed input. Setting L to large values will
generally permit the identification of statistically significant
motifs even if these motifs repeat only a small number of times.
This increase in specificity will happen at the expense of a
potentially significant decrease in sensitivity.
[0050] For the work described herein, we selected L=7. This choice
is dictated by the desire to capture potential commonalities among
the seed regions of diverse microRNAs; setting L to a value that is
smaller than the 6 nucleotides typically associated with the seed
regions gives us added flexibility. We also set W=10, a choice that
is dictated by the desire to capture sequence commonalities where
the local conservation is at least 70%. In other words, any
reported pattern will have more than 2/3 of its positions occupied
by literals. Finally, we set K=2. This is a natural consequence of
the fact that we generate conserved sequence motifs through an
unsupervised pattern discovery scheme. The value of 2 is the
smallest possible one (a pattern or motif, by definition, must
appear at least two times in the processed input) and guarantees
that all patterns will be discovered.
[0051] Step 114 is the step of statistically filtering the patterns
that were generated in step 112. The step of filtering is done by
estimating the log-probability of each pattern with the help of a
Markov-chain. We next describe in detail how to use Markov chains
to estimate the log-probabilities of patterns. The computation is
carried out in the same manner for all of the patterns.
[0052] Real genomic data was used to estimate the frequency of
trinucleotides that could span as many as 23 positions--there are
at most 20 wild cards between the first and last nucleotide of the
triplet. In other words, we computed the frequencies of all
trinucleotides of the form: TABLE-US-00001 AAA AA.A AA..A ...
AA....................A A.AA A.A.A A.A..A ...
T....................TT
[0053] With these counts at hand, we used Bayes' theorem to
estimate the probability that a given pattern could be generated
from a random database. Let us use the pattern [0054]
A..[AT].C..T...G to describe the approach. Observe that we can
write: [0055] Pr(A..[AT].C..T...G)= [0056]
Pr(C..T...G/A..[AT].C..T)= [0057]
Pr(C..T...G/C..T)*Pr(A..[AT].C..T)= [0058]
Pr(C..T...G/C..T)*Pr([AT].C..T/A..[AT].C)= [0059]
Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C)= [0060]
Pr(C..T...G/C..T)*Pr([AT].C..T/[AT].C)*Pr(A..[AT].C/A..[AT])=
[0061]
#(C..T...G)/(#(C..T...A)+#(C..T...C)+(C..T...G)+#(C..T...T))*
[0062]
#([AT].C..T)/(#([AT].C..A)+#([AT].C..C)+#([AT].C..G)+#([AT].C..T))*
[0063]
#(A..[AT].C)/(#(A..[AT].A)+#(A..[AT].C)+#(A..[AT].G)+#(A..[AT].T)-
) [0064] Note that all of the counts #(.) are available directly
from the Markov chain and thus can be substituted for in the last
equation. This in turn allows us to estimate the
Pr(A..[AT].C..T...G) as well as the log(Pr(A..[AT].C..T...G)).
[0065] We next describe step 120, the identification of candidate
regions. Step 120 is comprised of step 122 and step 124, as shown
in FIG. 1A.
[0066] Step 122 is the step of locating instances of patterns in
the genomic sequences of interest. We use the 192,240 microRNA
precursor patterns to locate instances in genomic sequences of
interest. Typically, these sequences correspond to the intergenic
and intronic regions of the genome at hand.
[0067] We first remove all low-complexity regions from the genomic
sequences to be processed using the publicly available NSEG program
(Wootton, J. C. and S. Federhen. Statistics of local complexity in
amino acid sequences and sequence databases. Computers and
Chemistry. 1993; 17:149-163) with default parameter settings. In
the filtered sequences, we sought instances of the patterns from
the microRNA-precursor-pattern-set.
[0068] Step 124 is the step of identifying regions in the genomic
sequences of minimum length and supported by a minimum number of
pattern hits. An instance of the microRNA precursor pattern
generates a "pattern hit" which covers as many nucleotides as the
span of the corresponding pattern-this is repeated for all
patterns. Each pattern contributes a support of +1 to all of the
genomic sequence locations spanned by its instance. Clearly, a
given nucleotide position may be hit by more than one pattern. We
make use of precisely this observation to associate genomic regions
which receive multiple pattern hits with putative microRNA
precursors. Conversely, regions which do not correspond to microRNA
precursors are expected to receive a much smaller number of hits,
if any, which of course permits us to differentiate between
background and microRNA precursors.
[0069] Segments of contiguous sequence locations that received more
than 60 patterns and spanned at least 60 positions were excised
together with a 30-nucleotide-long flanking sequence at each
end.
[0070] We next describe step 130, the step of subselecting among
candidate regions and reporting the subselected regions. Step 130
is comprised of step 132, step 134, step 136 and step 138, as shown
in FIG. 1A.
[0071] Step 132 is the step of predicting the RNA secondary
structure of the candidate sequences. With the help of the Vienna
package software (Hofacker, I. L. et al. Fast Folding and
Comparison of RNA Secondary Structures. Monatsh. Chem. 125 167-188
(1994)), we predicted the RNA secondary structure of each excised
sequence. Instead of the Vienna package, we could have used the
`mfold` algorithm to predict the hybrid's secondary RNA structure
(Matthews, D. H., Sabina, J., Zuker, M. and Turner, D. H. Expanded
Sequence Dependence of Thermodynamic Parameters Improves Prediction
of RNA Secondary Structure. J. Mol. Biol. 288, 911-940 (1999)).
[0072] Step 134 is the step of filtering candidate sequences based
on the energy of the structure. Only those sequences whose
predicted Gibbs free energy was .ltoreq.-18 Kcal/mol were kept and
reported.
[0073] Step 136 is the step of further filtering candidate
sequences based on number of bulges.
[0074] Step 138 is the step of reporting candidate sequences as
microRNA precursors.
[0075] Lastly, as shown in step 139 of FIG. 1A, the results (e.g.,
predictions) of the above processes can be optionally evaluated
through experiments.
[0076] FIG. 1B is a flow diagram illustrating a method for
identifying a mature microRNA sequence in a microRNA precursor
sequence, according to one embodiment of the invention. In each of
the candidate microRNA precursors that were identified in step 130,
we sought to determine the location of the corresponding mature
microRNA. To this end, we used the same method as described above,
only this time we generated patterns from the set of known microRNA
sequences.
[0077] We next describe step 140, the step of generating patterns.
Step 140 is comprised of step 142 and step 144, as shown in FIG.
1B.
[0078] Step 142 is the step of processing known microRNAs to
discover intra- and inter-species patterns of conserved sequence.
Similar to step 112, we downloaded 644 mature microRNAs from the
RFAM, Release 3.0 (January, 2004). Subsequent implementations of
our method described herein have used more recent versions of the
RFAM database.
[0079] Step 144 is the step of filtering discovered patterns,
keeping only statistically significant patterns. As in step 114, we
used a scheme based on BLASTN to remove duplicate and
near-duplicate entries from the initial collection. The final set
comprised 354 sequences of mature microRNAs such that no two
remaining sequences agreed on more than 90% of their positions.
[0080] The three parameters that control the discovery process were
set to L=4, W=12 and K=2. 120,789,247 variable length patterns were
discovered in the processed input set, typically spanning fewer
than 22 positions. Patterns with log-probability >-32.0 were
removed resulting in a final set of 233,554
statistically-significant, mature-microRNA patterns.
[0081] We next describe step 150, the step of identifying mature
regions. Step 150 is comprised of step 152, step 154 and step 156,
as shown in FIG. 1B.
[0082] Step 152 is the step of locating instances of patterns in
the candidate precursor sequences. For the 233,554 mature microRNA
patterns that we derived from the processed mature microRNA
sequences generated, we sought the instances of the mature microRNA
patterns in the sequences of microRNA precursors that were
identified above. Similar methods as described above in step 122
are incorporated herein.
[0083] Step 154 is the step of identifying regions in the candidate
precursor sequences of a minimum length and supported by a minimum
number of pattern hits. As before, a pattern's instance contributes
a vote of "+1" to all the UTR locations that the instance spans.
All regions that did not overlap with the putative loop of the
precursor and comprised contiguous blocks of locations that were
hit by .gtoreq.60 patterns and were at least 18 nucleotides long
were reported as the mature microRNAs corresponding to this
precursor. Similar methods as described above in step 124 are
incorporated herein.
[0084] Step 156 is the step of reporting regions as mature
microRNAs.
[0085] Lastly, as shown in step 159 of FIG. 1B, the results (e.g.,
predictions) of the above processes can be optionally evaluated
through experiments.
[0086] We next illustrate the above-described stages (`discovery of
a microRNA precursor`/`discovery of a mature microRNA`) with the
help of the C. elegans genome. In particular, we use the genomic
region in the vicinity of the known microRNA precursor
cel-miR-273.
[0087] FIGS. 2A-D illustrate how, for the genomic sequence under
consideration, the microRNA-precursor-patterns accumulate in the
region of the precursor whereas the microRNA-precursor-patterns are
absent in the other areas. For the shown example sequence,
approximately 500 patterns end up contributing to genomic location
14,946,975. In fact, the contiguous genomic locations that receive
support from the microRNA-precursor-patterns corresponds to the
known span of cel-miR-273, which is indicated by the light-grey
rectangle in FIG. 2B. The region that received the substantial
non-zero precursor support was examined for instances of the
mature-microRNA-pattern-set. In FIG. 2C, we show how well the
inventive approach localized the mature microRNA section within the
cel-miR-273 precursor. The actual span of the known mature microRNA
is indicated by the light-grey background.
[0088] FIG. 3A is a graph illustrating the distribution of
pattern-hit-scores for all C. elegans microRNAs within RFAM (solid
line) versus generic hairpins (dashed line).
[0089] FIG. 3B is a graph illustrating the distribution of
predicted folding energies for all C. elegans microRNAs (solid
line) and generic hairpins (dashed line).
[0090] FIG. 3C is an X-Y scatter plot illustrating patterns hits
versus folding energy for C. elegans microRNAs (light-grey-colored
dots) and generic hairpins (dark-grey-colored dots).
[0091] We used the 192,240 members of the
microRNA-precursor-pattern-set to determine how well they covered
those of the training sequences which originated in C. elegans.
Almost all of the known C. elegans precursors contained .gtoreq.100
instances of the precursor patterns. The solid-line curve in FIG.
3A shows the probability density function for the number of
precursors which contained a given number of pattern instances in
them.
[0092] We next generated randomly what we refer to as a generic
hairpin set. This hairpin set was designed so as to comprise
sequences whose geometric features were characteristic of all known
microRNA precursors, namely, a hairpin-shaped secondary structure
and lengths in the interval [60,120] nucleotides. First, we
randomly selected numerous regions with lengths uniformly
distributed between 60 and 120 nucleotides. There was no
restriction as to where in the C. elegans genome these regions were
located.
[0093] Then, we inspected the predicted RNA secondary structure of
these regions and kept only those which formed hairpins and did not
include any low-complexity regions. Starting with an initial set of
120,000 randomly selected regions (=10,000.times.2 strands.times.6
chromosomes), and discarding as described above, we were left with
a total of 20,560 generic hairpins. These hairpins are used to
sample the "background" distribution of hairpins and to estimate
its properties.
[0094] We examined these generic hairpins for instances of the
microRNA precursor patterns. The dashed-line curve in FIG. 3A shows
the probability density function for the percentage of the generic
hairpins that contained a certain number of pattern instances.
Setting the support threshold to 60 pattern-instances captures 104
of the 114 known C. elegans microRNAs or 91%. On the other hand,
less than 1% of the members of the generic hairpin set exceed
threshold. This is an important result that demonstrates that the
microRNA precursor patterns capture sequence properties which are
specific to microRNA precursors and can effectively distinguish
them from randomly selected regions that simply happen to fold into
"stem-loop-stem" structures.
[0095] In addition to the distribution of pattern instances, we
also examined the distribution of the Gibbs free energy values that
are computed from the generic hairpin set (dashed-line curve) and
the known C. elegans precursors (solid-line curve) and show the
results in FIG. 3B. Setting the support threshold to -25 Kcal/mol
captures 107 of the 114 known C. elegans microRNA precursors or
94%, but only 7% of the sequences in the generic hairpin set exceed
threshold.
[0096] Finally, we examined how well a combination of the "energy"
and the "pattern-instances" filters separates the known microRNA
precursors (light-grey colored dots) from the generic hairpin set
(dark-grey colored dots). The results are presented in FIG. 3C. As
can be seen in FIG. 3C, there is very little correlation between
these two criteria and their combined application provides a simple
yet powerful discriminator. The combined threshold of .gtoreq.60
pattern instances and a predicted Gibbs energy .ltoreq.-25 Kcal/mol
allows us to identify 78 of the 114 known C. elegans precursors
whereas less than 1% of the generic hairpins exceed this double
threshold. This translates into an estimated sensitivity of 67% for
our precursor prediction method and an estimated false-positive
ratio that is .ltoreq.1%.
[0097] We repeated the above generic-hairpin analysis for the
remaining three genomes of our collection. The remaining three
genomes were D. melanogaster, M. musculus and H. sapiens. By way of
example, such sequences are described in detail in Application No.
60/652,499, the disclosure of which is incorporated by reference
herein. Also, such sequences are described in detail in the
above-mentioned related U.S. patent application (YOR920040675US2),
the disclosure of which is incorporated herein. The estimated
false-positive ratios remained very low, and similar in magnitude
to the case of C. elegans above. In particular, the estimates we
generated for the false-positive ratio when predicting microRNA
precursors in the other three genomes ranged from .ltoreq.1% (for
hairpins with Gibbs energies of -25 Kcal/mol or less) to .ltoreq.2%
(for hairpins with Gibbs energies of -18 Kcal/mol or less). Given
that the four genomes span a very wide evolutionary spectrum, it is
reasonable to assume that these values are characteristic of our
method and independent of the identity of the genome that is
used.
[0098] FIG. 4 is a table summarizing the microRNA-precursor
predictions for the genomes of C. elegans, D. melanogaster, M.
musculus and H. sapiens.
[0099] We have analyzed the intergenic and intronic regions of four
complete genomes, as illustrated in FIG. 4. Results are reported
for two values for the Gibbs energy threshold, namely -18 Kcal/mol
and -25 Kcal/mol.
[0100] As can be seen from FIG. 4, the method correctly identifies
a very large percentage of the known microRNA precursors in these
four genomes, for the used thresholds. Additionally, we also
predict many novel microRNA precursors. Their numbers are
significantly higher than what has previously been discussed in the
literature. In light of the very low error rate estimates of our
method, we believe that a substantial number of our microRNA
precursor predictions are likely correct.
[0101] FIG. 5 is a block diagram of a system 500 for determining
whether a nucleotide sequence contains a microRNA precursor in
accordance with one embodiment of the present invention. System 500
comprises a computer system 510 that interacts with a media 550.
Computer system 510 comprises a processor 520, a network interface
525, a memory 530, a media interface 535 and an optional display
540. Network interface 525 allows computer system 510 to connect to
a network, while media interface 535 allows computer system 510 to
interact with media 550, such as Digital Versatile Disk (DVD) or a
hard drive.
[0102] As is known in the art, the methods and apparatus discussed
herein may be distributed as an article of manufacture that itself
comprises a computer-readable medium having computer-readable code
means embodied thereon. The computer-readable program code means is
operable, in conjunction with a computer system such as computer
system 510, to carry out all or some of the steps to perform the
methods or create the apparatuses discussed herein. The
computer-readable code is configured to generate patterns
processing a collection of already known mature microRNA sequences;
assign one or more attributes to the generated patterns; subselect
only the patterns whose attributes satisfy certain criteria;
generate the reverse complement of the subselected patterns; and
use the reverse complement of the subselected patterns to analyze
the nucleotide sequence. The computer-readable medium may be a
recordable medium (e.g., floppy disks, hard drive, optical disks
such as a DVD, or memory cards) or may be a transmission medium
(e.g., a network comprising fiber-optics, the world-wide web,
cables, or a wireless channel using time-division multiple access,
code-division multiple access, or other radio-frequency channel).
Any medium known or developed that can store information suitable
for use with a computer system may be used. The computer-readable
code means is any mechanism for allowing a computer to read
instructions and data, such as magnetic variations on a magnetic
medium or height variations on the surface of a compact disk.
[0103] Memory 530 configures the processor 520 to implement the
methods, steps, and functions disclosed herein. The memory 530
could be distributed or local and the processor 520 could be
distributed or singular. The memory 530 could be implemented as an
electrical, magnetic or optical memory, or any combination of these
or other types of storage devices. Moreover, the term "memory"
should be construed broadly enough to encompass any information
able to read from or written to an address in the addressable space
accessed by processor 520. With this definition, information on a
network, accessible through network interface 525, is still within
memory 530 because the processor 520 can retrieve the information
from the network. It should be noted that each distributed
processor that makes up processor 520 generally contains its own
addressable memory space. It should also be noted that some or all
of computer system 510 can be incorporated into an
application-specific or general-use integrated circuit.
[0104] Optional video display 540 is any type of video display
suitable for interacting with a human user of system 500.
Generally, video display 540 is a computer monitor or other similar
video display.
[0105] It is to be appreciated that, in an alternative embodiment,
the invention may be implemented in a network-based implementation,
such as, for example, the Internet. The network could alternatively
be a private network and/or local network. It is to be understood
that the server may include more than one computer system. That is,
one or more of the elements of FIG. 5 may reside on and be executed
by their own computer system, e.g., with its own processor and
memory. In an alternative configuration, the methodologies of the
invention may be performed on a personal computer and output data
transmitted directly to a receiving module, such as another
personal computer, via a network without any server intervention.
The output data can also be transferred without a network. For
example, the output data can be transferred by simply downloading
the data onto, e.g., a floppy disk, and uploading the data on a
receiving module.
[0106] Presented herein is a novel and robust pattern-based
methodology for the identification of microRNA precursors and their
corresponding mature microRNAs directly from genomic sequence. With
the help of patterns derived by processing the sequences of known
microRNA precursors, our method identifies genomic regions where
numerous instances of these patterns aggregate and subselects among
them following energy based filtering.
[0107] The following are examples of advantages that characterize
the inventive approach provided herein: a) the inventive approach
obviates the need to enforce a cross-species conservation filtering
before reporting results, thus allowing the discovery of microRNA
precursors that may not be shared even by closely related species;
b) the inventive approach can be applied to the analysis of any
genome that potentially harbors endogenous microRNAs without the
need to be retrained each time.
[0108] Although illustrative embodiments of the present invention
have been described herein, it is to be understood that the
invention is not limited to those precise embodiments, and that
various other changes and modifications may be made by one skilled
in the art without departing from the scope or spirit of the
invention.
* * * * *