U.S. patent application number 13/633673 was filed with the patent office on 2013-01-31 for classification of nucleic acid templates.
This patent application is currently assigned to Pacific Biosciences of California, Inc.. The applicant listed for this patent is Benjamin Flusberg, Jeremiah Hanes, Lei Jia, Jonas Korlach, Jessica Lee, John Lyle, Joseph Puglisi, Jon Sorenson, Kevin Travers, Stephen Turner, Dale Webster. Invention is credited to Benjamin Flusberg, Jeremiah Hanes, Lei Jia, Jonas Korlach, Jessica Lee, John Lyle, Joseph Puglisi, Jon Sorenson, Kevin Travers, Stephen Turner, Dale Webster.
Application Number | 20130029853 13/633673 |
Document ID | / |
Family ID | 42243263 |
Filed Date | 2013-01-31 |
United States Patent
Application |
20130029853 |
Kind Code |
A1 |
Flusberg; Benjamin ; et
al. |
January 31, 2013 |
CLASSIFICATION OF NUCLEIC ACID TEMPLATES
Abstract
Methods, compositions, and systems are provided for
characterization of modified nucleic acids. In certain preferred
embodiments, single molecule sequencing methods are provided for
identification of modified nucleotides within nucleic acid
sequences. Modifications detectable by the methods provided herein
include chemically modified bases, enzymatically modified bases,
abasic sites, non-natural bases, secondary structures, and agents
bound to a template nucleic acid.
Inventors: |
Flusberg; Benjamin;
(Atlanta, GA) ; Turner; Stephen; (Menlo Park,
CA) ; Lee; Jessica; (Cupertino, CA) ; Jia;
Lei; (North Potomac, MD) ; Korlach; Jonas;
(Newark, CA) ; Sorenson; Jon; (Alameda, CA)
; Webster; Dale; (San Mateo, CA) ; Lyle; John;
(Fremont, CA) ; Travers; Kevin; (Menlo Park,
CA) ; Hanes; Jeremiah; (Redwood City, CA) ;
Puglisi; Joseph; (Stanford, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Flusberg; Benjamin
Turner; Stephen
Lee; Jessica
Jia; Lei
Korlach; Jonas
Sorenson; Jon
Webster; Dale
Lyle; John
Travers; Kevin
Hanes; Jeremiah
Puglisi; Joseph |
Atlanta
Menlo Park
Cupertino
North Potomac
Newark
Alameda
San Mateo
Fremont
Menlo Park
Redwood City
Stanford |
GA
CA
CA
MD
CA
CA
CA
CA
CA
CA
CA |
US
US
US
US
US
US
US
US
US
US
US |
|
|
Assignee: |
Pacific Biosciences of California,
Inc.
Menlo Park
CA
|
Family ID: |
42243263 |
Appl. No.: |
13/633673 |
Filed: |
October 2, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12635618 |
Dec 10, 2009 |
|
|
|
13633673 |
|
|
|
|
61201551 |
Dec 11, 2008 |
|
|
|
61180350 |
May 21, 2009 |
|
|
|
61186661 |
Jun 12, 2009 |
|
|
|
Current U.S.
Class: |
506/2 ; 435/5;
435/6.11; 435/6.12; 977/902 |
Current CPC
Class: |
C12Q 1/6837 20130101;
C12Q 1/6858 20130101; C12Q 2561/113 20130101; C12Q 1/6869 20130101;
C12Q 1/6858 20130101; C12Q 2537/164 20130101; C12Q 2527/113
20130101; C12Q 2521/101 20130101; C12Q 1/6858 20130101; C12Q
2537/164 20130101; C12Q 2525/307 20130101; C12Q 2521/101 20130101;
C12Q 1/6858 20130101; C12Q 2525/117 20130101; C12Q 2527/113
20130101; C12Q 2521/101 20130101 |
Class at
Publication: |
506/2 ; 435/6.12;
435/5; 435/6.11; 977/902 |
International
Class: |
C12Q 1/68 20060101
C12Q001/68; C12Q 1/70 20060101 C12Q001/70; C40B 20/00 20060101
C40B020/00; G01N 21/64 20060101 G01N021/64 |
Claims
1. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) locking the forward and reverse strands of
the nucleic acid sample together to form a circular pair-locked
molecule; b) obtaining sequence data of the circular pair-locked
molecule via single molecule sequencing, wherein sequence data
comprises sequences of the forward and reverse strands of the
circular pair-locked molecule; and c) determining the sequence of
the double stranded nucleic acid sample and the position of the at
least one modified base in the sequence of the double stranded
nucleic acid sample by comparing the sequences of the forward and
reverse strands of the circular pair-locked molecule.
2. The method of claim 1, wherein the double stranded nucleic acid
sample comprises at least one modified base chosen from
5-bromouracil, uracil, 5,6-dihydrouracil, ribothymine,
7-methylguanine, hypoxanthine, and xanthine.
3. The method of claim 1, wherein at least one modified base in the
double-stranded nucleic acid sample is paired with a base with a
base pairing specificity different from its preferred partner
base.
4. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) locking the forward and reverse strands of
the nucleic acid sample together to form a circular pair-locked
molecule; b) altering the base-pairing specificity of bases of a
specific type in the circular pair-locked molecule; c) obtaining
sequence data of the circular pair-locked molecule via single
molecule sequencing, wherein sequence data comprises sequences of
the forward and reverse strands of the circular pair-locked
molecule; and d) determining the sequence of the double-stranded
nucleic acid sample and the position of the at least one modified
base in the sequence of the double-stranded nucleic acid sample by
comparing the sequences of the forward and reverse strands of the
circular pair-locked molecule.
5. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) locking the forward and reverse strands
together to form a circular pairlocked molecule; b) obtaining
sequence data of the circular pair-locked molecule via single
molecule sequencing, wherein the sequence data comprises sequences
of the forward and reverse strands of the circular pair-locked
molecule; c) determining the sequence of the double-stranded
nucleic acid sample by comparing the sequences of the forward and
reverse strands of the circular pair-locked molecule; d) obtaining
sequencing data of the circular pair-locked molecule via single
molecule sequencing, wherein at least one nucleotide analog that
discriminates between a base and its modified form is used to
obtain sequence data comprising at least one position wherein the
at least one differentially labeled nucleotide analog was
incorporated; and e) determining the positions of modified bases in
the sequence of the double-stranded nucleic acid sample by
comparing the sequences of the forward and reverse strands.
6. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) locking the forward and reverse strands of
the nucleic acid sample together to form a circular pair-locked
molecule; b) obtaining sequence data of the circular pair-locked
molecule via single molecule sequencing, wherein at least one
nucleotide analog that discriminates between a base and its
modified form is used to obtain sequence data comprising at least
one position wherein the at least one differentially labeled
nucleotide analog was incorporated; and c) determining the sequence
of the double-stranded nucleic acid sample and the position of the
at least one modified base in the sequence of the double-stranded
nucleic acid sample by comparing the sequences of the forward and
reverse strands of the circular pair-locked molecule.
7. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) locking the forward and reverse strands
together to form a circular pairlocked molecule; b) obtaining
sequence data of the circular pair-locked molecule via single
molecule sequencing, wherein the sequence data comprises sequences
of the forward and reverse strands of the circular pair-locked
molecule; c) determining the sequence of the double-stranded
nucleic acid sample by comparing the sequences of the forward and
reverse strands of the circular pair-locked molecule; d) altering
the base-pairing specificity of bases of a specific type in the
circular pair-locked molecule to produce an altered circular
pair-locked molecule; e) obtaining the sequence data of the altered
circular pair-locked molecule wherein the sequence data comprises
sequences of the altered forward and reverse strands; and f)
determining the positions of modified bases in the sequence of the
double-stranded nucleic acid sample by comparing the sequences of
the altered forward and reverse strands.
8. The method of claim 7, wherein the double-stranded nucleic acid
sample is obtained as a primary isolate from a cellular, viral, or
environmental source.
9. The method of claim 8, wherein the primary isolate is maintained
at or below 25.degree. C. in conditions substantially free of
divalent cations and nucleic acid modifying enzymes prior to step
(a) of claim 7.
10. The method of claim 7, wherein the double-stranded nucleic acid
sample is obtained from an in vitro reaction or from extracellular
nucleic acid.
11. The method of claim 7, wherein altering the base-pairing
specificity of bases of a specific type in the circular pair-locked
molecule comprises bisulfite treatment.
12. The method of claim 7, wherein altering the base-pairing
specificity of bases of a specific type in the circular pair-locked
molecule comprises photochemical transition.
13. The method of claim 7, wherein locking the forward and reverse
strands together comprises joining two nucleic acid inserts, which
may be identical or non-identical, to the double-stranded nucleic
acid sample, one to each end.
14. The method of claim 13, wherein the nucleic acid inserts have
lengths ranging from 14 to 200 nucleotide residues.
15. The method of claim 13, wherein the nucleic acid inserts have
known sequences.
16. The method of claim 13, wherein the nucleic acid inserts form
hairpins overhangs, and the nucleic acid sample has overhangs
compatible with the overhangs of the nucleic acid inserts.
17. The method of claim 13, wherein obtaining sequence data
comprises annealing a primer complementary to at least part of at
least one of the nucleic acid inserts to the template and extending
the primer.
18. The method of claim 13, wherein at least one of the nucleic
acid inserts comprises a promoter, and obtaining sequence data
comprises contacting the promoter with an RNA polymerase that
recognizes the promoter followed by synthesizing a product nucleic
acid molecule comprising ribonucleotide residues.
19. The method of claim 13, wherein joining is achieved by
ligation.
20. The method of claim 7, wherein the double-stranded nucleic acid
sample comprises a plurality of samples linked together.
21. The method of claim 20, wherein the samples of said plurality
are linked via intervening nucleic acid inserts.
22. The method of claim 21, wherein locking the forward and reverse
strands together comprises ligating a complex formed by contacting
the overhangs of the nucleic acid inserts with the compatible
overhangs of the nucleic acid sample.
23. The method of claim 7, wherein the double-stranded nucleic acid
sample is a genomic DNA fragment.
24. The method of claim 7, wherein the double-stranded nucleic acid
sample comprises at least one RNA strand.
25. The method of claim 7, wherein said single molecule sequencing
comprises sequencing by a method chosen from single molecule
sequencing by synthesis, and ligation sequencing.
26. The method of claim 7, wherein said single molecule sequencing
comprises real-time single molecule sequencing by synthesis.
27. The method of claim 7, wherein said single molecule sequencing
comprises single molecule sequencing by synthesis by a method
chosen from pyrosequencing, reversible terminator sequencing, and
third-generation sequencing.
28. The method of claim 7, wherein said single molecule sequencing
comprises nanopore sequencing.
29. The method of claim 7, wherein: the forward and reverse strands
of the circular pair-locked molecule are locked together by nucleic
acid inserts; the sequence data obtained in step (b) comprises at
least two copies of the sequence of the circular pair-locked
molecule, each copy comprising sequences of first and second
insert-sample units; the sequences of the first and second
insert-sample units comprise insert sequences, which may be
identical or non-identical, and oppositely oriented repeats of the
sequence of the nucleic acid sample; and the method further
comprises: g) calculating scores of the sequences of at least four
inserts contained in the sequence data by comparing the sequences
of the at least four inserts to the known sequences of the inserts;
h) accepting or rejecting at least four of the repeats of the
sequence of the nucleic acid sample contained in the sequence data
according to the scores of one or both of the sequences of the
inserts immediately upstream and downstream of the sample
sequences, subject to the condition that at least one sample
sequence in each orientation is accepted; i) compiling an accepted
sequence set comprising the at least one sample sequence in each
orientation accepted in step (g); and j) determining the sequence
of the nucleic acid sample using the accepted sequence set.
30. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) linking the forward and reverse strands of
the nucleic acid sample together to form a circular template
molecule comprising a double-stranded segment joined at both ends
by linking oligonucleotides; b) obtaining sequence data of the
circular template molecule via single molecule sequencing, wherein
sequence data comprises sequences of the forward and reverse
strands of the circular template molecule; and c) determining the
sequence of the double stranded nucleic acid sample and the
position of the at least one modified base in the sequence of the
double stranded nucleic acid sample by comparing the sequences of
the forward and reverse strands of the circular template
molecule.
31. The method of claim 30, wherein the double stranded nucleic
acid sample comprises at least one modified base chosen from
uracil, dihydrouridine, and methyl-7-guanosine.
32. The method of claim 30, wherein at least one modified base in
the double-stranded nucleic acid sample is paired with a base with
a base pairing specificity different from its preferred partner
base.
33. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) linking the forward and reverse strands of
the nucleic acid sample together to form a circular template
molecule comprising a double-stranded segment joined at both ends
by linking oligonucleotides; b) altering a base of a specific type
in the circular template molecule; c) obtaining sequence data of
the circular template molecule via single molecule sequencing,
wherein sequence data comprises sequences of the forward and
reverse strands of the circular template molecule; and d)
determining the sequence of the double-stranded nucleic acid sample
and the position of the at least one modified base in the sequence
of the double-stranded nucleic acid sample by comparing the
sequences of the forward and reverse strands of the circular
template molecule.
34. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) linking the forward and reverse strands
together to form a circular template molecule comprising a double
stranded segment joined at both ends by linking oligonucleotides;
b) obtaining sequence data of the circular template molecule via
single molecule sequencing, wherein the sequence data comprises
sequences of the forward and reverse strands of the circular
template molecule; c) determining the sequence of the
double-stranded nucleic acid sample by comparing the sequences of
the forward and reverse strands of the circular template molecule;
d) obtaining sequencing data of the circular template molecule via
single molecule sequencing, wherein at least one nucleotide analog
that discriminates between a base and its modified form is used to
obtain sequence data comprising at least one position wherein at
least one nucleotide analog that discriminates between a base and
its modified form was incorporated; and e) determining the
positions of modified bases in the sequence of the double-stranded
nucleic acid sample by comparing the sequences of the forward and
reverse strands.
35. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) linking the forward and reverse strands of
the nucleic acid sample together to form a circular template
molecule comprising a double stranded segment joined at both ends
by linking oligonucleotides; b) obtaining sequence data of the
circular template molecule via single molecule sequencing, wherein
at least one nucleotide analog that discriminates between a base
and its modified form is used to obtain sequence data comprising at
least one position wherein at least one analog that discriminates
between a base and its modified form was incorporated; and c)
determining the sequence of the double-stranded nucleic acid sample
and the position of the at least one modified base in the sequence
of the double-stranded nucleic acid sample by comparing the
sequences of the forward and reverse strands of the circular
template molecule.
36. A method of determining a sequence of a double-stranded nucleic
acid sample and a position of at least one modified base in the
sequence, comprising: a) linking the forward and reverse strands
together to form a circular template molecule comprising a double
stranded segment joined at both ends by linking oligonucleotides;
b) obtaining sequence data of the circular template molecule via
single molecule sequencing, wherein the sequence data comprises
sequences of the forward and reverse strands of the circular
template molecule; c) determining the sequence of the
double-stranded nucleic acid sample by comparing the sequences of
the forward and reverse strands of the circular template molecule;
d) altering a base of a specific type in the circular template
molecule to produce an altered circular template molecule; e)
obtaining the sequence data of the altered circular template
molecule wherein the sequence data comprises sequences of the
forward and reverse strands; and f) determining the positions of
modified bases in the sequence of the double-stranded nucleic acid
sample by comparing the sequences of the forward and reverse
strands obtained in e).
37. The method of claim 36, wherein the double-stranded nucleic
acid sample is isolated from a biological source.
38. The method of claim 36, wherein the double-stranded nucleic
acid sample is synthesized or isolated from a biological
source.
39. The method of claim 36, wherein altering a base of a specific
type in the circular template molecule comprises bisulfite
treatment.
40. The method of claim 36, wherein the linking oligonucleotides
are hairpin adaptors, and said linking the forward and reverse
strands together comprises joining two hairpin adaptors, which may
be identical or non-identical, to the double-stranded nucleic acid
sample, one to each end.
41. The method of claim 40, wherein the hairpin adaptors have
lengths ranging from 4 to 100 nucleotide residues.
42. The method of claim 40, wherein the nucleic acid inserts have
known sequences
43. The method of claim 40, wherein the hairpin adaptors form
overhangs, and the nucleic acid sample has overhangs compatible
with the overhangs of the hairpin adaptors.
44. The method of claim 40, wherein obtaining sequence data
comprises annealing a primer complementary to at least part of at
least one of the hairpin adaptors and extending the primer.
45. The method of claim 40, wherein sequence data is obtained by
contacting the nucleic acid sample with an RNA polymerase that
synthesizes a product nucleic acid molecule comprising
ribonucleotide residues.
46. The method of claim 40, wherein joining is achieved by
ligation.
47. The method of claim 36, wherein the double-stranded nucleic
acid sample comprises multiple copies of a nucleic acid segment of
interest.
48. The method of claim 36, wherein the double-stranded nucleic
acid sample comprises at least one RNA strand.
49. The method of claim 36, wherein said single molecule sequencing
comprises sequencing by synthesis.
50. The method of claim 36, wherein said single molecule sequencing
comprises nanopore sequencing.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation application of U.S.
Nonprovisional application Ser. No. 12/635,618, filed Dec. 10,
2009, which claims the benefit of Provisional U.S. Patent
Application No. 61/201,551, filed Dec. 11, 2008; Provisional U.S.
Patent Application No. 61/180,350, filed May 21, 2009; and
Provisional U.S. Patent Application No. 61/186,661, filed Jun. 12,
2009, the full disclosures of which are hereby incorporated herein
by reference in their entireties for all purposes.
BACKGROUND OF THE INVENTION
[0002] Assays for analysis of biological processes are exploited
for a variety of desired applications. For example, monitoring the
activity of key biological pathways can lead to a better
understanding of the functioning of those systems as well as those
factors that might disrupt the proper functioning of those systems.
In fact, various different disease states caused by operation or
disruption of specific biological pathways are the focus of much
medical research. By understanding these pathways, one can model
approaches for affecting them to prevent the onset of the disease
or mitigate its effects once manifested.
[0003] A stereotypical example of the exploitation of biological
process monitoring is in the area of pharmaceutical research and
development. In particular, therapeutically relevant biological
pathways, or individual steps or subsets of individual steps in
those pathways, are often reproduced or modeled in in vitro systems
to facilitate analysis. By observing the progress of these steps or
whole pathways in the presence and absence of potential therapeutic
compositions, e.g., pharmaceutical compounds or other materials,
one can identify the ability of those compositions to affect the in
vitro system, and potentially beneficially affect an organism in
which the pathway is functioning in a detrimental way. By way of
specific example, reversible methylation of the 5' position of
cytosine by methyltransferases is one of the most widely studied
epigenetic modifications. In mammals, 5-methylcytosine (5-MeC)
frequently occurs at CpG dinucleotides, which often cluster in
regions called CpG islands that are at or near transcription start
sites. Methylation of cytosine in CpG islands can interfere with
transcription factor binding and is associated with transcription
repression and gene regulation. In addition, DNA methylation is
known to be essential for mammalian development and has been
associated with cancer and other disease processes. Recently, a new
5-hydroxymethylcytosine epigenetic marker has been identified in
certain cell types in the brain, suggesting that it plays a role in
epigenetic control of neuronal function (S. Kriaucionis, et al.,
Science 2009, 324(5929): 929-30, incorporated herein by reference
in its entirety for all purposes). Further information on cytosine
methylation and its impact on gene regulation, development, and
disease processes is provided in the art, e.g., in A. Bird, Genes
Dev 2002, 16, 6; M. Gardiner-Garden, et al., J Mol Biol 1987, 196,
261; S. Saxonov, et al., Proc Natl Acad Sci USA 2006, 103, 1412; R.
Jaenisch, et al., Nat Genet 2003, 33 Suppl, 245; E. Li, et al.,
Cell 1992, 69, 915; A. Razin, et al., Hum Mol Genet. 1995, 4 Spec
No, 1751; P. A. Jones, et al., Nat Rev Genet 2002, 3, 415; P. A.
Jones, et al., Nat Genet 1999, 21, 163; and K. D. Robertson, Nat
Rev Genet 2005, 6, 597, all of which are incorporated herein by
reference in their entireties for all purposes.
[0004] In contrast to determining a human genome, mapping of the
human methylome is a more complex task because the methylation
status differs between tissue types, changes with age, and is
altered by environmental factors (P. A. Jones, et al., Cancer Res
2005, 65, 11241, incorporated herein by reference in its entirety
for all purposes). Comprehensive, high-resolution determination of
genome-wide methylation patterns from a given sample has been
challenging due to the sample preparation demands and short read
lengths characteristic of current DNA sequencing technologies (K.
R. Pomraning, et al., Methods 2009, 47, 142, incorporated herein by
reference in its entirety for all purposes).
[0005] Bisulfite sequencing is the current method of choice for
single-nucleotide resolution methylation profiling (S. Beck, et
al., Trends Genet 2008, 24, 231; and S. J. Cokus, et al., Nature
2008, 452, 215, the disclosures of which are incorporated herein by
reference in their entireties for all purposes). Treatment of DNA
with bisulfite converts unmethylated cytosine, but not 5-MeC, to
uracil (M. Frommer, et al., Proc Natl Acad Sci USA. 1992, 89, 1827,
incorporated herein by reference in its entirety for all purposes).
The DNA is then amplified (which converts all uracils into
thymines) and subsequently analyzed with various methods, including
microarray-based techniques (R. S. Gitan, et al., Genome Res 2002,
12, 158, incorporated herein by reference in its entirety for all
purposes) or 2.sup.nd-generation sequencing (K. H. Taylor, et al.,
Cancer Res 2007, 67, 8511; and R. Lister, et al., Cell 2008, 133,
523, both incorporated herein by reference in their entireties for
all purposes). While bisulfite-based techniques have greatly
advanced the analysis of methylated DNA, they also have several
drawbacks. First, bisulfite sequencing requires a significant
amount of sample preparation time (K. R. Pomraning, et al., supra).
Second, the harsh reaction conditions necessary for complete
conversion of unmethylated cytosine to uracil lead to degradation
of DNA (C. Grunau, et al., Nucleic Acids Res 2001, 29, E65,
incorporated herein by reference in its entirety for all purposes),
and thus necessitate large starting amounts of the sample, which
can be problematic for some applications.
[0006] Furthermore, because bisulfite sequencing relies on either
microarray or 2.sup.nd-generation DNA sequencing technologies for
its readout of methylation status, it also suffers from the same
limitations as do these methodologies. For array-based procedures,
the reduction in sequence complexity caused by bisulfite conversion
makes it difficult to design enough unique probes for genome-wide
profiling (S. Beck, et al., supra). Most 2.sup.nd-generation DNA
sequencing techniques employ short reads and thus have difficulties
aligning to highly repetitive genomic regions (K. R. Pomraning, et
al., supra). This is especially problematic, since many CpG islands
reside in such regions. Given these limitations, bisulfite
sequencing is also not well suited for de novo methylation
profiling (S. Beck, et al., supra).
[0007] In another widely used technique, methylated DNA
immunoprecipitation (MeDIP), an antibody against 5-MeC is used to
enrich for methylated DNA sequences (M. Weber, et al., Nat Genet
2005, 37, 853, incorporated herein by reference in its entirety for
all purposes). MeDIP has many advantageous attributes for
genome-wide assessment of methylation status, but it does not offer
as high base resolution as bisulfite treatment-based methods. In
addition, it is also hampered by the same limitations of current
microarray and 2.sup.nd-generation sequencing technologies.
[0008] Research efforts aimed at increasing our understanding of
the human methylome would benefit greatly from the development of a
new methylation profiling technology that does not suffer from the
limitations described above. Accordingly, there exists a need for
improved techniques for detection of modifications in nucleic acid
sequences, and particularly nucleic acid methylation.
[0009] Typically, modeled biological systems rely on bulk reactions
that ascertain general trends of biological reactions and provide
indications of how such bulk systems react to different effectors.
While such systems are useful as models of bulk reactions in vivo,
a substantial amount of information is lost in the averaging of
these bulk reaction results. In particular, the activity of and
effects on individual molecular complexes cannot generally be
teased out of such bulk data collection strategies.
[0010] Single-molecule real-time analysis of nucleic acid synthesis
has been shown to provide powerful advantages over nucleic acid
synthesis monitoring that is commonly exploited in sequencing
processes. In particular, by concurrently monitoring the synthesis
process of nucleic acid polymerases as they work in replicating
nucleic acids, one gains advantages of a system that has been
perfected over millions of years of evolution. In particular, the
natural DNA synthesis processes provide the ability to replicate
whole genomes in extremely short periods of time, and do so with an
extremely high level of fidelity to the underlying template being
replicated.
[0011] The present invention is directed to a variety of different
single-molecule real-time analyses for monitoring the progress and
effectors of biological reactions, and in particular detecting
modifications in nucleic acid sequences. For example, the present
invention provides a direct methylation sequencing technology based
on observing the kinetics of single polymerase molecules in real
time and with high multiplex. This technique will provide for fast
and economical analysis of methylation patterns, even in repetitive
genomic regions.
BRIEF SUMMARY OF THE INVENTION
[0012] The present invention is generally directed to the detection
of modified nucleic acid sequences, and particularly the detection
of methylated bases within nucleic acid sequences using a real time
direct detection of such methylated sites. The present invention is
expected to have a major impact on research aiming to illuminate
the role of DNA methylation in human health.
[0013] In certain aspects of the invention, methods are provided
for identification of a modification in a nucleic acid molecule. In
general, a template nucleic acid comprising the modification and an
enzyme capable of processing the template are provided. The
template nucleic acid is contacted with the enzyme, and the
subsequent processing of the template by the enzyme is monitored. A
change in the processing is detected, and this change is indicative
of the presence of the modification in the template. Exemplary
modifications that can be detected by the methods of the invention
include, but are not limited to methylated bases (e.g.,
5-methylcytosine, N6-methyladenosine, etc.), pseudouridine bases,
7,8-dihydro-8-oxoguanine bases, 2'-O-methyl derivative bases,
nicks, apurinic sites, apyrimidic sites, pyrimidine dimers, a
cis-platen crosslinking products, oxidation damage, hydrolysis
damage, bulky base adducts, thymine dimers, photochemistry reaction
products, interstrand crosslinking products, mismatched bases,
secondary structures, and bound agents. In preferred embodiments,
nucleotides or analogs thereof that are incorporated into a nascent
strand synthesized by the enzyme are distinctly labeled to allow
identification of a sequence of specific nucleotides or nucleotide
analogs so incorporated. In certain preferred embodiments, labels
are linked to nucleotides or nucleotide analogs through a phosphate
group, e.g., a phosphate group other than the alpha phosphate
group. As such, the labels are removed from the nucleotide or
nucleotide analog upon incorporation into the nascent strand.
[0014] In some embodiments, the template nucleic acid is treated
prior to processing by the enzyme, e.g., to alter the modification.
The treatment may be chemical or enzymatic, and includes, e.g.,
glycosylase modification, bisulfite modification, DMS modification,
cytosine methyltransferase modification, TET1 modification, and
cytidine deaminase modification. In some embodiments, non-natural
nucleotide analogs (e.g., pyrene analogs) are incorporated into a
nascent strand synthesized by the enzyme. In some embodiments, the
methods comprise both treatment of the template and incorporation
of non-natural nucleotide analogs into the nascent strand. In some
embodiments, non-natural nucleotides are incorporated into a
nascent strand in a position to pair with a modification in the
template. For example, a methylated cytosine in the template can be
paired with a modified guanine nucleotide analog; a template
modification can pair with a non-natural nucleotide analog to form
a non-natural base pair, e.g., isocytosine and isoguanine;
5-methylisocytosine and isoguanine; Im-N.sup.O and Im-O.sup.N; A*
and T*; and 8-oxoG and adenine. In some embodiments,
non-incorporatable nucleotide analogs bind the template/enzyme
complex, but are not incorporated into the nascent strand, and
detection of this "nonproductive" binding serves as an indication
of the modification in the template. Such non-incorporatable
nucleotide analogs are preferably distinctly labeled to facilitate
monitoring, and optionally to distinguish such binding from
incorporation of incorporatable nucleotide analogs that comprise
labels.
[0015] In certain embodiments, the template nucleic acid comprises
regions of internal complementarity (e.g., a double-stranded
portion) and at least one single-stranded portion, and preferably
the modification is located within at least one of the regions of
internal complementarity. In certain embodiments, the template is a
circular template. In certain embodiments, the template is a
circular template comprising at least two regions of internal
complementarity. In certain embodiments, the enzyme is a
polymerase, such as a DNA polymerase, and RNA polymerase, a reverse
transcriptase, or a derivative or variant thereof. In preferred
embodiments, the enzyme is a polymerase enzyme capable of strand
displacement. In specific embodiments, the enzyme is a .PHI.29
polymerase, optionally comprising at least one mutation at a
position selected from the group consisting of K392, K422, I93,
M188, K392, V399, T421, K422; S95, Y101, M102; Q99, L123, K124,
T189, A190; G191, S388; P127, L384, N387, S388; and L389, Y390, and
G391.
[0016] Examples of changes in the processing of the template by the
enzyme that are monitored in various embodiments of the invention
include, but are not limited to, kinetics, processivity, signal
characteristics, error metrics, signal context, and the like. In
some embodiments, a change occurs only at the modification, and in
other embodiments the change occurs at one or more positions
proximal to the modification, which may also include the
modification position.
[0017] In certain aspects, the methods further comprise mapping the
modification. In certain preferred embodiments, mapping the
modification comprises analyzing a portion of the sequence read
that was generated immediately prior to, during, and/or immediately
after detecting the change in processing to determine a sequence
complementary to the template nucleic acid; determining the
complement of the sequence complementary to the template nucleic
acid; and mapping the modification at a position in the template
nucleic acid that is proximal to the complement of the sequence
complementary to the template nucleic acid.
[0018] In certain embodiments, a change in the processing that is
indicative of the modification is a kinetic difference in the
processing (e.g., detected as an alteration in one or more of
interpulse duration, interpulse width, processivity, etc.) and/or a
change in an error metric (e.g., accuracy, an increase in binding
events that do not result in incorporation, etc.) The change in
processing can be indicative of the type of modification is present
in the template nucleic acid, since different types of
modifications have different effects on the activity and/or
fidelity of the enzyme.
[0019] In preferred embodiments, the monitoring occurs in real time
during the processing of the template by the enzyme. In preferred
embodiments, the template nucleic acid and the enzyme form a
complex that is immobilized at a reaction site on a substrate, and
in more preferred embodiments a plurality of complexes are
immobilized at optically resolvable reaction sites on the
substrate, wherein a single complex immobilized at one of the
reaction sites is optically resolvable from any other of the
complexes immobilized at any other of the reaction sites. In
certain embodiments, the optically resolvable reaction sites are
nanometer-scale apertures in the substrate, and can be optical
confinements, such as zero-mode waveguides. In preferred
embodiments, the template nucleic acid is plurality of template
nucleic acids that are optically resolvable from one another during
the monitoring. Preferably, the template nucleic acid is not
amplified prior to contacting it with the enzyme.
[0020] In some embodiments, the modification is secondary structure
in the template nucleic acid, e.g., a hairpin loop, and the change
in the modification is a kinetic change, e.g., an increased
interpulse duration or increased pulse width. Certain methods for
identifying such a secondary structure generally comprise
generating a sequence read for the template nucleic acid before and
after the pause; identifying a first portion of the sequence read
generated before the pause that is complementary to a second
portion of the sequence read generated after the pause; and
determining a likelihood that a hairpin loop formed by annealing of
the first portion to the second portion was present in the template
nucleic acid during the processing based at least upon the
nucleotide composition of the first portion and the second
portion.
[0021] In another aspect of the invention, methods for detecting
binding of an agent to a single nucleic acid template are provided.
In certain embodiments, such methods generally comprise providing
the single nucleic acid template in complex with a polymerase;
introducing a reaction mixture to the complex, wherein the reaction
mixture comprises the agent; and monitoring synthesis of a
polynucleotide by the polymerase, wherein the polynucleotide is
complementary to the single nucleic acid template, and wherein a
change in the synthesis is indicative of binding of the agent to
the single nucleic acid template. Examples of agents appropriate
for use in such methods include, but are not limited to,
transcription factors, polymerases, reverse transcriptases,
histones, restriction enzymes, antibodies, nucleic acid binding
proteins, and nucleic acid binding agents. Examples of single
nucleic acid templates appropriate for use in such methods include,
but are not limited to, double-stranded DNA, double-stranded RNA,
single-stranded DNA, single-stranded RNA, DNA/RNA hybrids, and
templates comprising both double-stranded and single-stranded
regions.
[0022] In certain aspects of the invention, a consensus binding
site of the agent is determined. This determination can comprise,
e.g., performing a plurality of sequencing-by-synthesis reactions
on a set of single nucleic acid templates in the presence of the
agent to generate a set of binding-affected nascent polynucleotide
sequences; performing a plurality of sequencing-by-synthesis
reactions on the set of single nucleic acid templates in the
absence of the agent to generate a set of full-length nascent
polynucleotide sequences; analyzing the binding-affected nascent
polynucleotide sequences to determine a location at which the agent
bound the single nucleic acid template during the
sequencing-by-synthesis reactions in the presence of the agent; and
identifying a sequence common to the full-length nascent
polynucleotide sequences at the location, thereby identifying the
consensus binding site of the agent. In certain embodiments, the
binding-affected nascent polynucleotide sequences are truncated
nascent polynucleotide sequences; and in other embodiments, the
binding-affected nascent polynucleotide sequences are nascent
polynucleotide sequences whose synthesis was paused at the location
at which the agent bound.
[0023] In yet further aspects of the invention, methods for
detecting modifications in a single nucleic acid template during a
sequencing-by-synthesis reaction are provided. For example, such a
method can comprise providing the single nucleic acid template in
complex with a polymerase; introducing a reaction mixture to the
complex, wherein the reaction mixture comprises an agent that
specifically binds to the modification; and monitoring synthesis of
a polynucleotide by the polymerase, wherein the polynucleotide is
complementary to the single nucleic acid template, and wherein a
pause or cessation of the synthesis of the polynucleotide is
indicative of binding of the agent to the single nucleic acid
template, thereby detecting the modification in the single nucleic
acid template. In certain embodiments, the modification is an
8-oxoG lesion and/or the agent is a protein is selected from the
group consisting of hOGG1, FPG, yOGG1, AlkA, Nth, Nei, MutY, UDG,
SMUG, TDG, NEIL, an antibody against 8-oxoG, or a binding domain
thereof. In other embodiments, the modification is a methylated
base and/or the agent is a protein selected from the group
consisting of MECP2, MBD1, MBD2, MBD4, UHRF1, an antibody against
the methylated base, or a binding domain thereof. In further
embodiments, the modification is a secondary structure formation in
the nucleic acid template. Preferably, the complex is immobilized
in an optical confinement. The template can comprise, e.g.,
single-stranded linear nucleic acid, single-stranded circular
nucleic acid, double-stranded linear nucleic acid, double-stranded
circular nucleic acid, or a combination thereof.
[0024] In certain embodiments, a modification in a template nucleic
acid can be repaired by including components of damage repair
machinery in the reaction mixture, e.g., during a
sequencing-by-synthesis reaction. In certain embodiments, the
readlength of the sequencing-by-synthesis reaction is longer than
that for a further sequencing-by-synthesis reaction performed with
the single nucleic acid template in complex with the polymerase,
but absent the agent and the damage repair machinery.
[0025] In other aspects of the invention, methods for bypassing one
or more modifications in a single nucleic acid template during a
sequencing-by-synthesis reaction are provided. Certain exemplary
methods include providing the single nucleic acid template in
complex with a sequencing engine; introducing a reaction mixture to
the complex, wherein the reaction mixture comprises a bypass
polymerase; initiating the sequencing-by-synthesis reaction;
monitoring synthesis of a polynucleotide by the sequencing engine,
wherein the polynucleotide is complementary to the single nucleic
acid template, and wherein a pause or cessation of the synthesis of
the polynucleotide is indicative that the sequencing engine has
encountered a modification in the single nucleic acid template;
subsequently monitoring synthesis of the polynucleotide by the
bypass polymerase, which is indicative that the modification is
being bypassed; and repeating the monitoring steps each time a
further modification is encountered in the single nucleic acid
template, thereby bypassing one or more modifications in a single
nucleic acid template during a sequencing-by-synthesis reaction. In
certain embodiments, the bypass polymerase comprises a detectable
label and detection of a signal from the detectable label during
the sequencing-by-synthesis reaction is indicative that the bypass
polymerase is actively synthesizing the polynucleotide. In
preferred embodiments, the readlength of the
sequencing-by-synthesis reaction is longer than that for a further
sequencing-by-synthesis reaction performed with the single nucleic
acid template in complex with the sequencing engine, but absent the
bypass polymerase. In specific embodiments, the reaction mixture
comprises multiple different bypass polymerases and a processivity
factor. Preferably, at least one of the single nucleic acid
template, the sequencing engine, and the bypass polymerase is
immobilized, directly or indirectly, in an optical confinement. For
example, the template can be immobilized by hybridization to an
oligonucleotide primer immobilized in the optical confinement. In
certain preferred embodiments, the single nucleic acid template is
processed by the sequencing engine multiple times at a single
reaction site, and further wherein redundant sequence data is
generated. Nucleic acid templates for use with the methods can be
circular and/or can comprise multiple copies of a nucleic acid
segment of interest. Further, in certain embodiments the
sequencing-by-synthesis reaction generates a polynucleotide
comprising multiple copies of a segment complementary to the
segment of interest, and further wherein redundant sequence data is
generated.
[0026] In further aspects, novel compositions are provided. For
example, in certain embodiments a composition of the invention
comprises a substrate having a reaction site that is optically
resolvable from any other reaction site on the substrate; a single
complex of a template and sequencing engine immobilized at the
reaction site; a mixture of incorporatable nucleotides or
nucleotide analogs; and at least one modification in the template
nucleic acid, wherein the template at or proximal to the
modification is processed differently than the template distal from
the modification. In some embodiments the modification is a
non-natural base in the template. The modification can be located
in either a strand of the template nucleic acid that is
complementary to a nascent strand synthesized by the sequencing
engine, or a strand of the template nucleic acid that is displaced
by the sequencing engine. In certain preferred embodiments, the
template nucleic acid comprises internally complementary regions,
and optionally, the modification is located within one of the
internally complementary regions. Certain embodiments further
comprise at least one type of non-incorporatable nucleotide analog.
Certain embodiments comprise at least one type of non-natural
incorporatable nucleotide analog. Preferably, one or more or all of
the nucleotides or nucleotide analogs in a composition of the
invention are tagged with distinct labels that distinguish
different types of nucleotides or nucleotide analogs from one
another. Compositions of the invention can also include an agent
other than the sequencing engine that binds to the modification
and/or chemically or enzymatically alters the modification.
Preferably, compositions of the invention comprise a nascent strand
generated by the sequencing engine, wherein the nascent strand is
complementary to the template nucleic acid, and, optionally,
comprises multiple copies of regions complementary to the template
nucleic acid. Further, certain compositions comprise a
nanometer-scale aperture in the substrate, where the reaction site
is disposed within the nanometer-scale aperture, e.g., a zero-mode
waveguide.
[0027] In further aspects of the invention, systems for
identification of modifications within a nucleic acid template are
provided. In certain preferred embodiments, a system of the
invention comprises a solid support having a polymerase complex
disposed thereon (e.g., at a reaction site, e.g., in a nanoscale
aperture, e.g., in a zero-mode waveguide), the polymerase complex
comprising a nucleic acid template comprising a modification; a
mounting stage configured to receive the solid support; an optical
train positioned to be in optical communication with at least a
portion of the solid support to detect signals emanating therefrom;
a translation system operably coupled to the mounting stage or the
optical train for moving one of the optical train and the solid
support relative to the other; and a data processing system
operably coupled to the optical train. Preferably, the polymerase
complex comprises a polymerase enzyme that is actively processing
the nucleic acid template. More preferably, the polymerase complex
comprises a polymerase enzyme that is processively synthesizing a
nascent strand by template-directed synthesis. In preferred
embodiments, the optical train detects signals emanating from the
solid support during the processing of the nucleic acid
template.
[0028] In yet further aspects, the invention provides
machine-implemented methods for transforming reaction data into
modification detection data, wherein the reaction data is
representative of a series of events during a
sequencing-by-synthesis reaction wherein a nascent strand is
synthesized based upon a nucleotide sequence of a template nucleic
acid, and the modification detection data is representative of a
presence of one or more modifications within a template nucleic
acid. Preferably, one or more steps of the machine-implemented
method are performed via a user interface implemented in a machine
that comprises instructions stored in machine-readable medium and a
processor that executes the instructions. In a final aspect of the
invention, a computer program products are provided. In certain
embodiments, machine-implemented methods for transforming reaction
data comprise a classifier to distinguish between true
incorporations and stochastic pulses, a segmenting algorithm based
on a hidden Markov model architecture, and/or a segmenting
algorithm based on a conditional random field framework. In certain
specific embodiments, the methods identify regions in the template
having a higher density of stochastic pulses than true
incorporations. In certain specific embodiments, the methods
identify regions in the template having higher IPD. Exemplary
computer program products of the invention typically comprise a
computer usable medium having a computer readable program code
embodied therein, said computer readable program code adapted to be
executed to implement the machine-implemented methods of the
invention; and the machine-readable medium on which the results of
one or more steps of the machine-implemented method are stored.
BRIEF DESCRIPTION OF THE DRAWINGS
[0029] FIG. 1 provides an exemplary illustration of
single-molecule, real-time (SMRT.TM.) nucleic acid sequencing.
[0030] FIG. 2 provides illustrative examples of various types of
reaction data in the context of a pulse trace.
[0031] FIG. 3 schematically illustrates a structural model of 5-MeC
positioned one base in the 5' direction relative to the DNA
polymerase active site.
[0032] FIG. 4 illustrates an exemplary embodiment of five-base DNA
methylation sequencing. FIG. 4A depicts fragmentation of genomic
DNA to generate the DNA template.
[0033] FIG. 4B illustrates DNA glycosylase excising a 5-MeC from
the template.
[0034] FIG. 5 provides an illustrative embodiment of a reaction
comprising a linear template and a damage-binding agent that
recognizes a lesion in a single-stranded template.
[0035] FIG. 6 illustrates an embodiment of the invention comprising
a circular template and a damage-binding agent that recognizes a
lesion in a double-stranded template.
[0036] FIG. 7 illustrates an observation of true incorporations
(solid line) versus stochastic pulses (dashed line) across
time.
[0037] FIG. 8 provides an illustrative example of a simple hidden
Markov model for classifying pause (P) versus sequencing (S) states
within a sequencing trace.
[0038] FIG. 9 provides an illustrative example of a system of the
invention.
[0039] FIG. 10A provides a schematic for exemplary template nucleic
acids of the invention. FIG. 10B provides graphs plotting
interpulse duration for template nucleic acids as depicted in
10A.
[0040] FIG. 11 provides a graph plotting IPD ratio against template
position for a template nucleic acid comprising 5-methylcytosine
modifications.
[0041] FIG. 12 provides a graph plotting IPD ratio against template
position for a template nucleic acid comprising 5-methylcytosine
modifications.
[0042] FIG. 13A provides a schematic for exemplary template nucleic
acids of the invention. FIG. 13B provides graphs plotting
interpulse duration for template nucleic acids as depicted in 13A.
FIG. 13C provides a ROC curve for the data provided in 13B.
[0043] FIG. 14 provides a graph plotting IPD ratio against template
position for a template nucleic acid comprising N6-methyladenosine
modifications.
[0044] FIG. 15 provides a graph plotting IPD ratio against template
position for a template nucleic acid comprising
5-hydroxymethylcytosine modifications.
[0045] FIG. 16 provides a graph plotting pulse width ratio against
template position for a template nucleic acid comprising
5-hydroxymethylcytosine modifications.
[0046] FIG. 17 provides a graph plotting IPD ratio against template
position for a template nucleic acid comprising 8-oxoguanosine
modifications.
[0047] FIG. 18 provides a graph plotting pulse width ratio against
template position for a template nucleic acid comprising
8-oxoguanosine modifications.
DETAILED DESCRIPTION OF THE INVENTION
I. General
[0048] The present invention is generally directed to methods,
compositions, and systems for detecting modifications within
nucleic acid sequences, and in particularly preferred aspects,
methylated nucleotides within sequence templates through the use of
single molecule nucleic acid analysis. The ability to detect
modifications within nucleic acid sequences is useful for mapping
such modifications in various types and/or sets of nucleic acid
sequences, e.g., across a set of mRNA transcripts, across a
chromosomal region of interest, or across an entire genome. The
modifications so mapped can then be related to transcriptional
activity, secondary structure of the nucleic acid, siRNA activity,
mRNA translation dynamics, kinetics and/or affinities of DNA- and
RNA-binding proteins, and other aspects of nucleic acid (e.g., DNA
and/or RNA) metabolism.
[0049] Although certain embodiments of the invention are described
in terms of detection of modified nucleotides or other
modifications in a single-stranded DNA molecule (e.g., a
single-stranded template DNA), various aspects of the invention are
applicable to many different types of nucleic acids, including
e.g., single- and double-stranded nucleic acids that may comprise
DNA, RNA (e.g., mRNA, siRNA, microRNA, rRNA, tRNA, snRNA, etc.),
RNA-DNA hybrids, PNA, LNA, morpholino, and other RNA and/or DNA
mimetics and derivatives thereof, and combinations of any of the
foregoing. Nucleic acids for use with the methods, compositions,
and systems provided herein may consist entirely of native
nucleotides, or may comprise non-natural bases/nucleotides (e.g.,
synthetic and/or engineered) that may be paired with native
nucleotides or may be paired with the same or a different
non-natural base/nucleotide. In certain preferred embodiments, the
nucleic acid comprises a combination of single-stranded and
double-stranded regions, e.g., such as the templates described in
U.S. Ser. Nos. 12/383,855 and 12/413,258, both filed on Mar. 27,
2009 and incorporated herein by reference in their entireties for
all purposes. In particular, mRNA modifications are difficult to
detect by technologies that require reverse transcriptase PCR
amplification because such treatment does not maintain the
modification in the amplicons. The present invention provides
methods for analyzing modifications in RNA molecules that do not
require such amplification.
[0050] Generally speaking, the methods of the invention involve
monitoring of an analytical reaction to collect "reaction data,"
wherein the reaction data is indicative of the progress of the
reaction. Reaction data included data collected directly from the
reaction, as well as the results of various manipulations of that
directly collected data, any or a combination of which can serve as
a signal for the presence of a modification in the template nucleic
acid. For example, certain types of reaction data are collected in
real time during the course of the reaction, such as metrics
related to reaction kinetics, processivity, signal characteristics,
and the like. Signal characteristics vary depending on the type of
analytical reaction being monitored. For example, some reactions
use detectable labels to tag one or more reaction components, and
signal characteristics for a detectable label include, but are not
limited to, the type of signal (e.g., wavelength, charge, etc.) and
the shape of the signal (e.g., height, width, curve, etc.).
Further, signal characteristics for multiple signals (e.g.,
temporally adjacent signals) can also be used, including, e.g., the
distance between signals during a reaction, the number of extra
signals (e.g., that do not correspond to the progress of the
reaction), internal complementarity, and the local signal context
(i.e., one or more signal that precede and/or follow a given
signal). For example, template-directed sequencing reactions often
combine signal data from multiple nucleotide incorporation events
to generate a sequence read for a nascent strand synthesized, and
this sequence read is used to derive, e.g., by complementarity, the
sequence of the template strand. Other types of reaction data are
generated from statistical analysis of real time reaction data,
including, e.g., accuracy, precision, conformance, etc. In some
embodiments, data from a source other than the reaction being
monitored is also used. For example, a sequence read generated
during a nucleic acid sequencing reaction can be compared to
sequence reads generated in replicate experiments, or to known or
derived reference sequences from the same or a related biological
source. Alternatively or additionally, a portion of a template
nucleic acid preparation can be amplified using unmodified
nucleotides and subsequently sequenced to provide an experimental
reference sequence to be compared to the sequence of the original
template in the absence of amplification. Although certain specific
embodiments of the use of particular types of reaction data to
detect certain kinds of modifications are described at length
herein, it is to be understood that the methods, compositions, and
systems are not limited to these specific embodiments. Different
types of reaction data can be combined to detected various kinds of
modifications, and in certain embodiments more than one type of
modification can be detected and identified during a single
reaction on a single template. Such variations to the detailed
embodiments of the invention will be clear to one of ordinary skill
based upon the teachings provided herein.
[0051] In certain embodiments, redundant sequence information is
generated and analyzed to detect one or more modifications in a
template nucleic acid. Redundancy can be achieved in various ways,
including carrying out multiple sequencing reactions using the same
original template, e.g., in an array format, e.g., a ZMW array. In
some embodiments in which a lesion is unlikely to occur in all the
copies of a given template, reaction data (e.g., sequence reads,
kinetics, signal characteristics, signal context, and/or results
from further statistical analyses) generated for the multiple
reactions can be combined and subjected to statistical analysis to
determine a consensus sequence for the template. In this way, the
reaction data from a region in a first copy of the template can be
supplemented and/or corrected with reaction data from the same
region in a second copy of the template. Similarly, a template can
be amplified (e.g., via rolling circle amplification) to generate a
concatemer comprising multiple copies of the template, and the
concatemer can be subjected to sequencing, thereby generating a
sequencing read that is internally redundant. As such, the sequence
data from a first segment of the concatemer (corresponding to a
first region of the template) can be supplemented and/or corrected
with sequence data from a second segment of the concatemer also
corresponding to the first region of the template. Alternatively or
additionally, a template can be subjected to repeated sequencing
reactions to generate redundant sequence information that can be
analyzed to more thoroughly characterize the modification(s)
present in the template.
[0052] The term "modification" as used herein is intended to refer
not only to a chemical modification of a nucleic acids, but also to
a variation in nucleic acid conformation, interaction of an agent
with a nucleic acid (e.g., bound to the nucleic acid), and other
perturbations associated with the nucleic acid. As such, a location
or position of a modification is a locus (e.g., a single nucleotide
or multiple contiguous or noncontiguous nucleotides) at which such
modification occurs within the nucleic acid. For a double-stranded
template, such a modification may occur in the strand complementary
to a nascent strand synthesized by a polymerase processing the
template, or may occur in the displaced strand. Although certain
specific embodiments of the invention are described in terms of
5-methylcytosine detection, detection of other types of modified
nucleotides (e.g., N.sup.6-methyladenosine,
N.sup.3-methyladenosine, N.sup.7-methylguanosine,
5-hydroxymethylcytosine, other methylated nucleotides,
pseudouridine, thiouridine, isoguanosine, isocytosine,
dihydrouridine, queuosine, wyosine, inosine, triazole,
diaminopurine, 8-oxoguanosine, and 2'-O-methyl derivatives of
adenosine, cytidine, guanosine, and uridine) is also contemplated.
These and other modifications are known to those of ordinary skill
in the art and are further described, e.g., in Narayan P, et al.
(1987) Mol Cell Biol 7(4):1572-5; Horowitz S, et al. (1984) Proc
Natl Acad Sci U.S.A. 81(18):5667-71; "RNA's Outfits: The nucleic
acid has dozens of chemical costumes," (2009) C&EN;
87(36):65-68; Kriaucionis, et al. (2009) Science 324 (5929): 92930;
and Tahiliani, et al. (2009) Science 324 (5929): 930-35; Matray, et
al. (1999) Nature 399(6737):704-8; Ooi, et al. (2008) Cell 133:
1145-8; Petersson, et al. (2005) J Am Chem. Soc. 127(5):1424-30;
Johnson, et al. (2004) 32(6):1937-41; Kimoto, et al. (2007) Nucleic
Acids Res. 35(16):5360-9; Ahle, et al. (2005) Nucleic Acids Res
33(10):3176; Krueger, et al., Curr Opinions in Chem Biology 2007,
11(6):588); Krueger, et al. (2009) Chemistry & Biology
16(3):242; McCullough, et al. (1999) Annual Rev of Biochem 68:255;
and Liu, et al. (2003) Science 302(5646):868-71, the disclosures of
which are incorporated herein by reference in their entireties for
all purposes. Modifications further include the presence of
non-natural base pairs in the template nucleic acid, including but
not limited to hydroxypyridone and pyridopurine homo- and
hetero-base pairs, pyridine-2,6-dicarboxylate and pyridine
metallo-base pairs, pyridine-2,6-dicarboxamide and a pyridine
metallo-base pairs, metal-mediated pyrimidine base pairs T-Hg(II)-T
and C-Ag(I)-C, and metallo-homo-basepairs of
2,6-bis(ethylthiomethyl)pyridine nucleobases Spy, and alkyne-,
enamine-, alcohol-, imidazole-, guanidine-, and
pyridyl-substitutions to the purine or pyridimine base (Wettig, et
al. (2003) J Inorg Biochem 94:94-99; Clever, et al. (2005) Angew
Chem Int Ed 117:7370-7374; Schlegel, et al. (2009) Org Biomol Chem
7(3):476-82; Zimmerman, et al. (2004) Bioorg Chem 32(1):13-25;
Yanagida, et al. (2007) Nucleic Acids Symp Ser (Oxf) 51:179-80;
Zimmerman (2002) J Am Chem Soc 124(46):13684-5; Buncel, et al.
(1985) Inorg Biochem 25:61-73; Ono, et al. (2004) Angew Chem
43:4300-4302; Lee, et al. (1993) Biochem Cell Biol 71:162-168;
Loakes, et al. (2009), Chem Commun 4619-4631; and Seo, et al.
(2009) J Am Chem Soc 131:3246-3252, all incorporated herein by
reference in their entireties for all purposes). Other types of
modifications include, e.g, a nick, a missing base (e.g., apurinic
or apyridinic sites), a pyrimidine dimer (e.g., thymine dimer or
cyclobutane pyrimidine dimer), a cis-platin crosslinking, oxidation
damage, hydrolysis damage, other methylated bases, bulky DNA base
adducts, photochemistry reaction products, interstrand crosslinking
products, mismatched bases, and other types of "damage" to the
nucleic acid. As such, certain embodiments described herein refer
to "damage" and such damage is also considered a modification of
the nucleic acid in accordance with the present invention. Modified
nucleotides can be caused by exposure of the DNA to radiation
(e.g., UV), carcinogenic chemicals, crosslinking agents (e.g.,
formaldehyde), certain enzymes (e.g., nickases, glycosylases,
exonucleases, methylases, other nucleases, etc.), viruses, toxins
and other chemicals, thermal disruptions, and the like. In vivo,
DNA damage is a major source of mutations leading to various
diseases including cancer, cardiovascular disease, and nervous
system diseases (see, e.g., Lindahl, T. (1993) Nature 362(6422):
709-15, which is incorporated herein by reference in its entirety
for all purposes). The methods and systems provided herein can also
be used to detect various conformations of DNA, in particular,
secondary structure forms such as hairpin loops, stem-loops,
internal loops, bulges, pseudoknots, base-triples, and the like;
and are also useful for detection of agents interacting with the
nucleic acid, e.g., bound proteins or other moieties.
[0053] In certain aspects, methods, compositions, and systems for
detection and/or reversal of modifications in a template for
single-molecule sequencing are provided, as well as determination
of their location (i.e. "mapping") within a nucleic acid molecule.
In certain preferred embodiments, high-throughput, real-time,
single-molecule, template-directed sequencing assays are used to
detect the presence of such damage sites and to determine their
location on the DNA template, e.g., by monitoring the progress
and/or kinetics of a polymerase enzyme processing the template. For
example, when a polymerase enzyme encounters certain types of
damage or other modifications in a DNA template, the progress of
the polymerase can be temporarily or permanently blocked, e.g.,
resulting in a paused or dissociated polymerase. As such, the
detection of a pause in or termination of nascent strand synthesis
is indicative of the presence of such damage or lesion. By analysis
of the sequence reads produced prior to the pause or stop in
synthesis, and alternatively or additionally after reinitiation of
synthesis, one can map the site of the damage or lesion on the
template. Since different types of lesions can have different
effects on the progress of the polymerase on the substrate, in
certain cases the behavior of the polymerase on the template not
only informs as to where the lesion occurs, but also what type of
lesion is present. Further, in certain embodiments a modification
may be bypassed by incorporation of a non-nucleotide binding
partner with the lesion in the template strand. For example, abasic
sites (e.g., produced by glycosylases) can be "paired" with pyrenes
or other similar analogs. (See, e.g., Matray, et al. (1999) Nature
399(6737): 704-8). Such an analog can also be labeled with a
detectable label optically distinguishable from those on the
nucleotides in the reaction mixture to allow optical detection of
its incorporation. Certain aspects of the invention provide a means
for reversing such modifications in real time, thereby allowing
reinitiation of the sequencing reaction and continued generation of
sequence information for the template nucleic acid. Such methods
can additionally be used to study effects of various agents (e.g.,
drugs, chemicals, enzymes, etc.) and reaction conditions on the
creation and/or repair of such lesions and/or damage, as described
elsewhere herein. These and other aspects of the invention are
described in greater detail in the description and examples that
follow.
II. Single Molecule Sequencing
[0054] In certain aspects of the invention, single molecule real
time sequencing systems are applied to the detection of modified
nucleic acid templates through analysis of the sequence and/or
kinetic data derived from such systems. In particular,
modifications in a template nucleic acid strand alter the enzymatic
activity of a nucleic acid polymerase in various ways, e.g., by
increasing the time for a bound nucleotide to be incorporated
and/or increasing the time between incorporation events. In certain
embodiments, polymerase activity is detected using a single
molecule nucleic acid sequencing technology. In certain
embodiments, polymerase activity is detected using a nucleic acid
sequencing technology that detects incorporation of nucleotides
into a nascent strand in real time. In preferred embodiments, a
single molecule nucleic acid sequencing technology is capable of
real-time detection of nucleotide incorporation events. Such
sequencing technologies are known in the art and include, e.g., the
SMRT.TM. sequencing and nanopore sequencing technologies. For more
information on nanopore sequencing, see, e.g., U.S. Pat. No.
5,795,782; Kasianowicz, et al. (1996) Proc Natl Acad Sci USA
93(24):13770-3; Ashkenas, et al. (2005) Angew Chem Int Ed Engl
44(9):1401-4; Howorka, et al. (2001) Nat Biotechnology 19(7):636-9;
and Astier, et al. (2006) J Am Chem Soc 128(5):1705-10, all of
which are incorporated herein by reference in their entireties for
all purposes. With regards to nucleic acid sequencing, the term
"template" refers to a nucleic acid molecule subjected to
template-directed synthesis of a nascent strand. A template may
comprise DNA, RNA, or mimetics or derivatives thereof. Further, a
template may be single-stranded, double-stranded, or may comprise
both single- and double-stranded regions. A modification in a
double-stranded template may be in the strand complementary to the
newly synthesized nascent strand, or may by in the strand identical
to the newly synthesized strand, i.e., the strand that is displaced
by the polymerase.
[0055] The preferred direct methylation sequencing described herein
may generally be carried out using single molecule real time
sequencing systems, i.e., that illuminate and observe individual
reaction complexes continuously over time, such as those developed
for SMRT.TM. DNA sequencing (see, e.g., P. M. Lundquist, et al.,
Optics Letters 2008, 33, 1026, which is incorporated herein by
reference in its entirety for all purposes). The foregoing SMRT.TM.
sequencing instrument generally detects fluorescence signals from
an array of thousands of ZMWs simultaneously, resulting in highly
parallel operation. Each ZMW, separated from others by distances of
a few micrometers, represents an isolated sequencing chamber.
[0056] Detection of single molecules or molecular complexes in real
time, e.g., during the course of an analytical reaction, generally
involves direct or indirect disposal of the analytical reaction
such that each molecule or molecular complex to be detected is
individually resolvable. In this way, each analytical reaction can
be monitored individually, even where multiple such reactions are
immobilized on a single substrate. Individually resolvable
configurations of analytical reactions can be accomplished through
a number of mechanisms, and typically involve immobilization of at
least one component of a reaction at a reaction site. Various
methods of providing such individually resolvable configurations
are known in the art, e.g., see European Patent No. 1105529 to
Balasubramanian, et al.; and Published International Patent
Application No. WO 2007/041394, the full disclosures of which are
incorporated herein by reference in their entireties for all
purposes. A reaction site on a substrate is generally a location on
the substrate at which a single analytical reaction is performed
and monitored, preferably in real time. A reaction site may be on a
planar surface of the substrate, or may be in an aperture in the
surface of the substrate, e.g., a well, nanohole, or other
aperture. In preferred embodiments, such apertures are "nanoholes,"
which are nanometer-scale holes or wells that provide structural
confinement of analytic materials of interest within a
nanometer-scale diameter, e.g., .about.1-300 nm. In some
embodiments, such apertures comprise optical confinement
characteristics, such as zero-mode waveguides, which are also
nanometer-scale apertures and are further described elsewhere
herein. Typically, the observation volume (i.e., the volume within
which detection of the reaction takes place) of such an aperture is
at the attoliter (10.sup.-18 L) to zeptoliter (10.sup.-21 L) scale,
a volume suitable for detection and analysis of single molecules
and single molecular complexes.
[0057] The immobilization of a component of an analytical reaction
can be engineered in various ways. For example, an enzyme (e.g.,
polymerase, reverse transcriptase, kinase, etc.) may be attached to
the substrate at a reaction site, e.g., within an optical
confinement or other nanometer-scale aperture. In other
embodiments, a substrate in an analytical reaction (for example, a
nucleic acid template, e.g., DNA, RNA, or hybrids, analogs, and
mimetics thereof, or a target molecule for a kinase) may be
attached to the substrate at a reaction site. Certain embodiments
of template immobilization are provided, e.g., in U.S. patent
application Ser. No. 12/562,690, filed Sep. 18, 2009 and
incorporated herein by reference in its entirety for all purposes.
One skilled in the art will appreciate that there are many ways of
immobilizing nucleic acids and proteins into an optical
confinement, whether covalently or non-covalently, via a linker
moiety, or tethering them to an immobilized moiety. These methods
are well known in the field of solid phase synthesis and
micro-arrays (Beier et al., Nucleic Acids Res. 27:1970-1-977
(1999)). Non-limiting exemplary binding moieties for attaching
either nucleic acids or polymerases to a solid support include
streptavidin or avidin/biotin linkages, carbamate linkages, ester
linkages, amide, thiolester, (N)-functionalized thiourea,
functionalized maleimide, amino, disulfide, amide, hydrazone
linkages, among others. Antibodies that specifically bind to one or
more reaction components can also be employed as the binding
moieties. In addition, a silyl moiety can be attached to a nucleic
acid directly to a substrate such as glass using methods known in
the art.
[0058] In some embodiments, a nucleic acid template is immobilized
onto a reaction site (e.g., within an optical confinement) by
attaching a primer comprising a complementary region at the
reaction site that is capable of hybridizing with the template,
thereby immobilizing it in a position suitable for monitoring. In
certain embodiments, an enzyme complex is assembled in an optical
confinement, e.g., by first immobilizing an enzyme component. In
other embodiments, an enzyme complex is assembled in solution prior
to immobilization. Where desired, an enzyme or other protein
reaction component to be immobilized may be modified to contain one
or more epitopes for which specific antibodies are commercially
available. In addition, proteins can be modified to contain
heterologous domains such as glutathione S-transferase (GST),
maltose-binding protein (MBP), specific binding peptide regions
(see e.g., U.S. Pat. Nos. 5,723,584, 5,874,239 and 5,932,433), or
the Fc portion of an immunoglobulin. The respective binding agents
for these domains, namely glutathione, maltose, and antibodies
directed to the Fc portion of an immunoglobulin, are available and
can be used to coat the surface of an optical confinement of the
present invention. The binding moieties or agents of the reaction
components they immobilize can be applied to a support by
conventional chemical techniques which are well known in the art.
In general, these procedures can involve standard chemical surface
modifications of a support, incubation of the support at different
temperature levels in different media comprising the binding
moieties or agents, and possible subsequent steps of washing and
cleaning.
[0059] In some embodiments, a substrate comprising an array of
reaction sites is used to monitor multiple biological reactions,
each taking place at a single one of the reaction sites. Various
means of loading multiple biological reactions onto an arrayed
substrate are known to those of ordinary skill in the art and are
described further, e.g., in U.S. Ser. No. 61/072,641, incorporated
herein by reference in its entirety for all purposes. For example,
basic approaches include: creating a single binding site for a
reaction component at the reaction site; removing excess binding
sites at the reaction site via catalytic or secondary binding
methods; adjusting the size or charge of the reaction component to
be immobilized; packaging or binding the reaction component within
(or on) a particle (e.g., within a viral capsid), where a single
such particle fits into the relevant reaction site (due to size or
charge of the particle and/or observation volume); using
non-diffusion limited loading; controllably loading the reaction
component (e.g., using microfluidic or optical or electrical
control); sizing or selecting charges in the reaction
sites/observation volumes (e.g., the sizes of optical confinements
in an array) to control which reaction components will fit
(spatially or electrostatically) into which reaction
sites/observation volumes; iterative loading of reaction
components, e.g., by masking active sites between loading cycles;
enriching the activity of the reaction components that are loaded;
using self-assembling nucleic acids to sterically control loading;
adjusting the size of the reaction site/observation volume; and
many others. Such methods and compositions provide for the
possibility of completely loading single-molecule array reaction
sites (instead of about 30% of such sites as occurs in "Poisson
limited" loading methods) with single reaction components (e.g.,
molecular complexes).
[0060] In preferred aspects, the methods, compositions, and systems
provided herein utilize optical confinements to facilitate single
molecule resolution of analytical reactions. In preferred
embodiments, such optical confinements are configured to provide
tight optical confinement so only a small volume of the reaction
mixture is observable. Some such optical confinements and methods
of manufacture and use thereof are described at length in, e.g.,
U.S. Pat. Nos. 7,302,146. 7,476,503, 7,313,308, 7,315,019,
7,170,050, 6,917,726, 7,013,054, 7,181,122, and 7,292,742; U.S.
Patent Publication Nos. 20080128627, 20080152281, and 200801552280;
and U.S. Ser. Nos. 11/981,740 and 12/560,308, all of which are
incorporated herein by reference in their entireties for all
purposes.
[0061] Where reaction sites are located in optical confinements,
the optical confinements can be further tailored in various ways
for optimal confinement of an analytical reaction of interest. In
particular, the size, shape, and composition of the optical
confinement can be specifically designed for containment of a given
enzyme complex and for the particular label and illumination scheme
used.
[0062] In certain preferred embodiments of the invention,
single-molecule real-time sequencing systems already developed are
applied to the detection of modified nucleic acid templates through
analysis of the sequence and kinetic data derived from such
systems. As described below, methylated cytosine and other
modifications in a template nucleic acid will alter the enzymatic
activity of a polymerase processing the template nucleic acid. In
certain embodiments, polymerase kinetics in addition to sequence
read data are detected using a single molecule nucleic acid
sequencing technology, e.g., the SMRT.TM. sequencing technology
developed by Pacific Biosciences (Eid, J. et al. (2009) Science
2009, 323, 133, the disclosure of which is incorporated herein by
reference in its entirety for all purposes). This technique is
capable of long sequencing reads and provides high-throughput
methylation profiling even in highly repetitive genomic regions,
facilitating de novo sequencing of modifications such as methylated
bases. SMRT.TM. sequencing systems typically utilize
state-of-the-art single-molecule detection instruments,
production-line nanofabrication chip manufacturing, organic
chemistry, protein mutagenesis, selection and production
facilities, and software and data analysis infrastructures.
[0063] Certain preferred methods of the invention employ real-time
sequencing of single DNA molecules (Eid, et al., supra), with
intrinsic sequencing rates of several bases per second and average
read lengths in the kilobase range. In such sequencing, sequential
base additions catalyzed by DNA polymerase into the growing
complementary nucleic acid strand are detected with fluorescently
labeled nucleotides. The kinetics of base additions and polymerase
translocation are sensitive to the structure of the DNA
double-helix, which is impacted by the presence of base
modifications, e.g, 5-MeC and other perturbations (secondary
structure, bound agents, etc.) in the template. By monitoring the
activity of DNA polymerase during sequencing, sequence read
information and base modifications can be simultaneously detected.
Long, continuous sequence reads that are readily achievable using
SMRT.TM. sequencing facilitate modification (e.g., methylation)
profiling in low complexity regions that are inaccessible to some
technologies, such as certain short-read sequencing technologies.
Carried out in a highly parallel manner, methylomes can be
sequenced directly, with single base-pair resolution and high
throughput.
[0064] The principle of SMRT.TM. sequencing is illustrated in FIG.
1. Two important technology components of certain embodiments of
this process are: (i) optical confinement technology that allows
single-molecule detection at concentrations of labeled nucleotides
relevant to the enzyme, and (ii) phospholinked nucleotides that
enable observation of uninterrupted polymerization.
[0065] In preferred embodiments, optical confinements are ZMW
nanostructures, preferably in an arrayed format. Typically, ZMWs
arrays comprise dense arrays of holes, .about.100 nm in diameter,
fabricated in a .about.100 nm thick metal film deposited on a
transparent substrate (e.g., silicon dioxide). These structures are
further described in the art, e.g., in M. J. Levene, et al.,
Science 2003, 299, 682; and M. Foquet, et al., J. Appl. Phys. 2008,
103, 034301, the disclosures of which are incorporated herein by
reference in their entireties for all purposes. Each ZMW becomes a
nanophotonic visualization chamber for recording an individual
polymerization reaction, providing a detection volume of just 100
zeptoliters (10.sup.-21 liters). This volume represents a
.about.1000-fold improvement over diffraction-limited confocal
microscopy, facilitating observation of single incorporation events
against the background created by the relatively high concentration
of fluorescently labeled nucleotides. Polyphosphonate and
silane-based surface coatings mediate enzyme immobilization to the
transparent floor of the ZMW while blocking non-specific
attachments to the metal top and side wall surfaces (Eid, et al.,
supra; and J. Korlach, et al., Proc Acad Sci USA 2008, 105, 1176,
the disclosures of which are incorporated herein by reference in
their entireties for all purposes). While certain methods described
herein involve the use of ZMW confinements, it will be readily
understood by those of ordinary skill in the art upon review of the
teachings herein that these methods may also be practiced using
other reaction formats, e.g., on planar substrates or in
nanometer-scale apertures other than zero-mode waveguides. (See,
e.g., U.S. Ser. No. 12/560,308, filed Sep. 15, 2009; and U.S.
Patent Publication No. 20080128627, incorporated herein supra.)
[0066] The second important component is phospholinked nucleotides
for which a detectable label (e.g., comprising a fluorescent dye)
is attached to the terminal phosphate rather than the base (FIG.
1). (See, e.g., J. Korlach, et al., Nucleos. Nucleot. Nucleic Acids
2008, 27, 1072, which is incorporated herein by reference in its
entirety for all purposes.) 100% replacement of unmodified
nucleotides by phospholinked nucleotides is performed, and the
enzyme cleaves away the label as part of the incorporation process,
leaving behind a completely natural, double-stranded nucleic acid
product. Each of the four different nucleobases is labeled with a
distinct detectable label to discriminate base identities during
incorporation events, thus enabling sequence determination of the
complementary DNA template. During incorporation, the enzyme holds
the labeled nucleotide in the ZMW's detection volume for tens of
milliseconds, orders of magnitude longer than the average diffusing
nucleotide is present. Signal (e.g., fluorescence) is emitted
continuously from the detectable label during the duration of
incorporation, causing a detectable pulse of increased fluorescence
in the corresponding color channel. The pulse is terminated
naturally by the polymerase releasing the
pyrophosphate-linker-label group. The polymerase then translocates
to the next base, and the process repeats. As shown in FIG. 1A,
single DNA polymerase molecules with bound DNA template are
attached to a substrate, e.g., at the bottom of each zero-mode
waveguide. Polymerization of the complementary DNA strand is
observed in real time by detecting fluorescently labeled
nucleotides. Reactions steps involved in SMRT.TM. sequencing are as
follows: Step 1: The DNA template/primer/polymerase complex is
surrounded by diffusing fluorescently labeled nucleotides which
probe the active site. Step 2: A labeled nucleotide makes a cognate
binding interaction with the next base in the DNA template that
lasts for tens of milliseconds, during which fluorescence is
emitted continuously. Step 3: The polymerase incorporates the
nucleotide into the growing nucleic acid chain, thereby cleaving
the .alpha.-.beta. phosphodiester bond, followed by release of the
nucleotide. Steps 4-5: The process repeats. A prophetic trace is
shown in FIG. 1B that comprises each step shown in 1A. At steps 2
and 4, a fluorescent signal is emitted during binding and
incorporation of a nucleotide into the growing nucleic acid chain,
and monitoring of these fluorescent signals provides a sequence of
nucleotide incorporations that can be used to derive the sequence
of the template nucleic acid. For example, a 5'-G-A-3' sequence in
the growing chain indicates a 5'-T-C-3' sequence in the
complementary template strand.
[0067] As described above, reaction data is indicative of the
progress of a reaction and can serve as a signal for the presence
of a modification in the template nucleic acid. Reaction data in
single molecule sequencing reaction reactions using fluorescently
labeled bases is generally centered around characterization of
detected fluorescence pulses, a series of successive pulses ("pulse
trace" or one or more portions thereof), and other downstream
statistical analyses of the pulse and trace data. Fluorescence
pulses are characterized not only by their spectrum, but also by
other metrics including their duration, shape, intensity, and by
the interval between successive pulses (see, e.g., Eid, et al.,
supra; and U.S. Patent Publication No. 20090024331, incorporated
herein by reference in its entirety for all purposes). While not
all of these metrics are generally required for sequence
determination, they add valuable information about the processing
of a template, e.g., the kinetics of nucleotide incorporation and
DNA polymerase processivity and other aspects of the reaction.
Further, the context in which a pulse is detected (i.e., the one or
more pulses that precede and/or follow the pulse) can contribute to
the identification of the pulse. For example, the presence of
certain modifications alters not only the processing of the
template at the site of the modification, but also the processing
of the template upstream and/or downstream of the modification. For
example, the presence of modified bases in a template nucleic acid
has been shown to change the width of a pulse and/or the interpulse
duration (IPD), either at the modified base or at one or more
positions proximal to it. A change in pulse width may or may not be
accompanied by a change in IPD. FIG. 2 provides illustrative
examples of various types of reaction data in the context of a
pulse trace including IPD, pulse width (PW), pulse height (PH), and
context. FIG. 2A illustrates these reaction data on a pulse trace
generated on an unmodified template, and FIG. 2B illustrates how
the presence of a modification (5-MeC) can elicit a change in one
of these reaction data (IPD) to generate a signal (increased IPD)
indicative of the presence of the modification.
[0068] In yet further embodiments, reaction data is generated by
analysis of the pulse and trace data to determine error metrics for
the reaction. Such error metrics include not only raw error rate,
but also more specific error metrics, e.g., identification of
pulses that did not correspond to an incorporation event,
incorporations that were not accompanied by a detected pulse,
incorrect incorporation events, and the like. Any of these error
metrics, or combinations thereof, can serve as a signal indicative
of the presence of one or more modifications in the template
nucleic acid. In some embodiments, such analysis involves
comparison to a reference sequence and/or comparison to replicate
sequence information from the same or an identical template, e.g.,
using a standard or modified multiple sequence alignment. Certain
types of modifications cause an increase in one or more error
metrics. For example, some modifications can be "paired" with more
than one type of incoming nucleotide or analog thereof, so
replicate sequence reads for the region comprising the modification
will show variable base incorporation opposite such a modification.
Such variable incorporation is thereby indicative of the presence
of the modification. Certain types of modifications cause an
increase in one or more error metrics proximal to the modification,
e.g., immediately upstream or downstream. The error metrics at a
locus or within a region of a template are generally indicative of
the type of modification(s) present at that locus or in that region
of the template, and therefore serve as a signal of such
modification(s). In preferred embodiments, at least some reaction
data is collected in real time during the course of the reaction,
e.g., pulse and/or trace characteristics.
[0069] Although described herein primarily with regards to
fluorescently labeled nucleotides, other types of detectable labels
and labeling systems can also be used with the methods,
compositions, and systems described herein including, e.g., quantum
dots, surface enhanced Raman scattering particles, scattering
metallic nanoparticles, FRET systems, intrinsic fluorescence,
non-fluorescent chromophores, and the like. Such labels are
generally known in the art and are further described in Provisional
U.S. Patent Application No. 61/186,661, filed Jun. 12, 2009; U.S.
Pat. Nos. 6,399,335, 5,866,366, 7,476,503, and 4,981,977; U.S.
Patent Pub. No. 2003/0124576; U.S. Ser. No. 61/164,567; WO
01/16375; Mujumdar, et al Bioconjugate Chem. 4(2):105-111, 1993;
Ernst, et al, Cytometry 10:3-10, 1989; Mujumdar, et al, Cytometry
10:1119, 1989; Southwick, et al, Cytometry 11:418-430, 1990; Hung,
et al, Anal. Biochem. 243(1):15-27, 1996; Nucleic Acids Res,
20(11):2803-2812, 1992; and Mujumdar, et al, Bioconjugate Chem.
7:356-362, 1996; Intrinsic Fluorescence of Proteins, vol. 6,
publisher: Springer US, .COPYRGT.2001; Kronman, M. J. and Holmes,
L. G. (2008) Photochem and Photobio 14(2): 113-134; Yanushevich, Y.
G., et al. (2003) Russian J. Bioorganic Chem 29(4) 325-329; and
Ray, K., et al. (2008) J. Phys. Chem. C 112(46): 17957-17963, all
of which are incorporated herein by reference in their entireties
for all purposes. Many such labeling groups are commercially
available, e.g., from the Amersham Biosciences division of GE
Healthcare, and Molecular Probes/Invitrogen Inc. (Carlsbad,
Calif.), and are described in `The Handbook--A Guide to Fluorescent
Probes and Labeling Technologies, Tenth Edition` (2005) (available
from Invitrogen, Inc./Molecular Probes and incorporated herein in
its entirety for all purposes). Further, a combination of the
labeling strategies described herein and known in the art for
labeling reaction components can be used.
[0070] Various strategies, methods, compositions, and systems are
provided herein for detecting modifications in a nucleic acid,
e.g., during real-time nascent strand synthesis. For example, since
DNA polymerases can typically bypass 5-MeC in a template nucleic
acid and properly incorporate a guanine in the complementary strand
opposite the 5-MeC, additional strategies are desired to detect
such altered nucleotides in the template. Various such strategies
are provided herein, such as, e.g., a) modification of the
polymerase to introduce an specific interaction with the modified
nucleotide; b) detecting variations in enzyme kinetics, e.g.,
pausing; c) use of a detectable and optionally modified nucleotide
analog that specifically base-pairs with the modification and is
potentially incorporated into the nascent strand; d) chemical
treatment of the template prior to sequencing that specifically
alters 5-MeC sites in the template; e) use of a protein that
specifically binds to the modification in the template nucleic
acid, e.g., delaying or blocking progression of a polymerase during
replication; and f) use of sequence context (e.g., the higher
frequency of 5-MeC nucleotides in CpG islands) to focus
modification detection efforts on regions of the template that are
more likely to contain such a modification (e.g., GC-rich regions
for 5-MeC detection). These strategies may be used alone or in
combination to detect 5-MeC sites in a template nucleic acid during
nascent strand synthesis.
III. Polymerase Modifications
[0071] Various different polymerases may be used in
template-directed sequence reactions, e.g., those described at
length, e.g., in U.S. Pat. No. 7,476,503, the disclosure of which
is incorporated herein by reference in its entirety for all
purposes. In brief, the polymerase enzymes suitable for the present
invention can be any nucleic acid polymerases that are capable of
catalyzing template-directed polymerization with reasonable
synthesis fidelity. The polymerases can be DNA polymerases or RNA
polymerases (including, e.g., reverse transcriptases), a
thermostable polymerase or a thermally degradable polymerase,
wildtype or modified. In some embodiments, the polymerases exhibit
enhanced efficiency as compared to the wildtype enzymes for
incorporating unconventional or modified nucleotides, e.g.,
nucleotides linked with fluorophores. In certain preferred
embodiments, the methods are carried out with polymerases
exhibiting a high degree of processivity, i.e., the ability to
synthesize long stretches (e.g., over about 10 kilobases) of
nucleic acid by maintaining a stable nucleic acid/enzyme complex.
In certain preferred embodiments, sequencing is performed with
polymerases capable of rolling circle replication. A preferred
rolling circle polymerase exhibits strand-displacement activity,
and as such, a single circular template can be sequenced repeatedly
to produce a sequence read comprising multiple copies of the
complement of the template strand by displacing the nascent strand
ahead of the translocating polymerase. Since the methods of the
invention can increase processivity of the polymerase by removing
lesions that block continued polymerization, they are particularly
useful for applications in which a long nascent strand is desired,
e.g. as in the case of rolling-circle replication. Non-limiting
examples of rolling circle polymerases suitable for the present
invention include but are not limited to T5 DNA polymerase, T4 DNA
polymerase holoenzyme, phage M2 DNA polymerase, phage PRD1 DNA
polymerase, Klenow fragment of DNA polymerase, and certain
polymerases that are modified or unmodified and chosen or derived
from the phages 029 (Phi29), PRD1, Cp-1, Cp-5, Cp-7, .PHI.15,
.PHI.1, .PHI.21, .PHI.25, BS 32 L17, PZE, PZA, Nf, M2Y (or M2),
PR4, PR5, PR722, B103, SF5, GA-1, and related members of the
Podoviridae family. In certain preferred embodiments, the
polymerase is a modified Phi29 DNA polymerase, e.g., as described
in U.S. Patent Publication No. 20080108082, incorporated herein by
reference in its entirety for all purposes. Additional polymerases
are provided, e.g., in U.S. Ser. Nos. 11/645,125, filed Dec. 21,
2006; 11/645,135, filed Dec. 21, 2006; 12/384,112, filed Mar. 30,
2009; and 61/094,843, filed Sep. 5, 2008; as well as in U.S. Patent
Publication No. 20070196846, the disclosures of which are
incorporated herein by reference in their entireties for all
purposes.
[0072] Further optimization is achieved through improvement of
enzyme kinetics, either through the screening of polymerase
libraries and/or the engineering of polymerases, which include,
e.g., DNA polymerases, RNA polymerases, reverse transcriptases, and
the like. In particular, DNA polymerases may be screened to
identify those that have desirable properties for detection of
nucleic acid modifications described herein. Further, polymerases
can be engineered through directed mutagenesis of one or more
residues involved in various aspects of template-directed nascent
strand synthesis. For example, careful examination of the crystal
structure of the polymerase-DNA-nucleotide complex for certain
polymerase enzymes has shown that the polymerase rotates and flips
out the base on the single-stranded region of the template DNA that
is adjacent to the active site in the 5' direction (i.e., the base
in the "-1" position). During the subsequent DNA translocation
process, this base is flipped into the active site. As such, amino
acids that interact with a modified base in the -1 position or
during the subsequent tranlocation can be altered or substituted to
increase the enzyme's sensitivity to the modified base. In fact,
any protein residues that come into close contact with a
modification in the template are candidates for substitution or
alteration. For example, family B polymerases mostly contain
replicative polymerases and include the major eukaryotic DNA
polymerases .alpha., .beta., .epsilon., and also DNA polymerase
.zeta.. Family B also includes DNA polymerases encoded by some
bacteria and bacteriophages, e.g., T4, Phi29, and RB69
bacteriophages. Most family B polymerases share common structural
features for DNA binding, and the residues along the DNA
primer-template junction and the residues around the base binding
pocket at the -1 location (pre-insertion position) can be mutated
and the resulting mutants screened for enhanced response to a
modification of interest. Specifically, when 5-MeC is in the
"flipped-out" (-1) position, it is surrounded by several .PHI.29
polymerase amino acid residues, such as K392 and K422 which are
positioned close to the methyl group (FIG. 3). Mutations such as
K392R/W/M and K422R/W/M that substitute the native lysine residue
with amino acids with larger side chains (e.g., arginine,
tryptophan, or methionine) may increase the polymerase's
sensitivity to modified bases, potentially delaying the
translocation step and slowing incorporation of the complementary
dGTP. This is schematically illustrated with reference to FIG. 3,
which shows a structural model of 5-MeC positioned one base in the
5' direction relative to the .PHI.29 DNA polymerase active site. As
shown, two polymerase residues, K392 and K422, are close to the
methyl group of 5-MeC. These two residues are potential targets for
site-specific mutagenesis to amino acids with larger side chains
that will interact sterically with the methyl group. 5-MeC is shown
as the chemical structure in the right center, while the K392 and
K422 residues are left center.
[0073] Any residues that come in close contact with a modified base
are candidates for mutation. For example, using molecular morphing
and energy minimization to model the translocation path of 5-MeC, a
number of protein residues in the .PHI.29 polymerase have been
identified within 5 .ANG. of the methyl moiety as it is being
flipped into the active site during the translocation step. The
following groups of residues (listed in the order of the
translocation path) are targets of mutations to residues with
larger side chains: I93, M188, K392, V399, T421, K422; S95, Y101,
M102; Q99, L123, K124, T189, A190; G191, S388; P127, L384, N387,
S388; and L389, Y390, G391. In particular, I93Y and V399Y may
introduce a 5-methylcytosine specific binding region, analogous to
those shown by the crystal structures of the SRA/5-methylcytosine
binding complex. For example, see G. V. Avvakumov, et al., Nature
2008, 455, 822; and H. Hashimoto, et al., Nature 2008, 455, 826,
the disclosures of both of which are incorporated herein by
reference in their entireties for all purposes. Although the
residues identified above are specific to the .PHI.29 polymerase,
one of ordinary skill will readily recognize that the structural
similarity between the family B polymerases, and to a lesser extent
family A polymerases and other polymerases, allows identification
of homologous positions on related polymerases as targets for
mutation based on the teachings herein.
[0074] In addition to the foregoing, additional improvements are
derived from a molecular evolution program using these polymerases
to enhance their ability to sense 5-MeC. Such programs have already
been used to successfully improve large numbers of different
enzymes for a variety of applications, including improving DNA
polymerases for sequencing. Such methods may include
diversification of the amino acid sequence space by mutagenic PCR
and DNA shuffling, and/or yeast displays for expression and
selection (see, e.g., S. A. Gai, et al., Curr Opin Struct Biol
2007, 17, 467; and D. Lipovsek, et al., Chem Biol 2007, 14, 1176,
which are incorporated herein by reference in their entireties for
all purposes, in which .about.10.sup.4 copies of a recombinant
protein are displayed on the surface of a single yeast cell
carrying the transgene for the protein. The genotype-phenotype
linkage is provided by the yeast cell, but no protein purification
is necessary as the displayed proteins have the same properties as
bulk solutions of polymerase. With the available infrastructure in
house, this program can be initiated without startup costs as soon
as polymerase candidates emerge.
IV. Secondary Structure Detection
[0075] During single molecule sequencing as described supra, an
otherwise highly processive trace is sometimes interrupted by a
long pause. Such pausing can be caused by secondary structure,
e.g., a hairpin loop, in the template strand. In certain aspects,
the invention provides methods for not only identifying secondary
structure in a template nucleic acid, but also for improving the
overall accuracy of single molecule sequencing.
[0076] In certain embodiments, a sequencing read or "trace" is
generated by subjecting a template nucleic acid to a real-time,
template-directed sequencing reaction. The trace is examined to
identify long pauses by finding portions of the trace at which the
interpulse duration (IPD) is significantly longer than the average
IPD. For example, find pauses that are at least 2-, 3-, 5-, 10-, or
20-times longer than the average IPD, In some embodiments, an IPD
averaged over a few neighboring bases is used, in other embodiments
an IPD averaged over a window of about 20, 30, 50, 70, or 100 bases
is used, and in yet other embodiments, an IPD averaged over all or
substantially all of the template is used.
[0077] The sequence reads generated before and after the pause are
analyzed within about a 20-, 30-, 50-, 70-, or 100-base window
centered on the pause, and regions that flank the pause site and
are complementary to one another are identified. Based upon the
complementary sequences, their spacing, and other known factors
that impact secondary structure formation (e.g., GC content, pH,
salt concentration, and the like), the probability of a hairpin
loop at that location in the template is determined. If this
probability is high, the sequence reads flanking the pause site are
re-examined to identify basecalls that do not match the
complementarity of the hairpin, e.g., a non-complementary basecall
or missing basecall within a stretch of complementary basecalls.
Such non-complementary or missing basecalls have a higher
probability of being errors than basecalls in the region that do
not interrupt the complementarity between the regions upstream and
downstream of the pause site. As such, the basecalls at these
positions are reevaluated to determine if the initial basecall was
erroneous. Further, knowledge of a given template's propensity for
forming secondary structures that interfere with processivity of a
polymerase can be used in future rounds of template-directed
sequencing of the template to better call base positions in the
vicinity of the interfering secondary structure, thereby improving
accuracy of basecalls in the future rounds.
[0078] Additionally, since the duration of the pause is likely
related to the strength of the secondary structure formed within
the template, the duration can be used as a metric in determining
the type, size, compositions, and likelihood of a secondary
structure in the template molecule. In addition, for applications
in which a single template is repeatedly subjected to
template-directed synthesis, the replicate sequence reads that are
generated are compared to one another to determine if a given
portion of the template consistently produces a pause in the
synthesis reaction, which provides further evidence that the pause
is due to the sequence context, e.g., secondary structure
spontaneously forming in the template.
IV. Modified and Non-Natural Nucleotide Analogs and Base
Pairing
[0079] In certain aspects, methods, compositions, and systems are
provided that utilize modified and/or non-natural nucleotide
analogs and/or base pairing. For example, certain non-natural
nucleotide analogs can be incorporated by a polymerase into a
nascent strand opposite a modification, e.g., missing or damaged
base. In certain embodiments, such non-natural nucleotide analogs
are detectably labeled such that their incorporation can be
distinguished from incorporation of a natural nucleotide or
nucleotide analog, e.g., during template-directed nascent strand
synthesis. This strategy allows real-time sequencing that generates
reads that not only provide base sequence information for native
bases in the template, but also modified bases without requiring
further modifications to the standard methods (Eid, et al, supra).
This method facilitates modification profiling in the absence of
repeated sequencing of each DNA template, and is particularly well
suited to de novo applications. In certain embodiments, the
modified or non-natural nucleotide analogs are not incorporatable
into the nascent strand and the polymerase can bypass the
modification using a native nucleotide or nucleotide analog, which
may or may not be labeled. Since the modified or non-natural analog
has a higher affinity for the modification than a native analog, it
will bind to the polymerase complex multiple times before a native
analog is incorporated, resulting in multiple signals for a single
incorporation event, and thereby increasing the likelihood of
accurate detection of the modification. Similar methods for use in
sequencing unmodified template nucleic acids are described in
greater detail in U.S. Ser. No. 61/186,661, filed Jun. 12, 2009 and
incorporated herein by reference in its entirety for all
purposes.
[0080] Since 5-MeC retains Watson-Crick hydrogen bonding with
guanine, an incoming guanine nucleotide analog can be used to
detect 5-MeC in the template strand. For example, a guanine
nucleotide analog can be constructed to cross the major groove and
sense the modified cytosine therein. In particular embodiments, a
fused aromatic ring is linked to the N7 atom of the guanine of the
nucleotide analog. This modified guanine nucleotide analog can
"sense" the methyl group of 5-MeC and affect the base-pairing as
compared to an unmodified guanine nucleotide analog. Such guanine
nucleotide analogs are further described elsewhere, e.g., in
International Application Pub. No. WO/2006/005064 and U.S. Pat. No.
7,399,614. Similar modifications can be made to nucleotide analogs
appropriate for SMRT.TM. sequencing applications, e.g., those with
terminal-phosphate labels, e.g., as described in U.S. Pat. Nos.
7,056,661 and 7,405,281; U.S. Patent Pub. Nos. 20070196846 and
20090246791; and U.S. Ser. No. 12/403,090, all of which are
incorporated herein by reference in their entireties for all
purposes. In certain embodiments, 5-MeC detection may be carried
out using a modified guanine nucleotide analog described above that
carries a detectable label that is distinguishable from detectable
labels on other reaction components, e.g., other nucleotide analogs
being incorporated. Such a strategy allows 5-MeC detection by
observation of a signal, rather than or in addition to altered
polymerase kinetics, which facilitates methylation profiling even
in the absence of redundant or replicate sequencing of the
template.
[0081] Certain embodiments use other non-natural base pairs that
are orthogonal to the natural nucleobases pairs. For example,
isoguanine (isoG) can be incorporated by a polymerase into DNA at
sites complementary to isocytosine (isoC) or 5-methylisocytosine
(.sup.MeisoC), and vice versa, as shown by the following chemical
structure and described in A. T. Krueger, et al., "Redesigning the
Architecture of the Base Pair: Toward Biochemical and Biological
Function of New Genetic Sets." Chemistry & Biology 2009, 16(3),
242, incorporated herein by reference in its entirety for all
purposes.
##STR00001##
Other non-natural base pairs that are orthogonal to the natural
nucleobases pairs can also be used, e.g., Im-N.sup.O/Im-O.sup.N or
A*/T* (described further in J. D. Able, et al., Nucleic Acids Res
2005, 33(10), 3176; A. T. Krueger, et al., supra; and A. T.
Krueger, et al., Curr Opinions in Chem Biology 2007, 11(6),
588).
[0082] In certain embodiments, a nucleic acid modification to be
detected by the methods herein is 7,8-dihydro-8-oxoguanine
("8-oxoG") (also known as 8-oxo-7,8-dihydroguanine, 8-oxoguanine,
and 8-hydroxyguanine). 8-oxoG is the major oxidative DNA lesion
found in human tissue. Due to the relatively subtle modification to
guanine in 8-oxoG, it may be bypassed by replicative DNA
polymerases, which preferentially incorporate an adenine nucleotide
into the nascent nucleic acid strand at the position where the
complementary cytosine should be incorporated, thereby resulting in
a mutation in the nascent strand (see, e.g., Hsu, et al. (2004)
Nature 431(7005): 217-21; and Hanes, et al. (2006) J. Biol. Chem.
281:36241-8, which are incorporated herein by reference in their
entireties for all purposes). As well as introducing mutations in
vivo, the bypass of such lesions by a polymerase during
template-dependent sequencing reactions introduces errors into the
sequence reads generated, and the presence of the damaged guanine
nucleotide can also cause base misalignment, potentially adding
further errors into a resulting sequence read. DNA synthesis
opposite an 8-oxoG lesion has relatively very low specificity
(kcat/Km) that is about 10.sup.6-fold lower than incorporating a C
opposite an unmodified G. See, e.g., Hsu, et al., supra. Further,
due to its very low redox potential 8-oxoG can be more easily
oxidized than unmodified guanine, and the 8-oxoG oxidation products
are very effective blockers of DNA polymerases. See, e.g., Duarte,
et al. (1999) Nucleic Acids Res 27(2):496-502; and Kornyushyna, et
al. (2002) Biochemistry 41(51): 15304-14, the disclosures of which
are incorporated herein by reference in their entireties for all
purposes.
[0083] It has been shown that 8-oxoG alters both k.sub.cat and
K.sub.m of steady-state incorporation kinetics, which are likely to
cause altered pulse widths and IPD before incorporation of a
nucleotide (G or A) into the complementary position in the nascent
strand during template-directed sequencing reactions (see, e.g.,
Hsu, et al. and Hanes, et al., supra). These altered kinetic
characteristics can be used to detect 8-oxoG in a template nucleic
acid during real-time sequencing reactions. Further, a circular
template that comprises both complementary strands of a region of
interest (e.g., as described in U.S. Ser. Nos. 12/383,855 and
12/413,258, both filed Mar. 27, 2009 and incorporated herein by
reference in their entireties for all purposes) can be used to
repeatedly sequence both strands of a region of interest, thereby
generating redundant sequence information that can be analyzed to
statistically determine how often a given position in the template
has an A-G mismatch as compared to how often the correct base is
incorporated at that position. The redundant sequence information
increases the accuracy of correctly calling a position as a G or an
8-oxoG. For example, if the mismatch rate is 100%, then if one
detects an A at the position, but then a G at the complementary
position, then it is highly likely that the A detected was
Hoogsteen base pairing with an 8-oxoG in the template. This
strategy is similar to detection of 5-MeC modifications that have
been deaminated to uracil prior to sequencing, as described in
greater detail below.
[0084] The mismatch incorporation rate opposite 8-oxoG sites, as
well as the degree to which IPD and pulse width are affected by
8-oxoG depend on the type of polymerase used in the reaction (see,
e.g., Hsu, et al. and Hanes, et al., supra). As such, polymerase
mutants can be designed to have increased kinetic sensitivity to
8-oxoG, or increased/decreased misincorporation rate opposite an
8-oxoG. Methods for designing polymerases for various embodiments
of the invention are known in the art and provided elsewhere
herein. Further, multiple binding events are very likely at the
site of modification, resulting in one or more signals not
associated with incorporation into the nascent strand, and these
multiple binding events can also occur at positions proximal to the
modification, e.g., continuing for a few bases after the site of
damage. These additional signaling events would provide a robust
indicator of the site of modification. In addition, multiple
sequencing reads for the region of the template comprising the
modification are expected to contain variable numbers of extra
signaling events at or proximal to the modification. As such,
comparison of this redundant sequence data will also facilitate
identification of loci comprising the modification.
V. Chemical Modification of Template
[0085] Direct detection of modifications (e.g., methylated bases as
described above) without pre-treatment of the DNA sample, has many
benefits. Alternatively or additionally, complementary techniques
may be employed, such as the use of non-natural or modified
nucleotide analogs and/or base pairing described elsewhere herein.
In general, such complementary techniques serve to enhance the
detection of the modification, e.g., by amplifying a signal
indicative of the modification. Further, while the methods
described herein focus primarily on detection of 5-MeC nucleotides,
it will be clear to those of ordinary skill in the art that these
methods can also be extended to detection of other types of
nucleotide modifications or damage. In addition, since certain
sequencing technologies (e.g., SMRT.TM. sequencing) do not require
amplification of the template, e.g., by PCR, other chemical
modifications of the 5-MeC can be employed to facilitate detection
of these modified nucleotides in the template. For example, the
difference in redox potential between normal cytosine and 5-MeC can
be used to selectively oxidize 5-MeC and further distinguish it
from the nonmethylated base. Such methods are further described
elsewhere, and include halogen modification (S. Bareyt, et al.,
Angew Chem Int Ed Engl 2008, 47(1), 181) and selective osmium
oxidation (A. Okamoto, Nucleosides Nucleotides Nucleic Acids 2007,
26(10-12), 1601; and K. Tanaka, et al., J Am Chem Soc 2007,
129(17), 5612), and these references are incorporated herein by
reference in their entireties for all purposes.
[0086] Glycosylase Modification
[0087] By way of example, DNA glycosylases are a family of repair
enzymes that excise altered, damaged, or mismatched nucleotide
residues in DNA while leaving the sugar-phosphate backbone intact.
Additional information on glycosylase mechanisms and structures is
provided in the art, e.g., in A. K. McCullough, et al., Annual Rev
of Biochem 1999, 68, 255. In particular, four DNA glycosylases
(ROS1, DME, DML2, and DML3) have been indentified in Arabidopsis
thaliana that remove methylated cytosine from double-stranded DNA,
leaving an abasic site. (See, e.g., S. K. Ooi, et al., Cell 2008,
133, 1145, incorporated herein by reference in its entirety for all
purposes.) Furthermore, it has been shown that a 5'-triphosphate
derivative of the pyrene nucleoside (dPTP) is efficiently and
specifically inserted by certain DNA polymerases into abasic DNA
sites through steric complementarity. (See, e.g., T. J. Matray, et
al., Nature 1999, 399(6737), 704, incorporated herein by reference
in its entirety for all purposes.)
[0088] In certain embodiments of single-molecule, five-color DNA
methylation sequencing, DNA glycosylase activity can be combined
with polymerase incorporation of a non-natural nucleotide analog
(e.g., a pyrene analog (dPTP) as shown in FIG. 4). For example, in
certain embodiments, methylated cytosines are excised from a DNA
sample treated with an Arabidopsis DNA glycosylase. Covalent
linkage of a fifth fluorophore to the terminal phosphate of dPTP
allows detection of abasic sites during polymerase-mediated DNA
synthesis.
[0089] FIG. 4 shows the principle of five-base DNA methylation
sequencing. As shown in FIG. 4A, genomic DNA is fragmented into
pieces up to several kilobases in length, which serve as the DNA
template. FIG. 4B illustrates DNA glycosylase excising a 5-MeC from
the template (black), leaving an abasic site. During SMRT.TM.
sequencing, the DNA polymerase synthesizes the complementary strand
and preferentially incorporates a fluorophore-phospholinked pyrene
analog opposite the abasic site. This fluorophore has spectral
characteristics distinct from those of the other four labeled
nucleotides and indicates the presence of a 5-MeC in the original
template. Further, error metrics can also be used to identify the
modification, e.g., an increase in binding events for the pyrene
analog may occur at the abasic site, as well as at downstream
positions as the incorporated pyrene analog is "buried" in the
nascent strand during subsequent incorporation events. In certain
embodiments, a non-hydrolyzable pyrene analog carrying a detectable
label is used at a concentration sufficient to bind (and be
detected) several times at the abasic site before a hydrolyzable
(and, preferably, distinctly labeled) analog is incorporated.
Methods using non-hydrolyzable analogs are further described
below.
[0090] A potential challenge in carrying out the above-described
methods is that many DNA glycosylases display some lyase activity,
e.g., bifunctional DNA glycosylase/AP lyases. These enzymes can
cleave the phosphodiester backbone 3' to the AP (abasic) site
generated by the glycosylase activity resulting in an abasic and
unsaturated ribose derivative at that site, which could prevent a
polymerase from incorporating the pyrene analog complementary to
this site. In certain cases, it may be desirable to suppress any
lyase activity of the Arabidopsis repair enzymes and enhance the
desired glycosylase activity. Strategies for achieving this include
site directed mutagenesis and the addition of a catalytically
inactive AP endonuclease to the glycosylase reaction. (See, e.g.,
A. E. Vidal, et al., Nucleic Acids Res 2001, 29, 1285, incorporated
herein by reference in its entirety for all purposes.) A parallel
protein mutagenesis program aims to enhance polymerase processivity
in the presence of a dPTP analog. Other variations exploit ways in
which the kinetics of pyrene incorporation into the abasic site are
effected by fluorophore identity, the number of phosphates attached
to the pyrene analog, and the structure of the linker connecting
the fluorophore to the terminal phosphate group.
[0091] In other embodiments of single-molecule, five-color DNA
methylation sequencing, DNA glycosylase activity can be combined
with addition of a non-natural base to replace the methylated base.
Briefly, after glycosylase-catalyzed excision of 5-MeC (with or
without cleavage of the phosphodiester backbone), a class I or
class II AP endonuclease is added to remove the abasic ribose
derivative by cleavage at the phosphate groups 3' and 5' to the
abasic site, thereby leaving 3'-OH and 5'-phosphate termini. A
polymerase capable of extending from the free 3'-OH (e.g., Pol I or
human pol .beta.) and a non-natural base (e.g., isoC, isoG, or
.sup.MeisoC) are added to incorporate the non-natural base into the
abasic site. A DNA ligase (e.g., LigIII) is added to close the
phosphodiester backbone by forming covalent phosphodiester bonds
between the free 3'-OH and 5'-phosphates via ATP hydrolysis.
Finally, a processive polymerase (e.g., .PHI.29 DNA polymerase) is
used to synthesize a nascent nucleic acid strand complementary to
the template strand, where the fifth nucleotide analog is the
complement of the non-natural base that replaced 5-MeC in the
template. For example, if the replacement base was isoC or
.sup.MeisoC, then the fifth analog would be isoG. As such, the
fifth analog would only incorporate into the nascent strand at
positions complementary to 5-MeC sites in the template nucleic
acid. In preferred embodiments, the fifth analog has a detectable
label (e.g., fluorescent dye) that is distinct from labels on other
reaction components, e.g, detectable labels on other nucleotide
analogs in the reaction mixture.
[0092] Further, glycosylases exist or can be engineered for various
DNA modifications, damage, or mismatches, so the methods described
above are applicable not only for detection of 5-MeC, but also
provide methods for detecting those other types of modifications,
as well. Methods for the use of glycosylases for detection of other
types of DNA damage are described in U.S. Ser. No. 61/186,661,
filed Jun. 12, 2009 and incorporated herein by reference in its
entirety for all purposes. In certain embodiments, the pyrene (or
similar) nucleotide analog can be non-hydrolyzable to increase the
residence time and, therefore, lengthen the emitted signal
indicative of the presence of the particular lesion of interest. A
non-hydrolyzable fifth-base is eventually displaced by a
hydrolysable analog and synthesis of the nascent strand continues.
Alternatively, a fifth-base may be hydrolysable but may produce
multiple separate signals prior to incorporation to increase the
likelihood of detection.
[0093] Bisulfite Modification
[0094] In certain embodiments, the template may be modified by
treatment with bisulfite. Bisulfite sequencing is a common method
for analyzing CpG methylation patterns in DNA. Bisulfite treatment
deaminates unmethylated cytosine in a single-stranded nucleic acid
to form uracil (P. W. Laird, Nat Rev Cancer 2003, 3(4), 253; and H.
Hayatsu, Mutation Research 2008, 659, 77, incorporated herein by
reference in their entireties for all purposes). In contrast, the
modified 5-MeC base is resistant to treatment with bisulfite. As
such, pretreatment of template DNA with bisulfite will convert
cytosines to uracils, and subsequent sequencing reads will contain
guanine incorporations opposite 5-MeC nucleotides in the template
and adenine incorporations opposite the uracil (previously
unmethylated cytosine) nucleotides. Ma nucleic acid to be treated
with bisulfite is double-stranded, it is denatured prior to
treatment. In conventional methods, amplification, e.g., PCR,
typically precedes sequencing, which amplifies the modified nucleic
acid, but does not preserve information about the complementary
strand. In contrast, certain embodiments of the present invention
include use of a template molecule comprising both strands of a
double-stranded nucleic acid that can be converted to a
single-stranded molecule, e.g., by adjusting pH, temperature, etc.
Treatment of the single-stranded molecule with bisulfite is
followed by single-molecule sequencing, and because the template
retains both strands of the original nucleic acid, sequence
information from both is generated. Comparison of the resulting
sequence reads for each strand of the double-stranded nucleic acid
will identify positions at which an unmethylated cytosine was
converted to uracil in the original templates since the reads from
the two templates will be non-complementary at that position (A-C
mismatch). Likewise, reads from the two templates will be
complementary at a cytosine position (G-C match) where the cytosine
position was methylated in the original template. In certain
preferred embodiments, a circular template is used, preferably
having regions of internal complementarity that can hybridize to
form a double-stranded region, e.g., as described in U.S. Ser. No.
12/383,855 and U.S. Ser. No. 12/413,258, both filed on Mar. 27,
2009, and both incorporated herein by reference in their entireties
for all purposes.
[0095] As described elsewhere herein, methylcytosine has an effect
on IPD over a number of neighboring positions when compared to
non-methylated cytosine. Uracil compared to thymine is like
unmethylated cytosine compared to methylcytosine (i.e. the only
difference between U and T is that T has an additional methyl
group). Thus, the invention provides methods for performing
bisulfite sequencing in which the polymerase kinetics (IPD and
pulse width) or the mismatch incorporation rate are monitored in
addition to the actual nucleotides being incorporated. Detection of
a change in either of these kinetic parameters or in the mismatch
rate at the position in question, or at neighboring positions, is
used to determine whether or not a position was always a T or is a
U that was originally an unmethylated cytosine.
[0096] In certain embodiments, polymerase mutants are designed that
are more sensitive to the difference between thymine and uracil in
order to enhance the effect described above. Methods for designing
polymerase variants are described in detail above and need not be
repeated here.
[0097] Additionally or alternatively, PCR of uracil-containing
oligonucleotides is not necessarily as efficient as PCR without
uracil. This issue can bias the PCR amplification of
bisulfite-converted DNA. The methods of sequencing-by-synthesis
using bisulfite-modified templates described herein circumvent this
problem by not using PCR amplification. However, the kinetics of
these sequencing-by-synthesis reactions can be monitored to detect
changes in kinetics due to the presence of uracil residues.
[0098] Further, the methods presented herein are useful for
detecting PCR bias in the amplification of bisulfite-treated
nucleic acids. For example, a few rounds of PCR could be performed
on various oligos, some with uracil and some without (including
controls with the same sequence but containing thymine in place of
uracil). After performing sequencing-by-synthesis on all the
resulting oligos, one could determine the percentage of oligos that
still contain uracil. If it's different than the expected
percentage given ideal (unbiased) PCR amplification, then a bias
has been detected.
[0099] In yet further embodiments, a template nucleic acid is
exposed to a reagent that transforms a modified nucleotide to a
different nucleotide structure. For example, a bacterial cytosine
methyl transferase converts 5-MeC to thymine (M. J. Yebra, et al.,
Biochemistry 1995, 34(45), 14752, incorporated herein by reference
in its entirety for all purposes). Alternatively, the reagent may
convert a methyl-cytosine to 5-hydroxy-methylcytosine, e.g., TET1
(M. Tahiliani, et al., Science 2009, 324(5929), 930, incorporated
herein by reference in its entirety for all purposes). In further
embodiments, the reagent may include a cytidine deaminase that
converts methyl-cytosine to thymine (H. D. Morgan, et al., J
Biological Chem 2004, 279, 52353, incorporated herein by reference
in its entirety for all purposes). In yet further embodiments, a
restriction enzyme that specifically alters a modification of
interest can be used to create a lesion at the modification site.
For example, DPNI cleaves at a recognition site comprising
methyladenosine. Optionally, the cleaved template could be repaired
during an analytical reaction by inclusion of a ligase enzyme in
the reaction mixture. As noted elsewhere herein, nucleotides other
than 5-MeC can also be modified and detected by the methods
provided herein. For example, adenine can be converted to inosine
through deamination, and this conversion affected by methylation of
adenine, allowing differential treatment and detection of adenine
and MeA.
[0100] DMS Modification
[0101] In certain embodiments, the template may be modified by
treatment with dimethyl sulfate (DMS) prior to sequencing. DMS is a
chemical that methylates the N7 position of guanine in dsDNA, and
to a lesser extent the N3 position of adenine in dsDNA. If proteins
are bound to a DNA treated with DMS, the proteins will block the
methylation of the sequences to which they are bound. The bound
proteins can then be removed and the DNA treated with piperidine,
which breaks the DNA backbone by removal of the methylated bases.
Protected regions of the DNA are identified as having been bound to
the proteins during the DMS treatment. DMS also modifies the N3
position of cytosine and the N1 position of adenine in
single-stranded DNA or RNA so these bases can no longer base pair
with their complement. Since both these positions are involved in
base-pairing, regions that are double-stranded during DMS treatment
are protected from modification. Reverse transcriptase PCR and gel
analysis is subsequently used to identify regions that were
unmodified, and are therefore likely regions that adopt secondary
structures that protect them from DMS treatment.
[0102] The present invention provides methods for real-time,
single-molecule sequencing of nucleic acids that have been
subjected to DMS treatment as a means for detecting both binding
sites of nucleic acid binding agents, as well as sites of secondary
structure formation, e.g., G-quadruplex structures (also known as
G-tetrads or G.sub.4-DNA; see, e.g., Zheng, et al. (2009)
"Molecular crowding creates an essential environment for the
formation of stable G-quadruplexes in long double-stranded DNA,"
Nuc Ac Res 1-12, incorporated herein by reference in its entirety
for all purposes). For example, dsDNA bound to one or more nucleic
acid binding agents is subjected to DMS treatment, and the binding
agents are subsequently removed. The resulting dsDNA is subjected
to template-directed sequencing and pulse metrics are monitored to
identify locations where guanine or adenine were methylated. For
example, A and G template nucleotides that cause a distinguishable
change in one or more pulse metrics are identified as not having
been bound by the agent(s), and A and G template nucleotides that
do not cause a distinguishable change in one or more pulse metrics
are identified as having been bound by the agent(s). In certain
embodiments, the DMS treatment takes place in vivo, and the dsDNA
is subsequently extracted and sequenced to study transcription
factor binding in the cell. Alternatively, dsDNA can be extracted
from cells and subsequently exposed to one or more nucleic acid
binding agents prior to treatment with DMS in vitro. The DMS
treatment can performed in solution, or can be performed after the
dsDNA is immobilized, e.g., at a reaction site. The nucleic acid
binding agents that can be studied include, but are not limited to,
transcription factors, polymerases, ribosomes, and associated
cofactors to DNA, and thereby study which DNA regions are being
actively transcribed in different cells, in healthy vs. diseased
tissue, in different cell cycle stages, in response to various
environmental stimuli, and the like. For example, in certain
embodiments DMS is applied in vivo or in vitro to mRNAs bound by
actively translating or stalled ribosomes. The resulting mRNA
templates are subsequently sequenced in real time, and the
reactions are monitored for altered kinetics, which are indicative
of modified bases. Alternatively, the DMS-treated mRNAs can be
heated to degrade modified regions, leaving only unmodified regions
for sequencing. The sequence data so generated is used to identify
the mRNAs to which a ribosome was bound, and therefore the mRNAs
that were being actively translated in the sample from which they
were extracted. Other methods of ribosome profiling are known in
the art, e.g., Ingolia, et al. (2009) Science 324(5924):218-23, the
disclosure of which is incorporated herein by reference in its
entirety for all purposes.
[0103] In further embodiments, DNA and RNA secondary structure
profiling can be performed by applying DMS to single-stranded DNA
or RNA (e.g., mRNA, siRNA, microRNA, rRNA, tRNA, snRNA, etc.) and
sequencing the DMS-modified nucleic acid using an appropriate
polymerase. (Methods for sequencing RNA molecules using RNA
dependent polymerases are described in detail in U.S. Ser. No.
61/186,661, filed Jun. 12, 2009 and incorporated herein by
reference in its entirety for all purposes.) Regions of the treated
nucleic acid that elicit altered polymerase kinetics are identified
as regions that were single-stranded during the DMS treatment, and
regions of the treated nucleic acid that do not elicit altered
polymerase kinetics are identified as regions that were
double-stranded during the DMS treatment and therefore likely
contained duplex secondary structure, e.g., hairpins. In certain
embodiments, the nucleic acid is heated prior to sequencing to
cause degradation of the modified regions. The remaining,
undegraded nucleic acid is subsequently subjected to sequencing and
the sequence data so generated is used to identify regions of the
original nucleic acid that formed secondary structures that prevent
DMS modification.
[0104] DMS modification can also be used to map regions that form
non-B-form secondary structures, some of which have regulatory
roles in vivo. For example, G-quadruplexes consist of stacks of Gs
that protect the guanosines from DMS-modification, even in the
absence of a nucleic acid binding agent. Subsequent sequence
analysis is used to identify regions that were protected from DMS
modification, and therefore are likely to have had some protective
secondary structure.
[0105] Further, although described primarily in terms of DMS
modification, other types of chemical and/or enzymatic
modifications can also be used in an analogous fashion, as will be
clear to one of ordinary skill in the art based on the teachings
herein. For example, other methods of DNA or RNA footprinting are
particularly useful in the methods herein, including, e.g., use of
DNaseI, hydroxyl radicals, or UV irradiation for cleavage of
nucleic acid that is not bound by an agent. Such methods are
described more fully in the published literature.
[0106] The template altered by exposure to the reagent is
sequenced, e.g., using a real-time, single-molecule methodology
such as SMRT.TM. sequencing. In certain preferred embodiments, the
sequencing is performed multiple times on the same template, e.g.,
by rolling-circle synthesis or another form of molecular redundant
sequencing. The loci in the template containing altered nucleotides
are identified by analysis of the resulting sequence reads. In
cases in which the 5-MeC nucleotides were converted to non-altered
nucleotide (e.g., thymine), molecular redundant sequencing on both
the forward and reverse strands is useful for further refining the
identification of the altered nucleotides since the transformation
disrupts the normal Watson-Crick base pairing. For example, if
MeC.cndot.G pair is converted to T.cndot.G, the forward and reverse
reads will show non-complementary nucleotides at that position (A
and C), indicating that the base pair in the template was
non-standard, likely due to an alteration of a 5-MeC at that
position. Methods for molecular redundant sequencing are further
described in U.S. Pat. No. 7,476,503 and U.S. application Ser. Nos.
12/383,855 (filed Mar. 27, 2009), 12/413,258 (filed Mar. 27, 2009),
12/413,226 (filed Mar. 27, 2009), and 12/561,221 (filed Sep. 16,
2009), all of which are incorporated herein by reference in their
entireties for all purposes.
VI. Detection of Agent-Nucleic Acid Interactions
[0107] Another example of a biological process that may be
monitored in accordance with the invention is association of a
nucleic acid binding agent (e.g., a protein, nucleic acid, or small
molecule) with a single nucleic acid molecule. As for the chemical
modifications to the template described above, use of such agents
serve to enhance the detection of the modification, e.g., by
amplifying a signal indicative of the modification. Many types of
agents bind to nucleic acids, such as transcription factors, RNA
and DNA polymerases, reverse transcriptases, histones, nucleases,
restriction enzymes, replication protein A (RPA), single-stranded
binding protein (SSB), anti-DNA antibodies, DNA damage-binding
agents, agents that bind altered nucleotides (e.g., methylated),
small RNAs, microRNAs, drug targets, etc. In particular,
transcription factors are involved in gene expression regulation
and are thus very important for the study of diseases such as
cancer. Further, RPA binds single-stranded DNA during replication
to keep DNA unwound and accessible to the polymerase. Current
technologies for detecting the binding of a protein transcription
factor to a DNA molecule involve bulk detection. Certain aspects of
the invention provide a method for detecting the binding of a
transcription factor or other nucleic acid binding agent to a
single molecule of DNA. The advantages of the methods described
herein include, but are not limited to, improved resolution of
kinetics (e.g., of association and dissociation), binding loci, and
statistical analysis; and greater sensitivity and simplicity.
[0108] In certain aspects, the invention provides detection of
binding of a nucleic acid binding agent onto a single nucleic acid
molecule through a technology that involves observing the
activities of single molecules of polymerases in real time and with
high multiplex capabilities, thereby allowing the screening of
multiple nucleic acid binding agents (or other components of the
reaction) with high throughput. In particular, the invention
employs analogous processes used for single-molecule, real-time DNA
sequencing, and with some modifications, exploits such processes to
characterize various aspects of binding of nucleic acids by
proteins of interest. Such sequencing technology has been
previously described, e.g., in Eid, et al. (incorporated herein
above). In certain preferred embodiments, one or more components of
the reaction are immobilized at a reaction site, e.g., in an
optical confinement such as a ZMW. Alternatively or additionally,
multiple reactions can be simultaneously monitored by immobilizing
them at discrete locations on a substrate, e.g. in an array of
optical confinements. Further, to prevent displacement of the agent
prior to a detectable affect on the reaction (e.g., a pause), the
binding may be enhanced through various alterations to the reaction
mixture (e.g., salt concentration, pH, temperature, etc.), or
through alterations to the agent itself. For example, a DNA-binding
protein may comprise various mutations that enhance binding under
the conditions of the sequencing reaction, e.g., by lowering the Kd
of the binding domain (e.g., a methyl binding domain) or by
duplicating the domain to increase the effective concentration of
the binding domain in the vicinity of the DNA template.
[0109] In certain preferred embodiments, a single nucleic acid
template is bound to a sequencing engine (e.g., a polymerase or
reverse transcriptase) that is synthesizing a nascent nucleic acid
strand, e.g., during a template-directed sequencing reaction or a
sequencing-by-synthesis reaction. The template can be any nucleic
acid template appropriate for template-directed sequencing, e.g.
single-stranded or double-stranded DNA, RNA, or a DNA/RNA hybrid.
Further, the nucleic acid template can be linear or circular. For
example, a dsDNA template can be bound by a polymerase in an
optical confinement, e.g., a ZMW, as described above and in, e.g.,
Foquet, et al., and Levene, et al., both of which are incorporated
herein supra. A nucleic acid binding agent, such as a transcription
factor or DNA damage-binding agent, is added to the reaction
mixture under conditions that promote binding of the agent to the
template. If the template is bound by the agent in a location ahead
of the polymerase, the bound agent impedes the translocation of the
polymerase along the template, resulting in a pause or full stop in
polymerization at or adjacent to the position at which the agent
bound. Real-time monitoring of the ongoing sequencing reaction will
allow detection of the pause or stop, which is indicative of (i)
the fact that the agent bound the template, and (ii) the position
on the template that was bound by the agent, e.g., based on the
sequence of nucleotides incorporated immediately prior to the pause
or stop. Further, a consensus sequence for the binding site of the
agent can be determined by statistical analysis of the
"binding-affected" (e.g., containing a pause or truncated) sequence
reads generated in the presence of the agent and the
non-binding-affected (e.g., full-length) sequence reads generated
in the absence of the agent. For example, truncated sequence reads
(or sequence reads having detectable pauses) generated in the
presence of the agent provide a location on the template at which
the polymerase was blocked, and full-length reads generated in the
absence of the agent provide the binding site sequence. In certain
embodiments, sequence reads from the region of the template
immediately downstream of the point at which the polymerase is
blocked are analyzed together to find a sequence (specific or
degenerate) they have in common, and this common sequence is
identified as the consensus binding site for the agent. Such
analyses are routine in nucleic acid sequence analysis and require
no further elaboration here.
[0110] In certain embodiments, a nucleic acid binding protein of
interest is introduced into a reaction mixture comprising a pool of
nucleic acid templates. The pool of templates is exposed to the
protein under conditions that promote binding, and polymerase
enzymes are subsequently added to the reaction mixture and allowed
to bind, the templates, e.g., at a single-stranded region
comprising a bound oligonucleotide primer. The reaction mixture
further comprises a set of detectably labeled nucleotides, wherein
each type of nucleotide in the set is linked to a distinct label
that is optically identifiable during polymerization, thereby
providing a distinct signal for each nucleotide incorporation event
that identifies the base incorporated into the nascent strand. The
polymerase-template complexes are immobilized on a substrate such
that signals emitted from each complex are optically resolvable
from signals emitted from every other complex on the substrate.
Preferably, the reaction mixture is lacking a component required
for polymerization to prevent polymerase activity prior to
immobilization. Such a component is subsequently added to the
reaction mixture allowing the polymerase to commence synthesis of a
nucleic acid strand complementary to the template to which it is
bound. For those templates that were not bound by the protein,
synthesis continues unimpeded and the template is fully sequenced
in the optical confinement, generating a full-length sequence read
for the template. In contrast, the templates that were bound by the
protein are processed by the polymerase until the bound protein is
encountered on the template, at which time the polymerase will
pause or stop polymerizing the complementary strand. The truncated
sequence read generated from such a stalled polymerase-template
complex will provide sequence information for the template upstream
of the protein binding site. Statistical analysis of this sequence
information, both at the single molecule level and across the pool
of templates, can be used to both identify the particular nucleic
acid templates bound (or not bound) by the protein, as well as
identifying the position at which the protein binds. For example,
this technique can be used to map specific protein binding sites on
the template, e.g., sequence-specific or lesion/damage-specific
binding sites.
[0111] This assay can by easily modified to test the impact of
various reaction conditions, e.g., pH, temperature, ion
concentrations, and presence or absence of agents such as drugs,
antibodies, or binding competitors. These tests can be used to
identify optimal reaction conditions, e.g., for causing a pause or
stop in an ongoing sequencing reaction or for binding to a
particular subset of the pool of template nucleic acids. Further,
the assay can be used to test variants and/or mutants of known
nucleic acid binding proteins to screen such mutants for desired
characteristics, such as binding under stringent conditions or
having altered sequence specificity for binding. The assay can also
be used to test variants and/or mutants of polymerase enzymes for
desired characteristics, such as the ability to bypass a particular
nucleic acid binding protein. Further, the specificity of binding
can be explored by performing the assay with different pools of
nucleic acid templates.
[0112] In certain embodiments, the nucleic acid binding protein is
a transcription factor (TF) with a specific consensus binding
sequence, e.g., TGACTCA for AP1 or GGACTTCC for NF-.kappa.B. DNA
template molecules that contain the consensus binding sequence are
bound by the TF at that sequence, and those that do not are not
bound by the TF. When the translocating polymerase encounters a
bound TF, the polymerase stops polymerizing, and the cessation of
signals emitted from the complex is indicative that the TF bound
the template and, therefore, that the template contains the
consensus binding sequence. As noted above, various reaction
conditions can be tested for their effect on either the binding of
the TF or the ability of the polymerase to bypass it or displace it
from the template. Statistical analysis of the sequence information
from the DNA templates that were bound by the TF can be used to
further characterize the TF, e.g., by (i) identifying genes
targeted by the TF, e.g., using publicly available genome sequence
data; (ii) identifying the consensus binding sequence, e.g., using
sequence data generated from the same templates in the absence of
the TF; (iii) studying the interaction of multiple transcription
factors; (iv) modulation of TF binding by other proteins, small
molecules, etc.; (v) testing the temperature sensitivity of
binding; (vi) identifying and characterizing the abundance of
particular DNA-binding proteins, e.g., in a cell extract; and the
like. For example, the identity and abundance of DNA-binding
proteins can be compared between a) different tissues, cell lines,
cell developmental stages, species, or subspecies; b) healthy and
diseased samples; and c) in the presence and absence of
environmental stressors and/or various agents (e.g., drugs, toxins,
etc.). Yet further, variants and mutants of different components of
the reaction mixture, e.g., TF, polymerase, template, etc., can be
tested to identify those with particularly desirable
characteristics, e.g., tight binding, protein displacement
activity, non-consensus binding sequences with higher binding
affinity to the TF, etc.
[0113] When implemented on in an arrayed format, such
investigations would be highly parallel, enabling high-throughput
screening assays. Arrays of reactions are carried out on highly
multiplexed confocal fluorescence microscope systems (see, e.g.,
Lundquist, et al., incorporated herein above) in which the
instrument detects fluorescent signals from each reaction site on
the array, resulting in a highly parallel operation. Although
preferred embodiments use arrays of zero mode waveguides, as
described elsewhere herein, these assays could also be performed in
other systems capable of real-time single-molecule detection, e.g.,
using total internal reflection fluorescence (TIRF) microscopy or
waveguide technology.
[0114] Although certain embodiments are described in terms of
nucleic acid binding proteins, it will be appreciated that the
methods and systems described herein are equally applicable to
other nucleic acid binding agents capable of pausing, stopping, or
otherwise disrupting processive template-directed synthesis of a
nascent nucleic acid molecule, e.g., nucleic acids and analogs and
mimetics thereof (e.g., protein nucleic acids), lipids,
sugar-oligoamides, intercalating dyes, major and minor groove
binders, etc.
VII. Nucleic Acid Binding Agents as Analytical Tools
[0115] In certain aspects, nucleic acid binding agents are used in
the methods, compositions, and systems of the invention to detect
and/or reverse modifications in nucleic acid molecules. Such agents
are typically used to enhance the response of a polymerase to a
modification in the template nucleic acid. That is, the methods
herein can be used to detect binding of an agent to the template,
whether in response to a modification as described below, or simply
an unmodified recognition site within the sequence of the template,
as described above. Further, the effects of various agents on the
creation, detection, or bypass of a nucleotide modification can
also be tested and compared. For example, a template can be treated
in various different ways (e.g., with and without a nucleic acid
binding agent) and subsequently subjected to single-molecule
sequencing-by-synthesis, which is monitored for a disruption in
sequence read generation that is characteristic of binding of the
agent to the template. In other embodiments, a template containing
a known modification can be subjected to single-molecule
sequencing-by-synthesis in the presence of various agents or
reaction conditions. The reaction is monitored for the activity of
the polymerase on the modified template to determine if the
presence of any of the agents or other conditions impacts the
ability of the polymerase to bypass or pause at the
modification.
[0116] In certain specific embodiments, accentuating the
differences in interpulse duration and/or pulse width between
methylated and unmethylated DNA involves DNA binding proteins. It
has been shown that some DNA polymerases stall when they encounter
a DNA-bound protein complex. (See, e.g., M. Elias-Arnanz, et al.,
EMBO J. 1997, 16, 5775, incorporated herein by reference in its
entirety for all purposes.) In SMRT.TM. sequencing, this stall is
detected as an unusually long interpulse duration that would end
when the binding protein dissociates from the DNA template or is
displaced by the translocating polymerase. There are a number of
proteins that can bind stably and specifically to methylated DNA
including members of the MBD family of human proteins, all of which
contain a methyl-CpG binding domain (MBD). For example, MECP2,
MBD1, MBD2, and MBD4 all bind specifically to methylated DNA, and
are involved in repressing transcription from methylated gene
promoters. Binding of these proteins to a template nucleic acid is
expected to cause a translocating polymerase to pause proximal to
the bound protein. As such, an increased pause duration during
single-molecule sequencing reactions is indicative of a methylated
base in the template nucleic acid. It is therefore important that
the protein bind tightly to its target nucleic acid sequence.
Natural MBD proteins only have micromolar Kd affinities for
methyl-CpG sequences, so engineered MBD proteins that bind more
tightly to the methylated template sequence can enhance
detectability of methylated bases. For example, a multimerized MBD1
protein is provided in Jorgensen, et al., Nucleic Acids Research
2006, 34(13), e96. Such engineered proteins can have a single
methyl binding domain with a lower Kd (sub-micromolar) or multiple
methyl-binding domains that increase the effective concentration of
the methyl-binding domain in the vicinity of the methylated DNA
template. More information on the MBD family of proteins is
provided, e.g., in B. Hendrich, et al., Mol Cell Biol 1998, 18(11),
6538; and I. Ohki, et al., EMBO J. 2000, 18(23), 6653.
[0117] In addition, the mammalian UHRF1 (ubiquitin-like, containing
PHD and RING finger domains 1) protein binds tightly to methylated
DNA and is required for its maintenance. Crystal structures of the
SRA domain of this protein bound to DNA show that the 5-MeC is
flipped out of the DNA duplex and stabilized by hydrophobic
stacking and hydrogen bonding to SRA protein residues. (See, e.g.,
G. V. Avvakumov, et al. and H. Hashimoto, et al., both supra.)
Finally, the monoclonal antibody to 5-MeC, used for methylated DNA
immunoprecipitation, also binds specifically to methylated
cytosine. (See, e.g., N. Rougier, et al., Genes Dev 1998, 12, 2108;
and M. Weber, et al., supra, which are incorporated herein by
reference in their entireties for all purposes.) All of the
above-mentioned proteins are candidates for interfering with normal
DNA polymerase processivity during SMRT'' sequencing. In order to
enhance their polymerase stalling effects, these proteins can also
be engineered to increase their affinity for methylated DNA sites.
See, e.g., H. F. Jorgensen, et al., Nucleic Acids Res 2006, 34,
e96, the disclosure of which is incorporated by reference herein in
its entirety for all purposes.
[0118] In yet further embodiments, an antibody against 5-MeC could
be used to bind 5-MeC in a template nucleic acid, similar to the
process used in methylated DNA immunoprecipitation assays (M.
Weber, et al., Nat Genet 2005, 37, 853). As such, the antibody
essentially acts as an enhancer of the signal indicating the
presence of the modification in the template by virtue of altering
the polymerase dynamics. Various components of such reactions can
be detectably labeled, e.g., the antibody, template, incorporated
nucleotides, and combinations thereof, as described further
elsewhere herein.
[0119] In still further embodiments, methyltransferases can be used
to further facilitate detection of methyl-modified template nucleic
acids. As described above, DNA methyltransferases catalyze the
addition of methyl groups to DNA based upon recognition of
methylation sites. For some methyltransferases (e.g., maintenance
methyltransferases), the most active binding site in a nucleic acid
is a hemi-methylated site in which one strand of the nucleic acid
is methylated and the opposite strand is not. An enzymatically
inactive methyl transferase (i.e., one that is unable to methylate
nucleic acids) will therefore preferably bind to a hemi-methylated
strand of DNA. In a real-time, template-directed sequencing
reaction, a methylated single-stranded template becomes
hemimethylated after nascent strand synthesis. A detectably labeled
methyltransferase can therefore be detected interacting with the
hemimethylated product of the synthesis reaction in real-time.
[0120] In certain embodiments, a circular template is used to
permit rolling-circle synthesis by the polymerase in which a
single-stranded circular template is converted to a double-stranded
circular template. In preferred embodiments, the polymerase is
capable of strand displacement such that after proceeding around
the template once it begins to displace the nascent strand ahead of
it as synthesis continues. This process eventually results in long
concatemer containing multiple copies of the complement to the
original template molecule. In such a system, a single-stranded
methylated template is converted to a double-stranded
hemimethylated template. A methyltransferase present in the
reaction mixture can bind the hemimethylated sites and, if
detectably labeled, this binding can be readily monitored in real
time. When the polymerase encounters a bound methyltransferase, a
pause may be detected prior to dissociation of the
methyltransferase. The location of the pause in the resulting
sequence reads can be used to map the position of the methylated
site within the template molecule, even in the absence of a
detectable label on the methyltransferase. For example, the pause
can be used to identify the binding of the methyltransferase, e.g.,
in cases in which the methyltransferase is not detectably labeled,
and in such cases the methyltransferase would essentially serve to
extend the pause at a methylated site, thereby facilitating
identification of such a site in the template nucleic acid.
[0121] Other types of modifications can also be detected and/or
reversed by nucleic acid binding agents. For example, among all
types of DNA damage, oxidative base damage by reactive oxygen
species (ROS) has been recognized as a major cause of cell death
and mutagenesis in aerobic organisms (see, e.g., Finkel, et al.
(2000) Nature 408(6809): 239-47, which is incorporated herein by
reference in its entirety for all purposes). DNA oxidative lesions
are primarily recognized and repaired by base excision repair (BER)
pathways (see, e.g., Fromme et al. (2004) Adv Protein Chem 69:
1-41, which is incorporated herein by reference in its entirety for
all purposes). In humans, the BER pathway for detecting and
repairing a common oxidative lesion, 7,8-dihydro-8-oxoguanine
("8-oxoG"), begins with recognition of the lesion by a human
oxoguanine DNA glycosylase 1 (hOgg1), which is a DNA
glycosylase/apurinic (AP) lyase (see, e.g., Klungland, et al.
(2007) DNA Repair (Amst) 6(4): 481-8, which is incorporated herein
by reference in its entirety for all purposes).
[0122] The modified base 8-oxoG is discussed at length supra.
Recent fluorescence and crystallography studies of hOgg1 found that
this DNA glycosylase recognizes the oxidative DNA lesion 8-oxoG by
scanning the DNA duplex, flipping the DNA base out, and
transferring the damaged base from a pre-sampling binding site to
the damage recognition binding site. Single-molecule experiments
revealed the rapid sliding activity of hOgg1 on DNA duplex. For
more detailed information on these studies, see Banerjee, et al.
(2005) Nature 434(7033): 612-8, and Blainey, et al. (2006) Proc
Natl Acad Sci USA 103(15): 5752-7, the disclosures of which are
incorporated herein by reference in their entireties for all
purposes.
[0123] In certain embodiments, the methods provided by the
invention expose a nucleic acid template to a damage-recognition
agent that binds to the template at a damaged nucleotide in a
manner that blocks bypass of the lesion by a polymerase
translocating along the template. The blockage causes a cessation
of incorporation-dependent signaling from the reaction site,
thereby indicating the damage-recognition agent has bound a damaged
nucleotide in the template. In some aspects, the methods further
include exposing a damaged template to additional reaction
components that act to repair the damage, restoring the template
and allowing dissociation of the damage-recognition agent from the
previously damaged nucleotide. Elements of the damage-repair (e.g.,
base excision repair (BER)) machinery can be provided in the
original reaction mixture, or can be added to an ongoing reaction.
If the polymerase pauses but does not dissociate, the
polymerization reaction can continue after DNA repair has been
completed and the repair machinery has dissociated from the
template or translocated away from the previously-damaged site.
[0124] In preferred embodiments, the damage-recognition agent is a
protein involved in BER such as DNA gycosylases/apurinic (AP)
lyases, e.g., hOGG1 (human oxoguanine DNA glycosylase I), yOGG1
(yeast homolog of hOGG1), FPG protein (MutM; bacterial homolog of
hOGG1); and others known in the art. Other proteins that can be
used as a damage-recognition agent include other DNA glycosylases,
e.g., AlkA, Nth, Nei, MutY, uracil DNA glycosylases (UDG),
single-strand selective monofunctional uracil-DNA glycosylase
(SMUG), thymine DNA glycosylase (TDG), NEIL (e.g., hNEIL1 and
hNEIL2), etc. Reaction components for repair of a damaged template
bound by the damage-recognition agent include, e.g., AP
endonucleases, DNA polymerase beta, and ligase, among others known
in the art. See, e.g., McCullough, et al. (1999) Annu Rev Biochem
68: 255-85, which is incorporated herein by reference in its
entirety for all purposes. Further, additional proteins that
stimulate damage recognition may also be included in an analytical
reaction; e.g., HAP1 (APE1) protein has been found to stimulate
hOGG1 activity (Vidal, et al. (2001) Nuc. Ac. Res.
29(6):1285-1292).
[0125] In certain embodiments, more than one polymerase may be
present in a template-directed sequencing reaction in which one or
more lesions may be present on the template nucleic acid. For
example, "bypass polymerases" have been discovered in both
prokaryotes and eukaryotes, most of which belong to the Y-family of
polymerases and/or are considered to be repair polymerases. In
contrast to replicative polymerases, they operate at low speed, low
fidelity, and low processivity. However, because their active sites
adopt a more open configuration than replicative polymerases they
are less stringent and can accommodate altered bases in their
active sites. For more information on bypass polymerases, see,
e.g., Cordonnier, et al. (1999) Mol Cell Biol 19(3):2206-11;
Friedberg, et al. (2005) Nat Rev Mol Cell Biol 6(12):943-53;
Holmquist, et al. (2002) Mutat Res 510(1-2):1-7; Lehmann, A. R.
(2002) Mutat Res 509(1-2):23-34; Lehmann, A. R. (2006) Exp Cell Res
312(14):2673-6; Masutani, et al. (1999) Nature 399(6737):700-4; and
Ohmori, et al. (2001) Mol Cell 8(1):7-8, the disclosures of which
are incorporated herein by reference in their entireties for all
purposes. Certain of these polymerases can bypass lesions in a
nucleic acid template and carry out "translesion synthesis" or TLS.
As such, DNA replication in the presence of such lesions was found
to require multiple polymerases and the "polymerase switch model"
was developed (see, e.g., Friedberg, et al. (2005) Nat Rev Mol Cell
Biol 6(12):943-53; Kannouche, et al. (2004) Cell Cycle 3(8):1011-3;
Kannouche, et al. (2004) Mol Cell 14(4):49'-500; and Lehmann, et
al. (2007) DNA Repair (Amst) 6(7):891-9, all of which are
incorporated herein by reference in their entireties for all
purposes). In brief, the polymerase switch model is model for
lesion bypass during replication that involves replacement of a
replicative polymerase with a bypass polymerase at a lesion,
synthesis of the nascent strand by the bypass polymerase until past
the lesion, and subsequent replacement of the bypass polymerase
with the more processive, higher fidelity replicative polymerase
for continued replication past the lesion.
[0126] In certain preferred embodiments, one or more bypass
polymerases is included in a template-directed nucleic acid
sequencing reaction. For example, during the course of a reaction
in which a replicative polymerase encounters and is blocked by a
lesion in a template nucleic acid, the replicative polymerase is
replaced by a bypass polymerase at the site of the lesion, and the
bypass polymerase synthesizes a segment of the nascent strand that
is capable of base-pairing with the damaged base, and may further
include one or more bases prior to and/or past the site of the
lesion in a process called "translesion synthesis." The limited
processivity of the bypass polymerase causes it to dissociate and
be replaced by the replicative polymerase following translesion
synthesis. The replicative polymerase continues to synthesize the
nascent strand until another blocking lesion is encountered in the
template, at which point it is once again replaced by a bypass
polymerase for translesion synthesis. (See, e.g., Friedberg, et al.
(2005) Nat Rev Mol Cell Biol 6(12):943-53; and Kannouche, et al.
(2004) Mol Cell 14(4):491-500, incorporated herein by reference
above.) The process continues until the template has been
replicated or the reaction is terminated, e.g., by the
investigatior. One particular advantage of the polymerase switch
method of template-dependent sequencing is that is it tolerant of
most types of lesions in the template nucleic acid. As such the
damaged template can be sequenced through a lesion, thereby
allowing reinitiation of synthesis downstream of the lesion and
increasing read lengths on lesion-containing templates.
[0127] Various different bypass polymerases known to those of
ordinary skill in the art can be used with the methods and
compositions provided herein, include prokaryotic polymerases
(e.g., DNA polymerase IV, polymerase V, Dpo4, Dbh, and UmuC) and
eukaryotic polymerases (e.g., DNA polymerase .eta., DNA polymerase
t, DNA polymerase .kappa., and Rev1). In eukaryotes, multiple
bypass polymerases participate in translesion synthesis, and a
processivity factor, proliferating cell nuclear antigen ("PCNA"),
is also required and can be included in a sequencing reaction.
[0128] In certain preferred embodiments, the template or primer is
immobilized during the template-dependent synthesis reaction to
ensure that the template remains at the reaction site during
polymerase switching. Alternatively or additionally, one or more
polymerases can be immobilized at the reaction site. Various
immobilization strategies useful in different aspects of the
invention are provided elsewhere herein.
[0129] Since the portion of the nascent strand corresponding to the
site of the lesion in the template is synthesized by a bypass
polymerase, the sequence reads generated therefrom are expected to
be less reliable than those generated from regions of the nascent
strand synthesized by the replicative polymerase. As such,
generation of redundant sequence information during a sequencing
reaction is a preferred means of generating complete and accurate
sequence reads. Redundancy can be achieved in various ways
described elsewhere herein, including carrying out multiple
sequencing reactions using the same original template with the
sequence data generated in the multiple reactions combined and
subjected to statistical analysis to determine a consensus sequence
for the template. For example, the sequence data from a region in a
first copy of the template that was replicated by a lower fidelity
bypass polymerase can be supplemented and/or corrected with
sequence data from the same region in a second copy of the template
that was replicated with a higher fidelity replicative polymerase.
Further, a template can be amplified (e.g., via rolling circle
amplification) to generate a concatemer comprising multiple copies
of the template that is subsequently sequenced to generate, a
sequencing read that is internally redundant. The sequence data
from a first segment of the concatemer (corresponding to a first
region of the template) that was replicated by the bypass
polymerase can be supplemented and/or corrected with sequence data
from a second segment of the concatemer (that also corresponds to
the first region of the template) that was replicated by the
replicative polymerase. Further, as noted above, redundancy can
also benefit identification and characterization of lesions that
occur in the same position in a plurality of templates, or that
occur at a single position in a template that is subjected to
resequencing. For example, since base incorporation by the bypass
polymerase is promiscuous, replicate sequencing reads for the
region containing the lesion may show more than one "complementary
base" being incorporated at the same position in different reads
(of the same or an identical template), and detection of such
promiscuity is indicative that there is a lesion at that position
in the template nucleic acid(s).
[0130] In certain embodiments, a polymerase in the reaction mixture
may comprise a detectable label to indicate when that polymerase is
associated with the template nucleic acid. For example, a bypass
polymerase can comprise a detectable label that will indicate when
the bypass polymerase is carrying out translesion synthesis. The
nucleotides incorporated into the nascent strand during that time
can therefore be identified and "tagged" as corresponding to a
region of the template that contains one or more lesions, thereby
allowing targeting of statistical analysis to these sequence reads,
e.g., as described above.
[0131] In yet further embodiments, a nucleic acid binding agent
specifically binds to secondary structure in the nucleic acid
template, e.g., hairpin loops, stem-loops, internal loops, bulges,
pseudoknots, base-triples, and the like. Binding of an agent to
such structures inhibits passage of the polymerase through the
structures to a greater extent than the enzyme is inhibited in the
absence of the agent, thereby increasing the resulting pause time
and facilitating detection of the secondary structure. Examples of
agents that have binding specificity for specific structures and/or
strandedness in nucleic acids include, e.g., intercalating agents,
nuclease-deficient endonucleases (e.g., with a specificity for a
double-stranded region within a stem-loop structure), polymerases,
and various eukaryotic initiator proteins.
[0132] As noted above, various different types of templates for
template-directed polymerization reactions can be used, e.g.,
single-stranded or double-stranded DNA, single-stranded or
double-stranded RNA, and analogs and mimetics thereof. Further, the
template can contain a combination of single-stranded and
double-stranded regions, e.g., such as the templates described in
U.S. Ser. Nos. 12/383,855 and 12/413,258, both filed on Mar. 27,
2009 and incorporated herein by reference in their entireties for
all purposes. The type of template used is limited only by the
substrate specificity of the polymerase and damage-binding agent in
the reaction. For example, FIG. 5 provides an illustrative
embodiment of such a reaction comprising a linear template and a
damage-binding agent that recognizes a lesion in a single-stranded
template. In A, the damage-binding agent (305) is scanning a
linear, single-stranded nucleic acid template (310) ahead of a
polymerase (315) performing template-directed polymerization of a
nascent nucleic acid strand (320). In B, the damage-binding agent
(305) has detected and bound to a lesion (325) in the
single-stranded template (310). In C, the polymerase (315) has
caught up with the damage-binding agent (305) and its progress
along the template (310) is blocked. In D, the lesion has been
repaired by repair machinery (330) recruited by the damage-binding
agent (305). In E, the repair machinery has dissociated from the
template (310) and the damage-binding agent (305) has translocated
away from the previously damaged site, thereby allowing the
polymerase (315) to resume synthesis of the nascent strand
(320).
[0133] In some embodiments, a damage-binding agent with specificity
for double-stranded nucleic acid may be used in a reaction
comprising a single-stranded template, e.g., when the scanning and
damage detection/binding is expected to occur after the polymerase
has converted the single-stranded template to a double-stranded
template by template-dependent polymerization, e.g., after a
single-stranded circle has been converted to a double-stranded
circle during "rolling-circle replication." For example, although
an initial substrate in a reaction is a circular single-stranded
nucleic acid template, after a polymerase has processed the
template one time it becomes a double-stranded template and an
appropriate substrate for a damage-binding agent that specifically
scans and binds double-stranded nucleic acid. For example, FIG. 6
illustrates an embodiment comprising a circular template and a
damage-binding agent that recognizes a lesion in a double-stranded
template. In A, the damage-binding agent (405) is scanning a
circular, double-stranded nucleic acid template (435) ahead of a
polymerase (415) performing template-directed polymerization while
displacing the 5' end of the nascent nucleic acid strand being
synthesized (440). In B, the damage-binding agent (405) has
detected and bound to a lesion (425) in the double-stranded
template (435), and the progress of the polymerase (415) is blocked
by the bound damage-binding agent (405). In C, the lesion has been
repaired and the damage-binding agent (405) has translocated away
from the previously damaged site, thereby allowing the polymerase
(415) to resume synthesis of the nascent strand (420).
[0134] Although various embodiments are described in terms of
recognition and, optionally repair of 8-oxoG lesions, other types
of DNA damage can also be addressed by the methods herein. For
example, in the case of hOGG1, the N-glycosylase activity releases
damaged purines from double-stranded DNA, generating an apurinic
(AP) site. The AP-lyase activity cleaves 3' to the AP site leaving
a 5' phosphate and a 3'-phospho-.alpha.,.beta.-unsaturated
aldehyde. In addition to 8-oxoG (when paired with cytosine), hOGG1
also recognizes and removes 8-oxoA (when base paired with
cytosine), foramidopyrimidine (fapy)-guanine and
methyl-fapy-guanine (Bjoras, M. et al. (1997) EMBO J., 16,
6314-6322; and Boiteux, S, and Radicella, J. (1999) Biochimie, 81,
59-67, the disclosures of which are incorporated herein by
reference in their entireties for all purposes). Other types of DNA
damage that can be bound and, optionally, repaired by the methods
herein include BER enzymes that repair other DNA base lesions
(small DNA base modification), e.g. AAG/MPG for methylated lesions,
UDG/SUMG1 for repairing uracil in DNA, APE for abasic sites, etc.
Also included are nucleotide excision repair (NER) enzymes that
repair more bulky DNA lesions, such as DNA base adducts and DNA
intra- and inter-strand crosslinks. Furthermore, although the DNA
polymerase switch methods described above are suitable for
detecting and bypassing most DNA lesions that block a replicative
polymerase, certain small base modifications like 8-oxoG can be
bypassed by a replicative polymerase, and thus methods that include
binding agents that block the polymerase at the site of a lesion
can help ensure that such lesions are detected, and optionally
removed, from the template to prevent the sequence data generated
from the template-dependent sequencing reactions to be adversely
affected.
[0135] In certain embodiments, hOGG1 is included in a
template-directed DNA sequencing reaction in the presence of a
polymerase and a set of nucleotides, each of which bears a label
that is optically detectable and that distinctively identifies the
base (e.g., A, G, T, or C). Detection of an optical signal upon
interaction with the polymerase and incorporation into the nascent
strand allows the practitioner to identify the base incorporated
and, by complementarity, the sequence of the template DNA molecule.
In preferred embodiments, the incorporation of nucleotides into the
nascent strand continues in a processive fashion, generating an
ordered set of optical signals that can be analyzed to provide a
sequence for both the nascent strand and, by complementarity, the
template strand. The hOGG1 enzyme associates with the template,
"scans" for damage, and specifically binds to locations at which
such damage occurs. As such, if the template DNA molecule contains
or acquires (e.g. during the course of the analytical reaction) DNA
damage recognized by hOGG1, it is bound by hOGG1, bypass of the
lesion by the polymerase is blocked, and the incorporation-based
signal is slowed or stopped (e.g., by stalling or dissociation of
the polymerase). Although such a blockage can cause dissociation of
the polymerase, in certain preferred embodiments the polymerase
merely pauses until the damaged nucleotide is repaired and hOGG1
and any other repair machinery dissociates from the template, at
which time polymerization resumes and additional sequence data is
generated from the template at and downstream of the site of the
previously damaged nucleotide.
[0136] In certain preferred embodiments, one or more reaction
components is immobilized at a reaction site, e.g., in an optical
confinement such as a zero mode waveguide (ZMW). In some
embodiments, the polymerase is immobilized and the nucleic acid
template and damage-binding agent are free in solution. Methods for
immobilizing a polymerase enzyme are available in the art and
provided elsewhere herein. In other embodiments, the nucleic acid
template can be immobilized at the reaction site with the
polymerase and damage-binding agent free in solution. For example,
in preferred embodiments the damage-binding agent translocates upon
the template faster than the polymerase so it does not impede
progress of the template-dependent sequencing reaction on an
undamaged template. However, upon binding a lesion, the
damage-binding agent will stop and bind to the site, blocking
progress of a translocating polymerase past the lesion. For
example, hOGG1 translocates much faster than phi29 polymerase on
undamaged DNA, but after encountering a damaged nucleotide the
enzyme will bind to the site and wait for other components of the
BER machinery. Alternatively or additionally, the damage-binding
agent may be immobilized at the reaction site. For example, in the
case of hOGG1 only a single enzyme is required for DNA binding,
scanning, and lesion recognition. Immobilization of a single
damage-binding agent at the reaction site increases the likelihood
that a single template at each reaction site will be scanned for
damage. Methods for immobilizing various reaction components are
known in the art as described elsewhere herein.
[0137] In certain aspects, the methods for detection of nucleic
acid damage can be used to test various elements of an experimental
system to identify sources of such damage. For example, various
buffer conditions or other components of an analytical reaction
(e.g., reaction components or radiation that can induce production
of oxygen radicals) can be tested to identify those that cause the
least amount of damage for use in an experimental system. Further,
such damage can be intentionally introduced into a nucleic acid
template by the practitioner, e.g., at one or more specific
locations in a template. This provides a means for controlling the
progress of the polymerase, and therefore controlling the timing of
production of sequence reads from different portions of the
template. For example, if the template is extremely long (e.g.,
thousands or tens of thousands of base pairs in length), it may be
beneficial to temporarily pause the reaction at one or more points
on the template to allow orientation of the sequence read to the
template. In particular, a pause in emission of signal pulses is
indicative that the polymerase has reached a particular location on
the template, and the investigator can reinitiate polymerization by
addition of repair agents/proteins to the reaction mixture. Such
repair agents may be washed out of the reaction mixture and,
optionally, reintroduced at a later point during the course of the
reaction, e.g., by buffer exchange.
VIII. Data Analysis
[0138] Analysis of the data generated by the methods described
herein is generally performed using software and/or statistical
algorithms that perform various data conversions, e.g., conversion
of signal emissions into basecalls, conversion of basecalls into
consensus sequences for a nucleic acid template, and conversion of
various aspects of the basecalls and/or consensus sequence to
derive a reliability metric for the resulting values. Such
software, statistical algorithms, and use thereof are described in
detail, e.g., in U.S. Patent Publication No. 20090024331 and U.S.
Ser. No. 61/116,439, the disclosures of which are incorporated
herein by reference in their entireties for all purposes. Specific
methods for discerning altered nucleotides in a template nucleic
acid are provided in U.S. Ser. No. 61/201,551, filed Dec. 11, 2008,
and incorporated herein by reference in its entirety for all
purposes. These methods include use of statistical classification
algorithms that analyze the signal from a single-molecule
sequencing technology and detect significant changes in one or more
aspects of signal morphology, variation of reaction conditions, and
adjustment of data collection parameters to increase sensitivity to
changes in signal due to the presence of modified or damaged
nucleotides.
[0139] In certain aspects, the invention provides methods for
detecting changes in the kinetics (e.g., slowing or pausing) or
other reaction data for real-time DNA sequencing. As discussed at
length above, detection of a change in such sequencing applications
can be indicative of secondary structure in the template, the
presence of modifications in the template, the presence of an agent
bound to the template, and the like. It is appreciated that the
kinetic activity of single molecules does not follow the regular
and simple picture implied by traditional chemical kinetics, a view
dominated by single-rate exponentials and the smooth results of
ensemble averaging. In a large multi-dimensional molecular system,
such as the polymerase-DNA complex, there are processes taking
place on many different time scales, and the resultant kinetic
picture can be quite complex at the molecular level. (See, e.g.,
Herbert, et al. (2008) Ann Rev Biochem 77:149.) As such, a
real-time single-molecule sequencing technology should be adaptable
to such non-exponential behavior. For example, pauses during a
real-time sequencing reaction are detectable as regions in the
trace of observed signals over time in which it appears that the
enzyme has significantly slowed as compared to the average rate of
incorporation. As such, methods are provided to analyze the data
generated in the vicinity of a pause site, and in particular
algorithmic methods for classifying and removing or down-weighting
the occurrence of pauses in the context of single-molecule
sequencing. General information on algorithms for use in sequence
analysis can be found, e.g., in Braun, et al. (1998) Statist Sci
13:142; and Durbin, et al. (1998) Biological sequence analysis:
Probabilistic models of proteins and nucleic acids, Cambridge
University Press: Cambridge, UK.
[0140] In certain preferred embodiments, the methods utilize a
segmentation algorithm for discriminating pause regions in a
real-time signal generated by monitoring single-molecule kinetics,
in particular by monitoring DNA synthesis by DNA polymerase. The
central observation is that during a pause the density of signal
events (incorporations) is lowered, where the density refers to the
number of events per a fixed unit of time. At the same time,
stochastic events arising from Poisson processes, such as sticks
(signals that do not correspond to an incorporation event, e.g.,
dyes that enter the detection volume but are not linked to
nucleotides that are incorporated into the nascent strand) should
continue at the same density as normally observed. FIG. 7
illustrates an observation of true incorporations (solid line)
versus stochastic pulses (dashed line) across time. A pause is
identified in the region of the trace in which the observation of
true incorporations dips below the observation of stochastic
pulses. As such, by observation of differences in local densities,
pauses in incorporation activity can be identified.
[0141] Other features that are related to pausing and can
contribute to a full model of the phenomenon. In particular, the
local sequence context of the template strand can also influence
and inhibit the activity of the polymerase along the template. For
example, the local sequence context that influences and/or inhibits
activity of the polymerase may extend for at least about one, two,
three, four, five, seven, ten, fifteen, twenty nucleotide
positions, and these positions may lie upstream or downstream of
the modification, or may flank the modification in the template.
Other known models can also be used in the methods described
herein, as will be clear to one of ordinary skill upon review of
the teachings herein.
[0142] In a preferred embodiment, an algorithm for use in detecting
changes during template-directed nucleic acid synthesis comprises
the following general steps. First, a classifier is created that
can distinguish between true incorporations and stochastic pulses.
Features that can help discriminate between the two include, e.g.,
pulse height, pulse width, local signal-to-noise ratio, dye
channel, and the .chi..sup.2 metric for the measured spectrum. Many
different statistical classification algorithms known in the art
can be used in this classifier. Certain preferred algorithms
include classification-and-regression trees (CART), naive Bayesian
classifiers, kernel density methods, linear discriminant functions,
and neural networks. Further, the pulse classifier does not need to
be particularly powerful (in an optimal specificity/sensitivity
sense), because a strength of the approach relies on the greater
significance associated with observing clusters of weakly
significant events.
[0143] A second step is to slide a fixed-length window across the
observed signal trace and count the number of incorporations versus
stochastic pulses in each window using the classifier. Choice of
the window size is determined by the length scale of events to be
detected; a reasonable choice in practice is 10 seconds, but a
practioner may increase or decrease the window size according to a
particular implementation of the invention. Regions of the trace in
which the stochastic pulse density exceeds the incorporation
density for an extended period of time, e.g., 5-15 seconds, or more
preferably about 10 seconds, are identified as corresponding to a
likely pause site. These regions can be found using standard
peak-finding techniques, e.g., threshold detection, finite-state
machines, multi-scale methods, etc. This method has the further
advantage of focusing on pause regions that have detrimental
effects in the downstream use of sequencing data. The mere
occurrence of a pause during sequencing is not consequential to the
use of the data for DNA sequence analysis, however the occurrence
of a large number of stochastic pulses in the pause region does
complicate the use of the resulting data.
[0144] A variation on this exemplary algorithm is to use the time
between true pulses that are identified by the classifier as a
discriminator for finding pause regions, where regions that have a
large difference between true pulses are candidate regions for
pauses. For example, A plot of .DELTA.t (the time between true
pulses) versus time will have local maxima at the location of
candidate pauses.
[0145] A more sophisticated algorithm for use in detecting pause
regions is a segmenting algorithm based on a hidden Markov model
(HMM) architecture. FIG. 8 provides an illustrative example of a
simple hidden Markov model for classifying pause (P) versus
sequencing (S) states within a sequencing trace. The use of this
model assumes that each pulse can be labeled as either a probable
incorporation (A.sub.S, C.sub.S, G.sub.S, T.sub.S) or a stochastic
pulse (A.sub.P, C.sub.P, G.sub.P, T.sub.P). By fitting this model
on multiple instances of sequence data (using e.g. the Baum-Welch
algorithm), good emission and transition probabilities that
correspond to the hidden pause and sequencing states can be
generated. When subsequently presented with any particular signal
(observed labels), the model can be queried for the underlying
sequencing of hidden states using the Viterbi algorithm. This model
is more powerful than the more simple algorithmic approaches
suggested above in several ways. First, this model permits the
modeling of per-nucleotide likelihoods for incorporations or sticks
during the pauses or sequencing states. An example where this is
useful is if a stick/incorporation classifier for one nucleotide is
particularly effective and if the pulses for another nucleotide are
difficult to classify in this way. This model permits some
nucleotide-specific differences in the classification power for
stochastic pulses versus incorporations. A further advantage of
this approach is that it is more adaptable to detecting regions
across multiple time scales, where HMM segmentation approaches are
usually better able to handle multi-time scale classification. The
final assignment of pause regions is made by computing the log-odds
ratio
log P | x i S | x i ##EQU00001##
across the pulses (x.sub.i) and identifying regions of high pause
likelihood.
[0146] A more powerful algorithmic architecture for segmentation is
the use of the conditional random field framework (CRF). The object
is to predict the conditional probability of a signal arising from
a pause or sequencing state given the observed pulses:
p y | x = exp [ w T F ( x , y ) ] y ' exp [ w T F ( x , y .dagger.
) ] , ##EQU00002##
where y is the sequence of desired labels (pause, sequencing), x is
the observed pulse data (both basecalls and other pulse features),
w is the weight vector learned from the training data, and the F
function is the feature vector. The weights in the CRF can be
trained using labeled sequences using standard techniques from the
CRF literature (for example, Lafferty et al (2001) Proc. 18.sup.th
International Conference on Machine Learning, 282-289, which is
incorporated herein by reference in its entirety for all purposes).
By labeling the sequence in this way, regions of high pause
probability can be identified using the methods described above.
One advantage of this method is the lack of a requirement for a
per-pulse classifier for distinguishing between incorporation and
stochastic pulses. It can also better integrate knowledge of the
inter-pulse spacing and other information, such as sequence
context, into a broad model. Potential disadvantages are the large
amount of training data required to build the model and the
algorithmic complexity involved in constructing a CRF model.
[0147] The application of such algorithmic methods to identify
pause sites and/or regions with locally high stochastic pulses in
sequencing trace data is useful in a number of contexts. For
example, pulses in regions that are predicted to exhibit enzyme
pausing can be labeled as less confident (lower quality value) for
their use in downstream analyses such as sequence variant detection
in resequencing applications or overlap detection for de novo
assembly. In other embodiments, pulses from regions with a high
probability of containing a cluster of stochastic pulses can be
removed from the reported basecalled sequence, thereby improving
the accuracy of sequence data for downstream use without resorting
to secondary information such as quality values. In other
embodiments, the occurrence of pauses can be associated with other
observables of interest, such as the probable DNA sequence or the
occurrence of modified nucleotide bases. For example, sequences
upstream of a pause site can be called in part based on their known
effect on pausing. That is, if a pause occurs downstream of a
sequence, then the sequence is more likely to be one that
facilitates or exacerbates pausing than one that has no effect or
that reduces the likelihood of pausing. As such, if sequencing of a
modification is known to increase the likelihood of pausing, then
this information can be incorporated into a Bayesian likelihood
model for identifying modified bases. In further embodiments, the
pause detection methods described herein can also be used to
increase the understanding of the biophysics of polymerase
activity, thereby providing useful feedback to efforts to better
develop single-molecule, real-time sequencing techniques.
[0148] Algorithms for the identification of regions in sequence
data belong to the general category of sequence labeling or
segmentation algorithms, which are generally known in the art. The
mapping of this problem to sliding-window analysis, HMMs, or CRFs
is natural in this context. Other algorithms that approach the same
problem are multiple change-point analysis such as the Gibbs
sampler (see, e.g., Lee, P. M. (2004) Bayesian Statistics: An
Introduction, Oxford University Press: New York, N.Y., the
disclosure of which is incorporated herein by reference in its
entirety for all purposes), or locally weighted polynomial
regression (see, e.g., Braun, et al., supra).
[0149] In general, data analysis methods benefit when the
sequencing technology generates redundant sequence data for a given
template molecule, e.g. by molecular redundant sequencing as
described above. The distribution of IPDs for each read at that
position is an exponential. The decay constant for the exponential
of a methylated base and for that of an unmethylated base may be
different. However, because of the large amount of overlap between
two exponentials, it is still challenging to use one read to
distinguish between the two populations. However, if one takes the
mean of multiple reads at a single position, the distribution of
this mean is a gamma function (convolution of several
exponentials), which is more Gaussian-like and better separated
than exponentials. This enables better distinguishability of the
two populations. For example, FIG. 13 provides actual data showing
that for two different positions in a single circular template, one
always unmethylated, and one differentially methylated, an increase
in the number of reads for the template corresponds to an increased
resolution between IPDs for methylated vs. unmethylated adenosines.
If the underlying distributions are exponential, as just discussed,
then the mean value is the only metric that can be used for making
the distinction (the standard deviation is the same as the mean).
If the distribution is non-exponential for each read position, as
it would be for the methylcytosine IPD' that is weighted over
numerous neighboring positions and thus itself has a gamma-like
distribution, then when doing consensus reads of the same position,
one can take into account the mean of the gamma-like weighted IPD'
distributions along with other information, e.g. its standard
deviation, its skewness, or other characteristics of the
distribution. FIG. 10 shows actual molecular consensus
distributions for methylcytosine, given the underlying gamma-like
weighted IPD' distributions of individual reads, but in this figure
only the means of these underlying distributions were utilized. The
plotted distributions could become even more well-separated if
other characteristics had been taken into account. The data used to
generate FIGS. 13 and 10 is more fully described in the Examples
herein.
[0150] In certain embodiments, methods may be employed that use
weighted sums of signal features at multiple positions to determine
the status of a base, e.g., whether or not it is methylated in a
template nucleic acid. In particular, interpulse duration (IPD)
information from multiple positions can be used to determine
whether or not a given cytosine is methylated, e.g., by comparing
nascent strand synthesis data for a differentially methylated
template (Me+) to such data for a fully unmethylated template
(Me-). In certain preferred embodiments, a pseudo IPD is created
for a given template position that is actually a weighted sum of
the IPDs for the surrounding positions. More specifically,
IPD j ' = i w i .times. IPD ji IPD j Me - , ##EQU00003##
where j is the index of the cytosine in question; i is an index
that ranges over all the neighboring positions that yield a change
in IPD due to cytosine being either methylated or unmethylated; and
<IPD.sub.i>.sub.Me- is the average IPD at that particular
position in the Me- template. The individual weights for the
multiple positions (all together which would likely sum to 1) could
be based on a combination of the following metrics, assuming we are
comparing two templates that are identical aside for one being
methylated and the other being unmethylated at a given position:
the ratio of or difference in IPD between the two templates at that
given position; the statistical significance of the
distinguishability between the IPD distributions of the two
templates at that given positions; the number of observations used
when creating the IPD distributions; and the neighboring sequence
context. An example w.sub.i could be
w i = log [ IPD i Me + IPD i Me - ] . ##EQU00004##
This signal can also be weighted by the prior probability of seeing
a Me+ signal.
[0151] In certain aspects, the invention provides a general-purpose
approach to discriminating between Me+ and Me- using features in a
real-time sequencing-by-synthesis trace comprising signals emitted
during the incorporation of optically detectable nucleotides into a
nascent strand by a polymerase enzyme. Such traces and various
methods of analysis thereof are further described elsewhere, e.g.,
in U.S. Patent Publication No. 20090024331, incorporated herein by
reference in its entirety for all purposes. A first stage of this
approach includes the development of a classifier for
distinguishing methylated from unmethylated cytosines in a nucleic
acid template. A set of features is measured for every pulse
(discrete event or signal) in the trace. For example, a set of
measurable features might be {pulse width, pulse duration, pulse
height, pulse amplitude variability}. Call the values of these
features f.sub.1.sup.i, f.sub.2.sup.i, . . . , f.sub.k.sup.i for
pulse i. For each cytosine pulse in the methylated and unmethylated
template data sets (which may or may not be restricted to CpG),
tabulation is performed for the local pulse features
{right arrow over (f)}={f.sub.1.sup.i-3,f.sub.2.sup.i-3, . . .
,f.sub.k.sup.i-1,f.sub.1.sup.i,f.sub.2.sup.i, . . . ,f.sub.k.sup.i,
. . . ,f.sub.1.sup.i+3,f.sub.2.sup.i+3, . . .
,f.sub.k.sup.i+3}.
In this example, a local context extending 3 pulses to the left and
right of the pulse of interest is assumed, but this context size is
flexible and, in certain embodiments, can be
application-specific.
[0152] The observed data likelihoods p({right arrow over (f)}|Me+)
and p({right arrow over (f)}|Me-) are derived, e.g., by a kernel
density method or simple binning and tabulation of the features.
Thus, a generalized signal for determining methylation status on
the trace has been determined:
w i = log [ p ( f .fwdarw. | Me + ) p ( Me + ) p ( f .fwdarw. | Me
- ) p ( Me - ) ] , ##EQU00005##
where p(Me+) and p(Me-) are the prior probabilities of methylated
or un-methylated positions, respectively.
[0153] Various standard classification algorithm development
techniques known to those of ordinary skill in the art may be
applied to refine this approach, both to reduce training set bias
and to improve sensitivity. Such techniques include but are not
limited to cross-validation, boosting, and bootstrap aggregating
(bagging). In certain embodiments, the set of feature inputs is
restricted to those that are most correlated with the Me+ and Me-
status of a position. In certain embodiments, the major component
in a principal components analysis can serve as a better weighted
combination of the most important features. In further embodiments,
leave-one-out cross-validation can be valuable in selecting a
robust predictive algorithm, e.g., by mitigating overfitting to the
observed data that can occur when developing a classifier on a
training set. Further, in some embodiments a boosting approach
(training of a hierarchy of classifiers on the progressively more
difficult regions of feature space) is applied to improve
sensitivity.
[0154] More sophisticated signals can be employed to detect
multiple, closely spaced CpGs. In certain embodiments, the data
likelihoods described above can be measured for the case of two CpG
sites with a known methylation state located a known distance
apart, e.g., 2 base pairs apart. The signal generalizes to
w i = arg max .mu. .di-elect cons. { ++ , + - , - + , -- } log [ p
( f .fwdarw. .alpha. , f .fwdarw. .beta. | .mu. ) p ( .mu. ) .mu. '
p ( f .fwdarw. .alpha. , f .fwdarw. .beta. | .mu. ' ) p ( .mu. ' )
] , ##EQU00006##
where p(.mu.) and p(.mu.') are the prior probabilities of
(methylated,un-methylated) configurations (this joint distribution
would be assumed to be independent unless otherwise shown).
[0155] Although described primarily in the context of detection of
methyl cytosine, these methods are also applicable to
methyl-adenosine or any other base modification for which IPDs are
used as a metric for detection. FIG. 14 provides data showing
differences between ratios of IPDs for methylated adenosines and
unmethylated adenosines in a template nucleic acid, and the data
used to generate FIG. 14 is further described in the Examples
herein. This data also shows that N.sup.6-methyladenosine, like
methylcytosine, has an effect on IPD not only at the methylated
base but also at multiple, neighboring positions, as well. Further,
in light of the above teachings it will be clear to one of ordinary
skill that the approach can be extended to pulse metrics other than
IPD, such as pulse width, branch rate, mismatch rate, deletion
rate, etc. In addition, the general classifier approach suggested
in steps 2+3 can be implemented with many standard statistical
classification algorithms, i.e. linear discriminant analysis,
multi-dimensional regression, kernel methods, classification and
regression trees, neural networks, and support vector machines. The
approach can also incorporate data from multiple strands of a
duplex template. For example, because the CG sequence for cytosine
methylation and the GATC sequence for adenosine methylation is the
same on the reverse complement strand, these bases can be
methylated on both complementary strands. If the general
statistical distribution for the fraction of sites that are
hemi-methylated vs. fully methylated is known, then information
regarding IPD or other metrics gained from the complementary strand
can be used to increase the accuracy with which a call is made on a
particular strand. For example, if after analyzing each strand
separately it is concluded that there is a 95% chance that stand A
is methylated and a 55% chance that complementary strand B is
methylated, but it is known that there is a 80% chance that if one
strand is methylated then so is the other, then the confidence in
calling strand B as methylated is increased.
[0156] Another modified base for which IPDs may be used as a metric
for detection is 5-hydroxymethylcytosine (5-hmC). It was recently
found to be abundant in human and mouse brains, as well as in
embryonic stem cells (see, e.g., Kriaucionis, et al. (2009) "The
nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje
neurons and the brain" Science 324 (5929): 929-30; and Tahiliani M
et al. (May 2009) "Conversion of 5-methylcytosine to
5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1"
Science 324 (5929): 930-35, incorporated herein by reference in
their entireties for all purposes). In mammals, it can be generated
by oxidation of 5-methylcytosine, a reaction mediated by the Tet
family of enzymes. Conventional bisulfite sequencing does not
effectively distinguish 5-hmC from 5-MeC because 5-hmC tends to
remain unmodified like 5-MeC. As such, mass spectrometry is the
typical means of detecting 5-hmC in a nucleic acid sample. The
methods described herein provide a high-throughput, real-time
method to distinguish between C, 5-MeC, and 5-hmC by monitoring
deviations from normal polymerase kinetics, including IPD and pulse
width.
[0157] Experiments were carried out to test the ability of the
methods of the invention to distinguish between 5-MeC and 5-hmC,
and it was found that 5-hmC causes an increase in IPD at certain
positions surrounding the 5-hmC site in the template, and also
decreases the pulse width at that position. Further, the data
generated suggests that 5-hmC may also increase the pulse width at
the position following the 5-hmC site. Both the difference in IPD
and the difference in pulse width between C and 5-hmC were larger
in magnitude than were the differences in IPD and pulse width
between C and 5-MeC, and these larger magnitudes are likely to make
5-hmC even more detectable than 5-MeC. Without being bound by
theory, the reason for the higher magnitude differences for these
two measures may be due to the additional oxygen atom present in
5-hmC as compared to 5-MeC. This additional oxygen could yield
additional steric and charge-based interactions between the
polymerase and the DNA template that slow the binding and/or
incorporation of the complementary base into the nascent
strand.
[0158] Based on the findings that indicate the easier detection of
5-hmC as compared to 5-MeC, in certain embodiments template nucleic
acids can be treated to convert 5-MeC to the more easily detected
5-hmC, e.g., by treatment with an enzyme such as TET1, which
converts 5-methylcytosine to 5-hydroxymethylcytosine in mammalian
DNA (see, e.g., Tahiliani M et al., supra). Although this technique
would not permit distinction between 5-MeC and 5-hmC in the
template (since the 5-MeC converted to 5-hmC would be
indistinguishable from any 5-hmC originally present in the
template), it will nonetheless be useful for facilitating detection
5-MeC patterns in template nucleic acids with the caveat that the
patterns so discovered may, in vivo, also include 5-hmC bases.
[0159] In order to maximally use the IPD and pulse width signals
from multiple positions surrounded the 5-hmC site, one could use a
technique to find the optimal weighting of different positions for
IPD and pulse width in order to distinguish 5-hmC, 5-MeC, and C
from one another. An example of one such technique is principle
component analysis, and others are known in the art. Principle
component analysis can be described as finding the eigenvector
(using each metric such as IPD or pulse width at each position as a
different basis vector, such that if you have 10 positions in
question and two metrics, your basis will have 2.times.10.sup.-20
dimensions) with the greatest eigenvalue. For a review of principle
component analysis, see e.g. Jolliffe I. T. Principal Component
Analysis, Series: Springer Series in Statistics, 2nd ed., Springer,
NY, 2002, the disclosure of which is incorporated herein by
reference in its entirety for all purposes.
IX. Systems
[0160] The invention also provides systems that are used in
conjunction with the compositions and methods of the invention in
order to provide for real-time single-molecule detection of
analytical reactions. In particular, such systems typically include
the reagent systems described herein, in conjunction with an
analytical system, e.g., for detecting data from those reagent
systems. In certain preferred embodiments, analytical reactions are
monitored using an optical system capable of detecting and/or
monitoring interactions between reactants at the single-molecule
level. For example, such an optical system can achieve these
functions by first generating and transmitting an incident
wavelength to the reactants, followed by collecting and analyzing
the optical signals from the reactants. Such systems typically
employ an optical train that directs signals from the reactions to
a detector, and in certain embodiments in which a plurality of
reactions is disposed on a solid surface, such systems typically
direct signals from the solid surface (e.g., array of confinements)
onto different locations of an array-based detector to
simultaneously detect multiple different optical signals from each
of multiple different reactions. In particular, the optical trains
typically include optical gratings or wedge prisms to
simultaneously direct and separate signals having differing
spectral characteristics from each confinement in an array to
different locations on an array based detector, e.g., a CCD, and
may also comprise additional optical transmission elements and
optical reflection elements.
[0161] An optical system applicable for use with the present
invention preferably comprises at least an excitation source and a
photon detector. The excitation source generates and transmits
incident light used to optically excite the reactants in the
reaction. Depending on the intended application, the source of the
incident light can be a laser, laser diode, a light-emitting diode
(LED), a ultra-violet light bulb, and/or a white light source.
Further, the excitation light may be evanescent light, e.g., as in
total internal reflection microscopy, certain types of waveguides
that carry light to a reaction site (see, e.g., U.S. Application
Pub. Nos. 20080128627, 20080152281, and 200801552280), or zero mode
waveguides, described below. Where desired, more than one source
can be employed simultaneously. The use of multiple sources is
particularly desirable in applications that employ multiple
different reagent compounds having differing excitation spectra,
consequently allowing detection of more than one fluorescent signal
to track the interactions of more than one or one type of molecules
simultaneously (e.g., multiple types of differentially labeled
reaction components). A wide variety of photon detectors or
detector arrays are available in the art. Representative detectors
include but are not limited to an optical reader, a high-efficiency
photon detection system, a photodiode (e.g. avalanche photo diodes
(APD)), a camera, a charge-coupled device (CCD), an
electron-multiplying charge-coupled device (EMCCD), an intensified
charge coupled device (ICCD), and a confocal microscope equipped
with any of the foregoing detectors. For example, in some
embodiments an optical train includes a fluorescence microscope
capable of resolving fluorescent signals from individual sequencing
complexes. Where desired, the subject arrays of optical
confinements contain various alignment aides or keys to facilitate
a proper spatial placement of the optical confinement and the
excitation sources, the photon detectors, or the optical train as
described below.
[0162] The subject optical system may also include an optical train
whose function can be manifold and may comprise one or more optical
transmission or reflection elements. Such optical trains preferably
encompass a variety of optical devices that channel light from one
location to another in either an altered or unaltered state. First,
the optical train collects and/or directs the incident wavelength
to the reaction site (e.g., optical confinement). Second, it
transmits and/or directs the optical signals emitted from the
reactants to the photon detector. Third, it may select and/or
modify the optical properties of the incident wavelengths or the
emitted wavelengths from the reactants. Illustrative examples of
such optical transmission or reflection elements are diffraction
gratings, arrayed waveguide gratings (AWG), optical fibers, optical
switches, mirrors (including dichroic mirrors), lenses (including
microlenses, nanolenses, objective lenses, imaging lenses, and the
like), collimators, optical attenuators, filters (e.g.,
polarization or dichroic filters), prisms, wavelength filters
(low-pass, band-pass, or high-pass), planar waveguides,
wave-plates, delay lines, and any other devices that guide the
transmission of light through proper refractive indices and
geometries. One example of a particularly preferred optical train
is described in U.S. Patent Pub. No. 20070036511, filed Aug. 11,
2005, and incorporated by reference herein in its entirety for all
purposes.
[0163] In a preferred embodiment, a reaction site (e.g., optical
confinement) containing a reaction of interest is operatively
coupled to a photon detector. The reaction site and the respective
detector can be spatially aligned (e.g., 1:1 mapping) to permit an
efficient collection of optical signals from the reactants. In
certain preferred embodiments, a reaction substrate is disposed
upon a translation stage, which is typically coupled to appropriate
robotics to provide lateral translation of the substrate in two
dimensions over a fixed optical train. Alternative embodiments
could couple the translation system to the optical train to move
that aspect of the system relative to the substrate. For example, a
translation stage provides a means of removing a reaction substrate
(or a portion thereof) out of the path of illumination to create a
non-illuminated period for the reaction substrate (or a portion
thereof), and returning the substrate at a later time to initiate a
subsequent illuminated period. An exemplary embodiment is provided
in U.S. Patent Pub. No. 20070161017, filed Dec. 1, 2006.
[0164] In particularly preferred aspects, such systems include
arrays of reaction regions, e.g., zero mode waveguide arrays, that
are illuminated by the system, in order to detect signals (e.g.,
fluorescent signals) therefrom, that are in conjunction with
analytical reactions being carried out within each reaction region.
Each individual reaction region can be operatively coupled to a
respective microlens or a nanolens, preferably spatially aligned to
optimize the signal collection efficiency. Alternatively, a
combination of an objective lens, a spectral filter set or prism
for resolving signals of different wavelengths, and an imaging lens
can be used in an optical train, to direct optical signals from
each confinement to an array detector, e.g., a CCD, and
concurrently separate signals from each different confinement into
multiple constituent signal elements, e.g., different wavelength
spectra, that correspond to different reaction events occurring
within each confinement. In preferred embodiments, the setup
further comprises means to control illumination of each
confinement, and such means may be a feature of the optical system
or may be found elsewhere is the system, e.g., as a mask positioned
over an array of confinements. Detailed descriptions of such
optical systems are provided, e.g., in U.S. Patent Pub. No.
20060063264, filed Sep. 16, 2005, which is incorporated herein by
reference in its entirety for all purposes.
[0165] The systems of the invention also typically include
information processors or computers operably coupled to the
detection portions of the systems, in order to store the signal
data obtained from the detector(s) on a computer readable medium,
e.g., hard disk, CD, DVD or other optical medium, flash memory
device, or the like. For purposes of this aspect of the invention,
such operable connection provides for the electronic transfer of
data from the detection system to the processor for subsequent
analysis and conversion. Operable connections may be accomplished
through any of a variety of well known computer networking or
connecting methods, e.g., Firewire.RTM., USB connections, wireless
connections, WAN or LAN connections, or other connections that
preferably include high data transfer rates. The computers also
typically include software that analyzes the raw signal data,
identifies signal pulses that are likely associated with
incorporation events, and identifies bases incorporated during the
sequencing reaction, in order to convert or transform the raw
signal data into user interpretable sequence data (see, e.g.,
Published U.S. Patent Application No. 2009-0024331, the full
disclosure of which is incorporated herein by reference in its
entirety for all purposes).
[0166] Exemplary systems are described in detail in, e.g., U.S.
patent application Ser. No. 11/901,273, filed Sep. 14, 2007 and
U.S. patent application Ser. No. 12/134,186, filed Jun. 5, 2008,
the full disclosures of which are incorporated herein by reference
in their entirety for all purposes.
[0167] Further, the invention provides data processing systems for
transforming raw data generated in an analytical reaction into
analytical data that provides a measure of one or more aspects of
the reaction under investigation, e.g., transforming signals from a
sequencing-by-synthesis reaction into nucleic acid sequence read
data, which can then be transformed into consensus sequence data.
In certain embodiments, the data processing systems include
machines for generating nucleic acid sequence read data by
polymerase-mediated processing of a template nucleic acid molecule
(e.g., DNA or RNA). The nucleic acid sequence read data generated
is representative of the nucleic acid sequence of the nascent
polynucleotide synthesized by a polymerase translocating along a
nucleic acid template only to the extent that a given sequencing
technology is able to generate such data, and so may not be
identical to the actual sequence of the nascent polynucleotide
molecule. For example, it may contain a deletion or a different
nucleotide at a given position as compared to the actual sequence
of the polynucleotide, e.g., when a nucleotide incorporation is
missed or incorrectly determined, respectively. As such, it is
beneficial to generate redundant nucleic acid sequence read data,
and to transform the redundant nucleic acid sequence read data into
consensus nucleic acid sequence data that is generally more
representative of the actual sequence of the polynucleotide
molecule than nucleic acid sequence read data from a single read of
the nucleic acid molecule. Redundant nucleic acid sequence read
data comprises multiple reads, each of which includes at least a
portion of nucleic acid sequence read that overlaps with at least a
portion of at least one other of the multiple nucleic acid sequence
reads. As such, the multiple reads need not all overlap with one
another, and a first subset may overlap for a different portion of
the nucleic acid sequence than does a second subset. Such redundant
sequence read data can be generated by various methods, including
repeated synthesis of nascent polynucleotides from a single nucleic
acid template, synthesis of polynucleotides from multiple identical
nucleic acid templates, or a combination thereof.
[0168] In another aspect, the data processing systems can include
software and algorithm implementations provided herein, e.g. those
configured to transform redundant nucleic acid sequence read data
into consensus nucleic acid sequence data, which, as noted above,
is generally more representative of the actual sequence of the
nascent polynucleotide molecule than nucleic acid sequence read
data from a single read of a single nucleic acid molecule. Further,
the transformation of the redundant nucleic acid sequence read data
into consensus nucleic acid sequence data identifies and negates
some or all of the single-read variation between the multiple reads
in the redundant nucleic acid sequence read data. As such, the
transformation provides a representation of the actual nucleic acid
sequence of the nascent polynucleotide complementary to the nucleic
acid template that is more accurate than a representation based on
a single read.
[0169] Various methods and algorithms for data transformation
employ data analysis techniques that are familiar in a number of
technical fields, and are generally referred to herein as
statistical analysis. For clarity of description, details of known
techniques are not provided herein. These techniques are discussed
in a number of available reference works, such as those provided in
U.S. Patent Publication No. 20090024331 and U.S. Ser. No.
61/116,439, filed Nov. 20, 2008, the disclosures of which are
incorporated herein by reference in their entireties for all
purposes.
[0170] The software and algorithm implementations provided herein
are preferably machine-implemented methods, e.g., carried out on a
machine comprising computer-readable medium configured to carry out
various aspects of the methods herein. For example, the
computer-readable medium preferably comprises at least one or more
of the following: a) a user interface; b) memory for storing raw
analytical reaction data; e) memory storing software-implemented
instructions for carrying out the algorithms for transforming the
raw analytical reaction data into transformed data that
characterizes one or more aspects of the reaction (e.g., rate,
consensus sequence data, etc.); d) a processor for executing the
instructions; e) software for recording the results of the
transformation into memory; and f) memory for recordation and
storage of the transformed data. In preferred embodiments, the user
interface is used by the practitioner to manage various aspects of
the machine, e.g., to direct the machine to carry out the various
steps in the transformation of raw data into transformed data,
recordation of the results of the transformation, and management of
the transformed data stored in memory.
[0171] As such, in preferred embodiments, the methods further
comprise a transformation of the computer-readable medium by
recordation of the raw analytical reaction data and/or the
transformed data generated by the methods. Further, the
computer-readable medium may comprise software for providing a
graphical representation of the raw analytical reaction data and/or
the transformed data, and the graphical representation may be
provided, e.g., in soft-copy (e.g., on an electronic display)
and/or hard-copy (e.g., on a print-out) form.
[0172] The invention also provides a computer program product
comprising a computer-readable medium having a computer-readable
program code embodied therein, the computer readable program code
adapted to implement one or more of the methods described herein,
and optionally also providing storage for the results of the
methods of the invention. In certain preferred embodiments, the
computer program product comprises the computer-readable medium
described above.
[0173] In another aspect, the invention provides data processing
systems for transforming raw analytical reaction data from one or
more analytical reactions into transformed data representative of a
particular characteristic of an analytical reaction, e.g., an
actual sequence of one or more template nucleic acids analyzed, a
rate of an enzyme-mediated reaction, an identity of a kinase target
molecule, and the like. Such data processing systems typically
comprise a computer processor for processing the raw data according
to the steps and methods described herein, and computer usable
medium for storage of the raw data and/or the results of one or
more steps of the transformation, such as the computer-readable
medium described above.
[0174] As shown in FIG. 9, the system 900 includes a substrate 902
that includes a plurality of discrete sources of chromophore
emission signals, e.g., an array of zero mode waveguides 904. An
excitation illumination source, e.g., laser 906, is provided in the
system and is positioned to direct excitation radiation at the
various signal sources. This is typically done by directing
excitation radiation at or through appropriate optical components,
e.g., dichroic 908 and objective lens 910, that direct the
excitation radiation at the substrate 902, and particularly the
signal sources 904. Emitted signals from the sources 904 are then
collected by the optical components, e.g., objective 910, and
passed through additional optical elements, e.g., dichroic 908,
prism 912 and lens 914, until they are directed to and impinge upon
an optical detection system, e.g., detector array 916. The signals
are then detected by detector array 916, and the data from that
detection is transmitted to an appropriate data processing system,
e.g., computer 918, where the data is subjected to interpretation,
analysis, and ultimately presented in a user ready format, e.g., on
display 920, or printout 922, from printer 924. As will be
appreciated, a variety of modifications may be made to such
systems, including, for example, the use of multiplexing components
to direct multiple discrete beams at different locations on the
substrate, the use of spatial filter components, such as confocal
masks, to filter out-of focus components, beam shaping elements to
modify the spot configuration incident upon the substrates, and the
like (See, e.g., Published U.S. Patent Application Nos.
2007/0036511 and 2007/095119, and U.S. patent application Ser. No.
11/901,273, all of which are incorporated herein by reference in
their entireties for all purposes.)
II. Examples
Detection of 5-methylcytosine (5-MeC)
[0175] Methylation sequencing on a SMRT.TM. Sequencing platform
(see, e.g., P. M. Lundquist, et al., supra) was performed on short,
synthetic DNA oligos with contrived patterns of methylated and
unmethlyated bases, along with control sequences having the same
primary sequence but without any methylation. These templates
provided unequivocal fluorescence pulse patterns and tempos that
demonstrated how the combination of sequence context and
methylation status affected interpulse duration. For example,
SMRT.TM. sequencing experiments were performed using synthetic DNA
templates that only differed by a single methylated vs.
unmethylated cytosine. The difference in average interpulse
durations between the two templates was visible both at the 5-MeC
position and in the vicinity of the 5-MeC position.
[0176] Because the interpulse duration between any two successive
incorporation events is stochastic in nature and has an exponential
distribution (Eid, et al., supra), a single sequencing measurement
may not always yield enough information to determine methylation
status with certainty. Therefore, in certain embodiments a highly
processive, strand-displacing polymerase is used, and this
polymerase carries out multiple laps of synthesis around a circular
DNA template (J. Korlach, et al., Proc Natl Acad Sci USA 2008,
supra). This mode of operation provides repeated sequencing of the
same DNA molecule to generate multiple sequence reads, e.g., by
rolling circle replication. The statistical distribution of
interpulse durations obtained at a particular template site will
thus indicate its methylation state.
[0177] In particular, FIG. 10A shows a schematic of two templates
for use in SMRT.TM. sequencing. Both are comprises a
double-stranded region flanked by two single-stranded hairpins. A
polymerase binds to a primed location on the template, e.g., via a
primer hybridized to one of the single-stranded hairpins, and
commences processing the template to generate a nascent strand
complementary to the strand upon which the polymerase is
translocating. The strand displacement activity of the polymerase
permits passage through the double-stranded region which is unwound
to transform the template into a circular form. The polymerase then
proceeds around the other single-stranded hairpin and on through
the previously displaced strand of the double-stranded region. The
polymerase can continue to process the template in a
"rolling-circle" fashion to generate a concatemer comprising
multiple copies of complements to both strands of the
double-stranded region, as well as the hairpins. The two templates
are identical except at position 2, where the top template
comprises a methylated cytosine (5-MeC) and the bottom template
comprises a non-methylated cytosine. (Position 1 is a
non-methylated cytosine in both templates.) FIG. 10B provides an
illustrative depiction of the difference in IPD for the methylated
template as compared to the unmethylated template. For each row,
the histograms depict the distributions of mean weighted IPD
(averaged over the labeled number of circular consensus sequencing
subreads (in this context, a sequence read generated from a single
pass of the polymerase around the template). Specifically, "1"
indicates the sequencing data was derived from a sequence read
generated in a single pass around the template; "3" indicates the
data was derived from a sequence read generated in three passes
around the template; and "5" indicates the data was derived from a
sequence read generated in five passes around the template. The
data from the methylated template is shown as a solid line, and the
data from the unmethylated template is shown as a dotted line. At
Position 1, the distributions of weighted IPD for the two templates
are very similar. At Position 2, the average weighted IPD after a
single subread (top histogram) is longer in the methylated template
than in the unmethylated template. After 3 and 5 circular subreads,
the distributions overlap even less. The interpulse duration (IPD)
was clearly lengthened by the presence of 5-MeC. These results
demonstrated the ability to use SMRT.TM. sequencing technology to
perform methylation sequencing of DNA. Weighted IPDs are described
elsewhere herein.
[0178] Further, methylcytosine was shown to have an effect on
interpulse duration (IPD) not only at the methylated base, but over
a range of several bases upstream and downstream of the position of
the methylcytosine. Specifically, an increase in IPD was observed
at some positions in the presence of methylcytosine relative to the
same position in the absence of methylcytosine. FIG. 11 provides a
plot depicting the ratio of the average IPD in the methylated
template to the average IPD in the unmethylated template, plotted
versus DNA template position. The two templates are identical
except for the methylated bases in the methylated template, which
are indicated by arrowheads in FIG. 11.
[0179] FIG. 12 provides another data set illustrating the ratio of
IPD for a different methylated template vs. an identical but
unmethylated template as a function of position. Seven cytosines
(shown with crosshatching) were differentially methylated (5-MeC)
between the two templates. That data clearly showed that IPD was
increased in the region comprising the methylated bases.
Interestingly, the effect on IPD occurred mostly downstream of the
methylated positions. As such, data from nascent strand synthesis
at positions in the template that are near the differentially
methylated site, in addition to the differentially methylated site
itself, is useful for methylation detection during real-time
nascent strand synthesis.
[0180] Detection of N6-Methyladenosine (N-6-MeA)
[0181] Similar methods as those used to detect 5-MeC were used to
detect N6-MeA in similarly constructed template nucleic acids. FIG.
13A shows a schematic of two templates, both of which comprise a
double-stranded region flanked by two single-stranded hairpins. The
methylated template has an A within a GATC context at Position 1
and a .sup.mA within a GATC context at Position 2, whereas the
unmethylated template has an A at both positions. Otherwise, the
two templates are identical. As described, above, a polymerase
binds to a primed location on the template and commences processing
the template to generate a nascent strand, using its strand
displacement activity to unwind the double-stranded region and
proceed around the template. FIG. 13B shows plots of mean IPD
generated from sequencing data using these two templates for
varying numbers of consensus reads, as described above. The data
from the methylated template is shown as a solid line, and the data
from the unmethylated template is shown as a dotted line. For each
row in FIG. 13B, the histograms depict the distributions of mean
IPD (averaged over the labeled number of consensus sequencing
subreads, i.e. the number of times the polymerase made one complete
pass around the template to generate a complementary nascent
strand). At Position 1, the distributions of IPD for the two
templates are very similar. At Position 2, the average IPD after a
single subread (top histogram) is .about.5.times. longer in the
methylated template than in the unmethylated template. After 3 and
5 circular subreads, the distributions overlap even less. The
interpulse duration (IPD) was clearly lengthened by the presence of
N6-MeA, demonstrating that the SMRT.TM. sequencing technology can
be used to perform methylation sequencing of DNA comprising a
methylated base other than 5-MeC.
[0182] Receiver operating characteristic (ROC) curves,
parameterized by IPD threshold, for assigning a methylation status
to an adenosine nucleotide are provided in FIG. 13C. True positive
means that an .sup.mA is correctly called as .sup.mA, whereas a
false positive means that an A is mistakenly called as .sup.mA.
These ROC curves, based on the IPD distributions from Position 2 in
FIG. 13B, are shown for a single read (solid line), and for 3
(long-dashed line) or 5 (short-dashed line) molecular redundant
sequencing reads produced by the polymerase processing the template
one, three, or five times, respectively. The dotted horizontal line
bisecting the graph depicts the ROC curve for randomly guessing the
methylation status. The normalized area under the ROC curve is 0.80
after the first circular subread but increases to 0.92 and 0.96
after three and five circular subreads, respectively. In fact,
after five subreads, >85% of .sup.mA bases can be detected at
this template position with a false positive rate of only
.about.5%.
[0183] Like methylcytosine, methyladenosine was also shown to have
an effect on IPD over a range of several bases upstream and
downstream of the position of the methyladenosine. Specifically, an
increase in IPD was observed at some positions in the presence of
methyladenosine relative to the same position in the absence of
methyladenosine. FIG. 14 provides a plot depicting the ratio of the
average IPD in the methylated template to the average IPD in the
unmethylated template, plotted versus DNA template position. The
two templates are identical except for the methylated bases in the
methylated template, which are indicated by arrowheads in FIG.
14.
[0184] Detection of 5-Hydroxymethylcytosine
[0185] Similar to 5-MeC and N6-MeA, 5-hydroxymethylcytosine was
also tested and shown to have an effect on IPD over a range of
several bases upstream and downstream of the position of the
5-hydroxymethylcytosine. Specifically, an increase in IPD was
observed at some positions in the presence of
5-hydroxymethylcytosine relative to the same position in the
absence of 5-hydroxymethylcytosine. FIG. 15 provides a plot
depicting the ratio of the average IPD in the hydroxymethylated
template to the average IPD in the unmethylated template, plotted
versus DNA template position. The two templates are identical
except for the hydroxymethylcytosine in the hydroxymethylated
template, which are indicated by arrowheads in FIG. 15. Templates
comprising 5-hydroxymethylcytosine bases were also tested and the
presence of these modifications was shown to have an effect on
pulse width. FIG. 16 provides a plot of pulse width ratio (pulse
width for methylated template divided by pulse width for
unmethylated template) vs. template position where the modified
positions comprise 5-hydroxymethylcytosine bases. Variably
hydroxymethylated positions are indicated by the arrowheads.
[0186] Detection of 8-Oxoguanosine (8-oxoG)
[0187] 8-oxoguanosine was also subjected to single molecule
real-time sequencing and was shown to affect IPD both at the site
of the modifications as well as at proximal unmodified positions in
the template. An increase in IPD was observed at some positions in
the presence of 8-oxoguanosine relative to the same position in the
absence of 8-oxoguanosine. FIG. 17 provides a plot depicting the
ratio of the average IPD in the 8-oxoguanosine template to the
average IPD in the unmodified template, plotted versus DNA template
position. The two templates are identical except for the
8-oxoguanosine in the modified template, which are indicated by
arrowheads in FIG. 17. These data showed that, compared to G,
8-oxoG altered IPD significantly over a window of .about.10
neighboring bases surrounding the 8-oxoG position. Some positions
saw an increase in IPD by a factor of as much as 6.5.times..
Templates comprising 8-oxoG bases were also tested and the presence
of these modifications was shown to have an effect on pulse width.
FIG. 18 provides a plot of pulse width ratio (pulse width for an
8-oxoG template divided by pulse width for template with no 8-oxoG)
vs. template position where the modified positions comprise 8-oxoG
bases. Variable positions are indicated by the arrowheads. Further,
8-oxoG altered pulse width over a window of 7-8 neighboring bases
by as much as 40%, and such alteration included both increases and
decreases in pulse width.
[0188] It is to be understood that the above description is
intended to be illustrative and not restrictive. It readily should
be apparent to one skilled in the art that various embodiments and
modifications may be made to the invention disclosed in this
application without departing from the scope and spirit of the
invention. The scope of the invention should, therefore, be
determined not with reference to the above description, but should
instead be determined with reference to the appended claims, along
with the full scope of equivalents to which such claims are
entitled. All publications mentioned herein are cited for the
purpose of describing and disclosing reagents, methodologies and
concepts that may be used in connection with the present invention.
Nothing herein is to be construed as an admission that these
references are prior art in relation to the inventions described
herein. Throughout the disclosure various patents, patent
applications, and publications are referenced. To the extent not
already expressly incorporated herein, all published references and
patent documents referred to in this disclosure are incorporated
herein by reference in their entirety for all purposes.
* * * * *