U.S. patent application number 12/574735 was filed with the patent office on 2010-03-11 for nucleotide sequencing via repetitive single molecule hybridization.
Invention is credited to Joseph M. JACOBSON.
Application Number | 20100063264 12/574735 |
Document ID | / |
Family ID | 34742942 |
Filed Date | 2010-03-11 |
United States Patent
Application |
20100063264 |
Kind Code |
A1 |
JACOBSON; Joseph M. |
March 11, 2010 |
NUCLEOTIDE SEQUENCING VIA REPETITIVE SINGLE MOLECULE
HYBRIDIZATION
Abstract
Methods of obtaining sequence information about target
oligonucleotides by repetitive single molecule hybridization are
disclosed. The methods include exposing a target oligonucleotide to
one or more copies of a test oligonucleotide; measuring
hybridization; dehybridizing the test oligonucleotide; and
repeating until the information content from the hybridization
trials equals or exceeds the information content of the target
oligonucleotide.
Inventors: |
JACOBSON; Joseph M.; (Newton
Center, MA) |
Correspondence
Address: |
MARTIN D. MOYNIHAN d/b/a PRTSI, INC.
P.O. BOX 16446
ARLINGTON
VA
22215
US
|
Family ID: |
34742942 |
Appl. No.: |
12/574735 |
Filed: |
October 7, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11896897 |
Sep 6, 2007 |
7604941 |
|
|
12574735 |
|
|
|
|
10990939 |
Nov 17, 2004 |
7276338 |
|
|
11896897 |
|
|
|
|
60520751 |
Nov 17, 2003 |
|
|
|
Current U.S.
Class: |
536/23.1 |
Current CPC
Class: |
G16B 30/00 20190201;
B82Y 5/00 20130101; Y10T 436/143333 20150115; G16B 25/00 20190201;
B82Y 10/00 20130101; C12Q 1/6869 20130101 |
Class at
Publication: |
536/23.1 |
International
Class: |
C07H 21/00 20060101
C07H021/00 |
Claims
1. A method of synthesizing de novo a nucleic acid sequence, the
method comprising the steps of: a) synthesizing a nucleic acid
sequence de novo; b) obtaining information about the nucleic acid
sequence by repetitive single molecule hybridization, thereby to
detect an error in the nucleic acid sequence; and c) correcting the
error in the nucleic acid sequence by site-directed mutagenesis,
thereby synthesizing de novo the nucleic acid sequence.
2. The method of claim 1, wherein step b) is effected by: (a)
hybridizing one or more copies of a test oligonucleotide comprising
at least two informative region sequences to a single molecule of
target oligonucleotides under conditions permitting hybridization
of the test oligonucleotide to a perfectly complementary sequence;
(b) measuring a signal of said hybridization of the test
oligonucleotide to the target oligonucleotide, thereby completing a
first hybridization trial; (c) dehybridizing the test
oligonucleotide; (d) exposing the target oligonucleotide to one or
more copies of at least one additional test oligonucleotide, said
additional test oligonucleotide comprising at least two informative
regions; (e) measuring a signal from the hybridization of the
additional test oligonucleotide to the target oligonucleotide,
thereby completing a further hybridization trial; (f) repeating
steps c) to e) until the information content from the hybridization
trials exceeds the information content of the target
oligonucleotide; and (g) analyzing said signals.
3. The method of claim 2, wherein said signal of said hybridization
comprises a fluorescent tag.
4. The method of claim 2, wherein said signal of said hybridization
comprises a quantum dot or nanoparticle tag.
5. The method of claim 2, wherein said measuring is effected
mechanically.
6. The method of claim 2, wherein said measuring is effected by
sensing a change in charge associated with hybridization.
7. The method of claim 2, wherein said measuring is effected
electronically.
8. The method of claim 2, wherein said target oligonucleotide is
immobilized on a solid support.
9. The method of claim 2, wherein a plurality of said target
oligonucleotides are immobilized on a solid support.
10. The method of claim 1, wherein step (b) is effected by: (a)
hybridizing one or more copies of a test oligonucleotide to a
single molecule of target oligonucleotides under conditions
permitting hybridization of the test oligonucleotide to a perfectly
complementary sequence; (b) measuring a signal of said
hybridization of the test oligonucleotide to the target
oligonucleotide, thereby completing a first hybridization trial;
(c) dehybridizing the test oligonucleotide; (d) exposing the target
oligonucleotide to one or more copies of at least one additional
test oligonucleotide; (e) measuring a signal from the hybridization
of the additional test oligonucleotide to the target
oligonucleotide, thereby completing a further hybridization trial;
(f) repeating steps c) to e) until the information content from the
hybridization trials exceeds the information content of the target
oligonucleotide; and (g) analyzing said signals; and (h) analyzing
an absence of said signals to eliminate oligonucleotide test
members.
11. The method of claim 10, wherein said signal of said
hybridization comprises a fluorescent tag.
12. The method of claim 10, wherein said signal of said
hybridization comprises a quantum dot or nanoparticle tag.
13. The method of claim 10, wherein said measuring is effected
mechanically.
14. The method of claim 10, wherein said measuring is effected by
sensing a change in charge associated with hybridization.
15. The method of claim 10, wherein said measuring is effected
electronically.
16. The method of claim 10, wherein said target oligonucleotide is
immobilized on a solid support.
17. The method of claim 10, wherein a plurality of said target
oligonucleotides are immobilized on a solid support.
Description
RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent
application Ser. No. 11/896,897 filed on Sep. 6, 2007, which is a
continuation of U.S. patent application Ser. No. 10/990,939 filed
on Nov. 17, 2004, which claims the benefit of U.S. Provisional
Patent Application No. 60/520,751 filed on Nov. 17, 2003. The
contents of the above applications are incorporated herein by
reference.
BACKGROUND
[0002] The advent of the first reference sequence of the human
genome by Lander (Nature (15 Feb. 2001) 409: 860-921) and Venter
(Science (16 Feb. 2001) 291:1304) has generated increased interest
in the ability to sequence entire genomes which range in size from
.sup..about.1 megabase to as high as 600 gigabases in some
organisms. To make uses of the human genome reference sequence
tractable, several important innovations including highly parallel
capillary electrophoresis were developed in order to bring base
sequencing costs down.
[0003] Unfortunately, to go beyond a small number of reference
sequences to the point where it is feasible to sequence each
individual genome in a population or to sequence ab initio a large
class of new organisms, vastly faster and less expensive means are
required. Towards that end several new approaches have been
proposed and demonstrated to various degrees including: Edman
degradation and fluorescent dye labeling of a single DNA strand in
a flow cytometer; "sequencing by hybridization" (Perlegen Corp.;
Callida Genomics) and "sequencing by synthesis" (Quake et al., Cal.
Tech.; Solexa Corp.). The latter two approaches afford a high
degree of parallel information retrieval, leveraging chip-based
imaging system approaches to simultaneously record data from a very
large number of gene chip pixels. Unfortunately, each approach
suffers several shortcomings.
[0004] "Sequencing by hybridization" requires an inordinately large
array of explicitly patterned spots to approximate the information
content of the genome to be sequenced. In addition,
oligonucleotides from the sample are required to search a very
large array of gene chip oligonucleotide complements, making
hybridization time very long.
[0005] "Sequencing by synthesis" avoids a number of the issues
above but introduces difficult and error-prone chemistries which
may be difficult to scale effectively.
SUMMARY OF THE INVENTION
[0006] The present invention provides "repetitive single molecule
hybridization," which has the ability to carry out de novo
sequencing of DNA, such as genomic DNA, at high speed and low
cost.
[0007] The invention involves obtaining sequence information from a
target nucleotide. A plurality of target nucleotides can be
interrogated in parallel using the methods described herein,
permitting the sequencing of an entire genome or a subset thereof.
Generally, test oligonucleotides from a library of test
oligonucleotides of uniform length are exposed to one or more
target oligonucleotides under conditions permitting hybridization
of the test oligonucleotides to a perfectly complementary sequence.
The target nucleotide can be immobilized on a chip such as an
inverse gene chip representing a genome or a portion thereof. A
suitable inverse gene chip can be prepared, for example, by
dehybridizing (denaturing) nucleic acids (e.g. by increasing the
temperature or altering solvent conditions), selecting a single
strand, cutting the strand into target oligonucleotides of average
length N nucleotides and chemically attaching the target
nucleotides to a substrate.
[0008] To perform sequencing, a target oligonucleotide is exposed
under hybridizing conditions to a first set of identical test
oligonucleotides drawn from a library of all possible test
oligonucleotides of length M. Hybridization or failure to hybridize
at the single molecule level is detected and recorded. The
molecules are denatured to separate any annealed oligonucleotides
and the first set of test oligonucleotides is then washed away. The
process is repeated, substituting different sets of identical test
oligonucleotides, until the library of test oligonucleotides has
been exhausted or until the information content of the set of
hybridization experiments equals or exceeds the targeted sequence
information.
[0009] The present invention permits highly parallel sequencing of
target nucleic acids while requiring a library of test
oligonucleotides of only modest size. Accordingly, in one
embodiment, where the target oligonucleotides are less than 50
nucleotides in length (i.e. N<50), test oligonucleotides having
2, 3, or 4 potentially informative positions (i.e. the
oligonucleotide length, subtracting uninformative spacer positions,
universal bases, positions hybridizing to any primer sequences
incorporated in the target oligonucleotides, or the like) are used;
where the target oligonucleotides are 50-99 nucleotides in length,
test oligonucleotides having 3, 4, or 5 potentially informative
positions are used; where the target oligonucleotides are 100-999
nucleotides in length, test oligonucleotides having 4-6 potentially
informative positions are used; where the target oligonucleotides
are 1,000-10,000 nucleotides in length, test oligonucleotides
having 5-8 potentially informative positions are used; and where
the target oligonucleotides are greater than 10,000 nucleotides in
length, test oligonucleotides having 7-13 potentially informative
positions are used. In another embodiment, positive hybridization
outcomes are used to deduce a set of possible target
oligonucleotide sequences and negative hybridization outcomes are
used to eliminate members from the set.
[0010] To address the existence of repeats in some nucleic acids of
interest, the invention also provides methods of sequencing using
test oligonucleotides incorporating one or more spacers separating
at least two of the potentially informative positions of the test
oligonucleotides. Useful spacers include, for example,
double-stranded portions of the test oligonucleotide; spacers
comprising an abasic furan; or other traditional chemical spacers
such as those incorporating a polyethylene glycol portion or an
extended carbon chain including, for example, at least three
methylene groups.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1A is a schematic representation of preparation of an
inverse gene chip.
[0012] FIG. 1B is a close-up schematic of an inverse gene chip
consisting of K oligonucleotides of average length N derived from
target genome disposed at random fixed locations.
[0013] FIGS. 2A-2E schematically illustrate the steps involved in
deriving de novo sequence information from an inverse gene chip by
means of repeated test hybridizations with members of the library
of all possible oligonucleotides of length M ("M-mers"). FIG. 2A
depicts the initial condition before incubation with test M-mers.
FIG. 2B depicts introduction of a test M-Mer oligonucleotide with
fluorescent dye under selective hybridizing conditions and depicts
the subsequent parallel readout with an electronic imager. FIG. 2C
depicts the status following dehybridization and wash steps. FIG.
2D depicts the introduction a second test M-Mer oligonucleotide
under selective hybridizing conditions and depicts the subsequent
parallel readout with the electronic imager. FIG. 2E is a schematic
depiction of an inverse DNA chip with multiple copies of target
N-mer oligonucleotides.
[0014] FIG. 3 is a schematic example showing the extraction of 16
bits of hybridization information sufficient for uniquely
determining the 6 bits of sequence information of the target
3-mer.
[0015] FIG. 4 is a schematic of an M-mer oligonucleotide interposed
with an R-mer double stranded spacer.
[0016] FIG. 5 is a table of M-mer library size as a function of
M.
[0017] FIG. 6A is a graphical representation of information content
in bits per hybridization trial as a function of M where N is fixed
at 100.
[0018] FIG. 6B is a table of optimal values of M for various values
of target oligonucleotide length N.
[0019] FIG. 6C is a table of maximum and optimal target
oligonucleotide sizes N for various values of test oligonucleotide
length M.
[0020] FIG. 7 schematically depicts an example of a procedure for
recovering sequence information.
[0021] FIG. 8 is a flowchart depicting an algorithm for recovering
sequence information.
DETAILED DESCRIPTION OF THE INVENTION
[0022] An inverse gene chip is known in the field and has been
employed in the technique termed "sequencing by synthesis" (see
Quake et al., PNAS 100:3960; Solexa Corp.). FIG. 1A illustrates an
exemplary technique for producing a gene chip. A cell 10 is lysed
and its genomic DNA separated. In many cases (other than for
single-stranded DNA viruses) the DNA will be in double-stranded
form. The DNA may be dehybridized either using heat (melting) or
chemically or enzymatically to isolate a single strand of the
genomic DNA. The genomic DNA is then cut into oligonucleotides of
average size N. In addition to other techniques known in the
literature, one can create a chimera consisting of an
oligonucleotide coupled to a nuclease. The chimera will bind
periodically to the DNA and cut locally, thus processing the single
stranded genomic DNA into a series of oligonucleotide fragments 20.
We refer to the genomic oligonucleotide fragments 20 as target
fragments. These target fragments are then immobilized on a support
by standard 3' or 5' coupling chemistry to form an inverse gene
chip 30. FIG. 1B is a schematic close-up view of an inverse gene
chip 30 consisting of K target oligonucleotides 20 of average
length N nucleotides. K is varied based on the amount of nucleic
acid sequence information desired and the average length N of the
nucleotides; for maximum throughput, however, K is selected to
approach the resolving capacity of the imaging system used.
[0023] FIGS. 2A-2E schematically illustrate the steps involved in
obtaining the data required to extract de novo sequence information
from DNA. FIG. 2A depicts the initial condition before incubation
with test M-Mers. FIG. 2B depicts introduction of a test M-Mer
oligonucleotide 25 with fluorescent dye under selective hybridizing
conditions and subsequent parallel readout with a conventional
electronic imager 40. Electronic imager 40 can incorporate a lens
45 to image inverse gene chip 30 or can operate without a lens, in
which case electronic imager 40 is located in close proximity to
inverse gene chip 30. FIG. 2C shows the conditions after a
subsequent dehybridization and wash step; the oligonucleotide
fragments return to the unbound state as previously shown in FIG.
2A. FIG. 2D depicts the introduction of a second test M-Mer
oligonucleotide 25 under selective hybridizing conditions and
subsequent parallel readout with electronic imager 40. Readout
involves detecting the presence or absence of a fluorescent signal
for each position of inverse gene chip 30 associated with a target
oligonucleotide 20 and recording the presence or absence of a
fluorescent signal, generally in a computer-readable medium. Steps
c and d are repeated with subsequent M-mer oligonucleotides until
sufficient information is extracted.
[0024] Referring to FIG. 2E, it may be convenient in certain cases
not to require single molecule hybridization detection. In this
case a plurality of co-located (i.e. located in physical proximity
to each other) copies of target oligonucleotides 20 can be used.
Such multiple co-located target oligonucleotides 20 can be readily
generated by, for example, local PCR such as that enabled by
rolling circle polymerization.
[0025] FIG. 3 is a schematic example showing the extraction of 16
bits of hybridization information sufficient for uniquely
determining the 6 bits of sequence information of the target 3-mer
20. There are 42 (sixteen) possible 2-mer test oligonucleotides 25,
each of which is sequentially incubated with the target 3-mer 20,
in this case the nucleotide sequence C-G-A. Only in cases 8 and 15
is hybridization successful; the positive and negative
hybridization outcomes of all sixteen exposure events are detected
and recorded. If, for example, one were to write a bit sequence
consisting of 0's in the case of no hybridization and 1's in the
case of successful hybridization for each of the sixteen possible
test oligonucleotides 25, then one would have the bit sequence
0000000100000010. This bit sequence, representing all possible
hybridization trials from the 2-mer library contains approximately
16 bits of information (I explain below why this is only an
approximate information measure), which is larger than the
information content of the target 3-mer sequence, CGA, which
contains log.sub.2(4.sup.3)=6 bits of information. Thus, in most
cases a 2-mer library can uniquely decode a 3-mer target sequence.
In contrast, a palindromic sequence such as CGC cannot be uniquely
decoded using only a 2-mer library because both it and the sequence
GCG bind the same set of 2-mers (namely CG and GC). Such cases can
be disambiguated by applying M-mer libraries of different size
coupled with the ability to determine the number of test
oligonucleotides bound to a given target oligonucleotide 20. For
instance, in this case, applying a 1-mer library disambiguates CGC
from GCG, as 2 G's would bind to the former and only one would bind
to the latter.
[0026] The procedure for determining sequence information about
target oligonucleotide 20 from the test bit sequence requires
ordering in an overlapping manner the test oligonucleotides that
successfully hybridized. Thus we know from the test data that the
sequence of target oligonucleotide 20 has both a CG and a GA with
the only overlap at G, giving the unique sequence of CGA. The
section discussing FIGS. 7 and 8 details the procedure and
flowchart algorithm for determining sequences from test
hybridization data.
[0027] Referring to FIG. 4, there are areas of certain genomes with
long repeat sequences, longer than the sizes of target
oligonucleotides 20 that can be conveniently handled by a typical
M-mer library. In such cases the M interrogating nucleotides of
test oligonucleotide 25 may be interspaced with a spacer of length
R as depicted in FIG. 4. This allows M-mer 20 to bridge larger
target oligonucleotides 20. By combining data obtained with M-mer
libraries containing different spacer values R full sequence data
can be obtained. The spacer may consist of double stranded
nucleotide sequences as shown in FIG. 4 which do not interact with
target oligonucleotide 20, or it may contain a minimally
interacting single stranded sequence, such as a sequence comprising
one or more abasic furans, or a sequence to which is bound another
hybridization blocker such as a sequence-specific protein binder or
a sequence-specific zinc finger complex. Exemplary spacers can, for
example, be constructed by incorporating one of the following
moieties during synthesis of a test oligonucleotide 25:
3,6,9-trioxaundecane-1,11-diisocyanate;
5'-O-dimethoxytrityl-1',2'-dideoxyribose-3'-[(2-cyanoethyl)-(N,N-diisopro-
pyl)]-phosphoramidite;
3-(4,4'-dimethoxytrityloxy)propyl-1-[(2-cyanoethyl)-(N,N-diisopropyl)]-ph-
osphoramidite; 9-O-dimethoxytrityl-triethyleneglycol,
1-[(2-cyanoethyl)-(N,N-diisopropyl)]-phosphoramidite;
12-(4,4'-dimethyoxytrityloxy)dodecyl-1-[(2-cyanoethyl)-(N,N-diisopropyl)]-
-phosphoramidite; or 18-O-dimethoxytritylhexaethyleneglycol,
1-[(2-cyanoethyl)-(N,N-diisopropyl)]-phosphoramidite.
Alternatively, the target oligonucleotides themselves can have an
R-mer cut out of them, where R corresponds to a length of the
target oligonucleotide known to have repetitive or uninformative
nucleotide positions.
[0028] FIG. 5 is a table of the size of the M-mer library of test
oligonucleotides as a function of M assuming the nucleotide at any
position is chosen from a possible set of 4. The size of the
library is simply the number of all possible oligonucleotides of
length M which for the case of h different types of nucleotides is
given by h.sup.M. For most natural systems h=4 (i.e. adenine,
guanine, cytosine, and either thyrnine or uracil). For many
applications it will be useful to deliver test oligonucleotides to
the inverse gene chip array by means of microfluidics. Practical
considerations of delivering test oligonucleotides, time to deliver
them as well as time and cost to generate the library coupled with
the number of elements in a single molecule hybridization detection
system make library sizes of M 8 a good choice.
[0029] FIGS. 6A-6C relate to the information content of
hybridization experiments of an M-mer test library on an N-mer
target oligonucleotide. Knowledge of the information content is a
guide to appropriate and preferred values for M in order to obtain
full de novo sequence information from an N-mer.
[0030] The probability of fully hybridizing a random M-mer
somewhere on a given N-mer is given by
P(M,N)=1-[1-(1/h).sup.M].sup.(N-M+1) Eq. 1
[0031] where for the case of 4 nucleotides we have h=4.
[0032] The information content in bits extractable per test
hybridization is given by
[0033] as derived from analysis used to analyze the statistics of
the information content from a series of weighted coin flips as is
known in the art. Thus the total number of bits extractable from
the h.sup.M possible hybridization trials is given by
I(M,N)h.sup.M Eq. 3
[0034] The information content in bits of an N-mer oligonucleotide
is given by
Log.sub.2(h.sup.N Eq. 4)
[0035] In order to obtain full de novo sequence information of a
given target N-mer oligonucleotide we thus require that:
I(M,N).times.t Log.sub.2(h.sup.N) Eq. 5
[0036] where t is the number of hybridization trials noting that a
maximum of t=h.sup.M independent trials are possible.
[0037] FIG. 6A is a plot of the information content in bits per
hybridization trial as a function of M. As shown there is an
optimal M (labeled M*) which yields the highest information content
per trial. Too small an M or too large an M yields non-optimal
information content per trial.
[0038] FIG. 6B gives M* for various values of the target
oligonucleotide length N. Since one is limited to h.sup.M
hybridization trials, an ideal strategy is to employ all possible
M* trial hybridizations and then to employ a sufficient number of
M*+1 trials (i.e. trials using test oligomers of length M*+1) such
that the information content of the hybridization trials equals or
exceeds the information content of the target oligonucleotide which
one desires to sequence. Accordingly, when the average or maximum
length of the target oligonucleotides 20 to be sequenced is less
than 50, M is advantageously 2-4; when N is 50-99, M is
advantageously 3-5; when N is 100-999, M is advantageously 4-6;
when N is 1,000-10,000, M is advantageously 5-8; and when N is
greater than 10,000, M is advantageously 7-13.
[0039] FIG. 6C is a table calculating N.sub.max, defined as the
largest N such that I(M,N)4.sup.M>2N for various values M of the
size of the test oligonucleotides. N.sub.max represents the largest
possible N that can be sequenced by means of the data set derived
from sequentially testing the hybridization of all possible M-mers
on the target N-mer. Although target oligonucleotides are most
often no more than 10,000 nucleotides in length (e.g. tens,
hundreds, or thousands of nucleotides as shown in FIG. 6C), their
length is in fact limited only by the constraints of Eq. 5.
[0040] An additional parameter, N.sub.Optimal, which is defined as
MAXIMUM[I(M, N)4.sup.M-2N], is also calculated. N.sub.Optimal is
the value of N which maximizes the difference in information
content between the hybridization trials and the target
oligonucleotide of size N.
[0041] Referring to FIGS. 7 and 8 an example procedure and
flowchart algorithm for inverting the series of hybridization
trials and deriving de novo sequence information are given.
[0042] Referring to FIG. 7, the example given is the de novo
sequence determination of a target 7-mer 20 using 3-mer test
oligonucleotides 25. According to Eq. 4 the information content of
an 7-mer is 14 bits. According to Eq. 2 the information content per
hybridization trial of a 3-mer on a 7-mer is 0.4 bits and thus
according to Eq. 5 on average at least 40 test oligonucleotides 25
would need to be trial hybridized to target oligonucleotide 20 to
derive the de novo sequence information of target oligonucleotide
20.
[0043] FIG. 7 gives an example for the general procedure for
inverting the sequence showing how both positive hybridization
outcomes and negative hybridization outcomes are used to determine
sequence. In step 1 a number of test hybridizations as discussed
above are carried out. Whether the test oligonucleotide 25
hybridizes or does not is recorded. In step 2 the test
oligonucleotide sequences which are recorded to have successfully
hybridized are themselves (or as shown in the figure their
complement) arrayed into possible target oligonucleotides. This
arraying operation is generally carried out in silico although it
is possible to carry it out using the techniques of DNA computing
known in the literature. As shown in FIG. 7 possible target
oligonucleotides which were created with the largest number of
overlaps (in this case the most instances of two-base overlaps as
opposed to single-base overlaps) are the most likely. In the case
of this example there are two possible target oligonucleotides
ACGGCCA and CCACGGC which both have been assembled from the set of
positively hybridizing test oligonucleotides 25 with two instances
of double base overlap and one instance of single base overlap. The
two possibilities thus cannot be disambiguated by means of positive
hybridizing data alone. In the final step negative hybridizing
data, namely data for the set of test oligonucleotides 25 which did
not successfully bind to the target oligonucleotide, are used to
eliminate possible target oligonucleotide sequences and thus to
isolate the correct target oligonucleotide sequence. In this
example data that show that test oligonucleotide GTG does not bind
target oligonucleotide 20 is used to exclude CCACGGC as a possible
target oligonucleotide (otherwise GTG would be expected to
hybridize to the CAC portion of CCACGGC) and thus identifies
ACGGCCA uniquely as desired target oligonucleotide 20.
[0044] FIG. 8 shows a representative series of steps for deriving
de novo sequence information from a target oligonucleotide using
the set of data (both positive and negative) from a sequence of
test oligonucleotide hybridization trials. Those steps include:
[0045] selecting an M-Mer test oligonucleotide library as described
above;
[0046] carrying out sufficient sequential hybridization trials
(each including test hybridization of an M-mer test oligonucleotide
to an N-mer target oligonucleotide, measuring the positive or
negative outcome of the hybridization trial, and, before any
subsequence test hybridization, dehybridizing and washing) of
members of the M-mer test oligonucleotide library with the N-mer
target oligonucleotide such that the hybridization trial
information content equals or exceeds the information content of
the desired target oligonucleotide;
[0047] using positive hybridization outcomes to construct the
possible set of target oligonucleotide sequences weighted by the
highest degree of overlap; and
[0048] using negative hybridization outcomes to eliminate members
of the possible set of target oligonucleotide sequences.
[0049] This set of steps determines the sequence of each individual
target oligonucleotide on an inverse DNA chip. In order to get the
sequence of the source nucleotide polymer from which the target
oligonucleotides were derived (e.g. of an entire genome) a second
inverse gene chip from the same genome can be created in which the
target oligonucleotides are cut in a different place, preferably
far from the original cuts. Target oligonucleotide sequences from
both inverse DNA chips may now be assembled in a manner similar to
that shown in FIG. 7. Again whole genome sequences created from
target oligonucleotide assemblies with maximum overlap are given
highest ranking.
Example 1
Human Genome
[0050] As an example of repetitive single molecule hybridization,
we consider the human genome, which includes about 3 billion
nucleotide base pairs. Referring to FIG. 6C we choose the 5-Mer
test oligonucleotide library with an optimal N-mer size of:
.sup..about.200 bases. This leads to 15.times.10.sup.6 spots (i.e.
K=15.times.10.sup.6) equivalent to the number of imaging elements
available in current solid state imaging devices or a small array
of such imaging devices. Assuming cycle times
(hybridization+measurement+wash) of 5 minutes per test
oligonucleotide yields a total run time of (1024 test
oligonucleotides in the 5-mer test oligonucleotide library)*(5
minutes per cycle)=3.5 days.
Example 2
De Novo Synthesis of Nucleic Acids
[0051] Regarding another application of the sequencing technology
described herein, there has recently emerged considerable interest
in the synthesis of long sequences of nucleic acids (See Smith,
Hutchison, Pfannkoch, and Venter et al., "Generating a Synthetic
Genome by Whole Genome Assembly: phiX174 Bacteriophage from
Synthetic Oligonucleotides" PNAS 2003). One approach is the
chemical synthesis of oligonucleotides followed by their ligation.
Unfortunately the error rate for such an approach is equivalent to
the chemical synthetic error rate of .sup..about.1:10.sup.2. One
approach is to sequence the ligated product and then perform site
directed mutagenesis to correct errors. But as described above this
approach is hampered by throughput and cost of current
electrophoresis-based sequencing methodologies.
[0052] The present invention facilitates accurate synthesis of
nucleic acids according to steps comprising:
[0053] synthesizing a nucleotide sequence de novo;
[0054] a) obtaining sequence information about the synthesized
nucleotide sequence by repetitive single molecule hybridization as
described above; and
[0055] c) using site directed mutagenesis to correct at least one
error in the synthesized nucleotide sequence.
[0056] The terms and expressions employed herein are used as terms
of description and not of limitation, and there is no intention, in
the use of such terms and expressions, of excluding any equivalents
of the features shown and described or portions thereof, but it is
recognized that various modifications are possible within the scope
of the invention claimed.
* * * * *