U.S. patent application number 11/580583 was filed with the patent office on 2007-02-15 for hybrid model for dna probe design and validation using nonlinear and linear regression methods.
Invention is credited to Brian S. Giles, Nicholas M. Sampas, Peter Tsang.
Application Number | 20070037201 11/580583 |
Document ID | / |
Family ID | 36128383 |
Filed Date | 2007-02-15 |
United States Patent
Application |
20070037201 |
Kind Code |
A1 |
Sampas; Nicholas M. ; et
al. |
February 15, 2007 |
Hybrid model for DNA probe design and validation using nonlinear
and linear regression methods
Abstract
Methods and systems for selecting oligonucleotide probes for use
in microarray applications are provided herein. The described
methods use a combination of measured probe performance and
predicted probe performance to select probes. Nucleic acid arrays
containing probes selected by the described methods are described.
Also included are algorithms for performing the subject methods
recorded on computer-readable media and computational systems for
analysis.
Inventors: |
Sampas; Nicholas M.; (San
Jose, CA) ; Tsang; Peter; (San Francisco, CA)
; Giles; Brian S.; (Fremont, CA) |
Correspondence
Address: |
AGILENT TECHNOLOGIES INC.
INTELLECTUAL PROPERTY ADMINISTRATION, M/S DU404
P.O. BOX 7599
LOVELAND
CO
80537-0599
US
|
Family ID: |
36128383 |
Appl. No.: |
11/580583 |
Filed: |
October 13, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10996323 |
Nov 23, 2004 |
|
|
|
11580583 |
Oct 13, 2006 |
|
|
|
Current U.S.
Class: |
435/6.14 ;
702/20 |
Current CPC
Class: |
B01J 2219/00722
20130101; G16B 25/00 20190201; B01J 2219/00695 20130101; B01J
2219/00689 20130101; C12Q 1/6883 20130101 |
Class at
Publication: |
435/006 ;
702/020 |
International
Class: |
C12Q 1/68 20060101
C12Q001/68; G06F 19/00 20060101 G06F019/00 |
Claims
1. A method for selecting an oligonucleotide probe for use on a
microarray, comprising: generating two or more candidate
oligonucleotide probes; analyzing the two or more candidate probes
with one or more metrics that indicate probe performance to obtain
an individual probe score for each metric; combining the individual
probe score for each metric into a single combined score for the
probe; and selecting the probe with a combined score closest to an
optimal score value for use on a microarray, wherein the optimal
score value is the score at, or nearest to, the highest end of a
numerical scale of probe scores.
2. The method of claim 1, wherein the optimal score value is about
1.0 on a scale of probe scores ranging from 0.0 to 1.0.
3. The method of claim 1, wherein the optimal score value is about
100 on a scale of probe scores ranging from 50 to 100.
4. The method of claim 1, wherein generating a candidate set of
oligonucleotide probes comprises: selecting one or more target
sequences within a region of interest; and tiling subsequences of
each target sequence across each region of interest to generate the
candidate set of potential probes.
5. The method of claim 4, further comprising: generating a large
set of potential probes by tiling the target sequences in single
base steps across the region of interest; and applying pairwise
reduction to reduce the number of probes by a factor of greater
than about 2 and less than about 1000.
6. The method of claim 1, wherein the metrics used to analyze the
candidate probes comprise direct metrics, indirect metrics, in
silico metrics, or combinations thereof.
7. The method of claim 6, wherein direct metrics used to analyze
the candidate probes comprise the changes in probe response based
on experimentally measured quantities, further comprising known
changes in copy number of a target molecule.
8. The method of claim 6, wherein indirect metrics used to analyze
the candidate probes comprise changes in predicted probe response
resulting from experimentally measured quantities for a target
molecule.
9. The method of claim 6, wherein indirect metrics used to analyze
the candidate probes comprise changes in predicted probe response
measured using empirical relationships based on direct responses
from other probe-target molecule duplexes.
10. The method of claim 6, wherein in silico metrics used to
analyze the candidate probes comprise changes in probe response
based on calculated quantities for a target molecule.
11. The method of claim 6, wherein in silico metrics used to
analyze the candidate probes comprise changes in probe response
measured using empirical relationships based on direct responses
from other probe-target molecule duplexes.
12. The method of claim 1, wherein analyzing the candidate probes
with one or more metrics to obtain individual probe scores further
comprises: calculating the slope for each candidate probe; plotting
the slope against the corresponding value for each of the metrics
to obtain a trend curve; and fitting the trend curve with a
polynomial function with order n to generate an individual probe
score.
13. The method of claim 12, wherein the order n of the polynomial
function ranges from n=1 to n=20.
14. The method of claim 1, wherein combining individual probe
scores for each metric to obtain a combined score comprises adding
or averaging the probe score for each metric.
15. The method of claim 1, wherein combining the individual probe
scores to obtain a combined score comprises fitting the scores with
a linear additive multivariate fitting function.
16. The method of claim 15, wherein combining the individual probe
scores further comprises fitting measured slope responses for a
well-characterized training data set to a change in copy
number.
17. The method of claim 1, wherein combining the individual probe
scores to obtain a combined score comprises fitting the scores with
a linear multiplicative curve-fitting function.
18. The method of claim 17, wherein combining the individual probe
scores further comprises: combining metrics in each category using
a linear model to obtain intermediate scores; and multiplying
together the intermediate scores to generate the combined
score.
19. The method of claim 1, wherein combining the individual probe
scores further comprises synthetically modifying the combined score
to obtain probes with more robust performance, the synthetic
modification further comprising: generating a candidate set of
probes; applying pairwise reduction to reduce the number of probes
in the candidate set; calculating the slope for each probe;
plotting the slope against the corresponding value for each of the
metrics to obtain a trend curve; fitting the trend curve to
generate a measured probe score; replacing the fitted trend curve
with a synthetic curve; and using the synthetic curve to generate a
predicted score for each probe.
20. The method of claim 1, wherein selecting the probe for use in a
microarray application comprises: combining experimentally measured
scores with predicted scores to obtain a combined score value for
the probe; and selecting the probe with a combined score value
closest to an optimal score value, wherein the optimal score value
is the score at, or nearest to, the highest end of a numerical
scale of probe scores.
21. A computer-readable medium having recorded thereon a program
that selects a probe for use in microarray applications according
to the method of claim 1.
22. A computational analysis system comprising the
computer-readable medium according to claim 21.
23. A method of fabricating a nucleic acid microarray, comprising
producing at least two different oligonucleotide probes on a
microarray substrate, wherein at least one of the two different
oligonucleotide probes is a probe selected according to the method
of claim 1.
24. A nucleic acid microarray produced according to the method of
claim 23.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This Application is a continuation-in-part of, and claims
priority to, U.S. patent application Ser. No. 10/996,323, filed
Nov. 23, 2004.
BACKGROUND
[0002] Comparative genomic hybridization (CGH) and location
analysis are important applications, which allow scientists to make
biological measurements involving genomics, cytogenetics, and study
expression and regulation of genes in biological systems. Both CGH
and location analysis entail quantifying or measuring changes in
copy number of genomic sequences in biological or medical samples.
CGH, is particularly important in developmental biology as well as
the causes of cancer and offers great potential in the diagnostics
of cancer and developmental diseases. Recently, cDNA microarrays
have been used for CGH studies. An oligo-array based approach has
several substantial advantages over other technologies, in that it
allows the designer to position the probes anywhere within the
genomic or polynucleotide sequence of interest. The probes can be
placed at any set of loci or positioned to span any genomic
intervals of interest at whatever density is commensurate with the
real-estate or area available on the microarray (in terms of number
of features). The copy numbers of DNA over the genomic regions of
interest can be evaluated by analyzing the hybridization of target
sequences to the surface-bound probes. The oligonucleotide probe
approach also offers the flexibility of focusing in on regions
within exons or introns of expressed sequences, including
pre-microRNAs or intergenic regions and regulatory regions for
location analysis, as well as any desirable admixture of the
aforementioned.
[0003] Probes that work well on microarrays for gene expression
generally do not work well for CGH arrays and are not appropriate
for location analysis arrays. The overall performance of probes for
CGH and location analysis arrays entails different optimization of
their properties than probes utilized for gene expression. Most
notably, these differences relate to the substantially increased
complexity of the labeled target mixture for CGH and location
analysis than for expression analysis which demands a greater
specificity of the probes in discriminating against non-specific
binding to competing targets. For comparison, the total number of
nucleotide bases in the human transcriptome is approximately
10.sup.8, while the human genome contains over 3.times.10.sup.9
bases. Additionally, probes selected for gene expression come from
within message sequences that are transcribed as RNA, i.e. exons,
while probes for CGH need be complementary, or nearly so, to
contiguous targets selected from within a genome sequence e.g.
introns and/or exons.
[0004] Despite great interest in CGH technology, methods for
evaluating probes in silico and also empirically for use in this
technology are limited. A rigorous method would be to measure
signals (e.g. ratios of signals) from each polynucleotide in
controlled experiments with test samples containing known copy
numbers for each probe sequence on the array. For example, a method
used by several probe designers for measuring array performance for
sets of polynucleotides specific for sequences on the X chromosome,
is to use a series of cell lines with known variable copies of the
X chromosome for CGH experiments. See, e.g., M. T. Barrett et al.,
Proc. Natl. Acad. Sci. USA 101(51): 17765-70 (2004). These cell
lines (X series) are homogeneous and contain intact copies (e.g. 1
to 5) of the X chromosome permitting a rigorous measure of the
relationship between copy number and signal intensities for each X
chromosome specific polynucleotide on an array. However, cell lines
containing known variable numbers of intact copies of most other
chromosomes are not readily available. Furthermore, the aberrant X
series cell lines are slow growing and can spontaneously vary in
ploidy under standard culturing conditions. Such methods are
complex and time-consuming and cannot readily be used to assay the
relationship between the hybridization signal of polynucleotides on
an array and the genomic copy number of sequences from each
chromosome in a cell.
SUMMARY
[0005] This disclosure relates to methods for predicting probe
performance for microarray applications. The methods described
herein optimize probe performance by measuring the probe response
in a model system and applying that response to predict the
response for probes that have not yet been experimentally
tested.
[0006] Methods for selecting an oligonucleotide probe with the best
performance in a microarray application are provided herein. In an
aspect, the methods include generating candidate probes and
screening the probes with one or more metrics or parameters that
can predict or classify probe performance. The resulting probe
scores for each metric are combined using various statistical
methods, and the probe with the best combined score is selected. In
aspects, the methods described herein can be modified to obtain
probes within a very narrow range of predicted properties for the
probes.
[0007] Algorithms for performing the described methods recorded on
computer-readable medium, as well as computations analysis systems
that include the same are also provided. The disclosure also
includes nucleic acid arrays with oligonucleotide probes whose
performance is predicted using the subject methods, and methods
using such arrays.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] FIG. 1 is a flowchart generally depicting the methods
described herein.
[0009] FIG. 2 is a flowchart showing a method to generate candidate
probes whose performance is predicted by the methods described
herein.
[0010] FIG. 3 is a flowchart depicting methods for calculating
slope and combining measured slope with calculated parameters to
predict probe performance.
[0011] FIG. 4 shows a distribution of measured slope against duplex
melting temperature.
[0012] FIG. 5 shows a plot of smoothed measured slope against
duplex melting temperature.
[0013] FIG. 6 shows a trend curve with a fitted curve of a
12th-order polynomial vs. duplex melting temperature.
[0014] FIG. 7 shows a graph of the fitted slopes vs. measured
slopes for combined metrics.
[0015] FIG. 8 shows measured slope plotted against duplex melting
temperature trend curves and various synthetic replacement
curves.
[0016] FIG. 9 shows T.sub.m distributions resulting from the use of
combined synthetic and empirical scores.
DETAILED DESCRIPTION
[0017] Various embodiments will be described in detail with
reference to the drawings, wherein like reference numerals
represent like parts throughout the several views. Reference to
various embodiments does not limit the scope of the claims attached
hereto. Additionally, any examples set forth in this specification
are not intended to be limiting and merely set forth some of the
many possible embodiments for the claims.
[0018] Unless defined otherwise, all technical and scientific terms
used herein have the same meaning as commonly understood by one of
ordinary skill in the art. Although any methods, devices and
material similar or equivalent to those described herein can be
used in practice or testing, the methods, devices and materials are
now described.
[0019] All publications and patent applications in this
specification are indicative of the level of ordinary skill in the
art and are incorporated herein by reference in their
entireties.
[0020] In this specification and the appended claims, the singular
forms "a," "an," and "the" include plural reference, unless the
context clearly dictates otherwise. Unless defined otherwise,
all,technical and scientific terms used herein have the same
meaning as commonly understood to one of ordinary skill in the
art.
[0021] Definitions
[0022] The term "genome" refers to all nucleic acid sequences
(coding and non-coding) and elements present in or originating from
a single cell, or from each cell type in an organism, or from a
virus. The term "genome" encompasses all sources of genomic
sequences or elements known to those of skill in the art. The term
genome also applies to any naturally occurring or induced variation
of these sequences that may be present in a mutant or disease
variant of any virus or cell type. These sequences include, but are
not limited to, those involved in the maintenance, replication,
segregation, and higher order structures (e.g. folding and
compaction of DNA in chromatin and chromosomes), or other
functions, if any, of the nucleic acids as well as all the coding
regions and their corresponding regulatory elements needed to
produce and maintain each particle, cell or cell type in a given
organism.
[0023] For example, the human genome consists of approximately
3.times.10.sup.9 base pairs of DNA organized into distinct
chromosomes. The genome of a normal diploid somatic human cell
consists of 22 pairs of autosomes (chromosomes 1 to 22) and either
chromosomes X and Y (males) or a pair of X chromosomes (female) for
a total of 46 chromosomes. A genome of a cancer cell may contain
variable numbers of each chromosome in addition to deletions,
rearrangements and amplification of any subchromosomal region or
DNA sequence.
[0024] The terms "nucleic acid" and "polynucleotide" are used
interchangeably herein to describe a polymer of any length, e.g.,
greater than about 10 bases, greater than about 100 bases, greater
than about 500 bases, greater than 1000 bases, usually up to about
10,000 or more bases composed of nucleotides, e.g.,
deoxyribonucleotides or ribonucleotides, or compounds produced
synthetically (e.g., PNA as described in U.S. Pat. No. 5,948,902
and the references cited therein) which can hybridize with
naturally occurring nucleic acids in a sequence specific manner
analogous to that of two naturally occurring nucleic acids, e.g.,
can participate in Watson-Crick base pairing interactions.
[0025] The terms "ribonucleic acid" and "RNA" as used herein mean a
polymer composed of ribonucleotides.
[0026] The terms "deoxyribonucleic acid" and "DNA" as used herein
mean a polymer composed of deoxyribonucleotides.
[0027] The term "oligonucleotide" as used herein denotes single
stranded nucleotide multimers of from about 10 to 100 nucleotides
and up to 200 nucleotides in length. Oligonucleotides are usually
synthetic and, in many embodiments, are under 50 nucleotides in
length.
[0028] The term "oligomer" is used herein to indicate a chemical
entity that contains a plurality of nucleotide monomers, i.e., a
nucleotide multimer. As used herein, the terms "oligomer" and
"polymer" are used interchangeably, as it is generally, although
not necessarily, smaller "polymers" that are prepared using the
functionalized substrates of the invention, particularly in
conjunction with combinatorial chemistry techniques. Examples of
oligomers and polymers include polydeoxyribonucleotides (DNA),
polyribonucleotides (RNA), other nucleic acids that are
C-glycosides of a purine or pyrimidine base, polypeptides
(proteins), polysaccharides (starches, or polysugars), and other
chemical entities that contain repeating units of like chemical
structure.
[0029] The term "sample" as used herein relates to a material or
mixture of materials, typically, although not necessarily, in fluid
form, containing one or more components of interest. Samples
include, but are not limited to, biological samples obtained from
natural biological sources, such as cells or tissue. The samples
may also be derived from tissue biopsies and other clinical
procedures.
[0030] The terms "nucleoside" and "nucleotide" are intended to
include those moieties that contain not only the known purine and
pyrimidine bases, but also other heterocyclic bases that have been
modified. Such modifications include methylated purines or
pyrimidines, acylated purines or pyrimidines, alkylated riboses or
other heterocycles. In addition, the terms "nucleoside" and
"nucleotide" include those moieties that contain not only
conventional ribose and deoxyribose sugars, but other sugars as
well. Modified nucleosides or nucleotides also include
modifications on the sugar moiety, e.g., wherein one or more of the
hydroxyl groups are replaced with halogen atoms or aliphatic
groups, or are functionalized as ethers, amines, or the like.
[0031] The phrase "surface-bound polynucleotide" refers to a
polynucleotide that is immobilized on a surface of a solid
substrate, where the substrate can have a variety of
configurations, e.g., a sheet, bead, or other structure. In certain
embodiments, the collections of oligonucleotide probe elements
employed herein are present on a surface of the same planar
support, e.g., in the form of an array.
[0032] The phrase "labeled population of nucleic acids" refers to
mixture of nucleic acids that are detectably labeled, e.g.,
fluorescently labeled, such that the presence of the nucleic acids
can be detected by assessing the presence of the label. A labeled
population of nucleic acids is "made from" a chromosome sample, the
chromosome sample is usually employed as template for making the
population of nucleic acids.
[0033] A "biological model system," or "model system," as provided
herein, refers to a system for which a quantitative response in a
microarray system can be expected with certainty (i.e. a system
wherein a response can be detected or measured). Exemplary model
systems include, without limitation, biological systems, such as
titration series with different RNA samples at different
concentrations, samples with known genomic aberrations, samples to
be used for comparative genomic hybridization experiments, etc. The
biological model systems are used to perform microarray
experiments, to validate probes designed for microarray
applications, to obtain sets of training data for statistical
analysis, etc.
[0034] The term "array" encompasses the term "microarray" and
refers to an ordered array presented for binding to nucleic acids
and the like.
[0035] An "array," includes any two-dimensional or substantially
two-dimensional (as well as a three-dimensional) arrangement of
spatially addressable regions bearing nucleic acids, particularly
oligonucleotides or synthetic mimetics thereof, and the like. Where
the arrays are arrays of nucleic acids, the nucleic acids may be
adsorbed, physisorbed, chemisorbed, or covalently attached to the
arrays at any point or points along the nucleic acid chain.
[0036] In those embodiments where an array includes two more
features immobilized on the same surface of a solid support, the
array may be referred to as addressable. An array is "addressable"
when it has multiple regions of different moieties (e.g., different
oligonucleotide sequences) such that a region (i.e., a "feature" or
"spot" of the array) at a particular predetermined location (i.e.,
an "address") on the array will detect a particular sequence. Array
features are typically, but need not be, separated by intervening
spaces. In the case of an array in the context of the present
application, the "population of labeled nucleic acids" will be
referenced as a moiety in a mobile phase (typically fluid), to be
detected by "surface-bound polynucleotides" which are bound to the
substrate at the various regions. These phrases are synonymous with
the arbitrary terms "target" and "probe", or "probe" and "target",
respectively, as they are used in other publications.
[0037] A "scan region" refers to a contiguous (preferably,
rectangular) area in which the array spots or features of interest,
as defined above, are found or detected. Where fluorescent labels
are employed, the scan region is that portion of the total area
illuminated from which the resulting fluorescence is detected and
recorded. Where other detection protocols are employed, the scan
region is that portion of the total area queried from which
resulting signal is detected and recorded. For the purposes of this
invention and with respect to fluorescent detection embodiments,
the scan region includes the entire area of the slide scanned in
each pass of the lens, between the first feature of interest, and
the last feature of interest, even if there are intervening areas
that lack features of interest.
[0038] The term "substrate" as used herein refers to a surface upon
which marker molecules or probes, e.g., an array, may be adhered.
Glass slides are the most common substrate for biochips, although
fused silica, silicon, plastic, flexible web and other materials
are also suitable.
[0039] An "array layout" refers to one or more characteristics of
the features, such as feature positioning on the substrate, one or
more feature dimensions, and an indication of a moiety at a given
location. "Hybridizing" and "binding", with respect to nucleic
acids, are used interchangeably. The terms "hybridizing,"
"hybridizing specifically to," and "specific hybridization" as used
herein, refer to the binding, duplexing, or hybridizing of a
nucleic acid molecule preferentially to a particular nucleotide
sequence under stringent conditions.
[0040] The term "stringent assay conditions" as used herein refers
to conditions that are compatible to produce binding pairs of
nucleic acids, e.g., probes and targets, of sufficient
complementarity to provide for the desired level of specificity in
the assay while being incompatible to the formation of binding
pairs between binding members of insufficient complementarity to
provide for the desired specificity. The term stringent assay
conditions refers to the combination of hybridization and wash
conditions.
[0041] A "stringent hybridization" and "stringent hybridization
wash conditions" in the context of nucleic acid hybridization
(e.g., as in array, Southern or Northern hybridizations) are
sequence dependent, and are different under different environmental
parameters. Stringent hybridization conditions that can be used to
identify nucleic acids within the scope of the invention can
include, e.g., hybridization in a buffer comprising 50% formamide,
5.times.SSC, and 1% SDS at 42.degree. C., or hybridization in a
buffer comprising 5.times.SSC and 1% SDS at 65.degree. C., both
with a wash of 0.2.times.SSC and 0.1% SDS at 65.degree. C.
Exemplary stringent hybridization conditions can also include a
hybridization in a buffer of 40% formamide, 1 M NaCl, and 1% SDS at
37.degree. C., and a wash in 1.times.SSC at 45.degree. C.
Alternatively, hybridization to filter-bound DNA in 0.5 M
NaHPO.sub.4, 7% sodium dodecyl sulfate (SDS), 1 mnM EDTA at
65.degree. C., and washing in 0.1.times.SSC/0.1% SDS at 68.degree.
C. can be employed. Yet additional stringent hybridization
conditions include hybridization at 60.degree. C. or higher and
3.times.SSC (450 mM sodium chloride/45 mM sodium citrate) or
incubation at 42.degree. C. in a solution containing 30% formamide,
1 M NaCl, 0.5% sodium sarcosine, 50 mM MES, pH 6.5. Those of
ordinary skill will readily recognize that alternative but
comparable hybridization and wash conditions can be utilized to
provide conditions of similar stringency.
[0042] In certain embodiments, the stringency of the wash
conditions determine whether a nucleic acid is specifically
hybridized to a probe. Wash conditions used to identify nucleic
acids may include, e.g.: a salt concentration of about 0.02 M at pH
7 and a temperature of about 20.degree. C. to about 40.degree. C.;
or, a salt concentration of about 0.15 M NaCl at 72.degree. C. for
about 15 minutes; or, a salt concentration of about 0.2.times.SSC
at a temperature of about 30.degree. C. to about 50.degree. C. for
about 2 to about 20 minutes; or, the hybridization complex is
washed twice with a solution with a salt concentration of about
2.times.SSC containing 1% SDS at room temperature for 15 minutes
and then washed twice by 0.1.times.SSC containing 0.1% SDS at
37.degree. C. for 15 minutes; or, equivalent conditions. Stringent
conditions for washing can also be, e.g., 0.2.times.SSC/0.1% SDS at
42.degree. C. See Sambrook, Ausubel, or Tijssen (cited below) for
detailed descriptions of equivalent hybridization and wash
conditions and for reagents and buffers, e.g., SSC buffers and
equivalent reagents and conditions.
[0043] A specific example of stringent assay conditions is rotating
hybridization at 65.degree. C. in a salt based hybridization buffer
with a total monovalent cation concentration of 1.5 M (e.g., as
described in U.S. patent application Ser. No. 09/655,482 filed on
Sep. 5, 2000, the disclosure of which is herein incorporated by
reference) followed by washes of 0.5.times.SSC and 0.1.times.SSC at
room temperature.
[0044] Stringent hybridization conditions may also include a
"prehybridization" of aqueous phase nucleic acids with
complexity-reducing nucleic acids to suppress repetitive sequences.
For example, certain stringent hybridization conditions include,
prior to any hybridization to surface-bound polynucleotides,
hybridization with Cot-1DNA, or the like.
[0045] Stringent assay conditions are hybridization conditions that
are at least as stringent as the above representative conditions,
where a given set of conditions are considered to be at least as
stringent if substantially no additional binding complexes that
lack sufficient complementarity to provide for the desired
specificity are produced in the given set of conditions as compared
to the above specific conditions, where by "substantially no more"
is meant less than about 5-fold more, typically less than about
3-fold more. Other stringent hybridization conditions are known in
the art and may also be employed, as appropriate.
[0046] The term "mixture", as used herein, refers to a combination
of elements, that are interspersed and not in any particular order.
A mixture is heterogeneous and not spatially separable into its
different constituents. Examples of mixtures of elements include a
number of different elements that are dissolved in the same aqueous
solution, or a number of different elements attached to a solid
support at random or in no particular order in which the different
elements are not especially distinct. In other words, a mixture is
not addressable. To be specific, an array of surface-bound
polynucleotides, as is commonly known in the art and described
below, is not a mixture of capture agents because the species of
surface-bound polynucleotides are spatially distinct and the array
is addressable. "Isolated" or "purified" generally refers to
isolation of a substance (compound, polynucleotide, protein,
polypeptide, polypeptide, chromosome, etc.) such that the substance
comprises the majority percent of the sample in which it resides.
Typically in a sample a substantially purified component comprises
50%, preferably 80%-85%, more preferably 90-95% of the sample.
Techniques for purifying polynucleotides, polypeptides and intact
chromosomes of interest are well-known in the art and include, for
example, ion-exchange chromatography, affinity chromatography,
sorting, and sedimentation according to density.
[0047] The terms "assessing" and "evaluating" are used
interchangeably to refer to any form of measurement, and include
determining if an element is present or not. The terms
"determining," "measuring," and "assessing," and "assaying" are
used interchangeably and include both quantitative and qualitative
determinations. Assessing may be relative or absolute. "Assessing
the presence of" includes determining the amount of something
present, as well as determining whether it is present or
absent.
[0048] The term "using" has its conventional meaning, and, as such,
means employing, e.g., putting into service, a method or
composition to attain an end. For example, if a program is used to
create a file, a program is executed to make a file, the file
usually being the output of the program. In another example, if a
computer file is used, it is usually accessed, read, and the
information stored in the file employed to attain an end. Similarly
if a unique identifier, e.g., a barcode is used, the unique
identifier is usually read to identify, for example, an object or
file associated with the unique identifier.
[0049] If a surface-bound polynucleotide "corresponds to" a
chromosome, the polynucleotide usually contains a sequence of
nucleic acids that is unique to that chromosome. Accordingly, a
surface-bound polynucleotide that corresponds to a particular
chromosome usually specifically hybridizes to a labeled nucleic
acid made from that chromosome, relative to labeled nucleic acids
made from other chromosomes. Array features, because they usually
contain surface-bound polynucleotides, can also correspond to a
chromosome.
[0050] A "non-cellular chromosome composition", as will be
discussed in greater detail below, is a composition of chromosomes
synthesized by mixing pre-determined amounts of individual
chromosomes. These synthetic compositions can include selected
concentrations and ratios of chromosomes that do not naturally
occur in a cell, including any cell grown in tissue culture.
Non-cellular chromosome compositions may contain more than an
entire complement of chromosomes from a cell, and, as such, may
include extra copies of one or more chromosomes from that cell.
Non-cellular chromosome compositions may also contain less than the
entire complement of chromosomes from a cell.
[0051] A "probe" means a polynucleotide which can specifically
hybridize to a target nucleotide, either in solution or as a
surface-bound polynucleotide.
[0052] The term "validated probe" means a probe that has been
passed by at least one screening or filtering process in which
experimental data related to the performance of the probes was used
a part of the selection criteria.
[0053] "In silico" means those parameters that can be determined
without the need to perform any experiments, by using information
either calculated de novo or available from public or private
databases.
[0054] The term "duplex T.sub.m " refers to the melting temperature
of two oligonucleotides which have formed a duplex structure.
Duplex T.sub.m is calculated by a simple formula where each
matching GC pair gets a value of 2, and each matching AT pair gets
a value of 1. The sum of these approximate values gives the melting
temperature.
Approaches and Methods for Probe Selection
[0055] The present methods provide alternative and novel methods
and systems for designing probes for CGH and location analysis in
microarray applications that overcome the drawbacks of existing
microarray probe selection techniques. General methods that utilize
probe/target hybridization experiments and/or unique data analysis
techniques to identify and select nucleotide probe(s) targeting
polynucleotide fragments in a region of interest were described in
U.S. Patent Publication No. 2006/0110744. The methods described
herein provide statistical methods for combining and modifying
probe scores in order to achieve desired results, such as selecting
or designing probes with more robust probe performance, better
probe signal, etc.
[0056] The present description provides methods, systems and
computer readable media for identifying and selecting nucleic acid
probes for detecting a target with a nucleic acid probe array or
microarray. The methods comprise, in general terms: the selection
of genomic nucleotide ranges of interest, determining appropriate
target sequences for CGH and/or location analysis, generating
candidate probes specific for the target sequences and analyzing
candidate probes for specific probe properties by computational
and/or experimental processes to optimize probe selection and
reduce the number of probes to a value appropriate for placement on
a microarray.
[0057] The description also provides microarrays comprising probes
selected by the methods described herein. The microarrays comprise
a solid support and a plurality of surface bound probes, the
surface bound probes having very similar thermodynamic properties
as well as similar GC content. More specifically, a large portion
of the probes utilized in the microarrays of the invention, have
duplex melting temperatures (T.sub.m) which are within a narrow
temperature range compared to the T.sub.m range of probes for other
microarray systems, such as arrays for gene expression.
[0058] The methods provided herein are particularly useful with
comparative genome hybridization microarrays, such as microarrays
based on the human or mouse genome. These methods permit more
cost-effective and efficient identification of gene regions or
sections which can be associated with human disease, points of
therapeutic intervention, and potential toxic side-effects of
proposed therapeutic entities.
[0059] In general terms, the methods for probe selection and
validation described herein comprise identifying probe properties
that can be determined a priori by the probe's sequence and the
sequence of the genome it is contained within, and may further
comprise expanding the set of properties from those that can be
determined a priori, to those that can be measured empirically
through simple experiments, such as self-self experiments. The
described methods may further comprise measuring the response of
candidate probes to a known stimulus, where the stimulus is
generated by a set of samples where the copy numbers for relatively
small subsets of the genome are altered in known ways.
[0060] In designing an array comprising high-performance probes
that comprehensively covers a whole genome (e.g. the human genome)
the entire genomic sequence must be searched when generating
specific candidate probes. This homology search is potentially the
most time-consuming part of the probe design process. Ideally, a
homology search would be the first part of the process, however
because of the scale of the human genome executing an exhaustive
search of all possible short oligo probes (<100 bases), can take
computation time on the scale of a CPU year (based on ProbeSpec),
for modern 3 GHz processors. This computation time can be reduced
by any of a number of methods, most involving reducing the scale of
the search. For example, known highly repetitive sequences can be
removed by a process called RepeatMasking. Repeat-masked genomic
sequences are publicly available on the web (e.g. UCSC's
www.genomebrowser.org). Another approach is to reduce the number of
probe sequences being searched up-front. This can be done on the
basis of any known property of the probe, from thermodynamic
properties, such as duplex-Tm and hairpin free energy, to position
on the genome. The present description provides methods which apply
known probe information as a screening process to reduce the number
of probe sequences to be analyzed in a homology search, thus
reducing the computation time needed to identify appropriate probes
for a CGH based array.
[0061] The present systems, techniques, methods and computer
readable media also provide for streamlined workflow, since
researchers need only to prepare and process one microarray instead
of two or more per sample, with fewer steps in processing and
tracking required.
[0062] Further, greater reproducibility of results is provided for,
since all data for an entire genome is generated from a single
microarray, resulting in less variability in the data. When two or
more microarrays associated with the same sample are processed
separately, there are always questions of variability of the
experimental conditions used to process each microarray.
[0063] Designing a microarray involves determining the amount of
"real estate" (number of probes) that is available for the final
array. The array designer also determines the amount of probes or
"real estate" to use for specified regulatory regions, intergenic
regions as well the amount of probes necessary to adequately cover
introns and exons of the chromosomes of interest. Initially, a
designer will generate 20 to 40 million candidate probes and need
to filter the probes for certain probe properties or parameters to
obtain a final array with approximately 40,000 probes. Intermediate
arrays are manufactured in some embodiments of the methods of the
invention, which have a redundancy of 3 or 4 fold over the number
of probes selected for the final array, these intermediate arrays
are utilized to screen candidate probes for certain probe
properties by direct or indirect experimentation.
[0064] In many embodiments, the oligonucleotides (i.e. probes)
contained in the features of the invention have been designed
according to one or more particular parameters to be suitable for
use in a given application, where representative parameters
include, but are not limited to: length, melting temperature
(T.sub.m), non-homology with other regions of the genome,
hybridization signal intensities, kinetic properties under
hybridization conditions, etc., see e.g., U.S. Pat. No. 6,251,588,
the disclosure of which is herein incorporated by reference.
[0065] Standard hybridization techniques (using high stringency
hybridization conditions) are used to probe subject array. Suitable
methods are described in references describing CGH techniques
(Kallioniemi et al., Science 258:818-821 (1992) and WO 93/18186).
Several guides to general techniques are available, e.g., Tijssen,
Hybridization with Nucleic Acid Probes, Parts I and II (Elsevier,
Amsterdam 1993). For a descriptions of techniques suitable for in
situ hybridizations see, Gall et al. Meth. Enzymol. 21 :470-480
(1981) and Angerer et al. in Genetic Engineering: Principles and
Methods (Setlow and Hollander, eds.), vol. 7, pp. 43-65 (Plenum
Press, New York 1985). See also U.S. Pat. Nos. 6,335,167;
6,197,501; 5,830,645; and 5,665,549; the disclosures of which are
incorporated herein by reference.
[0066] FIG. 1 shows a general description of the methods described
herein. In an aspect, as in operation 100, a candidate
oligonucleotide for a particular region of interest in a target
nucleic acid sequence is generated. The candidate probe is then
screened with one or more metrics or parameters that are predictive
of probe performance, as in the operation 102, which yields a probe
score for each metric. The individual probe scores are then
combined to produce a combined score for the probe in operation
103. The probe with the best score is then selected, as in 104, for
a subsequent microarray application.
Methods for Selecting Oligonucleotide Probes
[0067] The methods described herein are directed to selection of
oligonucleotide probes for use in microarray applications. Two or
more candidate oligonucleotide probes are generated and analyzed
using one or more metrics that are indicative of probe performance.
An individual probe score is obtained with respect to each metric,
and these probe scores are then combined into a single score for
the probe. Probes with combined scores closest to an optimal score
value are selected as ideal or best probes (i.e. those probes which
are most suited to a particular microarray experiment, in terms of
ability to hybridize to the target sequences, reproducibility,
repeatability, etc). Probes may be scored on any numerical scale,
with the best probes having scores closest to the high end of the
numerical scale. For example, an optimal score value on a scale of
0.0 to 1.0 would be about 1.0. Similarly, on a numerical scale of
probe scores from 50 to 100, an optimal score value would be about
100.
[0068] In embodiments, two or more candidate probes are generated
by selecting one or more target sequences within a region of
interest and subsequences of the target are tiled across the entire
region of interest to obtain a set of potential probes. In aspects,
the subsequences of the target sequences are tiled in single base
steps across the region of interest. This generates a large set of
potential probes, which are reduced to a manageable number (such as
greater than 2, but less than 1000, for example) by pairwise
filtering.
[0069] Once the candidate probes are generated, they are analyzed
using metrics that are indicative of probe performance, and each
probe is assigned a probe score. These metrics include direct
metrics, indirect metrics and in silico metrics. Direct metrics
comprise the changes in probe response based on experimentally
measured quantities, such as change in copy number. Indirect
metrics used comprise changes in predicted probe response resulting
from experimentally measured quantities for a target molecule, or
changes in predicted probe response measured using empirical
relationships based on direct responses from other probe-target
molecule duplexes. The in silico metrics comprise changes in the
probe response based on calculated quantities for a target
molecule, or changes in probe response measured using empirical
relationships based on direct responses from other probe-target
molecule duplexes. To obtain a probe score from the application of
one or more metrics, the slope for each candidate probe is
calculated and plotted against the corresponding value for each
metric to generate a trend curve. The trend curve is then fitted
with a polynomial function to obtain the probe score for each
metric. The order of the polynomial can range from 1 to 20.
[0070] Individual probe scores for each metric are then combined,
by adding or averaging the individual scores, to give a combined
probe score. The individual probe scores can also be fitted with a
linear additive multivariate fitting function, or a linear
multiplicative fitting function to give the combined score. In
aspects, combined probe scores are obtained by combining the
metrics in each category using a linear model to obtain
intermediate scores. The intermediate scores are then multiplied
together to give the combined score. Individual probe scores can
also be combined by fitting the measured slope responses for a
training data set with a change in copy number.
[0071] The combined scores can be synthetically modified to give
probes with more robust predicted performance (i.e. a probe which
more effectively mimics probe performance in an actual experiment).
In aspects, the synthetic modification comprises generating a large
candidate set of probes, and reducing the number of probes by
pairwise reduction. A slope is calculated for each probe and
plotted against the corresponding slope for each metric to generate
a trend curve. This trend curve is fitted to give a measured probe
score, and the fitted trend curve is replaced using a synthetic
curve and a predicted score is obtained. The predicted scores are
combined with experimentally measured scores to give the combined
score value for a particular probe. The probe (or probes) with a
combined score value closest to the optimal score value is selected
for microarray applications.
[0072] Probe selection is performed using a computational analysis
system which comprises a computer-readable medium with a program
that selects probes for microarray applications as in the methods
described herein. The methods can be used to produce or fabricate a
microarray comprising at least two probes selected according to the
methods described herein.
Generating Candidate Probes
[0073] In an embodiment, a candidate oligonucleotide probe, or set
of probes, for a particular region of interest in a target nucleic
acid sequence is generated, as in operation 100, an expanded
representation of which is shown in FIG. 2. Briefly, operation 100
begins with the selection or identification 200 of target nucleic
acid sequences within a genome. The candidate probe or candidate
set of probes is any probe or set of probes within (or capable of
hybridizing to) the target sequence or genome including, without
limitation, genes, exons, mRNA, a region of interest within the
target sequence, probes used or selected for previous experiments,
upstream or downstream regulatory regions of genes, methylated
regions, regions associated with putative SNPs or CNPs, sequence
aberrations known to be associated with particular disease states
or phenotypes, histones or binding sites in the sequence for other
molecules, etc. Potential target sequences of the nucleotide sample
of interest are identified, filtered and reduced to a set of
appropriate target sequences for CGH and/or location analysis. The
potential target sequences are filtered by size, number of
repeat-masked bases and/or GC-content. Target sequences are also
filtered and reduced in number by eliminating repetitive target
sequences. Another parameter which can be used to filter target
sequence is to eliminate potential target sequences which comprise
a restriction enzyme cut site. By limiting the size of the set of
target sequences, the computational time needed to generate and
analyze the candidate probes is decreased.
[0074] Generating a set of candidate probes comprises selecting
subsequences of the selected target nucleic acid sequences across
genomic regions of interest, as in operation 202. Probes are tiled
in uniform or moderately uniform spacing, in steps as small as a
single base, or as large as megabases, through the genome, targeted
region of the genome, or target nucleic acid sequence. For example,
probes may be tiled in steps of 50-100 bases across the entire
genome, but the methods described herein are not dependent on the
scale of the tiling The smaller the scale of the tiling, the larger
the number of potential candidate probes forming a plurality of
candidate probes. The number of potential candidate probes over an
interval should exceed the number expected to be selected over that
same interval. Candidate probes are selected from the plurality of
candidate probes based on parameters, for example, a narrow range
of a specific parameter such as probe length. Probe parameters
utilized to select candidate probes from a plurality of potential
candidate probes may include, but are not limited to, target
specificity, thermodynamic properties, expression and association
with genes, homology and kinetic properties.
[0075] In some embodiments, the probe parameters include, but are
not limited to, a range of T.sub.M of about 0.25.degree. C. to
about 5.degree. C., a T.sub.M value of about 65.degree. C. to about
85.degree. C., a nucleotide length of 20 to 200 nucleotides, a
range GC content % of less than 10%, and/or % GC content about
30-40%. When length of the probe is a criteria, probes have a
nucleotide length of about 20 nucleotides to about 200 nucleotides,
usually about 40 nucleotides to 100 nucleotides, and more usually
50 to 65 nucleotides.
[0076] Typically, 30 to 60-mer candidate probes are selected, but
the candidate probes may range from about 20-mer to about 200-mer.
Typically, probes may be selected over spacings of approximately
half the length of the probe. For example, for a 60-mer candidate
probe, 30 bp intervals would be selected over the entire genome, or
regions of interest. Usually, the repeat-masked regions are
skipped, as they are usually insufficiently unique to be of use.
Also, if the assay involves the use of a restriction digest, the
restriction sites within the sequence for the restriction enzymes
specified within the protocol are also typically excluded, or those
probes subsequently excluded from the candidate set.
[0077] The large number of candidate probes generated by this
process is then reduced to a smaller set of candidate probes using
a reduction method, such as the pairwise reduction method, for
example, as shown in operation 204. The pairwise reduction method
evaluates a pair of candidate probes for a probe property and
scores the probes within the pair against each other according to
the probe property analyzed. The pairwise reduction process reduces
the number of probes by a factor of X, where X may be any number
that significantly reduces the number of probes. For example, the
pairwise reduction process may reduce the number of probes by a
factor of 5, 10, 15, 20, 25, 30 and so on. The number of candidate
probes can also be reduced by any other method or algorithm that
uses the position of the probe and a combined or overall score to
discriminate between probes.
[0078] Following the reduction process, the candidate probes are
optionally experimentally validated, as in 206. The experimental
validation process involves experiments which measure the
properties of a probe that provide a good indication of the probe's
performance (i.e., suitability) in a microarray experiment, in the
absence of direct experiments or data. Experimentally measurable
probe properties include, without limitation, raw signal intensity,
reproducibility of signal intensity, dye bias, susceptibility of
non-specific binding, etc. The process for probe selection,
pairwise reduction and experimental validation are described in
detail in U.S. Patent Publication No. 2006/0110744 and WO
2004/059845, the disclosures of which are incorporated herein by
reference.
[0079] Once at least two candidate probes are selected, the
candidate probes are analyzed with one or more metrics that predict
or indicate probe performance to generate a probe score for each
metric.
Analyzing Candidate Probes with Metrics
[0080] As shown in FIG. 1, embodiments of the methods described
herein include a process 102 for analyzing candidate probes with
one or more probe performance parameters or metrics. The term
"parameter" or "metric" refers to a quantity or property that is
indicative of a probe's performance in a microarray experiment.
Three types of metrics are used with the methods described herein:
direct metrics, indirect metrics, and in silico metrics.
[0081] Direct metrics are those that directly measure probe
performance. Direct metrics measure performance by observing the
change in probe response, as measured by the signal or ratio of
signals, or log of the ratio of signals, with respect to a
reference sample, with a change in copy number of the target
nucleic acid sequence or region of interest, using multiple
hybridization experiments on multiple arrays, with the conditions
maintained as similar as possible between arrays. The change in
probe response is measured as a change in signal in a differential
model system, such as a dye-swap or dye-flip experiment, for
example, where DNA copy number is changed in known or predictable
ways. For example, in an experiment to evaluate the performance of
probes on the X chromosome using a normal pair of female-male
samples, the probe is expected to produce a 2:1 signal ratio, as
there are twice as many X chromosome target molecules in the female
sample as in the male sample, and there are no Y chromosome target
molecules in the female sample. Similarly, other differential model
systems such as cell lines with well-known chromosomal aberrations,
extra or missing chromosomes or regions of chromosomes, will also
produce copy number changes in a predictable manner. Biological
model systems that do not exist in nature, but are created using
cell sorting techniques, or by mixing collections of BACs, cDNAs or
other biologically-derived DNA samples can also be used for
measuring probe performance.
[0082] Indirect metrics are measured parameters or metrics that are
indicative of probe performance. Indirect metrics comprise
observing the change in probe response in relatively simple
experiments using a non-differential model. Indirect metrics (or
indirect empirical parameters) include signal strength (in one or
both channels from which signal is measured in a microarray
experiment), dye bias (the LogRatio associated with a dye label
rather than the LogRatio associated with copy number), differential
signals obtained from experiments under various conditions
(multiple annealing times for probes to target nucleic acid
sequences, wash times, wash temperatures, etc.), for example. For
dye bias measurements, the LogRatios for dye-flip experiments are
averaged, rather than subtracted as they would be to calculate the
effective LogRatios for copy number changes. Experiments using
indirect metrics are considered non-differential, because, for most
of the genome, the changes in probe response do not reflect changes
in copy number (i.e. no change in copy number is expected, between
the sample and a reference sequence). Rather, indirect metrics
predict the performance of a probe in terms of sensitivity and
specificity. For example, a measured signal that is too strong
could represent cross-hybridization of the probes to multiple
regions of the genome. On the other hand, a measured signal that is
too weak is indicative of noise, or susceptible to changes in the
condition or the quantity of DNA.
[0083] In silico metrics are calculated parameters that are
indicative of probe performance. In silico metrics are those
metrics that are calculated in the absence of any experimental
data. These metrics are derived from the sequence of the probes
themselves, and from the sequences of the genome, or the
transcriptome of the organism being studied. In silico metrics for
each candidate probe are obtained from the sequences directly,
based on known laws of physics and chemistry, such as those related
to thermodynamics. In silico metrics used in the methods described
herein include, without limitation, duplex melting temperature
(T.sub.m or DuplexTm) between a probe and its complementary
sequence, maximal subsequence duplex melting temperature of a probe
(MaxSubSeqTm; the maximal T.sub.m for any subsequence of length M
within a longer sequence of length N), hairpin thermodynamic
properties of the probe (i.e., hairpin melting temperature, Gibbs
free energy, number of bases within turns, loops, stems, etc.), and
sequence complexity (where complexity refers to the number of bases
in the probe that are contained within short simple repeats, such
as homopolymers, dimers, trimers, tetramers, etc., for example).
For example, with the methods described herein, complexity
typically refers to the number of bases contained within repeat
units with six nucleotides, i.e. hexamers, but the methods
described herein can generally be employed with repeats with any
number of nucleotides.
[0084] The direct, indirect and in silico metrics are described in
detail in U.S. Patent Publication No. 2006/0110744, the disclosure
of which is incorporated herein by reference. The analytical
process involves calculating a slope, or the responsiveness of a
probe to a change in copy number of its complementary target
sequence, for each candidate probe, as in 300, based on the
response of the probe in an experiment with respect to a particular
metric. The slope for each of a set of probes can be measured using
a model system where the relative copy numbers of the target
molecules for each probe in the set is known in each sample. The
measured slope is calculated for each probe within the set, for
example, X-chromosome probes, in the case of male and female
samples. The slope can be estimated most simply by calculating the
ratio of the signals in two samples with two different copy numbers
of targets. It can also be the ratio of log-signals to the log-copy
numbers, or the ratio of log ratios of signals. In a more complex
system, a number of samples can be hybridized, where each pair of
samples has a different set of copy numbers for each respective set
of probes. For example, in a male-female model system, some sample
pairs can be male referenced to female, others can be female
referenced to male, and still others can be male referenced to
male, or female referenced to female. This provides multiple data
points for each probe. The slope for a two-color assay is then
calculated by means of a linear regression of the ratios (of
signals) for each probe as a function of the ratios of known target
copy numbers in each sample. The y-intercept provided by such
regression is also useful, as it provides the dye-bias. By analogy,
in a single-color assay, the regression is between the measured
signals and the known copy number.
[0085] In embodiments, the slope is calculated (as in 300) from the
performance of a probe analyzed using a direct metric, by observing
the change in probe response, as measured by the signal or ratio of
signals, or log of the ratio of signals, with respect to a
reference sample, with a change in copy number of the target
nucleic acid sequence or region of interest, using multiple
hybridization experiments on multiple arrays, with the conditions
maintained as similar as possible between arrays. The change in
probe response is measured as a change in signal in a differential
model system, such as a dye-swap or dye-flip experiment, for
example, where DNA copy number is changed in known or predictable
ways. For example, in an experiment to evaluate the performance of
probes on the X chromosome using a normal pair of female-male
samples, the probe is expected to produce a 2:1 signal ratio, as
there are twice as many X chromosome target molecules in the female
sample as in the male sample, and there are no Y chromosome target
molecules in the female sample. Similarly, other differential model
systems such as cell lines with well-known chromosomal aberrations,
extra or missing chromosomes or regions of chromosomes, will also
produce copy number changes in a predictable manner. Biological
model systems that do not exist in nature, but are created using
cell sorting techniques, or by mixing collections of BACs, cDNAs or
other biologically-derived DNA samples can also be used for
measuring probe performance.
[0086] In embodiments, using a direct metric, the change in signal
is measured with respect to measured quantities such as LogRatio
(i.e. the log of the ratio of red to green channels), LogIntensity
(the log product of red and green channel intensities), and dye
bias (the average of LogRatios for a dye-swap pair; obtained by
subtracting LogRatios), for example. For the most robust probe
performance, the change in probe response reflects the specific
LogRatio change associated with changes in copy number in dye-flip
experiments, as measured by subtracting LogRatios.
[0087] In embodiments, the slope for a probe is calculated, as in
300, based on the performance of a probe analyzed using an indirect
metric, by observing the change in probe response in relatively
simple experiments using a non-differential model. Indirect metrics
(or indirect empirical parameters) include signal strength (in one
or both channels from which signal is measured in a microarray
experiment), dye bias (the LogRatio associated with a dye label
rather than the LogRatio associated with copy number), differential
signals obtained from experiments under various conditions
(multiple annealing times for probes to target nucleic acid
sequences, wash times, wash temperatures, etc.), for example. For
dye bias measurements, the LogRatios for dye-flip experiments are
averaged, rather than subtracted as they would be to calculate the
effective LogRatios for copy number changes. Experiments using
indirect metrics are considered non-differential, because, for most
of the genome, the changes in probe response do not reflect changes
in copy number (i.e. no change in copy number is expected, between
the sample and a reference sequence). Rather, indirect metrics
predict the performance of a probe in terms of sensitivity and
specificity. For example, a measured signal that is too strong
could represent cross-hybridization of the probes to multiple
regions of the genome. On the other hand, a measured signal that is
too weak is indicative of noise, or susceptible to changes in the
condition or the quantity of DNA.
[0088] In embodiments, the slope calculation in operation 300 is
based on the performance of a probe analyzed using in silico
parameters or metrics. In silico metrics are those metrics that are
calculated in the absence of any experimental data. These metrics
are derived from the sequence of the probes themselves, and from
the sequences of the genome, or the transcriptome of the organism
being studied. In silico metrics for each candidate probe are
obtained from the sequences directly, based on known laws of
physics and chemistry, such as those related to thermodynamics. In
silico metrics used in the methods described herein include,
without limitation, duplex melting temperature (T.sub.m or
DuplexTm) between a probe and its complementary sequence, maximal
subsequence duplex melting temperature of a probe (MaxSubSeqTm; the
maximal T.sub.m for any subsequence of length M within a longer
sequence of length N), hairpin thermodynamic properties of the
probe (i.e., hairpin melting temperature, Gibbs free energy, number
of bases within turns, loops, stems, etc.), and sequence complexity
(where complexity refers to the number of bases in the probe that
are contained within short simple repeats, such as homopolymers,
dimers, trimers, tetramers, etc., for example). For example, with
the methods described herein, complexity typically refers to the
number of bases contained within repeat units with six nucleotides,
i.e. hexamers, but the methods described herein can generally be
employed with repeats with any number of nucleotides.
[0089] In other embodiments, in silico parameters or metrics
associated with the homology of a probe are used. These metrics
include, without limitation, homology score (i.e. the distance to
the nearest hit, not including the first target sequence, within
the target sequence of interest or genome), homology
signal-to-background, expressed on a log scale (HomLogS2B,
described in U.S. Patent Publication No. 2006/0110744), and
predicted homology response (S.sub.Hom). The predicted homology
response is similar to the HomLogS2B, but instead of predicting the
signal-to-background, this score predicts the slope response of a
probe based on homology calculations alone, under the assumption
that thermodynamic and other properties of the probe are ideal. The
predicted homology score is defined by Equation 1: S Hom = j = 1
TargetSeq . .times. P .function. ( mm j ) i = 1 Genome .times. P
.function. ( mm i ) ( 1 ) ##EQU1## where P(mm.sub.j) is a penalty
term representing the signal contribution (under the specified
hybridization conditions) for the hybridization of the probe of
interest to each sufficiently complementary mismatch sequence
within a specified target sequence or genome. The summation in the
denominator in Equation 1 is over all the sequences in the genome,
or within the complex set of sequences expected to be in a sample
or set of samples. The numerator in Equation 1 represents the
target sequence of interest. In the most specific case, the target
sequence refers to the small specific sequence for which the probe
is being designed (i.e. within a particular locus within a narrow
region of a specific chromosome or region of interest in the genome
for which the probe is designed).
[0090] In the specific case, the equation can be simplified as
shown in Equation 2: S Hom = 1 i = 1 Genome .times. P .function. (
mm i ) ( 2 ) ##EQU2## The function P(mm.sub.j) can be calculated
using a model for the hybridization between two oligonucleotide
sequences using nearest neighbor models. The term is dependent on
the number of mismatches, the distribution of mismatches through
the aligned sequences, the specific mismatched bases, and the
length of the overlap. Although all possible sequences within the
target nucleic acid sequence or genome should be considered, in
practice, only those sequences that are homologous enough to the
probe sequence are considered. For example, with 60-mer probes, all
subsequences in the genome that align with fewer than about 20
bases are considered.
[0091] This model can be further simplified, by approximating the
homology slope response by using the distances or number of
mismatches between the probe and the nearest hit (i.e. closest in
homology) sequence, as shown in Equation 3: S Hom = d = 0 D .times.
P d .times. M d d = 0 D .times. P d .times. N d ( 3 ) ##EQU3##
where N.sub.d represents the total number of hits at a distance d,
where d is defined as the number of single-base difference between
the probe of interest and the target nucleic acid sequence or
region of interest in the genome, and D is the maximum distance
that needs to be considered. The denominator represents the signal
contributions of all probes in the complex set of sequences,
including the target sequence. The numerator represents either the
target for the probe sequence, or, if a model system is being used,
the region of the model system sequence that is being varied. For
example, if the model system is a whole chromosome, then M.sub.d
represents all the hits within the chromosome at a distance d from
the probe of interest. P.sub.d is the signal penalty for each
mismatch at a distance d. A perfect match has P.sub.d=1, and the
value of P.sub.d decreases towards zero as the number of mismatches
increase (i.e. as the system becomes more destabilized). This is an
approximation based on the assumption that the average signal
reduction across a large number of mismatches is a good
representation for any single mismatch. That is, each mismatched
base (or insertion or deletion) can be assigned a constant penalty
P, giving Equation 4 as the relationship between a single-base
penalty and distance: P.sub.d.apprxeq.P.sup.d (4)
[0092] In still other embodiments, in silico parameters or metrics
that combine homology with thermodynamic properties may be used.
For example, maxTemp, defined as the duplex melting temperature (or
T.sub.m) between the probe and the longest contiguous match within
each homologous sequence in the background genome, can be used as
an in silico metric for probe performance. In other embodiments,
the melting temperature of the closest mismatch to the probe
sequence in the genome (MMClosestDuplexTm) as calculated from the
nearest neighbor model can also be used to predict probe
performance.
[0093] The methods described herein for selecting an
oligonucleotide probe for a microarray application include a step
for screening candidate probes against probe performance metrics,
as indicated in FIG. 1, at operation 102. This operation is further
depicted in FIG. 3. The screening process begins with the
calculation or determination of a slope for each candidate probe,
based on each metric, as indicated in operation 300. Using the X
chromosome as a model system to characterize probe performance,
empirical measurements of signal changes or LogRatio changes are
made. From these empirical results, the slope for each probe is
calculated. The slope is defined in either linear or logarithmic
space as the ratio of a measured signal or LogRatio to a known or
deliberate change in copy number. For probes with measurements at
multiple distinct copy numbers, the slope is calculated from the
signals or ratios on the y-axis and the known or expected copy
number (or fold-change) on the x-axis. For example, where there are
only two copy number values, the slope is the difference between
the y-axis values and x-axis values. For probes with ideal
response/performance, the slope approaches 1.00. For data points at
more than two copy numbers, the slope for each probe is calculated
from the best-fit line for a plot of the signal, signal ratios or
LogRatios. In embodiments, the slope for data generated from more
than two copy numbers is analyzed using statistical methods that
eliminate outliers, such as a fitting method that weights data
points by variance, for example.
[0094] The calculated slope is then plotted against each metric to
give a trend curve, as in 302, which can be used to determine the
relationship between a given metric and the performance of the
probe. The trend curve is then smoothed fitted with an appropriate
theoretical function, such as a set of polynomials with order as
high as 20, as in 304, in order to determine the effect variables
have on the slope for a given metric. Any set of orthonormal basis
functions, as known to those of skill in the art, can be used for
the fit. The smoothed or fitted trend curve can then be used to
generate a probe score, with each probe being assigned a score, as
in 306. The probe scores are assigned based on an arbitrary
numerical scale. A probe score at or near the highest end of the
scale indicates optimal or best probe performance. For example, the
probe scores could lie between 0 to 1 on a numerical scale, and
probes are selected if the probe score is closer to 1.0 (i.e. a
score closest to 1 implies ideal or best probe performance in a
given microarray experiment). Similarly, a numerical scale from 50
to 100 could be used, where probes with scores closest to 100 are
selected. In other words, the probe with the best or optimal score
is selected depending on the scale employed. Any number of scales,
with any variation of numerical ranges, can be employed.
Generating Trend Curves for Measured and Predicted Slope
[0095] In embodiments, in order to characterize the relationship
between the performance of a probe and the metric used to gauge
that performance, the slope points calculated for each probe based
on empirical data are plotted against the corresponding values for
each metric, as indicated in FIG. 3, at operation 302. For example,
using the differential model system of chromosome X and female-male
pairs, data is obtained where the target copy numbers are changed
by a predictable ratio (i.e. 2:1). When the measured slope for a
set of probes is plotted against a given metric, a distribution
plot is obtained. For example, FIG. 4 shows a distribution plot of
the calculated slope from different arrays against duplex melting
temperature (or DuplexT.sub.m), an in silico metric. Useful
information with regard to the relationship between probe
performance and DuplexT.sub.m exists if the distribution contains
discrete data points (i.e. data points that do not cluster in a
round and fuzzy manner). The calculated slope is then plotted
against each metric to give a trend curve, as in 302, which can be
used to determine the relationship between a given metric and the
performance of the probe. The trend curve is then smoothed fitted
with an appropriate theoretical function, such as a set of
polynomials with order as high as 20, as in 304, in order to
determine the effect variables have on the slope for a given
metric. Any set of orthonormal basis functions, as known to those
of skill in the art, can be used for the fit. The smoothed or
fitted trend curve can then be used to generate a probe score, with
each probe being assigned a score, as in 306. The probe scores are
assigned based on an arbitrary numerical scale. A probe score at or
near the highest end of the scale implies optimal or best probe
performance. For example, the probe scores could lie between 0 to 1
on a numerical scale, and probes are selected if the probe score is
closer to 1.0 (i.e. a score closest to 1 implies ideal or best
probe performance in a given microarray experiment). Similarly, a
numerical scale from 50 to 100 could be used, where probes with
scores closest to 100 are selected. In other words, the probe with
the best or optimal score is selected, but the scale on which the
probes are scored is not significant.
[0096] The probe with the best or optimal score is assumed to be a
"good" probe, i.e. one that is particularly suitable for use in a
specific microarray experiment. For example, although not limited
to this aspect, the best probe selected according to the present
methods may be the one that hybridizes most strongly to the target
sequences. The actual underlying relationship (between probes and
scores for each metric) can be extracted from these distributions
by generating a trend curve. Trend curves can be obtained from the
slope date for each metric by a number of methods, including,
without limitation, polynomial fits, cubic-spline fits, Fourier
transforms, inverse transforms, smooth functional curves (for
example, exponentials, arctangents, etc.), Boltzmann distribution
curves, etc. Any curve that approximately follows the trend of the
data is useful. In embodiments, a straight line fit is appropriate.
In other embodiments, the data can also be smoothed and fitted
using methods like moving averages, moving medians, LOWESS, LOESS,
etc., for example.
[0097] An example of a trend curve used in the methods described
herein is shown in FIG. 5 (for DuplexT.sub.m vs. measured slope).
Each point in the trend curve represents the median value for data
sorted by rank on the x-axis and then smoothed in 1% bins using a
non-linear polynomial fitting method (i.e., the range of data is
split into equal-sized bins, each bin containing about 1% of the
data). From the trend curve, it is possible to see a relationship
between a given metric and the performance of the probe. For
example, the trend curve in FIG. 5 indicates that probe performance
is best for probes selected on the basis of DuplexT.sub.m close to
80.degree. C.
[0098] In embodiments, it is useful to make the trend curves for a
given metric more pronounced, to determine the independent effect
that a variable may have on the calculated slope. The response for
a given metric can be improved by filtering out a set of values
(for a second metric) that are not viable for good probes, or by
tuning in on a narrow range where selected probes are expected to
be found. For example, if most of the selected probes are expected
to occur within a narrow range of DuplexT.sub.m, then a trend curve
can be generated by selecting probes within that narrow range for a
particular metric. In embodiments, the trend curves are fitted with
polynomials as high as 20th order, as in operation 304 in FIG. 3.
An example of such a fitted slope for DuplexT.sub.m is shown in
FIG. 6.
Statistical Methods for Combining Probe Scores
[0099] In embodiments, the trend curves are used to generate
individual probe scores that are then combined to give a combined
probe score assigned to each candidate probe, as indicated in
operation 306 of FIG. 3. The combined or common probe score varies
from approximately zero to approximately 1, with a score closer to
1 implying ideal probe performance. In the simplest form,
individual probe scores for each metric (S.sub.m(p.sub.i)) are
combined into a single combined score (S.sub.c(p.sub.i); for each
probe p.sub.i) by adding or averaging the scores for each metric,
according to the Equation 5: S c = m .times. S m .function. ( p i )
( 5 ) ##EQU4## As long as the score for a given metric increases
(or decreases) in the same direction, combining the individual
probe scores by adding or averaging is sufficient to provide
consistent improvements in probe performance with increasing values
of the combined score. Once a probe score has been assigned to each
probe (or subset of probes), it is straightforward to select probes
with the highest score within a window of interest (i.e. a region
of interest in the target nucleic acid sequence or genome). In
embodiments, probes can also be selected via pairwise
filtering/pairwise elimination, a process for reducing the number
of probes in a large set, described in detail in U.S. Patent
Publication No. 2006/0110744, which is incorporated by reference
herein.
[0100] In other embodiments, individual probe scores obtained from
the trend curves are combined by fitting multiple scores for a
training data set, using the change in the measured slope response
with a change in copy number in a model system, as provided in
Equation 6: S c = m .times. C m .times. S m .function. ( p i ) ( 6
) ##EQU5## A number of different methods are available to combine
and fit multivariate date in this manner, including, without
limitation, principle component analysis (PCA), partial
Least-squares (PLS), chemometrics, as well as other methods.
[0101] In still other embodiments, a linear fitting function is
used, involving taking the inverse of a matrix (as implemented in
Matlab). In this approach, the vector of the measured slope for
each probe is represented as Y (one value per probe), and the
matrix of the scores as M (number of metrics +1, # of probes),
where all but one of the columns of M are vectors of scores for
each metric, and the last column is a vector of ones, representing
an additive constant. Equation 7 describes the basic relationship
between the score S and the matrix of the scores M: S=CM (7) where
the matrix C is a linear vector, with the coefficient C.sub.m
representing each element of the matrix (one term for each metric
plus the constant term). Multiplying both sides by the inverse of M
and solving for C gives the following (Equation 8): C=SM.sup.-1 (8)
where M M.sup.-1 is I, the identity matrix, and M.sup.-1 is
approximated using pinv(M), the Moore-Penrose pseudoinverse of M.
M.sup.-1 is implemented as a Matlab function that involves a
singular value decomposition (i.e., a common mathematical method to
invert a matrix or solve a set of linear equations, available with
many commercially available ). This approach allows any number of
metrics to be included in the score calculation, as long as the
metrics provide information in improving the performance of the
selected probes. An example of fitted data obtained from using the
above linear matrix functions is shown in FIG. 7, which depicts
probe scores obtained by plotting the combined fitted slopes for
four different metrics (DuplexTm, complexity, HomLogS2B and
MaxSubSeqTm) against the measured slope for the X chromosome model
system.
[0102] In embodiments, improved probe performance with respect to
various metrics may also be obtained using a multiplicative fitting
function, rather than an additive function. In a multiplicative
curve fit, several individual scores, or combined scores, are
multiplied together to produce a combined score. The metrics in
each category are first combined using a linear method (such as the
additive fitting already described) to produce intermediate scores.
These intermediate scores are then combined using a multiplicative
approach.
[0103] In embodiments, the scores associated with different probes
are combined linearly to give an overall score for a particular
metric. The overall score for each metric is then combined in a
multiplicative fit with the overall score for other metrics. For
example, the thermodynamic scores related to the duplex melting
temperature (DuplexTm) are combined linearly to give an overall
duplex-thermodynamics score D; the homology scores are also
combined to give an overall homology score H, and any structural
scores for the probe are combined independently giving P, with
target structural scores (if any) combined to give an overall
target score T. Each of these phenomena can independently lead to
decreased probe performance. For example, a nearly ideal probe I
with perfect homology scores, but poor thermodynamics scores may
only have a slope of 50%. Similarly, a probe with perfect
thermodynamic scores but poor homology score also may have only a
slope of 50%. It follows then that a probe with both relatively
poor thermodynamics (50%) and relatively poor homology scores (50%)
will have a slope of 25% rather than 50%, as would be predicted by
a linear model. This process will yield an overall probe score that
varies between approximately zero and one. The coefficients for the
combining of additive terms and the offsets are fitted to the data
according to the following equation (Equation 9):
S=C.sub.mDPHT+C.sub.a (9) where C.sub.m are the multiplicative
coefficients, and C.sub.a is an additive coefficient.
[0104] When observing the relationship between probe performance,
as measured by the slope response, it is seen that the slope trend
continues to improve as the trend tends towards lower T.sub.m. This
means that probes with very low T.sub.m would show ideal
performance. However, while this is true with respect to the model
system, and systems where a large quantity of high quality DNA is
plentiful, it will not necessarily be true for many biological and
clinical samples where the DNA quantity is very low, or where the
DNA is degraded (as in a biopsy sample, for example). Therefore, in
embodiments, to increase the robustness of the methods described
herein, the score curves are modified, to take into account effects
associated with real samples, such as, but not limited to, DNA
degradation or low DNA concentration. The methods produce more
robust results, if the modified score curves more accurately
reflect the performance of real probes in a real biological sample.
In particular, the methods are modified in order to produce
consistently high signals and robust results, while minimizing the
negative impact on probe response.
[0105] In embodiments, the methods herein are modified by replacing
the fitted DuplexTm trend curves, as in FIG. 8, with various
synthetic curves, in order to determine the effects of the
modification on probe T.sub.m distributions and signal
distributions. In FIG. 8, the solid line represents the fitted
Tm-slope response curve (i.e. not a synthetic curve), while the
dotted line represent a synthetically generated asymmetric
Lorentzian curve, with half-width to left of center at 20.degree.
C., and half-width to right of center at 10.degree. C. The
dash-dotted line is a symmetric triangle function, with half width
at 10.degree. C., while the dashed line is a symmetric exponential
decay function with a half-width of 7.degree. C. All the synthetic
curves in FIG. 8 are centered at 80.degree. C. In an aspect, the
generation of synthetic curves begins with a candidate pool with a
large number of X chromosome probes (about 1.4 million). Pairwise
filtering, as discussed earlier, is used to select different sets
of approximately evenly spaced probes from a candidate set on the
basis of the combined score for each set of probes (i.e. probes
with the optimal score on an arbitrary numerical scale). This
method helps enrich the candidate pool with probes with higher
scores (i.e. "good" probes). Briefly, the pairwise filtering method
uses the probe's combined score as the target value or parameter,
with the pairwise algorithm selecting one of each pair of probes
that has the closest score to the target value (i.e. 1). The goal
is to select probes within a relatively narrow T.sub.m range, based
on the idea that probes with near ideal performance will typically
fall within a narrow T.sub.m range.
[0106] The score curve synthesized in this manner are used to
generate selected measured slope distributions shown in FIG. 9,
which shows the probe T.sub.m distributions that result from the
use of the various combined scores in the pairwise reduction by
about a factor of 180:1. The thick solid line is a distribution of
T.sub.m values of the candidate probes. The thin solid line shows a
distribution of selected probe melting temperatures when the fitted
T.sub.m slope response is used as one component of the combined
score in the selection of probes. Each of the following, the
synthetic curves replaces only the fitted T.sub.m-slope response
component in its contribution to the total combined score. The
weights Ci of the various scores are kept constant. The dotted line
is T.sub.m distribution using the asymmetric Lorentzian function
component, the dash-dotted line is T.sub.m distribution using the
symmetric triangle function component, and the dashed line is
T.sub.m distribution using the symmetric exponential decay function
component.
[0107] It can be seen from FIG. 9 that the peaks in the
distribution shift from low values with the original scoring system
to values closer to the optimal 80 degree T.sub.m, with the
modified scoring system (the sharp curve with a peak at 80 degrees
is an artifact resulting from the candidate probe selection method,
and not a function of the modified scoring system).
[0108] In embodiments, the methods described herein use a
combination of experimentally measured slopes and predicated slopes
to select probes for a microarray application. Such a combination
is possible because the predicted slope has the same units and
varies over the same range as the measured slope. Consequently, as
experimental data becomes available, the predicted slope can be
replaced with the measured slope when performing probe selection.
This approach can be applied in a number of different ways. For
example, the predicted slope cam simply be replaced with the
measured slope when experimental data is collected. In another
embodiment, probes with measured slopes may be preferred over those
with comparable predicted slopes by applying a numerical bias to
the score, thereby reducing the risk of selected a probe with good
predicted parameters but poor actual performance. In yet another
embodiment, uncertainty values are assigned to both the predicted
and measured slopes, and the score values and uncertainties are
taken into account for probe selection.
Arrays
[0109] The present description also provides nucleic acid
microarrays produced using the subject methods, as described
herein. The subject arrays include at least two distinct nucleic
acids that differ by monomeric sequence immobilized on, e.g.,
covalently on, different and known locations on the substrate
surface. In certain embodiments, each distinct nucleic acid
sequence of the array is typically present as a composition of
multiple copies of the polymer on the substrate surface, e.g., as a
spot on the surface of the substrate. The number of distinct
nucleic acid sequences, and hence spots or similar structures,
present on the array may vary, but is generally at least 2, usually
at least 5 and more usually at least 10, where the number of
different spots on the array may be as a high as 100, 1000, 10,000,
100,000, 1,000,000 or higher, depending on the intended use of the
array. The spots of distinct polymers present on the array surface
are generally present as a pattern, where the pattern may be in the
form of organized rows and columns of spots, e.g., a grid of spots,
across the substrate surface, a series of curvilinear rows across
the substrate surface, e.g., a series of concentric circles or
semi-circles of spots, and the like. The density of spots present
on the array surface may vary, but will generally be at least about
10 and usually at least about 100 spots/cm.sup.2, where the density
may be as high as 10.sup.6 or higher. In other embodiments, the
polymeric sequences are not arranged in the form of distinct spots,
but may be positioned on the surface such that there is
substantially no space separating one polymer sequence/feature from
another. An exemplary array is described in U.S. Patent Publication
No. 20050095596, which is incorporated herein by reference.
[0110] Arrays can be fabricated using drop deposition from
pulsejets of either polynucleotide precursor units (such as
monomers) in the case of in situ fabrication, or the previously
obtained polynucleotide. Such methods are described in detail in,
for example, the previously cited references including U.S. Pat.
No. 6,242,266, U.S. Pat. No. 6,232,072, U.S. Pat. No. 6,180,351,
U.S. Pat. No. 6,171,797, U.S. Pat. No. 6,323,043, U.S. patent
application Ser. No. 09/302,898 filed Apr. 30, 1999 by Caren et
al., and the references cited therein. These references are
incorporated herein by reference. Other drop deposition methods can
be used for fabrication, as previously described herein.
[0111] A feature of the subject arrays is that they include one or
more, usually a plurality of, oligonucleotide probes predicted by
the statistical methods described herein. The oligonucleotide
probes selected according to the subject methods are suitable for
use in a plurality of different gene expression or genomic
microarray applications. The statistical regression method
evaluates probe performance, without using any assumptions about
the functional relationship between the oligonucleotide sequence
and the predictive parameters. Oligonucleotide probes that
"cluster" (i.e. consistently produce the same response) will
perform substantially similarly under a plurality of different
experimental conditions.
[0112] The arrays as described herein can be used in a variety of
different microarray applications, including gene expression
experiments and genomic analysis. In using an array, the array will
typically be exposed to a sample (for example, a fluorescently
labeled analyte, such as a sample containing genomic DNA) and the
array then read. Reading of the array may be accomplished by
illuminating the array and reading the location and intensity of
resulting fluorescence at each feature of the array to detect any
binding complexes on the surface of the array. For example, a
scanner may be used for this purpose that is similar to the AGILENT
MICROARRAY SCANNER available from Agilent Technologies, Palo Alto,
Calif. Other suitable apparatus and methods are described in U.S.
patent application Ser. No. 09/846,125 "Reading Multi-Featured
Arrays" by Dorsel et al.; and Ser. No. 09/430,214 "Interrogating
Multi-Featured Arrays" by Dorsel et al. As previously mentioned,
these references are incorporated herein by reference.
[0113] However, arrays may be read by any other method or apparatus
than the foregoing, with other reading methods including other
optical techniques (for example, detecting chemiluminescent or
electroluminescent labels) or electrical techniques (where each
feature is provided with an electrode to detect hybridization at
that feature in a manner disclosed in U.S. Pat. No. 6,221,583 and
elsewhere). Results from the reading may be raw results (such as
fluorescence intensity readings for each feature in one or more
color channels) or may be processed results such as obtained by
rejecting a reading for a feature which is below a predetermined
threshold and/or forming conclusions based on the pattern read from
the array (such as whether or not a particular target sequence may
have been present in the sample or an organism from which a sample
was obtained exhibits a particular condition). The results of the
reading (processed or not) may be forwarded (such as by
communication) to a remote location if desired, and received there
for further use (such as further processing).
Systems
[0114] The methods described herein are carried out in part with
the aid of a computer-based system, driven by software specific to
the methods. A "computer-based system" refers to the hardware,
software, and data storage used to analyze the information of the
present disclosure. Typical hardware of the computer-based systems
of the present disclosure comprises a central processing unit
(CPU), input, output, and data storage. A skilled artisan can
readily appreciate that any one of the currently available
computer-based system are suitable for use in the present
disclosure. The data storage means may comprise any manufacture
comprising a recording of the present information as described
above, or a memory access means that can access such a manufacture.
In certain instances a computer-based system may include one or
more wireless devices.
[0115] Data from at least one of the detecting and deriving steps,
as described above, is transmitted to a remote location. By "remote
location" is meant a location other than the location at which the
array is present and hybridization occur. For example, a remote
location could be another location (e.g. office, lab, etc.) in the
same city, another location in a different city, another location
in a different state, another location in a different country, etc.
As such, when one item is indicated as being "remote" from another,
what is meant is that the two items are at least in different
buildings, and may be at least one mile, ten miles, or at least one
hundred miles apart. "Communicating" information means transmitting
the data representing that information as electrical signals over a
suitable communication channel (for example, a private or public
network). "Forwarding" an item refers to any means of getting that
item from one location to the next, whether by physically
transporting that item or otherwise (where that is possible) and
includes, at least in the case of data, physically transporting a
medium carrying the data or communicating the data. The data may be
transmitted to the remote location for further evaluation and/or
use. Any convenient telecommunications means may be employed for
transmitting the data, e.g., facsimile, modem, internet, etc.
[0116] To "record" data, programming or other information on a
computer-readable medium refers to a process for storing
information on a recordable storage medium, using any such methods
as known in the art. Examples include magnetic media such as hard
drives, tapes, disks, and the like. Optical media can include CDs,
DVDs, and the like. Any convenient data storage structure may be
chosen, based on the means used to access the stored information. A
variety of data processor programs and the formats can be used for
storage, e.g., word processing text file, database format, etc.
[0117] A "processor" references any hardware and/or software
combination that will perform the functions required of it. For
example, any processor herein may be a programmable digital
microprocessor such as available in the form of an electronic
controller, mainframe, server or personal computer (desktop or
portable). Where the processor is programmable, suitable
programming can be communicated from a remote location to the
processor, or previously saved in a computer program product (such
as a portable or fixed computer readable storage medium, whether
magnetic, optical or solid state device based). For example, a
magnetic medium or optical disk may carry the programming, and can
be read by a suitable reader communicating with each processor at
its corresponding station.
[0118] In aspects, the methods described herein are performed using
computer-readable media containing programming stored thereon
implementing the subject methods. The computer-readable media may
be, for example, in the form of a computer disk or CD, a floppy
disk, a magnetic "hard card", a server, or any other
computer-readable media capable of containing data or the like,
stored electronically, magnetically, optically or by other means.
Accordingly, stored programming embodying steps for carrying out
the subject methods may be transferred to a computer such as a
personal computer (PC), (i.e. accessible by a researcher or the
like), by physical transfer of a CD, floppy disk, or like medium,
or may be transferred using a computer network, server, or any
other interface connection, e.g., the Internet.
[0119] In an embodiment, the system described herein may include a
single computer or the like with a stored algorithm capable of
evaluating probe performance, as described herein, i.e. a
computational analysis system that performs statistical regression
analysis on a set of training data. In certain embodiments, the
system is further characterized in that it provides a user
interface, where the user interface presents to a user the option
of selecting among one or more different, or multiple different
inputs. For example, in the systems described herein, the user has
the option of selecting various predictive parameters, such as
composition factors, thermodynamic factors, kinetic factors, and
mathematical combinations of such factors, as well as analogous
parameters for the intended genomic targets. Computational systems
that may be readily modified to become systems of the subject
invention include those described in U.S. Pat. No. 6,251,588, the
disclosure of which is incorporated herein by reference.
[0120] The various embodiments described above are provided by way
of illustration only and should not be construed to limit the
invention. Those skilled in the art will readily recognize various
modifications and changes that may be made to the present methods
without following the example embodiments and applications
illustrated and described herein, and without departing from the
true spirit and scope of the claims attached hereto.
* * * * *
References