U.S. patent application number 11/348730 was filed with the patent office on 2006-06-29 for computer software products for nucleic acid hybridization analysis.
This patent application is currently assigned to Affymetrix, INC.. Invention is credited to David J. Balaban, Earl A. Hubbell.
Application Number | 20060142951 11/348730 |
Document ID | / |
Family ID | 26942686 |
Filed Date | 2006-06-29 |
United States Patent
Application |
20060142951 |
Kind Code |
A1 |
Balaban; David J. ; et
al. |
June 29, 2006 |
Computer software products for nucleic acid hybridization
analysis
Abstract
Methods and computer software products are provided for
analyzing gene expression data In one embodiment, multiple probes
are used to detect a single transcript The hybridization
intensities of each probe is adjusted by dividing the intensities
by the affinities of the probes The minimal adjusted hybridization
intensity may be used as the measurement of the gene expression
Inventors: |
Balaban; David J.; (San
Jose, CA) ; Hubbell; Earl A.; (Palo Alto,
CA) |
Correspondence
Address: |
AFFYMETRIX, INC;ATTN: CHIEF IP COUNSEL, LEGAL DEPT.
3420 CENTRAL EXPRESSWAY
SANTA CLARA
CA
95051
US
|
Assignee: |
Affymetrix, INC.
Santa Clara
CA
|
Family ID: |
26942686 |
Appl. No.: |
11/348730 |
Filed: |
February 6, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10315919 |
Dec 9, 2002 |
6996475 |
|
|
11348730 |
Feb 6, 2006 |
|
|
|
09745272 |
Dec 20, 2000 |
6510391 |
|
|
10315919 |
Dec 9, 2002 |
|
|
|
60252808 |
Nov 22, 2000 |
|
|
|
Current U.S.
Class: |
702/20 |
Current CPC
Class: |
G16B 25/00 20190201 |
Class at
Publication: |
702/020 |
International
Class: |
G06F 19/00 20060101
G06F019/00 |
Claims
1. A computer implemented method for determining Hybridization
between a plurality of nucleic acid probes and a nucleic acid
target comprising inputting a plurality of hybridization
intensities, each of said intensities reflects said hybridization
between one of said plurality of said probes and said nucleic acid
target, adjusting said hybridization intensities for hybridization
affinities of said probes to obtain a plurality of adjusted
hybridization intensities, finding the minimal adjusted
Hybridization intensity among said adjusted hybridization
intensities; and indicating said minimal adjusted hybridization
intensity as a measurement of said hybridization
2. The method of claim 1 wherein said hybridization affinities of
said probes are predicted based upon the sequence of said
probes
3. The method of claim 1 wherein said hybridization affinities are
inputted from a database
4. The method of claim 3 wherein said hybridization affinities are
measured experimentally
5. The method of claim 1 wherein said adjusted Hybridization
intensity are calculated according to Adjusted .times. .times.
hybridization .times. .times. intensity = I .GAMMA. , ##EQU10##
wherein said I is said Hybridization intensity and said .GAMMA. is
said hybridization affinity
6. The method of claim 5 wherein said plurality of probes have at
least 5 probes
7. The method of claim 6 wherein said plurality of probes have at
least 10 probes
8. The method of claim 7 wherein said plurality of probes have at
least 20 probes
9. A computer software product for determining hybridization
between a plurality of nucleic acid probes and a nucleic acid
target comprising computer program code for inputting a plurality
of Hybridization intensities, each of said hybridization
intensities reflects said hybridization between one of said
plurality of said probes and said nucleic acid target, computer
program code for adjusting said hybridization intensities for
hybridization affinities of said probes to obtain a plurality of
adjusted hybridization intensities, computer program code for
finding the minimal adjusted hybridization intensity among said
adjusted hybridization intensities and computer program code for
indicating said minimal adjusted hybridization intensity as a
measurement of said hybridization, and a computer readable media
for storing said code
10. The computer software product of claim 9 wherein said
hybridization affinities of said probes are predicted based upon
the sequence of said probes
11. The computer software product of claim 10 wherein said
Hybridization affinities inputted from a database
12. The computer software product of claim 11 wherein said
hybridization affinities are measured experimentally
13. The computer software product of claim 12 wherein said adjusted
Hybridization intensity are calculated according to. Adjusted
.times. .times. hybridization .times. .times. intensity = I .GAMMA.
##EQU11## wherein said I is said hybridization intensity and said
.GAMMA. is said hybridization affinity
14. A computer-readable medium having computer-executable
instructions for performing a method comprising, inputting a
plurality of hybridization intensities, each of said intensities
reflects said hybridization between one of said plurality of said
probes and said nucleic acid target, adjusting said hybridization
intensities for hybridization affinities of said probes to obtain a
plurality of adjusted hybridization intensities, finding the
minimal adjusted hybridization intensity among said adjusted
hybridization intensities, and indicating said minimal adjusted
hybridization intensity as a measurement of said hybridization
15. The computer readable medium of claim 14 wherein said
hybridization affinities of said probes are predicted based upon
the sequence of said probes
16. The computer readable medium of claim 15 wherein said
hybridization affinities are inputted from a database
17. The computer readable medium of claim 16 wherein said
hybridization affinities are measured experimentally
18. The computer readable medium of claim 17 wherein said adjusted
Hybridization intensity are calculated according to Adjusted
.times. .times. hybridization .times. .times. intensity = I .GAMMA.
, ##EQU12## wherein said I is said hybridization intensity and said
.GAMMA. is said hybridization affinity
19. A system for comparing nucleic acid probes, comprising a
processor, and a memory being coupled to the processor, the memory
storing a plurality machine instructions that cause the processor
to perform a plurality of logical steps when implemented by the
processor, said logical steps including inputting a plurality of
Hybridization intensities, each of said intensities reflects said
hybridization between one of said plurality of said probes and said
nucleic acid target, adjusting said hybridization intensities for
hybridization affinities of said probes to obtain a plurality of
adjusted hybridization intensities, finding the minimal adjusted
hybridization intensity among said adjusted hybridization
intensities, and indicating said minimal adjusted hybridization
intensity as a measurement of said hybridization
20. The system of claim 19 wherein said hybridization affinities of
said probes are predicted based upon the sequence of said
probes
21. The system of claim 20 wherein said hybridization affinities
are inputted from a database
22. The system of claim 21 wherein said hybridization affinities
are measured experimentally
23. The system of claim 22 wherein said adjusted hybridization
intensity are calculated according to Adjusted .times. .times.
hybridization .times. .times. intensity = I .GAMMA. , ##EQU13##
wherein said I is said hybridization intensity and said .GAMMA. is
said hybridization affinity
Description
RELATED APPLICATIONS
[0001] This application claims the priority of U.S. Provisional
Application No. 60/252,808, filed on Nov. 22, 2000, which is
incorporated herein by reference for all purposes.
FIELD OF INVENTION
[0002] This invention is related to bromformatics and biological
data analysis. Specifically, this invention provides methods,
computer software products and systems for the analysis of
biological data.
BACKGROUND OF THE INVENTION
[0003] Many biological functions are carried out by regulating the
expression levels of various genes, either through changes in the
copy number of the genetic DNA, through changes in levels of
transcription (e g. through control of initiation, provision of RNA
precursors, RNA processing, etc.) of particular genes, or through
changes in protein synthesis. For example, control of the cell
cycle and cell differentiation, as well as diseases, are
characterized by the variations in the transcription levels of a
group of genes.
[0004] Recently, massive parallel gene expression monitoring
methods have been developed to monitor the expression of a large
number of genes using nucleic acid array technology which was
described in detail m, for example, U.S. Pat. No. 5,871,928; de
Saizieu, et al, 1998, Bacteria Transcript Imaging by Hybridization
of total RNA to Oligonucleotide Arrays, NATURE BIOTECHNOLOGY,
16.45-48; Wodicka et al., 1997, Genome-wide Expression Monitoring
in Saccharomyces cerevisiae, NATURE BIOTECHNOLOGY 15:1359-1367;
Lockhart et al., 1996, Expression Monitoring by Hybridization to
High Density Oligonucleotide Arrays NATURE BIOTECHNOLOGY
14.1675-1680; Lander, 1999, Array of Hope, NATURE-GENETICS,
21(suppl.), at 3.
[0005] Massive parallel gene expression monitoring experiments
generate unprecedented amounts of information. For example, a
commercially available GeneChip.RTM. array set is capable of
monitoring the expression levels of approximately 6,500 murine
genes and expressed sequence tags (ESTs) (Affymetrix, Inc, Santa
Clara, Calif., USA). Array sets for approximately 60,000 human
genes and EST clusters, 24,000 rat transcripts and EST clusters and
arrays for other organisms are also available from Affymetrix.
Effective analysis of the large amount of data may lead to the
development of new drugs and new diagnostic tools. Therefore, there
is a great demand in the art for methods for organizing, accessing
and analyzing the vast amount of information collected using
massive parallel gene expression monitoring methods.
SUMMARY OF THE INVENTION
[0006] The current invention provides methods, systems and computer
software products suitable for analyzing data from gene expression
monitoring experiments that employ multiple probes against a single
target.
[0007] Computer implemented methods for determining hybridization
between a plurality of nucleic acid probes and a nucleic acid
target are provided. The methods are useful for analyzing any
hybridization between multiple probes and a target nucleic acid It
is particularly useful for analyzing gene expression experiments
where a single transcript is determined using multiple probes
[0008] In some embodiments, the method include steps of inputting a
plurality of hybridization intensities, each of the intensities
reflects the hybridization between one of the plurality of the
probes and the nucleic acid target, adjusting the hybridization
intensities for hybridization affinities of the probes to obtain a
plurality of adjusted hybridization intensities, finding the
minimal adjusted hybridization intensity among the adjusted
hybridization intensities, and indicating the minimal adjusted
hybridization intensity as a measurement of the hybridization. The
hybridization affinities of the probes may be predicted based upon
the sequence of the probes The hybridization affinities may be
inputted from a database where experimentally determined
hybridization affinities are stored The adjusted hybridization
intensity are calculated according to Adjusted .times. .times.
hybridization .times. .times. intensity = I .GAMMA. , ##EQU1##
[0009] where I is hybridization intensity and .GAMMA. is
hybridization affinity
[0010] In another aspect of the invention, computer software
products are provided for determining hybridization between nucleic
acid probes and a nucleic acid target A software product may
include a computer-readable medium having computer-executable
instructions for performing the method of the invention
[0011] In some embodiments, the software products may include
computer program code for inputting a plurality of hybridization
intensities, each of the hybridization intensities reflects the
hybridization between one of the plurality of the probes and the
nucleic acid target, computer program code for adjusting the
hybridization intensities for hybridization affinities of the
probes to obtain a plurality of adjusted hybridization intensities,
computer program code for finding the minimal adjusted
hybridization intensity among the adjusted hybridization
intensities, and computer program code for indicating the minimal
adjusted hybridization intensity as a measurement of said
hybridization, and a computer readable media for storing the code
The hybridization affinities of the probes may be predicted based
upon the sequence of said probes and the software products contain
code for performing the prediction Alternatively, the predicted
hybridization affinities may be inputted In some embodiments, the
hybridization affinities are inputted from a database Hybridization
affinities may also be measured experimentally In preferred
embodiments, the adjusted hybridization intensity may be calculated
according to Adjusted .times. .times. hybridization .times. .times.
intensity = I .GAMMA. , ##EQU2##
[0012] where I is hybridization intensity and said .GAMMA. is the
hybridization affinity In yet another aspect of the invention,
systems for analyzing nucleic acid hybridization are provided In
some embodiments, the system may include a processor, and a memory
being coupled to the processor, the memory storing a plurality
machine instructions that cause the processor to perform the method
of the invention
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The accompanying drawings, which are incorporated in and
form a part of this specification, illustrate embodiments of the
invention and, together with the description, serve to explain the
principles of the invention
[0014] FIG. 1 illustrates an example of a computer system that may
be utilized to execute the software of an embodiment of the
invention
[0015] FIG. 2 illustrates a system block diagram of the computer
system of FIG. 1
[0016] FIG. 3 shows one embodiment of the gene expression analysis
method of the invention
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0017] Reference will now be made in detail to the preferred
embodiments of the invention While the invention will be described
in conjunction with the preferred embodiments, it will be
understood that they are not intended to limit the invention to
these embodiments On the contrary, the invention is intended to
cover alternatives, modifications and equivalents, which may be
included within the spirit and scope of the invention All cited
references, including patent and non-patent literature, are
incorporated herein by reference in their entireties for all
purposes
I. Gene Expression Monitoring With High Density Oligonucleotide
Probe Arrays
[0018] High density nucleic acid probe arrays, also referred to as
"DNA Microarrays,` have become a method of choice for monitoring
the expression of a large number of genes As used herein, "Nucleic
acids" may include any polymer or oligomer of nucleosides or
nucleotides (polynucleotides or oligonucleotides), which include
pyrimidine and purine bases, preferably cytosine, thymine, and
uracil, and adenine and guanine, respectively See Albert L.
Lehninger, PRINCIPLES OF BIOCHEMISTRY, at 793-800 (Worth Pub 1982)
and L. Stryer BIOCHEMISTRY, 4.sup.th Ed, (March 1995), both
incorporated by reference "Nucleic acids" may include any
deoxyribonucleotide, ribonucleotide or peptide nucleic acid
component, and any chemical variants thereof, such as methylated,
hydroxymethylated or glucosylated forms of these bases, and the
like The polymers or oligomers may be heterogeneous or homogeneous
in composition, and may be isolated from naturally-occurring
sources or may be artificially or synthetically produced In
addition, the nucleic acids may be DNA or RNA, or a mixture
thereof, and may exist permanently or transitionally in
single-stranded or double-stranded form, including homoduplex,
heteroduplex, and hybrid states
[0019] "A target molecule" refers to a biological molecule of
interest The biological molecule of interest can be a ligand,
receptor, peptide, nucleic acid (oligonucleotide or polynucleotide
of RNA or DNA), or any other of the biological molecules listed in
U.S. Pat. No. 5,445,934 at col 5, line 66 to col 7, line 51 For
example, if transcripts of genes are the interest of an experiment,
the target molecules would be the transcripts Other examples
include protein fragments small molecules, etc "Target nucleic
acid" refers to a nucleic acid (often derived from a biological
sample) of interest. Frequently, a target molecule is detected
using one or more probes As used herein, a "probe" is a molecule
for detecting a target molecule It can be any of the molecules in
the same classes as the target referred to above A probe may refer
to a nucleic acid, such as an oligonucleotide, capable of binding
to a target nucleic acid of complementary sequence through one or
more types of chemical bonds, usually through complementary base
pairing, usually through hydrogen bond formation As used herein, a
probe may include natural (i e A, G, U, C, or T) or modified bases
(7-deazaguanosine, inosine, etc). In addition, the bases in probes
may be joined by a linkage other than a phosphodiester bond, so
long as the bond does not interfere with hybridization Thus, probes
may be peptide nucleic acids in which the constituent bases are
joined by peptide bonds rather than phosphodiester linkages Other
examples of probes include antibodies used to detect peptides or
other molecules, any ligands for detecting its binding partners
When referring to targets or probes as nucleic acids, it should be
understood that there are illustrative embodiments that are not to
limit the invention in any way
[0020] In preferred embodiments, probes may be immobilized on
substrates to create an array An "array" may comprise a solid
support with peptide or nucleic acid or other molecular probes
attached to the support Arrays typically comprise a plurality of
different nucleic acids or peptide probes that are coupled to a
surface of a substrate in different, known locations These arrays,
also described as "microarrays" or colloquially "chips" have been
generally described in the art, for example, in Fodor et al,
Science, 251 767-777 (1991), which is incorporated by reference for
all purposes Methods of forming high density arrays of
oligonucleotides, peptides and other polymer sequences with a
minimal number of synthetic steps are disclosed in, for example,
U.S. Pat. Nos. 5,143,854, 5,252,743, 5,384,261, 5,405,783,
5,424,186, 5,429,807, 5,445,943, 5,510,270, 5,677,195, 5,571,639,
6,040,138, all incorporated herein by reference for all purposes
The oligonucleotide analogue array can be synthesized on a solid
substrate by a variety of methods, including, but not limited to,
light-directed chemical coupling, and mechanically directed
coupling See Pirrung et al, U.S. Pat. No. 5,143,854 (see also PCT
Application No WO 90/15070) and Fodor et al., PCT Publication Nos
WO 92/10092 and WO 93/09668, U.S. Pat. Nos. 5,677,195, 5,800,992
and 6,156,501 which disclose methods of forming vast arrays of
peptides, oligonucleotides and other molecules using, for example,
light-directed synthesis techniques See also, Fodor et al.,
Science, 251, 767-77 (1991) These procedures for synthesis of
polymer arrays are now referred to as VLSIPS.TM. procedures Using
the VLSIPS.TM. approach, one heterogeneous array of polymers is
converted, through simultaneous coupling at a number of reaction
sites, into a different heterogeneous array See, U.S. Pat. Nos.
5,384,261 and 5,677,195
[0021] Methods for making and using molecular probe arrays,
particularly nucleic acid probe arrays are also disclosed in, for
example, U.S. Pat. Nos. 5,143,854, 5,242,974, 5,252,743, 5,324,633,
5,384,261, 5,405,783, 5,409,810, 5,412,087, 5,424,186, 5,429,807,
5,445,934, 5,451,683, 5,482,867, 5,489,678, 5,491,074, 5,510,270,
5,527,681, 5,527,681, 5,541,061, 5,550,215, 5,554,501, 5,556,752,
5,556,961, 5,571,639, 5,583,211, 5,593,839, 5,599,695, 5,607,832,
5,624,711, 5,677,195, 5,744,101, 5,744,305, 5,753,788, 5,770,456,
5,770,722, 5,831,070, 5,856,101, 5,885,837, 5,889,165, 5,919,523,
5,922,591, 5,925,517, 5,658,734, 6,022,963, 6,150,147, 6,147,205,
6,153,743, 6,140,044 and D430024, all of which are incorporated by
reference in their entireties for all purposes Typically, a nucleic
acid sample is a labeled with a signal moiety, such as a
fluorescent label The sample is hybridized with the array under
appropriate conditions The arrays are washed or otherwise processed
to remove non-hybridized sample nucleic acids The hybridization is
then evaluated by detecting the distribution of the label on the
chip. The distribution of label may be detected by scanning the
arrays to determine florescence intensities distribution Typically,
the hybridization of each probe is reflected by several pixel
intensities The raw intensity data may be stored in a gray scale
pixel intensity file The GATC.TM. Consortium has specified several
file formats for storing array intensity data. The final software
specification is available at www gatcconsortium org and is
incorporated herein by reference in its entirety The pixel
intensity files are usually large For example, a GATC.TM.
compatible image file may be approximately 50 Mb if there are about
5000 pixels on each of the horizontal and vertical axes and if a
two byte integer is used for every pixel intensity The pixels may
be grouped into cells (see, GATC.TM. software specification) The
probes in a cell are designed to have the same sequence (i e, each
cell is a probe area) A CEL file contains the statistics of a cell,
e.g., the 75 percentile and standard deviation of intensities of
pixels in a cell The 50, 60, 65, 70, 80, 85 percentile of pixel
intensity of a cell is often used as the intensity of the cell
Methods for signal detection and processing of intensity data are
additionally disclosed in, for example, U.S. Pat. Nos. 5,547,839,
5,578,832, 5,631,734, 5,800,992, 5,856,092, 5,936,324, 5,981,956,
6,025,601, 6,090,555, 6,141,096, 6,141,096, and 5,902,723 Methods
for array based assays, computer software for data analysis and
applications are additionally disclosed in, e g U.S. Pat. Nos.
5,527,670, 5,527,676, 5,545,531, 5,622,829, 5,631,128, 5,639,423,
5,646,039, 5,650,268, 5,654,155, 5,674,742, 5,710,000, 5,733,729,
5,795,716, 5,814,450, 5,821,328, 5,824,477, 5,834,252, 5,834,758,
5,837,832, 5,843,655, 5,856,086, 5,856,104, 5,856,174, 5,858,659,
5,861,242, 5,869,244, 5,871,928, 5,874,219, 5,902,723, 5,925,525,
5,928,905, 5,935,793, 5,945,334, 5,959,098, 5,968,730, 5,968,740,
5,974,164, 5,981,174, 5,981,185, 5,985,651, 6,013,440, 6,013,449,
6,020,135, 6,027,880, 6,027,894, 6,033,850, 6,033,860, 6,037,124,
6,040,138, 6,040,193, 6,043,080, 6,045,996, 6,050,719, 6,066,454,
6,083,697, 6,114,116, 6,114,122, 6,121,048, 6,124,102, 6,130,046,
6,132,580, 6,132,996, 6,136,269 and attorney docket numbers 3298 1
and 3309, all of which are incorporated by reference in their
entireties for all purposes
[0022] Nucleic acid probe array technology, use of such arrays,
analysis array based experiments, associated computer software,
composition for making the array and practical applications of the
nucleic acid arrays are also disclosed, for example, in the
following U.S. patent applications Ser. Nos. 07/838,607,
07/883,327, 07/978,940, 08/030,138, 08/082,937, 08/143,312,
08/327,522, 08/376,963, 08/440,742, 08/533,582, 08/643,822,
08/772,376, 09/013,596, 09/016,564, 09/019,882, 09/020,743,
09/030,028, 09/045,547, 09/060,922, 09/063,311, 09/076,575,
09/079,324, 09/086,285, 09/093,947, 09/097,675, 09/102,167,
09/102,986, 09/122,167, 09/122,169, 09/122,216, 09/122,304,
09/122,434, 09/126,645, 09/127,115, 09/132,368, 09/134,758,
09/138,958, 09/146,969, 09/148,210, 09/148,813, 09/170,847,
09/172,190, 09/174,364, 09/199,655, 09/203,677, 09/256,301,
09/285,658, 09/294,293, 09/318,775, 09/326,137, 09/326,374,
09/341,302, 09/354,935, 09/358,664, 09/373,984, 09/377,907,
09/383,986 09/394,230, 09/396,196, 09/418,044, 09/418,946,
09/420,805, 09/428,350, 09/431,964, 09/445,734, 09/464,350,
09/475,209, 09/502,048, 09/510,643, 09/513,300, 09/516,388,
09/528,414, 09/535,142, 09/544,627, 09/620,780, 09/640,962,
09/641,081, 09/670,510, 09/685,011, and 09/693,204 and in the
following Patent Cooperative Treaty (PCT) applications/publications
PCT/NL90/00081, PCT/GB91/00066, PCT/US91/08693, PCT/US91/09226,
PCT/US91/09217, WO/93/10161, PCT/US92/10183, PCT/GB93/00147,
PCT/US93/01152, WO/93/22680, PCT/US93/04145, PCT/US93/08015,
PCT/US94/07106, PCT/US94/12305, PCT/GB95/00542, PCT/US95/07377,
PCT/US95/02024, PCT/US96/05480, PCT/US96/11147, PCT/US96/14839,
PCT/US96/15606, PCT/US97/01603, PCT/US97/02102, PCT/GB97/005566,
PCT/US97/06535, PCT/GB97/01148, PCT/GB97/01258, PCT/US97/08319,
PCT/US97/08446, PCT/US97/10365, PCT/US97/17002, PCT/US97/16738,
PCT/US97/19665, PCT/US97120313, PCT/US97/21209, PCT/US97/21782,
PCT/US97123360, PCT/US98/06414, PCT/US98/01206, PCT/GB98/00975,
PCT/US98/04280, PCT/US98/04571, PCT/US98/05438, PCT/US98/05451,
PCT/US98/12442, PCT/US98/12779, PCT/US98/12930, PCT/US98/13949,
PCT/US98/15151, PCT/US98/15469, PCT/US98/15458, PCT/US98/15456,
PCT/US98/16971, PCT/US98/16686, PCT/US99/19069, PCT/US98/18873,
PCT/US98/18541, PCT/US98/19325, PCT/US98/22966, PCT/US98/26925,
PCT/US98/27405 and PCT/IB99/00048, al of which are incorporated by
reference in their entireties for all purposes All the above cited
patent applications and other references cited throughout this
specification are incorporated herein by reference in their
entireties for all purposes
[0023] The embodiments of the invention will be described using
GeneChip.RTM. high oligonucleotide density probe arrays (available
from Affymetrix, Inc, Santa Clara, Calif., USA) as exemplary
embodiments One of skill the art would appreciate that the
embodiments of the invention are not limited to high density
oligonucleotide probe arrays In contrast, the embodiments of the
invention are useful for analyzing any parallel large scale
biological analysis, such as those using nucleic acid probe array,
protein arrays, etc
[0024] Gene expression monitoring using GeneChip.RTM. high density
oligonucleotide probe arrays are described in, for example,
Lockhart et al, 1996, Expression Monitoring By Hybridization to
High Density Oligonucleotide Arrays, Nature Biotechnology 14
1675-1680; U.S. Pat. Nos. 6,040,138 and 5,800,992, all incorporated
herein by reference in their entireties for all purposes
[0025] In the preferred embodiment, oligonucleotide probes are
synthesized directly on the surface of the array using
photolithography and combinational chemistry as disclosed in
several patents previous incorporated by reference In such
embodiments, a single square-shaped feature on an array contains
one type of probe Probes are selected to be specific against
desired target Methods for selecting probe sequences are disclosed
in, for example, U.S. patent application Ser. No. 09/718,295, filed
Nov. 21, 2000, U.S. patent application Ser. No. 09/721,042, filed
Nov. 21, 2000, and U.S. Patent Application No. 60/252,617, filed
Nov. 21, 2000. all incorporated herein by reference in their
entireties for all purposes
[0026] In a preferred embodiment, oligonucleotide probes in the
high density array are selected to bind specifically to the nucleic
acid target to which they are directed with minimal non-specific
binding or cross-hybridization under the particular hybridization
conditions utilized Because the high density arrays of this
invention can contain in excess of 1,000,000 different probes, it
is possible to provide every probe of a characteristic length that
binds to a particular nucleic acid sequence Thus, for example, the
high density array can contain every possible 20 mer sequence
complementary to an IL-2 mRNA There, however, may exist 20 mer
subsequences that are not unique to the IL-2 mRNA Probes directed
to these subsequences are expected to cross hybridize with
occurrences of their complementary sequence in other regions of the
sample genome Similarly, other probes simply may not hybridize
effectively under the hybridization conditions (e g, due to
secondary structure, or interactions with the substrate or other
probes) Thus, in a preferred embodiment, the probes that show such
poor specificity or hybridization efficiency are identified and may
not be included either in the high density array itself (e g,
during fabrication of the array) or in the post-hybridization data
analysis.
[0027] Probes as short as 15, 20, 25 or 30 nucleotides are
sufficient to hybridize to a subsequence of a gene and that, for
most-genes, there is a set of probes that performs well across a
wide range of target nucleic acid concentrations In a preferred
embodiment, it is desirable to choose a preferred or "optimum"
subset of probes for each gene before synthesizing the high density
array
[0028] In some preferred embodiments, the expression of a
particular transcript may be detected by a plurality of probes,
typically, up to 5, 10, 15, 20, 30 or 40 probes Each of the probes
may target different sub-regions of the transcript However, probes
may overlap over targeted regions
[0029] In some preferred embodiments, each target sub-region is
detected using two probes a perfect match (PM) probe that is
designed to be completely complementary to a reference or target
sequence In some other embodiments, a PM probe may be substantially
complementary to the reference sequence A mismatch (MM) probe is a
probe that is designed to be complementary to a reference sequence
except for some mismatches that may significantly affect the
hybridization between the probe and its target sequence In
preferred embodiments, MM probes are designed to be complementary
to a reference sequence except for a homomeric base mismatch at the
central (e g, 13.sup.th in a 25 base probe) position Mismatch
probes are normally used as controls for cross-hybridization A
probe pair is usually composed of a PM and its corresponding MM
probe The difference between PM and MM provides an intensity
difference in a probe pair
II. Data Analysis Systems
[0030] In one aspect of the invention, methods computer software
products and systems are provided for computational analysis of
microarray intensity data for determining the presence or absence
of genes in a given biological sample Accordingly, the present
invention may take the form of data analysis systems, methods,
analysis software, etc Software written according to the present
invention is to be stored in some form of computer readable medium,
such as memory, or CD-ROM, or transmitted over a network and
executed by a processor For a description of basic computer systems
and computer networks, see, e g, Introduction to Computing Systems
From Bits and Gates to C and Beyond by Yale N Patt, Sanjay J Patel,
1st edition (Jan. 15, 2000) McGraw Hill Text, ISBN 0072376902, and
Introduction to Client/Server Systems: A Practical Guide for
Systems Professionals by Paul E Renaud, 2nd edition (June 1996),
John Wiley & Sons, ISBN 0471133337
[0031] Computer software products may be written in any of various
suitable programming languages, such as C C++. C# (Microsoft.RTM.),
Fortran, Perl, MatLab (MathWorks, www mathworks com), SAS, SPSS and
Java The computer software product may be an independent
application with data input and data display modules Alternatively,
the computer software products may be classes that may be
instantiated as distributed objects The computer software products
may also be component software such as Java Beans (Sun
Microsystem), Enterprise Java Beans (EJB, Sun Microsystems),
Microsoft.RTM. COM/DCOM (Microsoft.RTM.D), etc.
[0032] FIG. 1 illustrates an example of a computer system that may
be used to execute the software of an embodiment of the invention
The computer system described herein is also suitable for hosting a
DBMS FIG. 1 shows a computer system 101 that includes a display
103, screen 105, cabinet 107, keyboard 109, and mouse 111 Mouse 111
may have one or more buttons for interacting with a graphic user
interface Cabinet 107 houses a floppy drive 112, CD-ROM or DVD-ROM
drive 102, system memory and a hard drive (113) (see also FIG. 2)
which may be utilized to store and retrieve software programs
incorporating computer code that implements the invention, data for
use with the invention and the like Although a CD 114 is shown as
an exemplary computer readable medium, other computer readable
storage media including floppy disk, tape, flash memory, system
memory, and hard drive may be utilized Additionally, a data signal
embodied in a carrier wave (e.g, in a network including the
Internet) may be the computer readable storage medium
[0033] FIG. 2 shows a system block diagram of computer system 101
used to execute the software of an embodiment of the invention As
in FIG. 1, computer system 101 includes monitor 201, and keyboard
209 Computer system 101 further includes subsystems such as a
central processor 203 (such as a Pentium m processor from Intel),
system memory 202, fixed storage 210 (e g, hard drive), removable
storage 208 (e.g, floppy or CD-ROM), display adapter 206, speakers
204, and network interface 211 Other computer systems suitable for
use with the invention may include additional or fewer subsystems
For example, another computer system may include more than one
processor 203 or a cache memory Computer systems suitable for use
with the invention may also be embedded in a measurement
instrument
III. Analysis of Hybridization of Probe Sets and Their Targets
[0034] The method of the invention will be explained in great
detail using the above terminology associated with Affymetrix
GeneChip.RTM. probe arrays One of skill in the art would appreciate
that the method of the invention is generally applicable to
biological analysis using multiple probes (or other means of
obtaining multiple measurements against one biological variable,
such as level of a transcript, etc)
[0035] A typical situation for current implementation and usage for
the GeneChip.RTM. probe array expression analysis is that there are
10, 15 or 20 probe pairs for each gene and a group of experiments
to be compared among each other It is apparent to those skilled in
the art, the current invention is not limited to the number of
probe pairs Preferably, the methods, systems and inventions are
used to analyze data from experiments that employ at least two
probe pairs, more preferably more than five probe pairs
[0036] Some embodiments of the methods, systems and computer
software products of the invention is based upon an algorithm for
analyzing gene expression levels The following notations are used
to describe preferred embodiments One of skill in the art would
appreciate that the specific notations and mathematical equations
are provided for the purpose of best describing the invention The
methods, systems and computer software products of the invention
are not limited by the specific notations or equations
[0037] In gene expression experiment, the target transcripts are
denoted as
[0038] t.sub.1,t.sub.2,t.sub.3
[0039] If multiple experiments are conducted, the experiments are
denoted as
[0040] E.sub.1,E.sub.2,E.sub.3
[0041] Nucleic acid probe arrays (Chips) are denoted as
[0042] A.sub.1,A.sub.2,A.sub.3
[0043] Probes on a chip are denoted as
[0044] p.sub.1,p.sub.2,p.sub.3 [0045] X(P.sub.3) is the x
coordinate of the cell containing probe j [0046] Y(P.sub.j) is the
y coordinate of the cell containing probe j [0047] T=The set of all
transcripts potentially existing in the target solution for any
experiment, T={t.sub.1,t.sub.2,t.sub.3 . . . } [0048]
E.sub.t(t.sub.j)=Concentration of transcript t.sub.j in experiment
E.sub.t X (P.sub.j)=(this will be zero for many combinations)
[0049] D(P.sub.j)=The transcript probe p.sub.j,
D(p.sub.j)=t.sub.j
[0050] A model is provided to relate the observed intensity for a
particular probe (such as an oligonucleotide sequence) on a chip,
to the hybridization of that oligonucleotide to transcripts in the
target solution The model explicitly describes the contribution of
"perfect match" hybridization and cross hybridization to the
measured intensity
[0051] For each probe P.sub.j
.alpha.I.sub.j.beta.=.GAMMA..sub.jC.sub.D(Pj)+.chi..sub.j,T-{D(pj)}
(1) where C.sub.D(PJ)=The concentration of the transcript measured
by p.sub.j, C=E.sub.t(D(pj)),C(E.sub.t) I.sub.j=The measured
intensity, I(E.sub.t,A.sub.t,P.sub.j) .alpha.=The spatial variation
correction factor, .alpha.(X(p.sub.j), Y(p.sub.j), E.sub.t,A.sub.t)
.beta. The uniform offset (background) correction,
.beta.(E.sub.t,A.sub.t) .GAMMA..sub.j=The hybridization affinity
for probe p.sub.j,.GAMMA.(p.sub.j,D(p.sub.J))
.chi..sub.j,T-{D(PJ)}=The cross hybridization affinity probe
p.sub.j .chi. jT - { D .function. ( pj ) } = tk .noteq. D
.function. ( Pj ) .times. E i .function. ( t k ) * .delta.
.function. ( t k , P j ) ##EQU3##
[0052] where .delta.(t.sub.k,p.sub.j) is the affinity probe p.sub.j
to transcript t.sub.k
[0053] It may be helpful to look at equation (1) without all the
subscripts .alpha.I-.beta.=.GAMMA.C+.chi. (2)
[0054] The left-hand side of equation (2) represents the measured
intensity for probe p.sub.j after all uniform effects have been
removed These uniform effects do not depend on the sequence of the
probe or on the sequences in the target solution In other words,
they are sequence independent effects that depend on the experiment
and on the manufacturing characteristics of the chip We will call
the left-hand side of equation (1) the adjusted intensity
[0055] The right hand side of equation (2) describes the effect of
hybridization on the adjusted intensity The first term states that
the adjusted intensity is a linear function of the target solution
concentration Specifically It is a linear function of the
transcript in the target solution that contains the probe sequence
The second term states that the adjusted intensity is also
proportional to the cross hybridization That is, to the
hybridization of the probe to all other transcripts in the target
solution
[0056] Cross hybridization is not a uniform, sequence independent
process It is not eliminated when the adjusted intensity is
computed It is a complex and unknown process but it is not random
and uniform Correctly managing cross hybridization is the essential
new concept in the new algorithms
[0057] To further simplify the notation, the adjusted intensity is
designated as I' That is I'=.GAMMA.C+.chi. (3)
[0058] Some embodiments of the algorithm assume that it is possible
to predict the value of .GAMMA. (hybridization affinity for probes)
based on the sequence of the probe Methods for predicting .GAMMA.
(hybridization affinity for probes) are described m, for example,
U.S. patent application Ser. No. 09/718,295, filed Nov. 21, 2000
and U.S. patent application Ser. No. 09/721,042, filed Nov. 21,
2000, both incorporated herein by reference for all purposes
[0059] In some embodiments, a physical model that is based on the
thermodynamic properties of the sequence is used to predict the
array-based hybridization intensities of the sequence Hybridization
propensities may be described by energetic parameters derived from
the probe sequence, and variations in hybridization and chip
manufacturing conditions will result in changes in these parameters
that can be detected and corrected The values of weight
coefficients in the physical model may be determined by empirical
data because these values are influenced by assay conditions, which
include hybridization and target fragmentation, and probe synthesis
conditions, which include choice of substrates, coupling
efficiency, etc
[0060] In one embodiment, a model experimental system is used to
generate empirical data and a computational model is used to
process these data to solve for the weight coefficients of the
physical model. These solved weight coefficients are in turn placed
back into the physical model, enabling it to predict the
hybridization behaviors of new sequences
[0061] The equation (3) is divided by the known quantity .GAMMA. to
get I ' .GAMMA. = C + .chi. .GAMMA. ##EQU4## Because cross
hybridization is difficult to be completely eliminated,
.chi..gtoreq.0 I 1 .GAMMA. 1 .gtoreq. C .times. .times. and .times.
.times. that .times. .times. I 1 .times. .GAMMA. 1 = C .times.
.times. only .times. .times. if .times. .times. .chi. = 0 ##EQU5##
This means that if
I.sub.1/.GAMMA..sub.1,I.sub.2/.GAMMA..sub.2,I.sub.3/.GAMMA..sub.3,
is a collection of concentration estimates for a particular
transcript, based on a collection of different probes for that
transcript, then the best estimate for the concentration of that
transcript is mm .times. { I 1 .GAMMA. 1 , I 2 .GAMMA. 2 , I 3
.GAMMA. 3 , } . ( 4 ) ##EQU6##
[0062] The probes corresponding to a transcript will all respond
differently to cross hybridization, and that at least a few of them
will have very little cross hybridization.
[0063] The minimization in equation (3) does not require any
assumptions about the stochastic behavior of cross hybridization
This provides a great advantage since cross hybridization is not
well modeled as a random process
IV. Computer Implemented Methods, Computer Software and Systems for
Multiple Probe Data Analysis
[0064] In one aspect of the invention, computer implemented methods
are used to analyze nucleic acid hybridization The methods are
particularly suitable for analyzing multiple probe array based gene
expression analysis FIG. 3 shows a process for some embodiments of
the invention Intensity values for a set of probes (I.sub.1,
I.sub.2 . . . I.sub.n) are inputted (301) The probe set is designed
to interrogate one transcript in preferred embodiments, a probe set
has at least 5, 10, 15 or 20 probes The probes may be designed to
be a perfect match with the target transcript Alternatively, some
of the probes may be designed as mismatch control The intensity
values may be the measured values for the prefect match probes
Alternatively, they may be the difference between the intensities
for the perfect match probes and those of the mismatch probes One
of skill in the art would appreciate that the intensity values may
be adjusted or normalized for background, non-specific bindings,
etc
[0065] The intensity values are adjusted using hybridization
affinities of the probes The predicted or measured hybridization
affinity of the probes with their target may be pre-calculated and
stored in a database (302) Hybridization affinity may be measured
experimentally by hybridizing probes with their intended targets In
addition, hybridization affinity for probes may be predicted based
upon the sequences of the probes Methods, software products and
systems for predicting hybridization affinity of probes are
disclosed in, for example, U.S. patent applications Ser. No.
______, Attorney Docket Number 3359, filed on Nov. 21, 2000, and
Ser. No. ______, Attorney Docket Number 3367, filed on Nov. 21,
2000, both incorporated herein by reference for all purposes
[0066] One of skill in the art would appreciate that the methods,
software products and systems are limited to any particular model
and methods for predicting hybridization affinity of probes Rather,
the current invention may employ any suitable methods for
predicting hybridization affinity However, for illustration
purposes, some preferred methods for predicting hybridization
affinity ale discussed below
[0067] In this particular method, a physical model that is based on
the thermodynamic properties of the sequence is used to predict the
array-based hybridization intensities of the sequence Hybridization
propensities may be described by energetic parameters derived from
the probe sequence, and variations in Hybridization and chip
manufacturing conditions will result in changes in these parameters
that can be detected and corrected
[0068] The values of weight coefficients in the physical model may
be determined by empirical data because these values are influenced
by assay conditions, which include hybridization and target
fragmentation, and probe synthesis conditions, which include choice
of substrates, coupling efficiency, etc
[0069] Basically, a target (7) hybridizes to its complementary
probe ( ) to form a probe-target duplex (P.cndot.T), and the
reaction is accompanied with favorable free energy change The
amplitude of the free energy change (.DELTA.G) determines the
stability of probe-target duplex. The duplex stability can be
described by equilibrium constant (K.sub.s), which is
sequence-dependent The relationship between K.sub.c and .DELTA.G
may be given by Boltzmann's equation K s = k on k off = e - .DELTA.
.times. .times. G / RT ( 4 ) ##EQU7## where k.sub.on and k.sub.off
are the rate constants for association and dissociation,
respectively of the probe-target duplex, R is the gas constant and
T is the absolute temperature. According to Equation 4, .DELTA.G is
a function of the sequence The dependence of .DELTA.G on probe
sequence can be quite complicated, but relatively simple models for
.DELTA.G have yielded good results
[0070] There are a number of ways to establish the relationship
between the sequence and .DELTA.G In preferred embodiments, Nathan
Hunt's simple model (See, U.S. application Ser. No. 09/721,042,
filed Nov. 21, 2000, previously incorporated by reference) works
the best in some embodiments of the invention .DELTA. .times.
.times. G seq = i = 1 3 .times. .times. N .times. P 1 .times. S i (
5 ) .DELTA. .times. .times. G sep = i = 1 3 .times. N .times. P 1
.times. S i + C ( 6 ) ##EQU8## where N is the length (number of
bases) of a probe P.sub.1 is the value of the ith parameter which
reflects the .DELTA.G of a base in a given sequence position
relative to a reference base in the same position In preferred
embodiments, the reference base is A In this case, the Pi's will be
the free energy of a base in a given position relative to base A in
the same position
[0071] Based on the simple hybridization scheme described above,
the Hybridization intensity is proportional to the concentration of
probe-target duplex, where C.sub.0 is constant. Under equilibrium
condition, the intensity is directly related to .DELTA.G This
relationship is also expressed in natural logarithm form, where
C.sub.1 and C.sub.2 are constants The relationship between
intensity and probe sequence is described below I = C 0 .function.
[ P T ] .times. [ P T ] = K s .function. [ P ] .function. [ T ] = e
- .DELTA. .times. .times. G / RT .function. [ P ] .function. [ T ]
##EQU9## Ln .times. .times. 1 = - .DELTA. .times. .times. G / RT +
Ln .times. { C 0 .function. [ P ] .function. [ T ] } .times.
.times. Ln .times. .times. 1 = C 1 .times. i = 1 3 .times. N
.times. P i .times. S i + C 2 .times. .times. Ln .times. .times. 1
= i = 1 3 .times. N .times. C 1 .times. P i .times. S i + C 2 = i =
1 3 .times. N .times. W i .times. S i + C 2 ##EQU9.2## where
W.sub.i=C.sub.1P.sub.i The following is a linear regression model
for probes of N bases in length using a training data set that
contains intensity values of M probes
Ln(I.sub.1)=W.sub.1S.sub.11+W.sub.2S.sub.21+W.sub.3NS.sub.3N1
Ln(I.sub.2)=W.sub.1S.sub.12+W.sub.2S.sub.22+W.sub.3NS.sub.3N2
Ln(I.sub.1)=W.sub.1S.sub.11+W.sub.2S.sub.12+W.sub.3NS.sub.3N1
[0072] Hybridization intensities (relative to a reference base,
such as an A) for each type of bases can be solved at each position
in the probe sequence may be predicted Multiple linear regression
analysis is well known in the art, see, for example, electronic
statistic book (http//www statsofthe com/textbook/stathome html).
Darlington, R B. (1990) Regression and linear models New York
McGraw-Hill, both incorporated by reference for all purposes
Computer software packages, such as SAS, SPSS and MatLib5 3 provide
multiple linear regression functions In addition, computer software
code examples suitable for performing multiple linear regression
analysis are provided in, for example, the Numerical Recipes (NR)
books developed by Numerical Recipes Software and published by
Cambridge University Press (CUP, with K and U.S. web sites)
[0073] In a preferred embodiment, a set of probes of different
sequences (probes 1 to M) is used as probes in experiments(s)
Hybridization affinities (relative .DELTA.G or Ln (I)) of the
probes with their target are experimentally measured to obtain a
training data set (see, example section infra) Multiple linear
regression may be performed using Hybridization affinities as I
[I.sub.i I.sub.m] to obtain a set of weight coefficients [W.sub.1
W.sub.N] The weight coefficients are then used to predict the
hybridization affinities
[0074] Continuing the process in FIG. 3, the predicted or measured
probe hybridization affinity values may be stored in a database
(302) or a file Alternatively, Hybridization affinities may be
predicted as requested Adjusted hybridization intensity values may
be calculated (301) as
[0075] 1/r,
[0076] An adjusted hybridization intensity may be calculated for
each probe (or probe pair if the intensity is the difference
between a perfect match probe and a mismatch probe) In some other
embodiments, the adjusted hybridization intensity may not be a
simple ratio One of skill in the art would appreciate that other
methods for calculating the relative or adjusted hybridization
intensities may also be used
[0077] The minimal value of the adjusted intensity values (303) may
be used as a measurement of gene expression
[0078] Computer software products for gene expression analysis are
also provided The products may contain a computer readable medium
containing code for performing the methods steps discussed above
and in FIG. 3
[0079] Computer systems for gene expression analysis are provided
The systems have a processor, a memory coupled to the processor,
the memory storing machine instructions that cause the processor to
perform a plurality of logical steps when implemented by the
processor The logic steps include the analysis steps discussed
above
[0080] Many embodiments of the invention are particularly useful
for analyzing gene expression using nucleic acid probe arrays As
described above, such arrays contain a large number of sets of
probes Each set is used for measuring one transcript In such
embodiments, the process in FIG. 3 is repeated for each probe sets
It is generally preferable that all the probe sets (each is for
measuring one transcript) in a probe array are analyzed using the
methods, software or system of the invention. However, in some
embodiments, a subset of the probe sets may be analyzed using the
methods, systems and software of the invention
CONCLUSION
[0081] The present inventions provide methods and computer software
products for analyzing gene expression profiles It is to be
understood that the above description is intended to be
illustrative and not restrictive Many variations of the invention
will be apparent to those of skill in the art upon reviewing the
above description By way of example, the invention has been
described primarily with reference to the use of a high density
oligonucleotide array, but it will be readily recognized by those
of skill in the art that other nucleic acid arrays, other methods
of measuring transcript levels and gene expression monitoring at
the protein level could be used The scope of the invention should,
therefore, be determined not with reference to the above
description, but should instead be determined with reference to the
appended claims, along with the full scope of equivalents to which
such claims are entitled
[0082] All cited references, including patent and non-patent
literature, are incorporated herewith by reference in their
entireties for all purposes
* * * * *