U.S. patent application number 09/841580 was filed with the patent office on 2002-04-11 for characterization of phenotypes by gene expression patterns and classification of samples based thereon.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Califano, Andrea, Stolovitzky, Gustavo A., Tu, Yuhai.
Application Number | 20020042681 09/841580 |
Document ID | / |
Family ID | 26930830 |
Filed Date | 2002-04-11 |
United States Patent
Application |
20020042681 |
Kind Code |
A1 |
Califano, Andrea ; et
al. |
April 11, 2002 |
Characterization of phenotypes by gene expression patterns and
classification of samples based thereon
Abstract
Generally, the present invention applies a transformation to
convert a probability distribution of gene expression signals in
control samples to a uniform distribution. The uniform distribution
allows better comparisons between expression levels for genes. The
transformation is derived from gene expression signals of control
data, and is applied to gene expression signals of phenotype data.
The phenotype data can be represented in a matrix format. A number
of gene expression patterns may be determined (in the form of
submatrices) that will characterize the phenotype. The uniform
distribution helps in this regard, as it allows better comparisons
of patterns. The gene expression patterns can then be used to
classify samples as belonging to the phenotype set. Preferably, a
discriminant function is used to compare a sample with the gene
expression patterns that characterize the phenotype. The
discriminant function can determine a score that can be used to
determine whether the sample belongs to the phenotype.
Inventors: |
Califano, Andrea; (New York,
NY) ; Stolovitzky, Gustavo A.; (Bronx, NY) ;
Tu, Yuhai; (Ossining, NY) |
Correspondence
Address: |
Ryan, Mason & Lewis, LLP
Suite 205
1300 Post Road
Fairfield
CT
06430
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
26930830 |
Appl. No.: |
09/841580 |
Filed: |
April 24, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60237590 |
Oct 3, 2000 |
|
|
|
Current U.S.
Class: |
702/20 |
Current CPC
Class: |
G16B 25/00 20190201;
G16B 25/10 20190201; G16B 40/00 20190201 |
Class at
Publication: |
702/20 |
International
Class: |
G06F 019/00; G01N
033/48 |
Claims
What is claimed is:
1. A method for transforming gene expression signals, the method
comprising the steps of: determining a plurality of gene expression
signals for a gene; and deriving a transformation that renders
uniform, within a selected interval, a distribution of transformed
gene expression signals for the gene.
2. The method of claim 1, further comprising the step of applying
the transformation to an additional gene expression signal.
3. The method of claim 1, wherein the step of deriving comprises
the steps of: determining a function that approximates the
distribution; and using the function to create the transformation,
wherein the transformation renders uniform a probability
distribution of the transformed gene expression signals, and
wherein each gene expression signal is mapped by the transformation
into a transformed gene expression signal.
4. A method for the characterization of phenotypes by gene
expression patterns and classification of samples based thereon,
the method comprising the steps of: a) forming, from a plurality of
first genes and first samples of a first phenotype, a first matrix
of gene expression signals for the first phenotype, each gene
expression signal corresponding to one of the first genes and one
of the first samples; b) forming, from a plurality of selected
genes and selected samples of a selected phenotype, a second matrix
of gene expression signals for the selected phenotype, each gene
expression signal corresponding to one of the selected genes and
one of the selected samples, wherein each selected gene corresponds
to one of the first genes; c) deriving a plurality of
transformations, wherein each transformation renders uniform,
within a selected interval, a distribution of transformed gene
expression signals for one of the first genes, whereby each
transformation corresponds to a selected gene; d) applying, for
each of the selected genes, one of the transformations to the gene
expression signals of a corresponding one of the selected genes of
the second matrix; and e) defining at least one submatrix of the
second matrix, wherein each transformed gene expression signal for
each gene of the submatrix falls within a selected range.
5. The method of claim 4, wherein a number of first genes is the
same as a number of selected genes.
6. The method of claim 4, wherein a number of selected genes is
less than a number of first genes.
7. The method of claim 4, wherein each of the at least one
submatrices is a largest distinct submatrix that satisfies step
(e).
8. The method of claim 7, wherein the at least one submatrix
comprises a plurality of submatrices.
9. The method of claim 4, wherein step (c) comprises the steps of,
for each of the transformations: i) determining a function that
approximates the distribution of expression levels for a
corresponding one of the first genes; and ii) using the function to
create the transformation corresponding to the one first gene,
wherein the transformation renders uniform a probability
distribution of the transformed gene expression signals, and
wherein each gene expression signal is mapped by the transformation
into a transformed gene expression signal.
10. The method of claim 9, wherein step (d) comprises the steps of:
i) selecting a gene expression signal in the second matrix; ii)
selecting one transformation that corresponds to the gene
expression signal; iii) determining a transformed gene expression
signal by using the one transformation and the gene expression
signal; and iv) performing the preceding three steps, selecting
another gene expression signal each time, until all of the gene
expression signals have been selected and their respective
transformed values determined.
11. The method of claim 4, wherein each of steps (a) and (b)
comprises a step of depositing biological material from an organism
upon a gene expression array.
12. The method of claim 11, wherein the biological material
comprises a marker substance operable to exhibit a perceptible gene
expression signal for indicating a relationship between a gene
element and the biological material.
13. The method of claim 12, wherein the marker substance is
operable to exhibit fluorescence as the perceptible gene expression
signal.
14. The method of claim 4, further comprising the steps of: f)
determining a gene expression signal from biological material of an
organism for each gene of each of the at least one submatrices; and
g) determining a deviation of each of the gene expression signals
of the organism from the selected range of each selected gene in
the at least one submatrix.
15. The method of claim 14, further comprising the step of: h)
determining if the deviation of each of the gene expression signals
from the biological material of the organism meets predetermined
criteria, whereby the organism exhibits the selected phenotype if
the deviation of each of the gene expression signals meets the
predetermined criteria.
16. The method of claim 15: wherein the at least one submatrix
comprises a plurality of submatrices; wherein step (f) comprises
the step of forming an organism vector from the gene expression
signals from the biological material of the organism; wherein step
(g) comprises the steps of: i) selecting at least one of the
plurality of submatrices; ii) converting each of the at least one
selected submatrices to a vector; iii) using the vectors for each
of the selected submatrices in a discriminant function to compare
the vectors from the submatrices to the organism vector; and
wherein step (h) comprises the step of determining if the
discriminant function returns a value greater than a threshold.
17. A system comprising: a memory that stores computer-readable
code; and a processor operatively coupled to the memory, the
processor configured to implement the computer-readable code, the
computer-readable code configured to: determine a plurality of gene
expression signals for a gene; and derive a transformation that
renders uniform, within a selected interval, a distribution of
transformed gene expression signals for the gene.
18. The system of claim 17, wherein the computer-readable code is
further configured to apply the transformation to an additional
gene expression signal.
19. The system of claim 17, wherein the computer-readable code is
further configured, during the step of deriving, to perform the
steps of: determine a function that approximates the distribution;
and use the function to create the transformation, wherein the
transformation renders uniform a probability distribution of the
transformed gene expression signals, and wherein each gene
expression signal is mapped by the transformation into a
transformed gene expression signal.
20. A system comprising: a memory that stores computer-readable
code; and a processor operatively coupled to the memory, the
processor configured to implement the computer-readable code, the
computer-readable code configured to: form, from a plurality of
first genes and first samples of a first phenotype, a first matrix
of gene expression signals for the first phenotype, each gene
expression signal corresponding to one of the first genes and one
of the first samples; form, from a plurality of selected genes and
selected samples of a selected phenotype, a second matrix of gene
expression signals for the selected phenotype, each gene expression
signal corresponding to one of the selected genes and one of the
selected samples, wherein each selected gene corresponds to one of
the first genes; derive a plurality of transformations, wherein
each transformation renders uniform, within a selected interval, a
distribution of transformed gene expression signals for one of the
first genes, whereby each transformation corresponds to a selected
gene; apply, for each of the selected genes, one of the
transformations to the gene expression signals of a corresponding
one of the selected genes of the second matrix; and define at least
one submatrix of the second matrix, wherein each transformed gene
expression signal for each gene of the submatrix falls within a
selected range.
21. The system of claim 20, wherein the computer-readable code is
further configured to: determine a gene expression signal from
biological material of an organism for each gene of each of the at
least one submatrices; and determine a deviation of each of the
gene expression signals from the biological material of the
organism from the selected range of each selected gene in the at
least one submatrix.
22. The system of claim 20, wherein the computer-readable code is
further configured to: determine if the deviation of each of the
gene expression signals from the biological material of the
organism meets predetermined criteria, whereby the organism
exhibits the selected phenotype if the deviation of each of the
gene expression signals meets the predetermined criteria.
23. An article of manufacture comprising: a computer readable
medium having computer readable code means embodied thereon, the
computer readable program code means comprising: a step to
determine a plurality of gene expression signals for a gene; and a
step to derive a transformation that renders uniform, within a
selected interval, a distribution of transformed gene expression
signals for the gene.
24. The article of manufacture of claim 23, wherein the
computer-readable code means further comprises a step to apply the
transformation to an additional gene expression signal.
25. The article of manufacture of claim 23, wherein the
computer-readable code means is further configured, during the step
of deriving, to perform: a step to determine a function that
approximates the distribution; and a step to use the function to
create the transformation, wherein the transformation renders
uniform a probability distribution of the transformed gene
expression signals, and wherein each gene expression signal is
mapped by the transformation into a transformed gene expression
signal.
26. An article of manufacture comprising: a computer readable
medium having computer readable code means embodied thereon, the
computer readable program code means comprising: a step to form,
from a plurality of first genes and first samples of a first
phenotype, a first matrix of gene expression signals for the first
phenotype, each gene expression signal corresponding to one of the
first genes and one of the first samples; a step to form, from a
plurality of selected genes and selected samples of a selected
phenotype, a second matrix of gene expression signals for the
selected phenotype, each gene expression signal corresponding to
one of the selected genes and one of the selected samples, wherein
each selected gene corresponds to one of the first genes; a step to
derive a plurality of transformations, wherein each transformation
renders uniform, within a selected interval, a distribution of
transformed gene expression signals for one of the first genes,
whereby each transformation corresponds to a selected gene; a step
to apply, for each of the selected genes, one of the
transformations to the gene expression signals of a corresponding
one of the selected genes of the second matrix; and a step to
define at least one submatrix of the second matrix, wherein each
transformed gene expression signal for each gene of the submatrix
falls within a selected range.
27. The article of manufacture of claim 26, wherein the
computer-readable code means further comprises: a step to determine
a gene expression signal from biological material of an organism
for each gene of each of the at least one submatrices; and a step
to determine a deviation of each of the gene expression signals
from the biological material of the organism from the selected
range of each selected gene in the at least one submatrix.
28. The article of manufacture of claim 26, wherein the
computer-readable code means further comprises: a step to determine
if the deviation of each of the gene expression signals from the
biological material of the organism meets predetermined criteria,
whereby the organism exhibits the selected phenotype if the
deviation of each of the gene expression signals meets the
predetermined criteria.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Application No. 60/237,590, filed Oct. 3, 2000.
FIELD OF THE INVENTION
[0002] The present invention relates to genotypes and phenotypes
and, more particularly, relates to the characterization of
phenotypes by gene expression patterns and classification of
samples based thereon.
BACKGROUND OF THE INVENTION
[0003] Gene expression microarrays are becoming very important
devices, as they allow researchers to test large quantities of
genetic material. These devices are important because they allow
researchers to analyze cells, and the genetic information of the
cells, to determine whether the cells belong to particular
phenotypes. A phenotype is the determinable characteristics of an
organism, as determined by genes and the relationship of the genes
to the environment. Researchers are currently examining such
phenotypes as cancer, diabetes and other diseases. Gene expression
microarrays are increasing in importance in this research.
[0004] There are different gene expression microarray technologies.
In typical implementations, gene expression microarrays contain
thousands of DeoxyriboNucleic Acid (DNA) molecules that represent
many genes. These DNA molecules are placed on discrete spots on the
microarray. Each of these DNA molecules may be thought of as part
of an "unzipped" piece of a gene that is waiting for a complement
to which it will be "zipped." The DNA molecules attached to the
microarrays are commonly called probes.
[0005] The complements, also called targets, generally come from
messenger RiboNucleic Acid (mRNA), which are basically the working
copies of genes within cells. When testing a cell from an organism,
the mRNA from a particular sample is purified and a marker is
attached to it. The mRNA is added to the gene expression microarray
and the mRNA "hybridizes" with the DNA to which it is a complement.
Thus, some of the discrete spots will contain mRNA hybridized to
DNA, and other spots will not contain mRNA hybridized to DNA. For
clarity, "targets" will be referred to as mRNA herein. In some
technologies, however, the cellular mRNA is reverse-transcribed
into complementary DNA (cDNA), which are complementry copies of
otherwise fragile mRNA. The cDNA is linearly amplified and is
subsequently used to hybridize with the probes.
[0006] The marker attached to the mRNA is used to determine which
spots contain mRNA hybridized to DNA. Usually, the markers are
fluorescent molecules that fluoresce when a laser light of an
appropriate frequency and power shines on them. This fluorescence
can be measured.
[0007] The fluorescence is a measure of how much a gene "expresses"
itself. If there is a high fluorescence for a particular gene, this
means that the gene is very active. Conversely, if there is low or
no fluorescence for a gene, this means that the gene is inactive.
Thus, by examining the fluorescence of the microarray, researchers
can determine the degree of activity of the different genes.
[0008] This method is advantageous in that it is possible to
determine gene function for tissue affected by a disease as
compared to tissue not affected by such disease. By comparing the
two phenotypes, researchers can in principle determine which genes
contribute to certain diseases and how they contribute.
[0009] When making these determinations, it is helpful to examine
genes from diverse groups of both people who have a disease and
people who do not have this disease (hereafter called "healthy,"
even though they may be affected by other conditions). Because
people are different, there will be differences in the expression
level of genes between subjects in the group. These differences
occur in both healthy and sick individuals. These differences will
be apparent during microarray analysis of samples for the various
people in the group. For instance, one person could have an
enzymatic deficiency (unrelated to the disease under study) which
causes a set of genes to be less expressive. Another person may not
have such deficiency and his or her corresponding genes express
themselves at a much higher level. Therefore, even for healthy
people, there is a variation in gene expression.
[0010] Because of these variations, which are further compounded
with errors in the experimental measurements, usually many
microarray samples are taken and analyzed. The microarray results
from this data can be used to make statistical analyses with the
aim of comparing sick cells and their genes with healthy cells and
their genes. From these analyses, researchers attempt to determine
which genes actually relate to the disease.
[0011] One method of determining this is to look for patterns in
the data. For example, perhaps one particular gene is turned on in
a sick cell, while another gene is turned off. This "pattern" can
be determined because the expression of one gene will be low, while
the expression of another gene will be high. Moreover, as
previously discussed, the researchers generally compare the
expressions from the genes of the unhealthy phenotype with the
expressions from the genes with the healthy phenotype. It helps to
compare genes from unhealthy cells with genes from healthy cells,
as the healthy cells provide a baseline. For instance, perhaps a
certain gene is almost always turned on in normal cells. Even
though a cell exhibiting an unhealthy phenotype might also have
this gene turned on, because normal cells also have this gene
turned on, it is likely that this gene does not relate to the
disease being researched. However, if the unhealthy cell has this
gene turned off and healthy cells generally have this cell turned
on, then it could be that this gene does relate to the disease
being researched.
[0012] There are various ways of analyzing gene expression data.
For a recent review on the methods, see "Genetic Network Inference:
From Co-expression Clustering to Reverse Engineering,"
Bioinformatics, August 2000, 16(8), 707-26, the disclosure of which
is incorporated by reference herein. Most of these methods analyze
the fluorescent outputs for a number of samples of a gene to
determine an "average" fluorescence for the gene. These values are
indicative of the expression level of a gene. With a number of
different genes, a "pattern" of these expression level averages can
be made. Typically, however, the fluorescence of a given gene
across several experiments can vary tremendously. In this
situation, the determined average is meaningless. Some researchers
attempt to alleviate this problem by determining the average
fluorescence value for a particular gene, evaluating the standard
deviation of the fluorescence value for the variety of samples for
the gene and, from this data, determining a normalized distribution
in which the standard deviation is one. The limitation of this
method is that it treats genes as independent, neglecting the
natural correlations between genes.
[0013] Thus, what is needed is a way of comparing expressions from
samples being examined with expression levels from control samples
and a better way of detecting gene-gene correlations when searching
for patterns.
SUMMARY OF THE INVENTION
[0014] Generally, the present invention applies a transformation to
convert a probability distribution of gene expression signals in
control samples to a uniform distribution. The uniform distribution
allows better comparisons between expression levels for genes. The
transformation is derived from gene expression signals of control
data, and is applied to gene expression signals of phenotype data.
The phenotype data can be represented in a matrix format. A number
of gene expression patterns may be determined (in the form of
submatrices) that will characterize the phenotype. The uniform
distribution helps in this regard, as it allows better comparisons
of patterns. The gene expression patterns can then be used to
classify samples as belonging to the phenotype set. Preferably, a
discriminant function is used to compare a sample with the gene
expression patterns that characterize the phenotype. The
discriminant function can determine a score that can be used to
determine whether the sample belongs to the phenotype.
[0015] A more complete understanding of the present invention, as
well as further features and advantages of the present invention,
will be obtained by reference to the following detailed description
and drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1 is a computer system for characterization of
phenotypes by gene expression patterns and classification of
samples based thereon, in accordance with one embodiment of the
present invention;
[0017] FIG. 2 is a block diagram of a method for characterization
of phenotypes by gene expression patterns and classification of
samples based thereon, in accordance with one embodiment of the
present invention;
[0018] FIG. 3 illustrates a control matrix in accordance with one
embodiment of the present invention;
[0019] FIG. 4 illustrates a transformation from a probability
density of control samples to a uniform probability density in
accordance with one embodiment of the present invention;
[0020] FIG. 5 illustrates a phenotype matrix prior to
transformation, in accordance with one embodiment of the present
invention;
[0021] FIG. 6 illustrates an exemplary phenotype matrix after
transformation, in accordance with one embodiment of the present
invention;
[0022] FIG. 7 illustrates a delta-valid but not maximal gene
expression pattern in accordance with one embodiment of the present
invention;
[0023] FIG. 8 illustrates several possible delta-valid gene
expression patterns in accordance with one embodiment of the
present invention;
[0024] FIG. 9 shows a histogram of the log expression of gene
135;
[0025] FIG. 10 shows a graph of normalized probability density for
a number of full length human genes in 60 human cancer cell
lines;
[0026] FIG. 11 shows a graph of experimental results, using three
different methods including a method of the present invention, for
the classification of 7 samples of melanoma;
[0027] FIG. 12 shows a graph of experimental results, using three
different methods including a method of the present invention, for
the classification of 17 cancer samples with mutations in the p53
gene;
[0028] FIG. 13 shows a graph of experimental results, using three
different methods including a method of the present invention, for
the classification of 10 cancer samples whose growth is highly
inhibited by the drug Chloramucil;
[0029] FIG. 14 shows a graph of experimental results, using two
different methods including a method of the present invention, for
the classification of a complex synthetic phenotype; and
[0030] FIG. 15 shows a graph of experimental results, using a
method of the present invention, for the classification of real,
random control, and fully random data.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0031] Basically, the present invention is used to take an initial
set of expression data for one phenotype (generally called the
control set and containing information from healthy cells) and to
determine transformations from this data. The transformations are
applied to a set of expression data from another phenotype
(generally called the phenotype set and containing information from
unhealthy cells). The transformed set of data is used to determine
gene expression patterns that are characteristic of the phenotype.
New expression data from samples that have an unknown genetic
makeup are compared with the gene expression patterns. Based on
this comparison, the new samples are classified as belonging to one
of the two phenotypes.
[0032] Turning now to FIG. 1, this figure shows a block diagram of
a system 100 for characterization of phenotypes by gene expression
patterns and classification of samples based thereon. System 100
comprises a computer system 105 that comprises processor 110 and a
data storage device 120. System 100 also comprises gene microarrays
or chips 171, 172, 181, and 182, microarray processing devices 170,
180, and biological material 173, 174, 183, and 184. Data storage
device 120 comprises a phenotype classification method 200, a
control matrix 130, a phenotype matrix 135, transformations 140, a
transformed phenotype matrix 145, maximal patterns 150 through 155,
sample vector(s) 160 and classification(s) 165.
[0033] The personal computer 105 comprises a processor 110
operatively coupled to the data storage device 120. Data storage
device 120 will configure the processor 110 to implement the
methods, steps, and fimctions disclosed herein. The data storage
device could be distributed or local and the processor could be
distributed or singular. The data storage device could be
implemented as an electrical, magnetic or optical memory, or any
combination of these or other types of storage devices.
[0034] Biological materials 173 through 174 are deposited onto gene
microarrays 171 through 172. As is known in the art, the biological
materials 173 through 174 contain a marker or markers (not shown in
FIG. 1) that can be used to determine how much each gene probed on
gene microarrays 171 through 172 expresses itself in biological
materials 173 through 174. Generally, the biological materials 173
through 174 are derived from cells of organisms and are usually
purified messenger RiboNucleic Acid (mRNA). Microarray processing
device 170 processes the gene microarrays 171 through 172, of which
there are a sufficient number to provide a probability density. The
microarray processing device 170 produces a number for each gene of
each gene microarray of microarrays 171 through 172, and this
number is related to the expression level of the gene. Similarly,
biological materials 183 through 184 are deposited onto gene
microarrays 181 through 182. Biological materials 183 through 184
contains a marker or markers (not shown in FIG. 1) that can be used
to determine how much each gene probed on gene microarrays 181
through 182 expresses itself in the biological samples 183 through
184. Generally, the biological materials 183 through 184 are
derived from cells of organisms and are usually purified messenger
RiboNucleic Acid (mRNA). Microarray processing device 180 processes
the gene microarrays 181 through 182. Usually, biological materials
173 through 174 contain materials from healthy cells, while
biological materials 183 through 184 contain information from
unhealthy cells or cells from an organism that has a certain
disease or condition.
[0035] Generally, the data from microarray processing devices 170
and 180 are formed into matrices 130 and 135. Usually, biological
material is processed at one location and the data in the matrices
130, 135 is processed at another location. However, it is possible
to perform both steps at one location, as FIG. 1 shows.
[0036] Phenotype classification method 200, shown in FIG. 2,
determines transformations 140 from control matrix 130. Control
matrix 130 is a matrix of experimental data from control samples.
Generally, such control samples will be samples of messenger
RiboNucleic Acid (mRNA) from healthy cells taken from a variety of
individuals. However, there is no requirement that these samples
come from healthy cells or from a variety of individuals, and other
types of cells or homogeneous samples may be used. The control
matrix 130 preferably contains columns corresponding to particular
genes and rows corresponding to particular experiments. Each column
of the matrix is used to determine one transformation of
transformations 140. Each column of the matrix will contain a
number of expressions from one gene. These expressions will be
distributed in a particular manner. Such probability distributions
indicate how probable it is that a gene expressed itself at a
particular value. These values are generally based on the amount of
luminescence detected from certain points on a gene microarray.
Each transformation preferably converts a distribution to a uniform
distribution.
[0037] To transform the probability distribution to a uniform
distribution, an integral of the probability density function is
performed. The integral may be numerically performed directly on
the probability density function data. Alternatively, an equation
for the probability density function may be derived and the method
will use the equation when integrating.
[0038] The transformations 140 are applied to the corresponding
elements of the phenotype matrix 135. Phenotype matrix 135
preferably contains columns corresponding to genes and rows
corresponding to experimental expressions. Each element of the
phenotype matrix 135 is converted from its current expression value
to another value. To determine the second value, a transformation
for the gene is used. This is explained in more detail below, in
reference to FIG. 4 and other figures.
[0039] After transformation, transformed phenotype matrix 145
occurs. It is not necessary to have a separate matrix from
phenotype matrix 135; instead, each element of phenotype matrix 135
can be transformed and replace the original element. However, a
transformed phenotype matrix 145 is shown in FIG. 1 for clarity.
Method 200 will search the transformed phenotype matrix 145 for
patterns and determine maximal patterns 150 through 155. Generally,
there will be one or more of the maximal patterns 150 through 155.
The maximal patterns can be used, preferably as part of a
discriminant function, when examining sample vector(s) 160 to
determine the classification(s) 165 of the vectors. The maximal
patterns may be thought of as gene expression patterns. The
classification indicates whether the sample vector is part of the
phenotype matrix or part of the control matrix. Optionally,
additional processing may be used to eliminate some of the maximal
patterns and create a subset of the gene expression patterns that
are better indicators of the phenotype.
[0040] It should be noted that FIG. 1 may have fewer or more
components than what are shown. For example, when determining
maximal patterns, it may be beneficial to store intermediate
patterns. These intermediate patterns are not shown in FIG. 1.
Similarly, once maximal patterns 150 through 155 are determined,
the control matrix 130, phenotype matrix 135, transformations 140,
and transformed phenotype matrix 145 could be discarded. Sample
vector 160 and 165 might not reside in data storage device 120
while, for instance, control matrix 130 resides in data storage
device 120.
[0041] As is known in the art, the methods and apparatus discussed
herein may be distributed as an article of manufacture that itself
comprises a computer readable medium having computer readable code
means embodied thereon. The computer readable program code means is
operable, in conjunction with a computer system such as computer
system 105, to carry out all or some of the steps to perform the
methods or create the apparatuses discussed herein. The computer
readable medium may be a recordable medium (e.g., floppy disks,
hard drives, compact disks, or memory cards) or may be a
transmission medium (e.g., a network comprising fiber-optics, the
world-wide web, cables, or a wireless channel using time-division
multiple access, code-division multiple access, or other
radio-frequency channel). Any medium known or developed that can
store information suitable for use with a computer system may be
used. The computer-readable code means is any mechanism for
allowing a computer to read instructions and data, such as magnetic
variations on a magnetic media, or height variations on the surface
of a compact disk.
[0042] Before proceeding to a more detailed discussion of method
200, it is helpful to discuss the following definitions that will
be used in this disclosure.
[0043] Gene Expression Matrix: The result of a DNA microarray
experiment is a collection of gene expression levels, the level for
each gene being roughly proportional to the concentration of the
mRNA transcribed from that particular gene in the cell. N.sub.g is
the number of different gene probes in the microarray. N.sub.e is
the number of microarray samples (i.e., experiments or cells).
Thus, a set of DNA microarray experiments is conveniently
represented by an N.sub.e.times.N.sub.g gene expression matrix
V={v.sub.eg}, where e is the experiment index and g is the gene
index. In this instance, each v.sub.eg is a transformed expression
level according to Equation (1) of the g-th gene in the e-th
sample. If the set is the control set, then the transformed gene
expression values v.sub.eg will be approximately uniformly
distributed. The term gene expression matrix can thus refer to both
transformed and not-transformed expression matrices.
[0044] Gene vector and Experiment vector: A list of gene
identifications G={g.sub.1, . . . g.sub.k}, with
1.ltoreq.g.sub.1<g.sub.2< . . . <g.sub.k.ltoreq.N.sub.g is
called a gene vector. A list of experiments E={e.sub.1, . . . ,
e.sub.j}, with 1.ltoreq.e.sub.1<e.sub.2< . . .
<e.sub.j.ltoreq.N.sub.e is an experiment vector.
[0045] .delta.-valid jk-patterns: Let V be a gene expression
matrix, then a gene vector G and an experiment vector E uniquely
define a j.times.k submatrix
V.sub.E,G={v.sub.e.sub..sub.1.sub.g.sub..sub.m} of V. Given
.delta.>0, V.sub.G,E is a .delta.-valid jk-pattern if each
column is tightly clustered in an interval of size up to .delta..
By this, it is meant that the maximum and the minimum value of each
column must differ by less than .delta.. The length of the
experiment vector j is called the support of the jk-pattem. If
.delta. is small, each gene in a jk-pattern is expressed at
approximately the same level across all the experiments in the
experiment vector. However, because these are transformed values,
the actual gene expression interval may be large.
[0046] Maximal patterns: A .delta.-valid jk-pattern is maximal if
the following two conditions hold: (1) it cannot be extended into a
.delta.-valid jk'-pattern, with k'>k, by adding genes to its
gene vector, and (2) it cannot be extended into a .delta.-valid
j'k-pattern, with j'>j, by adding experiments to its experiment
vector.
[0047] Turning now to FIG. 2, this figure shows a method 200 for
characterization of phenotypes by gene expression patterns and
classification of samples based thereon. The method is used
whenever it is desired that expression data from a control set be
analyzed, expression data from a phenotype set be analyzed, and
samples classified according to whether they belong to the control
or phenotype set.
[0048] The method begins in step 210, when a control matrix is
formed. A control matrix will preferably contain columns that
correspond to specific genes and rows that correspond to
experimental samples. The samples used will usually belong to one
specific phenotype. Generally, the experimental samples will be
from a variety of healthy cells, such that the phenotype is
"normal" or "healthy." Such phenotypes may be called "cell
phenotypes," which is a term generally used to indicate a common
property of a set of cells. For instance, a cancer morphology, such
as melanoma, is a typical cell phenotype. More subtle cell
phenotypes are also possible and useful. For example, there is a
"p53 phenotype" that identifies cancer cells with mutations in the
p53 tumor suppressor gene. Also, by measuring the drug
concentration required to inhibit by 50% the cell line growth, the
so-called GI.sub.50, it is possible to define a drug-sensitivity
phenotype. The latter can be used to divide cells in two groups:
one with cells that are inhibited by low concentrations of the drug
(i.e., that are highly sensitive to it) and the other with cells
that require high concentrations (i.e., that are resistant to it).
A classification method can then be used to predict whether an
unknown cell line is likely to be sensitive or resistant to a given
drug.
[0049] Some complex phenotypes, such as the p53-related one, are
likely to be mixtures of simpler unknown sub-phenotypes at the
molecular level, each one characterized by a possibly independent
pattern. Methods that rely on a single model are likely to perform
poorly with these complex cases, as truly there is no single model
that describes the entire set.
[0050] An exemplary control matrix is shown in FIG. 3, to be
discussed below. Each column of the control matrix will preferably
contain a number of expressions from a number of experiments. As is
known in the art, each expression corresponds to how much or how
little a particular gene is being expressed. Primarily, the
expression levels are determined by examining the fluorescence of
locations on a microarray. This fluorescence is converted to a
number based on a number of factors, such as the optical system
being used, the fluorescent markers being used, and the scanner
used to convert the fluorescence into a mathematical
representation. Although fluorescence is the primary readout for
microarrays, other readouts, such as probeless readouts that use
electrophoresis or conductance may be used.
[0051] Suppose that the expression level u of the g-th gene in the
control set is distributed according to a given probability density
P.sub.g.sup.-(u). This density is estimated empirically A
sufficient number of samples is required to measure this density
with a sufficient degree of accuracy. Each of the columns of a
control matrix should be able to determine a probability density
for the gene to which the column corresponds.
[0052] In FIG. 4, discussed below, a possible shape for
P.sup.-.sub.g is plotted. Basically, the similarity metric of the
present invention tries to provide a better determination of
clusters of expressions in order to provide a higher probability
that two expressions are related. This is discussed in more detail
in reference to FIG. 4. The present invention renormalizes the
expression axis so that the distance between two points on the new
axis is equal to the integral of the P.sup.-.sub.g(u) in the
previous coordinate system. This is accomplished by defining a new
variable v obtained by transforming the original variable u with
the following, gene-specific non-linear transformation f.sub.g: 1 v
f g ( u ) = - .infin. u P g - ( x ) x . ( 1 )
[0053] In this new variable, the corresponding probability density
Q.sub.g.sup.-(v) for the control set is uniformly distributed and
normalized in the interval [0,1]. If v=f.sub.g(u) and
v'=f.sub.g(u') are two transformed expression values, the Euclidean
metric in v will be taken as a measure of similarity, or distance,
between and u and u': 2 D ( u , u ' ) v - v ' = u u ' P g - ( x ) x
. ( 2 )
[0054] In other words, the distance between two expression values
is chosen to be equal to the integral of the gene expression
probability density in the control set between these two values.
Since the number of measurements in the control set that fall
between u and u' and is proportional to the integral in Equation
(2), it follows that the larger the number of measurements in the
control set that fall between two values, the further apart they
are in the new coordinate system, and vice versa.
[0055] One significant advantage of this is that, since the
probability density for all genes in the control set is uniformly
distributed in the transformed space over the interval [0, 1], it
is now possible to analytically compute the statistics of the
patterns discovered in the control set. Under previous systems, the
statistics of the patterns were much harder to calculate.
[0056] In step 220 of FIG. 2, transformations are derived. One
exemplary transformation is shown in FIG. 4. Each transformation
takes a probability density distribution for one of the columns of
the control matrix and transforms this to a uniform probability
density. During this process, the actual probability density
distribution may be integrated, or a function derived that
estimates the actual probability density distribution and that is
then integrated. This is discussed in more detail in reference to
FIGS. 4, 9, and 10. The transformation may be thought of as a
non-linear similarity metric, which maximizes the probability of
discovering discriminative gene expression patterns.
[0057] In step 230, a phenotype matrix is formed. The phenotype
matrix preferably has columns corresponding to particular genes and
rows corresponding to particular experiments. Each entry is an
expression level of a particular gene and a particular experiment.
Generally, the experiments are from cells that exhibit a certain
disease phenotype, such as cancer or diabetes. An exemplary
phenotype matrix is shown in FIG. 5.
[0058] The previously determined transformations are applied to the
phenotype matrix in step 240. Each transformation corresponds to
one of the genes and is used to convert the expression levels of
entries that correspond to the gene to a transformed value. The
transformed value, as part of a transformed phenotype matrix, can
be used to determine patterns. An exemplary transformed matrix is
shown in FIG. 6.
[0059] Of all the possible patterns in a phenotype matrix, some of
the patterns will be maximal. These maximal patterns are
submatrices of the phenotype matrix. In step 250, these maximal
patterns are determined. There are a variety of well known
pattern-finding algorithms and techniques that may be used in this
step.
[0060] For instance, the SPLASH algorithm may be used. Full details
of the SPLASH algorithm are given in Califano, A., "SPLASH:
Structural Pattern Localization Algorithm by Sequential
Histograming," Bioinformatics 16, 341-357, 2000, the disclosure of
which is incorporated by reference herein. Preprints are available
at http://www.research.ibm.com/topics/pop-
ups/deep/math/html/splashextemal.PDF. In that paper, SPLASH was
introduced as an algorithm to discover patterns in strings, where
all possible relative strings alignment are allowed. Also, a
density constraint is introduced to limit the impact of random
matches occurring over large distances on the string. For the
equivalent association discovery problem, relevant in this context,
the approach is analogous, as each row in the matrix is equivalent
to a string. However, the strings are prealigned in the present
case, meaning that the strings are aligned with the genes to which
they correspond. In addition, the density constraint criteria
introduced in the SPLASH paper is no longer meaningful here, as the
first and last genes are as likely to form patterns as two
corresponding to contiguous matrix columns.
[0061] Using the notation of SPLASH, the canonical seed set P.sub.s
has a single pattern with no genes, all the rows, and an offset of
0 for each row. The histogram operator T.sub.h is implemented by
simply sorting the values in each column and then selecting all
subsets of continuous values that are .delta.-valid. Non-maximal
subsets that are completely contained within another subset are
removed. Each subset is a potential superpattern of a maximal
pattern. The enumerate operator T.sub.e is then applied iteratively
to create all possible maximal combinations of these superpatterns.
As a result, all patterns that exist in the data are generated
hierarchically by combining together smaller superpatterns, with
fewer genes. Non-maximal branches are eliminated at each iteration,
as soon as their corresponding superpattern arises. This
contributes to the efficiency of the algorithm.
[0062] Another pattern-searching technique that may be used with
the present invention is the Teiresias algorithm. This algorithm is
described in Floratos et al., U.S. Pat. No. 6,108,666, "Method and
Apparatus for Pattern Discovery in 1-Dimensional Systems"; Floratos
et al., U.S. Pat. No. 6,092,065, "Method and Apparatus for
Discovery, Clustering and Classification of Patterns in
1-Dimensional Event Streams"; Rigoutsos and Floratos,
"Combinatorial Pattern Discovery in Biological Sequences: the
Teiresias Algorithm," Bioinformatics, 14(1):55-67, 1998; and
Rigoutsos and Floratos, "Motif Discovery Without Alignment Or
Enumeration," Proceedings 2nd Annual ACM International Conference
on Computational Molecular Biology, New York, N.Y., March 1998, the
disclosures of which are incorporated by reference herein.
[0063] In step 260, statistically significant patterns are
determined. This is an optional, but preferred step. When gene
expression values are organized in a gene expression matrix,
jk-patterns may occur for any given value of .delta.. An important
determination to be made is whether any of these patterns can occur
merely by chance.
[0064] An important observation is in order at this juncture: It is
not meant that the expression values of different genes in "real"
gene-expression matrices are independent random variables. Rather,
this model is used as the null hypothesis of a statistical
framework precisely to identify any skew or co-regulation in the
phenotype set. This null hypothesis definition is based on two
assumptions: (a) that the probability densities for the expression
levels of each gene are the same as in the control set, and (b)
that the gene expression levels in different experiments and/or
those of different genes are independently distributed. When
discovering patterns in the phenotype set, the statistically
relevant patterns will be those for which the null hypothesis is
rejected. These are patterns whose constituent genes are either
distributed differently in the phenotype set than in the control
set, and/or are expressed in a correlated fashion. Both of these
features are actually the kind of behavior that we are seeking to
differentiate the two sets.
[0065] Those skilled in the art will note that many genes are not
independently distributed in the control set. Therefore, patterns
may arise that reject the null hypothesis and yet are likely to
occur in the control set. These patterns will be called promiscuous
patterns. Promiscuous patterns, are easily eliminated in a
post-processing phase, described in reference to step 270 below,
and do not contribute significantly to remaining analysis. This is
verified by experimental results given below, where any correlation
in the control set is artificially removed.
[0066] An important result on the statistics of patterns is the
following: given .delta.>0, an N.sub.e.times.N.sub.g gene
expression matrix V, a k-dimensional gene vector G and a
j-dimensional experiment vector E, the probability that the
submatrix V.sub.G,E is a maximal .delta.-valid jk-pattern is:
P.sub..delta.(j,k,N.sub.e,N.sub.g).apprxeq..zeta..sup.k[1-.zeta.].sup.N.su-
p..sub.g.sup.-k[1-(1+j.sup.-1).sup.k.delta..sup.k].sup.N.sup..sub.e.sup.-j
(3)
[0067] where
.zeta.=j.delta..sup.(j-1)-(j-1).delta..sup.j. (4)
[0068] Therefore, the average number of maximal .delta.-valid
jk-patterns in V is
N.sub.jk=N.sub.tP.sub..delta.(j,k,N.sub.e,N.sub.g),
[0069] where 3 N t = ( N g k ) .times. ( N e j ) . ( 5 )
[0070] Where N.sub.t is the total number of ways in which one can
choose a gene and experiment vector. In Equation (3), an
approximation is made that is valid when .delta. is small. This is
consistent with the values used in the experimental section,
typically .delta..ltoreq.0.25.
[0071] The previous observations can be used to assess the
statistical significance of a pattern in the phenotype set with
respect to the randomized control set. Using classical statistics
reasoning, maximal .delta.-valid jk-patterns are rejected that
would be likely to occur in the randomized control set. Under the
null hypothesis, the probability p.sub.jk that one or more
jk-patterns occur in the phenotype set is (this assumes that the
number of patterns is distributed according to a Poisson
distribution, which is a reasonably good approximation to the
actual distribution):
i p.sub.jk=1-e.sup.-N.sup..sub.jk. (6)
[0072] This will be the p-value or significance level of the
preferred statistical test used herein. Thus, setting a reasonable
threshold P.sub.0, it is possible to say that if one or more
jk-patterns in the phenotype set are observed with
p.sub.jk<P.sub.0, the null hypothesis is rejected and a
conjecture is made that such jk-patterns could be specific of the
cell phenotype under study.
[0073] In step 270, a discriminant function is built. This step may
also entail further refinements to the maximal patterns to create a
better discriminant function. The discriminant function determines
whether a sample belongs to the phenotype set or the control set.
When building the discriminant, it may be beneficial to further
process the gene expression patterns discovered in step 260. In
particular, promiscuous patterns may be removed if desired. This is
discussed in more detail below.
[0074] Once the statistically significant patterns are found in the
phenotype set, they can be used as classifiers to build a
discriminant function. This function should determine whether or
not a previously unseen sample, v=(v.sub.1, . . . ,
v.sub.N.sub..sub.g), belongs to the phenotype or the control set.
To this end, a model is built for the probability density function
of the expression level for each statistically significant pattern
.pi..sub.l of the phenotype set. Each i-th gene of .pi..sub.l
contributes with a factor P.sub.i.sup.+(v). The probability density
P.sub.i.sup.+ is chosen to be normally distributed with mean equal
to the average of the cluster for the i-th gene and standard
deviation .sigma..sub.i estimated from the actual measurements. If
the number of actual measurements were, too small, then
.sigma..sub.i is taken to be .delta./2. For samples in the control
set, the same gene would be expressed according to a different
probability density P.sub.i.sup.-(v). The latter can be built
empirically because all the samples in the control set can be
used.
[0075] On a first order approximation, independence is assumed
between genes and the multivariate distribution is taken to be
equal to the product of the probability densities of the individual
genes, both in the phenotype and in the control set. This
assumption is necessary because not enough data exist to construct
a realistic multidimensional probability density for either
set.
[0076] Promiscuous patterns, which arise from correlations in the
control set, are likely to play a minor role in the classification
as described in the following discussion. To determine if a new
microarray sample fits the phenotype model of ajk-pattern
.pi..sub.l for the expression values (v.sub.1, v.sub.2, . . . ,
v.sub.k) over the k genes that constitute .pi..sub.l, it is scored
by the logarithm of the ratio of the two probability densities, as
described in Welch, B. L., "Note on Discriminant Functions,"
Biometrika, Vol 31, pp. 218-220, 1939, the disclosure of which is
incorporated herein by reference: 4 S l = log [ P + ( v 1 , , v k )
P - ( v l , , v k ) ] i = 1 k log [ P i + ( v i ) ] - i = 1 k log [
P i - ( v i ) ] ( 7 )
[0077] Using this score, it can easily be determined whether
promiscuous patterns are contained in the set of statistically
significant patterns. Patterns with positive values of S.sub.l for
samples taken from the control set are considered promiscuous.
Next, the statistically significant patterns are preferably
assigned a promiscuity index: 5 p l = S l ( v ) > 0 S l ( v
Control Set ) , ( 8 )
[0078] where the sum runs over all the samples in the control set
for which S.sub.l>0. Patterns whose S.sub.l<0 for all samples
in the control set have a promiscuity index of zero. Patterns can
now be sorted according to the promiscuity index, with the least
promiscuous pattern first.
[0079] The next step is to associate each pattern .pi..sub.l with a
coverage set, which includes all the samples in the phenotype set
v.sup.ph with a positive score S.sub.l(v.sup.ph)>0. Finally, an
optimal set of patterns is preferably selected using a greedy set
covering algorithm to optimally cover the phenotype set. This is
described in Chvatal, V., "A greedy heuristics for the set covering
problem," Math. Opera. Res., Vol 4, pp. 233-235, 1979, the
disclosure of which is incorporated herein by reference. The set
covering algorithm tries to use the patterns in sort order
according to the promiscuity index: The least promiscuous and most
covering pattern is chosen first. The smallest subset of patterns
whose coverage sets optimally cover the phenotype set is then used
for classification purposes. Therefore, if a non-promiscuous set
that optimally covers the phenotype set exists, it will be selected
over a promiscuous one.
[0080] In step 280, the discriminant is used to classify one or
more samples. Each sample is a vector of expressions, where each
expression corresponds to one gene of the phenotype matrix. The
discriminant is used to classify a sample as either in the
phenotype set or in the control set. Typically N.sub.c patterns of
the subset of patterns are selected, where this value generally
ranges between one and three. The score of a previously
unclassified sample v is defined as
S(v)=max(S.sub.l(v), l=1, . . . ,N.sub.c). (9)
[0081] Equation 9 is a preferred discriminant function. The
discriminant function basically runs the unclassified sample and
each selected gene expression pattern through Equation (7) to
determine a value S(v). Given a threshold S.sub.c, the sample fits
the phenotype model only if S(v).ltoreq.S.sub.c. The theoretical
false positive (FP) and false negative (FN) probabilities can be
easily estimated by integrating P.sup.- and P.sup.+ over the region
where their ratio is greater or smaller than the threshold. If a
single classifier is used, S.sub.c=0 minimizes the sum of false
positive and false negative probabilities (see Welch, B. L., "Note
on Discriminant Functions," Biometika, Chapter 31, pp. 218-220,
1939). In a multivariate model, S.sub.c must be tuned. The
parameter S.sub.c is an useful tunable parameter practically since
different problem requires different balance between FP and FN.
[0082] It should be noted that other discriminant functions may be
used. For example, a simple discriminant function could be a vector
derived from a gene expression pattern. The vector could contain
the average for each gene in the pattern. Additionally, each gene
within the pattern will have some standard deviation. An
unclassified sample, itself a vector, would then be included in the
phenotype if for each gene in the pattern, the respective entry in
the sample is within a range described by the average plus or minus
a few standard deviations. This would still require the sample
vector to be compared with each gene expression pattern. The
previously derived discriminant is preferred, however, as it deals
with promiscuous patterns and tends to be more sensitive.
[0083] Referring now to FIG. 3, this figure shows an exemplary
control matrix. The control matrix has columns corresponding to
genes and rows corresponding to experiments. Each entry is a gene
expression signal that indicates the expression for that gene for
that experiment. The experiments are preferably derived from
healthy cells and can be chosen from a multitude of different
organisms. Each column will contain expressions for this gene. For
instance, there are y genes and each gene has x expressions. The x
expressions will have some probability density distribution.
[0084] It is desirable to determine a measure that can determine
whether two expression levels are similar. As previously discussed,
the expression level u of the g-th gene, in the control set, is
distributed according to a given probability density
P.sup.-.sub.g(u). This density is estimated empirically. A
sufficient number of samples is required to measure this density
with a sufficient degree of accuracy.
[0085] Turning to FIG. 4, a possible shape for P.sup.-.sub.g, which
is discussed in above, is plotted along with four expression values
from hypothetical phenotype cells a, b, c, and d: u.sub.a, u.sub.b,
u.sub.c and u.sub.d. Although the Euclidean distance between
u.sub.c and u.sub.d is smaller than that between u.sub.a and
u.sub.b, the likelihood of getting the former values by chance is
higher because they are very close to the maximum of the expression
probability density. In other words, in order to minimize the
probability of finding random clusters in the control set, a metric
must be chosen such that u.sub.c and u.sub.d would be considered
further away than u.sub.a and u.sub.b. As discussed above, the
present invention renormalizes the expression axis so that the
distance between two points on the new axis is equal to the
integral of the P.sup.-.sub.g(u) in the previous coordinate system.
This is accomplished by defining a new variable v obtained by
transforming the original variable u with the non-linear
transformation of Equation (1) given above. In this new variable,
the corresponding probability density Q.sub.g.sup.-(v) for the
control set is uniformly distributed and normalized in the interval
[0,1]. In FIG. 4, the probability density is plotted together with
the transformed values for u.sub.a, u.sub.b, u.sub.c, and u.sub.d.
As expected, the Euclidean distance, as defined by Equation 2 given
previously, between u.sub.a and u.sub.b is now smaller than that
between u.sub.c and u.sub.d, which makes them much more likely
candidates for a cluster of expressions. These Euclidean distances
correspond to the distances between v.sub.a and v.sub.b and between
v.sub.c, and v.sub.d, respectively. The transformation has made the
difference between v.sub.c and v.sub.d larger than the difference
between v.sub.a and v.sub.b. It should be noted that, when
transforming a phenotype matrix, each entry in the phenotype matrix
will be a u expression signal. The transformation shown in FIG. 4
may be used to transform the u expression signal into an equivalent
transformed v expression signal.
[0086] Referring to FIG. 5, this figure shows a phenotype matrix
prior to transformation. Each column preferably comprises
expression signals from certain genes and each row preferably
corresponds to an experiment. The number of genes Ng is not
necessarily the same as the number of genes Ny in the control
matrix of FIG. 3. However, it will usually be the case that the two
numbers of genes will be the same. Additionally, a transformation
must be derived for a gene from the control data before values
corresponding to the gene in the phenotype matrix can be
transformed. Thus, Ny will generally be at least as large as Ng.
The transformation for each gene is applied to each expression
signal in the column that corresponds to that gene. This will
create a transformed phenotype matrix from which gene expression
patterns may be determined.
[0087] Referring to FIG. 6, this figure contains an example
transformed phenotype matrix V with N.sub.g=5 genes and N.sub.e=4
experiments. It is desired that maximal patterns be found for this
matrix. Given a gene vector G={1, 3, 4} and an experiment vector
E={1, 2, 4}, V.sub.G,E is as shown in FIG. 7. V.sub.G,E is not a
(.delta.=0.05)-valid (j=3, k=3)-pattern because the values in its
second and third column are spread over an interval greater than
0.05. The same pattern, .pi..sub.1 in FIG. 8, is
(.delta.=0.1)-valid but not maximal, because adding gene 2 to G,
produces .pi..sub.2 which is still .delta.-valid. Pattern
.pi..sub.2 is maximal because adding any other gene or experiment
yields submatrices that are no longer (.delta.=0.1)-valid. Pattern
.pi..sub.3 shows another (.delta.=0.1)-valid (j=2, k=5)-pattern
after adding gene 5 and removing experiment 1. This pattern is also
maximal.
[0088] Thus, what has been disclosed so far is a system and method
that transforms gene expression levels of one phenotype from a
probability density to a uniform probability density. The
transformation is used to transform expression levels of another
phenotype. These transformed values, when placed in a matrix, are
used to determine a number of gene expression patterns. The gene
expression patterns, which are primarily submatrices of the
transformed phenotype matrix, may be used to determine if a sample
belongs to the first or second phenotype.
[0089] Now that the main aspects of the present invention have been
discussed, some examples and experiments will now be related.
EXAMPLES
[0090] Affymetrix HU6800 GeneChips have been used to monitor the
gene expression levels of 6,817 full length human genes in 60 human
cancer cell lines (experiments were performed at the Whitehead
Institute), as described in Weinstein, J. N. et al., "An
Information-Intensive Approach to the Molecular Pharmacology of
Cancer," Science, Vol 275, pp. 343-349, 1997, the disclosure of
which is incorporated by reference herein. These are organized into
a set of panels for leukemia, melanoma, and cancer of the lung,
colon, kidney, ovary, and central nervous system. The identity of
the genes is not known. Consequently, they are therefore identified
by a numeric identification.
[0091] Genes with expression values of 20 or less are considered
switched off. From the 6,817 original genes, a subset of 418 was
selected by means of a variational filter to eliminate genes that
did not change significantly across samples (variational filters
were also used in Eisen, M. B. et al., "Cluster analysis and
display of genome-wide expression patterns," Proc. Natl. Acad. Sci.
USA, Vol 95, pp. 14863-14868, 1998; and Tamayo, P. et al.,
"Interpreting Patterns of Gene Expression with Self-Organizing
Maps: Methods and Application to Hematopoietic Differentiation,"
Proc. Natl. Acad. Sci. USA, Vol 96, pp. 2907-2912, 1999, the
disclosures of which are incorporated by reference herein).
[0092] The fluorescence intensity .phi..sub.g of each gene, roughly
proportional to the mRNA concentration, appears to be lognormally
distributed. The value of variable u is then chosen as
u=log(.phi..sub.g). In FIG. 9, the histogram of a typical
expression level of a gene over the 60 samples is shown. This
distribution is clearly bimodal. There is a peak at the basal
level, corresponding to the gene being switched off in some
experiments. The non-basal expression values, on the other hand,
are distributed with a well behaved mean and standard deviation.
Thus, the corresponding probability density may be written as:
P.sub.i.sup.-(u.sub.i)=a.sub.i.delta.(u.sub.i-u.sub.0)+(1-a.sub.i)P.sub.i.-
sup.(0)(u.sub.i), (10)
[0093] where a.sub.i is the percentage of expression data being at
the basal level u.sub.0, and P.sub.i.sup.0(u.sub.i) is the density
function for the non-basal expression values. For each gene, it is
possible to determine its basal level a.sub.i, the mean {overscore
(u)}.sub.i, and standard deviation .sigma..sub.i of its non-basal
density P.sub.i.sup.(0)(u.sub.i). Non-basal values of u.sub.i are
normally distributed in accord with the observed lognormal
distribution of the .phi..sub.i. This is shown in FIG. 10, which
shows a plot of the combined distribution obtained by shifting and
resealing the nonbasal activity of each gene as
u'=(u.sub.i-{overscore (u)}.sub.i)/.sigma..sub.i. The change of
variables corresponding to Equation (1) is: 6 v i = u 0 u i P i - (
x ) x = a i H ( u i - u 0 ) + ( 1 - a i ) u 0 u i P i ( 0 ) ( x ) x
. ( 11 )
[0094] In Equation (11), H(x) is the Heaviside function, which is
the function H(x.ltoreq.0)=1 and H(x<0)=0. This results from the
integration of the delta function in the u.sub.i distribution. For
a.sub.i.noteq.0, different values of u.sub.i can correspond to the
same value of v.sub.i. This is not a problem unless a.sub.i is of
the same order of .delta., in which case the values at u.sub.0 the
would not be discriminative are discarded.
[0095] Phenotype Analysis Given a gene expression matrix V, with
renormalized values v.sub.lm, the pattern discovery algorithm
SPLASH was used to find all maximal .delta.-valid jk-patterns for
k.gtoreq.4 and j.gtoreq.4. These parameters were chosen because
patterns with too few genes are not specific enough, while patterns
with too small a support do not characterize a significant
consensus in the data set. The .delta. was chosen between 0.05 and
0.15, depending on the data sets, such that a sufficient number of
patterns is discovered. Typically, this would be on the order of 10
to 50 statistically significant patterns. Larger values of .delta.
are possible, but they increase the probability of finding
promiscuous pattern and reduce performance. In general, the smaller
value of .delta. one can choose and still discover patterns, the
better the results. The threshold of significance P.sub.0 is chosen
to be 10.sup.-4.
[0096] Below, experimental results are discussed on the
classification of 7 samples in the melanoma panel, of 17 samples
with mutations in the p53 gene, and of 10 samples whose growth is
highly inhibited by the drug Chlorambucil. For each experiment, a
plot of the sum p.sub.Tot=p.sub.FP+p.sub.FN of false positives and
false negative probabilities as a function of the matching
threshold S.sub.c is shown.
[0097] Three methods are studied: The pattern discovery method (PD)
of the present invention; the support vector machine (SVM) method
disclosed in Brown, M. P. S. et al., "Support Vector Machine
Classification of Microarray Gene Expression Data," University of
California Technical Report USCC-CRL-99-09, 1999, which is
available at: http://www.cse.ucsc.edu/research/compbio/genex; and
the gene by gene method (GBG) disclosed in Golub, T. R. et al.,
"Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring," Science, Vol 286, pp.
531-537, 1999. For each given phenotype, its complement in the
NCI-60 was used, excluding the samples whose phenotype cannot be
accurately determined (neutral samples), as the control set.
[0098] Given the limited number of samples in the NCI-60 set, false
positive and false negative ratios were computed by cross
validation. Each sample both in the phenotype and in the control
set was removed in turn. The present invention was trained using
the remaining samples, and this included gene axis transformation,
pattern discovery, and set covering. Finally, the previously
removed sample was classified as described in regards to step 280.
When a phenotype set sample was misclassified it was considered a
false negative. When a control sample was misclassified it was
considered a false positive. All computation times reported are
relative to a 450 MHZ PENTIUM II, where a PENTIUM is a trademark
for a computer processor. In FIGS. 11 through 14, the sum of the
false positive and false negative probabilities is plotted as a
function of the classification score threshold. PD is shown by a
thick solid line, SVM by a dashed line with diamonds, and GBG by a
dotted line with circles.
[0099] Melanoma
[0100] The melanoma panel included 7 samples. There were also 14
neutral samples. These were selected by biologists prior to this
analysis. When the complete set of melanoma samples was used for
the training, there was only one statistically significant gene
expression pattern that was selected after the set covering phase.
FIG. 11 shows the performance of the complete analysis, with
.delta.=0.12, as described in the previous section. Both SVM and PD
show a significant range of the match threshold S.sub.c where both
false positive and false negative probabilities are zero. This is
considered perfect recognition. The GBG produces results which are
very similar, although a fraction less accurate. The time required
to classify a sample with the PD method is approximately 10
seconds.
[0101] p53 Mutation
[0102] A more challenging phenotype is that of 17 samples for cells
with mutations in the p53 gene. The corresponding set of cancer
morphologies is considerably more complex. It includes 5 melanoma,
3 renal cancer samples, 2 samples for cancer of the central nervous
system, leukemia, ovarian cancer, and breast cancer, and 1 sample
for colon cancer. As mentioned earlier, this is likely to have
several sub-phenotypes at the molecular level. This is confirmed by
this analysis, which also highlights a much wider range of
variability for the various methods. As shown in FIG. 12, the GBG
method performs quite poorly with a min(p.sub.Tot)=0.51. The SVM
method improves on that result, bringing that value to about 0.46.
The present method, with .delta.=0.12, has the best result at
min(p.sub.Tot)=0.33. Three distinct, rather orthogonal patterns are
used on average for each sample classification. If only one pattern
is allowed, results become close to that of the SVM method. The
time required to classify a sample with the PD method is
approximately 20 seconds.
[0103] Chlorambucil GI 50
[0104] Some truly interesting phenotypes are associated to the
ability of a given drug to inhibit cell growth. These are relevant
because many experimental anti-cancer compounds exhibit relatively
poor growth inhibition rates across large variety of cancer cells
of similar morphology. If it were possible, however, to correlate
the effectiveness of a compound to the much richer space of the
gene expression profile of a cell, it could be possible to
determine a priori which cells are most likely to be inhibited by a
drug. To test this scenario, Chlorambucil (NSC 3088) was selected
from the NCI anti-cancer database. Since the growth inhibition rate
is distributed rather continuously over the entire NCI-60 spectrum,
the samples were split into three groups. The phenotype group
contained the 10 cells that are most inhibited by Chlorambucil. The
control group contained the 20 samples whose growth is least
inhibited by the compound. The third set of 30 cells is considered
neutral. As shown in FIG. 13, the SVM and PD methods perform
similarly, with a slight advantage towards the former. Best values
for p.sub.Tot are 0.35 and 0.4 respectively. For the PD method, a
value of .delta.=0.12 is used. The other method cannot do better
than 0.55. The time required by the PD method to classify a sample
is 10 seconds.
[0105] Synthetic Data Analysis
[0106] Several cross-validation checks using synthetic or
randomized data have been performed to validate the present
invention and its approach. Three synthetic data sets were
analyzed. The first test was designed to evaluate the theoretical
performance of the algorithms in the case of phenotype mixtures. A
synthetic data model was generated with the same gene by gene
statistics as the control set for the p53 study. A set of 48
control samples was generated from this model at random. Their
gene-by-gene probability density was virtually identical to that of
the real control set in the p53 study. A set of 18 phenotype
samples was synthetically generated. This set consisted of three
independent sub-phenotypes of 6 samples each. Each sub-phenotype
was characterized by 10 marker genes clustered around a tight
interval. Remaining genes were modeled as in the control set.
Marker genes were different for different sub-phenotype with some
overlap. In particular, sub-phenotype 1 and 2 had 6 marker genes in
common; sub-phenotype 2 and 3 had 2 marker genes in common. Marker
genes were expressed differently than in the control according to
the following criteria: 1) they were given a different mean located
about 0.5.sigma. away from the control mean, 2) they were given a
smaller standard deviation 0.33.sigma., where .sigma. is the
standard deviation of the same gene in the control set.
[0107] As shown in FIG. 14, there was a dramatic difference between
the performance of the SVM method and that of the PD method. The
minimum of p.sub.Tot was about 0.7 for the SVM method and 0.19 for
the PD method. About 80% of the genes composing the sub-phenotypes
are correctly identified by the 3 resulting patterns. This seems a
good indication that the PD method is a suitable choice for the
classification of some complex phenotypes that may be mixtures of
simpler sub-phenotypes.
[0108] The second synthetic data set was designed to determine
whether correlation of genes in the control set is a significant
factor and could reduce the performance of the technique. To
accomplish this goal, the values of the genes have been randomly
permuted only across the control set on a gene by gene basis. This
has the effect of leaving the expression probability density for
each gene for the control virtually unchanged, while removing any
possible correlation between the values of genes in the same
sample. Results for the classification are shown in FIG. 15, as
shown by a dashed line with squares. There are no major differences
with respect to the same curve for the real data of FIG. 12, which
is reproduced in FIG. 15 as the thick solid line. This proves that
correlation in the genes, even though present in the data, does not
result in classifiers that are highly correlated over the control
set.
[0109] Finally, a test was designed to determine whether this
approach may suffer from over-fitting the data. To that end,
classification was performed using the same data and criteria as
for the p53 phenotype study but after the expression values of the
individual genes have been randomly permuted across all the samples
on a gene by gene basis. The results of the classification are
shown in a dashed line with triangles in FIG. 15. As clearly shown,
performance was very poor, with a value of p.sub.Tot consistently
larger than 1 over the entire classification threshold interval.
The PD method, as well as the SVM and GBG methods, exhibits no
predictability for these data sets. In other words, the sum of the
false positive and the false negative rates is close to 1, as it
should be Thus, results for the classification of melanoma, p53
mutations and the GI 50 activity of Chlorambucil are excellent.
They range from 0% to about 40% sum of false positive and false
negative probability. The high sensitivity and specificity of the
present method for complex phenotypes such as p53 show that the
method can successfully deal with multiple independent
sub-phenotypes. Moreover results from one of the first attempts to
predict drug effectiveness from gene expression data also are very
good. The present invention is especially well suited to treat
complex phenotypes, composed of several sub-phenotypes. This has
been shown for the p53-related phenotype and for a synthetic data
set.
[0110] The present invention has better predictive power than other
methods. Another advantage of the present method over SVM is that
the present method highlights the relevant marker genes, their
expression range in the phenotype, and the independent patterns
that relate them. Such information is highly desirable for
discovering the mechanism for various diseases at the molecular
level. The SVM method, on the other hand, is more like a black box
for classification.
[0111] It is to be understood that the embodiments and variations
shown and described herein are merely illustrative of the
principles of this invention and that various modifications may be
implemented by those skilled in the art without departing from the
scope and spirit of the invention. For instance, other suitable
pattern searching methods may be used, and other discriminant
fimctions may be designed and used.
* * * * *
References