U.S. patent application number 13/934166 was filed with the patent office on 2013-11-07 for identification of pattern similarities by unsupervised cluster analysis.
The applicant listed for this patent is Health Discovery Corporation. Invention is credited to Asa Ben-Hur, Andre Elisseeff, Isabelle Guyon.
Application Number | 20130297607 13/934166 |
Document ID | / |
Family ID | 26967221 |
Filed Date | 2013-11-07 |
United States Patent
Application |
20130297607 |
Kind Code |
A1 |
Ben-Hur; Asa ; et
al. |
November 7, 2013 |
IDENTIFICATION OF PATTERN SIMILARITIES BY UNSUPERVISED CLUSTER
ANALYSIS
Abstract
A method is provided for unsupervised clustering of data to
identify pattern similarities. A clustering algorithm randomly
divides the data into k different subsets and measures the
similarity between pairs of datapoints within the subsets,
assigning a score to the pairs based on similarity, with the
greatest similarity giving the highest correlation score. A
distribution of the scores is plotted for each k. The highest value
of k that has a distribution that remains concentrated near the
highest correlation score corresponds to the number of classes
having pattern similarities.
Inventors: |
Ben-Hur; Asa; (Fort Collins,
CO) ; Elisseeff; Andre; (Thalwil, CH) ; Guyon;
Isabelle; (Berkeley, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Health Discovery Corporation |
Atlanta |
GA |
US |
|
|
Family ID: |
26967221 |
Appl. No.: |
13/934166 |
Filed: |
July 2, 2013 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13019585 |
Feb 2, 2011 |
8489531 |
|
|
13934166 |
|
|
|
|
11929522 |
Oct 30, 2007 |
7890445 |
|
|
13019585 |
|
|
|
|
10478191 |
Nov 1, 2004 |
|
|
|
PCT/US02/15666 |
May 17, 2002 |
|
|
|
11929522 |
|
|
|
|
60335990 |
Nov 30, 2001 |
|
|
|
60292221 |
May 18, 2001 |
|
|
|
Current U.S.
Class: |
707/737 |
Current CPC
Class: |
G06K 9/6218 20130101;
G16B 40/00 20190201; G06F 16/35 20190101; G16B 25/00 20190201 |
Class at
Publication: |
707/737 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A non-transitory machine-readable medium comprising a plurality
of instructions that in response to being executed result in a
computing system executing a process for unsupervised
classification of data in a dataset according to pattern
similarities, comprising: a clustering function to randomly assign
k class labels to the data and partition the dataset in k subsets,
wherein k has a range with a minimum number of two; a compare
function to, for each value of k beginning with the minimum value
of k for each pair of subsets, compute a correlation score on the
intersection between the pair of subsets, wherein the correlation
score comprises a similarity measure between the pair of subsets
and the greatest similarity has the highest score; and a histogram
function to generate a distribution of the correlation scores for
each value of k, wherein the distribution comprising the highest
value of k that remains concentrated near the highest correlation
score corresponds to an optimal granularity level for clustering of
the data according to pattern similarities within the dataset, and
to generate an output corresponding to clustered data for display
or storage in a storage medium.
2. The non-transitory machine-readable medium of claim 1, wherein
the clustering function comprises a k-means algorithm.
3. The non-transitory machine-readable medium of claim 1, wherein
the clustering function comprises a hierarchical clustering
algorithm.
4. The non-transitory machine-readable medium of claim 1, wherein
the optimal granularity level comprises a number of clusters for
which there is a transition from a distribution that is peaked near
a correlation score of "1" to a distribution that corresponds to
random data.
5. The non-transitory machine-readable medium of claim 1, wherein
the similarity measure is obtained by measuring a residual of a fit
of one cluster onto another cluster, wherein the residual fit
comprises using a fit that is invariant with respect to affine
transformations, wherein the affine transformations comprise a
combination of translation, scaling and rotation.
6. The non-transitory machine-readable medium of claim 1, wherein
the dataset comprises gene expression data and the pattern
similarities comprise co-regulation patterns.
7. A non-transitory machine-readable medium comprising a plurality
of instructions that in response to being executed result in a
computing system separating classes within a dataset without
supervision, wherein the computing system: selects a plurality of
granularity levels k, and for each granularity level k: (a) induces
perturbations in the dataset to generate a modified dataset; (b)
applies a clustering algorithm to the at least one modified dataset
to produce k clusters under each of the perturbations; (c) creates
a data subset comprising the clusters identified in step (b); (d)
applies the clustering algorithm to the data subset using the same
value of k clusters; (e) determines the stability of the
clusterings at each granularity level k by measuring dissimilarity
between data in the data subset and a cluster center for the
cluster into which the data was assigned; measures fit of the data
to the cluster centers for all k granularity levels, wherein the
fit comprises using a fit that is invariant with respect to affine
transformations, wherein the affine transformations comprise a
combination of translation, scaling and rotation; selects from
among the plurality of granularity levels an optimum granularity
level k corresponding to the best fit; generates an output
comprising the dataset clustered into a plurality of subsets
corresponding to the optimal granularity level k; and displays a
graph showing the dataset clustered into the plurality of subsets,
wherein the subsets correspond to classes.
8. The non-transitory machine-readable medium of claim 7, wherein
the perturbations comprise a combination of one or more of
sub-sampling the dataset, changing initialization of the clustering
algorithm, and adding noise to the dataset.
9. The non-transitory machine-readable medium of claim 7, further
comprising, prior to selecting a plurality of granularity levels,
ranking the data according to one or more quality criteria and
selecting a pre-determined fraction of top ranked data.
10. The non-transitory machine-readable medium of claim 7, wherein
the clustering algorithm is a k-means algorithm.
11. The non-transitory machine-readable medium of claim 7, wherein
the clustering algorithm if hierarchical clustering.
12. The non-transitory machine-readable medium of claim 7, wherein
the optimal granularity level comprises a number of clusters for
which there is a transition from a distribution that is peaked near
a correlation score of "1" to a distribution that corresponds to
random data.
13. The non-transitory machine-readable medium of claim 7, wherein
the similarity is obtained by measuring a residual of a fit of one
cluster onto another cluster, wherein the residual fit comprises
using a fit that is invariant with respect to affine
transformations, wherein the affine transformations comprise a
combination of translation, scaling and rotation.
14. A non-transitory machine-readable medium comprising a plurality
of instructions that in response to being executed result in a
computing system determining without supervision an optimal number
of classes within a dataset, wherein each class has pattern
similarities, wherein the computing system: selects a range of
granularity levels k, the range having a minimum of two; applies a
clustering algorithm to the dataset for that k clusters are
produced; for each value of k, selects a plurality of subsets of
the dataset, wherein each sub-sample comprises a fixed fraction of
the dataset; selects a plurality of pairs of subsets; calculates a
similarity between each pair of the plurality of pairs of subsets;
determines a distribution of the similarities within the plurality
of pairs of subsets; compares distributions of the similarities for
all k granularity levels; selects from among the plurality of
granularity levels an optimal granularity level k whose
distribution of the similarities has the highest correlation score
among correlation scores for all distributions of similarities;
generates an output comprising the dataset clustered into an
optimal number of subsets corresponding to the optimal k; and
displays a graph of the clustered dataset showing pattern
similarities according to the optimal number of subsets.
15. The non-transitory machine-readable medium of claim 14, wherein
the clustering algorithm is a k-means algorithm.
16. The non-transitory machine-readable medium of claim 14, wherein
the clustering algorithm if hierarchical clustering.
17. The non-transitory machine-readable medium of claim 14, wherein
the optimal granularity level comprises a number of clusters for
which there is a transition from a distribution that is peaked near
a correlation score of "1" to a distribution that corresponds to
random data.
18. The non-transitory machine-readable medium of claim 14, wherein
the similarity is obtained by measuring a residual of a fit of one
cluster onto another cluster, wherein the residual fit comprises
using a fit that is invariant with respect to affine
transformations, wherein the affine transformations comprise a
combination of translation, scaling and rotation.
Description
RELATED APPLICATIONS
[0001] This application is a continuation of application Ser. No.
13/019,585, filed Feb. 2, 2011, issued Jul. 16, 2013 as U.S. Pat.
No. 8,489,531, which is a continuation of application Ser. No.
11/929,522, filed Oct. 30, 2007, issued Feb. 15, 2011 as U.S. Pat.
No. 7,890,445, which is a continuation of application Ser. No.
10/478,191, filed Nov. 18, 2003, now abandoned, which is a U.S.
National Stage filing of PCT/US02/15666, filed May 17, 2002, which
claims the benefit of U.S. provisional patent applications No.
60/335,990, filed Nov. 30, 2001, and No. 60/292,221, filed May 18,
2001. Each of the identified applications is incorporated herein by
reference.
FIELD OF THE INVENTION
[0002] The present invention relates to the use of learning
machines to identify relevant patterns in datasets containing large
quantities of diverse data, and more particularly to a method and
system for unsupervised learning for determining an optimal number
of data clusters into which data can be divided to best enable
identification of relevant patterns.
BACKGROUND OF THE INVENTION
[0003] Knowledge discovery is the most desirable end product of
data collection. Recent advancements in database technology have
lead to an explosive growth in systems and methods for generating,
collecting and storing vast amounts of data. While database
technology enables efficient collection and storage of large data
sets, the challenge of facilitating human comprehension of the
information in this data is growing ever more difficult. With many
existing techniques the problem has become unapproachable. Thus,
there remains a need for a new generation of automated knowledge
discovery tools.
[0004] As a specific example, the Human Genome Project has
completed sequencing of the human genome. The complete sequence
contains a staggering amount of data, with approximately 31,500
genes in the whole genome. The amount of data relevant to the
genome must then be multiplied when considering comparative and
other analyses that are needed in order to make use of the sequence
data. As an illustration, human chromosome 20 alone comprises
nearly 60 million base pairs. Several disease-causing genes have
been mapped to chromosome 20 including various autoimmune diseases,
certain neurological diseases, type 2 diabetes, several forms of
cancer, and more, such that considerable information can be
associated with this sequence alone.
[0005] One of the more recent advances in determining the
functioning parameters of biological systems is the analysis of
correlation of genomic information with protein functioning to
elucidate the relationship between gene expression, protein
function and interaction, and disease states or progression.
Proteomics is the study of the group of proteins encoded and
regulated by a genome. Genomic activation or expression does not
always mean direct changes in protein production levels or
activity. Alternative processing of mRNA or post-transcriptional or
post-translational regulatory mechanisms may cause the activity of
one gene to result in multiple proteins, all of which are slightly
different with different migration patterns and biological
activities. The human proteome is believed to be 50 to 100 times
larger than the human genome. Currently, there are no methods,
systems or devices for adequately analyzing the data generated by
such biological investigations into the genome and proteome.
[0006] Clustering is a widely used approach for exploratory data
analysis. Clustering analysis is unsupervised learning--it is done
without suggestion from an external supervisor; classes and
training examples are not given a priori. The objective of
clustering is to group data points into "meaningful" subsets. There
is no agreed upon definition of the clustering problem, and various
definitions appear in the literature. For example, clustering has
been defined as a search for some "natural" or "inherent" grouping
of the data. However, most clustering algorithms do not address
this problem. The vast majority of clustering algorithms produce as
their output either a dendogram or a partition into a number of
clusters, where the number of clusters is either the input, or
there is some other parameter(s) that controls the number of
clusters. In either case, a model selection technique is required
in order to choose the model parameter, or in the case of
hierarchical algorithms, to determine which level of the dendogram
represents the "inherent" structure of the data.
[0007] A few examples of applications of clustering include (1)
analysis of microarray data, where co-expressed genes are found,
and the assumption is that co-expression might be a sign of
co-regulation; (2) in medical datasets (gene expression data,
clinical data etc.), where patients are divided into categories;
(3) in any set or set of measurements to detect trends or artifacts
in the measurement protocol; and (4) in information retrieval to
partition text according to categories.
[0008] Most clustering algorithms either produce a hierarchical
partitioning of the data into smaller and smaller clusters, or
produces a partition of a dataset into a number of clusters that
depend on some input parameter (the number of clusters or some
other parameter(s)). The question remains, however, of how to set
the input parameter, or how to determine which level of the tree
representation of the data to look at: Clustering algorithms are
unsophisticated in that they provide no insight into the level of
granularity at which the "meaningful" clusters might be found.
Occasionally, there may be prior knowledge about the domain that
facilitates making such a choice. However, even in such cases, a
method for determining the granularity at which to look at the data
is required. This is seen as the problem of finding the optimal
number of clusters in the data, relative to some clustering
algorithm.
[0009] E. Levine and E. Domany in "Resampling Method for
Unsupervised Estimation of Cluster Validity", Neural Comp. 13,
2573-2593 (2001), assign a figure of merit to a clustering solution
according to its similarity to clusterings of subsamples of the
data. The "temperature" parameter of their clustering algorithm is
selected according to a maximum of the similarity measure. However,
in real data, such a maximum does not often occur. Other model
selection techniques have difficulty detecting the absence of
structure in the data, i.e., that there is a single cluster.
Further, many of algorithms make assumptions as to cluster shape,
and do not perform well on real data, where the cluster shape is
generally not known. Accordingly, other methods for clustering are
needed. The present invention is directed to such a method.
SUMMARY OF THE INVENTION
[0010] In an exemplary embodiment, a system and method are provided
for analysis of data by grouping the input data points into
meaningful subsets of data. The exemplary system comprises a
storage device for storing at least one data set, and a processor
for executing the clustering algorithm. In the preferred
embodiments, the clustering algorithm is a k-means or hierarchical
clustering algorithm.
[0011] In the method of the present invention, a "good granularity
level", i.e., an optimal number of clusters, is defined as one at
which a clustering solution is stable with respect to some
perturbation of the data, such as noise or sub-sampling. For each
level of granularity, the algorithm chooses a number of pairs of
sub-samples or other perturbations, clusters each sub-sample with
the chosen level of granularity, then computes a similarity between
pairs of clustering solutions. For each level of granularity, the
distribution of the similarity between pairs of clustering
solutions is computed, then the highest level of granularity for
which highly similar solutions are obtained under perturbation is
chosen.
[0012] According to the present invention, the probability
distribution of the similarity measure is evaluated in order to
find the "true" number of clusters based on a similarity measure
between clustering solutions. A dot product between pairs of
clustering solutions is normalized to provide a similarity measure.
Other similarity measures proposed in the literature can also be
expressed in terms of this dot product.
[0013] In one embodiment, the inventive method provides means for
extraction of information from gene expression profiles, in
particular, extraction of the most relevant clusters of temporal
expression profiles from tens of thousands of measured profiles. In
one variation of the present embodiment, the clustering algorithm
is the k-means algorithm. Other embodiments include all pairwise
clustering methods (e.g., hierarchical clustering) and support
vector clustering (i.e., a support vector machine (SVM)). In one
variation, the fit involves affine transformations such as
translation, rotation, and scale. Other transformations could be
included, such as elastic transformations. In the case of k-means,
the algorithm is applied in a hierarchical manner by sub-sampling
the data and clustering the sub-samples. The resulting cluster
centers for all the runs are then clustered again. The resulting
cluster centers are considered to be the most significant profiles.
To facilitate the clustering task, a ranking of the genes is first
performed according to a given quality criterion combining saliency
(significant difference in expression in the profile), smoothness,
and reliability (low noise level). Other criteria for ranking
include the local density of examples.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] FIGS. 1a-1c are a plot of data which is a mixture of four
Gaussians, a histogram of the correlation score for the Gaussian
data mixture, and a plot showing the overlay of cumulative
distributions of the correlation score, respectively.
[0015] FIGS. 2a and 2b are histogram of the correlation score for
the yeast gene expression data and an overlay of the cumulative
distributions of the correlation score, respectively.
[0016] FIGS. 3a and 3b are a histogram of the correlation score for
208 points uniformly distributed on the unit cube and an overlay of
the cumulative distributions of the correlation score,
respectively.
[0017] FIGS. 4a-4c are a plot of the first two principle
components, the histogram of the correlation score, and an overlay
of the cumulative distributions of the correlation score; k=2
(rightmost) to k-10 (leftmost), respectively.
[0018] FIGS. 5a-5b are graphic representations of gene
pre-selection.
[0019] FIGS. 6a-6b are graphs of the average profiles of the
clusters of the eight clustering runs and their grouping into
meta-clusters.
[0020] FIGS. 7a-7d are graphs of curve fitting with affine
transformations.
[0021] FIGS. 8a-8h are graphs of curve fitting with affine
transformations, clustering obtained from the top 800 best quality
genes.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0022] A typical computer system for running the inventive
clustering algorithm is a Pentium.RTM.-class processor with an
interface for inputting the dataset(s) of interest and a memory for
storing the dataset. Alternative processors include configurable
logic processors made up of SRAM-based field programmable gate
arrays (FPGAs) or configurable system on chip (CSOC) architectures,
which include a processor and an array of configurable logic cells
on a single chip, either of which can be used to accelerate
computer-intensive operations such as clustering algorithms on
large datasets. In addition, parallel implementations in
multi-computers have been used for executing clustering algorithms
for extracting knowledge from large-scale data repositories.
Selection of an appropriate computer system for executing the
inventive clustering method is within the level of skill in the
art.
[0023] The clustering model selection algorithm works with the help
of a scoring function that provides a similarity measure between
two labelings.
[0024] Let X={x.sub.1, . . . , x.sub.n}, and
x.sub.1.epsilon.R.sup.d be the dataset to be clustered. A labeling
is a partition of X into k subsets S.sub.1, . . . , S.sub.k. It can
be represented by a function c:X.fwdarw.{1, . . . , k} where
c(x.sub.i) is the cluster to which x.sub.i belongs.
[0025] A less compact representation of a labeling, which may be
useful is the following representation by a matrix C with
components:
C ij = { 1 if x i and x j belong to the same cluster , 0 otherwise
( 1 ) ##EQU00001##
[0026] Note that this representation is label independent, i.e.
there is no need to assign each point a label in {1, . . . , k}.
This may be viewed as an adjacency matrix, where each cluster is a
connected component of the graph. This representation can also be
converted into a representation of soft cluster labeling.
[0027] Let labelings and have matrix representations C.sup.(1) and
C.sup.(2) respectively, so that
( L 1 , L 2 ) = .SIGMA. i , j C ij ( 1 ) C ij ( 2 ) ( 2 )
##EQU00002##
[0028] This dot product computes the number of pairs of vectors
clustered together, and can also be interpreted as the number of
common edges in graphs represented by C.sup.(1) and C.sup.(2). A
naive method for computing the dot product by going over all pairs
of points has complexity O(n.sup.2). However, it can be computed in
linear time.
[0029] As a dot product, , satisfies the Cauchy-Schwartz
inequality: , .sub.2.ltoreq.. The correlation score, which is a
normalized version of the dot product, is:
cor ( L 1 , L 2 ) = L 1 , L 2 L 1 , L 1 L 2 , L 2 ( 3 )
##EQU00003##
[0030] A. K. Jain and R. C. Dubes, Algorithms for clustering data
(Prentice Hall, Englewood Cliffs, N.J., 1988) provide a number of
scores for comparing a labeling produced by a clustering algorithm
with a "gold standard" labeling. This technique is known as
"external" validation. In contrast, the present invention proposes
the use of scoring functions for "internal" validation that do not
require the "gold standard". In the following, it is shown that two
commonly used scoring functions can be expressed in terms of the
dot product defined above in Equation 2. Given two matrices
C.sup.(1), C.sup.(2) with 0-1 entries, let N.sub.ij i,j.epsilon.{0,
1} be the number of entries on which C.sup.(1) and C.sup.(2) have
values i and j, respectively. Define the matching coefficient as
the fraction of entries on which the two matrices agree:
M ( L 1 , L 2 ) = N 00 + N 11 N 00 + N 01 + N 10 + N 11 . ( 4 )
##EQU00004##
[0031] The Jaccard coefficient is the corresponding ratio when
"negative" matches are ignored:
J ( L 1 , L 2 ) = N 11 N 01 + N 10 + N 11 . ( 5 ) ##EQU00005##
[0032] The Jaccard coefficient is more appropriate when the
clusters are relatively small, since in that case, the N.sub.00
term will be the dominant factor even if the solution is far from
the true one. These scores can be expressed in terms of the
labeling dot product and the associated norm according to the
following proposition:
[0033] Let C.sup.(1) and C.sup.(2) be the matrix representations of
labelings and respectively. Then:
J ( L 1 , L 2 ) = < C ( 1 ) , C ( 2 ) > < C ( 1 ) , C ( 1
) > + < C ( 2 ) , C ( 2 ) > - < C ( 1 ) , C ( 2 ) >
##EQU00006## M ( L 1 , L 2 ) = 1 - 1 n 2 || C ( 1 ) - C ( 2 ) || 2
##EQU00006.2##
[0034] This is a result of the observation that N.sub.11=C.sup.(1),
C.sup.(2), N.sub.01=1.sub.n-C.sup.(1), C.sup.(2),
N.sub.10=C.sup.(1), 1.sub.n-C.sup.(2), N.sub.00=1.sub.n-C.sup.(1),
C.sup.(2), where 1.sub.n is an n.times.n matrix with entries equal
to 1. The above expression for the Jaccard coefficient shows that
it is close to the correlation score.
[0035] The following list provides the routine for the model
explorer algorithm according to the present invention:
TABLE-US-00001 Input: A dataset D, k.sub.min, k.sub.max Require: A
clustering algorithm, cluster (D, k) A scoring function for label
comparison, s( .sub.1, .sub.2) f = 0.8 for k = k.sub.min to
k.sub.max do for i = 1 to maximum iterations do sub.sub.1 =
subsamp(D, f) {a sub-sample with fraction f of the data} sub.sub.2
= subsamp(D, f) = cluster (sub.sub.1, k) = cluster (sub.sub.2, k)
Intersect = sub.sub.1 .andgate. sub.sub.2 Score (i, k) = s(
(Intersect), (Intersect)){Compute the score on the intersection of
the two labels} end for end for
[0036] When one looks at a cloud of data points, and at a
sub-sample of it for a sampling ratio, f (fraction of points
sampled) is not much smaller than 1 (say f>0.5), one usually
observes the same general structure. Thus, it is also reasonable to
postulate that a clustering algorithm has captured the inherent
structure in a dataset if clustering solutions over different
subsamples are similar, e.g., according to one of the similarity
measures introduced in the previous section. Thus, "inherent
structure" is defined as structure that is stable under
sub-sampling (or alternatively, perturbing the data). Given a
clustering algorithm and a dataset, this translates into a search
for the best number of clusters to use in the particular instance.
Note that one can also extend the search to look for a set of
features where structure is apparent, however, in this case, the
same set of features is kept.
[0037] The inventive clustering algorithm receives as input a
dataset (or similarity/dissimilarity matrix) and a parameter k that
controls either directly or indirectly the number of clusters that
the algorithm produces. This convention is applicable to
hierarchical clustering algorithms as well: given k, the tree is
cut so that k clusters are produced. Next, characterize the
stability of the clustering for each k. This is accomplished by
producing a set of clusterings of sub-samples of the data, and
comparing the labels of the intersection of pairs of clusterings
using, for example, the correlation similarity measure. This is
performed for increasing values of k (see above for details). The
distribution of the scores for the different values of k is then
compared (see, e.g., FIGS. 1a-1c). The idea is that when there is
structure in the data that is well described by the clustering
algorithm with that particular value of k, many sub-samples will
produce similar clusterings, and their pairwise similarity score
will be concentrated close to 1.
[0038] Each sub-sample contains a fixed fraction of the data, f.
The actual subsampling can be implemented in various ways:
[0039] 1. Select each sample independently so that the size of the
intersection between two samples is random.
[0040] 2. Select together pairs of samples by first selecting their
intersection, then selecting the rest of the data to complete the
fraction f.
[0041] 3. Fix one clustering solution to be one produced on the
whole dataset. (This third option was used by Levine and Domany
(supra) to give a figure of merit to a particular clustering
solution.)
[0042] For k=1, all clusterings are the same. This also holds for
k=n, where n is the number of data points; in this case every point
is in a different cluster. When the number of clusters becomes
large so that there is a small number of points in each cluster,
the solution becomes stable. The value of f should not be too low
so that there not all clusters are represented in a sub-sample. In
the Examples provided below, the shape of the distribution did not
depend very much on the specific value of f. Any value between 0.6
and 0.9 worked well.
EXAMPLES
[0043] In this section experiments on artificial and real data are
described. In all the experiments the distribution of the
correlation score is shown. Equivalent results were obtained using
other scores as well. The parameter values f=0.8 and 200 pairs of
solutions were compared for each k. A hierarchical clustering
algorithm was used, with the Ward criterion for merging clusters
(see, e.g., Jain and Dubes, supra). Similar results were obtained
using other hierarchical clustering methods (complete and average
linkage). The advantage of using hierarchical clustering methods is
that the same set of clusterings can be used for all values of
k.
Example 1
Gaussian Data
[0044] Referring first to FIGS. 1a-1c, FIG. 1a shows a mixture of
four Gaussians. The histograms of the score for varying values of k
for this data is plotted in FIG. 1b. Histograms are shown for each
value of k in the range of 2 to 7. Observations regarding the
histograms are that at k=2, there is a peak at 1, since almost all
the runs discriminated between the two upper and two lower
clusters. At k=3, most runs separated the two lower clusters, and
at k=4 most runs found the "correct" clustering as is reflected in
the distribution of scores that is still close to 1.0. At k>4
there is no longer essentially one preferred solution. There is, in
fact, a wide variety of solutions, evidenced by the widening
spectrum of the similarities. FIG. 1c plots the cumulative
distributions of the correlation score for each k, where k=2 at the
rightmost side of the plot (at peak 1), and k=7 being the leftmost
curve.
Example 2
DNA Microarray Data
[0045] The next dataset considered was the yeast DNA microarray
data of M. Eisen et. al. ("Genetics cluster analysis and display of
genome-wide expression patterns", Proc. Natl. Acad. Sci. USA, 95:
14863-14868, December 1998.). The data is a matrix which represents
the mRNA expression levels of n genes across a number of
experiments. Some of the genes in the data have known labels
according to a functional class. Five functional classes were
selected along with genes that belong uniquely to these five
functional classes. This yielded a dataset with 208 genes, with 79
features (experiments). Data was normalized by subtracting the mean
and dividing by the standard deviation for each column. This was
also performed for the rows, and repeated for the columns. At this
stage the first three principal components were extracted. The
distribution and histogram of scores is given in FIG. 2a for k over
the range of 2 to 7. The same behavior is observed as seen in the
mixture of four Gaussians data of FIG. 1a. Between k=5 and k=6,
there is a transition from a distribution that has a large
component near 1, to a wide distribution that is very similar to
the distribution on the random data. The clustering solution that
was obtained for k=5 agreed well with the given labels, with a
correlation score of 0.95.
Example 3
Uniformly Distributed Data
[0046] The results of a test run on data uniformly distributed on
the unit cube is shown in FIGS. 3a and 3b. The distributions are
quite similar to each other, with no change that can be interpreted
as a transformation from a stable set of solutions to unstable
solutions
[0047] The preceding examples indicate a simple way for choosing k
as the value where there is a transition from a score distribution
that is concentrated near 1 to a wider distribution. This can be
quantified, e.g., by an increase in the area under the cumulative
distribution function or by an increase in
S(K)=P(s>0.90).
The value of 0.9 is arbitrary, but any value close to 1 would work
on the set of examples considered here.
Example 4
Isolet Letter Pronunciation
[0048] The next test was run on a portion of the ISOLET (Isolated
Letter Speech Recognition) database created by Ron Cole and Mark
Fanty of the Department of Computer Science and Engineering, Oregon
Graduate Institute, Beaverton, Oreg. and available from the UCI
(University of California at Irvine) Repository of Machine Learning
Databases. (This data set was generated by having 150 subjects
speak the name of each letter of the alphabet twice, generating 52
training examples from each speaker.) This test provides an example
of what occurs with there is cluster overlap.
[0049] One thousand points of the original dataset, representing
the letters "a", "c", "d", "e", "f" were used. The columns of the
data were "whitened", then the first three principle components
were extracted. FIG. 4a is a plot of the distribution of the first
two principle components for each of the selected letters. FIG. 4b
provides histograms of the correlation score for each k in the
range of 2 to 7, and FIG. 4c is an overlay of the cumulative
distribution of the correlation score with k=2 towards the far
right side of the plot (near 1) and k=10 towards the left hand side
of the plot.
[0050] Table 1 provides a comparison of the number of clusters
identified using the inventive method against the number of
clusters obtained using other methods for selecting k. These other
methods are described by R. Tibshirani, G. Walther, and T. Hastie
in "Estimating the number of clusters in a dataset via the Gap
statistic", Tech. Report, Department of Statistics, Stanford
University, 2000; also published in JRSSB 2000, where the Gap
Statistic methods is shown to be superior to a number of other
methods.
TABLE-US-00002 TABLE 1 Jain Silhouette KL Gap Subsamp 4 Gauss 4 4 9
4 4 (FIG. 1) Microarray 3 2 4 6 5 (FIG. 2) Isolet 2 5 5 -- 3 (FIG.
4)
[0051] In the case of clustering algorithms whose output depends on
the initial condition, e.g., k-means, a distribution of scores
exists even when considering a fixed sub-sample. In such cases, the
method produces an indication of how varied the solutions can be
for various values of k. It has been observed that for a "good"
value of k, a similar transition occurs, but is generally
"smeared", since k-means produces widely varying solutions. To
address this, a version of k-means similar to that presented by P.
Bradley and U. Fayyad in "Refining initial points for k-means
clustering" (in J. Shavlik, editor, Proc. of the 15.sup.th Inter.
Conf. of Machine Learning (ICML '98), pages 91-99, San Francisco,
Calif., 1998. Morgan Kaufmann) was used. That method produces
initial conditions that converge to near optimal solutions, and use
a fixed initial condition for each value of k. According to the
present invention, k-means clustering produces solutions that are
highly stable with respect to sub-sampling. This good result may be
due to the global optimization criterion, which differs from the
local bottom up approach of hierarchical clustering that appears to
be less stable to sub-sampling.
Example 5
Gene Expression Data
[0052] A data set of 1600 genes with 12 time steps was utilized to
illustrate the process undergone by gene expression profiles.
[0053] First, the genes were ranked in order of "quality" to
pre-select a subset for further analysis. All the genes were ranked
according to three criteria: (1) saliency (the absolute difference
between their min and max value; the larger the better); (2)
smoothness (a coefficient assessing the smoothness of the profile
was computed; the smoother the better); and (3) reliability (the
average standard deviation of the experiment replicated for the
whole profile).
[0054] The ranks of the genes according to these three criteria
were then added to form a combined criterion according to which the
genes were ranked again. The result can be seen in FIGS. 5a and 5b,
which show gene expression temporal profiles. The ten best genes
are depicted in FIG. 5a, while the ten worst genes are shown in
FIG. 5b according to a combined criterion of saliency, smoothness,
and reliability.
[0055] The 5% top quality genes according to the above defined
combined criterion (800 genes) were selected. A kernel clustering
algorithm based on k-means was then run on random subsets of 100
genes among these 800 genes. A maximum number of clusters of ten
were used, but only five did not degenerate. The stability of the
solutions was verified by running again kernel k-means on the
resulting cluster centers. The solution was robust with respect to
increasing the number of genes (doubling, to 1600 genes), changing
the subset size (to 200 genes) and the maximum number of cluster
(to 20 genes).
[0056] FIGS. 6a and 6b illustrate the average profiles of the
clusters of the eight clustering runs and their groupings into
meta-clusters 1-5. The cluster centers in FIG. 6a were obtained
with multiple runs of k-means using random subsets of 100 genes in
the top 800 best quality genes. FIG. 6b shows the average cluster
centers for the five clusters. Only a subset of the nine possible
profiles that could occur is represented.
[0057] According to the present invention, the clustering algorithm
is based on, but is a variation of, the classical k-means
algorithm. The algorithm operates by the following routine: [0058]
Initialize: Start with a random assignment of class labels to the
patterns. [0059] Step 1: Compute cluster centers by averaging class
members in each class. [0060] Step 2: Re-assign patterns to the
cluster with nearest cluster center. [0061] Iterate step 1 and 2
until the assignment of patterns to classes remains constant.
[0062] This algorithm differs from the classical k-means in that it
uses a special metric to measure proximity in step 2. It is based
on the residuals of a fit of one profile onto another, using an
affine transformation that is a combination of translation,
scaling, and rotation (to first order). Such fit is remarkably
simple to implement with a couple of lines of MATLAB.RTM. (The
MathWorks, Inc., Natick, Mass.) and is quite fast. For example, to
fit a profile (a 12 dimensional vector x2) that goes through zero
between time step 5 and 6, onto another vector x1, one can
write:
[0063] xx2=[x2,ones(12,1),[-5;-4;-3;-2;-1;1;2;3;4;5;6;7]];
[0064] w=xx2\x1;
[0065] x2fit=xx2*w;
[0066] residual=mean((x1-x2fit). 2);
[0067] FIGS. 7a-d depict two profiles and variations on how they
can be fit one onto the other or both on their average. This
provides an illustration of curve fitting with affine
transformations. FIG. 7a shows the two original profiles P1 and P2.
FIG. 7b shows the profile P1 fitted on P2. FIG. 7c shows profile P2
fitted on P1. FIG. 7d shows both profiles fitted to their
average.
[0068] After some experimentation, the following measure of
dissimilarity was adopted: [0069]
residual0+max(residual1,residual2), where residual0 is the squared
Euclidean distance between x1 and x2, residual1 is the residual of
the fit of x1 onto x2, and residual2 is the residual of the fit of
x2 onto x1. The rationale behind this choice is that a
dissimilarity simply based on the symmetric fit to the average is,
in some cases, too optimistic--it becomes very easy with the type
of affine transformations that are being allowed to fit any curve
to a straight line (but not vice versa). Residual 0 is added to
avoid allowing too large transformations. The same desirable
properties could be achieved by solving an optimization problem
under constraints to limit the range of transformations, but this
would be computationally more expensive.
[0070] Any positive monotonic transformation of the dissimilarity
does not affect the algorithm. It should also be noted that in Step
1 of the algorithm, a simple average of the patterns is used, as
opposed to an average of the curves fitted to the cluster centers.
After iterating, there is a significant distortion of the cluster
center profiles, some of which just become straight lines.
[0071] FIGS. 8a-h illustrate the results of one run of the
algorithm on a subset of 100 gene expression profiles with a
maximum cluster number of 10, where expression intensity is plotted
with time. The clustering is obtained from the top 800 best quality
genes. FIG. 8a shows the original profiles with random class label
assignments. FIGS. 8c, 8e and 8g illustrate fitted profiles of
cluster members at successive iterations. FIGS. 8b, 8d, 8f, and 8h
show the cluster centers at successive iterations.
[0072] The present invention provides a method for model selection
in clustering. Most other approaches in the prior art are based on
either the sum squared distances within clusters, or some
combination of within-cluster distances and between-cluster
distances. Not only do these prior art methods impose the notion of
how a cluster should look (e.g., "compact"), most of the methods
performed poorly in practice. The inventive method, on the other
hand, provides a description of the stability of a clustering
method with respect to re-sampling or noise. It is believed that
this stability captures the notion of "inherent" structure that is
the goal in the validating clustering solutions.
[0073] Another advantage of present invention is that it is useful
in identifying the absence of structure in the data. This differs
from most clustering validation algorithms of the prior art, with
the exception of the gap statistic, which can only determine the
number of clusters if that number is larger than 1.
* * * * *