U.S. patent application number 13/865712 was filed with the patent office on 2013-09-12 for method and system for detecting discriminatory data patterns in multiple sets of data.
This patent application is currently assigned to The Research Foundation for The State University of New York. The applicant listed for this patent is THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW YORK. Invention is credited to John S. KOVACH, Xuena WANG, Wei ZHU.
Application Number | 20130238251 13/865712 |
Document ID | / |
Family ID | 33555464 |
Filed Date | 2013-09-12 |
United States Patent
Application |
20130238251 |
Kind Code |
A1 |
ZHU; Wei ; et al. |
September 12, 2013 |
METHOD AND SYSTEM FOR DETECTING DISCRIMINATORY DATA PATTERNS IN
MULTIPLE SETS OF DATA
Abstract
A comprehensive analysis procedure for analyzing and comparing
multiple sets of data to detect hidden discriminatory data
patterns. The inventive procedure identifies a best subset of
markers for optimal discrimination between two or more sets of
data. A point-wise test on two or more sets of data is performed to
calculate test statistic values and to generate a statgram, a two-
or higher-dimensional map of the test statistic values along the
range of data. A threshold is then determined for isolating
critical regions of the statgram at each significance level to
provide candidate markers. A subset of markers from the candidate
markers is then selected to discriminate among the sets of data.
The two or more sets of data are classified using the subset of
markers.
Inventors: |
ZHU; Wei; (Stony Brook,
NY) ; WANG; Xuena; (Stony Brook, NY) ; KOVACH;
John S.; (Setauket, NY) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
THE RESEARCH FOUNDATION FOR THE STATE UNIVERSITY OF NEW
YORK |
Albany |
NY |
US |
|
|
Assignee: |
The Research Foundation for The
State University of New York
Albany
NY
|
Family ID: |
33555464 |
Appl. No.: |
13/865712 |
Filed: |
April 18, 2013 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10868387 |
Jun 14, 2004 |
8478534 |
|
|
13865712 |
|
|
|
|
60477529 |
Jun 11, 2003 |
|
|
|
60553433 |
Mar 15, 2004 |
|
|
|
Current U.S.
Class: |
702/19 ;
702/179 |
Current CPC
Class: |
Y02A 90/10 20180101;
Y02A 90/26 20180101; G06K 9/6262 20130101; H01J 49/0036 20130101;
G06K 9/6217 20130101; G06F 17/18 20130101; G16B 20/00 20190201;
G16H 50/70 20180101; G16B 40/00 20190201 |
Class at
Publication: |
702/19 ;
702/179 |
International
Class: |
G06F 17/18 20060101
G06F017/18 |
Claims
1. A system for detecting discriminatory data patterns in multiple
sets of data, each unit of data being described uniquely by one or
more coordinates, the system comprising: a point-wise test module
for performing a point-wise test on a training set of data to
calculate a plurality of test statistic values for corresponding
units of data from two or more groups of data representing known
conditions; a threshold module for determining a threshold test
statistic value based on a selected significance level and
selecting units of data having a test statistic value with an
absolute value exceeding the threshold value, the selected units of
data comprising candidate marker elements; a marker selection
module for selecting a subset of marker elements from the candidate
marker elements to discriminate among the two or more groups of
data; and a classification module for classifying a testing set
into the two or more groups of data using the subset of marker
elements.
2. The system of claim 1, comprising a variance stability module
for checking at least one of the candidate marker elements and the
subset of marker elements for variance stability.
3. The system of claim 1, wherein the threshold module performs a
multiple-test correction of data of the training set.
4. The system of claim 1, comprising a pre-processing module for
presmoothing the multiple sets of data with a Gaussian kernel.
5. A system for detecting a disease from data of biological
samples, the system comprising: a sampling module for randomly
sampling data that has been standardized and smoothed to divide the
data into a training set and a testing set, each set comprising
random samples from subjects affected and unaffected by the
disease; a point-wise test module for performing a point-wise test
on the training set of data to determine test statistic values
indicative of the difference between corresponding data values of
the samples of the affected and the unaffected subjects; a
threshold module for determining a threshold test statistic value
based on a selected significance level and for selecting data
values having a test statistic value with an absolute value that
exceeds the threshold, the selected units of data comprising
candidate marker elements; a marker selection module for selecting
a subset of marker elements from the candidate marker elements to
discriminate between the affected and the unaffected samples of the
training set; and a classification module for classifying the
testing set of data as representing affected or unaffected samples
using the subset of marker elements.
6. The system of claim 5, wherein the data comprises spectral
intensity values in a spectrum for each of the biological samples,
the system comprising a pre-processing module for standardizing and
smoothing the spectral intensity values across the spectrum for
each of the biological samples.
7. A computer readable medium comprising code for detecting
discriminatory data patterns in multiple sets of data, each unit of
data being described uniquely by one or more coordinates, the code
comprising instructions for: performing a point-wise test on a
training set of data to calculate a plurality of test statistic
values for corresponding units of data from two or more groups of
data representing known conditions; determining a threshold test
statistic value based on a selected significance level and
selecting those units of data having a test statistic value with an
absolute value that exceeds the threshold, the selected units of
data comprising candidate marker elements; selecting a subset of
marker elements from the candidate marker elements to discriminate
among the two or more groups; and classifying a testing set of data
into the two or more groups using the subset of marker elements.
Description
RELATED PATENT APPLICATIONS
[0001] This application is a Continuation application of U.S.
application Ser. No. 10/868,387 filed on Jun. 14, 2004 and claims
priority from U.S. Provisional Patent Application No. 60/477,529
filed on Jun. 11, 2003 and from U.S. Provisional Patent Application
No. 60/553,433 filed on Mar. 15, 2004, each of which is entitled A
METHOD AND SYSTEM FOR DETECTING DISCRIMINATORY DATA PATTERNS
BETWEEN MULTIPLE SETS OF DATA and each of which is incorporated
herein by reference in its entirety.
FIELD OF THE INVENTION
[0002] The present invention relates to methods and systems for
analyzing data, particularly for detecting patterns that
distinguish multiple sets of data.
BACKGROUND OF THE INVENTION
[0003] In many systems or phenomena, naturally occurring or
otherwise, distinctive patterns of data are often buried within the
highly complex data sets that are created to characterize such
systems or phenomena. Such patterns have been observed, for
example, in the study of a wide variety of systems and phenomena
such as diseases, environmental conditions, and financial
conditions, to name a few.
[0004] The distinctive patterns of data that may characterize
certain conditions are often not obvious or apparent using existing
classification methods and systems. The current classification
systems and methods typically find or uncover a single known
differentiating feature between sets of data or analyze only a
subset of the data. A hidden pattern found in one dataset is
generally not applicable to another dataset. That is, these systems
generally require "retraining" on each new set of data and cannot
completely characterize the dataset.
[0005] For example, current classification systems cannot
effectively screen for early stage ovarian cancer. In its early
stages, ovarian cancer is an insidious disease, exhibiting
essentially no symptoms. Ovarian tumors may grow to a size of about
10-12 cm before impinging on adjacent organs, resulting in
symptoms, such as increased urinary frequency and rectal pressure.
More than 80% of ovarian cancer patients currently are diagnosed at
a late clinical stage as a result of the absence of early stage
symptoms and the associated 5-year survival rate is only 35%. If
ovarian cancers are diagnosed at an early stage, however, the
5-year survival rate is more than 90% because, in most cases, the
cancer can then be eradicated completely by surgery.
[0006] An estimated 25,000 women are diagnosed with ovarian cancer
annually in the United States and approximately 14,500 women die
from the disease each year. An effective screening program for
early stage ovarian cancer has been elusive, however, due to
factors which include the lack of a highly specific screening test.
With the rapid development of proteomics, new screening strategies
utilizing modern proteomic technology and bioinformatics are
emerging, but none have shown sufficient specificity to be an
effective diagnostic tool.
[0007] The PROTEIN CHIP array surface-enhanced laser
desorption-ionization (SELDI) mass spectrometry (MS) system,
available from Ciphergen Biosystems of Fremont, Calif., USA, is
increasingly recognized as the leading technology for fast and
reliable protein profiling based on tissue or body fluid samples.
The underlying principle of SELDI is surface-enhanced affinity
capture through the use of specific probe surfaces or protein
chips. Once captured on the SELDI protein chip array, proteins are
detected through the ionization-desorption, time-of-flight mass
spectrometry process. The PROTEIN CHIP SELDI-MS has been useful in
identifying known markers of prostate cancer and in discovering
potential markers which are over- or under-expressed in prostate
cancer cells and body fluids. (See
http://www.ciphergen.com/pub/pc_tech/).
[0008] By comparing serum proteomic spectra of early stage ovarian
cancer patients with a comparable group of unaffected women using a
bioinformatics algorithm, a recent study has identified a set of
proteomic markers and has been able to classify subjects with a
sensitivity of 100% and a specificity of 95%. See Petricoin et al.
"Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,"
The Lancet, vol. 369, pp. 572-577 (Feb. 16, 2002), incorporated
herein by reference in its entirety and hereinafter referred to as
the "Lancet Paper;" and U.S. Published Patent Application No.
2003/0004402 A1, entitled "Process for discriminating between
biological states based on hidden patterns from biological data,"
also incorporated herein by reference in its entirety.
[0009] Shortly after the publication of the Lancet Paper, using the
same set of subjects and an improved protein surface, Petricoin et
al. achieved better results with a sensitivity of 100% and a
specificity of 97%. See "Correspondence," The Lancet, vol. 360, pp.
169-171 (Jul. 13, 2002), incorporated herein by reference in its
entirety and hereinafter referred to as the "Correspondence." The
corresponding proteomic mass spectrum data set (referred to as
"ovarian data set 4-3-02") is publicly available at the NIH/FDA
Clinical Proteomic Program Databank at the following website:
http://www.clinicalproteomics.steem.com/databank.php.
[0010] While applauding their accomplishment, many remained
skeptical about the screening value of the method described by
Petrocoin et al. in the Correspondence. In fact, Petrocoin et al.
stated that the prevalence of ovarian cancer in postmenopausal
women is 1 in 2,500, which means that a screening assay with 97%
specificity would result in 75 false positives for every true
positive identification.
[0011] There are several statistics based, analytical tools that
have been developed to analyze mass spectra of protein marker
expression for various disorders. The genetic algorithm first
described by John Holland in the mid-1970s manipulated complex data
sets as individual elements through a computer-driven analog of a
natural selection process. In 1982, Kohonen proposed a cluster
analysis method by using a self-organizing map. Correlogic Systems,
Inc. of Bethesda, Md. has combined the ideas of Holland's genetic
algorithm and Kohonen's self-organizing map to implement a pattern
discovery algorithm in a software named Proteome Quest, Beta
version 1.0. Petricoin et al. utilized the Proteome Quest software
to analyze the proteomic spectra generated by SELDI-TOF, to
identify ovarian cancer. Petricoin et al. adopted a random window
approach to sequentially select markers and to examine their
contribution towards the classification rate.
[0012] A drawback to Petricoin et al.'s approach is that only
portions of the proteomic spectra are used for the analysis, in
which case the contribution of each marker may vary with the window
size and significant protein markers may be excluded from the
analysis. Conversely, many of the biomarkers predicted by such
known methods will not be statistically significant, so that in
many cases, efforts to determine the underlying molecular identity
and subsequent cell and molecular biology will be fruitless.
[0013] For the large-scale screening for the presence of early
cancer, specificity and sensitivity must approach 100% to assure no
disease is missed and to prevent pursuit of unnecessary additional
diagnostic procedures. Similarly, biomarker identification and
molecular characterization require a high degree of reproducibility
and fidelity for each individual proteomic marker.
[0014] Another challenge when analyzing proteomic data (or genomic
data, in general) is to draw robust conclusions from
high-dimensional data based on relatively few subjects. The
question is how robust this conclusion is. In addition to providing
100% accuracy and specificity when analyzing a particular testing
set, a robust analysis method should be able to determine
discriminating markers that can be trusted to accurately diagnose
any random subject from relevant population at large. In other
words, 100% accuracy and specificity should apply to the full
population.
[0015] In light of the known approaches and their limitations, it
is clear that improved analysis methods and systems are required.
Moreover, it is desirable to have a comprehensive method and system
that considers each data point of the dataset to discover hidden
patterns or markers, thereby enabling the method and system to
detect subtle differences in multiple datasets without retraining
on each dataset. Such a system and method trained, for example, to
detect a toxin from an environmental dataset can be used on another
environmental dataset to detect that toxin without retraining on
the other environmental dataset.
SUMMARY OF THE INVENTION
[0016] The present invention provides a method and system for
analyzing multiple sets of data, more particularly a comprehensive
method and system for detecting subtle hidden differences among
multiple sets of data. The comprehensive method and system obtains
best discrimination between multiple sets of data that can be
expressed with one or more common coordinates. The inventive method
and system performs a point by point assessment of the data,
thereby enabling the present invention to detect subtle differences
in the data that cannot be detected by current analytical
methods.
[0017] An exemplary embodiment of a method of the present invention
performs a point-wise test on two or more sets of data to calculate
test statistic values and to generate a statgram, a two-, or
higher, dimensional map of the test statistic values along the
range of data. A threshold is then determined for isolating
critical regions of the statgram at each significance level to
provide candidate elements. A subset of elements from the candidate
elements is then selected to discriminate among the sets of data.
The two or more sets of data are classified using the subset of
elements.
[0018] An exemplary embodiment of a system of the present invention
comprises a statgram module, a threshold module, a marker selection
module and a classifier. The statgram module performs a point-wise
test on the sets of data to calculate test statistic values and
generate a statgram. The threshold module determines a threshold to
determine critical regions of the statgram at each significance
level to provide candidate elements and the marker selection module
selects a subset of elements from the candidate elements to
discriminate the sets of data. The classifier classifies the sets
of data using the subset of elements.
[0019] The comprehensive statistical method and system of the
present invention can analyze biological datasets or non-biological
datasets to detect the presence of unique data, elements or
markers, or quantitative differences in the same data elements or
markers.
[0020] The present invention is applicable to a wide range of
fields including, for example, biology, medicine, chemistry, and
economics.
[0021] In an exemplary embodiment of the present invention,
environmental samples such as air samples, soil samples and water
samples are compared and analyzed to detect the presence of a
particular substance or radiation in the environment, thereby
providing a bio-hazard detector, for example.
[0022] In another exemplary application of the present invention,
the comprehensive method and system analyzes and compares tissues
and body fluids (such as serum) of diseased and control subjects so
as to draw conclusions regarding, for example, the existence,
progression or regression of a diseased state. An exemplary
embodiment of the present invention examines the complete proteomic
spectrum of biological samples and selects all of the significantly
different biomarkers using random field theory. A best-subset
discriminant analysis is then used to choose the most significant
biomarkers. An exemplary embodiment of the present invention is
described below in connection with the early diagnosis of ovarian
cancer. This embodiment has been used to re-analyze two public
ovarian cancer data sets with 100% specificity and 100%
sensitivity.
[0023] It is an object of the present invention to overcome the
shortcomings of known methods described supra.
[0024] Another object of the present invention is to provide a
comprehensive analytical method and system for detecting
discriminatory data patterns between sets of data where each data
element can be described by one or more coordinates.
[0025] A still other object of the present invention is to provide
a comprehensive analytical method and system as aforesaid, which
performs a point by point assessment of the data to detect subtle
differences in the datasets without retraining on each dataset.
[0026] A further object of the present invention is to provide a
computer readable medium comprising a code for detecting
discriminatory data patterns between sets of data, each data
element being described uniquely by one or more coordinates. The
code comprising instructions for performing a point-wise test on
the sets of data to calculate test statistic values and generate a
statgram, determining a threshold to determine critical regions of
the statgram at each significance level to provide candidate
elements, selecting a subset of k elements from the candidate
elements to discriminate between the sets of data, and classifying
the sets of data using the subset of k elements.
[0027] A still further object of the present invention is to
provide a comprehensive analytical method and system for analyzing
and comparing sample data (e.g., tissue and body fluids, such as
serum) of affected (i.e., subjects having a disease or disorder or
a predisposition for a disease or disorder) and control
(unaffected) subjects e.g., via analysis of protein mass spectra.
The inventive method identifies a best subset of "k" elements or
markers, where "k" is any positive integer, found in the sample
data (e.g., serum) to allow for optimal discrimination between two
or more groups.
[0028] The markers are identified as potential candidates for
further biological examination, so as to identify markers involved
in the disorder or disease. As discussed below, the inventive
methodology achieved perfect discrimination (100% sensitivity, 100%
specificity) between patients with early stage ovarian cancer and
normal controls (including benign cases). Ovarian cancer diagnosis
is exemplary of the inventive technique, which is not only
invaluable in screening for ovarian cancer, but can be used for
early diagnosis, treatment development and evaluation of patients
where any disease or condition of interest is involved.
[0029] In accordance with an embodiment of the present invention,
the comprehensive analysis method and system examines and
quantifies the role of each protein marker along the mass spectrum.
All markers with significantly different expression levels between
the affected and unaffected subsets, at a given experimentwise
error rate, are selected for the ensuing best-subset discriminant
analysis to determine the optimal set of markers for diagnosis of
the disorder or disease, such as cancer. The inventive method is
highly effective for, e.g., ovarian cancer detection, as it
achieved perfect discrimination between diseased and normal
subjects (including benign cases).
[0030] It is a further object of the present invention to provide a
method and system for the longitudinal analysis of a time series of
mass spectra, such as serum protein mass spectra.
[0031] The foregoing has outlined rather broadly the features and
technical advantages of the present invention in order that the
detailed description of the invention that follows may be better
understood. Additional features and advantages of the present
invention will be described hereinafter which form the subject of
the claims of the invention. It should be appreciated by those
skilled in the art that the specific concepts and embodiments
disclosed may be readily utilized as a basis for modifying or
designing other structures for carrying out the same purposes of
the present invention. It should also be realized by those skilled
in the art that such equivalent constructions do not depart from
the spirit and scope of the invention as set forth in the appended
claims. The novel features which are believed to be characteristic
of the invention, both as to its organization and method of
operation, together with further objects and advantages will be
better understood from the following description when considered in
connection with the accompanying figures. It is to be expressly
understood, however, that each of the figures is provided for the
purpose of illustration and description only and is not intended as
a definition of the limits of the present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0032] For a more complete understanding of the present invention,
reference is now made to the following descriptions taken in
conjunction with the accompanying drawing, in which:
[0033] FIG. 1 is a functional diagram of a computer 100 in
accordance with an embodiment of the present invention.
[0034] FIGS. 2A and 2B are average serum protein mass spectra of
unaffected women and women with ovarian cancer, respectively. FIGS.
2C-2H are magnified plots of various protein markers shown in FIGS.
2A and 2B.
[0035] FIG. 3 is a flow chart detailing the process of analyzing
serum protein mass spectra in accordance with an exemplary
embodiment of the present invention.
[0036] FIG. 4 is a graph illustrating an exemplary statgram in
accordance with an exemplary embodiment of the present
invention.
[0037] FIG. 5 is an exemplary graph illustrating the relationship
between full width at half maximum (FWHM) experimentwise error rate
and critical value threshold.
[0038] FIG. 6A is a graph illustrating the statgram of FIG. 4 with
positive and negative thresholds drawn at the critical value of
4.22 and FIG. 6B shows the statgram at values beyond the
thresholds, thereby showing the markers remaining after the
thresholding procedure.
[0039] FIG. 7 A is an illustration of a smoothing Guassian kernel
for use in an exemplary embodiment of the present invention. FIG.
7B shows the relationship between the range of the 11 adjacent
points within the Gaussian smoothing kernel (y-axis) and the median
of the range (x-axis) when FWHM=11, as in the exemplary embodiment.
FIG. 7C shows the relationship between the ratio of the range over
its median (y-axis) and the median of the range (x-axis) when
FWHM=11.
DETAILED DESCRIPTION
[0040] The comprehensive statistical method of the present
invention can be applied to analyze any type of raw data, including
clinical or environmental data where the sample may be, e.g.,
water, air, soil, serum, blood, saliva, plasma, nipple aspirate,
synovial fluid, cerebrospinal fluid, sweat, urine, fecal matter,
tears, bronchial lavage, swabbings, needle aspirant, semen, vaginal
fluid, pre-ejaculate, etc., to carry out, e.g., screening for
contaminants, pathologic diagnosis, toxicity status, efficacy of a
drug, screening or prognosis of a disease. In fact, the present
invention can be used to analyze any data that can be expressed
with common coordinates. That is, the multiple sets of data can be
mapped into the same coordinate system, thereby enabling comparison
between multiple sets of data. Moreover, the present invention can
analyze data from mass spectroscopy, liquid chromatography,
two-dimensional gel chromatography, gas chromatography, etc.
[0041] In accordance with an embodiment of the present invention,
the comprehensive statistical method and system can be used to
analyze spectrum data of tissues or body fluids (referred to
collectively as the "biological sample") for pattern detection and
discrimination purposes (i.e., diseased and non-diseased subjects,
or change in state of the disease). The following is a description
of an exemplary embodiment of the present invention adapted to
protein analysis of serum samples for the purpose of detecting
ovarian cancer. Although the present invention is described herein
with the ovarian cancer example, it is appreciated that the present
invention can be applied to any multiple sets of data (including
biological or non-biological data) that can be expressed with
common coordinates to detect subtle hidden patterns and/or
discriminate between multiple sets of data.
[0042] An embodiment of the present invention has been applied to
the same set of ovarian cancer patients and normal subjects
(including benign cases) analyzed by Petricoin et al. and described
in the Lancet Paper and at the NIH clinical proteomics program
website (the "Ovarian Dataset 4-3-02"). Although this exemplary
application involves classifying or categorizing the data into two
states or categories (i.e., normal and cancerous), it is
appreciated that the present invention can classify or categorize
the data into multiple states, e.g., normal, early stage cancerous
and late stage cancerous. This can be done using the analysis of
variance (ANOVA) F-test or the analysis of covariance (ANOCOVA)
F-test for marker selection, described in greater detail below.
[0043] The disease status in the Ovarian Dataset 4-3-02 is given in
Table I.
TABLE-US-00001 TABLE I Number of Subjects Unaffected Women No
evidence of ovarian cysts 61 Benign ovarian cysts <2.5 cm 30
Benign ovarian cysts >2.5 cm 8 Benign gynecological disease 10
No gynecological disease 7 Subtotal 116 Women with Ovarian Cancer
Stage I 24 Stages II, III, IV 76 Subtotal 100 Grand Total 216
[0044] The median ages of the 116 control subjects and the 100
ovarian cancer patients were 49 (range 21-75) and 58 (range 29-82),
respectively. Based on the age distribution, premenopausal and
postmenopausal women were equally represented in both groups (i.e.,
training data set and testing data set).
[0045] The serum protein mass spectrum generated for each subject
was depicted by 15,200 mass over charge ratios (m/z values) on the
x-axis and the corresponding intensity on the y-axis.
[0046] FIGS. 2A and 2B depict the average spectrum intensity of 116
unaffected women and 100 women with ovarian cancer, respectively.
As can be seen, there are significant differences in expression
levels of certain protein markers, as shown in greater detail in
FIGS. 2C-2H and indicated by dashed vertical lines in FIGS. 2A and
2B. A robust method, however, such as that of the present
invention, can determine whether such potential differences in
protein expression level are statistically significant (systematic
and persistent in the population), as opposed to being due to
random factors, such as outliers, sampling fluctuations, or the
like.
[0047] Using the above-described ovarian cancer data, an exemplary
embodiment of the present invention as applied to such data will
now be described in greater detail.
Method
[0048] An exemplary embodiment of a method of the present invention
includes a plurality of steps, each of which is described in
greater detail below. Each step has a statistical basis and a
probabilistic interpretation. The steps, in general, serve as
filters, to reduce the size and dimensionality of the data
processed by subsequent steps. For this reason, the computational
efficiency of the early steps and their ability to reduce data
dimensionality for the later steps are important.
[0049] At least two groups of data are initially provided, one
being derived from a control group of unaffected subjects and the
other being derived from a group of subjects affected by the
condition of interest (e.g., ovarian cancer).
[0050] In an exemplary embodiment, the method of the present
invention includes the following steps: [0051] (1) Pre-processing
the data on an individual group basis as necessary for the steps
below and/or removing outlying data; [0052] (2) Randomly selecting
a training data set and a testing data set from each group of data;
[0053] (3) Performing a point wise two-sample t/z-test between the
groups in the training data set, generating a map of the test
statistic values (statgram); [0054] (4) Selecting critical
region(s) of the statgram based on a desired significance level;
[0055] (5) Selecting markers from the critical region(s); [0056]
(6) Checking the variance stability of the markers selected; [0057]
(7) Validating; [0058] (8) Resampling (e.g., repeating steps 2-7);
and [0059] (9) Resolving any differences between the selected
markers after steps 7 and 8, respectively.
[0060] Each step of this exemplary process is described below in
greater detail.
Step 1: Pre-Processing the Data
[0061] The data is pre-processed as necessary for performing the
other steps of the method, and/or to remove outliers. A purpose of
such pre-processing is to reduce the noise content of the data and
thus improve its signal-to-noise ratio. Such pre-processing also
enables multiple-test correction via the random field theory,
described in greater detail below.
[0062] In the exemplary embodiment, the data comprises mass
spectral information indicative of molecules present in a
biological sample, such as proteins in human blood. Pre-processing
of the spectral data may include smoothing the data to obtain a
Gaussian distribution. In particular, for a given group, an average
spectrum is calculated and filtered to obtain a Gaussian
distribution.
[0063] In an exemplary embodiment, the data pre-processing includes
performing normalization to remove the overall variation of
spectral strength between different subjects and smoothing via
Gaussian filters, performed on each individual spectrum.
[0064] More particularly, the relative intensity for each mass
spectrum is obtained by dividing the intensity at each m/z value
(mass to charge ratio) by the average intensity of the entire
spectrum. Such standardization ensures comparability across
different spectra. Each relative spectrum is then smoothed by a
Gaussian kernel with a full width half maximum (FWHM) value
sufficiently small to preserve the original spectrum while being at
least as large as the mass accuracy of the system used to generate
the spectrum. For example, the mass accuracy of the Ciphergen
system is 0.1%. Thus, a particle with a detected mass value of x
could have the same true mass as particles in its neighborhood
within a range of 0.2% x (i.e. x.+-.0.1% x). It has been determined
that the smallest FWHM that will achieve a mass accuracy of at
least 0.1% for the entire spectrum is 11.
[0065] FIG. 7A is an illustration of a Guassian kernel as used in a
data pre-processing step of an exemplary embodiment of the present
invention. The smoothed intensity of the biomarker with m/z value
.mu. is calculated as the weighted average, proportional to the
Gaussian density, of the intensities of its neighboring
biomarkers.
[0066] FIG. 7B shows the relationship between the range of the 11
adjacent points within the Gaussian smoothing kernel (y-axis) and
the median of the range (x-axis) when FWHM=11, as in the exemplary
embodiment.
[0067] FIG. 7C shows the relationship between the ratio of the
range over its median (y-axis) and the median of the range (x-axis)
when FWHM=11. As shown in FIG. 7C, for FWHM=11, the ratio of the
range over its median (y-axis) is above 0.002 (0.2%) for most of
the spectrum and approaches 0.002 only toward the highest m/z
values of the spectrum.
Step 2: Creating a "Training Set" and a "Testing Set"
[0068] After standardization and smoothing, the subjects are
divided into a training set and a testing set. The training set
comprises a predetermined number of subjects selected at random
from the group of subjects known to have the condition of interest
and a predetermined number of subjects selected at random from the
group of subjects known not to have the condition of interest. The
testing set comprises the remaining subjects. For example, in an
exemplary processing of the above-described Ovarian Dataset 4-3-02,
of the 216 subjects, the training set consisted of a random sample
of 50 subjects with ovarian cancer and a random sample of 50
unaffected subjects. The testing set consisted of the remaining 116
subjects, 50 of which were affected and 66 of which were
unaffected.
Step 3: Creating a Stagram
[0069] Once the training and testing sets have been created, a
statgram is generated based on the data of the subjects in the
training set. In the exemplary embodiment in which the data is
proteomic spectral information, the statgram is a two-dimensional
map of test statistic values along the range of m/z values. In
order to generate the statgram, an independent samples t/z test is
performed at each m/z value to compare the spectral intensities
between the two groups of subjects (i.e., affected and unaffected)
in the training set. (The z-test is preferably used when both
sample groups are large, whereas the t-test is preferably used when
at least one sample group is small. For exemplary ovarian cancer
application, the z-test is used.) The null hypothesis is that the
intensities are equal between the groups at each m/z value and the
alternative hypothesis is that they differ.
[0070] The test statistic value t(x) at each m/z value (x) is
determined as follows:)
t ( x ) = y 1 ( x ) _ - y 2 ( x ) _ s 1 2 ( x ) / n 1 + s 2 2 ( x )
/ n 2 ( Eq . 1 ) ##EQU00001##
[0071] where y.sub.1(x), y.sub.2(x), s.sub.1.sup.2(x), and
s.sub.2.sup.2(x) are the means and standard deviations,
respectively, of the training samples and n.sub.1 and n.sub.2 are
the numbers of training samples in each of the two groups in the
training set.
[0072] The test statistic values t(x) are plotted against the m/z
values (x) to generate the statgram. FIG. 4 shows an exemplary
statgram for an embodiment of the present invention adapted to the
study of ovarian cancer. In the case where both samples are large
(e.g., in the example above where n.sub.1=n.sub.2=50), the test
statistic t(x) follows approximately the standard normal
distribution under the null hypothesis.
[0073] At each m/z value (x), the larger the absolute value of the
test statistic t(x), the stronger the evidence supporting the
alternative hypothesis that the average spectral intensities are
different between the two groups. As such, those m/z values tend to
be indicative of possibly significant markers.
Step 4: Thresholding the Statgram--Multiple-Test Correction and
Critical Region Selection
[0074] Once the statgram has been generated, as described above, a
thresholding procedure is carried out in which those regions of the
statgram are selected in which the absolute value of the test
statistic exceeds a critical value (or threshold) determined as a
function of a desired significance level.
[0075] When only one test is performed (i.e., for one m/z value),
one would reject the null hypothesis at a significance level of
0.05 if |t(x)| exceeds the critical value of 1.96 for the z-test.
In other words, for an m/z value for which the absolute value of
the test statistic (determined per eq. 1) is greater than 1.96,
there is a 95% probability that the difference in spectral
intensities between the affected and unaffected groups at that m/z
values is a real difference (i.e., true for the populations) and
not just a spurious difference that may be caused by the
variability in random sampling or the like. However, since the
total number of tests will be equal to the total number of data
points in each spectrum, the confidence level would be much lower
than 95% for the entire set of tests if each test were to be
conducted at the significance level of 0.05. To determine a
suitable significance level for each test, a multiple-test
correction is performed so to have 95% confidence that differences
identified are real.
[0076] There are several methods available for multiple-test
correction such as the Bonferroni method, the Tukey method, and the
random field theory (RFT) method. Methods such as the Tukey and
Bonferroni methods tend to be more conservative than the RFT
method. In an exemplary embodiment, RFT is employed for the
multiple-test correction. A prerequisite for RFT multiple-test
correction is that each spectrum be a one-dimensional Gaussian
field. This can be achieved by presmoothing the spectral data with
a Gaussian kernel, as described above. The Gaussian kernel is
uniquely determined by its FWHM.
[0077] The relationship between the experimentwise error rate a and
the critical value t for each z-test is given by the following
expression:
.alpha. .apprxeq. .intg. .infin. 1 2 .pi. - u 2 / 2 u + K ln 2 .pi.
( FWHM ) - t 2 / 2 ( Eq . 2 ) ##EQU00002##
where K is the total number of tests.
[0078] The relationship between the experimentwise error rate
.alpha. and the critical value t for each t-test with v degrees of
freedom is given by the following expression:
.alpha. .apprxeq. .intg. .infin. .GAMMA. ( v + 1 2 ) .GAMMA. ( v 2
) v .pi. ( 1 + u 2 v ) - v + 1 2 u + K ln 2 .pi. ( FWHM ) ( 1 + t 2
v ) - v + 1 2 ( Eq . 3 ) ##EQU00003##
[0079] FIG. 5 shows the relationship between the experimentwise
error rate .alpha. and the critical value t for different Gaussian
smoothing kernels. As shown in FIG. 5, there is little variation in
the critical value when the Gaussian kernel FWHM varies between 10
and 20 data points. As described above in connection with step 1, a
Gaussian smoothing kernel with FWHM of 11 is used in the data
preprocessing step in an exemplary embodiment of the present
invention. Furthermore, in the exemplary embodiment, the spectrum
for each subject tested has 15,154 m/z ratios (i.e., K=15,154). In
order to achieve an error no higher than .alpha.=0.05 (2-sided),
the critical value t in accordance with RFT multiple-test
correction is determined to be 4.22.
[0080] The critical value generated by the RFT method is less
conservative than the critical value that would be generated by the
Tukey or Bonferroni methods. For example, to ensure an
experimentwise error rate of 0.05 (2-sided), the Bonferroni method
requires that each test be performed at the significance level of
0.05/15154=3.299459.times.10.sup.-6 (2-sided). The corresponding
critical value for a normal test is 4.65. That is, using Bonferroni
multiple-test correction, the null hypothesis of equal intensity at
a given m/z value (x) would be rejected if |t(x)|>4.65.
[0081] Using the critical value thus determined, the statgram
derived above is then thresholded; i.e., those points on the
statgram for which the test value exceeds 4.22 (or is less than
-4.22) are selected. In other words, when thresholding at the
critical value of 4.22, data points with |t(x)|>4.22 are
considered significantly different between the two groups of
subjects at the significance level of 0.05 and are adopted as
candidates for a discriminant analysis phase.
[0082] Although a significance level of 0.05 has been used in this
exemplary application, the significance level can be any value
between 0 and 1.
[0083] FIGS. 6A and 6B illustrate the exemplary statgram before and
after thresholding, respectively. By thus applying RFT
multiple-test correction, the effective number of tests is reduced
from 15,154 to about 563. In essence, the reduction is achieved by
eliminating redundant tests for m/z values within the same
smoothing kernel.
[0084] By thresholding at the critical value of 4.22, 563 tests
remain significant. The corresponding 563 protein biomarkers are
considered significantly different between the unaffected and
affected populations and can be adopted as candidates for the
discriminant analysis. These statistically significant biomarkers
are valuable for deriving a diagnostic/discriminant rule for the
disease and for further biological studies to ascertain and
understand their roles in the disease's development and progress
and in the development of a new phenotype (other pathophysiologic
states). This information is valuable for developing and evaluating
therapeutic drugs and other treatments.
Step 5: Marker Selection
[0085] A subset of k markers from the candidates determined in step
4 that best discriminate between the two sets of training samples
are selected for any user-defined positive integer k. The procedure
starts from k=1 where k increases by 1 after each iteration until
the discriminating performance reaches a plateau.
[0086] Once a subset of markers is identified, test subjects can be
classified in accordance with the marker values for those subjects.
Non-parametric classification methods that can be used include the
l-nearest neighbor classification method, and the uniform kernel
method. The l-nearest neighbor classification method is equivalent
to the uniform kernel method with a location dependent radius. Both
methods tend to produce similar results. In an exemplary
embodiment, l-nearest neighbor classification is used because of
its robustness, flexibility, and intuitive explanations. Other
non-paramteric and parametric methods may also be used, such as the
normal kernel method as well as a neural network classification
method.
[0087] In accordance with the l-nearest neighbor classification
method, for each subject in the testing set, the l nearest
neighbors in the training set are determined. The condition of the
majority of the l nearest training set neighbors determines the
predicted condition of the testing subject. The l-nearest neighbor
classification procedure depends on the "distances" between
subjects. The "locations" of the subjects depend on the set of
markers that are selected. In an exemplary embodiment, Mahalanobis
distance is used. Mahalanobis distance is the covariance-adjusted
distance between the mean marker scores of the subjects having the
condition of interest (e.g. ovarian cancer) and the unaffected
subjects, as defined by the training set.
[0088] Mahalanobis distance is based on the pooled
variance-covariance matrix V. The squared distance between two
observation vectors x and y is given by:
d.sup.2(x,y)=(x-y)'V.sup.-1(x-y). (Eq. 4)
[0089] Here each vector corresponds to a subject. Its elements are
the expression levels (intensities) of the k discriminating markers
for the given subject.
[0090] For l-nearest neighbor classification, the choice of l is
usually relatively uncritical. One approach is to try several
different values of l to determine which value gives the best
crossvalidated estimate of the classification rate. Crossvalidation
treats n-1 out of n observations as a training set. It determines
the discrimination functions based on the n-1 observations and then
applies them to classify the one remaining observation. This is
done for each of the n training observations. The classification
rate for each group, that is, "sensitivity" for the affected group
and "specificity" for the control group, is the proportion of
sample observations in that group that are classified correctly.
The next step is to determine whether the selected biomarkers
distinguish the affected subjects from the unaffected subjects in
the testing set.
[0091] In the above-described exemplary application in which 563
candidate markers are identified by random field theory correction,
best separation between the two groups using the l-nearest neighbor
classifier with l=5 is achieved when k=18 (i.e., there is 100%
sensitivity and 100% specificity). In the above example, l=5. The
smallest/that achieved perfect discrimination was used for the
discriminant analysis of the remaining 116 spectra in the testing
data set.
Step 6: Check Variance Stability
[0092] Checking variance stability is an optional step which may
precede or follow the best discriminating subset selection
procedure (step 5). The rationale for this step is that the
expression level of certain markers may be correlated with stages
of a condition or other individual traits, and therefore may have
large variability across all subjects in a training set (affected
or unaffected). By examining the coefficient of variation, a
standardized measure of variability that is unaffected by the
magnitude of the mean, one could establish a statistical threshold
via resampling methods to divide the significant markers into two
subsets--those with less and those with more variability. If a
discriminant rule that is more robust to the condition's stages and
individual traits is desired, only the more stable markers should
be selected to derive the best k-subset of biomarkers. On the other
hand, a more stage-sensitive discriminant rule can be derived by
correlating more variable markers in the training set of the
"affected subjects" with condition stages/severity.
Step 7: Validation
[0093] In an exemplary embodiment, a validation step is performed
in which the testing set is scored. In a first embodiment, a binary
decision is made using the l-nearest neighbor classification
procedure (e.g., l=5) to score the subjects in the testing set as
having the condition of interest (e.g., ovarian cancer) or not. For
each subject in the testing set, the l nearest neighbors in the
training set are determined. The condition of the majority of the l
nearest training set neighbors determines the predicted condition
of the testing subject. This procedure is described above.
[0094] The l-nearest neighbor classification procedure yields a
binary outcome without attaching a probability indicative of the
relative proximity of the given subject to each group. In a further
embodiment, this limitation is addressed by integrating a scoring
system that allows a probabilistic interpretation. In accordance
with this embodiment, a probability that each subject in the
testing set has the condition of interest is determined. The
markers used in the scoring system can be all markers found
significant from the random field marker selection step, or a best
marker set with optimal classification rates, both obtained from
the training set.
[0095] Suppose K markers are selected from the training set, and
furthermore, suppose the mean and standard deviation for each
marker in the training group are x.sub.ij and s.sub.i,
respectively, where i=1, 2 (1 representing the group with cancer
and 2 representing the unaffected group), and j=1, 2, . . . , K is
the marker index. First consider the idealized case in which each
distinct marker is statistically independent of the others. Then
for a given subject in the testing set, the subject's score for the
cancer and control groups are given by the following
expression:
z i = j = 1 K [ y j - x ij _ s ij ] 2 ( Eq . 5 ) ##EQU00004##
[0096] These scores represent the statistical distance for each
subject to the center of the cancer and control groups,
respectively, for a single marker. The subject is more likely to
belong to the group with a low z score, and the larger the
difference between the two z scores, the higher the probability
that the classification is correct.
[0097] Assuming the K selected markers are independent, the
difference of the scores is proportional to the log likelihood
ratio of the probability L.sub.i that the subject belongs to group
1 or group 2 respectively. This is expressed as follows:
ln ( L 1 / L 2 ) = j [ ln ( s 2 j ) - ln ( s 1 j ) ] - 0.5 ( z 1 -
z 2 ) , where : L i = j 1 / [ 2 .pi. s ij ] exp [ - ( y j - x ij _
) 2 / 2 s ij 2 ] ( Eq . 6 ) ##EQU00005##
[0098] The realistic case with statistical dependencies among the
markers can be derived naturally from the independent case by a
change of variables that incorporates the covariance among the
markers. The scoring system should be modified in this case using
the Mahalanobis distance, expressed as follows:
z.sub.i=(y- x.sub.i).sup.TS.sub.i.sup.-1(y- x.sub.i) (Eq. 7)
[0099] where S.sub.i is the sample variance-covariance matrix of
the K markers for the cancerous training group (i=1) and the
control training group (i=2), respectively.
[0100] In this general case:
In(L.sub.1/L.sub.2)=0.5(In|S.sub.2|-In|S.sub.1|)-0.5(z.sub.1-z.sub.2)
(Eq. 8)
where:
L.sub.i=.PI..sub.j1/[(2.pi.).sup.K/2|S.sub.i|.sup.1/2]exp[-0.5(y-
x.sub.i).sup.TS.sub.i.sup.-1(y- x.sub.i)]
[0101] At a given significance level, this subject will be assigned
to group 1 if the log likelihood ratio exceeds a certain
probability threshold (i.e. the difference in the z-score below the
equivalent threshold) which will be determined by the likelihood
ratio test. Similarly, the subject will be assigned to group 2 if
the log likelihood ratio is below a certain threshold (i.e. the
difference in z-score above the equivalent threshold) which will
also be determined by the likelihood ratio test. For subjects whose
score difference is between the two thresholds, additional tests,
especially those with an independent or less correlated set of
markers, are performed to further determine their status.
Step 8: Resampling
[0102] The markers that are significant as selected by the RFT
thresholding procedure, tend to be dependent on the choice of the
training set. To eliminate this dependency, the subject pool can be
resampled to obtain alternate training set and testing set pairs.
Stable markers that reappear with high frequency in the resampling
process will be selected to choose the ultimate robust set of
discriminating markers. Performance of the best subset selection of
discriminating markers will be examined by cross-validation and
other resampling schemes.
[0103] As mentioned, 100% specificity and 100% sensitivity have
been achieved with the present invention. To determine whether the
resultant specificity and sensitivity may have been due to a
fortuitous choice of test and training sets, the entire process may
be repeated by randomly selecting another training set. Steps 2
through 7 can be repeated one or more times. Consistency can thus
be checked and distributions (of specificity, sensitivity, etc.)
obtained.
System
[0104] In an exemplary embodiment, the method of the present
invention can be implemented with a software program running on a
processor or computer 100 of FIG. 1. The processor 100 analyzes
each point of data in a set and determines whether it is
statistically different from a comparable point in another set of
data to thus discover data patterns, elements or markers, which
differ either in their presence or in amount. In accordance with an
embodiment of the present invention, the data can be classified or
categorized into two or more states or categories
[0105] In accordance with an embodiment of the present invention,
the program comprises one or more modules or routines performing
steps of the comprehensive statistical method. These modules and
the steps they perform will now be described in greater detail.
[0106] FIG. 3 shows a flow chart detailing the process of analyzing
serum protein mass spectra in accordance with an embodiment of the
present invention.
[0107] The exemplary program comprises a data preprocessing module
or routine 110 for data preprocessing, including standardization
for relative spectrum and smoothing via Gaussian filters, performed
on each individual spectrum. At step 300, the data preprocessing
module 110 obtains the relative intensity for each mass spectrum by
dividing the intensity at each m/z value with the average intensity
of the entire spectrum. Such standardization provides comparability
across different spectra.
[0108] At step 305, the data preprocessing module 110 may also
smooth each relative spectrum by a Gaussian kernel having an
appropriate full width at half maximum (FWHM), as described
above.
[0109] As shown in FIG. 1, the exemplary program further comprises
a sampling module or routine 120 for sampling for discriminant
purposes. The sampling module 120 randomly selects a training data
set from each group e.g., affected and unaffected, clean and
contaminated, or state 1, 2, 3 . . . n, etc. The training data set
could be the data for 10 out of 1,000 affected people, and 10 out
of 10,000 unaffected people, for example. The remaining data is the
testing data set. In the aforementioned example with 216 subjects,
the sampling module 120 selects a random sample of 0 women with
ovarian cancer and a random sample of 50 unaffected women for the
training set. The remainder of 50 cancerous and 66 unaffected women
comprises the testing set. The separation of the subjects into the
training and testing sets is shown in the flowchart of FIG. 3 as
step 310.
[0110] A statgram module or routine 130 is included for performing
a pointwise test, such as a two-sample t/z-test, analysis of
variance (ANOVA) F test, etc., between the groups in the training
data set. As discussed above, a statgram is a two-dimensional map
of the test statistic values along the spectrum. In the ovarian
cancer application, the common coordinate shared between the data
is the m/z value. Accordingly, the statgram module 130 performs an
independent samples t/z-test at each m/z value to compare the
intensity between the two training samples (cancerous and
unaffected) at that m/z value. This is shown in the flowchart of
FIG. 3 as step 320.
[0111] At step 325, the statgram module 130 determines whether the
mean spectral intensities are equal (i.e., null hypothesis) or
different (i.e., alternative hypothesis) between the groups at each
m/z value. In accordance with an embodiment of the present
invention, the statgram module 130 performs the t/z test at each
m/z value (x) and determines the test statistic value t(x) in
accordance with Eq. 1, as described above. The test statistic
values are plotted against the m/z values generating a statgram,
such as that shown in FIG. 4. Where both sample groups are
relatively large (e.g., n.sub.1=n.sub.2=50, in the ovarian cancer
data set), the test statistic t(x) follows approximately the
standard normal distribution under the null hypothesis and thus the
z-test is appropriate here.
[0112] A threshold module or routine 140 determines the threshold,
based on multiple-test correction, that yields the critical regions
of the statgram for a given desired significance level. At each m/z
value, the larger the absolute value of the test statistic, the
more likely it is that the statgram module 130 will determine in
step 325 that the average intensities are different between the two
groups (i.e., alternative hypothesis).
[0113] When only one test is performed, the threshold module 140
rejects the null hypothesis at the significance level of 0.05
(2-sided) if |t(x)| exceeds the critical value of 1.96. Although
the significance level of 0.05 was used in this exemplary
application, the significance level can be any value between 0 and
1.
[0114] In accordance with an embodiment of the present invention,
the threshold module 140 performs a total of 15,154 tests to cover
the entire m/z range for the exemplary application in step 325. To
guard against falsely rejecting the null hypothesis, in accordance
with an aspect of the present invention, the threshold module 140
performs a multiple-test correction. In accordance with an
embodiment of the present invention, the threshold module 140
employs an RFT correction method, as described above.
[0115] As mentioned above, the data preprocessing module 110
pre-smoothes each spectrum with a Gaussian kernel so that each
spectrum is a one-dimensional Gaussian field. The FWHM of the
Gaussian kernel will likely range between 10 and 20. FIG. 5 shows
that there is little variation in the critical value when the FWHM
varies between 10 and 20.
[0116] As described above in connection with the exemplary ovarian
cancer data, when the critical value of 4.22 is used as the
threshold, the threshold module 140 determines that 563 tests out
of the total of 15,154 tests are significant, as shown in FIGS. 6A
and 6B. That is, the corresponding 563 protein markers are
considered significantly different between the two populations
(women with ovarian cancer and unaffected women) and are adopted as
candidates by a marker selection module 160.
[0117] The exemplary program of the present invention may include a
variance module or routine 150 for checking variance stability for
selecting stable markers. If a user or operator requests a variance
stability check in step 340, operation proceeds to step 350 in
which the variance module 150 performs a variance stability check
to determine if the selected markers are relatively stable.
Performing a variance stability check may be desirable because the
expression level of certain markers may be correlated with disease
stages or other individual traits, and therefore may have large
variability across all subjects in a training set (affected or
unaffected). By examining the coefficient of variation (a
standardized measure of variability that is unaffected by the
magnitude of the mean), the variance module 150 can establish a
threshold to divide the significant markers into two subsets: those
with less and those with more variability.
[0118] In accordance with an embodiment of the present invention,
the marker selection module 160 can be used in conjunction with the
variance module 150 to derive a discriminant rule that is more
robust to the various stages of the condition of interest and to
individual traits. In step 330, the marker selection module 160
selects only stable markers to derive the best k-subset of markers.
Alternatively, the marker selection module 160 can be used to find
correlations between variable markers in the training set of
affected subjects with condition stages/severity. A discriminant
rule that is more sensitive to the various stages/severities of the
condition can thus be derived.
[0119] Although not shown in FIG. 3, the variance module 150 can
alternatively perform a variance stability check before the marker
selection module 160 selects the k subset of candidate markers in
step 330.
[0120] In the example described herein, the marker selection module
160 performs a best subset discriminant analysis to select the best
subset of markers. In the exemplary ovarian cancer embodiment, 18
markers were selected from the 563 candidates remaining after
thresholding of the statgram. The variance module 150 performs a
variance stability check to determine whether the selected markers
are relatively stable against the other markers. On the whole, the
markers selected are relatively stable. The significance of the
coefficient of variation for marker selection and data
interpretation can be analyzed further when more subject-specific
information is available.
[0121] The marker selection module or routine 160 selects a subset
of k markers (k is a positive integer) from the list of markers
remaining after processing by the threshold module 140 (or the
variance module 150, if invoked or requested), that could best
discriminate between the groups in the training set. The markers
are selected via the best k-subset discriminant method. In
accordance with an embodiment of the present invention, the marker
selection module 160 selects the smallest k achieving the best
possible classification performance. The marker selection module
160 starts from k=1 and k is incremented by 1 after each iteration
until the discriminating performance reaches a plateau in step
335.
[0122] In the exemplary ovarian cancer case, the marker selection
module 160 achieved the correct separation between the two groups
when k=18. The 18 selected markers are shown in Table I below. The
coefficient of variation (CV) in Table I is the ratio of the sample
standard deviation and the sample mean. For each marker, the
percentile (%) of its CV across the entire protein spectrum within
each group is also given in Table II. Since markers within a
neighborhood of 100 m/z values are likely to be variations of the
same protein, the marker selection module 160 bins the selected
markers into unique/independent groups for the ensuing biological
validation. In the exemplary application, the marker selection
module 160 effectively reduces the number of protein candidate
markers from 18 to 10 independent groups to be used for further
identification of their composition and validation of their role in
ovarian cancer etiology.
TABLE-US-00002 TABLE II Markers selected for the discriminant model
Cancerous Normal M/Z T-Value CV Percentile CV Percentile 167.8031
-5.7662 1.480295 82.52 0.625591 18.00 322.4204 -5.9078 4.762101
98.52 1.40202 68.84 359.6322 7.731 0.292722 4.00 0.699381 23.93
385.5688 6.5828 0.239876 3.14 0.328497 3.32 434.6859 -5.7399
2.299701 0.30 1.723423 79.78 444.469 -8.8655 0.399504 6.85 0.440768
5.98 445.2563 -8.5129 0.368993 5.93 0.419881 5.33 1222.185 -4.7291
0.629668 28.22 0.575398 12.44 1528.343 -5.8471 0.818051 42.04
0.661882 20.69 2026.812 3.8505 1.893308 89.69 1.281596 64.48
2522.814. -4.1012 1.514503 83.35 1.266205 63.82 3345.8 -9.1492
1.220903 71.54 0.98451 48.51 3449.15 -6.592 1.208733 70.90 0.721731
25.31 3473.308 -4.5641 0.867673 45.44 0.6403 19.31 3528.527 5.8663
0.772251 39.88 0.621775 17.70 6101.63 6.1853 0.645376 29.86
0.771314 28.80 6101.63 6.1853 0.645376 29.86 0.771314 28.80
6123.519 -5.7609 0.538648 18.29 0.731008 26.04 6453.56 -4.1838
3.809917 97.73 1.37106 68.03
[0123] At step 340, the processor determines if a variance
stability check has been requested. If the user or operator does
not request a variance stability check in step 340, operation
proceeds to step 360 in which a validation module or routine 170
validates the selected k markers on a testing data set. In an
exemplary embodiment, the l-nearest neighbor classification method
is used to classify the testing data set and the sensitivity and
specificity of the resulting classification are calculated. The
validation module 170 performs discrimination and classification on
the testing set using the normal kernel density estimates with
unequal bandwidth.
[0124] In the exemplary ovarian cancer application, the validation
module 170 correctly identified all 50 women with ovarian cancer of
the testing set as positive and all 66 unaffected women of the
testing set as negative. That is, the present invention achieved
perfect discrimination with 100% sensitivity and 100% specificity
of the test. The 95% confidence interval for sensitivity and
specificity are (93%, 100%) and (95%, 100%), respectively.
[0125] In another application of the present invention, the
processor 100 repeated steps 300-360 using another publicly
available ovarian cancer data set with 162 cancerous and 91
unaffected women. ("Ovarian Dataset 8-7-02"; available from the
above-cited NIH website). Again, the processor 100 obtained perfect
sensitivity and specificity using the above procedures.
[0126] Furthermore, the 18 markers identified in the Ovarian
Dataset 4-3-02 by the inventive system and method correctly
classified (with 100% sensitivity and 100% specificity) all of the
subjects of the Ovarian Dataset 8-7-02 via cross-validation. That
is, the inventive system and method correctly classified 212
cancerous (162 from Dataset 8-7-02 and 50 from the testing set of
Dataset 4-3-02) and 157 unaffected women (91 from Dataset 8-7-02
and 66 from the testing set of Dataset 4-3-02). The 95% confidence
interval for sensitivity and specificity were (98%, 100%) and (98%,
100%); respectively
[0127] The processor 100 can resample to check for consistency and
to obtain distributions of specificity, sensitivity and the like,
by repeating the process from the sampling module 120 to the
validation module 170. In accordance with an embodiment of the
present invention, the processor 100 repeats steps 310-360 in step
365 until the classification is verified. In analyzing the ovarian
datasets described above, the sampling module 120 randomly selected
another training set of 50 cancerous and 50 unaffected women in
step 310. The processor 100 repeated or iterated steps 310-360 50
times and obtained 50 pairs of perfect discrimination.
[0128] After verifying the above results in step 365, a classifier
module 180 of the processor 100 can be used in step 370 to screen
unknown samples of serum for the presence of the pattern diagnostic
of ovarian cancer.
[0129] In addition to Ovarian Datasets 4-3-02 and 8-7-02, the
present invention has also been applied to another public data set
of mass spectra of serum from ovarian cancer patients and control
subjects (e.g., referred to as the "High Resolution Dataset", also
available from the above-cited website). The High Resolution
Dataset consists of spectra from 95 cancerous and 117 normal
subjects, with the spectra obtained with a higher resolution mass
spectrometer (QSTAR). Despite differences in the mass spectrometry
platforms (low resolution SELDI-TOF versus high-resolution QSTAR)
and differences in chip sample preparation (manual versus robotic),
application of the present invention resulted in 100% specificity
and 100% sensitivity in distinguishing sera from patients with
ovarian cancer from those without cancer.
[0130] The method and system of the present invention select the
best marker set from among markers that are differentially
expressed among the groups of subjects being studied (e.g.,
cancerous and normal). Markers that are not statistically
significant are removed from further analysis. Therefore, the
markers selected are not only optimal for the given dataset but
also tend to be robust and have excellent performance for
independent datasets. For example, the best subset of 18 markers
(Table II) selected from the Ovarian Dataset 4-3-02 can not only
100% classify the testing set for the Ovarian Dataset 4-3-02 data
but also 100% classify the entire Ovarian Dataset 8-7-02, which
consists of an entirely different set of subjects. These 18 markers
are not only differentially expressed for the 4-3-02 data but also
for the 8-7-02 data. The present invention thus provides a powerful
systematic and robust methodology for the detection of subtle
variations in serum protein patterns revealed by mass
spectrometry.
[0131] A key difference between the present invention and other
approaches such as that taken by the Proteome Quest system, is that
present invention selects markers that are statistically
significant for the given study. That is, only markers exceeding
certain statistical thresholds (determined, for example, by RFT)
are chosen as candidate markers for the subsequent profiling.
Consequently, the final protein profiles chosen by the present
invention are by design, not only optimal for the given data
sample, but also robust for the classification of subjects from
other independent data samples from the same study population.
[0132] Table III shows that only a small fraction of markers
exceeds the statistical threshold for each of the three ovarian
cancer studies. The reduction in the data size as the method of the
present invention progresses can be illustrated with its
performance on these datasets. Thresholding results in reduction by
a factor of 30. Clustering of the markers using the K-mean method
(with the distance being 1--the Spearman correlation or the Pearson
correlation), where applied, gives a further reduction by a factor
of 2. Because the final step of selecting the marker subset tends
to be computationally expensive, the desirability of data reduction
before this step occurs is clear.
TABLE-US-00003 TABLE III Successive stages of data analysis on the
three public data sets. Data Raw Data Thresholded Clustered
Selected 04-03-02 15,154 563 N/A 18 08-07-02 15,154 867 N/A 6
High-Resolution 373,401 36,972 847 107
[0133] Although the Proteome Quest algorithm did obtain perfect
classification on both the 8-7-02 and the high-resolution data, the
method of the present invention is the only method that examines
the statistical significance of each biomarker. This ensures the
statistical validity of the selected biomarker set from the sample
to the population. In accordance with the present invention, only
biomarkers significantly different between the groups in expression
intensities are selected as candidate biomarkers for the ensuing
discriminant analysis. Other algorithms, however, often admit
biomarkers that are not statistically valid into the final model.
For example, the Proteome Quest algorithm almost invariably admits
invalid biomarkers into the final model. The percentage of invalid
markers admitted by such algorithms has ranged from 43% to
100%.
[0134] One way to establish the significance of the biomarkers
identified by the exemplary method of the present invention is to
identify and study the cell and molecular biology in terms of
physiologic function and/or diagnostic penetrance. Alternatively,
the significance, or lack of significance for any biomarker subsets
will also become apparent when the number of samples increases
10-100 fold.
[0135] Another way to validate the subset of predictive biomarkers
is to compare the proteomic analysis with that of an established
predictor of disease. For example, in the case of prostate cancer,
as a single biomarker, prostate specific antigen (PSA), either
total or complex, possesses a marked specificity and diagnostic
efficacy of 90-92%, but a low sensitivity of 54-56%. This is
primarily due to elevated levels of PSA in benign prostate
hyperplasia. Nevertheless, comparing clinically diagnosed prostate
cancer patients that are also PSA positive with a serum protein
profile analysis in accordance with the present invention should
provide a specific subset of proteomic markers that are positively
correlated with PSA. These data can then be stratified based upon
different PSA levels and in concert with both PSA and potentially
other suggested biomarkers such as hyaluronidase, alpha-methyl-CoA
racemase and the mucin-like epithelial polypeptide Ca15-3.
[0136] An outcome from the application of the present invention to
such analysis will be to provide specific phenotype information for
a given subset of cancer types. This could be accomplished if
sufficient tissue samples remain available to expand the molecular
phenotyping of a particular tumor. In the case of breast cancer, in
addition to estrogen receptor, progesterone receptor and Ki67, this
might include analysis of BRCA1, BRCA2, HER2 and Erk1/2.
[0137] As the biological signals become more specific, the overall
biomarker pattern may become weaker (due to a decrease in the
number of valid markers). From a cell biological approach, this
will make it possible to better define the biomarkers to be
identified. From the mathematical side, however, the detection and
discrimination from a complex serum pattern will become more
computationally expensive. Algorithms to cluster groups of markers
having similar performance will be able to address some of these
issues. Exemplary clustering procedures are described more fully
below. Furthermore, as the purity of the representative biomarker
components improves, the number of non-relevant biomarkers present
in the profile will be reduced, thereby increasing the sensitivity
and specificity of the statistical analysis.
Multiple Group Classification Analysis and Filtration of Prognostic
Factors
[0138] As discussed herein, the present inventive system and method
is not limited to comparing, classifying or categorizing data into
only two groups or categories. The inventive system and method can
compare, classify or categorize data into multiple groups or
categories. This capability is useful, for example, in classifying
subjects into multiple groups in accordance, for example, disease
states, treatment responses or stages, and the presence of
cofactors or prognostic conditions, among others. For instance, in
applying the present invention to the study of diseases such as
cancer, it has been observed that the expression levels of certain
significant markers are highly variable among subjects with cancer
while relatively stable among unaffected subjects. It is possible
that these biomarkers are related to specific stages of cancer,
differences in individual genetics (e.g. race) or other prognostic
factors such as weight, nutritional status or age. Data can be
stratified to not only differentiate between two groups (e.g.,
unaffected and cancer patients), but also the simultaneous
classification of subjects within the same disease (e.g., stages of
cancer progression and/or tumor subtypes).
[0139] In accordance with an embodiment of the present invention,
subjects can be classified into multiple groups, for example with
multiple disease states, treatment responses or stages, or with
different cofactors or prognostic conditions. In such an
embodiment, the t/z statistical map for two-group classification,
described above, is replaced with a multi-group and repeated
measure generalization known as the ANOVA F-map. Markers exceeding
the F-map threshold via the random field theory are selected for
the multiple group classification. Using random field theory, the
critical test value or threshold is determined from the
relationship between the experiment-wise error rate a and the
critical value f, which is given by the following expression:
.alpha. .apprxeq. .intg. f .infin. .GAMMA. ( v + w - 2 2 ) .GAMMA.
( v 2 ) .GAMMA. w 2 w v ( w u v ) w 2 - 1 ( 1 + w u v ) - v + w 2 u
+ 2 K ln 2 ( FWHM ) .pi. .GAMMA. ( v + w - 1 2 ) .GAMMA. ( v 2 )
.GAMMA. ( w 2 ) ( w f v ) w - 1 2 ( 1 - w f v ) 1 - v + w 2 ( Eq .
9 ) ##EQU00006##
K is the total number of markers along the spectrum, and w and v
are the degrees of freedom for the ANOVA F-test.
[0140] To select markers that are differentially expressed among
groups due to the underlying disease and not due to other
differences among the groups due to certain prognostic factors such
as age, the ANOVA F-map can be replaced with an Analysis of
Variance-Covariance (ANOCOVA) F-map whenever necessary. The
significant marker threshold can again be determined by Eq. 9.
[0141] Thus, by replacing the threshold determined by RFT (as
described above) with the F-map threshold, the significant marker
selection procedure can be extended from a two-group classification
to multiple group classifications and classification incorporating
relevant prognostic factors such as medical history or age.
[0142] Following this pre-marker selection procedure, there are
several classification procedures that can be used, including
discriminant analysis classification (i.e. the K-nearest neighbor
or the kernel classifier); statistical scoring; or two-level
clustering, described below. All three can be extended to multiple
group classifications. The extension is straightforward for the
discriminant and the cluster analyses.
[0143] Extending statistical scoring to multiple group
classification involves sequential likelihood ratio tests
illustrated as follows. For example, there are three groups and the
scores for a given subject place the subject closest to group 2,
intermediate with 1, and farthest from 3. The first likelihood
ratio test will be between group 2 and 1. If the likelihood ratio
is beyond an upper threshold, the subject is classified into group
2 and the test will stop. If the likelihood ratio falls between the
upper and lower thresholds, the subject can be either in group 2 or
1. If the likelihood ratio is below the lower threshold and thus
the subject is classified into group 1, a further test will
commence between 1 and 3 to determine whether the subject can be
clearly classified into group 1 or whether the subject could belong
to either group 1 or 3.
[0144] Another related consideration is the analysis of data
obtained from repeated measurements. Repeated measurements of the
same subject are useful, for example, to follow disease progression
and treatment effects in the same patient; assess the information
content arising from different hardware or specimen preparation
procedures; or to establish consistency of measurement. The outcome
of a repeated measures analysis will be a set of markers and a
training set sufficient to determine whether a subject changes
categories (e.g. cancerous or non-cancerous) within the sequence of
repeated measures. In accordance with an exemplary embodiment,
repeated measures ANOVA is performed at each marker to examine
whether the expression intensities are equal from spectrum to
spectrum or from time to time. Multiple-test correction for the
repeated measures ANOVA F-tests are again performed and significant
thresholds determined using the above described procedure based on
random field theory.
Marker Correlation Analysis
[0145] In an aspect of the present invention, a marker correlation
analysis tool is provided for algorithm comparison and marker set
comparison. A robust outlier detection mechanism is also
provided.
[0146] A marker correlation analysis procedure in accordance with
the present invention comprises two stages. In a first stage, a
k-means classifier is adopted to cluster subjects with highly
similar mass spectrum profiles. These clusters are subsequently
classified in a hierarchical tree structure, formed by weakening
the notions of similarity, so that larger groups are classified as
similar as one progresses along the tree hierarchy. In a second
stage, the branches and nodes of the classification tree are
subsequently examined for outliers.
[0147] The same two-stage clustering algorithm for outlier
detection also serves naturally as a clustering engine. As an
example of comparison of algorithms, various spectrum normalization
methods using mean, median, or maxima, on a global or regional
scale may be used. A set of robust markers that are significant
under different normalization methods will be identified and their
classification performance examined. The marker correlation
analysis tool based on multiple and canonical correlation is an
extension of the marker clustering analysis. Marker clustering bins
highly correlated individual markers together while the marker
correlation tools ascertain the relationship of an individual
marker to a set of markers or the relationship between two sets of
markers. This is helpful in selecting competing high performance
marker sets. A high correlation between marker sets indicates that
both have the same information content and thus either one can be
adopted. A low correlation indicates complimentary/independent
information content and thus both sets should be examined for a
complete profiling. An additional application of the marker
correlation tool is to examine the relationships between existing
markers (e.g. PSA) and newly identified markers (e.g., mass
spectrometric biomarkers).
[0148] Discriminant analysis and clustering analysis are two types
of classification methods. The discriminant analysis procedure
(e.g. the l-nearest or the kernel method) described above,
classifies subjects into predefined groups. It does not seek other
natural clusters that may be embedded in the data. It will not be
able to identify outliers due to sample contamination, instrument
misalignment or human error (for example, a male sample was wrongly
labeled as that from a female). Clustering analysis, however, does
not require predefined categories and instead seeks the natural
clusters embedded within the data. This also offers a natural
solution for outlier detection because an outlier tends to be in a
cluster of its own. Furthermore, clusters of outliers are usually
located farther apart from the other clusters in a classification
tree.
[0149] In an aspect of the present invention, a two-level
clustering engine is provided. The first level is a k-mean
clustering algorithm using the 1-correlation dissimilarity measure.
It is a dimension reduction scheme as well as an outlier detection
mechanism where spectra with similar shapes will be clustered
together. Single-point/spectrum clusters can be examined as
potential outliers. The second level clustering is defined by
modifying the criteria for similarity. It clusters the first level
clusters, each represented by its mean or median, in a hierarchical
classification tree. Nodes and branches falling farther apart from
the majority can be examined as potential outliers as well.
[0150] Outliers can arise for a variety of reasons, including, for
example, sample contamination, instrument misalignment, or human
error. For quality control purposes, one must detect and delete the
outliers prior to the diagnostic stage. In one approach, a sample
is classified as an outlier if its coefficient of variation exceeds
certain boundary values at least one m/z value. Such m/z values,
however, are often not included in the final classification model.
This in turn, would cause overly large false rejection rates.
[0151] The exemplary two-level clustering algorithm of the present
invention is more robust in the detection of outliers because only
the significant markers, that are a very small subset of the entire
spectrum, will be used towards the classification. Thus, with the
two-level clustering algorithm, a sample is declared an outlier
only if it exceeds a certain boundary for any of the significant
markers.
[0152] In a further aspect of the present invention, the
correlation between marker sets is analyzed to determine whether
two or more marker sets contain the same or complementary
information. A related issue is gauging the relationship between
existing markers (e.g. PSA or CA125) and new proteomics marker
sets. Another area of interest is to correlate the genetics,
clinical or other prognostic factors with the proteomics
markers.
[0153] In an exemplary embodiment, canonical correlation is
performed to evaluate linear the relationships between selected
marker sets. Principal component analysis is preferably performed
for dimensional reduction prior to the correlation analysis.
[0154] Canonical correlation is essentially the Pearson correlation
between the linear combination of variables in one set and the
linear combination of variables from another set. The pair of
linear combinations having the largest correlation is determined
first. Next, the pair of linear combinations having the largest
correlation among all pairs uncorrelated with the initially
selected pair is identified, and so on. The pairs of linear
combinations are called the canonical variables, and their
correlations are called the canonical correlations. The first
canonical correlation, which is often the only significant one in
most circumstances, is usually adopted to describe the inter-class
correlation. Its significance is determined by a statistic termed
Wilks' Lambda.
[0155] Small sample size and large dimensionality as is common in
proteomics studies would frequently render the degrees of freedom
insufficient to detect any significant canonical correlation. For
each class, major principal components (PCs) accounting for most of
the variations will be selected. Furthermore, Pearson correlations
of the selected PCs will be obtained and PCs from one marker set
that are not significantly correlated with PCs from the other set
will be dropped. Canonical correlations will then be obtained using
the remaining PCs.
[0156] It is very likely that the relationship between two marker
sets is not linear. In an exemplary embodiment, extensions to
nonlinear canonical correlations in the polynomial space are used.
In addition, the canonical correlations can be extended to account
for prognostic factors such as age or race by replacing each marker
intensity value by its residual from a regression with the
prognostic factors as independent variables. The resulting
canonical correlations on the residuals will be free of the
influence of these prognostic factors.
[0157] The marker correlation tool of the present invention can be
used to gauge relationship/information overlap between any two sets
of variables including genetic markers, proteomics markers, other
clinical variables, and prognostic factors such as age.
Improving Efficiency, Specificity and Sensitivity
[0158] With the rapid increase in instrumental accuracy, the number
of m/z values along a single proteomic mass spectrum has increased
from .about.15,000 (low-resolution) to .about.300,000 (QSTAR) and
likely to .about.3,000,000 (higher-resolution QSTAR). This has
created serious computational difficulties for all diagnostic
algorithms. One approach has been to bin adjacent markers (e.g.,
50) to reduce the QSTAR data by 50 fold to that of a low-resolution
equivalent. By doing so, however, the richer information provided
by higher resolution instrumentation is not effectively utilized.
To improve the statistical power and the computational efficiency
of the statistical detection algorithm, the present invention
provides, in an exemplary embodiment, a combination of marker
clustering and parallel feature selection algorithms. Marker
clustering is based on the K-mean algorithm, which, starting from a
single marker, sequentially groups each new marker as being near,
and thus added to, some already defined cluster or, if no nearby
cluster exists, as starting a new cluster. The algorithm depends on
the notion of distance, which is 1-r, where r is the Pearson
correlation between two markers across the subjects. After grouping
markers into clusters which show similar discriminating behavior,
redundancy is avoided by selecting a representative marker from
each group. The number of groups is significantly smaller than the
number of valid markers, so that the combinatorial complexity of
the best subset selection is reduced.
[0159] Due to the high dimensionality of the data, it is
computationally intensive to process the entire data at one time
matching data points on multiple features. The features generally
express the result of an approximate, suboptimal best subset
selection process. Instead, by implementing several classifiers in
parallel, each focusing on the identification of a single feature,
for example markers lying within a restricted window of the full
spectrum, the classification by multiple features can be
accomplished more swiftly. The parallel feature selection algorithm
can be readily applied for marker selection. Optimal marker sets
are derived from each marker subset/spectrum sub-interval. These
optimal marker sets are subsequently merged and further selected
via the best-subset or stepwise marker selection algorithms.
Applications of the Present Invention
[0160] The present invention is not limited to the analysis of
biological and chemical data. A wide variety of data sets from
various fields can be analyzed, including any complex data set that
can be fit to a Gaussian, Lorenzian, or similar distribution can be
analyzed using this invention.
[0161] The comprehensive statistical method of the present
invention can be applied to analyze any type of raw data, including
clinical or environmental data where the sample may be, e.g.,
water, air, soil, serum, blood, saliva, plasma, nipple aspirate,
synovial fluid, cerebrospinal fluid, sweat, urine, fecal matter,
tears, bronchial lavage, swabbings, needle aspirant, semen, vaginal
fluid, pre-ejaculate, etc., to carry out, e.g., screening for
contaminants, pathologic diagnosis, toxicity status, efficacy of a
drug, screening or prognosis of a disease. In fact, the processor
100 of the present invention can analyze any data that can be
expressed with common coordinates. That is, the multiple sets of
data can be mapped into the same coordinate system, thereby
enabling comparison between multiple sets of data. The processor
100 of the present invention can analyze data from mass
spectroscopy, liquid chromatography, two-dimensional gel
chromatography, gas chromatography, etc.
[0162] A listing of various applications and any relevant
considerations will now be discussed. As can be understood, the
applicability of the present invention is not limited to the listed
applications and this list is meant to be illustrative only.
Financial/commercial/economic
[0163] In an exemplary embodiment, the present invention can be
used to analyze the trading of securities. For example, when a
stock is trading in an aberrant manner due to insider information,
the price of the stock may fluctuate in a detectable manner. The
present invention can identify such stocks based on their trading
patterns and can be used to investigate suspicious activities or to
allow investors to better understand the market behavior of such
stocks.
[0164] In another exemplary embodiment, the present invention can
be used for market analysis to better understand the behavior of
consumers and their spending patterns.
Environmental/Geological, Sociological
[0165] In an exemplary embodiment, the present invention can be
used to analyze environmental data, such as for detecting
environmental contaminants, using a variety of sample types (e.g.,
air, water, soil). The present invention can be used also analyze,
geological data including, for example, seismic data and radio
mapping data.
Disease Diagnosis, Prognosis, Management
[0166] The present invention has multiple utilities including, but
not limited to, utilities that require the comparison of one or
more proteins, either known or unknown, between samples or between
a sample and a standard. In general, the methods and apparatus of
the present invention have utility in proteome research. In an
exemplary embodiment, the present invention is used for the
diagnosis, prognosis and management of a disease or condition. The
key to the development of assays for diagnostic or prognostic
assays is the ability to detect a few marker molecules, often
proteins, that are differentially expressed in affected patients.
Since the variety and amounts of molecules circulating in the blood
or in other biological samples (e.g., cerebrospinal fluid, cell
culture, urine, sweat, buccal swab, tissue biopsy, or aspiration
sample) at any given moment may differ substantially from one
individual to another, the methods and systems described herein can
be used to recognize patterns of the disease state or related
pattern and thereby provide information as to the disease or
condition diagnosis, prognosis, or management.
[0167] The present invention is useful for detecting and following
the course of any cancer or any disease that like cancer results
from acquire genetic mutations or any condition which leads to the
production of an abnormal product that may be detectable in blood,
urine, etc. or any other type of sample.
[0168] In addition to the aforementioned, the present invention is
applicable to any disease or disease state that can be classified
in accordance with quantitavily expressable characteristics
including, for example, cardiopulmonary diseases, autoimmune
diseases, Alzheimer's disease, arthritis, infectious diseases, and
allergies.
[0169] Cancer has become one of the leading causes of death in the
Western world, second only to heart disease. Current estimates
project that one person in three in the U.S. will develop cancer,
and that one person in five will die from cancer. Cancers can be
viewed as altered cells that have lost the normal growth-regulating
mechanisms.
[0170] The present invention can be used for the prognosis,
diagnosis, and management of cancer. Particular cancers for which
diagnosis or prognosis using the method of the current invention
include, for example carcinomas, melanomas, lymphomas, sarcomas,
blastomas, leukemias, myelomas, and neural tumors. Particular
cancer states contemplated by the present invention include
melanoma, non-small cell lung, small-cell lung, lung,
hepatocarcinoma, retinoblastoma, astrocytoma, glioblastoma,
leukemia, neuroblastoma, head, neck, breast, pancreatic, prostate,
renal, bone, testicular, ovarian, mesothelioma, cervical,
gastrointestinal, lymphoma, brain, colon, prostate, and bladder.
Other diseases that can be diagnosed, managed, or given a prognosis
include rheumatoid arthritis, inflammatory bowel disease,
osteoarthritis, leiomyomas, ademonas, lipomas, hemangiomas,
fibromas, vascular occlusion, restenosis, atherosclerosis,
pre-neoplastic lesions, carcinoma in situ, oral hairy leukoplakia
and psoriasis.
Other Medical and Biological Applications
[0171] The present invention can be applied to a wide variety of
other medical uses including, for example, genotyping, isotyping,
tissue typing, determining the efficacy of drugs or therapies
(alone or in combination), predicting and analyzing drug-drug
interactions, determining the state of perturbation of a body
organ, and detecting the presence of one or more pathogens.
[0172] The present invention is useful for the analysis of gene
expression profiles. The data obtained from measuring the
transcriptional rate of genes can be analyzed in accordance with
the present invention to determine, for example, which genesets are
co-regulated as determined from the correlation of gene expression.
The transcriptional rate can be measured by techniques of
hybridization to arrays of nucleic acid or nucleic acid mimic
probes or by other gene expression technologies. However measured,
the result is either the absolute, relative amounts of transcripts
or response data including values representing RNA abundance
ratios, which usually reflect DNA expression ratios (in the absence
of differences in RNA degradation rates). In various embodiments of
the present invention, aspects of the biological state other than
the transcriptional state, such as the translational state, the
activity state, or mixed aspects can be measured. Preferably,
measurement of the transcriptional state is made by hybridization
to transcript arrays.
[0173] In a preferred embodiment, the present invention makes use
of "transcript arrays" (also called herein "micro arrays").
Transcript arrays can be employed for analyzing the transcriptional
state in a biological sample and especially for measuring the
transcriptional states of a biological sample exposed to graded
levels of a drug of interest or to graded perturbations to a
biological pathway of interest. In one embodiment, transcript
arrays are produced by hybridizing detectably labeled
polynucleotides representing the mRNA transcripts present in a cell
(e.g., fluorescently labeled cDNA synthesized from total cell mRNA)
to a micro array.
[0174] Potential drug targets and/or candidates can also be
identified using the present invention. For example, the present
invention can be utilized to identify proteins that are
differentially expressed in diseased cells as compared to normal
cells. Such differentially expressed proteins can serve as targets
for drugs or serve as a potential therapeutic. In a related
fashion, the methods can be used in toxicology studies to identify
proteins that are differentially expressed in response to
particular toxicants. Such differentially expressed proteins can
serve as potential targets or as potential antidotes for particular
toxic compounds or challenges. This technique, because of the high
discrimination power, can facilitate such investigations because
these techniques enable proteins even in low abundance to be
detected, which makes it easier to identify real differences in
expression between different samples.
[0175] The effect of chemical moieties or a combination of moieties
on protein expression patterns can be analyzed. Alterations to the
chemical moiety or combination can then be made and protein
expression reassessed to determine what effect if any the
alteration has on protein expression. Such studies can be useful,
for example, in making derivatives of a lead compound identified
during initial drug screening trials.
Data Sources
[0176] The data that can be analyzed in accordance with the present
invention may be obtained from any of a number of sources and
instruments using various techniques. Such sources and relevant
considerations for each will now be discussed.
[0177] As already discussed, a source of data for an exemplary
embodiment of the present invention includes the mass spectrometer
(MS). Mass spectrometry requires that the sample be ionized; it is
then passed through a mass discriminator to a mass detector. Two
ionization methods are electrospray ionization (ESI) and
matrix-assisted laser desorption (MALDI). There are also various
mass discriminators that may be used, including, for example,
time-of-flight (TOF), a quadrapole ion trap, a double quadrapole,
or a triple quadrapole. MS/MS may also be used; where MS involves
the collective mass analysis of several peptides (i.e. peptide mass
fingerprint), MS/MS analysis is a measurement based on a single
peptide and the fragments derived from it via a collisional or
decay process. Two basic strategies have been proposed for the MS
identification of proteins after separation: mass profile
fingerprinting and sequencing of one or more peptide domains by
MS/MS. Refinements in both of these techniques have also reduced
the amount of individual proteins needed to achieve signature
detection. One particularly preferred MS instrument that may be
used is the MALDI-TOF/MS. Surface enhanced laser desorption
ionization (SELDI) MS which consists of two closely linked
techniques, a MALDI-TOF MS coupled with a pre-chromatography step
based upon solid phase absorption on a multi-platform chip
interface may be used. An instrument for this technique is produced
by Ciphergen.
[0178] Other spectral analysis techniques may also be used to
obtain datasets for analysis by the present invention. These
include fluorescence spectroscopy, IR spectroscopy, including FT-IR
spectroscopy, laser microscopy including scanning confocal laser
microscopy. Raman spectroscopy, including surface enhance Raman
(SERS) and resonance Raman, chemiluminescence, electrical
phenomenology UV-Vis spectroscopy, including reflection absorption,
absorption and transmission, and near-IR spectroscopy.
[0179] Nuclear magnetic resonance (NMR), including 2D NMR may be
used to provide biological datasets for analysis by the present
current invention. MRIs may also be used.
[0180] X-Ray diffraction data collected in the time or frequency
domain may also be used.
[0181] Electrophoresis, such as two-dimensional gel electrophoresis
is often used to separate components in the biological sample
before it is detected by the MS or other measurement instrument.
The electrophoratic system used may be either one or two
dimensional.
[0182] Capillary electrophoresis (CE) is a different type of
electrophoresis, and involves resolving components in a mixture
within a capillary to which an electric field is applied.
[0183] Isoelectric focusing is an electrophoretic method in which
zwitterionic substances such as proteins are separated on the basis
of their isoelectric points (pI).
[0184] Capillary isoelectric focusing (LIEF) involves separating
analytes such as proteins within a pH gradient according to their
isoelectric point (i.e., the pH at which the analyte has no net
charge) of the analytes. A second method, capillary zone
electrophoresis (CZE) fractionates analytes on the basis of their
intrinsic charge-to-mass ratio. Capillary gel electrophoresis (CGE)
is designed to separate proteins according to their molecular
weight.
[0185] Capillary zone electrophoresis is an electrophoretic method
conducted in free solution without a gel matrix and results in the
separation of molecules such as proteins based upon their intrinsic
charge-to-mass ratio.
[0186] Any type of technique capable of separating proteins can be
utilized. Suitable methods include, but are not limited to, HPLC,
ion exchange chromatography and affinity chromatography. HPLC, GC,
or other separation techniques.
[0187] Samples may be introduced into the spectrometers or other
devices to obtain the raw data for analysis. Alternatively, samples
may come from a device that separates, processes, or otherwise
alters the sample before detection. These devices include but are
not limited to biochips, lab-on-a-chip, microchip, DNA-based
microarrays, other array devices, and microfluidicsystems.
[0188] Biological data can be from health data, clinical data, or
from a biological sample, (e.g., a biological sample from a human,
e.g., serum, blood, saliva, plasma, nipple aspirants, synovial
fluids, cerebrospinal fluids, sweat, urine, fecal matter, tears,
bronchial lavage, swabbings, needle aspirantas, semen, vaginal
fluids, pre-ejaculate, etc.), etc. which is analyzed to determine
the biological state of the donor. The biological state can be a
pathologic diagnosis, toxicity state, efficacy of a drug, prognosis
of a disease, etc. The biological sample may be, for example,
blood, cerebrospinal fluid, cell culture, urine, sweat, buccal
swab, tissue biopsy, or aspiration sample.
[0189] For any given clinical symptom, there can be one, two,
dozens, or possibly hundreds of causative agents or targets for the
probes of the diagnostic device. A target can be one or more
microbes such bacteria, viruses, mycoplasma, rickettsia, chlamydia,
protozoa, plant cells (such as algae and pollens), or fungi. A
target can also be a genetic disorder such as a single nucleotide
polymorphism (SNP), a specific gene that is not normally present or
expressed, or not present in multiple copies, or a mutation in a
normally present gene. A target can also be a therapeutic
optimization factor. For example, a target can be a specific
microbial gene that renders a particular microbe susceptible or
resistant to a particular drug. A target can also be a particular
genetic sequence in a subject that makes the subject resistant,
tolerant, or intolerant (allergic) to a particular drug. These
types of targets can be used to develop a specific, tailored, and
optimized therapeutic regimen. In addition, the targets can be
selected to provide results that are most accepted by physicians
and/or clinicians.
[0190] One goal of target selection is to select a number of
targets (i.e., associated with possible causes of a specific
symptom) that provides a high level of reliability that one of the
selected targets is the cause of the symptom, and optionally to
select additional targets that can be used to optimize therapy. In
other words, the goal is to select targets that are the most likely
to be the cause of the symptom. For example, if there are 50
possible targets that can cause a symptom, but only 10 targets are
known to cause 90% of the clinically observed instances of a given
symptom, then a diagnostic device might include probes (e.g., 10 or
more probes) to detect only those 10 targets to provide a
sufficient level of reliability. This device would not provide a
positive result if the cause of the symptom in a subject happens to
be one of the targets in the 10% not detected by the device. A more
sophisticated diagnostic device might include an additional set of
probes that are specific for 10 more known targets that together
with the first 10 targets are known to cause 99% of the clinically
observed instances of the symptom. Either device can include probes
designed to optimize therapy. Of course, other scenarios are
possible.
[0191] To provide a high degree of accuracy, several different
probes can be used to detect and/or quantitate a single, specific
target. For example, one probe can be designed to specifically bind
to one epitope of an antibody target, and a second probe can be
designed to specifically bind to a second epitope of the same
antibody. In another example, one probe can be specific for an
enzyme that is produced by a specific microbe, a second probe can
be specific for a specific nucleic acid associated with that
microbe, and a third probe can be specific for and antibody in a
subject's bloodstream after exposure to the microbe. In addition,
numerous probes of the same type can be clustered into separate
locations or spots on a substrate to ensure that sample is evenly
distributed over the entire array and that even low concentrations
of target are detected. Two or more probes that recognize different
epitopes of an antibody can also be mixed and placed on the same
spot.
[0192] In each case, the probes are designed to specifically bind
to an analyte that is, or is associated with, a target. For
example, if the target is an antibody, the antibody is the analyte.
If the target is a microbial gene, then a specific nucleic acid
sequence can be the analyte. If the target is a genetic disorder in
the subject, then the analyte might be a SNP or a specific mutant
nucleic acid sequence.
[0193] Although the present invention and its advantages have been
described in detail, it should be understood that various changes,
substitutions and alterations can be made herein without departing
from the spirit and scope of the invention as defined by the
appended claims. Moreover, the scope of the present application is
not intended to be limited to the particular embodiments o the
process, machine, manufacture, composition of matter, means,
methods and steps described in the specification. As one of
ordinary skill in the art will readily appreciate from the
disclosure of the present invention, processes, machines,
manufacture, compositions of matter, means, methods, or steps,
presently existing or later to be developed that perform
substantially the same function or achieve substantially the same
result as the corresponding embodiments described herein may be
utilized according to the present invention. Accordingly, the
appended claims are intended to include within their scope such
processes, machines, manufacture, compositions of matter, means,
methods, or steps.
* * * * *
References