U.S. patent application number 13/464254 was filed with the patent office on 2013-05-16 for method and apparatus for representing multidimensional data.
This patent application is currently assigned to Nodality, Inc.. The applicant listed for this patent is Herbert Alan Holyst, Allan Robert Moser, Wade Thomas Rogers. Invention is credited to Herbert Alan Holyst, Allan Robert Moser, Wade Thomas Rogers.
Application Number | 20130124522 13/464254 |
Document ID | / |
Family ID | 38581568 |
Filed Date | 2013-05-16 |
United States Patent
Application |
20130124522 |
Kind Code |
A1 |
Moser; Allan Robert ; et
al. |
May 16, 2013 |
METHOD AND APPARATUS FOR REPRESENTING MULTIDIMENSIONAL DATA
Abstract
The present invention relates to methods for representing
multidimensional data. The methods of the present invention are
well suited but not limited to the representation of
multidimensional data in such a way as to enable the comparison and
differentiation of data sets. For example, the invention may be
applied to the representation of flow cytometric data. The
invention further relates to a program storage device having
instructions for controlling a computer system to perform the
methods, and to a program storage device containing data structures
used in the practice of the methods.
Inventors: |
Moser; Allan Robert;
(Swarthmore, PA) ; Rogers; Wade Thomas; (West
Chester, PA) ; Holyst; Herbert Alan; (Morton,
PA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Moser; Allan Robert
Rogers; Wade Thomas
Holyst; Herbert Alan |
Swarthmore
West Chester
Morton |
PA
PA
PA |
US
US
US |
|
|
Assignee: |
Nodality, Inc.
South San Francisco
CA
|
Family ID: |
38581568 |
Appl. No.: |
13/464254 |
Filed: |
May 4, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12293081 |
Jun 11, 2009 |
8214157 |
|
|
PCT/US2007/008246 |
Mar 30, 2007 |
|
|
|
13464254 |
|
|
|
|
60787908 |
Mar 31, 2006 |
|
|
|
Current U.S.
Class: |
707/737 |
Current CPC
Class: |
G06F 16/283 20190101;
G06K 9/6298 20130101; G01N 15/1429 20130101; G06F 16/22 20190101;
G16B 40/00 20190201 |
Class at
Publication: |
707/737 |
International
Class: |
G06F 17/30 20060101
G06F017/30; G11C 7/20 20060101 G11C007/20 |
Claims
1. A program storage device readable by a machine, said device
tangibly embodying at least one program of instructions executable
by the machine to cause the machine to perform steps for a method
of representing data at multiple resolutions, said method
comprising: a. providing a data set; b. representing said data in a
multidimensional space; c. dividing said multidimensional space
into discrete data bins; and d. subdividing data from each bin into
finer resolution bins, wherein for at least one current bin, the
subdividing comprises: i. determining the direction of maximum
variance of data contained within the current bin; ii. rotating the
coordinates of the data space in the direction of maximum variance,
wherein the first axis of the rotated coordinates is parallel to
the direction of maximum variance; iii. determining the median
value of the first coordinate in the rotated coordinate system for
the collection of data comprising the selected bin; iv. splitting
the data comprising the current bin into two finer resolution bins,
the first portion of the selected, split bin being comprised of
events with a first coordinate less than or equal to the median,
the second portion of the selected, split bin being comprised of
events with a value of the first coordinate greater than the
median; and v. recording the rotation and median value (split
value) associated with the current, split bin to a storage
device.
2. The program storage device of claim 1, further comprising
instructions for: forming a bin of lowest resolution encompassing
the complete data space and comprising all of the data within the
data set; and beginning with the lowest resolution, iterating over
each level of resolution, subdividing each bin at a given
resolution to form two bins at a higher resolution, continuing said
subdivision until the desired number of bins is obtained.
3. The program storage device of claim 2, further comprising
instructions for: rotating the data space by applying the rotation
matrix corresponding to the current bin after said subdividing; and
splitting the data comprising the current bin into two bins at the
next hierarchical resolution level by using the split value for the
current bin, wherein the first portion of the split bin is
comprised of events with a first coordinate value less than or
equal to the median, further wherein the second portion of the
split bin is comprised of events with a first coordinate value that
is greater than the median.
4. The program storage device of claim 1, further comprising
instructions for determining hyperplane boundaries of said bins,
said method comprising: a. specifying a rotation matrix of unit
diagonal and zero off diagonal elements as the parent of the lowest
resolution bin; b. starting with the bin of lowest resolution,
defining the hyperplane boundaries as the set of boundaries read in
from the storage device; c. beginning with the lowest resolution,
iterating over each level of resolution, intersecting the
hyperplane boundaries of the current bin with the hyperplane
boundary utilized to split the current bin into its two children
bins of higher resolution; and d. recording the two sets of
boundaries determined by the intersection as the hyperplane
boundaries of the two children bin.
5. The program storage device of claim 4, wherein step c.) of the
method further comprises: i. multiplying the rotation matrix for a
bin by the rotation matrix of its parent bin; ii. associating this
product matrix with the current bin to be used as a parent bin in
the next step in the iteration; iii. constructing a direction
vector from the elements of the first column of the product matrix
computed in the previous step of the iteration; iv. finding the
hyperplane perpendicular to the direction vector constructed in the
previous step of the iteration, wherein the vector passes through
the split value for the current bin; and v. identifying the
hyperplane found in the previous step as the boundary utilized to
split the current bin into its two children bins of higher
resolution.
6. The program storage device of claim 1, further comprising
instructions for determining one-dimensional lists of numbers
comprising fingerprints for a set of instances relative to the
representation of a multidimensional data set processed by the
binning procedure, the method comprising forming a template
instance by combining the events from a set of instances into a
single data set.
7. The program storage device of claim 6, further comprising
instructions for calculating an event density for each bin by
dividing the number of events in each bin by the total number of
events comprising the instance, for each of the instances in the
set of instances.
8. The program storage device of claim 6, further comprising
instructions for enumerating the bins in order of hierarchies of
increasing resolution, and within a resolution level, in the order
in which the bins were determined.
9. The program storage device of claim 8, further comprising
instructions for the step of recording the list of numbers on a
storage device.
10. The program storage device of claim 6, further comprising
instructions for determining one-dimensional lists of numbers
comprising fingerprints for sets of instances relative to the
representations of two or more multidimensional data sets, the
method comprising: a. specifying two or more sets of instances,
each set comprising a class of data sets; and b. for each class,
determining a set of bins representing each template instance and
forming a template instance for that class by combining the events
from the set of instances comprising the class into single data
set.
11. The program storage device of claim 10, further comprising
instructions, for each feature in the fingerprints for instances
comprising each class, for calculating the average and standard
deviation of each feature, further wherein an average and standard
deviation are associated with each bin for each class.
12. The program storage device of claim 10, further comprising
instructions, for each class, for the instances not comprising that
class, binning the data comprising each instance not of that class
relative to template instance for that class, and for the binned
representations of instances found in the previous step,
enumerating the bins in order of hierarchies of increasing
resolution, and within a resolution level, in the order in which
the bins were determined, in order to form fingerprints for each
instance.
13. The program storage device of claim 11, further comprising
instructions, for each fingerprint in each class, for calculating a
z-score for each feature in the fingerprint by subtracting the
average associated with the class for the corresponding feature and
then dividing that result by the standard deviation associated with
the class for the corresponding feature, wherein the resulting
values provide a set of fingerprints for each instance, the number
of elements of that set being equal to the number of classes.
14. The program storage device of claim 12, further comprising
instructions, for each fingerprint in each class, for calculating a
z-score for each feature in the fingerprint by subtracting the
average associated with the class for the corresponding feature and
then dividing that result by the standard deviation associated with
the class for the corresponding feature, wherein the resulting
values provide a set of fingerprints for each instance, the number
of elements of that set being equal to the number of classes.
15. The program storage device of claim 12, further comprising
instructions, for each instance, for combining the set of
fingerprints by concatenating the lists of elements in each
fingerprint, thereby forming a single fingerprint for each instance
which contains that instance's z-score calculated relative to every
class.
16. The program storage device of claim 6, further comprising
instructions for forming a categorical fingerprint, the method
comprising: a. defining a many-to-one mapping of continuous valued
numbers into a discrete set of values, said values being at least
one member selected from the group consisting of integers and a
discrete label, wherein the method of mapping is at least one
mathematical transform selected from the group consisting of
quantization, a transform based on a machine learning method, or
any transform capable of a many-to-one mapping; b. applying the
mapping to each feature of the fingerprint; and c. creating a list
of the mapped features, thereby forming a fingerprint consisting of
categorical features.
17. The program storage device of claim 10, further comprising
instructions for forming a categorical fingerprint, the method
comprising: a. defining a many-to-one mapping of continuous valued
numbers into a discrete set of values, said values being at least
one member selected from the group consisting of integers and a
discrete label, wherein the method of mapping is at least one
mathematical transform selected from the group consisting of
quantization, a transform based on a machine learning method, or
any transform capable of a many-to-one mapping; b. applying the
mapping to each feature of the fingerprint; and c. creating a list
of the mapped features, thereby forming a fingerprint consisting of
categorical features.
18. The program storage device of claim 16, further comprising
instructions for forming a binary fingerprint, the method
comprising: a. specifying the number of non-redundant, discrete
features that comprise a categorical fingerprint; b. assigning an
integer ordinal to each categorical feature; c. creating a mapping
of each categorical feature to a string of binary digits, the
number of elements in the string being equal to the number of
categorical features, by setting all digits in the string to zero
excepting the element whose position in the string corresponds to
the ordinal of the categorical feature, which ordinal-corresponding
element being set to one; d. applying the mapping described in the
previous step to each feature of the categorical fingerprint; and
e. creating a list of the mapped features, thereby forming a
fingerprint consisting of binary features.
19. A program storage device readable by a machine, said device
tangibly embodying at least one program of instructions executable
by the machine to cause the machine to perform steps for a method
of representing data at multiple resolutions, said method
comprising: a. providing a first data set; b. representing said
data in a multidimensional space; c. dividing said multidimensional
space into discrete data bins; d. subdividing data from each bin
into finer resolution bins; e. determining the direction of maximum
variance of data contained within at least one bin; f. rotating the
coordinates of the data space in the direction of maximum variance,
wherein the first axis of the rotated coordinates is parallel to
the direction of maximum variance, further wherein the rotation is
based on the data from said first data set; g. determining the
median value of the first coordinate in the rotated coordinate
system for the collection of data comprising the selected bin; h.
splitting the data comprising the selected bin into two bins at the
next hierarchical resolution level, the first portion of the
selected, split bin being comprised of events with a first
coordinate value less than or equal to the median, the second
portion of the selected, split bin being comprised of events with a
first coordinate value greater than the median; i. recording the
rotation matrix and median value (split value) associated with the
current, split bin to a storage device; j. representing a second
data set in a second multidimensional space; k. dividing said
second multidimensional space into a second set of discrete data
bins; l. subdividing data from each of said second bins into finer
resolution bins; m. rotating the coordinates of the second data
space based on the corresponding rotation matrix from said first
data set; n. in a selected second bin, splitting the data
comprising the second selected bin into two bins at the next
hierarchical resolution level, the first portion of the second
selected, split bin being comprised of events with a first
coordinate value less than or equal to the median of the
corresponding bin determined for said first data set in step g.),
the second portion of the second selected, split bin being
comprised of events with a first coordinate value greater than the
median of the corresponding bin determined for said first data set
in step g.); and o. determining one-dimensional lists of numbers
comprising fingerprints for a set of instances relative to the
representation of a multidimensional data set processed by the
binning procedure, the method comprising forming a template
instance by combining the events from a set of instances into a
single data set.
20. A computing environment providing a device readable by a
machine, said device tangibly embodying at least one program of
instructions executable by the machine to cause the machine to
perform steps for a method of representing data at multiple
resolutions, said method comprising: a. providing a data set; b.
representing said data in a multidimensional space; c. dividing
said multidimensional space into discrete data bins; d. subdividing
data from each bin into finer resolution bins; e. determining the
direction of maximum variance of data contained within at least one
bin; f. rotating the coordinates of the data space in the direction
of maximum variance, wherein the first axis of the rotated
coordinates is parallel to the direction of maximum variance; g.
determining the median value of the first coordinate in the rotated
coordinate system for the collection of data comprising the
selected bin; h. splitting the data comprising the selected bin
into two bins at the next hierarchical resolution level, the first
portion of the selected, split bin being comprised of events with a
first coordinate value less than or equal to the median, the second
portion of the selected, split bin being comprised of events with a
first coordinate value greater than the median; and i. recording
the rotation matrix and median value (split value) associated with
the current, split bin to a storage device.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a Continuation Application which claims
the benefit of U.S. application Ser. No. 12/293,081, filed Jun. 11,
2009; which is a National Stage application of PCT International
Application No. PCT/US2007/008246, filed Mar. 30, 2007, which in
turn claims the benefit of U.S. Provisional Application No.
60/787,908, filed on Mar. 31, 2006 each of which is hereby
incorporated by reference in its entirety herein.
BACKGROUND OF THE INVENTION
[0002] A common task for many applications is to compare data sets
in order to distinguish two or more classes forming sub-populations
of those data. One example of such an application involves the use
of flow cytometry for medical diagnosis.
[0003] Flow cytometry can be used to measure properties related to
individual cells in a sample of blood drawn from a patient. A
liquid stream in the cytometer carries and aligns individual cells
so that they pass through a laser beam in single file. As a cell
passes through the laser beam, light is scattered from the cell
surface. Photomultiplier tubes collect the light scattered in the
forward and side directions which gives information related to the
cell size and shape. This information may be used to identify the
general type of cell (e.g. monocyte, lymphocyte, granulocyte.)
[0004] Additionally, fluorescent molecules (fluorophores) that can
be conjugated with antibodies can be activated by the laser and
emit light. Since these antibodies bind with antigens on the cells,
the amount of light detected from the fluorophores is related to
the number of antigens on the surface of the cell passing through
the beam. The specific set of fluorescently tagged antibodies that
is chosen can depend on the types of cells to be studied since
different types of cells have different distributions of cell
surface antigens. Several tagged antibodies are used
simultaneously, so measurements made as one cell passes through the
laser beam consist of scattered light intensities as well as light
intensities from each of the fluorophores. Thus, the
characterization of a single cell can consist of a set of measured
light intensities that may be represented as a coordinate position
in a multidimensional space. Considering only the light from the
fluorophores, there is one coordinate axis corresponding to each of
the fluorescently tagged antibodies. The number of coordinate axes
(the dimension of the space) is the number of fluorophores used.
Modern flow cytometers can measure several colors associated with
different fluorophores and thousands of cells per second. Thus, the
data from one subject can be described by a collection of
measurements related to the number of antigens of certain types on
individual cells for each of (typically) many thousands of
individual cells.
[0005] By way of example, one would like to determine if a patient
has a specific illness based on a set of objective measurements
obtained from a blood sample that is analyzed with a flow
cytometer. The terminology used to describe data is as follows. One
case (e.g. the flow cytometric data derived from a blood sample
taken from a patient) is called a "sample instance." (The terms
"instance" and "sample" are also used.) Several sample instances
may be associated with each other forming a class of instances such
as the class of patients having a disease or the class of patients
who are healthy. Multiple sets of measurements (e.g. the measured
light intensities for each cell passing through the flow cytometer)
can be made for one instance. Each of these sets of measurements
can be referred to as an "event." In terminology of the present
invention, the data for an instance can consist of a distribution
of points in a multidimensional space, with each point representing
one event and with each coordinate of a point representing a
measurement of light intensity from a single fluorophore. For
example, FIG. 1 shows an example of flow cytometry data for four
fluorescent parameters. Since humans cannot visualize a
4-dimensional space, these data are shown as pair-wise dot
plots.
[0006] Data of the type described above, consisting of several
thousand events (or points) in a multidimensional parameter space,
is best described as a density function, i.e. the number of events
contained in a volume of space. Often, this density function is
normalized by the total number of events comprising the instance.
If this density function is known for some population of instances,
it can be used to specify the probability than an event will be
found within some region of the parameter space for instances
belonging to this population. In mathematical terminology this is
referred to as a probability density function (PDF).
[0007] In the example of flow cytometry for medical diagnosis, each
class of instances (e.g. diseased or healthy) has an associated
multidimensional PDF. The problem that arises in diagnosis can be
that of determining the PDF for two or more classes of instances,
measuring the density of events for a newly observed instance, and
by comparing these distributions, assigning the newly observed
instance to a class. Thus, accurately representing multidimensional
data in such a way as to enable this classification is
critical.
[0008] Flow cytometry has been in use as a clinical tool for many
years (Johnson 1993 and Jennings 1997). In many applications, an
optimized panel of antibodies is used to identify specific cell
types. If a cell of a specific type is present, the intensity
measured for the corresponding fluorophore will be high (positive
events); if it is not present, the intensity will be low (negative
events). In this case, one can count cells of different types by
applying a threshold to the signal such that the signal intensity
for negative events falls below the threshold and the signal
intensity for positive events falls above the threshold. For
multiple antibodies, the flow cytometric space is divided into
"quadrants" using these thresholds, and thus the numbers of cells
in each quadrant can be counted.
[0009] An example is shown in FIG. 2 for T-lymphocytes. CD4
positive events indicate the presence of helper T cells that play a
role in regulating immune response. CD8 positive events indicate
the presence of cytotoxic T cells that destroy infected cells. The
ratio of CD4 positives to CD8 positives is a measure of immune
status and can be used to diagnose or follow the progression of HIV
infection since the HIV virus targets helper T cells.
[0010] Flow cytometric quadrant analyses, as described above, are
possible when the cell antigens and specific antibodies are well
characterized. However, in cases where these are not known or cell
surface markers change with time, the distributions of intensity
levels from flow cytometry measurements are complex and thus a
simple positive/negative analysis is not possible. An example of an
especially important class of cells that are not well characterized
is Circulating Endothelial Progenitor Cells (CEPCs). These cells
play a key role in post-natal angiogenesis and vascular
development. A method of cytometrically identifying CEPCs would be
of great interest for diagnostics and therapeutics related to
cardiovascular pathology and conditions involving
neovascularization such as ischemia, diabetic retinopathy, and
tumor growth.
[0011] Other methods of representing and analyzing multidimensional
flow cytometry data have been developed. One that is most closely
related to the herein described methods and apparatus is
Probability Binning (Roederer 2001). Roederer's method of
Probability Binning represents a multidimensional probability
distribution as a set of bins defining regions of the
multidimensional space. The boundaries of these bins are chosen so
that approximately equal numbers of events lie in each bin. Bins
are found recursively by selecting a coordinate dimension,
determining the median in that coordinate, and subdividing the data
such that events whose values for this coordinate are less than the
median are placed in one bin while those whose values for this
coordinate are greater than the median are placed in another bin.
Dividing the data at the median insures that for each subdivision
of a "parent" bin, the "children" bins have equal numbers of events
(plus or minus one if the number of events in the parent bin is
odd). These two children bins are then processed in a similar way,
splitting the data into four bins. This recursive method is
continued until the desired number of bins is obtained. The method
used by Roederer et. al. to select the coordinate dimension at each
subdivision is to calculate the variance of the data in the parent
bin for all the coordinate dimensions and choose the dimension
having the largest variance. It is important to note that this
split always occurs on one of the coordinate axes of the data as
originally presented. Thus, if the space is 4-dimensional, the data
will be divided according to the coordinate corresponding to one of
those four dimensions. Although the decision is made on the basis
of the variance in each dimension, the split is not necessarily
along the optimal direction since the direction of maximum variance
may not coincide with one of the coordinate axes.
[0012] However, current practices and approaches fall short of
providing efficient, robust, reliable and accurate systems of
representing multidimensional data that can be used to address the
herein discussed problems. From the foregoing, it is appreciated
that there exists a need for methods and an apparatus that overcome
the shortcomings of those existing previously.
BRIEF SUMMARY OF THE INVENTION
[0013] In an illustrative implementation, the herein described
apparatus and method can use a method similar to probability
binning (referred to herein as Equal Probability Binning). In an
illustrative implementation, the method utilized can form bins by
splitting data in the direction of maximum variance rather than
along an original coordinate axis. In an illustrative
implementation, the direction of maximum variance can be first
determined and then the data space can be rotated such that the
principle coordinate axis lies in the direction of maximum
variance. Second, a hierarchical, multiresolution representation of
the data can be created. This can be done by retaining and
utilizing information for bins at each level of recursion. The
binned data can be used to develop a fingerprint that can be a
one-dimensional representation embodying the information contained
in the multiresolution, multidimensional representation.
Additionally, the herein described apparatus and methods can
include novel algorithms for finding and representing bins from one
data set and utilizing the bin representation to process a second
data set. It can also include a novel method of forming a
differential fingerprint that represents the degree of
dissimilarity of a given instance to two or more classes of
instances.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The invention will be more fully understood from the
following detailed description, taken in connection with the
accompanying drawings, which form a part of this application and in
which:
[0015] FIG. 1 shows an example of a 4-dimensional data set taken
from flow cytometry. Since a 4-dimensional space cannot be visually
displayed, these data are shown as pair-wise dot plots.
[0016] FIG. 2 shows a 2-dimensional flow cytometry data for
T-lymphocytes, an example of quadrant analysis for flow cytometric
data. This figure illustrates light intensities for fluorophores
conjugated with antibody CD8 versus those conjugated with antibody
CD4. The space is divided into positives and negatives in CD4 and
CD8. Percentages of positive-positive, positive-negative,
negative-positive, and negative-negative are shown. (From Purdue
University Cytometry Laboratory web-site,
http://www.cyto.purdue.edu).
[0017] FIG. 3 shows an example of the result of minimum variance,
equal probability, hierarchical binning for a 2-dimensional data
set from flow cytometry data. The rectangular box enclosing the
entire data set is the resolution level 0 bin. The first
subdivision of the data is shown by the heavy solid line which
divides the level 0 bin into two equally populated children bins.
This line is in the direction of maximum variance for the entire
data set. The two children bins form the resolution level 1
representation. The two bins at resolution level 1 are each divided
into two bins as indicated by the heavy dashed lines. Again, the
dashed lines are in the direction of maximum variance for the
subspaces being divided. The 4 bins resulting from this subdivision
form the resolution level 2 representation of the data. This
procedure is carried out recursively for levels 2 3, and 4. At each
step in the recursion, the number of bins is doubled to form the
next resolution level in the resolution hierarchy. The final
resolution level is 5, having a total of 32 bins.
[0018] FIG. 4 is a schematic representation of a fingerprint
showing the bin numbering and subdivisions corresponding to the
different resolution hierarchies. In this example, there are 4
resolution levels. Level 0 has 1 bin that contains the entire data
set; level 1 divides the data into two bins; level 2 into 4 bins;
and level 3 into 8 bins.
[0019] FIG. 5 shows a schematic diagram fingerprint construction
according to the invention. A. Hierarchical Binning is performed on
aggregates of two classifications of data to create two separate
sub-divisions of multidimensional space. (Although this figure
depicts the process in 2 dimensions for graphical simplicity, in
reality it operates in the full dimensionality of the data). B.
Each set of bins is applied to each individual data set. The event
count in each of the n bins for each data set is mapped to a
1-dimensional array. C. These arrays of event counts are encoded by
assigning z-scores that reflect differences between individual
samples and the population event count in each bin. The z-scores
are then quantized to form categorical features. In the final step,
the categorical features are encoded as binary features to create
binary fingerprints.
[0020] FIG. 6 depicts an exemplary computing system in accordance
with herein described system and methods.
[0021] FIG. 7 illustrates an exemplary illustrative networked
computing environment, with a server in communication with client
computers via a communications network, in which the herein
described apparatus and methods may be employed.
DETAILED DESCRIPTION
[0022] The problems stated in the Background section above, and in
particular, the problems of the prior art, are solved by the
herein-described method and apparatus for representing
distributions of multidimensional data by a set of regions
(referred to as bins). A novel form of binning, the method for
which is disclosed herein called Multidimensional Minimum Variance
Equal Probability Hierarchical Binning, is used in conjunction with
other data-mining and statistical analysis tools to compare
distributions and, for example, classify sets of samples by
determining whether their distributions vary significantly.
[0023] The bins in this representation partition the space into
discrete regions that may be enumerated. That is, each bin is
assigned a unique number. This enumeration enables the
representation of the multidimensional probability density in the
form of a linear sequence of numbers referred to herein as a
"fingerprint." Given a set of instances, each originally described
by a collection of points in a multidimensional space, each
instance from this set may be processed as described by this
invention to form a fingerprint representing the probability
density information for that instance. These fingerprints may then
be used in a variety of subsequent data analysis applications. A
particular example of such an analysis is to discover patterns
among the collection of instances where a pattern in this context
is defined as a specific combination of fingerprint features.
[0024] In one embodiment, the herein described apparatus and method
feature a program storage device readable by a machine, the program
storage device tangibly embodying a program of instructions
executable by a machine to perform a method for representing
multidimensional data. The method includes defining a subdivision
of the data into a discrete set of bins such that the probability
density function (PDF) for the data is approximated by this set of
bins. Each bin is described by a bounding set of intersecting
hyperplanes in the space defined by the set of parameters of the
multidimensional data. Every bin in this representation describes a
region of equal probability. Thus, the PDF for the multidimensional
data is approximated by the collection of bins and the hyperplane
boundaries defining each bin.
[0025] In another embodiment, the herein described apparatus and
method include a procedure for forming a set of hierarchical bins
such that the PDF for the multidimensional data is represented at
multiple resolutions. A single bin, whose hyperplane boundaries
enclose all of the data, represents the PDF at the coarsest
resolution. The next level of resolution is obtained by subdividing
this single bin thereby obtaining two bins. Successively finer
resolution representations are obtained by iteratively subdividing
each bin of the previous level of resolution. The totality of all
bins at all levels of resolution and their hyperplane boundaries
thus defines a multiresolution representation of the PDF.
[0026] In another embodiment, the herein described apparatus and
method include a procedure for representing the PDF wherein the
collection of bins described above approximate the PDF for the
multidimensional data in such a way that the variance of data
values within each bin is optimally reduced at each subdivision
into finer resolution bins. This is accomplished by rotating the
coordinate axes such that one axis is in the direction of maximum
variance. The subdivision of the bin is made along this direction
at the median value thereby reducing the variance by the largest
amount compatible with the constraint of maintaining an equal
number of samples within each bin.
[0027] In another embodiment, the herein described apparatus and
method include a procedure for utilizing the bin representation of
one data set, found through Multidimensional Minimum Variance Equal
Probability Hierarchical Binning, to bin data from a second data
set. Thus, one data set, found for example by aggregating many
samples, may be utilized to find a template for binning other
samples. This is particularly useful for detecting differences
between individual samples' PDFs.
[0028] In another embodiment, the herein described apparatus and
method include a procedure to enumerate the bins such that the PDF
for multidimensional data is represented as a linear sequence of
features referred to as a fingerprint. The features comprising this
sequence are the continuous-valued numbers for event densities
listed in the order of their corresponding enumerated bin. The
features of these fingerprints may be transformed by mathematical
operations. Examples of such transformations include but are not
limited to taking the logarithm of the numbers, performing linear
transformations of these numbers, or combining these or similar
operations. The event density fingerprints and any
continuous-valued mathematical transformation of them are referred
to as "continuous-valued fingerprints." The features of
continuous-valued fingerprints may further be transformed in a
manner that produces categorical features. Categorical features
have a discrete number of possibilities, such as integers (e.g.
"1", "2", "3"), alphabetic symbols (e.g. "a", "b", "c"), or textual
labels (e.g. "high", "medium", "low"). These fingerprints are
referred to as "categorical fingerprints." The features of
categorical fingerprints may be further processed to represent each
feature by a string of binary features (1's and 0's). These binary
representations are referred to as "binary fingerprints."
DETAILED DESCRIPTION OF THE INVENTION
[0029] For data comprised of multiple measurements for multiple
parameters, the distribution of events in said data can be
described by a density function in a multidimensional space. A
method is disclosed for representing this multidimensional data as
a set of regions referred to as bins; each bin enclosing a discrete
region of the data space having equal numbers of events. Further, a
method is disclosed for representing said data in a hierarchical
fashion creating a multiple resolution representation in which each
bin at a given resolution has two sub-bins encompassing the same
region such that the sub-bins represent the data at the next higher
level in the resolution hierarchy. Further, a method is disclosed
for forming said bins such that at each subdivision of a bin at one
resolution into two bins of higher resolution, the subdivision is
made by a boundary that maximally reduces the variance of the data
within the bin. A method is also disclosed for representing the
information describing bins found by the above methods for one data
set and using this information to efficiently determine bin
membership of events derived from another source of data. A method
is also disclosed for forming a one-dimensional fingerprint
representation of the multiresolution, multidimensional data.
Additionally, a method is disclosed for forming differential
fingerprints that efficiently represent differences between data
sets from two or more classes of data.
[0030] A computer readable medium having instructions for
controlling a computer system to perform the method and a computer
readable medium containing a data structure used in the practice of
the method are also disclosed.
[0031] In an embodiment of the invention, the first step in
representing the distribution of multidimensional data is to
specify the number of hierarchical levels (L) for the
representation. Successive hierarchical levels represent the space
at successively finer resolutions. The total number of bins
(N.sub.T) into which the space is to be divided is related to the
number of hierarchies by: N.sub.T=2.sup.L-1. For, example, if the
number of desired hierarchies is 9, the total number of bins will
be 511. The number of bins at each resolution level, k, is:
n.sub.r=2.sup.k where k=0, 1 . . . , L-1. Thus, the first
resolution level, k=0, consists of one bin which encompasses the
entire range of parameters defining the space in which the data
exist. The second resolution level, k=1, consists of two bins
dividing the space into two regions. The third level, k=2, consists
of four bins, and so on. The number of bins at each resolution
level for nine hierarchies is summarized in the following
table.
TABLE-US-00001 Hierarchy Resolution Level Number of bins 1 0 1 2 1
2 3 2 4 4 3 8 5 4 16 6 5 32 7 6 64 8 7 128 9 8 256
[0032] Typically, one would determine the number of finest
resolution bins first, requiring some minimum number of events to
be in each bin. By way of a non-limiting example, if the total
number of events is 10,000 and approximately 40 events are required
to be in each bin at the finest resolution, the resulting number of
high resolution bins would be 250. The closest power of two is 8
(2.sup.8=256), and thus L=9 would be specified as the number of
resolution levels resulting in a total of 511 bins.
[0033] The next step in this procedure is to determine bin
boundaries that subdivide the multidimensional space into regions
of equal probability. This is done in a recursive fashion such that
a hierarchical set of bin boundaries are found that first subdivide
the space into two regions, next into four regions, and so on until
the desired resolution is obtained. Additionally, the subdivision
of the space is done in such a way that at each division of a
parent bin into two child bins, the parent bin is divided by a
hyperplane perpendicular to the direction of maximum variance of
data within the bin. Thus, the variance of data within a bin is
maximally reduced at each subdivision. A method known as Principle
Components Analysis (PCA) may be utilized to find the direction of
maximum variance (O'Connel 1974). Other methods will be understood
by the skilled artisan armed with the present disclosure.
Method for Finding Bin Boundaries
[0034] In an embodiment of the invention, the method for finding
the bin boundaries for a given data set is described as
follows.
Description of Data:
[0035] A data set D, consisting of m sets of events x, each
consisting of p values is described by the set of points:
X.sup.j=(x.sup.j.sub.1,x.sup.j.sub.2, . . . , x.sup.j.sub.p), where
j=1, 2, . . . , m [0036] and each x.sup.j.sub.i is a number.
[0037] These data may be represented as points in a p-dimensional
space. For example, points in a 2-dimensional space consist of
pairs of numbers, points in a 3-dimensional space are triplets of
numbers, and so forth. A graphical example of a 2-dimensional space
from flow cytometry data is shown in FIG. 3.
Method:
[0038] In one aspect, the bin determination procedure is described
in steps (1) through (3) below. Step (1) initializes the binning
procedure, setting values for the lowest resolution bin which
encompasses the entire set of data points. Step (2) describes a
loop which successively subdivides the data space into finer
resolution bins. This step has subparts that loop over the bin
resolution levels. Step (3) terminates the binning procedure.
[0039] The steps are as follows:
[0040] (1) Initialization
a. For each dimension, i where i=1, 2, . . . , p, in the
p-dimensional space, find the minimum (xmin.sub.i,) and maximum
(xmax.sub.i) data values:
[(xmin.sub.1,xmax.sub.1);(xmin.sub.2,xmax.sub.2); . . .
;(xmin.sub.p,xmax.sub.p)].
b. The set of 2p hyperplanes defined by:
x 1 = x min 1 , x 1 = x max 1 , x 2 = x min 2 , x 2 = x max 2 , x p
= x min p , x p = x max p ##EQU00001## [0041] form a boundary
enclosing the entire data space and define the zero'th resolution
level bin. These boundaries are stored for future use. c. Set two
bin counters, n.sub.beg and n.sub.end, which define the beginning
and ending bin numbers for the current resolution level. For the
zero'th resolution level, set k=0, n.sub.beg=1, and n.sub.end=1. d.
Set a bin counter, b, to b=1. (This counter will be incremented as
additional bins are formed at higher resolutions.) e. Store the
data contained within the boundaries of the current bin in an
array, D.sub.1=D. The number of data points in D.sub.1, is
m.sub.1=m.
[0042] (2) Begin a loop over bins using b as a bin number
counter.
Continue this loop until the value of b exceeds N.sub.T. When b
exceeds N.sub.T, continue at step (3) below.
[0043] a. Increment the resolution level, k=k+1, and
set n=n.sub.beg.
b. Begin a loop over bins, n=n.sub.beg to n.sub.end.
[0044] i. Find the direction of
maximum variance of the data contained within bin n by PCA. This is
done in two steps. First, find the covariance matrix for the data
contained within bin n. Next, perform a Singular Value
Decomposition (SVD) on the covariance matrix. (For a description of
SVD see, for example, Golub 1996.) As is known by those skilled in
the art, this procedure finds the rotation matrix that can be used
to rotate the coordinates of the data space such that the first
dimension of the rotated space is along the direction of maximum
variance. The rotation matrix found in this step is denoted,
R.sub.n.
[0045] ii. Rotate the m.sub.n data points, D.sub.n, contained
within bin n by the rotation matrix found in the preceding step.
Since D.sub.n can be represented in the form of a matrix, this is
accomplished through matrix multiplication. The rotated data is
referred to as D.sub.n' and has points described by:
x.sup.'=(x.sup.'.sub.1, x.sup.'.sub.2, . . . , x.sup.'.sub.p).
Because of the rotation performed in this step, the values of the
first component, x.sup.'.sub.1, are measured relative to the
direction of maximum variance.
[0046] iii. Find the median value for x.sup.'.sub.1 from the
m.sub.n
data points contained in D.sub.n'. The median value may be found by
ranking the values of x.sup.'.sub.1 for all data points and storing
them in a list. The middle value in this list is the median and is
referred to as the "split" value x.sub.split. Set
t.sub.n=x.sub.split.
[0047] iv. For the current bin, n, save the values of
the data array, D.sub.n', the split value, t.sub.n, and the
rotation matrix, R.sub.n, which will be used in the next iteration
of the loop. Also, record the values of t.sub.n and R.sub.n to an
output storage device. (These values will be used in the procedure
to find the bins into which data points from new data sets are
distributed.)
[0048] v. Divide the data points in bin n according
to whether their values for x.sup.'.sub.1 are less than or greater
than t.sub.n. The split data is stored in two arrays: D.sub.low
contains points, j, such that the values x.sup.'j.sub.1 are less
than or equal to t.sub.n. D.sub.high contains points, j, such that
the values x.sup.'j.sub.1 are greater than t.sub.n. Since the data
is split at the median value, half of the data is stored in
D.sub.low while the other half of the data is stored in D.sub.high.
(If there are an odd number of points in the un-split data set,
D.sub.low will contain one more point than the number in
D.sub.high). Note that the data stored in these two arrays remain
in the rotated coordinate system.
[0049] vi. Increment the bin counter, b, by one: b=b+1,
and store the data points, D.sub.low, in bin b:
D.sub.b=D.sub.low.
(This bin is the next higher resolution level containing data
points whose first coordinate in the rotated system fell below the
median.)
[0050] vii. Increment the bin counter, b, by one: b=b+1,
and store the data points, D.sub.high, in bin b:
D.sub.b=D.sub.high.
[0051] (This bin is the next higher resolution level containing
data points whose first coordinate in the rotated system fell above
the median.)
[0052] viii. Increment counter n: n=n+1
If n is less than or equal to n.sub.end, continue the loop over n
at step (2)b. Otherwise, proceed with step (2)c. c. Replace the
current values of n.sub.beg and n.sub.end as follows:
n.sub.beg=n.sub.end+1, and n.sub.end=b.
(These new values of n.sub.beg and n.sub.end will form the range of
bins for the next resolution level.) d. Continue the loop which
began at step (2) above.
[0053] (3) Terminate the binning procedure.
[0054] By way of a non-limiting example of the binning procedure
described above, consider forming 15 bins. Thus, N.sub.T=/5 and the
number of hierarchical resolution levels is 4, labeled by k=0, 1,
2, 3. Step (1) forms resolution level 0 which consists of the
entire space within which the data points are contained. Step (2)
begins with bin b=1 (the level 0 resolution bin). The loop
beginning at step (2)b is executed 3 times with: [0055] n=1 to 1
(loop over resolution level 0) [0056] n=2 to 3 (loop over
resolution level 1) [0057] n=4 to 7 (loop over resolution level
2).
[0058] The first loop finds the direction of maximum variance for
the entire set of data points and splits the data into two equal
portions. This procedure forms bins 2 and 3 which each contain
one-half of the data points. The next pass through the loop at step
(2)b uses the values in bins 2 and 3. First, it finds the direction
of maximum variance for the data in bin 2 and splits these data
into two equal portions that form bins 4 and 5. Next, it finds the
direction of maximum variance for the data in bin 3 and splits
these data into two equal portions that form bins 6 and 7. The
final pass through loop (2)b uses the values in bins 4 through 7.
It first finds the direction of maximum variance for the data in
bin 4 and splits these data into two equal portions that form bins
8 and 9. Next, it processes the data in bin 5, again finding the
direction of maximum variance for these data and splits these data
into two equal portions that form bins 10 and 11. Continuing with
bin 6, the direction of maximum variance is found and the two
higher resolution bins 12 and 13 are determined. Finally, the data
in bin 7 are utilized following the same procedure to split these
data into two equal portions along the direction of maximum
variance forming bins 14 and 15. At each step, the information for
the rotation matrices and split values are recorded. As will be
demonstrated in a following section, these recorded parameters may
be used to process a new dataset, partitioning points from these
new data into the regions found in the binning procedure.
[0059] It is noteworthy that while the bin boundaries consist of
the intersection of hyperplanes in a p-dimensional space, these
boundaries do not need to be explicitly stored. All of the
information necessary to bin new data is contained in the rotation
matrices and split values. This will be demonstrated in the
procedure described below for binning a new data set. Thus, the
representation of a multidimensional data space by this binning
procedure is embodied in the rotation matrices and split values.
The hyperplane bin boundaries may be extracted from the rotation
matrices and split values. Starting from the set of hyperplanes
bounding all of the data stored in step (1)b, the rotation matrix
describing the first subdivision of the space can be used to find
the direction in which the data was split. The bin boundaries for
the two bins into which the data was split may be found by
intersecting the hyperplane perpendicular to this direction with
the hyperplanes bounding the entire data space. Bin boundaries for
successively finer resolution bins may be found by multiplying the
successive rotation matrices, finding the direction in which a bin
was split, and intersecting the hyperplane perpendicular to this
direction with the boundaries of the bin.
[0060] A non-limiting example of this binning procedure is shown in
FIG. 3. The dimensionality of the space for this example is two so
that the results can be graphically displayed.
Method for Binning a New Data Set In one embodiment of the
invention, events from a new data set D.sub.new, can be assigned to
bins determined from another data set D.sub.old, found by the
method described in the previous section. The method for binning
new data is identical to that described above except that the
rotation matrices and split levels from D.sub.old are used rather
than being recalculated from the new data set. Step (1) above is
replaced with a step which reads in the boundaries of the original
data space, rotation matrices, and split values. Step (2) is
identical except that the steps (2)b.i, (2)b.iii, and (2)b.iv are
skipped and the rotation matrices and split values utilized in the
remaining steps are those that were read in new step (1). The
procedure is as follows:
[0061] (1) Initialization
a. Read in the stored values for the template data set that has
been binned by the procedure described above. These values are:
[0062] i. The boundaries of the data space:
x 1 = x min 1 , x 1 = x max 1 , x 2 = x min 2 , x 2 = x max 2 , x p
= x min p , x p = x max p ##EQU00002##
[0063] ii. The rotation matrices: R.sub.n for n=1, 2, . . . ,
N.sub.T
[0064] iii. The split values: t.sub.n for n=1, 2, . . . ,
N.sub.T
b. Read in data set, D.sub.new. Denote the number of events in this
data set as m. c. Set the boundaries for the zero'th resolution
level to the values read in at step (1)a.i. (Note: It is assumed
that the coordinates of the new data set span the same range as the
template data set.) d. Set two bin counters, n.sub.beg and
n.sub.end, which define the beginning and ending bin numbers for
the current resolution level. For the zero'th resolution level, set
k=0, n.sub.beg=1, and n.sub.end=1. e. Set a bin counter, b, to b=1.
f. Store the data contained within the boundaries of the current
bin in an array, D.sub.1=D.sub.new. The number of data points in
D.sub.1, is m.sub.1=m.
[0065] (2) Begin a loop over bins using b as a bin number
counter.
Continue this loop until the value of b exceeds N.sub.T. When b
exceeds N.sub.T, continue at step (3) below. a. Increment the
resolution level, k=k+1, and
set n=n.sub.beg.
b. Begin a loop over bins, n=n.sub.beg to n.sub.end.
[0066] i. Rotate the m.sub.n data points, D.sub.n, contained
within bin n by the rotation matrix R.sub.n. The rotated data is
referred to as D.sub.n' and has points described by
x.sup.'=(x.sup.'.sub.1, x.sup.'.sub.2, . . . , x.sup.'.sub.p).
[0067] ii. Divide the data points in bin n according to
whether their values for x.sup.'.sub.1 are less than or greater
than t.sub.n. The split data is stored in two arrays: D.sub.low
contains points, j, such that the values x.sup.'j.sub.1 are less
than or equal to t.sub.n.
[0068] D.sub.high contains points, j, such that the values
x.sup.'j.sub.1 are greater than t.sub.n.
[0069] iii. Increment the bin counter, b, by one: b=b+1.
Store data points, D.sub.low, in bin b: D.sub.b=D.sub.low.
[0070] iv. Increment the bin counter, b, by one: b=b+1.
Store data points, D.sub.high, in bin b: D.sub.b=D.sub.high.
[0071] v. Increment counter n: n=n+1
If n is less than or equal to n.sub.end, continue the loop over n
at step (2)b. Otherwise, proceed with step (2)c. c. Replace the
current values of n.sub.beg and n.sub.end as follows:
n.sub.beg=n.sub.end+1, and n.sub.end=b. d. Continue the loop which
began at step (2) above.
[0072] (3) Terminate the binning procedure.
Method for Fingerprint Generation
[0073] In an embodiment of the invention, the binning procedure
described above represents a partitioning of the multidimensional
data space into an enumerated set of regions. The number of events
contained within each of these regions is nearly identical (for the
data set from which bins are determined). The bin boundaries at a
particular hierarchical level represent an estimate for the
probability density function for the data set at the corresponding
level of resolution. In particular, the bins represent regions that
have nearly equal probabilities since the event counts are nearly
identical in each bin. In order to obtain a fingerprint for a new
sample instance relative to an estimation of the probability
density function from another instance (referred to here as the
"template" instance), one can bin the data from the new sample as
described above. A density of events for each bin can be obtained
by dividing event counts by the total number of events in the
sample. Since the bins are enumerated, a simple one-dimensional
representation of the density variations, relative to the template
instance, may be obtained by recording the densities for the
successive bins in the form of a list. FIG. 4 is a schematic
representation of this list showing the subdivisions corresponding
to the different resolution hierarchies. This representation is
referred to as a "fingerprint" since it distinguishes differing
instances. Given a set of instances, a fingerprint for each
instance may be obtained by this procedure.
Fingerprints for a Set of Instances Relative to the Probability
Density for a Template Instance
[0074] It is often the case that one would like to describe the
differences between each individual instance in a set and a
"template" instance which represents the entire set of instances.
In one embodiment, a fingerprint representing these differences may
be found as follows.
[0075] (1) For a set of M instances, S.sub.1, S.sub.2, . . . ,
S.sub.M, aggregate the events from all of the instances to form a
single composite instance denoted as S.
[0076] (2) Find a set of bins for the data in S.
[0077] (3) Bin the data for each instance, S.sub.i, i=1, 2, . . .
M, using the bins found in step 2.
[0078] (4) Convert the event counts in each bin into an event
density by dividing each count by the total number of events in the
data set.
[0079] The lists of binned event densities for the set of
instances, S.sub.1, S.sub.2, . . . , S.sub.M, form a set of
fingerprints for these data relative to the probability density
estimated from the composite data set.
Fingerprints for Classification
[0080] In another embodiment of the invention, another variation of
fingerprinting is particularly useful for classifying instances.
The goal is to emphasize differences between samples belonging to
different classes. For classification problems, one typically has a
set of training instances for which the class identity is known and
a set of test or validation instances for which the class identity
is unknown. The training data can be used to construct "template"
instances to estimate the probability densities for each class.
Using individual instances from a training set, one can obtain
statistical measures for the average and degree of variation for
each bin. These statistics can be used to convert from event
densities to a z-score defined as: z=(r-r.sub.AVG)/r.sub.STD
where r is the event density for a bin, r.sub.AVG is the average
event density for a bin, and r.sub.STD is the standard deviation of
event densities for a bin. Here, averages and standard deviations
are found using all of the training instances from one class. The
z-score may be thought of as normalizing event densities by
measuring the number of standard deviations from the mean for a
given event density. Z-scores can also be calculated for instances
that are not part of the training data. These are a normalized
measure of event density variations relative to the estimate of the
PDF for a given class. A property of z-scores is that within a
class, one expects the statistical distribution of z-scores to have
an average value of zero and a standard deviation of one (referred
to here as a zero-mean, unit-variance distribution). The degree to
which the distribution of z-scores for instances outside the class
vary from this zero-mean, unit-variance distribution is a measure
of the dissimilarity of instances outside of the class to those in
the class. The normalization properties of the z-score make it
desirable to convert from event densities to z-score in
constructing fingerprints for classification since it places all
measurements on the same scale.
[0081] By way of a non-limiting example, following are the steps
used to construct fingerprints for a two class problem. (This
procedure easily generalizes to multiple classes.)
[0082] (1) For two classes A and B, obtain composite, template
instances for each class. This may be done by aggregating each
class's set of training instances. Denote the numbers of instances
in classes A and B by, N.sub.A and N.sub.B. Denote data for these
two composite instances by S.sub.A and S.sub.B.
[0083] (2) Calculate the multidimensional, minimum variance equal
probability, hierarchical bins as described above for the two data
sets S.sub.A and S.sub.B.
[0084] (3) Bin the data for each individual instance S.sub.Ai of
class A relative to the bins found from S.sub.A.
(i=1, 2, . . . , N.sub.A). Denote the event densities for these
bins as r.sub.AiA.
[0085] (4) Calculate the average and standard deviations for event
densities for each bin using the binned data for each S.sub.Ai
binned relative to S.sub.A. Denote this set of averages and
standard deviations as AVG.sub.AA and STD.sub.AA. (Note that there
will be N.sub.T elements in this set; one for every bin.)
[0086] (5) Bin the data for each individual instance S.sub.Bj, of
class B relative to the bins found from S.sub.A.
(j=1, 2, . . . , N.sub.B). Denote the event densities for these
bins as r.sub.BjA.
[0087] (6) Bin the data for each individual instance S.sub.Ai of
class A relative to the bins found from S.sub.B.
(i=1, 2, . . . , N.sub.A). Denote the event densities for these
bins as r.sub.AiB.
[0088] (7) Bin the data for each individual instance S.sub.Bj of
class B relative to the bins found from S.sub.B.
(j=1, 2, . . . , N.sub.B). Denote the event densities for these
bins as r.sub.BjB.
[0089] (8) Calculate the average and standard deviations for event
densities for each bin using the binned data for each S.sub.Bj
binned relative to S.sub.B. Denote this set of averages and
standard deviations as AVG.sub.BB and STD.sub.BB. (Note that there
will be N.sub.T elements in this set; one for every bin.)
[0090] (9) Bin the data for instances U.sub.k (k=1, 2, . . . ,
N.sub.U) whose class is not known (for example test, validation, or
unknown instances) relative to S.sub.A and relative to S.sub.B.
Denote the event densities for these bins as: r.sub.UkA and
r.sub.UkB respectively.
[0091] (10) Convert event densities to z-scores as follows:
Z.sub.AiA=(r.sub.AiA-AVG.sub.AA)/STD.sub.AA for (i=1, 2, . . .
N.sub.A)
Z.sub.AiB=(r.sub.AiB-AVG.sub.BB)/STD.sub.BB for (i=1, 2, . . .
N.sub.A)
Z.sub.BjA=(r.sub.BjA-AVG.sub.AA)/STD.sub.AA for (j=1, 2, . . .
N.sub.B)
Z.sub.BjB=(r.sub.BjB-AVG.sub.BB)/STD.sub.BB for (j=1, 2, . . .
N.sub.B)
Z.sub.UkA=(r.sub.UkA-AVG.sub.AA)/STD.sub.AA for (k=1, 2, . . .
N.sub.U)
Z.sub.UkB=(r.sub.UkB-AVG.sub.BB)/STD.sub.BB for (k=1, 2, . . .
N.sub.U)
[0092] (11) Construct fingerprints as described above for instances
A relative to A using z-scores Z.sub.AiA.
Denote these fingerprints as f.sub.AiA.
[0093] (12) Construct fingerprints as described above for instances
A relative to B using z-scores Z.sub.AiB.
Denote these fingerprints as f.sub.AiB.
[0094] (13) Construct fingerprints as described above for instances
B relative to A using z-scores Z.sub.BjA.
Denote these fingerprints as f.sub.BjA.
[0095] (14) Construct fingerprints as described above for instances
B relative to B using z-scores Z.sub.BjB.
Denote these fingerprints as f BjB.
[0096] (15) Construct fingerprints as described above for instances
U relative to A using z-scores Z.sub.UkA.
Denote these fingerprints as f.sub.UkA.
[0097] (16) Construct fingerprints as described above for instances
U relative to B using z-scores Z.sub.UkB.
Denote these fingerprints as f.sub.UkB.
[0098] (17) Construct composite fingerprints for the instances from
training class A by concatenating fingerprints f.sub.AiA and
f.sub.AiB. Denote these fingerprints as g.sub.AiAB.
[0099] (18) Construct composite fingerprints for the instances from
training class B by concatenating fingerprints f.sub.BjA and
f.sub.BjB. Denote these fingerprints as g.sub.BjAB.
[0100] (19) Construct composite fingerprint for the instances from
unknown class U by concatenating fingerprints f.sub.UkA and
f.sub.UkB. Denote these fingerprints as g.sub.UkAB.
[0101] The sets of fingerprints, g.sub.AiAB, g.sub.BjAB, and
g.sub.UkAB capture the probability density variations of each of
the sets of instances, A, B, and U, relative to bins determined
from template instances for classes A and B. Since the template for
class A is the aggregate of the class A instances, one would expect
the portion of the fingerprint corresponding to f.sub.AiA to have
small z-score values (not much variation from the average) while
the portion of the fingerprint f.sub.AiB to have larger z-score
values since there is less similarity between individual instances
from class A and the template probability density function for
class B. A corresponding statement can be made regarding
fingerprints for class B relative to classes A and B. Test,
validation, or unknown instances may be classified by measuring
their similarity to the training fingerprints for class A and B. A
schematic diagram of the fingerprint process for classification is
shown in FIG. 5.
Categorical and Binary Fingerprints
[0102] In an embodiment of the invention, a means of determining
the similarity of instances is to find patterns that are common to
sets of fingerprints. (Moser 2005) Useful forms of fingerprints for
pattern discovery are categorical and binary representations. For a
categorical representation, the values comprising the elements of
the fingerprint may be quantized into some number of discrete
categories. For example, this may be accomplished by using a series
of numerical thresholds. For a set of thresholds
t.sub.1<t.sub.2< . . . t.sub.M, assign categorical variable
x.sub.l to z-score z if t.sub.l<z<t.sub.l+1. Categorical
fingerprints are obtained by substituting the calculated
categorical variable computed from the z-score and thresholds into
the list comprising the fingerprint at the location of its
corresponding z-score.
[0103] Once categorical fingerprints have been obtained, they can
be easily transformed to a binary representation by assigning a set
of indicator binary variables to each category. For example, if
there are 5 categorical values, assign strings of binary digits as
follows: [0104] 00001 represents categorical variable 1 [0105]
00010 represents categorical variable 2 [0106] 00100 represents
categorical variable 3 [0107] 01000 represents categorical variable
4 [0108] 10000 represents categorical variable 5.
[0109] These fingerprints may then be processed by a binary pattern
discovery algorithm such as that described in (Moser 2005).
Data Analysis Applications
[0110] The present invention is useful for analysis of data in a
multitude of settings and applications. As set forth elsewhere
herein, the analysis of flow cytometric data has great importance
in understanding biological systems and in clinical medicine. In
one embodiment, the invention set forth herein has direct
applicability to flow cytometric data. In an embodiment, the
invention can be used to describe flow cytometric data. In raw
form, these data are described as "list-mode" files giving the
parameter values for each cell in a sample. These data are often
subsequently processed by quadrant analysis, whereby the parameter
space is segmented into two regions, or by gating to give the
fraction of cells within regions of space that have been delineated
by an operator. Because of the limitation of display devices and an
inability to visualize multiple dimensions simultaneously, this is
most often done as a sequential process whereby sets of gates (or
quadrants) are specified in two dimensions at a time. The invention
also provides a method of describing flow cytometric data as set of
multidimensional regions (covering the entire space at multiple
resolutions) that have been automatically determined through the
presently-disclosed Multidimensional Minimum Variance Equal
Probability Hierarchical Binning procedure. Thus, this invention
has general utility to the field of Flow Cytometry.
[0111] In another embodiment, the invention is used in quality
control processes in the field of Flow Cytometry. An important task
in flow cytometry is insuring that instruments are working
correctly and results are reproducible. Often, flow cytometric
analysis is carried out on multiple samples from one patient. For
example, several tubes of blood may be drawn and each is stained
with different antibody panels. However, these antibody panels
often overlap. For example, in a five tube analysis, each of the
five tubes may include antibodies for CD45 which is useful in
identifying lymphocytes. Additionally, data is almost always
acquired for forward and side scatter. Thus, repeated measurements
of several parameters from multiple samples for the same patient
are available. In an aspect, the invention can be used to find
fingerprints representing these repeated measurements. The
similarity of these fingerprints across a set of samples from the
same patient can be used to measure the reproducibility, and thus
quality, of the cytometric data.
[0112] Flow cytometry has a broad range of uses in medicine
including clinical measurements for disease diagnosis, prognosis,
classification, and progression. The present invention has direct
applicability to the use of flow cytometric data for these
applications. Currently, flow cytometry is most useful in clinical
medicine when optimized antibody panels are available. In this
case, cell populations can be distinguished by quadrant analysis or
sequential gating as shown, for example in FIG. 2 for T-lymphocyte
measurements related to HIV infection. However, these methods of
analysis do not work well when cell antigens and specific
antibodies are not well characterized, cell surface markers change
with time, or distributions of intensity levels from cytometric
measurements are complex and overlapping. The present invention
provides a means of representing and utilizing flow cytometric data
for clinical medicine in these situations. Utilizing data from
known populations (e.g. diseased versus non-diseased individuals),
fingerprints can be developed, using the methods described in this
invention, that can be used to classify patients. Thus, the present
invention has both general and broad application to problems of
clinical medicine including diagnostics, prognostics, disease
progression and disease classification.
[0113] It will also be understood by the skilled artisan, when
armed with the present disclosure, that the present invention also
has broader applicability. Multiple and varied medical-related
applications have been set forth herein. However, the methods and
apparatuses of the invention can also be applied to any type of
data that involves measurements that can be represented in a
multidimensional space. By way of a non-limiting example, the
invention can be used for data analysis in astronomy, in which the
distribution of stars in a 3-dimensional space can be represented
using the invention. Other non-limiting examples of applications of
the present invention include classification, processing and
analysis of banking data (e.g., characterization of credit risk in
terms of multiple dimensions, such as demographics, financial
resources, etc., as well as to classify potential credit card
customers). Therefore, the skilled artisan will understand, based
on the disclosure set forth herein, that the methods and
apparatuses of the invention can be used in any situation where
data are described by multiple parameters that can be numerically
quantified.
Additional Embodiments of the Invention
[0114] In another embodiment, the invention includes a method of
representing data at multiple resolutions, the data being described
by a multidimensional space containing multiple events consisting
of measurements of multiple parameters; the method comprising:
[0115] a) describing said data as a distribution of events in a
multidimensional space where each coordinate axis of the space has
a unique correspondence to one of the measured parameters;
[0116] b) determining the boundaries of the multidimensional space
as the minimum and maximum possible values for the parameter
corresponding to each axis;
[0117] c) specifying the number of regions, referred to as bins,
into which the data space is to be divided for the highest
resolution representation of the data space;
[0118] d) determining the number which is an exact power of two
closest to the number of high resolution bins specified in the
previous step;
[0119] e) determining the total number of resolution levels as the
number determined in the previous step plus one;
[0120] f) enumerating the resolution levels as a sequence of
integers starting at zero and ending at the total number of
resolution levels minus one;
[0121] g) determining the number of bins at each resolution level
as two raised to the power of the integer specified in the
enumeration for the corresponding resolution level;
[0122] h) determining the total number of bins as one less than two
raised to the power of the total number of resolution levels;
and
[0123] i) enumerating the totality of all bins starting at the
lowest resolution level, proceeding to the next higher resolution
level, and continuing to the highest resolution level; this
specification of the order of bins forming an enumerated,
hierarchical, multiresolution representation of the data.
[0124] In another embodiment, a method further comprises:
[0125] a) recording the values defining the boundaries of the data
space on a storage device; and
[0126] b) recording the value for the total number of bins into
which the data space is to be divided on a storage device.
[0127] In another embodiment, a method further comprises:
[0128] a) forming a bin of lowest resolution encompassing the
complete data space and comprising all of the data within the data
set; and
[0129] b) beginning with the lowest resolution, iterating over each
level of resolution, subdividing each bin at a given resolution to
form two bins at a higher resolution, continuing this subdivision
until the desired number of bins is obtained.
[0130] In another embodiment, a method further comprises:
[0131] a) in the process of subdividing the data from each bin into
finer resolutions bins, determining the direction of maximum
variance of the data contained within the given bin;
[0132] b) rotating the coordinates of the data space in the
direction of maximum variance in such a way that the first axis of
the rotated coordinate systems is parallel to the direction of
maximum variance;
[0133] c) determining the median value of the first coordinate in
the rotated coordinate system for the collection of data comprising
the bin;
[0134] d) splitting the data comprising the current bin into two
bins at the next hierarchical resolution level, the first portion
being comprised of events whose first coordinate value is less than
or equal to the median, the second portion being comprised of
events whose first coordinate value is greater than the median;
and
[0135] e) recording the rotation matrix and median value (split
value) associated with the current bin to a storage device;
[0136] The invention also includes a method of partitioning
multidimensional data from one data set into regions defined by the
application of the binning procedure, as described elsewhere
herein, to a different data set; the method comprising:
[0137] a) reading the data space boundaries, set of rotation
matrices, and set of split values for each bin to be formed in the
binning process from a storage device;
[0138] b) forming a bin of lowest resolution encompassing the
complete data space and comprising all of the data within the data
set; and
[0139] c) beginning with the lowest resolution, iterating over each
level of resolution, subdividing each bin at a given resolution to
form two bins at a higher resolution, continuing this subdivision
until the desired number of bins is obtained.
[0140] In another embodiment, a method further comprises:
[0141] a) in the process of subdividing the data within each bin
into finer resolutions bins, rotate the data space by applying the
rotation matrix corresponding to the current bin; and
[0142] b) utilizing the split value for the current bin, splitting
the data comprising the current bin into two bins at the next
hierarchical resolution level, the first portion being comprised of
events whose first coordinate value is less than or equal to the
median, the second portion being comprised of events whose first
coordinate value is greater than the median.
[0143] The invention also includes a method of determining the
hyperplane boundaries of bins found through the application of the
binning procedure as described elsewhere herein; the method
comprising:
[0144] a) reading the data space boundaries, set of rotation
matrices, and set of split values for each bin whose hyperplane
boundaries are to be determined from a storage device;
[0145] b) specifying a rotation matrix of unit diagonal and zero
off diagonal elements as the parent of the lowest resolution
bin;
[0146] c) starting with the bin of lowest resolution, defining the
hyperplane boundaries as the set of boundaries read in from the
storage device;
[0147] d) beginning with the lowest resolution, iterating over each
level of resolution, intersecting the hyperplane boundaries of the
current bin with the hyperplane boundary utilized to split the
current bin into its two children bins of higher resolution;
and
[0148] e) recording the two sets of boundaries determined by this
intersection as the hyperplane boundaries of the two children
bin.
[0149] In another embodiment, a method further comprises:
[0150] a) in the process of iterating over resolution levels to
find bin boundaries, multiplying the rotation matrix for a bin by
the rotation matrix of its parent bin;
[0151] b) associating this product matrix with the current bin to
be used as a parent bin in the next step in the iteration;
[0152] c) constructing a direction vector from the elements of the
first column of the product matrix computed in the previous
step;
[0153] d) finding the hyperplane perpendicular to the direction
vector constructed in the previous step that passes through the
split value for the current bin; and
[0154] e) identifying the hyperplane found in the previous step as
the boundary utilized to split the current bin into its two
children bins of higher resolution.
[0155] The invention also includes a method of determining
one-dimensional lists of numbers comprising fingerprints for a set
of instances relative to the representation of a multidimensional
data set that has been processed by the binning procedure as
described in detail elsewhere herein; the method comprising:
[0156] a) forming a template instance by combining the events from
a set of instances into a single data set;
[0157] b) determining a set of bins representing the template
instance as described elsewhere herein; and
[0158] c) binning the data comprising each instance of the set of
instances used to form the template instances, or each instance of
some other set of instances.
[0159] In another embodiment, a method further comprises:
[0160] a) for all of the instances in the set of instances,
calculating an event density for each bin by dividing the number of
events in each bin by the total number of events comprising the
instance; and
[0161] b) optionally performing other mathematical transformations
on the values of event densities.
[0162] In another embodiment, a method further comprises:
[0163] a) enumerating the bins in order of hierarchies of
increasing resolution, and within a resolution level, in the order
in which the bins were determined by the methods described herein;
and
[0164] b) creating a list of the numerical values associated with
each bin in the enumerated order found in the preceding step.
[0165] In another embodiment, a method further comprises the step
of recording the list of numbers on a storage device.
[0166] The invention also includes a method of determining
one-dimensional lists of numbers comprising fingerprints for sets
of instances relative to the representations of two or more
multidimensional data sets that have been processed by the binning
procedure described elsewhere herein; the method comprising:
[0167] a) specifying two or more sets of instances, each set
comprising a class of data sets;
[0168] b) for each class, forming a template instance for that
class by combining the events from the set of instances comprising
the class into single data set; and
[0169] c) for each class, using the method described elsewhere
herein to determine a set of bins representing each template
instance.
[0170] In another embodiment, a method further comprises:
[0171] a) for each class, for the instances comprising that class,
using the method described herein to bin the data comprising each
instance of that class relative to template instance for that
class;
[0172] b) for the binned representations of instances found in the
previous step, using the methods described herein to form
fingerprints for each instance; and
[0173] c) for each class, for the fingerprints for instances
comprising the class, for each feature in the fingerprint,
calculating the average and standard deviation of each feature,
there now being an average and standard deviation associated with
each bin for each class.
[0174] In another embodiment, a method further comprises:
[0175] a) for each class, for the instances not comprising that
class, using the method described herein to bin the data comprising
each instance not of that class relative to template instance for
that class; and
[0176] b) for the binned representations of instances found in the
previous step, using the methods described herein to form
fingerprints for each instance.
[0177] In another embodiment, a method further comprises:
[0178] a) for each class, for each fingerprint constructed as
described herein, calculating a z-score for each feature in the
fingerprint by subtracting the average associated with the class as
described herein for the corresponding feature and then dividing
that result by the standard deviation associated with the class as
described herein for the corresponding feature, this result giving
a set of fingerprints for each instance, the number of elements of
that set being equal to the number of classes.
[0179] In another embodiment, a method further comprises:
[0180] a) for each instance, combining the set of fingerprints,
constructed using the method described herein, by concatenating the
lists of elements in each fingerprint, thereby forming a single
fingerprint for each instance which contains that instance's
z-score calculated relative to every class; and
[0181] b) optionally performing other mathematical transformations
on every feature of the fingerprints.
[0182] The invention further includes a method of forming a
categorical fingerprint from a fingerprint created by the methods
described elsewhere herein; the method comprising:
[0183] a) defining a many-to-one mapping of continuous valued
numbers into a discrete set of values, those values being integers
or some other discrete label, the method of mapping being a
mathematical transform such as quantization, a transform based on a
machine learning method, or any other transform capable of a
many-to-one mapping;
[0184] b) applying the mapping described in the previous step to
each feature of the fingerprint; and
[0185] c) creating a list of the mapped features thereby forming a
fingerprint consisting of categorical features.
[0186] The invention also includes a method of forming a binary
fingerprint from a fingerprint created by the method described
elsewhere herein; the method comprising:
[0187] a) specifying the number of non-redundant, discrete features
that comprise a categorical fingerprint;
[0188] b) assigning a integer ordinal to each categorical
feature;
[0189] c) creating a mapping of each categorical feature to a
string of binary digits, the number of elements in the string being
equal to the number of categorical features, by setting all digits
in the string to zero excepting the element whose position in the
string corresponds to the ordinal of the categorical feature, that
element being set to one;
[0190] d) applying the mapping described in the previous step to
each feature of the categorical fingerprint; and
[0191] e) creating a list of the mapped features thereby forming a
fingerprint consisting of binary features.
Apparatuses
[0192] In an aspect of the invention, each of the methods described
herein may be implemented as a program or programs of instructions
executed by computer. In a typical realization, such a program or
programs of instructions can be saved on a mass storage device,
such as for example a hard disk drive, a floppy disk drive, or a
magnetic tape storage device, or even a plurality of such devices.
Thus the program or programs of instructions may be read in and
executed by one or more machines, either serially or in parallel,
depending on the data in consideration. It will be understood that
the novelty and utility of both the methods and their
implementations are not dependent on any particular embodiment of
computer(s) or mass storage device(s).
[0193] FIG. 6 depicts an exemplary computing system 100 in
accordance with herein described system and methods. Computing
system 100 is capable of executing a variety of operating systems
180 and computing applications 180' (e.g. web browser and mobile
desktop environment) operable on operating system 180. Exemplary
computing system 100 is controlled primarily by computer readable
instructions, which may be in the form of software, where and how
such software is stored or accessed. Such software may be executed
within central processing unit (CPU) 110 to cause data processing
system 100 to do work. In many known computer servers, workstations
and personal computers central processing unit 110 is implemented
by micro-electronic chips CPUs called microprocessors. Coprocessor
115 is an optional processor, distinct from main CPU 110, that
performs additional functions or assists CPU 110. CPU 110 may be
connected to co-processor 115 through interconnect 112. One common
type of coprocessor is the floating-point coprocessor, also called
a numeric or math coprocessor, which is designed to perform numeric
calculations faster and better than general-purpose CPU 110.
[0194] It is appreciated that although an illustrative computing
environment is shown to comprise a single CPU 110 that such
description is merely illustrative as computing environment 100 may
comprise a number of CPUs 110. Additionally computing environment
100 may exploit the resources of remote CPUs (not shown) through
communications network 160 or some other data communications means
(not shown).
[0195] In operation, CPU 110 fetches, decodes, and executes
instructions, and transfers information to and from other resources
via the computer's main data-transfer path, system bus 105. Such a
system bus connects the components in computing system 100 and
defines the medium for data exchange. System bus 105 typically
includes data lines for sending data, address lines for sending
addresses, and control lines for sending interrupts and for
operating the system bus. An example of such a system bus is the
PCI (Peripheral Component Interconnect) bus. Some of today's
advanced busses provide a function called bus arbitration that
regulates access to the bus by extension cards, controllers, and
CPU 110. Devices that attach to these busses and arbitrate to take
over the bus are called bus masters. Bus master support also allows
multiprocessor configurations of the busses to be created by the
addition of bus master adapters containing a processor and its
support chips.
[0196] Memory devices coupled to system bus 105 include random
access memory (RAM) 125 and read only memory (ROM) 130. Such
memories include circuitry that allows information to be stored and
retrieved. ROMs 130 generally contain stored data that cannot be
modified. Data stored in RAM 125 can be read or changed by CPU 110
or other hardware devices. Access to RAM 125 and/or ROM 130 may be
controlled by memory controller 120. Memory controller 120 may
provide an address translation function that translates virtual
addresses into physical addresses as instructions are executed.
Memory controller 120 may also provide a memory protection function
that isolates processes within the system and isolates system
processes from user processes. Thus, a program running in user mode
can normally access only memory mapped by its own process virtual
address space; it cannot access memory within another process's
virtual address space unless memory sharing between the processes
has been set up.
[0197] In addition, computing system 100 may contain peripherals
controller 135 responsible for communicating instructions from CPU
110 to peripherals, such as, printer 140, keyboard 145, mouse 150,
and data storage drive 155.
[0198] Display 165, which is controlled by display controller 163,
is used to display visual output generated by computing system 100.
Such visual output may include text, graphics, animated graphics,
and video. Display 165 may be implemented with a CRT-based video
display, an LCD-based flat-panel display, gas plasma-based
flat-panel display, a touch-panel, or other display forms. Display
controller 163 includes electronic components required to generate
a video signal that is sent to display 165.
[0199] Further, computing system 100 may contain network adaptor
170 which may be used to connect computing system 100 to an
external communication network 160. Communications network 160 may
provide computer users with means of communicating and transferring
software and information electronically. Additionally,
communications network 160 may provide distributed processing,
which involves several computers and the sharing of workloads or
cooperative efforts in performing a task. It will be appreciated
that the network connections shown are exemplary and other means of
establishing a communications link between the computers may be
used.
[0200] It is appreciated that exemplary computer system 100 is
merely illustrative of a computing environment in which the herein
described apparatus and methods may operate and does not limit the
implementation of the herein described apparatus and methods in
computing environments having differing components and
configurations as the inventive concepts described herein may be
implemented in various computing environments having various
components and configurations.
Illustrative Computer Network Environment:
[0201] Computing system 100, described above, can be deployed as
part of a computer network. In general, the above description for
computing environments applies to both server computers and client
computers deployed in a network environment. FIG. 7 illustrates an
exemplary illustrative networked computing environment 200, with a
server in communication with client computers via a communications
network, in which the herein described apparatus and methods may be
employed. Server 205 may be interconnected via a communications
network 160 (which may be either of, or a combination of a
fixed-wire or wireless LAN, WAN, intranet, extranet, peer-to-peer
network, the Internet, or other communications network) with a
number of client computing environments such as tablet personal
computer 210, mobile telephone 215, telephone 220, personal
computer 100, and personal digital assistance 225. Additionally,
the herein described apparatus and methods may cooperate with
automotive computing environments (not shown), consumer electronic
computing environments (not shown), and building automated control
computing environments (not shown) via communications network 160.
In a network environment in which the communications network 160 is
the Internet, for example, server 205 can be dedicated computing
environment servers operable to process and communicate web
services to and from client computing environments 100, 210, 215,
220, and 225 via any of a number of known protocols, such as,
hypertext transfer protocol (HTTP), file transfer protocol (FTP),
simple object access protocol (SOAP), or wireless application
protocol (WAP). Each client computing environmet 100, 210, 215,
220, and 225 can be equipped with browser operating system 180
operable to support one or more computing applications such as a
web browser (not shown), or a mobile desktop environment (not
shown) to gain access to server computing environment 205.
[0202] In operation, a user (not shown) may interact with a
computing application running on a client computing environments to
obtain desired data and/or computing applications. The data and/or
computing applications may be stored on server computing
environment 205 and communicated to cooperating users through
client computing environments 100, 210, 215, 220, and 225, over
exemplary communications network 160. A participating user may
request access to specific data and applications housed in whole or
in part on server computing environment 205. The applications
and/or data may be communicated between client computing
environments 100, 210, 215, 220, and 220 and server computing
environments for processing and storage. Server computing
environment 205 may host computing applications, processes and
applets for the generation, authentication, encryption, and
communication of web services and may cooperate with other server
computing environments (not shown), third party service providers
(not shown), network attached storage (NAS) and storage area
networks (SAN).
[0203] Thus, the apparatus and methods described herein can be
utilized in a computer network environment having client computing
environments for accessing and interacting with the network and a
server computing environment for interacting with client computing
environments. However, the apparatus and methods providing the
mobility device platform can be implemented with a variety of
network-based architectures, and thus should not be limited to the
example shown. The herein described apparatus and methods will now
be described in more detail with reference to a presently
illustrative implementation.
[0204] The herein described apparatus and methods provide a
mobility device. It is understood, however, that the invention is
susceptible to various modifications and alternative constructions.
There is no intention to limit the invention to the specific
constructions described herein. On the contrary, the herein
described apparatus and methods are intended to cover all
modifications, alternative constructions, and equivalents falling
within the scope and spirit of the herein described apparatus and
methods.
[0205] It should also be noted that the herein described apparatus
and methods may be implemented in a variety of computer
environments (including both non-wireless and wireless computer
environments), partial computing environments, and real world
environments. The various techniques described herein may be
implemented in hardware or software, or a combination of both.
Preferably, the techniques are implemented in computing
environments maintaining programmable computers that include a
processor, a storage medium readable by the processor (including
volatile and non-volatile memory and/or storage elements), at least
one input device, and at least one output device. Computing
hardware logic cooperating with various instructions sets are
applied to data to perform the functions described above and to
generate output information. The output information is applied to
one or more output devices. Programs used by the exemplary
computing hardware may be preferably implemented in various
programming languages, including high level procedural or object
oriented programming language to communicate with a computer
system. Illustratively the herein described apparatus and methods
may be implemented in assembly or machine language, if desired. In
any case, the language may be a compiled or interpreted language.
Each such computer program is preferably stored on a storage medium
or device (e.g., ROM or magnetic disk) that is readable by a
general or special purpose programmable computer for configuring
and operating the computer when the storage medium or device is
read by the computer to perform the procedures described above. The
apparatus may also be considered to be implemented as a
computer-readable storage medium, configured with a computer
program, where the storage medium so configured causes a computer
to operate in a specific and predefined manner.
[0206] Although an exemplary implementations of the herein
described apparatus and methods have been described in detail
above, those skilled in the art will readily appreciate that many
additional modifications are possible in the exemplary embodiments
without materially departing from the novel teachings and
advantages of the herein described apparatus and methods.
Accordingly, these and all such modifications are intended to be
included within the scope of this herein described apparatus and
methods. The invention may be better defined by the following
exemplary claims.
[0207] Those skilled in the art, having the benefits of the
teachings of the present invention as hereinabove set forth, may
effect numerous modifications thereto. Such modifications are to be
construed as lying within the contemplation of the present
invention, as defined by the claims herein set forth.
REFERENCES CITED
U.S. Patent Documents
[0208] Moser, A. R., et al., 2005, "Method and apparatus for
discovering patterns in binary or categorical data," US Patent
Application 20050143928.
Other Publications
[0209] Johnson, R. L., 1993, "Flow cytometry. From research to
clinical laboratory applications," Clin Lab Med, 13, 831-52.
[0210] Jennings C. D. and Foon, K. A., 1997, "Recent Advances in
Flow Cytometry: Application to the Diagnostics of hematologic
Malignancy," Blood 90, 2863-92. [0211] Roederer, M., et al., 2001,
"Probability Binning Comparison: A Metric for Quantitating
Univariate Distribution Differences," Cytometry 45, 37-46.
[0212] Roederer, M., et al., 2001, "Probability Binning Comparison:
A Metric for Quantitating Multivariate Distribution Differences,"
Cytometry 45, 47-55.
[0213] O'Connel M. J., 1974, "Search Program for Significant
Variables," Comp. Phys. Comm. 8, 49-55. [0214] Golub, G. H. and Van
Loan, C. F., 1996, "The Singular Value Decomposition" and "Unitary
Matrices" in Matrix Computations, 3rd ed. Baltimore, Md.: Johns
Hopkins University Press, 70-71 and 73.
* * * * *
References