U.S. patent application number 15/735573 was filed with the patent office on 2018-10-18 for method and apparatus for large scale machine learning.
The applicant listed for this patent is ARIZONA TECHNOLOGY ENTERPRISES. Invention is credited to Asim Roy.
Application Number | 20180300631 15/735573 |
Document ID | / |
Family ID | 57608944 |
Filed Date | 2018-10-18 |
United States Patent
Application |
20180300631 |
Kind Code |
A1 |
Roy; Asim |
October 18, 2018 |
METHOD AND APPARATUS FOR LARGE SCALE MACHINE LEARNING
Abstract
Analyzing patterns in a volume of data and taking an action
based on the analysis involves receiving data and training the data
to create training examples, and then selecting features that are
predictive of different classes of patterns in the data stream,
using the training examples. The process further involves training
in parallel a set of ANNs, using the data, based on the selected
features, and extracting only active nodes that are representative
of a class of patterns in the data stream from the set of ANNs. The
process continues with adding class labels to each extracted active
node, classifying patterns in the data based on the class-labeled
active nodes, and taking an action based on the classifying
patterns in the data.
Inventors: |
Roy; Asim; (Phoenix,
AZ) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
ARIZONA TECHNOLOGY ENTERPRISES |
Scotsdale |
AZ |
US |
|
|
Family ID: |
57608944 |
Appl. No.: |
15/735573 |
Filed: |
June 10, 2016 |
PCT Filed: |
June 10, 2016 |
PCT NO: |
PCT/US16/37079 |
371 Date: |
December 11, 2017 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
62186891 |
Jun 30, 2015 |
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06N 3/088 20130101;
G06N 3/0454 20130101 |
International
Class: |
G06N 3/08 20060101
G06N003/08; G06N 3/04 20060101 G06N003/04 |
Claims
1. A method for analyzing patterns in a volume of data and taking
an action based on the analysis, comprising: receiving the volume
of data; training the data to create a plurality of training
examples; selecting a plurality of features that are predictive of
different classes of patterns in the data, using the training
examples; training in parallel a plurality of ANNs, each of the
plurality of ANNs a self-organizing map, using the data, based on
the selected plurality of features; extracting only active nodes
that are representative of a class of patterns in the data from the
plurality of ANNs and adding class labels to each extracted active
node; classifying patterns in the data based on the class-labeled
active nodes; and taking an action based on the classifying
patterns in the data.
2. The method of claim 1, wherein receiving the volume of data
comprises receiving streaming data ("a data stream").
3. The method of claim 2, wherein training the data comprises
training in parallel a first plurality of Kohonen networks, using
the data stream, to create the plurality of training examples; and
wherein training in parallel the plurality of ANNs, each a
self-organizing map, comprises training in parallel a second
plurality of Kohonen networks, different from the first plurality
of Kohenen networks.
4. The method of claim 3, further comprising discarding the first
plurality of Kohonen networks prior to training in parallel the
second plurality of Kohonen networks.
5. The method of claim 3, wherein the plurality of training
examples are created for each class of pattern in the data stream
and are represented by nodes in the first plurality of Kohonen
networks.
6. The method of claim 1, wherein selecting the plurality of
features that are predictive of different classes of patterns in
the data, using the training examples, reduces dimensionality of
the data.
7. The method of claim 1, wherein selecting the plurality of
features that are predictive of different classes of patterns in
the data, using the training examples, is to produce separation
between patterns in different classes and also make the patterns
within each class more compact.
8. A method for classifying data having a plurality of features
belonging to a plurality of classes of patterns in streaming data,
comprising: receiving the streaming data; training in parallel a
first plurality of Kohonen networks, of different grid sizes, and
for a plurality of different subsets of the plurality of features,
by processing the streaming data, wherein the first plurality of
Kohonen networks to form clusters, wherein active nodes are the
centers of the clusters and serve as representative examples of
classes in the streaming data; assigning the active nodes in the
first plurality of Kohonen networks to one of the plurality of
classes; ranking the plurality of features for each of the
plurality of classes; grouping one or more of the plurality of
features into each of a plurality of separate categories based on
ranking; training in parallel a second plurality of Kohonen
networks, of different grid sizes, for each of the plurality of
features in each category, by processing the data; assigning active
nodes in the second plurality of Kohonen networks to one of the
plurality of classes; and creating a group of hypersphere
classifiers from a subset of the active nodes in the second
plurality of Kohonen networks.
9. The method of claim 8, wherein assigning the active nodes in the
first plurality of Kohonen networks to one of the plurality of
classes is based on a number of times each of the active nodes is
activated by features belonging to each of the plurality of
classes.
10. The method of claim 8, wherein assigning the active nodes in
the second plurality of Kohonen networks to one of the plurality of
classes is based on a number of times each of the active nodes is
activated by features belonging to each of the plurality of
classes.
11. The method of claim 8, wherein ranking the plurality of
features for each of the plurality of classes is based on
separability indices for each feature.
12. The method of claim 11, further comprising computing the
separability indices for each feature in the plurality of features
and for each class in the plurality of classes prior to ranking the
plurality of features for each of the plurality of classes based on
the separability indices for each feature.
13. The method of claim 8, wherein: receiving streaming data
comprises receiving streaming data that is unbalanced; wherein the
active nodes that are the centers of the clusters that serve as
representative examples of classes in the streaming data serve as
representative examples of majority classes and minority classes in
the streaming data; the method further comprising: training in
parallel another plurality of Kohonen networks, of different grid
sizes, and for a plurality of different subsets of the plurality of
features, by processing the streaming data, wherein active nodes in
the additional plurality of Kohonen networks serve as
representative examples of majority classes and minority classes in
the streaming data when a minimum threshold is exceeded in the
plurality of active nodes.
14. The method of claim 13, wherein the representative examples are
created for each class of pattern in the received streaming data
and are represented by active nodes in the first plurality of
Kohonen networks.
15. A non-transitory computer-readable medium containing
computer-executable instructions that, when executed by a
processor, cause the processor to analyze patterns in a volume of
data and taking an action based on the analysis, according to a
method comprising: receiving the volume of data; training the data
to create a plurality of training examples; selecting a plurality
of features that are predictive of different classes of patterns in
the data, using the training examples; training in parallel a
plurality of ANNs, each of the plurality of ANNs a self-organizing
map, using the data, based on the selected plurality of features;
extracting only active nodes that are representative of a class of
patterns in the data from the plurality of ANNs and adding class
labels to each extracted active node; classifying patterns in the
data based on the class-labeled active nodes; and taking an action
based on the classifying patterns in the data.
16. The non-transitory computer-readable medium of claim 15 wherein
receiving the volume of data comprises receiving streaming data ("a
data stream"), and wherein training the data comprises training in
parallel a first plurality of Kohonen networks, using the data
stream, to create the plurality of training examples; and wherein
training in parallel the plurality of ANNs, each a self-organizing
map, comprises training in parallel a second plurality of Kohonen
networks, different from the first plurality of Kohonen
networks.
17. The non-transitory computer-readable medium of claim 16,
further comprising discarding the first plurality of Kohonen
networks prior to training in parallel the second plurality of
Kohonen networks.
18. The non-transitory computer-readable medium of claim 16 wherein
the plurality of training examples are created for each class of
pattern in the data stream and are represented by nodes in the
first plurality of Kohonen networks.
19. The non-transitory computer-readable medium of claim 15 wherein
selecting the plurality of features that are predictive of
different classes of patterns in the data, using the training
examples, reduces dimensionality of the data.
20. The non-transitory computer-readable medium of claim 15 wherein
selecting the plurality of features that are predictive of
different classes of patterns in the data, using the training
examples, is to produce separation between patterns in different
classes and also make the patterns within each class more compact.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional
Application No. 62/186,891, filed Jun. 30, 2015, Attorney Docket
No. 143677.001111, the entire contents of which are incorporated by
reference herein.
BACKGROUND
[0002] With the advent of high-dimensional stored big data and
streaming data, what is needed is machine learning on a very large
scale. It would be advantageous for such machine learning to be
extremely fast, scale up easily with volume and dimension, be able
to learn from streaming data, automatically perform dimension
reduction for high-dimensional data, and be deployable on massively
parallel hardware. Artificial neural networks (ANNs) are well
positioned to address these challenges of large scale machine
learning.
SUMMARY
[0003] Embodiments of the invention provide a method for analyzing
patterns in a data stream and taking an action based on the
analysis. A volume of data is received, and the data is trained to
create training examples. Features are selected that are predictive
of different classes of patterns in the data, using the training
examples. A set of Kohonen networks is trained using the data,
based on the selected features. Then, active nodes are identified
and extracted from the set of Kohenen nets that are representative
of a class of patterns in the data. Classes are assigned to the
extracted active nodes. Action may then be taken based on the
assigned classes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] FIG. 1 illustrates hypersphere classification networks in
accordance with embodiments of the invention.
[0005] FIG. 2 illustrates the general architecture of a Kohonen
network as used in embodiments of the invention.
[0006] FIG. 3 illustrates adding classes to nodes in a Kohonen
network according to embodiments of the invention.
[0007] FIG. 4 illustrates an embodiment of the invention that
trains different Kohonen nets, of different grid sizes, and for
different feature subsets, in parallel on a distributed computing
platform.
[0008] FIG. 5 is a flow chart in accordance with an embodiment of
the invention.
[0009] FIG. 6 illustrates embodiments of the invention may be
implemented on a distributed computing platform.
DETAILED DESCRIPTION
[0010] Embodiments of the invention provide a method that can
effectively handle large scale, high-dimensional data. Embodiments
provide an online method that can be used for both streaming data
and large volumes of stored big data. Embodiments primarily train
multiple Kohonen nets in parallel both during feature selection and
classifier construction phases. However, in the end, embodiments of
the invention only retain a few selected neurons (nodes) from the
Kohonen nets in the classifier construction phase; the embodiments
discard all Kohonen nets after training. Embodiments use Kohonen
nets both for dimensionality reduction through feature selection
and for building an ensemble of classifiers using single Kohonen
neurons. Embodiments are meant to exploit massive parallelism and
should be easily deployable on hardware that implements Kohonen
nets. Further embodiments also provide for the method to handle
imbalanced data. The artificial neural network introduced by
Finnish professor Teuvo Kohonen in the 1980s is sometimes called a
Kohonen map or network. A Kohonen network is a self-organizing map
(SOM) or self-organizing feature map (SOFM) which is a type of
artificial neural network (ANN) that is trained using unsupervised
learning to produce a low-dimensional (typically two-dimensional),
discretized representation of the input space of training samples,
called a map. Self-organizing maps are different from other
artificial neural networks because they apply competitive learning,
as opposed to error-correction learning (such as back propagation
with gradient descent), and because they use a neighborhood
function to preserve the topological properties of the input space.
This makes SOMs useful for visualizing low-dimensional views of
high-dimensional data, akin to multidimensional scaling. Kohonen
nets do not require target outputs for each input vector in the
training data, inputs are connected to a two-dimensional grid of
neurons, or nodes, and multi-dimensional data base be mapped onto a
two-dimensional surface.
[0011] Streaming data is data that is generated continuously by
many, perhaps thousands, of data sources, which typically transmit
data records simultaneously, and in small sizes, on the order of
kilobytes. Streaming data includes a wide variety of data such as
log files generated by customers using mobile or web applications,
e-commerce purchases, in-game player activity, information from
social networks, financial trading floors, geospatial services,
gene expression datasets, telemetry and/or sensor data obtained
from connected devices or instrumentation in data centers, and
Internet of Things (IoT) data from everyday objects that have
internetwork connectivity. This streaming data, often times
received and processed in real-time, or near real-time, is
processed sequentially and incrementally on a record-by-record
basis, or over sliding time windows, and used for a wide variety of
analytics including correlations, aggregations, filtering, and
sampling. Streaming data processing is beneficial in most scenarios
where new, dynamic data is generated on a continual basis. It
applies to many industry segments and big data use cases. Companies
generally conduct data analysis, including applying machine
learning algorithms, and extract deeper insights from the data.
Stored data, by contrast, is historical--it has been gathered and
stored, in a permanent memory(s) or storage device(s), for later
retrieval and processing by a computing platform that has access to
the permanent memory or storage device.
1. INTRODUCTION
[0012] The arrival of big and streaming data is causing major
transformations within the machine learning field. For example,
there are significantly more demands on machine learning systems,
from the requirement to learn, and learn quickly, from very large
volumes of data, to the requirement for automation of machine
learning to reduce the need for expert (human) involvement and for
deployment of machine learning systems on massively parallel
hardware. Traditional artificial neural network (ANN) algorithms
("neural network algorithms", "neural net algorithms") have many
properties that can meet these demands of big data and, therefore,
can play a role in the major transformations that are taking place.
For example, the learning mode of many neural net algorithms is
online, incremental, learning--a mode that does not require
simultaneous access to large volumes of data. This mode of learning
not only resolves many computational issues associated with
learning from big data, it also removes the headache of correctly
sampling from large volumes of data. It also makes neural net
algorithms highly scalable and allows such to learn from all of the
data. This mode of learning is also useful when working with
streaming data, where none or very little of the data may actually
be stored and a learning system may only have a brief look at the
data that flows through the system.
[0013] Neural network algorithms also have the advantage of using
very simple computations that can be highly parallelized. Thus,
they are capable of exploiting massively parallel computational
facilities to provide very fast learning and response times. There
are many implementations of neural network algorithms on graphics
processing units (GPUs) that exploit parallel computations.
Neuromorphic hardware, especially meant for neural network
implementation, is also becoming available. Kohonen networks, or
simply, Kohonen nets, used in embodiments of the invention,
generally is a type of single layer net with its own hardware
implementation(s). In general, embodiments of the invention are
hardware implementations of neural network algorithms to handle
high velocity streaming data. Such hardware implementations can
also process stored big data in a very fast manner. All of these
features of neural network algorithms position the field to become
the backbone of machine learning in the era of big and streaming
data.
[0014] Embodiments of the invention provide a new and novel neural
network learning method that (1) can be parallelized at different
levels of granularity, (2) addresses the issue of high-dimensional
data through class-based feature selection, (3) learns an ensemble
of classifiers using selected Kohonen neurons (or Kohonen "nodes")
from different Kohonen nets, (4) can handle imbalanced data, and
(5) can be implemented on hardware.
[0015] With regard to the second above-noted objective of
dimensionality reduction through feature selection, and further
with reference to FIG. 5, the method 500 receives a volume of data,
for example, a stream of high dimensional data at 505, or from a
store of historical data, and trains the data to create training
examples. For example, in one embodiment, the method trains a
number of Kohonen nets in parallel with streaming data to create
some representative data points, also referred to as training
samples or training examples, at 510. (It should be noted that
stored data can also be received, either from a memory or permanent
store accessible to a hardware based computing platform or software
based computing platform executing code that accesses the memory or
permanent store, or streamed from the memory or permanent store
accessible to the computing platform). Using Kohonen nets, in one
embodiment, the method performs class-based feature selection at
515. For selection of features for each class, the basic criteria
are that (1) the method makes the class more compact, and (2) at
the same time, the method maximizes the average distance from the
other classes. The method discards, insofar as they are used in
creating training samples, all Kohonen nets once class-specific
feature selection is complete. In a second phase, the method
constructs several new Kohonen nets in parallel in different
feature spaces, again from the data, based on the selected
class-based features, at 520. Once these Kohonen nets are trained,
at 525, the method extracts just the active neurons (or active
"nodes") from them, adds class labels to each of the active neurons
and creates an ensemble of Kohonen neurons for classification. In
the end, the method retains just a set of dangling active Kohonen
neurons from different Kohonen nets in different feature spaces and
discards the Kohonen nets themselves. The retained set of
class-labeled active nodes can then be used to classify patterns in
the data at 530, and some action taken based on the classified
patterns in the streaming data, at 535.
[0016] In imbalanced data problems, such as fraud detection, there
exist very few data points for one or more classes, but lots of
data points are available for the other classes. Dealing with
imbalanced data problems has always been difficult for
classification algorithms and dealing with the streaming version of
imbalanced data is particularly challenging. An additional
embodiment of the invention provides a method that handles
imbalanced data problems by creating a second layer of Kohonen
nets, as described in more detail below.
[0017] In the description that follows, section 2 provides an
overview of the concepts used in embodiments of the invention
including class-specific feature selection and hypersphere nets.
Section 3 describes an algorithm that, according to embodiments of
the invention, uses Kohonen nets in parallel for class-specific
feature selection from streaming data. Sections 4 and 5 provide
details of how an ensemble of hypersphere nets is constructed using
neurons from different Kohonen nets, according to various
embodiments of the invention. Section 6 presents computational
results for several high-dimensional problems, according to an
embodiment of the invention. Section 7 describes an embodiment of
the invention including an algorithm for imbalanced data problems
and some related computational results. Section 8 discusses
hardware implementation of embodiments of the invention, and
conclusions are discussed in Section 9.
2. OVERVIEW OF THE CONCEPTS AND MOTIVATION BEHIND EMBODIMENTS OF
THE INVENTION
[0018] Embodiments of the invention use a method that creates
hypersphere classification networks A, B and C, illustrated in the
embodiment 100 in FIG. 1, in reduced feature spaces by constructing
a series of Kohonen nets from streaming data in those reduced
feature spaces. The general architecture of a Kohonen network, or
self-organizing map (SOM), is shown in FIG. 2 at 200 and 205. In
the network depicted at 200, only three connections are shown for
purposes of clarity. Embodiments of the invention discard all
Kohonen nets in the end and retain only certain Kohonen neurons
from these nets as hyperspheres. According to an embodiment, the
method is parallelized, and, in each of two phases (a feature
selection phase and then a classifier construction phase), all of
the Kohonen nets can be constructed in parallel.
2.1 Hypersphere Classification Nets
[0019] As shown in FIG. 1, a hypersphere classification net 100 has
one hidden layer and one output layer for classification. In terms
of computing speed, this shallow architecture has great advantage,
especially when compared to multiple hidden layer nets.
[0020] A prior art method constructs hypersphere nets in an offline
mode. Constructing hypersphere nets in an online mode is a
considerable challenge and embodiments of the invention utilize
Kohonen nets as the underlying mechanism to do so. After
class-specific feature selection, one embodiment constructs Kohonen
nets in reduced feature spaces with streaming data. After training
these Kohonen nets, one embodiment 300 adds class labels to
individual neurons (or "nodes"), whenever possible, as shown in
FIG. 3. The embodiment assigns an individual neuron to a particular
class if a majority of the streaming data points for which it is
the winning or best neuron belongs to that class. The radius of
activation of such a neuron is equal to the distance of the
furthest data point of the particular class the neuron is assigned
to and for which the neuron is also the winning neuron. It discards
neurons that are not assigned to any class. Described thus far are
the main concepts of the process of constructing hypersphere nets
from Kohonen nets. The following description presents further
details of embodiments of the invention.
2.2 Class-Specific Feature Selection and Dimensionality
Reduction
[0021] One of the fundamental challenges for machine learning
systems for high-dimensional streaming data is dimensionality
reduction. Many of the prior art feature extraction methods, such
as Principal Component Analysis (PCA) and Linear Discriminant
Analysis (LDA), do not perform very well on high-dimensional data.
A number of other prior art methods have been developed in recent
few years for both online feature selection and feature extraction
for high-dimensional streaming data. One prior art method considers
an online learning problem where the training instances arrive
sequentially, but the number of active features that a learning
system can use is fixed. Other prior art approaches present a
method for streaming features where candidate features arrive one
at a time and the learning system has to select the best set of
features. However, all the training examples are available before
the start of training. Yet another approach presents two methods
for dimensionality reduction based on the orthogonal centroid
algorithm--an online, incremental one for feature extraction and an
offline one for feature selection. Another prior art approach
proposes an online extension of the Maximum Margin Criterion (MMC)
to handle streaming data. Finally, one prior art approach presents
an online version of Isometric Feature Mapping (ISOMAP), a
nonlinear dimensionality reduction method.
[0022] One prior art approach presents a parallel distributed
feature selection algorithm for distributed computing environments
that preserves the variance contained within the data. It can
perform both supervised (based on human input) and unsupervised
(automatic) feature selection and uses data partitioning for large
amounts of data. A different approach proposes a highly scalable
feature selection algorithm for logistic regression that can be
parallelized by partitioning both features and records within a Map
Reduce framework. This approach ranks features by estimating the
coefficients of the logistic regression model.
[0023] However, none of these prior art approaches are
class-specific feature selection methods for streaming data,
although the idea of class-specific extracted features (i.e.,
projected subspaces) has existed for some time. Recently, prior art
approaches have used the idea of class-specific feature selection
in ensemble learning. One prior art approach proposed methods that
use a subset of the original features in class-specific
classifiers. However, none of these prior art methods are
appropriate for streaming data.
[0024] Embodiments of the invention use class-specific feature
selection for dimensionality reduction. The advantage in preserving
the original features of a problem is that, quite often, those
features have meaning and interpretation, while such meaning is
usually lost in extracted or derived features. In class-specific
feature selection, the algorithm finds separate feature sets for
each class such that they are the best ones to separate that class
from the rest of the classes. This criterion for identifying good
class-specific features is similar to that used in LDA and Maximum
Margin Criterion (MMC), which are feature extraction methods. LDA,
MMC, and other similar feature extraction methods maximize the
between-class scatter and minimize the within-class scatter. In
other words, those methods try to maximize the distance between
different class centers and at the same time make the data points
in the same class as close as possible. Embodiments of the
invention, although not based on a feature extraction method, are
based on a similar concept. The feature selection criterion is also
similar to that of a prior art approach that preserves the variance
contained within the data.
[0025] In the prior art offline mode for constructing hypersphere
nets, where a collection of data points is available, it is
straightforward to select features that maximize the average
distance of data points of one class from the rest of the classes
and also, at the same time, minimize the average distance of data
points within that class. One such approach ranks and selects
features on that basis and computational experiments show that it
also works for high-dimensional problems. However, that prior art
method cannot be used directly on streaming data, that is, where no
data is stored. Embodiments of the invention operate on streaming
data using the same concept for feature selection, but rely on
Kohonen nets to do so. By training multiple Kohonen nets from
streaming data, some representative data points (or "training
examples") are created for each class and that is how to resolve
the dilemma of not having access to a collection of data points.
Given a collection of representative data points (represented by
certain Kohonen neurons in the Kohonen nets), it is possible to use
a class-based feature selection method.
3. KOHONEN NETWORK BASED CLASS-SPECIFIC FEATURE SELECTION FOR
STREAMING DATA
3.1 Concept of Separability Index for Feature Ranking by Class
[0026] Suppose there are kc total classes. The basic feature
ranking criteria are that (1) a good feature for class k should
produce good separation between patterns in class k and those not
in class k, k=1 . . . kc, and (2) also make the patterns in class k
more compact. Based on this idea, a measure called the separability
index that can rank features for each class has been proposed.
Suppose d.sub.kn.sup.in is the average distance between patterns
within class k for feature n, and d.sub.kn.sup.out the average
distance between the patterns in class k and those not in class k
for feature n. One approach uses the Euclidean distance for
distance measure, but other distance measures could be used. The
separability index of feature n for class k is given by
r.sub.kn=d.sub.kn.sup.out/d.sub.kn.sup.in. One may use this
separability index r.sub.kn to rank order features of class k where
a higher ratio implies a higher rank. The sense of this measure is
that a feature n with a lower d.sub.kn.sup.in makes class k more
compact and with a higher d.sub.kn.sup.out increases the separation
of class k from the other classes. Thus, the higher the ratio
r.sub.kn for a feature n, the greater is its ability to separate
class k from the other classes and the better the feature.
3.2 Computing Separability Indices for High-Dimensional Streaming
Data
[0027] The challenge with online learning from streaming data is
that there is not access to a stored set of training examples to
compute separability indices. (The assumption here is that none of
the streaming data is stored.) Embodiments of the invention use
Kohonen nets to solve this problem; Kohonen nets serve as
collectors of representative training examples. One embodiment
trains many different Kohonen nets, of different grid sizes, and
for different feature subsets, in parallel on a distributed
computing platform, as illustrated in FIG. 4. One implementation of
this embodiment uses Apache Spark as the distributed computing
platform, but other similar platforms can be used. A Kohonen net
forms clusters and the cluster centers (that is, the active Kohonen
net nodes or neurons) are equivalent to representative examples of
the streaming data. One embodiment then uses these representative
examples to compute the separability indices of the features by
class.
3.3 Exploiting Parallel Distributed Computing to Construct Kohonen
Nets
[0028] Suppose the N-dimensional vector x, x=(X.sub.1, X.sub.2, . .
. ,X.sub.N) represents an input pattern in the streaming data and
X.sub.n denote the n.sup.th element of the vector x. Let FP.sub.q
denote the q.sup.th feature subset, q=1 . . . FS, where FS is the
total number of feature subsets. Let KN.sub.q.sup.g be the g.sup.th
Kohonen net of a certain grid size for the q.sup.th feature subset,
q=1 . . . FS, g=1 . . . FG, where FG is the total number of
different Kohonen net grid sizes. kc denotes the total number of
classes and k is a class.
[0029] Assume that the method, according to one embodiment of the
invention, has access to parallel distributed computing facilities
to compute the separability indices efficiently and quickly for
high-dimensional streaming data. Suppose that the embodiment trains
Kohonen nets of 10 different grid sizes for each feature subset
(e.g. Kohonen net grid sizes of 9.times.9, 8.times.8, 7.times.7 and
so on, as depicted in FIG. 4 at 400, 405, and 410, respectively;
FG=10) and also assume that it has computing resources to build 500
Kohonen nets in parallel. In that case, FS, the total number of
feature subsets, would be 50 (=500/10) and KN.sub.1.sup.1 . . .
KN.sub.50.sup.10 would denote the 500 different Kohonen nets.
Further suppose that there are 1000 features in the data stream
(N=1000). It would, therefore, partition the feature set randomly
into 50 subsets of 20 features each (FS=50; N=20*FS). For
simplicity, assume that the first feature partition FP.sub.1
include the features X.sub.1 . . . X.sub.20, the second feature
partition FP.sub.2 include the features X.sub.21 . . . X.sub.40 and
so on. For Kohonen nets for the first feature partition FP.sub.1
(KN.sub.1.sup.g, g=1 . . . 10), the input vector would be the
feature vector FP.sub.1, for Kohonen nets for the second feature
partition FP.sub.2 (KN.sub.2.sup.g, g=1 . . . 10), the input vector
would be the feature vector FP.sub.2, and so on. Thus, for each
feature subset FP.sub.q, 10 different Kohonen nets of different
grid sizes would be trained. If there are just a few classes in the
classification problem, smaller grid sizes should suffice (e.g.
grid sizes of 9.times.9, 8.times.8, 7.times.7 and so on). If there
are thousands of classes, then larger grid sizes would be used.
[0030] The use of feature partitions is, essentially, for
efficiency and speed because Kohonen nets can be trained with
low-dimensional input vectors much faster and in parallel compared
to a single net that is trained with thousands of features. And the
reason for using different grid sizes for the same feature
partition is to get different representative examples to compute
the separability indices. The method repeats this overall process
of computing separability indices a few times by randomly selecting
features for each feature partition, according to one embodiment.
The method then uses the maximum separability index value of each
feature over these repetitions for final ranking of the
features.
3.4 Assigning Class Labels to Kohonen Neurons to Compute
Separability Indices
[0031] According to embodiments of the invention, not all, but some
of the active nodes of Kohonen nets trained for different feature
partitions serve as representative training examples of different
classes and are used to compute the separability indices. One
embodiment considers only the winning or best neurons of the
Kohonen nets to be active nodes. Once the Kohonen nets stabilize
during initial training, the embodiment processes some more
streaming data to assign class labels to the active nodes. In this
phase, as the embodiment processes some more streaming data, it
does not change the weights of the Kohonen nets but only keeps
count of the number of times an input pattern of a particular class
activated a particular neuron (i.e., the neuron was the winning
neuron for those input patterns). For example, given there are two
classes, A and B, for each active node, the method keeps count of
the number of times input patterns from each of these two classes
activates the node. Suppose class A patterns activate one such
neuron (node) 85 times and class B patterns activate the node 15
times. At this node then, approximately 85% of the activating input
patterns belong to class A and 15% belong to class B. Since a
significant majority of the activating patterns belong to class A,
the method simply assigns this active neuron to class A. Assigning
an active neuron to a class simply means that that neuron
represents an example of that class. As an example when an active
neuron is discarded, suppose class A patterns activate a node 55%
of the time and class B patterns activate the node 45% of the time.
The method discards such an active node because no class has a
significant majority and, therefore, it cannot claim the node as a
representative point of any particular class. This phase of
labeling active nodes ends once the class ratios (percentages) at
every active node for all of the Kohonen nets are fairly stable and
all active nodes (neurons) can either be assigned to classes or
discarded if no class has a significant majority. The embodiment
also discards active nodes that have comparatively low absolute
count of patterns.
[0032] After assigning class labels to active nodes and dropping
some of the active nodes, the method constructs lists of active
nodes assigned to each class in each feature partition. For
example, 50 active nodes may be assigned to class A in a particular
feature partition and 20 active nodes assigned to class B and they
constitute the representative training examples of the respective
classes for this feature partition. Note that these active nodes
can be from different Kohonen nets of different grid sizes,
although in the same feature partition. Using these active nodes,
the method computes the separability index of each feature in that
feature partition and for each class.
3.5 Algorithm to Compute Separability Indices of Features for
Feature Ranking by Class
[0033] A summary of the overall steps of the online class-specific
feature selection algorithm for streaming data using Kohonen nets
follows, according to embodiments of the invention. Let KN.sub.max
be the total number of Kohonen nets that are created in parallel.
FG is the total number of different Kohonen net grid sizes used for
each feature partition, and FS=KN.sub.max/FG is the number of
feature partitions possible given the resources. For example, if
the algorithm uses grid sizes 9.times.9, 8.times.8, 7.times.7,
6.times.6 and 5.times.5, then FG is 5. For KN.sub.max=500,
FS=500/5=100, which means it can create a maximum of 100 feature
partitions. Let CC be the class count percentage of the most active
class at an active node and let PCT.sub.min be the minimum class
count percentage for a class to be assigned to an active node. In
computational testing, PCT.sub.min is set to 70% across all
problems. Let CW.sub.T be the cumulative weight change in a Kohonen
net over the last T streaming patterns. Here T is the length of a
streaming window to collect weight changes. Let CW.sub.T.sup.Max be
the maximum of the CW.sub.T since start of training of the Kohonen
net. Let CWR.sub.T (=CW.sub.T/CW.sub.T.sup.Max) be the ratio of the
current weight change to the maximum weight change. The method
continues to train a Kohonen net until this ratio CWR.sub.T falls
below a certain minimum level CWR.sub.T.sup.Min. In computational
testing, CWR.sub.T.sup.Min is set to 0.001. All notations are
summarized in Table 3.1, below.
TABLE-US-00001 TABLE 3.1 Summary of notations used in online
feature selection algorithm Symbol Meaning x N-dimensional pattern
vector, x = (X.sub.1, X.sub.2, . . . , X.sub.N) N Size of pattern
vector x X.sub.n n.sup.th element of the vector x KN.sub.q.sup.g
The g.sup.th Kohonen net of a certain grid size for the q.sup.th
feature partition, q = 1 . . . FS, g = 1 . . . FG FS Total number
of feature partitions FG Total number of different Kohonen net grid
sizes kc Total number of classes k Denotes a class in the set of
classes {1, 2, . . . , k, . . . kc} FP.sub.q q.sup.th feature
partition, q = 1 . . . FS KN.sub.max total number of Kohonen nets
that can be created in parallel CC the class count percentage of
the most active class at an active node PCT.sub.min minimum
required percentage of class counts for a class in order to assign
an active node to that class CW.sub.T the cumulative weight change
in a Kohonen net over the last T streaming patterns T the length of
a window, in terms of number of streaming patterns, to collect
weight changes CW.sub.T.sup.Max the maximum of CW.sub.T since the
start of training of a Kohonen net CWR.sub.T the ratio of the
current weight change to the maximum weight change, =
CW.sub.T/CW.sub.T.sup.Max CWR.sub.T.sup.Min Minimum CWR.sub.T used
as the stopping criterion. This is preset. Continue training a
Kohonen net if CWR.sub.T > CWR.sub.T.sup.Min
3.6 Algorithm for Class-Specific Feature Selection from Streaming
Data Using Kohonen Nets
[0034] Step 1. Process some streaming data to find the approximate
maximum and minimum values of each feature. Use the range to
normalize streaming input patterns during subsequent processing.
The input vector x is assumed to be normalized in this algorithm.
(Note: Other methods of data normalization can be also be used in
this algorithm, according to embodiments of the invention.)
[0035] Step 2. Randomly partition the N features into FS subsets
(FS=KN.sub.max/FG) where each partition is denoted by FP.sub.q, q=1
. . . FS.
[0036] Step 3. Initialize the weights and learning parameters of
the KN.sub.max Kohonen nets that will be trained in parallel, where
FG is the number of Kohonen nets of different grid sizes for each
feature partition FP.sub.q, q=1 . . . FS.
[0037] Step 4. Train all KN.sub.max Kohonen nets in parallel using
streaming data and selecting appropriate parts of the input pattern
vector for each Kohonen net according to the feature subset
assigned to it. Stop training when all the Kohonen nets converge
and their CWR.sub.T ratios are at or below CWR.sub.T.sup.Min.
[0038] Step 5. Process some more streaming data through the
stabilized Kohonen nets, without changing the weights, to find the
active nodes (winning neurons) and their class counts. Stop when
class count percentages at all active nodes converge and are
stable.
[0039] Step 6. Assign each active node (neuron) to a class if the
class count percentage CC for the most active class at that node is
>=PCT.sub.min. Discard active neurons that do not satisfy the
PCT.sub.min requirement or have a low total class count.
[0040] Step 7. Create a list of the remaining active nodes by class
for each feature partition FP.sub.q, q=1 . . . FS.
[0041] Step 8. Compute the separability indices of the features
separately for each feature partition FP.sub.q, q=1 . . . FS.
Compute the separability indices of the particular features in a
feature partition using the remaining active neurons for that
feature partition only. Those remaining active neurons, which have
been assigned to classes, are representative examples of the
classes.
[0042] Step 9. Repeat steps 2 through 8 a few times, according to
one embodiment, and track the maximum separability index value of
each feature.
[0043] Step 10. Rank features on the basis of their maximum
separability index value.
3.7 Discard all Kohonen Nets Built for Feature Selection
[0044] After class-specific ranking of features, embodiments of the
invention train the final set of Kohonen nets for classification.
At this point, the process discards all Kohonen nets built so far
for feature ranking by class.
3.8 Example of Class-Specific Feature Selection
[0045] The method according to embodiments of the invention was
tested on a number of high-dimensional gene expression problems.
One such problem tries to predict the leukemia type (AML or ALL)
from gene expression values (Golub et al. 1999). There are a total
of 72 samples and 7129 genes (features) in this dataset. Table 3.2
shows a few of the genes and their separability indices by class.
For example, genes 758, 1809 and 4680 have high separability
indices for the AML class (82.53, 75.25 and 39.73 respectively) and
are good predictors of the AML class. Comparatively, the
corresponding separability indices of the same genes for the ALL
class are quite low (2.49, 1.85 and 2.82 respectively) and, hence,
these three genes are not very good predictors of the ALL class.
Table 3.2 also shows three genes that are good predictors of the
ALL class (2288, 760 and 6182) since they have high separability
indices for the ALL class (114.75, 98.76 and 34.15). However, they
are not good predictors of the AML class as shown by their low
separability indices of 0.85, 0.93 and 0.8. This example
illustrates the power of class-specific feature selection and its
potential usefulness in understanding a particular phenomenon and
in building classifiers.
TABLE-US-00002 TABLE 3.2 Separability indices for a few features in
the AMLALL gene expression dataset Separability Indices by Class
Gene Number AML ALL AML Good Features 758 82.53 2.49 1809 75.25
1.85 4680 39.73 2.82 ALL Good Features 2288 0.85 114.75 760 0.93
98.76 6182 0.8 34.15
3.9 Feature Spaces to Explore and Build Classifiers for--Buckets of
Features
[0046] According to embodiments of the invention, in the next
phase, the method constructs classifiers exploiting the
class-specific feature rankings produced in this phase. Section 4
presents a heuristic search procedure that explores different
feature spaces to obtain a good classifier, according to
embodiments of the invention. The procedure can be parallelized and
such a version has been implemented on Apache Spark. This section
explains the basic concept of buckets of features.
[0047] The separability index of a feature for a particular class
measures the ability of that feature to create a separation between
that class and the rest of the classes and also to make that
particular class compact. And higher the value of the index,
greater is the ability of the feature to separate that class from
the rest of the classes. Thus, a feature that has an index value of
100 for a particular class is a much better and more powerful
feature than another that has an index value of 2. Thus, in Table
3.2, the first three features--758, 1809 and 4680--are very good
features for the AML class compared to the other three. And,
similarly, the last three features--2288, 760, and 6182--are very
good features for the ALL class compared to the other three.
[0048] A description follows of how the method, according to
embodiments of the invention, explores different feature spaces
given the class-specific feature rankings. In general, the process
creates buckets of features and then trains several Kohonen nets of
different grid sizes (e.g. 9.times.9, 8.times.8 and so on) for the
feature spaces contained in the buckets. The most simplistic
version of the procedure for creating buckets of features works as
follows. For the first bucket, select the top ranked feature of
each class. For the second bucket, select the top two ranked
features of each class and similarly create other buckets. The
procedure, therefore, sequentially adds top ranked features of each
class to create the buckets. Thus, the i.sup.th bucket of features
will have j top ranked features from each class.
[0049] With reference to the class-specific feature rankings of
Table 3.2, to further illustrate the notion of bucket of features,
suppose that the features (genes) 758, 1809 and 4680 are the three
top ranked features of the AML class and the features 2288, 760 and
6182 the top three of the ALL class. In one embodiment of the
invention, for bucket creation, features 758 (of the AML class) and
2288 (of the ALL class) will be in the first bucket of features.
Features 758 and 1809 (of the AML class) and features 2288 and 760
(of the ALL class) will be in the second bucket, and so on. For
this two-class problem, each bucket will have three feature spaces
to explore. For the second bucket, for example, features 758 and
1809 of the AML class comprise one feature space. Features 2288 and
760 of the ALL class comprise the second feature space, and the
third feature space consists of all four of these features.
[0050] For bucket two, for example, the method will train Kohonen
nets of different grid sizes for three different feature
spaces--(1) for the AML features 758 and 1809, (2) for the ALL
features 2288 and 760, and (3) for the combined feature set
consisting of the features 758, 1809, 2288 and 760. In general, for
kc total classes, the j.sup.th bucket of features will have j top
ranked features from each of the kc classes and the method will
construct Kohonen nets of different grid sizes in parallel for each
class using its corresponding j top ranked features, according to
embodiments of the invention. The process will also train another
set of Kohonen nets for the j.sup.th bucket using all kc*j features
in the bucket. A bucket, therefore, consists of a variety of
feature spaces and the method trains a variety of Kohonen nets for
each such feature space.
[0051] When there are thousands of features, the incremental
addition of one feature per class to the next bucket and training
several Kohonen nets for each bucket can become a computationally
expensive procedure even when parallelized. So, instead, according
to one embodiment, the method adds more than one feature at a time
to the next bucket to reduce the computational work. The other
issue is how to limit the total number of buckets to explore. To
address that issue, the method ignores features with separability
indices lower than 1, according to an embodiment. The assumption is
that features with indices less than 1 are poor discriminators for
the class. With the remaining features, the process uses increments
of more than one and it generally adds two or more features at a
time to the next bucket from the ranked list. For example, given
two classes A and B, and after class-specific feature ranking,
suppose the method selects 50 top-ranked features from each class
to build classifiers. If the process adds two features at a time
from each class to the buckets, there will be 25 total buckets. The
first bucket will have 4 features, 2 from each class, and the last
bucket will have all 100 features, 50 from each class. However,
each bucket will always have three feature spaces--one for class A
features, one for class B features, and the third for the combined
set of features. And the method will construct Kohonen nets of
different grid sizes for each of the three feature spaces in each
bucket, according to embodiments of the invention.
4. CONSTRUCTING ENSEMBLE OF HYPERSPHERE CLASSIFIERS WITH KOHONEN
NEURONS
[0052] The following describes how the method creates an ensemble
of hypersphere nets using Kohonen neurons, according to embodiments
of the invention. The method does this by selecting Kohonen neurons
from various Kohonen nets in different feature spaces and of
different grid sizes. Note that, at the end of this final phase,
the method discards all of the trained Kohonen nets and retains
only a selected set of Kohonen neurons to serve as hyperspheres in
different hypersphere nets.
4.1 Train Final Kohonen Nets for Different Feature Spaces in
Different Buckets--Assign Neurons (Active Nodes) to Classes on a
Majority Basis
[0053] In this phase, the method trains a final set of Kohonen nets
using streaming data. The method constructs Kohonen nets of
different grid sizes (e.g. 9.times.9, 8.times.8 and so on) for
different feature spaces in different buckets. The method
constructs all of these Kohonen nets in parallel and selects
appropriate parts of the input vector for each feature space in
each bucket. Once these Kohonen nets converge and stabilize, the
method processes some more streaming data, without changing the
weights of the Kohonen nets, to get the class count percentages at
each active node. This step is identical to the one in the feature
selection phase. After the class count percentages stabilize at
each of the active nodes, the process does some pruning of the
active nodes. This step, again, is similar to the one in the
feature selection phase. Thus, active nodes with small total class
counts are discarded and only the nodes where one class has a clear
majority (e.g. 70% majority) are retained. In a sense, the process
selects the good neurons where a particular class has a clear
majority.
4.2 Computing Radius of a Kohonen Neuron--Each Kohonen Neuron is a
Hypersphere
[0054] Let the radius of an active node (neuron) be the distance
from the center of the class within which it is the winning neuron.
This concept is important to the method because it only retains a
few active nodes from a Kohonen net and discards the rest of the
network. Once the method discards the rest of the Kohonen net,
there is no feasible way to determine the winning or best neuron
for a class in an input pattern. In absence of the Kohonen net, the
radius becomes the substitute way to determine if an active node is
the winning node for a class. The method creates hyperspheres by
extracting these neurons from the Kohonen nets.
[0055] To determine the radii of the active nodes of a Kohonen net,
the process initializes the radii of these active nodes to zero and
then updates them by processing some more streaming data until the
radii are stable for all active nodes. The method updates the
radius of an active node in the following way: if the streaming
input pattern has the same class as the class of the winning active
node, the method computes the distance of the input pattern from
the node and updates the radius of the node if the distance is
greater than the current radius. Note that before the process
computes the radius, classes are assigned to these active nodes. So
the process matches the class of the winning active node with that
of the input pattern before updating the radius of the active node.
The process updates the radii of all active nodes that are assigned
to classes before discarding the Kohonen nets.
4.3 Further Notations and Computation of Parameters
[0056] Let B.sub.max be the maximum number of feature buckets. The
method computes B.sub.max based on the resources available. As
before, suppose there are resources to create KN.sub.max number of
Kohonen nets in parallel. And, as before, let FG be the number of
Kohonen nets of different grid sizes (e.g. 9.times.9, 8.times.8 and
so on) trained for a feature space. For kc total classes, the
maximum number of feature buckets will therefore be
B.sub.max=KN.sub.max/(FG*(kc+1)). For example, suppose Apache
Spark, in a particular configuration, can only create 300 Kohonen
nets in parallel (KN.sub.max=300), and suppose the method uses
standard grid sizes 9.times.9, 8.times.8, 7.times.7, 6.times.6 and
5.times.5 (FG=5) for the Kohonen nets. If there are 2 classes
(kc=2), B.sub.max=300/(5*3)=20. This means that the method can only
use 20 feature buckets due to resource constraints. Suppose that,
after class-specific feature ranking, the method selects 60
top-ranked features from each class to build classifiers. Since the
process can only use 20 feature buckets, it is, therefore, forced
to add 3 features at a time to the buckets from each class. Thus,
the first bucket will have 6 features, 3 from each class, and the
last bucket will have all 120 features, 60 from each class. And
each bucket will always have three feature spaces--one for each
class and the third for the combined set of features. And there
will be FG Kohonen nets of different grid sizes for each of the
three feature spaces in each bucket.
[0057] Let Inc be the number of features added each time to a
bucket for each class. Inc is calculated from the number of
top-ranked features to use from each class and B.sub.max. Let
FB.sub.j be the j.sup.th bucket of features, j=1 . . . B.sub.max.
Let FSB.sub.kj represent the set of features belonging to class k
in bucket number j. Let AN.sub.kj be the list of active nodes
across all FG grid sizes for class k feature set FSB.sub.kj for
bucket j, k=1 . . . kc, j=1 . . . B.sub.max. Let AN.sub.kji be the
i.sup.th active node for class k feature set FSB.sub.kj, i=1 . . .
ANT.sub.kj, where ANT.sub.kj is the total number of such active
nodes for class k feature set FSB.sub.kj. Note that although the
active nodes AN.sub.kj resulted from Kohonen nets built with the
class k feature set in bucket j, these active nodes could belong to
(that is, be assigned to) any of the classes k, k=1 . . . kc. Let
W.sub.kji be the width or radius of the i.sup.th active node
AN.sub.kji. Let CTP.sub.kjjm, m=1 . . . kc, be the class count
percentage of the m.sup.th class at active node AN.sub.kji and let
CTA.sub.kji be the absolute count of input patterns processed at
that active node. All notations are summarized in Table 4.1,
below.
TABLE-US-00003 TABLE 4.1 Summary of notations used in online
Kohonen ensemble algorithm Symbol Meaning x The N-dimensional
pattern vector, x = (X.sub.1, X.sub.2, . . . , X.sub.N) N Size of
the pattern vector x X.sub.n n.sup.th element of the vector x FG
Number of Kohonen nets of different grid sizes trained for a
feature space kc Total number of classes k Denotes a class in the
set of classes {1, 2, . . . , k, . . . kc} KN.sub.max total number
of Kohonen nets that can be created in parallel PCT.sub.min minimum
required percentage of class counts for a class in order to assign
an active node to that class CT.sub.min minimum required absolute
class count for a class in order to assign an active node to that
class CW.sub.T the cumulative weight change in a Kohonen net over
the last T streaming patterns CW.sub.T.sup.Max the maximum of
CW.sub.T since the start of training of a Kohonen net CWR.sub.T the
ratio of the current weight change to the maximum weight change, =
CW.sub.T/CW.sub.T.sup.Max CWR.sub.T.sup.Min continue training a
Kohonen net if CWR.sub.T > CWR.sub.T.sup.Min; this is the
convergence criterion FB.sub.j the j.sup.th bucket of features, j =
1 . . . B.sub.max B.sub.max the maximum number of feature buckets
allowed Inc the number of features from each class added to a
bucket each time FSB.sub.kj the set of features selected for class
k in bucket j, k = 1 . . . kc, j = 1 . . . B.sub.max AN.sub.kj the
list of active nodes across all FG grid sizes for class k feature
set FSB.sub.kj ANT.sub.kj total number of active nodes for class k
feature set FSB.sub.kj AN.sub.kji the i.sup.th active node for
class k feature set FSB.sub.kj, i = 1 . . . ANT.sub.kj W.sub.kji
width or radius of the i.sup.th active node AN.sub.kji CTP.sub.kjim
the class count percentage of the m.sup.th class at active node
AN.sub.kji, m = 1 . . . kc CTA.sub.kji the absolute count of input
patterns processed at active node AN.sub.kji
4.4 Algorithm to Train the Final Set of Kohonen Nets for
Classification
[0058] Step 1. Initialize bucket number j to zero.
[0059] Step 2. Increment bucket number j (j=j+1) and add (Inc*j)
number of top ranked features to bucket FB.sub.j from the ranked
feature list of each class k (k=1 . . . kc). FSB.sub.kj is the set
of (Inc*j) top ranked features of class k in bucket j.
[0060] Step 3. Initialize final Kohonen nets, in parallel in a
distributed computing system, of FG different grid sizes for each
class k (k=1 . . . kc) and for the corresponding feature set
FSB.sub.kj. Also initialize FG Kohonen nets for a feature set that
includes all of the features from all classes in bucket j. If
j<B.sub.max, go back to step 2 to set up other Kohonen nets for
other feature buckets. When j=B.sub.max, go to step 4.
[0061] Step 4. Train all KN.sub.max Kohonen nets in parallel using
streaming data and selecting appropriate parts of the input pattern
for each Kohonen net according to the feature subsets FSB.sub.kj,
k=1 . . . kc, j=1 . . . B.sub.max. Stop training when all Kohonen
nets converge, that is, when CWR.sub.T<=CWR.sub.T.sup.Min for
all Kohonen nets.
[0062] Step 5. Process some more streaming data through the
stabilized Kohonen nets, without changing the weights, to find the
set AN.sub.kj of active nodes (neurons) in the corresponding
Kohonen nets for each class k in each bucket j (k=1 . . . kc, j=1 .
. . B.sub.max). Also find the set of active nodes for the Kohonen
net that uses all features of all classes in bucket j, j=1 . . .
B.sub.max. In addition, get the class counts CTA.sub.kji of the
active nodes and stop when the class count percentages
CTP.sub.kjim, m=1 . . . kc, become stable for all active nodes.
[0063] Step 6. Assign each active node AN.sub.kji to the majority
class m if the class count percentage CTP.sub.kjim, m=1 . . . kc,
for the majority class m at that active node is above the minimum
threshold PCT.sub.min and the absolute class count CTA.sub.kji is
above the threshold CT.sub.min.
[0064] Step 7. Process some more streaming data to compute the
radius W.sub.kji of each active node AN.sub.kji. Stop when the
radii or widths become stable.
[0065] Step 8. Retain only the active nodes AN.sub.kj, k=1 . . .
kc, j=1 . . . B.sub.max, from the corresponding Kohonen nets that
satisfy the minimum thresholds PCT.sub.min and CT.sub.min. Also
retain the active nodes from the Kohonen nets based on all features
of all classes in bucket j, j=1 . . . B.sub.max, and who satisfy
the minimum thresholds. Discard all other nodes from all of the
KN.sub.max Kohonen nets.
[0066] This algorithm produces a set of active Kohonen neurons for
each bucket j, j=1 . . . B.sub.max, and each Kohonen neuron is
assigned to a specific class. The process then tests the ensemble
of Kohonen neurons in each bucket with validation data sets to find
out which bucket (or buckets) of features produces the best
classifier.
5. AN ENSEMBLE CLASSIFIER BASED ON AN ENSEMBLE OF KOHONEN
NEURONS
[0067] The algorithm of section 4.4 essentially creates an ensemble
of hypersphere nets--one hypersphere net for each of the distinct
feature spaces corresponding to a particular class k and a
particular bucket j, plus one combining all of the features in
bucket j. A hypersphere net consists of one hidden layer and one
output layer as shown in FIG. 1. Each hidden node in a hypersphere
net represents one of the hyperspheres. The design and training of
a hypersphere net consists of 1) determining the hyperspheres to
use for each class k, and 2) finding their centers and widths or
radii. In this method, each active Kohonen neuron, determined
through the algorithm of section 4.4 and assigned to a class, is a
hypersphere in a hypersphere net. So, the method essentially
creates an ensemble of hypersphere nets by means of Kohonen
nets.
5.1 Computations at a Single Hidden Node of a Hypersphere Net
[0068] Suppose p.sub.k hyperspheres cover the region of a certain
class k, k=1 . . . kc. The class region, therefore, is the union of
all the hypersphere regions representing class k. Generally, the
output of a hidden node is one when the input vector is within the
region of the hypersphere and zero otherwise. Mathematically, the
functional form of a hypersphere-hidden node is as follows:
f q k ( x ) = 1 if z q k ( x ) .gtoreq. q k , = 0 otherwise , ( 5.1
) where z q k ( x ) = w q k - d q k ( x ) ( 5.2 ) d q k ( x ) = ( n
= 1 nv ( c qn k - X n ) 2 ) 1 / 2 ( 5.3 ) ##EQU00001##
[0069] Here f.sup.k.sub.q(x) is the response function of the
q.sup.th hidden node for class k, q=1 . . . p.sub.k.
c.sup.k.sub.q=(c.sup.k.sub.q1 . . . c.sup.k.sub.qnv) and
w.sup.k.sub.q are the center and width (radius) of the q.sup.th
hypersphere for class k, nv is the number of features in that
particular feature space, d.sup.k.sub.q(x) is the distance of the
input vector x from the center c.sup.k.sub.q of the q.sup.th hidden
node, z.sup.k.sub.q(x) is the difference between the width (radius)
w.sup.k.sub.q and the distance d.sup.k.sub.q(x), and
.epsilon..sup.k.sub.q is a small constant. .epsilon..sup.k.sub.q
can be slightly negative to allow an input pattern x to belong to
class k if it is close enough to one of the hyperspheres of class
k. Let p.sup.k.sub.q(x) be a measure of the probability of an input
vector x being a member of (i.e. being inside the boundary of) the
q.sup.th hypersphere for class k, q=1 . . . p.sub.k.
p.sup.k.sub.q(x)=(w.sup.k.sub.q-d.sup.k.sub.q(x))/w.sup.k.sub.q
(5.4)
[0070] If an input vector x is at the boundary of the q.sup.th
hypersphere, d.sup.k.sub.q(x)=w.sup.k.sub.q and, therefore,
p.sup.k.sub.q(x)=0. If it's at the center of the hypersphere,
d.sup.k.sub.q(x)=0 and p.sup.k.sub.q(x)=1. If an input vector x is
outside the boundary of the q.sup.th hypersphere,
d.sup.k.sub.q(x)>w.sup.k.sub.q and p.sup.k.sub.q(x) is
negative.
[0071] Let dn.sup.k.sub.q(x) be the normalized distance and defined
as follows:
dn.sup.k.sub.q(x)=d.sup.k.sub.q(x)/nv, (5.5)
[0072] where nv is the number of features in that particular
feature space.
5.2 Ensemble Classifiers
[0073] In general, combining multiple classifiers can improve
overall performance on a problem. One can categorize such ensemble
methods in many different ways. One can combine base classifiers in
a variety of ways for final prediction. Some popular ones include
majority voting (selects the class with highest number of votes),
performance weighting of base classifiers (weight the classifiers
based on their accuracy), distribution summation (summation of the
conditional probability vector from the base classifiers),
mathematical programming and many others. As explained in the next
section, the method, according to embodiments of the invention,
uses different variations of these combining methods on the
ensemble of hypersphere classifiers.
5.3 Using Ensemble of Kohonen Neurons in Different Feature Spaces
for Classification
[0074] In embodiments of the invention, all hyperspheres (i.e. all
active Kohonen neurons assigned to classes) are considered as being
part of a composite classifier although, within that framework,
each hypersphere is treated as an independent predictor. And, along
with standard ensemble prediction schemes such as maximum
probability and minimum distance, a particular voting mechanism is
used that has worked well. In general, one can try different
methods of combining classifier predictions and find the best
method for a particular problem. Embodiments of the invention use
the following measures to determine the final classification of a
test example.
[0075] a. Maximum Probability--in this embodiment, the method finds
the hypersphere (i.e. an active Kohonen neuron) with the highest
probability (or confidence) using the probability estimate
p.sup.k.sub.q(x) of (5.4) and assign its class to the test example.
Points outside the boundary of a hypersphere have a negative
probability p.sup.k.sub.q(x) and negative probabilities are allowed
up to a limit. The maximum probability, therefore, can be negative.
The computational testing procedure used a limit of -0.5 for
probability.
[0076] b. Minimum Distance--according to this embodiment, the
method finds the hypersphere (i.e. an active Kohonen neuron) whose
center is closest to the test example based on the normalized
distance dn.sup.k.sub.q(x) of (5.5) and assign its class to the
test example.
[0077] c. Inside neuron majority voting--according to this
embodiment, the method first determines if the test example is
within the boundary of a hypersphere or not based on the normalized
distance dn.sup.k.sub.q(x) and, if it is, it counts it as a vote
for the class represented by that hypersphere. After testing
against all of the hyperspheres (i.e. all active Kohonen neurons),
the embodiment counts the votes for each class and the majority
class wins, and then assigns the majority class to the test
example.
[0078] d. Majority voting with test points allowed to be outside
the hyperspheres--In many problems, some test examples may be
outside the boundary of a hypersphere, but otherwise close to that
hypersphere. In such case, the method allows test examples that are
close to a hypersphere to vote for the class represented by that
hypersphere. However, the method sets a limit on how far outside
the hypersphere a test point can be. According to one embodiment,
the method uses the probability measure p.sup.k.sub.q(x) instead of
the normalized distance measure dn.sup.k.sub.q(x). One can test
with various limits on a given problem and find out which limit
produces the best accuracy. The computational testing procedure
used two limits---0.25 and -0.35, but other limits can be
tried.
6. COMPUTATIONAL RESULTS
6.1 Datasets
[0079] Gene expression problems are characterized by high
dimensionality (the number of features or genes are usually a few
thousand) and a small number of training examples (typically just a
few dozen). Computational testing of the algorithm according to
embodiments of the invention was performed on seven widely
referenced gene expression datasets. These datasets are briefly
described below.
[0080] a) Leukemia (AML vs. ALL): The leukemia dataset was
published by Golub et al. (1999) and the original training data
consists of 38 bone marrow samples (27 ALL and 11 AML, where AML
and ALL are the two types of leukemia) and 7129 genes. It also has
34 test samples with 20 ALL and 14 AML.
[0081] b) Central Nervous System (CNS): The CNS dataset is from
Pomeroy et al. (2002) and is about creating gene expression
profiles of patients who survive a certain treatment for Embryonal
tumors of the central nervous system versus those that don't. The
dataset contains 60 patient samples with 7129 genes where 21 are
survivors and 39 are failures.
[0082] c) Colon Tumor: The two-class gene expression data for adult
colon cancer is from Alon et al. (1999) and it contains 62 samples
based on expression of 2000 genes and they include 40 tumor
biopsies ("negative") and 22 normal biopsies ("positive") from the
same patients.
[0083] d) SRBCT: The four-class gene expression data for diagnosing
small round blue-cell tumors (SRBCT) is from Khan et al. (2001).
The dataset contains 63 samples of these four different types of
tumors and has expression values for 2308 genes.
[0084] e) Lymphoma: The three-class gene expression data for
non-Hodgkin's lymphoma is from Alizadeh et al. (2000). The dataset
contains 62 samples of three different subtypes of lymphoma and has
expression values for 4026 genes.
[0085] f) Prostrate: The data here consists of 102 prostate tissues
from patients undergoing surgery of which 50 are normal and 52
tumor samples (Singh et al. 2002). It has expression values for
6033 genes.
[0086] g) Brain: This dataset, also from Pomeroy et al. (2002),
contains microarray expression data for 42 brain cancer samples for
5 different tumor subtypes. It has expression values for 5597
genes.
[0087] Table 6.1 summarizes the main characteristics of these
datasets. For all of these problems, the original training and test
data was combined and then random sub-sampling used to generate the
training and testing sets (Stone 1974). The training and test sets
from the available data were randomly generated by randomly
selecting nine-tenths of it for training and using the remainder
for testing. This random allocation was repeated 50 times for each
dataset and this section reports the average results of the 50
runs. The implementation for these fixed datasets simulated online
learning by reading one input pattern at a time.
TABLE-US-00004 TABLE 6.1 Characteristics of the gene expression
problems No. of No. of No. of genes classes examples Leukemia 7129
2 72 (AML-ALL) Central Nervous 7129 2 60 System Colon Tumor 2000 2
62 SRBCT 2308 4 63 Lymphoma 4026 3 62 Prostrate 6033 2 102 Brain
5597 5 42
6.2 Parameter Settings
[0088] There was no fine-tuning of parameters for any of the
problems solved with this method. For this set of problems, the
parameters were set as follows. For Kohonen nets, the grid sizes
used were 9.times.9, 8.times.8, 7.times.7, 6.times.6, 5.times.5,
4.times.4 and 3.times.3(FG=7). The thresholds PCT.sub.min and
CT.sub.min were set to 70% and 3 respectively. CWR.sub.T.sup.Min
was set to 0.001.
6.3 Experimental Results--Feature Selection
[0089] Table 6.2 shows the average number of features used by this
method for the gene expression problems. For these problems, an
important challenge is to discover a small set of genes (features)
responsible for a disease (or successful treatment) so that they
can be further investigated for a better understanding of the
disease (or treatment) (Kim et al. 2002). Identifying a small set
of genes for disease type diagnosis and treatment also reduces the
cost of clinical tests. As shown in Table 6.2, the method is fairly
good at identifying a small set of genes (features) among the
thousands.
TABLE-US-00005 TABLE 6.2 Average number of features used by the
Kohonen neuron ensemble method for the gene expression problems.
Average No. of % of features used features used in on average in
Total no. of ensemble of ensemble of attributes Kohonen neurons
Kohonen neurons Leukemia 7129 16 0.22% (AML-ALL) Central Nervous
7129 4 0.06% System Colon Tumor 2000 39 1.95% SRBCT 2308 10 0.43%
Lymphoma 4026 39 0.97% Prostrate 6033 40 0.66% Brain 5597 115
2.06%
6.4 Experimental Evaluation of the Kohonen Neuron Ensemble
Classifier System
[0090] This section of the description presents the experimental
results for the Kohonen ensemble algorithm that consists of (1)
class-specific feature selection, and (2) training an ensemble of
Kohonen neurons for classification using the selected features. The
section compares the performance of the algorithm with other,
prior-art feature selection and classification algorithms and uses
results from Bu et al. (2007), Li et al. (2007) and Dettling (2004)
for the comparison. These gene expression problems were also solved
with Apache Spark's machine learning library MLlib (Apache Spark
MLlib 2015) for comparison.
[0091] Bu et al. (2007) provides experimental results on several
gene expression problems. They used PCA for dimension reduction,
genetic algorithm (GA) and backward floating search method (BFS)
for feature selection, and support vector machines (SVM) for
classification. For their tests, they randomly split the data into
2/3 for training and the rest for testing and repeated the
procedure 50 times. Table 6.3 shows the average error rates and
standard deviations for the various combinations of feature
extraction, feature selection and SVM. Table 6.3 also shows the
results for the Kohonen neuron ensemble algorithm.
TABLE-US-00006 TABLE 6.3 Average test error rates and standard
deviations for various classification algorithms for the gene
expression datasets. SVM results are from Bu et al. (2007). Kohonen
PCA + BFS + PCA + GA + ensemble SVM PCA + SVM SVM SVM Leukemia 0.57
(2.83) 8.13 (4.87) 6.83 (5.34) 6.43 (5.32) 4.17 (2.1) (AML-ALL)
Central Nervous 29.67 (9.26) 43.67 (7.07) 42.46 (4.45) 39.83 (5.5)
40.69 (6.16) System Colon Tumor 11.33 (11.88) 31.75 (6.91) 29.83
(6.22) 24.4 (4.63) 23.61 (3.42)
[0092] Li et al. (2007) developed a method that combines
preliminary feature selection with partial least squares for
dimension reduction (PLSDR) (Dai et al. 2006) for these gene
expression problems. They used a linear SVM and a KNN method, with
K=1 for classification. They used stratified 10-fold cross
validation and Table 6.4 shows the average error rates and standard
deviations for the two classifiers along with Kohonen ensemble
results. Note that all reported SVM results, both in Bu et al.
(2007) and Li et al. (2007), are after fine-tuning of parameters
for each individual problem, whereas the Kohonen ensemble algorithm
used no such fine tuning. Overall, the Kohonen ensemble algorithm
performs well on these different gene expression problems when
compared against the various variations of SVM and feature
selection/extraction algorithms.
[0093] Table 6.5 compares the average number of features used in
the Kohonen ensemble algorithm with those used by the Gene
Selection+PLSDR method of Li et al. (2007). In the Gene
Selection+PLSDR method, genes are first eliminated based on a
t-statistic score. So Table 6.5 shows how many genes were used on
average for dimensionality reduction by PLSDR after the elimination
step. In discussing these results, Li et al. (2007) notes: "The
proposed method can greatly reduce the dimensionality, averagely
fifty percent genes were reduced from the full set." Note that
about fifty percent of the original set of genes is still used in
PLSDR dimensionality reduction and, therefore, interpretability of
results and the cost of gene tests are still a problem. Compared to
that, the Kohonen ensemble algorithm uses far less genes, has
better interpretability and it would cost far less to perform the
gene tests.
[0094] Dettling (2004) provides experimental results for a variety
of algorithms for most of these gene expression problems and Table
6.6 shows the average error rates. For his tests, he randomly split
the data into 2/3 for training and the rest for testing and
repeated the procedure 50 times. However, he selected balanced
training sets and that may have provided better results for some
algorithms. Note that the Kohonen ensemble algorithm didn't use
balanced training sets for testing. Dettling (2004) provides
confidence levels for the error rates in a graphical form, but they
are hard to decipher. Table 6.6, therefore, does not show standard
deviations of the error rates other than for the Kohonen ensemble
method.
TABLE-US-00007 TABLE 6.4 Average test error rates and standard
deviations for various classification algorithms for the gene
expression datasets. PLSDR results are from Li et al. (2007)
Kohonen PLSDR + PLSDR + ensemble SVM kNN Leukemia 0.57 (2.83) 2.82
(0.0) 2.37 (0.01) (AML-ALL) Central Nervous 29.67 (9.26) 31.5
(0.04) 35.0 (0.02) System Colon Tumor 11.33 (11.88) 16.45 (0.03)
24.31 (0.03)
TABLE-US-00008 TABLE 6.5 Average number of features used in the
Kohonen ensemble algorithm and the Gene Selection + PLSDR method of
Li et al. (2007) Kohonen Gene Selection + ensemble PLSDR Leukemia
16 3966.80 (AML-ALL) Central Nervous 4 3109.34 System Colon Tumor
39 1141.39
TABLE-US-00009 TABLE 6.6 Average test error rates for various
classification algorithms from Dettling (2004) Kohonen Random
ensemble BagBoost Forest SVM kNN DLDA Boosting Bagging CART
Leukemia 0.57 4.08 2.5 3.5 3.83 2.92 5.67 7.17 13.42 (AML- (2.83)
ALL) Colon 11.33 16.10 15.43 16.67 16.38 12.86 19.14 16.86 25.52
Tumor (11.88) SRBCT 1.11 1.24 2.29 1.81 1.43 2.19 6.19 19.33 24.38
(3.51) Lymphoma 1.11 1.62 1.43 0.95 1.52 2.19 6.29 20.57 20.48
(3.51) Prostrate 6.0 7.53 7.88 6.82 10.59 14.18 8.71 8.94 12.59
(8.43) Brain 24.0 23.86 34.71 28.14 29.71 28.57 27.57 49.0 51.29
(8.3)
Apache Spark Machine Learning Library (MLlib) Comparisons
[0095] Table 6.7 shows the average error rates and standard
deviations for a variety of algorithms in the Apache Spark Machine
Learning Library MLlib (2015). SVMwithSGD (the SVM method) and
LogRegWithSGD (the logistic regression method) use the stochastic
gradient descent (SGD) method to train the classifiers. All these
algorithms were used with their default parameters. SVMwithSGD and
LogRegWithSGD only work for two class problems, hence they don't
have results in the table for the multiclass problems SRBCT,
Lymphoma and Brain. For these Spark MLlib tests, the data was
randomly split into 2/3 for training and the rest for testing and
the procedure repeated 50 times.
TABLE-US-00010 TABLE 6.7 Average test error rates and standard
deviations for various classification algorithms of Apache Spark
Machine Learning Library MLlib (2015) Kohonen ensemble SVMwithSGD
NaiveBayes LogRegWithSGD RandomForest Leukemia 0.57 (2.83) 4.4
(7.0) 10.26 (8.0) 10.29 (12) 12.8 (10) (AML-ALL) Central Nervous
29.67 (9.26) 33.25 (14) 42.84 (11) 36.25 (13) 41.26 (11) System
Colon Tumor 11.33 (11.88) 17.56 (11) 8.33 (12) 12 (12) 18.67 (12)
SRBCT 1.11 (3.51) -- 7.33 (12) -- 21.33 (18) Lymphoma 1.11 (3.51)
-- 1.9 (3) -- 5.67 (10) Prostrate 6.0 (8.43) 13.64 (5) 36.97 (12)
13.4 (10) 19.6 (13) Brain 24.0 (8.3) -- 18 (19) -- 47.5 (25)
7. ADDITIONAL EMBODIMENTS FOR HANDLING IMBALANCED DATA PROBLEMS
7.1 Introduction
[0096] Many real-life classification problems, both for streaming
and stored big data, are highly imbalanced and it is fairly
difficult for standard classification methods to accurately predict
the minority class.
[0097] Embodiments of the invention use a classification method for
imbalanced streaming data based on the algorithm presented in
section 4.4 for balanced streaming data. The method uses a
two-layered ensemble of hypersphere nets to discover the minority
class. The following description explains the basic ideas behind
the method with an example, presents the algorithm and some
preliminary computational results.
7.2 Two-Layered Hypersphere Nets
[0098] A Kohonen net, as shown in FIG. 3, is generally used for
clustering data into separate classes of patterns in a data stream.
For classification problems, once it finds clusters, one can then
label the nodes based on the majority class at that node and use it
as a prediction system. That is the essence of the method presented
so far, according to embodiments of the invention. However, for
imbalanced data problems, where very few data points exist for one
or more classes, but lots of data points are available for the
other classes, the minority class may not be in the majority at any
of the nodes in the net or in only a few nodes. As a result, none
or just a few of the nodes in the net will predict the minority
class. To address this problem, one embodiment identifies Kohonen
nodes with a significant presence of the minority class and then
uses the training data points at those nodes to train another set
of Kohonen nets. The basic idea is to break up the data points at
those nodes to find the minority class regions. Henceforth, the
Kohonen nets for the individual nodes with significant minority
class presence are often referred to as Kohonen submodels or
subnets. The next section explains the process of finding the
minority class regions with a second layer of Kohonen nets.
7.3 The Process of Finding Minority Class Regions
[0099] The adult census dataset (Kohavi 1996) is widely used to
compare classification algorithms. The dataset has demographic
information about US citizens and the classification problem is to
predict who earns more than $50K vs. less than $50K. It has 16,281
examples in the training set of which 12,435 examples correspond to
the <=50K class and 3846 examples correspond to the >50K
class. Thus the >50K class has only about 23.62% of the data
points and is considered a minority class for purposes of this
discussion. A Kohonen net of grid size 5.times.5 trained on this
data produces 25 active nodes. A trained Kohonen net of grid size
7.times.7 has 44 active nodes. Table 7.1 shows the distribution of
data among the various nodes of both the Kohonen nets--5.times.5
and 7.times.7. The columns "<=50K Count" and ">50K Count"
show the number of data points at each node belonging to the
<=50K and >50K classes respectively. So, for example, in the
5.times.5 grid, the first row is for the node X=0, Y=0 and has 1044
data points from the <=50K class and 980 points from the >50K
class. Suppose that a node that has more than 10% of the data
points or at least 50 data points from the >50K class is
considered to have a significant presence of the minority class.
And suppose that the remaining nodes belong to the majority class
<=50K because the majority class is dominant With node
classification on this basis, the 5.times.5 grid has 12 nodes where
the majority class has absolute dominance and they contain a total
of 3418 or 27.5% of the 12435 majority class data points. And these
12 majority class nodes contain just 65 or 1.7% of the 3846
minority class data points. Compared to that, the 7.times.7 grid
has 27 nodes where the majority class has absolute dominance and
they contain a total of 5847 or 47% of the 12435 majority class
data points. And these 27 majority class nodes contain just 228 or
6% of the 3846 minority class data points. Thus, the 7.times.7 grid
successfully separates more of the majority class data points from
the minority class compared to the 5.times.5 grid--5847 vs. 3418.
With a 25.times.25 grid, there is much more separation between the
classes. With a 25.times.25 grid, there are 368 nodes where the
majority class has absolute dominance and they contain a total of
8591 or 69% of the total 12435 majority class data points. And
these 368 majority class nodes contain just 210 or 5.5% of the 3846
minority class data points. So, compared to 5.times.5 and 7.times.7
grids, the 25.times.25 grid does a far better job of separating the
two classes. Table 7.2 presents a summary of these numbers.
TABLE-US-00011 TABLE 7.1 Distribution of data across Kohonen nets
of grid sizes 5 .times. 5 and 7 .times. 7 Kohonen net - 5 .times. 5
Grid Kohonen net - 7 .times. 7 Grid Kohonen Kohonen Node <=50K
>50K node <=50K Coordinate Count Count Total Coordinate Count
>50K Count Total X = 0, Y = 0 1044 980 2024 X = 0, Y = 0 691 152
843 X = 0, Y = 1 40 61 101 X = 0, Y = 1 258 25 283 X = 0, Y = 2 917
1268 2185 X = 0, Y = 2 387 7 394 X = 0, Y = 3 435 96 531 X = 0, Y =
3 338 7 345 X = 0, Y = 4 995 475 1470 X = 0, Y = 4 568 10 578 X =
1, Y = 0 185 51 236 X = 0, Y = 5 415 9 424 X = 1, Y = 1 3 1 4 X =
0, Y = 6 625 5 630 X = 1, Y = 2 14 3 17 X = 1, Y = 0 201 10 211 X =
1, Y = 3 9 9 X = 1, Y = 1 52 10 62 X = 1, Y = 4 59 1 60 X = 1, Y =
2 163 3 166 X = 2, Y = 0 627 7 634 X = 1, Y = 3 25 1 26 X = 2, Y =
1 55 19 74 X = 1, Y = 4 74 4 78 X = 2, Y = 2 419 340 759 X = 1, Y =
5 30 30 X = 2, Y = 3 16 2 18 X = 1, Y = 6 29 1 30 X = 2, Y = 4 2097
200 2297 X = 2, Y = 0 585 29 614 X = 3, Y = 0 307 8 315 X = 2, Y =
1 138 2 140 X = 3, Y = 1 105 4 109 X = 2, Y = 2 620 17 637 X = 3, Y
= 2 63 1 64 X = 2, Y = 3 101 1 102 X = 3, Y = 3 92 11 103 X = 2, Y
= 4 588 19 607 X = 3, Y = 4 126 32 158 X = 2, Y = 6 982 475 1457 X
= 4, Y = 0 1718 23 1741 X = 3, Y = 0 358 34 392 X = 4, Y = 1 404 5
409 X = 3, Y = 1 22 2 24 X = 4, Y = 2 1313 78 1391 X = 3, Y = 2 67
15 82 X = 4, Y = 3 93 10 103 X = 3, Y = 3 14 2 16 X = 4, Y = 4 1299
170 1469 X = 3, Y = 4 191 52 243 X = 4, Y = 0 839 122 961 X = 4, Y
= 1 61 9 70 X = 4, Y = 2 344 320 664 X = 4, Y = 3 18 3 21 X = 4, Y
= 4 438 186 624 X = 4, Y = 5 11 1 12 X = 4, Y = 6 743 494 1237 X =
5, Y = 0 76 1 77 X = 5, Y = 1 23 1 24 X = 5, Y = 2 2 4 6 X = 5, Y =
3 2 2 X = 5, Y = 4 164 69 233 X = 6, Y = 0 1016 64 1080 X = 6, Y =
1 78 20 98 X = 6, Y = 2 212 532 744 X = 6, Y = 3 159 273 432 X = 6,
Y = 4 105 327 432 X = 6, Y = 5 204 173 377 X = 6, Y = 6 420 353
773
TABLE-US-00012 TABLE 7.2 Summary characteristics of the majority
class nodes across various Kohonen net grid sizes No. of points in
Percentage of total Percentage of total nodes where the majority
class No. of minority minority class Grid majority class is points
in those class points in points in those Size dominant nodes those
nodes nodes 5 .times. 5 3418 27.5% 65 1.7% 7 .times. 7 5847 47.0%
228 6.0% 25 .times. 25 8591 69.0% 210 5.5%
[0100] The basic objective of this embodiment of the invention is
to get good separation of the classes in the first layer Kohonen
net and then use an ensemble of Kohonen submodels (or an ensemble
of hypersphere nets) in a second layer to further separate the
classes at nodes where the minority class has a significant
presence. And, in this process, it can explore many different grid
sizes in the first layer (e.g. 5.times.5, 7.times.7, 25.times.25)
and ensemble them to get better performance.
[0101] In essence, the algorithm uses Kohonen nets as a tool to
break up class regions into smaller sub-regions to provide better
visibility to the different class regions. It is somewhat similar
to decision tree methods. However, one of the powerful features of
Kohonen nets is that it breaks up (that is, it groups) data points
considering all of the features, unlike decision tree methods that
only consider a subset of features to build trees.
7.4 Second Layer of Kohonen Net Ensembles to Find Minority Class
Regions
[0102] According to this embodiment of the invention, the algorithm
trains another set of Kohonen nets (a second layer of Kohonen
subnets) for the data subsets at each of the nodes in the first
Kohonen net where the minority class has a significant presence.
For example, the 5.times.5 grid Kohonen net in Table 7.1 above has
13 nodes with a significant minority class presence. Those 13 nodes
contain 3781 out of the total 3846 minority class points (i.e. 98%
of the minority points), but also have 9017 majority class points.
Thus, the algorithm breaks up the regions corresponding to these
nodes to gain better visibility to both the majority and minority
class subregions. The algorithm does this by creating an ensemble
of Kohonen subnets for the data points at each of these nodes that
has a significant presence of the minority class. For example, the
node X=0, Y=0 in the 5.times.5 grid Kohonen net of Table 7.1 has
1044 points of the majority class (<=50K) and 980 points of the
minority class (>50K). These data points at the node X=0, Y=0
are then used to build an ensemble of Kohonen subnets of different
sizes to gain further visibility into the subregions of each
class.
[0103] The algorithm treats the almost homogeneous nodes, where any
class is dominant, as leaf or terminal nodes in the sense of a
decision tree, according to an embodiment. Only the non-homogeneous
nodes are split up using Kohonen subnets, again in the sense of
decision trees. The two-layered Kohonen nets are structurally
somewhat similar to one-layer decision stumps. In this embodiment,
the algorithm uses an ensemble of Kohonen subnets mainly to prevent
overfitting because many of these nodes may not contain a large
number of data points.
7.5 Algorithm to Handle Imbalanced Data Problems, According to an
Embodiment
[0104] The algorithms described above in sections 3.5 and 4.4 are
revised, according to the following embodiments, to handle
imbalanced data problems, in an embodiment of the invention. Table
7.1 shows the additional notations used.
TABLE-US-00013 TABLE 7.1 Summary of additional notations used for
imbalanced data problems Symbol Meaning FGSL Number of Kohonen
subnets of different grid sizes in a second layer Kohonen ensemble
in a particular feature space CPCT.sub.k Percentage of data points
in the streaming data that belongs to class k, k = 1 . . . kc
IMPCT.sub.max Maximum percentage of data points that can belong to
a class for a class to be a minority class (e.g. maximum 25%) MCL
the set of classes considered minority class CCM the total
percentage of all minority class counts at an active node
MCLNODE.sub.FPq the list of active nodes in first layer Kohonen
nets for feature subset FP.sub.q, q = 1 . . . FS, where the
minority class has a significant presence MCLNODE- the list of
active nodes in first layer Kohonen nets FSB.sub.kj for each
feature subset FSB.sub.kj, k = 1 . . . kc, j = 1 . . . B.sub.max,
where the minority class has a significant presence
7.5.1 Algorithm for Class-Specific Feature Selection from Streaming
Data for Imbalanced Problems
[0105] Step 1. Process some streaming data to find the approximate
maximum and minimum values of each feature. Use the range to
normalize streaming input patterns during subsequent processing.
The input vector x is assumed to be normalized in this algorithm.
(Note: Other methods of data normalization can be also be used in
this algorithm.).
[0106] Track and compute CMPCT.sub.k, the approximate percentage of
data points in the streaming data that belong to class k, k=1 . . .
kc, when finding the maximum and minimum values for each feature.
If CMPCT.sub.k<IMPCT.sub.max, then class k is a minority class.
Add class k to the minority class list MCL.
[0107] Step 2. Randomly partition the N features into FS subsets
(FS=KN.sub.max/FG) where each partition is denoted by FP.sub.q, q=1
. . . FS. If the problem is imbalanced (i.e. MCL, the set of
minority classes, is nonempty), then reduce FS (e.g. FS=0.5
KN.sub.max/FG) and leave some resources aside to build a second
layer Kohonen subnets.
[0108] Step 3. Initialize the weights and learning parameters of
the KN.sub.max Kohonen nets (or a reduced set of Kohonen nets if
the problem is imbalanced data) that will be trained in parallel,
where FG is the number of Kohonen nets of different grid sizes for
each feature partition FP.sub.q, q=1 . . . FS.
[0109] Step 4. Train all KN.sub.max Kohonen nets in parallel (or a
reduced set of Kohonen nets if the problem is imbalanced data)
using streaming data and selecting appropriate parts of the input
pattern vector for each Kohonen net according to the feature subset
assigned to it. Stop training when all the Kohonen nets converge
and their CWR.sub.T ratios are at or below CWR.sub.T.sup.Min.
[0110] Step 5. Process some more streaming data through the
stabilized Kohonen nets, without changing the weights, to find the
active nodes (winning neurons) and their class counts. Stop when
class count percentages at all active nodes converge and are
stable.
[0111] Step 6. Assign each active node (neuron) to a class if the
class count percentage CC for the most active class at that node is
>=PCT.sub.min. If the problem is not imbalanced data, discard
all active neurons that do not satisfy the PCT.sub.min requirement
or have a low total class count. If the problem is not imbalanced
data, go to Step 7.
[0112] If the problem is imbalanced data, compute CCM at each
active node by adding all minority class counts and computing its
percentage of total class counts at that node. If CCM
>IMPCT.sub.max, add that active node to the set
MCLNODE.sub.FP.sub.q, the list of active nodes in first layer
Kohonen nets for each feature subset FP.sub.q, q=1 . . . FS, where
the minority class has a significant presence. A node assigned to
the list MCLNODE.sub.FP.sub.q is not assigned to any class.
[0113] Now train the second layer ensemble of Kohonen subnets
corresponding to each first layer node in the list
MCLNODE.sub.FP.sub.q where the minority class has a significant
presence.
[0114] Step A. Initialize the weights and learning parameters of
FGSL second layer Kohonen subnets for each first layer node in the
list MCLNODE.sub.FP.sub.q FP.sub.q, q=1 . . . FS. FGSL is the
number of Kohonen subnets of different grid sizes for each feature
partition.
[0115] Step B. Process some additional streaming data to train all
second layer Kohonen subnets in parallel and by selecting
appropriate parts of the input pattern for each Kohonen subnet
according to the feature partition FP.sub.q, q=1 . . . FS. Stop
training when all Kohonen subnets converge; that is, when
CWR.sub.T<=CWR.sub.T.sup.Min for all Kohonen nets. For this
step, each such second layer ensemble is trained with just the
streaming data points that are assigned to the corresponding first
layer node of a first layer Kohonen net.
[0116] Step C. Process some additional streaming data through the
stabilized second layer Kohonen subnets, without changing the
weights, to get class counts for the active nodes in all of the
second layer Kohonen subnets. Stop when class percentages become
stable for all second layer Kohonen subnets.
[0117] Step D. Assign each active node (neuron) in the second layer
Kohonen nets to a class if the class count percentage CC for the
most active class at that node is >=PCT.sub.min. Discard all
active neurons that do not satisfy the PCT.sub.min requirement.
[0118] Step 7. Create a list of the remaining active nodes by class
for each feature partition FP.sub.q, q=1 . . . FS.
[0119] Step 8. Compute the separability indices of the features
separately for each feature partition FP.sub.q, q=1 . . . FS.
Compute the separability indices of the particular features in a
feature partition using the remaining active neurons for that
feature partition only. Those remaining active neurons, which have
been assigned to classes, are representative examples of the
classes.
[0120] Step 9. Repeat steps 2 through 8 a few times and track the
maximum separability index value of each feature.
[0121] Step 10. Rank features on the basis of their maximum
separability index value.
7.5.2 Algorithm to Train the Final Set of Kohonen Nets for
Classification for Imbalanced Data Problems
[0122] If the problem is imbalanced data (i.e., MCL, the set of
minority classes, is nonempty), then reduce B.sub.max (e.g.
B.sub.max=0.5 KN.sub.max/(FG*(kc+1)) and leave some resources aside
to build second layer Kohonen subnets.
[0123] Step 1. Initialize bucket number j to zero.
[0124] Step 2. Increment bucket number j (j=j+1) and add (Inc*j)
number of top ranked features to bucket FB.sub.j from the ranked
feature list of each class k (k=1 . . . kc). FSB.sub.kj is the set
of (Inc*j) top ranked features of class k in bucket j.
[0125] Step 3. Initialize final Kohonen nets, in parallel in a
distributed computing system, of FG different grid sizes for each
class k (k=1 . . . kc) and for the corresponding feature set
FSB.sub.kj. Also initialize FG Kohonen nets for a feature set that
includes all of the features from all classes in bucket j. If
j<B.sub.max, go back to step 2 to set up other Kohonen nets for
other feature buckets. When j=B.sub.max, go to step 4.
[0126] Step 4. Train all KN.sub.max Kohonen nets in parallel using
streaming data and selecting appropriate parts of the input pattern
for each Kohonen net according to the feature subsets FSB.sub.kj,
k=1 . . . kc, j=1 . . . B.sub.max. Stop training when all Kohonen
nets converge; that is, when CWR.sub.T<=CWR.sub.T.sup.Min for
all Kohonen nets.
[0127] Step 5. Process some more streaming data through the
stabilized Kohonen nets, without changing the weights, to find the
set AN.sub.kj of active nodes (neurons) in the corresponding
Kohonen nets for each class k in each bucket j (k=1 . . . kc, j=1 .
. . B.sub.max). Also find the set of active nodes for the Kohonen
net that uses all features of all classes in bucket j, j=1 . . .
B.sub.max. In addition, get the class counts CTA.sub.kji of the
active nodes and stop when the class count percentages
CTP.sub.kjim, m=1 . . . kc, become stable for all active nodes.
[0128] Step 6. Assign each active node AN.sub.kji to the majority
class m if the class count percentage CTP.sub.kjim, m=1 . . . kc,
for the majority class m at that active node is above the minimum
threshold PCT.sub.min and the absolute class count CTA.sub.kji is
above the threshold CT.sub.min. If the problem is not imbalanced
data, go to Step 7.
[0129] If the problem is imbalanced data, compute CCM at each
active node AN.sub.kji by adding all minority class counts and
computing its percentage of total class counts at that node. If CCM
>IMPCT.sub.max, add that active node to the set
MCLNODE-FSB.sub.kj, the list of active nodes in first layer Kohonen
nets for each feature subset FSB.sub.kj, k=1 . . . kc, j=1 . . .
B.sub.max, where the minority class has a significant presence. A
node assigned to the list MCLNODE-FSB.sub.kj is not assigned to any
class.
[0130] Now train the second layer ensemble of Kohonen subnets
corresponding to each first layer node in the list
MCLNODE-FSB.sub.kj where the minority class has a significant
presence.
[0131] Step A. Initialize the weights and learning parameters of
FGSL second layer Kohonen subnets for each first layer node in the
list MCLNODE-FSB.sub.kj, FSB.sub.kj, k=1 . . . kc, j=1 . . .
B.sub.max. FGSL is the number of Kohonen subnets of different grid
sizes for each feature partition.
[0132] Step B. Process some additional streaming data to train all
second layer Kohonen subnets in parallel and by selecting
appropriate parts of the input pattern for each Kohonen subnet
according to the feature partition FSB.sub.kj, k=1 . . . kc, j=1 .
. . B.sub.max. Stop training when all Kohonen subnets converge;
that is, when CWR.sub.T<=CWR.sub.T.sup.Min for all Kohonen
nets.
[0133] For this step, each such second layer ensemble is trained
with just the streaming data points that are assigned to the
corresponding first layer node of a first layer Kohonen net.
[0134] Step C. Process some additional streaming data through the
stabilized second layer Kohonen subnets, without changing the
weights, to get class counts for the active nodes in all of the
second layer Kohonen subnets. Stop when class percentages become
stable for all second layer Kohonen subnets.
[0135] Step D. Assign each active node (neuron) in the second layer
Kohonen subnets to a class if the class count percentage CC for the
most active class at that node is >=PCT.sub.min. Add these
second layer active nodes to AN.sub.kj, the list of active nodes
for class k and feature set FSB.sub.kj. Discard all active neurons
which do not satisfy the PCT.sub.min requirement.
[0136] Step 7. Process some more streaming data to compute the
radius W.sub.kji of each active node AN.sub.kji. Stop when the
radii or widths become stable.
[0137] Step 8. Retain only the active nodes AN.sub.kj, k=1 . . .
kc, j=1 . . . B.sub.max, from the corresponding Kohonen nets (or
subnets) that satisfy the minimum thresholds PCT.sub.min and
CT.sub.min. Also retain the active nodes from the Kohonen nets (or
subnets) based on all features of all classes in bucket j, j=1 . .
. B.sub.max, and who satisfy the minimum thresholds. Discard all
other nodes from all of the Kohonen nets.
[0138] This algorithm produces a set of active Kohonen neurons for
each bucket j, j=1 . . . B.sub.max, and each Kohonen neuron is
assigned to a specific class, according to embodiments of the
invention.
7.6 Computational Results for Imbalanced Data Problems
[0139] Some computational results on imbalanced data problems
follow. Computational testing used two widely referenced
datasets--the adult census data of Kohavi (1996) and the bank
marketing data of Moro et al. (2014). The bank marketing dataset is
from a Portuguese banking institution and was used in their direct
marketing campaign to sell term deposits to their customers. The
dataset has general information about customers and details of
phone contacts made with them. The task is to predict whether a
term deposit will be subscribed to or not by the customer based on
a phone call. Those who subscribe are classified as "yes" and those
who don't as "no." It has 4521 examples in the training set of
which 4000 examples correspond to the "no" class and 521 examples
correspond to the "yes" class. Thus the "yes" class has only about
11.52% of the data points and was considered a minority class.
[0140] It ought to be mentioned that many experts suggest not using
accuracy as the evaluation measure on imbalanced data problems. To
them, for imbalanced data problems, correct prediction of the
minority class is of main concern. Thus, a slightly higher overall
error rate is often tolerated. The tables below show both the
overall accuracy and the number of minority class points correctly
classified.
[0141] The performance of this algorithm was compared to three
variations of decision tree algorithms available in IBM's SPSS
Modeler--C&R Tree, CHAID and Quest--since decision tree
algorithms are conceptually similar to this algorithm. For the
decision tree algorithms, the parameters were set to their default
values.
[0142] Table 7.3 shows the results for the adult census data and
Table 7.4 shows the results for the bank marketing data. In both
cases, the Kohonen Ensemble method described herein found more of
the minority class data points in both the training and test sets
than the other methods, while accuracy decreased only slightly.
TABLE-US-00014 TABLE 7.3 Training and test accuracies and the
number of minority class points correctly classified by the four
algorithms for the adult census dataset (Kohavi 1996) Kohonen
Ensemble C&R Tree CHAID Quest No. of No. of No. of No. of
Minority Minority Minority Minority Class points Class points Class
points Class points correctly correctly correctly correctly
classified Accuracy classified Accuracy classified Accuracy
classified Accuracy TRAIN 2516 84.71% 1918 84.9% 2046 83.56% 1473
81.46% TEST 4730 81.93% 3919 84.7% 4091 82.86% 2962 81.04%
TABLE-US-00015 TABLE 7.4 Training and test accuracies and the
number of minority class points correctly classified by the four
algorithms for the bank marketing dataset (Moro et al. 2014)
Kohonen Ensemble C&R Tree CHAID Quest No. of No. of No. of No.
of Minority Minority Minority Minority Class points Class points
Class points Class points correctly correctly correctly correctly
classified Accuracy classified Accuracy classified Accuracy
classified Accuracy TRAIN 181 90.00% 76 89.29% 141 89.07% 125
89.96% TEST 1312 87.6% 873 89.11% 1244 88.83% 1136 89.5%
8. HARDWARE IMPLEMENTATION OF EMBODIMENTS TO EXPLOIT PARALLEL
COMPUTATIONS
[0143] Embodiments of the invention may be implemented both on a
distributed computing platform 600, such as Apache Spark, as
depicted in FIG. 6, and on neural hardware. A neural hardware
implementation can exploit massively parallel computations at the
neuronal level of a Kohonen net. Such an implementation can be
useful in many domains that require fast learning and response,
including IoT (Internet of Things) and robotics. Such an
implementation can also process stored data in a very fast manner
Designs of Kohonen chips--both analog and digital versions--are
currently available and such chips can be produced in large
quantities.
8.1 Localized Learning to Save Signal Transmission Cost in IOT;
Facilitate Distributed Control and Decision-Making
[0144] A hardware implementation of embodiments of the invention
allow for localization of learning and response. The advantage of
localized learning and response is that it will reduce the volume
of signal transmission through expensive networks such as the
Internet. For example, if a piece of critical machinery is being
continuously monitored with localized hardware for performance and
potential failure, no continuous signal transmission through large
networks to a cloud-based agent need to occur until certain
thresholds are reached that indicate performance deterioration or
impending failure. Thus, localized hardware can reduce unnecessary
transmissions through large networks in a significant way.
Hardware-based localized learning and monitoring cannot only reduce
the volume of network traffic and its cost, it will also reduce (or
even eliminate) the dependence on a single control center, such as
the cloud, for decision-making and control. Localized learning and
monitoring will allow for distributed decision-making and control
of machinery and equipment in IoT.
[0145] A hardware implementation of embodiments of the invention
will also make learning machines widely deployable on an "anytime,
anywhere" basis even when there is no access to a network and/or a
cloud facility.
9. CONCLUSION
[0146] Embodiments of the invention provide a method for large
scale machine learning. The method can learn from both stored and
streaming data; of course, stored data has to be streamed to this
method. It is, therefore, a general purpose machine learning method
for classification problems. And, as shown by the experimental
results, the method in particular is very powerful at reducing the
dimensions of high-dimensional problems and its accuracy is also
very competitive. Online methods are also highly scalable because
they do not need simultaneous access to the training data and can
learn by examining a single record at a time. Another advantage of
online methods is that they can learn from all of the data and need
not sample from the data.
[0147] If machine learning systems are to be widely deployed and
used in big data, IoT and other environments, a certain level of
automation of learning is needed. Automation can reduce the
dependence on highly skilled machine learning experts to develop
applications. Without a certain level of automation, the cost of
deploying machine learning applications can become prohibitive,
thereby inhibiting their wider use and diminishing the economic
benefits of big data and IoT. Embodiments of the invention provide
a step towards automation of learning. Such automation can be
achieved through an ensemble of classifiers and by less stringent
requirements for parameter setting.
[0148] Some portions of this detailed description are presented in
terms of algorithms and representations of operations on data
within a computer memory. These algorithmic descriptions and
representations are the means used by those skilled in the data
processing arts to most effectively convey the substance of their
work to others skilled in the art. An algorithm is here, and
generally, conceived to be a sequence of steps leading to a desired
result. The steps are those requiring physical manipulations of
physical quantities. Usually, though not necessarily, these
quantities take the form of electrical or magnetic signals capable
of being stored, transferred, combined, compared, and otherwise
manipulated. It has proven convenient at times, principally for
reasons of common usage, to refer to these signals as bits, values,
elements, symbols, characters, terms, numbers, or the like.
[0149] It should be borne in mind, however, that all of these and
similar terms are to be associated with the appropriate physical
quantities and are merely convenient labels applied to these
quantities. Unless specifically stated otherwise, as apparent from
this discussion, it is appreciated that throughout the description,
discussions utilizing terms such as "processing" or "computing" or
"calculating" or "determining" or "displaying" or the like, refer
to the action and processes of a computer system or computing
platform, or similar electronic computing device(s), that
manipulates and transforms data represented as physical
(electronic) quantities within the computer system's registers and
memories into other data similarly represented as physical
quantities within the computer system memories or registers or
other such information storage, transmission or display
devices.
[0150] Embodiments of invention also relate to apparatuses for
performing the operations herein. Some apparatuses may be specially
constructed for the required purposes, or may comprise a general
purpose computer(s) selectively activated or configured by a
computer program stored in the computer(s). Such a computer program
may be stored in a computer readable storage medium, such as, but
not limited to, any type of disk including optical disks, CD-ROMs,
DVD-ROMs, and magnetic-optical disks, read-only memories (ROMs),
random access memories (RAMs), EPROMs, EEPROMs, NVRAMs, magnetic or
optical cards, or any type of media suitable for storing electronic
instructions, and each coupled to a computer system bus.
[0151] The algorithms presented herein are not inherently related
to any particular computer or other apparatus. Various general
purpose systems may be used with programs in accordance with the
teachings herein, or it may prove convenient to construct more
specialized apparatus to perform the required methods. The
structure for a variety of these systems appears from the
description herein. In addition, embodiments of the invention are
not described with reference to any particular programming
language. It will be appreciated that a variety of programming
languages may be used to implement the embodiments of the invention
as described herein.
[0152] A machine-readable medium includes any mechanism for storing
or transmitting information in a form readable by a machine (e.g.,
a computer). For example, a machine-readable medium includes read
only memory ("ROM"); random access memory ("RAM"); magnetic disk
storage media; optical storage media; flash memory devices,
etc.
[0153] Although the invention has been described and illustrated in
the foregoing illustrative embodiments, it is understood that the
present disclosure has been made only by way of example, and that
numerous changes in the details of implementation of the invention
can be made without departing from the spirit and scope of the
invention, which is only limited by the claims that follow.
Features of the disclosed embodiments can be combined and
rearranged in various ways.
* * * * *