U.S. patent application number 10/876533 was filed with the patent office on 2005-12-29 for methods for multi-class cost-sensitive learning.
Invention is credited to Abe, Naoki, Zadrozny, Bianca.
Application Number | 20050289089 10/876533 |
Document ID | / |
Family ID | 35507280 |
Filed Date | 2005-12-29 |
United States Patent
Application |
20050289089 |
Kind Code |
A1 |
Abe, Naoki ; et al. |
December 29, 2005 |
Methods for multi-class cost-sensitive learning
Abstract
Methods for multi-class cost-sensitive learning are based on
iterative example weighting schemes and solve multi-class
cost-sensitive learning problems using a binary classification
algorithm. One of the methods works by iteratively applying
weighted sampling from an expanded data set, which is obtained by
enhancing each example in the original data set with as many data
points as there are possible labels for any single instance, using
a weighting scheme which gives each labeled example the weight
specified as the difference between the average cost on that
instance by the averaged hypotheses from the iterations so far and
the misclassification cost associated with the label in the labeled
example in question. It then calls the component classification
algorithm on a modified binary classification problem in which each
example is itself already a labeled pair, and its (meta) label is 1
or 0 depending on whether the example weight in the above weighting
scheme is positive or negative, respectively. It then finally
outputs a classifier hypothesis which is the average of all the
hypotheses output in the respective iterations.
Inventors: |
Abe, Naoki; (Rye, NY)
; Zadrozny, Bianca; (New York, NY) |
Correspondence
Address: |
WHITHAM, CURTIS & CHRISTOFFERSON, P.C.
11491 SUNSET HILLS ROAD
SUITE 340
RESTON
VA
20190
US
|
Family ID: |
35507280 |
Appl. No.: |
10/876533 |
Filed: |
June 28, 2004 |
Current U.S.
Class: |
706/12 ; 706/20;
706/25; 706/52 |
Current CPC
Class: |
G06N 20/00 20190101;
G06K 9/6256 20130101; Y10S 706/932 20130101 |
Class at
Publication: |
706/012 ;
706/025; 706/052; 706/020 |
International
Class: |
G06F 015/18; G06F
009/44; G06N 003/08; G06N 007/02; G06N 007/06 |
Claims
Having thus described our invention, what we claim as new and
desire to secure by Letters Patent is as follows:
1. A method for multi-class, cost-sensitive learning based on
iterative example weighting schemes applied to a chosen data set
comprising the steps of: a) obtaining an expanded data set, which
is defined by enhancing each example in an original data set with
as many data points as there are possible labels for any single
instance; b) repeatedly drawing sub-samples from the expanded data
set using weighted sampling according to a certain example
weighting scheme that remains constant throughout the iterations,
in which each labeled example is given the weight specified as the
difference between the maximum possible misclassification cost for
the instance in question and the misclassification associated with
the label in the particular labeled example; c) calling a component
classification learning algorithm to the sub-sample obtained in
step b) and obtaining a hypothesis representing a classifier; d)
outputting all classifier representations obtained through the
iterations and representing an average over them, each of which can
be an arbitrary representation of classifier for a problem at hand;
and e) outputting all of the representations obtained through the
iterations representing an average over them, each of which can be
an arbitrary representation of classifier for the problem at
hand.
2. The method for multi-class, cost-sensitive learning recited in
claim 1, wherein the learning algorithm is an arbitrary algorithm
for classification.
3. The method for multi-class, cost-sensitive learning recited in
claim 1, wherein the learning algorithm is selected from the group
consisting of decision tree algorithms, nave Bayes method, logistic
regression method and neural networks.
4. A method for multi-class, cost-sensitive learning based on an
example weighting scheme applied to a chosen data set comprising
the steps of: a) obtaining an expanded data set, which is defined
by enhancing each example in an original data set with as many data
points as there are possibles for any single instance; b)
iteratively applying weighted sampling from the expanded data set,
using a dynamically changing weighting scheme involving both
positive and negative weights; c) calling a component
classification algorithm on a modified binary classification
problem in which each example is itself already a labeled pair, and
its (meta) label is 1 or 0 depending on whether the example weight
in the above weighting scheme is positive or negative,
respectively, and obtains a hypothesis representing a classifier;
d) optionally modifying the obtained classifier, which is in
general a relation on the original classification (mapping elements
of domain to the labels, so that it is stochastic, namely a
conditional probability distribution so that its probabilities over
the set of labels sum to one for each instance; e) outputting all
representations obtained through the iterations and representing an
average over them, each of which can be an arbitrary representation
of classifier for the problem at hand.
5. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the learning algorithm is an arbitrary algorithm
for classification.
6. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the learning algorithm is selected from the group
consisting of decision tree algorithms, nave Bayes method, logistic
regression method and neural networks.
7. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the dynamically changing weighting of step b)
gives each labeled example a weight specified as a difference
between an average cost on that instance by the averaged hypotheses
from iterations so far and a misclassification cost associated with
the label in the labeled example in question
8. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the dynamically changing weighting of step b)
gives each labeled example a weight specified as a difference
between an average cost on that instance by an averaged hypotheses
from iterations so far divided by a number of labels per instance,
and a misclassification cost associated with the label in the
labeled example in question.
9. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the dynamically changing weighting of step b)
gives each labeled example a weight specified as a difference
between an average cost on that instance by an averaged hypotheses
from iterations so far divided by a number of labels per instance,
and a misclassification cost associated with the label in the
labeled example in question, and the weighted sampling comprises
the steps of: sampling the instance in step a) according to a
probability proportional to a maximum of weights for that instance
and any of the labels, and choosing a label with a probability
proportional to the absolute value of a weight for that instance
and the label in question.
10. The method for multi-class, cost-sensitive learning recited in
claim 4, wherein the dynamically changing weighting of step b)
gives each labeled example a weight specified as a difference
between an average cost on that instance by an averaged hypotheses
from iterations so far divided by a number of labels per instance,
and a misclassification cost associated with the label in the
labeled example in question, and the weighted sampling comprises
the steps of: sampling the instance step a) according to a
probability proportional to a maximum of weights for that instance
and any of the labels, and for the chosen instance,
deterministically added examples for all possible labels.
11. A system implementing a method for multi-class, cost-sensitive
learning based on iterative example weighting schemes applied to a
chose data set comprising: a multi-class cost-sensitive learning
top control module controlling the overall control flow; a learning
algorithm storage module storing a representation of a learning
algorithm for classification learning; a model output module
storing models obtained as a result of applying the learning
algorithm to training data given by a weighted sampling module and
outputting a final model by aggregating these models, said weighted
sampling module accessing data stored in a data storage module,
sampling a relatively small subset of the data with acceptance
probability determined using the example weights, and passing the
obtained sub-sample to said top control module; a weight
calculation module updating the example weights for sampling using
weighted sampling according to a weighting scheme that remains
constant throughout iterations, in which each labeled example is
given a weight specified as a difference between a maximum possible
misclassification cost for the instance in question and a
misclassification cost associated with the label in the particular
labeled example; and a model update module updating current models
using a model's output in previous iterations stored in a current
model storage module and an output model of a current iteration
output and storing a resulting updated model in said current
storage module.
12. The system for implementing a method for multi-class,
cost-sensitive learning recited in claim 11, wherein the learning
algorithm is an arbitrary algorithm for classification.
13. The system for implementing a method for multi-class,
cost-sensitive learning recited in claim 11, wherein the learning
algorithm is selected from the group consisting of decision tree
algorithms, nave Bayes method, logistic regression method and
neural networks.
14. A system implementing a method for multi-class, cost-sensitive
learning based on an example weighting scheme applied to a chosen
data set comprising: a multi-class cost-sensitive learning top
control module controlling the overall control flow; a learning
algorithm storage module storing a representation of a learning
algorithm for classification learning; a model output module
storing models obtained as a result of applying the learning
algorithm to training data given by a weighted sampling module and
outputting a final model by aggregating these models, said weighted
sampling module accessing data stored in a data storage module,
sampling a relatively small subset of the data with acceptance
probability determined using the example weights, and passing the
obtained sub-sample to said top control module; a weight
calculation module calculating the example weights for sampling
using a dynamically changing weighting scheme involving both
positive and negative weights; and a model update module updating
the current models using the models output in the previous
iterations stored in a current model storage module and the output
model of the current iteration output and storing the resulting
updated model in said current storage module.
15. The system for implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the learning
algorithm is an arbitrary algorithm for classification.
16. The system for implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the learning
algorithm is selected from the group consisting of decision tree
algorithms, nave Bayes method, logistic regression method and
neural networks.
17. The system implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the
dynamically changing weighting scheme used by the weight
calculation module gives each labeled example a weight specified as
a difference between an average cost on that instance by the
averaged hypotheses from iterations so far and a misclassification
cost associated with the label in the labeled example in
question
18. The system implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the
dynamically changing weighting scheme used by the weight
calculation module gives each labeled example a weight specified as
a difference between an average cost on that instance by an
averaged hypotheses from iterations so far divided by a number of
labels per instance, and a misclassification cost associated with
the label in the labeled example in question.
19. The system implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the
dynamically changing weighting scheme used by the weight
calculation module gives each labeled example a weight specified as
a difference between an average cost on that instance by an
averaged hypotheses from iterations so far divided by a number of
labels per instance, and a misclassification cost associated with
the label in the labeled example in question, and the weighted
sampling samples the instance according to a probability
proportional to a maximum of weights for that instance and any of
the labels, and a label is chosen with a probability proportional
to the absolute value of a weight for that instance and the label
in question.
20. The system implementing a method for multi-class,
cost-sensitive learning recited in claim 14, wherein the
dynamically changing weighting scheme used by the weight
calculation module gives each labeled example a weight specified as
a difference between an average cost on that instance by an
averaged hypotheses from iterations so far divided by a number of
labels per instance, and a misclassification cost associated with
the label in the labeled example in question, and the weighted
sampling samples the instance according to a probability
proportional to a maximum of weights for that instance and any of
the labels, and for the chosen instance, deterministically adds
examples for all possible labels.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present invention generally relates to the field of
cost-sensitive learning in the areas of machine learning and data
mining and, more particularly, to methods for solving multi-class
cost-sensitive learning problems using a binary classification
algorithm. This algorithm is based on techniques of data space
expansion and gradient boosting with stochastic ensembles.
[0003] 2. Background Description
[0004] Classification in the presence of varying costs associated
with different types of misclassification is important for
practical applications, including many data mining applications,
such as targeted marketing, fraud and intrusion detection, among
others. Classification is often idealized as a problem where every
example is equally important, and the cost of misclassification is
always the same. The real world is messier. Typically, some
examples are much more important than others, and the cost of
misclassifying in one way differs from the cost of misclassifying
in another way. A body of work on this subject has become known as
cost-sensitive learning, in the areas of machine learning and data
mining.
[0005] Research in cost-sensitive learning falls into three main
categories. The first category is concerned with making particular
classifier learners cost-sensitive, including methods specific for
decision trees (see, for example., U. Knoll, G. Nakhaeizadeh, and
B. Tausend, "Cost-sensitive pruning of decision trees", Proceedings
of the Eight European Conference on Machine Learning, pp. 383-386,
1994, and J. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C.
Brodley, "Pruning decision trees with misclassification costs",
Proceedings of the European Conference on Machine Learning, pp.
131-136, 1998), neural networks (see, for example, P. Geibel and F.
Wysotzki, "Perceptron based learning with example dependent and
noisy costs", Proceedings of the Twentieth International Conference
on Machine Learning, 2003), and support vector machines (see, for
example, G. Fumera and F. Roli, "Cost-sensitive learning in support
vector machines", VIII Convegno Associazione Italiana per
L'Intelligenza Artificiale, 2002). The second category uses Bayes
risk theory to assign each example to its lowest expected cost
class (see, for example, P. Domingos, "MetaCost: A general method
for making classifiers cost sensitive", Proceedings of the Fifth
International Conference on Knowledge Discovery and Data Mining,
pp. 144-164, ACM Press, 1999, and D. Margineantu, Methods for
Cost-Sensitive Learning, PhD thesis, Department of Computer
Science, Oregon State University, Corvallis, 2001). This requires
classifiers to output class membership probabilities and sometimes
requires estimating costs (see, B. Zadrozny and C. Elkan, "Learning
and making decisions when costs and probabilities are both
unknown", Proceedings of the Seventh International Confernece on
Knowledge Discovery and Data Mining, pp. 204-213, ACM Press, 2001)
(when the costs are unknown at classification time). The third
category concerns methods that modify the distribution of training
examples before applying the classifier learning method, so that
the classifier learned from the modified distribution is
cost-sensitive. We call this approach cost-sensitive learning by
example weighting. Work in this area includes stratification
methods (see, for example, P. Chan and S. Stolfo, "Toward scalable
learning with non-uniform class and cost distributions",
Proceedings of the Fourth International Conference on Knowledge
Discovery and Data Mining, pp. 164-168, 1998, and L. Breiman, J. H.
Friedman, R. A. Olsen, and C. J. Stone, Classification and
Regression Trees, Wadsworth International Group, 1984) and the
costing algorithm (see, for example, B. Zadrozny, J. Langford, and
N. Abe, "Cost-sensitive learning by cost-proportionate example
weighting", Proceedings of the Third IEEE International Conference
on Data Mining, pp. 435-442, 2003). This approach is very general
since it reuses arbitrary classifier learners and does not require
accurate class probability estimates from the classifier.
Empirically this approach attains similar or better
cost-minimization performance.
[0006] Unfortunately, current methods in this category suffer from
a major limitation: they are well-understood only for two-class
problems. In the two-class case, it is easy to show that each
example should be weighted proportionally to the difference in cost
between predicting correctly or incorrectly (see, again, Zadrozny
et al., ibid.). However, in the multi-class case there is more than
one way in which a classifier can make a mistake, breaking the
application of this simple formula. Heuristics, such as weighting
examples by the average misclassification cost, have been proposed
(see, again, Breiman et al., ibid., and the Margineantu thesis,
ibid.), but they are not well-motivated theoretically and do not
seem to work very well in practice when compared to methods that
use Bayes risk minimization (see, again, Domingos, ibid.).
SUMMARY OF THE INVENTION
[0007] It is therefore an object of the present invention to
provide a method for multi-class cost-sensitive learning based on
an example weighting scheme.
[0008] According to the invention, the methods are based on example
weighting schemes that are derived using two key ideas: 1) data
space expansion and 2) gradient boosting with stochastic ensembles.
The latter is a formal framework that give rise to a coherent body
of methods.
[0009] One of the methods of invention, which is based on the idea
1) above, works by repeatedly sampling from the expanded data set,
which is obtained by enhancing each example in the original data
set with as many data points as there are possible labels for any
single instance. It then repeatedly draws sub-sample from this
expanded data set using weighted sampling according to a certain
example weighting scheme, in which each labeled example is given
the weight specified as the difference between the maximum possible
misclassification cost for the instance in question and the
misclassification associated with the label in the particular
labeled example. The example weighting remains constant throughout
the iterative sampling procedure. It then finally outputs a
classifier hypothesis which is the average of all the hypotheses
output in the respective iterations.
[0010] Another one of the methods of invention, which is based on
the idea 2) above, works by iteratively applying weighted sampling
from the same expanded data set, using a different weighting
scheme. The weighting scheme of this method gives each labeled
example the weight specified as the difference between the average
cost on that instance by the averaged hypotheses from the
iterations so far and the misclassification cost associated with
the label in the labeled example in question. Emphatically, the
weighting changes in every iteration, since it depends on the
performance of the averaged hypothesis obtained up to the current
iteration. Additionally, the example weights used in this method
can be both positive and negative, since the label given in any
labeled example does not necessarily correspond to the best label
for the given instance, i.e. the label with the minimum cost, due
to the use of data space expansion. Negative weights do not admit
the use of weighted sampling. The method deals with this problem by
calling the component classification algorithm on a modified binary
classification problem in which each example is itself already a
labeled pair, and its (meta) label is 1 or 0 depending on whether
the example weight in the above weighting scheme is positive or
negative, respectively.
[0011] The results of the methods of invention are obtained by
outputting all of the classifier representations obtained through
the iterations, and represent the average over them. These
representations can be arbitrary representations of classifiers,
such as decision trees, neural networks and support vector
machines, for the problem at hand, such as network intrusion
detection, fraud detection, targeted marketing, credit risk rating,
among other things. For example, in the application to network
intrusion detection, each one of these representations could be a
decision tree that specifies a set of conditions on various
attributes of a network connection event, which together signal
certain types of network intrusion. Such representations can be
further applied on a new network connection to output judgment
whether or not the connection is to be suspected to be some type of
an intrusion attempt with reasonable likelihood, and decisions can
be based on this judgment to determine the appropriate course of
action, such as denial of service or probing.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] The foregoing and other objects, aspects and advantages will
be better understood from the following detailed description of a
preferred embodiment of the invention with reference to the
drawings, in which:
[0013] FIG. 1 is a block diagram showing the architecture of the
system implementing one of the methods according to the
invention;
[0014] FIG. 2 is a flow chart showing the logic of the method for
multi-class cost-sensitive learning implemented on the system shown
in FIG. 1;
[0015] FIG. 3 is a block diagram showing the architecture of the
system implementing another one of the methods according to the
invention;
[0016] FIG. 4 is a flow chart showing the logic of the method for
multi-class cost-sensitive learning implemented on the system shown
in FIG. 3; and
[0017] FIG. 5 is an example of a decision tree to illustrate the
process implemented by the invention.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION
[0018] We begin by introducing some general concepts and notation
we use in the rest of the description.
Cost-Sensitive Learning and Related Problems
[0019] A popular formulation of the cost-sensitive learning problem
is via the use of a cost matrix. A cost matrix, C(y.sub.1,
y.sub.2), specifies how much cost is incurred when misclassifying
an example labeled y.sub.2 as y.sub.1, and the goal of a
cost-sensitive learning method is to minimize the expected cost.
Zadrozny and Elkan (B. Zadrozny and C. Elkan, "Learning and making
decisions when costs and probabilities are both unknown",
Proceedings of the seventh International Conference on Knowledge
Discovery and Data Mining, pp. 204-213, ACM Press, 2001) noted that
this formulation is not applicable in situations in which
misclassification costs depend on particular instances, and
proposed a more general form of cost function, C(x, y.sub.1 ,
y.sub.2), that allows dependence on the instance x. Here we adopt
this general formulation, but note that in the reasonable case in
which the cost is minimized by the true label, we can drop the
redundant information y.sub.2 and write C(x, y.sub.1) for C(x,
y.sub.1, y.sub.2).
[0020] Once we allow the costs to depend on each example, it is
natural to assume that the costs are generated according to some
distribution, along with the examples, which leads to the following
formulation. In (multi-class) cost sensitive classification,
examples of the form (x, (C(x, y.sub.1), . . . , C(x, y.sub.k)) are
drawn from a distribution D over a domain
X.times.R.sup.+.sup..sup.k. (Throughout, we will let k denote
.vertline.Y.vertline..) Given a set of examples, S=(x.sub.i,
(C(x.sub.i, y)).sub.y.epsilon.Y).sup.m, the goal is to find a
classifier h:X.fwdarw.{1, . . . , k} which minimizes the expected
cost of the classifier: 1 arg min h E D [ C ( x , h ( x ) ) ] ( 1
)
[0021] We can assume without loss of generality that the costs are
normalized so that 2 x X min x Y C ( x , y ) = 0.
[0022] Note that with this normalization, the above formulation of
cost is equivalent to the common fonnulation in terms of
misclassification cost, i.e., 3 min h E D [ C ( x , h ( x ) ) I ( h
( x ) arg min y C ( x , y ) ) ]
[0023] Nonmally a learning method attempts to do this by minimizing
the empirical cost in the given training data, given some
hypothesis class : 4 arg min h H ( x , C ( x , y ) y Y ) S C ( x ,
h ( x ) ) ( 2 )
[0024] We note that we sometimes use the empirical expectation
notation, , to refer to the averaged empirical cost, namely 5 E ^ (
x , C ( x , y ) y Y ) - S C ( x , h ( x ) ) = 1 S ( x , C ( x , y )
y Y ) S C ( x , h ( x ) )
[0025] As a building block of our method, we make use of methods
for solving importance weighted classification problems, which we
define below. In importance weighted classification, examples of
the form (x, y, c) are drawn from a distribution D over a domain
X.times.Y.times.R.sup.+. Given a set of examples S=(x, y, c).sup.m,
the goal is to find a classifier h:X.fwdarw.Y having minimum
importance-weighted misclassificaton error: 6 arg min h E ( x , y ,
c ) - D c I ( h ( x ) y )
[0026] Again, usually, a learning method attempts to meet this goal
by minimizing the empirical weighted error in some hypothesis class
: 7 arg min h H ( x , y , c ) S c I ( h ( x ) y ) ( 3 )
[0027] We note that importance weighted classification can be
solved very well with a classification method, by use of weighted
rejection sampling techniques (see,again, Zadronzny, Langford, and
Abe, ibid.).
Hypothesis Representations and Other Notation
[0028] In the above, we assumed that the hypotheses output by a
cost-sensitive learner is a functional hypothesis h, i.e.,
h:X.fwdarw.Y. It is also possible to allow hypotheses that are
stochastic, namely
h:X.times.Y.fwdarw.[0,1]
[0029] subject to the stochastic condition: 8 x X y Y h ( x , y ) =
1.
[0030] With stochastic hypotheses, stochastic cost-sensitive
learning is defined as that of minimizing the following expected
cost: 9 arg min h E D y Y C ( x , y ) h ( x , y )
[0031] Note that in the special case that h is deterministic, this
formulation is equivalent to the definition given in Equation (1).
Also, this is a convexification of the standard objective function
that we usually expect a stochastic cost-sensitive learner to
minimize, i.e., 10 E D [ C ( x , arg max y Y h ( x , y ) ) ]
[0032] We also consider a variant of cost-sensitive learning in
which relational hypotheses are allowed. Here relational hypotheses
h are relations over X.times.Y, i.e., h:X.times.Y.fwdarw.{0, 1}. In
general h is neither functional nor stochastic, and in particular
it may violate the stochastic condition, 11 y Y h ( x , y ) =
1.
[0033] We often use the more general notation of h(x,y), meant for
stochastic and relational hypotheses, even when h is a
deterministic function from X to Y. As notational shorthand, for a
stochastic hypothesis h, we write h(x) to denote h(x,
.multidot.):Y.fwdarw.[0,1], and C(x, h(x)) to denote the expected
cost of its predictions, i.e., 12 C ( x , h ( x ) ) = y Y h ( x , y
) C ( x , y ) .
[0034] Finally, we note that we often write "x .epsilon. S" as a
shorthand for "y .epsilon. Y(x,y) .epsilon. S".
The Methodology
[0035] Our methodology can be interpreted as a reduction, which
translates a multi-class cost-sensitive learning problem to a
classifier learning problem. That is, it allows us to solve the
cost-sensitive learning problem using an arbitrary classifier
learning method as a component algorithm. This methodology is
derived using two key ideas: 1) expanding data space and 2)
gradient boosting with stochastic ensembles. Theoretical
performance guarantee on a particular variant of the invented
methodology is derived using a convexification of the objective
function by the expected cost function. Below we will explain these
two key ideas by exhibiting a prototypical method based on
each.
[0036] A representative method in the prior art of iterative
methods for cost-sensitive learning is the method proposed in
Zadrozny, Langford and Abe, ibid., called costing. The weighting
scheme of this method exploits the following observation: For the
binary class case, the above formulation in terms of cost per
example, C(x, y.sub.2), can be further reduced to a formulation in
terms of a single importance number per example. This is possible
by associating a number indicating the importance of an example (x,
y.sub.2), given by .vertline.C(x, 0)-C(x, 1).vertline.. This
conversion allows us to reduce the cost-sensitive learning problem
to a weighted classifier learning problem, but it has not been
known how that would be done for the multi-class scenario. It is
therefore natural to consider iterative weighting schemes, in which
example weights are iteratively modified in search for the optimal
weighting.
[0037] A straightforward application of iterative weighting suffers
from an inability to directly take into account the different costs
associated with multiple ways of misclassifying examples. This
translates to non-convergence of the method in practice. We address
this issue by the technique of expanding data space, the first of
the two key ideas.
[0038] Data Space Expansion
[0039] The objective of minimizing the empirical cost on the
original training sample is equivalent to minimization on the
following expanded sample. Given a labeled sample S consisting of
(x, (C(x,y)).sub.y.epsilon.Y) of size m, we define a sample S' of
size mk for classification, where k is the size of the label set,
i.e., k=.vertline.Y.vertline., as follows. 13 S ' = { ( x , y ) ,
max x , y C ( x , y ) - C ( x , y ) ) ( x , C ( x , y ) y Y ) S , y
Y }
[0040] Minimizing the importance weighted loss, 14 ( x , y , c ) S
' c I ( h ( x ) y )
[0041] on this new dataset also minimizes the cost on our original
sample. The algorithm DSE (Data Space Expansion) takes advantage of
this observation, which is summarized below as a theorem.
[0042] THEOREM 1. With the definitions given in FIG. 3, a
hypothesis IL minimizing the weighted classification error on the
expanded weighted sample S', 15 E ^ ( x , y , c ) ~ S ' [ c I ( h (
x ) y ) ]
[0043] also minimizes the cost on the original sample S, 16 E ^ ( x
, y , c ) ~ S ' [ C h ( x ) ] . arg min h E ^ ( x , y , c ) ~ S ' [
c I ( h ( x ) y ) ] = arg min h E ^ ( x , C ~ ) ~ S y Y [ ( max y '
Y C y ' - C y ) I ( h ( x ) y ) ] = arg max h E ^ ( x , C ~ ) ~ S y
Y [ C y I ( h ( x ) y ) ] = arg max h E ^ ( x , C ~ ) ~ S [ ( y Y C
y ) - C h ( x ) ] = arg max h E ^ ( x , C ~ ) ~ S [ C h ( x ) ]
[0044] Gradient Boosting with Stochastic Ensembles
[0045] Having described the idea of data space expansion, we now
cmbine it with the gradient boosting framework to arrive at our
main method. In particular, we cast the stochastic multiclass
cost-sensitive learning in the framework of gradient boosting (see
L. Mason, J. Baxter, P. Barlett, and M. Frean, "Boosting algorithms
as gradient descent", Advances in Neural Information Processing
Systems 12, pp. 512-518, 2000), with the objective function defined
as the expected cost of the stochastic ensemble, obtained as a
mixture of individual hypotheses, on the expanded data set. As we
stated above, a functional hypothesis of the form h:X.fwdarw.Y can
be viewed as a special case of a stochastic hypothesis. We then
define a stochastic ensemble hypothesis H, given multiple
functional hypotheses, h.sub.t, t=1, . . . , T, as the conditional
distribution defined as the mixture of the component hypotheses,
namely, 17 x X , y Y , H ( x , y ) = t = 1 T h t ( x , y )
[0046] Let H.sub.t denote the mixture hypothesis of the learning
procedure at round t. The procedure is to update its current
combined hypothesis by the mixture of the previous combined
hypothesis and a new hypothesis, i.e., by setting
H.sub.t(x,y)=(1-.beta.)H.sub.t-1(x,y)+.beta.h(x,y)
[0047] Thus, the expected cost of H.sub.t on x is
C(x, H.sub.t(x))=(1-.beta.)C(x, H.sub.t-1(x))+.beta.C(x,
h.sub.t(x))
[0048] Now, suppose that h predicts a particular label y for x,
i.e., h(x,y)=1, then
C(x, H.sub.t(x))=(1-.beta.)C(x, H.sub.t-1(x))+.beta.C(x,y)
[0049] If we now take a derivative of this function with respect to
.beta., we get 18 C ( x , H t ( x ) ) = C ( x , y ) - C ( x , H t -
1 ( x ) )
[0050] Note that this is the difference between the average cost of
the current ensemble hypothesis and the new weak hypothesis
assigning probability one to the specified label.
[0051] We then take this derivative with respect to all data points
(x,y) in the expanded data set S', and thus the gradient is
mk-dimensional. We then expect the weak learner to find a
hypothesis h whose inner-product with the negative gradient is
large. That is, the output h of the weak learner seeks to maximize
the following sum. 19 - h , C = 1 W x S y Y ( C ( x , H t - 1 ( x )
) - C ( x , y ) ) h ( x , y ) ( 9 )
[0052] where W denotes the sum of absolute values of the weights,
i.e., 20 W = x S y Y C ( x , H t - 1 ( x ) ) - C ( x , y ) .
[0053] Note that unlike the weights typically used in existing
hosting methods, the weights w.sub.x,y:=C(x, H.sub.t-1(x))-C(x,y)
can be negative, since y is not necessarily the best (least cost)
label. This means that the weak learner now receives both positive
and negative weights. While the minimization of weighted
misclassification with positive and negative weights makes perfect
sense as an optimization problem, its interpretation as a
classification problem is not immediately clear. In particular, it
prohibits the use of weighted sampling as a means of realizing the
weighted classification problem.
[0054] We deal with this problem by converting a relational version
of the weighted multi-class classification problem (i.e., of
finding h to maximize Equation 9) in each iteration to a weighted
binary classification problem. Specifically, we convert each
example pair (x,y) to ((x,y), l), and set l=1 if the weight on
(x,y) is positive, and l=0 if the weight is negative. The output
hypothesis of the binary classifier is in general relational, so it
is converted to a stochastic hypothesis by the procedure
Stochastic. (The particular way this procedure is defined is
motivated by the theoretical guarantee, which will be shown in the
next subsection.) The overall process, consisting of multiple
iterations of such a reduction, constitutes a reduction of the
stochastic multi-class cost-sensitive classification to binary
weighted classification.
[0055] With the foregoing definitions, we can now state our main
method, GBSE (Gradient Boosting with Stochastic Ensembles).
Theoretical Performance Guarantee on a Variant
[0056] It turns out that a strong theoretical performance guarantee
can be proved on a variant of this method, which we describe below.
We define the per label average cost, {tilde over (C)}(x, H(x)), of
a stochastic hypothesis H, in general, as follows. 21 C ~ ( x , H (
x ) ) = 1 k y Y H ( x , y ) C ( x , y )
[0057] Note that, with this definition, the empirical loss (cost)
of H on the original sample S, C(H, S), can be expressed as the sum
of this per label cost over the expanded data set
S'={(x,y).vertline.x .epsilon. S, y .epsilon. Y}. 22 C ( H , S ) =
x y H ( x , y ) C ( x , y ) = x y C ~ ( x , H ( x ) )
[0058] The variant, for which we prove our theoretical performance
guarantee is obtained by simply replacing the weight updating rule
of GBSE by the following:
w.sub.x,y={tilde over (C)}(x, H.sub.t-1(x))-C(x,y)
[0059] The resulting variant, which we call GBSE-T (Gradient
Boosting with Stochastic Ensembles--Theoretical version), is
summarized in FIG. 5.
[0060] We can show that GBSE-T has a boosting property given a
version of weak learning condition on the component classifier.
This weak learning condition, which we make precise below, is one
that is sensitive to class imbalance.
[0061] DEFINITION 1. We say that an algorithm A for the binary
importance weighted classification problem, as defined above,
satisfies the weak learning condition for a given classification
sample S=(x,y).sup.m, if for arbitrary distribution over S,
(w).sup.m, .SIGMA.w=1, when it is given S'=(x,y,w).sup.m as input,
its output h satisfies the following, for some fixed .gamma.>0:
23 ( x , y , w ) S ' w I ( h ( x ) = y ) y = 0 w + y = 1 w ( 12
)
[0062] THEOREM 2. Suppose that the component leaner A satisfies the
weak learning condition for the input sample S. Then, the output of
GBSE-T will converge to a stochastic ensemble hypothesis achieving
minimum expected cost on the (original) sample S. In particular, if
we set .alpha.t=.alpha. for all t, 24 x y H T ( x , y ) C ( x , y )
exp { - k T } x y H 0 ( x , y ) C ( x , y )
[0063] Proof
[0064] We first establish the following simple correspondence
between the weak learning conditions on the relational multi-class
classification problem that we wish to solve in each iteration, and
the binary classification problem that is given to the component
algorithm to solve it.
[0065] DEFINITION 2. Let S be a weighted sample of the form
S=(x,y,w).sup.m, where weights w can be both positive and negative.
Then define a transformed sample S' from S by S'=((x,y), l,
.vertline.w.vertline.).sup.m where l=I(w.gtoreq.0).
[0066] 1. The relational weighted multi-class classification
problem for S is to find a relational hypothesis
h:X.times.Y.fwdarw.{0, 1} that maximizes the following sum: 25 a (
h , S ) = 1 W ( x , y , w ) S w h ( x , y ) where W = ( x , y , w )
S w .
[0067] 2. The weighted binary classification problem for S' is to
find a hypothesis h':X.times.Y.fwdarw.{0, 1} that maximizes the
following weighted classification accuracy: 26 a ' ( h ' , S ' ) =
1 W ( ( x , y ) , l , w ) S ' w I ( h ' ( x , y ) = l )
[0068] LEMMA 1. Assume the notation of Definition 2. Then, for
arbitrary .epsilon.>0, h satisfies the following condition on
the relational multi-class classification problem for S:
a(h,S).gtoreq..epsilon.
[0069] if and only if (the same) h satisfies the corresponding
condition on the transformed binary classification problem for S':
27 a ' ( h , S ' ) l = 0 w W +
[0070] Proof of Lemma 1 28 W a ( h , S ' ) = ( ( x , y ) , l , w )
S ' w I ( h ( x , y ) = l ) = w 0 w I ( h ( x , y ) = 1 ) + w <
0 - w I ( h ( x , y ) = 0 ) = w 0 w h ( x , y ) + w < 0 - w ( 1
- h ( x , y ) ) = ( x , y , w ) S w h ( x , y ) + w < 0 w = W a
( h , S ) + ( x , y , w ) S : w < 0 w
[0071] Hence the lemma follows.
[0072] Proof of Theorem 2
[0073] First, note that applying Stochastic to h, can increase the
expected cost only for x's such that
.vertline.[y.vertline.h.sub.t(x,y)=1- }.vertline.=0, and for such
x's the cost of f, equals that of H{t-1} by the definition of
Stochastic. Hence, the empirical cost of f, on the original sample
S, C(f.sub.t, S), satisfies the following: 29 C ( f t , S ) - C ( h
t , S ) x : y h ( x , y ) = 0 y C ~ ( x , H t - 1 ( x ) ) ( 13
)
[0074] Now recall that the expected empirical cost of H.sub.t
equals the following, where we drop the subscript t from
.alpha..sub.t. 30 C ( H t , S ) = x , y ( 1 - ) H t - 1 ( x , y ) C
( x , y ) + x , y f ( x , y ) C ( x , y ) = x , y ( 1 - ) C ~ ( x ,
H t - 1 ( x ) ) + x , y f ( x , y ) C ( x , y ) ( 14 )
[0075] Hence, by combining Equation 13 and Equation 14, we can show
the following bound on the decrease in empirical cost in each
iteration: 31 C ( H t - 1 , S ) - C ( H t , S ) = x ( y C ~ ( x , H
t - 1 ( x ) ) - y f ( x , y ) C ( x , y ) ) = x ( y C ~ ( x , H t -
1 ( x ) ) - y h ( x , y ) C ( x , y ) ) + x ( y h ( x , y ) C ( x ,
y ) - y f ( x , y ) C ( x , y ) ) x ( y C ~ ( x , H t - 1 ( x ) ) -
y h ( x , y ) C ( x , y ) ) - ( x : y h ( x , y ) = 0 y C ~ ( x , H
t - 1 ( x ) ) ) ( x ( y : h ( x , y ) = 1 h ( x , y ) ( C ~ ( x , H
t - 1 ( x ) ) - C ( x , y ) ) + y : h ( x , y ) = 0 C ~ ( x , H t -
1 ( x ) ) - x : y h ( x , y ) = 0 y C ~ ( x , H t - 1 ( x ) ) = ( x
( y h ( x , y ) C ~ ( x , H t - 1 ( x ) ) - C ( x , y ) ) + ( x y :
h ( x , y ) = 0 C ~ ( x , H t - 1 ( x ) ) - x : y h ( x , y ) = 0 y
C ~ ( x , H t - 1 ( x ) ) ) ) x y h ( x , y ) C ~ ( x , H t - 1 ( x
) ) - C ( x , y ) ) x y : C ~ ( x , H t - 1 ( x ) ) - C ( x , y )
> 0 C ~ ( x , H t - 1 ( x ) ) - C ( x , y ) x C ~ ( x , H t - 1
( x ) ) = k C ( H t - 1 , S )
[0076] In the above derivation, the second to last inequality
follows from the weak learning condition and applying Lemma 1 with
weights {tilde over (C)}(x, H.sub.t-1(x))-C(x,y). The last
inequality follows from the fact that the weights are nonnalized so
that the minimum achievable cost is zero for all x. Noting that the
sum of these weights is positive whenever the current ensemble
hypothesis is sub-optimal, this guarantees a positive progress in
each iteration unless optimality is achieved. Since the expected
empirical cost function as defined by .SIGMA..sub.x.SIGMA..sub.y
F(x,y) C(x,y) is convex (in fact linear), this implies convergence
to the global optimum. Noting that in each iteration, the empirical
cost is reduced at least by a factor of 32 1 - k ,
[0077] and the theorem follows.
[0078] Note that at earlier iterations, the binary classifier used
as the component learner is likely to be given weighted sample with
balanced positive and negative examples. As the number of
iterations increases and progress is made, however, it will receive
samples that are increasingly more negative. (This is because the
positive examples correspond to labels that can further improve the
current performance.) It therefore becomes easier to attain high
weighted accuracy by simply classifying all examples to be
negative. The weak learning condition of Equation 12 appropriately
deals with this issue, as it requires that the weak learner achieve
better weighted accuracy than that attainable by assigning all
examples to the negative class.
Variations
[0079] In addition to the two variants of the Gradient Boosting
with StochastiEnsembles method presented above, namely GBSE and
GBSE-T, other related variations are possible. For example, in one
variant, the weighted sampling can be done in two steps; the
instance is sampled in the first step according to a probability
proportional to
max.sub.y.times.w.sub.x,y
[0080] and then choosing the label y with a probability
proportional to
.vertline.w.sub.x,y.vertline.
[0081] In a yet another variant, the weighted sampling can be done
in two steps; the instance is sampled in the first step according
to the same probability as above, and for the chosen instance,
examples are deterministically added for all possible labels.
Implementation
[0082] Referring now to FIG. 1, there is shown a system on which a
method for multi-class, cost-sensitive learning according to the
invention may be implemented. This system comprises a multi-class
cost-sensitive learning top control module 1 which controls the
overall control flow, making use of various sub-components of the
system. A learning algorithm storage module 2 stores a
representation of an algorithm for classification learning. An
arbitrary algorithm for classification can be used here.
Alternatively the learning algorithm can be a decision tree
learning algorithm, a nave Bayes method, a logistic regression
method or neural networks. The model output module 3 stores the
models obtained as a result of applying the learning algorithm
stored in module 2 to training data given by weighted sampling
module 4 and outputs a final model by aggregating these models. The
weighted sampling module 4 accesses the data stored in data storage
module 7, samples a relatively small subset of the data with
acceptance probability determined using the example weights, and
passes the obtained sub-sample to module 1. The weight update
module 5 updates the example weights for sampling using a
particular function determined by the current weights and current
models. The model update module 6 updates the current model using
the model's output in the previous iterations stored in the current
model storage module 8 and the output model of the current
iteration output by module 3 and stores the resulting updated model
in module 8.
[0083] FIG. 2 shows a flow diagram of the process implemented in
the system of FIG. 1. The first three steps initialize the process.
In Step 21, expanded data T is initialized using the input data S.
In Step 22, H.sub.0 is initialized by setting for all (x,y) in T.
Finally, in Step 23, the weights for all (x,y) in Tare initialized.
The iteration begins in the decision block of Step 4. A test is
made to determine if i=t. If not, Step 25 performs the computation
for all (x,y) 33 w ( x , y ) = ( y Y H t - 1 ( x , y ) C ( c , y )
) - C ( x , y )
[0084] The decision block in Step 26 determines if there is more
data in T or a STOP condition has been met. If not, in Step 27,
(x,y) is sampled from T and accepted with a probability
proportional to .vertline.w(x,y).vertline.. Next, in Step 28, if
accepted, ((x,y), .vertline.(w(x,y)>0)) is added to sub-sample
T.sup.t. A return is then made to the decision block in Step 26.
When there is no more data in T or a STOP condition has been met,
the process goes to Step 29 where the learning algorithm is run on
T.sup.t to obtain model h.sub.t. Next, in Step 30, f.sub.t is set
equal to stoch(h.sub.t). Then, in Step 31, .alpha..sub.t is chosen
and H.sub.t is set equal to
(1-.alpha..sub.t)H.sub.t-1+.alpha..sub.tf.sub.t. The index i is
incremented at Step 31, and a return is then made to the decision
block in Step 24. If i=t, then in Step 33 the final model H.sub.t
is output.
[0085] FIG. 3 shows a system on which another method for
multi-class, cost-sensitive learning according to the invention may
be implemented. This system is similar to that shown in FIG. 1 and
comprises a multi-class cost-sensitive learning top control module
1 which controls the overall control flow, making use of various
sub-components of the system, a learning algorithm storage module
2, which stores a representation of an algorithm for classification
learning, a model output module 3, which stores the models obtained
as a result of applying the learning algorithm stored in module 2
to training data given by weighted sampling module 4 and outputs a
final model by aggregating these models, and a weighted sampling
module 4, which accesses the data stored in data storage module 7,
samples a relatively small subset of the data with acceptance
probability determined using the example weights, and passes the
obtained sub-sample to module 1. The weight calculation module 5'
replaces the weight upadate module 5, which updates the example
weights for sampling using a dynamically changing weighting scheme.
The model update module 6 updates the current model using the
model's output in the previous iterations stored in the current
model storage module 8 and the output model of the current
iteration output by module 3 and stores the resulting updated model
in module 8.
[0086] FIG. 4 shows a flow diagram of the process implemented in
the system of FIG. 3. The first step initializes the process. In
Step 41, expanded data T is initialized using the input data S. In
Step 42, the weights for all (x,y) in Tare set. The iteration
begins in the decision block of Step 43. A test is made to
determine if i=t. If not, a test is made in Step 44 to determine if
there is no more data in T or a stop condition has been met. If
not, Step 45 samples (x,y) from T and accepts (x,y) with
probability proportional to w(x,y). If accepted, (x,y) is added to
sub-sample T' in Step 46. The process then loops back to decision
block in Step 44 until there is either no more data in T or a stop
condition has been met. At this point, the learning algorithm is
run in Step 47 on T' to obtain a model h.sub.t. In Step 48, .alpha.
is chosen so that when i=1, .alpha.=0 and
H.sub.t=(1-.alpha..sub.t)H.sub.t-1+.alpha- ..sub.tf.sub.t. The
index i is incremented at Step 49, and a return is then made to the
decision block in Step 43. If i=t, then in Step 50 the final model
H.sub.t is output.
[0087] As a concrete example of applying the method of the
invention to a real world problem, we describe an application to
network intrusion detection. Network intrusion detection has
recently become a proto-typical application problem for
multi-class, cost-sensitive learning. The multi-class aspect is
essential because in this application there are typically more than
one level of intrusion detection, such as probing and denial of
service. The cost-sensitive aspect is important because vastly
different costs are associated with different types of
misclassification (e.g., false negatives are usually a magnitude
more costly than false positives) and it is absolutely critical
that any learning method used to derive an intrusion detection rule
is sensitive to this cost structure.
[0088] A network intrusion detection system based on the method and
system of the invention for multi-class, cost-sensitive learning
consists of the following steps:
[0089] 1) Convert past network connection data to a set of feature
vectors, by mapping information on a network connection to a
feature vector.
[0090] 2) Label each of these vectors with known labels, such as
"normal", "probe", "denial of service", or specific types of
intrusions.
[0091] 3) Apply the method of the invention on the above data set,
and obtain a classification rule.
[0092] 4) Convert new network connection data to feature vectors,
apply the above classification rule to them, and flag those
connections corresponding to feature vectors that are classified as
different types of "intrusions" as such.
[0093] A typical set of features used to transform connection data
into a well-defined feature vector is that used in the network
intrusion data set known as "KDD CUP 99" data, which is publically
available. Here is the list of features in this data set (given in
three separate tables).
1 Basic Features of Individual TCP Connections feature name
description type duration length (number of seconds) of the
continuous connection protocol_type type of protocol, e.g., TCP,
UDP, etc. discrete service network service on the destination,
e.g., discrete http, telnet, etc. src_bytes number of data bytes
from source to continuous desitination dst_bytes number of data
bytes from destination to continuous source flag normal or error
status of the connection discrete land 1 if connection is from/to
the same discrete host/port; 0 otherwise wrong_fragment number of
"wrong" fragments continuous urgent number of urgent packets
continuous
[0094]
2 Content Features Within a Connection Suggested by Domain
Knowledge feature name description type hot number of "hot"
indicators continuous num_failed_logins number of failed login
attempts continuous logged_in 1 if successfully logged in; 0
discrete otherwise num_compromised number of "compromised"
conditions continuous root_shell 1 if root shell is obtained; 0
discrete otherwise su_attempted 1 if "su-root" command attempted; 0
discrete otherwise num_root number of "root" accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous num_access_files
number of operations on access continuous control files
num_outbound_cmds number of outbound commands in an continuous ftp
session is_hot_login 1 if the login belongs to the "hot" discrete
list; 0 otherwise is_guest_login 1 if the login is a "guest" login;
0 discrete otherwise
[0095]
3 Traffic Features Computed Using a Two-Second Time Window feature
name description type count number of connections to the same host
continuous as the current connection in the past two seconds Note:
The following features refer to these same host connections.
serror_rate % of connections that have "SYN" continuous errors
rerror_rate % of connections that have "REJ" continuous errors
same_srv_rate % of connections of the same service continuous
diff_srv_rate % of connections of different services continuous
srv_count number of connections to the same continuous service as
the current connection in the past two seconds Note: The following
features refer to these same-service connections. srv_server_rate %
of connections that have "SYN" continuous errors srv_rerror_rate %
of connections that have "REJ" continuous errors srv_diff_host_rate
% of connections to different hosts continuous
[0096] As a result of applying the multi-class, cost-sensitive
learning method of the invention to a data set consisting of these
features and the corresponding labels, using a decision tree
algorithm as the "classification learning algorithm" stored in
Module 2 of FIG. 1, one obtains, as the classification rule, a
voting function over a number of decision trees, such as the tree
shown in FIG. 5.
[0097] The system diagram of FIG. 1 and the flow chart of FIG. 2
illustrate a preferred embodiment of the invention, which
corresponds to the method "GBSE" described herein. However, it will
be understood by those skilled in the art that the method "DSE",
also described herein, may be used in the alternative. The main
difference between DSE and GBSE is that in DSE, the sampling
weights remain unchanged throughout all iterations. Consequently,
the modules and funcationalities that are related to weight
updating are unnecessary.
Experimental Evaluation
[0098] We use the C4.5 decision tree learner described by J.
Quinlan in C4.5: Programs for Machine Learning, Morgan Kaufmann
(1993), as the base classifier learning method, because it is a
standard for empirical comparisons and it was used as the base
learner by Domingos for the MetaCost method (see, P. Domingos,
"MetaCost: A general method for making classifiers cost sensitive",
Proceedings of the Fifth International Conference on Knowledge
Discovery and Data Mining, pp. 155-164, ACM Press, 1999).
[0099] We compare our methods against three representative methods:
Bagging (see L. Breiman, "Bagging predictors", Machine Learning,
24(2):123-140, 1996), Averaging cost (see, P. Chan and S. Stolfo,
"Toward scalable learning with non-uniform class and cost
distributions", Proceedings of the Fourth International Conference
on Knowledge Discovery and Data Mining, pp. 164-168, 1998), and
MetaCost (see, Domingos, ibid.). The Averaging cost method was also
used for comparison in Domingos, ibid. Note that Bagging is a
cost-insensitive learning method. Here we give a brief description
of these methods, and refer the reader to Breiman, ibid., and
Domingos, ibid., for the details.
[0100] Bagging obtains multiple sub-samples by sampling with
replacement, feeds them to the base learner (C4.5), and takes the
average over the ensemble of output hypotheses.
[0101] Averaging Cost (AvgCost) obtains a subsample by weighted
sampling with weights defined as the average cost for each x, and
then feeds it to the base learner (C4.5).
[0102] MetaCost uses bagging to obtain an ensemble of hypotheses,
uses the ensemble to estimate the class probabilities, and then
outputs a hypothesis that minimizes the expected risk with respect
to these estimates.
[0103] There are some deviations from these methods in our
implementation, which we clarify below. The main deviation is that
we use rejection sampling for all methods, while other sampling
schemes such as resampling with replacement are used in the
original methods. We do this for two reasons: (1) inadequacy of
resampling with replacement, especially for C4.5, has been noted by
various authors (see, for example, B. Zadrozny, J. Langford, and N.
Abe, "Cost-sensitive learning by cost-proportionate example
weighting", Proceedings of the Third IEEE International Conference
on Data Mining, pp. 435-442, 2003); and (2) since our methods use
rejection sampling, we do the same for the other methods for
fairness of comparison. We stress that this deviation should only
improve their performance. Another deviation is that we use a
variant of MetaCost that skips the last step of learning a
classifier on a relabeled training data set. It has been observed
that this variant performs at least as well as MetaCost, in terms
of cost minimization. (This variant has been called BagCost by D.
Margineantu in Methods for Cost-Sensitive Learning, PhD thesis,
Department of Computer Science, Oregon State University, Corvallis,
Oreg., 2001.) Also, in our implementation of AvgCost, we perform
weighted sampling multiple times to obtain an emsemble of
hypotheses, then output their average as the final hypothesis. We
note that, due to our normalization assumption that the minimum
cost for each instance x is always zero, our version of AvgCost is
identical to a more sophisticated variant in which the difference
between the average cost and the minimum cost is used for sampling
weights. Our experience shows that this variant of AvgCost performs
better than the original method.
[0104] The methods were applied to five benchmark datasets
available from the UCI machine learning repository (C. L. Blake and
C. J. Merz, "UCI repository of machine learning databases",
Department of Information and Computer Sciences, University of
California, Irvine, Calif., 1998) and one dataset from the UCI KDD
archive (S. D. Bay, "UCI archive", Department of Information and
Computer Sciences, University of California, 2000). These datasets
were selected by the criteria of having approximately 1,000 data or
more, besides being multiclass problems. A summary of these
datasets is given in Table 1.
4TABLE 1 Data set characteristics: data size, number of classes,
and the ratio between the frequency of the most common class to the
least common. Dataset # of examples # of classes Class ratio
Annealing 898 5 0.01316 KDD-99 197710 5 0.0001278 Letter 20000 26
0.9028 Satellite 6435 6 0.4083 Solar flare 1389 7 0.002562 Splice
3190 3 0.4634
[0105] Except for the KDD-99 dataset, these datasets do not have
standard misclassification costs associated with them. For this
reason, we follow Domingos and generate cost matrices according to
a model that gives higher costs for misclassifying a rare class as
a frequent one, and inversely for lowest. (Note therefore that our
experiments do not exploit the full generality of the
instance-dependent cost formulation presented above.) This reflects
a situation that is found in many practical data mining
applications, including direct marketing and fraud detection, where
the rare classes are the most valuable to identify correctly.
[0106] Our cost model is as follows: Let {circumflex over
(P)}(y.sub.1) and {circumflex over (P)}(y.sub.2) be the empirical
probabilities of occurrence of classes y.sub.1and y.sub.2 in the
training data. We choose the non-diagonal entries of the cost
matrix C(y.sub.1, y.sub.2), y.sub.1.noteq.y.sub.2 with uniform
probability from the interval [0,2000 {circumflex over
(P)}(y.sub.1)/{circumflex over (P)}(y.sub.2)]. In Domingos, ibid.,
the diagonal entries were then chosen from the interval [0,1000],
which often leads to cost matrices in which the correct label is
not the least costly one. Besides being unreasonable (see C. Elkan,
"Magical thinking in data mining: Lessons from coil challenge
2000", Proceedings of the Seventh International Conference on
Knowledge Discovery and Data Mining, pp. 426-43 1, ACM Press,
1999), these cost matrices can give an unfair advantage to
cost-sensitive methods over cost-insensitive ones. We therefore set
the diagonal entries to be identically zero, which is consistent
with our normalization assumption.
[0107] In all experiments, we randomly select 2/3 of the examples
in the dataset for training and use the remaining 1/3 for testing.
Also, for each training/test split we generate a different cost
matrix according to the rules above. Thus, the standard deviations
that we report reflect both variations in the data and in the
misclassification costs.
[0108] We remark on certain implementation details of the proposed
learning methods in our experimentation. First, we note that in all
of the methods used for comparison, C4.5 was used as the component
algorithm, and the final hypothesis is expressed as an ensemble of
output decision tress of C4.5. Its output hypothesis is therefore
also an ensemble of decision trees. Next, the choice of the mixture
weight .alpha..sub.t was unspecified in the algorithm descriptions.
The choice of .alpha..sub.t was set at 1/t for most methods.
[0109] The results of these experiments are summarized in Tables 2
and 3.
5TABLE 2 Experimental results: the average cost and standard error.
Dataset Bagging AvgCost MetaCost DSE GBSE Annealing 1059 .+-. 174
127.4 .+-. 12.2 206.8 .+-. 42.8 127.1 .+-. 14.9 33.72 .+-. 4.29
Solar 5403 .+-. 397 237.8 .+-. 37.5 5317 .+-. 390 110.9 .+-. 28.7
48.17 .+-. 9.52 KDD-99 319.4 .+-. 42.2 42.43 .+-. 7.95 49.39 .+-.
9.34 46.68 .+-. 10.16 1.69 .+-. 0.78 letter 151.0 .+-. 2.58 91.90
.+-. 1.36 129.6 .+-. 2.44 114.0 .+-. 1.43 84.63 .+-. 2.24 Splice
64.19 .+-. 5.25 60.78 .+-. 3.65 49.95 .+-. 3.05 135.5 .+-. 14 57.50
.+-. 4.38 Satellite 189.9 .+-. 9.57 107.8 .+-. 5.95 104.4 .+-. 6.43
116.8 .+-. 6.28 93.05 .+-. 5.57
[0110]
6TABLE 3 Experimental results: the average data size used by each
method in 30 iterations, and standard error. Dataset Bagging
AvgCost MetaCost DSE GBSE Annealing 11991 .+-. 13.1 1002.8 .+-. 183
11987 .+-. 9.84 3795.5 .+-. 688 1260.2 .+-. 224 Solar 18499 .+-.
20.4 334.80 .+-. 37.5 18510 .+-. 14.4 2112.8 .+-. 276 486.45 .+-.
53.3 KDD-99 395310 .+-. 143 2551.9 .+-. 428.6 395580 .+-. 143 12512
.+-. 2450 4181 .+-. 783.6 letter 40037 .+-. 44.3 159720 .+-. 2028
40052 .+-. 41 479130 .+-. 2710 363001 .+-. 5557 Splice 42515 .+-.
26.6 33658 .+-. 1697 42501 .+-. 21 52123 .+-. 592 50284 .+-. 3659
Satellite 86136 .+-. 123 60876 .+-. 1641 85984 .+-. 127 218870 .+-.
6516 140810 .+-. 3335
[0111] Table 2 lists the average costs attained by each of these
methods on the 6 data sets, and their stand errors. These results
were obtained by averaging over 20 runs, each run consisting of 30
iterations of the respective learning method. These results appear
quite convincing: GBSE out-performs all comparison methods on all
data sets, except on Splice, for which it ranks second after
MetaCost. Also, GBSE is the best performing among the proposed
methods, confirming our claim that the combination of various
techniques involved is indeed necessary to attain this level of
performance.
[0112] Table 3 lists the average total data size used by each of
the methods in 30 iterations. Examining these results in
conjunction with the data characteristics in Table 1 reveals a
definite trend. First, note that the data sets are divided into to
groups: those having very large skews, or very low class ratios
(Annealing, KDD-99 and Solar flare), and those having moderate
skews (Satellite, Splice and Letter). It is evident that the
methods based on example weighting (AvgCost, GBSE, DSE) use
magnitudes smaller data sizes for the three data sets in the first
group (i.e., with large skews), as compared to other methods,
Bagging and MetaCost. The performance of GBSE is especially
impressive on this group, achieving much lower costs while
requiring very small data sizes. It is worth mentioning that it is
these data sets in the first group with large skews that require
cost-sensitive learning the most.
[0113] We have provided a novel method for multiclass
cost-sensitive learning based on gradient boosting with stochastic
ensembles. It is not the first time that the issue of incorporating
cost-sensitivity to boosting has been addressed. For example,
AdaCost (see W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan,
"AdaCost: Misclassification cost-sensitive boosting", Proceedings
of the Sixteenth International Conference on Machine Learning, pp.
97-105, 1999) suggested a way of modifying AdaBoost's exponential
loss using a function (called cost adjustment function) of the cost
and confidence. The rational choice of this cost adjustment
function, however, appears not to be well-understood. The
stochastic ensemble that we employ in this method provides a
straightforward but reasonable way of incorporating cost and
confidence; i.e., in terms of expected cost.
[0114] While the invention has been described in terms of a single
preferred embodiment, those skilled in the art will recognize that
the invention can be practiced with modification within the spirit
and scope of the appended claims.
* * * * *