U.S. patent application number 11/294938 was filed with the patent office on 2007-06-07 for weighted ensemble boosting method for classifier combination and feature selection.
Invention is credited to Yuri Ivanov.
Application Number | 20070127825 11/294938 |
Document ID | / |
Family ID | 38118830 |
Filed Date | 2007-06-07 |
United States Patent
Application |
20070127825 |
Kind Code |
A1 |
Ivanov; Yuri |
June 7, 2007 |
Weighted ensemble boosting method for classifier combination and
feature selection
Abstract
A method constructs a strong classifier from weak classifiers by
combining the weak classifiers to form a set of combinations of the
weak classifiers. Each combination of weak classifiers is boosted
to determine a weighted score for each combination of weak
classifiers, and combinations of weak classifiers having a weighted
score greater than a predetermined threshold are selected to form
the strong classifier.
Inventors: |
Ivanov; Yuri; (Arlington,
MA) |
Correspondence
Address: |
MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC.
201 BROADWAY
8TH FLOOR
CAMBRIDGE
MA
02139
US
|
Family ID: |
38118830 |
Appl. No.: |
11/294938 |
Filed: |
December 6, 2005 |
Current U.S.
Class: |
382/228 |
Current CPC
Class: |
G06K 9/6256 20130101;
G06K 9/6292 20130101 |
Class at
Publication: |
382/228 |
International
Class: |
G06K 9/62 20060101
G06K009/62 |
Claims
1. A computer implemented method for constructing a strong
classifier, comprising: selecting a plurality of weak classifiers;
combining the weak classifiers to form a set of combinations of the
weak classifiers; boosting each combination of the weak classifiers
to determine a weighted score for each combination of the weak
classifiers; and selecting combinations of the weak classifiers
having a weighted score greater than a predetermined threshold to
form the strong classifier.
2. The method of claim 1, in which the weak classifiers include
binary and multi-class classifiers.
3. The method of claim 1, further comprising: representing an
output of each weak classifier by a posterior probability; and
associating each weak classifier with a confidence matrix.
4. The method of claim 3, in which the combinations include linear
and non-linear combinations of the weak classifiers.
5. The method of claim 3, in which the combining is an approximate
Bayesian combination.
6. The method of claim 5, in which an output .lamda. of each weak
classifier is a random variable {tilde over (w)} taking integer
values from 1 to K, the number of classes, and a probability
distribution over values of a true class label .omega. is
P.sub..lamda.(.omega.|{tilde over (.omega.)}), and the approximate
Bayesian combination is P a .function. ( .omega. i | x ) = k = 1 K
.times. w k .times. j = 1 J .times. P k .function. ( .omega. i |
.omega. ~ j ) .times. P k .function. ( .omega. ~ j | x ) P k
.function. ( .omega. i | x ) , .times. ##EQU6## where
P.sub.k({tilde over (.omega.)}|x) is a prediction probability of
the weak classifier, and w.sub.k is a weight of the classifier.
7. The method of claim 6, in which the weight is according to the
confidence matrix of the class.
8. The method of claim 6, in which the set of combinations is
formed according to P n .beta. .function. ( .omega. i | x ) = exp
.function. ( .beta. .times. j .di-elect cons. S n .times. P j
.function. ( .omega. i | x ) ) c = 1 C .times. exp .function. [ (
.beta. .times. k .di-elect cons. S n .times. P k .function. (
.omega. c | x ) ) ] , ##EQU7## where P.sub.k(.omega..sub.j|x) is a
weighted weak classifier according to a non-linear weight .beta.,
and S.sub.n is an n.sup.th classifier combination.
Description
FIELD OF THE INVENTION
[0001] This invention relates generally to computer implemented
classifiers, and more specifically to strong classifiers that are
constructed by combining multiple weak classifiers.
BACKGROUND OF THE INVENTION
[0002] Recognition of activities and objects plays a central role
in surveillance and computer vision applications, see A. F. Bobick,
"Movement, activity, and action: The role of knowledge in the
perception of motion," Royal Society Workshop on Knowledge-based
Vision in Man and Machine, 1997; Aggarwal et al., "Human motion
analysis: A review," Computer Vision and Image Understanding, vol.
73, no. 3, pp. 428-440, 1999; and Nevatia et al., "Video-based
event recognition: activity representation and probabilistic
recognition methods," Computer Vision and Image Understanding, vol.
96, no. 2, pp. 129-162, November 2004.
[0003] Recognition, in part, is a classification task. The main
difficulty in event and object recognition is the large number of
events and object classes. Therefore, systems should be able to
make a decision based on complex classifications derived from a
large number of simpler classifications tasks.
[0004] Consequently, many classifiers combine a number of weak
classifiers to construct a strong classifier. The main purpose of
combining classifiers is to pool the individual outputs of the weak
classifiers as components of the strong classifier, the combined
classifier being more accurate than each individual component
classifier.
[0005] Prior art methods for combining classifiers include methods
that apply sum, voting and product combination rules, see Ross et
al., "Information fusion in biometrics," Pattern Recognition
Letters, vol. 24, no. 13, pp. 2115-2125, 2003; Pekalska et al., "A
discussion on the classifier projection space for classifier
combining," 3rd International Workshop on Multiple Classifier
Systems, Springer Verlag, pp. 137-148, 2002; Kittler et al.,
"Combining evidence in multimodal personal identity recognition
systems," Intl. Conference on Audio- and Video-Based Biometric
Authentication, 1997; Tax et al., "Combining multiple classifiers
by averaging or by multiplying?" Pattern Recognition, vol. 33, pp.
1475-1485, 2000; Bilmes et al., "Directed graphical models of
classifier combination: Application to phone recognition," Intl.
Conference on Spoken Language Processing, 2000; and Ivanov,
"Multi-modal human identification system," Workshop on Applications
of Computer Vision, 2004.
SUMMARY OF THE INVENTION
[0006] One embodiment of the invention provides a method for
combining weak classifiers into a strong classifier using a
weighted ensemble boosting. The weighted ensemble boosting method
combines Bayesian averaging strategy with a boosting framework,
finding useful conjunctive feature combinations of the classifiers
and achieving a lower error rate than the prior art boosting
process. The method demonstrates a comparable level of stability
with respect to the composition of a classifier selection pool.
[0007] More particularly, a method constructs a strong classifier
from weak classifiers by combining the weak classifiers to form a
set of combinations of the weak classifiers. Each combination of
weak classifiers is boosted to determine a weighted score for each
combination of weak classifiers, and combinations of weak
classifiers having a weighted score greater than a predetermined
threshold are selected to form the strong classifier.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] FIG. 1 is a flow diagram of a method for constructing a
strong classifier from a combination of weak classifiers according
to an embodiment of the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0009] FIG. 1 shows a method for constructing a strong classifier
109 from weak classifiers (A, B, C) 101 according to an embodiment
of the invention. The weak classifiers are combined 110 to produce
a set of combined classifiers 102. Then, a boosting process 120 is
applied to the set of combined classifiers to construct the strong
classifier 109.
[0010] Weak Classifiers
[0011] The weak classifiers can include binary and multi-class
classifiers. A binary classifier determines whether a single class
is recognized or not. A multi-class classifier can recognize
several classes.
[0012] An output of each weak classifier can be represented by
posterior probabilities. Each probability indicates how certain a
classifier is about a particular classification, e.g., the object
identity. In addition, each weak classifier can be associated with
a confidence matrix. The confidence matrix indicates how well the
classifier performs for a particular class. The confidence matrices
are obtained by training and validating the classifiers with known
or `labeled` data.
[0013] Combining
[0014] The combining step can include all possible linear
combinations 102' of the weak classifiers, as well as various
non-linear combinations 102'' For example, six weak classifiers can
yield over 500 combinations.
[0015] The combining 110 can also use an adaptation of an
approximate Bayesian combination. The Baysian combination uses some
measure of classifier confidence to weigh the prediction
probabilities of each weak classifier with respect to an expected
accuracy of the weak classifier for each of the classes, see
Ivanov, "Multi-modal human identification system," Workshop on
Applications of Computer Vision, 2004; and Ivanov et al., "Using
component features for face recognition," International Conference
on Automatic Face and Gesture Recognition, 2004, both incorporated
herein by reference.
[0016] More particularly, an output of weak classifier, .lamda., is
viewed as a random variable, {tilde over (.omega.)} taking integer
values from 1 to K, i.e., the number of classes. If, for each
classifier, the probability distribution over values of a true
class label .omega. is available for a given classifier prediction,
P.sub..lamda.(.omega.|{tilde over (.omega.)}), then the approximate
Bayesian combination can be derived via marginalization of
individual class predictions of each weak classifier: P a
.function. ( .omega. i | x ) = k = 1 K .times. w k .times. j = 1 J
.times. P k .times. ( .omega. i | .omega. ~ j ) .times. P k
.function. ( .omega. ~ j | x ) , P k .function. ( .omega. i | x )
##EQU1## where P.sub.k({tilde over (.omega.)}|x) is the prediction
probability of the weak classifier, and w.sub.k is a weight of each
weak classifier. Equation weights a prediction of each classifier
in accordance to the confidence matrix associated with the
class.
[0017] The combinations in the set 102 are formed for singles,
pairs, triples, etc., of the weak classifiers 101. The non-linear
transformation is according to: P n .beta. .function. ( .omega. i |
x ) = exp .function. ( .beta. .times. j .di-elect cons. S n .times.
P j .function. ( .omega. i | x ) ) c = 1 C .times. exp .function. [
( .beta. .times. k .di-elect cons. S n .times. P k .function. (
.omega. c | x ) ) ] , ##EQU2## where P.sub.k(.omega..sub.j|x) is a
weighted weak classifier according to a non-linear weight .beta.
and S.sub.n is an n.sup.th classifier combination. For an
exhaustive enumeration of combinations, the total number of the
tuples for every value of .beta. is given by the following
relation: N = n = 1 K .times. ( K n ) = 2 K - 1. ##EQU3## where K
is the number of weak classifiers and N is the number of tuples.
That is, if 8 different values of .beta. are used to form
combinations of 6 classifiers, the total number of these
combinations 102 comes to 504.
[0018] Boosting
[0019] As stated above, the strong classifier 109 is derived from
the set of combined weak classifiers 102. The boosting essentially
`discards` combinations in the set that have low `weights`, e.g.,
weights less than some predetermined threshold or zero, and keeps
the combinations that are greater than the predetermined threshold.
The number of elements in the strong classifier can be controlled
by the threshold.
[0020] The method adapts the well known AdaBoost process, Freund et
al., "A decision-theoretic generalization of on-line learning and
an application to boosting," Computational Learning Theory,
Eurocolt '95, pp. 23-37, 1995, incorporated herein by
reference.
[0021] The AdaBoost process trains each classifier in a combination
with increasingly more difficult data, and then uses a weighted
score. During the training, the combined classifiers are examined,
in turn, with replacement. At every iteration, a greedy selection
is made. The combined classifier that yields a minimal error rate
on data misclassified during a previous iteration is selected, and
the weight is determined as a function of the error rate. The
AdaBoost process iterates until one of the following conditions is
met: a predetermined number of iterations has been made, a
pre-determined number of classifiers have been selected, the error
rate decreases to a pre-determined threshold, or no further
improvement to the error rate can be made.
[0022] Formally, a probability distribution over the classes can be
expressed as a weighted sum of the scores: P b .function. ( .omega.
i | x ) .varies. k = 1 K .times. W k .function. [ argmax .times.
.times. P k .function. ( .omega. ~ | x ) = i ] , ##EQU4## where the
weight W.sub.k is the aggregate weight of the k.sup.th classifier:
W k = t = 1 T .times. w t .function. [ f t = f k ] ##EQU5## This
equation states that the weight of the k.sup.th classifier is the
sum of weights of all instances t of the classifier, where the
classifier f.sub.k is selected by the process.
[0023] Feature stacking can then use the strong classifier trained
on the outputs of weak classifiers stacked into a single vector.
That is, the input for the strong classifier, {tilde over (x)}, is
formed as follows: {tilde over (X)}=(P.sub.1(.omega.|x).sup.T,
P.sub.2(.omega.|x).sup.T, . . . , P.sub.K(.omega.|x).sup.T).sup.T,
and then the strong classifier is trained on pairs of data,
(X.sub.i, Y.sub.i), where Y.sub.i is the class label of the
i.sup.th data point.
[0024] Although the invention has been described by way of examples
of preferred embodiments, it is to be understood that various other
adaptations and modifications may be made within the spirit and
scope of the invention. Therefore, it is the object of the appended
claims to cover all such variations and modifications as come
within the true spirit and scope of the invention.
* * * * *