Weighted ensemble boosting method for classifier combination and feature selection Ivanov; Yuri [Ivanov; Yuri]

Weighted ensemble boosting method for classifier combination and feature selection

Ivanov; Yuri

Patent Application Summary

U.S. patent application number 11/294938 was filed with the patent office on 2007-06-07 for weighted ensemble boosting method for classifier combination and feature selection. Invention is credited to Yuri Ivanov.

Application Number	20070127825 11/294938
Document ID	/
Family ID	38118830
Filed Date	2007-06-07

United States Patent Application	20070127825
Kind Code	A1
Ivanov; Yuri	June 7, 2007

Weighted ensemble boosting method for classifier combination and feature selection

Abstract

A method constructs a strong classifier from weak classifiers by combining the weak classifiers to form a set of combinations of the weak classifiers. Each combination of weak classifiers is boosted to determine a weighted score for each combination of weak classifiers, and combinations of weak classifiers having a weighted score greater than a predetermined threshold are selected to form the strong classifier.

Inventors:	Ivanov; Yuri; (Arlington, MA)
Correspondence Address:	MITSUBISHI ELECTRIC RESEARCH LABORATORIES, INC. 201 BROADWAY 8TH FLOOR CAMBRIDGE MA 02139 US
Family ID:	38118830
Appl. No.:	11/294938
Filed:	December 6, 2005

Current U.S. Class:	382/228
Current CPC Class:	G06K 9/6256 20130101; G06K 9/6292 20130101
Class at Publication:	382/228
International Class:	G06K 9/62 20060101 G06K009/62

Claims

1. A computer implemented method for constructing a strong classifier, comprising: selecting a plurality of weak classifiers; combining the weak classifiers to form a set of combinations of the weak classifiers; boosting each combination of the weak classifiers to determine a weighted score for each combination of the weak classifiers; and selecting combinations of the weak classifiers having a weighted score greater than a predetermined threshold to form the strong classifier.

2. The method of claim 1, in which the weak classifiers include binary and multi-class classifiers.

3. The method of claim 1, further comprising: representing an output of each weak classifier by a posterior probability; and associating each weak classifier with a confidence matrix.

4. The method of claim 3, in which the combinations include linear and non-linear combinations of the weak classifiers.

5. The method of claim 3, in which the combining is an approximate Bayesian combination.

6. The method of claim 5, in which an output .lamda. of each weak classifier is a random variable {tilde over (w)} taking integer values from 1 to K, the number of classes, and a probability distribution over values of a true class label .omega. is P.sub..lamda.(.omega.|{tilde over (.omega.)}), and the approximate Bayesian combination is P a .function. ( .omega. i | x ) = k = 1 K .times. w k .times. j = 1 J .times. P k .function. ( .omega. i | .omega. ~ j ) .times. P k .function. ( .omega. ~ j | x ) P k .function. ( .omega. i | x ) , .times. ##EQU6## where P.sub.k({tilde over (.omega.)}|x) is a prediction probability of the weak classifier, and w.sub.k is a weight of the classifier.

7. The method of claim 6, in which the weight is according to the confidence matrix of the class.

8. The method of claim 6, in which the set of combinations is formed according to P n .beta. .function. ( .omega. i | x ) = exp .function. ( .beta. .times. j .di-elect cons. S n .times. P j .function. ( .omega. i | x ) ) c = 1 C .times. exp .function. [ ( .beta. .times. k .di-elect cons. S n .times. P k .function. ( .omega. c | x ) ) ] , ##EQU7## where P.sub.k(.omega..sub.j|x) is a weighted weak classifier according to a non-linear weight .beta., and S.sub.n is an n.sup.th classifier combination.

Description

FIELD OF THE INVENTION

[0001] This invention relates generally to computer implemented classifiers, and more specifically to strong classifiers that are constructed by combining multiple weak classifiers.

BACKGROUND OF THE INVENTION

[0002] Recognition of activities and objects plays a central role in surveillance and computer vision applications, see A. F. Bobick, "Movement, activity, and action: The role of knowledge in the perception of motion," Royal Society Workshop on Knowledge-based Vision in Man and Machine, 1997; Aggarwal et al., "Human motion analysis: A review," Computer Vision and Image Understanding, vol. 73, no. 3, pp. 428-440, 1999; and Nevatia et al., "Video-based event recognition: activity representation and probabilistic recognition methods," Computer Vision and Image Understanding, vol. 96, no. 2, pp. 129-162, November 2004.

[0003] Recognition, in part, is a classification task. The main difficulty in event and object recognition is the large number of events and object classes. Therefore, systems should be able to make a decision based on complex classifications derived from a large number of simpler classifications tasks.

[0004] Consequently, many classifiers combine a number of weak classifiers to construct a strong classifier. The main purpose of combining classifiers is to pool the individual outputs of the weak classifiers as components of the strong classifier, the combined classifier being more accurate than each individual component classifier.

[0005] Prior art methods for combining classifiers include methods that apply sum, voting and product combination rules, see Ross et al., "Information fusion in biometrics," Pattern Recognition Letters, vol. 24, no. 13, pp. 2115-2125, 2003; Pekalska et al., "A discussion on the classifier projection space for classifier combining," 3rd International Workshop on Multiple Classifier Systems, Springer Verlag, pp. 137-148, 2002; Kittler et al., "Combining evidence in multimodal personal identity recognition systems," Intl. Conference on Audio- and Video-Based Biometric Authentication, 1997; Tax et al., "Combining multiple classifiers by averaging or by multiplying?" Pattern Recognition, vol. 33, pp. 1475-1485, 2000; Bilmes et al., "Directed graphical models of classifier combination: Application to phone recognition," Intl. Conference on Spoken Language Processing, 2000; and Ivanov, "Multi-modal human identification system," Workshop on Applications of Computer Vision, 2004.

SUMMARY OF THE INVENTION

[0006] One embodiment of the invention provides a method for combining weak classifiers into a strong classifier using a weighted ensemble boosting. The weighted ensemble boosting method combines Bayesian averaging strategy with a boosting framework, finding useful conjunctive feature combinations of the classifiers and achieving a lower error rate than the prior art boosting process. The method demonstrates a comparable level of stability with respect to the composition of a classifier selection pool.

[0007] More particularly, a method constructs a strong classifier from weak classifiers by combining the weak classifiers to form a set of combinations of the weak classifiers. Each combination of weak classifiers is boosted to determine a weighted score for each combination of weak classifiers, and combinations of weak classifiers having a weighted score greater than a predetermined threshold are selected to form the strong classifier.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] FIG. 1 is a flow diagram of a method for constructing a strong classifier from a combination of weak classifiers according to an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0009] FIG. 1 shows a method for constructing a strong classifier 109 from weak classifiers (A, B, C) 101 according to an embodiment of the invention. The weak classifiers are combined 110 to produce a set of combined classifiers 102. Then, a boosting process 120 is applied to the set of combined classifiers to construct the strong classifier 109.

[0010] Weak Classifiers

[0011] The weak classifiers can include binary and multi-class classifiers. A binary classifier determines whether a single class is recognized or not. A multi-class classifier can recognize several classes.

[0012] An output of each weak classifier can be represented by posterior probabilities. Each probability indicates how certain a classifier is about a particular classification, e.g., the object identity. In addition, each weak classifier can be associated with a confidence matrix. The confidence matrix indicates how well the classifier performs for a particular class. The confidence matrices are obtained by training and validating the classifiers with known or `labeled` data.

[0013] Combining

[0014] The combining step can include all possible linear combinations 102' of the weak classifiers, as well as various non-linear combinations 102'' For example, six weak classifiers can yield over 500 combinations.

[0015] The combining 110 can also use an adaptation of an approximate Bayesian combination. The Baysian combination uses some measure of classifier confidence to weigh the prediction probabilities of each weak classifier with respect to an expected accuracy of the weak classifier for each of the classes, see Ivanov, "Multi-modal human identification system," Workshop on Applications of Computer Vision, 2004; and Ivanov et al., "Using component features for face recognition," International Conference on Automatic Face and Gesture Recognition, 2004, both incorporated herein by reference.

[0016] More particularly, an output of weak classifier, .lamda., is viewed as a random variable, {tilde over (.omega.)} taking integer values from 1 to K, i.e., the number of classes. If, for each classifier, the probability distribution over values of a true class label .omega. is available for a given classifier prediction, P.sub..lamda.(.omega.|{tilde over (.omega.)}), then the approximate Bayesian combination can be derived via marginalization of individual class predictions of each weak classifier: P a .function. ( .omega. i | x ) = k = 1 K .times. w k .times. j = 1 J .times. P k .times. ( .omega. i | .omega. ~ j ) .times. P k .function. ( .omega. ~ j | x ) , P k .function. ( .omega. i | x ) ##EQU1## where P.sub.k({tilde over (.omega.)}|x) is the prediction probability of the weak classifier, and w.sub.k is a weight of each weak classifier. Equation weights a prediction of each classifier in accordance to the confidence matrix associated with the class.

[0017] The combinations in the set 102 are formed for singles, pairs, triples, etc., of the weak classifiers 101. The non-linear transformation is according to: P n .beta. .function. ( .omega. i | x ) = exp .function. ( .beta. .times. j .di-elect cons. S n .times. P j .function. ( .omega. i | x ) ) c = 1 C .times. exp .function. [ ( .beta. .times. k .di-elect cons. S n .times. P k .function. ( .omega. c | x ) ) ] , ##EQU2## where P.sub.k(.omega..sub.j|x) is a weighted weak classifier according to a non-linear weight .beta. and S.sub.n is an n.sup.th classifier combination. For an exhaustive enumeration of combinations, the total number of the tuples for every value of .beta. is given by the following relation: N = n = 1 K .times. ( K n ) = 2 K - 1. ##EQU3## where K is the number of weak classifiers and N is the number of tuples. That is, if 8 different values of .beta. are used to form combinations of 6 classifiers, the total number of these combinations 102 comes to 504.

[0018] Boosting

[0019] As stated above, the strong classifier 109 is derived from the set of combined weak classifiers 102. The boosting essentially `discards` combinations in the set that have low `weights`, e.g., weights less than some predetermined threshold or zero, and keeps the combinations that are greater than the predetermined threshold. The number of elements in the strong classifier can be controlled by the threshold.

[0020] The method adapts the well known AdaBoost process, Freund et al., "A decision-theoretic generalization of on-line learning and an application to boosting," Computational Learning Theory, Eurocolt '95, pp. 23-37, 1995, incorporated herein by reference.

[0021] The AdaBoost process trains each classifier in a combination with increasingly more difficult data, and then uses a weighted score. During the training, the combined classifiers are examined, in turn, with replacement. At every iteration, a greedy selection is made. The combined classifier that yields a minimal error rate on data misclassified during a previous iteration is selected, and the weight is determined as a function of the error rate. The AdaBoost process iterates until one of the following conditions is met: a predetermined number of iterations has been made, a pre-determined number of classifiers have been selected, the error rate decreases to a pre-determined threshold, or no further improvement to the error rate can be made.

[0022] Formally, a probability distribution over the classes can be expressed as a weighted sum of the scores: P b .function. ( .omega. i | x ) .varies. k = 1 K .times. W k .function. [ argmax .times. .times. P k .function. ( .omega. ~ | x ) = i ] , ##EQU4## where the weight W.sub.k is the aggregate weight of the k.sup.th classifier: W k = t = 1 T .times. w t .function. [ f t = f k ] ##EQU5## This equation states that the weight of the k.sup.th classifier is the sum of weights of all instances t of the classifier, where the classifier f.sub.k is selected by the process.

[0023] Feature stacking can then use the strong classifier trained on the outputs of weak classifiers stacked into a single vector. That is, the input for the strong classifier, {tilde over (x)}, is formed as follows: {tilde over (X)}=(P.sub.1(.omega.|x).sup.T, P.sub.2(.omega.|x).sup.T, . . . , P.sub.K(.omega.|x).sup.T).sup.T, and then the strong classifier is trained on pairs of data, (X.sub.i, Y.sub.i), where Y.sub.i is the class label of the i.sup.th data point.

[0024] Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

* * * * *