U.S. patent application number 12/238012 was filed with the patent office on 2010-03-25 for automated feature selection based on rankboost for ranking.
This patent application is currently assigned to Microsoft Corporation. Invention is credited to Xiong-Fei Cai, Junyan Chen, Rui Gao, Feng-Hsiung Hsu, Ning-Yi Xu.
Application Number | 20100076911 12/238012 |
Document ID | / |
Family ID | 42038647 |
Filed Date | 2010-03-25 |
United States Patent
Application |
20100076911 |
Kind Code |
A1 |
Xu; Ning-Yi ; et
al. |
March 25, 2010 |
Automated Feature Selection Based on Rankboost for Ranking
Abstract
A method using a RankBoost-based algorithm to automatically
select features for further ranking model training is provided. The
method reiteratively applies a set of ranking candidates to a
training data set comprising a plurality of ranking objects having
a known pairwise ranking order. Each round of iteration applies a
weight distribution of ranking object pairs, yields a ranking
result by each ranking candidate, identifies a favored ranking
candidate for the round based on the ranking results, and updates
the weight distribution to be used in next iteration round by
increasing weights of ranking object pairs that are poorly ranked
by the favored ranking candidate. The method then infers a target
feature set from the favored ranking candidates identified in the
iterations.
Inventors: |
Xu; Ning-Yi; (Beijing,
CN) ; Chen; Junyan; (Beijing, CN) ; Gao;
Rui; (Beijing, CN) ; Cai; Xiong-Fei; (Beijing,
CN) ; Hsu; Feng-Hsiung; (Cupertino, CA) |
Correspondence
Address: |
LEE & HAYES, PLLC
601 W. RIVERSIDE AVENUE, SUITE 1400
SPOKANE
WA
99201
US
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
42038647 |
Appl. No.: |
12/238012 |
Filed: |
September 25, 2008 |
Current U.S.
Class: |
706/12 |
Current CPC
Class: |
G06F 16/951 20190101;
G06N 20/00 20190101 |
Class at
Publication: |
706/12 |
International
Class: |
G06F 15/18 20060101
G06F015/18 |
Claims
1. A computer implemented method used in a ranking algorithm, the
method comprising: reiteratively applying a set of ranking
candidates to a training data set comprising a plurality of ranking
objects having a known pairwise ranking order, wherein each
iteration applies a weight distribution of ranking object pairs,
yields a ranking result by each ranking candidate, identifies a
favored ranking candidate based on the ranking results, and updates
the weight distribution to be used in next iteration by increasing
weights of ranking object pairs that are poorly ranked by the
favored ranking candidate; and inferring a target feature set from
the favored ranking candidates identified in a plurality of
iterations.
2. The method as recited in claim 1, wherein the set of ranking
candidates are derived from an initial set of ranking features.
3. The method as recited in claim 2, wherein each ranking candidate
is associated with one or more ranking features, and inferring a
target feature set from the favored ranking candidates comprises:
selecting at least some ranking features associated with the
favored ranking candidates and including them in the target feature
set.
4. The method as recited in claim 2, wherein each ranking candidate
is associated with one or more ranking features, and is derived
from the associated one or more ranking features based on a linear
ranker scheme.
5. The method as recited in claim 2, wherein each ranking candidate
is associated with one or more ranking features, and is derived
from the associated one or more ranking features based on a
threshold ranker scheme.
6. The method as recited in claim 2, wherein each ranking candidate
is associated with one ranking feature, and inferring a target
feature set from the favored ranking candidates comprises:
selecting the ranking features associated with the favored ranking
candidates and including them in the target feature set.
7. The method as recited in claim 1, wherein the set of ranking
candidates comprises at least one subset of ranking candidates, and
the ranking candidates of each subset are derived from a common
single ranking feature and differ from one another by each having a
different threshold parameter.
8. The method as recited in claim 1, wherein the favored ranking
candidate of each iteration round is identified by selecting the
best performing ranking candidate of the iteration round.
9. The method as recited in claim 1, wherein the ranking objects
that are poorly ranked by the favored ranking candidate are
identified by comparing the ranking result with the known pairwise
ranking order of the training data.
10. The method as recited in claim 1, further comprising:
constructing an output ranking model using a linear combination of
the selected favored ranking candidates.
11. The method as recited in claim 10, wherein the target feature
set is inferred from the output ranking model.
12. The method as recited in claim 1, wherein the ranking
candidates are weak rankers.
13. The method as recited in claim 1, further comprising: stopping
iteration at a user chosen stop point.
14. The method as recited in claim 1, further comprising: inputting
the selected target feature set to a training engine; and training
the selected target feature set using the training engine to obtain
a final ranking model.
15. The method as recited in claim 14, wherein the training engine
comprises a RankNet training procedure.
16. The method as recited in claim 14, wherein the training engine
comprises a RankBoost training procedure.
17. The method as recited in claim 1, wherein reiteratively
applying a set of candidate rankers to a training data set is
performed using an FPGA-based accelerator.
18. A method for selecting the feature set for a ranking algorithm,
the method comprising: a. constructing a set of ranking candidates
using an initial set of features; b. applying each ranking
candidate to a training data set comprising a plurality of ranking
objects having a known pairwise ranking order and a weight
distribution of ranking object pairs, each ranking candidate
yielding a ranking result; c. comparing the ranking results of the
set of ranking candidates to identify a favored ranking candidate;
d. analyzing the ranking result of the favored ranking candidate to
identify ranking object pairs poorly ranked by the favored ranking
candidate; e. adjusting the weight distribution by increasing the
weights of the ranking object pairs poorly ranked by the favored
ranking candidate; f. reiterating above b-e, each iteration
identifying a favored ranking candidate; and g. inferring a target
feature set from the favored ranking candidates identified in
previous iterations.
19. The method as recited in claim 18, wherein each ranking
candidate is associated with one or more ranking features, and
inferring a target feature set from the favored ranking candidates
comprises: selecting at least some ranking features associated with
the favored ranking candidates and including them in the target
feature set.
20. One or more computer readable media having stored thereupon a
plurality of instructions that, when executed by a processor,
causes the processor to: reiteratively apply a set of ranking
candidates to a training data set comprising a plurality of ranking
objects having a known pairwise ranking order, wherein each
iteration applies a weight distribution of ranking object pairs,
yields a ranking result by each ranking candidate, identifies a
favored ranking candidate based on the ranking results, and updates
the weight distribution to be used in next iteration by increasing
weights of ranking object pairs that are poorly ranked by the
favored ranking candidate; and prepare the favored ranking
candidates identified in a plurality of iterations for inferring a
target feature set therefrom.
Description
RELATED APPLICATIONS
[0001] This application is related to U.S. patent application Ser.
No. 11/737,605 entitled "FIELD-PROGRAMMABLE GATE ARRAY BASED
ACCELERATOR SYSTEM", filed on Apr. 19, 2007, which application is
hereby incorporated by reference in its entirety.
BACKGROUND
[0002] In information retrieval, ranking is of central importance.
Ranking is usually done by applying a ranking function (a ranker)
onto a set of objects (e.g., documents) to compute a score for each
object and sort the objects according to the scores. Depending on
applications the scores may represent the degrees of relevance,
preference, or importance. Traditionally only a small number of
strong features (e.g., BM25 and language model) were used to
represent relevance (or preference and importance) to rank
documents. In recent years, with the development of the supervised
learning algorithms such as Ranking SVM and RankNet, it has become
possible to incorporate more features (strong or weak) into ranking
models. In this situation, feature selection has become an
important issue, particularly from the following viewpoints.
[0003] Learning to rank for web search relevance largely depends on
the document feature set that is used as training input. First, the
trained model is bound to be biased by the choice of features. The
feature selection may significantly affect the accuracy of the
ranking. For example, although the generalization ability of
Support Vector Machines (SVM) depends on the margin which does not
change with the addition of irrelevant features, it also depends on
the radius of training data points, which can increase when the
number of features increases. Moreover, the probability of
over-fitting also increases as the dimension of feature space
increases, and feature selection is a powerful means to avoid
over-fitting. Secondly, the dimension of the feature set also
determines the computational cost to produce the model. In the case
where not all features in the set are carefully hand-designed, it
is even more important to select a feature set of manageable size
that can produce a ranking with good performance.
[0004] For example, MSN Live Search employs RankNet for ranking,
with document features as input. The more features it employs, the
more time consuming it is to train a ranking model. In addition,
the presence of weak features may have the adverse effect of
over-fitting the model. Especially, there is a high chance of such
occurrence when the feature set includes a large number of
low-level features, as is presently the case. Therefore, it is very
important to select a good set of features for RankNet
training.
[0005] FIG. 1 is a block diagram showing an example of an existing
feature selection procedure. Currently, the feature selection is
done manually as represented in manual feature selection 110. A
training data set 102 is used for manual feature selection 110.
Through human decisions (112), a set of features (114) is chosen
and passed through RankNet training process 116. The resultant
RankNet model 118 is then fed to an automated evaluation tool (120)
to determine its performance. Typically NDCG (Normalized Discounted
Cumulative Gain) is used as the performance measure. Based on the
performance, a decision (122) is made to either further tune the
feature set or output a satisfactory selected feature set 130. To
further tune the feature set, the process returns to block 112
repeat the decision process, again manually.
[0006] The output selected feature set 130 is input to a RankNet
training process 140, which also uses training data 102. Input
transformation block 142 transfers the selected feature set 130
into input features 144 for RankNet training engine 146, which
outputs a RankNet model 148 to be used as a ranking function to
rank objects (e.g., documents).
[0007] The above menus feature selection 110 is a tedious,
time-consuming process that requires a lot of intuition and
experience. Even an experience trainer might spend several weeks to
tune a feature set and still not sure whether the tuning is
successful. It becomes an even greater problem as training data are
constantly updated, often adding new features to be evaluated.
SUMMARY
[0008] Disclosed is a method using a RankBoost-based algorithm to
automatically select features for further training of a ranking
model. The method reiteratively applies a set of ranking candidates
to a training data set comprising a plurality of ranking objects
having a known pairwise ranking order. In each round of iteration,
a weight distribution of ranking object pairs is applied, and each
ranking candidate yields a ranking result. The method identifies a
favored ranking candidate for the current round based on the
ranking results, and updates the weight distribution to be used in
the next iteration by increasing weights of ranking object pairs
that are poorly ranked by the favored ranking candidate. The method
then infers a target feature set from the favored ranking
candidates identified in a certain number of iterations. In one
embodiment, the favored ranking candidate is the best performing
ranking candidate in the iteration round.
[0009] In some embodiments, the ranking candidates are derived from
an initial set of ranking features. The ranking candidates may be
derived from the associated ranking feature(s) based on either a
linear ranker scheme or threshold ranker scheme. Each ranking
candidate is associated with one or more ranking features. In one
embodiment each ranking candidate is associated with a single
ranking feature and is defined by the single ranking feature and a
threshold parameter. To infer a target feature set from the favored
ranking candidates, the method selects the ranking features
associated with the favored ranking candidates and includes them in
the target feature set.
[0010] The method may be computer implemented with one or more
computer readable media having stored thereupon a plurality of
instructions that, when executed by a processor, causes the
processor to perform the procedures described herein.
[0011] This Summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This Summary is not intended to identify
key features or essential features of the claimed subject matter,
nor is it intended to be used as an aid in determining the scope of
the claimed subject matter.
BRIEF DESCRIPTION OF THE FIGURES
[0012] The detailed description is described with reference to the
accompanying figures. In the figures, the left-most digit(s) of a
reference number identifies the figure in which the reference
number first appears. The use of the same reference numbers in
different figures indicates similar or identical items.
[0013] FIG. 1 is a block diagram showing an example of an existing
feature selection procedure.
[0014] FIG. 2 is a flowchart of an exemplary automated process of
feature selection.
[0015] FIG. 3 is a block diagram showing an exemplary process of
automated feature selection which provides selected features for
further training of a ranking model.
[0016] FIG. 4 is a block diagram of a computer system implementing
the automated feature selection of the present disclosure.
DETAILED DESCRIPTION
[0017] The automated feature selection based on RankBoost algorithm
for ranking is described below with an overview of the processes
followed by a further detailed description of the exemplary
embodiments. In this description, the order in which a process is
described is not intended to be construed as a limitation, and any
number of the described process blocks may be combined in any order
to implement the method, or an alternate method. In this
description, a ranking model has a trained (or modeled) ranking
function or ranker. Terms such as "ranking function" and "ranker"
are used interchangeably unless noted otherwise.
[0018] Disclosed is an automated approach for feature selection
using RankBoost ranking algorithm. RankBoost is a boosting
algorithm, which is based on the idea that a number of weak
rankings can be combined to form a single strong ranking. For
example, in ranking movies, each individual reviewer's ranked list
of movies may not be a comprehensive detailed listing of all
movies, but instead a simple partition of movies into two groups
according to whether or not the reviewer prefers the movies over a
particular movie that appears on the reviewer's list. That is, an
individual reviewer's ranking is in itself a weak ranker. Using
RankBoost algorithm, many weak rankers may be combined to form a
strong ranker to give a more complete and more detailed
ranking.
[0019] Further detail of an exemplary embodiment of RankBoost
algorithm is provided in a later section of the present
description.
[0020] RankBoost algorithm has been used for training a strong
ranking model based on a selected feature set. The method disclosed
herein, however, is a nonconventional use of RankBoost algorithm to
select a feature set from a large initial set of features.
[0021] The disclosed method runs RankBoost in iteration. In each
round, a weak ranker is chosen from a set of candidates to maximize
a performance gain function. The final model is a weighted linear
combination of the weak rankings selected over the iteration. When
applied on the relevance ranking problem, document features are
taken as weak ranker candidates. The RankBoost-trained model is
thus viewed as a set of selected features which, in combination,
maximizes the performance gain function. The RankBoost-trained
model thus provides a basis for automated feature selection.
[0022] The RankBoost algorithm is developed based on the preference
concept. It operates on document pairs where one document in the
pair is valued (ranked) higher than the other. Weights are assigned
to each of these pairs to indicate how important it is that the
pair is ordered correctly, and the goal is to minimize such
weighted pair-wise errors.
[0023] One embodiment of the disclosed method is a computer
implemented method used in a ranking algorithm. The method
reiteratively applies a set of ranking candidates to a training
data set which includes a plurality of ranking objects having a
known pairwise ranking order. A ranking candidate is a candidate
ranking function or ranker. In each iteration round, a weight
distribution of ranking object pairs is applied, and each ranking
candidate yields a ranking result. A favored (preferred) ranking
candidate is then identified based on the ranking results, and the
weight distribution is updated to be used in next iteration by
increasing weights of ranking object pairs that are poorly (or
incorrectly) ranked by the favored ranking candidate. The method
finally infers a target feature set from the favored ranking
candidates identified in the iteration rounds.
[0024] The favored ranking candidates are preferentially selected
by the algorithm based on the performance of the ranking
candidates. In one embodiment, the favored ranking candidate is the
best performing ranking candidate that gives minimum pairwise error
in that round. As will be illustrated in further detail herein,
ranking candidates may be derived from an initial set of candidate
ranking features, preferably as weak rankers which have a simple
ranking function. Each round of RankBoost iteration chooses from a
set of weak ranking candidates the weak ranker h that gives minimum
pair-wise error, given the current weight distribution. The
distribution is then adjusted by increasing the weight of the pairs
that are incorrectly ordered by h.
[0025] The weight distribution may be an n.times.n matrix d.sub.ij
in which n is the number of documents and d.sub.ij is a scaled
number measuring the importance of having the order between
document i and document j right. In the disclosed feature selection
method, the weight distribution matrix d.sub.ij is updated at the
end of each iteration round by increasing the values of those
d.sub.ij elements which are incorrectly ordered by the favored
ranking candidate selected in the present iteration round. With
this adjustment, in the next round, the algorithm will favor weak
rankers that correctly order those pairs, thus acting as a
complement to the weak rankers selected so far.
[0026] The final model H is a linear combination of the selected
favored weak rankers in the following form:
H(x)=.SIGMA..sub.t=1.sup.T.alpha..sub.th.sub.t(d), (1)
where T is the number of iteration rounds, d refers to a document
(ranking object), and h.sub.t denotes the weak ranker selected at
round t. The scaling coefficient .alpha..sub.t is calculated from
the pair-wise error of the ranking that h.sub.t produces.
[0027] In one embodiment, the ranking candidates are derived from
an initial set of ranking features, which are the pool of potential
ranking features from which a target feature set is to be selected.
Each ranking candidate so constructed is associated with one or
more ranking features. Based on this association, a target feature
may be inferred from the ranking candidates appears in a
combination of ranking features that constitutes a trained strong
ranker, as expressed in the above equation (1).
[0028] In the context of relevance ranking, for example, ranking
candidates can be derived from document features using two
different schemes. In the linear ranker scheme, h(d) takes the
feature value directly, and ranking documents translates to sorting
the documents in a decreasing order of feature values. In the
threshold (binary) ranker scheme, h(d) assigns the value 0 or 1 to
a document depending on whether its feature value is less than or
greater than a chosen threshold. In general, the threshold ranker
scheme provides a larger pool of weak ranking candidates.
[0029] Various forms of weak rankers, such as that proposed in Y.
Freund, et al., An Efficient Boosting Algorithm for Combining
Preferences, Journal of Machine Learning, 4:933-969, 2003, may be
used. For low complexity and good ranking quality, the following
exemplary weak ranker may be used:
h ( d ) = { 1 if f i ( d ) > .theta. 0 if f i ( d ) .ltoreq.
.theta. or f i ( d ) is undefined ( 2 ) ##EQU00001##
where f.sub.i(d) denotes the value of featured f.sub.i for document
d, and .theta. is a threshold value.
[0030] A weak threshold ranker in the above example is thus defined
by two parameters: a feature f.sub.i, and a threshold .theta.. As
shown in the above equation (2), a weak ranker h(d) can only output
a discrete value 0 or 1.
[0031] Some features may have a complex function instead of a
simple threshold output. Such complex features usually cannot be
sufficiently represented by just one weak ranker. Instead, multiple
thresholds .theta. are needed for expressing each complex feature.
The values of the features may be normalized to [0, 1] and then
divided into bins with a number of thresholds, for example, 128,
256, 512, or 1024 different thresholds. These different thresholds
correspond to a family of ranking candidates which share a common
feature. Different features then give rise to different families of
ranking features. In other words, the set of ranking candidates may
include multiple subsets of ranking candidates, and the ranking
candidates of each subset are derived from a common ranking feature
and differ from one another by each having a different threshold
parameter.
[0032] When weak rankers are derived from the features according to
the above equation (2), each feature is thus associated with
multiple weak rankers. Because a complex feature cannot be
sufficiently expressed by a single weak ranker, the algorithm may
keep selecting a weak ranker associated with this feature and
different thresholds in order to fully express the information of
the feature. As a result, through the multiple runs of iteration,
an individual feature may be selected multiple times.
[0033] On the other hand, when weak rankers are derived from the
features according to the above equation (2), each weak ranker
corresponds to only one feature. This correspondence is a basis for
eventually inferring selected features from the selected weak
rankers.
[0034] However, the above particular correspondence is true only
for the above illustrated weak ranker design. More complex designs
of weak rankers may be used in which one weak ranker may correspond
to multiple features.
[0035] A weak ranker could have a different or more complex form
than the above "threshold" weak ranker. Examples include:
[0036] h.sub.i(d)=f.sub.i, which may be referred to as weak linear
rankers;
[0037] h.sub.i(d)=log(f.sub.i), which may be referred to as weak
log rankers; and
[0038] h.sub.ij(d)=f.sub.i*f.sub.j, which may be referred to as
weak conjugate rankers, in which one weak ranker corresponds to two
features (and accordingly, two features are implied when this weak
ranker is selected by the RankBoost algorithm.)
[0039] Weak threshold rankers as represented in equation (2) are
preferred because they have the ability to express a very complex
trend of one feature by combining different weak threshold rankers
that are associated with the same feature and different thresholds.
In addition, weak threshold rankers are found to have better
generalization ability than the weak linear ranker.
[0040] The following is an example for inferring features from
favored rankers when candidate rankers are weak threshold rankers
h(d) in equation (2). Suppose RankBoost algorithm has selected
favored weak rankers as follows:
[0041] Round0: favored ranker with feature=15, threshold=3, and
alpha=0.7
[0042] Round1: favored ranker with feature=7, threshold=18, and
alpha=0.4
[0043] Round2: favored ranker with feature=15, threshold=9, and
alpha=-0.2
[0044] Round3: favored ranker with feature=163, threshold=3, and
alpha=0.5
[0045] Round4: favored ranker with feature=15, threshold=200, and
alpha=0.6
[0046] Round5: favored ranker with feature=1, threshold=17, and
alpha=0.3
[0047] In the above six rounds, the RankBoost algorithm has
selected six favored weak rankers, and four individual features
(feature id=15, 7, 163, and 15). In the ranking model H(d), feature
15 has a more complex trend than others because it has been
expressed the most frequently, suggesting that feature 15 is a more
expressive feature.
[0048] In practice, far more than six rounds of iteration may be
carried out. It is found that in general, as the number of
iteration rounds increases, the algorithm continues to select new
features out of the whole set of features. This may be explained by
the fact that for a certain training set, RankBoost has a
theoretical ability to rank all training data correctly (i.e., with
the error rate approaching 0) given that the features are
expressive enough. However, not every iteration round adds a new
feature, and further the speed of adding new features tends to slow
down as the number of iteration rounds increases. In one
experiment, for example, with an initial whole set of over 1000
features, about 130 features (slightly over 10% of the total) were
selected after 2000 rounds of RankBoost selecting threshold-based
weak rankers, and about 150 features were selected after 5000
rounds of RankBoost selecting linear weak rankers.
[0049] In addition, increasing the number of iteration rounds
generally improves the quality of the feature set initially, but
does not result in unlimited improvement. The resultant feature set
can be tested using training on either 1-layer network or 2-layer
network which measures the relevance and the validity of the
feature set by its NDCG performers. As the number of iteration
rounds increases, the NDCG performers generally improves first but
tends to become flat after a certain point (e.g., 5000 rounds),
suggesting that before the error rate approaches zero, the model
may start to be over-fitting. For the foregoing reasons, the point
to stop the iteration may be determined empirically.
[0050] The initial feature set is usually the whole set of features
that are available. In practice, when a new feature is designed,
the whole set may be updated to add the new feature. The automated
feature selection also acts as a test for the newly designed
future. An effectively designed new feature may be quickly picked
up by the automated feature selection and thus become a part of the
selected feature set which is further trained and used for actual
ranking by the search engines. If the new feature is not selected
by the automated feature selection algorithm at all after a certain
number of attempts, the feature may be added to a black list, which
will guide RankBoost algorithm not to waste time checking the
feature.
[0051] To find the best performing h(d), all possible combinations
of feature f.sub.i and threshold .theta..sub.s are checked. This
can become a computationally expensive process. Special algorithms
and hardware disclosed in U.S. patent application Ser. No.
11/737,605 entitled "FIELD-PROGRAMMABLE GATE ARRAY BASED
ACCELERATOR SYSTEM", may be used to speed up this computation
process.
[0052] FIG. 2 is a flowchart of an exemplary automated process of
feature selection. The major components of the feature selection
process 200 are described as follows.
[0053] Block 210 constructs a set of ranking candidates using an
initial set of features.
[0054] Block 220 applies each ranking candidate to a training data
set comprising a plurality of ranking objects having a known
pairwise ranking order and a weight distribution of ranking object
pairs. Each ranking candidate yields a ranking result.
[0055] Block 230 compares the ranking results of the set of ranking
candidates to identify a favored ranking candidate.
[0056] Block 240 analyzes the ranking result of the favored ranking
candidate to identify ranking object pairs poorly ranked by the
favored ranking candidate.
[0057] Block 250 adjusts the weight distribution by increasing the
weights of the ranking object pairs poorly ranked by the favored
ranking candidate.
[0058] Block 260 determines whether the iteration should be
stopped. The stop point may be selected empirically. For example, a
certain target number of iteration rounds chosen based on the
empirically experience may be built in the RankBoost algorithm to
stop the iteration when the target number of iteration rounds is
reached. The iteration may also be stopped manually. Before the
iteration stops, the feature selection process 200 returns to block
220 two repeat the procedure of blocks 220-250. Each iteration
identifies a favored (e.g., best performing) ranking candidate.
[0059] Block 270 infers a target feature set from the favored
ranking candidates identified in previous iterations. The inferred
target feature set is then used for further training to obtain a
final ranking function.
[0060] FIG. 3 is a block diagram showing an exemplary process of
automated feature selection which provides selected features for
further training of a ranking model. As shown, automated feature
selection 310 includes feature extraction block 312, with which
features are extracted from one or more suitable sources. Features
may be extracted from training data 302, but may also be provided
independently. The result of feature extraction or collection is an
initial feature set 314, which may in practice a large feature set
including, for example, over a thousand features. The initial
feature set 314 is input to a weak ranker boosting algorithm 316
(e.g., RankBoost algorithm), resulting in a boosted ranking model
318, which may be, as illustrated herein, a linear combination of
the weak rankers selected by the weak ranker boosting algorithm
316. Feature inference block 320 infers features from the boosted
ranking model 318. The detail of automated feature selection 310
has been illustrated with the previously described processes, and
particularly with reference to FIGS. 1-2.
[0061] Automated feature selection 310 results in a set of selected
features 330 which is further trained by a training process 340.
The training can be done using any suitable algorithms, including
RankNet, LambaRank, and RankBoost. FIG. 3 shows a training process
340 based on RankNet, without losing generality. Block 343
represents an input transformation process in which the set of
selected features 330 is transformed into input features 344 to be
fed to RankNet training engine 346. The output of the RankNet
training process 340 is a RankNet model 348, which can be used by a
search engine for actual rankings of search results.
[0062] The method described herein is capable of automatically
selecting features for a ranking algorithm. Conventional training
techniques require extensive manual selection of features in which
a human trainer tunes the feature set according to the results of a
lot of training experiments with various feature combinations. The
presently disclosed method greatly simplifies the workflow of
feature selection, saving time and effort. Furthermore, the entire
RankNet training can now be automated. The automated feature
selection is also able to yield good performance to justify
replacing manual selection with the automation.
[0063] RankBoost algorithm is uniquely applied in the automatic
feature selection method disclosed herein. Studies conducted using
the disclosed method suggest that the automated feature selection
based on the RankBoost algorithm has a great potential to improve
not only the efficiency of training, but also the search
relevance.
[0064] In addition, the automated feature selection may be further
accelerated by an FPGA-based accelerator system for the automated
feature selection. An exemplary accelerator system is described in
U.S. patent application Ser. No. 11/737,605 entitled
"FIELD-PROGRAMMABLE GATE ARRAY BASED ACCELERATOR SYSTEM". In one
embodiment, the FPGA-based accelerator system can accelerate the
feature selection software for nearly 170 times.
[0065] The automated feature selection may be further enhanced
using a distributed software system, also described in the above
referenced US patent application. The distributed software system
is able to support much bigger data set than what an FPGA-based
accelerator can usually support.
[0066] FIG. 4 is a block diagram of a computer system implementing
the automated feature selection of the present disclosure. The
computer system 400 includes processor(s) 410, I/O devices 420 and
computer readable media (memory) 430. The computer readable media
430 stores application programs 432 and data 434 (such as features,
ranking candidates and training data). Application programs 432 may
include several application modules. The examples of such
application modules as illustrated include a weak ranker boosting
algorithm 442 (such as a RankBoost algorithm) to obtain a boosted
ranking model from combined selected weak rankers, a feature
inference module 444 to infer a selected feature set from the
boosted ranking model, and a training module 446 to train a ranking
function based on the selected feature set. These application
modules in the application programs 432 may together contain
instructions which, when executed by processor(s) 410, cause the
processor(s) 410 to perform actions of a process described herein
(e.g., the illustrated processes of FIGS. 2-3).
[0067] An exemplary process that can be performed by the weak
ranker boosting algorithm 442 is a to reiteratively apply a set of
ranking candidates to a training data set comprising a plurality of
ranking objects having a known pairwise ranking order. In each
iteration round, a weight distribution of ranking object pairs is
applied, and each ranking candidate yields a ranking result. A
favored ranking candidate (e.g., a best-performing ranker) is
identified based on the ranking results, and the weight
distribution is updated to be used in next iteration by increasing
weights of ranking object pairs that are poorly ranked by the
favored ranking candidate. The weak ranker boosting algorithm 442
may further prepare the favored ranking candidates identified in
the previous iteration rounds for inferring a target feature set
therefrom. The feature inference is preferably performed by a
feature inference module 444, but can be performed separately or
even manually.
[0068] It is appreciated that a computing system may be any device
that has a processor, an I/O device and a computer readable media
(either an internal or an external), and is not limited to a
personal computer or workstation. Especially, a computer device may
be a server computer, or a cluster of such server computers,
connected through network(s), which may either be Internet or an
intranet.
[0069] It is appreciated that the computer readable media may be
any of the suitable storage or memory devices for storing computer
data. Such storage or memory devices include, but not limited to,
hard disks, flash memory devices, optical data storages, and floppy
disks. Furthermore, the computer readable media containing the
computer-executable instructions may consist of component(s) in a
local system or components distributed over a network of multiple
remote systems. The data of the computer-executable instructions
may either be delivered in a tangible physical memory device or
transmitted electronically.
[0070] Further Detail of RankBoost Algorithm
[0071] An exemplary RankBoost algorithm which can be used for the
automated feature selection disclosed herein is described in
further detail below. Generally, when ranking objects, the goal is
to find a ranking function to order the given set of objects. Such
an object is denoted as an instance x in a domain (or instance
space) X. As a form of feedback, information about which instance
should be ranked above (or below) one another is provided for every
pair of instances. This feedback is denoted as function .PHI.:
X.times.X.fwdarw.R, where .PHI. (x.sub.0, x.sub.1)>0 means
x.sub.1 should be ranked above x.sub.0, and .PHI. (x.sub.0,
x.sub.1)<0 means x.sub.0 should be ranked above x.sub.1. A
learner then attempts to find a ranking function H: X.fwdarw.R,
which is as consistent as possible to the given .PHI., by asserting
x.sub.1 is preferred over x.sub.0 if H(x.sub.1)>H(x.sub.0).
[0072] A relevance-ranking algorithm may be used to learn the
ranking function H by combining a given collection of ranking
functions. The relevance-ranking algorithm may be pair-based or
document-based. The psuedocode for one such relevance ranking
algorithm, is shown below:
Initialize: Distribution D over X.times.X Do for t=, . . . , T:
[0073] (1) Train WeakLearn using distribution D.sub.t.
[0074] (2) WeakLearn returns a weak hypothesis h.sub.t,
[0075] (3) Choose .alpha..sub.t.epsilon.R
[0076] (4) Update weights: for each pair (d.sub.0, d.sub.1):
D t + 1 ( d 0 , d 1 ) = D t ( d 0 , d 1 ) exp ( - .alpha. t ( h t (
d 0 ) - h t ( d 1 ) ) ) Z t ##EQU00002##
[0077] where Z.sub.t is the normalization factor:
Z t = x 0 , x 1 D t ( d 0 , d 1 ) exp ( - .alpha. t ( h t ( d 0 ) -
h t ( d 1 ) ) ) . ##EQU00003##
Output: the final hypothesis:
H ( x ) = t = 1 T .alpha. t h t ##EQU00004##
[0078] The relevance-ranking algorithm is utilized in an iterative
manner. In each round, a procedure named "WeakLearn" is called to
select the best "weak ranker" from a large set of candidate weak
rankers. The weak ranker has the form h.sub.t: X.fwdarw.R and
h.sub.t(x.sub.1)>h.sub.t(x.sub.0) means that instance x.sub.1 is
ranked higher than x.sub.0 in round t. A distribution D.sub.t over
X.times.X is maintained in the training process. Weight
D.sub.t(x.sub.0, x.sub.1) will be decreased if h.sub.t ranks
x.sub.0 and x.sub.1 correctly
(h.sub.t(x.sub.1)>h.sub.t(x.sub.0)), and increased otherwise.
Thus, D.sub.t will tend to concentrate on the pairs that are hard
to rank. The final strong ranker H is a weighted sum of the
selected weak rankers in each round.
[0079] The WeakLearn algorithm may be implemented to find the weak
ranker with a maximum r(f, .theta.), by generating a temporary
variable .pi.(d) for each document. The WeakLearn algorithm may be
defined as follows:
Given: Distribution D(d.sub.0, d.sub.1) over all pairs Initialize:
(1) For each document d(q):
Compute
.pi.(d(q))=.SIGMA..sub.d'(q)(D(d'(q),d(q))-D(d(q),d'(q)))
[0080] (2) For every feature f.sub.k and every threshold
.theta..sup.k.sub.s:
Compute r ( f k , .theta. s k ) = d ( q ) : f k ( d ( q ) ) >
.theta. s k .pi. ( d ( q ) ) ##EQU00005##
[0081] (3) Find the maximum
|r*(f.sub.k*,.theta..sub.S*.sup.k*)|
[0082] (4) Compute:
.alpha. = 1 2 ln ( 1 + r * 1 - r * ) ##EQU00006##
Output: weak ranking (f.sub.k*,.theta..sub.S*.sup.k*) and
.alpha..
[0083] To extend the relevance-ranking algorithm to Web relevance
ranking, training pairs may be generated and weak rankers may be
defined. To generate the training pairs, the instance space for a
search engine may be partitioned according to queries issued by
users. For each query q, the returned documents may be rated a
relevance score, from 1 (means `poor match`) to 5 (means `excellent
match`) using a manual or automated process. Unlabeled documents
may be given a relevance score of 0. Based on the rating scores
(ground truth), the training pairs for the relevance-ranking
algorithm may be generated from the returned documents for each
query.
[0084] So-called "weak rankers" may be defined as a transformation
of a document feature, which is a one-dimensional real value
number. Document features can be classified into query dependent
features, such as query term frequencies in a document and term
proximity, and query independent features, such as PageRank, and so
forth. Thus, the same document may be represented by different
feature vectors for different queries based upon its
query-dependent features.
[0085] In keeping with the previous algorithm example, a document
may be designated as d(q), a pair as {d.sub.1(q), d.sub.2(q)}, and
d.sup.i.sub.j means a document for query q.sub.i. The k.sub.th
feature for document is denoted as f.sub.k(d.sup.i.sub.j). With
these notations, an alternative relevance-ranking algorithm may be
implemented as follows.
[0086] Initialize: initial distribution D over X.times.X
Given: N.sub.q queries {q.sub.i|i=1 . . . , N.sub.q}.
[0087] N.sub.i documents {d.sup.i.sub.j|=1, . . . , N.sub.i} for
each query q.sub.i, where
.SIGMA..sub.i=1.sup.N.sup.qN.sub.i=N.sub.doc.
[0088] N.sub.f features {f.sub.k(d.sup.i.sub.j)|j=1, . . . ,
N.sub.f} for each document d.sup.i..sub.j
[0089] N.sup.k.sub..theta. candidate thresholds
{.theta..sup.k.sub.s|s=1, . . . , N.sup.k.sub..theta.} for each
f.sub.k.
[0090] N.sub.pair pairs (d.sup.i.sub.j1, d.sup.i.sub.j2) generated
by ground truth rating {R(q.sub.i,d.sup.i.sub.j)} or
{R.sup.i.sub.j}.
Initialize: initial distribution D(d.sup.i.sub.j1, d.sup.i.sub.j2)
over X.times.X Do for t=1, . . . , T:
[0091] (1) Train WeakLearn using distribution D.sub.t.
[0092] (2) WeakLearn returns a weak hypothesis h.sub.t, weight
.alpha..sub.t
[0093] (3) Update weights: for each pair (d.sub.0, d.sub.1):
D t + 1 ( d 0 , d 1 ) = D t ( d 0 , d 1 ) exp ( - .alpha. t ( h t (
d 0 ) - h t ( d 1 ) ) ) Z t ##EQU00007##
[0094] where Z.sub.t is the normalization factor:
Z t = x 0 , x 1 D t ( d 0 , d 1 ) exp ( - .alpha. t ( h t ( d 0 ) -
h t ( d 1 ) ) ) . ##EQU00008##
Output: the final hypothesis:
H ( x ) = t = 1 T .alpha. t h t ##EQU00009##
[0095] For the relevance-ranking algorithms described by example
above, WeakLearn may be defined as a routine that uses the N.sub.f
document features to form its weak rankers, attempting to find the
one with the smallest pair-wise disagreement relative to
distribution D over N.sub.pair document pairs. As previously
described, an exemplary weak ranker may be defined by the following
relationship:
h ( d ) = { 1 if f i ( d ) > .theta. 0 if f i ( d ) .ltoreq.
.theta. or f i ( d ) is undefined ##EQU00010##
[0096] To find the best h(d), the weak learner checks all of the
possible combinations of feature f.sub.i and threshold .theta.. The
WeakLearn algorithm may be implemented to ascertain a maximum r(f,
.theta.) by generating a temporary variable .pi.(d) for each
document. Intuitively, .pi. contains information regarding labels
and pair weights, and the weak ranker only needs to access .pi. in
a document-wise manner for each feature and each threshold, that is
O(N.sub.docN.sub.fN.sub..theta.), in a straightforward
implementation. Based on this, an alternative weak learner may be
utilized using an integral histogram to further reduce the
computational complexity to O(N.sub.docN.sub.f). Because of this
relatively low computational complexity, the algorithm may be
implemented in both software and hardware, e.g., an accelerator
system utilizing an FPGA, as described above.
[0097] According to the implementation, r may be calculated in
O(N.sub.docN.sub.f) time in each round using an integral histogram
in O(N.sub.docN.sub.f) time. First, feature values {f.sub.k(d)} in
a dimension of the whole feature vector (f.sub.1, . . . ,
f.sub.N.sub.f) may be classified into N.sub.bin bins. The
boundaries of these bins are:
.theta. s k = f max k - f min k N bin s + f min k , s = 0 , 1 , , N
bin , ##EQU00011##
where f.sub.max.sup.k and f.sub.min.sup.k are maximum and minimum
values of all f.sub.k in the training data set. Then each document
d can be mapped to one of the bins according to the value of
f.sub.k(d):
Bin k ( d ) = floor ( f k ( d ) - f min k f max k - f min k N bin -
1 ) ##EQU00012##
The histogram of .pi.(d) over feature f.sub.k is then built
using:
Hist k ( i ) = d : Bin k ( d ) = i .pi. ( d ) , i = 0 , , ( N bin -
1 ) ##EQU00013##
Then, an integral histogram can be determined by adding elements in
the histogram from the right (i=N.sub.bin-1) to the left (i=0).
That is,
Integral k ( i ) = a > i Hist k ( a ) , i = 0 , , ( N bin - 1 )
##EQU00014##
[0098] Although the above-described RankBoost algorithm can also be
used for performing training of a ranking model with a given set of
selected features, the present disclosure takes a unique
application angle of the above algorithm. It starts from a given
initial feature set to select a feature set as a preparatory stage
for further training. The automated feature selection disclosed
herein significantly improves the ability to handle a variety of
initial feature sets which tend to include a large number of
features and also change frequently. With the efficient feature
selection tool disclosed herein, feature selection is done and/or
updated quickly whenever necessary, and thereafter a final ranking
model may be obtained using any suitable rank training algorithms,
such as RankNet and LambaRank.
[0099] It is appreciated that the potential benefits and advantages
discussed herein are not to be construed as a limitation or
restriction to the scope of the appended claims.
[0100] Although the subject matter has been described in language
specific to structural features and/or methodological acts, it is
to be understood that the subject matter defined in the appended
claims is not necessarily limited to the specific features or acts
described. Rather, the specific features and acts are disclosed as
exemplary forms of implementing the claims.
* * * * *