U.S. patent application number 12/120038 was filed with the patent office on 2009-11-19 for method and apparatus for better web ad matching by combining relevance with consumer click feedback.
Invention is credited to Deepak K. Agrawal, Deepayan Chakrabarti, Vanja Josifovski.
Application Number | 20090287672 12/120038 |
Document ID | / |
Family ID | 41317114 |
Filed Date | 2009-11-19 |
United States Patent
Application |
20090287672 |
Kind Code |
A1 |
Chakrabarti; Deepayan ; et
al. |
November 19, 2009 |
Method and Apparatus for Better Web Ad Matching by Combining
Relevance with Consumer Click Feedback
Abstract
A method and apparatus are provided for better web ad matching
by combining relevance with consumer click feedback. In one
example, the method includes receiving a query page, extracting
features from the query page, re-weighting the query page,
evaluating the query page in light of each ad in order to score
each ad and pick substantially best ad matches of the indexed ads,
and returning the substantially best ad matches to the consumer
computer.
Inventors: |
Chakrabarti; Deepayan;
(Mountain View, CA) ; Agrawal; Deepak K.;
(Sunnyvale, CA) ; Josifovski; Vanja; (Los Gatos,
CA) |
Correspondence
Address: |
Stattler-Suh PC
60 SOUTH MARKET, SUITE 480
SAN JOSE
CA
95113
US
|
Family ID: |
41317114 |
Appl. No.: |
12/120038 |
Filed: |
May 13, 2008 |
Current U.S.
Class: |
1/1 ;
707/999.005; 707/E17.014 |
Current CPC
Class: |
G06F 16/24578 20190101;
G06F 16/951 20190101; G06Q 30/02 20130101 |
Class at
Publication: |
707/5 ;
707/E17.014 |
International
Class: |
G06F 7/06 20060101
G06F007/06 |
Claims
1. A method of comparing query pages to indexed ads in order to
provide better web ad matching, the method comprising: receiving a
query page; extracting features from the query page; re-weighting
the query page; evaluating the query page in light of each ad in
order to score each ad and pick substantially best ad matches of
the indexed ads; and returning the substantially best ad matches to
the consumer computer.
2. The method of claim 1, wherein the re-weighting the query page
includes ranking words of the query page based on a measure that
quantifies interaction between words occurring in the query page
and ad regions of the indexed ads.
3. The method of claim 1, wherein the re-weighting the query page
includes ranking words of the query page by computing average
tf-idf scores across the query page and indexed ads for respective
regions under consideration.
4. The method of claim 1, further comprising calculating a final
score for each ad in order to pick substantially best ad matches,
wherein the calculating the final score includes using logistic
regression to adjust the final score, wherein the final score
represents a click probability of a particular ad for a particular
query of the query page.
5. A method of indexing ads in order to provide better web ad
matching, the method comprising: receiving ads that were clicked at
a consumer computer; extracting ad features from the ads; sorting
the ads according to ad identification to provide a data file; and
inverting the data file to sort the data file according to feature
identification, wherein sorting the ads includes computing a static
score for each ad using parameters learnt from using logistic
regression on some training data.
6. The method of claim 5, wherein the logistic regression is
performed on regions of each ad, and wherein each region has a
different impact on the static score.
7. The method of claim 5, wherein the static score is used to
assign ad identification to the ads in decreasing ad score
order.
8. The method of claim 5, further comprising writing the inverted
data to an ads index database.
9. The method of claim 5, further comprising applying a variable
selection technique to select a subset of important words of the ad
features to be used in the logistic regression.
10. The method of claim 9, wherein the variable selection technique
includes at least one of: using clicks and views on the ads; and
using relevance scores of words of the ads that are independent of
click feedback.
11. An apparatus for comparing query pages to indexed ads in order
to provide better web ad matching, wherein the apparatus is
configured to receive a query page, the apparatus comprising: a
page feature extraction device configured to extract features from
the query page; a page feature re-weighting device configured to
re-weight the query page; a page evaluation device configured to
evaluate the query page in light of each ad in order to score each
ad and pick substantially best ad matches of the indexed ads,
wherein the apparatus is configured to return the substantially
best ad matches to the consumer computer.
12. The apparatus of claim 11, wherein the page feature
re-weighting device is further configured to rank words of the
query page based on a measure that quantifies interaction between
words occurring in the query page and ad regions of the indexed
ads.
13. The apparatus of claim 11, wherein the page feature
re-weighting device is further configured to rank words of the
query page by computing average tf-idf scores across the query page
and indexed ads for respective regions under consideration.
14. The apparatus of claim 11, wherein the click probability
calculation device is configured to calculate a final score for
each ad in order to pick substantially best ad matches and to use
logistic regression to adjust the final score, wherein the final
score represents a click probability of a particular ad.
15. An apparatus for indexing ads in order to provide better web ad
matching, wherein the apparatus is configured to receive ads that
were clicked at a consumer computer, the apparatus comprising: an
ad feature extraction device configured to extract ad features from
the ads; an ad identification assignment device configured to sort
the ads according to ad identification to provide a data file; and
an ad inversion sort device configured to invert the data file to
sort the data file according to feature identification, wherein the
apparatus is further configured to sort the ads by computing a
static score for each ad using parameters learnt from using
logistic regression on some training data.
16. The apparatus of claim 15, wherein the apparatus is further
configured to perform logistic regression on regions of each ad,
and wherein each region has a different impact on the static
score.
17. The apparatus of claim 15, wherein the apparatus is further
configured to use the static score to assign ad identification to
the ads in decreasing ad score order.
18. The apparatus of claim 15, further comprising an ad indexing
device configured to write the inverted data to an ads index
database.
19. The apparatus of claim 15, wherein the apparatus is further
configured to apply a variable selection technique to select a
subset of important words of the ad features to be used in the
logistic regression.
20. The apparatus of claim 19, wherein the variable selection
technique includes at least one of: using clicks and views on the
ads; and using relevance scores of words of the ads that are
independent of click feedback.
21. A computer readable medium carrying one or more instructions
for comparing query pages to indexed ads in order to provide better
web ad matching, wherein the one or more instructions, when
executed by one or more processors, cause the one or more
processors to perform the steps of: receiving a query page;
extracting features from the query page; re-weighting the query
page; evaluating the query page in light of each ad in order to
score each ad and pick substantially best ad matches of the indexed
ads; and returning the substantially best ad matches to the
consumer computer.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to providing better web ads.
More particularly, the present invention relates to providing
better web ads by matching words on query pages with words on
clicked web ads.
BACKGROUND OF THE INVENTION
[0002] Web advertising provides financial support for a large
portion of today's Internet ecosystem, catering to a diverse set of
websites, such as blogs, news, reviews, etc. Spurred by the
tremendous growth in traffic in terms of volume, number of
consumers, consumer engagement, content diversity, the last few
years from 2008 have seen a tremendous growth in spending on web
advertising.
[0003] A major part of the advertising on the web falls into the
category of textual ads, which are typically short textual messages
usually marked as "sponsored links" or similar. There are two main
types of textual ads on the web today: [0004] 1. Sponsored search
(i.e., paid search) advertising places ads on the result pages from
a web search engine based on the search query. All major current
web search engines support such ads and act simultaneously as a
search engine and an ad agency. [0005] 2. Contextual advertising
(i.e., Context Match) advertising places ads within the content of
a generic, third-party web page. There usually is a commercial
intermediary, called an ad-network, in charge of optimizing the ad
selection with the twin goals of increasing revenue (shared between
publisher and ad-network) and improving consumer experience. Here,
the main players are the major search engines; however, there are
also many smaller players.
[0006] While the methods proposed in this paper could be adapted
for both sponsored search sponsored search and contextual
advertising, the relevant background is primarily contextual
advertising.
[0007] Studies have shown that displaying ads that are closely
related to the content of the page provide a better consumer
experience and increase the probability of clicks. This intuition
is analogous to that in conventional publishing, where there are
very successful magazines (e.g., Vogue) where a majority of the
content is topical advertising (e.g., fashion, in the case of
Vogue). Thus, estimating the relevance of an ad to a page is
critical in serving ads at run-time.
[0008] Previously, published approaches estimated the relevance
based on co-occurrence of the same words or phrases within the ad
and within the page. The model used in this body of work is to
translate the ad search into a similarity search in a vector space.
Each ad is represented as a vector of features, as for example,
unigrams, phrases and classes. The page is also translated to a
vector in the same space as the ads. The search for the
substantially best ads is now translated into finding the ad
vectors that are closest to the page vector. To make the search
efficient and scalable to hundreds of millions of ads and billions
of requests per day, an ad system can use an inverted index and an
efficient similarity search algorithm. A drawback of this method is
that it relies on a-priori information and does not use the
feedback (a posteriori) information that is collected in the form
of ad impressions (displays) and clicks.
[0009] Another line of work uses click data to produce a CTR (click
through rate) estimate for an ad, independent of the page (or query
page, in the sponsored search scenario). The CTR is estimated based
on features extracted from the ads that are then used in a learning
framework to build models for estimation of the CTR of unseen ads.
In this approach, the assumption is that the ad system selects the
ads by a deterministic method--by matching the bid phrase to a
phrase from the page (or the query page in sponsored search).
Accordingly, to select the most clickable ads, the ad system only
needs to estimate the CTR on the ads with the matching bid phrase.
This simplifying assumption of the matching process is an obvious
drawback of these approaches. Another drawback is that these
methods do not account for differential click probabilities on
different pages: If some pages in the corpus attract an audience
that clicks on ads significantly more than average, then the
learning of feature weights for ads will be biased towards ads that
were (only by circumstance) shown on such pages.
SUMMARY OF THE INVENTION
[0010] What is needed is an improved method having features for
addressing the problems mentioned above and new features not yet
discussed. Broadly speaking, the present invention fills these
needs by providing a method and apparatus for providing better web
ad matching by combining relevance with consumer click feedback. It
should be appreciated that the present invention can be implemented
in numerous ways, including as a method, a process, an apparatus, a
system or a device. Inventive embodiments of the present invention
are summarized below.
[0011] In one embodiment, a method is provided for comparing query
pages to indexed ads in order to provide better web ad matching.
The method comprises receiving a query page, extracting features
from the query page, re-weighting the query page, evaluating the
query page in light of each ad in order to score each ad and pick
substantially best ad matches of the indexed ads, and returning the
substantially best ad matches to the consumer computer.
[0012] In another embodiment, a method is provided for indexing ads
in order to provide better web ad matching. The method comprises
receiving ads that were clicked at a consumer computer, extracting
ad features from the ads, sorting the ads according to ad
identification to provide a data file, and inverting the data file
to sort the data file according to feature identification, wherein
sorting the ads includes computing a static score for each ad using
parameters learnt using logistic regression on some training
data.
[0013] In still another embodiment, an apparatus is provided for
comparing query pages to indexed ads in order to provide better web
ad matching, wherein the apparatus is configured to receive a query
page. The apparatus comprises a page feature extraction device
configured to extract features from the query page, a page feature
re-weighting device configured to re-weight the query page, a page
evaluation device configured to evaluate the query page in light of
each ad in order to score each ad and pick to obtain substantially
the best ad matches of the indexed ads, wherein the apparatus is
configured to return the substantially best ad matches to the
consumer computer.
[0014] In yet another embodiment, an apparatus is provided for
indexing ads in order to provide better web ad matching, wherein
the apparatus is configured to receive ads that were clicked at a
consumer computer. The apparatus comprises an ad feature extraction
device configured to extract ad features from the ads, an ad
identification assignment device configured to sort the ads
according to ad identification to provide a data file, and an ad
inversion sort device configured to invert the data file to sort
the data file according to feature identification, wherein the
apparatus is further configured to sort the ads by computing a
static score for each ad using parameters learnt using logistic
regression on some training data.
[0015] In still yet another embodiment, a computer readable medium
is provided carrying one or more instructions for comparing query
pages to indexed ads in order to provide better web ad matching.
The one or more instructions, when executed by one or more
processors, cause the one or more processors to perform the steps
of receiving a query page, extracting features from the query page,
re-weighting the query page, evaluating the query page to obtain
substantially best ad matches of the indexed ads, and calculating a
final score for each ad in order to pick substantially best ad
matches.
[0016] The invention encompasses other embodiments configured as
set forth above and with other features and alternatives.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The present invention will be readily understood by the
following detailed description in conjunction with the accompanying
drawings. To facilitate this description, like reference numerals
designate like structural elements.
[0018] FIG. 1 is a block diagram of a system for providing better
web ad matching by combining relevance with consumer click
feedback, in accordance with an embodiment of the present
invention;
[0019] FIG. 2 is a schematic diagram of a system for providing
better web ad matching by combining relevance with consumer click
feedback, in accordance with an embodiment of the present
invention;
[0020] FIG. 3 is a flowchart of a method of indexing ads in order
to provide better web ad matching, in accordance with an embodiment
of the present invention; and
[0021] FIG. 4 is a flowchart of a method for comparing query pages
to indexed ads in order to provide better web ad matching, in
accordance with an embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0022] An invention for a method and apparatus for provided better
web ad matching by combining relevance with consumer click feedback
is disclosed. Numerous specific details are set forth in order to
provide a thorough understanding of the present invention. It will
be understood, however, to one skilled in the art, that the present
invention may be practiced with other specific details.
General Overview
[0023] FIG. 1 is a block diagram of a system 100 for providing
better web ad matching by combining relevance with consumer click
feedback, in accordance with an embodiment of the present
invention. A device of the present invention is hardware, software
or a combination thereof. A device may sometimes be referred to as
an apparatus. Each device is configured to carry out one or more
steps of the method of providing better web ad matching by
combining relevance with consumer click feedback.
[0024] The network 102 couples together a front end server 104, a
consumer computer 106, an ad server 110, an ads database 112, a
click feedback device 114, an ads index database 122 and a
relevance device 124. The network 102 may be any combination of
networks, including without limitation the Internet, a local area
network, a wide area network, a wireless network and a cellular
network. The click feedback device 114 includes without limitation
an ad feature extraction device 116, an ad identification
assignment device 118, an ad sort device 120 and an ad indexing
device 121. The relevance device 124 includes without limitation a
page feature extraction device 126, a page feature re-weighting
device 128, a page evaluation device 130 and a click probability
device 132.
[0025] Alternatively, one apparatus may contain two or more devices
of the system 100. For example, one apparatus may contain two or
more of the devices that include, for example, the front end server
104, the click feedback device 114, the ads index database 122 and
the relevance device 124.
[0026] The system 100 is based on logistic regression, a popular
technique in statistics and machine learning. The regression
enables the system 100 to combine click feedback and semantic
information available from both pages and ads to determine
relevancy. This system 100 is more general than a pure relevance
based approach that does not use click feedback in any form.
Indeed, experiments performed with the system 100 convincingly
demonstrate the usefulness of using click feedback to find more
relevant ads. There has been prior work that involves using
regression models for determining relevant ads. While it has a
similar flavor of the present system 100, only ad-specific features
are learnt in such prior art, and ad specific features are only a
subset of the features that the system 100 utilizes. In particular,
in addition to page and ad specific features, the system 100 learns
features that capture interactions between pages and ads.
Furthermore, the system 100 combines word-based features with
traditional relevance measures to enhance matching relevant ads to
pages.
[0027] The models of the system 100 are more granular and can
incorporate larger number of features. Such incorporation reduces
bias in CTR estimates and leads to better performance. However,
reduced bias comes at the price of increased variance, which can
become a serious problem if the models become too granular and
start over-fitting the training data.
[0028] To balance these two issues, the system 100 utilizes a
two-pronged strategy. First, the system 100 uses a relatively large
but specially selected set of features, where the selection
mechanism ensures that the features have reasonable support. The
system 100 also provides a mechanism based on prior probabilities
to down-weight features that are too sparse.
[0029] The second strategy the system 100 uses to prevent
over-fitting is for the system 100 to train its models on an
extremely large corpus (e.g., billions of records, several thousand
features), which automatically increases the support of a large
number of features. Fortunately, data is plentiful especially for
big ad-networks that serve a large number of publishers and
advertisers. However, increased training size poses a difficult
computational challenge of scaling logistic regression to web scale
data. The system 100 overcomes this difficulty by using an
approximation based on a "divide and conquer strategy". In other
words, the system 100 randomly splits its training corpus into
several pieces and fits a separate logistic regression to each
piece. The system 100 obtains the final result by combining
estimates from all the pieces.
[0030] The system 100 carries out a method that involves three
broad steps--(a) feature extraction, (b) feature selection, and (c)
coefficient estimation for features through a logistic regression.
A detailed description of each of these steps is provided
below.
Feature Extraction
[0031] The system 100 treats pages and ads as being composed of
several regions. For instance, a page is composed of page title,
page metadata, page body, page URL etc. Similarly, an ad is
composed of ad title, ad body etc. Within each region, the ad
feature extraction device 116 and the page feature extraction
device 126 each extract a set of words/phrases after stop word
removal. The system 100 associates a score (e.g., region specific
tf, tf-idf) to each word that measures its importance in a given
region. The score may be, for example, region specific tf (term
frequency) or tf-idf (term frequency-inverse document frequency).
For a given page/ad region combination, this model has three sets
of features that are described below.
[0032] The first feature set is page region specific main effects.
Web pages are usually composed of multiple regions with different
visibility and prominence. Accordingly, the impact of each region
on the ad selection can vary. The system 100 learns the effect of
each region separately. For a word w in page region p(r) with score
t.sub.p(r)w, the region-specific main effect is defined as
M.sub.p(r)w=1(w.epsilon.p(r))t.sub.p(r)w. Equation 1.
In other words, if the word is present in the page region p(r), the
feature contributes its score else it does not contribute. These
features provide an estimate of word popularity. These features are
not useful at the time of selecting relevant ads for a given page
but help in getting better estimates of other terms in the model
after adjusting for the effect of popular words on a page. For
instance, if "camera" pages are popular in terms of CTRs and 90% of
the corpus consists of camera pages, "camera" ads that were the
ones mostly shown on camera pages would tend to become popular even
on "soccer" pages which constitute only 1% of the total corpus. By
incorporating page words in the model, the system 100 adjusts for
this effect and gets the correct matching ads for "soccer"
pages.
[0033] The second feature set is ad region specific main effects.
Ads are also composed of multiple regions, some visible to the
consumer (title, abstract) and some used only in the ad selection
(bid phrase, targeting attributes). As with the page regions, the
ad regions can have a different impact on the ad selection. For a
word w in ad region a(r) with score t.sub.a(r)w, this is defined
as
M.sub.a(r)w=1(w.epsilon.a(r))t.sub.a(r)w. Equation 2.
Unlike page specific main effects, ad region specific main effects
do play an important role when selecting relevant ads for a given
page and provide more weight to popular ads.
[0034] The third feature set is interaction effects between page
and ad regions. For a word w.sub.1 in page region p(r.sub.1) and
word w.sub.2 in ad region a(r.sub.2) with score f(t.sub.p(r.sub.1)
w.sub.1, t.sub.a(r.sub.2) w.sub.2) for some function f, this is
given as
I.sub.p(r.sub.1.sub.)w.sub.1.sub.,a(r.sub.2.sub.)w.sub.2.sub.,=1(w.sub.1-
.epsilon.p(r.sub.1),w.sub.2.epsilon.a(r.sub.2))f(t.sub.p(r.sub.1.sub.)w.su-
b.1,t.sub.a(r.sub.2.sub.)w.sub.2). Equation 3.
The system 100 confines itself to the case where w.sub.1=w.sub.2.
In other words, the feature "fires" only if the same word occurs in
both the corresponding page and ad regions. However, one can
generally consider co-occurrences of synonyms or related words.
Examples of f include the product function t.sub.p(r.sub.1)
w.sub.1.times.t.sub.a(r.sub.2) w.sub.2, the geometric mean {square
root over
(t.sub.p(r.sub.1)w.sub.1.times.t.sub.a(r.sub.2)w.sub.2)}{square
root over (t.sub.p(r.sub.1)w.sub.1.times.t.sub.a(r.sub.2)w.sub.2)}
and so on. Interaction effects are important components of the
system 100 and help in matching relevant ads to a given page. For
instance, occurrence of the word "camera" in the ad body is a
strong indication of the ad being relevant for the page whose title
contains the word "camera," with the degree of relevance being
determined by the regression.
Feature Selection
[0035] For any given (page, ad) region combination, a large number
of words occur in the training data. Using them all as features
might make the logistic regression ill-conditioned and inflate
variance of the coefficient estimates. Accordingly, the system 100
takes recourse to variable selection techniques which select a
subset of important words to be used in its regression. Variable
selection in the context of regression is a well studied area with
a rich literature. Stepwise backward-forward automated variable
selection algorithms are widely used for large scale applications,
but these methods have drawbacks, especially when features are
correlated. The general recommendation is to use as much domain
knowledge as possible instead of using an automated procedure to
select relevant variables. However, in large scale settings as in
the system 100, some level of automation is necessary.
[0036] For reasons of scalability, the system 100 uses a two-stage
approach. In the first stage, the system 100 conservatively prunes
non-informative features using simple measures that can be computed
using only a few passes over the training corpus. In the second
stage, the system 100 fits a regression to all the selected
features from the first stage but down-weights them through a
specially constructed prior that pools data from all the features.
Meanwhile, the system 100 preferably picks the features that are
less sparse. The second state is discussed below in more detail in
the Approximate Logistic Regression section. The variable selection
methods are discussed next.
[0037] The system 100 selects the variables using two methods. The
first method is based on clicks and views. The second method is
based on relevance scores of words that are independent of any
click feedback. In the first approach (data-based), the system 100
ranks words based on a measure that quantifies the interaction
between words occurring in the page and ad regions. For a word w,
the interaction measure is defined as
I w = CTR w both CTR w page CTR w ad , . Equation 4
##EQU00001##
where CTR.sub.w.sup.both denotes the CTR when w occurred both on
page region and ad region of an ad displayed on a page, and
CTR.sub.w.sup.page and CTR.sub.w.sup.ad denote the marginal CTRs
when w is shown on the page and ad regions, respectively. Higher
values of the ratio indicate stronger interaction being induced by
the presence of the word w which in turn should enhance the
matching quality of ads to pages. A variation of the measure above
may be tried with a square root of the denominator, which will
likely yield with no significant impact.
[0038] In the second approach (relevance-based), words are ranked
by computing the average tf-idf scores across the entire page and
ad corpus for the respective regions under consideration. Here, the
system 100 may involve two measures: (a) Create a single corpus by
treating page and ad regions as documents and compute a single
tf-idf average score for each word; and (b) Treat the page and ad
regions as different corpora and use the geometric mean of tf-idf
scores computed separately from page and ad regions for each
word.
[0039] For both measures, the system 100 picks, for example, the
top 1000 words and uses them in the logistic regression. To avoid
noisy estimates of CTRs in the ratio, the system 100 only considers
words that are shown simultaneously on ad and page regions at least
10 times and have non-zero marginal probabilities. It turns out
that the data-based approach gives better results for the same
number of words.
Approximate Logistic Regression
[0040] Let y.sub.ij denote the binary click outcome (1 for click, 0
for no click) when ad j is shown on page i. Assume y.sub.ij has a
Bernoulli distribution with CTR p.sub.ij. In other words, the
probability distribution of y.sub.ij is given by
P(y.sub.ij)=p.sub.ij.sup.y.sup.ij(1-p.sub.ij).sup.1-y.sup.ij. To
determine relevant ads for a given page i, the system 100 needs to
estimate p.sub.ij's, with higher values indicating more relevant
ads. For ads that are shown a large number of times on a page, the
system 100 can estimate the CTR empirically by clicks per
impression. However, for purposes here, a large fraction of page-ad
pairs have a small number of impressions. In fact, since the CTRs
are typically low (0.1%-20% with a substantial right skewness in
the distribution), the number of impressions required to get
precise empirical estimates are high. For instance, to estimate a
5% CTR, the system 100 needs 1,000 impressions to be even 85%
confident that the estimate is within 1% of the true CTR. Thus, the
system 100 takes recourse to feature based models. In other words,
p.sub.ij is a function of features extracted from page and ad
regions as discussed above in the Feature Extraction section.
[0041] To allow for arbitrary real-valued coefficients for
features, it is routine to map p.sub.ij onto the real line via a
monotonically increasing function. The most widely used function is
the logit which maps p.sub.ij to logit(p.sub.ij)=log
[p.sub.ij/(1-p.sub.ij)]. Assume logit(p.sub.ij) is a linear
function of features representing the main effects and interaction
effects discussed in the Feature Extraction section. For
simplicity, consider a single (page, ad) region combination
(p(r.sub.1), a(r.sub.2)). The linear function in the logistic
regression is given by
logit ( p ij ) = logit ( q ij ) + w .alpha. w M p ( r 1 ) w + w
.beta. w M a ( r 2 ) w + w .delta. w , r 1 , r 2 I p ( r 1 ) w , a
( r 2 ) w . Equation 5 ##EQU00002##
where w=(.alpha., .beta., .delta.) are unknown feature coefficients
to be estimated by logistic regression, and lit(q.sub.ij) are known
prior log-odds that could have been derived from a different model.
For instance, a uniform prior would assume q.sub.ij={circumflex
over (p)}, where {circumflex over (p)} is the average CTR on the
entire training corpus. Another possibility is to derive prior
log-odds q.sub.ij by combining relevance scores with click
feedback.
[0042] To add new (page,ad) region combination, the system 100 only
needs to augment Equation 5 with the appropriate linear terms for
the page main and ad main effects. For the interaction effects, the
system 100 re-parameterizes its model to facilitate indexing. The
re-parameterization is explained here. The connection to indexing
is discussed below in the Ad Search Prototype section. For each
(page,ad) combination (r.sub.1, r.sub.2), a word w that occurs in
both r.sub.1 and r.sub.2 has a coefficient
.delta..sub.w,r.sub.1.sub.,r.sub.2 which depends on the word w, the
page region and the ad region. We assume parameterization, as
in
.delta..sub.w,r.sub.1,r.sub.2=.delta..sub.w.gamma..sub.p(r.sub.1.sub.).g-
amma..sub.a(r.sub.2.sub.). Equation 6.
In other words, the interaction of a word for a given page and ad
region combination is factored into word-specific, page-specific
and ad-specific components. Accordingly, for M words, R.sub.1 page
regions, R.sub.2 ad regions, the number of parameters equals
M+R.sub.1+R.sub.2 as opposed to MR.sub.1R.sub.2 in the original
model. The estimate of coefficients is obtained by maximizing the
log-likelihood of the data as given by
ij ( y ij log ( p ij ) + ( 1 - y ij ) log ( 1 - p ij ) ) . Equation
7 ##EQU00003##
where p.sub.ij is given by Equation 5. The optimization problem
described above may become ill-conditioned and lead to high
variance estimates if features tend to be correlated or are sparse
or both. This is a drawback in our scenario where feature sparsity
and correlations are routine. To provide a robust solution, the
system 100 put additional constraints on the coefficients in the
form of priors.
[0043] A prior of N(0, .sigma..sup.2) would mean that the parameter
estimates are pinned down in the range (-3.sigma., 3.sigma.) with
99% probability a-priori. In the absence of enough information
about the coefficient from data, this ensures that the coefficient
estimates do not diverge to the boundaries and cause numerical
instability. To put more stringent constraints on sparse features,
the system 100 down-weights the prior variance .sigma..sup.2 by a
measure of relative sparsity, which is the variance of the feature
occurrence process relative to average feature occurrence variance.
The feature occurrence variance is given by s(1-s), where s is the
fraction of times the feature occurs. In particular, a set of
relationships is provided as
.alpha. w .about. N ( 0 , .sigma. 2 s p ( w ) ( 1 - s p ( w ) ) s p
( 1 - s p ) .beta. w .about. N ( 0 , .sigma. 2 s a ( w ) ( 1 - s a
( w ) ) s a ( 1 - s a ) .delta. w .about. N ( 0 , .sigma. 2 s I ( w
) ( 1 - s I ( w ) ) s I ( 1 - s I ) . Equation 8 ##EQU00004##
Note that separate averages are used for the main page and ad
effects, and interaction effects (indicated by the subscripts p, a,
and I). In real experiments, .sigma..sup.2=9; experiments with
several other values in the range of 3 to 20 have been found not to
yield much difference in the results.
[0044] Now, the optimization problem reduces to estimating the
coefficients by maximizing the log-posterior which is the sum of
the log-likelihood (Equation 7) and the log-prior of the
coefficients, as discussed above. Next, the optimization process
itself is discussed.
[0045] Several approaches to optimize the objective function exist.
Among the ones that have been used in large-scale applications are
iterative scaling, nonlinear conjugate gradient, quasi-Newton,
iteratively-reweighted least squares, truncated Newton, and
trust-region Newton. All these methods are iterative and generate a
sequence of estimates that converge to the optimal solution. For
all methods except iterative scaling, cost per iteration is high
but the convergence is fast. For iterative scaling which updates
one component at a time, cost per iteration is low but convergence
is slower. For application here, the training corpus typically has
several million data points and several thousand features, making
it extremely slow to fit the model using these approaches on a
single machine. To scale the computations, the system 100 utilizes
a simple parallelization approach that randomly splits the data
into several parts, fits a logistic regression separately to each
part and then combines the estimates obtained from each piece. For
convenience, the system 100 may perform its computation in a
MapReduce framework. MapReduce is a conventional programming model
for processing large data sets. It runs on a large cluster of
commodity machines; it is highly scalable processing several
gigabytes of data on thousands of machines and easy to use. The
run-time system automatically takes care of the details of
partitioning the data, scheduling jobs across machines, handling
failures and managing inter-machine communication.
[0046] To fit a logistic regression for a given piece, the system
100 uses a simple iterative scaling (also known as conditional
maximization) approach. The algorithm is as follows: Initialize the
coefficients .alpha.'s, .beta.'s, and .delta.'s to 0, and
.gamma..sub.p( )'s and .gamma..sub.a( )'S to 1; then update the
value of each coefficient one at a time holding the others fixed at
the current value by maximizing the likelihood through a
Newton-Raphson method. This completes a single iteration. The
procedure is continued through several iterations until
convergence. The method is substantially guaranteed to converge
since every step can only increase the likelihood. Along with a
coefficient estimate, the Newton-Raphson procedure provides an
estimate of the negative Hessian, the inverse of which provides an
estimate of variance of the coefficient from maximum likelihood
theory. The results on the various data partitions are combined
using a weighted average of the individual estimates, where the
weight assigned to partition-specific estimate is its relative
precision obtained from the negative Hessian values. This weighting
scheme is the substantially best way to combine estimates through a
linear function.
Ad Search Prototype
[0047] FIG. 2 is a schematic diagram of a system 200 for providing
better web ad matching by combining relevance with consumer click
feedback, in accordance with an embodiment of the present
invention. A key feature of the system 200 is that it is suitable
for efficient evaluation over inverted ad indexes. This section
discusses an implementation of the system 200, which is a prototype
ad search engine based on a query page evaluation algorithm and
inverted indexing of the ads. The relevance device 124 including
the page evaluation device 130 may involve, for example,
calculations using a conventional WAND algorithm. The click
feedback device 114 including the ad inversion sort device 120 may
involve, for example, calculations using a conventional Hadoop
computing framework.
[0048] The system 200 allows for any kind of feature to be used in
the ad search. The system 200 uses unigrams, phrases and classes as
features. The ads index database 122 (i.e., inverted index) is
composed of one postings list for each feature that has one entry
(i.e., posting) for each ad that contains this feature. The ads are
represented in the ads index database 122 by adIDs, which are
unique numeric identifiers assigned to each ad.
[0049] Consumers 108 from a multiple consumer computers 106 click
on ads on web pages. The front end server 104 informs the ads click
feedback device 114 of the ads clicked. The click feedback device
114 also has access to the ads database 112.
[0050] The system 200 produces the inverted ad index over
preferably a grid of machines running the ad inversion framework.
The indexing starts with the ad feature extraction device 116
extracting features from the ads. The ad identification assignment
device 118 represents each feature by a unique numeric featureID
and sorts the resulting data file by <adID, featureId>. Next,
the ad inversion sort device 120 inverts this file by sorting the
file by <featureId, adID> as a key. The ad indexing device
121 then writes the inverted data file (delta compressed posting
lists) into the ads index database 122. The system 200 uses the ads
index database 122 later during query runtime in order to evaluate
queries.
[0051] There are a few important differences in this ad search
engine that require a different approach compared to web search
engines. First, in web search, the queries are short and the
documents are long. In the present ad search case, the number of
features per ad is usually lower than the number of features
extracted from a web page, which represent the ad space query here.
So, it is almost never the case that an ad will contain all the
features of the ad search query. Accordingly, the ad search engine
performs similarity search in the vector space with a long query
and relatively short ad vectors. In contrast, for the majority of
the web queries, there are many pages that contain all the query
words. One of the key issues is how to rank the pages containing
the query.
[0052] The relevance device 124 includes architecture for analyzing
content of query pages where the ads are shown. The page feature
extraction device 126 extracts features from a query page. The page
feature re-weighting device 128 breaks down the query page into a
bag of pairs <featureId, weight>. For each query page
feature, the page evaluation device 130 opens a cursor over the
posting list of this feature. During the evaluation, the page
evaluation device 130 moves the cursors forward examining the
documents (ads) as the documents are encountered from the ads index
database 122. The page evaluation device 130 is configured to find
the next cursor to be moved based on an upper bound of the score
for the documents at which the cursors are currently positioned.
The page evaluation device 130 keeps a heap of current candidates.
The invariant of the page evaluation device 130 is that the heap
contains the substantially best matches (highest scores) among the
documents (ads) with IDs less than the document pointed by the
current minimum cursor.
[0053] Cursors pointing on documents with upper bound smaller than
the minimum score among the candidate documents are candidates for
a move. To find the upper bound for a document, the page evaluation
device 130 assumes that all cursors that are before the current
will hit this document (i.e. the document contains all those terms
represented by cursors before or at that document). It has been
shown that the system 200 can use the page evaluation device 130
with any function that is monotonic with respect to the number of
matching terms in the document. It can also be easily shown that
some non-monotonic scoring functions can also be used as long as
the system 200 can find a mechanism to estimate the score upper
bounds.
[0054] One family of such functions is a set of functions where a
fixed subset of the features (known a priori) always decreases the
score. In such cases, the upper bound estimates just assume that
these features do not appear in the ad. An example of such function
is a cosine similarity where some of the query page coefficients
are negative. The scoring function proposed in this invention might
have such coefficients and fits well within the framework of the
page evaluation device 130.
[0055] The system 200 incorporates the logistic-regression based
model, which is an important feature of the present invention. The
system 200 modifies the scoring Equation 5 to exclude the page
effect and uses Equation 5 as a scoring formula for the page
evaluation device 130 (e.g., WAND). The click feedback device 114
uses M.sub.a(r) of Equation 2 during indexing (i.e., sorting) to
calculate a static score for each individual ad. The ad
identification assignment device 118 uses this score to assign an
adID to the ads in decreasing ad score order. This scoring allows
for estimating upper bounds of the ads that are skipped by using
the score of the ad pointed by the preceding cursor in the sorted
cursor list. The ad indexing device then writes the indexed ads to
the ads index database 122.
[0056] After the relevance device 124 parses the page and extracts
the features along with their tf-idf scores, the page re-weighting
device 128 applies the reweighing based on the I.sub.w of Equation
4. The click feedback device does not use M.sub.p(r) of Equation 1
in the ad selection. Rather, the click probability calculation
device 132 may use M.sub.p(r) of Equation 1 to adjust the final
scores to calculate the probabilities according to Equation 5.
Method Outline
[0057] FIG. 3 is a flowchart of a method 300 of indexing ads in
order to provide better web ad matching, in accordance with an
embodiment of the present invention. The method 300 starts in step
302 where the system receives ads that were clicked at a consumer
computer. The click feedback device 114 of FIG. 2 may be configured
to carry out this step 302. The method 300 then moves to step 304
where the system extracts ad features from the ads. The ad feature
extraction device 116 of FIG. 2 may be configured to carry out this
step 304. Next, in step 306, the system represents each feature by
a unique featureID and sorts the results data file by <adID,
featureId>. The ad identification assignment device 118 of FIG.
2 may be configured to carry out this step 306. The method 300 then
proceeds to step 308 where the system inverts the data file to sort
the data file by <featureId, adID> as a key. The ad inversion
sort device 120 of FIG. 2 may be configured to carry out this step
308. Then, in step 310, the system writes the inverted data file
into an ads index database. The ad indexing device 121 of FIG. 2
may be configured to carry out this step 310. The sorting and
indexing in the method 300 include use of logistic regression
according to M.sub.a(r) of Equation 2. The method 300 is then at an
end.
[0058] FIG. 4 is a flowchart of a method 400 for comparing query
pages to indexed ads in order to provide better web ad matching, in
accordance with an embodiment of the present invention. The method
400 starts in step 402 where the system receives a query page from
a consumer computer. The relevance device 124 of FIG. 2 may be
configured to carry out this step 402. The method 400 then moves to
step 404 where the system extracts features from the query page.
The page feature extraction device 126 of FIG. 2 may be configured
to carry out this step 404. Next, in step 406, the system
re-weights the query page. The page feature re-weighting device 128
of FIG. 2 may be configured to carry out this step 128. The
re-weighting is based on I.sub.w of Equation 4. The method 400 then
proceeds to step 408 where the system evaluates the query page in
light of each ad in order to score each ad and pick substantially
best ad matches from the ads index database written in the method
300 of FIG. 3. The page evaluation device 130 of FIG. 2 may be
configured to carry out this step 408. Note that the system
computes a score for almost all (page, ad) pair and then uses this
score to judge which ads are the best for the given page.
Substantially, the only time the system does not compute such
scores is when the system evaluates that the score will not be high
enough for a particular ad to be among the substantially best ads.
It is not the case that the system computes scores only for the
substantially best ad matches. Then, in step 410, the system
returns the substantially best ad match(es) to the consumer
computer. The relevance device 124 of FIG. 2 may be configured to
carry out this step 410. The method 400 is then at an end.
[0059] The method 400 may involve an optional step where the system
calculates a final score for each of the substantially best ad
matches. The click probability calculation device 132 of FIG. 2 may
be configured to carry out this optional step. This final static
scoring involves use of M.sub.p(r) of Equation 1 to adjust the
final scores to calculate the probabilities according to Equation
5.
[0060] The top scoring ads in the top ads database may be used
later during runtime of a query. Thus, better matching ads can be
had for the query.
Computer Readable Medium Implementation
[0061] Portions of the present invention may be conveniently
implemented using a conventional general purpose or a specialized
digital computer or microprocessor programmed according to the
teachings of the present disclosure, as will be apparent to those
skilled in the computer art.
[0062] Appropriate software coding can readily be prepared by
skilled programmers based on the teachings of the present
disclosure, as will be apparent to those skilled in the software
art. The invention may also be implemented by the preparation of
application-specific integrated circuits or by interconnecting an
appropriate network of conventional component circuits, as will be
readily apparent to those skilled in the art.
[0063] The present invention includes a computer program product
which is a storage medium (media) having instructions stored
thereon/in which can be used to control, or cause, a computer to
perform any of the processes of the present invention. The storage
medium can include, but is not limited to, any type of disk
including floppy disks, mini disks (MD's), optical disks, DVDs,
CD-ROMs, micro-drives, and magneto-optical disks, ROMs, RAMs,
EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices (including
flash cards), magnetic or optical cards, nanosystems (including
molecular memory ICs), RAID devices, remote data
storage/archive/warehousing, or any type of media or device
suitable for storing instructions and/or data.
[0064] Stored on any one of the computer readable medium (media),
the present invention includes software for controlling both the
hardware of the general purpose/specialized computer or
microprocessor, and for enabling the computer or microprocessor to
interact with a human consumer or other mechanism utilizing the
results of the present invention. Such software may include, but is
not limited to, device drivers, operating systems, and consumer
applications. Ultimately, such computer readable media further
includes software for performing the present invention, as
described above.
[0065] Included in the programming (software) of the
general/specialized computer or microprocessor are software modules
for implementing the teachings of the present invention, including
without limitation receiving a query page, extracting features from
the query page, re-weighting the query page, evaluating the query
page to obtain substantially best ad matches of the indexed ads,
and calculating a final score for each ad in order to pick
substantially best ad matches, according to processes of the
present invention.
Advantages
[0066] The system of the present invention provides a new model to
combine relevance with click feedback for a contextual advertising
system. The model is based on a logistic regression and allows for
a large number of granular features. The key feature of the
modeling approach is the ability to model interactions that exist
among words between page and ad regions in a way that is suitable
for efficient evaluation over inverted indexes. In fact, the system
employs a multiplicative factorization to model the interaction
effects for several (page, ad) regions in a parsimonious way that
facilitates fast look-up of ads at query runtime. Large scale
experiments have been proven the advantage of combining relevance
with click feedback. In fact, experiments have achieved a 25% lift
in precision for a recall value of 10% relative to a pure relevance
based model.
[0067] In the foregoing specification, the invention has been
described with reference to specific embodiments thereof. It will,
however, be evident that various modifications and changes may be
made thereto without departing from the broader spirit and scope of
the invention. The specification and drawings are, accordingly, to
be regarded in an illustrative rather than a restrictive sense.
* * * * *