U.S. patent application number 11/300919 was filed with the patent office on 2007-06-21 for context-based key phrase discovery and similarity measurement utilizing search engine query logs.
This patent application is currently assigned to Microsoft Corporation. Invention is credited to Ying Li, Abhinai Srivastava, Lee Wang.
Application Number | 20070143278 11/300919 |
Document ID | / |
Family ID | 38174951 |
Filed Date | 2007-06-21 |
United States Patent
Application |
20070143278 |
Kind Code |
A1 |
Srivastava; Abhinai ; et
al. |
June 21, 2007 |
Context-based key phrase discovery and similarity measurement
utilizing search engine query logs
Abstract
Usage context obtained from search query logs is leveraged to
facilitate in discovery and/or similarity determination of key
search phrases. A key phrase extraction process extracts key
phrases from raw search query logs and breaks individual queries
into a vector of the key phrases. A Similarity Graph generation
process then generates a Similarity Graph from the output of the
key phrase extraction process. Information relating to the
similarity levels between two key phrases can be employed to
restrict a search space for tasks such as, for example, online
keyword auctions and the like. Thus, instances can be employed to
find frequent misspellings of a given keyword, keyword/acronym
pairs, key phrases with similar intention, and/or keywords which
are semantically related and the like.
Inventors: |
Srivastava; Abhinai;
(Redmond, WA) ; Wang; Lee; (Kirkland, WA) ;
Li; Ying; (Bellevue, WA) |
Correspondence
Address: |
AMIN. TUROCY & CALVIN, LLP
24TH FLOOR, NATIONAL CITY CENTER
1900 EAST NINTH STREET
CLEVELAND
OH
44114
US
|
Assignee: |
Microsoft Corporation
Redmond
WA
|
Family ID: |
38174951 |
Appl. No.: |
11/300919 |
Filed: |
December 15, 2005 |
Current U.S.
Class: |
1/1 ;
707/999.005 |
Current CPC
Class: |
Y10S 707/99933 20130101;
Y10S 707/99936 20130101; Y10S 707/99934 20130101; G06Q 30/02
20130101 |
Class at
Publication: |
707/005 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A system that facilitates key phrase processing, comprising: a
component that obtains data from at least one search query log; and
an extraction component that extracts key phrases from the search
query log data and breaks individual queries into key phrase
vectors.
2. The system of claim 1, the extraction component employs noise
filtering on the search query log data to remove universal resource
locator (URL) search queries.
3. The system of claim 1, the extraction component employs low
frequency word filtering to remove low occurrence search words from
the search query log data.
4. The system of claim 1, the extraction component generates key
phrase candidates that have less than a pre-set length of N words
for each query, where N is an integer from one to infinity.
5. The system of claim 1, the extraction component determines query
breakup information based on, at least in part, a number of words
in a key phrase and a frequency associated with the key phrase.
6. The system of claim 1 further comprising: a graph generation
component that employs, at least in part, the key phrase vectors
from the key phrase extraction component to construct a Similarity
Graph that indicates similarity between key phrases.
7. The system of claim 6, the graph generation component provides a
Co-occurrence Graph for key phrases by utilizing, at least in part,
query breakup information.
8. The system of claim 7, the graph generation component provides a
noise filter for the Co-occurrence Graph that, at least in part,
prunes edges that are less than a first given threshold and/or
prunes nodes that have less than a second given threshold.
9. The system of claim 8, the graph generation component generates
a Similarity Graph, prunes top E edges by edge weight for each
node, and removes edges except edges that fall within at least one
of the top E edges, where E is an integer from one to infinity.
10. An advertisement purchasing process that employs, at least in
part, the system of claim 1.
11. A method for facilitating key phrase processing, comprising:
obtaining data from at least one search query log; extracting key
phrases from the search query log data; and breaking individual
queries into key phrase vectors to provide query breakup
information.
12. The method of claim 11 further comprising: removing universal
resource locator (URL) search queries from the search query log
data to filter noise; eliminating low occurrence search words from
the search query log data to filter out low frequency words;
generating key phrase candidates that have less than a pre-set
length of N words for each query and counting their frequency,
where N is an integer from one to infinity; and determining query
breakup information based on, at least in part, a number of words
in a key phrase candidate and its associated frequency.
13. The method claim 11 further comprising: removing URL queries
from the search query log data; counting frequencies of individual
words that occur in the search query log data; discarding words
with a frequency lower than a first pre-set threshold limit;
generating possible phrases up to a pre-set length of n words for
each search query, where n is an integer from one to infinity;
counting frequencies of phrases and discarding infrequent phrases
to create candidate key phrases; estimating a best break for each
search query; incrementing a real count of each constituent key
phrase of a best break search query by one; and providing the query
breakup information to facilitate in determining key phrase
similarities.
14. The method of claim 11 further comprising: constructing a
Similarity Graph utilizing, at least in part, the vector of the key
phrases, the Similarity Graph indicating similarity between key
phrases.
15. The method claim 14 further comprising: creating a
Co-occurrence Graph for key phrases by utilizing, at least in part,
query breakup information; pruning edges of the Co-occurrence Graph
that are less than a first given threshold; removing nodes of the
Co-occurrence Graph that have less than a second given threshold;
and generating a Similarity Graph based on the Co-occurrence Graph
and pruning top E edges by edge weight for each node and removing
edges except edges that fall within at least one of the top E
edges, where E is an integer from one to infinity.
16. The method claim 14 further comprising: generating a key phrase
Co-occurrence Graph utilizing the query breakup information;
pruning edges with a weight less than a first threshold number from
the Co-occurrence Graph; pruning nodes and their associated edges
which have less than a second threshold number of edges from the
Co-occurrence Graph; determining top K edges for each node of the
Co-occurrence Graph, where K is an integer from one to infinity;
removing edges from the Co-occurrence Graph except for those that
fall into the top K of at least one node; creating a Similarity
Graph from remaining key phrase nodes of the Co-occurrence Graph;
determining edges for the Similarity Graph; determining top E edges
by edge weight for each node in the Similarity Graph, where E is an
integer from one to infinity; removing edges from the Similarity
Graph except those that fall into the top E edges of at least one
node; and outputting the Similarity Graph to facilitate
applications that utilize similarities between key phrases.
17. A method of auctioning online advertisements that employs, at
least in part, the method of claim 11.
18. The method of claim 14 further comprising: converting the
Similarity Graph into hash tables to facilitate in employing it in
substantially real-time processes.
19. A system that facilitates key phrase processing, comprising:
means for extracting key phrases from search query log data; means
for breaking individual queries into key phrase vectors; and means
for constructing a Similarity Graph utilizing, at least in part,
the key phrase vectors, the Similarity Graph indicating similarity
between key phrases.
20. A device employing the method of claim 11 comprising at least
one selected from the group consisting of a computer, a server, and
a handheld electronic device.
Description
BACKGROUND
[0001] Advertising in general is a key revenue source in just about
any commercial market or setting. To reach as many consumers as
possible, advertisements are traditionally presented via
billboards, television, radio, and print media such as newspapers
and magazines. However, with the Internet, advertisers have found a
new and perhaps less expensive medium for reaching vast numbers of
potential customers across a large and diverse geographic span.
Advertisements on the Internet can primarily be seen on web pages
or web sites as well as in pop-up windows when a particular site is
visited.
[0002] The Internet provides users with a mechanism for obtaining
information regarding any suitable subject matter. For example,
various web sites are dedicated to posting text, images, and video
relating to world, national, and local news. A user with knowledge
of a uniform resource locator (URL) associated with one of such web
sites can simply enter the URL into a web browser to be provided
with the web site and access content. Another conventional manner
of locating desired information from the Internet is through
utilization of a search engine. For instance, a user can enter a
word or series of words into a search field and initiate a search
(e.g., through depression of a button, one or more keystrokes,
voice commands, etc.). The search engine then utilizes search
algorithms to locate web sites related to the word or series of
words entered by the user into the search field, and the user can
then select one of the web sites returned by the search engine to
review related content.
[0003] Oftentimes, users who are searching for information will see
related advertisements and click on such advertisements to purchase
products, thereby creating business for that particular retailer.
Furthermore, the search engine is provided with additional revenue
by selling advertisement space for a period of time to a retailer
when a relevant term, such as, for example, the term "doggie," is
utilized as a search term. Thus, an individual who enters the term
"doggie" into a search engine may be interested in purchasing items
related to dogs--thus, it is beneficial for a company that sells
pet items to advertise to that user at the point in time that the
user is searching for a relevant term.
[0004] Conventionally, advertising space relating to search terms
provided to a search engine is bought or sold in an auction manner.
More specifically, a search engine can receive a query (from a
user) that includes one or more search terms that are of interest
to a plurality of buyers. The buyers can place bids with respect to
at least one of the search terms, and a buyer that corresponds to
the highest bid will have their advertisement displayed upon a
resulting page view. Bidding and selection of a bid can occur
within a matter of milliseconds, thereby not adversely affecting
usability of the search engine. Thus, two or more competing bidders
can bid against one another within a limited time frame until a
sale price of advertising space associated with one or more search
terms in the received query is determined. This bidding is often
accomplished by way of proxies (e.g., computer component) that are
programmed with a demand curve for specific search term(s). As
alluded to above, auctioning advertising space associated with
search terms is a substantial source of revenue for search engines,
and can further be a source of revenue for advertisers.
[0005] Because of the potential of a significant boost in revenue
from advertising with search terms, it is very likely that a
business will associate as many search terms and variations as
possible to their advertisements. For example, an advertiser of pet
items might submit a list of terms and variations for "doggie,"
such as "dog," "dogs," and "doggy." The intent of the advertiser is
to select all terms and variations that would likely be used by
users during a search. However, these lists of terms are often
manually composed and frequently omit terms/variations that might
increase sales for the advertiser. As an example, sometimes
different spellings of words become popular that would not normally
be included in the lists such as "dogz" or "doggee." Automatically
finding these terms and including them in associated advertising
terms could substantially improve sales for the advertiser and
revenue for a search engine provider.
SUMMARY
[0006] The following presents a simplified summary of the subject
matter in order to provide a basic understanding of some aspects of
subject matter embodiments. This summary is not an extensive
overview of the subject matter. It is not intended to identify
key/critical elements of the embodiments or to delineate the scope
of the subject matter. Its sole purpose is to present some concepts
of the subject matter in a simplified form as a prelude to the more
detailed description that is presented later.
[0007] The subject matter relates generally to online searching,
and more particularly to systems and methods for discovering and/or
determining similarity of search key phrases. Usage context
obtained from search query logs is leveraged to facilitate in
discovery and/or similarity determination of key search phrases. A
key phrase extraction process extracts key phrases from raw search
query logs and breaks individual queries into a vector of the key
phrases. A Similarity Graph generation process then generates a
Similarity Graph from the output of the key phrase extraction
process. Information relating to the similarity levels between two
key phrases can be employed to restrict a search space for tasks
such as, for example, online keyword auctions and the like. Thus,
instances can be employed to find frequent misspellings of a given
keyword, keyword/acronym pairs, key phrases with similar intention,
and/or keywords which are semantically related and the like.
[0008] To the accomplishment of the foregoing and related ends,
certain illustrative aspects of embodiments are described herein in
connection with the following description and the annexed drawings.
These aspects are indicative, however, of but a few of the various
ways in which the principles of the subject matter may be employed,
and the subject matter is intended to include all such aspects and
their equivalents. Other advantages and novel features of the
subject matter may become apparent from the following detailed
description when considered in conjunction with the drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIG. 1 is a block diagram of a key phrase processing system
in accordance with an aspect of an embodiment.
[0010] FIG. 2 is another block diagram of a key phrase processing
system in accordance with an aspect of an embodiment.
[0011] FIG. 3 is yet another block diagram of a key phrase
processing system in accordance with an aspect of an
embodiment.
[0012] FIG. 4 is a block diagram of a key phrase processing system
utilized with an advertising component in accordance with an aspect
of an embodiment.
[0013] FIG. 5 is an overview example of a key phrase discovery and
similarity determination process in accordance with an aspect of an
embodiment.
[0014] FIG. 6 is an overview example of a key phrase extraction
process in accordance with an aspect of an embodiment.
[0015] FIG. 7 is an overview example of a Similarity Graph
generation process in accordance with an aspect of an
embodiment.
[0016] FIG. 8 is a flow diagram of a method of facilitating key
phrase discovery and similarity determination in accordance with an
aspect of an embodiment.
[0017] FIG. 9 is a flow diagram of a method of facilitating key
phrase discovery in accordance with an aspect of an embodiment.
[0018] FIG. 10 is a flow diagram of a method of facilitating key
phrase similarity determination in accordance with an aspect of an
embodiment.
[0019] FIG. 11 illustrates an example operating environment in
which an embodiment can function.
[0020] FIG. 12 illustrates another example operating environment in
which an embodiment can function.
DETAILED DESCRIPTION
[0021] The subject matter is now described with reference to the
drawings, wherein like reference numerals are used to refer to like
elements throughout. In the following description, for purposes of
explanation, numerous specific details are set forth in order to
provide a thorough understanding of the subject matter. It may be
evident, however, that subject matter embodiments may be practiced
without these specific details. In other instances, well-known
structures and devices are shown in block diagram form in order to
facilitate describing the embodiments.
[0022] As used in this application, the term "component" is
intended to refer to a computer-related entity, either hardware, a
combination of hardware and software, software, or software in
execution. For example, a component may be, but is not limited to
being, a process running on a processor, a processor, an object, an
executable, a thread of execution, a program, and/or a computer. By
way of illustration, both an application running on a server and
the server can be a computer component. One or more components may
reside within a process and/or thread of execution and a component
may be localized on one computer and/or distributed between two or
more computers.
[0023] During the process of bidding for a keyword in online
keyword auction systems for search engines, advertisers have to
supply a long list of mutations for the same keyword to maximize
their reach while retaining relevance. Absence of a system that
automatically makes such recommendations forces the advertisers to
supply such a list manually. This is both cumbersome and
inefficient. Since the advertiser has no direct way of knowing the
relative frequency of various possible keyword mutations, it is
highly likely that they miss out on some of the important
mutations. Instances of the systems and methods herein discover key
phrases and/or measure their similarity by utilizing the usage
context information from search engine query logs. The information
of similarity levels between two key phrases can then be used to
narrow down the search space of several tasks in online keyword
auctions, like finding all the frequent misspellings of a given
keyword, finding the keyword/acronym pairs, finding key phrases
with similar intention, and/or finding keywords which are
semantically related and the like.
[0024] In FIG. 1, a block diagram of a key phrase processing system
100 in accordance with an aspect of an embodiment is shown. The key
phrase processing system 100 is comprised of a key phrase
processing component 102 that receives an input 104 and provides an
output 106. The input 104 is generally comprised of search query
log information. This type of data is typically compiled when users
search for things of interest on a network such as the Internet
and/or an intranet. The logs can contain search terms and/or other
information associated with a search such as, for example, time
when the search was executed, number of hits, and/or user
identification and the like. The key phrase processing component
102 utilizes textual strings of queries in the logs to provide the
output 106. A number of "hits" or times the search query was
entered can also be utilized by the key phrase processing component
102. The output 106 can be comprised of, for example, a key phrase
list, query breakup data and/or a Similarity Graph (described
infra) and the like. Thus, the key phrase processing component 102
can be employed to facilitate in extracting key phrases and/or
determine similarities between the key phrases based on the input
104. Similarities between key phrases can be utilized in
applications such as, for example, advertising systems where an
association of one search key term to another can be
invaluable.
[0025] Looking at FIG. 2, another block diagram of a key phrase
processing system 200 in accordance with an aspect of an embodiment
is depicted. The key phrase processing system 200 is comprised of a
key phrase processing component 202 that receives query log data
204 and provides query breakup data 206. In other instances a key
phrase list can also be provided (not illustrated). The key phrase
processing component 202 is comprised of a receiving component 208
and a key phrase extraction component 210. The receiving component
208 obtains query log data 204 from a network associated data
source such as, for example, a local network (e.g., intranet) data
source and/or a global network (e.g., the Internet) data source and
the like. The receiving component 208 can also provide basic
pre-filtering of the raw data from the query log data 204 if
required by the key phrase extraction component 210. For example,
the receiving component 208 can re-format data and/or filter data
based on a particular time period, a particular network source, a
particular location, and/or a particular amount of users and the
like. The receiving component 208 can also be co-located with a
data source. The key phrase extraction component 210 receives the
query log data 204 from the receiving component 208 and extracts
key phrases. The extraction process is described in detail infra.
The key phrase extraction component 210 can also directly receive
the query log data 204 for processing. The extracted key phrases
are then utilized to provide the query breakup data 206. The query
breakup data 206 is typically a data file that is employed to
determine Similarity Graphs (see infra) for the extracted key
phrases.
[0026] Turning to FIG. 3, yet another block diagram of a key phrase
processing system 300 in accordance with an aspect of an embodiment
is illustrated. The key phrase processing system 300 is comprised
of a key phrase processing component 302 that receives query log
data 304 and provides Similarity Graph 306. The key phrase
processing component 302 is comprised of a key phrase extraction
component 308 and a Similarity Graph generation component 310. The
key phrase extraction component 308 obtains query log data 304 from
a network associated data source such as, for example, a local
network data source and/or a global network data source and the
like. The key phrase extraction component 308 extracts key phrases
from the query log data 304. The extracted key phrases are then
utilized to provide query breakup data to the Similarity Graph
generation component 310. The Similarity Graph generation component
310 processes the query breakup data to generate the Similarity
Graph 306. Similarity Graph generation is described in detail
infra.
[0027] Moving on to FIG. 4, a block diagram of a key phrase
processing system 400 utilized with an advertising component 406 in
accordance with an aspect of an embodiment is shown. The key phrase
processing system 400 is comprised of a key phrase processing
component 402 that receives query log data 404 and interacts with
advertisement component 406 which provides advertising related
items 408 for advertisers. In this instance, the key phrase
processing component 402 generates a Similarity Graph from the
query log data 404 and provides this to the advertisement component
406. This allows the advertisement component 406 to generate
advertising related items 408. The advertising related items 408
can include, for example, frequent misspellings of a given keyword,
keyword/acronym pairs, key phrases with similar intention, and/or
keywords which are semantically related and the like. This
substantially increases the performance of the advertisement
component 406 and facilitates in automatically generating terms for
advertisers, eliminating the need to manually track related
advertising search terms.
[0028] This is contrary to the current process of bidding for a
keyword in the online keyword auction systems for search engines in
which advertisers have to supply a long list of mutations for the
same keyword to maximize their reach while retaining relevance.
Various kinds of mutations are: (1) Misspells/Multiple
spellings--for example, an advertiser targeting users who searched
for "britney spears" must bid for the most common spellings of the
name such as, for example, "britney spears", "brittany spears",
etc.; (2) Acronyms--for example, advertisers targeting keyword
"hewlett packard" must also bid on "hp."; (3) Similar
intention--for example, advertisers selling cheap air tickets must
bid on "cheap air tickets," "cheap air fares," "cheap airlines,"
"discount fares" and so on; and (4) Related keywords--for example,
advertisers selling pet supplies must bid for "cats," "dogs,"
"rottweiler" and so on.
[0029] Presently, absence of a process that automatically makes
such recommendations forces the advertisers to supply such a list
manually. This is both cumbersome and inefficient. Since the
advertiser has no direct way of knowing the relative frequency of
various possible keyword mutations, it is highly likely that they
miss out on some of the important mutations. This manual and often
incomplete provision of such keyword lists results in loss of
customers for the advertiser and loss of revenues for search
engines.
[0030] While (3) and (4) above can only be solved by employing
instances of the systems and methods herein (to determine the
similarity of key phrases in a document corpus such as search
engine query logs), there exists algorithms which can solve (1) and
(2) without using a similarity measure. However, the computational
complexity associated with using such algorithms over the scope of
entire query logs is computationally burdensome. Instances of the
systems and methods herein can provide a mechanism for determining
similarity between key phrases using usage context information
(e.g., information apart from a focus term of a search) in search
query logs. Thus, key phrases can be found which have a similar
intention and/or are related conceptually by looking at the
similarity of key phrase patterns around them. Moreover, the scope
of applying existing algorithms for solving (1) and (2) above can
be substantially reduced by limiting the search space to only those
key phrases which are similar to the given key phrase. This makes
the algorithms computationally tractable and also provides higher
accuracy for the final results.
[0031] First, a process is utilized to discover key phrases that
are statistically sound from raw query logs. This facilitates in:
(1) breaking down individual queries into a vector of key phrases;
(2) removing the associated noise while capturing the usage context
of a key phrase in a given query; and (3) capturing the
statistically most significant key phrases that are used by users
by the common patterns in which they framed search queries.
Secondly, a process is utilized to take a list of key phrase
segmented queries as input and return a Similarity Graph as output.
The Similarity Graph is a graph with the key phrases as its nodes.
Two nodes are joined with an edge if similarity between them is
greater than a given threshold. The edge weight is represented by
the similarity value between two key phrases. This value ranges
between "0" and "1." A value of "0" represents completely
dissimilar while a value of "1" represents completely similar.
[0032] In FIG. 5, an overview example 500 of a key phrase discovery
and similarity determination process in accordance with an aspect
of an embodiment is illustrated. If a process is treated as a black
box 502, an input 504, for example, is a list of queries from raw
query logs and an output 506 is a Similarity Graph as described
above. An overall process can generally employ, for example, one or
both of two processes, namely (1) Key-phrase extraction--a process
to extract key phrases from raw logs and break the individual
queries into a vector of these key phrases and/or (2) Similarity
Graph generation--a process to generate a Similarity Graph from an
output of the key phrase extraction process.
[0033] Turning to FIG. 6, an overview example of a key phrase
extraction process 600 in accordance with an aspect of an
embodiment is shown. The key phrase extraction process 600 is
generally comprised of the following passes on search query logs:
[0034] Noise Filtering: This pass includes, but is not limited to,
the following: First, the query logs are passed through a URL
filter which filters out queries which happen to be a URL. This
step is important for noise reduction because roughly 15% of search
engine logs are URLs. Second, non-alphanumeric characters, except
punctuation marks, are omitted from the queries. Third, queries
containing valid patterns of punctuation marks like "." "," "?" and
quotes and the like are broken down into multiple parts at the
boundary of punctuation. [0035] Low-frequency word filtering: In
this pass, frequencies of individual words that occur in the entire
query logs are determined. At the end of this pass, words which
have a frequency lower than a pre-set threshold limit are
discarded. This pass eliminates the generation of phrases
containing infrequent words in the next step. Typically, if a word
is infrequent then a phrase which contains this word is likely
infrequent as well. [0036] Key-phrase candidate generation: In this
pass, possible phrases up-to a pre-set length of N words for each
query is generated, where N is an integer from one to infinity.
Typically, a phrase which contains an infrequent word, a stop-word
at the beginning, a stop-word at the end, and/or a phrase that
appears in a pre-compiled list of non-standalone key phrases are
not generated. At the end of the pass, frequencies of phrases are
counted and infrequent phrases are discarded. The remaining list of
frequent phrases is called a "key phrase candidate list." [0037]
Key-phrase determination: For each query, the best break is
estimated by a scoring function which assigns a score of a break as
sum of (n-1).times.frequency+1 of each constituent key phrase.
Here, n is a number of words in the given key phrase and can be an
integer from one to infinity. Once the best break is determined, a
real count of each constituent key phrase of the best query break
is incremented by 1. This pass outputs a query breakup in a file
for later use to generate a Co-occurrence Graph. One can make an
additional pass through the list of key phrases generated in the
above step and discard the key phrases with a real frequency below
a certain threshold when the count of obtained key phrases exceeds
the maximum that is needed.
[0038] Looking at FIG. 7, an overview example of a Similarity Graph
generation process 700 in accordance with an aspect of an
embodiment is depicted. The Similarity Graph generation process 700
is typically comprised of the following: [0039] Co-occurrence Graph
generation: Using the query breakup file generated in a key phrase
extraction process, a key phrase Co-occurrence Graph is generated.
A Co-occurrence Graph is a graph with key phrases as nodes and edge
weights representing the number of times two key phrases are part
of the same query. For example, if a breakup of a query had three
key phrases, namely, a, b, and c then the weights of the following
edges are incremented by 1: {a,b}, {a,c} and {b,c}. [0040]
Co-occurrence Graph pruning: Once the Co-occurrence Graph has been
generated, noise is removed by pruning edges with a weight less
than a certain threshold. Next, nodes which have less than a
certain threshold number of edges are pruned. Edges associated with
these nodes are also removed. Further, the top K edges for each
node are determined, where K is an integer from one to infinity.
Edges, except those falling into the top K of at least 1 node, are
then removed from the graph. [0041] Similarity Graph creation: A
new graph called the Similarity Graph is then created. The set of
nodes of this graph is the key phrases which remain as nodes in the
Co-occurrence Graph after Co-occurrence Graph pruning. [0042]
Similarity Graph edge computation: For each pair {n.sub.1, n.sub.2}
of nodes in the Similarity Graph, an edge {n.sub.1, n.sub.2} is
created if and only if the similarity value S(n.sub.1,n.sub.2) for
the two nodes in the Co-occurrence Graph is greater than a
threshold T. The weight of the edge {n.sub.1,n.sub.2} is
S(n.sub.1,n.sub.2). The similarity value S(n.sub.1,n.sub.2) is
defined as the cosine distance between the vectors {e.sub.1n.sub.1,
e.sub.2n.sub.1 . . . } and {e.sub.1n.sub.2, e.sub.2n.sub.2 . . . },
where e.sub.1n.sub.1, e.sub.2n.sub.1 . . . are the edges connecting
node n.sub.1 in the Co-occurrence Graph and e.sub.1n.sub.2,
e.sub.2n.sub.2 . . . are the edges connecting node n.sub.2 in the
Co-occurrence Graph. Cosine distance between two vectors V.sub.1
and V.sub.2 is computed as follows:
(V.sub.1V.sub.2)/|V.sub.1|X|V.sub.2|. A total of .about.nC.sub.2
distance computations are required at this stage. [0043] Similarity
Graph edge pruning: The top E edges by edge weight for each node in
the Similarity Graph are then determined, where E is an integer
from one to infinity. The edges, except those falling in the top E
edges of at least one node, are removed. Typically, the value of E
is approximately 100. [0044] Output: Output the generated
Similarity Graph generated above.
[0045] The Similarity Graph can be stored in a hash table data
structure for very quick lookups of key phrases that have a similar
usage context as the given key phrase. The keys of such a hash
table are the key phrases and the values are a list of key phrases
which are neighbors of the hash key in the Similarity Graph. The
main parameter to control the size of this graph is the minimum
threshold value for frequent key phrases in the key phrase
extraction process. The size of the Similarity Graph is roughly
directly proportional to the coverage of key phrases. Hence, this
parameter can be adjusted to suit a given application and/or
circumstances.
[0046] In view of the exemplary systems shown and described above,
methodologies that may be implemented in accordance with the
embodiments will be better appreciated with reference to the flow
charts of FIGS. 8-10. While, for purposes of simplicity of
explanation, the methodologies are shown and described as a series
of blocks, it is to be understood and appreciated that the
embodiments are not limited by the order of the blocks, as some
blocks may, in accordance with an embodiment, occur in different
orders and/or concurrently with other blocks from that shown and
described herein. Moreover, not all illustrated blocks may be
required to implement the methodologies in accordance with the
embodiments.
[0047] The embodiments may be described in the general context of
computer-executable instructions, such as program modules, executed
by one or more components. Generally, program modules include
routines, programs, objects, data structures, etc., that perform
particular tasks or implement particular abstract data types.
Typically, the functionality of the program modules may be combined
or distributed as desired in various instances of the
embodiments.
[0048] In FIG. 8, a flow diagram of a method 800 of facilitating
key phrase discovery and similarity determination in accordance
with an aspect of an embodiment is shown. The method 800 starts 802
by obtaining search query log data 804. This type of data is
typically compiled when users search for things of interest on a
network such as the Internet and/or an intranet. The logs can
contain search terms and/or other information associated with a
search such as, for example, time when the search was executed,
number of hits, and/or user identification and the like. Key
phrases from the search query log data are then extracted 806. The
extraction processes that can be employed are described in detail
infra and supra. A Similarity Graph is then generated utilizing the
extracted key phrases 808. The Similarity Graph is then output 810
for utilization with applications that require key phrase
similarity information, ending the flow 812. Similarities between
key phrases can be utilized in applications such as, for example,
advertising systems where an association of one search key term to
another can be invaluable and/or other applications noted supra and
the like. Similarity Graphs can be stored as hash tables to reduce
their size and facilitate in real-time processes.
[0049] Looking at FIG. 9, a flow diagram of a method 900 of
facilitating key phrase discovery in accordance with an aspect of
an embodiment is depicted. The method 900 starts 902 by obtaining
search query log data 904. The logs can contain search terms and/or
other information associated with a search such as, for example,
time when the search was executed, number of hits, and/or user
identification and the like. URL queries are then removed from the
search query log data 906. The query logs are typically passed
through a URL filter which filters out queries which happen to be a
URL. In other instances, additional filtering can occur such as,
for example, removal of non-alphanumeric characters, except
punctuation marks. Queries containing valid patterns of punctuation
marks like "." "," "?" and quotes and the like can also be broken
down into multiple parts at a boundary of punctuation.
[0050] Frequencies of individual words that occur in the search
query log data are then counted 908. Words with a frequency lower
than a pre-set threshold limit are discarded 910. This eliminates
the generation of key phrases containing infrequent words.
Typically, if a word is infrequent then a phrase which contains
this word is likely infrequent as well. Possible phrases up to a
pre-set length of "N" words are generated for each query 912, where
"N" is an integer from one to infinity. Generally, a phrase which
contains an infrequent word, a stop-word at the beginning, a
stop-word at the end, and/or a phrase that appears in a
pre-compiled list of non-standalone key phrases is not
generated.
[0051] Frequencies of phrases are counted and infrequent phrases
are discarded, leaving "candidate key phrases" 914. A best break
for each search query is then estimated 916. For example, for each
query, the best break can be estimated by a scoring function which
assigns a score of a break as sum of (n-1).times.frequency+1 of
each constituent key phrase. Here, n is a number of words in the
given key phrase and can be a number from one to infinity. A real
count of each constituent key phrase of a best break query is then
incremented by "1" 918. Query breakup data is then output 920 to
facilitate in applications that utilize query breakup information
such as, for example, a Co-occurrence Graph employed in
constructing Similarity Graphs and the like, ending the flow
922.
[0052] Turning to FIG. 10, a flow diagram of a method 1000 of
facilitating key phrase similarity determination in accordance with
an aspect of an embodiment is illustrated. The method 1000 starts
1002 by obtaining search query breakup data 1004. A key phrase
Co-occurrence Graph is then generated utilizing query breakup data
1006. The Co-occurrence Graph has key phrases as nodes and edge
weights representing the number of times two key phrases are part
of the same query. For example, if a breakup of a query had three
key phrases, namely, a, b, and c then the weights of the following
edges are incremented by 1: {a,b}, {a,c} and {b,c}. Edges with a
weight less than a certain threshold are pruned from the
Co-occurrence Graph 1008. Nodes (and associated edges) which have
less than a certain threshold number of edges are also pruned from
the Co-occurrence Graph 1010.
[0053] Top K edges for each node of the Co-occurrence Graph are
then determined 1012, where K is an integer from one to infinity.
Edges are removed from the Co-occurrence Graph except those that
fall into the top K of at least one node 1014. A Similarity Graph
is then created from the remaining key phrase nodes of the
Co-occurrence Graph 1016. The set of nodes of this graph is the key
phrases which remain as nodes in the Co-occurrence Graph after
Co-occurrence Graph pruning. Edges for the Similarity Graph are
then determined 1018. For each pair {n.sub.1, n.sub.2} of nodes in
the Similarity Graph, an edge {n.sub.1, n.sub.2} is created if and
only if the similarity value S(n.sub.1,n.sub.2) for the two nodes
in the Co-occurrence Graph is greater than a threshold T. The
weight of the edge {n.sub.1,n.sub.2} is S(n.sub.1,n.sub.2). The
similarity value S(n.sub.1,n.sub.2) is defined as the cosine
distance between the vectors {e.sub.1n.sub.1, e.sub.2n.sub.1 . . .
} and {e.sub.1n.sub.2, e.sub.2n.sub.2 . . . }, where
e.sub.1n.sub.1, e.sub.2n.sub.1 . . . are the edges connecting node
n.sub.1 in the Co-occurrence Graph and e.sub.1n.sub.2,
e.sub.2n.sub.2 . . . are the edges connecting node n.sub.2 in the
Co-occurrence Graph. Cosine distance between two vectors V.sub.1
and V.sub.2 is computed as follows:
(V.sub.1V.sub.2)/|V.sub.1|X|V.sub.2|. A total of .about.nC.sub.2
distance computations are required at this stage.
[0054] Top E edges are then determined by edge weight for each node
in the Similarity Graph 1020, where E is an integer from one to
infinity. Edges from the Similarity Graph are then removed, except
those that fall into the top E edges of at least one node 1022. For
example, the value of E can be approximately 100. The Similarity
Graph is then output 1024 to facilitate applications that utilize
key phrase similarities such as keyword advertising auctions and
the like, ending the flow 1026.
[0055] In order to provide additional context for implementing
various aspects of the embodiments, FIG. 11 and the following
discussion is intended to provide a brief, general description of a
suitable computing environment 1100 in which the various aspects of
the embodiments can be performed. While the embodiments have been
described above in the general context of computer-executable
instructions of a computer program that runs on a local computer
and/or remote computer, those skilled in the art will recognize
that the embodiments can also be performed in combination with
other program modules. Generally, program modules include routines,
programs, components, data structures, etc., that perform
particular tasks and/or implement particular abstract data types.
Moreover, those skilled in the art will appreciate that the
inventive methods can be practiced with other computer system
configurations, including single-processor or multi-processor
computer systems, minicomputers, mainframe computers, as well as
personal computers, hand-held computing devices,
microprocessor-based and/or programmable consumer electronics, and
the like, each of which can operatively communicate with one or
more associated devices. The illustrated aspects of the embodiments
can also be practiced in distributed computing environments where
certain tasks are performed by remote processing devices that are
linked through a communications network. However, some, if not all,
aspects of the embodiments can be practiced on stand-alone
computers. In a distributed computing environment, program modules
can be located in local and/or remote memory storage devices.
[0056] With reference to FIG. 11, an exemplary system environment
1100 for performing the various aspects of the embodiments include
a conventional computer 1102, including a processing unit 1104, a
system memory 1106, and a system bus 1108 that couples various
system components, including the system memory, to the processing
unit 1104. The processing unit 1104 can be any commercially
available or proprietary processor. In addition, the processing
unit can be implemented as multi-processor formed of more than one
processor, such as can be connected in parallel.
[0057] The system bus 1108 can be any of several types of bus
structure including a memory bus or memory controller, a peripheral
bus, and a local bus using any of a variety of conventional bus
architectures such as PCI, VESA, Microchannel, ISA, and EISA, to
name a few. The system memory 1106 includes read only memory (ROM)
1110 and random access memory (RAM) 1112. A basic input/output
system (BIOS) 1114, containing the basic routines that help to
transfer information between elements within the computer 1102,
such as during start-up, is stored in ROM 1110.
[0058] The computer 1102 also can include, for example, a hard disk
drive 1116, a magnetic disk drive 1118, e.g., to read from or write
to a removable disk 1120, and an optical disk drive 1122, e.g., for
reading from or writing to a CD-ROM disk 1124 or other optical
media. The hard disk drive 1116, magnetic disk drive 1118, and
optical disk drive 1122 are connected to the system bus 1108 by a
hard disk drive interface 1126, a magnetic disk drive interface
1128, and an optical drive interface 1130, respectively. The drives
1116-1122 and their associated computer-readable media provide
nonvolatile storage of data, data structures, computer-executable
instructions, etc. for the computer 1102. Although the description
of computer-readable media above refers to a hard disk, a removable
magnetic disk and a CD, it should be appreciated by those skilled
in the art that other types of media which are readable by a
computer, such as magnetic cassettes, flash memory, digital video
disks, Bernoulli cartridges, and the like, can also be used in the
exemplary operating environment 1100, and further that any such
media can contain computer-executable instructions for performing
the methods of the embodiments.
[0059] A number of program modules can be stored in the drives
1116-1122 and RAM 1112, including an operating system 1132, one or
more application programs 1134, other program modules 1136, and
program data 1138. The operating system 1132 can be any suitable
operating system or combination of operating systems. By way of
example, the application programs 1134 and program modules 1136 can
include a key phrase processing scheme in accordance with an aspect
of an embodiment.
[0060] A user can enter commands and information into the computer
1102 through one or more user input devices, such as a keyboard
1140 and a pointing device (e.g., a mouse 1142). Other input
devices (not shown) can include a microphone, a joystick, a game
pad, a satellite dish, a wireless remote, a scanner, or the like.
These and other input devices are often connected to the processing
unit 1104 through a serial port interface 1144 that is coupled to
the system bus 1108, but can be connected by other interfaces, such
as a parallel port, a game port or a universal serial bus (USB). A
monitor 1146 or other type of display device is also connected to
the system bus 1108 via an interface, such as a video adapter 1148.
In addition to the monitor 1146, the computer 1102 can include
other peripheral output devices (not shown), such as speakers,
printers, etc.
[0061] It is to be appreciated that the computer 1102 can operate
in a networked environment using logical connections to one or more
remote computers 1160. The remote computer 1160 can be a
workstation, a server computer, a router, a peer device or other
common network node, and typically includes many or all of the
elements described relative to the computer 1102, although for
purposes of brevity, only a memory storage device 1162 is
illustrated in FIG. 11. The logical connections depicted in FIG. 11
can include a local area network (LAN) 1164 and a wide area network
(WAN) 1166. Such networking environments are commonplace in
offices, enterprise-wide computer networks, intranets and the
Internet.
[0062] When used in a LAN networking environment, for example, the
computer 1102 is connected to the local network 1164 through a
network interface or adapter 1168. When used in a WAN networking
environment, the computer 1102 typically includes a modem (e.g.,
telephone, DSL, cable, etc.) 1170, or is connected to a
communications server on the LAN, or has other means for
establishing communications over the WAN 1166, such as the
Internet. The modem 1170, which can be internal or external
relative to the computer 1102, is connected to the system bus 1108
via the serial port interface 1144. In a networked environment,
program modules (including application programs 1134) and/or
program data 1138 can be stored in the remote memory storage device
1162. It will be appreciated that the network connections shown are
exemplary and other means (e.g., wired or wireless) of establishing
a communications link between the computers 1102 and 1160 can be
used when carrying out an aspect of an embodiment.
[0063] In accordance with the practices of persons skilled in the
art of computer programming, the embodiments have been described
with reference to acts and symbolic representations of operations
that are performed by a computer, such as the computer 1102 or
remote computer 1160, unless otherwise indicated. Such acts and
operations are sometimes referred to as being computer-executed. It
will be appreciated that the acts and symbolically represented
operations include the manipulation by the processing unit 1104 of
electrical signals representing data bits which causes a resulting
transformation or reduction of the electrical signal
representation, and the maintenance of data bits at memory
locations in the memory system (including the system memory 1106,
hard drive 1116, floppy disks 1120, CD-ROM 1124, and remote memory
1162) to thereby reconfigure or otherwise alter the computer
system's operation, as well as other processing of signals. The
memory locations where such data bits are maintained are physical
locations that have particular electrical, magnetic, or optical
properties corresponding to the data bits.
[0064] FIG. 12 is another block diagram of a sample computing
environment 1200 with which embodiments can interact. The system
1200 further illustrates a system that includes one or more
client(s) 1202. The client(s) 1202 can be hardware and/or software
(e.g., threads, processes, computing devices). The system 1200 also
includes one or more server(s) 1204. The server(s) 1204 can also be
hardware and/or software (e.g., threads, processes, computing
devices). One possible communication between a client 1202 and a
server 1204 can be in the form of a data packet adapted to be
transmitted between two or more computer processes. The system 1200
includes a communication framework 1208 that can be employed to
facilitate communications between the client(s) 1202 and the
server(s) 1204. The client(s) 1202 are connected to one or more
client data store(s) 1210 that can be employed to store information
local to the client(s) 1202. Similarly, the server(s) 1204 are
connected to one or more server data store(s) 1206 that can be
employed to store information local to the server(s) 1204.
[0065] It is to be appreciated that the systems and/or methods of
the embodiments can be utilized in key phrase processing
facilitating computer components and non-computer related
components alike. Further, those skilled in the art will recognize
that the systems and/or methods of the embodiments are employable
in a vast array of electronic related technologies, including, but
not limited to, computers, servers and/or handheld electronic
devices, and the like.
[0066] What has been described above includes examples of the
embodiments. It is, of course, not possible to describe every
conceivable combination of components or methodologies for purposes
of describing the embodiments, but one of ordinary skill in the art
may recognize that many further combinations and permutations of
the embodiments are possible. Accordingly, the subject matter is
intended to embrace all such alterations, modifications and
variations that fall within the spirit and scope of the appended
claims. Furthermore, to the extent that the term "includes" is used
in either the detailed description or the claims, such term is
intended to be inclusive in a manner similar to the term
"comprising" as "comprising" is interpreted when employed as a
transitional word in a claim.
* * * * *