U.S. patent application number 12/653926 was filed with the patent office on 2010-04-29 for method and apparatus for identifying related searches in a database search system.
This patent application is currently assigned to Yahoo! Inc.. Invention is credited to Bradley R. Haugaard, Phillip G. Rorex, Thomas A. Soulanille.
Application Number | 20100106706 12/653926 |
Document ID | / |
Family ID | 24302119 |
Filed Date | 2010-04-29 |
United States Patent
Application |
20100106706 |
Kind Code |
A1 |
Rorex; Phillip G. ; et
al. |
April 29, 2010 |
Method and apparatus for identifying related searches in a database
search system
Abstract
A method of generating a search result list also provides
related searches for use by a searcher. Search listings which
generate a match with a search request submitted by the searcher
are identified in a pay-for-placement database which includes a
plurality of search listings. Related search listings contained in
a related search database generated from the pay-for-placement
database are identified as relevant to the search request. A search
result list is returned to the searcher including the identified
search listings and one or more of the identified search
listings.
Inventors: |
Rorex; Phillip G.;
(Stevenson Ranch, CA) ; Soulanille; Thomas A.;
(Pasadena, CA) ; Haugaard; Bradley R.; (Monrovia,
CA) |
Correspondence
Address: |
BRINKS HOFER GILSON & LIONE / YAHOO! OVERTURE
P.O. BOX 10395
CHICAGO
IL
60610
US
|
Assignee: |
Yahoo! Inc.
Sunnyvale
CA
|
Family ID: |
24302119 |
Appl. No.: |
12/653926 |
Filed: |
December 17, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11092182 |
Mar 29, 2005 |
7657555 |
|
|
12653926 |
|
|
|
|
09575894 |
May 22, 2000 |
6876997 |
|
|
11092182 |
|
|
|
|
Current U.S.
Class: |
707/709 ;
707/706; 707/742; 707/E17.108; 707/E17.11 |
Current CPC
Class: |
G06Q 30/0277 20130101;
G06F 16/951 20190101; Y10S 707/99942 20130101; Y10S 707/99943
20130101; Y10S 707/99933 20130101; G06Q 40/04 20130101; G06Q
30/0275 20130101 |
Class at
Publication: |
707/709 ;
707/706; 707/742; 707/E17.108; 707/E17.11 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A data processing method for a related searches database for a
database search system which includes a search database storing
information about a plurality of web pages searchable by a search
engine, the method comprising: at the database search system,
accessing the search database and retrieving from the search
database textual information from the plurality of web pages about
which information is stored in the search database; at the database
search system, omitting from the retrieved textual information
textual information from web pages which are determined to be
similar among the plurality of web pages to produce search listing
data; at the database search system, indexing the retrieved textual
information and the search listing data to create an inverted
index; and at the database search system, storing the inverted
index in the related searches database at the database search
system.
2. The data processing method of claim 1 wherein indexing
comprises: at the database search system, arranging data as a
plurality of rows in the related searches database, each row of the
plurality of rows including a key word along with all text from the
search database associated with the keyword.
3. The data processing method of claim 2 further comprising:
receiving a search request at the database search system from a
remotely located user device; at a search engine web server of the
database search system, locating in the search database information
matching the search request; at a related searches web server of
the database search system, using the key word to identify matching
related search listings in the related searches database, combining
the located information from the search database and the matching
related search listings in a search results web page; and
communicating the search results web page from the database search
system to the remotely located user device.
4. The data processing method of claim 1 wherein the search
database stores a plurality of search listings, the method further
comprising: creating one or more additional indexes using key
information associated with respective search listings of the
stored plurality of search listings; and storing the one or more
additional indexes in the related searches database.
5. The data processing method of claim 4 wherein creating the one
or more additional indexes comprises forming an index based on an
advertiser identification associated with the respective search
listings or a derived theme associated with the respective search
listings.
6. The data processing method of claim 1 wherein the search
database stores a plurality of search listings and wherein omitting
textual information from web pages which are determined to be
similar comprises: at the database search system, retrieving from
the search database uniform resource locators (URLs) included in
the plurality of search listings; at the database search system,
eliminating duplicate URLs from the retrieved URLs to create a set
of non-duplicate URLs; and at the database search system,
retrieving textual information stored at web pages identified by
URLs of the set of non-duplicate URLs.
7. The data processing method of claim 1 wherein the search
database stores a plurality of search listings and wherein omitting
textual information from web pages determined to be similar
comprises: using the search database, forming a list of uniform
resource locators (URLs) associated with interne web sites to be
accessed; removing duplicate URLs from the list; if a URL on the
list is similar to another URL on the list, crawling a
predetermined number of potentially duplicate URLs; comparing
bodies of the other URLs on the list and the potentially duplicate
URLs; if the body of the other URLs on the list is similar to the
body of the potentially duplicate URLs, suspending crawling of the
potentially duplicate URLs, and storing the body of the URL on the
list in the related searches database for subsequent search.
8. The data processing method of claim 7 further comprising:
comparing a selected URL with other URLs on the list; and
determining the URL is similar to the other URL on the list when
the URL has a predetermined text portion in common with the other
URL on the list.
9. The method of claim 7 wherein comparing bodies of the URL on the
list and the potentially duplicate URLs comprises: comparing text
from the URL on the list and text from one potentially duplicate
URL; and determining the URL on the list is similar to the one
potentially duplicate URL when the text from the URL on the list
and the text from the one potentially duplicate URL have a
predetermined text portion in common.
10. An apparatus for identifying related searches in a database
search system, the apparatus comprising: a search database storing
an ordered collection of search listing records, each search
listing record including at least a uniform resource locator (URL)
of an associated web page or document and descriptive text; a
search engine web server in data communication with the search
database; a related searches database storing an inverted index for
searching, the inverted index including a plurality of rows of
data, each row including a keyword and text from the search
database associated with the keyword, the index formed from all
text on web pages identified by a URL in the search listing records
of the search database after omitting text from similar web pages;
and a related searches web server in data communication with the
related searches database; wherein the related searches web server
responds to a search query received from a remote device at the
database search system by searching the related searches database
to provide suggested, related searches to the remote device.
11. The apparatus of claim 10 wherein the inverted index stored in
the related searches database comprises substantially all text on
web pages having non-duplicate URLs of associated web pages.
12. A method for identifying related searches in a database search
system, the method comprising: at the database search system,
retrieving from storage in a search database of the database search
system a plurality of search listings which are used to generate
search results in response to a search query from a remote device,
each search listing including a Uniform Resource Locator (URL) of
an associated web page or document and textual information; using
the URL, retrieving textual information at the database search
system from the associated web page or document for the plurality
of search listings; at the database search system, determining if
any web pages or documents associated with search listings in the
search database are similar; at the database search system,
omitting textual information from web pages or documents determined
to be similar; at the database search system, indexing the
plurality of search listings and the retrieved textual information
from non-similar web pages and documents to form an index; storing
the index in a related searches database of the database search
system; in response to a received search query, searching the index
for entries matching the received search query; and proving a list
of related search results to the remote device including at least
some of the matching entries.
13. The method of claim 12 wherein omitting the textual information
comprises: forming a list of URLs in the plurality of search
listings in the search database; and at the database search system,
deleting duplicate URLs from the list to create a list of
non-duplicate URLs; and at the database search system, retrieving
the textual information only from web pages or documents associated
with URLs on the list of non-duplicate URIs.
Description
RELATED APPLICATIONS
[0001] The present patent document is a continuation of application
Ser. No. 11/092,182, filed Mar. 29, 2005, pending, which is a
continuation of application Ser. No. 09/575,894, filed May 22,
2000, now U.S. Pat. No. 6,876,997, which applications are hereby
incorporated in its entirety herein by this reference.
COPYRIGHT NOTICE
[0002] A portion of the disclosure of this patent document contains
material which is subject to copyright protection. The copyright
owner has no objection to the facsimile reproduction by anyone of
the patent document or the patent disclosure, as it appears in the
Patent and Trademark Office patent file or records, but otherwise
reserves all copyright rights whatsoever.
REFERENCE TO APPENDIX ON CD ROM
[0003] A computer program listing appendix is included containing
computer program code listings on a CD-Rom pursuant to 37
C.F.R.1.52(e) and is hereby incorporated by reference in its
entirety. The total number of compact discs is 1 (two duplicate
copies are filed herewith). Each compact disc includes a single
file entitled Source Code Appendix, which is 58,304 bytes in size
and was created on Jul. 7, 2005. The creation date of the compact
disc is Dec. 17, 2009.
BACKGROUND
[0004] The present invention relates generally to a method and
system for generating a search result list, for example, using an
Internet-based search engine. More particularly, the present
invention relates to a method and system for generating search
results from a pay for performance database and generating a list
of related searches from a related search database.
[0005] Search engines are commonly used to search the information
available on computer networks such as the World Wide Web to enable
users to locate information of interest that is stored within the
network. To use a search engine, a user or searcher typically
enters one or more search terms that the search engine uses to
generate a listing of information, such as web pages, that the
searcher is then able to access and utilize. The information
resulting from the search is commonly identified as a result of an
association that is established between the information and one or
more of the search terms entered by the user. Different search
engines use different techniques to associate information with
search terms and to identify related information. These search
engines also use different techniques to provide the identified
information to the user. Accordingly, the likelihood of information
being found as a result of a search varies depending upon the
search engine used to perform the search.
[0006] This uncertainty is of particular concern to web page
operators that make information available on the World Wide Web. In
this setting, there are often several web page operators or
advertisers that are competing for the same group of potential
views or customers. Accordingly, a web page's ability to be
identified as the result of a search is often important to the
success of a web page. Therefore, web page operators often seek to
increase the likelihood that their web page will be seen as the
result of a search.
[0007] One type of search engine that provides web page operators
with a more predictable method of being seen as the result of a
search is a "pay for performance" arrangement where web pages are
displayed based at least in part upon a monetary sum that the
advertiser or web page operator has agreed to pay to the search
engine operator. The web page operator agrees to pay an amount of
money, commonly referred to as the bid amount, in exchange for a
particular position in a set of search results that is generated in
response to a user's input of a search term. A higher bid amount
will result in a more prominent placement in a set of search
results. Thus, a web page operator may attempt to place high bids
on one or more search terms to increase the likelihood that their
web page will be seen as a result of a search for that term.
However, there are many similar search terms, and it is difficult
for a web page operator to bid on every potentially relevant search
term. Likewise, it is unlikely that a bid will be made on every
search term. Accordingly, a search engine operator may not receive
any revenue from searches performed using certain search terms for
which there are no bids.
[0008] In addition, because the number of existing web pages is
ever increasing, it is becoming more difficult for a user to find
relevant search results. The difficulty of obtaining relevant
search results is further increased because of the search engine's
dependency on the search terms entered by the user. The search
results that a user receives are directly dependent upon the search
terms that the user enters. The entry of one search term may not
result in relevant search results, while the entry of only a
slightly different search term can result in relevant search
results. Accordingly, the selection of search terms is often an
important part of the search process. It would be of benefit to
both the searcher and the advertisers to recommend related searches
for the searcher to provide to the search engine. However, current
search engines do not enable a search engine operator to provide
related search terms, such as those that will produce relevant
search results, to a user. A system that overcomes these
deficiencies is needed.
SUMMARY
[0009] By way of introduction only, in accordance with one
embodiment of the invention, a search request is received from a
searcher and used to perform a search on a pay for performance
database. In the pay for performance database there are stored
search listings including web page locators and bid amounts to be
paid by the operator of the listed web page. The search using the
pay for performance database produces search results which are
presented to the searcher. The search request is also used to
perform a search on a related search database. The related search
database has been formed at least in part using contents of the pay
for performance database. The search of the related search database
produces a list of related searches which are presented to the
searcher.
[0010] In accordance with a second embodiment, a related search
database is created using a pay for performance database. All text
from all web pages referenced by the pay for performance database
is stored and used to create an inverted index. Additional indexes
are used to improve the relevancy and spread of related search
results obtained using the database.
[0011] The foregoing discussion of illustrative embodiments of the
invention has been provided only by way of introduction. Nothing in
this section should be taken as a limitation on the following
claims, which define the scope of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 is a block diagram illustrating a database search
system in conjunction with a computer network;
[0013] FIG. 2 is a flow diagram illustrating a method for operating
the database search system of FIG. 1;
[0014] FIG. 3 is a flow diagram illustrating a method for operating
the database search system of FIG. 1;
[0015] FIG. 4 is a flow diagram illustrating in more detail a
portion of the method shown in FIG. 2;
[0016] FIG. 5 is a flow diagram illustrating in more detail a
portion of the method shown in FIG. 2;
[0017] FIG. 6 is a flow diagram illustrating a method for forming a
related searches database; and
[0018] FIG. 7 is a flow diagram illustrating a method for removing
similar page information from a database.
DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED EMBODIMENTS
[0019] Referring now to the drawing, FIG. 1 is a block diagram of a
database search system 100 shown in conjunction with a computer
network 102.
[0020] The database search system 100 includes a pay for
performance database 104, a related searches database 106, a search
engine web server 108, a related searches web server 110 and a
search engine web page 114. The servers 104, 106, 108 may be
accessed over the network 102 by an advertiser web server 120 or a
client computer 122.
[0021] The network 102 in the illustrated embodiment is the
Internet and provides data communication according to appropriate
standards, such as Internet Protocol. In other embodiments, other
network systems may be used alone or in conjunction with the
Internet. Communication in the network 102 is preferably according
to Internet Protocol or similar data communication standard. Other
data communications standards may be used as well to ensure
reliable communication of data.
[0022] The database search system 100 is configured as part of a
client and server architecture. In the context of a computer
network such as the Internet, a client is a process such as a
program, task or application that requests a service which is
provided by another process such as a program, task or application
that requests a service which is provided by another process, known
as a server program. The client process uses the requested service
without having to know any working details about the other server
program or the server itself. In networked systems, a client
process usually runs on a computer that accesses shared network
resources provided by another computer running a corresponding
server process. A server is typically a remote computer system that
is accessible over a communications medium such as a network. The
server acts as an information provider for a computer network.
Thus, the system 100 operates as a server for access by the clients
such as client computer 122 and the advertiser web server 120.
[0023] The client computers 122 can be conventional personal
computers, workstations or computer systems of any size. Each
client computer 112 typically includes one or more processors,
memory, input and output devices and a network interface such as a
modem. The advertiser web server 120, the search engine web server
108, the related searches web server 110 and the account management
web server 112 can be similarly configured. However, the advertiser
web server 120, the search engine web server 108, the related
searches web server 110 and the account management web server 112
may each include many computers connected by a separate private
network.
[0024] The client computer 112 executes a World Wide Web ("web")
browser program 124. Examples of such a program are Navigator,
available from Netscape Communications Corporation and Internet
Explorer, available from Microsoft Corporation. The browser program
124 is used by a user to enter addresses of specific web pages to
be retrieved. These addresses are referred to as Uniform Resource
Locators (URLs). In addition, once a page has been retrieved, the
browser program 124 can provide access to other pages or records
when the user clicks on hyperlinks to other web pages contained in
the web page. Such hyperlinks provide an automated way for the user
to enter the URL of another page and to retrieve that page. The
pages can be data records including as content plain textual
information or more complex digitally encoded multimedia content
such as software programs, graphics, audio data, video data and so
forth.
[0025] Client computers 122 communicate through the network 102
with various network information providers. These information
providers include the advertiser web server 120, the account
management server 112, the search engine server 108, and the
related searches web server 110. Preferably, communication
functionality is provided by HyperText Transfer Protocol (HTTP),
although other communication protocols such as FTP, SNMP, Telnet
and a number of other protocols known in the art may be used.
Preferably, search engine server 108, related searches server 110
and account management server 112, along with advertiser servers
120 are located on the worldwide web. U.S. patent application Ser.
No. 09/322,627, filed May 29, 1999 and entitled "System and Method
for Influencing a Position on a Search Result List Generated by a
Computer Network Search Engine," and U.S. patent application Ser.
No. 09/494,818, filed Jan. 31, 2000 and entitled "Method and System
for Generating a Set of Search Terms," are commonly assigned to the
assignee of the present application and are incorporated herein by
reference. These applications disclose additional aspects of search
engine systems.
[0026] The account management web server 112 in the illustrated
embodiment includes a computer storage medium such as a disc system
and a processing system. A database is stored on the storage medium
and contains advertiser account information. Conventional browser
programs 124, running on client computers 122, may be used to
access advertiser account information stored on the account
management server 112.
[0027] The search engine web server 108 permits network users, upon
navigating to the search engine web server URL or sites on other
web servers capable of submitting queries to the search engine web
server 108 through a browser program 124, to type keyword queries
to identify pages of interest among the millions of pages available
on web pages. In one embodiment of the present invention, the
search engine web server 108 generates a search result list that
includes, at least in part, relevant entries obtained from and
formatted by the results of the bidding process conducted by the
account management server 112. The search engine web server 108
generates a list of HyperText links to documents that contain
information relevant to search terms entered by the user at a
client computer 122. The search engine web server transmits this
list, in the form of a web page 114 to the network user, where it
is displayed on the browser 124 running on the client computer 122.
One embodiment of the search engine web server may be found by
navigating to the web page at URL http://www.goto.com/.
[0028] Search engine web server 108 is connected to the network
102. In one embodiment of the present invention, search engine web
server 108 includes a pay for performance database including a
plurality of search listings. The database 104 contains and ordered
collection of search listing records used to generate search
results in response to user queries. Each search listing record
contains the URL of an associated web page or document, a title,
descriptive text and a bid amount. In addition, search engine web
server 108 may also be connected to the account management server
112. The account management server 112 may also be connected to the
network 102.
[0029] In addition, in the illustrated embodiment of FIG. 1, the
database system 100 further includes a related searches web server
110 and an associated related searches database 106. The related
searches web server 110 and data base 106 operate to provide
suggested, related searches for presentation to a searcher along
with search results in response to his query. Users conducting
searches for information using a search engine web server such as
the server 108 often perform searches which are inappropriately
focused as compared to the index data of the web site search
engine. Users may use search terms which are either to vague and
generalized, such a "music," or too specific and focused, such as
"hot jazz from New Orleans during the early 1950s." Some users
require assistance to refine their query to better obtain useful
information from the search engine. The related searches web server
110 provides the user with query suggestions better suited to the
abilities of the pay for performance database 104.
[0030] In the illustrated embodiment, the pay for performance
database 104 is established in conjunction with advertisers who
operate web servers such as advertiser web server 120. Advertiser
web pages 121 are displayed on the advertiser web server 120. An
advertiser or web site promoter may, through an account residing on
the account management server 112, participate in a competitive
bidding process with other advertisers. An advertiser may bid on
any number of search terms relevant to the content of the
advertiser's web site.
[0031] The bids submitted by the web site promoters are used to
control presentation of search results to a searcher using client
computer 122. Higher bids receive more advantageous placement on a
search result list generated by the search engine web server 108
when a search using the search term bid on by the advertiser is
executed. In one embodiment, the amount bid by an advertiser
comprises a money amount that is deducted from the account of the
advertiser each time the advertiser web site is accessed via a
hyperlink on the search result list page. A searcher clicks on the
hyperlink with a computer input device such as a mouse to initiate
a retrieval request to retrieve the information associated with the
advertiser's hyperlink. Preferably, each access or click on a
search result list hyperlink is redirected to the search engine web
server 108 to associate the click with the account identifier for
an advertiser. This redirection action, which is not apparent to
the searcher, will access account information coded into the search
result page before accessing the advertiser's URL using the search
result list hyperlink clicked on by the searcher. In the
illustrated embodiment, the advertiser's web site description and
hyperlink on the search result list page is accompanied by an
indication that the advertiser's listing is a paid listing. Each
paid listing displays an amount corresponding to a price per click
paid by the advertiser for each of referral to the advertiser site
through this search result list.
[0032] The searcher may click on HyperText links associated with
each listing in that search result page to access the corresponding
web pages. The HyperText links may access web pages anywhere on the
Internet, and include paid listings to advertiser web pages 121
located on the advertiser web server 120. In one embodiment of the
present invention, the search result list also includes non-paid
listings that are not priced as a result of advertiser's bids and
are generated by a conventional search engine, such as the Inktomi,
Lycos, or Yahoo! search engines. The non-paid HyperText links may
also include links manually indexed into the pay for performance
database 104 by an editorial team. Preferably, the non-paid
listings follow the paid advertiser listings on the search results
page.
[0033] Related searches web server 110 receives the search request
from the searcher at client computer 122 as entered using the
search engine web page 114. In the related searches database 106,
which includes related search listings generated from the pay for
performance database 104, the related searches web server 110
identifies related search listings relevant to the search request.
In conjunction with the search engine web server 108, the related
searches web server 110 returns a search result list to the
searcher including the identified search listings located in the
pay for performance database and one or more identified related
search listings located in the related searches database 106.
Operation of the related searches web server 110 in conjunction
with the related searches database 106 will be described below in
conjunction with FIGS. 2-5. The formation of the related searches
database 106 will be described below in conjunction with FIG.
6.
[0034] FIG. 2 is a flow diagram illustrating a method for operating
the database search system 100 of FIG. 1. The method begins at
block 200. Java source code for implementing the method of FIG. 2
and other method steps described herein is included as an
appendix.
[0035] At block 202, a search request is received. The search
request may be received in any suitable manner. It is envisioned
that a search request will originate with a searcher using a client
computer to access the search engine web page of the database
system implementing the method illustrated in FIG. 2. A search
request may be typed in as input text in a hyperlink click to
initiate the search request and search process.
[0036] After block 202, two parallel processes are initiated. At
block 204, the search engine web server of the database search
system identifies matching search listings in the pay for
performance database of the system. In addition, the search engine
web server may further identify non-paid search listings.
[0037] Similarly, at block 206, a related searches web server
initiates a search to identify matching related search listings in
the related search database. By matching search listings, it is
meant that the respective search engine identifies search listings
contained in the respective database which generate a match with
the search request. A match may be generated if an exact, letter
for letter textual match occurs between a bid on keyword and a
search term. In other embodiments, a match may be generated if a
bidded keyword has a predetermined relationship with a search term.
For example, the predetermined relationship may include matching
the root of a word which has been stripped of suffixes; in a
multiple word query, matching several but not all of the words; or
locating the multiple words of the query with a predetermined
number of words of proximity.
[0038] After the search results have been located, the search
results from the pay for performance database are combined with
search results from the related search database, block 208. At
block 210, a search result list is returned to the searcher, for
example by displaying identified search listings on the search
engine web page and conveying the web page data over the network to
the client computer. The search results and related search results
may be displayed in any convenient fashion.
[0039] An example of a search result list display used in one
embodiment of the present invention is shown in FIG. 3, which is a
display of the first several entries resulting from the search for
the term "CD burners." The exemplary display of FIG. 3 shows a
portion of a search result list including a plurality of entries
310a, 310b, 310c, 310d, 310e, 310f, 310g, 310h, 310i, a listing 312
of other search categories and a related searches listing 314.
[0040] As shown in FIG. 3, a single entry, such as entry 310a in
the search result list consists of a description 320 of the web
site, preferably comprising a title and a short textual
description, and a hyperlink 330 which, when clicked by a searcher,
directs the searcher browser to the URL where the described web
site is located. The URL 340 may also be displayed in the search
result list entry 310a, as shown in FIG. 3. The "click through" of
a search result item occurs when the remote searcher viewing the
search result item display 310 of FIG. 3 selects or clicks on the
hyperlink 330 of the search result item display 310.
[0041] Search result list entries 310a-310h may also show the rank
value 360a, 360b, 360c, 360d, 360e, 360f, 360g, 360h, 360i of the
advertisers' search listing. The rank value 360a-360i is an ordinal
value, preferably a number, generated and assigned to the search
listing by the processing system of the search engine web server.
Preferably, the rank value 360a-360i is assigned in a process,
implemented in software, that establishes an association between
the bid amount, the rank, and the search term of a search listing.
The process gathers all search listings that match a particular
search term, sorts the search listings in order from highest to
lowest bid amount, and assigns a rank value to each search listing
in order. The highest bid amount receives the highest rank value,
the next highest bid amount receives the next highest rank value,
proceeding to the lowest bid amount, which receives the lowest rank
value. The correlation between rank value and bid amount is
illustrated in FIG. 3, where each of the paid search list entries
310a-310h display the advertiser's bid amount 350a, 350b, 350c,
350d, 350e, 350f, 350g, 350h, 350i for that entry. If two search
listings having the same search term also have the same bid amount,
the bid that was received earlier in time will be assigned the
higher rank of value.
[0042] The search result list of FIG. 3 does not include unpaid
listings. In the preferred embodiment, unpaid listings do not
display a bid amount and are displayed following the lowest ranked
paid listing. Unpaid listings are generated by a search engine
utilizing object distributed database and text searching algorithms
as known in the art. An example of such a search engine is the
search engine operated by Inktomi Corporation. The original search
query entered by the remote searcher is used to generate unpaid
listings through the conventional search engine.
[0043] The listing 312 of other search categories shows other
possible categories for searching that may be related to the
searcher's input search term 316. The other search categories are
selected for display by identifying a group such as computer
hardware containing the input search term 316. Categories within
the group are then displayed as hyperlinks which may be clicked
through by the searcher for additional searches. This enhances the
user's convenience in cases where the user's input search did not
turn up suitable search results.
[0044] The related searches listing 314 displays six entries 318 of
related searches determined using the related searches database as
described herein. In other embodiments, other numbers of related
search entries may be show. In addition, a link 320 labeled "more"
allows the user to display additional related search entries. In
the illustrated embodiment, the displayed entries 318 are the top
six most relevant and most bidded-on terms in the related searches
database.
[0045] Referring now to FIG. 4, the act of identifying matching
related search listings in a related search database (act 206, FIG.
2) in one embodiment comprises the following acts. At block 400, an
inverted index containing all data from all web pages contained in
the pay for performance database of the database search system is
searched. The inverted index is stored in the related searches
database. In an inverted index, a single index entry is used to
reference many database records. Searching for multiple matches per
index entry is generally faster when using inverted indexes, since
each index entry may reference many database records. The inverted
index lists the words which can be searched in, for example,
alphabetical order and accompanying each word are pointers which
identify the particular documents which contain the word as well as
the locations within each document at which the word occurs. To
perform a search, instead of searching through the documents in
word order, the computer locates the pointers for the particular
words identified in a search query and processes them. The computer
identifies the documents which have the required order and
proximity relationship for the search query terms.
[0046] At block 402, meta-information is also searched for the
received search term. Meta-information is abstracted, once-removed
information about the collected data itself and forms a description
of the data. Meta-information is derived information and relational
information. Meta-information for a listing describes the relation
of the listing to other listings, and meta-information for a
listing describes the relation of the advertisers sponsoring a
listing to other advertisers.
[0047] Meta-information is obtained using a script of command to
analyze the pay for performance data base and determine information
and relationships present in the data. The meta-information is
collected for each row of data in the database and attached to that
row. In one embodiment, the script is run one time as a batch
process after the data is collected in the database. In other
embodiments, the script is periodically re-run to update the
meta-information.
[0048] Meta-information about the web pages and key words contained
in the pay for performance database includes information such as
the frequency of occurrence of similar key words among different
web site domains and the number of different key words associated
with a single web site. The meta-information may further include
fielded advertiser data which is the information contained in each
search listing provided by web site promoters who have bid upon
search terms in the pay for performance database; advertiser
identification information; web site themes, such as gambling or
adult content; and derived themes. Preferably, the meta-information
is combined in a common inverted index with the stored web page
data searched at block 400.
[0049] The result of the searches of block 400 and block 402 is a
listing of rows of the inverted index or indexes containing the
searched information. Each row contains the information associated
with a search listing of the pay for performance database along
with all the text of the web page associated with the search
listing. In the illustrated embodiment, the search listing includes
the advertiser's search terms, the URL of the web page, a title and
descriptive text.
[0050] At block 404, the returned related search results are sorted
by relevancy. Any suitable sorting routine may be used. A preferred
process of sorting the search results by relevancy, block 404, is
illustrated in greater detail in FIG. 5.
[0051] At block 406, the six most relevant related search results
are selected. It is to be noted that any suitable number of search
results may be provided. The choice of providing six related
searches as suggestions to a searcher is arbitrary. After block
406, control proceeds at block 208, FIG. 2.
[0052] FIG. 5 is a flow diagram illustrating a method for sorting
by relevancy search results obtained from a related searches
database, corresponding to block 404 of FIG. 4. In the embodiment
illustrated in FIG. 5, a relevancy value is maintained for each
returned listing. The relevancy value is adjusted according to
specific relevancy factors, some of which are defined in FIG. 5.
Other relevancy factors may be used as well. After adjusting the
relevancy value, a final sorting occurs and the highest valued
listings are returned.
[0053] At block 500, the relevancy value for individual records
located during the search (block 400, block 402, FIG. 4) are
increased according to the frequency of occurrence of a queried
search term in each respective record. For example, if the queried
search term occurs frequently in the text associated with the
search listing, the relevancy of that listing is increased. If the
queried search term occurs rarely or not at all in the listing, the
relevancy value of that list is not increased or is decreased.
[0054] At block 502, it is determined if there are multiple search
terms in the search queries submitted by the searcher. If not,
control proceeds to block 506. If there are multiple search terms,
at block 504, the relevancy of individual search results is
increased according to proximity of the searched terms in a located
record. Thus, if two search terms are immediately proximate, the
relevancy score value for the record may be substantially
increased, suggesting that the identified search listing is highly
relevant to the search query submitted by the searcher. On the
other hand, if the two search terms occur, for example, in the same
sentence but not in close proximity, the relevancy of the record
may be slightly increased to indicate the lesser relevancy
suggested by the reduced proximity of the search terms.
[0055] At block 506, it is determined if the located record
contains a bidded search term. Search terms are bidded on by
advertisers, the bids being used for display of search results by
the search engine web server using the pay for performance
database. If the search result does include a bidded-on search
term, the relevancy of the record is adjusted, block 508. If the
query does not include one or more bidded on search terms, control
proceeds to block 510.
[0056] At block 510, it is determined if there are search terms in
the description of the search listing. As illustrated in FIG. 3,
each such listing includes a textural description of the contents
of the web site associated with the search listing. If the search
terms are not included in the description, control proceeds to
block 514. If the search terms are included in the description, at
block 512, the relevancy of the located record is adjusted
accordingly.
[0057] At block 514, it is determined if the search terms are
located in the title of the search listing. As illustrated in FIG.
3, each search listing includes a title 360. If the search terms
are included in the title of a record, the relevancy of the record
is adjusted accordingly, block 516. If the search terms are not
included in the title, control proceeds to block 518.
[0058] At block 518, it is determined if the search terms are
included in the metatags of the search listing. Metatags are
textual information included in a web site which is not displayed
for user use. However, the search listing contained in the
pay-for-placement database includes the metatags for searching and
other purposes. If, at block 518, the search terms are not included
in the search listing, control proceeds to block 522. On the other
hand, if the search terms are included in one or more metatags of
the search listing, at block 520 the relevancy of the record is
adjusted accordingly.
[0059] At block 522 it is determined if the user's search terms are
included in the text of the bidded web page. If not, control
proceeds to block 406, FIG. 4. However, if the search terms are
included in the web page text, at block 524 the relevancy of the
search listing record is adjusted accordingly.
[0060] Following the steps illustrated in FIG. 5, one or more and
preferably six most relevant related search listings are returned
and presented to the searcher along with the search results from
the pay-for-placement database.
[0061] FIG. 6 illustrates a method for forming a related searches
database for use in the database search system of FIG. 1. The
method begins at block 600.
[0062] At block 602, all text for all web pages in the
pay-for-placement database is fetched. This includes metatags and
other non-displayed textual information contained in the web page
referenced by a URL contained in the pay for performance database.
At block 604, text from similar pages is omitted. This reduces the
amount of data which must be processed to form the related searches
database. One embodiment of a method for performing this act will
be described below in conjunction with FIG. 7. In addition, this
greatly increases the speed at which the related searches database
may be produced. At block 606, the text is stored in the related
searches database.
[0063] At block 608, an inverted index is created, indexing the
search listing data stored at block 606 along with the text fetched
at block 602. The resulting inverted index includes a plurality of
rows of data, each row including a key word along with all text
from the database associated with that key word.
[0064] One illustrative example of a configuration for the contents
of the related search database follows. Each row of the database
includes the following elements:
TABLE-US-00001 canon_cnt integer # Number of different search
listings bidded on this related result advertiser_cnt integer #
Number of different advertisers bidding on this related result
related_result varchar(50) # related result (bidded search term),
canonicalized and depluralized raw_search_text varchar(50) #
original raw bidded search term advertiser_ids varchar(4096) #
explicit list of all advertisers bidding on this related result
words varchar(65536+) # full text of all web pages crawled,
including hand-coded descriptions theme varchar(50)
directory_taxonomy varchar(200)
[0065] The count canon_cnt differs from the count advertiser cnt
because many different web pages in the same domain could be bidded
against the same bidded search term, or many different advertisers
may bid on only 1 search term. Special themed keys are embedded
into the database with `flags` inserted in the advertiser_cnt
field. If `advertiser_cnt ==999999999`, the query being presented
is an adult-oriented query. In this implementation, an optional
enhancement is to disable related results in this case. The counts
canon_cnt and advertiser_cnt are the current derived-data fields.
Additional fields such as theme and directory_taxonomy_category can
optionally be added to give even more enhanced relevance to related
results matches, though they are not used in the illustrated
embodiment.
[0066] In one embodiment, the inverted index which is queried
against to obtain the related results is created with the following
Java command:
[0067] SQL>Create metamorph inverted index mm_index02 on
line_ad02(words);
[0068] This is the vendor-specific method (using the Texis
relational database management system provide by Thunderstone-EPI,
Inc.) for creating a free-text search index (mm_index02) on a
document (here contained in a database column (words) which will be
searched (from RelatedSearcherCore.java) by the Texis Thunderstone
SQL command:
TABLE-US-00002 "SELECT" + "$rank, " //Num getRow( ) arg position 0
+ "canon_cnt, " //Int getRow( ) arg position 1 + "raw_search_text,
" //Stri getRow( ) arg position 2 + "cannon_search_text, " //Stri
getRow( ) arg position 3 + "advertiser_ids, " //Stri getRow( ) arg
position 4 + "advertiser_cnt " //Int getRow( ) arg position 5 +
"FROM line_ad02 " + "WHERE words " + "LIKEP $query ORDER BY 1 desc,
advertiser_cnt desc;";
[0069] The $rank is a vendor-supplied virtual data field which
programmatically contains the "relevancy" of the search result,
based on the frequency of occurrence of the queried phrase ($query)
in the "words" field, the proximity of the queried phrase elements
to each other within the indexed words field, and the word order
(if >1 query phrase word) as compared to the ordering of words
within the "words" field.
[0070] The "rank" is vendor-specific, and derived by various
differing algorithms by different Free Text Search Engine
suppliers, though is similar enough in practice that any vendor's
Free Text Search Engine works to implement the Related Searches
Functionality.
[0071] The "ORDER BY 1 desc[ending], advertiser_cnt desc[ending]"
controls ranking the results of the query by relevance primarily
(field "1"==$rank), and secondarily by the derived field
"advertiser_cnt", which is the count of advertisers bidding on this
particular related_search_result. Thus, "relevance" is the primary
selection criteria, and "popularity" is the secondary selection
criteria.
[0072] At block 610, additional indexes are created and stored with
the inverted index created at block 608. The additional indexes are
created using key information associated with each search listing.
The key information includes, for example, fielded advertiser data
such as an advertiser's identification and derived themes such as
gambling and so forth. The method then ends at block 612.
[0073] FIG. 7 is a flow diagram illustrating a method for removing
similar page information from a database. The method in the
illustrated implementation follows performance of act 602 of FIG.
6.
[0074] At block 702, the pay for performance database (also
referred to as a bidded search listing data base) is examined for
URL data and all URLs are extracted from the database and formed
into a list. The list is sorted and any exact duplicates are
removed, block 704
[0075] At block 706, a URL in the list is selected and it is
determined if the selected URL bears similarity to a preceding URL
in the list. Similarity may be determined by any suitable method,
such as a number of identical characters or fields within the URL
or a percentage of identical characters, or a common root or string
or field.
[0076] At block 708, if the selected URL is similar to the
preceding URL, the selected URL is added to a list of candidate
duplicate URLs. At block 710, a predetermined number of each
potentially duplicate URL are crawled. In the illustrated
embodiment, the predetermined number is the first two potentially
duplicate URLs. Crawling is preferably accomplished using a program
code referred to as a crawler. A crawler is a program that visits
Web sites and reads their pages and other information. Such
programs are will known and are also known as a "spiders" or
"bots." Entire sites or specific pages can be selectively visited
and indexed by a crawler. In alternative embodiments, subsets of
each site referenced by a URL, rather than an entire site, may be
crawled and compared for similarity.
[0077] At block 712, the data returned by the crawler is examined.
The data may be referred to as the body of the URL and includes
data from the site identified by the URL and all accessible pages
of the site. It is determined if the data including text and other
information contained in the body of the URL is sufficiently
similar to the data contained in the body of the previous URL.
Again, similarity may be determined by any suitable method, such as
a statistical comparison of the textual content of each page. If
there is sufficient similarity, control proceeds to block 714 and
it is assumed that the URL is the same as the previous URL. The
body of text and other information is assigned to the rest of the
similar URLs.
[0078] If, at block 706, it was determined that the selected URL
was not similar to the preceding URL, or if at block 712 it was
determined that the body of the URL was not similar enough to the
body of the previous URL, control proceeds to block 718. At block
718, the URL is added to a list of URLs to be crawled. At block
720, all URLs on the list are crawled to retrieve and store
information contained at the sites indicated by each URL.
[0079] At block 716, the information from each crawled URL is
loaded into the related searches database (also referred to as the
free text database). The information is joined with search listing
data already included in the related searches database. Thus, the
method steps illustrated in FIG. 7 reduce the total amount of data
contained in the related searches database by reducing the number
of URLs that are crawled and stored. Duplicate URLs are eliminated
from the process and near-duplicate URLs are checked for similarity
of content. The result is reduced storage requirements for the
resulting database and faster, more efficient searching on the
database. This enhances user convenience by improving
performance.
[0080] From the foregoing, it can be seen that the present
invention provides an improved method and apparatus for producing
related searches for presentation to a searcher searching in a pay
for performance database. Related searches are performed in a
related searches database which has been formed using the pay for
performance database. The search results from the related
searcher's database are ordered by relevancy for presentation to
the user. Thus, if a user's initial search was too narrow or too
broad, the user has available related searches which may be used to
produce more usable results. In addition, the related searches have
been produced using search listings referenced by bidded search
terms. This provides a benefit to advertisers who pay for
advertising in the database search system. This increases the
likelihood that an advertiser's web site will be visited by a
searcher using the database system.
[0081] While a particular embodiment of the present invention has
been shown and described, modifications may be made. It is
therefore intended in the appended claims to cover all such changes
and modifications which fall within the true spirit and scope of
the invention.
* * * * *
References