U.S. patent application number 15/365657 was filed with the patent office on 2017-03-23 for tiering of posting lists in search engine index.
The applicant listed for this patent is Microsoft Technology Licensing, LLC. Invention is credited to John G. Bennett, Trishul Chilimbi, Michael Hopcroft, Karthik Kalyanaraman, Knut Magne Risvik.
Application Number | 20170083553 15/365657 |
Document ID | / |
Family ID | 46065328 |
Filed Date | 2017-03-23 |
United States Patent
Application |
20170083553 |
Kind Code |
A1 |
Risvik; Knut Magne ; et
al. |
March 23, 2017 |
TIERING OF POSTING LISTS IN SEARCH ENGINE INDEX
Abstract
A search index includes tiered posting lists. Each posting list
in the search index corresponds with a different atom and includes
a list of documents containing the particular document.
Additionally, a rank is stored with each document listed in a
posting list for a given atom representing the relevance of the
atom to the context of each document. At least some of the posting
lists in the search index are tiered. A tiered posting list is
divided into a number of tiers with the tiers being ordered by
document while each tier is internally ordered by document.
Employing tiered posting lists within the search index allows a
search engine to evaluate search queries in a manner that allows
for a number of efficiencies and precise stopping.
Inventors: |
Risvik; Knut Magne; (Mo I
Rana, NO) ; Hopcroft; Michael; (Kirkland, WA)
; Bennett; John G.; (Bellevue, WA) ; Kalyanaraman;
Karthik; (Bellevue, WA) ; Chilimbi; Trishul;
(Seattle, WA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Microsoft Technology Licensing, LLC |
Redmond |
WA |
US |
|
|
Family ID: |
46065328 |
Appl. No.: |
15/365657 |
Filed: |
November 30, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12951799 |
Nov 22, 2010 |
9529908 |
|
|
15365657 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 16/951 20190101;
G06F 16/24578 20190101; G06F 16/2228 20190101; G06F 16/93
20190101 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. One or more computer storage media storing a data structure for
a search index, the data structure comprising: a plurality of
posting lists, each posting list being associated with a different
atom and including a plurality of postings, each posting within
each posting list corresponding with a given document and
identifying the given document and a rank for the given document,
wherein at least one posting list is divided into a plurality of
tiers, the tiers within the at least one posting list being ordered
by rank while the postings within each tier being internally
ordered by document sequence, wherein the search index supports
performing additional processing during a search query, to identify
a plurality of documents, based on comparing a combined rank of a
first set of documents identified after a first tier with a
calculated combined rank of the first set of documents of the first
tier and one or more documents of at least a second tier.
2. The one or more computer-storage media of claim 1, wherein the
number of tiers for the at least one posting list and the number of
documents within each tier of the at least one posting list are
determined based on the size of the document corpus used to
generate the search index.
3. The one or more computer storage media of claim 1, wherein tiers
of the at least one posting list are stored on different types of
storage devices.
4. One or more computer storage media storing computer useable
instructions that, when used by a computing device, cause the
computing device to perform a method comprising: accessing document
content for a plurality of documents; identifying atoms within the
document content of each document; determining ranks for atoms
found in the document content of each document, wherein the rank
for a given atom found in a given document comprises a score
representing the importance of the given atom within the context of
the given document; and generating a search index comprising a
plurality of posting lists, each posting list corresponding with an
atom identified within the document content of the plurality of
documents, wherein each posting within a given posting list
corresponding with a particular atom identifies a document
containing the atom and a rank for the document and the atom,
wherein at least one posting list is divided into a plurality of
tiers ordered based on rank, and wherein postings within each tier
are internally ordered based on document, wherein the search index
supports performing additional processing during a search query, to
identify a plurality of documents, based on comparing a combined
rank of a first set of documents identified after a first tier with
a calculated combined rank of the first set of documents of the
first tier and one or more documents of at least a second tier.
5. The one or more computer-storage media of claim 4, wherein the
number of tiers for the at least one posting list and the number of
documents within each tier of the at least one posting list are
determined based on the size of the document corpus used to
generate the search index.
6. The one or more computer-storage media of claim 4, wherein the
rank for a document within a posting of a posting list for an atom
comprises a score representing the importance of the atom within
the context of the document.
7. The one or more computer-storage media of claim 4, wherein
generating the search index comprises storing only a portion of the
atoms identified for at least one document in the search index.
8. The one or more computer-storage media of claim 4, wherein
generating the search index comprises generating the at least one
posting list to include postings for only a portion of the
documents identified as containing the atom for the at least one
posting list.
9. The one or more computer storage media of claim 4, wherein tiers
of the at least one posting list are stored on different types of
storage devices.
10. A system for providing search results, the system comprising: a
search engine server, having a processor and a memory configured
for providing computer program instructions to the processor, the
search engine server configured for: accessing document content for
a plurality of documents; identifying atoms within the document
content of each document; determining ranks for atoms found in the
document content of each document, wherein the rank for a given
atom found in a given document comprises a score representing the
importance of the given atom within the context of the given
document; and generating a search index comprising a plurality of
posting lists, each posting list corresponding with an atom
identified within the document content of the plurality of
documents, wherein each posting within a given posting list
corresponding with a particular atom identifies a document
containing the atom and a rank for the document and the atom,
wherein at least one posting list is divided into a plurality of
tiers ordered based on rank, and wherein postings within each tier
are internally ordered based on document, wherein the search index
supports performing additional processing during a search query, to
identify a plurality of documents, based on comparing a combined
rank of a first set of documents identified after a first tier with
a calculated combined rank of the first set of documents of the
first tier and one or more documents of at least a second tier.
11. The system of claim 10, wherein the number of tiers for the at
least one posting list and the number of documents within each tier
of the at least one posting list are determined based on the size
of the document corpus used to generate the search index.
12. The system of claim 10, wherein the rank for a document within
a posting of a posting list for an atom comprises a score
representing the importance of the atom within the context of the
document.
13. The system of claim 10, wherein generating the search index
comprises storing only a portion of the atoms identified for at
least one document in the search index.
14. The system of claim 10, wherein generating the search index
comprises generating the at least one posting list to include
postings for only a portion of the documents identified as
containing the atom for the at least one posting list.
15. The system of claim 10, wherein tiers of the at least one
posting list are stored on different types of storage devices.
16. The system of claim 10, wherein the search engine server is
configured for: receiving a search query comprising one or more
terms; analyzing the search query to identify one or more atoms
from the one or more terms; querying the search index using the one
or more atoms identified from the search query, wherein the search
index comprises a plurality of postings lists, each posting list
corresponding with an atom and including a plurality of postings,
wherein each posting within a given posting list corresponding with
the atom identifies a document containing the atom and a rank
representing a significance of the atom for the document, wherein
at least one posting list is divided into a plurality of tiers
ordered based on rank; identifying the plurality of documents from
querying the search index based on the plurality of tiers ordered
based on rank and the postings within each of the plurality of
tiers; and providing a plurality of search results for presentation
to an end user based on the plurality of documents identified by
querying the search index.
17. The system of claim 16, wherein querying the search index
comprises: determining that two or more atoms are identified from
the search query; identifying posting lists corresponding with the
two or more atoms; and merging a first level of tiers of the
posting lists to obtain a set of ranked documents.
18. The system of claim 17, wherein: if it is determined to not
analyze additional levels of tiers, providing a set of search
results from the set of ranked documents; if it is determined to
analyze additional levels of tiers: (1) repeating: merging a
combination of a next level of tiers of the posting lists to update
the set of ranked documents, and evaluating whether to perform
analysis of additional levels of tiers of the posting lists, until
it is determined to not analyze additional levels of tiers; and (2)
providing a set of search results from the set of ranked documents
when it is determined to not analyze additional levels of
tiers.
19. The system of claim 18, wherein determining whether to perform
analysis of additional levels of tiers comprises evaluating whether
one or more posting lists are long, wherein if one or more long
posting lists are identified then at least one lower level tier
from the one or more long posting lists is ignored and matching
between posting lists is performed by consideration of combinations
of other posting lists such that the at least one lower level tier
from the one or more long posting lists is assumed present and
resulting matches are deemed usefully close to an intent of the
search query.
20. The system of claim 19, wherein the search engine server is
configured for: determining overall ranking values for documents by
assuming presence of one or more atoms corresponding with the one
or more long posting lists in documents from the other posting
lists and assigning a nominal value to compute the overall ranking
values for the documents.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of, and claims priority
from, pending U.S. application Ser. No. 12/951,799, filed Nov. 22,
2010, "TIERING OF POSTING LISTS IN SEARCH ENGINE INDEX" which is
incorporated herein by reference in its entirety.
BACKGROUND
[0002] The amount of information and content available on the
Internet continues to grow rapidly. Given the vast amount of
information, search engines have been developed to facilitate
searching for electronic documents. In particular, users may search
for information and documents by entering search queries comprising
one or more terms that may be of interest to the user. After
receiving a search query from a user, a search engine identifies
documents and/or web pages that are relevant based on the search
query. Because of its utility, web searching, that is, the process
of finding relevant web pages and documents for user issued search
queries has arguably become the most popular service on the
Internet today.
[0003] Search engines operate by crawling documents and indexing
information regarding the documents in a search index. When a
search query is received, the search engine employs the search
index to identify documents relevant to the search query. Use of a
search index in this manner allows for fast retrieval of
information for queries. Without a search index, a search engine
would need to search the corpus of documents to find relevant
results, which would take an unacceptable amount of time.
[0004] As the Internet continues to grow, search engines continue
to index larger numbers of documents. Given a large search index,
some queries may take an amount of time to run that is unacceptable
to users. As a result, search engines often take shortcuts when
querying a search index in order to return search results back to
users in a timely manner. Often, users' expectations are to receive
search results within a few hundred milliseconds. To meet this
constraint, some search engines may only partially evaluate search
queries, which may adversely impact the quality of the search
results.
SUMMARY
[0005] This summary is provided to introduce a selection of
concepts in a simplified form that are further described below in
the Detailed Description. This summary is not intended to identify
key features or essential features of the claimed subject matter,
nor is it intended to be used as an aid in determining the scope of
the claimed subject matter.
[0006] Embodiments of the present invention relate to a search
index having tiered postings lists. Each posting list in the search
index corresponds with a particular atom and includes a list of
documents containing that atom. A tiered posting list is one in
which the posting list is divided into tiers with the tiers being
ordered by rank while each tier is internally ordered by document.
When a search query is received, the tiered posting lists allow the
search query to be evaluated in a manner that allows for a number
of efficiencies and precise stopping. In one embodiment, tiers of
posting lists may be sequentially merged while evaluating between
levels of tiers whether additional processing is required based on
the results of tiers already merged.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The present invention is described in detail below with
reference to the attached drawing figures, wherein:
[0008] FIG. 1 is a block diagram of an exemplary computing
environment suitable for use in implementing embodiments of the
present invention;
[0009] FIG. 2 is a diagram illustrating a tiered posting list in
accordance with an embodiment of the present invention;
[0010] FIG. 3 is a diagram illustrating merging tiers from two
posting lists in accordance with an embodiment of the present
invention;
[0011] FIG. 4 is a block diagram of an exemplary system in which
embodiments of the present invention may be employed;
[0012] FIG. 5 is a flow diagram showing a method for creating a
search index having tiered posting lists in accordance with an
embodiment of the present invention;
[0013] FIG. 6 is a flow diagram showing a general method for using
a search index having tiered posting lists to provide search
results in response to a search query in accordance with an
embodiment of the present invention; and
[0014] FIG. 7 is a flow diagram showing a more specific method for
using a search index having tiered posting lists to provide search
results in response to a search query in accordance with an
embodiment of the present invention.
DETAILED DESCRIPTION
[0015] The subject matter of the present invention is described
with specificity herein to meet statutory requirements. However,
the description itself is not intended to limit the scope of this
patent. Rather, the inventors have contemplated that the claimed
subject matter might also be embodied in other ways, to include
different steps or combinations of steps similar to the ones
described in this document, in conjunction with other present or
future technologies. Moreover, although the terms "step" and/or
"block" may be used herein to connote different elements of methods
employed, the terms should not be interpreted as implying any
particular order among or between various steps herein disclosed
unless and except when the order of individual steps is explicitly
described.
[0016] Embodiments of the present invention provide a search index
with tiered posting lists. The search index employed by embodiments
of the present invention indexes higher order primitives or "atoms"
from documents, as opposed to simply indexing single terms. As used
herein, an "atom" may refer to a variety of units of a query or a
document. These units may include, for example, a term, an n-gram,
an n-tuple, a k-near n-tuple, etc. A term maps down to a single
symbol or word as defined by the particular tokenizer technology
being used. A term, in one embodiment is a single character. In
another embodiment, a term is a single word or grouping of words.
An n-gram is a sequence of "n" number of consecutive or almost
consecutive terms that may be extracted from a document. An n-gram
is said to be "tight" if it corresponds to a run of consecutive
terms and is "loose" if it contains terms in the order they appear
in the document, but the terms are not necessarily consecutive.
Loose n-grams are typically used to represent a class of equivalent
phrases that differ by insignificant words (e.g., "if it rains I'll
get wet" and "if it rains then I'll get wet"). An n-tuple, as used
herein, is a set of "n" terms that co-occur (order independent) in
a document. Further, a k-near n-tuple, as used herein, refers to a
set of "n" terms that co-occur within a window of "k" terms in a
document. Thus, an atom is generally defined as a generalization of
all of the above. Implementations of embodiments of the present
invention may use different varieties of atoms, but as used herein,
atoms generally describes each of the above-described
varieties.
[0017] Each posting list corresponds with an atom identified in the
corpus of documents indexed. The posting list includes entries each
identifying a document and a rank that comprises a score
representing the importance of the atom in the context of the
document. At least some of the posting lists in the search index
are divided into multiple tiers. The tiers of a given posting list
are ordered by rank. For instance, a first tier may have a number
of documents with ranks above a particular value, a second tier may
have a number of documents with ranks within a range lower than
that particular value, etc. The tiers may be based on
non-overlapping ranges of rank. Additionally, the tiers are each
internally ordered by document. That is, there is a document
sequence that has no necessary relation to rank. For instance, a
numerical identifier may be associated with each document, and the
document sequence may be the sequence of documents in an ascending
numerical identifier order. As such, the tiers are each internally
ordered in accordance with the same document sequence. As will be
described in further detail herein, this facilitates performing
matching between tiers.
[0018] In some embodiments, the index may be built using tiering in
which the index does not include the rank or abbreviates the rank
compared to the values which were the original ranks used to form
the order. In other words, the ranks may be removed or abbreviated
when a list is tiered, or the tiering may be based upon a more
accurate rank originally known but not kept in the actual index as
it is stored and used.
[0019] Employing a search index with tiered posting lists provides
a number of efficiencies when performing searches. For instance,
when a search query is received from which multiple atoms are
identified, the tiers of posting lists may be sequentially merged
while evaluating whether additional processing is required based on
the combined rank of matching documents found in evaluated tiers as
compared to possible combined rank of documents that could be
retrieved in the lower tiers. This allows for precise stopping. The
stopping is precise in contrast to just stopping early as nothing
is lost because all remaining combinations are calculated to be
lower ranked.
[0020] In some embodiments, tiered posting lists may be employed to
perform a "soft-AND" operation. For instance, if atom A is
reasonably common, the atom has less significance and a long
posting list. As such, the process may choose not to read atom A's
lower ranked tiers if it is determined that those lower level
ranked tiers would be much less significant than the other atoms
that the process is trying to match with atom A. Accordingly, the
process may soft-AND atom A when far enough into the operation that
atom A's lower tiers would be evaluated but other higher-ranked
atoms are dominating the matching process. A nominal value may be
applied to the soft-AND atom such that the overall ranking score
for documents found in the posting lists for the other atoms are
computed using the nominal value. In still further embodiments of
the present invention, tiered posting lists allow a first tier of a
posting list to act as a "cut-index" in the case of single atom
queries.
[0021] Accordingly, in one aspect, an embodiment of the present
invention is directed to one or more computer storage media storing
a data structure for a search index. The data structure includes a
plurality of posting lists, each posting list being associated with
a different atom and including a plurality of postings, each
posting within each posting list corresponding with a given
document and identifying the given document and a rank for the
given document, wherein each posting list is divided into a
plurality of tiers, the tiers within a posting list being ordered
by rank while the postings within each tier being internally
ordered by document sequence.
[0022] In another embodiment, an aspect of the present invention is
directed to one or more computer storage media storing computer
useable instructions that, when used by a computing device, cause
the computing device to perform a method. The method includes
accessing document content for a plurality of documents. The method
also includes identifying atoms within the document content of each
document. The method further includes determining ranks for atoms
found in the document content of each document, wherein the rank
for a given atom found in a given document comprises a score
representing the importance of the given atom within the context of
the given document. The method still further includes generating a
search index comprising a plurality of posting lists, each posting
list corresponding with an atom identified within the document
content of the plurality of documents, wherein each posting within
a given posting list corresponding with a particular atom
identifies a document containing the atom and a rank for the
document and the atom, wherein each posting list is divided into a
plurality of tiers ordered based on rank, and wherein postings
within each tier are internally ordered based on document.
[0023] A further embodiment is directed to one or more computer
storage media storing computer useable instructions that, when used
by a computing device, cause the computing device to perform a
method. The method includes receiving a search query comprising one
or more terms. The method also includes analyzing the search query
to identify one or more atoms from the one or more terms. The
method further includes querying a search index using the one or
more atoms identified from the search query, wherein the search
index comprises a plurality of postings lists, each posting list
corresponding with an atom and including a plurality of postings,
wherein each posting within a given posting list corresponding with
an atom identifies a document containing the atom and a rank
representing a significance of the atom for the document, wherein
each posting list is divided into a plurality of tiers ordered
based on rank, and wherein postings within each tier are internally
ordered based on document. The method also includes identifying a
plurality of documents from querying the search index. The method
still further includes providing a plurality of search results for
presentation to an end user based on the plurality of documents
identified by querying the search index.
[0024] Having briefly described an overview of embodiments of the
present invention, an exemplary operating environment in which
embodiments of the present invention may be implemented is
described below in order to provide a general context for various
aspects of the present invention. Referring initially to FIG. 1 in
particular, an exemplary operating environment for implementing
embodiments of the present invention is shown and designated
generally as computing device 100. Computing device 100 is but one
example of a suitable computing environment and is not intended to
suggest any limitation as to the scope of use or functionality of
the invention. Neither should the computing device 100 be
interpreted as having any dependency or requirement relating to any
one or combination of components illustrated.
[0025] The invention may be described in the general context of
computer code or machine-useable instructions, including
computer-executable instructions such as program modules, being
executed by a computer or other machine, such as a personal data
assistant or other handheld device. Generally, program modules
including routines, programs, objects, components, data structures,
etc., refer to code that perform particular tasks or implement
particular abstract data types. The invention may be practiced in a
variety of system configurations, including hand-held devices,
consumer electronics, general-purpose computers, more specialty
computing devices, etc. The invention may also be practiced in
distributed computing environments where tasks are performed by
remote-processing devices that are linked through a communications
network.
[0026] With reference to FIG. 1, computing device 100 includes a
bus 110 that directly or indirectly couples the following devices:
memory 112, one or more processors 114, one or more presentation
components 116, input/output (I/O) ports 118, input/output
components 120, and an illustrative power supply 122. Bus 110
represents what may be one or more busses (such as an address bus,
data bus, or combination thereof). Although the various blocks of
FIG. 1 are shown with lines for the sake of clarity, in reality,
delineating various components is not so clear, and metaphorically,
the lines would more accurately be grey and fuzzy. For example, one
may consider a presentation component such as a display device to
be an I/O component. Also, processors have memory. The inventors
recognize that such is the nature of the art, and reiterate that
the diagram of FIG. 1 is merely illustrative of an exemplary
computing device that can be used in connection with one or more
embodiments of the present invention. Distinction is not made
between such categories as "workstation," "server," "laptop,"
"hand-held device," etc., as all are contemplated within the scope
of FIG. 1 and reference to "computing device."
[0027] Computing device 100 typically includes a variety of
computer-readable media. Computer-readable media can be any
available media that can be accessed by computing device 100 and
includes both volatile and nonvolatile media, removable and
non-removable media. By way of example, and not limitation,
computer-readable media may comprise computer storage media and
communication media. Computer storage media includes both volatile
and nonvolatile, removable and non-removable media implemented in
any method or technology for storage of information such as
computer-readable instructions, data structures, program modules or
other data. Computer storage media includes, but is not limited to,
RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM,
digital versatile disks (DVD) or other optical disk storage,
magnetic cassettes, magnetic tape, magnetic disk storage or other
magnetic storage devices, or any other medium which can be used to
store the desired information and which can be accessed by
computing device 100. Communication media typically embodies
computer-readable instructions, data structures, program modules or
other data in a modulated data signal such as a carrier wave or
other transport mechanism and includes any information delivery
media. The term "modulated data signal" means a signal that has one
or more of its characteristics set or changed in such a manner as
to encode information in the signal. By way of example, and not
limitation, communication media includes wired media such as a
wired network or direct-wired connection, and wireless media such
as acoustic, RF, infrared and other wireless media. Combinations of
any of the above should also be included within the scope of
computer-readable media.
[0028] Memory 112 includes computer-storage media in the form of
volatile and/or nonvolatile memory. The memory may be removable,
non-removable, or a combination thereof. Exemplary hardware devices
include solid-state memory, hard drives, optical-disc drives, etc.
Computing device 100 includes one or more processors that read data
from various entities such as memory 112 or I/O components 120.
Presentation component(s) 116 present data indications to a user or
other device. Exemplary presentation components include a display
device, speaker, printing component, vibrating component, etc.
[0029] I/O ports 118 allow computing device 100 to be logically
coupled to other devices including I/O components 120, some of
which may be built in. Illustrative components include a
microphone, joystick, game pad, satellite dish, scanner, printer,
wireless device, etc.
[0030] Turning now to FIG. 2, a diagram is provided that
illustrates a tiered posting list 200 in accordance with an
embodiment of the present invention. The posting list 200 shown in
FIG. 2 is for a given atom "X." The posting list 200 generally
includes a list of postings. Each posting identifies a document
that includes atom X and an indication of the rank for that
document. Each document may be identified within a posting using a
document identifier. In the embodiment shown in FIG. 2, a numerical
identifier is employed to identify each document.
[0031] The rank assigned to a given document is a score that
reflects the importance of atom X within the context of that
document. Any number of algorithms may be employed to assign a rank
to a given document for a given atom. By way of example only, a
document's rank may be a score based on term-frequency
inverse-document frequency (TF/IDF) functions as known in the art.
For instance, the document rank may be a score generated using the
BM25F ranking function. In the exemplary embodiment, a higher
ranking reflects a greater importance (although the inverse may be
employed in some embodiments). As such, documents with higher ranks
correspond with documents that are considered to have higher
relevance for atom X. In some embodiments, a rank for a given
posting may take into account the overall importance of the
atom.
[0032] As shown in FIG. 2, the posting list 200 for atom X is
divided into three tiers, tier 1 202, tier 2 204, and tier 3 206.
The tiers are divided based on document ranks. In particular, tier
1 202 includes documents with the highest ranks for atom X, tier 2
204 includes documents with the next highest ranks for atom X, and
tier 3 206 includes documents with the next highest ranks for atom
X. Although the tiers are ordered based on document rank, the
postings within each tier are internally ordered based on document
ID as opposed to rank. This is illustrated in portion 208 of tier 1
and portion 210 of tier 2 shown in FIG. 2. In particular, the
postings are listed in order of document identifiers, not rank.
[0033] When constructing a posting list for a search index, the
number of tiers to include in the posting list and the number of
documents to include within each tier is configurable and may be
determined in a number of different manners within the scope of
embodiments of the present invention. Additionally, the number of
tiers and number of documents per tier may vary from posting list
to posting list within the same search index. In some embodiments,
the number of tiers to include in a posting list and the number of
documents to include in each tier may be based on factors such as
the number of documents indexed in the search index, the number of
documents containing the atom, and the statistics surrounding the
likelihood of finding matching documents from the posting list.
These factors may also be employed for pruning when constructing
posting lists. In particular, all documents containing a particular
atom may not be indexed in the posting list for the atom. Instead,
documents having the lowest rank for the atom may be pruned when
the posting list is constructed such that they are not included in
the posting list.
[0034] In some embodiments, the tiers of a posting list may be
stored on different types of storage devices. For instance, tiers
having higher ranked documents that are accessed more frequently
could be stored on faster types of storage devices, such as in RAM
or flash-based solid state devices. Lower tiers having lower ranked
documents that are accessed less frequently could be stored on
slower types of storage devices, such as hard disk drives.
[0035] Employing an index with posting lists having tiers ordered
by rank while internally ordered within the tiers by document
provides a number of efficiencies to returning results to search
queries. For instance, because postings are internally ordered by
document, rapid merge joins may be performed between two or more
posting lists to identify matching documents. In principle, two or
more posting lists may be merged as inner products of tiers.
Additionally, by employing tiers and having known ranking
thresholds between the tiers, a precise stop can be determined
during the process of matching documents. For instance, the process
may proceed by iteratively merging postings lists at a given level
of tiers and determining whether the next level of tiers needs to
be employed by evaluating whether a sufficient number of documents
have been identified with a combined rank that exceeds the highest
combined rank that could be achieved using lower level tiers.
[0036] By way of illustration, FIG. 3 provides a diagram that
illustrates analysis of two posting lists in response to a search
query. Suppose, for instance, that a search query is received that
contains two atoms: atom X and atom Y. To identify search results
for the query, a posting list for atom X 302 and a posting list for
atom Y 304 are accessed. Each of the posting lists 302 and 304 in
the present example are broken into three tiers: a first tier,
second tier, and a third tier. Initially, the first tier from the
atom X posting list 302 and the first tier from the atom Y posting
list 304 are merged to generate combined ranks for documents within
the first tiers, as shown at 306. An analysis may then be performed
to determine if the process may stop. For instance, the search
engine may be seeking to return the top N search results identified
from the search index. If the top N documents from merging the
first tiers have a lowest rank that is greater than the highest
rank that could be achieved by using the next level of tiers (i.e.,
the second tiers from the posting lists 302 and 304), it is
mathematically impossible for any remaining coincidences to have a
ranking higher than the results already found. Therefore, the
process may stop as the top N documents have been retrieved
already.
[0037] However, if it is determined that the process should
continue as results from lower tiers may outrank results from the
first tiers, additional tiers may be merged. For instance, the
first tier from the atom X posting list 302 may be merged with the
second tier from the atom Y posting list 304, and the second tier
from the atom X posting list 302 may be merged with the first tier
from the atom Y posting list 304, as shown at 308. Again, the
results are analyzed against the known ranking thresholds for the
remaining tiers to determine if the process may be stopped. If the
process continues, the second tiers of the posting lists 302 and
304 may next be merged, as shown at 310, and an evaluation
performed to determine if the process may stop. The merging and
stop evaluating process may continue until a stop is identified or
until all tiers have been evaluated.
[0038] Tiered posting lists also allow the search engine to employ
a "soft-AND" operator in some cases of matching documents from
posting lists. In particular, when merging tiers from two posting
lists, they may be some instances in which there are very few or no
coincidences for the two atoms. This may be a result of one of the
atoms being very rare. However, rarity indicates significance and
that logically means that the ranks of the postings for the very
rare atom likely hugely outrank anything in the posting list for
the other atom. In that case, the system may ignore the remaining
postings for the more common atom and just replace them by a
substitute value (a function of the highest value the posting could
have had and yet been discarded) and do a "soft-AND" to move the
best ranked documents on the posting list of the very rare atom
into the retained candidate list even when the system doesn't know
if those documents would have had an intersection with remaining
postings in the posting list of the more common atom.
[0039] Having tiered posting lists in a search index also allows a
first tier of a posting list to act as a "cut-index" when a single
atom query is received. When a search query is received from which
only a single atom is identified, the posting list for that atom
may be identified. Since matching with another posting list is not
required, only the first tier with the highest ranked documents
needs to be retrieved. Search results may then be generated based
on the documents within the first tier.
[0040] Referring now to FIG. 4, a block diagram is provided
illustrating an exemplary system 400 in which embodiments of the
present invention may be employed. It should be understood that
this and other arrangements described herein are set forth only as
examples. Other arrangements and elements (e.g., machines,
interfaces, functions, orders, and groupings of functions, etc.)
can be used in addition to or instead of those shown, and some
elements may be omitted altogether. Further, many of the elements
described herein are functional entities that may be implemented as
discrete or distributed components or in conjunction with other
components, and in any suitable combination and location. Various
functions described herein as being performed by one or more
entities may be carried out by hardware, firmware, and/or software.
For instance, various functions may be carried out by a processor
executing instructions stored in memory.
[0041] Among other components not shown, the system 400 may include
a user device 402, content server 404, and search engine server
406. Each of the components shown in FIG. 4 may be any type of
computing device, such as computing device 100 described with
reference to FIG. 1, for example. The components may communicate
with each other via a network 408, which may include, without
limitation, one or more local area networks (LANs) and/or wide area
networks (WANs). Such networking environments are commonplace in
offices, enterprise-wide computer networks, intranets, and the
Internet. It should be understood that any number of user devices,
content servers, and search engine servers may be employed within
the system 400 within the scope of the present invention. Each may
comprise a single device or multiple devices cooperating in a
distributed environment. For instance, the search engine server 406
may comprise multiple devices arranged in a distributed environment
that collectively provide the functionality of the search engine
server 406 described herein. Additionally, other components not
shown may also be included within the system 400.
[0042] The search engine server 406 generally operates to receive
search queries from user devices, such as the user device 402, and
to provide search results in response to the search queries. The
search engine server 406 includes, among other things, an indexing
component 410, a user interface component 412, a query
reformulation component 414, and an index querying component
416.
[0043] The indexing component 410 operates to index data regarding
documents maintained by content servers, such as the content server
404. For instance, a crawling component (not shown) may be employed
to crawl content servers and access information regarding documents
maintained by the content servers. The indexing component 410 then
indexes data regarding the crawled documents in the search index
418. In embodiments, the indexing component 410 indexes atoms found
in documents and the documents' context, references, and other
context. For instance, atoms can be found in not only the document
content but can also be found outside of the document. For
instance, atoms can originate in how the document is summarized in
other places where links are found, in searches where the document
was known to be used, in URLs, in titles, as well as other sources.
The indexing component 410 also indexes scoring information for
documents for which each atom is found indicating the importance of
the atom in the context of the document. Any number of algorithms
may be employed to calculate a rank for an atom found in a
document. By way of example only, the rank may be a score based on
term-frequency inverse-document frequency (TF/IDF) functions as
known in the art. For instance, the BM25F ranking function may be
employed. The scores generated for document/atom pairs are stored
as ranks in the search index 418.
[0044] In embodiments, the indexing component 410 analyzes each
document to identify terms, n-grams, and n-tuples and to determine
which of these atoms should be indexed for the document. During
processing of documents to be indexed, statistics about query
distribution, term distribution, and/or the scoring function used
to calculate the score/significance for the document as a whole may
be used to statistically select the best set of atoms to represent
the document. These selected atoms are indexed in the search index
418 with the computed ranks.
[0045] When generating the search index 418, the indexing component
410 creates postings lists having tiers. As discussed above with
reference to FIG. 2, a posting list may have multiple tiers with
each tier having a number of documents. The tiers are ordered by
ranked (i.e., the first tier having documents with ranks above a
first threshold, the second tier having documents with ranks within
a given range below the first threshold, etc.). Additionally, the
tiers are internally ordered by document sequence. Not all posting
lists are necessarily tiered in the search index 418. For instance,
a posting list for an atom may be short enough such that only a
single tier is used. The number of tiers to use for any given atom
and the number of documents to include in each tier may be
configurable and based on factors such as the number of documents
indexed in the search index, the number of documents containing the
atom, and the statistics surrounding the likelihood of finding
matching documents from the posting list. Furthermore, when
generating the search index 418, the indexing component 410 may
decide to reduce the number of atoms indexed for each document and
may also limit the number of documents indexed for each atom.
[0046] The user interface component 412 provides an interface to
user devices, such as the user device 402, that allows users to
submit search queries to the search engine server 406 and to
receive search results from the search engine server 406. The user
device 402 may be any type of computing device employed by a user
to submit search queries and receive search results. By way of
example only and not limitation, the user device 402 may be a
desktop computer, a laptop computer, a tablet computer, a mobile
device, or other type of computing device. The user device 402 may
include an application that allows a user to enter a search query
and submit the search query to the search engine server 406 to
retrieve search results. For instance, the user device 402 may
include a web browser that includes a search input box or allows a
user to access a search page to submit a search query. Other
mechanisms for submitting search queries to search engines are
contemplated to be within the scope of embodiments of the present
invention.
[0047] When a search query is received via the user interface
component 412, the query reformulation component 414 operates to
reformulate the query. The query is reformulated from its free text
form into a format that facilitates querying the search index 418
based on how data is indexed in the search index 418. In
embodiments, the terms of the search query are analyzed to identify
atoms that may be used to query the search index 418. The atoms may
be identified using similar techniques that were used to identify
atoms in documents when indexing the documents in the search index
418. For instance, atoms may be identified based on the statistics
of terms and query distribution information. The query
reformulation component 414 may provide a set of conjunction of
atoms and cascading variants of these atoms. The atoms may be from
terms literal to the query and terms inferred from various
paraphrasings or alterations of the query. For instance, this may
include generating terms such as synonyms, plurals, or corrections
which are not actually within the query, but which the query is
expanded to encompass.
[0048] The index querying component 416 operates to query the
search index 418 using the atoms identified by the query
reformulation component. Conceptually, the index querying component
418 may merge posting lists for the atoms as inner products of the
tiers. This may be an iterative process in which an evaluation is
performed before moving on to a next of level tiers to determine if
the process may be stopped. In cases of single atom queries, the
index querying component 416 may simply retrieve the top tier of
the posting list for the single atom. Further, in cases in which
there are multiple atoms, one of which has a very long posting list
(e.g., when compared with the other atoms), the index querying
component 416 may employ a "soft-AND" operator to allow the lower
tier of the longest atoms to be replaced by an assumption of
presence, allowing more rare and significant atoms of the query to
dominate the matching calculation.
[0049] Turning next to FIG. 5, a flow diagram is provided that
illustrates a method 500 for generating an index with tiered
posting lists in accordance with an embodiment of the present
invention. As shown at block 502, document content is accessed from
documents to be indexed. For instance, document content may be
accessed by crawling documents. The document content for a document
may include a stream of terms found in the document. Atoms are
identified from the document content, as shown at block 504. As
noted above, the process may include analyzing the text of the
document to identify terms, n-grams, and n-tuples and to determine
which of these atoms should be indexed for the document. Statistics
about query distribution, term distribution, and/or the simplified
scoring function to be used during the funnel process may be used
to statistically select the best set of atoms to represent the
document.
[0050] Ranks are calculated for atoms found in a given document, as
shown at block 506. A rank for a given atom found in a document
comprises a score representing the importance of the atom within
the context of the document. A number of different algorithms may
be employed to determine the rank assigned to a document and atom
in accordance within the scope of embodiments of the present
invention. By way of example only, the score may be based on
term-frequency inverse-document frequency (TF/IDF) functions as
known in the art. For instance, the BM25F ranking function may be
employed. Pruning may be done to reduce the number of atoms that
are indexed for a given document. For instance, in some
embodiments, only a predetermined number of atoms may be indexed
for a given document. In other embodiments, only atoms in which the
rank determined for the document is above a certain threshold may
be indexed for that document.
[0051] The number of tiers for a given posting list and the number
of documents per tier are identified at block 508. As noted above,
the number of tiers to include in a posting list and the number of
documents to include within each tier is configurable. In
embodiments, the number of tiers to include in a posting list and
the number of documents to include in each tier may be based on
factors such as the number of documents indexed in the search
index, the number of documents containing the atom, and the
statistics surrounding the likelihood of finding matching documents
from the posting list.
[0052] Postings lists are generated for atoms, as shown at block
510. Each posting list corresponds with a particular atom. Each
posting in a posting list for a given atom identifies a document
and a rank representing the importance of the atom in the context
of the document. In accordance with embodiments of the present
invention, posting lists are constructed with tiers, although some
posting lists may only have a single tier (e.g., short posting
lists). A tiered posting list includes the number of tiers
determined for the posting list, with each tier containing the
determined number of documents. The tiers are ordered against one
another by rank while internally being ordered by document. When
constructing a posting list for a given atom, documents containing
the atom may be pruned. For instance, documents having the lowest
rank for the atom may not be included in the posting list. The
determination of how many documents to include in the posting list
and how many documents to prune may be based on factors similar to
those used to determine how many tiers to employ and how many
document to include in each tier.
[0053] With reference now to FIG. 6, a flow diagram is provided
that illustrates a general method 600 for using a search index with
tiered posting lists to provide search results in accordance with
an embodiment of the present invention. As shown at block 602, a
search query is received. The search query is analyzed at block 604
to identify one or more atoms. In some embodiments, this analysis
may be similar to the analysis used to identify atoms in documents
when indexing document data. For instance, statistics of terms and
search queries may be employed to identify atoms in the search
query.
[0054] A search index having tiered posting lists is queried to
identify relevant documents, as shown at block 606. This may
include identifying the posting list associated with each atom
identified from the search query. Additionally, tiers from the
posting lists are queried until a sufficient threshold is reached.
This may include querying only a single tier or querying multiple
tiers depending on the atoms identified from the search query and
the confluence of the tiers from the posting lists. The process may
be stopped when a sufficient number of documents have been
identified and/or when it is determined that no other matches will
have sufficient relevance.
[0055] As a result of querying the search index, a number of
documents are identified as having the highest relevance to the
search query, as shown at block 608. A set of search results is
generated based on the identified documents and provided to the end
user who submitted the original search query, as shown at block
610.
[0056] Turning next to FIG. 7, a flow diagram is provided that
illustrates a more specific method 700 for employing a search index
with tiered posting lists to provide search results to a search
query in accordance with an embodiment of the present invention. As
shown at block 702, a search query is received. The received search
query is reformulated at block 704. In particular, the search query
may include a number of terms. The terms of the search query are
analyzed to identify one or more atoms that will be used to query
the search index. As indicated above, this analysis may be similar
to the analysis used to identify atoms in documents when indexing
document data. For instance, statistics of terms and search queries
may be employed to identify atoms in the search query. The
reformulated query may comprise a set of conjunction of atoms and
cascading variants of these.
[0057] A determination is made at block 706 regarding whether the
search query comprises a single atom. If so, a "cut-index" approach
as discussed above may be employed to identify search results. As
shown at block 708, a posting list corresponding with the single
atom from the search query is identified. The first tier from that
posting list is retrieved at block 710. A set of ranked search
results are then provided based on documents identified in the
first tier for presentation to an end user, as shown at block 712.
In some embodiments, the set of ranked search results are retrieved
based simply on the rank associated with each posting in the
posting list. For instance, the search engine may be configured to
provide search results corresponding with the top N documents. As
such, postings with the top N ranks are identified, and the search
results are ordered by the ranks stored in the posting list. As
another example, the search engine may be configured to provide
search results for all documents above a predetermined rank
threshold. If so, all postings having a rank above the threshold
are identified, and the search results are ordered by the ranks
stored in the posting list. In some embodiments, the search engine
may employ a staged process to select search results for a search
query, such as the staged approach described in U.S. patent
application Ser. No. 12/951,528 (Attorney Docket Number
MFCP.157120), entitled "MATCHING FUNNEL FOR LARGE DOCUMENT INDEX."
In such embodiments, a number of documents are selected from the
first tier of the posting list based on the rank of the documents
and used as candidates for further processing and selection of the
set of ranked search results. For instance, in one embodiment, a
preliminary score may be generated for each document in the first
tier using a simplified scoring function. Candidate documents are
then selected based on the preliminary scores. The candidate
documents are processed using a final ranking algorithm to
determine a set of ranked documents from which the set of search
results is generated.
[0058] Returning to block 706, if two or more atoms are identified
from the search query, posting lists corresponding with those atoms
are identified at block 714. The first tiers from the posting lists
are merged at block 716. This provides a combined rank for each
document within the first tiers based on the individual rank of
each document within each posting list. A determination is made at
block 718 regarding whether the process of merging tiers may be
stopped at this point and search results provided from this merge.
The determination may be based on the combined rankings from
merging the first tiers and the possible combined ranks that may be
obtained from using subsequent tiers. In particular, if the highest
combined rank that could be returned for a document using lower
tiers is less than the lowest combined rank for a document from the
first tiers, it is mathematically impossible for any remaining
coincidences from lower tiers to provide a better result than found
from the first tiers. If so, the process of merging tiers from
posting lists may be stopped, and a set of ranked search results
may be returned from the first tiers for presentation to an end
user, as shown at block 720. In some embodiments, it may be
determined to stop by identifying at least one of the atoms as
having a very long posting list. In that case, a "soft-AND" may be
applied to such long atoms which are in the presence of dominant,
rarer atoms or intersections. This allows the lower tier of the
longest atoms to be replaced by an assumption of presence, allowing
more rare and significant atoms of the query to dominate the
matching calculation.
[0059] As discussed above for block 712, in some embodiments, the
set of ranked search results provided at block 720 may be
determined based on the ranks stored in the posting lists. In other
embodiments, a staged process may be employed and candidate
documents may be selected based on the ranks stored in the posting
lists and further processing may be performed to obtain the final
set of ranked search results.
[0060] If it is determined at block 718 that the process of merging
tiers should continue, the next level of tiers from the posting
lists are merged, as shown at block 722. For instance, if two
posting lists are being analyzed, the next merge includes merging
the first tier from a first posting list with the second tier from
the second posting list and merging the second tier from the first
posting list with the first tier from the second posting list. It
may also include merging the second tiers from the two posting
lists. The stopping analysis at 718 is performed to determine if
merging of lower tiers should be performed based on the combined
ranks determined so far and/or determining to apply a "soft-AND."
Additionally, the analysis at block 718 may determine that a last
tier has been reached, indicating that no more merging may be
performed. The process of merging lower tiers continues until a
stopping determination is made at block 718 and a set of ranked
search results are provided at block 720.
[0061] As can be understood, embodiments of the present invention
provide a search index with tiered posting lists in which the tiers
of a posting list are ordered by rank, while each tier is
internally ordered by document. The present invention has been
described in relation to particular embodiments, which are intended
in all respects to be illustrative rather than restrictive.
Alternative embodiments will become apparent to those of ordinary
skill in the art to which the present invention pertains without
departing from its scope.
[0062] From the foregoing, it will be seen that this invention is
one well adapted to attain all the ends and objects set forth
above, together with other advantages which are obvious and
inherent to the system and method. It will be understood that
certain features and subcombinations are of utility and may be
employed without reference to other features and subcombinations.
This is contemplated by and is within the scope of the claims.
* * * * *