U.S. patent application number 12/854775 was filed with the patent office on 2011-02-17 for segmenting postings list reader.
This patent application is currently assigned to GLOBALSPEC, INC.. Invention is credited to Jeff J. Dalton, Steinar Flatland.
Application Number | 20110040762 12/854775 |
Document ID | / |
Family ID | 43589199 |
Filed Date | 2011-02-17 |
United States Patent
Application |
20110040762 |
Kind Code |
A1 |
Flatland; Steinar ; et
al. |
February 17, 2011 |
SEGMENTING POSTINGS LIST READER
Abstract
A size of a posting list is determined as part of searching an
inverted index. The posting list is segmented for reading into a
plurality of segments based on the size. For example, the
segmenting may be performed if the size is larger than a
predetermined size. Finally, each of the plurality of segments is
read into memory.
Inventors: |
Flatland; Steinar; (Clifton
Park, NY) ; Dalton; Jeff J.; (Northampton,
MA) |
Correspondence
Address: |
HESLIN ROTHENBERG FARLEY & MESITI PC
5 COLUMBIA CIRCLE
ALBANY
NY
12203
US
|
Assignee: |
GLOBALSPEC, INC.
East Greenbush
NY
|
Family ID: |
43589199 |
Appl. No.: |
12/854775 |
Filed: |
August 11, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61233427 |
Aug 12, 2009 |
|
|
|
61233420 |
Aug 12, 2009 |
|
|
|
61233411 |
Aug 12, 2009 |
|
|
|
Current U.S.
Class: |
707/737 ;
707/749; 707/E17.089 |
Current CPC
Class: |
G06F 16/319
20190101 |
Class at
Publication: |
707/737 ;
707/E17.089; 707/749 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method of reading a posting list, the method comprising:
determining by a processor a size of a posting list as part of
searching an inverted index; segmenting the posting list for
reading by the processor into a plurality of segments based on the
size; and reading by the processor each of the plurality of
segments into memory.
2. The method of claim 1, wherein the segmenting is performed if
the size is larger than a predetermined size.
3. The method of claim 2, further comprising reading by the
processor all of the posting list at once if the size is the
predetermined size or smaller.
4. The method of claim 1, wherein the segmenting is performed using
at least one predetermined segment size.
5. The method of claim 1, wherein the segmenting is performed for
at least one segment using at least one estimated segment size.
6. The method of claim 5, wherein at least one actual read size for
the at least one segment is greater than or equal to at least one
of the at least one estimated segment size, the method further
comprising storing by the processor the at least one actual read
size in a data structure for reuse.
7. The method of claim 1, wherein the posting list includes a
plurality of relevance indicators for a plurality of postings in
the posting list, wherein the posting list is sorted into
descending order by relevance, and wherein the segmenting is
performed for at least one segment using at least one relevance
indicator.
8. The method of claim 7, wherein the reading comprises reading for
the at least one segment until reaching a relevance indicator lower
than the at least one relevance indicator, the method further
comprising storing by the processor a read size for the at least
one segment in a data structure for reuse.
9. The method of claim 7, wherein the plurality of relevance
indicators comprises a plurality of scores.
10. A computer system for reading a posting list, the computer
system comprising: a memory; and a processor in communication with
the memory to perform a method, the method comprising: determining
a size of a posting list as part of searching an inverted index;
segmenting the posting list for reading into a plurality of
segments based on the size; and reading each of the plurality of
segments into memory.
11. The system of claim 10, wherein the segmenting is performed if
the size is larger than a predetermined size.
12. The system of claim 10, further comprising reading all of the
posting list at once if the size is the predetermined size or
smaller.
13. The system of claim 10, wherein the segmenting is performed
using at least one predetermined segment size.
14. The system of claim 10, wherein the segmenting is performed for
at least one segment using at least one estimated segment size.
15. The system of claim 14, wherein at least one actual read size
for the at least one segment is greater than or equal to at least
one of the at least one estimated segment size, the method further
comprising storing the at least one actual read size in a data
structure for reuse.
16. The system of claim 10, wherein the posting list includes a
plurality of relevance indicators for a plurality of postings in
the posting list, wherein the posting list is sorted into
descending order by relevance, and wherein the segmenting is
performed for at least one segment using at least one relevance
indicator.
17. The system of claim 16, wherein the reading comprises reading
for the at least one segment until reaching a relevance indicator
lower than the at least one relevance indicator, the method further
comprising storing a read size for the at least one segment in a
data structure for reuse.
18. The system of claim 16, wherein the plurality of relevance
indicators comprises a plurality of scores.
19. A program product for reading a posting list, the program
product comprising: a storage medium readable by a processor and
storing instructions for execution by the processor for performing
a method, the method comprising: determining a size of a posting
list as part of searching an inverted index; segmenting the posting
list for reading into a plurality of segments based on the size;
and reading each of the plurality of segments into memory.
20. The program product of claim 19, wherein the segmenting is
performed if the size is larger than a predetermined size.
21. The program product of claim 20, further comprising reading all
of the posting list at once if the size is the predetermined size
or smaller.
22. The program product of claim 19, wherein the segmenting is
performed using at least one predetermined segment size.
23. The program product of claim 19, wherein the segmenting is
performed for at least one segment using at least one estimated
segment size.
24. The program product of claim 23, wherein at least one actual
read size for the at least one segment is greater than or equal to
at least one of the at least one estimated segment size, the method
further comprising storing the at least one actual read size in a
data structure for reuse.
25. The program product of claim 19, wherein the posting list
includes a plurality of relevance indicators for a plurality of
postings in the posting list, wherein the posting list is sorted
into descending order by relevance, and wherein the segmenting is
performed for at least one segment using at least one relevance
indicator.
26. The program product of claim 25, wherein the reading comprises
reading for the at least one segment until reaching a relevance
indicator lower than the at least one relevance indicator, the
method further comprising storing a read size for the at least one
segment in a data structure for reuse.
27. The program product of claim 25, wherein the plurality of
relevance indicators comprises a plurality of scores.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority under 35 U.S.C. .sctn.119
to the following U.S. Provisional Applications, which are herein
incorporated by reference in their entirety:
[0002] Provisional Patent Application Ser. No. 61/233,411, by
Flatland et al., entitled "ESTIMATION OF POSTINGS LIST LENGTH IN A
SEARCH SYSTEM USING AN APPROXIMATION TABLE," filed on Aug. 12,
2009;
[0003] Provisional Patent Application No. 61/233,420, by Flatland
et al., entitled "EFFICIENT BUFFERED READING WITH A PLUG IN FOR
INPUT BUFFER SIZE DETERMINATION," filed on Aug. 12, 2009; and
[0004] Provisional Patent Application Ser. No. 61/233,427, by
Flatland et al., entitled "SEGMENTING POSTINGS LIST READER," filed
on Aug. 12, 2009.
[0005] This application contains subject matter which is related to
the subject matter of the following applications, each of which is
assigned to the same assignee as this application and filed on the
same day as this application. Each of the below listed applications
is hereby incorporated herein by reference in its entirety:
[0006] U.S. Non-Provisional patent application Ser. No. ______, by
Flatland et al., entitled "ESTIMATION OF POSTINGS LIST LENGTH IN A
SEARCH SYSTEM USING AN APPROXIMATION TABLE" (Attorney Docket No.
1634.068A); and
[0007] U.S. Non-Provisional patent application Ser. No. ______, by
Flatland et al., entitled "EFFICIENT BUFFERED READING WITH A PLUG
IN FOR INPUT BUFFER SIZE DETERMINATION" (Attorney Docket No.
1634.069A).
TECHNICAL FIELD
[0008] The present invention generally relates to reading posting
lists as part of searching an inverted index. More particularly,
the invention relates to segmenting a posting list into a plurality
of segments based on the size of the list.
BACKGROUND
[0009] The following definition of Information Retrieval (IR) is
from the book Introduction to Information Retrieval by Manning,
Raghavan and Schutze, Cambridge University Press, 2008: [0010]
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature (usually text) that satisfies an
information need from within large collections (usually stored on
computers).
Inverted Index
[0011] An inverted index is a data structure central to the design
of numerous modern information retrieval systems. In chapter 5 of
Search Engines: Information Retrieval in Practice (Addison Wesley,
2010), Croft, Metzler and Strohman observe: [0012] An inverted
index is the computational equivalent of the index found in the
back of this textbook . . . . The book index is arranged in
alphabetical order by index term. Each index term is followed by a
list of pages about the word.
[0013] In a search system implemented using a computer, an inverted
index often comprises two related data structures: [0014] 1. A
lexicon contains the distinct set of terms (i.e., with duplicates
removed) that occur throughout all the documents of the index. To
facilitate rapid searching, terms in the lexicon are usually stored
in sorted order. Each term typically includes a document frequency
and a pointer into the other major data structure of the inverted
index, the posting file. The document frequency is a count of the
number of documents in which a term occurs. The document frequency
is useful at search time both for prioritizing term processing and
as input to scoring algorithms. [0015] 2. The posting file consists
of one posting list per term in the lexicon, recording for each
term the set of documents in which the term occurs. Each entry in a
posting list is called a posting. The number of postings in a given
posting list equals the document frequency of the associated
lexicon entry. A posting includes at least a document identifier
and may include additional information such as: a count of the
number of times the term occurs in the document; a list of term
positions within the document where the term occurs; and more
generally, scoring information that ascribes some degree of
importance (or lack thereof) to the fact that the document contains
the term.
[0016] When processing a user's query, a computerized search system
needs access to the postings of the terms that describe the user's
information need. As part of processing the query, the search
system aggregates information from these postings, by document, in
an accumulation process that leads to a ranked list of documents to
answer the user's query.
[0017] A large inverted index may not fit into a computer's main
memory, requiring secondary storage, typically disk storage, to
help store the posting file, lexicon, or both. Each separate access
to disk may incur seek time on the order of several milliseconds if
it is necessary to move the hard drive's read heads, which is very
expensive in terms of runtime performance compared to accessing
main memory.
[0018] Therefore, it would be helpful to minimize accesses to
secondary storage for reading posting lists when searching an
inverted index, in order to improve runtime performance.
BRIEF SUMMARY OF INVENTION
[0019] The present invention satisfies the above-noted need by
providing a posting list reader that reads a posting list
efficiently during inverted index searching by reducing the number
of accesses to secondary storage as compared to a traditional
buffered reading strategy that repeatedly uses a uniform input
buffer size.
[0020] The posting list reader of the present invention will be
referred to as a segmenting posting list reader, to distinguish it
from posting list readers in general. Further, a posting list
segment refers to a sequence of adjacent postings within a posting
list. A complete segmentation of a posting list breaks it up into
one or more non-overlapping segments that together include all the
postings of the list.
[0021] In accordance with the above, it is an object of the present
invention to provide a segmenting posting list reader that can
determine how many postings to read on each read request.
[0022] It is another object of the present invention to provide a
segmenting reader to read short posting lists in a single burst of
reading.
[0023] It is still another object of the present invention to
provide a segmenting reader that automatically breaks long posting
lists into segments according to, for example, a strategy that may
vary with the requirements of evaluation logic, posting list
organization, or other considerations. Each read request preferably
reads the next segment in one burst of reading.
[0024] It is yet another object of the present invention to provide
a segmenting reader with support for posting list segments of both
exact and approximate size.
[0025] Finally, it is another object of the present invention to
provide a segmenting posting list reader that learns, remembers and
applies posting list segmentations with only a small amount of
up-front configuration.
[0026] The present invention provides, in a first aspect, a method
of reading a posting list. The method comprises determining by a
processor a size of a posting list as part of searching an inverted
index, segmenting the posting list for reading by the processor
into a plurality of segments based on the size, and reading by the
processor each of the plurality of segments into memory.
[0027] The present invention provides, in a second aspect, a
computer system for reading a posting list. The computer system
comprises a memory, and a processor in communication with the
memory to perform a method. The method comprises determining a size
of a posting list as part of searching an inverted index,
segmenting the posting list for reading into a plurality of
segments based on the size, and reading each of the plurality of
segments into memory.
[0028] The present invention provides, in a third aspect, a program
product for reading a posting list. The program product comprises a
storage medium readable by a processor and storing instructions for
execution by the processor for performing a method. The method
comprises determining a size of a posting list as part of searching
an inverted index, segmenting the posting list for reading into a
plurality of segments based on the size, and reading each of the
plurality of segments into memory.
[0029] These, and other objects, features and advantages of this
invention will become apparent from the following detailed
description of the various aspects of the invention taken in
conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] One or more aspects of the present invention are
particularly pointed out and distinctly claimed as examples in the
claims at the conclusion of the specification. The foregoing and
other objects, features, and advantages of the invention are
apparent from the following detailed description taken in
conjunction with the accompanying drawings in which:
[0031] FIG. 1 is a graph of term rank versus document
frequency;
[0032] FIG. 2 is a block/flow diagram showing aspects of inverted
index searching;
[0033] FIG. 3 is one example of a block/flow diagram for a
segmenting posting list reader, in accordance with aspects of the
present invention;
[0034] FIG. 4 is an instance diagram for a posting list
segmentation table and associated objects;
[0035] FIG. 5 is a flow diagram for one example of a method of
reading a posting list, in accordance with aspects of the present
invention;
[0036] FIG. 6 is sequence diagram for one example of a method of
reading a short posting list comprising a single segment;
[0037] FIG. 7 is a sequence diagram for one example of a method of
reading and learning the segmentation of a posting list comprising
two segments;
[0038] FIG. 8 is a sequence diagram for one example of a method of
reading a posting list comprising two segments, taking advantage of
segmentation information learned and remembered during an earlier
read of the list; and
[0039] FIG. 9 is a block diagram of one example of a computing unit
incorporating one or more aspects of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0040] Posting lists in a search index are described by Zipf's law,
which states that given a corpus of natural language documents, the
frequency of any word is inversely proportional to its rank in the
frequency table.
[0041] FIG. 1 shows, for an index built from a natural language
corpus, a graph 100 of term rank 102 versus document frequency 104,
where document frequency is the number of distinct documents the
term occurs in. Another way to think about document frequency is
posting list length. The graph shows that most terms have very
short posting lists, and only relatively few posting lists are
long.
[0042] Observing that queries submitted to a search system are
little natural language documents, they too adhere to Zipf's law.
It follows that the relatively few long posting lists in a search
index are also the most frequently accessed during query
processing. An efficient read strategy for long posting lists can
help a search system deliver fast query run times. It is convenient
that the big posting lists are few. This makes it feasible to craft
and hold in memory exact read strategies for these lists.
Inverted Index Searcher and Posting List Reader
[0043] An information retrieval system 200 that searches an
inverted index comprises components similar to those labeled
InvertedIndexSearcher 202 and PostingListReader 204 in FIG. 2. An
inverted index searcher manages the process of searching an
inverted index, and a posting list reader manages the details of
reading a posting list from the posting file 214.
[0044] Inverted index searcher 202 takes a query 206 as input and
returns search results 208. Information contained in the query
includes, at a minimum, a term or terms describing the user's
information need. The query optionally includes other features such
as, for example, Boolean constraints (AND, OR, NOT), term weights,
phrase constraints, or proximity restrictions. The query may be
expressed literally as submitted by the user, or it may already
have been parsed and structured. The search results returned, at a
minimum, comprise unique identifiers of the documents matching the
query. Often, the search results are returned in order of
descending relevance, and each search result may optionally include
a variety of other information such as a score, date indexed,
document last modified date, a copy of the document as it was
indexed, the document's URL if applicable, document title, a
"snippet" or keywords in context showing how the query matches the
document, and application-specific metadata.
[0045] A given inverted index searcher instance searches a single
inverted index. A large scale search engine may have multiple
inverted index searcher instances, spread out on different servers
in a server cluster. In this case, higher level components, not
pictured here, are responsible for broadcasting queries across
inverted index search services and integrating the results that
come back.
[0046] When inverted index searcher 202 receives query 206, it
forwards it to the evaluation logic 216, which is the code and
associated data structures in the inverted index searcher that
executes the query and produces a list of search results. The
evaluation logic decides which posting lists to read and dispatches
any needed posting list readers. The evaluation logic controls the
details of reading, for example, how many posting list readers to
use at once, how much of each posting list to read, the order in
which lists are read, whether to read a given list all at once,
whether to alternate between lists in successive bursts of reading,
etc. In the example of FIG. 2, the evaluation logic has decided to
open three posting list readers (204, 210, and 212) simultaneously
over three different posting lists (218, 220 and 222,
respectively). As postings are read, the evaluation logic
aggregates information in the postings by document and interprets
Boolean operators and other advanced search language features to
identify matching documents. The end result is a list of search
results 208, often ranked in descending order of relevance, to
answer the user's query.
[0047] As it executes a search, an inverted index searcher requires
data transfer from the posting file. As previously mentioned, a
large search index may require implementing the posting file using
secondary storage.
[0048] FIG. 3 is one example of a block/flow diagram 300 for a
segmenting posting list reader, in accordance aspects of the
present invention. The segmenting posting list reader works
together with several other components, pictured in FIG. 3.
Directionality of arrows in FIG. 3 indicates component usage, i.e.,
an arrow goes from a software component to another component that
it uses.
Segmenting Posting List Reader
[0049] The main component in FIG. 3 is the Segmenting Posting List
Reader 302 (SPLR) whose purpose is to read posting lists during an
inverted index search and make them available to the evaluation
logic, in accordance with the efficiencies of the present
invention, i.e., reducing the number of reads compared to a
conventional reader.
[0050] The SPLR is implemented using several other software
components that are introduced here and described in greater detail
below. The purpose of the
LexiconEntryToPostingListSegmentationMapper 304 is to provide a
mapping from each lexicon entry to a segmentation of the associated
posting list, thereby determining for each term in the index both
the number of bursts of reading to fully read the posting list and
the postings that will be read by each successive read request. The
LexiconEntryToPostingListSegmentationMapper delegates work
optionally to a PostingListLengthApproximationTable 306 and to a
PostingListSegmentationTable 308. A
PostingListLengthApproximationTable provides accurate estimates of
posting list size, typically in bytes. The
PostingListSegmentationTable stores segmentations of the relatively
few but frequently accessed posting lists that are larger than a
predetermined size. A PostingListReadLimiter 310 helps the SPLR
learn segmentations of long posting lists that do not have
segmentations in the PostingListSegmentationTable yet, by defining
the boundaries between read bursts. An enhanced buffered reader 312
uses configurable predetermined buffer fill size strategies to read
from secondary storage more efficiently than a conventional
buffered reader. Finally, a BufferFillSizeSelectorFactory 314
manufactures predetermined buffer fill size strategies used to
configure an enhanced buffered reader.
[0051] To describe the public interface of the SPLR, it is
necessary to first define a LexiconEntry. A LexiconEntry is a
record retrieved from the inverted index's lexicon. A LexiconEntry
comprises at least three fields: term, document frequency, and
posting file start offset. The term is an indexed word or phrase.
The document frequency is the length of the term's posting list in
number of postings. The posting file start offset is the offset,
typically in bytes, in the posting file where the posting list of
the term starts. A LexiconEntry consisting of only these 3 fields
will be referred to below as a minimal lexicon entry.
[0052] A LexiconEntry may optionally include, for example, a
postings file end offset and/or posting list length. A posting file
end offset is the offset, typically in bytes, in the posting file
where the posting list of the term ends. A posting list length is
the length of the posting list of the term, again, typically in
bytes. If a lexicon entry has either or both of these fields it
will be referred to below as an extended lexicon entry.
[0053] As will become clear, whether a lexicon entry is minimal or
extended affects whether a PostingListLengthApproximationTable is
required in the implementation of the SPLR.
[0054] The public interface of the SPLR preferably includes the
following methods: [0055] 1. void open (LexiconEntry
lexiconEntry)--Prepares the SPLR for reading the posting list
indicated by the LexiconEntry. This method has no return value, as
indicated by "void." [0056] 2. Boolean read ( )--Read a burst of
postings. The preferred implementation is to forward these postings
directly to the evaluation logic via a callback, so it is suggested
here that the postings read are not the return value of this
method. Because the SPLR automatically decides how many postings to
read, the read( )method needs no input parameter such as the number
of postings to read. The method returns a Boolean value: whether
there are more postings to read, i.e. whether it makes sense for
the client to call read( ) again to do another burst of reading.
[0057] 3. void close ( )--Closes the reader, releasing any
resources such as memory and/or file handles. This method should
leave the SPLR in a state where open( ) can be called again. Making
the SPLR reusable in this way facilitates managing resource pools,
which is convenient for building the larger search system. The
close( ) method has no return value (void).
[0058] A discussion of the various software components, pictured in
FIG. 3, that help implement the SPLR, follows. This, in turn, is
followed by some examples of the SPLR in action, illustrated by
sequence diagrams, and pseudocode for a proposed SPLR
implementation.
PostingListReadLimiter
[0059] The purpose of the PostingListReadLimiter is to give the
SPLR a strategy whereby it can learn the complete segmentation of a
long posting list.
[0060] The public interface to the PostingListReadLimiter consists
of the following method: PostingListReadLimit getLimit (int
readSequenceNumber). The getLimit method takes as input a
readSequenceNumber, which is an integer greater than or equal to
one. A posting list is read using one or more bursts of reading,
one burst per segment. The first segment is designated
readSequenceNumber 1, the second as readSequenceNumber 2, and the
readSequenceNumber increases by 1 for each successive burst of
reading. The getLimit function returns a PostingListReadLimit that
is used by the implementation of the SPLR's read( )method to know
when to stop reading during a burst with a given
readSequenceNumber.
[0061] The details of how to best define the PostingListReadLimit
will vary depending upon the posting list structure of the inverted
index and associated evaluation logic.
[0062] In a score sorted index, the postings of each posting list
are sorted into descending order by score, so that the evaluation
logic gets the postings first with the highest scores, considered
the most important. For example, in "Pruned Query Evaluation Using
Pre-Computed Impacts," In Proceedings 29th Annual International ACM
SIGIR Conference (SIGIR 2006), pp. 372-379, Seattle, Wash., August
2006, incorporated herein by reference in its entirety, V. N. Anh
and A. Moffat describe a technique to achieve fast search runtime
and a guarantee of search result quality (i.e., relevance) using
pruned query evaluation with score-at-a-time processing of an
impact-sorted index. In their approach, the postings of each
posting list are ordered by descending impact, where impact is a
measure of the importance of a term in a document. In their
approach, a posting list is read using a sequence of bursts of
reading, and within each burst, each posting read contributes the
same partial score value toward the score of each document
encountered. With a score-sorted posting list organization, to help
achieve efficient data access, it is preferable to align the
segment boundaries of the present invention with the static score
or impact boundaries that are built into the posting list.
[0063] With a score sorted index, the PostingListReadLimit is
preferably defined as the minimum impact or score (more generally,
the minimum relevance indicator) to read during a burst of reading.
To enforce the limit, a burst of reading includes all remaining
postings with a score greater than or equal to the minimum score
that is the PostingListReadLimit for the current
readSequenceNumber. The implementation of PostingListReadLimit
getLimit (int readSequenceNumber) in this case is a trivial. The
PostingListReadLimiter has as part of its state an array of scores
indexed by read sequence number, and the getLimit method simply
does an array lookup and returns a score. The array of scores used
by the PostingListReadLimiter is preferably configurable through a
file or database read by the search system on startup.
[0064] In a document sorted index, another common index
organization that is simple and offers good compression
characteristics, the postings of each posting list are sorted by
document identifier. It is not possible to segment such an index
for reading on score boundaries.
[0065] One example strategy to segment a posting list of a document
sorted index is to make each successive burst of reading bigger,
for example, doubling the size of each successive read. The
intuition is to attempt to satisfy the evaluation logic's
information need with minimal data transfer, but if the evaluation
logic remains unsatisfied, then issue bigger and bigger reads to
deliver the needed information with a relatively small number of
separate accesses to secondary storage. To implement a strategy
like this, the PostingListReadLimit is a minimum number of bytes,
for example, to read during a burst. The burst of reading continues
until the minimum number of bytes for the readSequenceNumber has
been read or until end of list, whichever comes first. The
implementation of PostingListReadLimit getLimit (int
readSequenceNumber) is straight forward in this case. The
PostingListReadLimiter has as part of its state an array of sizes
in bytes indexed by read sequence number, and the getLimit method
simply does an array lookup and returns a size. The array of sizes
used by the PostingListReadLimiter is preferably configurable
through a file or database read by the search system on
startup.
PostingListSegmentationTable
[0066] A PostingListSegmentationTable is a table of posting list
segmentations randomly accessible by term, where a term is an
indexed word or phrase. The segmentation information in the table
may be complete or incomplete. The SPLR adds segmentation
information as it becomes known.
[0067] FIG. 4 shows the structure of a PostingListSegmentationTable
400, which is preferably held in main memory during search
evaluation and also saved to a persistent storage medium for long
term storage. The PostingListSegmentationTable includes a hash
table 402 keyed on indexed term 404. Each key is mapped to a
PostingListSegmentation object 406. The
PostingListSegmentationTable has a Boolean flag, is Dirty 408,
indicating whether the PostingListSegmentationTable has changed
since the last save to persistent storage.
[0068] A PostingListSegmentation object 406 describes a complete or
partial segmentation of a posting list. Recall that a posting list
segment is a sequence of adjacent postings within a posting list. A
complete segmentation of a posting list breaks it up into one or
more non-overlapping segments that together include all the
postings of the list.
[0069] A PostingListSegmentation object has the following object
state: [0070] 1. Term term--Unique identity of the term whose
posting list is being segmented. [0071] 2. int [ ]
postingListSegmentLengths--An array of 0 or more segment lengths in
bytes. The length of this array indicates the number of segments
that are known. (An array of size 0 is an initial condition, since
a posting list generally has at least one item on it.) [0072] 3.
boolean complete--Whether the segmentation is complete. [0073] 4.
boolean approximate--true if the values in
postingListSegmentLengths should be treated as approximate sizes;
false if the values in postingListSegmentLengths should be treated
as exact sizes.
[0074] A PostingListSegmentation also has a convenience method
numSegments( ) to return the number of segment lengths that are
known. This is the length of the postingListSegmentLengths
array.
[0075] A PostingListSegmentationTable includes the following public
methods: [0076] 1. PostingListSegmentation get (Term key)--Given a
term, return its segmentation, if any. If the indicated key has no
segmentation, return null. [0077] 2. PostingListSegmentation put
(PostingListSegmentation segmentation)--Put the segmentation passed
in into the table, keyed on its term. Returns the previous value
associated with the key (if any) or null. This method is useful for
initially populating the table, for example, if loading it from a
persistent data store. [0078] 3. PostingListSegmentation putRefined
(PostingListSegmentation segmentation)--Replaces the
PostingListSegmentation associated with the key that is the term of
the segmentation passed in. This key will be updated to map to the
segmentation parameter if the key currently has no value or if both
of the following conditions hold: (a) the current value of the key
is incomplete (complete==false) and (b) the current value of the
key has a postingListSegmentLengths array that is shorter than the
postingListSegmentLengths array in the segmentation that is the
parameter to the method. Returns the displaced
PostingListSegmentation or null if no value was displaced by this
operation. A null return value occurs if a key is set for the first
time, or if the put fails because the required conditions do not
hold. If this method modifies the PostingListSegmentationTable, it
sets the is Dirty flag to true. [0079] 4. boolean is Dirty (
)--Returns the value of the is Dirty flag, indicating whether the
PostingListSegmentationTable has been modified since it was last
saved to a persistent storage medium. Mutations caused by the
putRefined method cause the is Dirty flag to become true. [0080] 5.
void clearDirty ( )--Set the is Dirty flag to false, to indicate
that the PostingListSegmentationTable is unmodified/clean. This
method is called when a dirty PostingListSegmentationTable has been
successfully saved to a persistent storage medium. The method
returns nothing.
[0081] In a search system that is under load, the get( ) and
putRefined( )methods may be called concurrently by multiple threads
of execution. These methods should be synchronized to avoid
erroneous behavior.
PostingListLengthApproximationTable
[0082] A PostingListLengthApproximationTable provides accurate
estimates of posting list size, typically in bytes. The main method
on a PostingListLengthApproximationTable is:
[0083] PostingListLengthApproximation
getPostingListLengthApproximation (documentFrequency)--Returns a
PostingListLengthApproximation for a posting list with the
indicated document frequency (document frequency is the same thing
as posting list length).
[0084] A PostingListLengthApproximation includes the following
information: rangeId; average posting list length in bytes for this
range; and standard deviation of posting list length in bytes for
this range.
[0085] For a detailed discussion of a
PostingListLengthApproximationTable refer to U.S. Non-Provisional
patent application entitled "ESTIMATION OF POSTINGS LIST LENGTH IN
A SEARCH SYSTEM USING AN APPROXIMATION TABLE" (Attorney Docket No.
1634.068A) filed concurrently herewith.
LexiconEntryToPostingListSegmentationMapper
[0086] The purpose of this component is to map a lexicon entry to a
PostingListSegmentation. The PostingListSegmentation is useful to
the SPLR, representing what is known about how to best break a
given posting list into segments for reading.
[0087] The LexiconEntryToPostingListSegmentationMapper delegates
work to a PostingListSegmentationTable and optionally to a
PostingListLengthApproximationTable as will be spelled out
below.
[0088] A LexiconEntryToPostingListSegmentationMapper has the
following public methods: [0089] 1. PostingListSegmentation
getPostingListSegmentation (LexiconEntry lexiconEntry)--Given a
LexiconEntry, return the PostingListSegmentation to be used by the
SPLR to read the posting list. [0090] 2. PostingListSegmentation
updatePostingListSegmentation (PostingListSegmentation
segmentation)--Update a posting list segmentation in the
PostingListSegmentationTable. (As the SPLR discovers segmentation
information, it will call this method to report what it has
learned.) This method simply delegates work to the
PostingListSegmentationTable's putRefined method.
[0091] Internally, the LexiconEntryToPostingListSegmentationMapper
knows how to discriminate between long and short posting lists. A
short posting list is one that is short enough to read in its
entirety in one burst of reading. A long posting list is one that
should be broken into multiple segments and read in pieces. The
exact methodology to discriminate between long and short posting
lists could vary and is left to the implementer. In one example,
the inflection point on the graph of document frequency over term
rank (see FIG. 1) can be used as the dividing line between short
and long posting lists.
[0092] As described above, different possible implementations of
LexiconEntry include: a minimal lexicon entry that includes just
term, document frequency and posting file start offset; and a more
extended lexicon entry that adds posting file end offset or posting
list length in bytes.
[0093] The implementation of getPostingListSegmentation varies
depending upon whether a LexiconEntry is minimal or extended.
Examples of Java-like pseudocode for these two scenarios is given
below. In the pseudocode below, firstLongDocumentFrequency is the
length of the shortest posting list that is considered long as
opposed to short per the discussion above.
[0094] Example Pseudocode for getPostingListSegmentation, Minimal
Lexicon Entry
TABLE-US-00001 PostingListSegmentation getPostingListSegmentation
(LexiconEntry le) { if (le.documentFrequency >=
firstLongDocumentFrequency) { PostingListSegmentation segmentation
= postingListSegmentationTable.get(le.term); if (segmentation ==
null) { // This is a long posting list, but there is no //
segmentation information yet. return a new PostingListSegmentation
with the following state: term = le.term postingListSegmentLengths
= trivial empty list complete = false approximate = false } else {
// This is a long posting list, and segmentation info // is
available; return it. return segmentation; } } else { // This is a
short posting list. We will approximate its // length, because the
minimal lexicon entry does not include // this information.
PostingListLengthApproximation la =
postingListLengthApproximationTable.
getPostingListLengthApproximation(le.documentFrequency); // Let
numPostingListStdDevs be a configured number // standard deviations
of posting list length approximatePostingListLengthBytes =
la.averagePostingListLengthBytes +
(numPostingListStdDevs*la.stddevPostingListLengthBytes); return a
new PostingListSegmentation with the following state: term =
le.term postingListSegmentLengths = a list of one item, the
approximatePostingListLengthBytes computed above complete = true;
approximate = true; } }
[0095] Example Pseudocode for getPostingListSegmentation, Extended
Lexicon Entry
TABLE-US-00002 PostingListSegmentation getPostingListSegmentation
(LexiconEntry le) { if (le.documentFrequency >=
firstLongDocumentFrequency) { PostingListSegmentation segmentation
= postingListSegmentationTable.get(le.term); if (segmentation ==
null) { // This is a long posting list, but there is no //
segmentation information yet. return a new PostingListSegmentation
with the following state: term = le.term postingListSegmentLengths
= trivial empty list complete = false approximate = false } else {
// This is a long posting list, and segmentation info // is
available; return it. return segmentation; } } else { // This is a
short posting list, and we know its exact // length from the
extended lexicon entry. return a new PostingListSegmentation with
the following state: term = le.term postingListSegmentLengths = a
list of one item, the exact length of the posting list in bytes,
obtained from the LexiconEntry le either by subtracting starting
from ending posting file offsets or simply by using an explicit
posting list length in bytes from the lexicon entry complete =
true; approximate = false; } }
Enhanced Buffered Reader
[0096] The buffered reader used by the SPLR is an enhanced buffered
reader that uses configurable predetermined buffer fill size
strategies to read from secondary storage more efficiently than a
conventional buffered reader.
[0097] For a detailed discussion of enhanced buffered readers,
refer to U.S. Non-Provisional patent application entitled
"EFFICIENT BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE
DETERMINATION" (Attorney Docket No. 1634.069A) filed concurrently
herewith.
BufferFillSizeSelectorFactory
[0098] The BufferFillSizeSelectorFactory is used to make
BufferFillSizeSelector objects for plugging into the enhanced
buffered reader. A BufferFillSizeSelector object is a predetermined
buffer fill size strategy. More specifically, a
BufferFillSizeSelector is an ordered sequence of (fillSize,
numTimesToUse) pairs, where fillSize indicates how much of an
enhanced buffered reader's internal input buffer to fill when a
buffer fill is needed, and numTimesToUse indicates how many times
to use the associated fillSize.
[0099] The object state of the BufferFillSizeSelectorFactory
includes maxBufferSize, which is the largest read system call that
can be issued, typically in bytes, based on the maximum available
input buffer size of the enhanced buffered reader. In one example,
a large maxBufferSize (of 20 megabytes or so) is used on a
commodity server with an index of 20 million web documents.
[0100] The BufferFillSizeSelectorFactory provides the following
public methods: [0101] 1. BufferFillSizeSelectorFactory (int
maxBufferSize)--Constructs a new BufferFillSizeSelectoryFactory.
The constructor simply records the maxBufferSize as object state,
for future reference. [0102] 2. BufferFillSizeSelector
makePreciseBufferFillSizeSelector (long numBytesToRead)--This
method returns a BufferFillSizeSelector to read a precise number of
bytes with a minimum number of reads, taking into account the
maxBufferSize. [0103] 3. BufferFillSizeSelector
makeApproximateBufferFillSizeSelector (long
approximateNumBytesToRead, int supplementalReadSize)--This method
returns a BufferFillSizeSelector that will read
approximateNumBytesToRead with a minimum number of reads, and will
thereafter revert to using buffer fills of the supplementalReadSize
if more information is needed.
[0104] In the discussion that follows, let "/" represent the
operation of integer division, and "%" represent the operation of
integer modulo.
[0105] To implement makePreciseBufferFillSizeSelector, there are
two cases to consider, where numBytesToRead is the input to
makePreciseBufferFillSizeSelector, and maxBufferSize is the largest
read system call that can be issued in bytes: [0106] Case 1:
maxBufferSize>=numBytesToRead [0107] Case 2:
maxBufferSize<numBytesToRead
[0108] A discussion of these cases follows.
Case 1: maxBufferSize>=numBytesToRead
[0109] Build a one-stage predetermined buffer fill size strategy as
indicated below in Table I.
TABLE-US-00003 TABLE I Stage Fill Size Number of Times to Use 1
numBytesToRead 1
[0110] The above strategy, when installed in an enhanced buffered
reader, will read exactly numBytesToRead bytes of data using a
single system call.
Case 2: maxBufferSize<numBytesToRead
[0111] In this case, build a predetermined buffer fill size
strategy that generally has two stages, as indicated in Table II.
However, the second stage is not necessary when the maxBufferSize
evenly divides numBytesToRead.
TABLE-US-00004 TABLE II Stage Fill Size Number of Times to Use 1
maxBufferSize numBytesToRead/maxBufferSize 2 numBytesToRead % 1
maxBufferSize
[0112] The above strategy, when installed in an enhanced buffered
reader, will read exactly numBytesToRead bytes of data with the
minimum possible number of read system calls.
[0113] To implement makeApproximateBufferFillSizeSelector, there
are two cases to consider, where approximateNumBytesToRead is input
to makeApproximateBufferFillSizeSelector, and maxBufferSize is the
largest read system call that can be issued in bytes: [0114] Case
3: maxBufferSize>=approximateNumBytesToRead [0115] Case 4:
maxBufferSize<approximateNumBytesToRead Also, recall that a
supplementalReadSize is provided as input to
makeApproximateBufferFillSizeSelector. A discussion of these cases
follows. Case 3: maxBufferSize>=approximateNumBytesToRead
[0116] Build a two-stage predetermined buffer fill size strategy as
indicated below in Table III.
TABLE-US-00005 TABLE III Stage Fill Size Number of Times to Use 1
approximateNumBytesToRead 1 2 supplementalReadSize Repeat as
necessary
[0117] The above strategy, when installed in an enhanced buffered
reader, will read approximateNumBytesToRead bytes of data using a
single read system call and thereafter will perform as many
additional system calls of the supplemental read size as
necessary.
Case 4: maxBufferSize<approximateNumBytesToRead
[0118] In this case, build a predetermined buffer fill size
strategy that generally has three stages, as indicated below in
Table IV. However, the second stage is not necessary when the
maxBufferSize evenly divides the approximateNumBytesToRead.
TABLE-US-00006 TABLE IV Stage Fill Size Number of Times to Use 1
maxBufferSize approximateNumBytesToRead/ maxBufferSize 2
approximateNumBytesToRead 1 % maxBufferSize 3 supplementalReadSize
Repeat as necessary
[0119] The above strategy, when installed in an enhanced buffered
reader, will read approximateNumBytesToRead bytes of data with the
minimum possible number of read system calls and thereafter will
perform as many additional system calls of the supplemental read
size as necessary.
[0120] One example of a method of reading a posting list will now
be described with reference to the flow diagram 500 of FIG. 5. A
processor (e.g., as part of a computing unit) is used to determine
the size of a posting list as part of an inverted index search,
step 502. Once the size is determined, an inquiry is made as to
whether the size is larger than a predetermined size, inquiry 504.
If the size is equal to or smaller than the predetermined size,
i.e., not larger than the predetermined size, then the entire
posting list is read into memory as a single segment, step 506.
However, if the size is larger than the predetermined size, then
the posting list is segmented, step 508. In one example, the
segmenting uses at least one predetermined segment size. In another
example, the segmenting uses at least one estimated segment size,
and the actual read size is at least the estimated size. The
estimated segment size may be stored in a data structure for reuse.
The segments, by whatever method of segmentation, are then each
read into memory, step 506.
SPLR Operational Examples
[0121] Having described the SPLR and each of its subcomponents from
FIG. 3 in some detail, and providing the basic method above, some
examples of how these components work together to read posting
lists will now be presented.
[0122] In a first example, the sequence diagram 600 in FIG. 6 shows
interactions between an inverted index searcher 602,
SegmentingPostingListReader 604,
LexiconEntryToPostingListSegmentationMapper 606,
PostingListLengthApproximationTable 608, enhanced buffered reader
610, BufferFillSizeSelectorFactory 612 and evaluation logic 614 as
the inverted index searcher reads a short posting list consisting
of only a single segment. In this example, a minimal lexicon entry
is being used and therefore a PostingListLengthApproximationTable
is also used.
[0123] The inverted index searcher begins a reading session by
calling the open method 616 on the SPLR, passing in the lexicon
entry of the posting list to read. The SPLR saves a reference to
this lexicon entry as part of its state to help control the reading
session. The SPLR calls getPostingListSegmentation 618 on the
LexiconEntryToPostingListSegmentationMapper, forwarding the lexicon
entry. The LexiconEntryToPostingListSegmentationMapper examines the
document frequency of the lexicon entry and consults its method of
discriminating between long and short posting lists. The
LexiconEntryToPostingListSegmentationMapper determines that the
posting list to read is short and calls
getPostingListLengthApproximation 620 on the
PostingListLengthApproximationTable, providing as input the
document frequency of the lexicon entry. A
PostingListLengthApproximation is returned to the
LexiconEntryToPostingListSegmentationMapper 622, which then builds
a complete, approximate PostingListSegmentation, incorporating the
term from the lexicon entry and a single posting list segment
length equal to the average posting list length in bytes plus the
desired number of standard deviations from the
PostingListLengthApproximation. The
LexiconEntryToPostingListSegmentationMapper returns this newly
built PostingListSegmentation to the SPLR 624, where it becomes
part of the SPLR's state to control the reading session. The SPLR
finishes the execution of its open( )method by initializing various
miscellaneous state variables and finally seeking the enhanced
buffered reader to the start of the posting list 626 by passing the
posting file start offset of the lexicon entry to the enhanced
buffered reader's seek method. At this point, the open method
called by the inverted index searcher returns, and the SPLR is
ready to accept a read call.
[0124] The inverted index searcher calls the SPLR's read( )method
628. Based on the state established during the open( )method, the
SPLR recognizes that the posting list being read consists of a
single segment with an approximate length in bytes. The SPLR
forwards the approximate number of bytes to read to the
BufferFillSizeSelectorFactory's
makeApproximateBufferFillSizeSelector method 630. A predetermined
buffer fill size strategy in the form of a BufferFillSizeSelector
object is returned to the SPLR 632, which it installs in the
enhanced buffered reader by calling setBufferFillSizeSelector 634.
The SPLR next uses the enhanced buffered reader to read all of the
postings in this relatively short posting list 636, forwarding each
posting to the evaluation logic 638. Finally, the SPLR's read
method returns false to the inverted index searcher 640, indicating
that there are no more postings available to be read, and the
inverted index searcher calls close 642 on the SPLR to end the
reading session.
[0125] In a second example, the sequence diagram 700 in FIG. 7
shows interactions between an inverted index searcher 702,
SegmentingPostingListReader 704,
LexiconEntryToPostingListSegmentationMapper 706,
PostingListSegmentationTable 708, enhanced buffered reader 710,
PostingListReadLimiter 712 and evaluation logic 714 as the inverted
index searcher reads a posting list consisting of two segments. In
this example, the segmentation of the posting list is unknown and
has to be learned as the posting list is read.
[0126] The inverted index searcher begins a reading session by
calling the open method on the SPLR 716, passing in the lexicon
entry of the posting list to read. The SPLR saves a reference to
this lexicon entry as part of its state to help control the reading
session. The SPLR calls getPostingListSegmentation on the
LexiconEntryToPostingListSegmentationMapper 718, forwarding the
lexicon entry. The LexiconEntryToPostingListSegmentationMapper
examines the document frequency of the lexicon entry and consults
its method of discriminating between long and short posting lists.
The LexiconEntryToPostingListSegmentationMapper determines that the
posting list to read is long and calls get( ) on the
PostingListSegmentationTable 720, passing in the term of the
lexicon entry as the key for the lookup. The
PostingListSegmentationTable consults its hash but finds no mapping
from the term to a PostingListSegmentation. In this scenario, the
posting list has not been read since the inverted index was
deployed, and its segmentation is unknown. The get( ) call returns
null to the LexiconEntryToPostingListSegmentationMapper 722,
indicating that no segmentation information is available. In
response, the LexiconEntryToPostingListSegmentationMapper creates a
new incomplete, precise (i.e. complete=false, approximate=false)
PostingListSegementation, incorporating the term from the lexicon
entry, and using an empty array of posting list segment lengths.
This new empty PostingListSegmentation is returned to the SPLR 724,
where it becomes part of the SPLR's state to control the reading
session. The SPLR finishes the execution of its open( )method by
initializing various miscellaneous state variables and finally
seeking the enhanced buffered reader to the start of the posting
list 726 by passing the posting file start offset of the lexicon
entry to the enhanced buffered reader's seek method. At this point,
the open method called by the inverted index searcher returns, and
the SPLR is ready to accept a read call.
[0127] The inverted index searcher calls the SPLR's read( )method
728. Based on the state established during the open( )method, the
SPLR recognizes that the posting list consists of multiple
segments, that the segment boundaries are unknown, and the segment
boundaries need to be learned. Because this is the first call to
read in this session, the SPLR forwards the value 1 (one) to the
getLimit method of the PostingListReadLimiter 730. The
PostingListReadLimiter returns a PostingListReadLimit 732, an
indication of how far the SPLR may read during this first read
call. With this information, the SPLR is almost ready to read
postings. Since the SPLR does not know the size in bytes of the
segment it is about to read, it calls setBufferFillSizeSelector 734
to install a default predetermined buffer fill size strategy on the
enhanced buffered reader that always buffers several disk blocks
worth of data whenever the buffered reader needs more data. This
strategy is acceptable for learning a new segmentation, after which
a better strategy will be available.
[0128] Before reading any postings, the SPLR is careful to note the
current logical position of the enhanced buffered reader in the
posting file 736. Knowing the read start position will allow the
SPLR to know the length of the segment later when reading stops.
The SPLR now uses the enhanced buffered reader to read postings
738, forwarding each one to the evaluation logic as soon as it is
read 740, stopping when the PostingListReadLimit is reached or at
end of posting list, whichever comes first. In this case, reading
stops because the PostingListReadLimit is reached. Once again the
SPLR gets the current logical position from the enhanced buffered
reader 742. The difference between this second logical position and
the first one that was obtained is the length of the segment just
read. The SPLR creates and remembers an updated
PostingListSegmentation object that includes the new segment length
just learned. The SPLR then passes the updated
PostingListSegmentation to the updatePostingListSegmentation method
of the LexiconEntryToPostingListSegmentationMapper 744, to preserve
the updated segmentation information for reuse by future read
sessions. The LexiconEntryToPostingListSegmentationMapper simply
forwards the PostingListSegmentation to the putRefined method of
the PostingListSegmentationTable 746, where the
PostingListSegmentation is stored for reuse. Because reading
stopped due to the PostingListReadLimit (and not due to end of
posting list), there are more postings to read and the SPLR's read
method returns true 748 to the inverted index searcher to indicate
this fact.
[0129] The inverted index searcher then calls the SPLR's read
method a second time 750. Based on the state of the SPLR after the
first read call, the SPLR recognizes that the posting list consists
of multiple segments, more postings are available, but the extent
of the next segment to read is unknown and has to be learned.
Because this is the second call to read in this session, the SPLR
forwards the value 2 (two) to the getLimit method of the
PostingListReadLimiter 752. The PostingListReadLimiter returns a
PostingListReadLimit 754, an indication of how far the SPLR may
read during this second read call. The SPLR now follows the same
steps it used during the first read call, installing a default
predetermined buffer fill size strategy on the enhanced buffered
reader 756, noting the read start position by getting the current
logical position from the enhanced buffered reader 758, and reading
postings 760 and forwarding each one to the evaluation logic 762.
As before, reading stops when the PostingListReadLimit is reached
or at end of posting list, whichever comes first. In this case,
reading stops because end of posting list is reached.
[0130] The SPLR then gets the current logical position from the
enhanced buffered reader 764. The difference between this second
logical position and the first one that was obtained is the length
of the segment just read. The SPLR creates and remembers an updated
PostingListSegmentation object that includes both the new segment
length just learned and the new knowledge that the segmentation of
this posting list is complete (complete=true). The SPLR then passes
the updated PostingListSegmentation to the
updatePostingListSegmentation method of the
LexiconEntryToPostingListSegmentationMapper 766, to preserve the
updated segmentation information for reuse by future read sessions.
The LexiconEntryToPostingListSegmentationMapper simply forwards the
PostingListSegmentation to the putRefined method of the
PostingListSegmentationTable 768, where the PostingListSegmentation
is stored for reuse. Because reading stopped this time due to end
of posting list, there are no more postings to read and the SPLR's
read method returns false to the inverted index searcher to
indicate this fact 770. Finally, the inverted index searcher calls
close to close this read session 772.
[0131] In a third example, the sequence diagram 800 in FIG. 8 shows
interactions between an inverted index searcher 802,
SegmentingPostingListReader 804,
LexiconEntryToPostingListSegmentationMapper 806,
PostingListSegmentationTable 808, enhanced buffered reader 810,
BufferFillSizeSelectorFactory 812 and evaluation logic 814 as the
inverted index searcher reads a posting list consisting of two
segments. In this example, the segmentation of the posting list is
known. This scenario shows the benefit of learning and reusing
posting list segmentations for large, frequently accessed posting
lists.
[0132] The inverted index searcher begins a reading session by
calling the open method on the SPLR 816, passing in the lexicon
entry of the posting list to read. The SPLR saves a reference to
this lexicon entry as part of its state to help control the reading
session. The SPLR calls getPostingListSegmentation on the
LexiconEntryToPostingListSegmentationMapper 818, forwarding the
lexicon entry. The LexiconEntryToPostingListSegmentationMapper
examines the document frequency of the lexicon entry and consults
its method of discriminating between long and short posting lists.
The LexiconEntryToPostingListSegmentationMapper determines that the
posting list to read is long and calls get( ) on the
PostingListSegmentationTable 820, passing in the term of the
lexicon entry as the key for the lookup. The
PostingListSegmentationTable consults its hash and finds that the
term is mapped to a complete, precise (i.e. complete=true,
approximate=false) PostingListSegmentation with 2 segments. The
get( ) call returns this PostingListSegmentation to the
LexiconEntryToPostingListSegmentationMapper 822, which in turn
simply returns it to the SPLR 824, where it becomes part of the
SPLR's state to control the reading session. The SPLR finishes the
execution of its open( )method by initializing various
miscellaneous state variables and finally seeking the enhanced
buffered reader to the start of the posting list 826 by passing the
posting file start offset of the lexicon entry to the enhanced
buffered reader's seek method. At this point, the open method
called by the inverted index searcher returns, and the SPLR is
ready to accept a read call.
[0133] The inverted index searcher calls the SPLR's read( )method
828. Based on the state established during the open( )method, the
SPLR recognizes that the posting list being read consists of two
segments of known sizes in bytes. The SPLR forwards the exact size
in bytes of the first segment to the
BufferFillSizeSelectorFactory's makePreciseBufferFillSizeSelector
method 830. A predetermined buffer fill size strategy in the form
of a BufferFillSizeSelector object is returned to the SPLR 832,
which it installs in the enhanced buffered reader by calling
setBufferFillSizeSelector 834. The SPLR next uses the enhanced
buffered reader to read all of the postings in the first segment of
this posting list 836, forwarding each posting to the evaluation
logic 838. Finally, the SPLR's read method returns true to the
inverted index searcher 840, indicating that there are more
postings available to be read.
[0134] The inverted index searcher again calls the SPLR's read( )
method 842. Based on the state after the first read call, the SPLR
recognizes that there is another segment of known size in bytes
available to read. The SPLR forwards the exact size in bytes of the
second segment to the BufferFillSizeSelectorFactory's
makePreciseBufferFillSizeSelector method 844. A predetermined
buffer fill size strategy in the form of a BufferFillSizeSelector
object is returned to the SPLR 846, which it installs in the
enhanced buffered reader by calling setBufferFillSizeSelector 848.
The SPLR next uses the enhanced buffered reader to read all of the
postings in the second segment of this posting list 850, forwarding
each posting to the evaluation logic 852. Finally, the SPLR's read
method returns false 854 to the inverted index searcher, indicating
that there are no more postings available to be read, and the
inverted index searcher closes the read session by calling close( )
on the SPLR 856.
Common Pseudocode for a SPLR Implementation
[0135] The SPLR and its subcomponents, pictured in FIG. 3, were
described above. Note that flexibility in the design of the SPLR
allows for several different scenarios, for example: [0136] 1.
Minimal or extended lexicon entry, which implies the presence or
absence of the PostingListLengthApproximationTable, which in turn
requires differences in implementation of the
LexiconEntryToPostingListSegmentationMapper; [0137] 2. Different
posting list organizations (e.g., score sorted or document id
sorted); [0138] 3. Flexibility of discriminating between "short"
and "long" posting lists; exact method to be chosen by implementer;
and [0139] 4. Different ways of implementing PostingListReadLimit
(e.g., using a sequence of limiting scores or a sequence of
limiting read sizes).
[0140] The pseudocode below applies equally to all the scenarios
listed above; thus, it is the common pseudocode for a SPLR
implementation.
[0141] As a prerequisite to understanding the pseudocode for
methods of the SPLR, it is helpful to first understand the data
members that are part of its state. The following data members are
initialized by sending object references to the SPLR's constructor.
[0142] 1. BufferedReader bufferedReader--This is an enhanced
buffered reader that is open over the posting file. It has a large
input buffer (perhaps 20 MB for a large scale index on a commodity
server); [0143] 2. LexiconEntryToPostingListSegmentationMapper
lexiconEntryToPostingListSegmentationMapper; [0144] 3.
BufferFillSizeSelectorFactory bufferFillSizeSelectorFactory; and
[0145] 4. PostingListReadLimiter postingListReadLimiter.
[0146] The SPLR has additional state that is set up as part of a
call to its open (lexiconEntry) method. These data members are
documented here. [0147] 5. LexiconEntry lexiconEntry--Lexicon entry
of the posting list to read; [0148] 6. PostingListSegmentation
pls--The most complete segmentation of this posting list currently
available; [0149] 7. int readNum--A count of how many times the
read method has been called. This variable is set to 1 throughout
the first call to read( ) to 2 throughout the second call to read(
) and so on; [0150] 8. boolean done--Whether end of posting list
has been reached; and [0151] 9. int numPostingsRead--The number of
postings that have been read.
[0152] The SPLR has three public methods: [0153] 1. void open
(LexiconEntry lexiconEntry) [0154] 2. boolean read( ) [0155] 3.
void close( )
[0156] Open( ) should be called first to prepare for reading. Read(
) may be called multiple times. Each call to read( ) reads a
segment of postings, and the boolean return value indicates whether
there is another segment available. Finally, a well behaved client
calls close( ) to signal the end of the reading session.
[0157] The pseudocode below is Java-like. Java operators and
Java-like syntax are used, and array indexes start at 0. Example
pseudocode for each of the SPLR's public methods follows.
[0158] Example pseudocode for open method
TABLE-US-00007 public void open(LexiconEntry aLexiconEntry) { //
Set the SPLR data member, lexiconEntry, based on the aLexiconEntry
passed in lexiconEntry = aLexiconEntry; // Set the SPLR data
member, pls by lookup in the pls =
lexiconEntryToPostingListSegmentationMapper.
getPostingListSegmentation(lexiconEntry); // Initialize various
other SPLR data members readNum = 0; done = false; numPostingsRead
= 0; // Seek enhanced buffered reader to start of posting list
bufferedReader.seek(lexiconEntry.postingFileStartOffset); }
[0159] Example Pseudocode for read method
TABLE-US-00008 public boolean read( ) { // Reminder: This method
returns true if there are more postings to read and false //
otherwise. readNum = readNum + 1; if (done) { return false; //
nothing else to read } // NOTE: && is logical AND; == is
the equality test if ( pls.complete && pls.approximate
&& (pls.numSegments( ) == 1) ) { // This is a short posting
list for which we have an // approximate size in bytes. // Any read
beyond the first segment is trying to go too far. if ( readNum >
1 ) { done = true; // just to be sure return false; // nothing else
to read } // readNum is 1. This is the first read of a 1-segment
list. readShortPostingList
(pls.postingListSegmentLengths[readNum-1],
lexiconEntry.documentFrequency); done = true; return false; //
nothing else to read } else if (! pls.approximate) { // NOTE: !
means logical NOT // The segmentation has or will have precise
information. // A precise segmentation has been or will be learned.
// The segmentation may or may not be complete at this time. if
(readNum <= pls.numSegments( )) { // Segment size for this read
is known.
readPostingListSegment(pls.postingListSegmentLengths[readNum-1]);
// If the last segment has been read, set the done flag if (
(readNum == pls.numSegments( )) && pls.complete) { done =
true; } return (! done); // whether more postings } // If the
program gets here, readNum > pls.numSegments // The program is
trying to read past known segmentation info. if (pls.complete) { //
There's nothing else to learn. done = true; return false; //
nothing else to read } // If the program gets here, there ARE more
postings. // There's segmentation information to be learned.
PostingListReadLimit readLimit =
postingListReadLimiter.getLimit(readNum);
readAndLearnSegmentation(readLimit); // pls has been maintained by
the call to readAndLearnSegmentation // If the last segment has
been read, set the done flag if ( (readNum == pls.numSegments( ))
&& pls.complete) { done = true; } return (! done); //
whether more postings } else { // This should never happen, if the
// LexiconEntryToPostingListSegmentationMapper // is building valid
PostingListSegmentations. Log an error; done = true; return false;
// nothing else to read } } private void readShortPostingList (int
approximateNumBytesToRead, int documentFrequency) {
BufferFillSizeSelector bufferFillSizeSelector =
bufferFillSizeSelectorFactory.
makeApproximateBufferFillSizeSelector (approximateNumBytesToRead,
supplementalReadSize( ));
bufferedReader.setBufferFillSizeSelector(bufferFillSizeSelector);
while (numPostingsRead < documentFrequency) { use bufferedReader
to read posting; forward posting to evaluation logic;
numPostingsRead = numPostingsRead + 1; } } private int
supplementalReadSize( ) { // Return the number of bytes to read for
the relatively rare case // when the approximate covering read size
for a short posting list // was insufficient. A value like a
several kilobytes is fine. } private void
readPostingListSegment(int numBytesToRead) { BufferFillSizeSelector
bufferFillSizeSelector = bufferFillSizeSelectorFactory.
makePreciseBufferFillSizeSelector(numBytesToRead);
bufferedReader.setBufferFillSizeSelector(bufferFillSizeSelector);
// Get the current logical offset of the buffered reader from start
// of data. long startOffset = bufferedReader.offset( ); //We want
to read to here. long targetOffset = startOffset + numBytesToRead;
while (bufferedReader.offset( ) < targetOffset) { use
bufferedReader to read posting; forward posting to evaluation
logic; numPostingsRead = numPostingsRead + 1; } } private void
readAndLearnSegmentation(PostingListReadLimit readLimit) { // A
segmentation strategy is being learned here. // We do not know a
better strategy to use.
bufferedReader.setBufferFillSizeSelector(getTraditionalBufferingStrategy(
)); // Get current logical position within posting data long
startOffset = bufferedReader.offset( ); while (readLimit has not
been exceeded && numPostingsRead <
lexiconEntry.documentFrequency) { use bufferedReader to read
posting; forward posting to evaluation logic; numPostingsRead =
numPostingsRead + 1; } long endOffset = bufferedReader.offset( );
long newSegmentLength = endOffset - startOffset;
PostingListSegmentation newPls = a copy of the pls data member,
with the following adjustments applied: 1. A single additional
element has been added to the postingListSegmentLengths[ ] array:
the newSegmentLength learned above 2. if the while loop above
reached end of posting list, i.e. (numPostingsRead ==
lexiconEntry.documentFrequency) then set complete = true // Save
the segmentation we learned for reuse, to read more intelligently
next time. lexiconEntryToPostingListSegmentationMapper.
updatePostingListSegmentation(newPls); // And don't forget to
maintain this object's state. pls = newPls; } private
BufferFillSizeSelector getTraditionalBufferingStrategy( ) { // This
method returns a buffering strategy that says to buffer // several
disk blocks whenever data needs to be read from the // operating
system, until further notice. We use this buffering // strategy
only while learning a segment boundary for the first // time. }
[0160] Example pseudocode for close method
TABLE-US-00009 public void close( ) { // In this basic
implementation, close does not need to do anything. // The next
call to open( ) will fully reset all SPLR state // for the next
reading session. We assume it is OK to leave the // buffered reader
open over the posting file between read sessions. }
[0161] As evident in the pseudocode above, the implementation of
the SPLR's read method has to handle different cases defined by the
combination of the PostingListSegmentation state and the readNum.
Recall that the readNum is 1 throughout the first call to read, 2
throughout the second call to read, and so on. The combination of
the PostingListSegmentation (pls) state and the readNum defines
cases as described in Table V below.
TABLE-US-00010 TABLE V readNum vs. pls.complete pls.approximate
pls.numSegments Comments true true > Short posting list, single
approximately sized segment. The client is trying to read too far.
true true <= Short posting list, single approximately sized
segment. About to read the first and only segment. true false >
Posting list could be long or short. Its segmentation is complete,
and the client is trying to read beyond the end of the list. true
false <= Posting list could be long or short. Its segmentation
is complete, and the size of the next segment to read is known.
false true > Invalid state. Incomplete approximate
PostingListSegmentations are never created. false true <=
Invalid state. Incomplete approximate PostingListSegmentations are
never created. false false > The current posting list is long
and incompletely segmented, and the next read will learn a new
segmentation. false false <= The current posting list is long
and incompletely segmented, but the size of the next segment to
read is known.
[0162] The definition of the cases in Table V depends upon how
PostingListSegmentation objects are created by the
LexiconEntryToPostingListSegmentationMapper. An awareness of this
dependency is helpful for understanding and possibly evolving the
pseudocode that was presented.
Persistence of PostingListSegmentationTable
[0163] The PostingListSegmentationTable will be updated dynamically
as the SPLR's read method is called. When the search service shuts
down, the PostingListSegmenationTable is preferably saved to disk
or other nonvolatile storage medium. To avoid losing the work of
learning segmentations, the PostingListSegmenationTable could also
be saved automatically (like every 5 or 10 minutes or so) if it has
become dirty.
Performing Index Maintenance
[0164] If the inverted index changes, the PostingListSegmentation
table becomes invalid. On any index maintenance, all persistent and
in-memory copies of this table must be deleted. The system can then
re-learn the up-to-date segmentations.
[0165] The present invention includes the following aspects: [0166]
1. Learning posting list segmentation strategy dynamically as the
search system executes searches. [0167] 2. Supporting a plug-in
(PostingListReadLimiter in the above example) that is used by the
posting list reader to determine how to segment large posting lists
into pieces. This PostingListReadLimiter can be tailored to fit the
posting list organization and query logic. [0168] 3. Minimizing
access to secondary storage via dynamically crafted read strategies
for large posting lists, rather than a traditional buffered reader.
[0169] 4. Supporting optional use of a
PostingListLengthApproximationTable (as described above) if lexicon
entries contain minimal information (to save space in memory) and
do not include the length of the posting list in bytes. [0170] 5.
Using an enhanced BufferedReader (as described in section above) to
plug application-specific knowledge of good read sizes into the low
level I/O system.
[0171] The shortcomings of the prior art are overcome and
additional advantages are provided through the provision of a
computer program product for efficient reading of posting lists as
part of inverted index searching. The computer program product
comprises a storage medium readable by a processor and storing
instructions for execution by a processor for performing a method.
The method includes, for instance, determining by a processor a
size of a posting list as part of searching an inverted index,
segmenting the posting list by the processor for reading into a
plurality of segments based on the size, and reading by the
processor each of the plurality of segments into memory.
[0172] Methods and systems relating to one or more aspects of the
present invention are also described and claimed herein.
[0173] Additional features and advantages are realized through the
techniques of the present invention. Other embodiments and aspects
of the invention are described in detail herein and are considered
a part of the claimed invention.
[0174] In one aspect of the present invention, an application can
be deployed for performing one or more aspects of the present
invention. As one example, the deploying of an application
comprises providing computer infrastructure operable to perform one
or more aspects of the present invention.
[0175] As a further aspect of the present invention, a computing
infrastructure can be deployed comprising integrating computer
readable code into a computing system, in which the code in
combination with the computing system is capable of performing one
or more aspects of the present invention.
[0176] As yet a further aspect of the present invention, a process
for integrating computing infrastructure comprising integrating
computer readable code into a computer system may be provided. The
computer system comprises a computer readable medium, in which the
computer medium comprises one or more aspects of the present
invention. The code in combination with the computer system is
capable of performing one or more aspects of the present
invention.
[0177] As will be appreciated by one skilled in the art, aspects of
the present invention may be embodied as a system, method or
computer program product. Accordingly, aspects of the present
invention may take the form of an entirely hardware embodiment, an
entirely software embodiment (including firmware, resident
software, micro-code, etc.) or an embodiment combining software and
hardware aspects that may all generally be referred to herein as a
"circuit," "module" or "system." Furthermore, aspects of the
present invention may take the form of a computer program product
embodied in one or more computer readable medium(s) having computer
readable program code embodied thereon.
[0178] Any combination of one or more computer readable medium(s)
may be utilized. The computer readable medium may be a computer
readable storage medium. A computer readable storage medium may be,
for example, but not limited to, an electronic, magnetic, optical,
or semiconductor system, apparatus, or device, or any suitable
combination of the foregoing. More specific examples (a
non-exhaustive list) of the computer readable storage medium
include the following: an electrical connection having one or more
wires, a portable computer diskette, a hard disk, a random access
memory (RAM), a read-only memory (ROM), an erasable programmable
read-only memory (EPROM or Flash memory), an optical fiber, a
portable compact disc read-only memory (CD-ROM), an optical storage
device, a magnetic storage device, or any suitable combination of
the foregoing. In the context of this document, a computer readable
storage medium may be any tangible medium that can contain or store
a program for use by or in connection with an instruction execution
system, apparatus, or device.
[0179] In one example, a computer program product includes, for
instance, one or more computer readable media to store computer
readable program code means or logic thereon to provide and
facilitate one or more aspects of the present invention. The
computer program product can take many different physical forms,
for example, disks, platters, flash memory, etc.
[0180] Program code embodied on a computer readable medium may be
transmitted using an appropriate medium, including but not limited
to wireless, wireline, optical fiber cable, RF, etc., or any
suitable combination of the foregoing.
[0181] Computer program code for carrying out operations for
aspects of the present invention may be written in any combination
of one or more programming languages, including an object oriented
programming language, such as Java, Smalltalk, C++ or the like, and
conventional procedural programming languages, such as the "C"
programming language or similar programming languages. The program
code may execute entirely on the user's computer, partly on the
user's computer, as a stand-alone software package, partly on the
user's computer and partly on a remote computer or entirely on the
remote computer or server. In the latter scenario, the remote
computer may be connected to the user's computer through any type
of network, including a local area network (LAN) or a wide area
network (WAN), or the connection may be made to an external
computer (for example, through the Internet using an Internet
Service Provider).
[0182] Aspects of the present invention are described herein with
reference to flowchart illustrations and/or block diagrams of
methods, apparatus (systems) and computer program products
according to embodiments of the invention. It will be understood
that each block of the flowchart illustrations and/or block
diagrams, and combinations of blocks in the flowchart illustrations
and/or block diagrams, can be implemented by computer program
instructions. These computer program instructions may be provided
to a processor of a general purpose computer, special purpose
computer, or other programmable data processing apparatus to
produce a machine, such that the instructions, which execute via
the processor of the computer or other programmable data processing
apparatus, create means for implementing the functions/acts
specified in the flowchart and/or block diagram block or
blocks.
[0183] These computer program instructions may also be stored in a
computer readable medium that can direct a computer, other
programmable data processing apparatus, or other devices to
function in a particular manner, such that the instructions stored
in the computer readable medium produce an article of manufacture
including instructions which implement the function/act specified
in the flowchart and/or block diagram block or blocks.
[0184] The computer program instructions may also be loaded onto a
computer, other programmable data processing apparatus, or other
devices to cause a series of operational steps to be performed on
the computer, other programmable apparatus or other devices to
produce a computer implemented process such that the instructions
which execute on the computer or other programmable apparatus
provide processes for implementing the functions/acts specified in
the flowchart and/or block diagram block or blocks.
[0185] The flowchart and block diagrams in the figures illustrate
the architecture, functionality, and operation of possible
implementations of systems, methods and computer program products
according to various embodiments of the present invention. In this
regard, each block in the flowchart or block diagrams may represent
a module, segment, or portion of code, which comprises one or more
executable instructions for implementing the specified logical
function(s). It should also be noted that, in some alternative
implementations, the functions noted in the block may occur out of
the order noted in the figures. For example, two blocks shown in
succession may, in fact, be executed substantially concurrently, or
the blocks may sometimes be executed in the reverse order,
depending upon the functionality involved. It will also be noted
that each block of the block diagrams and/or flowchart
illustration, and combinations of blocks in the block diagrams
and/or flowchart illustration, can be implemented by special
purpose hardware-based systems that perform the specified functions
or acts, or combinations of special purpose hardware and computer
instructions.
[0186] A data processing system 900, as shown in FIG. 9, may be
provided suitable for storing and/or executing program code is
usable that includes at least one processor 902 coupled directly or
indirectly to memory elements 904 through a system bus 906. The
memory elements include, for instance, local memory employed during
actual execution of the program code, bulk storage, and cache
memory which provide temporary storage of at least some program
code in order to reduce the number of times code must be retrieved
from bulk storage during execution.
[0187] Input/Output or I/O devices 908 (including, but not limited
to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs,
thumb drives and other memory media, etc.) can be coupled to the
system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the
data processing system to become coupled to other data processing
systems or remote printers or storage devices through intervening
private or public networks. Modems, cable modems, and Ethernet
cards are just a few of the available types of network
adapters.
[0188] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of
the invention. As used herein, the singular forms "a", "an" and
"the" are intended to include the plural forms as well, unless the
context clearly indicates otherwise. It will be further understood
that the terms "comprises" and/or "comprising", when used in this
specification, specify the presence of stated features, integers,
steps, operations, elements, and/or components, but do not preclude
the presence or addition of one or more other features, integers,
steps, operations, elements, components and/or groups thereof.
[0189] The corresponding structures, materials, acts, and
equivalents of all means or step plus function elements in the
claims below, if any, are intended to include any structure,
material, or act for performing the function in combination with
other claimed elements as specifically claimed. The description of
the present invention has been presented for purposes of
illustration and description, but is not intended to be exhaustive
or limited to the invention in the form disclosed. Many
modifications and variations will be apparent to those of ordinary
skill in the art without departing from the scope and spirit of the
invention. The embodiment was chosen and described in order to best
explain the principles of the invention and the practical
application, and to enable others of ordinary skill in the art to
understand the invention for various embodiment with various
modifications as are suited to the particular use contemplated.
* * * * *