U.S. patent application number 12/854726 was filed with the patent office on 2011-02-17 for estimation of postings list length in a search system using an approximation table.
This patent application is currently assigned to GLOBALSPEC, INC.. Invention is credited to Jeff J. Dalton, Steinar Flatland.
Application Number | 20110040761 12/854726 |
Document ID | / |
Family ID | 43589199 |
Filed Date | 2011-02-17 |
United States Patent
Application |
20110040761 |
Kind Code |
A1 |
Flatland; Steinar ; et
al. |
February 17, 2011 |
ESTIMATION OF POSTINGS LIST LENGTH IN A SEARCH SYSTEM USING AN
APPROXIMATION TABLE
Abstract
The present invention provides a method of minimizing accesses
to secondary storage when searching an inverted index for a search
term. The method comprises automatically obtaining a predetermined
size of a posting list for the search term, the predetermined size
based on document frequency for the search term, wherein the
posting list is stored in secondary storage, and reading at least a
portion of the posting list into memory based on the predetermined
size. Corresponding computer system and program products are also
provided.
Inventors: |
Flatland; Steinar; (Clifton
Park, NY) ; Dalton; Jeff J.; (Northampton,
MA) |
Correspondence
Address: |
HESLIN ROTHENBERG FARLEY & MESITI PC
5 COLUMBIA CIRCLE
ALBANY
NY
12203
US
|
Assignee: |
GLOBALSPEC, INC.
East Greenbush
NY
|
Family ID: |
43589199 |
Appl. No.: |
12/854726 |
Filed: |
August 11, 2010 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61233411 |
Aug 12, 2009 |
|
|
|
61233420 |
Aug 12, 2009 |
|
|
|
61233427 |
Aug 12, 2009 |
|
|
|
Current U.S.
Class: |
707/737 ;
707/769; 707/802; 707/E17.014; 707/E17.089 |
Current CPC
Class: |
G06F 16/319
20190101 |
Class at
Publication: |
707/737 ;
707/769; 707/802; 707/E17.014; 707/E17.089 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method of minimizing accesses to secondary storage when
searching an inverted index for a search term, the method
comprising: obtaining by at least one computing unit a
predetermined size of a posting list for the search term, the
predetermined size based on document frequency for the search term,
wherein the posting list is stored in secondary storage; and
reading by the at least one computing unit at least a portion of
the posting list into memory based on the predetermined size.
2. The method of claim 1, wherein the size is a length in
bytes.
3. The method of claim 1, wherein if the size obtained is a
predetermined minimum size or less, then the reading comprises
reading all of the posting list at once.
4. The method of claim 3, wherein the reading comprises issuing a
single read system call.
5. The method of claim 3, wherein the predetermined minimum size
comprises a size of a main memory input buffer.
6. The method of claim 1, wherein if the predetermined size is
greater than a predetermined minimum size, then the reading
comprises performing a plurality of read operations.
7. The method of claim 6, wherein the predetermined minimum size
comprises a size of a main memory input buffer.
8. The method of claim 7, wherein the performing comprises filling
a largest available main memory input buffer a minimum number of
times.
9. The method of claim 1, wherein the obtaining comprises:
partitioning all posting lists in the inverted index into a
plurality of non-overlapping ranges, each range having a minimum
document frequency and a maximum document frequency; assigning a
range ID to each posting list; and using the range ID to look up
the predetermined size.
10. The method of claim 9, wherein each successive maximum document
frequency is twice that of an immediate prior one.
11. A computer system for minimizing accesses to secondary storage
when searching an inverted index for a search term, the computer
system comprising: a memory; and a processor in communication with
the memory to perform a method, the method comprising: obtaining a
predetermined size of a posting list for the search term based on
document frequency for the search term, wherein the posting list is
stored in secondary storage; and reading at least a portion of the
posting list into memory based on the predetermined size.
12. The system of claim 11, wherein the size is a length in
bytes.
13. The system of claim 11, wherein if the size obtained is a
predetermined minimum size or less, then the reading comprises
reading all of the posting list at once.
14. The system of claim 13, wherein the reading comprises issuing a
single read system call.
15. The system of claim 13, wherein the predetermined minimum size
comprises a size of a main memory input buffer.
16. The system of claim 11, wherein if the predetermined size is
greater than a predetermined minimum size, then the reading
comprises performing a plurality of read operations.
17. The system of claim 16, wherein the predetermined minimum size
comprises a size of a main memory input buffer.
18. The system of claim 17, wherein the performing comprises
filling a largest available main memory input buffer a minimum
number of times.
19. The system of claim 11, wherein the obtaining comprises:
partitioning all posting lists in the inverted index into a
plurality of non-overlapping ranges, each range having a minimum
document frequency and a maximum document frequency; assigning a
range ID to each posting list; and using the range ID to look up
the predetermined size.
20. The system of claim 19, wherein each successive maximum
document frequency is twice that of an immediate prior one.
21. A program product for minimizing accesses to secondary storage
when searching an inverted index for a search term, the program
product comprising: a storage medium readable by a processor and
storing instructions for execution by the processor for performing
a method, the method comprising: obtaining by at least one
computing unit a predetermined size of a posting list for the
search term, the predetermined size based on document frequency for
the search term, wherein the posting list is stored in secondary
storage; and reading by the at least one computing unit at least a
portion of the posting list into memory based on the predetermined
size.
22. The program product of claim 21, wherein the size is a length
in bytes.
23. The program product of claim 21, wherein if the size obtained
is a predetermined minimum size or less, then the reading comprises
reading all of the posting list at once.
24. The program product of claim 23, wherein the reading comprises
issuing a single read system call.
25. The program product of claim 23, wherein the predetermined
minimum size comprises a size of a main memory input buffer.
26. The program product of claim 21, wherein if the predetermined
size is greater than a predetermined minimum size, then the reading
comprises performing a plurality of read operations.
27. The program product of claim 26, wherein the predetermined
minimum size comprises a size of a main memory input buffer.
28. The program product of claim 27, wherein the performing
comprises filling a largest available main memory input buffer a
minimum number of times.
29. The program product of claim 21, wherein the obtaining
comprises: partitioning all posting lists in the inverted index
into a plurality of non-overlapping ranges, each range having a
minimum document frequency and a maximum document frequency;
assigning a range ID to each posting list; and using the range ID
to look up the predetermined size.
30. The program product of claim 29, wherein each successive
maximum document frequency is twice that of an immediate prior
one.
31. A data structure for use in minimizing accesses to data stored
in secondary storage when searching an inverted index for a search
term, the data structure comprising: a posting list length
approximation table, comprising a hash table, the hash table
comprising: a plurality of range IDs, each range ID corresponding
to a subset of posting lists of predetermined similar size and
representing a non-overlapping range of document frequencies; and a
posting list length approximation for each range ID.
32. The data structure of claim 31, wherein the posting list length
approximation is a length in bytes.
33. The data structure of claim 32, wherein the posting list length
approximation comprises a mean length and a standard deviation
length.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority under 35 U.S.C. .sctn.119
to the following U.S. Provisional Applications, which are herein
incorporated by reference in their entirety:
[0002] Provisional Patent Application Ser. No. 61/233,411, by
Flatland et al., entitled "ESTIMATION OF POSTINGS LIST LENGTH IN A
SEARCH SYSTEM USING AN APPROXIMATION TABLE," filed on Aug. 12,
2009; and
[0003] Provisional Patent Application No. 61/233,420, by Flatland
et al., entitled "EFFICIENT BUFFERED READING WITH A PLUG IN FOR
INPUT BUFFER SIZE DETERMINATION," filed on Aug. 12, 2009;
[0004] Provisional Patent Application Ser. No. 61/233,427, by
Flatland et al., entitled "SEGMENTING POSTINGS LIST READER," filed
on Aug. 12, 2009.
[0005] This application contains subject matter which is related to
the subject matter of the following applications, each of which is
assigned to the same assignee as this application and filed on the
same day as this application. Each of the below listed applications
is hereby incorporated herein by reference in its entirety:
[0006] U.S. Non-Provisional patent application Ser. No. ______, by
Flatland et al., entitled "EFFICIENT BUFFERED READING WITH A PLUG
IN FOR INPUT BUFFER SIZE DETERMINATION" (Attorney Docket No.
1634.069A); and
[0007] U.S. Non-Provisional patent application Ser. No. ______, by
Flatland et al., entitled "SEGMENTING POSTINGS LIST READER"
(Attorney Docket No. 1634.070A).
TECHNICAL FIELD
[0008] The present invention generally relates to searching an
inverted index. More particularly, the invention relates to
estimating a posting list size based on document frequency in order
to minimize accesses to the posting list stored in secondary
storage.
BACKGROUND
[0009] The following definition of Information Retrieval (IR) is
from the book Introduction to Information Retrieval by Manning,
Raghavan and Schutze, Cambridge University Press, 2008: [0010]
Information retrieval (IR) is finding material (usually documents)
of an unstructured nature (usually text) that satisfies an
information need from within large collections (usually stored on
computers).
[0011] An inverted index is a data structure central to the design
of numerous modern information retrieval systems. In chapter 5 of
Search Engines: Information Retrieval in Practice (Addison Wesley,
2010), Croft, Metzler and Strohman observe: [0012] An inverted
index is the computational equivalent of the index found in the
back of this textbook . . . . The book index is arranged in
alphabetical order by index term. Each index term is followed by a
list of pages about the word.
[0013] In a search system implemented using a computer, an inverted
index 100 often comprises two related data structures (see FIG. 1):
[0014] 1. A lexicon 101 contains the distinct set of terms 102
(i.e., with duplicates removed) that occur throughout all the
documents of the index. To facilitate rapid searching, terms in the
lexicon are usually stored in sorted order. Each term typically
includes a document frequency 104 and a pointer into the other
major data structure of the inverted index, the posting file 108.
The document frequency is a count of the number of documents in
which a term occurs. The document frequency is useful at search
time both for prioritizing term processing and as input to scoring
algorithms. [0015] 2. The posting file 108 consists of one posting
list per term in the lexicon, e.g., list 110 for term 112,
recording for each term the set of documents in which the term
occurs. Each entry in a posting list is called a posting. The
number of postings in a given posting list equals the document
frequency of the associated lexicon entry. A posting includes at
least a document identifier and may include additional information
such as: a count of the number of times the term occurs in the
document; a list of term positions within the document where the
term occurs; and more generally, scoring information that ascribes
some degree of importance (or lack thereof) to the fact that the
document contains the term.
[0016] When processing a user's query, a computerized search system
needs access to the postings of the terms that describe the user's
information need. As part of processing the query, the search
system aggregates information from these postings, by document, in
an accumulation process that leads to a ranked list of documents to
answer the user's query.
[0017] A large inverted index may not fit into a computer's main
memory, requiring secondary storage, typically disk storage, to
help store the posting file, lexicon, or both. Each separate access
to disk may incur seek time on the order of several milliseconds if
it is necessary to move the hard drive's read heads, which is very
expensive in terms of runtime performance compared to accessing
main memory.
[0018] Therefore, it would be helpful to minimize accesses to
secondary storage when searching an inverted list, in order to
improve runtime performance.
BRIEF SUMMARY OF INVENTION
[0019] The present invention provides, in a first aspect, a method
of minimizing accesses to secondary storage when searching an
inverted index for a search term. The method comprises
automatically obtaining a predetermined size of a posting list for
the search term, the predetermined size based on document frequency
for the search term, wherein the posting list is stored in
secondary storage, and reading at least a portion of the posting
list into memory based on the predetermined size.
[0020] The present invention provides, in a second aspect, a
computer system for minimizing accesses to secondary storage when
searching an inverted index for a search term. The computer system
comprises a memory, and a processor in communication with the
memory to perform a method. The method comprises automatically
obtaining a predetermined size of a posting list for the search
term based on document frequency for the search term, wherein the
posting list is stored in secondary storage, and reading at least a
portion of the posting list into memory based on the predetermined
size.
[0021] The present invention provides, in a third aspect, a program
product for minimizing accesses to secondary storage when searching
an inverted index for a search term. The program product comprises
a storage medium readable by a processor and storing instructions
for execution by the processor for performing a method. The method
comprises automatically obtaining a predetermined size of a posting
list for the search term based on document frequency for the search
term, the posting list being stored in secondary storage, and
reading at least a portion of the posting list into memory based on
the size approximated.
[0022] The present invention provides, in a fourth aspect, a data
structure for use in minimizing accesses to data stored in
secondary storage when searching an inverted index for a search
term. The data structure comprises a posting list length
approximation table, comprising a hash table, the hash table
comprising: a plurality of range IDs, each range ID corresponding
to a subset of posting lists of predetermined similar size and
representing a non-overlapping range of document frequencies, and a
posting list length approximation for each range ID.
[0023] These, and other objects, features and advantages of this
invention will become apparent from the following detailed
description of the various aspects of the invention taken in
conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0024] One or more aspects of the present invention are
particularly pointed out and distinctly claimed as examples in the
claims at the conclusion of the specification. The foregoing and
other objects, features, and advantages of the invention are
apparent from the following detailed description taken in
conjunction with the accompanying drawings in which:
[0025] FIG. 1 depicts one example of an inverted index consisting
of a lexicon and corresponding posting file.
[0026] FIG. 2 depicts one example of a posting list length
approximation table data structure, according to one aspect of the
present invention.
[0027] FIG. 3 is a flow diagram for one example of a method of
reading a posting list in accordance with one or more aspects of
the present invention.
[0028] FIG. 4 depicts one example of an inverted index with the
storage split between main memory and secondary storage.
[0029] FIG. 5 is an object oriented instance diagram showing one
example of a posting list reader and the main objects it uses, in
accordance with the present invention.
[0030] FIG. 6 is a block diagram of one example of a computing unit
incorporating one or more aspects of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0031] The present invention approximates posting list size,
preferably as a length in bytes, according to a term's document
frequency. The approximate posting list size is preferably
predetermined, and it covers, with high probability, the size of
the associated posting list in secondary storage. Knowing the
approximate size is useful for minimizing the number of accesses to
secondary storage when reading a posting list. For example, if the
approximate covering read size is several megabytes or less, a
highly efficient strategy is to scoop up the whole posting list in
a single access to secondary storage through a single read system
call. If the approximate covering read size is larger than the
largest available main memory input buffer, then the list can be
read, for example, by filling the largest available input buffer
several times using a single system call per buffer fill operation,
and then doing one more partial read to pick up the remainder of
the approximate covering read size. For the rare case where the
approximate covering read size does not cover the posting list
being read, additional supplemental reads can be issued as
necessary.
[0032] U.S. Non-Provisional Patent Application entitled "EFFICIENT
BUFFERED READING WITH A PLUG IN FOR INPUT BUFFER SIZE
DETERMINATION" (Attorney Docket No. 1634.069A) filed concurrently
herewith, describes an enhanced buffered reader that can be
configured with predetermined buffer fill size strategies. When the
posting file is in secondary storage, using an enhanced buffered
reader to read a posting list offers advantages over a conventional
buffered reader. An enhanced buffered reader can be configured with
a predetermined buffer fill size strategy that is based on both the
size of the posting list (in bytes, for example) and the size of
the available input buffer, ensuring that the fewest required
number of system calls to read from secondary storage are issued.
Another advantage of the enhanced buffered reader is that it neatly
encapsulates buffer management details inside the enhanced buffered
reader. The detailed description of present invention assumes a
working understanding of enhanced buffered readers.
[0033] Reading a posting list with a predetermined buffer fill size
strategy requires knowledge of the size of the posting list in
terms of buffer elements (typically bytes) before reading begins.
When an inverted index is small enough that the lexicon fits
entirely into memory, it is a simple matter to determine the size
of a posting list in bytes. For example, referring back to FIG. 1,
and assuming that the lexicon is entirely in main memory, simply
subtract adjacent posting list addresses to know the size in bytes
of a given posting list. As will become apparent in the detailed
description below, an advantage of the present invention is that
this sizing information preferably used for efficient reading from
secondary storage can be instantly available in main memory without
needing to store the full lexicon in main memory and without
needing to store posting list size in bytes as a separate field in
the lexicon.
[0034] The present invention uses a posting list range to
predetermine approximate posting list size. A posting list range is
a set of posting lists defined by an inclusive minimum and an
inclusive maximum document frequency. A posting list is a member of
a posting list range if the posting list's document frequency falls
within the inclusive minimum and maximum of the range. Posting
lists that are part of the same posting list range will have the
same approximate posting list size. The present invention builds on
the concept of a posting list range to predetermine approximate
posting list size.
[0035] As a prerequisite for populating the Posting List Length
Approximation Table data structure pictured in FIG. 2, and given an
inverted index, the posting lists in the inverted index are
partitioned into a collection of non-overlapping ranges whose union
is the complete set of posting lists in the index. Each of these
ranges is assigned a unique range identifier (rangeId).
[0036] One example of a way to accomplish this partitioning of
posting lists is through a function called
documentFrequencyToRangeIdTranslator, shown below and summarized
here. The function takes as input an integer that is the length of
a posting list in number of postings, also known as the document
frequency. The function returns the ID of the range that includes
the posting list whose document frequency was passed in. ln( ) is
the natural logarithm function, and ceil( ) is a function that
rounds a number with a fractional part to the next higher
integer.
TABLE-US-00001 documentFrequencyToRangeIdTranslator int
documentFrequencyToRangeIdTranslator(int documentFrequency) return
ceil(ln(documentFrequency) / ln(2.0)); }
[0037] Table I below shows how the implementation of
documentFrequencyToRangeIdTranslator above partitions posting lists
into posting list ranges. The implementation of
documentFrequencyToRangeIdTranslator has been found to work well in
practice with a natural language corpus in which the word
distribution adheres to Zipf's law. Each successive rangeId
includes twice as many document frequencies as the preceding
rangeId.
TABLE-US-00002 TABLE I Sample Range Definitions
minDocumentFrequency maxDocumentFrequency rangeId 1 1 0 2 2 1 3 4 2
5 8 3 9 16 4 17 32 5 etc.
[0038] Other implementations of the
documentFrequencyToRangeIdTranslator are possible. The above is
merely one example. This function could be implemented in any way
that defines a complete non-overlapping partitioning of the posting
lists into ranges.
[0039] Given an inverted index, the Posting List Length
Approximation table data structure pictured in FIG. 2 can be
created in one example as follows. For each posting list range
compute the mean and standard deviation of posting list size as
stored in secondary storage, preferably in bytes. Next, create a
Posting List Length Approximation object consisting of the rangeId
of the current range, the mean of posting list size, and the
standard deviation of posting list size. Finally, add a hash table
entry to the Posting List Length Approximation Table mapping the
rangeId to the Posting List Length Approximation object.
[0040] FIG. 2 depicts one example of a data structure 200 for a
posting list length approximation table, in accordance with one
aspect of the present invention. The data structure comprises a
hash table 210 with keys 220 and associated values 230. The keys
comprise a plurality of range ID's 240, as described above. The
associated values comprise the posting list length approximation
information 250. In the presently preferred embodiment, the length
approximation information is based on a predetermined length. The
information comprises, for example, the corresponding range ID, a
mean posting list length, and a standard deviation for the posting
list length. The mean length and standard deviation are preferably
expressed, for example, in bytes.
[0041] In one example, in addition to the structure shown in FIG.
2, the posting list length approximation table has an access method
getPostingListLengthApproximation(documentFrequency) which returns
a Posting List Length Approximation object based on a document
frequency passed in. In the present example, the implementation of
this method translates the document frequency to a rangeId using
the documentFrequencyToRangeIdTranslator function discussed
earlier. This rangeId is then used to do a hash table lookup to
find the proper Posting List Length Approximation object to return.
The resulting Posting List Length Approximation object can then be
turned into an approximate covering read size by, for example,
adding the mean posting list length in bytes to the desired number
of standard deviations.
[0042] One example of how to use a posting list length
approximation table to read a posting list efficiently will now be
provided with reference to the flow diagram 300 of FIG. 3. The
present example has a similar structure to the inverted index of
FIG. 1. In the scenario of this example, the inverted index is
large enough that the posting file is entirely in secondary storage
and only half of the lexicon fits into main memory.
[0043] FIG. 4 shows how the inverted index 400 is divided between
main memory 402, storing the lexicon index 404, and secondary
storage 406, storing the full lexicon 408 and the posting file 410.
Referring to FIG. 4, let N be the total number of terms in the full
lexicon in secondary storage. Only every second term, for a total
of N/2 terms, are kept in the lexicon index in main memory due to
memory constraints. The full lexicon in secondary storage is
preferably organized as a sequence of blocks, e.g., block 412, each
of a constant size k (e.g., in bytes) such that any block can
accommodate the largest pair of lexicon entries in the lexicon.
This causes some internal fragmentation within the full lexicon,
but the advantage is that the lexicon index does not need to store
explicit disk pointers into the full lexicon. Instead, to locate
the block in the full lexicon of the lexicon index record with
zero-based index i, simply seek to offset i*k in the full lexicon.
By design, the lexicon index includes document frequency but does
not include posting list sizing in bytes. The goal is to keep the
main memory lexicon data structure as compact as possible. The
Posting List Length Approximation Table will provide needed sizing
information for efficient reading of posting lists in secondary
storage.
[0044] In this example, it is assumed that the search engine
implementation uses an object called a Posting List Reader 500,
shown in FIG. 5, to read postings from secondary storage during
query processing. The Posting List Reader uses a Posting List
Length Approximation Table 502 to accurately estimate the sizes of
posting lists to be read. It uses an Enhanced Buffered Reader 504
with an internal buffer of size bufsize bytes to read postings 506
from secondary storage using efficient predetermined buffer fill
size strategies. Preferably, bufsize is relatively large (for
example several megabytes) to facilitate reading large posting
lists with relatively few read system calls. The Posting List
Reader provides the following access methods: [0045]
initialize(documentFrequency, postingListAddress)--Prepares the
Posting List
[0046] Reader for reading based on a document frequency and posting
list address of a term obtained from the lexicon. After
initialization, the readPosting( )method may be used. [0047]
readPosting( )--Reads the next posting from the posting list.
[0048] When a user runs a query, the search system first parses the
query, identifies the terms for which postings are needed to
process the query, and locates each of these terms in the lexicon
to obtain a document frequency and posting list address for each.
Assuming a lexicon structured similar to that shown in FIG. 4, a
term's document frequency and posting list address can be retrieved
without accessing secondary storage about half the time by doing a
binary search of the lexicon index in main memory, which is very
fast. If necessary, a disk seek can be used to find the term in the
full lexicon in secondary storage by seeking to offset i*k in the
full lexicon and reading the lexicon entries there, where i is the
zero-based record offset in the lexicon index of the lexically
greatest term that is lexically less than the sought term, and k is
the block size of the blocks in the full lexicon. Having obtained a
document frequency and posting list address for a term, the search
system initializes a Posting List Reader, preparing it to read
postings, as discussed below.
[0049] Returning to FIG. 3, the Posting List Reader receives an
initialize request (step 302) that includes a document frequency
and a posting list address. The document frequency is the length of
the posting list to read in number of postings, and the posting
list address is the byte offset in the posting file where the
posting list to read starts.
[0050] The Posting List Reader obtains a Posting List Length
Approximation object (step 304) by calling the
getPostingListLengthApproximation method on the Posting List Length
Approximation Table pictured in FIG. 2, passing the document
frequency to this getter. (The implementation of
getPostingListLengthApproximation in turn translates the document
frequency passed in to a rangeId using the
documentFrequencyToRangeIdTranslator function described earlier and
does a hash table lookup in the Posting List Length Approximation
table based on the rangeId to obtain the Posting List Length
Approximation object.)
[0051] The Posting List Reader next obtains the approximate size of
the posting list to read (step 306) by getting the mean and
standard deviation of posting list length from the Posting List
Length Approximation object and adding the desired number of
standard deviations to the mean. Let approximateReadSize be the
approximate read size calculated in this step.
[0052] The next step in initializing the Posting List Reader is to
build a predetermined buffer fill size strategy (step 308) for use
with the Enhanced Buffered Reader. A predetermined buffer fill size
strategy is an ordered sequence of (fillSize, numTimesToUse) pairs,
where fillSize indicates how much of the Enhanced Buffered Reader's
internal input buffer to fill when a buffer fill is needed, and
numTimesToUse indicates how many times to use the associated
fillSize. There are two cases to consider, based on the relative
sizes of the bufsize (the Enhanced Buffered Reader's internal
buffer size) and approximateReadSize.
[0053] Case 1: approximateReadSize<=bufsize; and
[0054] Case 2: approximateReadSize>bufsize.
[0055] A discussion of these cases follows.
Case 1: approximateReadSize<=bufsize
[0056] Build a two-stage predetermined buffer fill size strategy as
indicated below in Table II.
TABLE-US-00003 TABLE II Stage Fill Size Number of Times to Use 1
approximateReadSize 1 2 8 kilobytes Repeat as necessary
[0057] The above two-stage strategy, when installed in an Enhanced
Buffered Reader and used to read the posting list, will with high
probability result in a single disk seek and read of exactly
approximateReadSize bytes. As many supplemental 8 kilobyte reads as
necessary may then be issued to handle the relatively rare case
when the approximateReadSize is insufficient.
Case 2: approximateReadSize>bufsize
[0058] For this discussion, let "/" represent the operation of
integer division, and "%" represent the operation of integer
modulo.
[0059] In this case, we build a predetermined buffer fill size
strategy that generally has three stages, as indicated in the
following table. However, the second stage is not necessary when
the bufsize divides the approximateReadSize evenly).
TABLE-US-00004 TABLE III Stage Fill Size Number of Times to Use 1
bufsize approximateReadSize/bufsize 2 approximateReadSize % bufsize
1 3 8 kilobytes Repeat as necessary
[0060] The above strategy, when installed in an Enhanced Buffered
Reader and used to read the posting list, will utilize the
available input buffer of size bufsize bytes to read
approximateReadSize bytes of data using a minimal number of disk
seeks and minimal data transfer. The approximateReadSize is
sufficient to read the entire posting list with high probability;
however, as many supplemental 8 kilobyte reads as necessary will be
issued to handle the relatively rare case when the
approximateReadSize is insufficient.
[0061] Referring once again to FIG. 3, the next step in
initializing the Posting List Reader is to seek the Enhanced
Buffered Reader to start of posting list (step 310). The posting
list address that was passed to the initialize request (step 302)
is forwarded to the Enhanced Buffered Reader's seek method.
[0062] Finally, the predetermined buffer fill size strategy of step
308 is installed in the Enhanced Buffered Reader (step 312), by
calling the appropriate setter. The posting list reader is now
ready to start processing read requests for postings (step 314). As
the search system's search logic issues read requests as desired,
the Enhanced Buffered Reader automatically initiates buffer
refilling as needed using read sizes consistent with good runtime
performance when accessing secondary storage.
[0063] As shown in FIG. 6, one example of a data processing system
600 may be provided suitable for storing and/or executing program
code is usable that includes at least one processor 610 coupled
directly or indirectly to memory elements through a system bus 620.
As known in the art, the memory elements include, for instance,
data buffers 630 and 640, local memory employed during actual
execution of the program code, bulk storage 650, and cache memory
which provide temporary storage of at least some program code in
order to reduce the number of times code must be retrieved from
bulk storage during execution.
[0064] Input/Output or I/O devices 660 (including, but not limited
to, keyboards, displays, pointing devices, DASD, tape, CDs, DVDs,
thumb drives and other memory media, etc.) can be coupled to the
system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the
data processing system to become coupled to other data processing
systems or remote printers or storage devices through intervening
private or public networks. Modems, cable modems, and Ethernet
cards are just a few of the available types of network
adapters.
[0065] The shortcomings of the prior art are overcome and
additional advantages are provided through the provision of a
computer program product for minimizing accesses to secondary
storage for a posting list when searching an inverted index for a
search term. The computer program product comprises a storage
medium readable by a processing circuit and storing instructions
for execution by a computer for performing a method. The method
includes, for instance, automatically obtaining a predetermined
size of a posting list for the search term, the predetermined size
based on document frequency for the search term, wherein the
posting list is stored in secondary storage, and reading at least a
portion of the posting list into memory based on the predetermined
size.
[0066] Methods and systems relating to one or more aspects of the
present invention are also described and claimed herein. Further,
services relating to one or more aspects of the present invention
are also described and may be claimed herein.
[0067] Additional features and advantages are realized through the
techniques of the present invention. Other embodiments and aspects
of the invention are described in detail herein and are considered
a part of the claimed invention.
[0068] In one aspect of the present invention, an application can
be deployed for performing one or more aspects of the present
invention. As one example, the deploying of an application
comprises providing computer infrastructure operable to perform one
or more aspects of the present invention.
[0069] As a further aspect of the present invention, a computing
infrastructure can be deployed comprising integrating computer
readable code into a computing system, in which the code in
combination with the computing system is capable of performing one
or more aspects of the present invention.
[0070] As yet a further aspect of the present invention, a process
for integrating computing infrastructure comprising integrating
computer readable code into a computer system may be provided. The
computer system comprises a computer readable medium, in which the
computer medium comprises one or more aspects of the present
invention. The code in combination with the computer system is
capable of performing one or more aspects of the present
invention.
[0071] As will be appreciated by one skilled in the art, aspects of
the present invention may be embodied as a system, method or
computer program product. Accordingly, aspects of the present
invention may take the form of an entirely hardware embodiment, an
entirely software embodiment (including firmware, resident
software, micro-code, etc.) or an embodiment combining software and
hardware aspects that may all generally be referred to herein as a
"circuit," "module" or "system". Furthermore, aspects of the
present invention may take the form of a computer program product
embodied in one or more computer readable medium(s) having computer
readable program code embodied thereon.
[0072] Any combination of one or more computer readable medium(s)
may be utilized. The computer readable medium may be a computer
readable storage medium. A computer readable storage medium may be,
for example, but not limited to, an electronic, magnetic, optical,
or semiconductor system, apparatus, or device, or any suitable
combination of the foregoing. More specific examples (a
non-exhaustive list) of the computer readable storage medium
include the following: an electrical connection having one or more
wires, a portable computer diskette, a hard disk, a random access
memory (RAM), a read-only memory (ROM), an erasable programmable
read-only memory (EPROM or Flash memory), an optical fiber, a
portable compact disc read-only memory (CD-ROM), an optical storage
device, a magnetic storage device, or any suitable combination of
the foregoing. In the context of this document, a computer readable
storage medium may be any tangible medium that can contain or store
a program for use by or in connection with an instruction execution
system, apparatus, or device.
[0073] In one example, a computer program product includes, for
instance, one or more computer readable media to store computer
readable program code means or logic thereon to provide and
facilitate one or more aspects of the present invention. The
computer program product can take many different physical forms,
for example, disks, platters, flash memory, etc., including those
above.
[0074] Program code embodied on a computer readable medium may be
transmitted using an appropriate medium, including but not limited
to wireless, wireline, optical fiber cable, RF, etc., or any
suitable combination of the foregoing.
[0075] Computer program code for carrying out operations for
aspects of the present invention may be written in any combination
of one or more programming languages, including an object oriented
programming language, such as Java, Smalltalk, C++ or the like, and
conventional procedural programming languages, such as the "C"
programming language or similar programming languages. The program
code may execute entirely on the user's computer, partly on the
user's computer, as a stand-alone software package, partly on the
user's computer and partly on a remote computer or entirely on the
remote computer or server. In the latter scenario, the remote
computer may be connected to the user's computer through any type
of network, including a local area network (LAN) or a wide area
network (WAN), or the connection may be made to an external
computer (for example, through the Internet using an Internet
Service Provider).
[0076] Aspects of the present invention are described herein with
reference to flowchart illustrations and/or block diagrams of
methods, apparatus (systems) and computer program products
according to embodiments of the invention. It will be understood
that each block of the flowchart illustrations and/or block
diagrams, and combinations of blocks in the flowchart illustrations
and/or block diagrams, can be implemented by computer program
instructions. These computer program instructions may be provided
to a processor of a general purpose computer, special purpose
computer, or other programmable data processing apparatus to
produce a machine, such that the instructions, which execute via
the processor of the computer or other programmable data processing
apparatus, create means for implementing the functions/acts
specified in the flowchart and/or block diagram block or
blocks.
[0077] These computer program instructions may also be stored in a
computer readable medium that can direct a computer, other
programmable data processing apparatus, or other devices to
function in a particular manner, such that the instructions stored
in the computer readable medium produce an article of manufacture
including instructions which implement the function/act specified
in the flowchart and/or block diagram block or blocks.
[0078] The computer program instructions may also be loaded onto a
computer, other programmable data processing apparatus, or other
devices to cause a series of operational steps to be performed on
the computer, other programmable apparatus or other devices to
produce a computer implemented process such that the instructions
which execute on the computer or other programmable apparatus
provide processes for implementing the functions/acts specified in
the flowchart and/or block diagram block or blocks.
[0079] The flowchart and block diagrams in the figures illustrate
the architecture, functionality, and operation of possible
implementations of systems, methods and computer program products
according to various embodiments of the present invention. In this
regard, each block in the flowchart or block diagrams may represent
a module, segment, or portion of code, which comprises one or more
executable instructions for implementing the specified logical
function(s). It should also be noted that, in some alternative
implementations, the functions noted in the block may occur out of
the order noted in the figures. For example, two blocks shown in
succession may, in fact, be executed substantially concurrently, or
the blocks may sometimes be executed in the reverse order,
depending upon the functionality involved. It will also be noted
that each block of the block diagrams and/or flowchart
illustration, and combinations of blocks in the block diagrams
and/or flowchart illustration, can be implemented by special
purpose hardware-based systems that perform the specified functions
or acts, or combinations of special purpose hardware and computer
instructions.
[0080] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of
the invention. As used herein, the singular forms "a", "an" and
"the" are intended to include the plural forms as well, unless the
context clearly indicates otherwise. It will be further understood
that the terms "comprises" and/or "comprising", when used in this
specification, specify the presence of stated features, integers,
steps, operations, elements, and/or components, but do not preclude
the presence or addition of one or more other features, integers,
steps, operations, elements, components and/or groups thereof.
[0081] The corresponding structures, materials, acts, and
equivalents of all means or step plus function elements in the
claims below, if any, are intended to include any structure,
material, or act for performing the function in combination with
other claimed elements as specifically claimed. The description of
the present invention has been presented for purposes of
illustration and description, but is not intended to be exhaustive
or limited to the invention in the form disclosed. Many
modifications and variations will be apparent to those of ordinary
skill in the art without departing from the scope and spirit of the
invention. The embodiment was chosen and described in order to best
explain the principles of the invention and the practical
application, and to enable others of ordinary skill in the art to
understand the invention for various embodiment with various
modifications as are suited to the particular use contemplated.
* * * * *