Join algorithms over full text indexes Patent Grant Colby , et al. February 25, 2 [Colby; Latha Sankar]

Join algorithms over full text indexes

Colby , et al. February 25, 2

Patent Grant 8661019

U.S. patent number 8,661,019 [Application Number 12/696,013] was granted by the patent office on 2014-02-25 for join algorithms over full text indexes. This patent grant is currently assigned to International Business Machines Corporation. The grantee listed for this patent is Latha Sankar Colby, Quanzhong Li, Fatma Ozcan, Mir Hamid Pirahesh, Eugene J. Shekita, Zografoula Vagena. Invention is credited to Latha Sankar Colby, Quanzhong Li, Fatma Ozcan, Mir Hamid Pirahesh, Eugene J. Shekita, Zografoula Vagena.

United States Patent	8,661,019
Colby , et al.	February 25, 2014

Join algorithms over full text indexes

Abstract

According to one embodiment of the present invention, a method for processing join predicates in full-text indexes is provided. The method includes evaluating local predicates of an outer full text index to generate a first posting list of documents. For each document in the first posting list, the value of a join attribute is determined and an inner full text index is probed to obtain a second posting list of documents containing one of the join attributes determined for each document. Local predicates of an inner full text index are evaluated to generate a third posting list of documents, and the second posting list is merged with the third posting list to generate a merge list of documents. Documents in the first posting list may be paired up with documents in the merge list.

Inventors:

Colby; Latha Sankar (Sunnyvale, CA), Li; Quanzhong (San Jose, CA), Ozcan; Fatma (San Jose, CA), Pirahesh; Mir Hamid (San Jose, CA), Shekita; Eugene J. (San Jose, CA), Vagena; Zografoula (London, GB)

Applicant:

Name	City	State	Country	Type
Colby; Latha Sankar Li; Quanzhong Ozcan; Fatma Pirahesh; Mir Hamid Shekita; Eugene J. Vagena; Zografoula	Sunnyvale San Jose San Jose San Jose San Jose London	CA CA CA CA CA N/A	US US US US US GB

Assignee:

International Business Machines Corporation (Armonk, NY)

Family ID:

44309746

Appl. No.:

12/696,013

Filed:

January 28, 2010

Prior Publication Data


	Document Identifier	Publication Date
	US 20110184933 A1	Jul 28, 2011

Current U.S. Class:	707/714; 707/715
Current CPC Class:	G06F 16/2456 (20190101)
Current International Class:	G06F 7/00 (20060101); G06F 17/30 (20060101)

References Cited [Referenced By]

U.S. Patent Documents


5809502	September 1998	Burrows
6067543	May 2000	Burrows
7685138	March 2010	Beyer et al.
7945578	May 2011	Leung et al.
2010/0299367	November 2010	Chakrabarti et al.

Other References

Whang et al., "Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features", Proceedings of the 21st International Conference on Data Engineering, pp. 1104-1105, 2005, IEEE. cited by examiner .
Guo et al., "XRANK: Ranked Keyword Search over XML Documents," Proceedings of the Int'l Conf. on Management of Data, ACM SIGMOD, pp. 16-27, 2003. cited by examiner .
Halverson et al., "Mixed Mode XML Query Processing," Proceedings of the 29th Int'l Conf. on Very Large Data Bases, pp. 225-236, ACM, 2003. cited by examiner .
Carmel et al., "Searching XML documents via XML fragments", Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 151-158, 2003, ACM. cited by examiner .
Chakrabarti et al., "Compressed data structures for annotated web search", Proceedings of the 21st international conference on World Wide Web , pp. 121-130, 2012, ACM. cited by examiner .
Szydelski, Jakub "Combining Relational and Full-Text Search on Various Data Sources Under Retention of Their Original Access Rights," Aug. 16, 2002. http://wwwbruegge.in.tum.de/publications/da/szydelski2002.pdf. cited by applicant .
R. Baeza-Yates and G. Navaro. Integrating contents and structure in text retrieval. ACM SIGMOD Record, 25(1):67-79, 1996. cited by applicant .
H. Meuss and C. Stronhmaier. Improving Index Structures for Structured Document Retrieval. In Proc. of Annual Colloquium on IR Research, Feb. 1999. cited by applicant .
"Indexing and Querying XML Data for Regular Path Expressions", Q. Li et al, VLDB 2001. cited by applicant .
"Holistic Twig Join: Optimal XML Pattern Matching", N. Bruno et al, SIGMOD 2002. cited by applicant .
"Holistic Twig Joins on Indexed XML Documents", H. Jiang et al, VLDB 2003. pp. 273-284. cited by applicant .
"Efficient Processing of XML Twig Queries with OR-Predicates", H. Jiang et al, SIGMOD 2004. cited by applicant .
"Virtual Cursors for XML Joins", B. Yang et al, CIKM 2004. cited by applicant .
"Efficient Object Oriented Twig Query Evaluation over XML and Semantically Annotated Documents", S. Grennan et al, IBM paper, submitted (copy available). cited by applicant .
Inverted Index Support for Parametric Search, M. Fontoura et al, Internet Mathematics, 3(2), 153-185, 2006, also IBM RJ10329 2004. cited by applicant .
K. Beyer et al. System RX: One Part Relational, One Part XML. In Proc. of SIGMOD, Baltimore, Maryland, Jun. 2005. cited by applicant .
J.-M. Bremer and M. Gertz. Integrating Document and Data Retrieval Based on XML. VLDB Journal, 15(1):53-83, 2006. cited by applicant .
S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. XSEarch: A Semantic Search Engine for XML. In Proc. of VLDB, pp. 45-56, Berlin, Germany, 2003. cited by applicant .
E. Curtmola, S. Amer-Yahia, P. Brown, and M. Fernandez. GalaTex: A Conformant Implementation of the XQuery FullText Language. In Proc. of XIME-P, 2005. cited by applicant .
D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search into XML query processing. Computer Networks, 33(1-6):119-135, 2000. cited by applicant .
M. Fontoura, V. Josifovski, E. Shekita, and B. Yang. Optimizing Cursor Movement in Holistic Twig Joins. In Proc. of CIKM, pp. 784-791, 2005. cited by applicant .
L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked Keyword Search over XML Documents. In Proc. of SIGMOD, pp. 16-27, San Diego, California, 2003. cited by applicant .
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F. Korth. Covering indexes for branching path queries. In SIGMOD Conference, pp. 133-144, 2002. cited by applicant .
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In SIGMOD '04, pp. 779-790, New York, NY, USA, 2004. cited by applicant .
Y. Mass et al. JuruXML--an XML Retrieval System. In Proc. of INEX, pp. 73-80, 2002. cited by applicant .
T. Milo and D. Suciu. Index Structures for Path Expressions. In ICDT, pp. 277-295, 1999. cited by applicant .
S. Pal et al. XQuery Implementation in a Relational Database System. In Proc. of VLDB, Aug. 2005. cited by applicant .
A. Theobald and G. Weikum. The Index-based XXL Search Engine for Querying XML data with Relevance Ranking. In Proc. of EDBT, pp. 477-495, Prague, Czech Republic, 2002. cited by applicant .
M. Theobald, R. Schenkel, and G. Weikum. An e.+-.cient and versatile query engine for TopX search. In VLDB '05, pp. 625-636, 2005. cited by applicant .
F. Weigel, H. Meuss, F. Bry, and K. U. Schulz. Content-Aware Dataguides: Interleaving IR and DB Indexing Techniques for EfficientRetrieval of Textual XML Data. In Proc. of European Conference of Information Retrieval, Apr. 2004. cited by applicant .
World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Full-Text, Nov. 2005. W3C Working Draft, see http://www.w3.org/TR/xquery-full-text/. cited by applicant .
Lucene Search Engine. http://lucene.apache.org/. cited by applicant .
P. Mishra et al., "Join Processing in Relational Databases", ACM Comput. Surv., 24(1):63-113, 1992. cited by applicant .
J. Zobel et al., "Inverted files for Text Search Engines", ACM Comput. Surv., 38(2):6, 2006. cited by applicant.

Primary Examiner: Hicks; Michael
Attorney, Agent or Firm: Cantor Colburn LLP Kanehira; Yusuke

Claims

What is claimed is:

1. A computer hardware implemented method comprising: receiving a query by a processor, the query comprising: an inner full text index which comprises a first set of document ids, each of the documents ids having a first attribute and a first value; an outer full text index which comprises a second set of document ids, each of the documents ids having a second attribute and a second value; one or more inner local predicates comprising a first search term for the inner full text index; one or more outer local predicates comprising a second search term for the outer full text index, an inner join attribute comprising a third search term corresponding to the first value; and an outer join attribute comprising a fourth search term corresponding to the second value; and using said processor to: evaluate the one or more outer local predicates from said query of the outer full text index to generate a first posting list of documents; determine the second value of the outer join attribute from said query for each document in said first posting list; probe the inner full text index to obtain a second posting list of documents containing one of said inner join attributes determined for each document in said first posting list; evaluate the one or more inner local predicates of the inner full text index to generate a third posting list of documents; merge said second posting list with said third posting list to generate a merge list of documents; and pair up each document in said first posting list with documents in said merge list.

2. The method according to claim 1 wherein said first, second and third posting lists contain lists of document identifiers.

3. The method according to claim 1 wherein said inner and outer full text indexes comprise lists of documents that result from different full-text searches.

4. The method according to claim 1 wherein at least one of said local predicates is a full-text query including single keywords.

5. The method according to claim 4 wherein at least one of said full text queries is a complex query and wherein said method further comprises caching said third posting list.

6. The method according to claim 1 wherein said join attribute is a join condition on meta-data.

Description

BACKGROUND

The present invention relates to information retrieval, and more specifically, to the efficient access of full text indexes.

There has been a rapid increase in the volume of information available on the Internet and other sources. One widely used method for users to search and access this information is known as full text search, in which a search engine examines all of the words in every stored document as it tries to match search words supplied by the user. Full text search is usually divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, called a full text index. In the search stage, only the full text index is referenced rather than the text of the original documents.

Traditional structured databases store more and more semi-structured and unstructured textual information, which requires the full-text search to be integrated. Consequently, full-text indexes and their efficient access methods are critical in modern information retrieval. Full-text indexes have been augmented to support requirements beyond simple keyword search. More and more querying features on structured data are supported by full-text indexes directly. For example, advanced features like fielded search, numeric search, and XML support have been proposed and implemented inside full-text search.

A join is an operation that combines records from two tables in a relational database. A join can be used to combine fields from tables using values common to each. With the support of searching structured data in full-text indexes, similar join operations are also useful in full-text searches.

SUMMARY

According to one embodiment of the present invention, a method comprises: evaluating local predicates of an outer full text index to generate a first posting list of documents; determining the value of a join attribute for each document in the first posting list; probing an inner full text index to obtain a second posting list of documents containing one of the join attributes determined for each document; evaluating local predicates of an inner full text index to generate a third posting list of documents; merging the second posting list with the third posting list to generate a merge list of documents; and pairing up each document in the first posting list with documents in the merge list.

According to another embodiment of the present invention, a method comprises: performing a merge join of terms from dictionaries of inner and outer full text indexes to generate a list of matching term pairs; evaluating an outer local predicate of a query in the outer full text index to generate a first posting list; evaluating an inner local predicate of a query in the inner full-text index to generate a second posting list; for each matching term pair, probing to obtain a third posting list from the outer full-text index, and to obtain a fourth posting list from the inner full-text index; merging the first and third posting lists to generate a fifth posting list; merging the second and fourth posting lists to generate a sixth posting list; and pairing documents in the resulting fifth and sixth posting lists.

According to another embodiment of the present invention, a computer program product for processing join predicates in full-text indexes comprises: a computer usable medium having computer usable program code embodied therewith, the computer usable program code comprising: computer usable program code configured to: perform a merge join of terms from dictionaries of inner and outer full text indexes to generate a list of matching term pairs; evaluate an outer local predicate of a query in the outer full-text index to generate a first posting list with join values; evaluate an inner local predicate of the query in the inner full-text index to generate a second posting list with join values; and evaluate a join predicate on the first and second posting list.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 shows pseudo code for a full text nested loop join process in accordance with an embodiment of the invention;

FIG. 2 shows a diagram of a full-text nested loop join processor in accordance with an embodiment of the invention;

FIG. 3 shows pseudo code for a full text dictionary join process in accordance with an embodiment of the invention;

FIG. 4 shows the pseudo code for a full-text dictionary intersection step used with the full text dictionary join process shown in FIG. 3 in accordance with an embodiment of the invention;

FIG. 5 shows a full-text dictionary intersection join processor in accordance with an embodiment of the invention;

FIG. 6 shows pseudo code for an alternative process for use with the full-text dictionary join processor shown in FIG. 5 in accordance with an embodiment of the invention;

FIG. 7 shows pseudo code for a full-text merge join process in accordance with an embodiment of the invention; and

FIG. 8 shows a high level block diagram of an information processing system useful for implementing one embodiment of the present invention.

DETAILED DESCRIPTION

Embodiments of the invention provide techniques for performing join operations over full-text indexes. The embodiments of the invention will be explained with reference to the following example search problem. Suppose that we have full-text indexes over a data set about books and authors, and we would like to issue the query, "find the books about DB2 and whose author is affiliated with IBM".

A full-text index is good at finding the documents (books) having keyword "DB2" or the documents (authors) having keyword "IBM". However, in order to evaluate the whole query, we need to join the books and authors together through author and name attributes of books and authors respectively. For traditional databases, normally, full-text indexes are integrated and used for full-text searches only. If the above join processing is supported by full-text indexes, we can exploit this capability by pushing down join predicates in a database to full-text indexes, which is able to combine the join processing with full-text searches. In this way, we can reduce the processing cost inside a database and reduce the communication cost between the database and the full-text index search, which is benefited from the direct accesses of full-text indexing structures and from the utilization of specialized full-text search processes.

Embodiments of the invention address the problem of how to efficiently support join operations within full-text index engines. These embodiments include three join processes over full-text indexes, namely full-text nested-loop join, full-text dictionary join, and full-text merge join.

Support of join processing on document attributes, such as author attribute of book documents and name attribute of author documents, is facilitated by an understanding of how attributes and their values are stored and indexed within a full-text index. The values of an attribute can be stored inside a full-text index without having to be indexed, if the attribute is only used for retrieval purposes. For example, the URLs of web pages are returned along with the search results for users to retrieve the original page, but they may not be indexed for keyword search. On the other hand, some attributes can be stored and indexed at the same time. As described below, there are a number of typical methods to store attribute values in a full-text index.

Some full-text indexes (e.g., Apache Lucene) provide a document/attribute/field store (or a similar mechanism), which can be used to store attribute-value pairs that can be retrieved using document ids efficiently. Each entry in the document store may contain several attribute-value pairs. After the document id list is generated from a search, a user can retrieve the stored attribute values of those identified documents using their ids. To locate a document and its attributes in a document store, it is common that the document id is used as the index of an array of in-memory document pointers. In this case, this lookup process is just one indirect pointer access and is hence an inexpensive operation.

A posting list of a full-text index contains document ids and positional information about a term/keyword. Some full-text indexes also support payloads which allow meta-data or user defined information to be stored inside posting lists. Payloads can be used to store additional information, such as formatting, grammar and linkage information for web pages, or XPath data. Payloads can also be used to store extra positional information for XML data. This payload information can be used to do additional filtering or to adjust term ranking. Document attributes can be stored in posting lists, where the attribute names are treated as terms in the term dictionary. For example, to store author attribute for the book collection, we index author as a term in the dictionary, and the author values are stored inside the author posting list. To fetch author values, we add author as an additional term to a query, and retrieve the author payload from posting list directly for qualifying document ids during query processing.

If multiple attributes are needed for a smaller number of documents, the document store would be preferred. Otherwise, payload would be a better choice, since it clusters attribute values together, and provides sequential scanning and possible skipping, in a similar way as other term posting lists.

The values of an attribute can also be indexed regardless of whether it is stored or not. Numeric and short string types of attributes (such as date, price, country and name) are typically indexed. To index attribute values, the attribute name and value are concatenated together to form an index term. For example, the composite term author:Joe represents author and its value Joe. In this way, search on attribute and value can be supported using the term dictionary directly. When the cardinality of an attribute value domain is large, e.g., numeric attributes like price, each such term may have only one document, which is inefficient for scanning large numbers of values. In this case, the posting lists of several values in a certain range may be grouped together into one posting list.

A first embodiment of the invention includes a process for performing a full-text nested-loop join (FTNJ). In a relational database, a nested-loop join reads the rows in the outer table, possibly with index access and the filtering of local predicates. Then each qualified row is joined with the rows from the inner table, preferably with an index access. For full-text indexes, there are no tables. The two join parties are two document sets from different keyword searches, and the join attributes are the document meta-data/attributes, which are stored and indexed within a full-text index, using methods described above.

For sake of explaining a first embodiment of the invention, it is assumed that there are two full-text indexes (outer and inner indexes) to be joined through equality condition on two attributes. For the above-described example, we have book full-text index for books collection (outer) and author index for authors collection (inner). The join condition is book:author=author:name, where author and name are two attributes of two indexes respectively. The whole query can be expressed as: book:contains(DB2)^book:author=author:name^author:contains(IBM). A join condition is also referred to herein as a join predicate, and the rest conditions of the full-text query are referred to herein as local predicates. The full-text nested-loop join process (FTNJ) evaluates both local and join predicates at the same time.

FIG. 1 shows pseudo code for the FTNJ process in accordance with an embodiment of the invention. The inputs of the process specify the outer and inner indexes, outer and inner predicates if any, and the two join attributes. In FIG. 1, line 10, the function FT_Search( ) evaluates a full-text query and returns a list of document ids.

The join condition is on attribute values. For example, the join in the above-described example is on attributes book:author and author:name. Note that to probe the full-text index on author collection, we need to have a value. So, we need to find the values of the join attribute, book:author in this case, so that we can use these values to probe the other index. In other words, we need a mechanism to obtain the values of join attributes using document ids from a full-text index directly. In this example, when the index is being built on the book collection, we also store the attribute author in the full-text index (using techniques described above). In this process, the function FT_GetAttribute( )(Line 12) returns the corresponding attribute value. Note that in order to evaluate the equality condition innerja=.nu..sub.o, we should index each attribute value as a single value without being tokenized. Also, the full-text search should support searching the content under specified fields/attributes, as described above.

FIG. 2 shows a diagram of the components of a full-text nested loop join processor 10, also referred to herein as a join processor 10, in accordance with an embodiment of the invention. The join processor 10 includes a book index 11 and an author index 12. A field attribute store unit 14, stores the field attributes, such as author. Posting lists 16, 18 and 20, are also shown in FIG. 1. The book index 11 and the author index 12 also each include a term dictionary 22, 24.

FIG. 2 shows four main steps in the FTNL process, labeled 1 through 4. In step 1, at the beginning of the process, the local predicates in the outer index (book index 11) are evaluated by probing the outer collection. In the present example, this step would identify books in the book index 11 that are about DB2 and place these document ids in the posting list 16. Although, we used a single keyword in the example, in practice the local predicate could be an arbitrary full-text query. In step two, for each document id (e.g. d.sub.o) from the result set of the outer local predicate, the process will now fetch the value of the author attribute as the join value. This join value for one of the documents in the example shown in FIG. 2 is the author name "Joe", which may be stored in the field/attribute store unit 14.

In step three, with the join value in hand, the process can now probe the inner full-text author index 12 and obtain the posting list 18 of the join value term. The process will then merge the posting list 18 with the resulting posting list 20 from the inner local query result. Posting list 20 represents documents related to the word "IBM". Note that if the inner local full-text query is a complex query (with conjunctions and disjunctions of many terms), the resulting posting list can be cached to avoid repeated evaluation of this local query. This merge will then find the authors, such as "Joe", who are associated with IBM.

In step four, d.sub.o is paired with the document ids generated from the merge of the posting lists from the join value and the inner local query. The result set is {(d.sub.0, d.sub.i)}, which, in this example, represents the set of books about DB2 whose authors are affiliated with IBM.

One assumption of this FTNL process is that the join values of an outer index are stored inside a full-text index, as described above, and the attribute values of an inner index are indexed, also as described above. Full-text join conditions are typically on meta-data such as attributes or fields. Since most search engines already have efficient support for indexing and retrieving document fields/attributes (like titles, dates, URLs, etc.), the join value lookups can be supported efficiently in a similar way.

A second embodiment of the invention includes a process for performing a full-text dictionary join (FTDJ). This FTDJ relies on the observation that the terms in a full-text dictionary are already in sorted order. This embodiment simultaneously scans (merge-join) the terms in the dictionaries of two full-text indexes, and efficiently identifies the document pairs that satisfy a join condition. It is assumes that values of both attributes are indexed in full-text indexes, as described above.

FIG. 3 shows the pseudo code for the FTDJ process in accordance with an embodiment of the invention. The inputs to this process are the same as those in the full-text nested-loop join, shown in FIG. 1. As shown in FIG. 3, the dictionaries of the two full-text indexes are intersected using function FT_DI( )(Line 10). FIG. 4 shows the pseudo code for this full-text dictionary intersection step FT_DI( ) In particular, the process in FIG. 4 performs a synchronized sequential scan over the dictionaries of the two full-text indexes. The function FT_Terms( )(FIG. 4 lines 8 and 9) returns the dictionary terms under the specified attribute in a full-text index. As the terms in the dictionaries are sorted, a single scan is sufficient. Also note that this merge join only applies to the terms that are restricted by the input attribute. In FIG. 4 line 16, for each common term value, we remember the matching terms by collecting them in a result list.

In the FTDJ process shown in FIG. 3, (lines 11-15), for each matching term pair, the document ids from the two corresponding posting lists, which also satisfy local full-text predicates, are paired as results. For ease of exposition, we conduct a merge of posting lists from term pairs and local predicates in a single FT_Search invocation (lines 12 and 13). In practice, the resulting posting list of a local predicate can be cached to avoid repeated evaluations.

FIG. 5 shows a full-text dictionary intersection join processor 30, also referred to as a join processor 30, in accordance with an embodiment of the invention. A book index 31 and an author index 32 are also shown. Posting lists 34, 36, 38, and 40 are also shown in FIG. 5. The book index 31 and the author index 32 also each include a term dictionary 42, 44.

FIG. 5 shows three main steps in the full-text dictionary intersection join, labeled 1 through 3. In the first step, the process starts the dictionary intersection, which is a merge join of the terms from the dictionaries of two indexes. Two terms form a matching pair if they are of the respective input attributes and have the same term value. In step 2, for a matching term pair, the posting list of each term is merged with the resulting posting list of the corresponding local predicate. If a local predicate is a complex query, the resulting posting list can be cached. In the third step, the resulting document ids from both indexes are paired as results, since the documents in each pair share the same term value and satisfy the local predicates on their corresponding indexes. After this, the process goes back to process the next matching term pair.

In the pseudo code shown in FIG. 3, at lines 12-13 the process repeatedly applies local predicates on each returned term. In cases when a local predicate is complex or has a long posting list, re-evaluating the predicate or scanning a long list again and again may be costly. Another alternative process to handle this situation is shown in FIG. 6. In particular, FIG. 6 shows the pseudo code for an alternative process for full-text dictionary join in accordance with an embodiment of the invention. In FIG. 6, the process first invokes the FTDJ process without any local predicates to obtain a set of join results. Then in line 10 and line 12, the process we sorts the results on either outer or inner doc ids and treats this sorted list as a virtual posting list. This virtual posting list is further filtered by local predicates (line 11 and line 13). If the result of the join predicate is small but local predicates are complex, or the results of local predicates are large, FTDJA has a better performance since it invokes local predicates only once. It is clear that FTDA and FTDJA are complementary to each other. We choose one of them depending on the selectivities of join and local predicates.

From the description above, it can be seen that the full-text dictionary join process can utilize the full-text indexes of both collections. Moreover, both collections are treated symmetrically. If the selectivities of local queries are inaccurate or unknown, this process provides a safer bet over the nested-loop join. Furthermore, since the terms in a dictionary are already sorted, this process takes advantage of this and performs the intersection without the sorting cost. For ease of presentation, we separated the function FT_DI( ) from the rest of the process. In other embodiments, these two processes are actually merged together, such that it is not necessary to materialize the full list of matching term pairs. Also, the pointer to a posting list of a term can be carried to FT_Search (FIG. 3, lines 12-13), hence, it is not necessary to look up this term again in the dictionary.

In a third embodiment of the invention, a process to perform full-text merge join is provided. Note that, as described above, we can store and retrieve attribute values efficiently. In Full-Text Merge Join (FTMJ), local predicates are evaluated on each index first. This produces two lists of results in the tuple format: (docid, value). Then, these two tuple lists are merged (joined) on value. FIG. 7 shows the pseudo code for the Full-Text Merge Join process in accordance with an embodiment of the invention. In FIG. 7, the function FT_AttrSearch( ) evaluates local predicates and returns the specified attribute value together with document ids.

For the above-described example, attributes author and name, and their values are stored using methods previously described. The FTMJ process evaluates the full-text search "DB2" ^ "author", and fetches the attribute values of author from payload or document store for the result documents. The result list can be represented as a (docid, author) list. The process evaluates a similar query on the author index, "name" ^ "IBM", and obtains a second list, (docid, name) list, of qualifying results with name values. Then, the process joins these two lists using the condition author=name, and generates the document id pairs. For this last step, both search result lists can be sorted on attribute values, and the sort-merge join can be used. Note that other join methods, such as hash join, can also be employed.

As described in the various embodiments of the invention, the full-text index nested-loop join and full-text dictionary join processes incorporate the processing of join and local full-text predicates over both input collections. In full-text indexes, join attribute values can be obtained efficiently, and we have more options for join processing, such as sort-merge join.

The above-described full-text join processes can be directly used to support queries across multiple full-text indexes on different data collections. These processes can also be used in a database system that combines full-text indexes. For example, we can rewrite a SQL or XQuery query, and push the generated full-text predicates including join predicates down to the full-text indexes. By processing join predicates inside full-text indexes, we provide a database system with additional filtering capabilities, the avoidance of repeated evaluation of local full-text predicates, and the reduction of the cross-engine communication overhead.

As can be seen from the above disclosure, embodiments of the invention provide techniques for full-text join processes. As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

FIG. 8 is a high level block diagram showing an information processing system useful for implementing one embodiment of the present invention. The computer system includes one or more processors, such as processor 102. The processor 102 is connected to a communication infrastructure 104 (e.g., a communications bus, cross-over bar, or network). Various software embodiments are described in terms of this exemplary computer system. After reading this description, it will become apparent to a person of ordinary skill in the relevant art(s) how to implement the invention using other computer systems and/or computer architectures.

The computer system can include a display interface 106 that forwards graphics, text, and other data from the communication infrastructure 104 (or from a frame buffer not shown) for display on a display unit 108. The computer system also includes a main memory 110, preferably random access memory (RAM), and may also include a secondary memory 112. The secondary memory 112 may include, for example, a hard disk drive 114 and/or a removable storage drive 116, representing, for example, a floppy disk drive, a magnetic tape drive, or an optical disk drive. The removable storage drive 116 reads from and/or writes to a removable storage unit 118 in a manner well known to those having ordinary skill in the art. Removable storage unit 118 represents, for example, a floppy disk, a compact disc, a magnetic tape, or an optical disk, etc. which is read by and written to by removable storage drive 116. As will be appreciated, the removable storage unit 118 includes a computer readable medium having stored therein computer software and/or data.

In alternative embodiments, the secondary memory 112 may include other similar means for allowing computer programs or other instructions to be loaded into the computer system. Such means may include, for example, a removable storage unit 120 and an interface 122. Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 120 and interfaces 122 which allow software and data to be transferred from the removable storage unit 120 to the computer system.

The computer system may also include a communications interface 124. Communications interface 124 allows software and data to be transferred between the computer system and external devices. Examples of communications interface 124 may include a modem, a network interface (such as an Ethernet card), a communications port, or a PCMCIA slot and card, etc. Software and data transferred via communications interface 124 are in the form of signals which may be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface 124. These signals are provided to communications interface 124 via a communications path (i.e., channel) 126. This communications path 126 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, and/or other communications channels.

In this document, the terms "computer program medium," "computer usable medium," and "computer readable medium" are used to generally refer to media such as main memory 110 and secondary memory 112, removable storage drive 116, and a hard disk installed in hard disk drive 114.

Computer programs (also called computer control logic) are stored in main memory 110 and/or secondary memory 112. Computer programs may also be received via communications interface 124. Such computer programs, when executed, enable the computer system to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 102 to perform the features of the computer system. Accordingly, such computer programs represent controllers of the computer system.

From the above description, it can be seen that the present invention provides a system, computer program product, and method for implementing the embodiments of the invention. References in the claims to an element in the singular is not intended to mean "one and only" unless explicitly so stated, but rather "one or more." All structural and functional equivalents to the elements of the above-described exemplary embodiment that are currently known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the present claims. No claim element herein is to be construed under the provisions of 35 U.S.C. section 112, sixth paragraph, unless the element is expressly recited using the phrase "means for" or "step for."

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

* * * * *