Prefetching Rdf Triple Data Pan; Yue ; et al. [INTERNATIONAL BUSINESS MACHINES CORPORATION]

Prefetching Rdf Triple Data

Pan; Yue ; et al.

Patent Application Summary

U.S. patent application number 13/305116 was filed with the patent office on 2012-05-31 for prefetching rdf triple data. This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Yue Pan, Xing Zhi Sun, Qing Fa Wang, Shuo Wu, Lin Hao Xu.

Application Number	20120136875 13/305116
Document ID	/
Family ID	46091887
Filed Date	2012-05-31

United States Patent Application	20120136875
Kind Code	A1
Pan; Yue ; et al.	May 31, 2012

PREFETCHING RDF TRIPLE DATA

Abstract

Query requests for RDF triples are obtained, wherein the query request(s) contain(s) at least one triple pattern; for each triple pattern, the corresponding elementary pattern is determined, and each triple pattern is converted to a weighted elementary pattern. The occurrence frequency of each elementary pattern is computed based on the weighted elementary patterns; at least one elementary pattern is chosen at least according to the occurrence frequency; and the RDF triples corresponding to the chosen at least elementary pattern are prefetched into the buffer. The corresponding apparatus is also provided. With the above method and apparatus, the frequently accessed RDF triples can be determined and prefetched into the buffer, which improves the query efficiency.

Inventors:	Pan; Yue; (Beijing, CN) ; Sun; Xing Zhi; (Beijing, CN) ; Wang; Qing Fa; (Beijing, CN) ; Wu; Shuo; (Beijing, CN) ; Xu; Lin Hao; (Beijing, CN)
Assignee:	INTERNATIONAL BUSINESS MACHINES CORPORATION Armonk NY
Family ID:	46091887
Appl. No.:	13/305116
Filed:	November 28, 2011

Current U.S. Class:	707/748 ; 707/E17.014
Current CPC Class:	G06F 16/953 20190101; G06F 16/24578 20190101; G06F 16/24558 20190101; G06F 16/2458 20190101; G06F 16/24539 20190101; G06F 16/9574 20190101; G06F 16/2455 20190101
Class at Publication:	707/748 ; 707/E17.014
International Class:	G06F 17/30 20060101 G06F017/30

Foreign Application Data

Date	Code	Application Number
Nov 29, 2010	CN	201010577037.2

Claims

1. A method for processing RDF triples, comprising: obtaining query requests for RDF triples, wherein said query requests each contain at least one triple pattern; determining elementary patterns corresponding to said triple patterns; performing weighting with respect to said corresponding elementary patterns to weighted elementary patterns; computing occurrence frequency of said elementary patterns based on the weighted elementary patterns; and prefetching those of the RDF triples corresponding to said elementary patterns into a buffer if the occurrence frequency of said elementary patterns meets at least one predetermined condition.

2. The method of claim 1, wherein obtaining the query requests for the RDF triples comprises reading query request records from a query log.

3. The method of claim 1, wherein said elementary patterns are in a form of <?s :p ?o>.

4. The method of claim 1, wherein said weighting comprises setting the weight of a certain triple pattern with respect to a corresponding elementary pattern as a constant.

5. The method of claim 1, wherein said weighting comprises determining weight of said at least one triple pattern with respect to a corresponding one of said elementary patterns by referring to statistical information in an RDF triple data storage system.

6. The method of claim 5, wherein said weighting comprises: setting the weight w (p, o) of triple patterns in <?s :p :o> form with respect to elementary patterns in <?s :p ?o> form as: w(p,o)=Num(p,o)/FACT(p), setting the weight w (s, p) of triple patterns in <:s :p ?o> form with respect to elementary patterns in <?s :p ?o> form as: w(s,p)=Num(s,p)/FACT(p), wherein Num(p,o) denotes the number of all triples with predicate p and object o, Num(s,p) denotes the number of all triples with predicate p and subject s, FACT(p) denotes the number of all triples with predicate p in said RDF triple data storage system.

7. The method of claim 5, wherein said weighting comprises: setting the weight w (p, o) of triple patterns in <?s :p :o> form with respect to elementary patterns in <?s :p ?o> form as: w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ##EQU00006## setting the weight w (s, p) of triple patterns in <:s :p ?o> form with respect to elementary patterns in <?s :p ?o> form as: w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ##EQU00007## wherein DOM (p) denotes the number of different subjects with predicate p; DOM (o) denotes the number of different subjects with object o; RNG(s) denotes the number of different objects with subject s; RNG(p) denotes the number of different objects with predicate p; and FACT (p) denotes the number of all triples with predicate p in said RDF triple data storage system.

8. The method of claim 1, wherein computing the occurrence frequency of said elementary patterns comprises summing up the weights of the same elementary pattern as the occurrence frequency of said elementary pattern.

9. The method of any one of claim 1, wherein computing the occurrence frequency of said elementary patterns comprises, computing the occurrence frequency of the elementary patterns corresponding to the triple patterns contained in said query requests based on the occurrence frequency of said query requests.

10. The method of claim 1, wherein prefetching the RDF triples corresponding to said elementary patterns into the buffer if the occurrence frequency of said elementary patterns meets at least one predetermined condition comprises making the total size of the RDF triples corresponding to said elementary patterns not exceed the buffer size, and making the occurrence frequency of said elementary patterns as high as possible.

11. An apparatus for processing RDF triples, comprising: a query obtaining unit, configured to obtain the query requests for RDF triples, wherein said query requests contain at least one triple pattern; a pattern analyzing unit, configured to convert said at least one triple pattern to a weighted elementary pattern; a frequency computing unit, configured to compute the occurrence frequency of said elementary patterns based on the weighted elementary patterns; and a data prefetching unit, configured to prefetch the RDF triples corresponding to said elementary patterns into the buffer if the occurrence frequency of said elementary patterns meets certain condition; wherein each of said query obtaining unit, said pattern analyzing unit, said frequency computing unit, and said data prefetching unit comprises at least one of: dedicated hardware; and software tangibly embodied in a non-transitory storage medium, loaded into a hardware memory, and executing on at least one hardware processor coupled to the memory.

12. The apparatus of claim 11, wherein said query obtaining unit is configured to read query request records from the query log.

13. The apparatus of claim 11, wherein said elementary patterns are in the form of <?s :p ?o>.

14. The apparatus of claim 11, wherein said pattern analyzing unit is configured to set the weight of certain triple patterns with respect to corresponding elementary patterns as a constant.

15. The apparatus of claim 11, wherein said pattern analyzing unit is configured to, determine the weight of said at least one triple pattern with respect to a corresponding one of said elementary patterns by referring to the statistical information in an RDF triple data storage system.

16. The apparatus of claim 15, wherein said pattern analyzing unit is configured to: set the weight w (p, o) of triple patterns in <?s :p :o> form with respect to elementary patterns in <?s :p ?o> form as: w(p,o)=Num(p,o)/FACT(p), and set the weight w (s, p) of triple patterns in <:s :p ?o> form with respect to elementary patterns in <?s :p ?o> form as: w(s,p)=Num(s,p)/FACT(p), wherein Num(p,o) denotes the number of all triples with predicate p and object o, Num(s,p) denotes the number of all triples with predicate p and subject s, FACT(p) denotes the number of all triples with predicate p in said RDF triple data storage system.

17. The apparatus of claim 15, wherein said pattern analyzing unit is configured to: set the weight w (p, o) of triple patterns in <?s :p :o> form with respect to elementary patterns in <?s :p ?o> form as: w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ##EQU00008## set the weight w (s, p) of triple patterns in <:s :p ?o> form with respect to elementary patterns in <?s :p ?o> form as: w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ##EQU00009## wherein DOM (p) denotes the number of different subjects with predicate p; DOM (o) denotes the number of different subjects with object o; RNG(s) denotes the number of different objects with subject s; RNG(p) denotes the number of different objects with predicate p; and FACT (p) denotes the number of all triples with predicate p in said RDF triple data storage system.

18. The apparatus of claim 11, wherein said frequency computing unit is configured to sum up the weights of the same elementary pattern as the occurrence frequency of said elementary pattern.

19. The apparatus of claim 11, wherein said frequency computing unit is configured to compute the occurrence frequency of the elementary patterns corresponding to the triple patterns contained in said query requests based on the occurrence frequency of said query requests.

20. The apparatus of claim 11, wherein said data prefetching unit is configured to make the total size of the RDF triples corresponding to said elementary patterns not exceed the buffer size, and make the occurrence frequency of said elementary patterns as high as possible.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S)

[0001] This application claims foreign priority to P.R. China Patent application 201010577037.2 filed 29 Nov. 2010, the complete disclosure of which is expressly incorporated herein by reference in its entirety for all purposes.

FIELD OF THE INVENTION

[0002] This invention relates to the storage and management of RDF triple data, and more particularly relates to a method and an apparatus for accelerating the query and read of RDF triple data.

BACKGROUND OF THE INVENTION

[0003] RDF (Resource Description Framework) is a technical standard of markup language published by W3C (World Wide Web Consortium) to better describe and express the contents and the structure of Web resources. Particularly, RDF can be specially used to express the metadata about Web resources, such as the title, the author, the update time of Web pages, the copyright and the license of Web documents, the available schedule of some shared resources, and so on. However, when "Web resources" are generalized, RDF can be used to describe the information of anything that can be identified on the Web. Along with the development of semantic-based web description, RDF data are used more and more widely in various Web related applications, so the management of RDF data becomes more and more important.

[0004] Different from general relational data, RDF data are expressed in triple form, including <subject, predicate, object>. That is, RDF describes the relation between elements using such triples. When these RDF triples are stored into a storage system such as a database, usually they can be queried using SPARQL recommended by W3C.

[0005] FIG. 1 illustrates the structure of the existing RDF data storage and query system. System 100 comprises a database 101, a data loader 102, a data access module 103 and a query engine 104. Database 101 is configured to store RDF triple data. Specifically, database 101 contains an IRI table and a triple table. The IRI table is used to store the correspondence relation between the internal ID or index and the IRI string in the data, while the triple table stores triple data with their internal ID representation. It is understood that such storage manner is advantageous for compressed data storage, which saves storage space. When new RDF data are inputted from outside, data loader 102 receives and parses the inputted RDF data and transforms it into internal data models. For each IRI string in the internal data models, data access module 103 assigns a unique internal ID for it, and inserts or stores the correspondence relation between the ID and the string in the above IRI table. Then, for each RDF triple in the data models, data access module 103 inserts or stores its internal ID representation into the above triple table. For the above stored RDF triple data, when the data are queried, query engine 104 receives the user's SPARQL request and translates it into the corresponding standard SQL (Structured Query Language) sentences. Data access module 103 retrieves the queried triples from database 101 according to SQL sentences, and returns the results to query engine 104.

[0006] The storage and query process of RDF data executed in the above system 100 will be described in detail in connection with specific examples. In one example, school course information is stored in database 101 in RDF triple form. Supposed that a user wants to know the name list of the students who elect Jack's course, then in query engine 104 the SPARQL query can be set as:

TABLE-US-00001 SELECT ?name WHERE { ?student :hasName ?name. (1) ?student :takeCourse ?course. (2) ?course :toughtBy ?person. (3) ?person :hasName "Jack". (4) }

[0007] In the above SPARQL query, all values of "name" are requested, wherein the sentences in WHERE{ } are the relations that the "name" should satisfy. Concretely, this query contains 4 triple-form sentences (1)-(4), each of which is called a triple pattern. It is understood that these sentences are numbered here for description convenience, and such numbers don't exist in the real query. Corresponding to RDF data, each triple pattern is also expressed in the form of <subject. predicate, object>. but question mark can be added before at least one element of the triple so as to set it as variable to be queried. For example, triple pattern (4) means that it is to query the variable person in the case that the corresponding predicate is hasName and the object is Jack in the triples; that is, the person whose name is Jack will be retrieved. Then, via triple pattern (3), subject course will be queried in the case that the corresponding predicate is toughtBy and the object is the above retrieved person; that is, the course taught by the person will be retrieved. In triple pattern (2), all students who elect the course will be queried, and finally in triple pattern (1), the names of the students are determined. Thus, via the above triple pattern (1)-(4), taking person, course and student as middle variables, the values of the queried name will be determined finally.

[0008] By executing the translated SQL query from the query engine 104, data access module 103 in FIG. 1 retrieves the query results accordingly from database 101 and returns them to query engine 104. In one example, the returned RDF triples are in the following form:

TABLE-US-00002 Subject Predicate Object Course toughtBy person Student takeCourse course Person hasName "Jack" Student hasName "Rose"

[0009] Through the above triples, the result of the above-described query can be obtained; that is, the name of the student who elects Jack's course is Rose.

[0010] In the above query process, data access module 103 continually searches and retrieves data from database 101 according to the query of each triple pattern. However, because there is a large amount of data stored in database 101, the database is usually realized using large capacity storage media, such as a large capacity hard disk. Thus, continually searching and retrieving data from the hard disk brings a high IO cost and further influences the query efficiency and system performance.

[0011] To improve query efficiency, one solution adopted in the database system is to prefetch a part of the data in the buffer which is easy to access, for example the memory or the cache of a computing system. Therefore, when the computing system queries or accesses this part of the data, it can read data directly from the buffer, thereby reducing IO cost. However, because the buffer size is usually very limited, which data should be prefetched into the buffer in order to optimize the query efficiency is an issue under investigation. For the general relational data, various methods have been proposed for prefetching a part of data in the existing techniques. However, because of the special format of RDF data, the existing techniques are not adapted to optimize RDF data query. Therefore, a method and an apparatus are needed for selectively prefetching a part of RDF data to the buffer so as to accelerate and optimize RDF data query.

SUMMARY OF THE INVENTION

[0012] In view of the above-mentioned problems, embodiments of the invention are provided to improve the query efficiency of RDF data.

[0013] According to a first aspect of the invention, a method for prefetching RDF triples from RDF triple data storage system is provided, wherein each RDF triple contains subject, predicate and object, the method comprises: obtaining the query requests for RDF triples, wherein the query requests contain at least one triple pattern; converting the at least one triple pattern to a weighted elementary pattern; computing the occurrence frequency of the elementary patterns based on the weighted elementary patterns; and prefetching the RDF triples corresponding to the elementary patterns into the buffer when the occurrence frequency of the elementary patterns meets certain condition(s).

[0014] According to a second aspect of the invention, an apparatus for prefetching RDF triples from RDF triple data storage system is provided, wherein each RDF triple contains subject, predicate and object, the apparatus comprises: a query obtaining unit, configured to obtain the query requests for RDF triples, wherein the query requests contain at least one triple pattern; a pattern analyzing unit, configured to convert the at least one triple pattern to a weighted elementary pattern; a frequency computing unit, configured to compute the occurrence frequency of the elementary patterns based on the weighted elementary patterns; and a data prefetching unit, configured to prefetch the RDF triples corresponding to the elementary patterns into the buffer when the occurrence frequency of the elementary patterns meets certain condition(s).

[0015] With the method and the apparatus of one or more embodiments of the invention, the query patterns with higher occurrence frequency can be determined, thereby the RDF triples with higher access frequency can be determined, and these triples can be prefetched into the easy-to-access buffer. Then, in the later queries, the frequently accessed RDF data can be read directly from the buffer, which can reduce 10 cost and improve query efficiency.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] FIG. 1 illustrates the structure of an existing RDF data storage and query system;

[0017] FIG. 2 is a flowchart of the method according to one embodiment of the invention:

[0018] FIG. 3A illustrates some exemplary RDF triples stored in an RDF database:

[0019] FIG. 3B illustrates some statistical results of the data shown in FIG. 3A;

[0020] FIG. 4 illustrates the RDF data storage and query system comprising a prefetching apparatus according to one embodiment of the invention; and

[0021] FIG. 5 is a block diagram of the prefetching apparatus according to one embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0022] The following is the description of the embodiments in connection with the drawings. It is understood that the detailed description is illustrative, rather than restrictive, to the scope of the present invention.

[0023] FIG. 2 is a flowchart of the method according to one embodiment of the invention, wherein the method is used to prefetch a part of the RDF triples stored in a RDF data storage system to the buffer. Specifically, the method comprises step 201, obtaining the query requests for RDF triples, wherein the query requests contain at least one triple pattern; step 202, for each of the obtained at least one triple pattern, determining the corresponding elementary pattern and performing weighting with respect to the corresponding elementary pattern; step 203, computing the occurrence frequency of each elementary pattern based on the weighted elementary patterns; step 204, choosing at least one elementary pattern at least according to the occurrence frequency: and step 205, prefetching the RDF triples corresponding to the chosen at least one elementary pattern into the buffer. Through the above steps, the most frequently queried and accessed RDF triples can be determined and thus prefetched into the buffer in one or more embodiments of the invention, which improves the query efficiency.

[0024] The following is the description of the above steps shown in FIG. 2 in connection with the specific examples.

[0025] At step 201, the query requests for RDF triples are obtained. In one embodiment, those query requests are obtained in real time from the query engine. In another embodiment, the records of those query requests are read from the system query log. Optionally, a plurality of query requests, i.e. a query set, can be obtained at one time. Typically, the search and query requests for RDF data are SPARQL queries, each of which contains at least one triple pattern, such as triple patterns (1).about.(4) shown in the prior art.

[0026] Then, at step 202, the obtained triple patterns are analyzed and converted. Firstly, for each triple pattern, the corresponding elementary pattern is determined. The elementary pattern is defined mainly according to the data feature in the RDF triple data storage system and the request feature of the data query. In one embodiment, the elementary pattern is defined as the triple pattern in which only the predicate is constant; that is, the triple pattern in the form of <?subject, predicate, ?object>. If s denotes the subject, p donates the predicate, o donates the object, prefix `?` denotes a query variable, prefix `:` donates a constant, then the elementary pattern can be represented as <?s :p ?o>. It is understood that the elementary pattern can be defined in other forms, for example, the triple pattern in <:s ?p ?o> form in which only the subject is constant, the triple pattern in <?s ?p :o> form in which only the object is constant, the triple pattern in <:s :p ?o> form with constant subject and constant predicate, etc. Following is the description of the embodiments in connection with elementary patterns in <?s :p ?o> form. Those skilled in the art can understand that the embodiments of this invention are also applicable for other elementary patterns.

[0027] The strength of defining elementary patterns as <?s :p ?o> and classifying the triple patterns and the triple data based on the predicate is that the number of different predicates of RDF triples stored in the RDF database is much less than the number of RDF triples themselves. For example, in the RDF dataset of Wikipedia, the number of RDF triples is about 136.9 million, but the number of the referred predicates is only 927. What is more, in all possible triple patterns, triple patterns with constant predicate <?s :p :o>, <?s :p ?o> and <:s :p ?o> are the most common triple patterns, while triple patterns <?s ?p :o>, <:s ?p :o> and <:s ?p ?o> with querying predicates are seldom used, and <?s ?p ?o> of querying all elements is needless to say. Currently, the triple patterns contained in the standard test set of SPARQL are mostly the above most common triple patterns with constant predicate.

[0028] For the above-mentioned common triple pattern <?s :p :o>, <?s :p ?o> and <:s :p ?o>, it can be seen that <?s :p ?o> itself is an elementary pattern, while <?s :p :o> and <:s :p ?o> only query the subject or the object, the query results of which must be the subset of the query results of elementary pattern <?s :p ?o> with the same predicate. Therefore, each triple pattern whose predicate is a constant, contained in SPARQL queries can be mapped to an elementary pattern defined above. Accordingly, the step of determining the elementary pattern corresponding to a triple pattern is to determine the elementary pattern having the same predicate with that of the triple pattern.

[0029] In the illustrated triple patterns (1).about.(4), triple patterns (1).about.(3) are all triple patterns that have a constant predicate and query the subject and the object, i.e. elementary patterns. In triple pattern (4), object `Jack` is also a constant besides constant predicate `hasName`, so it is not an elementary pattern. Then, it can be determined that the corresponding elementary pattern is elementary pattern <?s :hasName ?o> with the same predicate.

[0030] After the corresponding elementary pattern of each triple pattern is determined, the triple pattern is weighted with respect to the corresponding elementary pattern in terms of occurrence frequency, so as to convert it to the weighted elementary pattern. That is because an elementary pattern only defines the predicate, the query results include all triples with the specified predicate, or, in other words, the complete set of the specified predicate. Therefore, the query of an elementary pattern will result in the accessing and the retrieving of the complete set of the specified predicate, while in the triple patterns that are not elementary patterns, the subject or the object is also defined and the query results are a part of the complete set of the specified predicate. That is, the triple pattern whose predicate is a constant but not elementary pattern, only access a part of the data accessed by its corresponding elementary pattern. Then, to evaluate the contribution of each triple pattern to the accessing frequency on the triple data, the non-elementary patterns should be discounted with respect to elementary patterns in terms of occurrence times; that is, they should be weighted.

[0031] In one embodiment, it is simply defined that the weight of non-elementary patterns is 0.5 compared with corresponding elementary patterns. Then, triple patterns (1).about.(4) can be converted as:

TABLE-US-00003 <?s :has Name ?o> (1') <?s :take Course ?o> (2') <?s :toughtBy ?o> (3') <?s :hasName ?o>*0.5 (4')

[0032] In some embodiments, the triple patterns are weighted by referring to the statistical information of the RDF database.

[0033] Specifically, in one embodiment, for triple pattern <?s :p :o>, Num (p, o) is defined as the number of triples in the RDF database with predicate p and object o, and FACT (p) is defined as the number of all triples with predicate p, i.e., the number of different <s,o> pairs. Then, the weight w (p, o) of triple pattern <?s :p :o> can be defined as:

w(p,o)=Num(p,o)/FACT(p)

[0034] Accordingly, for triple pattern <:s :p ?o>, Num (s, p) is defined as the number of triples in the RDF database with predicate p and subject s, and the weight w (s, p) of triple pattern <:s :p ?o> can be defined as:

w(s,p)=Num(s,p)/FACT(p)

[0035] For triple pattern <?s :p ?o>, since it is an elementary pattern, its weight is set as 1. Thereby, those three triple patterns contained in the SPARQL queries have been weighted.

[0036] In other embodiments, more statistical information of the RDF database can be considered. In one embodiment, Domain statistics and Range statistics of the triples in the RDF database are defined, wherein Domain statistics are used to compute the subject number and Range statistics are used to compute the object number.

[0037] Specifically, function DOM (p) is defined to denote the number of different subjects s with constant predicate p (the object can be any) in the RDF database; function DOM (o) is defined to denote the number of different subjects s with constant object o (the predicate can be any) in the RDF database.

[0038] Function RNG(s) is defined to denote the number of different objects o with constant subject s (the predicate can be any) in the RDF database; RNG(p) is defined to denote the number of different objects o with constant predicate p (the subject can be any) in the RDF database.

[0039] Furthermore, FACT (p) defined in the above embodiment is used to denote the number of different triples with predicate p. i.e., the number of different <s,o> pairs.

[0040] Based on the above statistics, the weight w (p, o) of triple pattern <?s :p :o> can be defined as:

w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ( i ) ##EQU00001##

[0041] For triple pattern <:s :p ?o>, its weight w(s,p) can be defined as:

w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ( ii ) ##EQU00002##

[0042] In the same way, for elementary pattern <?s :p ?o>, its weight is set as 1.

[0043] In connection with one example, following is the description of the process of weighting and converting triple patterns according to the above-mentioned embodiment. FIG. 3A illustrates some exemplary RDF triples stored in an RDF database, and FIG. 3B illustrates some statistical results of the data shown in FIG. 3A. In the triples shown in FIG. 3A, taking predicate "type" as an example, it can be seen that the number of different subjects with predicate "type" is 10, i.e. DOM (type)=10; the number of different objects with predicate "type" is 6, i.e. RNG (type)=6; and the number of the triples with predicate "type" is 11, i.e. FACT (type)=11. Other predicates and functions can be analyzed similarly and thus the statistic results shown in FIG. 3B can be obtained. These statistical results, as assistant storage information for further use, can be pre-stored in certain areas of the database and updated periodically, or updated when new data are received in the database.

[0044] Suppose the first SPARQL query for the data in FIG. 3A is defined as:

TABLE-US-00004 SELECT ?publication WHERE {?publication type Article (11) ?publication author ?researcher (12) ?researcher workAt ?university (13) ?university name NUS}(14)

[0045] Wherein triple patterns (12) and (13) are elementary patterns whose weight is 1; the corresponding elementary pattern of triple pattern (11) is <?s type ?o>, and the corresponding elementary pattern of triple pattern (14) is <?s name ?o>. Substituting the statistical results in FIG. 3B into formulas (i) and (ii), it can be obtained that the weights of triple pattern (11) and (14) are 1/11 and 1/8 respectively. Therefore, the first query can be converted into:

TABLE-US-00005 <?s type ?o>*1/11 (11') <?s author ?o> (12') <?s workAt ?o> (13') <?s name ?o>*1/8 (14')

[0046] Similarly, suppose the second SPARQL query is:

TABLE-US-00006 SELECT ?publication WHERE { ?researcher supervise ?student (21) ?researcher name "Ooi Beng Chin" (22) ?publication author ?student} (23)

[0047] According to the above process, the second query can be converted into:

TABLE-US-00007 <?s supervise ?o> (21') <?s name ?o>* 1/8 (22') <?s author ?o> (23')

[0048] Although some statistical methods and weighting methods are illustrated in the above embodiments, it is understood that those skilled in the art can modify the above methods or use other methods after reading this description. Any method for weighting the triple patterns, as long as it can reflect the effect of triple patterns on the access frequency of triples in the database in some aspect or in some degree, can be adopted for embodiments of the invention.

[0049] Moreover, the above embodiments are all described in connection with elementary patterns in the form of <?s :p ?o>. For other forms of elementary patterns, the corresponding weighting method can be adopted according to need to convert triple patterns to weighted elementary patterns, in order to reflect the effect of the triple patterns on triple data access frequency.

[0050] After weighting and converting the triple patterns in the queries, step 203 in FIG. 2 computes the occurrence frequency of each elementary pattern based on the weighted elementary patterns.

[0051] For example, in the above first query and the second query, for each of the weighted elementary patterns (11').about.(14') and (21').about.(23'), by summing up the weight factor of the same elementary pattern, the occurrence frequency of each elementary pattern can be obtained. Specifically, the occurrence frequency of <?s type ?o> is 1/11, the occurrence frequency of <?s author ?o> is 2, the occurrence frequency of <?s workAt ?o> is 1, the occurrence frequency of <?s name ?o> is 1/4, and the occurrence frequency of <?s supervise ?o> is 1.

[0052] In one embodiment, for a plurality of queries, firstly the occurrence frequency of each query is computed, and then the occurrence frequencies of the elementary patterns involved in the queries are computed based on the occurrence frequency of each query. For example, in a specific example, a query set Q is obtained, which contains a plurality of different queries, i.e. Q={q.sub.1, q.sub.2, . . . , q.sub.m}. Suppose the occurrence frequency of a query q.sub.i is f (q.sub.i). For each query q.sub.i appearing, the corresponding elementary pattern p and the corresponding weight w.sub.p,q, can be determined as mentioned above. Then, the occurrence frequency of elementary pattern p involved in query q.sub.i can be represented as f(q.sub.1).times.w.sub.p,q.sub.1. For the above set Q, the occurrence frequency f (p) of elementary pattern p can be represented as:

f(p)=.SIGMA..sub.q.sub.1.sub..epsilon.( )'f(q.sub.1).times.w.sub.p,q.sub.1

wherein Q' denotes the set of queries involving elementary pattern p.

[0053] Thereby, the occurrence frequency of each elementary pattern can be determined.

[0054] Based on the above computed occurrence frequency, at least one elementary pattern is chosen at step 204, and at step 205, RDF triples corresponding to the at least one elementary pattern are prefetched into the buffer. Generally, the above chosen elementary patterns are the elementary patterns with higher occurrence frequency. Since these elementary patterns have higher occurrence frequency in the queries, accordingly, their corresponding RDF triples have higher access frequency in the RDF database, and thus, prefetching these RDF triples into the buffer will facilitate the query speed.

[0055] In one embodiment, the obtained occurrence frequency of each elementary pattern simply chosen from the order. The RDF triples corresponding to the chosen elementary patterns are prefetched into the buffer.

[0056] In some embodiments, the capacity limit and the utilization ratio of the buffer are also taken into consideration. That is, it is expected that the RDF triples corresponding to elementary patterns with higher occurrence frequency are chosen while the total size of these triples does not exceed the buffer size and at the same time the benefit of the buffer is optimized. The optimized benefit of the buffer means that the triples stored in the buffer are as many as possible and the access frequency of these triples is as high as possible, etc.

[0057] This target can be generalized as the constrained optimization problem in mathematics. If M is the buffer size, size(p.sub.1) is the size of the triples in the RDF database corresponding to elementary pattern P.sub.1, and a.sub.1 is the choosing factor of elementary pattern p.sub.1, i.e. a.sub.1 is 0 or 1, then for n elementary patterns, they should meet the constraint:

i = 1 n a i .times. size ( p i ) .ltoreq. M ( iii ) ##EQU00003##

[0058] Meanwhile, the benefit function is defined as:

B = i = 1 n a 1 .times. size ( p i ) .times. f ( p i ) . ##EQU00004##

[0059] Thus, the above problem can be represented as how to determine the value of in order to make the largest benefit function B and meet constraint (iii) at the same time.

[0060] One common method for solving the above optimization problem is firstly ordering the elementary patterns in a queue according to their occurrence frequency from high to low. For the elementary pattern with the highest occurrence frequency in the queue, suppose its choosing factor is 1, then it is judged whether the constraint (iii) is met or not. If the constraint is met, then the choosing factor is set as 1; that is, the elementary pattern is chosen and the next elementary pattern in the queue is judged continually. For a certain elementary pattern in the queue, if it does not meet constraint (iii), then the elementary pattern is ignore; that is, its choosing factor is set as 0, and the next elementary pattern in the queue is judged continually until the whole queue is checked.

[0061] For the above constrained optimization problem, various approaches have been proposed in the existing technique to obtain optimized solutions, which is needless to describe. It is understood, those skilled in the art can adopt a proper approach to choose elementary patterns according to needs in order to optimize the benefit of the buffer.

[0062] As described above, by determining the occurrence frequency of each elementary pattern involved in SPARQL queries, and prefetching the corresponding triples of some elementary patterns into the buffer according to the occurrence frequency of these elementary patterns, the frequently accessed data in the RDF database can be pre-stored in advance. Thus, the later queries may read the data directly from the buffer with great possibility, which reduces IO cost and improves the query efficiency of RDF triples.

[0063] Based on the same inventive conception, this invention also provides an apparatus for prefetching RDF triple data. Advantageously, it is expected that this apparatus is constructed based on the existing RDF data storage and query system shown in FIG. 1 as much as possible, with the existing architecture modified as little as possible. Therefore, one or more embodiments of the invention proposes to add a prefetching apparatus to the existing RDF data storage and query system, in order to analyze and choose the triples with higher access frequency and prefetch them into the buffer.

[0064] Specifically, FIG. 4 illustrates an RDF data storage and query system comprising the prefetching apparatus according to one embodiment of the invention. Compared with the system in FIG. 1, the system in FIG. 4 additionally comprises a prefetching apparatus 500, which communicates with database 101, in order to prefetch the frequently queried triples into buffer 1011. Optionally, the prefetching apparatus 500 also connects with data loader 102 and/or query engine 104 in order to obtain the information about data storage and query.

[0065] FIG. 5 is a block diagram of the prefetching apparatus according to one embodiment of the invention. As shown in the figure, the prefetching apparatus 500 comprises a query obtaining unit 501, configured to obtain the query requests for RDF triples, wherein the query requests contain at least one triple pattern; a pattern analyzing unit 502, configured to, for each of the obtained at least one triple pattern, determine the corresponding elementary pattern and perform weighting with respect to the corresponding elementary pattern; a frequency computing unit 503, configured to compute the occurrence frequency of each elementary pattern based on the weighted elementary patterns; and a data prefetching unit 504, configured to choose at least one elementary pattern at least according to the occurrence frequency, and prefetch the RDF triples corresponding to the chosen at least one elementary pattern into the buffer.

[0066] Specifically, the query obtaining unit 501 obtains the query requests for RDF triples. In one embodiment, the query obtaining unit 501 connects with query engine 104, to acquire the query requests in real time. In another embodiment, the query obtaining unit 501 reads the query records from the system log. Optionally, multiple query requests, i.e. a query set, can be obtained at one time. For SPARQL queries for RDF data, each query contains at least one triple pattern. The query obtaining unit 501 sends the obtained queries and the contained triple patterns to pattern analyzing unit 502.

[0067] Pattern analyzing unit 502 analyzes and converts the received triple patterns. Firstly, for each triple pattern, pattern analyzing unit 502 determines the corresponding elementary pattern; that is, determines the elementary pattern <?s :p ?o> having the same predicate with the triple pattern.

[0068] After determining the corresponding elementary pattern of each triple pattern, pattern analyzing unit 502 weights the triple pattern with respect to the corresponding elementary pattern in terms of occurrence times and thus converts triple patterns to weighted elementary patterns.

[0069] In one embodiment, pattern analyzing unit 502 simply sets the weight of non-elementary patterns compared with the elementary patterns as a fixed value, for example 0.5. In other embodiments, pattern analyzing unit 502 further connects with database 101 and/or data loader 102, in order to weight triple patterns by referring to the statistical information in the RDF database.

[0070] Specifically, in one embodiment, pattern analyzing unit 502 computes the weight of triple pattern <?s :p :o> using the formula w(p,o)=Num(p,o)/FACT(p), and computes the weight of triple pattern <:s :p ?o> using formula w(s,p)=Num(s,p)/FACT(p), wherein Num (p, o) denotes the number of different triples with predicate p and object o in RDF database; Num (s, p) denotes the number of different triples with predicate p and subject s in RDF database; FACT (p) denotes the number of all triples with predicate p. For elementary pattern <?s :p ?o>, pattern analyzing unit 502 sets its weight as 1.

[0071] In other embodiments, pattern analyzing unit 502 further considers more statistical information of the RDF database. In one example, pattern analyzing unit 502 considers Domain statistics and Range statistics of triple data in the RDF database. Specifically, pattern analyzing unit 502 computes the weight w (p, o) of triple pattern <?s :p :o> using formula (I) and computes the weight w (s, p) of triple pattern <:s :p ?o> using formula (II):

w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ( i ) w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ( ii ) ##EQU00005##

[0072] Wherein function DOM (p) denotes the number of different subjects s with predicate p (the object can be any) in the RDF database; function DOM (o) denotes the number of different subjects s with object o (the predicate can be any) in the RDF database, Function RNG(s) denotes the number of different object o with subject s (the predicate can be any) in the RDF database; RNG(p) denotes the number of different objects o with predicate p (the subject can be any) in the RDF database. FACT (p) has the same meaning with the above embodiment. Similarly, for elementary pattern <?s :p ?o>, its weight is set as 1.

[0073] Although some statistical methods and weighting methods are illustrated for pattern analyzing unit 502, it is understood that those skilled in the art can optionally use other methods as long as the weight can reflect the effect of triple patterns on the access frequency of triples in the database in some aspect or in some degree.

[0074] After pattern analyzing unit 502 weights and converts the triple patterns in the query requests, frequency computing unit 503 computes the occurrence frequency of each elementary pattern based on the weighted elementary patterns.

[0075] In one example, frequency computing unit 503 considers the weighted elementary patterns involved in each query one by one, and obtains the occurrence frequency of each elementary pattern by summing up the weighting factor of the same elementary pattern.

[0076] In one embodiment, when query obtaining unit 501 obtains multiple queries, it firstly computes the occurrence frequency of each query. Then frequency computing unit 503 can compute the occurrence frequencies of the elementary patterns involved in the queries based on the occurrence frequency of each query.

[0077] Then frequency computing unit 503 sends the computed occurrence frequency of each elementary pattern to data prefetching unit 504. Data prefetching unit 504 chooses at least one elementary pattern based on the received occurrence frequency, and prefetches the RDF triples corresponding to the chosen at least one elementary pattern into the buffer.

[0078] In one embodiment, data prefetching unit 504 orders the received occurrence frequency of each elementary pattern, and simply chooses several elementary patterns with the highest occurrence frequency from the order. Then data prefetching unit 504 prefetches the corresponding RDF triples of the chosen elementary patterns into the buffer

[0079] In some embodiments, data prefetching unit 504 also considers the size limit and the utilization ratio of the buffer. That is, data prefetching unit 504 chooses the elementary patterns, such that the size of the triples to be prefetched does not exceed the buffer size and the benefit of the buffer is optimized. The optimized benefit of the buffer means that the triples stored in the buffer are as many as possible and the access frequency of these triples is as high as possible, etc.

[0080] To achieve the above optimization target, in one embodiment, data prefetching unit 504 firstly orders the elementary patterns in a queue according to their occurrence frequency from high to low. For the elementary pattern with the highest occurrence frequency in the queue, it is judged whether the constraint on the buffer size would be met if this elementary pattern is chosen. If the constraint is met, it is determined to choose the elementary pattern and to judge the next elementary pattern in the queue continually. For a certain elementary pattern in the queue, if the buffer size constraint is not met, then the elementary pattern is ignored and the next elementary pattern in the queue is judged continually until the whole queue is checked.

[0081] For the above constrained optimization problem, various approaches have been proposed in the existing technique(s) to obtain the optimized solutions. Data prefetching unit 504 can adopt other proper approaches to choose elementary patterns in order to optimize the benefit of the buffer.

[0082] Thereby, the prefetching apparatus 500 can determine the occurrence frequency of each elementary pattern involved in SPARQL queries and prefetch the corresponding triples of some elementary patterns into buffer 1011 according to the occurrence frequency of elementary patterns. Thus, the frequently accessed data in the RDF database can be prefetched, which improves the subsequent query efficiency. The detailed embodiments are accordant with those of the above prefetching method, which is needless to describe again.

[0083] Through the above description of the embodiments, those skilled in the art will recognize that the above method and apparatus for prefetching RDF triple data can be practiced by executable instructions and/or controlling codes in the processors e.g. codes in mediums like disc, CD or DVD-ROM; memories like ROM or EPROM; and carriers like optical or electronic signal carrier. The apparatus and its units can be realized using hardware like VLSI or Gates and Arrays, like semiconductors e.g. Logic Chip, transistors, etc. or like programmable hardware equipment e.g. FPGA, programmable logic equipment, etc.; or using software executed by different kinds of processors; or using the combination of the hardware and software. The software and program codes for implementing the invention can be written using object-oriented languages like Java, Smalltalk. C++, etc., and the traditional procedural languages like C language or other similar languages. The source code can be executed locally or remotely.

[0084] Thus, having reviewed the disclosure herein, the skilled artisan will appreciate that aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon. Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0085] A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

[0086] Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

[0087] Distinct software modules for carrying aspects of embodiments of the invention can be, in at least some cases, embodied on a computer readable storage medium. The distinct software modules may include, for example, any one, some, or all of the modules and/or sub-modules in FIGS. 4 and 5, for example.

[0088] The means mentioned herein can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable medium (or multiple such media).

[0089] The above-described exemplary embodiments are intended to be illustrative in all respects of the method and apparatus for prefetching RDF triple data, rather than restrictive, of the present invention. Those skilled in the art should recognize that the present invention is capable of many variations and modifications within the scope and spirit of the present invention. The scope of the present invention is defined only by the appended claims.

[0090] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0091] The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

* * * * *