U.S. patent application number 13/305116 was filed with the patent office on 2012-05-31 for prefetching rdf triple data.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Yue Pan, Xing Zhi Sun, Qing Fa Wang, Shuo Wu, Lin Hao Xu.
Application Number | 20120136875 13/305116 |
Document ID | / |
Family ID | 46091887 |
Filed Date | 2012-05-31 |
United States Patent
Application |
20120136875 |
Kind Code |
A1 |
Pan; Yue ; et al. |
May 31, 2012 |
PREFETCHING RDF TRIPLE DATA
Abstract
Query requests for RDF triples are obtained, wherein the query
request(s) contain(s) at least one triple pattern; for each triple
pattern, the corresponding elementary pattern is determined, and
each triple pattern is converted to a weighted elementary pattern.
The occurrence frequency of each elementary pattern is computed
based on the weighted elementary patterns; at least one elementary
pattern is chosen at least according to the occurrence frequency;
and the RDF triples corresponding to the chosen at least elementary
pattern are prefetched into the buffer. The corresponding apparatus
is also provided. With the above method and apparatus, the
frequently accessed RDF triples can be determined and prefetched
into the buffer, which improves the query efficiency.
Inventors: |
Pan; Yue; (Beijing, CN)
; Sun; Xing Zhi; (Beijing, CN) ; Wang; Qing
Fa; (Beijing, CN) ; Wu; Shuo; (Beijing,
CN) ; Xu; Lin Hao; (Beijing, CN) |
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
Armonk
NY
|
Family ID: |
46091887 |
Appl. No.: |
13/305116 |
Filed: |
November 28, 2011 |
Current U.S.
Class: |
707/748 ;
707/E17.014 |
Current CPC
Class: |
G06F 16/953 20190101;
G06F 16/24578 20190101; G06F 16/24558 20190101; G06F 16/2458
20190101; G06F 16/24539 20190101; G06F 16/9574 20190101; G06F
16/2455 20190101 |
Class at
Publication: |
707/748 ;
707/E17.014 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Foreign Application Data
Date |
Code |
Application Number |
Nov 29, 2010 |
CN |
201010577037.2 |
Claims
1. A method for processing RDF triples, comprising: obtaining query
requests for RDF triples, wherein said query requests each contain
at least one triple pattern; determining elementary patterns
corresponding to said triple patterns; performing weighting with
respect to said corresponding elementary patterns to weighted
elementary patterns; computing occurrence frequency of said
elementary patterns based on the weighted elementary patterns; and
prefetching those of the RDF triples corresponding to said
elementary patterns into a buffer if the occurrence frequency of
said elementary patterns meets at least one predetermined
condition.
2. The method of claim 1, wherein obtaining the query requests for
the RDF triples comprises reading query request records from a
query log.
3. The method of claim 1, wherein said elementary patterns are in a
form of <?s :p ?o>.
4. The method of claim 1, wherein said weighting comprises setting
the weight of a certain triple pattern with respect to a
corresponding elementary pattern as a constant.
5. The method of claim 1, wherein said weighting comprises
determining weight of said at least one triple pattern with respect
to a corresponding one of said elementary patterns by referring to
statistical information in an RDF triple data storage system.
6. The method of claim 5, wherein said weighting comprises: setting
the weight w (p, o) of triple patterns in <?s :p :o> form
with respect to elementary patterns in <?s :p ?o> form as:
w(p,o)=Num(p,o)/FACT(p), setting the weight w (s, p) of triple
patterns in <:s :p ?o> form with respect to elementary
patterns in <?s :p ?o> form as: w(s,p)=Num(s,p)/FACT(p),
wherein Num(p,o) denotes the number of all triples with predicate p
and object o, Num(s,p) denotes the number of all triples with
predicate p and subject s, FACT(p) denotes the number of all
triples with predicate p in said RDF triple data storage
system.
7. The method of claim 5, wherein said weighting comprises: setting
the weight w (p, o) of triple patterns in <?s :p :o> form
with respect to elementary patterns in <?s :p ?o> form as: w
( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ##EQU00006##
setting the weight w (s, p) of triple patterns in <:s :p ?o>
form with respect to elementary patterns in <?s :p ?o> form
as: w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p )
##EQU00007## wherein DOM (p) denotes the number of different
subjects with predicate p; DOM (o) denotes the number of different
subjects with object o; RNG(s) denotes the number of different
objects with subject s; RNG(p) denotes the number of different
objects with predicate p; and FACT (p) denotes the number of all
triples with predicate p in said RDF triple data storage
system.
8. The method of claim 1, wherein computing the occurrence
frequency of said elementary patterns comprises summing up the
weights of the same elementary pattern as the occurrence frequency
of said elementary pattern.
9. The method of any one of claim 1, wherein computing the
occurrence frequency of said elementary patterns comprises,
computing the occurrence frequency of the elementary patterns
corresponding to the triple patterns contained in said query
requests based on the occurrence frequency of said query
requests.
10. The method of claim 1, wherein prefetching the RDF triples
corresponding to said elementary patterns into the buffer if the
occurrence frequency of said elementary patterns meets at least one
predetermined condition comprises making the total size of the RDF
triples corresponding to said elementary patterns not exceed the
buffer size, and making the occurrence frequency of said elementary
patterns as high as possible.
11. An apparatus for processing RDF triples, comprising: a query
obtaining unit, configured to obtain the query requests for RDF
triples, wherein said query requests contain at least one triple
pattern; a pattern analyzing unit, configured to convert said at
least one triple pattern to a weighted elementary pattern; a
frequency computing unit, configured to compute the occurrence
frequency of said elementary patterns based on the weighted
elementary patterns; and a data prefetching unit, configured to
prefetch the RDF triples corresponding to said elementary patterns
into the buffer if the occurrence frequency of said elementary
patterns meets certain condition; wherein each of said query
obtaining unit, said pattern analyzing unit, said frequency
computing unit, and said data prefetching unit comprises at least
one of: dedicated hardware; and software tangibly embodied in a
non-transitory storage medium, loaded into a hardware memory, and
executing on at least one hardware processor coupled to the
memory.
12. The apparatus of claim 11, wherein said query obtaining unit is
configured to read query request records from the query log.
13. The apparatus of claim 11, wherein said elementary patterns are
in the form of <?s :p ?o>.
14. The apparatus of claim 11, wherein said pattern analyzing unit
is configured to set the weight of certain triple patterns with
respect to corresponding elementary patterns as a constant.
15. The apparatus of claim 11, wherein said pattern analyzing unit
is configured to, determine the weight of said at least one triple
pattern with respect to a corresponding one of said elementary
patterns by referring to the statistical information in an RDF
triple data storage system.
16. The apparatus of claim 15, wherein said pattern analyzing unit
is configured to: set the weight w (p, o) of triple patterns in
<?s :p :o> form with respect to elementary patterns in <?s
:p ?o> form as: w(p,o)=Num(p,o)/FACT(p), and set the weight w
(s, p) of triple patterns in <:s :p ?o> form with respect to
elementary patterns in <?s :p ?o> form as:
w(s,p)=Num(s,p)/FACT(p), wherein Num(p,o) denotes the number of all
triples with predicate p and object o, Num(s,p) denotes the number
of all triples with predicate p and subject s, FACT(p) denotes the
number of all triples with predicate p in said RDF triple data
storage system.
17. The apparatus of claim 15, wherein said pattern analyzing unit
is configured to: set the weight w (p, o) of triple patterns in
<?s :p :o> form with respect to elementary patterns in <?s
:p ?o> form as: w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT
( p ) ##EQU00008## set the weight w (s, p) of triple patterns in
<:s :p ?o> form with respect to elementary patterns in <?s
:p ?o> form as: w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT
( p ) ##EQU00009## wherein DOM (p) denotes the number of different
subjects with predicate p; DOM (o) denotes the number of different
subjects with object o; RNG(s) denotes the number of different
objects with subject s; RNG(p) denotes the number of different
objects with predicate p; and FACT (p) denotes the number of all
triples with predicate p in said RDF triple data storage
system.
18. The apparatus of claim 11, wherein said frequency computing
unit is configured to sum up the weights of the same elementary
pattern as the occurrence frequency of said elementary pattern.
19. The apparatus of claim 11, wherein said frequency computing
unit is configured to compute the occurrence frequency of the
elementary patterns corresponding to the triple patterns contained
in said query requests based on the occurrence frequency of said
query requests.
20. The apparatus of claim 11, wherein said data prefetching unit
is configured to make the total size of the RDF triples
corresponding to said elementary patterns not exceed the buffer
size, and make the occurrence frequency of said elementary patterns
as high as possible.
Description
CROSS-REFERENCE TO RELATED APPLICATION(S)
[0001] This application claims foreign priority to P.R. China
Patent application 201010577037.2 filed 29 Nov. 2010, the complete
disclosure of which is expressly incorporated herein by reference
in its entirety for all purposes.
FIELD OF THE INVENTION
[0002] This invention relates to the storage and management of RDF
triple data, and more particularly relates to a method and an
apparatus for accelerating the query and read of RDF triple
data.
BACKGROUND OF THE INVENTION
[0003] RDF (Resource Description Framework) is a technical standard
of markup language published by W3C (World Wide Web Consortium) to
better describe and express the contents and the structure of Web
resources. Particularly, RDF can be specially used to express the
metadata about Web resources, such as the title, the author, the
update time of Web pages, the copyright and the license of Web
documents, the available schedule of some shared resources, and so
on. However, when "Web resources" are generalized, RDF can be used
to describe the information of anything that can be identified on
the Web. Along with the development of semantic-based web
description, RDF data are used more and more widely in various Web
related applications, so the management of RDF data becomes more
and more important.
[0004] Different from general relational data, RDF data are
expressed in triple form, including <subject, predicate,
object>. That is, RDF describes the relation between elements
using such triples. When these RDF triples are stored into a
storage system such as a database, usually they can be queried
using SPARQL recommended by W3C.
[0005] FIG. 1 illustrates the structure of the existing RDF data
storage and query system. System 100 comprises a database 101, a
data loader 102, a data access module 103 and a query engine 104.
Database 101 is configured to store RDF triple data. Specifically,
database 101 contains an IRI table and a triple table. The IRI
table is used to store the correspondence relation between the
internal ID or index and the IRI string in the data, while the
triple table stores triple data with their internal ID
representation. It is understood that such storage manner is
advantageous for compressed data storage, which saves storage
space. When new RDF data are inputted from outside, data loader 102
receives and parses the inputted RDF data and transforms it into
internal data models. For each IRI string in the internal data
models, data access module 103 assigns a unique internal ID for it,
and inserts or stores the correspondence relation between the ID
and the string in the above IRI table. Then, for each RDF triple in
the data models, data access module 103 inserts or stores its
internal ID representation into the above triple table. For the
above stored RDF triple data, when the data are queried, query
engine 104 receives the user's SPARQL request and translates it
into the corresponding standard SQL (Structured Query Language)
sentences. Data access module 103 retrieves the queried triples
from database 101 according to SQL sentences, and returns the
results to query engine 104.
[0006] The storage and query process of RDF data executed in the
above system 100 will be described in detail in connection with
specific examples. In one example, school course information is
stored in database 101 in RDF triple form. Supposed that a user
wants to know the name list of the students who elect Jack's
course, then in query engine 104 the SPARQL query can be set
as:
TABLE-US-00001 SELECT ?name WHERE { ?student :hasName ?name. (1)
?student :takeCourse ?course. (2) ?course :toughtBy ?person. (3)
?person :hasName "Jack". (4) }
[0007] In the above SPARQL query, all values of "name" are
requested, wherein the sentences in WHERE{ } are the relations that
the "name" should satisfy. Concretely, this query contains 4
triple-form sentences (1)-(4), each of which is called a triple
pattern. It is understood that these sentences are numbered here
for description convenience, and such numbers don't exist in the
real query. Corresponding to RDF data, each triple pattern is also
expressed in the form of <subject. predicate, object>. but
question mark can be added before at least one element of the
triple so as to set it as variable to be queried. For example,
triple pattern (4) means that it is to query the variable person in
the case that the corresponding predicate is hasName and the object
is Jack in the triples; that is, the person whose name is Jack will
be retrieved. Then, via triple pattern (3), subject course will be
queried in the case that the corresponding predicate is toughtBy
and the object is the above retrieved person; that is, the course
taught by the person will be retrieved. In triple pattern (2), all
students who elect the course will be queried, and finally in
triple pattern (1), the names of the students are determined. Thus,
via the above triple pattern (1)-(4), taking person, course and
student as middle variables, the values of the queried name will be
determined finally.
[0008] By executing the translated SQL query from the query engine
104, data access module 103 in FIG. 1 retrieves the query results
accordingly from database 101 and returns them to query engine 104.
In one example, the returned RDF triples are in the following
form:
TABLE-US-00002 Subject Predicate Object Course toughtBy person
Student takeCourse course Person hasName "Jack" Student hasName
"Rose"
[0009] Through the above triples, the result of the above-described
query can be obtained; that is, the name of the student who elects
Jack's course is Rose.
[0010] In the above query process, data access module 103
continually searches and retrieves data from database 101 according
to the query of each triple pattern. However, because there is a
large amount of data stored in database 101, the database is
usually realized using large capacity storage media, such as a
large capacity hard disk. Thus, continually searching and
retrieving data from the hard disk brings a high IO cost and
further influences the query efficiency and system performance.
[0011] To improve query efficiency, one solution adopted in the
database system is to prefetch a part of the data in the buffer
which is easy to access, for example the memory or the cache of a
computing system. Therefore, when the computing system queries or
accesses this part of the data, it can read data directly from the
buffer, thereby reducing IO cost. However, because the buffer size
is usually very limited, which data should be prefetched into the
buffer in order to optimize the query efficiency is an issue under
investigation. For the general relational data, various methods
have been proposed for prefetching a part of data in the existing
techniques. However, because of the special format of RDF data, the
existing techniques are not adapted to optimize RDF data query.
Therefore, a method and an apparatus are needed for selectively
prefetching a part of RDF data to the buffer so as to accelerate
and optimize RDF data query.
SUMMARY OF THE INVENTION
[0012] In view of the above-mentioned problems, embodiments of the
invention are provided to improve the query efficiency of RDF
data.
[0013] According to a first aspect of the invention, a method for
prefetching RDF triples from RDF triple data storage system is
provided, wherein each RDF triple contains subject, predicate and
object, the method comprises: obtaining the query requests for RDF
triples, wherein the query requests contain at least one triple
pattern; converting the at least one triple pattern to a weighted
elementary pattern; computing the occurrence frequency of the
elementary patterns based on the weighted elementary patterns; and
prefetching the RDF triples corresponding to the elementary
patterns into the buffer when the occurrence frequency of the
elementary patterns meets certain condition(s).
[0014] According to a second aspect of the invention, an apparatus
for prefetching RDF triples from RDF triple data storage system is
provided, wherein each RDF triple contains subject, predicate and
object, the apparatus comprises: a query obtaining unit, configured
to obtain the query requests for RDF triples, wherein the query
requests contain at least one triple pattern; a pattern analyzing
unit, configured to convert the at least one triple pattern to a
weighted elementary pattern; a frequency computing unit, configured
to compute the occurrence frequency of the elementary patterns
based on the weighted elementary patterns; and a data prefetching
unit, configured to prefetch the RDF triples corresponding to the
elementary patterns into the buffer when the occurrence frequency
of the elementary patterns meets certain condition(s).
[0015] With the method and the apparatus of one or more embodiments
of the invention, the query patterns with higher occurrence
frequency can be determined, thereby the RDF triples with higher
access frequency can be determined, and these triples can be
prefetched into the easy-to-access buffer. Then, in the later
queries, the frequently accessed RDF data can be read directly from
the buffer, which can reduce 10 cost and improve query
efficiency.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1 illustrates the structure of an existing RDF data
storage and query system;
[0017] FIG. 2 is a flowchart of the method according to one
embodiment of the invention:
[0018] FIG. 3A illustrates some exemplary RDF triples stored in an
RDF database:
[0019] FIG. 3B illustrates some statistical results of the data
shown in FIG. 3A;
[0020] FIG. 4 illustrates the RDF data storage and query system
comprising a prefetching apparatus according to one embodiment of
the invention; and
[0021] FIG. 5 is a block diagram of the prefetching apparatus
according to one embodiment of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0022] The following is the description of the embodiments in
connection with the drawings. It is understood that the detailed
description is illustrative, rather than restrictive, to the scope
of the present invention.
[0023] FIG. 2 is a flowchart of the method according to one
embodiment of the invention, wherein the method is used to prefetch
a part of the RDF triples stored in a RDF data storage system to
the buffer. Specifically, the method comprises step 201, obtaining
the query requests for RDF triples, wherein the query requests
contain at least one triple pattern; step 202, for each of the
obtained at least one triple pattern, determining the corresponding
elementary pattern and performing weighting with respect to the
corresponding elementary pattern; step 203, computing the
occurrence frequency of each elementary pattern based on the
weighted elementary patterns; step 204, choosing at least one
elementary pattern at least according to the occurrence frequency:
and step 205, prefetching the RDF triples corresponding to the
chosen at least one elementary pattern into the buffer. Through the
above steps, the most frequently queried and accessed RDF triples
can be determined and thus prefetched into the buffer in one or
more embodiments of the invention, which improves the query
efficiency.
[0024] The following is the description of the above steps shown in
FIG. 2 in connection with the specific examples.
[0025] At step 201, the query requests for RDF triples are
obtained. In one embodiment, those query requests are obtained in
real time from the query engine. In another embodiment, the records
of those query requests are read from the system query log.
Optionally, a plurality of query requests, i.e. a query set, can be
obtained at one time. Typically, the search and query requests for
RDF data are SPARQL queries, each of which contains at least one
triple pattern, such as triple patterns (1).about.(4) shown in the
prior art.
[0026] Then, at step 202, the obtained triple patterns are analyzed
and converted. Firstly, for each triple pattern, the corresponding
elementary pattern is determined. The elementary pattern is defined
mainly according to the data feature in the RDF triple data storage
system and the request feature of the data query. In one
embodiment, the elementary pattern is defined as the triple pattern
in which only the predicate is constant; that is, the triple
pattern in the form of <?subject, predicate, ?object>. If s
denotes the subject, p donates the predicate, o donates the object,
prefix `?` denotes a query variable, prefix `:` donates a constant,
then the elementary pattern can be represented as <?s :p ?o>.
It is understood that the elementary pattern can be defined in
other forms, for example, the triple pattern in <:s ?p ?o>
form in which only the subject is constant, the triple pattern in
<?s ?p :o> form in which only the object is constant, the
triple pattern in <:s :p ?o> form with constant subject and
constant predicate, etc. Following is the description of the
embodiments in connection with elementary patterns in <?s :p
?o> form. Those skilled in the art can understand that the
embodiments of this invention are also applicable for other
elementary patterns.
[0027] The strength of defining elementary patterns as <?s :p
?o> and classifying the triple patterns and the triple data
based on the predicate is that the number of different predicates
of RDF triples stored in the RDF database is much less than the
number of RDF triples themselves. For example, in the RDF dataset
of Wikipedia, the number of RDF triples is about 136.9 million, but
the number of the referred predicates is only 927. What is more, in
all possible triple patterns, triple patterns with constant
predicate <?s :p :o>, <?s :p ?o> and <:s :p ?o>
are the most common triple patterns, while triple patterns <?s
?p :o>, <:s ?p :o> and <:s ?p ?o> with querying
predicates are seldom used, and <?s ?p ?o> of querying all
elements is needless to say. Currently, the triple patterns
contained in the standard test set of SPARQL are mostly the above
most common triple patterns with constant predicate.
[0028] For the above-mentioned common triple pattern <?s :p
:o>, <?s :p ?o> and <:s :p ?o>, it can be seen that
<?s :p ?o> itself is an elementary pattern, while <?s :p
:o> and <:s :p ?o> only query the subject or the object,
the query results of which must be the subset of the query results
of elementary pattern <?s :p ?o> with the same predicate.
Therefore, each triple pattern whose predicate is a constant,
contained in SPARQL queries can be mapped to an elementary pattern
defined above. Accordingly, the step of determining the elementary
pattern corresponding to a triple pattern is to determine the
elementary pattern having the same predicate with that of the
triple pattern.
[0029] In the illustrated triple patterns (1).about.(4), triple
patterns (1).about.(3) are all triple patterns that have a constant
predicate and query the subject and the object, i.e. elementary
patterns. In triple pattern (4), object `Jack` is also a constant
besides constant predicate `hasName`, so it is not an elementary
pattern. Then, it can be determined that the corresponding
elementary pattern is elementary pattern <?s :hasName ?o>
with the same predicate.
[0030] After the corresponding elementary pattern of each triple
pattern is determined, the triple pattern is weighted with respect
to the corresponding elementary pattern in terms of occurrence
frequency, so as to convert it to the weighted elementary pattern.
That is because an elementary pattern only defines the predicate,
the query results include all triples with the specified predicate,
or, in other words, the complete set of the specified predicate.
Therefore, the query of an elementary pattern will result in the
accessing and the retrieving of the complete set of the specified
predicate, while in the triple patterns that are not elementary
patterns, the subject or the object is also defined and the query
results are a part of the complete set of the specified predicate.
That is, the triple pattern whose predicate is a constant but not
elementary pattern, only access a part of the data accessed by its
corresponding elementary pattern. Then, to evaluate the
contribution of each triple pattern to the accessing frequency on
the triple data, the non-elementary patterns should be discounted
with respect to elementary patterns in terms of occurrence times;
that is, they should be weighted.
[0031] In one embodiment, it is simply defined that the weight of
non-elementary patterns is 0.5 compared with corresponding
elementary patterns. Then, triple patterns (1).about.(4) can be
converted as:
TABLE-US-00003 <?s :has Name ?o> (1') <?s :take Course
?o> (2') <?s :toughtBy ?o> (3') <?s :hasName ?o>*0.5
(4')
[0032] In some embodiments, the triple patterns are weighted by
referring to the statistical information of the RDF database.
[0033] Specifically, in one embodiment, for triple pattern <?s
:p :o>, Num (p, o) is defined as the number of triples in the
RDF database with predicate p and object o, and FACT (p) is defined
as the number of all triples with predicate p, i.e., the number of
different <s,o> pairs. Then, the weight w (p, o) of triple
pattern <?s :p :o> can be defined as:
w(p,o)=Num(p,o)/FACT(p)
[0034] Accordingly, for triple pattern <:s :p ?o>, Num (s, p)
is defined as the number of triples in the RDF database with
predicate p and subject s, and the weight w (s, p) of triple
pattern <:s :p ?o> can be defined as:
w(s,p)=Num(s,p)/FACT(p)
[0035] For triple pattern <?s :p ?o>, since it is an
elementary pattern, its weight is set as 1. Thereby, those three
triple patterns contained in the SPARQL queries have been
weighted.
[0036] In other embodiments, more statistical information of the
RDF database can be considered. In one embodiment, Domain
statistics and Range statistics of the triples in the RDF database
are defined, wherein Domain statistics are used to compute the
subject number and Range statistics are used to compute the object
number.
[0037] Specifically, function DOM (p) is defined to denote the
number of different subjects s with constant predicate p (the
object can be any) in the RDF database; function DOM (o) is defined
to denote the number of different subjects s with constant object o
(the predicate can be any) in the RDF database.
[0038] Function RNG(s) is defined to denote the number of different
objects o with constant subject s (the predicate can be any) in the
RDF database; RNG(p) is defined to denote the number of different
objects o with constant predicate p (the subject can be any) in the
RDF database.
[0039] Furthermore, FACT (p) defined in the above embodiment is
used to denote the number of different triples with predicate p.
i.e., the number of different <s,o> pairs.
[0040] Based on the above statistics, the weight w (p, o) of triple
pattern <?s :p :o> can be defined as:
w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ( i )
##EQU00001##
[0041] For triple pattern <:s :p ?o>, its weight w(s,p) can
be defined as:
w ( s , p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ( ii )
##EQU00002##
[0042] In the same way, for elementary pattern <?s :p ?o>,
its weight is set as 1.
[0043] In connection with one example, following is the description
of the process of weighting and converting triple patterns
according to the above-mentioned embodiment. FIG. 3A illustrates
some exemplary RDF triples stored in an RDF database, and FIG. 3B
illustrates some statistical results of the data shown in FIG. 3A.
In the triples shown in FIG. 3A, taking predicate "type" as an
example, it can be seen that the number of different subjects with
predicate "type" is 10, i.e. DOM (type)=10; the number of different
objects with predicate "type" is 6, i.e. RNG (type)=6; and the
number of the triples with predicate "type" is 11, i.e. FACT
(type)=11. Other predicates and functions can be analyzed similarly
and thus the statistic results shown in FIG. 3B can be obtained.
These statistical results, as assistant storage information for
further use, can be pre-stored in certain areas of the database and
updated periodically, or updated when new data are received in the
database.
[0044] Suppose the first SPARQL query for the data in FIG. 3A is
defined as:
TABLE-US-00004 SELECT ?publication WHERE {?publication type Article
(11) ?publication author ?researcher (12) ?researcher workAt
?university (13) ?university name NUS}(14)
[0045] Wherein triple patterns (12) and (13) are elementary
patterns whose weight is 1; the corresponding elementary pattern of
triple pattern (11) is <?s type ?o>, and the corresponding
elementary pattern of triple pattern (14) is <?s name ?o>.
Substituting the statistical results in FIG. 3B into formulas (i)
and (ii), it can be obtained that the weights of triple pattern
(11) and (14) are 1/11 and 1/8 respectively. Therefore, the first
query can be converted into:
TABLE-US-00005 <?s type ?o>*1/11 (11') <?s author ?o>
(12') <?s workAt ?o> (13') <?s name ?o>*1/8 (14')
[0046] Similarly, suppose the second SPARQL query is:
TABLE-US-00006 SELECT ?publication WHERE { ?researcher supervise
?student (21) ?researcher name "Ooi Beng Chin" (22) ?publication
author ?student} (23)
[0047] According to the above process, the second query can be
converted into:
TABLE-US-00007 <?s supervise ?o> (21') <?s name ?o>*
1/8 (22') <?s author ?o> (23')
[0048] Although some statistical methods and weighting methods are
illustrated in the above embodiments, it is understood that those
skilled in the art can modify the above methods or use other
methods after reading this description. Any method for weighting
the triple patterns, as long as it can reflect the effect of triple
patterns on the access frequency of triples in the database in some
aspect or in some degree, can be adopted for embodiments of the
invention.
[0049] Moreover, the above embodiments are all described in
connection with elementary patterns in the form of <?s :p
?o>. For other forms of elementary patterns, the corresponding
weighting method can be adopted according to need to convert triple
patterns to weighted elementary patterns, in order to reflect the
effect of the triple patterns on triple data access frequency.
[0050] After weighting and converting the triple patterns in the
queries, step 203 in FIG. 2 computes the occurrence frequency of
each elementary pattern based on the weighted elementary
patterns.
[0051] For example, in the above first query and the second query,
for each of the weighted elementary patterns (11').about.(14') and
(21').about.(23'), by summing up the weight factor of the same
elementary pattern, the occurrence frequency of each elementary
pattern can be obtained. Specifically, the occurrence frequency of
<?s type ?o> is 1/11, the occurrence frequency of <?s
author ?o> is 2, the occurrence frequency of <?s workAt
?o> is 1, the occurrence frequency of <?s name ?o> is 1/4,
and the occurrence frequency of <?s supervise ?o> is 1.
[0052] In one embodiment, for a plurality of queries, firstly the
occurrence frequency of each query is computed, and then the
occurrence frequencies of the elementary patterns involved in the
queries are computed based on the occurrence frequency of each
query. For example, in a specific example, a query set Q is
obtained, which contains a plurality of different queries, i.e.
Q={q.sub.1, q.sub.2, . . . , q.sub.m}. Suppose the occurrence
frequency of a query q.sub.i is f (q.sub.i). For each query q.sub.i
appearing, the corresponding elementary pattern p and the
corresponding weight w.sub.p,q, can be determined as mentioned
above. Then, the occurrence frequency of elementary pattern p
involved in query q.sub.i can be represented as
f(q.sub.1).times.w.sub.p,q.sub.1. For the above set Q, the
occurrence frequency f (p) of elementary pattern p can be
represented as:
f(p)=.SIGMA..sub.q.sub.1.sub..epsilon.(
)'f(q.sub.1).times.w.sub.p,q.sub.1
wherein Q' denotes the set of queries involving elementary pattern
p.
[0053] Thereby, the occurrence frequency of each elementary pattern
can be determined.
[0054] Based on the above computed occurrence frequency, at least
one elementary pattern is chosen at step 204, and at step 205, RDF
triples corresponding to the at least one elementary pattern are
prefetched into the buffer. Generally, the above chosen elementary
patterns are the elementary patterns with higher occurrence
frequency. Since these elementary patterns have higher occurrence
frequency in the queries, accordingly, their corresponding RDF
triples have higher access frequency in the RDF database, and thus,
prefetching these RDF triples into the buffer will facilitate the
query speed.
[0055] In one embodiment, the obtained occurrence frequency of each
elementary pattern simply chosen from the order. The RDF triples
corresponding to the chosen elementary patterns are prefetched into
the buffer.
[0056] In some embodiments, the capacity limit and the utilization
ratio of the buffer are also taken into consideration. That is, it
is expected that the RDF triples corresponding to elementary
patterns with higher occurrence frequency are chosen while the
total size of these triples does not exceed the buffer size and at
the same time the benefit of the buffer is optimized. The optimized
benefit of the buffer means that the triples stored in the buffer
are as many as possible and the access frequency of these triples
is as high as possible, etc.
[0057] This target can be generalized as the constrained
optimization problem in mathematics. If M is the buffer size,
size(p.sub.1) is the size of the triples in the RDF database
corresponding to elementary pattern P.sub.1, and a.sub.1 is the
choosing factor of elementary pattern p.sub.1, i.e. a.sub.1 is 0 or
1, then for n elementary patterns, they should meet the
constraint:
i = 1 n a i .times. size ( p i ) .ltoreq. M ( iii )
##EQU00003##
[0058] Meanwhile, the benefit function is defined as:
B = i = 1 n a 1 .times. size ( p i ) .times. f ( p i ) .
##EQU00004##
[0059] Thus, the above problem can be represented as how to
determine the value of in order to make the largest benefit
function B and meet constraint (iii) at the same time.
[0060] One common method for solving the above optimization problem
is firstly ordering the elementary patterns in a queue according to
their occurrence frequency from high to low. For the elementary
pattern with the highest occurrence frequency in the queue, suppose
its choosing factor is 1, then it is judged whether the constraint
(iii) is met or not. If the constraint is met, then the choosing
factor is set as 1; that is, the elementary pattern is chosen and
the next elementary pattern in the queue is judged continually. For
a certain elementary pattern in the queue, if it does not meet
constraint (iii), then the elementary pattern is ignore; that is,
its choosing factor is set as 0, and the next elementary pattern in
the queue is judged continually until the whole queue is
checked.
[0061] For the above constrained optimization problem, various
approaches have been proposed in the existing technique to obtain
optimized solutions, which is needless to describe. It is
understood, those skilled in the art can adopt a proper approach to
choose elementary patterns according to needs in order to optimize
the benefit of the buffer.
[0062] As described above, by determining the occurrence frequency
of each elementary pattern involved in SPARQL queries, and
prefetching the corresponding triples of some elementary patterns
into the buffer according to the occurrence frequency of these
elementary patterns, the frequently accessed data in the RDF
database can be pre-stored in advance. Thus, the later queries may
read the data directly from the buffer with great possibility,
which reduces IO cost and improves the query efficiency of RDF
triples.
[0063] Based on the same inventive conception, this invention also
provides an apparatus for prefetching RDF triple data.
Advantageously, it is expected that this apparatus is constructed
based on the existing RDF data storage and query system shown in
FIG. 1 as much as possible, with the existing architecture modified
as little as possible. Therefore, one or more embodiments of the
invention proposes to add a prefetching apparatus to the existing
RDF data storage and query system, in order to analyze and choose
the triples with higher access frequency and prefetch them into the
buffer.
[0064] Specifically, FIG. 4 illustrates an RDF data storage and
query system comprising the prefetching apparatus according to one
embodiment of the invention. Compared with the system in FIG. 1,
the system in FIG. 4 additionally comprises a prefetching apparatus
500, which communicates with database 101, in order to prefetch the
frequently queried triples into buffer 1011. Optionally, the
prefetching apparatus 500 also connects with data loader 102 and/or
query engine 104 in order to obtain the information about data
storage and query.
[0065] FIG. 5 is a block diagram of the prefetching apparatus
according to one embodiment of the invention. As shown in the
figure, the prefetching apparatus 500 comprises a query obtaining
unit 501, configured to obtain the query requests for RDF triples,
wherein the query requests contain at least one triple pattern; a
pattern analyzing unit 502, configured to, for each of the obtained
at least one triple pattern, determine the corresponding elementary
pattern and perform weighting with respect to the corresponding
elementary pattern; a frequency computing unit 503, configured to
compute the occurrence frequency of each elementary pattern based
on the weighted elementary patterns; and a data prefetching unit
504, configured to choose at least one elementary pattern at least
according to the occurrence frequency, and prefetch the RDF triples
corresponding to the chosen at least one elementary pattern into
the buffer.
[0066] Specifically, the query obtaining unit 501 obtains the query
requests for RDF triples. In one embodiment, the query obtaining
unit 501 connects with query engine 104, to acquire the query
requests in real time. In another embodiment, the query obtaining
unit 501 reads the query records from the system log. Optionally,
multiple query requests, i.e. a query set, can be obtained at one
time. For SPARQL queries for RDF data, each query contains at least
one triple pattern. The query obtaining unit 501 sends the obtained
queries and the contained triple patterns to pattern analyzing unit
502.
[0067] Pattern analyzing unit 502 analyzes and converts the
received triple patterns. Firstly, for each triple pattern, pattern
analyzing unit 502 determines the corresponding elementary pattern;
that is, determines the elementary pattern <?s :p ?o> having
the same predicate with the triple pattern.
[0068] After determining the corresponding elementary pattern of
each triple pattern, pattern analyzing unit 502 weights the triple
pattern with respect to the corresponding elementary pattern in
terms of occurrence times and thus converts triple patterns to
weighted elementary patterns.
[0069] In one embodiment, pattern analyzing unit 502 simply sets
the weight of non-elementary patterns compared with the elementary
patterns as a fixed value, for example 0.5. In other embodiments,
pattern analyzing unit 502 further connects with database 101
and/or data loader 102, in order to weight triple patterns by
referring to the statistical information in the RDF database.
[0070] Specifically, in one embodiment, pattern analyzing unit 502
computes the weight of triple pattern <?s :p :o> using the
formula w(p,o)=Num(p,o)/FACT(p), and computes the weight of triple
pattern <:s :p ?o> using formula w(s,p)=Num(s,p)/FACT(p),
wherein Num (p, o) denotes the number of different triples with
predicate p and object o in RDF database; Num (s, p) denotes the
number of different triples with predicate p and subject s in RDF
database; FACT (p) denotes the number of all triples with predicate
p. For elementary pattern <?s :p ?o>, pattern analyzing unit
502 sets its weight as 1.
[0071] In other embodiments, pattern analyzing unit 502 further
considers more statistical information of the RDF database. In one
example, pattern analyzing unit 502 considers Domain statistics and
Range statistics of triple data in the RDF database. Specifically,
pattern analyzing unit 502 computes the weight w (p, o) of triple
pattern <?s :p :o> using formula (I) and computes the weight
w (s, p) of triple pattern <:s :p ?o> using formula (II):
w ( p , o ) = MIN ( DOM ( p ) , DOM ( o ) ) FACT ( p ) ( i ) w ( s
, p ) = MIN ( RNG ( s ) , RNG ( p ) ) FACT ( p ) ( ii )
##EQU00005##
[0072] Wherein function DOM (p) denotes the number of different
subjects s with predicate p (the object can be any) in the RDF
database; function DOM (o) denotes the number of different subjects
s with object o (the predicate can be any) in the RDF database,
Function RNG(s) denotes the number of different object o with
subject s (the predicate can be any) in the RDF database; RNG(p)
denotes the number of different objects o with predicate p (the
subject can be any) in the RDF database. FACT (p) has the same
meaning with the above embodiment. Similarly, for elementary
pattern <?s :p ?o>, its weight is set as 1.
[0073] Although some statistical methods and weighting methods are
illustrated for pattern analyzing unit 502, it is understood that
those skilled in the art can optionally use other methods as long
as the weight can reflect the effect of triple patterns on the
access frequency of triples in the database in some aspect or in
some degree.
[0074] After pattern analyzing unit 502 weights and converts the
triple patterns in the query requests, frequency computing unit 503
computes the occurrence frequency of each elementary pattern based
on the weighted elementary patterns.
[0075] In one example, frequency computing unit 503 considers the
weighted elementary patterns involved in each query one by one, and
obtains the occurrence frequency of each elementary pattern by
summing up the weighting factor of the same elementary pattern.
[0076] In one embodiment, when query obtaining unit 501 obtains
multiple queries, it firstly computes the occurrence frequency of
each query. Then frequency computing unit 503 can compute the
occurrence frequencies of the elementary patterns involved in the
queries based on the occurrence frequency of each query.
[0077] Then frequency computing unit 503 sends the computed
occurrence frequency of each elementary pattern to data prefetching
unit 504. Data prefetching unit 504 chooses at least one elementary
pattern based on the received occurrence frequency, and prefetches
the RDF triples corresponding to the chosen at least one elementary
pattern into the buffer.
[0078] In one embodiment, data prefetching unit 504 orders the
received occurrence frequency of each elementary pattern, and
simply chooses several elementary patterns with the highest
occurrence frequency from the order. Then data prefetching unit 504
prefetches the corresponding RDF triples of the chosen elementary
patterns into the buffer
[0079] In some embodiments, data prefetching unit 504 also
considers the size limit and the utilization ratio of the buffer.
That is, data prefetching unit 504 chooses the elementary patterns,
such that the size of the triples to be prefetched does not exceed
the buffer size and the benefit of the buffer is optimized. The
optimized benefit of the buffer means that the triples stored in
the buffer are as many as possible and the access frequency of
these triples is as high as possible, etc.
[0080] To achieve the above optimization target, in one embodiment,
data prefetching unit 504 firstly orders the elementary patterns in
a queue according to their occurrence frequency from high to low.
For the elementary pattern with the highest occurrence frequency in
the queue, it is judged whether the constraint on the buffer size
would be met if this elementary pattern is chosen. If the
constraint is met, it is determined to choose the elementary
pattern and to judge the next elementary pattern in the queue
continually. For a certain elementary pattern in the queue, if the
buffer size constraint is not met, then the elementary pattern is
ignored and the next elementary pattern in the queue is judged
continually until the whole queue is checked.
[0081] For the above constrained optimization problem, various
approaches have been proposed in the existing technique(s) to
obtain the optimized solutions. Data prefetching unit 504 can adopt
other proper approaches to choose elementary patterns in order to
optimize the benefit of the buffer.
[0082] Thereby, the prefetching apparatus 500 can determine the
occurrence frequency of each elementary pattern involved in SPARQL
queries and prefetch the corresponding triples of some elementary
patterns into buffer 1011 according to the occurrence frequency of
elementary patterns. Thus, the frequently accessed data in the RDF
database can be prefetched, which improves the subsequent query
efficiency. The detailed embodiments are accordant with those of
the above prefetching method, which is needless to describe
again.
[0083] Through the above description of the embodiments, those
skilled in the art will recognize that the above method and
apparatus for prefetching RDF triple data can be practiced by
executable instructions and/or controlling codes in the processors
e.g. codes in mediums like disc, CD or DVD-ROM; memories like ROM
or EPROM; and carriers like optical or electronic signal carrier.
The apparatus and its units can be realized using hardware like
VLSI or Gates and Arrays, like semiconductors e.g. Logic Chip,
transistors, etc. or like programmable hardware equipment e.g.
FPGA, programmable logic equipment, etc.; or using software
executed by different kinds of processors; or using the combination
of the hardware and software. The software and program codes for
implementing the invention can be written using object-oriented
languages like Java, Smalltalk. C++, etc., and the traditional
procedural languages like C language or other similar languages.
The source code can be executed locally or remotely.
[0084] Thus, having reviewed the disclosure herein, the skilled
artisan will appreciate that aspects of the present invention may
take the form of a computer program product embodied in one or more
computer readable medium(s) having computer readable program code
embodied thereon. Any combination of one or more computer readable
medium(s) may be utilized. The computer readable medium may be a
computer readable signal medium or a computer readable storage
medium. A computer readable storage medium may be, for example, but
not limited to, an electronic, magnetic, optical, electromagnetic,
infrared, or semiconductor system, apparatus, or device, or any
suitable combination of the foregoing. More specific examples (a
non-exhaustive list) of the computer readable storage medium would
include the following: an electrical connection having one or more
wires, a portable computer diskette, a hard disk, a random access
memory (RAM), a read-only memory (ROM), an erasable programmable
read-only memory (EPROM or Flash memory), an optical fiber, a
portable compact disc read-only memory (CD-ROM), an optical storage
device, a magnetic storage device, or any suitable combination of
the foregoing. In the context of this document, a computer readable
storage medium may be any tangible medium that can contain, or
store a program for use by or in connection with an instruction
execution system, apparatus, or device.
[0085] A computer readable signal medium may include a propagated
data signal with computer readable program code embodied therein,
for example, in baseband or as part of a carrier wave. Such a
propagated signal may take any of a variety of forms, including,
but not limited to, electro-magnetic, optical, or any suitable
combination thereof. A computer readable signal medium may be any
computer readable medium that is not a computer readable storage
medium and that can communicate, propagate, or transport a program
for use by or in connection with an instruction execution system,
apparatus, or device.
[0086] Program code embodied on a computer readable medium may be
transmitted using any appropriate medium, including but not limited
to wireless, wireline, optical fiber cable, RF, etc., or any
suitable combination of the foregoing.
[0087] Distinct software modules for carrying aspects of
embodiments of the invention can be, in at least some cases,
embodied on a computer readable storage medium. The distinct
software modules may include, for example, any one, some, or all of
the modules and/or sub-modules in FIGS. 4 and 5, for example.
[0088] The means mentioned herein can include (i) hardware
module(s), (ii) software module(s) executing on one or more
hardware processors, or (iii) a combination of hardware and
software modules; any of (i)-(iii) implement the specific
techniques set forth herein, and the software modules are stored in
a computer readable medium (or multiple such media).
[0089] The above-described exemplary embodiments are intended to be
illustrative in all respects of the method and apparatus for
prefetching RDF triple data, rather than restrictive, of the
present invention. Those skilled in the art should recognize that
the present invention is capable of many variations and
modifications within the scope and spirit of the present invention.
The scope of the present invention is defined only by the appended
claims.
[0090] The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of
the invention. As used herein, the singular forms "a", "an" and
"the" are intended to include the plural forms as well, unless the
context clearly indicates otherwise. It will be further understood
that the terms "comprises" and/or "comprising," when used in this
specification, specify the presence of stated features, integers,
steps, operations, elements, and/or components, but do not preclude
the presence or addition of one or more other features, integers,
steps, operations, elements, components, and/or groups thereof.
[0091] The corresponding structures, materials, acts, and
equivalents of all means or step plus function elements in the
claims below are intended to include any structure, material, or
act for performing the function in combination with other claimed
elements as specifically claimed. The description of the present
invention has been presented for purposes of illustration and
description, but is not intended to be exhaustive or limited to the
invention in the form disclosed. Many modifications and variations
will be apparent to those of ordinary skill in the art without
departing from the scope and spirit of the invention. The
embodiment was chosen and described in order to best explain the
principles of the invention and the practical application, and to
enable others of ordinary skill in the art to understand the
invention for various embodiments with various modifications as are
suited to the particular use contemplated.
* * * * *