Calibration And Profile Based Synopses Error Estimiation And Synopses Reconciliation MATIAS; YOSSI ; et al. [Ramot At Tel Aviv University Ltd]

Calibration And Profile Based Synopses Error Estimiation And Synopses Reconciliation

MATIAS; YOSSI ; et al.

Patent Application Summary

U.S. patent application number 11/683444 was filed with the patent office on 2008-09-11 for calibration and profile based synopses error estimiation and synopses reconciliation. This patent application is currently assigned to Ramot At Tel Aviv University Ltd. Invention is credited to Yariv Matia, YOSSI MATIAS.

Application Number	20080222066 11/683444
Document ID	/
Family ID	39742636
Filed Date	2008-09-11

United States Patent Application	20080222066
Kind Code	A1
MATIAS; YOSSI ; et al.	September 11, 2008

CALIBRATION AND PROFILE BASED SYNOPSES ERROR ESTIMIATION AND SYNOPSES RECONCILIATION

Abstract

Apparatus, methods and computer code for estimating a synopsis error are disclosed herein. In some embodiments, one or more queries of a query workload are analyzed without running the analyzed queries. In some embodiments, the synopsis error is first estimated for a first memory allocation vector, and then, using intermediate or final results of the first estimated, estimated for a second memory allocation vector. In some embodiments, teachings related to estimating synopsis error are applied to the memory reconciliation problem.

Inventors:	MATIAS; YOSSI; (Tel Aviv, IL) ; Matia; Yariv; (Marburg, DE)
Correspondence Address:	DR. MARK M. FRIEDMAN;C/O BILL POLKINGHORN - DISCOVERY DISPATCH 9003 FLORIN WAY UPPER MARLBORO MD 20772 US
Assignee:	Ramot At Tel Aviv University Ltd Tel Aviv IL
Family ID:	39742636
Appl. No.:	11/683444
Filed:	March 8, 2007

Current U.S. Class:	706/19
Current CPC Class:	Y10S 707/99933 20130101; G06F 16/24549 20190101; G06F 16/2462 20190101; G06F 16/24547 20190101
Class at Publication:	706/19
International Class:	G06E 1/00 20060101 G06E001/00

Claims

1. A method for computing an estimated synopsis error associated with a plurality of data relations, a query workload, and a plurality of data synopses of the data relations, the method comprising: a) storing in memory descriptions of: i) the plurality of data relations; and ii) the query workload; and b) computing the estimated synopsis error from said stored description of the plurality of data relations and from said stored description of the query workload, wherein said computing includes: i) determining, for each query of at least some queries of the query workload, a respective indication of respective filtering criteria for said each query; and ii) estimating, in accordance with said respective determined filtering criteria, and without using query results on the data relations for all queries of said at least some queries, the synopsis error associated with the plurality of data relations, the query workload, and the plurality of data synopses.

2. The method of claim 1 wherein said estimating is carried out in accordance with said determined query filtering criteria is carried out in accordance with at least one computed function selected from the group consisting of: i) a query range function; ii) a query dispersion function; iii) a query central tendency function; and iv) a query distribution skew function.

3. The method claim 1 wherein, for said each query, said determining of said respective query filtering criteria includes determining a respective indication of a respective query filter field set associated with said each query.

4. The method claim 1 wherein, for said each query, said determining of said respective query filtering criteria includes determining a respective indication of a respective query filter value set associated with said each query.

5. The method of claim 4 wherein said estimating is carried out in accordance with said determined query filtering criteria is carried out in accordance with at least one computed function selected from the group consisting of: i) a query range function of content of said determined query filter value sets; ii) a query dispersion function of content of said determined query filter value sets; iii) a query central tendency function of content of said determined query filter value sets; and iv) a query distribution skew function of content of said determined query filter value sets.

6. A method of computing a first estimated synopsis error E.sup.1 and a second estimated synopsis error E.sup.2 for a plurality of N relations R={R.sub.1,R.sub.2 . . . R.sub.N}, the first estimated synopsis error E.sup.1 associated with a first set of synopses S.sup.1={S.sub.1.sup.1,S.sub.2.sup.1, . . . S.sub.N.sup.1}, the second estimated synopsis error E.sup.2 associated with a second set of synopses S.sup.2={S.sub.1.sup.2,S.sub.2.sup.2, . . . S.sub.N.sup.2}, each synopsis S.sup.1.sub.i of the first synopsis set and each synopsis S.sup.2.sub.i of the second synopsis set associated with a respective ith relation R.sub.i, the method comprising: a) computing E.sup.1 for a first allocation memory vector M.sup.1={m.sub.1.sup.1,m.sub.2.sup.1, . . . m.sub.N.sup.1} where: i = 1 N m i 1 = M 1 i ) ##EQU00010## ii) for an ith relation R.sub.1, each m.sub.i.sup.1 represents a respective amount of memory allocated for a respective synopsis S.sup.1.sub.1 of the first synopsis set; b) using at least one of: i) said computed first estimated synopsis error E.sup.1; and ii) stored intermediate results associated with said computing of E.sup.1, c) computing E.sup.2 for a second memory allocation vector M.sup.2={m.sub.1.sup.2,m.sub.2.sup.2, . . . m.sub.N.sup.2}, i = 1 N m i 2 = M 2 i ) ##EQU00011## ii) for an ith relation R.sub.1, each m.sub.i.sup.2 represents a respective amount of memory allocated for a respective synopsis S.sup.2.sub.1 of the second synopsis set; wherein M.sup.1.noteq.M.sup.2.

7. The method of claim 6 wherein said used stored intermediate results associated with said computing of E.sup.1 are indicative of a profile of at least one relation the plurality of relations.

8. A method of estimating a synopsis error for target relation data, the method comprising: a) providing an at least partially random data set; b) providing a synopsis of said at least partially random data set; c) after said providing of said at least partially random data set and said synopsis, receiving the target relation data; and d) in accordance with: (i) said provided at least partially random data set; (ii) said provided synopsis of said at least partially random data set; and (iii) the target relation set, computing a synopsis error estimation function for the target relation data.

9. A method of estimating a synopsis error for target relation data, the method comprising: a) selecting a functional form for an error function, said functional form associated with a plurality of function coefficients; b) effecting an offline calibration procedure, said calibration procedure including: i) generating synthetic relations using a random number generation procedure; ii) computing a synopsis of said generated synthetic relations; and iii) deriving said plurality of function coefficients from said computed synopsis of said generated synthetic relations; c) after said off-line calibration procedure, receiving the target relation data; and d) computing a profile of said target relation data; and e) estimating the synopsis error by using: i) said derived plurality of function coefficients; and ii) results of said profiling.

10. Apparatus for computing an estimated synopsis error associated with a plurality of data relations, a query workload, and a plurality of data synopses of the data relations, the apparatus comprising: a) a data storage operative to store representations of: i) the plurality of data relations; and ii) the query workload; and b) a synopsis error estimation engine operative to: i) determine, for each query of at least some queries of the query workload, a respective indication of respective filtering criteria for said each query; and ii) estimate the synopsis error in accordance with said: A) respective determined filtering criteria; and B) said stored representation of the plurality of data relations, without using query results on the data relations for all queries of said at least some queries.

11. The apparatus of claim 10 wherein said synopsis error estimation engine is operative such that said estimating is carried out in accordance with said determined query filtering criteria is carried out in accordance with at least one of: i) a query range function; ii) a query dispersion function; iii) a query central tendency function; and iv) a query distribution skew function.

12. The apparatus of claim 10 wherein said synopsis error estimation engine is operative such that, for said each query, said determining of said respective query filleting criteria includes determining a respective indication of a respective query filter field set associated with said each query.

13. The apparatus of claim 10 wherein said synopsis error estimation engine is operative such that, for said each query, said determining of said respective query filtering criteria includes determining a respective indication of a respective query filter value set associated with said each query.

14. The apparatus of claim 13 wherein said synopsis error estimation engine is operative such that said estimating is carried out in accordance with said determined query filtering criteria is carried out in accordance with at least one of: i) a query range function of content of said determined query filter value sets; ii) a query dispersion function of content of said determined query filter value sets; iii) a query central tendency function of content of said determined query filter value sets; and iv) a query distribution skew function of content of said determined query filter value sets.

15. Apparatus for computing a first estimated synopsis error E.sup.1 and a second estimated synopsis error E.sup.2 for a plurality of N relations R={R.sub.1,R.sub.2 . . . R.sub.N}, the first estimated synopsis error E.sup.1 associated with a first set of synopses S.sup.1={S.sub.1.sup.1, S.sub.2.sup.1, . . . S.sub.N.sup.1}, the second estimated synopsis error E.sup.2 associated with a second set of synopses S.sup.2={S.sub.1.sup.2,S.sub.2.sup.2, . . . S.sub.N.sup.2}, each synopsis S.sup.1.sub.i of the first synopsis set and each synopsis S.sup.2.sub.i of the second synopsis set associated with a respective ith relation relation R.sub.i, the apparatus comprising: a) a data storage operative to store representations of a first memory allocation vector M.sup.1={m.sub.1.sup.1,m.sub.2.sup.1, . . . m.sub.N.sup.1} and a second memory allocation vector M.sup.2={m.sub.1.sup.2,m.sub.2.sup.2, . . . m.sub.N.sup.2} wherein: i) each m.sub.i.sup.1 represents a respective amount of memory allocated for a respective synopsis S.sup.1.sub.i of the first synopsis set; ii) each m.sub.i.sup.2 represents an amount of memory allocated for a respective second synopsis S.sup.1.sub.i associated with said ith relation R.sub.i; i = 1 N m i 1 = M 1 iii ) i = 1 N m i 2 = M 2 iv ) M 1 .noteq. M 2 v ) ##EQU00012## b) a synopsis error estimation engine operative to: i) compute the first synopsis error E.sup.1 for the first allocation memory vector M.sup.1; ii) compute the second synopsis error E.sup.2 for the first allocation memory vector M.sup.2; using one of: A) said computed first estimated synopsis error E.sup.1; and B) stored intermediate results associated with said computing of E.sup.1.

16. The method of claim 6 wherein said synopsis error estimation engine is operative such that said used stored intermediate results associated with said computing of E.sup.1 are indicative of a profile of at least one relation the plurality of relations.

Description

FIELD OF THE INVENTION

[0001] The present invention relates to systems, apparatus and computer-code for data synopsis error estimation.

BACKGROUND OF THE INVENTION

Measurement of Synopsis Error

[0002] As databases continue to grow in size, there is an ongoing need to generate and utilize "data synopses" for efficient approximate query processing. A data synopsis is a concise representation of the data set whose purpose is to effectively enable an approximate result, which is as close as possible to the actual result given some space limitation (physical memory size) in which the data synopsis must reside.

[0003] When analyzing the usability of synopses, it is advantageous to derive an effective estimation of the synopsis approximation error, so as to have an informed decision about when it is cost effective to build a synopsis, and how much memory should be allocated to that synopsis.

[0004] Some aspects of the present disclosure relate to the following synopsis error estimation problem: Given a data set, a query workload consisting of queries to this data set, an error metric, a synopsis implementation and a space limit, find an efficient method for effectively estimating the approximation error of the synopsis for the given workload and data set.

[0005] Some aspects of the present disclosure relate to the synopses reconciliation problem: Given a set of N relations R={R.sub.1 . . . R.sub.N}, a query workload Q comprising queries directed to these relations, a set of synopses types, an error metric, and a space limit, find a combination of data synopses (synopses sizes and synopses types) that will achieve a low approximation error for the given query workload.

[0006] There are many situations where a database consists of multiple relations, but only one physical memory resource that has to be shared and partitioned among all the data synopses. The selection of which data synopses to build, their sizes and their types will determine the performance and accuracy of the queries submitted to the data synopses. A basic operation in synopses reconciliation algorithms is estimating the approximation-error of candidate synopses in various combinations that are considered. Existing algorithms use a straightforward error measurement, involving building synopses and running queries against large relations. Measuring the approximation-error is a costly operation, and its execution dominates the running time of these algorithms. Replacing these costly error measurements with an efficient method for estimating the approximation-error, could significantly reduce the running-times of these algorithms.

[0007] FIG. 1 describes a prior art technique for measuring the synopsis error--i.e. the difference between the result of queries Q run on die original relations R=(R.sub.1 . . . R.sub.N) and the result of queries Q run on the synopses of the original relations S={S.sub.1 . . . S.sub.N}.

[0008] In step S111, the set of synopses and the set of relations R={R.sub.1 . . . R.sub.N) are provided. In step S115, the query workload Q is run on the relations R={R.sub.1 . . . R.sub.N) to obtain Q(R). In step S115, die query workload Q is run on the synopses S={S.sub.1 . . . S.sub.N} to obtain Q(S).

[0009] In step S123, the synopses error is computed from Q(R) and Q(S) using some sort of error function. This error function may, for example, include differences between query results for a given query--i.e. die difference between the result when running the query on the relations and the synopses.

[0010] One shortcoming of die technique described in FIG. 1 is that typically, step S115 is very resource intense or slow (i.e. due to the need to read many rows of a given table), especially for cases where the tables include million(s) or even billion(s) of rows. In particular, it may be lime consuming to run the actual queries on the actual relations.

[0011] FIG. 2A provides a flow chart of a related prior art technique for computing synopses error. In the example of FIG. 2, the synopses error is computed for a particular memory allocation candidate C associated with a `memory allocation vector` M {m.sub.1 . . . m.sub.N} determining the size (i.e. in memory) of each synopses--i.e. synopsis S.sub.1 of relation R.sub.1 is not allowed exceed a memory size that is m.sub.1, synopsis S.sub.2 of relation R.sub.2 is not allowed to exceed a memory size that is m.sub.2, and so on.

[0012] Referring to FIG. 2A, after defining S131 the memory allocation candidate C associated with memory allocation vector M={m.sub.1 . . . m.sub.N}, it is possible to compute S133 tee synopsis error for the set of relations R for the memory allocation vector M.

[0013] It is noted that in S131, each `component` of the vector M refers to a size of memory and/or a number of memory segments. The variable M (for example, on the right hand side of the inequality of S131) may refer, for example, to a total number of memory segments (for example, memory segments of equal size).

[0014] FIG. 2B provides a low chart of a prior art technique for computing S133 synopsis error for the set of relations R for the memory allocation vector M. As shown in FIG. 2B, in the prior art of H. V. Jagadish, H. Jin, B. C. Ooi, and K. L. Tan. Global optimization of histograms. In SIGMOD '01, pages 223-234, 2001, a particular set of synopses S whose memory requirements satisfy the specifications of the memory allocation vector M must be generated S135 in order to compute S123 the synopsis error associated with die set of synopses S, the set of relations R, and the query workload Q. Unfortunately, step S135 is also often resource intensive and/or time consuming.

[0015] Thus, there is an ongoing need for systems, apparatus and computer code for estimating the synopses error in a manner that is less resource-intensive and/or is more efficient (for example, in a manner that requires fewer row-reads) than known techniques.

Memory Reconciliation

[0016] It is noted that synopsis error estimation may be used when determining how to best partition memory space of a given size between different synopses so as to achieve a lower global error for all synopsis. This process, whereby a recommendation for memory partition is generated, is referred to in the present disclosure as `memory reconciliation.`

[0017] FIG. 3 provides a flow chart of a prior art technique for memory reconciliation Thus, for the jth iteration (i.e. S151 and S163), a jth memory allocation candidate C.sup.j associated a jth memory allocation vector M.sup.j, is analyzed. In particular, a synopsis error is computed S133j for the jth memory allocation vector.

[0018] It is theoretically possible to use a brute force technique to analyze every possible memory allocation vector M.sup.j whose overall memory `size` does not exceed M (see the equality in S131j of FIG. 3). Nevertheless, there are known heuristic techniques (for example, defined in H. V. Jagadish, H. Jin, B. C. Ooi, and K. L. Tan. Global optimization of histograms. In SIGMOD '01, pages 223-234, 2001) for judiciously defining the specific memory allocation candidates C.sup.j.

[0019] Unfortunately, every time step S133j is carried out, several resource intensive and/or time-consuming steps must be carried out--i.e. S135j (i.e. generating the synopses to be tested), and S115j (i.e. running the query workload on die synopses).

[0020] It would be highly desirable to have a system, method and computer code for memory reconciliation that requires fewer resources and/or is faster than known techniques.

SUMMARY

[0021] Embodiments of die present invention are based, in part, on the surprising discovery that it is possible to estimate a synopsis error (i.e. for a given query workload, a given set of one or more relations and a given set of synopses) for a given set of relations and for a given set of synopses: (i) without actually generating the synopses or an approximation thereof; and (ii) without running queries (i.e. of die query workload) on the relation(s) and/or the synopses.

[0022] Instead, in some embodiments, it is possible to calculate a synopses-dependent error estimation function by carrying out two computational processes According to the first process, a genetic function is constructed (for example, using a random number generator or pseudo-random number generator) using a so-called `synthetic data set`--i.e. a data set whose content is not required to depend on the relations or on the particular queries. In die present disclosure, the process whereby fits generic function (or set of functions) is created is referred to as `calibration.` It is possible to implement the calibration process, which does not need to involve the actual data of the relations and does not need to involve the actual queries, as a pre-processing stage

[0023] According to the second process, referred to in the present disclosure as a `profiling process,` one or more functions indicative of the characteristics of hie relations (i.e. profiling the `relations`) and/or the query workloads (i.e. profiling the queries) is computed. The synopsis error depends on the results of first `calibration` process carried out with the synthetic data set and the second `profiling process carried out using the actual query workload and the actual relations.

[0024] One particular type of function of the query workload is a function of range parameters of the query workload. In particular, for each individual query of at least some queries of the workload, at least one parameter related to the range of the query is determined--for example, a lower bound of the range, an upper bound of the range, or a difference between an upper and lower bound. When the range specified by a given query does not extend over all possible values of a given field for a given relationship, the range parameter is referred to as a partial range parameter Exemplary functions of partial range parameters include but are not limited to functions of central tendency of range parameters, dispersion functions of range parameters, and functions indicative of higher-order statistical moments of range parameters.

[0025] Surprisingly, the functions of the query workload are useful, even for situations where certain `analyzed` queries are run on the relations and/or on the synopses. This obviates the need to run expensive queries on the relations in order to determine a synopsis-relation error.

[0026] The technique of analyzing queries without running queries in order to computer an estimated synopsis error is also useful in situations where the queries serve as `filters` to select a `sub-plurality` of rows from a plurality of rows in a given relation. Thus, it may be useful to analyze filter criteria defined by the queries--for example, range values associated with filters, fields associated with filters, etc. It is possible to compute the estimated synopsis in accordance with die results of the query analysis without running the actual queries (or using the query results on the relations).

[0027] Furthermore, it is now disclosed that it may be useful to compute a dispersion function or a query dispersion skew function or a data dispersion skew function (or higher-order statistical moment function) of one or more relations or queries during the `profiling process,` and to estimate synopsis error in accordance with the computed dispersion function or distribution skew function (or higher-order statistical moment function). Surprisingly, the aforementioned dispersion or skew function (or higher-order statistical moment) function is useful for estimating synopsis-relation error even in situations where the actual synopses (or approximations thereof) are not generated and are not used. This obviates the need to generate the synopses and/or run queries on the relations.

[0028] It is noted that the aforementioned technique for synopsis error estimation is useful when determining how to partition memory space between different synopses so as to achieve a lower global error for all synopsis. This process, whereby a recommendation for memory partition is generated, is referred to in the present disclosure as `memory reconciliation.`

A Brief Discussion of Techniques for Computing Synopsis error by Analyzing the Content of Queries Without the need to Run Certain Queries

[0029] In accordance with one aspect, it is now disclosed for lie first time a method for computing an estimated synopsis error associated with a plurality of data relations, a query workload, and a plurality of data synopses of the data relations, the method comprising: a) storing in memory descriptions of: i) die plurality of data relations; and ii) the query workload; and b) computing the estimated synopsis error from the stored description of the plurality of data relations and from the stored description of the query workload, wherein the computing includes: i) determining, for each query of at least some queries of the query workload, a respective indication of respective filtering criteria for each query; and ii) estimating, in accordance with the respective determined filtering criteria, and without using query results on the data relations for all queries of the at least some queries, the synopsis error associated with the plurality of data relations, the query workload, and the plurality of data synopses.

[0030] According to some embodiments, die estimating is carried out in accordance with die determined query filtering criteria is carried out in accordance with at least one computed function selected from die group consisting of: i) a query range function; ii) a query dispersion function; iii) a query central tendency function and iv) a query distribution skew function.

[0031] According to some embodiments, for each query, the determining of the respective query filtering criteria includes determining a respective indication of a respective query filter field set associated with each query.

[0032] According to some embodiments, for each query, the determining of die respective query filtering criteria includes determining a respective indication of a respective query filter value set associated with each query.

[0033] According to some embodiments, the estimating is carried out in accordance with the determined query filtering criteria is carried out in accordance with at least one of: i) a query range function of content of the determined query filter value sets; ii) a query dispersion function of content of the determined query filter value sets; iii) a query central tendency function of content of the determined query filter value sets; and iv) a query distribution skew function of content of the determined query filter value sets.

[0034] In accordance with another aspect, it is now disclosed for the first time an apparatus for computing an estimated synopsis error associated with a plurality of data relations, a query workload, and a plurality of data synopses of the data relations, the apparatus comprising a) a data storage operative to store representations of: i) the plurality of data relations; and ii) the query workload; and b) a synopsis error estimation engine operative to: i) determine, for each query of at least some queries of the query workload, a respective indication of respective filtering criteria for each query; and ii) estimate the synopsis error in accordance with the: A) respective determined filtering criteria; and B) the stored representation of the plurality of data relations, without using query results on the data relations for all queries of the at least some queries.

[0035] For the present disclosure, the term `data storage` may include any combination of volatile and/or non-volatile memory (for example, persistent storage including but not limited to magnetic media and flash memory). The `data storage` may reside on a single machine, or may be `distributed` over a plurality of machines, and may be deployed in any architecture.

[0036] According to some embodiments, the synopsis error estimation engine is operative such that the estimating is carried out in accordance with die determined query filtering criteria is carried out in accordance with at least one of: i) a query range function; ii) a query dispersion function; iii) a query central tendency function; and iv) a query distribution skew function.

[0037] According to some embodiments, the synopsis error estimation engine is operative such dial, for each query, die determining of the respective query filtering criteria includes determining a respective indication of a respective query filter field set associated with each query.

[0038] According to some embodiments, the synopsis error estimation engine is operative such that, for each query, the determining of the respective query filtering criteria includes determining a respective indication of a respective query filter value set associated with each query.

[0039] According to some embodiments, the synopsis error estimation engine is operative such that the estimating is carried out in accordance with the determined query filtering criteria is carried out in accordance with at least one of: i) a query range function of content of the determined query filter value sets; ii) a query dispersion function of content of the determined query filter value sets; iii) a query central tendency function of content of the determined query filter value sets; and iv) a query distribution skew function of content of the determined query filter value sets.

A Brief Discussion of Techniques Where Synopsis Errors are Computed for Multiple Sets of Input Parameters Where Results of Earlier Synopsis Error Estimations are Re-Used for Latter Synopsis Error Computations

[0040] According to some aspects of the present invention, it is possible to compute estimated synopsis errors for multiple circumstances, where the results of a `first` synopsis estimation (i.e. associated with a `first` set of input parameters--i.e. a `first` memory allocation vector and/or a `first` query workload and/or a `first` set of relations) are re-used when computing the synopsis error for a `second` set of input parameters (i.e. when at least one of the memory allocation vector, the query workload and the relations are different).

[0041] According, to some embodiments, it is now disclosed for the first time a method of computing a first estimated synopsis error E.sup.1 and a second estimated synopsis error E.sup.2 for a plurality of N relations R={R.sub.1,R.sub.2 . . . R.sub.N}, the first estimated synopsis error E.sup.1 associated with a first set of synopses S.sup.1=S.sub.1.sup.1,S.sub.2.sup.1, . . . ,S.sub.N.sup.1}, the second estimated synopsis error E.sup.2 associated with a second set of synopses S.sup.2=S.sub.1.sup.2,S.sub.2.sup.2, . . . S.sub.N.sup.2} (i.e. different from the first set of synopses), each synopsis S.sup.1.sub.i of the first synopsis set and each synopsis S.sup.2.sub.i of the second synopsis set associated with a respective ith relation R.sub.i, the method comprising:

a) computing E.sup.1, for a first allocation memory vector

M 1 _ = { m 1 1 , m 2 1 , m N 1 } where : i ) i = 1 N m i 1 = M 1 ii ) ##EQU00001##

for an ith relation R.sub.i, each m.sub.i.sup.1 represents a respective amount of memory allocated for a respective synopsis S.sup.1.sub.i of the first synopsis set; b) using at least one of: i) the computed first estimated synopsis error E.sup.1; and ii) stored intermediate results (for example, computed profiles of relations and/or query workloads) of associated with the computing of E.sup.1, c) computing E.sup.2 for a second memory allocation vector

M 2 _ = { m 1 2 , m 2 2 , m N 2 } , i ) i = 1 N m i 2 = M 2 ii ) ##EQU00002##

for an ith relation R.sub.i, each m.sub.i.sup.2 represents a respective amount of memory allocated for a respective synopsis S.sup.2.sub.i, of the second synopsis set, wherein M.sup.1.noteq.M.sup.2.

[0042] In accordance with another aspect, it is now disclosed for the first Lime an apparatus for computing a first estimated synopsis error E.sup.1 and a second estimated synopsis error E.sup.2 for a plurality of N relations R={R.sub.1R.sub.2 . . . R.sub.N}, the first estimated synopsis error E.sup.1 associated with a first set of synopses S.sup.1={S.sub.1.sup.1,S.sub.2.sup.1, . . . S.sub.N.sup.1}, the second estimated synopsis error E.sup.2 associated with a second set of synopses S.sup.2={S.sub.1.sup.2,S.sub.2.sup.2, . . . S.sub.N.sup.2}, each synopsis S.sup.1.sub.i of the first synopsis set and each synopsis S.sup.2.sub.i of the second synopsis set associated with a respective ith relation R.sub.i, the apparatus comprising: a) a data storage operative to store representations of a first memory allocation vector M.sup.1={m.sub.1.sup.1,m.sub.2.sup.1, . . . m.sub.N.sup.1} and a second memory allocation vector M.sup.2={m.sub.1.sup.2,m.sub.2.sup.2, . . . m.sub.N.sup.2} wherein:

i) each m.sub.i.sup.1 represents a respective amount of memory allocated for a respective synopsis S.sup.1.sub.i of the first synopsis set; ii) each m.sub.i.sup.2 represents an amount of memory allocated for a respective second synopsis S.sup.1.sub.i associated with the ith relation R.sub.i;

i = 1 N m i 1 = M 1 iii ) i = 1 N m i 2 = M 2 iv ) ##EQU00003##

and v) M.sup.1.noteq.M.sup.2; and b) a synopsis error estimation engine operative to: i) compute the first synopsis error E.sup.1 for the first allocation memory vector M.sup.1; ii) compute the second synopsis error E.sup.2 for the first allocation memory vector M.sup.2; using one of: A) the computed first estimated synopsis error E.sup.1; and B) stored intermediate results (for example, computed profiles of relations and/or query workloads) associated with the computing of E.sup.1.

[0043] In accordance with another aspect, it is now disclosed for the first time a method of computing, for a plurality of N relations (R.sub.1,R.sub.2 . . . R.sub.N), a first estimated synopsis error E.sup.1 associated with a first query workload Q.sup.1 and a second estimated synopsis error E.sup.2 associated with a second query workload Q.sup.2 different from Q.sup.1, the method comprising:

a) computing E.sup.1 for the first query workload Q.sup.1; and b) using at least one of: i) the estimated synopsis error E.sup.1 ii) stored intermediate results (for example, computed profiles of relations) associated with the computing of E.sup.1, computing E.sup.2 for the second query workload Q.sup.2 which is different from the first query workload Q.sup.1.

[0044] In accordance with another aspect, it is now disclosed for the first time an apparatus for computing, for a plurality of N relations {R.sub.1,R.sub.2 . . . R.sub.N}, a first estimated synopsis error E.sup.1 associated with a first query workload Q.sup.1 and a second estimated synopsis error E.sup.2 associated with a second query workload Q.sup.2 different from Q.sup.1, the apparatus comprising:

a) a data storage operative to store representations of the first query workload Q.sup.1 and the second query workload Q.sup.2 which is different from the first query workload Q.sup.1; b) a synopsis error estimation engine operative to: i) compute the first synopsis error E.sup.1 for the first query workload Q.sup.1; and ii) compute The second synopsis error E.sup.2 for die second query workload Q.sup.2 using at least one of: A) the computed estimated synopsis error E.sup.1; and B) stored intermediate results (for example, computed profiles of relations) associated with de computing of E.sup.1.

[0045] In accordance with another aspect, it is now disclosed for the first time a method of computing, a first estimated synopsis error E.sup.1 associated with a first plurality of relations, and a second estimated synopsis error R.sup.2 associated with a second plurality of relations different from the first plurality of N relations, the method comprising:

a) computing E.sup.1 for die first plurality of relations; b) using at least one of: i) die computed estimated synopsis error E.sup.1 ii) stored intermediate results (for example, computed profiles of a query workload) associated with the computing of E.sup.1, computing E.sup.2 for the second plurality of relations which is different from the first plurality of relations.

[0046] In accordance with another aspect, it is now disclosed for the first time an apparatus for computing a first estimated synopsis error E.sup.1 associated with a first plurality of relations, and a second estimated synopsis error R.sup.2 associated with a second plurality of relations different from die first plurality of N relations, the apparatus comprising: a) a data storage operative to store representations of the first plurality of relations and the second plurality of relations which is different from die first plurality of relations; b) a synopsis error estimation engine operative to:

a) compute E.sup.1 for the first plurality of relations; b) compute E.sup.2 for the second plurality of relations which is different from the first plurality of relations using at least one of: i) the computed estimated synopsis error E.sup.1 ii) stored intermediate results (for example, computed profiles of relations) associated with the computing of E.sup.1. A Discussion of Techniques Associated with Computing the Synopsis Error in `Multiple Stages`--i.e. in a Calibration Stage and a Profiling Stage

[0047] In accordance with another aspect, it is now disclosed for the first Lime a method of estimating a synopsis error for target relation data, the method comprising: a) providing an at least partially random data set; b) providing a synopsis of the at least partially random data set; c) after the providing of the at least partially random data set and the synopsis, receiving the target relation data; and d) in accordance with: (i) the provided at least partially random data set; (ii) the provided synopsis of the at least partially random data set; and (iii) the target relation set, computing a synopsis error estimation function for the target relation data.

[0048] In accordance with another aspect, it is now disclosed for the first time a method of estimating a synopsis error for target relation data, the method comprising: a) selecting a functional form for an error function, the functional form associated with a plurality of function coefficients; b) effecting an off-line calibration procedure, the calibration procedure including: i) generating synthetic relations (i.e. as opposed to tee `real` relations for which we are computing the synopsis error) using a random number generation procedure; ii) computing a synopsis of the generated synthetic relations; and iii) deriving die plurality of function coefficients from the computed synopsis of the generated synthetic relations; c) after die off-line calibration procedure, receiving the target relation data; d) computing a profile of die target relation data (for example, a function indicative of a range or dispersion or any other parameter of the target relation data), and e) estimating the synopsis error by using: i) the derived plurality of function coefficients; and ii) results of the profiling.

Additional Techniques Associated with Estimated Synopsis Error

[0049] In accordance with another aspect, it is now disclosed for the first time a method of computing a estimated synopsis error E associated with a plurality of N relations R={R.sub.1,R.sub.2, . . . R.sub.N} and a plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N}, the ith synopsis associated with the ith relation, the method comprising: a) computing at least one profiling function selected from the group consisting of: i) a dispersion function; and ii) a data distribution skew function; for at least one relation of the plurality of N relations; b) deriving the estimated synopsis error E from the at least one computed profiling function, wherein the method is carried out without computing the plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N} or an approximation thereof.

[0050] In accordance with another aspect, it is now disclosed for the first time an apparatus for computing a estimated synopsis error E associated with a plurality of N relations R={R.sub.1,R.sub.2, . . . R.sub.N) and a plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N}, the ith synopsis associated with the ith relation, the method comprising: a) a data storage operative to store a representation of the plurality of N relations R=(R.sub.1,R.sub.2, . . . R.sub.N}; and b) a synopsis error estimation engine operative to: i) compute at least one profiling function selected from the group consisting of: A) a dispersion function; and B) a data distribution skew function; for at least one relation of the plurality of N relations; and ii) derive the estimated synopsis error F from the at least one computed profiling function, wherein the synopsis error estimation engine is operative to carry out the computing of the at least one profiling function and the deriving of the estimated synopsis error E without computing the plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N} or an approximation thereof.

[0051] In accordance with another aspect, it is now disclosed for the first time a method of computing a estimated synopsis error E associated with a plurality of N relations R={R.sub.1,R.sub.2, . . . R.sub.N} and a plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N}, the ith synopsis associated with the ith relation, where row.sub.i is the number of rows in the ith relation, the method comprising:

a) computing a profiling function of die plurality of N relations including at least one of: i) a data dispersion function; and ii) a data distribution skew function, and b) deriving the estimated synopsis error E from the computed profiling function, wherein the method is carried out without effecting out more than

0.00001 * i = 1 N row i ##EQU00004##

row read operations and without effecting more than 100*N row read operations.

[0052] In accordance with another aspect, it is now disclosed for the first time an apparatus for computing a estimated synopsis error E associated with a plurality of N relations R={R.sub.1,R.sub.2 . . . R.sub.N} and a plurality of N synopses S={S.sub.1,S.sub.2, . . . S.sub.N}, the ith synopsis associated with the ith relation, where row.sub.i is the number of rows in the ith relation, the apparatus comprising: a) a data storage operative to store a representation of the plurality of N relations R={R.sub.1,R.sub.2, . . . R.sub.N}; and b) a synopsis error estimation engine operative to: i) compute at least one profiling function selected from the group consisting of:

A) a data dispersion function; and B) a data distribution skew function; for at least one relation of the plurality of N relations; and ii) derive the estimated synopsis error E from die at least one computed profiling function, wherein The synopsis error estimation engine is operative to carry out die computing of the at least one profiling function and the deriving of the estimated synopsis error E without effecting out more

0.00001 * i = 1 N row i ##EQU00005##

than row read operations and without effecting more than 100*N row read operations.

[0053] Embodiments of the present invention provide apparatus for carrying out any disclosed method.

[0054] Embodiments of the present invention provide computer readable medium comprising program instructions, wherein when executed the program instructions are operable to carry out any disclosed method.

[0055] These and further embodiments will be apparent from the detailed description and examples that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

[0056] FIGS. 1, 2A-2B describes prior art routines for computing synopsis error.

[0057] FIG. 3 describes a prior art memory reconciliation technique.

[0058] FIG. 4 describes a technique for estimating synopsis error in accordance with some embodiments of the present invention

[0059] FIG. 5A describes a technique for estimating: a first synopsis error associated with a first memory allocation vector and a second synopsis error associated with a second memory allocation vector in accordance with some embodiments of the present invention.

[0060] FIG. 5B describes a technique for estimating: a first synopsis error associated with a first query workload and a second synopsis error associated with a second query workload in accordance with some embodiments of the present invention,

[0061] FIG. 5C describes a technique for estimating: a first synopsis error associated with a first set of relations and a second synopsis error associated with a second set of relations in accordance with some embodiments of the present invention.

[0062] FIG. 6 describes a memory reconciliation technique in accordance with some embodiments of the present invention.

[0063] FIG. 7A describes a system for synopsis error estimation in accordance with some embodiments of the present invention.

[0064] FIG. 7B describes a system for memory reconciliation in accordance with some embodiments of the present invention.

[0065] FIG. 8 describes a routine for building a set of synopses using a low-read error estimation in accordance with some embodiments of the present invention.

[0066] FIG. 9 describes an exemplary synopses reconciliation process.

[0067] FIGS. 10A-10B describe an exemplary calibration process.

[0068] FIG. 11 describes an exemplary profiling process.

[0069] FIG. 12 describes an exemplary process for building a synopsis.

[0070] FIG. 13 describes the error as a function of the workload query range for the Golden Estimator synopsis. The "measured" plot is the error measured by testing the synopsis by querying the database, and the "estimated" plot is the plot of die fitted power-law function.

[0071] FIG. 14 describes The synopsis error as a function of the synopsis size as percent of the relation size using the Spline synopsis (left), and die normalized error of the some instances (right). Although the lines are not parallel, which suggests correlation between the parameters, the error is less than 6%, and on average only 0.8%.

[0072] FIG. 15 describes the error as relation data range (left) and the error vs. Number of zero or missing values in the relation (right).

[0073] FIG. 16 describes pseudo-code of the calibration process. The output of die first phase consists of the power-law coefficients b1,1/4,b5 and die output of the second phase consists of die weight coefficients a1,1/4/,a6.

[0074] FIG. 17 describes the calibration process. The calibration of a synopsis consists of least-square-fittings for the various instance-parameters, and an iterative refinement process describes selecting random instance-parameters, measuring the error, raising the instance-parameters to the value of the power-law coefficients and solving the linear equations. The process continues until the variance of the total error of the solution is smaller than a pre-specified error parameter e or until the number of iterations is bigger than g. The result is the synopsis function EEt.

[0075] FIG. 18 describes the error (y axis) vs. instance-parameters describes Relation distinct count (from 256 to 2048), Relation data distribution skew (from 0.02 to 0.6 Zipf), Workload query distribution skew (from 0.02 to 0.6 Zipf), Workload Query Range (from 1% to 25% of the relation size) and Synopsis Size (from 1% to 70% of the relation size). The plot with triangles is the measured error, and plot wilt squares is die fitted Power-Law function.

[0076] FIG. 19 describes required sample size as percent of the relation distinct count to achieve a maximum approximation error of 0.025 Zipf when calculating die relation data distribution skew. The results are sorted according to the sample size in descending order.

[0077] FIG. 20 describes synopses error estimation ranges. The diagram describes the ranges of the possible estimation errors around die approximate query result. The first range is the estimation of the synopsis error by die error-estimation function, and after that the error of the error-estimation function (EE). The intensity of the color represents the probability of die error estimation error, as in FIG. 15.

[0078] FIG. 21 describes an exemplary solution to the dynamic synopses reconciliation problem. The query workloads and the updates to the relations are intercepted and collected by the Updates Logger, and then transferred to the reconciliation server, where they are profiled. The synopses reconciliation process receives as input the profiles of the relations and query workloads, the error-estimation functions, the available synopses types and the memory space limit, and outputs the recommended combination. The recommended combination is then transferred back to die operational database for implementation.

[0079] FIG. 22 describes data flow relationships between die different data structures and processes of the Synopses Reconciliation Server.

[0080] FIG. 23 describes measured error vs. estimated error using several synopses types on 50 instances with randomly selected instance-parameters with synthetic data sets. The instance results are sorted by the measured error in decreasing order, Note how the estimated error is quite close to the measured error for all the synopses types used.

[0081] FIG. 24 describes measured error vs. estimated error using Spline synopsis on 50 instances with randomly selected instance-parameters with a real-world data set. The instance results are sorted by the measured error in decreasing order.

[0082] FIG. 25 describes robustness of the error-estimation to the assumption that the data has a Zipf-like distribution, In this experiment we took a relation with a perfect Zipfian distribution, and gradually modified the data moving the distribution away from Zipfian-like. The plot shows the measured and estimated error as a function of the variance of the distribution distance from Zipfian-like. Note how die estimated error remains the same while the measured error decreases, since the Zipf parameter of the Zipfian distribution that is computed to approximate the actual distribution is not affected by the changes that we made, but die actual data distribution was made easier to approximate.

[0083] FIG. 26 describes measured and estimated error using Golden Estimator Synopsis (left) and Workload-based Greedy Wavelet (right) on 50 instances with randomly selected instance-parameters using large relations sizes 16 m to 128 m tuples. The instance results are sorted by the measured error in decreasing order.

[0084] FIG. 27 describes measured and estimated error of running 150 queries of an instance consisting of a relation with 64 m tuples with data distribution skew Zipf parameter of 0.4, and a Workload-based Greedy Wavelet synopsis of 100 coefficients. The queries are sorted by die measured error. Note that the EEE is less than 20% for 96.7% of the queries.

[0085] FIG. 28 describes error of measured-based and estimated-based synopses reconciliation, using the KW algorithm over 20 instance-groups consisting of instances with randomly selected instance-parameters and two synopses types--Golden Estimator synopsis (left) and Spline synopsis (right). The instance-groups are sorted by the error of the measured-based combination in decreasing order.

[0086] FIG. 29 describes error of measured-based and estimated-based combinations using both KW and JJOT algorithms over 20 instance-groups consisting of instances with randomly selected instance-parameters and the Spline synopsis type The instance results are sorted by the measured error in decreasing order. The results of the algorithms awe so close together, that die four plots are almost indistinguishable.

[0087] FIG. 30 describes error of measured-based KW and the estimated-based JJOT over 7 instance-groups consisting of instances with randomly selected instance-parameters. The synopses combinations are formed of a mixture of all the synopses types together. The instance results are sorted by the measured error in decreasing order.

[0088] FIG. 31 describes synopses reconciliation using Spline synopsis alone, Workload-based greedy wavelet alone, and using a combination of the two (All). The instance results are sorted by the measured error of "All" in decreasing order. Note that the combination of both synopses is always better (or at least equal) to the total relative error achieved by using only one synopsis type.

[0089] FIG. 32 describes comparison of the run time between the measured-based KW and JJOT and the estimated-based KW and JJOT, using 4 relations with increasing size (x-axis). Note that the running times of the estimated-based KW and JJOT are constant (less than 1 second), and are not affected by the size of the relations.

[0090] While the invention is described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the invention is not limited to the embodiments or drawings described. It should be understood that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of die present invention. As used throughout this application, die word "may" is used in a permissive sense (i.e., meaning "having die potential to`), rather than die mandatory sense (i.e. meaning "must").

DETAILED DESCRIPTION OF EMBODIMENTS

[0091] The present invention will now be described in terms of specific, example embodiments. It is to be understood that die invention is not limited to the example embodiments disclosed. It should also be understood that not every feature of the presently disclosed apparatus, device and computer-readable code for synopsis error estimation and/or for reconciliation is necessary to implement the invention as claimed in any particular one of the appended claims. Various elements and features of devices are described to fully enable the invention. It should also be understood that throughout this disclosure, where a process or method is shown or described, the steps of the method may be performed in any order or simultaneously, unless it is clear from the context that one step depends on another being performed first. It is noted that any disclosed component may be implemented in hardware, software or any combination thereof.

[0092] The present inventors are now disclosing a technique for estimating a search synopsis error associated with: (A) a set of relations R={R.sub.1 . . . R.sub.N}; (B) a set of synopses S={S.sub.1 . . . S.sub.N} (i.e. where each synopsis is a synopsis of a respective relation); and (C) optionally, a query workload Q specifying a plurality of queries directed to the set of relations.

[0093] Surprisingly, there is no need to: (i) generate the synopsis or an approximation thereof; (ii) run queries on the relations; and (iii) run queries on the synopses. Thus, in some embodiments, the presently-disclosed techniques are `low-read` techniques requiting only a small number of row reads from the relations. This allows for a technique that is `fast` and/or uses fewer computational resources.

[0094] Instead of running expensive queries and/or generating relations, it is sufficient to: (i) carry out a so-called `calibration` technique using `synthetic data` or `synthetic relations`--i.e. data that need not depend on the relations and need not depend on the query workload; (ii) carry out a `light-footprint` or relatively low-read profiling technique of the query workload and the relations.

[0095] Because the calibration process does not require the relations and the query workload as input, it is possible to effect the calibration as a `pre-processing` step- or example, as a device configuration step that is carried out `in the factory` and is valid once for the lifetime of the device.

[0096] During the profiling process, on the other hand, an indication of the relations R={R.sub.1 . . . R.sub.N} and/or an indication of die query workload is processed.

[0097] FIG. 4 provides a flowchart of an exemplary technique for computing synopsis error for A) a set of relations R={R.sub.1 . . . R.sub.N}; (B) a set of synopses S={S.sub.1 . . . S.sub.N} (i.e. where each synopsis is a synopsis of a respective relation); and (C) a query workload Q specifying a plurality of queries directed to the set of relations

[0098] In FIG. 4, the calibration process S167 is run using so-called `synthetic` data. Because the calibration process does not require the relations and does not require the query workload, it is possible to carry this out in a pre-processing stage or `offline.`

[0099] In step S133 a memory allocation candidate associated with a memory allocation vector (i.e. where the total memory does not exceed M) is defined.

[0100] Skipping ahead to step S183, it is noted that the synopsis may be computer without running all of the queries and without generating all of the synopses.

[0101] In one non-limiting example, for example related to the function defined in equation (1) of Theoretical Section 2.1. The Error-Estimation Function (i.e. as provided in the latter section), a generic functional form is provided, and the coefficients (the `a` coefficients for the specific example of equation (1) of Theoretical Section 2.1) are computed during the calibration process S167.

[0102] In the non-limiting example of equation (1), parameters `L` and `Z` are computed when profiling the relations S171. It is noted that `Z` is just one particular example of a `relations skew` function (i.e. a third statistical moment). Other examples relate to indications of second order statistical moments or `data dispersion functions` or functions indicative of higher or statistical movements.

[0103] In the non-limiting example of equation (1), Q and R are computed when profiling the workload S175. It is noted that there is no specific requirement to profile the workload, and in some embodiments, a pre-determined `generic` workload is used. Nevertheless, it is noted that there are situations where the accuracy of the estimation of the synopsis error may be improved when utilizing the workload and profiling the workload.

[0104] In step S179, a function of the size of the synopsis (or a function of memory allocation vector) is computed. In the example of equation (1), this function relates to die `S` parameter.

[0105] Unlike the techniques described in FIGS. 1-3, for the technique described in FIG. 4, there is no need to provide or generate a representation of the actual synopses S={S.sub.1 . . . S.sub.N}. Rather, it is sufficient to specify S131 a synopses size vector m.sub.1 . . . m.sub.N where each m.sub.i specifies a respective synopses size or allocated memory. In the example of FIG. 4, the total memory size is less than or equal to some overall size M.

[0106] In accordance with these specific functions, it is possible to compute S183 an estimated synopsis error without: (i) running die query workload and/or any portion thereof on the relations and/or synopses; (ii) generating the synopses and/or providing a representation of the actual synopses

[0107] Functions Used in Profiling S187

[0108] It is stressed that equation (1) of Theoretical Section 2.1 is just one example of a technique for computing synopsis error in accordance with FIG. 4. Different examples of the specific `profiling` functions computed in steps S171 and S175 are now provided.

[0109] In one example, one or more `central tendency` functions (for example, including but not limited to mean, median and the like) of data of one or more relations are computed in step S171. These `central tendency` functions relate to the first statistical moment of data of one or more of the relations R={R.sub.1 . . . R.sub.N}.

[0110] In yet another example, one or more `data dispersion` functions providing an indication of a magnitude of data distribution around a central value (for example, including hut not limited to a Zipf function and a standard deviation function) of data of one or more relations are computed in step S171. These `data dispersion` functions relate to the second statistical moment of data of one or more of the relations R={R.sub.1 . . . R.sub.N}.

[0111] In yet another example, one or more `data skew` functions providing an indication of a magnitude of data skew of data of one or more relations are computed in step S171. These `data skew` functions relate to the third statistical moment Of data of one or more of the relations R={R.sub.1 . . . R.sub.N}

[0112] It is appreciated, dial in some embodiments, higher order (i.e. 4.sup.th moment or greater) functions may be computed. Furthermore, it is appreciated that in some embodiments, some combination of the aforementioned functions) is computed

[0113] A Discussion of Profiling the Query Workload S175; A Discussion of Filtering Criteria

[0114] In some embodiments, the `profiling of the query workload` S175 is one example of computing a function of the `filtering criteria` of a set of queries.

[0115] It is noted that a `query` for a given target (i.e. one or more relations, one or more summaries, etc) provides a specification of a sub-plurality of stored rows selected from a given plurality of stored rows--i.e. the query provides `filtering criteria.` As used herein, `filter criteria` refers to one or more of (i) a list of one or more fields of the filter criteria--i.e. a `filter field set` of one or more fields and (ii) a list of one or more values specified for the fields--i.e. one or more `filter values` of a `filter value set` (iii) a relationship between one or more filter fields and one or more filter values. It is noted that the filter criteria may refer to a single relationship or may refer to more than one relationship.

[0116] EXAMPLE "select count(*) from employees where salary>13000 and birth_day=`December 13` and last_name like `Smi %`".

[0117] For this example, the `query filter field set` is {salary,birth_day,last_name} and the `query value set` is {1300,"Smi"}. For the case of a sting, the value set may, in some examples, refer to a hash value of the sting.

[0118] Although the aforementioned example refers to a query on a single table or relationship, it is appreciated that filtering criteria (and in particular, filter field sets and/or filter value sets) of queries directed to a plurality of tables or relationships may be analyzed--in particular, without running one or more of the analyzed queries.

[0119] As used herein, an indication of a `query filter field set` is an indication of one or more elements of the field set and/or a cardinality of the field set. As used herein, an indication of a `query filter value set` is an indication of one or more elements of the value and/or a cardinality of the value set.

[0120] Typically, the query workload includes different queries where each query that has different respective filter criteria (for example, different respective filter value sets and/or filter field sets) specified by die query.

[0121] It is now disclosed that when computing synopsis error, it is useful to: (i) determine, for each given query of at least a sub-set of the query workload, a respective filter value set for the given query; (ii) to compute a function of the determined filter value sets (for example, a range function or a distribution skew function); and (iii) to estimate the synopsis error in accordance with the computed function of the determined filter value sets.

[0122] In one example, one or more `central tendency` functions (for example, including but not limited to mean, median and the like) of data of the determined search value sets (i.e., for each query) are computed in step S175. These `central tendency` functions relate to the first statistical moment of data.

[0123] In yet another example, one or more `data dispersion` functions providing an indication of a magnitude of data distribution around a central value (for example, including but not limited to a Zipf function and a standard deviation function) of data of the determined search value sets (i.e. for each query) are computed in step S175. These `data dispersion` functions relate to the second statistical moment of data of the determined search value sets (i.e. for each query).

[0124] In yet another example, one or more `data distribution skew` functions providing an indication of a magnitude of data distribution skew of data of one or more relations are computed in step S171. These `data distribution skew` functions relate to the third statistical moment of data of the determined search value sets (i.e. for each query).

[0125] It is appreciated, that in some embodiments, higher order (i.e. 4.sup.th moment or greater) functions may be computed Furthermore, it is appreciated that in some embodiments, some combination of die aforementioned function(s) is computed.

A Brief Discussion of Techniques Where Synopsis Errors are Computed for Multiple Circumstances of Input Parameters Where Results of Earlier Synopsis Error-Estimations are Re-Used for Latter Synopsis Error Computations

[0126] Once again, referring to die non-limiting example of equation (1) of Theoretical Section 2.1, it is noted that the synopsis error function includes different components: (COMPONENT A) a component that depends exclusively on the results of profiling the relations and that is independent of the query workload and die size of the synopses (and the synopsis themselves) (i.e. the first two terms of the polynomial on the right hand side of equation (1) associated with the `L` and the `Z` terms); (COMPONENT B) a component that depends exclusively on the results of profiling die query workload (i.e. the third and fourth terms of the polynomial on the right hand side of equation (1) associated with the `Q` and `R` terms) and that is independent the relations and the size of the synopses; and (COMPONENT C) a component that depends only on die size of the synopses and that is independent of the relations and the query workload (i.e. the fifth term of the polynomial on the right hand side of equation 11) associated with the `S` term).

[0127] The present inventors have thus noted that when computing synopsis error for a plurality of circumstances, it is possible, surprisingly, to reuse results associated with `earlier` estimated synopsis computations (i.e. for a `first` estimated synopsis computation for a first set of circumstances or first set of input parameters) when effecting an additional estimated synopsis computation (i.e. for a `second` estimated synopsis computation for a second set of circumstances of a second set of input parameters).

[0128] This concept is explained with reference to: (i) synopsis size with reference to FIG. 5A; (ii) query workload with reference to FIG. 5B; (iii) the relations themselves with reference to FIG. 5C.

[0129] Thus, referring to FIG. 5A, it is noted that in step S211 die synopsis error is computed to a first memory allocation vector associated with a first `synopses size.` In the course of such a computation, it is possible that die relations (see step S171 of FIG. 4) and/or workload (see step S175 of FIG. 4) are profiled.

[0130] In step S215, the synopses error is estimated for a second memory allocation vector or synopsis of a different size. It is noted that there is no need to re-compute the profiling of the query workload and there is no need to re-compute the profiling of the data relations when computing, in step S215, die synopsis error for the second memory allocation vector. Instead, it is possible to re-use die results of die computing die first synopsis error E.sup.1 or `stored intermediate results` for example, results of profiling the relations and/or the queries.

[0131] It is noted that the example of FIG. 5A is useful, for example, for the case of memory reconciliation where the errors are estimated for different memory allocation vectors as in FIG. 4. Thus, it is possible when computing the estimated synopsis error for a given memory allocation vector to re-used earlier results from an earlier computed different memory allocation vector.

[0132] In FIG. 5B it is noted that the first synopsis error is computed for a certain query workload in step S221. In step S225, it is noted that it is unnecessary to re-compute profiles of the relations and/or size parameter functions when the query workload changes. Instead, it is possible to re-use earlier results from step S221,

[0133] In FIG. 5C, it is noted that the first synopsis error is computed for a certain set of relations in step S231. In step S225, it is noted that it is unnecessary to re-compute profiles of the workload and/or size parameter functions when the relations themselves workload changes. Instead, it is possible to re-use earlier results from step S231.

[0134] It is noted that, in some embodiments, FIGS. 5A-5C may be said to relate to the dynamic synopses reconciliation problem, which is to update the recommended set of synopses in the face of changes to the underlying relations and/or query workload and/or memory allocation vector.

[0135] FIG. 6 provides an exemplary flow chart of an exemplary technique for memory reconciliation in accordance with some examples of the present invention. Thus, after calibrating S167 and profiling S187, the synopsis error is computed S171j for each defined S131j memory allocation candidate in accordance with the results of the calibration S167 and profiling S187, and in accordance with a computed S179 function of a synopsis size parameter (i.e. a function of the memory allocation vector).

[0136] It is noted that die order of FIG. 6 is not intended to be limiting. For example, it is possible to profile S187 after defining a memory allocation candidate or vector.

[0137] It is noted that FIG. 6 is just one example of a technique for memory reconciliation, and presently disclosed teachings related to synopsis error estimation may be applied to memory reconciliation in a number of ways, including in manners that are different from or that contradict the example of FIG. 6.

[0138] FIGS. 7A and 7B provide diagrams of exemplary systems for error estimation 50 and memory reconciliation 90.

[0139] Each system may include a `data storage` (not shown). For the present disclosure, die term `data storage` may include any combination of volatile and/or non-volatile memory (for example, persistent storage including but not limited to magnetic media and flash memory). The data storages may reside on a single machine, or may be `distributed` over a plurality of machines, and may be deployed in any architecture.

[0140] In FIG. 7A, die system 50 includes a synopsis error estimation engine (or computation engine) 60 for computing an estimated synopsis error.

[0141] In FIG. 7B, the system 90 may further include a memory allocation candidate definition engine 70 (for defining candidate memory allocation vectors as in step 131j) and a candidate selection engine 80 (for selecting the best candidate as in step S159).

USAGE EXAMPLES

[0142] Two usage examples are discussed in this section. As with other sections, this section is provided for the purpose of exemplification, and is not intended as limiting.

[0143] Large databases: Large databases with many large tables may require a large number of synopses to improve performance, giving approximate queries when an exact result is not necessary. Since the available memory is an expensive resource, it would be very useful to have an automatic low-cost procedure that will automatically manage these synopses, achieving a low global error over all synopses.

[0144] Large databases include but are not limited to data warehouses and text databases. In one non-limiting example, the text database is generated by a search engine or a component thereof (including but not limited to, for example, an Internet search engine or a so-called intranet search engine) which creates an index of text content or of features of text content. This index may include one or more stored relations' In a specific example, certain search engine services may invoke queries on the indexed and stored textual data, and it is desired to provide these search engine services in an efficient manner.

[0145] In another non-limiting example, the search engine or a component thereof includes media content (or features thereof)--i.e. audio and/or video media. In a specific example, certain search engine services invoke queries of data derived from the indexed media content, and it is desired to efficiently provide search engine services.

[0146] Reconciliation of Internet Router Synopses: Internet routers handle packets of data passing through them. This information can be looked at as streaming data, synopses can be built over it, and the same reconciliation process can be used to reconcile the space used by these synopses, to achieve better overall estimation results.

Discussion of FIGS. 8-12

[0147] FIG. 8 describes a routine for building a set of synopses using a low-read error estimation in accordance with some embodiments of the present invention.

[0148] We start with the calibration of the synopses implementations in Step 110, resulting with the error estimation functions (120). Before we perform die reconciliation (for example, periodically) we also need to run a Profiling process (130) on the relations and query workloads, resulting with a data structure holding the statistical information about the relations and query workloads involved (140).

[0149] In the example of FIG. 8, it is possible to use the previously calculated error estimation functions (120) and the profiles of the relations and query workloads (140), and run the reconciliation process (150) performing only low-read error estimation.

[0150] The processes of FIG. 8 produces the recommendations of which synopses to build (160 & 260). Finally, it is possible to choose to act upon the recommendations and build the recommended synopses (170 & 270), resulting wilt the final set of synopses (180 & 280). The process is iterative, and may, in some embodiments, be repeated as the relations and query workloads change.

[0151] FIG. 9 describes an exemplary synopses reconciliation process in accordance with some embodiments of the present invention. In step 310, we allocate one segment of memory to each synopsis but perform only low-read error estimation using die error-estimation functions and profiling information gathered in previous processes. In step 320, we allocate one more memory segment to each synopsis and perform again full-read (or low-read in the invention) error estimation We then loop, in step 330, checking where the gain was greatest (step 340), and allocate one more memory segment to that synopsis (step 350), until there is no more free memory (step 360).

[0152] FIGS. 10A-10B describe an exemplary calibration process. In the example of FIGS. 10A-10B, the equation has the same form as equation (1) of theoretical section 2.1.

[0153] For the example of FIGS. 10A-10B, The Calibration process is combined of two phases: 1) Calculating the power-law coefficients (b1 . . . b5 in steps 510-550) and 2) calculating the weight coefficients (a1, . . . ,a6 in steps 560-590).

[0154] In the first phase we loop through die instance parameters (510,540,550).

[0155] In step 520 we isolate each one of die instance parameters, measure the synopsis error as we change it, and then in step 530 we fit it with a Zipf parameter.

[0156] The second phase is an iterative process, running until a good enough accuracy is achieved (590). In step 560 we randomly pick the instance parameters of an instance and generate a synthetic relation, query workload and synopsis accordingly. In step 570 we raise each of the picked instance parameters to the power of its coefficient calculated in phase 1. We then put it all in the form of an equation with the weight coefficients as the variables, and use a commercial solver to solve the set of equations accumulated so far. We continue to add more and more equations until the Variance goes below a predefined constant e. When it does, die commercial solver outputs the set of weight coefficients.

[0157] Combining the coefficients of phase 1 and phase 2, results with the complete error estimation function (595).

[0158] In non-limiting embodiments, between 10 and 20 synthetic tables (i.e., C of step 520) are used in the calibration process, though any number is appropriate. In non-limiting embodiments, the number of tests during the second phase of the calibration process is between 15 and 100, though any number is appropriate.

[0159] In non-limiting, embodiments, die attribute e which appears in step 590, ice, maximal allowed variance of die error estimation function, has values between 0.001 and 0.001, though any value between 0 and 1 is appropriate.

[0160] FIG. 11 describes an exemplary profiling process according to some embodiments. In Step 620, we perform low-Sampling on the relation (less than 0.1%) In Step 630, we fit a Zipf parameter to the sample taken (requires sorting the sample, and using linear square fitting to fit it with a Zips parameter).

[0161] FIG. 12 describes an exemplary process for building a synopsis The following examples are to be considered merely as illustrative and non-limiting in nature. It will be apparent to one skilled in the art to which the present invention pertains that many modifications, permutations, and variations may be made without departing from the scope of the invention.

[0162] It is noted that although the theoretical discussion relates to two-column relations, this is not intended as a limitation and that multi-column relations are within the scope of the present invention. It is noted that in no way is this feature or any other features of the disclosed examples intended as limiting, but rather is provided for the purposes of exemplification.

Theoretical Discussion of Exemplary Embodiments

[0163] A relevant factor in the effective utilization of data synopses is the ability to have good a priori estimates on their expected query approximation errors. Such estimates are useful for the appropriate decisions regarding which synopses to build and how much space to allocate to them, which are also at the heart of the synopses reconciliation problem. In this section, a theoretical discussion of a novel synopses error estimation method based on the construction of synopses-dependant error estimation functions is presented. These functions are computed per synopsis implementation in a pre-processing phase using a calibration method, which does not need to involve actual data or actual queries. Subsequently, (hey are used to provide ad hoc error estimation w.r.t given data sets and query workloads based only on their statistical profiles, which can also be obtained effectively in a pre-processing phase.

[0164] In this theoretical section, we demonstrate the realization of our approach for the particular non-limiting setup of relations with a single attribute and range-sum queries. We define a specific error-estimation function that works well under this setup for several synopses implementations and the mean-relative-error metric, and describe an efficient calibration process for computing this function. The proposed function is in the form of a linear combination of power-law functions over several basic parameters, capturing the size and distribution of the relation and query workloads. We substantiate the validity of our approach under this setup by experimentation using several synopses implementations over a wide variety of relations and query workloads.

[0165] We also present a novel approach to the synopses reconciliation problem, which is to recommend which synopses to build under a given memory space limit. Using the statistical profile information and the error estimation functions within synopses reconciliation algorithms results with significant efficiency improvements by lowering to a minimum and even avoiding interference to the operational databases. We show the utilization of the error estimation functions by bounding the resulting error when utilized for synopses reconciliation.

[0166] Furthermore, the described methods enable for the first time a practical solution to die dynamic synopses reconciliation problem, which is to update the recommended set of synopses in the face of changes to the underlying relations and query workload.

[0167] The dynamic synopses reconciliation problem is to efficiently update the recommended combination of synopses whenever the relation or query workload change, with minimal impact to the database. Dynamic changes to the relations or to the query workload should be reflected with updated synopses (e.g., [1,4,8,20,23). But even when the synopses are kept up-to-date, these changes may change the effectiveness of the various synopses and the overall effectiveness of the synopses-combination. Hence, to effectively address a dynamic setup one needs to have an effective solution to the dynamic synopses reconciliation problem. That is, to continuously provide input regarding the best synopses-combinations and trigger a change of the synopses-combination when appropriate. Tables 3 and 4 detail examples for the improvement that can be gained from running a synopses reconciliation process after the relations or query workloads have changed.

[0168] The estimation of the approximation-error can also be used for other applications. For example, it can be used for reporting the quality of an ad-hoc approximate query submitted by a user (e.g., [2,5,9]). When submitting an approximate query the user will receive both the approximate result and an estimation of the approximation error.

[0169] The experiments in this theoretical section were performed using the .tau.-Synopses system [18]. We have extended die system with the calibration and reconciliation modules, enabling the easy research of error-estimation functions and reconciliation algorithms.

Theoretical Section 1

[0170] In this theoretical discussion, we present a novel approach to synopses error estimation based on the construction of synopses-dependant error estimation functions. These functions are computed in a pre-processing stage using a calibration method, which does not need to involve actual data or actual queries. They accept as arguments statistical information on die relations and query workloads, which can be efficiently collected in a pre-processing phase by profiling or by using existing statistical information in the DBMS. It is sufficient to perform die calibration for the implemented synopses only once during load and integration time, and the results can be used at any time thereafter to provide ad hoc error estimation w.r.t. given data sets and query workloads based only on their statistical profiles.

[0171] The calibration approach is quite general, where its particular implementation may depend on the particular data sets and types of queries under consideration. In this theoretical discussion, we have used Zipfian-like distributed relations with a single numerical attribute, and on query workloads consisting of range-sum queries with a single numerical result. We describe an error-estimation function with several basic parameters that works well for the several synopses types that we have tested, including Spline Synopsis [14], V-Optimal histograms [11], Standard Wavelet [19], Workload-based Greedy Wavelet [21], and Golden Estimator [25]. The function is in the form of a linear combination of power-law functions over several basic parameters that efficiently capture the size and distribution of the relations and query workloads. We describe an efficient synopses calibration method for deriving this error-estimation function.

[0172] In this theoretical discussion, we also present a novel approach to synopses reconciliation, by having synopses reconciliation algorithms use our error-estimation functions instead of actual error testing. In particular, we modify the same algorithms and heuristics recommended in [12,13,14,15,16] to use the calibration-based error estimation functions and profile information. By concentrating most of the effort on the pre-processing phases, we are able to lower to a minimum and even completely avoid interference to the operational databases during the run-time of these algorithms. Thus, gaining significant improvement in running time while maintaining good accuracy. We have also extended our study to include support for the reconciliation of multiple synopses types together, in order to take advantage of the strengths of each synopsis type. In addition to recommending the synopsis size, we also recommend which synopsis type is most suitable, meaning that it would give a lower error under the given space limitations.

[0173] As we utilize the profiling information and the error-estimation functions, we are interested, in this theoretical discussion, in the accuracy of the estimation obtained by these functions. This accuracy is expressed in terms of the error introduced by die error-estimation functions, denoted as the error-estimation error. In our experimental tests, using a wide variety of synopses, relations and query workloads, the derived error-estimation functions estimate the mean-relative-error of the used synopses with an average error-estimation error of 1.8%-2.3%. For single queries the error-estimation error is less than 10% for 81%-83.7% of the queries. When used within a synopses reconciliation process the mean-relative-error of the result is on average 0.2%-0.38% worse than the error of the original method, while showing a reduction of two orders of magnitude in the run-time.

[0174] The effectiveness of performing calibration and profiling-based synopses reconciliation is also shown in the reconciliation error lemma, by proving a bound on the additional error of the synopses combination recommended by a synopses reconciliation algorithm using error estimation instead of the more expensive measured error. In particular, we show in the reconciliation error lemma that this additional error is bounded by the combined error estimation errors of the recommended combination, and of the combination that would be recommended with measured error.

[0175] The efficient run-time of the synopses reconciliation process and its minimal impact on the operational database enable for the first time a practical solution to the dynamic synopses reconciliation problem. The profiles of the relations and query workloads can be approximated using a backing sample [6], which is kept up-to-date as the relations and query workloads change. The synopses reconciliation process itself uses only these profiles and the error-estimation functions and does not access die database at all. In this way, we can execute a synopses reconciliation process on a separate server and generate a new recommended combination without any impact on the database server or accessing any of the original relations. The only cost required from the database, is the cost of rebuilding or updating the synopses according to the new recommendations. The process of rebuilding the synopses can also be optimized as in [8], by building the synopses based on the same backing-sample used before. Running a second or a periodic synopses reconciliation process, can significantly lower the total approximation error of the system--In our experiments, a synopses reconciliation process executed after the underlying relations and query workload have changed, lowered the total approximation error of the system by as much as 5%, even when only small changes in the query workload were made.

[0176] The benefits of run-time error estimation are significant--our experiments show a reduction of two orders of magnitudes in the run-time of the synopses reconciliation process. For instance, an 8 minute algorithm run is reduced to less than 1 second. A summary of experimental results for synopses reconciliation algorithms, their accuracy and run-times can be found in Table 2.

[0177] Another application for the error-estimation functions would be to provide the estimated error of the result of an approximate query together with the result, reporting its quality. There are several works on alerting the user submitting a query to the possible approximation error of the query result. One such example is the AQUA system [2,5], a framework for running queries on synopses, that displays the upper limit of the approximation error of a query. Similar approach is presented in [9]. Having a good error-estimation would enable those frameworks and related frameworks to display a better error-estimation for the approximation, or provide a better prediction regarding the accuracy of the approximate query answer.

[0178] The error-estimation function can be extended to support relations with multiple attributes and other types of queries with the addition of parameters to capture them (e.g., adding a parameter with the number of attributes, and another parameter measuring die correlation between the attributes). The methods for both error-estimation and synopses reconciliation are applicable also for non-relational data sets, as well as for streaming data.

[0179] As the development and testing platform we used the .tau.-Synopses system [18] We have extended the system with the new modules of calibration and reconciliation, while taking advantage of the existing benchmarking tools, and the already integrated synopses types, query workloads and relations. A demonstration of the calibration and reconciliation implemented modules was given in [17].

[0180] The rest of this theoretical discussion is organized as follows: in Section 2 we will describe the technique for error estimation of a given setup and in Section 3 we will show how it is used within synopses reconciliation algorithms. Section 4 describes the setup and experiments we made to support our approach, and Section 5 contains the conclusions.

Theoretical Section 2 Synopsis Error Estimation

[0181] In this theoretical discussion we will use mean-relative-error (MRE) as the error metric for measuring the accuracy of a data synopsis a built over a relation .rho. w.r.t. a query workload .omega.. The triplet (.rho.,.omega.,.sigma.) is denoted as an instance. Let r.sub.1, . . . ,r.sub.n be the actual results of query workload .omega.={q.sub.1, . . . ,q.sub.n} submitted to the relation p, and let r.sub.1', . . . ,r.sub.n' be the approximate results of submitting query workload .omega. to the data synopsis .sigma. built over the relation .rho.. The error with respect to the instance <.rho.,.omega.,.sigma.> is defined as:

E ( .rho. , .omega. , .sigma. ) = 1 .omega. i = 1 .omega. r i - r i ' r i ##EQU00006##

[0182] The objective of this section is to provide efficient methods for estimating E(.rho.,.omega.,.sigma.) for any given instance <.rho.,.omega.,.sigma.>; that is, computing an estimated error eE(.rho.,.omega.,.sigma.). We denote mE(.rho.,.omega.,.sigma.) as the actual error of the same instance, which can be measured by building the synopsis .sigma. and running the query workload .omega. on both the relation .rho. and the synopsis .sigma.. The error estimation error, EEE, is defined as:

EEE(.rho., .omega., .sigma.)=|eE(.rho.,.omega.,.sigma.)-mE(.rho.,.omega.,.sigma.)|

[0183] Given an instance <.rho.,.omega.,.sigma.>, the error estimation problem is to provide effective error estimation, that can efficiently compute an estimated error eE with small EEE Our general approach is to define error estimation functions, EE.sub..tau., for different synopses types .tau.. These functions receive as arguments some appropriately selected parameters characterizing the instances. Thus for an instance <.rho.,.omega.,.sigma.> the function in use is EE.sub..tau., for .tau.=.tau.(.sigma.). The function and instance-parameters are defined so as to meet the following goals of efficiency and effectiveness (1) these instance-parameters are efficiently computable, (2) the function is efficiently computable based on the instance-parameters, and (3) the function is effective, meaning that the EEE of the method defined by this function is low.

[0184] We will start with describing the nature of the error-estimation function, and the instance-parameters that it utilizes as arguments (Section 2.1). We then describe die method in which the function is computed for each synopsis type, which we call synopsis calibration (Section 2.2). The process runs a set of tests on the synopses using synthetic relations and does not require any particular data sets to be available. The output of this process is a concrete error estimation function EE.sub..tau. per synopsis type .tau.. Finally we detail the data and query workload profiling process of efficiently obtaining the parameters for a particular instance (Section 2.3). The output of this process consists of the specifications of the relations and query workloads. These are used as input for the error estimation functions of various synopses types, and as a result we can now get an error estimation without actually building the particular synopses for the given instance.

Theoretical Section 2.1 The Error-Estimation Function

[0185] Our objective is to diminish or eliminate the need to access the underlying relations and our approach is to have an error-estimation function that accepts as parameters the statistical characteristics of an instance. Accordingly, we have identified a set of instance-parameters, and a general function that could provide an effective error-estimation for the relative error of several synopses types (detailed below, and in Section 4.2), for range-sum queries and for relations with a single attribute whose distribution can be well approximated by Zipf distribution. The identified set of instance-parameters is characterized as an instance-profile:

[0186] Instance-profile. We will define instance-profile to be a 5-tuple, .DELTA.=<I,Z,Q,R,S> consisting of Relation distinct count (L), Relation data distribution skew (Z), Workload query distribution skew (Q), Workload query range (R) and Synopsis size (S). Table 1 shows several examples of instance-profiles. We have tested other parameters as well, and these were the significant ones for all tested synopses. A description of some of the other tested parameters is given in Section 2.1.4.

TABLE-US-00001 What Unit Profile 1 Profile 2 Profile 3 Relation Distinct Count (L) #tuples 2,000,000 500,000 1,000,000 Relation Data Skew (Z) Zipf parameter 0.3 0.5 0.1 Workload Query distribution skew Zipf parameter 0.1 0.4 0.1 (Q) Workload Query Range (R) % of relation 10 15 10 size Synopsis Size (S) % of relation 0.5 1 0.7 size

[0187] One Exemplary error-estimation function. The error estimation functions of this theoretical discussion are linear combinations of power law functions of the instance-parameters:

EE.sub..tau.(.DELTA.)=a.sub.1L.sup.b.sub.1+a.sub.2Z.sup.b.sub.2+a.sub.3Q- .sup.b.sub.3+a.sub.4R.sup.b.sub.4+a.sub.5S.sup.b.sub.5+a6 (1)

where the coefficients a.sub.i=a.sub.i(.tau.) and b.sub.i-b.sub.i(.tau.) depend on the synopsis type .tau. and will be calculated in the calibration process described in Section 2.2. The a.sub.i coefficients are also used to normalize the instance-parameters which have different units of measure. The proposed error-estimation function was tested successfully on the following synopses types: Spline Synopsis [14], V-Optimal [11], Standard Wavelet [19], Workload-based Greedy Wavelet [21] and Golden Estimator [25]. It was tested using various instances, and proved to be quite effective for the types of synopses, query types and distributions we are focusing on. In our experiments, using a wide variety of instances and die synopses types listed above, the error estimation function achieves an average EEE of 2.3%, and a maximum LEE of 7% when the range of the error was between 0% and 20% The relative error [EEE/mE] is at most 2.6.

Theoretical Section 2.1.1 Power-Law Approximation

[0188] We show that the error as a function of each parameter X can be approximated using a power-law function (aX.sup.b). That is, if we fix all the instance-parameters except one, the error as a function of each parameter X can be effectively approximated using a function of tee form aX.sup.b+c for some constants a,b and c. FIG. 13 shows an example of the approximation of the workload query range R to a power-law function. It shows the estimated power-law function and the measured error vs. R which varies from 1% to 70% of the relation distinct count. All the other instance-parameters are fixed. It can be seen that a power-law function quite accurately approximates the error as a function of this parameter. Tests on several synopses using various instances have shown this to be an effective approximation that is good for synopsis error estimation.

Theoretical Section 2.1.2 Independence of the Instance-Parameters

[0189] For simplicity and efficiency, it is convenient to assume that the instance-parameters are independent of each other, enabling us to use a linear combination of power-law functions in the function definition. The penalty of this convenience is inaccuracy that would result from correlation between die different parameters. Experimental studies show that this penalty is reasonably small. FIG. 14 details an experiment performed using the Spline synopsis over 10 instances with random instance-parameters. The plots on the left show the measured error of each instance as a function of the synopsis size as percent of the relation size. The fact that the plots are not parallel suggests that the instance-parameters have some correlation between them. The plots on the right represent the same instances, only we normalized the plots by subtracting the average of each plot. In this experiment, the maximum error contributed by the correlation is 6% and on average 0.8% Similar experiments were performed using the other synopses types, and the other instance-parameters generating similar results.

[0190] Because we assume the instance-parameters are independent, the function can be described as a linear combination of their effects on the error for different parameters, and because their effect on the error can be approximated using a power-law function, the function is in the form of a sum of power-law functions.

Theoretical Section 2.1.3 Error Function Analysis

[0191] The strengths and weaknesses of each synopsis type, can be derived from their error-estimation function. Recall that the error-estimation function is of die form:

EE.sub..tau.(.DELTA.)=a.sub.1L.sup.b.sub.1+a.sub.2Z.sup.b.sub.2+a.sub.3Q- .sup.b.sub.3+a.sub.4R.sup.b.sub.4+a.sub.5S.sup.b.sub.5+a.sub.6

[0192] (2) The bigger the coefficients as, the bigger the error will be. Since the value of L is bigger than 1, then die bigger the coefficient b.sub.1 is, the bigger the error will be. Since the values of Z,Q,R and S are between 0 and 1, then the smaller the coefficients b.sub.i (i>1) are, the bigger the error will be. Furthermore, since we assume the instance-parameters are independent, one can look at each of the power-law sub-functions separately. The coefficients of the sub-functions indicate the behavior of the synopsis for that parameter. If, for example, a synopsis has a high value in b.sub.4, then this synopsis performs poorly for large query ran-es, and vice versa.

Theoretical Section 2.1.4 Parameters with Smaller Effect on the Error

[0193] The following parameters were suggested in some articles, and testing them, as we did with for the instance parameters, shows that they have little effect on the error. We took a specific instance, and modified only the parameter that we want to test. Most of die results showed an arbitrary error, caused mostly by die randomness of die workload, and of the sampling used. [0194] 1. FIG. 15 (left) shows the error as a function of the relation data range. Since the error is normalized by the value of the real data, we do not expect any effect at all of the data range on the error. We used 10 relations, each with 500:k tuples, with different data ranges .alpha.-axis). The query workload had 300 random queries with an average range of 40, and over each relation we built a Golden Estimator synopsis with size 100 buckets. The results of this experiment and similar experiments using other synopses types, show very small fluctuations of less than 1%. [0195] 2. FIG. 15 (right) shows the error as a function of the number of non-zero data values in the relation, For example, a value of 100% means that all the keys exist in the relation, and a value of 60% means that only 60% of the keys exist in the relation. We again used 10 relations with 500 k tuples, with different number of non-zero keys .alpha.-axis). The query workload had 300 random queries with an average range of 40, and over each relation we built a Golden Estimator synopsis with size 100 buckets. The results of this experiment, and similar experiments using other synopses types, show an arbitrary effect on the error.

Theoretical Section 2.2 Synopses Calibration

[0196] The calibration process computes the as and b.sub.1 coefficients of the error-estimation function EE.sub..tau. for a given synopsis type .tau.. This process only needs to be performed once per synopsis type, as a pre-processing phase, and the resulting function can later be used to estimate the error of a synopsis in a given instance, without accessing the database. This enables many applications to avoid expensive database accesses during their run-Lime phase.

[0197] The calibration process for building function (1) consists of the following steps: [0198] 1. Parameter Tests. We, run a set of tests on each parameter separately. Because we assume the instance-parameters are independent, it is possible to measure the effect of a single parameter by fixing all the others. So for each parameter, we fix all the other instance-parameters, and run multiple tests varying it through the entire range. Note that three tests are enough to fit a power-law function, and additional tests are run for better accuracy. For each variable we record the results of the tests in an array with two columns: the value of the variable and the error of the test. [0199] 2. Power-Law Fitting. We use least-square-fitting (as described in [24] on the results of the previous step to calculate the power-law coefficients b.sub.1, . . . ,b.sub.5. [0200] 3. Weight Coefficients Tests. We generate instance j with randomly selected instance-parameters, measure the error of the synopsis for the query workload of the instance, and record the instance-profile and the measured error in an equation of the form:

[0200] a.sub.1L.sub.j.sup.b.sub.1a.sub.2Z.sub.j.sup.b.sub.2+a.sub.3Q.sub- .j.sup.b.sub.3+a.sub.4R.sub.j.sup.b.sub.4+a.sub.5S.sub.j.sup.b.sub.5+a.sub- .6-M.sub.j-EEE.sub.j=0 [0201] 4. where L.sub.j,Z.sub.j,Q.sub.j,R.sub.j and S.sub.j are the parameters of instance j, a.sub.i are the variables that will hold the weight coefficients of the instance-parameters, b.sub.i are the power-law coefficients of the instance-parameters calculated in the previous step, M.sub.j is the measured error of the query workload of instance j, and EEE.sub.j is a variable the will hold the error-estimation error of the error of instance j. Once this equation is built the only unknowns are the a.sub.i coefficients and EEE.sub.j. [0202] 5. Calculate Weight Coefficients. We add the equation from Step 3 to an accumulating set of equations and use the i log cplex 8.0 commercial solver [10] to minimize the sum of squares of die error-estimation errors EEE.sub.j. The commercial solver returns the coefficients a.sub.1, . . . ,a.sub.6, building a function that minimizes the sum of error-estimation errors for the tests that were run. It also returns the variance of the error-estimation errors, indicating the accuracy of the error-estimation function. [0203] 6. Verify Accuracy. Note that since we assume the instance-parameters are independent, and there are only five instance-parameters, then under perfect conditions five equations should have been enough. But given the errors, we need a convergence mechanism and we do that by testing the variance returned from the solver. If the variance of EEE.sub.j is greater than e and the number of equations is smaller than .gamma. (e and .gamma. are pre-specified constants) to to Step 3.

[0204] FIG. 17 shows a flowchart of the calibration process, giving an overview of the process. A pseudo-code of the calibration process can be found in FIG. 16. First calculating the power-law coefficients, and then solving linear equations, makes the calculation much simpler and faster than solving equations of sum of power-law functions, since linear equations are much easier to solve. The accuracy of die resulting error-estimation function depends on three constants .lamda., .gamma. and e, that control the number of the tests and operations. The computation cost of die error-estimation function is as follows: In step 1 we run .lamda. tests per parameter, a total of 5-.lamda. tests. At most .gamma. more tests are performed in step 3 (possibly less than .gamma., depending on the value of e) and for each of those tests, we run the solver (step 4). So the total cost of the calibration process is 5 .lamda.+.gamma. tests and at most y executions of the solver. Note that the number of tests and operations is a constant, depending on the given parameters

Theoretical Section 2.2.1 Example Calibration

[0205] Below is an example of running the calibration process on the Golden Estimator data synopsis [25]. First step was to measure the effect of each parameter separately on the error. For each parameter; we ran several tests, varying that parameter through die entire range, while all the other instance-parameters are constant. FIG. 17 shows separate plots of tee error as a function of each parameter. Next we use least square fitting, to fit each of the instance-parameters in FIG. 1 to the function aX.sup.b+c. The result of the fitting was:

b.sub.1=-0.874,b.sub.20.112, b.sub.3=-0.015,b.sub.4=-0.815, b.sub.5=-0.509

[0206] Next we generate an instance with randomly selected instance-parameters, measure its error, and for each parameter in the instance-profile, we raise the parameter to the power of its power-law coefficient, and record the equation. For example, we generated the relation, workload and synopsis of the following instance-profile: <1024,0.2,0.6,0.04,0.16> and measured its error. The result of the measured error was 0.025, and so the equation was:

a.sub.11024.sup.-0.0874+a.sub.20.2.sup.0.112+a.sub.30.6.sup.-0.015+a.sub- .40.04.sup.-0.815+a.sub.50.16.sup.-0.509+a.sub.6=0.025+EEE.sub.j

[0207] After recording each equation, we use a commercial solver to minimize. The sum of squares of EEE.sub.j of all recorded equations. We defined the constant e to be 0.001, and continued to add more and more equations, until the variance of EEE.sub.j was smaller than e, after 12 tests. The commercial solver outputted the values for a.sub.i:

a.sub.1=10.545, a.sub.2=0.218, a.sub.3=0.819, a.sub.4=0.002, a.sub.5=0.002, a.sub.6=-1.042

Finally the derived error-estimation function is:

EE.sub..tau.(.DELTA.)=10.545L.sup.-0.874+0.218Z.sup.0.112+0.819Q.sup.-0.- 015+0.002R.sup.-0.815+0.002S.sup.-0.509-1.042

[0208] FIG. 23 depicts the results of 50 tests, comparing the measured synopsis error with the estimated synopsis error returned from the above function.

Theoretical Section 2.3 Data and Query Workload Profiling

[0209] The error-estimation function above, requires as parameters the specifications of the relation and query work-load of the instance. The function requires the relation distinct count (L), the relation data distribution Zipf parameter (Z), the workload query distribution Zipf parameter (Q) and the workload query average range (R). These can also be obtained in a pre-processing phase that needs to be performed only once, or whenever the relations or query workload change significantly. For relations, we estimate the cardinality of the relation (distinct count) using techniques from [3], and fit a Zipf parameter to the relation data distribution as described in Section 2.3.1. For query workloads, the number and average range of the queries are calculated, and the Zipf parameter of the query distribution is again fitted using tee same technique as described in Section 2.3.1. The computed statistical data can be stored in a simple relational data structure.

[0210] Any errors in the approximation of the relation or query workload specifications, will directly affect the synopsis error-estimation. The level of the effect depends on the coefficients calculated in the calibration process. If, for example, a synopsis is very highly influenced by relation data distribution skew, then it will have high weight and power coefficients for this parameter, and any error in the approximation of the relation data distribution skew, will be immediately felt in the result of the function. A significant change to the specifications of the relation or query workload will have a similar effect on the accuracy.

Theoretical Section 2.3.1 Estimating the Relation Data Distribution Skew Zipf Parameter

[0211] It is well known that real-life relations can be well approximated by a Zipf distribution. According to Zipf's law, Zipfian distributed data, when aggregated and sorted, can be approximated by a power-law function, and the Zipf parameter is the power-law coefficient of that function. We can use the same least-square-fitting algorithm described in [24] in order to fit a power-law function to the sorted data, and thus find the Zipf parameter of the relation. The same applies for query workloads--we look at the "low" attribute of each query as the data, and profile it the same way we profile a relation.

[0212] Sorting a very large relation consumes a lot of resources, so in order to improve performance we identify the appropriate Zipf parameter by using a smaller sample of the relation, using sampling methods as in [8]. The sample is then sorted according to value, and then linear-squared-fitting is performed on the sorted sample to derive the distribution skew Zipf parameter. Note that the bigger the sample we choose, the more accurate the result will be, but since this value is only intended for the error-estimation function, a relatively small sample will suffice.

Theoretical Section 2.3.2 Effectiveness of Profiling Using a Sample

[0213] In order to improve performance during the profiling of the relations, we utilize a backing-sample. We now consider how large a sample is required in order to accurately estimate the data distribution skew. We generated 230 synthetic relations, with random sizes ranging from 4 m to 64 m tuples (with 4 k to 64 k distinct count respectively), and a random data distribution skew ranging from 0.1 to 0.8 Zipf. We then profiled the full relation, and made several profiling attempts on a sample of the relation, increasing the sample size every time, until the error in identifying the Zipf parameter was less than 0.025. FIG. 19 shows for each one of the relations, the minimal sample size (as percent of the original relation) necessary for an accurate profiling of the relation. The results show that on average a sample size of 0.0005% of the relation size (or 0.5% of the distinct count), and a maximum of 0.006% of the relation size (or 6% of the distinct count) is enough for an accurate profiling of the original relation.

Theoretical Section 2.4 Single Query Error Estimation

[0214] One of the applications for the error estimation functions, is the error estimation for a single query. When using a synopsis to approximate a query answer, having an estimate on the accuracy of the response is highly desirable. Some, but not all of the synopses, give a bound on the error. This bound is not always very tight, and the actual result might be much better.

[0215] The error estimation functions can be used to give an estimate of the result quality for ad hoc queries. In our experiments the error-estimation functions estimate the approximation error of 81%-83.7% of the queries with an error-estimation error which is less than 10%. FIG. 20 depicts how tee approximation error may be displayed, with an error range around the approximate error result. The range consists of the estimated error computed by EE.sub..tau. and the additional error estimation error EEE, which is shown with different color intensities, to represent some known probability distribution of EEE. The probability distribution for EEE can be a global probability distribution computed for that synopsis type, or it could be a more refined probability distribution. For instance, it may be the probability distribution of EE(EE.sub..tau.), that is, for the particular value of EE.sub..tau. (as illustrated in FIG. 27).

Theoretical Section 3 Synopses Reconciliation

[0216] We show how to obtain efficient synopses reconciliation algorithms, using the synopses error-estimation functions. Recall that the problem of synopses reconciliation is defined as: Given a set of relations, a query workload, a set of synopses types and a space limit, find a combination of data synopses (synopses sizes and synopses types) that will achieve a low approximation error for the given workload.

[0217] The synopses reconciliation problem is a minimization problem, and should be solved as such. There are several challenges: there are many parameters that affect the error; the effect is not linear; the effect is different for different synopses types; and it is in some cases arbitrary because of the random sampling used in many synopses. The algorithms proposed in [12,13,14,15,16], overcome the above problems by physically testing many combinations, and selecting the one with the lowest error. The heuristic proposed in [12] relies on the basic assumption, that as the synopsis size grows the error shrinks, but it shrinks more and more slowly--at first the gain is large, and the gain decreases as the synopsis size grows. This assumption holds for most synopses types. In [16] the authors reduce the number of accesses to the database by eliminating many improbable combinations, and using incremental update of the synopsis instead of fully rebuilding it. But even with heuristics the number of accesses to the database, which are performed only for the purpose of estimating the synopses errors, is very large.

[0218] The main difference between our approach and previous approaches is that during the optimization algorithm no tests are performed against the original relations. The purpose of physically testing an instance is to calculate the error of that instance, and since the error-estimation function does exactly that, replacing the physical tests with a call to the error-estimation function, will lead to an efficient synopses reconciliation algorithm.

[0219] Furthermore, the heuristic suggested in [12], "the greedy algorithm with remedy", was built to work around the problem of the sometimes arbitrary results output by the synopsis. The solution was to always physically test two instances ahead instead of one in order to avoid unexpected non-monotonous peaks. Since the error-estimation function derived in the calibration process is already monotonous and connected, there won't be any peaks, and the "remedy" they had to perform can be ignored, achieving an even better performance gain during the synopses reconciliation process. Estimation error of multiple synopses. The quality of a solution to the synopses reconciliation problem is determined by the error of the output synopses combinations. We describe here our approach for measuring the quality of a solution. Recall that in Section 2 we defined the estimation error of an instance <.rho.,.omega.,.sigma.>. We extend this definition to describe the error of multiple instances: Let P=.rho..sub.1, . . . , .rho..sub.n be a set of relations, let .OMEGA. be the set of queries submitted to those relations and let .SIGMA.=.sigma..sub.1, . . . , .sigma..sub.n be the set of synopses built over those relations. Given r.sub.i the actual result of query q.sub.i, and r'.sub.i the estimated result of query q.sub.i when submitted to the synopsis, the estimation error of multiple synopses is defined as:

E * ( P , .OMEGA. , .SIGMA. ) = 1 .OMEGA. q i .di-elect cons. .OMEGA. r i - r i ' r i ##EQU00007##

[0220] For convenience we divide the queries in .OMEGA. into query work-loads according to the relation they are submitted to, so that .OMEGA.=.omega..sub.1, . . . ,.omega..sub.n where .omega..sub.i is the workload of queries submitted to relation .rho..sub.i. The definition of E* can be then rewritten as:

E * ( P , .OMEGA. , .SIGMA. ) = 1 .OMEGA. j = 1 n .omega. j E ( .rho. j , .omega. j , .sigma. j ) ##EQU00008##

[0221] We also define the measured and estimated error of multiple synopses mE* and eE* using the previously defined mE and eE respectively:

mE * ( P , .OMEGA. , .SIGMA. ) = 1 .OMEGA. j = 1 n .omega. j mE ( .rho. j , .omega. j , .sigma. j ) ##EQU00009## .OMEGA. j = 1 eE * ( P , .OMEGA. , .SIGMA. ) = 1 .OMEGA. j = 1 n .omega. j eE ( .rho. j , .omega. j , .sigma. j ) ##EQU00009.2##

[0222] Recall that EEE is defined to be the difference between eE and mE. Given a set of relations P, a set of query workloads .OMEGA. on those relations and a combination of synopses .SIGMA., we define EEE* (P,.OMEGA.,.SIGMA.) as:

EEE*(P,.OMEGA.,.SIGMA.)=|eE*(P,.OMEGA.,.SIGMA.)-mE*(P,.OMEGA.,.SIGMA.)|

[0223] For convenience, when P and .OMEGA. can be understood from the context, we denote E*(P,.OMEGA.,.SIGMA.) and EEE*(P,.OMEGA.,.SIGMA.) also as E*(Z) and EEE*(.SIGMA.) respectfully.

[0224] The synopses reconciliation problem. The synopses reconciliation problem can be more formally defined as follows:

Input:

[0225] P=.rho..sub.1, . . . ,.rho..sub.n--a set of relations [0226] .OMEGA.=.omega..sub.1, . . . ,.omega..sub.n--a set of workloads over those relations. [0227] T=.tau..sub.1, . . . .tau..sub.k--a set of synopses types. [0228] M--available memory. Output: .SIGMA.=.sigma..sub.1, . . . ,.sigma..sub.n--a set of synopses, subject to .SIGMA.|.sigma..sub.i|=M

Objective: Minimize E*(P,.OMEGA.,.SIGMA.)

[0229] A primitive used by all known synopses reconciliation algorithms is to compute the error of a particular instance <.rho..sub.i, .omega..sub.i, .sigma.'> where .sigma.' is a candidate synopsis for the solution. The error computation is usually done via straightforward measurement. Our profile-based synopses reconciliation approach is to replace the error measurement with an error estimation. In particular, E(.rho..sub.i, .omega..sub.i, .sigma.') can be obtained by invoking the error-estimation function EE.sub..tau. (with .tau. being the type of .sigma.') on the profile of the instance <.rho..sub.i,.omega..sub.i, .sigma.'>, without actually building the synopsis .sigma.', and without accessing .rho..sub.i.

[0230] Given a synopses reconciliation algorithm, the combination of synopses recommended by the algorithm may be different depending on whether it uses actual synopses-error measurement, or the synopsis-error estimation suggested in this theoretical discussion. We define the measured-based combination as the synopsis-combination recommended by the algorithm when using an actual measurement of synopses errors, and we define the estimated-based combination as the synopsis-combination recommended by the algorithm when using a synopsis-error estimation function.

Theoretical Section 3.1 Synopses Reconciliation Error vs. Synopsis Error Estimation Error

[0231] Given a set of relations and a query workload, the synopses reconciliation process attempts to find an optimal combination that will yield a minimal error. Using an approximated error estimation in order to improve the performance of the synopses reconciliation process introduces errors to the optimization problem, moving the recommended solution away from the optimal solution. We are interested in this error and how it impacts the quality of the synopses reconciliation results.

[0232] We show that the error-estimation error of the results of the synopses reconciliation depends on the error-estimation error of the error-estimation function, and that it is bounded. The difference between the actual error of the measured-based combination and the estimated-based combination is bounded by the sum of the error-estimation errors at those combinations:

[0233] Lemma 1 [Reconciliation Error] Consider a synopses reconciliation algorithm in which P is the set of relations, .OMEGA. is the set of query workloads, C is the set of considered synopses-combinations, .SIGMA..sub.m.epsilon.C is the measured-based combination and .SIGMA..sub.e.epsilon.C is the estimated-based combination. Then,

mE*(P,.OMEGA.,.SIGMA..sub.e)-mE*(P,.OMEGA.,E.sub.m).ltoreq.EEE*(P,.OMEGA- .,.SIGMA..sub.m)+EEE*(P,.OMEGA.,.SIGMA..sub.e) [0234] Proof: By definition of .SIGMA..sub.m and .SIGMA..sub.e, for all .SIGMA..epsilon.C:

[0234] mE*(.SIGMA..sub.m).ltoreq.mE*(.SIGMA.) (3)

eE*(.SIGMA..sub.e).ltoreq.eE*(.SIGMA.) (4) [0235] If eE*(.SIGMA..sub.e).ltoreq.mE*(.SIGMA..sub.m) then by (2) we have

[0235] eE*(.SIGMA..sub.e).ltoreq.mE*(.SIGMA..sub.m).ltoreq.mE*(.SIGMA..s- ub.e) [0236] and therefore

[0236] |mE*(.SIGMA..sub.m)-mE*(.SIGMA..sub.e)|.ltoreq.|eE*(.SIGMA..sub.e- )-mE*(.SIGMA..sub.e)|=EEE*(.SIGMA..sub.e) [0237] If mE*(.SIGMA..sub.m).ltoreq.eE*(.SIGMA..sub.e) then by (3) we have

[0237] eE*(.SIGMA..sub.e).ltoreq.eE*(.SIGMA..sub.m) [0238] and therefore

[0238] mE*(.SIGMA..sub.e)-mE*(.SIGMA..sub.m)=

mE*(.SIGMA..sub.e)-eE*(.SIGMA..sub.e)+eE*(.SIGMA..sub.e)-mE*(.SIGMA..sub- .m).ltoreq.

mE*(.SIGMA..sub.e)-eE*(.SIGMA..sub.e)+eE*(.SIGMA..sub.m)-mE*(.SIGMA..sub- .m).ltoreq.

|mE*(.SIGMA..sub.e)-eE*(.SIGMA..sub.e)|+|eE*(.SIGMA..sub.m)-mE*(.SIGMA..- sub.m)|=

EEE*(.SIGMA..sub.m)+EEE*(.SIGMA..sub.e)

Theoretical Section 3.2 Run Time Analysis

[0239] The usage of the error-estimation functions requires several pre-processing phases: [0240] 1. Every relation and query workload undergoes a profiling process, measuring its specifications. Using a backing sample as recommended above, the process will lake O(mlogm) time for each relation, where m is the size of the sample. [0241] 2. Every synopsis type goes through a calibration process, which runs a constant number of tests on relatively small relations.

[0242] By depending on the results of the pre-processing phases, it is possible to complete the entire synopses reconciliation process without accessing the original relations at all and without adding anything to the complexity of the algorithm. Adding support for multiple synopses types, requires the error estimation of every instance using all synopses types, which means multiplying the complexity of the synopses reconciliation process by the number of synopses types.

Theoretical Section 3.3 Dynamic Synopses Reconciliation

[0243] Operational databases are dynamic, and their relations and query workloads tend to chance over time. These changes affect the overall error of the system, and reallocation of synopses memory space is required in order to maintain a low error. For example, if one of the relations grows in size, while its synopsis does not, the approximation error of that synopsis will increase, increasing the approximation error of the entire system. In this case allocation of more space to this synopsis, decreasing the space allocation from the other synopses will probably achieve a lower error for the entire system.

[0244] Existing synopses reconciliation algorithms rely on building synopses and submitting queries to the original relations, tasks that consume considerable amounts of resources and are not usually tolerated with large operational systems. An algorithm for an efficient dynamic synopses reconciliation, should minimize or completely avoid accessing the operational database.

[0245] Dynamically updating the specific synopsis is a separate issue of building and maintaining the best synopsis under the given space. Both the error-estimation function and the synopses reconciliation process assume that the synopsis is up-to-date, and so if the relation changes, an incremental update of the synopsis is required (see, e.g., [8,20]). Workload-based synopses should also be dynamically updated as the query workload changes (e.g. [22]).

[0246] The dynamic synopses reconciliation problem is as follows: Given updates to P (that is, updates to the various relations .rho..sub.i, and additions of new relations to the set), to .OMEGA., to M and perhaps also to T, maintain a good approximation for the appropriate updated .SIGMA..

[0247] The proposed architecture for the solution of the static and dynamic synopses reconciliation problems is depicted in FIG. 9. It is comprised of two separate servers: the Synopses Reconciliation Server (SRS) and tee Database Server, which can reside on the same machine or on two separate machines. When a new synopsis type .tau. is integrated into the system, it goes through a calibration process (as described in Section 2.1) resulting with an error-estimation function EE.sub..tau.. The calibration process is performed in the calibration module of the SRS (as depicted in FIG. 9), or it can be performed on a completely different server. The computation of the instance-profiles requited for the synopses reconciliation can be based on a backing sample of the relations and of the query workloads [8], using the updates logger module of the database. The synopses reconciliation process can be performed on the SRS without any interaction with the database server, using the pre-computed error-estimation functions and instance profiles. The reconciliation process outputs the recommended combination and the estimated gain of the new combination, which are then returned to the operational database for implementation.

[0248] FIG. 10 describes the data flow relationships between the different components of the Synopses Reconciliation Server and the different stages where the components are used. In the integration/registration phase the calibration and profiling processes collect information on the synopses, relations and query workloads and store them in the synopses characteristics and instance profile data structures. These data structures are then used as input for the reconciliation process and other error estimation applications. The reconciliation results are stored in the reconciliation recommendation data structure prior to implementation. The approximate query) optimizer executes approximate queries submitted by users, also calling the error estimation process to receive an estimate of the error.

[0249] The proposed solution to the problem does not require accessing the original relations at all before or during the synopses reconciliation process. We only need to access the original relations for the purpose of building or rebuilding the synopses after the synopses reconciliation process has finished. Since the results, and an estimate of the gain, are received before the synopses are built, one can choose whether to rebuild lie synopses or not, saving a lot of resources on fruitless attempts. The process of rebuilding the synopses can also be optimized as in [8], by building the synopses based on the same backing-sample used before.

[0250] In Section 4.4 we give examples of instances where a synopses reconciliation process executed after the underlying relations and query workloads have changed, lowers the total approximation error of the system by as much as 4.9%, even when only small changes to the query workload or data distribution were made.

[0251] In Real commercial DBMS the cost of the overhead is very important and the applicability of the dynamic synopses reconciliation process depends on it. The query optimizer in most commercial databases requires several statistics, which ale usually gathered periodically. These statistics usually include the ones required by the error-estimation function, and even if not, the process gathering the statistics already performs a scan of the relation (or a sample of it), and can be easily modified to also calculate the Zipf parameter of the distribution with no additional I/O operations. The use of the backing sample is recommended only for a few very large relations, where these statistics might not be gathered regularly (because of the high cost). Most commercial DBMS also maintain a history (or cache) of the query workload, and even if not, a small backing sample with a few hundred queries per relation should be more than enough in order to profile the query workload.

Theoretical Section 4 Experiments

[0252] We ran five sets of tests, measuring the accuracy of the error-estimation function and effectiveness of its integration with the different synopses reconciliation processes: [0253] 1. Measure the EEE of the error estimation functions using several synopses types over various instances. [0254] 2. Test the effectiveness of the error-estimation function for estimating the approximation error of single queries. [0255] 3. Perform synopses reconciliation using only one synopsis type. The purpose of this experiment is to compare our modified synopses reconciliation algorithms with the original synopses reconciliation algorithms of [16] and [12] who also used only one synopsis type in their synopses reconciliation algorithms. [0256] 4. Perform synopses reconciliation using multiple types of synopses, allowing the algorithm to select which synopsis type is best suited for each relation in the system. [0257] 5. Demonstrate the effectiveness and significance of having dynamic synopses reconciliation by measuring the gain from running a dynamic estimated-based synopses reconciliation process after the underlying query workload has changed.

[0258] For the experiments we implemented the algorithm of calibration from Section 2 and both synopses reconciliation algorithms presented in [12] and [16]. For convenience, we will refer to the algorithm proposed in [16] as KW, and to "the greedy algorithm with remedy" proposed in [12] as JJOT. The following synopses types were used: Spline Synopsis [14], V-Optimal histograms [11], Standard Wavelet [19], Workload-based Greedy Wavelet [21], and Golden Estimator [25]. The experiments were performed using a wide variety of Zipfian distributed synthetic and real-world relations. The above components were implemented in the .tau.-Synopses system [18]. A demonstration of the calibration and reconciliation implemented modules was given in [17].

Theoretical Section 4.1 Summary of Experimental Results

[0259] The experimental results can be summarized as follows.

Theoretical Section 4.1.1 Error Estimation Function

[0260] Error-estimation using synthetic data sets. FIG. 23 compares the measured with the estimated error of instances with synthetic data sets. The tests show an average EEE of 1.8%, and a maximum EEE of 6.4%, when the range of the measured error was between 0% and 20%. The relative error [EEE/mE] is at most 1.7.

[0261] Error-estimation using real-world data sets. FIG. 24 shows the performance of the error-estimation functions when using the Spline synopsis with a real-world relation. The synthetic relations that we use, are pure Zipfian distributed relations, but even though the real-world relation that we use, CovType, is only partly Zipfian distributed, the results are still quite accurate, with an average FEE of 2%, a maximum BEE of 6.9%, when the range of the error was between 0% and 20%. The relative error [EEE/mE] is at most 2.41

[0262] Robustness of the error-estimation to the Zipf-similarity. In FIG. 25 we tested how robust the estimation is to the assumption that the data has a Zipf-like distribution. The results show that the EBB depends on how different the distribution is from a pure Zipfian like distribution Note that in this experiment, the relative EEE reached as high as 45% but in our experience, for most real-world relations, the variance of the distribution distance from perfect Zipf distribution is less than 0.1, where the relative FEE is only 3.1%.

[0263] Error-estimation using large relations. FIG. 26 shows error-estimation experiments performed with die Golden Estimator synopsis and the Work-load-based Greedy Wavelet synopsis using large relations. The results are very similar to the results of the tests performed on the smaller relations: an average EEE of 2.3% and a maximum EEE of 6.9% when the range of the error is between 0% and 20%. The relative error [BEE/mE] is at most 2.83.

[0264] Error-estimation for a single query. FIG. 27 details an experiment measuring h error of single queries. Although the maximal EEE is 41%, it very rapidly descends, and is under 20% for 96.3% of the queries, and under 10% for 83.7% of the queries.

Theoretical Section 4.1.2 Synopses Reconciliation with One Synopsis Type

[0265] Synopses reconciliation with error-estimation functions. FIG. 28 shows the comparison of the total error of all queries between the estimated-based combination and the measured-based combination. A synopses reconciliation process as in [16] was performed using the Golden Estimator synopsis and the Spline synopsis. The experiment demonstrates how the estimated-based combination achieves results which are quite close to the measured-based combination--the maximum difference between them is 1%, and the average difference between them is 0.3%, when the range of error was between 0% and 10%.

[0266] Synopses reconciliation heuristics with error-estimation functions FIG. 29 shows the error of the estimated-based and measured-based combinations of both the KW and JJOT algorithms. The results show that the difference in the total error between the measured-based combination using KW and the estimated-based combination using JJOT (the technique described in this theoretical discussion) is at most 3.3% and on average 1%, when the range of the error was between 0% and 40%.

Theoretical Section 4.1.3 Synopses Reconciliation with Multiple Synopses Types

[0267] FIG. 30 demonstrates the effectiveness of the proposed algorithm, when using multiple synopses types together. It compares the results of the more effective measured-based KW, with the results of the estimated-based JJOT. Using the error-estimation functions within the JJOT heuristic is a very fast way to perform synopses reconciliation, and yet the effectiveness remains quite close to that of the measured-based KW. The experiments show that the difference in the total error between are two techniques is on average 1% and at most 3.3%, when the range of the error was between 0% and 12%.

[0268] FIG. 31 demonstrates the advantage of using multiple synopses types together compared to using just one type. The plots describe the total estimation error when using die Spline Synopsis alone, the Workload-based greedy wavelet alone, and when using a combination of the two. It can be seen that in some cases the Spline synopsis alone achieves better results, in some the Workload-based greedy wavelet alone is better than the spline alone, but in all cases a combination of the two achieves the best results--up to 4% better than using just one synopsis type.

Theoretical Section 4.1.4 Dynamic Synopses Reconciliation

[0269] Tables 3 and 4 detail instances where a synopses reconciliation process after the underlying relations and query workloads have changed, significantly lowered the total error of the system. We built the instances detailed in the tables, and then made a small change: resizing one of the query workloads or changing the distribution of one of the relations. A second synopses reconciliation on the above instances lowered the error of the entire system by as much as 4.9%.

Theoretical Section 4.2 Experiments Setup

[0270] As is customary, we will assume that the data of real-world relations is well approximated using synthetic data with a Zipfian distribution 26]. Most of the tests were performed on synthetic relations with a Zipfian data distribution with Zipf parameter ranging from 0.1 (very low skew, almost uniform) to 0.6 (very high skew). In addition some tests were performed using relations from the TPC-H data set and on the "Forest Cover" (CovType) real-world relation provided by KDD Data of the University of California (http://kdd.ics.uci.edu), which resembles a synthetic relation with Zipf parameter 0.4. The tests were performed using relations with a single numerical attribute and query workloads consisting of range-sum queries with a single numerical result. Since we are measuring the relative error, which is the error divided by the real value, the range of the data is not important, so we chose it randomly for each relation--usually at half of the maximum key, With a goal to cover as wide a range as possible in our tests, we performed tests on the entire range of each of the instance-parameters of the instance: [0271] Relation distinct count L. varies from 256 to 2048 unique keys, with 256 k to 2048 k tuples respectively Relations of similar sizes were also used in the experiments of [12, 16]. We also performed tests on larger relations a distinct count of 16 k to 128 k (with 16 m to 128 m tuples respectively), which gave similar error-estimation results, as can bee seen below. [0272] The Relation data Z and Query workloads Q have a Zipfian distribution with Zipf parameter between 0.1 and 0.6 Zipf. Trying to best simulate real life relations, we used Zipfian data distributions as recommended in [26]. Starting from 0.1 to 0.6, almost covers the entire range--0.1 Zipf means relations with very low and almost uniform data distribution skew, and 0.6 Zipf means very high data distribution skew. [0273] Workload query range R starts from 10 keys to 80% of the relation distinct count, covering almost the entire range. We assume no one would run queries which select more than 80% of the relation. [0274] Synopsis size S starts from 2 buckets or coefficients to 80% of the relation distinct count, again covering almost the entire range. We assume in real life no one would build a synopsis so big, and for a size of 80% most synopses will give MRE=0 anyway.

[0275] The tests were performed using the following synopses types: Spline synopsis [14]--a combination of histogram with curve fitting resulting in a very high performance histogram, V-Optimal [11]--a high performance histogram, Standard Wavelet [19]--the basic wavelet synopsis, the Workload-based Greedy wavelet [21]--a high performance wavelet synopsis that uses workload information to achieve more accurate results, and Golden Estimator [25]--a sampling-based histogram. The Spline synopsis was chosen for it appeared in synopses reconciliation experiments in [14,13,15,16], the V-Optimal was chosen for it was used in the synopses reconciliation heuristics in [12], and in addition we used several other synopses types. The above synopses perform some method of lossy compression, whether it is wavelet calculations, sampling or averaging in histograms, and they all follow the basic assumption that the more space we have the less we need to compress, and the less we need to compress the more accurate and close to the original the synopsis will be. Therefore the behavior of the error as a function of the synopsis size is that of a monotonous increasing function (it might have peaks, but the average is monotonous), and since it is also limited (after a certain size it reaches 100% accuracy, and it cannot get any better) we can say that there is a diminishing return on increasing its size. The greedy heuristic suggested in [12] requires only that the synopsis' error as a function of the allocated space, be monotonous with decreasing marginal gain as the allocated space increases. So other synopses using a method of lossy compression are also expected to qualify for being used by the greedy heuristic. Since most known synopses use some method of lossy compression, then this technique will also work for most other synopses types.

Theoretical Section 4.3 Calibration Process

[0276] FIG. 23 shows the measured and estimated error of several synopses types in 50 instances with randomly selected instance-parameters. We used the following synopses types: the Workload-based Greedy Wavelet synopsis (upper-left), the Golden Estimator synopsis (upper-right) and the Spline synopsis (lower). FIG. 24 shows synopses measured and estimated error in 50 instances using the Spline synopsis and the CovType real-world relation. The instance-parameters of the query workloads and synopses were randomly selected. In both figures, for each test, a new workload was generated and the synopsis was rebuilt according to their values in the instance-profile. For instance, in instance number 40, the measured error of the Golden Estimator was 3% and the estimated error was 5.3%. Similar tests were performed using the other synopses types generating similar results.

[0277] The results shown in FIGS. 23 and 24 show that the estimated error is quite close to the measured error; and the difference--the EEE--is at most 6.9%, and 2% on the average, where the error range was between 0% and 20%. The variance of the EEE is at most 0.00035 and the relative error-estimation error [EEE/mE] is at most 2.41. Similar results were also achieved for the other synopses and for other relations.

[0278] Recall that the calibration process uses synthetic Zipfian distributed relations, and therefore the error-estimation function is also built to estimate Zipfian distributed relations. Thus, the accuracy of the error-estimation function depends on how close the real-world relation distribution resembles a Zipfian distribution. Since the relation used is only partly similar to a Zipfian distributed relation, the results are slightly less accurate than the results of using pure Zipfian distributed relations. In order to measure how similar is a relation to a Zipfian distributed relation, we measure for each tuple the difference between the current data value and the data value it would have had if it were a tuple in a pure Zipfian distributed relation. For example, the real-world relation that we use, the KDD CovType has a distribution distance variance of 0.032.

[0279] FIG. 25 shows the measured and estimated error of an instance as a function of the variance of the distance of the relation's distribution from a pure Zipfian distributed relation. It demonstrates how the accuracy of the error-estimation function depends on how similar the relation distribution is to a Zipfian distribution. In this experiment we used a pure Zipfian data distributed relation with Zipf parameter of 0.4 and 64 m tuples with a distinct count of 64 k, and gradually changed the data distribution away from a Zipfian-like. For each measurement we rebuilt a Workload-based Greedy Wavelet synopsis of 100 coefficients, and a query workload of 300 range-sum queries with a random range of 90-110, and plotted the measured error and the estimated error. Note how the measured error gradually decreases since the changes make the distribution easier to approximate, but the estimated error remains the same since the Zipf parameter fitted to the relation remains the same. The relative EEE in this experiment, for a distribution distance variance of 6, reached 45% For a relation with a distribution distance variance of 0.032, similar to the CovType relation, the relative EEE is 4%.

[0280] FIG. 26 shows the measured and estimated error of 50 instances with randomly selected instance parameters. The error-estimation experiments were performed with the Golden Estimator synopsis (left) and the Workload-based Greedy Wavelet synopsis (right) using larger relations We used synthetic relations of sizes 16 m to 128 m tuples, with a distinct count of 16 k to 128 k respectively, and a random data distribution skew ranging from 0.1 Zipf to 0.6 Zipf. For each test we randomly generated a query workload with query range between 10 keys and 1% of the relation distinct count. For instance, in instance number 25, the estimated error for the Workload-based Greedy Wavelet (right) was 6.8% while the measured error was 8.3%. The results are very similar to the results of the tests performed on the smaller relations: an average EEE of 2.3% and a maximum FEE of 6.9% when the error range was between 0% and 20%. The relative error [EEE/mE] is at most 2.3 and variance of the EEE is at most 0.00035.

[0281] Single approximate query error-estimation. Recall that, one of the applications for the error-estimation functions, is the error estimation for a single approximate query,

[0282] FIG. 27 details an experiment performed for estimating the error for a single query, showing the measured and estimated error of 150 queries. For the experiment we used a relation with 64 m tuples, and a distinct count of 64 k, with a Zipfian distribution skew with Zipf parameter of 0.4. Over the relation we built a Workload-based Greedy wavelet synopsis, with 100 coefficients. The query workload consisted of 150 random queries, and the results are sorted according to measured-error. The monotonous decreasing plot is the measured-error, and the other plot is the estimated error. For instance, in query number 51, the measured error was 9.15% and the estimated error was 7.99%. The results of the experiment show that the EEE is lower than 20% for 96.3% of the queries, and lower than 10% for 83.7% of the queries. The variance of the EEE in this test was 0.0036.

Theoretical Section 4.4 Synopses Reconciliation

[0283] FIG. 28 compares the error of the combination recommended by the measured-based KW with the error of the combination recommended by the estimated-based KW over 20 instance-groups. We use KW and test every possible combination by querying the database, finding the combination wilt the minimal error, and displaying that error on the plot. Using the same instance and the same algorithm, we find the combination with the minimal estimated error by using the error-estimation function to estimate the error of each combination. We then test the suggested combination by querying the database, and display the measured error of that combination in the plot. For each synopsis type we ran 20 synopses reconciliation tests, each with an instance-group consisting of 3-4 instances with randomly selected instance-parameters, following the specifications above. The figure depicts the results of experiments performed using the Golden Estimator synopsis (left) and the Spline Synopsis (right). For instance, using the Spline synopsis (right), in instance-group number 3, the measured-based KW combination achieved an error of 3% while the estimated-based KW achieved an error of 44%. Tests performed using the other synopses types gave similar results.

[0284] FIG. 29 shows the accuracy of the measured-based combination and the estimated-based combination using both KW and JJOT. The figure shows the error of the Spline synopsis type in 20 instance-groups consisting of instances with randomly selected instance-parameters when using: KW measured-based combination, KW estimated-based combination, JJOT measured-based combination and JJOT estimated-based combination. As can be seen in the figure, and also commented by Jagadish et al in [12], the results of JJOT are quite close to those of KW. The figure also shows that the estimated-based combination achieves very good results when integrated with both the KW algorithm and the JJOT algorithm, selecting a synopsis combination which is on average only 1.5% worse than the others. Also, the difference between the KW estimated-based combination and the JJOT estimated-based combination is very small (on average 0.2%), which shows that the error-estimation function works just as well with the heuristic in [12], even without the "remedy".

[0285] Using multiple synopses types. In FIG. 30 we removed the limitation of only one synopsis type and enabled combinations of mixed synopses types together showing again the error of the estimated-based combination and the error of the measured-based combination over 7 instance-groups consisting of instances with randomly selected instance-parameters. Using multiple synopses types together enables us to lower tee error even further, selecting the best suited synopsis for every relation and query work-load. As before the estimated-based combination achieves an error which is quite close to the measured-based combination--the maximum difference between them is 3.3%, and the average difference between them is 1% Table 2 shows an example of the synopses selected by each of the algorithms. Although the Workload-based Greedy Wavelet (WB-Wavelet) dominates the recommendations, there are occasions where the Spline synopsis is recommended, showing that a mixture of several synopses will perform better than using just one.

TABLE-US-00002 <L, Z, Q, R> <256k, 0.1, 0.1, 15> <512k, 0.4, 0.2, 35> <512k, 0.1, 0.6, 25> Reconciliation Synopsis type Synopsis type Synopsis type Run-time algorithm and size (S) and size (S) and size (S) Error (seconds) KW M- Spline WB-Wavelet WB-Wavelet 0.0247 64 based 180 180 140 KW E-based Spline WB-Wavelet Spline 0.0264 0.8 180 200 120 JJOT M- Spline WB-Wavelet WB-Wavelet 0.0255 18 based 180 200 120 JJOT E-based Spline WB-Wavelet Spline 0.0264 0.2 180 200 120

[0286] Table 2 describes Synopses reconciliation with multiple synopses. Selected synopses for three instances with different sizes L, skew parameter Z and query workload (Q and R), using the KW and JJOT algorithms, with both the measured-based and estimated-based synopses error estimation. The total allocated memory is 500 bytes, and the participating synopses types were Spline, V-Optimal, Standard Wavelet, Workload-based Greedy Wavelet (WB-wavelet) and Golden Estimator. Every cell contains the type of synopsis selected and the size in bytes. The results based on the example methods are in bold face.

[0287] In FIG. 31 We ran 50 tests with randomly selected instance-parameters, measuring the error of a measured-based KW when using Spline synopsis alone, Workload-based Greedy Wavelet alone and when using both synopses types together, demonstrating the advantage of using a combination of multiple synopses types together compared to using just one type. The results are sorted by the measured error of "All" in decreasing order. For instance, in instance-group number 23, the Spline synopsis achieved an error of 16%, the workload-based greedy wavelet synopsis achieved an error of 12.5%, and using a combination of synopses from both types achieved an error of 10%. Out of these 50 tests, in 13 of the tests using only the Workload-based greedy wavelet gave the best solution, in 7 of the tests using only the Spline synopsis type gave the best solution, and for the rest, the best solution was achieved by a combination of both synopses types. The total relative error when using a combination of different synopses types was up to 4% better than using just one synopsis type alone.

[0288] Dynamic Synopses Reconciliation. We demonstrate the improvement gained by performing a second synopses reconciliation, as the underlying relations and query workload change. In this experiment, we run a synopses reconciliation process on an instance-group and build the recommended synopses. We then change the instance-group a little, rebuild all the synopses (to imitate the dynamic update of the synopsis) and measure the total approximation error of the instance-group. This measurement shows the expected approximation error assuming a second synopses reconciliation process is never performed, while the relations and query workloads continue to change. Next we perform a second synopses reconciliation process, build the recommended synopses, and measure the total approximation error. The differences between the first and second measurement show the benefit from a second (and even a periodic) synopses reconciliation process. Below are two examples of instance-groups demonstrating the gain from a second synopses reconciliation process:

[0289] Table 3 details a given instance-group with 4 relations, and the synopses built over them according to the recommendations from the synopses reconciliation process (syn size before). The total relative error of this instance-group with the above synopses was 7.4%. We then chanced the query workload size of the fourth instance from 50 to 2000, simulating an increase of the number of queries to this relation. At this point the total approximation error of the instance-group changed to 8%. Next we ran a second synopses reconciliation process, which gave a new recommendation (syn size after).

[0290] The total relative error of the system after this change was 3.1%, meaning that by running a second synopses reconciliation, when the underlying query workload has changed, we managed to lower the total relative error by 4.9%. We also ran a synopses reconciliation process assuming equal workload size for each relation, and tested this combination under the real query workloads. Using this combination on the first workload resulted in a total relative error of 9.2%, and using the combination on the second workload resulted in a total relative error of 10.6%. Note that assuming an equal size of query workload among the relations gives the worst results, even when the instance-group has changed, and a second synopses reconciliation has not been run.

TABLE-US-00003 Instance 1 Instance 2 Instance 3 Instance 4 Relation distinct count 64k 16k 32k 16k Data skew 0.4 0.6 0.5 0.3 Query range 10 20 20 20 #queries - .OMEGA..sub.1 200 100 150 50 Syn sizes .SIGMA.(.OMEGA..sub.1) 60 420 90 90 E(P, .OMEGA..sub.1, .SIGMA.(.OMEGA..sub.1)) 7.4% #queries - .OMEGA..sub.2 200 100 150 2000 E(P, .OMEGA..sub.2, .SIGMA.(.OMEGA..sub.1)) 8% Syn sizes .SIGMA.(.OMEGA..sub.2) 60 30 60 510 E(P, .OMEGA..sub.2, .SIGMA.(.OMEGA..sub.2)) 3.1%

[0291] Table 3 describes an instance-group retaining relations and query workloads used to demonstrate the effectiveness of dynamic synopses reconciliation. The first error line shows the error of the original instance-group. Next we changed the workload of the fourth instance, and the following line shows the resulting increase in the error. Finally we ran a second synopses reconciliation, which recommended a change of synopses, and the bottom line displays the resulting error after the synopses were rebuilt.

[0292] Table 4 details another example for the effectiveness of the dynamic synopses reconciliation. In this example we changed the data distribution Zipf parameter of the first relation, causing the total relative error of the instance-group to increase from 3.4% to 9.3%. A second synopses reconciliation process lowered the total relative error by 3.1%.

TABLE-US-00004 Instance 1 Instance 2 Instance 3 Instance 4 Relation distinct count 64k 16k 32k 16k Data skew (P.sub.1) 0.1 0.6 0.5 0.3 Query range 10 20 20 20 #queries 300 100 150 50 Syn sizes .SIGMA.(P.sub.1) 30 420 480 60 E(P.sub.1, .OMEGA., .SIGMA.(P.sub.1)) 3.4% Data skew (P.sub.2) 0.4 0.6 0.5 0.3 E(P.sub.2, .OMEGA., .SIGMA.(P.sub.1)) 9.3% Syn sizes .SIGMA.(P.sub.2) 360 300 300 40 E(P.sub.2, .OMEGA., .SIGMA.(P.sub.2)) 6.2%

[0293] Table 4 describes another instance-group detailing relations and query workloads used to demonstrate the effectiveness of dynamic synopses reconciliation. The first error line shows the error of the original instance-group. Next we changed the data distribution Zipf parameter from 0.1 to 0.4, and the following line shows the resulting increase in the error. Finally we ran a second synopses reconciliation, which recommended a change of synopses, and the bottom line displays the resulting error after the synopses were rebuilt.

[0294] Run-time improvement FIG. 32 reports the run time of the synopses reconciliation processes in 6 instance-groups. Each instance-group contains 4 relations of the same size .alpha.-axis), and each with a query workload of 200 queries. For each instance-group we measured the run-time of the estimated-based and measured-based KW and JJOT algorithms as a function of the relation size. The higher plot is the measured-based KW, the plot in the middle is the measured-based JJOT, and the lower plot (coinciding with the x-axis) describes both the estimated-based KW and JJOT. For instance, using 4 relations of size 8 m tuples each, the run-time of the measured-based KW was 514 seconds, the measured-based JJOT ran for 169 seconds, while the estimated-based KW and JJOT ran for less than 1 second. The results of the experiment show how the run-times of the measured-based algorithms grows as the size of the relations grow, while the run-times of the estimated-based algorithms remains a constant (less than 1 second). This can be easily explained since the measured-based algorithms require to build synopses in order to test them, and as the relations grow, the build time of the synopses also grows. These run-times assume that before the synopses reconciliation process began, all the relevant synopses implementations were calibrated and the relevant relations and query workloads were profiled.

Theoretical Section 5 Discussion

[0295] In this theoretical discussion, we have presented an efficient way for the online error estimation of data synopses--a basic requirement for any optimization of synopses space utilization, enabling query quality reporting, "What-If" questions without making chances to the actual data, and performing efficient synopses reconciliation. One feature of the presented example technique is defining an error estimation function that is synopsis-type dependent, and not dependent on the particular instance of relation and query workload. Thus, it can be computed per synopsis at the time the synopsis module is integrated with the system, rather than when a synopsis is actually built for a particular relation. The function accepts as arguments the statistical specifications of the instance relation and query workload, and can thus avoid altogether database access for the purpose of error estimation.

[0296] Using the error-estimation functions we described, in this theoretical section, a profile-based synopses reconciliation process that can run on a separate server, with little impact to the operational database server, and also enables, for the first time, a practical solution to the dynamic synopses reconciliation problem. Previous works on synopses reconciliation required performing a large number of accesses to the database and also did not explicitly support the use of multiple synopses. Using the example method, a synopses reconciliation process finishes in less than 1 second instead of several minutes, obtaining a result whose approximation relative error is on average only 0.3% larger than the original error.

[0297] The calibration method presented here in this theoretical section is performed using synthetic Zipfian distributed relations, and therefore the error-estimation function is also built to estimate Zipfian-like distributed relations. This is not intended as a limitation.

[0298] This theoretical discussion relates to the particular case of relations with a single numerical attribute and on query work-loads consisting of range-sum queries with a single numerical result. It is appreciated that this is not intended as a limitation, and allows an in-depth treatment of one particular non-limiting case.

[0299] Under this setup, the five instance-parameters which define an instance-profile are sufficient to achieve a good estimation of the synopsis error. The specific construction of the error estimation function is based on the observation that the effect of the different parameters on the error of an instance can be approximated by power-law functions, and on the assumption that the instance-parameters are near-independent of each other. This enables to lower the number of tests required in order to find the error-estimation function, and enable the building of one simple function to estimate the error of a synopsis w.r.t. a given relation and query workload. In this theoretical section, we presented an off-line calibration process for calculating the characteristics of the data synopsis type, resulting in an on-line synopsis error-estimation for error prediction based on the synopsis and instance profiles.

[0300] Since the exemplary method deals with set items, it need not be in relational format and can therefore also support non-relational data sets. The error-estimation functions are also applicable to streaming data by identifying the relevant parameters and tracking them in the streaming model or by maintaining a backing sample on the streaming data.

[0301] The error-estimation function can be extended to support other types of setups by adding parameters to the error-estimation function capturing the specifications of these setups. Instead of assuming the parameters are independent of each other and using a linear combination of power-law function, the error-estimation function can be adjusted to capture the dependencies between the parameters, thus gaining better accuracy.

REFERENCES

[0302] [1] A. Aboulnaga and S. Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In Proceedings of the 1999 ACM SIGMOD International Conference on Management 181-192, 1999. [0303] [2] S. Acharya, P. B Gibbons, V. Poosala, and S. Ramaswamy The aqua approximate query answering system. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 574-576, 1999 [0304] [3] N. Alon, Y. Matias, and M Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the 28th ACM Symposium on Theory of Computing, pages 20-29, 1996. [0305] [4] V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. The VLDB Journal, pages 176-187, 2000. [0306] [5] P. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, N.J., December 1997. [0307] [6] P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science: Special Issue on External Memory Algorithms and Visualization, A:50, 1999. [0308] [7] P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets (summary). In Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 909-910, January 1999. [0309] [8] P. B. Gibbons, Y. Matias, and V Poosala. Fast incremental maintenance of approximate histograms. In Proc. VLDB, pages 466-475, 1997. [0310] [9] J. M., Hellerstein, P. J. Haas, and H.J. Wang. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Ariz., May 1997. [0311] [10] iLOG Inc. 8.0-user's manual, 2002. [0312] [11] Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 233-244, 1995. [0313] [12] H. V. Jagadish, H. Jin, B. C. Ooi, and K. L. Tan Global optimization of histograms. In SIGMOD '01, pages 223-234, 2001. [0314] [13] A. C. Koenig Query estimation techniques in database systems. PhD thesis, Saarlandes University, 2001. [0315] [14] A. C. Koenig and G. Weikum, Auto-tuned spline synopses for database statistics management. In 10th International Conference on Management of Data, Pune, India (COMAD), 2000. [0316] [15] A. C. Koenig and G Weikum. A framework for the physical design problem for data synopses. In Proceedings of the 8th International Conference on Extending Database Technology, pages 627-2002. [0317] [16] A. C. Koenig and G Weikum. Automatic tuning of data synopses. Information Systems, 28(1-2):85-109, 2003. [0318] [17] Y. Matia, Y. Matias, and L. Portman. Synopses reconciliation via calibration in .tau.-Synopses system. In International conference on Extending Database Technology (EDBT), Software Demo, pages 1139-1142, March 2006 [0319] [18] Y. Matias, L. Portman, and N. Drukh. The design and architecture of the .tau.-Synopses system, In International conference on Extending Database Technology (EDBT), Industrial & Application, pages 1088-1091, March 2006. [0320] [19] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 448-459, Seattle, Wash., June 1998. [0321] [20] Y. Matias, J. S Vitter, and M. Wang Dynamic maintenance of wavelet-based histograms. In Proceedings of the 2000 VLDB, pages 101-110, September 2000. [0322] [21] L. Portman. Workload-based wavelet synopses. Master's thesis, School of Computer Science, Tel Aviv University, 2003. [0323] [22] L. Portman and Y. Matias. Workload-based wavelet synopses. Technical report, Tel Aviv University, 2003. [0324] [23] L. Qiao, D. Agrawal, and A. E. Abbadi Rhist: adaptive summarization over continuous data streams. In CIKM '02: Proceedings of the eleventh international conference on information and knowledge management, pages 469-476, 2002. [0325] [24] E. W. Weisstein. Least squares fitting, mathworld--wolfram web resource. http://methworld.wolfram.com/leastsquaresfitting.html [0326] [25] Y. L. Wu, D Agrawal, and A. E. Abbadi. Applying the golden rule of sampling for query estimation. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 449-460, 2001. [0327] [26] G. K. Zipf. Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, Mass., 1949.

[0328] In the description and claims of the present application, each of the verbs, "comprise" "include" and "have", and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily a complete listing of members, components, elements or parts of the subject or subjects of the verb.

[0329] All references cited herein are incorporated by reference in their entirety Citation of a reference does not constitute an admission that the reference is prior art.

[0330] The articles "a" and "an" are used herein to refer to one or to more than one (i.e., to at least one) of the grammatical object of the article. By way of example, "an element" means one element or more than one element.

[0331] The term "including" is used herein to mean, and is used interchangeably with, the phrase "including but not limited" to.

[0332] The term "or" is used herein to mean, and is used interchangeably with, the term "and/or," unless context clearly indicates otherwise. The term "such as" is used herein to mean, and is used interchangeably, with the phrase "such as but not limited to".

[0333] The present invention has been described using detailed descriptions of embodiments thereof that are provided by way of example and are not intended to limit the scope of the invention. The described embodiments comprise different features, not all of which are required in all embodiments of the invention. Some embodiments of the present invention utilize only some of the features or possible combinations of the features. Variations of embodiments of the present invention that are described and embodiments of the present invention comprising different combinations of features noted in the described embodiments will occur to persons of the art.

* * * * *

Calibration And Profile Based Synopses Error Estimiation And Synopses Reconciliation

MATIAS; YOSSI ; et al.

References