U.S. patent application number 13/885032 was filed with the patent office on 2013-09-12 for analytical data processing.
The applicant listed for this patent is Muthian George. Invention is credited to Muthian George.
Application Number | 20130238548 13/885032 |
Document ID | / |
Family ID | 46581072 |
Filed Date | 2013-09-12 |
United States Patent
Application |
20130238548 |
Kind Code |
A1 |
George; Muthian |
September 12, 2013 |
ANALYTICAL DATA PROCESSING
Abstract
Systems and apparatuses are provided for integrating user
defined functions into an analytical data processing framework. A
query compiler (22) identifies a call to a user defined function
(28) and one of a GROUP BY expression and a defined online
analytics processing (OLAP) window associated with the user defined
function. A database engine (24) executes the user defined function
(28) with a partition set defined by the one of the GROUP BY
expression and the defined OLAP window, such that the user defined
function operates on the partition set to produce a function
output.
Inventors: |
George; Muthian; (Fremont,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
George; Muthian |
Fremont |
CA |
US |
|
|
Family ID: |
46581072 |
Appl. No.: |
13/885032 |
Filed: |
January 25, 2011 |
PCT Filed: |
January 25, 2011 |
PCT NO: |
PCT/US11/22437 |
371 Date: |
May 13, 2013 |
Current U.S.
Class: |
707/600 |
Current CPC
Class: |
G06F 16/2453 20190101;
G06F 16/283 20190101 |
Class at
Publication: |
707/600 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. An analytical data processing system (10) comprising: a
processor (12); and a non-transitory computer readable medium (14)
storing machine readable instructions, the machine readable
instructions comprising: a query compiler (22) to identify a call
to a user defined function (28) and one of a GROUP BY expression
and a defined online analytics processing (OLAF) window associated
with the user defined function; and a processing engine (24) to
execute the user defined function (28) with a partition set defined
by the one of the GROUP BY expression and the defined OLAP window,
such that the user defined function operates on the partition set
to produce a function output.
2. The analytical data processing system of claim 1, the user
defined function (28) having metadata (69) exposed to the query
compiler and the processing engine (24), the metadata (69)
indicating that the user defined function (28) is executable within
the one of the GROUP BY expression and the defined OLAP window.
3. The analytical data processing system of claim 2, the exposed
metadata (69) comprising an application programming interface, the
application programming interface comprising rules for validating
and resolving arguments within the call to the user defined
function (28).
4. The analytical data processing system of claim 1, the call to
the user defined function (28) being one of a plurality of calls to
respective user defined functions identified by the query compiler
(22) within a query, and the processing engine (24) executing the
plurality of user defined functions as part of the query.
5. The analytical data processing system of claim 4, the user
defined function (28) consuming afield output by another of the
plurality of user defined functions as an input.
6. The analytical data processing system of claim 4, the partition
set being a first partition set in a query, and the query compiler
(22) further identifying a second partition set associated with
another of the plurality of user defined functions, the first
partition set differing from the second partition set.
7. The analytical data processing system of claim 4, another of the
plurality of user defined functions having metadata (69) exposed to
the query compiler (22) indicating that the user defined function
should be executed without reference to either of a GROUP BY
expression and an OLAP window.
8. The analytical data processing system of claim 1, the query
compiler (22) identifying the one of the GROUP BY expression and
the defined online analytics processing (OLAP) window from an OLAP
window expression associated with the call to the user defined
function.
9. An analytical processing system comprising: a processor (12);
and a non-transitory computer readable medium (14) storing machine
readable instructions, the machine readable instructions
comprising: a query compiler (22) to identify respective calls to a
plurality of user defined functions (68) within a structured query
language (SQL) query, the query compiler (22) identifying a
plurality of data partitions, each associated with one of the
plurality of user defined functions and being identified from one
of a GROUP BY expression and an OLAP window expression, such that
one of the plurality of data partitions differs from another of the
plurality of data partitions; and a processing engine (24) to
execute the plurality of user defined functions as part of the SQL
query such that one of the plurality of user defined functions
consumes, as an input, a field output by another of the plurality
of user defined functions; wherein each user defined function (68)
has metadata (69) exposed to the query compiler (22) and the
processing engine (24) indicating that the user defined function is
executable within the one of the GROUP BY expression and the OLAP
expression.
10. A non-transitory computer readable medium storing machine
executable instructions, the machine executable instructions
comprising: a query compiler (62) to identify a call to a plurality
of user defined functions (68) and one of a GROUP BY expression and
a defined online analytics processing (OLAP) window associated with
the user defined function within an SQL query; and a database
engine (64) to execute the plurality of user defined functions (68)
with a data partition defined by the one of the GROUP BY expression
and the defined OLAP window, such that a user defined function of
the plurality of user defined function operates on the data
partition to produce a function output.
11. The non-transitory computer readable medium of claim 10, one of
the plurality of user defined functions (68) consuming, as an
input, a field output by another of the plurality of user defined
functions.
12. The non-transitory computer readable medium of claim 10, the
query compiler (62) identifying the one of a GROUP BY expression
and a defined online analytics processing (OLAP) window from an
OLAP window expression associated with the call to one of the
plurality of user defined functions.
13. The non-transitory computer readable medium of claim 10, each
of the plurality of user defined functions (68) having metadata
(69) exposed to the query compiler (62) and the database engine
(64), the metadata (69) for each function indicating that the user
defined functions are executable within the one of the GROUP BY
expression and the defined OLAP window.
14. The non-transitory computer readable medium of claim 13, the
metadata (69) comprising a generalized application programming
interface function pointer for validating and resolving arguments
within the call to the one of the plurality of user defined
functions.
15. The non-transitory computer readable medium of claim 10, the
defined data partition being a first data partition associated with
one of the plurality of user defined functions (68) in the SQL
query, and the query compiler (62) further identifying a second
data partition defined by another OLAP window associated with
another of the plurality of user defined functions, the first data
partition differing from the second data partition.
Description
TECHNICAL FIELD
[0001] This invention relates to information processing, and more
particularly, to integration of user defined functions into an
analytical data processing framework.
BACKGROUND
[0002] Analytical processing systems consume tables of data which
are typically linked together by relationships that simplify the
storage of data and make queries of the data more efficient. A
standardized query language, such as Structured Query Language
(SQL), can be used for creating and operating on relational
databases. Analytics is the application of computer technology and
statistics to solve problems in business and industry. The science
of analytics is concerned with extracting useful properties of data
using computable functions and, generally speaking, involves the
extraction of desired properties of data sets from large databases.
Analytics therefore bridges the disciplines of computer science,
statistics, and mathematics.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] FIG. 1 illustrates an example of an analytic data processing
system in which user defined functions (UDF) have been integrated
into an online analytics processing (OLAP) framework.
[0004] FIG. 2 illustrates one example of the integration of an
analytical processing system into a database system, such that user
defined functions (UDF) have been integrated into an online
analytics processing (OLAF) and GROUP BY processing framework.
[0005] FIG. 3 illustrates one method for executing a query.
[0006] FIG. 4 is a schematic block diagram illustrating an
exemplary system of hardware components capable of implementing
examples of the systems and methods for user defined function
integration disclosed in FIGS. 1-3.
DETAILED DESCRIPTION
[0007] Many databases have the framework for processing a limited
number of in-built analytic functions in queries using GROUP BY and
OLAP window clauses. Unfortunately, real-world analytical problems
in databases go beyond the analytic functions natively supported,
and datasets in databases that require intensive analytical
processing often end up using third party analytical packages for
analytical processing. In situations where the datasets are large,
they are often retained outside the databases due to the limited
analytical capabilities available in databases. These packages
require data to be retrieved from databases, persisted outside the
database, and reformatted before processing. Since these approaches
move data away from the database tables to a location associated
with the analytical processes, the result is a replication of same
large raw data sets in different formats in multiple locations,
with the accompanying security and data governance risk for
confidential data. Further, this approach results in excessive
manpower and hardware cost for data retrieval, management, and
storage as well as significant latency in processing, making it
impractical for real-time analytics.
[0008] FIG. 1 illustrates an example of an analytical processing
system 10 in which user defined functions (UDF) have been
integrated into an online analytics processing (OLAP) framework.
For example, the system 10 of FIG. 1 can represent a database
system, a data warehouse, a data mart, an in-memory database, a
standalone OLAP engine, a business intelligence report generation
system, a data mining system, or a federated query processing
system. In general, the analytics system can be represented as a
plurality of functional components 22, 24, 26, and 28, each of
which can be implemented as any appropriate combination of hardware
and programming configured to perform their associated function. In
the illustrated example, each of the query compiler 22, the
processing engine 24, the data tables 26, and the user defined
function 28 is stored on a non-transitory medium and executed by an
associated processor, but it will be appreciated that other
implementations of the functional components, for example, as
dedicated hardware or as a combination of hardware and machine
readable instructions, could be used.
[0009] The example system 10 includes a processor 12 and a memory
14 connected to a communications interface 16. It will further be
appreciated that the memory 14 can be a removable memory, connected
to the processor 12 and the communications interface 16 through an
appropriate port or drive, such as an optical drive, a USB port, or
other appropriate interface. Further, the memory 14 can be remote
from the processor 12, with machine readable instructions stored on
the memory provided to the processor via a communications link. It
will be appreciated that the communication interface 16 can
comprise any appropriate hardware and machine readable instructions
for receiving database queries from an associated query source (not
shown) and returning the results of the queries to the query
source. Accordingly, the communications interface 16 can include
any or all of a bus or similar data connection within a computer
system or a wired or wireless network adapter. The memory 14 can
include any appropriate standard storage devices associated with
computer systems, such as magnetic and optical storage media.
[0010] The device memory 14 can include a query compiler 22, a
processing engine 24 to compile and execute queries on data tables
26, and a user defined function (UDF) 28. The query complier 22 can
utilize any appropriate query language, for example, structured
query language (SQL) or multidimensional expression (MDX) language.
In one implementation, the user defined function 28 is stored as
one of a shared object or a dynamic link library. The query
compiler 22 identifies a call to the user defined function 28 and a
defined data partition. For example, the query compiler 22 can
identify the defined data partition from one of a GROUP BY and an
OLAP window expression associated with the call to the user defined
function 28. The processing engine 24 executes the user defined
function as a windowed function, such that the function operates on
the defined partition set to produce a function output.
[0011] It will be appreciated that OLAP processing is not
appropriate for all user defined functions. Accordingly, the user
defined function 28 has metadata exposed to the query compiler 22
and the processing engine 24 indicating that the user defined
function is executable within an OLAP window or GROUP BY
expression. The metadata can further include defined inputs,
outputs, and parameters for the user defined function. In one
instance, the exposed metadata can further include an application
programming interface (API) comprising rules for validating and
resolving arguments within the call to the user defined function.
It will be appreciated, however, that for the majority of user
defined functions, a generalized API function pointer in the
metadata will validate and resolve arguments unless the user
defined function requires special handling for resolving input,
output, and/or parameter arguments. The user defined function
metadata further comprises a second function pointer to construct
an object for processing during execution of the query.
[0012] One advantage of integrating the user defined functions into
the OLAP framework is that the output of each user defined function
is projected back into the SELECT list of the query, and is thus
available to other user defined functions and inbuilt functions
within a given query. Accordingly, the user defined function 28 can
be one of a plurality of user defined functions identified by the
query compiler 22 within a query, and the processing engine 24 can
execute the plurality of user defined functions as part of the
query such that one user defined function consumes, as an input,
field(s) output by another of the plurality of user defined
functions. Each of the user defined functions can have an
associated defined OLAP window, and the defined OLAP window for
each function can differ among the defined windows.
[0013] FIG. 2 illustrates one example of the integration of an
analytical processing system into a database system 50, such that
user defined functions (UDF) have been integrated into an online
analytics processing (OLAP) and GROUP BY processing framework. The
system 50 includes a processor 52 and a memory 54 connected to a
communications interface 56. It will be appreciated that the
communication interface 56 can comprise any appropriate hardware
and machine readable instructions for receiving database queries
from an associated query source (not shown) and returning the
results of the queries to the query source. Accordingly, the
communications interface 56 can include any or all of a bus or
similar data connection within a computer system or a wired or
wireless network adapter. The memory 54 can include any appropriate
standard storage devices associated with computer systems, such as
magnetic and optical storage media.
[0014] The device memory 54 can include a query complier 62 and a
database engine 64 to compile and execute queries on a database
table 66. The query complier 62 identifies a call to a user defined
function and a defined data partition, expressed, for example, as
an online analytics processing (OLAP) window or a GROUP BY
expression, for the user defined function, if the function is of an
appropriate class type for integrated OLAP processing. If the
defined OLAP data partition is provided, the database engine 64
executes the user defined function to provide an output having a
set of one or more output fields and projects the returned output
fields back into the select list. One benefit of performing the
processing within the OLAP framework is that multiple user defined
functions can be executed as part of a single query. The functions
can also be nested, with one user defined function consuming the
output of another user defined function.
[0015] During operation, the query complier 62 processes a call to
a user defined function in a database query using a standardized
user defined function syntax to distinctly map input, output, and
parameter expressions to appropriate objects. List of fields in the
input and output expressions define rows of data that can be
composed into tables. In the modified syntax, inputs are presented
as a list of fields within parentheses following the user defined
function name. Inputs are extracted from query fields, and composed
as input table. Outputs are denoted with a keyword OUTPUT and a
list of fields in parentheses. The query compiler 62 structures
these input and output fields into self-describing table objects
with field names, data types and data size to standardize
processing of all user defined function class types capable of
handling multiple records simultaneously. Each table object has
metadata to describe each field of the table object in terms of its
name, data type, and data length. Tables also have the ability to
store and retrieve rows of data based on row number. They are
capable of handling large row sets with the ability to manage
memory overflow into disk files.
[0016] The query complier 62 includes a query parser 74, which
extracts each of the input table, the output table, and any
parameter objects from each user defined function. It will be
appreciated that the basic syntax for all user defined functions,
regardless of their associated class, can be standardized, such
that they are parsed in a similar manner at the query parser 74.
One example of such a standardized syntax can include a statement
such as:
TABLE-US-00001 <UDF name> ([<Input Expression List>])
[OUTPUT(<Output Expression List>)] [[WITH] PARAMETER
(<key=valueExpression> [:...])]
[0017] In the above statement, items within brackets are optional,
items within parentheses are mandatory, and items given within
chevrons (< >) are replaced with appropriate expressions. The
names of the user defined functions are unique and
case-insensitive. The user defined functions support variable input
and output fields composed as table objects. The various expression
lists can comprise a series of comma separated items. The input
expression list, if present, can include columns or expressions
composed using columns from query table. A mapping for the output
of the user defined function is provided using the keyword OUTPUT,
with the output expression list comprising of one or more output
fields or expressions composed from output fields. Output fields
are field names from the user defined function output metadata or
field position identifiers using "$#" syntax, where $ represents a
special character and # represents an ordinal number of the output
field left to right starting from one. When the output is not
explicitly mapped in a query, default output fields defined within
the user defined function can be returned. Parameters are given as
"key=valueExpression" separated by colon using WITH PARAMETER
syntax when required. The "key" is the field name in the user
defined function parameter metadata. The "valueExpression" is a
constant or an expression that evaluates to a constant. The
parameters defined in the expression can be dates, time,
time-stamps, integers, decimal values (e.g., float, double, or long
double values), character strings, or comma separated array
constants formed from one of these data types.
[0018] In one implementation, the user defined functions are built
in a UDF library 68, for example as shared objects or dynamic link
libraries, and registered with the database engine. Each shared
object exposes the user defined functions in the form of
self-describing UDF metadata 69 that can be retrieved by the query
complier 62. The UDF metadata 69 can include a name of the user
defined function, a description, an associated class type, a
factory constructor function pointer to create a runtime processing
object instance, a function pointer for the validation and
resolution of input, output and parameters, and other runtime
optimization parameters, as well as defined input, output and
parameter fields for the function. The query complier 64 accesses
these fields for query resolution and validation. The input and
output fields can be either fixed or variable fields. Each fixed
field represents only one field during execution of the query,
while each variable field can represent zero, one, or multiple
fields during execution. When input and output fields are marked as
variable, they can be repeated zero, one, or multiple times at the
time of query processing. It will be appreciated that variable
fields can occur at any input or output position, and there can be
multiple variable fields in input and output. The size of a
variable field for a given instance of the function can be
provided, for example, as a parameter expression. The parameter
expression can also be used to define scalar or array values and
character parameters used for processing in the user defined
function itself.
[0019] Input and output fields are marked as fixed or variable
types in the input and output metadata of each user defined
function. It will be appreciated that a given user defined function
requires a standard order and number of inputs, as defined in the
input metadata for the function. Each field is referenced by a
corresponding input argument field or fields in the query.
Specifically, fixed fields will have only one corresponding input
argument field and variable fields can have zero, one or multiple
consecutive input argument fields in the query. Input and output
fields can have fixed or undefined data types within the metadata.
When the data type of an input field is undefined, the field
obtains its data types from the data types of input field arguments
in the query. Query input field defines if the field has null value
or not. For fields having a fixed data type, data from the input
query field is converted to the required input field type of the
user defined function at runtime when the data types do not match
and the data type promotion is valid. When the data type is
undefined in an output field, it is set to depend upon an input
field for resolution. At the query complier 62, undefined output
fields are resolved from the input fields. The metadata can also
include default output fields, which are returned when a query does
not explicitly assign output fields.
[0020] When there is only a single input variable field, it is
resolved at the query compiler 62 from the input arguments in the
query. Specifically, a repetition number for the variable field can
be determined as the difference between a total number of input
arguments and the number of fixed input fields defined in the user
defined function metadata. When there are multiple input or output
variable fields, parameter fields are used to provide the
repetition number, with each variable field having a corresponding
parameter field. When the query is processed, the parameter input
is used for validating and resolving the repetition number for the
variable fields. The names for the variable input and output fields
can be generated serially by extending the base name of the
variable field with a number incrementally starting from one. The
fixed data type of all the variable input and output fields is same
as the field marked as variable. The undefined data type of the
variable input field is resolved by the data type of the input
fields in the query. A general purpose function supplied with the
system can be used to validate input and output fields and resolve
their data types and lengths at the query compiler 62 when explicit
function is not supplied by the user defined function as a function
pointer in the metadata.
[0021] The listing of fields in the output expression allows the
output of a user defined function to be mapped to a list of fields
provided within the output metadata. When the number of output
fields in the query is less than the number of output fields from
the user defined function, the user defined function returns only
the fields that are mapped in the query. Output fields can be
mapped in any order in queries, using either the field name or
field position identifier. If output variable fields are mapped
using position identifiers, the validation and resolution function
substitutes position identifiers with output field names and sets
data type and size in the output table object at query compile
time.
[0022] Changing the order and mapping is particularly useful for
nested user defined function processing, to allow an inner user
defined function to return the fields in the order the outer user
defined function requires as input. In the projection list, output
mappings from user defined functions are merged with the rest of
the fields in the projection list in the order in which they are
mapped. When multiple user defined functions occur in a query,
projection fields are mapped similarly to queries with OLAP
multidimensional and multiple windows functions.
[0023] The metadata 69 for each user defined function can include
an associated class for each function out of a plurality of
function classes 72 to assist in the optimization of the query. The
user defined function class types implicitly set the rules for data
processing in the database engine along with the cardinality of
their output results. For example, user defined functions belonging
to some classes will be processed in OLAP windows, whereas such
processing is impractical or unnecessary for other classes of
functions. Unlike inbuilt functions that return only one output
field, all the user defined function class types may return one or
multiple output fields.
[0024] In one instance, user defined functions can be associated
with any of ten classes. 1) Aggregate user defined functions
process a set of rows partitioned by a GROUP BY expression or an
OLAP window expression and return only one output row per input row
set. Without a GROUP BY expression or an OLAP window, the whole
table data from the query is considered as the input set. 2)
Rank-type user defined functions process a set of rows partitioned
by an OLAP window and sorted in ORDER BY clause and return one row
per input row. 3) Series user defined functions are processed by
OLAP PARTITION BY in a subset window frame of ROWS or RANGE clause.
It will be appreciated that aggregate and series classes are not
mutually exclusive, and that a given user defined function can
belong to either or both the classes.
[0025] 4) Series user defined functions with Inbuilt Window Frame
(SIWF) are similar to series user defined functions except that
they use an inbuilt window frame for moving window aggregate
processing. A SIWF user defined function receives the window frame
size through a parameter specification. The function receives one
input row at a time and returns one output aggregate return row for
the result generated using the window frame size given. The SIWF
class supports fast computations for moving window aggregates away
from the OLAP window frames. The SIWF class of functions can use
the OLAP PARTITION BY expression or the OLAP ORDER BY clause if the
input is required to be in a sorted order.
[0026] 5) A Multiple Rows Returning Aggregate user defined function
(MRRA) is similar to the aggregate functions except that MRRA
functions return more than one row per set of rows processed. MRRA
functions can process a set of rows partitioned by GROUP BY or OLAP
PARTITION BY clause, or, in the absence of such a clause, the whole
table data from the query can be considered as the input set. 6) N
To M user defined function (NTM) functions are a special type of
user defined function that processes a set of N input rows and
generates M output rows. NTM functions can use OLAP PARTITION BY
and ORDER BY processing. The number of output rows can be more
than, less than, or equal to the number of input rows.
[0027] 7) A One To M user defined function (OTM) is a class of user
defined function that processes one input row at a time and
produces multiple rows of output. OTM functions do not require OLAP
windows support. 8) Value Generating user defined functions
generate one row of output per one row of input, generally
performing simple mathematical computations, such as log.sub.10. 9)
A row-sending UDF is a UDF that does not return any output, but
instead supplies data to an external process. A row-sending UDF can
be used in a SELECT query statement to send a set of rows to an
external process. In order to work with the SELECT query statement,
a row-sending UDF can return the number of rows sent. In one
instance, the database can be configured to automatically send any
modified rows via a row-sending function whenever the database
table is modified.
[0028] 10) Table Valued user defined functions (TVUDF) occur in the
FROM clause of SQL queries, where the table they return can
participate in relational set processing similar to regular
database tables. TVUDFs mostly connect to external structured or
unstructured data sources and return structured tabular data for
query processing. TVUDFs are without input arguments and are used
for generating tables from external data sources. In such a case,
inputs for the table extraction are given as parameters mostly in
the form of query statement and the function receives an empty
input table object to populate an output table from the external
source. When TVUDFs have parameterized external queries for
processing in the external databases, they will have input fields
from the local query corresponding to the external query
parameters. Input argument fields supply parameter values for
external query processing in the external database. In such cases,
TVUDFs behave exactly as OTM functions and are marked as members of
the OTM class in the metadata. There can be value generating user
defined functions without input arguments similar to CURRENT_DATE(
) or PI( ).
[0029] The query parser 74 can review the input query to ensure
that the call to each user defined function is valid. To this end,
the query parser 74 can retrieve an associated class of each user
defined function from the metadata 69 and apply a set of logical
rules to the query to evaluate the validity of each function call.
User defined functions belonging to any class beside the value
generating class and table-valued class occur in the projection
list of a SQL query similar to GROUP BY or OLAP window functions.
User defined functions belonging to the value generating user
defined function class can occur anywhere an expression can occur
in a SQL query. Table-valued user defined functions occur in the
FROM clause of a SQL query or in places where a sub-query is
allowed.
[0030] Similarly, user defined functions belonging to the
aggregate, rank, series, and SIWF classes can occur in a query with
multiple user defined functions in any combination, in a manner
similar to inbuilt GROUP BY or OLAP window functions. User defined
functions belonging to the MRRA, NTM and OTM classes, referred to
herein as singleton user defined functions, must be the only user
defined function in its associated query or sub-query. Singleton
user defined functions have an indeterminate output number of rows.
There is no restriction the number of singleton user defined
function sub-queries in a query, however. Row-sending user defined
functions also belong to the singleton user defined function type,
though its output is determinate in a query. In the case of
rank-type analytic OLAP functions, there are no input arguments for
the functions.
[0031] However, the lists of fields in the ORDER BY clause are
implicitly considered as the input arguments for the inbuilt
rank-type analytic functions. It will be appreciated, however, that
there can be input arguments in the UDF when there is an ORDER BY
expression for the function. In such cases, the input arguments in
the function are considered as the only input arguments and the
list of fields in the ORDER BY clause are considered as fields
required for data sorting purposes only. However, there is no
restriction in using the fields from the ORDER BY clause as input
arguments in the UDF.
[0032] A query optimizer 76 establishes a plan for executing the
received query, including any user defined functions in the query.
The query optimizer 76 retrieves metadata describing the associated
classes of the user defined functions, and uses the associated
class of each user defined function to determine if the number of
rows in the output of the function is known or if the output to the
function is indeterminate. For classes having known outputs, the
query optimizer 76 can proceed normally. Functions having
indeterminate outputs are handled as part of an alternative
optimization process, in which the potential impact of a function
having a large number of output rows is mitigated. Each of the data
objects extracted by the query parser 72 and the query plan
developed at the query optimizer 76 are provided to the database
engine 64 to execute the query, including the user defined
function.
[0033] The database engine 64 uses the extracted data objects to
execute the called function and return the results of the function
as the mapped output fields. The database engine 64 is provided
with the input fields in the order the input fields are defined in
the user defined function input field metadata. There is no need
for the input field names from the query to match the input field
names in the UDF metadata as the input field metadata names are
just placeholders. Variable input or output fields are expanded
with the right repetition numbers in consecutive positions. The
field name of the expanded variable field is appended with the
ordinal number of the field starting from one. Output fields from
user defined function are mapped in queries using field names from
the UDF metadata or field position identifier in a $# syntax.
Output fields can be renamed using an appropriate syntax, for
example: <$#/field name>[AS]<user defined name>. When
the field position identifiers are not renamed in the output
expression, the query compiler 62 replaces them with output field
names obtained from the user defined function metadata. User
queries can map a subset of output fields from user defined
function in any order irrespective of their order in the metadata.
When output is not explicitly mapped in a query, the results of the
output fields marked as default output fields in the metadata are
returned. These approaches simplify the application programming
interface (API) for user defined function development with an
object class having one member function for registering parameter
input, a second member function with input and output table objects
to initiate the processing, and a third member function to
iteratively process input table object with one or more records and
emit output rows into the output table object according to the
class type of the user defined function.
[0034] The database engine 64 executes the query plan to provide a
query result, and returns the query result to the query source via
the communications interface 56. During runtime, data are loaded
into the input table object according to the user defined function
class type. With the input and output table objects as arguments,
the data processing API of the user defined function is called to
process data. The user defined function emits output into the
output table object which is retrieved each time the processing
function is called. In case there is no input, as in the case of
some table-valued user defined functions, the processing function
is called with an empty input table object.
[0035] One example of an analytic function that can be implemented
in the database system is a multiple regression user defined
function. It will be appreciated that in a multiple regression
analysis, the number of independent variables varies according to
the regression model fitted which is illustrated in the example
below. Accordingly, a user defined function to perform this
function will have variable output fields for beta coefficients and
averages for independent variables, with the size of these variable
fields being defined by a parameter value representing the number
of independent variables in the desired model. The user defined
function, for example, can be implemented as aggregate class user
defined function because it returns one output row for N input rows
processed.
[0036] An example query using a specific instantiation of the
general multiple regression function is presented below. For
example, the query could be used for studying how three independent
expenses are affecting the dependent sales figures using a sales
and cost database table that has data, for example, for ten years,
with four quarters each, at each region and city in each
country:
TABLE-US-00002 SELECT country, MultipleRegression(salesDollarValue,
salespersonTrainingCost, advertizementCampaignCost,
couponPromotionCost) OUTPUT(countObservations, alphaIntercept,
beta1 AS betaSalespersonTrainingCost, beta2 AS
betaAdvertizementCampaignCost, beta3 AS betaCouponPromotionCost,
avgIndependent1 AS avgSalespersonTrainingCost, avgIndependent2 AS
avgAdvertizementCampaignCost, avgIndependent3 AS
avgCouponPromotionCost, avgDependent AS avgSalesDollarValue) WITH
PARAMETER(COUNT_INDEPENDENT_VAR=3) FROM salesAndCost GROUP BY
country;
[0037] In the above example, the first input field is the dependent
variable which is a fixed field. The second input field is the
starting point for the independent variable which is marked as a
variable field, giving the user defined function two input fields.
A first output field is the alpha intercept, `alphaIntercep` and a
second output is the average of the dependent variable,
`avgDependent,` each of which are fixed fields. A third output
field, `countObservations,` is a fixed field representing the
number of observations. A fourth field, marked as a variable field,
is named `beta` and represents the beta coefficients produced by
the regression analysis. A fifth field, also marked as a variable
field, is named `avgIndependent` and represents the average values
of the independent variables.
[0038] Note that the fourth field can be repeated until all the
beta co-efficient fields are mapped. As described previously, the
names of the fields are generated by appending an integer starting
from one. Thus, in the following example, there are three beta
co-efficient fields with field names beta1, beta2 and beta3. The
fifth field, avgIndependent starts only after all the beta fields
are mapped and repeats similarly with number appended to it
starting from one.
[0039] The user defined function requires one parameter and looks
for COUNT_INDEPENDENT_VAR keyword or parameter field name in the
parameter object to decide which multiple regression model to fit.
In case the parameter is not given, since there is only one input
variable field, the validation and resolution API function computes
the repetition number. In this example given, the model specifies
three independent variables and, therefore, three output fields
each for `beta` and `avgIndependent` are generated. Note that, in
the metadata, the variable output fields, beta and avgIndependent,
are marked for dependency resolution from the second input variable
field. If the query does not map `avgIndependent` from
independent-variables or any other field, the user defined function
understands it from the fields requested by the user query in the
output table object and does not project results for them.
[0040] The illustrated database system makes it possible to model
complex analytic applications, including statistical applications,
in SQL queries. Since the user defined function developer knows its
class type, input, output and parameter, the UDF metadata is built
as part of the user defined function for dynamic detection and
validation of syntax and semantics of its usage at query compile
time. There could be additional metadata captured in the database
for managing user defined functions, such as user defined function
usage and installation privileges. The use of the parameter field
allows for user defined functions general enough to allow
processing of many analytical application models by varying only
parameters in queries. Statistical applications such as multiple
regression, least-squares, multivariate analytics and frequency
distribution can be implemented with different statistical models
depending upon the requirements of a given analysis simply by
changing the parameter field of the query. Accordingly,
parameterization makes it easier to represent complex models in
simple key/value pair of parameters.
[0041] The system 50 further provides the ability to map only the
required output fields in any order from user defined functions
makes it easier to have nested user defined function queries where
inner user defined functions return outputs required for the input
of outer user defined functions. This helps in building complex
applications in parts and composed by nesting in queries according
to application requirement. The tight integration of user defined
functions into SQL brings analytic application processing closer to
the data, resulting in data security and less mobility of data
across systems. It also makes it more efficient to build user
defined functions for serving vertical applications through SQL in
a programming language. With table-valued user defined functions
available for use in FROM clause of queries for collecting data
from external sources in conjunction with the analytic user defined
functions, users will find many analytic applications easier to
model and process using data from external sources in SQL.
[0042] FIG. 3 illustrates one method 100 for executing a query. At
102, a call to a user defined function is identified in a
structured query language (SQL) query. At 104, a data partition
associated with the user defined function is identified within the
SQL query. For example, the data partitioning specification can be
extracted from one of a GROUP BY and or an OLAP window expression
(e.g., PARTITION BY, ORDER BY and data frame in ROWS or RANGE
specification) associated with the call to the user defined
function. At 106, the user defined function is executed on the
identified data partition within the OLAP framework to produce an
output.
[0043] FIG. 4 is a schematic block diagram illustrating an
exemplary system 200 of hardware components capable of implementing
the example systems and methods for user defined function
integration disclosed in FIGS. 1-3. The system 200 can include
various systems and subsystems. The system 200 can be a personal
computer, a laptop computer, a workstation, a computer system, an
appliance, an application-specific integrated circuit (ASIC), a
server, a server blade center, a server farm, or any other
appropriate processing component.
[0044] The system 200 can include a system bus 202, a processing
unit 204, a system memory 206, memory devices 208 and 210, a
communication interface 212 (e.g., a network interface), a
communication link 214, a display 216 (e.g., a video screen), and
an input device 218 (e.g., a keyboard and/or a mouse). The system
bus 202 can be in communication with the processing unit 204 and
the system memory 206. The additional memory devices 208 and 210,
such as a hard disk drive, server, stand alone database, or other
non-volatile memory, can also be in communication with the system
bus 202. The system bus 202 operably interconnects the processing
unit 204, the memory devices 206-210, the communication interface
212, the display 216, and the input device 218. In some examples,
the system bus 202 also operably interconnects an additional port
(not shown), such as a universal serial bus (USB) port.
[0045] The processing unit 204 can be a computing device and can
include an application-specific integrated circuit (ASIC). The
processing unit 204 executes a set of instructions to implement the
operations of examples disclosed herein. The processing unit can
include a processing core.
[0046] The additional memory devices 206, 208 and 210 can store
data, programs, instructions, database queries in text or compiled
form, and any other information that can be needed to operate a
computer. The memories 206, 208 and 210 can be implemented as
computer-readable media (integrated or removable) such as a memory
card, disk drive, compact disk (CD), or server accessible over a
network. In certain examples, the memories 206, 208 and 210 can
comprise text, images, video, and/or audio.
[0047] Additionally, the memory devices 208 and 210 can serve as
databases or data storage. Additionally or alternatively, the
system 200 can access an external data source or query source
through the communication interface 212, which can communicate with
the system bus 202 and the communication link 214.
[0048] In operation, the system 200 can be used to implement a
database system that executes user defined functions within an
online analytics processing (OLAP) framework in response to an
appropriate query. The queries can be formatted in accordance with
various query database protocols, including SQL. Computer
executable logic for implementing the real-time analytics system
resides on one or more of the system memory 206, and the memory
devices 208, 210 in accordance with certain examples. The
processing unit 204 executes one or more computer executable
instructions originating from the system memory 206 and the memory
devices 208 and 210. The term "computer readable medium" as used
herein refers to a medium that participates in providing
instructions to the processing unit 204 for execution.
[0049] What have been described above are examples of the present
invention. It is, of course, not possible to describe every
conceivable combination of components or methodologies for purposes
of describing the present invention, but one of ordinary skill in
the art will recognize that many further combinations and
permutations of the present invention are possible. Accordingly,
the present invention is intended to embrace all such alterations,
modifications, and variations that fall within the scope of the
appended claims.
* * * * *