Materializing Expressions Within In-memory Virtual Column Units To Accelerate Analytic Queries Mishra; Aurosish ; et al. [Oracle International Corporation]

Materializing Expressions Within In-memory Virtual Column Units To Accelerate Analytic Queries

Mishra; Aurosish ; et al.

Patent Application Summary

U.S. patent application number 15/146799 was filed with the patent office on 2017-02-02 for materializing expressions within in-memory virtual column units to accelerate analytic queries. The applicant listed for this patent is Oracle International Corporation. Invention is credited to Shasank K. Chavan, Amit Ganesh, Allison L. Holloway, Jesse Kamp, Ramesh Kumar, Tirthankar Lahiri, Zhen Hua Liu, Vineet Marwah, Aurosish Mishra, Niloy Mukherjee.

Application Number	20170031975 15/146799
Document ID	/
Family ID	56610006
Filed Date	2017-02-02

United States Patent Application	20170031975
Kind Code	A1
Mishra; Aurosish ; et al.	February 2, 2017

MATERIALIZING EXPRESSIONS WITHIN IN-MEMORY VIRTUAL COLUMN UNITS TO ACCELERATE ANALYTIC QUERIES

Abstract

Techniques are described for materializing pre-computed results of expressions. In an embodiment, a set of one or more column units are stored in volatile or non-volatile memory. Each column unit corresponds to a column that belongs to an on-disk table within a database managed by a database server instance and includes data items from the corresponding column. A set of one or more virtual column units, and data that associates the set of one or more column units with the set of one or more virtual column units, are also stored in memory. The set of one or more virtual column units includes a particular virtual column unit storing results that are derived by evaluating an expression on at least one column of the on-disk table.

Inventors:

Mishra; Aurosish; (Belmont, CA) ; Chavan; Shasank K.; (Menlo Park, CA) ; Holloway; Allison L.; (San Carlos, CA) ; Kamp; Jesse; (Castro Valley, CA) ; Kumar; Ramesh; (Foster City, CA) ; Liu; Zhen Hua; (San Mateo, CA) ; Mukherjee; Niloy; (Belmont, CA) ; Ganesh; Amit; (San Jose, CA) ; Lahiri; Tirthankar; (Palo Alto, CA) ; Marwah; Vineet; (San Ramon, CA)

Applicant:

Name	City	State	Country	Type
Oracle International Corporation	Redwood Shores	CA	US

Family ID:

56610006

Appl. No.:

15/146799

Filed:

May 4, 2016

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
62198572	Jul 29, 2015
62245959	Oct 23, 2015

Current U.S. Class:	1/1
Current CPC Class:	G06F 16/2282 20190101; G06F 16/2455 20190101; G06F 16/2454 20190101; G06F 16/2393 20190101
International Class:	G06F 17/30 20060101 G06F017/30

Claims

1. A method comprising: storing, in volatile or non-volatile memory, a set of one or more column units; wherein the set of one or more column units correspond to one or more columns; wherein each column unit of the set of one or more column units includes data items from the column to which the column unit corresponds; wherein the one or more columns belong to one or more on-disk tables within a database managed by a database server instance; wherein the set of one or more column units includes a particular column unit that corresponds to at least one particular column of a particular table within the database; storing, in volatile or non-volatile memory, a set of one or more virtual column units; wherein the set of one or more virtual column units includes a particular virtual column unit that stores results that are derived by evaluating an expression on the at least one particular column; storing, in volatile or non-volatile memory, data that associates the set of one or more column units with the set of one or more virtual column units.

2. The method of claim 1 further comprising maintaining, in volatile or non-volatile memory, an IMIU that stores pre-computed results that: are derived from values in the one or more columns; and do not have a one-to-one correspondence with rows of the one or more on-disk tables.

3. The method of claim 1, wherein the particular virtual column unit stores data for a particular virtual column; wherein the particular virtual column is defined using the expression; and wherein the particular virtual column does not correspond to an on-disk column.

4. The method of claim 1, wherein the set of one or more column units belong to a first in-memory columnar unit stored in volatile or non-volatile memory; wherein the set of one or more virtual column units belong to a first in-memory expression unit stored in memory; wherein the first in-memory columnar unit and the first in-memory expression unit correspond to a first set of rows from the one or more columns; wherein the results stored in the particular virtual column unit are derived using values in the first set of rows from the at least one particular column.

5. The method of claim 4, the method further comprising: storing, in volatile or non-volatile memory, a second in-memory columnar unit and a second in-memory expression unit; wherein the second in-memory columnar unit and the second in-memory expression unit correspond to a second set of rows from the one or more columns; wherein the second in-memory columnar unit includes a second set of one or more column units corresponding to the one or more columns; wherein the second in-memory expression unit includes a second set of one or more virtual column units; wherein the second set of one or more virtual column units includes a second virtual column unit stores results for the expression that are derived using values in the second set of rows from the at least one particular column.

6. The method of claim 4, further comprising: maintaining, within a shared metadata unit, data that indicates which rows in the first set of rows have become stale within the first in-memory columnar unit and the first in-memory expression unit; storing, in volatile or non-volatile memory, data that associates the shared metadata unit with the first in-memory columnar unit and the first in-memory expression unit.

7. The method of claim 1, further comprising: receiving a query that includes the expression that operates on the at least one particular column; in response to receiving the query, identifying, based on the data that associates the set of one or more column units with the set of one or more virtual column units, the particular virtual column unit; rewriting the query to access the results stored in the virtual column unit rather than re-computing the results for the expression.

8. The method of claim 1, further comprising: receiving a query that includes a second expression that operates on at least one column of the one or more columns; wherein results for the second expression are not stored in the set of one or more virtual column units; in response to receiving the query, updating a set of statistics for the second expression during evaluation of the second expression; determining, based on the set of statistics updated during evaluation of the second expression, whether to materialize results for the second expression within volatile or non-volatile memory; in response to determining to materialize results for the second expression, generating and storing, within volatile or non-volatile memory, at least one virtual column unit that includes results of evaluating the second expression.

9. The method of claim 1, further comprising: determining, based on a set of statistics maintained for the expression, that the results for the expression should be evicted from volatile or non-volatile memory; in response to determining that the results should be evicted from volatile or non-volatile memory, removing the particular virtual column unit from volatile or non-volatile memory.

10. The method of claim 1, further comprising repopulating the set of one or more virtual column units in response to detecting that the set of one or more column units has been repopulated with data from the one or more columns.

11. The method of claim 1, wherein the results that are populated in the particular virtual column unit are computed from a set of data as the set of data existed when the set of one or more column units were populated; wherein the results are populated in the particular virtual column unit by a background process.

12. One or more non-transitory computer-readable media storing instructions which, when executed by one or more processors, cause: storing, in volatile or non-volatile memory, a set of one or more column units; wherein the set of one or more column units correspond to one or more columns; wherein each column unit of the set of one or more column units includes data items from the column to which the column unit corresponds; wherein the one or more columns belong to one or more on-disk tables within a database managed by a database server instance; wherein the set of one or more column units includes a particular column unit that corresponds to at least one particular column of a particular table within the database; storing, in volatile or non-volatile memory, a set of one or more virtual column units; wherein the set of one or more virtual column units includes a particular virtual column unit that stores results that are derived by evaluating an expression on the at least one particular column; storing, in volatile or non-volatile memory, data that associates the set of one or more column units with the set of one or more virtual column units.

13. The one or more non-transitory computer-readable media of claim 12, wherein the particular virtual column unit stores data for a particular virtual column; wherein the particular virtual column is defined using the expression; and wherein the particular virtual column does not correspond to an on-disk column.

14. The one or more non-transitory computer-readable media of claim 12, wherein the set of one or more column units belong to a first in-memory columnar unit stored in volatile or non-volatile memory; wherein the set of one or more virtual column units belong to a first in-memory expression unit stored in memory; wherein the first in-memory columnar unit and the first in-memory expression unit correspond to a first set of rows from the one or more columns; wherein the results stored in the particular virtual column unit are derived using values in the first set of rows from the at least one particular column.

15. The one or more non-transitory computer-readable media of claim 14, wherein the instructions further cause: storing, in volatile or non-volatile memory, a second in-memory columnar unit and a second in-memory expression unit; wherein the second in-memory columnar unit and the second in-memory expression unit correspond to a second set of rows from the one or more columns; wherein the second in-memory columnar unit includes a second set of one or more column units corresponding to the one or more columns; wherein the second in-memory expression unit includes a second set of one or more virtual column units; wherein the second set of one or more virtual column units includes a second virtual column unit stores results for the expression that are derived using values in the second set of rows from the at least one particular column.

16. The one or more non-transitory computer-readable media of claim 14, wherein the instructions further cause: maintaining, within a shared metadata unit, data that indicates which rows in the first set of rows have become stale within the first in-memory columnar unit and the first in-memory expression unit; storing, in volatile or non-volatile memory, data that associates the shared metadata unit with the first in-memory columnar unit and the first in-memory expression unit.

17. The one or more non-transitory computer-readable media of claim 12, wherein the instructions further cause: receiving a query that includes the expression that operates on the at least one particular column; in response to receiving the query, identifying, based on the data that associates the set of one or more column units with the set of one or more virtual column units, the particular virtual column unit; rewriting the query to access the results stored in the virtual column unit rather than re-computing the results for the expression.

18. The one or more non-transitory computer-readable media of claim 12, wherein the instructions further cause: receiving a query that includes a second expression that operates on at least one column of the one or more columns; wherein results for the second expression are not stored in the set of one or more virtual column units; in response to receiving the query, updating a set of statistics for the second expression during evaluation of the second expression; determining, based on the set of statistics updated during evaluation of the second expression, whether to materialize results for the second expression within volatile or non-volatile memory; in response to determining to materialize results for the second expression, generating and storing, within volatile or non-volatile memory, at least one virtual column unit that includes results for the second expression.

19. The one or more non-transitory computer-readable media of claim 12, wherein the instructions further cause: determining, based on a set of statistics maintained for the expression, that the results for the expression should be evicted from volatile or non-volatile memory; in response to determining that the results should be evicted from volatile or non-volatile memory, removing the particular virtual column unit from volatile or non-volatile memory.

20. The one or more non-transitory computer-readable media of claim 12, wherein the instructions further cause repopulating the set of one or more virtual column units in response to detecting that the set of one or more column units has been repopulated with data from the one or more columns.

21. The one or more non-transitory computer-readable media of claim 12, wherein the results that are populated in the particular virtual column unit are computed from a set of data as the set of data existed when the set of one or more column units were populated; wherein the results are populated in the particular virtual column unit by a background process.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of Provisional Appln. 62/198,572, filed Jul. 29, 2015, and Provisional Appln. 62/245,959, filed Oct. 23, 2015, the entire contents for both of which are hereby incorporated by reference as if fully set forth herein, under 35 U.S.C. .sctn.119(e).

[0002] This application is related to: [0003] application Ser. No. ______, (Attorney Docket No. 50277-4852) entitled "TRACKING AND MAINTAINING EXPRESSION STATISTICS ACROSS DATABASE QUERIES", filed same day herewith, [0004] application Ser. No. ______, (Attorney Docket No. 50277-4854) entitled "MATERIALIZING INTERNAL COMPUTATIONS IN-MEMORY TO IMPROVE QUERY PERFORMANCE", filed same day herewith, and [0005] application Ser. No. 14/337,179, entitled "MIRRORING, IN MEMORY, DATA FROM DISK TO IMPROVE QUERY PERFORMANCE", filed Jul. 21, 2014, the entire contents for each of which are hereby incorporated by reference as if fully set forth herein.

FIELD OF THE DISCLOSURE

[0006] The present disclosure relates to storing and maintaining evaluation results for expressions and internal computations within in-memory storage units.

BACKGROUND

[0007] Database queries include constructs referred to as expressions, where an expression is typically a combination of one or more values, operators, and/or functions. During query execution, a database server instance evaluates the query expressions to produce a query result. Query expressions may cause a database server instance to perform a variety of operations such as filtering, projecting, aggregating, sorting, or otherwise manipulating values stored within a database object. Some query expressions require process-intensive operations that consume a significant amount of resources per evaluation. In some cases, these expressions appear across multiple queries or multiple parts of the same query. When these expressions are evaluated over large datasets repeatedly, the performance of a database server instance may be significantly impacted.

[0008] One approach that database administrators use to optimize execution of computationally-expensive queries is to create materialized views. A materialized view is a database object that contains the results of a query. For example, a materialized view may capture the final result of two tables joined together. Once the materialized view is created, subsequent queries may be rewritten to access the materialized view rather than performing another table join operation. Thus, materialized views obviate the need for multiple executions of queries that may require significant time to compute.

[0009] While materialized views are useful for capturing the final results of a query, materialized views do not capture data that tracks the usage of individual expressions within the query. Further, materialized views do not capture intermediary expression results that reside solely in memory during query execution. For example, a hash join operation computes hash values from a join attribute to determine which records to join from two different tables. A materialized view may store the combined records representing the final result of the join operation. However, the materialized view does not capture data relating to the hash operation since the resultant hash values are not part of the final query result and are not persistently stored within the database. Similarly, the results of other expression evaluations and internal computations that reside only in memory during query execution are not typically captured by a database server.

[0010] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] Various embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

[0012] FIG. 1 is a block diagram depicting an example system design for creating and maintaining in-memory units that cache results of expression evaluations and internal computations, according to an embodiment;

[0013] FIG. 2 is a flowchart depicting an example process for performing background population for an in-memory expression, according to an embodiment;

[0014] FIG. 3 is a block diagram depicting an example Shared Metadata Unit that maintains metadata that is shared by an IMCU-IMEU pair, according to an embodiment;

[0015] FIG. 4 is a flowchart depicting an example process for identifying and using in-memory expressions to optimize query execution, according to an embodiment;

[0016] FIG. 5 is a block diagram depicting an example design for an in-memory internal unit, according to an embodiment;

[0017] FIG. 6 is a flowchart depicting an example process for populating an in-memory internal unit with computation results and using materialized computation results to optimize query processing, according to an embodiment; and

[0018] FIG. 7 is a block diagram that illustrates a computer system upon which an embodiment may be implemented.

DETAILED DESCRIPTION

[0019] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.

General Overview

[0020] Analytic queries often contain complex expressions or calculations that can consume a large amount of system resources during execution. By automatically identifying these frequently used complex expressions or calculations and caching their results, a database server may improve query performance and increase scalability by reducing repeated computations and thus, reducing the system resources consumed during execution. In addition, query processing on certain data-types can be improved by storing an efficient in-memory optimized representation.

[0021] In various embodiments, computer systems, stored instructions, and technical steps are described for capturing and processing "in-memory expressions" (also referred to herein as "IMEs". An "in-memory expression", as used herein, is broadly classified as any expression for which the results have been pre-computed and cached in volatile memory. The pre-computed results of an IME are referred to herein as "pre-computed results" or PCRs. PCRs may be both cached in volatile memory and persistently-stored, or may exist only in volatile memory (without any persistently-stored copy). PCRs may include results computed at different levels of granularity, which provides flexibility during query optimization and execution.

[0022] In order to store PCRs of IMEs, a set of in-memory units, referred to herein as "in-memory columnar units (IMCUs)" and "in-memory expression units (IMEUs)", are allocated and maintained by a database server instance. An IMCU stores a set of column units, where each column units corresponds to a column that belongs to an on-disk table within a database managed by a database server instance and includes data items from the corresponding column. An IMEU stores a set of one or more "virtual column units", where each virtual column unit may store PCRs that are derived by evaluating an expression on at least one of the one or more columns. A pointer and/or other association data is stored in volatile or non-volatile memory (e.g., in the header of the IMCU) to associate the set of column units with the set of virtual column units. Queries against the IMCU may access PCRs from the set of virtual columns rather than re-computing the expression results.

[0023] According to an embodiment, the database server instance also maintains in-memory units referred to herein as "in-memory internal units (IMIUs)". IMIUs store a set of metadata values for computations that are being tracked by a database server instance. The IMIU may further store one or more sets of results for one or more computations. In an embodiment, responsive to the database server instance receiving a query, the database server instance identifies a set of computations for evaluation during execution of the query. Responsive to identifying the set of computations, the database server instance evaluates at least one computation in the set of computations to obtain a first set of results for a first computation in the set of computations. After evaluating at least one computation, the database server instance stores, within an in-memory unit, the first set of computation results. The database server also stores mapping data that maps a set of metadata values associated with the first computation to the first set of computation results. If subsequent queries are received that involve the computation, the database server instance may access the cached computation results from IMIU rather than re-evaluating the computation.

System Overview

[0024] In an embodiment, a database server comprises an IME framework. The IME framework supports a variety of functions for capturing and processing IMEs including, without limitation [0025] Identifying expressions and other computations for in-memory materialization; [0026] Creating and maintaining in-memory units for caching IMEs and other computations;

[0027] and [0028] Querying the in-memory units for faster projection and predicate evaluation.

[0029] FIG. 1 is a block diagram depicting an example system design for creating and maintaining in-memory units that cache results of expression evaluations and internal computations, according to an embodiment. Database server instance 100 includes background process pool 110, in-memory area 120, expression statistics store 150, and foreground process pool 140.

[0030] Background process pool 110 includes a plurality of background processes, including in-memory coordinator (IMCO) 112 and worker processes 114, which are responsible for performing background tasks within in-memory area 120. In an embodiment, background population and other tasks for the background processes are enqueued as they arise. IMCO 112 may then coordinate execution of these tasks with worker processes 114. IMCO 112 may assign and balance tasks amongst worker processes 114, which may perform certain tasks in parallel.

[0031] In-memory area 120 stores a plurality of in-memory units including IMCUs 122a to 122n, in-memory expression units IMEUs 124a to 124n, in-memory internal units IMIUs 126a to 126n, and shared metadata units (SMUs) 128a to 128n. Each in-memory unit may comprise one or more contiguous chunks of memory, referred to herein as "in-memory extents", that are specifically allocated for the unit. Metadata that defines the various attributes of the in-memory units may also be maintained by database server instance 100. For instance, the metadata may include, without limitation, extent lists that identify which in-memory extents belong to a particular in-memory unit, free space pointers that point to available storage space within the unit, and identifiers that uniquely identify each in-memory unit. Each of the different types of in-memory units is described in further detail below.

[0032] Foreground process pool 140 includes a set of foreground processes that are responsible for processing queries as they are received. The foreground processes include query processes 142, which perform tasks such as query compilation, query execution plan generation and optimization, parallel query coordination, and query execution. The foreground processes further include transactional processes 144, which maintain the atomicity and consistency of database transactions with respect to in-memory area 120.

[0033] Expression statistics store (ESS) 150 tracks various statistics and other metadata values for expressions that have been submitted in queries. Although depicted as residing outside of in-memory area 120, in some implementations, ESS 150 may be maintained, in whole or in part, within in-memory area 120. In addition or alternatively, ESS 150 may be stored persistently on disk or in some other non-volatile storage media.

Expression Evaluation

[0034] During query execution, query processes 142 may evaluate one or more query expressions that are included within a query. An "expression" in this context refers to a query construct that specifies one or more operations and one or more operands on which the one or more operations are performed. The operations may be specified using a set of commands, such as structured query language (SQL) statements, mathematical operators, etc. The operands may correspond to objects, such as columns and column functions, constants, or other values. As an example, the expression "a*b" includes the operands "a" and "b", which correspond to different columns in a table. The expression further includes the multiplication operator "*", which causes a database server instance to multiply values from columns "a" and "b" when evaluating the expression. The combination and types of operators used may vary widely between different query expressions.

[0035] Expressions may occur in various locations within a database query. In a SQL statement, for instance, expressions may exist in a select list, aggregation function, a SQL clause, or within some other SQL construct. Table 1 below depicts example database queries, where each query includes one or more expressions. The following queries might not be useful in real world applications, but are provided for illustrative purposes to discuss the identification and tracking of query expressions. Each query is associated with a query identifier to aid in the discussion below.

TABLE-US-00001 TABLE 1 SAMPLE QUERIES INCLUDING DIFFERENT EXPRESSIONS QUERY ID QUERY Q1 Select sal + comm From emp Q2 Select * From emp Where upper(name) = `SCOTT` Q3 Select sum(sal) From emp Group by to_date(hiredate, format) - 2 Q4 Select e.sal, d.name From emp e, dept d Where e.deptno * 2 = d.deptno * 2 / 3 Queries Q1 to Q4 include various types of expressions in different parts of a query. For instance, query Q1 includes the expression "sal + comm." in the select list, Q2 includes the expression "upper(name)" in the where clause, and Q3 includes the expression "to_date(hiredate, format) - 2" in the group by clause. In Q4, two expressions "deptno * 2" and "deptno * 2/3" are located in the where clause.

[0036] An expression may be completely subsumed by another expression, according to an embodiment. Expressions that are subsumed by other expressions may also be referred to herein as sub-expressions. An expression that is a composite of multiple sub-expressions but that is not a sub-expression itself is herein referred to as a "top-level expression". For example, the composite expression "(a+b)*(c+d)" may be considered a top-level expression that includes the sub-expressions "a+b" and "c+d".

[0037] The example expressions provided above included constructs that were explicitly recited in a query. However, in some embodiments, a database server instance may identify and cache implicit expressions and/or other computations. An implicit expression is an expression that a query optimizer generates during query compilation to optimize a query execution plan even though the expression was not explicitly recited in the database query. As an example, if a database query includes the expression "alike `cat`", the optimizer may break the expression up into two parts: "a>`c`" and "alike `cat`". The first expression may be included as a preliminary filtering measure to reduce the number of times the relatively costly expression "a like `cat`" is evaluated. The implicit expression may be added to a query execution plan as a target operator that operates on a set of input rows to derive a set of output rows.

[0038] Explicit and implicit expressions are examples of computations where there is a one-to-one mapping between rows in the expression results and the table rows from one or more operand columns used to compute the expression results. For instance, each row of expression results for "a+b" may be derived from values of a corresponding table row in columns "a" and "b". Other forms of computations may not have a one-to-one mapping between table rows and computation results. As an example, a bloom filter operation may apply a hash function to the values of distinct dictionary entries of a join key column. As some values within the column may not be distinct, multiple rows may map to the same dictionary entry and hash value.

[0039] Some computations are computed internally during query execution without being explicitly recited in a query. For instance, operations such as data conversions, hashing, concatenations, etc. may be performed even though such operations were not explicitly recited in the query. In some cases, these operations are selected by a query optimizer during query compilation to optimize execution of the query. Database server instance 100 may track and cache explicit expressions, implicit expressions, internal computations, and/or other forms of computations.

In-Memory Columnar Units

[0040] IMCUs 122a to 122n each store a set of one or more column units (CUs), such as CUs 132, which correspond to one or more on-disk columns that belong to one or more on-disk tables within a database managed by database server instance 100. Conceptually, each of IMCUs 122a to 122n divides the rows of a table into separate chunks. The rows within a chunk are converted to column-major format and may be compressed based on a chosen in-memory format. The rows of the chunk are then loaded into an IMCU as one or more CUs, where the CUs adhere to the in-memory format. As an example, IMCU 122a may include a first set of one or more column units, where each column unit corresponds to a different respective column of table t and stores the first million rows for the respective column in compressed or uncompressed column-major format. Similarly, IMCU 122b may comprise a second set of one or more column units, where each column unit in the second set corresponds to a different respective column of table t and stores the second million rows for the respective column in compressed or uncompressed format. The remaining rows of table t may similarly be divided into chunks and cached in other IMCUs within in-memory area 120.

In-Memory Expression Units

[0041] Each of IMEUs 124a to 124n is an in-memory cache where IMEs are stored for future reuse to improve query performance. In an embodiment, database server instance 100 allocates a fixed amount of memory for IMEUs 124a to 124n in an on-demand basis and maintains each of IMEUs 124a to 124n as a bounded cache. Once space is allocated for an IMEU, database server instance 100 identifies expressions that will benefit from caching and stores the evaluation results for the IME in one or more of IMEUs 124a to 124n.

[0042] An IMEU may be implemented as a logical extension of an IMCU. An IMCU that an IMEU logically extends is herein referred to as a "parent" IMCU. Within the parent IMCU, mapping data, such a pointer, is maintained within the IMCU header to link the set of CUs maintained in the IMCU with the expression results cached in the IMEU. The IMEU may also store a back pointer to the parent IMCU, depending on the particular implementation. During scan operations on the IMCU, the pointer may be used to identify the corresponding IMEU and any expression results cached therein. Each of IMEUs 124a to 124n thus acts as a per-IMCU container for storing expression results for the parent IMCU.

[0043] In an embodiment, database server instance 100 may automatically allocate space for IMEUs 124a to 124n to extend the corresponding parent IMCUs. For instance, when an IMCU is created for a particular segment that has been enabled for in-memory storage, a corresponding IMEU may also be created and mapped to the newly created IMCU. If the parent IMCU is dropped from memory, then the corresponding IMEU may also dropped from memory.

[0044] In an embodiment, each IMEU stores PCRs for expressions that operate on one or more columns corresponding to the column units stored in the parent IMCU. As an example, IMCU 122a may store values for rows one to ten of columns "a" and "b" of a particular table while IMCU 122b stores values from rows eleven to twenty from the same table columns. If the expression "a*b" is identified as an IME, then IMCU 122a may include a pointer to IMEU 124a, which stores PCRs of "a*b" for the first ten rows. IMCU 122b may include a pointer to IMEU 124b, which stores the expressions results of "a*b" for the next ten rows. Similarly, others rows from the operand columns and PCRs for the IME may respectively be distributed across different IMCUs and IMEUs.

[0045] An IMEU may also inherit a set of attributes from the parent IMCU and a base on-disk segment that was used to populate the parent IMCU. For example, if the base segment and parent IMCU are part of a cluster that is distributed or duplicated, then the IMEU may also be distributed or duplicated across multiple nodes of a distributed database. As another example, data in the IMEU may inherit the compression scheme of the parent IMCU, thereby causing a database server instance to compress data in the IMEU using the same compression algorithm used to compress column units in the IMCU.

Virtual Column Units

[0046] PCRs are similar to columns in that the PCRs of a given expression include a single value per row. Consequently, PCRs can be considered virtual columns, and the structures that stores PCRs are referred to herein as "virtual column units" or VCUs. In an embodiment, each of IMEUs 124a to 124n stores PCRs as a set of one or more virtual column units. A VCU may store values from the PCRs of an expression in compressed or uncompressed one-result-value-per-row format. As an example, database server instance 100 may derive a column vector in response to evaluating the expression "a/b", where each row in the column vector includes a result value obtained by dividing a first value in a row of column "a" by a second value in the corresponding row of column "b". Thus, the first row in the column vector has a result value obtained by dividing the first value in column "a" by the first value in column "b", the second result value is obtained by dividing the second value in columns "a" by the second value in column "b", etc. If the expression "a/b" is identified for caching, then the database server instance creates, within an IMEU, a virtual column unit that stores the PCRs for "a/b" such that consecutive values within the column vector are stored contiguously in memory.

[0047] A virtual column may be comprised of multiple VCUs that are distributed across different IMEUs. This scenario may occur where actual columns of a table that serve as operands to an expression are distributed across different IMCUs. As an example, IMCU 122a may store the first one hundred rows of columns "a" and "b" from table "t", and IMCU 122b may store the second one hundred rows of columns "a" and "b". In order to cache the PCRs for the expression "a*b", a database server instance may store a first virtual column unit in IMEU 124a that contains the PCRs for the first hundred rows and a second virtual column unit in IMEU 124b that contains the PCRs for the second hundred rows. Other virtual column units that belong to the same virtual column may be distributed across other IMEUs such that the virtual column units maintain a one-to-one relationship between the number of rows in the virtual column unit and the number of rows in the column units of the associated IMCU.

[0048] When an IMEU caches PCRs for a single IME, a single VCU may be used to store the PCRs. If the IMEU caches PCRs for multiple IMEs, then the IMEU may store PCRs across multiple VCUs, where different VCUs store the PCRs for different IMEs. For instance, IMEU 124a stores VCUs 134, which may include a first virtual column unit materializing PCRs for a first expression (e.g., "a*b") and one or more additional virtual column units that materialize the PCRs for additional expressions (e.g., "a+c", "upper(b)", etc.). VCUs 134 may logically extend the CUs maintained within the parent IMCU per the mapping data that is maintained to link the parent IMCU to the IMEU.

[0049] A virtual column may be associated with one or more column identifiers, which may be associated with the one or more VCUs that are part of the virtual column. Example column identifiers include, without limitation: [0050] Internal column numbers that are based on a logical ordering of columns within a table; [0051] Segment column numbers that are based on a physical ordering of columns stored on disk; and [0052] User column numbers that are based on a user-defined logical ordering of columns within a table. Logical column numbers may be assigned to virtual columns sequentially or in any other manner that uniquely identifies a position of the virtual column within a corresponding table. Segment column numbers may be assigned a null or zero value since virtual columns are not persistently stored on disk. However, in some database management systems, certain layers of a database server instance may rely on the segment number to uniquely identify a column. In such scenarios, a false segment column number may be used for the virtual column. The false segment column number may be selected such that it does not conflict with segment column number used for non-virtual columns of the table that are physically stored on disk. Column numbers and other metadata that define properties of the virtual column may be stored in a header of the virtual column.

[0053] For a given expression, the pre-computed results of the expression for all rows in a table are collectively referred to as the "entire virtual column" for that expression. In some instances, the entire virtual column for an expression may be stored in one or more VCUs. When the entire virtual column for an expression is cached, evaluation result for every row is stored and accessible from at least one IMEU. In other instances, a first portion of the virtual column is stored in memory while a second portion is not pre-computed. In such scenarios, only a subset of the rows in the evaluation results are stored in one or more IMEUs. A different subset of rows may not be pre-computed when a portion of the columns that serve as operands are not stored in IMCUs due to memory constraints or other reasons. If this occurs, the portion of the VCU that is cached may be directly accessed from in-memory area 120. The portion that was not pre-computed may have values computed at runtime. If some of the pre-computed results are stored on disk instead of in volatile memory, then the pre-computed results may be loaded into volatile memory at runtime of a query that requires evaluation of the expression that corresponds to the results.

[0054] In an embodiment, VCUs are compressed using one or more compression techniques, such as dictionary encoding, run-length encoding, etc. The VCUs may be compressed at the same level as the CUs within the parent IMCU or at a different level of compression, depending on the particular implementation. If compressed at a different level, a compression analyzer may be used to analyze the benefits of various compression algorithms if applied to the VCU data and recommend a level of compression. The VCU may be compressed at a higher level than the CUs in the parent IMCU to minimize the space footprint within in-memory area 120 for each of the materialized expressions.

Identification of Expressions for In-Memory Materialization

[0055] In an embodiment, database server instance 100 includes logic for identifying expressions that will benefit from in-memory materialization within IMEUs 124a to 124n. The expressions may be identified from one or more sources. Example sources may include, without limitation, user-defined virtual columns that are explicitly identified by a user for caching and query expressions that have not been explicitly identified by a user for caching but that are automatically detected and selected for caching by database server instance 100.

[0056] A user, such as a database administrator, database client, or other application program, may explicitly define virtual columns for in-memory materialization. The user may define the virtual column, and the expressions used to populate the in-memory columns, through a database command. As an example, a virtual column may be defined through the following data definition language (DDL) statement: "create table t (a number, v as (2*a), c number)". Here, "a" and "c" are regular on-disk columns, and "v" is a virtual column. The virtual column "v" is defined as an expression on column "a" multiplied by the constant "2". Other virtual columns may similarly be defined as an expression on one or more operands, where the operands may include physical columns and/or constants. The virtual column has no physical allocation and its values are not physically stored on disk. The virtual column may appear like a physical columns when queried, but the values are computed or, if materialized in memory, retrieved from one or more IMEUs at runtime during query execution.

[0057] In addition or as an alternative to materializing user-defined virtual columns, database server instance 100 may automatically detect expressions that would benefit from in-memory materialization. Database server instance 100 may determine which expressions would benefit from in-memory materialization based at least in part on expression statistics maintained in ESS 150. The expressions statistics that are used to determine whether to materialize an expression may vary from implementation to implementation. As an example, database server instance 100 may select the top n most frequently evaluated query expressions from ESS 150 to materialize. In another example, database server instance 100 may select the top n most costly query expressions from ESS 150 to materialize. The cost of an expression may be determined based on the amount of processing and/or other computational resources consumed during expression evaluation. In another example, the database server instance 100 may multiply the evaluation count of an expression by a per evaluation cost of the expression to compute a ranking value for the expression. In-memory area 120 may cache the highest ranked expressions as determined by the ranking values associated with each expression. In other embodiments, other statistics and criteria may be used to select expressions to materialize in memory.

[0058] For each unique expression, ESS 150 may track [0059] Compile-time information, such as static costs, expression text, etc.; and [0060] Run-time information such as dynamic costs, expression evaluation counts, timestamps of expression evaluation, etc. Database server instance 100 may use one or more of the statistics to rank, sort, and select expressions for materialization within in-memory area 120. As an example, database server instance 100 may determine, based on the compile-time and run-time information stored in ESS 150, which expressions being tracked are currently the "hottest". A heat level value or ranking for an expression may be computed as a function of evaluation frequency and evaluation costs. ESS 150 may sort expression entries by their ranking to facilitate identification of the top n hottest expressions.

[0061] In an embodiment, expressions identified for in-memory materialization are added to a corresponding table as a hidden virtual columns. Hidden virtual columns, like user-defined virtual columns, do not have their values physically stored on disk and may be distributed across one or more in-memory VCUs. Hidden virtual columns differ from user-defined virtual columns in that the hidden virtual columns are not visible to the user. As an example, when a "select *" or "describe" operation is performed on a table, the user may view which physical and user-defined virtual columns belong to the table. However, the hidden virtual column information is not returned or displayed.

[0062] Predictive analytics may also factor into the selection of expressions for caching, according to an embodiment. Database server instance 100 may include an analytic component that predicts, based on statistics maintained in ESS 150, which expressions are likely to become hot based on trends detected from recently received queries. For example, if certain queries extract the day part of a date column, the analytic component may predict that storing other parts of the date (such as months, years) as hidden virtual columns may be helpful. Similarly, database server instance 100 may predict which other expressions will become "hot" based on frequent co-occurrence patterns tracked within ESS 150.

Background Population of IMEUs

[0063] Once an expression is identified for in-memory materialization, database server instance 100 evaluates the expression and populates one or more IMEUs with the PCRs. During population, a one-to-one mapping may be maintained between rows in the IMEU and rows in the parent IMCU. As an example, if database server instance 100 is populating in-memory area 120 with evaluation results for the expression "e/f", database server instance 100 may store the evaluation results in the IMCUs that contain values from the operand column used to compute the results. Thus, if IMCU 122a stores the first million rows for columns "e" and "f", then database server instance 100 generates a VCU in IMEU 124a that includes the first million rows of the PCRs. Similarly, if IMCU 122b stores the second million rows for columns "e" and "f", then database server instance 100 generates a VCU in IMEU 124b that includes the second million rows of the PCRs.

[0064] In an embodiment, background process pool 110 includes a set of background processes that are responsible for populating IMEUs with PCRs. By using a background process, expression result values may be loaded into VCUs without the foreground query processes suffering a drop in performance. Background process pool 110 includes IMCO 112 for coordinating background population of IMEUs 124a to 124n, and worker processes 114 that load expression result values into IMEUs 124a to 124n. IMCO 112 may distribute population tasks across different worker processes, which may operate in parallel to load PCRs into IMEUs 124a to 124n.

[0065] FIG. 2 depicts an example process for performing background population for an in-memory expression, according to an embodiment. At step 202, a background process, such as IMCO 112, queries ESS 150 to identify hot expressions for in-memory materialization. Querying of ESS 150 may be performed on a periodic basis, triggered by receipt of a database query, or performed responsive to some other event, depending on the particular implementation.

[0066] At step 204, the background process determines, based on the query, one or more candidate expressions to materialize in memory. For example, the background process may query ESS 150 for the top n "hottest" expression ranked as a function of evaluation count and cost, as previously described. If any of the top n expressions returned by ESS 150 are not yet materialized in memory, the background process may select the expression as a candidate for materialization.

[0067] At step 206, the background process determines whether to evict any materialized expressions from in-memory area 120 to make space for the candidate expression(s) selected for materialization. If there is not enough space to cache the candidate expression(s), then the process continues to step 208. Otherwise, the process continues to step 210.

[0068] At step 208, the background process evicts one or more virtual columns from in-memory area 120. The background process may implement a cache replacement policy to determine which virtual columns to evict. In an example, the background process determines, based on the statistics stored in ESS 150, which of the currently cached virtual columns is "coldest". A virtual column may be considered "coldest" if it was the least recently accessed, least frequently used, or based on some other criteria. The background process may evict the virtual column by marking it as unusable and/or dropping the virtual column from the table.

[0069] At step 210, the background process computes and loads, for each candidate expression, result values into one or more VCUs. For each respective IMEU, the background process may access the operand column values for the candidate expression from the parent IMCU. If values from the operand columns are stale or the parent IMCU does not currently exist, then the background process may retrieve these values from disk, create a new IMCU if one does not exist, and load these values into the parent IMCU. The values from the operand columns are then used to compute the evaluation results, which are stored in the corresponding IMEU.

IMEU Consistency Maintenance

[0070] In an embodiment, IMCUs 122a to 122n and IMEUs 124a to 124n are maintained as read-only units. In this state, transactional changes that occur after data is loaded into an IMEU and parent IMCU is not instantly materialized within these in-memory units. One reason for not materializing the results instantly is that CUs and VCUs store data in compressed or uncompressed column-major format. While column-major format provides significant advantages for processing many analytic queries, updates can generally be performed more efficiently on data stored in row-major format. Therefore, the updates may be directed to objects that store data in row-major format rather than IMCUs 122a to 122n and IMEUs 124a to 124n, which store data in column-major format.

[0071] In order to maintain transaction consistency, database server instance 100 monitors for changes to the operand column values. If a transaction commits an update to a value within a row of an operand column, then the PCRs that are cached for that row are no longer valid. To prevent stale PCRs from being returned, shared metadata units 128a to 128n track which row values cached within in-memory area 120 are no longer valid. In response to detecting an update to a row within a segment that is in-memory enabled, transaction processes 144 may update one or more of SMUs 128a to 128n to indicate the row, within the in-memory units, is no longer valid. In alternative embodiments, invalidation information may be maintained at a finer granularity than rows. For example, within a row that is updated, the database server instance may track which columns were actually updated. If a column was not updated even though the row was updated, then the value in that column may still be treated as valid. Similarly, the pre-computed result for a row may be treated as valid is the values from the column(s) that are involved in the expression were not updated, even if other columns for the same row were updated.

[0072] By maintaining a one-to-one mapping of table rows between an IMEU and the parent IMCU, the same SMU may be used to track changes for both in-memory units. Any row that is invalid in the IMCU is also invalid in the IMEU. If the row is invalid, then the values for that row are provided from the latest image either on disk or in a transaction journal rather than the IMCU or IMEU. Each of SMUs 128a to 128n may therefore be associated with a corresponding IMCU and IMEU to track the status of individual rows within the CUs and VCUs contained therein. For instance, SMU 128a tracks the validity of rows within CUs 132 and VCUs 134. Similarly, SMU 128b tracks the validity of rows for CUs stored in IMCU 122b and VCUs stored in IMEU 124b, and SMU 128n tracks the validity of rows for CUs stored within IMCU 122n and VCUS stored in IMEU 124n.

[0073] Referring to FIG. 3, it depicts SMU 330, which is used to track which rows are valid within IMCU 310 and IMEU 320. IMCU 310 includes CU 312 and CU 314, and IMEU 320 includes VCU 322. CU 312 stores values from column a, and CU 314 stores values from column b. IMEU stores PCRs for the expression "a*b" SMU 330 stores invalidity bit-vector (BV) 332, where each bit corresponds to a particular row and the value of the bit indicates whether the corresponding row is valid or invalid. If a data manipulation language (DML) transaction updates a row that has not previously been invalidated, then the bit value for that row in BV 332 is changed to indicate that that row in CU 312, CU 314, and VCU 322 is no longer valid. During query processing, query processes 142 may read BV 332 to determine which cached values to retrieve from in-memory area 120 and which values to retrieve from a transaction journal or disk. As illustrated, the same bit-vector may be used to track the validity of rows in both IMCUs and IMEUs.

[0074] As read-only units, IMCUs 122a to 122n and IMEUs 124a to 124n are built from data that was current as of a certain point in time. Changes made after that point in time would not be reflected in the IMCUs 122a to 122n and IMEUs 12a to 124n. For instance, IMCUs 122a to 122n and IMEUs 124a to 124n may store data that is current up to a load system change number (SCN) or other logical timestamp. SMUs 128 to 128n track transactional changes on a row spanned by the IMCU after the load time of the IMCU. In some cases, the changed rows may be logged in a transaction journal that records which transactions have been committed. In order to leverage the transactional notion of invalidation tracking in SMUs 128a to 128n, the database server instance 100 may populate an IMEU such that it is built as of the same logical time as the parent IMCU. When such consistency is maintained, rows that are marked invalid in the IMEU may also be treated as invalid in the parent IMCU and vice versa. In order to build the IMEU such that the PCRs are built as of the same logical time as the operand values in the IMCU, the background processes responsible for populating the IMEU may issue flashback queries based on the load SCN of the IMCU. With flashback queries, past data is returned as it existed at the time of the SCN specified in the flashback query. Thus, the background process disregards updates to operand column values that may occur between the time the IMCU is built and the time the IMEU is built. When such consistency is maintained, the addition of an IMEU logically widens each row of the parent IMCU. The IMEU may conceptually be thought of as adding more columns to the row. In the case of the IMEU, virtual column are added to the physical columns stored in the IMCU.

Repopulation of In-Memory Column Units

[0075] In order to refresh in-memory units that have become stale, database server instance 100 queues repopulation tasks to refresh invalid data within IMCUs 122a to 122n and IMEUs 124a to 124n. IMCO 112 may coordinate repopulation of the in-memory units by assigning and distributing repopulation tasks in the queue among worker processes 114. When assigned a repopulation task, the worker process reloads updated values into the parent IMCU. The worker process further re-computes expression result values for invalid rows and loads the updated expression result values into the IMEU. The worker process also updates the corresponding SMU to indicate that the updated rows are no longer invalid.

[0076] In an embodiment, IMEUs 124a to 124n are repopulated whenever the parent IMCU is repopulated. Thus, IMEU 124a may be repopulated whenever IMCU 122a is repopulated, IMEU 124b may be repopulated whenever IMCU 122b is repopulated, etc. By repopulating related in-memory units as part of the same repopulation task, the IMEU is kept in sync with the parent IMCU. Thus, consistency is maintained between the CUs in the IMCU and the VCUs in the IMEU.

[0077] Database server instance 100 may prioritize repopulation tasks based on a set of one or more replacement policies. The set of replacement policies may take into account various factors including, without limitation, the number of rows invalidated, the number of scans on a particular IMCU or IMEU, user-specified priority levels for a table, and compression levels. For instance, if database server instance 100 determines, from SMU 128a, that a threshold number of rows has been invalidated, then database server instance 100 may queue a repopulation task for IMCU 122a and IMEU 124a. The threshold may vary depending on how frequently IMCU 122a and IMEU 124a are accessed. As the frequency of accesses increases, the threshold number of rows that trigger a repopulation may be reduced.

[0078] When an IMCU and IMEU pair are scheduled for repopulation, the previous versions of the IMCU and IMEU may remain accessible and available to queries until the repopulation task is complete. For instance, if a background process is loading values into a new version of IMCU 122a and IMEU 124a, database server instance 100 may receive a query that requires access to a value which is cached in a previous version of IMCU 122a. In such a scenario, a foreground process executing the query may access the value from the previous version of IMCU 122a. Once repopulation is complete, the previous versions of the IMCU and IMEU may be immediately discarded or may continue to remain accessible for a configurable duration of time. By maintaining the previous versions, the storage footprint of the IMCU-IMEU pair is increased. However, queries that are received with a scan SCN lower than the load SCN of the new version of the IMCU-IMEU pair, but greater than the load SCN of the previous version of the IMCU-IMEU pair may benefit from accessing values from the previous version of the IMCU-IMEU pair.

Query Optimization Using In-Memory Expressions

[0079] By storing results of `hot` expressions on a per-row basis in IMEUs, a query optimizer may improve performance of subsequent queries that refer to the materialized expressions. Rather than re-evaluating the expression, database server instance 100 may access the PCRs from in-memory area 120. In addition, by materializing user-defined virtual columns, database server instance 100 can provide the user with scan performance similar to that of a physical on-disk column that has been materialized in-memory, but without the extra-disk overhead.

[0080] In an embodiment, queries are rewritten to access materialized expressions stored in hidden virtual expressions. If database server instance 100 determines that evaluation results for an expression are stored in a virtual column, database server instance 100 replaces the reference to the expression with a reference to the hidden virtual column. As an example, if the sub-expression "(a+b)" of expression "(a+c+b)" has been materialized as hidden virtual column "v", then the database query "select a+c+b from t where a>0" can be re-written as "select v+c from t where a>0". During query execution, the value of "v" is read directly from an IMEU, thereby retrieving the PCRs and avoiding re-evaluation of the sub-expression "(a+b)".

[0081] FIG. 4 is a flowchart depicting an example process for identifying and using in-memory expressions to optimize query execution, according to an embodiment. The process may be performed by one or more foreground processes, such as query processes 142. At step 402, database server instance 100 receives a database query. The database query may be input directly by a user, received over a network from a database client, or submitted through some other application.

[0082] At step 404, database server instance 100 identifies expressions included in the database query that may potentially be materialized within in-memory area 120. As an example, the query: "select a*b from T where c=1 and e/f=10" has two expressions, "a*b" and "e/f" that may potentially be materialized in an IMEU. An "identified expression" as used herein may refer to an expression explicitly recited in the query, an implicit expression, or a previously-defined virtual column that is derived from the results of an expression. If an identified expression is associated with a hidden virtual column, then the database query may be rewritten to access the hidden virtual column as previously described.

[0083] At step 406, database server instance 100 determines whether to update ESS 150 based on the identified expressions. In some embodiments, expression tracking may not be enabled, may be enabled only for certain expressions, or may not be supported. In other embodiments, expression tracking may be applied to all identified expressions or some subset thereof. If database server instance 100 decides to update ESS 150, then the process continues to step 408. If database server instance 100 determines that the identified expressions should not be tracked, then the process continues to step 410.

[0084] At step 408, database server instance 100 updates ESS 150 based on an occurrence of the identified expressions within the query. To query and update ESS 150, database server instance 100 generates an expression identifier for each distinct expression on a per-table basis. The expression ID may be computed based on the canonical form of the text representation of the expression. For example, expressions "a+b" and "b+a" are considered the same expression for a particular table. The expression ID can be a hash value of the canonical representation of an expression and the base table object number or some other value that uniquely identifies the expression. If an entry does not already exist in ESS 150 for an identified expression, then a new expression entry may be created. Otherwise, the existing entry may be updated based on the compile-time and/or runtime information collected for the expression.

[0085] At step 410, database server instance 100 scans in-memory area 120 to determine whether IMEUs 124a to 124n includes any VCUs that store results for an identified expression referenced by the database query. In order to perform the scan, database server instance 100 may assemble a set of IMCUs and IMEUs based on the columns referenced by the database query. Assembling the IMCUs and IMEUs may involve decompressing, if compressed, the CUs in the IMCU and VCUs in the IMEU and gathering the CUs and VCUs for the columns and expressions referenced by the database query. In an embodiment, database server instance 100 scans IMEUs 124a to 124n for VCUs that have a column number that is associated with the identified expression. If a VCU that stores results for the identified expression is identified, then the process continues to step 412. If no VCUs are identified, then the process continues to step 414.

[0086] At step 412, database server instance 100 retrieves the evaluation results for the identified expression from one or more VCUs. In some cases, only a portion of the results may be cached due to memory constraints or due to row invalidations. In such cases, database server instance 100 may retrieve the portion of the results that are valid and cached within the one or more VCUs. For the portion that is not valid or cached, the expression is evaluated using values from a transaction journal, disk, or some other source. If the entire expression result is cached and valid, then the process may proceed without performing any evaluations for the identified expression.

[0087] If there are no VCUs that store results for the identified expression, then at step 414, database server instance 100 computes the PCRs. If the operand columns are stored in one or more column units within IMCUs 122a to 122n, then values for the operand columns may be retrieved from in-memory area 120 to perform the computation. If the values are not stored in memory, database server instance 100 retrieves the values from another source such as a transaction journal or disk.

[0088] At step 416, database server instance 100 determines whether there are any identified expressions that have not yet been evaluated. If so, then the process returns to step 410 and repeats steps 410 to 414 for the next identified expression. Once all identified expressions have been evaluated, then the process continues to step 418.

[0089] At step 418, database server instance 100 generates and returns a final query result using the PCRs, either retrieved from in-memory area 120 or computed per the description above, for each identified expression.

In-Memory Internal Units

[0090] As previously described, IMEUs 124a to 124n may be used to materialize PCRs and/or user-defined virtual columns within in-memory area 120 for subsequent reuse during scans, thereby providing faster query response times. Database server instance 100 further comprises IMIUs 126a to 126n, which materializes computations that are not suitable to be stored as VCUs, as described in further detail below. While database server instance 100 is depicted as having both IMEUs and IMIUs, in other implementations, a database server instance may comprise IMEUs but not IMIUs or IMIUs instead of IMEUs.

[0091] In an embodiment, each of IMIUs 126a to 126n are managed as an in-memory cache with fixed upper bound sizes that can be allocated on-demand on a per-IMCU basis. Similar to IMEUs 124a to 124n, IMIUs 126a to 126n store the results of internal computations. However, IMIUs 126a to 126n are structured differently than IMEUs and provide greater flexibility in the types of results that may be stored. As previously described, IMEUs store results as VCUs where the rows have a one-to-one relationship with CUs in the parent IMCU. This format facilitates transaction processing, as the VCUs may be treated as logically widening the table rows cached by the parent IMCU. Certain expressions, such as certain data conversions, hash operations, etc., may not have a one-to-one mapping with the rows cached by the parent IMCU. For such computations, the results may be stored instead in an IMIU. To organize the PCRs, IMIUs 126a to 126n maintain result maps that map computation identifiers to materialized results. Thus, IMIU 126a includes results map 138, which may be implemented as a hash table as described in further detail below. Similarly, IMIU 126b and 126n may include result maps to manage materialized computation results.

[0092] FIG. 5 depicts an example design for an IMIU, according to an embodiment. IMIU 500 generally comprises header 502, IMIU piecemap 504, hash table of computations 506, and materialized results area (MRA) 508.

[0093] Header 502 stores information used to manage free space within IMIU 500. In an embodiment, header 502 may include, without limitation: [0094] A pointer to free space where bit-vector results should be stored (allocate forward); [0095] A pointer to free space where derived column results should be stored (allocate backward); and/or [0096] One or more values indicating the amount of available free space for either class (e.g., bit-vector, derived column) of PCRs.

[0097] Piecemap 504 stores an array of pointers to IMIU extension pieces, if any, that may be located in other areas of memory. If the head IMIU piece is completely consumed with materialized results, database server instance 100 may allocate more memory and construct a chain IMIU piece to store additional computations. Thus, the IMIU cache can grow or shrink in size.

[0098] Hash table of computations 506 tracks statistics about computations processed against the parent IMCU and, if materialized serves as an computation ID to result map, providing the pointer to the stored computation result in the IMIU. Each hash bucket within hash table of computations 506 may store information to facilitate internal computation tracking and lookups. This information may include, without limitation: [0099] A compare and Swap (CAS) value for handling concurrent accesses to the hash bucket; [0100] A type associated with the internal computation; [0101] A running evaluation count that tracks how many times the internal computation has been evaluated (computations that are pruned out do not affect this count value) [0102] A running access count that tracks how many times the internal computation is referenced/accessed (computations that are directed at the parent IMCU are counted regardless of whether they are pruned out) [0103] A timestamp identifying the last time the internal computation was last evaluated; [0104] A timestamp identifying the last time the internal computation was last accessed; [0105] A static cost associated with evaluating the internal computation; [0106] A running cost associated with evaluating the internal computation; [0107] A stored result pointer indicating a memory location where the results for the internal computation are stored; and [0108] A reader count to indicate how many processes are currently reading the hash bucket.

[0109] MRA 508 stores the results for internal computations. PCRs may store different classes of results including, without limitation, bit-vector results and derived column result sets. In one embodiment, bit-vector results are populated from the starting address of MRA 508 and allocated forward. Derived column or other result sets are populated from the end address of MRA 508 and allocated backward. Separating the result sets may help reduce fragmentation since bit-vector results may be managed in fixed sizes, allowing for bit-vectors to be replaced without causing gaps between result sets.

Computations to Materialize in IMIU

[0110] IMIUs 126a to 126n provide great flexibility in the type of computations that may be materialized within in-memory area 120. Due to this flexibility, the format of the results that are cached may vary depending on the particular type of computation that is materialized. Examples of PCRs may include, without limitation [0111] Bit-vectors derived during predicate evaluations; [0112] Function evaluation results (such the results of a hash function, data type conversion, expression evaluations) on columns; [0113] Concatenated columns for multi-column joins; and [0114] Function evaluation results (such as data type conversions) on dictionary values, column numbers, and other metadata values.

[0115] A bit-vector that is derived during predicate evaluation stores a set of bits, where the position of each bit corresponds to a different row and the value of each bit indicates whether the corresponding row satisfies the predicate. If the predicate expression "e/f=10" is evaluated against five rows in table t, for example, then a five-bit bit-vector may be generated and cached in MRA 508 to indicate which rows of table t have values in columns e and f that satisfy the predicate expression. In the present example, the bit vector "10110" may be stored to indicate that the first, third, and fourth rows of table t satisfy the predicate, while the second and fifth rows do not satisfy the predicate. Caching bit-vectors allows subsequent queries to be rewritten to simply refer to the bit-vector rather than perform a potentially expensive evaluation.

[0116] Hash values are another internal computation result that may be valuable to cache. To illustrate, TABLE 2 depicts an example "what if" query that determines how much average yearly revenue would be lost if orders are no longer filled for small quantities of certain parts.

TABLE-US-00002 TABLE 2 SAMPLE QUERY WHERE HASH COMPUTATION MAY BE PERFORMED QUERY ID QUERY Q5 select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem ,part where p_partkey = l_partkey and p_brand = `BRAND23` and p_container = `MED BOX` and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey );

In Q5, "lineitem" is the fact table and "part" is a dimension table. The optimizer may pick a hash join to evaluate the query. When two tables are joined via a hash join, the dimension table is scanned and rows satisfying the where clause predicates for that table are used to create a hash table, based on the join key, in memory. Then the larger fact table is scanned and, for the rows that satisfy the where clause predicates for that table, the same hashing algorithm is performed on the join column. The join operation then probes the previously built hash table for each value and if there is a match, the row is returned. In some cases bloom filters may be used to perform the hash join operation. A bloom filter is a space-efficient probabilistic data structure that can be used to test whether an element is a member of a set or not. During the hash table creation for the dimension table, a bloom filter is also created based on the join column--the "p_partkey" column in the case of Q5. The bloom filter is then sent as an additional predicate to the scan of the "lineitem" table. After the "where" clause predicates have been applied to the "lineitem" table, the resultant row-set is further pruned by having the join column "(l_partkey)" hashed and probed in the bloom filter. If a match is found in the bloom vector, then that row is sent to the hash join operator. If no match is found then the row is discarded. Next, the passing rows from the fact table and dimension table are used to perform the actual join. This involves comparing the hash of the "l_partkey" for the passing rows from "lineitem", with the hash of "p_partkey" for the passing rows from "part", to weed out false positives from the bloom filter evaluation. The final set of rows is sent to an aggregator to perform a summation. Based on the foregoing, there is ample opportunity for hashing the values of the distinct dictionary entries of the join key columns and reusing them to improve join performance. For the rows that pass the predicates on the fact table "lineitem", a look-up is performed on the dictionary index of the join key column "l_partkey" to directly obtain the hash value from the materialized stream and use it to probe the bloom filter. This results in a significant speed-up for joins that have a dominant bloom filter evaluation cost because the cost of computing hash values during runtime may be completely eliminated. Further, hash values that are fixed in length (e.g., 8-bytes) may be cached as an aligned data stream within IMIU. By determining the start location where the stream of hash values are stored and accessing the stream directly the hash operation may be completed without looking up the dictionary symbols and their lengths.

[0117] An IMIU may also cache concatenated columns, according to an embodiment. TABLE 3 depicts an example query that may benefit from caching a concatenated column.

TABLE-US-00003 TABLE 3 SAMPLE QUERY WHERE COLUMN CONCATENTATION QUERY ID QUERY Q6 select * from T1, T2 where T1.id = T2.id and T1.name = T2.name;

Q6 effectively has two single column joins. Traditionally, rows are pruned out by performing a bloom filter operation on the first join operation (T1.id=T2.id), and subsequently, the second bloom filter operation (T1.name=T2.name) is performed on the passing rows from the first operation to prune out even more rows. As previously discussed, a single column join may be improved by materializing hash values of the join key columns in the IMIU. The same technique may be applied twice to Q6 to improve query performance. However, internally concatenating columns for each of the two tables involved in the multi-column join may improve query performance for analytic workload. For "T1", the internal column is "(IC1=T1.id o T1.name)", and for "T2", it is "(IC2=T2.id o T2.name)". With the concatenated column, the query may be rewritten internally to the following: "select * from T1, T2 where T1.IC1=T2.IC2". Thus, the original multi-column join query can be reduced to a single-column join query.

[0118] In an embodiment, IMIU may be used to cache decimal scaled binary (DSB) representations. A DSB representation comprises a DSB vector and DSB metadata. A DSB vector is an array of DSB values, which are native integers scaled by a common factor. The DSB metadata describes properties of the DSB vector and may include the following metadata: [0119] The width of each array element; [0120] An exponent indicating the number of base-100 digits to the right of the decimal point; [0121] Minimum and maximum values of the array; and [0122] A bit-vector indicating rows to ignore. A DSB vector may be used to represent different values, such as dictionary values for columns that store fixed and floating-point numbers. An entry in the IMIU may be used to store the DSB vector and the associated metadata listed above. For an arithmetic computation (e.g., "a+b"), a list of values for "a" and "b" may be obtained by looking up the dictionary entries for each value. For a particular IMCU, the DSB values may or may not be present for dictionary values of columns "a" and "b". If they are present in the IMIU, the DSB values may be retrieved in lieu of the fixed or floating point column values. Query processing may be improved as arithmetic operations may be performed more quickly with DSB values since floating point instructions are not needed. DSB conversions may be cached not only for physical columns, but also for materialized expression columns. For example, if a virtual column "v" is equal to the sum of physical columns "a" and "b", then a DSB representation of the unique values of the column "v", which is potentially materialized in an IMEU, may be stored in an IMIU. Thus queries of the form "select A+B+C from table" can be rewritten to effectively perform "DSB(V)+DSB(C)", thus improving query response times. This is an example where the IMEU and IMIU can work together to improve execution of a single query.

Hashing Computations

[0123] The manner in which an internal computation is mapped to hash table of computations 506 may vary from implementation to implementation. Each internal computation can be hashed based on a PCODE representation of the computation, a predicate log tree of the computation, and/or some other unique representation of the internal computation. A PCODE representation of an internal computation, for instance, typically includes a) an opcode b) column operands (with a data type for the column in some cases), and c) constant operands (as length/value pairs). The hash value may be obtained by applying a hash function to a combination of values included in the PCODE representation. Some examples are: [0124] HASH opcode, column 1, constant data, constant data length [0125] HASH opcode, type, column 1 The hash operator may further combine hash values for two or more different representations to derive a hash value for a nested/complicated internal computation. For instance the following command computes a hash value based on a combination of hash values: [0126] HASH opcode, hashval1, hashval2

[0127] To demonstrate the hashing of internal computations, the following multi-column predicate is given by way of example: "c=1 and e/f=10". In order to map this computation to the corresponding bucket within hash table of computations 506, the following sequence of commands may be executed: (1) HASH(EQ, c, 1); (2) HASH(EQ, HASH(DIV, e, f), 10); and (3) HASH(AND, hashval1, hashval2). The first command corresponds to a hash on the sub-expression c=1, and the resultant hash value is saved as "hashval1". The second command corresponds to a hash on the sub-expression "e/f=10", and the resultant hash value is saved as "hashval2". The third command evaluates the full predicate by applying a hash function to a combination of "hashval1" and "hashval2". The hash values for each expression and sub-expression map to corresponding hash buckets within hash table of computations 506 and may be used to perform lookups and updates of the computation information maintained therein.

Candidate Computations to Materialize in IMIU

[0128] In order to conserve memory, database server instance 100 may limit which internal computations are tracked and stored within the MRA for each IMIU. Database server instance 100 may employ a set of heuristics to build a set of selection criteria. Computations that satisfy the selection criteria and are deemed eligible for storage in the IMIU are henceforth referred to as "candidate computations". Factors that may be involved in the selection process may include, without limitation: [0129] Frequency of evaluation; [0130] Runtime costs of evaluation; [0131] Memory footprint of the computation if materialized; and [0132] Memory pressure experienced by the cache and/or the overall system.

[0133] Based on the selection criteria and the statistics stored in hash table of computations 506, each internal computation is given a score, according to an embodiment. If the score satisfies a threshold, then the internal computation is selected for materialization. If the score does not satisfy the threshold, then the computation results are not materialized within MRA 508. Even if not materialized within MRA 508, IMIU 500 may continue to track statistics for the internal computation within hash table of computations 506.

[0134] The manner in which scores are computed for candidate expression may vary depending on the particular implementation. For instance, the runtime cost of an expression used to compute a score may be based on the number of CPU cycles taken to evaluate the computation, the filter rate for the predicate as determined by the number of rows filtered divided by the number of rows evaluated, or some other measure of resource consumption directly tied to evaluating the internal computation. Generally, an internal computation may be scored higher as the runtime cost and frequency of evaluation increase and lower as these factors decrease.

[0135] The thresholds used to determine whether to materialize a candidate expression may also vary depending on various factors. As an example, the threshold may be different for different types of computations. For instance, as the memory pressure increases, the thresholds for materializing expressions may also increase. Further, predicates may have reduced thresholds owing to the relatively small memory footprint consumed by bit-vectors. In another example, thresholds between different types of predicates may vary. For instance a predicate result may subsume complex sub-predicate results with no additional cost, and caching predicate results may allow for greater opportunities for predicate reordering. On the other hand, if the constant value of the predicate changes, then the entire stored result becomes useless. These attributes may be factored in when setting the threshold score or otherwise selecting predicates and other computations for materialization.

[0136] The granularity with which candidate computations are selected for materialization may vary based on the type of computation, according to an embodiment. Some computations, such as predicate evaluations, may be selected on a per-IMCU basis. As an example, if a predicate of the form "col=<value>" is never evaluated against a particular IMCU because it is always pruned by min-max pruning, i.e. the specified value is not within the minimum and maximum value ranges of that column in the parent IMCU, then storing this predicate in an IMIU for the parent IMCU would be a waste of space. Thus, selections may be performed on a per-IMCU basis such that predicates are cached based on the likelihood that the predicate will be evaluated for the parent IMCU. Further, the predicate result may be materialized for column values in one IMCU but not for column values in another IMCU.

[0137] In other cases, computations may be performed on a per-table basis. For instance, hash values for a join key column and data conversion on a column are generally applied to all values within a column. If such a candidate computation is selected for materialization, the results may be cached for all IMCUs that cache data from the table column.

Materialization of Computations

[0138] Once a candidate computation is selected for materialization within IMIU 500, then the next task is to populate IMIU 500 with the computation result. In one embodiment, computation results are populated by a foreground process during query execution. By using a foreground process, computation results to materialize are readily available as they are computed as part of the evaluation process for the query triggering the materialization. In order to materialize the computation, a query process copies the computation result into MRA 508, adjusts the free space pointers in header 502, and updates, within hash table of internal computations 506, the result pointer in the hash table entry corresponding to this computation such that the result pointer maps to the beginning of the stored result in the MRA 508.

[0139] During materialization, if sufficient memory is not available to store the results, then another in-memory extent (a contiguous memory chunk) may be allocated for IMIU 500. The amount of memory allocated for the in-memory extent may be a default amount, selected based on the amount of memory consumed by results already stored in MRA 508, or determined based on some other criteria. Once allocated, the result is written into the new in-memory extent, and piece-map 504 is updated to show the chaining of the IMIU.

[0140] Once materialized, the results in MRA 508 may be accessed by subsequent queries. The PCODE representation or other representation of the computation may be hashed and probed in the hash table on a per-IMCU basis. If the result corresponding to the computation is present in MRA 508, a pointer to the result in MRA 508 is present as a separate entry in hash table of computations 506. If the pointer is present, queries may be rewritten to access the results directly from MRA 508 rather than performing the computation during query execution.

Evicting Computations

[0141] In order to free up space within IMIU 500, database server instance 100 may evict internal computation results from MRA 508. To select which computations to evict, database server instance 100 may use a set of one or more cache replacement policies to identify computations that have become "cold". Example cache replacement policies may include, without limitation, a least recently used (LRU) policy that evicts the least recently accessed or evaluated expression from memory, a least frequently used policy that evicts the computation that is least frequently evaluated or accessed, a random replacement policy that randomly selects a computation for eviction, a clock with adaptive replacements (CAR) policy that evicts results based on a virtual clock hand (iterator) indicating when a computation was last accessed, and score-based ranking that computes a score based on statistics maintained for the materialized computations and evicts the computation with the lowest score. Eviction of colder computations can either be done on the fly as new computations come in, or it may be done as part of a dedicated eviction pass in an attempt to reclaim space in the event of memory pressure.

[0142] When computation results are evicted from MRA 508, the result pointer is deleted from hash table of computations 506 and the free space pointer is updated in header 502 and/or piecemap 504 to reflect the newly allocated free space. The computation results may also be removed from MRA 508 or may remain and be overwritten by subsequent computation results.

[0143] Evicting bit-vector results may be performed with zero-fragmentation since the bit-vector sizes may be identical and any subsequent bit-vector materialization may replace the unoccupied space in IMIU 500. Eviction of other computations, however, may lead to fragmentation in IMIU 500 due to the variable sizes of such computations. To ameliorate this effect, a de-fragmentation pass may be performed where stored results in MRA 508 are shuffled around after an eviction to ensure optimal space utilization. The eviction pass may be performed by a dedicated background process to prevent foreground queries from suffering a drop in performance.

[0144] In the event of high memory pressure, database server instance 100 may evict entire an IMIU entirely from in-memory area 120. Once evicted, the in-memory extents allocated for that IMIU are freed back to the space layer, which may allocate those extents for different uses. High memory pressure may be detected as: [0145] A high ratio of expressions in a hash table of computations to materialized results in the MRA; [0146] An IMIU size exceeding a threshold level, which may be a configurable parameter; and [0147] Overall system memory availability being low.

Consistency Maintenance of IMIU

[0148] A computation materialized in IMIU 500 is based on the values stored in the parent IMCU. For computations that have one-to-one mappings with rows of the table, as is the case with predicate bit-vector results, the validity bit-vector maintained in the associated SMU may be used to identify which bits in the predicate bit-vector are stale. For instance, the bit-vector maintained in SMU 128a may be used as a mask against the predicate bit-vector results stored in MRA 136 to obtain correct results for rows that are valid in IMCU 122a.

[0149] However, MRA 136 may also store computations that do not have a one-to-one row mapping. For such computations, the manner in which validity is determined depends on the nature of the results being cached. As an example, if DSB values are cached within MRA 136 where the DSB values are for distinct entries of a column storing floating point numbers, then, during a scan of IMCU 122a, a process may obtain the DSB values for the dictionary entry for the valid rows from the materialized result. For invalid rows, the results are not accessed from IMIU 126a. Rather, a non-DSB encoding of the results may be used to evaluate the computation.

Concurrency Operations for IMIUs

[0150] A particular query process may have different roles when accessing data stored in IMIU 500. For instance, the query process may act in the following roles: [0151] A writer that updates statistics of a particular computation in hash table of computations 506; [0152] A reader that uses the materialized result from MRA 508 to obtain a final query result; and [0153] A writer that materializes a computation result in MRA 508 and/or evicts a computation already materialized in MRA 508.

[0154] Multiple processes with various roles may concurrently attempt to access an IMIU. In order to manage potential conflicts, database server instance 100 may implement a set of concurrency policies. Example concurrency policies may include, without limitation: [0155] Allowing two or more processes acting as readers to access, in parallel, different materialized results within MRA 508; [0156] Copying result pointers when multiple processes, acting as readers, attempt to access, in parallel, the same materialized results, whereby the stored result may be accessed in parallel; [0157] Preventing multiple processes from adding the same computation results to MRA 508; [0158] Serializing writes to MRA 508 when multiple processes attempt to materialize different computations concurrently; [0159] Preventing multiple processes from evicting the same computation from MRA 508; [0160] Preventing a process from evicting a materialized result from MRA 508 when there are existing readers on that computation; [0161] Serializing evictions of materialized computations when multiple processes are attempting to evict separate computations from MRA 508; [0162] Serializing updates to the statistics in the same hash table entry in hash table of computations 506 when multiple processes concurrently attempt to update the hash table entry; and [0163] Preventing a reader from accessing a results pointer when a separate process is writing to the result pointer field in the hash table entry.

[0164] Concurrency on the IMIU 500 is maintained by using atomic CAS operations, according to an embodiment. The default value of the CAS field in the hash table entry (e.g., "0") is set to indicate that the hash table entry is unlocked. When a process attempts to access the hash table entry, the process attempts to update the CAS value to its own process id in order to atomically lock access to the hash entry. An access to the hash table entry of a particular computation may be prevented until a CAS operation succeeds. Thus, updates to a particular hash table entry may be restricted to one process at a time.

[0165] Materializing new computations involves potentially changing free space pointers, obtaining new chained units and updating the piece map, and evicting colder computations. In order to maintain concurrency, a latching mechanism may be implemented to protect writes into MRA 508 whenever a new computation is being added. Even when latched, readers may still be permitted to read other computations by using the atomic CAS operations.

[0166] When evicting computations, concurrency may be maintained by checking for readers before the selected computation is evicted from MRA 508. If there is a reader using a particular materialized result, the result is not evicted until the reader count for the computation indicates that there are no longer any processes reading the computation results. The reader count value may be maintained by incrementing the value whenever there is a process using the particular materialized result. To decrement the reader count, the reader takes an atomic lock on the hash entry for the computation using CAS and updates the value. In this way, concurrency between a writer looking to evict a computation and a reader reading the result may be handled.

[0167] In some systems, it may be possible for a process to die when reading a computation result. If the reader count has already been bumped up, with the death of the process, there is a risk that the counter will not be decremented back down to 0 even though no active process is actually using the result. In such systems, a list of processes reading a particular computation result may be maintained in addition to the reader count. Thus, if a process wanting to evict a computation has waited a threshold period of time, the process may examine the list to see if any dead processes are present. If such a process is found, it is removed from the list and the reader count is appropriately decremented, thus allowing the eviction to proceed.

Example Dataflow for Managing an IMIU

[0168] FIG. 6 is a flowchart depicting an example process for populating an in-memory internal unit with computation results and using materialized computation results to optimize query processing, according to an embodiment. In an embodiment, the process is performed by one or more foreground processes that belong to databases server instance 100. By using foreground processes IMIUs may be updated and populated during query executions as computations are being evaluated. Referring to FIG. 6, at step 602, a representation of a computation i is hashed. The hash may be applied to a PCODE representation, a predicate log, or some other computation representation as previously described. In response to hashing the computation representation, a hash value is generated that identifies a corresponding entry within hash table of computations 506.

[0169] At step 604, a lock is obtained on the hash table entry for i. A process may obtain a lock by updating the value within the CAS field of the hash table entry as previously described. Once locked, other processes are prevented from updating values within the entry. Other processes may concurrently update different entries within hash table of computations 506. However, a single hash table entry is updated by only one process at a time.

[0170] At step 606, computation statistics are updated in the hash entry for i. For example, the evaluation count, access count, timestamp of last evaluation, timestamp of last access, and evaluation cost statistics may be updated based on runtime information collected during query evaluation.

[0171] At step 608, the foreground process determines if the result pointer is available within the hash entry. If there is a result pointer stored in the hash entry, then the results for computation i are materialized in the MRA, and the process continues to step 610. Otherwise, if the pointer does not exits (e.g., it is set to NULL), then the result have not been materialized for the computation. Thus, the process continues to step 616.

[0172] At step 610, the foreground process increments the reader count and releases the lock for the hash table entry. The reader count indicates that the process is currently accessing the materialized result for computation i, thereby preventing other processes from evicting the computation. Other processes may update the hash table entry with the exception of the result pointer, however, as the lock on the hash table is released.

[0173] At step 612, the process reads the materialized result from the location in the MRA identified by the result pointer and uses the materialized result to compute a final query result. The computation of a final query result may involve multiple internal computations and expression evaluations. Thus, the process may check the IMIU and/or IMEU for other cached results to compute the final query result. Once the process has finished accessing the result from the MRA, the process obtains a lock again on the hash table entry for computation i.

[0174] At step 614, the process decrements the reader count to indicate that it is no longer accessing the computation result from the MRA. After the reader count is decremented, the process releases the lock on the hash entry for computation i.

[0175] If the computation results are not cached in the MRA, the process determines whether the results should be cached. Thus, at step 616, a "hotness" score is calculated based on the statistics maintained in the hash table entry for computation i. The manner in which the score is calculated may depend on a variety of factors, as previously described.

[0176] At step 618, the process determines if the score for computation i exceeds a threshold. If not, then the process continues to step 624. Otherwise, the process continues to step 620.

[0177] If the score for computation i does not exceed the threshold, then the results are not materialized. Thus, at step 624, the process releases the lock on the hash table entry for computation i thereby allowing other processes to modify the entry, and the process computes the computation results. In order to compute the computation results, the process may access the operand values from an IMCU, from a transaction journal, and/or from disk.

[0178] If the score for computation i exceeds the threshold, then the process begins the task of materializing the computation results within an IMIU. In step 620, the process obtains an IMIU exclusive latch and releases a lock on the hash table entry for computation i. The latch prevents other processes from updating the header, piecemap, and MRA in the IMIU.

[0179] At step 622, the process determines whether there is free space in the IMIU. If there is not enough free space in the IMIU to store the computation results, then the process continues to step 626 to either allocate or free up more space. Otherwise, the process continues to step 632.

[0180] At step 626, the process determines whether to allocate an IMIU chain piece. The determination of whether to allocate a new chain piece may be made based on a variety of factors, such as memory pressure, maximum size threshold for an IMIU, etc. If the process determines that an additional IMIU piece should be allocated, then the process continues to step 628. Otherwise the process continues to step 630.

[0181] If the process determines that another IMIU chain piece should be added to the IMIU, then at step 628, the process allocates an IMIU chain piece. The process may request a set of one or more in-memory extents from a space management layer, as previously described. Once allocated, the piecemap of the IMIU is updated to show the chaining of the IMIU.

[0182] If the process determines that another IMIU chain piece should not be allocated, then at step 630, the process evicts a computation to free up space in memory for the computation results. In order to evict a column, the process may first identify a "cold" computation to evict based on a replacement policy as previously described. If the reader count for the identified computation indicates that another process is currently accessing the computation, then the process may wait to evict the computation or select another cold computation to evict. If the reader count indicates the computation is not be accessed by another process, then the process removes the computation results from the MRA, deletes the result pointer from the hash table entry for the evicted computation, and releases the atomic lock for the hash table entry. The process also updates the free space pointers in the header and/or piece-map, if needed, to reflect the newly freed up space.

[0183] Once there is enough free space in the IMIU to materialize the computation results, the process obtains a lock on the hash table entry for i at step 632.

[0184] At step 634, the computation is evaluated to obtain the results, which are then written to the MRA. In order to compute the computation results, the process may access the operand values from an IMCU, from a transaction journal, and/or from disk

[0185] At step 636, the process adds a result pointer in the hash table entry for i and release the lock. The pointer may identify the starting memory address where the computation results are stored within the MRA.

[0186] At step 638, the free space pointer in the header and/or piecemap of the IMIU is updated, if needed, and the IMIU latch is released. After this step, other processes may access and use the materialized computation results from the MRA. As various database queries are received and processed, the queries may be rewritten to access the results directly form the MRA rather than re-evaluating the computation another time.

In-Memory Statistics Store

[0187] The hash table of computations 506 tracks computations at a per-IMCU level as previously described. By tracking at this granular level, selections of which results to materialize may be performed at on an IMCU-basis. However, tracking on a per-IMCU basis involves a bit more overhead than tracking at a higher level of granularity, such as at a per-table basis. In addition, if the data is spread around uniformly among different IMCUs, the same computations may be evaluated across all IMCUs during query execution. Thus, in some instances, it may beneficial to track computation usage at a per-table level.

[0188] In order track computations at a per-table level, a dedicated area within in-memory area 120 may be created every time an in-memory database instance is started. The dedicated memory, herein referred to as an In-Memory Statistics Store (IMSS), is maintained separately from IMIUs 126a to 126n and is used to track computation statistics on a per table-basis. Thus, the IMSS may gather statistics for data stored across a plurality of IMCUs. Data within the IMSS may then be used to drive the allocation, population and maintenance of the IMIUs 126a to 126n.

[0189] The IMSS may contain a hash table of computations hashed by a computation id (predicate log trees or PCODE bytestream) and base table object identifier. Similar to the hash table in IMIUs 126a to 126n, the IMSS hash table may track several attributes for each computation, such as timestamp of last evaluation, running count of number of evaluations, average run-time evaluation cost, computation-specific metadata, etc. The IMSS may be rolled into ESS 150 and maintained as a single unit to form a more generic database statistics repository (DSR).

[0190] Similar concurrency control operations may be performed to avoid conflicts that may arise when multiple processes attempt to access the IMSS. The IMSS may be used in conjunction with the per-IMCU level tracking or may be implemented as an alternative to IMCU-level tracking. For instance, when a database server instance is initially started, IMCUs 122a to 122n and IMIUs 126a to 126n may be gradually built. During the build phase, the IMSS may act as the statistics repository for computations and can be used to boot-strap the hash table in IMIUs when the IMIUs become available. If, for any reason, the IMCU is unavailable (dropped or repopulated), statistics may continue to be tracked in the IMSS. These statistics may then be used to pre-populate the hash table in a new IMIU. The combination of the IMSS and hash table of computations may also be useful in the case of heavy analytics under memory pressure. Processes may begin tracking computations at a per-IMCU level, but end up reverting to table-level tracking as time progresses to reduce memory pressure.

Predicate Reordering

[0191] The results that are cached in the IMIU may be factored in to predicate reordering optimizations, according to an embodiment. A predicate that utilizes a cached predicate result may be moved for earlier evaluation to improve runtime performance in some instances. For example, if the predicates in the clause "where c=1 and d=2 and e/f=10" are evaluated in the order they appear in the database query, then the predicate "c=1" would be evaluated first, followed by "d=2" and finally "e/f=10". However, if the results of the predicate "e/f=10" are materialized in memory, then the predicates may be reordered such that "e/f=10" is evaluated (i.e., by fetching the expression result from the IMIU) before "c=1" and "d=2" are evaluated. If the predicate "e/f=10" also filters out a lot of rows, then the next two predicates are potentially evaluated against fewer number of rows.

[0192] Each predicate in a query may have a different degree of selectivity. For instance, one predicate may filter out a different number of rows than another predicate. While identifying predicate results to materialize in the IMIU, predicates may be selected based on selectivity to store the bit-vector results of highly selective predicates (high filter rate). During query execution, predicates may be reordered to evaluate the highly selective predicate, which has a higher likelihood of having cached results available in the IMIU. If cached, the results are obtained directly from the IMIU, thereby reducing the number of rows evaluated against the subsequent predicates. Thus, choosing high selective predicates to cache coupled with reordering of predicates may significantly improve query performance.

Additional Query Processing Optimizations

[0193] Materializing hot expressions or user-defined virtual columns in-memory in VCUs may improve scan performance by not only avoiding repeated expression evaluation, but also providing the scan engine with the ability to use vector processing techniques more efficiently. In systems that support vector processing, instructions may operate on data arrays referred to as vectors rather and is not limited to operating on a single data item per instruction. An example vector processing technique is referred to as single instruction, multiple data (SIMD) processing where a single instruction may cause the same operation to be performed on multiple data sets simultaneously.

[0194] Materialized IMEs may facilitate vector processing by allowing multi-column expressions evaluations to be treated as a single column expression evaluation. Once a process is operating on a single column, vectors may be formed more easily thereby reducing the number of memory and/or disk accesses. For example, with a single column unit materializing results of a multi-column expression, a single dictionary may be used to store all the column's distinct values. This reduces the number of gather operations and random memory references that are performed in comparison to when two dictionaries are involved in the operation.

[0195] To demonstrate how materialized expression can be more amenable to vector processing, consider a query that has a predicate "a+b=10". Both columns, "a" and "b" are separate column that have their own dictionaries, referred to herein as "D1" and "D2" respectively. If the expression is not materialized to a single virtual column, then during expression evaluation, for each row, "a+b" is computed for some symbol in D1 and some symbol in D2, prior to comparing that result to the value "10". In order to vectorize this operation without an IME, a first gather operation would be executed to retrieve N random symbols from D1 for column "a", and a second, separate gather operation to retrieve N random symbols from D2 for column "b". Then, a vector addition operation is performed using the two vectors of N symbols each to get the values of "a+b" for those N rows. Subsequently, the result of the addition operation is compared to a vector of N "10" values in it. The gather operation is very expensive, as random memory references are performed in the two dictionaries. Thus, the process of vectorizing a multi-column expression evaluation can be quite heavy. With an IME, the "a+b" expression may be materialized into a single VCU. Using the VCU, the equality predicate may be vectorized in a much more simple and efficient manner. For example, a single load may be performed to fetch N symbols in the data-stream without going through the dictionary and without performing a random memory access to obtain a symbol. Then, a single compare operation may be performed to compare that vector to a vector of N "10" values. The loads on the data-stream vector are done sequentially, and so, the hardware will pre-fetch its values in, thus, allowing the processor to operate near or at memory bandwidth speeds.

Hardware Overview

[0196] In some embodiments, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.

[0197] For example, FIG. 7 is a block diagram that illustrates a computer system 700 upon which an embodiment of the invention may be implemented. Computer system 700 includes a bus 702 or other communication mechanism for communicating information, and a hardware processor 704 coupled with bus 702 for processing information. Hardware processor 704 may be, for example, a general purpose microprocessor.

[0198] Computer system 700 also includes a main memory 706, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 702 for storing information and instructions to be executed by processor 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704. Such instructions, when stored in non-transitory storage media accessible to processor 704, render computer system 700 into a special-purpose machine that is customized to perform the operations specified in the instructions.

[0199] Computer system 700 further includes a read only memory (ROM) 708 or other static storage device coupled to bus 702 for storing static information and instructions for processor 704. A storage device 710, such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 702 for storing information and instructions.

[0200] Computer system 700 may be coupled via bus 702 to a display 712, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 714, including alphanumeric and other keys, is coupled to bus 702 for communicating information and command selections to processor 704. Another type of user input device is cursor control 716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on display 712. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.

[0201] Computer system 700 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 700 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 700 in response to processor 704 executing one or more sequences of one or more instructions contained in main memory 706. Such instructions may be read into main memory 706 from another storage medium, such as storage device 710. Execution of the sequences of instructions contained in main memory 706 causes processor 704 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[0202] The term "storage media" as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 710. Volatile media includes dynamic memory, such as main memory 706. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.

[0203] Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 702. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

[0204] Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 704 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 700 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 702. Bus 702 carries the data to main memory 706, from which processor 704 retrieves and executes the instructions. The instructions received by main memory 706 may optionally be stored on storage device 710 either before or after execution by processor 704.

[0205] Computer system 700 also includes a communication interface 718 coupled to bus 702. Communication interface 718 provides a two-way data communication coupling to a network link 720 that is connected to a local network 722. For example, communication interface 718 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 718 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

[0206] Network link 720 typically provides data communication through one or more networks to other data devices. For example, network link 720 may provide a connection through local network 722 to a host computer 724 or to data equipment operated by an Internet Service Provider (ISP) 726. ISP 726 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 728. Local network 722 and Internet 728 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 720 and through communication interface 718, which carry the digital data to and from computer system 700, are example forms of transmission media.

[0207] Computer system 700 can send messages and receive data, including program code, through the network(s), network link 720 and communication interface 718. In the Internet example, a server 730 might transmit a requested code for an application program through Internet 728, ISP 726, local network 722 and communication interface 718.

[0208] The received code may be executed by processor 704 as it is received, and/or stored in storage device 710, or other non-volatile storage for later execution.

Extensions and Alternatives

[0209] In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.

* * * * *