U.S. patent application number 11/149489 was filed with the patent office on 2006-12-14 for apparatus and method for reducing size of intermediate results by analyzing having clause information during sql processing.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to John Matthew Santosuosso.
Application Number | 20060282420 11/149489 |
Document ID | / |
Family ID | 37525265 |
Filed Date | 2006-12-14 |
United States Patent
Application |
20060282420 |
Kind Code |
A1 |
Santosuosso; John Matthew |
December 14, 2006 |
Apparatus and method for reducing size of intermediate results by
analyzing having clause information during SQL processing
Abstract
A query processor includes a HAVING clause processing mechanism
that processes the HAVING clause in a query when generating the
intermediate results, instead of generating the intermediate
results without regard to the HAVING clause, then applying the
HAVING clause to the intermediate results. As a result, the size of
the intermediate results may be significantly reduced, and effort
is not wasted in placing data into the intermediate results in one
step that would be removed were the HAVING clause processed
separately.
Inventors: |
Santosuosso; John Matthew;
(Rochester, MN) |
Correspondence
Address: |
MARTIN & ASSOCIATES, LLC
P.O. BOX 548
CARTHAGE
MO
64836-0548
US
|
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
ARMONK
NY
|
Family ID: |
37525265 |
Appl. No.: |
11/149489 |
Filed: |
June 9, 2005 |
Current U.S.
Class: |
1/1 ;
707/999.004 |
Current CPC
Class: |
G06F 16/24556
20190101 |
Class at
Publication: |
707/004 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. An apparatus comprising: at least one processor; a memory
coupled to the at least one processor; a database query residing in
the memory that includes a SELECT clause and a HAVING clause; and a
query processor residing in the memory and executed by the at least
one processor, the query processor including a HAVING clause
processing mechanism that analyzes the HAVING clause of the
database query and includes only data that satisfies the HAVING
clause in intermediate results for the database query.
2. The apparatus of claim 1 wherein the HAVING clause processing
mechanism analyzes an index in generating the intermediate
results.
3. The apparatus of claim 2 wherein the HAVING clause processing
mechanism skips entries in the index that do not satisfy the HAVING
clause.
4. The apparatus of claim 1 wherein the HAVING clause processing
mechanism performs a table scan in generating the intermediate
results.
5. The apparatus of claim 1 wherein the query processor generates a
final result set for the database query from the intermediate
results.
6. A computer-implemented method for processing a database query
that includes a HAVING clause, the method comprising the steps of:
analyzing the database query; and including only data that
satisfies the HAVING clause in intermediate results for the
database query.
7. The method of claim 6 further comprising the step of analyzing
an index in generating the intermediate results.
8. The method of claim 7 wherein the HAVING clause processing
mechanism skips entries in the index that do not satisfy the HAVING
clause.
9. The method of claim 6 wherein the HAVING clause processing
mechanism performs a table scan in generating the intermediate
results.
10. The method of claim 6 wherein the query processor generates a
final result set for the database query from the intermediate
results.
11. A program product comprising: (A) a query processor that
includes a HAVING clause processing mechanism that analyzes a
HAVING clause of a database query and includes only data that
satisfies the HAVING clause in intermediate results for the
database query; and (B) computer-readable signal bearing media
bearing the query processor.
12. The program product of claim 11 wherein the computer-readable
signal bearing media comprises recordable media.
13. The program product of claim 11 wherein the computer-readable
signal bearing media comprises transmission media.
14. The program product of claim 11 wherein the HAVING clause
processing mechanism analyzes an index in generating the
intermediate results.
15. The program product of claim 14 wherein the HAVING clause
processing mechanism skips entries in the index that do not satisfy
the HAVING clause.
16. The program product of claim 11 wherein the HAVING clause
processing mechanism performs a table scan in generating the
intermediate results.
17. The program product of claim 11 wherein the query processor
generates a final result set for the database query from the
intermediate results.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Technical Field
[0002] This invention generally relates to computer systems, and
more specifically relates to apparatus and methods for processing
database queries.
[0003] 2. Background Art
[0004] Database systems have been developed that allow a computer
to store a large amount of information in a way that allows a user
to search for and retrieve specific information in the database.
For example, an insurance company may have a database that includes
all of its policy holders and their current account information,
including payment history, premium amount, policy number, policy
type, exclusions to coverage, etc. A database system allows the
insurance company to retrieve the account information for a single
policy holder among the thousands and perhaps millions of policy
holders in its database.
[0005] Retrieval of information from a database is typically done
using queries. A query usually specifies conditions that apply to
one or more columns of the database, and may specify relatively
complex logical operations on multiple columns. The database is
searched for records that satisfy the query, and those records are
returned as the query result. Structured Query Language (SQL) is
one specific query language that may be used to query a
database.
[0006] An SQL query typically includes a "SELECT" statement or
clause that specifies what data is desired. SQL also includes a
construct known as a "HAVING" clause that may specify additional
restrictions on the data. When an SQL query has a HAVING clause,
the SELECT clause is processed first to generate intermediate
results, typically in the form of a temporary table. The HAVING
clause is then applied to the intermediate results to generate the
result set for the query. Sometimes the HAVING clause significantly
reduces the amount of data in the intermediate results. In these
cases, the database engine goes to significant work to assemble
data into the intermediate results, only to discard much of the
data once the HAVING clause is processed. Without a way to take the
HAVING clause into account while generating the intermediate
results, the database industry will continue to suffer from
inefficient methods for processing queries with a HAVING
clause.
DISCLOSURE OF INVENTION
[0007] According to the preferred embodiments, a query processor
includes a HAVING clause processing mechanism that processes the
HAVING clause in a query when generating the intermediate results,
instead of generating the intermediate results without regard to
the HAVING clause, then applying the HAVING clause to the
intermediate results. As a result, the size of the intermediate
results may be significantly reduced, and effort is not wasted in
placing data into the intermediate results in one step that would
be removed were the HAVING clause processed separately.
[0008] The foregoing and other features and advantages of the
invention will be apparent from the following more particular
description of preferred embodiments of the invention, as
illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF DRAWINGS
[0009] The preferred embodiments of the present invention will
hereinafter be described in conjunction with the appended drawings,
where like designations denote like elements, and:
[0010] FIG. 1 is a block diagram of an apparatus in accordance with
the preferred embodiments;
[0011] FIG. 2 is a sample database table to illustrate the concepts
of the preferred embodiments;
[0012] FIG. 3 is a first sample query that could be run against the
table in FIG. 2;
[0013] FIG. 4 is a flow diagram of a prior art method for
processing the query in FIG. 3;
[0014] FIG. 5 is a table showing intermediate results when
processing the SELECT clause of the query in FIG. 3;
[0015] FIG. 6 is a table showing the final result set for the query
in FIG. 3;
[0016] FIG. 7 is a flow diagram of a prior art method for
processing a query;
[0017] FIG. 8 is a block diagram of an index over the deptNumber
column for the table in FIG. 2;
[0018] FIG. 9 is a flow diagram of a method in accordance with the
preferred embodiments for processing a query that includes a HAVING
clause;
[0019] FIG. 10 is a flow diagram of a method in accordance with the
preferred embodiments for processing a query that includes a HAVING
clause;
[0020] FIG. 11 is a second sample table of data;
[0021] FIG. 12 is a second sample query that specifies joining the
table of FIG. 2 with the table of FIG. 11;
[0022] FIG. 13 is a block diagram of a prior art table of
intermediate results for the query in FIG. 12;
[0023] FIG. 14 is a block diagram of intermediate results for the
query in FIG. 12 after applying the HAVING clause to the
intermediate results in FIG. 13;
[0024] FIG. 15 is a third sample query; and
[0025] FIG. 16 is an index over the location and deptNumber columns
for the table in FIG. 2.
BEST MODE FOR CARRYING OUT THE INVENTION
1.0 Overview
[0026] The present invention relates to the processing of database
queries. For those not familiar with databases or queries, this
Overview section will provide background information that will help
to understand the present invention.
Known Databases and Database Oueries
[0027] There are many different types of databases known in the
art. The most common is known as a relational database (RDB), which
organizes data in tables that have rows that represent individual
entries or records in the database, and columns that define what is
stored in each entry or record.
[0028] To be useful, the data stored in databases must be able to
be efficiently retrieved. The most common way to retrieve data from
a database is to generate a database query. A database query is an
expression that is evaluated by a database manager. The expression
may contain one or more predicate expressions that are used to
retrieve data from a database. For example, lets assume there is a
database for a company that includes a table of employees, with
columns in the table that represent the employee's name, address,
phone number, gender, and salary. With data stored in this format,
a query could be formulated that would retrieve the records for all
female employees that have a salary greater than $40,000.
Similarly, a query could be formulated that would retrieve the
records for all employees that have a particular area code or
telephone prefix.
[0029] One popular way to define a query uses Structured Query
Language (SQL). SQL defines a syntax for generating and processing
queries that is independent of the actual structure and format of
the database. Note that an SQL query is expressed in terms of
columns defined on one or more database tables. Information about
the internal storage of the data is not required as long as the
query is written in terms of expressions that relate to values in
columns from tables.
Known Processing of Queries that Include a HAVING Clause
[0030] In the prior art, a query that includes a HAVING clause is
processed in two distinct steps. In the first step, the SELECT
clause is processed, and intermediate results are generated. In the
second step, the further restrictions of the HAVING clause are
applied to the intermediate results. Due to this two-step process,
the SELECT clause may cause large amounts of data to be placed into
the intermediate results in the first step that will have to be
removed in the second step. The overhead of adding data to the
intermediate results when processing the SELECT clause, only to
remove much of the same data when processing the HAVING clause,
negatively affects system performance.
2.0 Description of the Preferred Embodiments
[0031] The preferred embodiments recognize that it is inefficient
and wasteful of system resources to place data into intermediate
results, only to remove that same data to generate the result set.
As a result, the preferred embodiments take the HAVING clause into
account when generating the intermediate results, thereby reducing
the size of the intermediate results and enhancing system
performance.
[0032] Referring to FIG. 1, a computer system 100 is one suitable
implementation of an apparatus in accordance with the preferred
embodiments of the invention. Computer system 100 is an IBM eServer
iSeries computer system. However, those skilled in the art will
appreciate that the mechanisms and apparatus of the present
invention apply equally to any computer system, regardless of
whether the computer system is a complicated multi-user computing
apparatus, a single user workstation, or an embedded control
system. As shown in FIG. 1, computer system 100 comprises a
processor 110, a main memory 120, a mass storage interface 130, a
display interface 140, and a network interface 150. These system
components are interconnected through the use of a system bus 160.
Mass storage interface 130 is used to connect mass storage devices,
such as a direct access storage device 155, to computer system 100.
One specific type of direct access storage device 155 is a readable
and writable CD RW drive, which may store data to and read data
from a CD RW 195.
[0033] Main memory 120 in accordance with the preferred embodiments
contains data 121, an operating system 122, a database 123, one or
more database queries 124, and a query processor 126. Data 121
represents any data that serves as input to or output from any
program in computer system 100. Operating system 122 is a
multitasking operating system known in the industry as OS/400;
however, those skilled in the art will appreciate that the spirit
and scope of the present invention is not limited to any one
operating system. Database 123 is any suitable database, whether
currently known or developed in the future. Database 123 preferably
includes one or more tables. Database query 124 is a query in a
format compatible with the database 123 that allows retrieval of
information stored in the database 123 that satisfies the database
query 124. The preferred embodiments specifically apply to the
processing of queries that include a HAVING clause 125, as shown in
FIG. 1. Query processor 126 processes a database query 124, and
returns a result set that satisfies the query. The query processor
126 includes a HAVING clause processing mechanism 127 that avoids
the inefficiency in the two-step, prior art process of processing
queries with HAVING clauses by taking the HAVING clause information
into account when generating the intermediate results. By analyzing
the HAVING clause while generating intermediate results, the size
of the intermediate results may be greatly reduced.
[0034] Computer system 100 utilizes well known virtual addressing
mechanisms that allow the programs of computer system 100 to behave
as if they only have access to a large, single storage entity
instead of access to multiple, smaller storage entities such as
main memory 120 and DASD device 155. Therefore, while data 121,
operating system 122, database 123, database query 124, and query
processor 126 are shown to reside in main memory 120, those skilled
in the art will recognize that these items are not necessarily all
completely contained in main memory 120 at the same time. It should
also be noted that the term "memory" is used herein to generically
refer to the entire virtual memory of computer system 100, and may
include the virtual memory of other computer systems coupled to
computer system 100.
[0035] Processor 110 may be constructed from one or more
microprocessors and/or integrated circuits. Processor 110 executes
program instructions stored in main memory 120. Main memory 120
stores programs and data that processor 110 may access. When
computer system 100 starts up, processor 110 initially executes the
program instructions that make up operating system 122. Operating
system 122 is a sophisticated program that manages the resources of
computer system 100. Some of these resources are processor 110,
main memory 120, mass storage interface 130, display interface 140,
network interface 150, and system bus 160.
[0036] Although computer system 100 is shown to contain only a
single processor and a single system bus, those skilled in the art
will appreciate that the present invention may be practiced using a
computer system that has multiple processors and/or multiple buses.
In addition, the interfaces that are used in the preferred
embodiments each include separate, fully programmed microprocessors
that are used to off-load compute-intensive processing from
processor 110. However, those skilled in the art will appreciate
that the present invention applies equally to computer systems that
simply use I/O adapters to perform similar functions.
[0037] Display interface 140 is used to directly connect one or
more displays 165 to computer system 100. These displays 165, which
may be non-intelligent (i.e., dumb) terminals or fully programmable
workstations, are used to allow system administrators and users to
communicate with computer system 100. Note, however, that while
display interface 140 is provided to support communication with one
or more displays 165, computer system 100 does not necessarily
require a display 165, because all needed interaction with users
and other processes may occur via network interface 150.
[0038] Network interface 150 is used to connect other computer
systems and/or workstations (e.g., 175 in FIG. 1) to computer
system 100 across a network 170. The present invention applies
equally no matter how computer system 100 may be connected to other
computer systems and/or workstations, regardless of whether the
network connection 170 is made using present-day analog and/or
digital techniques or via some networking mechanism of the future.
In addition, many different network protocols can be used to
implement a network. These protocols are specialized computer
programs that allow computers to communicate across network 170.
TCP/IP (Transmission Control Protocol/Internet Protocol) is an
example of a suitable network protocol.
[0039] At this point, it is important to note that while the
present invention has been and will continue to be described in the
context of a fully functional computer system, those skilled in the
art will appreciate that the present invention is capable of being
distributed as a program product in a variety of forms, and that
the present invention applies equally regardless of the particular
type of computer-readable signal bearing media used to actually
carry out the distribution. Examples of suitable computer-readable
signal bearing media include: recordable type media such as floppy
disks and CD RW (e.g., 195 of FIG. 1), and transmission type media
such as digital and analog communications links. Note that the
preferred signal bearing media is tangible.
[0040] Referring now to FIG. 2, a sample database table 200 is
shown to illustrate the processing of a query in the prior art and
in accordance with the preferred embodiments. Table 200 includes a
deptNumber column, a salary column, a location column, and may
contain other data in other columns. A record number column recNo
is also shown, but this record number is used by the database
engine to uniquely identify each row in the table, and does not
represent user data. We now consider the query in FIG. 3. This
query includes a HAVING clause. Method 400 in FIG. 4 shows a prior
art method 400 for processing the query in FIG. 3. Method 400
begins when a query with a HAVING clause needs to be processed
(step 410). The SELECT clause is first processed to generate
intermediate results (step 420). The intermediate results are
typically in the form of a temporary table. Once the temporary
table is built, the HAVING clause may be applied to the temporary
table to further restrict the data (step 430). The final result set
may then be generated from the intermediate results (step 440). In
many cases, the final result set may be the same as the
intermediate results after applying the HAVING clause, requiring no
processing in step 440.
[0041] Method 400 may be applied to the query in FIG. 3 as shown in
FIGS. 5 and 6. The processing of the select clause in step 420
results in the intermediate results shown in the temporary table of
FIG. 5. The application of the HAVING clause to the data in the
temporary table of FIG. 5 results in the final result set shown in
FIG. 6. This example illustrates a problem with the prior art. Due
to the separate processing of the SELECT clause and the HAVING
clause, the SELECT clause may cause a large number or rows to be
included in the intermediate results that will immediately be
removed once the HAVING clause is processed. The result is
unnecessary overhead in processing a query with a HAVING
clause.
[0042] Another prior art method 700 is shown in FIG. 7. This method
is used when an index may exist that may speed up the execution of
the query. If a suitable index exists for the SELECT clause (step
710=YES), the index may be used to generate the intermediate
results (step 720). A suitable index over the deptNumber column
that could be used in step 720 is shown as index 800 in FIG. 8. If
no index exists for the SELECT clause (step 710=NO), a full table
scan is performed to generate the intermediate results (step 730).
Note that the intermediate results generated in steps 720 and 730
are the results of processing the SELECT clause before applying the
HAVING clause. Thus, for the query in FIG. 3, the intermediate
results shown in the temporary table of FIG. 5 are generated in
both steps 720 and 730. The HAVING clause is then applied to the
intermediate results in FIG. 5 (step 740), resulting in the result
set shown in FIG. 6 (step 750).
[0043] Regardless of whether the intermediate results shown in FIG.
5 are generated using an index or not, many rows are still included
in the temporary table of FIG. 5 that are subsequently deleted
during the processing of the HAVING clause. The preferred
embodiments address the inefficiency of separately processing the
HAVING clause after the intermediate results have been generated.
By processing the HAVING clause as part of the process that
generates the intermediate results, the intermediate results will
include only those rows that satisfy both the SELECT clause and the
HAVING clause.
[0044] Referring to FIG. 9, a method 900 in accordance with the
preferred embodiments begins when a query with a HAVING clause
needs to be processed (step 910). The SELECT clause and HAVING
clause are processed at the same time in generating the
intermediate results (step 920), which assures that only data that
satisfies both the SELECT and HAVING clauses is included in the
intermediate results. The final result set may then be generated
from the intermediate results (step 930). Note that in many cases
the intermediate results in step 920 will be the final result set,
resulting in no processing in step 930.
[0045] FIG. 10 shows a method 1000 in accordance with the preferred
embodiments that takes into account indexes in processing the
database query. The SELECT clause and HAVING clause are processed
to determine whether a suitable index exists (step 1010). If so
(step 1020=YES), the SELECT clause and HAVING clauses are both
processed using the index to generate the intermediate results
(step 1030). If no suitable index is found (step 1020=NO), the
SELECT clause and HAVING clauses are both processed using a table
scan to generate the intermediate results (step 1040). Once the
intermediate results have been generated in step 1030 or 1040, the
final result set may be generated from the intermediate results
(step 1050). Again, in many cases the intermediate results will be
the final result set, resulting in no processing in step 1050.
[0046] When a suitable index is found in step 1020, the processing
of the HAVING clause in generating the intermediate results may be
simplified. For example, let's assume the index 800 in FIG. 8
exists. The index helps in processing the HAVING clause by skipping
the entries in the index that do not satisfy the HAVING clause. The
query processor can tell from the index that only one department,
namely department 4, has enough employees to satisfy the HAVING
clause. For this reason, the department numbers 1, 2, 3, 5, 6, 7
and 8 may be skipped when using the index 800 to process the query
in FIG. 3. The preferred embodiments expressly extends to skipping
index entries that do not satisfy the HAVING clause.
[0047] A second example query is now presented to further
illustrate the concepts of the preferred embodiments. Referring to
FIG. 11, a personallnfoTable is shown with a column expectedRaise
that corresponds to a department number of the employee. We now
consider a second sample query shown in FIG. 12, which references
the employeeTable in FIG. 2 as well as the personalInfoTable in
FIG. 11. Because this query requires the employeeTable and
personallnfoTable to be joined, the preferred embodiments
illustrate that many join operations that would be performed in the
prior art may be avoided by processing the SELECT clause and HAVING
clause at the same time to generate the intermediate results.
[0048] When the query of FIG. 12 is processed using the prior art
method 400 in FIG. 4, the SELECT clause is first processed to
generate the intermediate results (step 420). Applying the SELECT
clause in FIG. 12 to the tables in FIGS. 2 and 11 generates
intermediate results as shown in the temporary table in FIG. 12.
Note that for every department number, a join operation must be
performed with a row from the personallnfoTable in FIG. 11. Thus,
the intermediate results in FIG. 13 that include 8 rows are
generated using 8 separate join operations. The HAVING clause is
then applied to the intermediate results in FIG. 12 (step 430),
resulting in the intermediate results shown in FIG. 14. Again,
seven of the eight rows that were inserted into the intermediate
results when processing the SELECT clause were deleted when later
processing the HAVING clause. The intermediate results in FIG. 14
may then be processed to generate the final result set shown in
FIG. 6. Because the expectedRaise column is not relevant to the
query results, this column may be omitted from the final result
set.
[0049] When the query of FIG. 12 is processed using method 900 in
FIG. 9, the SELECT and HAVING clauses are both used to generate the
intermediate results (step 920). As a consequence, the intermediate
results generated in step 920 are shown in FIG. 14. Taking the
HAVING clause into account when generating the intermediate results
therefore reduces the number of required join operations from 8 to
1. The final result set may then be generated from the intermediate
results (step 930). Again, because the expectedRaise column is not
relevant to the query results, this column may be omitted from the
final result set. The final result set is the same as shown in FIG.
6.
[0050] A third example query in FIG. 15 is used to illustrate
another benefit that arises from processing the SELECT clause and
HAVING clause when generating the intermediate results. By
processing both, an index may be used to process the query that
wouldn't normally be used in the prior art. The query in FIG. 15
includes a "where location=?" clause. Referring to FIG. 16, an
index 1600 over the location and deptNumber columns could be used
in processing the SELECT clause of FIG. 15. However, we note that
the only department number that has enough employees to satisfy the
HAVING clause is in the MN location (Minnesota). The query
processor can thus determine that the location is not relevant to
the query results. As a result, a different index, such as index
800 in FIG. 8, could be used instead. As described above, the
HAVING clause may be processed using the index 800 in FIG. 8 by
skipping over index entries that do not satisfy the HAVING clause.
Because department 4 is the only department with enough employees
to satisfy the HAVING clause, the index entries for 1, 2, 3, 5, 6,
7 and 8 are skipped by the query processor. By skipping index
entries that do not satisfy the HAVING clause, the query processor
succeeds in generating intermediate results that satisfy both the
SELECT clause and the HAVING clause, thereby eliminating the
overhead of inserting rows into a temporary table only to delete
them in the very next step when processing the HAVING clause.
[0051] The discussion above regarding the processing of the query
in FIG. 16 shows that the concept of which index is "suitable"
differs between the prior art and the preferred embodiments. In the
prior art method 700 in FIG. 7, index 1700 in FIG. 17 would be
considered suitable to process the query in FIG. 16 in step 710,
while index 800 in FIG. 8 would not be suitable. In contrast, in
method 1000 of the preferred embodiments, both index 1700 in FIG.
17 and index 800 in FIG. 8 are deemed to be suitable indexes in
step 1020, but index 800 is preferred because the query processor
may determine from index 1700 that the location is not relevant to
the query results. For this reason, the preferred embodiments allow
using indexes in processing a query that would not be used in prior
art methods.
[0052] The preferred embodiments process a query in a way that
generates intermediate results that satisfy both the SELECT clause
and the HAVING clause of the query. This prevents expending system
resources to retrieve rows that satisfy the SELECT clause, only to
throw them away when separately processing the HAVING clause, as is
done in the prior art. The preferred embodiments thus provide
enhanced system performance when processing queries with HAVING
clauses.
[0053] One skilled in the art will appreciate that many variations
are possible within the scope of the present invention. Thus, while
the invention has been particularly shown and described with
reference to preferred embodiments thereof, it will be understood
by those skilled in the art that these and other changes in form
and details may be made therein without departing from the spirit
and scope of the invention.
* * * * *