Methods for in-place online reorganization of a database Iyer, Balakrishna Raghavendra ; et al. [Iyer, Balakrishna Raghavendra]

Methods for in-place online reorganization of a database

Iyer, Balakrishna Raghavendra ; et al.

Patent Application Summary

U.S. patent application number 10/151393 was filed with the patent office on 2002-10-03 for methods for in-place online reorganization of a database. Invention is credited to Iyer, Balakrishna Raghavendra, Sockut, Gary Howard.

Application Number	20020143743 10/151393
Document ID	/
Family ID	22821987
Filed Date	2002-10-03

United States Patent Application	20020143743
Kind Code	A1
Iyer, Balakrishna Raghavendra ; et al.	October 3, 2002

Methods for in-place online reorganization of a database

Abstract

An in-place reorganization of a database achieves reasonably accurate results for users during high-throughput concurrent usage of the database. The reorganization's movement of records across a user transaction's position within a scan of the database is tracked. The behavior of the user transaction is corrected to account for the movement of the records.

Inventors:	Iyer, Balakrishna Raghavendra; (San Jose, CA) ; Sockut, Gary Howard; (San Jose, CA)
Correspondence Address:	Attention: George H. Gates Gates & Cooper LLP Howard Hughes Center 6701 Center Drive West, Suite 1050 Los Angeles CA 90045 US
Family ID:	22821987
Appl. No.:	10/151393
Filed:	May 20, 2002

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10151393	May 20, 2002
09220084	Dec 23, 1998
6411964

Current U.S. Class:	1/1 ; 707/999.001; 707/E17.005
Current CPC Class:	G06F 16/2282 20190101; Y10S 707/99953 20130101
Class at Publication:	707/1
International Class:	G06F 007/00

Claims

What is claimed is:

1. A computer-implemented method of providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising: (a) tracking, in the computer, the reorganization's movement of records across a user transaction's position within a scan of the database; and (b) correcting, in the computer, a behavior of the user transaction to account for the movement of the records.

2. The method of claim 1, wherein the reorganization clusters data.

3. The method of claim 1, wherein the reorganization provides online restoration of clustering.

4. The method of claim 1, wherein the reorganization distributes free space evenly.

5. The method of claim 1, wherein the reorganization removes overflow.

6. The method of claim 1, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.

7. The method of claim 6, wherein the concurrent user transactions are allowed to use different types of access to the database.

8. The method of claim 1, wherein the reorganization allows repeatable read for the user transaction.

9. The method of claim 1, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.

10. The method of claim 1, wherein the tracking step comprises maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.

11. The method of claim 10, wherein the tracking step comprises concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.

12. The method of claim 10, wherein a type of access determines a type of movement list used in the reorganization.

13. The method of claim 10, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.

14. The method of claim 13, wherein the forward ordered list tracks records that the reorganization moves forward.

15. The method of claim 13, wherein the forward ordered list tracks records that the user transaction has already processed.

16. The method of claim 13, further comprising omitting redundant processing of records that appear in the forward ordered list.

17. The method of claim 13, wherein the backward ordered list tracks records that the reorganization moves backward.

18. The method of claim 13, wherein the backward ordered list tracks records that the user transaction has not yet processed.

19. The method of claim 13, further comprising processing records that appear on the backward ordered list.

20. The method of claim 13, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.

21. The method of claim 13, further comprising applying a differential list to a list of records maintained by the user transaction to correct the list of records.

22. The method of claim 21, wherein the list of records is derived from an index.

23. The method of claim 21, further comprising correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.

24. The method of claim 13, further comprising modifying the differential list when the reorganization moves a record.

25. The method of claim 13, further comprising maintaining both a recent differential list and a composite differential list.

26. A computer-implemented system for providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising: (a) a computer; (b) means, performed by the computer, for tracking a reorganization's movement of records across a user transaction's position within a scan of the database; and (c) means, performed by the computer, for correcting a behavior of the user transaction to account for the movement of the records.

27. The system of claim 26, wherein the reorganization clusters data.

28. The system of claim 26, wherein the reorganization provides online restoration of clustering.

29. The system of claim 26, wherein the reorganization distributes free space evenly.

30. The system of claim 26, wherein the reorganization removes overflow.

31. The system of claim 26, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.

32. The system of claim 31, wherein the concurrent user transactions ate allowed to use different types of access to the database.

33. The system of claim 26, wherein the reorganization allows repeatable read for the user transaction.

34. The system of claim 26, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.

35. The system of claim 26, wherein the means for tracking comprises means for maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.

36. The system of claim 35, wherein the means for tracking comprises means for concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.

37. The system of claim 35, wherein a type of access determines a type of movement list used in the reorganization.

38. The system of claim 35, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.

39. The system of claim 38, wherein the forward ordered list tracks records that the reorganization moves forward.

40. The system of claim 38, wherein the forward ordered list tracks records that the user transaction has already processed.

41. The system of claim 38, further comprising means for omitting redundant processing of records that appear in the forward ordered list.

42. The system of claim 38, wherein the backward ordered list tracks records that the reorganization moves backward.

43. The system of claim 38, wherein the backward ordered list tracks records that the user transaction has not yet processed.

44. The system of claim 38, further comprising means for processing records that appear on the backward ordered list.

45. The system of claim 38, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.

46. The system of claim 38, further comprising means for applying a differential list to a list of records maintained by the user transaction to correct the list of records.

47. The system of claim 46, wherein the list of records is derived from an index.

48. The system of claim 46, further comprising means for correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.

49. The system of claim 38, further comprising means for modifying the differential list when the reorganization moves a record.

50. The system of claim 38, further comprising means for maintaining both a recent differential list and a composite differential list.

51. An article of manufacture embodying logic for a computer-implemented method of providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising: (a) tracking the reorganization's movement of records across a user transaction's position within a scan of the database; and (b) correcting a behavior of the user transaction to account for the movement of the records.

52. The method of claim 51, wherein the reorganization clusters data.

53. The method of claim 51, wherein the reorganization provides online restoration of clustering.

54. The method of claim 51, wherein the reorganization distributes free space evenly.

55. The method of claim 51, wherein the reorganization removes overflow.

56. The method of claim 51, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.

57. The method of claim 56, wherein the concurrent user transactions are allowed to use different types of access to the database.

58. The method of claim 51, wherein the reorganization allows repeatable read for the user transaction.

59. The method of claim 51, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.

60. The method of claim 51, wherein the tracking step comprises maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.

61. The method of claim 60, wherein the tracking step comprises concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.

62. The method of claim 60, wherein a type of access determines a type of movement list used in the reorganization.

63. The method of claim 60, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.

64. The method of claim 63, wherein the forward ordered list tracks records that the reorganization moves forward.

65. The method of claim 63, wherein the forward ordered list tracks records that the user transaction has already processed.

66. The method of claim 63, further comprising omitting redundant processing of records that appear in the forward ordered list.

67. The method of claim 63, wherein the backward ordered list tracks records that the reorganization moves backward.

68. The method of claim 63, wherein the backward ordered list tracks records that the user transaction has not yet processed.

69. The method of claim 63, further comprising processing records that appear on the backward ordered list.

70. The method of claim 63, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.

71. The method of claim 63, further comprising applying a differential list to a list of records maintained by the user transaction to correct the list of records.

72. The method of claim 71, wherein the list of records is derived from an index.

73. The method of claim 71, further comprising correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.

74. The method of claim 63, further comprising modifying the differential list when the reorganization moves a record.

75. The method of claim 63, further comprising maintaining both a recent differential list and a composite differential list.

Description

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is a Continuation of application Ser. No. 09/220,084, filed Dec. 23, 1998, entitled "METHODS FOR IN-PLACE ONLINE REORGANIZATION OF A DATABASE," which application is incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention relates in general to database management systems performed by computers, and in particular, to a method of providing in-place reorganization of a database.

[0004] 2. Description of Related Art.

[0005] (Note: This application references a number of different publications as indicated throughout the specification by reference numbers enclosed in brackets, e.g., [x]. A list of these different publications ordered according to these reference numbers can be found in Section 8 of the "Detailed Description of the Preferred Embodiment." Each of these publications is incorporated by reference herein.)

[0006] Any database management system (DBMS) can need some type of reorganization. Reorganization of a database is defined as changing some aspect of the logical and/or physical arrangement of the database. A tutorial paper referenced in [12] discusses issues in reorganization and types of reorganization. This specification describes the problem in reorganizing offline, and the need for online reorganization. (See, e.g., [11]).

[0007] The type of reorganization described herein involves restoration of clustering. Clustering is the practice of storing records near each other if they meet certain criteria. One popular criterion is consecutive values in a column of the records. Clustering should reduce disk input/output for records that users often access together. When users write data into the database, this writing can decrease the amount of clustering and thus degrade performance.

[0008] Reorganization can restore clustering and performance. During most types of reorganization in a typical database, the area being reorganized is offline or only partially available; users cannot write (and perhaps cannot even read) data in that area. However, a highly available database (a database that is to be fully available 24 hours per day, 7 days per week) should not go offline for significant periods, of course. Applications that require high availability include reservations, finance (especially global finance), process control, hospitals, police, armed forces, and Internet service.

[0009] Even for less essential applications, many database administrators prefer 24-hour availability. The maximum tolerable period of unavailability is specific to the application. When queried, DBMS customers (not all of whom have highly available databases) state that the maximum tolerable period ranges from 0 to 5 hours. Even without such a preference for 24-hour availability, reorganizing a very large database might require much longer than the maximum tolerable period of unavailability.

[0010] As examples of very large databases, a survey paper [6] mentions a database with several terabytes of data and the desire for one with petabytes. The author of one book [14] considers offline reorganization such an important problem for very large databases that he defines a very large database as one "whose reorganization by reloading takes a longer time than the users can afford to have the database unavailable." These considerations call for the ability to reorganize the database online (concurrently with usage or incrementally within users' transactions), so that users can read and write the database during most or all phases of reorganization.

[0011] In the context of papers that do not concentrate on online reorganization, many people have stated the need for the ability to reorganize online. As the amount of information and dependence on computers both grow, the number of very large or highly available databases will grow. Therefore, the importance of online reorganization will grow.

[0012] The present invention provides methods for in-place online reorganization (specifically, for restoration of clustering). The data structures are those of IBM's DBMS Database 2 (DB2) for OS/390 [4], but the concepts in the methods presented herein should apply to many DBMS's. The methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized. To allow high-throughput concurrent usage by users of the database, the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement.

[0013] This specification describes relevant features of a DBMS, discusses the advantages of the present invention over previous research (including the novelty of the methods), presents the concepts in the methods, describes the methods in more detail, and proposes extensions based on the methods.

SUMMARY OF THE INVENTION

[0014] To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, system, and article of manufacture for providing in-place reorganization of a database that achieves reasonably accurate results for users during high-throughput concurrent usage of the database. The reorganization's movement of records across a user transaction's position within a scan of the database is tracked. The behavior of the user transaction is corrected to account for the movement of the records.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] Referring now to the drawings in which like reference numbers represent corresponding parts throughout:

[0016] FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention;

[0017] FIG. 2 shows the structure of file pages according to the preferred embodiment of the present invention;

[0018] FIG. 3 shows the structure of index pages according to the preferred embodiment of the present invention;

[0019] FIG. 4 shows an example of a clustering index, a table space, and a non-clustering index according to the preferred embodiment of the present invention;

[0020] FIG. 5 shows a state transition diagram for the reorganizer according to the preferred embodiment of the present invention;

[0021] FIG. 6 shows the structure of a forward ordered list (FOL) according to the preferred embodiment of the present invention;

[0022] FIG. 7 shows a state transition diagram for each record that a user will eventually process according to the preferred embodiment of the present invention;

[0023] FIG. 8 illustrates the structure of a differential list according to the preferred embodiment of the present invention;

[0024] FIG. 9 shows a diagram for the states of a record ID (RID) according to the preferred embodiment of the present invention;

[0025] FIG. 10 shows an associated chart for the states of a pair of record IDs (RIDs) (source and target) according to the preferred embodiment of the present invention;

[0026] FIG. 11 is a flowchart if the main loop used by the reorganizer according to the preferred embodiment of the present invention;

[0027] FIG. 12 is a flowchart of the CLUSTER_RECORD procedure according to the preferred embodiment of the present invention;

[0028] FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure according to the preferred embodiment of the present invention;

[0029] FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure according to the preferred embodiment of the present invention;

[0030] FIG. 15 is a flowchart of the MODIFY_DL procedure according to the preferred embodiment of the present invention;

[0031] FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan according to the preferred embodiment of the present invention;

[0032] FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure according to the preferred embodiment of the present invention;

[0033] FIG. 18 is a flowchart that describes the logic of the SCAN procedure according to the preferred embodiment of the present invention; and

[0034] FIGS. 19A-19D together are a flowchart that describes the logic of the CORRECT_RID_LISTS procedure according to the preferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0035] In the following description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional changes may be made without departing from the scope of the present invention.

[0036] 1.0 Introduction

[0037] The present invention describes methods for performing one type of reorganization online. The type of reorganization distributes free space evenly, removes overflow, and clusters data. The methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized. To allow high-throughput concurrent usage by users of the database, the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement. The novelty provided by the present invention is online restoration of clustering that is both (1) in place (to obviate extra disk space for a copy of the data) and (2) supportive of high-throughput concurrent usage with reasonably accurate results.

[0038] 1.1 Hardware Environment

[0039] FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention. In the exemplary environment, a computer system 100 is comprised of one or more processors 102, each of which is connected to one or more data storage devices 104. The processor 102 executes database management system (DBMS) software 106 that manages one or more databases having one or more tables 108 and associated indices 110 stored on the data storage devices 104.

[0040] The tables 108 and indices 110 may be comprised of a plurality of partitions, as for example in the table 108 labeled as TS#1 and the index 110 labeled as IN#1, wherein the partitions are labeled as P1, P2, P3. This partitioning scheme, which is well known in the art, allows parallelized access and retrieval functions among both the data storage devices 104.

[0041] Alternatively, the tables 108 and indices 110 may be comprised of a single contiguous space, as for example in the table 108 labeled as TS#2 and the index 110 labeled as IN#2. In effect, this configuration comprises a single partition.

[0042] The processor 102 also executes reorganizer software 112 that reorganizes the tables 108 and associated indices 110 stored on the data storage devices 104. This reorganization comprises a logical and/or physical arrangement of the tables 108 and associated indices 110.

[0043] Generally, the DBMS 106, tables 108, associated indices 110, and reorganizer 112 comprise instructions and/or data that is embodied in or retrievable from a computer-readable device, medium, or carrier, e.g., a memory, a data storage device 104, a remote device coupled to the computer system 100 by a data communications device, etc. Moreover, these instructions and/or data, when read, executed, and/or interpreted by the computer system 100, cause the computer system 100 to perform the steps necessary to implement and/or use the present invention.

[0044] Thus, the present invention may be implemented as a method, system, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term "article of manufacture", or alternatively, "computer program carrier", as used herein is intended to encompass instructions and/or logic and/or data accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope of the present invention.

[0045] Those skilled in the art will recognize that any combination of the above components, or any number of different components, including computer programs, peripherals, and other devices, may be used to implement the present invention, so long as similar functions are performed thereby.

[0046] 2.0 Features of a DBMS

[0047] This section describes a set of DBMS 106 features (storage structures, types of concurrency control, and types of access), which come from System R [2] and DB2 [4, 3]; several other relational DBMS's use comparable features. This description applies to normal usage, i.e., without online reorganization; the types of access are later modified to accommodate online reorganization.

[0048] 2.1 Storage Structures, Structural Degradation, and Offline Reorganization

[0049] 2.1.1 Storage Structures for Data

[0050] The discussion of storage structures begins with the structures for data.

[0051] A row of a table 108 in a database is a logical unit within the table 108. For example, if Jones is an employee, a table 108 of employees includes a row for Jones. A row contains columns of data, e.g., for name, job tide, and salary. A column can have a fixed length (which does not change) or a variable length. The length of a variable-length column changes for each row according to the length of the data that users place in that column in that row. Ordinarily, the DBMS 106 implements a row by one data record, which is a lower-level (more physical) unit in storage. It will be explained shortly that sometimes the DBMS 106 implements a row by two data records. Users see rows, but do not directly see data records.

[0052] When users write rows (and thus the DBMS 106 writes records to implement the writing of rows), the DBMS 106 tracks the writing by appending corresponding entries to a collection of entries called the log. Later, it is possible to recover after an accidental loss of data by reloading from a backup copy of data and then applying (performing on the data) the log entries that the DBMS 106 appended after creation of the backup copy. The log record sequence number (LRSN) of a log entry is a number that represents the entry's position in the log.

[0053] A table 108 space is a region of storage that stores the data records for one or more tables 108. For simplicity, only one table 108 per table 108 space is used in the examples described herein, although any number of tables 108 and table 108 spaces could be used.

[0054] The DBMS 106 divides a table 108 space into units called file pages. FIG. 2 shows the structure 200 of file pages. The header of each file page includes the LRSN of the most recently written log entry that corresponds to writing of that page. A file page contains zero or more data records, which the DBMS 106 allocates at the beginning of the page (after the header). Deletion of records can cause gaps between remaining records. The end of a file page contains an ID map, which is an array of pointers (offsets of data records within the page). The term "slot" is used to mean the place (if any) to which an ID map entry points.

[0055] An ID map helps identify records. In DB2 and several other DBMS's that use the SQL [1] database language, not every table 108 has a unique key (a set of columns that identifies rows). Therefore, file pages, entries in the log, and indices 110 (structures that speed access to records) cannot use a key for identification. Instead, they use a record's record identifier (RID), which contains the record's page number and the offset of the record's entry in the ID map. A record's RID can change only during reorganization.

[0056] There are effects on storage structures during insertion of a row or during growth by update (modification) of a variable-length column of an existing row. During these operations, if the desired page lacks enough contiguous free space (the space available for insertions and growth), the DBMS 106 compacts the page to make its free space contiguous. During compaction, when the DBMS 106 moves a record, the DBMS 106 updates the ID map's pointer to the record; compaction does not change the record's RID. If compaction produces enough free space, the data goes into the desired page.

[0057] If such compaction does not produce enough space, the data goes into another page. On an insertion, a new record goes into that other page. On growth by update of existing data, the data moves into a new overflow data record in the other page, and the existing data record in the original page becomes a pointer data record, which contains the RID of the new overflow data record. Thus, the DBMS 106 sometimes implements a row by two records (a pointer record and an overflow record). Data records that do not involve overflow (hopefully, most data records) are regular data records.

[0058] In FIG. 2, file page 17 contains a regular data record and a pointer data record. File page 22 contains a regular data record and an overflow data record. The pointer record in page 17 contains the RID of the overflow record in page 22. The two bits in the header of each data record in FIG. 2 indicate whether the record is a pointer and whether it is an overflow, respectively.

[0059] 2.1.2 Storage Structures for Indices

[0060] A table 108 has zero or more indices 110, each of which uses an associated key (set of columns). For example, an employee table 108 might have an index 110 whose key is the table's 108 department number, an index 110 whose key is social security number, and an index 110 whose key is the combination of last name and first name. Within an index 110, the DBMS 106 maintains the key values in sorted order. Defining a key to be unique means that no two rows can have the same values in the key columns.

[0061] The DBMS 106 divides the storage of an index 110 into units called index 110 pages. FIG. 3 shows the structure 300 of index 110 pages, which the DBMS 106 arranges in a hierarchy. In this example, the index 110 pages that are numbered 101 and 109 are leaves of the hierarchy, and index 110 page 137 is a nonleaf. Each entry in a leaf page contains a key value and a list of RIDs whose records have that key value. The DBMS 106 designer (or, in some DBMS's, the database designer) optionally specifies that the DBMS 106 will sort each list by RID.

[0062] Each entry in a nonleaf page points to another nonleaf page or a leaf page, although this simple figure shows no entries that point to other nonleaf pages. Each entry in a nonleaf page also contains the value of the highest key of the page to which the entry points. For example, the first entry in index 110 page 137 contains 101 (the number of another index 110 page) and the highest key value of page 101. The second entry contains corresponding information for page 109. A possible alternative implementation is for each entry in a nonleaf page to contain the value of the lowest key of the page after the page to which the entry points.

[0063] For each table 108, the database designer declares at most one index 110 as a clustering index 110. In reorganization (and, whenever possible, in subsequent insertions), the assignment of data records to file pages reflects the data records' order in the key of the clustering index 110. For example, the records having the first few values of the key might reside (be stored) in one file page, the records having the next few values might reside in a second file page, etc. This clustering speeds some queries.

[0064] The database designer optionally declares the clustering index 110 to be a partitioning index 110. Here, the DBMS 106 divides the table 108 space (and the clustering index 110) into partitions according to values of the indexed key. The table 108 space is called a partitioned table 108 space. Partitions reside in separate files; a nonpartitioned table 108 space can reside in one file.

[0065] For a partitioned table 108 space, FIG. 4 shows an example of the clustering index 400, the table space 402, and a non-clustering index 404 (an index that is not the clustering index 400). For example, if the key of the clustering index 400 is an employee's name, partitions might cover names starting with A through F, G through M, and N through Z, respectively, as in the figure. The arrows represent RIDs in indices 400 and 404. In the set of leaf pages of the clustering index 400 for a partitioned table space 402, the RIDs for each partition are contiguous.

[0066] Within a nonclustering index 404, however, the RIDs for a partition need not be contiguous. For example, suppose that the key of a nonclustering index 404 is an employee's job title. The order of RIDs in the index 404 might be the RIDs for accountants (for all partitions), the RIDs for architects (for all partitions), the RIDs for artists (for all partitions), etc. Thus, the RIDs for a partition are scattered throughout the index 404, since some of them might be accountants, some might be architects, etc.

[0067] FIG. 4 illustrates lack of contiguity via two RIDs for the G-M partition with an intervening RID for the N-Z partition. Within each value of the index 404 key (e.g., "architect"), the RIDs might be sorted. Therefore, the RIDs for a partition might be contiguous within each key value, but they are not contiguous throughout the nonclustering index 404.

[0068] 2.1.3 Structural Degradation and Offline Reorganization

[0069] The storage structures described above can degrade. One type of degradation occurs when free space becomes unevenly distributed among the file pages of a table 108 space. After subsequent insertions, the order of some records no longer reflects the clustering index 110. This type of degradation slows some queries.

[0070] A second type of degradation occurs when variable-length data grows too large to fit in its original file page. The DBMS 106 then creates an overflow in another page and makes the original record a pointer. Indices 110 still contain the RID of the original record. This causes an extra page reference and thus extra time on some queries.

[0071] Reorganization removes such structural degradation. Specifically, reorganization distributes free space evenly, removes overflows (so that each row uses just one record, not two), and clusters data. Reorganization can move records between file pages. Therefore, the page that contains the record for a row after reorganization might differ from the page or pages that contained a record or records for that row before reorganization. Offline reorganization operates by (1) unloading (copying out) the data, (2) sorting the unloaded data by clustering key, and (3) reloading the data in sorted order. Users have read-only access (i.e., they can read but cannot write the data) during steps 1 and 2; they have no access during step 3.

[0072] To start offline reorganization, a database administrator issues a command. A parameter of the command specifies the name of the table 108 space that the DBMS 106 should reorganize. For a partitioned table 108 space, another parameter, which is optional, specifies the partition to reorganize; absence of this parameter signifies reorganization of the entire table 108 space. The term "area being reorganized" (often shortened to just "area") is used to mean the table 108 space or partition on which reorganization operates.

[0073] 2.2 Types of Concurrency Control

[0074] A user can choose one of these levels of isolation:

[0075] For repeatable read, a transaction holds locks until commitment. Several reads of the same data during one transaction will yield the same values.

[0076] For cursor stability, a transaction holds read-locks (if any) on only one page (per cursor [1]) at a time, except when crossing a page boundary, and holds write-locks (if any) until commitment. This dramatically improves concurrency and thus throughput, but several reads of the same data during one transaction will yield different values if another user has changed that data. Also, even if a transaction is scanning an area of the database and reading each record once (e.g., to calculate the sum of all the deposits in a bank), other users' writing (e.g., transferring money from one bank account to another) can cause an inconsistent calculation. Many users choose cursor stability, but this choice requires tolerance of inconsistency or knowledge that the concurrent writing will not cause inconsistency.

[0077] Variations on these levels are possible in some DBMS's.

[0078] 2.3 Types of Access

[0079] This section describes three types of access to a single table 108; although the simplifications that sometimes apply are omitted. For each type, an example query in SQL [1] is provided. In each query, the keyword "WHERE" precedes a combination of one or more predicates. A predicate is an expression whose instantiation returns "true," "false," or "unknown." An example of a predicate is "C=39."

[0080] (1) Table 108 space scan: An example query (where table 108 T has no index 110 on column C) is:

[0081] SELECT * FROM T WHERE C=39

[0082] The DBMS 106 reads every page in the table 108 space. For each page of the table 108, it scans the ID map to find the rows.

[0083] For each row, it evaluates the WHERE clause (if any) and returns the tow (to the application program) if it satisfies the clause. The DBMS 106 processes an overflow record only during the scan of the page containing its pointer.

[0084] (2) Index 110 scan: An example query (where T has an index 110 whose first column is C) is:

[0085] SELECT * FROM T WHERE (C BETWEEN 47 AND 51) . . .

[0086] The ellipsis represents possible additional predicates in the query. The DBMS 106 scans the relevant subset of the index 110 to select the RIDs whose key values satisfy the indexed part of the predicate on C. For each selected RID, it reads the row, evaluates the remainder (if any) of the WHERE clause for that row, and returns the row if it satisfies the clause.

[0087] (3) Multiple index 110 access: An example query (where T has an index 110 on C1, an index 110 on C2, and an index 110 on C3) is:

[0088] SELECT * FROM T WHERE ((C1=68 AND C2=7) OR C3=101)

[0089] The WHERE clause includes the conjunctions and/or disjunctions of n predicates, each of which the DBMS 106 can evaluate by just scanning an index 110. This type of access involves two lists of RIDs. One list (obtained from scanning an index 110) is called the recent RID list, and the other list (built incrementally by merging the recent RID list from each index 110) is called the composite RID list.

[0090] The DBMS 106 executes these phases:

[0091] For J=1 to n, perform phases 1, 2, and 3:

[0092] 1. Initialize the recent RID list to empty. Scan the relevant subset of the index 110 that enables evaluation of the jth predicate. For each RID whose key value satisfies the jth predicate, append the RID to the recent RID list.

[0093] 2. Sort the recent RID list (ordered by RID) if it comes from more than one key value. In a DBMS 106 that does not order RIDs within each entry on an index 110 leaf page, sort the RID list even if it comes from just one key value.

[0094] 3. If J=1, make the recent RID list the composite RID list. If J>1, merge the recent RID list into the composite RID list; i.e., calculate the intersection (for a conjunction) or union (for a disjunction).

[0095] Perform phase 4 once:

[0096] 4. For each RID in the composite RID list, read the row, evaluate the remainder (if any) of the WHERE clause (for repeatable read) or the entire WHERE clause (for cursor stability), and return the row if it satisfies the clause. For cursor stability, the evaluation of the entire WHERE clause accommodates other users' possible writing into this table 108.

[0097] 3.0 Advantages Over Previous Research

[0098] A survey paper [13] describes issues and surveys work in online reorganization of databases. Most of the work describes types of reorganization other than restoration of clustering (which is the type that is performed by the present invention). Some of the work deals with restoration of clustering, but most of this work in clustering does not discuss the issue of movement of data across a user's position in a scan.

[0099] Previous research in in-place online restoration of clustering requires repeatable read:

[0100] Omiecinski et al. [7, 8] describe an method for online reorganization in place. They use repeatable read with page-level locking.

[0101] Salzberg and Dimock [9] describe techniques for online changing of references to a moved record. For user transactions that perform scanning, the authors note that using repeatable read with page-level locking avoids movement of data across a user s position in a scan.

[0102] In contrast to the research above, the methods of the present invention allow, but do not require, repeatable read; they also allow cursor stability. The typical throughput using cursor stability exceeds the typical throughput using repeatable read, and many customers demand the opportunity to use cursor stability. With cursor stability, a user transaction can experience some inaccuracy due to concurrent user transactions, but customers could not tolerate the potentially massive amount of inaccuracy that could result from online reorganization in place without mechanisms to correct the inaccuracy. The reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct the inaccuracy. They allow cursor stability (with reasonably accurate results) during online reorganization in place.

[0103] One method for online restoration of clustering in DB2's structures [5, 11, 10] operates by copying data from the original copy of the area being reorganized (which users access) into a shadow copy of that area (in reorganized form). A survey paper [13] describes other methods for online reorganization by copying. Such an approach differs from our methods, which reorganize in place without making a shadow copy.

[0104] Following are disadvantages of reorganization by copying:

[0105] It can require more disk space for the area being reorganized, to store (1) the shadow copy of the area and (2) for some DBMS's, a mapping between records' RIDs in the original copy and corresponding records' RIDs in the shadow copy.

[0106] It involves a transition between directing users' accesses into the original copy and directing them into the shadow copy. This transition typically requires a period of limited or nonexistent user access.

[0107] The benefit for users (by letting them access reorganized data) begins after the transition; reorganization in place might begin the benefit immediately (as the area that users access is reorganized).

[0108] Advantages of reorganization by copying include (1) simplicity without sacrificing cursor stability and (2) reduced contention with users for access to the area being reorganized.

[0109] The novelty of the approach described herein is online restoration of clustering that is both (1) in place (to obviate extra disk space) and (2) supportive of cursor stability (for high throughput) with reasonably accurate results. There is previous work in online restoration of clustering that has just the first property or just the second, but there is no previous work that has both properties.

[0110] 4.0 Concepts in the Methods

[0111] 4.1 Overview of the Methods

[0112] The term reorganizer 112 is used to denote a DBMS 106 process that executes in the background to sweep through a table 108 space (or a partition of a table 108 space) and reorganize it. The methods involve any number of users and one reorganizer 112. The reorganizer 112 usually has the lowest priority in scheduling, but if the reorganizer's 112 locks have blocked a user, then the system should either preempt the reorganizer 112 or raise its allocation priority, to avoid a convoy of users waiting for the reorganizer 112.

[0113] The order in which a user's transaction accesses data or index 110 entries is the order in the table 108 space or in the index 110. At any time during a transaction, the user has reached some position in the order. Online reorganization can move records and thus change the order, and each record that moves between pages gets a new RID. Something that was ahead of the user's position can move behind, or something behind can move ahead. The methods described herein assure that if something moves, the user will access each appropriate piece of data exactly once (not twice and not zero times). Thus, the result of the transaction is consistent with the user's chosen level of isolation. This accommodation of the changed order is considered to be worthwhile, since the rate of movement by the reorganizer 112 might greatly exceed the rate of writing by regular transactions that execute concurrently with a cursor stability transaction.

[0114] 4.2 States of the Reorganizer

[0115] At any time, the reorganizer 112 is operating, suspended, or finished; FIG. 5 shows a state transition diagram. For each user, the reorganizer 112 (while operating) allocates certain data to track movement of records across the user's position. In any state, users can consume data from the database and can consume data that the reorganizer 112 has allocated, and the reorganizer 112 can deallocate its data after users have consumed it. Thus, the amount of allocation increases with the reorganizer's 112 activity and decreases with users' activity.

[0116] The reorganizer 112 can switch between the operating state 500 and suspended state 502 for either of these reasons:

[0117] Reorganization suspends when a period of peak usage begins and resumes when the period ends. Another possibility is for the reorganizer 112 to pause between its steps.

[0118] To bound the allocation of main storage, reorganization suspends when its total allocation grows past a parameter-controlled size; it resumes when the allocation shrinks below another parameter-controlled size. The storage allocator (or, in an alternative implementation, the reorganizer 112) tracks this allocation and changes the reorganizer's 112 state between the operating state 500 and suspended state 502. During reorganizer 112 operation, users give a higher priority to consuming data from the database, to take advantage of locality of reference. During reorganizer 112 suspension, users give a higher priority to consuming the reorganizer's 112 data, to allow deallocation and thus availability of storage. The use of two parameters (for suspension and resumption) instead of one reduces the frequency of state-switching and thus gives users longer periods of database consumption, when they can take advantage of locality of reference.

[0119] 4.3 State Information for a User

[0120] The methods of the present invention refer to a user's current position, whose definition depends on the user's type of access:

[0121] During a table 108 space scan, the position is the RID being processed.

[0122] During an index 110 scan, if the current key satisfies the query's predicate, the position is the key and the offset within the index's RID list for that key. If the key does not satisfy the predicate, the position is the end of the list for the previous satisfying key, if any.

[0123] During phase 1 of a multiple index 110 access, the position for the index 110 being scanned (and the recent RID list) is as in an index 110 scan; the position for already-scanned indices 110 (and the composite RID list) is the end of those indices 110.

[0124] During phases 2 and 3 of a multiple index 110 access, the position for all those indices 110 (and both RID lists) is the end of those indices 110.

[0125] During phase 4 of a multiple index 110 access (when the user is processing the associated data), the position is the RID in the composite RID list.

[0126] For each user, this state information (most of which is never paged out) is maintained:

[0127] The name(s) of the table(s) 108 being accessed

[0128] The type of access

[0129] The user's current position

[0130] 4.4 Record Movement and Movement Lists

[0131] Four terms are used to describe the possible directions of the reorganizer's 112 movement of a record with respect to a user's position:

[0132] The record moves forward from a position behind the user (in the already-scanned area) to a position ahead of the user (in the not-yet-scanned area).

[0133] It moves backward from a position ahead of the user to a position behind the user.

[0134] It moves behind the user from one position behind the user to another position behind the user.

[0135] It moves ahead of the user from one position ahead of the user to another position ahead of the user.

[0136] When reorganization moves a record and thus changes its RID, the RID can move within the index 110 if the DBMS 106 sorts indices' 110 RID lists by RID. The same four terms are used for the directions of movement of a RID within an index 110.

[0137] Determination of the direction of movement depends on the user's type of access:

[0138] During a table 108 space scan, movement can be in any direction, depending on the user's position (RID).

[0139] During an index 110 scan, for a record whose key equals that of the user's position in the index 110, movement can be in any direction, depending on the user's position in the index 110. For a record with an earlier or later key, movement is behind the user or ahead of the user, respectively.

[0140] During phase 1 of a multiple index 110 access, for the index 110 being scanned (and the recent RID list), movement is as in an index 110 scan.

[0141] During phases 2 and 3 of a multiple index 110 access, for the index 110 being scanned (and the recent RID list), movement is behind the user.

[0142] During phases 1, 2, and 3 of a multiple index 110 access, for already-scanned indices 110 (and the composite RID list), movement is behind the user. For not-yet-scanned indices 110, movement is ahead of the user.

[0143] During phase 4 of a multiple index 110 access, movement can be in any direction, depending on the user's position (RID) in the composite RID list.

[0144] For each user, the reorganizer 112 maintains movement lists to track the reorganizer's 112 movement of records that the user might access. The type of access determines the types of movement lists (forward and backward ordered lists or differential lists). The lists' structures and operations are explained below; Section 5 discusses how to use the lists, thus clarifying the reasons for the structures and operations.

[0145] 4.5 Forward and Backward Ordered Lists

[0146] Forward and backward ordered lists are used for users performing a table 108 space scan, an index 110 scan, or phase 4 of a multiple index 110 access. FIG. 6 shows the structure of a forward ordered list (FOL) 600, which tracks records that the reorganizer 112 moves forward; the user has already processed them. Each user has at most one FOL 600, which is a doubly linked list, ordered by RID and indexed. Each entry contains the new RID of a moved record. The logical head points to the first entry that is not logically deleted; the physical head might point to a logically deleted entry. When a RID is in a FOL 600, it means that the FOL 600 contains an entry that has that RID and is not logically deleted. The system tries to arrange for the beginning part of the FOL 600 (which probably matches the user's current file page) to be in main storage; the rest of the FOL 600 might be paged out.

[0147] These are the operations on a FOL 600; they maintain the FOL's 600 order, heads, and index 602 when necessary:

[0148] Determine whether a RID has an entry in the FOL 600 (reorganizer 112 invokes, using index 602).

[0149] Change a RID in an existing entry (reorganizer 112 invokes).

[0150] Read the RID of the entry to which the logical head points (user invokes): This returns null if the head is null.

[0151] Logically and physically insert an entry (reorganizer 112 invokes).

[0152] Logically and physically delete an entry (reorganizer 112 invokes). This includes deallocation of storage.

[0153] Logically delete the first entry (user invokes): Advance the logical head.

[0154] Collect garbage (reorganizer 112 invokes):

[0155] Do until physical head=logical head;

[0156] Advance physical head;

[0157] Deallocate storage of the formerly physically first entry;

[0158]

[0159] End;

[0160] A backward ordered list (BOL) tracks records that the reorganizer 112 moves backward; the user has not yet processed them. Each user has at most one BOL. Its structure and operations are like those of a FOL 600, as illustrated in FIG. 6.

[0161] Without online reorganization, a user transaction scans a list of RIDs (in a table 108 space, an index 110, or a RID list) and processes each record whose RID it finds. With online reorganization, the transaction omits redundant processing of records whose RIDs appear in the FOL 600 (since it has already processed them). It also eventually scans the BOL and processes the records whose RIDs appear there (since its scan of data would never find those records).

[0162] FIG. 7 shows a state transition diagram for each record that a user will eventually process, including the possible states 700, 702, 704 of a record.

[0163] 4.6 Differential Lists

[0164] A differential list (CL 800) tracks records that the reorganizer 112 moves forward, backward, or behind the user. DLs 800 are used for users performing phase 1, 2, or 3 of a multiple index 110 access. A user transaction's RID list can become invalid if, for example, the reorganizer 112 moves a record from a slot whose RID is in the RID list (thus changing the record's RID). Applying a differential list to a RID list will correct such invalidity in the RID list.

[0165] For each RID list (recent or composite), a user has a corresponding DL 800, which is a doubly linked list that the reorganizer 112 maintains. FIG. 8 illustrates the structure of a DL 800, an entry of which contains:

[0166] A RID

[0167] A flag (shown in the figure as "O" or "N") to tell whether the RID is an old RID (the source of a movement) or a new RID (the target)

[0168] In some but not all cases, a link to a paired entry with the opposite flag, to represent these possibilities:

[0169] O--An old entry (without a paired new entry) contains the old RID of a record that the reorganizer 112 moves forward. See the first entry in the figure.

[0170] N--A new entry (without a paired old entry) contains the new RID of a record that the reorganizer 112 moves backward. See the third entry in the figure.

[0171] O-N--Two paired entries together contain the old and new RIDs of a record that the reorganizer 112 moves behind the user. See the second and fourth entries in the figure.

[0172] A DL 800 is ordered by RID, with old and new entries interleaved.

[0173] A user transaction corrects a RID list by making a local copy of the corresponding DL 800 and then inserting new entries and deleting old entries in the RID list. Since a user extracts RIDs from only the part of an index 110 that satisfies an index 110 predicate, forward or backward movement can occur only for a record that satisfies the predicate.

[0174] Therefore, an unpaired entry (either old or new) in a DL 800 represents a record that satisfies the predicate. However, movement behind the user can occur for a satisfying or an unsatisfying record. Therefore, a pair of entries represents a satisfying record only if the user's RID list includes the old RID. This fact requires the pairing; a user transaction should insert a paired new RID into a RID list only if the RID list already contains the paired old RID (which it should delete).

[0175] When the reorganizer 112 moves a record, the corresponding modification of a DL 800 can be complex. At any time, any RID (e.g., the RID for the slot of the source of the move or the RID for the target) is in any of several possible states (with respect to a DL 800).

[0176] Three formats are used to describe the behavior of a DL 800:

[0177] A diagram shows the transitions among a RID's states. Each state has a state number.

[0178] A chart shows the post-move states for each possible pair of (1) pre-move state of the move's source and (2) pre-move state of the move's target. The chart refers to the state numbers.

[0179] Each operation on a DL 800 has a detailed description, which includes justifying comments and refers (when appropriate) to the chart's rows and columns.

[0180] The next three sections present the three formats. Most of the transitions in the state transition diagram are derived from the chart. Almost all of the contents of the chart are derived from the operations. The descriptions of operations contain all the necessary information; the diagram and chart were added only for clarity.

[0181] 4.6.1 Transitions Among a RID's States

[0182] With respect to a DL 800, a RID can be in any of several states, depending on the entries (if any) for that RID in that DL 800, whether the RID is behind or ahead of the user's position in the corresponding index 110, and whether the slot for that RID contains a record.

[0183] FIG. 9 shows a diagram 900 for the states of a RID. The rectangle for a state shows the state number and the four components of a state:

[0184] Whether the RID is behind ("B") or ahead of ("A") the user's position

[0185] Whether the slot for the RID contains a record ("R") or is empty ("E")

[0186] Whether the DL 800 contains a paired old entry ("O-"), an unpaired old entry ("O"), or no old entry (no "O") for the RID

[0187] Whether the DL 800 contains a paired new entry ("N-"), an unpaired new entry ("N"), or no new entry (no "N") for the RID

[0188] The possible initial states (when the DBMS 106 creates the DL 800) are 1 and 2 (if the DL 800 is for the index 110 that the user is now scanning or will scan) or 5 and 6 (otherwise). States 3-11 are the possible final states (when the user performs the operation that copies the DL 800 into a local copy); these states (i.e., all except 1 and 2) are also the states in which the RID is behind the user's position. Since the user locks an index 110 entry while reading it, the reorganizer 112 cannot concurrently move that entry's record (and thus change its RID's status in a DL 800), so the diagram omits states in which the user s position is exactly at the RID. The diagram contains only the states that are reachable from the initial states; unreachable states are ignored, of course. For each transition between states, the diagram shows an arrow, with text that describes the circumstances under which the transition occurs.

[0189] 4.6.2 States of a Pair of RIDs (Source and Target) During Movement

[0190] The reorganizer's 112 movement of a record changes both of these:

[0191] The state of the RID of the source of the move (with respect to a DL 800)

[0192] The state of the RID of the target of the move (with respect to a DL 800).

[0193] FIG. 10 shows an associated chart 1000 for the states of a pair of RIDs (source and target). The columns in FIG. 10 represent the possible pre-move states of the target of the move (the states in which the RID's slot does not yet contain a record). The center of the right side of each column header shows whether the RID has an old (O) entry or no entry. The state number appears below this indication. Any paired entry (for another RID) appears above the indication; the arrow goes from O to N (the direction of movement). Thus, in state 7, the RID has an O entry, which is paired with a different RID's N entry. In state 1, the RID is ahead of the user, and in states 5, 7, and 8, the RID is behind the user. The rows represent the possible pre-move states of the source of the move (the states in which the RID's slot already contains a record). The center of the bottom of each row header shows whether the RID has an old (O) entry, a new (N) entry, both, or neither. If the RID has both, the N is above the O. The state number appears to the right of this indication. Any paired entries (for other RIDs) appear to the left of the indication. In state 10, the O and N at the left are not for the same RID. In state 2, the RID is ahead of the user, and in states 3, 4, 6, 9, 10, and 11, the RID is behind the user.

[0194] In each intersection cell, the right side shows the post-move state of the target, and the bottom shows the post-move state of the source. For example, in the intersection of row 6 and column 8, a record has moved from a source slot (whose RID's state, at the bottom, has changed from 6 to 7) into a target slot (whose RID's state, at the right, has changed from 8 to 11). Some intersections in column 1 contain two cells (two pairs of post-move states), since the states depend on whether the RID whose O is paired with the N in the source also had an unpaired N. Similarly, some intersections in column 7 contain two cells, since the states depend on whether a record is returning to its original RID.

[0195] The chart does not explicitly show states of RIDs whose entries are paired with the source or target of the move. Also, the chart does not model the user's crossing of a RID during a scan. Therefore, the chart omits some of the state transitions of FIG. 9. For almost every transition of FIG. 9, several intersections of the chart show the transition.

[0196] In examining the chart, it is easiest to begin with the four simplest cases, in which the RID does not yet appear in the DL 800 before the movement:

[0197] Row state 2, column state 1: the only case of movement ahead of the user,

[0198] Row state 2, column state 5: one case of movement backward,

[0199] Row state 6, column state 1: one case of movement forward,

[0200] Row state 6, column state 5: one case of movement behind the user.

[0201]

[0202] 4.6.3 Operations on a Differential List

[0203] The operations on a DL 800 modify the DL's 800 order and index 802 when necessary. In the following description of FIG. 10, the operations that reflect a record's movement refer (in parentheses) to state numbers in the chart's 1000 rows and columns and include justifying comments. There is no operation to change the DL 800 to reflect movement ahead of the user (row 2, column 1).

[0204] (1) Change the DL 800 to reflect forward movement (reorganizer 112 invokes; parameter is source RID):

[0205] If the DL 800 has no N entry for the source RID (row 6; column 1), insert an unpaired O entry for the source RID. Reason: This record did not move previously.

[0206] If the DL 800 has an unpaired N entry for the source RID (row 4 or 9; column 1), delete it. Reason: Movement backward and then forward resembles no movement.

[0207] If the DL 800 has a paired N entry for the source RID (row 3, 10, or 11; column 1), delete it, thus unpairing its paired O entry. Reason: Movement behind the user and then forward resembles movement forward without prior movement behind.

[0208] Also, if the DL 800 contains an unpaired N entry for the same RID as the recently unpaired O entry (left part of column 1), delete both the N and O entries for that RID. Reason: The user's RID list will already correctly include that RID, since that RID's original record and current record both satisfy the predicate. This additional deletion is omitted if the DL 800 contains a paired N entry for the same RID as the recently unpaired O entry (right part of column 1), since the current record at that RID might not satisfy the predicate.

[0209] (2) Change the DL 800 to reflect backward movement (reorganizer 112 invokes; parameter is target RID):

[0210] If the DL 800 has a paired O entry for the target RID (row 2; column 7) or no O entry for the target RID (row 2; column 5), insert an unpaired N entry for the target RID. Reason: This record did not move previously.

[0211] If the DL 800 has an unpaired O entry for the target RID (row 2; column 8), delete it. Reason: Movement forward and then backward (for the same record or a different satisfying record) resembles no movement. This cannot be done for a paired O entry (column 7), since the original record at that RID might not satisfy the predicate, so the user's RID list might not include that RID.

[0212] (3) Change the DL 800 to reflect movement behind the user (reorganizer 112 invokes; parameters are source and target RIDs):

[0213] If the DL 800 has no N entry for the source RID (row 6; column 5, 7, or 8), insert a pair of an O entry for the source RID and an N entry for the target RID. Reason: This record did not move previously.

[0214] If the DL 800 has an unpaired N entry for the source RID (row 4 or 9; column 5, 7, or 8), delete it. Also:

[0215] If the DL 800 has a paired O entry for the target RID (column 7) or no O entry for the target RID (column 5), insert an unpaired N entry for the target RID. Reason: Movement backward and then behind into a RID that was already the source of a movement behind (i.e., a RID with a paired O entry) resembles the latter movement behind followed by the movement backward. Movement backward and then behind into a RID that had no movement (i.e., a RID with no O entry) resembles one movement backward into that RID.

[0216] If the DL 800 has an unpaired O entry for the target RID (column 8), delete the O entry for the target RID. Reason: Movement backward and then behind into a RID that was already the source of a movement forward (i.e., a RID with an unpaired O entry) resembles movement forward and then backward, which resembles no movement.

[0217] If the DL 800 has a paired N entry for the source RID (row 3, 10, or 11; column 5, 7, or 8), delete it. Also:

[0218] If the deleted N entry's formerly paired O entry is for the target RID (left part of column 7), delete that O entry. Reason: Returning a record to its original RID resembles no movement.

[0219] If the formerly paired O entry is not for the target RID (column 5 or 8 or right part of column 7), insert an N entry for the target RID, paired with the formerly paired O entry. Reason: Two movements behind the user constitute one combined movement behind the user.

[0220] (4) Copy the DL 800 into a local copy, and logically delete the original DL 800 (user invokes).

[0221] (5) Collect garbage (reorganizer 112 invokes on logically deleted DL 800): Deallocate storage.

[0222] Following is the reason for maintaining two DLs 800 (for the recent and composite RID lists): The placement of entries in a DL 800 depends upon the movement of records across a user's current position. A record might move across a user's current position in the index 110 that the user is scanning now, but it cannot move across the user's current position in an already-scanned index 110, since the user's "current" position is at the end of that index 110.

[0223] 5.0 The Methods

[0224] Following is a description of the reorganization methods using flowcharts to illustrate the methods.

[0225] 5.1 The Method for the Reorganizer

[0226] In effect, the reorganizer 112 scans the clustering index 110, and the reorganizer's 112 current position is defined as the current key and the offset within the index's RID list for that key. Section 4.3 contains a similar definition of the position of a user performing an index 110 scan. The description of the reorganizer's 112 method will apply comparison operators (e.g., ">") to positions; such a comparison involves comparing the keys, and, if the keys are identical, comparing the offsets. Incrementing the reorganizer's 112 position is defined as incrementing the offset, and, if there is no next RID, moving to the next key (if any).

[0227] FIG. 11 is a flowchart of the main loop used by the reorganizer 112.

[0228] Block 1100 is a decision block that represents the main loop.

[0229] Block 1102 is a decision block that determines whether there is any garbage collection ready to be performed. If so, control transfers to Block 1104, which performs the garbage collection; otherwise, control transfers to Block 1106. Note that, if the reorganizer 112 is operating, a possible alternative (instead of checking for garbage collection in every iteration of the main loop) is checking for garbage collection once in every few iterations of the main loop.

[0230] Block 1106 is a decision block that determines whether the state of the reorganizer 112 is "suspended". If so, control transfers to Block 1108, which causes the loop to "sleep" for a parameter-controlled period before returning to Block 1100; otherwise, control transfers to Block 1110.

[0231] Block 1110 is a decision block that determines whether the state of the reorganizer 112 is "finished". If so, control transfers to Block 1112; otherwise, control transfers to Block 1114.

[0232] Block 1112 is a decision block that determines whether any movement lists exist. If so, control transfers to Block 1108; otherwise, the loop is exited.

[0233] Block 1114 is a decision block that determines whether the state of the reorganizer 112 is "operating". If not, control transfers to Block 1100; otherwise, control transfers to Block 1116.

[0234] Block 1116 represents a shared lock being applied to the end of the clustering index for the table being reorganized.

[0235] Block 1118 represents a variable L being assigned the position in the clustering index of the logically last record in the index.

[0236] Block 1120 represents the unlocking of the end of the clustering index for the table being reorganized.

[0237] Block 1122 is a decision block that determines whether R>L, where R is the position of the reorganizer 112; initially, this is at the beginning of the clustering index. If so, control transfers to Block 1124; otherwise, control transfers to Block 1126.

[0238] Block 1124 represents the state of the reorganizer 112 being set to "Finished." Thereafter, control transfers to Block 1100.

[0239] Block 1126 represents D_P (Desired Page for R) being calculated, which is based on the size of R, page size, size of earlier records, desired frequency of free pages, desired percent of free space per page, etc.

[0240] Block 1128 is a decision block that determines whether D_P is the current page for R. If so, control transfers to Block 1130; otherwise, control transfers to Block 1132.

[0241] Block 1130 represents R being incremented. Thereafter, control transfers to Block 1100.

[0242] Block 1132 represents the procedure CLUSTER_RECORD being called with the RID of R and D_P as its parameters. Thereafter, control transfers to Block 1130.

[0243] 5.1.1 Performing the Restoration of Clustering

[0244] In the following procedure to cluster a record on a page, sometimes the system must first make room for the record by moving other records off the page. Since it is assumed that an overflow record does not point back to its pointer record, the system could not efficiently find the pointer for an overflow record (to update the pointer if the system moved its overflow). Therefore, the system does not move overflow records off the page when it needs to make room. If the DBMS 106 uses back pointers, this restriction is omitted.

[0245] FIG. 12 is a flowchart of the CLUSTER_RECORD procedure, which has as its parameters, R and D_P. This procedure clusters record R, preferably on desired page D_P.

[0246] The loop in this procedure initially tries the desired page D_P. For each tried page T_P, it tests for enough space, and if there is not enough space, the next iteration tries page T_P+1. Instead of iterating to the next page, the method could attempt to find a "nearby" page. For example, record R could be put on a page that contains a record with the same key, or if the record has a unique key, a record with a neighboring key. Another alternative is to put all such records at the end of the table space.

[0247] Block 1200 is a decision block that represents a loop comprising T_P=D_P to the last page of the table 108 space. This loop finds the target page, T_P, for clustering. Upon completion of the loop, control transfers to Block 1228; otherwise, control transfers to Block 1202.

[0248] Block 1202 represents an exclusive lock being applied to T_P.

[0249] Block 1204 is a decision block that determines whether the largest available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228; otherwise, control transfers to Block 1206.

[0250] Block 1206 represents T_P being compacted to make free space contiguous.

[0251] Block 1208 is a decision block that determines whether the available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228; otherwise, control transfers to Block 1210.

[0252] Block 1210 is a decision block that determines whether the available space on T_P plus the total space of non-overflow records on T_P that logically follow R is greater than or equal to the size of parameter R. If not, control transfers to Block 1212; otherwise, control transfers to Block 1214.

[0253] Block 1212 represents T_P being unlocked. This Block is reached when R cannot be put on T_P, and thus an iteration must occur.

[0254] Block 1214 is a decision block that represents a loop that iterates while available space on T_P is less than the size of parameter R. Upon completion of this loop, control transfers to Block 1224. This Block is reached when removing records from this page will succeed in providing enough space for R.

[0255] Block 1216 represents finding a non-overflow record A_R to be removed on T_P.

[0256] Block 1218 represents finding and applying an exclusive lock on another page A_P with space for record A_R.

[0257] Block 1220 represents the procedure MOVE_RECORD being called with the parameters A_R and A_P.

[0258] Block 1222 represents A_P being unlocked. Thereafter, control transfers to Block 1214.

[0259] Block 1224 is a decision block that determines whether the largest available space on T_P is less than the size of R. If so, control transfers to Block 1226; otherwise, control transfers to Block 1228.

[0260] Block 1226 represents T_P being compacted. Thereafter, control transfers to Block 1228.

[0261] Block 1228 represents an exclusive lock being applied to the page of R.

[0262] Block 1230 represents the procedure MOVE_RECORD being called with the parameters R and T_P.

[0263] Block 1232 represents T_P and the page of R being unlocked.

[0264] Finally, the procedure exits at 1234.

[0265] FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure, which has as its parameters, R and P. This procedure moves record R to page P.

[0266] Block 1300 represents the variable NEW_RID being assigned the first available offset in page P's ID map.

[0267] Block 1302 is a decision block that represents a loop for each user. Upon completion of the loop, control transfers to FIG. 13B via "A".

[0268] Block 1304 is a decision block that determines whether the user's type of access comprises a table space scan. If so, control transfers to Block 1306; otherwise, control transfers to Block 1310.

[0269] Block 1306 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1308; otherwise, control transfers to Block 1302.

[0270] Block 1308 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.

[0271] Block 1310 is a decision block that determines whether the user's type of access comprises an index scan. If so, control transfers to Block 1312; otherwise, control transfers to Block 1316.

[0272] Block 1312 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1314; otherwise, control transfers to Block 1302.

[0273] Block 1314 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.

[0274] Block 1316 is a decision block that determines whether the user's type of access comprises Phase 1, 2, or 3 of a multiple index access. If so, control transfers to Block 1318; otherwise, control transfers to Block 1322.

[0275] Block 1318 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1320; otherwise, control transfers to Block 1302.

[0276] Block 1320 represents the procedure MODIFY_DL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.

[0277] Block 1322 is a decision block that determines whether the user's type of access comprises Phase 4 of a multiple index access. If so, control transfers to Block 1324; otherwise, control transfers to Block 1302.

[0278] Block 1324 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1326; otherwise, control transfers to Block 1302.

[0279] Block 1326 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.

[0280] Upon completion of the loop at Block 1302, control transfers to Block 1328 via Block 1328 represents finding all the index entries that reference R.

[0281] Block 1330 represents finding all the referential integrity structures that reference R.

[0282] Block 1332 represents moving R to P, modifying the index entries, and (if necessary) modifying the referential integrity structures. If R is a pointer record, the data is moved from the overflow record to a new, regular data record. If deletion and insertion ordinarily trigger certain actions, e.g., deletion via referential integrity, those actions should not occur for record movement.

[0283] 5.1.2 Maintaining the Movement Lists

[0284] This section illustrates how to maintain the movement lists. The above procedure sometimes omits the maintenance of movement lists:

[0285] For a table 108 space scan or index 110 scan, repeatable read prevents movement of a record across the user's position, so the system omits movement lists.

[0286] For a multiple index 110 access, repeatable read prevents movement of a record whose RID the user has already read, so the system omits movement lists.

[0287] Also, the procedure is presented under the assumption that the DBMS 106 sorts index 110 leaf entries by RID. If the DBMS 106 does not sort them, then the system need not maintain a movement list for an index 110 scan, since it never moves a record across the user's position.

[0288] FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the FOL 600 and BOL 600 for one user. If the user is doing a table space scan, the reorganizer 112 performs FOL/BOL processing for data records; for an index scan, the reorganizer 112 performs FOL/BOL processing for index entries; and for Phase 4 of a multiple index access, the reorganizer 112 performs FOL/BOL processing for RID list entries.

[0289] Block 1400 represents the FOL 600 and BOL 600 being latched.

[0290] Block 1402 is a decision block that determines whether the record movement is forward of the user's position. If so, control transfers to Block 1404; otherwise, control transfers to Block 1410.

[0291] Block 1404 is a decision block that determines whether the old RID is in the BOL 600. If so, control transfers to Block 1406, which deletes the old RID from the BOL 600; otherwise, control transfers to Block 1408, which inserts the new RID into the FOL. Thereafter, control transfers to Block 1430.

[0292] Block 1410 is a decision block that determines whether the record movement is backward from the user's position. If so, control transfers to Block 1412; otherwise, control transfers to Block 1418.

[0293] Block 1412 is a decision block that determines whether the old RID is in the FOL 600. If so, control transfers to Block 1414, which deletes the old RID from the FOL 600; otherwise, control transfers to Block 1416, which inserts the new RID into the BOL 600. Thereafter, control transfers to Block 1430.

[0294] Block 1418 is a decision block that determines whether the record movement stays ahead of the user's position. If so, control transfers to Block 1420; otherwise, control transfers to Block 1424.

[0295] Block 1420 is a decision block that determines whether the old RID is in the FOL 600. If so, control transfers to Block 1422, which changes the old RID to the new RID in the FOL 600. Thereafter, control transfers to Block 1430.

[0296] Block 1424 is a decision block that determines whether the record movement stays behind the user's position. If so, control transfers to Block 1426; otherwise, control transfers to Block 1430.

[0297] Block 1426 is a decision block that determines whether the old RID is in the BOL 600. If so, control transfers to Block 1428, which changes the old RID to the new RID in the BOL 600. Thereafter, control transfers to Block 1430.

[0298] Block 1430 unlatches the FOL 600 and BOL 600.

[0299] FIG. 15 is a flowchart of the MODIFY_DL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the DL 800 for one user. If the user is doing Phase 1 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; any movement is possible. If the user is doing Phase 2 or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; all movement is behind the user. For Phases 1, 2, or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for already-scanned indices and the composite RID list; all movement is behind the user.

[0300] Block 1500 represents the recent DL 800 being latched.

[0301] Block 1502 is a decision block that determines whether the record movement is forward from the user's position in the index for the recent RID list. If so, control transfers to Block 1504; otherwise, control transfers to Block 1506.

[0302] Block 1504 represents the recent DL 800 being changed to reflect the forward movement.

[0303] Block 1506 is a decision block that determines whether the record movement is backward from the user's position in the index for the recent RID list. If so, control transfers to Block 1508; otherwise, control transfers to Block 1510.

[0304] Block 1508 represents the recent DL 800 being changed to reflect the backward movement.

[0305] Block 1510 is a decision block that determines whether the record stays ahead of the user's position in the index for the recent RID list. If so, control transfers to Block 1516; otherwise, control transfers to Block 1512.

[0306] Block 1512 is a decision block that determines whether the record stays behind the user's position in the index for the recent RID list. If so, control transfers to Block 1514; otherwise, control transfers to Block 1516.

[0307] Block 1514 represents the recent DL 800 being changed to reflect the movement behind.

[0308] Block 1516 represents the recent DL 800 being unlatched.

[0309] Block 1518 represents the composite DL 800 being latched.

[0310] Block 1520 represents the composite DL 800 being changed to reflect movement behind. This composite DL 800 was produced by earlier merges. The record always stays behind the user, who is already at the end of all indices that the RID list represents.

[0311] Block 1522 represents the composite DL 800 being unlatched.

[0312] 5.2 The Method for a User Performing a Table Space Scan

[0313] FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan.

[0314] Block 1600 represents the FOL 600 and BOL 600 structures being created.

[0315] Block 1602 is a decision block that represents a loop for all records.

[0316] Block 1604 represents the logical head of the BOL 600 being latched.

[0317] Block 1606 represents the variable H being assigned the RID of the logical head of the BOL 600.

[0318] Block 1608 represents the logical head of the BOL 600 being unlatched.

[0319] Block 1610 represents the state of the reorganizer 112 being latched.

[0320] Block 1612 represents the variable S being assigned the state of the reorganizer 112.

[0321] Block 1614 represents the state of the reorganizer 112 being unlatched.

[0322] Block 1616 is a decision block that determines whether POS (the position in the scan) has reached the end of the scan. If so, control transfers to Block 1618; otherwise, control transfers to Block 1626.

[0323] Block 1618 is a decision block that determines whether H is null. If so, control transfers to Block 1620, which logically deletes the FOL 600 and BOL 600, and then to Block 1622 where the procedure exits; otherwise, control transfers to Block 1624, which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602.

[0324] Block 1626 is a decision block that determines whether H is null. If so, control transfers to Block 1628, which calls the procedure SCAN, and then returns to Block 1602; otherwise, control transfers to Block 1630.

[0325] Block 1630 is a decision block that determines whether S is "Operating". If so, control transfers to Block 1628, which calls the procedure SCAN, and then returns to Block 1602; otherwise, control transfers to Block 1632, which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602.

[0326] FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure.

[0327] Block 1700 represents the shared lock being applied to the record that the logical head entry indicates.

[0328] Block 1702 represents the first entry in the BOL 600 being logically deleted.

[0329] Block 1704 represents that record being processed, e.g., evaluating the WHERE clause and returning the row if appropriate.

[0330] Block 1706 represents all locks being unlocked, i.e., if cursor stability is in effect.

[0331] Block 1708 represents an exit from the procedure.

[0332] FIG. 18 is a flowchart that describes the logic of the SCAN procedure.

[0333] Block 1800 represents the POS (position in the scan) being incremented.

[0334] Block 1802 represents a shared lock being applied to the current record.

[0335] Block 1804 is a decision block that determines whether POS =RID of the FOL 600 logical head. If so, then control transfers to Block 1806; otherwise, control transfers to Block 1808.

[0336] Block 1806 logically deletes the first entry in the FOL 600, e.g., because it moved forward. Thereafter, control transfers to Block 1810.

[0337] Block 1808 represents that record being processed, e.g., evaluating the WHERE clause and returning the tow if appropriate.

[0338] Block 1810 represents all locks being unlocked, i.e., if cursor stability is in effect.

[0339] Block 1812 represents an exit from the procedure.

[0340] 5.3 The Method for a User Performing an Index Scan

[0341] This resembles a table 108 space scan. The remainder (non-indexed part) of the WHERE clause, not the whole WHERE clause, is evaluated. The user has a position in a range (key value) of the index 110. At the end of scanning each range, the FOL 600 is empty, since no record can move beyond the range. Therefore, a FOL 600 can be sorted by RID; the system does not need a major sort by key value and a minor sort by RID. The BOL 600 might not be empty.

[0342] 5.4 The Method for a User Performing a Multiple Index Access

[0343] A user performing a multiple index 110 access (on n indices 110) does this:

[0344] Create DL 800 structures (indices 802, etc.).

[0345] For J=1 to n, perform phases 1, 2, and 3:

[0346] 1. Initialize the recent RID list to empty. Scan the relevant subset of the index 110 that enables evaluation of the jth predicate. For each RID whose key value satisfies the jth predicate, append the RID to the recent RID list.

[0347] 2. Sort the recent RID list (ordered by RID) if it comes from more than one key value. Correct both RID lists (as shown herein after) according to the DLs 800.

[0348] 3. If J=1, make the recent RID list the composite RID list. If j >1, merge the recent RID list into the composite RID list; i.e., calculate the intersection (for a conjunction) or union (for a disjunction). Correct the composite RID list according to the composite DL 800; this includes logically deleting the DLs 800 and creating the FOL 600 and BOL 600 structures for phase 4. The system no longer cares about correcting the recent RID list, since the system already merged it into the composite RID list, and any changes in the recent DL 800 will now be identical to changes in the composite DL 800.

[0349] Perform phase 4 once:

[0350] 4. For each RID in the composite RID list, read the row, evaluate the remainder (if any) of the WHERE clause (for repeatable read) or the entire WHERE clause (for cursor stability), and return the row if it satisfies the clause. Use the FOL 600 and the BOL 600, as in a table 108 space scan; the user has a position in the composite RID list.

[0351] When finished, logically delete FOL 600 and BOL 600 structures (indices 602, etc.).

[0352] Following is a description of how the user transaction corrects its RID lists according to the DLs 800. It does this at the end of each phase 2 (for both RID lists) and at the end of each phase 3 (for just the composite RID list):

[0353] FIGS. 19A-D is a flowchart that describes the logic of the CORRECT_RID_LISTS procedure, which accepts a parameter K, wherein K=1 signifies correction of just a composite RID list and K=2 signifies correction of both composite and recent RID lists.

[0354] Block 1900 represents the two DLs 800 being latched.

[0355] Block 1902 is a decision block that represents a loop which tests for emptiness, i.e., it iterates while either DL 800 contains anything. Upon completion of the loop, control transfers to Block 1936 via "C".

[0356] Block 1904 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite DL 800, while J=2 means recent DL 800.

[0357] Block 1906 represents the DL 800 being copied into a local copy and the DL 800 being logically deleted.

[0358] Block 1908 represents the creation of an empty new DL 800.

[0359] Block 1910 represents the two DLs 800 being unlatched.

[0360] Block 1912 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite RID list and J=2 means recent RID list. This loop makes a first pass at correcting RID lists (to nullify non-qualifying pairs). Upon completion of the loop, control transfers to Block 1922 via "B".

[0361] Block 1914 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J.

[0362] Block 1916 is a decision block that determines if an entry in the copy of the DL 800 is "old". If so, control transfers to Block 1918; otherwise, control transfers to Block 1914.

[0363] Block 1918 is a decision block that determines if an entry's RID does not exist in the RID list. If so, control transfers to Block 1920; otherwise, control transfers to Block 1914.

[0364] Block 1920 represents nullifying the entry and its paired entry (if any). Thereafter, control transfers to Block 1914.

[0365] Block 1922 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite RID list and J=2 means recent RID list. This loop makes a second pass at correcting RID lists. Upon completion of the loop, control transfers to Block 1934.

[0366] Block 1924 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J. Upon completion of the loop, control transfers to Block 1922.

[0367] Block 1926 is a decision block that determines if the local copy of the DL 800 contains an "old" entry for an RID but no "new" entry for that RID. If so, control transfers to Block 1928; otherwise, control transfers to Block 1930.

[0368] Block 1928 represents deleting the RID from the RID list. Thereafter, control transfers to Block 1924.

[0369] Block 1930 is a decision block that determines if the local copy of the DL 800 contains a "new" entry for an RID but no "old" entry for that RID. If so, control transfers to Block 1932; otherwise, control transfers to Block 1924.

[0370] Block 1932 represents inserting the RID into the RID list (maintaining sorted order). Thereafter, control transfers to Block 1924.

[0371] Block 1934 represents the two DLs 800 being latched. Control then transfers to Block 1902 via "D".

[0372] Block 1936 indicates that the user transaction is moving to the next phase.

[0373] Block 1938 is a decision block that determines if the user transaction is moving from phase 3 to phase 4. If so, control transfers to Block 1940; otherwise, control transfers to Block 1944.

[0374] Block 1940 represents logically deleting DL 800 structures.

[0375] Block 1942 represents creating the FOL 600 and BOL 600 structures.

[0376] Block 1944 represents the DLs 800 being unlatched. Thereafter, the procedure exists at 1946.

[0377] In the above method, a possible optimization, if the outer loop is executed more than a parameter-controlled number of times, is to suspend the reorganizer 112.

[0378] Following is the reason for using two passes rather than one: With one pass, suppose that a new entry and an old entry are paired, and the new entry has a lower RID. The user transaction will find the new entry first. The transaction should insert the new entry's RID in its RID list only if the old entry's RID matches a RID that is already in the RID list. Lack of a match means that the new entry's RID does not represent a qualifying record. To find the old entry, the user transaction must do a searched access (rather than a sequential access) into the local copy of the DL 800, but there is no index, so there must be a search via another sequential access, which is an extra expense.

[0379] 6.0 Extensions Based on the Methods

[0380] This section describes possible extensions that could be added-to the methods of the present invention.

[0381] 6.1 Logging and Recovery

[0382] Logging and recovery could be added (for both the reorganizer 112 and the users) during online reorganization. For logging, a new type of log record can simply be added, which means that a record has been moved from the old RID to the new.

[0383] Logging could also be included for movement lists.

[0384] Movement lists could also be stored in artificial data pages; i.e., not just in main storage. This implies that a transaction that is ordinarily considered read-only (because it just does a SELECT) is not really read-only, since it writes into the movement lists.

[0385] 6.2 Multiple-Table Queries

[0386] So far, only transactions that operate on one table 108 have been considered. An alternative would consider transactions that operate on several tables 108.

[0387] 6.3 Transactions That Write

[0388] So far, only "reader" transactions have been considered. This can be generalized in two ways:

[0389] Even if a transaction is a reader, consider writers' effect on this transaction's movement lists.

[0390] Consider movement lists and synchronization for writers. In general, the same methods that are used for readers can be used, recognizing that locks are held until commit, for serializability.

[0391] 6.4 Index Reorganization

[0392] Online reorganization of an index 110 could be added, not just online reorganization of data with updating of the RIDs in an index 110.

[0393] 6.5 Possible Reorganizations

[0394] To give priority to users, the reorganizer 112 should abort instead of waiting when the reorganizer's 112 lock request conflicts with a user's lock.

[0395] In addition, a warning bit could be maintained to indicate that the FOL 600 contains an entry for the current page.

[0396] 7.0 Summary

[0397] Any database management system can need some type of reorganization. To avoid taking a very large or highly available database offline for reorganization, a solution is to reorganize online. Methods have been described for performing a certain type of reorganization online. The reorganization restores clustering and removes overflows.

[0398] The methods include synchronization, and they use movement lists to track the reorganization's movement of records across a user's position within a scan of data. The reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct inaccuracy in user transactions and thus allow cursor stability during online reorganization in place. The novelty is in allowing high throughput concurrent usage during in-place online restoration of clustering. As the amount of information and dependence on computers both grow, the number of very large or highly available databases will grow. Therefore, the importance of online reorganization will grow.

[0399] 8.0 References

[0400] The following references are incorporated by reference herein:

[0401] 1. Amer. Nad. Standards Institute, "Database Language SQL," X3.135-1992, New York, 1992.

[0402] 2. M. M. Astrahan et al., "System R: Relational Approach to Database Management," ACM Trans. Database Syst., Vol. 1, No. 2, June 1976, pp. 97-137.

[0403] 3. C. J. Date and C. J. White, A Guide to DB2, 4th edition, Addison-Wesley, Reading, Mass., 1993.

[0404] 4. D. J. Haderle and R. D. Jackson, "IBM Database 2 Overview," IBM Syst. J., Vol. 23, No. 2, 1984, pp. 112-125.

[0405] 5. IBM Corp., "DB2 for OS/390 Version 5 Utility Guide and Reference," SC26-8967-00, June 1997.

[0406] 6. C. Mohan, "A Survey of DBMS Research Issues in Supporting Very Large Tables," in D. Lomet, Ed., Foundations of Data Organization and Methods (Proc. 4th Intl. Conf. Foundations of Data Organization and Methods, October 1993), Lecture Notes in Computer Science 730, Springer-Verlag, New York, pp. 279-300.

[0407] 7. E. Omiecinski, L. Lee, and P. Scheuermann, "Concurrent File Reorganization for Record Clustering: A Performance Study," Proc. 8th Intl. Conf Data Engineering, IEEE-CS, Feb. 1992, pp. 265-272. For more details, see their later work [8].

[0408] 8. E. Omiecinski, L. Lee, and P. Scheuermann, "Performance Analysis of a Concurrent File Reorganization Method for Record Clustering," IEEE Trans. Knowledge and Data Engin., Vol. 6, No. 2, Apt. 1994, pp. 248-257.

[0409] 9. B. Salzberg and A. Dimock, "Principles of Transaction-Based On-Line Reorganization," Proc. 18th Intl. Conf. Very Large Data Bases, Morgan Kaufmann Publishers, San Mateo, CA, Aug. 1992, pp. 511-520.

[0410] 10. G. H. Sockut and T. A. Beavin, "Interaction Between Application of a Log and Maintenance of a Table that Maps Record Identifiers During Online Reorganization of a Database," U.S. Pat. No. 5,721,915, February 1998.

[0411] 11. G. H. Sockut, T. A. Beavin, and C.-C. Chang, "A Method for On-line Reorganization of a Database," IBM Syst. J., Vol. 36, No. 3, 1997, pp. 411-436; erratum in Vol. 37, No. 1, 1998, p. 152.

[0412] 12. G. H. Sockut and R. P. Goldberg, "Database Reorganization--Principles and Practice," Computing Surveys, ACM, Vol. 11, No. 4, Dec. 1979, pp. 371-395.

[0413] 13. G. H. Sockut and B. R. Iyer, "Reorganizing Databases Concurrently with Usage: A Survey," Tech. Report 03.488, IBM Santa Teresa Lab., San Jose, Calif., June 1993.

[0414] 14. G. Wiederhold, Database Design, 2nd edition, McGraw-Hill, New York, 1983.

Conclusion

[0415] This concludes the description of the preferred embodiment of the invention. The following describes some alternative embodiments for accomplishing the present invention. For example, any type of computer, such as a mainframe, minicomputer, or personal computer, could be used with the present invention. In addition, any software program providing database management functions could benefit from the present invention.

[0416] In summary, the present invention discloses a method, system, and article of manufacture for in-place reorganization of a database that allows high-throughput concurrent usage by users of the database. The reorganization's movement of records are tracked across a user transaction's position within a scan of the database. The behavior of the user transaction is corrected to account for the movement of the records.

[0417] The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

* * * * *