U.S. patent application number 09/207927 was filed with the patent office on 2002-07-11 for method and apparatus for detecting and recovering from data corruption of database via read logging.
Invention is credited to BOHANNON, PHILIP L., RASTOGI, RAJEEV, SESHADRI, SRINIVASAN, SILBERSCHATZ, ABRAHAM, SUDARSHAN, SUNDARARAJARAO.
Application Number | 20020091718 09/207927 |
Document ID | / |
Family ID | 27378787 |
Filed Date | 2002-07-11 |
United States Patent
Application |
20020091718 |
Kind Code |
A1 |
BOHANNON, PHILIP L. ; et
al. |
July 11, 2002 |
METHOD AND APPARATUS FOR DETECTING AND RECOVERING FROM DATA
CORRUPTION OF DATABASE VIA READ LOGGING
Abstract
A method of detecting and recovering from data corruption of a
database is characterized by the step of logging information about
reads of a database in memory to detect errors in data of the
database, wherein said errors in data of said database arise from
one of bad writes of data to the database, of erroneous input of
data to the database by users and of logical errors in code of a
transaction. The read logging method may be implemented in a
plurality of database recovery models including a cache-recovery
model, a prior state model a redo-transaction model and a delete
transaction model. In the delete transaction model, it is assumed
that logical information is not available to allow a redo of
transactions after a possible error and the effects of transactions
that read corrupted data are deleted from history and any data
written by a transaction reading Ararat data is treated as
corrupted.
Inventors: |
BOHANNON, PHILIP L.;
(MORRIS, NJ) ; RASTOGI, RAJEEV; (UNION, NJ)
; SESHADRI, SRINIVASAN; (BASKING RIDGE, NJ) ;
SILBERSCHATZ, ABRAHAM; (UNION, NJ) ; SUDARSHAN,
SUNDARARAJARAO; (POWAI, BOMBAY, IN) |
Correspondence
Address: |
THOMAS H. JACKSON
BANNER AND WITCOFF LTD
11TH FLOOR
1001 G STREET NW
WASHINGTON
DC
200014597
|
Family ID: |
27378787 |
Appl. No.: |
09/207927 |
Filed: |
December 9, 1998 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60099265 |
Sep 4, 1998 |
|
|
|
60099271 |
Sep 4, 1998 |
|
|
|
Current U.S.
Class: |
1/1 ;
707/999.202 |
Current CPC
Class: |
G06F 11/1474 20130101;
Y10S 707/99932 20130101; G06F 11/1471 20130101; G06F 2201/80
20130101; Y10S 707/99953 20130101 |
Class at
Publication: |
707/202 |
International
Class: |
G06F 012/00 |
Claims
What we claim is:
1. A method of detecting and recovering from data corruption of a
database comprising the step of logging information about reads of
a database to detect and recover from physical corruption of the
data in the database, said physical corruption arising from bad
writes of data to the database or corruption arising indirectly
therefrom.
2. A method of detecting and recovering from data corruption as
recited in claim 1 further comprising the step of: storing data of
the database in main volatile memory of a data processor.
3. A method of detecting and recovering from data corruption as
recited in claim 1 further comprising the steps of maintaining a
log of writes of the database and combining said log of reads and
said log of writes into a combined log of reads and writes.
4. A method of detecting and recovering from data corruption as
recited in claim 1 further comprising the step of logging one of an
identity of information read from and an identity of information
written to the database.
5. A method of detecting and recovering from data corruption as
recited in claim 1 further comprising the step of: storing a
codeword in a record of said read-logging information.
6. A method of detecting and recovering from data corruption as
recited in claim 3 further comprising the step of storing a
codeword corresponding to a value of a data item in a record of
said combined log of reads and writes.
7. A method of detecting and recovering from data corruption as
recited in claim 1 further comprising the step of: protecting data
of the database with codewords, one codeword for each region of the
database.
8. A method as recited in claim 1 wherein recovery in a delete
transaction model comprises deleting effects of transactions from a
database image.
9. A method of detecting and recovering from data corruption as
recited in claim 3, said recovery from data corruption comprising
first and second phases, a first redo phase followed by an undo
phase.
10. A method as recited in claim 9 wherein said redo phase
comprises a forward scan of a log of read and write operations.
11. A method as recited in claim 10 wherein said forward scan adds
an identity of data to a corrupt data table whenever said data is
written by a corrupt transaction.
12. A method as recited in claim 10 wherein said forward scan adds
an identity of data to a corrupt data table whenever said data has
failed an audit.
13. A method as recited in claim 10 further comprising the step of
maintaining a corrupt transaction table.
14. A method as recited in claim 13 wherein said forward scan adds
transactions to said corrupt transaction table that are known
corrupt transactions.
15. A method as recited in claim 13 wherein said forward scan adds
transactions to said corrupt transaction table whenever data is
read that is identified in the corrupt data table.
16. A method as recited in claim 13 further comprising the step of
storing a codeword corresponding to a value of a data item in a
record of said combined log of reads and writes wherein said
forward scan adds transactions to said corrupt transaction table
whenever said codeword does not match the current value of said
data item read from a database.
17. A method as recited in claim 15 wherein in a redo phase,
actions such as writes are applied to a database image unless a
transaction is listed in the corrupt transaction table.
18. A method as recited in claim 10 wherein said undo phase
comprises undoing portions of corrupt transactions.
19. A method as recited in claim 10 wherein said undo phase
comprises undoing in progress transactions at the time of a data
processor failure.
20. A method as recited in claim 1 for further recovering from
logical corruption, the method comprising the step of maintaining a
logical redo log.
21. A method as recited in claim 20 further comprising the
additional steps of maintaining a transaction code for redoing
transactions and of storing user inputs for a transaction in said
logical redo log.
22. A method as recited in claim 21 further comprising the step of
maintaining a commit record at a transaction level.
23. A method as recited in claim 20 comprising the further step of
returning a database to a transaction consistent state prior to a
first detected instance of possible data corruption.
24. A method as recited in claim 23 comprising the further step of
rerunning transactions affected by the first detected instance of a
possible data corruption in the same order as an original set of
transactions.
25. A method as recited in claim 23 comprising the further step of
deleting effects of transactions when logical information is
unavailable to permit redoing transactions.
26. A method as recited in claim 1, wherein direct physical
corruption has occurred to data in memory used for database cache,
further comprising the step of removing corruption from cache pages
without reflecting any corrupt data values in log records.
27. A method of detecting and recovering from data corruption as
recited in claim 3 further comprising the step of responsive to the
identity of a known corrupt data item, determining a probable
source of the corruption from said combined log of writes and
reads.
28. A method as recited in claim 27 wherein said determining step
comprises the substep of processing said log of writes and reads
backwards from an end of said combined log of writes and reads.
29. A method as recited in claim 28 further comprising the substeps
of maintaining a table of suspect corrupt data items and, for each
data item of said table, maintaining a set of known corrupt data
items whose corruption would be explained if said data item were
corrupt.
30. A method as recited in claim 29 further comprising the substep
of maintaining a table of suspect corrupt transactions and, for
each transaction of said table, maintaining said set of known
corrupt data items whose corruption would be explained if said
transaction were corrupt.
31. A method as recited in claim 30 further comprising the substeps
of adding a transaction to the table of suspect corrupt
transactions when a write of a data item in the suspect data table
by said transaction is encountered in said combined log during said
backwards processing step.
32. A method as recited in claim 31 further comprising the substep
of adding the set of known corrupt data items associated with said
data item in the suspect data table to the set of known corrupt
data items associated with said transaction.
33. A method as recited in claim 30 further comprising the
stubsteps of adding a data item to the suspect data table when a
read of said data item by a transaction in the suspect transaction
table is encountered in said combined log during said backwards
processing step.
34. A method as recited in claim 33 further comprising the substep
of adding the set of known corrupt data items associated with said
transaction to the set of known corrupt data items associated with
said data item.
35. A method as recited in claim 34 further comprising the step of
stopping said backwards processing whenever a suspect transaction
exists in the suspect transaction table such that the set of data
items associated with said transaction contains all known corrupt
data items, said transaction being output to said user as a
possible source of said corruption.
36. A method as recited in claim 35 further comprising the steps of
performing forward recovery from said transaction determined by
said user as a source of said corruption.
37. A method as recited in claim 1 wherein said information about
reads comprises one of a start and an end point and length of data
read.
38. A method as recited in claim 1 wherein, for detecting logical
corruption of data, said method comprises the step of logging a
codeword of a logical state found.
39. A method as recited in claim 1, further comprising the step of
logging lock information in a logical read log.
40. A method as recited in claim 39 wherein said lock information
comprises a name of a data item and a type of lock.
41. A method as recited in claim 1 further comprising the step of
ensuring a disc image of said database is free of corruption.
42. A method as recited in claim 1, wherein, in the event of
logical corruption, said method comprises the step of logging the
identity and codeword for a logical structure protected by a
lock.
43. A method of detecting and recovering from data corruption of a
database comprising the steps of logging information about reads of
a database to detect and recover from logical corruption of the
data in the database, maintaining a logical redo log and storing
user inputs for a transaction in said logical redo log.
44. A data corruption recovery method as recited in claim 43
further comprising the step of performing forward recovery in said
logical redo log from a determined point of initial corruption.
45. A corruption recovery method as recited in claim 44 further
comprising the steps of saving log records until a commit or an
abort for a transaction is seen and, if a commit is seen, then
scanning the read log records to determine if a transaction has
read corrupted data and marking the transaction as corrupt.
46. A corruption recovery method as recited in claim 45 further
comprising the steps of, if a commit is seen and the transaction is
marked corrupt, reexecuting the transaction logically and,
responsive thereto, replacing logical redo records with redo
records generated by the logical reexecution of the transaction and
adding data written by the transaction to a corrupt data table.
47. A corruption recovery method as recited in claim 45 further
comprising the steps of, if a commit is seen and the transaction is
not marked corrupt, executing its logical redo records.
48. A corruption recovery method as recited in claim 44 further
comprising the steps of saving log records until a commit or an
abort for a transaction is seen and, if an abort is seen,
discarding the log records.
49. A method of detecting and recovering from data corruption of a
database comprising the steps of logging information about reads of
a database to detect and recover from logical corruption of the
data in the database and deleting the effects of transactions from
an image of the database.
50. The method of detecting and recovering from data corruption as
recited in claim 49 further comprising the step of logging lock
information in said logical redo log.
51. A method of detecting and recovering from data corruption as
recited in claim 49, said recovery from data corruption comprising
first and second phases, a first redo phase followed by an undo
phase.
52. A method as recited in claim 51 wherein said redo phase
comprises a forward scan of a log of read and write operations.
53. A method as recited in claim 52 wherein said forward scan adds
an identity of data to a corrupt data table whenever said data is
written by a corrupt transaction.
54. A method as recited in claim 52 wherein said forward scan adds
transactions to a corrupt transaction table whenever data is read
that is identified in the corrupt data table.
55. A method as recited in claim 54 wherein in a redo phase,
actions such as writes are applied to a database image unless a
transaction is listed in the corrupt transaction table.
56. A method as recited in claim 51 wherein said undo phase
comprises undoing portions of corrupt transactions.
57. A method as recited in claim 51 wherein said undo phase
comprises undoing in progress transactions at the time of a data
processor failure.
58. A method of detecting and recovering from data corruption of a
database comprising the steps of logging information about reads of
a database to detect and recover from logical corruption of the
data in the database and logging a codeword of a logical state
found.
Description
[0001] This application claims priority and is related by subject
matter to U.S. application Ser. No. 08/766,096, entitled "System
and Method for Restoring a Distributed Checkpointed Database," of
Bohannon et al., to U.S. application Ser. No. 08/767,048, entitled
"System and Method for restoring a Multiple Checkpointed Database
in View of Loss of Volatile Memory" of Bohannon et al., both
applications being filed Dec. 16, 1996 and to U.S. Provisional
Patent Application Serial No.'s 60/099,265 and 60/099,271, filed
Sep. 4, 1998 of Bohannon et al.
BACKGROUND OF THE INVENTION
[0002] 1. Technical Field
[0003] The present invention relates to the field of database
management systems generally and, more particularly, to method and
apparatus for detecting and recovering from data corruption of a
database by codewording regions of the database and by logging
information about reads of the database.
[0004] 2. Description of the Related Arts
[0005] A database is a collection of data organized usefully and
fundamental to many software applications. The database is
associated with a database manager and together with its software
application comprises a database management system (DBMS). In
recent years, extensible database systems such as Illustra (now
part of the Informix Universal Server) have been developed which
allow the integration of application code with database system
code. In these systems, the application code has direct access to
the buffer cache and other internal structures of the DBMS.
Similarly, application programs in many object oriented database
(OODB) systems have direct access to an object cache in their
address space. This OODB architecture was developed to minimize the
cost of accesses to data, for example, to support the needs of
Computer Aided Design (CAD) systems. Also, several recently
developed storage management systems provide memory resident or
memory mapped architectures. For example, the Dali main-memory
storage manager described in Bohannon et al., "The Architecture of
the Dali Main-Memory Storage Manager," Journal of Multimedia Tools
and Applications, 4(2), pp. 115-151 (1997) is designed to provide
applications with fast, direct access to data by keeping the entire
database in volatile main memory. In all these systems, direct
access to data (either in the database buffer cache or in a
memory-mapped portion of the database) by application programs is
critical to providing fast response times. The alternative to
memory mapping is to access data via a server process, but this
presents an unacceptable solution due to the high cost of
inter-process communication. Application code is typically less
trustworthy than database system code, and there is therefore a
significant risk that "wild writes" and other programming errors
can affect persistent data in systems that allow applications to
access such data directly. Since the systems described above are
increasingly popular, the risk of wild writes and associated
physical corruption is growing. Additionally, there is a risk of
damage due to software faults in the DBMS itself. It is therefore
important to develop techniques that can mitigate the risk of
corruption.
[0006] In our parent U.S. patent application Ser. No. 08/766,096,
filed Dec. 16, 1996 and entitled "System and Method for Restoring a
Distributed Checkpointed Database," we describe the application of
multiple checkpoints and the maintenance of a stable log record
stored on a server for tracking operations to be made to the
multiple checkpoints in a distributed environment. A companion
parent application, U.S. patent application Ser. No. 08/767,048,
entitled "System and Method for Restoring a Multiple Checkpointed
Database in View of Loss of Volatile Memory" filed the same day
describes recovery processes at multiple levels of a DBMS in the
event of loss of volatile memory. The '048 and the '096
applications should be deemed to be incorporated by reference
herein as to their entire contents. Both of these applications
relate to the preservation and restoration of a database (or
distributed database), for example, stored in main volatile memory
of a data processor.
[0007] The problem of detecting and recovering from corruption of
data in a database system still remains to be solved in a pragmatic
manner without adding considerable overhead to the DBMS. Data
corruption may be physical or logical and it may be direct or
indirect. Data is "directly" corrupted in a physical corruption
sense by "unintended" updates, such as wild writes as explained
above due to programming errors in the physical case, or arising
from incorrectly coded updates or input errors (human errors) in
the logical case. Once data is directly corrupted, it may be read
by a process, which then issues writes based on the value read.
Data written in this manner is indirectly corrupted, and the
process involved is said to have carried the corruption. While this
process may be a database maintenance process, we focus on
transaction-carried corruption, a problem in which the carrying
process is executing transactions.
[0008] Direct physical corruption can be mostly prevented with
hardware memory protection, using the virtual memory support
provided by most operating systems. One approach involves mapping
the entire database in a protected mode, and selectively
un-protecting and re-protecting pages as they are updated. However,
this can be very expensive, for example, on standard UNIX systems.
An alternative to the hardware approach would be programming
language techniques such as type-safe languages or sandboxing.
(Sandboxing is a technique whereby an assembly language programmer
adds code immediately before a write to ensure that the instruction
is not affecting protected space.) However, type-safe languages
have yet to be proven in high-performance situations, and
sandboxing may perform poorly on certain architectures. Finally,
communication across process domain boundaries to a database server
process provides protection, but such communication is orders of
magnitude slower than access in the same process space, even with
highly tuned implementations. The concern over physical corruption
is further motivated by the increasing number of systems in which
application code has direct access to system buffers, including
extensible systems, object databases, and memory-mapped or
in-memory architectures. Finally, some work has raised concern over
damage to data due to faults in the DBMS itself
[0009] Integrity constraints are widely studied and prevent certain
cases of logical corruption in which rules about the data would be
violated. However, it is an object of the present invention to deal
with those cases in which integrity constraints and other input
validation techniques fail, and whether due to programming error or
invalid input, unintended updates are made to the database. We
consider such cases inherently impossible to prevent, and instead
assume that the problem is detected later, usually when a database
user notices incorrect output (on a bank statement, for
example).
[0010] In the field of accounting systems and audits, it is known
from L. A. Bjork, Jr., "Generalized Audit Trail Requirements and
Concepts for Data Base Applications", IBM Systems Journal, No. 3,
1975, pp. 229-245, how to create and maintain an audit trail--a
history of activities by transaction, posted because of operations
on specific data. Bjork describes that a time dimension can be
added to a stored record such that supplemental information in the
form of a descriptor and time frame are maintained. For example,
the time frame when the information was created and stored, and
each version of a data field, is maintained along with the action.
He refers to "create" (creation of data), "reference" (when
reference is made to x at time t), and update (when created data is
updated) as descriptors all having time dimension. A further
descriptor is "refer to prior generation" (when data now updated is
referred to by a prior generation." Also, C. T. Davies, Jr. in his
article "Data Processing Spheres of Contro" appearing in the IBM
Systems Journal, Vol. 17, No. 2, 1978, pp. 179-198 describes
"in-process recovery" as the control of recording and subsequent
use of data required to return to a previous point in a process,
and that the process that created or last modified data elements be
determined from a journal. System recovery is obtained via
establishing checkpoints that represent an early state in a data
base. Once a search backward is conducted to find an error, a
checkpoint behind the error is obtained from which to rebuild.
These accounting processes, for example, typified by the recovery
from a payroll error, are not examined by Bjork or Davies for the
generic case of database recovery, nor are they automated. Bjork
and Davies provide no suggestions for real-time implementation, for
example, of a read-logging recovery system such as would be
required in a communications system and associated record-keeping
environment.
[0011] Thus, there appears a genuine need in the art of database
management systems to provide an improved method and apparatus for
detecting and recovering from corruption of a database via read
logging.
SUMMARY OF THE INVENTION
[0012] According to the present invention, it is a principle to
apply several new techniques for the prevention or detection of
corruption. The new techniques may be suitable for application in a
real-time environment such as in a telecommunications system.
[0013] For detecting indirect logical or physical corruption, it is
a feature of the present invention to log information about reads
(Read Logging). Interestingly, any negative impact of Read Logging
is limited, as the actual values read are not logged according to
one embodiment of the present invention, just the identity of the
item read and optionally a checksum of the value. Moreover, it is
an extension of the present invention to store codewords for each
read of the read log records.
[0014] When corruption is detected rather than prevented,
techniques for corruption recovery are employed to restore the
database to an uncorrupted state. As will be further described
herein, once codewording of data and read logging is performed,
models and algorithms are presented for recovery from
transaction-carried indirect corruption. In these models, the read
log records are preferably combined with known write log records
and operated on more efficiently as a combined log to detect and
recover from transaction based corruption (although in other
embodiments of the present invention, the read log and write log
records may be separately maintained.) One model, the
redo-transaction model, uses logical descriptions of transactions
to repair the database state. A second model, the
delete-transaction model, focuses on removing the effects of
corruption from the database image. The algorithms presented herein
can be applied to recovery from logical or physical corruption. In
addition, tracing techniques are presented which aid in determining
the scope of logical corruption.
[0015] To ascertain the performance of our algorithms for detecting
and recovering from physical corruption, we have studied the impact
of these schemes on a TPC-B style workload implemented in the Dali
main-memory storage manager. Our goal was to evaluate the relative
impact on normal processing of schemes that can be easily ported
across a variety of architectures and operating systems. In
addition to our schemes, we study a hardware-based protection
technique. For detection of direct corruption, the overheads
imposed cause throughput of update transactions to be decreased by
8%. Prevention of transaction-carried corruption with Read
Prechecking costs between 12% and 72%, but requires a significant
space overhead to achieve the better performance numbers. Detection
of transaction-carried corruption with Read Logging costs between
17% and 22%. Our study indicates that the corruption prevention
algorithms of Sullivan et al., "Using Write Protected Data
Structures to Improve Fault Tolerance in Highly Available DBMS" in
Proceedings of the International Conference on Very Large
Databases, pp 171-179, 1991, when using standard OS support for
memory protection, decrease throughput by about 38%. Thus, the
codeword and read logging based detection and prevention schemes of
the present invention perform significantly better than the
hardware-based protection.
[0016] With the present invention, it is possible to identify a
subset of the later transactions that were (directly or indirectly)
affected by the error, and to selectively roll them back and redo
them manually (or even automatically in some cases). Also, the
techniques of the present invention are language and
instruction-set independent.
[0017] Thus, a method of detecting and recovering from data
corruption of a database according to the present invention is
characterized by the step of logging information about reads of a
database to detect and recover from physical corruption of the data
in the database, wherein the physical corruption arises from bad
writes of data to the database or arises indirectly from the bad
writes. In a delete transaction model, the corruption recovery
comprises first and second phases, a first redo phase followed by
an undo phase. In another embodiment, a method of detecting and
recovering from database corruption comprises the steps of logging
informtion about reads of a database to detect and recover from
logical corruption, maintaining a logical redo log and storing user
inputs for a transaction in the logical redo log. Alternatively, in
the logical corruption recovery method, the method comprises the
step of logging a checksum of a logical state found.
[0018] In our co-pending, concurrently filed patent application, we
describe a Read Prechecking scheme that associates one word
codewords with each region of data, and prevents
transaction-carried corruption by verifying that the codeword
matches the data each time it is read. A Data Codeword scheme, a
less expensive variant of Read Prechecking, allows detection of
direct physical corruption by asynchronously auditing the
codewords. This scheme is also referred to herein and in our
co-pending application as deferred codeword maintenance and
involves performing codeword updates during a process called "log
flushing" at the same time as data is flushed to disc from main
memory. These schemes are disclosed and claimed in concurrently
filed, copending U.S. patent application Ser. No. (Attorney Docket
Number Bohannon et al. 8-25-3-38-11) entitled "Method and Apparatus
for Detecting and Recovering from Data Corruption via Read
Prechecking and Deferred Maintenance of Codewords," of the same
inventors.
[0019] These and other features of the present invention will be
best understood from considering the drawings and the following
detailed description of the preferred embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] FIG. 1 is a functional schematic drawing of the overall
architecture of the Dali main memory storage manager useful for
explaining an implementation of the present invention.
[0021] FIG. 2 is a functional block diagram providing an overview
of database recovery structures which may be existent, for example,
within an architecture of FIG. 1.
[0022] FIG. 3 shows a database memory comprising a plurality of
pages and codewords for each page with a latch for processes
wanting access to a page or a codeword for that page useful in
describing a simple read prechecking scheme.
[0023] FIG. 4A also shows a database memory organized with
codewords for page regions having a page portion protection latch
permitting only one process to read or modify a codeword at a time
useful in describing deferred codeword maintenance and FIG. 4B is
useful in describing how deferred maintenance operates at log flush
and associated recovery and audit processes.
[0024] FIGS. 5A and 5B together comprise a flowchart of a recovery
algorithm for the delete transaction model of the present invention
comprising a redo phase followed by an undo phase before
checkpointing the database.
[0025] FIG. 6 comprises an alternative algorithmic approach to
recovery for the redo-transaction model.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
[0026] Introduction
[0027] Referring briefly to FIG. 1, there is shown an overall
architecture of the Dali storage manager on which the present
invention has been implemented. The Dali system should be
considered only an exemplary implementation of the present
invention and the present detection and recovery algorithms may be
applied in other architectures as well. The database in a Dali
system consists of one or more database files, along with a special
system database file. User data itself is stored in database files
while all data related to database support, such as log and lock
data, is stored in the system database file. This enables storage
allocation routines to be uniformly used for persistent user data
as well as non-persistent system data like locks and logs. The Dali
system database file also persistently stores information about the
database files in the system.
[0028] As shown, database files opened by a process are directly
mapped into the address space of that process. In Dali, either
memory-mapped files or shared-memory segments can be used to
provide this mapping. Different processes may map different sets of
database files, and may map the same database file to different
locations in their address space.
[0029] This feature precludes using virtual memory addresses as
physical pointers to data (in database files), but provides two
benefits. First, a database file may be easily resized. Second, the
total active database space on the system may exceed the addressing
space of a single process. This is useful on machines with 32-bit
addressing in which physical memory can significantly exceed the
amount of memory addressable by a single process. In a 64-bit
machine, both of these considerations may be mitigated, so we also
consider physical addressing. For example, if a single database
file can be limited to the order of 64 gigabytes, then each process
can still map close to a billion database files (which can be
expected to far exceed the total database space).
[0030] There is a single active transaction table (ATT), stored in
the system database, that stores redo and undo logs for each active
transaction. A dirty page table (dpt) is maintained for the
database (also in the system database) which records the pages that
have been updated since the last checkpoint. The ATT (with undo
logs) and the dirty page table are also stored with each
checkpoint. The dirty page table (dpt) in a checkpoint is referred
to as ckpt_dpt in the figure.
[0031] Models and Overview
[0032] In this section we describe our models of corruption and
recovery and the database model in which our algorithms are
described.
[0033] Models of Corruption
[0034] As described above, data corruption may be either direct or
indirect, logical or physical. Direct physical corruption is
defined to be an update to the in-memory image of database data
which did not happen through prescribed methods. For example, in a
standard disk-based system, an update to an internal buffer without
a latch and without calls to logging routines (usually made via a
FixBuffer or similar function call) would constitute direct
physical corruption. For applications with direct access to data,
such corruption could result due to application errors (e.g., stray
pointers in the application code).
[0035] Examples of direct logical corruption include any update
made in error, for example, an accidental deletion of a customer,
updates to account balances made by a program which computed
interest incorrectly, or simply entering an incorrect amount for
sales commission which is then used in payroll computations. Once
direct corruption has occurred, corrupted data may be read by a
subsequent transaction. If that transaction, considered a corrupt
transaction, then writes other data in the database, that data is
also assumed to be corrupt. Note that the newly "corrupted" data
may then be read by other transactions, and the corruption
propagated further. We call such corruption indirect
corruption.
[0036] As with corruption of user data, corruption of control
structures may be handled with hardware memory protection, or with
codeword based techniques.
[0037] Models of Recovery
[0038] We define four models of corruption recovery: a cache
recovery model, a prior-state model, and a redo transaction model
and a delete-transaction model. The first model deals only with
direct physical corruption. The other models deal with logical and
indirect physical corruption.
[0039] In the cache-recovery model, corruption is removed from
cache pages, assuming that indirect corruption has not occurred,
and because of that, corrupt data values are not reflected in any
log records.
[0040] In the prior-state model, the goal is to return the database
to a transaction consistent state prior to the first possible
occurrence of corruption. Most commercial systems support this
model. Clearly, this recovery model leaves much to be desired, as
it may well be impractical to request that users re-submit any work
done following the introduction of the corruption.
[0041] In an extension of the prior-state model, once the direct
error has been identified and corrected, transactions affected by
the error must be logically re-run in history; that is, the
transactions must be re-run in the same serialization order as the
original set of transactions. We call this extension of the
prior-state model the redo-transaction model of corruption
recovery.
[0042] In our final model, the delete-transaction model, we assume
that logical information is not available to allow a redo, and
corruption is dealt with by deleting the effects of transactions
from the database image. Any transaction that read corrupted data
must be deleted from history, and any data that such a transaction
wrote after reading corrupt data is treated as being corrupted by
the transaction.
[0043] To implement a recovery algorithm for the delete transaction
model, it must be clearly understood what it means to "delete a
transaction from history". One possible interpretation would be to
allow any serializable execution of the remaining, undeleted,
transactions. However, this definition is not acceptable, since the
values read by other transactions, and thus the values exposed to
the outside world, might change in the modified history.
[0044] To define correctness in the delete transaction model, we
consider two transaction execution histories, the original history,
H.sub.o, and the delete history, H.sub.d, in which all reads and
writes of certain transactions no longer appear. These histories
include the values read or written by each operation, and for a
given operation, the value read or written in the delete history is
the same as in the original history. A delete history is
conflict-consistent with the original history if any read in
H.sub.d is preceded by the same write which preceded it in H.sub.o,
Similarly, H.sub.d is view-consistent with H.sub.o if each read in
H.sub.d returns the value returned to it in H.sub.o. Note that the
notions of conflict- or view-consistency are distinct from the
standard notions of conflict- or view-equivalence. A correct
recovery algorithm in the delete-transaction model recovers the
database according to a delete history which is conflict- or view-
consistent with the original history. Note that it follows from
this definition that in a conflict-consistent delete history, the
final state of any data item written by a transaction in the delete
set will have the value it had before being written by the first
deleted transaction.
[0045] Levels of Protection
[0046] Having covered the models of corruption and corruption
recovery, we now consider different levels of protection. In
general, it is possible to prevent or detect either direct or
indirect physical corruption. It may further be possible to detect
indirect logical corruption, once the fact that a certain
transaction introduced the logical corruption has been determined
by a human. The human then can initiate appropriate action. In each
of these cases, if a form of corruption is detected but not
prevented, a recovery mechanism for that corruption should
exist.
[0047] Note that prevention of direct corruption is equivalent to
prevention of all corruption, since other forms only propagate
direct corruption. Preventing direct logical corruption is
impossible, and preventing direct corruption using hardware may be
expensive as discussed above. In the next best alternative, direct
corruption is detected and transaction-carried corruption is
prevented. Finally, direct corruption and transaction-carried
corruption may both be detected. Detecting or preventing only
transaction-carried corruption with no detection of direct
corruption makes little sense, as the source of the corruption
would remain in the database indefinitely. However, detection of
only direct physical corruption may be used, if one feels that the
corruption will usually be detected and recovery performed before
the direct corruption leads to transaction-carried corruption. As
unsatisfactory as it seems, such minimal protection is a great
improvement over no protection at all. Corrupt data left in the
database indefinitely is far more likely to cause problems than
corruption removed after a few minutes or even hours.
[0048] Some of the schemes we describe below associate a codeword
with a region of data known as a protection region. When data in
the region is updated through the prescribed interface, the
codeword is updated along with the data. By using word-wise parity,
or a similar scheme, updates to the codeword can be made based only
on the portion of the protection region which was actually
updated--the remainder of the region need not be read.
[0049] Using Codewords To Detect Corruption
[0050] The basic (and well known) idea of codeword-based corruption
detection is as follows. Each protection region also stores a
codeword for the region. Updates via the prescribed interface also
update the codeword for the region (either immediately on update,
or in a deferred fashion, as we shall see). If an update not via
the prescribed interface occurs, the stored codeword will not be
updated. Thus (with a very high probability assuming a good
codeword scheme), if we compute the codeword for the region it will
not match the stored code (we ensure that the deferred codeword
updates have been performed before matching the codewords).
[0051] There are several issues in codeword based corruption
detection, such as:
[0052] When and how to update the codewords (Immediately on update?
Deferred to later?)
[0053] When and how to check for codeword mismatch (When flushing
data to disk or archival copy?
[0054] How to integrate codeword detection with concurrency control
and recovery (ensuring concurrency levels stay high; ensuring
recoverability from failures).
[0055] These issues are addressed later, in Section 5; in this
section we introduce some basic requirements on the use of
codewords for detecting corruption.
[0056] Error detecting codes, such as the Cyclic Redundancy Check
(CRC) code are widely used, for example to verify integrity of
sectors of disk systems. However, our algorithms have non-standard
requirements on the codeword schemes that may be used:
[0057] 1. It should be possible to update a codeword incrementally
when part of the protection region that it covers is updated. In
particular, to avoid locking unrelated data and to provide a high
degree of concurrency, we must be able to compute the new codeword
using only the old value of the codeword, and the old and new
values of the part of the region that is updated.
[0058] 2. Since the actual update of the codeword may be deferred,
we require that the effect on the codeword of an update M to part
of a region can be summarized as .DELTA.(M). The update M consists
of the old and new values of the updated part of the protection
region, and hence .DELTA.(M) must be a function of only the
location and the old and new values of the updated area. In
particular, .DELTA.(M) should not depend on the old codeword. Since
in one of our schemes .DELTA.(M must be stored in the redo log
record, we require that the information in .DELTA.(M) should be
small, only about as big as the codeword itself. We also assume
there is a codeword update operation .sym., such that if the
codeword for a region prior to an update M is C.sub.old, the
codeword after the update is
C.sub.new=C.sub.old.sym..DELTA.(M)
[0059] We shall assume that there is a value, denoted by 0, that
results in no change to the codeword; that is, for all C, we have
C.sym.0=C.
[0060] 3. There may be a sequence of updates to a region whose
.DELTA.s are applied out-of-order to the codeword. We assume that
while an update is in progress it has exclusive access to the
updated data; such exclusive access can be ensured by means of
region locks. Further, the codeword change .DELTA.(M) is computed
while the update M has exclusive access to the updated data. Given
the above, we require that for any pair of updates M.sub.1 and
M.sub.2
(C.sym..DELTA.(M.sub.1)).sym..DELTA.(M.sub.2)=(C.sym..DELTA.(M.sub.2)).sym-
..DELTA.(M.sub.1).
[0061] We often treat M as if it is composed of two updates, one
from the initial value of the updated area U to a value of all
zeros, and a second from all zeros to the value R it holds after M.
We use .DELTA..sup.-(U), which we call the undo value, to denote
.DELTA.(M') where M' is the updated from U to all zeros, while
.DELTA..sup.+(R), which we call the redo value denotes .DELTA.(M")
where M" is the update from all zeros to R.
[0062] Thus, we have
C.sym..DELTA.(M)=C.sym..DELTA..sup.-(U)).sym..DELTA..- sup.+(R). We
shall assume a function .sym. that can be used to combine
.DELTA..sup.-(U) and .DELTA..sup.+(R) such that
.DELTA.(M)=.DELTA..sup.-(- U).sym..DELTA..sup.+(R).
[0063] There are several codeword schemes that satisfy our
requirements. One such scheme is parity encoding in which the size
of each codeword is one word. The parity code is the bitwise
exclusive-or of the words in the region. Thus the i'th bit of the
codeword represents the parity of the i'th bit of each word on the
region. For the case of parity, {circle over (x)}, .sym.and .DELTA.
are all the same--they compute bitwise exclusive-or. It is easy to
check that the parity code has the properties that we require. In
particular, .DELTA.(M) for an update M simply consists of the
exclusive-or of the before-update and after-update values of words
involved in the update. Also, .DELTA.(M) requires only one word of
storage.
[0064] There are other choices for error-detection codes, for
example a summation scheme where the codeword is the word-wise
summation of all data on the region modulo 2.sup.32. Any such
codeword scheme satisfying our requirements can be used. Our focus
here is not on the choice of the best error detecting code, but
rather on the integration of such a code with a transaction
processing system.
[0065] Schemes for Prevention, Detection and Recovery
[0066] In this section, we briefly outline the schemes we will
present for the respective levels of protection described
above.
[0067] We begin with schemes which prevent corruption. The Hardware
Memory Protection scheme may be used to prevent direct corruption.
Furthermore, since the database is not corrupted, no explicit
recovery is required. If, however, the performance hit of hardware
protection is unacceptable, an alternative is to detect direct
physical corruption, while preventing indirect physical corruption.
This may be accomplished with the codeword-based scheme of Read
Prechecking, in which codewords are associated with regions of the
database, and these codewords are checked against the actual
contents prior to every read.
[0068] Direct physical corruption may be detected rather than
prevented with this same codeword arrangement by running periodic
audits of the protected data. Based on the performance of Read
Prechecking, we recommend using Read Logging to detect
transaction-carried corruption. In fact, the use of read logging
opens up several new possibilities in error detection and recovery.
In particular, if transaction-carried corruption has occurred the
read logs provide a way of tracing history and detecting the
transactions that were affected by the corruption. Because this
ability to trace affected transaction extends to logical
corruption, this technique aids in recovery from errors against
which the previously known techniques for corruption prevention are
ineffective.
[0069] Database Model
[0070] FIG. 2 provides an overview of the structures used for
recovery. Our corruption detection and recovery algorithms are
expressed in terms of the Dali recovery algorithm. The Dali
recovery algorithm provides very general support for high
concurrency, multilevel operations and minimal interference with
transaction processing in a main-memory database system. More
details can be found in our companion applications and our article
describing the Dali main memory manager, "The Architecture of the
Dali Main-Memory Storage Manager, referred to above and
incorporated herein by reference.
[0071] Multi-level Recovery: Dali implements a main-memory version
of multi-level recovery. A multi-level transaction processing
system consists of n logical levels of abstraction, with operations
at each level invoking operations at lower levels. Transactions
themselves are modeled as operations at level n, with level 0
consisting of physical updates. Multi-level recovery permits some
locks to be released early to increase concurrency. For example,
many systems have two levels defined below the transaction level: a
tuple level and a page level. Locks on pages are released early,
but locks on tuples are held for transaction duration.
[0072] Latching Protocol: In Dali, the extent of physical latching
used to protect the lowest level physical updates is left to the
database implementor. (Often these physical updates are covered by
higher level locks, leading to efficient implementations of
concurrency control for main-memory.) Thus, we merely assume that a
physical update occurs on a locking region (not to be confused with
a protection region, though they may be the same in a given
system), which may be a physical space like a page, or a logical
space like a linked list or tree. These lowest level updates are
assumed to be covered by a region lock.
[0073] Undo and Redo Logging: Updates in Dali are done in-place,
and updates by a transaction must be bracketed by calls to the
functions beginUpdate and endUpdate. Each physical update to a
database region generates an undo to part of a locking region image
and a redo image for use in transaction abort and crash recovery.
Undo and redo logs in Dali are stored on a per-transaction basis
(local logging). When a lower-level operation is committed, the
redo log records are moved from the local redo log to the system
log tail in memory, and the undo information for that operation is
replaced with a logical undo record. Both steps take place prior to
the release of lower level locks. A copy of the logical undo
description is included in the operation commit log record for use
in restart recovery. The Dali recovery algorithm repeats history on
a physical level, so during undo processing, redo records are
generated as for forward processing.
[0074] Log Flush: The contents of the system log tail are flushed
to the stable system log on disk when a transaction commits, or
during a checkpoint. The system log latch must be obtained before
performing a flush, so that flushes do not occur concurrently. The
stable system log and the tail are together called the system log.
The variable end_of stable_log stores a pointer into the system log
such that all records prior to the pointer are known to have been
flushed to the stable system log. While flushing physical log
records, we also note which pages were written ("dirtied") by the
log record. This information about dirty pages is noted in the
dirty page table (dpt).
[0075] Logical Undo: All redo actions are physical, but when an
operation commits, an operation commit log record is added to the
redo log, containing logical undo description for that operation.
These records are used so that at system recovery, logical undo
information is available for all committed operations whose
enclosing operation (which may be the transaction itself) has not
committed. For transaction rollback during normal execution, the
corresponding undo records in the transaction's local undo log are
used instead.
[0076] Due to the requirements of multi-level recovery, at any
point in time a transaction's undo log consists of some number of
logical undo actions followed by some number of physical undo
actions. If a transaction aborts, it executes the undo log in
reverse order.
[0077] Transaction Rollback: When a transaction is rolled back
during normal operation its local undo log is traversed in reverse
order, and the undo descriptions are used to undo operations of the
transaction. In the case of physical update operations, the undo is
done physically, while in the case of logical operations, the undo
is done by executing the undo operation.
[0078] Following the philosophy of repeating history, both these
actions generate redo logs representing the physical updates taken
on their behalf. Additionally, each undo operation also generates
redo log records to note the begin and commit of the undo
operation, just like a regular operation. At the end of rollback a
transaction abort record is written out. indicating that they are
an action taken on behalf of an undo log record.
[0079] Checkpoints: Again, FIG. 2 provides an overview of the
structures used for recovery. The database is mapped into the
address space of each process. During a checkpoint, dirty pages
from the in-memory database image are written to disk. In fact, two
checkpoint images Ckpt_A and Ckpt_B are stored on disk, as is
cur_ckpt, an "anchor" pointing to the most recent valid checkpoint
image for the database. During subsequent checkpoints, the newly
dirty portions of the database are written alternately to the two
checkpoint images (this is called ping-pong checkpointing). The
anchor is switched to point to the new checkpoint only after
checkpointing has been successfully completed.
[0080] Thus, even if one checkpoint image is corrupted (due to
writing corrupted data during checkpointing or due to failure
during checkpointing) the other checkpoint image is still available
for recovery.
[0081] Information about active transactions is stored in an active
transaction table, referred to herein as the ATT. Due to local
logging, the entry for each transaction in the ATT contains local
undo and redo logs. In addition to the database image, a copy of
the ATT with the local undo logs and a copy of the dirty page table
(dpt) are stored with each checkpoint.
[0082] Note that physical undo information is moved to disk only
during a checkpoint. The undo information is taken by the
checkpointer directly from the local undo logs of each transaction.
(Thus, physical undo log records are never written to disk for
transactions which take place between checkpoints.)
[0083] Restart Recovery: Restart recovery starts from the last
completed checkpoint image, and replays all redo logs (repeating
history). When the end of log is reached, incomplete transactions
(those without operation commit or abort records) are rolled back,
using the logical undo information stored in either the ATT or
operation commit log records. Undo information is available in the
redo log for all operations at level 1 and higher. For level 0
(physical updates) the undo information is available in the
checkpointed undo log. (Or is reconstructed from the checkpoint
image during redo.)
[0084] Due to multi-level recovery, the rollback is done level by
level, with all incomplete operations at level i being rolled back
before any incomplete actions at level i+1 are rolled back.
[0085] Preventing Corruption
[0086] The prevention of physical corruption can take place at two
times--at the time of the direct corruption (the "bad write"), or
at the time of an attempt to read the corrupt data (indirect
corruption). In this section, we the present the Hardware
Protection scheme which takes the first approach, and the Read
Prechecking scheme which takes the second.
[0087] Hardware Protection
[0088] Direct physical corruption can be largely prevented by using
virtual memory support provided by most operating systems. The
basic approach is to un-protect a page before an access, and
re-protect it afterwards. The un-protection and re-protection are
done in the database code that surrounds an update, for example,
fixing and unfixing a page in the buffer pool. Thus, if an update
is done outside of these routines, the page is very likely to be
protected, and the hardware mechanism will cause the program to be
terminated. If database processes are threaded, however,
un-protected pages will be vulnerable to other threads, as hardware
memory protection is a per-process resource.
[0089] We note that greater speed with less safety may be available
if the page is left un-protected for a period of time--until the
end of the enclosing operation, or transaction.
[0090] Turning protection on and off as above can be very expensive
since each protect/un-protect involves a system call (for security
reasons), which on most current generation operating systems
continues to be very expensive (around 20,000 instructions). For
example, on a SPARCStation 20 Model 50, only about 16,000
protect/un-protect pairs can be performed in a second.
[0091] While relatively good performance has been obtained, we
specially modified a research operating system to take additional
advantage of hardware features. We are interested in solutions
which are applicable to today's DBMS on a wide variety of standard
hardware and operating systems. Unfortunately, system calls and in
particular memory protection and un-protection are expensive on
these systems, and the problem does not seem likely to get
better.
[0092] Read Prechecking
[0093] An alternative to preventing direct corruption of data is
preventing the-use of that corrupted data by a transaction. To
accomplish this, codewords are maintained for page-sized or smaller
regions of data known as protection regions. When data in the
region is updated through the prescribed interface, the codeword is
updated along with the data. By using word-wise parity, or a
similar scheme, updates to the codeword can be made based only on
the portion of the protection region which was actually
updated--the remainder of the region need not be read. Selection of
an encoding and parity protection scheme may be accomplished in any
manner known to those in the art appropriate.
[0094] In the Read Prechecking scheme, the consistency between the
data in a protection region and its codeword is checked during each
read of persistent data. This scheme and the Data Codeword scheme
for direct corruption detection, described below, both use the same
codewords maintained during application updates.
[0095] FIG. 3 shows a database memory 300 comprising a plurality of
pages 301-1 to 301-3 and codewords for each page, for example,
321-2 for page 301-2, with a latch 311-2 for processes 331-1 331-2
to 331-n wanting access to a page 301-2 or a codeword 321-2 for
that page useful in describing a simple read prechecking scheme. A
protection latch, for example, 311-2 is associated with each
protection region and acquired exclusively when data is being
updated, or when a reader needs to check the region against the
codeword. A shared protection latch is associated with each
protection region so that, by holding the latch in exclusive mode,
the reader can obtain an update consistent image of the region and
the codeword. Updates to data in the region hold this latch 311-2
in shared mode, so that updates to different portions of the region
may go on concurrently. Each codeword has an associated codeword
latch, which is used to serialize updates to the codeword. More
than one codeword can share a single latch; the loss in concurrency
is not much since codeword updates take very little time, and thus
the latch is acquired only for very short durations.
[0096] A flag, for example, with the name codeword-applied, is
stored in the undo log record for a physical update to indicate
whether the change to the codeword has been applied corresponding
to the update. This flag allows transaction rollback between a
beginUpdate and an endUpdate, the functions which, in Dali, bracket
a physical update.
[0097] We now present the actions taken during normal transaction
processing. In the description of steps taken for update and
initialization processing, we assume these actions are taken for an
update M, made up of an undo U and redo R. For abort processing,
M.sup.-1 is used to represent the inverse update which restores the
data to its original value U.
[0098] Begin Update: At the beginning of an update, the protection
latch for the region is acquired, the undo image of the update is
noted in the undo log, and the flag codeword-applied in the undo
log record is set to false.
[0099] End Update: When an update is completed, the redo image is
recorded from the updated region, and the undo image in the undo
log record is used with this image to determine the change in the
codeword. The codeword is updated with this change, and the
operator .sym. is used to compute .DELTA.(M) from the undo image U
and redo image R in the log.
[0100] The codeword latch is acquired in exclusive mode, and the
codeword, C, is changed to C.sym. .DELTA.(M) and the flag
"codeword-applied" in the undo log record for the update is set to
true. Finally, the protection latch is released.
[0101] Undo Update: When undo processing executes logical actions,
the Dali recovery algorithm generates redo log records for updates,
just as during normal processing. Correspondingly, the codeword is
changed to reflect each update, just as during forward processing.
When undo processing encounters a physical undo log record, then it
must be handled differently based on whether redo processing had
taken place, as represented by the flag codeword-applied in the
undo log record. If this flag is set to true, the protection latch
is acquired and the codeword for the protection region is modified
to back out the update. If the flag is false, no change is made to
the codeword, and the protection latch need not be acquired, as it
is already held from forward processing. Regardless of the value of
the flag, the other undo actions, such as applying the update and
generating a log record for the undo, are executed. Finally, the
protection latch is released.
[0102] Read: To verify the integrity of the data in the protection
region, the reader acquires the protection latch for the region,
computes the codeword value of the region and compares this with
the stored codeword for that region.
[0103] In order to prevent corruption from reaching the disk image,
the checkpointer performs the same integrity check as the reader.
If either a reader or the checkpointer finds that a region does not
match the codeword for that region, steps are taken as will be
described herein to recover the data from this corruption. Note
that since corruption cannot be propagated, recovery under the
cache recovery model is sufficient.
[0104] Read Prechecking in ARIES
[0105] In this section, we describe how the Read Precheck scheme
would need to be modified to work on a page-based architecture such
as the ARIES system. In ARIES or other page-based systems, the
protection region may be chosen to be the page, in which case the
page latch can be used as the protection latch. However our
performance study indicates that this may not lead to acceptable
performance, thus it may be necessary to associate multiple
codewords with a given page. The codeword is associated either with
the page, for example, if the codeword will be used to check the
integrity of disk writes and reads or with the buffer control block
(13CB) if otherwise. The "codeword-applied" flag is stored in the
BCB. This flag is needed in case rollback occurs between the
FixForUpdate and UnFix calls. If rollback cannot occur in this
interval, this flag can be dispensed with and the steps below
should behave as if it is always set. As mentioned earlier we
assume only one update occurs within a FixForUpdate and UnFix pair.
Our scheme can be extended easily to allow multiple updates, by
tracking a region which is a union of the individual regions, and
eliminating overlaps.
[0106] The actions taken at various points during an update are
described below.
[0107] FixForUpdate: First the page is latched in exclusive mode to
prevent any concurrent updates on the page or to the codeword of
the page. The undo image of the update is then noted in the undo
log, and the flag codeword-applied is set to false. At this time,
the codeword is updated with .DELTA..sup.-(U) (that is, the A for
an update which changes the value from U to all zeros) and the
region to be updated is noted in the BCB (by storing its offset and
length). If the update is physiological (that is, log records that
identify a page physically, but record the update to be performed
logically as an operation) then, if it is possible to determine an
area within the page that may be modified, the offset and length of
the area may be stored in the BCB. All updates by the physiological
operation must then be within this area. If at this time it is not
possible to determine the area affected by the update, the area to
be updated is simply assumed to be the entire page.
[0108] Let us define M as an update consisting of old and new
values of the updated part of a protected region; then, the effect
on the codeword for the region lay be defined as .DELTA. M. Then,
if we treat M as if it is composed of two updates, one from the
initial value of the updated area U to a value of all zeros, and a
second from all zeros to the value R it holds after M, then we
define .DELTA..sup.-(U) as the undo value and .DELTA..sup.+(R) as
the redo value for the update from all zeros to R.
[0109] UnFix: At the UnFix call, if the original fix was for an
update, then the updated region R is determined from the region
noted in the BCB, and the codeword for the page is updated with
.DELTA..sup.+(R) (that is, the .DELTA. for an update which changes
the value from all zeros to R). The flag codeword-applied is set.
Finally the latch on the page is released.
[0110] Undo Update: When undo processing executes logical actions,
all codeword updates are done just as for forward processing. When
undo processing encounters a physical/physiological undo log
record, it must be handled differently based on whether the
codeword has been updated during UnFix or not. To do so, if the
flag codeword-applied is not set in the BCB, then the abort was
called between the FixForUpdate and UnFix. In this case, the update
to area during undo is taken to be from all zeros to the old value;
thus the codeword update .DELTA..sup.+(U) for the restored value U
of the affected area is used to update the codeword for the page.
In addition the flag "codeword-applied" is set. If the flag
codeword-applied was already set, the update is taken to be from
the current value to the old value, and the codeword for the page
is correspondingly updated.
[0111] Page Steal: In order to write out a page, the page latch is
acquired in shared mode, and the codeword is computed for the page.
This computed codeword is compared with the codeword stored for the
page in the BCB, and if they are not equal, then the data in the
page has been corrupted. Appropriate corruption recovery actions
are then taken, as will be described below.
[0112] Detecting Corruption
[0113] In this section, we describe the Data Codeword scheme for
detecting direct physical corruption and the Read Logging scheme
for detecting transaction-carried physical and logical
corruption.
[0114] Data Codeword
[0115] Detecting (but not preventing) direct physical corruption
can be accomplished with a variant of the Read Prechecking scheme
described above. The maintenance of the codewords is accomplished
in the same manner, however, the check of the codeword on each read
is dropped in favor of periodic audits.
[0116] Since transaction-carried physical corruption is possible in
this scheme, additional care must be taken during checkpointing to
ensure that an uncorrupted image exists on disk. The process of
auditing is nothing more than an asynchronous check of consistency
between the contents of a protection region and the codeword for
that region. This can be carried out just as if a read of the
region were taking place in the Read Prechecking scheme.
[0117] Since prechecks are not being performed, and audits are
asynchronous, it makes sense to use significantly larger protection
regions. In this case the protection latch may become a concurrency
bottleneck. If so, a new latch, the codeword latch, may be
introduced to guard the update to the actual codewords, and the
protection latch need only be held in shared mode by updaters.
During audit, the protection latch must be taken in exclusive mode
to obtain a consistent image of the protection region and
associated codeword. In particular, data is audited during the
propagation to disk by the checkpointer (or page-steal in a
page-based system).
[0118] Note that this scheme by itself supports only the cache
recovery model, thus indirect corruption is not prevented, rather
one attempts to audit frequently enough to repair direct physical
corruption before it is encountered.
[0119] Corruption Detection with Deferred Maintenance
[0120] We now describe an alternate scheme for physical error
detection in which the maintenance of codewords is deferred to
improve concurrency. This scheme stores codeword updates in log
records and updates codewords, for example, during a log flush
rather than during a data update (or immediately as described
above). Updaters need not obtain any page latches, and
checkpointing, codeword-auditing and page-flushing do not interfere
with updates. FIG. 4A shows a database memory 400 organized with
codewords 421 for page regions having a page portion protection
latch 441a permitting only one process to read or modify a codeword
at a time useful in describing deferred codeword maintenance. Since
codeword updates are done during log flush, the system-log latch
441, which prevents concurrent flushes, serves to serialize access
to the codewords. Read processes can go in at any time to read.
Also, latch 441 allows several processes to modify a page
concurrently, but only one process can go in to verify a page.
Thus, the deferred maintenance scheme can result in increased
throughput for update transactions, especially in main-memory
databases, where the relative cost of latching can be substantial.
Other locks/latches 411-1, 411-2 . . . 411-N in the system control
access to individual parts of the page 401-1.
[0121] For this scheme, each redo log record has an additional
field, the "redo delta", which restores the value .DELTA.(M), where
M is the update denoted by the log record. Also, each undo record
has a codeword-applied flag. We assume that the area covered by a
redo log record does not span two or more protected regions; the
assumption can be easily assured by splitting log records that span
protection boundaries into multiple pieces.
[0122] The actions taken at various steps are described below and
are in addition to the regular actions taken during these
steps.
[0123] Begin Update: Add an undo log record for the update to the
local undo log of the transaction that performed the update, and
set codeword-applied in the undo log record to false.
[0124] End Update: Create a redo log record for the update. Find
the undo log record for the update, and find the updated area from
the undo log. Compute .DELTA.(M) from the undo image in the undo
log and the new value of the updated area, and store it in the redo
.DELTA. field in the redo log record. Add the redo log for the
update to the redo log, and set the codeword-applied flag in the
undo log record to true.
[0125] Undo Update: For each physical update in the undo log, a
proxy redo log record is generated, whose redo image is the old
value from the undo record. If the codeword-applied flag for the
undo record is true, then a redo record has already been generated
with .DELTA.(M), so .DELTA.(M.sup.-1) must be included in the proxy
record to reverse the effect. .DELTA.(M.sup.-1) is computed from
the current contents of the region and the undo log record, and
stored in the redo .DELTA. of the proxy log record.
[0126] If the codeword-applied flag is false, then no redo log
record has been generated, and thus no codeword delta will be
applied by the flusher process. Since the codeword has not been
changed and should not be changed, the proxy record is created as
usual and 0 (a special value that results in no change to the
codeword) is stored in its redo .DELTA..
[0127] In either case, the physical undo log record is processed by
replacing the image in the database with the undo image in the log.
Note that the logical undo actions generate physical updates;
codeword processing is done for these updates as described
above.
[0128] Flush: When flushing a physical redo log record, the flusher
applies redo .DELTA.(M) from the log record to the codeword for
that page. Note that the system log latch is held for the duration
of the flush. Also, the variable end_of stable_log is updated to
reflect the amount of log which is known to have made it to
disk.
[0129] Auditing
[0130] An algorithm for performing an audit under the deferred
maintenance scheme will now be described with reference to FIG. 4A.
A database is shown in dashed line with some pages 470, 475, and
478 identified and their corresponding codewords 471, 476 and 479
of a codeword table 430. Clients attempt to read/write and maintain
client personal logs 440-1, 440-2 and a log for client n not called
out. A flusher process to disc is shown as flusher process 460 and
a flush latch 461. A checkpoint image 450 is maintained for changes
to pages of the database made by the clients.
[0131] In the logging-based scheme, while the updater's task is
simpler, the job of the auditor has become significantly more
difficult. When an auditor algorithm reads the page, it does so
fuzzily, and partial updates may be captured. Log information must
then be used to bring the page to an update-consistent state, so
that it can be compared with a codeword value. This requires
performing a small-scale version of the Dali recovery algorithm at
checkpoint time. To avoid the expense of executing this algorithm
for the entire database we introduce a fuzzy precheck. The idea of
the precheck is simply to compute the codeword value of the page,
ignoring ongoing updates, and compare that to the codeword value in
the codeword table. If the two match, we assume that the page is
correct. Note that this introduces a slightly increased risk of
missing an error, because a valid, in-progress set of updates which
have not made it to the table might exactly match the effect in the
table of a corruption. However, we consider the probability of this
to be approximately the same as the probability of an error going
undetected by the codeword which is dependent on the codeword
scheme, but is extremely small. We store the set of pages which
fail this precheck test in the set AU_needed.
[0132] We now present the steps taken to audit the database:
[0133] 1. For each page
[0134] (a) Note the value in the codeword table 430 for the page,
for example, page codeword value 471 for page 470 or codeword value
476 for page 475.
[0135] (b) Compute its codeword (without latches or locks).
[0136] (c) Note the value, for example, value 471, in the codeword
table 430 for the page, for example, page 470, again.
[0137] (d) If the computed value does not match either noted value,
add the page, for example, page 470, to AU_needed.
[0138] 2. Note end_of stable_log into AU_begin.
[0139] 3. Copy pages in AU_needed to the side, for example area
450.
[0140] 4. Extract the trailing physical undo records for in-process
transactions from the active transaction table ATT. Call this
collection of physical records the AU_att. Records are gathered
from different transactions independently, using a latch on the
entry to ensure that operations are not committed by a transaction
while we are gathering its log.
[0141] 5. Get flush latch 461, and execute a flush to cause
codewords from outstanding log records to be applied to the
codeword table 430. Note the new end_of_stable_log in AU_end, Note
the codeword values for all pages in AU_needed into a copy of the
codeword table called AU_codewords. Finally, release the flush
latch 461.
[0142] 6. Scan the system log from AU_begin to AU_end. Physical
redo records which apply to pages in AU_needed are applied to the
side copy 450 of those pages. Also, if the undo corresponding to
this physical redo is in the AU_att, it is removed.
[0143] 7. All remaining physical undo records from AU_att which
affect pages in AU_needed are applied to the checkpoint image
450.
[0144] 8. Compute codewords of each page in AU_needed, and compare
to the value in AU_codewords. If any differ, report that basic
corruption has occurred on this page.
[0145] At the end of the flush in step 5, a certain set of updates
for each page has been applied to the codeword table 430, and this
is captured in AU_codewords. The purpose of the rest of the
algorithm through step 7 is to ensure that the side copy of each
page being checked contains exactly those updates. All prior
updates which may have been partially or completely missing from
the checkpoint image 450 are reapplied by step 7 Since in Dali redo
log records are not immediately placed in the system log, some
updates may have taken place which are only recorded in
transaction's local redo and undo logs. Note that the codeword
delta for these updates is not reflected in the codeword table 430.
Therefore, these updates are undone in step 7, leaving the side
image of a page consistent with the updates reflected in the
codeword for the page in AU-codewords.
[0146] In this algorithm, individual transactions are blocked long
enough to gather the outstanding physical undo records for the
current operation.
[0147] Read Logging
[0148] In order to detect indirect physical and logical corruption,
we introduce the idea of limited read logging. When a data item is
read (the data item could be a tuple, an index node or any
persistent data structure required for database consistency), a
read log record identifying the start point and the length of the
data that was read is written out. Of course, equivalent schemes
may come to mind of one of ordinary skill in the art such as
logging an end point and length of data read or other scheme. Since
most writes are preceded by reads of the same data, one may assume
that a write of a region is implicitly a read also, and thus
significantly reduce the number of read log records required.
Should an audit indicate that certain data is corrupt, the read log
records can help determine if a transaction has in fact read the
corrupt data. We shall see that the read log records provide a
mechanism for tracing the flow of indirect corruption in the
database and thus determining the set of transactions affected by
an instance of direct corruption.
[0149] In database applications, a write log record is maintained
of writes to a database. In a preferred embodiment of the present
invention, the write log record is combined with the presently
suggested read log record as a combined record. In an alternative
embodiment, the read log record may be maintained separately from
the write log record with a possible loss of efficiency in the
recovery algorithms.
[0150] Optionally, a codeword for the data that was read can also
be written out with the read log record. In the event of a crash,
the codeword can be used to detect any corruption which may have
occurred since the last audit, which would not otherwise be
detected. The codeword is also used to detect when exactly an item
was physically corrupted. given that a corruption has been detected
by some other means (either an audit or by external means). For
example, when an audit detects a corrupted protection region it
must be assumed, in the absence of a codeword in the read log
record, that any transaction reading the region since the time of
the previous successful audit read the corrupted data. With the
codeword, a precise check can be made.
[0151] Read Logging, unlike the other techniques presented so far,
provides the benefits of being able to detect propagation of
logical corruption. When read logging is used for this purpose,
differences detected at the physical level may overestimate the
flow of corruption. In this case, operations may log a checksum of
the logical state found. How exactly to do so is dependent on the
operation. However, the following property must be satisfied--if
the checksums are the same (with a high probability) the two
operation executions gave the same logical result. For example, an
index lookup would log a checksum of the index entries that it
retrieved, ignoring physical details such as location in the
index.
[0152] There are two alternative ways of creating the read log
record; the system designer must choose and use only one of
them:
[0153] 1) Physical read logs, which specify the start point and the
length of the data that was read. Additionally the records can
store a checksum of the data that was read, used during corruption
recovery to detect reads of corrupted data.
[0154] 2) Logical read logs, which use lock information for read
logging. When a transaction locks a data item, a log record noting
the lock information (name of the item and type of lock) is written
out. This is done for both read and update locks, so that it is
possible to detect which transaction has read information written
by which other transaction. In the absence of write lock
information, the physical redo log records alone may not be
sufficient to infer which transactions read items written by a
given transaction.
[0155] Generating Checkpoints Free of Corruption
[0156] Since Read Logging supports recovery from indirect
corruption, it becomes crucial that the disk image be free not only
of direct corruption, but indirect corruption as well, so that
recovery does not require loading an archive image. Thus, when
propagating dirty pages from memory to disk, it is not sufficient
to audit the pages being written. Even if none of the dirty pages
has direct physical corruption, it is possible that a "clean" page
has direct corruption, and a transaction carried this corruption
over to a page that was written out. Thus the checkpoint would have
data that is indirectly corrupted.
[0157] With deferred maintenance of codewords or read logging, the
correct way of ensuring that the checkpoint is free of corruption
is to create the checkpoint, and after the checkpoint has been
written out, audit every page in the database. If no page in the
database has direct corruption, no indirect corruption could have
occurred either. We can then certify the checkpoint free of
corruption.
[0158] This technique cannot be directly applied to page flushes in
a disk-based system, since it amounts to auditing all pages in the
buffer cache before any write of a dirty page to disk (at page
steal time). However, a similar strategy can be followed if a set
of pages are copied to the side, and then an audit of all pages is
performed before writing them to the database image on disk. To
ensure that direct physical corruption does not escape undetected,
a clean page which is being discarded to make room for a new page
must also be audited.
[0159] Corruption Spread
[0160] Unlike prevention, the success of corruption detection and
recovery is dependent on the rate at which corruption spreads and
how quickly it is detected. The speed at which direct corruption is
detected depends on the frequency of audits in the system, unless
Read Prechecking is used, in which case the corruption may also be
detected by a failed precheck. The rate of corruption spread may
not be easily quantified, as it depends on two factors which are
dependent on details of DBMS implementation and a particular
application's use of the database.
[0161] First is the probability that an initial corrupt read will
take place in a given time period. The second is the probability
that a new transaction will read data corrupted indirectly by
previous transactions. The probability of an initial corrupt read
varies based on the application workload, internal DBMS
implementation techniques, and which data is directly corrupted.
Certainly, there is some data in any database that will corrupt
almost any transaction that follows, for example, a key or pointer
in the root node of a tree index. Similarly, there may be data that
will cause no damage, such as currently free space, and data which
will cause little damage, such as a text description which is not
used in processing, just for display.
[0162] The second factor, the probability that a new transaction
will read data which is already corrupted, depends on the frequency
of contact between transactions, which again depends on the
application and DBMS implementation. In addition, this probability
will grow over time as more data is indirectly corrupted. Consider
two variants of an application which updates customer information.
In one variant, each transaction updates global summaries of the
total business done with all customers, and in the other variant
such global statistics are computed when they are needed, and
individual transactions touch only one customer. Clearly,
corruption would propagate extremely quickly in the first case, and
much more slowly in the second.
[0163] In the worst case, any transaction would touch data written
by a recent transaction. Any application in which each transaction
both updates and uses a global statistic would exhibit this
worst-case behavior. For example, a database which tracked the
contents of a warehouse might have a summary of how much space is
available, which would be checked for each incoming transaction
before looking for space of the right size, etc. At the other
extreme, transactions may deal with customer accounts with no
overlap between customers (maintaining global statistics as in the
previous example can lead to very poor concurrency). In this case,
corruption would only spread outside of a particular customer
record through DBMS structures, such as index nodes and free space
lists. While corruption could still spread, the spread would be
much slower.
[0164] In summary, the total amount of corruption allowed to arise
in the database is determined by the speed at which the corruption
spreads in a particular application/DBMS pair, and the frequency
with which the data is audited.
[0165] Corruption Recovery
[0166] So far we have seen how to detect direct physical corruption
by means of codeword schemes, and to log information about reads to
detect indirect physical and logical corruption. We now consider
how to recover from corruption once it is detected. Note that these
algorithms for recovery from corruption are tightly integrated with
restart recovery. On detecting an error, we simply note the region
or regions failing the audit, and cause the database to crash.
Corruption recovery is handled as part of the subsequent restart
recovery.
[0167] Cache Recovery Model
[0168] This algorithm is useful to recover from direct corruption
when recovery from indirect corruption is not required. It is
invoked when a precheck fails in the Read Prechecking scheme (as
indirect corruption could not have occurred), or when an audit
detects a codeword error in the Data Codeword scheme (as the Data
Codeword scheme does not address indirect corruption). By ensuring
that directly corrupted data does not get propagated to the disk
image, recovery from a corrupted cache image can be accomplished
with standard techniques. Such a technique is known for ARIES for
use when a page is latched by a process which was unexpectedly
terminated. For Dali, a similar technique can be used to reload the
latest checkpoint and replay physical log records forward.
[0169] Prior-State Recovery Model
[0170] The prior-state model can be used to recover from any form
of corruption, once the source of that corruption is determined,
and the recovery can be accomplished with standard techniques. In
this case, the techniques are similar to media failure. An archive
image of the database must be loaded, and the log played forward to
the latest point known to be prior to the introduction of
corruption. Once that point is reached, the log is truncated, and
restart recovery is executed (possibly generating additional log
records).
[0171] Redo-Transaction Recovery Model
[0172] In order to recover from corruption under the
redo-transaction model for corruption recovery, we require that a
logical redo log is available along with the transaction code
necessary to redo transactions. In particular, the logical log must
be at the transaction level, that is the transaction's external
inputs (e.g., from the user) are saved. In the following, we assume
the transaction writes this logical description with its commit
record.
[0173] We shall also require that the commit order be the same as
the serialization order; this can be ensured by requiring that
transactions hold all locks (or in multi-level recovery schemes,
all locks at level n-1) till end of transaction.
[0174] Recovery
[0175] Corruption recovery begins as in the prior-state model--the
database is recovered to a transaction consistent state known to be
prior to the introduction of corruption. The tail of the log not
used by this recovery is saved in OldLog before the log is
truncated (see Section). The re-executed transactions will read
non-corrupted data, and produce non-corrupted results.
[0176] When comparing transactions, we scan the logs in parallel;
if we find that they both have operation begin log records
indicating the operation has a logical checksum, and the checksums
in the operation begin log records are the same, we stop comparing
the checksums on the redo log records generated by the operations.
When we reach the operation commit on both logs, we compare the
logical checksum to see if the operation results were the same.
[0177] Handling External Writes and User Notifications
[0178] Changes in external writes have to be handled manually, and
even changes to the database may require notification to the users
in some cases. Identifying which transactions were corrupted
enables humans to focus on these transactions, thereby reducing the
human workload. Detecting what external writes changed as a result
of re-execution simply requires external writes to be logged. The
log comparison procedure above will then detect changes in external
writes.
[0179] User Notification
[0180] Depending on the nature of the corruption, it may be helpful
to determine which results exposed to the outside world were
corrupt. To accomplish this, any transaction which reports to the
user must log the values output to the user. During re-execution,
these values can be checked against the values logged by the
transaction as it re-executes. If they do not match, the user is
notified.
[0181] Extension: Logical Corruption
[0182] The above algorithm can be easily extended to recover from
logical corruption. This is accomplished by amending Old_Log, prior
to recovery, in order to correct an erroneous transaction or
transactions. For instance, we may have fixed errors in the
transaction code, or we may use correct user inputs in place of
wrong user inputs. We may even delete the transaction or replace it
with one or more other transactions. If it is not known when the
direct logical corruption occurred, techniques as described below
can aid in making this determination.
[0183] Delete-Transaction Model
[0184] If a logical log is not available, recovery may take place
under the delete-transaction model discussed above. For our
delete-transaction model recovery algorithm, we need a checkpoint
which is update-consistent in addition to being free from
corruption. However, in Dali, a checkpoint being used for recovery
is not necessarily update-consistent until recovery has completed
(that is, physical changes may only be partially reflected in the
checkpoint image, and certain updates may be present when earlier
updates are not).
[0185] The Dali algorithm to obtain an update-consistent checkpoint
uses a portion of the redo log and ATT to bring the checkpoint to a
consistent state before the anchor is toggled and the checkpoint
made active. Once performed, the checkpoint is update-consistent
with a point in the log, CK_end.
[0186] We will outline how to get a checkpoint that is update
consistent with a specific point CK_end in the redo log, by
applying undo logs. Such a checkpoint has some important additional
properties:
[0187] 1) Effects of redo log records after CK_end are not
reflected in the checkpoint image and
[0188] 2) The checkpoint is codeword consistent as of CK_end.
[0189] As a result, when we replay log records after CK_end,
whenever we encounter a read log record the database state will be
exactly what the read found when the read log record was generated,
unless the database was corrupted. That is, we can identify a point
in the log such that all physical updates noted in the log up to
the point are reflected in the checkpoint, and none of the physical
updates in the log after that point are reflected.
[0190] Such a checkpoint can be obtained by a simple modification
of the Dali checkpointing scheme. We create a checkpoint exactly as
done in Dali, but then perform some physical redos and undos on the
checkpoint image as described below, to make it
update-consistent.
[0191] Recovery Algorithm
[0192] The main idea of the following scheme is that corruption is
removed from the database by refusing during recovery to perform
writes which could have been influenced by corrupt data. In order
to do this, the transactions which performed those writes must, at
the end of the recovery, appear to have aborted instead of
committed. Certain other transactions may also be removed from
history (by refusing to perform their writes) in order to ensure
that these "corrupt" transactions can be effectively removed, and
thus ensuring the final history as executed by the recovery
algorithm is delete-consistent with the original execution as
discussed above.
[0193] Recovery must start from a database image that is known to
be non-corrupt. Note that since errors are only detected during
checkpointing or auditing, we may not know exactly {.backslash.em
when the error occurred; the error may have been propagated through
several transactions before being detected. FIGS. 5A and 5B
comprise a flowchart of the recovery algorithm for the delete
transaction model. The starting point of the algorithm is box 500.
The algorithm conservatively assumes that the error occurred
immediately after Audit_LSN, the point in the log at which the last
clean audit began. Box 505 relates to the repetitive process of
reading log records and box 515 asks whether we are at Audit_LSN.
If yes, then, we add all data noted as corrupt by the last audit to
the CorruptDataTable. If not, we enter a redo phase followed by an
undo phase.
[0194] Two tables, a CorruptTransTable and a CorruptDataTable are
maintained. A transaction is said to have read corrupt data if the
data noted in a read or write log record of that transaction is in
the CorruptDataTable.
[0195] Restart recovery consists of the redo phase followed by the
undo phase. The redo phase initiates with the question what kind of
log record is it at box 520.
[0196] Redo Phase
[0197] The checkpointed database is loaded into memory and the redo
phase of the Dali recovery algorithm is initiated, starting the
forward log scan from CK_end.
[0198] During the forward scan, the following steps are taken (any
log record types not mentioned below are handled as during normal
recovery):
[0199] If a read log record or a physical write log record is
found, you follow the path in FIG. 5A to box 525, and if this
record indicates that the transaction has read corrupted data at
box 525, then the transaction is added to CorruptTransTable (where
it may already appear) at box 565.
[0200] If a log record for a physical write is found, then there
are two cases to consider:
[0201] 1. The transaction that generated the log record is not in
CorruptTransTable at box 575: In this case, the redo is applied to
the database image as in the Dali recovery algorithm at box
585.
[0202] 2. The transaction that generated the log record is in the
CorruptTransTable (CTT): In this case at box 580, the data it would
have written is inserted into CorruptDataTable (CDT). However, the
data is not updated because the transaction is in the CTT.
[0203] If a begin operation log record is found for a transaction
that is not in CorruptTransTable at box 530, then it is checked
against the operations in the undo logs of all transactions
currently in CorruptTransTable. If it conflicts with one of these
operations, then the transaction is added to CorruptTransTable at
box 560. This ensures that the earlier corrupt transaction can be
rolled back. If it does not conflict, then it is handled as in the
normal restart recovery algorithm at box 555.
[0204] If a logical record such as commit operation, commit
transaction or abort transaction is found, the record is ignored at
box 535 if the transaction that generated the log record is in
CorruptTransTable. Otherwise, the record is handled as in normal
restart recovery at box 550.
[0205] When Audit_LSN is passed at box 515, all data noted to be
corrupt by the last audit or by analysis of a logical error, is
added to CorruptDataTable at box 570.
[0206] Undo Phase
[0207] At the end of the forward scan, incomplete transactions are
rolled back. following the normal Dali restart recovery algorithms
undo phase. As in the normal Dali algorithm, undo of all incomplete
transactions is performed logically level by level. Note that at
the end of the redo phase, each transaction in CorruptTransTable
has a (possibly empty) undo log. This log records the actions taken
by the transaction before it first read corrupted data. During the
undo phase at box 590, these portions of the corrupt transactions
are being undone along with the transactions which were in progress
at the time of the crash.
[0208] Checkpoint
[0209] The extended restart recovery algorithm is completed by
performing a checkpoint at box 595 to ensure that recovery
following any further crashes will find a clean database free of
corruption., and to avoid the insertion of records into the log
during the rollback of a corrupt transaction. If the checkpoint
were not performed, a future recovery may rediscover the same
corruption and in fact additionally declare transactions that
started after this recovery phase to also be corrupted. Note that
this checkpoint invalidates all archives. The log may be amended
during recovery to avoid this problem.
[0210] Discussion
[0211] The database image at the end of the above algorithm should
reflect a delete history that is consistent (in this case,
conflict-consistent) with the original transaction history. To see
(informally) that this is the case, first observe that all
top-level reads of non-deleted transactions read the same value in
the history played during recovery as in the original history. This
is because any data that could possibly have been read with
different values was previously placed in CorruptDataTable, and
top-level reads must be implemented in terms of physical-level
reads where corruption is tracked. The second observation is that
the database image is consistent and contains the original contents
plus the writes of those transactions which do not appear in the
delete set. This follows from the correctness of the original
recovery algorithm, and the fact that the initial portion of
corrupted transactions can be rolled back during the undo phase
along with normal incomplete transactions to produce a consistent
image. This is ensured since we do not allow any subsequent
operations which conflict with these operations to begin.
[0212] Extension: Codewords in Read Log Records
[0213] If codewords are stored in read log records, then detection
of indirect corruption becomes more precise. In particular, the
CorruptDataTable can be dispensed with, and instead, the definition
of reading corrupt data given above is replaced by the following
two cases:
[0214] 1. A codeword is stored in a read log record, and it does
not match the computed codeword for the corresponding region in the
database being recovered.
[0215] 2. A codeword is stored in a write log record (indicating
that it should be treated as a read followed by a write) and the
codeword does not match the computed codeword for the corresponding
region in the database.
[0216] A second benefit of storing codewords in read log records is
that it is possible to detect physical corruption which occurred
after the last audit but before the crash. More precisely,
physically corrupt data will be detected if any transaction read
it, since during recovery the codeword for these transactions will
not match the database image being recovered. Thus, if codewords
are present, the corruption recovery algorithm should be executed
not only when an error is detected, but also on every restart.
[0217] Note that the modified algorithm produces a recovery
schedule which is view-consistent with the original history, thus
not propagating corruption when the corrupt transaction wrote the
same data to a data item as it would have had in the
delete-history. However, these benefits must be traded against a
slight degradation in performance (see performance section below
for details).
[0218] Extension: Recovering from Logical Corruption
[0219] The algorithm given above can be easily adapted to recover
from logical corruption. This corruption may have occurred before
the last checkpoint, thus an archive image that was taken prior to
the introduction of the error is restored, and CK_end (the point at
which recovery starts) taken from that archive. Audit_LSN is
defined to be the latest point in the log such that the corruption
is known to have occurred after that point. If this point is not
known, backwards analysis techniques discussed below can be used to
help locate it. Finally, when the point Audit_LSN is passed during
recovery, any data which was directly affected by the corrupting
error must be added to CorruptDataTable.
[0220] Logical Corruption
[0221] Approaches to recovering from logical corruption have
already been given in the context of the redo-transaction and
delete-transaction models. In this section, we discuss some issues
involved in determining when logical corruption was introduced, and
outline an efficient variant of the redo-transaction recovery
scheme for a restricted recovery model.
[0222] The techniques described so far do not consider transactions
with faulty code that updated the database erroneously, but through
the prescribed interface. Similarly, transactions that were
executed incorrectly, for instance with wrong user input, have not
been considered so far herein. We call transactions such as the
above as erroneous transactions. Of course, it is not possible to
prevent erroneous transactions, since users as well as programmers
can always make mistakes.
[0223] Detecting that there has been an erroneous transaction
execution is quite non-trivial, and cannot be done automatically by
the system, since the differences between good and erroneous
transaction executions are at a semantic level above what the
database may be aware of. Integrity constraints may help detect
some such errors, but cannot detect all.
[0224] We assume that humans have somehow become aware that an
error has occurred. For instance, a customer may complain of a
wrong balance, or an operator may discover that a transaction he
ran yesterday ought not to have been executed.
[0225] We have two models of error detection here: where it is
known exactly which transaction originally caused the error, or
which was the first erroneous transaction. In this case, only
forward recovery is required. For example, the case where the
operator realizes a transaction he ran ought not to have been
executed.
[0226] Once the cause of the error is known, we can find the latest
consistent checkpoint (archive image) prior to executing the
erroneous transaction, and perform recovery using one of the
algorithms given in our corruption recovery section.
[0227] If recovering under the redo-transaction model, however, we
do not simply re-execute erroneous transactions, but re-execute a
corrected version of the transaction. For instance we may have
fixed errors in the transaction code, or we may use correct user
inputs in place of wrong user inputs. We may even delete the
transaction or replace it with one or more other transactions.
[0228] Backward Analysis--Determining the Source of Corruption
[0229] Even if it is known that logical corruption has occurred, it
may not be known which transaction initially introduced the error.
For example, a customer has detected a wrong balance but does not
know what caused it. Here, two steps are required: detecting the
root cause, and then forward recovery from there.
[0230] The database system cannot automatically detect the root
cause, but can provide support to humans to detect the problem by
supporting backward analysis of the log using the read log
records.
[0231] The following algorithm is applicable when certain data are
known to be corrupt, and the source (originating transaction) for
the corruption is sought. This algorithm assumes that the database
log has been enhanced with logs of data items read using the read
logging techniques described earlier. The database log contains a
record of both writes and reads, and each write is assumed to imply
a prior read of the item. The idea is to trace backward through the
log, tracking how corruption could have flowed into the
known-corrupt data items.
[0232] Assume there are n data items known to be corrupt. Let
SuspectData be a set of data suspected of being corrupt. Associate
with each suspect data item, D, in SuspectData a set of integers
from 1 . . . n, A(D) which represents the known-corrupt data which
could have been affected by corruption if D itself were found to be
corrupt. Let SuspectTrans be a set of suspect transactions, each
with a set A(T) associated with semantics analogous to A(D): if i
is in A(T), then if transaction T has read corrupt data or was
itself the source of direct logical corruption, then the known
corruption of data item i could be explained as resulting from this
error. The goal is to find a single transaction which could explain
all the known corruption, that is all the items 1 . . . n are in
the set A(T) for that transaction.
[0233] In our algorithm, we would process the log backwards; that
is, for each log record, do:
[0234] If the log record is a write of data item D by transaction T
and if D is in SuspectData, add T to SuspectTrans (if not already
present), and set A(T) equal to A(D)uA(T).
[0235] If the log record is a read of data item D by transaction T
and if T is in SuspectTrans, add D to SuspectData (if not already
present) and set A(D) to be A(D)uA(T).
[0236] If any A(T) contains all elements from 1 . . . n, offer
transaction T to the user as a possible root cause of the
corruption. If the user accepts, done, else continue.
[0237] Now we will discuss an example where the corruption
detection and recovery algorithms of the present invention may be
applied to advantage. Consider an example where five customers of a
bank, Bank A, call over the course of a day and complain that their
balance is incorrect. A corruption detection algorithm according to
the present invention will proceed by processing the log backwards
from the latest time when a bad balance is reported. It may soon be
found that a transaction which added interest to accounts had
recently updated all five accounts. This transaction has an
associated set, A(T), which contains all five known corrupt data
items, and the transaction will be presented to a user of the
system as a potential source of corruption. The user upon examining
the transaction may determine, however, that the interest rate was
computed correctly, and thus prompt the algorithm to continue
searching for an explanation for the corruption.
[0238] Subsequently, it may be found that earlier in the log of
events, updates to the third and fifth complaining customers'
balances had been made due to an ATM (automatic teller machine)
withdrawal at Bank B. Since the transaction that posted these
withdrawals would read the main data record for Bank B, that record
would be added to the SuspectData set. If subsequently in
processing (earlier in time) it is discovered that the other
complaining customers' accounts had been subject to withdrawals
from ATM's owned by the same Bank B, which also read the main data
record for Bank B, then the A(D) associated with the main data
record for Bank B (D), would contain all known corrupt data items.
The next (earlier) transaction to update the main data record for
Bank B would be presented to the user as a potential source of
corruption. In this example, it may turn out that the amount of
money charged by Bank B as an overhead for using their ATM had been
incorrectly entered into the database, resulting in the eventual
incorrect balances. A user may use any one of the corruption
recovery algorithms described herein to help determine which other
customers are affected by this incorrect update, as well as other
data items which may be affected indirectly such as payments of
these charges to Bank B.
[0239] As written, the detection algorithm assumes that all the
known corruption originated from a single source. However, the
algorithm can be easily modified to consider multiple sources of
corruption. For example, it could present to the user whenever two
transactions together explained all the corruption, that is the
union of A(T) and A(T1) contained all the data elements 1 . . . n,
where T and T1 are two transaction in SuspectTrans table.
[0240] Optimizing Recovery in the Redo-Transaction Model
[0241] Unfortunately, under the redo-transaction model of
corruption recovery, the logical re-execution of a transaction may
take time similar to its original execution; if an error is
discovered after several days, it can reasonably be expected to
take several days to recover the database. During this time the
database would be unavailable.
[0242] Referring to FIG. 6, we now outline an alternative approach
to implementing the redo-transaction model which assumes a
two-level recovery model with logical or physiological redo
logging. Thus, record-level operations are logged, record-level
locks are held for the duration of the transaction, and latches are
held on pages for the duration of an operation. The goal of the
algorithm is to use primarily the log for recovery, only
re-executing a select few transactions.
[0243] In the algorithm given below, we may use a CorruptDataTable
as in the delete-transaction model recovery algorithm, or use
logical read logging as described in above. A transaction is said
to read corrupted data if
[0244] It reads data in the CorruptDataTable, if such a table is
used.
[0245] The logical codeword is computed for an operation during
recovery, and it does not match a codeword recorded in the log.
[0246] In addition to the actions normally taken during recovery,
the corruption recovery algorithm proceeds by processing records as
follows:
[0247] Save log records per step 601 for a transaction until a
commit or an abort for the transaction is seen.
[0248] If the commit log record for a transaction is encountered
via path 605, then
[0249] 1) The read log records for the transaction are scanned at
step 610 to determine if the transaction has read corrupted data.
If so, the transaction is marked as corrupt at step 615.
[0250] 2) If the transaction is marked as corrupt, it is
re-executed logically, and the new logical redo records are used to
replace its redo records in the log at step 620. Any data which the
original transaction wrote, or was written during the re-execution,
is added to the CorruptDataTable, if one is used.
[0251] 3) If the transaction is not marked as corrupt, its log
records are executed at step 625.
[0252] 4) If an abort record for a transaction is found at step
630, the log records for that transaction are discarded at step
635.
[0253] Since the redo records are logical and record-level locks
are held to the end of transaction, executing the log records at
the point where the commit record appears is correct. Note that
when a transaction is re-executed, it could generate different log
records. In fact, the operations it performs may be completely
different from the ones it originally performed. Since transactions
are executed at their commit point, this new transaction will
serialize with respect to other transactions which might have
originally executed concurrently with it. It will read the data as
written by any transaction which serialized before it, and if its
actions cause any new conflicts with transactions that serialize
after it, these transactions will read corrupt data and be
re-executed themselves.
[0254] Performance
[0255] The goal of our performance study was to compare the
relative cost of different levels of protection, for example
detection versus prevention, as well as comparing different
techniques for obtaining the same level of protection. In each
case, we are interested in the impact of the scheme on normal
processing as opposed to the time taken for recovery. Corruption
recovery is expected to be relatively rare, and the time required
is highly dependent on the application and workload. The algorithms
studied were implemented in the DataBlitz Storage Manager, a
storage manager being developed at Bell Labs based on the Dali main
memory storage manager.
[0256] Performance of mprotect
[0257] Before describing the comparison of schemes, we begin by
looking at the relative performance of memory protection primitives
on commonly available UNIX hardware. In the Table below, we
evaluate the basic performance of the memory protection feature on
a number of hardware platforms locally available to us. In each
case, 2000 pages were protected and then unprotected, and this was
repeated 50 times. The number reported is the average number of
these pairs of operations which were accomplished per second.
[0258] As shown in this table, the UltraSPARC on which our
benchmarks were run is significantly faster at memory protection
than the other UNIX platforms to which we had access, leading us to
expect that hardware protection may fare worse on other
platforms.
1 Performance of Protect/Unprotect Platform pairs/second
SPARCstation 20 15,600 UltraSPARC2 43,000 HP 9000 C110 3,300 SGI
Challenge DM 8,200
[0259] Implementing Read Logging and Prechecking
[0260] Computation of codeword values for reads takes place in two
contexts: as part of a codeword precheck or when computing the
codeword to be stored in a read log record. In general,
implementation difficulty must be traded against the window of
potentially undetected corruption. To minimize the window for
undetected corruption, a codeword computation should be made after
the read has occurred. If the computation is made before the read,
direct corruption could take place in the window between the
computation of the codeword and the read itself.
[0261] In the case of prechecking, if a read is performed, and a
write is made based on that read before the codeword for the read
is checked, then this write can potentially cause
transaction-carried corruption, which the Read Prechecking scheme
is designed to prevent. More precisely, such a write can be
allowed, but cannot be visible to another transaction, and it must
be possible to undo the write physically, as any logical undo based
on a corrupt read cannot be trusted.
[0262] For example, in a page-based system, one approach is to
perform the read precheck when a page is Fix'ed in a shared mode,
which leaves some window between the computation and the read in
which undetected corruption could occur. Alternatively, the
precheck can occur at UnFix time, introducing the possibility that
a write is issued which is based on the Fix'ed data before the
UnFix takes place, in which case provision must be made to
physically undo that write if an error is detected at UnFix
time.
[0263] For read logging, the read log record must appear in the log
before any subsequent writes so that the recovery algorithms will
consider the writes as suspect. In this case, it is simpler to use
the codeword at Fix time.
[0264] A conservative solution, which may have a significant
performance cost, is to copy the information to be read into
private space and perform the test and subsequent reads on this
copy. In the page-based example, this would happen at Fix time.
[0265] To eliminate any window between checking the codeword for
some data and then using it, an explicit call to perform the check
must be inserted between reads and any writes which are possibly
exposed to another transaction. For example, if pages are fixed in
read mode for the duration of a complicated action, then some
analysis of the code will be required to add the additional
calls.
[0266] In our implementation, which is not page based, we assumed
that any write was also a read, thus the codeword was checked
during beginUpdate calls. We added calls before all reads of
persistent data which were not part of a write. In the case in
which a codeword is included with the read log record, then these
codewords were added to the write log record as well as being added
to the read log record. Thus, calls for each read which was not
part of a write were added to the portions of the code which is
used by our performance tests, including the allocation code, the
relation manager, and one index structure, a hash table.
[0267] Workload
[0268] The workload examined is a single process executing TPC-B
style transactions. The database consists of four tables, Branch,
Teller, Account, and History, each with 100 bytes per record. Our
database contained 100,000 accounts, with 10,000 tellers and 1,000
branches. These ratios are higher than in TPC-B, in order to limit
CPU caching effects on the smaller tables. The benchmarks were run
on an UltraSPARC with two 200 MHZ processors, and 1 gigabyte of
memory. All tables are in memory during each run, with logging and
checkpointing ensuring recoverability. In each run, 50,000
operations were done, where an operation consists of updating the
(non-key) balance fields of one account, teller and branch, and
adding a record to the history table. Transactions were committed
after 500 operations, so that commit times do not dominate. The
alternative was to design a highly concurrent test with group
commits, introducing a great deal of complexity and variability
into the test. Each test was run six times, and the results
averaged. The results are reported in terms of number of operations
completed per second.
[0269] Prechecking and Protection Domain Size
[0270] The table below provides operations per second and percent
slower for page sizes from 64 bytes to 8 k bytes.
2 Size Ops/Sec % Slower None 417 0% 64 366 12.2% 128 348 16.5% 256
329 21.1% 512 311 25.4% 1024 277 33.5% 8192 115 72.4%
[0271] Precheck Domain Sizes
[0272] Before presenting general results, we discuss a tradeoff in
the implementation of the Read Prechecking algorithm. The Read
Prechecking algorithm verifies each read by computing the codeword
of the regions which the read intersects. Since one codeword is
stored for each protection domain, the size of the region leads to
a time-space tradeoff for this scheme. We present the performance
of Read Prechecking with Data Codeword maintenance for a variety of
sizes of protection domains from 64 bytes to the 8K page size of
our machine. With small size protection somains, this scheme
performs well, but may add 3%-6% to the space usage of the
database. The scheme breaks even with hardware protection at about
1K protection domains. These results are shown in the above
precheck Table.
[0273] Results
[0274] The table below provides our results in terms of cost of
corruption protection for the various discussed herein:
3 Logical Physical Corruption Corruption Ops/ Algorithm Direct
Indirect Indirect Sec % Slower Baseline None None None 417 0% Data
CodeWord (CW) Correct None Nothing 380 8.5% Data CW w/Precheck, 64
byte Correct Prevent Nothing 366 12.2% Data CW w/ReadLog Correct
Correct Correct 345 17.1% Data CW w/CW ReadLog Correct Correct
Correct 323 22.4% Data CW w/Precheck, 512 byte Correct Prevent
Nothing 311 25.4% Memory Protection Prevent Unneeded Nothing 257
38.2% Data CW w/Precheck, 8K byte Correct Prevent Nothing 115
72.4%
[0275] In the above Cost of Protection Table, a representative
selection of the algorithms discussed in this paper are shown,
along with the average number of operations per second the
algorithm achieved in our tests, and the relative slowdown of the
algorithm compared to the baseline algorithm, which is just the
system running with no corruption protection. Our experiments show
that detection of direct corruption can be achieved very cheaply,
with about 8% cost, with simple data codeword protection. The
choice between these algorithms can be made on ease of
implementation in a particular system. Prechecking with a small
domain size is economical at a 12% cost, depending on the
acceptability of a 6% space overhead. Read logging lowers the space
overhead, but raises the cost to 17%, which is significant, but may
be worthwhile, since automatic support for repairing the database
can then be employed, and the results of erroneous transactions can
be tracked. Logging the checksum of the data read, which increases
the accuracy of the corruption recovery algorithms, adds 5% to the
cost, bringing it to 22%. Memory protection using the standard
mprotect call costs 38%, more than double the performance hit of
codeword protection with read logging. Finally, prechecking with
large domain sizes fares very poorly.
[0276] Our conclusion from these results is that some form of
codeword protection should be implemented in any DBMS in which
application code has direct access to database data. Detection of
direct corruption is quite cheap, and as limited as it is, is still
far better than allowing corruption to remain undetected in the
database. Other levels of protection may be implemented or offered
to users so that they may make their own safety/performance
tradeoff.
[0277] Thus there has been described a variety of schemes for
preventing or detecting physical corruption using codewords, and
for tracing and recovering from physical and logical corruption
using read logging. Finally, a performance study comparing
alternative techniques of corruption detection and recovery
demonstrated the utility and practicality of the present invention
involving codewording and read logging. Protection from direct
physical corruption is economical to implement, transaction-carried
corruption can be prevented cheaply if enough space is available
for small protection domains, and detection of transaction-carried
corruption for later correction through read logging imposes about
a 17% cost on update transaction performance. The technique of the
present invention opens up interesting possibilities in tracing
logical errors through the database system and aiding in their
correction. The new schemes are shown to be significantly cheaper
than using the memory protection and unprotection provided by UNIX
around every update.
[0278] Our techniques may deal with logical corruption and handle
errors which are not caught by integrity constraints. Our
techniques for handling physical corruption will be of increasing
importance since applications are increasingly being provided
direct access to persistent data. Since limited protection is very
cheap, we believe implementors of database systems in which
application code has direct access to database buffers should
provide some form of protection, at least as an option for users.
Our techniques have been shown to be highly portable, and use only
simple integer operations which will be efficient on all modern
processors.
[0279] In summary, then, the present invention may find application
in the efficient recovery from physical or logical corruption,
off-line generation of consistent checkpoints to be used to check
global integrity constraints, fault-induction tests of logical
corruption recovery, and implementation of these techniques in
other database management systems. The present invention may be
implemented in a real-time environment, for example, a
communications environment where the busy state of communications
channels, circuits and the like are preserved in a central (at a
switch or server) or distributed database management system. All
patent applications and articles referenced herein should be deemed
to be incorporated by reference as to their entire contents. The
present invention should only be deemed to be limited in scope by
the claims which follow.
* * * * *