U.S. patent application number 12/129665 was filed with the patent office on 2009-12-03 for backup coordinator for distributed transactions.
Invention is credited to Mark Cameron Little.
Application Number | 20090300405 12/129665 |
Document ID | / |
Family ID | 41381316 |
Filed Date | 2009-12-03 |
United States Patent
Application |
20090300405 |
Kind Code |
A1 |
Little; Mark Cameron |
December 3, 2009 |
BACKUP COORDINATOR FOR DISTRIBUTED TRANSACTIONS
Abstract
A primary coordinator generates a prepare message for a
two-phase commit distributed transaction, the prepare message
including an address of a backup coordinator. The primary
coordinator maintains a transaction log of the distributed
transaction, wherein the transaction log is accessible to both the
primary coordinator and the backup coordinator. The prepare message
is sent to a plurality of participants. The primary coordinator
fails over to the backup coordinator without interrupting the
distributed transaction.
Inventors: |
Little; Mark Cameron;
(Ebchester, GB) |
Correspondence
Address: |
RED HAT/BSTZ;BLAKELY SOKOLOFF TAYLOR & ZAFMAN LLP
1279 OAKMEAD PARKWAY
SUNNYVALE
CA
94085-4040
US
|
Family ID: |
41381316 |
Appl. No.: |
12/129665 |
Filed: |
May 29, 2008 |
Current U.S.
Class: |
714/3 ;
714/E11.071 |
Current CPC
Class: |
G06F 11/203 20130101;
G06F 11/2038 20130101; G06F 11/2028 20130101 |
Class at
Publication: |
714/3 ;
714/E11.071 |
International
Class: |
G06F 11/20 20060101
G06F011/20 |
Claims
1. A computer implemented method comprising: generating a prepare
message for a two-phase commit distributed transaction by a primary
coordinator, the prepare message including an address of a backup
coordinator; maintaining a transaction log of the distributed
transaction, wherein the transaction log is accessible to both the
primary coordinator and the backup coordinator; sending the prepare
message to a plurality of participants; and failing over to the
backup coordinator without interrupting the distributed
transaction.
2. The method of claim 1, further comprising: exchanging a
heartbeat message between the primary coordinator and the backup
coordinator; and failing over to the backup coordinator if the
primary coordinator fails to respond to the heartbeat message from
the backup coordinator.
3. The method of claim 1, further comprising: failing over to the
backup coordinator upon receiving a transaction status inquiry by
the backup coordinator from at least one of the plurality of
participants; and responding to the transaction status inquiry by
the backup coordinator.
4. The method of claim 1, further comprising: accessing the
transaction log by the backup coordinator to determine a current
status of the distributed transaction; and completing the
distributed transaction by the backup coordinator.
5. The method of claim 4, wherein the primary coordinator writes to
the transaction log if a commit response is received from each of
the plurality of participants, the method further comprising:
directing, by the backup coordinator, the plurality of participants
to roll back the distributed transaction if there is no entry in
the transaction log for said distributed transaction; and
directing, by the backup coordinator, the plurality of participants
to commit the distributed transaction if there is an entry in the
transaction log for said distributed transaction.
6. The method of claim 1, wherein the backup coordinator is remote
from the primary coordinator.
7. A computer implemented method comprising: receiving a prepare
message by a participant of a two-phase commit distributed
transaction from a primary coordinator of the distributed
transaction, the prepare message including an address of a backup
coordinator; sending a commit or abort message to the primary
coordinator; sending a backup transaction status inquiry message to
the backup coordinator using the address after failing to receive a
subsequent message from the primary coordinator within a specified
time period; and receiving a commit or roll back message from the
backup coordinator.
8. The method of claim 7, further comprising: sending a primary
transaction status inquiry message to the primary coordinator; and
timing out a timer before receiving a response to the primary
transaction status inquiry message.
9. The method of claim 7, wherein the backup coordinator is remote
from the primary coordinator.
10. A computer readable medium including instructions that, when
executed by a processing system, cause the processing system to
perform a method comprising: generating a prepare message for a
two-phase commit distributed transaction by a primary coordinator,
the prepare message including an address of a backup coordinator;
maintaining a transaction log of the distributed transaction,
wherein the transaction log is accessible to both the primary
coordinator and the backup coordinator; sending the prepare message
to a plurality of participants; and failing over to the backup
coordinator without interrupting the distributed transaction.
11. The computer readable medium of claim 10, the method further
comprising: exchanging a heartbeat message between the primary
coordinator and the backup coordinator; and failing over to the
backup coordinator if the primary coordinator fails to respond to
the heartbeat message from the backup coordinator.
12. The computer readable medium of claim 10, the method further
comprising: failing over to the backup coordinator upon receiving a
transaction status inquiry by the backup coordinator from at least
one of the plurality of participants; and responding to the
transaction status inquiry by the backup coordinator.
13. The computer readable medium of claim 10, the method further
comprising: accessing the transaction log by the backup coordinator
to determine a current status of the distributed transaction; and
completing the distributed transaction by the backup
coordinator.
14. The computer readable medium of claim 13, wherein the primary
coordinator writes to the transaction log if a commit response is
received from each of the plurality of participants, the method
further comprising: directing, by the backup coordinator, the
plurality of participants to roll back the distributed transaction
if there is no entry in the transaction log for said distributed
transaction; and directing, by the backup coordinator, the
plurality of participants to commit the distributed transaction if
there is an entry in the transaction log for said distributed
transaction.
15. A computer readable medium including instructions that, when
executed by a processing system, cause the processing system to
perform a method comprising: receiving a prepare message by a
participant of a two-phase commit distributed transaction from a
primary coordinator of the distributed transaction, the prepare
message including an address of a backup coordinator; sending a
commit or abort message to the primary coordinator; sending a
backup transaction status inquiry message to the backup coordinator
using the address after failing to receive a subsequent message
from the primary coordinator within a specified time period; and
receiving a commit or roll back message from the backup
coordinator.
16. The computer readable medium of claim 15, the method further
comprising: sending a primary transaction status inquiry message to
the primary coordinator; and timing out a timer before receiving a
response to the primary transaction status inquiry message.
17. The computer readable medium of claim 15, wherein the backup
coordinator is remote from the primary coordinator.
18. A distributed computing system comprising: a data store; a
primary coordinator node to generate a prepare message for a
two-phase commit distributed transaction, the prepare message
including an address of a backup coordinator node, to maintain a
transaction log of the distributed transaction in the data store,
and to send the prepare message to a plurality of participant
nodes; and the backup coordinator node networked to the primary
coordinator to, upon detecting a failure of the primary coordinator
node, access the transaction log to determine a state of the
distributed transaction and assume control of the distributed
transaction.
19. The distributed computing system of claim 18, further
comprising: the backup coordinator node to periodically exchange a
heartbeat message with the primary coordinator node, and to assume
control of the distributed transaction if the primary coordinator
node fails to respond to the heartbeat message.
20. The distributed computing system of claim 18, further
comprising: the primary coordinator node to write to the
transaction log if a commit response is received from each of the
plurality of participant nodes; and the backup coordinator node,
after assuming control of the distributed transaction, to direct
the plurality of participant nodes to roll back the distributed
transaction if there is no entry in the transaction log for said
distributed transaction, and to direct the plurality of participant
nodes to commit the distributed transaction if there is an entry in
the transaction log for said distributed transaction.
22. The distributed computing system of claim 18, further
comprising: a participant node, networked to the primary
coordinator node and to the backup coordinator node, to receive the
prepare message, to send a commit or abort message to the primary
coordinator node, to send a primary transaction status inquiry
message to the primary coordinator node upon failing to receive a
subsequent message pertaining to the distributed transaction, and
to send a backup transaction status inquiry message to the backup
coordinator node using the address after failing to receive a
response to the primary status inquiry message.
23. The distributed computing system of claim 22, further
comprising: the backup coordinator node to assume control of the
distributed transaction upon receiving the backup transaction
status inquiry message, and to respond to the backup transaction
status inquiry message.
Description
TECHNICAL FIELD
[0001] Embodiments of the present invention relate to distributed
transactions, and more specifically to improving reliability of
distributed transaction recovery.
BACKGROUND
[0002] Distributed transactions are often performed on distributed
computing systems. A distributed transaction is a set of operations
that update shared objects. Distributed transactions must satisfy
the properties of Atomicity, Consistency, Isolation and Durability,
known commonly as the ACID properties. According to the Atomicity
property, either the transaction successfully executes to
completion, and the effects of all operations are recorded, or the
transaction fails. The Consistency property requires that the
transaction does not violate integrity constraints of the shared
objects. The Isolation property requires that intermediate effects
of the transaction are not detectable to concurrent transactions.
Finally, the Durability property requires that changes to shared
objects due to the transaction are permanent.
[0003] To ensure the Atomicity property, all participants of the
distributed transaction must coordinate their actions so that they
either unanimously abort or unanimously commit to the transaction.
A two-phase commit protocol is commonly used to ensure Atomicity.
Under the two-phase commit protocol, the distributed system
performs the commit operation in two phases. In the first phase,
commonly known as the prepare phase or request phase, a coordinator
node (a node in the distributed computing system managing the
transaction) asks all participant nodes whether they are willing to
commit to the transaction. During the second phase, commonly known
as the commit phase, the coordinator node determines whether the
transaction should be completed. If during the prepare phase all
participant nodes committed to the transaction, the coordinator
node successfully completes the transaction. If during the prepare
phase one or more participant nodes failed to commit to the
transaction, the coordinator node does not complete the
transaction.
[0004] Typically, only the coordinator node has all the information
necessary to determine whether a transaction should commit or roll
back. Therefore, if the coordinator node fails during a distributed
transaction, all participants in the transaction must wait for the
coordinator to recover before completing the transaction. Thus,
significant delays may be caused when a coordinator fails.
[0005] To minimize delays caused by a failed coordinator, some
conventional transaction systems use clustering and/or group
communication protocols to provide standby coordinators. However,
clustering protocols and group communication protocols add
complexity to distributed transactions, and require a change to the
underlying distributed transaction protocol.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The present invention is illustrated by way of example, and
not by way of limitation, in the figures of the accompanying
drawings and in which:
[0007] FIG. 1A illustrates an exemplary distributed computing
system, in which embodiments of the present invention may
operate;
[0008] FIG. 1B is a transaction diagram illustrating messages
flowing through a distributed computing system, in accordance with
one embodiment of the present invention;
[0009] FIG. 2 illustrates a flow diagram of one embodiment for a
method of coordinating a two-phase commit distributed
transaction;
[0010] FIG. 3A illustrates a flow diagram of one embodiment for a
method of assuming control of a distributed transaction by a backup
coordinator;
[0011] FIG. 3B illustrates a flow diagram of another embodiment for
a method of assuming control of a distributed transaction by a
backup coordinator;
[0012] FIG. 4 illustrates a flow diagram of one embodiment for a
method of participating in a distributed transaction; and
[0013] FIG. 5 illustrates a block diagram of an exemplary computer
system, in accordance with one embodiment of the present
invention.
DETAILED DESCRIPTION
[0014] Described herein is a method and apparatus for performing
distributed transactions. In one embodiment, a primary coordinator
generates a prepare message for a two-phase commit distributed
transaction. The prepare message includes an address of a backup
coordinator. The primary coordinator maintains a transaction log of
the distributed transaction that is accessible to both the primary
coordinator and the backup coordinator. The prepare message is sent
to a plurality of participants. The backup coordinator may assume
control of the distributed transaction at any time if the primary
coordinator fails. In one embodiment, the backup coordinator and
the primary coordinator exchange a heartbeat message. If the
primary coordinator fails to respond to the heartbeat message, the
backup coordinator can assume control of the distributed
transaction. In one embodiment, the backup coordinator assumes
control of the distributed transaction upon receiving a status
inquiry message from a participant of the distributed transaction.
Failover of the primary coordinator to the backup coordinator can
occur in a seamless manner without interrupting the distributed
transaction.
[0015] In the following description, numerous details are set
forth. It will be apparent, however, to one skilled in the art,
that the present invention may be practiced without these specific
details. In some instances, well-known structures and devices are
shown in block diagram form, rather than in detail, in order to
avoid obscuring the present invention.
[0016] Some portions of the detailed descriptions which follow are
presented in terms of algorithms and symbolic representations of
operations on data bits within a computer memory. These algorithmic
descriptions and representations are the means used by those
skilled in the data processing arts to most effectively convey the
substance of their work to others skilled in the art. An algorithm
is here, and generally, conceived to be a self-consistent sequence
of steps leading to a desired result. The steps are those requiring
physical manipulations of physical quantities. Usually, though not
necessarily, these quantities take the form of electrical or
magnetic signals capable of being stored, transferred, combined,
compared, and otherwise manipulated. It has proven convenient at
times, principally for reasons of common usage, to refer to these
signals as bits, values, elements, symbols, characters, terms,
numbers, or the like.
[0017] It should be borne in mind, however, that all of these and
similar terms are to be associated with the appropriate physical
quantities and are merely convenient labels applied to these
quantities. Unless specifically stated otherwise, as apparent from
the following discussion, it is appreciated that throughout the
description, discussions utilizing terms such as "sending",
"receiving", "managing", "directing", "generating", or the like,
refer to the action and processes of a computer system, or similar
electronic computing device, that manipulates and transforms data
represented as physical (electronic) quantities within the computer
system's registers and memories into other data similarly
represented as physical quantities within the computer system
memories or registers or other such information storage,
transmission or display devices.
[0018] The present invention also relates to an apparatus for
performing the operations herein. This apparatus may be specially
constructed for the required purposes, or it may comprise a general
purpose computer selectively activated or reconfigured by a
computer program stored in the computer. Such a computer program
may be stored in a computer readable storage medium, such as, but
not limited to, any type of disk including floppy disks, optical
disks, CD-ROMs, and magnetic-optical disks, read-only memories
(ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or
optical cards, or any type of media suitable for storing electronic
instructions, each coupled to a computer system bus.
[0019] The algorithms and displays presented herein are not
inherently related to any particular computer or other apparatus.
Various general purpose systems may be used with programs in
accordance with the teachings herein, or it may prove convenient to
construct a more specialized apparatus to perform the required
method steps. The required structure for a variety of these systems
will appear as set forth in the description below. In addition, the
present invention is not described with reference to any particular
programming language. It will be appreciated that a variety of
programming languages may be used to implement the teachings of the
invention as described herein.
[0020] The present invention may be provided as a computer program
product, or software, that may include a machine-readable medium
having stored thereon instructions, which may be used to program a
computer system (or other electronic devices) to perform a process
according to the present invention. A machine-readable medium
includes any mechanism for storing or transmitting information in a
form readable by a machine (e.g., a computer). For example, a
machine-readable (e.g., computer-readable) medium includes a
machine (e.g., a computer) readable storage medium (e.g., read only
memory ("ROM"), random access memory ("RAM"), magnetic disk storage
media, optical storage media, flash memory devices, etc.), a
machine (e.g., computer) readable transmission medium (electrical,
optical, acoustical or other form of propagated signals (e.g.,
carrier waves, infrared signals, digital signals, etc.)), etc.
[0021] FIG. 1A illustrates an exemplary distributed computing
system 100, in which embodiments of the present invention may
operate. The distributed computing system 100 may include a service
oriented architecture (SOA) (an information system architecture
that organizes and uses distributed capabilities (services) for one
or more applications). An SOA provides a uniform means to offer,
discover, interact with and use capabilities (services) distributed
over a network. Through the SOA, applications may be designed that
combine loosely coupled and interoperable services. In one
embodiment, the distributed computing system 100 includes an
enterprise service bus (ESB). An ESB is an event-driven and
standards-based messaging engine that provides services for more
complex architectures. The ESB provides an infrastructure that
links together services and clients to enable distributed
applications and processes. The ESB may be implemented to
facilitate an SOA. In one embodiment, the ESB is a single bus that
logically interconnects all available services and clients.
Alternatively, the ESB may include multiple busses, each of which
may logically interconnect different services and/or clients.
[0022] In one embodiment, the distributed computing system 100
includes one or more clients 102, a first server 105 and a second
server 110 connected via a network 155. Alternatively, the
distributed computing system 100 may only include a single server
and/or the client 102 may be directly connected with the first
server 105 or the second server 110.
[0023] Client(s) 102 may be, for example, personal computers (PC),
palm-sized computing devices, personal digital assistants (PDA),
etc. Client(s) 102 may also be applications run on a PC, server,
database, etc. In the SOA, client(s) 102 include applications that
access services. Client(s) 102 may be fat clients (client that
performs local processing and data storage), thin clients (client
that performs minimal or no local processing and minimal to no data
storage), and/or hybrid clients (client that performs local
processing but little to no data storage).
[0024] Each of the first server 105 and second server 110 may host
services, applications and/or other functionality that is available
to clients 102 on the distributed computing system 100. The first
server 105 and second server 110 may be a single machine, or may
include multiple interconnected machines (e.g., machines configured
in a cluster). The network 155 may be a private network (e.g.,
local area network (LAN), wide area network (WAN), intranet, etc.),
a public network (e.g., the Internet), or a combination
thereof.
[0025] In one embodiment, first server 105 includes a first
resource manager 115 and a second resource manager 125. A resource
manager is a software module that manages a persistent and stable
storage system. Examples of resource managers include databases and
file managers.
[0026] In one embodiment, first server 105 is coupled with one or
more data stores on which first resource manager 115 and second
resource manager 125 maintain transaction logs. In one embodiment,
the transactions logs are maintained on data store 122.
Alternatively, the transaction logs may be maintained on a
different data store or data stores. The data store(s) may include
a file system, a database, or other data storage arrangement, and
may be internal or external to first server 105. The transaction
logs maintained by first resource manager 115 and second resource
manager 125 may be undo logs (log of committed changes that occur
during a distributed transaction) and/or redo logs (log of
uncommitted changes that occur during a distributed transaction).
The redo logs and/or undo logs can be used to rollback any changes
that occurred during a distributed transaction if the transaction
is aborted.
[0027] Each resource manager that participates in a distributed
transaction may be a participant node of the transaction. During a
prepare phase of a two-phase commit distributed transaction, a
participant node is asked whether it can commit to the transaction
by a coordinator node (described in greater detail below). If the
resource manager can commit to the transaction, it sends a commit
response to the coordinator node. If the resource manager cannot
commit to the transaction, it sends an abort message to the
coordinator node.
[0028] During a commit phase of a two-phase commit distributed
transaction, each resource manager receives a commit command if all
resource managers indicated that they were able to commit. Each
resource manager then commits to the transaction and sends a
confirmation to the coordinator that the transaction was
successfully completed. If one or more of the participating
resource managers sent an abort response, then all resource
managers receive an abort command during the commit phase. Each
resource manager then rolls back the transaction, and may send a
confirmation to the coordinator that the transaction was rolled
back.
[0029] In one embodiment, the first server 105 includes a first
transaction manager 120, and the second server 110 includes a
second transaction manager 145. A transaction manager is a software
module that coordinates multiple participants during a distributed
transaction. A participant may be another transaction manager
(e.g., second transaction manager 145) or a local resource manager
(e.g., first resource manager 115 and second resource manager 125).
Coordinating a distributed transaction may include assigning
identifiers to the transaction, monitoring progress of the
transaction, taking responsibility for transaction completion, and
providing fault recovery for the transaction. Taking responsibility
for transaction completion may include determining whether each
participant can commit to a transaction, directing each participant
to commit if all are able, and directing each participant to
rollback if not all participating nodes are able to commit.
Providing fault recovery may include maintaining a log of
transactions that can be used by participants to recover from a
system failure.
[0030] Any transaction manager in the distributed computing system
100 is capable of operating as a coordinator node. Generally, it is
a transaction manager that is located at a node at which a
transaction is begun or requested that operates as the coordinator
node for that distributed transaction. However, it is not a
requirement that a node that begins a transaction act as
coordinator node for that transaction. Moreover, a transaction
manager can hand responsibility over to another node, causing a
transaction manager of that other node to become the coordinator
node.
[0031] In one embodiment, first transaction manager 120 acts as a
master coordinator node, and coordinates a distributed transaction
between first resource manager 115, second resource manager 125
and/or one or more remote transaction managers and resource
managers. A master coordinator node is a transaction manager that
acts on behalf of a process that initiates a distributed
transaction (e.g., by initiating a commit operation) to coordinate
all participants of the distributed transaction. A master
coordinator node must arrive at a commit or abort decision and
propagate that decision to all participants. In one embodiment,
first transaction manager 120 is configured to initiate a two-phase
commit distributed transaction if there are multiple resource
managers and/or transaction managers that will participate in the
transaction.
[0032] In another embodiment, first transaction manager 120 may act
as an intermediate coordinator node, and coordinate a distributed
transaction between only first resource manager 115 and second
resource manager 125. An intermediate coordinator node is a
transaction manager that acts on behalf of a process that
participates in a distributed transaction to coordinate local
resource managers and/or additional transaction managers that are
participants in the distributed transaction. An intermediate
coordinator node gathers information about the participants that it
manages, and reports the information to a master coordinator node.
An intermediate coordinator node also receives commit or abort
decisions from a master coordinator node, and propagates the
decisions to participants that it manages.
[0033] In one embodiment, first transaction manager 120 acts as a
primary coordinator node and second transaction manager 145 acts as
a backup coordinator node. As a primary coordinator node, first
transaction manager 120 coordinates a transaction until a failure
occurs. If the first transaction manager 120 fails, the second
transaction manager 145 (acting as backup coordinator node) assumes
control of the transaction.
[0034] During a prepare phase of a two-phase distributed
transaction, first transaction manager 120 notifies all
participants that second transaction manager 145 will act as a
backup coordinator node. In one embodiment, first transaction
manager 120 performs this notification by inserting an address of
the second transaction manager 145 into a prepare message that is
sent to all participants. Participants may then use the address of
second transaction manager 145 to inquire about a status of the
transaction if communication with first transaction manager 120 is
lost.
[0035] In one embodiment, first transaction manager 120 and second
transaction manager 145 exchange heartbeat messages at regular
intervals. The heartbeat messages are used to inform each
transaction manager of the other transaction manager's operating
status. If, for example, second transaction manager 145 fails to
receive a heartbeat message from first transaction manager 120,
second transaction manager 145 may determine that first transaction
manager 120 has failed. Second transaction manager 145 may then
assume control of any transaction coordinated by first transaction
manager 120.
[0036] In one embodiment, first server 105 is connected with a data
store 122. Data store 122 may include a file system, a database, or
other data storage arrangement. In one embodiment, data store 122
is internal to first server 105. In another embodiment, data store
122 may be external to first server 105, and connected with first
server 105 either directly or via a network. In yet another
embodiment, data store 122 is a shared data store that can be read
from and written to by both first transaction manager 120 and
second transaction manager 145.
[0037] In one embodiment, first transaction manager 120 maintains a
transaction log 130 of active transactions in the data store 122.
The first transaction manager 120 may operate using the presumed
nothing, presumed commit or presumed abort optimizations. In the
presumed nothing optimization, information about a transaction is
maintained in the transaction log 130 until all participants
acknowledge an outcome of the transaction whether the transaction
is to be committed or aborted (rolled back). According to the
presumed abort optimization the first transaction manager 120 only
maintains information about a transaction in transaction log 130 if
the transaction is committed. Therefore, in the absence of
information about a transaction, the first transaction manger 120
presumes that the transaction has been aborted. In the presumed
commit optimization, the first transaction manager 120 maintains
records of aborted transactions. In the absence of information
about a transaction the transaction manager 120 presumes that the
transaction was successfully completed.
[0038] In one embodiment, first transaction manager 130 maintains
an entry in the transaction log 130 for each participant of a
specified transaction. As the first transaction manager 120
receives confirmation from participants that they successfully
committed or rolled back the transaction, it may remove the entry
corresponding to that participant from the transaction log 130.
Once the first transaction manager 120 receives messages from all
participants indicating that they have successfully committed or
rolled back the transaction, first transaction manager 120 deletes
the log corresponding to that transaction. This minimizes the
amount of storage capacity of data store 122 that is required to
maintain the transaction log 130.
[0039] To successfully perform its role as a backup coordinator,
second transaction manager 145 has access to a transaction log 130
maintained by first transaction manager 120. Therefore, second
transaction manager 145 can direct participants of a distributed
transaction managed by first transaction manager 120 whether the
transaction was committed or rolled back if first transaction
manager 120 fails. In the absence of a backup coordinator,
participants of the transaction would have to wait for first
transaction manager 120 to recover from its failure before the
participants can resolve the transaction. This can cause
considerable delay in some distributed computing systems.
[0040] In one embodiment, failures of the primary coordinator are
only tolerated after the transaction log 130 is written to. This
ensures that the atomicity property required for distributed
transactions is achieved. However, in such an embodiment if a
failure in the first transaction manager 120 occurs prior to the
first transaction manager 120 writing to transaction log 130, the
second transaction manager 145 directs all participants to roll
back the transaction. Such a system design requires a minimum of
resources at the expense of occasionally rolling back transactions
that would otherwise have been successful if the first transaction
manager 120 had not failed. Such a system design may be preferable
for systems that have very low failure rates.
[0041] First server 105 includes a first memory 157 and second
server 110 includes a second memory 159. In one embodiment, first
memory 157 and second memory 159 are components of a shared memory.
In one embodiment, the first memory 157 and second memory 159 are
volatile memories (e.g., random access memory (RAM)) that are
components of a shared backplane bus. In such an embodiment, the
first server and second server are networked via a local area
network. In another embodiment, first memory 157 and second memory
159 are components of a shared file system that includes a memory
mapped file (not shown). The memory mapped file may be a file that
includes all the relevant information pertaining to a distributed
transaction that is resident in volatile memory of first
transaction manager 120. In yet another embodiment, first memory
157 and second memory 159 are components of a shared virtual
memory. In still another embodiment, second transaction manager 145
is a replicated service of first transaction manager 120 that
maintains volatile information only in memory.
[0042] A shared memory between first transaction manager 120 and
second transaction manager 145 permits a distributed transaction to
be assumed by second transaction manager 145 whether or not first
transaction manager 120 has written to transaction log 130.
Therefore, for example, if first transaction manager 120 fails
after sending out prepare messages, and before receiving commit or
rollback responses from all participants, second transaction
manager 145 may determine which participants have not yet
responded. Second transaction manager 145 may then complete the
transaction by receiving commit or rollback messages from the
remaining participants, and issuing a commit or rollback command to
all participants.
[0043] FIG. 1B is a transaction diagram illustrating messages
flowing through a distributed computing system 160, in accordance
with one embodiment of the present invention. In one embodiment,
the distributed computing system 160 includes a primary coordinator
node 165, a backup coordinator node 172, a transaction log 166 and
multiple participant nodes (e.g., first participant node 167 and
second participant node 169). Each node represents a specific
resource manager or transaction manager that participates in a
distributed transaction. Each node is connected with each other
node directly or via a network, which may be a private network
(e.g., local area network (LAN), wide area network (WAN), intranet,
etc.), a public network (e.g., the Internet), or a combination
thereof.
[0044] In one embodiment, primary coordinator node 165 includes a
first transaction manager that initiates and manages a specific
distributed transaction, and backup coordinator node 172 includes a
second transaction manager that assumes management of the specific
distributed transaction if primary coordinator node 165 fails.
Managing the distributed transaction includes determining whether
each participating node 167, 169 can commit to a transaction,
directing each participating node 167, 169 to commit if all are
able, and directing each participating node 167, 169 to rollback
(undo changes caused by the transaction) if not all participating
nodes are able to commit.
[0045] In one embodiment, the primary coordinator node 165
coordinates a two-phase commit distributed transaction between the
first participant node 167 and the second participant node 169.
During a prepare phase of the two-phase commit distributed
transaction, the primary coordinator node 165 sends a prepare
message 174 to each of the participant nodes 167, 169 asking
whether they can commit to the transaction. The prepare message 174
may identify the primary coordinator node, the participant nodes,
and the distributed transaction. The prepare message 174 may also
include an address of the backup coordinator node 172, thereby
notifying the participant nodes 167, 169 how to contact the backup
coordinator node 172 if the primary coordinator node 165 fails. The
primary coordinator node 165 then waits for a response from each of
the participant nodes 167, 169.
[0046] Each participant node 167, 169 then sends a response 176
(e.g., a commit response or an abort response) to the primary
coordinator node 165 responding to the prepare message 174. The
participant nodes 167, 169 will then wait for a commit or roll-back
(abort) command from the primary coordinator 165. If no commit or
roll-back command is received within a specified time frame, the
first participant node 167 and second participant node 169 send a
primary status inquiry message 180 to the primary coordinator node
165. The primary status inquiry message 180 requests an outcome of
the distributed transaction. If no response to the primary status
inquiry message 180 is received in another specified time frame,
the first participant node 167 and second participant node 169 send
a backup status inquiry message 184 to the backup coordinator node
172 (e.g., using the address provided in the prepare message
174).
[0047] Upon receiving a backup status inquiry message 184, backup
coordinator node 172 may assume control of the distributed
transaction. Alternatively, or in addition, backup coordinator node
172 may exchange heartbeat messages 182 with primary coordinator
node 165. If primary coordinator node 165 fails to send a heartbeat
message 182 to backup coordinator node 172, backup coordinator node
172 may assume control of the distributed transaction.
[0048] Upon assuming control of the distributed transaction, backup
coordinator node 172 accesses transaction log 166, which was
maintained 178 by primary coordinator node 165 until primary
coordinator node 165 failed. The transaction log 166 may have
varying degrees of information, depending on a two-phase commit
optimization used by the distributed computing system 160 and/or
depending on a stage of the transaction. For example, if a presumed
abort optimization is used, primary coordinator node 165 may only
have written to transaction log 166 if it received commit response
messages 176 from both first participant 167 and second participant
169. If, for example, a presumed nothing optimization or presumed
commit optimization is used, the transaction log 166 may include an
entry for the distributed transaction even if primary coordinator
node 165 failed before receiving response messages 176. In one
embodiment, the distributed computing system 160 includes a shared
memory between primary coordinator node 165 and backup coordinator
node 172. The shared memory may provide backup coordinator node 172
with all information pertaining to the distributed transaction that
was stored in volatile memory of primary coordinator node 165.
Therefore, backup coordinator node 172 may have more information
available to it from which to base a commit or abort decision than
what is stored in the transaction log 166.
[0049] After receiving the backup status inquiry message 184,
backup coordinator 172 may send a status response 188 to first
participant 167 and second participant 169 notifying them of a
status of the transaction. If backup coordinator node 172 does not
have enough information from which to base an abort or commit
decision, it may include in the status response 188 a request to
identify whether the participant nodes 167, 169 had responded to
the prepare message 174 with a commit or abort response. Upon
receiving the updated responses, backup coordinator node 172 may
then command the participant nodes 167, 169 to commit or roll back
the transaction, as appropriate. Alternatively, if there is not
enough information available, backup coordinator node 172 may
command first participant node 167 and second participant node 169
to roll back the transaction via the status response 188. If there
is enough information, backup coordinator node 172 may include a
commit or abort command as appropriate in the status response
188.
[0050] Though FIG. 1B has been described with two participant
nodes, embodiments of the present invention work equally well for
distributed transactions that include greater or fewer than two
participant nodes. Moreover, though a single backup coordinator 172
has been described, embodiments of the present invention operate
equally well with multiple backup coordinators. In such
embodiments, the multiple backup coordinators are ranked. That way
there is no confusion as to which backup coordinator node is to
assume control of the distributed transaction when the primary
coordinator node or another backup coordinator node fails.
[0051] FIG. 2 illustrates a flow diagram of one embodiment for a
method 200 of coordinating a two-phase commit distributed
transaction. The method is performed by processing logic that
comprises hardware (e.g., circuitry, dedicated logic, programmable
logic, microcode, etc.), software (such as instructions run on a
processing device), or a combination thereof. In one embodiment,
method 200 is performed by a transaction manager (e.g., first
transaction manager 120) of FIG. 1A acting as a primary coordinator
node.
[0052] Referring to FIG. 2, method 200 includes initiating a
two-phase commit distributed transaction (block 205) by a primary
coordinator node. At block 210, participants for the distributed
transaction are determined. Appropriate participants include
resource managers that will contribute data or services to the
transaction and/or transaction managers that manage those resource
managers. Appropriate participants may be determined by
broadcasting a transaction participation query, and receiving
responses from all nodes that will participate in the queried
transaction. Alternatively, appropriate participants may be
determined, for example, based on a nature of the transaction, an
initiator of the transaction, or other criteria.
[0053] At block 215, a prepare message is generated that includes
an address of a backup coordinator. In one embodiment, the backup
coordinator is a transaction manager that remains in a standby mode
during the transaction. If the primary coordinator node fails
during the transaction, the backup coordinator may assume control
of the transaction. At block 220, the prepare message is sent to
each of the participants of the distributed transaction.
[0054] At block 225, a response message is received from a
participant. If the response message is an abort response, then the
transaction will be aborted, and the method proceeds to block 235.
If the response message is a commit message, the method proceeds to
block 230.
[0055] At block 235, an abort command is sent to all
participants.
[0056] At block 230, processing logic determines whether responses
have been received from all participants. If responses have not
been received from all participants, the method returns to block
225. If responses have been received from all participants, the
method proceeds to block 240.
[0057] At block 240, a commit command is sent to all participants.
The participants may then commit to the transaction, and send back
a commit acknowledgement. Upon receiving commit acknowledgements
from all participants, the primary coordinator node may close the
transaction and delete all transaction logs for the
transaction.
[0058] In embodiments of the present invention, the primary
coordinator node writes to a transaction log at certain stages of
method 200. For example, if the presumed nothing optimization is
used, the primary coordinator node may write to the transaction log
upon determining the participants of the transaction, after sending
out the prepare messages, upon receiving each response from a
participant, upon sending an abort command to the participants
and/or upon sending a commit command to the participants. In
another example, if the presumed abort optimization is used, the
primary coordinator node may write to the transaction log upon
receiving a commit response from all participants. This transaction
log may then be used by the backup coordinator to complete the
transaction in the event that the primary coordinator should
fail.
[0059] FIG. 3A illustrates a flow diagram of one embodiment for a
method 300 of assuming control of a distributed transaction by a
backup coordinator. The method is performed by processing logic
that comprises hardware (e.g., circuitry, dedicated logic,
programmable logic, microcode, etc.), software (such as
instructions run on a processing device), or a combination thereof.
In one embodiment, method 300 is performed by a transaction manager
(e.g., second transaction manager 145) of FIG. 1A.
[0060] Referring to FIG. 3A, method 300 includes sending a
heartbeat message from the backup coordinator of a distributed
transaction to a primary coordinator of the distributed transaction
(block 305). At block 310, the backup coordinator waits for a
response heartbeat message from the primary coordinator. If a
response backup message is received from the primary coordinator,
the method returns to block 305. If no response heartbeat message
is received within a specified time period, the method continues to
block 315.
[0061] At block 315, the backup coordinator assumes control of the
distributed transaction. At block 320, the backup coordinator
accesses a transaction log for the distributed transaction that was
maintained by the primary coordinator. In one embodiment, the
backup coordinator also has access to a shared memory that includes
contents of a volatile memory of the primary coordinator prior to
failure of the primary coordinator. The transaction log and/or the
shared memory can be used to determine a current state of the
transaction.
[0062] At block 325, the backup coordinator completes the
distributed transaction. Completing the transaction may include
requesting abort or commit responses from participants, sending
abort or commit commands to participants, receiving
acknowledgements from the participants, etc. The method then
ends.
[0063] FIG. 3B illustrates a flow diagram of another embodiment for
a method 350 of assuming control of a distributed transaction by a
backup coordinator. The method is performed by processing logic
that comprises hardware (e.g., circuitry, dedicated logic,
programmable logic, microcode, etc.), software (such as
instructions run on a processing device), or a combination thereof.
In one embodiment, method 350 is performed by a transaction manager
(e.g., second transaction manager 145) of FIG. 1A.
[0064] Referring to FIG. 3B, method 350 includes receiving a
transaction status inquiry message from a participant (block 355).
The transaction status inquiry message may have been sent by a
participant node after the participant node failed to receive a
response to another status inquiry message sent to a primary
coordinator.
[0065] At block 360, the backup coordinator determines if the
primary coordinator has failed. In one embodiment, such a
determination is made by sending a heartbeat message to the primary
coordinator and waiting for a response heartbeat message.
Alternatively, it may be automatically determined that the primary
coordinator has failed based on receipt of the transaction status
inquiry message by the backup coordinator. If it is determined that
the primary coordinator has failed, the method continues to block
365. If it is determined that the primary coordinator has not
failed, the method proceeds to block 370, and the status inquiry
message is forwarded to the primary coordinator.
[0066] At block 365, the backup coordinator assumes control of the
distributed transaction. At block 375, the backup coordinator
accesses a transaction log for the distributed transaction that was
maintained by the primary coordinator. In one embodiment, the
backup coordinator also has access to a shared memory that includes
contents of a volatile memory of the primary coordinator prior to
failure of the primary coordinator. The transaction log and/or the
shared memory can be used to determine a current state of the
transaction.
[0067] At block 380, the backup coordinator responds to the status
inquiry message based on the contents of the transaction log and/or
the shared memory. If the transaction log and/or shared memory
indicate that the transaction was committed or rolled back, then a
commit or abort command may be included in the status inquiry
response, as appropriate. If the transaction log and/or shared
memory indicated that a commit or abort decision was still pending
due to failure to receive a commit or abort response from certain
participants, a query may be sent to those participants. If no
record of the transaction is included in the transaction log, then
a roll back command may be sent to all participants.
[0068] At block 385, the backup coordinator completes the
distribute transaction. The method then ends.
[0069] FIG. 4 illustrates a flow diagram of one embodiment for a
method 400 of participating in a distributed transaction. The
method is performed by processing logic that comprises hardware
(e.g., circuitry, dedicated logic, programmable logic, microcode,
etc.), software (such as instructions run on a processing device),
or a combination thereof. In one embodiment, method 400 is
performed by a resource manager (e.g., first resource manager 115)
or transaction manager (e.g., first transaction manager 120) of
FIG. 1A.
[0070] Referring to FIG. 4, method 400 includes receiving a prepare
message for a two-phase commit distributed transaction. The
received prepare message includes an address of a backup
coordinator. The distributed transaction is a transaction in which
the recipient resource manager or transaction manager is a
participant.
[0071] At block 410, the participant (resource manager or
transaction manager) sends a commit or abort message to a primary
coordinator from which the prepare message was received. The
participant then expects to receive a commit or abort command from
the primary coordinator.
[0072] At block 415, a status inquiry message is sent to the
primary coordinator. The status inquiry message is sent if no
message has been received from the primary coordinator in a
predetermined time period. At block 420, the participant waits for
a response to the status inquiry message from the primary
coordinator. If no subsequent messages are received from the
primary coordinator, the method proceeds to block 425. If a
subsequent message is received from the primary coordinator, the
method proceeds to block 430, and the participant commits or aborts
the transaction as per an instruction included in the subsequent
message.
[0073] At block 425, the participant sends a status inquiry message
to the backup coordinator using the address that was included in
the prepare message. At block 435, the participant receives a
commit or roll-back command from the backup coordinator. The
participant then commits or rolls back the transaction as
appropriate.
[0074] FIG. 5 illustrates a diagrammatic representation of a
machine in the exemplary form of a computer system 500 within which
a set of instructions, for causing the machine to perform any one
or more of the methodologies discussed herein, may be executed. In
alternative embodiments, the machine may be connected (e.g.,
networked) to other machines in a Local Area Network (LAN), an
intranet, an extranet, or the Internet. The machine may operate in
the capacity of a server or a client machine in a client-server
network environment, or as a peer machine in a peer-to-peer (or
distributed) network environment. The machine may be a personal
computer (PC), a tablet PC, a set-top box (STB), a Personal Digital
Assistant (PDA), a cellular telephone, a web appliance, a server, a
network router, switch or bridge, or any machine capable of
executing a set of instructions (sequential or otherwise) that
specify actions to be taken by that machine. Further, while only a
single machine is illustrated, the term "machine" shall also be
taken to include any collection of machines (e.g., computers) that
individually or jointly execute a set (or multiple sets) of
instructions to perform any one or more of the methodologies
discussed herein.
[0075] The exemplary computer system 500 includes a processor 502,
a main memory 504 (e.g., read-only memory (ROM), flash memory,
dynamic random access memory (DRAM) such as synchronous DRAM
(SDRAM) or Rambus DRAM (RDRAM), etc.), a static memory 506 (e.g.,
flash memory, static random access memory (SRAM), etc.), and a
secondary memory 518 (e.g., a data storage device), which
communicate with each other via a bus 530.
[0076] Processor 502 represents one or more general-purpose
processing devices such as a microprocessor, central processing
unit, or the like. More particularly, the processor 502 may be a
complex instruction set computing (CISC) microprocessor, reduced
instruction set computing (RISC) microprocessor, very long
instruction word (VLIW) microprocessor, processor implementing
other instruction sets, or processors implementing a combination of
instruction sets. Processor 502 may also be one or more
special-purpose processing devices such as an application specific
integrated circuit (ASIC), a field programmable gate array (FPGA),
a digital signal processor (DSP), network processor, or the like.
Processor 502 is configured to execute the processing logic 526 for
performing the operations and steps discussed herein.
[0077] The computer system 500 may further include a network
interface device 508. The computer system 500 also may include a
video display unit 510 (e.g., a liquid crystal display (LCD) or a
cathode ray tube (CRT)), an alphanumeric input device 512 (e.g., a
keyboard), a cursor control device 514 (e.g., a mouse), and a
signal generation device 516 (e.g., a speaker).
[0078] The secondary memory 518 may include a machine-readable
storage medium (or more specifically a computer-readable storage
medium) 531 on which is stored one or more sets of instructions
(e.g., software 522) embodying any one or more of the methodologies
or functions described herein. The software 522 may also reside,
completely or at least partially, within the main memory 504 and/or
within the processing device 502 during execution thereof by the
computer system 500, the main memory 504 and the processing device
502 also constituting machine-readable storage media. The software
522 may further be transmitted or received over a network 520 via
the network interface device 508.
[0079] The machine-readable storage medium 531 may also be used to
store a transaction manager (e.g., first transaction manager 120 of
FIG. 1A) or resource manager (e.g., first resource manager 115 of
FIG. 1A), and/or a software library containing methods that call
transaction managers or resource managers. While the
machine-readable storage medium 531 is shown in an exemplary
embodiment to be a single medium, the term "machine-readable
storage medium" should be taken to include a single medium or
multiple media (e.g., a centralized or distributed database, and/or
associated caches and servers) that store the one or more sets of
instructions. The term "machine-readable storage medium" shall also
be taken to include any medium that is capable of storing or
encoding a set of instructions for execution by the machine and
that cause the machine to perform any one or more of the
methodologies of the present invention. The term "machine-readable
storage medium" shall accordingly be taken to include, but not be
limited to, solid-state memories, and optical and magnetic
media.
[0080] It is to be understood that the above description is
intended to be illustrative, and not restrictive. Many other
embodiments will be apparent to those of skill in the art upon
reading and understanding the above description. Although the
present invention has been described with reference to specific
exemplary embodiments, it will be recognized that the invention is
not limited to the embodiments described, but can be practiced with
modification and alteration within the spirit and scope of the
appended claims. Accordingly, the specification and drawings are to
be regarded in an illustrative sense rather than a restrictive
sense. The scope of the invention should, therefore, be determined
with reference to the appended claims, along with the full scope of
equivalents to which such claims are entitled.
* * * * *