Reliable distributed shared memory Parsons, Eric W. [Parsons, Eric W.]

Reliable distributed shared memory

Parsons, Eric W.

Patent Application Summary

U.S. patent application number 10/042763 was filed with the patent office on 2003-09-04 for reliable distributed shared memory. Invention is credited to Parsons, Eric W..

Application Number	20030167420 10/042763
Document ID	/
Family ID	23704402
Filed Date	2003-09-04

United States Patent Application	20030167420
Kind Code	A1
Parsons, Eric W.	September 4, 2003

Reliable distributed shared memory

Abstract

In implementing a reliable distributed shared memory, a weak consistency model is modified to ensure that all vital data structures are properly replicated at all times. Write notices and their corresponding diffs are stored on a parameterizable number of nodes. Whenever a node (say the primary role) releases a lock, it sends its current vector timestamp, write notices generated during the time the lock was held and their corresponding diffs to secondary node. The secondary node keeps this information separate from its own private data structures. If a node fails (detected by all nodes simultaneously through a group membership protocol) while holding a lock, then all nodes complete a lock release method, and enter a recovery operation. During this recovery operation, all nodes exchange all write notices and corresponding diffs, including backup write notices and diffs held by nodes on behalf of the failed node. After the information has been exchanged, diffs are applied and all nodes may start fresh.

Inventors:	Parsons, Eric W.; (Ashton, CA)
Correspondence Address:	SMART AND BIGGAR 438 UNIVERSITY AVENUE SUITE 1500 BOX 111 TORONTO ON M5G2K8 CA
Family ID:	23704402
Appl. No.:	10/042763
Filed:	November 20, 2002

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10042763	Nov 20, 2002
09429712	Oct 29, 1999
6574749

Current U.S. Class:	714/10 ; 709/214; 709/230; 714/E11.008
Current CPC Class:	G06F 11/1482 20130101; G06F 11/1425 20130101; Y10S 707/99952 20130101
Class at Publication:	714/10 ; 709/214; 709/230
International Class:	G06F 015/167; G06F 015/16; H04L 001/22

Claims

We claim:

1. At a first node in a distributed shared memory system, said system implemented using a weak consistency protocol, a method of replicating state comprising: completing access to a synchronization variable; after completing said access, sending a message to a second node, said message comprising: an indication of a global ordering of access to said synchronization variable; an indication that a page of shared memory has undergone a modification, said page of shared memory including memory referenced by said synchronization variable; and a record of said modification.

2. The method of claim 1 wherein said weak consistency protocol comprises the lazy release consistency protocol.

3. The method of claim 2 wherein said synchronization variable is a lock on a unit of shared memory and said access completing comprises releasing said lock.

4. At a first node in a distributed shared memory system, said system implemented using a weak consistency protocol, a method of replicating state comprising: releasing a lock on a unit of shared memory; after releasing said lock, sending a message to a second node, said message comprising: a vector timestamp; a write notice indicating that a page of shared memory underwent a modification while said lock was held; and a record of said modification.

5. At a first node in a distributed shared memory system, said system implemented using a weak consistency protocol, a processor operable to: complete access to a synchronization variable; after completing said access, send a message to a second node, said message comprising: an indication of a global ordering of access to said synchronization variable; an indication that a page of shared memory has undergone a modification, said page of shared memory including memory referenced by said synchronization variable; and a record of said modification.

6. A computer readable medium for providing program control to a processor, said processor included in a node in a distributed shared memory system, said system implemented using a weak consistency protocol, said computer readable medium adapting said processor to be operable to: complete access to a synchronization variable; after completing said access, send a message to a second node, said message comprising: an indication of a global ordering of access to said synchronization variable; an indication that a page of shared memory has undergone a modification, said page of shared memory including memory referenced by said synchronization variable; and a record of said modification.

7. A computer data signal embodied in a carrier wave comprising: an indication of a global ordering of access to said synchronization variable; an indication that a page of shared memory has undergone a modification, said page of shared memory including memory referenced by said synchronization variable; and a record of said modification.

8. A method for synchronization variable managing in a distributed shared memory system, said system implemented using a weak consistency protocol, said method comprising: receiving an access request related to a synchronization variable, where said synchronization variable is for a unit of shared memory; determining a most recent node to have held said synchronization variable; if said most recent node to have held said synchronization variable has failed, and said failure has occurred subsequent to sending a replication message, determining a node possessed of said replication message, said replication message including an indication of a global ordering of access to said synchronization variable, an indication that a page has undergone a modification while said synchronization variable was held, said page of shared memory including memory referenced by said synchronization variable, and a record of said modification; and forwarding said access request to said node determined to be possessed of said replication message.

9. The method of claim 8 wherein said access request is a request to acquire said synchronization variable.

10. The method of claim 8 wherein said access request is a for a record of a modification undergone while said synchronization variable was held.

11. The method of claim 8 wherein said determining a node possessed of said replication message comprises polling nodes for possession of said replication message.

12. At a node acting as a synchronization variable manager in a distributed shared memory system, said system implemented using a weak consistency protocol, a processor operable to: receive an access request related to a synchronization variable, where said synchronization variable is for a unit of shared memory; determine a most recent node to have held said synchronization variable; if said most recent node to have held said synchronization variable has failed, and said failure has occurred subsequent to sending a replication message, determine a node possessed of said replication message, said replication message including an indication of a global ordering of access to said synchronization variable, an indication that a page has undergone a modification while said synchronization variable was held, said page of shared memory including memory referenced by said synchronization variable, and a record of said modification; and forward said access request to said node determined to be possessed of said replication message.

13. A computer readable medium for providing program control to a processor, said processor included in a node acting as a synchronization variable manager in a distributed shared memory system, said system implemented using a weak consistency protocol, said computer readable medium adapting said processor to be operable to: receive an access request related to a synchronization variable, where said synchronization variable is for a unit of shared memory; determine a most recent node to have held said synchronization variable; if said most recent node to have held said synchronization variable has failed, and said failure has occurred subsequent to sending a replication message, determine a node possessed of said replication message, said replication message including an indication of a global ordering of access to said synchronization variable, an indication that a page has undergone a modification while said synchronization variable was held, said page of shared memory including memory referenced by said synchronization variable, and a record of said modification; and forward said access request to said node determined to be possessed of said replication message.

14. At a first node in a group of nodes in a distributed shared memory system, said system implemented using a weak consistency protocol, a method of recovering from a failure of a second node in said group, said method comprising: detecting, via a group membership protocol, said failure in said second node; releasing each currently held synchronization variable; waiting for each currently held synchronization variable to be released or expire; entering a recovery operation, wherein said recovery operation comprises: sending, to all nodes in said group, an indication of a global ordering of access to each said synchronization variable along with an indication of each page that has undergone a modification while one said synchronization variable was held, and a record of said modification; receiving from other nodes in said group a plurality of indications of global ordering of access to each said synchronization variable currently held by other nodes, each said indication of global ordering sent with an indication of each page that has undergone a modification while one said synchronization variable was held, and a record of said modification; and subsequent to completion of said sending and receiving, applying each said received record to a shared memory.

Description

FIELD OF THE INVENTION

[0001] The present invention relates to distributed shared memory and, in particular, a method for replicating state to result in reliable distributed shared memory.

BACKGROUND OF THE INVENTION

[0002] Distributed Shared Memory (DSM) has been an active field of research for a number of years. A variety of sophisticated approaches have been developed to allow processes on distinct systems to share a virtual memory address space, but nearly all of this work has been focussed on enabling shared memory based parallel scientific applications to be run on distributed systems. Examples of such scientific applications appear in computational fluid dynamics, biology, weather simulation and galaxy evolution. In studying such parallel systems, the principal focus is on achieving a high degree of performance.

[0003] While the domain of parallel scientific applications is important, distributed shared memory can also play a valuable role in the design of distributed applications. The rapid adoption of distributed object frameworks (e.g., CORBA and Java RMI) is leading to an increased number of distributed applications, whose functionality is partitioned into coarse-grained components which communicate through object interfaces. These distributed object frameworks are well suited for locating and invoking distributed functionality, and may transparently provide "failover" capabilities. Failover being the capability of a system to detect failure of a component and to transfer operations to another functioning component. Many distributed applications, however, require the ability to share simple state (i.e., data) across distributed components, for which distributed shared memory can play a role.

[0004] To illustrate the need for the ability to share simple state across distributed components, consider a typical web-based service framework which allows new services to be readily added to the system. Some components of the framework deal with authenticating the user, establishing a session and presenting the user with a menu of services. The services are implemented as distinct distributed components, as are the various components of the framework itself. In this type of system, a user-session object, encapsulating information about a user's session with the system, would represent simple state that must be available to every component and which is accessed frequently during the handling of each user request. If such a user-session object were only accessible through a remote interface, obtaining information such as the user's "customer-id" would be very expensive. Ideally, the user-session object would be replicated on nodes where it is required, and a session identifier would be used to identify each session.

[0005] As soon as data replication is considered, data consistency becomes an issue. There are a number of approaches that can be used for this purpose. As just mentioned, a typical starting point is to store shared objects on a single server and use remote object communication to access various fields. When performance is important, one will typically introduce caching mechanisms to allow local access to certain objects whenever possible. In practice, ad hoc caching and consistency schemes are used for this purpose, individually tailored for each object in question. Given the complexity and ensuing maintainability issues, such steps are not undertaken lightly.

[0006] Distributed shared memory (DSM), however, is ideally suited to this problem domain. Using "weak consistency" DSM techniques, state can be very efficiently replicated onto nodes where it is required, with very little additional software complexity. When an object is first accessed on a node, its data pages are brought onto the local processor and subsequent accesses occur at memory access speeds. An underlying DSM layer maintains consistency among the various copies.

[0007] Weak consistency refers to the way in which shared memory that is replicated on different nodes is kept consistent. With weak consistency, accesses to synchronization variables are sequentially consistent, no access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere and no data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed (see M. Dubois, C. Scheurich, and F. Briggs, "Memory Access Buffering in Multiprocessors," International Symposia on Computer Architecture 1986, pp. 434-442., incorporated herein by reference).

[0008] Since existing DSM research has been focussed on parallel scientific computation, there are a number of issues that have not been addressed. First, existing DSM systems typically assume that all the nodes and processes involved in a computation are known in advance, which is not true of most distributed applications. Second, many existing DSM systems are not designed to tolerate failures, either at the node level or the application level, which will almost certainly occur in any long-running distributed application. Third, distributed applications will often have several processes running on a given node, which should be taken into account in the design of the DSM system. Finally, distributed applications have a much greater need for general-purpose memory allocation and reclamation facilities, when one is not dealing with fixed-sized multidimensional arrays allocated during application initialisation (which is typical of scientific applications).

[0009] In addition, DSM systems can be augmented to be fault tolerant by ensuring that all data is replicated to a parameterizable degree at all times. Although doing so leads to some level of overhead (on write operations), this cost may be warranted for some types of data and may still provide much better performance than storing data in secondary storage (via a database). By using a fault-tolerant DSM system for all in-memory critical data, a distributed application can easily be made to be highly available. A highly available system is one that continues to function in the presence of faults. However, unlike most fault-tolerant systems, failures in a highly available system are not transparent to clients.

[0010] Traditional in-memory data-replication schemes include primary site replication and active replica replication. In primary site replication, read and write requests for data are made to a primary (or master) site, which is responsible for ensuring that all replicas are kept consistent. If the primary fails, one of the replicas is chosen as the primary site. In active replica replication, write requests for data are made to all replica sites, using an algorithm that ensures that all writes are performed in the same order on all hosts. Read requests can be made to any replica.

[0011] Adapting a DSM system for fault tolerance is quite different than traditional in-memory data replication schemes in that the set of nodes replicating data are the ones that are actively using it. In the best case, where a single node is the principal accessory for an object and where the majority of memory operations are read operations (a read-mostly object) performance using distributed lock leasing algorithms approaches that of a local object. A distributed lock leasing algorithm is an algorithm that allows one node among a set of nodes to acquire a "lock" on a unit of shared memory for some period of time. A lock is an example of a synchronization variable since it synchronizes the modification of a unit of memory, i.e. ensures the unit is only modified by one processor at a time. If a node should fail while holding a lock, the lock is reclaimed and granted to some other node in such a way that all correctly functioning nodes agree as to the state of the lock. Read-mostly objects that are actively shared amongst several nodes may be more costly as more lock requests will involve remote communications. Most expensive may be highly shared objects that are frequently modified.

[0012] Some alternative approaches to introducing fault tolerance have been to make the DSM "recoverable", that is, allowing the system to be recovered to a previous consistent checkpoint in the event of failure. This approach is very well suited to long running parallel scientific applications, where the loss of a partial computation can be costly. However, in the context of an interactive (e.g., web-based) application, recovering to an outdated previous state provides no benefit. As such, a distributed shared memory system that offers transactional-like guarantees is desirable.

SUMMARY OF THE INVENTION

[0013] A weak consistency shared memory model is modified to result in reliable distributed shared memory that ensures that all vital data structures are properly replicated at all times. Whenever a node records changes to a unit of shared memory according to a weak consistency protocol, the node sends to a secondary node vital data structures related to that change.

[0014] In accordance with an aspect of the present invention there is provided, at a first node in a distributed shared memory system, the system implemented using a weak consistency protocol, a method of replicating state including completing access to a synchronization variable, and, after completing the access, sending a message to a second node. The message includes an indication of a global ordering of access to the synchronization variable, an indication that a page of shared memory has undergone a modification, the page of shared memory including memory referenced by the synchronization variable and a record of the modification. In further aspects of the invention a processor for carrying out this method is provided as well as a software medium that permits a general purpose computer to carry out this method.

[0015] In accordance with an aspect of the present invention there is provided, at a first node in a distributed shared memory system, the system implemented using a weak consistency protocol, a method of replicating state including releasing a lock on a unit of shared memory and after releasing the lock, sending a message to a second node. The includes a vector timestamp, a write notice indicating that a page of shared memory underwent a modification while the lock was held and a record of the modification.

[0016] In accordance with a further aspect of the present invention there is provided a computer data signal including an indication of a global ordering of access to the synchronization variable, an indication that a page of shared memory has undergone a modification, the page of shared memory including memory referenced by the synchronization variable and a record of the modification.

[0017] In accordance with another aspect of the present invention there is provided a method for synchronization variable managing in a distributed shared memory system, the system implemented using a weak consistency protocol, the method including receiving an access request related to a synchronization variable, where the synchronization variable is for a unit of shared memory, determining a most recent node to have held the synchronization variable. If the most recent node to have held the synchronization variable has failed, and the failure has occurred subsequent to sending a replication message, the method further includes determining a node possessed of the replication message, the replication message including an indication of a global ordering of access to the synchronization variable, an indication that a page has undergone a modification while the synchronization variable was held, the page of shared memory including memory referenced by the synchronization variable, and a record of the modification. The method also includes forwarding the access request to the node determined to be possessed of the replication message. In a further aspect of the invention a processor, in a node manager, for carrying out this method is provided. In a still further aspect of the invention a software medium permits a general purpose computer to carry out this method.

[0018] In accordance with another aspect of the present invention there is provided, at a first node in a group of nodes in a distributed shared memory system implemented using a weak consistency protocol, a method of recovering from a failure of a second node in the group including detecting, via a group membership protocol, the failure in the second node, releasing each currently held synchronization variable, waiting for each currently held synchronization variable to be released or expire and entering a recovery operation. The recovery operation includes sending, to all nodes in the group, an indication of a global ordering of access to each synchronization variable along with an indication of each page that has undergone a modification while one synchronization variable was held, and a record of the modification, receiving from other nodes in the group a plurality of indications of global ordering of access to each synchronization variable currently held by other nodes, each indication of global ordering sent with an indication of each page that has undergone a modification while one synchronization variable was held, and a record of the modification and, subsequent to completion of the sending and receiving, applying each the received record to a shared memory.

[0019] Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] In the figures which illustrate an example embodiment of this invention:

[0021] FIG. 1 schematically illustrates a distributed shared memory system.

[0022] FIG. 2 illustrates chronological operation of the lazy release consistency algorithm.

[0023] FIG. 3 illustrates chronological operation of the lazy release consistency algorithm with reliability added in accordance with an embodiment of the present invention.

[0024] FIG. 4 illustrates, in a flow diagram, lock releasing method steps followed by a node in an embodiment of the invention.

[0025] FIG. 5 illustrates, in a flow diagram, failure recovery method steps followed by a node in an embodiment of the invention.

[0026] FIG. 6 illustrates, in a flow diagram, lock acquire request forwarding method steps followed by a lock manager in an embodiment of the invention.

[0027] FIG. 7 illustrates, in a flow diagram, diff request forwarding method steps followed by a lock manager in an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0028] A variety of memory consistency models have been defined in the context of parallel computing (see, for a survey thereof, S. Adve and K. Gharachorloo, "Shared memory consistency models: A tutorial", Computer, vol. 29, no. 12, pp. 66-76, December 1996, the contents of which are incorporated herein by reference). A symmetric shared memory system typically implements sequential consistency, in that updates made by processors are viewed by other processors in exactly the same sequence. This type of consistency may be prohibitively expensive to implement in loosely coupled distributed systems, however, due to the overhead of propagating individual writes and ordering these writes globally. As such, distributed shared memory typically relies on weak consistency models, where memory is consistent only at well-defined points. It has been shown that many parallel applications will function correctly without modification in weak consistency systems. One well known weak consistency memory model is the lazy release consistency (LRC) algorithm, found in TreadMarks (see P. Keleher, A. Cox, S. Sandhya Dwarkadas and Willy Zwaenepoel, "TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems", Proceedings of the Winter 1995 USENIX Conference, pp. 115-131, 1994, the contents of which are incorporated herein by reference).

[0029] From "TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems":

[0030] In lazy release consistency (LRC), the propagation of modifications is postponed until the time of the acquire. At this time, the acquiring processor determines which modifications it needs to see according to the definition of RC. To do so, LRC divides the execution of each process into intervals, each denoted by an interval index. Every time a process executes a release or an acquire, a new interval begins and the interval index is incremented. Intervals of different processes are partially ordered: (i) intervals on a single processor are totally ordered by program order, and (ii) an interval on processor p precedes an interval on processor q if the interval of q begins with the acquire corresponding to the release that concluded the interval of p. This partial order can be represented concisely by assigning a vector timestamp to each interval. A vector timestamp contains an entry for each processor. The entry for processor p in the vector timestamp of interval i of processor p is equal to i. The entry for processor q.noteq.p denotes the most recent interval of processor q that precedes the current interval of processor p according to the partial order. A processor computes a new vector timestamp at an acquire according to the pair-wise maximum of its previous vector timestamp and the releaser's vector timestamp.

[0031] RC requires that, before a processor p may continue past an acquire, the updates of all intervals with a smaller vector timestamp than p's current vector timestamp must be visible at p. Therefore, at an acquire, p sends its current vector timestamp to the previous releaser, q. Processor q then piggybacks on the release-acquire message to p, write notices for all intervals named in q's current vector timestamp but not in the vector timestamp it received from p.

[0032] A write notice is an indication that a page has been modified in a particular interval, but it does not contain the actual modifications. The timing of the actual data movement depends on whether an invalidate, an update, or a hybrid protocol is used. TreadMarks currently uses an invalidate protocol: the arrival of a write notice for a page causes the processor to invalidate its copy of that page. A subsequent access to that page causes an access miss, at which time the modifications are propagated to the local copy.

[0033] Note that a write notice relates to a page, yet a lock relates to an individual unit of shared memory, which may be smaller or larger than a page and may span more than one page. Note also that a lock is a data structure and that a vector timestamp relating to a lock is an indication of the global ordering of access operations (acquire, release) performed on the lock.

[0034] Illustrated in FIG. 1 is a reliable distributed shared memory system 100 including four nodes 102, 104, 106, 110 each with a corresponding processor 112, 114, 116, 120 connected to a corresponding memory 122, 124, 126, 130 where both have access to a network 138 through a corresponding network interface 142, 144, 146, 150. One node (110) may act as a lock manager. Note that the memory shown (122, 124, 126, 130) is physical memory and that virtual memory is a portion of the physical memory. Virtual memory is the portion of memory which is of interest to currently running processes and it is virtual memory that is shared in a distributed shared memory system.

[0035] In an exemplary manner, processor 112 (of node 102) is shown as loaded with state replicating software for executing a method of this invention from software medium 118. Similarly, processor 120 (of lock manger 110) is shown as loaded with lock management software for executing a method of this invention from software medium 128. Each of software media 118, 128 may be a disk, a tape, a chip or a random access memory containing a file downloaded from a remote source.

[0036] To illustrate the known LRC algorithm, consider, with reference to FIG. 2, three processors 112, 114, 116 of nodes 102, 104, 106, respectively (FIG. 1) sharing a page 202 of single distributed shared memory 108 (FIG. 1) that contains two units, U1 and U2, each protected by a lock, L1 and L2, respectively. FIG. 2 depicts a sequence of actions taken by the processors. Initially, page 202 is considered valid and write protected at all three processors. All processors can read the units, U1 and U2. Next, at time T220, processor 112 sends a lock acquire request 20A to lock manager 110 and receives a reply 22A through which it acquires lock L1. Roughly at the same time processor 116 acquires lock L2 (through messages 20B and 22B). At time T222, processors 112 and 116 modify the units U1, U2corresponding to the locks L1, L2. Due to the modifications to the units U1, U2, and since page 202 was initially write protected, a page fault occurs at each processor 112, 116. In response to the page fault, a local copy 204, 206 of page 202 is made in each processor 112, 116. These copies, or twins, 204, 206 can later be used to determine which portions of the page have been modified. Once the copies are made, page 202 may be unprotected at processors 112, 116, allowing reads and writes to proceed. Later, when the processors release the locks, the fact that the page has been modified may be recorded in a write notice.

[0037] At time T224, processors 112, 116 release locks L1 and L2 by sending a "lock release request" 26A, 26B to lock manager 110. Any processor then may acquire the locks. At time T226 of the example of FIG. 2, a message 20C, including the current vector timestamp for node 104 and a lock acquire request for each of L1 and L2, is sent from processor 114 to lock manager 110. Lock manager 110 forwards lock acquire request for L1 20D to processor 112 and forwards lock acquire request for L2 20E to processor 116. Processor 112 sends to processor 114 a write notice 30A for page 202 while processor 116 sends to processor 114 a write notice 30B for page 202, both at time T228. The write notices cause the copy of page 202 at processor 114 to be invalidated, as shown at time T230. Also at time T230, note the state of pages 202A (page 202 with a change to unit U1) and 202B (page 202 with a change to unit U2).

[0038] When processor 114 subsequently accesses page 202, the invalidity of page 202 is noted. Processor 114 requests "diffs", which record the changes in a page, from processors 112, 116, at time T232 with diff requests 32A, 32B. At time T234, a diff 208 is computed at processor 112 by comparing current copy 202A of page 202 against its twin 204. Similarly, a diff 210 is computed at processor 116 by comparing current copy 202B of page 202 against its twin 206. After diffs 208, 210 have been computed and sent to processor 114 by diff reply messages 34A, 34B, twins 204, 206 can be safely discarded.

[0039] At time T236, processor 114 receives diffs 208, 210 and updates page 202 with the modifications made by processors 112 and 116 to result in page 202C. Hence, once processor 114 has received and applied both diffs, there are three different versions of the page. Pages 202A and 202B at processors 112 and 116 respectively, that reflect the changes done to the page locally, and page 202C at processor 114 with an updated status containing changes made by both processors 112 and 116.

[0040] In order for this multiple writer algorithm to work, it is assumed that overlapping memory regions (units) are not associated with different locks; since that would cause the diffs to partly relate to the same addresses and the final state of a shared page would depend on the order in which the diffs were applied.

[0041] To augment existing distributed shared memory algorithms for building high availability applications, several issues must be addressed, two being the following:

[0042] some mechanism must be used to maintain an accurate view as to the set of nodes participating in a reliable distributed shared memory (RDSM); and

[0043] to achieve fault tolerance, the DSM data structures, namely write notices and diffs, have to be replicated, which involves remote communications every time a processor releases a lock where data has been modified.

[0044] To address the first of the above issues, a group communication protocol, such as Isis.TM. (Stratus Computer of Marlboro, Massachusetts), Ensemble (Cornell University) or Totem (University of California, Santa Barbara) can be used. The group membership protocol ensures that all correctly functioning nodes in the network share a common view of membership at all times. That is, all nodes agree as to the set of nodes that are in the group. Although group communication protocols may be limited in terms of performance, they need only be invoked when a new node joins a group or when a failure is detected in communicating with an existing node. The group communication system may also be used to recover locks that are in the possession of a failed processor.

[0045] In overview, to address the second of the above issues, the present invention may be employed. According to the present invention, following a node releasing a lock, the node sends information, including its current vector timestamp, any write notices generated during the time the lock was held and the diffs corresponding to the write notices, to a secondary node. The secondary node is preferably the one that requires the lock next, but in a case in which no node has yet requested the lock, the secondary node may be the node that last held the lock. The secondary node may keep this information separate from its own private data structures, only accessing it, or making it available to other nodes, if required to due to a failure of the primary node. If a node fails (detected by all nodes simultaneously through the group membership protocol), then all nodes complete a lock release code sequence, and enter a recovery operation. During this recovery operation, all nodes exchange all write notices and corresponding diffs, including backup write notices and diffs held by nodes on behalf of the failed node. After all information has been exchanged, diffs are applied and all nodes may start fresh.

[0046] To implement reliable distributed shared memory based on the above lazy release consistency algorithm, we must ensure that all vital data structures are properly replicated at all times. Assuming we use a replication factor of two for tolerating a single point of failure, we may ensure that write notices and their associated diffs are always stored on two nodes, except during the recovery operation when one node fails.

[0047] Therefore, the present invention requires that a node, upon releasing a lock, send to at least one other node a vector timestamp related to the lock, any write notices generated while the lock was held and the diffs corresponding to the write notices. This replication, at a secondary node, of the lock information (timestamp, write notices and diffs) provides a back up which allows this information to reach the next node to request the lock, even if the last node to hold the lock fails.

[0048] To illustrate a reliable lazy release consistency algorithm, consider, with reference to FIG. 3, processors 112, 114, 116, of nodes 102, 104, 106, respectively (FIG. 1) sharing memory. Each unit of the shared memory is associated with a global lock in such a way that no units overlap. Consider that in advance of time T320 in FIG. 3, processor 112 of node 102 had sent a lock acquire request to manager 110 for a lock L3 on shared memory unit U3 and subsequently had sent a lock release request for lock L3. After completing the lock release request, processor 112 of node 102 had also sent a message (including a vector timestamp, any write notices relating to pages of memory changed while the lock was held and the diffs corresponding to the write notices) to a secondary node (not shown). At time T320, before accessing shared memory unit U3, processor 114 of node 104 sends a lock acquire request 20F for lock L3, to global lock manager 110. Note that lock acquire request 20F includes the current vector timestamp for node 104.

[0049] At time T322, global lock manager 110 forwards, in a message 20G, lock acquire request 20F (including the vector timestamp from node 104) to the last node to hold lock L3, node 102, which sends, at time T324, a reply 22C to node 104. Reply 22C includes a write notice relating to the page of memory that includes shared memory unit U3. The page of memory is then invalidated in memory 124 corresponding to processor 114. At time T326, when an application running on processor 114 accesses the page that has been invalidated, the corresponding diff is requested, via diff request 32C, from node 102. Node 102, having the vector timestamp from node 104, computes a diff and replies, at time T328, with diff reply message 34C, and the diff is applied to update the page.

[0050] At time T330, after completing access to unit of shared memory U3, node 104 sends to lock manager 110 a lock release request 26C (including updated vector timestamp) for lock L3. Further, processor 114 computes a write notice for this release operation and any necessary diffs. Node 104 then, at time T332, sends a replication message 44A (including vector timestamp, write notice and diffs) to node 106.

[0051] Consider a situation illustrated at time T334 wherein node 104 fails. Through the group membership protocol, nodes 102 and 106 and lock manager 110 learn of the failure. At time T336, upon learning of the failure, nodes 102 and 106 enter a failure recovery phase. Initially, nodes 102 and 106 release any currently held locks with lock release requests 26D, 26E. Each of nodes 102 and 106 also compute write notices and corresponding diffs and send a replication message to a secondary node (not shown). Subsequently, at time T338, node 102 sends all currently held vector timestamps, write notices and diffs to all nodes in the group via failure messages 36A, 36B. At time T340, node 106 sends all currently held vector timestamps, write notices and diffs to all nodes in the group via failure messages 36C, 36D.

[0052] At time T342, lock manager 110 sends all currently held vector timestamps, write notices and diffs to all nodes in the group via failure messages 36E, 36F. Once all nodes have send and received all vector timestamps, write notices and diffs, i.e. at time T344, each node may apply the diffs and update the shared memory to a condition common to all other nodes. Note that, after completion of failure recovery, the shared memory includes changes made by node 104, because node 106 had received a replication message 44A from node 104 before its failure.

[0053] The actions of a node upon the release of a lock may be summarized with reference to FIG. 4. The release of the lock (step 402) comprises sending a lock release request to the current lock manager. Subsequent to the release of the lock, the node computes write notices and corresponding diffs (step 404). A replication message including information relating to the lock is then sent to secondary node (step 406). This information includes a vector timestamp, write notices and diffs corresponding to the write notices.

[0054] The actions of all nodes in a group performed in response to a failure of a node in the group are outlined in FIG. 5. It is via a group membership protocol that the nodes detect a failure of a node in the group (step 502). Upon detection of this failure, all nodes release their currently held locks (step 504) following the procedure outlined in FIG. 4. The nodes then enter a failure recovery phase. The failure recovery phase includes sending a message (step 506) that includes vector timestamps, write notices and corresponding diffs held on behalf of other nodes to all other nodes in the group. Consequently, the failure recovery phase also includes receiving such messages (step 508) from other nodes in the group. After all the information has been exchanged, the diffs are applied (step 510) and the shared memory may be considered to be as it was before the failure.

[0055] An alternative to the failure recovery phase approach is outlined in FIG. 6. In this alternate approach, upon receiving a lock acquire request (step 602), a lock manager determines the status of the last node to hold the requested lock (step 604) through the use of a group membership protocol. If the last node to hold the requested lock has not failed, the lock acquire request is forwarded, as is known, to that node (step 606) such that the requesting node may be provided with write notices. If the last node to hold the requested lock has failed (after having released the lock), the lock manager polls the nodes in the group to determine the identity of a node that holds information replicated from the last node to hold the requested lock (step 608). Once the node holding the replicated information is identified, the lock acquire request is forwarded to that node (step 610) such that the requesting node may be provided with the necessary write notices.

[0056] It may be the case that a last node to hold a requested lock fails between the time at which it supplied write notices to a lock acquiring node and the time at which the lock acquiring node requests the corresponding diffs. In such a case, a lock acquiring node, knowing of the failure of the last node to hold the requested lock through the group membership protocol, may send the diff request to the lock manager. Turning to FIG. 7, the lock manager receives the diff request (step 702). The lock manger may then poll, as above, the other nodes in the group to determine the identity of a node that holds information replicated from the last node to hold the requested lock (step 704). Once the node holding the replicated information is identified, the diff request may be forwarded to that node (step 706).

[0057] Whichever group membership protocol is used, if the lock manager fails, the failure is detected and another node in the group becomes lock manager. Subsequently, upon receiving a lock request, this new lock manager polls each node for the one that last held the lock to which the request relates.

[0058] Note that, after acquiring a lock, a process running on a node may freeze or otherwise fail and thus fail to release the lock. A properly configured lock manager may maintain a counter relating to each lock such that after a process freezes, a "time out" may occur. Such a time out may be flagged to the group membership protocol as a failure of the node with the frozen process. As the node has failed without releasing the lock, a replication message has not been sent to a secondary node. Consequently, when recovering from the failure, nodes in the group can only recover information replicated when the lock was last released, thus any changes made to shared memory by the frozen process may be lost.

[0059] As will be apparent to a person skilled in the art, a replication factor of greater than two may be used. Such a strategy would increase the reliability of a distributed shared memory system, at the cost of increased data traffic. With a replication factor greater than two, there will be more than one node holding information replicated from the last node to hold the requested lock. The lock manager need only determine one.

[0060] It is possible to also base reliable distributed shared memory on types of weak consistency algorithms other than lazy consistency, in particular, entry consistency (see B. Bershad, M. Zekauskas, and W. Sawdon, "The Midway Distributed Shared Memory System", Proceedings of COMPCOM '93, pp. 528-537, February, 1993, incorporated herein by reference).

[0061] Other modifications will be apparent to those skilled in the art and, therefore, the invention is defined in the claims.

* * * * *