Methods and apparatus to implement parallel transactions Dice; David ; et al. [Dice; David]

Methods and apparatus to implement parallel transactions

Dice; David ; et al.

Patent Application Summary

U.S. patent application number 11/475716 was filed with the patent office on 2007-08-23 for methods and apparatus to implement parallel transactions. Invention is credited to David Dice, Nir N. Shavit.

Application Number	20070198979 11/475716
Document ID	/
Family ID	38429749
Filed Date	2007-08-23

United States Patent Application	20070198979
Kind Code	A1
Dice; David ; et al.	August 23, 2007

Methods and apparatus to implement parallel transactions

Abstract

For each of multiple processes executing in parallel, as long as corresponding version information associated with a respective set of one or more shared variables used for computational purposes has not changed during execution of a respective transaction, results of the respective transaction can be globally committed to memory without causing data corruption. If version information associated with one or more respective shared variables (used to produce the transaction results) happens to change during a process of generating respective results, then a respective process can identify that another process modified the one or more respective shared variables during execution and that its transaction results should not be committed to memory. In this latter case, the transaction repeats itself until it is able to commit respective results without causing data corruption.

Inventors:	Dice; David; (Foxborough, MA) ; Shavit; Nir N.; (Cambridge, MA)
Correspondence Address:	BARRY W. CHAPIN, ESQ.;CHAPIN INTELLECTUAL PROPERTY LAW, LLC WESTBOROUGH OFFICE PARK 1700 WEST PARK DRIVE WESTBOROUGH MA 01581 US
Family ID:	38429749
Appl. No.:	11/475716
Filed:	June 27, 2006

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
60775580	Feb 22, 2006
60775564	Feb 22, 2006
60789483	Apr 5, 2006

Current U.S. Class:	718/100
Current CPC Class:	G06F 9/52 20130101; G06F 12/0893 20130101; G06F 9/466 20130101; G06F 9/526 20130101; G06F 12/0806 20130101; G06F 12/0815 20130101
Class at Publication:	718/100
International Class:	G06F 9/46 20060101 G06F009/46

Claims

1. A method comprising: executing a transaction defined by a corresponding set of instructions to produce a respective transaction outcome based on use of at least one shared variable; in lieu of locking and modifying a given shared variable during execution of the transaction, initiating a lock on the given shared variable after producing the respective transaction outcome via use of locally modified data values, the lock preventing other processes from modifying a data value associated with the given shared variable; and after obtaining the lock, initiating a modification of the data value associated with the given shared variable even though at least one of the other processes performed a computation using the data value associated with the given shared variable before the lock and during execution of the transaction.

2. A method as in claim 1, wherein executing the transaction includes: maintaining version information in a locally managed read set associated with the transaction, the read set not being accessible by the other processes using the shared variables, the read set identifying versions associated with each of multiple shared variables used to generate the respective transaction outcome, the version information indicating respective versions of the multiple shared variables at a time when the transaction retrieves respective data values associated with the multiple shared variables from a globally accessible repository.

3. A method as in claim 2, wherein executing the transaction further includes: after acquiring the lock associated with the given shared variable and before modifying the data value associated with the given shared variable, verifying that newly read version information associated with each of the multiple shared variables used to generate the respective transaction outcome matches the version information in the locally managed read set associated with the transaction.

4. A method as in claim 3, wherein the newly read version information indicates that the data values associated with the multiple shared variables used to generate the transaction outcome have not been changed by the other processes during execution of the transaction to produce the respective transaction outcome.

5. A method as in claim 2, wherein initiating the lock includes: identifying that another process has a respective lock on the given shared variable; and utilizing a specified backoff time to acquire the lock on the given shared variable, the backoff time being a random value relative to the other processes that also attempt to acquire the lock associated with the given shared variable.

6. A method as in claim 1, wherein executing the transaction includes: complying with a respective rule indicating size limitations associated with the transaction to enhance efficiency of multiple processes executing different transactions using a same set of shared variables including the given shared variable to produce respective transaction outcomes.

7. A method as in claim 1 further comprising: maintaining version information associated with each of multiple shared variables, the version information indicating occurrences of data value changes associated with each of the multiple shared variables; and wherein initiating the lock on the given shared variable includes: if the given shared variable was read at any time during execution of the transaction, atomically: i) acquiring the lock on the shared variable, and ii) validating that a present version value associated with the given shared variable matches a previous version value of the given shared variable when read during execution of the transaction.

8. A method as in claim 1 further comprising: in response to identifying that a corresponding data value associated with the at least one shared variable was modified during execution of the transaction, aborting the transaction in lieu of modifying the data value associated with the given shared variable; and initiating execution of the transaction again to produce the respective transaction outcome.

9. A method as in claim 1 further comprising: maintaining a locally managed and accessible write set of data values associated with each of multiple shared variables that are locally but not globally modified during execution of the transaction, the local write set representing data values: i) not yet globally committed and ii) not yet globally accessible by the other processes.

10. A method as in claim 9 further comprising: after completing execution of the transaction, initiating locks on each of the multiple shared variables specified in the write set which were modified during execution of the transaction, the locks preventing the other processes from changing data values associated with the multiple shared variables.

11. A method as in claim 10 further comprising: utilizing a hash-based filter function during execution of the transaction to identify whether a corresponding data value associated with a respective globally accessible variable already exists locally in the write set and should be modified in lieu of performing a respective read to globally accessible shared data.

12. A method as in claim 1 further comprising: after the modification of the data value associated with the given shared variable in a global environment accessible by the other processes, incrementing globally accessible version information associated with the shared variable to indicate that the given shared variable has been modified.

13. A method as in claim 1 further comprising: initiating a compare function to verify that the at least one shared variable has not been modified during execution of the corresponding set of instructions prior to initiating the lock on the given shared variable; and aborting execution of the transaction if the at least one shared variable has been modified.

14. A method as in claim 1, wherein steps of executing the transaction, initiating the lock, and initiating the modification are carried out in software, the method further comprising utilizing hardware transactional memory as an accelerator for executing the transaction.

15. A method as in claim 1 further comprising: maintaining a locally managed and accessible write set of data values associated with each of multiple shared variables that are locally but not globally modified during execution of the transaction, the local write set representing data values: i) not yet globally committed and ii) not yet globally accessible by the other processes; initiating locks on each of the multiple shared variables specified in the write set which were modified during execution of the transaction to prevent the other processes from changing data values associated with the multiple shared variables; verifying that respective data values associated with the multiple shared variables accessed during the transaction have not been globally modified by the other processes during execution of the transaction by checking that respective version values associated with the multiple shared variables have not changed during execution of the transaction; and after modifying data values associated with the multiple shared variables, releasing the locks on each of the multiple shared variables.

16. A method comprising: maintaining segments of information that are shared by multiple processes executing in parallel; for each of at least two of the segments, maintaining a corresponding location to store a respective version value representing a relative version of a respective segment, the relative version being changed each time contents of the respective segment is modified; and enabling the multiple processes to compete and secure an exclusive access lock with respect to each of the at least two segments to prevent other processes from modifying a respective locked segment.

17. A method as in claim 16 further comprising: for each of at least two of the segments, maintaining a corresponding location to store globally accessible lock information indicating whether one of the multiple processes executing in parallel has locked a respective segment for: i) changing a respective data value therein, and ii) preventing other processes from reading respective data values from the respective segment; and enabling the multiple processes to retrieve version information associated with the respective at least two segments to identify whether contents of a respective segment have changed over time.

18. A method comprising: in a given process of multiple processes executing in parallel: maintaining a locally managed write set of data values associated with globally accessible shared variables, the locally managed write set accessible only by the given process, the globally accessible shared variables accessible by the multiple processes; while executing a transaction including multiple instructions, modifying data values associated with the locally managed write set in lieu of modifying the globally accessible shared variables; and after completion of execution of the transaction, initiating locks on each of the globally accessible shared variables specified in the write set in order to: i) prevent other processes from changing data values associated with respective locked shared variables and ii) commit data values in the locally managed write set to the globally accessible shared variables.

19. A method comprising: performing at least one transactional access to segments of information in transactional memory that are shared by multiple processes executing in parallel; and competing amongst multiple other processes to secure an exclusive access lock with respect to a segment in the transactional memory to prevent other processes from modifying a respective locked segment, use of respective access locks enabling transactional memory to interoperate with any malloc and free operations.

20. A method as in claim 19 further comprising: utilizing a hash-based filter function during execution of a respective transaction to identify whether a corresponding data value associated with a respective globally accessible variable already exists locally in a write set, the write set being a scratchpad for temporarily maintaining data values locally in lieu of modifying the data values in the transactional memory.

Description

RELATED APPLICATION

[0001] This application claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 60/775,580 (Attorney's docket no. SUN06-02 (060720)p, filed on Feb. 22, 2006, entitled "Transactional Locking," the entire teachings of which are incorporated herein by this reference.

[0002] This application is related to U.S. patent application identified by Attorney's docket no. SUN06-03(060711), filed on same date as the present application, entitled "METHODS AND APPARATUS TO IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 60/775,564 (Attorney's docket no. SUN06-01(060711)p, filed on Feb. 22, 2006, entitled "Switching Between Read-Write Locks and Transactional Locking," the entire teachings of which are incorporated herein by this reference.

[0003] This application is related to U.S. patent application identified by Attorney's docket no. SUN06-06(060908), filed on same date as the present application, entitled "METHODS AND APPARATUS TO IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 60/789,483 (Attorney's docket no. SUN06-05(060908)p, filed on Apr. 5, 2006, entitled "Globally Versioned Transactional Locking," the entire teachings of which are incorporated herein by this reference.

[0004] This application is related to U.S. patent application identified by Attorney's docket no. SUN06-08(061191), filed on same date as the present application, entitled "METHODS AND APPARATUS TO IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit of and priority to U.S. Provisional Patent Application Ser. No. 60/775,564 (Attorney's docket no. SUN06-01(060711)p, filed on Feb. 22, 2006, entitled "Switching Between Read-Write Locks and Transactional Locking," the entire teachings of which are incorporated herein by this reference.

BACKGROUND

[0005] There has been an ongoing trend in the information technology industry to execute software programs more quickly. For example, there are various conventional advancements that provide for increased execution speed of software programs. One technique for increasing execution speed of a program is called parallelism. Parallelism is the practice of executing or performing multiple things simultaneously. Parallelism can be possible on multiple levels, from executing multiple instructions at the same time, to executing multiple threads at the same time, to executing multiple programs at the same time, and so on. Instruction Level Parallelism or ILP is parallelism at the lowest level and involves executing multiple instructions simultaneously. Processors that exploit ILP are typically called multiple-issue processors, meaning they can issue multiple instructions in a single clock cycle to the various functional units on the processor chip.

[0006] There are different types of conventional multiple-issue processors. One type of multiple-issue processor is a superscalar processor in which a sequential list of program instructions are dynamically scheduled. A respective processor determines which instructions can be executed on the same clock cycle, and sends them out to their respective functional units to be executed. This type of multi-issue processor is called an in-order-issue processor since issuance of instructions is performed in the same sequential order as the program sequence, but issued instructions may complete at different times (e.g., short instructions requiring fewer cycles may complete before longer ones requiring more cycles).

[0007] Another type of multi-issue processor is called a VLIW (Very Large Instruction Width) processor. A VLIW processor depends on a compiler to do all the work of instruction reordering and the processor executes the instructions that the compiler provides as fast as possible according to the compiler-determined order. Other types of multi-issue processors issue out of order instructions, meaning the instruction issue order is not be the same order as the order of instructions as they appear in the program.

[0008] Conventional techniques for executing instructions using ILP can utilize look-ahead techniques to find a larger amount of instructions that can execute in parallel within an instruction window. Looking-ahead often involves determining which instructions might depend upon others during execution for such things as shared variables, shared memory, interference conditions, and the like. When scheduling, a handler associated with the processor detects a group of instructions that do not interfere or depend on each other. The processor can then issue execution of these instructions in parallel thus conserving processor cycles and resulting in faster execution of the program.

[0009] One type of conventional parallel processing involves a use of coarse-grained locking. As its name suggests, coarse-grained locking prevents conflicting groups of code from operating on different processes at the same time based on use of lockouts. Accordingly, this technique enables non-conflicting transactions or sets of instructions to execute in parallel.

[0010] Another type of conventional parallel processing involves a use of fine-grain locking. As its name suggests, fine-grain locking prevents conflicting instructions from being simultaneously executed in parallel based on use of lockouts. This technique enables non-conflicting instructions to execute in parallel.

SUMMARY

[0011] Conventional applications that support parallel processing can suffer from a number of deficiencies. For example, although easy to implement from the perspective of a software developer, coarse-grained locking techniques provide very poor performance because they can severely limit parallelism. Although fine-grain lock-based concurrent software can perform exceptionally well during run-time, developing such code can be a very difficult task for a respective one or more software developers.

[0012] Techniques discussed herein deviate with respect to conventional applications such as those discussed above as well as other techniques known in the prior art. For example, embodiments herein include techniques for enhancing performance associated with transactions executing in parallel.

[0013] In general, a transactional memory programming technique according to embodiments herein provides an alternative type of "lock" method over the conventional techniques as discussed above. For example, one embodiment herein involves use and/or maintenance of version information indicating whether any of multiple "globally" shared variables has been modified during a course of executing a respective transaction (e.g., a set of software instructions initiating a respective computation). Any one of multiple possible processes executing in parallel can update respective version information associated with a globally shared variable (e.g., a shared variable accessible by any of multiple processes) in order to indicate that the shared variable has been modified. Accordingly, other processes keeping track of the version information during execution of their own respective transaction can (keep track of) and identify if and when any shared variables have been modified during a window of use. If any critical variables have been modified, a respective process can prevent corresponding computational results from being committed to memory.

[0014] That is, for each of multiple processes executing in parallel, as long as version information associated with a respective set of one or more shared variables used for computational purposes has not changed during execution of a respective transaction, results of the respective transaction can be committed globally without causing data corruption by one or more processes simultaneously using the shared variable. If version information associated with one or more respective shared variables (used to produce the transaction results) happens to change during a process of generating respective results, then a respective process can identify that another process modified the one or more respective shared variables during execution and prevent global committal of the respective results. In this latter case, the transaction can repeat itself (e.g., execute again or retry) until the process is able to commit respective results without causing data corruption. In this way, each of multiple processes executing in parallel can "blindly" initiate computations using the shared variables even though there is a chance that another process executing in parallel modifies a mutually used shared variable and prevents the process from globally committing its results.

[0015] In view of the specific embodiment discussed above, more general embodiments herein are directed to maintaining version information associated with shared variables. In one embodiment, a computer environment includes segments of information (e.g., a groupings, sections, portions, etc. of a repository for storing data values associated with one or more variables) that are shared by multiple processes executing in parallel. For each of at least two of the segments, the computer environment includes a corresponding location to store a respective version value (e.g., version information) representing a relative version of a respective segment. A relative version associated with a segment is changed or updated by a respective process each time any contents (e.g., data values of one or more respective shared variables) in a respective segment has been modified. Accordingly, other processes keeping track of version information associated with a respective segment can identify if and when contents of the respective segment have been modified.

[0016] In one embodiment, one or more processes in the computer environment can use contents stored in the one or more segments to generate new data values for storage in a segment. A respective process can initiate modification of a data value associated with a shared variable. For example, in one embodiment, the processes can compete to secure an exclusive access lock with respect to each of multiple segments to prevent other processes from modifying a respective locked segment. Locking of a segment (e.g., a single or multiple shared variables) can prevent two or more processes from modifying a same data segment. Locking of a segment also may provide notification to other processes that the other processes should not use contents of a respective segment for a current transaction and/or that previous computations associated with a current transaction must be aborted.

[0017] According to further embodiments, a computer environment can be configured to maintain, for each of multiple segments of shared data, a corresponding location to store globally accessible lock information indicating whether one of the multiple processes executing in parallel has locked a respective segment for: i) changing a respective one or more data value therein, and ii) preventing other processes from reading respective data values from the respective segment. In other words, acquiring a lock on a segment prevents other processes from accessing data values in the locked segment.

[0018] Additionally, the computer environment can enable the multiple processes to maintain (e.g., store, retrieve, use, etc.) version information associated with the respective multiple segments to identify whether contents of a respective segment have changed over time. For example, a computer environment can include globally accessible version information enabling a respective one of the processes to modify respective version value information associated with shared variables. The version value information can represent a relative version value associated with a given segment as modified by a respective process to a new unique data value to indicate that the respective process modified a data value associated with the given segment.

[0019] As a more specific example, a first process can retrieve a data value associated with a shared variable as well as retrieve a current version value associated with the shared variable when the shared variable is accessed. The first process stores the version value associated with the shared variable and then can perform computations (e.g., a transaction) using the shared variable. Prior to globally committing results associated with the transaction, the first process can verify that no other process modified the shared variable by checking current version information associated with the shared variable. If the version information associated with one or more shared variables at a committal phase of the transaction matches corresponding originally obtained version information associated with the one or more shared variables during an execution phase of the transaction, then the first process can globally commit results of the transaction to memory. Alternatively, the first process can abort and repeat a transaction until it is able to complete without interference. If and when the first process is able to globally commit it results from a respective transaction to memory, then the first process updates version information associated with any data values (or segments) that are modified during the commit phase. Accordingly, a second process (or multiple other processes) can identify if and when a data value associated with the one or more shared variables changes and prevent or initiate its own global committal depending on current processing circumstances.

[0020] Techniques herein are well suited for use in applications such as those supporting parallel processing and use of shared data. However, it should be noted that configurations herein are not limited to such use and thus configurations herein and deviations thereof are well suited for use in other environments as well.

[0021] In addition to the embodiments discussed above, other embodiments herein include a computerized device (e.g., a host computer, workstation, etc.) configured to support the techniques disclosed herein such as supporting parallel execution of transaction performed by different processes. In such embodiments, a computer environment includes a memory system, a processor (e.g., a processing device), a respective display, and an interconnect connecting the processor and the memory system. The interconnect can also support communications with the respective display (e.g., display screen or display medium). The memory system is encoded with an application that, when executed on the processor, supports parallel processing according to techniques herein.

[0022] Yet other embodiments of the present disclosure include software programs to perform the method embodiment and operations summarized above and disclosed in detail below in the Detailed Description section of this disclosure. More specifically, one embodiment herein includes a computer program product (e.g., a computer-readable medium). The computer program product includes computer program logic (e.g., software instructions) encoded thereon. Such computer instructions can be executed on a computerized device to support parallel processing according to embodiments herein. For example, the computer program logic, when executed on at least one processor associated with a computing system, causes the processor to perform the operations (e.g., the methods) indicated herein as embodiments of the present disclosure. Such arrangements as further disclosed herein can be provided as software, code and/or other data structures arranged or encoded on a computer readable medium such as an optical medium (e.g., CD-ROM), floppy or hard disk, or other medium such as firmware or microcode in one or more ROM or RAM or PROM chips or as an Application Specific Integrated Circuit (ASIC). The software or firmware or other such configurations can be installed on a computerized device to cause one or more processors in the computerized device to perform the techniques explained herein.

[0023] Yet another more particular technique of the present disclosure is directed to a computer program product that includes a computer readable medium having instructions stored thereon for to facilitate use of shared information among multiple processes. The instructions, when carried out by a processor of a respective computer device, cause the processor to perform the steps of: i) executing a transaction defined by a corresponding set of instructions to produce a respective transaction outcome based on use of at least one shared variable; ii) after producing the respective transaction outcome, initiating a lock on a given shared variable to prevent other processes from modifying a data value associated with the given shared variable; and iii) initiating a modification of the data value associated with the given shared variable based on the respective transaction outcome even though at least one of the other processes performed a computation using the data value associated with the given shared variable before the lock. Other embodiments of the present application include software programs to perform any of the method embodiment steps and operations summarized above and disclosed in detail below.

[0024] It is to be understood that the system of the invention can be embodied as a software program, as software and hardware, and/or as hardware alone. Example embodiments of the invention may be implemented within computer systems, processors, and computer program products and/or software applications manufactured by Sun Microsystems Inc. of Palo Alto, Calif., USA.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025] The foregoing and other objects, features, and advantages of the present application will be apparent from the following more particular description of preferred embodiments of the present disclosure, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, with emphasis instead being placed upon illustrating the embodiments, principles and concepts.

[0026] FIG. 1 is a diagram illustrating a computer environment enabling multiple processes to access shared variable data according to embodiments herein.

[0027] FIG. 2 is a diagram illustrating maintenance and use of version and lock information associated with shared data according to embodiments herein.

[0028] FIG. 3 is a diagram of a sample process including a read-set and write-set according to embodiments herein.

[0029] FIG. 4 is a diagram of a flowchart illustrating execution of a transaction according to an embodiment herein.

[0030] FIG. 5 is a diagram of a flowchart illustrating execution of a transaction according to embodiments herein.

[0031] FIG. 6 is a diagram of a sample architecture supporting shared use of data according to embodiments herein.

[0032] FIG. 7 is a diagram of a flowchart according to an embodiment herein.

[0033] FIG. 8 is a diagram of a flowchart according to an embodiment herein.

[0034] FIG. 9 is a diagram of a flowchart according to an embodiment herein.

DETAILED DESCRIPTION

[0035] For each of multiple processes executing in parallel, as long as corresponding version information associated with a respective set of one or more shared variables used for computational purposes has not changed during execution of a respective transaction, results of the respective transaction can be globally committed to memory without causing data corruption. If version information associated with one or more corresponding shared variables (used to produce the transaction results for the respective transaction) happens to change thus indicating that another process modified shared data used to generate results associated with the respective transaction, then results associated with the respective transaction are not committed to memory for global access. In this latter case, the respective transaction repeats itself until the respective transaction is able to commit respective results without causing potential data corruption as a result of data changing during execution of the respective transaction.

[0036] FIG. 1 is a block diagram of a computer environment 100 according to an embodiment herein. As shown, computer environment 100 includes shared data 125 and corresponding metadata 135 (e.g., in a respective repository) that is globally accessible by multiple processes 140 such as process 140-1, process 140-2, . . . process 140-M. In one embodiment, each of processes 140 is a processing thread. Metadata 135 enables each of processes 140 to identify whether portions of shared data 125 have been "locked" and/or whether any portions of shared data 125 have changed during execution of a respective transaction.

[0037] Each of processes 140 includes a respective read-set 150 and write-set 160 for storing information associated with shared data used to carry computations with respect to a transaction. For example, process 140-1 includes read-set 150-1 and write-set 160-1 to carry out a respective one or more transactions associated with process 140-1. Process 140-2 includes read-set 150-2 and write-set 160-2 to carry out a respective transaction associated with process 140-2. Process 140-M includes read-set 150-M and write-set 160-M to carry out one or more transactions associated with process 140-M.

[0038] Transactions executed by respective processes 140 can be defined by one or more instructions of software code. Accordingly, each of processes 140 can execute a respective set of instructions to carry out a respective transaction. In one embodiment, the transactions executed by the processes 140 come from the same overall program or application running on one or more computers. Alternatively, the processes 140 execute transactions associated with different programs.

[0039] In the context of a general embodiment herein such as computer environment 100 in which multiple processes 140 (e.g., processing threads) execute transactions in parallel, each of processes 140 accesses shared data 125 to generate computational results (e.g., transaction results) that are eventually committed for storage in a respective repository storing shared data 125. Shared data 125 is considered to be globally accessible because each of the multiple processes 140 can access the shared data 125.

[0040] Each of processes 140 can store data values locally that are not accessible by the other processes 140. For example, process 140-1 can globally access a data value and store a respective copy locally in write-set 160-1 that is not accessible by any of the other processes. During execution of a respective transaction, the process 140-1 is able to locally modify the data value in its write-set 160. Accordingly, one purpose of write-set 160 is to store globally accessed data that is modified locally.

[0041] As will be discussed later in this specification, the results of executing the respective transaction can be globally committed back to a respective repository storing shared data 125 depending on whether globally accessed data values happened to change during the course of the transaction executed by process 140-1. In general, a respective read-set 150-1 associated with each process stores information for determining which shared data 125 has been accessed during a respective transaction and whether any respective data values associated with globally accessed shared data 125 happens to change during execution of a respective transaction.

[0042] In one embodiment, each of one or more processes 140 complies with a respective rule or set of rules indicating transaction size limitations associated with the parallel transactions to enhance efficiency of multiple processes executing different transactions using a same set of shared variables including the given shared variable to produce respective transaction outcomes. For example, each transaction can be limited to a certain number of lines of code, a number of data value modifications, time limit, etc. so that potentially competing transactions do not end up in a deadlock.

[0043] As will be further discussed, embodiments herein include: i) maintaining a locally managed and accessible write set of data values associated with each of multiple shared variables that are locally modified during execution of the transaction, the local write set representing data values not yet a) globally committed and b) accessible by the other processes; ii) initiating locks on each of the multiple shared variables specified in the write set which were locally modified during execution of the transaction to prevent the other processes from changing data values associated with the multiple shared variables to be modified; iii) verifying that respective data values associated with the multiple shared variables accessed during the transaction have not been globally modified by the other processes during execution of the transaction by checking that respective version values associated with the multiple shared variables have not changed during execution of the transaction; and vi) after modifying data values associated with the multiple shared variables, releasing the locks on each of the multiple shared variables.

[0044] FIG. 2 is a diagram illustrating shared data 125 and corresponding metadata 135 according to embodiments herein. As shown, shared data 125 can be partitioned to include segment 210-1, segment 210-2, . . . , segment 210-J. A respective segment of shared data 125 can be a resource such as a single variable, a set of variables, an object, a stripe, a portion of memory, etc. Metadata 135 includes respective version information 220 and lock information 230 associated with each corresponding segment 210 of shared data 125. In one embodiment, version information 220 is a multi-bit value that is incremented each time a respective process 140 modifies contents of a corresponding segment 210 of shared data 135. The lock information 230 and version information 220 can make up a single 64-bit word.

[0045] In one embodiment, each of processes 140 (e.g., software) need not be responsible for updating the version information 220. For example, a monitor function separate or integrated with processes 140 automatically initiate changing version information 220 each time contents of a respective segment is modified.

[0046] As an example, assume that process 140-2 (e.g., a software processing entity) modifies contents of segment 210-1 during a commit phase of a respective executed transaction. Prior to committing transaction results globally to shared data 125, process 140-2 would read and store version information 220-1 associated with segment 210-1 or shared variable. After modifying contents of segment 210-1 during the commit phase, the process 140-2 would modify the version information 220-1 in metadata 135 to a new value. More specifically, prior to modifying segment 210-1, the version information 220-1 may have been a count value of 1326. After modifying segment 210-1, the process 140-2 updates (e.g., increments) the version information 220-1 to be a count value of 1327. Each of the processes 140 performs a similar updating of corresponding version information 220 each time a respective process 140 modifies a respective segment 210 of shared data 125. Accordingly, the processes can monitor the version information 220-1 to identify when changes have been made to a respective segment 210 of shared data 125.

[0047] Note that metadata 135 also maintains lock information 230 associated with each respective segment 210 of shared data 125. In one embodiment, the lock information 230 associated with each segment 210 is a globally accessible single bit indicating whether one of processes 140 currently has "locked" a corresponding segment for purposes of modifying its contents. For example, a respective process such as process 140-1 can set the lock information 230-J to a logic one indicating that segment 210-J has been locked for use. Other processes know that contents of segment 210-J should not be accessed, used, modified, etc. during the lock phase initiated by process 140-1. Upon completing a respective modification to contents of segment 210-J, process 140-1 sets the lock information 230-J to a logic zero. All processes 140 can then compete again to obtain a lock with respect to segment 210-J.

[0048] FIG. 3 is a diagram more particularly illustrating details of respective read-sets 150 and write-sets 160 associated with processes 140 according to embodiments herein. As shown, process 140-1 executes transaction 351 (e.g., a set of software instructions). Read-set 150-1 stores retrieved version information 320-1, retrieved version information 320-2, . . . , retrieved version information 320-K associated with corresponding data values (or segments) accessed from shared data 125 during execution of transaction 351. Accordingly, the process 140-1 can keep track of version information associated with any globally accessed data.

[0049] Write-set 160-1 stores shared variable identifier information 340 (e.g., address information, variable identifier information, etc.) for each respective globally shared variable that is locally modified during execution of the transaction 351. Local modification involves maintaining and modifying locally used values of shared variables in write-set 160-1 rather than actually modifying the global variables during execution of transaction 351. As discussed above and as will be further discussed, the process 140-1 attempts to globally commit information in write-set 160-1 to shared data 125 upon completion of transaction 351. In the context of the present example, process 140-1 maintains write-set 160-1 to include i) shared variable identifier information 340-1 (e.g., segment or variable identifier information) of a respective variable accessed from shared data 125 and corresponding locally used value of shared variable 350-1, ii) shared variable identifier information 340-2 (e.g., segment or variable identifier information) of a variable or segment accessed from shared data 125 and corresponding locally used value of shared variable 350-2, an so on. Accordingly, process 140-1 uses write-set 160-1 as a scratch-pad to carry out execution of transaction 351 and keep track of locally modified variables and corresponding identifier information.

[0050] FIG. 4 is a flowchart illustrating a more specific use of read-sets 150, write-sets 160, version information 220, and lock information 230 according to embodiments herein. In general, flowchart 400 indicates how each of multiple processes 140 utilizes use of read-sets 150 and write-sets 160 while carrying out a respective transaction.

[0051] Step 405 indicates a start of a respective transaction. As previously discussed, a transaction can include a set of software instructions indicating how to carry out one or more computations using shared data 125.

[0052] In step 410, a respective process 140 executes an instruction associated with the transaction identifying a specific variable in shared data 125.

[0053] In step 415, the respective process checks whether the variable exists in its respective write-set 160. If the variable already exists in its respective write-set 160 in step 420, then processing continues at step 440 in which the respective process 140 fetches a locally maintained value from its write-set 160.

[0054] If a locally stored data value associated with the variable does not already exist in its respective write-set 160 (e.g., because the variable was never fetched yet and/or modified locally) as identified in step 415, then processing continues at step 420 in which the respective process 140 attempts to globally fetch a data value associated with the variable based on a respective access to shared data 125. For example, as further indicated in step 425, the process 140 checks whether the variable to be globally fetched is locked by another process. As previously discussed, another process may lock variables, segments, etc. of shared data 125 to prevent others from accessing the variables. Globally accessible lock information 230 (e.g., a single bit of information) in metadata 135 indicates which variables have been locked for use.

[0055] If an active lock is identified in step 425, the respective process initiates step 430 to abort and retry a respective transaction or initiate execution of a so-called back-off function to access the variable. In the latter instance, the back-off function can specify a random or fixed amount of time for the process to wait before attempting to read the variable again with hopes that a lock will be released. The respective lock on the variable may be released by during a second or subsequent attempt to read the variable.

[0056] If no lock is present on the variable during execution of step 425, the respective process initiates step 435 to globally fetch a data value associated with the specified variable from shared data 125. In addition to globally accessing the data value associated with the shared variable, the respective process retrieves version information 220 associated with the globally fetched variable. The process stores retrieved version information associated with the variable in its respective read-set 150 for later use during a commit phase.

[0057] In step 445, the respective process utilizes the fetched data value associated with the variable to carry out one or more computations associated with the transaction. Based on the paths discussed above, the data value associated with the variable can be obtained from either write-set 160 or shared data 125.

[0058] In step 450, the process performs a check to identify whether use of the fetched variable (in the transaction) involve modifying a value associated with the fetched variable. If so, in step 455, the process modifies the locally used value of shared variable 350 in write-set 160. The respective process skips executing step 455 if use of the variable (as specified by the executed transaction) does not involve modification of the variable.

[0059] In step 460, the respective process identifies whether a respective transaction has completed. If not, the process continues at step 410 to perform a similar loop for each of additional variables used during a course of executing the transaction. If the transaction has completed in step 460, the respective process continues at step 500 (e.g., the flowchart 500 in FIG. 5) in which the process attempts to globally commit values in its write-set 160 to globally accessible shared data 125.

[0060] Accordingly, in response to identifying that a corresponding data value associated with one or more shared variable was modified during execution of the transaction, a respective process can abort a respective transaction in lieu of modifying a data value associated with shared data 125 and initiate execution of the transaction again at a later time to produce attempt to produce a respective transaction outcome.

[0061] FIG. 5 is a flowchart 500 illustrating a technique for committing results of a transaction to shared data 125 according to embodiments herein. Up until his point, the process executing the respective transaction has not initiated any locks on any shared data yet although the process does initiate execution of computations associated with accessed shared data 125. Waiting to obtain locks at the following "commit phase" enables other processes 140 to perform other transactions in parallel because a respective process initiating storage of results during the commit phase holds the locks for a relatively short amount of time. In

[0062] In step 505, the respective process that executed the transaction attempts to obtain locks associated with each variable in its write-set 160. For example, the process checks whether lock information in metadata 135 indicates whether the variables to be written to (e.g., specific portions of globally accessible shared data 125) are locked by another process. The process initiates locking the variables (or segments as the case may be) to block other process from using or locking the variables. In one embodiment, a respective process attempts to obtain locks according to a specific ordering such as an order of initiating local modifications to retrieved shared variables during execution of a respective transaction, addresses associated with the globally shared variables, etc.

[0063] If all locks cannot be immediately obtained in step 510, then the process can abort and retry a transaction or initiate a back-off function to acquire locks associated with the variables that are locally modified during execution of the transaction.

[0064] After all appropriate locks have been obtained by writing respective lock information 230, processing continues at step 520 in which the process obtains the stored version information associated with variables read from shared data 125. As previously discussed, the version information 230 of metadata 135 indicates a current version of the respective variables at a time when they were read during execution of the transaction.

[0065] In step 525, the respective process compares the retrieved version information in the read-set 150 saved at a time of accessing the shared variables to the current globally available version information 220 from metadata 135 for each variable in the read-set 150.

[0066] In step 530, if the version information is different in step 525, then the process acknowledges that another process modified the variables used to carry out the present transaction. Accordingly, the process releases any obtained locks and retries the transaction again. This prevents the respective process from causing data corruption.

[0067] In step 535, if the version information is the same in step 525, then the process acknowledges that no other process modified the variables used to carry out the present transaction. Accordingly, the process can initiate modification of shared data to reflect the data values in the write-set 160. This prevents the respective process from causing data corruption during the commit phase.

[0068] Finally, in step 540, after updating the shared data 125 with the data values in the write-set 160, the process updates version information 220 associated with modified variables or segments and releases the locks. The locks can be released in any order or in a reverse order relative to the order of obtaining the locks.

[0069] Note that during the commit phase as discussed above in flowchart 500, if a lock associated with a location in the process's write-set 160 also appears in the read-set 150, then the process must atomically: a) acquire a respective lock and b) validate that current version information associated with the variable (or variables) is the same as the retrieved version information stored in the read-set 150. In one embodiment, a CAS (Compare and Swap) operation can be used to accomplish both a) and b).

[0070] Also, note that each of the respective processes 140 can be programmed to occasionally, periodically, sporadically, intermittently, etc. check (prior to the committal phase in flowchart 500) whether current version information 220 in metadata 135 matches retrieved version information in its respective read-set 150 for all variables read from shared data 125. Additionally, each of the respective processes 140 can be programmed to also check (in a similar way) whether a data value and/or corresponding segment has been locked by another process prior to completion. If a change is detected in the version information 220 (e.g., there is a difference between retrieved version information 320 in read-set 150 and current version information 220) and/or a lock is implemented on a data value or segment used by a given process, the given process can abort and retry the current transaction, prior to executing the transaction to the commit phase. Early abortion of transactions doomed to fail (because of an other process locking and modifying) can increase overall efficiency associated with parallel processing.

[0071] Use of version information and lock information according to embodiments herein can prevent corruption of data. For example, suppose that as an alternative to the above technique of using version information to verify that relied upon information (associated with a respective transaction) has not changed by the end of a transaction, a process reads data values (as identified in a respective read-set) from shared data 125 again at commit time to ensure that the data values are the same as were when first being fetched by the respective process. Unfortunately, this technique can be misleading and cause errors because of the occurrence of race conditions. For example, a first process may read and verify that a globally accessible data value in shared data 125 has not changed while soon after (or at nearly the same time) another respective process modifies the globally accessible data value. This would result in corruption if the first process committed its results to shared data 125. The techniques herein are advantageous because use of version and lock information in the same word prevents corruption as a result of two different processes accessing the word at the same or nearly the same time.

[0072] FIG. 6 is a block diagram illustrating an example computer system 610 (e.g., an architecture associated with computer environment 100) for executing a parallel processes 140 and other related processes according to embodiments herein. Computer system 610 can be a computerized device such as a personal computer, workstation, portable computing device, console, network terminal, processing device, etc.

[0073] As shown, computer system 610 of the present example includes an interconnect 111 that couples a memory system 112 storing shared data 125 and metadata 135, one or more processors 113 executing processes 140, an I/O interface 114, and a communications interface 115. Peripheral devices 116 (e.g., one or more optional user controlled devices such as a keyboard, mouse, display screens, etc.) can couple to processor 113 through I/O interface 114. I/O interface 114 also enables computer system 610 to access repository 180 (that also potentially stores shared data 125 and/or metadata 135). Communications interface 115 enables computer system 610 to communicate over network 191 to transmit and receive information from different remote resources.

[0074] Note that functionality associated with processes 140 can be embodied as software code such as data and/or logic instructions (e.g., code stored in the memory or on another computer readable medium such as a disk) that support functionality according to different embodiments described herein. Alternatively, the functionality associated with processes 140 can be implemented via hardware or a combination of hardware and software code.

[0075] It should be noted that, in addition to the processes 140 themselves, embodiments herein include a respective application and/or set of instructions to carry out processes 140. Such a set of instructions associated with processes 140 can be stored on a computer readable medium such as a floppy disk, hard disk, optical medium, etc. The set of instruction can also be stored in a memory type system such as in firmware, RAM (Random Access Memory), read only memory (ROM), etc. or, as in this example, as executable code.

[0076] Attributes associated with processes 140 will now be discussed with respect to flowcharts in FIG. 7-9. For purposes of this discussion, each of the multiple processes 140 in computer environment 100 can execute or carry out the steps described in the respective flowcharts. Note that the steps in the below flowcharts need not always be executed in the order shown.

[0077] Now, more particularly, FIG. 7 is a flowchart 700 illustrating a technique supporting execution of parallel transactions in computer environment 100 according to an embodiment herein. Note that techniques discussed in flowchart 700 overlap and summarize some of the techniques discussed above.

[0078] In step 710, a respective one of multiple processes 140 executes a transaction defined by a corresponding set of instructions to produce a respective transaction outcome based on use of at least one shared variable from shared data 125.

[0079] In step 720, after producing the respective transaction outcome (e.g., locally storing computational results in its respective write-set 160), the respective process 140 initiates a lock on a given shared variable of shared data 125 to prevent other processes from modifying a data value associated with the given shared variable.

[0080] In step 730, the respective process 140 initiates a modification of the data value associated with the given shared variable based on the respective transaction outcome even though at least one of the other processes 140 in computer environment 100 also performed a computation using the data value associated with the given shared variable before the lock and during execution of the transaction by the respective one of multiple processes 140.

[0081] FIG. 8 is a flowchart 800 illustrating processing steps associated with processes 140 according to an embodiment herein. Note that techniques discussed in flowchart 800 overlap with the techniques discussed above in the previous figures.

[0082] In step 810, each of multiple processes 140 maintains version information in a respective locally managed read set 150 associated with an executed transaction. In one embodiment, the read set 150 is generally not accessible by the other processes 140 using the shared variables from shared data 125. Accordingly, the read set 150 and write-set 160 serve as a local scratch-pad function. As previously discussed, the read set 150 can store and identify version information (e.g., includes retrieved version information) associated with each of multiple shared variables used to generate a respective transaction outcome associated with a given process. The version information stored in the read-set 150 indicates respective versions of the multiple shared variables in shared data 125 at a time when the transaction retrieves respective data values associated with the multiple shared variables (e.g., shared data 125) from a corresponding globally accessible repository.

[0083] In step 815, after producing a respective transaction outcome associated with an executed transaction, each of multiple processes 140 potentially competes to initiate a respective lock on a given one or more shared variables (e.g., portions of shared data 125) locally modified (as indicated in write-set 160) during the transaction to prevent other processes from modifying a data value associated with the given one or more shared variables.

[0084] In step 820, after acquiring respective locks associated with the given one or more shared variables and before globally modifying respective data values associated with the given one or more shared variables, a respective process attempting to globally commit its results verifies that newly read (e.g., present or current) version information associated with each of the given one or more shared variables used to generate the respective transaction outcome matches the version information in the locally managed read set associated with the transaction. The newly read version information can be used to identify whether the data values associated with the multiple shared variables have not changed by the other processes during execution of the transaction. There was no change if the newly retrieved version information matches the version information in the read-set 150.

[0085] In step 825, after verifying that "before-and-after" version information matches and obtaining locks, a respective one of the multiple processes 140 initiates a modification of data values associated with the given one or more shared variables based on the respective transaction outcome. The respective process globally modifies the data values associated with the transaction outcome even though one or more of the other processes 140 performed a computation using the data value associated with the given shared variable before the respective process obtains the lock.

[0086] In step 830, after the modification of the data values in the shared data 125 associated with the given one or more shared variables in write-set 160, the respective process modifies globally accessible version information 220 associated with the modified segments of shared data 125 (e.g., one or more shared variable) to indicate to other processes that contents of a respective segment have been modified.

[0087] FIG. 9 is a flowchart 900 illustrating another technique associated with use of lock and version information according to embodiments herein. Note that techniques discussed in flowchart 900 overlap and summarize some of the techniques discussed above.

[0088] In step 910, computer environment 100 maintains segments 210 of information (e.g., shared data 125) that are shared by multiple processes 140 executing in parallel.

[0089] In step 915, for each of multiple segments 210, the computer environment 100 maintains a corresponding location (e.g., a portion of storage) to store a respective version value representing a relative version of contents in a respective segment 210. As previously discussed, the relative version associated with a segment is updated by a respective process each time contents of the respective segment is modified by a process. For example, after committing results to shared data 125, a respective process can increment the version value by one over the previous version value to notify other processes 140 that the shared data 125 has changed.

[0090] In step 920, computer environment 100 enables the multiple processes to compete and secure an exclusive access lock with respect to each of the multiple segments 210 to prevent other processes 140 from modifying a respective locked segment.

[0091] In step 925, for each of the multiple segments 210, computer environment 100 maintains a corresponding location to store globally accessible lock information (e.g., lock information 230) indicating whether one of the multiple processes 140 executing in parallel has locked a respective segment 210 for: i) changing a respective data value in the respective segment 210, and ii) preventing other processes from reading respective data values from the respective segment 210.

[0092] In step 930, computer environment 100 enables the multiple processes 140 to retrieve version information 220 associated with the respective multiple segments 210 to identify whether contents of a respective segment have changed over time.

[0093] In sub-step 935 of step 930, one embodiment of computer environment 100 enables a respective one of the processes 140 to modify a respective version value representing a relative version value associated with a given segment 210 to a new unique data value to indicate that a respective one of the processes modifies a data value associated with the given segment has been modified.

[0094] As discussed above, techniques herein are well suited for use in applications such as those that support parallel processing of threads in the same or different processors. However, it should be noted that configurations herein are not limited to such use and thus configurations herein and deviations thereof are well suited for use in other environments as well.

Further Embodiments Associated with Transactional Locking

[0095] A leading approach for simplifying concurrent programming is a class of non-blocking software (and hardware) mechanisms called transactional memories. Transactional memories can be static or dynamic, indicating whether the locations transacted on are known in advance (like an n-location CAS) or decided dynamically within the scope of the transaction's execution, the latter type being more general and expressive. Unfortunately, current implementations of dynamic non-blocking software transactional memories (STMs) have unsatisfactory performance.

[0096] This disclosure presents a new software based dynamic transactional memory mechanism which we call Transactional Locking (TL). TL is essentially a way of using static (and therefore simple) non-blocking transactions in software or hardware to transform sequential code into deadlock-free dynamic transactions based on fine grained locks.

[0097] Initial performance benchmarks of an "all-software" TL mechanism are surprisingly good. TL implementations of concurrent data structures significantly outperform the most effective S.TM. based implementations, and, more importantly, are within a competitive margin from the most efficient hand crafted implementations. These surprising performance results bring us to question two assumptions that have recently taken hold in the transactional memory development community: that software transactions should be non-blocking, and that to be useful, hardware transactions need to be dynamic.

1.0 Introduction

[0098] A goal of current multiprocessor software design is to introduce parallelism into software applications by allowing operations that do not conflict in accessing memory to proceed concurrently. As discussed above, a key tool in designing concurrent data structures has been the use of locks. Unfortunately, course grained locking is easy to program with, but provides very poor performance because of limited parallelism, while designing fine grained lock-based concurrent data structures has long been recognized as a difficult task better left to experts. If concurrent programming and data structure design is to become ubiquitous, researchers agree that one must develop alternative approaches that simplify code design and verification. This disclosure addresses "mechanical" methods for transforming sequential code or course-grained lock-based code to concurrent code. In one embodiment, by mechanical we mean that the transformation, whether done by hand, by a preprocessor, or by a compiler, does not require any program specific information (such as the programmer's understanding of the data flow relationships).

1.1 Transactional Programming

[0099] Transactional memory programming paradigm is gaining momentum as the approach of choice for replacing locks in concurrent programming. Combining sequences of concurrent operations into atomic transactions seems to promise a great reduction in the complexity of both programming and verification, by making parts of the code appear to be sequential without the need to use fine-grained locking. Transactions will hopefully remove from the programmer the burden of figuring out the interaction among concurrent operations that happen to conflict when accessing the same locations in memory. Transactions that do not conflict in accessing memory will run uninterrupted in parallel, and those that do will be aborted and retried with the programmer having to worry about issues such as deadlock. There are currently proposals for hardware implementations of transactional memory (HTM), purely software based ones (i.e. Software Transactional Memories (3.TM.), and hybrid schemes that combine hardware and software.

[0100] A preferred unifying theme of parallel processing is that the transactions provided to the programmer, in either hardware or software, will be non-blocking, unbounded, and dynamic. Non-blocking means that transactions do not use locks, and are thus obstruction-free, lock-free, or wait-free. Unbounded means that there is no limit on the number of locations accessed by the transaction. Dynamic means that the set of locations accessed by the transaction is not known in advance and is determined during its execution. Providing all three properties in hardware seems to introduce large degrees of complexity into the design. Providing them in software seems to limit performance: hand-crafted lock-based code, though hard to program and prove correct, greatly outperforms the most effective current software STMs, even when they are programmed using an understanding of the data access relationships. When the STM programmer does not make use of such information, performance of STMs is in general an order of magnitude slower than the hand-crafted counterparts.

1.2 Transactional Locking

[0101] This disclosure, according to one embodiment, suggests that it is perhaps time to re-examine these basic development requirements. We contend that on modem operating systems, deadlock avoidance is the only compelling reason for making transactions non-blocking, and that there is no reason to provide it for transactions at the user level. Conventional mechanisms already exist whereby threads might yield their quanta to other threads. In particular, one conventional method such as so-called "schedctl" (e.g., a feature in the Solaris.TM. operating system) allows threads to transiently defer preemption while holding locks. In sense, rather than trying to improve on hand-crafted lock-based implementations by being non-blocking, we propose to get as close to their behavior as one can with a mechanical approach, that is, one that does not require the programmer to understand their data access relationships.

[0102] With this in mind, the disclosure introduces a new way of Transactional Locking (TL), a blocking approach to designing software based transactional memory mechanisms. TL according to embodiments herein transforms sequential code into unbounded concurrent dynamic transactions that synchronize using deadlock-free fine grained locking. The scheme itself is highly efficient because it does not try to provide a non-blocking progress guarantee for the transaction as a whole. Instead, static (and therefore simple) non-blocking transactions are used only to provide deadlock freedom when acquiring the set of locks needed to safely complete a transaction. These simple static transactions can be implemented in a trivial manner using today's hardware synchronization operations such as compare-and-swap (CAS), or using hardware transactions when these become available. We note that implementing static transactions in hardware may prove significantly simpler than implementing the more general dynamic ones proposed in current HTM schemes.

1.3 A TL Approach in a Nutshell

[0103] One TL mechanism is based on coordination via a special versioned-read-write-lock. Each shared variable is associated with and protected by one lock. The mapping between variables and locks can be one-to-one or many-to-few. For instance there may be one lock per variable, where the lock is allocated adjacent to the variable; one lock per object; or a separate array of locks indexed by a hash of the variable address. Other mappings are possible as well. A versioned-read-write lock has a version field in the lock word and increments the lock's version number on every successful write attempt. In an example embodiment the versioned-read-write lock would consist of a word where the low-order bit served as a lock-bit and the remaining bits served as a version subfield. On a high level a dynamic transaction is executed as follows: [0104] 1. Run the transactional code, reading the locks of all fetched-from shared locations and building a local read-set and write-set (use a safe load operation to avoid running off null pointers as a result of reading an in-consistent view of memory). [0105] A transactional load first checks to see if the load address appears in the write-set. if so the transactional load returns the last value written to the address. This provides the illusion of processor consistency and avoids so-called read-after-write hazards. If the address is not found in the write-set the load operation then fetches the lock value associated with the variable, saving the version in the read-set, and then fetches from the actual shared variable. If the transactional load operation finds the variable locked the load may either spin until the lock is released or abort the operation. [0106] Transactional stores to shared locations are handled by saving the address and value into the thread's local write-set. The shared variables are not modified during this step. That is, transactional stores are deferred and contingent upon successfully completing the transaction. [0107] 2. Attempt to commit the transaction. Acquire the locks of locations to be written. If a lock in the write-set (or more precisely a lock associated with a location in the write-set) also appears in the read-set then the acquire operation must atomically (a) acquire the lock and, (b) validate that the current lock version subfield agrees with the version found in the earliest read-entry associated with that same lock. An atomic CAS can accomplish both (a) and (b). In its simplest form, acquire locks in ascending lock address order, avoiding deadlocks. [0108] Alternately, the implementation might acquire the locks in some other order, using bounded spinning to avoid indefinite deadlock. [0109] 3. Re-read the locks of all read-only locations to make sure version numbers haven't changed. If version does not match, roll-back (release) the locks, abort the transaction, and retry. [0110] 4. The prior observed reads in step (1) have been validated as mutually consistent. The transaction is now committed. Write-back all the entries from the local write-set to the appropriate shared variables. [0111] 5. Release all the locks identified in the write-set by atomically incrementing the version and clearing the write-lock bit. Critically, the write-locks have been held for a brief time.

[0112] At a high level TL according to embodiments herein converts coarse-grained lock operations into transactions, where the transactional infrastructure is implemented with fine-grained locks.

[0113] There are various other optimizations and contention reduction mechanisms that one should add to this basic scheme to improve performance, but, as can be seen, at its core it is painfully simple. The acquisition of the locks in step 2 is essentially a static obstruction-free transaction, one in which the set of accessed locations is know in advance. It can alternately be sped-up using a hardware transaction such as an re-location compare-and-swap (CAS) operation. As noted earlier, this type of operation is simpler than the dynamic hardware transaction.

1.4 TL vs. STM and Hand-Grafted Locking

[0114] One aspect associated with TL is the observation that the blocking part of a transaction can be limited to the acquisition of a set of lock records. This observation has significant performance implications because it allows one to eliminate all the overheads associated with the mechanisms providing the non-blocking progress guarantee for the transaction as a whole. As we show, this is a major source of overhead of current STM systems.

[0115] When compared to hand-crafted lock-based structures, one can think of TL as using a non-blocking transaction to overcome the need to understand the data-access relationships, while keeping the basic fine-grained locking structure of a lock per object or field.

[0116] A few more detailed differences are as follows.

[0117] Like OSTM (Object-based STM) or Hy.TM. (Hybrid.TM.), TL associates a special coordination word with each transacted memory location. However, while STM systems like OSTM and Hy.TM. use this word as a pointer to a transaction record, TL uses it as a lock, as in the hand-crafted fine-locked structure. One immediate implication is a saving of a level of indireaction over STMs.

[0118] Unlike STMs, TL's rollback mechanism is simple and local. There are no transaction records, and the collected read-set and write-set is never shared with other threads.

[0119] OSTM derives a large part of its efficiency from the programmer's help in deciding when to "open" a transacted object for reading or writing. Without this help, it has been shown that OSTM's performance is rather poor. The TL transformation requires no programmer understanding of the data structure in order to make the transformation efficient. We believe it should not be difficult, given a simple set of constraints on program structure, to turn it into a straightforward mechanical transformation.

[0120] There is an inherent overhead of the general mechanical (and hence "dumb") transformation when compared to hand-crafted code. For example, in Eraser's elegant fine-locked skiplist implementation he makes use of his understanding of the structure's semantics and the mechanics of his GC to allow list traversal to ignore locks on nodes since the traversal still works even if a node is concurrently removed. It is hard to imagine that a mechanical approach could be made to ignore the fact that a node is locked and might be removed from the list.

2. The TL Algorithm

[0121] According to one aspect of this disclosure, we associate a special versioned-write-lock with every transacted memory location. In the example embodiment a versioned write-lock is a. simple spinlock that uses a compare-and-swap (CAS) operation to acquire the lock and a store to release it. Since one only needs a single bit to indicate that the lock number. This number is incremented by every successful lock-release.

[0122] We allocate a collection of versioned-write-locks. We use various schemes for associating locks with shared variables: per object (PO), where a lock is assigned per shared object, per stripe (PS), where we allocate a separate large array of locks and memory is stripped (divided up) using some hash function to map each location to a separate stripe, and per word (PW) where each transactionally referenced variable (word) is collocated adjacent to a lock. Other mappings between transactional shared variables and locks are possible. The PW and PO schemes require either manual or compiler assisted automatic insertion of lock fields whereas PS can be used with unmodified data structures. PO might be implemented, for instance, by leveraging the header words of Java.TM. objects. A single PS stripe-lock array may be shared and used for different TL data structures within a single address-space. For instance an application with two distinct TL red-black trees and three TL hash-tables could use a single PS array for all TL locks.

[0123] The following is a description of the PS algorithm although most of the details carry through verbatim for PO and PW as well. We maintain thread local read- and write-sets as linked lists. The read-set entries contain the address of the lock and the observed version number of the lock associated with the transactionally loaded variable. The write-set entries contain the address of the variable, the value to be written to the variable, and the address of the lock that "covers" the variable. The write-set is kept in chronological order to avoid write-after-write hazards.

[0124] We now describe how TL executes a sequential code fragment that was placed within a TL transaction. We later describe the limitations placed on the programmer in terms of structure of this code so as to allow it to be mechanically transformed into a TL transaction. The transaction proceeds through the code as follows:

1. For every location read, read its lock value, and

[0125] (a) if it is not locked, add the lock's version number to the read-set. We use a safe load operation to avoid running off null pointers as a result of reading an inconsistent view of memory. Safe loads may be implemented with SPARC.TM. non-faulting loads or by a complicit user-level trap handler that skips over potentially trapping safe load instructions.

[0126] (b) if it is locked by another thread then we spin briefly. If the spin fails abort the transaction and retry.

2. For every location to be written, record the location and the value to be written.

[0127] Upon completion of the pass through the code, reread the version numbers of all locations in the read-set.

[0128] 1. Attempt to acquire all locks in the write-set in ascending lock address order. Upon failing to acquire a lock, apply some type of backoff policy or abort and retry the transaction. A backoff policy could for example be to spin for a certain amount of time before re-attempting acquire the lock.

2. Once all locks are acquired, re-read the locks of all read-set locations to make sure version numbers have not changed.

[0129] (a) If a location has changed, release locks, abort and retry the transaction.

[0130] (b) If not, perform stores in write set and release locks in any order. The transaction is complete.

[0131] The transaction's re-reading of all the locks of locations in the read set before attempting to acquire the locks is only a performance optimization. It is not required for correctness. Empirically we have found that many transactions fail due to modifications before locks are acquired. Pre-validating the lock versions in the write-set avoids acquired the locks for a transaction that is fated to abort. We note that spinning as a backoff policy does not introduce deadlocks because locks are acquired in ascending order. The above algorithm, which we call sort TL acquires locks in order. We have also experimented with algorithms that acquire locks as they are encountered TL and uses randomized backoff to avoid deadlock. The advantage of the latter is that the transacting thread does not need to search the read set for values of locations it updated since locations are updated "in place."

2.1 Intentionally Left Blank

2.2 Mechanical Transformation

[0132] As we discussed earlier, the algorithm we describe can be added to code in a mechanical fashion, that is, without understanding anything about how the code works or what the program itself does. In our benchmarks, we performed the transformation by hand. We do however believe that

[0133] It should not be hard to automate this process and allow a compiler to perform the transformation given a few rather simple limitations on the code structure within a transaction.

2.3 Software-Hardware Inter-Operability

[0134] Though we have described TL as a software based scheme, it can be made inter-operable with HTM systems on several levels.

[0135] In its simplest form, one can use static bounded size obstruction free hardware transaction to speed up software TL. This is done by using the hardware transactions to acquire the write locks of a TL transaction in order. Since the write set is know in advance, we require only static hard-ware transactions. Because for many data structures the number of writes is significantly smaller than the number of reads, it may well be that in most cases these hardware transactions can be bounded in size. If all write locks do not fit in a single hardware transaction, one can apply several of them in sequence using the same scheme we currently use to acquire individual locks, avoiding deadlock because the locations are acquired in ascending order.

[0136] One can also use TL as a hybrid backup mechanism to extend bounded size dynamic hardware transactions to arbitrary size. We can use a scheme similar to OSTM and Hy.TM. where instead of their object records, we use the versioned-write-lock associated with each location.

[0137] Hardware transactions need to verify for each location that they read or write that the associated versioned-write-lock is free. For every write they also need to update the version number of the associated stripe lock. This suffices to provide inter-operability between hardware and software transactions. Any software read will detect concurrent modifications of locations by a hardware writes because the version number of the associated lock will have changed. Any hardware transaction will fail if a concurrent software transaction is holding the lock to write. Software transactions attempting to write will also fail in acquiring a lock on aware synchronization operation (such as CAS or a single location transaction) which will fail if the version number of the location was modified by the hardware transaction.

3.0 Remarks

[0138] 1. One goal of the present disclosure is to allow the programmer to convert coarse-grain locked data structures to TL so as to enjoy the benefits of parallelism. This can be helpful when transitioning to high-order parallelism with SMT/CMT processors such as Niagara.TM.. One key attribute of TL is simplicity. It allows the programmer to extract additional parallelism but without unduly increasing the complexity of their code. The programmer can "think serially" but the code will "execute concurrently".

[0139] For a given problem we deem TL successful if the resultant performance exceeds that of the original coarse-grain locked form. In many cases the TL form is competitive with the best-of-breed STM forms. That having been said, for any given problem a specialized, hand-coded, form written by a synchronization expert is likely to be faster than the TL form. An expert in synchronization, developing with concurrency in mind as 1st-order requirement, may be aware of relaxed data dependencies in the algorithm and take advantage of domain-specific advantages.

[0140] For example a red-black tree transformed with TL will out-perform a red-black protected by a naive lock. But an exotic ad-hoc red-black tree designed by concurrency experts and subject to considerable research, such as Hanke's red-black algorithm will generally outperform the TL-transformed red-black tree.

[0141] 2. Broadly, TL works by transform an operation protected by a coarse-grained lock into optimistic transactional form. We then implement the transactional infrastructure with fine-grain locks, enabling additional parallelism as the access patterns permit.

[0142] 3. OSTM works by opening and closing records for reading and writing. TL, in a sense, performs the open operations automatically at transactional load- and store-time but leaves the record open until commit time. TL has no way of knowing that prior loads executed within a transaction might have any bearing on results produced by transaction.

[0143] In such cases the load could safely be removed from the read-set but TL doesn't currently provide that capability. As such, the TL transaction admits exposed to false-positive failures.

[0144] Consider the following scenario where we have a TL-protected hash table. Thread T1 traverses a long hash bucket chain searching for a value associated with a certain key, iterating over "next" fields. We'll say that T1 locates the appropriate node at or near the end of the linked list. T2 concurrently deletes an unrelated node earlier in the same linked list. T2 commits. At commit-time T1 will abort because the linked-list "next" field written to by T2 is in T1's read-set. T1 must retry the lookup operation (ostensibly locating the same node). Given our domain-specific knowledge of the linked list we understand that the lookup and delete operations didn't really conflict and could have been allowed to operate concurrently with no aborts. A clever "hand over hand" ad-hoc hand-coded locking scheme would allow the desired concurrency.

[0145] 4. As described above TL admits live-lock failure. Consider where thread T1's read-set is A and its write-set is B. T2's read-set is B and write-set is A. T1 tries to commit and locks B. T2 tries to commit and acquires A. T1 validates A, in its read-set, and aborts as a Bis locked by T2. T2 validates B in its read-set and aborts as B was locked by T1. We have mutual abort with no progress. To improve "liveness" we use a back-off delay at abort-time, similar in spirit to that found in CSMA-CD MAC protocols. The delay interval is a function of (a) a random number generated at abort-time, (b) the length of the prior (aborted) write-set, and (c) the number of prior aborts for this transactional attempt.

[0146] 5. As described above, at commit-time the transactional mechanism will acquire write-set locks, validate the read-set, perform the write-back, and then release (and increment) the write-locks. Lock acquisition is accomplished with CAS and lock-release with a simple store. Given the availability of restricted capacity hardware transactional memory, such as will be available in Sun's forthcoming "ROCK" SPARC processor, we eliminate the CAS operations and try to acquire the locks in groups (replacing a set of CAS operations with a single ROCK hardware transaction).

[0147] In addition it is possible that the entire commit operation might be feasible as a single ROCK hardware transaction where the original application transaction was too big (too many loads and stores) to be feasible as a single ROCK transaction). The commit operation will be able to make an accurate estimate of ROCK-feasibility given that the size of the read-set and write-set are available (or cheap to compute) at commit-time. Finally, if the entire commit is feasible as a ROCK hardware transaction, we can avoid changing the lock word from unlocked, to locked, to unlocked (but incremented) by simply fetching the lock word at the start of the commit, verifying that it is unlocked, and then increasing the version sub-field at the end of the transaction, after the data writes are complete.

[0148] 6. Changes to non-transactional variables, such as automatic variables, must not be allowed to escape or "leak" out of an abort transaction. Where needed, the transactional infrastructure must log such changes and roll-back any updates at abort-time. Similarly, exceptions in aborted or doomed transactions must not propagate out of the transactional intra structure. The SXM scheme, where transactions are encapsulated in method calls, handily deals with this issue.

[0149] 7. All accesses to shared variables within a transformed TL critical section must be performed transactionally. Mixed-mode access can be unsafe. Transactions should not perform or initiate 10 or otherwise interface with non-transactional components. Transactions should not access device-memory (memory mapped devices) with transactional loads and stores as loads from device-memory are not necessarily idem potent and may have side effects.

[0150] 8. Under TL pure read operations don't require any store operations. This is important as stores to shared variables under typical snoop- or directory-based coherency protocols result can result in considerable coherency bus traffic. Such stores result in a local latency penalty scalability issues as the coherency traffic consumes precious bandwidth on the shared coherency bus.

[0151] 9. Write-locks are held for a brief time--just long enough to validate the read-set and write-back the deferred transactional stores.

[0152] 10. If a transaction acquires many distinct locks, it can suffer a local latency penalty as the CAS instruction is typically slow. A balance must be struck between lots of locks (and increased potential parallelism) and un-contended lock acquisition overhead. The mapping strategy between variables and locks is critical.

[0153] 11. As noted above, the PW scheme may suffer undue local CAS latency if many distinct write-locks must be acquired. One possible solution is add an indireaction-bit to the lock-word. When set, the lock-word contains a pointer to the actual lock. Multiple indireaction is not allowed. Objects are initialized so that the per-field lock words point to either a canonical non-indirect field lock within the same object, or to a lock that protects the entire data structure (e.g., the entire red-black tree or skip-list). Initially we have coarse-grain locking with a many: I relationship between locks fields and actual locks, but as we encounter contention we can convert automatically to fine-grain locking by replacing the indireaction pointer with a normal non-indirected lock value. For safety, only the currently lock-owner can "split" or upgrade the load from the in-directed form (coarse-grained) to a per-field lock (fine-grained). The transition is unidireactional--we never try to aggregate multiple fine-grain locks to refer to a single coarse-grain lock. The onset of contention (or more precisely, aborts caused by encountering a locked object) triggers splitting. When the contending thread eventually acquires the lock it can perform the split operation. By automatically splitting the locks and switching to finer grained locking we minimize the number of high-latency CAS operations needed to lock low-contention fields, but maximize the potential parallelism for operations that access high-contention fields.

[0154] One can the same optimization to PS, where the 1st lock in the array is a normal lock and all other locks are indirect locks, pointing to the 1st element.

[0155] 12. Broadly, TL operates better in environments with lower mutation rates (that is, where the store:load ratio is low). For example consider a red-black and a skip-list that are protected by a single lock and where the data structure is subject to many concurrent modifications. The relative speedup achieved with TL as compared to the classic lock will usually be higher with the skip-list than with the red-black tree as mutations to a skip list usually only require a few stores, where mutations to a red-black tree many require adjustments to the tree structure that require many stores.

[0156] 13. We claim that TL admits no schedules that were not already possible for the data structure as protected by the coarse-grained lock.

[0157] 14. TL could be used to implement the "atomic . . . " construct where no lock is specified.

[0158] 15. Our example embodiment describes a 64-bit lock-word, partitioned into a single lock bit and a 63-bit version subfield. Assuming a 4 Ghz processor and a maximum update rate of I transaction per-clock, the version sub-field will overflow in 68 years. Other example embodiments allow for use of a 32-bit lock-word field. When a counter overflows, for instance, a so-called stop-the-world epoch might be used to stop all threads outside transactions. At that point no thread can have a previously fetched instance of the overflowed lock-word in its read-set; the lock-word version can safely be reset to 0. All threads can then be allowed to resume normal execution.

[0159] 16. Unlike some other STMs which incorporate and depend on their own garbage-collection mechanisms, TL allows the C programmer to use normal malloc( ) and free( ) operations to manage the lifecycle of structures containing transactionally accessed shared variables. The only requirement imposed by TL is that a structure being free( )-ed must be allowed to quiesce. That is, any pending transactional stores, detectable by checking the lock-bit in the associated locks, must be allowed to drain into the structure before the structure is freed. After the structure is quiesced it can be accessed with normal load and store operations outside the transactional framework.

[0160] 17. Concurrent mixed-mode transactional and non-transactional accesses are proscribed. When a particular object is being accessed with transactional load and store operations it must not be accessed with normal non-traditional load and store operations. (When any accesses to an object are transactional, all accesses must be transactional). An object can exit the transactional domain and subsequently be accessed with normal non-transactional loads and stores, we must sterilize the object before it escapes. To motivate the need for sterilization consider the following scenario. We have a linked list of 3 nodes identified by addresses A, B and C. A node contains Key, Value and Next fields. The data structure implements a traditional key-value mapping. The key-value map (the linked list) is protected by TL using PS. Node A's Key field contains "I", its value field contains "1001" and its Next field refers to B. B's Key field contains "2", its Value field contains "1002" and its Next field refers to C. C's Key field contains 3, the value field "1003" and its Next field is NULL. Thread T1 calls Set("2", "2002"). The TL-based Set( ) operator traverses the linked list using transactional loads and finds node B with a key value of "2". T1 then executes a transactional store in a. Value to change "1002" to "2002". T1's read-set consists of A.Key, A.Next, B.Key and the write-set consists of "B.Value." T1 attempts to commit; it acquires the lock covering

[0161] B,Value and then validates that the previously fetched read-set is consistent by checking the version numbers in the locks convering the read-set. Thread T1 stalls. Thread T2 executes Delete("2"). The DeleteO operator traverses the linked list and attempts to splice-out Node B by setting A.Next to C. T2 successfully commits. The commit operator stores C into A.Next.

[0162] T2's transaction completes. T2 then calls free(B). T1 resumes in the midst of its commit and stores into B.Value, We have a classic modify-after-free pathology. To avoid such problems T2 calls sterilize(B) after the commit finishes but before free( )ing B. This allows T1's latent transactional ST to drain into B before B is free( )ed and potentially reused. Note, however, that TL (using sterilization) did not admit any out-comes that were not already possible under the original coarse-grained lock.

[0163] 18. Consider the following problematic lifecycle based on the A, B, C linked list, above. Lets say we using TL in the "C" language to moderate concurrent access to the list, but with either PO or PW mode where the lock word(s) are embedded in the node. Thread T1 calls Set("2", "2002"). The TL-based Set( ) method traverse the list and locates node B having a key value of "2". Thread T2 then calls Delete("2"). The Delete( ) operator commits successfully. T2 sterilizes B and then calls free(B). The memory underlying B is recycled and used by some other thread T3. T1 attempts to commit by acquiring the lock covering B.Value. The lock-word is collocated with B. Value, so the CAS operation transiently change the lock-word contents. T2 then validates the read-set, recognizes that A.Next changed (because of T1's Delete( )) and aborts, restoring the original lock-word value. T1 has cause the memory word underlying the lock for B.value to "flicker", however. Such modifications are unacceptable; we have a classic modify-after-free error.

[0164] As such, we advocate using PS for normal C code as the lock-words (metadata) are stored seperately in type-stable memory distinct from the data protected by the locks. This proviso can be relaxed if the C-code uses some type of garbage collection (such as Boehm-style conservative garbage collection for C, Maged-style hazard pointers or Fraser-stye Epoch-Based Reclamation) or type-stable storage for the nodes. For type-safe garbage collected managed runtime environments such as Java any of the mapping policies (PS, PO or PW) are safe. Relatedly, use-after-free errors are impossible in Java, so sterilization would be needed only for objects that escape the transactional domain and will subsequently be accessed with normal loads and stores.

[0165] Alternately, we could employ PO or PW with C-code but replace the embedded lock-words with immutable words that point to type-stable or immortal lock-words. Under PO, for instance, the object would contain an immutable field that points to some other lock-word. The field would be initialize to point to the associated lock-word either at object construction-time, or initialization could be deferred until the 1st transactional store or load.

[0166] 19. It is possible to use C++ operator overloading and template functions to interpose on all load and store operations for variables defined to be used in a transactional fashion. This approach obviates the need to explicitly call transactional load and store operators, making the set of modifications required to switch to TL much smaller.

[0167] 20. We previously described the PW, PO and PS schemes for associating variables with locks. More generally, TL might allow a skilled programmer to explicitly control the mapping by allowing the programmer to define a custom VariableToLock( ) function which takes a variable address as input and returns a lock address. The VariableToLock( ) function is optional.

[0168] 21. TL can easily be combined with STM interfaces or transactional infrastructures such as Herlihy's SXM.

[0169] 22. TL protects data accessed within a critical section. TL should not be used where a lock is used an execution barrier and shared data is accessed outside the lock. For instance lets say thread T1 acquires Lock A, and spawns thread T2, increments some global variable B and then releases A. T2 will acquire A, release A, and then increment B. Access to the shared variable B is protected by the lock, but the accesses are outside the critical section. In fact the critical section is empty and degenerate.

[0170] 23. Code that assumes memory barrier (fence)-equivalent semantics for lock and unlock should not be transformed with TL.

[0171] 24. We can extend the lock-word encoding from LOCKED/UN-LOCKED to READWRITE/READONLY/EXCLUSIVE as follows. READWRITE corresponds to UNLOCKED and EXCLUSIVE corresponds to LOCKED. The new state, READONLY, is an interim state used only at commit-time. The commit operator is modified to attempt to change all locks in the write-set from READ-WRITE to READONLY with CAS. The commit operator must spin if the lock is found to be in READ-ONLY or EXCLUSIVE state. Once the write-set locks have been made READONLY, the commit operator ratifies versions of the read-set locks and ensures that the read-set locks are in READWRITE state. If the read-set is invalid the commit operator restores the write-set locks to READWRITE and aborts the transaction. Otherwise the commit operator uses simple store operations to upgrade all the write-set locks from READONLY to EXCLUSIVE. The commit operator then writes back the deferred stores saved in the write-set and then releases the locks and increments the versions, changing (V.EXCLUSIVE) to V+I.READWRITE) with a single atomic store. Note that the upgrade to EXCLUSIVE, write-back, and release can be fused into a single loop that interates over the write-set in chronological order. This modification decreases the lock-hold time--that is the time that locks are in EXCLUSIVE state. Critically, if a lock is in READONLY state because of a commit operation being executed by thread T1, concurrent transactional loads performed by thread T2 is allowed to proceed. (That is, when a thread executing a commit has placed a lock in READ-ONLY state, concurrent transactional loads performed by other threads are allowed to proceed).

[0172] In yet another variation the commit operator would use CAS to try to change all the write-set locks from READWRITE to READONLY. Once in READONLY state commit would then use normal atomic stores to upgrade the locks from READONLY to EXCLUSIVE. The commit operator would then validate the read-set and, conditionally, write-back the deferred stores saved in the write-set and release the locks, incrementing the version subfields. This adaptation minimizes aggregate lock-hold times. Recall that CAS has high local latency even when successful. Consider a transaction containing stores to variables V1 and V2 covered by distinct locks W1 and W2. The basic commit operator, described earlier, uses CAS to lock W1 and then another CAS to lock W2. The hold-time for W1 is increased because of the latency of the CAS needed to acquire W2. The mechanism described here lessens the impact of CAS latency.

[0173] 25. Transactions may be nested by folding or "flattening" inner transactions into the outermost transaction. By nature, longer transactions have a higher chance of failing because of concurrent interference, however.

4.0 Additional Embodiments

[0174] In furtherance of the above discussion, embodiments herein can operate in two modes which we will call encounter mode and commit mode. These modes indicate how locks are acquired and how transactions are committed or aborted. We will begin by further describing our commit mode algorithm, later explaining how TL operates in encounter mode.

[0175] We associate a special versioned-write-lock with every transacted memory location. A versioned-write-lock is a simple spin lock that uses a compare-and-swap (GAS) operation to acquire the lock and a store to release it. Since one only needs a single bit to indicate that the lock is taken, we use the rest of the lock word to hold a version number. This number is incremented by every successful lock-release. In encounter mode the version number is displaced and a pointer into a threads private undo log is installed.

[0176] We allocate a collection of versioned-write-locks. We use various schemes for associating locks with shared: per object (PO), where a lock is assigned per shared object, per stripe (PS), where we allocate a separate large array of locks and memory is stripped (divided up) using some hash function to map each location to a separate stripe, and per word(PW) where each transactionally referenced variable (word) is collocated adjacent to a lock. Other mappings between transactional shared variables and locks are possible. The PW and PO schemes require either manual or compiler assisted automatic put of lock fields whereas PS can be used with unmodified data structures. Since in general PO showed better performance than PW we will focus on PO and do not discuss PW further. PO might be implemented, for instance, by leveraging the header words of Java TM objects. A single PS stripe-lock array may be shared and used for different TL data structures within a single address-space. For instance an application with two distinct TL red-black trees and three TL hash-tables could use a single PS array for all TL locks. As our default mapping we chose an array of 220 entries of 32-bit lock words with the mapping function masking the variable address with "Ox3FFFFC" and then adding in the base address of the lock array to derive the lock address.

[0177] The following is a description of the PS algorithm although most of the details carry through verbatim for PO and PW as well. We maintain thread local read- and write-sets as linked lists. The read-set entries contain the address of the lock and the observed version number of the lock associated with the transactionally loaded variable. The write-set entries contain the address of the variable, the value to be written to the variable, and the address of the lock that "cover" the variable. The write-set is kept in chronological order to avoid write-after-write hazards.

4.1 Commit Mode

[0178] We now describe how TL executes in commit mode a sequential code fragment that was placed within a TL transaction. As we explain, this mode does not require type-stable garbage collection, and works seamlessly with the memory life-cycle of languages like C and C++.

1. Run the transactional code, reading the locks of all fetched-from shared locations and building a local read set and write-set (use a safe load operation to avoid running off null pointers as a result of reading an inconsistent view of memory).

[0179] A transactional load first checks (using a filter such as a Bloom filter) to see if the load address appears in the write-set, if so the transactional load returns the last value written to the address. This provides the illusion of processor consistency and avoids so-called read-after-write hazards. If the address is not found in the write-set the load operation then fetches the lock value associated with the variable, saving the version in the read-set, and then fetches from the actual shared variable. If the transactional load operation finds the variable locked the load may either spin until the lock is released or abort the operation.

[0180] Transactional stores to shared locations are handled by saving the address and value into the thread's local write-set. The shared variables are not modified during this step. That is, transactional stores are deferred and contingent upon successfully completing the transaction. During the operation of the transaction we periodically validate the read-set. If the read-set is found to be invalid we abort the transaction. This avoids the possibility of a doomed transaction (a transaction that has read inconsistent global state) from becoming trapped in an infinite loop.

[0181] 2. Attempt to commit the transaction. Acquire the locks of locations to be written. If a lock in the write-set (or more precisely a lock associated with a location in the write-set) also appears in the read-set then the acquire operation must atomically (a) acquire the lock and, (b) validate that the current lock version sub-field agrees with the version found in the earliest read-entry associated with that same lock. An atomic CAS can accomplish both (a) and (b). Acquire the locks in any convenient order using bounded spinning to avoid indefinite deadlock.

3. Re-read the locks of all read-only locations to make sure version numbers haven't changed. If a version does not match, roll-back (release) the locks, abort the transaction, and retry.

4. The prior observed reads in step (1) have been validated as forming an atomic snapshot of memory. The transaction is now committed. Write-back all the entries from the local write-set to the appropriate shared variables.

5. Release all the locks identified in the write-set by atomically incrementing the version and clearing the write-lock bit (using a simple store).

[0182] A few things to note. The write-locks have been held for a brief time when attempting to commit the transaction. This helps improve performance under high contention. The Bloom filter allows us to determine if a value is not in the write set and need not be searched for by reading the single filter word. Though locks could have been acquired in ascending address order to avoid deadlock, we found that sorting the addresses in the write set was not worth the effort.

4.2 Encounter Mode

The following is the TL encounter mode transaction. For reasons we explain later, this mode assumes a type-stable closed memory pool or garbage collection.

1. Run the transactional code, reading the locks of all fetched-from shared locations and building a local read-set and write-set (the write set is an undo set of the values before the transactional writes).

[0183] Transactional stores to shared locations are handled by acquiring locks as the are encountered, saving the address and current value into the thread's local write-set, and pointing from the lock to the write-set entry. The shared variables are written with the new value during this step.

[0184] A transactional load checks to see if the lock is free or is held by the current transaction and if so reads the value from the location. There is thus no need to look for the value in the write set. If the transactional load operation finds that the lock is held it will spin. During the operation of the transaction we periodically validate the read-set. If the read-set is found to be invalid we abort the transaction. This avoids the possibility of a doomed transaction (a transaction that has read inconsistent global state) from becoming trapped in an infinite loop.

2. Attempt to commit the transaction. Acquire the locks associated with the write-set in any convenient order, using bounded spinning to avoid deadlock.

3. Re-read the locks of all read-only locations to make sure version numbers haven't changed. If a version does not match, restore the values using the write-set, roll-back (release) the locks, abort the transaction, and retry.

4. The prior observed reads in step (1) have been validated as forming an atomic snapshot of memory. The transaction is now committed.

5. Release all the locks identified in the write-set by atomically incrementing the version and clearing the write-lock bit.

We note that the locks in encounter mode are held for a longer duration than in commit mode, which accounts for weaker performance under contention. However, one does not need to look-aside and search through the write set for every read.

4.3 Contention Management

[0185] As described above TL can admit a live-lock failure. Consider where thread T1's read-set is A and its write-set is B. T2's read-set is B and write-set is A. T1 tries to commit and locks B. T2 tries to commit and acquires A. T1 validates A, in its read-set, and aborts as a Bis locked by T2. T2 validates B in its read-set and aborts as B was locked by T1. We have mutual abort with no progress. To provide liveness we use bounded spin and a back-off delay at abort-time, similar in spirit to that found in CSMA-CD MAC protocols. The delay interval is a function of (a) a random number generated at abort-time, (b) the length of the prior (aborted) write-set, and (c) the number of prior aborts for this transactional attempt. It is important to note that unlike conventional methods, we found that we do not need mechanisms for one transaction to abort another to allow progress/liveness even in encounter mode.

[0186] These mechanisms are unnecessary for performance or deadlock avoidance, and in a sense contradict the very philosophy behind transactional locking: rather than trying to improve on hand-crafted lock-based implementations by being non-blocking (hand-crafted lock-based data structures are not obstruction free), we try and build lock-based STMs that will get us as close to their behavior as one can with a completely mechanical approach, that is, one that truly simplifies the job of the concurrent programmer.

4.4 The Pathology of Transactional Memory Management

[0187] For type-safe garbage collected managed runtime environments such as Java any of the TL lock-mapping policies (PS, PO, or PW) and modes (Commit or Encounter) are safe, as the GC assures that transactionally accessed memory will only be released once no references remain to the object. In C or C++TL preferentially uses the PS/Commit locking scheme to allow the C programmer to use normal malloc( )and free( ) operations to manage the lifecycle of structures containing transactionally accessed shared variables.

[0188] Concurrent mixed-mode transactional and non-transactional accesses are proscribed. When a particular object is being accessed with transactional load and store operations it must not be accessed with normal non-transactional load and store operations. (When any accesses to an object are transactional, all accesses must be transactional). In PS/-Commit mode an object can exit the transactional domain and subsequently be accessed with normal non-transactional loads and stores, but we must wait for the object to quiesce before it leaves. There can be at most one transaction holding the transactional lock, and quiescing means waiting for that lock to be released, implying that all pending transactional stores to the location have been "drained", before allowing the object to exit the transactional domain and subsequently to be accessed with normal load and store operations. Once it has quiesced, the memory can be freed and recycled in a normal fashion, because any transaction that may acquire the lock and reach the disconnected location will fail its read-set validation.

[0189] To motivate the need for quiescing, consider the following scenario with PS/Commit. We have a linked list of 3 nodes identified by addresses A, B and C. A node contains Key, Value and Next fields. The data structure implements a traditional key-value mapping. The key-value map (the linked list) is protected by TL using PS. Node A's Key field contains 1, its value field contains 1001 and its Next field refers to B. B's Key field contains 2, its Value field contains 1002 and its Next field refers to C. C's Key field contains 3, the value field 1003 and its Next field is NULL. Thread T1 calls put(2, 2002). The TL-based put( ) operator traverses the linked list using transactional loads and finds node B with a key value of 2. T1 then executes a transactional store into B.Value to change 1002 to 2002. T1's read-set consists of A.Key, A.Next, B.Key and the write-set consists of B.Value. T1 attempts to commit; it acquires the lock covering B.Value and then validates that the previously fetched read-set is consistent by checking the version numbers in the locks converging the read-set. Thread T1 stalls. Thread T2 executes delete(2). The delete( ) operator traverses the linked list and attempts to splice-out Node B by setting A.Next to C. T2 successfully commits. The commit operator stores C into A.Next. T2's transaction completes. T2 then calls free(B). T1 resumes in the midst of its commit and stores into B.Value. We have a classic modify-after-free pathology. To avoid such problems T2 calls quiesce(B) after the commit finishes but before free( )ing B. This allows T1's latent transactional ST to drain into B before B is free( )ed and potentially reused. Note, however, that TL (using quiescing) did not admit any outcomes that were not already possible under a simple coarse-grained lock. Any thread that attempts to write into B will, at commit-time, acquire the lock covering B, validate A.Next and then store into B. Once B has been unlinked there can be at most one thread that has successfully committed and is in the process of writing into B. Other transactions attempting to write into B will fail read-set validation at commit-time as A.Next has changed.

[0190] Consider another following problematic lifecycle scenario based on the A,B,C linked list, above. Lets say we're using TL in the C language to moderate concurrent access to the list, but with either PO or PW mode where the lock word(s) are embedded in the node. Thread T1 calls put(2, 2002). The TL-based put( ) method traverse the list and locates node B having a key value of 2. Thread T2 then calls delete(2). The delete( ) operator commits successfully. T2 waits for B to quiesce and then calls free(B). The memory underlying B is recycled and used by some other thread T3. T1 attempts to commit by acquiring the lock covering B.Value. The lock-word is collocated with B.Value, so the CAS operation transiently change the lock-word contents. T2 then validates the read-set, recognizes that A.Next changed (because of T1's delete( )) and aborts, restoring the original lock-word value. T1 has cause the memory word underlying the lock for B.value to "flicker", however. Such modifications are unacceptable; we have a classic modify after-free error.

[0191] Finally, consider the following pathological scenario admitted by PS/Encounter. T1 calls put(2,2002). Put( ) traverses the list and locates node B. T2 then calls delete(2), commits successfully, calls quiesce(B) and free(B). T1 acquires the lock covering B.Value, saves the original B.Value (1002) into its private write undo log, and then stores 2002 into B.Value. Later, during read-set validation at commit time, T1 will discover that its read-set is invalid and abort, rolling back B.Value from 2002 to 1002. As above, this constitutes a modify-after-free pathology where B recycled, but B.Value transiently "flickered" from 1002 to 2002 to 1002. We can avoid this problem by enhancing the encounter protocol to validate the read-set after each lock acquisition but before storing into the shared variable. This confers safety, but at the cost of additional performance.

[0192] As such, we advocate using PS/Commit for normal C code as the lock-words (metadata) are stored separately in typestable memory distinct from the data protected by the locks. This provision can be relaxed if the C-code uses some type of garbage collection (such as Boehm-style conservative garbage collection for C, Michael-style hazard pointers or Fraser-stye Epoch-Based Reclamation) or type-stable storage for the nodes.

4.5 Mechanical Transformation of Sequential Code

[0193] As we discussed earlier, the algorithm we describe can be added to code in a mechanical fashion, that is, without understanding anything about how the code works or what the program itself does. In our benchmarks, we performed the transformation by hand. We do however believe that it may be feasible to automate this process and allow a compiler to perform the transformation given a few rather simple limitations on the code structure within a transaction.

[0194] We note that hand-crafted data structures can always have an advantage over TL, as TL has no way of knowing that prior loads executed within a transaction might no longer have any bearing on results produced by transaction.

[0195] Consider the following scenario where we have a TL-protected hash table. Thread T1 traverses a long hash bucket chain searching for a the value associated with a certain key, iterating over "next" fields. We'll say that T1 locates the appropriate node at or near the end of the linked list. T2 concurrently deletes an unrelated node earlier in the same linked list. T2 commits. At commit-time T1 will abort because the linked-list "next" field written to by T2 is in T1's read-set. T1 must retry the lookup operation (ostensibly locating the same node). Given our domain-specific knowledge of the linked list we understand that the lookup and delete operations didn't really conflict and could have been allowed to operate concurrently with no aborts. A clever "hand over hand" ad-hoc hand-coded locking scheme would have the advantage of allowing this desired concurrency. Nevertheless, as our empirical analysis later in the paper shows, in the data structure we tested the beneficial effect of this added concurrency on overall application scalability does not seem to be as profound as one would think.

4.6 Software-Hardware Inter-Operability

[0196] Though we have described TL as a software based scheme, it can be made inter-operable with HTM systems on several levels.

[0197] On a machine supporting dynamic hardware, transactions executed in hardware need only verify for each location that they read or write that the associated versioned-write-lock is free. There is no need for the hardware transaction to store an intermediate locked state into the lock word(s). For every write they also need to update the version number of the associated stripe lock upon completion. This suffices to provide inter-operability between hardware and software transactions. Any software read will detect concurrent modifications of locations by a hardware writes because the version number of the associated lock will have changed. Any hardware transaction will fail if a concurrent software transaction is holding the lock to write. Software transactions attempting to write will also fail in acquiring a lock on a location since lock acquisition is done using an atomic hard-ware synchronization operation (such as CAS or a single location transaction) which will fail if the version number of the location was modified by the hardware transaction.

[0198] One can also think of using a static bounded size obstruction-free hardware transaction to speed up software TL. This may be done variously by attempting to complete the entire commit operation with a single hardware transaction, or, alternately, by using hardware transactions to acquire the write locks "in bulk". This latter approach is beneficial if bulk acquisition of the write-locks via hardware transactions is faster (has lower latency) than acquiring one write lock at a time with CAS. Since the write set is know in advance, we require only static hardware transactions. Because for many data structures the number of writes is significantly smaller than the number of reads, it may well be that in most cases these hardware transactions can be bounded in size. If all write locks do not fit in a single hardware transaction, one can apply several of them in sequence using the same scheme we currently use to acquire individual locks. However, as we report above, we found the relative contribution of the lock acquisition time to latency to be small, so it is not clear how much of a saving a hardware transaction will provide over the use of GAS operations.

[0199] One can also use TL as a hybrid backup mechanism to extend bounded size dynamic hardware transactions to arbitrary size. Again, our empirical testing suggests that there is not much of a gain in this approach.

[0200] While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application as defined by the appended claims. Such variations are covered by the scope of this present disclosure. As such, the foregoing description of embodiments of the present application is not intended to be limiting. Rather, any limitations to the invention are presented in the following claims. Note that the different embodiments disclosed herein can be combined or utilized individually with respect to each other.

We claim:

* * * * *