U.S. patent application number 11/475716 was filed with the patent office on 2007-08-23 for methods and apparatus to implement parallel transactions.
Invention is credited to David Dice, Nir N. Shavit.
Application Number | 20070198979 11/475716 |
Document ID | / |
Family ID | 38429749 |
Filed Date | 2007-08-23 |
United States Patent
Application |
20070198979 |
Kind Code |
A1 |
Dice; David ; et
al. |
August 23, 2007 |
Methods and apparatus to implement parallel transactions
Abstract
For each of multiple processes executing in parallel, as long as
corresponding version information associated with a respective set
of one or more shared variables used for computational purposes has
not changed during execution of a respective transaction, results
of the respective transaction can be globally committed to memory
without causing data corruption. If version information associated
with one or more respective shared variables (used to produce the
transaction results) happens to change during a process of
generating respective results, then a respective process can
identify that another process modified the one or more respective
shared variables during execution and that its transaction results
should not be committed to memory. In this latter case, the
transaction repeats itself until it is able to commit respective
results without causing data corruption.
Inventors: |
Dice; David; (Foxborough,
MA) ; Shavit; Nir N.; (Cambridge, MA) |
Correspondence
Address: |
BARRY W. CHAPIN, ESQ.;CHAPIN INTELLECTUAL PROPERTY LAW, LLC
WESTBOROUGH OFFICE PARK
1700 WEST PARK DRIVE
WESTBOROUGH
MA
01581
US
|
Family ID: |
38429749 |
Appl. No.: |
11/475716 |
Filed: |
June 27, 2006 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60775580 |
Feb 22, 2006 |
|
|
|
60775564 |
Feb 22, 2006 |
|
|
|
60789483 |
Apr 5, 2006 |
|
|
|
Current U.S.
Class: |
718/100 |
Current CPC
Class: |
G06F 9/52 20130101; G06F
12/0893 20130101; G06F 9/466 20130101; G06F 9/526 20130101; G06F
12/0806 20130101; G06F 12/0815 20130101 |
Class at
Publication: |
718/100 |
International
Class: |
G06F 9/46 20060101
G06F009/46 |
Claims
1. A method comprising: executing a transaction defined by a
corresponding set of instructions to produce a respective
transaction outcome based on use of at least one shared variable;
in lieu of locking and modifying a given shared variable during
execution of the transaction, initiating a lock on the given shared
variable after producing the respective transaction outcome via use
of locally modified data values, the lock preventing other
processes from modifying a data value associated with the given
shared variable; and after obtaining the lock, initiating a
modification of the data value associated with the given shared
variable even though at least one of the other processes performed
a computation using the data value associated with the given shared
variable before the lock and during execution of the
transaction.
2. A method as in claim 1, wherein executing the transaction
includes: maintaining version information in a locally managed read
set associated with the transaction, the read set not being
accessible by the other processes using the shared variables, the
read set identifying versions associated with each of multiple
shared variables used to generate the respective transaction
outcome, the version information indicating respective versions of
the multiple shared variables at a time when the transaction
retrieves respective data values associated with the multiple
shared variables from a globally accessible repository.
3. A method as in claim 2, wherein executing the transaction
further includes: after acquiring the lock associated with the
given shared variable and before modifying the data value
associated with the given shared variable, verifying that newly
read version information associated with each of the multiple
shared variables used to generate the respective transaction
outcome matches the version information in the locally managed read
set associated with the transaction.
4. A method as in claim 3, wherein the newly read version
information indicates that the data values associated with the
multiple shared variables used to generate the transaction outcome
have not been changed by the other processes during execution of
the transaction to produce the respective transaction outcome.
5. A method as in claim 2, wherein initiating the lock includes:
identifying that another process has a respective lock on the given
shared variable; and utilizing a specified backoff time to acquire
the lock on the given shared variable, the backoff time being a
random value relative to the other processes that also attempt to
acquire the lock associated with the given shared variable.
6. A method as in claim 1, wherein executing the transaction
includes: complying with a respective rule indicating size
limitations associated with the transaction to enhance efficiency
of multiple processes executing different transactions using a same
set of shared variables including the given shared variable to
produce respective transaction outcomes.
7. A method as in claim 1 further comprising: maintaining version
information associated with each of multiple shared variables, the
version information indicating occurrences of data value changes
associated with each of the multiple shared variables; and wherein
initiating the lock on the given shared variable includes: if the
given shared variable was read at any time during execution of the
transaction, atomically: i) acquiring the lock on the shared
variable, and ii) validating that a present version value
associated with the given shared variable matches a previous
version value of the given shared variable when read during
execution of the transaction.
8. A method as in claim 1 further comprising: in response to
identifying that a corresponding data value associated with the at
least one shared variable was modified during execution of the
transaction, aborting the transaction in lieu of modifying the data
value associated with the given shared variable; and initiating
execution of the transaction again to produce the respective
transaction outcome.
9. A method as in claim 1 further comprising: maintaining a locally
managed and accessible write set of data values associated with
each of multiple shared variables that are locally but not globally
modified during execution of the transaction, the local write set
representing data values: i) not yet globally committed and ii) not
yet globally accessible by the other processes.
10. A method as in claim 9 further comprising: after completing
execution of the transaction, initiating locks on each of the
multiple shared variables specified in the write set which were
modified during execution of the transaction, the locks preventing
the other processes from changing data values associated with the
multiple shared variables.
11. A method as in claim 10 further comprising: utilizing a
hash-based filter function during execution of the transaction to
identify whether a corresponding data value associated with a
respective globally accessible variable already exists locally in
the write set and should be modified in lieu of performing a
respective read to globally accessible shared data.
12. A method as in claim 1 further comprising: after the
modification of the data value associated with the given shared
variable in a global environment accessible by the other processes,
incrementing globally accessible version information associated
with the shared variable to indicate that the given shared variable
has been modified.
13. A method as in claim 1 further comprising: initiating a compare
function to verify that the at least one shared variable has not
been modified during execution of the corresponding set of
instructions prior to initiating the lock on the given shared
variable; and aborting execution of the transaction if the at least
one shared variable has been modified.
14. A method as in claim 1, wherein steps of executing the
transaction, initiating the lock, and initiating the modification
are carried out in software, the method further comprising
utilizing hardware transactional memory as an accelerator for
executing the transaction.
15. A method as in claim 1 further comprising: maintaining a
locally managed and accessible write set of data values associated
with each of multiple shared variables that are locally but not
globally modified during execution of the transaction, the local
write set representing data values: i) not yet globally committed
and ii) not yet globally accessible by the other processes;
initiating locks on each of the multiple shared variables specified
in the write set which were modified during execution of the
transaction to prevent the other processes from changing data
values associated with the multiple shared variables; verifying
that respective data values associated with the multiple shared
variables accessed during the transaction have not been globally
modified by the other processes during execution of the transaction
by checking that respective version values associated with the
multiple shared variables have not changed during execution of the
transaction; and after modifying data values associated with the
multiple shared variables, releasing the locks on each of the
multiple shared variables.
16. A method comprising: maintaining segments of information that
are shared by multiple processes executing in parallel; for each of
at least two of the segments, maintaining a corresponding location
to store a respective version value representing a relative version
of a respective segment, the relative version being changed each
time contents of the respective segment is modified; and enabling
the multiple processes to compete and secure an exclusive access
lock with respect to each of the at least two segments to prevent
other processes from modifying a respective locked segment.
17. A method as in claim 16 further comprising: for each of at
least two of the segments, maintaining a corresponding location to
store globally accessible lock information indicating whether one
of the multiple processes executing in parallel has locked a
respective segment for: i) changing a respective data value
therein, and ii) preventing other processes from reading respective
data values from the respective segment; and enabling the multiple
processes to retrieve version information associated with the
respective at least two segments to identify whether contents of a
respective segment have changed over time.
18. A method comprising: in a given process of multiple processes
executing in parallel: maintaining a locally managed write set of
data values associated with globally accessible shared variables,
the locally managed write set accessible only by the given process,
the globally accessible shared variables accessible by the multiple
processes; while executing a transaction including multiple
instructions, modifying data values associated with the locally
managed write set in lieu of modifying the globally accessible
shared variables; and after completion of execution of the
transaction, initiating locks on each of the globally accessible
shared variables specified in the write set in order to: i) prevent
other processes from changing data values associated with
respective locked shared variables and ii) commit data values in
the locally managed write set to the globally accessible shared
variables.
19. A method comprising: performing at least one transactional
access to segments of information in transactional memory that are
shared by multiple processes executing in parallel; and competing
amongst multiple other processes to secure an exclusive access lock
with respect to a segment in the transactional memory to prevent
other processes from modifying a respective locked segment, use of
respective access locks enabling transactional memory to
interoperate with any malloc and free operations.
20. A method as in claim 19 further comprising: utilizing a
hash-based filter function during execution of a respective
transaction to identify whether a corresponding data value
associated with a respective globally accessible variable already
exists locally in a write set, the write set being a scratchpad for
temporarily maintaining data values locally in lieu of modifying
the data values in the transactional memory.
Description
RELATED APPLICATION
[0001] This application claims the benefit of and priority to U.S.
Provisional Patent Application Ser. No. 60/775,580 (Attorney's
docket no. SUN06-02 (060720)p, filed on Feb. 22, 2006, entitled
"Transactional Locking," the entire teachings of which are
incorporated herein by this reference.
[0002] This application is related to U.S. patent application
identified by Attorney's docket no. SUN06-03(060711), filed on same
date as the present application, entitled "METHODS AND APPARATUS TO
IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit
of and priority to U.S. Provisional Patent Application Ser. No.
60/775,564 (Attorney's docket no. SUN06-01(060711)p, filed on Feb.
22, 2006, entitled "Switching Between Read-Write Locks and
Transactional Locking," the entire teachings of which are
incorporated herein by this reference.
[0003] This application is related to U.S. patent application
identified by Attorney's docket no. SUN06-06(060908), filed on same
date as the present application, entitled "METHODS AND APPARATUS TO
IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit
of and priority to U.S. Provisional Patent Application Ser. No.
60/789,483 (Attorney's docket no. SUN06-05(060908)p, filed on Apr.
5, 2006, entitled "Globally Versioned Transactional Locking," the
entire teachings of which are incorporated herein by this
reference.
[0004] This application is related to U.S. patent application
identified by Attorney's docket no. SUN06-08(061191), filed on same
date as the present application, entitled "METHODS AND APPARATUS TO
IMPLEMENT PARALLEL TRANSACTIONS," which itself claims the benefit
of and priority to U.S. Provisional Patent Application Ser. No.
60/775,564 (Attorney's docket no. SUN06-01(060711)p, filed on Feb.
22, 2006, entitled "Switching Between Read-Write Locks and
Transactional Locking," the entire teachings of which are
incorporated herein by this reference.
BACKGROUND
[0005] There has been an ongoing trend in the information
technology industry to execute software programs more quickly. For
example, there are various conventional advancements that provide
for increased execution speed of software programs. One technique
for increasing execution speed of a program is called parallelism.
Parallelism is the practice of executing or performing multiple
things simultaneously. Parallelism can be possible on multiple
levels, from executing multiple instructions at the same time, to
executing multiple threads at the same time, to executing multiple
programs at the same time, and so on. Instruction Level Parallelism
or ILP is parallelism at the lowest level and involves executing
multiple instructions simultaneously. Processors that exploit ILP
are typically called multiple-issue processors, meaning they can
issue multiple instructions in a single clock cycle to the various
functional units on the processor chip.
[0006] There are different types of conventional multiple-issue
processors. One type of multiple-issue processor is a superscalar
processor in which a sequential list of program instructions are
dynamically scheduled. A respective processor determines which
instructions can be executed on the same clock cycle, and sends
them out to their respective functional units to be executed. This
type of multi-issue processor is called an in-order-issue processor
since issuance of instructions is performed in the same sequential
order as the program sequence, but issued instructions may complete
at different times (e.g., short instructions requiring fewer cycles
may complete before longer ones requiring more cycles).
[0007] Another type of multi-issue processor is called a VLIW (Very
Large Instruction Width) processor. A VLIW processor depends on a
compiler to do all the work of instruction reordering and the
processor executes the instructions that the compiler provides as
fast as possible according to the compiler-determined order. Other
types of multi-issue processors issue out of order instructions,
meaning the instruction issue order is not be the same order as the
order of instructions as they appear in the program.
[0008] Conventional techniques for executing instructions using ILP
can utilize look-ahead techniques to find a larger amount of
instructions that can execute in parallel within an instruction
window. Looking-ahead often involves determining which instructions
might depend upon others during execution for such things as shared
variables, shared memory, interference conditions, and the like.
When scheduling, a handler associated with the processor detects a
group of instructions that do not interfere or depend on each
other. The processor can then issue execution of these instructions
in parallel thus conserving processor cycles and resulting in
faster execution of the program.
[0009] One type of conventional parallel processing involves a use
of coarse-grained locking. As its name suggests, coarse-grained
locking prevents conflicting groups of code from operating on
different processes at the same time based on use of lockouts.
Accordingly, this technique enables non-conflicting transactions or
sets of instructions to execute in parallel.
[0010] Another type of conventional parallel processing involves a
use of fine-grain locking. As its name suggests, fine-grain locking
prevents conflicting instructions from being simultaneously
executed in parallel based on use of lockouts. This technique
enables non-conflicting instructions to execute in parallel.
SUMMARY
[0011] Conventional applications that support parallel processing
can suffer from a number of deficiencies. For example, although
easy to implement from the perspective of a software developer,
coarse-grained locking techniques provide very poor performance
because they can severely limit parallelism. Although fine-grain
lock-based concurrent software can perform exceptionally well
during run-time, developing such code can be a very difficult task
for a respective one or more software developers.
[0012] Techniques discussed herein deviate with respect to
conventional applications such as those discussed above as well as
other techniques known in the prior art. For example, embodiments
herein include techniques for enhancing performance associated with
transactions executing in parallel.
[0013] In general, a transactional memory programming technique
according to embodiments herein provides an alternative type of
"lock" method over the conventional techniques as discussed above.
For example, one embodiment herein involves use and/or maintenance
of version information indicating whether any of multiple
"globally" shared variables has been modified during a course of
executing a respective transaction (e.g., a set of software
instructions initiating a respective computation). Any one of
multiple possible processes executing in parallel can update
respective version information associated with a globally shared
variable (e.g., a shared variable accessible by any of multiple
processes) in order to indicate that the shared variable has been
modified. Accordingly, other processes keeping track of the version
information during execution of their own respective transaction
can (keep track of) and identify if and when any shared variables
have been modified during a window of use. If any critical
variables have been modified, a respective process can prevent
corresponding computational results from being committed to
memory.
[0014] That is, for each of multiple processes executing in
parallel, as long as version information associated with a
respective set of one or more shared variables used for
computational purposes has not changed during execution of a
respective transaction, results of the respective transaction can
be committed globally without causing data corruption by one or
more processes simultaneously using the shared variable. If version
information associated with one or more respective shared variables
(used to produce the transaction results) happens to change during
a process of generating respective results, then a respective
process can identify that another process modified the one or more
respective shared variables during execution and prevent global
committal of the respective results. In this latter case, the
transaction can repeat itself (e.g., execute again or retry) until
the process is able to commit respective results without causing
data corruption. In this way, each of multiple processes executing
in parallel can "blindly" initiate computations using the shared
variables even though there is a chance that another process
executing in parallel modifies a mutually used shared variable and
prevents the process from globally committing its results.
[0015] In view of the specific embodiment discussed above, more
general embodiments herein are directed to maintaining version
information associated with shared variables. In one embodiment, a
computer environment includes segments of information (e.g., a
groupings, sections, portions, etc. of a repository for storing
data values associated with one or more variables) that are shared
by multiple processes executing in parallel. For each of at least
two of the segments, the computer environment includes a
corresponding location to store a respective version value (e.g.,
version information) representing a relative version of a
respective segment. A relative version associated with a segment is
changed or updated by a respective process each time any contents
(e.g., data values of one or more respective shared variables) in a
respective segment has been modified. Accordingly, other processes
keeping track of version information associated with a respective
segment can identify if and when contents of the respective segment
have been modified.
[0016] In one embodiment, one or more processes in the computer
environment can use contents stored in the one or more segments to
generate new data values for storage in a segment. A respective
process can initiate modification of a data value associated with a
shared variable. For example, in one embodiment, the processes can
compete to secure an exclusive access lock with respect to each of
multiple segments to prevent other processes from modifying a
respective locked segment. Locking of a segment (e.g., a single or
multiple shared variables) can prevent two or more processes from
modifying a same data segment. Locking of a segment also may
provide notification to other processes that the other processes
should not use contents of a respective segment for a current
transaction and/or that previous computations associated with a
current transaction must be aborted.
[0017] According to further embodiments, a computer environment can
be configured to maintain, for each of multiple segments of shared
data, a corresponding location to store globally accessible lock
information indicating whether one of the multiple processes
executing in parallel has locked a respective segment for: i)
changing a respective one or more data value therein, and ii)
preventing other processes from reading respective data values from
the respective segment. In other words, acquiring a lock on a
segment prevents other processes from accessing data values in the
locked segment.
[0018] Additionally, the computer environment can enable the
multiple processes to maintain (e.g., store, retrieve, use, etc.)
version information associated with the respective multiple
segments to identify whether contents of a respective segment have
changed over time. For example, a computer environment can include
globally accessible version information enabling a respective one
of the processes to modify respective version value information
associated with shared variables. The version value information can
represent a relative version value associated with a given segment
as modified by a respective process to a new unique data value to
indicate that the respective process modified a data value
associated with the given segment.
[0019] As a more specific example, a first process can retrieve a
data value associated with a shared variable as well as retrieve a
current version value associated with the shared variable when the
shared variable is accessed. The first process stores the version
value associated with the shared variable and then can perform
computations (e.g., a transaction) using the shared variable. Prior
to globally committing results associated with the transaction, the
first process can verify that no other process modified the shared
variable by checking current version information associated with
the shared variable. If the version information associated with one
or more shared variables at a committal phase of the transaction
matches corresponding originally obtained version information
associated with the one or more shared variables during an
execution phase of the transaction, then the first process can
globally commit results of the transaction to memory.
Alternatively, the first process can abort and repeat a transaction
until it is able to complete without interference. If and when the
first process is able to globally commit it results from a
respective transaction to memory, then the first process updates
version information associated with any data values (or segments)
that are modified during the commit phase. Accordingly, a second
process (or multiple other processes) can identify if and when a
data value associated with the one or more shared variables changes
and prevent or initiate its own global committal depending on
current processing circumstances.
[0020] Techniques herein are well suited for use in applications
such as those supporting parallel processing and use of shared
data. However, it should be noted that configurations herein are
not limited to such use and thus configurations herein and
deviations thereof are well suited for use in other environments as
well.
[0021] In addition to the embodiments discussed above, other
embodiments herein include a computerized device (e.g., a host
computer, workstation, etc.) configured to support the techniques
disclosed herein such as supporting parallel execution of
transaction performed by different processes. In such embodiments,
a computer environment includes a memory system, a processor (e.g.,
a processing device), a respective display, and an interconnect
connecting the processor and the memory system. The interconnect
can also support communications with the respective display (e.g.,
display screen or display medium). The memory system is encoded
with an application that, when executed on the processor, supports
parallel processing according to techniques herein.
[0022] Yet other embodiments of the present disclosure include
software programs to perform the method embodiment and operations
summarized above and disclosed in detail below in the Detailed
Description section of this disclosure. More specifically, one
embodiment herein includes a computer program product (e.g., a
computer-readable medium). The computer program product includes
computer program logic (e.g., software instructions) encoded
thereon. Such computer instructions can be executed on a
computerized device to support parallel processing according to
embodiments herein. For example, the computer program logic, when
executed on at least one processor associated with a computing
system, causes the processor to perform the operations (e.g., the
methods) indicated herein as embodiments of the present disclosure.
Such arrangements as further disclosed herein can be provided as
software, code and/or other data structures arranged or encoded on
a computer readable medium such as an optical medium (e.g.,
CD-ROM), floppy or hard disk, or other medium such as firmware or
microcode in one or more ROM or RAM or PROM chips or as an
Application Specific Integrated Circuit (ASIC). The software or
firmware or other such configurations can be installed on a
computerized device to cause one or more processors in the
computerized device to perform the techniques explained herein.
[0023] Yet another more particular technique of the present
disclosure is directed to a computer program product that includes
a computer readable medium having instructions stored thereon for
to facilitate use of shared information among multiple processes.
The instructions, when carried out by a processor of a respective
computer device, cause the processor to perform the steps of: i)
executing a transaction defined by a corresponding set of
instructions to produce a respective transaction outcome based on
use of at least one shared variable; ii) after producing the
respective transaction outcome, initiating a lock on a given shared
variable to prevent other processes from modifying a data value
associated with the given shared variable; and iii) initiating a
modification of the data value associated with the given shared
variable based on the respective transaction outcome even though at
least one of the other processes performed a computation using the
data value associated with the given shared variable before the
lock. Other embodiments of the present application include software
programs to perform any of the method embodiment steps and
operations summarized above and disclosed in detail below.
[0024] It is to be understood that the system of the invention can
be embodied as a software program, as software and hardware, and/or
as hardware alone. Example embodiments of the invention may be
implemented within computer systems, processors, and computer
program products and/or software applications manufactured by Sun
Microsystems Inc. of Palo Alto, Calif., USA.
BRIEF DESCRIPTION OF THE DRAWINGS
[0025] The foregoing and other objects, features, and advantages of
the present application will be apparent from the following more
particular description of preferred embodiments of the present
disclosure, as illustrated in the accompanying drawings in which
like reference characters refer to the same parts throughout the
different views. The drawings are not necessarily to scale, with
emphasis instead being placed upon illustrating the embodiments,
principles and concepts.
[0026] FIG. 1 is a diagram illustrating a computer environment
enabling multiple processes to access shared variable data
according to embodiments herein.
[0027] FIG. 2 is a diagram illustrating maintenance and use of
version and lock information associated with shared data according
to embodiments herein.
[0028] FIG. 3 is a diagram of a sample process including a read-set
and write-set according to embodiments herein.
[0029] FIG. 4 is a diagram of a flowchart illustrating execution of
a transaction according to an embodiment herein.
[0030] FIG. 5 is a diagram of a flowchart illustrating execution of
a transaction according to embodiments herein.
[0031] FIG. 6 is a diagram of a sample architecture supporting
shared use of data according to embodiments herein.
[0032] FIG. 7 is a diagram of a flowchart according to an
embodiment herein.
[0033] FIG. 8 is a diagram of a flowchart according to an
embodiment herein.
[0034] FIG. 9 is a diagram of a flowchart according to an
embodiment herein.
DETAILED DESCRIPTION
[0035] For each of multiple processes executing in parallel, as
long as corresponding version information associated with a
respective set of one or more shared variables used for
computational purposes has not changed during execution of a
respective transaction, results of the respective transaction can
be globally committed to memory without causing data corruption. If
version information associated with one or more corresponding
shared variables (used to produce the transaction results for the
respective transaction) happens to change thus indicating that
another process modified shared data used to generate results
associated with the respective transaction, then results associated
with the respective transaction are not committed to memory for
global access. In this latter case, the respective transaction
repeats itself until the respective transaction is able to commit
respective results without causing potential data corruption as a
result of data changing during execution of the respective
transaction.
[0036] FIG. 1 is a block diagram of a computer environment 100
according to an embodiment herein. As shown, computer environment
100 includes shared data 125 and corresponding metadata 135 (e.g.,
in a respective repository) that is globally accessible by multiple
processes 140 such as process 140-1, process 140-2, . . . process
140-M. In one embodiment, each of processes 140 is a processing
thread. Metadata 135 enables each of processes 140 to identify
whether portions of shared data 125 have been "locked" and/or
whether any portions of shared data 125 have changed during
execution of a respective transaction.
[0037] Each of processes 140 includes a respective read-set 150 and
write-set 160 for storing information associated with shared data
used to carry computations with respect to a transaction. For
example, process 140-1 includes read-set 150-1 and write-set 160-1
to carry out a respective one or more transactions associated with
process 140-1. Process 140-2 includes read-set 150-2 and write-set
160-2 to carry out a respective transaction associated with process
140-2. Process 140-M includes read-set 150-M and write-set 160-M to
carry out one or more transactions associated with process
140-M.
[0038] Transactions executed by respective processes 140 can be
defined by one or more instructions of software code. Accordingly,
each of processes 140 can execute a respective set of instructions
to carry out a respective transaction. In one embodiment, the
transactions executed by the processes 140 come from the same
overall program or application running on one or more computers.
Alternatively, the processes 140 execute transactions associated
with different programs.
[0039] In the context of a general embodiment herein such as
computer environment 100 in which multiple processes 140 (e.g.,
processing threads) execute transactions in parallel, each of
processes 140 accesses shared data 125 to generate computational
results (e.g., transaction results) that are eventually committed
for storage in a respective repository storing shared data 125.
Shared data 125 is considered to be globally accessible because
each of the multiple processes 140 can access the shared data
125.
[0040] Each of processes 140 can store data values locally that are
not accessible by the other processes 140. For example, process
140-1 can globally access a data value and store a respective copy
locally in write-set 160-1 that is not accessible by any of the
other processes. During execution of a respective transaction, the
process 140-1 is able to locally modify the data value in its
write-set 160. Accordingly, one purpose of write-set 160 is to
store globally accessed data that is modified locally.
[0041] As will be discussed later in this specification, the
results of executing the respective transaction can be globally
committed back to a respective repository storing shared data 125
depending on whether globally accessed data values happened to
change during the course of the transaction executed by process
140-1. In general, a respective read-set 150-1 associated with each
process stores information for determining which shared data 125
has been accessed during a respective transaction and whether any
respective data values associated with globally accessed shared
data 125 happens to change during execution of a respective
transaction.
[0042] In one embodiment, each of one or more processes 140
complies with a respective rule or set of rules indicating
transaction size limitations associated with the parallel
transactions to enhance efficiency of multiple processes executing
different transactions using a same set of shared variables
including the given shared variable to produce respective
transaction outcomes. For example, each transaction can be limited
to a certain number of lines of code, a number of data value
modifications, time limit, etc. so that potentially competing
transactions do not end up in a deadlock.
[0043] As will be further discussed, embodiments herein include: i)
maintaining a locally managed and accessible write set of data
values associated with each of multiple shared variables that are
locally modified during execution of the transaction, the local
write set representing data values not yet a) globally committed
and b) accessible by the other processes; ii) initiating locks on
each of the multiple shared variables specified in the write set
which were locally modified during execution of the transaction to
prevent the other processes from changing data values associated
with the multiple shared variables to be modified; iii) verifying
that respective data values associated with the multiple shared
variables accessed during the transaction have not been globally
modified by the other processes during execution of the transaction
by checking that respective version values associated with the
multiple shared variables have not changed during execution of the
transaction; and vi) after modifying data values associated with
the multiple shared variables, releasing the locks on each of the
multiple shared variables.
[0044] FIG. 2 is a diagram illustrating shared data 125 and
corresponding metadata 135 according to embodiments herein. As
shown, shared data 125 can be partitioned to include segment 210-1,
segment 210-2, . . . , segment 210-J. A respective segment of
shared data 125 can be a resource such as a single variable, a set
of variables, an object, a stripe, a portion of memory, etc.
Metadata 135 includes respective version information 220 and lock
information 230 associated with each corresponding segment 210 of
shared data 125. In one embodiment, version information 220 is a
multi-bit value that is incremented each time a respective process
140 modifies contents of a corresponding segment 210 of shared data
135. The lock information 230 and version information 220 can make
up a single 64-bit word.
[0045] In one embodiment, each of processes 140 (e.g., software)
need not be responsible for updating the version information 220.
For example, a monitor function separate or integrated with
processes 140 automatically initiate changing version information
220 each time contents of a respective segment is modified.
[0046] As an example, assume that process 140-2 (e.g., a software
processing entity) modifies contents of segment 210-1 during a
commit phase of a respective executed transaction. Prior to
committing transaction results globally to shared data 125, process
140-2 would read and store version information 220-1 associated
with segment 210-1 or shared variable. After modifying contents of
segment 210-1 during the commit phase, the process 140-2 would
modify the version information 220-1 in metadata 135 to a new
value. More specifically, prior to modifying segment 210-1, the
version information 220-1 may have been a count value of 1326.
After modifying segment 210-1, the process 140-2 updates (e.g.,
increments) the version information 220-1 to be a count value of
1327. Each of the processes 140 performs a similar updating of
corresponding version information 220 each time a respective
process 140 modifies a respective segment 210 of shared data 125.
Accordingly, the processes can monitor the version information
220-1 to identify when changes have been made to a respective
segment 210 of shared data 125.
[0047] Note that metadata 135 also maintains lock information 230
associated with each respective segment 210 of shared data 125. In
one embodiment, the lock information 230 associated with each
segment 210 is a globally accessible single bit indicating whether
one of processes 140 currently has "locked" a corresponding segment
for purposes of modifying its contents. For example, a respective
process such as process 140-1 can set the lock information 230-J to
a logic one indicating that segment 210-J has been locked for use.
Other processes know that contents of segment 210-J should not be
accessed, used, modified, etc. during the lock phase initiated by
process 140-1. Upon completing a respective modification to
contents of segment 210-J, process 140-1 sets the lock information
230-J to a logic zero. All processes 140 can then compete again to
obtain a lock with respect to segment 210-J.
[0048] FIG. 3 is a diagram more particularly illustrating details
of respective read-sets 150 and write-sets 160 associated with
processes 140 according to embodiments herein. As shown, process
140-1 executes transaction 351 (e.g., a set of software
instructions). Read-set 150-1 stores retrieved version information
320-1, retrieved version information 320-2, . . . , retrieved
version information 320-K associated with corresponding data values
(or segments) accessed from shared data 125 during execution of
transaction 351. Accordingly, the process 140-1 can keep track of
version information associated with any globally accessed data.
[0049] Write-set 160-1 stores shared variable identifier
information 340 (e.g., address information, variable identifier
information, etc.) for each respective globally shared variable
that is locally modified during execution of the transaction 351.
Local modification involves maintaining and modifying locally used
values of shared variables in write-set 160-1 rather than actually
modifying the global variables during execution of transaction 351.
As discussed above and as will be further discussed, the process
140-1 attempts to globally commit information in write-set 160-1 to
shared data 125 upon completion of transaction 351. In the context
of the present example, process 140-1 maintains write-set 160-1 to
include i) shared variable identifier information 340-1 (e.g.,
segment or variable identifier information) of a respective
variable accessed from shared data 125 and corresponding locally
used value of shared variable 350-1, ii) shared variable identifier
information 340-2 (e.g., segment or variable identifier
information) of a variable or segment accessed from shared data 125
and corresponding locally used value of shared variable 350-2, an
so on. Accordingly, process 140-1 uses write-set 160-1 as a
scratch-pad to carry out execution of transaction 351 and keep
track of locally modified variables and corresponding identifier
information.
[0050] FIG. 4 is a flowchart illustrating a more specific use of
read-sets 150, write-sets 160, version information 220, and lock
information 230 according to embodiments herein. In general,
flowchart 400 indicates how each of multiple processes 140 utilizes
use of read-sets 150 and write-sets 160 while carrying out a
respective transaction.
[0051] Step 405 indicates a start of a respective transaction. As
previously discussed, a transaction can include a set of software
instructions indicating how to carry out one or more computations
using shared data 125.
[0052] In step 410, a respective process 140 executes an
instruction associated with the transaction identifying a specific
variable in shared data 125.
[0053] In step 415, the respective process checks whether the
variable exists in its respective write-set 160. If the variable
already exists in its respective write-set 160 in step 420, then
processing continues at step 440 in which the respective process
140 fetches a locally maintained value from its write-set 160.
[0054] If a locally stored data value associated with the variable
does not already exist in its respective write-set 160 (e.g.,
because the variable was never fetched yet and/or modified locally)
as identified in step 415, then processing continues at step 420 in
which the respective process 140 attempts to globally fetch a data
value associated with the variable based on a respective access to
shared data 125. For example, as further indicated in step 425, the
process 140 checks whether the variable to be globally fetched is
locked by another process. As previously discussed, another process
may lock variables, segments, etc. of shared data 125 to prevent
others from accessing the variables. Globally accessible lock
information 230 (e.g., a single bit of information) in metadata 135
indicates which variables have been locked for use.
[0055] If an active lock is identified in step 425, the respective
process initiates step 430 to abort and retry a respective
transaction or initiate execution of a so-called back-off function
to access the variable. In the latter instance, the back-off
function can specify a random or fixed amount of time for the
process to wait before attempting to read the variable again with
hopes that a lock will be released. The respective lock on the
variable may be released by during a second or subsequent attempt
to read the variable.
[0056] If no lock is present on the variable during execution of
step 425, the respective process initiates step 435 to globally
fetch a data value associated with the specified variable from
shared data 125. In addition to globally accessing the data value
associated with the shared variable, the respective process
retrieves version information 220 associated with the globally
fetched variable. The process stores retrieved version information
associated with the variable in its respective read-set 150 for
later use during a commit phase.
[0057] In step 445, the respective process utilizes the fetched
data value associated with the variable to carry out one or more
computations associated with the transaction. Based on the paths
discussed above, the data value associated with the variable can be
obtained from either write-set 160 or shared data 125.
[0058] In step 450, the process performs a check to identify
whether use of the fetched variable (in the transaction) involve
modifying a value associated with the fetched variable. If so, in
step 455, the process modifies the locally used value of shared
variable 350 in write-set 160. The respective process skips
executing step 455 if use of the variable (as specified by the
executed transaction) does not involve modification of the
variable.
[0059] In step 460, the respective process identifies whether a
respective transaction has completed. If not, the process continues
at step 410 to perform a similar loop for each of additional
variables used during a course of executing the transaction. If the
transaction has completed in step 460, the respective process
continues at step 500 (e.g., the flowchart 500 in FIG. 5) in which
the process attempts to globally commit values in its write-set 160
to globally accessible shared data 125.
[0060] Accordingly, in response to identifying that a corresponding
data value associated with one or more shared variable was modified
during execution of the transaction, a respective process can abort
a respective transaction in lieu of modifying a data value
associated with shared data 125 and initiate execution of the
transaction again at a later time to produce attempt to produce a
respective transaction outcome.
[0061] FIG. 5 is a flowchart 500 illustrating a technique for
committing results of a transaction to shared data 125 according to
embodiments herein. Up until his point, the process executing the
respective transaction has not initiated any locks on any shared
data yet although the process does initiate execution of
computations associated with accessed shared data 125. Waiting to
obtain locks at the following "commit phase" enables other
processes 140 to perform other transactions in parallel because a
respective process initiating storage of results during the commit
phase holds the locks for a relatively short amount of time. In
[0062] In step 505, the respective process that executed the
transaction attempts to obtain locks associated with each variable
in its write-set 160. For example, the process checks whether lock
information in metadata 135 indicates whether the variables to be
written to (e.g., specific portions of globally accessible shared
data 125) are locked by another process. The process initiates
locking the variables (or segments as the case may be) to block
other process from using or locking the variables. In one
embodiment, a respective process attempts to obtain locks according
to a specific ordering such as an order of initiating local
modifications to retrieved shared variables during execution of a
respective transaction, addresses associated with the globally
shared variables, etc.
[0063] If all locks cannot be immediately obtained in step 510,
then the process can abort and retry a transaction or initiate a
back-off function to acquire locks associated with the variables
that are locally modified during execution of the transaction.
[0064] After all appropriate locks have been obtained by writing
respective lock information 230, processing continues at step 520
in which the process obtains the stored version information
associated with variables read from shared data 125. As previously
discussed, the version information 230 of metadata 135 indicates a
current version of the respective variables at a time when they
were read during execution of the transaction.
[0065] In step 525, the respective process compares the retrieved
version information in the read-set 150 saved at a time of
accessing the shared variables to the current globally available
version information 220 from metadata 135 for each variable in the
read-set 150.
[0066] In step 530, if the version information is different in step
525, then the process acknowledges that another process modified
the variables used to carry out the present transaction.
Accordingly, the process releases any obtained locks and retries
the transaction again. This prevents the respective process from
causing data corruption.
[0067] In step 535, if the version information is the same in step
525, then the process acknowledges that no other process modified
the variables used to carry out the present transaction.
Accordingly, the process can initiate modification of shared data
to reflect the data values in the write-set 160. This prevents the
respective process from causing data corruption during the commit
phase.
[0068] Finally, in step 540, after updating the shared data 125
with the data values in the write-set 160, the process updates
version information 220 associated with modified variables or
segments and releases the locks. The locks can be released in any
order or in a reverse order relative to the order of obtaining the
locks.
[0069] Note that during the commit phase as discussed above in
flowchart 500, if a lock associated with a location in the
process's write-set 160 also appears in the read-set 150, then the
process must atomically: a) acquire a respective lock and b)
validate that current version information associated with the
variable (or variables) is the same as the retrieved version
information stored in the read-set 150. In one embodiment, a CAS
(Compare and Swap) operation can be used to accomplish both a) and
b).
[0070] Also, note that each of the respective processes 140 can be
programmed to occasionally, periodically, sporadically,
intermittently, etc. check (prior to the committal phase in
flowchart 500) whether current version information 220 in metadata
135 matches retrieved version information in its respective
read-set 150 for all variables read from shared data 125.
Additionally, each of the respective processes 140 can be
programmed to also check (in a similar way) whether a data value
and/or corresponding segment has been locked by another process
prior to completion. If a change is detected in the version
information 220 (e.g., there is a difference between retrieved
version information 320 in read-set 150 and current version
information 220) and/or a lock is implemented on a data value or
segment used by a given process, the given process can abort and
retry the current transaction, prior to executing the transaction
to the commit phase. Early abortion of transactions doomed to fail
(because of an other process locking and modifying) can increase
overall efficiency associated with parallel processing.
[0071] Use of version information and lock information according to
embodiments herein can prevent corruption of data. For example,
suppose that as an alternative to the above technique of using
version information to verify that relied upon information
(associated with a respective transaction) has not changed by the
end of a transaction, a process reads data values (as identified in
a respective read-set) from shared data 125 again at commit time to
ensure that the data values are the same as were when first being
fetched by the respective process. Unfortunately, this technique
can be misleading and cause errors because of the occurrence of
race conditions. For example, a first process may read and verify
that a globally accessible data value in shared data 125 has not
changed while soon after (or at nearly the same time) another
respective process modifies the globally accessible data value.
This would result in corruption if the first process committed its
results to shared data 125. The techniques herein are advantageous
because use of version and lock information in the same word
prevents corruption as a result of two different processes
accessing the word at the same or nearly the same time.
[0072] FIG. 6 is a block diagram illustrating an example computer
system 610 (e.g., an architecture associated with computer
environment 100) for executing a parallel processes 140 and other
related processes according to embodiments herein. Computer system
610 can be a computerized device such as a personal computer,
workstation, portable computing device, console, network terminal,
processing device, etc.
[0073] As shown, computer system 610 of the present example
includes an interconnect 111 that couples a memory system 112
storing shared data 125 and metadata 135, one or more processors
113 executing processes 140, an I/O interface 114, and a
communications interface 115. Peripheral devices 116 (e.g., one or
more optional user controlled devices such as a keyboard, mouse,
display screens, etc.) can couple to processor 113 through I/O
interface 114. I/O interface 114 also enables computer system 610
to access repository 180 (that also potentially stores shared data
125 and/or metadata 135). Communications interface 115 enables
computer system 610 to communicate over network 191 to transmit and
receive information from different remote resources.
[0074] Note that functionality associated with processes 140 can be
embodied as software code such as data and/or logic instructions
(e.g., code stored in the memory or on another computer readable
medium such as a disk) that support functionality according to
different embodiments described herein. Alternatively, the
functionality associated with processes 140 can be implemented via
hardware or a combination of hardware and software code.
[0075] It should be noted that, in addition to the processes 140
themselves, embodiments herein include a respective application
and/or set of instructions to carry out processes 140. Such a set
of instructions associated with processes 140 can be stored on a
computer readable medium such as a floppy disk, hard disk, optical
medium, etc. The set of instruction can also be stored in a memory
type system such as in firmware, RAM (Random Access Memory), read
only memory (ROM), etc. or, as in this example, as executable
code.
[0076] Attributes associated with processes 140 will now be
discussed with respect to flowcharts in FIG. 7-9. For purposes of
this discussion, each of the multiple processes 140 in computer
environment 100 can execute or carry out the steps described in the
respective flowcharts. Note that the steps in the below flowcharts
need not always be executed in the order shown.
[0077] Now, more particularly, FIG. 7 is a flowchart 700
illustrating a technique supporting execution of parallel
transactions in computer environment 100 according to an embodiment
herein. Note that techniques discussed in flowchart 700 overlap and
summarize some of the techniques discussed above.
[0078] In step 710, a respective one of multiple processes 140
executes a transaction defined by a corresponding set of
instructions to produce a respective transaction outcome based on
use of at least one shared variable from shared data 125.
[0079] In step 720, after producing the respective transaction
outcome (e.g., locally storing computational results in its
respective write-set 160), the respective process 140 initiates a
lock on a given shared variable of shared data 125 to prevent other
processes from modifying a data value associated with the given
shared variable.
[0080] In step 730, the respective process 140 initiates a
modification of the data value associated with the given shared
variable based on the respective transaction outcome even though at
least one of the other processes 140 in computer environment 100
also performed a computation using the data value associated with
the given shared variable before the lock and during execution of
the transaction by the respective one of multiple processes
140.
[0081] FIG. 8 is a flowchart 800 illustrating processing steps
associated with processes 140 according to an embodiment herein.
Note that techniques discussed in flowchart 800 overlap with the
techniques discussed above in the previous figures.
[0082] In step 810, each of multiple processes 140 maintains
version information in a respective locally managed read set 150
associated with an executed transaction. In one embodiment, the
read set 150 is generally not accessible by the other processes 140
using the shared variables from shared data 125. Accordingly, the
read set 150 and write-set 160 serve as a local scratch-pad
function. As previously discussed, the read set 150 can store and
identify version information (e.g., includes retrieved version
information) associated with each of multiple shared variables used
to generate a respective transaction outcome associated with a
given process. The version information stored in the read-set 150
indicates respective versions of the multiple shared variables in
shared data 125 at a time when the transaction retrieves respective
data values associated with the multiple shared variables (e.g.,
shared data 125) from a corresponding globally accessible
repository.
[0083] In step 815, after producing a respective transaction
outcome associated with an executed transaction, each of multiple
processes 140 potentially competes to initiate a respective lock on
a given one or more shared variables (e.g., portions of shared data
125) locally modified (as indicated in write-set 160) during the
transaction to prevent other processes from modifying a data value
associated with the given one or more shared variables.
[0084] In step 820, after acquiring respective locks associated
with the given one or more shared variables and before globally
modifying respective data values associated with the given one or
more shared variables, a respective process attempting to globally
commit its results verifies that newly read (e.g., present or
current) version information associated with each of the given one
or more shared variables used to generate the respective
transaction outcome matches the version information in the locally
managed read set associated with the transaction. The newly read
version information can be used to identify whether the data values
associated with the multiple shared variables have not changed by
the other processes during execution of the transaction. There was
no change if the newly retrieved version information matches the
version information in the read-set 150.
[0085] In step 825, after verifying that "before-and-after" version
information matches and obtaining locks, a respective one of the
multiple processes 140 initiates a modification of data values
associated with the given one or more shared variables based on the
respective transaction outcome. The respective process globally
modifies the data values associated with the transaction outcome
even though one or more of the other processes 140 performed a
computation using the data value associated with the given shared
variable before the respective process obtains the lock.
[0086] In step 830, after the modification of the data values in
the shared data 125 associated with the given one or more shared
variables in write-set 160, the respective process modifies
globally accessible version information 220 associated with the
modified segments of shared data 125 (e.g., one or more shared
variable) to indicate to other processes that contents of a
respective segment have been modified.
[0087] FIG. 9 is a flowchart 900 illustrating another technique
associated with use of lock and version information according to
embodiments herein. Note that techniques discussed in flowchart 900
overlap and summarize some of the techniques discussed above.
[0088] In step 910, computer environment 100 maintains segments 210
of information (e.g., shared data 125) that are shared by multiple
processes 140 executing in parallel.
[0089] In step 915, for each of multiple segments 210, the computer
environment 100 maintains a corresponding location (e.g., a portion
of storage) to store a respective version value representing a
relative version of contents in a respective segment 210. As
previously discussed, the relative version associated with a
segment is updated by a respective process each time contents of
the respective segment is modified by a process. For example, after
committing results to shared data 125, a respective process can
increment the version value by one over the previous version value
to notify other processes 140 that the shared data 125 has
changed.
[0090] In step 920, computer environment 100 enables the multiple
processes to compete and secure an exclusive access lock with
respect to each of the multiple segments 210 to prevent other
processes 140 from modifying a respective locked segment.
[0091] In step 925, for each of the multiple segments 210, computer
environment 100 maintains a corresponding location to store
globally accessible lock information (e.g., lock information 230)
indicating whether one of the multiple processes 140 executing in
parallel has locked a respective segment 210 for: i) changing a
respective data value in the respective segment 210, and ii)
preventing other processes from reading respective data values from
the respective segment 210.
[0092] In step 930, computer environment 100 enables the multiple
processes 140 to retrieve version information 220 associated with
the respective multiple segments 210 to identify whether contents
of a respective segment have changed over time.
[0093] In sub-step 935 of step 930, one embodiment of computer
environment 100 enables a respective one of the processes 140 to
modify a respective version value representing a relative version
value associated with a given segment 210 to a new unique data
value to indicate that a respective one of the processes modifies a
data value associated with the given segment has been modified.
[0094] As discussed above, techniques herein are well suited for
use in applications such as those that support parallel processing
of threads in the same or different processors. However, it should
be noted that configurations herein are not limited to such use and
thus configurations herein and deviations thereof are well suited
for use in other environments as well.
Further Embodiments Associated with Transactional Locking
[0095] A leading approach for simplifying concurrent programming is
a class of non-blocking software (and hardware) mechanisms called
transactional memories. Transactional memories can be static or
dynamic, indicating whether the locations transacted on are known
in advance (like an n-location CAS) or decided dynamically within
the scope of the transaction's execution, the latter type being
more general and expressive. Unfortunately, current implementations
of dynamic non-blocking software transactional memories (STMs) have
unsatisfactory performance.
[0096] This disclosure presents a new software based dynamic
transactional memory mechanism which we call Transactional Locking
(TL). TL is essentially a way of using static (and therefore
simple) non-blocking transactions in software or hardware to
transform sequential code into deadlock-free dynamic transactions
based on fine grained locks.
[0097] Initial performance benchmarks of an "all-software" TL
mechanism are surprisingly good. TL implementations of concurrent
data structures significantly outperform the most effective S.TM.
based implementations, and, more importantly, are within a
competitive margin from the most efficient hand crafted
implementations. These surprising performance results bring us to
question two assumptions that have recently taken hold in the
transactional memory development community: that software
transactions should be non-blocking, and that to be useful,
hardware transactions need to be dynamic.
1.0 Introduction
[0098] A goal of current multiprocessor software design is to
introduce parallelism into software applications by allowing
operations that do not conflict in accessing memory to proceed
concurrently. As discussed above, a key tool in designing
concurrent data structures has been the use of locks.
Unfortunately, course grained locking is easy to program with, but
provides very poor performance because of limited parallelism,
while designing fine grained lock-based concurrent data structures
has long been recognized as a difficult task better left to
experts. If concurrent programming and data structure design is to
become ubiquitous, researchers agree that one must develop
alternative approaches that simplify code design and verification.
This disclosure addresses "mechanical" methods for transforming
sequential code or course-grained lock-based code to concurrent
code. In one embodiment, by mechanical we mean that the
transformation, whether done by hand, by a preprocessor, or by a
compiler, does not require any program specific information (such
as the programmer's understanding of the data flow
relationships).
1.1 Transactional Programming
[0099] Transactional memory programming paradigm is gaining
momentum as the approach of choice for replacing locks in
concurrent programming. Combining sequences of concurrent
operations into atomic transactions seems to promise a great
reduction in the complexity of both programming and verification,
by making parts of the code appear to be sequential without the
need to use fine-grained locking. Transactions will hopefully
remove from the programmer the burden of figuring out the
interaction among concurrent operations that happen to conflict
when accessing the same locations in memory. Transactions that do
not conflict in accessing memory will run uninterrupted in
parallel, and those that do will be aborted and retried with the
programmer having to worry about issues such as deadlock. There are
currently proposals for hardware implementations of transactional
memory (HTM), purely software based ones (i.e. Software
Transactional Memories (3.TM.), and hybrid schemes that combine
hardware and software.
[0100] A preferred unifying theme of parallel processing is that
the transactions provided to the programmer, in either hardware or
software, will be non-blocking, unbounded, and dynamic.
Non-blocking means that transactions do not use locks, and are thus
obstruction-free, lock-free, or wait-free. Unbounded means that
there is no limit on the number of locations accessed by the
transaction. Dynamic means that the set of locations accessed by
the transaction is not known in advance and is determined during
its execution. Providing all three properties in hardware seems to
introduce large degrees of complexity into the design. Providing
them in software seems to limit performance: hand-crafted
lock-based code, though hard to program and prove correct, greatly
outperforms the most effective current software STMs, even when
they are programmed using an understanding of the data access
relationships. When the STM programmer does not make use of such
information, performance of STMs is in general an order of
magnitude slower than the hand-crafted counterparts.
1.2 Transactional Locking
[0101] This disclosure, according to one embodiment, suggests that
it is perhaps time to re-examine these basic development
requirements. We contend that on modem operating systems, deadlock
avoidance is the only compelling reason for making transactions
non-blocking, and that there is no reason to provide it for
transactions at the user level. Conventional mechanisms already
exist whereby threads might yield their quanta to other threads. In
particular, one conventional method such as so-called "schedctl"
(e.g., a feature in the Solaris.TM. operating system) allows
threads to transiently defer preemption while holding locks. In
sense, rather than trying to improve on hand-crafted lock-based
implementations by being non-blocking, we propose to get as close
to their behavior as one can with a mechanical approach, that is,
one that does not require the programmer to understand their data
access relationships.
[0102] With this in mind, the disclosure introduces a new way of
Transactional Locking (TL), a blocking approach to designing
software based transactional memory mechanisms. TL according to
embodiments herein transforms sequential code into unbounded
concurrent dynamic transactions that synchronize using
deadlock-free fine grained locking. The scheme itself is highly
efficient because it does not try to provide a non-blocking
progress guarantee for the transaction as a whole. Instead, static
(and therefore simple) non-blocking transactions are used only to
provide deadlock freedom when acquiring the set of locks needed to
safely complete a transaction. These simple static transactions can
be implemented in a trivial manner using today's hardware
synchronization operations such as compare-and-swap (CAS), or using
hardware transactions when these become available. We note that
implementing static transactions in hardware may prove
significantly simpler than implementing the more general dynamic
ones proposed in current HTM schemes.
1.3 A TL Approach in a Nutshell
[0103] One TL mechanism is based on coordination via a special
versioned-read-write-lock. Each shared variable is associated with
and protected by one lock. The mapping between variables and locks
can be one-to-one or many-to-few. For instance there may be one
lock per variable, where the lock is allocated adjacent to the
variable; one lock per object; or a separate array of locks indexed
by a hash of the variable address. Other mappings are possible as
well. A versioned-read-write lock has a version field in the lock
word and increments the lock's version number on every successful
write attempt. In an example embodiment the versioned-read-write
lock would consist of a word where the low-order bit served as a
lock-bit and the remaining bits served as a version subfield. On a
high level a dynamic transaction is executed as follows: [0104] 1.
Run the transactional code, reading the locks of all fetched-from
shared locations and building a local read-set and write-set (use a
safe load operation to avoid running off null pointers as a result
of reading an in-consistent view of memory). [0105] A transactional
load first checks to see if the load address appears in the
write-set. if so the transactional load returns the last value
written to the address. This provides the illusion of processor
consistency and avoids so-called read-after-write hazards. If the
address is not found in the write-set the load operation then
fetches the lock value associated with the variable, saving the
version in the read-set, and then fetches from the actual shared
variable. If the transactional load operation finds the variable
locked the load may either spin until the lock is released or abort
the operation. [0106] Transactional stores to shared locations are
handled by saving the address and value into the thread's local
write-set. The shared variables are not modified during this step.
That is, transactional stores are deferred and contingent upon
successfully completing the transaction. [0107] 2. Attempt to
commit the transaction. Acquire the locks of locations to be
written. If a lock in the write-set (or more precisely a lock
associated with a location in the write-set) also appears in the
read-set then the acquire operation must atomically (a) acquire the
lock and, (b) validate that the current lock version subfield
agrees with the version found in the earliest read-entry associated
with that same lock. An atomic CAS can accomplish both (a) and (b).
In its simplest form, acquire locks in ascending lock address
order, avoiding deadlocks. [0108] Alternately, the implementation
might acquire the locks in some other order, using bounded spinning
to avoid indefinite deadlock. [0109] 3. Re-read the locks of all
read-only locations to make sure version numbers haven't changed.
If version does not match, roll-back (release) the locks, abort the
transaction, and retry. [0110] 4. The prior observed reads in step
(1) have been validated as mutually consistent. The transaction is
now committed. Write-back all the entries from the local write-set
to the appropriate shared variables. [0111] 5. Release all the
locks identified in the write-set by atomically incrementing the
version and clearing the write-lock bit. Critically, the
write-locks have been held for a brief time.
[0112] At a high level TL according to embodiments herein converts
coarse-grained lock operations into transactions, where the
transactional infrastructure is implemented with fine-grained
locks.
[0113] There are various other optimizations and contention
reduction mechanisms that one should add to this basic scheme to
improve performance, but, as can be seen, at its core it is
painfully simple. The acquisition of the locks in step 2 is
essentially a static obstruction-free transaction, one in which the
set of accessed locations is know in advance. It can alternately be
sped-up using a hardware transaction such as an re-location
compare-and-swap (CAS) operation. As noted earlier, this type of
operation is simpler than the dynamic hardware transaction.
1.4 TL vs. STM and Hand-Grafted Locking
[0114] One aspect associated with TL is the observation that the
blocking part of a transaction can be limited to the acquisition of
a set of lock records. This observation has significant performance
implications because it allows one to eliminate all the overheads
associated with the mechanisms providing the non-blocking progress
guarantee for the transaction as a whole. As we show, this is a
major source of overhead of current STM systems.
[0115] When compared to hand-crafted lock-based structures, one can
think of TL as using a non-blocking transaction to overcome the
need to understand the data-access relationships, while keeping the
basic fine-grained locking structure of a lock per object or
field.
[0116] A few more detailed differences are as follows.
[0117] Like OSTM (Object-based STM) or Hy.TM. (Hybrid.TM.), TL
associates a special coordination word with each transacted memory
location. However, while STM systems like OSTM and Hy.TM. use this
word as a pointer to a transaction record, TL uses it as a lock, as
in the hand-crafted fine-locked structure. One immediate
implication is a saving of a level of indireaction over STMs.
[0118] Unlike STMs, TL's rollback mechanism is simple and local.
There are no transaction records, and the collected read-set and
write-set is never shared with other threads.
[0119] OSTM derives a large part of its efficiency from the
programmer's help in deciding when to "open" a transacted object
for reading or writing. Without this help, it has been shown that
OSTM's performance is rather poor. The TL transformation requires
no programmer understanding of the data structure in order to make
the transformation efficient. We believe it should not be
difficult, given a simple set of constraints on program structure,
to turn it into a straightforward mechanical transformation.
[0120] There is an inherent overhead of the general mechanical (and
hence "dumb") transformation when compared to hand-crafted code.
For example, in Eraser's elegant fine-locked skiplist
implementation he makes use of his understanding of the structure's
semantics and the mechanics of his GC to allow list traversal to
ignore locks on nodes since the traversal still works even if a
node is concurrently removed. It is hard to imagine that a
mechanical approach could be made to ignore the fact that a node is
locked and might be removed from the list.
2. The TL Algorithm
[0121] According to one aspect of this disclosure, we associate a
special versioned-write-lock with every transacted memory location.
In the example embodiment a versioned write-lock is a. simple
spinlock that uses a compare-and-swap (CAS) operation to acquire
the lock and a store to release it. Since one only needs a single
bit to indicate that the lock number. This number is incremented by
every successful lock-release.
[0122] We allocate a collection of versioned-write-locks. We use
various schemes for associating locks with shared variables: per
object (PO), where a lock is assigned per shared object, per stripe
(PS), where we allocate a separate large array of locks and memory
is stripped (divided up) using some hash function to map each
location to a separate stripe, and per word (PW) where each
transactionally referenced variable (word) is collocated adjacent
to a lock. Other mappings between transactional shared variables
and locks are possible. The PW and PO schemes require either manual
or compiler assisted automatic insertion of lock fields whereas PS
can be used with unmodified data structures. PO might be
implemented, for instance, by leveraging the header words of
Java.TM. objects. A single PS stripe-lock array may be shared and
used for different TL data structures within a single
address-space. For instance an application with two distinct TL
red-black trees and three TL hash-tables could use a single PS
array for all TL locks.
[0123] The following is a description of the PS algorithm although
most of the details carry through verbatim for PO and PW as well.
We maintain thread local read- and write-sets as linked lists. The
read-set entries contain the address of the lock and the observed
version number of the lock associated with the transactionally
loaded variable. The write-set entries contain the address of the
variable, the value to be written to the variable, and the address
of the lock that "covers" the variable. The write-set is kept in
chronological order to avoid write-after-write hazards.
[0124] We now describe how TL executes a sequential code fragment
that was placed within a TL transaction. We later describe the
limitations placed on the programmer in terms of structure of this
code so as to allow it to be mechanically transformed into a TL
transaction. The transaction proceeds through the code as
follows:
1. For every location read, read its lock value, and
[0125] (a) if it is not locked, add the lock's version number to
the read-set. We use a safe load operation to avoid running off
null pointers as a result of reading an inconsistent view of
memory. Safe loads may be implemented with SPARC.TM. non-faulting
loads or by a complicit user-level trap handler that skips over
potentially trapping safe load instructions.
[0126] (b) if it is locked by another thread then we spin briefly.
If the spin fails abort the transaction and retry.
2. For every location to be written, record the location and the
value to be written.
[0127] Upon completion of the pass through the code, reread the
version numbers of all locations in the read-set.
[0128] 1. Attempt to acquire all locks in the write-set in
ascending lock address order. Upon failing to acquire a lock, apply
some type of backoff policy or abort and retry the transaction. A
backoff policy could for example be to spin for a certain amount of
time before re-attempting acquire the lock.
2. Once all locks are acquired, re-read the locks of all read-set
locations to make sure version numbers have not changed.
[0129] (a) If a location has changed, release locks, abort and
retry the transaction.
[0130] (b) If not, perform stores in write set and release locks in
any order. The transaction is complete.
[0131] The transaction's re-reading of all the locks of locations
in the read set before attempting to acquire the locks is only a
performance optimization. It is not required for correctness.
Empirically we have found that many transactions fail due to
modifications before locks are acquired. Pre-validating the lock
versions in the write-set avoids acquired the locks for a
transaction that is fated to abort. We note that spinning as a
backoff policy does not introduce deadlocks because locks are
acquired in ascending order. The above algorithm, which we call
sort TL acquires locks in order. We have also experimented with
algorithms that acquire locks as they are encountered TL and uses
randomized backoff to avoid deadlock. The advantage of the latter
is that the transacting thread does not need to search the read set
for values of locations it updated since locations are updated "in
place."
2.1 Intentionally Left Blank
2.2 Mechanical Transformation
[0132] As we discussed earlier, the algorithm we describe can be
added to code in a mechanical fashion, that is, without
understanding anything about how the code works or what the program
itself does. In our benchmarks, we performed the transformation by
hand. We do however believe that
[0133] It should not be hard to automate this process and allow a
compiler to perform the transformation given a few rather simple
limitations on the code structure within a transaction.
2.3 Software-Hardware Inter-Operability
[0134] Though we have described TL as a software based scheme, it
can be made inter-operable with HTM systems on several levels.
[0135] In its simplest form, one can use static bounded size
obstruction free hardware transaction to speed up software TL. This
is done by using the hardware transactions to acquire the write
locks of a TL transaction in order. Since the write set is know in
advance, we require only static hard-ware transactions. Because for
many data structures the number of writes is significantly smaller
than the number of reads, it may well be that in most cases these
hardware transactions can be bounded in size. If all write locks do
not fit in a single hardware transaction, one can apply several of
them in sequence using the same scheme we currently use to acquire
individual locks, avoiding deadlock because the locations are
acquired in ascending order.
[0136] One can also use TL as a hybrid backup mechanism to extend
bounded size dynamic hardware transactions to arbitrary size. We
can use a scheme similar to OSTM and Hy.TM. where instead of their
object records, we use the versioned-write-lock associated with
each location.
[0137] Hardware transactions need to verify for each location that
they read or write that the associated versioned-write-lock is
free. For every write they also need to update the version number
of the associated stripe lock. This suffices to provide
inter-operability between hardware and software transactions. Any
software read will detect concurrent modifications of locations by
a hardware writes because the version number of the associated lock
will have changed. Any hardware transaction will fail if a
concurrent software transaction is holding the lock to write.
Software transactions attempting to write will also fail in
acquiring a lock on aware synchronization operation (such as CAS or
a single location transaction) which will fail if the version
number of the location was modified by the hardware
transaction.
3.0 Remarks
[0138] 1. One goal of the present disclosure is to allow the
programmer to convert coarse-grain locked data structures to TL so
as to enjoy the benefits of parallelism. This can be helpful when
transitioning to high-order parallelism with SMT/CMT processors
such as Niagara.TM.. One key attribute of TL is simplicity. It
allows the programmer to extract additional parallelism but without
unduly increasing the complexity of their code. The programmer can
"think serially" but the code will "execute concurrently".
[0139] For a given problem we deem TL successful if the resultant
performance exceeds that of the original coarse-grain locked form.
In many cases the TL form is competitive with the best-of-breed STM
forms. That having been said, for any given problem a specialized,
hand-coded, form written by a synchronization expert is likely to
be faster than the TL form. An expert in synchronization,
developing with concurrency in mind as 1st-order requirement, may
be aware of relaxed data dependencies in the algorithm and take
advantage of domain-specific advantages.
[0140] For example a red-black tree transformed with TL will
out-perform a red-black protected by a naive lock. But an exotic
ad-hoc red-black tree designed by concurrency experts and subject
to considerable research, such as Hanke's red-black algorithm will
generally outperform the TL-transformed red-black tree.
[0141] 2. Broadly, TL works by transform an operation protected by
a coarse-grained lock into optimistic transactional form. We then
implement the transactional infrastructure with fine-grain locks,
enabling additional parallelism as the access patterns permit.
[0142] 3. OSTM works by opening and closing records for reading and
writing. TL, in a sense, performs the open operations automatically
at transactional load- and store-time but leaves the record open
until commit time. TL has no way of knowing that prior loads
executed within a transaction might have any bearing on results
produced by transaction.
[0143] In such cases the load could safely be removed from the
read-set but TL doesn't currently provide that capability. As such,
the TL transaction admits exposed to false-positive failures.
[0144] Consider the following scenario where we have a TL-protected
hash table. Thread T1 traverses a long hash bucket chain searching
for a value associated with a certain key, iterating over "next"
fields. We'll say that T1 locates the appropriate node at or near
the end of the linked list. T2 concurrently deletes an unrelated
node earlier in the same linked list. T2 commits. At commit-time T1
will abort because the linked-list "next" field written to by T2 is
in T1's read-set. T1 must retry the lookup operation (ostensibly
locating the same node). Given our domain-specific knowledge of the
linked list we understand that the lookup and delete operations
didn't really conflict and could have been allowed to operate
concurrently with no aborts. A clever "hand over hand" ad-hoc
hand-coded locking scheme would allow the desired concurrency.
[0145] 4. As described above TL admits live-lock failure. Consider
where thread T1's read-set is A and its write-set is B. T2's
read-set is B and write-set is A. T1 tries to commit and locks B.
T2 tries to commit and acquires A. T1 validates A, in its read-set,
and aborts as a Bis locked by T2. T2 validates B in its read-set
and aborts as B was locked by T1. We have mutual abort with no
progress. To improve "liveness" we use a back-off delay at
abort-time, similar in spirit to that found in CSMA-CD MAC
protocols. The delay interval is a function of (a) a random number
generated at abort-time, (b) the length of the prior (aborted)
write-set, and (c) the number of prior aborts for this
transactional attempt.
[0146] 5. As described above, at commit-time the transactional
mechanism will acquire write-set locks, validate the read-set,
perform the write-back, and then release (and increment) the
write-locks. Lock acquisition is accomplished with CAS and
lock-release with a simple store. Given the availability of
restricted capacity hardware transactional memory, such as will be
available in Sun's forthcoming "ROCK" SPARC processor, we eliminate
the CAS operations and try to acquire the locks in groups
(replacing a set of CAS operations with a single ROCK hardware
transaction).
[0147] In addition it is possible that the entire commit operation
might be feasible as a single ROCK hardware transaction where the
original application transaction was too big (too many loads and
stores) to be feasible as a single ROCK transaction). The commit
operation will be able to make an accurate estimate of
ROCK-feasibility given that the size of the read-set and write-set
are available (or cheap to compute) at commit-time. Finally, if the
entire commit is feasible as a ROCK hardware transaction, we can
avoid changing the lock word from unlocked, to locked, to unlocked
(but incremented) by simply fetching the lock word at the start of
the commit, verifying that it is unlocked, and then increasing the
version sub-field at the end of the transaction, after the data
writes are complete.
[0148] 6. Changes to non-transactional variables, such as automatic
variables, must not be allowed to escape or "leak" out of an abort
transaction. Where needed, the transactional infrastructure must
log such changes and roll-back any updates at abort-time.
Similarly, exceptions in aborted or doomed transactions must not
propagate out of the transactional intra structure. The SXM scheme,
where transactions are encapsulated in method calls, handily deals
with this issue.
[0149] 7. All accesses to shared variables within a transformed TL
critical section must be performed transactionally. Mixed-mode
access can be unsafe. Transactions should not perform or initiate
10 or otherwise interface with non-transactional components.
Transactions should not access device-memory (memory mapped
devices) with transactional loads and stores as loads from
device-memory are not necessarily idem potent and may have side
effects.
[0150] 8. Under TL pure read operations don't require any store
operations. This is important as stores to shared variables under
typical snoop- or directory-based coherency protocols result can
result in considerable coherency bus traffic. Such stores result in
a local latency penalty scalability issues as the coherency traffic
consumes precious bandwidth on the shared coherency bus.
[0151] 9. Write-locks are held for a brief time--just long enough
to validate the read-set and write-back the deferred transactional
stores.
[0152] 10. If a transaction acquires many distinct locks, it can
suffer a local latency penalty as the CAS instruction is typically
slow. A balance must be struck between lots of locks (and increased
potential parallelism) and un-contended lock acquisition overhead.
The mapping strategy between variables and locks is critical.
[0153] 11. As noted above, the PW scheme may suffer undue local CAS
latency if many distinct write-locks must be acquired. One possible
solution is add an indireaction-bit to the lock-word. When set, the
lock-word contains a pointer to the actual lock. Multiple
indireaction is not allowed. Objects are initialized so that the
per-field lock words point to either a canonical non-indirect field
lock within the same object, or to a lock that protects the entire
data structure (e.g., the entire red-black tree or skip-list).
Initially we have coarse-grain locking with a many: I relationship
between locks fields and actual locks, but as we encounter
contention we can convert automatically to fine-grain locking by
replacing the indireaction pointer with a normal non-indirected
lock value. For safety, only the currently lock-owner can "split"
or upgrade the load from the in-directed form (coarse-grained) to a
per-field lock (fine-grained). The transition is
unidireactional--we never try to aggregate multiple fine-grain
locks to refer to a single coarse-grain lock. The onset of
contention (or more precisely, aborts caused by encountering a
locked object) triggers splitting. When the contending thread
eventually acquires the lock it can perform the split operation. By
automatically splitting the locks and switching to finer grained
locking we minimize the number of high-latency CAS operations
needed to lock low-contention fields, but maximize the potential
parallelism for operations that access high-contention fields.
[0154] One can the same optimization to PS, where the 1st lock in
the array is a normal lock and all other locks are indirect locks,
pointing to the 1st element.
[0155] 12. Broadly, TL operates better in environments with lower
mutation rates (that is, where the store:load ratio is low). For
example consider a red-black and a skip-list that are protected by
a single lock and where the data structure is subject to many
concurrent modifications. The relative speedup achieved with TL as
compared to the classic lock will usually be higher with the
skip-list than with the red-black tree as mutations to a skip list
usually only require a few stores, where mutations to a red-black
tree many require adjustments to the tree structure that require
many stores.
[0156] 13. We claim that TL admits no schedules that were not
already possible for the data structure as protected by the
coarse-grained lock.
[0157] 14. TL could be used to implement the "atomic . . . "
construct where no lock is specified.
[0158] 15. Our example embodiment describes a 64-bit lock-word,
partitioned into a single lock bit and a 63-bit version subfield.
Assuming a 4 Ghz processor and a maximum update rate of I
transaction per-clock, the version sub-field will overflow in 68
years. Other example embodiments allow for use of a 32-bit
lock-word field. When a counter overflows, for instance, a
so-called stop-the-world epoch might be used to stop all threads
outside transactions. At that point no thread can have a previously
fetched instance of the overflowed lock-word in its read-set; the
lock-word version can safely be reset to 0. All threads can then be
allowed to resume normal execution.
[0159] 16. Unlike some other STMs which incorporate and depend on
their own garbage-collection mechanisms, TL allows the C programmer
to use normal malloc( ) and free( ) operations to manage the
lifecycle of structures containing transactionally accessed shared
variables. The only requirement imposed by TL is that a structure
being free( )-ed must be allowed to quiesce. That is, any pending
transactional stores, detectable by checking the lock-bit in the
associated locks, must be allowed to drain into the structure
before the structure is freed. After the structure is quiesced it
can be accessed with normal load and store operations outside the
transactional framework.
[0160] 17. Concurrent mixed-mode transactional and
non-transactional accesses are proscribed. When a particular object
is being accessed with transactional load and store operations it
must not be accessed with normal non-traditional load and store
operations. (When any accesses to an object are transactional, all
accesses must be transactional). An object can exit the
transactional domain and subsequently be accessed with normal
non-transactional loads and stores, we must sterilize the object
before it escapes. To motivate the need for sterilization consider
the following scenario. We have a linked list of 3 nodes identified
by addresses A, B and C. A node contains Key, Value and Next
fields. The data structure implements a traditional key-value
mapping. The key-value map (the linked list) is protected by TL
using PS. Node A's Key field contains "I", its value field contains
"1001" and its Next field refers to B. B's Key field contains "2",
its Value field contains "1002" and its Next field refers to C. C's
Key field contains 3, the value field "1003" and its Next field is
NULL. Thread T1 calls Set("2", "2002"). The TL-based Set( )
operator traverses the linked list using transactional loads and
finds node B with a key value of "2". T1 then executes a
transactional store in a. Value to change "1002" to "2002". T1's
read-set consists of A.Key, A.Next, B.Key and the write-set
consists of "B.Value." T1 attempts to commit; it acquires the lock
covering
[0161] B,Value and then validates that the previously fetched
read-set is consistent by checking the version numbers in the locks
convering the read-set. Thread T1 stalls. Thread T2 executes
Delete("2"). The DeleteO operator traverses the linked list and
attempts to splice-out Node B by setting A.Next to C. T2
successfully commits. The commit operator stores C into A.Next.
[0162] T2's transaction completes. T2 then calls free(B). T1
resumes in the midst of its commit and stores into B.Value, We have
a classic modify-after-free pathology. To avoid such problems T2
calls sterilize(B) after the commit finishes but before free( )ing
B. This allows T1's latent transactional ST to drain into B before
B is free( )ed and potentially reused. Note, however, that TL
(using sterilization) did not admit any out-comes that were not
already possible under the original coarse-grained lock.
[0163] 18. Consider the following problematic lifecycle based on
the A, B, C linked list, above. Lets say we using TL in the "C"
language to moderate concurrent access to the list, but with either
PO or PW mode where the lock word(s) are embedded in the node.
Thread T1 calls Set("2", "2002"). The TL-based Set( ) method
traverse the list and locates node B having a key value of "2".
Thread T2 then calls Delete("2"). The Delete( ) operator commits
successfully. T2 sterilizes B and then calls free(B). The memory
underlying B is recycled and used by some other thread T3. T1
attempts to commit by acquiring the lock covering B.Value. The
lock-word is collocated with B. Value, so the CAS operation
transiently change the lock-word contents. T2 then validates the
read-set, recognizes that A.Next changed (because of T1's Delete(
)) and aborts, restoring the original lock-word value. T1 has cause
the memory word underlying the lock for B.value to "flicker",
however. Such modifications are unacceptable; we have a classic
modify-after-free error.
[0164] As such, we advocate using PS for normal C code as the
lock-words (metadata) are stored seperately in type-stable memory
distinct from the data protected by the locks. This proviso can be
relaxed if the C-code uses some type of garbage collection (such as
Boehm-style conservative garbage collection for C, Maged-style
hazard pointers or Fraser-stye Epoch-Based Reclamation) or
type-stable storage for the nodes. For type-safe garbage collected
managed runtime environments such as Java any of the mapping
policies (PS, PO or PW) are safe. Relatedly, use-after-free errors
are impossible in Java, so sterilization would be needed only for
objects that escape the transactional domain and will subsequently
be accessed with normal loads and stores.
[0165] Alternately, we could employ PO or PW with C-code but
replace the embedded lock-words with immutable words that point to
type-stable or immortal lock-words. Under PO, for instance, the
object would contain an immutable field that points to some other
lock-word. The field would be initialize to point to the associated
lock-word either at object construction-time, or initialization
could be deferred until the 1st transactional store or load.
[0166] 19. It is possible to use C++ operator overloading and
template functions to interpose on all load and store operations
for variables defined to be used in a transactional fashion. This
approach obviates the need to explicitly call transactional load
and store operators, making the set of modifications required to
switch to TL much smaller.
[0167] 20. We previously described the PW, PO and PS schemes for
associating variables with locks. More generally, TL might allow a
skilled programmer to explicitly control the mapping by allowing
the programmer to define a custom VariableToLock( ) function which
takes a variable address as input and returns a lock address. The
VariableToLock( ) function is optional.
[0168] 21. TL can easily be combined with STM interfaces or
transactional infrastructures such as Herlihy's SXM.
[0169] 22. TL protects data accessed within a critical section. TL
should not be used where a lock is used an execution barrier and
shared data is accessed outside the lock. For instance lets say
thread T1 acquires Lock A, and spawns thread T2, increments some
global variable B and then releases A. T2 will acquire A, release
A, and then increment B. Access to the shared variable B is
protected by the lock, but the accesses are outside the critical
section. In fact the critical section is empty and degenerate.
[0170] 23. Code that assumes memory barrier (fence)-equivalent
semantics for lock and unlock should not be transformed with
TL.
[0171] 24. We can extend the lock-word encoding from
LOCKED/UN-LOCKED to READWRITE/READONLY/EXCLUSIVE as follows.
READWRITE corresponds to UNLOCKED and EXCLUSIVE corresponds to
LOCKED. The new state, READONLY, is an interim state used only at
commit-time. The commit operator is modified to attempt to change
all locks in the write-set from READ-WRITE to READONLY with CAS.
The commit operator must spin if the lock is found to be in
READ-ONLY or EXCLUSIVE state. Once the write-set locks have been
made READONLY, the commit operator ratifies versions of the
read-set locks and ensures that the read-set locks are in READWRITE
state. If the read-set is invalid the commit operator restores the
write-set locks to READWRITE and aborts the transaction. Otherwise
the commit operator uses simple store operations to upgrade all the
write-set locks from READONLY to EXCLUSIVE. The commit operator
then writes back the deferred stores saved in the write-set and
then releases the locks and increments the versions, changing
(V.EXCLUSIVE) to V+I.READWRITE) with a single atomic store. Note
that the upgrade to EXCLUSIVE, write-back, and release can be fused
into a single loop that interates over the write-set in
chronological order. This modification decreases the lock-hold
time--that is the time that locks are in EXCLUSIVE state.
Critically, if a lock is in READONLY state because of a commit
operation being executed by thread T1, concurrent transactional
loads performed by thread T2 is allowed to proceed. (That is, when
a thread executing a commit has placed a lock in READ-ONLY state,
concurrent transactional loads performed by other threads are
allowed to proceed).
[0172] In yet another variation the commit operator would use CAS
to try to change all the write-set locks from READWRITE to
READONLY. Once in READONLY state commit would then use normal
atomic stores to upgrade the locks from READONLY to EXCLUSIVE. The
commit operator would then validate the read-set and,
conditionally, write-back the deferred stores saved in the
write-set and release the locks, incrementing the version
subfields. This adaptation minimizes aggregate lock-hold times.
Recall that CAS has high local latency even when successful.
Consider a transaction containing stores to variables V1 and V2
covered by distinct locks W1 and W2. The basic commit operator,
described earlier, uses CAS to lock W1 and then another CAS to lock
W2. The hold-time for W1 is increased because of the latency of the
CAS needed to acquire W2. The mechanism described here lessens the
impact of CAS latency.
[0173] 25. Transactions may be nested by folding or "flattening"
inner transactions into the outermost transaction. By nature,
longer transactions have a higher chance of failing because of
concurrent interference, however.
4.0 Additional Embodiments
[0174] In furtherance of the above discussion, embodiments herein
can operate in two modes which we will call encounter mode and
commit mode. These modes indicate how locks are acquired and how
transactions are committed or aborted. We will begin by further
describing our commit mode algorithm, later explaining how TL
operates in encounter mode.
[0175] We associate a special versioned-write-lock with every
transacted memory location. A versioned-write-lock is a simple spin
lock that uses a compare-and-swap (GAS) operation to acquire the
lock and a store to release it. Since one only needs a single bit
to indicate that the lock is taken, we use the rest of the lock
word to hold a version number. This number is incremented by every
successful lock-release. In encounter mode the version number is
displaced and a pointer into a threads private undo log is
installed.
[0176] We allocate a collection of versioned-write-locks. We use
various schemes for associating locks with shared: per object (PO),
where a lock is assigned per shared object, per stripe (PS), where
we allocate a separate large array of locks and memory is stripped
(divided up) using some hash function to map each location to a
separate stripe, and per word(PW) where each transactionally
referenced variable (word) is collocated adjacent to a lock. Other
mappings between transactional shared variables and locks are
possible. The PW and PO schemes require either manual or compiler
assisted automatic put of lock fields whereas PS can be used with
unmodified data structures. Since in general PO showed better
performance than PW we will focus on PO and do not discuss PW
further. PO might be implemented, for instance, by leveraging the
header words of Java TM objects. A single PS stripe-lock array may
be shared and used for different TL data structures within a single
address-space. For instance an application with two distinct TL
red-black trees and three TL hash-tables could use a single PS
array for all TL locks. As our default mapping we chose an array of
220 entries of 32-bit lock words with the mapping function masking
the variable address with "Ox3FFFFC" and then adding in the base
address of the lock array to derive the lock address.
[0177] The following is a description of the PS algorithm although
most of the details carry through verbatim for PO and PW as well.
We maintain thread local read- and write-sets as linked lists. The
read-set entries contain the address of the lock and the observed
version number of the lock associated with the transactionally
loaded variable. The write-set entries contain the address of the
variable, the value to be written to the variable, and the address
of the lock that "cover" the variable. The write-set is kept in
chronological order to avoid write-after-write hazards.
4.1 Commit Mode
[0178] We now describe how TL executes in commit mode a sequential
code fragment that was placed within a TL transaction. As we
explain, this mode does not require type-stable garbage collection,
and works seamlessly with the memory life-cycle of languages like C
and C++.
1. Run the transactional code, reading the locks of all
fetched-from shared locations and building a local read set and
write-set (use a safe load operation to avoid running off null
pointers as a result of reading an inconsistent view of
memory).
[0179] A transactional load first checks (using a filter such as a
Bloom filter) to see if the load address appears in the write-set,
if so the transactional load returns the last value written to the
address. This provides the illusion of processor consistency and
avoids so-called read-after-write hazards. If the address is not
found in the write-set the load operation then fetches the lock
value associated with the variable, saving the version in the
read-set, and then fetches from the actual shared variable. If the
transactional load operation finds the variable locked the load may
either spin until the lock is released or abort the operation.
[0180] Transactional stores to shared locations are handled by
saving the address and value into the thread's local write-set. The
shared variables are not modified during this step. That is,
transactional stores are deferred and contingent upon successfully
completing the transaction. During the operation of the transaction
we periodically validate the read-set. If the read-set is found to
be invalid we abort the transaction. This avoids the possibility of
a doomed transaction (a transaction that has read inconsistent
global state) from becoming trapped in an infinite loop.
[0181] 2. Attempt to commit the transaction. Acquire the locks of
locations to be written. If a lock in the write-set (or more
precisely a lock associated with a location in the write-set) also
appears in the read-set then the acquire operation must atomically
(a) acquire the lock and, (b) validate that the current lock
version sub-field agrees with the version found in the earliest
read-entry associated with that same lock. An atomic CAS can
accomplish both (a) and (b). Acquire the locks in any convenient
order using bounded spinning to avoid indefinite deadlock.
3. Re-read the locks of all read-only locations to make sure
version numbers haven't changed. If a version does not match,
roll-back (release) the locks, abort the transaction, and
retry.
4. The prior observed reads in step (1) have been validated as
forming an atomic snapshot of memory. The transaction is now
committed. Write-back all the entries from the local write-set to
the appropriate shared variables.
5. Release all the locks identified in the write-set by atomically
incrementing the version and clearing the write-lock bit (using a
simple store).
[0182] A few things to note. The write-locks have been held for a
brief time when attempting to commit the transaction. This helps
improve performance under high contention. The Bloom filter allows
us to determine if a value is not in the write set and need not be
searched for by reading the single filter word. Though locks could
have been acquired in ascending address order to avoid deadlock, we
found that sorting the addresses in the write set was not worth the
effort.
4.2 Encounter Mode
The following is the TL encounter mode transaction. For reasons we
explain later, this mode assumes a type-stable closed memory pool
or garbage collection.
1. Run the transactional code, reading the locks of all
fetched-from shared locations and building a local read-set and
write-set (the write set is an undo set of the values before the
transactional writes).
[0183] Transactional stores to shared locations are handled by
acquiring locks as the are encountered, saving the address and
current value into the thread's local write-set, and pointing from
the lock to the write-set entry. The shared variables are written
with the new value during this step.
[0184] A transactional load checks to see if the lock is free or is
held by the current transaction and if so reads the value from the
location. There is thus no need to look for the value in the write
set. If the transactional load operation finds that the lock is
held it will spin. During the operation of the transaction we
periodically validate the read-set. If the read-set is found to be
invalid we abort the transaction. This avoids the possibility of a
doomed transaction (a transaction that has read inconsistent global
state) from becoming trapped in an infinite loop.
2. Attempt to commit the transaction. Acquire the locks associated
with the write-set in any convenient order, using bounded spinning
to avoid deadlock.
3. Re-read the locks of all read-only locations to make sure
version numbers haven't changed. If a version does not match,
restore the values using the write-set, roll-back (release) the
locks, abort the transaction, and retry.
4. The prior observed reads in step (1) have been validated as
forming an atomic snapshot of memory. The transaction is now
committed.
5. Release all the locks identified in the write-set by atomically
incrementing the version and clearing the write-lock bit.
We note that the locks in encounter mode are held for a longer
duration than in commit mode, which accounts for weaker performance
under contention. However, one does not need to look-aside and
search through the write set for every read.
4.3 Contention Management
[0185] As described above TL can admit a live-lock failure.
Consider where thread T1's read-set is A and its write-set is B.
T2's read-set is B and write-set is A. T1 tries to commit and locks
B. T2 tries to commit and acquires A. T1 validates A, in its
read-set, and aborts as a Bis locked by T2. T2 validates B in its
read-set and aborts as B was locked by T1. We have mutual abort
with no progress. To provide liveness we use bounded spin and a
back-off delay at abort-time, similar in spirit to that found in
CSMA-CD MAC protocols. The delay interval is a function of (a) a
random number generated at abort-time, (b) the length of the prior
(aborted) write-set, and (c) the number of prior aborts for this
transactional attempt. It is important to note that unlike
conventional methods, we found that we do not need mechanisms for
one transaction to abort another to allow progress/liveness even in
encounter mode.
[0186] These mechanisms are unnecessary for performance or deadlock
avoidance, and in a sense contradict the very philosophy behind
transactional locking: rather than trying to improve on
hand-crafted lock-based implementations by being non-blocking
(hand-crafted lock-based data structures are not obstruction free),
we try and build lock-based STMs that will get us as close to their
behavior as one can with a completely mechanical approach, that is,
one that truly simplifies the job of the concurrent programmer.
4.4 The Pathology of Transactional Memory Management
[0187] For type-safe garbage collected managed runtime environments
such as Java any of the TL lock-mapping policies (PS, PO, or PW)
and modes (Commit or Encounter) are safe, as the GC assures that
transactionally accessed memory will only be released once no
references remain to the object. In C or C++TL preferentially uses
the PS/Commit locking scheme to allow the C programmer to use
normal malloc( )and free( ) operations to manage the lifecycle of
structures containing transactionally accessed shared
variables.
[0188] Concurrent mixed-mode transactional and non-transactional
accesses are proscribed. When a particular object is being accessed
with transactional load and store operations it must not be
accessed with normal non-transactional load and store operations.
(When any accesses to an object are transactional, all accesses
must be transactional). In PS/-Commit mode an object can exit the
transactional domain and subsequently be accessed with normal
non-transactional loads and stores, but we must wait for the object
to quiesce before it leaves. There can be at most one transaction
holding the transactional lock, and quiescing means waiting for
that lock to be released, implying that all pending transactional
stores to the location have been "drained", before allowing the
object to exit the transactional domain and subsequently to be
accessed with normal load and store operations. Once it has
quiesced, the memory can be freed and recycled in a normal fashion,
because any transaction that may acquire the lock and reach the
disconnected location will fail its read-set validation.
[0189] To motivate the need for quiescing, consider the following
scenario with PS/Commit. We have a linked list of 3 nodes
identified by addresses A, B and C. A node contains Key, Value and
Next fields. The data structure implements a traditional key-value
mapping. The key-value map (the linked list) is protected by TL
using PS. Node A's Key field contains 1, its value field contains
1001 and its Next field refers to B. B's Key field contains 2, its
Value field contains 1002 and its Next field refers to C. C's Key
field contains 3, the value field 1003 and its Next field is NULL.
Thread T1 calls put(2, 2002). The TL-based put( ) operator
traverses the linked list using transactional loads and finds node
B with a key value of 2. T1 then executes a transactional store
into B.Value to change 1002 to 2002. T1's read-set consists of
A.Key, A.Next, B.Key and the write-set consists of B.Value. T1
attempts to commit; it acquires the lock covering B.Value and then
validates that the previously fetched read-set is consistent by
checking the version numbers in the locks converging the read-set.
Thread T1 stalls. Thread T2 executes delete(2). The delete( )
operator traverses the linked list and attempts to splice-out Node
B by setting A.Next to C. T2 successfully commits. The commit
operator stores C into A.Next. T2's transaction completes. T2 then
calls free(B). T1 resumes in the midst of its commit and stores
into B.Value. We have a classic modify-after-free pathology. To
avoid such problems T2 calls quiesce(B) after the commit finishes
but before free( )ing B. This allows T1's latent transactional ST
to drain into B before B is free( )ed and potentially reused. Note,
however, that TL (using quiescing) did not admit any outcomes that
were not already possible under a simple coarse-grained lock. Any
thread that attempts to write into B will, at commit-time, acquire
the lock covering B, validate A.Next and then store into B. Once B
has been unlinked there can be at most one thread that has
successfully committed and is in the process of writing into B.
Other transactions attempting to write into B will fail read-set
validation at commit-time as A.Next has changed.
[0190] Consider another following problematic lifecycle scenario
based on the A,B,C linked list, above. Lets say we're using TL in
the C language to moderate concurrent access to the list, but with
either PO or PW mode where the lock word(s) are embedded in the
node. Thread T1 calls put(2, 2002). The TL-based put( ) method
traverse the list and locates node B having a key value of 2.
Thread T2 then calls delete(2). The delete( ) operator commits
successfully. T2 waits for B to quiesce and then calls free(B). The
memory underlying B is recycled and used by some other thread T3.
T1 attempts to commit by acquiring the lock covering B.Value. The
lock-word is collocated with B.Value, so the CAS operation
transiently change the lock-word contents. T2 then validates the
read-set, recognizes that A.Next changed (because of T1's delete(
)) and aborts, restoring the original lock-word value. T1 has cause
the memory word underlying the lock for B.value to "flicker",
however. Such modifications are unacceptable; we have a classic
modify after-free error.
[0191] Finally, consider the following pathological scenario
admitted by PS/Encounter. T1 calls put(2,2002). Put( ) traverses
the list and locates node B. T2 then calls delete(2), commits
successfully, calls quiesce(B) and free(B). T1 acquires the lock
covering B.Value, saves the original B.Value (1002) into its
private write undo log, and then stores 2002 into B.Value. Later,
during read-set validation at commit time, T1 will discover that
its read-set is invalid and abort, rolling back B.Value from 2002
to 1002. As above, this constitutes a modify-after-free pathology
where B recycled, but B.Value transiently "flickered" from 1002 to
2002 to 1002. We can avoid this problem by enhancing the encounter
protocol to validate the read-set after each lock acquisition but
before storing into the shared variable. This confers safety, but
at the cost of additional performance.
[0192] As such, we advocate using PS/Commit for normal C code as
the lock-words (metadata) are stored separately in typestable
memory distinct from the data protected by the locks. This
provision can be relaxed if the C-code uses some type of garbage
collection (such as Boehm-style conservative garbage collection for
C, Michael-style hazard pointers or Fraser-stye Epoch-Based
Reclamation) or type-stable storage for the nodes.
4.5 Mechanical Transformation of Sequential Code
[0193] As we discussed earlier, the algorithm we describe can be
added to code in a mechanical fashion, that is, without
understanding anything about how the code works or what the program
itself does. In our benchmarks, we performed the transformation by
hand. We do however believe that it may be feasible to automate
this process and allow a compiler to perform the transformation
given a few rather simple limitations on the code structure within
a transaction.
[0194] We note that hand-crafted data structures can always have an
advantage over TL, as TL has no way of knowing that prior loads
executed within a transaction might no longer have any bearing on
results produced by transaction.
[0195] Consider the following scenario where we have a TL-protected
hash table. Thread T1 traverses a long hash bucket chain searching
for a the value associated with a certain key, iterating over
"next" fields. We'll say that T1 locates the appropriate node at or
near the end of the linked list. T2 concurrently deletes an
unrelated node earlier in the same linked list. T2 commits. At
commit-time T1 will abort because the linked-list "next" field
written to by T2 is in T1's read-set. T1 must retry the lookup
operation (ostensibly locating the same node). Given our
domain-specific knowledge of the linked list we understand that the
lookup and delete operations didn't really conflict and could have
been allowed to operate concurrently with no aborts. A clever "hand
over hand" ad-hoc hand-coded locking scheme would have the
advantage of allowing this desired concurrency. Nevertheless, as
our empirical analysis later in the paper shows, in the data
structure we tested the beneficial effect of this added concurrency
on overall application scalability does not seem to be as profound
as one would think.
4.6 Software-Hardware Inter-Operability
[0196] Though we have described TL as a software based scheme, it
can be made inter-operable with HTM systems on several levels.
[0197] On a machine supporting dynamic hardware, transactions
executed in hardware need only verify for each location that they
read or write that the associated versioned-write-lock is free.
There is no need for the hardware transaction to store an
intermediate locked state into the lock word(s). For every write
they also need to update the version number of the associated
stripe lock upon completion. This suffices to provide
inter-operability between hardware and software transactions. Any
software read will detect concurrent modifications of locations by
a hardware writes because the version number of the associated lock
will have changed. Any hardware transaction will fail if a
concurrent software transaction is holding the lock to write.
Software transactions attempting to write will also fail in
acquiring a lock on a location since lock acquisition is done using
an atomic hard-ware synchronization operation (such as CAS or a
single location transaction) which will fail if the version number
of the location was modified by the hardware transaction.
[0198] One can also think of using a static bounded size
obstruction-free hardware transaction to speed up software TL. This
may be done variously by attempting to complete the entire commit
operation with a single hardware transaction, or, alternately, by
using hardware transactions to acquire the write locks "in bulk".
This latter approach is beneficial if bulk acquisition of the
write-locks via hardware transactions is faster (has lower latency)
than acquiring one write lock at a time with CAS. Since the write
set is know in advance, we require only static hardware
transactions. Because for many data structures the number of writes
is significantly smaller than the number of reads, it may well be
that in most cases these hardware transactions can be bounded in
size. If all write locks do not fit in a single hardware
transaction, one can apply several of them in sequence using the
same scheme we currently use to acquire individual locks. However,
as we report above, we found the relative contribution of the lock
acquisition time to latency to be small, so it is not clear how
much of a saving a hardware transaction will provide over the use
of GAS operations.
[0199] One can also use TL as a hybrid backup mechanism to extend
bounded size dynamic hardware transactions to arbitrary size.
Again, our empirical testing suggests that there is not much of a
gain in this approach.
[0200] While this invention has been particularly shown and
described with references to preferred embodiments thereof, it will
be understood by those skilled in the art that various changes in
form and details may be made therein without departing from the
spirit and scope of the present application as defined by the
appended claims. Such variations are covered by the scope of this
present disclosure. As such, the foregoing description of
embodiments of the present application is not intended to be
limiting. Rather, any limitations to the invention are presented in
the following claims. Note that the different embodiments disclosed
herein can be combined or utilized individually with respect to
each other.
We claim:
* * * * *