U.S. patent application number 11/898363 was filed with the patent office on 2008-04-24 for analyzing diagnostic data generated by multiple threads within an instruction stream.
Invention is credited to Simon Andrew Ford, Katherine Elizabeth Kneebone, Alastair David Reid.
Application Number | 20080098207 11/898363 |
Document ID | / |
Family ID | 38219318 |
Filed Date | 2008-04-24 |
United States Patent
Application |
20080098207 |
Kind Code |
A1 |
Reid; Alastair David ; et
al. |
April 24, 2008 |
Analyzing diagnostic data generated by multiple threads within an
instruction stream
Abstract
A diagnostic method for outputting diagnostic data relating to
processing of instruction streams stemming from a computer program,
at least some of said instructions streams comprising multiple
threads is disclosed. The method comprises the steps of: (i)
receiving diagnostic data; (ii) reordering said received diagnostic
data in dependence upon reordering data, said reordering data
comprising data relating to said computer program; and (iii)
outputting said reordered diagnostic data. In general, the
instructions streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel, said diagnostic data being received from said plurality
of processing units.
Inventors: |
Reid; Alastair David;
(Cambridge, GB) ; Ford; Simon Andrew; (Cambridge,
GB) ; Kneebone; Katherine Elizabeth; (Cambridge,
GB) |
Correspondence
Address: |
NIXON & VANDERHYE, PC
901 NORTH GLEBE ROAD, 11TH FLOOR
ARLINGTON
VA
22203
US
|
Family ID: |
38219318 |
Appl. No.: |
11/898363 |
Filed: |
September 11, 2007 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60853756 |
Oct 24, 2006 |
|
|
|
Current U.S.
Class: |
712/227 ;
714/E11.207 |
Current CPC
Class: |
G06F 11/3636 20130101;
G06F 11/362 20130101 |
Class at
Publication: |
712/227 |
International
Class: |
G06F 9/44 20060101
G06F009/44 |
Claims
1. A diagnostic method for outputting diagnostic data relating to
processing of instruction streams stemming from a computer program,
at least some of said instructions streams comprising multiple
threads, said method comprising the steps of: (i) receiving
diagnostic data; (ii) reordering at least one received diagnostic
data item from said received diagnostic data with respect to at
least one other received diagnostic data item in dependence upon
reordering data, said reordering data comprising data relating to
said computer program; and (iii) outputting said reordered
diagnostic data.
2. A diagnostic method according to claim 1, said diagnostic method
comprising a further step: (iia) removing at least one received
diagnostic data item from said received diagnostic data in
dependence upon said reordering data.
3. A diagnostic method according to claim 1, said diagnostic method
comprising a further step: (iib) merging at least one received
diagnostic data item from said received diagnostic data with at
least one other received diagnostic data item in dependence upon
said reordering data.
4. A diagnostic method according to claim 1, wherein said
instructions streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel, said diagnostic data being received from said plurality
of processing units.
5. A diagnostic method according to claim 1, wherein said
reordering data further comprises at least one of generic data
relating to how computer programs are transformed prior to being
executed and rules regarding processing of said instructions
stemming from a hardware arrangement of said processing units.
6. A diagnostic method according to claim 1, wherein said
reordering data further comprises additional instruction processing
dependency data output from a compiler compiling said program.
7. A diagnostic method according to claim 1, wherein said
reordering step (ii) comprises identifying a thread within said
instruction stream that an item of said received diagnostic data is
produced by and grouping diagnostic data produced by a same thread
together in execution order.
8. A diagnostic method according to claim 7, wherein said
instructions streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel, said diagnostic data being received from said plurality
of processing units and wherein said thread migrates between said
processing units.
9. A diagnostic method according to claim 8, wherein at least one
of said processing units is programmable using a sequence of
instructions from an instruction set.
10. A diagnostic method according to claim 8, wherein at least two
of said processing units are responsive to different instruction
sets.
11. A diagnostic method according to claim 1, wherein said
reordering step (ii) comprises identifying diagnostic data
generated in response to at least one object being processed by
multiple threads and grouping diagnostic data relating to a same
object together.
12. A diagnostic method according to any one of claim 1, wherein
said reordering step (ii) comprises reordering said received
diagnostic data such that diagnostic data produced by execution of
each instruction is arranged in program order.
13. A diagnostic method according to claim 1, wherein said
instruction streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel and said processing of said instruction streams by said
plurality of processing units involves interleaving of at least
some of said instructions such that said received diagnostic data
stream relates to at least some interleaved instructions and said
reordering step (ii) comprises determining allowed reordering
operations from said reordering data and performing at least one of
said allowed reordering operations in order to produce a diagnostic
data stream in which at least some interleaving of instructions due
to parallel processing has been removed.
14. A diagnostic method according to claim 1, wherein said
instruction streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel and said step of receiving diagnostic data comprises
receiving a plurality of streams of diagnostic data from each of
said plurality of processing units.
15. A diagnostic method according to claim 1, wherein said
diagnostic data comprises trace data.
16. A diagnostic method according to claim 15, wherein said
diagnostic data comprises specific trace data relating to a
synchronisation event and emitted in response to said
synchronisation event.
17. A diagnostic method according to claim 16, wherein said
synchronisation event comprises specific trace data relating to the
initiation or completion of a task on a processing element.
18. A diagnostic method according to claim 16, wherein said
synchronisation event comprises specific trace data relating to the
migration of a task between processing elements.
19. A diagnostic method according to claim 16, wherein said
synchronisation event comprises at least one of: a processing unit
processing one of said streams of instructions switching processing
between threads and communication between at least two of said
threads.
20. A computer program product which is operable when run on a data
processor to, control the data processor to perform the steps of
the method according to claim 1.
21. A diagnostic apparatus for outputting diagnostic data relating
to processing of instruction streams stemming from a computer
program, at least some of said instructions streams comprising
multiple threads, said diagnostic apparatus comprising: a data
input for receiving diagnostic data from said plurality of
processing units; reordering logic adapted to reorder said received
diagnostic data in dependence upon reordering data, said reordering
data comprising data relating to said computer program; and a data
output for outputting said reordered diagnostic data.
22. A diagnostic apparatus according to claim 21, said reordering
logic being further operable to remove at least one received
diagnostic data item from said received diagnostic data in
dependence upon said reordering data, and to merge at least one
received diagnostic data item from said received diagnostic data
with at least one other received diagnostic data item in dependence
upon said reordering data.
23. A diagnostic apparatus according to claim 21, wherein said
instructions streams are processed by a plurality of processing
units arranged to process at least some of said instructions in
parallel.
24. A diagnostic apparatus according to claim 21, said diagnostic
apparatus comprising a plurality of data inputs for receiving a
plurality of streams of diagnostic data from each of said plurality
of processing units.
25. A diagnostic apparatus according to claim 21, wherein said
diagnostic data comprises trace data and said data output is trace
data reordered to more closely follow an ordering of said computer
program.
26. A method of compiling a computer program to be processed by a
plurality of processing units arranged to process at least some of
said instructions in parallel, comprising: generating a plurality
of program fragments to be executed by said plurality of processors
from source code of said computer program; generating additional
information indicative of semantically permissible reorderings of
threads of execution derived from said source code; outputting said
program fragments and said generated information.
27. A method according to claim 26, wherein said additional
information comprises dependency information indicating conditions
where it is allowable to reorder diagnostic data generated by
execution of said instructions, said additional information being
output separately to said decoded streams of instructions.
28. A computer program product which is operable when run on a data
processor to control the data processor to perform the steps of the
method according to claim 26.
29. A compiler for compiling a computer program to be processed by
a plurality of processing units arranged to process at least some
of said instructions in parallel, said compiler comprising: a
program fragment generator for generating a plurality of program
fragments to be executed by said plurality of processors from
source code of said computer program; an additional information
generator for generating additional information indicative of
semantically permissible reorderings of threads of execution
derived from said source code; and an output for outputting said
decoded streams of instructions and said generated information.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present invention relates to the field of data
processing and in particular to diagnostic mechanisms for
monitoring data processing operations.
[0003] 2. Description of the Prior Art
[0004] Systems are being developed that have multiple processing
units operable to process data in parallel. A compiler will compile
a program to execute on a particular hardware system such that in a
multi-processor system portions of the program will be sent to be
executed by different processing units. Thus, much of the program
will be executed in parallel and executing times and processing
performance will be improved. However, a drawback of this is that
the view of the program execution derived from diagnostic
mechanisms such as debug are difficult for a programmer to
understand and to relate back to the original sequential
program.
[0005] When an application has been split into multiple threads
automatically, there are two conventional approaches to debugging
the application.
[0006] One can restrict the debug view at any time to only those
parts which are equivalent to the original application. For
example, if the program initializes a data-structure, then splits
into four threads to modify the data structure, then waits for the
four threads to complete, one can disallow observation on the data
structure during the time that all four threads are modifying it
because the state of the data structure will not reflect any valid
state of the original unithreaded program. Clearly this has the
disadvantage of not allowing diagnosis of a significant part of the
program.
[0007] Alternatively, one can allow the programmer to observe any
operation at any point at the cost of requiring the programmer both
to understand how the program was parallelized and to debug a
multithreaded program (which is hard to do).
[0008] In summary, when multiple processors are executing in
parallel and independently producing traces of diagnostic
information, it can be hard to understand the resulting traces
independently of each other. This problem can be addressed by
merging the data streams based on dependencies between the
different streams using a topological sort. An example of this is
shown in, D. Kimelman, D. Zernik, "On-the-fly topological sort-a
basis for interactive debugging and live visualization of parallel
programs", Workshop on Parallel & Distributed Debugging archive
Proceedings of the 1993 ACM/ONR workshop on Parallel and
distributed debugging, San Diego, Calif., United States, pp. 12-20,
1993. While this approach does significantly simplify the event
stream, it leaves a significant semantic gap between the program as
written by the programmer and the program as executed on the
multi-processor system. For example, if the original program
requested execution of three tasks `T1`, `T2`, `T3` in order, a
parallelizing compiler might detect that `T1` and `T2` must be
executed in sequence but that `T3` may execute in parallel with
`T1` and `T2` and it might decide to execute `T1` and `T2` on
separate processors. If T1 and T2 are executed on separate
processors, the compiler will insert synchronization or
communication operations to ensure that the tasks execute in the
required sequence. By tracing synchronization and communication
operations, it is possible to recognise the dependency between
these tasks. The topological sort would take into account the
dependencies between `T1` and `T2` but, because there is no
dependency between `T3` and the other two tasks, the traces
produced could be any one of:
a) T1 T2 T3 b) T1 T3 T2 c) T3 T1 T2 and which of these three traces
is produced may vary from one run of the program to another. This
makes debugging the system harder because the programmer must
disregard the order of some events (e.g., exactly when does T3 run)
while paying careful attention to other events (e.g., that T1 and
T2 execute in sequence).
[0009] The problem identified above comes from the topological sort
making sorting decisions based only on how the hardware runs the
output of the transforming compiler.
[0010] Embodiments of the present invention seek to use information
about the original program and how it was parallelized to present
the information in a way that relates to that original program. In
this example, since the original program requested that the three
tasks execute in order, embodiments of the invention will seek an
event rearrangement of the actual trace to match that order. That
is, it will only display the first trace (T1 T2 T3). The order of
events will not depend on the actual execution order.
[0011] Furthermore, diagnosis can be difficult when an application
has been written as a number of separate threads or even as a
number of separate programs or where it is not possible to modify
the part of the compiler that automatically parallelizes the
program. There are three conventional approaches to understanding
execution in these cases.
[0012] One can observe the operations as a single sequential trace.
Having a single trace is, in some sense simple but the trace is
hard to understand because of the interleaving of multiple
independent operations which requires the programmer to keep many
things in their head at once.
[0013] One can split the trace into a number of separate threads
according to which thread/processor executes each operation. This
is complicated as it requires the programmer to reason carefully
about the communication protocol used between the different
threads.
[0014] One can split the trace into sequences of causally related
operations using statistical means as was done by, for example,
Aguilera et al., "Performance debugging for distributed systems of
black boxes", Proceedings of the nineteenth ACM symposium on
Operating systems principles, ACM Press, pp 74-89, 2003. This can
help program understanding but not debugging because it can only
display a subset of the trace (those sequences whose probability of
being causally related is high) and because even though a sequence
is probably correct, it may still be wrong so there is a
possibility of confusion. Furthermore, because of its statistical
nature, long sequences of operations are hard to construct because
it takes a lot of observations before one can be confident that a
sequence of 100 fairly probable events is itself fairly
probable.
[0015] Lamport's work "Time Clocks and Ordering of Events in a
Distributed System" communications of the ACM 21, 7 (July 1978) pp
558-565. on sequential consistency and clocks proposes the "happens
before" partial order and suggests extending it to an arbitrary
total order in order to implement a mutual exclusion protocol. It
does not suggest use in debugging.
[0016] It would be desirable to be able to analyse an execution
trace of a program having multiple threads in a similar manner to
the analysis that can be performed for a trace of a sequential
program run on a single processor.
SUMMARY OF THE INVENTION
[0017] A first aspect of the present invention provides a
diagnostic method for outputting diagnostic data relating to
processing of instruction streams stemming from a computer program,
at least some of said instructions streams comprising multiple
threads, said method comprising the steps of:
(i) receiving diagnostic data; (ii) reordering at least one
received diagnostic data item from said received diagnostic data
with respect to at least one other received diagnostic data item in
dependence upon reordering data, said reordering data comprising
data relating to said computer program; and (iii) outputting said
reordered diagnostic data.
[0018] In order to address the problem of a computer program that
is divided into a plurality of threads, being hard to analyse when
it is being processed, the present invention analyses any
diagnostic data generated in conjunction with data relating to the
original written computer program. It uses the information from the
original written computer program as reordering data to help in
reordering the received diagnostic data and it is this reordered
diagnostic data that is then analysed. Thus, the diagnostic data
received can be reordered into a form that more closely matches the
original computer program and thus is easier for a programmer to
understand and analyse. It should be noted that a computer program
may be divided in to a plurality of threads prior to execution when
it is to be executed on multiple processing or execution units in
parallel, or it may be so divided by a compiler in the expectation
that the program will run on such a platform, although the
execution is actually performed by a single processor.
[0019] In some embodiments said diagnostic method comprising a
further step: (iia) removing at least one received diagnostic data
item from said received diagnostic data in dependence upon said
reordering data.
[0020] In other embodiments said diagnostic method comprising a
further step: (iib) merging at least one received diagnostic data
item from said received diagnostic data with at least one other
received diagnostic data item in dependence upon said reordering
data.
[0021] In addition to simply reordering the data, data items can
also be removed or merged with other items in response to the
reordering data. For example, where several diagnostic data items
relate to a single higher level entity (for example diagnostic data
from program fragments derived from a single function in the user's
source code), it may be appropriate for those data items to be
merged into a compound item relating to the higher level entity.
Similarly it may also be appropriate to remove data items which are
not appropriate or are redundant to the user's chosen
viewpoint.
[0022] In general an implementation will typically perform one or
more of these operation on each item of data and may well perform
all three many times during a single run.
[0023] In some embodiments said instructions streams are processed
by a plurality of processing units arranged to process at least
some of said instructions in parallel, said diagnostic data being
received from said plurality of processing units.
[0024] The programs are split into multiple threads by a compiler
in order that they can be processed by different processing units
in parallel. It is known that the parallelising of processing has
increased efficiency in this field, however the understanding and
thus fault diagnosis of programs that are processed in parallel is
difficult for a programmer. Thus, if this information can be
presented in a form that more closely matches the original
sequential computer program then this disadvantage of
parallelisation would be alleviated.
[0025] In some embodiments said reordering data further comprises
at least one of generic data relating to how computer programs are
transformed prior to being executed and rules regarding processing
of said instructions stemming from a hardware arrangement of said
processing units.
[0026] In addition to receiving data related to the computer
program itself, further data can also be used to assist reordering,
this further data may comprise generic data relating to rules that
are followed when transforming a computer program prior to
processing it, i.e. information on how it was parallelised, and it
may comprise rules regarding processing of the instructions
stemming from the hardware arrangement of the processing units. The
hardware arrangements of the processing units clearly affect the
way data can be processed in parallel and a knowledge of this can
help determine how to unravel the parallel data.
[0027] In some embodiments said reordering data further comprises
additional instruction processing dependency data output from a
compiler transforming said program.
[0028] When the compiler compiles the instruction stream, it knows
the original order and also how it has divided it. Thus, it may be
advantageous if it generates dependency information which it then
outputs separately as metadata. This data can then be used when
analysing the program to determine for example additional
conditions for which it is legal to reorder events and thus, the
reordering of the instruction stream can be formed to a further
degree and/or more efficiently.
[0029] In some embodiments, said reordering step (ii) comprises
identifying a thread within said instruction stream that an item of
said received diagnostic data is produced by and grouping
diagnostic data produced by a same thread together in execution
order.
[0030] In order to make the diagnostic data more comprehensible to
a programmer it can be helpful if information from different
threads are grouped together. This allows some processing of each
thread to be followed and analysed by a programmer.
[0031] In some embodiments said instructions streams are processed
by a plurality of processing units arranged to process at least
some of said instructions in parallel, said diagnostic data being
received from said plurality of processing units and wherein said
thread migrates between said processing units.
[0032] In symmetric systems, threads may migrate between
programmable processing elements which respond to the same
instruction set. For example, the operating system may recognise
that one processor is heavily loaded while another processor is
idle and arrange to migrate one of the threads executing on the
heavily loaded processor to the idle processor. This migration can
be recognised in the trace data if the operating system emits
diagnostic data when a thread is migrated in this way. In
heterogeneous and asymmetric systems, it is common to arrange that
completion of a task on one processing element to trigger the start
of a task on another processing element. For example, one might
program a system such that: [0033] 1) a control processor P1 starts
a data transfer operation on a DMA engine and sleeps waiting for an
event [0034] 2) when the DMA engine completes the transfer, a task
is started on a second processor P2. [0035] 3) when the task
completes on processor P2, it signals processor P1 which wakes up
and resumes execution.
[0036] Such patterns are commonly viewed as a sequence of
individual events on separate processors which leads to a
fragmented view of the overall system behaviour. In some
embodiments it is useful to view such patterns of events as a
single thread migrating between processing elements because this
allows a unified view of the execution across multiple processing
elements.
[0037] Although in some embodiments the processing units may not be
programmable using a sequence of instructions, for example they may
be DMA engines that can be configured to copy particular data but
can not be configured to copy it in a certain way, in other
embodiments at least one of said processing units is programmable
using a sequence of instructions from an instruction set.
[0038] In some embodiments at least two of said processing units
are responsive to different instruction sets.
[0039] In some embodiments, said reordering step (ii) comprises
identifying diagnostic data generated in response to at least one
object being processed by multiple threads and grouping diagnostic
data relating to a same object together.
[0040] A further way of reordering the diagnostic data to make it
easier to analyse may be to group it by an object that is processed
by multiple threads. This again is an arrangement that is easier
for a programmer to follow and understand than when data is
provided in the order it executed in on different parallel
units.
[0041] In some embodiments, said instructions streams are processed
by a plurality of processing units arranged to process at least
some of said instructions in parallel and said processing of said
instruction streams by said plurality of processing units involves
interleaving of at least some of said instructions such that said
received diagnostic data stream relates to at least some
interleaved instructions and said reordering step (ii) comprises
determining allowed reordering operations from said reordering data
and performing at least one of said allowed reordering operations
in order to produce a diagnostic data stream in which at least some
interleaving of instructions due to parallel processing has been
removed.
[0042] The processing of the computer program by a plurality of
parallel processing units causes interleaving of at least some of
the instructions. This typically occurs due to instructions being
executed on different processing units at the same time. The
reordering of the diagnostic data generated by these interleaved
instructions so as to reduce the degree of interleaving reduces
potential confusion caused to a programmer by the diagnostic
data.
[0043] In some embodiments, said reordering step (ii) comprises
reordering said received diagnostic data such that diagnostic data
produced by execution of each instruction is arranged in a program
order.
[0044] In ideal cases, the diagnostic data is rearranged in program
order. This is clearly a preferred reordering of the program and
the easiest for the programmer to understand. It may not be
possible to achieve such reordering and thus, in some examples the
removal of some interleaving or the grouping together of threads
may be all that is possible. As usual program order refers to the
execution sequence of a program when compiled with a compiler which
translates each instruction directly (i.e. without performance
improving transformations) and executes on a single in-order
processor.
[0045] In some embodiments said instructions streams are processed
by a plurality of processing units arranged to process at least
some of said instructions in parallel and said step of receiving
diagnostic data comprises receiving a plurality of streams of
diagnostic data from each of said plurality of processing
units.
[0046] Although, the diagnostic data may be received as a single
data stream that is made up of data streams from the plurality of
data processing units that have been merged, in other embodiments
it is received as a plurality of data streams from each of the
processing units.
[0047] Although the diagnostic data may comprise a number of
things, in some embodiments it comprises trace data Trace data
comprises a stream of trace elements produced in real-time and
representing activities of the data processing apparatus that are
desired to be traced. The activities of a processing unit that
might want to be traced include, but are not limited to, the
instructions being executed by that processor core (referred to as
instruction trace), and the memory accesses made by those
instructions (referred to as data trace). These activities may be
individually traced or traced together, so that the data trace can
be correlated with the instruction trace. The data trace itself
consists of two parts, the memory addresses and the data values,
referred to (respectively) as data address trace and data value
trace.
[0048] In some embodiments, said diagnostic data comprises specific
trace data relating to a synchronisation event and emitted in
response to said synchronisation event.
[0049] A synchronisation event is an event that occurs when two
threads are synchronised, i.e. execution of an instruction in one
thread depends on the other thread. For example, one thread may be
waiting for the other to reach a particular point before continuing
execution. Thus, this event can be used to link the two threads and
potentially link them to a particular place in the program. Thus,
they are clearly quite important when trying to reorder trace data
from multiple threads and relate it back to the original written
program.
[0050] In some embodiments said synchronisation event comprises
specific trace data relating to the initiation or completion of a
task on a processing element.
[0051] The tracing of the initiation or completion of a task on a
processing element allows migration of the thread through a
heterogeneous system to be tracked.
[0052] In other embodiments said synchronisation event comprises
specific trace data relating to the migration of a task between
processing elements.
[0053] This allows migration in symmetric systems to be traced.
[0054] In some embodiments, said synchronisation event comprises at
least one of: a processing unit processing one of said instruction
streams switching processing between threads and communication
between at least two of said threads.
[0055] When a processing unit switches to processing a different
thread, it would be helpful for the reordering of the trace data if
data is output at this moment as it is then clear that these
threads have been switched between and when relating diagnostic
data to a thread this information is important. Similarly, when
different threads communicate with each other then it is clear
which step needs to be performed first and thus, this information
is useful when trying to reorder diagnostic data and as such it is
helpful if this data is output. This information helps to determine
the order of operations between different threads.
[0056] A second aspect of the invention provides a computer program
product which is operable when run on a data processor to control
the data processor to perform the steps of the method according to
a first aspect of the present invention.
[0057] A third aspect of the invention provides a diagnostic
apparatus for outputting diagnostic data relating to processing of
instruction streams stemming from a computer program, at least some
of said instructions streams comprising multiple threads, said
diagnostic apparatus comprising:
[0058] a data input for receiving diagnostic data from said
plurality of processing units;
[0059] reordering logic adapted to reorder said received diagnostic
data in dependence upon reordering data, said reordering data
comprising data relating to said computer program; and
[0060] a data output for outputting said reordered diagnostic
data.
[0061] A fourth aspect of the present invention provides a method
of compiling a computer program to be processed by a plurality of
processing units arranged to process at least some of said
instructions in parallel, comprising:
[0062] generating a plurality of program fragments to be executed
by said plurality of processors from source code of said computer
program;
[0063] generating additional information indicative of semantically
permissible reorderings of threads of execution derived from said
source code;
[0064] outputting said program fragments and said generated
information.
[0065] As previously mentioned, when compiling a computer program
to be processed by a plurality of processing units decisions on how
the program is divided and which bits are sent to which processor
are made. Thus, if additional information regarding this process
can be output by the compiler this could considerably help the
reordering of any diagnostic data that may occur later. Thus, it is
advantageous if compilers can generate and output information that
will give hints on how the parallelisation of the program was
performed.
[0066] In some embodiments, said additional information comprises
dependency information indicating conditions where it is allowable
to reorder diagnostic data generated by execution of said
instructions, said additional information being output separately
to said decoded streams of instructions.
[0067] The additional information may comprise metadata relating
for example to dependency information indicating conditions where
it is allowable to reorder events controlled by the instructions.
Such information is clearly very useful when later trying to
reorder any diagnostic data.
[0068] A fifth aspect of the present invention provides a computer
program product which is operable when run on a data processor to
control the data processor to perform the steps of the method
according to a fourth aspect of the present invention.
[0069] A sixth aspect of the present invention provides a compiler
for compiling a computer program to be processed by a plurality of
processing units arranged to process at least some of said
instructions in parallel, said compiler comprising:
[0070] a program fragment generator for generating a plurality of
program fragments to be executed by said plurality of processors
from source code of said computer program;
[0071] an additional information generator for generating
additional information indicative of semantically permissible
reorderings of threads of execution derived from said source
code;
[0072] a decoder for decoding said plurality of streams of
instructions;
[0073] an output for outputting said decoded streams of
instructions and said generated information.
DESCRIPTION OF THE DRAWINGS
[0074] The present invention will be described further, by way of
example only, with reference to embodiments thereof as illustrated
in the accompanying drawings, in which.
[0075] FIG. 1 schematically shows an embodiment of the present
invention;
[0076] FIG. 2 illustrates a data processing apparatus and a
diagnostic apparatus according to an embodiment of the present
invention;
[0077] FIG. 3 schematically shows a system according to an
embodiment of the present invention;
[0078] FIGS. 4a and 4b are Hasse diagrams illustrating dependencies
between events;
[0079] FIGS. 5a and 5b show a Hasse diagram and corresponding data
structure;
[0080] FIGS. 6a and 6b illustrate simpler dependencies in Hasse
diagram and data structure;
[0081] FIG. 6c shows a disadvantage of the simplified data
structure;
[0082] FIG. 7a to d schematically shows the splitting into
separately executable sections of a computer program according to
an embodiment of the present invention;
[0083] FIG. 8a to b schematically shows a method of splitting and
then merging sections of a computer program;
[0084] FIG. 9 schematically shows data communication between two
sections of a program;
[0085] FIG. 10a shows a simple computer program annotated according
to an embodiment of the present invention;
[0086] FIG. 10b shows the maximal set of threads for the program of
FIG. 4a.
[0087] FIG. 11 schematically illustrates an asymmetric
multiprocessing apparatus with an asymmetric memory hierarchy;
[0088] FIG. 12 illustrates an architectural description;
[0089] FIG. 13 illustrates a communication requirement; and
[0090] FIG. 14 illustrates communication support.
DESCRIPTION OF EMBODIMENTS
[0091] FIG. 1 shows very schematically an embodiment of the present
invention. This embodiment comprises a system 10 which is a
multi-threaded program running on a multiprocessor system. When
this system is being traced, a stream of trace data 12 which
consists of a complex sequence of events of all the different
threads and processes in the system is produced. This trace data is
very hard to analyse and thus, the invention uses trace reordering
logic 20 to reorder this trace into an order which more closely
resembles the order of the original program and as such is easier
to understand by a programmer. This reorder data is then input to
trace user interface 30.
[0092] The trace reordering logic can reorder the trace data in a
number of ways. In the embodiments shown, the complex sequence of
events that are output from the system being traced comprise
synchronisation events which are indicated as capitals in the
stream of data 12. These synchronisation events are events where
one processor is synchronised with another processor and thus,
different threads being processed on different processors are
synchronised at this point. The trace reordering logic can utilise
this information to reorder the trace to produce an alternative,
simpler sequence of events which could legally have been produced
by the program. This data is then input to the trace user interface
30 where it may be displayed to the user or compressed to a disc
for later replay.
[0093] Although, this figure shows a single event stream 12 in
practice there may be multiple event streams, for example there may
be one for each processor. These may then be merged prior to the
trace reordering process or the merging could be performed as part
of the reordering process.
[0094] Events that are traced can be quite low level events such as
individual memory accesses and instruction execution or they can be
quite high level events such as the start and stop of remote
procedure calls. Data transfer between different processors,
inserting/removing an entry from an inter-thread communication
channel etc. Events may be generated automatically by hardware or
they may be generated in response to operations in the program. For
example, a communication library may explicitly generate an event
on every send or receive or an operating system may explicitly
generate an event on every context switch.
[0095] Embodiments of the present invention may also comprise a
compiler 40 into which the original program is input to be compiled
to run on the multiprocessor system. The compiler parallelises the
program to run on the different processors and produces program
fragments. It should be noted, that in some embodiments the
compiler provides multiple threads from the program which would be
suitable to run on a multiprocessor system but may be run on a
single processor system. In this embodiment, compiler 40 in
addition to producing the parallelised threads or program fragments
also produces debug information 42 and dependency information 44.
The dependency information indicates additional conditions under
which it is legal to reorder events. This information is input to
the trace reordering logic 20. By providing more information about
the dependencies present in the original program, the trace
reordering logic is able to provide further reordering and
potentially produce a simpler trace.
[0096] The debug information 42 which may also comprise dependency
data can be input to the user interface where it can be used when
displaying the reordered diagnostic data.
[0097] The compiler may also insert trace generating operations
into the program to reduce the number of possible re-orderings
available. This may allow the trace reordering to run faster or use
less memory and it may help to make the reconstructive trace more
similar to a trace of the original program.
[0098] FIG. 2 shows another embodiment of the present invention
which is similar to FIG. 1 but illustrated in a different manner.
In this embodiment data processing apparatus 50 comprise a data
store 52 in which the source program is stored a compiler 40 and
multiple processing unit 60, 62, 64 and 66. The compiler 40
compiles the source program from data store 52 into different
threads of execution and these are sent to the multiple processors
60, 62, 64, 66. These each have a trace unit and output trace
information to output 54. Output 54 also serves to merge these
input streams and outputs a single trace data stream 12 to
diagnostic apparatus 70. Output 54 also receives dependency
information from the compiler 40 and this is outputs alongside the
merged trace data stream as metadata. Diagnostic apparatus 70
comprises a data store or buffer 72 for storing the trace data,
reorder logic 20 and a display 30.
[0099] FIG. 3 shows an alternative depiction of an embodiment of
the present invention. This embodiment shows the input data which
is the source program 80 and a description of the system 82. The
description of the system 82 comprises a description of the
hardware processing unit 90 that processes the program. There is
then a compiler 40 which is implemented in software in this
embodiment and which compiles the program into program fragments
for processing by the various processing units 60, 62 within
hardware 90. Each processing unit has its own trace unit 100, 102
which generates the trace data. This trace data is output as a
trace data stream 12 to reordering logic 20. This reordering logic
may be either hardware or software. This reordering logic also
receives programs specific dependency data from the compiler and
generic dependency data relating to the system. It may also receive
reordering criteria from the user. It uses this data to reorder the
trace data to produce a reordered trace which is then sent to user
interface 30.
[0100] The reordering logic uses different reordering rules to
reorder the trace data depending on the embodiment. In some
systems, different processors and different threads can run at
different rates from each other and the relative rates may be
unknown. Thus, any correct program must contain synchronisation
between threads and processors whenever the threads communicate.
These are synchronisation events. In such systems, we may assume
that events in different threads are not ordered with respect to
each other unless the two threads directly or indirectly
synchronise with each other. This assumption allows a significant
amount of trace reordering to be performed. In order to do this,
these events are identified as synchronisation events.
[0101] Some of the techniques/rules used for reordering depend on
the goals of the reordering. For example, if the goal is to gain a
high level understanding of the program then the number of context
switches, that is switches between different threads may be
reduced. Furthermore, the maximum number of simultaneous live
threads at any point in the trace should be reduced. Furthermore,
the number of events between a causing event and a consequence of
that event should also be reduced to make the trace easier to
understand. The amount of reordering should be kept to only
reordering events where it simplifies the trace.
[0102] Although, in some embodiments the trace may be complete for
the whole program, in other embodiments it is only a part of the
program that is traced.
[0103] There are a number of ways that trace reconstruction can be
performed. The following describes a simple online algorithm that
can be used for reordering trace.
[0104] In common with many online algorithms, the event stream can
be split into three regions: [0105] 1. Processed events--these have
been reordered by the algorithm. [0106] 2. The reconstruction
"window"--events which are in the process of being reordered. The
window may be of fixed or variable size. [0107] 3. Unobserved
events--events that have not yet been considered for
reordering.
[0108] Events within the window are represented by an appropriate
data structure and trace reconstruction consists of repeatedly
either adding the next unobserved event to the window or removing
an event from the window and adding it to the processed events.
[0109] The reordering is restricted by dependencies between events:
two events cannot be reordered relative to each other if one must
come before another. For example, we write `x+y` to indicate that
event `a` must occur before `b` in the reordered trace. This
relationship is transitive: if `x.fwdarw.y` and `y.fwdarw.z` then
it must be true that `x.fwdarw.z`. The usual convention of using a
Hasse diagram to represent the .fwdarw. relation is used. For
example, the relation `a.fwdarw.b`, `b.fwdarw.c`, `a.fwdarw.c`,
`b.fwdarw.d`, `c.fwdarw.d`, `a.fwdarw.d` is represented by FIG.
4a.
[0110] It is assumed that the initial event stream is ordered such
that if `x.fwdarw.y` then event `x` comes before event `y`.
[0111] To add an event `x` to the window, we identify all events
`e` already within the window such that `e.fwdarw.x` and add a link
from each such event to x. As an transformation, an edge can be
omitted or an existing edge deleted if there is an edge from event
`e` to some other edge `f` and there is an edge from `f` to
`x`.
[0112] For example, if we add an event `a5` to FIG. 4a and
`a4.fwdarw.a5` and `c2.fwdarw.a5`, then the window is updated as is
depicted in FIG. 4b.
[0113] An event `y` can only be removed from the window (and added
to the end of the processed event list) if there is no other event
`x` within the window such that `x.fwdarw.y`. There are often
multiple events that can be removed from the window. For example,
in the above example, events `a1`, `b1` and `c1` can be removed
from the window. When there are multiple events that can be removed
from the window, heuristics are used to choose the best event to
remove.
[0114] A number of different data structures may be used to
represent the events within the window. An obvious representation
is to reflect the `.fwdarw.` relation directly using a directed
graph structure. For example, the events and dependencies between
them of FIG. 5a can be represented by the data structure of FIG. 5b
Instead of representing the .fwdarw. relation exactly, it is also
possible to use an approximation. For example, by restricting the
complexity of dependencies allowed within the window, it is
possible to use FIFO queues. For example, window of FIG. 6A can be
represented by the data structure of FIG. 6B.
[0115] The rule here is that an event `x` comes before an event `y`
in a queue if and only if `x.fwdarw.y`. This has the effect of
restricting the window to a set of totally ordered events (e.g.,
all the events generated by a single thread or a single processor).
The disadvantage of this simplified representation is that it is
not capable of representing some types of dependency. For example,
if the event `b2` is added and `a2.fwdarw.b2`, then `a1` and a2`
must be removed from the window, see FIG. 6C.
[0116] When an event is removed from the window (and added to the
list of processed events), there is often a choice of several
events that could be removed. In such cases, some simple rules for
choosing which event to remove, these are: [0117] 1. If a removable
event `e` is on the same processor as the last event removed, then
remove event `e`. (This reduces the amount of context switching for
the user.) [0118] 2. If a removable event `e` is on the same thread
as the last event removed, then remove event `e`. (This reduces the
amount of context switching for the user.) [0119] 3. If a removable
event `e` depends on the last event removed (that is, the last
event removed was `f` and `f.fwdarw.e`), then remove event `e`.
(This makes causal relationships more obvious). [0120] 4. Is next
in program order [0121] 5. If a removable event has many events
that depend on it, then remove event `e`. (This may encourage large
clusters of related events to be removed.) [0122] 6. If a removable
event has few events that depend on it, then remove event `e`.
(This contradicts the previous heuristic and helps reduce the total
number of removable events at any one time.) [0123] 7. Remove the
oldest event. (This helps reduce the amount of reordering.) [0124]
8. Assign each removable event a score based on the degree to which
it meets any of the above criteria and choose the event with the
highest score.
[0125] Some examples of events and dependencies for detailed trace
are given below. Some processors support generation of detailed
traces consisting of memory accesses and instructions. For such
processors, appropriate dependencies might be: [0126] Instructions
executed on the same thread are totally ordered: if instruction i1
is executed before instruction i2, then i1.fwdarw.i2. (On a
superscalar processor, we mean that instruction i1 occurs before i2
in program order or is retired before i2.) [0127] If one thread
waits for an event to happen and another thread signals that event,
then there is a dependency from the signal instructions to the wait
instructions. [0128] If an instruction i2 in one thread reads from
a memory location previously written to by an instruction i1 in
another thread, there is a dependency from the instruction that
wrote to the memory location. In many cases, these dependencies
should only be considered if the write is guaranteed to be
observable--for example, if a memory barrier operation or a
thread-synchronization operation has been used. Where the write is
not guaranteed to be observable, it may be appropriate to report a
possible error.
[0129] Some examples of events and dependencies for coarse-grained
trace are given below.
[0130] Some systems may generate much coarser-grained traces
consisting of when a given task (e.g., computing a result) starts
and stops and when threads/processors communicate/synchronize with
each other. For such systems, appropriate dependencies might be:
[0131] Tasks executed on the same thread are totally ordered: if
task t1 is executed before task t2, then t1.fwdarw.t2.
[0132] If completion of one task `t1` triggers another task `t2` to
start, then `t1.fwdarw.t2`.
[0133] Communication between threads can be `direct` or `buffered`.
In direct communication, if a thread `A` sends a value to a thread
`B`, the next value read by `B` will be the last value written by
`A`. In such communication, there is a dependency from the event
associated with sending a value to the next event associated with
receiving a value. In buffered communication, if a thread `A` sends
a value to a thread `B`, the next value read by `B` will be one of
the values previously written sent by `A`. For example, when using
a FIFO channel between threads, values are received by B in the
order they are sent by A. In such communication, there is a
dependency from the event associated with sending a value to the
event associated with receiving that particular value from the
channel.
[0134] There is often an overhead associated with generating events
so it is desirable to reduce the amount of trace generated. This
can be done by using coarser-grained trace but it can also be done
by generating less detail in the trace and/or by exploiting the
semantics of the operations generating the trace. For example, to
model a buffered communication precisely, one must include enough
information in an event so that the event caused by reading a
particular value can be matched with the event that wrote that
particular value. The amount of trace can be reduced by omitting
this information in which case, all receive events are considered
to be dependent on all preceding send events. Alternatively, if the
communication is through a FIFO channel, the amount of trace can be
reduced by exploiting the fact that values are received in the same
order as they are written so the first send to that channel can be
paired with the first receive from the channel, etc.
[0135] If trace is generated by executing instructions which emit
events into the trace, it can be convenient to emit trace in
communication and scheduling libraries. For example, the scheduler
can emit an event on every context switch and communication
libraries can emit events before sending an event, after sending an
event, before receiving an event and after receiving an event.
[0136] It may also be convenient to generate trace in parallelising
compilers. If a compiler automatically parallelizes a program, the
compiler can insert instructions to emit trace events to allow more
complete reconstruction of the trace. If the program is split into
multiple threads and the threads frequently communicate with each
other, it is often sufficient to emit events when the threads
communicate which can be done using a communication library that
emits events as previously described. If the threads communicate
infrequently, it may be desirable for the compiler to emit
additional events. For example, if the original program is:
TABLE-US-00001 for(int i=0; i<100; ++i) { f( ); g( ); }
and the compiler splits this program into two threads that do not
communicate with each other:
TABLE-US-00002 for(int i=0; i<100; ++i) { f( ); } and for(int
i=0; i<100; ++i) { g( ); }
then there may not be enough information to reconstruct a properly
synchronized trace showing alternating calls to functions `f` and
`g`. In this case, the compiler can insert instructions to emit an
event after invoking `f` and before invoking `g` resulting in
threads:
TABLE-US-00003 for(int i=0; i<100; ++i) { f( ); emit(F); } and
for(int i=0; i<100; ++i) { emit(G); g( ); }
[0137] It should be noted that although the above described
techniques are particularly useful for multiprocessor SoCs, they
can also be applied to traces of VLIW execution (parallelism
between functional units) and to traces of remote execution
(parallelism between computers).
[0138] In summary embodiments of the invention that are able to
modify the parallelizing compiler, may insert enough trace
generation hints so that, instead of an arbitrary total order, a
total order corresponding to the original sequential program is
generated.
[0139] If an embodiment is not able to insert all the trace
generation hints needed to achieve a total order, our
reconstruction algorithm uses heuristics to reduce the number of
context switches in the trace.
[0140] To achieve complete reconstruction, it helps if the
parallelizing compiler inserts hints in the code that make it
easier to match up corresponding parts of the program. In the
absence of explicit hints, it may be possible to obtain full
reconstruction using debug information to match parts of the
program.
[0141] When there are no explicit hints or debug information,
partial reconstruction can be achieved by using points in the
program that synchronize with each other to guide the matching
process. The resulting trace will not be sequential but will be
easier to understand. A useful application is to make it simpler to
understand a trace of a program written using an event-based
programming style (e.g., a GUI, interrupt handlers, device drivers,
etc.)
[0142] Partial reconstruction could also be used to simplify
parallel programs running on systems that use release consistency.
Such programs must use explicit memory barriers at all
synchronization points so it will be possible to simplify traces to
reduce the degree of parallelism the programmer must consider.
[0143] One simple case of this is reconstructing a `message
passing` view of bus traffic.
[0144] In summary, reordering of trace can make use of multiple
sources of information including (in approximately increasing order
of specificity):
[0145] 1) Information about the compiler which is independent of
the compilation of the program such as:
[0146] a) The style of parallelization used
[0147] b) Naming conventions used for variables/functions
introduced during parallelization.
[0148] c) How programs are instrumented to produce trace (e.g., a
trace event might be generated at the end of every loop).
[0149] 2) Information about how a particular program was compiled
such as:
[0150] a) Which sections of code were parallelized.
[0151] b) Where threads and communication/synchronization between
threads was introduced.
[0152] c) How sections of code in the original program relate to
sections of code in the parallelized program, e.g., line 23 of the
original program might correspond to line 45 in the parallelized
program.
[0153] d) How variables in the original program relate to variables
in the parallelized program, e.g. a variable `x` in the original
program might have been split into two parts `x1` and `x2` in the
parallelized program
[0154] e) What instrumentation has been introduced into this
program e.g., an event might be generated indicating how many times
a particular loop executed.
[0155] 3) Information about this particular execution. This
primarily consists of the trace data but might also include
information about which processors executed particular threads
(say).
[0156] 4) User preferences.
[0157] Details of further techniques are given below.
[0158] FIG. 7a shows a portion of a computer program comprising a
loop in which data items are processed, function f operating on the
data items, and function g operating on the data items output by
function f and then function h operating on these items. These
functions being performed n times in a row for values of i from 1
to n.
[0159] Thus, the control flow can be seen as following the solid
arrows while data flow follows the dotted arrows. In order to try
to parallelise this portion of the computer program it is analysed,
either automatically or by a programmer and "decouple" indications
are inserted into the data flow where it is seen as being desirable
to split the portion into sections that are decoupled from each
other and can thus, be executed on separate execution mechanisms.
In this case, a decouple indication is provided between the data
processing operations f and g. This can be seen as being equivalent
to inserting a buffer in the data flow, as the two sections are
decoupled by providing a data store between then so that the
function f can produce its results which can then be accessed at a
different time by function g.
[0160] FIG. 7c, shows how the program is amended to enable this
decoupling by the insertion of "put" and "get" instructions into
the data stream. These result in the data being generated by the f
function being put into a data store, from which it is retrieved by
the get instruction to be processed by function g. This enables the
program to be split into two sections as is shown in FIG. 7d. One
section performs function f on the data for i=1 to n and puts it
into a buffer data store. The other section then retrieves this
data and performs functions g and h on it. Thus, by the provision
of a data store the two sections of the program are in effect
decoupled from each other and can be executed on separate
executions mechanisms. This decoupling by the use of a specialised
buffer and extra instructions to write and read data to it, are
only required for systems having heterogeneous memory, whereby two
execution mechanisms may not be able to access the same memory. If
the memory is shared, then the data path between the two sections
does not need a data copy but can simply be the provision of a data
store identifier. Thus, if the program is being processed by a data
processing apparatus having a number of different processors, the
two sections can be processed in parallel which can improve the
performance of the apparatus. Alternatively, one of the functions
may be a function suitable for processing by an accelerator in
which case it can be directed to an accelerator, while the other
portion is processed by say, the CPU of the apparatus.
[0161] As can be seen from FIG. 7d, the splitting of the program
results in the control code of the program being duplicated in both
section, while the data processing code is different in each
section.
[0162] It should be noted that the put and get operations used in
FIG. 7c can be used in programs both for scalar and non-scalar
values but they are inefficient for large (non-scalar) values as
they require a memory copy. In operating systems, it is
conventional to use "zero copy" interfaces for bulk data transfer:
instead of generating data into one buffer and then copying the
data to the final destination, the final destination is first
determined and the data directly generated into the final
destination. A different embodiment of the invention applies this
idea to the channel interface, by replacing the simple `put`
operation with two functions: put_begin obtains the address of the
next free buffer in the channel and put_end makes this buffer
available to readers of the channel:
[0163] void* put_begin(channel *ch);
[0164] void put_end(channel *ch, void* buf);
Similarly, the get operation is split into a get_begin and get_end
pair
[0165] void* get_begin(channel *ch);
[0166] void get_end(channel *ch, void* buf);
Using these operations, sequences of code such as:
[0167] int x[100];
[0168] generate(x);
[0169] put(ch,x);
Can be rewritten to this more efficient sequence:
[0170] int px=put_begin(ch);
[0171] generate(px);
[0172] put_end(ch,px);
And similarly, for get:
[0173] int x[100];
[0174] get(ch,x);
[0175] consume(x);
to this more efficient sequence:
[0176] int px=get_begin(ch);
[0177] consume(px);
[0178] get_end(ch,px);
[0179] The use of puts and gets to decouple threads can be further
extended to use where communication between threads is cyclic.
Cyclic thread dependencies can lead to "Loss of Decoupling"--that
is, two threads may not run in parallel because of data
dependencies between them and thus, in devices of the prior art
decoupling is generally limited to acyclic thread dependencies.
[0180] 1. A particularly common case of cyclic thread dependencies
is code such as
TABLE-US-00004 y = 1; while(1) { x = f(y); y = g(x); }
[0181] Under conventional decoupling schemes, puts are always
inserted after assignment to any data boundary variable. This would
require both a put outside the loop and a put at the end of the
loop:
TABLE-US-00005 y1 = 1; put(ch,y1); while(1) { y2 = get(ch); x =
f(y2); y3 = g(x); put(ch,y3); }
[0182] Conventional decoupling schemes only generate matched pairs
of puts and gets (i.e., there is only one put on each channel and
only one get on each channel) so they cannot generate such
code.
[0183] Embodiments of the present invention use an alternative way
of decoupling this code and generate:
TABLE-US-00006 y1 = 1; while(1) { put(ch,y1); y2 = get(ch); x =
f(y2); y1 = g(x); }
[0184] This does have matched pairs of puts and gets but breaks the
rule of always performing a put after any assignment to a
variable.
[0185] FIGS. 8a and 8b schematically illustrate the program code
shown in FIG. 7. In this Figure a data store is provided to
decouple functions f and g, but one is not provided between g and
h. In this embodiment analysis of the program to decouple it is
performed automatically and several potential sections are
provided, in this case these are loops having functions f, g and h
in them. The automatic analysis then checks that each loop can be
executed separately and in this case identifies a missing data path
between functions g and h. Thus, these two functions are remerged
to provide two sections with a data path between.
[0186] FIG. 9 shows in more detail the data path between the two
program sections. As can be seen in this figure, it is a data array
that is transferred, that is the data from the whole loop that is
transferred in a single transaction. This is clearly advantageous
compared to transferring data for each pass in the loop. In
particular, by parallelizing at a coarse granularity, the need for
low latency, high throughput communication mechanisms such as those
used in prior art finer granularity devices are reduced.
[0187] Furthermore, parallelizing at a significantly coarser
granularity also allows the duplication of more control code
between threads which reduces and simplifies inter-thread
communication allowing the generation of distributed schedules.
That is, we can distribute the control code across multiple
processors both by putting each control thread on a different
processor and by putting different parts of a single control thread
onto different processors.
[0188] The transfer of data may be done by, writing the data to a
particular buffer such as a FIFO. Alternatively it may simply be
done by providing the other section of the program with information
as to where the data has been stored.
[0189] The way of transferring the data depends on the system the
program is executing on. In particular, if the architecture does
not have shared memory, it is necessary to insert DMA copies from a
buffer in one memory to a buffer in a different memory. This can
lead to a lot of changes in the code: declaring both buffers,
performing the copy, etc. In embodiments of the invention an
analysis is performed to determine which buffers need to be
replicated in multiple memory regions and to determine exactly
which form of copy should be used. DMA copies are also inserted
automatically subject to some heuristics when the benefit from
having the programmer make the decision themselves is too
small.
[0190] Systems with multiple local memories often have tight memory
requirements which are exacerbated by allocating a copy of a buffer
in multiple memories. The analysis takes account of this and seeks
to reduce the memory requirement by overlapping buffers in a single
memory when they are never simultaneously live.
[0191] It should be noted that although in some programs it may be
appropriate to provide a FIFO type data store between the sections,
in others it may be that the section requiring the data does not
require it in a particular order, or it may not require all of the
data. This can be provided for by varying the way the data is
passed between the sections.
[0192] FIG. 10a shows a simple computer program annotated according
to an embodiment of the present invention. An analysis of this
program is performed initially and parts of the program are
identified by programmer annotation in this embodiment although it
could be identified by some other analysis including static
analysis, profile driven feedback, etc. The parts identified are as
follows:
[0193] What can be regarded as the "decoupling scope". This is a
contiguous sequence of code that we wish to split into multiple
threads.
[0194] The "replicatable objects": that is variables and operations
which it is acceptable to replicate. A simple rule of thumb is that
scalar variables (i.e., not arrays) which are not used outside the
scope, scalar operations which only depend on and only modify
replicatable variables, and control flow operations should be
replicated but more sophisticated policies are possible.
[0195] Ordering dependencies between different operations: if two
function calls both modify a non-replicated variable, the order of
those two function calls is preserved in the decoupled code.
(Extensions to the basic algorithm allow this requirement to be
relaxed in various ways.)
[0196] The "data boundaries" between threads: that is, the
non-replicatable variables which will become FIFO channels. (The
"copies" data annotation described above determines the number of
entries in the FIFO.)
[0197] This degree of annotation is fine for examples but would be
excessive in practice so most real embodiments would rely on tools
to add the annotations automatically based on heuristics and/or
analyses.
[0198] At a high level, the algorithm splits the operations in the
scope into a number of threads whose execution will produce the
same result as the original program under any scheduling policy
that respects the FIFO access ordering of the channels used to
communicate between threads.
[0199] The particular decoupling algorithm used generates a maximal
set of threads such that the following properties hold: [0200] All
threads have the same control flow structure and may have copies of
the replicatable variables and operations. [0201] Each
non-replicatable operation is included in only one of the threads.
[0202] Each non-replicatable variable must satisfy one of the
following: [0203] The only accesses to the variable in the original
program are reads; or [0204] All reads and writes to the variable
are in a single thread; or [0205] The variable was marked as a data
boundary and all reads are in one thread and all writes are in
another thread. [0206] If two operations have an ordering
dependency between them which is not due to a read after write
(RAW) dependency on a variable which has been marked as a data
boundary, then the operations must be in the same thread.
[0207] FIG. 10b shows the maximal set of threads for the program of
FIG. 10a. One way to generate the set of threads shown in FIG. 10b
is as follows: [0208] 1. For each non-replicatable operation,
create a `protothread` consisting of just that operation plus a
copy of all the replicatable operations and variables. Each
replicatable variable must be initialized at the start of each
thread with the value of the original variable before entering the
scope and one of the copies of each replicatable variable should be
copied back into the master copy on leaving the scope. (Executing
all these protothreads is highly unlikely to give the same answer
as the original program, because it lacks the necessary
synchronization between threads. This is fixed by the next steps.)
[0209] 2. Repeatedly pick two threads and merge them into a single
thread if any of the following problems exist: [0210] a. One thread
writes a non-replicatable variable which is accessed (read or
written) by the other thread and the variable is not marked as a
data boundary. [0211] b. Two threads both write to a variable which
is marked as a data boundary. [0212] c. Two threads both read from
a variable that is marked as a data boundary. [0213] d. There is an
ordering dependency between an operation in one thread and an
operation in the other thread which is not a RAW dependency on a
variable marked as a data boundary. [0214] 3. When no more threads
can be merged, quit
[0215] Another way is to pick an operation, identify all the
operations which must be in the same thread as that operation by
repeatedly adding operations which would be merged (in step 2
above). Then pick the next operation not yet assigned to a thread
and add all operations which must be in the same thread as that
operation. Repeat until there are no more non-replicatable
operations. It should be noted that this is just one possible way
of tackling this problem: basically, we are forming equivalence
classes based on a partial order and there are many other known
ways to do this.
[0216] The above method splits a program into a number of sections
which can be executed in parallel. There are many possible
mechanisms that can be used to accomplish this task.
[0217] FIG. 11 schematically illustrates an asymmetric
multiprocessor apparatus comprising a first execution mechanism 100
and a second execution mechanism 102. An asymmetric memory
hierarchy within the system comprises a cache memory 104 connected
to the first execution mechanism 100 and a shared memory 106
connected to both the first execution mechanism 100 and the second
execution mechanism 102 via the cache memory 104. It will be
appreciated that FIG. 11 illustrates a highly simplified system,
but this is nevertheless asymmetric, contains an asymmetric memory
hierarchy and would represent some level of difficulty in deciding
which sections of a source program should execute on which
execution mechanism 100, 102 and how the data should be partitioned
between the different elements of the memory hierarchy 104, 106
(e.g. which data items used by the first processor 100 should be
made cacheable and which non-cacheable).
[0218] FIG. 12 schematically illustrates an at least partial
architectural description of the system of FIG. 11. This partial
architectural description is in the style of the Spirit format and
specifies which components are present and the interconnections
between those components. It will be appreciated that in practice a
Spirit architectural description will typically contain
considerably more detail and information concerning the nature and
interconnections of the various elements within the system.
Nevertheless, this basic information as to which elements are
present and how they are connected is used by a computer
implemented method for transforming a source computer program into
a transformed computer program for distributed execution on the
system of FIG. 11.
[0219] FIG. 13 gives an example of a communication requirement
which can be identified within a source computer program. This
communication requirement is a Move instruction. This Move
instruction is moving a variable A being manipulated within the
first execution mechanism 100 (PE0) to the second execution
mechanism 102 (PE1). Having identified this communication
requirement, the architectural description of the system as given
in FIG. 12 can be used to identify that an appropriate set of
communication supporting operations need to be added to the code
and include those illustrated, i.e. forming a MemoryBarrier on PE0,
cleaning the variable A from the cache of PE0 and then loading the
variable A from the memory 106 into the processor PE1. This is a
considerably simplified example, but nevertheless illustrates the
identification of a communication requirement followed by the
associated communication support.
[0220] FIG. 14 schematically illustrates a section of source
computer program including data placement tags and processing
placement tags of the type described elsewhere herein. In
particular, in respect of the data element char x[1000], a data
placement tag is associated with the source computer program (in
this particular example added to it) indicating that this data
element should be stored within a memory MEM1. This information is
used by the computer implemented method which maps portions of the
source code to different execution mechanisms and compiles or
configures those portions appropriately.
[0221] Also illustrated in FIG. 14 are two programming functions
foo(x) and bar(x). It will be appreciated that these functions may
represent complex sequences of instructions in their own right. The
processing placement tags associated with each of these functions
indicates where that function is to be executed. As an example, the
function foo could be a general purpose control function and this
is most appropriately performed using a general purpose processor
PE0. Conversely the function bar may be a highly specialised FFT
task or other specific function for which there is provided a
specific accelerator in the form of the execution mechanism PE1 and
accordingly it is appropriate to specify that this function should
be executed on that particular execution mechanism.
The following describes language extensions/annotations,
compilation tools, analysis tools, debug/profiling tools, runtime
libraries and visualization tools to help programmers program
complex multiprocessor systems. It is primarily aimed at
programming complex SoCs which contain heterogeneous parallelism
(CPUs, DEs, DSPs, programmable accelerators, fixed-function
accelerators and DMA engines) and irregular memory hierarchies. The
compilation tools can take a program that is--either sequential or
contains few threads and map it onto the available hardware,
introducing parallelism in the process. When the program is
executed, we can exploit the fact that we know mappings between the
user's program and what is executing to efficiently present a debug
and profile experience close to what the programmer expects while
still giving the benefit of using the parallel hardware. We can
also exploit the high level view of the overall system to test the
system more thoroughly, or to abstract away details that do not
matter for some views of the system.
This provides a way of providing a full view for SoC
programming.
Overview
[0222] The task of programming a SoC is to map different parts of
an application onto different parts of the hardware. In particular,
blocks of code must be mapped onto processors, data engines,
accelerators, etc. and data must be mapped onto various memories.
In a heterogeneous system, we may need to write several versions of
each kernel (each optimized for a different processor) and some
blocks of code may be implemented by a fixed--function accelerator
with the same semantics as the code. The mapping process is both
tedious and error-prone because the mappings must be consistent
with each other and with the capabilities of the hardware. We
reduce these problems using program analysis which: [0223] detect
errors in the mapping [0224] infer what mappings would be legal
[0225] choose legal mappings automatically subject to some
heuristics The number of legal mappings is usually large but once
the programmer has made a few choices, the number of legal options
usually drops significantly so it is feasible to ask the programmer
to make a few key choices and then have the tool fill in the less
obvious choices automatically. Often the code needs minor changes
to allow some mappings. In particular, if the architecture does not
have shared memory, it is necessary to insert DMA copies from a
buffer in one memory to a buffer in a different memory buffer. This
leads to a lot of changes in the code: declaring both buffers,
performing the copy, etc. Our compiler performs an analysis to
determine which buffers need to be replicated in multiple memory
regions and to determine exactly which form of copy should be used.
It also inserts DMA copies automatically subject to some heuristics
when the benefit from having the programmer make the decision
themselves is too small. Systems with multiple local memories often
have tight memory requirements which are exacerbated by allocating
a copy of a buffer in multiple memories. Our compiler uses lifetime
analysis and heuristics to reduce the memory requirement by
overlapping buffers in a single memory when they are never
simultaneously live. Programmable accelerators may have limited
program memory so it is desirable to upload new code while old code
is running. For correctness, we must guarantee that the new code is
uploaded (and I-caches made consistent) before we start running it.
Our compiler uses program analysis to check this and/or to schedule
uploading of code at appropriate places.
For applications with highly variable load, it is desirable to have
multiple mappings of an application and to switch dynamically
between different mappings.
[0226] Some features of our approach are: [0227] Using an
architecture description to derive the `rules` for what code can
execute where. In particular, we use the type of each processor and
the memories attached to each processor. [0228] The use of program
analysis together with the architecture description to detect
inconsistent mappings. [0229] Using our ability to detect
inconsistent mappings to narrow down the list of consistent
mappings to reduce the number of (redundant) decisions that the
programmer has to make. [0230] Selecting an appropriate copy of a
buffer according to which processor is using it and inserting
appropriate DMA copy operations. [0231] Use of lifetime analyses
and heuristics to reduce memory usage due to having multiple copies
of a buffer. [0232] Dynamic switching of mappings.
Annotations to Specify Mappings
[0233] To describe this idea further, we need some syntax for
annotations. Here we provide one embodiment of annotations which
provide the semantics we want. In this document, all annotations
take the form:
[0234] . . . @ {tag1=>value1, . . . tagm=>value}
Or, when there is just one tag and it is obvious,
[0235] . . . @ value
The primary annotations are on data and on code. If a tag is
repeated, it indicates alternative mappings. The tags associated
with data include: [0236] {memory=>"bank3"} specifies which
region of memory a variable is declared in. [0237] {copies=>2}
specifies that a variable is double buffered [0238]
{processor=>"P1 "} specifies that a variable is in a region of
memory accessible by processor P1. For example, the annotation:
[0239] int x[100] @ {memory=>"bank3", copies=>2,
memory=>"bank4", copies=>1} indicates that there are 3
alternative mappings of the array x: two in memory bank3 and one in
memory bank4.
The tags associated with code include: [0240] {processor=>"P1 "}
specifies which processor the code is to run on [0241]
{priority=>5} specifies the priority with which that code should
run relative to other code running on the same processor [0242]
{atomic=>true} specifies that the code is to run without
pre-emption. [0243] {runtime=>"<=10 ms"} specifies that the
code must be able to run in less than 10 milliseconds on that
processor. This is one method used to guide automatic system
mapping. For example, the annotation:
[0244] {fir(x); fft(x,y);} @ {processor=>"P1"}
Specifies that processor P1 is to execute fft followed by P1. The
semantics is similar to that of a synchronous remote procedure
call: when control reaches this code, free variables are marshalled
and sent to processor P1, processor P1 starts executing the code
and the program continues when the code finishes executing. It is
not always desirable to have synchronous RPC behaviour. It is
possible to implement asynchronous RPCs using this primitive either
by executing mapped code in a separate thread or by splitting each
call into two parts: one which signals the start and one which
signals completion. The tags associated with functions are: [0245]
{cpu=>"AR1DE"} specifies that this version of an algorithm can
be run on a processor/accelerator of type "AR1DE" [0246]
{flags=>"-O3"} specifies compiler options that should be used
when compiling this function [0247] {implements=>"fir"}
specifies that this version of an algorithm can be used as a drop
in replacement for another function in the system For example, the
annotation:
[0248] Void copy_DMA(void* src, void* tgt, unsigned length) @
{cpu=>"PL081", implements=>"copy"};
Specifies that this function runs on a PL081 accelerator (a DMA
Primesys engine) and can be used whenever a call to "copy" is
mapped to a PL081 accelerator.
Extracting Architectural Rules from the Architectural
Description
[0249] There are a variety of languages for describing hardware
architectures including the SPIRIT language and ARM SoCDesigner's
internal language. While the languages differ in syntax, they share
the property that we can extract information such as the following:
[0250] The address mapping of each processor. That is, which
elements of each each memory region and which peripheral device
registers are accessed at each address in the address and I/O
space. A special case of this is being able to detect that a
component cannot address a particular memory region at all. [0251]
The type of each component including any particular attributes such
as cache size or type. [0252] That a processor's load-store unit, a
bus, a combination of buses in parallel with each other, a memory
controller or the address mapping makes it possible for accesses to
two addresses that map to the same component or to different
components from one processor to be seen in a different order by
another processor. That is, the processors are not sequentially
consistent with respect to some memory accesses. [0253] That a
combination of load-store units, caches, buffers in buses, memory
controllers, etc. makes it possible for writes by one processor to
the same memory location to suffer from coherency problems wrt
another processor for certain address ranges. Thus, from the
architecture, we can detect both address maps which can be used to
fill in fine details of the mapping process and we can detect
problems such as connectivity, sequential consistency and
incoherence that can affect the correctness of a mapping.
Detecting Errors in a System Mapping
[0254] Based on rules detected in an architectural description
and/or rules from other sources, we can analyse both sequential and
parallel programs to detect errors in the mapping. Some examples:
[0255] If a piece of code is mapped to a processor P and that code
reads or writes data mapped to a memory M and P cannot access M,
then there is an error in the mapping. [0256] If two pieces of code
mapped to processors P1 and P2 both access the same variable x
(e.g. P1 writes to x and P2 reads from x), then any write by P1
that can be observed by a read by P2 must: [0257] have some
synchronization between P1 and P2 [0258] be coherent (e.g., there
may need to be a cache flush by P1 before the synchronization and a
cache invalidate by P2 after the synchronization) [0259] be
sequentially consistent (e.g., there may need to be a memory
barrier by P1 before the synchronization and a memory barrier by P2
after the synchronization) [0260] share memory (e.g., it may be
necessary to insert one or more copy operations (by DMA engines or
by other processors/accelerators) to transfer the data from one
copy of x to the other. [0261] Synchronization and signalling can
be checked [0262] Timing and bandwidth can be checked [0263]
Processor capability can be checked: a DMA engine probably cannot
play Pacman [0264] Processor speed can be checked: a processor may
not be fast enough to meet certain deadlines. [0265] Etc.
Thus, we can check the mapping of a software system against the
hardware system it is to run on based on a specification of the
architecture or additional information obtained in different
ways.
Filling in Details and Correcting Errors in a System Mapping
[0266] Having detected errors in a system mapping, there are a
variety of responses. An error such as mapping a piece of code to a
fixed-function accelerator that does not support that function
should probably just be reported as an error that the programmer
must fix. Errors such as omitting synchronization can sometimes be
fixed by automatically inserting synchronization. Errors such as
requiring more variables to a memory bank than will fit can be
solved, to some extent, using overlay techniques. Errors such as
mapping an overly large variable to a memory can be resolved using
software managed paging though this may need hardware support or
require that the kernel be compiled with software paging turned on
(note: software paging is fairly unusual so we have to implement it
before we can turn it on!). Errors such as omitting memory
barriers, cache flush/invalidate operations or DMA transfers can
always be fixed automatically though it can require heuristics to
insert them efficiently and, in some cases, it is more appropriate
to request that the programmer fix the problem themselves.
Overview
[0267] Given a program that has been mapped to the hardware, the
precise way that the code is compiled depends on details of the
hardware architecture. In particular, it depends on whether two
communicating processors have a coherent and sequentially
consistent view of a memory through which they are passing
data.
Communication Glue Code
[0268] Our compiler uses information about the SoC architecture,
extracted from the architecture description, to determine how to
implement the communication requirements specified within the
program. This enables it to generate the glue code necessary for
communication to occur efficiently and correctly. This can include
generation of memory barriers, cache maintenance operations, DMA
transfers and synchronisation on different processing elements.
This automation reduces programming complexity, increases
reliability and flexibility, and provides a useful mechanism for
extended debugging options.
Communication Error Checking
[0269] Other manual and automatic factors may be used to influence
the communication mechanism decisions. Errors and warnings within
communication mappings can be found using information derived from
the architecture description.
SUMMARY
[0270] Some features of our approach are: [0271] Detecting
coherence and consistency problems of communication requirements
from a hardware description. [0272] Automatically inserting memory
barriers, cache maintenance, DMA transfers etc. to fix
coherence/consistency problems into remote procedure call stubs
(i.e., the "glue code") based on above. We take the concept of
Remote Procedure Calls (RPCs) which are familiar on fully
programmable processors communicating over a network, and adapt and
develop it for application in the context of a SoC: processors
communicating over a bus with fixed function, programmable
accelerators and data engines. Expressing execution of code on
other processing elements or invocation of accelerators as RPCs
gives a function based model for programmers, separating the
function from the execution-mechanism. This enables greater
flexibility and scope for automation and optimisation.
RPC Abstraction
[0273] An RPC abstraction can be expressed as functions mapped to
particular execution mechanisms:
TABLE-US-00007 main( ) { foo( ); foo( ) @ {processor => p2};
}
This provides a simple mechanism to express invocation of
functions, and the associated resourcing, communication and
synchronisation requirements.
[0274] Code can be translated to target the selected processing
elements, providing the associated synchronisation and
communication. For example, this could include checking the
resource is free, configuring it, starting it and copying the
results on completion. The compiler can select appropriate glue
mechanisms based on the source and target of the function call. For
example, an accelerator is likely to be invoked primarily by glue
on a processor using a mechanism specific to the accelerator.
The glue code may be generated automatically based on a high level
description of the accelerator or the programmer may write one or
more pieces of glue by hand.
[0275] The choice of processor on which the operation runs can be
determined statically or can be determined dynamically. For
example, if there are two identical DMA engines, one might indicate
that the operation can be mapped onto either engine depending on
which is available first.
The compiler optimisations based on the desired RPC interface can
range from a dynamically linked interface to inter-procedural
specialisation of the particular RPC interface
RPC Semantics
[0276] RPC calls may be synchronous or asynchronous. Asynchronous
calls naturally introduce parallelism, while synchronous calls are
useful as a simpler function call model, and may be used in
conjunction with fork-join parallelism. In fact, parallelism is not
necessary for efficiency; a synchronous call alone can get the
majority of the gain when targeting accelerators. Manually and
automatically selecting between asynchronous and synchronous
options can benefit debugging, tracing and optimisation.
RPC calls may be re-entrant or non-reentrant, and these decisions
can be made implicitly, explicitly or through program analysis to
provide benefit such as optimisation where appropriate.
RPC Debugging
[0277] This mechanism enables a particular function to have a
number of different execution targets within a program, but each of
those targets can be associated back to the original function;
debugging and trace can exploit this information. This enables a
user to set a breakpoint on a particular function, and the debug
and trace mechanisms be arranged such that it can be caught
wherever it executes, or on a restricted subset (e.g. a particular
processing element).
The details of the RPC interface implementation can be abstracted
away in some debugging views.
SUMMARY
[0278] Some features of our approach are: [0279] Using an RPC-like
approach for mapping functions on to programmable and fixed
function accelerators, including multiple variants. [0280]
Providing mechanisms for directing mapping and generation of the
marshalling and synchronisation to achieve it. [0281] Optimising
the RPC code based-on inter-procedural and program analysis. [0282]
Providing-debug functionality based on information from the RPC
abstraction and the final function implementations.
Overview
[0283] Increasingly, applications are being built using libraries
which define datatypes and a set of operations on those types. The
datatypes are often bulk datastructures such as arrays of data,
multimedia data, signal processing data, network packets, etc. and
the operations may be executed with some degree of parallelism on a
coprocessor, DSP processor, accelerator, etc. It is therefore
possible to view programs as a series of often quite coarse-grained
operations applied to quite large data structures instead of the
conventional view of a program as a sequence of `scalar` operations
(like `32 bit add`) applied to `scalar` values like 32-bit integers
or the small sets of values found in SIMD within a register (SWAR)
processing such as that found in NEON. It is also advantageous to
do so because this coarse-grained view can be a good match for
accelerators found in modern SoCs.
We observe that with some non-trivial adaptation and some
additional observations, optimization techniques known to work on
fine-grained operations and data can be adapted to operate on
coarse-grained operations and data.
Our compiler understands the semantics associated with the data
structures and their use within the system, and can manipulate them
and the program to perform transformations and optimisations to
enable and optimise execution of the program.
Conventional Analyses and their Extension
[0284] Most optimizing compilers perform a dataflow analysis prior
to optimization. For example, section 10.5 of Aho Sethi and
Ullman's `Compilers: Principles Techniques and Tools`, published by
Addison Wesley, 1986, ISBN: 0-201-10194-7 describes dataflow
analysis. The dataflow analysis is restricted to scalar values:
those that fit in a single CPU register. Two parts of a dataflow
analysis are: [0285] identifying the dataflow through individual
operations [0286] combining the dataflow analysis with a
control-flow analysis to determine the dataflow from one program
point to another. In order to use dataflow analysis techniques with
coarse-grained dataflow, we modify the first part so that instead
of identifying the effect of a single instruction on a single
element, we identify the effect of a coarse-grained operation
(e.g., a function call or coprocessor invocation) on an entire data
structure in terms of whether the operation is a `use`, a `def` or
a `kill` of the value in a data structure. Care must be taken if an
operation modifies only half of an array since the operation does
not completely kill the value of the array. For operations
implemented in hardware or in software, this might be generated
automatically from a precise description of the operation
(including the implementation of the operation) or it might be
generated from an approximate description of the main effects of
the operation or it might be provided as a direct annotation. In
particular, for software, these coarse-grained operations often
consist of a simple combination of nested loops and we can analyze
the code to show that the operation writes to an entire array and
therefore `kills` the old value in the array. In scalar analysis,
this is trivial since any write necessarily kills the entire old
value.
The following sections identify some of the uses of coarse-grained
dataflow analysis
Multiple Versions of the same Buffer
[0287] Especially when writing parallel programs or when using I/O
devices and when dealing with complex memory hierarchies, it is
necessary to allocate multiple identically sized buffers and copy
between the different buffers (or use memory remapping hardware to
achieve the effect of a copy). We propose that in many cases these
multiple buffers can be viewed as alternative versions of a single,
logical variable. It is possible to detect this situation in a
program with multiple buffers, or the programmer can identify the
situation. One way the programmer can identify the situation is to
declare a single variable and then use annotations to specify that
the variable lives in multiple places or the programmer could
declare multiple variables and use annotations to specify that they
are the same logical variable. However the different buffers are
identified as being one logical variable, the advantages that can
be obtained include: [0288] more intelligent buffer allocation
[0289] detecting errors where one version is updated and that
change is not propagated to other version before it is used [0290]
debug, trace and profile tools can treat a variable as one logical
entity so that, for example, if programmer sets watchpoint on x
then tools watch for changes on any version of x. Likewise, if
compiler has put x and y in the same memory location (following
liveness analysis), then the programmer will only be informed about
a write to x when that memory location is being used to store x,
not when it is being used to store y. When doing this, you might
well want to omit writes to a variable which exist only to preserve
the multi-version illusion. For example, if one accelerator writes
to version 1, then a dma copies version 1 to version 2, then
another accelerator modifies the variable, then the programmer will
often not be interested in the dma copy. We note that compilers do
something similar for scalar variables: the value of a scalar
variable `x` might sometimes live on the stack, sometimes in
register 3, sometimes in register 6, etc. and the compiler keeps
track of which copies currently contain the live value.
Allocation
[0291] By performing a liveness analysis of the data structures,
the compiler can provide improved memory allocation through memory
reuse because it can identify opportunities to place two different
variables in the same memory location. Indeed, one can use many
algorithms normally used for register allocation (where the
registers contain scalar values) to perform allocation of data
structures. One modification required is that one must handle the
varying size of buffers whereas, typically, all scalar registers
are the same size.
Scheduling
[0292] One thing that can increase memory use is having many
variables simultaneously live. It has been known for a long time
that you can reduce the number of scalar registers required by a
piece of code by reordering the scalar operations so that less
variables are simultaneously live. Using a coarse-grained dataflow
analysis, one can identify the lifetime of each coarse-grained data
structure and then reorder code to reduce the number of
simultaneously live variables. One can even choose to recalculate
the value of some data structure because it is cheaper to
recalculate it than to remember its value. When parallelising
programs, one can also deliberately choose to restrain the degree
of parallelism to reduce the number of simultaneously live values.
Various ways to restrain the parallelism exist: forcing two
operations into the same thread, using mutexes/semaphores to block
one thread if another is using a lot of resource, tweaking
priorities or other scheduler parameters. If a
processor/accelerator has a limited amount of available memory,
performing a context switch on that processor can be challenging.
Context switching memory-allocated variables used by that processor
solves the problem.
Optimisation
[0293] Compiler books list many other standard transformations that
can be performed to scalar code. Some of the mapping and
optimisation techniques that can be applied at the coarse-grain we
discuss include value splitting, spilling, coalescing, dead
variable removal, recomputation, loop hoisting and CSE.
Data structures will be passed as arguments, possibly as part of an
ABI. Optimisations such as specialisation and not conforming to the
ABI when it is not exposed can be applied.
Multigranularity Operation
[0294] In some cases, one would want to view a complex
datastructure at multiple granularities. For example, given a
buffer of complex values, one might wish to reason about dataflow
affecting all real values in the buffer, dataflow affecting all
imaginary values or dataflow involving the whole buffer. (More
complex examples exist)
Debugging
[0295] When debugging, it is possible for the data structure to
live in a number of different places throughout the program. We can
provide a single debug view of all copies, and watch a value
wherever it is throughout the lifetime of a program, optionally
omitting omit certain accesses such as DMAs.
The same is possible for tracing data structures within the
system.
Zero Copying
[0296] Using this coarse-grained view, one can achieve zero copy
optimization of a sequence of code like this: [0297] int x[100];
[0298] generate(&x); // writes to x [0299] put(channel,&x)
by inlining the definition of put to get: [0300] int x[100]; [0301]
generate(&x); // writes to x [0302] int *px=put_begin(channel);
[0303] copy(px,&x); [0304] put_end(channel,px); then reordering
the code a little: [0305] int *pX=put_begin(channel); [0306] int
x[100]; [0307] generate(&x); // writes to x copy(px,&x);
[0308] put_end(channel,px); and optimizing the memory allocation
and copy: [0309] int *px=putbegin(channel); [0310] generate(px); //
writes to *px [0311] put_end(channel,px);
Trace
[0312] Most of this section is about coarse-grained data structure
but some benefits from identifying coarse-grained operations come
when we are generating trace. Instead of tracing every scalar
operation that is used inside a coarse-grained operation, we can
instead just trace the start and stop of the operation. This can
also be used for cross-triggering the start/stop of recording other
information through trace.
Likewise, instead of tracing the input to/output from the whole
sequence of scalar operations, we can trace just the values at the
start/end of the operation.
Validating Programmer Assertions
[0313] If we rely on programmer assertions, documentation, etc. in
performing our dataflow analysis, it is possible that an error in
the assertions will lead to an error in the analysis or
transformations performed. To guard against these we can often use
hardware or software check mechanisms. For example, if we believe
that a function should be read but not written by a given function,
then we can perform a compile-time analysis to verify it ahead of
time or we can program an MMU or MPU to watch for writes to that
range of addresses or we can insert instrumentation to check for
such errors. We can also perform a `lint` check which looks for
things which may be wrong even if it cannot prove that they are
wrong. Indeed, one kind of warning is that the program is too
complex for automatica analysis to prove that it is correct.
SUMMARY
[0314] Some of the features of our approach are: [0315] Using a
register like (aka scalar-like) approach to data structure
semantics within the system [0316] Using liveness analysis to
influence memory allocation, parallelism and scheduling decisions.
[0317] Applying register optimisations found in compiler to data
structures within a program. [0318] Providing debugging and tracing
of variables as a single view
Overview
[0319] Given a program that uses some accelerators, it is possible
to make it run faster by executing different parts in parallel with
one another. Many methods for parallelizing programs exist but many
of them require homogeneous hardware to work and/or require very
low cost, low latency communication mechanisms to obtain any
benefit. Our compiler uses programmer annotations (many/all of
which can be inserted automatically) to split the code that invokes
the accelerators (`the control code`) into a number of parallel
"threads" which communicate infrequently. Parallelizing the control
code is advantage because it allows tasks on independent
accelerators to run concurrently. Our compiler supports a variety
of code generation strategies which allow the parallelized control
code to run on a control processor in a real time operating system,
in interrupt handlers or in a polling loop (using `wait for event`
if available to reduce power) and it also supports distributed
scheduling where some control code runs on one or more control
processors, some control code runs on programmable accelerators,
some simple parts of the code are implemented using conventional
task-chaining hardware mechanisms. It is also possible to design
special `scheduler devices` which could execute some parts of the
control code. The advantage of not running all the control code on
the control processor is that it can greatly decrease the load on
the control processor.
Other parallelising methods may be used in conjunction with the
other aspects of this compiler.
[0320] Some of the features of our approach are: [0321] By applying
decoupled software pipelining to the task of parallelizing the
control code in a system that uses heterogeneous accelerators, we
significantly extend the reach of decoupled software pipelining and
by working on coarser grained units of parallelism, we avoid the
need to add hardware to support high frequency streaming. [0322] By
parallelizing at a significantly coarser granularity, we avoid the
need for low latency, high throughput communication mechanisms used
in prior art. [0323] Parallelizing at a significantly coarser
granularity also allows us to duplicate more control code between
threads which reduces and simplifies inter-thread communication
which allows us to generate distributed schedules. That is, we can
distribute the control code across-multiple processors both by
putting each control thread on a different processor and by putting
different parts of a single control thread onto different
processors. [0324] By optionally allowing the programmer more
control over the communication between threads, we are able to
overcome the restriction of decoupled software pipelining to
acyclic `pipelines`. [0325] The wide range of backends including
distributed scheduling and use of hardware support for scheduling.
[0326] Our decoupling algorithm is applied at the source code level
whereas existing decoupling algorithms are applied at the assembly
code level after instruction scheduling. Some of the recent known
discussions on decoupled software pipelining are: [0327] Decoupled
Software Pipelining: http://liberty.princeton.edu/Research/DSWP/
http://liberty.princeton.edu/Publications/index.php?abs-1&setselect=pact0-
4_dswp
http://Iiberty.cs.princeton.edu/Publications/index.php?abs=1&setsel-
ect=micro38_d swp [0328] Automatically partitioning packet
processing applications for pipelined architectures, PLDI 2005, ACM
http://portal.acm.org/citation.cfm?id=1065010.1065039
A Basic Decoupling Algorithm
[0329] The basic decoupling algorithm splits a block of code into a
number of threads that pass data between each other via FIFO
channels. The algorithm requires us to identify (by programmer
annotation or by some other analysis including static analysis,
profile driven feedback, etc.) the following parts of the program:
[0330] The "decoupling scope": that is a contiguous sequence of
code that we wish to split into multiple threads. Some ways this
can be done are by marking a compound statement, or we can insert a
`barrier` annotation that indicates that some parallelism
ends/starts here. [0331] The "replicatable objects": that is
variables and operations which it is acceptable to replicate. A
simple rule of thumb is that scalar variables (i.e., not arrays)
which are not used outside the scope, scalar operations which only
depend on and only modify replicatable variables, and control flow
operations should be replicated but more sophisticated policies are
possible. [0332] Ordering dependencies between different
operations: if two function calls both modify a non-replicated
variable, the order of those two function calls is preserved in the
decoupled code. (Extensions to the basic algorithm allow this
requirement to be relaxed in various ways.) [0333] The "data
boundaries" between threads: that is, the non-replicatable
variables which will become FIFO channels. (The "copies" data
annotation described above determines the number of entries in the
FIFO.)
(Identifying replicatable objects and data boundaries are two of
the features of our decoupling algorithm.)
[0334] If we use annotations on the program to identify these
program parts, a simple program might look like this:
TABLE-US-00008 void main( ) { int i; for(i=0; i<10; ++i) { int
x[100] @ {copies=2, replicatable=false; boundary=true} ; produce(x)
@ {replicatable=false, writes_to=[x]}; DECOUPLE(x); consume(x) @
{replicatable=false, reads_from=[x]}; } }
This degree of annotation is fine for examples but would be
excessive in practice so most real embodiments would rely on tools
to add the annotations automatically based on heuristics and/or
analyses.
[0335] At a high level, the algorithm splits the operations in the
scope into a number of threads whose execution will produce the
same result as the original program under any scheduling policy
that respects the FIFO access ordering of the channels used to
communicate between threads. The particular decoupling algorithm we
use generates a maximal set of threads such that the following
properties hold: [0336] All threads have the same control flow
structure and may have copies of the replicatable variables and
operations. [0337] Each non-replicatable operation is included in
only one of the threads. [0338] Each non-replicatable variable must
satisfy one of the following: [0339] The only accesses to the
variable in the original program are reads; or [0340] All reads and
writes to the variable are in a single thread; or [0341] The
variable was marked as a data boundary and all reads are in one
thread and all writes are in another thread. [0342] If two
operations have an ordering dependency between them which is not
due to a read after write (RAW) dependency on a variable which has
been marked as a data boundary, then the operations must be in the
same thread. For the example program above, the maximal set of
threads is:
TABLE-US-00009 [0342] void main( ) { int x[100] @ {copies=2};
channel c @ {buffers=x}; parallel sections{ section{ int i;
for(i=0; i<10; ++i) { int x1[100]; produce(x1); put(c,x1); } }
section{ int i; for(i=0; i<10; ++i) { int x2[100]; get(c,x2);
consume(x2); } } }
One way to generate this set of threads is as follows: [0343] 4.
For each non-replicatable operation, create a `protothread`
consisting of just that operation plus a copy of all the
replicatable operations and variables. Each replicatable variable
must be initialized at the start of each thread with the value of
the original variable before entering the scope and one of the
copies of each replicatable variable should be copied back into the
master copy on leaving the scope. (Executing all these protothreads
is highly unlikely to give the same answer as the original program,
because it lacks the necessary synchronization between threads.
This is fixed by the next steps.) [0344] 5. Repeatedly pick two
threads and merge them into a single thread if any of the following
problems exist: [0345] a. One thread writes a non-replicatable
variable which is accessed (read or written) by the other thread
and the variable is not marked as a data boundary. [0346] b. Two
threads both write to a variable which is marked as a data
boundary. [0347] c. Two threads both read from a variable that is
marked as a data boundary. [0348] d. There is an ordering
dependency between an operation in one thread and an operation in
the other thread which is not a RAW dependency on a variable marked
as a data boundary. [0349] 6. When no more threads can be merged,
quit Another way if to pick an operation, identify all the
operations which must be in the same thread as that operation by
repeatedly adding operations which would be merged (in step 2
above). Then pick the next operation not yet assigned to a thread
and add all operations which must be in the same thread as that
operation. Repeat until there are no more non-replicatable
operations. (There are lots of other ways of tackling this problem:
basically, we are forming equivalence classes based on a partial
order and there are many known ways to do that.)
Note that doing dataflow analysis on arrays one must distinguish
defs which are also kills (i.e., the entire value of a variable is
overwritten by an operation) and that requires a more advanced
analysis than is normally used.
Decoupling Extensions
There are a number of extensions to this model
Range Splitting Preprocessing
[0350] It is conventional to use dataflow analysis to determine the
live ranges of a scalar variable and then replace the variable with
multiple copies of the variable: one for each live range. We use
the same analysis techniques to determine the live range of arrays
and split their live ranges in the same way. This has the benefit
of increasing the precision of later analyses which can enable more
threads to be generated. On some compilers it also has the
undesirable effect of increasing memory usage which can be
mitigated by later merging these copies if they end up in the same
thread and by being selective about splitting live ranges where the
additional decoupling has little overall-effect on performance.
Zero Copy Optimizations
[0351] The put and get operations used when decoupling can be used
both for scalar and non-scalar values. (i.e., both for individual
values (scalars) and arrays of values (non-scalars) but they are
inefficient for large scalar values because they require a memory
copy. Therefore, for coarse-grained decoupling, it is desirable to
use an optimized mechanism to pass data between threads. In
operating systems, it is conventional to use "zero copy" interfaces
for bulk data transfer: instead of generating data into one buffer
and then copying the data to the final destination, we first
determine the final destination and generate the data directly into
the final destination. Applying this idea to the channel interface,
we can replace the simple `put` operation with two functions:
put_begin obtains the address of the next free buffer in the
channel and put_end makes this buffer available to readers of the
channel: [0352] Void* put_begin(channel *ch); [0353] Void
put_end(channel *ch, void* buf);
Similarily, the get operation is split into a get_begin and get_end
pair
[0353] [0354] Void* get_begin(channel *ch); [0355] Void
get_end(channel *ch, void* buf); Using these operations, we can
often rewrite sequences of code such as: [0356] Int x[100]; [0357]
Generate(x); [0358] Put(ch,x); to this more efficient sequence:
[0359] Int *px=put_begin(ch); [0360] Generate(px); [0361]
Put_end(ch,px); And similarity, for get: [0362] Int-x[100]; [0363]
Get(ch,x); [0364] Consume(x); to this more efficient sequence:
[0365] Int *px=get_begin(ch); [0366] Consume(px); [0367]
get_end(ch,px);
Note that doing zero copy correctly requires us to take lifetime of
variables into account.
[0368] We can do that using queues with multiple readers, queues
with intermediate r/w points, reference counts or by restricting
the decoupling (all readers must be in same thread and . . . ) to
make lifetime trivial to track. This can be done by generating
custom queue structures to match the code or custom queues can be
built out of a small set of primitives.
Dead Code and Data Elimination
This section illustrates both how to get better results and also
that we may not get exactly the same control structure but that
they are very similar.
Allowing Cyclic Thread Dependencies
[0369] Prior art on decoupling restricts the use of decoupling to
cases where the communication between the different threads is
acyclic. There are two reasons why prior art has done this: [0370]
2. Cyclic thread dependencies can lead to "Loss of
Decoupling"--that is, two threads may not run in parallel because
of data dependencies between them. [0371] 3. A particularity common
case of cyclic thread dependencies is code such as
TABLE-US-00010 [0371] y = 1; while(1) { x = f(y); y = g(x); }
Under existing decoupling schemes, puts are always inserted after
assignment to any data boundary variable. This would require both a
put outside the loop and a put at the end of the loop:
TABLE-US-00011 y1 = 1; put(ch,y1); while(1) { y2 = get(ch); x =
f(y2); y3 = g(x); put(ch,y3); }
Existing decoupling schemes only generate matched pairs of puts and
gets (i.e., there is only one put on each channel and only one get
on each channel so they cannot generate such code An alternative
way of decoupling this code is to generate:
TABLE-US-00012 y1 = 1; while(1) { put(ch,y1); y2 = get(ch); x =
f(y2); y1 = g(x); }
This does have matched pairs of puts and gets but breaks the rule
of always performing a put after any assignment to a variable so it
is also not generated by existing decoupling techniques.
Exposing Channels to the Programmer
[0372] It is possible to modify the decoupling algorithm to allow
the programmer to insert-puts and gets (or put_begin/end,
get_begin/end pairs) themselves. The modified decoupling algorithm
treats the puts and gets in much the same way that the standard
algorithm treats data boundaries. Specifically, it constructs the
maximal set of threads such that: [0373] Almost all the same
conditions as for standard algorithm go here [0374] All puts to a
channel are in the same thread [0375] All gets to a channel are in
the same thread For example, given this program:
TABLE-US-00013 [0375] channel ch1; put(ch1,0); for(int i=0;
i<N); ++i) { int x = f( ); put(ch1,x); int y = g(get(ch1));
DECOUPLE(y); h(x,y); }
The modified decoupling algorithm will produce:
TABLE-US-00014 channel ch1, ch2; put(ch1,0); parallel sections{
section{ for(int i=0; i<10; ++i) { x = f(); put(ch1,x); int y =
get(ch2); h(x,y); } } section{ for(int i=0; i<10; ++i) { int y =
g(get(ch1)); put(ch2,y); } }
This extension of decoupling is useful for creating additional
parallelism because it allows f and g to be called in parallel.
[0376] Writing code using explicit puts can also be performed as a
preprocessing step. For example, we could transform:
TABLE-US-00015 for(i=0; i<N; ++i) { x = f(i); y = g(i,x);
h(i,x,y); }
[0377] To the following equivalent code:
TABLE-US-00016 x = f(0); for(i=0: i<N; ++i) { y = g(i,x);
h(i,x,y); if (i+1<N) x = f(i+1); }
Which, when decoupled gives very similar code to the above.
(There are numerous variations on this transformation including
computing f(i+1) unconditionally, peeling the last iteration of the
loop, etc.)
Alternatives to FIFO Channels
[0378] A First-In First-Out (FIFO) channel preserves the order of
values that pass through it: the first value inserted is the first
value extracted, the second value inserted is the second value
extracted, etc. Other kinds of channel are possible including:
[0379] a "stack" which has Last in First out (LIFO) semantics.
Amongst other advantages, stacks can be simpler to implement [0380]
a priority queue where entries are prioritized by the writer or
according to some property of the entry and the reader always
received the highest priority entry in the queue. [0381] a merging
queue where a new value is not inserted if it matches the value at
the back of the queue or as a variant, if it matches any value in
the queue. Omitting duplicate values which may help reduce
duplicated work [0382] a channel which only tracks the last value
written to the queue. That is, the queue logically contains only
the most recently written entry. This is useful if the value being
passed is time-dependent (e.g., current temperature) and it is
desirable to always use the most recent value. Note that with
fine-grained decoupling the amount of time between generation of
the value and its consumption is usually small so being up to date
is not a problem; whereas in coarse-grained decoupling, a lot of
time may pass between generation and consumption and the data could
easily be out of date if passed using a FIFO structure. [0383] A
channel which communicates with a hardware device. For example, a
DMA device may communicate with a CPU using a memory mapped
doubly-linked list of queue entries which identify buffers to be
copied or a temperature sensor may communicate with a CPU using a
device register which contains the current temperature.
Using most of these alternative channels has an affect on program
meaning so we either have to perform an analysis before using a
different kind of channel or the programmer can indicate that a
different choice is appropriate/allowed.
Using Locks
[0384] In parallel programming, it is often necessary for one
thread to need exclusive access to some resource while it is using
that resource to avoid a class of timing dependent behaviour known
as a "race condition" or just a "race". The regions of exclusive
access are known as "critical sections" and are often clearly
marked in a program. Exclusive access can be arranged in several
ways. For example, one may `acquire` (aka `lock) a `lock` (aka
`mutex`) before starting to access the resource and `release` (aka
`unlock`) the lock after using the resource. Exclusive access may
also be arranged by disabling pre-emption (such as interrupts)
while in a critical section (i.e., a section in which exclusive
access is required). In some circumstances, one might also use a
`lock free` mechanism where multiple users may use a resource but
at some point during use (in particular, at the end), they will
detect the conflict, clean up and retry. Some examples of wanting
exclusive access include having exclusive access to a hardware
accelerator, exclusive access to a block of memory or exclusive
access to an input/output device. Note that in these cases, it is
usually not necessary to preserve the order of accesses to the
resource.
[0385] The basic decoupling algorithm avoids introducing race
conditions by preserving all ordering dependencies on statements
that access non-replicated resources. Where locks have been
inserted into the program, the basic decoupling algorithm is
modified as follows: [0386] The ordering dependencies on operations
which use shared resources can be relaxed. This requires programmer
annotation and/or program analysis which, for each operation which
may be reordered, identifies: [0387] Which other operations it can
be reordered relative to [0388] Which operations can simultaneously
access the same resource (i.e., without requiring exclusive access)
[0389] Which critical section each operation occurs in. [0390] For
example, one might identify a hardware device as a resource, then
indicate which operations read from the resource (and so can be
executed in parallel with each other) and which operations modify
the resource (and so must have exclusive access to the resource).
[0391] For simplicity, one might identify all operations inside a
critical section as having an ordering dependency between them
though one can sometimes relax this if the entire critical section
lies inside the scope of decoupling. [0392] One might determine
which critical section each operation occurs in using an analysis
which conservatively approximates the set of locks held at all
points in the program.
Multithreaded Input
[0393] Decoupling can be applied to any sequential section of a
parallel program. If the section communicates with the parallel
program, we must determine any ordering dependencies that apply to
operations within the section (a safe default is that the order of
such operations should be preserved). What I'm saying here is that
one of the nice properties of decoupling is that it interacts well
with other forms of paralellization including manual
parallelization.
Decoupling Backends
[0394] The decoupling algorithm generates sections of code that are
suitable for execution on separate processors but can be executed
on a variety of different execution engines by modifying the "back
end" of the compiler. That is, by applying a further transformation
to the code after decoupling to better match the hardware or the
context we wish it to run in.
Multiprocessor and Multithreaded Processor Backends
[0395] The Most Straightforward Execution Model is to Execute each
Separate Section in the Decoupled Program on a Separate Processor
or, on a Processor that Supports Multiple Hardware Contexts (i.e.,
Threads), to Execute each Separate Section on a Separate Thread.
Since most programs have at least one sequential section before the
separate sections start (e.g., there may be a sequential section to
allocate and initialize channels), execution will typically start
on one processor which will later synchronize with the other
processors/threads to start parallel sections on them.
Using Accelerators
[0396] In the context of an embedded system and, especially, a
System on Chip (SoC), some of the data processing may be performed
by separate processors such as general purpose processors, digital
signal processors (DSPs), graphics processing units (GPUs), direct
memory access (DMA) units, data engines, programmable accelerators
or fixed-function accelerators. This data processing can be
modelled as a synchronous remote procedure call. For example, a
memory copy operation on a DMA engine can be modelled as a function
call to perform a memory copy. When such an operation executes, the
thread will typically: [0397] acquire a lock to ensure it has
exclusive access to the DMA engine [0398] configure the DMA engine
with the source and destination addresses and the data size [0399]
start the DMA engine to initiate the copy [0400] wait for the DMA
engine to complete the copy which will be detected either by an
interrupt to a control processor or by polling [0401] copy out any
result from the copy (such as a status value) [0402] release the
lock on the accelerator This mode of execution can be especially
effective because one `control processor` can keep a number of
accelerator's busy with the control processor possibly doing little
more than deciding which accelerator to start next and on what
data. This mode of execution can be usefully combined with all of
the following forms of execution.
RTOS Backend
[0403] Instead of a multiprocessor or multithreaded processor, one
can use a thread library, operating system (OS) or real time
operating system (RTOS) running on one or more processors to
execute the threads introduced by decoupling. This is especially
effective when combined with the use of accelerators because
running an RTOS does not provide parallelism and hence does not
increase performance but using accelerators does provide
parallelism and can therefore increase performance.
Transforming to Event-Based Execution
[0404] Instead of executing threads directly using a thread
library, OS or RTOS, one can transform threads into an
`event-based` form which can execute more efficiently than threads.
The methods can be briefly summarized as follows: [0405]
Transformations to data representation. [0406] The usual
representation of threads allocates thread-local variables on a
stack and requires one stack per thread. The overhead of managing
this stack and some of the space overhead of stacks can be reduced
by using a different allocation policy for thread-local variables
based on how many copies of the variable can be live at once and on
the lifetime of the variables. [0407] If only one copy of each
variable can be live at once (e.g., if the functions are not
required to be re-entrant), then all variables can be allocated
statically (i.e., not on a stack or heap). [0408] If multiple
copies of a variable can be live at once (e.g., if more than once
instance of a thread can be live at once), the variables can be
allocated on the heap. [0409] Transformations to context-switch
mechanism [0410] When one processor executes more threads than the
processor supports, the processor must sometimes switch from
executing one thread to executing another thread. This is known as
a `context switch`. The usual context mechanism used by threads is
to save the values of all registers on the stack or in a reserved
area of memory called the "thread control block", then load all the
registers with values from a different thread control block and
restart the thread. The advantage of this approach is that a
context switch can be performed at almost any point during
execution so any code can be made multithreaded just by using a
suitable thread library, OS or RTOS. [0411] An alternative
mechanism for context switching is to transform each thread to
contain explicit context switch points where the thread saves its
current context in a thread control block and returns to the
scheduler which selects a new thread to run and starts it. The
advantages of this approach are that thread control blocks can be
made significantly smaller. If all context switches occur in the
top-level function and all thread-local variables can be statically
allocated, it is possible to completely eliminate the stack so that
the entire context of a thread can be reduced to just the program
counter value which makes context switches very cheap and makes
thread control blocks extremely small. [0412] A further advantage
of performing context switches only at explicit context switch
points is that it is easier and faster to ensure that a resource
shared between multiple threads is accessed exclusively by at most
one thread at a time because, in many cases, it is possible to
arrange that pre-emption only happens when the shared resource is
not being used by the current thread. Together, these
transformations can be viewed as a way of transforming a thread
into a state machine with each context switch point representing a
state and the code that continues execution from each context
switch point viewed as a transition function to determine the next
state. Execution of transformed threads can be viewed as having
been transformed to an event-based model where all execution occurs
in response to external events such as responses from input/output
devices or from accelerators. It is not necessary to transform all
threads: event-based execution can coexist with threaded
execution.
Interrupt-Driven Execution
[0413] Transforming threads as described above to allow event-based
execution is a good match for applications that use accelerators
that signal task completion via interrupts. On receiving an
interrupt signalling task completion the following steps occur:
[0414] the state of the associated accelerator is updated [0415]
all threads that could be blocked waiting for that task to complete
or for that accelerator to become available are executed. This may
lead to further threads becoming unblocked. [0416] When there are
no runnable threads left, the interrupt handler completes
Polling-Based Execution
[0417] Transforming threads as described above is also a good match
for polling-based execution where the control processor tests for
completion of tasks on a set of accelerators by reading a status
register associated with each accelerator. This is essentially the
same as interrupt-driven execution except that the state of the
accelerators is updated by polling and the polling loop executes
until all threads complete execution.
Distributed Scheduling
[0418] Distributed scheduling can be done in various ways. Some
part of a program may be simple enough that it can be implemented
using a simple state machine which schedules one invocation of an
accelerator after completion of another accelerator. Or, a control
processor can hand over execution of a section within a thread to
another processor. In both cases, this can be viewed as a RPC like
mechanism ("{foo( ); bar( )@P0;}@P1"). In the first case, one way
to implement it is to first transform the thread to event-based
form and then opportunistically spot that a sequence of system
states can be mapped onto a simple state machine and/or you may
perform transformations to make it map better.
Non-Work-Conserving Schedulers and Priorities/Deadlines
[0419] Two claims in this section: 1) using a priority mechanism
and 2) using a non-work-conserving scheduler in the context of
decoupling If a system has to meet a set of deadlines and the
threads within the system share resources such as processors, it is
common to use a priority mechanism to select which thread to run
next. These priorities might be static or they may depend on
dynamic properties such as the time until the next deadline or how
full/empty input and output queues are.
[0420] In a multiprocessor system, using a priority mechanism can
be problematic because at the instant that one task completes, the
set of tasks available to run next is too small to make a
meaningful choice and better schedules occur if one waits a small
period of time before making a choice. Such schedulers are known as
non-work-conserving schedulers.
Overview
[0421] A long-standing problem of parallelizing compilers is that
it is hard to relate the view of execution seen by debug mechanisms
to the view of execution the programmer expects from the original
sequential program. Our tools can take an execution trace obtained
from running a program on parallel hardware and reorder it to
obtain a sequential trace that matches the original program. This
is especially applicable to but not limited to the coarse-grained
nature of our parallelization method. To achieve complete
reconstruction, it helps if the parallelizing compiler inserts
hints in the code that make it easier to match up corresponding
parts of the program. In the absence of explicit hints, it may be
possible to obtain full reconstruction using debug information to
match parts of the program. When there are no explicit hints or
debug information, partial reconstruction can be achieved by using
points in the program that synchronize with each other to guide the
matching process. The resulting trace will not be sequential but
will be easier to understand. A useful application is to make it
simpler to understand a trace of a program written using an
event-based programming style (e.g., a GUI, interrupt handlers,
device drivers, etc.) Partial reconstruction could also be used to
simplify parallel programs running on systems that use release
consistency. Such programs must use explicit memory barriers at all
synchronization points so it will be possible to simplify traces to
reduce the degree of parallelism the programmer must consider.
One simple case of this is reconstructing a `message passing` view
of bus traffic.
[0422] HP has been looking at using trace to enable performance
debugging of distributed protocols. Their focus is on data mining
and performance not reconstructing a sequential trace.
http://portal.acm.org/citation.cfm?id=945445.945454&d1=portal&d1=ACM&type-
=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM
%20Symposium %20 on %20Operating %20Systems
%20Principles&CFID=111111111&CFTOKEN=2222222
Partial Reconstruction Based on Observed Dataflow
[0423] Suppose we can identify sections of the system execution and
we have a trace which lets us identify when each section was
running and we have a trace of the memory accesses they performed
or, from knowing properties of some of the sections, we know what
memory accesses they would perform without needing a trace. The
sections we can identify might be: [0424] function calls [0425]
remote procedure calls [0426] execution of a fixed-function
accelerator such as a DMA transfer [0427] message passing
We can summarize the memory accesses of each section in terms of
the input data and the output data (what addresses were accessed
and, perhaps, what values were read or written).
[0428] Given a sequence of traces of sections, we can construct a
dynamic dataflow graph where each section is a node in a directed
graph and there is an edge from a node M to a node N if the section
corresponding to M writes to an address x and the section
corresponding to N reads from address x and, in the original trace,
no write to x happens between M's write to x and N's read from x.
This directed dataflow graph shows how different sections
communicate with each other and can be used for a variety of
purposes: [0429] identify potential parallelism [0430] identify
timing-sensitive behaviour such as race conditions (when combined
with a trace of synchronizations between parallel threads): if M
writes to x and N reads from x and there is no chain of
synchronizations from M to N to ensure that N cannot read from x
before M does the read, there is a potential problem [0431]
identify redundant memory writes (if a value is overwritten before
it has been read) [0432] provides a simple way to show programmers
what is happening in a complex, possibly parallel, system [0433]
can be analyzed to determine the time between when data is being
generated and when it is consumed. If the time is long it might
suggest that memory requirements could be reduced by calculating
data nearer the time or, in a parallel or concurrent system that
the generating task can be executed later. [0434] can be analyzed
to identify number and identity of consumers of data: it is often
possible to manage memory more efficiently or generate data more
efficiently if we know what it is being used for, when it is being
used, etc.
Many other uses exist.
Full Reconstruction Based on Parallelization Transformations
[0435] The first section talks about what you need for the general
case of a program that has been parallelized and you would like to
serialize trace from a run of the parallel program based on some
understanding of what transformations were done during
parallelization (i.e., you know how different bits of the program
relate to the original program). The second part talks about how
you would specifically do this if the paralellization process
included decoupling. The sketch describes the simplest case in
which it can work but it is possible to relax these restrictions
significantly. Here is a brief description of what is required to
do trace reconstruction for decoupled programs. That is, to be able
to take a trace from the decoupled program and reorder it to obtain
a legal trace of the original program. Most relevant should be
conditions 1-9 which say what we need from trace. Where the
conditions do not hold, there need to be mechanisms to achieve the
same effect or a way of relaxing the goals so that they can still
be met. For example, if we can only trace activity on the bus and
two kernels running on the same DE communicate by one leaving the
result in DE-local memory and the other using it from there, then
we either add hardware to observe accesses to local memories or we
tweak the schedule to add a spurious DMA copy out of the local
memory so that it appears on the bus or we pretend we didn't want
to see that kind of activity anyway. Condition 10 onwards relate
mainly to what decoupling aims to achieve. But, some conditions are
relevant such as conditions 5 and 6 because, in practice, it is
useful to be able to relax these conditions slightly. For example
(5) says that kernels have exclusive access to buffers but it is
obviously ok to have multiple readers of the same buffer and it
would also be ok (in most real programs) for two kernels to
(atomically) invoke `malloc` and `free` in the middle of the
kernels even though the particular heap areas returned will depend
on the precise interleaving of those calls and it may even be ok
for debugging printfs from each kernel to be ordered.
Initial assumptions (to be relaxed later):
[0436] 1. Trace can see the start and end of each kernel execution
and can identify which kernel is being started or is stopping.
[0437] 2. Trace can see context switches on each processor and can
identify which context we are leaving and which context we are
entering.
[0438] Consequences of (1)-(2): We can derive which kernel instance
is running on any processor at any time. [0439] 3. Trace has a
coherent, consistent view of all activity on all processors. [0440]
4. Trace can identify the source of all transactions it observes.
[0441] Two mechanisms that can make this possible are: [0442] 1.
Trace might observe directly which processor caused a transaction.
or [0443] 2. Trace might observe some property of the transaction
such as the destination address and combine that with some property
of the kernels running at that time.
[0444] Condition 2 can be satisfied if we have each kernel only
accesses buffers that are either: [0445] 1. At a static address
(and of static length); or [0446] 2. At an address (and of a
length) that are handed to the kernel at the start of kernel
execution and trace can infer what that address and length are.
Consequences of (1)-(4): We can identify each transaction with a
kernel instance and we can see all transactions a kernel
performs.
[0446] [0447] 5. Each kernel instance has exclusive access to each
buffer during its execution. That is, all inter-kernel
communication occurs at kernel boundaries. [0448] 6. Each kernel's
transactions only depend on the state of the buffers it accesses
and the state of those buffers only depends on the initial state of
the system and on transactions that kernels have performed since
then. Consequences of (1)-(6): Given a trace consisting of the
interleaved transactions of a set of kernel instances, we can
reorder the transactions such that all transactions of a kernel are
contiguous and the resulting trace satisfies all read after write
data dependencies. That is, we can construct a sequentially
consistent view of the transactions as though kernels executed
atomically and sequentially.
Note that there may be many legal traces. e.g., if A (only) writes
to address 0 and then 1 and B (only) writes to address 2 and then 3
then the trace `0,2,1,3` could be reordered to `0, 1, 2, 3` or to
`2,3,0,1`.
[0448] [0449] 7. Sequencing of each kernel instance is triggered by
a (single) state machine. There are a number of parallel state
machines. (State machines may be in dedicated hardware or a number
of state machines may be simulated on a processor.) [0450] 8. State
machines can synchronize with each other and can wait for
completion of a kernel and state transitions can depend (only) on
those synchronizations and on the results of kernels. [0451] 9.
Trace has a sequentially consistent, coherent view of all state
transitions of the sequencers and all synchronization. Consequences
of (7)-(9): Given a trace of the state transitions and
synchronizations, we can reorder them into any of the set of legal
transitions those state machines could have made where a transition
is legal if it respects synchronization dependencies.
Consequences of (1)-(9): Given a trace of all kernel transactions
and all state transitions and synchronizations, we can reorder them
into any legal trace which respects the same synchronization
dependencies and data dependencies.
The challenge of trace reconstruction is to show that, if you
decouple a program, then the following holds. (Actually, this is
what you want to show for almost any way you may parallelize a
program.)
[0451] [0452] 10. We assume that we have a single `master`
deterministic state machine that corresponds to the set of
parallel, deterministic state machines in the following way: [0453]
a. Any trace of the `master` state machine is a legal trace of the
parallel state machine. [0454] b. Some traces of the parallel state
machine can be reordered into a legal trace of the master state
machine. [0455] c. Those traces of the parallel state machine that
cannot be reordered to give a legal trace of the master, are a
prefix of a trace that can be reordered to give a legal trace of
the master. [0456] That is, any run of the parallel machine can be
run forward to a point equivalent to a run of the master state
machine.
(We further assume that we know how to do this reordering and how
to identify equivalent points.)
Consequences of (1)-(10): We can reorder any trace to match a
sequential version of the same program.
[0457] To show that decoupling gives us property (10) (i.e., that
any trace of the decoupled program can be reordered to give a trace
of the original program and to show how to do that reordering), we
need to establish a relationship between the parallel state machine
and the master state machine (i.e., the original program). This
relationship is an "embedding" (i.e., a mapping between states in
the parallel and the master machines such that the transitions map
to each other in the obvious way). It is probably easiest to prove
this by considering what happens when we decouple a single state
machine (i.e., a program) into two parallel state machines. When we
decouple, we take a connected set of states in the original and
create a new state machine containing copies of those states but:
[0458] 1. The two machines synchronize with each other on all
transitions into and out of that set of states. [0459] 2. The two
machines contain a partition of the kernel activations of the
original machine. [0460] 3. The two machines each contain a subset
(which may overlap) of the transitions of the original machine.
From this, it follows that the parallel machine_can_execute the
same sequence as the original machine. To show that it_must_execute
an equivalent sequence (i.e., that we can always reorder the
trace), we need the following properties of decoupling: [0461] 4.
All data dependencies are respected: if kernel B reads data written
by kernel A, then both are executed in sequence on the same state
machine or the state machines will synchronize after A completes
and before B starts. [0462] Note that this depends on the fact that
channels are FIFO queues so data is delivered in order. Extensions
of decoupling allow the programmer to indicate that two operations
can be executed in either order even though there is a data
dependency between them (e.g., both increment a variable
atomically). This mostly needs us to relax the definition of what
trace reconstruction is meant to do. One major requirement is that
the choice of order doesn't have any knock-on effects on control
flow.
[0463] 5. Deadlock should not happen: [0464] threads cannot block
indefinitely on a put as long as each queue has space for at least
one value. [0465] threads cannot block indefinitely on a get:
either one thread is still making progress towards a put or, if
they both hit a get, at least one will succeed.
[0466] Outline proof: Because they share the same control flow, the
two threads perform opposing actions (i.e., a put/get pair) on
channels in the same sequence as each other. A thread can only
block on a get or a put if it has run ahead of the other thread.
Therefore, when one thread is blocked, the other is always
runnable.
Extensions of Decoupling Allow for the following.
[0467] 1. Locks are added by the programmer.
[0468] To avoid deadlock, we require: [0469] The standard condition
that locks must always be obtained in a consistent order. [0470] If
the leading thread blocks on a channel while holding a lock, then
the trailing thread cannot block on the same lock.
[0471] A sufficient (and almost necessary) condition is that a put
and a get on the same channel must not be inside corresponding
critical sections (in different threads):
TABLE-US-00017 // Not allowed parallel_sections{ section{ ...
lock(l); ... put(ch,x); ... unlock(l); ...} section{ ... lock(l);
... get(ch,x); ... unlock(l); ...} }
[0472] which means that the original code cannot have looked like
this: [0473] . . . lock(1); . . . DECOUPLE(ch,x); . . . unlock(1);
. . .
[0474] That is, extreme care must be taken if DECOUPLE occurs
inside a critical section especially when inserting DECOUPLE
annotations automatically
[0475] 2. Puts and gets don't have to occur in pairs in the
program.
[0476] A useful and safe special case is that all initialization
code does N puts, a loop then contains only put-get pairs and then
finalization code does at most N gets. It should be possible to
prove that this special case is ok.
[0477] It might also be possible to prove the following for
programs containing arbitrary puts and gets: if the original
single-threaded program does not deadlock (i.e., never does a get
on an empty channel or a put on a full channel), then neither will
the decoupled program.
Overview
[0478] A long-standing problem of parallelizing compilers is that
it is virtually impossible to provide the programmer with a
start-stop debugger that lets them debug in terms of their
sequential program even though it runs in parallel. In particular,
we would like to be able to run the program quickly (on the
parallel hardware) for a few minutes and then switch to a
sequential view when we want to debug. It is not necessary (and
hard) to seamlessly switch from running parallel code to running
sequential code but it is feasible to change the scheduling rules
to force the program to run only one task at a time. With compiler
help, it is possible to execute in almost the sequence that the
original program would have executed. With less compiler help or
where the original program was parallel, it is possible to present
a simpler schedule than on the original program. This method can be
applied to interrupt driven program too. This same method of
tweaking the scheduler while leaving the application unchanged can
be used to test programs more thoroughly. Some useful examples:
[0479] Testing the robustness of a real time system by modifying
the runtime of tasks. Making a task longer may cause a deadline to
be missed. Making a task longer may detect scheduling anomalies
where the system runs faster if one part becomes slower.
(Scheduling anomalies usually indicate that priorities have been
set incorrectly.) Making tasks take randomly longer times
establishes how stable a schedule is. [0480] Providing better test
coverage in parallel systems. Race conditions and deadlock often
have a small window of opportunity which it is hard to detect in
testing because the `windows` of several threads have to be aligned
for the problem to manifest. By delaying threads by different
amounts, we can cause different parts of each thread to overlap so
that we can test a variety of alignments. (We can also measure
which alignments we have tested so far for test-coverage statistics
and for guided search.) This is especially useful for
interrupt-driven code. John Regehr did some work on avoiding
interrupt overload by delaying and combining interrupts.
http://portal.acm.org/citation.cfm?id=945445.945454&d1=portal&d1=ACM&type-
=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM
%20Symposium %20 on %20Operating %20Systems
%20Principles&CFID=11111111&CFTOKEN=2222222 but this is
really about modifying the (hardware) scheduling of interrupts to
have more desirable properties for building real time systems
whereas we are more interested in: [0481] debugging, tracing. and
testing systems (and some of the stuff we do might actually break
real-time properties of the system) [0482] thread schedulers (but
we still want to do some interrupt tweaking)
Testing Concurrent Systems
[0483] Errors in concurrent systems often stem from
timing-dependent behaviour. It is hard to find and to reproduce
errors because they depend on two independently executing sections
of code executing at the same time (on a single-processor system,
this means that one section is preempted and the other section
runs). The problematic sections are often not identified in the
code. Concurrent systems often have a lot of flexibility about when
a particular piece of code should run: a task may have a deadline
or it may require that it receive 2 seconds of CPU in every 10
second interval but tasks rarely require that they receive a
particular pattern of scheduling.
[0484] The idea is to use the flexibility that the system provides
to explore different sequences from those that a traditional
scheduler would provide. In particular, we can use the same
scheduler but modify task properties (such as deadlines or
priorities) so that the system should still satisfy real time
requirements or, more flexibly, use a different scheduler which
uses a different schedule.
Most schedulers in common use are `work conserving schedulers`: if
the resources needed to run a task are available and the task is
due to execute, the task is started. In contrast, a
non-work-conserving scheduler might choose to leave a resource idle
for a short time even though it could be used. Non-work-conserving
schedulers are normally used to improve efficiency where there is a
possibility that a better choice of task will become available if
the scheduler delays for a short time. A non-work-conserving
scheduler for testing concurrent systems because they provide more
flexibility over the precise timing of different tasks than does a
work-conserving scheduler. In particular, we can exploit
flexibility in the following way: [0485] model the effect of
possibly increased runtime of different tasks. e.g., if task A
takes 100 microseconds and we want to know what would happen if it
took 150 microseconds, the scheduler can delay scheduling any tasks
for 50 microseconds after A completes. A special case is uniformly
slowing down all tasks to establish the `critical scaling factor`.
Another interesting thing to watch for is `scheduling anomalies`
where a small change in the runtime of a task can have a large
effect on the overall schedule and, in particular, where increasing
the runtime of one task can cause another task to execute earlier
(which can have both good and bad effects). [0486] model the effect
of variability in the runtime of different tasks by waiting a
random amount of time after each task completes [0487] cause two
tasks to execute at a range of different phases relative to each
other by delaying the start of execution of one or the other of the
tasks by different amounts. Where the tasks are not periodic,
(e.g., they are triggered by external events) you might delay
execution of one task until some time after the other task has been
triggered. In all these cases, the modification of the schedule is
probably done within the constraints of the real-time requirements
of the tasks. For example, when a task becomes runnable, one might
establish how much `slack` there is in the schedule and then choose
to delay the task for at most that amount. In particular, when
exploring different phases, if the second event doesn't happen
within that period of slack, then the first event must be sent to
the system and we will hope to explore that phase the next time the
event triggers.
[0488] It is often useful to monitor which different schedules have
been explored either to report to the programmer exactly what tests
have been performed and which ones found problems or to drive a
feedback loop where a test harness keeps testing different
schedules until sufficient test coverage has been achieved.
Debugging Parallel Systems
[0489] When a sequential program is parallelized, it is often the
case that one of the possible schedules that the scheduler might
choose causes the program to execute in exactly the same order that
the original program would have executed. (Where this is not true,
such as with a non-preemptive scheduler it is sometimes possible to
insert pre-emption points into the code to make it true.) If the
scheduler is able to determine what is currently executing and what
would have run next in the original program, the scheduler can
choose to execute the thread that would run that piece of code.
(Again, it may be necessary to insert instrumentation into the code
to help the scheduler figure out the status of each thread so that
it can execute them in the correct order.)
Tracing Parallel Systems
Reduce amount of reordering required by reordering the execution
which might reduce size of window required, simplify task of
separating out parallel streams of memory accesses, eliminate the
need to reorder trace at all, etc.
Overview
[0490] Working with the whole program at once and following
compilation through many different levels of abstraction allows us
to exploit information from one level of compilation in a higher or
lower level. Some examples: [0491] Executing with very abstract
models of kernels can give us faster simulation which gives
visualization, compiler feedback and regression checks on meeting
deadlines. [0492] We can plug a high level simulator of one
component into a low level system simulation (using back-annotation
of timing) and vice-versa. [0493] We can simulate at various levels
of detail: trace start/stop events (but don't simulate kernels),
functional simulation using semihosting, bus traffic simulation,
etc. [0494] We can use our knowledge of the high level semantics to
insert checking to confirm that the high-level semantics is
enforced. For example, if a kernel is supposed to access only some
address ranges, we can use an MPU to enforce that. [0495] We can
reconstruct a `message-passing view` of bus traffic.
[0496] Although illustrative embodiments of the invention have been
described in detail herein with reference to the accompanying
drawings, it is to be understood that the invention is not limited
to those precise embodiments, and that various changes and
modifications can be effected therein by one skilled in the art
without departing from the scope and spirit of the invention as
defined by the appended claims.
* * * * *
References