U.S. patent application number 10/007211 was filed with the patent office on 2002-12-19 for method of determining causal connections between events recorded during process execution.
Invention is credited to Hrischuk, Curtis, Woodside, Charles Murray.
Application Number | 20020194393 10/007211 |
Document ID | / |
Family ID | 25469373 |
Filed Date | 2002-12-19 |
United States Patent
Application |
20020194393 |
Kind Code |
A1 |
Hrischuk, Curtis ; et
al. |
December 19, 2002 |
Method of determining causal connections between events recorded
during process execution
Abstract
A method of determining scenario causality, along with
precedence causality, is disclosed. Information is recorded
relating to events occurring during execution of a process. The
information includes object related information and process related
information. The information is translated into a sequence of
scenario graph language statements, one or more events translated
to a statement. From the statements, process execution flow is
determined establishing some scenario causality and precedence
causality.
Inventors: |
Hrischuk, Curtis;
(Woodinville, WA) ; Woodside, Charles Murray;
(Ottawa, CA) |
Correspondence
Address: |
Clifford H. Kraft
320 Robin Hill Drive
Naperville
IL
60540
US
|
Family ID: |
25469373 |
Appl. No.: |
10/007211 |
Filed: |
November 8, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10007211 |
Nov 8, 2001 |
|
|
|
08937023 |
Sep 24, 1997 |
|
|
|
Current U.S.
Class: |
719/318 |
Current CPC
Class: |
G06F 9/542 20130101 |
Class at
Publication: |
709/318 |
International
Class: |
G06F 009/46 |
Claims
What is claimed is:
1. A method of determining, from recorded information relating to
events occurring during execution of a process, a plurality of the
events that are causally connected by scenario causality and
precedence causality, the method comprising the steps of: (a)
translating the recorded information relating to the events to
statements in a first scenario graph language; (b) determining from
the first graph language statements, information relating to
execution flow of the process wherein each first graph language
statement comprises information relating to a predetermined
execution flow of the process; and, (c) based on the information
relating to an execution flow of the process, determining, for a
first plurality of events, events that precede each event from the
first plurality of the events that are causally connected by
scenario causality to said event from the first plurality of the
events.
2. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of:
performing run time behavior verification by analysis of the
scenario event graph, or combinations thereof, for one of race
conditions, live lock conditions, and deadlock conditions.
3. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of:
determining the identity type of a scenario during execution and
providing a different level or style of service based on this
determination.
4. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the steps of:
monitoring a process during execution; and, recording the
information relating to events occurring during execution of the
process, the recorded information comprising at least a time value
from each of at least two clocks and wherein at least one of the
clocks is a logical clock.
5. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein translating the
recorded information is performed in each of two domains; and,
wherein determining from the statements information relating to
execution flow of the process is performed in dependence upon the
statements in each domain.
6. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the process is a
process executed by a microprocessor.
7. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the process is a
process executed in software on at least two processors in a
distributed system and wherein the information relating to events
comprises information relating to a time measured by a logical
clock and another time measured by another clock.
8. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein a statement in
the first graph language represents a node having an out degree of
at least 2 and wherein statement in the first graph language
represents a node having in degree of at least 2.
9. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the recorded
information relating to events comprises process event information
and object event information.
10. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the statements
form a graph language that is complete and sound.
11. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the statements
relate to delimiting and progress events of a process and of an
object.
12. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the first graph
language has nodes and edges from a group of: external, thread
begin, and-join, and-fork, thread end, activity, object period
start, start process, next object event, next process node, next
object period, and process thread fork.
13. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of
determining a UML behavioural diagram relating to process
execution.
14. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of
determining a message sequence chart relating to process
execution.
15. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of
determining design related information for use in one of design
verification, business process modelling, performance modelling,
and optimisation.
16. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the recorded
events form an angio trace defined as G.sub.Trace=(N .SIGMA..sub.n,
M.sub.n, P, .OMEGA.) where N is a set of recorded events;
.SIGMA..sub.n is the alphabet of event time stamps;
M.sub.n:N.fwdarw..sub.n is the mapping of events to time stamps; P
is a set of event predicates for identifying the type of an event;
and, .OMEGA. is a set of partial-ordering relations.
17. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the recorded
information relating to an event comprises an event type from
external event; process thread begin event; process activity event;
process thread fork event; process thread half-join event; and
process thread end event.
18. A method of determining a plurality of the events that are
causally connected comprising the steps of: during execution of an
event, recording process related information, recording object
related information, and recording event related information; using
the process related information and the object related information
for a plurality of events, translating the recorded information to
a graph language substantially indicative of scenario and
precedence causal connections between events; and, providing
information based on the causal connections between events.
19. A method of determining a plurality of events that are causally
connected for use with recorded information relating to the events
occurring during execution of a process, the method comprising the
steps of: analysing the recorded information to determine a partial
order of events from each of two relative perspectives; combining
the two partial orders of events to produce information relating to
some forms of scenario causality and precedence causality.
20. A method of determining a plurality of events that are causally
connected as defined in claim 19, wherein the recorded information
relating to the events comprises at least an event type and two
time stamps from each of two clocks wherein a clock from the two
clocks is a logical clock and wherein causality is deduced in
dependence upon precedence determined from the partial orders and
recorded event types.
21. A method of determining a plurality of the events that are
scenario and precedence causally connected comprising the steps of:
providing a process for execution; instrumenting the process for
monitoring of the process during execution; executing the
instrumented process to produce a trace of the process execution;
transforming the trace of the process execution into a plurality of
scenario graph language statements according to a plurality of
predetermined rules; and, transforming the scenario graph language
statements into a domain specific model.
22. A method of determining a plurality of the events that are
causally connected as defined in claim 21, wherein the process is
instrumented according to the rules of the following table:
23. A method of determining a plurality of the events that are
causally connected as defined in claim 21, wherein the process is
instrumented according to the rules of the following table:
24. A method of determining a plurality of the events that are
causally connected as defined in claim 1, comprising the step of:
the replaying of execution of a scenario and system behavior on one
of the actual system, a simulator tool, and a visualisation
tool.
25. A method of determining a plurality of the events that are
causally connected as defined in claim 1, wherein the process is a
process executed in computer software and comprising the step of:
performing pattern analysis on the statements to detect at least
one of software design and software execution patterns therein.
Description
[0001] This is a continuation-in-part of U.S. patent application
Ser. No. 08/937,023 filed Sep. 24, 1997.
FIELD OF THE INVENTION
[0002] This invention relates generally to process execution and
more particularly to determining causality for information stored
during concurrent and distributed software process execution.
BACKGROUND OF THE INVENTION
[0003] In application execution and analysis, tracing is a term
having many similar but distinct meanings. Tracing implies a
following of process execution. Often such tracing incorporates
recording information relating to a process during execution. In
essence, a process that executes and has information there about
recorded is considered a traced process.
[0004] In the past, tracing of computer software application
programs has been performed for two main purposes-debugging and
optimisation. In debugging, the purpose of tracing is to trace back
from an abnormal occurrence - a bug to show a user a flow of
execution that occurred previous to the abnormal occurrence. This
allows the user to identify an error in the executed program.
Unfortunately, commands executed immediately previous to an
abnormality are often not a source of the error in execution.
Because of this, much research is currently being conducted to
better view trace related data in order to more easily identify
potential sources of bugs.
[0005] Debuggers are well known in the art of computer programming
and in hardware design. In commonly available debuggers, a user
sets up a trace process to store a certain set of variables upon
execution of a particular command while the program is in a
particular state. Upon this state and command occurring, the
variables are stored. A viewer is provided allowing the user to try
to locate errors in the program that result in the bug.
[0006] Usually, debuggers provide complex tracing tools which allow
for execution of a program on a line by line basis and also allow
for a variety of break commands and execution options. Some
debuggers allow modification of parameters such as variable values
or data during execution of the program. These tools facilitate
error identification and location.
[0007] Unfortunately, using multiprocessor or networked systems, it
is difficult to ensure that a system will function as desired and
also, it is difficult to ascertain that a system is actually
functioning as desired. Many large, multiprocessor systems appear
to execute software programs flawlessly for extended periods of
time before bugs are encountered. Tracing these bugs is very
difficult because a cause of a bug may originate from any of a
number of processors which may be geographically dispersed. Also,
many of these bugs appear intermittently and are difficult to
isolate. Using a debugger is difficult, if not impossible, because
multiple debugging sessions must be established and
coordinated.
[0008] In contrast for optimisation, it is important to know which
commands are executed most often in order to optimise a software
program. For example, when an application during normal execution
executes a first subroutine once, a second subroutine twice, and a
third subroutine seventy times, each subroutine requiring a similar
time span for execution, optimising the subroutine which runs
seventy times is clearly most important. In system optimisation,
tracing is not actually performed except in so far as statistics of
routine execution and execution times are maintained. These
statistics are very important because they allow for a directed
optimisation effort at points where the software executes slowest
or where execution will benefit most. Statistics as captured for
program optimisation, are often useful in determining execution
bottlenecks and other unobvious problems encountered. Examples of
optimisation based modelling or tracing include systems described
in the following references:
[0009] P. Dauphin, R. Hofmann, R. Klar, B. Mohr, A. Quick, M.
Siegle, and F. Sotz. "ZM4/ Simple: A general approach to
performance measurement and evaluation of distributed systems." In
T. Casavant and M. Singhal, editors, Readings in Distributed
Computing Systems, pages 286-309. IEEE Computer Society Press, Los
Alamitos, Calif., 1994; M. Heath and J. Etheridge. "Visualizing the
performance of parallel programs." IEEE Software, 8(5):29-39,
September 1991;
[0010] C. Kilpatrick and K. Schwan. "ChaosMON--application-specific
monitoring and display of performance information for parallel and
distributed systems." Proceedings of the ACMI ONR Workshop on
Parallel and Distributed Debugging, May 1991; and,
[0011] J. Yan. "Performance tuning with an automated
instrumentation and monitoring system for multicomputers AIMS."
Proceedings of the Twenty-Seventh Hawaii International Conference
on System Sciences, January 1994.
[0012] Software performance models of a design prior to product
implementation reduce risk of performance-related failures.
Performance models provide performance predictions under varying
environmental conditions or design alternatives and these
predictions are used to detect problems. To construct a model, a
software description in the form of a design document or source
code is analysed and translated into a model format. Examples of
model formats are a simulation model, queuing network model, or a
state-based model like a Petri-Net. The effort of model development
makes it unattractive, so performance is usually addressed only in
a final product. This has been termed the "fix-it-later" approach
and the seriousness of the problems it creates is well
documented.
[0013] In order to determine that a process is in fact executing as
desired or to construct a performance model for optimisation
requires an understanding of causality within a software
application. Commonly, the only causal connection determined
automatically is precedence. For example, in determining system
statistics, it is easily recorded which subroutine was executed
when. This results in knowledge of precedence when the entire
process is executed on a single processor. However, given this
knowledge, it is difficult to determine anything other than
precedence.
Time and Causality
[0014] For concurrent or distributed software computations a common
synchronised time reference is unavailable. A system operating on
the earth and another system operating in space illustrate this
problem. When the system on earth performs an activity and
transmits a message to the system in space, an evident time delay
occurs between message transmission and message reception. Once a
system is in space, synchronising its time source precisely with
that of an earth bound system is difficult. When the system in
space is moving, such a synchronisation is unlikely. A same
problem, though on a smaller scale, exists in earth bound networks.
Each computer is bound to an independent time source and
synchronisation of time sources is difficult. With advances in
computer technology and processing speeds, these synchronisation
difficulties are becoming no less significant than those
experienced with space bound systems.
[0015] The lack of a common time reference, as well as other
problems with observing a distributed system, have led to a notion
of causality that is probability based. This "probabilistic
causality" is a probability estimate of an event having occurred.
Probabilistic causality uses a database of information (e.g.,
application structure, network configuration), a sophisticated data
reduction algorithm (i.e., expert system), and trace records to
make an educated guess at the source of problems in a complex
system based on observable events. Although probabilistic causality
is useful for network fault diagnosis it should not be confused
with the stricter definition of causality that is being espoused
here which is not probability based. Examples of probabilistic
causality are found in U.S. Pat. Nos. 5,661,668 and 5,483,637.
[0016] In order to determine causality, it is beneficial to
determine which events happened before which other events,
described here as precedence causality. Precedence is a commonly
known form of causality; for example, an executable instruction is
not executed until a previous instruction is executed given no
branching instructions. This precedence based causality is used
heavily for debugging. Often, once an anomaly is discovered during
execution, previous executed instructions are reviewed to determine
a cause of the anomaly. For single processor systems, such an
analysis is straightforward; however for network applications, time
source synchronisation presents problems and therefore, precedence
is not immediately evident.
[0017] Because of the above when more than one computer are
networked together, precedence is not determined through recording
of time. Even when a synchronisation of clocks occurs via a
communication link, a time delay caused by communication times
exists between computers and the recorded times are inaccurate. The
resulting clock times are not useful for determining precedence
between instructions or activities executing on different
processors.
[0018] In an attempt to overcome this problem, it has been proposed
that a logical clock may be used to record time in the form of a
partial ordering of recorded times. Several types of logical clocks
are known for use in a classical model of a distributed system.
[0019] In the classical model of a distributed system, according to
a survey paper by Schwarz and Mattem entitled "Detecting causal
relationships in distributed computations: in search of the Holy
Grail" (Distributed Computing, 7(3):149-174, 1994), a distributed
system consists of N objects: P.sub.I. . . P.sub.N. The objects
interact solely by point-to-point message communication with finite
but unpredictable delay; knowledge about structure of a
communication network is not available; first in first out (FIFO)
order of message delivery is not assumed; and a global clock, or
perfectly synchronised clocks local to each process, are not
available. Each object executes a local algorithm to determine its
reaction to incoming messages. The occurrence of actions such as a
local state change or sending a message performed by the local
algorithm are called events. Events are recorded atomically.
Concurrent and co-ordinated execution of all local algorithms
composes a distributed computation.
[0020] A distributed computation is described by ordering events to
agree with an order of execution. Let E, denote a set of events
occurring in object P.sub.1 in the form of a history of events, and
let E=E.sub.1.orgate. E.sub.2. . . .orgate. E.sub.N denote a set of
all events of the distributed computation. These event sets evolve
dynamically as computation progresses. Since each P.sub.1 is
strictly sequential, its sequence of events, E.sub.i, are ordered
by their occurrence and written as E.sub.1={e.sub.il, e.sub.i1,
e.sub.i3, . . . }.
[0021] For the classical model, three event types are recorded: a
send event, a receive event, and an internal event. A send event
reflects the fact that a message was sent asynchronously. A receive
event denotes the receipt of a message together with local state
changes according to the contents of that message. Internal events
reflect changes to local object states. This description does not
account for conflicts or non-determinism since it is based on
events that have actually occurred. The precedence relation is used
as a basis for constructing logical clocks. According to the
precedence relation an event with a later logical time occurred
after an event with an earlier logical time where. Also, two events
with same logical times in an event set are concurrent which
indicates that they may have occurred in any order or
simultaneously. Essentially, a concurrency relation indicates that
a precedence relation cannot determine which of two events happened
first.
Precedence Causality's Failure to Define Scenarios
[0022] The precedence relation does not identify when events are
independent because it identifies all past events as being possible
causes for the current event. This information can be useful but it
is usually overwhelming and it must be analysed by hand to prune
out precedence causal relationships. The context information that
is most valuable for understanding the system behaviour is the
scenario. A scenario is a "specific sequence of actions [events]
that illustrates behaviours [for an application]. A scenario may be
used to illustrate an interaction or the execution of a use case
instance.".sup.1 The interaction is "a specification of how stimuli
are sent between [object] instances to perform a specific task. The
interaction is defined in the context of a collaboration.".sup.2
.sup.1"OMG Unified Modelling Language (UML) Specification" Version
1.3, March 2000 which is the industry standard. .sup.2"OMG Unified
Modelling Language (UML) Specification" Version 1.3, March 2000
which is the industry standard.
[0023] An observed scenario is, informally, a set of objects which
execute and interact together, recording events as they execute.
The observed scenario is produced by ordering of the events to
identify the objects' local interactions and their interactions
with each other.
[0024] In a sequential application with only one stimulus the order
of recorded events (i.e., the system behaviour) is one-to-one with
the observed scenario. This is true of every static system where
every execution of the application (i.e., scenario behaviour)
corresponds to the exact ordering of the events in the system (i.e,
system behaviour).
[0025] If there are dynamic aspects to the system structure or
behaviour, then the one-to-one correspondence of scenario event
ordering with the system behaviour is no longer true. The scenario
structure cannot be recovered in this case because multiple
scenarios are intermingled with each other in the system behaviour.
The dynamic aspects involve: multiple simultaneous stimuli,
concurrent thread execution, dynamic construction of software
components, replication of software components, dynamic
communication paths, message queuing, asynchronous message sends,
etc. The following three canonical problems describe the problem of
recovering and isolating observed scenario structure using
precedence causality.
Canonical Problems of Precedence Causality
[0026] A fundamental limitation of the precedence causality
approach is that it cannot identify scenarios because it cannot
identify the end of a scenario, hereafter called "the problem of
finding the scenario end". This situation is illustrated in FIG. 1a
where there are two scenarios. Each scenario consists of a hidden
external event causing Object A to send a message to Object B with
each object doing some internal processing (not shown for clarity).
As shown in the figure there are two independent scenarios
initiated by Object A but there is a network delay such that the
second message send of Object A (event e.sub.A2) overtakes the
first message it has sent (event e.sub.A1).
[0027] The scenario causal ordering is that the events of the first
scenario are Scenario1={e.sub.A1.fwdarw.e.sub.B2} and the events of
the second scenario are Scenario2={e.sub.A2.fwdarw.e.sub.BI}. Note
that each scenario is properly identified and can be analysed
independently of the other (e.g., comparing the actual behaviour
against the intended behaviour of a sequence diagram).
[0028] The precedence ordering of the two scenarios is shown in
FIG. 1a, including the transitive ordering components. The
precedence ordering includes the additional event orderings
{e.sub.A1.fwdarw.e.sub.A2}, {e.sub.A1.fwdarw.e.sub.B1},
{e.sub.A1.fwdarw.e.sub.B2}, and {e.sub.A1.fwdarw.e.sub.B2}. These
event orderings would need to be filtered out before any analysis
could be performed because it is not possible to identify the
scenarios. It is possible to do the filtering manually for a small
example but these additional relationships grow exponentially with
the number of events recorded.
[0029] A second fundamental limitation of the precedence ordering
relation is that an event can only belong to one scenario but it is
difficult to determine which event a scenario belongs to. Hereafter
called the "problem of scenario association." This is illustrated
by FIG. 1b. Is there one or two scenarios in FIG. 1b? There can be
one scenario that consists of the events S.sub.1={e.sub.1, e.sub.2,
e.sub.3, e.sub.b4}, or the two scenarios S.sub.1, {e.sub.1,
e.sub.3} and S.sub.2={e.sub.2, e.sub.4}. This problem grows
linearly with the number of interactions between objects.
[0030] A third limitation of precedence ordering is that events are
recorded for a duration of time. Instead, monitoring should be
triggered based on the scenario that is being executed. This is the
problem of the scenario monitor trigger.
[0031] A fourth limitation of precedence ordering is that it is not
communication protocol aware. The communication protocol that is
used to send and receive information is important for analysis
purposes but precedence causality does not capture any information
related to it. This is a lack of communication protocol
characterization.
[0032] A new type of causality, called scenario causality, is
needed that overcomes these limitations.
Logical Clock Background
[0033] Discussions of implementation mechanics of logical clocks
are presented in the following articles:
[0034] M. Ahuja, T. Carlson, A. Gahlot, and D. Shands.
"Timestamping events for inferring `Affects` relation and potential
causality." In Proceedings 11th International Conference on
Distributed Computing Systems (COMPSAC 91), pages 274-281,
Arlington, Tex., 1991;
[0035] B. Charron-Bost. "Concerning the size of logical clocks in
distributed systems." Information Processing Letters, 39:11-16,
July 1991;
[0036] C. Diehl and C. Jard. "Interval approximations of message
causality in distributed executions. " In Proceedings of the
Symposium on Theoretical Aspects of Computer Science, pages
363-374. Springer-Verlag, February 1992;
[0037] C. Fidge. "Logical time in distributed computing systems."
IEEE Computer, pages 28-33, August 1991;
[0038] J. Fowler and W. Zwaenepoel. "Causal distributed
breakpoints." In Proceedings of 10th International Conference on
Distributed Systems, pages 134-141, 1990;
[0039] L. Lamport. "Time, clocks, and the ordering of events in a
distributed system." CACM, 21(7):558-565, July 1978;
[0040] F. Mattern. "Time and global states of distributed systems."
in Proceedings International Workshop on Parallel and Distributed
Algorithms, pages 215-226, Amsterdam, 1988. Bonas, France,
North-Holland;
[0041] S. Meldal, S. Sankar, and J. Vera. "Exploiting locality in
maintaining potential causality." In Proceedings 10th Annual ACM
Symposium on Principles of distributed Computing, pages 231-239,
Montreal, Canada, 1991;
[0042] M. Raynal and M. Singhal. "Logical time: Capturing causality
in distributed systems." Computer, 29(2):49-56, February 1996;
[0043] R. Schwarz and F. Mattem. "Detecting causal relationships in
distributed computations: in search of the Holy Grail." Distributed
Computing, 7(3):149-174, 1994;
[0044] M. Singhal and A. Kshemkalyani. "An efficient implementation
of vector clocks." Information Processing Letters, 43:47-52, August
1992; and,
[0045] C. Valot. "Characterizing the accuracy of distributed
timestamps." In Proceedings of the ACM IONR Workshop on Parallel
and Distributed Debugging, pages 43-52, May 1993.
[0046] The implementations described in the above references have
several commonalties. Each event is assigned a time stamp from a
logical clock, which is used to establish relative ordering of
events. If a first event precedes a second event, then the time
stamp of the first event is smaller than the time stamp of the
second event. To generate the time stamp, every object maintains
its own local logical clock that is advanced using a set of
prescribed rules. An object's local clock represents its best
approximation to a global logical clock. A time stamp is included
with every message sent. A receiving object uses the included time
stamp to update its local clock. Internal, send, and receive events
advance an object's local clock.
[0047] Lamport, in the above noted reference, describes a logical
clock wherein each object has a scalar local clock in the form of a
counter that is incremented with each event. When a message is
received that has a larger time stamp than the receiving object's
current counter, the received time stamp replaces the current
counter value. A total ordering of events can be constructed by
appending an object's identifier to a time stamp value. In this
way, within an object a first event precedes a second event when
the first event has a time stamp that is less than that of the
second event. Unfortunately, between objects, it is often difficult
to assess an ordering since concurrent objects have their own local
counter which may increment faster or slower than that of another
object.
[0048] In another logical clock implementation, each object
maintains a vector of integers that constitutes its local clock. A
timestamp consists of the entire vector and each message sent
includes an entire vector. Precedence order of two events is
determined by comparing two vector time stamps in a similar fashion
to that described by Reynal and Senghal as well as Fidge et al. in
the above noted article. Concurrency can be determined in both
cases.
[0049] A known implementation difficulty of a vector clock is the
size and overhead of the time stamp. Characterising concurrency
requires using vector time stamps of integers of at least size N
when nothing is known about a computation except a number of
objects, N. When N is large, the amount of time stamp data
associated with each message and event becomes unacceptable.
[0050] There have been several approaches to reducing the overhead
associated with vector time stamps. Singhal and Kshemkalyani, in
the above noted reference reduce communication bandwidth by sending
vector clock entries that have changed from a message last sent to
a receiver in place of an entire vector. Each object maintains two
additional vectors to store information between interactions.
However, communication channels must be FIFO. In this approach,
post-execution analysis is needed to recover the precedence
relation between different messages sent to a same receiver.
[0051] Fowler and Zwaenepoel, in an above noted reference, describe
a direct-dependency technique reducing communication overhead by
maintaining precedence relations for direct interactions. A
transitive component of the precedence relation is constructed by
post-execution analysis. This allows an object's local clock to be
an event counter. Each object maintains information relating to
objects with which it directly communicates. Each message carries
with it a sending object's event counter value from when the
message was sent. The information that is recorded for each
communication event is a sending object, receiving object, and
appropriate event counters.
[0052] Valot, in an above noted reference, suggests that there is a
trade-off between memory requirements and time stamp accuracy for
precedence relations. She describes a family of time stamps, which
she calls k-vectors, that can be tailored for particular analysis.
Instead of allocating a position in the vector to a single object,
a subset of available objects are each assigned a single position
in the vector. The size of the k-vector is a number of subsets
chosen. The appropriate selection of vector clock subsets provides
better time stamp accuracy for a given vector size. However, a
priori knowledge of simultaneous concurrency during execution is
required for optimal assignment of an object to a position in the
k-vector. This method, therefore, is only applicable to certain
cases and not to general implementation.
[0053] Other logical clocks such as those proposed by Meldal, et
al. require specific conditions or additional a priori knowledge to
result in a reduced size time stamp or approximate the precedence
relation. Using knowledge of fixed communication links between
objects, this method provides a precedence ordering between
messages arriving at a same object. This approach is used to
determine precedence relations between messages arriving at a same
object with overhead dependent upon network topology.
[0054] Interval clocks have been disclosed to approximate the
precedence relation with a constant time stamp size. Interval
clocks provide better results than scalar clocks having a same
overhead. By using a bit array vector value instead of a counter,
precedence relations are established by post-execution analysis. If
only blocking RPC style communication is used then interval clocks
describe the precedence relations with no additional post-execution
analysis.
[0055] All of these logical clocks and all prior research only
dealt with precedence causality. A scenario based causality is
needed.
Monitoring and Tracing a Process
[0056] Event records are produced by monitoring a process. There
are two aspects to monitoring. There is a monitoring system
comprising means for storing data relating to process execution,
and monitoring instrumentation, which using the monitoring system
for recording of execution related information. The term monitor is
used in its general sense to incorporate both these aspects.
[0057] An event record contains information about an application's
activity and it consists of at least an event token and a time
stamp. The time stamp is generated by a monitor and represents the
acquisition time of the event record. The set of events is stored
as an event trace.
[0058] A monitor collects information by at least one of sampling
or tracing. Tracing consists of reporting all occurrences of an
event within a certain interval of time. Tracing is synchronous
with occurrence of events; it is performed when all occurrences of
an event are known or when each occurrence of an event is followed
by a certain action. With tracing, dynamic behaviour of a program
is abstracted to a sequence of events. On the other hand, sampling
is a collection of information upon request of the monitor.
Optionally, sampling is asynchronous with the occurrence of an
event; it is useful when an immediate reaction to an event is not
necessary. Sampling allows only statistical statements about
program behaviour. Profiling involves collecting execution counts
or performing timing at the procedure, statement, or instruction
level, using sampling or tracing.
[0059] Recorded information relating to events includes fields that
record encapsulated data that follows a prescribed format. Some
common approaches to specifying data to record are recording header
data in the trace file to describe the fields; a self-describing
trace format; an abstract information model based on
entity-relationship descriptions; and a trace description
language.
[0060] There is a large body of work in the prior art relating to
monitoring of parallel programs but there is little research of
monitoring distributed applications. There is an expectation in
prior art literature that much of the parallel program monitoring
research is applicable to a distributed application; however, it
has been found that monitoring of distributed applications has a
different set of requirements.
[0061] There are many different properties that a monitor may have.
Several that have been identified in the literature are machine
independence, using shadow processors, visualisation of performance
metrics as they are gathered, pre-execution, automated
instrumentation, instrumentation during execution, run-time
enabling of event probes, event ordering by precision hardware time
stamp, on-line program steering to control the program and
monitoring overhead as it executes, and post-execution compensation
for probe intrusion. Most of these monitoring systems sample and
aggregate measurements using a specified criteria, and then present
the resulting metrics either visually for analysis or to an expert
system for evaluation.
[0062] Discussions of implementation mechanics of monitors are
presented in the following articles:
[0063] P. Dauphin, R. Hoftnann, R. Klar, B. Mohr, A. Quick, M.
Siegle, and F. Sotz. "ZM4/Simple: A general approach to performance
measurement and evaluation of distributed systems." In T. Casavant
and M. Singhal, editors, Readings in Distributed Computing Systems,
pages 286-309. IEEE Computer Society Press, Los Alamitos, Calif.,
1994.
[0064] M. Heath and J. Etheridge. "Visualizing the performance of
parallel programs." IEEE Software, 8(5):29-39, September 1991;
[0065] M. J. Kaelbling and D. Ogle. "Minimizing monitoring costs:
Choosing between tracing and sampling." 23rd International Hawaii
Conference on System Sciences, Volume 1:314-320, January 1990;
[0066] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K.
Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and
T. Newhall. "The Paradyn parallel performance measurement tool."
Computer, 28(11):37-46, November 1995;
[0067] D. M. Ogle, K. Schwan, and R. Snodgrass.
"Application-dependent dynamic monitoring of distributed and
parallel systems." IEEE Transactions on Parallel and Distributed
Systems, 4(7):762-778, July 1993;
[0068] P. H. Worley. "A new PICL trace file format." Technical
Report ORNLFM-12125, Oak Ridge National Laboratory, September 1992;
and,
[0069] J. Yan, S. Sanikkai, and P. Mehra. "Performance measurement,
visualization and modeling of parallel and distributed programs
using the AIMS toolkit." Software Practice and Experience,
25(4):429-46 1, April 1995.
Problem Summary
[0070] Though a tremendous amount of research and effort has been
expended attempting to better monitor and analyse software
execution, heretofore, no system exists for determining restricted
forms of causality such as scenario causality. Scenario causality
is a subset of precedence relationships and is indicative of a more
direct causal link. Precedence, of course, is considered a
requirement for scenario causality since current understandings of
time indicate that it is unlikely that a later event can cause an
earlier event to occur. It is desirable to determine forms of
causality other than mere precedence of an application during
execution. This would require solving the previously listed
problems of "finding the scenario end", "scenario association,"
"the scenario event trigger", and "characterization of
communication protocol." In so doing, causal connections detected
are likely more significant and less numerous. It is also desirable
to determine precedence for a multiprocessor or network based
application during execution.
Object of the Invention
[0071] It is an object of the invention to provide a method of
recording information relating to some events during execution of a
process, and of determining scenario causality and precedence
causality for some of the events.
[0072] It is an object of the invention to provide a method of
recording information relating to some events during execution of a
distributed software application, and of determining scenario
causality and precedence causality for some of the events.
[0073] It is an object of the invention to provide a method of
recording information relating to some events during execution of a
process, and of analysing the recorded information for the purpose
of determining aspects of process execution flow.
SUMMARY OF THE INVENTION
[0074] In accordance with the invention there is provided for a
system wherein information is recorded relating to events occurring
during execution of a process, a method of determining a plurality
of the events that are causally connected by precedence causality
or scenario causality. The method comprises the steps of:
[0075] (a) translating the recorded information relating to the
events to first graph language statements wherein one or more
events is translated to a statement;
[0076] (b) determining from the statements information relating to
process execution flow wherein each statement comprises information
relating to a predetermined process execution flow; and,
[0077] (c) based on the information relating to a predetermined
process execution flow, determining, for each of a plurality of
caused events, a plurality of events from the events that precede
each event from the plurality of caused events and are each
scenario causally or precedence causally connected to said event
from the plurality of caused events.
[0078] In accordance with the invention there is provided a method
of determining a plurality of the events that are scenario causally
or precedence causally connected comprising the steps of:
[0079] during execution of an event,
[0080] recording process related information,
[0081] recording object related information, and
[0082] recording event related information;
[0083] using the process related information and the object related
information for a plurality of events, translating the recorded
information to a graph language substantially indicative of
scenario and potential causal connections between events; and,
[0084] providing information based on the causal connections
between events.
[0085] In accordance with the invention there is provided a method
of determining a plurality of events that are scenario causally or
precedence causally connected for use with recorded information
relating to events occurring during execution of a process. The
method comprises the steps of:
[0086] analysing the recorded information to determine a partial
order of events from each of two relative perspectives;
[0087] combining the two partial orders of events to produce
information relating to some forms of scenario and potential
causality. In accordance with the invention there is provided a
method of determining a plurality of the events that are scenario
causally or precedence causally connected comprising the steps
of:
[0088] providing a process for execution;
[0089] instrumenting the process for monitoring of an execution of
the process;
[0090] executing the instrumented process to produce a trace of the
process execution;
[0091] transforming the trace of the process execution into a
plurality of scenario graph language statements according to a
plurality of predetermined rules to reverse engineer scenarios;
[0092] transforming the scenario graph language statements into a
scenario event graph for analysis, and,
[0093] transforming the scenario event graph(s) into a domain
specific model for analysis in another domain.
BRIEF DESCRIPTION OF THE DRAWINGS
[0094] Exemplary embodiments of the invention will now be described
in conjunction with the following drawings, in which:
[0095] FIG. 1a is a simplified scenario diagram;
[0096] FIG. 1b is another simplified scenario diagram;
[0097] FIG. 1c is a high-level block diagram of a method according
to the invention;
[0098] FIGS. 2a and 2b are simplified flow diagrams of code
execution;
[0099] FIG. 3 is a simplified set diagram of different forms of
known causality;
[0100] FIG. 4 is a diagram showing a simple example of a difference
between scenario causality, and potential causality;
[0101] FIG. 5 is a flow diagram showing an RPC having two blocking
interactions, one nested within the other;
[0102] FIG. 6 is a diagram showing the steps in applying MMAP in a
performance engineering context;
[0103] FIG. 7 is a diagram of symbols for use in process event
graphs according to the invention;
[0104] FIG. 8 shows a general representation of a Scenario Event
Graph (SEG) node as a six-port building block;
[0105] FIG. 9 is a diagram of a portion of a SEG of an RPC;
[0106] FIG. 10 is a diagram of a portion of a SEG of an
asynchronous interaction;
[0107] FIG. 11 is a diagram of a portion of a SEG of a case where a
message is sent using a blocking communication protocol that
results in a synchronisation;
[0108] FIG. 12 is a diagram of a portion of a SEG of an initiating
object using an asynchronous communication protocol that results in
a synchronisation;
[0109] FIG. 13 is a diagram of a portion of a SEG where a blocked
initiating object receives its reply to a service request that used
an RPC communication protocol and it is considered to be a
synchronisation;
[0110] FIG. 14 is a diagram of a portion of a SEG involving
acceptance of an external event that results in a
synchronisation;
[0111] FIG. 15 is a diagram of a portion of a SEG of an example
where: the initiating object (Object A) sends an RPC request and
blocks, the first responding object (Object B) processes the
request, and forwards it to another responding object (Object C),
Object C processes the request further and forwards it to Object D
which replies to the initiating object; and,
[0112] FIG. 16 is a graph rewriting operation for simplifying a
scenario event graph model.
DETAILED DESCRIPTION OF THE INVENTION
[0113] A sequentially executing software component that may execute
concurrently with other components is referred to as an object
throughout this specification and the claims, which follow.
[0114] Software execution models of distributed and concurrent
systems characterise objects and their interactions in the context
of the process that they are part of, fully describing a scenario
of execution of an application. Software execution models are
design aids that are used during the development of a software
application. A software execution model (hereafter simply "model")
characterizes high-level aspects of an application's execution for
analysis. A forward engineering model will specify intended
behavior (e.g., specifying interaction diagrams such as use-cases,
sequence diagram, collaboration diagram). Interactions between
objects are important because they effect parallelism and resource
contention experienced during execution when, for example, a
heavily used object queues arriving requests and becomes a
bottleneck. During the later phases of development, models are
constructed to characterize the realized behavior, to aid in
program understanding, re-engineering, reuse, performance analysis,
and debugging. Often these realized models are mental pictures that
the developer constructs from user requirements, design documents,
source code examination, and, most importantly, experiences with
the system. The realized models are critical for investigating
differences between the specified behavior and the observed
behavior. Manually reverse engineering a model from an actual
execution of a small software application is relatively easy but it
is expensive, difficult, and uncertain for a large or dynamic
application. A technique is needed to generate models of realized
behavior and object-oriented methodologies need to be adapted to
incorporate the realized models.
[0115] Performance models are a type of software execution model
for optimisation purposes. A performance model of distributed and
concurrent systems characterise objects and their interactions.
Interactions between objects are important because they effect
parallelism and resource contention experienced during execution
when, for example, a heavily used object queues arriving requests
and becomes a bottleneck. The Layered Queuing Network (LQN) model
has been proposed to evaluate such processes. The LQN model extends
queuing network models to include contention effects for software
resources such as server objects, as well as contention for
hardware devices. It is appropriate for assessing performance of
many kinds of distributed systems, including client-server
applications, peer-to-peer applications, communications switching
software, transaction processing systems, and systems based on
middleware software technologies. Using the invention for this
purpose is described in both:
[0116] C. E. Hrischuk. Trace-based Load Characterization for the
Automated Development of Software Performance Models. Ph.D. thesis,
Carleton University, Ottawa, Canada, 1998.
[0117] C. M. Woodside, C. Hrischuk, B. Selic, S. Bayarov.
"Automated performance modelling of software generated by a design
environment." Performance Evaluation, vol. 45:1, pages 107-123,
2001.
[0118] Referring to FIG. 1c, a high-level block diagram of a method
according to the invention is shown. A language statement in the
form of a design statement or executable code is instrumented to
support monitoring of the design or executable during simulation or
execution. The instrumentation interacts with storage devices and
other system resources to provide tracing of the simulation of a
design in the form of an abstract execution, simulation, or
emulation of the execution of an executable. Once traced, the trace
results form an angiotrace. The angiotrace is a particular form of
trace as defined hereinbelow. From the angio trace is determined a
plurality of scenario graph language statements that characterise
the observed scenario's behaviour. In an embodiment, the scenario
graph language is, as disclosed herein, "scenario event graph."
From the scenario graph language statements, domain specific models
are formed through transformation. Since a scenario event graph
language description is substantially indicative of scenario and
potential causality, the domain specific models may take a number
of forms. These include performance models, resource utilisation
models, design models, execution flow models, and so forth. By
determining design models, an executable program is verifiable
against design requirements from which it is derived. Further
information can be found in C. Hrischuk. "A Model Making Automation
Process (MMAP) using a Graph Grammar Formalism". Proceedings of the
Theory and Application of Graph Transformations, 1999.
[0119] Referring to FIGS. 2a and 2b, simplified flow diagrams of
code execution are shown. Code statements represented by circles
represent fork events and join events. Code statements represented
by solid boxes represent terminals and hollow boxes represent
default events. Lines joining code statement representations are
indicative of potential causality. The flow diagrams are shown in
time with an earlier time to the left of a later time. The two flow
diagrams shown in FIGS. 2a and 2b are of identical executable code
executed at two different times. Upon a brief review of the two
flow diagrams, it is evident that a code statement 1 is executed at
two different times. In fact, this does not effect execution of the
process because the code statement 1 is not causally connected to
the join code statement 3. Unfortunately, when evaluating a system
based solely on precedence, it is difficult to determine when
causally identical situations such as that shown in FIGS. 2a and 2b
may occur.
[0120] In fact, though a flow diagram generated from the system
during testing may always be similar to that of FIG. 2a, the flow
diagram of FIG. 2b is an acceptable execution of the process
characterised by the two flow diagrams and may occur at some later
time. It is clearly advantageous to identify flow related issues
such as these and to test out their correctness in light of desired
design parameters. According to the present invention, a method of
evaluating and transforming recorded information relating to code
statements into process flow information and subsequently into
other information is provided.
Scenario Causality
[0121] Prior art research into implementations for logical clocks
has proven useful for ordering events but other than precedence
causality, characterisation of scenario causality has heretofore
been elusive.
[0122] Prior to discussing scenario event graph and its use for
determining causality other than precedence, causality should be
defined. In order to understand causality, some forms of causality
are outlined below. The terms as defined hereinbelow associated
with each form of causality are used throughout this specification
and the claims.
[0123] Precedence causality in the form of precedence relations are
a loose form of causality inferring that a first event occurs
before a second event during an execution. This form of causality
is known in the art and is a common object of prior art systems.
Referring to FIG. 3 a simplified set diagram of different forms of
known causality is shown. As is evident from the diagram, imposed
causality is inclusive of several other forms of causality.
[0124] Realised causality is a term referring to an event ordering
that is consistent with both purpose and an execution. In theory,
when a process is correctly designed and implemented, realised
causality reflects both. Realised causality is summarised as a
first event is an intended cause of a second event if the second
event cannot occur unless the first event has already occurred. Of
course, when verification of process implementation against design
criteria is intended and process implementation is potentially
incorrect the statement "cannot" is modified to "should not."
Recovering realised causality from prior art post-execution traces
is impossible because it necessitates knowledge of the process
implementation in the form of software code of each object, the
initial value of variables, and the execution environment.
[0125] According to the present invention a form of causality
referred to as scenario causality is determined. Scenario causality
includes forms of causality other than precedence but does not
truly reflect realised causality in every instance. This is
indicated in FIG. 3 wherein scenario causality is a subset of
realised causality. Certain assumptions and limitations allow for a
broader applicability of the method of the present invention as
discussed below.
[0126] Different types of logical clocks result in different causal
ordering of events which may exclude important relationships.
Although each ordering is consistent with precedence relations,
some orderings are preferable for some applications. For example,
vector based logical clocks allow for a determination of potential
causality.
[0127] Precedence causality provides a partial ordering between
events that respects event ordering during execution. Precedence
causality is characterised as a future event being incapable of
influencing the past. A vector clock characterises precedence
causality because the event ordering is consistent with system
execution. Precedence causality is a weak approximation or
characterisation of realised causality because it results in all
previous events being potential causes for later events. This is a
consequence of causality being deduced solely from precedence
relations.
[0128] Imposed causality is obtained when the ordering between
events is imposed by an algorithm, and is not constrained to event
execution order. A scalar clock is an example of a logical clock
resulting in a determination of imposed causality. Because a clock
with imposed causality may include all other clocks as special
cases, imposed causality is shown as the largest set in FIG. 3
[0129] The difference between precedence causality and realised
causality is well known but many prior art methods for determining
causality ignore the difference. Examples of some of these include
the following papers:
[0130] D. Bryan. "An algebraic specification of the partial orders
generated by concurrent Ada computations." In Proceedings of
Tri-Ada, pages 225-241, New York, N.Y., 1989. A.C.M. Press;
[0131] C. Fidge. "Partial orders for parallel debugging." In
Proceedings of ACM SIGPLAN/SIGOPS Workshop on Parallel and
Distributed Debugging, pages 183-194, 1988;
[0132] D. P. Helmbold, C. E. McDowell, and J. Z. Wang. "Determining
possible event orders by analyzing sequential traces." IEEE
Transactions on Parallel and Distributed Systems, 4(7):827-839,
July 1993;
[0133] M. Raynal and M. Singhal. "Logical time: Capturing causality
in distributed systems." Computer, 29(2):49-56, February 1996;
[0134] A. Schiper, J. Eggli, and A. Sandoz. "A new algorithm to
implement causal ordering." In Proceedings .sub.3rd International
Workshop on Distributed Algorithms, number 392 in Lecture Notes in
Computer Science, pages 219-232. Springer-Verlag, Berlin, 1989;
and,
[0135] G. Winskel. "An introduction to event structures." In J. W.
de Bakker, W. P. de Roever, and G. Rozenberg, editors, Linear Time,
Branching Time and Partial Order in Logics and Models for
Concurrency, pages 364-397. Springer-Verlag, Berlin, 1989.
[0136] Scenario causality is a subset of realised causality,
including only those causal relationships for each application's
execution. Whereas imposed and precedence causality are overly
liberal inclusive approximations of realised causality, scenario
causality is an achievable conservative approximation of realised
causality. Scenario causality limits an event's influence to those
future application events it can effect. A criteria, called the
scenario ordering relation, is used to deduce scenario causality
from observations of the execution. The scenario ordering relation,
according to the invention, solves the previously listed
limitations of scenario causality by limiting the effects of an
event to both the unit of software modularity in the form of object
level effects and process of which the event forms part. This is
useful because it reflects the context with which an event is
associated, namely a software module and its process. The scenario
ordering relation is described as: "a first event is a cause of a
second event if there is a sequence of events from the first event
to the second event in the same process."
[0137] In the specification and claims that follow, causality
refers to conservative estimations of causality. Alternatively
stated, causality as used herein refers to events that are scenario
causal and not merely precedence causal. When precedence causality
is intended, that term or a synonym thereof is used.
[0138] In the timing diagram of FIG. 4, there are two distributed
applications. Each application consists of Object A sending a
message to Object B. As shown, two independent external events
cause each application to execute, recording the events of the
first application as and the events of the second application as .
However, as the scenario causal ordering shows in FIG. 4, there is
a delay such that the second message sent by Object A (event)
overtakes the first message it has sent (event ). The precedence
ordering of the events is shown in FIG. I c, including the
transitive ordering components. The precedence ordering includes
the additional event orderings of which are not causal orderings
because the scenarios are independent.
[0139] It is useful to identify blocking of objects when analysing
system execution for race detection and system visualisation among
other applications; however, a classical partial-order model has
difficulty characterising object blocking because communications
are recorded as asynchronous communications. Object blocking
introduced by blocking communication mechanisms, such as the Remote
Procedure Call (RPC), is not apparent within the classical model.
Analysis is further complicated when blocking interactions are
nested. For example in the flow diagram of FIG. 5, an RPC has two
blocking interactions, one nested within the other. Object A
initiates an RPC and blocks at event e.sub.1, and the nested
blocking interaction is initiated by Object B at event e.sub.3. One
approach to identifying object blocking is to augment time stamps
recorded through monitoring, in particular metrication within the
time stamps, with information about a communication mechanism, as
is attempted in some debugging applications. Other approaches
modify the precedence relation. The topology of the scenario event
graph language according to the invention directly characterises
object blocking by labelling the elements of blocking and
non-blocking communications differently with different events. Then
a causal chain of events through an RPC is immediately
identifiable. According to the invention, a characterisation of
message-based synchronisation between objects is performed.
[0140] Performance model construction using the model making
automation process (MMAP).sup.3 comprises three steps. First an
appropriate trace of execution is recorded. Such a trace is
referred to as an angio trace throughout this document and the
claims that follow. Then the trace is analysed to produce scenario
event graph that characterises the execution of the scenario: the
involved objects, their individual activities, and their
interactions with each other. A scenario event graph forms a
scenario model. Thirdly, the scenario event graphs are combined to
make a performance model that merges several scenario models and
additional configuration information necessary to determine
performance. .sup.3Previously called Trace Based Load
Characterization.
[0141] There are several benefits to using MMAP as opposed to a
"source code examination" approach for constructing models. Traces
incorporate dynamic details of a design that are difficult to
determine from source code or documentation. Some of these are data
dependent branching, identity of objects involved in anonymous or
dynamically bound interactions, and involvement of polymorphism and
inheritance hierarchy of an object-based system. Automated trace
processing results in more accurate performance model construction
at a lower cost because a larger volume of details is included
during model development. An area where automation has a decided
advantage is for correctly identifying interaction types. For
example, MMAP identifies a synchronous interaction constructed from
asynchronous messages. This is important when the nature of an
interaction cannot be explicitly identified in a trace. Optionally,
MMAP is used to model a production version of a software process,
to provide full life-cycle support for modelling.
[0142] The steps in applying MMAP in a performance engineering
context are shown in FIG. 6. A first step is to select scenarios
which are important for performance modelling and to add
instrumentation to identify where the execution of each scenario
begins and ends. In a second step, angiotrace events are recorded
during the execution of a scenario. For analysis purposes the
events of a trace are reordered into an intermediate format that is
then processed into an LQN sub-model. The user completes the LQN
construction in a fifth step by combining several LQN sub-models
with system configuration information.
Software Execution Tracing
[0143] In tracing a process in the form of a software process
during execution, it is preferable to have a predetermined set of
desired information. Such a set of desired information is
determined in dependence upon information sought through processing
of the trace results. Essentially, for use in the present
invention, a trace must capture, in an automated fashion,
information sufficient for determining scenario causality from a
distributed application's execution history.
[0144] In order to record sufficient information regarding software
process execution, angio tracing is employed. Angio tracing
according to the invention identifies scenario and potential
precedence relationships between recorded events of an application
and properly characterises concurrency. Angio tracing according to
the invention characterises communication protocol elements in the
form of blocking request initiation, non-blocking request
initiation, request acceptance, synchronisation acceptance, sending
a reply to a blocking request initiation, and acceptance of a
reply. Angio tracing supports integration of information from a
heterogeneous environment because it is independent of
implementation technology, execution environment, and monitoring
approach. Optionally, multiple angio traces are recorded
simultaneously. Automated trace analysis is possible because angio
tracing according to an embodiment is based on a formal model.
[0145] Angio tracing has been successfully implemented in many
environments. Software monitoring is a preferable means for
characterising a distributed application. An angio trace has at
least a logical clock which can serve many purposes. The approach
adopted by angio tracing is to provide an event format which
includes time stamp information and user defined application data
payload.
[0146] Angio traces are extractable from a plurality of different
sources at various steps of the development. Examples of sources of
an angio trace include annotated specifications in the form of
use-cases and Message Sequence Charts, functional prototypes,
detailed simulations, and an executable production system.
Successful experiments have been conducted with several of these
sources. In the embodiment described below, angio traces are
derived from a design prototype environment; the method is
applicable with necessary modifications to other sources.
[0147] There are four requirements limiting the application of
parallel program monitoring research to distributed applications.
First, hardware or hybrid monitoring of a distributed application
is not possible because of the geographically dispersed
environment. Therefore, a software monitor approach must be
adopted
[0148] Secondly, a strategy for minimising tracing overhead is
required. Parallel program monitors have used several strategies.
The simplest strategy is to enable trace sensors at run-time. A
more elaborate strategy is the on-line control of the program and
monitoring overhead as it executes. Examples are used in Falcon,
Paradyn, and Pablo. These strategies are difficult to apply to
distributed applications because software components are not known
in advance, making instrumentation adjustments a priori impossible.
Angio tracing uses a different strategy; event recording is enabled
during execution by application during execution and other
applications which are executing simultaneously do not necessarily
have events recorded.
[0149] When distributed applications are considered in isolation,
tracing should be used; however, most parallel program monitors use
sampling. Sampling is justified in a parallel programming
environment because parallel applications have a static structure
and run in isolation. This is not true of distributed applications,
where the sampled data values can be attributed to incorrect
applications, since applications execute simultaneously and share
resources. Angio tracing is tailored for monitoring distributed
applications by tracing.
[0150] Another concern is the need for ordering recorded events
once tracing is done. Ordering recorded events in a distributed
system is difficult for two reasons. First, a global clock is not
available because the system is geographically dispersed. Secondly,
perfectly synchronised clocks local to each object are not possible
because of poor clock granularity, poor clock synchronisation,
clock drift, or unpredictable communication delay. This is well
known in the prior art.
[0151] Although the precedence relation is useful for system-wide
analysis it is not useful for analysing execution of a distributed
application. The first limitation of the precedence relation is
that it does not distinguish between blocking and non-blocking
communication protocols; it assumes all communication is
non-blocking. Secondly, it introduces ordering relationships
between events from different scenarios, treating independent
scenarios as if they were part of a single, system-wide scenarios.
As mentioned previously, this is because the precedence relation
does not distinguish between different scenarios.
[0152] Angio tracing overcomes these limitations by using a special
scenario precedence ordering relation that also recovers potential
causality. This precedence ordering relation is used to answer a
particular class of questions, such as: "Does an event happen
before another event in scenario A?" Whereas, the precedence
ordering answers questions such as, "Did an event occur before
another event, in the system?"
[0153] Angio tracing is useful for monitoring a distributed system
which heretofore has been a difficult environment to monitor. A
distributed system is composed of geographically dispersed,
heterogeneous hardware with a set of executing, concurrent software
objects, which are referred to as objects. A distributed
application is a subset of objects that interact in a dynamic,
coordinated fashion solely by point-to-point message communication
with finite but unpredictable delay. The communication protocol is
assumed to be reliable and first in first out (FIFO) ordering of
message delivery is not assured.
[0154] A system that executes distributed applications differs from
a classical distributed system model. Some differences are: several
different applications or instances of the same application can
execute simultaneously sharing the software resources (objects) and
hardware resources; object execution is periodic, beginning when a
service request message is accepted and ending when the service
request is satisfied; an object's lifetime may extend beyond that
of an application; an object can be added or removed so the
software structure is dynamic; and, communication links between
objects are dynamically established. The communication at least one
of blocking (i.e., Remote Procedure Call) and non-blocking (i.e.,
asynchronous). DCE RPC, CORBA, Java, mobile agents, HTTP requests,
and web services are examples of technologies used to build
distributed applications. Angio tracing accommodates the above
noted differences and supports tracing using these technologies as
well as others.
[0155] Angio tracing characterises execution of a distributed
application independent of other executing applications. To ensure
that trace event information properly captures concurrency and
event ordering, angio tracing was derived from a formal model
called the scenario event graph. The scenario event graph is a
scenario graph language, with typed nodes and edges, which fully
describe an execution of an application or the behaviour cycles in
an application that are independent of each other. The relationship
between the scenario event graph language and angio tracing is
significant in implementing a method according to the invention.
Essentially, appropriate event recording requires some knowledge of
information necessary to produce a desired output. The scenario
event graph language provides a formal model from which many
different output views or data sets are determinable and, therefore
the scenario event graph language is a desirable model. As set out
below, angio tracing supports the formation of models in the
scenario event graph language. The scenario event graph language is
described in more detail below.
[0156] Properties of angio tracing which make it unique follow. It
is a new type of logical clock allowing reconstruction of a
scenario and precedence causal ordering of events for each
distributed application is used. An angio trace is capable of
transformation for analysis into a model using the scenario event
graph language. For example, an angio trace is used to
automatically generate a performance model of a distributed
application. An angio trace characterises communication protocol
elements.
[0157] Angio tracing, as herein disclosed, is successfully
implemented in experimental systems in the following environments:
a functional prototyping environment, a commercial prototyping
environment, a distributed software system simulator called
Parasol, coarse-grained UNIX operating system processes, in the DCE
RPC environment using data collected by the POET debugger, and on
the Microsoft Windows platform.
Event Graph Types
[0158] Three approaches to formally characterising the execution of
a distributed application are: a partial order, a regular
expression language, or a graph language. A partial order
characterises concurrency but it is difficult to characterise
blocking interactions or synchronisation between objects. The most
frequently used partial ordering relation, the precedence relation,
is discussed in detail above.
[0159] A regular expression language characterises blocking and
synchronisation but it loses information about software structure
and concurrency because applications are described by event
interleaving. Two regular expression languages are path expressions
and flow expressions.
[0160] The scenario event graph is a graph language for
characterising a distributed application that overcomes limitations
of prior art characterisation methods. The scenario event graph
language has labelled nodes that are types of application events
and labelled, directed edges that are different types of causal
relationships. It characterises communication protocol elements and
object concurrency during application, and system execution. The
communication protocol elements are: blocking request initiation,
non- blocking request initiation, request acceptance,
synchronisation acceptance, sending a reply to a blocking request
initiation, and acceptance of the reply.
[0161] The scenario event graph's implementation of scenario
causality is fully described in:
[0162] C. E. Hrischuk. Trace-based Load Characterization for the
Automated Development of Software Performance Models. Ph.D. thesis,
Carleton University, Ottawa, Canada, 1998.
[0163] C. Hrischuk, C. M. Woodside. "Logical clock requirements for
reverse engineering scenarios from a distributed system." Accepted
for publication in IEEE Trans. on Software Engineering.
[0164] The scenario event graph language is the basis for the angio
trace specification because there is a correspondence between
elements of the graph language and the angio-trace specification.
To better understand the properties of angio tracing a brief
description of the scenario event graph language is given here.
[0165] The scenario event graph language combines two types of
graphs to describe object, process, and system execution. It
characterises a process's execution as a process event graph.
Object execution is characterised by an object event graph.
According to the scenario event graph language these two points of
view are combined as a Scenario Event Graph (SEG), which has more
information than the graphs considered in isolation. The graphs are
causal models, where the nodes are recorded events and an edge
identifies a causal relationship between two nodes.
Object Event Graph
[0166] The object event graph characterises periodic execution of
an object. An object satisfies a service requests of other objects
one at a time, with the subsequent processing of each request being
described as a service period. An object event graph consists of a
sequence of linear sub-graphs, one for each service period. Each
service period is also a linear sub-graph of object activities. An
object event graph has a beginning, but it may not have an end;
this occurs, for example, when an object continuously operates.
[0167] An object event graph is composed of two types of nodes and
edges as follows:
[0168] "Period start" node: the object has started a new service
request period and this is the first node.
[0169] "Object activity" node: a node that represents an activity
that the object performed.
[0170] "Object's next node" edge: its target is the node in the
same object period that succeeds the source node.
[0171] "Object's next period" edge: its source is the last node of
an object's period and its target is the period start node of the
object's next service period.
[0172] The target of an object's next period edge is a service
period in the same process or in a different process. So, the next
period edge sometimes connects different process event graph's
together characterising system execution, provided there are
objects which are common to the processes.
[0173] There are four types of roles that an object assumes in a
process. A role limits node connection types as indicated by the
column in Table 1 called "Allowed Protocol Role". The first role
type is an initiating object, where requests for services from
other objects are communicated. The second role type is as a
responding object, where acceptance of a service request from an
initiating object occurs and the service request is satisfied. The
third role type is as a forwarding object, where a service request
is accepted, some processing is performed, and then the service
request is forwarded to another forwarding object for further
processing. The fourth role type is as a replier object, where a
responding object sends a reply back to a blocked initiating object
to indicate that its service request has been satisfied.
Process Event Graph
[0174] A process event graph characterises execution of a process
as an attributed, edge-labelled, binary, finite, directed, acyclic
graph. Each concurrent thread of execution is a linear sub-graph
called a process thread. Each process thread is also a linear
sub-graph of process activities. For example, when an application
has several objects interacting by blocking RPC, the process event
graph is a single process thread because there is no concurrent
execution. When a process event graph has concurrent process
threads, special node and edge types are used to characterise
causal relationships between process threads.
[0175] The process event graph node types are as set out below.
[0176] "External" node: a marker for the external initiation of a
process. a process may have more than one external node.
[0177] "Thread begin" node begins a process thread.
[0178] "Process activity" node has an attribute to store process
information.
[0179] "And-fork" node forks a new process thread to characterise
the introduction of logical concurrency.
[0180] "And-join" node joins two process threads into a single
thread of execution.
[0181] "Thread end" node finishes a process thread.
[0182] All of the node types, except the activity node type, are
considered atomic, having no duration, allowing chaining of nodes
to describe complex interactions between objects.
[0183] The different edge types of a process event graph are as
follows:
[0184] "Start the process" edge (st): its source is an external
node and its target is the thread begin node of the first process
thread.
[0185] "Process thread's next node" edge: its target is the next
node in the same process thread that succeeds the source node.
[0186] "Process thread's fork" edge (f): its source is an and-fork
node and its target is the thread begin node of the forked
thread.
[0187] The default edge type is the "Process thread's next node"
edge which is abbreviated to next process edge.
[0188] The execution of a single program statement is described by
a sub-graph to separate a program statement identifier from its
effect on the process behaviour. To ensure consistency of
representation, several rules govern introduction of a sub-graph.
First, if a program statement is characterised by a sub-graph of
and-fork node(s) and an activity node, the activity node is the
first node in the sub-graph. Conversely, if a program statement is
characterised by an activity node with a begin node or and-join
node(s), the activity node is the last node in the sub-graph.
Scenario Event Graph (SEG)
[0189] A scenario event graph combines a process event graph with
several object event subgraphs. We start from the process subgraph
of the scenario, and superpose those parts of the object subgraphs
representing service periods within the scenario as overlays. Two
nodes representing the same event are merged, and have a dual type,
one type for the scenario subgraph and one for the object subgraph.
Similarly where an edge exists in both the scenario and the object
subgraphs it has a dual type, one type for each.
[0190] Symbols of the SEG are shown in FIG. 7. No icon is provided
for the default node type, object activity node. FIGS. showing SEGs
follow several conventions: time proceeds from left to right, and
the consecutive nodes of an object are at the same vertical
level.
[0191] The interpretation of a SEG restrict the manner in which
nodes and edges are connected. Causal relationships during
execution restrict the node and edge connections. For completeness,
object period start edges are shown where they may occur.
[0192] RPC, synchronisation, and asynchronous communication
protocols are also characterised by the following elements:
[0193] Blocking request initiation: An object cannot proceed until
it receives a reply to a request it has just made;
[0194] Non-blocking request initiation: An initiating or forwarding
object makes a service request to another object and the initiating
object does not block to wait for a reply;
[0195] Request acceptance: A blocked responding object accepts a
new service request and begins a new period;
[0196] Synchronisation acceptance: A responding object is already
processing a service request but it is blocked, waiting to accept
another message to continue the service;
[0197] Sending a reply to a blocking request: A replier object
sends the reply to the blocked initiating object; and,
[0198] Acceptance of a reply: A blocked initiating object receives
the reply and continues execution.
[0199] With a formal specification of the scenario event graph
language, a deduction of the information required to generate a
scenario event graph model from a trace is possible. Trace
requirements are discussed below with reference to angio
tracing.
A Global Event Graph
[0200] Finally, where the event records include many scenarios, a
global event graph is defined as the superposition of all of the
object event graphs and scenario event graphs in the system.
[0201] The global event graph is important because it means several
scenario causal description can be combined to characterize
precedence causality. However, precedence causality cannot
characterize scenario causality.
Angio Trace
[0202] An angio trace provides a precedence ordering of separate
sets of execution related information - object level information
and process level information. These are easily visualised as two
graphs related to each of two times tamp values. The ordering of
events is achieved by a set of ordering relations and event
predicates. An event predicate identifies a type of an event and it
serves as guard conditions for selecting an ordering relation. Once
an ordering relation is selected, event ordering for two events is
established. Essentially, during tracing sufficient information is
collected to allow for determination of event ordering according to
scenario and precedence causality.
[0203] An angio trace is defined as:
G.sub.Trace=(N, .SIGMA..sub.n, M.sub.n, P, .OMEGA.)
[0204] where
[0205] N is a set of recorded events;
[0206] .SIGMA..sub.n, is the alphabet of event time stamps;
[0207] M.sub.n:N.fwdarw..SIGMA..sub.n, are the rules for assigning
time stamp values to events;
[0208] P is a set of event predicates ; and
[0209] .OMEGA. is the mapping of a predicate to one, or more, valid
ordering relations.
[0210] To develop the two graphs, each angio trace event records an
object time stamp for the object event graph and a process time
stamp for the process event graph. Before describing each of these
time stamp values, the logical clock requirements satisfied thereby
are outlined.
[0211] There are three properties that the time stamp values have
when used as a logical clock. Firstly, each time stamp has a unique
value or the event ordering relations provide a default scheme for
ordering events with identical time stamps. According to an
embodiment of the invention each time stamp value is unique.
Secondly, the time stamp values are monatonically increasing,
although there may be gaps in the time stamp values. For example,
the process time stamps are sequentially indexed so that missing
events are easily detected. The object time stamp value is allowed
to have gaps. An additional property needed for angio tracing is
that the two time stamp values of successive events in the same
object are synchronised: two events A and B cannot have time stamp
values where the object time stamps indicate that event A occurred
before event B and the process time stamp indicates that event A
occurred after event B.
[0212] An object time stamp consists of a unique object identifier
for each object event graph; an object period index that is a
counter ordering service periods of an object; and an object event
index that is a value ordering events within a service period.
[0213] Object time stamp monatonicity is a result of period and
event index values always increasing. The object identifier
provides uniqueness of the time stamp values.
[0214] A process time stamp consists of a unique process name that
associates an event with an application's scenario; a unique
process thread identifier that is assigned as a process thread
begins; a thread event index ordering events of a process thread;
and event type information for ordering process threads.
[0215] The process time stamp monatonicity is provided by the
thread event index values always increasing. Uniqueness of a
process time stamp is provided by the process name and process
thread identifier. Process thread identifiers are unique within the
scope of a process name and the process name must be globally
unique.
[0216] The event type information of the process time stamp closely
follows node types of scenario event graph as set out below.
[0217] External event (Ex): is a marker for the external,
initiation of a process.
[0218] Process thread begin event (Be): identifies the start of a
process thread.
[0219] Process activity event (Ac): records an identifier for an
action taken or the executed program statement.
[0220] Process thread fork event (Fk): connects a child process
thread with its parent process thread.
[0221] Process thread half-join event (HJo): signals the end of the
current process thread but not the service period of the object
[0222] Process thread end event (En): indicates an end of the
process thread and the object's service period.
[0223] The process thread begin event, process thread fork event,
and the process thread half-join each are recorded with information
with the event type to order process threads.
[0224] A fork event results in and is the cause of two subsequent
events; one is placed in the same process thread and the other is
taken as the beginning of a new child process thread. To identify
the child process thread the fork event results in recording a new
process thread identifier.
[0225] A half-join event differs from a scenario event graph
language and-join node. In the scenario event graph language, the
and-join node is a target of and preceded by two process threads.
In angio tracing, half-join events are the cause of and precede a
new process thread that results from the joining of two process
threads. The joining process threads end with half-join events.
[0226] The event notation that is used combines the object time
stamp and the process time stamp as follows: An event e has the
time stamp values 1 e = [ ProcessEventGraph ObjectEventGraph ] = [
j , k , m , l i , c , v ] ,
[0227] j is the process name for each process event graph,
[0228] k is the process thread identifier,
[0229] m is the thread event index,
[0230] l={Ex, Be, Ac, Fk, HJo, En} is the event type information
including information
[0231] specific to each event type,
[0232] i is the object identifier for each object event graph,
[0233] c is an object service period index, and
[0234] v is an object event index.
[0235] A process thread is identified with the process scenario
name and the process thread identifier, such as
.vertline.L,k.vertline.. If an object-oriented system is being
monitored then the object identifier should include class name and
instance number of an executing object.
[0236] Some fields require a particular initialisation value. These
values are specified as v.sub.0 for the object event index, c.sub.0
for the object period index and m.sub.0 for the thread event index;
these initial values are commonly initialised to 0 or to 1.
[0237] Information recorded with each event is used by the
following event predicates:
[0238] fork(e, k) True if event e is a fork event that forked the
process thread .vertline.j,k.vertline., otherwise it is false. This
is deduced as follows: (1) the parent event e is a fork event type,
(2) event e recorded the child process thread's identifier, and (3)
the child begin event recorded the process execution time stamp of
its parent fork event. To test for a fork event, the process thread
field takes on an arbitrary value--fork(e,-).
[0239] hJoin(E, .vertline.j,k.vertline.) If process thread
.vertline.j,k.vertline. is caused by one or more half-join events,
then the half-join events are assigned to set E and the predicate
is true; otherwise it is false. This is deduced as follows: (1)
half-join events are determined based on event types, (2) half-join
events record resulting process thread's identifier, and (3) the
begin event of the resulting process thread records the process
time stamp of its parent half-join event(s).
[0240] isHJoin(e) True if event e is a half-join event; otherwise,
it is false.
[0241] external(e) True if event e is an external event; otherwise,
it is false.
[0242] begin(e) True if event e is a begin event, otherwise it is
false.
[0243] end(e) True if event e is an end event, otherwise it is
false.
[0244] activity(e, V) True if event e is an activity event that
also recorded the process level information V, otherwise it is
false. To test for an activity event, the process thread field
takes on an arbitrary value such as activity(e,-)
[0245] last(i, c, e) True if event e is the last event recorded in
period c of object i, otherwise it is false. This is determined by
traversing the object event graph of object i in period c until the
period index changes or there are no further events recorded for
the object.
[0246] exist(e) True if event e is an event within the trace,
otherwise it is false.
[0247] These predicates serve as conditions for the event ordering
relations of angio tracing. An angio trace has six event ordering
relations that use the time stamp information. These relations
identify a given event's succeeding or preceding event in the
object event graph or the process event graph. Each relation is
reflexive, antisymmetric, and transitive. The ordering relations
are
.OMEGA..epsilon.{>.sup.T<.sup.T,
>.sup.At>.sup.Ao<.sup.At,&- lt;.sup.Ao},
[0248] where
[0249] >.sup.T orders the succeeding events in an object event
graph,
[0250] <.sup.T orders the preceding events in an object event
graph,
[0251] >.sup.At orders succeeding process event graph events in
the same process thread,
[0252] >.sup.Ao orders succeeding process event graph events
that are not in the same process thread such as a fork event and
its child begin event and a half-join event and its child begin
event,
[0253] <.sup.At orders preceding process event graph events that
are in the same process thread, and
[0254] <.sup.Ao orders preceding process event graph events that
are not in the same process thread.
[0255] An angio trace event description of an application's
execution is transformed into a scenario event graph model for
further analysis. This transformation consists of converting events
to nodes, adding edges between nodes, and replacing half-join and
external event types with simplified event types. The conversion of
an event to a node is a one-to-one mapping. There are four
operators that are used to add a labelled edge between two adjacent
nodes.
[0256] nextObject(e.sub.1, e.sub.2) adds a next object edge from
the source node e1 to the target node e.sub.2;
[0257] nextperiod(e.sub.1, e.sub.2) adds a next period edge from
the source node e1 to the target node e.sub.2;
[0258] nextAppTh(e.sub.1, e.sub.2) adds a next process edge from
the source node e.sub.1 to the target node e2; and,
[0259] andFork (e.sub.1, e.sub.2) adds an and-fork edge from the
source node e1 to the target node e.sub.2.
[0260] Table 3 shows identifying operators that are invoked to add
edges that are identified by the partial order relations, the node
type, and some additional time stamp information.
[0261] Once edges are added to nodes, graph modifications as shown
in Table 4 are applied to remove angio half-join and angio external
event types, as well as to provide some simplifications of a
resulting model.
[0262] The angio trace representation of the four possible styles
of synchronisations that occur in scenario event graph are shown in
Table 4. These illustrate how the half-join events are components
of a scenario event graph and-join node.
[0263] It is common in practice for a task to interleave the
processing of several service requests, maintaining state
information for each outstanding request. This is typically
implemented as a responding task polling several message queues and
servicing the first message it finds. Angio tracing can accommodate
this service period interleaving without violating proper time's
assumption that a task's service period should not be interleaved.
This is done using the two instrumentation operators suspend and
resume. They are defined as:
[0264] suspend(ts.sub.i, ts.sub.t) Copies task i's time stamp value
tsi into the temporary time stamp storage location ts.sub.t. Then
task i's time stamp value is updated by clearing its operation time
stamp, incrementing its task period index, and resetting its task
period index value to c.sub.0.
[0265] resume(ts.sub.t, ts.sub.i) Copies the contents of the
temporary time stamp storage location ts.sub.t into task i's time
stamp value ts.sub.i.
[0266] In each case an activity event is recorded with its
application information set to either "suspend" or "resume", so
that post-processing of the trace can determine if service periods
are being interleaved.
[0267] The transformation from an angio trace to a scenario event
graph model is known as a valid transformation because the partial
order of the scenario is the same in both cases and the event
ordering does not change; the meaning is preserved because there is
a correspondence from the node connection specifications to the
scenario event graph node connection strategies; and each node
connection specification is unique so there is no conflict and
corresponding non-determinism during the transformation
process.
Scenario Graph Language Verification of Properties
[0268] The scenario graph language presented and defined herein is
known to be complete and sound. This is important for several
reasons. First, it ensures that all possible executions will be
able to be interpreted for analysis with known semantics. Second,
all necessary information will be captured in an unambiguous,
non-contradictory fashion: no information is missing. Third, it
ensures that only the necessary information is recorded so there is
no redundancy in the data or extra overhead for data capture.
Lastly, it facilitates automated analysis techniques because of the
previous reasons. This permits its use in a wide variety of
situations. Such a complete and sound graph language statement set
is preferred.
[0269] The scenario graph language's node connections of Table 1
are complete and sound. They capture the valid ways to connect
nodes and maintain causal relationships of a distributed
application. A proof of the correctness is but described in both of
the following articles which are herein incorporated by
reference:
[0270] C. E. Hrischuk. Trace-based Load Characterization for the
Automated Development of Software Performance Models. Ph.D. thesis,
Carleton University, Ottawa, Canada, 1998.
[0271] C. Hrischuk, C. M. Woodside. "Logical clock requirements for
reverse engineering scenarios from a distributed system." Accepted
for publication in IEEE Trans. on Software Engineering.
Example Scenario Event Graphs
[0272] Example sub-graphs are now provided for an RPC,
asynchronous, synchronisation, and forwarding interactions.
[0273] A SEG of an RPC is shown in FIG. 9. In an RPC interaction,
the responding object is also the replier object. The process event
graph resembles a procedure call graph if the object's were
procedures.
[0274] A SEG of an asynchronous interaction is shown in FIG.
10.
[0275] A synchronisation interaction occurs when the synchronising
object has started a service period and it must accept another
message to continue execution. There are four possible ways a
synchronisation occurs. The first case is where the message was
sent using a blocking communication protocol (shown in FIG. 11).
The second case is where the initiating object used an asynchronous
communication protocol (shown in FIG. 12). The third case occurs
where a blocked initiating object receives its reply to a service
request that used an RPC communication protocol (shown in FIG. 13).
The procedure call graph analogy breaks down in this case because
there are two concurrent threads of execution, since the responding
object continues execution after sending the reply. This third case
is characterised as a new process thread being forked for the
reply. The last synchronisation case involves an external event
being accepted (shown in FIG. 14).
[0276] A forwarding interaction involves an initiating object, a
responding object that receives the initiating object's request,
other responding objects that forward the request in an object
pipeline, and a replier object. An example is shown in FIG. 15,
where: the initiating object (Object A) sends an RPC request and
blocks, the first responding object (Object B) processes the
request, and forwards it to another responding object (Object C),
Object C processes the request further and forwards it to Object D
which replies to the initiating object.
Model Transformation Using Graph Rewrite Rules
[0277] A graph rewrite operation occurs by finding a sub-graph,
identifying adjacent nodes and edges to the selected sub-graph, and
then replacing the identified sub-graph with another, ensuring that
the adjacent nodes and edges are undisturbed by the embedding of
the new graph. In Table 4, graph rewriting rules are shown. In
Table 4, the adjacent nodes and edges are numbered the same in the
identification and replacement sub-graphs to ensure the embedding
operation does not alter the adjacent nodes.
[0278] A graph rewrite rule preserves those nodes and their
modified attribute values and adjacent edges. Graph rewriting
operations are used to simplify a scenario event graph model during
analysis as well as to establish graph properties. FIG. 16 provides
two examples of this. In the first example, if the sub-graph to
replace is found then it is proven that the sub-graph has that
property. In the second case, the graph is rewritten and
simplified, ready for another set of graph rewriting rules to prove
a property or simplify the model.
[0279] A scenario event graph model is analysed or translated into
a domain specific model. An analysis is done by first describing
the properties to be assessed as a sub-graph template, which is
then compared with the host scenario event graph model using an
algorithm supplied by an analyst. A sub-graph template has
variables and values. Translation of a scenario event graph model
from one domain to another begins similar to analysis, except that
a second sub-graph is supplied replacing each occurrence of a first
sub-graph in the host scenario event graph model.
[0280] An example of this approach is shown in FIG. 16, which is a
graph rewriting operation for simplifying a scenario event graph
model. In this example, an RPC interaction occurs using
asynchronous messages. By removing unnecessary nodes and replacing
arcs, this is simplified. The input sub-graph template uses the
numbered nodes to establish glue points to embed the output
sub-graph template. The algorithmic graph grammar approach is ideal
for this purpose and it is supported by a graph rewriting
specification language and tool set called PROGRES.
Angio Trace Instrumentation and Time Stamps
[0281] The instrumentation for a method of the invention for use
with an unreliable monitor is shown in Table 5 (Instrumentation
Specification for an Unreliable Monitor). And for a reliable
monitor is shown in Table 6 (An Optimized Instrumentation
Specification for a Reliable Monitor).
[0282] A principle that governs implementation of angio tracing for
a reliable monitor are to minimise the data recorded. There are
several approaches that are used for an optimised implementation.
First, only one event is recorded per instrumentation item, which
requires that event type information be combined together. The
event identifier syntax of Table 5 is still used but merged events
will have combined subscripts. For example, two events e.sub.2 and
e.sub.3 are described as the merged event e.sub.2,3. Secondly, only
the time stamp fields that change between events are recorded.
Thirdly, only one ordering direction is recorded because the
reverse ordering can be deduced by post-processing.
[0283] For an implementation description with a reliable monitor,
the monitor has several characteristics. First, each object's
events are stored serially, in-order. Optionally, different objects
may store their events to the same buffer, so that events from
different objects are stored in an interleaved fashion. Secondly,
the monitor is able to detect missing events or guarantee that no
events went missed during recording. Clocks local to each
processing node need not be synchronised.
[0284] An object time stamp consists of an object identifier, an
object period index, and an object event index. There are several
optimisations for these time stamp fields. The object identifier is
recorded with each event because the monitoring system is recording
values for several objects simultaneously and interleaving the
events. The object identifier is used to separate the events during
post-processing.
[0285] The object periods are sequentially ordered because object
events are serially recorded. The object period values need not be
recorded with each event, but they are recorded when an object
period ends. In this fashion, a change in an object period value
means that a new object period has started and the object index of
the succeeding event is reset to one. The object index values are
not necessarily recorded because all object events are recorded
sequentially; object index values are determinable from this
ordering.
[0286] A process time stamp consists of a process name, a process
thread identifier, a thread event index, and event type
information. Optionally, each of these values is optimised as
follows. Process name is recorded by external events as long as the
process thread identifiers are globally unique, because the process
name does not change throughout a process. Process thread
identifier is recorded when a message is received since that is the
only time it changes. Thread event index is changed after sending
or receiving a message, so that an order of events in different
objects is determinable.
[0287] The angio trace is unique because it has two timestamps used
to establish an event ordering. One of those time stamps is
dedicated to providing a scenario time stamp to order the events in
the scenario.
[0288] An angio trace event is recorded by instrumentation embedded
within an application that interfaces to the program monitor. A
minimal set of instrumentation primitive operations must be
supported by the program monitor and they are described here.
[0289] Process time stamp information is added to a message before
it is sent to implement the ordering relations
{>.sup.A0,<.sup.A0}. A message carries the process time stamp
S.sub.1 of the sending object's event that is the cause of the
receive event. The process time stamp S.sub.2 will replace the
current process time stamp value of the receiving object. The
sending object is responsible for generating S.sub.2.
[0290] The monitor provides four operators. The first two are used
to manipulate the time stamp values of a message. They are:
[0291] end(e, S.sub.1S.sub.2) appends the process time stamps
S.sub.1 and S.sub.2 to the message that is associated with the send
event e;
[0292] rcv(e, S.sub.1, S.sub.2) retrieves the process time stamps
S.sub.1 and S.sub.2 that were sent with the message received by
event e;
[0293] record(e) atomically records and stores the event e; and
[0294] unique(x) assigns a globally unique value to variable x.
[0295] The instrumentation is listed in Table 5 as well as Table 6.
The instrumentation is defined by the last three columns. The
instrumentation defines the time stamp information that each event
must record and not how the instrumentation is to be coded or
executed. Each row in the table should be interpreted as follows:
"if the precondition values of object i1 are met, then execute the
instrumentation primitives to record the identified events." The
instrumentation for the suspension and resumption of an process
thread is presented as (P) and (Q). The conventions that are used
to describe the instrumentation follow.
[0296] The documentation columns of the table are the columns
"Event Connection Interpretation" and "Instrumentation Comments".
The "Event Connection Interpretation" column describes the purpose
of the specification. The "Instrumentation Comments" column details
the finer points of each event connection specification,
identifying the purpose for recording each event.
[0297] The "Recorded Event Observations" is the most important
column because it illustrates the recorded events and their
ordering, using the conventions of the scenario event graph, the
time stamp field values, and the icon for the half-join event. The
illustrations show the recorded events with the angio trace time
stamp information overlaid against the scenario event graph edges
and nodes (where applicable). For the sake of simplicity there are
two exceptions. First, the icon is added to represent any type of
event provided it has the specified time stamp field values.
Secondly, the domain data recorded with the activity event
information is not shown because it is domain dependent.
[0298] Each illustration identifies the recorded events, as well as
their preceding and succeeding event. Events in dashed boxes are
the events recorded by the instrumentation and, if there is more
than one, they are recorded together atomically. The time stamp
field values of the recorded events are the actual values that are
recorded.
[0299] In all the illustrations, event e1 precedes the event(s) to
be recorded by the instrumentation. The events which may succeed
the recorded events are also shown. In some cases, boundary
conditions exist. The illustrations will show additional events to
describe the boundary condition, but these events will not be
recorded in all cases. For example the source or target node of a
object's next period edge may not exist.
[0300] The table refers to recording events for the object i1 and
its instrumentation state vector may be used to determine which
events to record. The "Precondition State of Object i1" column
lists the predicates and conditions which must all be true for the
instrumentation primitives to be executed. This state information
is the object's instrumented state just prior to recording the
events; it is not the state of the object when event e1 was
recorded.
[0301] The executed instrumentation primitives are described in a
column of the same name. The instrumentation primitives also
identify object il's instrumentation state vector values after
executing the primitives.
[0302] The time stamps notation is as follow. A field value may be
a symbolic, subscripted variable. Variables in the same event
connection specification with the same subscripts have the same
value. A time stamp field value with a "-" can take on any value. A
field with the place holder value "-" can take on the empty value.
All time stamp values are natural numbers, beginning at one.
Time Stamp Optimizations
[0303] Optimisations for each event type are as follows:
[0304] External event always has an index value of one so the index
values need not be stored.
[0305] Begin event always has an index value of one as well.
[0306] Activity event is a default event type, therefore, an
activity event type label is not recorded.
[0307] The information recorded for a Half-join event is reduced if
the object time stamp information is used.
[0308] Fork event is not recorded because the corresponding begin
event of the child process thread will provide ordering
information.
[0309] End event is not recorded since it occurs predictably and
its occurrence is determinable through post-processing.
[0310] According to one particular embodiment although the event
connection specifications define the events and their time stamp
values to be recorded, some implementation considerations must be
addressed. To minimise the burden of instrumentation from the
analyst, it is intended that the angio trace instrumentation is
embedded in system activities for that particular environment. Then
the analyst's instrumentation effort is limited to identifying
external events and object periods. There are also standardisation
concerns for use in a heterogeneous environment.
[0311] To amortise the instrumentation effort, it is expected that
the instrumentation once designed remains embedded in message
passing system functions of a distributed system programming
language, system libraries, interface definition language
compilers, and operating system kernel calls.
[0312] The analyst adds process specific instrumentation to
identify where the execution of each distributed process begins and
ends. Optionally, this instrumentation is added manually.
Alternatively, the instrumentation for the end of a process is
reduced by assuming that a process implicitly ends when another
process begins. Software interrupts which signify an external event
are easily instrumented as an external event and generate a unique
process name automatically to start an angio trace. Also, the
generation of angio traces may be transparently incorporated into
the design testing effort with little additional cost, as well as
providing additional information for debugging.
[0313] Object specific instrumentation is also necessary to
identify beginning and ending of object periods. There are three
independent approaches to reduce object period instrumentation;
since object endings are likely more frequent than process endings,
optimisation of this instrumentation is advantageous. When
optimised, the start of an object period is deduced automatically
for some system activities such as an RPC message acceptance or
object creation. This is also true for the ending of an object
period.
[0314] Another optimisation approach is to identify where
synchronisation between process threads occurs. A service period
serves as a boundary between different angio traces and it
identifies synchronisation between process threads. However, angio
trace separation is determinable from the process name values in
the process time stamp. So, if the synchronisation points are
instrumented then the service periods are determinable. For
example, synchronisation is automatically identified by nested
accept statements in ADA, nested interleaved RPC interactions, or
synchronisation barriers in parallel programs.
[0315] Yet another optimisation approach is to introduce
constraints on a process and use heuristics to deduce the start and
end of an object period. If a process is constrained to being
initiated by a single external event then the history of the
process is used to infer the start of an object period, the end of
an object period, and where a synchronisation occurs. When a single
test-driver is used to initiate a process then this is a feasible
approach.
[0316] The selection of which approach to adopt should be assessed
for each application; however, it should be noted that the object
period information is important design documentation, which is
generally not captured.
[0317] An implementation concern to be addressed is standardisation
of the format for use in a heterogeneous environment, including a
trace format specification and the primitive ordinal types that are
used by the specification.
MMAP Process Application
[0318] The method as described herein is also applicable to
verifying application functionality. When an application is
specified in a graph language such as use cases or message sequence
charts, the graph language statements provided by the method
according to the invention are translatable into said scenario
graph language. A comparison between the specification scenario
graph language description and the execution scenario graph
language description results in design specification verification
and improves overall design verification.
[0319] Also, since the method provides as an output from a scenario
graph language description of process and object execution,
transformation of the output to provide different views of system
execution is possible. Though, the graph language described herein
is complete and sound, optionally the transformations eliminate
these properties in order to provide data in a manner that is more
useful to an operator, designer, or a corporate executive. Many
such transforms are applicable to each graph language output from
such a system according to the invention.
[0320] There are other uses for scenario event graph aside from
scenario causally or precedence ordering of events. It is
applicable to automatically generating software performance models
from traces of execution. Generic event templates are used to
identify interactions and object behaviour. The interactions and
object behaviour are mapped onto a performance model. Race
detection and system visualisation make use of the interaction
information. In this fashion, system optimisation and resource
allocation are improved. Also, the process ordering relation
provides a more selective view of potential causes of an event,
which is a useful starting point for debugging.
[0321] In accordance with the invention a physical process is
modelled. Examples of physical processes for modelling include
manufacturing, purchasing, workflow, chemical processes, etc. By
tracing events occurring through a scenario in a predetermined
fashion, flow graphs relating to objects and processes within the
scenario are determined. These graphs are then used to either
automate certain objects which are commonly repeated and therefore
in need of optimisation, which form bottlenecks, or which are
performed in inefficient manners due to flow related issues. In
workflow modelling, a plurality of people and systems record events
during normal work. These events are then constructed into scenario
graph language models which are transformed into different domains
for different purposes. An evaluation of overtime efficiency is one
such application. Elimination of inefficient but required
activities, identification of resource shortages, automation of
objects within processes, reduction of cost, and other
optimisations are determined based on domain specific output.
[0322] Similarly in manufacturing, common sources of delay are
identified and analysed to determine a cost for delays and a cost
of implementing preventative action to eliminate delay. A simple
business decision follows to determine whether or not to implement
a delay preventing process. Essentially, gathering of event related
information and automatically transforming same into process flow
related information is beneficial in many fields.
[0323] Similar to the concept of "proper time" in relativity, a
frame of reference may be any object, or, in this embodiment a
response thread. Selection of a frame of reference does not affect
validity of the results obtained. There is a duality between an
object and a process thread. An observer that chooses an object as
a frame of reference sees a succession of process threads, whereas,
if the process thread is chosen as a reference, the observer sees a
succession of objects.
Angio Trace Application
[0324] Alternatively, angio traces as described herein have several
applications beyond model construction. An angio trace is so named
because it is similar to medical applications, angiograms, where a
dye is injected into a patient and its movement through the body is
monitored. Similarly, when using an angio trace, monitoring permits
analysis of flow of communications through an application. The term
angio dye is used herein to describe an identifier forming part of
a time stamp that allows for analysis of application execution and
communication flow during execution, abstract execution,
simulation, emulation, etc.
[0325] The use of an angio dye as used in an angio tracing system
described herein, allows for tracking information flow in a process
during execution. As such, angio dyes are applicable to system
self-monitoring. An example of a self-monitoring application
includes, monitoring of network objects for crashes or resource
overload. For example, when a process is divided among several
processors in distributed systems, each system is required to
transmit angio dye related information at intervals in the form of
predetermined intervals. The information is used to monitor
progress on provided objects or applications and to establish that
each distributed processor is in operation. Failure to receive
angio dye related information or failure of a processor to progress
fast enough, results in corrective action such as providing the
same object to another processor for execution. Optionally, the
first object execution request is not withdrawn and results from a
first processor to complete the object are used.
[0326] In another example application, angio dye related
information is used to prevent record-playback of an encryption
key. Since a times tamp as used in angio tracing is substantially
unique, packaging an encryption key with such a time stamp,
prevents its use at a later time. This allows for a traced system
employing angio tracing to distinguish between current communicated
information and previous or stale communicated information. Of
course many other applications of angio dye related information,
scenario graph language models, and tracing may be envisioned
without departing from the spirit or scope of the invention.
[0327] In an alternative embodiment, angio tracing and scenario
event graph are used to model a hardware process. For example, in
design and implementation of a large hardware device, simulation is
often employed. During simulation, an angio trace and a scenario
event graph model of the simulation or of the simulation as well as
a software simulation is constructed and analysed. This permits
design verification, design optimisation, and performance
evaluation. Similarly when a design is intended for mass production
or is implemented in a programmable device, a hardware based angio
trace for analysis in forming a scenario event graph model is
employed. As much integrated circuit design involves library
circuit blocks, such an implementation of a monitor for angio
tracing is not unrealistic and provides numerous advantages as
disclosed herein.
[0328] It is known to perform pattern analysis for design of
software applications, workflow engineering, and process design.
According to an embodiment of the invention, a SEG is analysed to
determine patterns therein. These patterns are in the form of at
least one of predetermined patterns and patterns identified through
analysis of the SEG. Identification of patterns within the SEG
provides valuable information for use in system optimisation,
reverse engineering, design review, implementation analysis, and so
forth.
[0329] In order to identify patterns within a SEG a generic
mathematical approach to pattern recognition is applicable.
Patterns within the SEG are identified as being identical or
substantially similar in some aspect. For example, flow of a graph
segment when identical is identified. Non flow related events
within the graph segments are then compared in order to determine
whether a correlation exists. When optimisation is possible on one
of the identified graph segments, the other graph segments are
reviewed to determine an applicability of a same or similar
optimisation.
[0330] Alternatively, when substantially similar or identical graph
segments are identified, design analysis to optimise a process in
the form of a computer software program for memory utilisation,
speed, reliability, or other known goals of design analysis and
optimisation is performed. The design analysis, because it is of an
executing software program, is an accurate and pertinent analysis
of the process as implemented.
[0331] Numerous other embodiments may be envisioned without
departing from the spirit or scope of the invention.
1TABLE 2 Previous Node Successor Node Connection Successor Node
Connection Previous Node Axiom in a Connection Axiom Axiom in a
Connection Axiom in Different in the Same Object Different Node the
Same Object Event Object Event Event Graph Object Event Connection
Node Graph Graph (OutProc and Graph Axiom Type (InProc and
InObject) (InProcExt) OutObject) (OutProcExt) A External n/a n/a
n/a M, N, O B Thread C, E, G, H, I, J, K, L, n/a E, I, M n/a end M,
N, O C Action C, E, G, H, I, J, K, L, n/a B, C, D, F, H, J, L, n/a
M, N, O N D Action C, E, G, H, I, J, K, L, n/a G, K, O E, J M, N, O
E Action B, F D B, C, D, F, H, J, L, n/a N F Action C, E, G, H, I,
J, K, L, n/a E, I, M G M, N, O G Action D F B, C, D, F, H, J, L,
n/a N H Fork C, E, G, H, I, J, K, L, n/a B, C, D, F, H, J, L, I, K,
L M, N, O N I Thread B, F H B, C, D, F, H, J, L, n/a begin N J
And-join C, E, G, H, I, J, K, L, D B, C, D, F, H, J, L, n/a M, N, O
N K And-join D H B, C, D, F, H, J, L, n/a N L And-join C, E, G, H,
I, J, K, L, H B, C, D, F, H, J, L, n/a M, N, O N M Thread B, F A B,
C, D, F, H, J, L, n/a begin N N And-join C, E, G, H, I, J, K, L, A
B, C, D, F, H, J, L, n/a M, N, O N O And-join D A B, C, D, F, H, J,
L, n/a N
[0332]
2TABLE 3 Edge Type Assignment of Nodes e.sub.1 to e.sub.2 2 Event
type of e 1 where e 1 = j 1 , k 1 , m 1 i 1 , c 1 , v 1 and e 2 = j
2 , k 2 , m 2 i 2 , c 2 , v 2 Successor Event e.sub.2 in the Task
Event Graph >.sup.T(e.sub.1, e.sub.2) Successor Event e.sub.2 in
the Same Operation Thread >.sup.AT(e.sub.1, e.sub.2) Successor
Event e.sub.2 that is not in the Same Operation Thread
>.sup.Ao(e.sub.1, e.sub.2) Activity event (c.sub.1 = c.sub.2)
.fwdarw. nextTask(e.sub.1, e.sub.2) V nextOpTh(e.sub.1, e.sub.2)
N/A activity(e.sub.1, -) (c.sub.1 .noteq. c.sub.2) .fwdarw.
nextPeriod(e.sub.1, e.sub.2) External event nextTask(e.sub.1,
e.sub.2) nextOpTh(e.sub.1, e.sub.2) N/A external(e.sub.1, -) Begin
event nextTask(e.sub.1, e.sub.2) nextOpTh(e.sub.1, e.sub.2) N/A
begin(e.sub.1) Fork event nextTask(e.sub.1, e.sub.2)
nextOpTh(e.sub.1, e.sub.2) andFork(e.sub.1, e.sub.2) fork(e.sub.1,
-) Half-join event nextTask(e.sub.1, e.sub.2) N/A N/A
isHJoin(e.sub.1) End event (c.sub.1 = c.sub.2) .fwdarw.
nextTask(e.sub.1, e.sub.2) V N/A N/A end(e.sub.1) (c.sub.1 =
c.sub.2) .fwdarw. nextPeriod(e.sub.1, e.sub.2)
[0333]
* * * * *