Method of determining causal connections between events recorded during process execution Hrischuk, Curtis ; et al. [Hrischuk, Curtis]

Method of determining causal connections between events recorded during process execution

Hrischuk, Curtis ; et al.

Patent Application Summary

U.S. patent application number 10/007211 was filed with the patent office on 2002-12-19 for method of determining causal connections between events recorded during process execution. Invention is credited to Hrischuk, Curtis, Woodside, Charles Murray.

Application Number	20020194393 10/007211
Document ID	/
Family ID	25469373
Filed Date	2002-12-19

United States Patent Application	20020194393
Kind Code	A1
Hrischuk, Curtis ; et al.	December 19, 2002

Method of determining causal connections between events recorded during process execution

Abstract

A method of determining scenario causality, along with precedence causality, is disclosed. Information is recorded relating to events occurring during execution of a process. The information includes object related information and process related information. The information is translated into a sequence of scenario graph language statements, one or more events translated to a statement. From the statements, process execution flow is determined establishing some scenario causality and precedence causality.

Inventors:	Hrischuk, Curtis; (Woodinville, WA) ; Woodside, Charles Murray; (Ottawa, CA)
Correspondence Address:	Clifford H. Kraft 320 Robin Hill Drive Naperville IL 60540 US
Family ID:	25469373
Appl. No.:	10/007211
Filed:	November 8, 2001

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
10007211	Nov 8, 2001
08937023	Sep 24, 1997

Current U.S. Class:	719/318
Current CPC Class:	G06F 9/542 20130101
Class at Publication:	709/318
International Class:	G06F 009/46

Claims

What is claimed is:

1. A method of determining, from recorded information relating to events occurring during execution of a process, a plurality of the events that are causally connected by scenario causality and precedence causality, the method comprising the steps of: (a) translating the recorded information relating to the events to statements in a first scenario graph language; (b) determining from the first graph language statements, information relating to execution flow of the process wherein each first graph language statement comprises information relating to a predetermined execution flow of the process; and, (c) based on the information relating to an execution flow of the process, determining, for a first plurality of events, events that precede each event from the first plurality of the events that are causally connected by scenario causality to said event from the first plurality of the events.

2. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of: performing run time behavior verification by analysis of the scenario event graph, or combinations thereof, for one of race conditions, live lock conditions, and deadlock conditions.

3. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of: determining the identity type of a scenario during execution and providing a different level or style of service based on this determination.

4. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the steps of: monitoring a process during execution; and, recording the information relating to events occurring during execution of the process, the recorded information comprising at least a time value from each of at least two clocks and wherein at least one of the clocks is a logical clock.

5. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein translating the recorded information is performed in each of two domains; and, wherein determining from the statements information relating to execution flow of the process is performed in dependence upon the statements in each domain.

6. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the process is a process executed by a microprocessor.

7. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the process is a process executed in software on at least two processors in a distributed system and wherein the information relating to events comprises information relating to a time measured by a logical clock and another time measured by another clock.

8. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein a statement in the first graph language represents a node having an out degree of at least 2 and wherein statement in the first graph language represents a node having in degree of at least 2.

9. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the recorded information relating to events comprises process event information and object event information.

10. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the statements form a graph language that is complete and sound.

11. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the statements relate to delimiting and progress events of a process and of an object.

12. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the first graph language has nodes and edges from a group of: external, thread begin, and-join, and-fork, thread end, activity, object period start, start process, next object event, next process node, next object period, and process thread fork.

13. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of determining a UML behavioural diagram relating to process execution.

14. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of determining a message sequence chart relating to process execution.

15. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of determining design related information for use in one of design verification, business process modelling, performance modelling, and optimisation.

16. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the recorded events form an angio trace defined as G.sub.Trace=(N .SIGMA..sub.n, M.sub.n, P, .OMEGA.) where N is a set of recorded events; .SIGMA..sub.n is the alphabet of event time stamps; M.sub.n:N.fwdarw..sub.n is the mapping of events to time stamps; P is a set of event predicates for identifying the type of an event; and, .OMEGA. is a set of partial-ordering relations.

17. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the recorded information relating to an event comprises an event type from external event; process thread begin event; process activity event; process thread fork event; process thread half-join event; and process thread end event.

18. A method of determining a plurality of the events that are causally connected comprising the steps of: during execution of an event, recording process related information, recording object related information, and recording event related information; using the process related information and the object related information for a plurality of events, translating the recorded information to a graph language substantially indicative of scenario and precedence causal connections between events; and, providing information based on the causal connections between events.

19. A method of determining a plurality of events that are causally connected for use with recorded information relating to the events occurring during execution of a process, the method comprising the steps of: analysing the recorded information to determine a partial order of events from each of two relative perspectives; combining the two partial orders of events to produce information relating to some forms of scenario causality and precedence causality.

20. A method of determining a plurality of events that are causally connected as defined in claim 19, wherein the recorded information relating to the events comprises at least an event type and two time stamps from each of two clocks wherein a clock from the two clocks is a logical clock and wherein causality is deduced in dependence upon precedence determined from the partial orders and recorded event types.

21. A method of determining a plurality of the events that are scenario and precedence causally connected comprising the steps of: providing a process for execution; instrumenting the process for monitoring of the process during execution; executing the instrumented process to produce a trace of the process execution; transforming the trace of the process execution into a plurality of scenario graph language statements according to a plurality of predetermined rules; and, transforming the scenario graph language statements into a domain specific model.

22. A method of determining a plurality of the events that are causally connected as defined in claim 21, wherein the process is instrumented according to the rules of the following table:

23. A method of determining a plurality of the events that are causally connected as defined in claim 21, wherein the process is instrumented according to the rules of the following table:

24. A method of determining a plurality of the events that are causally connected as defined in claim 1, comprising the step of: the replaying of execution of a scenario and system behavior on one of the actual system, a simulator tool, and a visualisation tool.

25. A method of determining a plurality of the events that are causally connected as defined in claim 1, wherein the process is a process executed in computer software and comprising the step of: performing pattern analysis on the statements to detect at least one of software design and software execution patterns therein.

Description

[0001] This is a continuation-in-part of U.S. patent application Ser. No. 08/937,023 filed Sep. 24, 1997.

FIELD OF THE INVENTION

[0002] This invention relates generally to process execution and more particularly to determining causality for information stored during concurrent and distributed software process execution.

BACKGROUND OF THE INVENTION

[0003] In application execution and analysis, tracing is a term having many similar but distinct meanings. Tracing implies a following of process execution. Often such tracing incorporates recording information relating to a process during execution. In essence, a process that executes and has information there about recorded is considered a traced process.

[0004] In the past, tracing of computer software application programs has been performed for two main purposes-debugging and optimisation. In debugging, the purpose of tracing is to trace back from an abnormal occurrence - a bug to show a user a flow of execution that occurred previous to the abnormal occurrence. This allows the user to identify an error in the executed program. Unfortunately, commands executed immediately previous to an abnormality are often not a source of the error in execution. Because of this, much research is currently being conducted to better view trace related data in order to more easily identify potential sources of bugs.

[0005] Debuggers are well known in the art of computer programming and in hardware design. In commonly available debuggers, a user sets up a trace process to store a certain set of variables upon execution of a particular command while the program is in a particular state. Upon this state and command occurring, the variables are stored. A viewer is provided allowing the user to try to locate errors in the program that result in the bug.

[0006] Usually, debuggers provide complex tracing tools which allow for execution of a program on a line by line basis and also allow for a variety of break commands and execution options. Some debuggers allow modification of parameters such as variable values or data during execution of the program. These tools facilitate error identification and location.

[0007] Unfortunately, using multiprocessor or networked systems, it is difficult to ensure that a system will function as desired and also, it is difficult to ascertain that a system is actually functioning as desired. Many large, multiprocessor systems appear to execute software programs flawlessly for extended periods of time before bugs are encountered. Tracing these bugs is very difficult because a cause of a bug may originate from any of a number of processors which may be geographically dispersed. Also, many of these bugs appear intermittently and are difficult to isolate. Using a debugger is difficult, if not impossible, because multiple debugging sessions must be established and coordinated.

[0008] In contrast for optimisation, it is important to know which commands are executed most often in order to optimise a software program. For example, when an application during normal execution executes a first subroutine once, a second subroutine twice, and a third subroutine seventy times, each subroutine requiring a similar time span for execution, optimising the subroutine which runs seventy times is clearly most important. In system optimisation, tracing is not actually performed except in so far as statistics of routine execution and execution times are maintained. These statistics are very important because they allow for a directed optimisation effort at points where the software executes slowest or where execution will benefit most. Statistics as captured for program optimisation, are often useful in determining execution bottlenecks and other unobvious problems encountered. Examples of optimisation based modelling or tracing include systems described in the following references:

[0009] P. Dauphin, R. Hofmann, R. Klar, B. Mohr, A. Quick, M. Siegle, and F. Sotz. "ZM4/ Simple: A general approach to performance measurement and evaluation of distributed systems." In T. Casavant and M. Singhal, editors, Readings in Distributed Computing Systems, pages 286-309. IEEE Computer Society Press, Los Alamitos, Calif., 1994; M. Heath and J. Etheridge. "Visualizing the performance of parallel programs." IEEE Software, 8(5):29-39, September 1991;

[0010] C. Kilpatrick and K. Schwan. "ChaosMON--application-specific monitoring and display of performance information for parallel and distributed systems." Proceedings of the ACMI ONR Workshop on Parallel and Distributed Debugging, May 1991; and,

[0011] J. Yan. "Performance tuning with an automated instrumentation and monitoring system for multicomputers AIMS." Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences, January 1994.

[0012] Software performance models of a design prior to product implementation reduce risk of performance-related failures. Performance models provide performance predictions under varying environmental conditions or design alternatives and these predictions are used to detect problems. To construct a model, a software description in the form of a design document or source code is analysed and translated into a model format. Examples of model formats are a simulation model, queuing network model, or a state-based model like a Petri-Net. The effort of model development makes it unattractive, so performance is usually addressed only in a final product. This has been termed the "fix-it-later" approach and the seriousness of the problems it creates is well documented.

[0013] In order to determine that a process is in fact executing as desired or to construct a performance model for optimisation requires an understanding of causality within a software application. Commonly, the only causal connection determined automatically is precedence. For example, in determining system statistics, it is easily recorded which subroutine was executed when. This results in knowledge of precedence when the entire process is executed on a single processor. However, given this knowledge, it is difficult to determine anything other than precedence.

Time and Causality

[0014] For concurrent or distributed software computations a common synchronised time reference is unavailable. A system operating on the earth and another system operating in space illustrate this problem. When the system on earth performs an activity and transmits a message to the system in space, an evident time delay occurs between message transmission and message reception. Once a system is in space, synchronising its time source precisely with that of an earth bound system is difficult. When the system in space is moving, such a synchronisation is unlikely. A same problem, though on a smaller scale, exists in earth bound networks. Each computer is bound to an independent time source and synchronisation of time sources is difficult. With advances in computer technology and processing speeds, these synchronisation difficulties are becoming no less significant than those experienced with space bound systems.

[0015] The lack of a common time reference, as well as other problems with observing a distributed system, have led to a notion of causality that is probability based. This "probabilistic causality" is a probability estimate of an event having occurred. Probabilistic causality uses a database of information (e.g., application structure, network configuration), a sophisticated data reduction algorithm (i.e., expert system), and trace records to make an educated guess at the source of problems in a complex system based on observable events. Although probabilistic causality is useful for network fault diagnosis it should not be confused with the stricter definition of causality that is being espoused here which is not probability based. Examples of probabilistic causality are found in U.S. Pat. Nos. 5,661,668 and 5,483,637.

[0016] In order to determine causality, it is beneficial to determine which events happened before which other events, described here as precedence causality. Precedence is a commonly known form of causality; for example, an executable instruction is not executed until a previous instruction is executed given no branching instructions. This precedence based causality is used heavily for debugging. Often, once an anomaly is discovered during execution, previous executed instructions are reviewed to determine a cause of the anomaly. For single processor systems, such an analysis is straightforward; however for network applications, time source synchronisation presents problems and therefore, precedence is not immediately evident.

[0017] Because of the above when more than one computer are networked together, precedence is not determined through recording of time. Even when a synchronisation of clocks occurs via a communication link, a time delay caused by communication times exists between computers and the recorded times are inaccurate. The resulting clock times are not useful for determining precedence between instructions or activities executing on different processors.

[0018] In an attempt to overcome this problem, it has been proposed that a logical clock may be used to record time in the form of a partial ordering of recorded times. Several types of logical clocks are known for use in a classical model of a distributed system.

[0019] In the classical model of a distributed system, according to a survey paper by Schwarz and Mattem entitled "Detecting causal relationships in distributed computations: in search of the Holy Grail" (Distributed Computing, 7(3):149-174, 1994), a distributed system consists of N objects: P.sub.I. . . P.sub.N. The objects interact solely by point-to-point message communication with finite but unpredictable delay; knowledge about structure of a communication network is not available; first in first out (FIFO) order of message delivery is not assumed; and a global clock, or perfectly synchronised clocks local to each process, are not available. Each object executes a local algorithm to determine its reaction to incoming messages. The occurrence of actions such as a local state change or sending a message performed by the local algorithm are called events. Events are recorded atomically. Concurrent and co-ordinated execution of all local algorithms composes a distributed computation.

[0020] A distributed computation is described by ordering events to agree with an order of execution. Let E, denote a set of events occurring in object P.sub.1 in the form of a history of events, and let E=E.sub.1.orgate. E.sub.2. . . .orgate. E.sub.N denote a set of all events of the distributed computation. These event sets evolve dynamically as computation progresses. Since each P.sub.1 is strictly sequential, its sequence of events, E.sub.i, are ordered by their occurrence and written as E.sub.1={e.sub.il, e.sub.i1, e.sub.i3, . . . }.

[0021] For the classical model, three event types are recorded: a send event, a receive event, and an internal event. A send event reflects the fact that a message was sent asynchronously. A receive event denotes the receipt of a message together with local state changes according to the contents of that message. Internal events reflect changes to local object states. This description does not account for conflicts or non-determinism since it is based on events that have actually occurred. The precedence relation is used as a basis for constructing logical clocks. According to the precedence relation an event with a later logical time occurred after an event with an earlier logical time where. Also, two events with same logical times in an event set are concurrent which indicates that they may have occurred in any order or simultaneously. Essentially, a concurrency relation indicates that a precedence relation cannot determine which of two events happened first.

Precedence Causality's Failure to Define Scenarios

[0022] The precedence relation does not identify when events are independent because it identifies all past events as being possible causes for the current event. This information can be useful but it is usually overwhelming and it must be analysed by hand to prune out precedence causal relationships. The context information that is most valuable for understanding the system behaviour is the scenario. A scenario is a "specific sequence of actions [events] that illustrates behaviours [for an application]. A scenario may be used to illustrate an interaction or the execution of a use case instance.".sup.1 The interaction is "a specification of how stimuli are sent between [object] instances to perform a specific task. The interaction is defined in the context of a collaboration.".sup.2 .sup.1"OMG Unified Modelling Language (UML) Specification" Version 1.3, March 2000 which is the industry standard. .sup.2"OMG Unified Modelling Language (UML) Specification" Version 1.3, March 2000 which is the industry standard.

[0023] An observed scenario is, informally, a set of objects which execute and interact together, recording events as they execute. The observed scenario is produced by ordering of the events to identify the objects' local interactions and their interactions with each other.

[0024] In a sequential application with only one stimulus the order of recorded events (i.e., the system behaviour) is one-to-one with the observed scenario. This is true of every static system where every execution of the application (i.e., scenario behaviour) corresponds to the exact ordering of the events in the system (i.e, system behaviour).

[0025] If there are dynamic aspects to the system structure or behaviour, then the one-to-one correspondence of scenario event ordering with the system behaviour is no longer true. The scenario structure cannot be recovered in this case because multiple scenarios are intermingled with each other in the system behaviour. The dynamic aspects involve: multiple simultaneous stimuli, concurrent thread execution, dynamic construction of software components, replication of software components, dynamic communication paths, message queuing, asynchronous message sends, etc. The following three canonical problems describe the problem of recovering and isolating observed scenario structure using precedence causality.

Canonical Problems of Precedence Causality

[0026] A fundamental limitation of the precedence causality approach is that it cannot identify scenarios because it cannot identify the end of a scenario, hereafter called "the problem of finding the scenario end". This situation is illustrated in FIG. 1a where there are two scenarios. Each scenario consists of a hidden external event causing Object A to send a message to Object B with each object doing some internal processing (not shown for clarity). As shown in the figure there are two independent scenarios initiated by Object A but there is a network delay such that the second message send of Object A (event e.sub.A2) overtakes the first message it has sent (event e.sub.A1).

[0027] The scenario causal ordering is that the events of the first scenario are Scenario1={e.sub.A1.fwdarw.e.sub.B2} and the events of the second scenario are Scenario2={e.sub.A2.fwdarw.e.sub.BI}. Note that each scenario is properly identified and can be analysed independently of the other (e.g., comparing the actual behaviour against the intended behaviour of a sequence diagram).

[0028] The precedence ordering of the two scenarios is shown in FIG. 1a, including the transitive ordering components. The precedence ordering includes the additional event orderings {e.sub.A1.fwdarw.e.sub.A2}, {e.sub.A1.fwdarw.e.sub.B1}, {e.sub.A1.fwdarw.e.sub.B2}, and {e.sub.A1.fwdarw.e.sub.B2}. These event orderings would need to be filtered out before any analysis could be performed because it is not possible to identify the scenarios. It is possible to do the filtering manually for a small example but these additional relationships grow exponentially with the number of events recorded.

[0029] A second fundamental limitation of the precedence ordering relation is that an event can only belong to one scenario but it is difficult to determine which event a scenario belongs to. Hereafter called the "problem of scenario association." This is illustrated by FIG. 1b. Is there one or two scenarios in FIG. 1b? There can be one scenario that consists of the events S.sub.1={e.sub.1, e.sub.2, e.sub.3, e.sub.b4}, or the two scenarios S.sub.1, {e.sub.1, e.sub.3} and S.sub.2={e.sub.2, e.sub.4}. This problem grows linearly with the number of interactions between objects.

[0030] A third limitation of precedence ordering is that events are recorded for a duration of time. Instead, monitoring should be triggered based on the scenario that is being executed. This is the problem of the scenario monitor trigger.

[0031] A fourth limitation of precedence ordering is that it is not communication protocol aware. The communication protocol that is used to send and receive information is important for analysis purposes but precedence causality does not capture any information related to it. This is a lack of communication protocol characterization.

[0032] A new type of causality, called scenario causality, is needed that overcomes these limitations.

Logical Clock Background

[0033] Discussions of implementation mechanics of logical clocks are presented in the following articles:

[0034] M. Ahuja, T. Carlson, A. Gahlot, and D. Shands. "Timestamping events for inferring `Affects` relation and potential causality." In Proceedings 11th International Conference on Distributed Computing Systems (COMPSAC 91), pages 274-281, Arlington, Tex., 1991;

[0035] B. Charron-Bost. "Concerning the size of logical clocks in distributed systems." Information Processing Letters, 39:11-16, July 1991;

[0036] C. Diehl and C. Jard. "Interval approximations of message causality in distributed executions. " In Proceedings of the Symposium on Theoretical Aspects of Computer Science, pages 363-374. Springer-Verlag, February 1992;

[0037] C. Fidge. "Logical time in distributed computing systems." IEEE Computer, pages 28-33, August 1991;

[0038] J. Fowler and W. Zwaenepoel. "Causal distributed breakpoints." In Proceedings of 10th International Conference on Distributed Systems, pages 134-141, 1990;

[0039] L. Lamport. "Time, clocks, and the ordering of events in a distributed system." CACM, 21(7):558-565, July 1978;

[0040] F. Mattern. "Time and global states of distributed systems." in Proceedings International Workshop on Parallel and Distributed Algorithms, pages 215-226, Amsterdam, 1988. Bonas, France, North-Holland;

[0041] S. Meldal, S. Sankar, and J. Vera. "Exploiting locality in maintaining potential causality." In Proceedings 10th Annual ACM Symposium on Principles of distributed Computing, pages 231-239, Montreal, Canada, 1991;

[0042] M. Raynal and M. Singhal. "Logical time: Capturing causality in distributed systems." Computer, 29(2):49-56, February 1996;

[0043] R. Schwarz and F. Mattem. "Detecting causal relationships in distributed computations: in search of the Holy Grail." Distributed Computing, 7(3):149-174, 1994;

[0044] M. Singhal and A. Kshemkalyani. "An efficient implementation of vector clocks." Information Processing Letters, 43:47-52, August 1992; and,

[0045] C. Valot. "Characterizing the accuracy of distributed timestamps." In Proceedings of the ACM IONR Workshop on Parallel and Distributed Debugging, pages 43-52, May 1993.

[0046] The implementations described in the above references have several commonalties. Each event is assigned a time stamp from a logical clock, which is used to establish relative ordering of events. If a first event precedes a second event, then the time stamp of the first event is smaller than the time stamp of the second event. To generate the time stamp, every object maintains its own local logical clock that is advanced using a set of prescribed rules. An object's local clock represents its best approximation to a global logical clock. A time stamp is included with every message sent. A receiving object uses the included time stamp to update its local clock. Internal, send, and receive events advance an object's local clock.

[0047] Lamport, in the above noted reference, describes a logical clock wherein each object has a scalar local clock in the form of a counter that is incremented with each event. When a message is received that has a larger time stamp than the receiving object's current counter, the received time stamp replaces the current counter value. A total ordering of events can be constructed by appending an object's identifier to a time stamp value. In this way, within an object a first event precedes a second event when the first event has a time stamp that is less than that of the second event. Unfortunately, between objects, it is often difficult to assess an ordering since concurrent objects have their own local counter which may increment faster or slower than that of another object.

[0048] In another logical clock implementation, each object maintains a vector of integers that constitutes its local clock. A timestamp consists of the entire vector and each message sent includes an entire vector. Precedence order of two events is determined by comparing two vector time stamps in a similar fashion to that described by Reynal and Senghal as well as Fidge et al. in the above noted article. Concurrency can be determined in both cases.

[0049] A known implementation difficulty of a vector clock is the size and overhead of the time stamp. Characterising concurrency requires using vector time stamps of integers of at least size N when nothing is known about a computation except a number of objects, N. When N is large, the amount of time stamp data associated with each message and event becomes unacceptable.

[0050] There have been several approaches to reducing the overhead associated with vector time stamps. Singhal and Kshemkalyani, in the above noted reference reduce communication bandwidth by sending vector clock entries that have changed from a message last sent to a receiver in place of an entire vector. Each object maintains two additional vectors to store information between interactions. However, communication channels must be FIFO. In this approach, post-execution analysis is needed to recover the precedence relation between different messages sent to a same receiver.

[0051] Fowler and Zwaenepoel, in an above noted reference, describe a direct-dependency technique reducing communication overhead by maintaining precedence relations for direct interactions. A transitive component of the precedence relation is constructed by post-execution analysis. This allows an object's local clock to be an event counter. Each object maintains information relating to objects with which it directly communicates. Each message carries with it a sending object's event counter value from when the message was sent. The information that is recorded for each communication event is a sending object, receiving object, and appropriate event counters.

[0052] Valot, in an above noted reference, suggests that there is a trade-off between memory requirements and time stamp accuracy for precedence relations. She describes a family of time stamps, which she calls k-vectors, that can be tailored for particular analysis. Instead of allocating a position in the vector to a single object, a subset of available objects are each assigned a single position in the vector. The size of the k-vector is a number of subsets chosen. The appropriate selection of vector clock subsets provides better time stamp accuracy for a given vector size. However, a priori knowledge of simultaneous concurrency during execution is required for optimal assignment of an object to a position in the k-vector. This method, therefore, is only applicable to certain cases and not to general implementation.

[0053] Other logical clocks such as those proposed by Meldal, et al. require specific conditions or additional a priori knowledge to result in a reduced size time stamp or approximate the precedence relation. Using knowledge of fixed communication links between objects, this method provides a precedence ordering between messages arriving at a same object. This approach is used to determine precedence relations between messages arriving at a same object with overhead dependent upon network topology.

[0054] Interval clocks have been disclosed to approximate the precedence relation with a constant time stamp size. Interval clocks provide better results than scalar clocks having a same overhead. By using a bit array vector value instead of a counter, precedence relations are established by post-execution analysis. If only blocking RPC style communication is used then interval clocks describe the precedence relations with no additional post-execution analysis.

[0055] All of these logical clocks and all prior research only dealt with precedence causality. A scenario based causality is needed.

Monitoring and Tracing a Process

[0056] Event records are produced by monitoring a process. There are two aspects to monitoring. There is a monitoring system comprising means for storing data relating to process execution, and monitoring instrumentation, which using the monitoring system for recording of execution related information. The term monitor is used in its general sense to incorporate both these aspects.

[0057] An event record contains information about an application's activity and it consists of at least an event token and a time stamp. The time stamp is generated by a monitor and represents the acquisition time of the event record. The set of events is stored as an event trace.

[0058] A monitor collects information by at least one of sampling or tracing. Tracing consists of reporting all occurrences of an event within a certain interval of time. Tracing is synchronous with occurrence of events; it is performed when all occurrences of an event are known or when each occurrence of an event is followed by a certain action. With tracing, dynamic behaviour of a program is abstracted to a sequence of events. On the other hand, sampling is a collection of information upon request of the monitor. Optionally, sampling is asynchronous with the occurrence of an event; it is useful when an immediate reaction to an event is not necessary. Sampling allows only statistical statements about program behaviour. Profiling involves collecting execution counts or performing timing at the procedure, statement, or instruction level, using sampling or tracing.

[0059] Recorded information relating to events includes fields that record encapsulated data that follows a prescribed format. Some common approaches to specifying data to record are recording header data in the trace file to describe the fields; a self-describing trace format; an abstract information model based on entity-relationship descriptions; and a trace description language.

[0060] There is a large body of work in the prior art relating to monitoring of parallel programs but there is little research of monitoring distributed applications. There is an expectation in prior art literature that much of the parallel program monitoring research is applicable to a distributed application; however, it has been found that monitoring of distributed applications has a different set of requirements.

[0061] There are many different properties that a monitor may have. Several that have been identified in the literature are machine independence, using shadow processors, visualisation of performance metrics as they are gathered, pre-execution, automated instrumentation, instrumentation during execution, run-time enabling of event probes, event ordering by precision hardware time stamp, on-line program steering to control the program and monitoring overhead as it executes, and post-execution compensation for probe intrusion. Most of these monitoring systems sample and aggregate measurements using a specified criteria, and then present the resulting metrics either visually for analysis or to an expert system for evaluation.

[0062] Discussions of implementation mechanics of monitors are presented in the following articles:

[0063] P. Dauphin, R. Hoftnann, R. Klar, B. Mohr, A. Quick, M. Siegle, and F. Sotz. "ZM4/Simple: A general approach to performance measurement and evaluation of distributed systems." In T. Casavant and M. Singhal, editors, Readings in Distributed Computing Systems, pages 286-309. IEEE Computer Society Press, Los Alamitos, Calif., 1994.

[0064] M. Heath and J. Etheridge. "Visualizing the performance of parallel programs." IEEE Software, 8(5):29-39, September 1991;

[0065] M. J. Kaelbling and D. Ogle. "Minimizing monitoring costs: Choosing between tracing and sampling." 23rd International Hawaii Conference on System Sciences, Volume 1:314-320, January 1990;

[0066] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. "The Paradyn parallel performance measurement tool." Computer, 28(11):37-46, November 1995;

[0067] D. M. Ogle, K. Schwan, and R. Snodgrass. "Application-dependent dynamic monitoring of distributed and parallel systems." IEEE Transactions on Parallel and Distributed Systems, 4(7):762-778, July 1993;

[0068] P. H. Worley. "A new PICL trace file format." Technical Report ORNLFM-12125, Oak Ridge National Laboratory, September 1992; and,

[0069] J. Yan, S. Sanikkai, and P. Mehra. "Performance measurement, visualization and modeling of parallel and distributed programs using the AIMS toolkit." Software Practice and Experience, 25(4):429-46 1, April 1995.

Problem Summary

[0070] Though a tremendous amount of research and effort has been expended attempting to better monitor and analyse software execution, heretofore, no system exists for determining restricted forms of causality such as scenario causality. Scenario causality is a subset of precedence relationships and is indicative of a more direct causal link. Precedence, of course, is considered a requirement for scenario causality since current understandings of time indicate that it is unlikely that a later event can cause an earlier event to occur. It is desirable to determine forms of causality other than mere precedence of an application during execution. This would require solving the previously listed problems of "finding the scenario end", "scenario association," "the scenario event trigger", and "characterization of communication protocol." In so doing, causal connections detected are likely more significant and less numerous. It is also desirable to determine precedence for a multiprocessor or network based application during execution.

Object of the Invention

[0071] It is an object of the invention to provide a method of recording information relating to some events during execution of a process, and of determining scenario causality and precedence causality for some of the events.

[0072] It is an object of the invention to provide a method of recording information relating to some events during execution of a distributed software application, and of determining scenario causality and precedence causality for some of the events.

[0073] It is an object of the invention to provide a method of recording information relating to some events during execution of a process, and of analysing the recorded information for the purpose of determining aspects of process execution flow.

SUMMARY OF THE INVENTION

[0074] In accordance with the invention there is provided for a system wherein information is recorded relating to events occurring during execution of a process, a method of determining a plurality of the events that are causally connected by precedence causality or scenario causality. The method comprises the steps of:

[0075] (a) translating the recorded information relating to the events to first graph language statements wherein one or more events is translated to a statement;

[0076] (b) determining from the statements information relating to process execution flow wherein each statement comprises information relating to a predetermined process execution flow; and,

[0077] (c) based on the information relating to a predetermined process execution flow, determining, for each of a plurality of caused events, a plurality of events from the events that precede each event from the plurality of caused events and are each scenario causally or precedence causally connected to said event from the plurality of caused events.

[0078] In accordance with the invention there is provided a method of determining a plurality of the events that are scenario causally or precedence causally connected comprising the steps of:

[0079] during execution of an event,

[0080] recording process related information,

[0081] recording object related information, and

[0082] recording event related information;

[0083] using the process related information and the object related information for a plurality of events, translating the recorded information to a graph language substantially indicative of scenario and potential causal connections between events; and,

[0084] providing information based on the causal connections between events.

[0085] In accordance with the invention there is provided a method of determining a plurality of events that are scenario causally or precedence causally connected for use with recorded information relating to events occurring during execution of a process. The method comprises the steps of:

[0086] analysing the recorded information to determine a partial order of events from each of two relative perspectives;

[0087] combining the two partial orders of events to produce information relating to some forms of scenario and potential causality. In accordance with the invention there is provided a method of determining a plurality of the events that are scenario causally or precedence causally connected comprising the steps of:

[0088] providing a process for execution;

[0089] instrumenting the process for monitoring of an execution of the process;

[0090] executing the instrumented process to produce a trace of the process execution;

[0091] transforming the trace of the process execution into a plurality of scenario graph language statements according to a plurality of predetermined rules to reverse engineer scenarios;

[0092] transforming the scenario graph language statements into a scenario event graph for analysis, and,

[0093] transforming the scenario event graph(s) into a domain specific model for analysis in another domain.

BRIEF DESCRIPTION OF THE DRAWINGS

[0094] Exemplary embodiments of the invention will now be described in conjunction with the following drawings, in which:

[0095] FIG. 1a is a simplified scenario diagram;

[0096] FIG. 1b is another simplified scenario diagram;

[0097] FIG. 1c is a high-level block diagram of a method according to the invention;

[0098] FIGS. 2a and 2b are simplified flow diagrams of code execution;

[0099] FIG. 3 is a simplified set diagram of different forms of known causality;

[0100] FIG. 4 is a diagram showing a simple example of a difference between scenario causality, and potential causality;

[0101] FIG. 5 is a flow diagram showing an RPC having two blocking interactions, one nested within the other;

[0102] FIG. 6 is a diagram showing the steps in applying MMAP in a performance engineering context;

[0103] FIG. 7 is a diagram of symbols for use in process event graphs according to the invention;

[0104] FIG. 8 shows a general representation of a Scenario Event Graph (SEG) node as a six-port building block;

[0105] FIG. 9 is a diagram of a portion of a SEG of an RPC;

[0106] FIG. 10 is a diagram of a portion of a SEG of an asynchronous interaction;

[0107] FIG. 11 is a diagram of a portion of a SEG of a case where a message is sent using a blocking communication protocol that results in a synchronisation;

[0108] FIG. 12 is a diagram of a portion of a SEG of an initiating object using an asynchronous communication protocol that results in a synchronisation;

[0109] FIG. 13 is a diagram of a portion of a SEG where a blocked initiating object receives its reply to a service request that used an RPC communication protocol and it is considered to be a synchronisation;

[0110] FIG. 14 is a diagram of a portion of a SEG involving acceptance of an external event that results in a synchronisation;

[0111] FIG. 15 is a diagram of a portion of a SEG of an example where: the initiating object (Object A) sends an RPC request and blocks, the first responding object (Object B) processes the request, and forwards it to another responding object (Object C), Object C processes the request further and forwards it to Object D which replies to the initiating object; and,

[0112] FIG. 16 is a graph rewriting operation for simplifying a scenario event graph model.

DETAILED DESCRIPTION OF THE INVENTION

[0113] A sequentially executing software component that may execute concurrently with other components is referred to as an object throughout this specification and the claims, which follow.

[0114] Software execution models of distributed and concurrent systems characterise objects and their interactions in the context of the process that they are part of, fully describing a scenario of execution of an application. Software execution models are design aids that are used during the development of a software application. A software execution model (hereafter simply "model") characterizes high-level aspects of an application's execution for analysis. A forward engineering model will specify intended behavior (e.g., specifying interaction diagrams such as use-cases, sequence diagram, collaboration diagram). Interactions between objects are important because they effect parallelism and resource contention experienced during execution when, for example, a heavily used object queues arriving requests and becomes a bottleneck. During the later phases of development, models are constructed to characterize the realized behavior, to aid in program understanding, re-engineering, reuse, performance analysis, and debugging. Often these realized models are mental pictures that the developer constructs from user requirements, design documents, source code examination, and, most importantly, experiences with the system. The realized models are critical for investigating differences between the specified behavior and the observed behavior. Manually reverse engineering a model from an actual execution of a small software application is relatively easy but it is expensive, difficult, and uncertain for a large or dynamic application. A technique is needed to generate models of realized behavior and object-oriented methodologies need to be adapted to incorporate the realized models.

[0115] Performance models are a type of software execution model for optimisation purposes. A performance model of distributed and concurrent systems characterise objects and their interactions. Interactions between objects are important because they effect parallelism and resource contention experienced during execution when, for example, a heavily used object queues arriving requests and becomes a bottleneck. The Layered Queuing Network (LQN) model has been proposed to evaluate such processes. The LQN model extends queuing network models to include contention effects for software resources such as server objects, as well as contention for hardware devices. It is appropriate for assessing performance of many kinds of distributed systems, including client-server applications, peer-to-peer applications, communications switching software, transaction processing systems, and systems based on middleware software technologies. Using the invention for this purpose is described in both:

[0116] C. E. Hrischuk. Trace-based Load Characterization for the Automated Development of Software Performance Models. Ph.D. thesis, Carleton University, Ottawa, Canada, 1998.

[0117] C. M. Woodside, C. Hrischuk, B. Selic, S. Bayarov. "Automated performance modelling of software generated by a design environment." Performance Evaluation, vol. 45:1, pages 107-123, 2001.

[0118] Referring to FIG. 1c, a high-level block diagram of a method according to the invention is shown. A language statement in the form of a design statement or executable code is instrumented to support monitoring of the design or executable during simulation or execution. The instrumentation interacts with storage devices and other system resources to provide tracing of the simulation of a design in the form of an abstract execution, simulation, or emulation of the execution of an executable. Once traced, the trace results form an angiotrace. The angiotrace is a particular form of trace as defined hereinbelow. From the angio trace is determined a plurality of scenario graph language statements that characterise the observed scenario's behaviour. In an embodiment, the scenario graph language is, as disclosed herein, "scenario event graph." From the scenario graph language statements, domain specific models are formed through transformation. Since a scenario event graph language description is substantially indicative of scenario and potential causality, the domain specific models may take a number of forms. These include performance models, resource utilisation models, design models, execution flow models, and so forth. By determining design models, an executable program is verifiable against design requirements from which it is derived. Further information can be found in C. Hrischuk. "A Model Making Automation Process (MMAP) using a Graph Grammar Formalism". Proceedings of the Theory and Application of Graph Transformations, 1999.

[0119] Referring to FIGS. 2a and 2b, simplified flow diagrams of code execution are shown. Code statements represented by circles represent fork events and join events. Code statements represented by solid boxes represent terminals and hollow boxes represent default events. Lines joining code statement representations are indicative of potential causality. The flow diagrams are shown in time with an earlier time to the left of a later time. The two flow diagrams shown in FIGS. 2a and 2b are of identical executable code executed at two different times. Upon a brief review of the two flow diagrams, it is evident that a code statement 1 is executed at two different times. In fact, this does not effect execution of the process because the code statement 1 is not causally connected to the join code statement 3. Unfortunately, when evaluating a system based solely on precedence, it is difficult to determine when causally identical situations such as that shown in FIGS. 2a and 2b may occur.

[0120] In fact, though a flow diagram generated from the system during testing may always be similar to that of FIG. 2a, the flow diagram of FIG. 2b is an acceptable execution of the process characterised by the two flow diagrams and may occur at some later time. It is clearly advantageous to identify flow related issues such as these and to test out their correctness in light of desired design parameters. According to the present invention, a method of evaluating and transforming recorded information relating to code statements into process flow information and subsequently into other information is provided.

Scenario Causality

[0121] Prior art research into implementations for logical clocks has proven useful for ordering events but other than precedence causality, characterisation of scenario causality has heretofore been elusive.

[0122] Prior to discussing scenario event graph and its use for determining causality other than precedence, causality should be defined. In order to understand causality, some forms of causality are outlined below. The terms as defined hereinbelow associated with each form of causality are used throughout this specification and the claims.

[0123] Precedence causality in the form of precedence relations are a loose form of causality inferring that a first event occurs before a second event during an execution. This form of causality is known in the art and is a common object of prior art systems. Referring to FIG. 3 a simplified set diagram of different forms of known causality is shown. As is evident from the diagram, imposed causality is inclusive of several other forms of causality.

[0124] Realised causality is a term referring to an event ordering that is consistent with both purpose and an execution. In theory, when a process is correctly designed and implemented, realised causality reflects both. Realised causality is summarised as a first event is an intended cause of a second event if the second event cannot occur unless the first event has already occurred. Of course, when verification of process implementation against design criteria is intended and process implementation is potentially incorrect the statement "cannot" is modified to "should not." Recovering realised causality from prior art post-execution traces is impossible because it necessitates knowledge of the process implementation in the form of software code of each object, the initial value of variables, and the execution environment.

[0125] According to the present invention a form of causality referred to as scenario causality is determined. Scenario causality includes forms of causality other than precedence but does not truly reflect realised causality in every instance. This is indicated in FIG. 3 wherein scenario causality is a subset of realised causality. Certain assumptions and limitations allow for a broader applicability of the method of the present invention as discussed below.

[0126] Different types of logical clocks result in different causal ordering of events which may exclude important relationships. Although each ordering is consistent with precedence relations, some orderings are preferable for some applications. For example, vector based logical clocks allow for a determination of potential causality.

[0127] Precedence causality provides a partial ordering between events that respects event ordering during execution. Precedence causality is characterised as a future event being incapable of influencing the past. A vector clock characterises precedence causality because the event ordering is consistent with system execution. Precedence causality is a weak approximation or characterisation of realised causality because it results in all previous events being potential causes for later events. This is a consequence of causality being deduced solely from precedence relations.

[0128] Imposed causality is obtained when the ordering between events is imposed by an algorithm, and is not constrained to event execution order. A scalar clock is an example of a logical clock resulting in a determination of imposed causality. Because a clock with imposed causality may include all other clocks as special cases, imposed causality is shown as the largest set in FIG. 3

[0129] The difference between precedence causality and realised causality is well known but many prior art methods for determining causality ignore the difference. Examples of some of these include the following papers:

[0130] D. Bryan. "An algebraic specification of the partial orders generated by concurrent Ada computations." In Proceedings of Tri-Ada, pages 225-241, New York, N.Y., 1989. A.C.M. Press;

[0131] C. Fidge. "Partial orders for parallel debugging." In Proceedings of ACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, pages 183-194, 1988;

[0132] D. P. Helmbold, C. E. McDowell, and J. Z. Wang. "Determining possible event orders by analyzing sequential traces." IEEE Transactions on Parallel and Distributed Systems, 4(7):827-839, July 1993;

[0133] M. Raynal and M. Singhal. "Logical time: Capturing causality in distributed systems." Computer, 29(2):49-56, February 1996;

[0134] A. Schiper, J. Eggli, and A. Sandoz. "A new algorithm to implement causal ordering." In Proceedings .sub.3rd International Workshop on Distributed Algorithms, number 392 in Lecture Notes in Computer Science, pages 219-232. Springer-Verlag, Berlin, 1989; and,

[0135] G. Winskel. "An introduction to event structures." In J. W. de Bakker, W. P. de Roever, and G. Rozenberg, editors, Linear Time, Branching Time and Partial Order in Logics and Models for Concurrency, pages 364-397. Springer-Verlag, Berlin, 1989.

[0136] Scenario causality is a subset of realised causality, including only those causal relationships for each application's execution. Whereas imposed and precedence causality are overly liberal inclusive approximations of realised causality, scenario causality is an achievable conservative approximation of realised causality. Scenario causality limits an event's influence to those future application events it can effect. A criteria, called the scenario ordering relation, is used to deduce scenario causality from observations of the execution. The scenario ordering relation, according to the invention, solves the previously listed limitations of scenario causality by limiting the effects of an event to both the unit of software modularity in the form of object level effects and process of which the event forms part. This is useful because it reflects the context with which an event is associated, namely a software module and its process. The scenario ordering relation is described as: "a first event is a cause of a second event if there is a sequence of events from the first event to the second event in the same process."

[0137] In the specification and claims that follow, causality refers to conservative estimations of causality. Alternatively stated, causality as used herein refers to events that are scenario causal and not merely precedence causal. When precedence causality is intended, that term or a synonym thereof is used.

[0138] In the timing diagram of FIG. 4, there are two distributed applications. Each application consists of Object A sending a message to Object B. As shown, two independent external events cause each application to execute, recording the events of the first application as and the events of the second application as . However, as the scenario causal ordering shows in FIG. 4, there is a delay such that the second message sent by Object A (event) overtakes the first message it has sent (event ). The precedence ordering of the events is shown in FIG. I c, including the transitive ordering components. The precedence ordering includes the additional event orderings of which are not causal orderings because the scenarios are independent.

[0139] It is useful to identify blocking of objects when analysing system execution for race detection and system visualisation among other applications; however, a classical partial-order model has difficulty characterising object blocking because communications are recorded as asynchronous communications. Object blocking introduced by blocking communication mechanisms, such as the Remote Procedure Call (RPC), is not apparent within the classical model. Analysis is further complicated when blocking interactions are nested. For example in the flow diagram of FIG. 5, an RPC has two blocking interactions, one nested within the other. Object A initiates an RPC and blocks at event e.sub.1, and the nested blocking interaction is initiated by Object B at event e.sub.3. One approach to identifying object blocking is to augment time stamps recorded through monitoring, in particular metrication within the time stamps, with information about a communication mechanism, as is attempted in some debugging applications. Other approaches modify the precedence relation. The topology of the scenario event graph language according to the invention directly characterises object blocking by labelling the elements of blocking and non-blocking communications differently with different events. Then a causal chain of events through an RPC is immediately identifiable. According to the invention, a characterisation of message-based synchronisation between objects is performed.

[0140] Performance model construction using the model making automation process (MMAP).sup.3 comprises three steps. First an appropriate trace of execution is recorded. Such a trace is referred to as an angio trace throughout this document and the claims that follow. Then the trace is analysed to produce scenario event graph that characterises the execution of the scenario: the involved objects, their individual activities, and their interactions with each other. A scenario event graph forms a scenario model. Thirdly, the scenario event graphs are combined to make a performance model that merges several scenario models and additional configuration information necessary to determine performance. .sup.3Previously called Trace Based Load Characterization.

[0141] There are several benefits to using MMAP as opposed to a "source code examination" approach for constructing models. Traces incorporate dynamic details of a design that are difficult to determine from source code or documentation. Some of these are data dependent branching, identity of objects involved in anonymous or dynamically bound interactions, and involvement of polymorphism and inheritance hierarchy of an object-based system. Automated trace processing results in more accurate performance model construction at a lower cost because a larger volume of details is included during model development. An area where automation has a decided advantage is for correctly identifying interaction types. For example, MMAP identifies a synchronous interaction constructed from asynchronous messages. This is important when the nature of an interaction cannot be explicitly identified in a trace. Optionally, MMAP is used to model a production version of a software process, to provide full life-cycle support for modelling.

[0142] The steps in applying MMAP in a performance engineering context are shown in FIG. 6. A first step is to select scenarios which are important for performance modelling and to add instrumentation to identify where the execution of each scenario begins and ends. In a second step, angiotrace events are recorded during the execution of a scenario. For analysis purposes the events of a trace are reordered into an intermediate format that is then processed into an LQN sub-model. The user completes the LQN construction in a fifth step by combining several LQN sub-models with system configuration information.

Software Execution Tracing

[0143] In tracing a process in the form of a software process during execution, it is preferable to have a predetermined set of desired information. Such a set of desired information is determined in dependence upon information sought through processing of the trace results. Essentially, for use in the present invention, a trace must capture, in an automated fashion, information sufficient for determining scenario causality from a distributed application's execution history.

[0144] In order to record sufficient information regarding software process execution, angio tracing is employed. Angio tracing according to the invention identifies scenario and potential precedence relationships between recorded events of an application and properly characterises concurrency. Angio tracing according to the invention characterises communication protocol elements in the form of blocking request initiation, non-blocking request initiation, request acceptance, synchronisation acceptance, sending a reply to a blocking request initiation, and acceptance of a reply. Angio tracing supports integration of information from a heterogeneous environment because it is independent of implementation technology, execution environment, and monitoring approach. Optionally, multiple angio traces are recorded simultaneously. Automated trace analysis is possible because angio tracing according to an embodiment is based on a formal model.

[0145] Angio tracing has been successfully implemented in many environments. Software monitoring is a preferable means for characterising a distributed application. An angio trace has at least a logical clock which can serve many purposes. The approach adopted by angio tracing is to provide an event format which includes time stamp information and user defined application data payload.

[0146] Angio traces are extractable from a plurality of different sources at various steps of the development. Examples of sources of an angio trace include annotated specifications in the form of use-cases and Message Sequence Charts, functional prototypes, detailed simulations, and an executable production system. Successful experiments have been conducted with several of these sources. In the embodiment described below, angio traces are derived from a design prototype environment; the method is applicable with necessary modifications to other sources.

[0147] There are four requirements limiting the application of parallel program monitoring research to distributed applications. First, hardware or hybrid monitoring of a distributed application is not possible because of the geographically dispersed environment. Therefore, a software monitor approach must be adopted

[0148] Secondly, a strategy for minimising tracing overhead is required. Parallel program monitors have used several strategies. The simplest strategy is to enable trace sensors at run-time. A more elaborate strategy is the on-line control of the program and monitoring overhead as it executes. Examples are used in Falcon, Paradyn, and Pablo. These strategies are difficult to apply to distributed applications because software components are not known in advance, making instrumentation adjustments a priori impossible. Angio tracing uses a different strategy; event recording is enabled during execution by application during execution and other applications which are executing simultaneously do not necessarily have events recorded.

[0149] When distributed applications are considered in isolation, tracing should be used; however, most parallel program monitors use sampling. Sampling is justified in a parallel programming environment because parallel applications have a static structure and run in isolation. This is not true of distributed applications, where the sampled data values can be attributed to incorrect applications, since applications execute simultaneously and share resources. Angio tracing is tailored for monitoring distributed applications by tracing.

[0150] Another concern is the need for ordering recorded events once tracing is done. Ordering recorded events in a distributed system is difficult for two reasons. First, a global clock is not available because the system is geographically dispersed. Secondly, perfectly synchronised clocks local to each object are not possible because of poor clock granularity, poor clock synchronisation, clock drift, or unpredictable communication delay. This is well known in the prior art.

[0151] Although the precedence relation is useful for system-wide analysis it is not useful for analysing execution of a distributed application. The first limitation of the precedence relation is that it does not distinguish between blocking and non-blocking communication protocols; it assumes all communication is non-blocking. Secondly, it introduces ordering relationships between events from different scenarios, treating independent scenarios as if they were part of a single, system-wide scenarios. As mentioned previously, this is because the precedence relation does not distinguish between different scenarios.

[0152] Angio tracing overcomes these limitations by using a special scenario precedence ordering relation that also recovers potential causality. This precedence ordering relation is used to answer a particular class of questions, such as: "Does an event happen before another event in scenario A?" Whereas, the precedence ordering answers questions such as, "Did an event occur before another event, in the system?"

[0153] Angio tracing is useful for monitoring a distributed system which heretofore has been a difficult environment to monitor. A distributed system is composed of geographically dispersed, heterogeneous hardware with a set of executing, concurrent software objects, which are referred to as objects. A distributed application is a subset of objects that interact in a dynamic, coordinated fashion solely by point-to-point message communication with finite but unpredictable delay. The communication protocol is assumed to be reliable and first in first out (FIFO) ordering of message delivery is not assured.

[0154] A system that executes distributed applications differs from a classical distributed system model. Some differences are: several different applications or instances of the same application can execute simultaneously sharing the software resources (objects) and hardware resources; object execution is periodic, beginning when a service request message is accepted and ending when the service request is satisfied; an object's lifetime may extend beyond that of an application; an object can be added or removed so the software structure is dynamic; and, communication links between objects are dynamically established. The communication at least one of blocking (i.e., Remote Procedure Call) and non-blocking (i.e., asynchronous). DCE RPC, CORBA, Java, mobile agents, HTTP requests, and web services are examples of technologies used to build distributed applications. Angio tracing accommodates the above noted differences and supports tracing using these technologies as well as others.

[0155] Angio tracing characterises execution of a distributed application independent of other executing applications. To ensure that trace event information properly captures concurrency and event ordering, angio tracing was derived from a formal model called the scenario event graph. The scenario event graph is a scenario graph language, with typed nodes and edges, which fully describe an execution of an application or the behaviour cycles in an application that are independent of each other. The relationship between the scenario event graph language and angio tracing is significant in implementing a method according to the invention. Essentially, appropriate event recording requires some knowledge of information necessary to produce a desired output. The scenario event graph language provides a formal model from which many different output views or data sets are determinable and, therefore the scenario event graph language is a desirable model. As set out below, angio tracing supports the formation of models in the scenario event graph language. The scenario event graph language is described in more detail below.

[0156] Properties of angio tracing which make it unique follow. It is a new type of logical clock allowing reconstruction of a scenario and precedence causal ordering of events for each distributed application is used. An angio trace is capable of transformation for analysis into a model using the scenario event graph language. For example, an angio trace is used to automatically generate a performance model of a distributed application. An angio trace characterises communication protocol elements.

[0157] Angio tracing, as herein disclosed, is successfully implemented in experimental systems in the following environments: a functional prototyping environment, a commercial prototyping environment, a distributed software system simulator called Parasol, coarse-grained UNIX operating system processes, in the DCE RPC environment using data collected by the POET debugger, and on the Microsoft Windows platform.

Event Graph Types

[0158] Three approaches to formally characterising the execution of a distributed application are: a partial order, a regular expression language, or a graph language. A partial order characterises concurrency but it is difficult to characterise blocking interactions or synchronisation between objects. The most frequently used partial ordering relation, the precedence relation, is discussed in detail above.

[0159] A regular expression language characterises blocking and synchronisation but it loses information about software structure and concurrency because applications are described by event interleaving. Two regular expression languages are path expressions and flow expressions.

[0160] The scenario event graph is a graph language for characterising a distributed application that overcomes limitations of prior art characterisation methods. The scenario event graph language has labelled nodes that are types of application events and labelled, directed edges that are different types of causal relationships. It characterises communication protocol elements and object concurrency during application, and system execution. The communication protocol elements are: blocking request initiation, non- blocking request initiation, request acceptance, synchronisation acceptance, sending a reply to a blocking request initiation, and acceptance of the reply.

[0161] The scenario event graph's implementation of scenario causality is fully described in:

[0162] C. E. Hrischuk. Trace-based Load Characterization for the Automated Development of Software Performance Models. Ph.D. thesis, Carleton University, Ottawa, Canada, 1998.

[0163] C. Hrischuk, C. M. Woodside. "Logical clock requirements for reverse engineering scenarios from a distributed system." Accepted for publication in IEEE Trans. on Software Engineering.

[0164] The scenario event graph language is the basis for the angio trace specification because there is a correspondence between elements of the graph language and the angio-trace specification. To better understand the properties of angio tracing a brief description of the scenario event graph language is given here.

[0165] The scenario event graph language combines two types of graphs to describe object, process, and system execution. It characterises a process's execution as a process event graph. Object execution is characterised by an object event graph. According to the scenario event graph language these two points of view are combined as a Scenario Event Graph (SEG), which has more information than the graphs considered in isolation. The graphs are causal models, where the nodes are recorded events and an edge identifies a causal relationship between two nodes.

Object Event Graph

[0166] The object event graph characterises periodic execution of an object. An object satisfies a service requests of other objects one at a time, with the subsequent processing of each request being described as a service period. An object event graph consists of a sequence of linear sub-graphs, one for each service period. Each service period is also a linear sub-graph of object activities. An object event graph has a beginning, but it may not have an end; this occurs, for example, when an object continuously operates.

[0167] An object event graph is composed of two types of nodes and edges as follows:

[0168] "Period start" node: the object has started a new service request period and this is the first node.

[0169] "Object activity" node: a node that represents an activity that the object performed.

[0170] "Object's next node" edge: its target is the node in the same object period that succeeds the source node.

[0171] "Object's next period" edge: its source is the last node of an object's period and its target is the period start node of the object's next service period.

[0172] The target of an object's next period edge is a service period in the same process or in a different process. So, the next period edge sometimes connects different process event graph's together characterising system execution, provided there are objects which are common to the processes.

[0173] There are four types of roles that an object assumes in a process. A role limits node connection types as indicated by the column in Table 1 called "Allowed Protocol Role". The first role type is an initiating object, where requests for services from other objects are communicated. The second role type is as a responding object, where acceptance of a service request from an initiating object occurs and the service request is satisfied. The third role type is as a forwarding object, where a service request is accepted, some processing is performed, and then the service request is forwarded to another forwarding object for further processing. The fourth role type is as a replier object, where a responding object sends a reply back to a blocked initiating object to indicate that its service request has been satisfied.

Process Event Graph

[0174] A process event graph characterises execution of a process as an attributed, edge-labelled, binary, finite, directed, acyclic graph. Each concurrent thread of execution is a linear sub-graph called a process thread. Each process thread is also a linear sub-graph of process activities. For example, when an application has several objects interacting by blocking RPC, the process event graph is a single process thread because there is no concurrent execution. When a process event graph has concurrent process threads, special node and edge types are used to characterise causal relationships between process threads.

[0175] The process event graph node types are as set out below.

[0176] "External" node: a marker for the external initiation of a process. a process may have more than one external node.

[0177] "Thread begin" node begins a process thread.

[0178] "Process activity" node has an attribute to store process information.

[0179] "And-fork" node forks a new process thread to characterise the introduction of logical concurrency.

[0180] "And-join" node joins two process threads into a single thread of execution.

[0181] "Thread end" node finishes a process thread.

[0182] All of the node types, except the activity node type, are considered atomic, having no duration, allowing chaining of nodes to describe complex interactions between objects.

[0183] The different edge types of a process event graph are as follows:

[0184] "Start the process" edge (st): its source is an external node and its target is the thread begin node of the first process thread.

[0185] "Process thread's next node" edge: its target is the next node in the same process thread that succeeds the source node.

[0186] "Process thread's fork" edge (f): its source is an and-fork node and its target is the thread begin node of the forked thread.

[0187] The default edge type is the "Process thread's next node" edge which is abbreviated to next process edge.

[0188] The execution of a single program statement is described by a sub-graph to separate a program statement identifier from its effect on the process behaviour. To ensure consistency of representation, several rules govern introduction of a sub-graph. First, if a program statement is characterised by a sub-graph of and-fork node(s) and an activity node, the activity node is the first node in the sub-graph. Conversely, if a program statement is characterised by an activity node with a begin node or and-join node(s), the activity node is the last node in the sub-graph.

Scenario Event Graph (SEG)

[0189] A scenario event graph combines a process event graph with several object event subgraphs. We start from the process subgraph of the scenario, and superpose those parts of the object subgraphs representing service periods within the scenario as overlays. Two nodes representing the same event are merged, and have a dual type, one type for the scenario subgraph and one for the object subgraph. Similarly where an edge exists in both the scenario and the object subgraphs it has a dual type, one type for each.

[0190] Symbols of the SEG are shown in FIG. 7. No icon is provided for the default node type, object activity node. FIGS. showing SEGs follow several conventions: time proceeds from left to right, and the consecutive nodes of an object are at the same vertical level.

[0191] The interpretation of a SEG restrict the manner in which nodes and edges are connected. Causal relationships during execution restrict the node and edge connections. For completeness, object period start edges are shown where they may occur.

[0192] RPC, synchronisation, and asynchronous communication protocols are also characterised by the following elements:

[0193] Blocking request initiation: An object cannot proceed until it receives a reply to a request it has just made;

[0194] Non-blocking request initiation: An initiating or forwarding object makes a service request to another object and the initiating object does not block to wait for a reply;

[0195] Request acceptance: A blocked responding object accepts a new service request and begins a new period;

[0196] Synchronisation acceptance: A responding object is already processing a service request but it is blocked, waiting to accept another message to continue the service;

[0197] Sending a reply to a blocking request: A replier object sends the reply to the blocked initiating object; and,

[0198] Acceptance of a reply: A blocked initiating object receives the reply and continues execution.

[0199] With a formal specification of the scenario event graph language, a deduction of the information required to generate a scenario event graph model from a trace is possible. Trace requirements are discussed below with reference to angio tracing.

A Global Event Graph

[0200] Finally, where the event records include many scenarios, a global event graph is defined as the superposition of all of the object event graphs and scenario event graphs in the system.

[0201] The global event graph is important because it means several scenario causal description can be combined to characterize precedence causality. However, precedence causality cannot characterize scenario causality.

Angio Trace

[0202] An angio trace provides a precedence ordering of separate sets of execution related information - object level information and process level information. These are easily visualised as two graphs related to each of two times tamp values. The ordering of events is achieved by a set of ordering relations and event predicates. An event predicate identifies a type of an event and it serves as guard conditions for selecting an ordering relation. Once an ordering relation is selected, event ordering for two events is established. Essentially, during tracing sufficient information is collected to allow for determination of event ordering according to scenario and precedence causality.

[0203] An angio trace is defined as:

G.sub.Trace=(N, .SIGMA..sub.n, M.sub.n, P, .OMEGA.)

[0204] where

[0205] N is a set of recorded events;

[0206] .SIGMA..sub.n, is the alphabet of event time stamps;

[0207] M.sub.n:N.fwdarw..SIGMA..sub.n, are the rules for assigning time stamp values to events;

[0208] P is a set of event predicates ; and

[0209] .OMEGA. is the mapping of a predicate to one, or more, valid ordering relations.

[0210] To develop the two graphs, each angio trace event records an object time stamp for the object event graph and a process time stamp for the process event graph. Before describing each of these time stamp values, the logical clock requirements satisfied thereby are outlined.

[0211] There are three properties that the time stamp values have when used as a logical clock. Firstly, each time stamp has a unique value or the event ordering relations provide a default scheme for ordering events with identical time stamps. According to an embodiment of the invention each time stamp value is unique. Secondly, the time stamp values are monatonically increasing, although there may be gaps in the time stamp values. For example, the process time stamps are sequentially indexed so that missing events are easily detected. The object time stamp value is allowed to have gaps. An additional property needed for angio tracing is that the two time stamp values of successive events in the same object are synchronised: two events A and B cannot have time stamp values where the object time stamps indicate that event A occurred before event B and the process time stamp indicates that event A occurred after event B.

[0212] An object time stamp consists of a unique object identifier for each object event graph; an object period index that is a counter ordering service periods of an object; and an object event index that is a value ordering events within a service period.

[0213] Object time stamp monatonicity is a result of period and event index values always increasing. The object identifier provides uniqueness of the time stamp values.

[0214] A process time stamp consists of a unique process name that associates an event with an application's scenario; a unique process thread identifier that is assigned as a process thread begins; a thread event index ordering events of a process thread; and event type information for ordering process threads.

[0215] The process time stamp monatonicity is provided by the thread event index values always increasing. Uniqueness of a process time stamp is provided by the process name and process thread identifier. Process thread identifiers are unique within the scope of a process name and the process name must be globally unique.

[0216] The event type information of the process time stamp closely follows node types of scenario event graph as set out below.

[0217] External event (Ex): is a marker for the external, initiation of a process.

[0218] Process thread begin event (Be): identifies the start of a process thread.

[0219] Process activity event (Ac): records an identifier for an action taken or the executed program statement.

[0220] Process thread fork event (Fk): connects a child process thread with its parent process thread.

[0221] Process thread half-join event (HJo): signals the end of the current process thread but not the service period of the object

[0222] Process thread end event (En): indicates an end of the process thread and the object's service period.

[0223] The process thread begin event, process thread fork event, and the process thread half-join each are recorded with information with the event type to order process threads.

[0224] A fork event results in and is the cause of two subsequent events; one is placed in the same process thread and the other is taken as the beginning of a new child process thread. To identify the child process thread the fork event results in recording a new process thread identifier.

[0225] A half-join event differs from a scenario event graph language and-join node. In the scenario event graph language, the and-join node is a target of and preceded by two process threads. In angio tracing, half-join events are the cause of and precede a new process thread that results from the joining of two process threads. The joining process threads end with half-join events.

[0226] The event notation that is used combines the object time stamp and the process time stamp as follows: An event e has the time stamp values 1 e = [ ProcessEventGraph ObjectEventGraph ] = [ j , k , m , l i , c , v ] ,

[0227] j is the process name for each process event graph,

[0228] k is the process thread identifier,

[0229] m is the thread event index,

[0230] l={Ex, Be, Ac, Fk, HJo, En} is the event type information including information

[0231] specific to each event type,

[0232] i is the object identifier for each object event graph,

[0233] c is an object service period index, and

[0234] v is an object event index.

[0235] A process thread is identified with the process scenario name and the process thread identifier, such as .vertline.L,k.vertline.. If an object-oriented system is being monitored then the object identifier should include class name and instance number of an executing object.

[0236] Some fields require a particular initialisation value. These values are specified as v.sub.0 for the object event index, c.sub.0 for the object period index and m.sub.0 for the thread event index; these initial values are commonly initialised to 0 or to 1.

[0237] Information recorded with each event is used by the following event predicates:

[0238] fork(e, k) True if event e is a fork event that forked the process thread .vertline.j,k.vertline., otherwise it is false. This is deduced as follows: (1) the parent event e is a fork event type, (2) event e recorded the child process thread's identifier, and (3) the child begin event recorded the process execution time stamp of its parent fork event. To test for a fork event, the process thread field takes on an arbitrary value--fork(e,-).

[0239] hJoin(E, .vertline.j,k.vertline.) If process thread .vertline.j,k.vertline. is caused by one or more half-join events, then the half-join events are assigned to set E and the predicate is true; otherwise it is false. This is deduced as follows: (1) half-join events are determined based on event types, (2) half-join events record resulting process thread's identifier, and (3) the begin event of the resulting process thread records the process time stamp of its parent half-join event(s).

[0240] isHJoin(e) True if event e is a half-join event; otherwise, it is false.

[0241] external(e) True if event e is an external event; otherwise, it is false.

[0242] begin(e) True if event e is a begin event, otherwise it is false.

[0243] end(e) True if event e is an end event, otherwise it is false.

[0244] activity(e, V) True if event e is an activity event that also recorded the process level information V, otherwise it is false. To test for an activity event, the process thread field takes on an arbitrary value such as activity(e,-)

[0245] last(i, c, e) True if event e is the last event recorded in period c of object i, otherwise it is false. This is determined by traversing the object event graph of object i in period c until the period index changes or there are no further events recorded for the object.

[0246] exist(e) True if event e is an event within the trace, otherwise it is false.

[0247] These predicates serve as conditions for the event ordering relations of angio tracing. An angio trace has six event ordering relations that use the time stamp information. These relations identify a given event's succeeding or preceding event in the object event graph or the process event graph. Each relation is reflexive, antisymmetric, and transitive. The ordering relations are

.OMEGA..epsilon.{>.sup.T<.sup.T, >.sup.At>.sup.Ao<.sup.At,&- lt;.sup.Ao},

[0248] where

[0249] >.sup.T orders the succeeding events in an object event graph,

[0250] <.sup.T orders the preceding events in an object event graph,

[0251] >.sup.At orders succeeding process event graph events in the same process thread,

[0252] >.sup.Ao orders succeeding process event graph events that are not in the same process thread such as a fork event and its child begin event and a half-join event and its child begin event,

[0253] <.sup.At orders preceding process event graph events that are in the same process thread, and

[0254] <.sup.Ao orders preceding process event graph events that are not in the same process thread.

[0255] An angio trace event description of an application's execution is transformed into a scenario event graph model for further analysis. This transformation consists of converting events to nodes, adding edges between nodes, and replacing half-join and external event types with simplified event types. The conversion of an event to a node is a one-to-one mapping. There are four operators that are used to add a labelled edge between two adjacent nodes.

[0256] nextObject(e.sub.1, e.sub.2) adds a next object edge from the source node e1 to the target node e.sub.2;

[0257] nextperiod(e.sub.1, e.sub.2) adds a next period edge from the source node e1 to the target node e.sub.2;

[0258] nextAppTh(e.sub.1, e.sub.2) adds a next process edge from the source node e.sub.1 to the target node e2; and,

[0259] andFork (e.sub.1, e.sub.2) adds an and-fork edge from the source node e1 to the target node e.sub.2.

[0260] Table 3 shows identifying operators that are invoked to add edges that are identified by the partial order relations, the node type, and some additional time stamp information.

[0261] Once edges are added to nodes, graph modifications as shown in Table 4 are applied to remove angio half-join and angio external event types, as well as to provide some simplifications of a resulting model.

[0262] The angio trace representation of the four possible styles of synchronisations that occur in scenario event graph are shown in Table 4. These illustrate how the half-join events are components of a scenario event graph and-join node.

[0263] It is common in practice for a task to interleave the processing of several service requests, maintaining state information for each outstanding request. This is typically implemented as a responding task polling several message queues and servicing the first message it finds. Angio tracing can accommodate this service period interleaving without violating proper time's assumption that a task's service period should not be interleaved. This is done using the two instrumentation operators suspend and resume. They are defined as:

[0264] suspend(ts.sub.i, ts.sub.t) Copies task i's time stamp value tsi into the temporary time stamp storage location ts.sub.t. Then task i's time stamp value is updated by clearing its operation time stamp, incrementing its task period index, and resetting its task period index value to c.sub.0.

[0265] resume(ts.sub.t, ts.sub.i) Copies the contents of the temporary time stamp storage location ts.sub.t into task i's time stamp value ts.sub.i.

[0266] In each case an activity event is recorded with its application information set to either "suspend" or "resume", so that post-processing of the trace can determine if service periods are being interleaved.

[0267] The transformation from an angio trace to a scenario event graph model is known as a valid transformation because the partial order of the scenario is the same in both cases and the event ordering does not change; the meaning is preserved because there is a correspondence from the node connection specifications to the scenario event graph node connection strategies; and each node connection specification is unique so there is no conflict and corresponding non-determinism during the transformation process.

Scenario Graph Language Verification of Properties

[0268] The scenario graph language presented and defined herein is known to be complete and sound. This is important for several reasons. First, it ensures that all possible executions will be able to be interpreted for analysis with known semantics. Second, all necessary information will be captured in an unambiguous, non-contradictory fashion: no information is missing. Third, it ensures that only the necessary information is recorded so there is no redundancy in the data or extra overhead for data capture. Lastly, it facilitates automated analysis techniques because of the previous reasons. This permits its use in a wide variety of situations. Such a complete and sound graph language statement set is preferred.

[0269] The scenario graph language's node connections of Table 1 are complete and sound. They capture the valid ways to connect nodes and maintain causal relationships of a distributed application. A proof of the correctness is but described in both of the following articles which are herein incorporated by reference:

[0270] C. E. Hrischuk. Trace-based Load Characterization for the Automated Development of Software Performance Models. Ph.D. thesis, Carleton University, Ottawa, Canada, 1998.

[0271] C. Hrischuk, C. M. Woodside. "Logical clock requirements for reverse engineering scenarios from a distributed system." Accepted for publication in IEEE Trans. on Software Engineering.

Example Scenario Event Graphs

[0272] Example sub-graphs are now provided for an RPC, asynchronous, synchronisation, and forwarding interactions.

[0273] A SEG of an RPC is shown in FIG. 9. In an RPC interaction, the responding object is also the replier object. The process event graph resembles a procedure call graph if the object's were procedures.

[0274] A SEG of an asynchronous interaction is shown in FIG. 10.

[0275] A synchronisation interaction occurs when the synchronising object has started a service period and it must accept another message to continue execution. There are four possible ways a synchronisation occurs. The first case is where the message was sent using a blocking communication protocol (shown in FIG. 11). The second case is where the initiating object used an asynchronous communication protocol (shown in FIG. 12). The third case occurs where a blocked initiating object receives its reply to a service request that used an RPC communication protocol (shown in FIG. 13). The procedure call graph analogy breaks down in this case because there are two concurrent threads of execution, since the responding object continues execution after sending the reply. This third case is characterised as a new process thread being forked for the reply. The last synchronisation case involves an external event being accepted (shown in FIG. 14).

[0276] A forwarding interaction involves an initiating object, a responding object that receives the initiating object's request, other responding objects that forward the request in an object pipeline, and a replier object. An example is shown in FIG. 15, where: the initiating object (Object A) sends an RPC request and blocks, the first responding object (Object B) processes the request, and forwards it to another responding object (Object C), Object C processes the request further and forwards it to Object D which replies to the initiating object.

Model Transformation Using Graph Rewrite Rules

[0277] A graph rewrite operation occurs by finding a sub-graph, identifying adjacent nodes and edges to the selected sub-graph, and then replacing the identified sub-graph with another, ensuring that the adjacent nodes and edges are undisturbed by the embedding of the new graph. In Table 4, graph rewriting rules are shown. In Table 4, the adjacent nodes and edges are numbered the same in the identification and replacement sub-graphs to ensure the embedding operation does not alter the adjacent nodes.

[0278] A graph rewrite rule preserves those nodes and their modified attribute values and adjacent edges. Graph rewriting operations are used to simplify a scenario event graph model during analysis as well as to establish graph properties. FIG. 16 provides two examples of this. In the first example, if the sub-graph to replace is found then it is proven that the sub-graph has that property. In the second case, the graph is rewritten and simplified, ready for another set of graph rewriting rules to prove a property or simplify the model.

[0279] A scenario event graph model is analysed or translated into a domain specific model. An analysis is done by first describing the properties to be assessed as a sub-graph template, which is then compared with the host scenario event graph model using an algorithm supplied by an analyst. A sub-graph template has variables and values. Translation of a scenario event graph model from one domain to another begins similar to analysis, except that a second sub-graph is supplied replacing each occurrence of a first sub-graph in the host scenario event graph model.

[0280] An example of this approach is shown in FIG. 16, which is a graph rewriting operation for simplifying a scenario event graph model. In this example, an RPC interaction occurs using asynchronous messages. By removing unnecessary nodes and replacing arcs, this is simplified. The input sub-graph template uses the numbered nodes to establish glue points to embed the output sub-graph template. The algorithmic graph grammar approach is ideal for this purpose and it is supported by a graph rewriting specification language and tool set called PROGRES.

Angio Trace Instrumentation and Time Stamps

[0281] The instrumentation for a method of the invention for use with an unreliable monitor is shown in Table 5 (Instrumentation Specification for an Unreliable Monitor). And for a reliable monitor is shown in Table 6 (An Optimized Instrumentation Specification for a Reliable Monitor).

[0282] A principle that governs implementation of angio tracing for a reliable monitor are to minimise the data recorded. There are several approaches that are used for an optimised implementation. First, only one event is recorded per instrumentation item, which requires that event type information be combined together. The event identifier syntax of Table 5 is still used but merged events will have combined subscripts. For example, two events e.sub.2 and e.sub.3 are described as the merged event e.sub.2,3. Secondly, only the time stamp fields that change between events are recorded. Thirdly, only one ordering direction is recorded because the reverse ordering can be deduced by post-processing.

[0283] For an implementation description with a reliable monitor, the monitor has several characteristics. First, each object's events are stored serially, in-order. Optionally, different objects may store their events to the same buffer, so that events from different objects are stored in an interleaved fashion. Secondly, the monitor is able to detect missing events or guarantee that no events went missed during recording. Clocks local to each processing node need not be synchronised.

[0284] An object time stamp consists of an object identifier, an object period index, and an object event index. There are several optimisations for these time stamp fields. The object identifier is recorded with each event because the monitoring system is recording values for several objects simultaneously and interleaving the events. The object identifier is used to separate the events during post-processing.

[0285] The object periods are sequentially ordered because object events are serially recorded. The object period values need not be recorded with each event, but they are recorded when an object period ends. In this fashion, a change in an object period value means that a new object period has started and the object index of the succeeding event is reset to one. The object index values are not necessarily recorded because all object events are recorded sequentially; object index values are determinable from this ordering.

[0286] A process time stamp consists of a process name, a process thread identifier, a thread event index, and event type information. Optionally, each of these values is optimised as follows. Process name is recorded by external events as long as the process thread identifiers are globally unique, because the process name does not change throughout a process. Process thread identifier is recorded when a message is received since that is the only time it changes. Thread event index is changed after sending or receiving a message, so that an order of events in different objects is determinable.

[0287] The angio trace is unique because it has two timestamps used to establish an event ordering. One of those time stamps is dedicated to providing a scenario time stamp to order the events in the scenario.

[0288] An angio trace event is recorded by instrumentation embedded within an application that interfaces to the program monitor. A minimal set of instrumentation primitive operations must be supported by the program monitor and they are described here.

[0289] Process time stamp information is added to a message before it is sent to implement the ordering relations {>.sup.A0,<.sup.A0}. A message carries the process time stamp S.sub.1 of the sending object's event that is the cause of the receive event. The process time stamp S.sub.2 will replace the current process time stamp value of the receiving object. The sending object is responsible for generating S.sub.2.

[0290] The monitor provides four operators. The first two are used to manipulate the time stamp values of a message. They are:

[0291] end(e, S.sub.1S.sub.2) appends the process time stamps S.sub.1 and S.sub.2 to the message that is associated with the send event e;

[0292] rcv(e, S.sub.1, S.sub.2) retrieves the process time stamps S.sub.1 and S.sub.2 that were sent with the message received by event e;

[0293] record(e) atomically records and stores the event e; and

[0294] unique(x) assigns a globally unique value to variable x.

[0295] The instrumentation is listed in Table 5 as well as Table 6. The instrumentation is defined by the last three columns. The instrumentation defines the time stamp information that each event must record and not how the instrumentation is to be coded or executed. Each row in the table should be interpreted as follows: "if the precondition values of object i1 are met, then execute the instrumentation primitives to record the identified events." The instrumentation for the suspension and resumption of an process thread is presented as (P) and (Q). The conventions that are used to describe the instrumentation follow.

[0296] The documentation columns of the table are the columns "Event Connection Interpretation" and "Instrumentation Comments". The "Event Connection Interpretation" column describes the purpose of the specification. The "Instrumentation Comments" column details the finer points of each event connection specification, identifying the purpose for recording each event.

[0297] The "Recorded Event Observations" is the most important column because it illustrates the recorded events and their ordering, using the conventions of the scenario event graph, the time stamp field values, and the icon for the half-join event. The illustrations show the recorded events with the angio trace time stamp information overlaid against the scenario event graph edges and nodes (where applicable). For the sake of simplicity there are two exceptions. First, the icon is added to represent any type of event provided it has the specified time stamp field values. Secondly, the domain data recorded with the activity event information is not shown because it is domain dependent.

[0298] Each illustration identifies the recorded events, as well as their preceding and succeeding event. Events in dashed boxes are the events recorded by the instrumentation and, if there is more than one, they are recorded together atomically. The time stamp field values of the recorded events are the actual values that are recorded.

[0299] In all the illustrations, event e1 precedes the event(s) to be recorded by the instrumentation. The events which may succeed the recorded events are also shown. In some cases, boundary conditions exist. The illustrations will show additional events to describe the boundary condition, but these events will not be recorded in all cases. For example the source or target node of a object's next period edge may not exist.

[0300] The table refers to recording events for the object i1 and its instrumentation state vector may be used to determine which events to record. The "Precondition State of Object i1" column lists the predicates and conditions which must all be true for the instrumentation primitives to be executed. This state information is the object's instrumented state just prior to recording the events; it is not the state of the object when event e1 was recorded.

[0301] The executed instrumentation primitives are described in a column of the same name. The instrumentation primitives also identify object il's instrumentation state vector values after executing the primitives.

[0302] The time stamps notation is as follow. A field value may be a symbolic, subscripted variable. Variables in the same event connection specification with the same subscripts have the same value. A time stamp field value with a "-" can take on any value. A field with the place holder value "-" can take on the empty value. All time stamp values are natural numbers, beginning at one.

Time Stamp Optimizations

[0303] Optimisations for each event type are as follows:

[0304] External event always has an index value of one so the index values need not be stored.

[0305] Begin event always has an index value of one as well.

[0306] Activity event is a default event type, therefore, an activity event type label is not recorded.

[0307] The information recorded for a Half-join event is reduced if the object time stamp information is used.

[0308] Fork event is not recorded because the corresponding begin event of the child process thread will provide ordering information.

[0309] End event is not recorded since it occurs predictably and its occurrence is determinable through post-processing.

[0310] According to one particular embodiment although the event connection specifications define the events and their time stamp values to be recorded, some implementation considerations must be addressed. To minimise the burden of instrumentation from the analyst, it is intended that the angio trace instrumentation is embedded in system activities for that particular environment. Then the analyst's instrumentation effort is limited to identifying external events and object periods. There are also standardisation concerns for use in a heterogeneous environment.

[0311] To amortise the instrumentation effort, it is expected that the instrumentation once designed remains embedded in message passing system functions of a distributed system programming language, system libraries, interface definition language compilers, and operating system kernel calls.

[0312] The analyst adds process specific instrumentation to identify where the execution of each distributed process begins and ends. Optionally, this instrumentation is added manually. Alternatively, the instrumentation for the end of a process is reduced by assuming that a process implicitly ends when another process begins. Software interrupts which signify an external event are easily instrumented as an external event and generate a unique process name automatically to start an angio trace. Also, the generation of angio traces may be transparently incorporated into the design testing effort with little additional cost, as well as providing additional information for debugging.

[0313] Object specific instrumentation is also necessary to identify beginning and ending of object periods. There are three independent approaches to reduce object period instrumentation; since object endings are likely more frequent than process endings, optimisation of this instrumentation is advantageous. When optimised, the start of an object period is deduced automatically for some system activities such as an RPC message acceptance or object creation. This is also true for the ending of an object period.

[0314] Another optimisation approach is to identify where synchronisation between process threads occurs. A service period serves as a boundary between different angio traces and it identifies synchronisation between process threads. However, angio trace separation is determinable from the process name values in the process time stamp. So, if the synchronisation points are instrumented then the service periods are determinable. For example, synchronisation is automatically identified by nested accept statements in ADA, nested interleaved RPC interactions, or synchronisation barriers in parallel programs.

[0315] Yet another optimisation approach is to introduce constraints on a process and use heuristics to deduce the start and end of an object period. If a process is constrained to being initiated by a single external event then the history of the process is used to infer the start of an object period, the end of an object period, and where a synchronisation occurs. When a single test-driver is used to initiate a process then this is a feasible approach.

[0316] The selection of which approach to adopt should be assessed for each application; however, it should be noted that the object period information is important design documentation, which is generally not captured.

[0317] An implementation concern to be addressed is standardisation of the format for use in a heterogeneous environment, including a trace format specification and the primitive ordinal types that are used by the specification.

MMAP Process Application

[0318] The method as described herein is also applicable to verifying application functionality. When an application is specified in a graph language such as use cases or message sequence charts, the graph language statements provided by the method according to the invention are translatable into said scenario graph language. A comparison between the specification scenario graph language description and the execution scenario graph language description results in design specification verification and improves overall design verification.

[0319] Also, since the method provides as an output from a scenario graph language description of process and object execution, transformation of the output to provide different views of system execution is possible. Though, the graph language described herein is complete and sound, optionally the transformations eliminate these properties in order to provide data in a manner that is more useful to an operator, designer, or a corporate executive. Many such transforms are applicable to each graph language output from such a system according to the invention.

[0320] There are other uses for scenario event graph aside from scenario causally or precedence ordering of events. It is applicable to automatically generating software performance models from traces of execution. Generic event templates are used to identify interactions and object behaviour. The interactions and object behaviour are mapped onto a performance model. Race detection and system visualisation make use of the interaction information. In this fashion, system optimisation and resource allocation are improved. Also, the process ordering relation provides a more selective view of potential causes of an event, which is a useful starting point for debugging.

[0321] In accordance with the invention a physical process is modelled. Examples of physical processes for modelling include manufacturing, purchasing, workflow, chemical processes, etc. By tracing events occurring through a scenario in a predetermined fashion, flow graphs relating to objects and processes within the scenario are determined. These graphs are then used to either automate certain objects which are commonly repeated and therefore in need of optimisation, which form bottlenecks, or which are performed in inefficient manners due to flow related issues. In workflow modelling, a plurality of people and systems record events during normal work. These events are then constructed into scenario graph language models which are transformed into different domains for different purposes. An evaluation of overtime efficiency is one such application. Elimination of inefficient but required activities, identification of resource shortages, automation of objects within processes, reduction of cost, and other optimisations are determined based on domain specific output.

[0322] Similarly in manufacturing, common sources of delay are identified and analysed to determine a cost for delays and a cost of implementing preventative action to eliminate delay. A simple business decision follows to determine whether or not to implement a delay preventing process. Essentially, gathering of event related information and automatically transforming same into process flow related information is beneficial in many fields.

[0323] Similar to the concept of "proper time" in relativity, a frame of reference may be any object, or, in this embodiment a response thread. Selection of a frame of reference does not affect validity of the results obtained. There is a duality between an object and a process thread. An observer that chooses an object as a frame of reference sees a succession of process threads, whereas, if the process thread is chosen as a reference, the observer sees a succession of objects.

Angio Trace Application

[0324] Alternatively, angio traces as described herein have several applications beyond model construction. An angio trace is so named because it is similar to medical applications, angiograms, where a dye is injected into a patient and its movement through the body is monitored. Similarly, when using an angio trace, monitoring permits analysis of flow of communications through an application. The term angio dye is used herein to describe an identifier forming part of a time stamp that allows for analysis of application execution and communication flow during execution, abstract execution, simulation, emulation, etc.

[0325] The use of an angio dye as used in an angio tracing system described herein, allows for tracking information flow in a process during execution. As such, angio dyes are applicable to system self-monitoring. An example of a self-monitoring application includes, monitoring of network objects for crashes or resource overload. For example, when a process is divided among several processors in distributed systems, each system is required to transmit angio dye related information at intervals in the form of predetermined intervals. The information is used to monitor progress on provided objects or applications and to establish that each distributed processor is in operation. Failure to receive angio dye related information or failure of a processor to progress fast enough, results in corrective action such as providing the same object to another processor for execution. Optionally, the first object execution request is not withdrawn and results from a first processor to complete the object are used.

[0326] In another example application, angio dye related information is used to prevent record-playback of an encryption key. Since a times tamp as used in angio tracing is substantially unique, packaging an encryption key with such a time stamp, prevents its use at a later time. This allows for a traced system employing angio tracing to distinguish between current communicated information and previous or stale communicated information. Of course many other applications of angio dye related information, scenario graph language models, and tracing may be envisioned without departing from the spirit or scope of the invention.

[0327] In an alternative embodiment, angio tracing and scenario event graph are used to model a hardware process. For example, in design and implementation of a large hardware device, simulation is often employed. During simulation, an angio trace and a scenario event graph model of the simulation or of the simulation as well as a software simulation is constructed and analysed. This permits design verification, design optimisation, and performance evaluation. Similarly when a design is intended for mass production or is implemented in a programmable device, a hardware based angio trace for analysis in forming a scenario event graph model is employed. As much integrated circuit design involves library circuit blocks, such an implementation of a monitor for angio tracing is not unrealistic and provides numerous advantages as disclosed herein.

[0328] It is known to perform pattern analysis for design of software applications, workflow engineering, and process design. According to an embodiment of the invention, a SEG is analysed to determine patterns therein. These patterns are in the form of at least one of predetermined patterns and patterns identified through analysis of the SEG. Identification of patterns within the SEG provides valuable information for use in system optimisation, reverse engineering, design review, implementation analysis, and so forth.

[0329] In order to identify patterns within a SEG a generic mathematical approach to pattern recognition is applicable. Patterns within the SEG are identified as being identical or substantially similar in some aspect. For example, flow of a graph segment when identical is identified. Non flow related events within the graph segments are then compared in order to determine whether a correlation exists. When optimisation is possible on one of the identified graph segments, the other graph segments are reviewed to determine an applicability of a same or similar optimisation.

[0330] Alternatively, when substantially similar or identical graph segments are identified, design analysis to optimise a process in the form of a computer software program for memory utilisation, speed, reliability, or other known goals of design analysis and optimisation is performed. The design analysis, because it is of an executing software program, is an accurate and pertinent analysis of the process as implemented.

[0331] Numerous other embodiments may be envisioned without departing from the spirit or scope of the invention.

1TABLE 2 Previous Node Successor Node Connection Successor Node Connection Previous Node Axiom in a Connection Axiom Axiom in a Connection Axiom in Different in the Same Object Different Node the Same Object Event Object Event Event Graph Object Event Connection Node Graph Graph (OutProc and Graph Axiom Type (InProc and InObject) (InProcExt) OutObject) (OutProcExt) A External n/a n/a n/a M, N, O B Thread C, E, G, H, I, J, K, L, n/a E, I, M n/a end M, N, O C Action C, E, G, H, I, J, K, L, n/a B, C, D, F, H, J, L, n/a M, N, O N D Action C, E, G, H, I, J, K, L, n/a G, K, O E, J M, N, O E Action B, F D B, C, D, F, H, J, L, n/a N F Action C, E, G, H, I, J, K, L, n/a E, I, M G M, N, O G Action D F B, C, D, F, H, J, L, n/a N H Fork C, E, G, H, I, J, K, L, n/a B, C, D, F, H, J, L, I, K, L M, N, O N I Thread B, F H B, C, D, F, H, J, L, n/a begin N J And-join C, E, G, H, I, J, K, L, D B, C, D, F, H, J, L, n/a M, N, O N K And-join D H B, C, D, F, H, J, L, n/a N L And-join C, E, G, H, I, J, K, L, H B, C, D, F, H, J, L, n/a M, N, O N M Thread B, F A B, C, D, F, H, J, L, n/a begin N N And-join C, E, G, H, I, J, K, L, A B, C, D, F, H, J, L, n/a M, N, O N O And-join D A B, C, D, F, H, J, L, n/a N

[0332]

2TABLE 3 Edge Type Assignment of Nodes e.sub.1 to e.sub.2 2 Event type of e 1 where e 1 = j 1 , k 1 , m 1 i 1 , c 1 , v 1 and e 2 = j 2 , k 2 , m 2 i 2 , c 2 , v 2 Successor Event e.sub.2 in the Task Event Graph >.sup.T(e.sub.1, e.sub.2) Successor Event e.sub.2 in the Same Operation Thread >.sup.AT(e.sub.1, e.sub.2) Successor Event e.sub.2 that is not in the Same Operation Thread >.sup.Ao(e.sub.1, e.sub.2) Activity event (c.sub.1 = c.sub.2) .fwdarw. nextTask(e.sub.1, e.sub.2) V nextOpTh(e.sub.1, e.sub.2) N/A activity(e.sub.1, -) (c.sub.1 .noteq. c.sub.2) .fwdarw. nextPeriod(e.sub.1, e.sub.2) External event nextTask(e.sub.1, e.sub.2) nextOpTh(e.sub.1, e.sub.2) N/A external(e.sub.1, -) Begin event nextTask(e.sub.1, e.sub.2) nextOpTh(e.sub.1, e.sub.2) N/A begin(e.sub.1) Fork event nextTask(e.sub.1, e.sub.2) nextOpTh(e.sub.1, e.sub.2) andFork(e.sub.1, e.sub.2) fork(e.sub.1, -) Half-join event nextTask(e.sub.1, e.sub.2) N/A N/A isHJoin(e.sub.1) End event (c.sub.1 = c.sub.2) .fwdarw. nextTask(e.sub.1, e.sub.2) V N/A N/A end(e.sub.1) (c.sub.1 = c.sub.2) .fwdarw. nextPeriod(e.sub.1, e.sub.2)

[0333]

* * * * *