Storage Of Software Execution Data By Behavioral Identification Puthuff; Neil Craig [ZeroDee, Inc.]

Storage Of Software Execution Data By Behavioral Identification

Puthuff; Neil Craig

Patent Application Summary

U.S. patent application number 14/304050 was filed with the patent office on 2014-11-13 for storage of software execution data by behavioral identification. The applicant listed for this patent is ZeroDee, Inc.. Invention is credited to Neil Craig Puthuff.

Application Number	20140337822 14/304050
Document ID	/
Family ID	46878420
Filed Date	2014-11-13

United States Patent Application	20140337822
Kind Code	A1
Puthuff; Neil Craig	November 13, 2014

STORAGE OF SOFTWARE EXECUTION DATA BY BEHAVIORAL IDENTIFICATION

Abstract

Methods and systems for analyzing software. For example, one method can include executing a software program including a function by a computer. The method also includes producing an execution sequence for the function when, during execution, the software program executes the function. The method further includes generating an identifier for the execution sequence, wherein the identifier uniquely identifies a path of execution through the function represented by the execution sequence. In addition, the method includes saving the identifier and making the identifier available to at least one user through a user interface.

Inventors:

Puthuff; Neil Craig; (McLean, VA)

Applicant:

Name	City	State	Country	Type
ZeroDee, Inc.	Alexandria	VA	US

Family ID:

46878420

Appl. No.:

14/304050

Filed:

June 13, 2014

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
13428572	Mar 23, 2012	8776029
14304050
13428597	Mar 23, 2012
13428572
61466818	Mar 23, 2011
61466828	Mar 23, 2011

Current U.S. Class:	717/125 ; 717/131
Current CPC Class:	G06F 11/28 20130101; G06F 11/3636 20130101; G06F 11/3612 20130101
Class at Publication:	717/125 ; 717/131
International Class:	G06F 11/36 20060101 G06F011/36

Claims

1. A method for processing software, the method comprising: executing a software program, by a computer, the software program comprising a function; when, during execution, the software program executes the function, producing an execution sequence of the function; generating an identifier for the execution sequence, wherein the identifier uniquely identifies a path of execution through the function represented by the execution sequence; saving the identifier; and making the identifier available to at least one user through a user interface.

2. The method of claim 1, further comprising: accessing at least one data storage medium storing previously-generated identifiers associated with functions of the software program; and comparing the identifier to the previously-generated identifiers to determine whether the identifier is already stored in the at least one data storage medium.

3. The method of claim 2, wherein saving the identifier includes saving the identifier when the identifier is not already stored in the at least one data storage medium.

4. The method of claim 2, further comprising incrementing a count value associated with the identifier when the identifier is previously stored in the at least one data storage medium.

5. The method of claim 2, wherein the function includes a defined function or set of instructions.

6. The method of claim 1, wherein identifier for the execution sequence includes a sum of operational code hash values or conditional execution instruction hash values for the execution sequence.

7. The method of claim 1, further comprising: executing a second function in the software program when encountering a function call, a call stack, a context switch, a switch statement, a branch point, or a conditional execution instruction; producing a second execution sequence of the second function; generating a second identifier for the execution sequence, wherein the second identifier uniquely identifies a path of execution through the second function represented by the second execution sequence; and saving the second identifier when the identifier is not already stored in the at least one data storage medium.

8. The method of claim 1, further comprising: generating a hash table of identifiers associated with functions of the software program, wherein each identifier includes a hash value; counting a number of times each execution sequence is encountered in the execution of the software program represented by the identifier for each execution sequence and associating a count with the corresponding identifier; and displaying the hash table of identifiers and the count associated with functions of the software program.

9. The method of claim 1, further comprising: selecting the identifier; identifying source code or function variables representing the execution sequence of the function; and displaying the identifier with a link to the source code or function variables representing execution sequence of the function.

10. The method of claim 1, further comprising: identifying source code or function variables representing the execution sequence of the function; and saving at least one selected from the group comprising the identifier with a link to the source code or function variables representing execution sequence of the function and the identifier with the source code or values of the function variables representing execution sequence of the function.

11. At least one non-transitory machine readable storage medium comprising a plurality of instructions adapted to be executed to implement the method of claim 1.

12. A system for processing software, the system comprising: a processor configured to execute a software program comprising a function; produce an execution sequence of the function during execution of the function; generate an identifier for the execution sequence, wherein the identifier uniquely identifies a path of execution through the function represented by the execution sequence; and at least one data storage medium configured to save the identifier.

13. The system of claim 12, further comprising a user interface configured to make the identifier available to at least one user.

14. The system of claim 12, wherein the processor is further configured to generate an index table of identifiers associated with functions of the software program, wherein each identifier includes an index value; the at least one data storage medium configured to save the index table of identifiers; and the user interface configured to the index table of identifiers to the at least one user.

15. The system of claim 12, wherein the processor is further configured to access the at least one data storage medium storing previously-generated identifiers associated with functions of the software program; and compare the identifier to the previously-generated identifiers to determine whether the identifier is already stored in the at least one data storage medium.

16. The system of claim 15, wherein the processor is further configured to save the identifier when the identifier is not already stored in the at least one data storage medium.

17. The system of claim 16, further comprising a counter configured to increment a count value associated with the identifier when the identifier is previously stored in the at least one data storage medium.

18. The system of claim 12, wherein the function includes a defined function or a specific code segment with sequential code instructions.

19. The system of claim 12, wherein identifier for the execution sequence is derived from an arithmetic or logic operation on the operational code hash values or conditional execution instruction hash values for the execution sequence.

20. The system of claim 12, further comprising a data buffer configured to collect execution sequences of functions in real-time during of the execution of the software program.

Description

RELATED APPLICATIONS

[0001] This application is a continuation-in-part of U.S. application Ser. No. 13/428,572, filed on Mar. 23, 2012, which claims priority to U.S. Provisional Application No. 61/466,818, filed on Mar. 23, 2011, the entire content of these applications is hereby incorporated by reference. This application is also a continuation-in-part of U.S. patent application Ser. No. 13/428,597, filed on Mar. 23, 2012, which claims priority to U.S. Provisional Application Ser. No. 61/466,828, filed on Mar. 23, 2011, the entire content of these applications is hereby incorporated by reference.

FIELD

[0002] Embodiments of the present invention relate to developing and analyzing computer software. For example, embodiments of the invention provide methods and systems for identifying unique behaviors of a software execution sequence, storing the unique behaviors, and using and/or exporting the stored unique behavior to assess the computer software.

BACKGROUND

[0003] Software is created from source code that is written by software developers. In the process of writing software, many defects are unintentionally introduced into the software code. These defects are generally referred to as "bugs," and can be very difficult to isolate and understand using existing tools and methods. Accordingly, defect-free computer software has always been difficult to create. In all but a few instances, computer software knowingly contains many residual defects that are too elusive or subtle to economically remove.

[0004] For example, consider the following example of a small software function:

TABLE-US-00001 int example(char x, char y, char z) { int rtnVal = 0; switch(z) { case 0: rtnVal = (x-y); break; case 1: rtnVal = ((int)(x*100)) / (x+y); break; case 2: rtnVal = (x<<y); break; case 3: rtnVal = 100; break; } return rtnVal; }

[0005] From initial inspection this function might be expected to behave in only four possible ways (i.e., one path for each "case" statement reached by evaluating argument "z"). However, there are additional behaviors to this example function that can be difficult to detect. For example, there is no "default" condition for the "switch" statement. Therefore, if the value of argument "z" is something other than 0, 1, 2, or 3, then no case statement will be reached and the "switch" statement will fall-through and return a 0. The effects of this defect can range from benign to catastrophic. Similarly, if the sum of arguments "x" and "y" result in a value of 0 when argument "z" is set to 1, the result will be a divide-by-zero exception (see "case 1"), which is generally viewed as a catastrophic error condition. Also, if argument "y" is greater than 31 when argument "z" is 2, the overflow of the shift operation will cause the return value to be 0 or -1 regardless of the value of argument "x." Any of these behaviors can be very difficult to detect using conditional-capture methods. Also, the effects of any of these unwanted behaviors can be so catastrophic (such as a system reset) that they eradicate the evidence of the cause of the error. Similarly, in some situations, the effects of any of these unwanted behaviors can be so benign that nobody notices that something is incorrect or can happen so infrequently that they cannot be reproduced within a reasonable time frame (and, therefore, cannot be properly debugged using traditional software debugging tools). Note that the above example function is simple and used solely for illustration purposes. In real software development, functions are likely more complex, which leads to more potential behaviors (both wanted and unwanted). This complexity further complicates the debugging process.

SUMMARY

[0006] When considering the task of discovering the root cause of a software defect, all of the answers are in the computer chip or system. In particular, the computer chip or system contains the cause of every bug, the value of every program variable, how every line of software actually behaves, and every software vulnerability and optimization opportunity. If it were possible to access and analyze this information in its entirety, then software development could be much easier and result in fewer residual bugs. This superabundance of information is always present in a computer that is running software, yet for much of the computer age this was too much information to export, collect, or process economically. In response, software debugging tools have been designed to limit the export of execution information to a tiny portion of the total available, to give software developers only the information they specifically request using tools and methods of conditional debugging.

[0007] For example, software developers traditionally have relied on tools and methods of conditional debugging. Conditional debugging requires software developers to pre-determine a condition or sequence of conditions that must be satisfied in the target computer before enabling the capture of execution data. Examples of conditional debuggers include breakpoint debuggers (where one or more predefined breakpoint conditions are set at fixed locations in the software code to enable data capture), single-step debuggers (wherein program code can be stepped instruction-by-instruction, resulting in manual data capture at instruction boundaries), print debugging (wherein the target software has additional instructions inserted to export data from predetermined locations), and real-time trace debuggers (wherein dedicated circuitry performs the real-time export of software execution data while the computer system is running at full speed, and includes triggering circuitry to enable data capture around a predefined condition or a predefined sequence of conditions).

[0008] A shortcoming of conditional debugging is that the developer must know in advance the exact condition around which to capture data for each and every behavior of interest that the software exhibits. For example, a software developer may become aware of a defect or undesirable behavior of software and begins searching for its cause. Using conditional debugging, the developer can set a breakpoint condition or trigger condition based on the developer's best guess of the possible cause of the incorrect behavior. The software program is then executed until the breakpoint or trigger condition is satisfied. When the condition is satisfied, execution data is collected. However, the collected execution data may not necessarily reveal the underlying cause of the incorrect behavior. In particular, in many situations, the developer needs to modify the breakpoint or trigger condition to more-correctly match the conditions of the incorrect behavior. The developer repeats this process until the defect is located. This iterative process can take hours or days to complete and typically results in the correction of just one software defect.

[0009] These forms of conditional debugging are highly intrusive. In particular, these techniques can alter the flow of program execution enough to make the original problems non-reproducible during debugging. Furthermore, these methods are created on the premise that a software developer will search for the cause of one known, reproducible bug at a time. Searching for one bug at a time requires the developer to first make an educated guess about where a particular defect originates (i.e., to set a breakpoint, trigger, or other mechanism to capture of the exact portion of execution data that contains evidence of the cause of the present problem). This search for defects is usually an iterative process, since the cause of software errors are often not easy to determine, and a series of iterations can add up to span a long time duration to find and correct just one error, particularly if the error has a low recurrence rate or is otherwise difficult to reproduce. Furthermore, these debugging techniques may only help a developer isolate software defects that the developer becomes aware of through external symptoms. Defects with subtle symptoms or very low recurrence rates can often elude detection through the entire development process, and end up shipping with the final product.

[0010] Breakpoint debuggers and other traditional conditional debugging tools are rooted in a past era wherein technical limitations prevented the economic export and capture of the vast amount of information available on the computer chip. A recent development is the real-time trace ("RTT") port, such as ARM ETM, MIPS PDTrace and IEEE/ISTO Nexus-5001, which is specialized logic added to a computer system to non-intrusively export the vast amount of execution data present in the computer as it runs at full speed. As these RTT ports are capable of exporting very large quantities of data and as an aid to conditional debugging methods, they generally include condition-detection logic to signal that a pre-defined triggering event or sequence has occurred, which is then used to indicate the exported data should be captured for analysis in either an in-system buffer or by an external system.

[0011] Accordingly, software debuggers using RTT have been developed with a similar mindset as breakpoint debuggers. For example, the debuggers are used to capture a relatively small quantity of data around a pre-defined event or sequence. These RTT debuggers offer a similar set of features as their conditional debugger predecessors: breakpoints, single-stepping, examining variables, etc. using the data that has been captured from the RTT port.

[0012] Recent improvements in RTT debuggers involving the collection of larger quantities of real-time trace data show some promise as a more effective means of software debugging. These systems use fixed-size buffers of up to 4 gigabytes for high-bandwidth collection of several seconds of execution data, or employ spool-to-disk methods for low-bandwidth execution data collection over extended periods. The captured data can then be analyzed to obtain profiling or code coverage information, or replayed as though debugging a live computer target with a conditional debugger. For example, Lauterbach GmbH's "Real-time Streaming (ETMv3)" technology performs extended-duration recording of real-time trace data and creates profiling and code coverage summaries on-the-fly. Execution profiling and code coverage is useful and has been available for many years, but neither of these will detect the unique individual behaviors of the called functions. Correct and incorrect behaviors will be included ambiguously in the profiling and coverage summaries just like any other function-behavioral iteration. In short, these enhancements continue to rely on the developer to manually locate any behavioral anomalies. This crucial shortcoming is inherent in all conditional debuggers: they do not detect variations in the behavior of the software, nor do they use this as a basis for data collection.

[0013] As newly written software will typically contain many defects, the process of debug and test can take an unpredictably long period of time to complete, and can account for 80% of the total cost of software development. This has made computer software the most expensive and unpredictable component in many of the intelligent, connected devices that utilize computer software for enhanced functionality. These difficulties have remained remarkably constant for decades, despite continuing advances and repeated "breakthroughs" in software debugging technology.

[0014] Remembering that the answer to every software defect is on the computer chip, conditional debugging methods are hindering developers from getting the answers they need and are the direct cause of the high costs, unpredictable schedules, and poor resulting quality in software development.

[0015] Accordingly, embodiments of the present invention provide means to uniquely identify software behaviors at a point where the execution information is most abundant--inside the computer system. If implemented inside the computer system, this more effectively manages the limited capacity of conventional debug collection or export facilities for the exclusive use of unique software behaviors. Given sufficient capacity for behavior identifiers and execution data export or capture, continuous software behavioral analysis and behavioral anomaly capture can be accomplished for entire software programs and multi-program systems. This can also be implemented external to the target computer system, receiving execution data from a high-capacity RTT port or other resources. Both implementations improve the software development process by eliminating the need for conditional debugging and by enabling a more rigorous approach to software quality through providing a means to individually review and approve-or-improve every unique behavior exhibited by every software function.

[0016] Therefore, embodiments of the present invention provide methods and systems for identifying behavioral uniqueness of software execution sequences as a basis for collection and/or export of software execution data and related information. One method can include executing a software program and continuously producing a sequence of execution information. The method can also include determining if the execution information is within a functional boundary of the software program, and determining if the execution sequence of the execution information is a new execution sequence or a repeat execution sequence.

[0017] One system can include a functional boundary detector for continuously analyzing an execution information of a software program to determine if the execution information is within a functional boundary of said software program, an execution behavior identification number generator to create unique behavioral identifiers for unique execution sequences, and a comparator provided for determining if an execution sequence of the execution information is a new execution sequence or a repeat execution sequence and producing a unique detection signal if the new execution sequence is detected. Therefore, the system identifies behavioral uniqueness of software execution sequences.

[0018] In particular, embodiments of the present invention provide methods and systems for analyzing and debugging a software program. In one embodiment, a method for processing software includes executing a software program including a function, by a computer. The method also includes producing an execution sequence of the function when the software program executes the function. In addition, the method includes generating an identifier for the execution sequence, saving the identifier, and making the identifier available to at least one user through a user interface. The identifier uniquely identifies a path of execution through the function represented by the execution sequence

[0019] The method can also include accessing at least one data storage medium storing previously-generated identifiers associated with functions of the software program, and comparing the identifier to the previously-generated identifiers to determine whether the identifier is already stored in the at least one data storage medium. The operation of saving the identifier can include saving the identifier when the identifier is not already stored in the at least one data storage medium. The method can further include incrementing a count value associated with the identifier when the identifier is previously stored in the at least one data storage medium. The function includes a defined function or a specific code segment with sequential code instructions. A high count value can represent a higher frequency of execution of the execution sequence, which can be used to identify infrequently used execution sequences which may represent an execution sequence with an error. In an example, the identifier for the execution sequence includes a sum of operational code hash values or conditional execution instruction hash values for the execution sequence.

[0020] In another configuration, the method can further include executing a second function in the software program when encountering a function call, a call stack, a context switch, a switch statement, a branch point, or a conditional execution instruction. The next operations can include producing a second execution sequence of the second function, generating a second identifier for the execution sequence, where the second identifier uniquely identifies a path of execution through the second function represented by the second execution sequence, and saving the second identifier when the identifier is not already stored in the at least one data storage medium. The software program can include multiple distinct functions.

[0021] In another configuration, the method can further include generating a hash table of identifiers associated with functions of the software program, wherein each identifier includes a hash value, counting a number of times each execution sequence is encountered in the execution of the software program represented by the identifier for each execution sequence and associating a count with the corresponding identifier, and displaying the hash table of identifiers and the count associated with functions of the software program. The hash table can improve the accessibility and visualization of the identifiers and execution sequence of the software program. The method can include selecting the identifier, identifying source code or function variables representing the execution sequence of the function, and displaying the identifier with a link to the source code or function variables representing execution sequence of the function. The method can include identifying source code or function variables representing the execution sequence of the function; and saving the identifier with a link to the source code or function variables representing execution sequence of the function, or saving the identifier with the source code or values of the function variables representing execution sequence of the function. Linking the identifier to the source code in the source file can allow a user to quickly replay, analyze, or visualize the source file for an identifier.

[0022] Another embodiment of the invention can provide a system for processing software can include a processor and at least one data storage medium. The processor is configured to execute a software program comprising a function, produce an execution sequence of the function during execution of the function, and generate an identifier for the execution sequence. The identifier uniquely identifies a path of execution through the function represented by the execution sequence. The at least one data storage medium configured to save the identifier.

[0023] The processor can further be configured to access the at least one data storage medium storing previously-generated identifiers associated with functions of the software program, and compare the identifier to the previously-generated identifiers to determine whether the identifier is already stored in the at least one data storage medium. The processor is further configured to save the identifier when the identifier is not already stored in the at least one data storage medium. The system can include a counter configured to increment a count value associated with the identifier when the identifier is previously stored in the at least one data storage medium. The function includes a defined function or a specific code segment with sequential code instructions. The identifier for the execution sequence is derived from an arithmetic and/or logic operation on the operational code hash values or conditional execution instruction hash values for the execution sequence. The system can include a data buffer configured to collect execution sequences of functions in real-time during of the execution of the software program.

[0024] Another embodiment of the invention can provide a user interface configured to make unique behavior identifiers available to at least one user. A processor can be configured to generate an index table of identifiers associated with functions of the software program, where each identifier includes an index value. The at least one data storage medium can be configured to save the index table of identifiers. The user interface can be configured to make the index table of identifiers accessible to at least one user.

[0025] Other aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The accompanying drawings are incorporated in and constitute a part of the specification. The drawings, together with the general description given above and the detailed description of the exemplary embodiments and methods given below, serve to explain the principles of the invention. The objects and advantages of the invention will become apparent from a study of the following specification when viewed in light of the accompanying drawings

[0027] FIG. 1 is a flowchart of a method for analyzing and debugging a software program.

[0028] FIG. 2A schematically illustrates a system including a computer processor equipped with a real-time trace subsystem using existing software analyzing tools.

[0029] FIG. 2B schematically illustrates a system including a computer processor equipped with a real-time trace subsystem implementing the method of FIG. 1.

[0030] FIG. 3 schematically illustrates a behavioral identifier calculation system used by the system of FIG. 2B according to one embodiment of the invention.

[0031] FIG. 4 schematically illustrates a behavior uniqueness detection system used by the system of FIG. 2B according to one embodiment of the invention.

[0032] FIG. 5 schematically illustrates an alternative implementation of embodiments of the present invention using an external system for processing real-time trace data exported from an unmodified computer system.

[0033] FIG. 6A is a screen shot illustrating behavioral information and launcher for a replay debugger.

[0034] FIG. 6B illustrates a portion of FIG. 6A.

[0035] FIG. 7 schematically illustrates another system for performing the method of FIG. 1.

[0036] FIG. 8 provides further details regarding the functionality performed by the system of FIG. 7.

[0037] FIG. 9 illustrates functionality performed by a behavioral identifier 330 included in the system of FIG. 7.

[0038] FIG. 10 illustrates functionality performed by a comparator included in the system of FIG. 7.

[0039] FIG. 11 illustrates the system of FIG. 7 used with a multi-user storage system.

[0040] FIG. 12 schematically illustrates a system implementing the method of FIG. 1 using data compression.

[0041] FIG. 13 is a flow chart illustrating a compression operation performed by the system of FIG. 12.

[0042] FIGS. 14 and 15 illustrate functionality performed by an execution path identification creator included in the system of FIG. 12.

[0043] FIG. 16 illustrates accumulator logic performed by the system FIG. 12.

[0044] FIG. 17 is a flow chart illustrating embodiments of the present invention implemented using existing instruction trace information.

[0045] FIG. 18 illustrates an example software function.

[0046] FIG. 19 is a flow chart illustrating a method of analyzing target software program and inserting additional software instructions in the target software program to implement embodiments of the present invention in a software-only manner.

[0047] FIG. 20 schematically illustrates decompression of stored execution path identifier sequences.

[0048] FIG. 21 schematically illustrates post-processing to remove intervening interrupt or exception handling code from a software execution path.

[0049] FIG. 22 schematically illustrates compression of software execution instructions.

[0050] TABLE 1 illustrates computer instructions for the sample software function of FIG. 18.

[0051] TABLES 2A, 2B, and 2C illustrate execution of instructions in TABLE 1.

[0052] TABLE 3 illustrates implementation of a software-only solution using the sample software function of FIG. 18.

[0053] TABLE 4 illustrates effect of interrupts or exceptions on the processing of instruction trace compression.

DETAILED DESCRIPTION

[0054] Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways.

[0055] Reference will now be made in detail to exemplary embodiments and methods of the invention as illustrated in the accompanying drawings, in which like reference characters designate like or corresponding parts throughout the drawings. It should be noted, however, that the invention in its broader aspects is not limited to the specific details, representative devices and methods, and illustrative examples shown and described in connection with the exemplary embodiments and methods.

[0056] This description of exemplary embodiments is intended to be read in connection with the accompanying drawings, which are to be considered part, of the entire written description. The word "a" as used in the claims means "at least one" and the word "two" as used in the claims means "at least two."

[0057] As used herein, a behavioral identifier, a captured execution behavior, compressed behavioral data, or an execution path identifier can refer to an identifier. A functional boundary can refer to beginning or end of an execution sequence for the function. Execution information or execution path can refer to the execution sequence for a single function or multiple functions. A function can include a defined function that has a function initialization and return value at the completion of the function. Alternatively, the function can include a specific code segment that can be grouped or blocked together because of the sequential nature of the code instructions or the repeatability of the code instructions. For example, a function can include a specific code segment that is manually defined by a user or can be automatically identified (e.g., based on a predefined number of instructions, location of particular types of instructions, such as breaks, returns, repeats, etc.).

[0058] As noted above, embodiments of the present invention provide methods and systems for analyzing and debugging a software program. FIG. 1A illustrates a method 100 of analyzing and debugging a software program. The method 100 includes executing a software program, by a computer (at block 110). The software program includes a function. When, during execution, the software program executes the function, an execution sequence of the function is produced (at block 120). The execution sequence represents a path followed through the function (e.g., what instructions were executed, what order the instructions were executed in, and what data was used or generated during the execution). As described in more detail below, a unique identifier is defined for the execution sequence (at block 130). The unique identifier is then saved (at block 140), where it can be accessible by a user (e.g., through a graphical user interface) (at block 150). In some embodiments, before saving the unique identifier, the method 100 can include accessing at least one data storage device storing previously-saved unique identifiers. If the data storage device already stores the unique identifier generated for the currently collected execution sequence, the unique identifier is not stored to prevent duplication of execution information. In other words, if the exact execution sequence for a function has already been observed and recorded, no further information is stored for the current execution sequence. This check and comparison helps reduce the amount of information collected by the system and made available to a user for analyzing and debugging software. For example, storing two occurrences of the same execution sequence through a particular function, does not provide any additional information to a developer than if only one occurrence was stored. Furthermore, by limiting the storing of duplicate execution sequences, a developer can use the saved data to quickly identify how many paths have been recorded through a particular function. Accordingly, if more paths were recorded that inherent from the structure of the function (e.g., four possible paths contained in a switch statement), the developer can efficiently identify a bug. In some embodiments, counts can be maintained to track how often particular unique execution sequences are observed, if a developer needs this information to track software performance.

[0059] As such, embodiments of the present invention provide methods and systems for identifying behavioral uniqueness of software execution sequences. In particular, execution information is continuously analyzed to determine if a behavioral iteration of the computer program is unique or merely a repeat of previously-observed behavior. When a unique behavior is detected, the data of interest is captured, stored, and indexed by a behavioral identifier. The input data used to create a behavioral identification can include but is not limited to: execution trace data, program variables, execution timing, and related signals, conditions, and events. These data values are progressively combined into a behavioral identifier as the program executes and exported on software functional boundaries to be evaluated for uniqueness. Using the example software function described above in the Summary section, embodiments of the present invention uniquely identify the four case statements (i.e. cases 0-3) and the three additional behaviors (e.g., default condition, divide-by-zero condition, overflow condition) discussed above (i.e., if actually executed). A software developer could then review the collected behaviors at their leisure to determine if each behavior is correct or incorrect.

[0060] Using the behavioral capture method as described above provides benefits over conditional capture methods. For example, software developers no longer have to set conditional breakpoints or triggers in an iterative attempt to capture evidence of just one incorrect software behavior after another. Rather, every behavior is automatically captured the first time it occurs. This nearly eliminates the need to find and fix software bugs in an iterative approach, which commonly is one of the most expensive components of software development. In addition, since every behavior is uniquely identified and captured, including incorrect behaviors with otherwise subtle symptoms or low recurrence rates, defects can be corrected as soon as they happen at least one time. The result is improved software quality, with very low residual defect rates achievable without undue expense. Furthermore, the identification and capture can be performed on the entirety of executing software, not just those functions of interest to an individual developer. This enables an intimate knowledge of unfamiliar code to be gained quickly by a software developer. A process that is very difficult using existing methods.

[0061] Additional details regarding the method 100 illustrated in FIG. 1A are provided below. For example, FIG. 2A illustrates a computer processor equipped with a real-time trace ("RTT") subsystem and existing software debugging tools (i.e., conditional debugging tools). During software program execution, processor core logic 160 produces signals 162 indicative of the current software instructions being executed within the processor core logic 160 and on-chip peripheral systems. These signals 162 are interpreted by program trace logic 164 to produce a reduced-size encoding of the executed instructions and/or memory accesses made during software execution. Event detection logic 166 monitors the execution signals 162 to detect if any user-defined events (conditions or triggers) have happened and creating applicable enable/disable/event signals to control capture of trace data 168 into the in-system buffer or to an off-system export portal 170. Accordingly, only trace data relating to user-defined events are captured in the buffer.

[0062] In contrast, FIG. 2B schematically illustrates the processor system of FIG. 2A implementing the method 100 according to one embodiment of the invention. As illustrated in FIG. 2A, the system includes a behavior ID generation system 180 that processes and converts the execution signals 162 into a series of behavioral identifiers 182. These behavioral identifiers are passed to a behavior ID data set 184 and to the in-system buffer or export 170 for possible inclusion into trace data sequence. The behavior ID data set 184 evaluates each behavior ID it receives to determine if this value already exists in the data set, or if it is a new value that has not yet been observed. If the behavior ID is new, indicating that a not-yet-observed instruction execution sequence has taken place, a "New BehavID" signal 188 is asserted, which indicates that the related RTT data should be captured or exported for analysis by trace buffer or export system 170.

[0063] FIG. 3 schematically illustrates functionality performed by the behavior ID generation system 180 in one embodiment of the invention. As illustrated in FIG. 3, the system 180 modifies the opcode 190 or actual instruction word currently being executed to remove any memory address encoding. This provides position independence to the resulting value, yielding a consistent result regardless of the physical address location encoded within the instruction. The position independent opcode is then passed to a hash function 192 to amplify the effects of relatively small differences between different opcodes. The result is then passed through an exclusive-OR block 194. The exclusive-OR block 194 has a complementary effect on the hashed opcode value. For example, if the related instruction was conditionally not executed, the hashed opcode is bitwise inverted. Otherwise the hashed opcode is passed through unmodified. Similar to the hash function 192, the exclusive-OR block 194 has the effect of amplifying changes to a hashed opcode value to reflect different program behavior relating to instructions not conditionally executed. The resulting hashed and conditionally inverted opcode is passed to an accumulator 196. The accumulator 196 sums the received result with a series of opcode hashes executed along the current path of execution (i.e., within a function). Control logic 198 maintains sequences of actions in behavior ID generation, such as temporarily storing the in-process behavior ID of a function onto a call stack if that function calls another function before completion and restoring the in-process behavior ID to the accumulator when any called function returns execution flow back to the original function. If used in a multi-process system, control logic 198 also manages multiple call stacks on a per-process basis. At the completion point of a function (i.e., a functional boundary), control logic 198 exports the resulting behavior ID from the accumulator 196 and an accompanying IDVALID indicator signal. The configuration illustrated in FIG. 3 provides position independence, consistency between different target program builds, and high sensitivity to program changes as small as a single bit. It should be understood, however, that other forms and combinations of execution data could be used to create behavioral identifiers without deviating from the scope of the present invention. For example, additional details regarding behavioral identifiers are provided below with respect to FIGS. 12-15.

[0064] FIG. 4 schematically illustrates functionality performed by the behavior ID data set 184 according to one embodiment of the invention. As illustrated in FIG. 4, the resulting behavior ID and IDVALID signal from the control logic 198 (see FIG. 3) are presented to content addressable memory ("CAM") read/write interface block 200, which initiates a search within CAM 202 for a matching behavior ID. For each comparison, a "MVALID" signal is asserted. Also, if a matching behavior ID is present within the CAM 202, a "MATCH" signal is also asserted. In some embodiments, user-settable signals "AddNew," "TrigOnMatch," and "TrigOnNotMatch" control the resulting actions taken if a behavior ID does or does not exist in the CAM 202. For example, if the "AddNew" signal is enabled, any behavior ID that does not exist in the CAM 202 is added to the available space, to be available for subsequent comparisons. If the "TrigOnMatch" signal is enabled, any comparison that results in a match causes a "CAPTURE" signal to be asserted. Similarly, if the "TrigOnNotMatch" signal is enabled, any comparison that does not result in a match causes a "CAPTURE" signal to be asserted. The "CAPTURE" signal causes trace data to be captured and stored in the buffer or export 170.

[0065] A CAMData bus provides read/write access to the contents of the CAM 202 to the host system and debugging tools. These configuration and access interfaces provide the user with options to pre-load the CAM 202 with known good behavior identifiers which can then be ignored by the system resulting in the capture of unknown behaviors exclusively. Similarly, a user can pre-load known bad behavior identifiers, reserving capture for only these behaviors of interest. Furthermore, these behavior identifiers can be read from the CAM 202 and stored externally for future use. The CAM block 200 can also include event counters for each behavioral identifier element to indicate the accumulated total number of times each behavior has occurred during a given interval. In some embodiments, the CAM block 200 can also be paired with a secondary cache system to pre-load the related behavior identifiers for functional sections of a computer program as they are executed in a running system, thereby expanding the coverage of the system by effectively increasing the working size of the CAM 202.

[0066] FIG. 5 schematically illustrates an alternate implementation of embodiments of the invention that provides an external solution to an existing processor system equipped with a RTT. This implementation performs additional processing to decode exported RTT data and reconstruct signals that would otherwise be available to an on-chip implementation (see FIG. 2B). Otherwise, these systems work in an approximately equivalent manner. As illustrated in FIG. 5, the target microprocessor system 240 continuously exports trace data during full speed execution. The trace data is captured and correlated to the target software image to reconstruct the information that would be available to an on-chip implementation. The correlated image data can include opcodes, addresses, data values, etc. The resulting execution information is then presented to a behavior ID generation system to create a series of behavioral identifiers, which, as described above, are passed to a behavior ID set 184 to determine if they are newly observed or repeat software behaviors. If a new behavior ID is detected, then the related trace data, behavior ID, and related execution information are presented to one or more database and mass storage systems 242, along with the software image and source files to facilitate on-demand replay. This implementation acts as an external retrofit to existing processors with RTT where no additional on-chip logic is required.

[0067] As described above, after storing unique behaviors, the behaviors are made available to users for analysis. For example, FIG. 6A is a screen shot 300 illustrating a user interface that retrieves and displays behavior results obtained by running embodiments of the present invention against the software example code described above (see the Background section) (e.g., collected using the alternate implementation of FIG. 5 from an actual microprocessor system running at full execution speed). The screen 300 includes a table 302 displaying the executed functions and the observed behaviors during an execution session. No user configuration or breakpoints were required to obtain these results, and the associated execution data (in the form of RTT data) was automatically collected for each unique behavior and is available for on-demand replay using a conventional replay debugger application.

[0068] FIG. 6B illustrates a portion of the table 302. As illustrated in FIG. 6B, each of the four expected behaviors was executed many thousands of times. Also, a single instance of an unexpected behavior was executed, which upon replay would reveal a transient error case of variable "z" being outside the expected range of 0 to 3. This type of defect can be notoriously difficult to correct using conventional debugging methods as it is both transient and symptomless. However, as illustrated in FIG. 6B, using embodiments of the present invention, the defect can be identified and captured based on its first and only appearance.

[0069] FIG. 7 schematically illustrates an alternative configuration of a system for implementing the method 100. Referring to FIG. 7, the system 308 includes a computer system 310 (physical or simulated) executing one or more software programs of interest. The system 308 also includes a functional boundary detector 314, a comparator 316, and a data buffer 318. While the computer system 310 executes a software program, execution information 312 (including execution data and related information) is continuously created by the computer system 310. This execution information 312 is continuously collected and presented to both the functional boundary detector 314 and the comparator 316 through the data buffer 318. The boundary detector 314 analyzes the execution information to determine if a functional boundary within the software program, such as function calls, call stacks, context switches, and the like, have been crossed. In other words, the functional boundary detector 314 determines if the execution information is within a functional boundary of the software program. If a functional boundary is detected, the boundary detector 314 asserts the boundary detection signal 320, which signals the comparator 316 to continuously evaluate the contents of the preceding execution segment against the contents of the previously-collected execution information from a previous execution data buffer 322. Accordingly, the comparator 316 determine if an execution sequence of the execution information has been previously observed or if the most-recently collected execution information represents a new, unique behavior (i.e., a new and unique path through the function). If the behavior is determined to be unique (i.e. new and not previously observed), the comparator 316 produces a unique detection signal 324, which instructs a storage system 326 to store the related data contents in the data buffer 318. The comparator 316 also produces a behavioral identifier that is stored in the data buffer 318 for future comparisons.

[0070] FIG. 8 provides further details regarding functionality performed by the system 308. As illustrated in FIG. 8, the computer system 310 produces the execution information 312, which may be composed of any combination of execution trace information, program variables, memory accesses, input/output operations, execution timing, and other related signals, events, or conditions. The execution information 312 is presented to the functional boundary detector 314, the data buffer 318, and the contents of the comparator 316. As illustrated in FIG. 8, the comparator 316 includes behavioral identifier creation logic 330 and a uniqueness detector 332. The behavior identifier creation logic 330 sequentially processes the execution and related data (i.e., the execution information) using arithmetic and/or logic operations to produce a behavioral identifier 334 of the execution data sequence 312 for the period defined between the boundaries established by the boundary detection signal 320. When complete, the behavioral identifier 334 is presented to the uniqueness detector 332. The detector 332 determines if the received identifier 334 is a repeat of a previously-received behavioral identifiers (i.e., relates to a previously-observed execution sequences) or represents a new behavioral identifier (a new execution sequence not previously observed). If the identifier is new, the unique detection signal 324 is asserted, which instructs the storage system 326 to save the related execution data sequence contained in the first in, first out ("FIFO") buffer 318 along with the behavioral identifier 334. The behavioral identifier 334 is also saved in the previous execution data buffer (or store) 322. As illustrated in FIG. 8, additionally, related program source files and executable software images 336 can be stored in the storage system 326 to enable future replay, analysis, or visualization using the correct source and executable files for selected behaviors, even if those files receive many edits and modifications during development. Accordingly, using the system illustrated in FIGS. 7 and 8, a user can replay stored execution sequences against the then-existing code even if the code has changed since storing the execution sequence.

[0071] FIG. 9 illustrates functionality performed by the behavioral identifier 330 accordingly to one embodiment of the invention. Input data from a variety of sources that are affected by or have an effect on the software execution are candidates for input data to create the behavioral identifiers. Instruction trace is a source of the input data as the instruction trace provides the most direct indication of the software behavior. However, distinctive identifiers can be obtained from alternate combinations of sources, such as program variables and execution timing. The internal arithmetic/logic operation performed on the input data within the behavioral identifier 334 can vary depending on implementation conditions. For example, the internal functionality can include checksums or cyclic redundancy check ("CRC") totals, cumulative hashes such as MD5, or a minimally-processed linear representation of the input data. Any of these approaches may be suitable provided they produce consistent identifiers for repeated input sequences. Further details regarding generating a behavior identifier as provided below with respect to FIGS. 12-15.

[0072] FIG. 10 illustrates decision flow within the comparator 316, which implements a non-duplicating memory set with detection for new item addition. In some embodiments, a local behavioral identifier store can be initialized with previously-recorded values to prevent the re-recording of these execution sequences, which saves storage capacity for recording only previously-unseen execution sequences.

[0073] FIG. 11 illustrates an embodiment of the system 308 using a multi-user storage system, such as a database or distributed file system. As illustrated in FIG. 11, in this embodiment, individual computer systems 310 paired with the behavior identifier 330 and uniqueness detector 332 have their resulting behavioral identifiers and related execution information, source files, and executable software images stored in a multi-user storage system 340. This arrangement shares the collected execution information among all users. Sharing this information makes a defect or other unique behavior that happens on any connected computer system immediately available to all users. Accordingly, this embodiment enables a team synergy where all developers contribute their collected software behavior data to the common store automatically. Therefore, as they execute software on a target system (e.g., seeking to quickly expose as many defects as possible in their own code), they're also executing other parts of the target software that may contain code written by others, which potentially exposes new behaviors that had not been seen before. The result is that every developer becomes a tester of other developers' code without expending any extra effort.

[0074] As noted above, embodiments of the invention can use different techniques for generating a unique behavior identifier. For example, FIG. 12 schematically illustrates a system 400 that accepts software execution instructions and outputs compressed behavioral data according to some embodiments of the present invention. The compressed behavioral data can include behavior identifiers representing unique execution sequences of executed instructions.

[0075] Referring to FIG. 12, the system 408 includes a computer system 410 (physical or simulated) running a software program by continuously executing software information (such as software instructions or software execution data). The execution information can be in the form of conditional execution instructions. Moreover, the conditional execution instructions can be in the form of operation codes (or opcodes) and condition flags. In computer science, an opcode (operation code) is the portion of a machine language instruction that specifies the operation to be performed. The specification and format of opcodes are specified by the instruction set architecture of the processor in question (which may be a general CPU or a more specialized processing unit).

[0076] An input stream or trace of the software execution information, generally depicted with the reference numeral 411 (e.g., the software instructions, the execution status, the address, and the like), is supplied (e.g., continuously) to an execution path identification creator 412 while the computer system 410 executes a software program. The trace represents an execution path through the software program or a portion thereof. For example, an execution path can be the path through which input data (i.e., the software execution instructions) passes during the period of being processed in operation modules of the computer system 410. In each operation module of the computer system 410, there are typically various branch points so that different input data can pass through different branches at these branch points. The branches through which the input data passes form an execution path of the input data.

[0077] The execution path identification creator 412 converts the input stream or trace of the software execution information 411 from the computer system 410 into a stream of encoded data values representing a specific path taken by the software execution information executed within each path. The data values are uniquely created for every specific execution path and serve as behavior identifiers for the executing software program. The stream of encoded data values represents at least one unique execution sequence of the software execution instructions. For example, in one embodiment, the execution path identification creator 412 continuously accesses the execution instructions of the computer software, identifies execution sequences of the software execution instructions, and creates a unique execution path identifier 414 of each of the execution sequences by summing the conditional execution instructions when the conditional execution instructions are within a functional boundary. Therefore, the execution path identification creator 412 creates a unique execution path identifier 414 representing a compressed unique execution sequence of the execution instructions. The resulting execution path identifier 414 is then available for writing to one or more storage devices 420.

[0078] As further illustrated in FIG. 12, the system 408 also includes comparison logic 428 and a local storage medium 430 collecting the stream of the execution path identifiers for later retrieval and decompression. The execution path identifier 414 is supplied to the comparison logic 428 as a means of detecting when a previously-unseen (i.e., not previously observed execution sequence has occurred). In other words, the comparison logic 428 determines whether each execution sequence is a new execution sequence or a repeat execution sequence by comparing the execution path identifier 414 determined by the execution path identification creator 412 with the execution path identifiers previously stored in the local storage medium 430.

[0079] As illustrated in FIG. 13, a compression operation performed by the execution path execution path identification creator 412 for encoding an input data stream starts when the execution path identification creator 412 receives the next software execution information 411 (at block 502). The execution path identification creator 412 decodes the software execution information 411, and address-dependent portions are removed by a decode logic device 416 of the execution path identification creator 412 (at block 504). The result is converted to a hash value (at block 506) before being summed within an accumulator 418 (at block 508). The accumulator 418 can be a register in a central processing unit ("CPU"), in which intermediate results are stored.

[0080] The above-described process continues until the functional boundary is reached in the program image (at block 510). At this point, a resulting sum 416 in the accumulator 418 is exported as a unique, repeatable representation of the behavior of that segment of the software program (at block 512). The accumulator 418 is then reset to a base value to begin accumulation of the path identification of the software execution information of the next segment of software program (e.g., the next function executed by the computer 410). The resulting sum represents an execution path identifier 414.

[0081] For example, FIGS. 14 and 15 illustrate further details of generating an execution path identifier 414 implemented in computer logic. As illustrated in FIG. 14, a next-in-stream instruction opcode 411 is received by the decode logic device 416, which detects functional boundaries and removes the address portions of the opcode 411. Removing the address portion creates an address-independent canonical form of the instruction. The address-independent form provides that the execution type will produce the same results, even if that software program is executed from different address locations in a computer memory. The canonical opcode is then presented to a hash function, such as MD5 or murmurhash3, that converts the sometimes subtle differences in software execution information encodings into different resulting values. Accordingly, the hash helps ensure uniqueness in the resulting execution path identifier 414 value. The hashed canonical opcode is then combined with any preceding values in the accumulator 418. In some embodiments, the opcode can be inverted to further distinguish conditionally executed instructions. For example, in some embodiments, a hashed canonical opcode can be provided as a non-inverted value if the associated instruction was conditionally executed or as an inverted value if the associated instruction was conditionally not executed. This inversion helps to distinguish the execution paths that include conditional non-branch instructions, which may or may not execute depending on conditions. When the functional boundary is detected by decode logic 416, the resulting execution path identifier 414 in the accumulator 418 is exported to a results register 422, where the presented identifier is available to additional logic, storage, or export.

[0082] FIG. 16 illustrates accumulator logic 424, which can optionally include additional resources to collect and export the associated address in memory of an execution path sequence and the number of instructions contained within that sequence of execution. Including these optional values can assist decompression to quickly determine an exact match for the path of execution and can reduce the possibility of unresolvable collisions in execution path identification values. As illustrated in FIG. 11, the accumulator logic 424 can include the accumulator 418, a counter 426, and the register 422.

[0083] FIG. 17 is a flow chart illustrating embodiments of the present invention implemented using existing instruction trace information, such as that produced by real-time trace interfaces including ARM ETM, MIPS PDTrace, IEEE/ISTO Nexus-5001, or instrumented trace information available from processor simulators or emulators. In these embodiments, the software execution information may be encoded and may contain gaps in the information that require reconstruction.

[0084] As illustrated in FIG. 17, one or more program reference tables are initially loaded. Then the next software execution information 411 is received from the computer system 410 (at block 518). A determination is made whether the execution information has a relative or absolute address value (at block 520). If the execution information 411 is determined to have an absolute address value, a determination is made whether the address value matches the expected execution address (at block 522). If the answer is "no," gap reconstruction is conducted (at block 524). Next, the address is looked up in a reference table (at block 526). Similarly, if the address value matches the expected execution address (see block 522), the address is looked up in a reference table (at block 526).

[0085] Alternatively, if the execution information 411 has a relative value (at block 520), a determination is made whether the current execution address is known to the system 408 (at block 528). If so, then the relative address information is summed with the current known address and the address is looked up in a reference table (at block 526). If not, the next software execution information 411 is obtained (at block 518).

[0086] After the address is looked in the reference table, an opcode hash is added to and summarized in the accumulator 418 (at block 530). Then, a determination is made whether the functional boundary is reached (at block 532). If the boundary has not been reached, the next software execution information 411 is obtained (at block 518). If the boundary has been reached, however, the resulting sum in the accumulator 18 is exported as a unique, repeatable representation of the behavior of that segment of the software program (at block 534). The resulting sum represents the execution path identifier 414. The accumulator 418 is then reset to a base value to begin accumulation of the path identification of the software execution information of the next segment of the software program.

[0087] In some embodiments, the decoding and gap reconstruction are performed by the above-described flow steps, and their results are used with a reference table to look up the current instruction opcode and the current instruction opcode's pre-computed canonical hash, as well as the pre-computed functional boundaries and locations of conditional instructions. These are then presented to the accumulator 418 as described above.

[0088] In some embodiments, the system 408 continuously collects and categorizes execution information, thus imposing no limits on the software developer's visibility into the executing software program.

[0089] FIG. 18 shows an example software function. The function tests the value of an argument and returns one of two possible values as a result of that test. TABLE 1 shows that same function along with a possible implementation using ARM architecture instructions for each source line. The opcodes for each instruction can have a high degree of similarity. TABLE 2A shows an exemplary execution sequence of the sample function in FIG. 18, with a value of a variable "a" set to 0 before the function is called. As illustrated in TABLE 1 the condition for the execution of the instruction at <example-0xc> has not been met, so the branch is not taken and the instruction is treated as a no-operation instruction by the processor. Therefore, the accumulator 418 is presented with the inverse of the canonical hash of the opcode to reflect this conditional execution in the present path. The accumulator results are shown at each step, with a final value of "ad8f9a33" as the unique execution path identification, with optional inclusion of the starting address and count of instructions. TABLE 2B shows an exemplary execution sequence of the example function of FIG. 18, with the value of the variable "a" set to 1 before the function is called. Note that the change in execution path results in a drastic change in the unique value of the resulting path identification value.

[0090] TABLE 2C illustrates how even small changes in the executing software program results in changes to the resulting execution path identification value for the affected path(s) but may leave other execution paths in the same software program unaffected. In this modified example, the value returned for the values of the variable "a" less than 1 has changed from 25 to 24, which represents a small change to the software program. However the resulting execution path identification value changes from "d4b696cd" to "7146c1b4." This change only affects the path taken when the value of the variable "a" is greater than "0." The execution path identifier produced when the variable "a" is less than "0" remains the same as before.

[0091] TABLE 3 illustrates insertion points for a software-only embodiment of the present invention. Using the same sample code from FIG. 18, additional software can be inserted into the resulting executable software information at the indicated locations. The inserted software uses pre-computed values that represent accumulated sums of canonical opcode hash values. Accordingly, these values can be added to an accumulator value held in a designated location in the computer system. Additional instructions can be inserted at function boundaries to initialize the accumulator value and export the results to the appropriate destination.

[0092] FIG. 19 is a flow chart illustrating a method of analyzing target software program and inserting additional software instructions in the target software program to implement the present invention in a software-only manner. The method includes designating resources to hold the in-process execution path identifier and resources for the export or storage of the resulting identifiers (at block 610), analyzing the target software executable to identify functional boundaries and conditional instructions (at block 620), and analyzing the instructions within the segments between and including the conditional instructions to create a sum of the opcodes with optional removal of address information and hashing (at block 620). The method also includes inserting into the target executable the instructions for implementing the functionality of the present invention (at blocks 630, 640, and 650) and adjusting the program address references to compensate for the additional inserted instructions (step 660). When the resulting target software is executed, a series of execution path identifiers will be produced per the present invention.

[0093] FIG. 20 illustrates decompression of a compressed execution trace data (the execution sequence of the execution instructions) back to a reconstructed representation of an original form using the execution path identifiers. A sequential record of the created execution path identifiers and optional starting addresses and instruction counts are passed in chronological order to simulator or data table logic 700, which reconstructs the equivalent execution path necessary to create the presented execution path identification value. The logic 700 can in some configurations include a simulator of the target computer system, iteratively searching for the identical path needed to produce a matching execution path identification value. This simulation operation can be accelerated by including either or both of the starting address and number of instructions of the presented execution path identifier. The simulation can also be accelerated by using a data table of pre-computed execution path identifiers and their associated execution paths. In another instance, contents of the logic 700 may be a larger data table containing the execution path identifiers and a complete pre-recorded or pre-computed execution trace record of the associated path. The resulting stream of execution trace information that matches the data represented by the presented execution path identifier is then produced, recreating the trace of execution in the target computer system.

[0094] TABLE 4 and FIG. 21 illustrate handling of interrupts or exceptions. Normal execution proceeds at the top of TABLE 4 until an interrupt or other exception event alters the flow of execution. At this point the in-process path identifier is exported, along with the starting address and count of instructions within this partial path execution, and an indicator that the execution has been interrupted. Path identification of the interrupt/exception-handling code then proceeds as normal, resulting in the export of one or more path identifiers. If execution resumes within the interrupted function, execution is treated as a start of a new function and the path identifier accumulator and instruction counters are reset and the resume address is recorded as a new starting address for that path segment. FIG. 21 illustrates the processing that could assemble the interrupted execution path segments back into a whole for direct comparison with uninterrupted peers, effectively removing the interrupt/exception processing from the execution path analysis. In some embodiments, no resolution is lost. For example, the moment an interrupt or exception occurs is preserved, and the interrupted function can be treated as though the interrupted function was not interrupted.

[0095] It should be understood that embodiments of the present invention are amenable to additional compression logic, which can increase the compression of the execution trace data. For example, FIG. 22 illustrates one embodiment of the present invention using execution combine-and-store logic. In this example, a call graph depicts a series of software function calls, with each functional unit segment returning a unique path identifier. The additional logic sums the sequences of path identifiers and saves the resulting identifiers in a table, allowing subsequent calls resulting in the same series of path identifiers to use the combined identifier to replace the series. This results in ever-larger sequences of execution being represented by individual identifiers, which vastly increases compression of execution information. Furthermore, these path identifier sequences are combined using a simple sum operation, which increases compression with minimal additional resources. This compares favorably to general-purpose data compressors that require additional resources and can obfuscate the results and prevent direct use as a path identifier.

[0096] Therefore, the present invention provides a novel method and system of compressing software instruction execution trace sequences white simultaneously creating a unique identification for the sequence that is a direct representation of the software's behavior. The method and system of the present invention accesses information about the executed instructions in a computer system and converts that information into a uniquely representative identification of the specific conditions and execution path taken by a stream of execution.

[0097] In particular, embodiments of the present invention access execution trace data of a computer system. This trace data is analyzed to determine program functional boundaries. A behavioral identifier variable is initialized to a base value at the start of a program functional boundary. During execution within a program functional boundary, the execution trace data and other related data of interest is progressively combined with the behavioral identifier variable using arithmetic and/or logical operations until the end of the program functional boundary, at which point the behavioral identifier variable is exported to a behavior uniqueness detector. The behavior uniqueness detector maintains a store of behavioral identifiers to be compared with the newly presented behavioral identifiers as a test of uniqueness. If the presented identifier does not exist in the store, the presented identifier is added to the store and a signal is asserted that the behavior is unique, and the associated execution data around and including the unique behavior should be captured and stored in a storage system, such as a database, file system, or similar.

[0098] Further according to the present invention, pre-collected execution data is analyzed to create unique behavioral identifiers corresponding to functional boundaries within the target software program. These identifiers can then be used to index the pre-collected data, to eliminate duplicate behavior sequences from the pre-collected execution data, or in the creation of a common index for multiple buffers of pre-collected execution data.

[0099] Moreover, the sequence of the behavioral identifiers may be stored in the storage system sequentially as they appear. This enables a continuous reconstruction of the entirety of observed software execution to be created from the data in the storage system.

[0100] Also according to some embodiments of the present invention, the relevant executable software image and associated source files can be saved in the storage system, thus facilitating the anytime retrieval, reconstruction, and replay of the entirety of captured execution behaviors. Storing this data enables the on-demand replay, analysis, and visualization of not only all behaviors of all executed software functions, but also of every revision of every executed software function, using the correct source files and program image for reconstruction and presentation in a replay debugger or analyzer. This stored data also results in the creation of a self-assembling knowledge base of the entirety of behaviors exhibited by the target software, spanning all changes incurred during development and maintenance. Existing tools and methods routinely discard this valuable execution data, and generally provide no facility for correlated storage of the associated source and executable files.

[0101] Despite the ever-growing size and complexity of software programs, an insight into reducing and simultaneously organizing the abundant execution data of a software program is that the software program is executed strictly within rigidly defined segments of instructions that are interconnected by branching junctions that have a finite number of connections. Furthermore, the execution path that is actually taken by a running software program is most often a very small subset of all possible paths.

[0102] With this insight, a means of compressing the execution information based on execution information's behavior has been described in the present application. By replacing extended sequences of execution with a uniquely representative and consistently repeatable execution path identifier for every uniquely executed path in the software program, unexpected benefits are produced. For example, the execution path identifiers themselves are representative of distinct behaviors of the executed software functions, automatically classifying the execution trace data by the execution trace data's behavior. This simplifies software debugging, because every behavior of the software correct or incorrect is individually identified during compression, regardless of the behavior's transience or commonality. Reviewing the complete range of behaviors of the target program or any subset of interest can be done by decompressing the results at the appearance of each unique identifier type for the functions of interest. Also, the compression ratio can be an improvement over existing systems and can replace the trace data of thousands of instructions with a single representative value. In addition, because of the rigid-track nature of computer software execution, when observed over extended periods of time, a software program will spend the vast majority of time executing within a small subset of all possible paths and executing functions in frequently repeated sequences. This pattern of execution can be exploited to achieve extremely high compression ratios, by replacing extended sequences of already-observed functional unit executions with a single representative value.

[0103] Embodiments of the present invention therefore offer advantages by achieving higher compression ratios than existing systems, easing the burden of implementing into working computer systems, and providing compressor output that is a direct representation of the functional behavior of the target software. Embodiments of the present invention can also be used as an identifier for defect isolation and execution profiling, to assist software developers in rapidly learning intimate details about unfamiliar software code, and more.

[0104] Embodiments of the present invention is suitable for a plurality of embodiments including implementation in computer logic (thereby reducing the required capacity for trace export and storage); implementation with existing real-time trace processors, and as a software-only implementation for use with computer systems that may have no real-time trace export capabilities. By classifying the trace data by the behavior of the software being traced while compressing the trace data can overcome many of the difficulties found in existing systems and methods, embodiments of the present invention can achieve higher compression ratios than previous techniques discussed above, while producing a result that is simpler to use for the tasks of software debugging, software testing and analysis, and in gaining a deeper understanding of how the software actually behaves during full-speed execution.

[0105] Also according to some embodiments of the present invention, methods and systems are provided for inserting pre-computed software instructions into specific points of a software application to create unique execution path identifiers using a software-only approach. One method can include analyzing the target software to determine the appropriate canonical hash values and appropriate insertion points in the application, inserting these additional instructions into the application at the appropriate conditional instructions and branch points, accumulating and storing the unique execution path identifiers at runtime to a designated memory buffer or output port, and retrieving the resulting execution path identifiers at runtime for immediate use or storage.

[0106] Through the methods and systems according to embodiments of the present invention, execution behavior identifiers can be created and collected from an operating computer system using minimal system resources. The identifiers can also be compared to a computed set of identifiers representing a fill reconstruction of the execution path taken by the application. This results in abundant information that is pre-classified by behavioral type and therefore easier to differentiate which identifier represents software that is running in normal, expected ways, and which represents software that is running in new, potentially anomalous, and unexpected ways. This is particularly useful for software debugging, where countless hours are spent using existing techniques attempting the capture of transient events that are not yet fully understood. Embodiments of the present invention are also useful to quickly gain a deep understanding of unfamiliar software, because every behavior the software exhibits can be immediately identified as the behavior occurs. These benefits can be amplified when embodiments of the present invention are paired with additional system data capture, such as correlated capture of program variables, execution timing information, or external system signals at runtime.

[0107] Some of the functional units described in this specification have been labeled as modules, in order to more particularly emphasize their implementation independence. For example, a module may be implemented as a hardware circuit comprising custom very-large-scale integration (VLSI) circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.

[0108] Modules may also be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more blocks of computer instructions, which may be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different locations which comprise the module and achieve the stated purpose for the module when joined logically together.

[0109] Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices. The modules may be passive or active, including agents operable to perform desired functions.

[0110] The technology described here can also be stored on a non-transitory computer readable storage medium (e.g., data storage medium) that includes volatile and non-volatile, removable and non-removable media implemented with any technology for the storage of information such as computer readable instructions, data structures, program modules, or other data. Computer readable storage media include, but is not limited to, random-access memory ("RAM"), read only memory ("ROM"), erasable programmable ROM ("EPROM"), electrically EPROM ("EEPROM"), flash memory or other memory technology, compact disc-read-ROM ("CD-ROM"), digital versatile disks ("DVD") or other optical storage, flash drive, solid state drive, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other computer storage medium which can be used to store the desired information and described technology.

[0111] The devices described herein may also contain communication connections or networking apparatus and networking connections that allow the devices to communicate with other devices. Communication connections are an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules and other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. A "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency, infrared, and other wireless media. The term computer readable media as used herein includes communication media.

[0112] The foregoing description of embodiments of the present invention has been presented for the purpose of illustration. The description is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obvious modifications or variations are possible light of the above teachings. The embodiments disclosed herein were chosen in order to best illustrate the principles of the present invention and its practical application to thereby enable those of ordinary skill in the art to best utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated, as long as the principles described herein are followed. Thus, changes can be made in the above-described invention without departing from the intent and scope thereof. For example, the various configurations of the system and methods described and illustrated in the present application can be combined and distributed in various ways. It is also intended that the scope of the present invention be defined by the claims appended thereto.

* * * * *