U.S. patent application number 14/304050 was filed with the patent office on 2014-11-13 for storage of software execution data by behavioral identification.
The applicant listed for this patent is ZeroDee, Inc.. Invention is credited to Neil Craig Puthuff.
Application Number | 20140337822 14/304050 |
Document ID | / |
Family ID | 46878420 |
Filed Date | 2014-11-13 |
United States Patent
Application |
20140337822 |
Kind Code |
A1 |
Puthuff; Neil Craig |
November 13, 2014 |
STORAGE OF SOFTWARE EXECUTION DATA BY BEHAVIORAL IDENTIFICATION
Abstract
Methods and systems for analyzing software. For example, one
method can include executing a software program including a
function by a computer. The method also includes producing an
execution sequence for the function when, during execution, the
software program executes the function. The method further includes
generating an identifier for the execution sequence, wherein the
identifier uniquely identifies a path of execution through the
function represented by the execution sequence. In addition, the
method includes saving the identifier and making the identifier
available to at least one user through a user interface.
Inventors: |
Puthuff; Neil Craig;
(McLean, VA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
ZeroDee, Inc. |
Alexandria |
VA |
US |
|
|
Family ID: |
46878420 |
Appl. No.: |
14/304050 |
Filed: |
June 13, 2014 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13428572 |
Mar 23, 2012 |
8776029 |
|
|
14304050 |
|
|
|
|
13428597 |
Mar 23, 2012 |
|
|
|
13428572 |
|
|
|
|
61466818 |
Mar 23, 2011 |
|
|
|
61466828 |
Mar 23, 2011 |
|
|
|
Current U.S.
Class: |
717/125 ;
717/131 |
Current CPC
Class: |
G06F 11/28 20130101;
G06F 11/3636 20130101; G06F 11/3612 20130101 |
Class at
Publication: |
717/125 ;
717/131 |
International
Class: |
G06F 11/36 20060101
G06F011/36 |
Claims
1. A method for processing software, the method comprising:
executing a software program, by a computer, the software program
comprising a function; when, during execution, the software program
executes the function, producing an execution sequence of the
function; generating an identifier for the execution sequence,
wherein the identifier uniquely identifies a path of execution
through the function represented by the execution sequence; saving
the identifier; and making the identifier available to at least one
user through a user interface.
2. The method of claim 1, further comprising: accessing at least
one data storage medium storing previously-generated identifiers
associated with functions of the software program; and comparing
the identifier to the previously-generated identifiers to determine
whether the identifier is already stored in the at least one data
storage medium.
3. The method of claim 2, wherein saving the identifier includes
saving the identifier when the identifier is not already stored in
the at least one data storage medium.
4. The method of claim 2, further comprising incrementing a count
value associated with the identifier when the identifier is
previously stored in the at least one data storage medium.
5. The method of claim 2, wherein the function includes a defined
function or set of instructions.
6. The method of claim 1, wherein identifier for the execution
sequence includes a sum of operational code hash values or
conditional execution instruction hash values for the execution
sequence.
7. The method of claim 1, further comprising: executing a second
function in the software program when encountering a function call,
a call stack, a context switch, a switch statement, a branch point,
or a conditional execution instruction; producing a second
execution sequence of the second function; generating a second
identifier for the execution sequence, wherein the second
identifier uniquely identifies a path of execution through the
second function represented by the second execution sequence; and
saving the second identifier when the identifier is not already
stored in the at least one data storage medium.
8. The method of claim 1, further comprising: generating a hash
table of identifiers associated with functions of the software
program, wherein each identifier includes a hash value; counting a
number of times each execution sequence is encountered in the
execution of the software program represented by the identifier for
each execution sequence and associating a count with the
corresponding identifier; and displaying the hash table of
identifiers and the count associated with functions of the software
program.
9. The method of claim 1, further comprising: selecting the
identifier; identifying source code or function variables
representing the execution sequence of the function; and displaying
the identifier with a link to the source code or function variables
representing execution sequence of the function.
10. The method of claim 1, further comprising: identifying source
code or function variables representing the execution sequence of
the function; and saving at least one selected from the group
comprising the identifier with a link to the source code or
function variables representing execution sequence of the function
and the identifier with the source code or values of the function
variables representing execution sequence of the function.
11. At least one non-transitory machine readable storage medium
comprising a plurality of instructions adapted to be executed to
implement the method of claim 1.
12. A system for processing software, the system comprising: a
processor configured to execute a software program comprising a
function; produce an execution sequence of the function during
execution of the function; generate an identifier for the execution
sequence, wherein the identifier uniquely identifies a path of
execution through the function represented by the execution
sequence; and at least one data storage medium configured to save
the identifier.
13. The system of claim 12, further comprising a user interface
configured to make the identifier available to at least one
user.
14. The system of claim 12, wherein the processor is further
configured to generate an index table of identifiers associated
with functions of the software program, wherein each identifier
includes an index value; the at least one data storage medium
configured to save the index table of identifiers; and the user
interface configured to the index table of identifiers to the at
least one user.
15. The system of claim 12, wherein the processor is further
configured to access the at least one data storage medium storing
previously-generated identifiers associated with functions of the
software program; and compare the identifier to the
previously-generated identifiers to determine whether the
identifier is already stored in the at least one data storage
medium.
16. The system of claim 15, wherein the processor is further
configured to save the identifier when the identifier is not
already stored in the at least one data storage medium.
17. The system of claim 16, further comprising a counter configured
to increment a count value associated with the identifier when the
identifier is previously stored in the at least one data storage
medium.
18. The system of claim 12, wherein the function includes a defined
function or a specific code segment with sequential code
instructions.
19. The system of claim 12, wherein identifier for the execution
sequence is derived from an arithmetic or logic operation on the
operational code hash values or conditional execution instruction
hash values for the execution sequence.
20. The system of claim 12, further comprising a data buffer
configured to collect execution sequences of functions in real-time
during of the execution of the software program.
Description
RELATED APPLICATIONS
[0001] This application is a continuation-in-part of U.S.
application Ser. No. 13/428,572, filed on Mar. 23, 2012, which
claims priority to U.S. Provisional Application No. 61/466,818,
filed on Mar. 23, 2011, the entire content of these applications is
hereby incorporated by reference. This application is also a
continuation-in-part of U.S. patent application Ser. No.
13/428,597, filed on Mar. 23, 2012, which claims priority to U.S.
Provisional Application Ser. No. 61/466,828, filed on Mar. 23,
2011, the entire content of these applications is hereby
incorporated by reference.
FIELD
[0002] Embodiments of the present invention relate to developing
and analyzing computer software. For example, embodiments of the
invention provide methods and systems for identifying unique
behaviors of a software execution sequence, storing the unique
behaviors, and using and/or exporting the stored unique behavior to
assess the computer software.
BACKGROUND
[0003] Software is created from source code that is written by
software developers. In the process of writing software, many
defects are unintentionally introduced into the software code.
These defects are generally referred to as "bugs," and can be very
difficult to isolate and understand using existing tools and
methods. Accordingly, defect-free computer software has always been
difficult to create. In all but a few instances, computer software
knowingly contains many residual defects that are too elusive or
subtle to economically remove.
[0004] For example, consider the following example of a small
software function:
TABLE-US-00001 int example(char x, char y, char z) { int rtnVal =
0; switch(z) { case 0: rtnVal = (x-y); break; case 1: rtnVal =
((int)(x*100)) / (x+y); break; case 2: rtnVal = (x<<y);
break; case 3: rtnVal = 100; break; } return rtnVal; }
[0005] From initial inspection this function might be expected to
behave in only four possible ways (i.e., one path for each "case"
statement reached by evaluating argument "z"). However, there are
additional behaviors to this example function that can be difficult
to detect. For example, there is no "default" condition for the
"switch" statement. Therefore, if the value of argument "z" is
something other than 0, 1, 2, or 3, then no case statement will be
reached and the "switch" statement will fall-through and return a
0. The effects of this defect can range from benign to
catastrophic. Similarly, if the sum of arguments "x" and "y" result
in a value of 0 when argument "z" is set to 1, the result will be a
divide-by-zero exception (see "case 1"), which is generally viewed
as a catastrophic error condition. Also, if argument "y" is greater
than 31 when argument "z" is 2, the overflow of the shift operation
will cause the return value to be 0 or -1 regardless of the value
of argument "x." Any of these behaviors can be very difficult to
detect using conditional-capture methods. Also, the effects of any
of these unwanted behaviors can be so catastrophic (such as a
system reset) that they eradicate the evidence of the cause of the
error. Similarly, in some situations, the effects of any of these
unwanted behaviors can be so benign that nobody notices that
something is incorrect or can happen so infrequently that they
cannot be reproduced within a reasonable time frame (and,
therefore, cannot be properly debugged using traditional software
debugging tools). Note that the above example function is simple
and used solely for illustration purposes. In real software
development, functions are likely more complex, which leads to more
potential behaviors (both wanted and unwanted). This complexity
further complicates the debugging process.
SUMMARY
[0006] When considering the task of discovering the root cause of a
software defect, all of the answers are in the computer chip or
system. In particular, the computer chip or system contains the
cause of every bug, the value of every program variable, how every
line of software actually behaves, and every software vulnerability
and optimization opportunity. If it were possible to access and
analyze this information in its entirety, then software development
could be much easier and result in fewer residual bugs. This
superabundance of information is always present in a computer that
is running software, yet for much of the computer age this was too
much information to export, collect, or process economically. In
response, software debugging tools have been designed to limit the
export of execution information to a tiny portion of the total
available, to give software developers only the information they
specifically request using tools and methods of conditional
debugging.
[0007] For example, software developers traditionally have relied
on tools and methods of conditional debugging. Conditional
debugging requires software developers to pre-determine a condition
or sequence of conditions that must be satisfied in the target
computer before enabling the capture of execution data. Examples of
conditional debuggers include breakpoint debuggers (where one or
more predefined breakpoint conditions are set at fixed locations in
the software code to enable data capture), single-step debuggers
(wherein program code can be stepped instruction-by-instruction,
resulting in manual data capture at instruction boundaries), print
debugging (wherein the target software has additional instructions
inserted to export data from predetermined locations), and
real-time trace debuggers (wherein dedicated circuitry performs the
real-time export of software execution data while the computer
system is running at full speed, and includes triggering circuitry
to enable data capture around a predefined condition or a
predefined sequence of conditions).
[0008] A shortcoming of conditional debugging is that the developer
must know in advance the exact condition around which to capture
data for each and every behavior of interest that the software
exhibits. For example, a software developer may become aware of a
defect or undesirable behavior of software and begins searching for
its cause. Using conditional debugging, the developer can set a
breakpoint condition or trigger condition based on the developer's
best guess of the possible cause of the incorrect behavior. The
software program is then executed until the breakpoint or trigger
condition is satisfied. When the condition is satisfied, execution
data is collected. However, the collected execution data may not
necessarily reveal the underlying cause of the incorrect behavior.
In particular, in many situations, the developer needs to modify
the breakpoint or trigger condition to more-correctly match the
conditions of the incorrect behavior. The developer repeats this
process until the defect is located. This iterative process can
take hours or days to complete and typically results in the
correction of just one software defect.
[0009] These forms of conditional debugging are highly intrusive.
In particular, these techniques can alter the flow of program
execution enough to make the original problems non-reproducible
during debugging. Furthermore, these methods are created on the
premise that a software developer will search for the cause of one
known, reproducible bug at a time. Searching for one bug at a time
requires the developer to first make an educated guess about where
a particular defect originates (i.e., to set a breakpoint, trigger,
or other mechanism to capture of the exact portion of execution
data that contains evidence of the cause of the present problem).
This search for defects is usually an iterative process, since the
cause of software errors are often not easy to determine, and a
series of iterations can add up to span a long time duration to
find and correct just one error, particularly if the error has a
low recurrence rate or is otherwise difficult to reproduce.
Furthermore, these debugging techniques may only help a developer
isolate software defects that the developer becomes aware of
through external symptoms. Defects with subtle symptoms or very low
recurrence rates can often elude detection through the entire
development process, and end up shipping with the final
product.
[0010] Breakpoint debuggers and other traditional conditional
debugging tools are rooted in a past era wherein technical
limitations prevented the economic export and capture of the vast
amount of information available on the computer chip. A recent
development is the real-time trace ("RTT") port, such as ARM ETM,
MIPS PDTrace and IEEE/ISTO Nexus-5001, which is specialized logic
added to a computer system to non-intrusively export the vast
amount of execution data present in the computer as it runs at full
speed. As these RTT ports are capable of exporting very large
quantities of data and as an aid to conditional debugging methods,
they generally include condition-detection logic to signal that a
pre-defined triggering event or sequence has occurred, which is
then used to indicate the exported data should be captured for
analysis in either an in-system buffer or by an external
system.
[0011] Accordingly, software debuggers using RTT have been
developed with a similar mindset as breakpoint debuggers. For
example, the debuggers are used to capture a relatively small
quantity of data around a pre-defined event or sequence. These RTT
debuggers offer a similar set of features as their conditional
debugger predecessors: breakpoints, single-stepping, examining
variables, etc. using the data that has been captured from the RTT
port.
[0012] Recent improvements in RTT debuggers involving the
collection of larger quantities of real-time trace data show some
promise as a more effective means of software debugging. These
systems use fixed-size buffers of up to 4 gigabytes for
high-bandwidth collection of several seconds of execution data, or
employ spool-to-disk methods for low-bandwidth execution data
collection over extended periods. The captured data can then be
analyzed to obtain profiling or code coverage information, or
replayed as though debugging a live computer target with a
conditional debugger. For example, Lauterbach GmbH's "Real-time
Streaming (ETMv3)" technology performs extended-duration recording
of real-time trace data and creates profiling and code coverage
summaries on-the-fly. Execution profiling and code coverage is
useful and has been available for many years, but neither of these
will detect the unique individual behaviors of the called
functions. Correct and incorrect behaviors will be included
ambiguously in the profiling and coverage summaries just like any
other function-behavioral iteration. In short, these enhancements
continue to rely on the developer to manually locate any behavioral
anomalies. This crucial shortcoming is inherent in all conditional
debuggers: they do not detect variations in the behavior of the
software, nor do they use this as a basis for data collection.
[0013] As newly written software will typically contain many
defects, the process of debug and test can take an unpredictably
long period of time to complete, and can account for 80% of the
total cost of software development. This has made computer software
the most expensive and unpredictable component in many of the
intelligent, connected devices that utilize computer software for
enhanced functionality. These difficulties have remained remarkably
constant for decades, despite continuing advances and repeated
"breakthroughs" in software debugging technology.
[0014] Remembering that the answer to every software defect is on
the computer chip, conditional debugging methods are hindering
developers from getting the answers they need and are the direct
cause of the high costs, unpredictable schedules, and poor
resulting quality in software development.
[0015] Accordingly, embodiments of the present invention provide
means to uniquely identify software behaviors at a point where the
execution information is most abundant--inside the computer system.
If implemented inside the computer system, this more effectively
manages the limited capacity of conventional debug collection or
export facilities for the exclusive use of unique software
behaviors. Given sufficient capacity for behavior identifiers and
execution data export or capture, continuous software behavioral
analysis and behavioral anomaly capture can be accomplished for
entire software programs and multi-program systems. This can also
be implemented external to the target computer system, receiving
execution data from a high-capacity RTT port or other resources.
Both implementations improve the software development process by
eliminating the need for conditional debugging and by enabling a
more rigorous approach to software quality through providing a
means to individually review and approve-or-improve every unique
behavior exhibited by every software function.
[0016] Therefore, embodiments of the present invention provide
methods and systems for identifying behavioral uniqueness of
software execution sequences as a basis for collection and/or
export of software execution data and related information. One
method can include executing a software program and continuously
producing a sequence of execution information. The method can also
include determining if the execution information is within a
functional boundary of the software program, and determining if the
execution sequence of the execution information is a new execution
sequence or a repeat execution sequence.
[0017] One system can include a functional boundary detector for
continuously analyzing an execution information of a software
program to determine if the execution information is within a
functional boundary of said software program, an execution behavior
identification number generator to create unique behavioral
identifiers for unique execution sequences, and a comparator
provided for determining if an execution sequence of the execution
information is a new execution sequence or a repeat execution
sequence and producing a unique detection signal if the new
execution sequence is detected. Therefore, the system identifies
behavioral uniqueness of software execution sequences.
[0018] In particular, embodiments of the present invention provide
methods and systems for analyzing and debugging a software program.
In one embodiment, a method for processing software includes
executing a software program including a function, by a computer.
The method also includes producing an execution sequence of the
function when the software program executes the function. In
addition, the method includes generating an identifier for the
execution sequence, saving the identifier, and making the
identifier available to at least one user through a user interface.
The identifier uniquely identifies a path of execution through the
function represented by the execution sequence
[0019] The method can also include accessing at least one data
storage medium storing previously-generated identifiers associated
with functions of the software program, and comparing the
identifier to the previously-generated identifiers to determine
whether the identifier is already stored in the at least one data
storage medium. The operation of saving the identifier can include
saving the identifier when the identifier is not already stored in
the at least one data storage medium. The method can further
include incrementing a count value associated with the identifier
when the identifier is previously stored in the at least one data
storage medium. The function includes a defined function or a
specific code segment with sequential code instructions. A high
count value can represent a higher frequency of execution of the
execution sequence, which can be used to identify infrequently used
execution sequences which may represent an execution sequence with
an error. In an example, the identifier for the execution sequence
includes a sum of operational code hash values or conditional
execution instruction hash values for the execution sequence.
[0020] In another configuration, the method can further include
executing a second function in the software program when
encountering a function call, a call stack, a context switch, a
switch statement, a branch point, or a conditional execution
instruction. The next operations can include producing a second
execution sequence of the second function, generating a second
identifier for the execution sequence, where the second identifier
uniquely identifies a path of execution through the second function
represented by the second execution sequence, and saving the second
identifier when the identifier is not already stored in the at
least one data storage medium. The software program can include
multiple distinct functions.
[0021] In another configuration, the method can further include
generating a hash table of identifiers associated with functions of
the software program, wherein each identifier includes a hash
value, counting a number of times each execution sequence is
encountered in the execution of the software program represented by
the identifier for each execution sequence and associating a count
with the corresponding identifier, and displaying the hash table of
identifiers and the count associated with functions of the software
program. The hash table can improve the accessibility and
visualization of the identifiers and execution sequence of the
software program. The method can include selecting the identifier,
identifying source code or function variables representing the
execution sequence of the function, and displaying the identifier
with a link to the source code or function variables representing
execution sequence of the function. The method can include
identifying source code or function variables representing the
execution sequence of the function; and saving the identifier with
a link to the source code or function variables representing
execution sequence of the function, or saving the identifier with
the source code or values of the function variables representing
execution sequence of the function. Linking the identifier to the
source code in the source file can allow a user to quickly replay,
analyze, or visualize the source file for an identifier.
[0022] Another embodiment of the invention can provide a system for
processing software can include a processor and at least one data
storage medium. The processor is configured to execute a software
program comprising a function, produce an execution sequence of the
function during execution of the function, and generate an
identifier for the execution sequence. The identifier uniquely
identifies a path of execution through the function represented by
the execution sequence. The at least one data storage medium
configured to save the identifier.
[0023] The processor can further be configured to access the at
least one data storage medium storing previously-generated
identifiers associated with functions of the software program, and
compare the identifier to the previously-generated identifiers to
determine whether the identifier is already stored in the at least
one data storage medium. The processor is further configured to
save the identifier when the identifier is not already stored in
the at least one data storage medium. The system can include a
counter configured to increment a count value associated with the
identifier when the identifier is previously stored in the at least
one data storage medium. The function includes a defined function
or a specific code segment with sequential code instructions. The
identifier for the execution sequence is derived from an arithmetic
and/or logic operation on the operational code hash values or
conditional execution instruction hash values for the execution
sequence. The system can include a data buffer configured to
collect execution sequences of functions in real-time during of the
execution of the software program.
[0024] Another embodiment of the invention can provide a user
interface configured to make unique behavior identifiers available
to at least one user. A processor can be configured to generate an
index table of identifiers associated with functions of the
software program, where each identifier includes an index value.
The at least one data storage medium can be configured to save the
index table of identifiers. The user interface can be configured to
make the index table of identifiers accessible to at least one
user.
[0025] Other aspects of the invention will become apparent by
consideration of the detailed description and accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The accompanying drawings are incorporated in and constitute
a part of the specification. The drawings, together with the
general description given above and the detailed description of the
exemplary embodiments and methods given below, serve to explain the
principles of the invention. The objects and advantages of the
invention will become apparent from a study of the following
specification when viewed in light of the accompanying drawings
[0027] FIG. 1 is a flowchart of a method for analyzing and
debugging a software program.
[0028] FIG. 2A schematically illustrates a system including a
computer processor equipped with a real-time trace subsystem using
existing software analyzing tools.
[0029] FIG. 2B schematically illustrates a system including a
computer processor equipped with a real-time trace subsystem
implementing the method of FIG. 1.
[0030] FIG. 3 schematically illustrates a behavioral identifier
calculation system used by the system of FIG. 2B according to one
embodiment of the invention.
[0031] FIG. 4 schematically illustrates a behavior uniqueness
detection system used by the system of FIG. 2B according to one
embodiment of the invention.
[0032] FIG. 5 schematically illustrates an alternative
implementation of embodiments of the present invention using an
external system for processing real-time trace data exported from
an unmodified computer system.
[0033] FIG. 6A is a screen shot illustrating behavioral information
and launcher for a replay debugger.
[0034] FIG. 6B illustrates a portion of FIG. 6A.
[0035] FIG. 7 schematically illustrates another system for
performing the method of FIG. 1.
[0036] FIG. 8 provides further details regarding the functionality
performed by the system of FIG. 7.
[0037] FIG. 9 illustrates functionality performed by a behavioral
identifier 330 included in the system of FIG. 7.
[0038] FIG. 10 illustrates functionality performed by a comparator
included in the system of FIG. 7.
[0039] FIG. 11 illustrates the system of FIG. 7 used with a
multi-user storage system.
[0040] FIG. 12 schematically illustrates a system implementing the
method of FIG. 1 using data compression.
[0041] FIG. 13 is a flow chart illustrating a compression operation
performed by the system of FIG. 12.
[0042] FIGS. 14 and 15 illustrate functionality performed by an
execution path identification creator included in the system of
FIG. 12.
[0043] FIG. 16 illustrates accumulator logic performed by the
system FIG. 12.
[0044] FIG. 17 is a flow chart illustrating embodiments of the
present invention implemented using existing instruction trace
information.
[0045] FIG. 18 illustrates an example software function.
[0046] FIG. 19 is a flow chart illustrating a method of analyzing
target software program and inserting additional software
instructions in the target software program to implement
embodiments of the present invention in a software-only manner.
[0047] FIG. 20 schematically illustrates decompression of stored
execution path identifier sequences.
[0048] FIG. 21 schematically illustrates post-processing to remove
intervening interrupt or exception handling code from a software
execution path.
[0049] FIG. 22 schematically illustrates compression of software
execution instructions.
[0050] TABLE 1 illustrates computer instructions for the sample
software function of FIG. 18.
[0051] TABLES 2A, 2B, and 2C illustrate execution of instructions
in TABLE 1.
[0052] TABLE 3 illustrates implementation of a software-only
solution using the sample software function of FIG. 18.
[0053] TABLE 4 illustrates effect of interrupts or exceptions on
the processing of instruction trace compression.
DETAILED DESCRIPTION
[0054] Before any embodiments of the invention are explained in
detail, it is to be understood that the invention is not limited in
its application to the details of construction and the arrangement
of components set forth in the following description or illustrated
in the following drawings. The invention is capable of other
embodiments and of being practiced or of being carried out in
various ways.
[0055] Reference will now be made in detail to exemplary
embodiments and methods of the invention as illustrated in the
accompanying drawings, in which like reference characters designate
like or corresponding parts throughout the drawings. It should be
noted, however, that the invention in its broader aspects is not
limited to the specific details, representative devices and
methods, and illustrative examples shown and described in
connection with the exemplary embodiments and methods.
[0056] This description of exemplary embodiments is intended to be
read in connection with the accompanying drawings, which are to be
considered part, of the entire written description. The word "a" as
used in the claims means "at least one" and the word "two" as used
in the claims means "at least two."
[0057] As used herein, a behavioral identifier, a captured
execution behavior, compressed behavioral data, or an execution
path identifier can refer to an identifier. A functional boundary
can refer to beginning or end of an execution sequence for the
function. Execution information or execution path can refer to the
execution sequence for a single function or multiple functions. A
function can include a defined function that has a function
initialization and return value at the completion of the function.
Alternatively, the function can include a specific code segment
that can be grouped or blocked together because of the sequential
nature of the code instructions or the repeatability of the code
instructions. For example, a function can include a specific code
segment that is manually defined by a user or can be automatically
identified (e.g., based on a predefined number of instructions,
location of particular types of instructions, such as breaks,
returns, repeats, etc.).
[0058] As noted above, embodiments of the present invention provide
methods and systems for analyzing and debugging a software program.
FIG. 1A illustrates a method 100 of analyzing and debugging a
software program. The method 100 includes executing a software
program, by a computer (at block 110). The software program
includes a function. When, during execution, the software program
executes the function, an execution sequence of the function is
produced (at block 120). The execution sequence represents a path
followed through the function (e.g., what instructions were
executed, what order the instructions were executed in, and what
data was used or generated during the execution). As described in
more detail below, a unique identifier is defined for the execution
sequence (at block 130). The unique identifier is then saved (at
block 140), where it can be accessible by a user (e.g., through a
graphical user interface) (at block 150). In some embodiments,
before saving the unique identifier, the method 100 can include
accessing at least one data storage device storing previously-saved
unique identifiers. If the data storage device already stores the
unique identifier generated for the currently collected execution
sequence, the unique identifier is not stored to prevent
duplication of execution information. In other words, if the exact
execution sequence for a function has already been observed and
recorded, no further information is stored for the current
execution sequence. This check and comparison helps reduce the
amount of information collected by the system and made available to
a user for analyzing and debugging software. For example, storing
two occurrences of the same execution sequence through a particular
function, does not provide any additional information to a
developer than if only one occurrence was stored. Furthermore, by
limiting the storing of duplicate execution sequences, a developer
can use the saved data to quickly identify how many paths have been
recorded through a particular function. Accordingly, if more paths
were recorded that inherent from the structure of the function
(e.g., four possible paths contained in a switch statement), the
developer can efficiently identify a bug. In some embodiments,
counts can be maintained to track how often particular unique
execution sequences are observed, if a developer needs this
information to track software performance.
[0059] As such, embodiments of the present invention provide
methods and systems for identifying behavioral uniqueness of
software execution sequences. In particular, execution information
is continuously analyzed to determine if a behavioral iteration of
the computer program is unique or merely a repeat of
previously-observed behavior. When a unique behavior is detected,
the data of interest is captured, stored, and indexed by a
behavioral identifier. The input data used to create a behavioral
identification can include but is not limited to: execution trace
data, program variables, execution timing, and related signals,
conditions, and events. These data values are progressively
combined into a behavioral identifier as the program executes and
exported on software functional boundaries to be evaluated for
uniqueness. Using the example software function described above in
the Summary section, embodiments of the present invention uniquely
identify the four case statements (i.e. cases 0-3) and the three
additional behaviors (e.g., default condition, divide-by-zero
condition, overflow condition) discussed above (i.e., if actually
executed). A software developer could then review the collected
behaviors at their leisure to determine if each behavior is correct
or incorrect.
[0060] Using the behavioral capture method as described above
provides benefits over conditional capture methods. For example,
software developers no longer have to set conditional breakpoints
or triggers in an iterative attempt to capture evidence of just one
incorrect software behavior after another. Rather, every behavior
is automatically captured the first time it occurs. This nearly
eliminates the need to find and fix software bugs in an iterative
approach, which commonly is one of the most expensive components of
software development. In addition, since every behavior is uniquely
identified and captured, including incorrect behaviors with
otherwise subtle symptoms or low recurrence rates, defects can be
corrected as soon as they happen at least one time. The result is
improved software quality, with very low residual defect rates
achievable without undue expense. Furthermore, the identification
and capture can be performed on the entirety of executing software,
not just those functions of interest to an individual developer.
This enables an intimate knowledge of unfamiliar code to be gained
quickly by a software developer. A process that is very difficult
using existing methods.
[0061] Additional details regarding the method 100 illustrated in
FIG. 1A are provided below. For example, FIG. 2A illustrates a
computer processor equipped with a real-time trace ("RTT")
subsystem and existing software debugging tools (i.e., conditional
debugging tools). During software program execution, processor core
logic 160 produces signals 162 indicative of the current software
instructions being executed within the processor core logic 160 and
on-chip peripheral systems. These signals 162 are interpreted by
program trace logic 164 to produce a reduced-size encoding of the
executed instructions and/or memory accesses made during software
execution. Event detection logic 166 monitors the execution signals
162 to detect if any user-defined events (conditions or triggers)
have happened and creating applicable enable/disable/event signals
to control capture of trace data 168 into the in-system buffer or
to an off-system export portal 170. Accordingly, only trace data
relating to user-defined events are captured in the buffer.
[0062] In contrast, FIG. 2B schematically illustrates the processor
system of FIG. 2A implementing the method 100 according to one
embodiment of the invention. As illustrated in FIG. 2A, the system
includes a behavior ID generation system 180 that processes and
converts the execution signals 162 into a series of behavioral
identifiers 182. These behavioral identifiers are passed to a
behavior ID data set 184 and to the in-system buffer or export 170
for possible inclusion into trace data sequence. The behavior ID
data set 184 evaluates each behavior ID it receives to determine if
this value already exists in the data set, or if it is a new value
that has not yet been observed. If the behavior ID is new,
indicating that a not-yet-observed instruction execution sequence
has taken place, a "New BehavID" signal 188 is asserted, which
indicates that the related RTT data should be captured or exported
for analysis by trace buffer or export system 170.
[0063] FIG. 3 schematically illustrates functionality performed by
the behavior ID generation system 180 in one embodiment of the
invention. As illustrated in FIG. 3, the system 180 modifies the
opcode 190 or actual instruction word currently being executed to
remove any memory address encoding. This provides position
independence to the resulting value, yielding a consistent result
regardless of the physical address location encoded within the
instruction. The position independent opcode is then passed to a
hash function 192 to amplify the effects of relatively small
differences between different opcodes. The result is then passed
through an exclusive-OR block 194. The exclusive-OR block 194 has a
complementary effect on the hashed opcode value. For example, if
the related instruction was conditionally not executed, the hashed
opcode is bitwise inverted. Otherwise the hashed opcode is passed
through unmodified. Similar to the hash function 192, the
exclusive-OR block 194 has the effect of amplifying changes to a
hashed opcode value to reflect different program behavior relating
to instructions not conditionally executed. The resulting hashed
and conditionally inverted opcode is passed to an accumulator 196.
The accumulator 196 sums the received result with a series of
opcode hashes executed along the current path of execution (i.e.,
within a function). Control logic 198 maintains sequences of
actions in behavior ID generation, such as temporarily storing the
in-process behavior ID of a function onto a call stack if that
function calls another function before completion and restoring the
in-process behavior ID to the accumulator when any called function
returns execution flow back to the original function. If used in a
multi-process system, control logic 198 also manages multiple call
stacks on a per-process basis. At the completion point of a
function (i.e., a functional boundary), control logic 198 exports
the resulting behavior ID from the accumulator 196 and an
accompanying IDVALID indicator signal. The configuration
illustrated in FIG. 3 provides position independence, consistency
between different target program builds, and high sensitivity to
program changes as small as a single bit. It should be understood,
however, that other forms and combinations of execution data could
be used to create behavioral identifiers without deviating from the
scope of the present invention. For example, additional details
regarding behavioral identifiers are provided below with respect to
FIGS. 12-15.
[0064] FIG. 4 schematically illustrates functionality performed by
the behavior ID data set 184 according to one embodiment of the
invention. As illustrated in FIG. 4, the resulting behavior ID and
IDVALID signal from the control logic 198 (see FIG. 3) are
presented to content addressable memory ("CAM") read/write
interface block 200, which initiates a search within CAM 202 for a
matching behavior ID. For each comparison, a "MVALID" signal is
asserted. Also, if a matching behavior ID is present within the CAM
202, a "MATCH" signal is also asserted. In some embodiments,
user-settable signals "AddNew," "TrigOnMatch," and "TrigOnNotMatch"
control the resulting actions taken if a behavior ID does or does
not exist in the CAM 202. For example, if the "AddNew" signal is
enabled, any behavior ID that does not exist in the CAM 202 is
added to the available space, to be available for subsequent
comparisons. If the "TrigOnMatch" signal is enabled, any comparison
that results in a match causes a "CAPTURE" signal to be asserted.
Similarly, if the "TrigOnNotMatch" signal is enabled, any
comparison that does not result in a match causes a "CAPTURE"
signal to be asserted. The "CAPTURE" signal causes trace data to be
captured and stored in the buffer or export 170.
[0065] A CAMData bus provides read/write access to the contents of
the CAM 202 to the host system and debugging tools. These
configuration and access interfaces provide the user with options
to pre-load the CAM 202 with known good behavior identifiers which
can then be ignored by the system resulting in the capture of
unknown behaviors exclusively. Similarly, a user can pre-load known
bad behavior identifiers, reserving capture for only these
behaviors of interest. Furthermore, these behavior identifiers can
be read from the CAM 202 and stored externally for future use. The
CAM block 200 can also include event counters for each behavioral
identifier element to indicate the accumulated total number of
times each behavior has occurred during a given interval. In some
embodiments, the CAM block 200 can also be paired with a secondary
cache system to pre-load the related behavior identifiers for
functional sections of a computer program as they are executed in a
running system, thereby expanding the coverage of the system by
effectively increasing the working size of the CAM 202.
[0066] FIG. 5 schematically illustrates an alternate implementation
of embodiments of the invention that provides an external solution
to an existing processor system equipped with a RTT. This
implementation performs additional processing to decode exported
RTT data and reconstruct signals that would otherwise be available
to an on-chip implementation (see FIG. 2B). Otherwise, these
systems work in an approximately equivalent manner. As illustrated
in FIG. 5, the target microprocessor system 240 continuously
exports trace data during full speed execution. The trace data is
captured and correlated to the target software image to reconstruct
the information that would be available to an on-chip
implementation. The correlated image data can include opcodes,
addresses, data values, etc. The resulting execution information is
then presented to a behavior ID generation system to create a
series of behavioral identifiers, which, as described above, are
passed to a behavior ID set 184 to determine if they are newly
observed or repeat software behaviors. If a new behavior ID is
detected, then the related trace data, behavior ID, and related
execution information are presented to one or more database and
mass storage systems 242, along with the software image and source
files to facilitate on-demand replay. This implementation acts as
an external retrofit to existing processors with RTT where no
additional on-chip logic is required.
[0067] As described above, after storing unique behaviors, the
behaviors are made available to users for analysis. For example,
FIG. 6A is a screen shot 300 illustrating a user interface that
retrieves and displays behavior results obtained by running
embodiments of the present invention against the software example
code described above (see the Background section) (e.g., collected
using the alternate implementation of FIG. 5 from an actual
microprocessor system running at full execution speed). The screen
300 includes a table 302 displaying the executed functions and the
observed behaviors during an execution session. No user
configuration or breakpoints were required to obtain these results,
and the associated execution data (in the form of RTT data) was
automatically collected for each unique behavior and is available
for on-demand replay using a conventional replay debugger
application.
[0068] FIG. 6B illustrates a portion of the table 302. As
illustrated in FIG. 6B, each of the four expected behaviors was
executed many thousands of times. Also, a single instance of an
unexpected behavior was executed, which upon replay would reveal a
transient error case of variable "z" being outside the expected
range of 0 to 3. This type of defect can be notoriously difficult
to correct using conventional debugging methods as it is both
transient and symptomless. However, as illustrated in FIG. 6B,
using embodiments of the present invention, the defect can be
identified and captured based on its first and only appearance.
[0069] FIG. 7 schematically illustrates an alternative
configuration of a system for implementing the method 100.
Referring to FIG. 7, the system 308 includes a computer system 310
(physical or simulated) executing one or more software programs of
interest. The system 308 also includes a functional boundary
detector 314, a comparator 316, and a data buffer 318. While the
computer system 310 executes a software program, execution
information 312 (including execution data and related information)
is continuously created by the computer system 310. This execution
information 312 is continuously collected and presented to both the
functional boundary detector 314 and the comparator 316 through the
data buffer 318. The boundary detector 314 analyzes the execution
information to determine if a functional boundary within the
software program, such as function calls, call stacks, context
switches, and the like, have been crossed. In other words, the
functional boundary detector 314 determines if the execution
information is within a functional boundary of the software
program. If a functional boundary is detected, the boundary
detector 314 asserts the boundary detection signal 320, which
signals the comparator 316 to continuously evaluate the contents of
the preceding execution segment against the contents of the
previously-collected execution information from a previous
execution data buffer 322. Accordingly, the comparator 316
determine if an execution sequence of the execution information has
been previously observed or if the most-recently collected
execution information represents a new, unique behavior (i.e., a
new and unique path through the function). If the behavior is
determined to be unique (i.e. new and not previously observed), the
comparator 316 produces a unique detection signal 324, which
instructs a storage system 326 to store the related data contents
in the data buffer 318. The comparator 316 also produces a
behavioral identifier that is stored in the data buffer 318 for
future comparisons.
[0070] FIG. 8 provides further details regarding functionality
performed by the system 308. As illustrated in FIG. 8, the computer
system 310 produces the execution information 312, which may be
composed of any combination of execution trace information, program
variables, memory accesses, input/output operations, execution
timing, and other related signals, events, or conditions. The
execution information 312 is presented to the functional boundary
detector 314, the data buffer 318, and the contents of the
comparator 316. As illustrated in FIG. 8, the comparator 316
includes behavioral identifier creation logic 330 and a uniqueness
detector 332. The behavior identifier creation logic 330
sequentially processes the execution and related data (i.e., the
execution information) using arithmetic and/or logic operations to
produce a behavioral identifier 334 of the execution data sequence
312 for the period defined between the boundaries established by
the boundary detection signal 320. When complete, the behavioral
identifier 334 is presented to the uniqueness detector 332. The
detector 332 determines if the received identifier 334 is a repeat
of a previously-received behavioral identifiers (i.e., relates to a
previously-observed execution sequences) or represents a new
behavioral identifier (a new execution sequence not previously
observed). If the identifier is new, the unique detection signal
324 is asserted, which instructs the storage system 326 to save the
related execution data sequence contained in the first in, first
out ("FIFO") buffer 318 along with the behavioral identifier 334.
The behavioral identifier 334 is also saved in the previous
execution data buffer (or store) 322. As illustrated in FIG. 8,
additionally, related program source files and executable software
images 336 can be stored in the storage system 326 to enable future
replay, analysis, or visualization using the correct source and
executable files for selected behaviors, even if those files
receive many edits and modifications during development.
Accordingly, using the system illustrated in FIGS. 7 and 8, a user
can replay stored execution sequences against the then-existing
code even if the code has changed since storing the execution
sequence.
[0071] FIG. 9 illustrates functionality performed by the behavioral
identifier 330 accordingly to one embodiment of the invention.
Input data from a variety of sources that are affected by or have
an effect on the software execution are candidates for input data
to create the behavioral identifiers. Instruction trace is a source
of the input data as the instruction trace provides the most direct
indication of the software behavior. However, distinctive
identifiers can be obtained from alternate combinations of sources,
such as program variables and execution timing. The internal
arithmetic/logic operation performed on the input data within the
behavioral identifier 334 can vary depending on implementation
conditions. For example, the internal functionality can include
checksums or cyclic redundancy check ("CRC") totals, cumulative
hashes such as MD5, or a minimally-processed linear representation
of the input data. Any of these approaches may be suitable provided
they produce consistent identifiers for repeated input sequences.
Further details regarding generating a behavior identifier as
provided below with respect to FIGS. 12-15.
[0072] FIG. 10 illustrates decision flow within the comparator 316,
which implements a non-duplicating memory set with detection for
new item addition. In some embodiments, a local behavioral
identifier store can be initialized with previously-recorded values
to prevent the re-recording of these execution sequences, which
saves storage capacity for recording only previously-unseen
execution sequences.
[0073] FIG. 11 illustrates an embodiment of the system 308 using a
multi-user storage system, such as a database or distributed file
system. As illustrated in FIG. 11, in this embodiment, individual
computer systems 310 paired with the behavior identifier 330 and
uniqueness detector 332 have their resulting behavioral identifiers
and related execution information, source files, and executable
software images stored in a multi-user storage system 340. This
arrangement shares the collected execution information among all
users. Sharing this information makes a defect or other unique
behavior that happens on any connected computer system immediately
available to all users. Accordingly, this embodiment enables a team
synergy where all developers contribute their collected software
behavior data to the common store automatically. Therefore, as they
execute software on a target system (e.g., seeking to quickly
expose as many defects as possible in their own code), they're also
executing other parts of the target software that may contain code
written by others, which potentially exposes new behaviors that had
not been seen before. The result is that every developer becomes a
tester of other developers' code without expending any extra
effort.
[0074] As noted above, embodiments of the invention can use
different techniques for generating a unique behavior identifier.
For example, FIG. 12 schematically illustrates a system 400 that
accepts software execution instructions and outputs compressed
behavioral data according to some embodiments of the present
invention. The compressed behavioral data can include behavior
identifiers representing unique execution sequences of executed
instructions.
[0075] Referring to FIG. 12, the system 408 includes a computer
system 410 (physical or simulated) running a software program by
continuously executing software information (such as software
instructions or software execution data). The execution information
can be in the form of conditional execution instructions. Moreover,
the conditional execution instructions can be in the form of
operation codes (or opcodes) and condition flags. In computer
science, an opcode (operation code) is the portion of a machine
language instruction that specifies the operation to be performed.
The specification and format of opcodes are specified by the
instruction set architecture of the processor in question (which
may be a general CPU or a more specialized processing unit).
[0076] An input stream or trace of the software execution
information, generally depicted with the reference numeral 411
(e.g., the software instructions, the execution status, the
address, and the like), is supplied (e.g., continuously) to an
execution path identification creator 412 while the computer system
410 executes a software program. The trace represents an execution
path through the software program or a portion thereof. For
example, an execution path can be the path through which input data
(i.e., the software execution instructions) passes during the
period of being processed in operation modules of the computer
system 410. In each operation module of the computer system 410,
there are typically various branch points so that different input
data can pass through different branches at these branch points.
The branches through which the input data passes form an execution
path of the input data.
[0077] The execution path identification creator 412 converts the
input stream or trace of the software execution information 411
from the computer system 410 into a stream of encoded data values
representing a specific path taken by the software execution
information executed within each path. The data values are uniquely
created for every specific execution path and serve as behavior
identifiers for the executing software program. The stream of
encoded data values represents at least one unique execution
sequence of the software execution instructions. For example, in
one embodiment, the execution path identification creator 412
continuously accesses the execution instructions of the computer
software, identifies execution sequences of the software execution
instructions, and creates a unique execution path identifier 414 of
each of the execution sequences by summing the conditional
execution instructions when the conditional execution instructions
are within a functional boundary. Therefore, the execution path
identification creator 412 creates a unique execution path
identifier 414 representing a compressed unique execution sequence
of the execution instructions. The resulting execution path
identifier 414 is then available for writing to one or more storage
devices 420.
[0078] As further illustrated in FIG. 12, the system 408 also
includes comparison logic 428 and a local storage medium 430
collecting the stream of the execution path identifiers for later
retrieval and decompression. The execution path identifier 414 is
supplied to the comparison logic 428 as a means of detecting when a
previously-unseen (i.e., not previously observed execution sequence
has occurred). In other words, the comparison logic 428 determines
whether each execution sequence is a new execution sequence or a
repeat execution sequence by comparing the execution path
identifier 414 determined by the execution path identification
creator 412 with the execution path identifiers previously stored
in the local storage medium 430.
[0079] As illustrated in FIG. 13, a compression operation performed
by the execution path execution path identification creator 412 for
encoding an input data stream starts when the execution path
identification creator 412 receives the next software execution
information 411 (at block 502). The execution path identification
creator 412 decodes the software execution information 411, and
address-dependent portions are removed by a decode logic device 416
of the execution path identification creator 412 (at block 504).
The result is converted to a hash value (at block 506) before being
summed within an accumulator 418 (at block 508). The accumulator
418 can be a register in a central processing unit ("CPU"), in
which intermediate results are stored.
[0080] The above-described process continues until the functional
boundary is reached in the program image (at block 510). At this
point, a resulting sum 416 in the accumulator 418 is exported as a
unique, repeatable representation of the behavior of that segment
of the software program (at block 512). The accumulator 418 is then
reset to a base value to begin accumulation of the path
identification of the software execution information of the next
segment of software program (e.g., the next function executed by
the computer 410). The resulting sum represents an execution path
identifier 414.
[0081] For example, FIGS. 14 and 15 illustrate further details of
generating an execution path identifier 414 implemented in computer
logic. As illustrated in FIG. 14, a next-in-stream instruction
opcode 411 is received by the decode logic device 416, which
detects functional boundaries and removes the address portions of
the opcode 411. Removing the address portion creates an
address-independent canonical form of the instruction. The
address-independent form provides that the execution type will
produce the same results, even if that software program is executed
from different address locations in a computer memory. The
canonical opcode is then presented to a hash function, such as MD5
or murmurhash3, that converts the sometimes subtle differences in
software execution information encodings into different resulting
values. Accordingly, the hash helps ensure uniqueness in the
resulting execution path identifier 414 value. The hashed canonical
opcode is then combined with any preceding values in the
accumulator 418. In some embodiments, the opcode can be inverted to
further distinguish conditionally executed instructions. For
example, in some embodiments, a hashed canonical opcode can be
provided as a non-inverted value if the associated instruction was
conditionally executed or as an inverted value if the associated
instruction was conditionally not executed. This inversion helps to
distinguish the execution paths that include conditional non-branch
instructions, which may or may not execute depending on conditions.
When the functional boundary is detected by decode logic 416, the
resulting execution path identifier 414 in the accumulator 418 is
exported to a results register 422, where the presented identifier
is available to additional logic, storage, or export.
[0082] FIG. 16 illustrates accumulator logic 424, which can
optionally include additional resources to collect and export the
associated address in memory of an execution path sequence and the
number of instructions contained within that sequence of execution.
Including these optional values can assist decompression to quickly
determine an exact match for the path of execution and can reduce
the possibility of unresolvable collisions in execution path
identification values. As illustrated in FIG. 11, the accumulator
logic 424 can include the accumulator 418, a counter 426, and the
register 422.
[0083] FIG. 17 is a flow chart illustrating embodiments of the
present invention implemented using existing instruction trace
information, such as that produced by real-time trace interfaces
including ARM ETM, MIPS PDTrace, IEEE/ISTO Nexus-5001, or
instrumented trace information available from processor simulators
or emulators. In these embodiments, the software execution
information may be encoded and may contain gaps in the information
that require reconstruction.
[0084] As illustrated in FIG. 17, one or more program reference
tables are initially loaded. Then the next software execution
information 411 is received from the computer system 410 (at block
518). A determination is made whether the execution information has
a relative or absolute address value (at block 520). If the
execution information 411 is determined to have an absolute address
value, a determination is made whether the address value matches
the expected execution address (at block 522). If the answer is
"no," gap reconstruction is conducted (at block 524). Next, the
address is looked up in a reference table (at block 526).
Similarly, if the address value matches the expected execution
address (see block 522), the address is looked up in a reference
table (at block 526).
[0085] Alternatively, if the execution information 411 has a
relative value (at block 520), a determination is made whether the
current execution address is known to the system 408 (at block
528). If so, then the relative address information is summed with
the current known address and the address is looked up in a
reference table (at block 526). If not, the next software execution
information 411 is obtained (at block 518).
[0086] After the address is looked in the reference table, an
opcode hash is added to and summarized in the accumulator 418 (at
block 530). Then, a determination is made whether the functional
boundary is reached (at block 532). If the boundary has not been
reached, the next software execution information 411 is obtained
(at block 518). If the boundary has been reached, however, the
resulting sum in the accumulator 18 is exported as a unique,
repeatable representation of the behavior of that segment of the
software program (at block 534). The resulting sum represents the
execution path identifier 414. The accumulator 418 is then reset to
a base value to begin accumulation of the path identification of
the software execution information of the next segment of the
software program.
[0087] In some embodiments, the decoding and gap reconstruction are
performed by the above-described flow steps, and their results are
used with a reference table to look up the current instruction
opcode and the current instruction opcode's pre-computed canonical
hash, as well as the pre-computed functional boundaries and
locations of conditional instructions. These are then presented to
the accumulator 418 as described above.
[0088] In some embodiments, the system 408 continuously collects
and categorizes execution information, thus imposing no limits on
the software developer's visibility into the executing software
program.
[0089] FIG. 18 shows an example software function. The function
tests the value of an argument and returns one of two possible
values as a result of that test. TABLE 1 shows that same function
along with a possible implementation using ARM architecture
instructions for each source line. The opcodes for each instruction
can have a high degree of similarity. TABLE 2A shows an exemplary
execution sequence of the sample function in FIG. 18, with a value
of a variable "a" set to 0 before the function is called. As
illustrated in TABLE 1 the condition for the execution of the
instruction at <example-0xc> has not been met, so the branch
is not taken and the instruction is treated as a no-operation
instruction by the processor. Therefore, the accumulator 418 is
presented with the inverse of the canonical hash of the opcode to
reflect this conditional execution in the present path. The
accumulator results are shown at each step, with a final value of
"ad8f9a33" as the unique execution path identification, with
optional inclusion of the starting address and count of
instructions. TABLE 2B shows an exemplary execution sequence of the
example function of FIG. 18, with the value of the variable "a" set
to 1 before the function is called. Note that the change in
execution path results in a drastic change in the unique value of
the resulting path identification value.
[0090] TABLE 2C illustrates how even small changes in the executing
software program results in changes to the resulting execution path
identification value for the affected path(s) but may leave other
execution paths in the same software program unaffected. In this
modified example, the value returned for the values of the variable
"a" less than 1 has changed from 25 to 24, which represents a small
change to the software program. However the resulting execution
path identification value changes from "d4b696cd" to "7146c1b4."
This change only affects the path taken when the value of the
variable "a" is greater than "0." The execution path identifier
produced when the variable "a" is less than "0" remains the same as
before.
[0091] TABLE 3 illustrates insertion points for a software-only
embodiment of the present invention. Using the same sample code
from FIG. 18, additional software can be inserted into the
resulting executable software information at the indicated
locations. The inserted software uses pre-computed values that
represent accumulated sums of canonical opcode hash values.
Accordingly, these values can be added to an accumulator value held
in a designated location in the computer system. Additional
instructions can be inserted at function boundaries to initialize
the accumulator value and export the results to the appropriate
destination.
[0092] FIG. 19 is a flow chart illustrating a method of analyzing
target software program and inserting additional software
instructions in the target software program to implement the
present invention in a software-only manner. The method includes
designating resources to hold the in-process execution path
identifier and resources for the export or storage of the resulting
identifiers (at block 610), analyzing the target software
executable to identify functional boundaries and conditional
instructions (at block 620), and analyzing the instructions within
the segments between and including the conditional instructions to
create a sum of the opcodes with optional removal of address
information and hashing (at block 620). The method also includes
inserting into the target executable the instructions for
implementing the functionality of the present invention (at blocks
630, 640, and 650) and adjusting the program address references to
compensate for the additional inserted instructions (step 660).
When the resulting target software is executed, a series of
execution path identifiers will be produced per the present
invention.
[0093] FIG. 20 illustrates decompression of a compressed execution
trace data (the execution sequence of the execution instructions)
back to a reconstructed representation of an original form using
the execution path identifiers. A sequential record of the created
execution path identifiers and optional starting addresses and
instruction counts are passed in chronological order to simulator
or data table logic 700, which reconstructs the equivalent
execution path necessary to create the presented execution path
identification value. The logic 700 can in some configurations
include a simulator of the target computer system, iteratively
searching for the identical path needed to produce a matching
execution path identification value. This simulation operation can
be accelerated by including either or both of the starting address
and number of instructions of the presented execution path
identifier. The simulation can also be accelerated by using a data
table of pre-computed execution path identifiers and their
associated execution paths. In another instance, contents of the
logic 700 may be a larger data table containing the execution path
identifiers and a complete pre-recorded or pre-computed execution
trace record of the associated path. The resulting stream of
execution trace information that matches the data represented by
the presented execution path identifier is then produced,
recreating the trace of execution in the target computer
system.
[0094] TABLE 4 and FIG. 21 illustrate handling of interrupts or
exceptions. Normal execution proceeds at the top of TABLE 4 until
an interrupt or other exception event alters the flow of execution.
At this point the in-process path identifier is exported, along
with the starting address and count of instructions within this
partial path execution, and an indicator that the execution has
been interrupted. Path identification of the
interrupt/exception-handling code then proceeds as normal,
resulting in the export of one or more path identifiers. If
execution resumes within the interrupted function, execution is
treated as a start of a new function and the path identifier
accumulator and instruction counters are reset and the resume
address is recorded as a new starting address for that path
segment. FIG. 21 illustrates the processing that could assemble the
interrupted execution path segments back into a whole for direct
comparison with uninterrupted peers, effectively removing the
interrupt/exception processing from the execution path analysis. In
some embodiments, no resolution is lost. For example, the moment an
interrupt or exception occurs is preserved, and the interrupted
function can be treated as though the interrupted function was not
interrupted.
[0095] It should be understood that embodiments of the present
invention are amenable to additional compression logic, which can
increase the compression of the execution trace data. For example,
FIG. 22 illustrates one embodiment of the present invention using
execution combine-and-store logic. In this example, a call graph
depicts a series of software function calls, with each functional
unit segment returning a unique path identifier. The additional
logic sums the sequences of path identifiers and saves the
resulting identifiers in a table, allowing subsequent calls
resulting in the same series of path identifiers to use the
combined identifier to replace the series. This results in
ever-larger sequences of execution being represented by individual
identifiers, which vastly increases compression of execution
information. Furthermore, these path identifier sequences are
combined using a simple sum operation, which increases compression
with minimal additional resources. This compares favorably to
general-purpose data compressors that require additional resources
and can obfuscate the results and prevent direct use as a path
identifier.
[0096] Therefore, the present invention provides a novel method and
system of compressing software instruction execution trace
sequences white simultaneously creating a unique identification for
the sequence that is a direct representation of the software's
behavior. The method and system of the present invention accesses
information about the executed instructions in a computer system
and converts that information into a uniquely representative
identification of the specific conditions and execution path taken
by a stream of execution.
[0097] In particular, embodiments of the present invention access
execution trace data of a computer system. This trace data is
analyzed to determine program functional boundaries. A behavioral
identifier variable is initialized to a base value at the start of
a program functional boundary. During execution within a program
functional boundary, the execution trace data and other related
data of interest is progressively combined with the behavioral
identifier variable using arithmetic and/or logical operations
until the end of the program functional boundary, at which point
the behavioral identifier variable is exported to a behavior
uniqueness detector. The behavior uniqueness detector maintains a
store of behavioral identifiers to be compared with the newly
presented behavioral identifiers as a test of uniqueness. If the
presented identifier does not exist in the store, the presented
identifier is added to the store and a signal is asserted that the
behavior is unique, and the associated execution data around and
including the unique behavior should be captured and stored in a
storage system, such as a database, file system, or similar.
[0098] Further according to the present invention, pre-collected
execution data is analyzed to create unique behavioral identifiers
corresponding to functional boundaries within the target software
program. These identifiers can then be used to index the
pre-collected data, to eliminate duplicate behavior sequences from
the pre-collected execution data, or in the creation of a common
index for multiple buffers of pre-collected execution data.
[0099] Moreover, the sequence of the behavioral identifiers may be
stored in the storage system sequentially as they appear. This
enables a continuous reconstruction of the entirety of observed
software execution to be created from the data in the storage
system.
[0100] Also according to some embodiments of the present invention,
the relevant executable software image and associated source files
can be saved in the storage system, thus facilitating the anytime
retrieval, reconstruction, and replay of the entirety of captured
execution behaviors. Storing this data enables the on-demand
replay, analysis, and visualization of not only all behaviors of
all executed software functions, but also of every revision of
every executed software function, using the correct source files
and program image for reconstruction and presentation in a replay
debugger or analyzer. This stored data also results in the creation
of a self-assembling knowledge base of the entirety of behaviors
exhibited by the target software, spanning all changes incurred
during development and maintenance. Existing tools and methods
routinely discard this valuable execution data, and generally
provide no facility for correlated storage of the associated source
and executable files.
[0101] Despite the ever-growing size and complexity of software
programs, an insight into reducing and simultaneously organizing
the abundant execution data of a software program is that the
software program is executed strictly within rigidly defined
segments of instructions that are interconnected by branching
junctions that have a finite number of connections. Furthermore,
the execution path that is actually taken by a running software
program is most often a very small subset of all possible
paths.
[0102] With this insight, a means of compressing the execution
information based on execution information's behavior has been
described in the present application. By replacing extended
sequences of execution with a uniquely representative and
consistently repeatable execution path identifier for every
uniquely executed path in the software program, unexpected benefits
are produced. For example, the execution path identifiers
themselves are representative of distinct behaviors of the executed
software functions, automatically classifying the execution trace
data by the execution trace data's behavior. This simplifies
software debugging, because every behavior of the software correct
or incorrect is individually identified during compression,
regardless of the behavior's transience or commonality. Reviewing
the complete range of behaviors of the target program or any subset
of interest can be done by decompressing the results at the
appearance of each unique identifier type for the functions of
interest. Also, the compression ratio can be an improvement over
existing systems and can replace the trace data of thousands of
instructions with a single representative value. In addition,
because of the rigid-track nature of computer software execution,
when observed over extended periods of time, a software program
will spend the vast majority of time executing within a small
subset of all possible paths and executing functions in frequently
repeated sequences. This pattern of execution can be exploited to
achieve extremely high compression ratios, by replacing extended
sequences of already-observed functional unit executions with a
single representative value.
[0103] Embodiments of the present invention therefore offer
advantages by achieving higher compression ratios than existing
systems, easing the burden of implementing into working computer
systems, and providing compressor output that is a direct
representation of the functional behavior of the target software.
Embodiments of the present invention can also be used as an
identifier for defect isolation and execution profiling, to assist
software developers in rapidly learning intimate details about
unfamiliar software code, and more.
[0104] Embodiments of the present invention is suitable for a
plurality of embodiments including implementation in computer logic
(thereby reducing the required capacity for trace export and
storage); implementation with existing real-time trace processors,
and as a software-only implementation for use with computer systems
that may have no real-time trace export capabilities. By
classifying the trace data by the behavior of the software being
traced while compressing the trace data can overcome many of the
difficulties found in existing systems and methods, embodiments of
the present invention can achieve higher compression ratios than
previous techniques discussed above, while producing a result that
is simpler to use for the tasks of software debugging, software
testing and analysis, and in gaining a deeper understanding of how
the software actually behaves during full-speed execution.
[0105] Also according to some embodiments of the present invention,
methods and systems are provided for inserting pre-computed
software instructions into specific points of a software
application to create unique execution path identifiers using a
software-only approach. One method can include analyzing the target
software to determine the appropriate canonical hash values and
appropriate insertion points in the application, inserting these
additional instructions into the application at the appropriate
conditional instructions and branch points, accumulating and
storing the unique execution path identifiers at runtime to a
designated memory buffer or output port, and retrieving the
resulting execution path identifiers at runtime for immediate use
or storage.
[0106] Through the methods and systems according to embodiments of
the present invention, execution behavior identifiers can be
created and collected from an operating computer system using
minimal system resources. The identifiers can also be compared to a
computed set of identifiers representing a fill reconstruction of
the execution path taken by the application. This results in
abundant information that is pre-classified by behavioral type and
therefore easier to differentiate which identifier represents
software that is running in normal, expected ways, and which
represents software that is running in new, potentially anomalous,
and unexpected ways. This is particularly useful for software
debugging, where countless hours are spent using existing
techniques attempting the capture of transient events that are not
yet fully understood. Embodiments of the present invention are also
useful to quickly gain a deep understanding of unfamiliar software,
because every behavior the software exhibits can be immediately
identified as the behavior occurs. These benefits can be amplified
when embodiments of the present invention are paired with
additional system data capture, such as correlated capture of
program variables, execution timing information, or external system
signals at runtime.
[0107] Some of the functional units described in this specification
have been labeled as modules, in order to more particularly
emphasize their implementation independence. For example, a module
may be implemented as a hardware circuit comprising custom
very-large-scale integration (VLSI) circuits or gate arrays,
off-the-shelf semiconductors such as logic chips, transistors, or
other discrete components. A module may also be implemented in
programmable hardware devices such as field programmable gate
arrays, programmable array logic, programmable logic devices or the
like.
[0108] Modules may also be implemented in software for execution by
various types of processors. An identified module of executable
code may, for instance, comprise one or more blocks of computer
instructions, which may be organized as an object, procedure, or
function. Nevertheless, the executables of an identified module
need not be physically located together, but may comprise disparate
instructions stored in different locations which comprise the
module and achieve the stated purpose for the module when joined
logically together.
[0109] Indeed, a module of executable code may be a single
instruction, or many instructions, and may even be distributed over
several different code segments, among different programs, and
across several memory devices. Similarly, operational data may be
identified and illustrated herein within modules, and may be
embodied in any suitable form and organized within any suitable
type of data structure. The operational data may be collected as a
single data set, or may be distributed over different locations
including over different storage devices. The modules may be
passive or active, including agents operable to perform desired
functions.
[0110] The technology described here can also be stored on a
non-transitory computer readable storage medium (e.g., data storage
medium) that includes volatile and non-volatile, removable and
non-removable media implemented with any technology for the storage
of information such as computer readable instructions, data
structures, program modules, or other data. Computer readable
storage media include, but is not limited to, random-access memory
("RAM"), read only memory ("ROM"), erasable programmable ROM
("EPROM"), electrically EPROM ("EEPROM"), flash memory or other
memory technology, compact disc-read-ROM ("CD-ROM"), digital
versatile disks ("DVD") or other optical storage, flash drive,
solid state drive, magnetic cassettes, magnetic tapes, magnetic
disk storage or other magnetic storage devices, or any other
computer storage medium which can be used to store the desired
information and described technology.
[0111] The devices described herein may also contain communication
connections or networking apparatus and networking connections that
allow the devices to communicate with other devices. Communication
connections are an example of communication media. Communication
media typically embodies computer readable instructions, data
structures, program modules and other data in a modulated data
signal such as a carrier wave or other transport mechanism and
includes any information delivery media. A "modulated data signal"
means a signal that has one or more of its characteristics set or
changed in such a manner as to encode information in the signal. By
way of example, and not limitation, communication media includes
wired media such as a wired network or direct-wired connection, and
wireless media such as acoustic, radio frequency, infrared, and
other wireless media. The term computer readable media as used
herein includes communication media.
[0112] The foregoing description of embodiments of the present
invention has been presented for the purpose of illustration. The
description is not intended to be exhaustive or to limit the
invention to the precise forms disclosed. Obvious modifications or
variations are possible light of the above teachings. The
embodiments disclosed herein were chosen in order to best
illustrate the principles of the present invention and its
practical application to thereby enable those of ordinary skill in
the art to best utilize the invention in various embodiments and
with various modifications as are suited to the particular use
contemplated, as long as the principles described herein are
followed. Thus, changes can be made in the above-described
invention without departing from the intent and scope thereof. For
example, the various configurations of the system and methods
described and illustrated in the present application can be
combined and distributed in various ways. It is also intended that
the scope of the present invention be defined by the claims
appended thereto.
* * * * *