U.S. patent application number 09/765916 was filed with the patent office on 2002-05-30 for system and method for software code optimization.
This patent application is currently assigned to Cadence Design Systems, Inc.. Invention is credited to Canut, Frederic, Derras, Mustapha.
Application Number | 20020066088 09/765916 |
Document ID | / |
Family ID | 22808341 |
Filed Date | 2002-05-30 |
United States Patent
Application |
20020066088 |
Kind Code |
A1 |
Canut, Frederic ; et
al. |
May 30, 2002 |
System and method for software code optimization
Abstract
A method is provided of optimizing a software program for a
target processor in order to meet specific performance objectives
and yet maintain portability, where the software program is
initially coded in a high-level language. The method includes a
first step of optimizing the software program in the high-level
language, using optimizations that are substantially independent of
the target processor to host the application. Preferably, if the
performance objectives are met after the completion of this step,
then the process preferably successfully terminates. However, if
the performance objectives are not met, then the method preferably
proceeds to a second step. In the second step, the initially
optimized form of the software program is again optimized in the
high-level language, although target processor-dependent
optimizations are used. If the performance objectives are met after
completing this second step, then the process preferably
terminates. If the performance objectives are not met, then the
process proceeds to a third step. In the third step, the
twice-optimized software program is optimized using a low-level
language of the target processor on key portions of the code, such
that although the software implementation becomes target-dependent,
it remains relatively portable.
Inventors: |
Canut, Frederic; (Leuville
sur Orgo, FR) ; Derras, Mustapha; (Antony,
FR) |
Correspondence
Address: |
LYON & LYON LLP
633 WEST FIFTH STREET
SUITE 4700
LOS ANGELES
CA
90071
US
|
Assignee: |
Cadence Design Systems,
Inc.
|
Family ID: |
22808341 |
Appl. No.: |
09/765916 |
Filed: |
January 18, 2001 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
60216746 |
Jul 3, 2000 |
|
|
|
Current U.S.
Class: |
717/151 ;
717/147 |
Current CPC
Class: |
G06F 30/33 20200101;
G01R 31/318342 20130101; G06F 30/39 20200101; G01R 31/318357
20130101; G06F 8/443 20130101; G06F 30/30 20200101 |
Class at
Publication: |
717/151 ;
717/147 |
International
Class: |
G06F 009/45 |
Claims
What is claimed is:
1. A method of optimizing a software program for a target processor
to meet performance objectives, where the software program is coded
in a high-level language, the method comprising the steps of: (a)
optimizing the software program such that a resulting first
optimized form of the software program is substantially independent
of the target processor and is substantially coded in the
high-level language; (b) optimizing the first optimized form of the
software program such that a resulting second optimized form of the
software program is substantially dependent on the target processor
and is substantially coded in the high-level language; and (c)
optimizing the second optimized form of the software program such
that a resulting third optimized form of the software program is
substantially dependent on the target processor and is includes
portions coded in a low-level language of the target processor.
2. The method of claim 1, further comprising steps of: (a1)
determining a first performance profile for the first optimized
form of the software program, and comparing the first performance
profile with the performance objectives; and (b1) determining a
second performance profile for the second optimized form of the
software program, and comparing the second performance profile with
the performance objectives.
3. The method of claim 2, wherein steps (b), (b1), and (c) are not
performed if the performance objectives are met after completing
step (a), and step (c) is not performed if the performance
objectives are met after completing step (b).
Description
[0001] This application claims priority to a U.S. Provisional
Application entitled "System-on-a-Chip-1," having Ser. No.
60/216,746 and filed on Jul. 3, 2000, and which is hereby
incorporated by reference into this application as though fully set
forth herein.
BACKGROUND OF THE INVENTION
[0002] 1. Field of the Invention
[0003] The present invention relates to the field of software
development, and particularly, to system and methods for software
code optimization.
[0004] 2. Background
[0005] In the design of software for digital signal processing
(DSP) and other applications, programmers take advantage of the
low-level but high-speed capabilities that the particular target
processor (e.g., DSP processor, microcontroller) offers in order to
achieve the performance requirements for the applications. However,
the application of these tools early in the development process
leads to the development of a program that may be unportable should
a different target processor be subsequently used to host the
program. The development of code that is not portable from one
target processor to another may result in significant redesign and
development costs for the same basic application. Often, if the
program is portable, the most significant issue is the cost in time
and resources to port the application to a new host, such that the
performance requirements of the program on the new host are
met.
[0006] A need exists therefore for a system and method that
minimizes the likelihood that a development process for software
will result in a program that is unportable from one target
processor to another.
SUMMARY OF THE INVENTION
[0007] The present invention, in one aspect, provides a systems and
methods for optimizing software for execution on a specific host
processor.
[0008] In one embodiment, a method is provided of optimizing a
software program for a target processor in order to meet specific
performance objectives, where the software program is coded in a
high-level language. The method includes the steps of first
optimizing the software program in the high-level language, using
optimizations that are substantially independent of the target
processor to host the application. Preferably, if the performance
objectives are met after the completion of this step, then the
process preferably successfully exits. Thus, if the performance
objectives are not met, then the method preferably proceeds to a
second step.
[0009] In the second step, the initially optimized form of the
software program is again optimized in the high-level language,
although target processor-dependent optimizations are used. If the
performance objectives are met after completing this second step,
then the process preferably terminates. If the performance
objectives are not met, then the process proceeds to a third
step.
[0010] In the third step, the twice-optimized software program is
optimized using a low-level language of the target processor on key
portions of the code, such that although the software
implementation becomes target-dependent, it remains relatively
portable. Preferably, in evaluating whether the performance
objectives have been achieved, performance profiles are determined
for the intermediate forms of the optimized software program. These
performance profiles are then preferably quantitatively compared to
the previously defined performance objectives.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] FIG. 1 is a flow diagram generally depicting steps in a
process of optimizing a software program for a target processor
representing a preferred embodiment of the invention.
[0012] FIG. 2 is a flow diagram depicting a preferred embodiment of
a simplified representation of the main steps of the generic
implementation process represented as one step in FIG. 1.
[0013] FIG. 3 is a flow diagram depicting a preferred embodiment of
detailed steps of the generic implementation process depicted in
FIG. 2.
[0014] FIG. 4 is a graph depicting examples of curves of the
evolution of a software application with respect to its performance
and size in a target-independent optimization process.
[0015] FIG. 5 is a flow diagram depicting a preferred embodiment of
a simplified representation of the main steps of the specific
implementation process represented as one step in FIG. 1.
[0016] FIG. 6 is a flow diagram depicting a preferred embodiment of
detailed steps of the specific implementation process depicted in
FIG. 5.
[0017] FIG. 7 is a graph depicting examples of curves of the
evolution of a software application with respect to its performance
and size in a target-dependent optimization process.
[0018] FIG. 8 is a flow diagram depicting a preferred embodiment of
steps of the fully dedicated implementation process represented as
one step in FIG. 1.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0019] In a preferred embodiment of a software code optimization
method (or process) comprising multiple basic steps, each
successive basic step generally results in code that is closer to
being dedicated to operate on a particular target. Thus, to promote
portability, if the performance goals of the application are
reached after the completion of any step in the process, then the
optimization process is terminated.
[0020] The evaluation of performance, and thereby, whether the
performance meets stated, predefined objectives preferably accounts
for several factors. Preferably, a key performance measure is the
real-time speed of the application when operating on the specified
target. Another performance measure is the accuracy and/or quality
of the output. Another factor that may be integrated into the
process evaluation is the binary code size. While the application
is made to fit in the target processor's memory, this factor
generally is less and less important as memory sizes increase,
become smaller, cheaper and less power consuming.
[0021] Thus, one major step in the optimization process is to fix
the initial constraints that are applied to the development and
optimization of the software application. These constraints are
preferably used to quantitatively evaluate the application
implementation's performance in an overall sense, and facilitate
determining the feasibility of porting the application to a
specific target at each of the development stages. These measures
preferably inherently integrate processing performance
characteristics of the target processor, including its clock
frequency, which relates to the number cycles available to execute
the application.
[0022] Another set of parameters that preferably are calculated is
the global I/O data flow to determine if the memory accesses
(read/write) are achievable for the specified target. This set of
parameters integrate elements like the aggregated data flow over
the internal and external buses (data exchanges).
[0023] FIG. 1 is a flow diagram generally depicting steps in a
process of optimizing a software program for a target processor. In
the embodiment represented in FIG. 1, an optimization process 100
comprises three optimization steps. In a first optimization step
102, the software for the DSP processor target is written in a
high-level language such as C, C++ or Ada. Preferably, the
programming language is one that is completely portable between all
probable DSP targets. In coding the software, optimization
techniques particular to the language are preferably used. The code
optimization in this step 102 preferably does not employ any
optimization tools that depend on the processor that is meant to
host the application.
[0024] Upon completion of this step 102, the new implementation, as
a next step 104, is evaluated to determine whether the performance
goals have been reached. If by completing the step 102 of
target-independent optimization the performance objectives are
achieved, then the overall optimization process 100 successfully
terminates 112. If the performance requirements for the application
have not been achieved, then in a next optimization step 106,
certain portions of the software code are re-implemented in the
high-level language to take advantage of the specific processing
capabilities of the DSP target.
[0025] While the code after this step 106 is less portable than the
code that results after the previous step 102, the software may
remain partially portable for a number of reasons. One reason is
that the modified code is preferably selected from a portion of
code that is short in terms of lines of source code, but is
repeatedly executed and is thus responsible for a relatively
significant percentage of the processing overhead. By first
modifying the code that fits these criteria, the amount of code
that must be modified is minimized, and may additionally be flagged
in the source file to indicate that it is target-specific code. If
the target processor later changes, only these identified portions
need be addressed for optimization. Further, the previously
unoptimized code, corresponding in functionality to the portion of
code that is optimized in this step 106, may remain coded in the
source file. This original unoptimized code may be used as a
starting point for optimizing the same portion of code of any
subsequent target processor. Another benefit is that although the
coding is specific to the DSP target for the application, the code
preferably remains in the high-level language. By remaining in a
high-level language (versus being re-coded in a low-level language
such as an assembly language), the resulting code is inherently
much easier to revisit and comprehend should modifications be
necessary.
[0026] Preferably, software-profiling tools are applied to readily
identify the portions of code that fit the criteria required to be
preferred candidates for optimization, so that they then can be
optimized for the particular DSP target as necessary.
[0027] Once a section of code has been optimized for a particular
DSP target, other portions of code that meet the criteria to be
candidates for optimization may also be optimized for the DSP
target, if the performance criteria for the application have still
not been achieved.
[0028] If the portions of code that are candidates for optimization
have been optimized, then in a next step 108, the implementation is
again evaluated to determine whether the performance objectives on
the target processor have been met. As in step 104, if the
performance objectives are achieved though the step 102 of
target-dependent optimization using the high-level language, then
the overall optimization process 100 successfully terminates 112.
However, if the performance goals of the application have not been
met, then the optimization process 100 proceeds to a third
optimization step 110. In this step 110, the software is configured
to be fully dedicated to the architecture and processing benefits
of the target processor. Various coding techniques that are
particular to the target processor for the application may be
employed. Some of these techniques include executing instructions
in parallel or using any pipeline processing or other specialized
processing capabilities. Further, tradeoffs may be made between
performance and throughput in order to meet pre-stated objectives
of the application. The result of the process is an efficiently
created program for a DSP, microcontroller, or other computing
target processor that meets pre-stated performance objectives, and
is optimal, or close to optimal, with respect to its portability,
thereby minimizing future software development efforts for the same
application.
[0029] In one embodiment of a system for performing the
optimization method, the method is performed automatically after
the software code has been initially developed in a high-level
language. Preferably, the system is provided the performance
parameters that are desired for the application, as well as the
architectural specification of the target processor. Given these
inputs, the system then processes the high-level language source
code, compiles and simulates the code's execution, and tests the
code against the specified performance requirements. If the
performance requirements are not met, the system profiles the code
and then optimizes the portions that are the best candidates for
optimization.
[0030] Preferably, the system comprises a software-optimizing
processor in conjunction with memory that automatically performs
the code profiling operations, code generation operating on
portions of code that are determined to be candidates for
optimization, and then subsequent performance analysis. The
software-optimizing processor may comprise any type of computer,
and has processing characteristics dependent upon, for example, the
processing requirements for the code generation, profiling and
performance assessment operations. It may comprise, e.g., a
computer, such as a workstation such as are manufactured by Sun
Microsystems, a main frame computer, or a personal computer such as
the type manufactured by IBM or Apple. A computer executing
optimization software is preferably used for the
software-optimizing processor, due to the utility and flexibility
of a computer in programming, modifying software, and observing
software performance. More generally, the software-optimizing
processor may be implemented using any type of processor or
processors that may perform the code optimization process as
described herein.
[0031] Thus, the term "processor," in its use herein, refers to a
wide variety of computational devices or means including, for
example, using multiple processors that perform different
processing tasks or have the same tasks distributed between
processors. The processor(s) may be general purpose CPUs or special
purpose processors such as are often conventionally used in digital
signal processing systems. Further, multiple processors may be
implemented in a server-client or other network configuration, as a
pipeline array of processors, etc. Some or all of the processing is
alternatively implemented with hard-wired circuitry such as an
application-specific integrated circuit (ASIC), a field
programmable gate array (FPGA) or other logic device. In
conjunction with the term "processor," the term "memory" refers to
any storage medium that is accessible to a processor that meets the
memory storage needs for a system for optimizing software.
Preferably, the memory buffer is random access memory (RAM) that is
directly accessed by the software-optimizing processor for ease in
manipulating and processing selected portions of data. Preferably,
the memory store comprises a hard disk or other nonvolatile memory
device or component.
[0032] Preferred embodiments of the method for performing each of
the basic steps illustrated in FIG. 1 are now provided.
Generic Implementation Process (Target independent
Optimization)
[0033] As used herein, the term generic means target-independent.
In the DSP domain, in a target-independent implementation, the
high-level language source code, normally C- code, uses no specific
function calls to pragma or macros dedicated to the target. With a
target-independent implementation, the portability of the
application is maintained and some optimization is integrated into
the application at a high level, without using assembly language
code.
[0034] FIG. 2 is a flow diagram depicting a simplified
representation of the main steps of the generic implementation
process 200 represented as step 102 in FIG. 1. Preferred detailed
steps of this process 200 are depicted in the task flow diagram of
FIG. 3. Preferably, as shown in FIG. 2, there are four main steps
that take as input the mathematical theory related to a signal
processing algorithm and lead to an implementation that is later
used by the specific implementation process.
Stage G1: Floating Point Implementation
[0035] The floating-point implementation step takes as input the
theoretical solution of a process and transforms the solution into
a structured language implementation. A main purpose of the step is
to be able to reflect as much of the math in the theory into the
implementation.
[0036] Precision in the calculations is important in the
floating-point implementation. In general cases, those applications
are done using double-floats and tools like the Cadence.RTM.
Cierto.TM. signal processing worksystem or MathLab. Such tools
provide representation of the processed data, allow graphical
representation and comparison, and extract errors so that an
implementation can be qualified.
[0037] For an integer DSP, the floating-point implementation
transitions to a fixed-point implementation linked to the precision
that the DSP can handle. For example, the DSP may need a 16-bit
precision implementation. However, typically a group developing the
floating-point application is not the group developing the
fixed-point implementation. This means that there are at least the
following two approaches. In the first approach, the theoretical
implementation is made with no consideration of the precision. In
that case, the implementation is oriented to processing quality and
pushes the precision problem to the fixed-point porting. In the
second approach, a target precision is involved at an early stage
of the development and impacts the quality of the processing. This
provides a full precision-oriented implementation. However, this
implementation must be entirely redone if the target architecture
is changed. Details regarding floating point formats and related
issues in terms of implementation are provided in several articles,
including Morgan, Don, "Practical DSP Modeling, Techniques, and
Programming in C," John Wiley & Sons, 1995, pp. 263-298, and
Lasley, P., Bier, J., Sholan, A., and Lee, Edward A., "DSP
Processors Fundamentals, Architectures and Features," IEEE Press
Series on Signal Processing, 1997, pp. 1-30, which are hereby
incorporated by reference as though fully set forth herein.
[0038] Stage G2: Fixed Point Implementation
[0039] In the stage of deriving the fixed-point implementation,
trade-offs relating to precision may be made. The extent of these
trade-offs primarily depends on the target DSP capability. If the
target processor is a 16-bit precision DSP, the accepted deviation
of the output result will be greater than on a 32-bit DSP.
[0040] However, depending on the complexity of the algorithm,
another factor, the implementation architecture, is preferably
considered. If an implementation involves hundreds of function
calls, the real-time execution at the end of the implementation
flow is impacted. For this reason, two different steps in the
implementation activity are utilized.
[0041] Another consideration at this level is the inheritance. A
common method of implementing signal processing is to take the
floating-point implementation and port it to a specific target.
Another method includes porting an existing fixed-point
implementation to a new target. The mechanisms are quite different
because of the availability of a first implementation. In the
latter case, it is more an adaptation of an existing application
than a new implementation. The advantage is that it shortens the
development process by reusing the existing code done for another
target.
Sub-stage G2.1: Processing Qualification
[0042] The goal of processing qualification is to obtain an
implementation that preferably provides the best trade-off on the
output result for a given precision. One of the tools that can
accelerate the completion of this step is the Cierto.TM. signal
processing worksystem. This tool provides the capability to
validate, compare, and qualify a process with a reference to the
floating-point implementation.
[0043] Fixing the derivation criteria depends primarily on the
application category. For image processing comparison, information
like texture, edge, contrast and distortion is considered. For
voice processing, the same elements may be taken into account, but
spectral analysis, tone, volume and saturation, etc. may also be
considered. Depending on the application domain, the criteria can
be completely different. Furthermore, within a given domain, the
criteria can change. Radar can be used in military or agricultural
activities but the measures made for those two applications of
radar image may be quite different.
Sub-stage G2.2: Implementation Sizing
[0044] With a qualified algorithm in terms of quality and
precision, the first sizing of the algorithm can be addressed.
Preferably, the information gathered includes the real-time data
flow, the implementation structure and architecture, the profiling
of instructions and cycles, and the performances of the target DSP.
These elements help determine if the code can fit inside the
target.
Real-Time Data Flow
[0045] The goal of real-time data flow is to understand the
different I/Os related to the algorithm that are to be integrated
into the DSP. On one level is the global data flow that globally
indicates the availability of the raw and the processed data. With
the global data flow, the developer identifies the processing
delays that are going to provide a basic characteristic of the
application relating to data flow.
[0046] However, with the global data flow that corresponds to a
simplified representation of the data flow, the real behavior of
the data coming in and out on the data bus of the system is not
necessarily clear. The programmer may have to zoom in the
elementary time duration (selected for the global data flow
representation) to characterize the behavior of the implementation
confronted with the interruptions coming from the devices involved
in the process. This "elementary time duration" can be very
different from one application to another. It can be the duration
of, for example, an image frame, an image line, an audio frame, or
a dedicated time dictated by control software or the processor.
[0047] Another data flow consideration is application cadence.
Application cadence may impact all future decisions for the
application. For example, in an interrupt-driven architecture,
which is the case in most of the real-time DSP constraint
developments, it is then possible to make clear design choices like
use of a (first-in first-out) FIFO that will buffer data. This
option provides a more flexible way to manage bus I/Os because it
allows a better optimization of the bandwidth usage. It is
generally a more expensive system design, but it is recommended for
processing that involves large amounts of data, like image
processing.
[0048] Alternatively, a designer may choose not to use a FIFO. This
means that each piece of data produced is either immediately saved
or eventually lost. This is the most constrained way of
implementing a signal processing application, but is cheaper and
well suited for processing that involves little data such as voice
processing. This example shows the impact of the application
cadence criteria on application development.
[0049] Another factor is bandwidth. Such a study exhaustively
integrates external data variables and code fetches. However, a
strict and exact representation of the detailed I/Os is generally
impossible. All of the representations reflect only the static
point of view. A real I/O study preferably indicates that temporal
drift impacts the overall bandwidth all along the application
processing.
Implementation Organization
[0050] Also affecting implementation sizing is implementation
organization. Implementation organization preferably considers the
implementation structure, the architecture of the implementation,
and the behavior of the implementation.
Implementation Structure
[0051] The implementation structure generally means that the
developer knows the number of functions implemented, the number of
times they are called, split if possible into low-level and
high-level functions, and so on. This first measure can be made
manually or by using tools. One difficulty is identifying a tool
that indicates the number of times a function is called. Context
switching can be expensive if it occurs too many times. For this
purpose, one can use free coverage tools like gprofs that provide
part of the necessary information. The use of other tools like
Sparcworks (Sun Microsystems) provides the call graph.
Implementation Architecture
[0052] The architecture of the implementation generally means
knowing the overall behavior of the application to know if it may
be necessary to revisit the algorithm construction to emphasize
real-time issues. Given a specific processing algorithm that
produces a signal processing development, the requirements can be
formalized as follows:
[0053] 1) Obtain an "elementary" signal sample. This can be an
audio value, an entire image, etc.
[0054] 2) Process the sample using the development made.
[0055] The first step in evaluating the feasibility of the
application includes determining the global data flow to fix the
limit of the input/output size and the precision (8, 16 or 32
bits). The first output of the data flow indicates if it is
possible to sustain the I/Os, but another indication concerns the
algorithm structure.
[0056] Effectively analyzing the processing flow in conjunction
with the data flow can indicate that the some steps of the
processing cannot be done before others. In that case, it may be
possible to conclude sometime that some delay constraints can not
be matched or simply that the algorithm cannot run real-time.
[0057] There are several definitions of the delay. There is the
intrinsic delay related to every processing called real processing
delay. In a data process, there is always the processing delay
needed to perform the data transformation. But there is also the
Architectural Delay (AD) related to the structure of the algorithm.
In this case, it is related to the algorithm architecture that
never allows reducing the architectural delay.
Application Behavior
[0058] A majority of applications integrate some computations that
use static correspondence tables or lookup tables to transform the
signal. However, depending on the calculation results, the tables
that are used will not be the same. If, for the same computation,
the signal processing uses two different conversion tables that
have different sizes, then the application is non-deterministic.
Thus, this part of an implementation is preferably clearly
identified so that all of the steps that follow result in useful
measures that can be accurately correlated with the performance
increase of the application.
Profiling
[0059] The objective of high-level profiling is to provide a first
indication of the number of cycles consumed by the implementation
and the binary code size. If necessary, a simulator can also
produce an instruction profiling.
[0060] One difficulty is fixing the comparison criteria so that it
is known whether the application fits in the targeted DSP. However,
it is possible to get a ratio based on the different benches
realized on the DSP. The benches are generally provided by the DSP
vendors. This means that the cycle counts indicated at a higher
level must be correlated with the performances of the target DSP to
establish a go/no go process.
[0061] As an example, several DSP providers provide appropriate
benches. Several DSPs are compared with C-written kernel functions
including MAC: Multiply accumulate, Vect: Simple vectors multiply,
Fir: FIR filter with redundant load elimination, Lat: Lattice
synthesis, Iir: IIR filter, Jpeg: JPEG discrete cosine
transform
[0062] From the application point of view, if the programmer takes
the average cycle reduction ratio, it is possible to obtain a value
of 2.8. From the DSP point of view, one can get a 2.78 gain factor.
This is one indicator.
[0063] On one hand, one thing that is not integrated in such
benches is the fact that a complete application merges many kinds
of functions. This means that the optimization is less efficient
for part of the implemented algorithm. Furthermore, the application
integrates several function calls that add, in some cases,
significant overhead.
[0064] On the other hand, such benches do not assume that the
maximum potential of the DSPs is exploited. One preferably measures
the gain factor to get effective comparison criteria for the go-no
go decision because the developer should go further down in
assembly optimization.
[0065] Thus, if the generic C implementation cycle count indicates
more than five to six times the number of targeted cycles, the
developer may consider that the real-time application is not
reachable in a reasonable amount of time.
Stage G3: Reference C Code/Optimization
[0066] This implementation is the reference after generic code
qualification to be optimized at the C level. Based on this code,
one applies several rules concerning the method of implementing the
application that fosters processing time reduction.
[0067] A primary objective is to establish a test process that
guarantees the integrity of the processing. The goal is to reduce
the cycle consumption and not to transform the result of the
processing. It is also possible to establish a specific test script
to validate the optimization and/or use tools to compare the
processing results.
[0068] The script makes easier the run of several tests and allows
the programmer to gather information (traces) on the application
behavior. The tools allow the creation of specific comparisons on
the processed data. The Cadence Cierto Signal Processing Worksystem
(SPW) is capable of such a task and can speed the development
cycle.
Stage G3.1: Optimization
[0069] For the high-level language implementation, optimization
preferably uses tricks such as loop reduction, loop merging, test
reduction, pointer usage, and in-line functions or macros, to
reduce context switching. These tricks are generic and can be used
for most if not all high-level languages.
[0070] Another optimization step that can be integrated at this
level is development chain optimization by addressing the specific
options of the pre-processor, compiler, assembler, and/or linker.
This may be useful if the implementation is initially done with the
target development environment. Generally, the applications are
initially developed on PCs or Workstations. Then, taking advantage
of the generic compiler is not useful and can lead to bad decisions
in terms of performances and code size.
[0071] At the implementation level, the developer assumes that the
target DSP is fixed and that a simulator is available. Many C
optimizations as are known in the art are possible at this language
level.
Stage G3.2: Profiling
[0072] Each time a specific implementation is globally applied and
validated the result is preferably benched, and if possible, fixed
to facilitate further optimization. Preferably, at least three
parameters are integrated: the global effort in terms of time to
integrate a new optimization step, the processing time reduction
that can be evaluated, and the code size evolution. These
parameters preferably are correlated to the time dedicated to the
project and whether or not the application is mandatory to system
functionality.
Processing Time Reduction
[0073] In most cases, the gain in processing time follows a
x.sup.-1 law. In-line functions, loop reduction and/or unrolling
produce significant gain. Integrating pointers are normally less
significant. However, a curve like that presented in FIG. 4, which
regroups the measures realized for the generic implementation
process, may be obtained.
[0074] A goal of these measures is to understand the impact of a
modification. There is no generic rule that can be applied to all
the code and all of the applications that reduce the number of
cycles. Modifications that appear to optimize cycle count can
actually increase it. Another goal is to fix a limit for the
different optimization steps in terms of time. One rule, for
example, may be to measure more than five percent of cycle
reduction between two steps.
Code Size Evolution
[0075] This development measure is necessary to have embedded
applications that do not have several Mbytes of memory available on
the final system. Experiments have shown that the generic C
optimization process considerably increases the code size. A fully
dedicated C optimization process generally decreases the size.
However, the programmer preferably guarantees that the code size
does not exceed the available memory size of the target system.
Specific Implementation Process (Target-dependent Implementation in
the High-level Language)
[0076] In this process, some instructions to allow the use of
DSP-specific characteristics preferably are integrated into the C
or other high-level language implementation. Many of the
instructions may be addressed by using pragma instructions that are
placed in the code to take advantage of caches or internal RAM,
loop counters, multiply-accumulate capabilities (MAC), and
multiply-subtract capabilities (MSU). Other specific
characteristics like splitable ALUs or multipliers, parallel
instruction execution, and pipeline effects are addressed in the
assembly level. For some DSPs, the only way to use these
characteristics is to handle them at the assembly level.
Furthermore, this step requires that the developer perform the
least amount of tuning on the code to comply with the DSPs
features.
[0077] Although the pragmas and intrinsics tend to detract from the
portability, those parts of the code may be encapsulated and
isolated. With the use of "#if-define" or other such conditional
compiling flags, target compiler dependent flags can be integrated
into the code so that it is possible to recompile the same
application for all the targets to be addressed. However, this
method of implementation requires a clear and structured versioning
system as well as clear coding rules. One of the main issues arises
from the need to support more than three or four different
targets.
[0078] Another task of this stage is to implement the high-level
language (e.g., C) code and look at the effect obtained on the
generated assembler. The goal is not to modify assembly code but to
write C code in a way that the assembler part of the compiler
generates optimized assembly code. The assumption is made that
there is some specific C implementations that will impact the
generated assembly code in the same way for many compilers. The
examples are the "do {} while" or the MAC integration. However,
this is mainly true for the second and third DSP generations. One
can also use the example of the post-register modification. If the
developer has realized a conversion of the implementation to
integrate pointers, the position in the code of the pointer
increment automatically generates or not the post-register
modification in the assembly.
[0079] FIG. 5 is a flow diagram depicting a simplified
representation of the main steps of the specific implementation
process 500 represented as step 106 in FIG. 1. Detailed steps of
the specific implementation process are depicted in the task flow
diagram of FIG. 6.
Stage SI: C-Optimized Code Impacting the Assembler
[0080] A key objective is to integrate specific pragmas and
intrinsics into the code. The pragmas allow the use of cache or
internal RAM memories and integration of loop counters to optimize
loop branches. The other aspect of this optimization concerns the
implementation modifications that take advantage of the specific
capabilities of the target DSP, including multiply-accumulate,
multiply-subtract, splittable multiply-add, and post register
modification.
[0081] The goal is to generate the assembly code and observe what
can be modified in the C implementation that can be translated
differently by the compiler.
Stage S2: Specialized Low-Level Functions
[0082] Depending on the implementation structure, it may be
necessary to tune some specific functions that are used
intensively. Some methods of accomplishing this include, for
example, removing code that is not used, avoiding overhead
introduced by recursive calls, moving loop invariant expressions
out of the loops, and reducing the scope of the variables (using
macros integrates this concept naturally).
Code Portability
[0083] While the above-mentioned improvements may be viewed as
non-portable, they are portable in the sense that the overall
architecture of the implementation can be ported and reused. To
achieve that, the developer preferably encapsulates the specific
instructions of a DSP by integrating specific flags related to the
target compiler in the code, and by using a source versioning
system to handle the various target DSPs. Note that integrating
specific flags can validate specific code parts depending on those
flags.
[0084] FIG. 7 is a graph depicting examples of curves of the
evolution of a software application with respect to its performance
and size in a target-dependent optimization process. As shown in
FIG. 7, the size decreases slowly because specific points of the
application are addressed, but the impact on the performance can be
impressive.
Fully-Dedicated Implementation Process
[0085] A fully dedicated implementation process is the lowest stage
of the development process. Trade-offs on the application are
preferably made by removing some processing passes, wherever
possible. Assembly-specific optimization is also integrated to
finally reach the target performance.
[0086] A key challenge is to determine after a profiling if the
performance goal is reached. FIG. 8 is a task flow diagram
depicting steps of a dedicated implementation process 800
represented as step 108 in FIG. 1. The dedicated implementation
process includes two main steps, manual assembly optimization and
feature tuning/cutting.
Manual Assembly Optimization
[0087] At this point, very low-level assembly language optimization
is integrated. Key characteristics for this implementation
generally come from parallel instructions, pipeline effects, and
not fully optimized assembly code. Regarding parallel instructions,
some DSPs are able to execute several instructions in the same
clock cycle. It is possible to execute loads-operation-store in the
same instruction. The main objective is to be able to integrate the
pipeline effects that affect the availability of the processed
data.
[0088] With respect to pipeline effects, mainly in the branch call
is it possible to code specific instructions and take advantage of
the pipeline delay slots. This optimization can be useful for the
loop intensive applications. It is mandatory to handle the parallel
instruction optimization.
[0089] For the computationally intensive part of not fully
optimized assembly code, it may be necessary to reorganize the
generated code and integrate a more optimal way of using
accumulators and registers.
Feature Tuning/Cutting
[0090] Depending on the capacities of the used DSP, it may be
necessary to re-adapt the application because it does not fit. If
such a decision is made, then the high-level steps of the process
are not adapted or have been neglected. It is necessary to
re-evaluate the application behavior in terms of processing, which
is normally a high-level task of the process, for example, floating
to fixed-point implementation.
[0091] One work-around is to drop out some specific part of the
application that will have little impact on the quality of the
processed data. For example, in an audio processing application,
functions such as a ring subtraction, a high-pass filter on the
input signal, or compression rate could be dropped without a
significant loss of performance. Although the gain in terms of
performance may not be high, cutting compression rate can suppress
enough cycles to reach the target performance.
[0092] Reaching this step of the dedicated implementation process
may mean that the application has not been evaluated correctly. If
this is the case, then optionally, some of the highest process
levels may be re-addressed.
[0093] While preferred embodiments of the invention have been
described herein, and are further explained in the accompanying
materials, many variations are possible which remain within the
concept and scope of the invention. Such variations would become
clear to one of ordinary skill in the art after inspection of the
specification and the drawings. The invention therefore is not to
be restricted except within the spirit and scope of any appended
claims.
* * * * *