U.S. patent application number 09/185422 was filed with the patent office on 2002-03-21 for interactive instruction scheduling and block ordering.
Invention is credited to BHARADWAJ, JAYASHANKAR, MCKINSEY, CHRISTOPHER M..
Application Number | 20020035722 09/185422 |
Document ID | / |
Family ID | 22680905 |
Filed Date | 2002-03-21 |
United States Patent
Application |
20020035722 |
Kind Code |
A1 |
MCKINSEY, CHRISTOPHER M. ;
et al. |
March 21, 2002 |
INTERACTIVE INSTRUCTION SCHEDULING AND BLOCK ORDERING
Abstract
In some embodiments, the invention includes a method of
compiling instructions of a program. The method includes receiving
instructions for code motion and controlling the code motion while
interacting with block ordering. The code motion may be done as
part of various activities including instruction scheduling,
partial redundancy elimination, and loop invariant removal. The
scheduling may involve making an assessment of the cost of
scheduling an instruction that takes into account generation and/or
elimination of branches due to resulting block order update and
determining whether to make the code motion based on the cost.
Instruction scheduling may involve regeneration of predicate
expressions to invert conditional branches.
Inventors: |
MCKINSEY, CHRISTOPHER M.;
(CUPERTINO, CA) ; BHARADWAJ, JAYASHANKAR;
(SARATOGA, CA) |
Correspondence
Address: |
ALOYSIUS T C AUYEUNG
BLAKELY SOKOLOFF TAYLOR & ZAFMAN
12400 WILSHIRE BOULEVARD
SEVENTH FLOOR
LOS ANGELES
CA
900251026
|
Family ID: |
22680905 |
Appl. No.: |
09/185422 |
Filed: |
November 3, 1998 |
Current U.S.
Class: |
717/141 |
Current CPC
Class: |
G06F 8/445 20130101 |
Class at
Publication: |
717/5 |
International
Class: |
G06F 009/45 |
Claims
What is claimed is:
1. A method of compiling instructions of a program, comprising:
receiving instructions for code motion; and controlling the code
motion while interacting with block ordering.
2. The method of claim 1, wherein the code motion is done as part
of instruction scheduling.
3. The method of claim 1, wherein the code motion is done as part
of partial redundancy elimination.
4. The method of claim 1, wherein the block ordering is made in
response to blocks being emptied or populated due to code
motion.
5. The method of claim 1, wherein the block ordering involves
moving blocks within a physical order and eliminating or changing
branch instructions consistent with movement of the blocks.
6. The method of claim 1, wherein the block ordering involves
emptying some blocks and wherein certain of the empty blocks are
moved from a current region of memory to a remote region of memory
and adding an unconditional branch to the block having a target in
a current region of memory.
7. The method of claim 4, wherein branch instructions from the
current region have the empty blocks in the remote region as
targets and during a path compression the branch instructions to
the empty blocks in the remote region are removed.
8. The method of claim 2, wherein the scheduling involves making an
assessment of the cost of scheduling an instruction and determining
whether to make the code motion based on the cost.
9. The method of claim 8, wherein the cost is a global cost.
10. The method of claim 1, wherein the block ordering may expose
multiple branches for multi-way branching.
11. The method of claim 2, wherein the scheduling selectively
involves regeneration of predicate expressions to invert
conditional branches.
12. A method of compiling instructions of a program, comprising:
receiving instructions for scheduling; and scheduling the
instructions while interacting with block ordering.
13. A method of compiling instructions of a program, comprising:
scheduling an instruction; updating a block order of blocks of the
instructions where warranted in response to the scheduling of the
instruction; and scheduling an additional instruction.
14. The method of claim 13, wherein the block order updating
involves moving blocks within a physical order and eliminating or
changing branch instructions consistent with movement of the
blocks.
15. The method of claim 13, wherein the updating involves emptying
some blocks and wherein certain of the empty blocks are moved from
a current region of memory to a remote region of memory and adding
an unconditional branch to the block having a target in a current
region of memory.
16. The method of claim 15, wherein branch instructions from the
current region have the empty blocks in the remote region as
targets and during a path compression the branch instructions to
the empty blocks in the remote region are removed.
17. The method of claim 13, wherein the scheduler makes an
assessment of the cost of scheduling an instruction and determines
whether to make the instruction based on the cost.
18. The method of claim 13, wherein the updating exposes multiple
branches for multi-way branching.
19. The method of claim 13, wherein updating the block ordering
selectively involves regeneration of predicate expressions to
invert conditional branches.
20. A method of compiling instructions of a program, comprising:
selecting at least one of the instructions for scheduling; and
updating a block order of the instructions before conclusion of the
scheduling.
21. An article comprising: a machine readable medium having
instruction that when executed causes a processor to: receive
instructions for code motion; and control the code motion while
interacting with block ordering.
22. The article of claim 21, wherein the code motion is done as
part of instruction scheduling.
23. The article of claim 21, wherein the block ordering is made in
response to blocks being emptied or populated due to code
motion.
24. An article comprising: a machine readable medium having a
program thereon which is created by a compiler that: receives
instructions for code motion; and controls the code motion while
interacting with block ordering.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Technical Field of the Invention
[0002] The present invention relates to compilers.
[0003] 2. Background Art
[0004] A compiler is a program that reads a source program written
in a source language and translates it into a target program in a
target language. For example, a compiler may translate a high level
source program (such as C++) into compiled code that can be
understood by a processor, such as a microprocessor.
[0005] Block ordering (also called code placement) concerns the
order in which instructions and blocks of instructions are to
appear in physical memory. The block ordering may involve the
selection of certain branch instructions between some of the
blocks. It is generally true that it takes fewer cycles or other
processor resources if the instruction is able to fall through to
the next contiguous instruction in memory rather than branching to
another instruction. Accordingly, block ordering involves
attempting to pick the direction of a conditional branch such that
it falls through to an instruction that is more likely to occur and
branches to an instruction less likely to occur. Another benefit of
doing so is that spatial locality is more likely to exist in a
cache. Instruction scheduling involves moving instructions (called
code motion) to better assign instructions to an execution unit for
a particular cycle. The scheduler may move code within a block
(called local code motion) or between blocks (called global code
motion). Some schedulers are capable of only local code motion,
while other schedulers are capable of local and global code
motion.
[0006] In prior art compilers, block ordering and instruction
scheduling are independent activities. For example, in the
compiling process of some prior art compilers, first an instruction
order and accordingly a block order is chosen. Next, instruction
scheduling is performed. Instruction scheduling involves code
motion or moving instructions to different locations in physical
memory to attempt better utilization of execution units. If there
are three execution units, an attempt is made to have each
execution unit be busy during each cycle. Following the completion
of scheduling, the physical order is re-evaluated to see if can be
improved. For example, if an unconditional branch branches to the
next sequential instruction in memory, the unconditional branch can
be removed without changing the operation of the program. However,
in making these changes to the physical order, the execution units
may be less well utilized. Good block ordering increases
performance. Good instruction scheduling also increases
performance. In the prior art compilers, however, by treating
instruction scheduling and ordering as sequential, independent
activities, both the instruction ordering and scheduling suffer.
Accordingly, performance suffers.
[0007] Accordingly, there is a need for a compiler with improved
instruction scheduling and ordering.
SUMMARY
[0008] In some embodiments, the invention includes a method of
compiling instructions of a program. The method includes receiving
instructions for code motion and controlling the code motion while
interacting with block ordering.
[0009] The code motion may be done as part of instruction
scheduling. The scheduling may involve making an assessment of the
cost of scheduling an instruction and determining whether to make
the code motion based on the cost.
[0010] The scheduling may involve regeneration of predicate
expressions to invert conditional branches.
Brief DESCRIPTION OF THE DRAWINGS
[0011] The invention will be understood more fully from the
detailed description given below and from the accompanying drawings
of embodiments of the invention which, however, should not be taken
to limit the invention to the specific embodiments described, but
are for explanation and understanding only.
[0012] FIG. 1 is a schematic block diagram representation of a
processor and memory used in compiling.
[0013] FIG. 2A is a representation of an intermediate physical
block order.
[0014] FIG. 2B is a representation of a final physical block
order.
[0015] FIG. 3 is a representation of a table.
[0016] FIG. 4A is a control flowgraph.
[0017] FIG. 4B is a control flowgraph.
[0018] FIG. 5 is a flowchart illustrating a feedback feature that
may be used by the compilation code.
[0019] FIG. 6A is a control flowgraph.
[0020] FIG. 6B is a representation of physical block order.
[0021] FIG. 7A is a control flowgraph.
[0022] FIG. 7B is a control flowgraph.
[0023] FIG. 7C is a representation of physical block order at
different times. FIG. 8A is a control flowgraph.
[0024] FIG. 8B is a control flowgraph.
[0025] FIG. 8C is a representation of physical block order at
different times.
[0026] FIG. 9A is a control flowgraph.
[0027] FIG. 9B is a control flowgraph.
[0028] FIG. 9C is a representation of physical block order.
DETAILED DESCRIPTION
[0029] A. Overview
[0030] Referring to FIG. 1, a computer system 10 includes memory 14
and a processor 16, which executes a compiler program (called the
"compiler") to compile a source program in memory 14 to create a
compiled program. Memory 14 holds the source program to be
compiled, intermediate forms of the source program, and the
resulting compiled program. Memory 14 may also hold the compiler.
Memory 14 is intended as a generalized representation of memory and
may include a variety of forms of memory, such as a hard drive,
CD-ROM, and random access memory (RAM) and related circuitry. A
hard drive, CD-ROM, and RAM are examples of articles including
machine readable media. For example, the compiler may be included
on a CD-ROM and loaded from the CD-ROM to a hard drive.
[0031] The phrase "some embodiments" refers to at least some
embodiments of the invention. The various appearances "some
embodiments" are not necessarily all referring to the same
embodiments.
[0032] During any phase of compilation where instructions are moved
around the program, the basic blocks (called blocks) may change.
That is, new blocks on edges in the flow graph may be created and
other blocks on edges may be emptied in response to code motion. In
some embodiments, the invention involves dynamically updating
physical instruction block placement during an instruction
scheduling phase of compilation or during another phase of
compiling involving code motion (e.g., partial redundancy
elimination (PRE) or loop invariant removal). Branch instructions
may be eliminated or changed as part of the updating. The
instruction scheduling and block order updating is interactive,
because the block ordering update follows the scheduling of some
instructions, but scheduling of other instructions is done with
knowledge of the updating of block order and related branch
instructions. The scheduler can keep the execution units busier and
with better code.
[0033] In some embodiments, the invention includes a candidate
selection mechanism that can measure the cost of populating an
otherwise empty block or emptying a block. For example, when a
block is populated, an additional unconditional branch instruction
may be added that otherwise would not be included. Further, that
unconditional branch may cost more overall runtime cycles than the
savings gained by populating the block. The unconditional branch
may or may not be in the same block that is populated so the branch
may be added in a block which is executed more than the block
populated. This also means that scheduling heuristics can be driven
to empty blocks for the sole goal of eliminating branches and their
pipeline bubbles. The cost may be a global or regional cost in
terms of an estimate of change in performance in executing the
compiled program once it is compiled.
[0034] The invention differs from prior art compilers in which the
physical block order is fixed during the course of scheduling the
code. Opportunities for code improvement are thereby missed. In
some of these prior art compilers, block ordering is performed
again after scheduling all the code, and the code may then be
rescheduled. However, several iterations of block ordering and
rescheduling may be needed to realized the benefit the present
invention provides, if it could be achieved at all by the prior art
compilers. Further, it would take a significant amount of time to
perform multiple iterations of scheduling, block ordering, and
rescheduling, which in many instances would not be practical. By
contrast, in embodiments of the present invention, the compiler
considers whether to change the physical order after merely one or
a small number of instructions has been scheduled or otherwise
considered for movement, even though many more instructions are yet
to be scheduled or otherwise considered for movement. As described
below, in so doing, various opportunities to improve performance
can be identified that are missed by the prior art compilers.
[0035] As instructions are moved globally around the control
flowgraph, basic blocks become populated or emptied. This opens
opportunity for improving the code placement over what is was
before the scheduler started. A side effect of rearranging the code
placement is the modification of branches. For example,
unconditional branches may need to be added or removed from the
graph and conditional branches may need to be inverted. In a
microprocessor where branches compete for resources with other
instructions to be scheduled, dynamic code placement (updating)
exposes those branches to the scheduler so that it has an exact
view of the instructions competing for resources.
[0036] Before providing examples, the following background
information is provided. A conditional branch instruction has a
target instruction and a fall through instruction. The fall through
instruction is the next instruction in memory. It is generally true
that fewer cycles or other processor resources are used when the
instruction following the condition branch instruction in time
order is the fall through instruction rather than the target
instruction. Accordingly, the compiler may attempt to determine
which instruction is more likely to follow the conditional branch
in time order and to make that instruction the fall through
instruction. When all of the instructions are removed from a block
it is said to be empty.
[0037] A control flowgraph is a well known representation of code
that includes blocks and edges. The blocks (also called nodes,
basic blocks, or vertices) represent instructions (also called
code). An edge represents the transfer of control from one block to
another. Control is transferred either by the execution of a branch
instruction, or by falling sequentially into the code in the
physically contiguous next block.
[0038] Physical block order (sometimes called code layout, memory
order, or physical memory order) is the order that the blocks (and
hence the instructions of the block) are assigned for the
instruction memory. Referring to FIGS. 2A, in some embodiments,
during intermediate stages of the compilation process, the
instruction memory includes a current region and a remote region.
An imaginary line 24 separates the current and remote regions.
Table 1 provides definitions.
1TABLE 1 Populated Block Block in current region of instruction
memory having at least one instruction which is not an
unconditional branch Partially Empty Block Block in remote region
of instruction memory having only one instruction, which is an
unconditional branch instruction Fully Empty Block Block having no
instructions; it is not in either the current or remote region of
instruction memory
[0039] Partially empty blocks are created by inserting an
unconditional branch in a block that has been made empty through
code motion. Fully empty blocks have no instructions and are in
neither the current region nor remote region of the instruction
memory. When a populated block is emptied by code motion, the block
is made a partially empty block if it effects the control flow
between other blocks, (e.g., if it is between two blocks connected
by a branch). If the emptied block does not effect the control flow
between other blocks (e.g., it separates two blocks that would be
separated in a fall through condition), it is a filly emptied
block.
[0040] Referring to FIG. 2B, at the end of compilation, a final
physical block order includes only populated block. A path
compression technique described below may be used to remove
partially empty blocks. There are no blocks in a remote region, so
current and remote regions are not shown in FIG. 2B. In some
embodiments, there are no partially empty blocks placed in a remote
region of memory.
[0041] Referring to FIG. 3, a block order table 30 contains
information regarding the order of blocks within the physical block
order. In some embodiments, table 30 contains information regarding
populated blocks, partially empty blocks, and fully empty blocks.
As the order changes, table 30 can be updated. A function which
accesses table 30 is called LookupOrder( ). One reason to keep
track of the order for all blocks is that if an empty block is
removed, and it is later decided to return the block to the control
flowgraph, it will be known where to return it. Also, blocks (some
of which may be JS blocks, described below) may be empty before
code motion is started for a particular scheduling phase. The block
may then get populated and need to be reintroduced into the current
region of the physical order. Further, in some embodiments, it is
desirable that all blocks that may be used during scheduling be
allocated before scheduling begins. Table 30 may then hold
information regarding all these possibly populated blocks. In this
way, if the control flowgraph does not include a particular block,
that block is still accounted for. In other embodiments, table 30
might not hold information regarding all possible blocks.
[0042] In some embodiments, the control flowgraph only holds
populated blocks and partially empty blocks. In other embodiments,
the control flowgraph may only include populated blocks. In still
other embodiments, the control flowgraph may include all blocks
(populated, partially empty, and fully empty), although on
different levels. On one level, the control flowgraph could include
only populated blocks (or only populated and partially empty
blocks). On another level, the control flowgraph would include the
position of all types of blocks. Table 30 may contain this
information. Table 30 may be organized in various ways.
[0043] Various methods may be used to updating physical block order
and associated branches following code motion. The follow section
discusses some of these methods. It will be apparent to those
skilled in the art having the benefit of this disclosure that other
methods may be used within the scope of the invention.
[0044] B. Pseudocode and Explanation
[0045] The following pseudocode provides an exemplary high level
view of certain aspects of compiling. Statements of the pseudocode
are numbered for convenience of discussion. Different embodiments
of the invention involve different statements of the pseudocode.
Other embodiments of the invention include aspects of some or all
of the statements (as explained below). The statements do not have
to be in the order provided in the pseudocode and certain
statements of the pseudocode could be combined.
2 1 Construct an initial block ordering; 2
NormalizeCriticalEdges(CFG); 3 RemoveEmptyBlocksAndUpdateBranche-
s(CFG); 4 ConstructBlockOrderingTable(CFG); 5 rdy .rarw.
DagRoots(DDG); 6 while (rdy .noteq. .O slashed.) do 7 best .rarw.
BestCandidate(rdy); 8 from .rarw. Block(best); 9 to .rarw.
TargetBlock(best); 10 if (Block_empty(to)) 11 Bo_PopulateBlock(to);
12 fi; 13 MoveInstr(best, from, to); 14 if (Block_empty(from)) 15
Bo_EmptyBlock(from); 16 fi; 17 rdy .rarw. rdy - best; 18 rdy .rarw.
rdy .orgate. RdySuccs(best); 19 od; 20 PathCompress(CFG);
[0046] In line 1, an initial block ordering is made. An instruction
ordering is made as part of the block ordering. Branch instructions
are selected as part of the block order. Various currently known or
new algorithms may be used to make this initial order.
[0047] Line 2 concerns critical edges and blocks, called JS blocks,
that may be positioned on the critical edges if needed. (CFG stands
for control flowgraph.) In some embodiments, it is desirable that
the number of blocks and paths remains constant during scheduling.
Accordingly, at least in these embodiments, the JS blocks are
created before the scheduling begins. Referring to FIG. 4A, a
critical edge exists between a split node, i.e., a node with
multiple successors (e.g., block S), and a join node, i.e., a node
with multiple predecessors (e.g., block J). The JS block is
position on the critical edge, thereby replacing the edge with two
non-critical edges, one between S and the JS block, and the other
between the JS block and J. If later as part of scheduling, an
instruction I is moved from block J to block B, a copy of
instruction I can be moved to the JS block, as shown in FIG. 4B. A
copy of instruction I in the JS block is referred to as
compensation code. Accordingly, a JS block may be an empty block or
a populated block depending on whether it actually holds any
compensation code. In some embodiments, the JS blocks are placed in
the physical order and in the control flowgraph. In other
embodiments, the JS blocks are only placed in the control
flowgraph. (In other embodiments, the JS block is not created until
it is needed to hold compensation code.)
[0048] In line 3, empty blocks are removed from the initially
constructed control flowgraph and affected branches are removed or
changed. It may be that most of the empty blocks are JS blocks that
were inserted in the statement of line 2. The code after removal of
empty blocks is initial code as viewed by the instruction
scheduler. The code may be the final position the blocks and
branches would be in if there was no code motion (if no code was
moved outside of its own block).
[0049] In line 4, a physical global code ordering is constructed
for all blocks whether populated or empty to create table 30 (shown
in FIG. 3). Various algorithms, including well know graph layout
algorithms, may be used to create the ordering. Line 4 is similar
to prior art activities except that there may be partially empty
blocks in the remote region of the physical memory order. This
ordering may be computed ignorant of the number of instructions in
any block. This ordering provides the basis of the function
append_block.rarw.LookupOrder(b), for any block b which needs to be
reintroduced into the graph. This answers the question of where to
place a newly populated block in the physical block order. The
following provides additional information regarding block order
table 30 and LookupOrder(b) in some embodiments. (In other
embodiments, the details are different.) Table 30 includes a block
order array, which is an array of pointers to blocks. The pointers
in the array are in the same order as the blocks in the ideal
physical block ordering computed by
"ConstructBlockOrderingTable(CFG)". For example, if the physical
block order computed by "ConstructBlockOrderingTable(CFG)" were A,
B, C, and D, then the Block Order Array (BOA) would contain:
3 1. Pointer to A. 2. Pointer to B. 3. Pointer to C. 4. Pointer to
D.
[0050] "ConstructBlockOrderingTable(CFG)" associates a physical
order number (PON) with each block. That is, each block has a
number N such that it is the Nth block in the physical order from
the beginning. So initially block "C" has the number 3 in the above
example. As an example, Block_Order_Array[PON(B)].fwdarw.B.
[0051] When an emptied block is determined to be moved to the
remote region of the physical block order, it's pointer is removed
from the Block Order Array (BOA). That is, it's pointer is set to
empty (Null). When a block is populated and moved to the current
region, a pointer to itself is reinserted back into it's position
in the BOA. For example, BOA[PON(B)]=Pointer to B. The net effect
of this is that the BOA indicates which blocks are in the current
region of the physical block order. This may be used to indicate
which blocks it is believed will be path compressed away (although
that may change) and where to reinsert blocks which are to be moved
to the current region. For example, if block B were emptied and
moved to the remote region, then BOA[2] would be set to empty
(Null). Assume block C becomes emptied and moved to the remote
region. It's entry BOA[3] would be set to empty (Null). Finally,
assume block C is populated and is to be moved back to the current
region. The BOA table is used to indicate after which block C
should be appended. Since the BOA entry 2 immediately before C is
empty, we look at the entry 1 before that to find that A is indeed
in the true physical order. Block A becomes the block to append the
newly populated block C.
[0052] The following is pseudo code for
append_block.rarw.LookupOrder(b) in some embodiments:
4 index .rarw. PON(B); do { index .rarw. index - 1; mark .rarw.
BOA[index]; } while (mark == Null); return (mark);
[0053] Lines 5-20 provide a high level description of some
embodiments of an instruction scheduler that interfaces with block
ordering. Instruction scheduling is the assigning of an instruction
to an execution unit for a particular cycle.
[0054] In line 5, DDG refers to the data dependency graph. As is
well known, a data dependency graph lists dependencies of
instructions. If an instruction has a dependency, it cannot be
scheduled. DagRoots(DDG) provides those instructions that are not
dependent on another instruction for that cycle. In some
embodiments, the scheduler is top down scheduler. Rdy are those
instructions that are ready to be scheduled.
[0055] Line 6 includes the start of a while do loop that extends
from line 6 to line 19. The do loop continues while there are
instructions to be scheduled. Note that "od" in line 19 is the end
of the do loop. In lines 12 and 16, "fi" is the end of the "if"
section beginning in lines 10 and 14, respectively.
[0056] In line 7, best is the best instruction ready to be
scheduled. Various techniques, including well known techniques can
be used to determine which is the best instruction. However, as
described above, in addition to using general scheduling practices,
the "BestCandidate( )" statement can look ahead as to what would be
the global or regional cost of various possible instructions. The
result could be feedback to the BestCandidate( ) function. One
embodiment of this look-ahead feature is described in connection
with FIG. 5. Referring to FIG. 5, as shown in box 50, the
BestCandidate(rdy) function selects a possible best instruction
(similar to line 7). As shown in box 52, the "from" and "to" blocks
are selected (similar to lines 8 and 9). As shown in box 54, the
instruction is moved and populate and empty functions are performed
as need (similar to lines 10-15). As shown in box 56, the cost of
the proposed move is assessed. (A negative cost is a benefit.) As
shown in box 58, the states of the control flowgraph and physical
memory order may be restored and the result of the assessment is
fedback to the scheduler in BestCandidate(rdy). Note that the same
or different code may be used to predict and assess cost, than is
used to update block order. The same or different memory may be
used for the two.
[0057] An example of how a change may have a benefit for one small
area of code, but be globally harmful to performance, is provided
as follows with reference to FIG. 6A (control flowgraph) and FIG.
6B (physical order). Assume blocks A, B, and D each have
instructions. Block C is an empty JS block. Further assume path A B
D is more likely than path A C D. Instruction "i" is considered for
scheduling from block D into block B. In some situations, this
motion may place block C between block B and D in the physical
order. A side-effect of placing the block C into the current region
of the physical order is adding an unconditional branch into block
B (since it would no longer fall into D). Adding the unconditional
branch into B may cost more overall runtime cycles than the savings
from moving instruction i into B. Different heuristics may lead to
placing C in different places. However, as described above, the
cost of the different placements can be determined ahead of time
and used in the decision of scheduling.
[0058] In line 8, a block called "from" is identified. In line 9, a
block "to" is identified. Block "from" is the block the best
instruction moves from and block "to" is the block it moves to. The
"from" block may be called the source block and the "to" block may
be called the target block.
[0059] In line 10, it is determined whether the block "to" was
empty (including fully or partially empty) before the best
instruction was move into it. In line 11, if the block was empty,
then it is inserted into the block order using
append_block.rarw.LookupOrder(b). In the case of partially empty
blocks, for example, an unconditional branch may need to be
removed. Populating may involve introducing other blocks into the
control flowgraph, removing blocks from the control flowgraph, and
updating conditional and unconditional branches and the testing of
their readiness.
[0060] In line 13, the instruction "best" is moved from block
"from" to block "to".
[0061] In line 14, it is determined whether the block "from" is
empty after the instruction is moved out of it. In line 15, if it
is now empty, it may be removed from the current region if need be.
This may include removing other blocks from the control flowgraph,
adding blocks to the control flowgraph, or updating conditional or
unconditional branches and the testing of their readiness.
[0062] In line 17, the best instruction is removed from the set of
ready instructions.
[0063] In line 18, each instruction that depended on the best
instruction is now ready, as long as they are not dependent on
something else.
[0064] In line 20, a form of transitive reduction called path
compression is applied on the targets of conditional and
unconditional branches that have empty blocks as their targets.
This has the effect of removing any empty blocks that are not used
after the instruction scheduling phase. This reduction has no
effect on the modeling of branches or the ability to well schedule
branches and so is performed after scheduling. Path compression is
illustrated in examples below.
[0065] As an example, scheduling is included in lines 7-9 of the do
loop. However, in contrast to the prior art, the control flowgraph
and physical memory order may change (see lines 10-15) during
scheduling. From one perspective, the scheduler uses the populate
and empty functions as utilities. From another perspective, the
populate and empty functions are part of the scheduler. Branches
are added, removed, or inverted (switching target and fall through)
as part of the populate and empty functions. The compiler of the
present invention can take advantage of opportunities to improve
code dynamically on the fly. The scheduler knows of the change to
the physical order and related changes or elimination to branches
and can take it into account in scheduling later instructions.
[0066] C. Examples
[0067] FIGS. 7A, 7B, and 7C illustrate an example of how branches
can be changed during the scheduling process. Referring to FIG. 7A,
a control flowgraph 60 includes blocks A, B, C, D, and E. The
arrows represent edges between blocks. Assume that during the
course of scheduling, blocks B and C have their instructions moved
up out of their blocks so that blocks B and C are empty blocks.
[0068] FIG. 7B shows control flowgraph 60 following the code motion
of removing the instructions of blocks B and C. FIG. 7C includes
columns 64, 66, and 68 that illustrate the physical block order at
different stages of compilation. Column 64 shows the physical block
order before blocks B and C are emptied. Column 66 shows the
physical block order after blocks B and C are emptied, but before
path compression. Column 68 shows the physical order after path
compression.
[0069] As illustrated in column 66, when it is determined that
block B has been emptied, block B is placed in the remote region of
physical memory (see FIG. 2A). The unconditionally branch
instruction from A to C is removed since C is on the fall through
to E. B branches back to E. B is taken to the remote region so that
the number of blocks and paths may remain constant during
scheduling. By moving B to the remote region, the branch in A may
be removed by path compression at the end of scheduling and there
is one fewer branch in the scheduled code. In embodiments in which
the number of blocks and paths does not have to remain constant
during scheduling, B may disappear without going to the remote
region. Another reason to place B in the remote region until path
compression is that if it is determined that B should be
re-populated, it may be easier to move it back to the current
region of memory. The unconditional branch that was removed can be
reinserted at the end of block A.
[0070] When C is emptied, it is not taken to the remote region
because when C is removed, A falls through to E rather than falls
through to C. There is no branch instruction in A to remove (other
than the one to B which will be removed through path compression).
An advantage of the some embodiments of the present invention is
that the scheduler will know that the branch instruction will be
removed. Therefore, depending on the circumstances, it may be able
to schedule another instruction for the execution unit that would
have received the branch instruction, or other instructions for
execution units which would have been unavailable due to the branch
being needed that same cycle. If it had been waited until the
completion of scheduling to remove the branch instruction, the
opportunity to schedule another instruction in its place may be
lost.
[0071] FIGS. 8A-8C illustrate an example of updating the block
order to expose a scheduling opportunity referred to as multi-way
branches. In certain processors, multi-way branching occurs when
multiple branches are concurrently executed in different execution
units in the same cycle. In some processors, the branch
instructions have to be in contiguous memory locations. Compilers
have been used to try to place branch instructions next to each
other in physical memory (when it otherwise is a good use of
resources) to take advantage of multi-way branching capability. The
inventors of the present invention do not claim to have invented
multi-way branching or using a compiler to align branches in
contiguous memory locations. However, the present invention can
identify opportunities for multi-way branching that might be missed
by prior art compilers.
[0072] For example, referring to FIG. 8A, a control flowgraph 70
includes blocks A, B, C, D, E, F, and G. (Note that in the examples
of FIGS. 7A-7C and 8A-8C, there may be additional blocks that are
not shown in the figures.) Assume that during the course of
scheduling, blocks B and D have their instructions moved up out of
their blocks. After this code motion, control flowgraph 70 would
look like it does in FIG. 8B. FIG. 8C includes columns 74, 76, and
78. Column 74 represents the physical order of blocks A-G before
code motion and corresponds to control flowgraph 70 in FIG. 8A.
Blocks A-E are in physically contiguous memory locations. The "***"
symbols in columns 74, 76, and 78 represent that blocks F and G are
in memory locations that are not necessarily physically contiguous
with block E. Blocks A and C each have conditional branches. Table
2, below, lists the target and fall through instructions of the
conditional branches before code motion (see FIG. 8A and column 74
of FIG. 8C) and after code motion and block order updating (see
FIG. 8B and column 78 of FIG. 8C).
5 TABLE 2 Target of Fall through of Target of Fall through of
conditional conditional conditional conditional branch branch
branch branch instruction of instruction of instruction of
instruction of block A block A block C block C Before code motion
first instruction first instruction first instruction first
instruction (i.e., moving of block C of block B of block E of block
D instructions out of blocks B and D) After code motion and first
instruction first instruction first instruction first instruction
block order updating of block F of block C of block G of block
E
[0073] Column 78 illustrates an intermediate state of the physical
order during the block order updating. Switching the target and
fall through instruction of a conditional branch is referred to as
inverting the conditional branch. In the example, the conditional
branches are considered inverted because the target instruction
prior to code motion becomes the fall through instruction, although
the fall through instruction prior to code motion is removed from
blocks B and D. With the physical order of column 78, the
conditional branch instructions of blocks A and C may be used in a
multi-way branch of a processor that supports multi-way branching.
This type of opportunity cannot be exposed without updating the
block order dynamically in response to code motion. A prior art
compiler will not regularly find these opportunities created by
code motion.
[0074] The question arises, why not invert the conditional branch
of block A even if there is no code motion. The answer is that it
is assumed that for other reasons, the physical order of column 74
is preferred. The edge A.fwdarw.B may be a higher probability edge
so that block A would preferably fall into block B to save cycles.
However, once block B becomes empty in the example, then the
opportunity for improvement on the less probable path becomes
exposed.
[0075] In summary, the updating exposes added or changed branches
or other instructions to scheduling. Further, removed branches or
other instructions can make room for other instructions to be
scheduled.
[0076] D. Regeneration of Predicate Expressions to Invert
Conditional Branches
[0077] Another advantage of the some embodiments of the invention
is that the scheduler can know when to regenerate the inverse sense
of a complex branch predicate expression for a branch that needs to
be inverted before those expressions are scheduled. In some cases,
the predicate qualifying the branch is defined by a very long
complex sequence of compares. In prior art compilers, the inverse
sense of the branch may be so complicated, that code generation may
have to be redone. However, with the present invention in which
scheduling and physical ordering are interactive, if it is noticed
the branch needs to be inverted, the compares can be regenerated
before they are scheduled. If-conversion may be used to regenerate
predicate expressions.
[0078] Consider an example in which predicate expressions are
regenerated to invert a conditional branch. FIG. 9A illustrates a
control flowgraph before an if-conversion. FIG. 9B illustrates a
control flowgraph for predicate region (1, 2, 3, and 4) after the
if-conversion. FIG. 9C illustrates a physical order after the
if-conversion. To generate the compares for the conditional branch
which ends block 1, the condition for block 5 or block 6 being true
is computed. The condition used depends on whether the conditional
branch at the end of block 1 is taken to reach block 5 or block 6.
This decision is decided by LookupOrder( ) and may change during
the course of scheduling since block 5 or 6 may become emptied or
populated. Accordingly, when the conditional branch target changes,
the conditional branch at the end of block 1 is inverted which may
involve regenerating a different set of conditions for the branch
to be taken. For the example,"p" stands for the block predicate
(e.g., a Boolean value that is true if and only if control flows
through the associated block) and "c" stands for the Boolean
condition computed in the associated block. The associated block is
indicated by the number following the letter "c" or "p".
6 p1 = True p2 = (c1 == True) p3 = (c1 == False) p4 = (c1 == False
.or. (p2 == True .and. c2 == False)) p5 = (p2 == True .and. c2 ==
True) .or. (p4 == True .and. c4 == True) p6 = (p4 == True .and. c4
== False)
[0079] From the Boolean algebra, computing the predicate for block
5 ("p5") to be executed has one more term than the expression for
computing the conditions for block 6 ("p6"). Therefore, assuming
each term of the expression takes one compare instruction to
compute, inverting the conditional branch at the end of block 1
will involve regenerating different comparison conditions. The two
predicate expressions have different resource requirements and so
should be exposed to the instruction scheduler as early as possible
to guarantee the best schedule. When one of the blocks is emptied
and it is known the conditional branch should be inverted, the
comparison expression instructions can be regenerated and there is
still a chance to schedule them in one top-down pass. (In other
embodiments, details of regeneration of predicate expressions may
be different.)
[0080] E. Additional Information and Embodiments
[0081] The present invention may be used over an arbitrary number
of blocks (including the entire program).
[0082] If the specification states a component, feature, structure,
or characteristic "may", "might", or "could" be included, that
particular component, feature, structure, or characteristic is not
required to be included.
[0083] In FIG. 2A and 2B, in multithreaded version of the compiler,
there might or might not be more than one physical order in
parallel, depending on the implementation.
[0084] Those skilled in the art having the benefit of this
disclosure will appreciate that many other variations from the
foregoing description and drawings may be made within the scope of
the present invention. Accordingly, it is the following claims
including any amendments thereto that define the scope of the
invention.
* * * * *