U.S. patent number 5,497,499 [Application Number 08/219,425] was granted by the patent office on 1996-03-05 for superscalar risc instruction scheduling.
This patent grant is currently assigned to Seiko Epson Corporation. Invention is credited to Sanjiv Garg, Kevin R. Iadonato, Le T. Nguyen, Johannes Wang.
United States Patent |
5,497,499 |
Garg , et al. |
March 5, 1996 |
**Please see images for:
( Certificate of Correction ) ** |
Superscalar risc instruction scheduling
Abstract
A register renaming system for out-of-order execution of a set
of reduced instruction set computer instructions having addressable
source and destination register fields, adapted for use in a
computer having an instruction execution unit with a register file
accessed by read address ports and for storing instruction
operands. A data dependance check circuit is included for
determining data dependencies between the instructions. A tag
assignment circuit generates one of more tags to specify the
location of operands, based on the data dependencies determined by
the data dependance check circuit. A set of register file port
multiplexers select the tags generated by the tag assignment
circuit and pass the tags onto the read address ports of the
register file for storing execution results.
Inventors: |
Garg; Sanjiv (Freemont, CA),
Iadonato; Kevin R. (San Jose, CA), Nguyen; Le T. (Monte
Sereno, CA), Wang; Johannes (Redwood City, CA) |
Assignee: |
Seiko Epson Corporation (Tokyo,
JP)
|
Family
ID: |
25333867 |
Appl.
No.: |
08/219,425 |
Filed: |
March 29, 1994 |
Related U.S. Patent Documents
|
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
Issue Date |
|
|
860719 |
Mar 19, 1992 |
|
|
|
|
Current U.S.
Class: |
712/217;
712/E9.046; 712/E9.049; 712/23 |
Current CPC
Class: |
G06F
9/3824 (20130101); G06F 9/3013 (20130101); G06F
9/384 (20130101); G06F 9/3838 (20130101); G06F
9/3855 (20130101) |
Current International
Class: |
G06F
9/38 (20060101); G06F 009/38 () |
Field of
Search: |
;395/800,375,475,395 |
References Cited
[Referenced By]
U.S. Patent Documents
Foreign Patent Documents
|
|
|
|
|
|
|
0515166 |
|
Nov 1992 |
|
EP |
|
0533337 |
|
Mar 1993 |
|
EP |
|
WO91/20031 |
|
Dec 1991 |
|
WO |
|
Other References
Keller, "Lock-Ahead Processors"; Dec. 1975, pp. 177-194. .
Dywer, A Multiple, Out-of-Order, Instruction Issuing System For
SuperScaler Processors, (All); Aug. 1991. .
Lightner et al., "The Metaflow Lightning" Chip Set Mar. 1991 IEEE
Lightning Outlined, Microprocessor Report Sep. 1990. .
Hwu, Wen--mei, Steve Melvin, Mike Shebanow, Chein Chen, Jia--juin
Wei, Yale Patt, "An HPS Implementation of VAX: Initial Design and
Analysis", Proceedings of the Nineteenth Annual Hawaii
International Conference on System Sciences, pp. 282-291, 1986.
.
Hwu et al., "Experiments with HPS, a Restricted Data Flow
Microarchitecture for High Performance Computers", Compcon 86,
1986. .
Hwu, Wen--mei and Yale N. Patt, "HPSm, a High Performance
Restricted Data Flow Architecture Having Minimal Functionality",
Proceedings of the 18th International Symposium on Computer
Architecture, pp. 297-306, Jun. 1986. .
Yale N. Patt, Stephen W. Melvin, Wen--Mei Hwu, Michael C. Shebanow,
Chein Chen, Jiajuin Wei, "Run-Time Generation of HPS
Microinstructions From a VAX Instruction Stream", Proceedings of
Micro 19 Workshop, New York, pp. 1-7, Oct. 1986. .
Swenson, John A. and Yale N. Patt, "Hierarchical Registers for
Scientific Computers", St. Malo '88, University of California at
Berkeley, pp. 346-353, 1988. .
Butler, Michael and Yale Patt, "An Improved Area-Efficient Register
Alias Table for Implementing HPS", University of Michigan, Ann
Arbor, Michigan, pp. 1-15, Jan. 1990. .
Uvieghara, G. A., W. Hwu, Y. Nakagome, D. K. Jeong, D. Lee, D. A.
Hodges, Y. Patt, "An Experimental Single-Chip Data Flow CPU",
Symposium on ULSI Circuits Design Digest of Technical Papers, May,
1990. .
Melvin, Stephen and Yale Patt, "Exploiting Fine-Grained Parallelism
Through a Combination of Hardware and Software Techniques",
Proceedings From ISCA-18, pp. 287-296, May, 1990. .
Butler, Michael, Tse--Yu Yeh, Yale Patt, Mitch Alsup, Hunter Scales
and Michael Shebanow, "Single Instruction Stream Parallelism Is
Greater Than Two" Proceedings of ISCA-18, pp. 276-286, May. .
Uvieghara, Gregory A., Wen--mei, W. Hwu, Yoshinobu Nakagone,
Deog--Kyoon Jeong, David D. Lee, David A. Hodges and Yale Patt, "An
Experimental Single-Chip Data Flow CPU", IEEE Journal of
Solid-State Circuits, vol. 27, No. 1, pp. 17-28, Jan. 1992. .
Gee, Jeff, Stephen W. Melvin, Yale N. Patt, "The Implementation of
Prolog via VAX 8600 Microscode", Proceedings of Micro 19, New York
City, pp. 1-7, Oct. 1986. .
Hwu, Wen--mei Hwu and Yale N. Patt, "Design Choices for the HPSm
Microprocessor Chip", Proceedings of the Twentieth Annual Hawaii
International Conference on System Sciences, pp. 330-336, 1987.
.
Wilson, James E., Steve Melvin, Michael Shebanow, Wen--mei Hwu and
Yale N. Patt, "On Turning the Microarchitecture of an HPS
Implementation of the VAX", Proceedings of Micro 20, pp. 162-167,
Dec. 1987. .
Hwu, Wen--mei and Yale N. Patt, "HPSm2: A Refined Single-chip
Microengine", HICSS '88, pp. 30-40, 1988. .
"Critical Issues Regarding HPS, A High Performance
Microarchitecture", Yale N. Patt, Stephen W. Melvin, Wen--Mei Hwu
and Michael C. Shebanow; The 18th Annual Workshop on
Microprogramming, Pacific Grove, California, Dec. 3-6, 1985, IEEE
Computer Order No. 653, pp. 109-116. .
"HPS, A New Microarchitecture: Rationale and Introduction", Yale N.
Patt, Wen--mei Hwu and Michael C. Shebanow; The 18th Annual
Workshop on Microprogramming, Pacific Grove, California , Dec. 3-6,
1985; IEEE Computer Society Order No. 653, pp. 103-108. .
Peleg et al., "Future Trends in Microprocessors: Out-of-Order
Execution, Spec. Branching and Their CISC Performance Potential",
Mar. 1991. .
Lightner et al., "The Metaflow Architecture", pp. 11-12, 63-68,
IEEE Micro Magazine, June. 1991. .
John L. Hennessy & David A. Patterson, Computer Architecture A
Quantitative Approach, Ch. 6.4, 6.7 and p. 449, 1990. .
Smith et al., "Limits on Multiple Instruction Issue", Computer
Architecture News, vol. 17, No. 2, Apr. 1989. .
Popescu et al., "The Metaflow Architecture", IEEE Micro, vol. 11,
No. 3, Jun. 1991. .
"Superscaler Microprocessor Design", by Mike Johnson, (Prentice
Hall, Englewood Cliffs, New Jersey, 1991; Chapters 2, 6 and 7).
.
Gibbons et al., "Efficient Instruction Scheduling for a Pipelined
Architecture, " Sigplan Notices, vol. 21, No. 2, Jul. 1986, pp.
11-16. .
R. D. DeGroot, "Method for Prioritizing Waiting Arithmetic
Instructions, " IBM Technical Disclosure Bulletin, vol. 27, 7A,
Dec. 1984, pp. 3735-3736..
|
Primary Examiner: Bowler; Alyssa H.
Assistant Examiner: Donoghue; L.
Attorney, Agent or Firm: Sterne, Kessler, Goldstein &
Fox
Parent Case Text
This application is a continuation of application Ser. No.
07/860,719, filed Mar. 31, 1992, now abandoned.
CROSS-REFERENCE TO RELATED APPLICATIONS
The following are commonly owned, co-pending applications:
"Semiconductor Floor Plan and Method for a Register Renaming
Circuit", Ser. No. 07/860,718, concurrently filed with the present
application;
"High Performance RISC Microprocessor Architecture", Ser. No.
07/817,810, filed Jan. 8, 1992;
"Extensible RISC Microprocessor Architecture", Ser. No. 07/817,809,
filed Jan. 8, 1992.
The disclosures of the above applications are incorporated herein
by reference.
Claims
What is claimed is:
1. A register renaming method for performing out-of-order execution
of a set of computer instructions adapted for use in a computer,
the computer comprising an instruction execution unit with a
temporary buffer accessed by read address ports and for storing
instruction operands, the method comprising the steps of:
(1) determining data dependencies between instructions in an
instruction window to produce data dependency results;
(2) storing, in a temporary buffer, out-of-order execution results
in storage locations determined by the location of an instruction
in said instruction window;
(3) generating one or more tags to specify an address in the
temporary buffer at which said out of order results are temporarily
stored; and
(4) outputting a tag in place of a register address for an operand
of a first instruction if said data dependency results indicate
that said first instruction is dependent on a previous instruction
in said instruction window, wherein said tag represents an address
of said operand in said temporary buffer.
2. The method of claim 1, wherein said determining step further
comprises the step of comparing the addresses of the source
register field of each instruction to the addresses of the
destination register fields.
3. The method of claim 1, further comprising the step of storing
in-order results in the register file and the temporary storage
means.
4. The method of claim 1, further comprising the step of passing
said out-of-order results to the register files in-order after
execution of the set of instructions is completed.
5. The register renaming method of claim 1, further comprising the
step of transferring said execution results in said temporary
buffer to a register file in-order based on the order of
instructions in the instruction window.
6. The register renaming method of claim 5, further comprising the
step of transferring a group of retirable instructions from said
temporary buffer to said register file simultaneously.
7. The register renaming method of claim 6, further comprising the
step of transferring an execution result from said temporary buffer
to said register file when all execution results for all prior
instructions are retirable.
8. A register renaming system for a set of RISC instructions in an
instruction window, adapted for use in a computer comprising an
instruction execution unit which executes the instructions to
generate execution results, the system comprising:
data dependence check means for determining data dependencies
between the instructions in the instruction window;
means for assigning instruction tags to the execution results of
each instruction in the instruction window based on the location of
each instruction in the instruction window; and
storage means including,
register file locations for storing retired instruction results,
and
temporary storage locations for temporarily storing the execution
results according to said instruction tags prior to completion of
all the instructions in the set of RISC instructions, wherein said
instruction tags represent addresses within said temporary storage
locations.
9. The system of claim 8, wherein said data dependance check means
determines said data dependencies by comparing the addresses of the
source register field of each instruction to the addresses of the
destination register fields.
10. The system of claim 8, further comprising means for passing
out-of-order execution results to said register file locations in
order after execution of the set of instructions is completed.
11. A system for register renaming in a computer system capable of
out-of-order instruction execution, comprising:
an instruction window for storing a plurality of instructions;
a temporary buffer comprising a plurality of storage locations for
storing instruction data, wherein said instruction data for an
instruction are stored at one of said plurality of storage
locations, said one of a plurality of storage locations being
determined by a location of said instruction in said instruction
window;
dependency check means for identifying dependencies among said
plurality of instructions;
tag assignment means for receiving data dependency results from
said dependency check means and for outputting a tag in place of a
register address for an operand of a first instruction if said
first instruction is dependent on a previous one of said plurality
of instructions in said instruction window for said operand;
wherein said tag represents an address of said operand in one of
said plurality of storage locations.
12. The register renaming system of claim 11, further comprising
multiplexer means for selecting said tag for said first instruction
when said first instruction is to be issued, and for providing said
tag to said temporary buffer to retrieve said operand.
13. The register renaming system of claim 12, further comprising an
issuer, wherein said issuer comprises:
means for evaluating an instruction's resource requirements;
means for signalling said multiplexer means to select an
instruction to issue.
14. The register renaming system of claim 11, further comprising
means for transferring the execution results in said temporary
storage locations to said register file locations in-order based on
the order of instructions in the instruction window.
15. The register renaming system of claim 14, wherein said means
for transferring transfers a group of retirable instructions from
said temporary buffer to said register file simultaneously.
16. The register renaming system of claim 15, wherein said means
for transferring transfers an execution result for an instruction
from said temporary buffer to said register file only when all
execution results for all prior instructions are retirable.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to superscalar reduced instruction
set computers (RISC), more particularly, the present invention
relates to instruction scheduling including register renaming and
instruction issuing for superscalar RISC computers.
2. Related Art
A more detailed description of some of the basic concepts discussed
in this application is found in a number of references, including
Mike Johnson, Superscalar Microprocessor Design (Prentice-Hall,
Inc., Englewood Cliffs, N.J., 1991); John L. Hennessy et al.,
"Computer Architecture--A Quantitative Approach" (Morgan Kaufmann
Publishers, Inc., San Mateo, Calif., 1990). Johnson's text,
particularly Chapters 2, 6 and 7 provide an excellent discussion of
the register renaming issues addressed by the present
invention.
A major consideration in a superscalar RISC processor is to how to
execute multiple instructions in parallel and out-of-order, without
incurring data errors due to dependencies inherent in such
execution. Data dependency checking, register renaming and
instruction scheduling are integral aspects of the solution.
Storage Conflicts and Register Renaming
True dependencies (sometimes called "flow dependencies" or
"write-read" dependencies) are often grouped with anti-dependencies
(also called "read-write" dependencies) and output dependencies
(also called "write-write" dependencies) into a single group of
instruction dependencies. The reason for this grouping is that each
of these dependencies manifests itself through use of registers or
other storage locations. However, it is important to distinguish
true dependencies from the other two. True dependencies represent
the flow of data and information through a program. Anti- and
output dependencies arise because, at different points in time,
registers or other storage locations hold different values for
different computations.
When instructions are issued in order and complete in order, there
is a one-to-one correspondence between registers and values. At any
given point in execution, a register identifier precisely
identifies the value contained in the corresponding register. When
instructions are issued out of order and complete out of order,
correspondence between registers and values breaks down, and values
conflict for registers. This problem is severe when the goal of
register allocation is to keep as many values in as few registers
as possible. Keeping a large number of values in a small number of
registers creates a large number of conflicts when the execution
order is changed from the order assumed by the register
allocator.
Anti- and output dependencies are more properly called "storage
conflicts" because reusing storage locations (including registers)
causes instructions to interfere with one another even though
conflicting instructions are otherwise independent. Storage
conflicts constrain instruction issue and reduce performance. But
storage conflicts, like other resource conflicts, can be reduced or
eliminated by duplicating the troublesome resource.
Dependency Mechanisms
Johnson also discusses in detail various dependency mechanisms,
including: software, register renaming, register renaming with a
reorder buffer, register renaming with a future buffer, interlocks,
the copying of operands in the instruction window to avoid
dependencies, and partial renaming.
A conventional hardware implementation relies on software to
enforce dependencies between instructions. A compiler or other code
generator can arrange the order of instructions so that the
hardware cannot possibly see an instruction until it is free of
true dependencies and storage conflicts. Unfortunately, this
approach runs into several problems. Software does not always know
the latency of processor operations, and thus, cannot always know
how to arrange instructions to avoid dependencies. There is the
question of how the software prevents the hardware from seeing an
instruction until it is free of dependencies. In a scalar processor
with low operation latencies, software can insert "no-ops" in the
code to satisfy data dependencies without too much overhead. If the
processor is attempting to fetch several instructions per cycle, or
if some operations take several cycles to complete, the number of
no-ops required to prevent the processor from seeing dependent
instructions rapidly becomes excessive, causing an unacceptable
increase in code size. The no-ops use a precious resource, the
instruction cache, to encode dependencies between instructions.
When a processor permits out-of-order issue, it is not at all clear
what mechanism software should use to enforce dependencies.
Software has little control over the behavior of the processor, so
it is hard to see how software prevents the processor from decoding
dependent instructions. The second consideration is that no
existing binary code for any scalar processor enforces the
dependencies in a superscalar processor, because the mode of
execution is very different in the superscalar processor. Relying
on software to enforce dependencies requires that the code be
regenerated for the superscalar processor. Finally, the
dependencies in the code are directly determined by the latencies
in the hardware, so that the best code for each version of a
superscalar processor depends on the implementation of that
version.
On the other hand, there is some motivation against hardware
dependency techniques, because they are inherently complex.
Assuming instructions with two input operands and one output value,
as holds for typical RISC instructions, then there are five
possible dependencies between any two instructions: two true
dependencies, two anti-dependencies, and one output dependency.
Furthermore, the number of dependencies between a group of
instructions, such as a group of instructions in a window, varies
with the square of the number of instructions in the group, because
each instruction must be considered against every other
instruction.
Complexity is further multiplied by the number of instructions that
the processor attempts to decode, issue, and complete in a single
cycle. These actions introduce dependencies. The only aid in
reducing complexity is that the dependencies can be determined
incrementally, over many cycles to help reduce the scope and
complexity of the dependency hardware.
One technique for removing storage conflicts is by providing
additional registers that are used to reestablish the
correspondence between registers and values. The additional
registers are conventionally allocated dynamically by hardware, and
the registers are associated with values needed by the program
using "register renaming." To implement register renaming,
processors typically allocate a new register for every new value
produced (i.e., for every instruction that writes a register). An
instruction identifying the original register, for the purpose of
reading its value, obtains instead the value in the newly allocated
register. Thus, hardware renames the original register identifier
in the instruction to identify the new register and correct value.
The same register identifier in several different instructions may
access different hardware registers, depending on the locations of
register references with respect to register assignments.
Consider the following code sequence where "op" is an operation,
"Rn" represents a numbered register, and ":=" represents
assignment:
R3b:=R3a op R5a (1)
R4b:=R3b+1 (2)
R3c:=R5a+1 (3)
R7b:=R3c op R4b (4)
Each assignment to a register creates a new "instance" of the
register, denoted by an alphabetic subscript. The creation of a new
instance for R3 in the third instruction avoids the anti- and
output dependencies on the second and first instructions,
respectively, and yet does not interfere with correctly supplying
an operand to the fourth instruction. The assignment to R3 in the
third instruction supersedes the assignment to R3 in the first
instruction, causing R3c to become the new R3 seen by subsequent
instructions until another instruction assigns a value to R3.
Hardware that performs renaming creates each new register instance
and destroys the instance when its value is superseded and there
are no outstanding references to the value. This removes anti- and
output dependencies and allows more instruction parallelism.
Registers are still reused, but reuse is in line with the
requirements of parallel execution. This is particularly helpful
with out-of-order issue, because storage conflicts introduce
instruction issue constraints that are not really necessary to
produce correct results. For example, in the preceding instruction
sequence, renaming allows the third instruction to be issued
immediately, whereas, without renaming, the instruction must be
delayed until the first instruction is complete and the second
instruction is issued.
Another technique for reducing dependencies is to associate a
single bit (called a "scoreboard bit") with each register. The
scoreboard bit is used to indicate that a register has a pending
update. When an instruction is decoded that will write a register,
the processor sets the associated scoreboard bit. The scoreboard
bit is reset when the write actually occurs. Because there is only
one scoreboard bit indicating whether or not there is a pending
update, there can be only one such update for each register. The
scoreboard stalls instruction decoding if a decoded instruction
will update a register that already has a pending update (indicated
by the scoreboard bit being set). This avoids output dependencies
by allowing only one pending update to a register at any given
time.
Register renaming, in contrast, uses multiple-bit tags to identify
the various uncomputed values, some of which values may be destined
for the same processor register (that is, the same program-visible
register). Conventional renaming requires hardware to allocate tags
from a pool of available tags that are not currently associated
with any value and requires hardware to free the tags to the pool
once the values have been computed. Furthermore, since
scoreboarding allows only one pending update to a given register,
the processor is not concerned about which update is the most
recent.
A further technique for reducing dependencies is using register
renaming with a "reorder buffer" which uses associative lookup. The
associative lookup maps the register identifier to the reorder
buffer entry as soon as the entry is allocated, and, to avoid
output dependencies, the lookup is prioritized so that only the
value for the most recent assignment is obtained if the register is
assigned more than once. A tag is obtained if the result is not yet
available. There can be as many instances of a given register as
there are reorder buffer entries, so there are no storage conflicts
between instructions. The values for the different instances are
written from the reorder buffer to the register file in sequential
order. When the value for the final instance is written to the
register file, the reorder buffer no longer maps the register; the
register file contains the only instance of the register, and this
is the most recent instance.
However, renaming with a reorder buffer relies on the associative
lookup in the reorder buffer to map register identifiers to values.
In the reorder buffer, the associative lookup is prioritized so
that the reorder buffer always provides the most recent value in
the register of interest (or a tag). The reorder buffer also writes
values to the register file in order, so that, if the value is not
in the reorder buffer, the register file must contain the most
recent value.
In a still further technique for reducing dependencies, associative
lookup can be eliminated using a "future file." The future file
does not have the properties of the reorder buffer discussed in the
preceding paragraph. A value presented to the future file to be
written may not be the most recent value destined for the
corresponding register, and the value cannot be treated as the most
recent value unless it actually is. The future file therefore keeps
track of the most recent update and checks that each write
corresponds to the most recent update before it actually performs
the write.
When an instruction is decoded, it accesses tags in the future file
along with the operand values. If the register has one or more
pending updates, the tag identifies the update value required by
the decoded instruction. Once an instruction is decoded, other
instructions may overwrite this instructions's source operands
without being constrained by anti-dependencies, because the
operands are copied into the instruction window. Output
dependencies are handled by preventing the writing as a result into
the future file if the result does not have a tag for the most
recent value. Both anti- and output dependencies are handled
without stalling instruction issue.
If dependencies are not removed through renaming, "interlocks" must
use to enforce dependencies. An interlock simply delays the
execution of an instruction until the instruction is free of
dependencies. There are two ways to prevent an instruction from
being executed: one way is to prevent the instruction from being
decoded, and the other is to prevent the instruction from being
issued.
To improve performance over scoreboarding, interlocks are moved
from the decoder to the instruction window using a "dispatch
stack." The dispatch stack is an instruction window that augments
each instruction in the window with dependency counts. There is a
dependency count associated with the source register of each
instruction in the window, giving the number of pending prior
updates to the source register and thus the number of updates that
must be completed before all possible true dependencies are
removed. There are two similar dependency counts associated with
the destination register of each instruction in the window, giving
both the number of pending prior uses of the register (which is the
number of anti-dependencies) and the number of pending prior
updates to the register (which is the number of output
dependencies).
When an instruction is decoded and loaded into the dispatch stack,
the dependency counts are set by comparing the instruction's
register identifiers with the register identifiers of all
instructions already in the dispatch stack. As instructions
complete, the dependency counts of instructions that are still in
the window are decremented based on the source and destination
register identifiers of completing instructions (the counts are
decremented by a variable amount, depending on the number of
instructions completed). An instruction is independent when all of
its counts are zero. The use of counts avoids having to compare all
instructions in the dispatch stack to all other instructions on
every cycle.
Anti-dependencies can be avoided altogether by copying operands to
the instruction window (for example, to the reservation stations)
during instruction decode. In this manner, the operands cannot be
overwritten by subsequent register updates. Operands can be copied
to eliminate anti-dependencies in any approach, independent of
register renaming. The alternative to copying operands is to
interlock anti-dependencies, but the comparators and/or counters
required for these interlocks are costly, considering the number of
combinations of source and result registers to be compared.
A tag can be supplied for the operand rather than the operand
itself. This tag is simply a means for the hardware to identify
which value the instruction requires, so that, when the operand
value is produced, it can be matched to the instruction. If there
can be only one pending update to a register, the register
identifier can serve as a tag (as with scoreboarding). If there can
be more than one pending update to a register (as with renaming),
there must be a mechanism for allocating result tags and insuring
uniqueness.
An alternative to scoreboarding interlocking is to allow multiple
pending updates of registers to avoid stalling the decoder for
output dependencies, but to handle anti-dependencies by copying
operands (or tags) during decode. An instruction in the window is
not issued until it is free of output dependencies, so the updates
to each register are performed in the same order in which they
would be performed with in-order completion, except that updates
for different registers are out of order with respect to each
other. This alternative has almost all of the capabilities of
register renaming, lacking only the capability to issue
instructions so that updates to the same register occur out of
order.
There appears to be no better alternative to renaming other than
with a reorder buffer. Underlying the discussion of dependencies
has been the assumption that the processor performs out-of-order
issue and already has a reorder buffer for recovering from
mispredicted branches. Out-of-order issue makes it unacceptable to
stall the decoder for dependencies. If the processor has an
instruction window, it is inconsistent to limit the look ahead
capability of the processor by interlocking the decoder. There are
then only two alternatives: implement anti- and output dependency
interlocks in the window or remove these altogether with
renaming.
SUMMARY OF THE INVENTION
The present invention is directed to instruction scheduling
including register renaming and instruction issuing for superscalar
RISC computers. A Register Rename Circuit (RRC), which is part of
the scheduling logic allows a computer's Instruction Execution Unit
(IEU) to execute several instructions at the same time while
avoiding dependencies. In contrast to conventional register
renaming, the present invention does not actually rename register
addresses. The RRC of the present invention temporarily buffers the
instruction results, and the results of out-of-order instruction
execution are not transferred to the register file until all
previous instructions are done. The RRC also performs result
forwarding to provide temporarily buffered operands (results) to
dependant instructions. The RRC contains three subsections: a Data
Dependency Checker (DDC), Tag Assign Logic (TAL) and Register file
Port MUXes (RPM).
The function of the DDC is to locate the dependencies between the
instructions for a group of instructions. The DDC does this by
comparing the addresses of the source registers of each instruction
to the addresses of the destination registers of each previous
instruction in the group. For example, if instruction A reads a
value from a register that is written to by instruction B, then
instruction A is dependent upon instruction B and instruction A
cannot start until instruction B has finished. The DDC outputs
indicate these dependencies.
The outputs of the DDC go to the TAL. Because it is possible for an
instruction to be dependent on more than one previous instruction,
the TAL must determine which of those previous instructions will be
the last one to be executed. The present invention automatically
maps each instruction a predetermined temporary buffer location;
hence, the present invention does not need prioritized associative
look-up as used by convention reorder buffers, thereby saving chip
area/cost and execution speed.
Out-of-order results for several instructions being executed at the
same time are stored in a set of temporary buffers, rather that the
file register designated by the instruction. If the DDC determines,
for example, that a register that instruction 6's source is written
to by instructions 2, 3 and 5, then the TAL will indicate that
instruction 6 must wait for instruction 5 by outputting the "tag"
of instruction 5 for instruction 6. The tag of instruction 5 shows
the temporary buffer location where instruction 5's result is
stored. It also contains a one bit signal (called a "done flag")
that indicates if instruction 5 is finished or not. The TAL will
output three tags for each instruction, because each instruction
can have three source registers. If an instruction is not dependent
on any previous instruction, the TAL will output the register file
address of the instruction's input, rather an a temporary buffer's
address.
The last part of the RRC are the RPMs or Register file Port MUXes.
The inputs of the RPMs are the outputs of the TAL, and the select
lines for the RPMs come from another part of the IEU called the
Instruction Scheduler or Issuer. The Instruction Scheduler chooses
which instruction to execute (this decision is based partly on the
done flags) and then uses the RPMs to select the tags of that
instruction. These tags go to the read address ports of the
computer's register files. In the previous example, once
instruction 5 has finished, the Instruction Scheduler will start
instruction 6. It will select the RPM so that the address of
instruction 5's result (its tag) is sent to the register file, and
the register file will make the result of instruction 5 available
to instruction 6.
The foregoing and other features and advantages of the present
invention will be apparent from the following more particular
description of the preferred embodiments of the invention, as
illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will be better understood if reference is made to the
accompanying drawings.
FIG. 1 shows a representative high level block diagram of the
register renaming circuit of the present invention.
FIG. 2 shows a representative block diagram of the data dependency
check circuit of the present invention.
FIG. 3 shows a representative block diagram of the tag assignment
logic of the present invention.
FIG. 4 shows a representative block diagram of the register port
file multiplexers of the present invention.
FIG. 5 is a representative flowchart showing a data dependency
check method for IXS1 and IYS/D in accordance with the present
invention.
FIGS. 6A and 6B are representative flowcharts showing a tag
assignment method in accordance with the present invention.
FIG. 7 shows a representative block diagram which compares an
instruction Y's source/destination operand with each operand of an
instruction X in accordance with an embodiment of the present
invention.
FIG. 8 shows a representative circuit diagram for comparator block
706 of FIG. 7.
FIG. 9 shows a representative block diagram of a Priority Encoder
in accordance with an embodiment of the present invention.
FIG. 10 shows a representative block diagram of the instruction
scheduling logic of the present invention.
DETAILED DESCRIPTION
FIG. 1 shows a representative high level block diagram of an
Instruction Execution Unit (IEU) 100 associated with the present
invention. The goal of IEU 100 is to execute as many instructions
as possible in the shortest amount of time. There are two basic
ways to accomplish this: optimize IEU 100 so that each instruction
takes as little time as possible or optimize IEU 1130 so that it
can execute several instructions at the same time.
Instructions are sent to IEU 100 from an Instruction Fetch Unit
(IFU, not shown) through an instruction FIFO (first-in-first-out
register stack storage device) 101 in groups of four called
"buckets." IEU 100 can decode and schedule up to two buckets of
instructions at one time. FIFO 101 stores 16 total instructions in
four buckets labeled 0-3. IEU 100 looks at the an instruction
window 102. In one embodiment of the present invention, window 102
comprises eight instructions (buckets 0 and 1). Every cycle IEU 100
tries to issue a maximum number of instructions from window 102.
Window 102 functions as a instruction buffer register. Once the
instructions in a bucket are executed and their results stored in
the processor's register file (see block 117), the bucket is
flushed out a bottom 104 and a new bucket is dropped in at a top
106.
In order to execute instructions in parallel or out of order, care
must be taken so that the data that each instruction needs is
available when the instruction needs it and also so that the result
of each instruction is available for any future instructions that
might need it. A Register Rename Circuit (RRC), which is part of
the scheduling logic of the computer's IEU performs this function
by locating dependencies between current instructions and then
renaming the sources (inputs) of the instruction.
As noted above, there are three types of dependencies: input
dependencies, output dependencies and anti-dependencies. Input
dependencies occur when an instruction, call it A, that performs an
operation on the result of a previous instruction, call it B.
Output dependencies occur when the outputs of A and B are to be
stored in the same place. Anti-dependencies occur when instruction
A comes before B in the instruction stream and B's result will be
stored in the same place as one of A's inputs.
Input dependencies are handled by not executing instructions until
their inputs are available. RRC 112 is used to locate the input
dependencies between current instructions and then to signal an
Instruction Scheduler or Issuer 118 when all inputs for a
particular instruction are ready. In order to locate these
dependencies, RRC 112 compares the register file addresses of each
instruction's inputs with the addresses of each previous
instruction's output using a data dependency circuit (DDC) 108. If
one instruction's input comes from a register where a previous
instruction's output will be stored, then the latter instruction
must wait for the former to finish.
This implementation of RRC 112 can check eight instructions at the
same time, so a current instruction is defined as any one of those
eight from window 102. It should become evident to those skilled in
the art that the present invention can easily be adapted to check
more or less instructions.
In one embodiment of the present invention, instructions can have
from 0 to 3 inputs and 0 or 1 outputs. Most instructions' inputs
and outputs come from, or are stored in, one of several register
files. Each register file 117 (e.g., separate integer, floating and
boolean register files) has 32 real entries plus the group of 8
temporary buffers 116. When an instruction completes, (The term
"complete" means that the operation is complete and the operand is
ready to be written to its destination register.) its result is
stored in its preassigned location in the temporary buffers 116.
Its result is later moved to the appropriate place in register file
117 after all previous instructions' results have been moved to
their places in the register file. This movement of results from
temporary buffers 116 to register file 117 is called "retirement"
and is controlled by termination logic, as should become evident to
those skilled in the art. More than one instruction may be retired
at a time. Retirement comprises updating the "official state" of
the machine including the computer's Program Counter, as will
become evident to those skilled in the art. For example, if
instruction I0 happens to complete directly before instruction I1,
both results can be stored directly into register file 117. But if
instruction I3 then completes, its result must be stored in
temporary buffer 116 until instruction I2 completes. By having IEU
100 store each instruction's result in its preassigned place in the
temporary buffers 116, IEU 100 can execute instructions out of
program order and still avoid the problems caused by output and
anti-dependencies.
RRC 112 sends a bit map to an Instruction Scheduler 118 via a bus
120 indicating which instructions in window 102 are ready for
issuing. Instruction decode logic (not shown) indicates to Issuer
118 the resource requirements for each instruction over a bus 123.
For each resource in IEU 100 (e.g., each functional unit being an
adder, multiplier, shifter, or the like), Issuer 118 scans this
information and selects the first and subsequent instructions for
issuing by sending issue signals over bus 121. The issue signals
select a group of Register File Port MUXes (RPMs) 124 inside RRC
112 whose inputs are the addresses of each instruction's
inputs.
Because the results may stay in temporary buffer 116 several cycles
before going to register file 117, a mechanism is provided to get
results from temporary buffer 116 before they go to register file
117, so the information can be used as operands for other
instructions. This mechanism is called "result forwarding," and
without it, Issuer 118 would not be able to issue instructions out
of order. This result forwarding is done in register file 117 and
is controlled by RRC 112. The control signals necessary for
performing the result forwarding will be come evident to those
skilled in the art, as should the random logic used for generating
such control signals.
If an instruction is not dependent on any of the current
instructions result forwarding is not necessary since the
instruction's inputs are already in register file 117. When Issuer
118 decides to execute that instruction, RRC 112 tells register
file 117 to output its data.
RRC 112 contains three subsections: a Data Dependency Checker (DDC)
108, Tag Assign Logic (TAL) 122 and Register File Port MUXes (RPM)
124. DDC 108 determines where the input dependencies are between
the current instructions. TAL 122 monitors the dependencies for
Issuer 118 and controls result forwarding. RPM 124 is controlled by
Issuer 118 and directs the outputs of TAL 122 to the appropriate
register file address ports 119. Instructions are passed to DDC 108
via bus 110. All source registers are compared with all previous
destination registers for each instruction in window 102.
Each instruction has only one destination, which may be a double
register in one embodiment. An instruction can only depend on a
previous instruction and may have up to three source registers.
There are various register file source and destination addresses
that need to be checked against each other for any dependencies. As
noted above, the eight bottom instructions corresponding to the
lower two buckets are checked by DDC 108. All source register
addresses are compared with all previous destination register
addresses for the instructions in window 102.
For example, let's say a program has the following instruction
sequence:
add R0, R1, R2 (0)
add R0, R2, R3 (1)
add R4, R5, R2 (2)
add R2, n3, R4 (3)
The first two registers in each instruction 0-3 are the source
registers, and the last listed register in each instruction is the
destination register. For example, R0 and R1 are the source
registers for instruction 0 and R2 is the destination register.
Instruction 0 adds the contents of registers 0 and 1 and stores the
result in R2. For instructions 1-3 in this example, the following
are the comparisons needed to evaluate all of the dependencies:
ILS1, I1S2 vs. I0D
I2S1, I2S2 vs. I1D, I0D
I3S1, I3S2 vs. I2D, I1D, I0D
The key to the above is as follows: IXRS1 is the address of source
(input) number 1 of instruction X; IXRS2 is the address of source
(input) number 2 of instruction X; and IXD is the address of the
destination (output) of instruction X.
Note also that RRC 112 can ignore the fact that instruction 2 is
output dependent on instruction 0, because the processor has a
temporary buffer where instruction 2's result can be stored without
interfering with instruction O's result. As discussed before,
instruction 2's result will not be moved from temporary buffers 116
to register file 117 until instructions 0 and 1's results are moved
to register file 117.
The number of instructions that can be checked by RRC 112 is easily
scaleable. In order to check eight instructions at a time instead
of four, the following additional comparisons would also need to be
made:
I4S1, I4S2 vs I3D, I2D, I1D, I0D
I5S1, I5S2 vs I4D, I3D, I2D, I1D, I0D
I6S1, I6S2 vs I5D, I4D, I3D, I2D, I1D, I0D
I7S1, I7S2 vs I6D, I5D, I4D, I3D, I2D, I1D, I0D
There are several special cases that RRC 112 must handle in order
to do the dependency check. First, there are some instructions that
use the same register as an input and an output. Thus, RRC 112 must
compare this source/destination register address with the
destination register addresses of all previous instructions. So for
instruction 7, the following comparisons would be necessary:
I7S1,I7S2, I7S/D vs. I6D,I5D, I4D,I3D, I2D,IID, I0D.
Another special case occurs when a program contains instructions
that generate 64 bit outputs (called long-word operations). These
instructions need two registers in which to store their results. In
this embodiment, these registers must be sequential. Thus if RRC
112 is checking instruction 4's dependencies and instruction 1 is a
long-word operation, then it must do the following comparisons:
I4S1,I4S2 vs. I3D,I2D, I1D, IID+1,I0D
Sometimes, instructions do not have destination registers. Thus RRC
112 must ignore any dependencies between instructions without
destination registers and any future instructions. Also,
instructions may not have only one valid source register, so RRC
112 must ignore any dependencies between the unused source register
(usually S2) and any previous instructions.
RRC 112 is also capable of dealing with multiple register files.
When using multiple register files, dependencies only occur when
one instruction's source register has the same address and is in
the same register file as some other instruction's destination
register. RRC 112 treats the information regarding which register
file a particular address is from as part of the address. For
example, in an implementation using four 32 bit register files, RRC
112 would do 7 bit compares instead of 5 bit compares (5 for the
address and 2 for the register file).
Signals indicating which instructions are long-word operations or
have invalid source or destination registers are sent to RRC 112
from Instruction Decode Logic (IDL; not shown). IDL also tells RRC
112 which register file each instruction's sources and destinations
will come from or go to.
A block diagram of DDC 108 is shown in FIG. 2. Source address
signals arrive from IFIFO 101 for all eight instructions of window
102. Additional inputs include long-word load operation flags,
register file decode signals, invalid destination register flags,
destination address signals and addressing mode flags for all eight
instructions.
DDC 208 comprises 28 data dependency blocks 204. Each block 204 is
described in a KEY 206. Each block 204 receives 3 inputs, IXS1,
IXS2 and IXS/D. IXS1 is the address of source (input) number 1 of
instruction X, IXS2 is the address of source (input) number 2 of
instruction X; and IXS/D is the address of the source/destination
(input) of instruction X. Each block 204 also receives input IYS/D,
which is the destination register address for some previous
instruction Y. A top row 208, for example, receives 10S/D, which is
the destination register address for instruction 0. Each block 204
outputs the data dependency results to one of a corresponding bus
line 114. For example, the address of I2S/D must be checked with
operand addresses S1, S2 and S/D of instructions 7, 6, 5, 4, and
3.
Each block 204 performs the three comparisons. To illustrate these
comparisons, consider a generic block 700 shown in FIG. 7, which
compares instruction Y's source/destination operand with each
operand of instruction X. In this example, the three following
comparisons must be made:
IXS1=IYS/D
IXS2=IYS/D
IXS/D=IYS/D
These comparisons are represented by three comparator blocks 702,
704 and 706, respectively. One set of inputs to comparator blocks
702, 704 and 706 are the bits of the IYS/D field, which is
represented by number 708. Comparator block 702 has as its second
set of inputs the bits of the IXS1. Similarly, comparator block 704
has as its second set of inputs the bits of the IXS1, and
comparator block 706 has as its second set of inputs the bits of
the IXS/D.
In a preferred embodiment, the comparisons performed by blocks 702,
704 and 706 can be performed by random logic. An example of random
logic for comparator block 706 is shown in FIG. 8. Instruction Y's
source/destination bits [6:0] are shown input from the right at
reference number 802 and instruction X's source/destination bits
[6:0] are shown input from the top at reference number 804. The
most significant bit (MSB) is bit 6 and the least significant bit
(LSB) is bit 0. The corresponding bits from the two operands are
fed to a set of seven exclusive NOR gates (XNORs) 806. The outputs
of XNORs 806 are then ANDed by a seven input AND gate 808. If the
corresponding bits are the same, the output of XNOR 806 will be
logic high. When all bits are the same, all seven XNOR 806 outputs
are logic high and the output of gate 808 is logic high, this
indicates that there is a dependency between IXS/D and IYS/D.
The random logic for comparator blocks 702 and 704 will be
identical to that shown in FIG. 8. The present invention
contemplates many other random logic circuits for performing data
dependency checking, as will become evident to those skilled in the
art without departing from the spirit of this example.
As will further become evident to those skilled in the art, various
implementation specific special cases can arise which require
additional random logic to perform data dependency checking. An
illustrative special data dependency checking case is for long word
handling.
As mentioned before, if a long word operation writes to register X,
the first 32 bits are written to register X and the second 32 bits
are written to register X+1. The data dependency checker therefore
needs to check both resisters when doing a comparison. In a
preferred embodiment, register X is an even register, X+1 is an odd
register and thus they only differ by the LSB. The easiest way to
check both registers at the same time is to simply ignore the LSB.
In the case of a store long (STLG) or load long (LDLG) operation,
if X and Y only differ by the LSB bit [0], the logic in FIG. 8
would cause there to be no dependency, when there really is a
dependency. Therefore, for a long word operation the STLG and LDLG
flags must be ORed with the output of the [0] bit XNOR to assure
that all dependencies are detected.
A data dependency check flowchart for IXS1 and IYS/D is shown in
FIG. 5. DDC 108 first checks whether IXS1 and IYS/D are in the same
register file, as shown at a conditional block 502. If they are not
in the same register file there is no dependency. This is shown at
block a 504. If there is a dependency, DDC 108 then determines
whether IXS1 and IYS/D are in the same register, as shown at a
block 506. If they are not in the same register, flow proceeds to a
conditional block 508 where DDC 108 determines whether IY is a long
word operation. If IY is not a long word operation there is no
dependency and flow proceeds to a block 504. If IY is a long word
operation, flow then proceeds to a conditional statement 510 where
DDC 108 determines whether IXS1 and IYS/D+1 are the same register.
If they are not, there is no dependency and flow proceeds to a
block 504. If IXS1 and IYS/D+1 are the same register, flow proceeds
to a conditional block 512 where DDC 108 determines if IY has a
valid destination. If it does not have a valid destination, there
is no dependency and flow proceeds to block 504. If IY does have a
valid destination, flow proceeds to a conditional block 514 where
DDC 108 determines if IXS1 has a valid source register. Again, if
no valid source register is detected there is no dependency, and
flow proceeds to a block 504. If a valid source register is
detected, DDC 108 has determined that there is a dependency between
IXS1 and IYX/D, as shown at a block 516.
A more detailed discussion of data dependency checking is found in
commonly owned, copending application Ser. No. 07/860,718, the
disclosure of which is incorporated herein by reference.
Because it is possible that an instruction might get one of its
inputs from a register that was written to by several other
instructions, the present invention must choose which one is the
real dependency. For example, if instructions 2 and 5 write to
register 4 and instruction 7 reads register 4, then instruction 7
has two possible dependencies. In this case, it is assumed that
since instruction 5 came after instruction 2 in the program, the
programmer intended instruction 7 to use instruction 5's result and
not instruction 2's. So, if an instruction can be dependent on
several previous instructions, RRC 112 will consider it to be
dependent on the highest numbered previous instruction.
Once TAL 122 has determined where the real dependencies are, it
must locate the inputs for each instruction. In a preferred
embodiment of the present invention, the inputs can come from the
actual register file or an array temporary buffers 116. RRC 112
assumes that if an instruction has no dependencies, its inputs are
all in the register file. In this case, RRC 112 passes the IXS1,
IXS2 and IXS/D addresses that came from IFIFO 102 to the register
file. If an instruction has a dependency, then RRC 112 assumes that
the data is in temporary buffers 116. Since RRC 112 knows which
previous instruction each instruction depends on, and since each
instruction always writes to the same place in temporary buffers
116, RRC 112 can determine where in temporary buffers 116 an
instruction's inputs are stored. It sends these addresses to
register file read ports 119 and register file 117 outputs the data
from temporary buffers 116 so that the instruction can use it.
The following is an example of tag assignments:
0: add r0, r1, r2
1: add r0, r2, r3
2: add r4, r5, r2
3: add r2, r3, r4
The following are the dependencies for the above operations
(dependencies are represented by the symbol "#"):
I152#I05/D
I3Sl#I0S/D
I3S1#I2S/D
I3S2#I1S/D
First, look at I0; since it has no dependencies, its tags are equal
to its original source register addresses:
I0S1 TAG=I0S1=r0
I0S2 TAG=I0S2=r1
I0S/D TAG=I0S/D=r2
I1 has one dependency, and its tags are as follows:
I1S1 TAG=I1S1=t0
I1S2 TAG=I0S/D=t0
where: (t0=inst. 0's slot in temporary buffer)
I1S/D TAG=I1S/D=r3
I2 is also independent:
I2S1 TAG=I2S1=r4
I2S2 TAG=I2S2=r5
I2S/D TAG=I2S/D=r2
I3S1 has two possible dependencies, I0S/D and 12S/D. Because TAL
122 must pick the last one (highest numbered one), I2S/D is
chosen.
I3S1 TAG=I2S/D=t2
I3S2 TAG=I1S/D=t1
I3S/D TAG=I3S/D=r4
These tags are then sent to RPM 124 via bus 126 to be selected by
Issuer 118. At the same time TAL 122 is preparing the tags, it is
also monitoring the outputs of DCL 130 and passing them on to
Issuer 118 using bus 120. TAL 122 chooses the proper outputs of
DCL's 130 to pass to Issuer 118 by the same method that it chooses
the tags that it sends to RPM 124.
Continuing the example, TAL 122 sends the following ready signals
to Issuer 118:
I0S1 INFO=1
(Inst 0 is independent so it can start immediately)
I0S2 INFO=1
I0S/D INFO=1
I1S1 INFO=1
I1S2 INFO=DONE[0]
(DONE[0]=1 when I0 is done)
I1S/D INFO=1
I2S1 INFO=1
I2S2 INFO=1
I2S/D INFO=1
I3S1 INFO=DONE[2]
I3S2 INFO=DONE[1 ]
I3S/D READ=1
(The DONE signals come from DCL 130 via a bus 132. In connection
with the present invention, the term "done" means the result of the
instruction is in a temporary buffer or otherwise available at the
output of a functional unit. Contrastingly, the term "terminate"
means the result of the instruction is in the register file.)
Turning now to FIG. 3, a representative block diagram of TAL 122
will be discussed. TAL 122 comprises 8 tag assignment logic blocks
302. Each TAL block 302 receives the corresponding data dependency
results via buses 114, as well as further signals that come from
the computer's Instruction Decode and control logic (not shown).
The BKT bit signal forms the least significant bit of the tag.
DONE[X] flags are for instructions 0 through 6, and indicate if
instruction X is done. DBLREG[X] flags indicates which, if any, of
the instructions is a double (long) word. Each TAL block 302 also
receives its own instructions register addresses as inputs. The
Misc. signals, DBLREG and BKT signals are all implementation
dependent control signals. Each TAL block 302 outputs 3 TAGs 126
labeled IXS1, IXS2 and IXS/D, which are 6 bits. TAL 122 outputs the
least significant 5 bits of each TAG signal to RPMs 124 and the
most significant TAG to Issuer 118.
Each block 302 of FIG. 3 comprises three Priority Encoders (PE),
one for S1, one for S2 and one for S/D. There is one exception
however. I0 requires no tag assignment. Its tags are the same as
the original S1, S2 and S/D addresses, because I0 is always
independent.
An illustrative PE is shown in FIG. 9. PE 902 has eight inputs 904
and eight outputs 906. Inputs 904 for PE 902 are outputs 114 from
DDC 108 which show where dependencies exist. For example, in the
case of source register 1 (S1), I7S1 tag assign PE 902's seven
inputs are the seven outputs 114 of DDC 108 that indicate whether
I7S1 is dependent on I6D, whether I7S1 is dependent on I5D, and so
on down to whether I7S1 is dependent on IOD. An eighth input, shown
at reference number 908, is always tied high because there should
always be an output from PE 902.
As stated before, if an instruction depends on several previous
instructions, PE 902 will select and output only the most previous
instruction (in program order) on which there is a dependency. This
is accomplished by connecting the signal showing if there is a
dependency on the most previous instruction to the highest priority
input of the PE 902 and the signal showing if there is a dependency
on the second most previous instruction to the input of PE 902 with
the second highest priority and so on for all previous
instructions. The input of the PE 902 with the lowest priority is
always tied high so that at least one of PE 902's outputs will be
asserted.
Outputs 906 are used as select lines for a MUX 910. MUX 910 has
eight inputs 912 to which the tags for each instruction are
applied.
To illustrate this, assume that I7 depends on I6 and I5, then,
since I6 has a higher priority than I5, the bit corresponding to I6
at outputs 906 of PE 902 will be high. At the corresponding input
912 of MUX 910 will be I6's tag for $1 (recall PE 902 is for I7S1).
Because I7 is dependent on I6, the location of I6's result must be
output from MUX 910 so that it can be used by I7. I6's tag will
therefore be selected and output on an output line 914. I6's done
flag, DONE[6] must also be output from MUX 910 so that Issuer 118
will know when I7's input is ready. This data is passed to Issuer
118 via bus 120. Since an instruction can have up to three sources,
TAL 122 monitors up to three dependencies for each instruction and
sends three vectors for each instruction (totalling 24 vectors) to
Issuer 118. If an instruction is independent, TAL 122 signals to
Issuer 118 that the instruction can begin immediately.
The MSB of the tag outputs which are sent to RPMs 124 is used to
indicate if the address is a register file address or a temporary
buffer address. If an instruction is independent, then the five LSB
outputs indicate the source register address. For instructions that
have dependencies: the second MSB indicates that the address is for
a 64 bit valve; the third through fifth MSB outputs specify the
temporary buffer address; and the LSB output indicates which bucket
is the current bucket, which is equal to the BKT signal in TAL
122.
Like DDC 108, TAL 122 has numerous implementation dependent, (i.e.,
special cases) that it handles. First, in an embodiment of the
present invention, register number 0 of the register file is always
equal to 0. Therefore, even if one instruction writes to register 0
and another reads from register 0, there will be no dependency
between them. TAL 122 receives three signals from Instruction
Decode Logic (IDL; not shown) for each instruction to indicate if
one of that instruction's sources is register 0. If any of those is
asserted, TAL 122 will ignore any dependencies for that particular
input of that instruction.
Another special case occurs because under some circumstances, an
instruction in bucket 0 will be guaranteed to not have any of the
instructions in bucket 1 dependent on it. A four bit signal called
BKTI.sub.-- NODEP.sub.-- is sent to RRC 112 from the IEU control
logic (not shown) and if BKT1.sub.-- NODEP[X]=1 then RRC 112 knows
to ignore any dependencies between instructions, 4,5,6 or 7 and
instruction X.
An example for TAG assignment of instruction 7's source 1 (I7S1) is
shown in a flowchart in FIGS. 6A-6B. TAL 122 first determines
whether I7S1 is register 0, as shown at a conditional block 602. If
the first source operand for I7 is register 0, the TAG is set equal
to zero, and the I7S1's INFO flag is set equal to one, as shown in
a block 604. If the first source operand ($1) for I7 is not
register 0, TAL 122 then determines if I7S1 is dependent on I6S/D,
as shown at a conditional block 606. If I7S1 is dependent on I6S/D
flow then proceeds to a block 610 where I7S1's TAG is set equal to
{1,DBLREG[6],0,1,0,BKT} and I7S1's INFO flag is set equal to
DONE[6], as shown at a block 6 10. If either of the condition
tested at a conditional block 606 is not met, flow proceeds to
conditional block 612 where TAL 122 determines if I7S1 is dependent
on I5S/D. If there is a dependency, flow then proceeds to block 616
where TAL 122 sets I7S1's TAG equal to {1,DBLREG[5],0,0,1,BKT} and
I7S1's INFO flag is set equal to DONE[5]. If the condition tested
at block 612 is not met, flow proceeds to a block 618 where TAL 122
determines if I7S1 is dependent on I4S/D.
As evident by inspection of the remaining sections of FIGS. 6A and
6B, similar TAG determinations are made depending on whether I7S1
is dependent on I45/D, I3S/D, I25/D, IIS/D and I05/D, as shown at
sections 620, 622, 624, 626 and 628, respectively. Finally, if
instruction 7 is independent of instruction 0 or if all
instructions in bucket 1 are independent of instruction 0 (i.e., if
BKT1.sub.-- NODEP[0]=1), as tested at a conditional block 630, the
flow proceeds to block 632 where TAL 122 sets I7S1's TAG equal to
{0, I7S1} and I7S1's INFO flag equal to 1. It should be noted for
the above example that I7S1 TAG signals are forwarded directly the
register file port MUXes of register file 117. The I7S1 INFO
signals are sent to Issuer 118 to tell it when I7's SI input is
ready.
A representative block diagram of Issuer 118 is shown in FIG. 10.
In a preferred embodiment, Issuer 118 has one scanner block 1002
for each resource (functional unit) that has to be allocated. In
this example, Issuer 118 has scanner blocks FU1, FU2, FU3, FU4
through FUn. Requests for functional units are generated from
instruction information by decoding logic (not shown) in a known
manner, which are sent to scanners 1002 via bus 123. Each scanner
block 1002 scans from instruction I0 to I7 and selects the first
request for the corresponding functional unit to be serviced during
that cycle.
In the case of multiple register files (integer, floating and/or
boolean), Issuer 118 is capable of issuing instructions having
operands stored in different register files. For example, an ADD
instruction may have a first operand from the floating point
register file and a second operand from the integer register file.
Instructions with operands from different register files are
typically given higher issue priority (i.e., they are issued
first). This issuing technique conserves processor execution time
and functional unit resources.
In a further embodiment in which IEU 100 may include two ALU's, ALU
scanning becomes a bit more complicated. For speed reasons, one ALU
scanner block scans from I0 to 17, while the other scanner block
scans from I7 to I0. This is how two ALU requests are selected.
With this scheme it is possible that an ALU instruction in bucket 1
will get issued before an ALU instruction in bucket 0, while
increasing scanning efficiency.
Scanner outputs 1003 are selected by MUXing logic 1004. A set of
SELect inputs 1006 for MUX 1004 receive three 8-bit vectors (one
for each operand) from TAL 122 via bus 120. The vectors indicate
which of the eight instructions have no dependencies and are ready
to be issued. Issuer 118 must wait for this information before it
can start to issue any instructions. Issuer 118 monitors these
vectors and when all three go high for a particular instruction,
Issuer 118 knows that the inputs for that instruction are ready.
Once the necessary functional unit is ready, the issuer can issue
that instruction and send select signals to the register file port
MUXes to pass the corresponding instructions outputs to register
file 117.
In a preferred embodiment of the present invention, after Issuer
118 is done it provides two 8-bit vectors per register file back to
RRC 112 via MUXOUTputs 1008 to bus 121. These vectors indicate
which instructions are issued this cycle, are used a select lines
for RPMs 124.
The maximum number of instructions that can be issued
simultaneously for each register file is restricted by the number
of register file read ports available. A data dependency with a
previous uncompleted instruction may prevent an instruction from
being issued. In addition, an instruction may be prevented from
being issued if the necessary functional unit is allocated to
another instruction.
Several instructions, such as load immediate instructions, Boolean
operations and relative conditional branches, may be issued
independently, because they may not require resources other than
register file read ports or they may potentially have no
dependencies.
The last section of RRC 112 is the register file port MUX (RPM)
section 124. The function of RPMs 124 is to provide a way for
Issuer 118 to get data out of register files 117 for each
instruction to use. RPMs 124 receive tag information via bus 126,
and the select lines for RPMs 124 come from Issuer 118 via a bus
121 and also from the computer's IEU control logic. The selected
TAGs comprise read addresses that are sent to a predetermined set
of ports 119 of register file 117 using bus 128.
The number and design of RPMs 124 depend on the number of register
files and the number of ports on each register file. One embodiment
of RPMs 124 is shown in FIG. 4. In this embodiment, RPMs 124
comprises 3 register port file MUXes 402, 404 and 406. MUX 402
receives as inputs the TAGs of instructions 0-7 corresponding to
the source register field S1 that are generated by TAL 122. MUX 404
receives as inputs the TAGs of instructions 0-7 corresponding to
the source register field $2 that are generated by TAL 122. MUX 406
receives as inputs the TAGs of instructions 0-7 corresponding to
the source/destination register field S/D that are generated by TAL
122. The outputs of MUXes 402, 404 and 406 are connected to the
read addresses ports of register file 117 via bus 128.
RRC 112 and Issuer 118 allow the processor to execute instructions
simultaneously and out of program order. An IEU for use with the
present invention is disclosed in commonly owned, co-pending
application Ser. No. 07/817,810, the disclosure of which is
incorporated herein by reference.
While various embodiments of the present invention have been
described above, it should be understood that they have been
presented by way of example, and not limitation. Thus the breadth
and scope of the present invention should not be limited by any of
the above-described exemplary embodiments, but should be defined
only in accordance with the following claims and their
equivalents.
* * * * *