U.S. patent application number 11/362763 was filed with the patent office on 2007-08-30 for compact linked-list-based multi-threaded instruction graduation buffer.
This patent application is currently assigned to MIPS Technologies, Inc.. Invention is credited to Kjeld Svendsen.
Application Number | 20070204139 11/362763 |
Document ID | / |
Family ID | 38445410 |
Filed Date | 2007-08-30 |
United States Patent
Application |
20070204139 |
Kind Code |
A1 |
Svendsen; Kjeld |
August 30, 2007 |
Compact linked-list-based multi-threaded instruction graduation
buffer
Abstract
A processor and instruction graduation unit for a processor. In
one embodiment, a processor or instruction graduation unit
according to the present invention includes a linked-list-based
multi-threaded graduation buffer and a graduation controller. The
graduation buffer stores identification values generated by an
instruction decode and dispatch unit of the processor as part of
one or more linked-list data structures. Each linked-list data
structure formed is associated with a particular program thread
running on the processor. The number of linked-list data structures
formed is variable and related to the number of program threads
running on the processor. The graduation controller includes
linked-list head identification registers and linked-list tail
identification registers that facilitate reading and writing
identifications values to linked-list data structures associated
with particular program threads. The linked-list head
identification registers determine which executed instruction
result or results are next to be written to a register file.
Inventors: |
Svendsen; Kjeld; (San Jose,
CA) |
Correspondence
Address: |
STERNE, KESSLER, GOLDSTEIN & FOX P.L.L.C.
1100 NEW YORK AVENUE, N.W.
WASHINGTON
DC
20005
US
|
Assignee: |
MIPS Technologies, Inc.
Mountain View
CA
|
Family ID: |
38445410 |
Appl. No.: |
11/362763 |
Filed: |
February 28, 2006 |
Current U.S.
Class: |
712/218 |
Current CPC
Class: |
G06F 9/384 20130101;
G06F 9/3851 20130101; G06F 9/3867 20130101; G06F 9/30105 20130101;
G06F 9/3836 20130101 |
Class at
Publication: |
712/218 |
International
Class: |
G06F 9/30 20060101
G06F009/30 |
Claims
1. A processor, comprising: a results buffer having a plurality of
registers, each register for temporarily storing a result of an
executed instruction prior to the result being written to a
register file; a results buffer allocater that generates
identification values, wherein each identification value identifies
a register of the results buffer in which an executed instruction's
result can be temporarily stored; and a graduation buffer having a
plurality of registers, wherein identification values generated by
the results buffer allocater are temporarily stored as part of a
linked-list data structure.
2. The processor of claim 1, wherein identification values
generated by the results buffer allocater are temporarily stored as
part of a plurality of linked-list data structures, each
linked-list data structure being associated with a particular
program thread.
3. The processor of claim 1, further comprising: a graduation
controller, coupled to the graduation buffer, that includes a
linked-list head identification register and a linked-list tail
identification register.
4. The processor of claim 3, wherein the graduation controller
includes a plurality of linked-list head identification registers
and a plurality of linked-list tail identification registers.
5. The processor of claim 1, further comprising: a graduation
controller, coupled to the graduation buffer, that includes a
plurality of linked-list head identification units each having a
first linked-list head identification register and a second
linked-list head identification register, and a plurality of
linked-list tail identification units each having a first
linked-list tail identification register and a second linked-list
tail identification register.
6. The processor of claim 1, wherein the graduation controller
specifies a plurality of results to be written to the register file
in a particular clock cycle of the processor.
7. An instruction graduation unit for a processor, comprising: a
graduation buffer having a plurality of registers, wherein
identification values generated by an instruction decode and
dispatch unit of a processor are temporarily stored as part of a
linked-list data structure; and a graduation controller, coupled to
the graduation buffer, that includes a linked-list head
identification register and a linked-list tail identification
register.
8. The instruction graduation unit of claim 7, wherein
identification values generated by the instruction decode and
dispatch unit are temporarily stored as part of a plurality of
linked-list data structures, each linked-list data structure being
associated with a particular program thread.
9. The instruction graduation unit of claim 7, wherein the
graduation controller includes a plurality of linked-list head
identification registers and a plurality of linked-list tail
identification registers.
10. The instruction graduation unit of claim 7, wherein the
graduation controller specifies a plurality of results to be
written from a results buffer to a register file in a particular
clock cycle of a processor.
11. A computer readable storage medium comprising a processor
embodied in software, the processor comprising: a results buffer
having a plurality of registers, each register for temporarily
storing a result of an executed instruction prior to the result
being written to a register file; a results buffer allocater that
generates identification values, wherein each identification value
identifies a register of the results buffer in which an executed
instruction's result can be temporarily stored; and a graduation
buffer having a plurality of registers, wherein identification
values generated by the results buffer allocater are temporarily
stored as part of a linked-list data structure.
12. The computer readable storage medium of claim 11, wherein
identification values generated by the results buffer allocater are
temporarily stored as part of a plurality of linked-list data
structures, each linked-list data structure being associated with a
particular program thread.
13. The computer readable storage medium of claim 11, wherein the
processor further comprises: a graduation controller, coupled to
the graduation buffer, that includes a linked-list head
identification register and a linked-list tail identification
register.
14. The computer readable storage medium of claim 13, wherein the
graduation controller includes a plurality of linked-list head
identification registers and a plurality of linked-list tail
identification registers.
15. The computer readable storage medium of claim 11, wherein the
graduation controller specifies a plurality of results to be
written to the register file in a particular clock cycle of the
processor.
16. The computer readable storage medium of claim 11, wherein the
processor is embodied in hardware description language
software.
17. The computer readable storage medium of claim 16, wherein the
processor core is embodied in one of Verilog hardware description
language software and VHDL hardware description language
software.
18. A computer readable storage medium comprising an instruction
graduation unit for a processor embodied in software, the
instruction graduation unit comprising: a graduation buffer having
a plurality of registers, wherein identification values generated
by an instruction decode and dispatch unit of a processor are
temporarily stored as part of a linked-list data structure; and a
graduation controller, coupled to the graduation buffer, that
includes a linked-list head identification register and a
linked-list tail identification register.
19. The computer readable storage medium of claim 18, wherein
identification values generated by the instruction decode and
dispatch unit are temporarily stored as part of a plurality of
linked-list data structures, each linked-list data structure being
associated with a particular program thread.
20. The computer readable storage medium of claim 19, wherein the
graduation controller includes a plurality of linked-list head
identification registers and a plurality of linked-list tail
identification registers.
21. The computer readable storage medium of claim 18, wherein the
graduation controller specifies a plurality of results to be
written from a results buffer to a register file in a particular
clock cycle of a processor.
22. The computer readable storage medium of claim 18, wherein the
instruction graduation unit is embodied in hardware description
language software.
23. A method for controlling the order in which instruction results
are written to a register file of a processor, the method
comprising: assigning a first identification value to a first
instruction of a program thread, a second identification value to a
second instruction of the program thread, and a third
identification value to a third instruction of the program thread,
wherein the first identification value specifies where in a first
buffer of a processor a result of the first instruction is to be
written by an execution unit of the processor, the second
identification value specifies where in the first buffer a result
of the second instruction is to be written by the execution unit,
and the third identification value specifies where in the first
buffer a result of the third instruction is to be written by the
execution unit; writing the first identification value to a first
register of a second buffer of the processor, the second
identification value to a second register of the second buffer, and
the third identification value to a third register of the second
buffer, wherein the first identification value, the second
identification value, and the third identification value form part
of a linked-list data structure; and writing results stored in the
first buffer to a register file of the processor in an order
determined by the values that form the linked-list data
structure.
24. The method of claim 23, further comprising: assigning a fourth
identification value to a first instruction of a second program
thread, a fifth identification value to a second instruction of the
second program thread, and a sixth identification value to a third
instruction of the second program thread, wherein the fourth
identification value specifies where in the first buffer of t
processor a result of the first instruction of the second program
thread is to be written by the execution unit of the processor, the
fifth identification value specifies where in the first buffer a
result of the second instruction of the second program thread is to
be written by the execution unit, and the sixth identification
value specifies where in the first buffer a result of the third
instruction of the second program thread is to be written by the
execution unit; writing the fourth identification value to a fourth
register of the second buffer of the processor, the fifth
identification value to a fifth register of the second buffer, and
the sixth identification value to a sixth register of the second
buffer, wherein the fourth identification value, the fifth
identification value, and the sixth identification value form part
of a second linked-list data structure; and writing results stored
in the first buffer, for instructions belonging to the second
program thread, to the register file of the processor in an order
determined by the values that form the second linked-list data
structure.
Description
FIELD OF THE INVENTION
[0001] The present invention relates generally to processors and
more particularly to processors having an out-of-order execution
pipeline.
BACKGROUND OF THE INVENTION
[0002] Reduced Instruction Set Computer (RISC) processors are well
known. RISC processors have instructions that facilitate the use of
a technique known as pipelining. Pipelining enables a processor to
work on different steps of an instruction at the same time and
thereby take advantage of parallelism that exists among the steps
needed to execute an instruction. As a result, a processor can
execute more instructions in a shorter period of time.
Additionally, modern Complex Instruction Set Computer (CISC)
processors often translate their instructions into micro-operations
(i.e., instructions similar to those of a RISC processor) prior to
execution to facilitate pipelining.
[0003] Many pipelined processors, especially those used in the
embedded market, are relatively simple single-threaded in-order
machines. As a result, they are subject to control, structural, and
data hazard stalls. More complex processors are typically
multi-threaded processors that have out-of-order execution
pipelines. These more complex processors schedule execution of
instructions around hazards that would stall an in-order
machine.
[0004] A conventional multi-threaded out-of-order processor has
multiple dedicated buffers that are used to reorder instructions
executed out-of-order so that each instruction graduates (i.e.,
writes its result to a general purpose register file and/or other
memory) in program order. For example, a conventional N-threaded
out-of-order processor has N dedicated buffers for ensuring
instructions graduate in program order; one buffer for each thread
that can be run on the processor. A shortcoming of this approach,
for example, is that it requires a significant amount of integrated
circuit chip area to implement N separate buffers. This approach
can also degrade performance in some designs when only a single
program thread is running on a multi-threaded processor, for
example, if each of the N buffers is limited in size in order to
reduce the overall area of the N buffers.
[0005] What is needed is a processor that overcomes the limitations
noted above.
BRIEF SUMMARY OF THE INVENTION
[0006] The present invention provides a processor, an instruction
graduation unit for a processor, and applications thereof. In one
embodiment, a processor or an instruction graduation unit according
to the present invention includes a linked-list-based
multi-threaded graduation buffer and a graduation controller.
[0007] The graduation buffer is used to temporarily store
identification values generated by an instruction decode and
dispatch unit of the processor. The identification values specify
buffer registers used to temporarily store executed instruction
results until the results are written to a register file. The
identification values generated by the instruction decode and
dispatch unit are stored in the graduation buffer and form part of
one or more linked-list data structures. Each linked-list data
structure formed is associated with a particular program thread
running on the processor. Accordingly, the number of linked-list
data structures formed is variable and related to the number of
program threads running on the processor.
[0008] The graduation controller is coupled to the graduation
buffer and includes both linked-list head identification registers
and linked-list tail identification registers. The linked-list head
identification registers and the linked-list tail identification
registers facilitate reading and writing identifications values
generated by the instruction decode and dispatch unit of the
processor to a linked-list data structure associated with a
particular program thread. The linked-list head identification
registers determine which executed instruction result or results
are next to be written to the register file.
[0009] Further embodiments, features, and advantages of the present
invention, as well as the structure and operation of the various
embodiments of the present invention, are described in detail below
with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES
[0010] The accompanying drawings, which are incorporated herein and
form a part of the specification, illustrate the present invention
and, together with the description, further serve to explain the
principles of the invention and to enable a person skilled in the
pertinent art to make and use the invention.
[0011] FIG. 1 is a diagram of a processor according to an
embodiment of the present invention.
[0012] FIG. 2 is a more detailed diagram of the processor of FIG.
1.
[0013] FIG. 3 is a diagram of a first embodiment of a graduation
buffer and a graduation controller according to the present
invention.
[0014] FIG. 4 is a simplified diagram of the graduation buffer and
the graduation controller of FIG. 3.
[0015] FIG. 5 is a first table illustrating operation of the
graduation buffer and the graduation controller of FIG. 3.
[0016] FIG. 6 is a second table illustrating operation of the
graduation buffer and the graduation controller of FIG. 3.
[0017] FIG. 7 is a diagram of a second embodiment of a graduation
buffer and a graduation controller according to the present
invention.
[0018] FIG. 8 is a simplified diagram of the graduation buffer and
the graduation controller of FIG. 7.
[0019] FIG. 9 is a first table illustrating operation of the
graduation buffer and the graduation controller of FIG. 7.
[0020] FIG. 10 is a second table illustrating operation of the
graduation buffer and the graduation controller of FIG. 7.
[0021] The present invention is described with reference to the
accompanying drawings. The drawing in which an element first
appears is typically indicated by the leftmost digit or digits in
the corresponding reference number.
DETAILED DESCRIPTION OF THE INVENTION
[0022] The present invention provides a processor, an instruction
graduation unit for a processor, and applications thereof. In the
detailed description of the invention that follows, references to
"one embodiment", "an embodiment", "an example embodiment", etc.,
indicate that the embodiment described may include a particular
feature, structure, or characteristic, but every embodiment may not
necessarily include the particular feature, structure, or
characteristic. Moreover, such phrases are not necessarily
referring to the same embodiment. Further, when a particular
feature, structure, or characteristic is described in connection
with an embodiment, it is submitted that it is within the knowledge
of one skilled in the art to effect such feature, structure, or
characteristic in connection with other embodiments whether or not
explicitly described.
[0023] FIG. 1 illustrates an example processor 100 according to an
embodiment of the present invention. As shown in FIG. 1, processor
100 includes an instruction fetch unit 102, an instruction cache
104, an instruction decode and dispatch unit 106, one or more
instruction execution unit(s) 108, a data cache 110, an instruction
graduation unit 112, a register file 114, and a bus interface unit
116. Processor 100 is capable of implementing multi-threading. As
used herein, multi-threading refers to an ability of an operating
system and a processor to execute different parts of a program,
called threads, simultaneously.
[0024] Instruction fetch unit 102 retrieves instructions from
instruction cache 104 and provides instructions to instruction
decode and dispatch unit 106. Instructions are retrieved in program
order, for example, for one or more program threads. In one
embodiment, instruction fetch unit 102 includes logic for recoding
compressed format instructions to a format that can be decoded and
executed by processor 100. In one embodiment, instruction fetch
unit 102 includes an instruction buffer that enables instruction
fetch unit 102 to hold multiple instructions for multiple program
threads, which are ready for decoding, and to issue more than one
instruction at a time to instruction decode and dispatch unit
106.
[0025] Instruction cache 104 is an on-chip memory array organized
as a direct associative or multi-way set associative cache such as,
for example, a 2-way set associative cache, a 4-way set associative
cache, an 8-way set associative cache, et cetera. In one
embodiment, instruction cache 104 is virtually indexed and
physically tagged, thereby allowing virtual-to-physical address
translations to occur in parallel with cache accesses. Instruction
cache 104 interfaces with instruction fetch unit 102.
[0026] Instruction decode and dispatch unit 106 receives one or
more instructions at a time from instruction fetch unit 102 and
decodes them prior to execution. In one embodiment, instruction
decode and dispatch unit 106 receives at least one instruction for
each program thread being implemented during a particular clock
cycle. As described herein, the number of program threads being
implemented at any given point in time is variable. Decoded
instructions are stored in a decoded instruction buffer and issued
to instruction execution unit(s) 108, for example, after it is
determined that selected operands are available. Instructions can
be dispatched from instruction decode and dispatch unit 106 to
instruction execution unit(s) 108 out of program order.
[0027] Instruction execution unit(s) 108 execute instructions
dispatched by instruction decode and dispatch unit 106. In one
embodiment, at least one instruction execution unit 108 implements
a load-store (RISC) architecture with single-cycle arithmetic logic
unit operations (e.g., logical, shift, add, subtract, etc.). Other
instruction execution unit(s) 108 can include, for example, a
floating point unit, a multiple-divide unit and/or other special
purpose co-processing units. In embodiments having multiple
instruction execution units 108, one or more of the units can be
implemented, for example, to operate in parallel. Instruction
execution unit(s) 108 interface with data cache 110, register file
114, and a results buffer (not shown).
[0028] Data cache 110 is an on-chip memory array. Data cache 110 is
preferably virtually indexed and physically tagged. Data cache 110
interfaces with instruction execution unit(s) 108.
[0029] Register file 114 represents a plurality of general purpose
registers, which are visible to a programmer. Each general purpose
register is a 32-bit or a 64-bit register, for example, used for
logical and/or mathematical operations and address calculations. In
one embodiment, register file 114 is part of instruction execution
unit(s) 108. Optionally, one or more additional register file sets
(not shown), such as shadow register file sets, can be included to
minimize content switching overhead, for example, during interrupt
and/or exception processing.
[0030] Bus interface unit 116 controls external interface signals
for processor 100. In one embodiment, bus interface unit 116
includes a collapsing write buffer used to merge write-through
transactions and gather writes from uncached stores. Processor 100
can include other features, and thus it is not limited to having
just the specific features described herein.
[0031] FIG. 2 is a more detailed diagram of processor 100. As
illustrated in FIG. 2, processor 100 performs four basic functions:
instruction fetch; instruction decode and dispatch; instruction
execution; and instruction graduation. These four basic functions
are illustrative and not intended to limit the present
invention.
[0032] Instruction fetch (represented in FIG. 1 by instruction
fetch unit 102) begins when a PC selector 202 selects amongst a
variety of program counter values and determines a value that is
used to fetch an instruction from instruction cache 104. In one
embodiment, the program counter value selected is the program
counter value of a new program thread, the next sequential program
counter value for an existing program thread, or a redirect program
counter value associated with a branch instruction or a jump
instruction. After each instruction is fetched, PC selector 202
selects a new value for the next instruction to be fetched.
[0033] During instruction fetch, tags associated with an
instruction to be fetched from instruction cache 104 are checked.
In one embodiment, the tags contain precode bits for each
instruction indicating instruction type. If these precode bits
indicate that an instruction is a control transfer instruction, a
branch history table is accessed and used to determine whether the
control transfer instruction is likely to branch or likely not to
branch.
[0034] In one embodiment, any compressed-format instructions that
are fetched are recoded by an optional instruction recoder 204 into
a format that can be decoded and executed by processor 100. For
example, in one embodiment in which processor 100 implements both
16-bit instructions and 32-bit instructions, any 16-bit
compressed-format instructions are recoded by instruction recoder
204 to form instructions having 32 bits. In another embodiment,
instruction recoder 204 recodes both 16-bit instructions and 32-bit
instructions to a format having more than 32 bits.
[0035] After optional recoding, instructions are written to an
instruction buffer 206. In one embodiment, this stage can be
bypassed and instructions can be dispatched directly to instruction
decoder 208.
[0036] Instruction decode and dispatch (represented in FIG. 1 by
instruction decode and dispatch unit 106) begins, for example, when
one or more instructions are received from instruction buffer 206
and decoded by an instruction decoder 208. In one embodiment,
following resolution of a branch misprediction, the ability to
receive instructions from instruction buffer 206 may be temporarily
halted until selected instructions residing within the instruction
execution portion and/or instruction graduation portion of
processor 100 are purged.
[0037] In parallel with instruction decoding, operands are renamed.
Register renaming map(s) located within instruction identification
(ID) generator and operand renamer 210 are updated and used to
determine whether required source operands are available, for
example, in register file 114 and/or a results buffer 218. A
register renaming map is a structure that holds the mapping
information between programmer visible architectural registers and
internal physical registers of processor 100. Register renaming
map(s) indicate whether data is available and where data is
available. As will be understood by persons skilled in the relevant
arts given the description herein, register renaming is used to
remove instruction output dependencies and to ensure that there is
a single producer of a given register in processor 100 at any given
time. Source registers are renamed so that data is obtained from a
producer at the earliest opportunity instead of waiting for the
processor's architectural state to be updated. In parallel with
instruction decoding, instruction ID generator and operand renamer
210 generates and assigns an instruction ID tag to each
instruction. An instruction ID tag assigned to an instruction is
used, for example, to determine the program order of the
instruction relative to other instructions. In one embodiment, each
instruction ID tag is a thread-specific sequentially generated
value that uniquely determines the program order of instructions.
The instruction ID tags can be used to facilitate graduating
instructions in program order, which were executed out of program
order.
[0038] Each decoded instruction is assigned a results buffer
identification value or tag by a results buffer allocater 212. The
results buffer identification value determines the location in
results buffer 218 where instruction execution unit(s) 108 can
write calculated results for an instruction. In one embodiment, the
assignment of results buffer identification values are accomplished
using a free list. The free list contains as many entries as the
number of entries in results buffer 218. The free list can be
implemented, for example, using a bitmap. A first bit of the bitmap
can be used to indicate whether the results buffer entry is either
available (e.g., if the bit has a value of one) or unavailable
(e.g., if the bit has a value of zero).
[0039] As described in more detail below, assigned results buffer
identification values are written into a graduation buffer 224. In
one embodiment, results buffer completion bits associated with
newly renamed instructions are reset/cleared to indicate incomplete
results. As instructions complete execution, their corresponding
results buffer completion bits are set, thereby enabling the
instructions to graduate and release their associated results
buffer identification values. In one embodiment, control logic (not
shown) ensures that one program thread does not consume more than
its share of results buffer entries.
[0040] Decoded instructions are written to a decoded instruction
buffer 214. An instruction dispatcher 216 selects instructions
residing in decoded instruction buffer 214 for dispatch to
execution unit(s) 108. In embodiments, instructions can be
dispatched for execution out of program order. In one embodiment,
instructions are selected and dispatched, for example, based on
their age (ID tags) assuming that their operands are determined to
be ready.
[0041] Instruction execution unit(s) 108 execute instructions as
they are dispatched. During execution, operand data is obtained as
appropriate from data cache 110, register file 114, and/or results
buffer 218. A result calculated by instruction execution unit(s)
108 for a particular instruction is written to a location/entry of
results buffer 218 specified by the instruction's associated
results buffer identification value.
[0042] Instruction graduation (represented in FIG. 1 by instruction
graduation unit 112) is controlled by a graduation controller 220.
Graduation controller 220 graduates instructions in accordance with
the results buffer identification values stored in graduation
buffer 224. When an instruction graduates, its associated result is
transferred from results buffer 218 to register file 114. In
conjunction with instruction graduation, graduation controller 220
updates, for example, the free list of results buffer allocater 212
to indicate a change in availability status of the graduating
instruction's assigned results buffer identification value.
[0043] FIG. 3 is a diagram of a graduation controller 220a and a
graduation buffer 224a according to a embodiment present invention.
In this embodiment, a single instruction is identified for
graduation by graduation controller 220 during each instruction
graduation cycle.
[0044] Graduation controller 220a includes a plurality of 2-to-1
multiplexers 302, a plurality of registers 304, and an N-to-1
multiplexer 306. Graduation controller 220a also includes a
plurality of registers 308 and an N-to-1 multiplexer 310.
Graduation buffer 224a stores one or more linked-list data
structures, each one being associated with a particular program
thread that is running on processor 100. Each of the linked-list
data structures has an associated head identification (ID) value
and an associated tail ID value.
[0045] As shown in FIG. 3, each of the 2-to-1 multiplexers 302 is
coupled to results buffer allocater 212 (see FIG. 2) and a read
data bus of graduation buffer 224a. The outputs of 2-to-1
multiplexers 302 are coupled to the inputs of registers 304. Each
register 304 stores a head ID value that is associated with a
particular linked-list data structure for a program thread. The
output of each register 304 is coupled to N-to-1 multiplexer 306.
The output of N-to-1 multiplexer 306 is coupled to a read address
bus of graduation buffer 224a and to results buffer 218. Results
buffer allocater 212 is also coupled to a write data bus of
graduation buffer 224a and to the input of each register 308. Each
register 308 stores a tail ID value that is associated with a
particular linked-list data structure for a program thread. The
output of each register 308 is coupled to N-to-1 multiplexer 310.
The output of N-to-1 multiplexer 310 is coupled to a write address
bus of graduation buffer 224a.
[0046] In an embodiment, graduation controller 220a operates as
follows. Results buffer allocater 212 assigns (allocates) a results
buffer ID value (new ID) to an instruction being decoded by
instruction decoder 208. This new ID is provided to the inputs of
2-to-1 multiplexers 302, a write data bus of graduation buffer
224a, and the inputs of registers 308. This new ID is stored by the
appropriate thread tail ID register 308 and, if appropriate, thread
head ID register 304. For example, if a first new ID value (e.g.,
buffer ID 0) is allocated by results buffer allocater 212 for an
instruction associated with program thread 1, and if graduation
buffer 224a currently does not store any ID values associated with
program thread 1, the new ID value is stored by thread head ID
register 304b and thread tail ID register 308b. If a second new ID
value (e.g., buffer ID 5) associated with program thread 1 is then
allocated before the instruction associated with the first new ID
graduates, the second new ID value (buffer ID 5) is written to a
memory location 312 (i.e., a memory location linked to buffer ID
0). Register 308b is accordingly updated to store the second new ID
(buffer ID 5) and point to the tail of the linked-list data
structure formed for program thread 1.
[0047] As shown in FIG. 3, graduation buffer 224a also stores a
linked list data structure for program thread 0 and a linked list
data structure for program thread N. The linked-list data structure
stored for program thread 0 is {(6-7), (7-9)}. The linked-list data
structure stored for program thread N is {(10-N), N-1)}.
[0048] When an instruction graduates, the appropriate thread head
ID register 304 is updated to point to the new head value of the
linked-list data structure stored. For example, assume that the
next instruction to graduate is an instruction associated with
program thread 0. As can be seen by looking a thread head ID
register 304a, the calculated result for this instruction is stored
in results buffer entry 6. Thus, when the thread selection value
provided to N-to-1 multiplexer 306 selects thread 0, the output of
N-to-1 multiplexer 306 will be 6. This value (i.e., 6) is placed on
the read address bus of graduation buffer 224a, and the associated
next ID value (i.e., 7) is provided by the read data bus of
graduation buffer 224a to an input of 2-to-1 multiplexer 302a and
stored by thread 0 head ID register 304a. In a similar manner, if
the next instruction to graduate is an instruction associated with
program thread N, register 304n will be updated to store the next
ID value (i.e., 1) associated with buffer ID N.
[0049] As described herein, the total number of program threads
running on processor 100 at any given time is variable from one up
to a maximum number of threads (e.g., N) supported by processor
100. The number of graduation buffer entries that can be allocated
to a particular program thread is independent of the number of
threads that can run on processor 100. For example, a single thread
can be allocated all of the graduation buffer entries to achieve a
maximum single-threaded performance. This point is further
illustrated by FIG. 4.
[0050] FIG. 4 illustrates the relationship between results buffer
allocater 212, results buffer 218, and graduation buffer 224a
according to an embodiment of the present invention. It also shows
the type of information stored by these components. In the example
depicted in FIG. 4, only a single program thread is running on
processor 100. Since only a single program thread (e.g., program
thread 0) is running on processor 100, the control logic required
to support multiple program threads (shown in FIG. 3) is not shown
for purposes of clarity.
[0051] As shown in FIG. 4, results buffer allocater 212 has
allocated six results buffer entries to store the results of six
instructions belonging to program thread 0. Results buffer entry 6
(represented as Buffer ID 6) has been assigned to an instruction
having instruction ID 0. Results buffer entries 0, 5, 7, 10, and N
have been assigned to instructions having instruction IDs 1, 2, 3,
4, and 5, respectively. As illustrated by these values, results
buffer allocater 212 assigns the entries of results buffer 218
independently of program threads (i.e., there is no limitation
regarding with entries of results buffer 218 can be assigned to an
instruction based on the program thread to which the instruction
belongs.)
[0052] In the example of FIG. 4, graduation buffer 224a stores a
single linked-list data structure associated with program thread 0.
The elements of the linked-list data structure are (6, 0), (0, 5),
(5, 7), (7, 10), and (10, N). The head ID value of the linked-list
data structure (6) is stored in register 304a. The tail ID value of
the linked-list data structure (N) is stored in register 308a. The
next instruction to graduate is instruction ID 0, whose calculated
resultant value (A) is stored in buffer entry 6 of results buffer
218. Upon graduation of instruction ID 0, the value A stored in
buffer entry 6 will be written to a general purpose register of
register file 114. Buffer entry 6 will then become available to be
assigned/allocated to a new instruction by results buffer allocater
212.
[0053] Results buffer 218 in FIG. 4 is shown storing a plurality of
values. For example, as noted above, buffer entry 6 stores the
value A. Buffer entries 0, 5, 7, 10, and N are shown storing values
B, C, D, E, and F, respectively. In one embodiment, whether or not
the stored values are valid is determine, for example, by a valid
bit stored with each entry of results buffer 218. However, bits
used to determine whether an entry is valid or not valid need not
be store in results buffer 218. Other means for determining whether
an entry is valid or not valid can also be used.
[0054] FIG. 5 depicts a Table 1 that further illustrates operation
of processor 100. In the embodiment represented by FIG. 5,
processor 100 includes the graduation controller 220a and the
graduation buffer 224a shown, for example, in FIG. 3. As noted in
FIG. 5, Table 1 depicts an example ten-cycle clock-by-clock
progress of buffer entry allocations and graduation of values
stored in results buffer 218 for a case in which processor 100 is
executing a single program thread.
[0055] In clock cycle 1 of Table 1, results buffer allocater 212 of
instruction decode and dispatch unit 106 allocates entry 0 of
results buffer 218 to a first instruction of a program thread, for
example, program thread 0. It is assumed for this example that this
is the only buffer entry currently allocated to an instruction
belonging to program thread 0. Accordingly, their is no associated
linked-list data structure presently stored in graduation buffer
224a for program thread 0, and the thread head ID register and the
thread tail ID register do not yet contain valid values. The
allocated buffer entry ID 0 is provided to graduation controller
220a as the New ID shown, for example, in FIG. 4.
[0056] In clock cycle 2 of Table 1, as shown by arrows, graduation
controller 220a updates the thread head ID register 304 and the
thread tail ID register 308 with the buffer entry ID value 0 (i.e.,
the New ID) allocated by results buffer allocater 212 during clock
cycle 1. In clock cycle 2, as shown in FIG. 5, results buffer
allocater 212 allocates buffer entry 5 to a second instruction of
program thread 0. This value (5) is provided to graduation
controller 220a as illustrated, for example, in FIG. 4. The value 5
is stored in the next ID entry of buffer ID 0, which is the write
address specified by the value stored in tail ID register 308,
during clock cycle 3.
[0057] In clock cycle 3 of Table 1, results buffer allocater 212
allocates buffer entry 7 to a third instruction of program thread
0. As shown in FIG. 5 by arrows, the value 7 is stored in the next
ID entry of buffer ID 5, which is the write address specified by
the value stored in tail ID register 308, during clock cycle 4. As
noted above, in clock cycle 3, graduation controller 220a stores
the value 5 in the next ID entry of buffer ID 0 (see, e.g.,
location 312 of graduation buffer 224a in FIG. 3). Graduation
controller 220a also updates thread tail ID register 308 to contain
the value 5. As no instruction has yet graduated, the value of
thread head ID register 304 remains unchanged.
[0058] In clock cycle 4 of Table 1, results buffer allocater 212
allocates buffer entry 10 to a fourth instruction of program thread
0. Graduation controller 220a updates thread tail ID register 308
to contain the value 7, which was allocated by results buffer
allocater 212 in clock cycle 3. In clock cycle 4, the result stored
in entry 0 of results buffer 218 is graduated by instruction
graduation unit 112. As shown by arrows in FIG. 5, during clock
cycle 5, the value 5 stored in the Next ID entry of Buffer ID 0 of
graduation buffer 224a will be used to update head ID register
304a.
[0059] In clock cycle 5 of Table 1, results buffer allocater 212
does not allocate any buffer entry to a new instruction. This
situation might arise, for example, due to a branch misprediction
that resulted in a processing pipeline purge. During this clock
cycle, graduation controller 220a stores the value 10 in the next
ID entry of buffer ID 7 of graduation buffer 224a. As noted above,
because an instruction was graduated in the previous clock cycle,
graduation controller 220a updates thread head ID register 304 to
contain the new head value of the linked-list data structure (i.e.,
the value 5 that identifies the next instruction to be graduated by
instruction graduation unit 112). Graduation controller 220a also
updates thread tail ID register 308 to contain the value 10, which
was allocated during clock cycle 4. In clock cycle 5, the result
stored in entry 5 of results buffer 218 is graduated.
[0060] In clock cycle 6 of Table 1, the result stored in entry 7 of
results buffer 218 graduates. To reflect the fact that an
instruction graduated during clock cycle 5, graduation controller
220a updates thread head ID register 304 to contain the value 7
(i.e., the next to graduate).
[0061] In clock cycle 7 of Table 1, the result stored in entry 10
of results buffer 218 graduates. In this clock cycle, graduation
controller 220a updates thread head ID register 304 to contain the
value 10 (i.e., the next to graduate).
[0062] In clock cycle 8 of Table 1, no activity takes place.
[0063] In clock cycle 9 of Table 1, results buffer allocater 212 of
instruction decode and dispatch unit 106 allocates entry N of
results buffer 218 to a fifth instruction of program thread 0. This
value (N) is provided to graduation controller 220a and used to
update thread head ID register 304 and thread tail ID register 308
in clock cycle 10.
[0064] In clock cycle 10 of Table 1, graduation controller 220a
updates thread head ID register 304 and thread tail ID register 308
with the buffer entry ID value N allocated by results buffer
allocater 212 during clock cycle 9.
[0065] FIG. 6 depicts a Table 2 that further illustrates the
operation of graduation controller 220a. As noted in FIG. 6, Table
2 is a thread head ID and thread tail ID update logic table. This
logic table provides implementation information regarding
graduation controller 220a to persons skilled in the relevant
art(s).
[0066] FIG. 7 is a diagram of a graduation controller 220b and a
graduation buffer 224b in accordance with another embodiment of the
present invention. In this embodiment, two instructions (or their
resulting values stored in results buffer 218) are identified for
graduation by graduation controller 220b during each instruction
graduation cycle.
[0067] As shown in FIG. 7, graduation controller 220b includes a
plurality of thread head ID units 701 and a plurality of thread
tail ID units 703. The number of thread head ID units 701 and the
number of thread tail ID units 703 is a design choice. Each thread
head ID unit 701 is capable of holding two head ID values (head
ID-0 and head ID-1). Each thread tail ID unit 703 is capable of
holding two tail ID values (tail ID-0 and tail ID-1). The inputs to
graduation controller 220b include a new ID-0 value and a new-ID-1
value generated, for example, by results buffer allocater 212.
[0068] The head ID units 701 each include a multiplexer 702 and a
register 704 that select and store a head ID-0 value. This head
ID-0 value is provided to an N-to-1 multiplexer 720a. The head ID
units 701 also each include a multiplexer 706 and a register 708
that select and store a head ID-1 value. This head ID-1 value is
provided to an N-to-1 multiplexer 720b. The interconnections of
these components is illustrated in FIG. 7.
[0069] The tail ID units 703 each include a multiplexer 712 and a
register 714 that select and store a tail ID-0 value. This tail
ID-0 value is provided to an N-to-1 multiplexer 722a. The tail ID
units 703 also each include a multiplexer 716 and a register 718
that select and store a tail ID-1 value. This tail ID-1 value is
provided to an N-to-1 multiplexer 722b. The interconnections of
these components is also illustrated in FIG. 7.
[0070] As shown in FIG. 7, graduation buffer 224b includes a
plurality of data and address buses. These buses are used to store
and to retrieve linked-list data used to determine the order in
which instructions are graduated by instruction graduation unit
112. The connections of these buses to graduation controller 220b
and the a new ID-0 value and a new-ID-1 value generated, for
example, by results buffer allocater 212 are shown in FIG. 7.
[0071] In an embodiment, graduation controller 220b operates as
follows. Results buffer allocater 212 assigns (allocates) one or
two results buffer ID values (new ID-0 and new ID-1) to one or two
instructions of a program thread, respectively, during decoding by
instruction decoder 208. The new ID-0 value and the new ID-1 values
are processed by the thread tail ID unit 703 associated with the
program thread and used, if appropriate, to add one or two new
elements to a linked-list data structure residing within graduation
buffer 224b. If the new ID value(s) are associated with a program
thread for which there is no current linked-list data structure
stored within graduation buffer 224b, the new ID value(s) are
processed and stored by the appropriate register(s) 704 and 708 of
a thread head ID unit 701. When one or two instructions of a
program thread are graduated, the head ID unit associated with the
program thread is updated to store the value(s) of the next
instruction(s) of the program thread to be graduated.
[0072] To better understand the operation of graduation controller
220b and graduation buffer 224b, an example in which only a single
program thread is running on processor 100 is provided below. This
example is described with reference to FIGS. 8 and 9.
[0073] FIG. 8 is a simplified diagram of graduation controller 220b
and graduation buffer 224b. FIG. 8 represents an example
implementation in which only a single program thread (thread-0) is
running on processor 100. In particular, FIG. 8 depicts the state
of graduation controller 220b and graduation buffer 224b for clock
cycle 5 of Table 3 (see FIG. 9). Since only a single program thread
is running on processor 100 in this example, the control logic
required to support multiple program threads (shown in FIG. 7) is
not depicted for purposes of clarity.
[0074] As can be seen in FIG. 8, graduation controller 220b and
graduation buffer 224b store elements of a linked-list data
structure associated with program thread-0. The head of the
linked-list data structure (results buffer entry 10) is stored in
head ID-0 register 704a. The second element of the linked-list data
structure (results buffer entry 12) is stored in head ID-1 register
708a. The tail value of the linked-list data structure (results
buffer entry 15) is stored in tail ID-1 register 718a. The next to
the last element of the linked-list data structure (results buffer
entry 12) is stored in tail ID-0 register 714a. Based on this
information, one can discern that the elements of the linked-list
data structure are (10, 12) and (12, 15).
[0075] In the next clock cycle, if both the results stored in
results buffer entry 10 and results buffer entry 12 graduate, and
no new results buffer entry is allocated to an instruction
belonging to program thread 0, the value 15 will be read from
graduation buffer 224b and stored in head ID-0 register 704a.
Because no valid value is stored in graduation buffer 224b for
buffer ID 12, the value stored by head ID-1 register 708a will be
treated as invalid. The value 15 stored by tail ID-1 register 718a
will be transferred to tail ID-0 register 714a. The value stored by
tail ID-1 register 718a will be treated as invalid.
[0076] In the next clock cycle, if only the result stored in
results buffer entry 10 is graduated, and no new results buffer
entry is allocated to an instruction belonging to program thread 0,
the value 12 stored by head ID-1 register 708a will be transferred
to head ID-0 register 704a, and the value 15 will be read from
graduation buffer 224b and stored in head ID-1 register 708a.
Because no valid value is stored in graduation buffer 224b for
buffer ID 12, the value stored by head ID-1 register 708a will be
treated as invalid. The value 15 stored by tail ID-1 register 718a
will be transferred to tail ID-0 register 714a. The value stored by
tail ID-1 register 718a will be treated as invalid.
[0077] A more detailed explanation of the operation of graduation
controller 220b and graduation buffer 224b is illustrated by Table
3 of FIG. 9.
[0078] FIG. 9 depicts a Table 3 that further illustrates the
operation of graduation controller 220b and graduation buffer 224b.
As noted in FIG. 9, Table 3 depicts an example eight-cycle
clock-by-clock progress of buffer entry allocation and graduation
of values stored in results buffer 218, for a case in which
processor 100 is executing a single program thread.
[0079] In clock cycle 1 of Table 3, results buffer allocater 212 of
instruction decode and dispatch unit 106 allocates entry 0 of
results buffer 218 to a first instruction of a program thread, for
example, program thread 0. This allocated buffer entry ID (e.g.,
New ID-0 shown in FIG. 8) is provided to graduation controller
220b. It is assumed for this example that this is the only buffer
entry currently allocated to program thread 0. Thus, their is no
associated linked-list data structure presently stored by
graduation controller 220b and graduation buffer 224a for program
thread 0, and the thread head ID unit 701a and the thread tail ID
unit 703a do not yet contain valid values.
[0080] In clock cycle 2 of Table 3, as shown by arrows, graduation
controller 220b updates thread head ID-0 register 704a and thread
tail ID-0 register 714a with buffer entry ID value 0, which was
allocated by results buffer allocater 212 during clock cycle 1. As
shown in FIG. 9, in clock cycle 2, results buffer allocater 212
allocates buffer entry 5 to a second instruction of program thread
0 and buffer entry 7 to a third instruction of program thread 0.
These values, as shown by arrows in FIG. 9, are used to update head
ID-1 register 708a, tail ID-0 register 714a, and tail ID-1 register
718a in clock cycle 3. No instructions are graduated during this
clock cycle.
[0081] In clock cycle 3 of Table 3, results buffer allocater 212
allocates buffer entry 10 to a fourth instruction of program thread
0 and buffer entry 12 to a fifth instruction of program thread 0.
During this clock cycle, graduation controller 220b stores the
value 7 in the next ID entry of buffer ID 0 of graduation buffer
224b, which was the address pointed to by tail ID-0 register 714a
during the previous clock cycle. Graduation controller 220b updates
head ID-1 register 708a and thread tail ID-0 register 714a to
contain the value 5. Tail ID-1 register 718a is updated to hold the
value 7. In clock cycle 3, the results stored in entries 0 and 5 of
results buffer 218 are graduated by instruction graduation unit
112.
[0082] In clock cycle 4 of Table 3, results buffer allocater 212
allocates buffer entry 15 to a sixth instruction of program thread
0. During this clock cycle, graduation controller 220b stores the
values 10 and 12 in the next ID entries of buffer IDs 5 and 7,
respectively, of graduation buffer 224b. Graduation controller 220b
updates head ID-0 register 704a to contain the value 7 read from
buffer ID entry 0 of graduation buffer 224b. Graduation controller
220b also updates head ID-1 register 708a and thread tail ID-0
register 714a to contain the value 10, and thread tail ID-1
register 718a to contain the value 12. In clock cycle 4, the result
stored in entry 7 of results buffer 218 is graduated by instruction
graduation unit 112.
[0083] In clock cycle 5 of Table 3, results buffer allocater 212
allocates buffer entry 21 to a seventh instruction of program
thread 0 and buffer entry 22 to an eight instruction of program
thread 0. During this clock cycle, graduation controller 220b
stores the value 15 in the next ID entry of buffer ID 10 of
graduation buffer 224b. Graduation controller 220b updates head
ID-0 register 704a to contain the value 10 read from head ID-1
register 708a. Graduation controller 220b updates head ID-1
register 708a to contain the value 12 read from buffer ID entry 7.
Graduation controller 220b updates tail ID-0 register 714a to
contain the value 12 read from tail ID-1 register 718a. Graduation
controller 220b updates tail ID-1 register 718a to contain the
value 15 provided by results buffer allocater 212 as a new ID-0
value during clock cycle 4. In clock cycle 5, the results stored in
entries 10 and 12 of results buffer 218 are graduated by
instruction graduation unit 112. It is this logic state of
graduation controller 220b and graduation buffer 224b that is
depicted in FIG. 8.
[0084] In clock cycle 6 of Table 3, results buffer allocater 212
allocates buffer entry N to a ninth instruction of program thread
0. As shown by arrows in FIG. 9, the value N is stored in the next
ID entry of buffer ID 21, which is the write address specified by
the value stored in tail ID-0 register 714a, during clock cycle 7.
In clock cycle 6, graduation controller 220b updates head ID-0
register 704a to contain the value 15 read from buffer ID entry 10.
Graduation controller 220b updates head ID-1 register 708a and tail
ID-0 register 714a to contain the value 21 allocated by results
buffer allocater 212 in clock cycle 5. Graduation controller 220b
updates tail ID-1 register 718a to contain the value 22 provided by
results buffer allocater 212 as a new ID-1 value in clock cycle 5.
In clock cycle 6, the instruction result stored in entry 15 of
results buffer 218 is graduated by instruction graduation unit
112.
[0085] In clock cycle 7 of Table 3, graduation controller 220b
updates head ID-0 register 704a to contain the value 21. Graduation
controller 220b updates head ID-1 register 708a and tail ID-0
register 714a to contain the value 22. Graduation controller 220b
updates tail ID-1 register 718a to contain the value N. During this
clock cycle, the results stored in entries 21 and 22 of results
buffer 218 are graduated by instruction graduation unit 112.
[0086] In clock cycle 8 of Table 3, graduation controller 220b
updates head ID-0 register 704a and tail ID-0 register 714a to
contain the value N. In this clock cycle, the instruction result
stored in entry N of results buffer 218 is graduated by instruction
graduation unit 112.
[0087] FIG. 10 depicts a Table 4 that also illustrates the
operation of graduation controller 220b. As noted in FIG. 10, Table
4 is an example thread head ID and thread tail ID update logic
table. This logic table provides example state and implementation
information regarding the various inputs and outputs of graduation
controller 220b. For purposes of brevity and clarity, only a few of
the row entries are shown in Table 4. A Person skilled in the
relevant art(s) will be able to populate all of the row entries of
Table 4 given the description of the present invention provided
herein.
[0088] While various embodiments of the present invention have been
described above, it should be understood that they have been
presented by way of example, and not limitation. It will be
apparent to persons skilled in the relevant computer arts that
various changes in form and detail can be made therein without
departing from the spirit and scope of the invention. Furthermore,
it should be appreciated that the detailed description of the
present invention provided herein, and not the summary and abstract
sections, is intended to be used to interpret the claims. The
summary and abstract sections may set forth one or more but not all
exemplary embodiments of the present invention as contemplated by
the inventors.
[0089] For example, in addition to implementations using hardware
(e.g., within or coupled to a Central Processing Unit ("CPU"),
microprocessor, microcontroller, digital signal processor,
processor core, System on Chip ("SOC"), or any other programmable
or electronic device), implementations may also be embodied in
software (e.g., computer readable code, program code, instructions
and/or data disposed in any form, such as source, object or machine
language) disposed, for example, in a computer usable (e.g.,
readable) medium configured to store the software. Such software
can enable, for example, the function, fabrication, modeling,
simulation, description, and/or testing of the apparatus and
methods described herein. For example, this can be accomplished
through the use of general programming languages (e.g., C, C++),
GDSII databases, hardware description languages (HDL) including
Verilog HDL, VHDL, SystemC Register Transfer Level (RTL) and so on,
or other available programs, databases, and/or circuit (i.e.,
schematic) capture tools. Such software can be disposed in any
known computer usable medium including semiconductor, magnetic
disk, optical disk (e.g., CD-ROM, DVD-ROM, etc.) and as a computer
data signal embodied in a computer usable (e.g., readable)
transmission medium (e.g., carrier wave or any other medium
including digital, optical, or analog-based medium). As such, the
software can be transmitted over communication networks including
the Internet and intranets.
[0090] It is understood that the apparatus and method embodiments
described herein may be included in a semiconductor intellectual
property core, such as a microprocessor core (e.g., embodied in
HDL) and transformed to hardware in the production of integrated
circuits. Additionally, the apparatus and methods described herein
may be embodied as a combination of hardware and software. Thus,
the present invention should not be limited by any of the
above-described exemplary embodiments, but should be defined only
in accordance with the following claims and their equivalence.
* * * * *