U.S. patent application number 12/354889 was filed with the patent office on 2009-07-16 for multithreaded microprocessor with register allocation based on number of active threads.
This patent application is currently assigned to INTEL CORPORATION. Invention is credited to Debra Bernstein, Mark B. Rosenbluth, Gilbert Wolrich.
Application Number | 20090182989 12/354889 |
Document ID | / |
Family ID | 26907635 |
Filed Date | 2009-07-16 |
United States Patent
Application |
20090182989 |
Kind Code |
A1 |
Rosenbluth; Mark B. ; et
al. |
July 16, 2009 |
MULTITHREADED MICROPROCESSOR WITH REGISTER ALLOCATION BASED ON
NUMBER OF ACTIVE THREADS
Abstract
A mechanism in a multithreaded processor to allocate resources
based on configuration information indicating how many threads are
in use.
Inventors: |
Rosenbluth; Mark B.;
(Uxbridge, MA) ; Wolrich; Gilbert; (Framingham,
MA) ; Bernstein; Debra; (Sudbury, MA) |
Correspondence
Address: |
FISH & RICHARDSON, PC
P.O. BOX 1022
MINNEAPOLIS
MN
55440-1022
US
|
Assignee: |
INTEL CORPORATION
Santa Clara
CA
|
Family ID: |
26907635 |
Appl. No.: |
12/354889 |
Filed: |
January 16, 2009 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10212945 |
Aug 5, 2002 |
7487505 |
|
|
12354889 |
|
|
|
|
60315144 |
Aug 27, 2001 |
|
|
|
Current U.S.
Class: |
712/220 ;
712/E9.025; 712/E9.027; 712/E9.053; 712/E9.071 |
Current CPC
Class: |
G06F 9/30098 20130101;
G06F 9/3851 20130101; G06F 9/3885 20130101; G06F 9/30123
20130101 |
Class at
Publication: |
712/220 ;
712/E09.025; 712/E09.027; 712/E09.053; 712/E09.071 |
International
Class: |
G06F 15/00 20060101
G06F015/00 |
Claims
1. A method of allocating resources in a multithreaded processor
comprising: providing resources for use by execution threads
supported by the multithreaded processor; and applying
configuration information to a selection of the resources to
allocate the resources among active ones of the execution
threads.
2. The method of claim 1 wherein the resources comprise: registers
in a general purpose register file.
3. The method of claim 1 wherein the configuration information
comprises: a configuration bit which when cleared indicates all of
the supported execution threads as the active ones and when set
indicates a portion of the supported execution threads as the
active ones.
4. The method of claim 1 wherein the configuration information
comprises: a configuration bit which when cleared indicates all of
the supported execution threads as the active ones and when set
indicates half of the supported execution threads as the active
ones.
5. The method of claim 3, wherein the configuration bit resides in
a control and status register.
6. The method of claim 2 wherein the general purpose register file
includes an address decode portion and a multiplexor coupled to the
address decode portion, the multiplexor to receive a thread number
and a register number as inputs and to select bits of the thread
number and the register number based on the configuration
information to form an address corresponding to one of the
registers.
7. The method of claim 6 wherein the configuration information
indicates selection of all but the least signification bit of the
thread number and all bits of the register number.
8. The method of claim 6 wherein the configuration information
indicates selection of all but the most significant bit of the
register number and all bits of the thread number.
9. The method of claim 6 wherein the selected bits of the register
number form a thread-relative register number.
10. A processor comprising: resources for use by execution threads
supported by the processor; and a resource selector to receive
configuration information and to allocate the resources among
active ones of the execution threads based on the configuration
information.
11. The processor of claim 10 wherein the resources comprise:
registers in a general purpose register file.
12. The processor of claim 10 wherein the configuration information
comprises: a configuration bit which when cleared indicates all of
the supported execution threads as the active ones and when set
indicates a portion of the supported execution threads as the
active ones.
13. The processor of claim 10 wherein the configuration information
comprises: a configuration bit which when cleared indicates all of
the supported execution threads as the active ones and when set
indicates half of the supported execution threads as the active
ones.
14. The processor of claim 12, wherein the configuration bit
resides in a control and status register.
15. The processor of claim 11 wherein the general purpose register
file includes an address decode portion and the resource selector
is a multiplexor coupled to the address decode portion, the
multiplexor to receive a thread number and a register number as
inputs and to select bits of the thread number and the register
number based on the configuration information to form an address
corresponding to one of the registers.
16. The processor of claim 15 wherein the configuration information
indicates selection of all but the least signification bit of the
thread number and all bits of the register number.
17. The processor of claim 15 wherein the configuration information
indicates selection of all but the most significant bit of the
register number and all bits of the thread number.
18. The processor of claim 15 wherein the selected bits of the
register number form a thread-relative register number.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation application and claims
priority to U.S. application Ser. No. 10/212,945, filed on Aug. 5,
2002, which in turn claims priority from U.S. Provisional Patent
Application Ser. No. 60/315,144, filed Aug. 27, 2001. The contents
of these applications are incorporated herein in their
entirety.
BACKGROUND
[0002] Typically, hardware implementations of multithreaded
microprocessors provide for use by each thread a fixed number of
resources, such as registers, program counters, and so forth.
Depending on the amount of parallelism in an application program
executing on the microprocessor, some of the threads may not be
used. Consequently, the resources of the unused threads and, more
specifically, the power and silicon area consumed by those
resources, are wasted.
DESCRIPTION OF DRAWINGS
[0003] FIG. 1 shows a block diagram of a communication system
employing a processor having multithreaded microengines to support
multiple threads of execution.
[0004] FIG. 2 shows a block diagram of the microengine (of FIG.
1).
[0005] FIG. 3 shows a microengine Control and Status Register (CSR)
used to select a number of "in use" threads.
[0006] FIG. 4 shows a schematic diagram of a dual-bank
implementation of a General Purpose Registers (GPR) file (of the
microengine of FIG. 2) that uses a selected number of "in use"
threads to allocate registers to threads.
[0007] FIG. 5 shows a table of thread GPR allocations for eight "in
use" threads and four "in use" threads.
[0008] FIGS. 6A and 6B show the partition of registers in the GPR
file in accordance with the thread GPR allocations for eight "in
use" threads and four "in use" threads, respectively.
DETAILED DESCRIPTION
[0009] Referring to FIG. 1, a communication system 10 includes a
processor 12 coupled to one or more I/O devices, for example,
network devices 14 and 16, as well as a memory system 18. The
processor 12 is multi-threaded processor and, as such, is
especially useful for tasks that can be broken into parallel
subtasks or functions. In one embodiment, as shown in the figure,
the processor 12 includes multiple microengines 20, each with
multiple hardware controlled program threads 22 that can be
simultaneously active and independently work on a task. In the
example shown, there are "n" microengines 20, and each of the
microengines 20 is capable of processing multiple program threads
22, as will be described more fully below. In the described
embodiment, the maximum number "N" of context threads supported is
eight, but other maximum amount could be provided. Preferably, each
of the microengines 20 is connected to and can communicate with
adjacent microengines.
[0010] The processor 12 also includes a processor 24 that assists
in loading microcode control for other resources of the processor
12 and performs other general-purpose computer type functions such
as handling protocols and exceptions. In network processing
applications, the processor 24 can also provide support for higher
layer network processing tasks that cannot be handled by the
microengines 20. In one embodiment, the processor 24 is a StrongARM
(ARM is a trademark of ARM Limited, United Kingdom) core based
architecture. The processor (or core) 24 has an operating system
through which the processor 24 can call functions to operate on the
microengines 20. The processor 24 can use any supported operating
system, preferably a real-time operating system. Other processor
architectures may be used.
[0011] The microengines 20 each operate with shared resources
including the memory system 18, a PCI bus interface 26, an I/O
interface 28, a hash unit 30 and a scratchpad memory 32. The PCI
bus interface 26 provides an interface to a PCI bus (not shown).
The I/O interface 28 is responsible for controlling and interfacing
the processor 12 to the network devices 14, 16. The memory system
18 includes a Dynamic Random Access Memory (DRAM) 34, which is
accessed using a DRAM controller 36 and a Static Random Access
Memory (SRAM) 38, which is accessed using an SRAM controller 40.
Although not shown, the processor 12 also would include a
nonvolatile memory to support boot operations. The DRAM 34 and DRAM
controller 36 are typically used for processing large volumes of
data, e.g., processing of payloads from network packets. In a
networking implementation, the SRAM 38 and SRAM controller 40 are
used for low latency, fast access tasks, e.g., accessing look-up
tables, memory for the processor 24, and so forth. The microengines
20 can execute memory reference instructions to either the DRAM
controller 36 or the SRAM controller 40.
[0012] The devices 14 and 16 can be any network devices capable of
transmitting and/or receiving network traffic data, such as
framing/MAC devices, e.g., for connecting to 10/100BaseT Ethernet,
Gigabit Ethernet, ATM or other types of networks, or devices for
connecting to a switch fabric. For example, in one arrangement, the
network device 14 could be an Ethernet MAC device (connected to an
Ethernet network, not shown) that transmits packet data to the
processor 12 and device 16 could be a switch fabric device that
receives processed packet data from processor 12 for transmission
onto a switch fabric. In such an implementation, that is, when
handling traffic to be sent to a switch fabric, the processor 12
would be acting as an ingress network processor. Alternatively, the
processor 12 could operate as an egress network processor, handling
traffic that is received from a switch fabric (via device 16) and
destined for another network device such as network device 14, or
network coupled to such device. Although the processor 12 can
operate in a standalone mode, supporting both traffic directions,
it will be understood that, to achieve higher performance, it may
be desirable to use two dedicated processors, one as an ingress
processor and the other as an egress processor. The two dedicated
processors would each be coupled to the devices 14 and 16. In
addition, each network device 14, 16 can include a plurality of
ports to be serviced by the processor 12. The I/O interface 28
therefore supports one or more types of interfaces, such as an
interface for packet and cell transfer between a PHY device and a
higher protocol layer (e.g., link layer), or an interface between a
traffic manager and a switch fabric for Asynchronous Transfer Mode
(ATM), Internet Protocol (IP), Ethernet, and similar data
communications applications. The I/O interface 28 includes separate
receive and transmit blocks, each being separately configurable for
a particular interface supported by the processor 12.
[0013] Other devices, such as a host computer and/or PCI
peripherals (not shown), which may be coupled to a PCI bus
controlled by the PC interface 26 are also serviced by the
processor 12.
[0014] In general, as a network processor, the processor 12 can
interface to any type of communication device or interface that
receives/sends large amounts of data. The processor 12 functioning
as a network processor could receive units of packet data from a
network device like network device 14 and process those units of
packet data in a parallel manner, as will be described. The unit of
packet data could include an entire network packet (e.g., Ethernet
packet) or a portion of such a packet, e.g., a cell or packet
segment.
[0015] Each of the functional units of the processor 12 is coupled
to an internal bus structure 42. Memory busses 44a, 44b couple the
memory controllers 36 and 40, respectively, to respective memory
units DRAM 34 and SRAM 38 of the memory system 18. The I/O
Interface 28 is coupled to the devices 14 and 16 via separate I/O
bus lines 46a and 46b, respectively.
[0016] Referring to FIG. 2, an exemplary one of the microengines 20
is shown. The microengine (ME) 20 includes a control unit 50 that
includes a control store 51, control logic (or microcontroller) 52
and a context arbiter/event logic 53. The control store 51 is used
to store a microprogram. The microprogram is loadable by the
processor 24.
[0017] The microcontroller 52 includes an instruction decoder and
program counter units for each of supported threads. The The
context arbiter/event logic 53 receives messages (e.g., SRAM event
response) from each one of the share resources, e.g., SRAM 38, DRAM
34, or processor core 24, and so forth. These messages provides
information on whether a requested function has completed.
[0018] The context arbiter/event logic 53 has arbitration for the
eight threads. In one embodiment, the arbitration is a round robin
mechanism. However, other arbitration techniques, such as priority
queuing or weighted fair queuing, could be used.
[0019] The microengine 20 also includes an execution datapath 54
and a general purpose register (GPR) file unit 56 that is coupled
to the control unit 50. The datapath 54 includes several datapath
elements, e.g., and as shown, a first datapath element 58, a second
datapath element 59 and a third datapath element 60. The datapath
elements can include, for example, an ALU and a multiplier. The GPR
file unit 56 provides operands to the various datapath elements.
The registers of the GPR file unit 56 are read and written
exclusively under program control. GPRs, when used as a source in
an instruction, supply operands to the datapath 54. When use as a
destination in an instruction, they are written with the result of
the datapath 54. The instruction specifies the register number of
the specific GPRs that are selected for a source or destination.
Opcode bits in the instruction provided by the control unit 50
select which datapath element is to perform the operation defined
by the instruction.
[0020] The microengine 20 further includes a write transfer
register file 62 and a read transfer register file 64. The write
transfer register file 62 stores data to be written to a resource
external to the microengine (for example, the DRAM memory or SRAM
memory). The read transfer register file 64 is used for storing
return data from a resource external to the microengine 20.
Subsequent to or concurrent with the data arrival, event signals 65
from the respective shared resource, e.g., memory controllers 36,
40, or core 24, can be provided to alert the thread that requested
the data that the data is available or has been sent. Both of the
transfer register files 62, 64 are connected to the datapath 54,
the GPR file unit 56, as well as the control unit 50.
[0021] Also included in the microengine 20 is a local memory 66.
The local memory 66, which is addressed by registers 68a, 68b, also
supplies operands to the datapath 54. The local memory 66 receives
results from the datapath 54 as a destination. The microengine 20
also includes local control and status registers (CSRs) 70 for
storing local inter-thread and global event signaling information,
as well as other information, and a CRC unit 72, coupled to the
transfer registers, which operates in parallel with the execution
datapath 54 and performs CRC computations for ATM cells. The local
CSRs 70 and the CRC unit 72 are coupled to the transfer registers,
the datapath 54 and the GPR file unit 56.
[0022] In addition to providing an output to the write transfer
unit 62, the datapath 54 can also provide an output to the GPR file
56 over line 80. Thus, each of the datapath elements can return a
result value from an executed.
[0023] The functionality of the microengine threads 22 is
determined by microcode loaded (via the core processor 24) for a
particular user's application into each microengine's control store
51. For example, in one exemplary thread task assignment, one
thread is assigned to serve as a receive scheduler thread and
another as a transmit scheduler thread, a plurality of threads are
configured as receive processing threads and transmit processing
threads, and other thread task assignments include a transmit
arbiter and one or more core communication threads. Once launched,
a thread performs its function independently.
[0024] Referring to FIG. 3, the CSRs 70 include a context enable
register ("CTX_Enable") 90, which includes an "in use" contexts
field 92 to indicate a pre-selected number of threads or contexts
in use. The "in use" contexts field 92 stores a single bit, which
when cleared (X=0) indicates all of the 8 available threads are in
use, and which when set (X=1) indicates that only a predefined
number, e.g., 4, more specifically, threads 0, 2, 4 and 6, are in
use.
[0025] As shown in FIG. 4, the GPRs of the GPR file unit 56 may be
physically and logically contained in two banks, an A bank 56a and
a B bank 56b. The GPRs in both banks include a data portion 100 and
an address portion 102. Coupled to each register address path 102
is a multiplexor 104, which receives as inputs a thread number 104
and register number 106 (from the instruction) from the control
unit 50. The output of the multiplexor 104, that is, the form of
the "address" provided to the address path 102 to select one of the
registers 109, is controlled by an enable signal 110. The state of
the enable signal 110 is determined by the setting of the "In_Use"
Contexts bit in the field 92 of the CTX_Enable register 90.
[0026] Conventionally, each thread has a fixed percentage of the
registers allocated to it, for example, one-eighth for the case of
eight threads supported. If some threads are not used, the
registers dedicated for use by those unused threads go unused as
well.
[0027] In contrast, the use of the multiplexor 104 controlled by
"in use" contexts configuration information in the CTX_Enable CSR
90 enables a re-partitioning of the number of bits of active thread
number/instruction (register number) bits in the register address
and therefore a re-allocation of registers to threads. More
specifically, when the bit in field 92 is equal to a "0", the
number of "in use" threads is 8, and the enable 110 controls the
multiplexor 104 to select all of the bits of the active thread
number 106 and all but the most significant bit from the register
number 108 specified by the current instruction. Conversely, when
the bit in field 92 is set to a "1", the number of "in use" threads
is reduced by half, and the number of registers available for
allocation is redistributed so that the number of registers
allocated per thread is doubled.
[0028] FIG. 5 shows the thread allocation for a register file of 32
registers. For 8 threads, thread numbers 0 through 7, each thread
is allocated a total of four registers. For 4 threads, thread
numbers 0, 2, 4 and 6, each thread is allocated a total of eight
registers.
[0029] FIGS. 6A and 6B show a register file (single bank, for
example, register file 56a) having 32 registers available for
thread allocation and re-allocation among a maximum of eight
supported threads. In an 8-thread configuration 120, that is, the
case of eight threads in use, shown in FIG. 6A, each of the threads
is allocated four registers. The multiplexor 104 selects all three
bits of the binary representation of the thread number and all bits
except the most significant bit (that is, selects two bits (bits 0
and 1)) of the binary representation of the register number from
the instruction because the enable 110 is low. For a 4-thread
configuration 122, that is, when enable 110 is high and thus four
threads, as illustrated in FIG. 6B, each of the four threads is
allocated eight registers. The multiplexer 104 selects all but the
least significant bit (in this case, selects two bits, bits 1 and
2) of the binary representation of the thread number and selects
all three bits (bits 0-2) of the binary representation of the
register number from the instruction. Thus, the address into the
register file is a concatenation of bits of the currently active
thread number with bits of the register number from the
instruction, and the contributing number of bits from each is
determined by the setting of the In_Use contexts bit 92 in the
CTX_Enable register 90 (from FIG. 3).
[0030] Thus, the GPRs are logically subdivided in equal regions
such that each context has relative access to one of the regions.
The number of regions is configured in the In_Use contexts field
92, and can be either 4 or 8. Thus, a context-relative register
number is actually associated with multiple different physical
registers. The actual register to be accessed is determined by the
context making the access request, that is, the context number
concatenated with the register number, in the manner described
above. Context-relative addressing is a powerful feature that
enables eight or four different threads to share the same code
image, yet maintain separate data. Thus, instructions specify the
context-relative address (register number). For eight active
contexts, the instruction always specifies registers in the range
of 0-3. For four active contexts, the instruction always specifies
registers in the range of 0-7.
[0031] Referring back to the table shown in FIG. 4, the absolute
GPR register number is the register number that is actually used by
the register address path (decode logic) to access the specific
context-relative register. For example, with 8 active contexts,
context-relative thread 0 for context (or thread) 2 is 8.
[0032] The above thread GPR allocation scheme can be extended to
different numbers of threads (based on multiples of 2) and
registers, for example, re-allocating a total of 128 registers from
among a maximum number of 8 "in use" threads (16 registers each) to
4 "in use" threads (32 registers each), or re-allocating a total of
128 registers from among a maximum number of 16 "in use" threads (8
registers each) to 8 "in use" threads (16 registers each).
[0033] Other embodiments are within the scope of the following
claims.
* * * * *