U.S. patent application number 13/230131 was filed with the patent office on 2012-11-15 for lookahead priority collection to support priority elevation.
Invention is credited to Abhijeet Ashok Chachad, Raguram Damodaran, Ramakrishnan Venkatasubramanian, Joseph Raymond Michael Zbiciak.
Application Number | 20120290755 13/230131 |
Document ID | / |
Family ID | 45870030 |
Filed Date | 2012-11-15 |
United States Patent
Application |
20120290755 |
Kind Code |
A1 |
Chachad; Abhijeet Ashok ; et
al. |
November 15, 2012 |
Lookahead Priority Collection to Support Priority Elevation
Abstract
A queuing requester for access to a memory system. Transaction
requests received from two or more requestors access to the memory
system. Each transaction request includes an associated priority
value. A request queue is formed in the queuing requester. Each
transaction request includes an associated priority value. A
highest priority value of all pending transaction requests within
the request queue is determined. An elevated priority value is
selected when the highest priority value is higher than the
priority value of an oldest transaction request in the request
queue; otherwise the priority value of the oldest transaction
request is selected. The oldest transaction request in the request
queue with the selected priority value is then provided to the
memory system. An arbitration contest with other requesters for
access to the memory system uses the selected priority value.
Inventors: |
Chachad; Abhijeet Ashok;
(Plano, TX) ; Damodaran; Raguram; (Plano, TX)
; Venkatasubramanian; Ramakrishnan; (Plano, TX) ;
Zbiciak; Joseph Raymond Michael; (Arlington, TX) |
Family ID: |
45870030 |
Appl. No.: |
13/230131 |
Filed: |
September 12, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61387283 |
Sep 28, 2010 |
|
|
|
Current U.S.
Class: |
710/244 |
Current CPC
Class: |
G06F 2212/2532 20130101;
G06F 12/1081 20130101; G06F 2212/608 20130101; Y02D 30/50 20200801;
H03K 19/0016 20130101; H03K 21/00 20130101; G06F 13/1657 20130101;
Y02D 10/172 20180101; G06F 12/0815 20130101; G06F 13/1605 20130101;
H03M 13/353 20130101; G06F 9/3012 20130101; G06F 2212/283 20130101;
G06F 13/1652 20130101; G06F 2212/1032 20130101; H03M 13/2903
20130101; Y02D 10/124 20180101; Y02D 50/20 20180101; G06F 12/12
20130101; G06F 1/3296 20130101; G06F 11/1064 20130101; G06F 13/364
20130101; G06F 12/0811 20130101; G06F 7/483 20130101; Y02D 10/00
20180101; G06F 13/1663 20130101; G06F 12/0246 20130101; G06F 13/18
20130101; G06F 2212/1021 20130101; G06F 2212/221 20130101 |
Class at
Publication: |
710/244 |
International
Class: |
G06F 13/14 20060101
G06F013/14 |
Claims
1. A method of operating a queuing requester for access to a memory
system, the method comprising: receiving transaction requests from
two or more requestors for access to the memory system, wherein
each transaction request includes an associated priority value;
forming a request queue of the received transaction requests in the
queuing requester, wherein each transaction request within the
request queue includes an associated priority value; determining a
highest priority value of all pending transaction requests within
the request queue; selecting an elevated priority value when the
determined highest priority value is higher than the priority value
of an oldest transaction request in the request queue, otherwise
selecting the priority value of the oldest transaction request; and
providing the oldest transaction request in the request queue with
the selected priority value to the memory system.
2. The method of claim 1, further comprising: defining a higher
priority value to use as an elevated priority value and a lower
priority value to use as a non-elevated priority value; and wherein
the higher priority value is selected when the determined highest
priority value is higher than the priority value of an oldest
transaction request in the queue, otherwise the lower priority
value is selected.
3. The method of claim 1, wherein the priority value of the oldest
transaction request is selected unless an elevated priority value
is selected.
4. The method of claim 1, wherein the priority value included with
each transaction request in the transaction queue is a bimodal
value indicating either urgent or normal priority.
5. The method of claim 4, further comprising: defining a high
priority value to use as an elevated priority value and a low
priority value to use as a non-elevated priority value; wherein the
high priority value is selected when any transaction request in the
transaction queue has an urgent priority value; and wherein the low
priority value is selected when all transaction requests in the
transaction queue have a normal priority value.
6. The method of claim 1, wherein the priority value of the queuing
requester is set to a lowest priority value when there are no
transaction requests in the transaction queue.
7. The method of claim 1, wherein determining the highest priority
value includes one or more transaction requests that are not yet in
the transaction queue.
8. The method of claim 1, further comprising: arbitrating pending
transaction requests from the queuing requester and from one or
more other requestors; and granting access to the memory system to
the requestor that wins the arbitration.
9. A system comprising: a queuing request module with two or more
requesters; queue logic configured to receive access requests from
the two or more requesters; lookahead merge logic coupled to the
queue logic, wherein the merge logic is configured to determine a
highest priority value of all pending transaction requests within
the queue logic and to select an elevated priority value when the
determined highest priority value is higher than the priority value
of an oldest transaction request in the queue logic, otherwise to
select the priority value of the oldest transaction request; and an
interface configured to provide the oldest transaction request in
the queue logic of the queuing request module using the selected
priority value to a shared resource.
10. The module of claim 9, further comprising a register coupled to
the merge logic, wherein the register is configured to store a
higher priority value for selection as an elevated priority value
and a lower priority value for selection as a non-elevated priority
value.
11. The module of claim 9, wherein the merge logic is configured to
select a priority value of the oldest transaction request unless an
elevated priority value is selected.
12. The module of claim 9, wherein the priority value included with
each transaction request in the queue logic is a bimodal value
indicating either urgent or normal priority.
13. The module of claim 9, wherein the merge logic is configured to
select a lowest priority value when there are no transaction
requests in the queue logic.
14. The system of claim 9, wherein the module is coupled to a
shared resource; and further comprising one or more additional
requesting modules coupled to the shared resource, wherein the
shared resource includes arbitration logic configured to provide an
arbitration contest for the requesting module and the additional
requesting modules using the selected priority provided by the
requesting module.
15. The system of claim 9 being a system on a chip.
Description
CLAIM OF PRIORITY UNDER 35 U.S.C. 119(e)
[0001] The present application claims priority to and incorporates
by reference U.S. Provisional Application number 61/387,283,
(attorney docket TI-69952PS) filed Sep. 28, 2010, entitled "Cache
Controller Architecture."
FIELD OF THE INVENTION
[0002] This invention generally relates to management of memory
access by multiple requesters, and in particular to access to a
shared memory resource in a system on a chip with multiple
cores.
BACKGROUND OF THE INVENTION
[0003] System on Chip (SoC) is a concept that strives to integrate
more and more functionality into a given device. This integration
can take the form of either hardware or solution software.
Performance gains are traditionally achieved by increased clock
rates and more advanced processor nodes. Many SoC designs pair a
digital signal processor (DSP) with a reduced instruction set
computing (RISC) processor to target specific applications. A more
recent approach to increasing performance has been to create
multi-core devices. In this scenario, management of competition for
processing resources is typically resolved using a priority
scheme.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Particular embodiments in accordance with the invention will
now be described, by way of example only, and with reference to the
accompanying drawings:
[0005] FIG. 1 is a functional block diagram of a system on chip
(SoC) that includes an embodiment of the invention;
[0006] FIGS. 2 is a more detailed block diagram of one processing
module used in the SoC of FIG. 1;
[0007] FIGS. 3 and 4 illustrate configuration of the L1 and L2
caches;
[0008] FIG. 5 is a more detailed block diagram of one processing
module used in the SoC of FIG. 1;
[0009] FIG. 6 is a block diagram illustrating a transaction request
pipeline that performs priority elevation;
[0010] FIG. 7 illustrates a register used to indicate multiple
priority values;
[0011] FIG. 8 is a block diagram illustrating arbitration between
multiple requestors for shared resources in a shared memory system
of the SoC of FIG. 1;
[0012] FIG. 9 is a flow diagram illustrating operation of priority
elevation in a queuing requester; and
[0013] FIG. 10 is a block diagram of a system that includes the SoC
of FIG. 1.
[0014] Other features of the present embodiments will be apparent
from the accompanying drawings and from the detailed description
that follows.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
[0015] Specific embodiments of the invention will now be described
in detail with reference to the accompanying figures. Like elements
in the various figures are denoted by like reference numerals for
consistency. In the following detailed description of embodiments
of the invention, numerous specific details are set forth in order
to provide a more thorough understanding of the invention. However,
it will be apparent to one of ordinary skill in the art that the
invention may be practiced without these specific details. In other
instances, well-known features have not been described in detail to
avoid unnecessarily complicating the description.
[0016] High performance computing has taken on even greater
importance with the advent of the Internet and cloud computing. To
ensure the responsiveness of networks, online processing nodes and
storage systems must have extremely robust processing capabilities
and exceedingly fast data-throughput rates. Robotics, medical
imaging systems, visual inspection systems, electronic test
equipment, and high-performance wireless and communication systems,
for example, must be able to process an extremely large volume of
data with a high degree of precision. A multi-core architecture
that embodies an aspect of the present invention will be described
herein. In a typically embodiment, a multi-core system is
implemented as a single system on chip (SoC). As used herein, the
term "core" refers to a processing module that may contain an
instruction processor, such as a digital signal processor (DSP) or
other type of microprocessor, along with one or more levels of
cache that are tightly coupled to the processor.
[0017] A processor module, referred to as a core, that includes a
processor and a memory system is intended for use in multi-core
system architecture will be described in more detail below. Each
core device includes several modules that may request memory
transactions and therefore implements a priority scheme both within
each core device and at the interconnect fabric interface where
multiple core devices can communicate with each other and other
devices, such as memory controllers and hardware accelerators.
[0018] A multi-level cache controller within a core module may
process different types of transfer requests from multiple
requestors that may be targeted to different resources. In a
multi-core scenario, these transfers may be divided into two
categories: 1) local core central processing unit (CPU) generated,
and 2) external master generated. In an embodiment that will be
described in more detail below, the CPU generated transactions may
target a local static random access memory (SRAM) within the core
module or shared resources located elsewhere in the SoC. CPU
transactions and DMA transactions may both be targeted for a same
resource, such as SRAM that may also be configured as a level 2
(L2) cache.
[0019] An External Direct Memory Access (EDMA) interface, also
known as the Master DMA (MDMA) interface of each core device
presents a single request per cycle with a single priority to an
external interconnect fabric. In a case where a low priority
request is present on the MDMA interface a higher priority request
could remain stalled inside the memory system of the device. In
order to minimize such a stall, a concept of elevated priority is
provided. Priority elevation allows a high priority request in the
core device's memory system to communicate and propagate its
priority directly to the MDMA interface even while it remains
stalled, effectively elevating the priority of all older requests
that are in a request queue, until the high priority request
reaches the interface itself. The concept of elevated priority
changes the arbitration in the multi-core environment from
arbitration between individual requests to arbitration between
cores. The difference is subtle, but important. In the first case,
an entire memory system is represented by the single request
currently present at the MDMA interface. In the second case,
through elevated priority, the entire memory system is represented
when arbitrating with other devices.
[0020] The core device's memory system implements a hierarchical
lookahead priority collection mechanism to calculate the overall
priority of the core's entire memory system and if necessary, to
elevate the priority of the oldest transaction in the system,
namely the transaction present at the EDMA boundary, as will be
described in more detail below.
[0021] FIG. 1 is a functional block diagram of a system on chip
(SoC) 100 that includes an embodiment of the invention. System 100
is a multi-core SoC that includes a set of processor modules 110
that each include a processor core, level one (L1) data and
instruction caches, and a level two (L2) cache. In this embodiment,
there are eight processor modules 110; however other embodiments
may have fewer or greater number of processor modules. In this
embodiment, each processor core is a digital signal processor
(DSP); however, in other embodiments other types of processor cores
may be used. A packet-based fabric 120 provides high-speed
non-blocking channels that deliver as much as 2 terabits per second
of on-chip throughput. Fabric 120 interconnects with memory
subsystem 130 to provide an extensive two-layer memory structure in
which data flows freely and effectively between processor modules
110, as will be described in more detail below. An example of SoC
100 is embodied in an SoC from Texas Instruments, and is described
in more detail in "TMS320C6678--Multi-core Fixed and Floating-Point
Signal Processor Data Manual", SPRS691, November 2010, which is
incorporated by reference herein.
[0022] External link 122 provides direct chip-to-chip connectivity
for local devices, and is also integral to the internal processing
architecture of SoC 100. External link 122 is a fast and efficient
interface with low protocol overhead and high throughput, running
at an aggregate speed of 50 Gbps (four lanes at 12.5 Gbps each).
Working in conjunction with a routing manager 140, link 122
transparently dispatches tasks to other local devices where they
are executed as if they were being processed on local
resources.
[0023] There are three levels of memory in the SoC 100. Each
processor module 110 has its own level-1 program (L1P) and level-1
data (L1D) memory. Additionally, each module 110 has a local
level-2 unified memory (L2). Each of the local memories can be
independently configured as memory-mapped SRAM (static random
access memory), cache or a combination of the two.
[0024] In addition, SoC 100 includes shared memory 130, comprising
internal memory 133 and optional external memory 135 connected
through the multi-core shared memory controller (MSMC) 132. MSMC
132 allows processor modules 110 to dynamically share the internal
and external memories for both program and data. The MSMC internal
RAM 133 offers flexibility to programmers by allowing portions to
be configured as shared level-2 RAM (SL2) or shared level-3 RAM
(SL3). SL2 RAM is cacheable only within the local L1P and L1D
caches, while SL3 is additionally cacheable in the local L2
caches.
[0025] External memory may be connected through the same memory
controller 132 as the internal shared memory via external memory
interface 134, rather than to chip system interconnect as has
traditionally been done on embedded processor architectures,
providing a fast path for software execution. In this embodiment,
external memory may be treated as SL3 memory and therefore
cacheable in L1 and L2.
[0026] SoC 100 may also include several co-processing accelerators
that offload processing tasks from the processor cores in processor
modules 110, thereby enabling sustained high application processing
rates. SoC 100 may also contain an Ethernet media access controller
(EMAC) network coprocessor block 150 that may include a packet
accelerator 152 and a security accelerator 154 that work in tandem.
The packet accelerator speeds the data flow throughout the core by
transferring data to peripheral interfaces such as the Ethernet
ports or Serial RapidIO (SRIO) without the involvement of any
module 110's DSP processor. The security accelerator provides
security processing for a number of popular encryption modes and
algorithms, including: IPSec, SCTP, SRTP, 3GPP, SSL/TLS and several
others.
[0027] Multi-core manager 140 provides single-core simplicity to
multi-core device SoC 100. Multi-core manager 140 provides
hardware-assisted functional acceleration that utilizes a
packet-based hardware subsystem. With an extensive series of more
than 8,000 queues managed by queue manager 144 and a packet-aware
DMA controller 142, it optimizes the packet-based communications of
the on-chip cores by practically eliminating all copy
operations.
[0028] The low latencies and zero interrupts ensured by multi-core
manager 140, as well as its transparent operations, enable new and
more effective programming models such as task dispatchers.
Moreover, software development cycles may be shortened
significantly by several features included in multi-core manager
140, such as dynamic software partitioning. Multi-core manager 140
provides "fire and forget" software tasking that may allow
repetitive tasks to be defined only once, and thereafter be
accessed automatically without additional coding efforts.
[0029] Two types of buses exist in SoC 100 as part of packet based
switch fabric 120: data buses and configuration buses. Some
peripherals have both a data bus and a configuration bus interface,
while others only have one type of interface. Furthermore, the bus
interface width and speed varies from peripheral to peripheral.
Configuration buses are mainly used to access the register space of
a peripheral and the data buses are used mainly for data transfers.
However, in some cases, the configuration bus is also used to
transfer data. Similarly, the data bus can also be used to access
the register space of a peripheral. For example, DDR3 memory
controller 134 registers are accessed through their data bus
interface.
[0030] Processor modules 110, the enhanced direct memory access
(EDMA) traffic controllers, and the various system peripherals can
be classified into two categories: masters and slaves. Masters are
capable of initiating read and write transfers in the system and do
not rely on the EDMA for their data transfers. Slaves on the other
hand rely on the EDMA to perform transfers to and from them.
Examples of masters include the EDMA traffic controllers, serial
rapid I/O (SRIO), and Ethernet media access controller 150.
Examples of slaves include the serial peripheral interface (SPI),
universal asynchronous receiver/transmitter (UART), and
inter-integrated circuit (I2C) interface.
[0031] FIG. 2 is a more detailed block diagram of one processing
module 110 used in the SoC of FIG. 1. As mentioned above, SoC 100
contains two switch fabrics that form the packet based fabric 120
through which masters and slaves communicate. A data switch fabric
224, known as the data switched central resource (SCR), is a
high-throughput interconnect mainly used to move data across the
system. The data SCR is further divided into two smaller SCRs. One
connects very high speed masters to slaves via 256-bit data buses
running at a DSP/2 frequency. The other connects masters to slaves
via 128-bit data buses running at a DSP/3 frequency. Peripherals
that match the native bus width of the SCR it is coupled to can
connect directly to the data SCR; other peripherals require a
bridge.
[0032] A configuration switch fabric 225, also known as the
configuration switch central resource (SCR), is mainly used to
access peripheral registers. The configuration SCR connects the
each processor module 110 and masters on the data switch fabric to
slaves via 32-bit configuration buses running at a DSP/3 frequency.
As with the data SCR, some peripherals require the use of a bridge
to interface to the configuration SCR.
[0033] Bridges perform a variety of functions: [0034] Conversion
between configuration bus and data bus. [0035] Width conversion
between peripheral bus width and SCR bus width. [0036] Frequency
conversion between peripheral bus frequency and SCR bus
frequency.
[0037] The priority level of all master peripheral traffic is
defined at the boundary of switch fabric 120. User programmable
priority registers are present to allow software configuration of
the data traffic through the switch fabric. In this embodiment, a
lower number means higher priority. For example: PRI=000b=urgent,
PRI=111b=low.
[0038] All other masters provide their priority directly and do not
need a default priority setting. Examples include the processor
module 110, whose priorities are set through software in a unified
memory controller (UMC) 216 control registers. All the Packet DMA
based peripherals also have internal registers to define the
priority level of their initiated transactions.
[0039] DSP processor core 112 includes eight functional units 214,
two register files 215, and two data paths. The two general-purpose
register files 215 (A and B) each contain 32 32-bit registers for a
total of 64 registers. The general-purpose registers can be used
for data or can be data address pointers. The data types supported
include packed 8-bit data, packed 16-bit data, 32-bit data, 40-bit
data, and 64-bit data. Multiplies also support 128-bit data.
40-bit-long or 64-bit-long values are stored in register pairs,
with the 32 LSBs of data placed in an even register and the
remaining 8 or 32 MSBs in the next upper register (which is always
an odd-numbered register). 128-bit data values are stored in
register quadruplets, with the 32 LSBs of data placed in a register
that is a multiple of 4 and the remaining 96 MSBs in the next 3
upper registers.
[0040] The eight functional units 214 (.M1, .L1, .D1, .S1, .M2,
.L2, .D2, and .S2) are each capable of executing one instruction
every clock cycle. The .M functional units perform all multiply
operations. The .S and .L units perform a general set of
arithmetic, logical, and branch functions. The .D units primarily
load data from memory to the register file and store results from
the register file into memory. Each .M unit can perform one of the
following fixed-point operations each clock cycle: four 32.times.32
bit multiplies, sixteen 16.times.16 bit multiplies, four
16.times.32 bit multiplies, four 8.times.8 bit multiplies, four
8.times.8 bit multiplies with add operations, and four 16.times.16
multiplies with add/subtract capabilities. There is also support
for Galois field multiplication for 8-bit and 32-bit data. Many
communications algorithms such as FFTs and modems require complex
multiplication. Each .M unit can perform one 16.times.16 bit
complex multiply with or without rounding capabilities, two
16.times.16 bit complex multiplies with rounding capability, and a
32.times.32 bit complex multiply with rounding capability. The .M
unit can also perform two 16.times.16 bit and one 32.times.32 bit
complex multiply instructions that multiply a complex number with a
complex conjugate of another number with rounding capability.
[0041] Communication signal processing also requires an extensive
use of matrix operations. Each .M unit is capable of multiplying a
[1.times.2] complex vector by a [2.times.2] complex matrix per
cycle with or without rounding capability. Another version of the
.M unit may allow multiplication of the conjugate of a [1.times.2]
vector with a [2.times.2] complex matrix. Each .M unit also
includes IEEE floating-point multiplication operations, which
includes one single-precision multiply each cycle and one
double-precision multiply every 4 cycles. There is also a
mixed-precision multiply that allows multiplication of a
single-precision value by a double-precision value and an operation
allowing multiplication of two single-precision numbers resulting
in a double-precision number. Each .M unit can also perform one the
following floating-point operations each clock cycle: one, two, or
four single-precision multiplies or a complex single-precision
multiply.
[0042] The .L and .S units support up to 64-bit operands. This
allows for arithmetic, logical, and data packing instructions to
allow parallel operations per cycle.
[0043] An MFENCE instruction is provided that will create a
processor stall until the completion of all the processor-triggered
memory transactions, including: [0044] Cache line fills [0045]
Writes from L1D to L2 or from the processor module to MSMC and/or
other system endpoints [0046] Victim write backs [0047] Block or
global coherence operation [0048] Cache mode changes [0049]
Outstanding XMC prefetch requests.
[0050] The MFENCE instruction is useful as a simple mechanism for
programs to wait for these requests to reach their endpoint. It
also provides ordering guarantees for writes arriving at a single
endpoint via multiple paths, multiprocessor algorithms that depend
on ordering, and manual coherence operations.
[0051] Each processor module 110 in this embodiment contains a 1024
KB level-2 memory (L2) 216, a 32 KB level-1 program memory (L1P)
217, and a 32 KB level-1 data memory (L1D) 218. The device also
contains a 4096 KB multi-core shared memory (MSM) 132. All memory
in SoC 100 has a unique location in the memory map
[0052] The L1P and L1D cache can be reconfigured via software
through the L1PMODE field of the L1P Configuration Register
(L1PCFG) and the L1DMODE field of the L1D Configuration Register
(L1DCFG) of each processor module 110 to be all SRAM, all cache
memory, or various combinations as illustrated in FIG. 3, which
illustrates an L1D configuration; L1P configuration is similar. L1D
is a two-way set-associative cache, while L1P is a direct-mapped
cache.
[0053] L2 memory can be configured as all SRAM, all 4-way
set-associative cache, or a mix of the two, as illustrated in FIG.
4. The amount of L2 memory that is configured as cache is
controlled through the L2MODE field of the L2 Configuration
Register (L2CFG) of each processor module 110.
[0054] Global addresses are accessible to all masters in the
system. In addition, local memory can be accessed directly by the
associated processor through aliased addresses, where the eight
MSBs are masked to zero. The aliasing is handled within each
processor module 110 and allows for common code to be run
unmodified on multiple cores. For example, address location
0x10800000 is the global base address for processor module 0's L2
memory. DSP Core 0 can access this location by either using
0x10800000 or 0x00800000. Any other master in SoC 100 must use
0x10800000 only. Conversely, 0x00800000 can by used by any of the
cores as their own L2 base addresses.
[0055] Level 1 program (L1P) memory controller (PMC) 217 controls
program cache memory 267 and includes memory protection and
bandwidth management. Level 1 data (L1D) memory controller (DMC)
218 controls data cache memory 268 and includes memory protection
and bandwidth management. Level 2 (L2) memory controller, unified
memory controller (UMC) 216 controls L2 cache memory 266 and
includes memory protection and bandwidth management. External
memory controller (EMC) 219 includes Internal DMA (IDMA) and a
slave DMA (SDMA) interface that is coupled to data switch fabric
224. The EMC is coupled to configuration switch fabric 225.
Extended memory controller (XMC) is coupled to MSMC 132 and to dual
data rate 3 (DDR3) external memory controller 134. The XMC provides
L2 prefetching for L2 cache 216/266.
[0056] FIG. 5 is a more detailed block diagram of one processing
module 110 used in the SoC of FIG. 1 that illustrates distributed
bandwidth management. When multiple requestors contend for a single
processor module 110 resource, the conflict is resolved by granting
access to the highest priority requestor. The following four
resources are managed by the bandwidth management control hardware
516-519: [0057] Level 1 Program (L1P) SRAM/Cache 217 [0058] Level 1
Data (L1D) SRAM/Cache 218 [0059] Level 2 (L2) SRAM/Cache 216 [0060]
EMC 219
[0061] The priority level for operations initiated within the
processor module 110 are declared through registers within each
processor module 110. These operations are: [0062] DSP-initiated
transfers [0063] User-programmed cache coherency operations [0064]
IDMA-initiated transfers
[0065] The priority level for operations initiated outside the
processor modules 110 by system peripherals is declared through the
Priority Allocation Register (PRI_ALLOC). System peripherals that
are not associated with a field in PRI_ALLOC may have their own
registers to program their priorities.
Priority Elevation
[0066] FIG. 6 is a block diagram illustrating a transaction request
pipeline 630 that may perform priority elevation within UMC 216.
Each core 110 implements a notion of "elevated priority" for
requests it sends externally. In this embodiment, a given request
may have one of two priority levels: normal or urgent. Elevated
priority in this embodiment only affects requests sent outside a
given core 110. It does not change the arbitration rules inside the
core.
[0067] As mentioned earlier, each core 110 includes a processor 112
and a memory hierarchy that includes a Level 1 Data Memory
Controller (DMC) 218, a Level 1 Program Memory Controller (PMC)
217, and a Level 2 Unified Memory Controller (UMC) 216. The PMC 217
and DMC 218 each provide a priority value 617, 618 individually and
report this status to UMC 216. Each access request from PMC 217 and
DMC 218 is processed in UMC pipeline 610 to check access tags for
L2 cache memory 266. If the requested data is not present in L2
SRAM 266, then an access request is made to L3 shared memory system
130 via XMC 570 by sending transaction requests to MSMC 132. XMC
prefetch engine 680 may perform prefetch transactions to the MSMC
for requests it receives from router 670.
[0068] Each access request that needs to be routed external to core
110 is placed into a pipeline 630 that may have one or more stages
631(1)-631(n). Each stage 631(N) has a priority field 634 that
indicates the original priority assigned to each access request by
PMC 217, DMC 218 and UMC 216. Merge logic 640 then collects
priority elevation status with the information from DMC and PMC to
compute a final priority elevation status. This final priority
elevation status is used to consult a programmable configuration
register 650 to select the elevated priority 637 that is presented
on the MDMA interface.
[0069] DMC 218 may elevate its priority if its pipeline contains
any known cacheable read allocate requests. A DMC lookahead
priority scheme considers the transaction currently on the bus
interface to the UMC, as well as younger transactions in its
pipeline that have already consulted the cache tag ram. If any of
these transactions is a read allocate request, DMC 218 elevates its
priority 618 to UMC 216.
[0070] PMC 217 may elevate its priority 617 for fetch requests to
branch target addresses. Since more than one transaction request is
required to fill an LIP cache memory 268 line when a miss occurs
due to a branch, the transaction request for the branch target may
be elevated while the remainder of the requests to fill the cache
line is performed at normal priority.
[0071] In a similar manner, UMC 216 may elevate its priority for a
critical portion of an L2 fetch in response to an L2 cache miss.
The remainder of the requests for the L2 cache line may then be
performed at normal priority.
[0072] All other requests from PMC, DMC and UMC may proceed with
normal priority.
[0073] Lookahead merge logic 640 considers the following
transactions when deciding whether to elevate its priority: all
transactions in the UMC pipeline 610, 630; a single transaction
being requested by the DMC but not yet accepted by UMC; a single
transaction being requested by the PMC but not yet accepted by UMC;
and the oldest transaction 631(1) currently ready to be output on
the MDMA interface 572.
[0074] In this embodiment, all transactions under consideration are
categorized as urgent or non-urgent. An urgent transaction is one
of the following: L1D read miss; L1P fetch for a branch target;
critical portion of an L2 allocate. The presence of a single urgent
request in the UMC transaction scope results in the elevation of
the UMC priority.
MDMAARBU Configuration Register for Programming Priority Values
[0075] FIG. 7 illustrates a memory mapped configuration register
650 used to indicate multiple priority values, referred to as the
MDMAARBU (MDMA ARBitration for Unified memory controller) register.
This register contains the priority levels UMC uses when asserting
requests on the MDMA interface. Field 702 is a three bit field that
holds the urgent priority value. In SoC 100, priority may be any
value from 0-7, where zero is the highest priority. Field 704 is a
three bit field that holds the normal priority value. MDMAARBU
register 650 may be programmed with various priority values by an
application program being executed on processor 112 of core 110.
Since all memory addresses in SoC 100 are global, a program being
executed in one core 110 may adjust the MDMAARBU registers in other
cores.
[0076] Referring again to FIG. 6, after completing its lookahead
merge, merge control logic 640 decides whether the priority of the
current request 631(1) provided to the MDMA interface 572 should be
elevated. If there are no pending requests on the MDMA interface a
lowest priority value of 7 is output on priority signal 637. If the
decision is to elevate priority, then the Urgent Priority (UPRI)
field 702 of the MDMAARBU register is used. If the decision is not
to elevate priority, then the PRI field 704 of the MDMAARBU
register is used.
[0077] The oldest transaction request 631(1) is then output to
router 670. Each transaction request includes a command part 636, a
data part if it is a write, and priority value 637. Router 670 then
routes the transaction request to the appropriate interconnect
fabric based on the value of the address in command 636. For
example, if the transaction requested is addressed to shared memory
133, then it will be routed to the MSMC via link 672.
[0078] In this manner, transactions within the memory system
hierarchy of each core 110 are classified as high and low priority
based on functionality. A hierarchical lookahead priority scheme is
described, where the individual elevated priority status of each
memory controller in the hierarchy is combined to form a single
elevated priority for the entire memory system of the core 110. A
lookahead priority collection mechanism is provided for individual
memory controllers in the memory hierarchy whereby the presence of
a single high priority in-flight or pending transaction results in
the priority of the memory controller as a whole indicating
"elevated" priority. In another embodiment, there may be more
priority levels than just normal and urgent tracked in a
transaction request queue.
[0079] FIG. 8 is a block diagram illustrating arbitration between
multiple requestors for shared resources in shared memory system
130. Each of the plurality of cores 110 has a dedicated link 672
from router 670 in the XMC 570 of each core 110 to a slave port on
MSMC 132. As described above, each link 672 provides transaction
requests that include a command, priority and data if it is a write
request. The priority for each transaction request on each link 672
is determined as described above and may be elevated when a pending
access occurs in an access pipeline of any of the cores 110. Each
requesting slave interface is designed to present only one access
request among all endpoint arbiters in a given cycle so that the
return path for reads is reserved at the time a command wins
arbitration.
[0080] Each endpoint in MSMC 132 has an associated arbiter; there
is an arbiter for each memory bank 136 of shared memory 133 and one
each for system master port 814 and EMIF master port 134. For each
access to an endpoint, the arbitration logic utilizes a dynamic
multi-level priority scheme that attempts to fairly allocate the
number of access opportunities and bandwidth among the requestors.
Each endpoint arbiter may include additional arbitration logic
specific to the requirements of that endpoint.
[0081] FIG. 9 is a flow diagram illustrating operation of priority
elevation in a queuing requester, such as the queuing requester
illustrated in FIG. 6. Within a given module, such as core module
110, there may be several independent device that request memory
transactions in a hierarchical manner, such as an L1 program cache,
an L1 data cache, an L2 cache controller, a DMA engine, etc.
Transaction requests from each independent device is received 1002
by a queuing requester and then formed 1004 into a request queue.
As described in more detail above, each request includes a priority
value that is included in the queue. In some embodiments, this may
be a multi-bit value. In other embodiments, it may be a simple
single bit value that indicates normal or urgent, for example.
[0082] The requests are ordered in the request queue, such that the
oldest request is the next request that is issued from the queuing
requester for accessing a shared resource. As requests are received
for the queuing requester, the priority value of each queued
request is examined to determine 1006 the highest priority value of
all pending transaction requests within the request queue.
[0083] Whenever one of the pending transactions within the queue or
that is about to enter the queue has a priority that is higher than
the oldest priority value in the queue, the priority value for
queue is elevated 1008 by selecting a higher priority value.
[0084] In some embodiments, determining 1006 the highest priority
value may include one or more transaction requests that are not yet
included in the transaction queue 1004.
[0085] The oldest transaction from the queue is then provided 1010
to the memory system using the selected priority value. In this
manner, a high priority value is selected when any transaction
request in the transaction queue has an urgent priority value, and
a low priority value is selected when all transaction requests in
the transaction queue have a normal priority value, for example.
Other embodiments may use more than urgent and normal priority
values in the queue.
[0086] A lowest priority value may be selected when there are no
transaction requests in the transaction queue.
[0087] Periodically, typically once each clock cycle, pending
transaction requests from the queuing requester 1010 and from one
or more other requestors 1020 are arbitrated 1022 using the
selected priority value for the pending request from the queuing
requester. One or more of the other requesters 1020 may also be
queuing requesters that each includes a selected priority value
that may be elevated for urgent requests.
[0088] After each arbitration contest, the winner of the
arbitration contest is granted access 1024 to the shared memory
system.
[0089] As mentioned in more detail above, a register or other means
may be provided to allow a program to dynamically update 1030 a
defined high priority value to use as an elevated priority value
and a low priority value to use as a non-elevated priority
value.
System Example
[0090] FIG. 10 is a block diagram of a base station for use in a
radio network, such as a cell phone network. SoC 1002 is similar to
the SoC of FIG. 1 and is coupled to external memory 1004 that may
be used, in addition to the internal memory 1005 within SoC 1002,
to store application programs and data being processed by SoC 1002.
Transmitter logic 1010 performs digital to analog conversion of
digital data streams transferred by the external DMA (EDMA3)
controller and then performs modulation of a carrier signal from a
phase locked loop generator (PLL). The modulated carrier is then
coupled to multiple output antenna array 1020. Receiver logic 1012
receives radio signals from multiple input antenna array 1021,
amplifies them in a low noise amplifier and then converts them to
digital a stream of data that is transferred to SoC 1002 under
control of external DMA EDMA3. There may be multiple copies of
transmitter logic 1010 and receiver logic 1012 to support multiple
antennas.
[0091] The Ethernet media access controller (EMAC) module in SoC
1002 is coupled to a local area network port 1006 which MAY supply
data for transmission and may transport received data to other
systems that may be coupled to the internet.
[0092] An application program executed on one or more of the
processor modules within SoC 1002 encodes data received from the
internet, interleaves it, modulates it and then filters and
pre-distorts it to match the characteristics of the transmitter
logic 1010. Another application program executed on one or more of
the processor modules within SoC 1002 demodulates the digitized
radio signal received from receiver logic 1012, deciphers burst
formats, and decodes the resulting digital data stream and then
directs the recovered digital data stream to the internet via the
EMAC internet interface. The details of digital transmission and
reception are well known.
[0093] By making use of elevated priority for transaction requests
for critical transaction requests to memory 1004 and memory 1005 by
the processor modules of SoC 1002, data drops are avoided while
transferring the time critical transmission data to and from the
transmitter and receiver logic.
[0094] Input/output logic1030 may be coupled to SoC 1002 via the
inter-integrated circuit (I2C) interface to provide control,
status, and display outputs to a user interface and to receive
control inputs from the user interface. The user interface may
include a human readable media such as a display screen, indicator
lights, etc. It may include input devices such as a keyboard,
pointing device, etc.
OTHER EMBODIMENTS
[0095] Although the invention finds particular application to
Digital Signal Processors (DSPs), implemented, for example, in a
System on a Chip (SoC), it also finds application to other forms of
processors. A SoC may contain one or more megacells or modules
which each include custom designed functional circuits combined
with pre-designed functional circuits provided by a design
library.
[0096] While the invention has been described with reference to
illustrative embodiments, this description is not intended to be
construed in a limiting sense. Various other embodiments of the
invention will be apparent to persons skilled in the art upon
reference to this description. For example, in another embodiment,
the transaction request queue may be longer or shorter than
described here. In another embodiment, there may be multiple layers
of queues that each performs priority elevation. The priority
values may cover a wider range of values than just urgent or
normal. The selected priority may have a value that has more or
less than three bits, for example.
[0097] In another embodiment, the shared resource may be just a
memory that is not part of a cache. The shared resource may by any
type of storage device or functional device that may be accessed by
multiple masters in which an urgent request in a request queue in
one master needs to be elevated above requests from another
master.
[0098] In another embodiment, there may be more or fewer cores used
than described herein. The may be arranged in different
configurations and topologies than described herein.
[0099] Certain terms are used throughout the description and the
claims to refer to particular system components. As one skilled in
the art will appreciate, components in digital systems may be
referred to by different names and/or may be combined in ways not
shown herein without departing from the described functionality.
This document does not intend to distinguish between components
that differ in name but not function. In the following discussion
and in the claims, the terms "including" and "comprising" are used
in an open-ended fashion, and thus should be interpreted to mean
"including, but not limited to . . . ." Also, the term "couple" and
derivatives thereof are intended to mean an indirect, direct,
optical, and/or wireless electrical connection. Thus, if a first
device couples to a second device, that connection may be through a
direct electrical connection, through an indirect electrical
connection via other devices and connections, through an optical
electrical connection, and/or through a wireless electrical
connection.
[0100] Although method steps may be presented and described herein
in a sequential fashion, one or more of the steps shown and
described may be omitted, repeated, performed concurrently, and/or
performed in a different order than the order shown in the figures
and/or described herein. Accordingly, embodiments of the invention
should not be considered limited to the specific ordering of steps
shown in the figures and/or described herein.
[0101] It is therefore contemplated that the appended claims will
cover any such modifications of the embodiments as fall within the
true scope and spirit of the invention.
* * * * *