U.S. patent application number 13/276041 was filed with the patent office on 2012-02-09 for interconnect that eliminates routing congestion and manages simultaneous transactions.
This patent application is currently assigned to SONICS, INC.. Invention is credited to Chien-Chun Chou, Stephen W. Hamilton, Ian Andrew Swarbrick, Vida Vakilotojar, Drew E. Wingard.
Application Number | 20120036296 13/276041 |
Document ID | / |
Family ID | 40137732 |
Filed Date | 2012-02-09 |
United States Patent
Application |
20120036296 |
Kind Code |
A1 |
Wingard; Drew E. ; et
al. |
February 9, 2012 |
INTERCONNECT THAT ELIMINATES ROUTING CONGESTION AND MANAGES
SIMULTANEOUS TRANSACTIONS
Abstract
A method, apparatus, and system are described, which generally
relate to an integrated circuit having an interconnect. The flow
control logic for the interconnect applies a flow control splitting
protocol to permit transactions from each initiator thread and/or
each initiator tag stream to be outstanding to multiple channels in
a single aggregate target at once, and therefore to multiple
individual targets within an aggregate target at once. The combined
flow control logic and flow control protocol allows the
interconnect to manage simultaneous requests to multiple channels
in an aggregate target from the same thread or tag at the same
time.
Inventors: |
Wingard; Drew E.; (Palo
Alto, CA) ; Chou; Chien-Chun; (Saratoga, CA) ;
Hamilton; Stephen W.; (Pembroke Pines, FL) ;
Swarbrick; Ian Andrew; (Sunnyvale, CA) ; Vakilotojar;
Vida; (Mountain View, CA) |
Assignee: |
SONICS, INC.
Milpitas
CA
|
Family ID: |
40137732 |
Appl. No.: |
13/276041 |
Filed: |
October 18, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12144987 |
Jun 24, 2008 |
|
|
|
13276041 |
|
|
|
|
60946096 |
Jun 25, 2007 |
|
|
|
Current U.S.
Class: |
710/105 |
Current CPC
Class: |
G11C 7/1072 20130101;
G06F 15/17375 20130101; G06F 12/0607 20130101; Y02D 10/00 20180101;
Y02D 10/13 20180101 |
Class at
Publication: |
710/105 |
International
Class: |
G06F 13/42 20060101
G06F013/42 |
Claims
1. An integrated circuit having multiple initiator IP cores and
multiple target IP cores that communicate request transactions over
an interconnect, where the interconnect provides a shared
communications bus between the multiple initiator IP cores and
multiple target IP cores, comprising: flow control logic for the
interconnect is configured to apply a flow control splitting
protocol to permit transactions from a first initiator thread or a
first initiator tag stream to be outstanding to multiple channels
in a single aggregate target at once, and therefore to multiple
individual target IP cores within the aggregate target at once,
where the combination of the flow control logic and the flow
control splitting protocol allows the interconnect to manage
simultaneous requests to multiple channels in the aggregate target
from the same first initiator thread or the first initiator tag at
the same time.
2. The integrated circuit of claim 1, where the flow control logic
also includes merger and thread splitter units in its architecture
to intentionally split request transactions in an initiator agent,
or as early as possible along the path in the interconnect to
target agents for the multiple channels of the aggregate target,
and this approach avoids creating a centralized point that could
act as a bandwidth choke point and routing congestion point.
3. The integrated circuit of claim 1, where a distribution of the
flow control logic eliminates a need to have all the communication
paths in the interconnect pass through a single choke point because
many distributed pathways exist in this shared communications bus,
and the flow control logic for the interconnect is configured to
apply a flow control splitting protocol to also split the request
transactions early where it makes sense due to a physical routing
of parts of that set of request transactions being routed on
separate physical pathways in the interconnect as well as being
routed to target IP cores physically located in different areas on
the integrated circuit, and where flow control splitting protocol
is also configured to allow multiple transactions to be issued and
serviced in parallel, which increases an efficiency of each
initiator in being able to start having more transactions serviced
in the same period of time, where a first and a second transaction
from a first initiator IP core are issued prior to the first
transaction being completely serviced by a first target IP core
resulting in that the first initiator IP core and the first target
IP core are working on multiple transactions at the same time.
4. The integrated circuit of claim 1, where the interconnect has
multiple thread merger and thread splitter units in the flow
control logic distributed over the interconnect that maintain
request order for read and write request transactions over the
interconnect, where the one or more thread splitter units route
request transactions from a first initiator IP core generating a
set of request transactions in the first initiator thread down two
or more different physical paths to the target IP cores physically
located in different areas on the integrated circuit.
5. The integrated circuit of claim 1, where the interconnect
implements an address map with assigned address for the target IP
cores in the integrated circuit to route request transactions
between the target IP cores and the initiator IP cores in the
integrated circuit, where the interconnect is configured to
interrogate the address map based on a logical destination address
associated with a first request to the aggregate target with two or
more interleaved memory channels, and determines which memory
channels will service the first request and how to route the first
request to the physical IP addresses of each memory channel in the
aggregate target servicing that request so that an initiator IP
core need not know of the physical IP addresses of each memory
channel in the aggregate target, and where flow control splitting
protocol implemented in the flow control logic is also configured
to also allow multiple transactions from either 1) the same
initiator IP core thread or 2) the same initiator IP core set of
tags to be outstanding to the multiple channels of the aggregated
target at the same time, and the multiple channels in the
aggregated target map to target IP cores having physically
different addresses.
6. The integrated circuit of claim 1, where the flow control logic
in the interconnect maintains a request order within the first
initiator thread and an expected response order to those requests,
and where the interconnect includes three or more initiator agents
and three or more target agents, where two or more target agents
are located at physically different locations coupling to the
interconnect but belong to the same aggregate target with multiple
channels.
7. The integrated circuit of claim 1, where one or more thread
splitter units with the flow control logic in a request network
splits the path links to the aggregate target with multiple
channels, which is a Dynamic Random-Access Memory (DRAM) IP core,
where a first request travels a first link to a first channel in
the multi-channel target DRAM and a second request travels a second
link to a second channel in the multi-channel target DRAM and two
or more target agents are coupled to the multi-channel target DRAM,
and a first target agent is assigned to the first channel and a
second target agent is assigned to the second channel for the
multi-channel target DRAM, where the first and second target agents
are at physically different locations coupling to the interconnect
and belong to the same aggregate target with multiple channels, and
where the thread splitter units and other associated flow control
logic minimize the transaction and routing congestion issues
associated with a centralized channel splitter.
8. The integrated circuit of claim 1, where a distributed
implementation in each thread splitter unit and thread merger unit
in the flow control logic is configured to allow them to
interrogate a local system address map to determine both 1) thread
routing and 2) thread buffering until a switch of physical paths
can occur, and where the thread splitter units and thread merger
units cooperate end-to-end to ensure ordering without a need to
install one or more full transaction reorder buffers within the
interconnect.
9. The integrated circuit of claim 1, where the flow control logic
internal to 1) the interconnect or 2) in the initiator agent
interrogates the address map and a known structural organization of
the aggregated target in the integrated circuit to decode an
interleaved address space of the aggregated target to determine any
physical distinctions between the target IP cores making up the
aggregated target IP core in order to determine which targets
making up the aggregated target need to service a given request
from an initiator IP core, and where the flow control logic applies
a flow control splitting protocol to allow multiple transactions
from the same thread to be outstanding to multiple channels of the
aggregated target at any given time and the multiple channels in
the aggregated target map to target IP memory cores having
physically different addresses.
10. The integrated circuit of claim 1, where an initiator agent
interfacing the interconnect for a first initiator IP core
interrogates an address map based on a logical destination address
associated with a request to the aggregate target that has
interleaved two or more memory channels, and determines which
memory channels will service the request and how to route the
request to the physical IP addresses of each memory channel in the
aggregate target servicing that request so that any initiator IP
core need not know of the physical IP addresses of each memory
channel in the aggregate target.
11. The integrated circuit of claim 1, where the flow control logic
is configured to apply a flow control splitting protocol to allow
multiple transactions from the same thread to be outstanding to the
multiple channels of the aggregated target at any given time and
the multiple channels in the aggregated target map to target memory
cores having physically different addresses.
12. The integrated circuit of claim 1, where chopping logic and the
flow control logic cooperate to allow requests that are part of a
request burst transaction to cross an interleave boundary of the
aggregate target such that some request transfers are sent to one
channel target while others are sent to another channel target
within the aggregate target, where the chopping logic is internal
to the interconnect and is configured to chop individual burst
transactions that cross channel boundaries headed for channels in
the aggregate target into two or more requests.
13. The integrated circuit of claim 1, where the initiator cores do
not need a priori knowledge of a memory's address structure and
organization in the aggregate target, rather one or more initiator
agents have this structural and organizational knowledge of memory
channels to choose a true address of the target of a request
transaction, a route to the target of the request transaction from
a first initiator IP core across the interconnect, and then a
channel within the aggregated target.
14. The integrated circuit of claim 1, where address decoding of an
intended address of the request transaction from the first
initiator thread happens as soon as the request transaction enters
an interface of the interconnect, and the flow control logic
interrogates an address map and a known structural organization of
each aggregated target IP core in the integrated circuit to decode
an interleaved address space of the aggregated targets to determine
the physical distinctions between the target IP cores making up a
particular aggregated target IP core in order to determine which
target IP cores making up a first aggregated target needs to
service a current request transaction.
15. The integrated circuit of claim 1, where two or more thread
splitter units with the flow control logic are configured to route
request transactions from an initiator IP core generating a set of
transactions in the first initiator thread down two or more
different physical paths in the interconnect by routing a first
request with a destination address headed to a first physical
location on the integrated circuit, which is a first target, and
other requests within that first initiator thread having a
destination address headed to different physical locations on the
integrated circuit from the first physical location, where the
first physical location is a first channel and the different
physical location is a second channel making up part of the
aggregate target, where the first and second channels share an
address region to appear as single logical aggregated target, and
where a channel merger component in a response path maintains
response path ordering, where a mechanism to re-order responses in
the response path includes passing information from a channel
splitter in the request path to a corresponding channel merger
component in the response path, and the information passed over
tells the thread merger component which incoming thread the next
response burst transaction should come from.
16. The integrated circuit of claim 1, where a request path in the
interconnect includes a series of splitter and merger units in the
flow control logic distributed across the interconnect to create
different physical paths across the interconnect to the aggregate
target with multiple channels, and where the aggregate target with
multiple channels has two or more discreet memories channels
including on-chip IP cores and off-chip memory cores that are
interleaved with each other to appear to system software and other
IP cores as a single memory in a system address space.
17. The integrated circuit of claim 1, where the interconnect
implements the flow control logic and flow control protocol
internal to the interconnect itself to manage expected execution
ordering of a set of issued requests within the same first
initiator thread that are serviced and responses returned in order
with respect to each other but independent of an ordering of
another thread, and the flow control logic at a thread splitter
unit permits transactions from one initiator thread to be
outstanding to multiple channels at once and therefore to multiple
individual target IP cores within a multi-channel target at once,
where different channels are mapped to two individual target IP
cores within the aggregate target with multiple channels, and the
integrated circuit has chopping logic to chop individual burst
requests that cross the memory channel address boundaries from a
first memory channel to a second memory channel within the first
aggregate target into two or more burst requests from the same
thread, where the chopping logic cooperates with a detector to
detect when the starting address of an initial word of requested
bytes in the burst request and ending address of the last word of
requested bytes in the burst request causes the requested bytes in
that burst request to span across one or more channel address
boundaries to fulfill all of the word requests in the burst request
transaction.
18. A method of communicating requests over an interconnect in an
integrated circuit having multiple initiator IP cores and multiple
target IP cores, where the interconnect provides a shared
communications bus between the multiple initiator IP cores and
multiple target IP cores, comprising: applying a flow control
splitting protocol to permit transactions from one initiator thread
or one initiator tag stream to be outstanding to multiple channels
in a single aggregate target at once, and therefore to multiple
individual target IP cores within the aggregate target at once,
where the combined flow control logic and flow control protocol
allows the interconnect to manage simultaneous requests to multiple
channels in the aggregate target from a same thread or tag at the
same time.
Description
RELATED APPLICATIONS
[0001] This application is continuation of patent application Ser.
No. 12/144,987, filed Jun. 24, 2008, titled "Various methods and
apparatus to support outstanding requests to multiple targets while
maintaining transaction ordering," which is related to and claims
the benefit of U.S. Provisional Patent Application Ser. No.
60/946,096, titled "An interconnect implementing internal
controls," filed Jun. 25, 2007.
NOTICE OF COPYRIGHT
[0002] A portion of the disclosure of this patent document contains
material that is subject to copyright protection. The copyright
owner has no objection to the facsimile reproduction by anyone of
the software engine and its modules, as it appears in the Patent
and Trademark Office Patent file or records, but otherwise reserves
all copyright rights whatsoever.
FIELD OF THE INVENTION
[0003] Embodiments of the invention generally relate to an
interconnect implementing internal controls to eliminate routing
congestion.
BACKGROUND OF THE INVENTION
[0004] When an SOC has multiple DRAM interfaces for accessing
multiple DRAMs in parallel at differing addresses, each DRAM
interface can be commonly referred to as a memory "channel". In the
traditional approach, the channels are not interleaved, so the
application software and all hardware blocks that generate traffic
need to make sure that they spread their traffic evenly across the
channels to balance the loading. Also, in the past, the systems use
address generators that split a thread into multiple requests, each
request being sent to its own memory channel. This forced the
software and system functional block to have to be aware of the
organization and structure of the memory system when generating
initiator requests. Also, in some super computer prior systems, the
system forced dividing up a memory channel at the size of burst
length request. Also, in some prior art, requests from a processor
perform memory operations that are expanded into individual memory
addresses by one or more address generators (AGs). To supply
adequate parallelism, each AG is capable of generating multiple
addresses per cycle to the multiple segments of a divided up memory
channel. The memory channel performs the requested accesses and
returns read data to a reorder buffer (RB) associated with the
originating AG. The reorder buffer collects and reorders replies
from the memory channels so they can be presented to the initiator
core.
[0005] In the traditional approach, the traffic may be split deeply
in the memory subsystem in central routing units, which increases
traffic and routing congestion, increases design and verification
complexity, eliminates topology freedom, and increases latencies.
The created centralized point can act as a bandwidth choke point, a
routing congestion point, and a cause of longer propagation path
lengths that would lower achievable frequency and increase
switching power consumption. Also, some systems use re-order
buffers to maintain an expected execution order of transactions in
the system.
[0006] In the typical approach, area-consuming reorder buffering is
used at the point where the traffic is being merged on to hold
response data that comes too early from a target.
SUMMARY OF THE INVENTION
[0007] A method, apparatus, and system are described, which
generally relate to an integrated circuit having an interconnect
that has multiple initiator IP cores and multiple target IP cores
that communicate request transactions over an interconnect. The
interconnect provides a shared communications bus between the
multiple initiator IP cores and multiple target IP cores. The flow
control logic for the interconnect applies a flow control splitting
protocol to permit transactions from each initiator thread and/or
each initiator tag stream to be outstanding to multiple channels in
a single aggregate target at once, and therefore to multiple
individual targets within an aggregate target at once. The combined
flow control logic and flow control protocol allows the
interconnect to manage simultaneous requests to multiple channels
in an aggregate target from the same thread or tag at the same
time.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] The drawings refer to embodiments of the invention as
follows.
[0009] FIG. 1 illustrates a block diagram of an embodiment of a
System-on-a-Chip having multiple initiator IP cores and multiple
target IP cores that communicate transactions over an
interconnect.
[0010] FIG. 2 illustrates an embodiment of a map of contiguous
address space in which distinct memory IP cores are divided up in
defined memory interleave segments and then interleaved with memory
interleave segments from other memory IP cores.
[0011] FIG. 3 shows an embodiment of a map of an address region for
multiple interleaved memory channels.
[0012] FIG. 4A illustrates a block diagram of an embodiment of a
integrated circuit having multiple initiator IP cores and multiple
target IP cores that maintains request order for read and write
requests over an interconnect that has multiple thread merger and
thread splitter units.
[0013] FIG. 4B illustrates a block diagram of an embodiment of flow
control logic implemented in a centralized merger splitter unit to
maintain request path order.
[0014] FIG. 5 illustrates a block diagram of an embodiment of one
or more thread splitter units to route requests from an initiator
IP core generating a set of transactions in a thread down two or
more different physical paths.
[0015] FIG. 6 illustrates an example timeline of the thread
splitter unit in an initiator agent's use of flow control protocol
logic that allows multiple write requests from a given thread to be
outstanding at any given time but restricts an issuance of a
subsequent write request from that thread.
[0016] FIG. 7a illustrates an example timeline of embodiment of
flow logic to split a 2D WRITE Burst request.
[0017] FIG. 7b also illustrates an example timeline of embodiment
of flow logic to split a 2D WRITE Burst request.
[0018] FIG. 7c illustrates an example timeline of embodiment of
flow logic to split a 2D READ Burst.
[0019] FIG. 8 illustrates a block diagram of an embodiment of a
response path from two target agents back to two initiator agents
through two thread splitting units and two thread merger units.
[0020] FIG. 9 shows the internal structure of an example
interconnect maintaining the request order within a thread and the
expected response order to those requests.
[0021] FIG. 10 illustrates a diagram of an embodiment of chopping
logic to directly support chopping individual transactions that
cross the channel address boundaries into two or more
transactions/requests from the same thread.
[0022] FIG. 11 illustrates a diagram of an embodiment of a path
across an interconnect from an initiator agent to multiple target
agents including a multiple channel aggregate target.
[0023] FIGS. 12a-12e illustrate five types of channel based
chopping for block burst requests: normal block chopping, block row
chopping, block height chopping, block deadlock chopping, and block
deadlock chopping and then block height chopping.
[0024] FIG. 13 illustrates a flow diagram of an embodiment of an
example of a process for generating a device, such as a System on a
Chip, with the designs and concepts discussed above for the
Interconnect.
[0025] While the invention is subject to various modifications and
alternative forms, specific embodiments thereof have been shown by
way of example in the drawings and will herein be described in
detail. The invention should be understood to not be limited to the
particular forms disclosed, but on the contrary, the intention is
to cover all modifications, equivalents, and alternatives falling
within the spirit and scope of the invention.
DETAILED DISCUSSION
[0026] In the following description, numerous specific details are
set forth, such as examples of specific data signals, named
components, connections, number of memory channels in an aggregate
target, etc., in order to provide a thorough understanding of the
present invention. It will be apparent, however, to one of ordinary
skill in the art that the present invention may be practiced
without these specific details. In other instances, well known
components or methods have not been described in detail but rather
in a block diagram in order to avoid unnecessarily obscuring the
present invention. Further specific numeric references, such as
first target, may be made. However, the specific numeric reference
should not be interpreted as a literal sequential order but rather
interpreted that the first target is different than a second
target. Thus, the specific details set forth are merely exemplary.
The specific details may be varied from and still be contemplated
to be within the spirit and scope of the present invention.
[0027] In general, a method, apparatus, and system are described,
which generally relate to an integrated circuit having an
interconnect that implements internal controls. The interconnect
may maintain request path order; maintain response path order;
interleave channels in an aggregate target with unconstrained burst
sizes; have configurable parameters for channels in an aggregate
target; chop individual transactions that cross channel boundaries
headed for channels in an aggregate target; chop individual
transactions that cross channel boundaries headed for channels in
an aggregate target so that two or more or the chopped portions
retain their 2D burst attributes, as well as implement many other
internal controls.
[0028] In an embodiment, the flow control logic for the
interconnect applies a flow control splitting protocol to permit
transactions from each initiator thread and/or each initiator tag
stream to be outstanding to multiple channels in a single aggregate
target at once, and therefore to multiple individual targets within
an aggregate target at once. The combined flow control logic and
flow control protocol allows the interconnect to manage
simultaneous requests to multiple channels in an aggregate target
from the same thread or tag at the same time.
[0029] Most aspects of the invention may be applied in most
networking environments and an example integrated circuit such as a
System-on-a-Chip environment will be used to flesh out these
aspects of the invention.
[0030] FIG. 1 illustrates a block diagram of an embodiment of a
System-on-a-Chip having multiple initiator IP cores and multiple
target IP cores that communicate read and write requests as well as
responses to those requests over an interconnect. Each initiator IP
core such as a CPU IP core 102, an on-chip security IP core 104, a
Digital Signal Processor (DSP) 106 IP core, a multimedia IP core
108, a Graphics IP core 110, a streaming Input-Output (I/O) IP core
112, a communications IP core 114, such as a wireless transmit and
receive IP core with devices or components external to the chip,
etc. and other similar IP cores may have its own initiator agent
116 to interface with the interconnect 118. Each target IP core
such as a first DRAM IP core 120 through a fourth DRAM IP core 126
as well as a FLASH memory IP core 128 may have its own target agent
130 to interface with the interconnect 118. Each DRAM IP core
120-126 may have an associated memory scheduler 132 as well as DRAM
controller 134.
[0031] The Intellectual Property cores (IP) have self-contained
designed functionality to provide that macro function to the
system. The interconnect 118 implements an address map 136 with
assigned address for the target IP cores 120-128 and potentially
the initiator IP cores 102-114 in the system to route the requests
and potentially responses between the target IP cores 120-128 and
initiator IP cores 102-114 in the integrated circuit. One or more
address generators may be in each initiator IP core to provide the
addresses associated with data transfers that the IP core will
initiate to memories or other target IP cores. All of the IP cores
may operate at different performance rates (i.e. peak bandwidth,
which can be calculated as the clock frequency times the number of
data bit lines (also known as data width), and sustained bandwidth,
which represents a required or intended performance level). Most of
the distinct IP cores communicate to each other through the memory
IP cores 120-126 on and off chip. The DRAM controller 134 and
address map 136 in each initiator agent 116 and target agent 130
abstracts the real IP core addresses of each DRAM IP core 120-126
from other on-chip cores by maintaining the address map and
performing address translation of assigned logical addresses in the
address map to physical IP addresses.
[0032] The address mapping hardware logic may also be located
inside an initiator agent. The DRAM scheduler & controller may
be connected downstream of a target agent. Accordingly, one method
for determining the routing of requests from initiators to targets
is to implement an address mapping apparatus that associates
incoming initiator addresses with specific target IP cores. One
embodiment of such an address mapping apparatus is to implement
target address decoding logic in each initiator agent. In order for
a single initiator to be able to access all of the target IP core
locations, the initiator may need to provide more total address
values than a single target IP core contains, so the interconnect
may translate the initiator address into a target IP core address.
One embodiment of such a translation is to remove the initiator
address bits that were used to decode the selected target IP core
from the address that is presented to the target IP core.
[0033] The interconnect 118 provides a shared communications bus
between IP core sub-systems 120-128 and 102-114 of the system. All
the communication paths in the shared communication bus need not
pass through a single choke point rather many distributed pathways
may exist in the shared communication bus. The on-chip interconnect
118 may be a collection of mechanisms that may be adapters and/or
other logical modules along with interconnecting wires that
facilitate address-mapped and arbitrated communication between the
multiple Intellectual Property cores 102-114 and 120-128.
[0034] The interconnect 118 may be part of an integrated circuit,
such as System-on-a-Chip, that is pipelined with buffering to store
and move requests and responses in stages through the
System-on-a-Chip. The interconnect 118 may have flow control logic
that is 1) non-blocking with respect to requests from another
thread as well as with respect to requiring a response to an
initial request before issuing a subsequent request from the same
thread, 2) implements a pipelined protocol, and 3) maintains each
thread's expected execution order. The interconnect also may
support multiple memory channels, with 2D and address tiling
features, response flow control, and chopping of individual burst
requests. Each initiator IP core may have its own initiator agent
to interface with the interconnect. Each target IP core may have
its own target agent to interface with the interconnect.
[0035] The System-on-a-Chip may be pipelined to store and move
requests and responses in stages through the System-on-a-Chip. The
flow control logic in the interconnect is non-blocking with respect
to requests from another thread as well as with respect to
requiring a response to a first request before issuing a second
request from the same thread, pipelined, and maintains each
thread's execution order.
[0036] Each memory channel may be an IP core or multiple external
DRAM chips ganged together to act as a single memory make the width
of a data word such as 64 bits or 128 bits. Each IP core and DRAM
chip may have multiple banks inside that IP core/chip. Each channel
may contain one or more buffers that can store requests and/or
responses associated with the channel. These buffers can hold
request addresses, write data words, read data words, and other
control information associated with channel transactions and can
help improve memory throughput by supplying requests and write data
to the memory, and receiving read data from the memory, in a
pipelined fashion. The buffers can also improve memory throughput
by allowing a memory scheduler to exploit address locality to favor
requests that target a memory page that is already open, as opposed
to servicing a different request that forces that page to be closed
in order to open a different page in the same memory bank.
[0037] One benefit of a multi-channel aggregate target is that it
provides spatial concurrency to target access, thus increasing
effective bandwidth over that achievable with a single target of
the same width. An additional benefit is that the total burst size
of each channel is smaller than the total burst size of a single
channel target with the same bandwidth, since the single channel
target would need a data word that is as wide as the sum of the
data word sizes of each of the multiple channels in an aggregate
target. The multi-channel aggregate target can thus move data
between the SoC and memory more efficiently than a single channel
target in situations where the data size is smaller than the burst
size of the single channel target. In an embodiment, this
interconnect supports a strict super-set of the feature set of the
previous interconnects.
[0038] Connectivity of multi-channel targets may be primarily
provided by cross-bar exchanges that have a chain of pipeline
points to allow groups of channel targets to be separated on the
die. The multiple channel aggregate target covers the high
performance needs of digital media dominated SOCs in the general
purpose (memory reference and DMA) interconnect space.
[0039] Also, the memory channels in an aggregate target may support
configurable configuration parameters. The configurable
configuration parameters flexibly support a multiple channel
configuration that is dynamically changeable and enable a single
already-designed System-on-a-Chip design to support a wide range of
packaging or printed circuit board-level layout options that use
different on-chip or external memory configurations by
re-configuring channel-to-region assignments and interleaving
boundaries between channels to better support different modes of
operation of a single package.
Interleaved Channels in an Aggregate Target with Unconstrained
Burst Sizes
[0040] Many kinds of IP core target blocks can be combined and have
their address space interleaved. The below discussion will use
discreet memory blocks as the target blocks being interleaved to
create a single aggregate target in the system address space. An
example "aggregate target" described below is a collection of
individual memory channels, such as distinct external DRAM chips,
that share one or more address regions that support interleaved
addressing across the aggregate target set. Another aggregate
target is a collection of distinct IP blocks that are being
recognized and treated as a single target by the system.
[0041] FIG. 2 illustrates an embodiment of a map of contiguous
address space in which distinct memory IP cores are divided up in
defined memory interleave segments and then interleaved with memory
interleave segments from other memory IP cores. Two or more
discreet memories channels including on chip IP cores and off chip
memory cores may be interleaved with each other to appear to system
software and other IP cores as a single memory (i.e. an aggregate
target) in the system address space. Each memory channel may be an
on-chip IP memory core, an off-chip IP memory core, a standalone
memory bank, or similar memory structure. For example, the system
may interleave a first DRAM channel 220, a second DRAM channel 222,
a third DRAM channel 224, and a fourth DRAM channel 226. Each
memory channel 220-226 has two or more defined memory interleave
segments such as a first memory interleave segment 240 and a second
memory interleave segment 242. The two or more defined memory
interleave segments from a given discreet memory channel are
interleaved with two or more defined memory interleave segments
from other discreet memory channels in the address space of a
memory map 236b. The address map 236a may be divided up into two or
more regions such as Region 1 thru Region 4, and each interleaved
memory segment is assigned to at least one of those regions and
populates the system address space for that region as shown in
236b, eventually being mappable to a physical address, in the
address space.
[0042] For example, memory interleave segments from the first and
second DRAM channels 220 and 222 are sized and then interleaved in
region 2 of the address map 236b. Also, memory interleave segments
from the third and fourth DRAM channels 224 and 226 are sized (at a
granularity smaller than interleave segments in the first and
second DRAM channels) and then interleaved in region 4 of the
address map 236b. Memory interleave segments from the first and
second DRAM channels 220 and 222 are also interleaved in region 4
of the address map 236b. Thus, a memory channel may have defined
memory interleave segments in the address space of two or more
regions and can be implemented through an aliasing technique.
Memory interleave segments from the first DRAM channel 220 of a
first size, such as a first memory interleave segment 240, are
controlled by a configurable parameter of the second region in the
address map 236b and interleave segments of a second size, such as
a third memory interleave segment 244, are controlled by a
configurable parameter of the fourth region in the address map
236b.
[0043] Thus, each memory channel 220-226 has defined memory
interleave segments and may have memory interleave segments of
different sizes. Each corresponding region 4 in the system address
map 236b has a configurable parameter, which may be programmable at
run time or design time by software, to control the size
granularity of the memory interleave segments in the address space
assigned to that region potentially based on anticipated type of
application expected to have transactions (including read and write
requests) with the memory interleave segments in that region. As
discussed, for example, the second region in the address map 236b
has defined memory interleave segments allocated to that region
from the first memory channel 220 that have a configured
granularity at a first amount of bytes. Also, the fourth region in
the address map 236b has defined memory interleave segments
allocated to that region from the first memory channel 220 that
have a configured granularity at a second amount of bytes. Also,
each region, such as region 4, may have defined memory interleave
segments allocated to that region from two or more memory channels
220-226.
[0044] FIG. 3 shows an embodiment of a map of an address region for
multiple interleaved memory channels. The address region 346 or the
address map 336 may have address space for example from 00000 to
3FFFF in the hexadecimal numbering system. The address region 346
has interleaved addressing across multiple channels in an
aggregated target. The global address space covered by the address
region 346 may be partitioned into the set of defined memory
interleave segments from the distinct memory channels. The defined
memory interleave segments are non-overlapping in address space and
collectively cover and populate the entire region 346 in that
address space. Each interleaved memory segment from an on-chip or
off-chip IP memory core/channel is then sequential stacked with the
defined interleaved segments from the other on-chip IP memory cores
to populate address space in the address map. The maximum number of
channels associated with a region may be a static value derived
from the number of individual targets associated with the region,
and from the nature of the target. Individual targets and
multi-ported targets may have a single channel; multi-channel
targets have up to 2, 4, or 8 channels. In an embodiment, a
num_channels attribute is introduced for the "region" construct
provided in the RTL.conf syntax and is used to indicate the maximum
number of active channels an address region can have. It may be
possible to configure the address map to use fewer than the static
number of individual targets associated with the region. The first
defined memory interleave segment 340 in the region 336 is mapped
to channel 0. The second defined memory interleave segment 342 in
the region 336 is mapped to channel 1. The third defined memory
interleave segment 344 in the region 336 is mapped to channel 2.
The next defined memory interleave segment 346 in the region 336 is
mapped to channel 3. This process continues until a memory
interleave segment is mapped to the last channel active in this
region. This completes what is known as a "channel round". The
sequential stacking process of memory interleave segments in the
address space assigned to a region is then repeated until enough
channel rounds are mapped to completely cover the address space
assigned to a particular region. This address region 336 will be
treated as an aggregate target. A request, for data, such as a
first request 348 from that aggregate target in this region may
then require response data spans across multiple defined memory
interleave segments and thus across multiple discrete memory IP
cores. Also, a physical memory location in an on chip or off chip
memory may actually be assigned to multiple regions in the system
address space and thus have multiple assigned system addresses from
that address map to the same physical memory location. Such
multiple mapping, sometimes termed address aliasing, can be used to
support multiple ways of addressing the same memory location or to
support dynamic allocation of the memory location to either one
region or the other, when the different regions have different
interleaving sizes or channel groupings and may therefore have
different access performance characteristics.
[0045] Each memory interleave segment is defined and interleaved in
the system address space at a size granularity unconstrained by a
burst length request allowed by the DRAM memory design
specification by a system designer. The size granularity of memory
interleave segment may be a defined length between a minimum DRAM
burst length request allowed by the DRAM memory design
specification configured into the DRAM and an anticipated maximum
DRAM memory page length as recognized by the memory configuration.
The size of this granularity is a configurable value supplied by
user, such as software programmable. For example, the defined
length supplied by the user may be between 64 Bytes and 64
Kilobytes.
[0046] Logically, this aggregated target presents itself as a
single target to other IP cores but interleaves the memory
interleave/segments in the address map of the system from multiple
on-chip IP memory cores/memory channels. Thus, each DRAM IP
core/channel may be physically divided up into interleaving
segments at a size granularity supplied by the user. An initiator
agent interfacing the interconnect for a first initiator IP core
interrogates the address map based on a logical destination address
associated with a request to the aggregate target of the
interleaved two or more memory channels and determines which memory
channels will service the request and how to route the request to
the physical IP addresses of each memory channel in the aggregate
target servicing that request so that any IP core need not know of
the physical IP addresses of each memory channel in the aggregate
target.
[0047] The access load to each memory core automatically
statistically spreads application traffic across the channels by
virtue of the system designer configuring the granularity of the
interleave segments based on the address patterns associated with
expected request traffic to that region/aggregated target. Requests
sent by a single initiating thread to a multi-channel address
region can cross the interleave boundary such that some transfers
are sent to one channel target while others are sent to another
channel target within the aggregate target. These requests can be
part of a request burst that crossed a channel interleave boundary
or independent transactions. Thus, if the expected request traffic
that for system is dominated by requests that linearly access
memory location by virtue of the code in the programs they run, the
size granularity is set up such that the several requests will be
serviced by a first memory channel followed by maybe one request
falling on both sides of a memory channel boundary followed by
several requests being serviced by a second memory channel. The
traffic spreading is due to system addressing, size granularity of
the memory segment, and the memory channels being stacked
sequentially. Thus, for example, requests a-c 350 from a same
thread may be serviced by exclusively by memory channel 2, while
request d 352 is partially serviced by both memory channel 2 and
memory channel 3. This way of sequentially stacking of defined
memory interleave segments in the address space from different
memory cores/channels allows the inherent spreading/load balancing
between memory cores as well as takes advantage of the principle of
locality (i.e. requests in thread tend to access memory address in
locally close to the last request and potentially reuse the same
access data).
[0048] Referring to FIG. 1, an initiator IP core itself and system
software are decoupled from knowing the details of the organization
and structure of the memory system when generating the addresses of
requests going to memory targets. Requests from the initiator
cores, such as a CPU 102, to perform memory operations can be
expanded into individual memory addresses by one or more address
generators (AGs). To supply adequate parallelism, an AG in the
initiator agent generates a single address per request, and several
AGs may operate in parallel, with each generating accesses from
different threads. The address generators translate system
addresses in the memory map into real addresses of memory cells
within a particular IP memory core or in some cases across a
channel boundary. A generated request may have an address with
additional fields for memory channel select bits, which aid in
decoding where to retrieve the desired information in a system
having one or more aggregated targets. The initiator agents, such
as a first initiator agent 158, may have address generators with
logic to add channel select bits into the address of a generated
request from an IP core. At least part of the address decode of a
target's address may occur at the interface of when a request first
enters the interconnect such as at an initiators agent. An address
decoder may decode an address of a request to route the request to
the proper IP memory core based on, for example, the low bits of
the memory address. The address decoder removes the channel select
bits from the address and then pass the address to the address
decoders/generator(s) in the memory controller. The addresses
presented to a channel target may be shifted, for example, to the
right to compensate for channel selection bit(s). The memory
scheduler 132 may also decode/translate a system's memory target
address sent in a request to determine a defined memory segments
physical location on a chip i.e. (Rank, bank, row, and column
address information). Each access can be routed to the appropriate
memory channel (MC) via a look up table. The address map 136 of
details of the organization and structure of the memory system
exits in each initiator agent coupled to an IP core. The memory
scheduler 132 schedules pending accesses in a channel-buffer,
selecting one access during each DRAM command cycle, sending the
appropriate command to the DRAM, and updating the state of the
pending access. Note that a single memory access may require as
many as three DRAM commands to complete. The memory channel then
performs the requested accesses and returns one or more responses
with the read data to a buffer. The target agent collects replies
from the memory channels so they can be presented to the initiator
core in the expected in-order response order.
[0049] Thus, the initiator cores 102-114 and 120-126 do not need
hardware and software built in to keep track of the memory address
structure and organization. The initiator cores 102-114 and 120-126
do not need a priori knowledge of the memory address structure and
organization. The initiator agents 116 have this information and
isolate the cores from needing this knowledge. The initiator agents
116 have this information to choose the true address of the target,
the route to the target from the initiator across the interconnect
118, and then the channel route within an aggregated target. The
memory scheduler 132 may receive a request sent by the initiator
agent and translate the target address and channel route to rank,
bank, row, and column address information in the various memory
channels/IP cores. In an embodiment, the multiple channel nature of
an aggregate target is abstracted from the IP cores in the system
and puts that structural and organizational knowledge of memory
channels onto either each initiator agent 116 in the system or the
centralized memory scheduler 132 in the system.
[0050] The flow control protocol and flow control logic ensure that
the transactions are re-assembled correctly in the response path
before the corresponding responses are returned to the initiator IP
core.
[0051] It is desirable in interleaved multi-channel systems that
each initiator distributes its accesses across the channels roughly
equally. The interleave size has an impact on this. The expected
method to allocate bandwidth N to a thread is to program each
channel QOS allocation as (N/channels) plus a small tolerance
margin. If the application is known to have a channel bias,
non-symmetric allocations can be made instead. If region
re-definition is used, the number of active channels may differ in
different boot setups. Having separate allocations at each channel
is useful to accommodate this.
[0052] For multiple channel DRAM, the percentage of service
bandwidth that is elastic is greater than for single DRAM. Each
channel still has the page locality based elasticity. Additionally,
there is elasticity related to the portion of service bandwidth
from a single channel that is available to each initiator. If the
address streams distribute nicely across channels, then there is a
certain level of contention at each channel. If the address streams
tend to concentrate at a few channels, then other channels are
lightly used (less than 100% utilized), and therefore the aggregate
service rate is reduced.
[0053] In an embodiment, the flow control logic internal to the
interconnect may interrogate the address map and a known structural
organization of an aggregated target in the integrated circuit to
decode an interleaved address space of the aggregated target to
determine the physical distinctions between the targets making up
the first aggregated target in order to determine which targets
making up the first aggregated target need to service a first
request. The flow control logic applies a flow control splitting
protocol to allow multiple transactions from the same thread to be
outstanding to multiple channels of an aggregated target at any
given time and the multiple channels in the aggregated target map
to IP memory cores having physically different addresses. The flow
control logic internal to the interconnect is configured to
maintain request order routed to the target IP core. The flow
control mechanism cooperates with the flow control logic to allow
multiple transactions from the same thread to be outstanding to
multiple channels of an aggregated target at any given time and the
multiple channels in the aggregated target map to IP memory cores
having physically different addresses.
[0054] The interconnect implements an address map with assigned
addresses for target IP cores in the integrated circuit to route
the requests between the target IP cores and initiator IP cores in
the integrated circuit. A first aggregate target of the target IP
cores includes two or more memory channels that are interleaved in
an address space for the first aggregate target in the address map.
Each memory channel is divided up in defined memory interleave
segments and then interleaved with memory interleave segments from
other memory channels. Each memory interleave segment of those
memory channels being defined and interleaved in the address space
at a size granularity unconstrained by a burst length request
allowed by the memory design specification by a system designer.
The size granularity of memory interleave segment can be a defined
length between a minimum burst length request allowed by a DRAM
memory design specification configured into the DRAM and an
anticipated maximum DRAM memory page length as recognized by the
memory configuration and the size of this granularity is
configurable.
[0055] The two or more discreet memory channels may include on-chip
IP memory cores and off-chip memory cores that are interleaved with
each other to appear to system software and other IP cores as a
single memory in the address space.
[0056] An initiator agent interfacing the interconnect for a first
initiator IP core is configured to interrogate the address map
based on a logical destination address associated with a first
request to the aggregate target of the interleaved two or more
memory channels and determines which memory channels will service
the first request and how to route the first request to the
physical IP addresses of each memory channel in the aggregate
target servicing that request so that the first IP core need not
know of the physical IP addresses of each memory channel in the
aggregate target.
[0057] The two or more memory channels are interleaved in the
address space of the system address map to enable automatic
statistical spreading of application requests across each of the
memory channels over time, to avoid locations of uneven load
balancing between distinct memory channels that can arise when too
much traffic targets a subset of the memory channels making up the
aggregated target.
[0058] The address map can be divided up into two or more regions
and each interleaved memory interleave segment is assigned to at
least one of those regions and populates the address space for that
region. Memory channels can have defined memory interleave segments
in the address space of two or more regions. Memory interleave
segments in the address space assigned to a given region having a
unique tiling function used in two dimensional (2D) memory page
retrieval for the 2-D block request and the memory interleave
segments are addressable through a memory scheduler. The memory
interleave segments are addressable through a memory scheduler.
[0059] Chopping logic internal to the interconnect to chop
individual burst transactions that cross channel boundaries headed
for channels in the first aggregate target into two or more
requests. The chopping logic chops the individual transactions that
cross channel boundaries headed for channels in the aggregate
target so that the two or more or resulting requests retain their
2D burst attributes.
[0060] The address map register block contains the following two
types of registers: the base register and the control register.
Each pair of the base and control registers is corresponding to a
multi-channel address region. A base register contains the base
address of the multi-channel address region. The fields control
register contains the other configuration parameters.
[0061] The system may also support enhanced concurrency management.
The system has support for Open Core Protocol (OCP) threads and OCP
tags, and connectivity to AXI with its master IDs it is important
that the interconnect have flexible mappings between the external
and internal units of concurrency. This will likely take the form
of flexible thread/tag mappings. The interconnect has an efficient
mechanism for managing concurrency cost verses performance
trade-offs. Thread mapping and thread collapsing address may be
used to manage concurrency cost verses performance trade-off needs
along with a fine granularity of control. Providing combined OCP
thread and OCP tag support is one way to address these needs. Also,
additional control may be supplied by specifying tag handling where
initiator thread merging to target threads occurs. Support for
partial thread collapsing is another feature that can address these
trade-off needs.
[0062] In an embodiment, if an initiator agent connects to one
individual target agent in a multi-channel target, this initiator
agent should connect to all individual target agents in the
multi-channel target.
Maintaining Request Path Order
[0063] FIG. 4A illustrates a block diagram of an embodiment of a
integrated circuit, such as a SoC, having multiple initiator IP
cores and multiple target IP cores that maintains request order for
read and write requests over an interconnect that has multiple
thread merger and thread splitter units. Each initiator IP core
such as a Central Processor Unit IP core 602 may have its own
initiator agent 658 to interface with the interconnect 618. Each
target IP core such as a first DRAM IP core may have its own
initiator agent to interface with the interconnect 618. Each DRAM
IP core 620-624 may have an associated memory scheduler 632, DRAM
controller 634, and PHY unit 635. The interconnect 658 implements
flow control logic internal to the interconnect 618 itself to
manage an order of when each issued request in a given thread
arrives at its destination address for each thread on a per thread
basis. The interconnect 618 also implements flow control protocol
internal to the interconnect in the response network to enforce
ordering restrictions of when to return responses within a same
thread in an order in which the corresponding requests where
transmitted. The interconnect 618 implements flow control logic and
flow control protocol internal to the interconnect itself to manage
expected execution ordering a set of issued requests within the
same thread that are serviced and responses returned in order with
respect to each other but independent of an ordering of another
thread. The flow control logic at a thread splitter unit permits
transactions from one initiator thread to be outstanding to
multiple channels at once and therefore to multiple individual
targets within a multi-channel target at once. This includes a
transaction targeted at two different channels, as well as, two
transactions (from the same initiator thread) each targeted at a
single but different channel, where these two different channels
are mapped to two individual targets within a multi-channel
target.
[0064] Thread splitter units near or in an initiator agent sends
parts of the thread, such as requests, to multiple separate
physical pathways on the chip. For example, a thread splitter unit
in the first initiator agent 658 associated with the CPU core 602
can route transactions in a given thread down a first physical
pathway 662 to a first combined thread merger-splitter unit 668,
down a second physical pathway 664 to a second combined thread
merger-splitter unit 670, or down a third physical pathway 666 to a
third combined thread merger-splitter unit 672. The flow control
logic applies the flow control splitting protocol to splitting the
traffic early where it makes sense due to the physical routing of
parts of that set of transactions being routed on separate physical
pathways in the system as well as being routed to targets
physically located in different areas in the system/on the
chip.
[0065] Thread merger units near or in a target agent ensures that
responses to the requests from that thread segment come back from
the target core to the initiator core in the expected in-order
response order. For example, the first thread merger unit 668 near
the first target agent 631 ensures that responses to the requests
from a given thread come back from the first target DRAM IP core
620 and the second target DRAM IP core 622 to the first initiator
core 662 in the expected in-order response order.
[0066] Threads from two different initiators may be combined into a
single third thread in a thread merger unit. Parts of a single
thread may be split into two different threads in a thread splitter
unit. The merger and splitter units may use thread id mapping to
combine or split threads having different thread identifiers. Each
thread merger unit and thread splitter unit may maintain a local
order of transactions at that splitting-merger point and couple
that system with a simple flow control mechanism for responses.
[0067] As discusses, a thread splitter unit in an initiator agent
such as a first initiator agent 658 may split a set of transactions
in a given thread from a connected initiator IP core where the
split up parts of the set of transactions are being routed on
separate physical pathways to their intended target (i.e. two
different channels and two different target IP cores). The flow
control logic associated with that splitter unit implements flow
control stopping an issuance of a next request from the same thread
headed to a physical pathway other than the physical pathway being
used by outstanding requests in that thread same to allow a switch
to the other physical pathway and the switch to route requests from
the same thread with destination addresses down the other physical
pathway to occur when all acknowledge notifications from
outstanding requests in that same thread going to a current
physical pathway are returned to the splitter unit. The flow
control logic may be part of a thread splitter unit or a separate
block of logic coordinating with a thread splitter unit target.
Thus, the thread splitter unit implements flow control to prevent
an issuance of a next request from the same thread headed to a
first physical pathway 662, such as a link, other than a current
physical pathway being used by outstanding requests in that same
thread until all acknowledge notifications from outstanding
requests in that same thread going to the current physical pathway
are communicated back to the thread splitter unit.
[0068] The flow control logic tracks acknowledge notifications from
requests within the same thread, indicating safe arrival of those
requests, to ensure all previous requests headed toward an intended
target have reached a last thread merger unit prior to the intended
target IP core before requests from the same thread are routed
along a separate physical path to second intended target. The flow
control logic applies a flow control protocol to stop issuance of
requests from the same thread merely when requests from that thread
are being routed to separate physical pathways in the system. The
thread splitter unit and associated flow control logic allow much
more flexibility about where in the interconnect topology each
target or channel is attached and minimizes the traffic and routing
congestion issues associated with a centralized target/channel
splitter.
[0069] In an embodiment, address decoding the intended address of
the request from a thread happens as soon as the request enters the
interconnect interface such as at the initiator agent. The flow
control logic interrogates the address map and a known structural
organization of each aggregated target in the system to decode an
interleaved address space of the aggregated targets to determine
the physical distinctions between the targets making up a
particular aggregated target in order to determine which targets
making up the first aggregated target need to service a current
request. The multiple channels in the aggregated target 637 map to
IP memory cores 620 and 622 having physically different addresses.
The flow logic may cooperate with the chopping logic which
understands the known structural organization of the aggregated
targets including how the memory interleave segments wrap across
channel boundaries of different channels in a channel round back to
the original channel and then repeats this wrapping nature. Thus,
the flow logic of an initiator agent may route requests to both a
proper channel, such as 620 and 622, in an aggregated target 637,
and a specific target 628 amongst all the other targets on the
chip. Overall, the flow control logic such as applies the flow
control splitting protocol to allow multiple transactions from the
same thread to be outstanding to multiple channels at any given
time.
[0070] Requests being routed through separate physical pathways can
be split at an initiator agent as well as other splitter units in a
cascaded splitter unit highly pipelined system. FIG. 4B illustrates
a block diagram of an embodiment of flow control logic 657
implemented in a centralized merger splitter unit 668b to maintain
request path order.
[0071] FIG. 5 illustrates a block diagram of an embodiment of one
or more thread splitter units to route requests from an initiator
IP core 716 generating a set of transactions in a thread down two
or more different physical paths by routing a first request with a
destination address headed to a first physical location on the
chip, such as a first target 724, and other requests within that
thread having a destination address headed to different physical
locations on the chip from the first physical location such as a
first channel 722 and a second channel 720 making up an aggregate
second target 737. The first and second channels 720 and 722 share
an address region to appear as single logical aggregated target
737. The initiator agent 716 may route requests from the thread to
a first thread splitter unit. The first thread splitter unit 761
may route the request depending on its destination address down one
or more different physical pathways such as a first link 762 and a
second link 764.
[0072] In the IA 716, when the address lookup is done, then a
request destination physical route determination and a return route
for acknowledge notification determination are looked up. The IA
716 looks up the acknowledge notification return route statically
at the time when the sending address/route lookup takes place. An
ordered flow queue, such as a first order flow queue 717, exists
per received thread in each thread splitter unit 761 and 763, and
thread merger unit 765, 767, 769 and 771. The ordered flow queue
may have a First-In-First-Out ordering structure. One turnaround
First-In-First-Out ordered flow queue may be maintained per
received thread in that first splitter unit. Logic circuitry and
one or more tables locally maintain a history of requests in each
ordered flow queue of transactions entering/being stored in that
ordered flow queue. As discussed, the flow logic tracks acknowledge
notifications/signals from requests from within the same thread to
ensure all previous requests headed toward an intended target have
reached the last merger unit prior to the intended target before
requests from the same thread are routed along a separate physical
path to second intended target.
[0073] The first-in-first-out inherent ordering of the queue may be
used to establish a local order of received requests in a thread
and this maintained local order of requests in a particular thread
may be used to compare requests to other requests in that same
thread to ensure a subsequent request to a different link is not
released from the splitter unit until all earlier requests from
that same thread going to the same target have communicated
acknowledge signals back to the splitter unit that is splitting
parts of that thread.
[0074] The thread splitter units are typically located where a
single transaction in a given thread may be split into two or more
transactions, and the split up parts of the single transaction are
being routed on separate physical pathways to their intended
targets. In an embodiment, when a transaction transfer/part of the
original request reaches a last serialization point (such as the
last thread merger unit) prior to the intended target of that
transfer, then an acknowledge notification is routed back to the
initial thread splitter unit. Note, pending transactions are
serialized until an acknowledge signal is received from all
previous requests from a different physical path but not serialized
with respect to receiving a response to any of those requests in
that thread. The flow control protocol for requests is also has
non-blocking nature with respect to other threads as well as
non-blocking with respect to requiring a response to a first
request before issuing a second request from the same thread.
[0075] A first thread splitting unit 761 may be cascaded in the
request path with a second thread splitting unit 763. Subsequent
thread splitter units in the physical path between an initial
thread splitter unit and the intended aggregated target channel may
be treated as a target channel by the flow control logic associated
with the initial thread splitter unit. Request path thread merger
units can be cascaded too, but the acknowledge notification for
each thread should come from the last thread merger on the path to
the intended target channel. As discussed, thread splitter units
can be cascaded, but acknowledge notification needs to go back to
all splitters in the path. The flow logic in each splitter unit in
the physical path blocks changes to a different `branch/physical
pathway` until all acknowledge notifications from the open branch
are received. Note, the response return network may be an exact
parallel of the forward request network illustrated in FIG. 5 and
could even use the same interconnect links 762 and 764 with reverse
flow control added.
[0076] In an embodiment, an upstream splitter unit will continue to
send multiple requests from a given thread to another splitter
until the subsequent request needs to be split down the separate
physical pathway at the downstream thread splitter unit. The
downstream splitter unit causing the pathway splitting then
implements flow control buffering of the subsequent request from
the same thread heading down the separate physical pathway than all
of the outstanding requests from that thread until all of the
outstanding requests from that thread headed down the initial
physical pathway have communicated an acknowledge notification of
receipt of those outstanding requests to the downstream thread
splitter unit causing the pathway splitting.
[0077] In an embodiment, the interconnect for the integrated
circuit communicates transactions between the one or more initiator
Intellectual Property (IP) cores and multiple target IP cores
coupled to the interconnect. The interconnect may implement a flow
control mechanism having logic configured to support multiple
transactions issued from a first initiator in parallel with respect
to each other and issued to, at least one of, 1) multiple discreet
target IP cores and 2) an aggregate target that includes two or
more memory channels that are interleaved in an address space for
the aggregate target in an address map, while maintaining an
expected execution order within the transactions. The flow control
mechanism has logic that supports a second transaction to be issued
from the first initiator IP core to a second target IP core before
a first transaction issued from the same first initiator IP core to
a first target IP core has completed while ensuring that the first
transaction completes before the second transaction and while
ensuring an expected execution order within the first transaction
is maintained. The first and second transactions are part of a same
thread from the same initiator IP core. The first and second
transactions are each composed of one or more requests and one or
more optional responses. An initiator sending a request and a
target sending a response to the request would be a transaction.
Thus, a write from the initiator and a write from the target in
response to the original write would still be a transaction.
[0078] A thread splitting unit may be cascaded in the request path
with another thread splitting unit. An upstream thread splitter may
continuously send requests from a given thread from the upstream
thread splitter unit to a downstream thread splitter unit until the
subsequent request needs to be split down the separate physical
pathway at the downstream thread splitter unit. The downstream
thread splitter unit implements flow control buffering of the
subsequent request from the same thread heading down the separate
physical pathway than all of the outstanding requests from that
thread until all of the outstanding requests from that thread
headed down the initial physical pathway have communicated an
acknowledge notification of receipt of those outstanding requests
to the downstream thread splitter unit causing the pathway
splitting.
[0079] The system can be pipelined with buffers in the interconnect
component to store and move requests and responses in stages
through the system. The system also uses a pipeline storage system
so multiple requests may be sent from the same initiator, each
request sent out on a different cycle, without the initiator having
to wait to receive a response to the initial request before
generating the next request. The thread splitter units in the
interconnect must simply wait for an acknowledge notification of an
issued request before sending a next request down the same request
as the previous request.
[0080] The flow logic prevents a request path deadlock by using
acknowledge notifications, which are propagated back up the request
network from the last thread merge unit. The flow logic uses the
above flow control protocol as an interlock that virtually assures
no initiator thread will have transactions outstanding to more than
one target at a time. Yet, the flow control protocol does permit
transactions from one initiator thread to be outstanding to
multiple channels in a single aggregate target at once, and
therefore to multiple individual targets within an aggregate target
at once. Since the rate of progress at these individual targets may
be different, it is possible that responses will be offered to an
initiator core out of order with respect to how the requests were
issued out by the initiator core. A simple response flow control
protocol may be used to ensure responses to these requests will be
offered to the initiator core in the expected order with respect to
how the requests were issued by the initiator core. The combined
request flow control logic and simple response flow control
protocol allows the interconnect to manage simultaneous requests to
multiple channels in an aggregate target from same thread at the
same time.
[0081] The combined request flow control logic and simple response
flow control protocol implemented at each thread splitter unit and
thread merger unit allows this control to be distributed over the
interconnect. The distributed implementation in each thread
splitter unit and thread merger unit allow them to interrogate a
local system address map to determine both thread routing and
thread buffering until a switch of physical paths can occur. This
causes a lower average latency for requests. Also, software
transparency because software and, in fact, the IP cores themselves
need not be aware of the actual aggregated target structure. The
thread splitter units and thread merger units cooperate end-to-end
to ensure ordering without a need to install full transaction
reorder buffers within the interconnect.
[0082] Similarly, FIG. 11 illustrates a diagram of an embodiment of
a path across an interconnect from an initiator agent to multiple
target agents including a multiple channel aggregate target
1579.
[0083] As discussed, the interconnect for the integrated circuit is
configured to communicate transactions between one or more
initiator Intellectual Property (IP) cores and multiple target IP
cores coupled to the interconnect. The interconnect implements
logic configured to support multiple transactions issued from a
first initiator IP core to the multiple target IP cores while
maintaining an expected execution order within the transactions.
The logic supports a second transaction to be issued from the first
initiator IP core to a second target IP core before a first
transaction issued from the same first initiator IP core to a first
target IP core has completed while ensuring that the first
transaction completes before the second transaction. The logic does
not include any reorder buffering, and ensures an expected
execution order for the first transaction and second transaction
are maintained. The first and second transactions may be part of a
same thread from the first initiator IP core, and the expected
execution order within the first transaction is independent of
ordering of other threads. The logic may be configured to support
one or more transactions issued from a second initiator IP core to
at least the first target IP core, simultaneous with the multiple
transactions issued from the first initiator IP core to the first
and second target IP cores while maintaining the expected execution
order for all of the transactions and thereby allowing transactions
from several initiators to be outstanding simultaneously to several
targets. The flow control logic is associated with a thread
splitter unit in a request path to a destination address of a
target IP core. The first and second transactions may be each
composed of one or more requests and one or more optional
responses. The aggregate target IP core of the multiple target IP
cores may include two or more memory channels that are interleaved
in an address space for the aggregate target in an address map. The
thread splitter unit implements flow control to prevent an issuance
of a next request from a same thread from the first initiator IP
core headed to a first physical pathway, other than a current
physical pathway being used by outstanding requests in that same
thread until all acknowledge notifications from outstanding
requests in that same thread going to the current physical pathway
are communicated back to the thread splitter unit.
[0084] FIG. 6 illustrates an example timeline of the thread
splitter unit in an initiator agent's use of flow control protocol
logic that allows multiple write requests from a given thread to be
outstanding at any given time, such as a first write burst request
851 and a second write burst request 853 but restricts an issuance
of a subsequent write request from that thread, such as a third
write burst request 855 having a destination address down a
separate physical pathway from all of the outstanding requests in
that thread. All initiator agents may have a thread splitter unit
that splits requests from a given thread based on requests in that
set of requests being routed down separate physical pathway from
other requests in that thread. A burst request may be a set of word
requests that are linked together into a transaction having a
defined address sequence, defined pattern, and number of word
requests. The first write burst request 851 and the second write
burst request 853 have eight words in their request and a
destination address of channel 0. The third burst request 855 also
has eight words in its request but a destination address of channel
1 which is down a separate physical pathway from channel 0.
[0085] The flow control logic 857 associated with the thread
splitter unit that split the set of transactions in that given
thread issues the next third burst request 855 being routed down
the separate first physical pathway from other outstanding
requests, such as the first and second 851 and 853 in that thread
1) no earlier than one cycle after an amount of words in an
immediate previous request if the previous request was a burst
request or 2) no earlier than a sum of a total time of an amount of
anticipated time the immediate previous request will arrive at a
last thread merger unit prior to that previous request's target
address plus an amount of time to communicate the acknowledgement
notification back to the thread splitter. If the flow logic was
only based on sum of the total time of an amount of anticipated
time the immediate previous request will arrive at a last thread
merger unit plus an amount of time/cycles to communicate an
acknowledgement notification of the previous request to a thread
splitter unit, then the third request 855 could have issued 3
cycles earlier. Note, neither the response to the first burst
request 851 nor the response to the second burst request 853 needs
to be even generated let alone arrive in its entirety back at the
initiating core prior to the issuing of the third request 855.
[0086] FIGS. 7b and 7c illustrate additional example timelines of
embodiments of the flow control logic to split target request
traffic such as a 2D WRITE Burst and 2D READ Burst. Referring to
FIG. 5, in an embodiment, the acknowledgement mechanism generates
confirmation information from the last channel merge point at which
two links merge threads. This information confirms that the channel
requests from different links have been serialized. The
acknowledgement information is propagated back up the request
network to all the channel thread splitter units. If the channel
splitter and the last serialization point exist within the same
cycle boundary (i.e., there are no registers between them) then no
explicit acknowledgement signals are needed--the acceptance of a
transfer on the link between the channel splitter and channel
merger can also be used to indicate acknowledgement.
[0087] In an embodiment, the merger unit is configured structurally
to store the incoming branch/thread for a successful request that
has ack_req set. When an ack_req_return signal is set high, the
turnaround queue is `popped` and causes the corresponding
ack_req_return signal to be driven high on the correct
branch/thread. Serialization merger (for this thread)--thread
merging happens here. The merger unit is configured structurally to
reflect the incoming ack_req signal back on the ack_req_return
signal on the incoming branch/thread that sent the current
request.
[0088] The initiator agent generates m_ack_req signals. The signal
is driven low by default. The m_ack_req signal is driven high on
the first transfer of any split burst that leaves the initiator
agent that is going to a multi-channel target. Channel splitting
happens at a thread splitter in an embedded register point or in a
pipeline point, and is needed in the request path. Inside the
splitter, an acknowledge control unit (ACU) is added. The ACU
prevents requests from proceeding on a thread if the outgoing
splitter branch and/or thread changes from that of the previous
transfer and there are outstanding acknowledge signals. There is at
most one ACU for each (input) thread at the RS.
[0089] The m_ack_req signals travel in-band with a request
transfer. At some point the request transfer with the m_ack_req
will reach the serialization merger--this is the last point where
the connection merges with another connection on the same merger
(outgoing) thread. If the transfer wins arbitration at the merger,
the merger will extract the m_ack_req signal and return it back
upstream on the same, request DL link path via the s_ack_req_return
signal. The s_ack_req_return signals are propagated upstream on the
request DL links. These signals do not encounter any backpressure
or have any flow control. Wherever there is a PP RS, the
s_ack_req_return signals will be registered. The s_ack_req_return
signals are used at each channel splitter ACU along the path. The
ACU keeps a count of outstanding acknowledgements. When
s_ack_req_return is set to one, the ACU will decrement its count of
outstanding acknowledgements. The s_ack_req_return propagates back
to the first channel split point in the request network. For the
example shown in FIG. 5, this first channel split point is at the
embedded register point RS just downstream to the initiator agent
component. However, the first channel split point in a request
acknowledgement network could also be at a PP RS component.
[0090] If the path leading into an RS that performs channel
splitting is thread collapsed, then the DL link is treated as
single threaded for the purposes of the acknowledgement
mechanism.
[0091] The architecture intentionally splits multi-channel paths in
the IA, or as early as possible along the path to the multiple
channel target agents. This approach avoids creating a centralized
point that could act as a bandwidth choke point, a routing
congestion point, and a cause of longer propagation path lengths
that would lower achievable frequency and increase switching power
consumption.
[0092] FIG. 7A illustrates an example timeline of embodiment of
flow logic to split a 2D WRITE Burst request. In this example, the
number of words in the 2D WRITE Burst request is 4, N=3, M=2,
ChannelInterleaveSize=4<N+M. The WRITE Burst request 1081 is
shown over time. FIG. 7B also illustrates an example timeline of
embodiment of flow logic to split a 2D WRITE Burst request 1083.
FIG. 7C illustrates an example timeline of embodiment of flow logic
to split a 2D READ Burst 1085. The flow control logic in
conjunction with the other above features allowing high
throughput/deep pipelining of transactions. As shown in FIG. 7A,
multiple transactions are being issued and serviced in parallel,
which increases the efficiency of each initiator in being able to
start having more transactions serviced in the same period of time.
Also, the utilization of the memory is greater because as seen in
the bubbles on FIG. 7A there are very few periods of idle time in
the system. The first four bubbles shown the initial write burst is
being issued. Next two bubbles of inactivity occur. However, after
that the next four bubbles of the next write burst are issued and
being serviced by the system. The initiator and memory are working
on multiple transactions at the same time. The latency of the ACK
loop may limit the effective data bandwidth of the initiator. The
initiator has to wait for first-row responses to order them. No
need to wait for the next rows. Channel responses may become
available too early for the initiator to consume them. This will
create a back-pressure on this thread at the channels, forcing them
to service other threads. Initiators that send 2D bursts may have
dedicated threads, because of the way they can occupy their thread
on multiple channels for the duration of the 2D burst. Note: for 2D
WRITE bursts, because of the channel switching, the split WRITE
bursts will remain open until the original 2D burst is closed; that
is, while a splitter is sweeping all other branches before
switching back to a given branch, all the resources for that thread
of the branch remain idle (maybe minus N cycles). A similar
situation exists for 2D READ requests, on the response path.
[0093] As discussed, the interconnect for the integrated circuit is
configured to communicate transactions between the one or more
initiator Intellectual Property (IP) cores and the multiple target
IP cores coupled to the interconnect. Two or more memory channels
may make up a first aggregate target of the target IP cores. The
two or more memory channels may populate an address space assigned
to the first aggregate target and appear as a single target to the
initiator IP cores. The interconnect may be configured to implement
chopping logic to chop individual two-dimensional (2D) transactions
that cross the memory channel address boundaries from a first
memory channel to a second memory channel within the first
aggregate target into two or more 2D transactions with a height
value greater than one, as well as stride and width dimensions,
which are chopped to fit within memory channel address boundaries
of the first aggregate target. The flow control logic internal to
the interconnect may be configured to maintain ordering for
transactions routed to the first aggregate target IP core. The flow
control logic is configured to allow multiple transactions from the
same initiator IP core thread to be outstanding to multiple
channels of an aggregated target at the same time and the multiple
channels in the first aggregated target map to target IP cores
having physically different addresses. The transactions may include
one or more requests and one or more optional responses and the
transactions are part of a same thread from the same initiator IP
core.
Maintaining Response Path Order
[0094] FIG. 8 illustrates a block diagram of an embodiment of a
response path from two target agents back to two initiator agents
through two thread splitting units and two thread merger units. The
two target agents 1120, 1122 may each have one or more associated
thread splitting unit such as a first thread splitting unit 1141
for the first target agent 1120 and a second thread splitting unit
1143 for the second target agent 1122. The two target agents 1120,
1122 may each have one or more associated thread merging unit such
as a first thread merging unit 1145 for the first target agent 1120
and a second thread merging unit 1147 for the first target agent
1120. A target agent or memory scheduler may have FIFO response
flow buffers, such as a first response flow buffer 1149, which
cooperate with the merger units 1145, 1147 implementing a flow
control protocol to return responses within a same thread in the
order in which the corresponding requests were transmitted rather
than using re-order buffers.
[0095] The flow logic in the target agent and merger unit uses
first-in first-out inherent ordering to compare responses to other
responses in that same thread to ensure the next response is not
released from the target agent until all earlier responses from
that same thread have been transmitted back toward a thread merger
unit in the response path toward the initiator IP core issuing that
thread. The FIFO response flow buffers are filled on a per thread
basis. Alternatively, the turnaround state of the response buffers
may be distributed to other channels making up the aggregated
target or even just other targets on the chip to implement a
response flow order protocol.
[0096] The merger unit closest to the target/channel may determine
which physical branch pathway should be delivering the next
response, and routes a threadbusy from the correct branch back to
the target. The merger unit closest to the target agent or the
merger unit closest to the initiator IP core generating the thread
may assert this flow control protocol to backpressure all responses
from a particular thread from all physical pathways connected to
that thread merger unit except responses from the physical pathway
expected to send a next in order response for that thread. For
example, the first thread merger unit controls when responses come
from the first target agent 1120 and the second target agent 1122.
Logic, counters, and tables associated with the merger unit keep
track of which physical pathway, such as a link, should be
supplying the next response in sequential order for that thread and
stops responses from that thread from all other physical branches
until that next response in sequential order for that thread is
received on the active/current physical pathway.
[0097] The flow control logic maintains expected execution order of
the responses from within a given thread by referencing the ordered
history of which physical path requests were routed to from the
maintained order history of the request queue, the expected
execution order of the responses corresponding to those requests,
and allows only the target agent from the physical branch where the
next expected in order response to send responses for that thread
to the merger unit and blocks responses from that thread from the
other physical branches. The flow logic in a merger unit
establishes a local order with respect to issued requests and thus
expected response order sent down those separate physical
pathways.
[0098] The flow control mechanism asserts response flow control on
a per thread basis and the flow control mechanism blocks with
respect to other out-of-order responses within a given thread and
is non-blocking with respect to responses from any other thread.
The flow control mechanism and associated circuitry maintain the
expected execution order of the responses from within a given
thread by 1) referencing an ordered history of which physical path
requests in that thread where routed to, 2) an expected execution
order of the responses corresponding to those requests, and 3)
allowing the target agent to send responses for that given thread
to the thread merger unit only from the physical pathway where a
next expected in-order response is to come from and block responses
from that given thread from the other physical pathways.
[0099] The thread splitter and merger units in combination with
buffers in the memory controller eliminate the need for dedicated
reorder buffers and allow a non-blocking flow control so that
multiple transactions may be being serviced in parallel vice merely
in series.
[0100] FIG. 9 shows the internal structure of an example
interconnect maintaining the request order within a thread and the
expected response order to those requests. The interconnect include
three initiator agents 1331, 1333, and 1335 and three target
agents, where target agent0 1341 and target agent1 1339 are target
agents that belong to a multi-channel target, DRAM. Only one
multi-channel aggregate target 1331 exists in this example.
[0101] On the request network, for initiator agent 1331, the
multi-channel path going to the multi-channel target DRAM splits at
the initiator agent0's 1331 embedded, request-side thread splitter
units, Req_rs10. Since there are two channels, the two outgoing
single-threaded (ST) DL links 1362, 1364 each goes to a different
channel target. The third outgoing ST DL link 1366 is a normal path
leading to a normal individual target agent TA2 1341. A
request-side channel splitter 1368b is embedded in the initiator
agent 1331. For the channel target agent0 1343, the merger splitter
unit component, tat00_ms0 1368a, upstream to target agent0 1343
acts as a channel merger and regulates channel traffics coming from
two different initiator agents, initiator agent0 1331 and initiator
agent1 1333.
[0102] On the response network, for target agent1 1339, the
embedded RS component, Resp_rs01, acts as a response channel
splitter--it has three outgoing links 1371, 1373, 1375 for
delivering channel responses back to initiator agent0 1331, normal
responses back to the normal initiator agent2 1333, and channel
responses back to initiator agent1 1335, respectively. A
response-side channel splitter is color-coded in blue. For
initiator agent1 1333, its upstream merger splitter unit component,
lah11_ms0, is a channel merger, which not only regulates responses
coming back from channel 0 (i.e., target agent0) and channel 1
(i.e., target agent1) in the aggregate target 1337, but also
handles responses returned by the normal target agent2 1341. The
response-side channel merger 1381 receives responses from target
agent0 1343, target agent1 1339, and target agent2 1341.
[0103] Since a response-side channel merger unit needs to regulate
channel responses but it may not have enough information to act
upon, additional re-ordering information can be passed to the
merger unit from the request-side channel splitter of the initiator
agent. For instance, the DRL link 1391 is used to pass response
re-ordering information between the request-side channel thread
splitter unit, Req_rs11, and the response-side channel thread
merger unit, lah11_ms0, for initiator agent1 1333.
[0104] Target agent TA0 1343 is assigned to channel 0 and target
agent TA1 1339 is assigned to channel 1 for the multi-channel
target DRAM. Connectivity between initiators and individual targets
of the multi-channel target DRAM is done via connectivity
statements that specify the initiator agent (connected to an
initiator) and the specific target agent (connected to an
individual target of the multi-channel target DRAM) as shown in the
example.
[0105] Also disclosed are two multi-channel address regions:
SMS_reg and USB_mem. The specification of the SMS_reg region can be
explained as follows: The size of this region is 0x1000 bytes.
Having a channel_interleave_size of 8, means that each interleave
is of size 0x100 (28). This results in 16 non-overlapping memory
interleave segments (region size 0x1000/interleave size 0x100=16).
As discussed, each interleave is assigned to a channel using the
"channel round" idea. In this case there are 2 channels so
interleaves 0, 2, 4, 6, 8, 10, 12, 14 are assigned to channel 0
(target agent TA0) and interleaves 1, 3, 5, 7, 9, 11, 13, 15 are
assigned to channel 1 (target agent TA1). Note that if an initiator
agent connects to one individual target agent in a multi-channel
target, this initiator agent should connect to all individual
target agents in the multi-channel target. That is, as indicated in
FIG. 9, the connection between IA2 and TA1 is NOT ALLOWED unless
IA2 is also connected to TA0 in the same time.
[0106] In an embodiment, in the response path ordering, the
interconnect maintains OCP thread order, and has a mechanism to
re-order responses in the response path. This is achieved by
passing information for a request path channel splitter RS
component to the corresponding response path channel merger MS
component. The information is passed via a turnaround queue, which
maintains FIFO order. The information passed over tells the thread
merger splitter unit component which incoming branch/thread the
next response burst should come from. The thread merger splitter
unit component applies backpressure to all branches/threads that
map to the same outgoing thread, except for the one indicated by
the turnaround queue. When the burst completes, then the turnaround
queue entry is popped. This mechanism ensures that all responses
are returned in the correct order.
Chopping Individual Transactions that Cross Channel Boundaries
Headed for Channels in an Aggregate Target
[0107] FIG. 10 illustrates a diagram of an embodiment of chopping
logic to directly support chopping individual transactions that
cross the channel address boundaries into two or more
transactions/requests from the same thread, which makes the
software and hardware that generates such traffic less dependent on
the specific multiple channel configuration of a given SoC.
[0108] The interconnect implements chopping logic 1584 to chop
individual burst requests that cross the memory channel address
boundaries from a first memory channel 1520 to a second memory
channel 1522 within the first aggregate target into two or more
burst requests from the same thread. The chopping logic 1584
cooperates with a detector 1585 to detect when the starting address
of an initial word of requested bytes in the burst request 1548 and
ending address of the last word of requested bytes in the burst
request 1548 causes the requested bytes in that burst request 1548
to span across one or more channel address boundaries to fulfill
all of the word requests in the burst request 1548. The chopping
logic 1585 includes a channel chopping algorithm and one or more
tables 1586 to track thread ordering in each burst request 1548
issued by an IP initiator core to maintain a global target ordering
among chopped up portions of the burst request 1548 that are spread
over the individual memory channels 1520 and 1522. Either in a
distributed implementation with each initiator agent in the system
or in a centralized memory scheduler 1587 the system may have a
detector 1585, chopping logic 1584, some buffers 1587, state
machine 1588, and counters 1587 to facilitate the chopping process
as well as ensuring the sequential order within the original
chopped transaction is maintained.
[0109] The chopping logic supports transaction splitting across
channels in an aggregate target. The chopping logic 1584 chops a
burst when an initiator burst stays within a single region but
spans a channel boundary. The chopping logic may be embedded in an
initiator agent at the interface between the interconnect and a
first initiator core. The chopping logic chops, an initial burst
request spanning across one or more memory channel address
boundaries to fulfill all of the word requests in the burst
request, into two or more burst requests of a same height dimension
for each memory channel. As shown in FIG. 12a the chopping
algorithm in the flow control logic 1657 chops a series of requests
in the burst request so that a starting address of an initial
request in the series has a same offset from a channel boundary in
a first memory channel as a starting address of the next request
starting in the series of requests in the burst request in a
neighboring row in the first memory channel as shown in FIG. 12b.
Also, if the burst request vertically crosses into another memory
channel, then the chopping algorithm chops a transaction series of
requests in the burst request so that a starting address of an
initial request has a same offset from a channel boundary in a
first DRAM page of a first memory channel as a starting address of
the next request starting the sequence of series of requests in the
burst request of a second DRAM page of the first memory channel as
shown in FIG. 12c.
[0110] The detector 1585 in detecting 2D block type burst requests
also detects whether the initial word of the 2D burst request
starts in a higher address numbered memory channel than memory
channels servicing subsequent requests in that 2D burst request
from the chopped transaction. If the detector detects that the
initial words in a first row of the 2D block burst that crosses a
memory channel boundary starts in a higher address numbered memory
channel than subsequent requests to be serviced in a lower address
numbered memory channel, then the state machine chops this first
row into multiple bursts capable of being serviced independent of
each other. The request, containing the initial words in a first
row of the 2D block burst request, which is headed to the higher
address numbered memory channel must be acknowledged as being
received at a last thread merger unit prior to the intended higher
address numbered memory channel before the chopping logic allows
the second burst, containing the remainder of the first row, to be
routed to the lower address numbered memory channel.
[0111] A state machine 1588 in the chopping logic chops a
transaction based upon the type of burst request crossing the
memory channel address boundary. The detector 1585 detects the type
of burst. The detector detects for a request containing burst
information that communicates one or more read requests in a burst
from an initiator Intellectual Property (IP) core that are going to
related addresses in a single target IP core. A burst type
communicates the address sequence of the requested data within the
target IP core. The state machine 1588 may perform the actual
chopping of the individual transactions that cross the initial
channel address boundary into two or more transactions/requests
from the same thread and put chopped portions into the buffers
1587. The detector 1588 may then check whether the remaining words
in the burst request cross another channel address boundary. The
state machine will chop the transaction until the resulting
transaction fits within a single channel's address boundary. The
state machine 1585 may factor into the chop of a transaction 1) the
type of burst request, 2) the starting address of initial word in
the series of requests in the burst request, 3) the burst length
indicating the number of words in the series of requests in the
burst request, and 4) word length involved in crossing the channel
address boundary. The word length and number of words in the burst
request may be used to calculate the ending address of the last
word in the original burst request. The design allows the traffic
generating elements to allow both their request and response
traffic to cross such channel address boundaries.
[0112] In an embodiment, a burst length may communicate that
multiple read requests in this burst are coming from this same
initiator IP core and are going to related addresses in a single
target IP core. A burst type may indicate that the request is for a
series of incrementing addresses or non-incrementing addresses but
a related pattern of addresses such as a block transaction. The
burst sequence may be for non-trivial 2-dimensional block, wrap,
XOR or similar burst sequences. If the block transaction is for
two-dimensional data then the request also contains annotations
indicating 1) a width of the two-dimensional object that the
two-dimensional object will occupy measured in the length of the
row (such as a width of a raster line), 2) a height of the
two-dimensional object measured in the number of rows the
two-dimensional object will occupy, and 3) a stride of the
two-dimensional object that the two-dimensional object will occupy
that is measured in the address spacing between two consecutive
rows. Address spacing between two consecutive rows can be 1) a
length difference between the starting addresses of two consecutive
row occupied by the target data, 2) a difference between an end of
a previous rows to the beginning of next row or 3) similar spacing.
The single 2D block burst request may fully describe attributes of
a two-dimensional data block across the Interconnect to a target to
decode the single request.
[0113] A request generated for a block transaction may include
annotations indicating that an N number of read requests in this
burst are going to related addresses in a single target, a length
of a row occupied by a target data, a number of rows occupied by
the target data, and a length difference between starting addresses
of two consecutive row occupied by the target data.
Chopping Individual Transactions that Cross Channel Boundaries
Headed for Channels in an Aggregate Target so that Two or More or
the Chopped Portions Retain their 2D Burst Attributes
[0114] FIGS. 12a-12e illustrate five types of channel based
chopping for block burst requests: normal block chopping, block row
chopping, block height chopping, block deadlock chopping, and block
deadlock chopping and then block height chopping. The state machine
may be configured to implement channel based chopping rules as
follows:
[0115] For unknown pattern types of burst requests, the chopping
logic breaks the single initiator burst into a sequence of single
initiator word transfers with the same sequence code (chop to
initiator singles).
[0116] For detected types of bursts such as streaming, incrementing
address, XOR and wrap burst, the chop fits them within a single
channel. Streaming bursts, by definition, are always within a
single channel. An incrementing burst request is for a series of
incrementing addresses and XOR bursts non-incrementing addresses
but a related pattern of addresses that are cross a channel
boundary. The state machine breaks the single initiator burst into
a sequence of two or more separate burst requests--each with a
burst length reduced to fit within each individual channel of an
aggregate target (chop to channels). Moreover, for any XOR bursts
crossing a channel boundary, the resulting channel bursts have a
burst byte length that is equal to 2 times
2.sup.channel.sup.--.sup.interleave.sup.--.sup.size bytes; and the
second burst starts at
MAddr+/-2.sup.channel.sup.--.sup.interleave.sup.--.sup.size. For
WRAP bursts that cross a channel boundary, the state machine breaks
the single initiator burst into a sequence of single initiator word
transfers (chop to initiator singles). Normally interleave_size is
selected to be larger than the cache lines whose movement is the
dominant source of WRAP bursts. So channel crossing WRAPs will
usually not occur; and the chopping logic chops up a WRAP burst to
two INCR bursts when the WRAP burst crosses a channel boundary.
[0117] For any initiator 2-Dimensional block burst to a target that
is not capable of supporting the block burst, but the target does
support INCR bursts, the state machine performs block row chopping.
Block row chopping breaks the initiator burst into a sequence of
INCR bursts, one for each row in the block burst. If the row(s)
crosses a channel boundary, each row is broken into a sequence of 2
INCR bursts, one to each channel. Each such INCR burst may further
be chopped into smaller INCR bursts if the target has
user-controlled burst chopping and does not have sufficiently large
chop_length or the target supports a shorter OCP MBurstLength.
[0118] The chopping logic prevents a deadlock situation when each
smaller burst/portion of the transaction has requests that need to
be serviced by their own channel and these requests should be
serviced from each channel in a ping-pong fashion by making sure
that the a burst request headed to a lower address numbered memory
channel is serviced initially and then a burst request in the
second portion may be serviced by a higher address numbered memory
channel. If the initiator block row(s) crosses a channel boundary
and the burst starts in a higher address numbered memory channel
than memory channels servicing subsequent requests in that burst,
then block deadlock chopping creates 4 target bursts as shown in
FIG. 12d. The first of the 4 chopped burst (resulting from the
deadlock block chopping) is a single row block with chopped length
for the highest-number channel. It corresponds to the leading part
of the first row of the initiator block burst that falls into the
highest-numbered channel. The last of the 4 chopped burst
(resulting from the deadlock block chopping) is a single row block
with chopped length for the first channel (channel 0). It
corresponds to the trailing part of the last row of the initiator
block burst that falls into channel 0. The first and last single
row block bursts are separated by an even number of block bursts
each containing a series of rows that alternatively fall into
channel 0 and then the highest-numbered channel, ch 3. Each pair of
such channel block bursts has a new and the largest
possible/affordable MBurstHeight that is a power of two. The 4
target bursts may have a new MBurstStride equal to the
initiator-supplied MBurstStride divided by num_active_channels.
[0119] Whenever normal block chopping or block deadlock chopping is
applied to a block
[0120] Write burst or a block Multiple Request Multiple response
Data Read (MRMD burst that is not translated to Single Request
Multiple response Data (SRMD) (MRMD Read to SRMD Read translation
is disabled for the given target), initiator agent sends the two
resulting channel block bursts as a single atomic sequence, called
an interleaved block burst. The reason is to prevent downstream
mergers from interleaving in other traffic from other initiators
while an upstream splitter switches among alternative rows of the
two-channel block bursts. i.e., the splitter has to lock
arbitration (using m_lockarb) on both of its outgoing
branches/threads until all rows are processed and then release the
lock on both branches/threads. In the alternative, the m_lockarb
action at the splitter may be the following: (a) the initiator
agent should set the m_lockarb properly among alternative rows to
prevent downstream mergers from interleaving in other traffic
before these alternative rows reaching the first channel splitter
RS (only 1 channel crossing). At the channel splitter, the
m_lockarb needs to be set for the first block burst's last row.
[0121] In the interconnect, 2D block bursts are sent as Single
Request Multiple response Data bursts whenever possible (i.e., MRMD
to SRMD conversion of RD bursts is not disabled). Burst length
conversion for block channel bursts (post channel burst chopping)
is performed similar to INCR bursts. For example, for
wide-to-narrow conversion, burst length is multiplied by the ratio
of target to initiator data widths; for narrow-to-wide conversions,
initiator agent pads each row at start and end to align it to the
target data width, and the resulting initiator burst (row) length
is divided to get the target burst length.
[0122] As shown in FIG. 12e, a round of block height chopping to
the second and third of the 4 chopped bursts resulting from the
original block deadlock chopping.
[0123] In an embodiment, when the chopping logic chops a request
into two, the chopping logic maintains the width of the word
request being chopped by figuring out the number of bits in the
first portion of the chopped word request being serviced by a first
channel and subtracting that number of bits from the width of a
word to determine the width of the second portion of the chopped
word request being serviced by a next channel. See FIG. 3 and
chopped request d. The second portion of the chopped word request
being serviced by a second channel has a starting address of a
first row of the next channel. Also, each portion of a chopped
burst request may be chopped so that a start address for requested
bytes of an initial request in the series of requests in each
portion has the same relative position within a channel (same
relative offset in column from channel boundary) as other words in
column. See FIG. 12a and the aligned portions in Channel 0.
[0124] A DL link payload signal p_split_info may be used to notify
the splitter. The p_split_info field is zero for non-INT_block
bursts. For INT_block bursts, split_info identifies the downstream
splitter where the INT_block burst will split into two. The channel
splitter whose channel_splitter_id matches p_split_info will split
the INT_block burst and reset to 0 any m_lockarb=1 in that atomic
sequence that is accompanied by a p_burstlast=1.
Higher Performance Access Protection
[0125] The chopping logic in the interconnect may also employ a new
higher performance architecture for access protection mechanism
(PM) checking. The architecture is a dual look-up architecture.
Each request burst issued from the target agent is first qualified
by the PM using two look-ups in parallel. The first look-up is
based upon the starting address for the burst. The second look-up
is based upon the calculated ending address for the burst.
Qualification of the access as permitted requires all the
conditions as current required in SMX associated with the first
look-up, plus 1 new condition. The new condition is that the first
and second look-ups must hit the same protection region. This
disqualifies bursts that cross a protection region boundary, even
if the proper permissions are set in both the starting and the
ending regions. It is expected and required that a single
protection region covers data sets accessed by bursts.
[0126] The second look-up is only performed for INCR bursts at
targets with burst_aligned=0, and for block bursts. For WRAP, XOR,
STRM, and burst aligned INCR bursts success of the second look-up
is guaranteed (by the aligned nature of the bursts, the range of
lengths supported, and the minimum granularity of protection region
sizes). UNKN and DFLT2 transactions are still only handled as
single word transfers at protected the target agents, so the second
look-up for these is also assured.
[0127] FIG. 13 illustrates a flow diagram of an embodiment of an
example of a process for generating a device, such as a System on a
Chip, with the designs and concepts discussed above for the
Interconnect. The example process for generating a device with from
designs of the Interconnect may utilize an electronic circuit
design generator, such as a System on a Chip compiler, to form part
of an Electronic Design Automation (EDA) toolset. Hardware logic,
coded software, and a combination of both may be used to implement
the following design process steps using an embodiment of the EDA
toolset. The EDA toolset such may be a single tool or a compilation
of two or more discrete tools. The information representing the
apparatuses and/or methods for the circuitry in the Interconnect,
etc may be contained in an Instance such as in a cell library, soft
instructions in an electronic circuit design generator, or similar
machine-readable storage medium storing this information. The
information representing the apparatuses and/or methods stored on
the machine-readable storage medium may be used in the process of
creating the apparatuses, or representations of the apparatuses
such as simulations and lithographic masks, and/or methods
described herein.
[0128] Aspects of the above design may be part of a software
library containing a set of designs for components making up the
Interconnect and associated parts. The library cells are developed
in accordance with industry standards. The library of files
containing design elements may be a stand-alone program by itself
as well as part of the EDA toolset.
[0129] The EDA toolset may be used for making a highly
configurable, scalable System-On-a-Chip (SOC) inter block
communication system that integrally manages input and output data,
control, debug and test flows, as well as other functions. In an
embodiment, an example EDA toolset may comprise the following: a
graphic user interface; a common set of processing elements; and a
library of files containing design elements such as circuits,
control logic, and cell arrays that define the EDA tool set. The
EDA toolset may be one or more software programs comprised of
multiple algorithms and designs for the purpose of generating a
circuit design, testing the design, and/or placing the layout of
the design in a space available on a target chip. The EDA toolset
may include object code in a set of executable software programs.
The set of application-specific algorithms and interfaces of the
EDA toolset may be used by system integrated circuit (IC)
integrators to rapidly create an individual IP core or an entire
System of IP cores for a specific application. The EDA toolset
provides timing diagrams, power and area aspects of each component
and simulates with models coded to represent the components in
order to run actual operation and configuration simulations. The
EDA toolset may generate a Netlist and a layout targeted to fit in
the space available on a target chip. The EDA toolset may also
store the data representing the interconnect and logic circuitry on
a machine-readable storage medium.
[0130] Generally, the EDA toolset is used in two major stages of
SOC design: front-end processing and back-end programming.
[0131] Front-end processing includes the design and architecture
stages, which includes design of the SOC schematic. The front-end
processing may include connecting models, configuration of the
design, simulating, testing, and tuning of the design during the
architectural exploration. The design is typically simulated and
tested. Front-end processing traditionally includes simulation of
the circuits within the SOC and verification that they should work
correctly. The tested and verified components then may be stored as
part of a stand-alone library or part of the IP blocks on a chip.
The front-end views support documentation, simulation, debugging,
and testing.
[0132] In block 2005, the EDA tool set may receive a user-supplied
text file having data describing configuration parameters and a
design for at least part of an individual IP block having multiple
levels of hierarchy. The data may include one or more configuration
parameters for that IP block. The IP block description may be an
overall functionality of that IP block such as an Interconnect. The
configuration parameters for the Interconnect IP block may be
number of address regions in the system, system addresses, how data
will be routed based on system addresses, etc.
[0133] The EDA tool set receives user-supplied implementation
technology parameters such as the manufacturing process to
implement component level fabrication of that IP block, an
estimation of the size occupied by a cell in that technology, an
operating voltage of the component level logic implemented in that
technology, an average gate delay for standard cells in that
technology, etc. The technology parameters describe an abstraction
of the intended implementation technology. The user-supplied
technology parameters may be a textual description or merely a
value submitted in response to a known range of possibilities.
[0134] The EDA tool set may partition the IP block design by
creating an abstract executable representation for each IP sub
component making up the IP block design. The abstract executable
representation models TAP characteristics for each IP sub component
and mimics characteristics similar to those of the actual IP block
design. A model may focus on one or more behavioral characteristics
of that IP block. The EDA tool set executes models of parts or all
of the IP block design. The EDA tool set summarizes and reports the
results of the modeled behavioral characteristics of that IP block.
The EDA tool set also may analyze an application's performance and
allows the user to supply a new configuration of the IP block
design or a functional description with new technology parameters.
After the user is satisfied with the performance results of one of
the iterations of the supplied configuration of the IP design
parameters and the technology parameters run, the user may settle
on the eventual IP core design with its associated technology
parameters.
[0135] The EDA tool set integrates the results from the abstract
executable representations with potentially additional information
to generate the synthesis scripts for the IP block. The EDA tool
set may supply the synthesis scripts to establish various
performance and area goals for the IP block after the result of the
overall performance and area estimates are presented to the
user.
[0136] The EDA tool set may also generate an RTL file of that IP
block design for logic synthesis based on the user supplied
configuration parameters and implementation technology parameters.
As discussed, the RTL file may be a high-level hardware description
describing electronic circuits with a collection of registers,
Boolean equations, control logic such as "if-then-else" statements,
and complex event sequences.
[0137] In block 2010, a separate design path in an ASIC or SOC chip
design is called the integration stage. The integration of the
system of IP blocks may occur in parallel with the generation of
the RTL file of the IP block and synthesis scripts for that IP
block.
[0138] The EDA toolset may provide designs of circuits and logic
gates to simulate and verify the operation of the design works
correctly. The system designer codes the system of IP blocks to
work together. The EDA tool set generates simulations of
representations of the circuits described above that can be
functionally tested, timing tested, debugged and validated. The EDA
tool set simulates the system of IP block's behavior. The system
designer verifies and debugs the system of IP blocks' behavior. The
EDA tool set tool packages the IP core. A machine-readable storage
medium may also store instructions for a test generation program to
generate instructions for an external tester and the interconnect
to run the test sequences for the tests described herein. One of
ordinary skill in the art of electronic design automation knows
that a design engineer creates and uses different representations
to help generating tangible useful information and/or results. Many
of these representations can be high-level (abstracted and with
less details) or top-down views and can be used to help optimize an
electronic design starting from the system level. In addition, a
design process usually can be divided into phases and at the end of
each phase, a tailor-made representation to the phase is usually
generated as output and used as input by the next phase. Skilled
engineers can make use of these representations and apply heuristic
algorithms to improve the quality of the final results coming out
of the final phase. These representations allow the electric design
automation world to design circuits, test and verify circuits,
derive lithographic mask from Netlists of circuit and other similar
useful results.
[0139] In block 2015, next, system integration may occur in the
integrated circuit design process. Back-end programming generally
includes programming of the physical layout of the SOC such as
placing and routing, or floor planning, of the circuit elements on
the chip layout, as well as the routing of all metal lines between
components. The back-end files, such as a layout, physical Library
Exchange Format (LEF), etc. are generated for layout and
fabrication.
[0140] The generated device layout may be integrated with the rest
of the layout for the chip. A logic synthesis tool receives
synthesis scripts for the IP core and the RTL design file of the IP
cores. The logic synthesis tool also receives characteristics of
logic gates used in the design from a cell library. RTL code may be
generated to instantiate the SOC containing the system of IP
blocks. The system of IP blocks with the fixed RTL and synthesis
scripts may be simulated and verified. Synthesizing of the design
with Register Transfer Level (RTL) may occur. The logic synthesis
tool synthesizes the RTL design to create a gate level Netlist
circuit design (i.e. a description of the individual transistors
and logic gates making up all of the IP sub component blocks). The
design may be outputted into a Netlist of one or more hardware
design languages (HDL) such as Verilog, VHDL (Very-High-Speed
Integrated Circuit Hardware Description Language) or SPICE
(Simulation Program for Integrated Circuit Emphasis). A Netlist can
also describe the connectivity of an electronic design such as the
components included in the design, the attributes of each component
and the interconnectivity amongst the components. The EDA tool set
facilitates floor planning of components including adding of
constraints for component placement in the space available on the
chip such as XY coordinates on the chip, and routes metal
connections for those components. The EDA tool set provides the
information for lithographic masks to be generated from this
representation of the IP core to transfer the circuit design onto a
chip during manufacture, or other similar useful derivations of the
circuits described above. Accordingly, back-end programming may
further include the physical verification of the layout to verify
that it is physically manufacturable and the resulting SOC will not
have any function-preventing physical defects.
[0141] In block 2020, a fabrication facility may fabricate one or
more chips with the signal generation circuit utilizing the
lithographic masks generated from the EDA tool set's circuit design
and layout. Fabrication facilities may use a standard CMOS logic
process having minimum line widths such as 1.0 um, 0.50 um, 0.35
um, 0.25 um, 0.18 um, 0.13 um, 0.10 um, 90 nm, 65 nm or less, to
fabricate the chips. The size of the CMOS logic process employed
typically defines the smallest minimum lithographic dimension that
can be fabricated on the chip using the lithographic masks, which
in turn, determines minimum component size. According to one
embodiment, light including X-rays and extreme ultraviolet
radiation may pass through these lithographic masks onto the chip
to transfer the circuit design and layout for the test circuit onto
the chip itself.
[0142] The EDA toolset may have configuration dialog plug-ins for
the graphical user interface. The EDA toolset may have an RTL
generator plug-in for the SocComp. The EDA toolset may have a
SystemC generator plug-in for the SocComp. The EDA toolset may
perform unit-level verification on components that can be included
in RTL simulation. The EDA toolset may have a test validation
testbench generator. The EDA toolset may have a dis-assembler for
virtual and hardware debug port trace files. The EDA toolset may be
compliant with open core protocol standards. The EDA toolset may
have Transactor models, Bundle protocol checkers, OCPDis2 to
display socket activity, OCPPerf2 to analyze performance of a
bundle, as well as other similar programs.
[0143] As discussed, an EDA tool set may be implemented in software
as a set of data and instructions, such as an Instance in a
software library callable to other programs or an EDA tool set
consisting of an executable program with the software cell library
in one program, stored on a machine-readable medium. A
machine-readable storage medium may include any mechanism that
provides (e.g., stores and/or transmits) information in a form
readable by a machine (e.g., a computer). For example, a
machine-readable medium may include, but is not limited to: read
only memory (ROM); random access memory (RAM); magnetic disk
storage media; optical storage media; flash memory devices; DVD's;
EPROMs; EEPROMs; FLASH, magnetic or optical cards; or any other
type of media suitable for storing electronic instructions. The
instructions and operations also may be practiced in distributed
computing environments where the machine-readable media is stored
on and/or executed by more than one computer system. In addition,
the information transferred between computer systems may either be
pulled or pushed across the communication media connecting the
computer systems.
[0144] Some portions of the detailed descriptions above are
presented in terms of algorithms and symbolic representations of
operations on data bits within a computer memory. These algorithmic
descriptions and representations are the means used by those
skilled in the data processing arts to most effectively convey the
substance of their work to others skilled in the art. An algorithm
is here, and generally, conceived to be a self-consistent sequence
of operations leading to a desired result. The operations are those
requiring physical manipulations of physical quantities. Usually,
though not necessarily, these quantities take the form of
electrical or magnetic signals capable of being stored,
transferred, combined, compared, and otherwise manipulated. It has
proven convenient at times, principally for reasons of common
usage, to refer to these signals as bits, values, elements,
symbols, characters, terms, numbers, or the like.
[0145] While some specific embodiments of the invention have been
shown the invention is not to be limited to these embodiments. For
example, most functions performed by electronic hardware components
may be duplicated by software emulation. Thus, a software program
written to accomplish those same functions may emulate the
functionality of the hardware components in input-output circuitry.
A target may be single threaded or multiple threaded. The invention
is to be understood as not limited by the specific embodiments
described herein, but only by scope of the appended claims.
* * * * *