U.S. patent application number 12/849958 was filed with the patent office on 2012-02-09 for determination of one or more partitionable endpoints affected by an i/o message.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to ERIC N. LAIS, STEVE THURBER.
Application Number | 20120036302 12/849958 |
Document ID | / |
Family ID | 44629800 |
Filed Date | 2012-02-09 |
United States Patent
Application |
20120036302 |
Kind Code |
A1 |
LAIS; ERIC N. ; et
al. |
February 9, 2012 |
DETERMINATION OF ONE OR MORE PARTITIONABLE ENDPOINTS AFFECTED BY AN
I/O MESSAGE
Abstract
A data processing system includes a processor core, a system
memory including a first data structure including a plurality of
entries mapping requester identifiers (IDs) to partitionable
endpoint (PE) numbers, and an input/output (I/O) subsystem
including a plurality of PEs each having an associated PE number,
where each of the plurality of PEs including one or more requesters
each having a respective requester ID. An I/O host bridge,
responsive to receiving an I/O message including a requester ID and
an address, determines a PE number by reference to a first entry
from the first data structure, and responsive to determining the PE
number, accesses a second entry of the second data structure
utilizing the PE number as an index and validates the address by
reference to the accessed entry in the second data structure. The
I/O host bridge, responsive to successful validation, provides a
service indicated by the I/O message.
Inventors: |
LAIS; ERIC N.; (TILLSON,
NY) ; THURBER; STEVE; (AUSTIN, TX) |
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
Armonk
NY
|
Family ID: |
44629800 |
Appl. No.: |
12/849958 |
Filed: |
August 4, 2010 |
Current U.S.
Class: |
710/308 ;
707/705; 707/E17.044; 710/306; 711/118 |
Current CPC
Class: |
G06F 13/36 20130101;
G06F 2213/0058 20130101; G06F 13/28 20130101; G06F 13/404
20130101 |
Class at
Publication: |
710/308 ;
707/705; 707/E17.044; 710/306; 711/118 |
International
Class: |
G06F 13/40 20060101
G06F013/40; G06F 17/30 20060101 G06F017/30 |
Claims
1. A method of data processing in a data processing system having
an input/output (I/O) subsystem including at least one I/O host
bridge and a plurality of partitionable endpoints (PEs) each having
an associated PE number, each of the plurality of PEs including one
or more requesters each having a respective requester identifier
(ID), the method comprising: receiving an I/O message at an I/O
host bridge, the I/O message including a requester ID and an
address; in response to receiving the I/O message, the I/O host
bridge determining a PE number of a PE affected by the I/O message
by reference to a first entry from a first data structure including
a plurality of entries mapping requester IDs to PE numbers; in
response to determining the PE number, the I/O host bridge
accessing a second entry of a second data structure utilizing the
PE number as an index and validating the address by reference to
the accessed entry in the second data structure; and in response to
successful validation of the address, the I/O host bridge providing
a service indicated by the I/O message.
2. The method of claim 1, wherein: the I/O message is a direct
memory access request (DMA) request; and the method further
comprises translating the address into a system memory address.
3. The method of claim 1, wherein the I/O host bridge comprises a
PCI host bridge.
4. The method of claim 1, wherein the I/O host bridge determining a
PE number of a PE affected by the I/O message by reference to a
first entry comprises accessing the first entry in a cache in the
I/O host bridge that buffers entries from the first data
structure.
5. The method of claim 1, wherein: the first data structure resides
in a system memory of the data processing system; and the I/O host
bridge determining a PE number of a PE affected by the I/O message
by reference to a first entry comprises accessing the first entry
in the system memory.
6. The method of claim 1, and further comprising selecting the
second entry from among multiple entries of the second data
structure associated with the PE number by reference to an index in
the address.
7. A data processing system, comprising: a processor core; a system
memory coupled to the processor core, the system memory including a
first data structure including a plurality of entries mapping
requester identifiers (IDs) to partitionable endpoint (PE) numbers;
and an input/output (I/O) subsystem coupled to the processor core,
the I/O subsystem including: a plurality of PEs each having an
associated PE number, each of the plurality of PEs including one or
more requesters each having a respective requester ID; and at least
one I/O host bridge including a second data structure including a
plurality of entries, wherein the I/O host bridge, responsive to
receiving an I/O message including a requester ID and an address,
determines a PE number of a PE affected by the I/O message by
reference to a first entry from the first data structure, and
responsive to determining the PE number, accesses a second entry of
the second data structure utilizing the PE number as an index and
validates the address by reference to the accessed entry in the
second data structure, and wherein the I/O host bridge, responsive
to successful validation of the address, provides a service
indicated by the I/O message.
8. The data processing system of claim 7, wherein the I/O message
is a direct memory access request (DMA) request.
9. The data processing system of claim 7, wherein the I/O host
bridge comprises a PCI host bridge.
10. The data processing system of claim 7, wherein: the I/O host
bridge includes a cache of entries from the first data structure in
system memory; and the I/O host bridge determines the PE number by
accessing the first entry in the cache.
11. The data processing system of claim 7, wherein: the second data
structure holds multiple entries associated with the PE number; and
the I/O host bridge selects the second entry from among the
multiple entries associated with the PE number by reference to an
index in the address.
12. The data processing system of claim 7, and further comprising,
an integrated circuit chip including the processor core and the I/O
host bridge.
13. A processor for a data processing system having a system memory
and an input/output (I/O) subsystem including a plurality of
partitionable endpoints (PEs) each having an associated PE number,
each of the plurality of PEs including one or more requesters each
having a respective requester identifier (ID), the processor
comprising: a processor core; and an I/O host bridge that,
responsive to receiving an I/O message including a requester ID and
an address, determines a PE number of a PE affected by the I/O
message by reference to a first entry, from a first data structure
including a plurality of entries mapping requester IDs to PE
numbers, and responsive to determining the PE number, accesses a
second entry of a second data structure in the I/O host bridge
utilizing the PE number as an index and validates the address by
reference to the accessed entry in the second data structure, and
wherein the I/O host bridge, responsive to successful validation of
the address, provides a service indicated by the I/O message.
14. The processor of claim 13, wherein the I/O message is a direct
memory access request (DMA) request.
15. The processor of claim 13, wherein the I/O host bridge
comprises a PCI host bridge.
16. The processor of claim 13, wherein: the first data structure
resides in system memory of the data processing system; the I/O
host bridge includes a cache of entries from the first data
structure in system memory; and the I/O host bridge determines the
PE number by accessing the first entry in the cache.
17. The processor of claim 13, wherein: the second data structure
holds multiple entries associated with the PE number; and the I/O
host bridge selects the second entry from among the multiple
entries associated with the PE number by reference to an index in
the address.
18. The processor of claim 13, and further comprising an integrated
circuit chip including the processor core and the I/O host bridge.
Description
CROSS-REFERENCE
[0001] The present application is related to the following
copending patent applications, which are assigned to the assignee
hereof, filed on even date herewith, and incorporated herein by
reference in their entireties: [0002] U.S. patent application Ser.
No. ______, (Docket No. AUS920100183US1): [0003] U.S. patent
application Ser. No. ______, (Docket No. AUS920100185US1); [0004]
U.S. patent application Ser. No. ______, (Docket No.
AUS920100187US1); and [0005] U.S. patent application Ser. No.
______, (Docket No. AUS920100188US1).
BACKGROUND OF THE INVENTION
[0006] 1. Technical Field
[0007] The present invention relates in general to data processing,
and in particular, to input/output (I/O) in a data processing
system.
[0008] 2. Description of the Related Art
[0009] A data processing system may include multiple processing
elements and multiple input/output adapters (IOAs) to support
connections to communication networks, storage devices and/or
storage networks, and peripheral devices. In such data processing
systems, the hardware resources of the data processing system may
be logically partitioned into multiple, non-intersecting sets of
resources, each controlled by a respective one of multiple possibly
heterogeneous operating system instances. The operating systems
concurrently execute on this common hardware platform in their
respective logical partitions (LPARs) under the control of system
firmware, which is referred to as a virtual machine monitor (VMM)
or hypervisor. Thus, the hypervisor allocates each LPAR a
non-intersecting subset of the resources of the data processing
system, and each operating system instance in turn directly
controls its distinct set of allocable resources, such as regions
of system memory and IOAs.
[0010] In any environment including multiple IOAs, it is desirable
to isolate IOAs so that each IOA can only obtain access to the
resources allocated to it. Isolating IOAs promotes reliability,
availability and serviceability of the data processing system, and
is especially important in environments supporting hardware
virtualization (or logical partitioning), so that IOAs can be
individually allocated to different logical partitions (LPARs) and
so that any IOA errors be isolated to the particular partition to
which the IOA is allocated. For example, for Peripheral Component
Interconnect (PCI) buses, if an IOA in one LPAR activates the
System Error (SERR) signal, the system must make the SERR signal
visible to all other LPARs absent some additional control. Making
I/O errors visible across LPAR boundaries requirement is, of
course, contrary to the definition and intent of logical
partitioning.
[0011] One solution that addresses the partitioning problem with
PCI errors is to require assignment of all IOAs connected to one
PCI Host Bridge (PHB) to the same LPAR partition. However, this
restriction mandates a high resource granularity for IOAs that is
not very useful or flexible. Ideally, IOAs should be allocable to
different LPARs regardless of the PHB to which the IOA is
connected. Alternative solutions include the use of specially
designed bridge chips external to the PHBs as described in U.S.
Pat. No. 6,643,727 or incorporating additional logic and data
structures to enforce partitioning between IOAs in differing LPARs
within PHBs as described in U.S. Pat. No. 7,398,427.
[0012] As also appreciated by the present disclosure, it would be
desirable to reduce the size of data structures within PHBs
utilized in handling routine messages, such as DMA messages and I/O
error message.
SUMMARY OF THE INVENTION
[0013] In at least one embodiment, a data processing system
includes a processor core, a system memory including a first data
structure including a plurality of entries mapping requester
identifiers (IDs) to partitionable endpoint (PE) numbers, and an
input/output (I/O) subsystem including a plurality of PEs each
having an associated PE number, where each of the plurality of PEs
including one or more requesters each having a respective requester
ID. An I/O host bridge, responsive to receiving an I/O message
including a requester ID and an address, determines a PE number by
reference to a first entry from the first data structure, and
responsive to determining the PE number, accesses a second entry of
the second data structure utilizing the PE number as an index and
validates the address by reference to the accessed entry in the
second data structure. The I/O host bridge, responsive to
successful validation, provides a service indicated by the I/O
message.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] FIG. 1 is a high level block diagram of an exemplary data
processing system in accordance with one embodiment;
[0015] FIG. 2 is a logical view of a data processing system showing
the hardware and software resources of the data processing system
partitioned into multiple concurrently executing logical partitions
(LPARs);
[0016] FIG. 3 illustrates an I/O subsystem that provides I/O
resource isolation in a data processing system in accordance with
one embodiment;
[0017] FIG. 4A depicts a conventional Peripheral Component
Interconnect (PCI) host bridge (PHB);
[0018] FIG. 48 illustrates a conventional Translation and
Validation Entry (TVE) of a Translation and Validation Table (TVT)
in the PHB of FIG. 4A;
[0019] FIG. 5A depicts an improved Peripheral Component
Interconnect (PCI) host bridge (PHB) in one exemplary
embodiment;
[0020] FIG. 5B illustrates an improved Translation and Validation
Entry (TVE) of a Translation and Validation Table (TVT) in the PHB
of FIG. 5A;
[0021] FIG. 6 is a high level logical flowchart of an exemplary
process by which a PHB handles a DMA message in accordance with one
embodiment;
[0022] FIG. 7A depicts a conventional Peripheral Component
Interconnect (PCI) host bridge (PHB) including a PE lookup table
(PELT) in accordance with the prior art;
[0023] FIG. 7B illustrates a conventional PE Lookup Entry (PELE) of
the PELT in the prior art PHB of FIG. 7A;
[0024] FIG. 8A depicts an improved Peripheral Component
Interconnect (PCI) host bridge (PHB) in one exemplary
embodiment;
[0025] FIG. 8B illustrates an improved PE Lookup Entry (PELE)
utilized by the improved PHB of FIG. 8A; and
[0026] FIG. 9 is a high level logical flowchart of an exemplary
process by which a PHB handles an I/O error message in accordance
with one embodiment.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT
[0027] With reference now to the figures, and in particular with
reference to FIG. 1, there is depicted a high level block diagram
of an exemplary data processing system 100 in accordance with one
embodiment. In some embodiments, data processing system 100 may be,
for example, a symmetric multiprocessor (SMP) system including a
plurality of processors 102a-102n, each coupled for communication
to a system fabric 104, which may include one or more bused or
switched communication links. For example, data processing system
100 may be implemented with an IBM eServer, a product line of
International Business Machines Corporation of Armonk, N.Y. In
alternative embodiments, a data processing system with a single
processor 102 may be utilized.
[0028] In the depicted embodiment, each processor 102 is preferably
realized as a single integrated circuit chip having a substrate in
which semiconductor circuitry is fabricated as is known in the art.
As shown, processor 102 includes a plurality of processor cores 110
that process data through the execution and/or processing of
program code, which may include, for example, software and/or
firmware and associated data, if any. Processor 102 further
includes cache memory 112 providing one or more levels of
relatively low latency temporary storage for instructions and data
retrieved from lower levels of the data storage hierarchy. In
addition, processor 102 includes an integrated memory controller
114 that controls access to an associated one of off-chip system
memories 116.
[0029] Each processor 102 further includes a fabric interface (FIF)
by which processor 102 communicates with system fabric 104, as well
as one or more (and preferably multiple) host bridges supporting
input/output communication with various input/output adapters
(IOAs) 130. In the depicted embodiment, all of the host bridges are
implemented as Peripheral Component interconnect (PCI) host bridges
(PHBs) 120, but in other embodiments the host bridges may implement
one or more additional or alternative I/O bus standards.
[0030] PHBs 120a, 120k, 120m and 120v provide interfaces to PCI
local buses 122a, 122k, 122m and 122v, respectively, to which IOAs
130, such as network adapters, storage device controllers,
peripheral adapters, etc., may be directly connected or indirectly
coupled. For example, PCI IOA 130a is coupled to PCI local bus 122a
optionally through an I/O fabric 124a, which may comprise one or
more switches and/or bridges. In a similar manner, PCI IOAs 130k
and 130l are coupled to PCI local bus 122k optionally through an
I/O fabric 124k, PCI IOA 130m is coupled to PCI local bus 122m
optionally through I/O fabric 124m, and PCI IOAs 130v and 130w,
which may comprise, for example, a display adapter and hard disk
adapter, are coupled to PCI local bus 122v optionally through I/O
fabric 124v.
[0031] Data processing system 100 further includes a service
processor 140 that manages the boot process of data processing
system 100 and thereafter monitors and reports on the performance
of and error conditions detected in data processing system 100.
Service processor 140 is coupled to system fabric 104 and is
supported by a local memory 142, which may include volatile (e.g.,
dynamic random access memory (DRAM)) and non-volatile memory (e.g.,
non-volatile random access memory (NVRAM) or static random access
memory (SRAM)). Service processor 140 is further coupled to a
mailbox interface 144 through which service processor 140
communicates I/O operations with PCI bus 122a.
[0032] Those of ordinary skill in the art will appreciate that the
architecture and components of a data processing system can vary
between embodiments. For example, other devices and interconnects
may alternatively or additionally be used. Accordingly, the
exemplary data processing system 100 given in FIG. 1 is not meant
to imply architectural limitations with respect to the claimed
invention.
[0033] Referring now to FIG. 2, there is depicted a logical view of
a data processing system 200 showing the hardware and software
resources of the data processing system partitioned into multiple
logical partitions (LPARs). Data processing system 200 may have,
for example, the same components and/or architecture as data
processing system 100 in FIG. 1 and accordingly identifies common
components with like reference numerals.
[0034] Data processing system 200 has a collection of partitioned
hardware 202, including processors 102a-102n, system memories
116a-116n and IOAs 130a-130w. Partitioned hardware 202 may of
course include additional unillustrated components, such as
additional volatile or nonvolatile storage devices, ports, bridges,
switches, etc. The hardware components comprising partitioned
hardware 202 (or portions thereof) can be assigned to various ones
of logical partitions (LPARs) 210a-210p in data processing system
200 by system firmware 204, also referred to herein as a virtual
machine monitor (VMM) or hypervisor. System firmware 204 supports
the simultaneous execution of multiple independent operating system
instances by virtualizing the partitioned hardware of data
processing system 200.
[0035] In addition to the hardware resources allocated by system
firmware 204, each of LPARs 210a-210p includes a respective one of
multiple concurrently executed operating system instances
212a-212p. In various embodiments, operating system instances
212a-212p, which may include, for example, instances of Linux, AIX
and/or Windows, may be homogeneous or heterogeneous. Each LPAR 210
may further include unillustrated application programs, as well as
a respective instance of partition firmware 214, which may be
implemented, for example, with a combination of initial boot strap
code, IEEE-1275 Standard Open Firmware, and runtime abstraction
software (RTAS). When LPARs 210a-210p are instantiated, a copy of
boot strap code is loaded onto partitions 210a-210p by system
firmware 204. Thereafter, system firmware 204 transfers control to
the boot strap code, which in turn loads the open firmware and
RTAS. The processor(s) 102 assigned to each LPAR 210 then execute
the partition firmware 214 of that LPAR 210 to bring up the LPAR
210 and initiate execution of the OS instance 212.
[0036] In the logically partitioned environment depicted in FIG. 2,
service processor 140 can be used to provide various services, such
as processing of errors in LPARs 210a-210p. These services may also
function as a service agent to report errors back to a system
administrator or vendor of data processing system 200. Operation of
the different LPARs 210 may further be controlled through a
hardware management console 220. In at least one embodiment,
hardware management console 220 can be implemented as a separate
data processing system from which a system administrator may
perform various functions within data processing system 200
including creating and destroying LPARs 210, as well as
reallocating hardware and software resources among LPARs 210.
[0037] In a logical partitioned environment such as that depicted
in FIG. 2, it is not permissible for the hardware or software
resources in one LPAR 210 to consume the resources of or affect the
operations in another LPAR 210. Furthermore, to be useful, the
assignment of resources to LPARs 210 needs to be fine-grained. For
example, it is often not acceptable to assign all IOAs 130 under a
particular PHB 120 to the same partition, as that will restrict
configurability of the system, including the ability to dynamically
reallocated resources between partitions. Accordingly, PHBs 120 are
able to assign resources, such as individual IOAs 130 (or portions
thereof) to different LPARs 210 while preventing the assigned
resources from accessing or affecting the resources of other LPARs
210.
[0038] To support such isolation between the resources of different
LPARs 210, the I/O subsystem of a data processing system is
subdivided into multiple partitionable endpoints. A "partitionable
endpoint" or "PE" is defined herein as any component or
subcomponent of an I/O subsystem that can be allocated to an LPAR
independently of any other component or subcomponent of the I/O
subsystem. For example, some PEs may comprise a plurality of IOAs
and/or I/O fabric components that function together and, thus,
should be allocated as a unit to a single LPAR. Another PE,
however, may comprise a portion of a single IOA, for example, a
separately configurable and separately assignable port of a
multi-port IOA. In general, a PE will be identified by its function
rather than by its structure.
[0039] Referring now to FIG. 3, there is depicted a block diagram
of at least a portion of the I/O subsystem 300 of a logically
partitioned data processing system, such as data processing system
100 of FIG. 1, which exhibits resource isolation between LPARs 210
in accordance with one embodiment.
[0040] In the depicted embodiment, I/O subsystem 300 includes a PHB
120 coupled to a plurality of IOAs 302a-302g through an I/O fabric
124. I/O fabric 124 in turn includes switches 310a, 310b,
PCI-Express (PCI-E) buses 320, 322, 324 and 326, PCI bridges 312a
and 312b, and secondary buses 340, 342, 344 and 346.
[0041] As further shown in FIG. 3, system firmware 204 groups
various components of I/O subsystem 300 to form a plurality of PEs
350a-350d that are each independently assignable to any of the
LPARs 210 of the data processing system. In the given example, PE
350a and PE 350c each comprise a single IOA, namely, IOAs 302a and
302d, respectively. PE 350b, in contrast, comprises two IOAs 302b
and 302c that must be assigned to the same LPAR 210. PE 350d
comprises three IOAs 302e, 302f and 302g and PCI bridge 312b, which
function together as a PE and therefore must be assigned to the
same LPAR 210. As noted previously, in other embodiments, a PE may
include only a portion (e.g., one or more ports) of an IOA.
[0042] In I/O subsystem 300, the respective state of each PE,
referred to herein as the partitionable endpoint state, is
maintained in the associated PHB 120. Thus, for example, PHB 120 of
I/O subsystem 300 includes partitionable endpoint state registers
360a-360d, which correspond to and indicate the states of PEs
350a-350d, respectively.
[0043] System firmware 204 assigns each PE one or more domain
numbers (or requester IDs (RIDs)) that associate its component(s)
with that PE. In an exemplary embodiment, the domain number
assigned each PE comprises a plurality of fields that can further
be used to differentiate between I/O components in the PE. For
example, these fields may include: [0044] Bus number (Bus) field:
provides the highest level of division between I/O resources, with
each bus under a PHB having a unique bus number. [0045] Device
number (Dev) field: provides an intermediate level of division
between I/O resources, with each IOA on a given bus having a
different device number. [0046] Function number (Funs) field:
provides the lowest level of division between I/O resources, with
each distinct function of an IOA having a different function
number.
[0047] As will be appreciated, the domain number (or RID) supports
the division of I/O resources down to the lowest level of I/O
functionality. For example, the domain number allows separate
functions of a multiple function IOA to be differentiated. In data
processing systems that do not require such a fine granularity, the
domain number can be defined by the Bus field alone, allowing
differentiation between the PEs connected to the same PHB, or by
the Bus field together with either the Dev field or the Func field
to permit differentiation between IOAs of a PE or differentiation
between functions of an IOA in a PE that contains a multiple
function IOA. The sparseness of the domain number space consisting
of the Bus, Bus/Dev, or Bus/Dev/Func fields makes it desirable in
many cases to condense the domain number space defined by these
fields to something less sparse for internal usage by the PHB
120.
[0048] Among the isolation functionalities included in PHB 120 of
FIG. 3 is the capability to isolate PE error domains. In logically
partitioned data processing systems, different PEs may be assigned
to different LPARs. Accordingly, PHBs 120 enable an error occurring
in one PE to be isolated to the particular LPAR to which the PE is
assigned. More particularly, each PHB 120 includes the capability
of stopping I/O operations to and from a PE when an error is
detected (referred to as the Stopped state). The stopping of I/O
operations is preferably accomplished in such a way that: [0049] 1.
The PE is prevented from completing an I/O operation in error,
[0050] a. such that the PE does not propagate an error to any LPAR,
and [0051] b. such that a requester of the I/O operation does not
use erroneous data. [0052] 2. The stopping of operations should
appear to a device driver to be isolated to just that device
driver. [0053] 3. Software (at the device driver level or above)
for one PE does not introduce an error that can cause another PE to
enter the Stopped state. [0054] 4. Fault information for problem
determination can be captured after the Stopped state occurs.
[0055] 5. Firmware can access the configuration space below the PHB
when any or all of the PEs are in the Stopped state.
[0056] In order to achieve error handling in accordance with these
criteria, each PHB preferably provides isolation functionality that
identifies a particular error domain for an I/O configuration
operation. In a preferred embodiment, the configuration operation
error domain capability is enabled by implementing a configuration
PE number field in a register of the PHB, which field can be set by
the system firmware. In addition, in a preferred embodiment, each
PHB determines one or more PE numbers affected by an I/O message
and routes the I/O message to only software specific to controlling
those PE(s).
[0057] In addition to providing effective isolation functionality,
it is also desirable to reduce the size of data structures within
PHBs utilized in handling routine messages, such as DMA messages
and I/O error message, particularly in embodiments in which PHBs
are integrated into a common integrated circuit chip with the
processor. Accordingly, as discussed further herein below, the
footprint of data structures implemented within PHBs can be reduced
by an improved determination of the PE(s) affected by I/O messages,
such as DMA and I/O error messages.
[0058] Referring now to FIG. 4A, there is depicted a conventional
PHB 400 as described in U.S. Pat. No. 7,398,427, which is
implemented in an integrated circuit chip separate from the
processor. PHB 400 includes a wide data structure referred to as
Translation and Validation Table (TVT) 402. TVT 402 includes a
plurality of Translation and Validation Entries (TVEs) 404. As
shown in FIG. 4B, each conventional TVE 404 comprises a number of
fields including Requester ID (RID) and RID Validate Control field
430 specifying a RID and control information for validating the
RID, a PE# field 432 indicating a PE associated with the RID, a
Translation Control Entry (TCE) table size field 434, an I/O page
size field 436, and a TCE table start address field 438 indicating
the base address of the TCE table for the specified PE.
[0059] PHB 400 validates RIDs of Direct Memory Access (DMA)
requests and translates RIDs to particular PEs by reference to TVT
402. As shown, PHB 400 receives a Direct Memory Access (DMA) packet
including a RID 410 (which comprises a bus number, a device number
and a function number) and a DMA address 412. Several bits of DMA
address 412 form a TVE index (TVEI) 414 into TVT 402 that selects a
particular TVE 404 for access. Once the TVE 404 is selected, the
content of PE# field 432 is read out to determine the current state
of the PE. In addition, the content of RID and RID Validate Control
field 430 is compared with incoming RID 410 as shown at block 420.
If RID 410 does not match the RID specified in field 430, PHB 400
does not permit the requested DMA operation to be performed. As
indicated at block 422, PHB 400 also truncates the low order n bits
of DMA address 412 (where 2'' is the I/O page size specified by I/O
page size field 436 of the selected TVE 404) and compares the
remaining DMA address bits below TVEI 414 with TCE table size field
434 of the selected TVE 404. If DMA address 412 specifies an
address past the end of the relevant TCE table, PHB 400 disallows
the DMA operation. If, on the other hand, the validations shown at
block 420 and 422 are successful, PHB 400 performs the requested
DMA operation utilizing the DMA address-to-real address translation
contained in the in-memory TCE table for the PE, which is pointed
to by the contents of ICE start address field 438.
[0060] It should be noted that the conventional TVE 404 depicted in
FIGS. 4A-4B contains numerous multi-bit fields, and consequently
conventional TVT 402 is a large data structure that requires
considerable die area. In addition, each PE does not have use of
TVEI field 414 of DMA address 412 for its own application, meaning
that the DMA address space is carved into different discontiguous
spaces for the various PEs.
[0061] With reference now to FIG. 5A, there is illustrated a more
detailed view of an improved PHB 120 in accordance with one
embodiment. In general, it is desirable to reduce the die area of
PHB 120, particularly in preferred embodiments in which PHB 120 is
integrated within the integrated circuit chip of processor 102 as
shown in FIG. 1. One factor contributing to the reduction in the
die area of PHB 120 is a reduction in the size of data structures
within PHB 120 utilized to validate and translate DMA and I/O error
messages received from various I/O components, such as IOAs 130,
I/O bridges 312, and switches 310. Specifically, as detailed
further below, the 16-bit RID field and PE# field formerly found in
each conventional TVE 404 can be removed, leading to a significant
reduction in the width of TVEs and a concomitant reduction in the
overall footprint of the TVT and PHB 120.
[0062] In the arrangement shown in FIG. 5A, a RID Translation Table
(RTT) 500, which may be populated and maintained, for example, by
system firmware 204 based upon its allocation of I/O resources
among LPARs 210, includes a plurality of RID Translation Entries
(RTEs) 502. Each RTE 502 associates a respective RID, such as
conventional 16-bit PCI RID 410, with a PE. RTT 500 can be
implemented either in PHB 120, or more preferably, in an off-chip
storage location, such as system memory 116. In embodiments in
which RTT 500 is implemented off-chip, PHB 120 can optionally
include a small on-chip RID Translation Cache (RTC) 504 to provide
lower latency access to copies of the most recently accessed RTEs
502.
[0063] FIG. 5A further illustrates that PHB 120 includes a
streamlined TVT 510 including a plurality of TVEs 512. As depicted
in FIG. 5B, each TVE 512 comprises a small number of bit fields
including a Translation Control Entry (TCE) table size field 534
indicating a table size of the TCE table 560 for the PE originating
the DMA, an I/O page size field 536, and a TCE table start address
field 538 indicating the base address of the in-memory TCE table
560 for the source PE. It should be noted upon comparison to FIG.
4B that TVEs 512 lack fields corresponding to conventional fields
430 and 432, resulting in a significant size reduction in TVT
510.
[0064] The operation of PHB 120 in servicing a DMA request will now
be described with reference to FIGS. 5A-5B and with additional
reference to the high level logical flowchart provided in FIG. 6.
The process begins at block 600 and then proceeds to block 602,
which illustrates PHB 120 receiving from one of the I/O components
to which it is coupled a Direct Memory Access (DMA) operation
including a conventional RID 410 and a DMA address 540. PHB 120
utilizes the RID 410 of the DMA operation to access a particular
RTE 502, either from RTC 504 (if present) or from RTT 500 (block
604). The accessed RTE 502 specifies a PE, which PHB 120 utilizes
to access the current state of the PE. PHB 120 also utilizes the
PE# specified by the accessed RTE 502 to access TVT 510 (block
606). In some embodiments in which each PE has a single associated
TVE 512, the PE# directly indexes into TVT 510. In alternative
embodiments in which each PE may have one or more TVEs 512 (e.g.,
to enable multiple I/O page sizes for at least some PEs), then PHB
120 can additionally utilize one or more PE index (PEI) bits 514
from DMA address 540 to select between the multiple TVEs 512
associated with the selected PE. It should be appreciated that the
use of PEI 514 does not carve up the DMA address space between
different PEs, as does TVEI 414 of FIG. 4A, but only divides the
DMA address space within the selected PE's address space, thus
advantageously making the entire DMA address space available to
each PE.
[0065] Following block 606, the process of FIG. 6 proceeds to block
608, which depicts DMA address validation logic 550 in PHB 120
truncating the low order n bits of DMA address 540 (where 2'' is
the I/O page size-specified by I/O page size field 536 of the
selected TVE 512) and comparing the remaining upper order DMA
address bits with the contents of TCE table size field 534 of the
selected TVE 512. As indicated at block 610, if DMA address 540
specifies an address past the end of the relevant TCE table 560,
the validation fails, and PHB disallows the DMA operation as
indicated by the process terminating at block 616. If, on the other
hand, DMA address 540 passes validation, as indicated by a positive
determination at block 610, PHB 120 translates DMA address 540 to a
real address in system memory 116 (block 612). In one embodiment,
PHB 120 performs the address translation by reference to the
in-memory TCE table 560 utilizing the particular TCE therein
pointed to by an address formed by combining the contents of TCE
table start address field 538 of the selected TVE 512 and the
mid-order bits of DMA address 540 between PEI 514 and the n
low-order address bits. PHB 120 then transmits the DMA operation to
the IMC 114 of the target system memory 116 using the system memory
(e.g., real) address obtained by the address translation in order
to invoke performance of the requested DMA operation (block 614).
Thereafter, the process shown in FIG. 6 terminates at block
616.
[0066] A similar technique for providing isolation between PEs
while minimizing the size of data structures in PHBs 120 is also
applicable to the isolation of I/O error messages, as discussed
further below with reference to FIGS. 7A-7B, 8A-8B and 9.
[0067] With reference first to FIG. 7A, there is illustrated a
second view of conventional PHB 400 of FIG. 4A that depicts the
data structure utilized in handling I/O (e.g., PCIe) error messages
in the prior art. As shown, in addition to the data structures
previously discussed, PHB 400 includes a wide data structure
referred to as PE Lookup Table (PELT) 700. PELT 700, which is
implemented in expensive content-addressable memory (CAM), includes
a plurality of PE Lookup Entries (PELEs) 702. As shown in FIG. 7B,
each conventional PELE 702 comprises Requester ID (RID) and RID
Validate Control field 710 specifying a RID and control information
for validating the RID, as well as a PE Lookup Vector (PELV) field
712 indicating by set bits (e.g., 1's) which PE number(s) are
affected by the I/O error.
[0068] In the prior art, PHB 400 receives a PCIe error message 704
together with a RID 410 identifying which I/O component is the
source of PCIe error message 704. In response, PHB 400 utilizes RID
410 to perform a CAM access to PELT 700 to identify a matching PELE
702 containing a matching RID in its RID and RID Validate Control
field 710. PHB 400 then processes the PCIe error message for each
PE specified by the PELV field 712 of the matching PELE 702.
[0069] Referring now to FIG. 8A, there is depicted a more detailed
view of an improved PHB 120 in accordance with one embodiment. As
noted above, it is desirable to reduce the die area of PHB 120,
particularly in preferred embodiments in which PHB 120 is
integrated within processor 102 as shown in FIG. 1. One factor
contributing to the reduction in the die area of PHB 120 is the
elimination of the RID field found in each conventional PELE 702,
leading to a significant reduction in the width of PELEs and a
concomitant reduction in the overall footprint of PHB 120. It is
further desirable to reduce or eliminate utilization of expensive
CAM, such as that utilized to implement conventional PELT 700.
[0070] Consequently, in the arrangement shown in FIG. 8A, RTT 500,
which is preferably implemented in system memory 116, is again
utilized to associate each possible RID that may be received by PHB
120, such as conventional 16-bit PCI RID 410, with a PE. As noted
above, to reduce access latency in embodiments in which RTT 500 is
implemented off-chip, PHB 120 can optionally include a small
on-chip RTC 504 to provide lower latency access to copies of the
most recently accessed RTEs 502.
[0071] FIG. 8A further illustrates that system memory 116, which is
preferably implemented with a low cost non-CAM technology (e.g.,
DRAM), preferably implements a streamlined PELT 510 including a
plurality of PELEs 802. As depicted in FIG. 8B, each PELE 802
comprises a PELV 804 containing a plurality of bits each
corresponding to a respective one of a plurality of PE numbers. As
described above, PELV 804 identifies with one or more set bits
(e.g., I's) the PE(s) against which an error occurring a given RID
should be processed. Multiple PEs can be implicated in an error,
for example, if the error related to an I/O component coupled to
multiple PEs (e.g., a switch 310) or to multiple functions
associated with a single device (e.g., multiple ports of an IOA
130). It should be noted that PELEs 802 lack a field corresponding
to conventional field 710, resulting in a significant size
reduction in PELT 800.
[0072] The operation of PHB 120 in handling an I/O error message
will now be described with additional reference to the high level
logical flowchart provided in FIG. 9. The I/O error message
handling process begins at block 900 and then proceeds to block
902, which illustrates a PHB 120 receiving an I/O error message
packet containing an error message 704 and a RID 410 identifying
the source of the I/O error message. PHB 120 utilizes the RID 410
of the I/O error packet to access a particular RTE 502, either from
RTC 504 (if present) or from RTT 500 (block 904). The accessed RTE
502 specifies a PE#, which PHB 120 utilizes as a direct index to
access PELT 800 (block 906). It should be noted that since a direct
index into PELT 800 is available, it is not necessary to implement
PELT 800 in expensive CAM.
[0073] Next, at block 908, PHB 120 determines which PEs are
affected by the I/O error by examining which bit or bits are set in
the PELV field 804 of the selected PELE 802 in PELT 800. In
response to the determination of the affected PE(s), I/O error
handling logic 810 in PHB 120 signals the I/O error as appropriate
to only the error handling software (e.g., device driver software
of one or more OSs 212) responsible for handling errors for the
affected PE(s) (block 910). The error handing process then
completes at block 912.
[0074] As has been described, in one embodiment, a data processing
system includes a processor core, a system memory including a first
data structure including a plurality of entries mapping requester
identifiers (IDs) to partitionable endpoint (PE) numbers, and an
input/output (I/O) subsystem including a plurality of PEs each
having an associated PE number, where each of the plurality of PEs
including one or more requesters each having a respective requester
ID. An I/O host bridge, responsive to receiving an I/O message
including a requester ID and an address, determines a PE number by
reference to a first entry from the first data structure, and
responsive to determining the PE number, accesses a second entry of
the second data structure utilizing the PE number as an index and
validates the address by reference to the accessed entry in the
second data structure. The I/O host bridge, responsive to
successful validation, provides a service indicated by the I/O
message.
[0075] The foregoing description has been presented for purposes of
illustration and elaboration, and is not intended to be exhaustive
or limited to the structures and processes disclosed. Many
modifications and variations will be apparent to those of ordinary
skill in the art. Various embodiments were chosen and described in
order to best explain the principles of operation, the practical
application, and to enable others of ordinary skill in the art to
understand and apply the disclosed teachings in various embodiments
with any modifications suitable for the particular use
contemplated.
[0076] While the present invention has been particularly shown as
described with reference to one or more preferred embodiments, it
will be understood by those skilled in the art that various changes
in form and detail may be made therein without departing from the
spirit and scope of the invention. For example, while the present
invention has been described in the context of a fully functioning
data processing system, those of ordinary skill in the art will
appreciate that the processes of the present invention are capable
of being distributed in the form of a computer program product
including a computer readable storage medium having program code
stored therein. Examples of computer readable storage media include
hard disk drives, RAM or other volatile memory, non-volatile
memory, and optical storage media.
* * * * *