U.S. patent application number 13/178486 was filed with the patent office on 2013-05-09 for programmatic instantiation, and provisioning of servers.
The applicant listed for this patent is Curtis M. Collins, Yuri Finkelstein, Fabio Onofrio Ingrao, Thomas Dean Lovett, Shyam Mittur, Cosmos Nicolaou, Nakul Pratap Saraiya, Shreyas B. Shah. Invention is credited to Curtis M. Collins, Yuri Finkelstein, Fabio Onofrio Ingrao, Thomas Dean Lovett, Shyam Mittur, Cosmos Nicolaou, Nakul Pratap Saraiya, Shreyas B. Shah.
Application Number | 20130117426 13/178486 |
Document ID | / |
Family ID | 42027094 |
Filed Date | 2013-05-09 |
United States Patent
Application |
20130117426 |
Kind Code |
A1 |
Saraiya; Nakul Pratap ; et
al. |
May 9, 2013 |
PROGRAMMATIC INSTANTIATION, AND PROVISIONING OF SERVERS
Abstract
A networked computing system is taught that may be operated as
one or more provisioned servers, each of the provisioned servers
including capabilities as identified by a corresponding set of
specifications and attributes, according to various embodiments.
Typically the specifications (or constraints) and attributes are
specified with a Server Configuration File. The networked computing
system may be provisioned into any combination and number of
servers according to needed processing and I/O capabilities. Each
of these servers may include distinct compute, storage, and
networking performance. Provisioned servers may be managed similar
to conventional servers, including operations such as boot and
shutting down.
Inventors: |
Saraiya; Nakul Pratap; (Palo
Alto, CA) ; Collins; Curtis M.; (San Jose, CA)
; Finkelstein; Yuri; (San Carlos, CA) ; Lovett;
Thomas Dean; (Portland, OR) ; Mittur; Shyam;
(Laguna Nigel, CA) ; Nicolaou; Cosmos; (Palo Alto,
CA) ; Shah; Shreyas B.; (San Jose, CA) ;
Ingrao; Fabio Onofrio; (Belmont, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Saraiya; Nakul Pratap
Collins; Curtis M.
Finkelstein; Yuri
Lovett; Thomas Dean
Mittur; Shyam
Nicolaou; Cosmos
Shah; Shreyas B.
Ingrao; Fabio Onofrio |
Palo Alto
San Jose
San Carlos
Portland
Laguna Nigel
Palo Alto
San Jose
Belmont |
CA
CA
CA
OR
CA
CA
CA
CA |
US
US
US
US
US
US
US
US |
|
|
Family ID: |
42027094 |
Appl. No.: |
13/178486 |
Filed: |
July 7, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
12728250 |
Mar 21, 2010 |
7979552 |
|
|
13178486 |
|
|
|
|
11057048 |
Feb 12, 2005 |
7685281 |
|
|
12728250 |
|
|
|
|
10889469 |
Jul 12, 2004 |
|
|
|
11057048 |
|
|
|
|
10889468 |
Jul 12, 2004 |
7757033 |
|
|
10889469 |
|
|
|
|
10889467 |
Jul 12, 2004 |
|
|
|
10889468 |
|
|
|
|
60650707 |
Feb 7, 2005 |
|
|
|
60651027 |
Feb 7, 2005 |
|
|
|
60603250 |
Aug 20, 2004 |
|
|
|
60563222 |
Apr 17, 2004 |
|
|
|
60544916 |
Feb 13, 2004 |
|
|
|
60651026 |
Feb 7, 2005 |
|
|
|
Current U.S.
Class: |
709/222 |
Current CPC
Class: |
G06F 13/387
20130101 |
Class at
Publication: |
709/222 |
International
Class: |
G06F 15/177 20060101
G06F015/177 |
Claims
1. A method including the steps of: accessing a networked computing
system template; automatically instantiating a networked computing
system according to the template; and wherein the template includes
a hardware manifest of required hardware primitives, and the
hardware manifest includes a description of a required number of
partitions of a plurality of partitionable symmetric
multiprocessors.
2. The method of claim 1, wherein: the template is an instance of a
server specification.
3. The method of claim 1, wherein: the networked computing system
is an instance of a provisioned server.
4. The method of claim 1, wherein: the act of instantiating the
networked computing system is an instance of forming a provisioned
server.
5. The method of claim 1, wherein: the hardware manifest is an
instance of a hardware specification.
6. The method of claim 1, wherein: the act of automatically
instantiating includes programmatically managing a free primitives
pool.
7. The method of claim 6, wherein: the act of automatically
instantiating further includes programmatically allocating
resources from the free primitives pool according to the
template.
8. The method of claim 1, wherein: the partitions are physical
partitions and the partitionable symmetric multiprocessors are
physically-partitionable symmetric multiprocessors.
9. The method of claim 1, wherein: each of the partitions couples
to a switch fabric via a switch fabric interface.
10. The method of claim 9, wherein: each of the partitions is
designed to independently couple to the switch fabric.
11. The method of claim 1, wherein: each of the partitionable
symmetric multiprocessors is partitionable into a plurality of
configurations, including a first configuration having a first
number of partitions, and a second configuration having a second
number of partitions.
12. The method of claim 11, wherein: the configurations include a
third configuration having a single partition.
13. The method of claim 1, wherein: the template further includes
hardware configuration information.
14. The method of claim 1, wherein: the template further includes a
software manifest of required software primitives.
15. The method of claim 14, wherein: the template further includes
software configuration information.
16. The method of claim 1, wherein: the required hardware
primitives include a description of required compute
primitives.
17. The method of claim 16, wherein: the required compute
primitives include required application compute primitives.
18. The method of claim 17, wherein: the required number of
partitions corresponds to a subset of the required application
compute primitives.
19. The method of claim 16, wherein: the required compute
primitives include required service compute primitives.
20. The method of claim 1, further including: altering the
template.
21. The method of claim 20, wherein: the act of altering is at
least in part via a command line interface.
22. The method of claim 20, wherein: the act of altering is at
least in part via a graphical user interface.
23. A system comprising: a processor; and a memory coupled with and
readable by the processor and storing a set of instructions which,
when executed by the processor, cause the processor to perform a
process comprising: accessing a networked computing system
template; automatically instantiating a networked computing system
according to the template; and wherein the template includes a
hardware manifest of required hardware primitives, and the hardware
manifest includes a description of a required number of partitions
of a plurality of partitionable symmetric multiprocessors.
24. The system of claim 23, wherein: the template is an instance of
a server specification.
25. The system of claim 23, wherein: the networked computing system
is an instance of a provisioned server.
26. The system of claim 23, wherein: the act of instantiating the
networked computing system is an instance of forming a provisioned
server.
27. The system of claim 23, wherein: the hardware manifest is an
instance of a hardware specification.
28. The system of claim 23, wherein: the act of automatically
instantiating includes programmatically managing a free primitives
pool.
29. The system of claim 28, wherein: the act of automatically
instantiating further includes programmatically allocating
resources from the free primitives pool according to the
template.
30. The system of claim 23, wherein: the partitions are physical
partitions and the partitionable symmetric multiprocessors are
physically-partitionable symmetric multiprocessors.
31. The system of claim 23, wherein: each of the partitions couples
to a switch fabric via a switch fabric interface.
32. The system of claim 31, wherein: each of the partitions is
designed to independently couple to the switch fabric.
33. The system of claim 23, wherein: each of the partitionable
symmetric multiprocessors is partitionable into a plurality of
configurations, including a first configuration having a first
number of partitions, and a second configuration having a second
number of partitions.
34. The system of claim 33, wherein: the configurations include a
third configuration having a single partition.
35. The system of claim 23, wherein: the template further includes
hardware configuration information.
36. The system of claim 23, wherein: the template further includes
a software manifest of required software primitives.
37. The system of claim 36, wherein: the template further includes
software configuration information.
38. The system of claim 23, wherein: the required hardware
primitives include a description of required compute
primitives.
39. The system of claim 38, wherein: the required compute
primitives include required application compute primitives.
40. The system of claim 39, wherein: the required number of
partitions corresponds to a subset of the required application
compute primitives.
41. The system of claim 38, wherein: the required compute
primitives include required service compute primitives.
42. The system of claim 23, further including: altering the
template.
43. The system of claim 42, wherein: the act of altering is at
least in part via a command line interface.
44. The system of claim 42, wherein: the act of altering is at
least in part via a graphical user interface.
45. A computer-readable memory device storing a set of instructions
which, when executed by a processor, cause the processor to perform
a process comprising: accessing a networked computing system
template; automatically instantiating a networked computing system
according to the template; and wherein the template includes a
hardware manifest of required hardware primitives, and the hardware
manifest includes a description of a required number of partitions
of a plurality of partitionable symmetric multiprocessors.
46. The computer-readable memory of claim 45, wherein: the template
is an instance of a server specification.
47. The computer-readable memory of claim 45, wherein: the
networked computing system is an instance of a provisioned
server.
48. The computer-readable memory of claim 45, wherein: the act of
instantiating the networked computing system is an instance of
forming a provisioned server.
49. The computer-readable memory of claim 45, wherein: the hardware
manifest is an instance of a hardware specification.
50. The computer-readable memory of claim 45, wherein: the act of
automatically instantiating includes programmatically managing a
free primitives pool.
51. The computer-readable memory of claim 50, wherein: the act of
automatically instantiating further includes programmatically
allocating resources from the free primitives pool according to the
template.
52. The computer-readable memory of claim 45, wherein: the
partitions are physical partitions and the partitionable symmetric
multiprocessors are physically-partitionable symmetric
multiprocessors.
53. The computer-readable memory of claim 45, wherein: each of the
partitions couples to a switch fabric via a switch fabric
interface.
54. The computer-readable memory of claim 53, wherein: each of the
partitions is designed to independently couple to the switch
fabric.
55. The computer-readable memory of claim 45, wherein: each of the
partitionable symmetric multiprocessors is partitionable into a
plurality of configurations, including a first configuration having
a first number of partitions, and a second configuration having a
second number of partitions.
56. The computer-readable memory of claim 55, wherein: the
configurations include a third configuration having a single
partition.
57. The computer-readable memory of claim 45, wherein: the template
further includes hardware configuration information.
58. The computer-readable memory of claim 45, wherein: the template
further includes a software manifest of required software
primitives.
59. The computer-readable memory of claim 58, wherein: the template
further includes software configuration information.
60. The computer-readable memory of claim 45, wherein: the required
hardware primitives include a description of required compute
primitives.
61. The computer-readable memory of claim 60, wherein: the required
compute primitives include required application compute
primitives.
62. The computer-readable memory of claim 61, wherein: the required
number of partitions corresponds to a subset of the required
application compute primitives.
63. The computer-readable memory of claim 60, wherein: the required
compute primitives include required service compute primitives.
64. The computer-readable memory of claim 45, further including:
altering the template.
65. The computer-readable memory of claim 64, wherein: the act of
altering is at least in part via a command line interface.
66. The computer-readable memory of claim 64, wherein: the act of
altering is at least in part via a graphical user interface.
Description
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to the following
applications: U.S. patent application Ser. No. 12/728,250 (Docket
No. F7-2004-17-1, filed Mar. 21, 2010), now U.S. Pat. No.
7,979,552, and entitled PROGRAMMATIC INSTANTIATION, PROVISIONING,
AND MANAGEMENT OF FABRIC-BACKPLANE ENTERPRISE SERVERS, U.S. patent
application Ser. No. 11/057,048 (Docket No. F7.2004.17, filed Feb.
12, 2005), now U.S. Pat. No. 7,685,281, and entitled PROGRAMMATIC
INSTANTIATION, PROVISIONING, AND MANAGEMENT OF FABRIC-BACKPLANE
ENTERPRISE SERVERS, U.S. Provisional Patent Application Ser. No.
60/544,916 (Docket No. F7.2003.01, filed Feb. 13, 2004) entitled
DYNAMICALLY PROVISIONABLE DISTRIBUTED VIRTUAL SERVERS, U.S.
Provisional Patent Application Ser. No. 60/563,222 (Docket No.
F7.2004.02, filed Apr. 17, 2004) entitled DYNAMICALLY PROVISIONABLE
DISTRIBUTED VIRTUAL SERVERS, U.S. Provisional Patent Application
Ser. No. 60/603,250 (Docket No. F7.2004.07, filed Aug. 20, 2004)
entitled INSTANTIATION AND MANAGEMENT OF FABRIC-BACKPLANE
ENTERPRISE SERVERS, U.S. Provisional Patent Application Ser. No.
60/650,707 (Docket No. F7.2004.20, filed Feb. 7, 2005) entitled
CLASS INSTANCE RELATIONSHIP QUERY WITH OPTIONAL CONSTRAINTS, U.S.
Provisional Patent Application Ser. No. 60/651,027 (Docket No.
F7.2004.21, filed Feb. 7, 2005) entitled DYNAMIC USER INTERFACE
SPECIFICATION FOR A PREDETERMINED INTERFACE ENGINE, and U.S.
Provisional Patent Application Ser. No. 60/651,026 (Docket No.
F7.2004.22, filed Feb. 7, 2005) entitled DURABLE QUERY REGISTRATION
WITH REAL TIME RESPONSE; this application is also a
continuation-in-part of the following applications: U.S. patent
application Ser. No. 10/889,468 (Docket No. F7.2004.03, filed Jul.
12, 2004) entitled FABRIC-BACKPLANE ENTERPRISE SERVERS, U.S. patent
application Ser. No. 10/889,467 (Docket No. F7.2004.04, filed Jul.
12, 2004) entitled CLIENT-SERVER AND APPLICATION MODELS IN
FABRIC-BACKPLANE ENTERPRISE SERVERS; and U.S. patent application
Ser. No. 10/889,469 (Docket No. F7.2004.05, filed Jul. 12, 2004)
entitled COMPUTE RESOURCES FOR FABRIC-BACKPLANE ENTERPRISE SERVERS;
and wherein all of the foregoing applications are herein
incorporated by reference for all purposes.
[0002] This application is related to the following applications
filed Feb. 12, 2005: U.S. patent application Ser. No. 11/057,116
(Docket No. F7.2004.06) entitled FABRIC-BACKPLANE ENTERPRISE
SERVERS WITH VNICS AND VLANS, U.S. patent application Ser. No.
11/057,120 (Docket No. F7.2004.17B) entitled PROVISIONING AND
MANAGEMENT OF FABRIC-BACKPLANE ENTERPRISE SERVERS, U.S. patent
application Ser. No. 11/057,113 (Docket No. F7.2004.08) entitled
REAL TIME PROVISIONING AND MANAGEMENT OF FABRIC-BACKPLANE
ENTERPRISE SERVERS, U.S. patent application Ser. No. 11/057,038
(Docket No. F7.2004.08B) entitled REAL TIME PROVISIONING AND
MANAGEMENT OF FABRIC-BACKPLANE ENTERPRISE SERVERS, U.S. patent
application Ser. No. 11/057,121 (Docket No. F7.2004.08C) entitled
REAL TIME PROVISIONING AND MANAGEMENT OF FABRIC-BACKPLANE
ENTERPRISE SERVERS, U.S. patent application Ser. No. 11/057,037
(Docket No. F7.2004.09) entitled FABRIC-BACKPLANE ENTERPRISE
SERVERS WITH VNICS AND VLANS, U.S. patent application Ser. No.
11/057,036 (Docket No. F7.2004.09B) entitled FABRIC-BACKPLANE
ENTERPRISE SERVERS WITH VNICS AND VLANS, U.S. patent application
Ser. No. 11/057,119 (Docket No. F7.2004.09C) entitled
FABRIC-BACKPLANE ENTERPRISE SERVERS WITH VNICS AND VLANS, U.S.
patent application Ser. No. 11/057,112 (Docket No. F7.2004.09D)
entitled FABRIC-BACKPLANE ENTERPRISE SERVERS WITH VNICS AND VLANS,
U.S. patent application Ser. No. 11/057,046 (Docket No. F7.2004.10)
entitled LINK AGGREGATION GROUPS IN FABRIC-BACKPLANE ENTERPRISE
SERVERS, U.S. patent application Ser. No. 11/057,114 (Docket No.
F7.2004.11) entitled RAPID IP/MAC ADDRESS TAKEOVER VIA FORWARDING
TABLE UPDATES, U.S. patent application Ser. No. 11/057,034 (Docket
No. F7.2004.11B) entitled RAPID IP/MAC ADDRESS TAKEOVER VIA
FORWARDING TABLE UPDATES, U.S. patent application Ser. No.
11/057,035 (Docket No. F7.2004.13) entitled MULTI-CHASSIS
FABRIC-BACKPLANE ENTERPRISE SERVERS, and U.S. patent application
Ser. No. 11/057,117 (Docket No. F7.2004.13B) entitled MULTI-CHASSIS
FABRIC-BACKPLANE ENTERPRISE SERVERS; and wherein all of the
foregoing simultaneously filed applications are herein incorporated
by reference for all purposes.
FIELD OF THE INVENTION
[0003] The present invention relates generally to interprocess and
inter-module communications in servers and server clusters. More
specifically, it relates to the organization, provisioning,
management, and interoperation of compute, storage, and network
resources to enhance datacenter availability, efficiency, and
utilization.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] Various embodiments of the invention are disclosed in the
following detailed description and the accompanying drawings.
[0005] FIG. 1A illustrates a conceptual representation of an
embodiment of an ES system.
[0006] FIG. 1B illustrates a conceptual representation of selected
details of data transfer in an embodiment of an ES system.
[0007] FIG. 2 illustrates various example embodiments of packet and
process data flow in an ES embodiment.
[0008] FIG. 3A illustrates selected aspects of an embodiment of
packet transmission and reception in an ES embodiment.
[0009] FIG. 3B illustrates selected aspects of an embodiment of
packet and cell prioritized transmission in an ES embodiment.
[0010] FIG. 4A illustrates selected aspects of an embodiment of a
System Intelligence Module (SIM) configured as a pluggable module
including a System Control Module (SCM) and an associated Switch
Fabric Module (SFM).
[0011] FIG. 4B illustrates selected aspects of an embodiment of a
Processor Memory Module (PMM) configured as a pluggable module.
[0012] FIG. 4C illustrates selected aspects of an embodiment of a
Network Module (NM) configured as a pluggable module.
[0013] FIG. 4D illustrates selected aspects of an embodiment of a
Fibre Channel Module (FCM) configured as a pluggable module.
[0014] FIG. 4E illustrates selected aspects of an embodiment of an
OffLoad Board (OLB) configured as a pluggable module.
[0015] FIG. 5A illustrates selected aspects of embodiments of
SoftWare (SW) layers for executing on application processor
resources in an ES embodiment.
[0016] FIG. 5B illustrates selected aspects of embodiments of SW
layers for executing on management processor resources in an ES
embodiment.
[0017] FIG. 5C illustrates selected aspects of embodiments of SW
layers for executing on module-level configuration and management
processor resources in an ES embodiment.
[0018] FIG. 6A illustrates selected aspects of a logical view of an
embodiment of a plurality of virtual Network Interface Controllers
(VNICs), also known as virtualized Network Interface Cards.
[0019] FIG. 6B illustrates selected aspects of a logical view of an
embodiment of VNIC transmit queue organization and
prioritization.
[0020] FIG. 6C illustrates selected aspects of a logical view of an
embodiment of transmit output queue organization and
prioritization.
[0021] FIG. 6D illustrates selected aspects of a logical view of an
embodiment of receive input queue organization and
prioritization.
[0022] FIG. 6E illustrates selected aspects of a logical view of an
embodiment of VNIC receive queue organization and
prioritization.
[0023] FIG. 7A illustrates selected aspects of an embodiment of a
Virtual Input Output Controller (VIOC).
[0024] FIG. 7B illustrates selected aspects of egress operation of
an embodiment of a Virtual Input Output Controller (VIOC).
[0025] FIG. 7C illustrates selected aspects of ingress operation of
an embodiment of a VIOC.
[0026] FIG. 8A illustrates selected aspects of an embodiment of an
egress lookup key and result entries.
[0027] FIG. 8B illustrates selected aspects of an embodiment of an
ingress lookup key and entry.
[0028] FIGS. 9A and 9B illustrate a Hardware Resources view and a
Provisioned Servers and Switch view of an embodiment of an ES
system, respectively.
[0029] FIG. 9C illustrates an operational view of selected aspects
of provisioning and management SW in an ES embodiment.
[0030] FIG. 10 illustrates a conceptual view of an embodiment of a
Server Configuration File (SCF) and related SCF tasks.
[0031] FIG. 11 illustrates selected aspects of an embodiment of
server operational states and associated transitions.
[0032] FIGS. 12A and 12B are flow diagrams illustrating selected
operational aspects of real time server provisioning and management
in an ES embodiment.
[0033] FIG. 13A is a state diagram illustrating processing of
selected Baseboard Management Controller (BMC) related commands in
an ES embodiment.
[0034] FIG. 13B illustrates selected operational aspects of single
and dual PMM low-level hardware boot processing in an ES
embodiment.
[0035] FIG. 14 illustrates a conceptual view of selected aspects of
embodiments of Internet Protocol (IP) and Media Access Control
(MAC) address failover data structures and associated
operations.
[0036] FIG. 15 illustrates a flow diagram of an embodiment of rapid
IP address takeover.
[0037] FIG. 16 illustrates an embodiment of a multi-chassis
fabric-backplane ES system.
[0038] FIG. 17 illustrates an embodiment of two variations of
multi-chassis provisioning and management operations.
DETAILED DESCRIPTION
[0039] The invention can be implemented in numerous ways, including
as a process, an apparatus, a system, a composition of matter, a
computer readable medium such as a computer readable storage medium
or a computer network wherein program instructions are sent over
optical or electronic communication links. In this specification,
these implementations, or any other form that the invention may
take, may be referred to as techniques. In general, the order of
the steps of disclosed processes may be altered within the scope of
the invention.
[0040] A detailed description of one or more embodiments of the
invention is provided below along with accompanying figures that
illustrate the principles of the invention. The invention is
described in connection with such embodiments, but the invention is
not limited to any embodiment. The scope of the invention is
limited only by the claims and the invention encompasses numerous
alternatives, modifications and equivalents. Numerous specific
details are set forth in the following description in order to
provide a thorough understanding of the invention. These details
are provided for the purpose of example and the invention may be
practiced according to the claims without some or all of these
specific details. For the purpose of clarity, technical material
that is known in the technical fields related to the invention has
not been described in detail so that the invention is not
unnecessarily obscured.
INTRODUCTION
[0041] This introduction is included only to facilitate the more
rapid understanding of the Detailed Description. The invention is
not limited to the concepts presented in the introduction, as the
paragraphs of any introduction are necessarily an abridged view of
the entire subject and are not meant to be an exhaustive or
restrictive description. For example, the introduction that follows
provides overview information limited by space and organization to
only certain embodiments. There are in fact many other embodiments,
including those to which claims will ultimately be drawn, which are
discussed throughout the balance of the specification. As is
discussed in more detail in the Conclusions, the invention
encompasses all possible modifications and variations within the
scope of the issued claims, which are appended to the very end of
the issued patent.
[0042] A hybrid server/multi-layer switch system architecture,
referred to hereinafter as the Enterprise Fabric (EF) architecture,
forms the basis for a number of Enterprise Server (ES) chassis
embodiments. Within each ES embodiment includes one or more
Processor Memory Modules (PMMs, each having a symmetric
multiprocessor complex), one or more Network Modules, and a System
Control Module (SCM). The SCM includes a cellified switching-fabric
core (SF) and a System Intelligence Module (SIM).
[0043] Each PMM has one or more resident Virtual IO Controller
(VIOC) adapters. Each VIOC is a specialized Input/Output (I/O)
controller that includes embedded layer-2 forwarding and filtering
functions and tightly couples the PMM to the SF. Thus the layer-2
switch functionality within the ES chassis is distributed over all
of the SCM, NM, and PMM modules. Through the use of VIOC/VNIC
device drivers, host operating system software (Host O/S) running
on the PMMs is presented with a plurality of Virtual Network
Interface Cards (VNICs). In some embodiments, each VNIC behaves as
a high-performance Ethernet interface at the full disposal of the
Host O/S. In other embodiments, at least some of the VNICs behave
as high-performance Fibre Channel Host Bus Adapters.
[0044] The SIM is responsible for provisioning and overall system
management. Via system control and management processes running on
the SIM, the server and switch functionality of the ES chassis are
provisioned via configuration files in accordance with respective
requirements specified by server and network administrators.
Configurable parameters for each server include the number of
processors, memory, the number of VNICs, and VNIC bandwidth.
Configurable parameters for the network include Virtual LAN (VLAN)
assignments for both Network Module ports and VNICs and Link
Aggregation Group (LAG) definitions.
[0045] An ES system may be operated as one or more provisioned
servers, each of the provisioned servers including capabilities as
identified by a corresponding set of specifications and attributes,
according to various embodiments. Typically the specifications (or
constraints) and attributes are specified with a Server
Configuration File. An ES system may be provisioned into any
combination and number of servers according to needed processing
and I/O capabilities. Each of these servers may include distinct
compute, storage, and networking performance. Provisioned servers
may be managed similar to conventional servers, including
operations such as boot and shutting down.
[0046] One or more VNICs provide for communication among modules of
Enterprise Server (ES) embodiments via a switch fabric dataplane.
Processes executing on compute complexes of the servers exchange
data as packets or messages by interfaces made available through
VNICs. The VNICs further provide for transparent communication with
network and storage interfaces. VNIC provisioning capabilities
include programmable bandwidth, priority scheme selection, and
detailed priority control (such as round-robin weights). In some
embodiments, VNICs are implemented in Virtual Input/Output
Controllers (VIOCs). In another aspect, Virtual Local Area Networks
(VLANs) enable access to Layer-2 (L2) and selected Layer-3 (L3)
network functions while exchanging the packets and messages. VLAN
identification is provided in each VNIC, and VLAN processing is
partially performed in VIOCs implementing VNICs.
[0047] The compute complexes and interfaces are typically
configured as pluggable modules inserted into a backplane included
in a chassis. The switch fabric dataplane (sometimes simply
referred to as "a dataplane") is accessible via the backplane
(serving as a replacement for a conventional backplane bus), and
hence ES embodiments are known as "fabric-backplane" enterprise
servers. Various ES embodiments are comprised of varying numbers
and arrangements of modules.
[0048] The EF architecture provides for provisioning virtual
servers (also known as server-instances) with included virtual
networks from underlying ES hardware and software resources. The EF
architecture is applicable to application scenarios requiring
dynamic combinations of compute, network, and storage performance
and capabilities, and is a unifying solution for applications
requiring a combination of computation and networking performance.
Resources may be pooled, scaled, and reclaimed dynamically for new
purposes as requirements change, using dynamic reconfiguration of
virtual computing and communication hardware and software. This
approach offers the advantages of reduced cost, as provisioning is
"just-right" rather than over-provisioned. In addition, dynamic
configuration allows for quick performance or scale
modifications.
[0049] The EF architecture provides a radically different
underlying server architecture compared to traditional multi-way
Symmetric MultiProcessor (SMP) servers, including integrated fabric
interconnectivity to enable high-bandwidth, low-latency I/O
operation. Processing and I/O throughput are virtualized, providing
scalable, coordinated resources. Partitioning and fail-over are
hardware supported, including mechanisms for treating multiple
virtual servers as a single managed entity, resulting in new high
availability clustering and multi-site fail-over capabilities.
Compared to a current network switch, networking capabilities are
extended, including efficient Remote Direct Memory Access (RDMA)
and flexible resource pooling. Virtualized fabric services, such as
Server Load Balancing (SLB), Secure Sockets Layer (SSL) protocols
including Transport Layer Security (TLS) variants, eXtensible
Markup Language (XML), and so forth, are also provided.
[0050] In some embodiments, a data center or other installation
implemented in accordance with the EF architecture will include one
or more ES chassis. In a first embodiment, the ES chassis
capabilities include an 8-way SMP partition-configurable compute
complex. These compute resources include a plurality of 64-bit x86
processing elements. The ES chassis hardware configuration is
compatible with execution of software operating systems such as
Linux and Microsoft Windows. Processing elements in the ES chassis
are coupled to a low-latency high-bandwidth interconnect fabric via
virtualized I/O functions, providing for efficient communication
between processing elements and with network and fibre channel
interfaces coupled to the fabric. The virtualized I/O functions are
distributed throughout the plurality of processing elements. Among
the virtualized I/O functions, the ES chassis includes VNICs and
virtualized Host Bus Adaptors (VHBAs). Via these VNICs and VHBAs,
the processing elements can selectively communicate with external
networks coupled to any of several high-performance network
interfaces (up to three 10 Gb Ethernet interfaces, or thirty 1 Gb
Ethernet interfaces, in the first embodiment) and with several
high-performance 2 Gb Fibre Channel interfaces (up to eight in the
first embodiment). Each VNIC/VHBA can be individually configured
such that it appears to be coupled to a multi-port switch coupled
to other of the VNICs/VHBA and to the network/storage interfaces.
Alternatively, each VNIC/VHBA can be configured such that it
appears to be directly coupled to one of the network/storage
interfaces. Additional processing capabilities may be provided in
the chassis in the form of offload cards (or pluggable boards or
modules) supporting virtualized services, such as SLB, SSL, and XML
processing.
[0051] The ES chassis is further configured with capabilities to
provide for a high availability system, including modular
components, hot-swap of components, and fully redundant components.
Other high availability capabilities include multi-site fail-over
and mainframe class Reliability, Availability, and Serviceability
(RAS) features.
[0052] An ES system may be operated as one or more provisioned
servers, each of the provisioned servers including capabilities as
identified by a corresponding set of specifications and attributes,
according to various embodiments. Typically the specifications (or
constraints) and attributes are specified with an SCF (see the SCF
and Related Tasks section, elsewhere herein). An ES system may be
provisioned into any combination and number of servers according to
needed processing and I/O capabilities. Each of these servers may
include distinct compute, storage, and networking performance.
Provisioned servers may be managed similar to conventional servers,
including operations such as boot and shutting down (see the Server
Operational States section, elsewhere herein).
[0053] The EF architecture further includes a Graphical User
Interface (GUI) for configuration management. The GUI may be
provided via a web browser, a network-based Java client, or some
other related mechanism, according to various embodiments. The GUI
provides role-based access and division of functions, and may be
used as a single point of management for all EF system functions.
System management personnel may use the GUI to control EF
virtualized configuration and provisioning settings. Resource
pooling and allocation of Central Processing Unit (CPU) and 10
capabilities may be dynamically altered without requiring physical
changes or re-cabling. Network and storage capabilities may be
similarly dynamically modified, including Network Interface
Controller (NIC), Host Bus Adaptor (HBA), and bandwidth resources.
Redundancy, fail-over and other RAS capabilities may also be
configured via the GUI, including specific multi-site configuration
information. Various embodiments may also include a Command Line
Interface (CLI) with functions and capabilities similar to the
GUI.
[0054] The GUI further provides functions for monitoring various
aspects of the hardware and software performance and behavior of
systems including each ES chassis. The monitoring functions are
available for inspection of operations at several levels in the
system, from top-level application performance to low-level network
interconnect metrics.
[0055] The GUI provides hooks for integration of the functions
provided therein into higher-level application software and
standard applications, allowing for flexibility in specifying and
monitoring the EF system configuration.
[0056] EF configuration management and monitoring may also be
performed via other mechanisms. Alternate mechanisms include one or
more command line interfaces, a scripting interface, and remote
network-centric interfaces using standard capabilities provided for
in Simple Network Management Protocol (SNMP) and Remote MONitoring
(RMON).
[0057] Systems including EF capabilities may also provide for
upgrades to installed software, including operating system
software, application software, driver-level software, and firmware
software. The upgrades may include updates to address security
issues, to enable new or expanded functionality, or to repair
incorrect operation (a "bug fix"). A variety of sources may provide
upgrades, including EF vendors, or vendors of software installed or
used in EF-based systems. Additionally, Independent Software Vendor
(ISV) certifications of software associated with EF-based systems
may also be performed.
[0058] Illustrative application usage scenarios include a first
usage scenario including a first configuration adapted to replace a
server (having an Operating System selected from a list including
but not limited to Unix, Linux, Windows, etc.) or a collection of
such servers. The first configuration provides for virtualization
of data center capabilities, resource pooling, and consolidation of
functions otherwise performed in a plurality of heterogeneous
devices. Computing, networking, and services are completely
virtualized, enabling dynamic deployment, scaling, and reclamation
according to changing application requirements. Significant savings
in capital and operating expense result.
[0059] A second usage scenario includes a second configuration
adapted for I/O intensive applications. The second configuration
provides high-bandwidth and low-latency storage and networking
capabilities, enabling new classes of applications using fewer
infrastructure components than currently possible. The
high-bandwidth and low-latency capabilities are enabled in part by
use of a high-bandwidth, low-latency fabric. Efficient
intra-chassis communication is provided for in a transparent
manner, enabling increased I/O bandwidth and reduced latency
compared to existing solutions.
[0060] A third usage scenario includes a third configuration
adapted for consolidating tiers in a data center application. The
third configuration provides for collapsing the physical divisions
in present 3-tier data centers, enabling solutions with fewer
servers, a smaller number of network switches, and reduced needs
for specialized appliances. The concepts taught herein provide for
completely virtualized computing, networking, and services, in
contrast to existing solutions addressing tiered data systems.
Dynamic configuration enables pooling of resources and on-the-fly
deploying, scaling, and reclaiming of resources according to
application requirements, allowing for reduced infrastructure
requirements and costs compared to existing solutions.
[0061] A fourth usage scenario includes a fourth configuration
adapted for enhanced high availability, or RAS functionality,
including multi-site fail-over capabilities. The fourth
configuration provides for new redundancy and related
architectures. These new architectures reduce set-up and
configuration time (and cost), and also decrease on-going operating
expenses. Modular components of the ES chassis are hot-swap
compatible and all EF systems are configured with fully redundant
components, providing for mainframe-class RAS functionality.
Reduced networking latency capabilities enable enhanced multi-site
fail-over operation.
[0062] The concepts taught herein thus enable simple, fast,
straightforward, low-effort deployment of systems configured and
provisioned for efficient execution of heterogeneous mixes of
applications requiring varying computing, networking, and I/O
processing resources. Application fail-over architectures are also
readily enabled with little or no additional hardware, according to
further dynamic partitioning features. Accordingly, Information
Technology (IT) operations are reduced in complexity and cost.
[0063] The concepts taught herein consolidate multiple devices and
tiers in data center operations, requiring fewer servers (in type
and quantity), reduced supporting hardware, and smaller
infrastructure outlays compared to systems of the current art.
Significant reductions in the total cost of ownership are thus
provided for by the concepts taught herein.
[0064] The concepts taught herein, ensure highly reliable and
available compute, network, storage, and application resources
while also dramatically improving storage and networking
performance and reliability. True multi-site fail-over and disaster
recovery are possible by use of the concepts taught herein,
enabling new classes of I/O and high availability applications.
Enterprise Server System
[0065] FIG. 1A illustrates System 100A, a conceptual representation
of an embodiment of an ES system. The system includes a particular
ES chassis embodiment, ES1 110A, which is coupled to various other
systems, including Fibre Channel Storage Network 106, Generic
Packet Network 107, and Ethernet Storage Network 108. Fibre Channel
Storage Network 106 provides mass storage via a collection of disks
organized, for example, as a Storage Area Network (SAN). Generic
Packet Network 107 conceptually includes arbitrary combinations of
Local Area Network (LAN), Metro Area Network (MAN), and Wide Area
Network (WAN) networks and typically includes Ethernet and Ethernet
derivative links for coupling to Internet 101, an arbitrary number
and arrangement of Client machines or servers, represented as
Client 102 and Client 103, as well as an arbitrary number and
arrangement of Personal Computers (PCs) or Workstations,
represented as PC 104 and PC 105. Ethernet Storage Network 108
provides mass storage via a collection of disks organized in a
Network Attached Storage (NAS) or Small Computer System Interface
over Transmission Control Protocol/Internet Protocol (iSCSI)
fashion.
[0066] ES1 110A includes a central I/O SFM (SFM 180) providing a
switch fabric dataplane coupling for FCMs 120, NMs 130, SCMs 140,
PMMs 150 (also known as Application Processor Modules), and OLBs
160, also known as Offload Modules (OLMs) or AppLication Modules
(ALMs). Cell-based communication is carried out over the switch
fabric dataplane between the aforementioned elements coupled to it.
In other words, the fabric functions as a cell-based switch. FCMs
120 include Fibre Channel Interfaces (FCIs) for coupling to Fibre
Channel standard storage devices and networks (such as SANs). NMs
130 include interfaces to standard network infrastructures. PMMs
150 include compute elements for execution of Application, Driver,
and Operating System (OS) processes, via SMP clusters illustrated
conceptually as SMP 151. A configurable Coherency Switch Fabric and
Interconnect (CSFI 170) is included for partitioning or combining
the CPU and Randomly Accessible read/write Memory (RAM) resources
of PMMs 150. OLBs 160 include compute elements for execution of
service processes, via various service acceleration modules.
Service acceleration modules include SLB accelerator 161, SSL
accelerator 162, and XML accelerator 163. SCMs 140 include compute
elements for providing system management, controlplane (L2/L3
bridging and routing, for example), and load balancing processing
for SFM 180 and the elements coupled to it. PMMs 150 also include
FCIs for coupling to mass storage systems, such as Optional Local
Disks 111-112, or SAN systems including mass storage.
[0067] In operation, Application, Driver, and OS processes are
executed on PMMs 150 via CPU and RAM elements included in SMP 151.
At least some of the data consumed and produced by the processes is
exchanged in packets formatted as cells for communication on SFM
180. The data may include network data exchanged with Generic
Packet Network 107 via NMs 130, and storage data exchanged with
Ethernet Storage Network 108 via NMs 130 or Fibre Channel Storage
Network 106 via FCMs 120. The data may also include service data
exchanged with OLBs 160 and SCMs 140, and other Application,
Driver, or OS data exchanged with other elements of PMMs 150.
[0068] Data communicated on SFM 180 is not limited to data
exchanged with PMMs 150, but may also include data communicated
between any of the modules (or fabric clients) coupled to the
fabric. For example, one NM may forward packets to itself or to
another NM via the fabric. An NM may also exchange packets with an
OLB for processing via the fabric. SCMs 140 may also exchange
configuration and forwarding update information with VIOCs via VIOC
Control Protocol (VIOC-CP) packets via the fabric. In some
embodiments, SCMs 140 may also exchange selected system management,
controlplane, and load balancing information with all modules
coupled to the fabric via in-band packets communicated on the
fabric.
[0069] In one embodiment, in conjunction with the VNIC
functionality mentioned previously, a modified Ethernet Driver
provides the illusion of local NIC functionality to Application,
Driver, and OS processes locally executing on any of SCMs 140, PMMs
150, and OLBs 160. The NIC functionality can be configured to
either appear to be coupled to a switch coupled to other NICs or
appear to be coupled directly to one of the networking interfaces
included on NMs 130. In another embodiment, this technique may be
used to access networked storage devices (i.e., NAS subsystems) via
the NMs 130.
[0070] As illustrated conceptually, SFM 180 includes a redundant
pair of fabrics, with one of the pair typically configured as a
Primary Fabric, while the other fabric is typically configured as a
Redundant Fabric. SCM-Fabric coupling 149 represents two fabric
dataplane couplings, a first Primary Coupling between a Primary SCM
of SCMs 140 and the Primary Fabric, and a Redundant Coupling
between a Redundant SCM of SCMs 140 and the Redundant Fabric. In
one mode of operation, all dataplane traffic is carried on the
Primary Fabric, managed by the Primary SCM, while the Redundant
Fabric and the Redundant SCM are maintained in a hot-standby
mode.
[0071] Further dataplane couplings to SFM 180 are illustrated
conceptually as FCM-Fabric coupling 129, NM-Fabric coupling 139,
PMM-Fabric coupling 159, and OLB-Fabric coupling 169. In FIG. 1,
each coupling is abstractly portrayed as a single line between each
group of modules and the switch fabric. It will be understood that
for the FCM, NM, and OLB modules, each module instance has a
Primary Fabric coupling and a Redundant Fabric coupling. For the
PMM, each PMM instance has two Primary Fabric couplings and two
Redundant Fabric couplings.
[0072] All of the modules coupled to SFM 180 (FCMs 120, NMs 130,
SCMs 140, PMMs 150, and OLBs 160) include fabric interface
communication units for exchanging data as cells on the fabric. The
details of this data exchange are described in more detail
elsewhere herein.
[0073] In an illustrative embodiment, the components of ES1 110A
are included on a plurality of pluggable modules adapted for
insertion into and removal from a backplane while the server is
powered-up and operational (although software intervention to
cleanly shut down or start up various processes or functions may be
required). The backplane forms portions of FCM-Fabric coupling 129,
NM-Fabric coupling 139, CSFI-PMM coupling 179, PMM-Fabric coupling
159, and OLB-Fabric coupling 169. The Primary Fabric of SFM 180 and
the associated Primary SCM of SCMs 140 are included on pluggable
module Primary SIM as illustrated by SIMs 190. Similarly, the
Redundant Fabric of SFM 180 and the associated Redundant SCM of
SCMs 140 are included on pluggable module Redundant SIM of SIMs
190. All of the modules of FCMs 120, NMs 130, PMMs 150, and OLBs
160 are also configured as pluggable modules adapted for operation
with the backplane.
[0074] Each PMM of PMMs 150 is physically-partitionable, i.e.
configurable into one or more physical partitions. The physical
partitioning of PMMs 150 and related modes of CSFI 170 are
configured under program control. For example, PMMs 150 may be
configured as a single SMP complex in conjunction with CSFI 170.
The result is a first example of a physical partition. In a second
example, each PMM of PMMs 150 may instead be configured
individually as an independent SMP complex, resulting in a
plurality of physical partitions, one for each PMM. In a third
example, each PMM of PMMs 150 may instead be configured as a pair
of SMP complexes, resulting in two physical partitions per PMM. In
various embodiments, CSFI 170 may be implemented as any combination
of simple interconnect, coherency logic, and switching logic,
operating in conjunction with any combination of interconnect and
logic included on PMMs 150. Some of these embodiments are discussed
in more detail later herein.
[0075] ES1 110A, as illustrated, is representative of a number of
embodiments configured with various Modules to provide differing
amounts of storage and network interface capability (connectivity
and bandwidth), as well as differing levels of compute capability
(cycles and memory). Typically, each embodiment includes at least a
redundant pair of Switch Fabrics and associated System Intelligence
Modules (for communication between Modules), at least one Processor
Memory Module (for execution of Application, Driver, and OS
processes), and at least one Network Module (for communication with
external agents). Some embodiments may optionally further include
any combination of additional Modules to provide additional
interface and compute capability, up to the physical limits of the
particular implementation. For example, additional Network Modules
may be included in an embodiment to provide additional network
bandwidth or connectivity. One or more Fibre Channel Modules may be
included in an embodiment to provide additional storage bandwidth
or connectivity. Additional Processor Memory Modules may be
included to provide additional compute cycles or memory. One or
more Offload Modules may be included to provide additional service
compute cycles or memory, and these Offload Modules may each be
individually configured with any combination of SLB, SSL, and XML
accelerators.
[0076] In the embodiments represented by ES1 110A, communication
between the Modules via SFM 180 is independent of the manner and
arrangement of the Modules. All of the Modules communicate as peers
on SFM 180 and interface to the fabric in a similar manner.
[0077] System 100A is also representative of a variety of system
embodiments, for example, differing in the number, type, and
arrangement of storage and network systems coupled to ES1 110A. For
example, any combination of Optional Local Disks 111-112 may be
included. Generic Packet Network 107 may include any combination of
LAN, MAN, or WAN elements. FCMs 120 may be coupled to a single SAN,
or a plurality of SANs. NMs 130 may be coupled to a plurality of
networks or storage systems. Couplings between ES1 110A and other
systems is limited only by the number and type of interfaces and
physical couplings available according to implementation.
[0078] FIG. 1B illustrates System 100B, a conceptual representation
of selected details of data transfer in an embodiment of an ES
system. An ES chassis embodiment, illustrated as ES1 110B, is
coupled to Fibre Channel Storage Network 106 and Ethernet Network
107, as described elsewhere herein. In the illustrated embodiment,
one module of each type (FCM 120A, NM 130A, PMM 150A, and OLB 160A)
are coupled to Primary Switch Fabric Module 180A, via FCM-Fabric
coupling 129A, NM-Fabric coupling 139A, PMM-Fabric couplings
159A/159A', and OLB-Fabric coupling 169A, respectively. FCM 120A
provides interfaces for storage network couplings, including a
coupling for Fibre Channel Storage Network 106. NM 130A provides
interfaces for network couplings, including a coupling for Ethernet
Network 107, coupled in turn to Client 102. In this illustrative
scenario, PMM 150A is configured as a first and a second physical
partition. The first physical partition includes SMP Portion
P.sub.A 152A, having RAM 153A, and is coupled by PMM-Fabric
coupling 159A to the fabric dataplane. The second physical
partition includes SMP Portion P.sub.A' 152A', having RAM 153A',
and is coupled by PMM-Fabric coupling 159A' to the fabric
dataplane. Note: several elements have been omitted from the figure
for clarity, including the SCMs, the Redundant Fabric, the CSFI,
and optional SANs.
[0079] Several illustrative data transfer paths are shown,
including Client-Server Data Exchange 115, Storage-Server Data
Exchange 116, Client-Service Data Exchange 117, and Service Data
Exchange 118. Each path illustrates the movement of data between
two clients of the switch fabric. Consider now three embodiments,
described from the perspective of these clients. In one embodiment,
data is organized as packets transferred via a stateless
connection-free (and unreliable) protocol. In another embodiment,
data is organized as messages, and transferred via a
connection-oriented reliable message protocol. In yet another
embodiment, data is selectively organized as either packets or
messages.
[0080] The underlying fabric transport process is transparent to
the clients and external devices coupled to the system and to most
of the host software processes on the PMMs and OLBs. As described
in more detail elsewhere herein, to external network devices, each
port of each NM acts as a switch port of a virtualized
high-performance L2/L3 switch. As will be discussed in further
detail elsewhere herein, the switch has advanced VLAN and
classification functionalities. The VLAN functionality provides for
selectively coupling or isolating the network segments coupled to
each switch port. Each segment associated with an NM port may have
one or more external physical network devices as in any
conventional network segment. The classification functionality
provides for special forwarding treatments in accordance with a
variety of attributes of the Ethernet frames received from external
network devices on the ports of the NMs. Select software processes
running on the PMMs and OLBs are provided the illusion that they
are coupled via high-performance Ethernet network interfaces to
dedicated ports of the VLAN-enabled switch. Connectivity between
the different network segments may be flexibly and arbitrarily
configured through VLAN management.
[0081] In systems having FCMs, a virtualized fibre channel switch
is similarly presented to external fibre channel devices. Likewise,
certain software processes running on the PMMs are provided the
illusion they are coupled to the fibre channel switch via
high-performance fibre channel interfaces. Multiple VLANs and
multiple fibre channel networks can simultaneously co-exist on top
of the fabric transport infrastructure while being completely
logically separate and secure.
[0082] The underlying fabric transport process is discussed in more
detail elsewhere herein, but it will now be briefly overviewed. The
fabric supports the transparent transport of both packets and
messages. Here, "packets" refers to conventional Ethernet frames
sent via some connectionless protocol that does not have integral
support for reliable delivery. The term "messages" refers to one or
more data transfers of quasi-arbitrarily sized data blocks reliably
delivered over a logical connection established between end-points.
Packets are transported over the fabric using "fabric packets,"
while messages are transported over the fabric using "fabric
messages." Both fabric packets and fabric messages make use of
highly similar fabric frames. A fabric packet is comprised of a
single fabric frame, sent over the fabric without any connection or
reliable delivery support. In contrast, fabric messages are
comprised of (potentially quite long) sequences of fabric frames,
sent over the fabric using a connection-oriented reliable delivery
protocol. Some of the fabric frames of a fabric message are for
transfer of the message data blocks while other fabric frames are
used for control to set up and take down connections and to
implement reliable delivery (e.g., via handshake and re-delivery
attempts).
[0083] Thus fabric messages require additional fabric frames for
messaging control beyond the fabric frames required to transport
the message data. Furthermore, the fabric frames of fabric messages
require additional processing at the source and destination ends
related to the management of reliable delivery, connections, and
the fragmentation (segmentation) and reassembly of data blocks.
Nevertheless, the transport over the fabric of individual fabric
frames is essentially the same for both fabric messages and fabric
packets. Since all sources and destinations on the fabric have
support for processing fabric packets and fabric messages, those of
ordinary skill in the art will understand that all of the data
exchange illustrations below that describe the transport of packets
using fabric packets are equally applicable to the transport of
messages using fabric messages.
[0084] The Ethernet frames of the packets to be transported over
the fabric may originate in external clients or devices coupled to
the NM ports or from within the various processing modules. During
processing by the NM or the VIOC, respectively, a fabric packet is
formed to contain the data of each original Ethernet frame plus
additional information to facilitate transport over the fabric.
During the fabric packet formation, the protocol field (Ether-type)
of the original Ethernet frame is examined. The fabric packet is
generally labeled (tagged) in accordance with the Ether-type and
other information found in the original packet. E.g., if the
original Ethernet frame is an IP type frame, the fabric packet is
identifiable as an "IP fabric packet."IP fabric packets are
evaluated for L3 forwarding (a.k.a. IP forwarding) based upon their
included destination IP address. Otherwise, non-IP fabric packets
are evaluated for L2 forwarding based upon their included MAC
destination address (MACDA). L2/L3 forwarding is overviewed
next.
[0085] During the fabric packet formation a forwarding decision is
made that determines a fabric destination address that is embedded
in the fabric packet. The embedded fabric destination address
controls how the fabric packet is delivered to destinations within
the system. The fabric destination address includes a specification
for an egress port of the switch fabric. When multiple sub-ports
(corresponding to multiple L2 or L3 destination addresses) are
associated with a single egress port, the fabric destination
address will also include a fabric sub-address to specify a
particular one of the sub-ports.
[0086] The fabric packet is subsequently cellified (segmented into
cells) and presented to an ingress port of the switch fabric. Each
cell includes the fabric destination address and the cell is
transferred by the switch fabric to the egress port specified by
the fabric destination address. After being received by the module
coupled to the specified egress port, the cells are reformed into a
representation of the original Ethernet frame prior to presentation
to the destination. If the module at the egress port has multiple
sub-ports, the module will use the included fabric sub-address to
further direct the reformed Ethernet frame to the specified
sub-port.
[0087] In L2 forwarding, the VLAN assignment of the network port or
processing module from which the original Ethernet frame was
sourced is also used with the MACDA in determination of the fabric
destination address. The determination is by way of a lookup in an
L2 Forwarding Information Base (L2 FIB). As discussed elsewhere
herein, an L2 FIB is implemented for each VIOC and NM in the system
using any combination of TCAM/SRAM structures and search engines,
according to embodiment. The L2 forwarding decision is thus
implemented completely within the module where the original
Ethernet frame was sourced and the next fabric destination is the
module most directly associated with the MACDA of the original
Ethernet frame.
[0088] General L3 forwarding (i.e., to destinations beyond the IP
sub-net of the source IP) requires access (on the same L2 sub-net
as the source) to at least one gateway IP interface and associated
L3 FIB. The number and location of gateway IP interfaces and L3
FIBs varies by embodiment. In a first L3 embodiment, a gateway IP
interface and L3 FIB is implemented external to the system chassis.
In the first L3 embodiment, there is no gateway IP interface or L3
FIB on any NM or VIOC and the system hardware does not provide any
L3 forwarding assists. That is, the external gateway IP interface
and L3 FIB are not aware of the fabric transport process and
function conventionally to provide the Ethernet MACDA of the next
hop. In a second L3 embodiment, at least one gateway IP interface
is implemented via a media port (physical port) or pseudo-port
(virtual port) somewhere on at least one NM and an L3 FIB is
implemented within each NM having a gateway IP interface. In the
second L3 embodiment, there is not a gateway IP interface or L3 FIB
on any VIOC. The gateway IP interface is only visible from inside
the chassis if implemented on a pseudo-port. The gateway IP
interface is visible from inside and outside the chassis if
implemented on a media port. Combinations of multiple gateway IP
interfaces, some on media ports and others on pseudo ports, are
envisioned. In a third L3 embodiment, a gateway IP interface is
implemented somewhere on at least one NM and for each VNIC, and an
L3 FIB is implemented within each NM and VIOC. Gateway IP
interfaces and L3 FIBs implemented within the chassis are fabric
packet aware and assist L3 forwarding by providing the fabric
destination address of the next hop.
[0089] In the second and third L3 embodiments, L3 FIB management
processes maintain a master L3 FIB in the SCM and maintain
coherency between all L3 FIBs in the chassis. When a non-master L3
FIB encounters a destination IP address that has no corresponding
entry, the IP fabric packet undergoing the forwarding decision is
forwarded as an "exception packet" to the controlplane process
executing on the Primary SCM. The controlplane process determines
the proper fabric address for the missing entry, propagates a new
entry to all of the L3 FIBs, and forwards the IP fabric packet to
the destination IP (or at least one hop closer) using the newly
learned fabric address. (Fabric frames exist for a single cellified
hop across the fabric between fabric source and fabric destination.
In the above and the following it will be understood that when an
IP fabric packet is "forwarded" via an indirection or hop via an
gateway IP interface or the Primary SCM, the IP fabric packet is
being re-instantiated into a new fabric frame for each traverse
across the fabric.)
[0090] In accordance with the foregoing, depending upon embodiment
and within which module an IP packet is first processed, an IP
fabric packet may first undergo an indirection to an IP gateway
interface, possibly on a different port or pseudo port on the same
or a different module or external to the system. All transport is
by conventional Ethernet frames outside the chassis and by fabric
frames within the chassis. Once at an IP gateway interface, the
destination IP address of the original Ethernet frame is used to
associatively access the L3 FIB and the lookup result is used to
forward the IP packet to the IP destination (or at least one hop
closer).
[0091] Those of ordinary skill in the art will understand that for
IP packet transfers over the fabric, generally a gateway IP
interface must be involved. In the following illustrated data
exchanges, the paths are drawn for scenarios that do not require
additional indirection. Nevertheless, it will be understood that if
an IP packet is received at an interface that is neither the IP
destination address or a gateway IP interface, then generally the
corresponding data exchange path is modified by interposing an
intermediate hop to a gateway IP interface. Furthermore, when an IP
packet is received at a gateway IP interface, either directly or as
part of an indirection from a non-gateway IP interface, in the
relatively rare event that there is a miss in the associated L3
FIB, the corresponding data exchange path is modified by
interposing an intermediate hop to the Primary SCM. As previously
indicated, Primary SCM controlplane processing services the miss in
the master L3 FIB and updates the L3 FIBs throughout the chassis.
Once the miss is serviced, the Primary SCM forwards the IP packet
toward the originally intended destination. Thus, while not a
frequent occurrence, for some IP fabric packets two intermediate
hops are interposed in the data exchange paths: a first
intermediate hop to a gateway IP interface and a second
intermediate hop to the Primary SCM.
[0092] In at least L3 forwarding, the classification functionality
of the NMs facilitates more sophisticated forwarding decisions,
special data manipulation, and other data treatments, to be
optionally performed as a function of additional attributes of the
network data traffic encountered. In one particular example that
will be examined in more detail elsewhere herein, the fabric
destination address for IP fabric packets is at least in part
determined by the recognition of particular service requests (and
the lack thereof) embedded in the data traffic. More specifically,
the service request recognition takes the form of recognizing
particular Transmission Control Protocol/Internet Protocol (TCP/IP)
destination ports corresponding to particular applications. The L2
and L3 FIBs are also updated dynamically, both in response to
changes in the network configuration and optionally for dynamic
performance optimization, such as to achieve load balancing among
the processing resources of the system.
[0093] For the sake of clarity in the below initial expositions for
each of the data exchanges illustrated by FIGS. 1B and 2, the
foregoing details of the transport process just described will be
understood to apply, but will not be repeated. It will also be
understood that even though the data exchanges are described from a
flattened perspective that includes operations by the clients, the
modules, and the switch fabric, the underlying above described
fabric transport process (and the associated management of the L2
and L3 FIBs) remains transparent to the external clients, external
devices, and internal host software processes. These clients,
devices and host software processes are only aware of packets and
messages (and not the fabric packets or fabric messages).
[0094] The data exchanges will be described in terms of more than
one type of packet. References to packet transmission, packets
originating from the client, incoming packets, received packets,
reassembled packets, or simply packets, are references to Ethernet
frames. It will be understood that all such Ethernet frames are
transported across the fabric via the process of fabric packet
encapsulation, cellification, switch fabric traversal, and
reassembly. References to augmented packets or cellified packets
are references to fabric packets. References to cells or cellified
packets being forwarded refers to the providing of fabric-addressed
cells to the switch fabric for transfer by the switch fabric to the
module coupled to the switch fabric egress port specified by the
fabric address.
[0095] In each data exchange description, generic exchanges are
purposefully not explicitly characterized as being of the IP or
non-IP type. It will be understood that each of these data
exchanges is generally equally applicable to the exchange of IP
fabric packets or non-IP fabric packets. If IP packets are being
exchanged, then the forwarding is understood to be L3 forwarding.
Otherwise, the forwarding is understood to be L2 forwarding. Again
for the sake of clarity in the following initial expositions, some
details of the data exchanges, such as the handling of multicast,
are deferred to elsewhere herein.
[0096] Client-Server Data Exchange 115 includes packet transmission
from Client 102 via Ethernet Network 107 to NM 130A. Since the
system may in part be providing the functionality of an L2/L3
switch for any of many network segments, packets received in close
time proximity by NM 130A may be for any of multiple destinations
both internal and external to the system. The incoming packets are
classified, formed into fabric packets, subjected to a forwarding
decision to determine a fabric address, and selectively provided as
cells to Primary Switch Fabric Module 180A via a fabric ingress
port associated with NM-Fabric coupling 139A. Based in part on the
classification, NM 130A addresses the cells to PMM 150A, and more
specifically to SMP Portion P.sub.A 152A, as a result of the
forwarding decision identifying the fabric egress port associated
with PMM-Fabric coupling 159A as the destination fabric addresses
for the cells. Primary Switch Fabric Module 180A then transfers the
cells to the fabric egress port associated with PMM-Fabric coupling
159A. SMP Portion P.sub.A 152A receives the cells and reassembles
them into received packets corresponding to the packets originating
from Client 102. The received packets are formed directly in RAM
153A, typically via DMA write data transfers. Return packets follow
the flow in reverse, typically beginning with DMA read transfers
from RAM 153A. Although Client-Server Data Exchange 115 has been
described from the perspective of packets "originating" from Client
102 and return traffic flowing in reverse, this is only for
illustrative purposes. The flow from Client 102 to SMP Portion
P.sub.A 152A is entirely independent of the flow in the other
direction.
[0097] Not all packets received from Client 102 via Ethernet
Network 107 are destined to SMP Portion P.sub.A 152A. For example,
Client-Service Data Exchange 117 illustrates cellified packets
selectively forwarded by NM 130A toward OLB 160A via NM-Fabric
coupling 139A, Primary Switch Fabric Module 180A, and OLB-Fabric
coupling 169A. Based in part on the classification, packets from
Client 102 are determined to require transfer to OLB 160A (instead
of other fabric clients, such as SMP Portion P.sub.A 152A as in
Client-Server Data Exchange 115). NM 130A addresses the
corresponding cells to OLB 160A and executes a forwarding decision
identifying the fabric egress port associated with OLB-Fabric
coupling 169A as the fabric destination address for the cells.
Primary Switch Fabric Module 180A then transfers the cells to the
fabric egress port associated with OLB-Fabric coupling 169A. OLB
160A receives the cells and reassembles them into received packets
directly into a RAM local to the OLB. Return packets follow the
flow in reverse.
[0098] In one embodiment, messages are used to implement Fibre
Channel storage transactions. Storage-Server Data Exchange 116
includes establishing a reliable end-to-end logical connection,
directly reading message data from RAM 153A (included in SMP
Portion P.sub.A 152A), fragmenting (as required) the message data
into fabric frames, and providing corresponding cells addressed to
FCM 120A via PMM-Fabric coupling 159A. As a result of a forwarding
decision, the cell destination addresses specify the fabric egress
port associated with FCM-Fabric coupling 129A. For each fabric
frame sent, the cells are transferred, received, and reassembled in
a manner similar to that described for fabric packets in
conjunction with Client-Service Data Exchange 117. The storage
transactions are provided via a storage network coupling to at
least one storage device of external Storage Network 106. If more
than one storage network and associated storage network coupling is
associated with FCM 120A, the particular storage network coupling
is specified via a fabric sub-address portion of the cell
destination address. Returning storage transaction responses follow
the flow in reverse.
[0099] Service Data Exchange 118 is similar to Client-Service Data
Exchange 117. Packet data is read from RAM 153A' (included in SMP
Portion P.sub.A' 152A'), and cells are forwarded to OLB 160A by a
forwarding decision specifying the fabric egress port associated
with OLB-Fabric coupling 169A as the cell destination addresses.
The packets exchanged by Client-Server Data Exchange 115, and
Client-Service Data Exchange 117, and Service Data Exchange 118,
are typically but not necessarily IP packets.
[0100] In some embodiments, Data Exchanges 115-118 are overlapped
or partially concurrent with each other. For example, cells
corresponding to a portion of Client-Server Data Exchange 115
traffic may be intermixed with cells relating to Client-Service
Data Exchange 117 traffic, as the cells from both data exchanges
are coupled via NM-Fabric coupling 139A to the fabric. However,
each cell includes sufficient information in the corresponding
fabric destination address and other information to specify the
proper operation. For example, at NM-Fabric coupling 139A, cells of
Client-Server Data Exchange 115 are forwarded to SMP Portion
P.sub.A 152A by a first forwarding decision specifying the fabric
egress port associated with PMM-Fabric coupling 159A, while cells
of Client-Service Data Exchange 117 are forwarded to OLB 160A by a
second forwarding decision specifying the fabric egress port
associated with OLB-Fabric coupling 169A. Similarly, cells from
Client-Service Data Exchange 117 and Service Data Exchange 118 may
be intermixed on OLB-Fabric coupling 169A, because sub-port
destination address and other information in the cells enable
proper processing. In one embodiment, a portion of the sub-port
destination address is used to associate packets with a respective
input queue within the destination module.
[0101] In some embodiments, the termini of Data Exchanges are
located in RAM that is directly accessible by one or more
processing elements. For example, Service Data Exchange 118
includes a first terminus in RAM 153A', and a second terminus in a
RAM within OLB 160A. In typical embodiments, packet data from the
RAMs are read and written by DMA logic units included in each of
the respective modules. Conceptually, data is streamed from a
source RAM as packets, cellified and provided to the fabric,
transferred to the egress port as specified by the cells,
reassembled, and stored into a destination RAM in packet form.
These operations, in some embodiments, are fully overlapped, or
pipelined, so that data from a first cell of a packet may be stored
into the destination RAM while data from a following cell of the
same source packet is being read from the source RAM.
Packet and Process Data Flow
[0102] FIG. 2 illustrates various example embodiments of packet and
process data flow in an ES1 110A embodiment. A plurality of FCMs
are illustrated by FCM 120A and FCM 120B, coupled to Primary Switch
Fabric Module 180A via FCM-Fabric coupling 129A and FCM-Fabric
coupling 129B, respectively. A plurality of NMs are illustrated by
NM 130A and NM 130B, coupled to Primary Switch Fabric Module 180A
via NM-Fabric coupling 139A and NM-Fabric coupling 139B,
respectively. A plurality of PMMs are illustrated by PMM 150A and
PMM 150B, coupled to Primary Switch Fabric Module 180A by
PMM-Fabric couplings 159A/159A' and PMM-Fabric couplings
159B/159B', respectively. CSFI 170 is coupled to PMM 150A and PMM
150B by CSFI-PMM coupling 179A and CSFI-PMM coupling 179B,
respectively. A plurality of OLBs are illustrated by OLB 160A and
OLB 160B, coupled to Primary Switch Fabric Module 180A by
OLB-Fabric coupling 169A and OLB-Fabric coupling 169B. Note: the
Redundant SIM and associated couplings are omitted from the figure
for clarity.
[0103] It will be understood that each of the active FCMs and NMs
of FIG. 2 are typically, but not necessarily, coupled to external
devices on external networks as illustrated in FIGS. 1A and 1B. It
remains the case that all transport is by conventional Ethernet
frames outside the chassis and by fabric frames within the chassis.
Nevertheless, when such external devices or networks are involved,
the termini of FCM-related and NM-related packet and message data
exchanges may be considered from a certain perspective to extend to
those external devices. However, even with coupled external
devices, exchanges for at least some fabric frames related to
system management and control will terminate within the various
modules. Furthermore, in certain embodiments and scenarios,
including scenarios with external devices, certain non-control data
exchanges terminate within the NMs. Specifically, for the case of
fabric IP packets unicast to the IP gateway interface on a pseudo
port within an NM, the data exchange to the pseudo port terminates
within the NM and is not visible externally.
[0104] As a first illustrative configuration scenario, PMM 150A is
shown configured as two physical partitions, P.sub.1 201, and
P.sub.2 202, while PMM 150B is shown configured as a single
physical partition P.sub.3 203. As a second illustrative
configuration scenario, mutually exclusive with respect to the
first configuration scenario, PMM 150A and PMM 150B are shown
configured as a single unified physical partition P.sub.4 204.
[0105] FCM-PMM Data Exchange 210 is representative of data
exchanged between a storage sub-system coupled to an FCM and a PMM,
or more specifically a physical partition of a PMM. As illustrated,
this traffic is typically storage related messages between
processes executing on P.sub.3 203 of PMM 150B (including any of
Application, Driver, or OS Processes) and an external storage
sub-system (such as SAN 106 of FIG. 1B). In operation,
bidirectional message information flows as cellified fabric frames
via FCM-Fabric coupling 129A, Primary Switch Fabric Module 180A,
and PMM-Fabric coupling 159B. For example, a storage sub-system
request is generated by a storage sub-system Driver process
executing on P.sub.3 203. The request is formed as a storage
sub-system message addressed to the external storage sub-system
coupled to FCM 120A, and delivered as cellified fabric frames to
Primary Switch Fabric Module 180A via PMM-Fabric coupling 159B.
Primary Switch Fabric Module 180A routes the cells to FCM-Fabric
coupling 129A.
[0106] FCM-Fabric coupling 129A delivers the cellified fabric
frames to FCM 120A. In turn the cells of each fabric frame are
reconstituted (or reconstructed) into the original storage
sub-system message request, which is then sent to the storage
sub-system attached to FCM 120A (such as Fibre Channel Storage
Network 106 of FIG. 1B, for example). The storage sub-system
returns a response message, which is formed by FCM 120A into one or
more fabric messages addressed to P.sub.3 203. The fabric messages
are fragmented into fabric frames that are delivered as cells to
Primary Switch Fabric Module 180A via FCM-Fabric coupling 129A.
Primary Switch Fabric Module 180A routes the cells via PMM-Fabric
coupling 159B to P.sub.3 203 of PMM 150B. P.sub.3 203 reconstitutes
the cells into fabric frames, then reassembles and delivers the
response message to the storage sub-system Driver process executing
on P.sub.3 203.
[0107] In various embodiments, FCM-PMM Data Exchange 210 may flow
via PMM-Fabric coupling 159B' instead of 159B, or it may flow
partially via PMM-Fabric coupling 159B and partially via PMM-Fabric
coupling 159B'. The operation is similar for these cases, as the
fabric messages may be forwarded to P.sub.3 203 via 159B and 159B'
with no other change in operation.
[0108] NM-OLB Data Exchange 211 is representative of data exchanged
between an NM and a service process executing on an OLB. NM 130A
receives information, typically but not necessarily in IP packet
form, from an external coupled client (such as Client 102 of FIG.
1B), and classifies the packets, in part to determine a subset of
the packets to be sent to OLB 160B. Based in part on the
classification, an appropriate subset of the information is formed
into like-kind fabric packets including the destination address of
OLB 160B. An appropriate L2/L3 forwarding decision is made and the
fabric packets are then communicated as cells to Primary Switch
Fabric Module 180A via NM-Fabric coupling 139A. Primary Switch
Fabric Module 180A forwards the cells toward OLB 160B. The cells
are ultimately received via OLB-Fabric coupling 169B, reconstituted
as packets, and provided directly to the service process executing
on OLB 160B. The reverse path is used to transfer information from
the service process to the client coupled to NM 130A. Another path
may also be used to transfer information from the service process
to other destinations, such as an application process executing on
a PMM.
[0109] In an example of an SSL service process executing on OLB
160B, NM 130A recognizes a variety of SSL IP packet forms during
classification, including HyperText Transfer Protocol Secure
(HTTPS) as TCP/IP destination port 443, Secure Simple Mail
Transport Protocol (SSMTP) as TCP/IP destination port 465, and
Secure Network News Transfer Protocol (SNNTP) as TCP/IP destination
port 563. IP fabric packets are formed including the destination IP
address of OLB 160B. An L3 forwarding decision is made and the IP
fabric packets are provided as cells to the fabric and forwarded
toward OLB 160B. The SSL service process executing on OLB 160B,
upon receiving the reconstituted IP packets, performs SSL service
functions such as context switching, state look-up, protocol layer
demultiplexing, and decryption.
[0110] The SSL service process executing on OLB 160B produces
result data based in part on the packets received from the external
client via NM 130A. The result data typically includes IP packets
that may sent back to the external client via NM 130A (a handshake
or acknowledgement, for example) as illustrated by NM-OLB Data
Exchange 211 or alternately addressed to P.sub.3 203 (decrypted
clear text, for example) as illustrated by PMM-OLB Data Exchange
216. In both cases fabric packets are provided as cells to Primary
Switch Fabric Module 180A via OLB-Fabric coupling 169B and
forwarded accordingly.
[0111] In an example of an SLB service process executing on OLB
160B, NM 130A recognizes TCP SYN packets during classification and
forms an IP fabric packet including the destination IP address of
OLB 160B. An L3 forwarding decision is made and the IP fabric
packet is provided as cells to the fabric and forwarded toward OLB
160B. The SLB service process executing on OLB 160B, upon receiving
a reconstituted packet, consults load information for the system,
and assigns the request to a relatively unloaded physical partition
of a PMM (such as one of P.sub.1 201, P.sub.2 202, and P.sub.3
203), establishing a new connection. The new connection is recorded
in the appropriate L3 FIBs, in order for NM 130A to properly
forward subsequent IP packets for the new connection to the
assigned physical partition, enabling information flow from NM 130A
to the assigned physical partition without the need for indirection
through OLB 160B.
[0112] In one embodiment, SSL and SLB processing may be cascaded.
For example, NM 130A forwards cellified encrypted IP packet
information from an encrypted external client toward OLB 160B for
SSL processing, or decryption. OLB 160B in turn forwards cellified
decrypted (clear text) IP packet information to itself, another
OLB, or a PMM for subsequent SLB processing. After assignment of a
physical partition relating to a new encrypted connection,
cellified packets are then forwarded first from NM 130A to OLB 160B
for decryption, and then directly to the assigned physical
partition.
[0113] In an example of an XML service executing on OLB 160B, the
service process functions as an XML server. NM 130A identifies XML
requests from the external client, and forwards each request, in
the form of cellified IP packets, toward OLB 160B, where the XML
service process analyzes the request. Appropriate response
information is produced and provided as cellified packets forwarded
toward NM 130A.
[0114] While NM-OLB Data Exchange 211 illustrates data exchange
between NM 130A and OLB 160B, this is only an illustrative
scenario. For example, during classification, NM 130A may examine a
packet (typically but not necessarily an IP packet) received from
the client coupled to NM 130A to determine an associated flow, and
then selectively determine a destination OLB based on the
determined flow (OLB 160A or OLB 160B, as appropriate). This
provides a form of service processing load balancing. The
destination OLB may also be determined based on the type of service
(SLB, SSL, or XML), if a particular OLB includes hardware
acceleration specific to a service. For example, in an embodiment
where OLB 160A includes an SSL hardware accelerator, while OLB 160B
includes an XML hardware accelerator, then IP packets requiring SSL
processing are typically directed toward OLB 160A, and IP packets
requiring XML processing are typically directed toward OLB 160B.
Those of ordinary skill in the art will recognize that destination
OLB determination may be performed based on combining service
processing load balancing with selection based on hardware
accelerator availability and location.
[0115] Service processing (such as SLB, SSL, and XML) is not
restricted to OLBs, as PMMs and SCMs may also be configured to
provide service processing. In these embodiments, NMs take the
destination IP address assigned to a physical partition of a PMM
(such as P.sub.1 201, P.sub.2 202, or P.sub.3 203, for example) or
an SCM (such as Primary SCM 140A, for example) and perform an L3
forwarding decision to provide a fabric destination address in
preparation for transit on the fabric as cells. The cells are then
forwarded toward the appropriate PMM or SCM where the service
process is executing.
[0116] NM-NM Data Exchange 212 is representative of data exchanged
between NMs. This traffic is exemplary of network traffic between a
first external client and a second external client coupled
respectively to a port of NM 130A and a port of NM 130B, and
wherein ES1 110A performs as a bridge, switch, or router. (Clients
such as 102 and 103 of FIG. 1A are representative of the external
clients.) The low-level details of the data exchange are
substantially similar to NM-OLB Data Exchange 211. A port of NM
130A receives information, typically in packet form, from the first
external client (and potentially a plurality of other external
sources), and classifies the packets (which may be for a plurality
of destinations), in part to determine a subset of the packets to
be sent to the port of NM 130B. Based in part on the
classification, an appropriate subset of the information is formed
into fabric packets destination-addressed to the port of NM 130B.
NM 130A makes a forwarding decision that embeds a fabric address
into the fabric packet, which is then communicated as cells to
Primary Switch Fabric Module 180A via NM-Fabric coupling 139A.
Primary Switch Fabric Module 180A forwards the cells toward NM
130B. After arriving at NM 130B the cells are reconstituted as
packets, and sent to the second external client coupled to the port
of NM 130B. The reverse path is used to transfer information from
the second client to the first client, and operates in a symmetric
manner.
[0117] Under some circumstances, such as multiple clients coupled
to one NM via multiple physical network couplings (media ports), an
NM (such as NM 130A) may forward data toward itself via the fabric.
Operation in this scenario is similar to NM-NM Data Exchange 212,
except the packets are addressed to NM 130A, instead of NM 130B.
The multiple media ports are distinguished via the sub-address
portion of the fabric address.
[0118] NM-PMM Data Exchange 213 is representative of IP packets
exchanged between an NM and a process (Application, Driver, or OS)
executing on a PMM, typically under control of a higher-level
protocol, such as Transmission Control Protocol (TCP) or User
Datagram Protocol (UDP). The data exchange is substantially similar
to NM-OLB Data Exchange 211. NM 130B forms a portion of received
information (based on classification) as IP packets addressed to
P.sub.3 203. NM 130B executes a forwarding decision on the
destination IP address to obtain a fabric destination address in
preparation for providing a fabric packet as cells to Primary
Switch Fabric Module 180A via NM-Fabric coupling 139B. Primary
Switch Fabric Module 180A forwards the cells toward P.sub.3 203.
Upon arrival via PMM-Fabric coupling 159B (or alternatively
PMM-Fabric coupling 159B'), the cells are reconstituted as IP
packets by PMM 150B, and provided to P.sub.3 203. The process
transfers information to NM 130B (and typically on to a client
coupled to the NM) using the reverse path. A return IP fabric
packet is formulated by the process, IP destination-addressed to NM
130B, a corresponding fabric address is obtained from a forwarding
decision, and the IP fabric packet is provided as cells to Primary
Switch Fabric Module 180A for forwarding toward NM 130B.
[0119] NM-SCM Data Exchange 214 is representative of data exchanged
between an NM (or a client coupled thereto) and a management,
forwarding, or load balancing process executing on an SCM. The data
exchange is substantially similar to NM-OLB Data Exchange 211.
Packets addressed to Primary SCM 140A are formulated by either an
external client coupled to NM 130B or (as an alternative example)
by a control plane related process running on the Network Processor
of NM 130B. NM 130B forms corresponding fabric packets and a
forwarding decision is made to determine the embedded fabric
address. The fabric packets are then provided as cells to Primary
Switch Fabric Module 180A via NM-Fabric coupling 139B. Primary
Switch Fabric Module 180A forwards the cells toward Primary SCM
140A. Upon arrival at Primary SCM 140A via SCM-Fabric coupling
149A, the cells are reassembled as packets by Primary SCM 140A and
provided to the management, controlplane, or load balancing
process. Examples illustrating these processes are provided in the
following paragraphs. The management, controlplane, or load
balancing process transfers information back to NM 130B (or a
client coupled thereto) using the reverse path. A management packet
is addressed to NM 130B (or the client coupled thereto), is
formulated by a process executing on SCM 140A, a corresponding
fabric packet is formed, and a forwarding decision is made to
determine the embedded fabric address. The fabric packet is
provided as cells to Primary Switch Fabric Module 180A for
forwarding toward NM 130B. Upon arrival at NM 130B, the management
packet is reconstructed. If addressed to NM 130B, the packet is
consumed therein. If addressed to the external client, the
reconstructed packet is provided thereto.
[0120] In an example of a management process executing on Primary
SCM 140A, a management client coupled externally to NM 130B
(typically for remote server management, provisioning,
configuration, or other related activities) sends a management
related packet via NM 130B with the destination address of the
management process executing on Primary SCM 140A. Within NM 130B,
the packet is classified and determined to be a management related
packet. A forwarding decision is then made and a cellified version
of the management packet is forwarded toward the management process
via Primary Switch Fabric Module 180A. Return information from the
management process to the management client uses the reverse path,
by formulation (within SCM 140A) of packets having the destination
address of the management client coupled to NM 130B. A forwarding
decision is then made and a cellified version of the return
information packets are forwarded toward the management client via
Primary Switch Fabric Module 180A and NM 130B. Typically but not
necessarily, IP packets would be used for the exchange between the
management client and process.
[0121] In an example of a controlplane process executing on Primary
SCM 140A, in some embodiments and under some circumstances NM 130B
classification determines that the proper L2 forwarding for a
received packet is not known, and designates the received packet is
an "exception packet". In one embodiment, NM 130B forwards a
cellified version of the exception packet to an L2 FIB management
process executing on the Primary SCM via Primary Switch Fabric
Module 180A. The L2 FIB management process examines the exception
packet, master L2 FIB, and other forwarding related information, to
determine the proper fabric address for the missing L2 FIB entry.
The updated forwarding information is then recorded in the master
L2 FIB, in some embodiments, and propagated to the ancillary L2
FIBs in order for NM 130B to properly forward subsequent packets
having the same or similar classification. Primary SCM 140A also
provides a correct fabric address for the exception packet and
emits an IP fabric packet equivalent to the exception packet (but
addressed to the updated fabric address) as corresponding cells to
Primary Switch Fabric Module 180A for forwarding to the interface
at destination IP address (or at least one hop closer). In this
circumstance the fabric destination could be any of the elements
coupled to the dataplane of Primary Switch Fabric Module 180A,
including NM 130B or Primary SCM 140A (this general data exchange
is not illustrated in the figure).
[0122] In an example of a load balancing process executing on
Primary SCM 140A, operation is similar to the SLB service executing
on an OLB, except the IP packets are destination IP addressed to
Primary SCM 140A instead of an OLB.
[0123] PMM-SCM Data Exchange 215 is representative of data
exchanged between an Application, Driver, or OS process executing
on a physical partition of a PMM and a management, controlplane, or
load balancing process executing on an SCM. The data exchange is
substantially similar to the exchanges described elsewhere herein.
A PMM-to-SCM communication is formed as a packet addressed to
Primary SCM 140A by a Driver process, for example, executing on
P.sub.3 203. A fabric packet is formed and a forwarding decision is
made to determine the embedded fabric address. The fabric packet is
then provided as cells via PMM-Fabric coupling 159B (or 159B'), and
forwarded via Primary Switch Fabric Module 180A toward Primary SCM
140A. Upon arrival at Primary SCM 140A and subsequent reassembly,
the reassembled packet is provided to the management, controlplane,
or load balancing process. The reverse path is used for SCM-to-PMM
communication, with the management, controlplane, or load balancing
process formulating a packet addressed to P.sub.3 203, for
communication to the Driver process.
[0124] In an example of the load balancing process executing on
Primary SCM 140A, the communication includes server load
information relating to PMM 150B. PMM-SCM Data Exchange 215 is also
representative of a variety of paths between an SCM and all other
elements coupled to the fabric dataplane (such as FCMs, NMs, OLBs,
and other PMMs), to update forwarding information maintained in
each of the elements. For each element requiring a forwarding
information update, the controlplane process executing on Primary
SCM 140A formulates one or more packets to include the updated
forwarding information and addresses the packet(s) to the
appropriate fabric destination. The packets are provided as cells
to the fabric and the fabric forwards the cells according to the
fabric destination. In another embodiment the fabric destination
includes a multicast destination, and the cellified packets are
delivered to a plurality of destinations by the fabric.
[0125] PMM-OLB Data Exchange 216 is representative of data
exchanged between a process (Application, Driver, or OS) executing
on a physical partition of a PMM and a service process executing on
an OLB. The data exchange is substantially similar to PMM-SCM Data
Exchange 215, except that OLB 160B takes the place of Primary SCM
140A, and data is coupled via OLB-Fabric coupling 169B instead of
SCM-Fabric coupling 149A.
[0126] Data exchanges between processes executing on different
physical partitions are communicated on the fabric (PMM-PMM-Fabric
Data Exchange 217, for example). Data exchanges between processes
executing within the same physical partition are communicated by
coherent shared memory and coherent cache memory transactions
(PMM-Internal Data Exchange 218, for example). When multiple PMMs
are configured as a single physical partition, coherent shared
memory and coherent cache memory transactions travel between the
PMMs of the partitions via CSFI 170 (PMM-PMM-CSFI Data Exchange
219, for example).
[0127] PMM-PMM-Fabric Data Exchange 217 is representative of data
exchanged between a first process and a second process executing on
different physical partitions, i.e. message-passing InterProcess
Communication (IPC). The two processes may be any combination of
Application, Driver, or OS processes. The data exchange is
substantially similar to PMM-SCM Data Exchange 215, except P.sub.1
201 takes the place of Primary SCM 140A, and data is coupled via
PMM-Fabric coupling 159A' instead of SCM-Fabric coupling 149A.
Another example of this type of communication would be between
P.sub.1 201 and P.sub.2 202, (via PMM-Fabric coupling 159A' and
PMM-Fabric coupling 159A) even though these two physical partitions
are on the same PMM.
[0128] PMM-Internal Data Exchange 218 is representative of data
exchanged between two processes executing on the same physical
partition, and the physical partition resides entirely within a
single PMM. A source process, executing on a first compute element
of P.sub.3 203, writes to a shared memory location, and a sink
process, executing on a second compute element of P.sub.3 203,
reads the shared memory modified by the write. Communication is
provided by links internal to PMM 150B supporting coherent shared
memory and coherent cache memory.
[0129] PMM-PMM-CSFI Data Exchange 219 is representative of data
exchanged between two processes executing on the same physical
partition, the physical partition spans more than one PMM, and the
two processes execute on different PMMs. An example of this
physical partition configuration is illustrated as P.sub.4 204,
where P.sub.4 204 includes all of the compute elements of PMM 150A
and PMM 150B. Coherent shared memory and coherent cache memory
transactions are used to exchange data, as in PMM-Internal Data
Exchange 218. However, the transactions are communicated via CSFI
170, instead of links internal to the PMMs.
[0130] Data exchanges may also occur between processes executing on
physical partitions distributed across multiple PMMs via a
combination of paths similar to PMM-Internal Data Exchange 218 and
PMM-PMM-CSFI Data Exchange 219. That is, particular coherent memory
traffic (for both shared memory and cache memory) may travel via
both CSFI 170 and via links internal to the PMMs.
[0131] Data exchanges involving an NM (such as 211, 212, 213, and
214) typically include Tag processing. Specifically, during ingress
processing on an NM, incoming packets from a first client coupled
to the NM are classified, producing a condensed representation of
certain details of the incoming packet, typically by analyzing the
header of the incoming packet. A portion of the classification
result is represented by a Tag, and typically a portion of the Tag,
referred to as the egress Tag, is included in a header of the
fabric packet produced by the NM in response to the incoming
packet. The egress Tag may specify selected packet processing
operations to be performed by the NM during egress processing, thus
modifying the packet header, data, or both, before receipt by a
second client. Egress packet processing may be performed in
response to the corresponding Tag produced during ingress
processing (in the NM that did the ingress processing on the
packet) or in response to a Tag `manufactured` by a service process
on an OLB or an Application, Driver, or OS process executing on a
PMM. In one embodiment egress processing is specified directly by
the Tag, and in another embodiment egress processing is determined
indirectly by the Tag (via a table look-up, for example).
[0132] As an example, during NM-OLB Data Exchange 211, in the case
of an SSL service process, the egress Tag provided may include
information examined by the SSL process in order to perform any
combination of SSL processing related context switching, SSL
related per context state look-up, and early protocol layer
demultiplexing.
[0133] As another example, during NM-SCM Data Exchange 214, in the
case of a routing process, the exception packet delivered to the
routing process may include an egress Tag providing exception
information to determine a particular Quality of Service (QoS) for
the associated routing protocol. The routing process examines the
exception information of the egress Tag to determine the particular
route processing queue to insert the exception packet into.
[0134] In one embodiment, Primary Switch Fabric Module 180A
provides for only a single transport of cells at a time between any
pairing of ingress and egress ports. In another embodiment, Primary
Switch Fabric Module 180A provides for a plurality of simultaneous
transports of cells between multiple pairings of fabric ingress and
egress ports. This simultaneous transport may be by parallel
communication paths available in the fabric, by interleaving cells
from different transports in time on a single communication path,
or any combination of these and other techniques. Those of ordinary
skill in the art will recognize that the details of Primary Switch
Fabric Module 180A operation affect only the available bandwidth
and latency provided by the fabric, not details of data exchanges
as exemplified by FCM-PMM Data Exchange 210, NM-OLB Data Exchange
211, and so forth. In one embodiment, Primary Switch Fabric Module
180A includes sufficient parallel resources to provide
substantially simultaneous communication for all of the data
exchanges illustrated in the figure.
Packet Transmission and Reception
[0135] FIG. 3A illustrates Fabric Communication 300A, conceptually
showing selected aspects of an embodiment of packet transmission
and reception in an ES embodiment, used in FCM-PMM Data Exchange
210 and other similar data exchanges described elsewhere herein.
Fabric client elements coupled to Primary Switch Fabric Module 180A
include PMM 150A, OLB 160A, NM 130A, FCM 120A, and Primary SCM
140A. Each coupled client includes RAM, shown respectively as PMM
RAM 350, OLB RAM 360, NM RAM 330, FCM RAM 320, and SCM RAM 340.
Each RAM in turn includes a packet memory image, shown respectively
as Packet Memory Image PKT.sub.PMM 351, Packet Memory Image
PKT.sub.OLB 361, Packet Memory Image PKT.sub.NM 331, Packet Memory
Image PKT.sub.FCM 321, and Packet Memory Image PKT.sub.SCM 341. A
VIOC is included in each of PMM 150A, OLB 160A, FCM 120A, and
Primary SCM 140A, illustrated as VIOC 301.5, VIOC 301.6, VIOC
301.2, and VIOC 301.4, respectively. The VIOCs are shown
conceptually coupled to corresponding packet images as dashed
arrows 359, 369, 329, and 349, respectively. The VIOCs provide an
interface to the fabric via PMM-Fabric coupling 159A, OLB-Fabric
coupling 169A, FCM-Fabric coupling 129A, and SCM-Fabric coupling
149A, respectively.
[0136] NM 130A includes a Traffic Manager (TM 302), also known as a
Buffer and Traffic Manager (BTM) instead of a VIOC. The TM is shown
conceptually coupled to Packet Memory Image PKT.sub.NM 331 via
dashed arrow 339. TM 302 provides an interface for NM-Fabric
coupling 139A.
[0137] Data is exchanged between fabric clients (modules on the
fabric) by procedures of packet transmission, routing (as cells) on
the fabric, and packet reception. Arrow 315 conceptually shows the
direction of flow for packet transmission to the fabric, also
referred to as ingress to the fabric, while arrow 316 conceptually
shows the direction of flow for packet reception from the fabric,
also referred to as egress from the fabric. Conceptually, packet
transmission begins at the source fabric client by reading a packet
image from a source RAM and providing the packet as cells for
transmission via the fabric. The fabric routes the cells to the
appropriate destination fabric client. Packet reception
conceptually begins at the fabric edge of the destination fabric
client, where the cells are received from the fabric and
reassembled into a corresponding packet (including information from
the packet image as read from the source RAM) and written into a
destination RAM in the destination fabric client.
[0138] Each of TM 302 and the VIOCs (301.5, 301.6, 301.2, and
301.4) comprise various logic blocks, including a fabric interface
communication unit (also known as a packet-cell converter) for
performing the functions relating to packet transmission and packet
reception via cells on the fabric. The fabric communication
operation of all fabric clients (PMM 150A, OLB 160A, NM 130A, FCM
120A, and Primary SCM 140A) is substantially similar, but for
clarity is described in a context of data exchanged between PMM
150A and NM 130A (such as NM-PMM Data Exchange 213, for
example).
[0139] For a packet transmission operation, the fabric interface
communication units read a packet image from a RAM included in a
fabric client (such as Packet Memory Image PKT.sub.PMM 351 included
in PMM RAM 350). The packet image includes a header and packet body
data, illustrated as Header 311 and Packet Body 312 respectively.
The fabric interface communication unit (included in VIOC 301.5 in
this example) conceptually segments the packet into a plurality of
cells of equal length, illustrated as Cell C.sub.1 313.1' through
Cell C.sub.M-1 313.(M-1)'. The final Cell C.sub.M 313.M' is
typically a different length than the other cells as packets are
not restricted to integer multiples of cell sizes. Although
illustrated as contiguous in RAM, the packet body data may instead
be scattered in various non-contiguous buffers in RAM, according to
various embodiments.
[0140] Each of the cells are encapsulated with headers, illustrated
as Cell Header h.sub.1 314.1, corresponding to Cell C.sub.1 313.1,
and Cell Header h.sub.M 314.M corresponding to Cell C.sub.M 313.M.
The cell headers for each of the cells segmented from a given
packet are determined in part from the packet header, as shown
conceptually by arrow 310.1 and arrow 310.M flowing from Header 311
to Cell Header h.sub.1 314.1 and to Cell Header h.sub.M 314.M
respectively. Each of the resultant encapsulated cells is provided,
in order, to the fabric for routing to the destination. Segmenting
the packet into cells and encapsulating the cells is also referred
to as "cellification".
[0141] Header 311 includes a packet destination address, and VIOC
301.1 determines a cell destination address in part from the
destination address of the packet header in addition to routing
tables and state information available to the VIOC. The cell
destination address, also known as a fabric destination or egress
port address, is included in each of the cell headers (Cell Header
h.sub.1 314.1 through Cell Header h.sub.M 314.M).
[0142] This technique enables a process executing on a fabric
client to transparently address a packet to another fabric client
using a logical address for the addressed client. The resolution of
the packet address to a fabric egress port address corresponds to
resolving the logical address to a physical address, i.e. a
specific port of the switch that the addressed client is coupled
to.
[0143] The reverse direction receive path operates in a
complementary, conceptually symmetric, inverse manner. The
segmented cells are routed by the fabric, in order, to the fabric
port specified by the cell destination address. The fabric
interface communication unit included in the destination fabric
client receives the cells, processes the headers, and reconstitutes
(or reconstructs) the cells into a packet image in RAM, resulting
in a packet image substantially similar to the packet image
originally provided by the transmitting fabric client. Continuing
with the example, if the destination address of Packet Memory Image
PKT.sub.PMM 351 resolves to NM 130A, then the cells are routed to
NM-Fabric coupling 139A by Primary Switch Fabric Module 180A. TM
302 receives the cells via NM-Fabric coupling 139A, assembles them
back into a packet, and stores the resultant image as Packet Memory
Image PKT.sub.NM 331 in NM RAM 330.
[0144] Those of ordinary skill in the art will recognize that
packet transmission and reception procedures are not limited to
complete packet images in RAM. For example, on an NM, packet
information flowing to the fabric may be provided, in some
embodiments, directly from a network interface included on the NM,
without intermediate storage in RAM. Similarly, packet information
flowing from the fabric may, in some embodiments, be provided
directly to the network interface without intermediate storage in
RAM. The same techniques may be used on an FCM with respect to
information flowing to and from the storage interface. Those of
ordinary skill in the art will also recognize that the sans-header
intermediate cells, Cell C.sub.1 313.1' through Cell C.sub.M
313.M', are only conceptual in nature as some embodiments implement
cellification without copying packet data. Instead, packet data is
accessed in-situ in cell-sized chunks and encapsulation performed
on-the-fly.
[0145] The fabric interface communication units included in TMs and
VIOCs further include logic adapted to allocate and manage
bandwidth and priority for various flows as identified by any
combination of classification information, Tag, and a sub-port
portion of a fabric egress port address. This enables provisioning
of bandwidth and setting of priorities for transport according to
operational requirements.
[0146] In one embodiment, the particular priority at which
transport is performed is selectively determined by examination of
the sub-port portion of the fabric egress port address. For
example, NM 130A may be configured with a high-priority queue and a
low-priority queue, having corresponding high-priority and
low-priority sub-port addresses. With respect to NM-PMM Data
Exchange 213, for example, a portion of the data exchange may be
performed via the low priority queue and another portion performed
via the high-priority queue. A process desiring selective access to
the high-priority queue and low-priority queue addresses packets
accordingly, providing a corresponding high-priority packet
destination address to high-priority packets and a corresponding
low-priority packet destination address to low-priority packets.
During transmission to the fabric, the high-priority packet address
and the low-priority packet address are resolved by the appropriate
VIOC on PMM 150B to a corresponding high-priority fabric egress
port address and a corresponding low-priority egress port address.
The high-priority egress port address and the low-priority egress
port address include a fabric port number sub-portion that is
identical for the two egress port addresses, since both packets are
destined to the same pluggable module. However, the sub-port
portion is different to distinguish between high and low
priorities.
[0147] Upon receipt in the NM of high-priority cells and
low-priority cells (corresponding to cells from packets addressed
to the high-priority queue and the low-priority queue,
respectively), the TM on NM 130A examines the sub-port portion of
the fabric egress port addresses provided with each cell, and
selectively identifies packets as associated with the high-priority
queue or the low-priority queue as appropriate.
[0148] In another embodiment, the sub-port portion may also include
a bandwidth allocation portion to identify one of a plurality of
bandwidth allocations to associate with the packet assembled from
the cell. Still other embodiments provide for combining priority
and bandwidth allocation dependent processing according to the
sub-port portion of the fabric egress port address. Those of
ordinary skill in the art will recognize that these techniques are
not limited to communication between NMs and PMMs as in the
example, but may be used between all modules operating as fabric
clients (such as FCMs, SCMs, and OLBs). FCMs, for example, may
provide for allocation of bandwidth to various coupled storage
devices or networks via the sub-port mechanism.
[0149] In one embodiment, the fabric interface communication units
of TMs and VIOCs provide hardware support for a reliable message
protocol in addition to packet communication. State machines
implement a connection-oriented procedure including establishing a
connection via a connection request and a corresponding
acknowledgement, sending and receiving one or more messages using
the established connection, and terminating the connection after it
is no longer needed. Delivery of message content is guaranteed,
using a limited number of retries, otherwise an error is returned
to the sender. In operation, message images are constructed similar
to packet images, with an indicator included in the message image
identifying the image as a message instead of a packet. The message
image includes a message header, similar in format to a packet
header, and message body data, similar to packet body data. The
message body data is communicated in the same manner as packet body
data. The message header includes a message destination similar in
format to a packet destination address. The message destination
address is resolved into a cell destination address for inclusion
in the cells during cellification, similar to the resolution of a
packet destination address.
[0150] FIG. 3B illustrates Prioritized Fabric Communication 300B,
conceptually showing selected aspects of an embodiment of packet
and cell prioritized transmission in an ES embodiment, focusing on
transports originating from a VIOC. VIOCs 301.5 and 301.6, along
with TM 302, are coupled to Primary Switch Fabric Module 180A, via
couplings 159A, 169A, and 139A, respectively. A VIOC typically is
closely associated with 16 groups of four transmit queues each,
providing a total of 64 transmit queues. Here in this initial
illustrative embodiment the transmit queues are conceptualized as
existing within the VIOC. In a later discussed illustrative
embodiment, the transmit queues physically reside in shared
portions of the host memory, although their state is managed within
the VIOC and portions are buffered within the VIOC.
[0151] A subset of three transmit queues are shown conceptually in
VIOC 301.6, as Q.sub.1 309.1, Q.sub.2 309.2, and Q.sub.3 309.3.
Each transmit queue is configurable as operating according to a
specified priority or according to a specified bandwidth. For
example, Q.sub.1 309.1 may be configured to operate at a strictly
higher priority than Q.sub.2 309.2 and Q.sub.3 309.3, or the queues
may be configured to operate in a round-robin priority with respect
to each other. As another example, Q.sub.1 309.1 may be configured
to operate at twice the bandwidth of Q.sub.2 309.2 and Q.sub.3
309.3. As another example, Q.sub.1 309.1 may be configured for a
first maximum bandwidth and a first minimum bandwidth, Q.sub.2
309.2 may be configured for a second maximum and a second minimum
bandwidth, and Q.sub.3 309.3 may be configured for third maximum
and minimum bandwidths. Those of ordinary skill in the art will
recognize that many other priority and bandwidth schemes are
possible.
[0152] In addition to transmit queues, VIOCs typically implement
virtual output queues (VOQs) to prevent head of line blocking, in
order to maximize the bandwidth of transmission to the fabric. In
an illustrative embodiment discussed herein below, the VOQs are
implemented as pointer managed buffers within an egress shared
memory internal to the VIOC. A subset of VOQs in VIOC 301.6 are
illustrated as VOQ.sub.1 308.1 and VOQ.sub.2 308.2, one for each of
the two illustrated destinations (VIOC 301.5 and TM 302). The VOQs
are processed according to configurable priority algorithms,
including a straight (or strict) priority algorithm, a straight
round-robin algorithm (without weights), a weighted round-robin
algorithm, and a weighted round-robin algorithm with configurable
weights, according to embodiment.
[0153] In operation, a transmit queue selection is made according
to the configured priority and bandwidth. Data is then made
available from the selected queue accordingly, and provided to the
fabric for transfer to the destination according to the virtual
output queue associated with the destination. These procedures
repeat as long as any of the queues are not empty. In the example,
one of Q.sub.1 309.1, Q.sub.2 309.2, and Q.sub.3 309.3 is selected,
and then examined to determine the next packet (or message) to
transmit. For illustrative purposes, Q.sub.1 309.1 is configured as
the highest priority, Q.sub.2 309.2 is the next highest priority,
and Q.sub.3 309.3 is configured as the lowest priority.
[0154] A first selection and subsequent transmission, shown
conceptually as Packet Transmission Path 317.1, is made from
Q.sub.1 309.1, as it is configured as the highest priority, and
includes at least one packet ready for transmission. Data is read
according to Q.sub.1 309.1, and determined to be destined for the
fabric egress port associated with PMM-Fabric coupling 159A. Data
is transferred to Primary Switch Fabric Module 180A under the
control of VOQ.sub.1 308.1, and further transferred to PMM-Fabric
coupling 159A.
[0155] Subsequently, a second selection and transmission, shown
conceptually as Packet Transmission Path 317.2, is made from
Q.sub.2 309.2, as it is configured as the next highest priority, it
includes at least one packet ready for transmission, and Q.sub.1
309.1 is empty. Data is read according to Q.sub.2 309.2, determined
to be destined for PMM-Fabric coupling 159A, provided to Primary
Switch Fabric Module 180A under the control of VOQ.sub.1 308.1, and
transferred to PMM-Fabric coupling 159A.
[0156] Subsequently, a third selection and transmission, shown
conceptually as Packet Transmission Path 317.3, is made from
Q.sub.3 309.3, as it is configured as the lowest priority, it is
not empty, and Q.sub.1 309.1 and Q.sub.2 309.2 are empty. Data is
read according to the selected transmit queue (Q.sub.3 309.3) as in
the previous two scenarios, but the destination is determined to be
the fabric egress port associated with NM-Fabric coupling 139A, and
therefore data is provided to the fabric under the control of
VOQ.sub.2 308.2.
[0157] Although the three selections and transmissions
corresponding to Packet Transmission Paths 317.1, 317.2, and 317.3
have been described as occurring strictly sequentially,
transmission of data from differing virtual output queues may
instead be interleaved on the fabric. For example, transmission of
data from VOQ.sub.1 308.1 (such as Packet Transmission Paths 317.1
or 317.2) may overlap in time with the transmission of data from
VOQ.sub.2 308.2 (such as Packet Transmission Path 317.3). In this
circumstance, cells from the overlapping transmissions are wholly
or partially interleaved on the fabric.
[0158] In addition to prioritized selection among the transmit
queues, the switch fabric also typically provides for prioritized
transport. Each cell may specify a priority, and in one embodiment
there are four priority levels available. The cell priority is
developed from any combination of various parameters, including
packet size, packet type, packet class of service, packet quality
of service, transmit queue priority, and other packet header
information. As shown in the figure, Cell Transmission Path 318.1
provides for transmission of cells from VIOC 301.6 to VIOC 301.5,
and Cell Transmission Path 318.2 provides for transmission of cells
from VIOC 301.6 to TM 302. Each of Paths 318.1 and 318.2 may
transfer cells according to any of the four priorities. For
example, cells corresponding to Packet Transmission Path 317.1 may
be transferred at the highest priority, while cells corresponding
to Packet Transmission Path 317.2 may be transferred at a lower
priority.
[0159] Those of ordinary skill in the art will recognize that these
described scenarios are examples only, as entries in the transmit
queues may be destined for any of the egress ports of the fabric,
and therefore any of the corresponding VOQs. In addition, dynamic
transmit queue priorities (such as round-robin) and bandwidth
allocation associated constraints and processing may result in data
from various transmit queues being intermingled in various VOQs,
although all of the data for a given packet will necessarily be
present in the same VOQ, since the packet is destined for only a
single fabric egress port. Furthermore, fabric transport priorities
may vary amongst entries in the VOQs, such that the switch fabric
traffic typically appears intermixed and complicated.
Pluggable Modules
[0160] FIGS. 4A-4E illustrate various embodiments of pluggable
modules included in various ES embodiments. The modules share many
similar characteristics. For example, each of the modules includes
a fabric interface communication unit included in a TM or a VIOC.
Each of the modules typically includes one or more computation and
memory elements. Couplings between elements of the modules
typically operate in the same or substantially similar fashion.
These similarities are illustrated by common element presentation
and identification, as follows.
[0161] RAM elements are shown with identifiers prefixed with 411,
and these elements are typically Dynamic Random Access Memories
(DRAMs) organized as Dual Inline Memory Modules (DIMMs) in some
embodiments. CPU elements are shown with identifiers prefixed with
410, and these elements are typically Opteron processors. VIOC
identifiers are prefixed with 301. Elements representing combined
Ternary Content Addressable Memory (TCAM) and Static Random Access
Memory (SRAM) identifiers are prefixed with 403. BMC elements are
prefixed with 402. FCI elements are prefixed with 413, and the
associated optional coupling identifiers are prefixed with 414.
HyperTransport (HT) channel couplings are shown with identifiers
prefixed with 460.
[0162] Those of ordinary skill in the art will recognize that while
the aforementioned elements shown with common prefix identifiers
are closely related to each other in operation and function, they
are not identical, and may differ in various characteristics.
Differences may include size, organization, bandwidth, capacity, or
speed, according to embodiment. Such differences have no effect on
the overall operation of the associated pluggable module.
[0163] FIG. 4A illustrates SIM Detail 400A, including selected
aspects of an embodiment of a SIM configured as a pluggable module
including an SCM and an associated SFM. It will be understood that
the discussion of FIG. 4A is made with respect to the capabilities
and topology of the primary SIM and primary fabric, but the
discussion equally describes the redundant topology and latent
capabilities of the secondary SIM and secondary fabric. As
discussed elsewhere herein, the secondary fabric remains dormant
with respect to non-control dataplane functionally as long as the
primary fabric operates properly. Primary SCM 140A includes compute
and associated memory elements CPU 410.4L/RAM 411.4L coupled to CPU
410.4R/RAM 411.4R via HT coupling 460.4L. VIOC 301.4 is coupled to
CPU 410.4R via HT coupling 460.R. VIOC 301.4 is in communication
with TCAM/SRAM 403.4. and provides a fabric interface for
SCM-Fabric coupling 149A. Management I/O 412 is coupled to CPU
410.4L via HT coupling 460.4M and provides an interface to the
intra-chassis BMCs via coupling 452. Primary SCM 140A also includes
BMC 402.4 coupled to VIOC 301.4 and Management I/O 412. As
discussed in more detail later herein, Mass Storage 412A is coupled
to Management I/O 412 via coupling 453 and provides local mass
storage.
[0164] Primary Switch Fabric Module 180A includes Primary SFM
Dataplane (SFDP) 404 having a plurality of fabric ports with
respective fabric port addresses. The fabric ports are coupled to
the various system modules via SCM-Fabric coupling 149A, FCM-Fabric
coupling 129A, NM-Fabric coupling 139A, PMM-Fabric coupling 159A,
and OLB-Fabric coupling 169A. Each module in the system may be
coupled to one or more of the fabric ports and at least some of the
foregoing illustrated fabric couplings represent more than one
full-duplex fabric coupling to the Primary Switch Fabric Module
180A. For example, in one embodiment, there maybe up to two PMM
modules and each PMM module has two full-duplex fabric couplings.
Thus PMM-Fabric coupling 159A may be representative of four
full-duplex fabric couplings to four respective fabric ports, each
having a respective fabric port address. Each module or portion
thereof having its own fabric coupling to a fabric port is
addressable via the corresponding fabric port address. Primary
Switch Fabric Module 180A also includes Primary Switch Fabric
Scheduler 401 coupled to Primary SFDP 404.
[0165] In operation, SFDP 404, under the direction of Switch Fabric
Scheduler 401, routes data as cells provided to a fabric ingress
port to a fabric egress port, according to a cell destination
address, as described elsewhere herein. Each of dataplane couplings
149A, 129A, 139A, 159A, and 169A couples with a respective fabric
ingress port and a respective fabric egress port.
[0166] Primary SCM 140A executes any combination of management,
controlplane, and load balancing processes using compute and memory
resources provided by CPU 410.4L/RAM 411.4L and CPU 410.4R/RAM
411.4R. The CPUs operate as a single SMP complex, communicating
shared memory coherency and cache memory coherency transactions via
HT coupling 460.4L.
[0167] VIOC 301.4 operates as an intelligent I/O device responding
to commands from the CPUs, typically originating from a Driver
process. A Driver process executing on one of the CPUs forms a
packet image in one of the RAMs, including specifying a destination
address for the packet, and then notifies the VIOC that a new
packet is available for transmission. In response, the VIOC fabric
interface communication transmit unit directly accesses the packet
image from RAM via an included transmit Direct Memory Access (DMA)
unit. The VIOC examines the packet header and identifies the packet
destination address. The transmission of packets as cells proceeds
without direct assistance from any of the processes executing on
the CPUs. The packet address and other associated information are
referenced in accessing forwarding and state information maintained
in TCAM/SRAM 403.4 to determine the corresponding fabric egress
port address and other related information for inclusion in headers
of cells provided to the fabric to transmit the packet as
cells.
[0168] VIOC 301.4 also operates as an intelligent I/O device in the
reverse direction, in a conceptually symmetric fashion. Cells are
received from the fabric and reassembled as packets by the VIOC
fabric interface communication receive unit. In one embodiment the
packet data is partially reassembled directly into a packet image
stored in one of the RAMs via an included receive DMA unit. The
reception of packets as cells proceeds without direct assistance
from any of the processes executing on the CPUs. The VIOC notifies
one or more of the CPUs that new packet data is available, and
subsequently a process, typically a Driver process, accesses the
packet image provided in RAM and processes it accordingly.
[0169] The management process (or processes) executing on the CPUs
of Primary SCM 140A communicate management and configuration
control information via Management I/O 412 between Primary SCM 140A
and other modules via coupling 452 coupled to BMCs included in PMM
150A, NM 130A, FCM 120A, OLB 160A, and Primary SCM 140A (local BMC
402.4). This communication is typically via a dedicated management
Ethernet network, and is consequently out-of-band with respect to
Primary Switch Fabric Module 180A. BMC 402.4 provides baseboard
management functions, communicating with Management I/O 412 and
VIOC 301.4. The processes executing on the CPUs of Primary SCM 140A
collect selected management information from all BMCs in the server
and in response to the collected information and provisioning
commands received from elsewhere, provide management and
configuration commands to the BMCs. Management I/O 412 also
communicates configuration and control information via coupling 451
between management and controlplane processes executing on the CPUs
and Switch Fabric Scheduler 401. This provides, for example, for
static or dynamic configuration of the SCMs, one as the Primary SCM
and the other as the Redundant SCM.
[0170] In selected embodiments, a mass storage device (such as Mass
Storage 412A) or storage network interface is available locally to
provide boot images, management information, provisioning
information, or log file storage. Mass Storage 412A may include any
combination of mass storage device types including Flash memory,
Magnetic Disk memory, and Optical Disk memory. The mass storage
devices may be coupled via any combination of storage interface
types including but not limited to PC Card, Compact Flash,
Multi-Media Card, Memory Stick, Smart Card, Secure Digital,
Universal Serial Bus (USB), FireWire, SCSI (Small Computer System
Interface), IDE (Integrated Device Electronics), EIDE (Enhanced
IDE) and variations and successors thereof. In other embodiments,
the local mass storage is omitted, and this data is accessed from
mass storage devices or networks remotely via FCMs 120 or NMs
130.
[0171] FIG. 4B illustrates PMM Detail 400B, including selected
aspects of an embodiment of a PMM configured as a pluggable module.
The PMM is arranged as a pair of identical sections, Half-PMM 430
and Half-PMM 430'. Each section includes two CPU/RAM elements
coupled to each other by HT links, a VIOC/TCAM/SRAM element
interfacing to a fabric coupling, and an optional FCI. The coupling
of these elements is substantially similar to corresponding
elements of Primary SCM 140A, except that Management I/O 412 is
omitted. The two Half-PMMs share BMC 402.5.
[0172] The two Half-PMMs are coupled to each other by a pair of HT
links (HT coupling 460.5X and HT coupling 460.5Y). One of the CPUs
of each half also provides an HT interface for coupling to another
PMM (such as PMM 150B of FIG. 2) via CSFI-PMM coupling 179A and
CSFI-PMM coupling 179A'. As discussed in more detail below, in one
embodiment these couplings are coupled directly to another
identically configured PMM, and in other embodiments these
couplings are coupled indirectly to another PMM via CSFI 170 (with
variations illustrated in FIG. 1A and FIG. 2). Shared memory
coherency and cache memory coherency transactions are communicated
over the HT couplings internal to the PMM (460.5L, 460.5X, 460.5L',
and 460.5Y) and over HT couplings external to the PMM (179A, and
179A').
[0173] The HT couplings communicating shared memory coherency and
cache memory coherency transactions and CSFI 170 are
programmatically configurable to provide for physical partitioning
of the CPU/RAM elements of PMMs. The PMM is configured as a single
4-way physical partition by programming the internal HT links
(460.5L, 460.5X, 460.5L', and 460.5Y) for coherent operation, and
programming the external HT links (179A, and 179A') for "isolated"
operation (i.e. links 179A and 179A' are disabled). In another
embodiment, isolating a PMM for configuration as a single 4-way
physical partition (or as two 2-way physical partitions) is
performed by programmatically configuring CSFI 170 (of FIG. 1A) to
isolate the PMM from other PMMs.
[0174] The PMM is configured as a pair of identical 2-way physical
partitions (Half-PMM 430 and Half-PMM 430') by programmatically
configuring a portion of the internal HT links (460.5L, and
460.5L') for coherent operation, and another portion of the
internal HT links (460.5X, and 460.5Y) for isolated operation. The
external HT links (179A, and 179A') or CSFI 170 are also programmed
for isolated operation.
[0175] A plurality of PMMs are configured as a single unified 8-way
physical partition by programmatically configuring all of the
internal and external HT links of all of the PMMs (and also CSFI
170, depending on the embodiment) for coherent operation. Those of
ordinary skill in the art will recognize that an arbitrary number
and arrangement of physical partitions may be configured from PMMs
150, as limited only by the available resources, using the
techniques described.
[0176] In operation, each PMM is programmatically partitioned
according to provisioning information. Physical partitions can be
established that have one-half of a PMM (2-way), a single PMM
(4-way), or two PMMs (8-way). It will be understood that the number
of SMP-ways per half of a PMM is merely illustrative and not
limiting as is the configurable topology for aggregation of
SMP-ways. Application, Driver, and OS processes are executed on the
resultant physical partitions. Each resultant physical partition
always includes at least one VIOC. The VIOC provides for
communication between the executing processes and other clients of
the fabric via packet images in memory, operating as described
elsewhere herein.
[0177] In selected embodiments, one or both of optional FCIs 413.5
and FCI 413.5' are included, to access boot images or related
information, via couplings 414.5 and 414.5' and FCIs 413.5 and
413.5', from either a local mass storage device or via a mass
storage network. In other embodiments, the optional FCIs are
omitted, and this data is accessed via the fabric from mass storage
devices or networks via fabric-coupled FCMs 120 or NMs 130.
[0178] As suggested by the CSFI variations illustrated by FIGS. 1A
and 2, those of ordinary skill in the art will recognize that CSFI
170 may be wholly or partially implemented on the SIM, on the PMM,
on a separate module, or any combination of SIM, PMM, and separate
module, or any other convenient location. The coherent switch
functionality may be implemented in conjunction with the HT links
on the PMM, or implemented independently of these functions without
substantially modifying the operation. In some embodiments, CSFI
170 is limited to interconnect operating in conjunction with
coherency and switching logic implemented internal to the CPU
elements included on the PMMs. In other embodiments, CSFI 170
includes some portion of coherency and switching logic operating in
conjunction with coherency and switching logic included on the
PMMs.
[0179] FIG. 4C illustrates NM Detail 400C, including selected
aspects of an embodiment of a Network Module (NM) configured as a
pluggable module. NM 130A includes media interface hardware
specific to a particular type of network coupling (Interfaces 420
and 419 for couplings 426 and 427, respectively), coupled to
network processing elements adapted for packet processing,
including Packet Classification and Editor (PCE 417) and associated
CAM 418, coupled in turn to Traffic Manager (TM 302). TM 302 is in
communication with RAM 416, and provides a fabric interface for
NM-Fabric coupling 139A. Control Processor (CP) 429 is coupled to
PCE 417 and TM 302. NM 130A also includes BMC 402.3, coupled to CP
429, PCE 417 and TM 302. The BMC provides an interface for coupling
452. While the illustrated embodiment shows CP 429 coupled to
Management I/O 412 indirectly via BMC 402.3, in alternate
embodiments the CP is coupled to the Management I/O via a coupling
shared with the BMC, and in further alternate embodiments the CP is
coupled to the Management I/O via a dedicated (i.e. not shared)
coupling.
[0180] In operation, information (typically in the form of packets)
communicated between a network device (typically external to the
ES1) coupled to coupling 426 is processed at a low-level and in an
interface-specific manner by Interface 420 (the operation of
coupling 427 and Interface 419 is substantially similar). Packets
received from the network device are provided to PCE 417 for
classification and Tag determination, as described elsewhere
herein. The packet data and Tag are stored in RAM 416, and provided
to the fabric as cells by TM 302 via NM-Fabric coupling 139A. In
the reverse direction, cells are reassembled by TM 302 as received
from the fabric via NM-Fabric coupling 139A, and the resultant
packets are stored in RAM 416. PCE 417 reads the stored packet
data, and dynamically modifies it according to any associated Tag
information, providing the result to Interface 420 for transmission
to the network device via coupling 426.
[0181] TM 302 operates as a fabric interface communication unit,
and includes a fabric interface communication transmit unit that
directly accesses the packet image from RAM via an included DMA
unit. The TM examines the packet header and identifies the packet
destination address. The packet address and other associated
information are referenced in accessing routing and state
information maintained in one or more of CAM 418 and RAM 416. The
resultant fabric egress port address and other related information
are included in headers of cells provided to the fabric to transmit
the packet as cells.
[0182] TM 302 also includes a fabric interface communication
receive unit that operates in a conceptually symmetric fashion.
Cells are received from the fabric and reassembled as packets
stored into RAM 416 via an included DMA unit. The TM notifies the
PCE as new packet data becomes available for editing and transport
to Interface 420.
[0183] CP 429 manages various HW resources on the NM, including PCE
417 and TM 302, and respective lookup elements CAM 418 and RAM 416.
The CP receives management information via coupling 452 (either
indirectly via the BMC or directly via a Management I/O coupling,
according to embodiment) and programs lookup, forwarding, and data
structure information included in CAM 418 (such as associatively
searched information) and RAM 416 (such as trie table
information).
[0184] FIG. 4D illustrates FCM Detail 400D, including selected
aspects of an embodiment of an FCM configured as a pluggable
module. FCM 120A includes Fibre Channel compatible couplings 428.1A
through 428.4B, coupled in pairs to Fibre Channel interface
Processors (FCPs 423.1-423.4). The FCPs are in turn coupled to a
compute and storage element including Input Output Processor HOP
421) and associated RAM 422, coupled in turn to VIOC 301.2. The
VIOC provides a fabric interface for FCM-Fabric coupling 129A. FCM
120A also includes BMC 402.2 coupled to VIOC 301.2 and providing an
interface for coupling 452.
[0185] In operation, information communicated between Fibre Channel
compatible devices or networks coupled to couplings 428.1A through
428.4B is processed in a low-level manner by FCPs 423.1-423.4.
Information received from external storage devices is typically
stored as packets in RAM 422. After optional processing by IOP 421,
packet data is transmitted as cells to the fabric by the fabric
interface communication transmit unit of VIOC 301.2 via FCM-Fabric
coupling 129A (as described elsewhere herein). In the reverse
direction, cells are reassembled by VIOC 301.2 as received from the
fabric via FCM-Fabric coupling 129A, and the resultant packets are
stored in RAM 422. In one embodiment, IOP 421 reads the stored
data, providing the end result to FCPs 423.1-423.4 for transmission
to the coupled device or network. In an alternate embodiment, FCPs
423.1-423.4 access the data directly via DMA.
[0186] FIG. 4E illustrates OLB Detail 400E, including selected
aspects of an embodiment of an OLB configured as a pluggable
module. OLB 160A is similar in many respects to Primary SCM 140A,
and includes compute and associated memory elements CPU 410.6L/RAM
411.6L coupled to CPU 410.6R/RAM 411.6R via HT coupling 460.6L.
VIOC 301.6 is coupled to CPU 410.6R via HT coupling 460.6R. VIOC
301.6 is in communication with TCAM/SRAM 403.6 and provides a
fabric interface for OLB-Fabric coupling 169A. PCI sub-module 425
and HT sub-module 424 are optionally included in various
combinations and configurations in several embodiments to provide
additional service-specific computational capabilities as service
acceleration modules. OLB 160A also includes BMC 402.6 coupled to
VIOC 301.6 and providing an interface for coupling 452.
[0187] PCI sub-module 425 includes a PCI interface for interfacing
PCI-adapter based devices to HT coupling 460.6R. HT sub-module 424
includes an HT interface for coupling to CPU 410.6R via coupling
460.6X. Various embodiments of PCI sub-modules and HT-sub-modules
further include any number and combination of service-specific
hardware accelerators according to implementation, such as an SLB
hardware accelerator, an SSL hardware accelerator, and an XML
hardware accelerator.
[0188] In operation, OLB 160A executes any combination of service
processes (relating to SLB, SSL, or XML, for example) using compute
and memory resources provided by CPU 410.6L/RAM 411.6L and CPU
410.6R/RAM 411.6R. The CPUs operate as a single SMP complex,
communicating shared memory coherency and cache memory coherency
transactions via HT coupling 460.6L. The VIOC provides for
communication between the executing processes and other clients of
the fabric via packet images in memory, operating as described
elsewhere herein. Service processes executing on embodiments
including any combination of PCI sub-module 425 and HT sub-module
424 access elements of the sub-modules in order to accelerate
processing related to the service processes.
Software Layers
[0189] FIG. 5A illustrates Application SW Layering 500A, including
selected aspects of embodiments of SW layers for executing on
application processor resources, such as CPUs included on PMMs, of
an ES embodiment. The layers are represented in a first context for
execution on P.sub.3 203, and in a second context for execution on
P.sub.1 201. As described elsewhere herein, P.sub.3 203 and P.sub.1
201 correspond to distinct physical partitions configured from one
or more PMMs.
[0190] Although Application SW Layering 500A is illustrated as
representative of a collection of code images, in some contexts it
may be useful to consider it as conceptually representative of
processes, or groups of processes associated with each of the
illustrated elements. For example, Hypervisor1 510 represents a
code image of a specific Hypervisor, but it may also be considered
conceptually representative of all processes and related execution
threads associated with executing any portion of the Hypervisor
code image. In typical embodiments, a plurality of concurrent
execution streams co-exists and cooperates while executing portions
of the code image. Similarly, OS1/Drivers 1 507, App1 501, and so
forth, may be considered representative of groups of respective
processes associated with each of the respective SW elements.
[0191] Illustrated at the highest (most fundamental and privileged)
SW layer level is a Hypervisor layer, as shown by Hypervisor1 510
and Hypervisor2 520. Hypervisors typically provide a software
environment for executing a plurality of OS instances in an
apparently concurrent manner via timesharing on a shared hardware
resource, such as P.sub.3 203 or P.sub.1 201, as illustrated in the
figure. At the middle SW layer level is an OS layer. As shown,
Hypervisor1 510 provides two logical partitions, one for
OS1/Drivers 1 507, and another for OS2/Drivers 2 508. Hypervisor2
520 provides a single logical partition for OS3/Drivers 3 509.
[0192] Illustrated within the OS layer are Drivers, including VIOC
and VNIC Drivers. A VIOC Driver provides an interface between
management and controlplane processes and VIOCs. VIOC Drivers
include VIOC Driver 1 511 in OS1/Drivers 1 507, VIOC Driver 2 512
in OS2/Drivers 2 508, and VIOC Driver 3 513 in OS3/Drivers 3 509.
VIOC Drivers, in some embodiments, are customized according to OS
environment, such that VIOC Driver 1 511, VIOC Driver 2 512, and
VIOC Driver 3 513 may be distinct if OS1/Drivers 1 507, OS2/Drivers
2 508, and OS3/Drivers 3 509 or associated environments are
distinct. An OS layer may include one or more VIOC Drivers,
depending on embodiment.
[0193] A VNIC Driver provides an interface between processes
(executing on application processor resources, for example) and
communication resources as provided by VNICs (implemented by VIOCs,
for example). In some embodiments, a VNIC Driver is conceptually
similar to a modified Ethernet Driver. VNIC Drivers include VNIC
Driver 1 514 and VNIC Driver 2 515 in OS1/Drivers 1 507, VNIC
Driver 3 516 and VNIC Driver 4 517 in OS2/Drivers 2 508, and VNIC
Driver 5 518 and VNIC Driver 6 519 in OS3/Drivers 3 509. VNIC
Drivers, in some embodiments, are customized according to OS
environment, such that VNIC Driver 1 514 and VNIC Driver 3 516 may
be distinct if OS1/Drivers 1 507 and OS2/Drivers 2 508 or
associated environments are distinct. In some embodiments, VNIC
drivers are further customized according to OS requirements or
contexts, such that VNIC drivers within the same OS are distinct
(VNIC Driver 1 514 being distinct with respect to VNIC Driver 2
515, for example). An OS layer may include one or more VNIC
Drivers, each having unique functions, parameters, or
customizations, depending on embodiment.
[0194] The lowest SW layer level, the Application layer, is
provided by the OSs for executing Applications. OS1/Drivers 1 507
hosts a plurality of Applications as shown by App1-App3 501-503.
Similarly, OS2/Drivers 2 508 hosts two applications, App4 504, and
App5 505. OS3/Drivers 3 509 hosts a single application, App6
506.
[0195] Execution of SW at the different layers typically results in
a plurality of processes or execution streams, corresponding to
program execution of various portions or sub-portions of code from
the illustrated SW layers. For example, execution of each of
Hypervisor1 510, OS1/Drivers 1 507, and App1 501 may result in a
plurality of concurrently running processes. One example of a
process spawned from OS1/Drivers 1 507 is a Device Driver process,
for example a Device Driver process corresponding to execution of
the modified Ethernet Driver described elsewhere herein.
[0196] In typical embodiments, VIOC Device Drivers are associated
with VIOC management and control, and VNIC Device Drivers are
associated with VNIC management and control. According to various
embodiments, OS1/Drivers 1 507, OS2/Drivers 2 508, and OS3/Drivers
3 509 include various combinations of VIOC and VNIC Drivers. VIOC
Drivers, VNIC Drivers, and associated functions may be wholly or
partially shared and combined according to embodiment.
[0197] Those of ordinary skill in the art will recognize that the
illustrated SW layering is only on example embodiment. Other
embodiments may include more layers (such as Application
sub-layers) or fewer layers (such as dedicated physical partitions
requiring no Hypervisor layer). There is also no restriction as to
the number and arrangement of Hypervisors, OSs, Drivers, and
Applications, other than restrictions imposed by the finite
resources in an embodiment.
[0198] FIG. 5B illustrates Management SW Layering 500B, including
selected aspects of SW layers for executing on management processor
resources, such as processor resources included on SCMs (Primary
SCM 140A, for example), OLBs (OLB 160A, for example), and PMMs (PMM
150A, for example), according to various ES system embodiments.
Generally execution of these SW layers results in various system
control and system operational processes such as management,
controlplane, and load balancing processes. As described elsewhere
herein, the system control and management processes program various
configuration and lookup state information to implement
provisioning, switching, and routing functions in the EF
architecture. Similar to Application SW Layering 500A, Management
SW Layering 500B represents a collection of code images, and each
element may be usefully considered as representative of one or more
processes executing portions of each respective element.
[0199] Enterprise Manager 530 supports multi-chassis management,
complex provisioning, interfaces to client GUIs, and generally
operates at a relatively high level of abstraction, as does CLI
532. Platform Manager 531 generally performs in-chassis (or
single-chassis) management operations and tends to manipulate
system objects directly at a relatively low level of abstraction.
Several SW modules operate in close cooperation with the Platform
Manager, including Chassis Manager (CM) 533, Query Engine 534,
Repository Manager 535, VIOC Manager 536, Interface Manager 537, L2
Forwarding DataBase (FDB) Manager 538, VLAN Manager 539, and Other
Management Code 540.
[0200] CLI 532 generally provides some combination of the CLI
proper and related services to support the interface. CM 533
discovers and manages chassis resources. Query Engine 534 processes
queries relating to persistent state managed by Repository Manager
535. VIOC Manager 536 generally provides an interface between the
system control and management processes and VIOCs in a system.
Interface Manager 537 generally provides an interface between the
system control and management processes and NMs in a system. L2 FDB
Manager 538 provides L2 forwarding information management by
maintaining one or more FDBs. The L2 FDB Manager further manages
dissemination of information related to portions of the FDB
throughout the system as needed (such as updating the forwarding
and state information maintained in TCAM/SRAM elements coupled to
VIOCs). VLAN Manager 539 generally provides system-wide management
functions relating to provisioning VLANs and maintaining VLAN
related information, such as associations between VNICs, VLANs, and
NM ports. Typically a Logical InterFace identifier (LIF) is
allocated per port per VLAN, identifying a "connection" to the
VLAN. Other Management Code 540 generally provides other
management, controlplane, and load balancing functions. The
platform manager and the aforementioned closely cooperating SW
modules are described in more detail elsewhere herein.
[0201] FIG. 5C illustrates BMC SW Layering 500C, including selected
aspects of SW layers for executing on module-level configuration
and management processor resources, such as BMCs (BMC 402.4 of FIG.
4A, for example), according to various ES system embodiments.
Module BMC SW 550 supports module-level operations, typically via
interfaces with HW components on the module the BMC is included on.
The Module BMC SW generally functions at a relatively low level of
abstraction. Similar to Application SW Layering 500A and Management
SW Layering 500B, BMC SW Layering 500C represents a collection of
code images, and each element may be usefully considered as
representative of one or more processes executing portions of each
respective element.
[0202] In some embodiments, optional IPMI Client 551 provides an
interface to IPMI services (typically part of Platform Manager 531)
and in some embodiments serves to export low-level platform
services to various elements of the Management SW. Event Agent 552
monitors module-level information (such as sensors and board
insertion detection logic) to recognize changes in status and
configuration of the module. The Event Agent then communicates
these changes as events delivered to various elements of the
Management SW (such as CM 533). The Selected BMC Event Details
section, included elsewhere herein, provides further details on BMC
events. Command Agent 553 receives BMC commands from various
elements of the Management SW (such as CM 533) and sequences and
forwards the commands. The Command Agent communicates results of
command execution as events via functions provided by the Event
Agent. The Selected BMC Command Details section, included elsewhere
herein, provides further details on BMC commands
VNIC Overview and Queuing Operation
[0203] As previously described, VNICs provide each processing
element with access via a modified Ethernet Driver to other system
resources connected to the fabric. For example, each VNIC, in
conjunction with transport via the fabric and external
communication via an NM, provides capabilities similar to that of a
conventional NIC, made available to processes by way of a driver
functionally similar to a conventional Ethernet driver. In the
general case, the accessible system resources include networking
interfaces provided by NMs, storage interfaces provided by FCMs,
and computing capabilities provided by SCMs, PMMs, and OLBs. In
some embodiments, VNICs are implemented by VIOCs included in
modules such as PMMs, OLBs, FCMs, and SCMs, thereby providing
processing elements on such modules with access to VNIC
capabilities.
[0204] VNIC operation generally provides for communication of data
directly between processes executing on a Local Processor and the
fabric via Local Processor Memory. The Local Processor Memory is
typically accessible by the Local Processor and, for example, a
VIOC implementing VNIC functionality. Conceptually, a VNIC provides
transmit and receive queues for use by processes executing on a
Local Processor for communication of data (as packets or messages)
to and from the Local Processor Memory.
[0205] As VNICs (like conventional NICs) are bidirectional, VNICs
also provide access to each processing element from other system
resources connected to the fabric. For example, each VNIC on an OLB
provides a separate portal to execution resources provided by the
OLB. A first VNIC on the OLB may be allocated to and accessed by
processes executing on a first provisioned server, while a second
VNIC on the same OLB may be used by a second provisioned server. As
another example, each VNIC on an FCM provides a separate path to
storage resources enabled by the FCM, and each of the separate
paths may be accessed by distinct provisioned servers.
[0206] FIG. 6A illustrates selected aspects of a logical view of an
embodiment of a plurality of VNICs. According to one embodiment,
VIOC 301 implements 16 identical VNICs. In the figure only two of
the VNICs are shown (for clarity), VNIC #1 600.1 and VNIC #16
600.16. Each VNIC, such as VNIC #1 600.1, includes programmable
identification information illustrated as Address Block 601.1, VNIC
configuration registers illustrated as VNIC Configuration Block
618.1, and packet and message data communication interfaces
illustrated as I/O Block 605.1.
[0207] Address Block 601.1 includes Fabric Address 602.1, for
specifying the source fabric address of the VNIC, and MAC Address
603.1, defining the MAC address associated with the VNIC. In some
embodiments, a first portion of Fabric Address 602.1 is common to
some or all of the VNICs of a VOIC, and corresponds to the physical
fabric address of the VIOC. A second portion of Fabric Address
602.1, also known as the fabric sub-address, is unique to each
VNIC. Address Block 601.1 optionally includes, according to
embodiment, Public Internet Protocol (Public IP) Address 604.1, for
specifying the IP address associated with the VNIC, to enable
selected L3 processing.
[0208] Address Block 601.1 also includes a Virtual Local Area
Network (VLAN) Identifier (VLANid 617.1), for use according to
embodiment to map a destination address according to VLAN
information, or as a tag to identify the VNIC with respect to a
VLAN for selected L3 processing, or both. In some embodiments, a
separate L3 VLAN identifier (in addition to VLAN Identifier 617.1)
may be provided for use with selected L3 VLAN operations. Examples
of VLAN processing include formulating a packet or cell header
including a VLAN tag, and filtering incoming traffic with respect
to a VLAN. In various embodiments, programming of VLAN Identifier
617.1 further enables implementation of fabric-port-based, MAC
address-based, IP address-based, and general L3 type VLAN
functions.
[0209] VNIC Configuration Block 618.1 includes VNIC Enable 618.1a,
for enabling (and disabling) the corresponding VNIC, and priority
and bandwidth configuration registers. The priority configuration
registers include Priority Scheme Selection 618.1b and Priority
Weights 618.1c, for specifying priority processing related
information, such as priority scheme selection and weights,
respectively. The bandwidth configuration registers include
Bandwidth Minimum 618.1d and Bandwidth Maximum 618.1e, for
specifying bandwidth allocation and control configurations, such as
minimum and maximum bandwidth allocations, respectively, for the
corresponding VNIC.
[0210] I/O Block 605.1 includes separate collections of queues for
packets and messages. Each collection includes transmit, receive,
and completion queues (the completion queues are also known as
receive completion queues). The packet and message queues are
organized by "context". Each context includes a transmit queue, a
receive queue, and either a completion queue or a pointer to a
completion queue. In a typical usage scenario, the transmit,
receive, and completion queues of a context are associated with a
particular protocol connection or a particular multiplexed set of
protocol connections. Each context is established by software
running on the Local Processors. In certain embodiments, while each
context has a respective pair of dedicated transmit and receive
queues, multiple contexts are associated with a common completion
queue. This is detailed further below.
[0211] A packet (or message) context may be used for IP traffic,
interprocess communication (directly or over IP), or other similar
communications. When accessed via physical buffer addresses, packet
(or message) contexts are typically used directly by OS and
Hypervisor processes, and typically used indirectly (via system
calls, for example) by application processes. When accessed via
virtual buffer addresses, packet (or message) contexts are
typically usable directly by application processes, in addition to
direct use by OS and Hypervisor processes.
[0212] A single packet context may be used to implement multiple
connections via SW, while a single message context typically
implements a single connection managed (at least in part) by HW.
Message contexts may be used by RDMA operations with fewer SW
operations than similar functionality implemented with packet
contexts. In some embodiments, message contexts are typically
directly accessible via application processes.
[0213] Each context has particular state associated with it,
organized within a Context State Block (CSB), which identifies the
context's transmit, receive, and completion queues. The CSBs are
maintained in various embodiments by any combination of hypervisor,
OS, and Driver processes providing control information for the
VNIC.
[0214] The queues contain descriptors that are organized within
circular buffers in contiguous memory and thus are also known as
descriptor rings. Each enqueued descriptor describes various
aspects (detailed herein below) of the packets or messages being
transmitted and received. In particular, the packet transmit and
receive descriptors correspond to the raw data buffers in the Local
Processor Memory where packet data is respectively read and written
by the VIOC. A common scenario is for these descriptors to have
been prepared by processes of the Local Processors (such as a
Driver).
[0215] Packet Contexts 606.1 provides four packet CSBs (also
referred to as packet contexts) of identical capability, two of
which are illustrated in the figure (Packet Context Block #1
606.1.1 and Packet Context Block #4 606.1.4). As shown, each packet
context includes a transmit queue, such as Packet Transmit Queue #1
607.1.1 and Packet Transmit Queue #4 607.1.4., and a receive queue
pointer, such as Packet Receive Queue Pointer #1 608.1.1 and Packet
Receive Queue Pointer #4 608.1.4.
[0216] Each packet receive queue pointer identifies one of 16
identical shared packet receive queues, as illustrated by
dashed-arrow 619r pointing toward Packet Receive Queues 616r. For
clarity, only two of the 16 packet receive queues are illustrated
in the figure, specifically Packet Receive Queue #1 616r.1 and
Packet Receive Queue #16 616r.16. The packet receive queue pointers
enable arbitrary sharing of packet receive queues among packet
contexts, including sharing among packet contexts in multiple
distinct VNICs.
[0217] In alternate embodiments, packet context receive queues are
not shared among VNICs, but are dedicated per VNIC as packet
context transmit queues are. In these alternate embodiments,
608.1.1 . . . 608.1.4 operate directly as the packet receive queues
for a given VNIC, and Packet Receive Queues 616r (of FIG. 6A) is
not provided.
[0218] Also included in each packet context is a completion queue
pointer, such as Packet Complete Queue Pointer #1 609.1.1 and
Packet Complete Queue Pointer #4 609.1.4. Each packet completion
queue pointer identifies one of 16 identical shared packet
completion queues, as illustrated by dashed-arrow 619c pointing
toward Packet Completion Queues 616c. For clarity, only two of the
16 packet completion queues are illustrated in the figure,
specifically Packet Completion Queue #1 616c.1 and Packet
Completion Queue #16 616c.16. Similar to the packet receive queue
pointers, the packet completion queue pointers enable arbitrary
sharing of packet completion queues among packet contexts,
including sharing among packet contexts in multiple distinct
VNICs.
[0219] The packet transmit queues of a VNIC (such as Packet
Transmit Queue #1 607.1.1, for example), also known as transmit
rings, are used to communicate transmit packet availability and
location in memory. The transmit queues include a plurality of
transmit descriptors, each of which refer to a buffer in memory
having a complete or partial packet image for transmission. The
descriptor includes the address of the buffer (in Local Processor
Memory), the buffer size, a packet state indicator, a valid
indicator, a done indicator, and other related information. The
packet state indicator describes the buffer information as
associated with the start, the middle, or the end of a packet.
Assertion of the valid indicator specifies that the descriptor
refers to packet image data ready for transmission. Deassertion
indicates otherwise: that no data is ready, the descriptor is
invalid, or some other related condition. The valid indicator
allows a VNIC implementation (such as a VIOC) to poll (or scan) the
transmit queues for valid descriptors to discover packet data ready
for transmission without requiring any further information or
stimulus from the processes executing on the Local Processor. The
processes may also interrupt the VIOC by accessing a Doorbell
location in the VIOC memory-mapped I/O address space to indicate
new or modified descriptors in a transmit queue, corresponding to
additional packet data ready for transmission. A plurality of
Doorbell locations are typically provided per VNIC. In some
embodiments, packet transmit queues and message contexts of all
VNICs are allocated distinct doorbells.
[0220] The packet receive queues of a VNIC (such as those pointed
to by Packet Receive Queue Pointer #1 608.1.1, for example), also
known as receive rings, are used to communicate receive packet data
locations in memory. The receive queues include a plurality of
receive descriptors, each of which refers to a buffer in memory for
reception of a complete or partial packet image. The descriptor
includes the address of the buffer (in Local Processor Memory), the
buffer size, a valid indicator, and other related information. The
valid indicator specifies that the buffer the descriptor refers to
is ready to receive packet image data, enabling a VNIC
implementation (such as a VIOC) to determine receive buffer
availability without direct involvement of processes executing on
the Local Processor.
[0221] The shared packet completion queues (Packet Completion Queue
#1 616c.1, for example) are used to communicate completion
information, including receive packet data availability and status.
The completion queues include a plurality of entries, each of which
includes a packet status indicator and other related information.
The packet status indicator identifies buffer data as corresponding
to the start, middle, or end of a packet. When a receive buffer is
filled, corresponding to consumption of an associated receive
descriptor, an entry is created in one of the completion queues.
The completion queue data structure and related processing further
enable a VNIC implementation (such as a VIOC) to provide packet
data to processes executing on Local Processors with little direct
involvement of the processes. In some embodiments, the ingress
processing (e.g., packet reassembly from cells) performed by the
VIOC/VNIC frees up resources for other tasks, and may also simplify
implementation of other portions of the system.
[0222] In the illustrated embodiment, Message Contexts 610.1
provides up to a total of 4K-4 (i.e., 2 to the 12.sup.th power,
minus 4, or 4092) message CSBs (also referred to as message
contexts) per VNIC. Each message context has identical capability,
and two are illustrated in the figure (Message Context Block #1
610.1.1 and Message Context Block #N 610.1.N). In other
embodiments, (not illustrated), the message contexts of the 16
VNICs are shared, for a total of 64K-16 (i.e., 2 to the 16.sup.th
power, minus 16, or 65,520) message contexts available collectively
to all of the VNICs. As shown, each message context includes a
transmit queue, such as Message Transmit Queue #1 611.1.1 and
Message Transmit Queue #N 611.1.N, and a receive queue, such as
Message Receive Queue #1 612.1.1 and Message Receive Queue #N
612.1.N. Also included in each message context is a completion
queue, such as Message Completion Queue #1 613.1.1 and Message
Completion Queue #N 613.1.N. The message transmit, receive, and
completion queue data structures are similar to those defined for
packet processing. The message data structures specify message
location and availability for transmission, for determining buffer
location and availability for reception, and for directly providing
message data and completion status to processes executing on Local
Processors.
[0223] Messages are processed via an additional level of
indirection not used for packet processing. Message Lists 615.1
includes two linked lists of messages to be processed (Message List
#1 615.1.1 and Message List #2 615.1.2). Each of the lists includes
an active entry pointing to (or selecting) one of the message
contexts, as illustrated by dashed-ellipse 614.1 and associated
dashed-arrows pointing to message contexts in Message Contexts
610.1. In embodiments with 4K-4 message contexts per VNIC the
active entries point to message contexts associated with the VNIC,
and in embodiments with 64K-16 shared message contexts, the active
entries point to any of the shared message contexts. Each of the
selected message contexts in turn includes transmit, receive, and
completion queues, as illustrated.
[0224] Transmit and receive packet and message buffer addresses may
be provided as physical addresses or virtual addresses subsequently
translated by a VNIC implementation (such as a VIOC), depending on
embodiment and configuration. In some embodiments and
configurations, address type varies according to queue, and may
also vary according to other configuration information.
[0225] In typical implementations, VNICs share resources amongst
each other and between queues within each respective VNIC. Shared
transmit resources include access to the fabric (based on
destination, for example) and effective bandwidth to the fabric
(allocated between VNICs and queues included in VNICs, for
example). Shared receive resources include input queuing (or
enqueuing) bandwidth from the fabric (based on a categorization
included in the received data, for example) and dequeuing bandwidth
from input queues to Local Processor Memory (allocated between
VNICs and queues included in VNICs, for example).
Egress Path VNIC/Queue Operation
[0226] Sharing of transmit resources is managed by three scheduling
policies. These are the transmit-descriptor-fetch scheduling
policy, the VOQ scheduling policy, and the
subsequent-transmit-data-read scheduling policy. From a high-level
perspective, the transmit-descriptor-fetch scheduling policy
decides which transmit tasks the VIOC will do next. More
specifically, the transmit-descriptor-fetch scheduling policy,
described in more detail below in conjunction with FIG. 6B,
determines the next transmit-descriptor to be fetched. (An initial
data read is also performed in conjunction with each transmit
descriptor fetch.) Each transmit descriptor describes a transmit
buffer in the local processor memory that holds one or more packets
awaiting transport to their addressed destinations. The transmit
descriptor to be next fetched is the descriptor identified by the
next descriptor pointer of a transmit queue selected by the
transmit-descriptor-fetch scheduling policy from the candidate pool
of all VNIC transmit queues. The selection is based in part on a
prioritization among the VNIC transmit queues, in a manner that is
configurable in part. The selection is also based in part on an
approximate fabric bandwidth allocation among the VNICs.
[0227] The VOQ scheduling policy, described in more detail below in
conjunction with FIG. 6C, determines the next cell to transmit to
the fabric. The cell to be next transmitted to the fabric is the
cell at the head of the VOQ selected by the VOQ scheduling policy
from the candidate pool of all VOQs. The selection is based in part
on a prioritization among the VOQs, in a manner that is
configurable in part.
[0228] The subsequent-transmit-data-read scheduling policy
determines the next transmit data to read (for all reads required
to retire a transmit descriptor made after the first data read).
The transmit data to be next read is the data (generally a
cache-line in memory) identified by the next data unit prefetch
pointer of a transmit queue packet buffer (in transmit shared
memory) selected by the subsequent-transmit-data-read scheduling
policy from the candidate pool of all transmit queue packet buffers
awaiting to do transmit data reads.
[0229] Note that the solid arrows drawn in FIGS. 6B through 6E do
not represent data flows (at least not required ones). Taking an
analogy from sports, each of the solid-arrow diagrams is very
similar to a "tournament bracket" (also known as a tournament
chart), wherein a pool of competitors is reduced by successive
stages of competition to arrive at an overall winner. In terms of
this sports analogy, the transmit queues represented at the start
(on the left for transmit, on the right for receive) "compete" in
accordance with rules that may differ at each stage to be chosen as
the queue for which an action (different in each figure) is next
performed.
[0230] In the following scheduling discussions, it will be
understood that although the description may refer to queue
selections in each of multiple stages ("prioritization levels"),
and that subsequent to the scheduling selection data is generally
transferred, that nevertheless each selection described is a
logical abstraction that generally does not necessitate any data
movement corresponding to any of: the queues, identifiers (or
pointers) representing the queues, descriptors pointed to by the
queues, or data associated with the descriptors. Furthermore,
evaluation in each of the discrete stages described is also a
logical abstraction that need not be physically implemented in
order to realize the overall result. While in some embodiments the
selection logic could implement pass gate or other multiplexing
logic to pass transmit queue identifiers (or other related data
items) along circuit topographies that are fashioned after the
tournament bracket topologies illustrated, this is not at all
essential. Those of ordinary skill in the art will understand that
each scheduling policy of FIGS. 6B through 6E is reducible to logic
equations that can be implemented in many functionally equivalent
ways, including but not limited to: pass gates, multiplexers,
AND/OR-gating, memory or programmable-logic arrays,
micro-programming, and combinations thereof. Furthermore, only the
final result of the scheduling process need be actually realized.
That is, certain implementation approaches and optimizations may
not actually implement or expose the intermediate prioritization
level results.
[0231] FIG. 6B illustrates selected aspects of a logical view of an
embodiment of VNIC transmit-descriptor-fetch scheduling. For
clarity, only the transmit queue functions of two of the 16 VNICs
are illustrated (VNIC #1 600.1 and VNIC #16 600.16). The figure
represents operations related to selecting a transmit queue from
among all the transmit queues. The next descriptor pointer of the
selected transmit queue in turn identifies the next transmit
descriptor to fetch, according to various priority techniques. This
determines the relative processing order of packet and message
buffers described by the transmit descriptors, and thereby
approximates and manages a per-VNIC effective bandwidth allocation
to the fabric.
[0232] Packet Transmit Queues 621.1.P, as shown, includes the four
transmit queues (Packet Transmit Queue #1 607.1.1 . . . Packet
Transmit Queue #4 607.1.4) of VNIC #1600.1. Message Transmit Queues
621.1.M, as shown, includes two message transmit queues (Message
Transmit Queue #5 621.1.M.5 and Message Transmit Queue #6
621.1.M.6) of VNIC #1 600.1. (The above and subsequent uses of " .
. . " can be read as "through.") These correspond to the two
message transmit queues identified by an active message identified
by each of Message Lists 615.1 (Message List #1 615.1.1 and Message
List #2 615.1.2), as shown by dashed-ellipse 614.1 (of FIG. 6A).
The other 15 VNICs are organized identically.
[0233] Several levels of prioritization are performed to determine
the next transmit descriptor to read (if not previously prefetched)
and process when a new transmit descriptor is needed. Generally,
the transmit descriptors are prefetched in advance of their need. A
first prioritization level selects (identifies), for each VNIC, one
packet transmit queue and one message transmit queue. The first
level includes Packet Transmit Queue Prioritization 622.1.P,
selecting one of Packet Transmit Queues 621.1.P according to either
a straight or weighted round-robin priority scheme. The first level
also includes Message Transmit Queue Prioritization 622.1.M,
selecting one of Message Transmit Queues 621.1.M according to
either a straight or weighted round-robin priority scheme.
Identical prioritization processing is performed for each of the 16
VNICs, selecting one potential packet queue and one potential
message queue for each VNIC. The resultant 32 candidate queues are
then evaluated in a second prioritization level.
[0234] The second prioritization level selects, for each VNIC,
between the packet queue or the message queue as selected by the
first prioritization level. The second level includes Packet vs.
Message Transmit Prioritization 623.1, selecting a packet queue or
a message queue according to a weighted round-robin priority
scheme. Identical prioritization processing is performed for each
of the 16 VNICs, selecting one candidate queue for each VNIC. These
16 queues are then evaluated in a third prioritization level.
[0235] The third prioritization level selects, across the 16 VNICs,
a single transmit queue for subsequent evaluation. The third level
includes VNIC Bandwidth Management 624, selecting one of the 16
queues provided by the second level according to a bandwidth
allocation priority scheme. Bandwidth schemes include enforcing a
maximum bandwidth per VNIC, a minimum bandwidth per VNIC, and
arbitrary combinations of maximum and minimum bandwidths
(individually selectable) per VNIC. A single resultant transmit
queue is selected for processing as indicated by dashed-arrow 639.
The processing includes accessing a transmit descriptor identified
by the selected transmit queue, reading data for transmission
according to the accessed descriptor, and then readying the data
for transmission over the fabric in accordance to the VOQ
scheduling policy discussed in conjunction with FIG. 6C below.
[0236] The selection of straight or round-robin prioritization
(Packet Transmit Queue Prioritization 622.1.P and Message Transmit
Queue Prioritization 622.1.M, for example) is individually
programmable for each VNIC and may be distinct for packet and
message prioritization, in various embodiments. Weights for each of
the weighted round-robin prioritizations (Packet Transmit Queue
Prioritization 622.1.P, Message Transmit Queue Prioritization
622.1.M, and Packet vs. Message Transmit Prioritization 623.1, for
example), are individually programmable, in various embodiments.
The maximum and minimum bandwidths per VNIC (VNIC Bandwidth
Management 624) are also programmable in typical embodiments.
[0237] Once a transmit descriptor becomes scheduled in accordance
with the transmit-descriptor-fetch scheduling policy, the
descriptor is fetched and the first transfer is made (corresponding
to a fabric-cell-worth of data) from the first packet in the
transmit memory buffer pointed to by the descriptor to a packet
buffer in shared memory associated with the transmit queue. This
first read and the subsequent reads are performed by one or more
transmit DMA engines, which operate in accordance with the
transmit-descriptor-fetch scheduling policy and the
subsequent-transmit-data-read scheduling policy, discussed below.
As described elsewhere herein, a lookup is generally performed on
the MACDA contained in the first read, the nature of the
destination becomes known as a result of the lookup, and the data
can subsequently be appropriately directed to either a multicast
VOQ or a fabric-port-specific unicast VOQ. Operation of the VOQs is
discussed in more detail below, in conjunction with FIG. 6C.
[0238] On a cell-by-cell basis as space becomes available in the
appropriate VOQ, the data from the transmit queue packet buffer (in
shared memory) is read, additional information is added (such as
priority and the destination fabric address and fabric
sub-address), and the data is transferred to the appropriate VOQ as
cells. As a matter of policy, cells from different packets are not
co-mingled in the VOQs. The VOQs act as variable depth FIFOs,
wherein the order in which cells enter a VOQ determines the order
in which the cells leave the VOQ. As previously indicated, cells
are released by the VOQs and transmitted to the fabric in
accordance with the VOQ scheduling policy.
[0239] As data is transferred from the transmit queue packet
buffers to the VOQs, the packet buffers are depleted. Responsive to
the depletion, the subsequent-transmit-data-read scheduling policy
generally attempts to keep the transmit queue packet buffers full
of pre-fetched transmit read data up to the allocated pre-fetch
depth, which is 16 cache-lines in one embodiment. In doing so, it
gives higher priority to those transmit queue packet buffers that
are supplying cells for a packet that is at the head of a VOQ (a
packet being actively transferred as cells over the fabric). It
gives lower priority to those transmit queue packet buffers that
are not yet supplying cells to a VOQ.
[0240] FIG. 6C illustrates selected aspects of a logical view of an
embodiment of a VOQ scheduling policy to provide efficient access
to the fabric. The figure represents processing to select the VOQ
to send the next data unit (cells in one embodiment) to the fabric.
The VOQs send information to the fabric interface as cells.
Accordingly, at least logically the VOQ receives information as
cells. Physically, the VOQ could receive cells as constituent
components (data, priority, fabric address, etc.) and assemble the
cells just prior to sending the cells to the fabric interface.
Furthermore, the VOQs may be implemented within the egress shared
memory. Thus a number of levels of virtualization and indirection
are possible. In one embodiment, the VOQs are implemented within
the egress shared memory and they hold cells that have been
pre-assembled and are ready for immediate transfer via the fabric
interface to the switch fabric. The pre-assembly is performed at
least in part by transfers to each VOQ from an appropriately
dynamically associated transmit queue packet buffer (also in the
egress shared memory).
[0241] As indicated previously, transmit logic included in the VNIC
implementation (such as in the egress logic of a VIOC as
illustrated in FIG. 7A and as discussed below), assembles cells in
preparation for providing them to the VOQs as represented
abstractly by dashed-arrow 640. Included within each cell is a
corresponding priority indication (one of four levels: P0, P1, P2,
and P3, in selected embodiments) and a corresponding fabric
destination address. The destination address may be a unicast
address (one of up to 11 destinations, in some embodiments), or a
multicast address. Multicast cells are enqueued into one of
Multicast Output Queues 641.M (also known as Multicast VOQs)
according to priority, as abstractly represented by dashed-arrow
640.M, illustrating insertion into the P0 priority multicast
VOQ.
[0242] Unicast cells are enqueued into one of 11 unicast VOQ groups
(Unicast VOQ Group #1 641.1 . . . Unicast VOQ Group #11 641.11)
according to the fabric destination address (VOQ Group #1 . . . VOQ
Group #11) and further according to priority (P0 . . . P3) within
each VOQ group. Since there are 11 destinations, each having four
priorities, there are a total of 44 unicast VOQs. The unicast
enqueue operation is illustrated by dashed-arrows 640.1, and
640.11. Arrow 640.1 abstractly represents a unicast transmit cell
addressed to the first fabric destination being enqueued into the
P1 priority VOQ of Unicast VOQ Group #1 641.1. Arrow 640.11
abstractly represents a unicast transmit cell addressed to fabric
destination #11 being enqueued into the P2 priority VOQ of Unicast
VOQ Group #11 641.11. For clarity, only two of the 11 Unicast VOQ
groups of identical operation are shown in the figure.
[0243] Several levels of prioritization are performed to select the
VOQ that will be permitted to send the next transmit cell to the
fabric. A first prioritization level selects a candidate VOQ from
within the Multicast VOQs (Multicast Output Queues 641.M) and each
of the Unicast VOQ groups (Unicast VOQ Group #1 641.1 . . . Unicast
VOQ Group #11 641.11). The first level includes Multicast Output
Queue Prioritization 642.M, selecting a single multicast VOQ from
Multicast Output Queues 641.M according to either a straight or
weighted round-robin priority scheme. The first level also includes
Unicast Output Queue Prioritization 642.1, selecting one of Unicast
VOQ Group #1 641.1 according to either a straight or weighted
round-robin priority scheme. Identical prioritization processing is
performed for each of the 11 Unicast VOQ groups, selecting one
potential VOQ for each Unicast VOQ group. The resultant 11 unicast
VOQ candidates are then evaluated in a second prioritization level,
and the resultant single multicast VOQ candidate is then evaluated
in a third prioritization level.
[0244] The second prioritization level selects, on a per VOQ group
basis, one of the 11 unicast VOQ group VOQ candidates as selected
by the first level. The second level includes Destination
Prioritization 643, selecting a VOQ according to a round-robin
priority scheme. Since the VOQ groups are organized by fabric
destination, the second level is a fabric-destination-based
prioritization. The single result unicast VOQ candidate is then
evaluated in the third prioritization level.
[0245] The third and final prioritization level selects between the
multicast and unicast VOQ candidates as provided by the first and
second levels respectively. The third level includes Multicast vs.
Unicast Output Prioritization 644, selecting the final multicast or
the unicast VOQ candidate according to a weighted round-robin
priority scheme. The final selected VOQ is then permitted to
provide one cell to the fabric, as abstractly represented by
dashed-arrow 659.
[0246] The selection of straight or round-robin prioritization is
individually programmable for Multicast Output Queue Prioritization
642.M and each of Unicast Output Queue Prioritization 642.1 . . .
642.11. Weights for each of the weighted round-robin
prioritizations (Multicast Output Queue Prioritization 642.M,
Unicast Output Queue Prioritization 642.1 . . . Unicast Output
Queue Prioritization 642.11, and Multicast vs. Unicast Output
Prioritization 644, for example) are individually programmable, in
various embodiments.
[0247] In some embodiments, the number of VOQ groups is equal to
the number of fabric destinations in the ES system (such as 11 VOQ
groups and 11 fabric destinations). In some embodiments, the number
of VOQ groups is greater than the number of fabric destinations
(such as 16 VOQ groups and 11 fabric destinations). In some
embodiments, more than one priority may share a VOQ, instead of
each priority having a separate VOQ. For example, P0 and P1
priorities may share a first VOQ within a VOQ group and P2 and P3
priorities may share a second VOQ within the VOQ group. These and
all similar variations are contemplated within the contexts of
various embodiments.
Ingress Path VNIC/Queue Operation
[0248] Shared receive resources include enqueuing bandwidth for
cells received from the fabric. The bandwidth is shared based on a
priority included in the received data units (see the following
FIG. 6D discussion). After enqueuing, the data units are classified
according to destination VNIC and associated queue, including
processing according to multicast and unicast destinations. Then
the data units are dequeued for storage into Local Processor Memory
according to priorities associated with the VNICs and the queues of
the VNICs. The shared receive resources further include the
dequeuing bandwidth (see the following FIG. 6E discussion).
[0249] FIG. 6D illustrates selected aspects of a logical view of an
embodiment to schedule the start of receive processing for incoming
cells. Received cells are pushed into the VIOC from the fabric,
typically via one or more First-In-First-Out (FIFO) or similar
buffering mechanisms external to the VIOC, as suggested by
dashed-arrow 660. As they enter, the cells are classified according
to multicast or unicast (dashed arrows 660.M and 660.U,
respectively), and inserted into Multicast Input Queues 661.M or
Unicast Input Queues 661.0 accordingly. In the illustrated
embodiment, queue insertion is without regard to priority, as all
priorities (P0, . . . P3) share the same queue for a given traffic
type.
[0250] As illustrated, a single level of prioritization is
performed by the receive logic to select an input queue from a
candidate pool that includes Multicast Input Queues 661.M and
Unicast Input Queues 661.U. The single level selects between the
multicast and the unicast queues according to Multicast vs. Unicast
Input Prioritization 663, a weighted round-robin priority scheme.
As abstractly represented by dashed-arrow 679, the receive logic
then pulls one cell from the queue selected for storage into
receive logic memory (such as ISMem 738) and subsequent processing.
Weights for Multicast vs. Unicast Input Prioritization 663 are
individually programmable, according to various embodiments.
[0251] In a variation of FIG. 6D, some embodiments select the next
cell to pull from the input queues at least in part according to
priorities associated with the received cells. For example, the
multicast and unicast input queues may be managed with priority
information either included in the received data unit or determined
as a function of the fabric transport priority associated with the
received cell, according to embodiment. Multicast and unicast input
queue insertion is then partially determined by the priority
associated with the received cell, in addition to multicast versus
unicast categorization. Multicast Input Queues 661.M and Unicast
Input Queues 661.0 are thus each modified to be organized with
respect to data unit priority.
[0252] In some embodiments, there are four defined priority levels
(P0, P1, P2, and P3), and each queue includes data associated with
a single priority (i.e. there is a queue per categorization and
priority pair). In some embodiments, each queue may include cells
from a pair of priorities (P0 and P1 in a first queue, and P2 and
P3 in a second queue, for example). In other embodiments, queue
insertion is further determined according to information provided
in response to a lookup operation based on information included in
the received cell (see the TCAM/SRAM lookup state section,
elsewhere herein).
[0253] In embodiments of FIG. 6D modified to have queues partially
managed by priority, a first prioritization level selects, on a per
queue priority basis, one candidate multicast input queue and one
candidate unicast input queue from the multicast and unicast input
queue groups respectively. The first prioritization may be straight
priority, straight round-robin, or weighted round-robin, according
to embodiment. The resultant two input queue candidates are then
evaluated in a second prioritization level. The second
prioritization level selects between the multicast and unicast
input queue candidates according to a straight priority, straight
round-robin, or weighted round-robin, according to embodiment.
[0254] In some embodiments that include priority-managed receive
path input queues, at least some of the queues may be implemented
with queue depths substantially larger than queue depths
implemented for embodiments lacking priority-managed queues. In
some of these embodiments having priority-managed queues, the
receive path input queue depths are substantially larger than the
effective queue depth of the fabric providing the receive data
units.
[0255] In some embodiments, selected received cells are ignored (or
dropped) according to programmable receive (or ingress) bandwidth
limitations or policies, to prevent overuse of subsequent VIOC
receive resources or associated Local Processor resources. In
various embodiments, the ingress bandwidth limitation policies
operate in parallel with the prioritization illustrated in FIG. 6D,
and in various other embodiments the ingress policies operate
either "before" or "after" the operations depicted in the figure.
Some embodiments implement dropping policies according to a single
cell, while other embodiments drop all subsequent cells of a packet
or a message after dropping a first cell in response to a dropping
policy operation.
[0256] Incoming unicast cells have a fabric sub-address that
identifies the VNIC and receive queue to which the cell is
associated. As each unicast cell is pulled from the unicast queue,
the receive queue is ascertained and the data payload of the cell
is placed into a corresponding receive queue packet buffer. The
data carried by multicast cells is replicated in accordance with
the multicast group ID and appropriately placed into multiple
receive queue packet buffers. The data held within the receive
queue packet buffers accumulates until written to Local Processor
Memory in accordance with a receive-data-write scheduling policy,
as illustrated in FIG. 6E.
[0257] The receive-data-write scheduling policy of FIG. 6E selects
the next receive queue to be serviced. The queue selection
determines the next receive data to write through the following
indirection. Each receive queue points to a currently active
receive descriptor, which describes a receive buffer in the Local
Processor Memory that is ready to receive one or more packets. Each
receive queue also has a corresponding receive queue packet buffer
in the ingress shared memory. The receive data to be next written
is the data unit (generally a cache-line in memory) identified by
the next data unit write pointer of the receive queue packet buffer
corresponding to the receive queue selected by the
receive-data-write scheduling policy from the candidate pool of all
VNIC receive queues. The selection is based in part on a
prioritization among the VNIC receive queues, in a manner that is
configurable in part. Each data unit written is scheduled
independently (generally on a cache-line by cache-line basis) by
the receive-data-write scheduling policy. The writes associated
with each receive descriptor carry out the reassembly of
corresponding data link layer frames (typically Ethernet
frames).
[0258] For clarity, only two of the 16 VNICs (VNIC #1 600.1 and
VNIC #16 600.16) are illustrated in FIG. 6E. Packet Receive Queues
684.1.P, as shown, includes four packet receive queues such as
Packet Receive Queue #1 684.1.P.1 and Packet Receive Queue #4
684.1.P.4 of VNIC #1 600.1. These correspond to the four packet
receive queues identified by each of Packet Receive Queue Pointer
#1 608.1.1 . . . Packet Receive Queue Pointer #4 608.1.4
respectively (of FIG. 6A). Message Receive Queues 684.1.M, as
shown, includes two message receive queues (Message Receive Queue
#5 684.1.M.5 and Message Receive Queue #6 684.1.M.6) of VNIC #1
600.1. These correspond to the two message receive queues
identified by an active message identified by each of Message Lists
615.1 (Message List #1 615.1.1 and Message List #2 615.1.2), as
shown by dashed-ellipse 614.1 (of FIG. 6A). The other 15 VNICs are
organized identically.
[0259] Several levels of prioritization are performed to select the
next receive queue. A first prioritization level selects, for each
VNIC, one candidate packet receive queue and one candidate message
receive queue. The first level includes Packet Receive Queue
Prioritization 682.1.P, selecting one of Packet Receive Queues
684.1.P according to a straight round-robin priority scheme. The
first level also includes Message Receive Queue Prioritization
682.1.M, selecting one of Message Receive Queues 684.1.M according
to a straight round-robin prioritization scheme. Identical
processing is performed for each of the 16 VNICs, selecting one
potential receive packet queue and one potential receive message
queue for reach VNIC. The resultant 32 candidate queues are then
evaluated in a second prioritization level.
[0260] The second processing level selects, for each VNIC, between
the packet or the message receive queue as selected by the first
prioritization level. The second level includes Packet vs. Message
Receive Prioritization 681.1, selecting a packet or a message
receive queue according to a straight round-robin priority scheme.
Identical prioritization processing is performed for each of the 16
VNICs, selecting one candidate receive queue for each VNIC. These
16 candidate data units are then evaluated in a third
prioritization level.
[0261] The third and final prioritization level selects, across the
16 VNICs, a single receive queue. The third level includes VNIC
Prioritization 680, selecting one of the 16 receive queues provided
by the second level according to a straight round-robin priority
scheme. A final single resultant receive queue is selected and,
through the indirection process described previously, a single data
unit (generally a cache-line) is written via the HT interface into
Local Processor Memory as abstractly represented by dashed-arrow
699.
[0262] In some embodiments, weighted round-robin prioritization may
be performed for any combination of the first, second, and third
prioritization levels, and the associated weights may be fixed or
individually programmable, according to embodiment. Various
embodiments may also provide individual programmable selection
between straight and weighted round-robin for each of the first,
second, and third prioritization levels.
Queue Priorities
[0263] Transmit and receive priority algorithms may vary according
to embodiments. For example, in some embodiments, straight priority
may implement a static priority having queue #1 as the highest,
queue #2 as the next highest, and so forth with queue #4 as the
lowest priority. In other embodiments, the priority order may be
reversed (i.e. #4 is the highest and #1 is the lowest). In some
embodiments, instead of weighted round-robin, straight, or simple
(i.e. un-weighted round-robin) may be implemented. Round-robin
weighting may be based on data units (cells, for example) or bytes,
according to various embodiments. Weighted fair queuing may also be
provided by some embodiments in place of or in addition to weighted
round-robin, and the weighted fair queuing may be based on data
units or bytes, according to various embodiments. Round-robin
processing may be based on previously processed information or on
queue depth, also according to embodiment.
VLAN Overview
[0264] In some embodiments, each VNIC is a member of one VLAN,
which is a port-based VLAN (i.e., a virtual LAN defined by logical
connection to a designated subset of available logical L2 switch
ports). In other embodiments, each VNIC may be a member of a
plurality of VLANs, including at least one port-based VLAN. In
various embodiments, the VLANs may be port-based, MAC
address-based, IP address-based, and L3 type VLANs. VLANs may be
provisioned and managed by programming VNIC address information
accordingly (such as VLAN Identifier 617.1, for example) and by
writing corresponding lookup state (such as that retained in
TCAM/SRAMs). VLAN management operations may be relatively static,
as related to endpoints included within a server, or relatively
dynamic, as related to endpoints external to the server. Internal
endpoint VLAN operations include server and cluster provisioning
and re-provisioning, VLAN specific provisioning, pluggable module
insertion and removal, and failover responses, for example. VLAN
operations may be supervised by controlplane processes executing on
a SCM (such as Primary SCM 140A), Driver processes executing on
Local Processors, or combinations of the two, according to
embodiment.
[0265] In some embodiments, VLAN related processing for egress data
to the fabric includes determining a VLAN identifier. If the VLAN
identifier does not match the source VLAN identifier, then the
egress data may optionally be dropped, according to embodiment. The
source VLAN identifier may be provided directly from VLAN
Identifier 617.1 or derived from it, according to embodiment. If
the destination MAC is not identifiable, then the egress data may
be flooded to all destinations allowed by the source VLAN
configuration, according to embodiment.
[0266] In some embodiments, VLAN related processing for ingress
data from the fabric includes determining which VNICs, if any, are
members of the VLAN identified by the received data, and providing
the data to the member VNICs appropriately. If no VNICs are members
of the destination VLAN, then the ingress data may be optionally
dropped, according to embodiment. In some embodiments, VLAN related
broadcasts to VNICs may be wholly or partially implemented using
multicast group processing. VLAN ingress processing may optionally
include determining the VLAN identifier associated with learning a
MAC Source Address (MACSA) associated with the ingress data.
Optionally processing may further include dropping the ingress data
if the learning VLAN (i.e. the VLAN from which the MACSA was
learned from) is different from the destination VLAN.
[0267] In some embodiments, VLAN broadcasts are implemented by
assigning a Multicast Group IDentifier (MGID) to each of the VLAN
broadcast groups. In one embodiment there are 32K available MGIDs
sharable between VLAN and IP broadcast groups.
[0268] Some embodiments implement Independent VLAN Learning (IVL),
other embodiments implement Shared VLAN Learning (SVL), and still
other embodiments implement IVL and SVL modes and configurable
selection between them. IVL and SVL both enforce inter-VLAN
isolation (within the same abstraction layer) through confirmation
of VLAN membership based on MAC address. In IVL, forwarding entries
(i.e., entries of the FIBs) are shared throughout the VLAN on which
they are learned, but are not shared with other VLANs. In SVL,
forwarding entries learned on a first VLAN are "shared" with other
VLANs. Under SVL, a forwarding entry learned for a MAC address on a
first VLAN (and therefore unreachable at the same abstraction layer
by other VLANs) is used by the other VLANs for the limited purpose
of dropping frames addressed to the MAC address on the first VLAN.
The MAC address is known by the other VLANs to be unreachable only
because of the SVL sharing. In this way, SVL prevents unnecessary
flooding within any of the other VLANs, which under IVL would have
occurred in a futile effort to reach the MAC address on the first
VLAN (which under IVL, is guaranteed to be unknown to the other
VLANs). Further details of IVL and SVL, particularly with respect
to TCAM/SRAM configuration and use, is provided in conjunction with
the discussion of FIGS. 8A and 8B.
[0269] Static VLAN management operations typically include
distribution of VLAN membership information, removing the need for
learning VLAN membership changes relating to provisioning, module
insertion and removal, and failover responses. Typically VLAN
learning operations are performed under the supervision of
SCM-based management processes.
[0270] VIOC Internals
[0271] As previously described, at least one VIOC is included in
each of PMM 150A, OLB 160A, FCM 120A, and each of the SCMs 140
(including Primary SCM 140A). Each VIOC typically implements
functionality for a plurality of VNICs. The VIOC is illustrated in
these various operating contexts in FIGS. 3A, 4A, 4B, 4D, and 4E.
VIOC operation was summarized in the pluggable modules section
above, in conjunction with an examination of VIOC 301.4 of FIG. 4A.
As shown in FIG. 3A, the VIOC 301.5 is coupled and adapted to
directly communicate packets 351 between RAM elements 350 and the
Primary Switch Fabric Module 180A. The RAM elements are also
accessible by one or more CPUs, enabling processes executing on the
CPUs to directly exchange data via the fabric. With reference
particularly to FIGS. 4A and 4B, CPUs coupled to RAMs accessible by
VIOCs in this manner are examples of Local Processors, and the
coupled RAMs are examples of Local Processor Memory. RAM elements
411.4L and 411.4R are accessible via an HT Channel 460.4R, and the
fabric is accessible via a Common Switch Interface consortium
(CSIX) channel 149A.
[0272] The transmission and reception of packet (and message) data
directly between Local Processor Memory and the fabric corresponds
to dataplane functionality. Control of dataplane functionality
corresponds to controlplane functionality and conceptually includes
forwarding tables and related state information included in part in
the TCAM/SRAM. Control packets (also known as VIOC-CP packets)
generated by the SCM are received via the fabric and processed by
the VIOCs, resulting in selective accesses to configuration
registers and the TCAM/SRAM coupled to each VIOC. The forwarding
and state information of the TCAMs/SRAMs is typically initialized
and maintained in this way. In various embodiments, the control
packets are provided by management and controlplane processes
executing on any combination of the SCMs, PMMs, and OLBs.
Configuration information of a more general nature is typically
provided in part by a BMC.
[0273] The VIOC and processes executing on the Local Processors
communicate in part by sharing portions of the Local Processor
Memory space. Included in these shared portions are the packet and
message queues as described in the VNIC overview and queuing
operation section. In addition, the VIOC itself appears as an
intelligent memory-mapped I/O device residing in a portion of the
Local Processor Memory space. In this way, the VIOC provides access
to configuration registers and certain state relating to packet
(and message) transmission and reception.
[0274] The packet transmit and receive descriptors associated with
the VNICs describe raw data buffers in the Local Processor Memory
where packet data is respectively read and written by the VIOC, via
DMA operations, in order to implement VNIC functions. In some
embodiments, at least some of the packet transmit and receive
descriptors are prefetched into buffers onto the VIOC to improve
performance. In certain embodiments, all of the packet receive
descriptors corresponding to the VIOC's VNICs are buffered. When
reference is made herein to one or more descriptors within or on
the VIOC, those skilled in the art will understand that such
descriptors are the buffered copies of the same descriptors within
the Local Processor Memory. The VIOC operates on the completion
queues directly in the Local Processor Memory.
[0275] In certain embodiments, the packet CSB's are held within the
VIOC to improve performance. The message context state is kept in
either the Local Processor Memory, or in memory private to the VIOC
(such as the TCAM/SRAM or the DDR DRAM discussed herein below).
Since in certain embodiments the packet CSBs that represent the
packet queues are held on-chip, and since some descriptors are
buffered on-chip, for some conceptual purposes the queues may be
thought of as residing within the VIOC. Those skilled in the art
will understand that this is an informal abstraction, as the queues
actually reside in Local Processor Memory.
[0276] Packet and message transmission corresponds to data flowing
out from the VIOC to the fabric, also known as VIOC egress, or
simply as egress when discussing the VIOC operation. Conversely,
packet and message reception corresponds to VIOC ingress, or simply
ingress.
[0277] FIG. 7A illustrates selected aspects of one VIOC embodiment
as VIOC block diagram 700A. Included in VIOC 301 are several
interfaces, including a unit for coupling to Double Data Rate (DDR)
DRAM memories (DDR Interface 701) via coupling 721, a unit for
coupling to an HT channel (HT Interface 702) via coupling 722, and
a block for coupling to a BMC (BMC Interface 718 included in VIOC
Control 704) via coupling 733.
[0278] Further included in VIOC 301 are FICTX 714 (an instance of a
VIOC fabric interface communication transmit unit) and FICRX 715
(an instance of a VIOC fabric interface communication receive
unit). FICTX 714 includes egress path elements Vioc EGRess
interface (VEGR) 708, and CSix Transmit unit (CSTX) 710. VEGR 708
includes DMATX 716, an instance of a transmit DMA unit; ECSM 735,
an instance of Egress Control State Machines; and ESMem 736, an
instance of an Egress Shared Memory. FICRX 715 includes ingress
path elements Vioc INGress interface (VING) 709, and CSix Receive
unit (CSRX) 711. VING 709 includes DMARX 717, an instance of a
receive DMA unit; ICSM 737, an instance of an Ingress Control State
Machines; and ISMem 738, an instance of an Ingress Shared Memory
that in some embodiments is an implementation of the receive logic
memory.
[0279] Flow control blocks Csix Flow Control Unit Transmit side
(CFCUTX) 712 and Csix Flow Control Unit Receive side (CFCURX) 713,
included in VIOC 301, are coupled from the receive path to the
transmit path. As described in more detail elsewhere herein, CFCUTX
712 is used to temporarily suspend sending by CSTX 710 upon
receiving an indication of fabric congestion and CFCURX 713 is used
to indicate VIOC congestion to other modules. Other VIOC elements
include RXDmgr 766, and shared egress and ingress elements Lookup
Engine (LE) 703 and Message State Machine 707. VIOC 301 control
elements include VIOC Control 704, in turn including SIM Interface
705, VIOC Configuration block 706, and BMC Interface 718.
[0280] For clarity and emphasis, only selected egress and ingress
data flows are indicated explicitly in the figure. Generally,
egress data enters VIOC 301 via HT Channel coupling 722, and flows
from HT Interface 702 to VEGR 708 via coupling 750, under control
of DMA read protocols implemented by DMATX 716. The egress data
continues to CSTX 710 via coupling 751, exiting CSTX 710 via
coupling 753, and exits VIOC 301 via Fabric Coupling 732.
Similarly, ingress data flows in a symmetric reverse path, entering
via Fabric Coupling 732 and continuing to CSRX 711 via coupling 763
and then to VING 709 via coupling 761. The ingress data proceeds to
HT Interface 702 via coupling 760 under control of DMA write
protocols implemented by DMARX 717 to exit VIOC 301 via HT Channel
coupling 722.
[0281] Information related to egress flow control is provided from
CSRX 711 to CFCUTX 712 via coupling 752r. Egress flow control
commands are provided from CFCUTX 712 to CSTX 710 via coupling
752t. Information related to ingress flow control is provided from
CSRX 711 to CFCURX 713 via coupling 762r. Ingress flow control
commands are provided from CFCURX 713 to CSTX 710 via coupling
762t. Control packet handshaking is provided from FICRX 715 to
FICTX 714 as shown by ingress-egress coupling 772.
[0282] Although couplings 750, 751, 753, 760, 761, 763, and 772 are
illustrated as unidirectional, this is only to highlight the
primary flow of data, as control and status information, for
example, flows along similar pathways in a bidirectional manner.
Internal egress path related coupling 770 and ingress path related
coupling 771 illustrate LE 703 request and returning status and
result communication with VEGR 708 and CSRX 711, respectively.
[0283] VIOC Configuration block 706 includes configuration and mode
information relating to operation of VIOC 301, generally organized
into registers, including system configuration registers and local
configuration registers. The system and local configuration
registers are typically accessed by management processes executing
on Primary SCM 140A, by control packets sent to Fabric Coupling
732, and then processed by CSRX 711 and SIM Interface 705. The
system registers are typically inaccessible to processes executing
on Local Processors, and include a plurality of scratchpad
registers typically used for communication with the management
processes. However, the local registers are typically accessible
via the HT channel by Hypervisor, OS, and Driver processes
executing on Local Processors. Hypervisor and OS processes
typically configure environments for Application processes so that
the local configuration registers are inaccessible to the
Application processes.
[0284] The system registers include VNIC related registers, such as
Address Block 601.1 (of FIG. 6A) for each of 16 VNICs. Also
included is a bit (or mask) per VNIC to enable and disable the
corresponding VNIC. The local registers include pointers and state
information associated with I/O Block 605.1 (of FIG. 6A) of each of
the 16 VNICs.
[0285] During initialization and debug, Local Processor access to
the system registers may be provided by manipulation of a field in
the system configuration registers. In addition, the system and
local configuration registers are accessible via BMC command and
data information received from BMC Interface 718.
[0286] In one embodiment, VIOCs included on controlplane modules
(such as SCMs included in SIMs) are initialized by BMC commands to
enable selected privileged operations, including transmission via
the fabric of control packets without lookup processing (these
packets are also referred to as `No Touch` packets). No Touch
packets may be used for control packets (to initialize or modify
forwarding information included in TCAM/SRAMs) and to forward an
exception packet from an SCM to the proper destination. VIOCs
included on other modules (such as PMMs, OLBs, and FCMs) are
initialized to disable No Touch packet transmission, i.e. packets
(and messages) are always processed with an egress lookup.
[0287] SIM Interface 705 is coupled to receive control packets from
CSRX 711 as typically provided by controlplane processes executing
on an SCM included in a SIM. The control packets are parsed to
determine the included command and any associated parameters, such
as address and data. SIM Interface 705 then passes the command and
parameters to the proper element of VIOC 301 for execution. Return
handshake status is typically provided in the form of a packet
addressed to the sending SCM from FICRX 715 to FICTX 714 via
ingress-egress coupling 772, and FICTX 714 provides the packet to
the fabric.
[0288] BMC Interface 718 includes logic for interfacing to a BMC,
including receiving, processing, and responding to BMC commands
received via coupling 733. The interface parses the command,
provides the command and any associated parameters to the proper
unit of VIOC 301, and returns response information as
appropriate.
[0289] HT Interface 702 includes an HT Channel compatible external
interface providing read and write style accesses to resources
available via coupling 722. Read response information (typically
associated with egress processing) is provided to VEGR 708 via
coupling 750. Write information (typically associated with ingress
processing) is provided from VING 709 via coupling 760. In a
typical usage scenario, the read and write accesses target memory
locations in RAMs coupled to CPUs coupled in turn to HT Channel
coupling 722 (i.e. Local Processor Memory). HT Channel coupling 722
is an illustrative instance of HT couplings including 460.4R,
460.5R, 460.5R', 460.6R (of FIGS. 4A, 4B, 4B, and 4E,
respectively).
[0290] DDR Interface 701 includes logic for coupling to DDR DRAMs
via coupling 721. DDR Interface 701 communicates with Message State
Machine 707 via coupling 767, as shown. DDR Interface 701 also
communicates with other elements of VIOC 301 via implicit
communication paths that allow access to the DRAMs via control
packets (SIM Interface 705), BMC commands (BMC Interface 718), and
processes executing on Local Processors (HT Channel coupling 722),
in addition to VIOC internal requests (Message State Machine 707,
VIOC Control 704, FICTX 714, and FICRX 715). The topology of these
paths will be understood by those of ordinary skill in the art. In
various embodiments, DDR SDRAM typically includes data structures
related to context and message related processing (such as CSBs),
as well as virtual to physical address translation for transmit and
receive buffer addresses in Local Processor Memory.
[0291] Message State Machine 707 manages message state (including
connecting, established, closing, and closed) for a plurality of
message contexts, such as those associated with the 16 VNICs,
according to a connection-oriented reliable protocol. According to
embodiment, message state is stored in part in DDR coupled via
coupling 721 to DDR Interface 701, and coupling 767 communicates
requests for DDR reads and writes, as well as resultant read data
between the state machine and the DDR interface. The state machine
also provides for message handshaking and re-delivery attempts by
appropriate processing and communication with FICTX 714 and FICRX
715, via implicit couplings that will be understood to those of
ordinary skill in the art. For example, message related inspection
and processing of incoming information may be performed in CSRX 711
under the direction of Message State Machine 707. Similarly,
message related processing and information insertion may be
performed in CSTX 710 also under the control of the state machine.
Further included in Message State Machine 707 are logic units for
performing RDMA.
[0292] RXDmgr 766 includes logic for fetching and prefetching
receive descriptors to support ingress operation. Receive
descriptor requirements and results are communicated with FICRX 715
via coupling 764. Requests to read descriptors from Local Processor
Memory are provided to HT Interface 702 via coupling 765, and
returning data is returned via coupling 765.
[0293] FICTX 714 includes logic (VEGR 708) implementing egress path
processing, including accessing packet data for transmission and
cellification using DMA protocols, according to configured
priorities and bandwidth allocations, and including one lookup (LE
703 via coupling 770). The lookup typically provides a fabric
egress port based in part on the packet destination address
(typically a MAC address) and relevant VLAN related information.
The included logic also implements packet data cellification and
CSIX cell-level processing (CSTX 710). An overview of selected
aspects of packet access and cellification is provided with respect
to FIG. 3A. In some embodiments, FICTX 714 processes selected
multicast packets (and hence cells) using cell-level multicast
capability provided by the fabric.
[0294] VEGR 708 includes logic blocks performing packet egress
processing functions including transmit queue management and
scheduling (see FIG. 6B and the related discussion), transmit
packet scheduling, packet segmentation into cells (including a
packet address processing lookup via LE 703), various control state
machines within ECSM 735, and an egress shared memory ESMem 736.
DMATX 716, included in VEGR 708, is configured to transfer packet
image data from Local Processor Memory to the egress shared memory,
and further configured to transfer data from the egress shared
memory to CSTX 710. In an illustrative embodiment, the VOQs are
implemented as pointer managed buffers that reside within the
egress shared memory. The DMA transfers are managed by the control
state machines in VEGR 708 according to bandwidth and priority
scheduling algorithms.
[0295] CSTX 710 includes logic blocks performing cell egress
processing functions. Logic units in CSTX 710 read cell data
according to the VOQs as scheduled by a VOQ prioritizing algorithm
(see FIG. 6C and the related discussion), calculate horizontal
parity, vertical parity, and CRC for each cell, and then send the
results and the cell data to the fabric. Logic units in CSTX 710
include CSIX egress queue structures and associated transmit data
path (FIFO) buffers, CSIX compatible transmit flow control logic
responsive to information received from CFCUTX 712, logic
responsive to information received from CFCURX 713 (to apply fabric
back-pressure using CSIX compatible receive flow control
instructions), and a transmit-side CSIX compatible external
interface for Fabric Coupling 732.
[0296] CFCUTX 712 (shown outside of FICTX 714 in the figure, but
closely associated with egress processing) includes fabric
congestion detection logic and VOQ feedback control logic to
instruct CSTX 710 to stop sending cell traffic from a VOQ when
fabric congestion is detected. When the congestion is relieved, the
logic instructs CSTX 710 to resume cell traffic from the stopped
VOQ. Fabric congestion information is provided to CFCUTX 712 from
CSRX 711 as it is received from the fabric.
[0297] The VOQ prioritizing algorithm implemented in CSTX 710, as
illustrated in FIG. 6C, includes configurable weighted round-robin
priority between unicast output queues and multicast output queues,
round-robin priority among VOQ groups, and straight priority within
VOQ groups. The algorithm also guarantees that all cells associated
with a given packet are sent in order, and further that cells from
different packets from the same VOQ are not intermingled. In other
words, once a first cell for a packet from a selected one of the
VOQs is sent, then the remainder of the cells for the packet are
sent before any cells of any other packet from the selected VOQ are
sent.
[0298] FICRX 715 includes logic implementing ingress path
processing, including CSIX cell-level processing (CSRX 711), and
packet-level processing (VING 709), including storing reassembled
packets using DMA protocols. An optional lookup (LE 703) is
performed under the control of CSRX 711 via coupling 771. The
lookup provides information related to processing the packet,
including mapping the packet to the proper receive queue.
[0299] CSRX 711 includes logic blocks performing cell ingress
processing functions. Logic units in CSRX 711 receive, buffer, and
parse cell data from the fabric. Logic units in CSRX 711 include a
receive-side CSIX compatible external interface for Fabric Coupling
732, CSIX ingress queue structures and associated CSIX receive data
path (FIFO) buffers, a CSIX cell parser unit, and transmit and
receive flow control information detection logic.
[0300] CFCURX 713 (shown outside of FICRX 715 in the figure, but
closely associated with ingress processing) includes VIOC
congestion detection logic and fabric feedback control logic to
instruct the fabric to stop sending cell traffic of a specific
priority when VIOC congestion is detected for that priority. When
the congestion is relieved, the logic instructs the fabric to
resume cell transmission. Receive flow control instructions to the
fabric are communicated via CSTX 710. This method of congestion
relief is referred to elsewhere herein as applying fabric
back-pressure.
[0301] Cell data is received from the fabric, including horizontal
parity, vertical parity, and CRC. The parities and CRC are computed
for the received data, checked, and errors logged. Cell and packet
headers are parsed, and in some embodiments an optional lookup is
performed (LE 703) for selected unicast packets to determine in
part an appropriate receive queue. In some embodiments, an optional
lookup is performed for multicast packets, VLAN broadcast packets,
or both, according to embodiment, to determine in part one or more
appropriate receive queues or multicast group identifiers, also
according to embodiment. Unicast lookups (if performed) are
typically based in part on a source fabric port address and a
context key included in the packet header. Some embodiments omit
unicast lookups entirely. Control packet data is written into a
control packet portion of the CSRX's FIFOs, and subsequently sent
to SIM Interface 705 for further processing, while non-control
packet data is written to a data portion of the CSRX's FIFOs.
[0302] VING 709 includes logic blocks performing packet ingress
processing functions including receive and completion queue
management and scheduling, receive packet scheduling (see FIG. 6D
and the related discussion), cell reassembly into packets, various
control state machines, and an ingress shared memory. DMARX 717,
included in VING 709, is configured to transfer cell data into
Local Processor Memory from the Ingress Shared Memory (ISMem 738).
The DMA transfers are managed by the Ingress Control State Machines
(ICSM 737) in VING 709. A receive buffer is considered complete (or
consumed) when either the last available location in a buffer is
written, or the last cell of a packet is written. Buffer completion
is indicated by writing an entry to one of the completion queues,
with data including packet receive status (Error or OK), receive
processing (or thread) number, and context key (if the data
includes the last cell of the packet). The completion queue write
information optionally includes results of packet-level CRC and 1's
complement computations for use by Driver or other processes
executing on the Local Processors.
[0303] VING 709 controls movement of cell data from the ingress
shared memory (ISMem 738) to Local Processor Memory during packet
reconstruction according to various receive scheduling algorithms
(see FIG. 6E and the related discussion), including determination
of buffers in the Local Processor Memory, selection of cell data to
transfer, and movement of the packet data to the Local Processor
Memory. Buffers are determined by processing receive descriptors,
and in some embodiments the receive descriptors are fetched and
processed according to a round-robin priority between the groups of
receive queues. Cell data is scheduled for transfer to the Local
Processor Memory according to a selection between unicast and
multicast queues for insertion into the receive queues, as
illustrated by FIG. 6D. This is followed by a weighted round
robin-scheduling algorithm between the groups of receive queues, as
illustrated by FIG. 6E, according to various embodiments. Packet
data movement into the Local Processor Memory is also directed in
part according to flow control information from HT Interface 702
that VING 709 responds to in order to prevent overrunning limited
resources in the HT interface.
[0304] Fabric Coupling 732, in one embodiment, includes a Fabric
Interface Chip (FIC) providing low-level functions relating to
coupling to an embodiment of SFM 180 that includes Fabric Switch
Chips (FSCs). Fabric Coupling 732 is an illustrative instance of
generic fabric couplings, which in the system as a whole further
includes FCM-Fabric coupling 129A, NM-Fabric coupling 139A,
OLB-Fabric coupling 169A, PMM-Fabric coupling 159A, and PMM-Fabric
coupling 159A', for example.
[0305] LE 703 includes TCAM and SRAM interfaces, and accepts egress
lookup requests from VEGR 708 and ingress lookup requests from CSRX
711. Lookup requests include a key and a look up transaction
identifier. The TCAM is searched for a first entry matching the
key, and if a match is found, a corresponding entry from the SRAM
is read. When processing is complete and lookup information
available, the requestor is notified by a handshake, and the
transaction identifier, a match indication, and result data (if a
match) are returned to the requestor (one of VEGR 708 and CSRX
711). Request processing is pipelined in LE 703, but if the Lookup
Engine is unable to accept a request, then an acceptance delayed
indicator is provided to the requestor. In one embodiment the key
and the results are each 64 bits, both are multiplexed in two
32-bit chunks, and the transaction identifier is 4 bits.
[0306] In addition to lookup requests during egress and ingress
processing, LE 703 supports directly reading and writing the TCAM
and SRAM to examine and modify lookup information, via requests
from BMC Interface 718, SIM Interface 705, and HT Interface
702.
[0307] In operation, VIOC 301 and related components are
initialized to set configuration, mode, initial state, and other
related information. For example, selected management and
configuration control information maintained in VIOC Configuration
block 706 is written by an external BMC via coupling 733 and BMC
Interface 718. Additional information is optionally written by an
external agent via packets received from Fabric Coupling 732, CSRX
711, and SIM Interface 705. Additional information may also be
optionally written by an agent coupled to HT Channel coupling 722
via HT Interface 702. Typically, the management and configuration
control information is provided by management processes executing
on Primary SCM 140A, as described elsewhere herein.
[0308] Initial (as well as subsequent) ingress and egress lookup
information, including forwarding, VLAN, and multicast information,
is typically provided by controlplane and related processes
executing on Primary SCM 140A. The information is included in
packets sent by the processes and received via Fabric Coupling 732,
CSRX 711, and SIM Interface 705. The lookup information is stored
in TCAM/SRAM resources coupled to VIOC 301 via TCAM/SRAM coupling
723 by LE 703. Portions of state stored in the TCAM/SRAM may also
be optionally initialized by the agent coupled to HT Channel
coupling 722 via HT Interface 702 and LE 703.
[0309] After initialization, VIOC 301 and related elements are
available for directly communicating packets (and messages) between
clients coupled to the fabric, as described with respect to FIG. 3A
and FIG. 4A. The communication is bidirectional, including egress
(from Local Processor Memory to fabric) and ingress (from fabric to
Local Processor Memory), and is typically accessible to processes
executing on Local Processors via a VNIC-style interface as
illustrated by FIG. 6A.
[0310] Egress operation serves to directly transmit a buffer of
packet data, as provided by a Driver process in conjunction with an
OS executing on a Local Processor, to the fabric. The Driver (or
optionally an Application process) forms a packet image within the
buffer. A transmit descriptor, including the physical address of
the buffer, the buffer size, a valid indicator, and a done
indicator, is fabricated by the Driver and placed on one of the
transmit descriptors. The valid indicator is asserted by the Driver
to indicate the descriptor is ready for processing by VIOC 301. The
done indicator is initially deasserted by the Driver and later
asserted by VIOC 301 when the descriptor and the underlying buffer
data has been fully processed by the VIOC. Upon assertion of done
the buffer is available for subsequent use by the Driver.
[0311] After forming the packet image and the transmit descriptor,
in one illustrative scenario, the Driver informs VIOC 301 that
additional packet data is available for transmission by accessing a
corresponding Doorbell, asynchronously interrupting VIOC 301. The
Doorbell access is sent via HT Channel coupling 722, received by HT
Interface 702, and processed by VIOC Control 704. In another
illustrative scenario, instead of the Doorbell access, VIOC 301
polls transmit descriptors, examining the associated valid
indicators, to determine that additional packet data is available
for transmission.
[0312] When additional packet data is available for transmission,
VEGR 708 accesses the transmit queue using DMA processing functions
included in DMATX 716 according to the bandwidth and priority
scheduling algorithms of FIG. 6B. Algorithms implemented by the
priority scheduling of the transmit queue accesses include straight
priority, round-robin, and weighted round-robin, and priority is
determined between transmit packet queues and transmit message
queues. The information obtained from the queue includes a
descriptor including a pointer to the new packet images. VEGR 708
analyzes the descriptor, providing the pointer to DMATX 716 and
requesting additional accesses to begin to read in the packet
image. The packet image begins with a packet header, including
packet destination and priority information. VEGR 708 formats
information from the packet header, including the destination,
along with VLAN processing related information, into a lookup
request in a suitable format, and passes the request to LE 703. LE
703, in turn, accepts the request when room for the request is
available, and processes it by accessing the coupled TCAM/SRAM. The
lookup result, including a destination fabric port address, is used
in forming appropriate cell headers, including a fabric destination
port address. Packet data is cellified and sent to CSTX 710 for
further processing.
[0313] CSTX 710 receives the cellified packet data, including cell
header information, and processes the data according to the VOQ
prioritizing algorithms of FIG. 6C. Cell data is sent according to
the configured priority algorithm, and CSTX 710 is further
responsive to flow control instructions received from CFCUTX 712.
Cell parities and CRC are calculated and provided to Fabric
Coupling 732 along with the cell header and data information.
[0314] Ingress operation is conceptually the reverse of egress
operation, and serves to directly receive packet data into a buffer
in Local Processor Memory, the buffer being directly accessible by
a Driver process (and optionally an Application process) executing
on a Local Processor. A receive descriptor, including the physical
address of the buffer, the buffer size, and a valid indicator, is
fabricated by the Driver and placed on one of the receive
descriptor queues. The valid indicator is asserted by the Driver to
indicate the descriptor is ready for use by VIOC 301. Typically
VIOC 301 prefetches (under the direction of RXDmgr 766) and
preprocesses one or more receive descriptors in preparation for
receiving cell data and storing it into a new receive buffer in
Local Processor Memory. A completion queue entry is written by VIOC
301 when the buffer has been filled with packet image data.
[0315] CSRX 711 receives CSIX cells, checks parities and CRC for
the received cells, parses cell headers, and for the first cells of
packets, parses a packet header. Information related to flow
control is provided to CFCURX 713, and fabric back-pressure is
applied (via CSTX 710) when VIOC congestion is detected. A lookup
is performed via LE 703 for the first cells of multicast packets,
to determine proper destinations and required replication of the
packet. Further within CSRX 711, control packet data is FIFOed for
presentation to and processing by SIM Interface 705, while
non-control packet data is FIFOed for further data path processing
in accordance with FIG. 6D as discussed elsewhere herein.
[0316] VING 709 directs DMARX 717 to store received non-control
packet data as complete or partially reassembled packets into Local
Host Memory via DMA transfers according to the various receive
scheduling algorithms of FIG. 6E. When a buffer is completely
filled (or the last datum of a last cell of a packet is stored),
VING 709 directs writing a corresponding completion queue entry,
including status (Error or OK), thread number, context key, and
optionally packet-level CRC and 1's complement results. This
completes the reception of the packet (if the last cell was
received) and the packet image is available for use directly by the
Driver (or optionally an Application) process executing on a Local
Processor.
[0317] Control packets are sent in-order to SIM Interface 705 for
further processing. SIM Interface 705 parses the control packet and
passes command, address, and data information to the appropriate
VIOC element for execution. Return handshake status and result
information is typically provided via ingress-egress coupling 772
as a packet (typically addressed to an SCM) for transmission to
Fabric Coupling 732. Control packets, typically provided from a
controlplane process executing on Primary SCM 140A, may be used at
any time to effect updates or changes to forwarding, VLAN,
multicast, and other related state information included in
TCAM/SRAM coupled to VIOC 301.
[0318] Egress and ingress buffer operation is not restricted to
physical addresses, as virtual addresses may be supplied in
transmit and receive descriptors. VIOC 301 references coupled DDR
DRAM via coupling 721 to access translation mapping information.
The VIOC then translates the virtual addresses to physical
addresses and processing proceeds accordingly.
[0319] During message operation, Message State Machine 707 manages
various aspects of the reliable connection-oriented protocol, and
directs overall processing according to message related queues and
associated descriptors. The reliable protocol includes handshake
and retry information that is communicated from VING 709 to VEGR
708 via ingress-egress coupling 772 for communication as messages
on Fabric Coupling 732. Message operation is otherwise similar to
packet operation.
[0320] Some embodiments implement packet operation only, while
other embodiments implement packet and message operation.
Embodiments implementing only packet operation omit Message State
Machine 707 and associated processing logic.
[0321] Those of ordinary skill in the art will recognize that VIOC
301 is an example embodiment only. For example, consider that in
other embodiments the external couplings may have differing
bandwidths to that described heretofore. For example, a VIOC used
in a controlplane context, such as an SCM included on a SIM, may be
implemented with reduced bandwidth variations of Fabric Coupling
732 and HT Channel coupling 722 compared to a VIOC used in a
dataplane context, such as a PMM. In one embodiment, the VIOC
included in the SCM is coupled to the SFM via one-half the coupling
bandwidth of the VIOC included in the PMM. In another embodiment,
the VIOC included in the OLB is coupled to the SFM via one-half the
coupling bandwidth of the VIOC included in the PMM.
[0322] Those of ordinary skill in the art will also recognize that
the partitioning illustrated in VIOC 301 is only illustrative, and
that any number of other arrangements and variations may be
implemented. For example, the functions relating to SIM Interface
705, VIOC Configuration block 706, and Message State Machine 707,
may be incorporated within the control state machines of VEGR 708
and VING 709. Functions implemented in the blocks of FICTX 714 and
FICRX 715 may be rearranged in other equivalent organizations.
These and other such arrangements are contemplated in various
embodiments.
VIOC Egress Operation
[0323] FIG. 7B illustrates selected aspects of egress operation of
an embodiment of a VIOC as flow diagram Egress Overview 700B. For
clarity, only selected details related to packet processing are
shown (message processing is similar). Processing begins upon
receipt of "Doorbell Ring" 781, indicating availability of one or
more new transmit descriptors pointing to packets available for
transmission. Flow proceeds to "Descriptor Fetch" 780.1, where
transmit descriptors are fetched (in accordance with the scheduling
illustrated by FIG. 6B) and passed to "Valid?" 780.2 to determine
which descriptors are valid for processing by the VIOC. If an
invalid descriptor is detected, then an error condition is present,
and processing is complete ("No" 780.2N proceeds to "End" 780.14).
If the descriptor is valid, then flow continues to "Program DMATX
Transmit Q Fetch" 780.4 via "Yes" 780.2Y.
[0324] Processing continues as "Program DMATX Transmit Q Fetch"
780.4 analyzes the fetched descriptor information to determine the
buffer address and length, and configures DMATX 716 to fetch the
packet data located in the buffer and to store the data into ESMem
736. The fetched packet data is in turn analyzed to determine the
destination, and a lookup is performed according to the MAC
destination address (MACDA) and the VLAN of the associated with the
descriptor at "Lookup" 780.5. The lookup result, including a
destination fabric port address is used in part by "Form Packet
Header" 780.6 to formulate a working image of a packet header.
Typically the packet header includes other information from the
address block of the VNIC sourcing the transmission (such as
Address Block 601.1, of FIG. 6A), including a MAC source address
(such as MAC Address 603.1, of FIG. 6A), and a VLAN tag (such as
VLAN Identifier 617.1, of FIG. 6A). Some embodiments implement VLAN
processing such as dropping the packet if source and destination
VLANs are different.
[0325] Processing continues as "Scheduled?" 780.7 determines if a
first cell of the packet is scheduled, and if not, loops back via
"No" 780.7N until the cell is scheduled. The scheduling is as
illustrated in FIG. 6C. When the cell is scheduled ("Yes" 780.7Y),
flow proceeds to "Program DMATX Output Q Fetch" 780.8 where DMATX
716 is programmed to fetch data from ESMem 736 for insertion into
the appropriate output queue. In one embodiment, the output queues
are implemented within ESMem 736. It will be understood that data
transfers within the same memory structure may be at least in part
performed logically via pointer manipulation rather than via
physical data transfers. A cell header is formulated in "Form Cell
Header" 780.8A, for encapsulating cells corresponding to the
packet. The fetched data is processed ("Compute Checksum, CRC"
780.8B) to determine packet-level error detection information in
CSTX 710 (of FIG. 7A). The cell header and cell data (corresponding
to cell-sized chunks of the packet data) are then ready for
transmission on the fabric by CSTX 710 ("Transmit Cells"
780.9).
[0326] Processing continues as "Packet Complete?" 780.10 determines
if the entire packet has been transmitted. If not ("No" 780.10N),
then flow returns to "Scheduled?" 780.7 to continue sending the
packet. If the entire packet has been transmitted ("Yes" 780.10Y),
then flow proceeds to "Modify Transmit Q Descriptor" 780.11 to
indicate that the buffer identified by the transmit descriptor has
been transmitted by setting the done indicator accordingly.
[0327] Processing continues as "Interrupt Requested?" 780.12
determines if an interrupt to the Local Processor is requested,
based in part on an interrupt request indicator included in the
transmit descriptor, in one embodiment. If an interrupt is
requested ("Yes" 780.12Y), then flow continues to request an
interrupt ("Interrupt" 780.13) and then processing of the
information related to the descriptor is complete ("End" 780.14).
If an interrupt is not requested ("No" 780.12N), then processing is
complete ("End" 780.14).
[0328] The processing illustrated in Egress Overview 700B is
representative of the overall flow with respect to one cell,
including any special processing relating to completion of a
packet. However, according to various embodiments, such processing
may be wholly or partially overlapped for a plurality of cells. For
example, in some embodiments Descriptor Fetch 780.1 may provide a
plurality of descriptors, each pointing to a plurality of cells,
and each of the respective cells are processed according to the
remainder of the flow. In other embodiments, a first cell may
remain in ESMem 736 indefinitely once processed by Program DMATX
Transmit Q Fetch 780.4, while subsequent cells are processed
according to Program DMATX Transmit Q Fetch 780.4. In addition,
cells may be removed from ESMem 736 in a different order than
stored, according to Program DMATX Output Q Fetch 780.8. These and
all other such variations are contemplated in various
embodiments.
VIOC Ingress Operation
[0329] FIG. 7C illustrates selected aspects of ingress operation of
an embodiment of a VIOC as flow diagram Ingress Overview 700C. For
clarity, only selected details related to packet processing are
shown (message processing is similar and is omitted). Processing
begins when a cell is received from the fabric and enqueued
(according to priorities illustrated by FIG. 6D) by CSRX 711, as
indicated by "Cell Received" 791. Flow continues to "Check
Parities, CRC" 790.1, where cell-level error check computations are
performed. The error results are checked ("Data OK?" 790.2), and if
the data is incorrect ("No" 790.2N), then the error is recorded
("Log Error" 790.3) and processing for the cell is complete ("End"
790.16). If the data is correct ("Yes" 790.2Y), then processing
proceeds to "UC/MC?" 790.4.
[0330] Processing continues as "UC/MC?" 790.4 determines if the
cell is a multicast ("MC" 790.4M) or a unicast ("UC" 790.4U) type
cell. Unicast processing continues at "Enque By VNIC/Q" 790.5,
where the received cell is enqueued in a selected unicast receive
queue according to VNIC number and receive priority (or queue).
Multicast processing continues at "Lookup" 790.17, where a lookup
is performed by LE 703 (of FIG. 7A) based at least in part on the
MGID as discussed elsewhere herein in conjunction with FIG. 8B. The
lookup provides information describing the VNICs to receive the
multicast data, and the cell is enqueued accordingly ("Enqueue
Multicast" 790.18). Some embodiments implement selected VLAN
processing such as dropping the cell if the learning VLAN is
different from the destination VLAN. In one embodiment, the receive
queues are implemented within ISMem 738.
[0331] Unicast and multicast processing continue at "Descriptor
Fetched?" 790.6, where a determination is made as to whether a
receive descriptor appropriate for the cell received has been
fetched. If not ("No" 790.6N), then a descriptor is fetched by
RXDmgr 766 (of FIG. 7A) in "Fetch Descriptor" 790.7 and examined
for validity ("Fetch OK?" 790.7A). If the fetched descriptor is
invalid, or no descriptors are available, then the cell is dropped
(i.e. not received) along flow "No" 790.7N. The packet that the
cell is part of is also dropped ("Drop Packet" 790.7B), and
processing is complete at "End" 790.16. If the fetched descriptor
is valid, then flow proceeds ("Yes" 790.7Y) for receive scheduling
at "Scheduled?" 790.8. If an appropriate receive descriptor is
already available ("Yes" 790.6Y), then processing also continues at
"Scheduled?" 790.8.
[0332] Processing in "Scheduled?" 790.8 determines if the cell has
been scheduled (according to priorities illustrated in FIG. 6E),
and if not ("No" 790.8N), then processing loops back. If the cell
has been scheduled ("Yes" 790.8Y), then processing continues at
"Program DMARX DeQueue Fetch" 790.9, where DMARX 717 (of FIG. 7A)
is programmed to fetch the cell data from the shared ingress memory
and to store the cell data into local host memory according to the
fetched receive descriptor. Error checking information is then
computed ("Compute Checksum, CRC" 790.10), and for multicast cells
a check is made to determine if multicast replication is complete
("MC Finished?" 790.11). If multicast processing is not complete
("No" 790.11N), then processing returns to "Scheduled?" 790.8 to
await subsequent scheduling of the next multicast operation. If
multicast processing is complete (or the cell is a unicast cell),
then flow continues along "Yes" 790.11Y to "EOP?" 790.111. If the
cell is not an End Of Packet (EOP) cell and there is room available
for storage of additional cell data (according to the fetched
receive descriptor), then processing for the cell is complete
("End" 790.16). Otherwise, flow proceeds to "Write Completion Q
Descriptor" 790.12 to record completion of the receive descriptor.
In alternate embodiments, processing for multicast cells is
performed wholly or partially in parallel, including embodiments
where all multicast destinations for the cell are processed in
parallel.
[0333] Processing in "Write Completion Q Descriptor" 790.12 records
completion of the processing of the receive descriptor, or
consumption of the descriptor, by entering an entry on a designated
write complete queue. The entry includes packet and error status.
Then a determination is made ("Interrupt>Threshold ?" 790.13) as
to whether the number of receive events exceeds a specified
threshold. If so ("Yes" 790.13Y), then an interrupt is signaled to
the Local Processor ("Interrupt" 790.14). If the threshold has not
been reached ("No" 790.13N), then a further determination is made
if a timer has expired ("Timer Expired?" 790.15). If so ("Yes"
790.15Y), then an interrupt request is sent to the Local Processor
("Interrupt" 790.14), and then processing for the cell is complete
("End" 780.14). If not ("No" 790.15N), then processing for the cell
is immediately complete ("End" 780.14).
[0334] The processing illustrated in Ingress Overview 700C is
representative of the overall flow with respect to one cell,
including any required multicast processing and special processing
relating to an EOP cell or consumption of a receive descriptor.
However, according to various embodiments, such processing may be
wholly or partially overlapped for a plurality of cells. For
example, once a first cell has been processed according to "Enque
By VNIC/Q" 790.5, processing for the first cell may be suspended
indefinitely, and in the meantime a plurality of additional cells
may be received and processed up to and including "Enque By VNIC/Q"
790.5. In addition, cells may be processed according to "Program
DMARX DeQueue Fetch" 790.9 in a different order than processed
according to "Enque By VNIC/Q" 790.5. As another example,
"Scheduled?" 790.8 is conceptually performed for many (for example
all) enqueued cells on every cycle, even though only a subset of
cells is scheduled according to "Yes" 790.8Y (for example, only a
single cell may be scheduled). In some embodiments, ISMem 738 is
used to store received cells during some portion of the processing
time corresponding to "Enque By VNIC/Q" 790.5 through "Yes" 790.8Y.
As another example of overlapped processing, processing of a first
cell according to "Lookup" 790.17 may be wholly or partially
concurrent with processing of a second cell according to "Check
Parities, CRC" 790.1. These and all other such similar variations
are contemplated in various embodiments.
[0335] In some embodiments, packets are aligned on cache line
boundaries, and packets are segmented along cache line boundaries
into cells. In embodiments where local host memory may be
referenced a cache line at a time, there is a one-to-one
correspondence between local host memory cache line references and
cells. For example, during egress processing, reading the first
cache line of the packet provides data corresponding entirely to
the first cell. Reading the second cache line provides data
entirely for the second cell, and so forth. During ingress
processing, reception of the first cell of the packet provides a
full cache line of data that is also aligned with respect to the
receiving buffer, and the entire received cell may be written to
the local host memory in a single transaction. Subsequently
received cells may also be written one cell per cache line
transaction.
[0336] In some embodiments, packet-level error detection
information is computed, transmitted, and checked upon reception
irrespective of packet size. In other embodiments, if all of the
data for a packet fits in a single cell, then no packet-level error
detection information is computed or transmitted, enabling the
transmission of additional data bytes instead of the error
detection information. For example, if a two-byte CRC is used for
error detection, then two additional data bytes may be sent instead
of the CRC. In these circumstances the cell error detection
information (such as a CRC) provides error detection information
for the packet.
[0337] Those of ordinary skill in the art will recognize that the
processing illustrated by the aforementioned flow diagrams is
illustrative of certain embodiments, and that other embodiments are
also possible. For example, operations may be partitioned or
ordered differently, or selected aspects of processing may be
altered. These and all other related variations are
contemplated.
TCAM/SRAM Lookup State
[0338] In their most general form, TCAM lookups may be performed
using a combination of "local" and "global" masks. Each entry in
the Tag array logically has a data field (holding a data value) and
an associated equal width local mask field (holding a local mask
value). Equivalently (and possibly physically), the Tag array may
also be described as having a data array and a corresponding mask
array. Also provided are one or more global mask registers (holding
a global mask value of the same width as the data and local mask
values) located outside the Tag array. In parallel for all entries
in the Tag array, the data value of each entry has applied to it
the associated local mask value of the entry and a selected global
mask value. The masked data value is then compared against the
search key. One or more entries in the Tag array may result in a
hit (a match with the key). A priority-encoder selects the highest
priority entry (the match at the lowest address), which is used as
an index to address the SRAM and retrieve the corresponding result
entry.
[0339] The mask values that will be used for a search are often
known well in advance and are often stable for extended periods of
operation. Accordingly, many mask values (in the local mask array
and the global mask registers) may be programmed well in advance of
the search. This permits the searches to proceed more quickly.
[0340] Generally, local mask values are required at least in part
if the use of simultaneous prioritized hits to more than one entry
is desired. Specifically, in some broadcast embodiments, the use of
local mask values is used to establish an arbitrary-MACDA
low-priority broadcast entry for each VLAN. For the broadcast TCAM
entry, the data value includes the VLANid for the VLAN and a null
MACDA, and the MACDA-related bits of the local mask are cleared to
prevent the null MACDA field from participating in comparison with
presented keys. If there are no other entries in the TCAM to match
on the VLANid, then the multicast result (and included MGID)
corresponding to the broadcast entry will be returned. The
broadcast entry is used to flood the frame being forwarded to all
members of the VLAN. In some addressing embodiments, this is
accomplished by providing a multicast result that includes a
Multicast Group ID (MGID) that has been assigned to the VLAN. When
a MACDA on a VLAN is learned, a higher-priority (lower addressed)
MACDA-specific non-broadcast entry is created wherein the data
value generally includes the VLANid for the VLAN and the learned
MACDA, and the MACDA-related bits of the local mask are set to
enable the MACDA field to participate in the comparison with
presented keys. Subsequent searches using a key having both the
VLANid and the MACDA will cause hits to both the non-broadcast and
broadcast entries. Since the broadcast entry is created at a higher
address, the TCAM's priority encoder only returns the
MACDA-specific non-broadcast entry.
[0341] The following TCAM illustrations are intentionally general
to encompass a number of implementation options and variations. For
example, the use of both local and global masks is allowed for as
are TCAM Tag-entry and SRAM result-entry organizations that support
both IVL and SVL modes of operation. The combination of local and
global masks is illustrative and not limiting. Within a given
implementation, either local masks or global masks could be
eliminated with an associated reduction in the logic associated
with the eliminated functionality. (However, elimination of local
masks generally requires performing a subsequent broadcast lookup
upon encountering a previously unlearned MACDA.) Likewise, the
combination of IVL and SVL is merely illustrative and not limiting.
One of either IVL or SVL could be eliminated with possibly
associated reductions in certain fields within the TCAM Tag-entries
and SRAM result-entries.
[0342] It will be understood that some operations could be carried
out equivalently (although generally not as quickly) through use of
just individual local mask values or just global mask values, and
likewise could be carried out through values formulated at the time
of the search (and stored into either a global mask register or the
local mask array). In particular, the selective masking of the
VLANid field within the TCAM Tag array to selectively implement one
of the IVL and SVL modes of operation could be done through use of
a global mask value or through use of the local mask values.
Whether operations are based upon the use of local mask values, the
use of global mask values, or some combination of local and global
masks, and whether a given mask value is pre-programmed or
formulated and programmed at the time of the search, will be based
upon designer preference and situational specifics in a manner that
is understood to those of ordinary skill in the art.
[0343] The following TCAM illustrations are also described in terms
of a single overall TCAM/SRAM combination. It will be understood
that the overall TCAM/SRAM may be physically implemented using a
plurality of smaller TCAM/SRAM primitives (i.e., smaller in
entry-width, number of entries, or both) arranged in ways known to
those of ordinary skill in the art to provide the desired number of
entries and entry-width of the overall TCAM/SRAM.
[0344] As previously described with respect to egress operation,
"Lookup" 780.5 (see FIG. 7B) includes a lookup in a TCAM/SRAM
coupled to a VIOC (such as VIOC 301.5 coupled to TCAM/SRAM 403.5),
as performed by LE 703. Similarly, with respect to ingress
operation, "Lookup" 790.17 (see FIG. 7C) includes a lookup in the
TCAM/SRAM. The lookup operations are performed by formulating a
lookup key, optionally selecting a global mask register, and
presenting the key and optional global mask value to the TCAM
portion of the TCAM/SRAM. A result is then produced by reading the
first matching entry (if any) as determined by the TCAM from the
SRAM portion of the TCAM/SRAM. The TCAM/SRAM is programmed
according to various provisioning, switching, and routing
functions, as described elsewhere herein.
[0345] Egress TCAM/SRAM keys, masks, and results are formulated to
provide for transparent processing of various L2 switching related
activities, and selected L3 switching and routing related
activities. The L2 and L3 operations include multicast and unicast,
with and without Link Aggregation Group (LAG) processing, and
further include VLAN processing. A lookup key is formed without
specific knowledge concerning the destination, other than the MAC
Destination Address (MACDA). In other words, the key is formed in
the same manner for multicast and unicast destinations. As
described in more detail below, the lookup result provides
information specifying the type of destination (multicast, unicast,
or unicast LAG), along with information specific to the destination
according to the destination type.
[0346] FIG. 8A illustrates selected aspects of an embodiment of an
egress lookup key and result entries as TCAM/SRAM Egress Layout
800A. The egress layout is an example of a portion of a MAC
Forwarding Information Base (MACFIB) implementation. A 64-bit
lookup key, Egress Key 801, has several fields, including: Egress
PathID 899A (two bits), Egress TableID 898A (two bits), VLANid 805
(12 bits), and MACDA 806 (48 bits). In various embodiments, the
PathID is 0x0 and the TableID is 0x0.
[0347] Embodiments implementing IVL formulate lookup keys including
the VLANid associated with the source (such as a VNIC or NM port),
such as VLAN Identifier 617.1 or a value derived from it, according
to embodiment. If the VLANid and the MACDA of the key match to a
TCAM entry having both the same VLANid and the same MACDA as the
key, then a non-broadcast (unicast or multicast) entry has been
identified and the corresponding SRAM result is retrieved.
[0348] In a first SVL embodiment, lookup keys and TCAM data array
values are formed with a common predetermined value (such as all
zeros) in the VLANid field of the key and in the corresponding
VLANid field of the TCAM entries. Since the TCAM entries so
programmed always match on the common predetermined VLANid,
forwarding entries learned for one VLAN are accessible by other
VLANs. If no entry matches on the common predetermined VLANid, then
the MACDA has yet to be learned, and some subsequent means must be
used to broadcast over the VLAN. The previously discussed catch-all
broadcast entry (wherein the MACDA field is masked) could still be
used, being looked-up using a key with the VLANid of the source
VLAN in a follow-up TCAM search.
[0349] A second SVL embodiment uses two global mask registers. In a
first search, lookup keys are formed with the MACDA specified by
the source, but with arbitrary values in the VLANid portion of the
key. In a first global mask register, mask bit locations
corresponding to the VLANid bits are cleared to remove the VLANid
bits of the data array from participation in the TCAM search, while
mask bit locations corresponding the MACDA bits are set to ensure
their participation in the search. As a result, for previously
learned MACDAs, TCAM entries match on the MACDA of the key
regardless of the VLANid of the key, and forwarding entries learned
for one VLAN are accessible by other VLANs. If no entry matches,
then the MACDA has yet to be learned. In order to broadcast over
the VLAN of the source, a second lookup is performed, this time
with a key having the VLANid of the source and arbitrary values in
the MACDA portion of the key. In the second lookup, a second global
mask register is used, this time having the mask bit locations
corresponding to the VLANid set and the mask bit locations
corresponding to the MACDA bits cleared. In the second SVL
embodiment, the VLANid field is not strictly required in the lookup
key for SVL-only implementations. However, if the VLANid field is
not present in the TCAM, then in order to implement broadcast when
the MACDA is yet to be learned by the TCAM, the VLANid to broadcast
address mapping must be resolved using other logic.
[0350] A third SVL embodiment uses local masks. For non-broadcast
entries, the mask bit locations in the local mask array
corresponding to the VLANid field are cleared to remove the VLANid
bits of the data array from participation in the TCAM search (the
mask bit locations in the local mask array corresponding to the
MACDA field are set). As before, for previously learned MACDAs,
TCAM entries match on the MACDA of the key regardless of the VLANid
of the key, and forwarding entries learned for one VLAN are
accessible by other VLANs. A broadcast entry as described above,
which reverses the local masking between the VLANid field and the
MACDA field, would within the same lookup still provide the
broadcast match if the MACDA has yet to be learned.
[0351] The MACDA is a value provided in the packet (or message)
header included in the buffer image formed in the Local Processor
Memory, or the MACDA is derived in part from the header. Since the
destination may be one of several types, the format of the 64-bit
result returned varies accordingly, including: multicast, unicast,
and unicast LAG formats, as illustrated by Multicast Result 802,
Unicast Result 803, and Unicast (LAG) Result 804, respectively.
[0352] Multicast Result 802 has several fields, including:
Multicast Bit (Multicast) 807.M (one bit), VLANid (Multicast) 808.M
(12 bits), and Egress MGID 809 (16 bits). The multicast bit is
asserted if the result corresponds to a multicast destination, and
deasserted otherwise. As this is a multicast result, the multicast
bit is asserted. The VLANid identifies the VLAN membership of the
destination. The MGID identifies the destination multicast group,
and may be associated with a VLAN broadcast group or an IP
broadcast group. Subsequent processing uses the MGID to replicate
information to one or more destinations, as determined by the
identified multicast group.
[0353] Under SVL, in order to enforce inter-VLAN isolation, if the
destination VLAN identifier does not match the source VLAN
identifier, then the egress data is discarded. The VLANid field is
not strictly required in the multicast, unicast, or LAG results of
IVL-only implementations. Under IVL, TCAM matches (and subsequent
result retrievals) are predicated upon the destination residing
within the VLAN of the source, thus checking the VLANid field of
the result is superfluous.
[0354] Unicast Result 803 has several fields, including: Multicast
Bit (Unicast) 807.0 (one bit), LAG Bit (Unicast) 810.0 (one bit),
VLANid (Unicast) 808.0 (12 bits), DstFabAddr 811 (eight bits),
DstSubAddr 812 (four bits), Egress DstQ 813 (four bits), and
DstLFIFID (Unicast) 814.0 (12 bits). The multicast bit is
deasserted to indicate the result is a unicast result. The LAG bit
is deasserted to indicate the result is not a LAG result. VLANid
(Unicast) 808.0 is identical in format and function to VLANid
(Multicast) 808.M.
[0355] DstFabAddr 811 identifies the destination fabric port
address (typically associated with a slot having an inserted
pluggable module). DstSubAddr 812 identifies a sub-address
distinguishing one of a plurality of sub-destinations associated
with the destination fabric port address. In embodiments including
a VIOC implementing VNICs, DstSubAddr 812 typically identifies
either a) one of the plurality of VNICs implemented in the VIOC at
the destination, or b) one of the plurality of network ports of a
multi-ported NM. DstQ 813 typically identifies a packet receive
queue associated with the identified VNIC.
[0356] Certain L3 processing operations (such as port filtering)
are implemented in some embodiments according to DeSTination
Logical InterFace IDentifers (DstLIFIDs). DstLIFIDs are typically
global, and may be used by software or hardware components (such as
NMs), according to embodiment. DstLIFID (Unicast) 814.0 is a
DstLIFID associated with the destination identified by MACDA
806.
[0357] Unicast (LAG) Result 804 has several fields, including:
Multicast Bit (LAG) 807.UL (one bit), LAG Bit (LAG) 810.UL (one
bit), VLANid (LAG) 808.UL (12 bits), LagID 815 (eight bits), and
DstLIFID (LAG) 814.UL (16 bits). The multicast bit is deasserted to
indicate the result is a unicast result. The LAG bit is asserted to
indicate the result is a LAG result. VLANid (LAG) 808.UL is
identical in format and function to VLANid (Multicast) 808.M. LagID
815 identifies the LAG the destination is associated with to enable
load balancing, failover, and other related operations with respect
to the identified LAG. DstLIFID (LAG) 814.UL is identical in format
and function to DstLIFID (Unicast) 814.U.
[0358] Subsequent processing of Unicast Result 803 and Unicast
(LAG) Result 804 provides portions of the lookup result, or
information derived in part from the lookup result, for use in
forming selected egress cell headers. For example, various
combinations of DstFabAddr 811, DstSubAddr 812, Egress DstQ 813,
and DstLIFID (Unicast) 814.0 (or data derived from them, according
to embodiment) are included in selected cell headers formulated
during data transmission, according to embodiment. Similarly,
various combinations of LagID 815 and DstLIFID (LAG) 814.UL (or
data derived from them, according to embodiment) are included in
selected cell headers during data transmission, according to
embodiment.
[0359] Providing destination information such as the destination
fabric port address, sub-address (or VNIC identifier), destination
queue, and destination logical interface in the lookup result
enables transparent L2 and selected L3 operations with respect to
processes sending data. The sending processes address data by
MACDA, and are not aware of multicast, unicast, and LAG properties
associated with the destination. The lookup destination information
further enables transparent management of bandwidth and other
related resources by agents other than the Local Processor process
sending the data. Such agents include management, controlplane, and
load balancing processes executing elsewhere.
[0360] FIG. 8B illustrates selected aspects of an embodiment of an
ingress lookup key and result entry, as TCAM/SRAM Ingress Layout
800B. The illustrated ingress layout embodiment is an example of an
implementation of an MGID table. A 64-bit lookup key, Ingress Key
821, has several fields, including: Ingress PathID 899B (two bits),
Ingress TableID 898B (two bits), Ingress Mask 823 (44 bits),
Multicast Key Bit 897 (one bit), and Ingress MGID 824 (15 bits). In
various embodiments, the PathID is 0x1 and the TableID is 0x0.
[0361] To enable use of the same TCAM on ingress and egress, the
same size key is used for both ingress and egress searches. At a
physical level the TCAM operates identically for ingress and egress
searches, comparing each presented key in parallel with all of the
stored data values as masked by the local mask values and the
global mask value, as described previously. As discussed in further
detail below, the PathID and TableID bits are commonly laid out
between the ingress and egress entries. These bits participate in
the TCAM comparisons, allowing if so desired the ingress and egress
entries, and entries from multiple tables, to be co-mingled in the
TCAM while remaining logically distinct subsets. Beyond the PathID
and TableID bits, ingress searches only require the 16 bits
corresponding to the Ingress MGID 824 bits and the Multicast Key
Bit 897. Multicast Key Bit 897 is asserted to indicate the lookup
is a multicast type search. Ingress MGID 824 is directly from an
MGID field included in the received cell header, or is derived from
the header, according to embodiment.
[0362] The remaining 44 bit positions of the common key layout,
denoted by Ingress Mask 823, are null place-holders, being neither
required nor logically used on ingress lookups. In a first ingress
embodiment, for each ingress entry, the mask bit locations within
the local mask array corresponding to Ingress Mask 823 are cleared
to insure that the bit locations within the data array
corresponding to Ingress Mask 823 do not participate in ingress
searches. In a second ingress embodiment, the corresponding mask
bits within a global mask register are cleared to accomplish the
same result.
[0363] Ingress Result 822 is 64 bits and has several fields,
including: Ingress DstQ 825 (four bits) and VNIC BitMask 826 (16
bits). In some embodiments, Ingress DstQ 825 identifies one of a
plurality of multicast packet receive queues for insertion of the
received data into (see the discussion relating to FIG. 6D). In
embodiments including a VIOC implementing VNICs, VNIC BitMask 826
identifies destination VNICs for replication of the received data.
Typically there is a one-to-one correspondence between asserted
bits in VNIC BitMask 826 and VNICs that are to receive the
multicast data.
[0364] Egress PathID 899A and Ingress PathID 899B are arranged to
be in the same location in the egress and ingress lookup keys,
respectively. An egress path lookup is identified by the value 0x0
and an ingress path lookup is identified by the value 0x1, thus
enabling selected embodiments to include egress and ingress lookup
information in a shared TCAM/SRAM (such as TCAM/SRAM 403.4). Other
embodiments may provide separate TCAM/SRAM resources for egress and
ingress processing. Egress TableID 898A and Ingress TableID 898B
are in the same location and may be used to specify one of a
plurality of tables to facilitate additional lookup information for
use in other scenarios.
[0365] Those of ordinary skill in the art will recognize that the
egress and ingress key and result layouts described are example
embodiments only, and that there are any number of possible
arrangements of keys and results that provide similar information.
For example, the widths and encodings of fields may vary, selected
fields may be omitted, and additional fields may be added.
Server Provisioning and Management
[0366] An ES system provides one or more provisioned servers in
conjunction with a provisioned L2/L3 switch and associated network
topology. Each of the provisioned servers and the provisioned
switch and network include capabilities as identified by a
corresponding set of specifications and attributes, according to
various embodiments. Typically the specifications (or constraints)
and attributes are specified with an SCF (see the SCF and Related
Tasks section, elsewhere herein).
[0367] An ES system may be provisioned into a wide-range of server
combinations according to needed processing and I/O capabilities.
For example, servers may be provisioned according to various
illustrative application usage scenarios described elsewhere herein
including: a Unix server, an I/O intensive server, a data-center
tier-consolidation server, and an enhanced high availability
server. Each of these servers may include distinct compute,
storage, and networking performance. Provisioned servers may be
managed similar to conventional servers, including operations such
as booting and shutting down (see the server operational states
section, elsewhere herein).
[0368] An ES system may also be provisioned to configure a wide
range of networking capabilities and logical switch topologies
within the ES system (i.e., internal to the chassis of the ES
system). For example, the networking may be provisioned such that a
logical L2/L3 switch provides L2 and L3 forwarding for VNICs of
provisioned servers within the ES system and other network
interfaces external to the ES system. Any of the logical ports of
the L2/L3 switch may be configured to be part of a VLAN and
multiple simultaneous VLANs are possible. A provisioned server may
optionally be provided with a dedicated (a.k.a. "pinned") network
port for direct non-switched coupling to an external network.
Another option is the implementation one or more LAGs, where
multiple physical network ports are aggregated to form one logical
port of the L2/L3 switch.
[0369] FIG. 9A illustrates a Hardware Resources View 900A of an
embodiment of an ES system. Provisioned servers and an associated
network and switch complex are formed by assigning hardware
resources from a collection of available hardware resources (such
as any or all of the elements of Hardware Resources View 900A) and
then programming configuration and management state associated with
the assigned hardware resources. FIG. 9B illustrates a Provisioned
Servers and Switch View 900B of an embodiment of an ES system, and
conceptually represents the result of provisioning several servers
and network functionality from the hardware elements of Hardware
Resources View 900A.
[0370] More specifically, Hardware Resources View 900A illustrates
one embodiment having a Primary SFM 180A, a Primary SCM 140A, a
first NM 130A, a second NM 130B, a first PMM 150A, and a second PMM
150B. NM 130A and NM 130B each provide a plurality of network ports
for interfacing with networks external to the ES system and further
adapts those ports to couple with the cell-based Primary SFM 180A.
As illustrated, the ports of NM 130A are coupled to the Internet
910 while the ports of NM 130B are coupled to Data Center Network
920. It will be understood that this configuration is merely
illustrative and the ES system may be configured for use with a
wide range of external network topologies. Each NM has an
associated fabric address on the Primary SFM 180A and each network
port of the NMs has an associated fabric sub-address.
[0371] Each PMM has two 2-way SMP Processing Complexes that can be
optionally coupled (via configurable couplings 990A and 990B
respectively) as a single 4-way SMP Processing Complex. (Each of
couplings 990A and 990B represents coupling pair 460.5X and 460.5Y
of FIG. 4B.) Additionally, couplings 179A and 179B may be
configured in conjunction with CSFI 170 to couple the two PMMs
together as a single 8-way SMP Processing Complex. In an
illustrative scenario, 2-way and 4-way configurations have been
chosen and thus couplings 179A and 179B are configured as inactive
(as suggested by dashed lines, rather than solid lines). PMM 150A
is configured such that coupling 990A is not active (as suggested
by dashed lines), facilitating the formation of 2-way SMP
Processing Complex 901.2 and 2-way SMP Processing Complex 901.1.
PMM 150B is configured such that coupling 990B is active (as
suggested by solid lines), facilitating the formation of 4-way SMP
Processing Complex 901.0. It will be understood that since the two
PMMs are physically identical, the illustrated scenario is an
arbitrary choice. Both PMMs can just as easily be configured in any
combination of 2-way and 4-way SMP Processing Complexes (up to four
2-ways, one 4-way and up to two 2-ways, up to two 4-ways, or one
8-way). In this way, the ES system provides for an easily scalable
number of SMP processor-ways from which to configure physical
partitions, which are the basis for provisioned servers.
[0372] Each 2-way SMP Processing Complex is associated with a VIOC
that adapts the Complex to the cell-based Primary SFM 180A. While
in FIG. 9A each VIOC is represented by only one VNIC, in one
illustrative embodiment each VIOC includes 16 VNICs. Each VIOC has
an associated fabric address on the Primary SFM 180A and each VNIC
has an associated fabric sub-address. In conjunction with
appropriate device drivers, each VNIC appears to the operating
software on the SMP Complex as a high-performance Ethernet
compatible NIC. Each VNIC can be selectively enabled, thus any
subset of the 16 VNICs may be provisioned for use by the associated
2-way SMP Processing Complex. In this way, the ES system provides
easily scalable virtualized I/O services to the provisioned
servers. In one embodiment, 2-way, 4-way, and 8-way physical
partitions are envisioned that make use of the 2-way SMP Processing
Complex and it's associated VIOC and VINCs as an underlying
primitive. Accordingly, each of up to four 2-way physical
partitions consists of a 2-way SMP Processing Complex and up to 16
VNICs, each of up to two 4-way physical partitions consists of two
coupled 2-way SMP Processing Complexes and up to 32 VINCs (up to 16
VNICS in each of two groups), and an 8-way physical partition
consists of four coupled 2-way SMP Processing Complexes and up to
64 VNICs (up to 16 VNICs in each of four groups). As illustrated,
physical partition P.sub.1 201 consists of 2-way SMP Processing
Complex 901.2 and VNIC 154A.1, physical partition P.sub.2 202
consists of 2-way SMP Processing Complex 901.1 and VNIC 154A'.1,
and physical partition P.sub.3 203 consists of 4-way SMP Processing
Complex 901.0 and VNICs 154B.1 and 154B'.1.
[0373] The fabric infrastructure of FIG. 9A is not seen by the
software operating on the SMP Complexes or by external sub-systems.
These entities need not be concerned with (and in fact are isolated
from) knowledge about the fabric addresses and fabric
sub-addresses. Software and external sub-systems operate
conventionally, communicating through packets and messages over
Ethernet frames using either Ethernet-style MACDAs or IP
addressing. The SCM 140A maintains separate master L2 and L3 FIBs
(an L2 FIB and a separate L3 FIB). The L2 FIB maps Ethernet-style
MACDAs to fabric addresses and sub-addresses and the L3 FIB maps
destination IP addresses to fabric addresses and sub-addresses. The
L2 and L3 FIBs are initialized and updated with mappings for the
provisioned hardware resources internal to the ES system and other
mappings are learned over time based on network traffic
processed.
[0374] Each NM and VIOC makes L2 forwarding decisions for packets
they receive (via the network ports of the NM and via the SMP
Complex associated with each VIOC) by referencing a local copy of
the L2 FIB. That is, each NM and VIOC does a local L2 FIB lookup on
the MACDA of each packet received (packets heading toward the SFM)
and determines the fabric address and sub-address within the ES
system where the packet should be delivered. The NM or VIOC then
provides a fabric-addressed cellified version of the packet to the
Primary SFM 180A, which delivers the cellified packet to the module
specified by the fabric address (PMM 150A, PMM 150B, NM 130A, or NM
130B). In turn, each module delivers the reconstructed packet to
the network port or VNIC specified by the fabric sub-address.
[0375] As detailed elsewhere herein, for multicast packets
(including packets broadcast on VLANs), the packets are replicated
as required both prior to fabric insertion (for each fabric address
in the multicast group) and after fabric egress (for each fabric
sub-address in the multicast group). In one embodiment, the
multicast to fabric sub-address mapping is determined via multicast
group lookups in the L2 FIBs.
[0376] Provisioned Servers and Switch View 900B (of FIG. 9B)
illustrates three example resultant provisioned servers as
Provisioned Server PS.sub.2 902.2, Provisioned Server PS.sub.1
902.1, and Provisioned Server PS.sub.0 902.0. These correspond
respectively to physical partition P.sub.1 201, physical partition
P.sub.2 202, and physical partition P.sub.3 203, of the Hardware
Resources View 900A of FIG. 9A. Each provisioned server includes
compute and I/O resources. For example, Provisioned Server PS.sub.1
902.1 includes 2-way SMP Processing Complex 901.1 and network
connectivity via NIC Function 903.1. Provisioned Server PS.sub.0
902.0 includes 4-way SMP Processing Complex 901.0 and network
connectivity via NIC Function 903.0A and NIC Function 903.0B.
[0377] In accordance with the above described L2 forwarding
decision functionality of the NMs and VIOCs, the network ports and
VNICs readily lend themselves to the logical view (of FIG. 9B) that
they are coupled to the ports of a L2 switch (exemplified by L2/L3
Switch 930). (While in the hardware view of FIG. 9A, the L2 switch
forwarding decisions are in fact performed within the NMs and
VIOCs.) Switch 930 provides selective isolation between the logical
network segments coupled to each switch port. Switch 930 forwards
packets between the switch ports (network segments) only when
warranted to reach a known MACDA on an otherwise isolated segment
(or when the MACDA is unknown and thus can only be reached via a
broadcast on the otherwise isolated segments).
[0378] As illustrated, Switch 930 provides its selective
isolation/forwarding functionality to resources both within and
external to the ES system. For example, packets originating in Data
Center Network 920 are selectively isolated-from/forwarded-to
destinations associated with Internet 910, Provisioned Server
PS.sub.1 902.1, and Provisioned Server PS.sub.0 902.0, as warranted
by the specified MACDA. Unicast packets exchanged solely between
two end-points nodes are not observable by other nodes either
inside or outside the ES system.
[0379] VLANs are a common networking administration tool to
selectively isolate or couple multiple network segments for
collaboration, security, and performance motives. A common VLAN
implementation is the so-called port-based VLAN, wherein each
logical switch-port is defined to be associated with a particular
VLAN. Switch ports (and associated network segments) belonging to
the same VLAN are logically coupled for forwarding and broadcast
events (they are part of the same broadcast domain), while switch
ports (and associated network segments) belonging to different
VLANs are L2 isolated for all events (they are in different
broadcast domains).
[0380] The L2 FIB lookup architecture of the ES system has integral
support for port-based VLANs. In the logical view of FIG. 9B,
port-based VLANs are defined by associating each of the
switch-ports of Switch 930 with a particular VLAN. In the hardware
resource view of FIG. 9A, SCM 140A has a VLAN Manager that
associates each NM port and each VNIC with a particular VLAN.
[0381] Unless expressly identified as a member of another VLAN,
each NM port and VNIC in the system is by default a member of the
VLAN known as VLAN1. As illustrated in FIG. 9B, a subset of the
switch-ports of Switch 930 are expressly associated with VLAN 905
(a different VLAN than VLAN1). Specifically, those switch-ports
associated with interconnect 937, 938, and 940, are members of VLAN
905. Switch ports and associated network segments within VLAN1 are
selectively coupled for forwarding and broadcast events as
warranted. Network segments (switch ports) within VLAN 905 are
similarly selectively coupled for forwarding and broadcast events
as warranted. From a strict L2 view, network segments within VLAN1
are never coupled to network segments within VLAN 905. In effect,
the two VLANs function as though each were topographically
segregated including having two separate switches (one for each
VLAN).
[0382] Provisioned Server PS.sub.2 902.2 includes 2-way SMP
Processing Complex 901.2 and network connectivity as illustrated by
(virtualized) NIC Function 903.2. In an example of a special
networking option, VNIC 154A.1 is "pinned" to the port of NM 130A
associated with interconnect 131. That is, the port of NM 130A
associated with interconnect 131 has been dedicated to data traffic
to and from VNIC 154A.1. More specifically, data traffic coming
into the dedicated network port goes only to VNIC 154A.1 and data
traffic coming from VNIC 154A.1 goes only to the dedicated network
port. This creates the logical view shown in FIG. 9B, where NIC
903.2 appears to be coupled directly to Internet 910 via
interconnect 931, without use of L2/L3 Switch 930.
[0383] As described elsewhere herein, provisioned servers may
include OLB and FCM resources (not shown explicitly in FIGS. 9A and
9B). In contrast to PMMs, where granularity of allocation is
typically by physical partition (such as 2-way, 4-way, or 8-way),
OLB and FCM allocation granularity is by number, bandwidth
capacity, and relative queuing priorities of VNICs implemented in
VIOCs included on OLBs and FCMs. A first level of resource
partitioning between provisioned servers may be accomplished by
allocating appropriate numbers of VNICs to each of the provisioned
servers (a greater number of VNICs generally corresponding to
larger allocations). For example, a first provisioned server may be
allocated a single first VNIC of an OLB, while a second provisioned
server may be allocated second, third, and fourth VNICs of the OLB.
In some embodiments, VNICs implement various combinations of
minimum and maximum bandwidth, providing a second level of resource
control, such that the first VNIC may be allocated a first maximum
bandwidth and the second, third, and fourth VNICs may be allocated
respective second, third, and fourth maximum bandwidths. A third
level of resource sharing control may be implemented by proper
variation of VNIC queuing priorities, as described elsewhere
herein. FCM resource partitioning may be accomplished in the same
manner, by assigning proper number, bandwidth capacity, and
relative queuing priorities of VNICs implemented by VIOCs on
FCMs.
[0384] Provisioned servers are logically isolated from each other.
For example, each provisioned server may be characterized by a
group of VNICs corresponding to resources allocated to the server.
By assigning each group of VNICs to distinct VLANs, at least one
VLAN per server, then each of the provisioned servers remains
completely isolated form the others, even though multiple servers
may be using resources from the same module (such as an OLB or
FCM). For example, first and second provisioned servers may be
using a shared OLB via respective first and second VNICs. If the
first and second VNICs are assigned respectively to distinct first
and second VLANs, then the provisioned servers are isolated, even
though both are using the shared OLB. Similar isolation may be
provided when provisioned servers share an FCM, by associating the
groups of VNICs to distinct VLANs.
Link Aggregation
[0385] Link aggregation provides a way to linearly scale
connectivity bandwidth and also offers reliability and availability
benefits. The L2 FIB lookup architecture of the ES system supports
link aggregation. The NM 130A ports associated with interconnect
133 and 134 in FIG. 9A (interconnect 933 and 934 in FIG. 9B) are
aggregated to form one logical port (with double the bandwidth)
illustrated in FIG. 9B as LAG 950. While not expressly illustrated,
link aggregation across multiple NMs is also possible. Load
balancing processing is performed in conjunction with the L2 FIB
lookup architecture to distribute the traffic quasi-evenly over the
ports that comprise the aggregate. In some embodiments, NM ports
that are aggregated must be of the same media type.
Control and Management Processes
[0386] FIG. 9C illustrates an operational view of selected aspects
of various system control and system management SW and processes
(including provisioning) in an embodiment of an ES system. Many
provisioning functions store, access, and communicate data using
abstractions and representations derived from the Common
Information Model (CIM), as described by CIM Schema v2.8.2 from
http://www.dmtf.org, herein incorporated by reference for all
purposes, for example. Selected command and result communications
use abstractions and representations derived from the Common Object
Request Broker Architecture (CORBA), as described by CORBA: Core
Specification Version 3.0.3, March 2004, from http://www.omg.org,
herein incorporated by reference for all purposes, for example.
[0387] SW layers (Platform Manager 531, and Enterprise Manager 530,
of FIG. 5B) conceptually surround HW core ES1 110A (FIG. 1A). As
illustrated, Platform Manager 531 is representative of not only
Platform Manager 531 but also the SW modules operating in close
cooperation with the Platform Manager. Such modules may include CLI
532, CM 533, Query Engine 534, Repository Manager 535, VIOC Manager
536, Interface Manager 537, L2 FDB Manager 538, VLAN Manager 539,
and Other Management Code 540 (see FIG. 5B), according to
embodiment. The SW layers execute using resources included in the
HW core (such as Primary SCM 140A), and provide system-level
control and management of the HW core. Persistent state relating to
these operations is maintained in CIM-compatible formats in two
repositories, Primary CIM Instance Repository 917.P and Secondary
CIM Instance Repository 917.S. Non-volatile storage for these
repositories may be included in various elements of Flash memory,
Magnetic Disk memory, and Optical Disk memory, according to
embodiment, and interfaced to various elements of ES1 110A (such as
SCMs, PMMs, or OLBs), or directly included in such elements, also
according to embodiment. In some embodiments, the secondary
repository is not specifically stored in persistent storage, but is
stored as data structures in portions of processor main memory
(such as DRAM and cache memory), although OS related paging
operations may swap memory pages including portions of the
repository to disk. In some embodiments, the secondary repository
is not present, and accesses to the secondary repository are
replaced with accesses to the primary repository.
[0388] Enterprise Manager 530 provides an interface for client GUIs
(such as GUI 914.1 and GUI 914.2) and maintains Secondary CIM
Instance Repository 917.S. Platform Manager 531 provides an
interface for remote shell window CLIs (such as CLI 911.1 and CLI
911.2), and maintains Primary CIM Instance Repository 917.P. The
Enterprise and Platform Managers cooperate to keep the Primary and
Secondary Repositories synchronized. Communication between Client
GUIs and remote shell CLIs is generally via CIM and CORBA standard
representations, (or similar techniques), as shown by CIM
Communication 1 916.1, CIM Communication 2 916.2, CORBA
Communication 1 916.3, and CORBA Communication 2 916.4.
[0389] GUIs enable performance of various system management and
configuration control operations by system management personnel,
including various aspects of HW and SW operation. Typically GUIs
are provided via network-based Java clients (such as Client 1 913.1
and Client 2 913.2) executing on PCs, Workstations, or other
similar computing elements. The clients include interface functions
(such as Interface 915.1 and Interface 915.2) to facilitate
processing of GUI commands and display of GUI data, as well as
communication of commands and data. GUIs generally operate on
managed objects (such as provisioned servers), and typically
independently maintain state information about the managed objects
(i.e. the GUIs are "stateful"). GUIs update in real time as status
of managed objects changes in real time. Communications between the
GUIs and the Enterprise Manager (such as CIM Communication 1 916.1
and CIM Communication 2 916.2) may be transported via any
combination of WAN (including the Internet), MAN, LAN, or a direct
connection to any compatible communication interface provided by
ES1 110A, according to various embodiments. For example,
communications between GUIs and the Enterprise Manager may be
coupled via an Ethernet port provided by one of NMs 130, or by
Management I/O 412 (FIG. 4A), according to embodiment.
[0390] CLIs enable all or any subset of system management and
configuration control operations available via GUIs, according to
embodiment. In some embodiments, CLIs enable somewhat limited
functionality with respect to the GUIs. CLIs are typically provided
via network-based text-oriented command shells (such as Shell 1
910.1 and Shell 2 910.2) executing on PCs, Workstations, or other
similar computing elements. The shells and related SW include
interface functions (such as Interface 912.1 and Interface 912.2)
similar in operation to the client interface functions. CLIs are
typically stateless, relying on the Platform Manager 531 to manage
objects on their behalves. The CLIs send commands to the Platform
Manager for translation into operations on managed objects (such as
provisioned servers). Communications between the CLIs and the
Platform Manager (such as CORBA Communication 1 916.3 and CORBA
Communication 2 916.4) may be transported via any of the mechanisms
provided for the communications between the clients and the
Enterprise Manager. In some embodiments, CLIs, low-level services
supporting CLIs, or both, are provided by SW executing on resources
of ES1 110A, such as CLI 532 (FIG. 5B) executing on Primary SCM
140A.
[0391] One or more GUI sessions may be simultaneously active and in
communication with the Enterprise Manager, receiving data and
providing commands in real time. The Enterprise Manager updates
Secondary CIM Instance Repository 917.S according to the commands
received from all of the GUI sessions. Similarly, one or more CLI
sessions may be simultaneously active and in communication with the
Platform Manager, receiving data and providing commands in real
time. The Platform Manager updates Primary CIM Instance Repository
917.P according to the commands received from all of the CLI
sessions. Any number of GUI sessions and any number CLI sessions
may also be active concurrently, and the Enterprise and Platform
Managers receive and process the respective commands, synchronizing
the two repositories as necessary.
[0392] In some embodiments, GUIs and CLIs provide an integrated
view of processing and networking functions available in an ES
system. In other embodiments, GUIs and CLIs provide a "legacy" view
of logically separate elements including switches, routers,
accelerators for Layer-4 to Layer-7 processing (such as SSL
accelerators), management servers, and enterprise servers. In still
other embodiments, GUIs and CLIs providing integrated and legacy
views may be operated simultaneously.
[0393] Platform Manager 531 (and closely cooperating SW modules)
functions performed include HW and SW inventory discovery and
maintenance, SW services relating to internal subsystem management
and RAS, networking services, low-level user interface services,
and component level services. A single Application Programming
Interface (API) is provided by Platform Manager 531 to enable
accessing the aforementioned functions by other SW modules (such as
Enterprise Manager 530, and CLI 532, for example).
[0394] HW and SW inventory discovery functions include any
combination of several elements, according to various embodiments.
A chassis manager (such as CM 533 of FIG. 5B) discovers and manages
chassis resources, a query engine (such as Query Engine 534)
processes queries, and a persistent state manager (such as
Repository Manager 535) records and provides system configuration,
status, and query information. A VIOC manager (such as VIOC Manager
536) provides control information directly to VIOCs via the SFM,
and indirectly interfaces to VIOC Drivers via the scratchpad
registers included in the VIOC. A write to any of the scratchpad
registers typically results in an interrupt being delivered to a
VIOC Driver. An interface manager (such as Interface Manager 537)
discovers interfaces on NMs and detects changes in state on NMs. In
some embodiments, a VLAN Manager (such as VLAN Manager 539)
provides services and operations relating to VLANs, such as some
aspects of provisioning VLANs in relation to VIOCs. An event
service includes a general publish and subscribe message layer and
an alarm service enables management processes to set and clear
alarms. A software versioning and upgrade service enables
management of binary software releases.
[0395] Internal subsystem management SW services include, according
to various embodiments, an Interface Definition Language (IDL)
based communication infrastructure for use between various system
components. The IDL-based infrastructure is also used for process
management and monitoring (via SNMP, for example) of system
components, services, and applications, and also for use for
information logging from one or more processors.
[0396] Other internal subsystem management SW services are
provided, according to embodiment. Security services and
virtualization services (relating to modularity and ownership
records of system components and resources, for example) may be
provided. Maintenance services (relating to a persistent database
to store configuration and other related information across system
restarts and failures, for example) may be provided. A naming
service (serving name and location information to processes
executing on resources inside a chassis, and to executing agents
external to the chassis, for example) may be provided. An IPC
communication framework and associated services (for communication
between management and controlplane processes) may be provided. A
low-level CLI (for accessing various platform manager functions)
may be provided.
[0397] RAS SW services include, according to various embodiments,
state replication, quorum protocols, fast restart mechanisms,
product validation techniques, support for in-service upgrades, and
statistics and tracing collection and storage. Platform manager
functions further include, according to embodiment, an interface
for Enterprise Manager 530, a debugging infrastructure, a highly
available process infrastructure (with disaster recovery), and
various support functions relating to security, logging in,
filtering and secure communication.
[0398] Networking services include, according to various
embodiments, L2 and L3 protocols and functions, such as those
associated with management of FIB data, and Routing Information
Base (RIB) data, respectively. According to embodiment, networking
services further include selected Layer 4 and above services, and
protocols and services relating to SNMP Management Information Base
(MIB) data and SNMP agent support for external management
systems.
[0399] Component level services include, according to various
embodiments, services and frameworks relating to management of
processing resources included on NMs, FCMs, OLBs (including system
and user code), and PMMs. An example of such a framework is a
configuration framework to be used by CLIs, SNMP agents, the
Enterprise Manager (via a GUI), and other similar mechanisms to
deliver configuration information to components. An example of such
a service is a boot management service to provide support and boot
image management for booting pluggable modules (such as PMMs, NMs,
FCMs, and OLBs) in a system.
[0400] Enterprise Manager 530 functions performed include
multi-chassis functions analogous to functions performed by the
Platform Manager, selected complex provisioning operations, and
interfacing to GUIs (described elsewhere herein). In some
embodiments, the Enterprise Manager includes functions to integrate
one or more ES systems into surrounding management
infrastructure.
SCF and Related Tasks
[0401] FIG. 10 illustrates a conceptual view of an embodiment of a
Server Configuration File (SCF) and related SCF tasks. SCF Tasks
1010 operate on SCF 1020 as illustrated by Interaction 1030 between
SCF Tasks 1010 and SCF 1020. SCF 1020 serves to describe a desired
server by specifying a list of required (or desired) resources,
typically in the form of constraints. A server specification, as
illustrated by SCF 1020, may include HardWare (HW) details such as
HW Specifications 1021, and SW details such as OS Specifications
1022. Additionally, various HW and SW attributes and organization
and operational details may also be included in SCF 1020, as
illustrated by Server Attributes 1023.
[0402] According to various embodiments, SCF 1020 may be
implemented as a human-readable text file or as a machine-readable
binary file. Typically text file implementations enable editing and
viewing operations via standard text editors. Typically binary file
implementations enable editing and viewing operations via a
specialized Command Line Interface (CLI) or a specialized GUI. In
some embodiments, text file SCF implementations may provide CLI and
GUI driven editing and viewing operations.
[0403] Tasks that may be performed on an SCF include creating an
SCF, as illustrated by Create 1011, and modifying an SCF (including
editing), as illustrated by Modify 1012. An SCF may be explicitly
locked to prevent inadvertent or unauthorized modifications, and
explicitly unlocked to allow modifications, as illustrated by
Lock/Unlock 1013. Viewing an SCF, as illustrated by View 1014,
enables examining the SCF to inspect various details (such as
parameters in HW Specifications 1021, OS Specifications 1022, and
Server Attributes 1023). In some embodiments, View 1014 may provide
error checking operations to determine if the SCF is legal and is
syntactically correct. An SCF may be copied, as illustrated by Copy
1015, moved (or renamed) as illustrated by Move 1016, and removed,
as illustrated by Remove 1017.
[0404] HW Specifications 1021 may describe constraints in a logical
manner (for example `CreateServer 2way SMP`) or in a physical
manner (for example `CreateServer 2way SMP-slot 2`, referring
specifically to HW inserted into slot 2). The HW specifications may
include a combination of logical and physical specifications. In
some embodiments, the constraints are interpreted by default as
minimum (i.e. `CreateServer 2way SMP` may be satisfied by one or
more 2-way SMPs), and may be specified explicitly to match exactly
(for example `CreateServer 2way SMP-exact` only allows for a single
2-way SMP). Constraints for some types of resources (such as VNICs,
for example), may be described by a full or a partial
specification. The full specification is expressed by itemizing and
fully specifying each individual resource (enumerating bandwidth,
priority scheme selection and associated weights, and other similar
parameters, for example). Alternatively, the partial specification
is expressed by listing a number of a particular type of resource
required (each of the resources is assumed to require identical
parameters, for example).
Server Operational States
[0405] FIG. 11 illustrates selected aspects of an embodiment of
server operational states and associated transitions. Flow
typically begins at "Unused" 1101 with a defined SCF (such as SCF
1020 of FIG. 10, for example) and an available resource collection
(such as Hardware Resources View 900A of FIG. 9A). When a command
to pin a server is received, flow proceeds along "PinServer" 1111
to "Pinned" 1102, where resources are assigned (or allocated) from
the available resource collection according to constraints in the
SCF. In some embodiments, resource allocation (or binding) is
managed by a platform manager (such as Platform Manager 531 of FIG.
5B). The result of the pinning is a provisioned server (such as any
of Provisioned Server PS.sub.2 902.2, Provisioned Server PS.sub.1
902.1, and Provisioned Server PS.sub.0 902.0 illustrated in FIG.
9B).
[0406] In various embodiments, there may be further processing
relating to pinning a server, such as programming resources based
in part on the assignment, based in part on attributes specified in
the SCF, or any combination thereof. For example, the SCF may
specify a two-processor constraint, and during processing relating
to "Pinned" 1102 an assignment of SMP Portion P.sub.A 152A is made
from an available resource pool originally including SMP Portion
P.sub.A 152A and SMP Portion P.sub.A' 152A' (see FIGS. 9A and 9B,
for example). SMP 151A (see FIG. 1B, for example) may then be
configured as the pair of two-processor physical partitions 152A
and 152A' (if this has not already been accomplished).
[0407] VNIC 154A.1 may also be configured with an IP address, a MAC
address, a VLANid, and so forth, according to information in the
SCF or according to other management related criteria. Selected
lookup state information (such as keys and results illustrated in
FIG. 8A and FIG. 8B) may be programmed, including a VNIC MAC
destination address (MACDA) to corresponding fabric port mapping,
for subsequent reference by VIOCs when processing packets (and
messages). Other programming may be performed relating to VLAN
membership, bandwidth, queuing behaviors, or other related
properties (such as programming any combination of VNIC registers
including those of Address Block 601.1 and VNIC Configuration Block
618.1 of FIG. 6A).
[0408] When a command to install the server is received, flow
proceeds along "InstallServer" 1112 to "Installing" 1103 where
initial operating software is installed. The initial operating
software includes any combination of a boot image, an OS boot
loader, a root file system image, portions of an OS image, and an
entire OS image. The software is customized as required according
to attributes included in the SCF, and stored so that it is
accessible by hardware resources assigned during processing
relating to "Pinned" 1102. If installation fails, then flow
proceeds along "Failure" 1113.F to "Failed" 1109, where corrective
action may be taken. If installation is successful, then flow
proceeds along "Success" 1113 to "Installed" 1104.
[0409] When a command to boot the server is received, flow proceeds
along "BootServer" 1114 to "Booting" 1105, where execution of the
boot image begins, followed by execution of the OS boot loader,
with references to the root file system image as needed. If booting
fails, then flow proceeds along "Failure" 1115.F to "Failed" 1109,
where corrective action may be taken. If booting is successful,
then flow proceeds along "Success" 1115 to "Running" 1106. The
server is then fully operational with an executing OS, and may
begin executing various application processes. In situations where
installation has been performed previously, the server may be
booted directly from "Pinned" 1102 via "BootServer" 1112.B.
[0410] "Running" 1106 is exited when any of several server
operational commands, including shutdown, reboot, and reclaim are
received, as illustrated by "ShutdownServer, RebootServer,
ReclaimServer" 1116. In the case of a shutdown command, flow
proceeds to "Shutting Down" 1107, where any executing applications
are closed and the executing OS is terminated. When termination is
complete, then flow proceeds along "ShutdownServer" 1117 to
"Pinned" 1102, awaiting the next command. In the case of a reboot
command, flow proceeds to "Shutting Down" 1107 (where software
execution is terminated) and then proceeds along "RebootServer"
1117.B to "Booting" 1105 to boot the server again.
[0411] In the case of a reclaim server command, flow proceeds to
"Shutting Down" 1107 (terminating software) and then proceeds along
"ReclaimServer" 1117.R to "Unused" 1101, where processing frees the
resources assigned when pinning the server and returns them to the
available resource collection. A reclaim server command may also be
processed from "Installed" 1104 (via "ReclaimServer" 1114.R) and
from "Pinned" 1102 (via "ReclaimServer" 1112.R). When an SCF is
unused (as in "Unused" 1101, after processing of a reclaim server
command, for example), the unused SCF may be deleted, as indicated
by flow "DeleteServer" 1111.D proceeding to "Deleted" 1108.
[0412] The initial operating software, in various embodiments, may
correspond to a generic operating system environment (such as
Linux, Windows, or any similar commercially available OS). In other
embodiments, the initial operating software may correspond to a
wholly or partially customized operating system environment,
according to user requirements. Such customization may be based in
part on one of the commercially available generic operating system
environments. As illustrated in FIG. 5A, in some embodiments the
initial operating software may include a Hypervisor (such as
Hypervisor1 510 or Hypervisor2 520).
[0413] In some embodiments, operations such as those illustrated in
SCF Tasks 1010 may be performed on an SCF only while there is no
corresponding provisioned server, as designated by "Unused" 1101.
The previous discussion of SCF Tasks assumes this scenario, and is
roughly analogous to operating on the SCF when a server is
"off-line". However, other embodiments enable some or all SCF Tasks
to be performed while a server is "on-line" (i.e. in any
combination of operational states "Pinned" 1102, "Installing" 1103,
"Installed" 1104, "Booting" 1105, "Running" 1106, and "Failed"
1109, in addition to "Unused" 1101). For example, Modify 1012 may
be performed while a server is running, in order to add resources
associated with the server, and has the effect of changing the SCF
as well as pinning additional resources that are then made
available to the running server. Similarly, resources may be
subtracted from the running server via Modify 1012. Parameters that
may be modified while a server is running are considered dynamic
parameters, and parameters that may not be modified are considered
static parameters. Various parameters may be dynamic or static,
according to implementation.
[0414] The operation of various SCF Tasks with respect to server
operational state (such as Running and Unused) may also vary by
embodiment. For example, in some embodiments, when View 1014 is
performed, information specific to the operational state of the
server corresponding to the selected SCF may be provided. This
information may include current operating state, any relevant error
information, data concerning processing load, and other similar
information relating to status of an on-line server. In alternate
embodiments, variants of the View operation may provide a
system-wide listing of defined SCFs and any corresponding server
operational state.
[0415] A server as defined by an SCF may be physically relocated
from one set of resources to another, including resources in
distinct geographical locations, according to various embodiments.
In a first embodiment, an SCF may be developed on a first EF system
and then the SCF may be used on a second EF system. This
effectively deassigns resources for the server from the first EF
system and then assigns resources for the server from the second EF
system. The two systems may be co-located in a single server
facility, or located in geographically distinct server facilities.
In a second embodiment, a server provisioned according to an SCF
may be unpinned with respect to a first assignment of resources,
and then pinned anew to a second assignment of resources.
Real Time Server Provisioning and Management
[0416] Real time server provisioning and management includes
recognizing system status changes and responding to the system
status changes at approximately the same rates, and performing
these operations with low latency. As an example, when a pluggable
module is inserted into a running ES, availability of new HW
resources included in the pluggable module is reflected immediately
to an operator of the ES, according to a previously recorded query.
The operator may then request booting a new provisioned server, and
this request is performed immediately (provided sufficient
resources are available to provision the server).
[0417] At the level of a single chassis, CM 533, executing on an
SCM (such as Primary SCM 140A), manages many aspects of real time
server provisioning and management, including managing modules in
the chassis and maintaining an inventory of modules in the chassis.
The CM also monitors operational status of the modules and provides
user interfaces for configuration of modules and provisioning of
servers (via CLIs or GUIs, as provided directly by the CM or in
conjunction with other SW, according to various embodiments).
Additional information concerning the CM is included elsewhere in
this section (see also the Chassis Manager Operation section,
elsewhere herein). Operating in parallel with the CM, instances of
Module BMC SW 550, executing on BMCs in the system (such as BMCs
402.4, 402.5, and so forth), provide events to and receive commands
from the CM. The events convey information regarding module status
changes and command execution results, providing information
necessary for module management and inventory maintenance.
[0418] FIG. 12A is a flow diagram illustrating selected operational
aspects of real time server provisioning and management in an ES
system embodiment, including "CM SW Flow" 1201.20 and "Module BMC
SW Flow" 1201.21. The CM flow illustrates portions of processing
performed directly by or under the control of CM 533 (FIG. 5B). The
BMC flow illustrates portions of processing performed directly or
under the control of the Module BMC SW 550 (FIG. 5C) executing on
the BMCs.
[0419] Processing begins when a module (a PMM, such as PMM 150A,
for example) is introduced into an ES chassis backplane ("Module
Insertion" 1201.1), and continues as a presence interrupt is
generated and delivered to CM 533, indicating insertion of the
pluggable module ("Detect Module Insertion and Generate Presence
Interrupt" 1201.2). Processing then continues under the control of
the CM, as illustrated in "CM SW Flow" 1201.20.
[0420] When the CM receives the presence interrupt, a request is
made to establish communication between the CM and a BMC included
on the inserted module, such as BMC 402.5 of PMM 150A ("Establish
TCP IP with Module BMC" 1201.3). In parallel, the module BMC has
been powered (due to insertion of the module) and has begun
booting. Depending on various implementation dependent timing
behaviors, the module BMC may have completed booting. Eventually
the BMC completes booting, responds to the TCP/IP communication
channel, and listens for commands from the CM (by executing various
portions of Command Agent 553 of FIG. 5B, for example).
[0421] At this point, CM 533 is aware only that a module has been
inserted, but is not aware of any particular details of the module
(such as if the module is a PMM, NM, FCM, or OLB). The CM then
interrogates the module for Vital Product Data (VPD) to determine
the particular details of the module ("Request VPD" 1201.4) by
issuing a Module Get VPD command to the module BMC. The CM then
awaits a BMC event in response to the command ("Event Available?"
1201.5), looping back ("No" 1201.5N) until a response is received
("Yes" 1201.5Y).
[0422] In parallel, the BMC SW receives the command (as illustrated
conceptually by dashed-arrow 1201.4V) and begins to gather the VPD
for the module. Depending on embodiment, optional power-up
processing may occur ("Optional Power-Up" 1201.10 via dashed-arrow
1201.4P) to enable various components on the module to respond to
BMC interrogatories concerning various capacities and capabilities.
For example, in some embodiments it may be necessary for a portion
of the module to operate in a reduced or full manner in order to
determine a maximum operational frequency, which may be an element
of the VPD. The various elements of the VPD are eventually gathered
from components of the module ("Gather VPD" 1201.11). The BMC SW
flow then proceeds to send an event ("Send VPD Event" 1201.12) to
the CM in response to the command (as illustrated conceptually by
dashed-arrow 1201.12V). Processing relating to sending the event is
generally performed by executing various portions of Event Agent
552 (5B), for example.
[0423] In parallel, the CM has been awaiting a response from the
BMC, and when an event arrives conveying the response, the VPD
included in the response is parsed and corresponding entries are
stored into a repository ("Post Event Data to Repository" 1201.6
via "Yes" 1201.5Y). In some embodiments, the repository is Primary
CIM Instance Repository 917.P (FIG. 9C) and Repository Manager 535
accesses the repository at the request of CM 533. In some
embodiments, the repository includes any combination of Primary CIM
Instance Repository 917.P and Secondary CIM Instance Repository
917.S.
[0424] The CM then processes queries that depend on at least one of
the corresponding entries stored in the repository ("Pre-Select
Queries and Respond to Activated Queries" 1201.7). Processing
includes determining queries that are dependent on any of the newly
stored entries (or "pre-selecting" such queries), evaluating the
pre-selected queries (to determine which, if any, are activated or
deactivated), and processing any resultant triggered queries (and
ceasing processing of any queries that are no longer triggered). In
some embodiments, query processing is performed via SW routines
included in Query Engine 534 (FIG. 5B). Flow then loops back to
await a subsequent event ("Event Available?" 1201.5).
[0425] In parallel to the processing of the VPD data by the CM, the
BMC SW has entered a loop monitoring for status changes on the
module ("Status Change?" 1201.13). If no change has occurred, then
processing loops back ("No" 1201.13N). If a change has occurred,
then processing flows forward ("Yes" 1201.13Y) to send a status
change event indicating and describing the new status to the CM
("Send StatusChange Event" 1201.14). The event communication is
indicated conceptually by dashed-arrow 1201.14E, pointing to "Event
Available?" 1201.5, where the CM is looping while awaiting a newly
available event.
[0426] In some embodiments, processing of triggered queries may
result in one or more commands being sent to the BMC to alter the
status or configuration of the module (as illustrated conceptually
by dashed-arrow 1201.7C, for example). For example, a query may be
registered that is activated whenever a module is inserted, and the
query may result in an action including provisioning a server. If
the module is a PMM, then provisioning the server may require
sending a command to the BMC on the PMM module to partition the PMM
according to the requirements of the server to be provisioned.
Other such scenarios are possible, such as re-provisioning a failed
server when a replacement module is inserted.
[0427] Generally the BMC SW is enabled to receive and process
commands in parallel with performing other processing. The received
BMC commands are typically generated by the CM, and in some
embodiments are provided in response to server provisioning and
management commands, that may be provided manually by a user, or
generated automatically in response to an activated query,
according to various usage scenarios. Examples include booting a
server, such as processing related to "BootServer" 1114 (FIG. 12),
and shutting down a server, such as processing relating to
"ShutdownServer" 1117.
[0428] Typically a command is sent asynchronously to the BMC
("Command" 1201.15), as a result, for example, of processing
related to an activated query (see dashed-arrow 1201.7C originating
from "Pre-Select Queries and Respond to Activated Queries" 1201.7.
The command is then received, any accompanying parameters are
parsed, and the required operation is performed ("Perform Command"
1201.16). Status that may change as a result of executing the
command is updated ("Update Status" 1201.17) and processing of the
command is complete ("End" 1201.18). Updating the status, as shown
conceptually by dashed-arrow 1201.17S, eventually results in
recognition of the status change ("Yes" 1201.13Y exiting "Status
Change?" 1201.13). In turn, an event is sent to the CM
(dashed-arrow 1201.14E), typically including a response to the
command.
[0429] Recognized status changes ("Status Change?" 1201.13) are not
limited to those occurring as a result of processing a command, but
may include a change in any monitored parameter, state, or other
related variable associated with the module. Such status changes
may include a module failing or becoming operational or powered up,
a sensor crossing a threshold, or completion of a boot operation.
See the Selected BMC Event Details section, included elsewhere
herein, for other examples.
[0430] Those of ordinary skill in the art will recognize that the
figure is illustrative only, as only selected portions of the CM
and BMC SW flows are shown. Furthermore, the CM is typically in
communication with an instance of BMC SW associated with each
inserted module of an ES system embodiment. Thus there are
typically multiple instances (operating in parallel) of the
illustrated flow.
[0431] Those of ordinary skill in the art will also recognize that
although the figure illustrates loop back control (such as "Event
Available?" 1201.5) implemented as a "busy loop", in alternate
embodiments an interrupt may be used instead. These and all similar
variations are contemplated according to various embodiments.
[0432] FIG. 12B is a flow diagram illustrating selected operational
aspects of real time server provisioning and management in an ES
system embodiment, including selected details relating to
provisioning VNICs and booting PMMs, as typically performed during
processing related to "Booting" 1105 (see FIG. 11). FIG. 12B is
representative of operations performed by various SW elements,
including the CM, the BMC SW, the VIOC Manager, and the BIOS, OS,
and VIOC drivers executing on a PMM.
[0433] Flow starts ("Begin" 1202.1) when a command or request to
boot a server is processed. In response the CM instructs the BMC to
partition the PMM according to an SCF, and the BMC configures the
HT links on the PMM to form the appropriate physical partitions
("Partition PMM" 1202.2). In some embodiments the BMC also
"constructs" or "routes" a VIOC implementation in all or a portion
of a Field Programmable Gate Array (FPGA) device ("Instantiate VIOC
from FPGA" 1202.3). In some usage scenarios a plurality of VIOCs
may be instantiated, such as VIOC 301.5 and VIOC 301.5' of PMM 150A
(FIG. 4B). In other usage scenarios a further plurality of VIOCs,
included on a plurality of PMMs, may be instantiated, depending on
the requirements stated in the SCF.
[0434] After the VIOC (or VIOCs) are instantiated, the CM provides
VNIC provisioning information from the SCF to a controlplane
process (such as VIOC Manager 536 of FIG. 5B) responsible for
configuring VNICs in the VIOC ("VNIC Provisioning Info to VIOCmgr"
1202.4). The VNICs are then configured according to the
provisioning information ("Configure VNICs" 1202.5), typically by
asserting corresponding VNIC enable bits (such as VNIC Enable
618.1a of FIG. 6A) of respective VNICs. In some embodiments, the
VNIC configuration further includes setting minimum and maximum
bandwidth parameters (such as Bandwidth Minimum 618.1d and
Bandwidth Maximum 618.1, both of FIG. 6A), as specified in the SCF.
Subsequently, lookup information is programmed into TCAM/SRAMs
coupled to the VIOC ("Configure TCAMs" 1202.6), based in part on
the SCF and also based on additional system configuration and
topological information.
[0435] The PMM configuration (including partitioning and VIOC
setup) is now complete and processing continues by booting the PMM
(or PMMs) used in the instantiated server ("Boot PMMs" 1202.7).
Processing in the PMM continues as an OS (such as OS1/Drivers 1 507
of FIG. 5A) is booted ("Start OS" 1202.8). A kernel mode VIOC
Driver (such as VIOC Driver 1 511) is in turn initialized and
spawned by the OS ("Start VIOC Driver" 1202.9). The VIOC Driver is
typically responsible for communication between the OS and selected
VIOC resources, including VNICs. The VIOC Driver subsequently
instantiates OS-level interfaces for each of the configured VNICs,
presenting the VNICs as NIC resources to the OS ("Present NICs to
OS" 1202.10). Presenting the NICs includes the VIOC Driver reading
the VNIC enable bits implemented in the VIOC, and for each asserted
bit (indicating an active VNIC), allocating and initializing
driver-level SW data structures for the respective enabled VNIC to
enable the OS to access the VNIC as a NIC. The illustrated
processing is then complete ("End" 1202.9).
[0436] Removal of a VNIC (as a result of operations relating to
server management, for example) is typically accomplished in
several stages. The VIOC Manager disables a corresponding VNIC
enable bit in (included in a VNIC enable register of a VIOC). In
response to writing the VNIC enable register, the VIOC generates an
interrupt that is delivered to the VIOC Driver executing on the
module including the VIOC (such as a PMM). The VIOC Driver reads
the VNIC enable register, determines that a VNIC has been disables,
and deletes any corresponding OS-level interfaces previously
configured and presented to the OS. Reconfiguration of a VNIC, with
respect to properties visible to the OS, is accomplished in a
similar manner: the VIOC Manager writes VIOC registers, an
interrupt is delivered to the VIOC Driver, and the VIOC Driver
modifies corresponding OS-level interfaces appropriately. Results
may be returned by the VIOC Driver by writing one or more VIOC
scratchpad registers with return value information. Typically the
VIOC Manager polls the proper scratchpad registers awaiting
results.
[0437] In some embodiments, the VIOC Manager executes using
resources on an SCM (such as Primary SCM 140A), and in such
embodiments the VIOC Driver typically communicates via an SFM with
the Manager (as shown by PMM-SCM Data Exchange 215, for example).
In some embodiments, one or more of the VIOC Drivers communicate
with the same VIOC Manager, and or more of the VIOC Drivers
communicate with the same VLAN Manager. In some embodiments, there
is only a single VIOC Manager and a single VLAN Manager.
BMC Command and Low-Level Hardware Boot Processing
[0438] FIG. 13A is a state diagram illustrating processing of
selected BMC related commands in an ES embodiment. In some
embodiments, all or any portion of BMC commands are implemented as
IPMI type commands, and in some of these embodiments, an IPMI
Client (such as PMI Client 551 of FIG. 5C) may provide interface
functions for communication with Command Agent 553. The BMC
commands are typically provided by CM 533 and processed by BMC SW,
as illustrated by "Perform Command" 1201.16 (FIG. 12A).
[0439] In some embodiments, the BMC SW implements processing of BMC
commands via a BMC Operational State Machine, and the figure is
representative of states and transitions of these embodiments. In
other embodiments, the BMC command processing is performed by any
combination of SW and HW. In some embodiments, a plurality of BMC
state machine instances are provided on a module (such as a PMM),
one for each possible partition the module may be partitioned into.
In some embodiments, the BMC Operational State Machine receives
IPMI commands and passes them through to an IPMI implementation,
returning IPMI status from the IPMI implementation if needed. Some
BMC events correspond to pass-through IPMI events, such as Sensor
Threshold Breached, described elsewhere herein. The figure conforms
to the convention that the state machine remains in a state unless
one of the illustrated transitions is activated ("loopback"
transitions are omitted for clarity).
[0440] In some embodiments, whenever the state machine transitions
to a new state, the transition is reported to Chassis Manager 533
as one or more events that describe the resultant state. The
transitions are typically recognized as a status change, such as
those detected by "Status Change?" 1201.13 (FIG. 12A), and event
signaling is as illustrated by "Send StatusChange Event" 1201.14.
In some embodiments, the events include information concerning the
transition or the cause for the transition.
[0441] Consider the BMC Operational State Machine as starting in
state P 1301. In this state, the module the BMC is included in is
powered up only sufficiently for operation of the BMC sub-system,
and other elements of the module remain powered down. BMC
sub-system elements related to detecting selected module status
information, receiving CM commands, and delivering event
information to the CM are operational. In some embodiments, these
elements include execution of all or portions Event Agent 552, and
Command Agent 553 of FIG. 5C.
[0442] While in state P 1301, if the CM sends commands including
any of Module Power Up, Module Hold Reset, and Module Boot to the
BMC, then the state machine transitions to state C1 1302 via
transition "Power Up OR Hold Reset OR Boot" 1301.C1. State C1
generally corresponds to a first or minimal configuration. If the
command was Module Power Up, then state C1 1302 is the end state
for processing the command. If the command was Module Hold Reset,
then when conditions for transition "Valid BCT AND Configuration
Complete OK" 1302.C2 are met, the state machine transitions to
state C2 1303, and this is the end state for processing the Module
Hold Reset command State C2 generally corresponds to a second or
customized configuration. If the command was Module Boot, then the
state machine transitions first to state C2 1303 as Module Hold
Reset. Then, when conditions for transition "Boot" 1303.S1 are met,
the state machine transitions to state S1/D 1304, followed by a
transition to state S2 1305 when conditions for transition
"Heartbeat OK" 1304.S2 are met, and this is the end state for
processing the Module Boot command.
[0443] While in state C1 1302, power is applied to all of the
elements on the module, in addition to the BMC sub-system, and
reset is active to any system or application processing elements.
For example, CPUs included on PMMs, PCEs and TMs included on NMs,
IOPs included on FCMs, and CPUs included on OLBs are continuously
reset. If a Module Power Up command was being processed, then C1 is
the final state, and is exited only upon receipt of another
command. If a Module Power Down command is received, then the state
machine transitions to state P 1301 via transition "Power Down"
1302.P. If a Module Force Fault command is received, then the state
machine transitions to state F 1306 via transition "Force Fault"
1302.F.
[0444] If a Module Hold Reset or a Module Boot command is being
processed, then C1 is a transient state, and is exited when the BMC
sub-system detects or determines that the conditions for transition
"Valid BCT AND Configuration Complete OK" 1302.C2 are met.
Specifically, the Boot Configuration Table (BCT) information
received with the command being processed has been found to be
valid, and any module configuration information included in the BCT
information has been successfully applied to the module. Thus the
module configuration is complete and the machine transitions to
state C2 1303.
[0445] While in state C2 1303, reset remains active to the system
and application processing elements. If a Module Hold Reset command
was being processed, then C2 is the final state, and is exited only
upon receipt of another command. If any command that provides new
BCT information is received, then the state machine transitions to
state C1 1302 via transition "BCT Change" 1303.C1. If a Module
Power Down command is received, then the state machine transitions
to state P 1301 via transition "Power Down" 1303.P. If a Module
Force Fault command is received, then the state machine transitions
to state F 1306 via transition "Force Fault" 1303.F.
[0446] If a Module Boot command is being processed, then C2 is a
transient state, and is exited when the BMC sub-system detects or
determines that the conditions for transition "Boot" 1303.S1 are
met. Specifically, the BMC determines that an implementation
dependent delay has transpired, and the state machine transitions
to state S1/D 1304.
[0447] While in state S1/D 1304, reset is released, allowing the
system and application processing elements to begin fetching and
executing code. If a Module Hold Reset command is received, then
the state machine transitions to state C2 1303 via transition "Time
Out OR Hold Reset" 1304.C2. If a Module Power Down command is
received, then the state machine transitions to state P 1301 via
transition "Power Down" 1304.P. If a Module Force Fault command is
received, then the state machine transitions to state F 1306 via
transition "Force Fault" 1304.F.
[0448] If a Module Boot command is being processed, then S1/D is a
transient state. If the conditions for transition "Heartbeat OK"
1304.S2 are met, then the state machine transitions to state S2
1305 via transition "Heartbeat OK" 1304.S2. Specifically, the BMC
sub-system receives a heartbeat indication from the system or
application processor after the processor has executed sufficient
start-up code (such as BIOS boot for a PMM) to communicate the
heartbeat indication to the BMC. In some embodiments, BIOS
execution communicates heartbeat information to the BMC via VIOC
scratchpad registers.
[0449] However, if the BMC does not receive a heartbeat indication
after a predetermined time interval, then the BMC Operational State
Machine times out the operation, and transitions to state C2 1303
via transition "Time Out OR Hold Reset" 1304.C2. A special sub-case
of the Module Boot command may specify (via information in the BCT)
that off-line diagnostics are to be performed instead of a full
boot. In this context, completion of the Module Boot command occurs
when the off-line diagnostics are completed or have timed out. When
this occurs, the state machine transitions to state C1 1302 via
transition "Offline Diagnostics Finished" 1304.C1.
[0450] While in state S2 (the terminus of successful processing of
a Module Boot command), reset to the processing elements remains
released, and the processing elements continue executing
instructions and periodically generating heartbeat indications to
the state machine. If a predetermined period of time elapses
without a heartbeat indication, then the state machine transitions
to state F 1306 via transition "Heartbeat Timeout OR Force Fault"
1305.F.
[0451] If a Module Boot or a Module Hold Reset command is received,
then the state machine transitions to state C2 1303 via transition
"Boot OR Hold Reset" 1305.C2. If a Module Power Down command is
received, then a transition is made to state P 1301 via transition
"Power Down" 1305.P. If a Module Force Fault command is received,
then a transition is made to state F 1306 via transition "Heartbeat
Timeout OR Force Fault" 1305.F.
[0452] State F is a transient fault recovery state where an attempt
is made to recover from whatever condition led to the transition
into the state. If recovery from any non-fatal faults relating to
state S2 is made, then the machine transitions to state F 1306 via
transition "Recover" 1306.S2. If recovery from any fatal faults
relating to states C2, S1/D, or S2 is made, then the machine
transitions to state C2 1303 via transition "Recover" 1306.C2. If
recovery from any fatal faults relating to state C1 is made, then a
transition is made to state C1 1302 via transition "Recover OR Hold
Reset" 1306.C1. A Module Hold Reset command overrides any
in-progress or attempted recovery, and in response the machine
transitions to state C1 1302 via transition "Recover OR Hold Reset"
1306.C1. A Module Power Down command is similarly overriding, and
the machine moves to state P 1301 via transition "Power Down"
1306.P. The CM may issue a Module Power Down command in response to
event information sent from the BMC indicating that the fault is an
unrecoverable HW or SW failure, according to embodiment, or
represents a catastrophic fault, also according to embodiment.
[0453] In some embodiments, some server provisioning and management
operations typically require issuing one or more BMC commands that
are processed according to the illustrated state diagram. A first
example is booting a server, such as processing related to
"BootServer" 1114 (FIG. 12). If the server to be booted is
configured with multiple modules (such as two PMMs), then two
separate BMC command streams will be issued, one to each of the two
PMMs (see FIG. 13B for an example). A second example is shutting
down a server, such as processing relating to "ShutdownServer"
1117, resulting in separate Module Power Down commands to some or
all of the modules the server was provisioned from.
[0454] While the BMC Operational State Machine has been described
with respect to selected BMC commands (such as Module Power Up,
Module Power Down, Module Hold Reset, Module Boot, and Module Force
Fault), this is only a representative embodiment. Any combination
of BMC commands (such as those described in the Selected BMC
Command Details section, elsewhere herein) may be implemented by
the BMC state machine. Additionally, in some embodiments, any
combination of BMC Commands illustrated with respect to the BMC
state machine may be implemented by other mechanisms.
[0455] Those of ordinary skill in the art will note that the BMC
Operational State Machine may be implemented in HW, SW, or any
combination of the two, according to embodiment. It is also
apparent than any number of state machine states and transitions
may be implemented to provide similar functionality, according to
embodiment.
[0456] FIG. 13B illustrates selected operational aspects of single
and dual PMM low-level hardware boot processing in an ES
embodiment, as represented by "Boot PMMs" 1202.7 (FIG. 12B), for
example. The processing illustrated in FIG. 13B conceptualizes
selected paths through states as illustrated by FIG. 13A, with
corresponding states and transitions named accordingly. FIG. 13B
illustrates boot processing for a single PMM configuration (such as
P.sub.3 203 of FIG. 2) and a dual PMM configuration (such as
P.sub.4 204 of FIG. 2), and as such the generic Module commands
described in FIG. 13A transitions correspond to specific PMM Module
commands in the context of FIG. 13B.
[0457] Operation in the single PMM configuration is as follows.
"Boot Flow" 1312 begins at state P 1301.M, when the BMC on PMM 150B
receives a PMM Boot command via CM 533. The BMC Operational State
Machine then moves to state C1 1302.M via transition "Boot"
1301.C1.M, and asserts reset to the PMM. When the BCT has been
found to be valid and the configuration included therein has been
properly applied to the PMM, the state machine moves to state C2
1303.M via transition "Configuration Complete OK" 1302.C2.M. The
state machine then continues to state S1/D 1304.M via transition
"Boot" 1303.S1.M, and releases reset to the PMM. The PMM then boots
BIOS and generates a valid heartbeat. When the valid heartbeat from
the PMM has been detected by the BMC, the machine moves to state S2
1305.M via transition "Heartbeat OK" 1304.S2.M, and the PMM boot
flow is complete.
[0458] Each of the state machine transitions are reported to CM 533
via events describing the resultant state. For example, when the
state machine has completed the transition to state C1 1302, an
event describing the new state machine state as "C1" is generated
and delivered to the CM. Events are similarly generated and
delivered for all of the state machine transitions.
[0459] Operation in the dual PMM configuration is as follows, with
PMM 150B operating as the master, and PMM 150A operating as the
slave. Conceptually, the master PMM is partially booted ("Hold
Reset Flow" 1313M), then the slave PMM is booted ("Hold Reset Flow"
1313S and "Release Reset Flow" 1314S), and then the master PMM is
fully booted ("Release Reset Flow" 1314M). The final slave PMM boot
state is different than the master PMM boot state, as the slave PMM
omits booting of BIOS and hence generates no heartbeat.
Coordination of transitions between the master and slave PMMs are
managed by CM 533, via reception and processing of state transition
events and issuing of appropriate commands to the master and slave
BMCs on the respective PMMs.
[0460] More specifically, "Hold Reset Flow" 1313M begins at state P
1301.M, when the BMC on the master PMM (PMM 150B) receives a PMM
Hold Reset command from CM 533. The BMC Operational State Machine
then moves to state C1 1302.M (asserting reset to the master PMM)
and then to state C2 1303.M as in "Boot Flow" 1312. However, the
state machine remains in state C2 1303.M when processing the PMM
Hold Reset command (leaving reset asserted), instead of continuing
as when processing a PMM Boot command. As in all of the BMC
Operational State Machine transitions, an event is generated upon
arrival in state C2 1303.M and delivered to the CM.
[0461] In response (indicated conceptually by dashed-arrow 1310)
the CM sends a PMM Hold Reset command to the BMC on the slave PMM
(PMM 150A). The slave BMC Operational State Machine then
transitions from state P 1301.S to state C1 1302.S (asserting reset
to the slave PMM) and then to state C2 1303.S, where it remains,
awaiting further CM commands. An event is generated and delivered
to the CM indicating the slave BMC is now in the "C2" state. In
response the CM provides a PMM Release Reset command to the slave
BMC. The slave BMC then releases reset to the slave PMM and
transitions to state S1/D 1304.S, whereupon another event is
delivered to the CM indicating arrival in the "S1/D" state.
[0462] In response (indicated conceptually by dashed-arrow 1311)
the CM sends a Release Reset command to the master BMC. The master
BMC then transitions to state S1/D 1304.M and releases reset to the
master PMM. When BIOS boot is complete and the resultant heartbeat
is detected, the master BMC Operational State Machine transitions
to state S2 1305.M and reports the new state to the CM. Booting of
the dual PMM configuration is now complete, with both PMMs out of
reset and the master PMM having booted BIOS.
[0463] According to embodiment, CM communication with BMCs is via
any combination of transports and protocols. The transports include
Ethernet (coupling 452 of FIG. 4A, for example, as described
elsewhere herein), an Intelligent Chassis Management Bus (ICMB), an
Intelligent Platform Management Bus (IPMB), RS-485, RS-232, PCI
mailboxes, in-band or out-of-band signaling over the SFM, and any
other similar mechanisms. The protocols include TCP/IP and any
similar protocols. The communications include events from BMCs to
the CM, and commands from the CM to the BMCs.
[0464] Some embodiments provide for larger than 8-way SMPs, and in
a first group of implementations, BMC coordination is via explicit
CM control, as illustrated in FIG. 13B. In a second group of
implementations, BMC SW instances communicate and cooperate with
each other in a peer-to-peer mode, independent of explicit CM
control coordination and sequencing.
Selected BMC Event Details
[0465] BMC events are generated when a change in specific
characteristics of an ES system or a pluggable module included
therein occurs, and are also generated in response to most BMC
commands (even those effecting no change in characteristics). The
CM is the primary consumer of the generated events. At boot up of
each pluggable module (such as a SIM, PMM, NM, FCM, and OLB), the
CM establishes a separate TCP connection for each respective BMC,
for communication of the events as TCP messages. Each of the TCP
messages may include a returned data structure providing specific
details regarding the event, such as detailed status or log
information, according to embodiment. The data structure typically
includes fields identifying the pluggable module type sourcing the
event, and the event classification or number. SIMs, PMMs, NMs, and
OLBs, may be identified as pluggable module types 1, 2, 3, and 4,
respectively, with unknown modules identified as module type 0,
according to embodiment. In some embodiments, a dedicated packet
format is used to convey event information. In some embodiments,
BMC events are conveyed as SNMP traps.
[0466] Pluggable modules including VIOCs (for example SIMs, PMMs,
FCMs, and OLBs) communicate events specific to VIOC operation, such
as VIOC Initialization Complete/Fail, and VIOC Reset Request. The
VIOC Initialization Complete event is sent when the BMC has
successfully initialized the VIOC after module reset has been
released, and the VIOC Initialization Fail event is sent if the
VIOC initialization fails. In some embodiments, the VIOC
Initialization Complete and Fail events are implemented as a single
event with a field in the associated return data structure
specifying success or failure. The VIOC Reset Request event is sent
by the BMC in response to receipt of a corresponding VIOC reset
request from a VIOC Driver executing on the module. The CM
determines if and when permission for the request is to be given,
and if so sends a corresponding Module Reset VIOC command to the
BMC, providing a mechanism for the VIOC Driver to reset an
associated VIOC under control of the CM.
[0467] Other BMC events include Module Operational Status Up/Down,
Release Reset Failure, and Sensor Threshold Breached. The Module
Operational Status Up/Down event is sent when the BMC successfully
establishes a heartbeat with SW executing on the module. The
associated return data structure specifies that the module is
operational (Up). Subsequently, if the heartbeat is not received
within a predetermined time interval, then the BMC sends the event
with the data structure indicating the module is not operational
(Down). In some embodiments, separate events are used for Module
Operational Status Up and Module Operational Status Down. The SW
executing on the module and providing the heartbeat may be any
combination of OS SW, Driver SW, and BIOS SW, varying according to
module type and embodiment.
[0468] In some embodiments, the Module Operational Status Up/Down
event is sent when the BMC Operational State Machine transitions to
state S2 1305 (FIG. 13A), with the return data structure indicating
the module is operational. In some embodiments, a general BMC State
Change event may be used to communicate transitions of the BMC
Operational State Machine, including transitions such as the
transition to state S2 1305., as well as other transitions of the
state machine.
[0469] The Release Reset Failure event is sent when the BMC detects
that a module fails to respond to a request to release reset,
typically delivered to the module by the BMC in response to a
corresponding command from the CM. The Sensor Threshold Breached
event is sent when any sensors included in the BMC sub-system
report a value that crosses any predefined thresholds (for example
an over-temperature or over-voltage detection). The event data
structure may optionally include the sensor value at the time the
event is detected, according to sensor type and embodiment.
[0470] PMM specific events generally relate to a BCT, that is
typically a superset of a Partition Configuration Table (PCT), used
to specify the configuration of a PMM, particularly with respect to
the number of CPUs in a partition (such as 2-way, 4-way, or 8-way).
PMM specific events include a BCT Valid event that is sent in
response to a BMC command that communicates a BCT. The BMC checks
the communicated BCT to determine that it is valid for the module
(such as determining that a requested partitioning is possible for
the module), and if so, then configures the module according to the
information in the BCT. If the configuration is successful, then
the BCT sends a BCT Valid event indicating that the BCT was valid
for the module, and the module was successfully configured as
specified by the BCT.
[0471] SIM specific events include Power Up/Down and Fan Up/Down
events. The Power Up/Down event is sent when there is a change in
the operational status of a power module in the system. The event
data structure specifies if the module has become operational (Up)
or has become non-operational (Down). Similarly, the Fan Up/Down
event is sent to notify the CM of a change in a fan module
operational state. In some embodiments, separate events are used
for Power Up, Power Down, Fan Up, and Fan Down events.
Selected BMC Command Details
[0472] BMC commands are generally sent by the CM to determine
information or status regarding pluggable modules, or to effect a
change in configuration or status of pluggable modules. BMC
commands may be directed to BMCs on any type of pluggable module
(such as a SIM, PMM, NM, FCM, and OLB), via the separate TCP
connections for each module established at module boot. In some
embodiments the TCP connections are also used to communicate BMC
events.
[0473] Some commands are specific to one module type (such as a
PMM), and other commands may be applicable to more than one module
type (such as any module including a VIOC, or any module having a
configurable power supply). Typically commands directed toward a
SIM are directed to a Redundant SIM by a Primary SIM, since the CM
typically executes at least in part using computing resources
included in a Primary SIM (such as Primary SCM 140A of FIG. 2).
[0474] Each BMC command generally includes a command parameter data
structure defining specific details or values associated with the
command. The data structure typically includes fields identifying
the pluggable module type receiving the command, and the command
identifier (or number). SIMs, PMMs, NMs, and OLBs, may be
identified as pluggable module types 1, 2, 3, and 4, respectively,
with unknown modules identified as module type 0, according to
embodiment. In some embodiments, a dedicated packet format is used
to convey command information. In some embodiments, processing of a
BMC command may include generating a response event (directed to
the CM, for example) acknowledging receipt of the command and
describing the outcome of the command in the form of a return
code.
[0475] BMC commands specific to BMCs on PMM modules include Module
BCT. The Module BCT command (identifier 0x000F) is used to instruct
the BMC to configure the associated PMM (or a previously
partitioned portion of it) according to a BCT (provided as a
command parameter data structure), and is typically issued in the
context of provisioning a server. The BMC parses the provided BCT
to determine if the configuration is valid for the PMM in which the
BMC is included. If the configuration is valid, then the BMC
configures components on the PMM according to the
configuration.
[0476] The parameters include structures for general, server, VNIC,
boot, partition, console, and OS information. The general parameter
structure includes a table version number (such as 0), and an
action identifier describing an action to take based on the
configuration information (such as provision identified by the
value 1, and release or reclaim provision identified by the value
2). The general parameter structure further includes a count of the
number of BMCs involved in applying the configuration (one BMC for
a 2-way or a 4-way configuration, and two BMCs for an 8-way
configuration). The general parameters further include an IP
address identifying a master BMC associated with the configuration,
and a list of IP addresses for all of the BMCs involved in the
configuration.
[0477] The server structure includes a server type identifier
(having values such as 1 for 2-way, 2 for 4-way, and 3 for 8-way),
and a slot number and valid bit to associate with the provisioned
server (having values such as 0 and 1). The server structure
further includes a system number and valid bit to associate with
the provisioned server (having values such as 0 and 1), and a boot
method identifier (such as 1 for network booting and 2 for local
fibre channel booting). The server structure further includes a
count of VNICs for the server (from 1 to 64, for example), a VNIC
structure for each of the VNICs, and a list and count of fibre
channel boot paths. Each VNIC structure includes a VNIC identifier
that is unique throughout the server (such as a 32-bit integer), a
bandwidth specification, and a MAC address for the VNIC. Each fibre
channel boot path includes a port identifier of an associated fibre
channel port, a world wide name of a fibre channel destination, and
a logical unit number for the fibre channel destination.
[0478] The partition structure includes a boot policy identifier
(such as 1 for `Wait-for-SIM`, 2 for `autoBoot`, 3 for `oneShot`
and 4 for `Debug`), and a sticky bit to indicate if the
configuration remains over reboots (such as 0 for not sticky and 1
for sticky). The console structure includes information describing
a baud rate, a number of data bits, a parity type, a number of stop
bits, and a console type (such as 1 for VT-100). The OS structure
includes an OS identifier (such as 1 for Linux and 2 for
Windows).
[0479] Processing performed in response to the Module BCT command
analyzes the BCT and configures PMM HW accordingly. For example,
when the action is to provision a server (action identifier equal
to 1), the server type identifier is decoded to determine how to
configure the HT links on the PMM. More specifically, if the server
type is 2-way (type identifier equal to 1), then in the context of
PMM 150A (FIG. 4B), HT couplings 460.5L and 460.5R are configured
for coherent operation by BMC 402.5, and HT couplings 460.5X and
460.5Y are configured for isolated operation. If the server type is
4-way (type identifier equal to 2), then HT couplings 460.5L,
460.5R, 460.5L', 460.5R', 460.5X, and 460.5Y are configured for
coherent operation. The MAC address in each VNIC structure is
stored into corresponding configuration state in one or more VIOCs,
such as setting MAC Address 603.1 (FIG. 6A), by writing
corresponding configuration state implemented in VIOC Configuration
block 706 (FIG. 7A).
[0480] The response event generated upon completion of the Module
BCT command includes a return code selected from the set including
Valid BCT Configuration Successful (encoded as 0x0000), Invalid
Slot Information (0x0100), Valid BCT Configuration Failure
(0x0200), and Invalid BCT (0x0300). Valid BCT Configuration
Successful is returned when the BMC has determined that the
provided BCT is valid for the associated module, and the specified
configuration has been successfully applied to the module (or
portion thereof). Valid BCT Configuration Failure is returned when
the BCT is valid but the configuration has not been successfully
applied. Invalid Slot Information is returned when the slot
information in the BCT is not valid for the module. Invalid BCT is
returned when the BMC determines that the BCT is not valid for the
module (no attempt is made to configure the module).
[0481] BMC commands specific to BMCs on pluggable modules including
VIOCs (such as SIMs, PMMs, FCMs, and OLBs) include Module Reset
VIOC. The Module Reset VIOC command (identifier 0x000E) causes the
BMC to reset a selected VIOC on the module (without resetting any
other elements) and is typically issued in response to a request by
a VIOC Driver to reset a VIOC. The parameters include a slot
number, and a VIOC number to select which VIOC on the module to
reset (such as 0 or 1). The return codes include VIOC Reset
Successful (0x0000), Invalid Slot Information (0x0100), Invalid
VIOC Number (0x0200), and VIOC Reset Failure (0x0300).
[0482] BMC commands specific to BMCs on pluggable modules having
system or application processing elements include Module Reset
Partition, Module Hold Reset, Module Release Reset, Module Boot,
Module Firmware Update, and Module Firmware Update Status. Such
modules include PMMs (having CPUs), NMs (having PCEs and TMs), FCMs
(having IOPs), and OLBs (having CPUs).
[0483] The Module Reset Partition command (identifier 0x0006)
causes the BMC to assert and then release reset for an entire
module or a partition of a module (such as a partition of a PMM).
If the module has been previously configured into partitions (by a
Module BCT command, for example), then the command operates on a
specified partition of the module. If the module is a partitionable
module (such as a PMM) and there has been no previous partitioning
of the module, then the entire module is reset and an error is
returned.
[0484] The parameters include a slot number and a partition
identifier. The associated return codes include Reset Partition
Successful (0x0000), Invalid Slot Information (0x0100), Invalid
Partition (0x0200), and Reset Partition Failure (0x0300). Reset
Partition Successful is returned when the partition identifier is
valid and reset has been successfully applied and released. Invalid
Slot Information is returned when the slot information is not valid
for the module (for example when the module is inserted in a
different slot than the command was intended for, or an incorrect
BMC received the command). Invalid Partition is returned when the
partition identifier is incorrect for module. In some embodiments,
Invalid Partition is returned when the module has not been
previously partitioned (although the entire module is also
reset).
[0485] The Module Hold Reset command (identifier 0x0005) causes the
BMC to begin asserting reset to system and application processing
elements on the module, a selected partition of the module, or a
CPU sub-system on the module, and to continue asserting reset until
a command to release reset is received. If the module has not been
previously partitioned (or is not partitionable), then the entire
module (or CPU sub-system) is reset and continues to be reset. The
parameters include a slot number and a partition identifier. The
return codes include Hold Reset Successful (encoding 0x0000) for
indicating the partition identifier is valid (or ignored) and reset
has been successfully applied, Invalid Slot Information (0x0100)),
Invalid Partition (0x0200), and Hold Reset Failure (0x0300).
[0486] The Module Release Reset command (identifier 0x0004) causes
the BMC to stop asserting reset to system and application
processing elements on the module, a selected partition of the
module, or a CPU sub-system on the module. Typically the Module
Release Reset command enables the module (or the selected partition
or CPU sub-system) to boot. It may be used, for example, when
directed to a PMM as in "Release Reset Flow" 1314S (FIG. 13B). The
parameters include a slot number and a partition identifier. The
return codes include Release Reset Successful (encoding 0x0000),
Invalid Slot Information (0x0100), and Release Reset Failure
(0x0200).
[0487] The Module Boot command instructs the BMC to power up,
reset, and release reset to system and application processing
elements on the module, a selected partition of the module, or a
CPU sub-system on the module. The Module Boot command typically
enables the module (or the selected partition or CPU sub-system) to
proceed from being not powered to a fully booted state without
additional BMC commands. The parameters include a slot number and a
partition identifier. The return codes include Boot Successful
(encoding 0x0000), Invalid Slot Information (0x0100), and Boot
Failure (0x0200). In some embodiments, intermediate events return
information as the module proceeds through various stages of
executing the Module Boot command.
[0488] The Module Firmware Update command (identifier 0xFFFE)
instructs the BMC to download and program firmware into
non-volatile memory (such as flash memory) on the module.
Downloading typically uses the Trivial File Transfer Protocol
(TFTP). The parameters include an IP address (in binary format) and
a number and list of file names. The return codes include Firmware
Update Successful (encoding 0x0000), indicating all of the
requested files have been downloaded and stored into the
non-volatile memory, and Firmware Update Failure (0x0100),
indicating otherwise.
[0489] The Module Firmware Update Status command (identifier
0xFFFF) instructs the BMC to provide information concerning the
most recent Module Firmware Update command. In some embodiments,
there are no parameters. Multiple return codes are provided in
response, including an overall status indicator, a stage indicator,
and a completion/an error indicator. The overall status indicator
states include Success (encoding 0x0000) and Failure (0x0100). The
stage indicator states include Update Complete (0), Update
Downloading (1), and Updating Flash (2). The completion/error
indicator states include percent completion from 0% to 100%
(encodings 0x00 to 0x64), Update Successful (0x70), No TFTP Server
(0x71), File Not Found (0x72), Checksum Invalid (0x73), Bad Sector
Number (0x74), TFTP Connection Closed (0x75), and Canceled
(0x76).
[0490] Other BMC commands include Module Get VPD, Module SIM Fabric
Port, Module Power Up, Module Power Down, Module Get Sensors,
Module Get Sensor Information, Module Get Operational Status, and
Module Force Fault. The Module Get VPD command (identifier 0x0002)
causes the BMC to collect and report information describing the
capabilities of the module. The parameters include a slot number.
The return information includes several fields and structures,
including a status code, a table identifier, a count of table
entries, a variable number of table entries, an end marker tag, and
a checksum. The status codes include VPD Retrieval Successful
(0x0000) and VPD Retrieval Failure (0x0100). The table identifier
is an 8-bit field in some embodiments. The count of table entries
specifies the number of individual VPD table entries that follow
the count. The end marker tag (encoded as 0x79) marks the end of
the VPD table. The checksum is used to verify integrity of the
response data, and is an 8-bit field in some embodiments.
[0491] In some embodiments, VPD entries are compatible with those
described by the Conventional PCI v2.2 Specification, available
from http://www.pcisig.com, and hereby incorporated by reference
herein for all purposes. Each VPD table entry includes an entry
beginning marker tag (0x90) followed by a count of fields in the
entry and a variable number of fields as indicated by the count.
Each field in turn includes a field name (a 3-character string in
some embodiments), a field length, and a field value string having
a length as indicated by the field length. The general format of
the VPD table enables essentially unlimited information to be
provided by the BMC to the CM, as the format is not
restrictive.
[0492] The VPD returned may vary according to module and
embodiment. In some embodiments, and for some modules, VPD may
include descriptions of number and capabilities of system and
application processing elements present on or associated with the
module. Examples include number and frequency of CPUs included on
PMMs, PCEs and TMs included on NMs and included on daughter cards
coupled to NMs, IOPs included on FCMs, CPUs included on OLBs, and
CPUs and Accelerators included on daughter cards coupled to OLBs.
VPD may include memory size and organization on the module. VPD may
include MAC address information associated with the module, such as
a MAC address associated with a VIOC on the module.
[0493] VPD returned for SIM modules may indicate the presence and
capabilities of Mass Storage 412A (FIG. 4A), and information
concerning Primary Switch Fabric Module 180A. VPD returned for PMM
modules may indicate the presence and capabilities of FCI 413.5 and
FCI 413.5' (FIG. 4B). VPD returned for NM modules may describe
Interface 420 and IOP 421 (FIG. 4C), including bandwidth capacity
and physical interface type. VPD returned for FCM modules may
describe operational parameters associated with FCPs, such as FCP
423.4 (FIG. 4D). VPD returned for OLB modules may describe the
presence and capabilities of optional daughter cards or modules,
such as PCI sub-module 425 and HT sub-module 424 (FIG. 4E),
including descriptions of specific services or protocols
accelerated by the daughter elements. The aforementioned
module-specific VPD information may vary in specific details and
may be provided in various combinations, according to
embodiment.
[0494] The Module SIM Fabric Port command (identifier 0x000D)
informs the BMC of the physical fabric port number of the Primary
SIM (having an included Primary SCM), such as the fabric port
associated with Primary SCM-Fabric coupling 149A (FIG. 2) as
coupled to Primary Switch Fabric Module 180A. The parameters
include a slot number and a fabric port number. The fabric port
number corresponds to the fabric port number of the primary (or
master) SIM of the chassis (such as 4 or 5). The return codes
include SIM Fabric Port Success (0x0000) and SIM Fabric Port
Failure (0x0100). In some embodiments, SIM Fabric Port Failure is
returned when the BMC fails to register the SIM fabric port.
[0495] The Module Power Up and Module Power Down commands
(identifiers 0x0003 and 0x007, respectively) instruct the BMC to
apply and remove, respectively, operating power for the remainder
of the module. The Module Power Up command leaves reset to system
and application processing elements of the module asserted. The
Module Power Down command optionally fails unless the module (such
as a PMM) has no booted, active, or running partitions, or has no
active heartbeat established, according to embodiment. The
parameters include a slot number. The return codes include Success
(0x0000), Invalid Slot Information (0x0100), and Failure
(0x0200).
[0496] The Module Get Sensors command (identifier 0x000A) causes
the BMC to return information regarding sensors available on the
module, such as the number and types of sensors. The parameters
include a slot number. The return information includes a status
code, a count of sensors available, and a variable number of sensor
identifiers. The status codes include Success (0x0000), Invalid
Slot Information (0x0100), and Failure (0x0200). The count of
sensors specifies the number of sensors available on the module and
individually identified by the information following the count.
Each of the sensor identifiers is a 32-bit integer in some
embodiments.
[0497] The Module Get Sensor Information command (identifier
0x000B) causes the BMC to return information about a selected
sensor or list of sensors, as specified by the command. The
parameters include a slot number, a count of sensors requested, and
a variable number of sensor identifiers. The count of sensors
requested specifies the number of sensors for which information is
requested and individually identified by the information following
the count. Each of the sensor identifiers is a 32-bit integer in
some embodiments. The return information includes a status code,
and a sensor information structure for the sensors selected by the
sensor identifiers. The status codes include Success (0x0000),
Invalid Slot Information (0x0100), and Failure (0x0200). In some
embodiments, sensor information structures are compatible with IPMI
v1.5, available from
ftp://download.intel.com/design/servers/ipmi/IPMIv1.sub.--5rev1.sub.--1.p-
df, and hereby incorporated by reference herein for all purposes.
Each of the sensor information structures includes a sensor
identifier (32-bits, for example), a length specification (16-bits,
for example) equal to the length of the following name, value, and
type fields (including nulls), a name string, a value string
(representing the current value of the sensor), and a data type
field for the value string.
[0498] The Module Get Operational Status command (identifier
0x000C) instructs the BMC to return the operational status of a
module or a selected partition of a previously partitioned module
(such as a PMM). Typically the operational status is determined by
the presence of a heartbeat between the BMC and an OS (or BIOS)
executing on the module or the selected partition. The parameters
include a slot number and a partition identifier. The return
information includes a status code and an operational code. The
status codes include Get Operational Status Successful (0x0000),
Invalid Slot Information (0x0100), Invalid Partition (0x0200), and
Get Operational Status Failure (0x0300). The operational codes
include Down/Non-operational (0x0000) and Up/Operational
(0x0100).
[0499] In some embodiments, The Module Force Fault command
instructs the BMC to force the BMC Operational State Machine
associated with the module (or a selected partition of a module) to
transition to state F 1306 (FIG. 13A), and may be used when the CM
detects operational errors requiring the module to be failed. The
parameters may include a slot number and a partition identifier,
according to embodiment.
[0500] Other BMC commands are IPMI-compliant, and relate to
collecting and managing information in a System Event Log(SEL)
maintained by a BMC, and include Module Get SEL and Module Clear
SEL. The Module Get SEL command causes the BMC to provide selected
log entries from the associated SEL. The parameters include a slot
number, an offset, and a maximum count. The offset specifies a
starting point in the SEL from which the BMC is to return data, to
prevent resending older data. The maximum count specifies the
maximum number of entries to provide in the return information.
[0501] The return information includes several fields and
structures, including a status code, a count of returned log
entries, and a variable number of log entries. The status codes
include Get SEL Successful (0x0000), Invalid Slot Information
(0x0100), Invalid Offset (0x0200), and Get SEL Failure (0x0300).
The count of log entries specifies the number of individual log
entries that follow the count. Each returned log entry, in some
embodiments, includes a 16-byte field encoded according to an IPMI
standard (such as is described on page 308 of IPMI specification
Rev 1.1, incorporated herein by reference for all purposes).
[0502] The Module Clear SEL command (identifier 0x0009) causes the
BMC to clear all or selected log entries from the associated SEL,
according to embodiment. The parameters include a slot number. The
return codes include Clear SEL Successful (0x0000), Invalid Offset
(0x0200), and Clear SEL Failure (0x0300).
[0503] In some embodiments, the aforementioned command identifiers
vary according to target module. For example, a prefix may be
inserted identifying the module type (such as 0x0001 for SIMs,
0x0002 for PMMs, and 0x0003 for NMs, according to embodiment). In
some embodiments, the aforementioned return codes are 16-bit
values, and the status codes are 8-bit values. In some embodiments,
the slot number is 0 or 1 for PMMs, 2 or 3 for SIMs, 4, 5, or 6 for
NMs, and other values for other modules. In some embodiments, the
partition identifier is 0 or 1 to select a first or a second
partition of a PMM that is partitioned as a 2-way element. In some
embodiments, the partition identifier is optional, and is provided
only for a command directed to a PMM. In some embodiments (or
contexts, such as a PMM), the partition identifier is ignored
unless the module has been partitioned as a 2-way element.
[0504] Those of ordinary skill in the art will recognize that
specifics related to BMC event and command communication, such as
specific encodings, field sizes, and organization of structures,
may vary according to embodiment.
Chassis Manager Operation
[0505] This section provides additional information on selected
aspects of some embodiments of the CM. Typically the CM provides a
single source of chassis information to all other processes in the
system. It provides other processes with information such as
presence of modules, properties of the modules and status of the
modules. It also provides information about failure of modules and
changes in module configurations. To provide such detailed
information about each of the modules, the CM peers with the BMC on
each of the modules in a chassis and obtains vital data to maintain
a persistent database.
[0506] The CM may be considered to provide a window into an ES
system embodiment and an interface to users and operators to view
and modify various system level behaviors. Typically there is a
plurality of slots (10, for example) in an ES system chassis
embodiment. In some embodiments, each slot in the chassis is
enabled to accommodate only one type of pluggable module. In these
embodiments, the slot assignments and the module types in the
chassis are predefined.
[0507] At startup (or ES system boot) the CM performs various
initialization steps, including resetting values of global
variables, initializing an event library, and initializing the BMC
interface of the SCM the CM is executing on. Typical SCM
embodiments include a Mastership Module (MM), and the CM
initializes an interface of the MM. The CM then issues a Process
Initialized Event.
[0508] The initialization of the various interfaces triggers an
associated set of activities in the CM. The CM performs the
initialization functions and then enters a loop for listening to
events occurring in the system, such as those reported by the
Module BMC SW.
[0509] The MM is typically implemented in all or a portion of an
FPGA, according to embodiment, and provides various functions, also
according to embodiment. The functions may include an application
level heartbeat, and an identification of the slot that the SCM the
CM is executing on is inserted. Other functions may include
presence information of various modules inserted in the chassis,
notification of pluggable module insertion (such as that associated
with "Detect Module Insertion and Generate Presence Interrupt"
1201.2 of FIG. 12A), and notification of pluggable module removal.
Further functions may include various indications of whether or not
inserted modules are capable of powering up and powering down.
Further functions may enable failover from a Primary SCM to a
Redundant SCM (such as the Primary and Redundant SCMs 140 of FIG.
1A), either manually via a user or operator request, or
automatically as a result of a system failure.
[0510] In some embodiments, the MM includes a Mastership state
machine. In operation, the CM indicates it has booted, and in
response the state machine transitions from a Waiting state to a
Booted state. In the Booted state, the state machine determines
whether the SCM the CM is executing on is a Primary SCM or a
Secondary SCM, and transitions to a Primary or Redundant state
accordingly.
[0511] In the Booted state, the SCM is determined to be the Primary
SCM if there is currently no Primary SCM, and otherwise it is the
Redundant SCM. If the determination of Primary versus Secondary
roles is not possible, then an error is recognized and the state
machine transitions to an Error state. If there is a failover
(either manually or automatically) then a transition is made from
the Redundant to the Primary state, and the SCM becomes a new
Primary SCM and the CM changes roles from Redundant to Primary
accordingly.
[0512] During initialization the CM also sets a watchdog time
interval in the MM, corresponding to a heartbeat for SW executing
on the SCM. The CM sets a watchdog bit at regular intervals
(shorter than the watchdog time interval). If the CM is unable to
set the watchdog bit within the timeout of the watchdog interval,
then the MM assumes that the SW executing on the SCM is locked up
and the SCM becomes unavailable. If the SCM is the Primary SCM,
then an automatic failover occurs, and the Redundant SCM becomes a
new Primary SCM. If the SCM was the Redundant SCM, then the SCM is
no longer available for failover, and there is no longer a
Redundant SCM.
[0513] When an SCM becomes a Primary SCM, the CM reads the
identification of the SCM slot from the MM and stores it in an
internal variable. The CM then obtains the presence information
from the MM (such as by reading selected MM registers, in some
embodiments) and determines the modules that are present in the
chassis. The CM then populates a database of modules (such as
Primary CIM Instance Repository 917.P of FIG. 9C) and attempts to
set up communication channels with BMCs on each of the modules that
were indicated as being present.
[0514] When a module is added or removed from the chassis, the MM
detects this change and notifies the CM via an event. The CM
receives the event, determines the affected slot, and carries out
any necessary actions as determined by the specific module
involved. There is also a notification when the SCM has changed
from Secondary to Primary (such as during failover processing).
[0515] In some embodiments, the CM maintains a map of the slots in
the chassis. In some embodiments, the slots are restricted to
selected modules (for example, PMMs may only be inserted in slots 0
or 1, SIMs in slots 2 or 3, NMs in slots 4, 5, or 6, and so forth
according to embodiment). In some embodiments, the map includes
information concerning the type of module that may be inserted in
each slot according to the restrictions. The module type
information may vary according to any combination of product type,
chassis type, or other similar customization information, according
to various embodiments.
[0516] The CM attempts to establish a TCP connection for each
module in the chassis by issuing a connect request to the BMC on
each respective module. In some embodiments, the request issuing is
non-blocking and arrives asynchronously. As each connection is
established, the CM typically requests VPD for the corresponding
module (using a Module Get VPD command, for example). Returned
information arrives via a corresponding BMC event, and is processed
and stored in the module database. The information is used, for
example, to determine a module type and various properties
associated with each respective module.
[0517] Once the VPD information has been collected and processed
into the database, the CM issues module presence events to any
other processes that may be listening for the module presence
events. In some embodiments, the presence is published (i.e. module
presence events generated) only if the VPD information is obtained.
If there is a failure in retrieving the VPD data, then the module
is considered of an unknown or unrecognized type.
[0518] The CM then collects other information such as module
properties, sensor properties and anything else that may be
necessary for CM and related functions. The CM may also poll the
SEL maintained by the BMC to determine if there were any new system
events logged. System events in the SEL may also be dumped into a
system log file along with appropriate information to identify
sources of the dumped system events.
[0519] When a module is prepared to boot, the CM may initiate
booting by issuing a command (such as a Module Boot or a Module BCT
command) to the BMC of the module. The CM also initiates module
resets, reloads, and other related operations by issuing
corresponding BMC commands. The various commands from the CM to the
BMCs may be results of manual user input or automatic provisioning
or configuration processing.
[0520] The CM stores module information (such as presence, sensor
values, and so forth) in the database. Thresholds and policies
relating to these values may also be stored in the database, and in
some embodiments are implemented as queries having corresponding
actions.
[0521] In some embodiments, booting of some pluggable modules that
include system or application processing elements (such as PMMs,
NMs, FCMs, and OLBs) includes providing one or more data images to
the booting module. In some embodiments, after the CM has received
VPD from the module, the CM updates a Dynamic Host Configuration
Protocol (DHCP) configuration file and creates or updates a
Pre-boot eXecution Environment (PXE) configuration file for the
module. The CM then restarts a DHCP daemon and issues a BMC command
to boot the module. The module subsequently issues a DHCP request,
and the DHCP daemon responds with IP address and PXE configuration
information, according to the updates the CM has performed. The
module then requests a kernel image and RAM disk image via TFTP,
the images are transferred, and the module boots using the
images.
[0522] Since the DHCP configuration file is accessed during module
boot, and modules may be dynamically configured in or added to live
systems and then booted, the CM must dynamically alter the DHCP and
PXE information as module configuration changes, or as modules are
inserted into the chassis. Additionally, in these embodiments, the
DHCP configuration file may also include entries corresponding to
each of the bootable modules. In some embodiments, the VPD includes
a MAC address, and the CM may determine some portion of DHCP and
PXE configuration information based in part on the MAC address. In
some embodiments, the configuration information may also be based
in part on processing of an SCF when performing an install server
command (such as "Installing" 1103 of FIG. 11).
[0523] In some embodiments, the CM stores portions of configuration
data and portions of runtime information, including portions of the
database, in the form of CIM instances, providing a standard
interface for querying the information and other operational
parameters. In some embodiments, Chassis Manager 533 may issue
events to Repository Manager 535 to create CIM instances
corresponding to modules inserted in the chassis, for example when
VPD is returned and processed. Additional CIM instances may be
created to represent components of inserted modules. In some
embodiments, creation and maintenance of the CIM instances (or
instances of classes) may be performed in a library form in a
platform process (such as a process associated with Platform
Manager 531).
[0524] In some embodiments, commands or requests originating from
CLI and GUI operations trigger the CM to carry out operations
necessary to perform the requested command. The operations may
include accessing the database to view CIM instances (corresponding
to modules or components in the chassis), and delivering BMC
commands as needed. The operations may further include updating CIM
instances as module state changes, as indicated by received BMC
events. Some of the BMC events may be generated due to processing
the BMC commands, and may indicate success or failure of a command.
This enables a user to determine success or failure of a request by
requesting a display of appropriate module status information.
[0525] In some embodiments, asynchronous requests are posted to the
CM without blocking, and a requestor determines success or failure
by specifically requesting status from the CM. In some embodiments,
synchronous requests are posted to the CM with blocking, and wait
for status returned from the CM before completion.
[0526] In some embodiments, one or more processes associated with
any combination of Enterprise Manager 530 and Platform Manager 531
may require portions of module sensor information (and portions of
other monitoring information) to be visible via CIM instances. In
some embodiments, the CM acts as the instance provider for some or
all of the sensor and monitoring instances, creating the instances
as corresponding information is received (perhaps in response to
commands) from the modules. In some embodiments, all
non-configuration type CIM instances are managed in the CM context
(the CM performs as the associated instance provider) and all
configuration type CIM instances are managed by Platform Manager
531.
[0527] The embodiments described in the Chassis Manager Operation
section are illustrative only, as those of ordinary skill in the
art will recognize that selected CM functions may be performed
elsewhere while still under the direct control of the CM.
Additionally, some of the functions may be modified, added, or
deleted, according to embodiment.
Rapid IP/MAC Address Takeover
[0528] Layer-3 and above networking protocols typically identify
and name sources, destinations, and resources using one or more IP
addresses, and the IP addresses are mapped to associated MAC
addresses while performing various switching and routing functions.
A pluggable module of an ES system embodiment is typically
associated with (or assigned) one or more IP addresses, such as
Public IP Address 604.1 (FIG. 6A), and one or more MAC addresses,
such as MAC Address 603.1. IP and MAC addresses are typically
assigned or configured when a server is provisioned (see the Server
Operational States section, elsewhere herein). For modules having
VIOCs implementing one or more VNICs, appropriate values are
written into each of the respective public IP and MAC address
registers corresponding to the assignments.
[0529] Layer-3 forwarding information (including correspondence
between selected IP addresses and respective MAC addresses) is
maintained by system management, controlplane, and load balancing
processes (also referred to collectively as "SCM processes" since
they are executed by an SCM that is part of a SIM). In some
embodiments, the SCM processes provide portions of the layer-3
forwarding information to pluggable modules, typically as L3 FIB
updates. For example, in some embodiments NMs include search
engines accessing IP to MAC forwarding information that is managed
by the SCM processes, and in some embodiments VIOCs access
forwarding information (stored in coupled TCAM/SRAMs) that is also
managed by the SCM processes.
[0530] Layer-2 networking protocols typically communicate source
and destination information using MAC addresses, and pluggable
modules in an ES system embodiment typically map each pluggable
module MAC address to a corresponding fabric port address. The
correspondence between module MAC addresses and fabric port
addresses is maintained by the SCM processes, according to
embodiment, and may be modified when a server is provisioned. The
MAC address to fabric port address mapping (or forwarding)
information is typically provided to pluggable modules, typically
as L2 FIB updates. For example, in some embodiments the NM search
engines access and manage a cache of MAC to fabric port forwarding
information that is provided by the SCM processes. In some
embodiments, VIOCs access and manage a cache of similar forwarding
information (such as MACFIB information as discussed in the
TCAM/SRAM Lookup State section, elsewhere herein) that is also
managed by the SCM processes.
[0531] In some embodiments, server provisioning and management
functions enable detection of a failed module, identification of a
standby module (already available in the system), and automatic
failover replacement of the failed module by the standby module.
According to various embodiments, any combination of the IP address
and the MAC address assigned to the failed module are re-assigned
to the standby module. Consider an illustrative scenario such that
before the failure, the module that is going to fail is associated
with a first IP address and a first MAC address, and such that the
standby module is associated with a second IP address and a second
MAC address. Also consider that there may be an optional mapping
between a service provided by the module that is going to fail and
a service IP address, such that before the failure the service IP
address is equal to the first IP address.
[0532] In a first group of takeover embodiments, where only the IP
address is re-assigned, the standby module is associated with the
first IP address (replacing or "taking over" the first IP address)
as part of performing the module failover. The standby module
remains associated with the second MAC address, and thus the first
IP address should no longer be resolved to the first MAC address,
but to the second MAC address. In some implementations, an Address
Resolution Protocol (ARP) compatible address discovery mechanism is
used to discover the new mapping when the remapped IP address is
referenced. The new mapping is then propagated to the layer-3
forwarding information tables (such as those accessed by the NM
search engine and the VIOCs, according to embodiment). In other
implementations, the SCM processes intercede during the
ARP-compatible processing, recognizing a "local" IP address and
providing a corresponding local MAC address without overheads
typically associated with ARP-compatible processing. Local IP
addresses include IP addresses allocated to pluggable modules (such
as SIMs, PMMs, NMs, FCMs, and OLBs) within an ES system or within
an ES chassis. In still other implementations, the SCM processes
actively update the new mapping in the layer-3 forwarding
information tables upon the replacement event, irrespective of if
or when the remapped IP address is referenced. Since the MAC
addresses are unchanged in the first group of embodiments, the
layer-2 forwarding information (such as mappings to fabric port
addresses) is also unchanged. If there is a mapping between the
service and an associated service address, then since the standby
module has been assigned the first IP address, no change in the
service address is made.
[0533] In a second group of embodiments, where only the MAC address
is re-assigned, the standby module is associated with the first MAC
address (taking over the first MAC address) as part of performing
module failover. The second group of embodiments is typically used
in conjunction with local service IP addresses (i.e. the service
address is not visible external to the ES system), or in
conjunction with a proxy, or in circumstances where changes to the
service address are inconsequential. The standby module remains
associated with the second IP address, and thus the mapping between
the first IP address and the first MAC address is no longer valid,
and a new mapping between the second IP address and the first MAC
address is created. As in the first group of embodiments, some
implementations use the ARP-compatible mechanism and some
implementations use the active update of the new mapping. Since the
MAC address is changed, the layer-2 forwarding information is also
changed accordingly, and the SCM processes actively propagate new
MAC to fabric port address mapping information to the pluggable
modules. If there is a mapping between the service and an
associated service address, then since the standby module is
assigned the second IP address, the service address is changed to
the second IP address. Some implementations perform passive
discovery of this new mapping via the ARP-compatible mechanism and
some implementations use the active updating of the new
mapping.
[0534] In a third group of embodiments, where both the IP and the
MAC addresses are re-assigned, the standby module is associated
with the first IP address and the first MAC address as part of
performing module failover. The mapping between the first IP
address and the first MAC address remains valid, however the
layer-2 mapping between the first MAC address and the associated
fabric port is updated, and the associated layer-2 forwarding
information is changed by active propagation to the pluggable
modules. If there is a mapping between the service and an
associated service address, then since the standby module has been
assigned the first IP address, no change in the service address is
made.
[0535] FIG. 14 illustrates a conceptual view of selected aspects of
embodiments of IP and MAC address failover data structures and
associated operations, including HW elements 1404 and IP/MAC
Address and Forwarding Chart 1405. Three techniques are
illustrated, corresponding to one embodiment of each of the
aforementioned three groups of embodiments.
[0536] For clarity, the illustrated HW elements include only
selected portions of an ES system embodiment, Primary Switch Fabric
Module 180A providing communication between included modules NM
130B, Primary SCM 140A, PMM 150A, and PMM 150B. The NM includes
Search Engine 1406 to search state information included on the NM
to perform layer-3 forwarding functions, including supplying a
forwarding MAC address for a provided IP address. The SCM executes
the SCM processes. PMM 150A illustrates a "failure" PMM, in other
words a PMM that is initially functioning properly, but then
becomes non-functional. PMM 150B illustrates a "standby" PMM, in
other words a PMM that is initially unused (or spare), but is later
used to replace the failed PMM. Each of the PMMs includes a VIOC
(301.5 and 301.5B) and a TCAM/SRAM (403.5 and 403.5B) accessed in
part for layer-2 and optionally for layer-3 forwarding functions,
according to various implementations.
[0537] IP/MAC Address and Forwarding Chart 1405 illustrates address
and forwarding information before PMM 150A fails, and corresponding
information after PMM 150B has replaced PMM 150A, for each of the
three techniques. The chart is organized in rows and columns.
"Initial MAC/IP" column 1410 shows information before the failure,
and "Technique 1 MAC/IP" column 1411, "Technique 2 MAC/IP" column
1412, and "Technique 3 MAC/IP" column 1413 show final information
after failure processing for the three techniques.
[0538] Failure PMM Address Row 1420A and Standby PMM Address Row
1420B show IP and MAC address information stored in VIOCs 301.5 and
301.5B included in PMMs 150A and 150B respectively, for the initial
and final states. More specifically, the failure and standby IP
address information are stored in respective instances of Public IP
Address 604.1 (FIG. 6A), corresponding to VNICs implemented in
VIOCs 301.5 and 301.5B. The failure and standby MAC addresses are
stored in respective instances of MAC Address 603.1.
[0539] MAC to Fabric Port Forwarding Rows 1430 shows destination
MAC address to destination fabric port address forwarding
information stored in TCAM/SRAMs 403.5 and 403.5B and referenced by
VIOCs 301.5 and 301.5B respectively. More specifically, 1430 shows
key and result pairs as described in association with FIG. 8B. For
example, instances of Egress Key 801 are programmed with MAC
addresses as shown in 1430 (MAC1 and MAC2), and corresponding
Unicast Result 803 instances are programmed with DstFabAddr 811 as
shown in 1430 (FPort0 and FPort1). Typically identical information
is stored in TCAM/SRAMs 403.5 and 403.5B, such that both VIOC 301.5
and VIOC 301.5B have access to the same mappings. IP to Fabric Port
Forwarding Rows 1431 show destination IP address to destination
fabric port address forwarding information referenced by Search
Engine 1406. Service Address Row 1432 shows an IP address
associated with a service provided by PMM 150A before failing, and
by PMM 150B after completion of failover processing. In some
implementations the IP address to service mapping of 1432 is also
referenced by Search Engine 1406.
[0540] In the subsequent discussion the terminology "address(es)
stored in PMM" is shorthand for "address(es) stored in a VNIC
implemented in a VIOC included in PMM". The shorthand terminology
is meant to refer to storage in an instance of Public IP Address
604.1 for an IP address, and to storage in an instance of MAC
Address 603.1 for a MAC address.
[0541] Initially, as illustrated by the chart ("Initial MAC/IP"
column 1410), the IP address stored in PMM 150A (the PMM that is to
fail) is IP1, and the MAC address stored in PMM 150A is MAC1. The
corresponding initial addresses stored in PMM 150B (the standby
PMM) are IP2 and MAC2. The initial MAC address to fabric port
forwarding information stored in the TCAM/SRAMs of both PMM 150A
and PMM 150B associates MAC address MAC1 (of PMM 150A) with fabric
port 0 (FPort0, corresponding to slot 0), and MAC2 (of PMM 150B) is
associated with fabric port 1 (FPort1, corresponding to slot 1).
The initial IP to fabroic port address forwarding information
referenced by Search Engine 1406 associates IP1 to FPort0 and IP2
to FPort1. The initial mapping for the service is to IP address IP1
(that of PMM 150A).
[0542] After a failover processed according to the first technique
(an IP takeover as in the first of the aforementioned groups of
embodiments), the IP address stored in PMM 150B (the standby PMM
that has replaced the failed PMM) is set to the value of the IP
address previously stored in PMM 150A (the failed PMM). This is
illustrated by PMM_B IP address 1450 (of "Technique 1 MAC/IP"
column 1411) having the value IP1. The IP and MAC address
information stored in the VIOC of PMM 150A (the failed PMM) is no
longer relevant, as the module is no longer being used. The IP to
fabric port address forwarding information has changed, since the
replacement module has taken on the IP address of the failed module
without also taking on the fabric port address of the failed module
(i.e. the modules remain in their original slots). This is
illustrated by IP to fabric port address entry 1454 having the
value FPort1. The MAC address to fabric port forwarding and service
IP address mapping information are not changed (see the
intersection of rows 1430 and 1432, respectively, with column
1411), as the initial mappings remain applicable. Note that the MAC
address to fabric port forwarding information previously associated
with MAC1 is no longer valid, as the MAC1 address is no longer
being used.
[0543] After a failover processed according to the second technique
(a MAC takeover as in the second of the aforementioned groups of
embodiments), the MAC address stored in PMM 150B (the replacement
PMM) is set to the value of the MAC address previously stored in
PMM 150A (the failed PMM). This is illustrated by PMM_B MAC address
1451 (of "Technique 2 MAC/IP" column 1412) having the value MAC1.
As in the first technique, the IP and MAC address information
stored in PMM 150A is no longer relevant, as the module is no
longer being used. The MAC address to fabric port forwarding
information is changed, since the replacement PMM has a new MAC
address but has remained inserted in the same slot. This is
illustrated by MAC address to fabric port address entry 1455 (of
"Technique 2 MAC/IP" column 1412) having the value FPort1. With
reference to the intersection of column 1412 with row 1430, the MAC
address to fabric port forwarding information associated with MAC2
is no longer valid, and the MAC1 address is now associated with a
different fabric port address. The IP to fabric port address
forwarding associated with the IP address of the failed module is
now invalid. The service IP address mapping has changed, since the
replacement module is known by a different IP address than the
failed module. This is illustrated by service IP address 1456
having the value IP2.
[0544] After a failover processed according to the third technique
(both IP and MAC address takeover as in the third of the
aforementioned groups of embodiments), the IP and MAC addresses
stored in PMM 150B (the replacement PMM) are set to the
corresponding values previously stored in PMM 150A (the failed
PMM). This is illustrated by PMM_B IP address 1452 (of "Technique 3
MAC/IP" column 1413) having the value IP1, and PMM_B MAC address
1453 having the value MAC1. As in the first and second techniques,
the IP and MAC address information stored in PMM 150A is no longer
relevant, as the module is no longer being used. As in the second
technique, the MAC address to fabric port forwarding information is
changed, as illustrated by MAC to fabric port address entry 1457
having the value FPort1. With reference to the intersection of
column 1413 with row 1430, the MAC address to fabric port
forwarding information associated with MAC2 is no longer valid, and
the MAC1 address is now associated with a different fabric port
address. As in the first technique, the IP to fabric port address
forwarding is changed, as illustrated by IP to fabric port address
entry 1458 having the value FPort1. The service IP address mapping
information associated with IP1 is not changed, as the initial
mappings remain applicable.
[0545] FIG. 15 illustrates a flow diagram of an embodiment of rapid
IP address takeover in a context of replacing a failed module with
a standby module. An IP address originally associated with a first
MAC address (corresponding to the failed module), is re-associated
with a second MAC address (corresponding to the standby module).
Although failover processing that directs the re-association is
typically executed outside of the standby module, the
re-association is often described as the standby module "taking
over" the IP address from the failed module, and corresponds to the
aforementioned first group of embodiments.
[0546] Processing begins at "Start" 1501, and then flows to "Detect
Failed Module" 1510 upon determination that a module is no longer
functional (such as PMM 150A as shown in FIG. 14). Flow then
proceeds to "Identify Replacement Module" 1511 to determine a
standby module to serve in place of the failed module (such as PMM
150B replacing PMM 150A). Processing continues at "Determine
Replacement MAC Address" 1512 where the MAC address of the standby
module is ascertained. This may be performed by consulting
appropriate MAC address assignment or allocation tables maintained
by the SCM processes, by reading state managed by the module (such
as an instance of MAC Address 603.1), or by other similar
mechanisms, according to embodiment. Note that this operation is
distinct from determining an IP to MAC address mapping, as there is
no specific IP address involved in "Determine Replacement MAC
Address" 1512.
[0547] The standby module MAC address, and its correspondence to
the IP address previously associated with the failed module, is
made known throughout the SCM processes by updating a master
Layer-3 FIB table ("Update Master L3 FIB" 1513). Flow continues to
"Update Module L3 FIBs" 1514, where the correspondence between the
IP address and the standby module MAC address is actively
disseminated to module tables (such as forwarding information
consulted by Search Engine 1406), replacing the stale
correspondence to the failed module MAC address. Typically the SCM
processes communicate with interface management processes that in
turn provide updates to search engine look up state via the switch
fabric module. This contrasts to a more passive replacement of IP
to MAC correspondence information for IP addresses external to an
ES system (such as Client 103 of FIG. 1A) via Address Resolution
Protocol (ARP) requests. Note: as ARP responders, the SCM processes
reply to subsequent ARP requests for the re-associated IP address
with the standby module MAC address. Processing is then complete at
"End" 1598.
[0548] The illustrated embodiment of FIG. 15 is shown determining
and propagating a new IP address to MAC address association
(leaving an original IP address for an associated service intact)
when replacing a failing module with a standby module
(corresponding to the aforementioned first group of embodiments).
Some embodiments also determine and propagate a new IP address to
fabric port address association in conjunction with propagating a
new IP address to MAC address association. Other embodiments
determine and propagate an IP address update, or both MAC and IP
address updates (corresponding to the aforementioned second and
third groups of embodiments, respectively), including propagating
updates as appropriate for the following mappings: IP address to
MAC address, MAC address to fabric port address, and IP address to
fabric port address.
[0549] In some embodiments, the SCM processes program the Layer-2
and Layer-3 module tables (L2 FIBs and L3 FIBs) in their entirety
with respect to all elements known to reside within an ES system.
For example, with respect to FIG. 14, all IP and MAC addresses to
fabric port address correspondences are programmed into the
TCAM/SRAM structures included on the PMMs and into the search
engines included in the NMs. The pre-programming of mapping
information guarantees that references to local IP and MAC
addresses will be found in the module tables (i.e. will be "hits").
Subsequently, when IP and MAC takeover occurs (for example during
failover processing), the SCM processes update the L2 and L3 FIBs
immediately, guaranteeing that later references to local IP and MAC
addresses will continue to be hits in the module tables.
[0550] In some embodiments, the L2 and L3 FIB preprogramming is
limited according to VLAN configuration but still sufficient to
guarantee that local IP and MAC address references are hits, to
conserve TCAM/SRAM entry usage. For example, TCAM/SRAM 403.5 would
be initially preprogrammed only with entries corresponding to the
VLANs that VNICs implemented in VIOC 301.5 were members of, while
TCAM/SRAM 403.5B would be initially programmed according to VLAN
membership of VNICs implemented by VIOC 301.5B. Depending on
specific VLAN membership of VNICs, there could be entries uniquely
present in TCAM/SRAM 403.5, entries uniquely present in TCAM/SRAM
403.5B, and entries present in both TCAM/SRAMs 403.5 and 403.5B.
Failover processing would immediately update and add TCAM/SRAM
403.5B entries in order to continue to continue guarantee local IP
and MAC address hits.
[0551] According to various embodiments, NMs and PMMs may implement
any combination of L2 and L3 FIBs and perform corresponding L2 and
L3 forwarding lookups. In some embodiments, the L2 and L3 module
tables are distinct, while in other embodiments the L2 and L3
module tables are implemented in a single combined module table,
with L2 and L3 type entries being differentiated by a table
identification field (of one or more bits) stored in the table and
included in the lookup key.
[0552] The foregoing discussion focuses on module tables typically
implemented as TCAM/SRAM elements associated with VIOCs included on
PMMs. However, TCAM/SRAM elements associated with VIOCs included on
SCMs, FCMs, and OLBs, as well as TCAM/SRAM elements included on NMs
function similarly. Those of ordinary skill in the art will readily
appreciate how to extend the failover techniques to other module
types having TCAM/SRAM elements implementing Layer-2 and Layer-3
module tables.
Multi-Chassis Provisioning and Management
[0553] FIG. 16 illustrates an embodiment of a multi-chassis
fabric-backplane ES system, also referred to simply as a
"multi-chassis system". Servers may be provisioned from compute,
storage, and I/O resources available via three chassis (ES 110X, ES
110Y, and ES 110Z), each similar to ES1 110A (see FIG. 1A).
Conceptually the multi-chassis provisioning process is similar to
that of a single chassis, as illustrated in FIGS. 9A, 9B, and 11
and their respective discussions, except that resources for
provisioning are distributed amongst more than one chassis.
[0554] As illustrated in FIG. 16, each chassis includes an SFM (SFM
180X of ES 110X, for example) coupled to various compute, storage,
and I/O modules. The compute modules include two OLBs (OLB 160XA
and OLB 160XB of ES 110X, for example), two PMMs (PMM 150XA and PMM
150XB of ES 110X, for example), and an SCM (SCM 140X of ES 110X,
for example). Storage modules include two FCMs (FCM 120XA and FCM
120XB of ES 110X, for example). I/O modules include two NMs (NM
130XA and NM 130XB of ES 110X, for example). ES 110Y and ES 110Z
are similar to ES 110X (similar elements are identified with a `Y`
and a `Z`, respectively, in the identifier name instead of an `X`).
Fibre channel standard storage arrays (or networks coupled to
arrays, according to implementation) are coupled to each ES
chassis, as illustrated by Fibre Channel Array/Networks 106X, 106Y,
and 106Z, coupled to ESs 110X, 110Y, and 110Z, respectively.
[0555] Each ES system chassis is coupled to LAN/WAN/MAN/Internet
network 1619, ES 110X via NM 130XB and coupling 1614, ES 110Y via
NM 130YB and coupling 1615, and ES 110Z via NM 130ZB and coupling
1621. In some embodiments, network 1619 (or similar network
interconnectivity reached via NMs) provides a majority of
communication bandwidth between various chassis of a multi-chassis
system, including communication between the SCMs of the multiple
chassis. In some embodiments, additional connectivity between SCMs
of a multi-chassis system is included, as illustrated by dash-line
coupling 1616 between SCM 140X and SCM 140Y, and dashed-line
coupling 1622 between SCM 140Y and SCM 140Z. Client 1601A is
coupled to ES 110X (via coupling 1617 to SCM 140X) and to ES 110Z
(via coupling 1618 to SCM 140Z). Client 1601B is coupled to network
1619 via coupling 1620, and the network provides indirect coupling
between Client 1601B and SCMs 140X, 140Y, and 140Z.
[0556] At least two major variations of multi-chassis provisioning
and management operations are possible. A first, or "Master-Free"
variation, is illustrated by Client 1601B operating in conjunction
with a host SCM (such as any one of SCM 140X, SCM 140Y, and SCM
140Z) executing multi-chassis management and provisioning
operations (such as those provided by Enterprise Manager 530 of
FIG. 5B). A second, or "Client-Mastered" variation, is illustrated
by Client 1601A operating as a master to two or more SCMs (such as
SCM 140X and SCM 140Y), with each SCM executing single-chassis
management and provisioning operations (such as those provided by
CM 533). The two variations are described in more detail with
respect to FIG. 17.
[0557] The illustrated embodiment shows couplings 1617 and 1618
coupling Client 1601A to SCM 140X of ES 110X and SCM 140Z of ES
110Z respectively. Other embodiments may provide communication
between a Client acting as a multi-chassis master and a plurality
of SCMs via any network connectivity between the SCMs and the
client, such as portions of network 1619. It is important only that
the master Client establish logical connectivity with a CM
executing in each SCM included in each respective chassis of the
multi-chassis system under management of the master Client.
[0558] While the aforementioned multi-chassis system embodiments
include identical chassis, other embodiments may include chassis of
varying capabilities. For example, a first chassis in a
multi-chassis system may not include any FCMs, a second chassis may
not include any OLBs, and a third chassis may include four OLBs.
These and all other similar combinations of compute, storage, and
I/O resources are contemplated. Furthermore, embodiments of
multi-chassis systems may include as few as two chassis to tens,
hundreds, or thousands of chassis, interconnected without
restriction via various combinations and topologies of LANs, WANs,
MANs, and the Internet. In addition, multi-chassis management may
be performed via either or both of the two aforementioned
variations and embodiments may include multiple instances of each
variation operating substantially in parallel, provisioning and
managing a plurality of provisioned servers.
[0559] FIG. 17 illustrates an embodiment of two variations of
multi-chassis provisioning and management operations. Generally, in
the Client-Mastered variation, a client (such as Client 1601A) acts
as a master for gathering information and distributing commands
(rather than relying on an Enterprise Manager proxy). The client,
acting as the master, logically couples to a plurality of CMs
executing on respective SCMs (such as SCMs 140X and 140Z). The CMs,
acting as slaves, individually provide the information to the
client (master) and individually receive and act on the distributed
commands from the client.
[0560] Generally, in the Master-Free variation, a host Enterprise
Manager executing on a host SCM operates as a proxy for receiving
system information from all chassis in the system, passing the
information to a client (such as Client 1601B) as necessary,
depending on state information required by the client. Such
information may include event information for processing with
respect to queries, system status changes, or similar management
and provisioning data. The host Enterprise Manager further operates
as a proxy for broadcasting or distributing commands from Client
1601B to the proper chassis in the system. Such commands may
include commands to provision or manage servers, or to perform
other similar multi-chassis operations.
[0561] Communication of system information and commands between the
host Enterprise Manager and the other Enterprise Managers of the
multi-chassis system is typically performed via an Open Shortest
Path First (OSPF) protocol (an industry standard link-state
protocol for routing IP packets). An example of an OSPF protocol is
described in "RFC 2328-OSPF Version 2", April 1998,
(http://www.faqs.org/rfcs/rfc2328.html), herein incorporated by
reference for all purposes. In some embodiments, data distribution
is implemented via extensions to an OSPF protocol, such as an
opaque Link Status Advertisement (LSA) option (an industry standard
technique of data distribution for arbitrary purposes throughout an
OSPF domain). An example of an opaque LSA option extension is
described in "RFC 2370--The OSPF Opaque LSA Option", July 1998,
(http://www.faqs.org/rfcs/rfc2370.html), herein incorporated by
reference for all purposes. Other embodiments communicate
management and provisioning information, events, commands, and
other related data via other OSPF-like protocols adapted for
propagating and distributing data in an environment of dynamically
varying bandwidth and topology.
[0562] With respect to OSPF processing, each ES system of the
multi-chassis system acts as a router, originating OSPF Link State
Update messages (according to opaque LSAs, in some embodiments)
indicating the current state of the respective ES system. The ES
systems further operate to propagate similar information from
adjacent ES systems (also acting as routers with respect to OSPF
processing) throughout the multi-chassis system (often referred to
as autonomous system in regards to OSPF operations). The propagated
information includes data values required for proper determination
and processing of server provisioning and management operations,
including data values referenced by queries being monitored for
activation. For example, command execution results and module
status changes (such as Module Operational Status Up/Down events,
as described in the Selected BMC Event Details section, elsewhere
herein) are packaged for dissemination via Link State Update
messages, and provided to the adjacent ES systems. Since the Link
State Update messages are communicated reliably (via the Link State
Ack acknowledgement handshake), all ES systems in a multi-chassis
system have access to reliable near real time status information
for all of the chassis in the multi-chassis system. In some
embodiments, OSPF Hello and Link State Request messages may be used
to discover ES systems present in the multi-chassis system.
[0563] Referring to FIG. 17, Master-Free Management Flow 1701
illustrates processing specific to an embodiment of the first
variation, and Client-Mastered Management Flow 1702 illustrates
processing specific to an embodiment of the second variation. The
Master-Free flow starts at "Begin" 1701.1. The flow then continues
to "Register to Receive OSPF ES Information" 1701.2. Here the
client (such as Client 1601B) registers with a host enterprise
management process executing on a host SCM (such as SCM 140X) to
receive and broadcast information via OSPF opaque LSAs, with the
host enterprise management process acting as a proxy. Communication
between the client and the host enterprise management process is
via coupling 1620, network 1619, and coupling 1614. The host SCM in
turn communicates with the other SCMs (or more specifically
enterprise management processes executing thereon) of the
multi-chassis system via network 1619 and coupling 1613, and
optionally via coupling 1616. Processing in "Register to Receive
OSPF ES Information" 1701.2 completes the Master-Free-specific
portion of the overall flow.
[0564] The Client-Mastered flow starts at "Begin" 1702.1 and
continues to "Connect to Next Chassis Manager" 1702.2. Here
communication between the client (such as client 1601A) and a CM
process executing on the SCM of the first ES system (such as SCM
140X via coupling 1617) in the multi-chassis system is established.
A check is then made to determine if connections have been
established to all of the ES systems of the multi-chassis system at
"Connected to All?" 1702.3. If there are more systems to connect
to, then flow proceeds along "No" 1702.3N back to 1702.2 "Connect
to Next Chassis Manager". The next connection is then established
(such as to SCM 140Z via coupling 1618). If connections have been
established to all desired ES systems of the multi-chassis system,
then flow continues via "Yes" 1702.3Y, and the
Client-Mastered-specific portion of the overall flow is
complete.
[0565] Flow continues from "Register to Receive OSPF ES
Information" 1701.2 and via "Yes" 1702.3Y to "Monitor System
Information" 1702.4, where information received from each chassis
of the multi-chassis system is monitored. The Master-Free variation
receives status change and event information via the host
enterprise management process, and the host enterprise management
process in turn collects the information via OSPF opaque LSA
processing. The Client-Mastered variation receives the information
directly from each chassis via the respective connections
established in Client-Mastered Management Flow 1702.
[0566] In parallel with the system information monitoring, two
other flows are occurring, Monitor and Process Changes 1703 and
Monitor User Interface 1704. Generally, Monitor and Process Changes
1703 includes receiving and acting upon information collected from
the system monitoring activity (status changes and events),
including updating information displayed to one or more operators
(such as users, system managers, and other related personnel). The
functions further include processing queries and sending commands
generated by any activated queries to member systems of the
multi-chassis system that are required to execute the commands.
Monitor User Interface 1704 includes waiting for commands from the
operators (provided in response to the updating of information
displayed, for example) and distributing the commands to
appropriate members of the multi-chassis system.
[0567] More specifically, information regarding status changes and
events is propagated to "Change?" 1703.1 from "Monitor System
Information" 1702.4, as shown conceptually by dashed-arrow 1702.4C.
"Change?" 1703.1 determines if there has been a system status
change or an event. If not ("No" 1703.1N), then processing loops
back to "Change?" 1703.1. If there has been a change or an event
("Yes" 1703.1Y), then flow continues to "Update Display" 1703.2.,
where the client updates any relevant display information according
to user interface requirements. Processing then proceeds to
"Process Queries" 1703.3, where queries are processed with respect
to the status changes and events, and commands may be formulated in
response to activated queries (similar to "Pre-Select Queries and
Respond to Activated Queries" 1201.7 of FIG. 12A). Flow continues
to "Broadcast Resultant Commands" 1703.4, where any resultant
commands are distributed to appropriate chassis, in a manner
dependent on whether the operating mode is the Master-Free or the
Client-Mastered variation. After completion of command broadcast,
flow returns to "Change?" 1703.1 to await another status change or
event.
[0568] "Broadcast Resultant Commands" 1703.4 processing, in the
Master-Free variation, unfolds in a two stage process. The commands
are first sent from the client to the host enterprise management
process, and then the commands are broadcast via OSPF opaque LSAs
to all of the managed ES systems of the multi-chassis system. In
the Client-Mastered variation, commands are broadcasted by the
client to each managed ES system. The broadcasting may be performed
directly by the client, using a form of unicast transmission, or
indirectly by the client using a form of multicast transmission,
according to various embodiments.
[0569] Processing in "New Command?" 1704.1 determines if the
operator has provided a command (frequently this occurs in response
to changes in displayed information, as indicated conceptually by
dashed-arrow 1703.2N). If not ("No" 1704.1N), then processing loops
back to "New Command?" 1704.1. If a command has been received
("Yes" 1704.1Y), then flow proceeds to "Process Command" 1704.2,
where the operator command is processed into one or more commands
for communication to the elements of the multi-chassis system.
Processing continues in "Broadcast Command" 1704.3, where the
command (or commands) is sent to the appropriate multi-chassis
system members, in a variation dependent way, as in "Broadcast
Resultant Commands" 1703.4.
[0570] In some embodiments, commands sent to the multi-chassis
system (as generated by "Broadcast Resultant Commands" 1703.4 and
"Broadcast Command" 1704.3, for example), may be multi-stage
commands (or sub-commands) requiring one or more synchronization
events with respect to execution of sub-commands by members of the
multi-chassis system. Synchronization may extend across any number
and combination of the multi-chassis members, depending on
specifics of the command, topology of the multi-chassis system, and
implementation specific characteristics.
CONCLUSION
[0571] Although the foregoing embodiments have been described in
some detail for purposes of clarity of understanding, the invention
is not limited to the details provided. There are many alternative
ways of implementing the invention. The disclosed embodiments are
illustrative and not restrictive. It will be understood that many
variations in construction, arrangement and use are possible
consistent with the teachings and within the scope of the claims
appended to the issued patent. For example, interconnect and
function-unit bit-widths, clock speeds, and the type of technology
used may generally be varied in each component block. Also, unless
specifically stated to the contrary, the value ranges specified,
the maximum and minimum values used, or other particular
specifications (such as the quantity and type of processors and
memory on the modules, the channel bandwidths, the degree of
redundancy for any particular component or module, the particular
version of an interface standard or component, the number of
entries in a buffer), are merely those of the illustrative
embodiments, can be expected to track improvements and changes in
implementation technology, and should not be construed as
limitations.
[0572] Functionally equivalent techniques known to those of
ordinary skill in the art may be employed instead of those
illustrated to implement various components or sub-systems. The
names given to interconnect and logic are merely illustrative, and
should not be construed as limiting the concepts taught. It is also
understood that many design functional aspects may be carried out
in either hardware (i.e., generally dedicated circuitry) or
software (i.e., via some manner of programmed controller or
processor), as a function of implementation dependent design
constraints and the technology trends of faster processing (which
facilitates migration of functions previously in hardware into
software) and higher integration density (which facilitates
migration of functions previously in software into hardware).
Specific variations may include, but are not limited to:
differences in module and chassis partitioning; different module
and chassis form factors and configurations; use of different
operating systems and other system software; use of different
interface standards, network protocols, or communication links; and
other variations to be expected when implementing the concepts
taught herein in accordance with the unique engineering and
business constraints of a particular application.
[0573] The embodiments have been illustrated with detail and
environmental context well beyond that required for a minimal
implementation of many of aspects of the concepts taught. Those of
ordinary skill in the art will recognize that variations may omit
disclosed components without altering the basic cooperation among
the remaining elements. It is thus understood that much of the
details disclosed are not required to implement various aspects of
the concepts taught. To the extent that the remaining elements are
distinguishable from the prior art, omitted components are not
limiting on the concepts taught herein.
[0574] All such variations in design comprise insubstantial changes
over the teachings conveyed by the illustrative embodiments. It is
also understood that the concepts taught herein have broad
applicability to other computing and networking applications, and
are not limited to the particular application or industry of the
illustrated embodiments. The invention is thus to be construed as
including all possible modifications and variations encompassed
within the scope of the claims appended to the issued patent.
* * * * *
References