U.S. patent application number 10/873329 was filed with the patent office on 2005-04-28 for port congestion notification in a switch.
Invention is credited to Carlsen, Scott, Schmidt, Steven G., Tornetta, Anthony G..
Application Number | 20050088969 10/873329 |
Document ID | / |
Family ID | 21801585 |
Filed Date | 2005-04-28 |
United States Patent
Application |
20050088969 |
Kind Code |
A1 |
Carlsen, Scott ; et
al. |
April 28, 2005 |
Port congestion notification in a switch
Abstract
A congestion notification mechanism provides a congestion status
for all destinations in a switch at each ingress port. Data is
stored in a memory subsystem queue associated with the destination
port at the ingress side of the crossbar. A cell credit manager
tracks the amount of data in this memory subsystem for each
destination. If the count for any destination exceeds a threshold,
the credit manager sends an XOFF signal to the XOFF masks. A lookup
table in the XOFF masks maintains the status for every switch
destination based on the XOFF signals. An XON history register
receives the XOFF signals to allow queuing procedures that do not
allow a status change to XON during certain states. Flow control
signals directly from the memory subsystem are allowed to flow to
each XOFF mask, where they are combined with the lookup table
status to provide a congestion status for every destination.
Inventors: |
Carlsen, Scott; (Mount
Laurel, NJ) ; Tornetta, Anthony G.; (King of Prussia,
PA) ; Schmidt, Steven G.; (Westampton, NJ) |
Correspondence
Address: |
BECK AND TYSVER
2900 THOMAS AVENUE SOUTH
SUITE 100
MINNEAPOLIS
MN
55416
US
|
Family ID: |
21801585 |
Appl. No.: |
10/873329 |
Filed: |
June 21, 2004 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
10873329 |
Jun 21, 2004 |
|
|
|
10020968 |
Dec 19, 2001 |
|
|
|
Current U.S.
Class: |
370/229 ;
370/389; 707/999.1; 709/235 |
Current CPC
Class: |
H04L 2012/5683 20130101;
H04L 49/30 20130101; H04L 49/3009 20130101 |
Class at
Publication: |
370/229 ;
370/389; 709/235; 707/100 |
International
Class: |
H04L 001/00; G06F
011/00; G08C 015/00; H04L 012/28; G06F 007/00; G06F 017/00 |
Claims
What is claimed is:
1. A method for congestion notification within a switch comprising:
a) maintaining a plurality of lookup tables having multiple
entries, each entry containing a congestion status for a different
destination in the switch; b) sending a congestion update to the
plurality of lookup tables, the congestion update containing a
destination identifier and an updated congestion status; and c)
updating the entry in the lookup table corresponding to the
destination identifier using the updated congestion status.
2. The method of claim 1, wherein each lookup table contains an
entry for all available destinations in the switch.
3. The method of claim 2, wherein a separate lookup table is
maintained at each ingress to the switch.
4. The method of claim 1, wherein the switch is a Fibre Channel
switch.
5. The method of claim 1, further comprising: d) maintaining an
indicator of an amount of data within a buffer for each
destination; and e) triggering the sending of the congestion update
when the indicator passes a threshold value.
6. The method of claim 5, wherein a credit module maintains the
indicators and sends the congestion updates.
7. The method of claim 6, wherein the credit module uses a single
indicator for each destination to track data entering the switch
from a plurality of ingress ports.
8. The method of claim 7, wherein i) data from each ingress ports
passes through a fabric module before entering the buffer, ii) each
fabric module submits a first credit event to the credit module for
each grouping of data submitted to the buffer, and iii) the credit
module uses the first credit event to alter the indicator so as to
reflect additional data entering the buffer.
9. The method of claim 8, wherein iv) the buffer informs the fabric
module each time a grouping of data leaves the buffer, v) the
fabric module responds to such information from the buffer by
submitting a second credit event to the credit module, and vi) the
credit module uses the second credit event to alter the indicator
so as to reflect data leaving the buffer.
10. The method of claim 9, wherein the first credit event is a
decrement event decreasing a value of the indicator, the second
credit event is an increment event increasing the value the
indicator.
11. The method of claim 9, wherein a plurality of fabric modules
submit first and second credit events to the credit module, which
stores the credit events in a plurality of FIFOs.
12. The method of claim 11, wherein the credit events are retrieved
from the FIFOs and applied to the indicator.
13. The method of claim 8, wherein each lookup table responds to a
switch destination address by returning the congestion status for
the destination associated with the switch destination address.
14. The method of claim 13, wherein the congestion status returned
by the lookup table is combined with a congestion indicator
generated by the fabric module to return a final congestion status
for the switch destination address.
15. The method of claim 14, wherein a first fabric module shares
the congestion indicator with a second fabric module within the
switch, with the second fabric module submitting the congestion
indicator to at least one additional lookup table.
16. The method of claim 7, wherein each congestion update from the
credit module is sent to the lookup tables used by the plurality of
ingress ports for which the credit module maintains the
indicators.
17. The method of claim 16, wherein the credit module is a master
credit module, further comprising a plurality of slave credit
modules each serving a different subset of ingress ports on the
switch.
18. The method of claim 17, wherein each slave credit module
receives information on the data entering the buffer from its own
subset of ingress ports and forwards that information to the master
credit module.
19. The method of claim 18, wherein the master credit module uses
the information received from the slave credit modules to maintain
the indicators, and furthermore wherein the master credit module
directs the slave credit modules to submit congestion updates to
their subset of served ports.
20. The method of claim 5, wherein different threshold values are
maintained for different destinations, and further wherein the
grouping of data is a fixed-sized data cell.
21. The method of claim 1, wherein each lookup table responds to a
switch destination address by returning the congestion status for
the destination associated with the switch destination address.
22. The method of claim 21, wherein the congestion status returned
by the lookup table is combined with a congestion indicator to
return a final congestion status.
23. A method for congestion notification within a switch
comprising: a) maintaining at each ingress port a lookup table
having multiple entries, each entry containing a congestion status
for a different destination in the switch, each lookup table
containing entries for all available destinations in the switch,
each lookup table returning the congestion status in response to a
status query for a particular destination; b) maintaining at a
first module an indicator of an amount of data submitted for each
destination; and c) when the indicator passes a threshold value,
sending a congestion update from the first module to a first lookup
table, the congestion update containing a destination identifier
and an updated congestion status; and d) updating the entry in the
first lookup table corresponding to the destination identifier
using the updated congestion status.
24. The method of claim 23, wherein the first module services a
plurality of ports and their associated lookup tables, with all
data passing through the serviced ports being reflected in the
indicators of the first module.
25. The method of claim 24, wherein data from each serviced port
passes through a separate second module, each second module
submitting credit events to the first module reflecting data being
submitted to and exiting a memory subsystem.
26. The method of claim 25, wherein cell credit events are stored
by the first module in FIFOs to be later applied to the indicators
for each destination.
27. The method of claim 25, wherein the congestion status returned
by the lookup table is combined with a congestion signal generated
by the second module to return a final congestion status.
28. The method of claim 27, wherein the congestion signal is in
response to an XOFF/XON signal from the memory subsystem.
29. The method of claim 23, wherein different threshold values are
maintained for different destinations.
30. A method for sharing congestion information in a switch
comprising: a) interfacing with an ingress memory subsystem for a
crossbar component through a plurality of fabric interfaces, the
crossbar component handling data in predefined units; b)
associating a set of fabric interfaces to a credit module; c)
transmitting a first data event from one of the fabric interfaces
to the credit module when a unit of data for a destination is
submitted to the ingress memory subsystem, the first data event
identifying the destination; d) transmitting a second data event
from one of the fabric interfaces to the credit module when the
ingress memory subsystem informs the fabric interface that a unit
of data has been submitted to the crossbar from the ingress memory
subsystem; e) using the first and second data events at the credit
module to track a congestion status for the destinations in the
switch.
31. The method of claim 30, further comprising: f) sending a
congestion event from the credit module to a plurality of ingress
ports to indicate a change in the congestion status for a
destination;
32. The method of claim 31, further comprising: g) upon receiving a
flow control signal from the ingress memory subsystem, sending a
congestion signal from one of the fabric interfaces to one of the
ingress ports.
33. The method of claim 32, further comprising: h) sending the
congestion signal from the one of the fabric interfaces to a second
fabric interface, and then sending the congestion signal from the
second fabric interface to a second ingress port.
34. A method for distributing information regarding port congestion
on a switch having a switch fabric and a plurality of I/O boards,
each board having a plurality of ports, the method comprising a)
submitting incoming data on a first I/O board to the switch fabric
via a single ingress memory subsystem; b) organizing the ingress
memory subsystem so as to establish a separate queue for each
destination on the switch; c) monitoring an amount of data in each
queue in the ingress memory subsystem; d) submitting a congestion
event to each port on the first I/O board when the amount of data
in a first queue passes a threshold value; and e) maintaining at
each port a destination lookup table containing a congestion value
for each destination on the switch based upon the congestion
events.
35. The method of claim 34, wherein each I/O board has a plurality
of protocol devices servicing a plurality of ports, and further
wherein a credit module on each protocol device performs the
monitoring step based on the amount of data in each queue that
originated from ports on its protocol device, wherein the credit
module submits the congestion event to each port on its protocol
device.
36. The method of claim 34, wherein each I/O board has a plurality
of protocol devices each servicing a plurality of ports, and
further wherein slave credit modules on at least some of the
protocol devices submit information to a master credit module
concerning the data entering each queue that originated from ports
on its protocol device, wherein the master credit module instructs
the slave credit modules to submit the congestion event to each
port that it services.
37. The method of claim 36, wherein i) all data passes through a
fabric interface before being submitted to the ingress memory
subsystem, ii) multiple fabric interfaces exist on each protocol
device, iii) the fabric interfaces receive congestion signals from
the ingress memory subsystem, and iv) the fabric interfaces submit
a fabric congestion signal to at least one port after receiving
congestion signals from the ingress memory subsystem.
38. The method of claim 37, wherein the fabric interfaces indicate
to each other when fabric congestion signals are created.
39. The method of claim 37, wherein the fabric interfaces track the
data entering and leaving the ingress memory subsystem and all the
fabric interfaces on a single protocol device report this
information to a single credit module.
40. A data communication switch having a plurality of destinations
comprising: a) a crossbar component; and b) a plurality of I/O
boards, each I/O board having i) a memory subsystem for queuing
data for submission to the crossbar component, ii) a credit
component for tracking an amount of data within the memory
subsystem for each destination, and iii) a plurality of protocol
devices, each protocol devices having (1) a plurality of ports, (2)
a port congestion indicator at each port, the port congestion
indicator having an indication of a congestion status for each
destination in the switch, and (3) a congestion communication link
connecting the credit component with each of the port congestion
indicators.
41. The switch of claim 40, wherein each destination in the switch
has a switch destination address and further wherein the credit
component contains a decrement FIFO containing switch destination
addresses and an increment FIFO containing switch destination
addresses.
42. The switch of claim 41, wherein a first switch destination
address for a first destination is added to the decrement FIFO when
a unit of data for the first destination is submitted to the memory
subsystem and further wherein the first switch destination address
for the first destination is added to the increment FIFO when the
unit of data for the first destination exits the memory
subsystem.
43. The switch of claim 42, wherein each port communicates to the
memory subsystem through a fabric interface module, and further
wherein the fabric interface modules submit the switch destination
addresses to the FIFOs.
44. The switch of claim 43, wherein the fabric interface modules
receive flow control signals from the memory subsystem.
45. The switch of claim 44, wherein each fabric interface module
send congestion signals to an associated port congestion indicator
upon receipt of the flow control signals.
46. The switch of claim 45, wherein each fabric interface module
sends a congestion signal to the other fabric interface modules on
its I/O board upon receipt of the flow control signals.
47. The switch of claim 40, wherein the credit component submits a
congestion event to at least one of the port congestion indicators
when an amount of data in the memory subsystem for a first
destination crosses a threshold.
48. The switch of claim 47, wherein the credit component is a
master credit component, and further comprising a plurality of
slave credit components, wherein when the master credit component
submits the congestion event to the at least one port congestion
indicators, the master credit component also submits an instruction
to the slave credit components to submit the congestion event to
other port congestion indicators.
49. The switch of claim 40, wherein each port communicates to the
memory subsystem through a fabric interface module, and further
wherein the fabric interface modules communicate to the credit
components events related to data entering and leaving the memory
subsystem.
50. The switch of claim 40, wherein the port congestion indicator
is an XOFF mask lookup table.
51. A data communication switch having a plurality of destinations
comprising: a) a crossbar component; and b) at least one I/O board
having i) an output queuing means for queuing data for submission
to the crossbar component, ii) a plurality of ports, iii) a
congestion indicator means at each port for indicating a congestion
status for each destination in the switch, iv) a congestion
signaling means for signaling a need to update the congestion
indicator means with a new congestion status for at least one
port.
52. The switch of claim 51, further comprising a means for sharing
the congestion signaling means with all ports on an I/O board.
53. The switch of claim 51, wherein the destinations include the
ports and at least one microprocessor.
Description
RELATED APPLICATION
[0001] This application is a continuation-in-part application based
on U.S. patent application Ser. No. 10/020,968, entitled "Deferred
Oueuing in a Buffered Switch," filed on Dec. 19, 2001, which is
hereby incorporated by reference.
[0002] This application is related to U.S. patent application
entitled "Fibre Channel Switch," Ser. No. ______, attorney docket
number 3194, filed on even date herewith with inventors in common
with the present application. This related application is hereby
incorporated by reference.
FIELD OF THE INVENTION
[0003] The present invention relates to congestion notification in
a switch. More particularly, the present invention relates to
maintaining and updating a congestion status for all destination
ports within a switch.
BACKGROUND OF THE INVENTION
[0004] Fibre Channel is a switched communications protocol that
allows concurrent communication among servers, workstations,
storage devices, peripherals, and other computing devices. Fibre
Channel can be considered a channel-network hybrid, containing
enough network features to provide the needed connectivity,
distance and protocol multiplexing, and enough channel features to
retain simplicity, repeatable performance and reliable delivery.
Fibre Channel is capable of full-duplex transmission of frames at
rates extending from 1 Gbps (gigabits per second) to 10 Gbps. It is
also able to transport commands and data according to existing
protocols such as Internet protocol (IP), Small Computer System
Interface (SCSI), High Performance Parallel Interface (HIPPI) and
Intelligent Peripheral Interface (IPI) over both optical fiber and
copper cable.
[0005] In a typical usage, Fibre Channel is used to connect one or
more computers or workstations together with one or more storage
devices. In the language of Fibre Channel, each of these devices is
considered a node. One node can be connected directly to another,
or can be interconnected such as by means of a Fibre Channel
fabric. The fabric can be a single Fibre Channel switch, or a group
of switches acting together. Technically, the N_port (node ports)
on each node are connected to F_ports (fabric ports) on the switch.
Multiple Fibre Channel switches can be combined into a single
fabric. The switches connect to each other via E-Port (Expansion
Port) forming an interswitch link, or ISL.
[0006] Fibre Channel data is formatted into variable length data
frames. Each frame starts with a start-of-frame (SOF) indicator and
ends with a cyclical redundancy check (CRC) code for error
detection and an end-of-frame indicator. In between are a 24-byte
header and a variable-length data payload field that can range from
0 to 2112 bytes. The switch uses a routing table and the source and
destination information found within the Fibre Channel frame header
to route the Fibre Channel frames from one port to another. Routing
tables can be shared between multiple switches in a fabric over an
ISL, allowing one switch to know when a frame must be sent over the
ISL to another switch in order to reach its destination port.
[0007] Fibre Channel switches are required to deliver frames to any
destination in the same order that they arrive from a source. One
common approach to insure in order delivery in this context is to
process frames in strict temporal order at the input or ingress
side of a switch. This is accomplished by managing its input buffer
as a first in, first out (FIFO) buffer. Sometimes, however, a
switch encounters a frame that cannot be delivered due to
congestion at the destination port. This frame remains at the top
of the buffer until the destination port becomes un-congested, even
when the next frame in the FIFO is destined for a port that is not
congested and could be transmitted immediately. This condition is
referred to as head of line blocking.
[0008] Various techniques have been proposed to deal with the
problem of head of line blocking. Scheduling algorithms, for
instance, do not use true FIFOs. Rather, they search the input FIFO
buffer looking for matches between waiting data and available
output ports. If the top frame is destined for a busy port, the
scheduling algorithm merely scans the FIFO buffer for the first
frame that is destined for an available port. Such algorithms must
take care to avoid sending Fibre Channel frames out of order.
Another approach is to divide the input buffer into separate
buffers for each possible destination. However, this requires large
amounts of memory and a good deal of complexity in large switches
having many possible destination ports. A third approach is the
deferred queuing solution described in detail in the incorporated
references. Deferred queuing requires that all incoming data frames
that are destined for a congested port be placed in a deferred
queue, which keeps these frames from unduly interfering with frames
destined for uncongested ports. This technique requires a
dependable method for determining the congestion status for all
destinations at each input port.
[0009] Congestion and blocking are especially troublesome when the
destination port is an E_Port providing an interswitch link to
another switch. One reason that the E_Port can become congested is
that the input port on the second switch has filled up its input
buffer. The flow control between the switches prevents the first
switch from sending any more data to the second switch. Often times
the input buffer on the second switch becomes filled with frames
that are all destined for a single congested port on that second
switch. This filled buffer has congested the ISL, so that the first
switch cannot send any data to the second switch--including data
that is destined for an un-congested port on the second switch.
Several manufacturers have proposed the use of virtual channels to
prevent the situation where congestion on an interswitch link is
caused by traffic to a single destination. In these proposals,
traffic on the link is divided into several virtual channels, and
no virtual channel is allowed to interfere with traffic on the
other virtual channels. In range Technologies Corporation has
proposed a technique for flow control over virtual channel that is
described in the incorporated Fibre Channel Switch application.
This flow control technique monitors the congestion status of all
destination ports at the downstream switch. If a destination port
becomes congested, the flow control process determines which
virtual channel on the ISL is affected, and sends an XOFF message
so informing the upstream switch. The upstream switch will then
stop sending data on the affected virtual channel.
[0010] Like the deferred queuing solution, the virtual channel flow
control solution requires that every input port in the downstream
switch know the congestion status of all destinations in the
switch. Unfortunately, the existing solutions for providing this
information are not satisfactory, as they do not easily present
accurate congestion status information to each of the ingress ports
in a switch.
SUMMARY OF THE INVENTION
[0011] The foregoing needs are met, to a great extent, by the
present invention, which provides a method for noticing port
congestion and informing ingress ports of the congestion. The
present invention utilizes a switch that submits data to a crossbar
component for making connections to a destination port. Before data
is submitted to the crossbar, it is stored in a virtual output
queue structure in a memory subsystem. A separate virtual output
queue is maintained for each destination within the switch. When a
connection is made over the crossbar to a destination port, data is
removed from the virtual output queue associated with that
destination port and transmitted over the connection. When a
destination port becomes congested, flow control within the switch
will prevent data from leaving the virtual output queues associated
with that destination.
[0012] The present invention utilizes a cell credit manager at the
ingress to the switch. The cell credit manager tracks credits
associated with each virtual output queue in order to obtain
knowledge about the amount of data within each queue. If the credit
count in the cell credit manager drops below a threshold value, the
cell credit manager views the associated port as a congested port
and asserts an XOFF signal. The XOFF signal includes three
components: a internal switch destination address for the relevant
destination port, an XOFF/XON status bit, and a validity signal to
indicate that a valid XOFF signal is being sent.
[0013] The XOFF signal of the cell credit manager is received by a
plurality of XOFF mask modules. One XOFF mask is utilized at each
ingress to the switch. Each XOFF mask receives the XOFF signal, and
assigns the designated destination port to the indicated XOFF/XON
status. The XOFF mask maintains the status for every destination
port in a look up table that assigns a single bit to each port. If
the bit assigned to a port is set to "1," the port has an XOFF
status. If the bit is "0," the port has an XON status and is free
to receive data.
[0014] The present invention recognizes that the XOFF mask should
not set the status of the destination port to XON during certain
portions of the deferred queuing procedure. Consequently, the
present invention utilizes a XON history register that also tracks
the current status of all ports. This XON history register receives
the XOFF signals from the cell credit manager and reflects those
changes in its own lookup table. The values in the look up table in
the XON history register are then used to periodically update the
values in the look up table in the XOFF mask.
[0015] The present invention also recognizes flow control signals
directly from the memory subsystem that request that all data stop
flowing to that subsystem. When these signals are receives, a
"gross_xoff" signal is sent to the XOFF mask. The XOFF mask is then
able to combine the results of this signal with the status of every
destination port as maintained in its lookup table. When another
portion of the switch wishes to determine the status of a
particular port, the internal switch destination address is
submitted to the XOFF mask. This address is used to reference the
status of that destination in the lookup table, and the result is
ORed with the value of the gross_xoff signal. The resulting signal
indicates the status of the indicated destination port.
[0016] The present invention utilizes a single cell credit manager
to track the inputs to the memory subsystem for a plurality of
ports. Since each port has its own XOFF mask, the XOFF signals must
be sent to the XOFF mask for each port that the cell credit manager
tracks. Other cell credit managers exist within the switch. The
present invention utilizes a special bus to transfer XOFF signals
between the various cell credit managers within a switch. In
addition, the present invention provides a technique for a stop_all
signal to be shared with all XOFF masks utilizing a single memory
subsystem. This signal will ensure that when the gross_xoff signal
is set, it will prevent all traffic from flowing into the memory
subsystem.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] FIG. 1 is a block diagram of one possible Fibre Channel
switch in which the present invention can be utilized.
[0018] FIG. 2 is a block diagram showing the details of the port
protocol device of the Fibre Channel switch shown in FIG. 1.
[0019] FIG. 3 is a block diagram showing the details of the memory
controller of the port protocol device shown in FIG. 2.
[0020] FIG. 4 is a block diagram showing the queuing utilized in an
upstream switch and a downstream switch communicating over an
interswitch link.
[0021] FIG. 5 is a block diagram showing XOFF flow control between
the ingress memory subsystem and the egress memory subsystem in the
switch of FIG. 1.
[0022] FIG. 6 is a block diagram showing backplane credit flow
control between the ingress memory subsystem and the egress memory
subsystem in the switch of FIG. 1.
[0023] FIG. 7 is a block diagram showing flow control between the
ingress memory subsystem and the protocol interface module in the
switch of FIG. 1.
[0024] FIG. 8 is a block diagram showing flow control between the
fabric interface module and the egress memory subsystem in the
switch of FIG. 1.
[0025] FIG. 9 is a block diagram showing the interactions of the
fabric interface modules, the XOFF masks, and the cell credit
manager in the switch of FIG. 1.
[0026] FIG. 10 is a block diagram showing the details of the cell
credit manager, the XON history register, and the XOFF mask in the
switch of FIG. 1.
DETAILED DESCRIPTION OF THE INVENTION
[0027] 1. Switch 100
[0028] The present invention is best understood after examining the
major components of a Fibre Channel switch, such as switch 100
shown in FIG. 1. The components shown in FIG. 1 are helpful in
understanding the applicant's preferred embodiment, but persons of
ordinary skill will understand that the present invention can be
incorporated in switches of different construction, configuration,
or port counts.
[0029] Switch 100 is a director class Fibre Channel switch having a
plurality of Fibre Channel ports 110. The ports 110 are physically
located on one or more I/O boards inside of switch 100. Although
FIG. 1 shows only two I/O boards, namely ingress board 120 and
egress board 122, a director class switch 100 would contain eight
or more such boards. The preferred embodiment described in the
application can contain thirty-two such I/O boards 120, 122. Each
board 120, 122 contains a microprocessor 124 that, along with its
RAM and flash memory (not shown), is responsible for controlling
and monitoring the other components on the boards 120, 122 and for
handling communication between the boards 120, 122.
[0030] In the preferred embodiment, each board 120, 122 also
contains four port protocol devices (or PPDs) 130. These PPDs 130
can take a variety of known forms, including an ASIC, an FPGA, a
daughter card, or even a plurality of chips found directly on the
boards 120, 122. In the preferred embodiment, the PPDs 130 are
ASICs, and can be referred to as the FCP ASICs, since they are
primarily designed to handle Fibre Channel protocol data. Each PPD
130 manages and controls four ports 110. This means that each I/O
board 120, 122 in the preferred embodiment contains sixteen Fibre
Channel ports 110.
[0031] The I/O boards 120, 122 are connected to one or more
crossbars 140 designed to establish a switched communication path
between two ports 110. Although only a single crossbar 140 is
shown, the preferred embodiment uses four or more crossbar devices
140 working together. In the preferred embodiment, crossbar 140 is
cell-based, meaning that it is designed to switch small, fixed-size
cells of data. This is true even though the overall switch 100 is
designed to switch variable length Fibre Channel frames.
[0032] The Fibre Channel frames are received on a port, such as
input port 112, and are processed by the port protocol device 130
connected to that port 112. The PPD 130 contains two major logical
sections, namely a protocol interface module 150 and a fabric
interface module 160. The protocol interface module 150 receives
Fibre Channel frames from the ports 110 and stores them in
temporary buffer memory. The protocol interface module 150 also
examines the frame header for its destination ID and determines the
appropriate output or egress port 114 for that frame. The frames
are then submitted to the fabric interface module 160, which
segments the variable-length Fibre Channel frames into fixed-length
cells acceptable to crossbar 140.
[0033] The fabric interface module 160 then transmits the cells to
an ingress memory subsystem (iMS) 180. A single iMS 180 handles all
frames received on the I/O board 120, regardless of the port 110 or
PPD 130 on which the frame was received.
[0034] When the ingress memory subsystem 180 receives the cells
that make up a particular Fibre Channel frame, it treats that
collection of cells as a variable length packet. The iMS 180
assigns this packet a packet ID (or "PID") that indicates the cell
buffer address in the iMS 180 where the packet is stored. The PID
and the packet length is then passed on to the ingress Priority
Queue (iPQ) 190, which organizes the packets in iMS 180 into one or
more queues, and submits those packets to crossbar 140. Before
submitting a packet to crossbar 140, the iPQ 190 submits a "bid" to
arbiter 170. When the arbiter 170 receives the bid, it configures
the appropriate connection through crossbar 140, and then grants
access to that connection to the iPQ 190. The packet length is used
to ensure that the connection is maintained until the entire packet
has been transmitted through the crossbar 140, although the
connection can be terminated early.
[0035] A single arbiter 170 can manage four different crossbars
140. The arbiter 170 handles multiple simultaneous bids from all
iPQs 190 in the switch 100, and can grant multiple simultaneous
connections through crossbar 140. The arbiter 170 also handles
conflicting bids, ensuring that no output port 114 receives data
from more than one input port 112 at a time.
[0036] The output or egress memory subsystem (eMS) 182 receives the
data cells comprising the packet from the crossbar 140, and passes
a packet ID to an egress priority queue (ePQ) 192. The egress
priority queue 192 provides scheduling, traffic management, and
queuing for communication between egress memory subsystem 182 and
the PPD 130 in egress I/O board 122. When directed to do so by the
ePQ 192, the eMS 182 transmits the cells comprising the Fibre
Channel frame to the egress portion of PPD 130. The fabric
interface module 160 then reassembles the data cells and presents
the resulting Fibre Channel frame to the protocol interface module
150. The protocol interface module 150 stores the frame in its
buffer, and then outputs the frame through output port 114.
[0037] In the preferred embodiment, crossbar 140 and the related
components are part of a commercially available cell-based switch
chipset, such as the nPX8005 or "Cyclone" switch fabric
manufactured by Applied Micro Circuits Corporation of San Diego,
Calif. More particularly, in the preferred embodiment, the crossbar
140 is the AMCC S8705 Crossbar product, the arbiter 170 is the AMCC
S8605 Arbiter, the iPQ 190 and ePQ 192 are AMCC S8505 Priority
Queues, and the iMS 180 and eMS 182 are AMCC S8905 Memory
Subsystems, all manufactured by Applied Micro Circuits
Corporation.
[0038] 2. Port Protocol Device 130
[0039] a) Link Controller Module 300
[0040] FIG. 2 shows the components of one of the four port protocol
devices 130 found on each of the I/O Boards 120, 122. As explained
above, incoming Fibre Channel frames are received over a port 110
by the protocol interface 150. A link controller module (LCM) 300
in the protocol interface 150 receives the Fibre Channel frames and
submits them to the memory controller module 310. One of the
primary jobs of the link controller module 300 is to compress the
start of frame (SOF) and end of frame (EOF) codes found in each
Fibre Channel frame. By compressing these codes, space is created
for status and routing information that must be transmitted along
with the data within the switch 100. More specifically, as each
frame passes through PPD 130, the PPD 130 generates information
about the frame's port speed, its priority value, the internal
switch destination address (or SDA) for the source port 112 and the
destination port 114, and various error indicators. This
information is added to the SOF and EOF in the space made by the
LCM 300. This "extended header" stays with the frame as it
traverses through the switch 100, and is replaced with the original
SOF and EOF as the frame leaves the switch 100.
[0041] The LCM 300 uses a SERDES chip (such as the Gigablaze SERDES
available from LSI Logic Corporation, Milpitas, Calif.) to convert
between the serial data used by the port 110 and the 10-bit
parallel data used in the rest of the protocol interface 150. The
LCM 300 performs all low-level link-related functions, including
clock conversion, idle detection and removal, and link
synchronization. The LCM 300 also performs arbitrated loop
functions, checks frame CRC and length, and counts errors.
[0042] b) Memory Controller Module 310
[0043] The memory controller module 310 is responsible for storing
the incoming data frame on the inbound frame buffer memory 320.
Each port 110 on the PPD 130 is allocated a separate portion of the
buffer 320. Alternatively, each port 110 could be given a separate
physical buffer 320. This buffer 320 is also known as the credit
memory, since the BB_Credit flow control between switch 100 and the
upstream device is based upon the size or credits of this memory
320. The memory controller 310 identifies new Fibre Channel frames
arriving in credit memory 320, and shares the frame's destination
ID and its location in credit memory 320 with the inbound routing
module 330.
[0044] The routing module 330 of the present invention examines the
destination ID found in the frame header of the frames and
determines the switch destination address (SDA) in switch 100 for
the appropriate destination port 114. The router 330 is also
capable of routing frames to the SDA associated with one of the
microprocessors 124 in switch 100. In the preferred embodiment, the
SDA is a ten-bit address that uniquely identifies every port 110
and processor 124 in switch 100. A single routing module 330
handles all of the routing for the PPD 130. The routing module 330
then provides the routing information to the memory controller
310.
[0045] As shown in FIG. 3, the memory controller 310 consists of a
memory write module 340, a memory read module 350, and a queue
control module 400. A separate memory controller 310 exists for
each of the four ports 110 on the PPD 130. The memory write module
340 handles all aspects of writing data to the credit memory 320.
The memory read module 350 is responsible for reading the data
frames out of memory 320 and providing the frame to the fabric
interface module 160. The queue control module 400 handles the
queuing and ordering of data on the credit memory 320. The XON
history register 420 can also be considered a part of the memory
controller 310, although only a single XON history register 420 is
needed to service all four ports 110 on a PPD 130.
[0046] c) Queue Control Module 400
[0047] The queue control module 400 stores the routing results
received from the inbound routing module 330. When the credit
memory 320 contains multiple frames, the queue control module 400
decides which frame should leave the memory 320 next. In doing so,
the queue module 400 utilizes procedures that avoid head-of-line
blocking.
[0048] The queue control module 400 has four primary components,
namely the deferred queue 402, the backup queue 404, the header
select logic 406, and the XOFF mask 408. These components work in
conjunction with the XON History register 420 and the cell credit
manager or credit module 440 to control ingress queuing and to
assist in managing flow control within switch 100. The deferred
queue 402 stores the frame headers and locations in buffer memory
320 for frames waiting to be sent to a destination port 114 that is
currently busy. The backup queue 404 stores the frame headers and
buffer locations for frames that arrive at the input port 112 while
the deferred queue 402 is sending deferred frames to their
destination. The header select logic 406 determines the state of
the queue control module 400, and uses this determination to select
the next frame in credit memory 320 to be submitted to the FIM 160.
To do this, the header select logic 406 supplies to the memory read
module 350 a valid buffer address containing the next frame to be
sent. The functioning of the backup queue 404, the deferred queue
402, and the header select logic 406 are described in the
incorporated Fibre Channel Switch application.
[0049] The XOFF mask 408 contains a congestion status bit for each
port 110 in the switch. The XON history register 420 is used to
delay updating the XOFF mask 408 under certain conditions. These
two components 408, 420 and their interaction with the cell credit
manager 440 and FIM 160 are described in more detail below.
[0050] d) Fabric Interface Module 160
[0051] When a Fibre Channel frame is ready to be submitted to the
ingress memory subsystem 180 of I/O board 120, the queue control
400 passes the frame's routed header and pointer to the memory read
portion 350. This read module 350 then takes the frame from the
credit memory 320 and provides it to the fabric interface module
160. The fabric interface module 160 converts the variable-length
Fibre Channel frames received from the protocol interface 150 into
fixed-sized data cells acceptable to the cell-based crossbar 140.
Each cell is constructed with a specially configured cell header
appropriate to the cell-based switch fabric. When using the Cyclone
switch fabric of Applied Micro Circuits Corporation, the cell
header includes a starting sync character, the switch destination
address of the egress port 114 and a priority assignment from the
inbound routing module 330, a flow control field and ready bit, an
ingress class of service assignment, a packet length field, and a
start-of-packet and end-of-packet identifier.
[0052] When necessary, the preferred embodiment of the fabric
interface 160 creates fill data to compensate for the speed
difference between the memory controller 310 output data rate and
the ingress data rate of the cell-based crossbar 140. This process
is described in more detail in the incorporated Fibre Channel
Switch application.
[0053] Egress data cells are received from the crossbar 140 and
stored in the egress memory subsystem 182. When these cells leave
the eMS 182, they enter the egress portion of the fabric interface
module 160. The FIM 160 then examines the cell headers, removes
fill data, and concatenates the cell payloads to re-construct Fibre
Channel frames with extended SOF/EOF codes. If necessary, the FIM
160 uses a small buffer to smooth gaps within frames caused by cell
header and fill data removal.
[0054] In the preferred embodiment, there are multiple links
between each PPD 130 and the iMS 180. Each separate link uses a
separate FIM 160. Preferably, each port 110 on the PPD 130 is given
a separate link to the iMS 180, and therefore each port 110 is
assigned a separate FIM 160.
[0055] e) Outbound Processor Module 450
[0056] The FIM 160 then submits the frames to the outbound
processor module (OPM) 450. A separate OPM 450 is used for each
port 110 on the PPD 130. The outbound processor module 450 checks
each frame's CRC, and handles the necessary buffering between the
fabric interface 160 and the ports 110 to account for their
different data transfer rates. The primary job of the outbound
processor modules 450 is to handle data frames received from the
cell-based crossbar 140 that are destined for one of the Fibre
Channel ports 110. This data is submitted to the link controller
module 300, which replaces the extended SOF /EOF codes with
standard Fibre Channel SOF/EOF characters, performs 8b/10b
encoding, and sends data frames through its SERDES to the Fibre
Channel port 110.
[0057] The components of the PPD 130 can communicate with the
microprocessor 124 on the I/O board 120, 122 through the
microprocessor interface module (MIM) 360. Through the
microprocessor interface 360, the microprocessor 124 can read and
write registers on the PPD 130 and receive interrupts from the PPDs
130. This communication occurs over a microprocessor communication
path 362. The microprocessor 124 also uses the microprocessor
interface 360 to communicate with the ports 110 and with other
processors 124 over the cell-based switch fabric.
[0058] 3. Queues
[0059] a) Class of Service Queue 280
[0060] FIG. 4 shows two switches 260, 270 that are communicating
over an interswitch link 230. The ISL 230 connects an egress port
114 on upstream switch 260 with an ingress port 112 on downstream
switch 270. The egress port 114 is located on the first PPD 262
(labeled PPD 0) on the first I/O Board 264 (labeled I/O Board 0) on
switch 260. This I/O board 264 contains a total of four PPDs 130,
each containing four ports 110. This means I/O board 264 has a
total of sixteen ports 110, numbered 0 through 15. In FIG. 4,
switch 260 contains thirty-one other I/O boards 120, 122, meaning
the switch 260 has a total of five hundred and twelve ports 110.
This particular configuration of I/O Boards 120, 122, PPDs 130, and
ports 110 is for exemplary purposes only, and other configurations
would clearly be within the scope of the present invention.
[0061] I/O Board 264 has a single egress memory subsystem 182 to
hold all of the data received from the crossbar 140 (not shown) for
its sixteen ports 110. The data in eMS 182 is controlled by the
egress priority queue 192 (also not shown). In the preferred
embodiment, the ePQ 192 maintains the data in the eMS 182 in a
plurality of output class of service queues (O_COS_Q) 280. Data for
each port 110 on the I/O Board 264 is kept in a total of "n" O_COS
queues, with the number n reflecting the number of virtual channels
240 defined to exist with the ISL 230. When cells are received from
the crossbar 140, the eMS 182 and ePQ 192 add the cell to the
appropriate O_COS_Q 280 based on the destination SDA and priority
value assigned to the cell. This information was placed in the cell
header as the cell was created by the ingress FIM 160. The cells
are then removed from the O_COS_Q 280 and are submitted to the PPD
262 for the egress port 114, which converts the cells back into a
Fibre Channel frame and sends it across the ISL 230 to the
downstream switch 270.
[0062] b) Virtual Output Queue 290
[0063] The frame enters switch 270 over the ISL 230 through ingress
port 112. This ingress port 112 is actually the second port
(labeled port 1) found on the first PPD 272 (labeled PPD 0) on the
first I/O Board 274 (labeled I/O Board 0) on switch 270. Like the
I/O board 264 on switch 260, this I/O board 274 contains a total of
four PPDs 130, with each PPD 130 containing four ports 110. With a
total of thirty-two I/O boards 120, 122, switch 270 has the same
five hundred and twelve ports as switch 260.
[0064] When the frame is received at port 112, it is placed in
credit memory 320. The D_ID of the frame is examined, and the frame
is queued and a routing determination is made as described above.
Assuming that the destination port on switch 270 is not XOFFed
according to the XOFF mask 408 servicing input port 112, the frame
will be subdivided into cells and forwarded to the ingress memory
subsystem 180.
[0065] The iMS 180 is organized and controlled by the ingress
priority queue 190, which is responsible for ensuring in-order
delivery of data cells and packets. To accomplish this, the iPQ 190
organizes the data in its iMS 180 into a number ("m") of different
virtual output queues (V_O_Qs) 290. To avoid head-of-line blocking,
a separate V_O_Q 290 is established for every destination within
the switch 270. In switch 270, this means that there are at least
five hundred forty-four V_O_Qs 290 (five hundred twelve physical
ports 110 and thirty-two microprocessors 124) in iMS 180. The iMS
180 places incoming data on the appropriate V_O_Q 290 according to
the switch destination address assigned to that data by the routing
module 330 in PPD 272.
[0066] Data in the V_O_Qs 290 is handled like the data in O_COS_Qs
280, such as by using round robin servicing. When data is removed
from a V_O_Q 290, it is submitted to the crossbar 140 and provided
to an eMS 182 on the switch 270.
[0067] c) Virtual Input Queue 282
[0068] FIG. 4 also shows a virtual input queue structure 282 within
each ingress port 112 in downstream switch 270. Each of these
V_I_Qs 282 corresponds to one of the virtual channels 240 on the
ISL 230, which in turn corresponds to one of the O_COS_Qs 280 on
the upstream switch. By assigning frames to a V_I_Q 282 in ingress
port 112, the downstream switch 270 can identify which O_COS_Q 280
in switch 260 was assigned to the frame. As a result, if a
particular data frame encounters a congested port within the
downstream switch 270, the switch 270 is able to communicate that
congestion to the upstream switch by performing flow control for
the virtual channel 240 assigned to that V_I_Q 282.
[0069] 4. Flow Control in Switch
[0070] a) XOFF Flow Control between iMS 180 and eMS 182
[0071] The cell-based switch fabric used in the preferred
embodiment of the present invention can be considered to include
the memory subsystems 180, 182, the priority queues 190, 192, the
cell-based crossbar 140, and the arbiter 170. As described above,
these elements can be obtained commercially from companies such as
Applied Micro Circuits Corporation. This switch fabric utilizes a
variety of flow control mechanisms to prevent internal buffer
overflows, to control the flow of cells into the cell-based switch
fabric, and to receive flow control instructions to stop cells from
exiting the switch fabric.
[0072] XOFF internal flow control within the cell-based switch
fabric is shown as communication 500 in FIG. 5. This flow control
serves to stop data cells from being sent from iMS 180 to eMS 182
over the crossbar 140 in situations where the eMS 182 or one of the
O_COS_Qs 280 in the eMS 182 is becoming full. If there were no flow
control, congestion at an egress port 114 would prevent data in the
port's associated O_COS_Qs 280 from exiting the switch 100. If the
iMS 180 were allowed to keep sending data to these queues 280, eMS
182 would overflow and data would be lost.
[0073] This flow control works as follows. When cell occupancy of
an O_COS_Q 280 reaches a threshold, an XOFF signal is generated
internal to the switch fabric to stop transmission of data from the
iMS 180 to these O_COS_Qs 280. The preferred Cyclone switch fabric
uses three different thresholds, namely a routine threshold, an
urgent threshold, and an emergency threshold. Each threshold
creates a corresponding type of XOFF signal to the iMS 180.
[0074] Unfortunately, since the V_O_Qs 290 in iMS 180 are not
organized into the individual class of services for each possible
output port 114, the XOFF signal generated by the eMS 182 cannot
simply turn off data for a single O_COS_Q 280. In fact, due to the
manner in which the cell-based fabric addresses individual ports,
the XOFF signal is not even specific to a single congested port
110. Rather, in the case of the routine XOFF signal, the iMS 180
responds by stopping all cell traffic to the group of four ports
110 found on the PPD 130 that contains the congested egress port
114. Urgent and Emergency XOFF signals cause the iMS 180 and
Arbiter 170 to stop all cell traffic to the effected egress I/O
board 122. In the case of routine and urgent XOFF signals, the eMS
182 is able to accept additional packets of data before the iMS 180
stops sending data. Emergency XOFF signals mean that new packets
arriving at the eMS 182 will be discarded.
[0075] b) Backplane Credit Flow Control
[0076] The iPQ 190 also uses a backplane credit flow control 510
(shown in FIG. 6) to manage the traffic from the iMS 180 to the
different egress memory subsystems 182 more granularly than the
XOFF signals 500 described above. For every packet submitted to an
egress port 114, the iPQ 190 decrements its "backplane" credit
count for that port 114. When the packet is transmitted out of the
eMS 182, a backplane credit is returned to the iPQ 190. If a
particular O_COS_Q 280 cannot submit data to an ISL 230 (such as
when the associated virtual channel 240 has an XOFF status),
credits will not be returned to the iPQ 190 that submitted those
packets. Eventually, the iPQ 190 will run out of credits for that
egress port 114, and will stop making bids to the arbiter 170 for
these packets. These packets will then be held in the iMS 180.
[0077] Note that even though only a single O_COS_Q 280 is not
sending data, the iPQ 190 only maintains credits on an port 110
basis, not a class of service basis. Thus, the effected iPQ 190
will stop sending all data to the port 114, including data with a
different class of service that could be transmitted over the port
114. In addition, since the iPQ 190 services an entire I/O board
120, all traffic to that egress port 114 from any of the ports 110
on that board 120 is stopped. Other iPQs 190 on other I/O boards
120, 122 can continue sending packets to the same egress port 114
as long as those other iPQs 190 have backplane credits for that
port 114.
[0078] Thus, the backplane credit system 510 can provide some
internal switch flow control from ingress to egress on the basis of
a virtual channel 240, but it is inconsistent. If two ingress ports
112 on two separate I/O boards 120, 122 are each sending data to
different virtual channels 240 on the same ISL 230, the use of
backplane credits will flow control those channels 240 differently.
One of those virtual channels 240 might have an XOFF condition.
Packets to that O_COS_Q 280 will back up, and backplane credits
will not be returned. The lack of backplane credits will cause the
iPQ 190 sending to the XOFFed virtual channel 240 to stop sending
data. Assuming the other virtual channel does not have an XOFF
condition, credits from its O_COS_Q 280 to the other iPQ 190 will
continue, and data will flow through that channel 240. However, if
the two ingress ports 112 sending to the two virtual channels 240
utilize the same iPQ 190, the lack of returned backplane credits
from the XOFFed O_COS_Q 280 will stop traffic to all virtual
channels 240 on the ISL 230.
[0079] c) Input to Fabric Flow Control 520
[0080] The cell-based switch fabric must be able to stop the flow
of data from its data source (i.e., the FIM 160) whenever the iMS
180 or a V_O_Q 290 maintained by the iPQ 190 is becoming full. The
switch fabric signals this XOFF condition by setting the RDY
(ready) bit to 0 on the cells it returns to the FIM 160, shown as
520 on FIG. 7. Although this XOFF is an input flow control signal
between the iMS 180 and the ingress portion of the PPD 130, the
signals are communicated from the eMS 182 into the egress portion
of the same PPD 130. When the egress portion of the FIM 160
receives the cells with RDY set to 0, it informs the ingress
portion of the PPD 130 to stop sending data to the iMS 180.
[0081] There are three situations where the switch fabric may
request an XOFF or XON state change. In every case, flow control
cells 520 are sent by the eMS 182 to the egress portion of the FIM
160 to inform the PPD 130 of this updated state. These flow control
cells use the RDY bit in the cell header to indicate the current
status of the iMS 180 and its related queues 290.
[0082] In the first of the three different situations, the iMS 180
may fill up to its threshold level. In this case, no more traffic
should be sent to the iMS 180. When a FIM 160 receives the flow
control cells 520 indicating this condition, it sends a congestion
signal (or "gross_xoff" signal) 522 to the XOFF mask 408 in the
memory controller 310. This signal informs the MCM 310 to stop all
data traffic to the iMS 180, as described in more detail below. The
FIM 160 will also broadcast an external signal called STOP_ALL 164
to the FIMs 160 on its PPD 130, as well as to the other three PPDs
130 on its I/O board 120. The STOP_ALL congestion signal 164 may
take the same form as the gross_xoff congestion signal 522, or it
may be differently formatted. The interconnection between the PPDs
130 and the STOP_ALL signal 164 is shown in FIG. 9, although it is
possible to use the physical linkage 454 between the cell credit
managers 440 to communicate this signal. When a FIM 160 receives
the STOP_ALL signal 164, the gross_xoff signal 522 is sent to its
memory controller 310. Since all FIMs 160 on a board 120 will
receive the STOP_ALL signal 164, this signal will stop all traffic
to the iMS 180. The gross_xoff signal 522 will remain on until the
flow control cells 520 received by the FIM 160 indicate the buffer
condition at the iMS 180 is over.
[0083] In the second case, a single V_O_Q 290 in the iMS 180 fills
up to its threshold. When this occurs, the signal 520 back to the
PPD 130 will behave just as it did in the first case, with the
generation of a gross_xoff congestion signal 522 and a STOP_ALL
congestion signal 164. Thus, the entire iMS 180 stops receiving
data, even though only a single V_O_Q 290 has become
congestion.
[0084] The third case involves a failed link between a FIM 160 and
the iMS 180. Flow control cells indicating this condition will
cause a gross_xoff signal 522 to be sent only to the MCM 310 for
the corresponding FIM 160. No STOP_ALL signal 164 is sent in this
situation.
[0085] d) Outputfrom Fabric Flow Control 530
[0086] When an egress portion of a PPD 130 wishes to stop traffic
coming from the eMS 182, it signals an XOFF to the switch fabric by
sending a cell from the input FIM 160 to the iMS 180, which is
shown as flow control 530 on FIG. 8. The cell header contains a
queue flow control field and a RDY bit to help define the XOFF
signal. The queue flow control field is eleven bits long, and can
identify the class of service, port 110 and PPD 130, as well as the
desired flow status (XON or XOFF).
[0087] The PPD 130 might desire to stop the flow of data from the
eMS 182 for several reasons. First, an internal buffer within the
egress portion of the FIM 160 may be approaching an overflow
condition. Second, the egress portion of the PIM 150 may have
received a switch-to-switch flow control signal. This signal may
request stopping the flow of data over the entire link.
Alternatively, the signal may reflect only a desire to stop traffic
over a particular virtual channel 240 on a link. Regardless of the
reason, when the FIM 160 needs to stop data traffic from the eMS
182, the FIM 160 sends an XOFF to the switch fabric in an ingress
cell header directed toward iMS 180. The iMS 180 extracts each XOFF
instruction from the cell header, and sends it to the eMS 182,
directing the eMS 182 to XOFF or XON a particular O_COS_Q 280. If
the O_COS_Q 280 is sending a packet to the FIM 160, it finishes
sending the packet. The eMS 182 then stops sending fabric-to-port
or fabric-to-microprocessor packets to the FIM 160.
[0088] 5. Congestion Notification
[0089] a) XOFF Mask 408
[0090] The XOFF mask 408 shown in FIG. 10 is responsible for
notifying the ingress ports 112 of the congestion status of all
egress ports 114 and microprocessors 124 in the switch. Every port
112 has its own XOFF mask 408, as shown in FIG. 9. The XOF mask 408
is considered part of the queue control module 400 in the memory
controller 310, and is therefore shown within the MCM 330 in FIG.
9.
[0091] Each XOFF mask 408 contains a separate status bit for all
destinations within the switch 100. In one embodiment of the switch
100, there are five hundred and twelve physical ports 110 and
thirty-two microprocessors 124 that can serve as a destination for
a frame. Hence, the XOFF mask 408 uses a 544 by 1 look up table 410
to store the "XOFF" status of each destination. If a bit in XOFF
look up table 410 is set, the port 110 corresponding to that bit is
busy and cannot receive any frames.
[0092] In the preferred embodiment, the XOFF mask 408 returns a
status for a destination by first receiving the switch destination
address for that port 110 or microprocessor 124 on SDA input 412.
The look up table 410 is examined for the SDA on input 412, and if
the corresponding bit is set, the XOFF mask 408 asserts a signal on
"defer" output 414, which indicates to the rest of the queue
control module 400 that the selected port 110 or processor 124 is
busy. This construction of the XOFF mask 408 is the preferred way
to store the congestion status of possible destinations at each
port 110. Other ways are possible, as long as they can quickly
respond to a status query about a destination with the congestion
status for that destination.
[0093] In the preferred embodiment, the output of the XOFF look up
table 410 is not the sole source for the defer signal 414. In
addition, the XOFF mask 408 receives the gross_xoff signal 522 from
its associated FIM 160. This signal 522 is ORed with the output of
the lookup table 410 in order to generate the defer signal 414.
This means that whenever the gross_xoff signal 522 is set, the
defer signal 414 will also be set, effectively stopping all traffic
to the iMS 180. In another embodiment (not shown), a force defer
signal that is controlled by the microprocessor 124 is also able to
cause the defer signal 414 to go on. When the defer signal 414 is
set, it informs the header select logic 406 and the remaining
elements of the queue module 400 that the port 110 having the
address on next frame header output 415 is congested, and this
frame should be stored on the deferred queue 402.
[0094] b) XOFF History Register 420
[0095] The XON history register 420 is used to record the history
of the XON status of all destinations in the switch 100. Under the
procedure established for deferred queuing, the XOFF mask 408
cannot be updated with an XON event when the queue control 400 is
servicing deferred frames in the deferred queue 402. During that
time, whenever a port 110 changes status from XOFF to XON, the XOFF
mask 408 will ignore (or not receive) the XOFF signal 452 from the
cell credit manager 440 and will therefore not update its lookup
table 410. The signal 452 from the cell credit manager 440 will,
however, update the lookup table 422 within the XON history
register 420. Thus, the XON history register 420 maintains the
current XON status of all ports 110. When the update signal 416 is
made active by the header select 406 portion of the queue control
module 400, the entire content of the lookup table 422 in the XON
history register 420 is transferred to the lookup table 410 of the
XOFF mask 408. Registers within the table 422 containing a
zero(having a status of XON) will cause corresponding registers
within the XOFF mask lookup table 410 to be reset to zero. The dual
register setup allows for XOFFs to be written directly to the XOFF
mask 408 at any time the cell credit manager 440 requires traffic
to be halted, and causes XONs to be applied only when the logic
within the queue control module 400 allows for a change to an XON
value. While a separate queue control module 400 and its associated
XOFF mask 408 is necessary for each port 110 in the PPD 130, only
one XON history register 420 is necessary to service all four ports
110 in the PPD 130, which again is shown in FIG. 9.
[0096] c) Cell Credit Manager 440
[0097] The cell credit manager or credit module 440 sets the
XOFF/XON status of the possible destination ports 110 in the lookup
tables 410, 422 of the XOFF mask 408 and the XON history register
420. To update these tables 410, 422, the cell credit manager 440
maintains a cell credit count of every cell in the virtual output
queues 290 of the iMS 180. Every time a cell addressed to a
particular SDA leaves the FIM 160 and enters the iMS 180, the FIM
160 informs the credit module 440 through a cell credit event
signal 442. The credit module 440 then decrements the cell count
for that SDA. Every time a cell for that destination leaves the iMS
180, the credit module 440 is again informed and adds a credit to
the count for the associated SDA. The iPQ 190 sends this credit
information back to the credit module 440 by sending a cell
containing the cell credit back to the FIM 160 through the eMS 182.
The FIM 160 then sends an increment credit signal 442 to the cell
credit manager 440. This cell credit flow control is designed to
prevent the occurrence of more drastic levels of flow control from
within the cell-based switch fabric described above, since these
flow control signals 500-520 can result in multiple blocked ports
110, shutting down an entire iMS 180, or even the loss of data.
[0098] In the preferred embodiment, the cell credits are tracked
through increment and decrement credit events 442 received from FIM
160. These events are stored in dedicated increment FIFOs 444 and
decrement FIFOs 446. Each FIM 160 is associated with a separate
increment FIFO 444 and a separate decrement FIFO 446, although
ports 1-3 are shown as sharing FIFOs 444, 446 for the sake of
simplicity. Decrement FIFOs 446 contain SDAs for cells that have
entered the iMS 180. Increment FIFOs 444 contain SDAs for cells
that have left the iMS 180. These FIFOs 444, 446 are handled in
round robin format, decrementing and incrementing the credit count
that the credit module 440 maintains for each SDA in its cell
credit accumulator 447. In the preferred embodiment, the cell
credit accumulator 447 is able to handle one increment event from
one of the FIFOs 444 and one decrement event from one of the FIFOs
446 at the same time. An event select logic services the FIFOs 444,
446 in a round robin manner while monitoring the status of each
FIFOs 444, 446 so as to avoid giving access to the accumulator 447
to empty FIFOs 444, 446.
[0099] The accumulator 447 maintains separate credit counts for
each SDA, with each count reflecting the number of cells contained
within the iMS 180 for a given SDA. A compare module 448 detects
when the count for an SDA within accumulator 447 crosses an XOFF or
XON threshold stored in threshold memory 449. When a threshold is
crossed, the compare module 448 causes a driver to send the
appropriate XOFF or XON event 452 to the XOFF mask 408 and the XON
history register 420. If the count gets too low, then that SDA is
XOFFed. This means that Fibre Channel frames that are to be routed
to that SDA are held in the credit memory 320 by queue control
module 400. After the SDA is XOFFed, the credit module 440 waits
for the count for that SDA to rise to a certain level, and then the
SDA is XONed, which instructs the queue control module 400 to
release frames for that destination from the credit memory 320. The
XOFF and XON thresholds in threshold memory 449 can be different
for each individual SDA, and are programmable by the processor
124.
[0100] When an XOFF event or an XON event occurs, the credit module
440 sends an XOFF instruction 452 to the XON history register 420
and all four XOFF masks 408 in its PPD 130. In the preferred
embodiment, the XOFF instruction 452 is a three-part signal
identifying the SDA, the new XOFF status, and a validity
signal.
[0101] In the above description, each cell credit manager 440
receives communications from the FIMs 160 on its PPD 130 regarding
the cells that each FIM 160 submits to the iMS 180. The FIMs 160
also report back to the cell credit manager 440 when those cells
are submitted by the iMS 180 over the crossbar 140. As long as the
system works as described, the cell credit managers 440 are able to
track the status of all cells submitted to the iMS 180. Even though
each cell credit manager 440 is only tracking cells related to its
PPD 130 (approximately one fourth of the total cells passing
through the iMS 180), this information could be used to implement a
useful congestion notification system.
[0102] Unfortunately, the preferred embodiment ingress memory
system 180 manufactured by AMCC does not return cell credit
information to the same FIM 160 that submitted the cell. In fact,
the cell credit relating to a cell submitted by the first FIM 160
on the first PPD 130 might be returned by the iMS 180 to the last
FIM 160 on the last PPD 130. Consequently, the cell credit managers
440 cannot assume that each decrement credit event 442 they receive
relating to a cell entering the iMS 180 will ever result in a
related increment credit event 442 being returned to it when that
cell leaves the iMS 180. The increment credit event 442 may very
well end up at another cell credit manager 440.
[0103] To overcome this issue, an alternative embodiment of the
present invention has the four cell credit managers 440 on an I/O
board 120, 122 combine their cell credit events 442 in a
master/slave relationship. In this embodiment, each board 120, 122
has a single "master" cell credit manager 441 and three "slave"
cell credit manager 440. When a slave unit 440 receives a cell
credit event signal 442 from a FIM 160, the signal 442 is forwarded
to the master cell credit manager 441 over a special XOFF bus 454
(as seen in FIG. 9). The master unit 441 receives cell credit event
signals 442 from the three slave units 440 as well as the FIMs 160
that directly connect to the master unit 441. In this way, the
master cell credit manager 441 receives the cell credit event
signals 442 from all of the FIMs 160 on an I/O board 120. This
allows the master unit to maintain a credit count for each SDA in
its accumulator 447 that reflects all data cells entering and
leaving the iMS 180.
[0104] The master cell credit manager 441 is solely responsible for
maintaining the credit counts and for comparing the credit counts
with the threshold values stored in its threshold memory 449. When
a threshold is crossed, the master unit 441 sends an XOFF or XON
event 452 to its associated XON history register 420 and XOFF masks
408. In addition, the master unit 441 sends an instruction to the
slave cell credit managers 440 to send the same XOFF or XON event
452 to their XON history registers 420 and XOFF masks 408. In this
manner, the four cell credit managers 440, 441 send the same
XOFF/XON event 452 to all four XON history registers 442 and all
sixteen XOFF masks 408 on the I/O board 120, 122, effectively
unifying the cell credit congestion notification across the board
120, 122.
[0105] Due to error probabilities, there is a possibility that the
cell credit counts in accumulator 447 may drift from actual values
over time. The present invention overcomes this issue by
periodically re-syncing these counts. To do this, the FIM 160
toggles a `state` bit in the headers of all cells sent to the iMS
180 to reflect a transition in the system's state. At the same
time, the credit counters in cell credit accumulator 447 are
restored to full credit. Since each of the cell credits returned
from the iMS 180/eMS 182 includes an indication of the value of the
state bit in the cell, it is possible to differentiate credits
relating to cells sent before the state change. Any credits
received by the FIM 160 that do not have the proper state bit are
ignored. After the iMS 180 recognizes the state change, credits
will only be returned for those cells indicating the new state. In
the preferred embodiment, this changing of the state bit and the
re-syncing of the credit in cell credit accumulator 447 occurs
approximately every eight minutes, although this time period is
adjustable under the control of the processor 124.
[0106] The many features and advantages of the invention are
apparent from the above description. Numerous modifications and
variations will readily occur to those skilled in the art. For
instance, persons of ordinary skill could easily reconfigure the
various components described above into different elements, each of
which has a slightly different functionality than those described.
The component reconfigurations do not fundamentally alter the
present invention. Since such modifications are possible, the
invention is not to be limited to the exact construction and
operation illustrated and described. Rather, the present invention
should be limited only by the following claims.
* * * * *