U.S. patent application number 13/428606 was filed with the patent office on 2013-09-26 for reducing headroom.
This patent application is currently assigned to BROADCOM CORPORATION. The applicant listed for this patent is Bruce Kwan, Vahid Tabatabaee. Invention is credited to Bruce Kwan, Vahid Tabatabaee.
Application Number | 20130250757 13/428606 |
Document ID | / |
Family ID | 47602745 |
Filed Date | 2013-09-26 |
United States Patent
Application |
20130250757 |
Kind Code |
A1 |
Tabatabaee; Vahid ; et
al. |
September 26, 2013 |
Reducing Headroom
Abstract
The various embodiments of the invention provide mechanisms to
reduce headroom size while minimizing dropped packets. In general,
this is done by using a shared headroom space between all ports,
and providing a randomized delay in transmitting a flow-control
message.
Inventors: |
Tabatabaee; Vahid;
(Cupertino, CA) ; Kwan; Bruce; (Sunnyvale,
CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Tabatabaee; Vahid
Kwan; Bruce |
Cupertino
Sunnyvale |
CA
CA |
US
US |
|
|
Assignee: |
BROADCOM CORPORATION
Irvine
CA
|
Family ID: |
47602745 |
Appl. No.: |
13/428606 |
Filed: |
March 23, 2012 |
Current U.S.
Class: |
370/230 ;
370/236 |
Current CPC
Class: |
H04L 47/266 20130101;
H04L 47/30 20130101; H04L 47/29 20130101 |
Class at
Publication: |
370/230 ;
370/236 |
International
Class: |
H04L 12/24 20060101
H04L012/24 |
Claims
1. In a switch for receiving packets in a packet-switching network,
a method comprising: setting an upper value (XOFF_MAX) for the
buffer, the XOFF_MAX being indicative of a buffer usage that always
triggers a transmission of a flow-control message (XOFF); setting a
lower value (XOFF_MIN) for a buffer, the XOFF_MIN being indicative
of a buffer usage that never triggers a transmission of XOFF;
determining a random value (XRAND) at a packet arrival time, the
random value being between XOFF_MAX and the XOFF_MIN; monitoring
usage of the buffer to determine whether the buffer usage exceeds
XRAND; and transmitting a flow-control message (XOFF) when the
buffer usage exceeds XRAND.
2. A system, comprising: a buffer; and a randomized threshold value
for the buffer, the randomized threshold value to trigger
transmission of a flow-control message to a source.
3. The system of claim 2, the flow-control message being XOFF.
4. The system of claim 2, further comprising: a second buffer; and
a second randomized threshold for the second buffer, the second
randomized threshold value to trigger a second flow-control message
to the source.
5. The system of claim 2, further comprising means for triggering
transmission of a flow-control message to a source.
6. The system of claim 2, further comprising an upper limit.
7. The system of claim 6, further comprising a random offset, the
randomized threshold being set by subtracting the random offset
from the upper limit.
8. The system of claim 6, further comprising means for subtracting
the random offset from the upper limit.
9. The system of claim 6, further comprising a lower limit, the
randomized threshold being set between the lower limit and the
upper limit.
10. The system of claim 6, further comprising means for setting the
randomized threshold.
11. The system of claim 2, further comprising a shared headroom,
the shared headroom comprising the buffer.
12. A method, comprising: determining a randomized threshold value
for a buffer; monitoring usage of the buffer; determining whether
the monitored buffer usage exceeds the randomized threshold value;
and transmitting a flow-control message when the monitored buffer
usage exceeds the randomized threshold value.
13. The method of claim 12, the flow-control message being
XOFF.
14. The method of claim 12, further comprising: determining a
second randomized threshold value for a second buffer; monitoring
usage of the second buffer; determining whether the monitored usage
of the second buffer exceeds the second randomized threshold value;
and transmitting a second flow-control message when the monitored
buffer usage of the second buffer exceeds the second randomized
threshold value.
15. The method of claim 12, further comprising: resetting a flow
control; and determining a different randomized threshold value for
the buffer.
16. The method of claim 12, the step of determining the randomized
threshold value, comprising: setting an upper limit.
17. The method of claim 16, the step of determining the randomized
threshold value, further comprising: setting a lower limit; and
setting the randomized threshold value between the upper limit and
the lower limit.
18. The method of claim 16, the step of determining the randomized
threshold value, further comprising: determining a random offset;
and setting the randomized threshold value by subtracting the
random offset from the upper limit.
Description
BACKGROUND
[0001] In packet-switching networks, switches have buffers that
facilitate lossless operation. However, when incoming packet rates
from a source are high, and data accumulates within the buffer,
packets can be dropped due to exceeding the buffer size. Insofar as
dropped packets are problematic for packet-switching networks,
there are ongoing developments that attempt to ameliorate the
problem of dropped packets.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] Many aspects of the disclosure can be better understood with
reference to the following drawings. The components in the drawings
are not necessarily to scale, emphasis instead being placed upon
clearly illustrating the principles of the present disclosure.
Moreover, in the drawings, like reference numerals designate
corresponding parts throughout the several views.
[0003] FIG. 1 is a diagram of one embodiment of a buffer having a
randomized flow-control threshold.
[0004] FIG. 2 is a diagram of another embodiment of a buffer having
a different randomized flow-control threshold.
[0005] FIG. 3 is a flowchart showing one embodiment of a method for
transmitting a flow-control signal.
[0006] FIG. 4 is a flowchart showing another embodiment of a method
for transmitting a flow-control signal.
[0007] FIG. 5 is a diagram showing one embodiment of a
packet-switching architecture, which may employ the buffers of
FIGS. 1 and 2.
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0008] In packet-switching networks, switches have buffers that
facilitate lossless operation. However, when incoming packet rates
from a source are high, and data accumulates within the buffer,
packets can be dropped due to exceeding the buffer size. To
ameliorate this problem, an Ethernet switch sends a link-level
flow-control message when the data buffer usage of a particular
queue or ingress port and priority exceeds a specified threshold,
called an XOFF threshold. This flow-control message is sent to the
source to instruct the source to stop transmitting packets. Due to
delays in receiving the flow-control message by the source, the
switch can still receive frames from the source, even after
transmitting the XOFF message. In view of this delay, a portion of
the switch buffer is normally reserved and provisioned to admit the
packets that may arrive after the flow-control is set. This
reserved buffer is referred to as the lossless headroom, or,
simply, headroom.
[0009] One of the main reasons for this delay, and one of the main
drivers in provisioning the headroom, is the waiting time in the
switch for sending out the XOFF signal. Upon detection of
congestion in the switch, a XOFF message is generated. However, if
the port is already occupied with sending a packet, then the XOFF
message cannot be sent until transmission of the current outgoing
packet is finished. In the worst case, the switch will wait for a
full maximum transmission unit (MTU) size packet to depart the port
before transmitting the XOFF message. In other words, if the port
has just initiated transmission of a Jumbo packet before the flow
control message is generated, then the delay will be equal to the
time that it takes to complete transmission of the Jumbo packet.
Thus, even though the average waiting time is about half of a Jumbo
packet, the worst worst-case-situation results in a waiting time of
a full Jumbo packet.
[0010] One example of when this worst-case situation occurs is
during benchmark testing for switches. Under these benchmark tests,
all ingress ports transmit traffic to a single egress port causing
simultaneous congestion on all ingress ports. At the same time one
ingress port sends multicast Jumbo frames to all egress ports.
Therefore, it is possible that the waiting time for flow-control
messages on all ports will be almost equal, and very close to the
worst case, during these benchmark tests. For these situations,
flow-control triggering events cannot be considered independent
events.
[0011] In order for a switch to be lossless, headroom has normally
been provisioned based on these and other types of worst-case
assumptions. However, the worst case scenario is often based on an
occurrence of a highly unlikely sequence of events. As such,
provisioning the headroom based on these worst case events results
in headroom that is unnecessarily large for normal operation.
[0012] Current technology and methods are based on dedicated
headroom per ingress port and port group. However as the switch
sizes (i.e., number of ports, speed of ports, number of lossless
priorities) increase, this approach requires larger headroom based
on the worst-case-assumptions for each ingress port and port group,
thereby resulting in large headroom reservation and low utilization
of the switch buffer. Additionally, the flow-control setting is
typically based on fixed thresholds which results in
synchronization of flow-control setting between different ports and
speeds. Another method that is used for controlling the headroom
size relies on setting the flow control on every port when the
switch memory buffer gets full. This method is very disruptive and
can result in throughput degradation and unfair flow controlling of
a port.
[0013] The various embodiments of the invention provide mechanisms
to reduce headroom size while minimizing dropped packets. In
general, this is done by using a shared headroom space between all
ports, and providing a randomized delay in transmitting the XOFF
message. In particular, in one embodiment, a pseudo-random
threshold is inserted for triggering the flow control on ports. The
randomized flow control offset causes triggering of the flow
control on ports to become sufficiently uncorrelated. Thus,
headroom sizing can be done based on the average waiting time for
the transmission of the XOFF message from the switch, rather than
worst case assumptions.
[0014] To reduce the required headroom size and to size the
headroom based on the average waiting time in the switch rather
than the worst case, one embodiment of the invention provides for a
shared headroom space between all ports and lossless priorities.
The shared headroom efficiency and advantage over dedicated
headroom for (ingress port, priorities) are based on a premise that
delay in transmission of flow-control messages for each port after
the flow control is triggered is a random variable that depends on
waiting until transmission of the packet from that port is
finished. If the time to set the flow control for different ports
and priorities are uncorrelated (or have low correlation), then the
required headroom sizes for different ports and priorities can be
considered uncorrelated.
[0015] With this said, reference is now made in detail to the
description of the embodiments as illustrated in the drawings.
While several embodiments are described in connection with these
drawings, there is no intent to limit the disclosure to the
embodiment or embodiments disclosed herein. On the contrary, the
intent is to cover all alternatives, modifications, and
equivalents.
[0016] FIG. 1 is a diagram of one embodiment of a buffer having a
randomized flow-control threshold. Specifically, the embodiment of
FIG. 1 shows a buffer 110, an upper threshold 130, labeled as a
deterministic threshold (XOFF_DETERMINISTIC), a randomized offset
150 (XOFF_RAND_OFFSET), and a lower threshold 140 that is derived
from the XOFF_RAND_OFFSET 150 being subtracted from the
XOFF_DETERMINISTIC 130. The XOFF_DETERMINISTIC 130 is derived the
same way that a conventional XOFF threshold is computed in current
switches. The randomized XOFF_RAND_OFFSET 150 is derived using a
pseudo-random number generator and its range is from zero to one
maximum transmission unit (MTU). The random component is initially
computed per ingress port and priority, and uploaded. Thereafter, a
newly-generated random number is uploaded every time that the
ingress port priority resets the flow control. Therefore, in this
particular embodiment, the flow control is always set based on a
newly-selected random number. In the embodiment of FIG. 1, as
additional frames or data 120 enter the buffer 110, the buffer
usage increases. And, as the buffer usage exceeds the lower
threshold 140, the switch generates and transmits the XOFF message
to the data source. In this way, the flow control setting events on
different ports are not synchronized. Furthermore, there is no
fixed bias among ports since the offset is randomly selected for
each ingress port and port group after it is used once.
[0017] FIG. 2 is a diagram of another embodiment of a buffer having
a different randomized flow-control threshold. The embodiment of
FIG. 2 is based on having two XOFF thresholds: XOFF_MIN 240 and
XOFF_MAX 230. The flow control signal is set based on the buffer
usage of an ingress port and priority using the following rules.
First, if the buffer usage is below XOFF_MIN 240, then flow control
is not set. In other words, the flow-control message is never
transmitted when data 120 in the buffer 110 is below XOFF_MIN.
Second, if the buffer usage is above XOFF_MAX 230, then flow
control is set with a probability of one. Stated differently, the
flow-control message is always transmitted when data 120 in the
buffer 110 exceeds XOFF_MAX 240. Last, if the buffer usage is at a
threshold 250 that is between XOFF_MIN 230 and XOFF_MAX 240, then
the flow control is set with a probability (which, for some
embodiments can be a fixed probability, while for other embodiments
can be a variable probability). As such, one can see that the
probability of triggering a transmission of the flow-control
message ranges from zero to one for each buffer 110.
[0018] In comparison, the embodiment of FIG. 2, the switch
generates a pseudo-random number for ever cell arrival when the
buffer usage is between XOFF_MIN 230 and XOFF_MAX 240. However, in
the embodiment of FIG. 1, the switch subtracts the XOFF_RAND_OFFSET
150 from XOFF_DETERMINISTIC 130 for every cell arrival.
[0019] Various embodiments of the invention can also be viewed as
methods, for which two embodiments are shown with reference to FIG.
3 and FIG. 4. As shown in FIG. 3, one embodiment of the method
begins with the switch setting 310 an upper limit (XOFF_MAX), and
also setting 320 a lower limit (XOFF_MIN). The switch then
determines 330 a random threshold (XRAND) that resides between
XOFF_MIN and XOFF_MAX. Once the threshold is determined 330, the
switch monitors 340 buffer usage and determines 350 whether the
buffer usage exceeds XRAND. As long as the buffer usage does not
exceed XRAND, the switch continues to monitor 340 buffer usage as
packets flow in and out of the buffer. If, however, the buffer
usage exceeds XRAND, then the switch transmits 360 a flow-control
signal (XOFF), and waits until the flow-control is reset 370. Once
the flow control is reset, the switch again determines 330 a random
threshold and monitors 340 the buffer usage.
[0020] FIG. 4 is a flowchart showing another embodiment of a method
for transmitting a flow-control signal. As shown in FIG. 4, this
embodiment begins by setting 410 an upper limit
(XOFF_DETERMINISTIC), and determining 420 a random offset
(XOFF_RAND_OFFSET). Thereafter, a buffer threshold is set 430 to a
value that is XOFF_RAND_OFFSET subtracted from XOFF_DETERMINISTIC.
The switch monitors 440 buffer usage as packets flow into and out
of the buffer, and determines 450 whether or not the buffer usage
exceeds the set 430 threshold. If the buffer usage does not exceed
the set 430 threshold, then the switch continues to monitor 440 the
buffer usage. When the buffer usage exceeds the set 430 threshold,
the switch transmits 460 a flow-control signal. Thereafter, the
switch waits until the flow control is reset 470, at which time the
switch again determines 420 a new random offset, and sets 430 a new
threshold based on the random offset.
[0021] FIG. 5 is a diagram showing one embodiment of a
packet-switching architecture, which can employ the buffers of
FIGS. 1 and 2, or employ the methods of FIGS. 3 and 4. As shown in
FIG. 5, the packet-switching architecture includes a plethora of
components that are operatively coupled to a network 505 (e.g., the
Internet). In some embodiments, the architecture includes multiple
server racks 515, 535, 555, each having a bank of servers 510a . .
. 510n (collectively 510), 530a . . . 530n (collectively 530), 550a
. . . 550n (collectively 550). Each server rack 515, 535, 555 is
operatively coupled to its respective top-of-the-rack (TOR) switch
520, 540, 560, which allows the servers 510, 530, 550 to transmit
and receive data packets through their respective TOR switches 520,
540, 560. The TOR switches 520, 540, 560 are, in turn, operatively
coupled to aggregators 570, 580, which allow the TOR switches 520,
540, 560 to access the network 505 through the aggregators 570,
580. Each switch includes one or more buffers, such as those shown
in FIG. 1 or 2.
[0022] Insofar as each TOR switch 520, 540, 560 has access to both
of the aggregators 570, 580, data packets from one server 550a can
reach another server 550n through many different circuitous paths.
For example, data packets can travel from an originating server
550a, through its TOR switch 520, then through one of the
aggregators 570, to another TOR switch 560, eventually arriving at
an endpoint server 550n. Alternatively, the data packet can travel
from the originating server 550a, through its TOR switch 520, then
through another aggregator 580, to the other TOR switch 560, to
arrive at the endpoint server 550n. Given that the data traffic
through the switches can be enormous, the reduction in headroom,
which can be accomplished by employing the buffers as shown in FIG.
1 or 2, can be quite significant.
[0023] As one can see from the embodiments of FIGS. 1 through 4,
the various embodiments of the invention provide for shared
headroom to reduce the memory required for lossless switches.
Additionally, the disclosed embodiments have the advantage of
provisioning the headroom in a way that is based on an average
waiting time to transmit the flow control message, rather than on a
worst-case-situation. Also, for some embodiments, the flow control
can be set incrementally, if the shared headroom is getting full to
reduce the frequency of setting the flow control when there is
short term congestion. Additionally, the proposed mechanisms are
simple and are amenable to hardware implementation. Furthermore,
the number of new attributes in the switch that should be set are
limited and are easy to provide guidance to the users and
customers.
[0024] The randomized threshold may be implemented in hardware,
software, firmware, or a combination thereof. In the preferred
embodiment(s), the randomized threshold is implemented in hardware
using any or a combination of the following technologies, which are
all well known in the art: a discrete logic circuit(s) having logic
gates for implementing logic functions upon data signals, an
application specific integrated circuit (ASIC) having appropriate
combinational logic gates, a programmable gate array(s) (PGA), a
field programmable gate array (FPGA), etc. In an alternative
embodiment, the randomized threshold is implemented in software or
firmware that is stored in a memory and that is executed by a
suitable instruction execution system.
[0025] Any process descriptions or blocks in flow charts should be
understood as representing modules, segments, or portions of code
which include one or more executable instructions for implementing
specific logical functions or steps in the process, and alternate
implementations are included within the scope of the preferred
embodiment of the present disclosure in which functions may be
executed out of order from that shown or discussed, including
substantially concurrently or in reverse order, depending on the
functionality involved, as would be understood by those reasonably
skilled in the art of the present disclosure.
[0026] Although exemplary embodiments have been shown and
described, it will be clear to those of ordinary skill in the art
that a number of changes, modifications, or alterations to the
disclosure as described may be made. For example, multiple parallel
implementations of the different embodiments can exist in a switch
for the different entities that set the flow control (e.g., queues,
ingress ports, etc.). Furthermore, it should be appreciated that
multiple, shared headroom can be employed in a switch. For example,
one shared headroom can be used for low-priority traffic, while
another shared headroom can be used for high-priority traffic. All
such changes, modifications, and alterations should therefore be
seen as within the scope of the disclosure.
* * * * *