All-or-none Switchover To Address Split-brain Problems In Multi-chassis Link Aggregation Groups SOOD; Ankit ; et al. [Ciena Corporation]

All-or-none Switchover To Address Split-brain Problems In Multi-chassis Link Aggregation Groups

SOOD; Ankit ; et al.

Patent Application Summary

U.S. patent application number 15/611283 was filed with the patent office on 2018-12-06 for all-or-none switchover to address split-brain problems in multi-chassis link aggregation groups. The applicant listed for this patent is Ciena Corporation. Invention is credited to Hossein BAHERI, Vijay Mohan CHANDRA MOHAN, Wei-Chiuan CHEN, Leela Sankar GUDIMETLA, Ankit SOOD.

Application Number	20180351855 15/611283
Document ID	/
Family ID	64459042
Filed Date	2018-12-06

United States Patent Application	20180351855
Kind Code	A1
SOOD; Ankit ; et al.	December 6, 2018

ALL-OR-NONE SWITCHOVER TO ADDRESS SPLIT-BRAIN PROBLEMS IN MULTI-CHASSIS LINK AGGREGATION GROUPS

Abstract

Systems and methods utilize an all-or-none switchover to prevent split-brain problems in a Multi-Chassis Link Aggregation Group (MC-LAG) network. A standby node in the MC-LAG network can perform the steps of remaining in a standby state responsive to a loss of adjacency with an active node, wherein, in the standby state, all standby links between the standby node and a common endpoint are non-distributing; monitoring frames transmitted by the common endpoint to the standby node over the standby links; and determining based on the monitoring frames whether all active links between the active node and the common endpoint have failed and entering an active state with all the standby links distributing based thereon.

Inventors:

SOOD; Ankit; (San Jose, CA) ; BAHERI; Hossein; (Monte Sereno, CA) ; GUDIMETLA; Leela Sankar; (San Jose, CA) ; CHANDRA MOHAN; Vijay Mohan; (San Jose, CA) ; CHEN; Wei-Chiuan; (San Jose, CA)

Applicant:

Name	City	State	Country	Type
Ciena Corporation	Hanover	MD	US

Family ID:

64459042

Appl. No.:

15/611283

Filed:

June 1, 2017

Current U.S. Class:	1/1
Current CPC Class:	H04L 45/28 20130101; H04L 43/0811 20130101; H04L 45/245 20130101; H04L 41/0668 20130101; Y02D 30/50 20200801
International Class:	H04L 12/709 20060101 H04L012/709; H04L 12/721 20060101 H04L012/721; H04L 12/24 20060101 H04L012/24

Claims

1. A method utilizing all-or-none switchover to prevent split-brain problems in a Multi-Chassis Link Aggregation Group (MC-LAG) network implemented by a standby node, the method comprising: remaining in a standby state responsive to a loss of adjacency with an active node, wherein, in the standby state, all standby links between the standby node and a common endpoint are non-distributing; monitoring frames transmitted by the common endpoint to the standby node over the standby links; and determining based on the monitoring frames whether all active links between the active node and the common endpoint have failed and entering an active state with all the standby links distributing based thereon.

2. The method of claim 1, further comprising: determining based on the monitoring frames whether less than all of the active links have failed and remaining in the standby state and continuing monitoring the frames transmitted by the common endpoint over the standby links based thereon.

3. The method of claim 1, wherein the monitoring checks for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links.

4. The method of claim 1, wherein the common endpoint is communicatively coupled to both the active node and the standby node in an active/standby triangle topology.

5. The method of claim 1, wherein the common endpoint is configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node.

6. The method of claim 1, wherein the common endpoint is unaware the active node and the standby node are in separate network elements.

7. The method of claim 1, wherein the loss of adjacency with the active node is based on a failure or fault on a link between the active node and the standby node used for coordination of the active node and the standby node in a Redundant Group, while the active node and the standby node are both operational.

8. A standby node in a Multi-Chassis Link Aggregation Group (MC-LAG) network configured with all-or-none switchover to prevent split-brain problems, the standby node comprising: a plurality of ports in a logical Link Aggregation Group (LAG) with an active node, wherein the plurality of ports form standby links with a common endpoint; a communication link with an active node; and a switching fabric between the plurality of ports, wherein the standby node is configured to remain in a standby state responsive to a loss of the communication link, wherein, in the standby state, all the standby links are non-distributing; monitor frames transmitted by the common endpoint to the standby node over the standby links; and determine based on the monitored frames whether all active links between the active node and the common endpoint have failed and enter an active state with all the standby links distributing based thereon.

9. The standby node of claim 8, wherein the standby node is further configured to determine based on the monitoring frames whether less than all of the active links have failed and remain in the standby state and continue monitoring the frames transmitted by the common endpoint over the standby links based thereon.

10. The standby node of claim 8, wherein the frames are monitored to check for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links.

11. The standby node of claim 8, wherein the common endpoint is communicatively coupled to both the active node and the standby node in an active/standby triangle topology.

12. The standby node of claim 8, wherein the common endpoint is configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node.

13. The standby node of claim 8, wherein the common endpoint is unaware the active node and the standby node are in separate network elements.

14. The standby node of claim 8, wherein the loss of adjacency with the active node is based on a failure or fault on the communication link, while the active node and the standby node are both operational.

15. An apparatus configured for all-or-none switchover to prevent split-brain problems in a Multi-Chassis Link Aggregation Group (MC-LAG) network located at a standby node, the apparatus comprising: circuitry configured to remain in a standby state responsive to a loss of adjacency with an active node, wherein, in the standby state, all standby links between the standby node and a common endpoint are non-distributing; circuitry configured to monitor frames transmitted by the common endpoint to the standby node over the standby links; and circuitry configured to determine based on the monitored frames whether all active links between the active node and the common endpoint have failed and enter an active state with all the standby links distributing based thereon.

16. The apparatus of claim 15, further comprising: circuitry configured to determine based on the monitored frames whether less than all of the active links have failed and remain in the standby state and continue monitoring the frames transmitted by the common endpoint over the standby links based thereon.

17. The apparatus of claim 15, wherein the circuitry configured to monitor checks for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links.

18. The apparatus of claim 15, wherein the common endpoint is communicatively coupled to both the active node and the standby node in an active/standby triangle topology.

19. The apparatus of claim 15, wherein the common endpoint is configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node.

20. The apparatus of claim 15, wherein the common endpoint is unaware the active node and the standby node are in separate network elements.

Description

FIELD OF THE DISCLOSURE

[0001] The present disclosure generally relates to networking systems and methods. More particularly, the present disclosure relates to systems and methods performing an all-or-none switchover to address split-brain problems in Multi-Chassis Link Aggregation Groups (MC-LAGs).

BACKGROUND OF THE DISCLOSURE

[0002] Link aggregation relates to combining various network connections in parallel to increase throughput, beyond what a single connection could sustain, and to provide redundancy between the links. Link aggregation including the Link Aggregation Control Protocol (LACP) for Ethernet is defined in IEEE 802.1AX, IEEE 802.1aq, IEEE 802.3ad, as well in various proprietary solutions. IEEE 802.1AX-2008 and IEEE 802.1AX-2014 are entitled Link Aggregation, the contents of which are incorporated by reference. IEEE 802.1aq-2012 is entitled Shortest Path Bridging, the contents of which are incorporated by reference. IEEE 802.3ad-2000 is entitled Link Aggregation, the contents of which are incorporated by reference. Multi-Chassis Link Aggregation Group (MC-LAG), is a type of LAG with constituent ports that terminate on separate chassis, primarily for the purpose of providing nodal redundancy in the event one of the chassis fails. The relevant standards for LAG do not mention MC-LAG, but do not preclude it. MC-LAG implementation varies by vendor.

[0003] LAG is a technique for inverse multiplexing over multiple Ethernet links, thereby increasing bandwidth and providing redundancy. IEEE 802.1AX-2008 states "Link Aggregation allows one or more links to be aggregated together to form a Link Aggregation Group, such that a MAC (Media Access Control) client can treat the Link Aggregation Group as if it were a single link." This layer 2 transparency is achieved by LAG using a single MAC address for all the device's ports in the LAG group. LAG can be configured as either static or dynamic. Dynamic LAG uses a peer-to-peer protocol for control, called Link Aggregation Control Protocol (LACP). This LACP protocol is also defined within the 802.1AX-2008 standard the entirety of which is incorporated herein by reference.

[0004] LAG can be implemented in multiple ways, namely LAG N and LAG N+N/M+N. LAG N is the load sharing mode of LAG and LAG N+N/M+N provides the redundancy. The LAG N protocol automatically distributes and load balances the traffic across the working links within a LAG, thus maximizing the use of the group if Ethernet links go down or come back up, providing improved resilience and throughput. For a different style of resilience between two nodes, a complete implementation of the LACP protocol supports separate worker/standby LAG subgroups. For LAG N+N, the work links as a group will failover to the standby links if any one or more or all of the links in the worker group fail. Note, LACP marks links as in standby mode using an "out of sync" flag.

[0005] Advantages of Link Aggregation include increased throughput/bandwidth (physical link capacity*number of physical links), load balancing across aggregated links and link-level redundancy (failure of a link does not result in a traffic drop; rather standby links can take over as active role for traffic distribution). One of the limitations of Link Aggregation is that it does not provide node-level redundancy. If one end of a LAG fails, it leads to a complete traffic drop as there is no other data path available for the data traffic to be switched to the other node. To solve this problem, "Multi-Chassis" Link Aggregation Group (MC-LAG) is introduced, that provides a node-level redundancy in addition to link-level redundancy and other merits provided by LAG.

[0006] MC-LAG allows two or more nodes (referred to herein as a Redundant Group (RG)) to share a common LAG endpoint (Dual Homing Device (DHD)). The multiple nodes present a single logical LAG to the remote end. Note that MC-LAG implementations are vendor-specific, but cooperating chassis remain externally compliant to the IEEE 802.1AX-2008 standard. Nodes in an MC-LAG cluster communicate to synchronize and negotiate automatic switchovers (failover). Some implementations may support administrator-initiated (manual) switchovers.

[0007] The multiple nodes in the redundant group maintain some form of adjacency with one another, such as the Inter-Chassis Communication Protocol (ICCP). Since the redundant group requires the adjacency to operate the MC-LAG, a loss in the adjacency (for any reason including a link fault, a nodal fault, etc.) results in a so-called split-brain problem where all peers in the redundant group attempt to take an active role considering corresponding peers as operationally down. This can lead to the introduction of loops in the MC-LAG network and result in the rapid duplication of packets.

[0008] Thus, there is a need for a solution to the split-brain which is solely implemented between the RG members that are interoperable with any vendor supporting standard LACP on the DHD and which does not increase switchover time.

BRIEF SUMMARY OF THE DISCLOSURE

[0009] There are some conventional solutions to addressing this problem. One conventional solution introduces configuration changes on the common LAG endpoint where the DHD detects the split-brain and configures packet flow accordingly. However, this solution is a proprietary solution requiring the DHD to participate in the MC-LAG. It would be advantageous to avoid configuration on the DHD due to the split-brain problem since the DHD may or may not be aware of the MC-LAG, preferably, the DHD may simply think it is participating in a conventional LAG supporting standard LACP. Another conventional solution includes changing the system MACs on RG members during a split-brain along with the use of an out-of-band management channel as a backup to verify communication between the RG members. However, this solution may lead to a significant switchover time since the underlying LACP would have to re-converge with the new system MACs.

[0010] In an embodiment, a method utilizing all-or-none switchover to prevent split-brain problems in a Multi-Chassis Link Aggregation Group (MC-LAG) network implemented by a standby node includes remaining in a standby state responsive to a loss of adjacency with an active node, wherein, in the standby state, all standby links between the standby node and a common endpoint are non-distributing; monitoring frames transmitted by the common endpoint to the standby node over the standby links; and determining based on the monitoring frames whether all active links between the active node and the common endpoint have failed and entering an active state with all the standby links distributing based thereon. The method can further include determining based on the monitoring frames whether less than all of the active links have failed and remaining in the standby state and continuing monitoring the frames transmitted by the common endpoint over the standby links based thereon. The monitoring can check for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links. The common endpoint can be communicatively coupled to both the active node and the standby node in an active/standby triangle topology.

[0011] The common endpoint can be configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node. The common endpoint can be unaware the active node and the standby node are in separate network elements. The loss of adjacency with the active node can be based on a failure or fault on a link between the active node and the standby node used for coordination of the active node and the standby node in a Redundant Group, while the active node and the standby node are both operational.

[0012] In another embodiment, a standby node in a Multi-Chassis Link Aggregation Group (MC-LAG) network configured with all-or-none switchover to prevent split-brain problems includes a plurality of ports in a logical Link Aggregation Group (LAG) with an active node, wherein the plurality of ports form standby links with a common endpoint; a communication link with an active node; and a switching fabric between the plurality of ports, wherein the standby node is configured to remain in a standby state responsive to a loss of the communication link, wherein, in the standby state, all the standby links are non-distributing; monitor frames transmitted by the common endpoint to the standby node over the standby links; and determine based on the monitored frames whether all active links between the active node and the common endpoint have failed and enter an active state with all the standby links distributing based thereon.

[0013] The standby node can be further configured to determine based on the monitoring frames whether less than all of the active links have failed and remain in the standby state and continue monitoring the frames transmitted by the common endpoint over the standby links based thereon. The frames can be monitored to check for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links. The common endpoint can be communicatively coupled to both the active node and the standby node in an active/standby triangle topology. The common endpoint can be configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node. The common endpoint can be unaware the active node and the standby node are in separate network elements. The loss of adjacency with the active node can be based on a failure or fault on the communication link, while the active node and the standby node are both operational.

[0014] In a further embodiment, an apparatus configured for all-or-none switchover to prevent split-brain problems in a Multi-Chassis Link Aggregation Group (MC-LAG) network located at a standby node includes circuitry configured to remain in a standby state responsive to a loss of adjacency with an active node, wherein, in the standby state, all standby links between the standby node and a common endpoint are non-distributing; circuitry configured to monitor frames transmitted by the common endpoint to the standby node over the standby links; and circuitry configured to determine based on the monitored frames whether all active links between the active node and the common endpoint have failed and enter an active state with all the standby links distributing based thereon.

[0015] The apparatus can further include circuitry configured to determine based on the monitored frames whether less than all of the active links have failed and remain in the standby state and continue monitoring the frames transmitted by the common endpoint over the standby links based thereon. The circuitry configured to monitor can check for a presence of SYNC bits from the common endpoint with each SYNC bit set to TRUE indicative of a switch by the common endpoint of one of the active links to one of the standby links. The common endpoint can be communicatively coupled to both the active node and the standby node in an active/standby triangle topology. The common endpoint can be configured to operate Link Aggregation Control Protocol (LACP) and an N:N link-level redundancy between the active node and the standby node. The common endpoint can be unaware the active node and the standby node are in separate network elements.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] The proposed solution is illustrated and described herein with reference to the various drawings, in which like reference numbers are used to denote like system components/method steps, as appropriate, and in which:

[0017] FIG. 1 illustrates an active/standby Multi-Chassis Link Aggregation Group (MC-LAG;

[0018] FIG. 2 illustrates the MC-LAG of FIG. 1 with a fault and associated node-level redundancy;

[0019] FIG. 3 illustrates the MC-LAG of FIG. 1 with the Inter-Chassis Communication Protocol (ICCP) link failed and associated operation with no other faults;

[0020] FIG. 4 illustrates the MC-LAG of FIG. 1 with the ICCP link failed and associated operation with a fault on one of the active links causing the split-brain problem of the prior art;

[0021] FIG. 5 illustrates the MC-LAG of FIG. 1 with the ICCP link failed and associated operation with a fault on any but the last the active link in an all-or-none (AON) switchover to prevent the split-brain problem in accordance with an embodiment of the proposed solution;

[0022] FIG. 6 illustrates the MC-LAG of FIG. 1 with the ICCP link failed and associated operation with a fault on all of the active links in the AON switchover in accordance with an embodiment of the proposed solution;

[0023] FIG. 7 illustrates a flowchart of an AON switchover process in accordance with an embodiment of the proposed solution implemented by the standby RG member node subsequent to the loss of connectivity with the active Redundant Group (RG) member node such as due to the fault on the ICCP link; and

[0024] FIG. 8 illustrates an example network element for the proposed systems and methods described herein.

DETAILED DESCRIPTION OF THE DISCLOSURE

[0025] In various embodiments, the present disclosure relates to systems and methods performing an all-or-none switchover to address split-brain problems in Multi-Chassis Link Aggregation Groups (MC-LAGs). In particular, the systems and method solve the split-brain problem in an active/standby MC-LAG in a triangle topology (a DHD connected to a plurality of RG members). The proposed systems and methods are implemented between the RG members only without the involvement of the DHD; thus, the systems and methods can interoperate with any vendor's DHD. Also, the systems and methods do not change system MAC addresses thereby avoiding increased switchover time.

Active/Standby MC-LAG

[0026] FIG. 1 illustrates an active/standby MC-LAG 10. MC-LAG 10 simply means dual-homing an endpoint to two or more upstream devices, i.e., allowing two or more upstream nodes to share a common endpoint thereby providing node-level redundancy. The MC-LAG 10 includes a Redundant Group (RG) 12 which includes RG member nodes 14, 16 which are the two or more upstream devices. The common endpoint is a Dual Homing Device (DHD) 18. The nodes 14, 16 and the DHD 18 can be Ethernet switches, routers, packet-optical devices, etc. supporting Layer 2 connectivity. The multiple nodes 14, 16 in the RG 12 present a single logical LAG interface 20 which is an MC-LAG to a DHD LAG 22. Specifically, the nodes 14, 16 each have a separate LAG 24, 26 which are logically operated as the logical LAG interface 20 based on adjacency and coordination between the nodes 14, 16. In this manner, the RG 12 can appear to the DHD 18 as a single node with the logical LAG interface 20.

[0027] In order to present the RG 12 as the logical LAG interface 20, the nodes 14, 16 rely on LACP as an underlying communication protocol between one another. The nodes 14, 16 can exchange their configuration and dynamic state data over an Inter-Chassis Communication Protocol (ICCP) link 28. Again, the nodes 14, 16 are different physical network elements which can be in the same location or in different locations. In either situation, the nodes 14, 16 are interconnected via a network 30, such as a G.8032 Ethernet network, a Multiprotocol Label Switching (MPLS) network, or the like. The ICCP link 28 can be a physical connection in the network 30. Also, the ICCP link 28 can be a dedicated link between the nodes 14, 16 such as when they are in the same location or chassis.

[0028] RG 12 implementation is typically vendor-specific, i.e., not specified by the relevant LAG standards. However, in general, the objective of the RG 12 is to present the nodes 14, 16 and the logical LAG interface 20 as a single virtual endpoint to a standards-based LAG DHD 18. Various vendors use different terminology for the MC-LAG which include: MLAG, distributed split multi-link trunking, multi-chassis trunking, MLAG, etc. The proposed systems and methods described herein can apply to any implementation of the RG 12 and seek to avoid coordination with the DHD 18 such that the RG 12 appears to any LAG-compliant DHD 12 as the single logical LAG interface 20. Also, other terminology may be used for the ICCP link 28, but the objective is the same--to enable adjacency and coordination between the nodes 14, 16.

[0029] The ICCP link 28 can be monitored via keep-alive message exchanges that deem this link operational. For faster ICCP Link Failure detection/recovery, Connectivity Fault Management (CFM) or Bidirectional Forwarding Detection (BFD) services can be configured across the RG member nodes 14, 16.

[0030] In the example of FIG. 1, the DHD 18 includes four ports 32 into the LAG 22, two ports 34 are active and connected to the LAG 26 and two ports 36 that are standby connected to the LAG 24. In this manner, the MC-LAG 10 is an active/standby MC-LAG. From the perspective of the DHD 18, the four ports 32 appear as a standard LAG, and the DHD 18 is unaware that the ports 34, 36 terminate on separate nodes 14, 16. The ICCP link 28 coordination between the RG member nodes 14, 16 cause them to appear as a single node from the DHD 18's perspective.

[0031] FIG. 2 illustrates the MC-LAG 10 with a fault 50 and associated node-level redundancy. Specifically, FIG. 2 illustrates two states 52, 54 shown to illustrate how node-level redundancy is performed. At the state 52, the ports 34 are active such that the node 14 is the active RG member node and the ports 36 are standby such that the node 16 is the standby RG member node. In LACP, the ports 34, 36 include sending frames (LACPDUs--LACP Protocol Data Units) between the DHD 18 and the nodes 14, 16 with SYNC bits. Prior to the fault 50, the ports 34 have the LACPDU SYNC bits set to 1 indicating the ports 34 are active and the ports 36 have the LACPDU SYNC bits set to 0 indicating the ports 36 are standby.

[0032] At step 60-1, assume the node 14 fails, and the active RG member node's failure causes protection switching of traffic to the standby RG member node 16. As soon as the standby RG member node 16 losses connectivity with active RG member node 14 (the ICCP link 28 failure in step 60-2 due to the fault 50), the standby RG member node 16 takes the active role by setting the SYNC bit=1 on all its member ports 36 at step 60-3. Since the DHD 18 also gets a link failure for all active links on the ports 34 at step 60-4, all the standby links on the DHD 18 take the active role by setting their SYNC bit=1 at step 60-5. This makes the backup links "distributing" and hence, traffic switches to the new active RG member node 16 (node-level redundancy).

Split-Brain in Active/Standby MC-LAG Triangle Topology

[0033] An MC-LAG supports a triangle, square, and mesh topology. Particularly, the disclosure herein focuses on the split-brain problem and solution in the MC-LAG triangle topology such that the DHD 18 is not required to participate in the diagnosis or correction and such that the ports 34, 36 do not require new MAC addresses.

[0034] The split-brain problem is an industry-wide known problem that happens in the case of dual homing. It may occur when communication between two MC-LAG nodes 14, 16 is lost (i.e., the ICCP link 28 failed/operational down) while both the nodes 14, 16 are still up and operational. When the split-brain problem happens, both the nodes 14, 16, being no longer aware of each other's existence, try to take active role considering the other one as operationally down. This can lead to the introduction of loops in MC-LAG 10 network and can result in rapid duplication of packets at the DHD 18.

[0035] The ICCP link 28 communication can be lost between the nodes 14, 16 for various reasons, such as misconfigurations, network congestion, network errors, hardware failures, etc. For misconfigurations, example problems can include configuring or administratively enabling the ICCP link 28 only on one RG member node 14, 16, configuring different ICCP heartbeat interval or timeout multiplier on the RG member nodes 14, 16, incorrectly configuring CFM or BFD Monitoring over the ICCP link 28, configuring CFM Maintenance End Points (MEPs) incorrectly that may result in MEP Faults (MEP Faults will be propagated to the ICCP link 28 deeming the ICCP link 28 operationally down), etc. Network congestion may lead to CFM/BFD/ICCP frame-loss that in-turn may cause the ICCP link 28 to appear operationally down while some data traffic may still be switched across. For network errors, high bit errors may result in CFM/BFD/ICCP packet drops. For hardware failure, Operations, Administration, and Maintenance (OAM) engine failures may result in faults in the ICCP link 28 monitoring. For example, the OAM engine may be implemented in hardware as a Field Programmable Gate Array (FPGA), a Network Processor Unit (NPU), an Application Specific Integrated Circuit (ASIC), etc.

[0036] FIG. 3 illustrates the MC-LAG 10 with the ICCP link 28 failed and associated operation with no other faults. At step 100-1, there is a fault 102 that causes the ICCP link 28 to fail. The reason for fault 102 is irrelevant. At step 100-2, since the ICCP link 28 connectivity is lost between the RG member nodes 14, 16, both the RG member nodes 14, 16 try to take the active role by setting the SYNC bit to 1 on all their member ports 34, 36. The node 14 already is the active node, so the node 14 does not change the SYNC bit, but the node 16 is in standby and goes into standalone active at step 100-3.

[0037] This scenario, however, does not cause the split-brain problem to occur because of the configured link-level redundancy (N:N) on the DHD 18. Since all N links on the ports 34 from the active RG member node 14 are active, the DHD 18 does not set its SYNC bit on the N standby links on the ports 36 at step 100-4. This prevents the standby path from going to the distribution state even though standby RG member node 16 (after taking the new active role) sets the SYNC Bit to 1 on the backup path.

[0038] FIG. 4 illustrates the MC-LAG 10 with the ICCP link 28 failed and associated operation with a fault 104 on one of the active links (34) causing the split-brain problem. At step 150-1, there is fault 102 that causes the ICCP link 28 to fail. Again, the fault 102 could be for any reason. At step 150-2, since the ICCP link 28 connectivity is lost between the RG member nodes 14, 16, both the RG member nodes 14, 16 try to take the active role by setting the SYNC bit to 1 on all their member ports 34, 36.

[0039] An issue, however, arises if any distributing link fails on the ports 34 between the DHD 18 and the active RG member node 14. At step 150-3, the fault 104 causes a failure on one of the ports 34, and the SYNC bit is 0 and unable to send on this port. In this scenario, the DHD 18, unaware of the fault 102 affecting the ICCP link 28, selects one of the standby links on the ports 36 to take an active role and sets its SYNC Bit to 1 at step 150-4.

[0040] The SYNC bit has already been set to 1 on the standby RG member node 16 because of the ICCP link 28 fault 102. Thus, the backup path on the ports 36 goes to the distribution state. Since, there is at least one link distributing from the DHD 18 to both the RG member nodes 14, 16; it results in the formation of a loop resulting in packet duplication towards the DHD at step 150-5. The result is the split-brain problem where the member nodes 14, 16 cause the loop due to their lack of adjacency and coordination. The split-brain problem can only occur when there are more than one physical ports between the DHD 18 and each RG member node 14, 16. In case there is only one physical port between the DHD 18 and each RG member node 14, 16, the DHD's 18 1:1 redundancy will ensure that only one port can be active at any point of time thus preventing active-active situation from happening. However, N:N/M:N redundancy is desired over 1:1 redundancy and employing N:N/M:N redundancy exposes the arrangement to the split-brain problem.

All-or-None Switchover in Split-Brain in Active/Standby MC-LAG Triangle Topology

[0041] FIGS. 5 and 6 illustrate the MC-LAG 10 with the ICCP link 28 failed and associated operation with a fault 104 on one of the active links with an all-or-none (AON) switchover to prevent the split-brain problem in accordance with the proposed solution. Specifically, FIG. 5 illustrates the MC-LAG 10 with the ICCP link 28 failed and associated operation with a fault 104 on any but the last the active link (34) in the AON switchover. FIG. 6 illustrates the MC-LAG 10 with the ICCP link 28 failed and associated operation with fault 104 on all of the active links in the AON switchover.

[0042] The AON switchover can be implemented by each of the RG member nodes 14, 16 with the restriction that the standby RG member node 16 will only take the active role when all of the active links (34) on the active RG member node 14 fail. Of course, the RG member nodes 14, 16 cannot coordinate this with one another due to the fault 102 and the lack of adjacency. Instead, this is achieved by making optimal use of the SYNC bit as employed by DHD 18. When the ICCP link 28 goes down operationally, the standby RG member node 16 will not set its member's SYNC bit to 1 immediately, but rather rely on the DHD 18 port's SYNC bits in order to set its member's (16) SYNC bit. The standby RG member node 16 will set its port's SYNC Bits to 1 only if receives SYNC bit=1 on all the operational ports from the DHD 18.

[0043] The AON switchover eliminates a loop during split brain situation where MC-LAG 10 is configured with N:N link redundancy and there is no link failure on the standby path (on the ports 36). With the AON switchover, when the ICCP link 28 fails, the standby RG member node 16 will not go active and will keep the SYNC Bits to FALSE (0) and will keep monitoring the SYNC bits coming from the DHD 18. Again, the DHD 18 may not know it is in the MC-LAG but rather assume this is a standard LAG. This AON switchover approach does not require the DHD 18 to have a special configuration, but rather operate standard LACP. Further, the AON switchover does not require new MAC addresses and/or re-convergence.

[0044] If RG member nodes 14, 16 are runtime upgraded to employ the functionality of the proposed solution, preferably standby RG member node 16 should be upgraded first (before active RG member node 14).

[0045] FIG. 7 is a flowchart of an AON switchover process 300 implemented by the standby RG member node 16 subsequent to the loss of connectivity with the active RG member node 14 such as due to the fault 102 on the ICCP link 28. The standby RG member node 16 performs the AON switchover process 300 to eliminate chances that the split-brain problem may cause a loop. The standby RG member node 16 begins the AON switchover process 300 subsequent to the loss of adjacency with the active RG member node 14 (step 302). Subsequent to loss of adjacency (the ICCP link 28 failure), the standby RG member node 16 remains in the standby state on all of the ports 36 keeping the SYNC bits set to 0 with the standby RG member node 16 monitoring LACPDUs from the DHD 18 for their associated SYNC bit (step 304). Specifically, this monitoring does not require the DHD 18 to make changes, but simply assumes DHD 18 to operate standard LACP in an N:N link-level redundancy scheme.

[0046] The standby RG member node 16 can infer the operational status of the active ports 34 based on the SYNC bits from the DHD 18 on the standby ports 36. Specifically, the standby RG member node 16 knows the value of N (N:N) and can infer the number of active/failed links on the ports 34 based on the number of SYNC bit values equal to 1 coming from the DHD 18 on the ports 36. Thus, the AON switchover process 300 operates in a triangle MC-LAG with N:N active/standby configurations.

[0047] Based on the monitoring, the standby RG member node 16 can determine if any active links have failed (step 306). Specifically, no active links have failed if none of the ports 36 have the SYNC bit set to 0 coming from the DHD 18 and the standby RG member node 16 remains, (step 304), in the standby state on all of the ports 36 keeping the SYNC bits set to 0 and the standby RG member node 16 monitors LACPDUs from the DHD 18 for their associated SYNC bit (step 306).

[0048] There are active links failed if any link on the ports 36 has the SYNC bit set to 1 coming from the DHD 18 (step 306). The standby RG member node 16 determines whether all of the active links have failed or whether some, but not all of the active links have failed (step 306). The standby RG member node 16 will only become active when all of the active links (34) have failed. This prevents the loops and does not require coordination with the DHD 18 or changes to system MAC addresses.

[0049] The standby RG member node 16 can determine whether or not all of the active links have failed by determining the number of links on the ports 36 from the DHD 18 which are showing the SYNC bit as 1. That is, if all of the ports 36 are showing LACPDUs from the DHD 18 with the SYNC bit as 1, then all of the active links (34) have failed, i.e., N links on the ports 36 show SYNC=1 from the DHD 18 then the N links on the ports 34 are failed.

[0050] If not all of the active links have failed (step 306), then the standby RG member node 16 remains in the standby state on all ports keeping the SYNC bits set to 0 and continues to monitor LACPDUs from the DHD 18 (step 304). If all of the active links (34) have failed (step 308), the standby RG member node enters the active state on all ports 36 changing the SYNC bits to 1 (step 308). This will result in the backup path going to distribution state and traffic will resume after protection switching.

[0051] Again, the AON switchover process 300 is implemented on the RG 12 and therefore is interoperable with any vendor's DHD 18 supporting standard LACP and the switchover time is not compromised since no re-convergence is required. Also, the AON switchover process 300 can be configurable and selectively enabled/disabled on both of the member nodes 14, 16.

[0052] Referring back to FIGS. 5 and 6, an operation of the AON switchover process 300 is illustrated. In FIG. 5, similar to FIG. 4, at step 350-1, there is a fault 102 that causes the ICCP link 28 to fail. Again, the fault 102 could be for any reason. At step 350-2, the member nodes 14, 16 detect the ICCP link 28 failure and report the same to the MC-LAG 10. At step 350-3, the active member RG node 14 goes to standalone (active), and the SYNC bit remains at 1 on the operational links in the ports 34. Also at step 350-3, if the standby RG member node 16 is configured with the AON switchover process 300 enabled, the standby RG member node 16 goes to a standalone mode, but non-distributing, keeping the SYNC bits set at 0 for all links in the ports 36.

[0053] Now, in the standalone mode, but non-distributing, the standby RG member node 16 monitors the LACPDUs from the DHD 18 on the ports 36. At step 350-4, the DHD 18 determines the fault 104 on the ports 34 and since this is N:N redundancy, the DHD 18 selects a standby port as active on the ports 36 setting the SYNC bit to 1. Note, since the standby RG member node 16 is operating the AON switchover process 300, the standby RG member node 16 remains in the standalone mode, but non-distributing with all links in the ports 36 transmitting SYNC=0 to the DHD 18.

[0054] In FIG. 6, at step 350-5, the last link in the ports 34 fails. The active RG member node 14 goes into standalone, non-distributing and the SYNC bits are 0 on all links on the ports 34. At step 350-6, the DHD 18 selects another standby port of the ports 36 to set as active and sets the SYNC bit to 1. At step 350-7, the standby RG member node 16 determines that all of the active links (34) have failed. In this example, this is due to the DHD 18 sending SYNC=1 on two ports of the ports 36, N=2 here. At this point, (350-7) the standby RG member node 16 sets the SYNC bit to 1 on all of the ports 36 since the DHD 18 also has the SYNC bit set to 1 on all of the ports 36 and the ports 36 go into distribution, such that the traffic switches from the ports 34 to the ports 36.

Network Element

[0055] FIG. 8 illustrates an example network element 400 for the systems and methods described herein. In this embodiment, the network element 400 is an Ethernet, MPLS, IP, etc. network switch, but those of ordinary skill in the art will recognize the systems and methods described herein can operate with other types of network elements and other implementations. Specifically, the network element 400 can be the RG member nodes 14, 16. Also, the network element 400 can be the DHD 18 as well. In this embodiment, the network element 400 includes a plurality of blades 402, 404 interconnected via an interface 406. The blades 402, 404 are also known as line cards, line modules, circuit packs, pluggable modules, etc. and generally refer to components mounted on a chassis, shelf, etc. of a data switching device, i.e., the network element 400. Each of the blades 402, 404 can include numerous electronic devices and optical devices mounted on a circuit board along with various interconnects including interfaces to the chassis, shelf, etc. Those skilled in the art will recognize that the network element 400 is illustrated in an oversimplified manner and may include other components and functionality.

[0056] Two blades are illustrated with line blades 402 and control blades 404. The line blades 402 include data ports 408 such as a plurality of Ethernet ports. For example, the line blade 402 can include a plurality of physical ports disposed on an exterior of the blade 402 for receiving ingress/egress connections. Additionally, the line blades 402 can include switching components to form a switching fabric via the interface 406 between all of the data ports 408 allowing data traffic to be switched between the data ports 408 on the various line blades 402. The switching fabric is a combination of hardware, software, firmware, etc. that moves data coming into the network element 400 out by the correct port 408 to the next network element 400. "Switching fabric" includes switching units, or individual boxes, in a node; integrated circuits contained in the switching units; and programming that allows switching paths to be controlled. Note, the switching fabric can be distributed on the blades 402, 404, in a separate blade (not shown), or a combination thereof. The line blades 402 can include an Ethernet manager (i.e., a processor) and a Network Processor (NP)/Application Specific Integrated Circuit (ASIC).

[0057] The control blades 404 include a microprocessor 410, memory 412, software 414, and a network interface 416. Specifically, the microprocessor 410, the memory 412, and the software 414 can collectively control, configure, provision, monitor, etc. the network element 400. The network interface 416 may be utilized to communicate with an element manager, a network management system, etc. Additionally, the control blades 404 can include a database 420 that tracks and maintains provisioning, configuration, operational data and the like. In this embodiment, the network element 400 includes two control blades 404 which may operate in a redundant or protected configuration such as 1:1, 1+1, etc. In general, the control blades 404 maintain dynamic system information including packet forwarding databases, protocol state machines, and the operational status of the ports 408 within the network element 400.

[0058] When operating as the standby RG member node 16, the various components of the network element 400 can be configured to implement the AON switchover process 300.

[0059] It will be appreciated that some embodiments described herein may include one or more generic or specialized processors ("one or more processors") such as microprocessors; Central Processing Units (CPUs); Digital Signal Processors (DSPs): customized processors such as Network Processors (NPs) or Network Processing Units (NPUs), Graphics Processing Units (GPUs), or the like; Field Programmable Gate Arrays (FPGAs); and the like along with unique stored program instructions (including both software and firmware) for control thereof to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the methods and/or systems described herein. Alternatively, some or all functions may be implemented by a state machine that has no stored program instructions, or in one or more Application Specific Integrated Circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic or circuitry. Of course, a combination of the aforementioned approaches may be used. For some of the embodiments described herein, a corresponding device in hardware and optionally with software, firmware, and a combination thereof can be referred to as "circuitry configured or adapted to," "logic configured or adapted to," etc. perform a set of operations, steps, methods, processes, algorithms, functions, techniques, etc. on digital and/or analog signals as described herein for the various embodiments.

[0060] Moreover, some embodiments may include a non-transitory computer-readable storage medium having computer readable code stored thereon for programming a computer, server, appliance, device, processor, circuit, etc. each of which may include a processor to perform functions as described and claimed herein. Examples of such computer-readable storage mediums include, but are not limited to, a hard disk, an optical storage device, a magnetic storage device, a ROM (Read Only Memory), a PROM (Programmable Read Only Memory), an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory), Flash memory, and the like. When stored in the non-transitory computer readable medium, software can include instructions executable by a processor or device (e.g., any type of programmable circuitry or logic) that, in response to such execution, cause a processor or the device to perform a set of operations, steps, methods, processes, algorithms, functions, techniques, etc. as described herein for the various exemplary embodiments.

[0061] Although the present disclosure has been illustrated and described herein with reference to preferred embodiments and specific examples thereof, it will be readily apparent to those of ordinary skill in the art that other embodiments and examples may perform similar functions and/or achieve like results. All such equivalent embodiments and examples are within the spirit and scope of the present disclosure, are contemplated thereby, and are intended to be covered by the following claims.

* * * * *