Stateful Subnet Manager Failover In A Middleware Machine Environment Johnsen; Bjorn-Dag ; et al. [ORACLE INTERNATIONAL CORPORATION]

Stateful Subnet Manager Failover In A Middleware Machine Environment

Johnsen; Bjorn-Dag ; et al.

Patent Application Summary

U.S. patent application number 13/235113 was filed with the patent office on 2012-03-29 for stateful subnet manager failover in a middleware machine environment. This patent application is currently assigned to ORACLE INTERNATIONAL CORPORATION. Invention is credited to Roy Arntsen, Line Holen, Bjorn-Dag Johnsen.

Application Number	20120079090 13/235113
Document ID	/
Family ID	44872584
Filed Date	2012-03-29

United States Patent Application	20120079090
Kind Code	A1
Johnsen; Bjorn-Dag ; et al.	March 29, 2012

STATEFUL SUBNET MANAGER FAILOVER IN A MIDDLEWARE MACHINE ENVIRONMENT

Abstract

A system and method can provide stateful subnet manager failover in a middleware machine environment. The system includes a policy daemon associated with each master subnet manager candidate in a subnet in the middleware machine environment. The policy daemon manages one or more policies for the subnet. The system also includes a transactional interface associated with the policy daemon co-located with a current master subnet manager. The transactional interface allows for updating the one or more policies using a policy update transaction. The policy daemon co-located with the master subnet manager operates to replicate the policy update transaction to one or more policy daemons co-located with the subnet managers that are master candidates associated with the master subnet manager, before committing the policy update transaction. Additionally, when the master subnet manager fails, the subnet managers operate to negotiate with each other and elect a new master subnet manager.

Inventors:	Johnsen; Bjorn-Dag; (Oslo, NO) ; Holen; Line; (Fetsund, NO) ; Arntsen; Roy; (Oslo, NO)
Assignee:	ORACLE INTERNATIONAL CORPORATION Redwood Shores CA
Family ID:	44872584
Appl. No.:	13/235113
Filed:	September 16, 2011

Related U.S. Patent Documents


Application Number	Filing Date	Patent Number
61384228	Sep 17, 2010
61484390	May 10, 2011
61493330	Jun 3, 2011
61493347	Jun 3, 2011
61498329	Jun 17, 2011

Current U.S. Class:	709/223
Current CPC Class:	H04L 45/50 20130101; H04L 45/00 20130101; H04L 12/462 20130101; H04L 69/40 20130101; H04L 41/0659 20130101
Class at Publication:	709/223
International Class:	G06F 15/173 20060101 G06F015/173

Claims

1. A system for supporting policy transaction in a middleware machine environment, comprising: one or more microprocessors; a policy daemon, running on the one or more microprocessors, associated with a master subnet manager in a subnet in the middleware machine environment, wherein the policy daemon manages one or more policies for the subnet; a transactional interface associated with the policy daemon, wherein the transactional interface allows for updating the one or more policies managed by the policy daemon associated with the master subnet manager using a policy update transaction; and wherein the master subnet manager is associated with one or more subnet managers that are master candidates in the subnet, and the policy daemon associated with the master subnet manager operates to replicate the policy update transaction to the one or more subnet managers before committing the policy update transaction.

2. The system according to claim 1, wherein: the subnet is Infiniband (IB) subnet that includes a plurality of management nodes connecting with a plurality of host servers.

3. The system according to claim 2, wherein: the plurality of management nodes include one or more network switches, wherein each said subnet manager resides on a network switch.

4. The system according to claim 1, wherein: each said subnet manager is associate with a different policy daemon.

5. The system according to claim 4, wherein: the policy update transaction is committed only when a quorum of said different policy daemons agrees.

6. The system according to claim 1, wherein: when the master subnet manager fails, the one or more subnet managers operate to negotiate with each other and elect a new master subnet manager, which is responsible for configuring and managing the middleware machine environment.

7. The system according to claim 1, wherein: the subnet uses an in-band communication protocol to connect the master subnet manager with the one or more subnet managers.

8. The system according to claim 1, wherein: a said policy is a partition policy that can define a partition configuration in the subnet, and wherein the partition policy can be supplied to the subnet through an initialization policy transaction.

9. The system according to claim 1, further comprising: a command interface that is responsible for providing policies to the master subnet manager via the transactional interface.

10. The system according to claim 1, wherein: the master subnet manager can use a default patitioning policy for initialization when no partitioning policy is specified.

11. The system according to claim 1, wherein: the master subnet manager ensures that functioning of the middleware machine environment is not be interrupted when a standby subnet manager takes over and becomes a new master subnet manager.

12. The system according to claim 1, wherein: all stale policy information can be removed before applying the new policy or the policy updates.

13. The system according to claim 1, wherein: the policy update transaction can include either a new policy or a set of policy updates, and the policy update transaction can be represented using a unique version number.

14. The system according to claim 14, wherein: the master subnet manager considers a policy associated with a highest version number as the current policy used in the middleware machine environment, and the subnet allows one policy update transaction in progress at any point of time in the subnet.

15. The system according to claim 1, wherein: the policy daemon ensures that current policy is in synch within a quorum of policy daemons before allowing a newly elected master subnet manager to complete initialization of the subnet.

16. The system according to claim 1, wherein: a consensus based scheme is used when it is impossible to establish a quorum following a single point of failure, wherein the consensus based rules can implement a current policy when at least one single master subnet manager is established and the current policy can not be changed when any master subnet manager candidates is not a part of the upgrade transaction.

17. The system according to claim 1, wherein: the subnet manager is implemented with a core logic based on third party source code.

18. The system according to claim 17, wherein: the policy daemon can inject critical policy information for the middleware machine environment into the core logic implementation in the subnet manager.

19. A method for supporting policy transaction in a middleware machine environment, comprising: associating a policy daemon running on one or more microprocessors with a master subnet manager in a subnet in the middleware machine environment, wherein the policy daemon manages one or more policies for the subnet; associating a transactional interface with the policy daemon, wherein the transactional interface allows for updating the one or more policies managed by the policy daemon associated with the master subnet manager using a policy update transaction; and replicating, via the policy daemon associated with the master subnet manager, the policy update transaction to one or more subnet managers that are master candidates associated with the master subnet manager before committing the policy update transaction.

20. A machine readable medium having instructions stored thereon that when executed cause a system to perform the steps of: associating a policy daemon running on one or more microprocessors with a master subnet manager in a subnet in the middleware machine environment, wherein the policy daemon manages one or more policies for the subnet; associating a transactional interface with the policy daemon, wherein the transactional interface allows for updating the one or more policies managed by the policy daemon associated with the master subnet manager using a policy update transaction; and replicating, via the policy daemon associated with the master subnet manager, the policy update transaction to one or more subnet managers that are master candidates associated with the master subnet manager before committing the policy update transaction.

Description

CLAIM OF PRIORITY

[0001] This application claims the benefit of priority on U.S. Provisional Patent Application No. 61/384,228, entitled "SYSTEM FOR USE WITH A MIDDLEWARE MACHINE PLATFORM" filed Sep. 17, 2010; U.S. Provisional Patent Application No. 61/484,390, entitled "SYSTEM FOR USE WITH A MIDDLEWARE MACHINE PLATFORM" filed May 10, 2011; U.S. Provisional Patent Application No. 61/493,330, entitled "STATEFUL SUBNET MANAGER FAILOVER IN A MIDDLEWARE MACHINE ENVIRONMENT" filed Jun. 3, 2011; U.S. Provisional Patent Application No. 61/493,347, entitled "PERFORMING PARTIAL SUBNET INITIALIZATION IN A MIDDLEWARE MACHINE ENVIRONMENT" filed Jun. 3, 2011; U.S. Provisional Patent Application No. 61/498,329, entitled "SYSTEM AND METHOD FOR SUPPORTING A MIDDLEWARE MACHINE ENVIRONMENT" filed Jun. 17, 2011, each of which applications are herein incorporated by reference.

COPYRIGHT NOTICE

[0002] A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD OF INVENTION

[0003] The present invention is generally related to computer systems and software such as middleware, and is particularly related to supporting a middleware machine environment.

BACKGROUND

[0004] Infiniband (IB) Architecture is a communications and management infrastructure that supports both I/O and interprocessor communications for one or more computer systems. An IB Architecture system can scale from a small server with a few processors and a few I/O devices to a massively parallel installation with hundreds of processors and thousands of I/O devices.

[0005] The IB Architecture defines a switched communications fabric allowing many devices to concurrently communicate with high bandwidth and low latency in a protected, remotely managed environment. An end node can communicate with over multiple IB Architecture ports and can utilize multiple paths through the IB Architecture fabric. A multiplicity of IB Architecture ports and paths through the network are provided for both fault tolerance and increased data transfer bandwidth.

[0006] These are the generally areas that embodiments of the invention are intended to address.

SUMMARY

[0007] Described herein is a system and method that can provide stateful subnet manager failover in a middleware machine environment. In accordance with an embodiment, the system includes a policy daemon associated with each master subnet manager candidate in a subnet in the middleware machine environment. The policy daemon manages one or more policies for the subnet. The system also includes a transactional interface associated with the policy daemon that is co-located with a current master subnet manager. The transactional interface allows for updating the one or more policies using a policy update transaction. The policy daemon co-located with the master subnet manager operates to replicate the policy update transaction to one or more policy daemons co-located with the subnet managers that are master candidates associated with the master subnet manager, before committing the policy update transaction. Additionally, when the master subnet manager fails, the one or more subnet manager operate to negotiate with each other and elect a new master subnet manager.

BRIEF DESCRIPTION OF THE FIGURES

[0008] FIG. 1 shows an illustration of an exemplary configuration for a middleware machine, in accordance with an embodiment of the invention.

[0009] FIG. 2 shows an illustration of a middleware machine environment, in accordance with an embodiment of the invention.

[0010] FIG. 3 shows an illustration of a middleware machine environment that supports a policy transaction, in accordance with an embodiment of the invention.

[0011] FIG. 4 illustrates an exemplary flow chart for supporting a policy transaction in a middleware machine environment, in accordance with an embodiment of the invention.

[0012] FIG. 5 shows an illustration of stateful subnet manager failover scenario in a middleware machine environment, in accordance with an embodiment of the invention.

[0013] FIG. 6 illustrates an exemplary flow chart for implementing a system that supports stateful subnet manager failover in a middleware machine environment, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

[0014] Described herein is a system and method for providing a middleware machine or similar platform. In accordance with an embodiment of the invention, the system comprises a combination of high performance hardware (e.g. 64-bit processor technology, high performance large memory, and redundant InfiniBand and Ethernet networking) together with an application server or middleware environment, such as WebLogic Suite, to provide a complete Java EE application server complex which includes a massively parallel in-memory grid, that can be provisioned quickly, and that can scale on demand. In accordance with an embodiment of the invention, the system can be deployed as a full, half, or quarter rack, or other configuration, that provides an application server grid, storage area network, and InfiniBand (IB) network. The middleware machine software can provide application server, middleware and other functionality such as, for example, WebLogic Server, JRockit or Hotspot JVM, Oracle Linux or Solaris, and Oracle VM. In accordance with an embodiment of the invention, the system can include a plurality of compute nodes, one or more IB switch gateways, and storage nodes or units, communicating with one another via an IB network. When implemented as a rack configuration, unused portions of the rack can be left empty or occupied by fillers.

[0015] In accordance with an embodiment of the invention, referred to herein as "Sun Oracle Exalogic" or "Exalogic", the system is an easy-to-deploy solution for hosting middleware or application server software, such as the Oracle Middleware SW suite, or Weblogic. As described herein, in accordance with an embodiment the system is a "grid in a box" that comprises one or more servers, storage units, an IB fabric for storage networking, and all the other components required to host a middleware application. Significant performance can be delivered for all types of middleware applications by leveraging a massively parallel grid architecture using, e.g. Real Application Clusters and Exalogic Open storage. The system delivers improved performance with linear I/O scalability, is simple to use and manage, and delivers mission-critical availability and reliability.

[0016] FIG. 1 shows an illustration of an exemplary configuration for a middleware machine, in accordance with an embodiment of the invention. As shown in FIG. 1, the middleware machine 100 uses a single rack configuration that includes two gateway network switches, or leaf network switches, 102 and 103 that connect to twenty-eight server nodes. Additionally, there can be different configurations for the middleware machine. For example, there can be a half rack configuration that contains a portion of the server nodes, and there can also be a multi-rack configuration that contains a large number of servers.

[0017] As shown in FIG. 1, the server nodes can connect to the ports provided by the gateway network switches. As shown in FIG. 1, each server machine can have connections to the two gateway network switches 102 and 103 separately. For example, the gateway network switch 102 connects to the port 1 of the servers 1-14 106 and the port 2 of the servers 15-28 107, and the gateway network switch 103 connects to the port 2 of the servers 1-14 108 and the port 1 of the servers 15-28 109.

[0018] In accordance with an embodiment of the invention, each gateway network switch can have multiple internal ports that are used to connect with different servers, and the gateway network switch can also have external ports that are used to connect with an external network, such as an existing data center service network.

[0019] In accordance with an embodiment of the invention, the middleware machine can include a separate storage system 110 that connects to the servers through the gateway network switches. Additionally, the middleware machine can include a spine network switch 101 that connects to the two gateway network switches 102 and 103. As shown in FIG. 1, there can be optionally two links from the storage system to the spine network switch.

IB Fabric/Subnet

[0020] In accordance with an embodiment of the invention, an IB Fabric/Subnet in a middleware machine environment can contain a large number of physical hosts or servers, switch instances and gateway instances that are interconnected in a fat-tree topology.

[0021] FIG. 2 shows an illustration of a middleware machine environment, in accordance with an embodiment of the invention. As shown in FIG. 2, the middleware machine environment 200 includes an IB subnet or fabric 220 that connects with a plurality of end nodes. The IB subnet includes a plurality of subnet managers 211-214, each of which resides on one of a plurality of network switches 201-204. The subnet managers can communicate with each other using an in-band communication protocol 210, such as the Management Datagram (MAD)/Subnet Management Packet (SMP) based protocols or other protocol such as the Internet Protocol over IB (IPolB).

[0022] In accordance with an embodiment of the invention, a single IP subnet can be constructed on the IB fabric allowing the switches to communicate securely among each other in the same IB fabric (i.e. full connectivity among all switches). The fabric based IP subnet can provide connectivity between any pair of switches when at least one route with operational links exists between the two switches. Recovery from link failures can be achieved if an alternative route exists by re-routing.

[0023] The management Ethernet interfaces of the switches can be connected to a single network providing IP level connectivity between all the switches. Each switch can be identified by two main IP addresses: one for the external management Ethernet and one for the fabric based IP subnet. Each switch can monitor connectivity to all other switches using both IP addresses, and can use either operational address for communication. Additionally, each switch can have a point-to-point IP link to each directly connected switch on the fabric. Hence, there can be at least one additional IP address.

[0024] IP routing setups allow a network switch to route traffic to another switch via an intermediate switch using a combination of the fabric IP subnet, the external management Ethernet network, and one or more fabric level point-to-point IP links between pairs of switches. IP routing allows external management access to a network switch to be routed via an external Ethernet port on the network switch, as well as through a dedicated routing service on the fabric.

[0025] The IB fabric includes multiple network switches with managment Ethernet access to a managment network. There is in-band physical connectivity between the switches in the fabric. In one example, there is at least one in-band route of one or more hops between each pair of switches, when the IB fabric is not degraded. Management nodes for the IB fabric include network switches and management hosts that are connected to the IB fabric.

[0026] A subnet manager can be accessed via any of its private IP addresses. The subnet manager can also be accessible via a floating IP address that is configured for the master subnet manager when the subnet manager takes on the role as a master subnet manager, and the subnet manager is un-configured when it is explicitly released from the role. A master IP address can be defined for both the external management network as well as for the fabric based management IP network. No special master IP address needs to be defined for point-to-point IP links.

[0027] In accordance with an embodiment of the invention, each physical host can be virtualized using virtual machine based guests. There can be multiple guests existing concurrently per physical host, for example one guest per CPU core. Additionally, each physical host can have at least one dual-ported Host Channel Adapter (HCA), which can be virtualized and shared among guests, so that the fabric view of a virtualized HCA is a single dual-ported HCA just like a non-virtualized/shared HCA.

[0028] The IB fabric can be divided into a dynamic set of resource domains implemented by IB partitions. Each physical host and each gateway instance in an IB fabric can be a member of multiple partitions. Also, multiple guests on the same or different physical hosts can be members of the same or different partitions. The number of the IB partitions for an IB fabric may be limited by the P_Key table size.

[0029] In accordance with an embodiment of the invention, a guest may open a set of virtual network interface cards (vNICs) on two or more gateway instances that are accessed directly from a vNIC driver in the guest. The guest can migrate between physical hosts while either retaining or having updated vNIC associates.

[0030] In accordance with an embodiment of the invention, switchs can start up in any order and can dynamically select a master subnet manager according to different negotiation protocols, for example an IB specified negotiation protocol. If no partitioning policy is specified, a default partitioning enabled policy can be used. Additionally, the management node partition and the fabric based management IP subnet can be established independently of any additional policy infomation and independently of whether the complete fabric policy is known by the master subnet manager. In order to allow fabric level configuration policy information to be synchronized using the fabric based IP subnet, the subnet manager can start up initially using the default partition policy. When fabric level synchronization has been achieved, the partition configuration, which is current for the fabric, can be installed by the master subnet manager.

Policy Transaction in a Middleware Machine Environment

[0031] In accordance with an embodiment of the invention, a system and method can support a policy transaction in a middleware machine environment. The system includes a policy daemon associated with a master subnet manager in an IB subnet in the middleware machine environment. The policy daemon manages one or more policies for the IB subnet. The system also includes a transactional interface associated with the policy daemon. The transactional interface allows for updating the one or more policies using a policy update transaction. Additionally, the master subnet manager is associated with one or more subnet manager that are master candidates in the middleware machine environment. The policy daemon associated with the master subnet manager operates to replicate the policy update transaction to the one or more subnet manager before committing the policy update transaction.

[0032] FIG. 3 shows an illustration of a middleware machine environment that supports a policy transaction, in accordance with an embodiment of the invention. As shown in FIG. 3, the middleware machine environment 300 includes an IB subnet or fabric 320 that manages a plurality of end nodes. The IB subnet includes a plurality of subnet managers 321-324, each of which resides on one of a plurality of network switches 301-304. The subnet managers can communicate with each other using an in-band communication protocol 310, such as the Internet Protocol over Infiniband (IPolB). The subnet managers can negotiate among each other and elect a master subnet manager A 321, which is responsible for configuring and managing the middleware machine environment. Additionally, the subnet managers B-D are standby master candidates in the middleware machine environment, each of which is ready to take over the master subnet manager when necessary.

[0033] In accordance with an embodiment of the invention, each network switch can connect with one or more end nodes, such as the host servers within the middleware machine environment. Both the network switch and the subnet managers residing on top of the network switch can be considered as management nodes from the perspective of a network high availability management model. The network switch can be either a leaf switch that communicates directly with the end nodes, or a spine switch that communicates with the end nodes through the leaf switches. The network switches can communicate with the host servers via the switch ports of the network switches and the host ports of the host servers. In an IB network, partitions can be defined to specify which end ports are able to communicate with other end ports.

[0034] In accordance with an embodiment of the invention, the middleware machine environment employs a fat-tree topology, which allows a small number of switches sitting at the top layers of the fat tree while maintaining a large number of end nodes as leafs of the tree.

[0035] In accordance with an embodiment of the invention, the system can provide a plurality of policy daemons 311-314, each of which is associated with a subnet manager. The policy daemon that collocates with the master subnet manager is responsible for configuring and managing the end nodes in the middleware machine environment using one or more policies. One exemplary policy managed by a policy daemon in a middleware machine environment can be a partition configuration policy. The partition configuration policy can be supplied to the subnet through an initialization policy transaction.

[0036] For example, a middleware machine environment that includes end nodes, A, B and C can be partitioned into two groups: a Group I that includes nodes A and B and a Group II that includes node C. A partition configuration policy can define a partition update that requires deleting node B from the Group I, before adding node B into the Group II. This partition configuration policy can require that the master subnet manager will not allow a new partition to add node B into Group II without first deleting nodes B from Group I. This partition configuration policy can be enforced by the master subnet manager using a policy daemon.

[0037] In accordance with an embodiment of the invention, the system can provide a transactional interface 308 that is associated with the policy daemon. The transactional interface allows for updating the one or more policies managed by the policy daemon using a policy update transaction 309. The policy daemon associated with the master subnet manager can replicate the policy update transaction to the subnet manager master candidates before committing the policy update transaction. Additionally, the system provides a command interface that is responsible for providing policies to the master subnet manager.

[0038] By replicating the policy updates from the master subnet manager to the subnet manager master candidates, the system can ensure that the policies are synchronized within the middleware machine environment. When the standby subnet manager takes over and becomes the new master subnet manager, the functioning of the middleware machine environment can be uninterrupted and the communication in the middleware machine environment can maintain undisturbed. Additionally, the system can remove all stale policy information before applying the new policy or the policy updates, in order to prevent inconsistency between the master subnet manager and different instances of the subnet manager master candidates.

[0039] In accordance with an embodiment of the invention, a policy update transaction can include either a new policy or a set of policy updates. Each policy update transaction can be represented using a unique version number. A master subnet manager can consider a policy associated with the highest version number, in its knowledge, as the current policy to be used in the middleware machine environment. In one embodiment, the system is configured so that there is only one policy update transaction in progress at any point of time in the subnet.

[0040] In accordance with an embodiment of the invention, if the replication of the policy updates from the master subnet manager to the standby subnet managers fails, or alternatively the master subnet manager fails, then the policy update transaction is not committed, and the system does not apply the policy updates on the middleware machine environment in order to preserve consistency in policy configuration. Furthermore, the system allows a user or an administrator to intervene and manually set up the environment.

[0041] In accordance with an embodiment of the invention, each policy daemon can have a close relationship with the subnet manager instance on the corresponding node. The subnet manager can perform a synchronization operation with the local policy daemon whenever it becomes a master subnet manager. This ensures that the policy daemon can prepare all required current policy information that the subnet manager needs in order to initialize and maintain the state of the subnet. For example, such information includes the current partitioning configuration which can be provided as a local file.

[0042] Additionally, the various policy daemon instances can cooperate to replicate the policy information that is supposed to be shared among the subnet managers. For example, the local synchronization with the local subnet manager instance can ensure that the replication of a new version of a configuration file is complete and accurate before the master subnet manager start to apply the new policy.

[0043] In accordance with an embodiment of the invention, when a standby subnet manager reboots, it can synchronize with the current master or other currently available master candidates for the current fabric policy.

[0044] FIG. 4 illustrates an exemplary flow chart for supporting a policy transaction in a middleware machine environment, in accordance with an embodiment of the invention. As shown in FIG. 4, at step 401, a policy daemon can be associated with a master subnet manager in a subnet in the middleware machine environment, wherein the policy daemon manages one or more policies for the subnet. Furthermore, a transactional interface can be associated with the policy daemon at step 402. The transactional interface allows for updating the one or more policies managed by the policy daemon associated with the master subnet manager using a policy update transaction. Then, at step 403, the policy daemon co-located with the master subnet manager can replicate the policy update transaction to one or more policy daemons co-located with the one or more subnet managers that are master candidates before committing the policy update transaction.

Stateful Subnet Manager Failover

[0045] FIG. 5 shows an illustration of stateful subnet manager failover scenario in a middleware machine environment, in accordance with an embodiment of the invention. As shown in FIG. 5, the middleware machine environment 500 includes a plurality of network switches 501-504 together with a plurality of subnet managers 521-524 that manage a plurality of end nodes. The plurality of subnet managers can communicate with each other using an in-band communication protocol 510, such as the Internet Protocol over Infiniband (IPolB).

[0046] In the example as shown in FIG. 5, when an old master subnet manager A 521 fails, the rest of the subnet managers B-D can negotiate with each other and elect a new master subnet manager C, which is responsible for configuring and managing the subnet in the middleware machine environment. The new master subnet manager C can determine the most recent versions of the fabric configuration policy information along with all available subnet managers B-D. Additionally, a transaction interface 308 associated with the new master subnet manager C is used by the system to support a new policy update transaction 509.

[0047] In order to determine the current fabric configuration policy information, the system can use a quorum-based policy in the policy daemon, which specifies a minimum number of the subnet managers needed in the middleware machine environment in order to support a policy update. For example, a quorum-based policy can require more than half of all the standby subnet managers must be in synchronization. If less than half of all the standby subnet managers have the same policy, then a quorum cannot be reached, and no fabric policy changes can be implemented until either a quorum has been reached, or until a system administrator redefines the master candidate set. For example, a "split-brain" condition can be detected in a middleware machine environment with only two subnet managers. When a single point of failure disables one subnet manager, then there is only one subnet manager existing in the system and the quorum-based policy can prevent the subnet manager which detects the "split-brain" condition from taking on any master role.

[0048] In accordance with an embodiment of the invention, a policy update transaction 509 is committed only when a quorum of different policy daemons all agree on the policy update. Additionally, the policy daemon 513 can ensure that the current policy is in synch within a quorum of policy daemons before allowing a newly elected master subnet manager 523 to complete initialization of the subnet.

[0049] In accordance with an embodiment of the invention, a quorum based scheme has the advantage of being able to both implement and change a policy following one or more failures as long as a sufficient level of redundancy is provided, i.e. the system is configured with sufficient number of independent master subnet manager candidate instances.

[0050] For example, a decision about implementing, or changing and implementing, a policy can be based on an assumption that there are no conflicting policy decisions made among other potential master candidates. If only exactly half of the configured standby subnet managers are available (e.g. 2 out of a total of 4 subnet managers or 1 out of a total of 2 subnet managers), then no decision to implement or change the policy is permitted. Thus, a population of 3 or 4 master candidates can survive a single point of failure and still be able to establish a quorum that can make decisions about implementing or changing the policy. Furthermore, a population of 5 or 6 master candidates can tolerate 2 failures, 7 and 8 master candidates can tolerate 3 failures, and so on.

[0051] In accordance with an embodiment of the invention, a consensus based scheme can be used in the system, when it is impossible to establish a quorum (or majority) following a single point of failure, for example for a configuration with only two master subnet manager candidates. The consensus based rules can implement the current policy when at least one single master subnet manager can be established. However, in order to preserve consistency in the system, the current policy may be changed when any master subnet manager candidate is not a part of the upgrade transaction.

[0052] The advantage of the consensus based scheme is that any subnet manager that becomes the master can immediately configure the subnet based on the current policy as long as the local policy daemon can determine that the policy state reflects a committed update transaction. The drawback is that a single point of failure that makes a single subnet manager master candidate unavailable can prevent any further policy update transactions.

Implementation without Third Party Constraints

[0053] In accordance with an embodiment of the invention, an implementation of the system as described above requires only a minimal change in an existing subnet manager implementation, and allows the subnet manager implementation to be based on third party source code, such as open source shared code. The system allows the handling of stateful fail-over to be implemented without the open source constraints and also can be independent of the IB fabric that the subnet manager relates to.

[0054] FIG. 6 illustrates an exemplary flow chart for implementing a system that supports stateful subnet manager failover in a middleware machine environment, in accordance with an embodiment of the invention. As shown in FIG. 6, at step 601, the subnet manager employs a core logic implementation. The core logic of the subnet manager can implement an IB standard such as the OpenSM standard, which only supports stateless failover. Furthermore, at step 202, a policy daemon can be associated with the subnet manager in order to inject critical policy information for the middleware machine environment into the core logic implementation in the subnet manager. The core logic implementation in the subnet manager may not be aware of the policy daemon. The implementation of the policy daemon requires only minimal change to the core logic implementation, and can be independent of and separated from the core logic implementation in the subnet manager. Additionally, a transactional interface can be associated with the policy daemon, at step 203. The transactional interface allows transactional behavior for policy updates in the middleware machine environment. The transactional behavior can ensure full ACID (atomicity, consistency, isolation, durability) properties for the policy updates without a need of including the transaction logic within the core logic implementation in the subnet manager.

[0055] The present invention may be conveniently implemented using one or more conventional general purpose or specialized digital computer, computing device, machine, or microprocessor, including one or more processors, memory and/or computer readable storage media programmed according to the teachings of the present disclosure. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art.

[0056] In some embodiments, the present invention includes a computer program product which is a storage medium or computer readable medium (media) having instructions stored thereon/in which can be used to program a computer to perform any of the processes of the present invention. The storage medium can include, but is not limited to, any type of disk including floppy disks, optical discs, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

[0057] The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalence.

* * * * *