U.S. patent application number 11/121605 was filed with the patent office on 2006-11-09 for method and apparatus for expressing high availability cluster demand based on probability of breach.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Ming Chen, Thomas Juergen Lumpp, Juergen Peter Schneider, Andrew Neil Trossman.
Application Number | 20060253725 11/121605 |
Document ID | / |
Family ID | 37395346 |
Filed Date | 2006-11-09 |
United States Patent
Application |
20060253725 |
Kind Code |
A1 |
Chen; Ming ; et al. |
November 9, 2006 |
Method and apparatus for expressing high availability cluster
demand based on probability of breach
Abstract
A method, apparatus, and computer instructions are provided for
expressing high availability (H/A) cluster demand based on
probability of breach. When a failover occurs in the H/A cluster,
event messages are sent to a provisioning manager server. The
mechanism of embodiments of the present invention filters the event
messages and translates the events into probability of breach data.
The mechanism then updates the data model of the provision manager
server and makes a recommendation to the provisioning manager
server as to whether reprovisioning of new node should be
performed. The provisioning manager server makes the decision and
either reprovisions new nodes to the H/A cluster or notifies the
administrator of detected poisoning problem.
Inventors: |
Chen; Ming; (Markham,
CA) ; Lumpp; Thomas Juergen; (Reutlingen, DE)
; Schneider; Juergen Peter; (Althengstett, DE) ;
Trossman; Andrew Neil; (North York, CA) |
Correspondence
Address: |
IBM CORP (YA);C/O YEE & ASSOCIATES PC
P.O. BOX 802333
DALLAS
TX
75380
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
37395346 |
Appl. No.: |
11/121605 |
Filed: |
May 4, 2005 |
Current U.S.
Class: |
714/47.1 |
Current CPC
Class: |
G06F 11/2028 20130101;
G06F 11/2025 20130101 |
Class at
Publication: |
714/004 |
International
Class: |
G06F 11/00 20060101
G06F011/00 |
Claims
1. A method in a data processing system for expressing high
availability cluster demand based on probability of breach, the
method comprising: detecting event messages from a high
availability cluster; calculating a probability of breach based on
the event messages detected from the high availability cluster;
updating a data model of a provisioning server based on the event
messages, responsive to calculating the probability of breach; and
providing a recommendation of reprovisioning to the provisioning
server based on the probability of breach.
2. The method of claim 1, wherein the event messages are sent by
the high availability cluster responsive to detecting a failover of
an active server to a standby server.
3. The method of claim 1, wherein the calculating step comprises:
determining a difference between a number of required active
servers and a number of failed servers; and raising a service class
factor to a power of the difference, wherein the service class
factor indicates a service level agreement of a cluster
application.
4. The method of claim 1, wherein the updating step includes:
marking a failed server; marking an observed or a desired state of
a master server; removing the failed server from an application
tier; and updating cluster resource states for failed and failover
servers.
5. The method of claim 1, wherein the providing step includes:
providing a recommendation of reprovisioning a new server node to
the cluster if the probability of breach is low; and providing a
recommendation of not reprovisioning a new server node to the
cluster if the probability of breach is high.
6. The method of claim 1, wherein the event messages include a name
of the cluster, a name of a failed node, a name of a node to which
clustering software failover resources, a service class factor
defined for the cluster, wherein the service class factor indicates
an importance of failover to an application, timestamps indicating
when failure and failover occur, and event descriptions describing
a reason for failure.
7. The method of claim 1, further comprises: responsive to
receiving the recommendation, determining whether to reprovision a
new server node to the cluster; responsive to determining that a
new server node is to be reprovisioned, executing a workflow to
configure the new server node to the cluster; and responsive to
determining that the new server node is not to be reprovisioned,
notifying a system administrator of a potential poisoning
problem.
8. The method of claim 7, wherein the determining step comprises
determining if a poisoning situation occurs, wherein the
determining step for determining if a poisoning situation occurs
comprises: maintaining a history of events received from the
cluster for a period of time; determining, from the history of
events received, if the events received and reasons are the same
for the period of time; and if the events received and reasons are
the same for the period of time, marking the cluster as a poisoning
situation.
9. The method of claim 1, wherein the probability of breach
includes a value from zero to one, and wherein the probability of
breach is calculated by normalizing a plurality of probability of
breaches of a plurality of applications.
10. A data processing system comprising: a bus; a memory connected
to the bus, wherein a set of instructions are located in the
memory; and a processing unit connected to the bus, wherein the
processing unit executes the set of instructions to detect event
messages from a high availability cluster; calculate a probability
of breach based on the event messages detected from the high
availability cluster; update a data model of a provisioning server
based on the event messages, responsive to calculating the
probability of breach; and provide a recommendation of
reprovisioning to the provisioning server based on the probability
of breach.
11. The data processing system of claim 10, wherein the processing
unit, in executing the set of instructions to calculate a
probability of breach based on event messages, determines a
difference between a number of required active servers and a number
of failed servers; and raise a service class factor to a power of
the difference, wherein the service class factor indicates a
service level agreement of a cluster application.
12. The data processing system of claim 10, wherein the processing
unit, in executing the set of instructions to provide a
recommendation reprovisioning to a provisioning server based on the
probability of breach, provides a recommendation of reprovisioning
a new server node to the cluster if the probability of breach is
low; and provides a recommendation of not reprovisioning a new
server node to the cluster if the probability of breach is
high.
13. The data processing system of claim 10, wherein the processing
unit further executes the set of instructions to determine whether
to reprovision a new server node to the cluster responsive to
receiving the recommendation; execute a workflow to configure the
new server node to the cluster responsive to determining that a new
server node is to be reprovisioned; and notify a system
administrator of a potential poisoning problem responsive to
determining that a new server node is not to be reprovisioned.
14. The data processing system of claim 13, wherein the processing
unit, in executing the set of instructions to determine whether to
reprovision a new server node to the cluster, determines if a
poisoning situation occurs, and wherein the processing unit, in
executing the set of instructions to determine if a poisoning
situation occurs, maintains a history of events received from the
cluster for a period of time; determines, from the history of
events received, if the events received and reasons are same for
the period of time; and marks the cluster as a poisoning situation
if the events received and reasons are the same for the period of
time.
15. The data processing system of claim 10, wherein the probability
of breach includes a value from zero to one, and wherein the
probability of breach is calculated by normalizing a plurality of
probability of breaches of a plurality of applications.
16. A computer program product in a computer usable medium for
expressing high availability cluster demand based on probability of
breach, the computer program product comprising: first instructions
for detecting event messages from a high availability cluster;
second instructions for calculating a probability of breach based
on the event messages detected from the high availability server;
third instructions for updating a data model of a provisioning
server based on the event messages responsive to calculating the
probability of breach; and fourth instructions for providing a
recommendation of reprovisioning to the provisioning server based
on the probability of breach.
17. The computer program product of claim 16, wherein the second
instructions comprise: first sub-instructions for determining the
difference between a number of required active servers and a number
of failed servers; and second sub-instructions for raising a
service class factor to the power of the difference, wherein the
service class factor indicates a service level agreement of a
cluster application.
18. The computer program product of claim 16, wherein the fourth
instructions comprise: first sub-instructions for providing a
recommendation of reprovisioning a new server node to the cluster
if the probability of breach is low; and second sub-instructions
for providing a recommendation of not reprovisioning a new server
node to the cluster if the probability of breach is high.
19. The computer program product of claim 16, further comprising:
fifth instructions for determining whether to reprovision a new
server node to the cluster responsive to receiving the
recommendation; sixth instructions for executing a workflow to
configure the new server node to the cluster responsive to
determining that a new server node is to be reprovisioned; and
seventh instructions for notifying a system administrator of a
potential poisoning problem responsive to determining that a new
server node is not to be reprovisioned.
20. The computer program product of claim 19, wherein the fifth
instructions comprise: first sub-instructions for determining if a
poisoning situation occurs, wherein the determining step comprises:
second sub-instructions for maintaining a history of events
received from the cluster for a period of time; third
sub-instructions for determining, from the history of events
received, if the events received and reasons are same for the
period of time; and fourth sub-instructions for marking the cluster
as a poisoning situation if the events received and reasons are the
same for the period of time.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Technical Field
[0002] The present invention relates to an improved data processing
system. In particular, the present invention relates to expressing
high availability cluster demand. Still more particularly, the
present invention relates to expressing high availability cluster
demand based on probability of breach.
[0003] 2. Description of Related Art
[0004] In a large data center, High Availability (H/A) clusters are
often used to ensure that servers are available to meet business
needs. A H/A cluster is designed, implemented, and deployed with
sufficient components to satisfy the functional requirements but
which also has sufficient redundancy in components (hardware,
software and procedures) to mask certain defined faults. When using
H/A clusters, it is desirable to minimize server downtime and to
reduce business losses due to system errors. Currently, there are a
few software products in the market that provide such functions and
features. Examples of these software products include Veritas.TM.
clustering server available from Veritas.TM. Software Corporation
and High Availability Cluster Multiprocessing for AIX 5L V5.2.0
available from International Business Machines Corporation.
[0005] However, with existing H/A clusters, a problem exists when
more resources are needed while there are no more redundant or
standby resources available. One problem scenario occurs when there
are two servers running in a H/A cluster. One server is active and
the other server is standing by. When the active server fails, the
clustering software is capable of failing over all resources to the
standby server and making the standby server active. However, if
the standby server also fails and there are no more resources
available in the cluster, the H/A cluster can no longer provision
resources to serve the client requests.
[0006] Another problem scenario is when a poisoning problem is
detected. A poisoning problem occurs when the servers in the
cluster keep failing no matter how many resources are added to the
cluster. This poisoning problem may be caused by software errors,
for example, memory leaks, software bugs, etc. When a poisoning
problem is detected, the cluster notifies the administrator to fix
the problem manually. However, it is often difficult for the
cluster to detect the problem. Usually, a provisioning manager
server, which provisions and deprovisions resources, predicts how
and when the poisoning problem may occur.
[0007] Therefore, it would be advantageous to have an improved
method, apparatus, and computer instructions for expressing high
availability demand cluster to a provisioning manager server, such
that based on a probability of breach data, the provisioning
manager server may respond to the problem accordingly.
SUMMARY OF THE INVENTION
[0008] Embodiments of the present invention provide a method, an
apparatus, and computer instructions in a data processing system
for expressing high availability cluster demand based on
probability of breach. Responsive to detecting event messages from
a high availability cluster, a probability of breach based on the
event messages is calculated. The mechanism of embodiments of the
present invention then updates a data model of a provisioning
server based on the event messages and provides a recommendation of
reprovisioning to a provisioning server based on the probability of
breach.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The novel features believed characteristic of embodiments of
the invention are set forth in the appended claims. Embodiments of
the invention itself, however, as well as a preferred mode of use,
further objectives and advantages thereof, will best be understood
by reference to the following detailed description of an
illustrative embodiment when read in conjunction with the
accompanying drawings, wherein:
[0010] FIG. 1 depicts a pictorial representation of a network of
data processing systems in which embodiments of the present
invention may be implemented;
[0011] FIG. 2 is a block diagram of a data processing system that
may be implemented as a server, in accordance with an illustrative
embodiment of the present invention;
[0012] FIG. 3 is a block diagram of a data processing system in
which an illustrative embodiment of the present invention may be
implemented;
[0013] FIG. 4 is a diagram illustrating an exemplary data center,
in accordance with an illustrative embodiment of the present
invention;
[0014] FIG. 5 is a diagram illustrating exemplary interactions
between a H/A cluster and the provisioning manager server in
accordance with an illustrative embodiment of the present
invention;
[0015] FIG. 6 is a diagram illustrating an exemplary probability of
breach data formula in accordance with an illustrative embodiment
of the present invention;
[0016] FIG. 7 is a diagram containing examples of using calculating
probability of breach in accordance with an illustrative embodiment
of the present invention; and
[0017] FIG. 8 is a flowchart of an exemplary process for expressing
high availability demand based on probability of breach in
accordance with an illustrative embodiment of the present
invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
[0018] With reference now to the figures, FIG. 1 depicts a
pictorial representation of a network of data processing systems in
which embodiments of the present invention may be implemented.
Network data processing system 100 is a network of computers in
which embodiments of the present invention may be implemented.
Network data processing system 100 contains network 102, which is
the medium used to provide communications links between various
devices and computers connected together within network data
processing system 100. Network 102 may include connections, such as
wire, wireless communication links, or fiber optic cables.
[0019] In the depicted example, server 104 is connected to network
102 along with storage unit 106. In addition, clients 108, 110, and
112 are connected to network 102. Clients 108, 110, and 112 may be,
for example, personal computers or network computers. In the
depicted example, server 104 provides data, such as boot files,
operating system images, and applications to clients 108-112.
Clients 108, 110, and 112 are clients to server 104. Network data
processing system 100 may include additional servers, clients, and
other devices not shown. In the depicted example, network data
processing system 100 is the Internet with network 102 representing
a worldwide collection of networks and gateways that use the
Transmission Control Protocol/Internet Protocol (TCP/IP) suite of
protocols to communicate with one another. At the heart of the
Internet is a backbone of high-speed data communication lines
between major nodes or host computers, consisting of thousands of
commercial, government, educational, and other computer systems
that route data and messages. Of course, network data processing
system 100 also may be implemented as a number of different types
of networks, such as for example, an intranet, a local area network
(LAN), or a wide area network (WAN). FIG. 1 is intended as an
example, and not as an architectural limitation for embodiments of
the present invention.
[0020] Referring to FIG. 2, a block diagram of a data processing
system that may be implemented as a server, such as server 104 in
FIG. 1, is depicted in accordance with an embodiment of the present
invention. Data processing system 200 may be a symmetric
multiprocessor (SMP) system including a plurality of processors 202
and 204 connected to system bus 206. Alternatively, a single
processor system may be employed. Also connected to system bus 206
is memory controller/cache 208, which provides an interface to
local memory 209. I/O Bus Bridge 210 is connected to system bus 206
and provides an interface to I/O bus 212. Memory controller/cache
208 and I/O Bus Bridge 210 may be integrated as depicted.
[0021] Peripheral component interconnect (PCI) bus bridge 214
connected to I/O bus 212 provides an interface to PCI local bus
216. A number of modems may be connected to PCI local bus 216.
Typical PCI bus implementations will support four PCI expansion
slots or add-in connectors. Communications links to clients 108-112
in FIG. 1 may be provided through modem 218 and network adapter 220
connected to PCI local bus 216 through add-in connectors.
[0022] Additional PCI bus bridges 222 and 224 provide interfaces
for additional PCI local buses 226 and 228, from which additional
modems or network adapters may be supported. In this manner, data
processing system 200 allows connections to multiple network
computers. Memory-mapped graphics adapter 230 and hard disk 232 may
also be connected to I/O bus 212 as depicted, either directly or
indirectly.
[0023] Those of ordinary skill in the art will appreciate that the
hardware depicted in FIG. 2 may vary. For example, other peripheral
devices, such as optical disk drives and the like, also may be used
in addition to or in place of the hardware depicted. The depicted
example is not meant to imply architectural limitations with
respect to embodiments of the present invention.
[0024] The data processing system depicted in FIG. 2 may be, for
example, an IBM eServer pseries system, a product of International
Business Machines Corporation in Armonk, N.Y., running the Advanced
Interactive Executive (AIX) operating system or the LINUX operating
system.
[0025] With reference now to FIG. 3, a block diagram illustrating a
data processing system is depicted in which embodiments of the
present invention may be implemented. Data processing system 300 is
an example of a client computer. Data processing system 300 employs
a peripheral component interconnect (PCI) local bus architecture.
Although the depicted example employs a PCI bus, other bus
architectures such as Accelerated Graphics Port (AGP) and Industry
Standard Architecture (ISA) may be used. Processor 302 and main
memory 304 are connected to PCI local bus 306 through PCI Bridge
308. PCI Bridge 308 also may include an integrated memory
controller and cache memory for processor 302. Additional
connections to PCI local bus 306 may be made through direct
component interconnection or through add-in boards. In the depicted
example, local area network (LAN) adapter 310, small computer
system interface (SCSI) host bus adapter 312, and expansion bus
interface 314 are connected to PCI local bus 306 by direct
component connection. In contrast, audio adapter 316, graphics
adapter 318, and audio/video adapter 319 are connected to PCI local
bus 306 by add-in boards inserted into expansion slots. Expansion
bus interface 314 provides a connection for keyboard and mouse
adapter 320, modem 322, and memory 324. SCSI host bus adapter 312
provides a connection for hard disk drive 326, tape drive 328, and
CD-ROM drive 330. Typical PCI local bus implementations will
support three or four PCI expansion slots or add-in connectors.
[0026] An operating system runs on processor 302 and is used to
coordinate and provide control of various components within data
processing system 300 in FIG. 3. The operating system may be a
commercially available operating system, such as Windows.RTM. XP,
which is available from Microsoft Corporation. An object-oriented
programming system, such as Java, may run in conjunction with the
operating system and provide calls to the operating system from
Java programs or applications executing on data processing system
300. "Java" is a trademark of Sun Microsystems, Inc. Instructions
for the operating system, the object-oriented programming system,
and applications or programs are located on storage devices, such
as hard disk drive 326 and may be loaded into main memory 304 for
execution by processor 302.
[0027] Those of ordinary skill in the art will appreciate that the
hardware in FIG. 3 may vary depending on the implementation. Other
internal hardware or peripheral devices, such as flash read-only
memory (ROM), equivalent non-volatile memory, or optical disk
drives and the like, may be used in addition to or in place of the
hardware depicted in FIG. 3. Also, the processes of embodiments of
the present invention may be applied to a multiprocessor data
processing system.
[0028] As another example, data processing system 300 may be a
stand-alone system configured to be bootable without relying on
some type of network communication interfaces. As a further
example, data processing system 300 may be a personal digital
assistant (PDA) device, which is configured with ROM and/or flash
ROM in order to provide non-volatile memory for storing operating
system files and/or user-generated data.
[0029] The depicted example in FIG. 3 and above-described examples
are not meant to imply architectural limitations. For example, data
processing system 300 also may be a notebook computer or hand held
computer in addition to taking the form of a PDA. Data processing
system 300 also may be a kiosk or a Web appliance.
[0030] Turning now to FIG. 4, a diagram illustrating an exemplary
data center is depicted, in accordance with an embodiment of the
present invention. As shown in FIG. 4, in this illustrative
example, data center 400 includes resources, such as, customer 402,
server 404, Virtual Local Area Network (VLAN) 406, subnet 408,
router 410, switch 412, software products 416, load balancer 418,
and data container 420.
[0031] Customer 402 may be, for example, a client or an
administrator who uses a data processing system, such as data
processing system 300 in FIG. 3. Server 404 may be implemented as a
data processing system, such as data processing system 200 in FIG.
2. Server 404 may also be implemented as an application server,
which hosts Web services or other types of servers. Router 410 and
switch 412 facilitate communications between different devices.
VLAN 406 is a network of computers that behave as if they are
connected to the same wire even though they may actually be
physically located on different segments of a local area network.
Subnet 408 is a portion of a network, which may be a physically
independent network segment and shares a network address with other
portions of the network.
[0032] Software products 416 are applications that may be deployed
to a client or a server. Load balancer 418 spreads traffic among
multiple systems such that no single system is overwhelmed. Load
balancer 418 is normally implemented as software running on a data
processing system. Data container 420 may be a database, such as
DB2 Universal Database, a product available from International
Business Machines Corporation.
[0033] Data center 400, as depicted in FIG. 4, is presented for
purposes of illustrating embodiments of the present invention.
Other resources, such as, for example, a cluster of servers and
switch ports, may also be included in data center 400. The
mechanism of an embodiment of the present invention reports
failover and potential poisoning situation of resources, such as
server 404, to a provisioning manager server. An example of a
provisioning manager server is Tivoli Intelligent ThinkDynamics
Orchestrator (TITO), a product available from International
Business Machines Corporation.
[0034] Embodiments of the present invention provide a method,
apparatus, and computer instructions for expressing high
availability cluster demand based on a probability of breach data.
A probability of breach data indicates the probability of a service
level agreement breach. A service level agreement breach is a
failure of a cluster to meet the specified service level agreement
(SLA) for an application. The processes of embodiments of the
present invention are performed by a processing unit that comprises
one or more processors, such as processor 302 in FIG. 3, using
computer implemented instructions, which may be located in a memory
such as, for example, main memory 304, memory 324, or in one or
more peripheral devices 326 and 330 in FIG. 3.
[0035] In an illustrative embodiment, the mechanism of embodiments
of the present invention may be implemented in a provisioning
manager server that detects events reported by clustering software
when failovers and potential poisoning problems are encountered.
When the clustering software in the H/A cluster detects that
resources are failover to a standby server due to failure of the
active server, event messages are generated by the clustering
software and sent to the provisioning manager server.
[0036] In an illustrative embodiment, event messages may include
information, such as, for example, the name of the cluster domain,
the name of the failed node, the name of the node to which the
clustering software failover resources, service class factor
indicating the importance of the failover to an application,
timestamps indicating when the failure and failover occurred, and
event descriptions that describe reasons for the failure.
[0037] When the provisioning manager server receives the generated
events, the mechanism of embodiments of the present invention first
filters out the irrelevant events and then translates the events
into a probability of breach data based on a mathematical formula.
This formula may be calculated using the service class factor, the
number of active servers, and the number of failed servers. The
service class factor is defined for the H/A cluster based on a
service level agreement. The service class factor is defined for
the H/A cluster and reflects the high availability service class
level, for example, gold, silver, and bronze. The clustering
software in the H/A cluster typically decides what service class
factor to be sent to provisioning manager server based on the
service level agreement for the cluster application.
[0038] In addition, the mechanism of embodiments of the present
invention in the provisioning manager server updates the cluster
model accordingly to synchronize changes in the H/A cluster based
on the event messages received. Thus, the state of the hardware and
the data model is synchronized.
[0039] Based on the calculated probability of breach data, the
mechanism of embodiments of the present invention in the
provisioning manager server makes a recommendation as to whether
servers should be reprovisioned to the H/A cluster in order to
maintain high availability for the application. If reprovisioning
is recommended, the mechanism of embodiments of the present
invention reprovisions the servers to the H/A cluster by executing
workflows to add or configure servers to the cluster. In this way,
by expressing H/A cluster demand to the provisioning manager server
based on a probability of breach, poisoning problems may be
avoided.
[0040] Turning now to FIG. 5, a diagram illustrating exemplary
interactions between H/A cluster and the provisioning manager
server is depicted in accordance with an illustrative embodiment of
the present invention. As shown in FIG. 5, data center 500 includes
H/A cluster 502 and provisioning manager server 504.
[0041] Within H/A cluster 502, two server nodes are present: node 1
and node 2. Node 1 is an online server, while node 2 is a standby
server. When node 1 experiences a failure, node 2 becomes active
and the clustering software in H/A cluster 502 then generates and
sends event messages to provisioning manager server 504. In an
alternative embodiment, other situations may also trigger event
messages to be sent. For example, when configuration of the
existing node changes, or when a new calculation of probability of
breach is needed because new nodes are provisioned or existing
nodes are deprovisioned.
[0042] Event message 506 may include the name of the cluster
domain, in this example, H/A cluster 502; the name of the failed
node, in this example, node 1; the name of the node to which
clustering software failover resources, in this example, node 2;
the service class factor indicating importance of the failover
situation to an application; timestamps when the failure and
failover occur; and event descriptions that describe reasons for
the failure.
[0043] The service class factor is defined for H/A cluster 502 and
reflects the high availability service class level, for example,
gold, silver, and bronze. The clustering software in H/A cluster
502 typically decides what service class factor is to be sent to
provisioning manager server 504 based on the service level
agreement for the cluster application.
[0044] Once provisioning manager server 504 receives event message
506, the mechanism of embodiments of the present invention in
provisioning manager server 504 filters out events 508 that are not
relevant and calculates a probability of breach (PoB) 510 using
mathematical formula. More details regarding the mathematical
formula is discussed in FIG. 6. Once the PoB is calculated, the
mechanism of embodiments of the present invention updates data
model 512 in provisioning manager server 504 to synchronize with
the hardware state of H/A cluster 502. Then, the mechanism of
embodiments of the present invention makes a recommendation as to
whether to reprovision new server nodes 514 to H/A cluster 502.
Typically, the higher the PoB, the less likely reprovisioning is
recommended, since there is a higher probability of a poisoning
problem. If, however, new node reprovisioning is recommended 516,
the mechanism of embodiments of the present invention executes
workflow to add or configure new node 518 to the H/A cluster
502.
[0045] Turning now to FIG. 6, a diagram illustrating an exemplary
probability of breach data formula is depicted in accordance with
an illustrative embodiment of the present invention. As shown in
FIG. 6, probability of breach (PoB) of an application group 600 is
calculated using a mathematical formula. An application group is
composed of managed resources or resource groups. H/A clusters
usually accommodates more than one resource group.
[0046] As shown in FIG. 6, PoB 600 is calculated by raising service
class factor of application group 602 to the power of the
difference between the number of required active servers 604 and
the number of failed or missing servers 606. This difference may
range between zero and the number of required servers 604. The
higher the difference, the lower PoB 600 will be since there are
less failed servers. If the difference is zero, PoB 600 of 1 is
returned as a result.
[0047] In an illustrative embodiment, PoB 600 is an abstract
expression of breaching a goal. The value of PoB 600 is between 0
and 1. A PoB of 0 reflects no breaching while a PoB of 1 reflects a
certain existing breaching. As described above, service class
factor 602 is defined by the clustering software of the H/A cluster
based on the service level agreement of the application.
[0048] Turning now to FIG. 7, a diagram containing examples of
using calculating probability of breach is depicted in accordance
with an illustrative embodiment of the present invention. As shown
in FIG. 7, case 1 700 calculates PoB for a two-node cluster with a
first node used to host the workload and a second node used as a
cold backup. The number of required active servers is 2 while the
service class factor is 0.75 (gold level). In case 1 700, during
normal production running with no additional servers required, PoB
702 is 0.75.sup.2=0.56. However, if one server fails and only one
additional server is required, PoB 704 becomes 0.75.sup.2-1=0.75.
If both servers fail and two additional servers are required, PoB
706 becomes 0.75.sup.2-2=1.0.
[0049] Case 2 708 calculates a PoB for a four-node cluster hosting
two applications, App1 and App2. While each application uses the
two nodes separately, every other node may be used as backup. The
customer has required a `silver` level goal, 0.5, accepting that a
backup server is used. In case 2 708, during normal production with
no additional server required, PoB.sub.app1 710 is
0.5.sup.4=0.0625. If two servers fail, two additional servers are
required and PoB.sub.app1 714 becomes 0.5.sup.4-2=0.125. If all
four servers fail, four additional servers are required and
PoB.sub.app1 718 becomes 0.5.sup.4-4=1.0. Similarly, PoB.sub.app2
712 is also 0.5.sup.4=0.0625 for normal production,
0.5.sup.4-2=0.125 for losing two servers 716 and 0.5.sup.4-4=1.0
for losing all four servers 720.
[0050] Case 3 722 is similar to case 2 708 except that the customer
requested level is gold, 0.75. This means that for each
application, two additional backup servers are needed and each
server in the cluster is dedicated for either one of the
applications. In case 3 722, during normal production, PoB.sub.app1
724 is 0.75.sup.4=0.31. Similarly, PoB.sub.app2 726 is also
0.75.sup.4=0.31. When one of the servers for App1 fails, two
additional servers are required. Thus, PoB.sub.app1 728 is
0.75.sup.4-2=0.562. However, unlike case 2 708, PoB.sub.app2 730
stays the same since the two lost servers were dedicated to App1.
If the complete cluster is lost, PoB.sub.app1 732 and PoB.sub.app2
734 are 0.75.sup.4-4=1.0.
[0051] Turning now to FIG. 8, a flowchart of an exemplary process
for expressing high availability demand based on probability of
breach is depicted in accordance with an illustrative embodiment of
the present invention. As depicted in FIG. 8, the process begins
when the mechanism of embodiments of the present invention in the
provisioning manager server detects event messages sent by
clustering software in the H/A cluster (step 802). Event messages
may be sent when a node failover, if configuration of existing
nodes changes, if a new node is provisioned or an existing node is
deprovisioned.
[0052] Once the events are received, the mechanism of embodiments
of the present invention filters out irrelevant events (step 804)
and translates the event into probability of breach (step 806) to
indicate the probability of a service level agreement breach. The
probability of breach is calculated using the mathematical formula
shown in FIG. 6. Typically, PoB is calculated for each application
group. If more than one application group is present, the PoBs of
all application groups are normalized to a single PoB.
[0053] After the PoB is calculated, the mechanism of embodiments of
the present invention updates the data model of the provisioning
manager server based on the events received (step 808). Examples of
updating the data model include marking the failed servers, marking
the observed state or desired state of the new master server,
removing a failed server from the application tier, and updating
the cluster resource states for both failed and failover
servers.
[0054] Once the data model of the provisioning manager server is
updated, the mechanism of embodiments of the present invention
makes a recommendation for reprovisioning new nodes based on the
calculated PoB (step 810). A determination is then made by the
provisioning manager server as to whether to reprovision new nodes
to the H/A cluster (step 812). This decision is based on not only
the PoB but also whether a potential poisoning situation is
detected. A potential poisoning situation may occur when the PoB is
high and the number of failed servers detected by the provisioning
manager server for a period of time is abnormal.
[0055] To handle the poisoning situation, the provisioning manager
server maintains an event history of all events received from a
given H/A cluster. The provisioning manager server may use an event
analysis algorithm to analyze the pattern of failed events received
for a certain period of time. If the same failed event is received
with a same failed reason within a period of time after a number of
reprovisionings are executed, the provisioning manager server may
mark this case as a potential situation.
[0056] If the provisioning manager server determines that a new
node should be reprovisioned, the provisioning manager server
executes a workflow to add or configure the new node to the H/A
cluster (step 814). However, if the provisioning manager server
determines not to reprovision the new node, the provisioning
manager server notifies the system administrator (step 816) of the
detected poisoning problem, such that the administrator may take
appropriate actions. Thus, the process terminates thereafter.
[0057] In summary, embodiments of the present invention provide a
way to express high availability cluster demand to a provisioning
manager server based on a probability of breach data to a service
level agreement. With embodiments of the present invention,
potential poisoning problems may be detected as new nodes are
reprovisioned or as existing nodes are deprovisioned. In this way,
administrators may be notified to take the appropriate actions.
[0058] It is important to note that while embodiments of the
present invention have been described in the context of a fully
functioning data processing system, those of ordinary skill in the
art will appreciate that the processes of embodiments of the
present invention are capable of being distributed in the form of a
computer usable medium of instructions and a variety of forms and
that embodiments of the present invention apply equally regardless
of the particular type of signal-bearing media actually used to
carry out the distribution. Examples of computer usable media
include recordable-type media such as a floppy disc, a hard disk
drive, a RAM, and CD-ROMs and transmission-type media such as
digital and analog communications links.
[0059] The description of embodiments of the present invention have
been presented for purposes of illustration and description but is
not intended to be exhaustive or limited to embodiments of the
invention in the form disclosed. Many modifications and variations
will be apparent to those of ordinary skill in the art. The
embodiment was chosen and described in order to best explain the
principles of embodiments of the invention, the practical
application, and to enable others of ordinary skill in the art to
understand the invention for various embodiments with various
modifications as are suited to the particular use contemplated.
* * * * *