U.S. patent application number 10/250345 was filed with the patent office on 2004-08-05 for method for managing faults it a computer system enviroment.
Invention is credited to Gravestock, Peter, O'Brien, Michael.
Application Number | 20040153692 10/250345 |
Document ID | / |
Family ID | 32770069 |
Filed Date | 2004-08-05 |
United States Patent
Application |
20040153692 |
Kind Code |
A1 |
O'Brien, Michael ; et
al. |
August 5, 2004 |
Method for managing faults it a computer system enviroment
Abstract
Many computing system environments require continuous
availability and high operational readiness. The ability to find,
diagnose, and correct actual faults and potential faults in these
systems is high. By combining a continually updated database of
computing system performance with the ability to analyze that
information to detect faults and then communicating that fault
information to correct the fault or provide appropriate
notification of the fault results in achieving the goals of high
availability and operational readiness. FIG. (1) shows the data
collectors (102, 104, 106 and 108) fault detectors (110) and policy
actions (112) are combined to meet these goals.
Inventors: |
O'Brien, Michael; (Bellevue,
WA) ; Gravestock, Peter; (New Farm, AU) |
Correspondence
Address: |
James L Davison
GoAhead Software
Suite 1200
10900 NE 8th Street
Bellevue
WA
98004-1455
US
|
Family ID: |
32770069 |
Appl. No.: |
10/250345 |
Filed: |
March 8, 2004 |
PCT Filed: |
December 28, 2001 |
PCT NO: |
PCT/US01/49945 |
Current U.S.
Class: |
714/4.12 ;
714/E11.023 |
Current CPC
Class: |
G06F 11/0793 20130101;
G06F 11/0709 20130101 |
Class at
Publication: |
714/004 |
International
Class: |
H04L 001/22 |
Claims
We claim:
1. A method for determining faults in a computing system
environment comprising the acts of: a) detecting the occurrence of
a fault event; b) comparing the fault event to a predetermined
criteria to determine if said fault event corresponds to said
predetermined criteria; and c) if said correspondence occurs then
communicating said fault event to a policy module.
2. The method of claim 1 wherein upon communicating said fault
event triggers a policy action.
3. The method of claim 2 wherein said policy action generates a
pager signal.
4. The method of claim 2 wherein said policy action generates an
email.
5. The method of claim 2 wherein said policy action generates a
system reboot.
6. The method of claim 2 wherein said policy action generates a
system restart.
7. The method of claim 2 wherein said policy action causes a
switchover to another device.
8. The method of claim 2 wherein said policy action generates a
SNMP trap.
9. The method of claim 1 for determining faults in a computing
environment further comprising the acts of: a) collecting operating
data from the computing system and populating a database with said
operating data; and b) making said operating data in said database
available to a detection module.
10. A system for diagnosing faults in a computing system
environment comprising: a) means for collecting operating data from
the computing system; b) means for populating a database with said
data; c) means for detecting faults derived from said operating
data; d) means for communicating said faults to a policy module;
and e) means for said policy module to take appropriate action
based on said communication.
11. A method for diagnosing faults in a computing system
environment comprising the acts of: a) collecting operating system
data; b) storing the operating system data in a database accessible
to a detector; c) detecting a fault by reading the data using a
first detector and comparing said data to a predetermined criteria;
d) communicating the fault to a second detector with said second
detector capable of receiving multiple inputs from several said
first detectors and; e) causing a second communication to a policy
module if the information received from the several first detectors
meets a second predetermined criteria.
Description
BACKGROUND--FIELD
[0001] The present invention applies to the field of fault
diagnostics in computing systems using detectors and policies.
BACKGROUND--DESCRIPTION OF RELATED ART
[0002] Comprehensive fault management plays an important role in
keeping critical computing systems in a continuous highly available
mode of operation. These systems must incur minimum downtime,
typically in the range of seconds or minutes per year. In order to
meet this goal every critical component (a critical component is
one that, upon failing, fails the entire corresponding system) must
be closely monitored for both occurring faults and potentially
occurring faults. In addition it is important that these faults be
handled in real time and within the system rather than remotely as
is done in many monitoring systems today. An example of a remote
monitoring system is a system that follows the Simple Network
Management Protocol (SNMP). For the foregoing reasons there is a
need for a fast, small footprint, real time system to detect and
diagnose problems. In addition it's preferred that this system also
be cross-platform, extensible and modular.
SUMMARY
[0003] The present invention uses a method for detecting faults in
a computing environment and then taking action on those faults. If
the fault detected meets predetermined criteria then the detection
module sends an event signal to a policy module that in turn takes
a programmed action depending on predetermined criteria that
analyzes the variables associated with the event signal. The
resulting action may range from sending email to causing a device
switchover from a defective device to a correctly operating device.
The detection modules are also capable of sending event signals to
other detection modules. These other detection modules may only
react if multiple signal are received from the primary detection
modules. This aids in the diagnosis of the system fault. Data is
continually collected from the computing system and this data is
kept in a readily accessible database that may be read by the
detection modules. The computing system data is continually updated
so the information is current and fresh. The detection modules
continually scan the data appropriate for the particular detection
module interested in that information.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] FIG. 1 shows an overview of the fault detection system.
[0005] FIG. 2 shows an example of fault monitoring system
hardware.
[0006] FIG. 3 shows how fault monitoring helps in system
diagnostics.
DETAILED DESCRIPTION OF THE INVENTION
[0007] The preferred embodiment and best mode of this invention
provides a framework for diagnosing faults in a computing system
environment It includes the capabilities of detecting, and
diagnosing computing systems problems and individual device
problems within that system.
[0008] The detection capability identifies an undesirable condition
that may lead to the loss of service from the system or device.
Detection also includes the discovery of a fault using error
detection or inference. The detection may occur through direct
observation, by correlating multiple events in time, or by
inference, that is, by observing other types of behavior of the
system. Some sets of circumstances may lead to the conclusion that
an event is a fault whereas another set of circumstances may lead
to a different conclusion (e.g. the event is normal system
behavior).
[0009] Diagnosis occurs when one or more events and system
parameters are used to determine the nature and location of a
fault. This step can be performed by the fault detection system or
invoked separately as a user diagnostic. The diagnosis may be acted
upon automatically by the system or may be reported to the user for
some user action to be taken. In some systems it's possible that a
single fault may lead to multiple errors being detected. By doing a
root cause analysis the fault may be able to be isolated and acted
upon. Isolation actions contain the error or problem and keep it
from spreading throughout the system. Isolation actions and
recovery actions are often done in conjunction with each other. An
example of an isolation action is one in which a memory usage size
is imposed upon an application when the fault management system
recognizes that the application is continually absorbing memory
without a follow-on release of said used memory when no longer
needed by the application. Another isolation example is where the
power to a board is terminated when the board is recognized as
having a failed component. Recovery occurs when a fault management
system takes action to restore a system to service once a fault has
occurred. The recovery actions cover a wide range of activities,
from restarting a failed application to failing over to a standby
hardware card. The recovery process is often takes multiple steps
wherein those steps comprise several actions that must be taken in
specific order. In some cases, the recovery process is multitiered
in that, if a specific action doesn't recover the system then some
other action must be taken.
[0010] Notifying or logging to either the system or to the user of
the diagnosis made and the resultant action taken is known as
reporting. For example if an application crashes it might be both
recovered, for example by restarting, and reported via email or
paging. Repair is defined as the replacement of the hardware or
software components as necessary. Hardware components may be hot
swapped (taken out and replaced while the system, still running,
switches over to another component), for example network interface
cards, or, instead of hot swapping the system may be shut down and
the failed part manually replaced and repaired.
[0011] Detectors and policies can be arranged in networks of
arbitrary complexity to capture the dependencies between events,
faults, and errors. The actions taken may be arbitrarily complex
functions or even calls to other programs.
[0012] In the current embodiment the detectors and policies are
encoded in multiple XML-based files, which help achieve the
cross-platform, extensible, and modular design goals. Table 1 shows
a typical database table for a detector. The columns of the table
specify the attributes of the detector component. Because detectors
and policies are implemented in XML and embedded JavaScript changes
to policies and reconfiguration of detectors can be done easily and
without recompilation. This run-time modification of behavior
supports faster development. Detectors and policies can be
developed independently and plugged into the fault management
system.
1TABLE 1 Detector Column Name Description name Name that identifies
the detector. Must be globally uni among all detectors defined for
all SelfReliant extensio Used primarily when running one detector
from anothe when specifying detector sets in schedules. description
Description of what the detector detects. type The type of
detector. The type specified here can be u by other detectors as
well. Policies are triggered by detectors based on their type, so
this field is the link between a detector and its policy. url URL
to a Web page served by the SelfReliant WebServer .TM. that
provides an HTML-based explanati the detector's output. events
Space-delimited list of events that the detector listens Multiple
detectors can listen to the same event. types Space delimited list
of the types of detectors that caus detector to fire. The detector
rule will run if a detector type listed here fires with a non-zero
output. enable Boolean value that indicates whether the rule for
this detector should run. If 0, the detector rule will not be r
matter how the detector is invoked. rule Embedded JavaScript rule
that is run when the detect invoked. The rule has access to all
global variables ar functions defined for use within the detector
namespa
[0013] Detectors "listen" for specified events and can be also be
made aware if other detectors have triggered. This approach is the
opposite of function calling because it allows new detectors to be
added to listen for new events without requiring an edit of the set
of functions. This capability, along with run-time interpreting of
detectors and policies provide support for modularity and
reusability.
[0014] The procedural part of detectors and policies is coded in
"embedded JavaScript" which is a very small footprint subset of the
JavaScript language. Any function written in the C language can be
nominated into and called from the JavaScript namespace. This
embodiment of the invention makes extensive use of an in-memory
database to store data and code.
[0015] Detectors gather data from various sources including
collector databases, events, and applications and even from other
detectors. Based on this information decisions are made about the
condition of the system and how the system parameters compare to
the predetermined parameters that judge whether the system is
running optimally or not. If a detector decides that the
information it has obtained represents a problem condition then the
detector fires (sends a message) and passes that information to a
policy or another detector. Note that the detector doesn't decide
what action needs to be taken to correct the situation it just
passes the condition to one or more policies for analysis and
decision making. Detectors can be activated asynchronously by
responding to fault management events originated from the system
hardware, application software, or the operating system software.
The detectors may also be executed in a synchronous or polled
manner according to a predetermined schedule. Detectors can also
run other detectors through an embedded JavaScript API and
detectors may be triggered by other detectors if the first
detectors are configured to listen to other detector types. FIG. 1
shows a hierarchy of the detector and policy objects. A process
102, a memory load 104, the network traffic 106, and the time of
day 108 data is collected and made available to the appropriate
detectors 110. Note that detector 110 can trigger another detector
111. The detectors in turn trigger the appropriate policies 112.
Note that some policies 113 can respond to more than one detector.
The policies can, in turn, trigger various recovery actions such as
paging 114, sending an email 116, rebooting the system 120,
restarting the system 122, switching over resources 124, engaging a
SNMP trap 126, or some other custom action 128. To prevent a
recursive event the detectors are locked out from listening to
themselves. When a detector is run, it invokes its rule to
determine the status of the information it watches. This rule is
implemented in embedded JavaScript and contained in an XML file.
When a value watched by a detector violates the rule the detector
triggers one or more policies. When a detector triggers, its output
can be set to a "fuzzy" value ranging from zero to a hundred as
determined by the rule. The detector can also pass other
information to a listening detector or policy to help analyze the
information. FIG. 2 shows an interface between a systems hardware,
detectors, events and policies. A typical piece of hardware can be
a fan 200 or a network interface card (NIC) 202. The detectors 203
can monitor the performance of the hardware devices through the
operating system (for example, using heartbeats). A hardware
anomaly is flagged by detector 203 that can be set to trigger
another detector 206, which in turn triggers a policy 208. Note
that it is possible to trigger detector 206 only if there is also
an event occurrence triggered by an outside condition 210. An
application 212 can also provide input into a detector 203 that
either combines data from elsewhere to cause the detector 203 to
trigger, or on the other hand prevent the detector 203 from
triggering by the event's presence.
[0016] Policies decide what action to take, based on information
provided by detectors. Policies can be configured to listen to a
set of detectors as specified by the detector type. If a policy
listening to a detector sees the detector fire (that is, have an
output value greater than zero) then the policy rule runs. Policies
can react to multiple detectors and invoke multiple actions.
Policies use the output and any passed information of the relevant
detectors to determine the recovery and/or notification action to
take. For example, if a fault is detected on a Monday afternoon
during business hours, the policy may page a technician in real
time, if the fault is detected after nours then the policy may send
an email to the technician. Table 2 below shows the attributes of
the policy component of the fault management system.
2TABLE 2 Policy Column Name Description name Name that identifies
the detector. Must be globally uni among all detectors defined.
description Description of what actions the policy takes given the
detector types that it is triggered by. types Space-delimited list
of the types of detectors that caus policy to fire. The policy rule
will run if a detector of a t listed here fires. enable Boolean
value that indicates whether the rule for this should run. If 0,
the rule will not be run. rule Embedded JavaScript rule that is run
when the policy triggered. The rule has access to all global
variables a functions defined for use within the policy
namespace.
[0017] When a policy responds to a fault occurrence it may call a
recovery action. Recoveries can be either a corrective action or a
notification. Recovery functions are usually implemented using the
C programming language and they are called by the embedded
JavaScript rules in the policies. Actions can include failovers to
standby components. Although detectors and policies both run
embedded JavaScript rules in response to certain conditions, they
serve different functions. The primary function of detectors is to
detect certain conditions, evaluate the output of other detectors,
and, if necessary, fire to signal that a specific condition or sets
of conditions have been found. Detector rules should be relatively
short and fast. Networks of detectors help produce a more accurate
and complete diagnosis by evaluating the results of multiple
inputs. A policy rule on the other hand needs to take action given
that a certain condition was detected. A policy is invoked when a
detector of a specified type fires. This allows one policy to
respond to several different detectors in the same way. A policy
rule simply allows configuration of what actions will be taken in
response to various conditions or faults detected. The detectors,
the policies, and the schedules are defined in XML database
tables.
[0018] This embodiment of a multimode fault management system
allows a certain degree of multithreading. Each individual detector
and policy that is running is locked. This prevents another thread
from running the same event or policy simultaneously. However the
other detectors remain unlocked and can be run at the same time the
first detectors and policy is running. If one detector triggers or
sends an event to another that is momentarily locked by another
thread, the first thread will wait until it can acquire the lock.
Each detector and policy has a local scope that is populated when
data is passed from one to another. During this data transfer both
objects are locked. After the transfer is complete the object that
fired is unlocked.
[0019] Scheduled Detector
[0020] In the following example, an XML detector description
defines a scheduled detector that monitors memory use through a
database collector. If the amount of memory used exceeds a certain
threshold, the policy triggers and calls a logging action. See
additional comments in the XML file below for more information.
[0021] <TBL name="detector">
[0022] <!--
[0023] Low Memory Detector
[0024] This detector collects the base memory table, causing the
table to be updated with current values relating to memory
usage.
[0025] If more than ninety percent of the available memory is used,
the detector will publish the name of the resource that is low to
any listening policies and fire with a value equal to that of the
percentage of used memory.
3 --> <ROW> <COL name="name">lowMemory</COL>
<COL name="description">Low Memory</COL> <COL
name="type">lowResource</COL> <COL
name="url">/fm/memorySmartExplanation.htm</COL> <COL
name="enable">1</COL> <COL name="public">1</COL-
> <COL name="events"></COL> <COL
name="rule"><SCRIPT> var total, free, usage; var memthresh
= 90; dbCollectTable("base", "baseMem"); total = dbRead("base",
"baseMem", "physical", "0") / 1000; free = dbRead("base",
"baseMem", "physFree", "0") / 1000; usage = ((total - free) * 100)
/ total; if (usage >= memthresh) { var resource = "Memory";
publish("resource"); setOutput(usage); }
</SCRIPT></COL&g- t; </ROW> </TBL> <TBL
name="policy"> <!--
[0026] Low Resource Policy
[0027] This policy listens to detectors of type "lowResource". Any
number of detectors can detect low resources for various system
components, and this policy will handle all of them.
[0028] This policy assumes that the output of the detectors is the
amount of resource utilization. It also assumes that a variable
named "resource" will be published to determine which resource is
low.
[0029] Using this information, errors are written to the error log
according to how severe the resource situation is.
4 --> <ROW> <COL
name="name">lowResourcePolicy</COL> <COL
name="description">Low Resource</COL> <COL
name="url"></COL> <COL name="enable">1</COL>
<COL name="public">1</COL> <COL
name="types">lowResource</COL> <COL
name="rule"><SCRIPT> var pct = getOutput( ); if(pct >
95) { logError("Very low" + resource + "(" + pct + "%)"); } else {
logError("Low" + resource + "(" + pct + "%)"); }
</SCRIPT></COL> </ROW> </TBL> <!--
[0030] Resource check schedule
[0031] This schedule runs every five seconds, causing the lowMemory
detector to run and fire the policy if the memory usage is
high.
[0032] Additional resource detectors can be added to this schedule
set to allow more resources to be monitored.
[0033] .fwdarw.
[0034] <TBL name="schedule">
[0035] <ROW>
[0036] <COL name="name">resourceCheck</COL>
[0037] <COL name="description">Check system
resources</COL>
[0038] <COL name="enable">1</COL>
[0039] <COL name="period">5000</COL>
[0040] <COL name="schedule"></COL>
[0041] <COL name="set">lowMemory</COL>
[0042] </ROW>
[0043] </TBL>
[0044] </DB>
[0045] </GOAHEAD>
[0046] Networks of detectors are useful in diagnosing intermittent
problems that may not be directly testable because of interface
limitations or the intermittence of the problem. In these cases, it
is useful to correlate faults that have occurred in other related
components, and make a diagnosis based on those faults.
[0047] FIG. 3 illustrates a scenario that assumes a hardware
platform with five PCI slots and a PCI bridge chip. Assume the
bridge chip is not be able to notify the system of its failure. One
symptom of the bridge chip failure is that the five cards bridged
by the chip become unable to communicate with the host processor.
The loss of a hardware heartbeat is detectable by the fault
management process. An individual card can also stop responding to
heartbeats because of electrical or physical disconnection, or
other hardware and software faults. By determining the correct
cause of a failure, the system is better equipped to ensure rapid
failover between the correct components.
[0048] A lost heartbeat event from a card will cause the lost card
heartbeat detector 314 to run. This detector populates a table that
stores the name of the card that missed a heartbeat, the current
time, and the number of times the heartbeat has failed. This
information is important because it allows the second level
detectors to provide fault correlation. This detector 314 always
fires.
[0049] Both the bridge 310 detector and the card failure detector
306 listen to the lost heartbeat detector 314. The detectors will
run serially, in the order defined in the XML file, but in general,
the rules for each are designed so that the order in which they run
does not matter. For this example, we assume the bridge failure
detector 310 runs first.
[0050] If the bridge supports diagnostics, they can be called from
the bridge failure detector 310. The results of the tests can be
used to determine that the bridge has failed, and fire the detector
immediately. The bridge detector, by firing, invokes the bridge
failure policy 316 to run. If the problem is intermittent, or the
diagnostics cannot detect certain conditions, event correlation
must be done by the bridge failure detector 310. The bridge failure
detector 310 looks at the card database table to determine if all
of the cards have had heartbeat failures within a given period of
time. If they have, the bridge is assumed to be bad, and the bridge
failure detector 310 fires.
[0051] The card failure detector 306 engages in a similar process.
The card failure detector can invoke the card failure policy 312.
If card diagnostics can show the card has failed, the detector can
run those diagnostics to determine whether to fire based on that
condition. Because the diagnostics may not run correctly in the
case of a bridge failure or other intermittent problem, the
correlation table once again comes into play. If the card that lost
a heartbeat has repeatedly lost heartbeats recently, and at least
one card in the correlation table has not lost any heartbeats, the
bridge chip has not failed, but the card has. The bridge failure
event and the card failure event show two additional methods by
which a failure in these components can be detected. If driver code
(the interface software between the operating system and the
device) can internally detect a card or bridge failure, the event
can be sent directly. In this case, if either second level detector
was triggered through an external event, no additional diagnosis or
correlation would be required, and the detector would fire.
Detectors can determine whether or not an event caused them to fire
by looking at the local "_EVENT" embedded JavaScript variable.
[0052] The abovementioned description of a method for fault
managing in a multinode networked computing environment according
to the preferred embodiments of the present invention is merely
exemplary in nature and is no way intended to limit the invention
or its application or uses. Further, in the abovementioned
description, numerous specific details are set forth to provide a
more thorough understanding of the present invention. It will be
apparent, however, to one skilled in the art, that the present
invention may be practiced without these specific details. In other
instances, characteristics and functions of the well-known
processes have not been described so as to not obscure the present
invention.
* * * * *