U.S. patent application number 14/971187 was filed with the patent office on 2016-04-14 for management system for managing computer system comprising multiple monitoring-target devices.
This patent application is currently assigned to HITACHI, LTD.. The applicant listed for this patent is HITACHI, LTD.. Invention is credited to Masataka NAGURA, Jun NAKAJIMA.
Application Number | 20160103727 14/971187 |
Document ID | / |
Family ID | 50182791 |
Filed Date | 2016-04-14 |
United States Patent
Application |
20160103727 |
Kind Code |
A1 |
NAKAJIMA; Jun ; et
al. |
April 14, 2016 |
MANAGEMENT SYSTEM FOR MANAGING COMPUTER SYSTEM COMPRISING MULTIPLE
MONITORING-TARGET DEVICES
Abstract
A management system manages a computer system including multiple
monitoring-target devices. A storage device of the management
system stores a general rule, general plan information, unresolved
information, and configuration information. A control device of the
management system creates multiple expanded rules based on the
general rule and the configuration information, and if an event
related to any of the multiple monitoring-target devices has
occurred, identifies, based on the multiple expanded rules, a first
conclusion event constituting a candidate for the cause of the
occurred event, creates, based on the general plan information, one
or more expanded plans, which are recovery plans that can be
implemented if the first conclusion event is a cause, identifies an
unresolved event based on the unresolved information, identifies a
risk site based on the identified unresolved event, and displays
data showing the first conclusion event, expanded plan, and risk
site.
Inventors: |
NAKAJIMA; Jun; (Yokohama,
JP) ; NAGURA; Masataka; (Yokohama, JP) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
HITACHI, LTD. |
Tokyo |
|
JP |
|
|
Assignee: |
HITACHI, LTD.
Tokyo
JP
|
Family ID: |
50182791 |
Appl. No.: |
14/971187 |
Filed: |
December 16, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13991671 |
Jun 5, 2013 |
9244800 |
|
|
PCT/JP2012/072310 |
Sep 3, 2012 |
|
|
|
14971187 |
|
|
|
|
Current U.S.
Class: |
714/37 |
Current CPC
Class: |
G06F 11/3051 20130101;
G06F 11/0727 20130101; G06F 11/0751 20130101; G06F 11/1484
20130101; G06F 11/0793 20130101; G06F 11/2257 20130101; G06F
11/0709 20130101; G06F 11/079 20130101 |
International
Class: |
G06F 11/07 20060101
G06F011/07 |
Claims
1. A management system for managing a computer system, which
comprises multiple monitoring-target devices, comprising: a storage
device; and a control device coupled to the storage device, wherein
the storage device is configured to store: a general plan that
includes a recovery plan available in a case where a failure event
happens in the multiple monitoring-target devices; and
configuration information, that includes a coupling relationship
between the multiple monitoring-target devices, wherein the control
device is configured to: create, based on the failure event and the
configuration information, one or more expanded plans into which
expand the general plan by taking into account configurations of
the monitoring-target devices, display the one or more expanded
plans and predicted information corresponding to the one or more
expanded plans.
2. The management system according to claim 1, wherein the
predicted information comprises information about one of the
multiple monitoring-target devices which is related to the failure
event.
3. The management system according to claim 1, wherein the
predicted information comprises information about one of the
multiple monitoring-target devices which is a subject of one or
more of the expanded plans.
4. The management system according to claim 1, wherein the
predicted information comprises information about an implementation
of one or more of the expanded plans.
5. The management system according to claim 2, wherein the
predicted information is information about a prediction of state
after implementation of one or more of the expanded plans, and
wherein the predicted information comprises at least one of
information intending a risk after implementation of one or more of
the expanded plans, information intending performance, information
intending capacity and information intending cost.
6. The management system according to claim 3, wherein the
predicted information is information intending a prediction about
state after implementation of one or more of the expanded plans,
and wherein the predicted information comprises at least one of
information intending a risk after implementation of one or more of
the expanded plans, information intending performance, information
intending capacity and information intending cost.
7. The management system according to claim 4, wherein the
predicted information is information intending a prediction about
one or more of the expanded plans itself, and wherein the predicted
information comprises at least one of information intending time
required for implementation of one or more of the expanded plans
and information intending costs for implementation of one or more
of the expanded plans.
8. The management system according to claim 1, wherein the control
device is further configured to create, when the general plan
includes a volume migration, a first expanded plan related to the
volume migration, that includes identifying a volume related to the
failure event as a migration-source volume, and identifying a
migration-destination volume based on the coupling relationship
with the migration-source volume, compute, based on an I/O response
time for the migration-source volume and the migration-destination
volume with respect to the first expanded plan, a predicted value
for the I/O response time of the migration-source volume and the
migration-destination volume after the first expanded plan has been
implemented, and display the predicted value for the I/O response
time.
9. The management system according to claim 8, wherein the control
device is further configured to: create, when the general plan
includes a disk addition to a pool, a second expanded plan related
to disk addition to a pool that identifies the pool related to the
failure event as a disk addition-target pool, compute, based on an
I/O response time for the disk addition-target pool and a
before-and-after disk addition capacity ratio of the disk
addition-target pool with respect to the first expanded plan, a
second predicted value for the I/O response time of the disk
addition-target pool after the second expanded plan has been
implemented, and display the second predicted value for the I/O
response time.
10. The management system according to claim 9, wherein the control
device is configured to compute a third predicted value based on a
performance value related to the monitoring-target devices before
and after implementation of the first expanded plan and the second
expanded plan and the performance value related to the
monitoring-target devices, and further display the third predicted
value for the performance value.
11. The management system according to claim 10, wherein the
control device is configured to consolidate similar expanded plans
from multiple expanded plans that includes the first expanded plan
and the second expanded plan into a single consolidated expanded
plan, and display data showing the single consolidated expanded
plan.
12. The management system according to claim 11, wherein the
storage device is configured to further store maintenance schedule
information, which shows a schedule of maintenance operations to be
performed for any of the multiple monitoring-target devices, and
wherein the control device is configured to further display data
showing the schedule of maintenance operations to be performed for
the monitoring-target devices, which are related to one or more of
the expanded plans.
13. The management system according to claim 12, wherein the
storage device is configured to further store cost information
showing, with respect to the general plan, the cost required for
implementing the multiple expanded plans, and wherein the control
device is configured to compute, with respect to each of the
multiple expanded plans, the cost required for implementing the
multiple expanded plans based on a cost required for implementing
the general plan, and further display the computed cost.
14. A non-transitory computer-readable storage medium having
instructions stored thereon, execution of which by a processor
cause the processor to perform operations, the operations
comprising: storing a general plan that includes a recovery plan
available in a case where a failure event happens in multiple
monitoring-target devices; storing configuration information, that
includes a coupling relationship between the multiple
monitoring-target devices; and creating, based on the failure event
and the configuration information, one or more expanded plans into
which expand the general plan by taking into account configurations
of the monitoring-target devices.
15. A method for managing a computer system, comprising: storing a
general plan that includes a recovery plan available in a case
where a failure event happens in multiple monitoring-target
devices; storing configuration information, that includes a
coupling relationship between the multiple monitoring-target
devices; creating, based on the failure event and the configuration
information, one or more expanded plans into which expand the
general plan by taking into account configurations of the
monitoring-target devices; and displaying the one or more expanded
plans and predicted information corresponding to the one or more
expanded plans.
Description
TECHNICAL FIELD
[0001] The present invention, for example, relates to technology
for managing a computer system comprising monitoring-target
devices, such as a host computer, a network switch, and a storage
apparatus.
BACKGROUND ART
[0002] In the management of a computer system, the use of Event
Correlation technology or another such technology for identifying
the cause of an event-based failure makes it possible for the
administrator of the computer system to detect the cause of a
failure, which has occurred in the computer system (refer to Patent
Literature 1).
[0003] There exists technology in which an analysis engine for
analyzing the causal relationship between multiple failure events,
which have occurred in a machine that is under management, applies
a general rule, which comprises a predetermined condition statement
and conclusion statement, to an event related to the machine under
management, for example, an event in which a performance value
exceeds a prescribed threshold, to create an expanded rule
comprising a conclusion event, which is the cause of a drop in
performance, and a condition event group caused thereby, and
identifies the failure based on the created expanded rule (refer to
Patent Literature 2).
[0004] There are numerous useful methods capable of being
implemented in recent computer systems as recovery methods with
respect to the cause of a failure identified in accordance with
cause analysis, and, for example, there is a method for recovering
from a failure by performing an appropriate data migration at
system resource (virtual machine, data) allocation. As data
migration technology in an environment in which multiple virtual
host computers (that is, virtual machines, hereinafter called "VM")
are running on a physical host computer, for example, technology
(first VM migration) for transferring a VM operating environment
from a certain physical host computer to another physical host
computer, and technology (second VM migration) for migrating a VM
stored in a certain storage area to another storage area, in
accordance with information denoting VM performance and resource
utilization information are known. Here, the VM is a type of data
stored in a storage area, and the VM migration (the first VM
migration and the second VM migration) is a type of data migration
between storage areas. Data migration technology (volume migration)
between data storage areas (volumes) in a storage apparatus is also
known (refer to Patent Literature 3).
CITATION LIST
Patent Literature
[0005] [PTL 1] U.S. Pat. No. 7,107,185 (Description) [0006] [PTL 2]
Japanese Patent Application Laid-open No. 2010-86115 [0007] [PTL
3]U.S. Pat. No. 6,108,748 (Description)
SUMMARY OF INVENTION
Technical Problem
[0008] When dealing with a failure identified using Event
Correlation technology such as that of Patent Literature 1, the
administrator does not specifically know which method should be
implemented to recover from the failure, and recovering from the
failure is costly.
[0009] In order to deal with a failure identified using the Event
Correlation technology, cases can exists in which a variety of
plans are capable of being implemented, such as a plan for
recovering from the failure cause itself, and a plan that places
the highest priority on resuming a business operation, which was
affected by the failure cause. For example, in the case of a drop
in performance in business responses resulting from a performance
failure in a storage port, there is a plan for migrating the
business operation to another server (for example, a plan related
to the VM migration) so as not to use the failed port, and a plan
for replacing the port.
[0010] In the case of an action, which has implemented a plan that
does not achieve recovery from the failure cause itself, the
administrator does not know what problems remain in what parts of
the computer system although the risk of a failure arising from the
failure cause could continue to exist even after the plan has been
executed. Thus, in a case where the risk of a failure is likely to
remain, the administrator could unintentionally select a plan that
does not achieve recovery from the failure cause itself (for
example, a plan that places the highest priority on resuming
business operations).
Solution to Problem
[0011] A management system manages a computer system comprising
multiple monitoring-target devices. A storage device of the
management system stores: a general rule, which shows a
corresponding relationship between one or more condition events,
which are related to any of the multiple monitoring-target devices,
and a conclusion event, which is related to any of the multiple
monitoring-target devices and constitutes a cause in a case where
the one or more condition events have occurred, this general rule
moreover denoting a monitoring-target device related to the
condition event and the conclusion event according to the relevant
monitoring-target device type; general plan information, which
shows a corresponding relationship between the general rule and one
or more general plans, which are recovery plans available in a case
where the conclusion event of the general rule is a cause;
unresolved information, which shows, for each combination of the
general rule and the general plan, a condition event of the
relevant general rule that remains unresolved in a case where the
relevant general plan has been implemented; and configuration
information, which shows a coupling relationship between the
multiple monitoring-target devices, wherein the control device:
creates, based on the general rule and the configuration
information, multiple expanded rules, which denote, by means of
data showing an identified monitoring-target device, the type of
monitoring-target device related to the condition event and the
conclusion event; performs, based on the created multiple expanded
rules in a case where an event related to any of the multiple
monitoring-target devices has occurred, a cause analysis by
regarding the occurred event as the condition event, and identifies
a first conclusion event, which constitutes a candidate for the
cause of the occurred event; creates, based on the general plan
information, one or more expanded plans, which are recovery plans
that can be implemented in a case where the first conclusion event
is a cause, and expand, by taking into account the actual
configuration of the computer system, a general plan corresponding
to the general rule, which constitutes the basis of an expanded
rule comprising the first conclusion event; identifies, based on
the unresolved information for each of the created one or more
expanded plans, an unresolved event, which remains unresolved in a
case where the relevant expanded plan has been implemented, and
identifies, based on the identified unresolved event, a risk site,
which is a monitoring-target device for which a problem remains
even after the relevant expanded plan has been implemented; and
displays data showing the first conclusion event, the created one
or more expanded plans, and the identified risk site.
[BRIEF DESCRIPTION OF DRAWINGS]
[0012] FIG. 1 is a constitutional diagram of an example of a
computer system related to Example 1.
[0013] FIG. 2 is a constitutional diagram of an example of a
management server related to Example 1.
[0014] FIG. 3 is a constitutional diagram of an example of a
storage apparatus related to Example 1.
[0015] FIG. 4 is a constitutional diagram of an example of a
physical server related to Example 1.
[0016] FIG. 5 is a constitutional diagram of an example of a
configuration information table related to Example 1.
[0017] FIG. 6 is a constitutional diagram of an example of a
performance information table related to Example 1.
[0018] FIG. 7 is a constitutional diagram of an example of an event
management table related to Example 1.
[0019] FIG. 8 is a constitutional diagram of an example of a
general rule table related to Example 1.
[0020] FIG. 9 is a constitutional diagram of an example of a
general plan table related to Example 1.
[0021] FIG. 10 is a constitutional diagram of an example of a
rule-plan association table related to Example 1.
[0022] FIG. 11 is a constitutional diagram of an example of an
expanded rule table related to Example 1.
[0023] FIG. 12A is a first constitutional diagram of an expanded
plan table related to Example 1.
[0024] FIG. 12B is a second constitutional diagram of an expanded
plan table related to Example 1.
[0025] FIG. 13 is a constitutional diagram of an example of an
analysis result management table related to Example 1.
[0026] FIG. 14 is a flowchart of an event analysis process related
to Example 1.
[0027] FIG. 15 is a flowchart of a rule expansion process related
to Example 1.
[0028] FIG. 16 is a flowchart of a plan creation process related to
Example 1.
[0029] FIG. 17 is a flowchart of a post-plan execution risk
selection process related to Example 1.
[0030] FIG. 18 is a flowchart of a plan presentation process
related to Example 1.
[0031] FIG. 19 is a flowchart of an effect-risk presentation
process related to Example 1.
[0032] FIG. 20 is a constitutional diagram of an example of a plan
presentation screen related to Example 1.
[0033] FIG. 21 is a constitutional diagram of an example of a plan
details screen related to Example 1.
[0034] FIG. 22 is a constitutional diagram of an example of a
management server related to Example 2.
[0035] FIG. 23 is a constitutional diagram of an example of a
physical server related to Example 2.
[0036] FIG. 24 is a logical constitutional diagram of an example of
a physical server related to Example 2.
[0037] FIG. 25 is a constitutional diagram of an example of a
switch related to Example 2.
[0038] FIG. 26 is a constitutional diagram of an example of a
configuration information table related to Example 2.
[0039] FIG. 27 is a constitutional diagram of an example of a VM
configuration information table related to Example 2.
[0040] FIG. 28 is a constitutional diagram of an example of a
performance information table related to Example 2.
[0041] FIG. 29 is a constitutional diagram of an example of an
event management table related to Example 2.
[0042] FIG. 30A is a first constitutional diagram of a general rule
table related to Example 2.
[0043] FIG. 30B is a second constitutional diagram of a general
rule table related to Example 2.
[0044] FIG. 31 is a constitutional diagram of an example of a
rule-plan association table related to Example 2.
[0045] FIG. 32A is a first constitutional diagram of an expanded
rule table related to Example 2.
[0046] FIG. 32B is a second constitutional diagram of an expanded
rule table related to Example 2.
[0047] FIG. 33A is a first constitutional diagram of an expanded
plan table related to Example 2.
[0048] FIG. 33B is a second constitutional diagram of an expanded
plan table related to Example 2.
[0049] FIG. 33C is a third constitutional diagram of an expanded
plan table related to Example 2.
[0050] FIG. 34 is a constitutional diagram of an example of an
analysis result management table related to Example 2.
[0051] FIG. 35 is a flowchart of a plan presentation process
related to Example 2.
[0052] FIG. 36 is a constitutional diagram of an example of a plan
presentation screen related to Example 2.
[0053] FIG. 37 is a constitutional diagram of an example of a
management server related to Example 3.
[0054] FIG. 38 is a constitutional diagram of an example of a
general plan table related to Example 3.
[0055] FIG. 39 is a constitutional diagram of an example of a
rule-plan association table related to Example 3.
[0056] FIG. 40A is a first constitutional diagram of an expanded
plan table related to Example 3.
[0057] FIG. 40B is a second constitutional diagram of an expanded
plan table related to Example 3.
[0058] FIG. 41 is a constitutional diagram of an example of a
maintenance information management table related to Example 3.
[0059] FIG. 42 is a flowchart of an effect-risk presentation
process related to Example 3.
[0060] FIG. 43 is a constitutional diagram of an example of a plan
details screen related to Example 3.
DESCRIPTION OF EMBODIMENTS
[0061] A number of examples will be explained by referring to the
drawings. Furthermore, the examples explained below do not limit
the invention related to the claims, and not all of the elements
and combinations thereof explained in the examples are essential
for the solution provided by the invention. In the drawings, the
same reference signs denote the same components throughout the
multiple drawings. In the following explanation, information on the
present invention is explained using the expression "aaa table",
but this information may also be expressed using a data structure
other than a table. Thus, to show that the information is not
dependent on the data structure, "aaa table" may be called "aaa
information". In addition, when explaining the contents of the
respective information, the expressions "identification
information", "identifier", "name" and "ID" are used, but these
expressions are interchangeable.
[0062] In the following explanation, there may be cases where an
explanation is given using a "program" as the doer of the action,
but since the defined processing is performed in accordance with a
program being executed by a processor while using a memory and a
communication port (a communication device, a management I/F, and a
data I/F), the explanation may also give the processor as the doer
of the action. A process, which is disclosed as having a program as
the doer of the action, may be regarded as a process performed by a
management server or other such computer, and an information
processing apparatus. Furthermore, either all or a portion of a
program may be realized in accordance with dedicated hardware.
Various types of programs may be installed in respective computers
using a program delivery server or computer-readable storage
media.
[0063] Hereinafter, a set of one or more computers, which manage
the computer system and display the display information of the
present invention may be called a management system. In a case
where a management server displays the display information, the
management server is the management system, and a combination of a
management server and a display computer is also the management
system. Furthermore, processing similar to that of the management
server may be realized using multiple computers to increase the
speed and reliability of management processing, and in this case,
the relevant multiple computers (to include a display computer in a
case where a display computers carries out displays) is the
management system.
EXAMPLE 1
[0064] First, a computer system related to Example 1 will be
explained.
[0065] FIG. 1 is a constitutional diagram of an example of a
computer system related to Example 1.
[0066] The computer system related to Example 1 comprises one or
more management servers 10000 (management server A in FIG. 1), one
or more storage apparatuses 20000 (storage apparatus A in FIG. 1),
and one or more physical servers 30000 (physical server A in FIG.
1). The physical server 30000 and the storage apparatus 20000 are
coupled together via a SAN (Storage Area network) 40000
(specifically, a fibre channel). The management server 10000, the
storage apparatus 20000, and the physical server 30000 are coupled
together via a management network 50000.
[0067] The management server 10000 stores a plan creation program
11100, a post-plan execution risk selection program 11200, a plan
presentation program 11300, a configuration/performance information
repository 11800, and a rule-plan information repository 11900 in a
memory 11000 (refer to FIG. 2). The management server 10000 is able
to communicate with a program running on the storage apparatus
20000 and the physical server 30000 via the management network
50000.
[0068] One or more logical volumes 22100 are prepared in the
storage apparatus 20000. The logical volume 22100, for example, is
provided to the physical server 30000. In the example shown in FIG.
1, the storage apparatus A provides the logical volume 22100 to the
physical server A.
[0069] The physical server 30000 executes various business
operations using the logical volume 22100 provided from the storage
apparatus 20000. In the example shown in FIG. 1, the physical
server A and the storage apparatus A are coupled together via the
SAN 40000.
[0070] In the example shown in FIG. 1, the management server 10000
stores programs, such as the plan creation program 11100, the
post-plan execution risk selection program 11200, and the plan
presentation program 11300, but the constitution of the computer
system is not limited thereto. For example, either the storage
apparatus 20000 or the physical server 30000 may store a variety of
programs, and another apparatus, such as a switch (not shown in the
drawing) installed between the respective apparatuses, may also
store various programs. The connection between the storage
apparatus 20000 and the physical server 30000 is not limited to a
direct connection via a fibre channel, and the storage apparatus
20000 and the physical server 30000 may be coupled via network
equipment, such as one or more fibre channel switches. The
connection between the storage apparatus 20000 and the physical
server 30000 is a data communication network, and, for example, may
be an IP (Internet Protocol) network.
[0071] FIG. 2 is a constitutional diagram of an example of the
management server related to Example 1.
[0072] The management server 10000 comprises a memory 11000, a
storage device 12000, an input device 13000, an output device
14000, a processor 15000, and a communication device 16000, and
these elements are coupled together via an internal path or other
such communication channel 17000.
[0073] The memory 11000 stores the plan creation program 11100, the
post-plan execution risk selection program 11200, the plan
presentation program 11300, an event analysis processing program
11400, a rule expansion program 11500, a configuration setting
management program 11600, a performance information collection
program 11700, the configuration/performance information repository
11800, and the rule-plan information repository 11900.
[0074] A configuration information table 11810 and a performance
information table 11820 are stored in the configuration/performance
information repository 11800. An event management table 11910, one
or more general rule tables 11920, a general plan table 11930, a
rule-plan association table 11940, one or more expanded rule tables
11950, one or more expanded plan tables 11960, and an analysis
result management table 11970 are stored in the rule-plan
information repository 11900.
[0075] The configuration information table 11810 manages
information showing an apparatus and a device that exist on an I/O
(input/output) route from the physical server 30000 to a physical
disk, which configures the logical volume 22100 used by the
physical server 30000, that is, information (hereinafter referred
to as "configuration information") showing the coupling
relationship of the apparatus and the device based on the I/O
route.
[0076] The performance information table 11820 manages performance
information with respect to monitoring-target apparatuses coupled
to the SAN 40000 and the devices in the monitoring-target
apparatuses (monitoring-target devices).
[0077] The event management table 11910 manages information showing
when an event occurred in relation to what type of metrics for
which device in the computer system.
[0078] The general rule table 11920 manages a general rule showing
the corresponding relationship between one or more condition events
capable of occurring in the computer system and a conclusion event,
which is regarded as the cause of a failure with respect to the one
or more condition events.
[0079] The general plan table 11930 manages information showing a
recovery plan capable of being implemented in the computer system
with respect to a failure.
[0080] The rule-plan association table 11940 manages information
(general plan information) denoting the corresponding relationship
between the general rule and a general plan, which represents a
recovery plan corresponding to the relevant general rule, that is,
a recovery plan capable of being implemented in a case where the
conclusion event of the relevant general rule is the cause of the
failure, and information (unresolved information) denoting which of
the failure events in the general rule remains unresolved after the
respective general plans have been executed.
[0081] The expanded rule table 11950 stores information (an
expanded rule), which embodies the information of the general rule
table 11920 based on the information of the configuration
information table 11810.
[0082] The expanded plan table 11960 manages information (an
expanded plan), which embodies the information of the general plan
table 11930 based on the information of the configuration
information table 11810 and the performance information table
11820.
[0083] The analysis result management table 11970 stores
information related to an apparatus and a device in which an event
judged to be the cause of a failure has occurred, and a failure
event judged to be the cause thereof.
[0084] The storage device 12000 is an HDD (Hard Disk Drive), SSD
(Solid State Drive) or the like for storing information. The input
device 13000 is a device for the administrator to input an
indication to the management server 10000, and, for example, is a
keyboard. The output device 14000 is a device for outputting the
result of processing executed by the management server 10000, for
example, the result of the execution of the plan presentation
program 11300, and, for example, is a display. The processor 15000
executes a program deployed on the memory 11000. The communication
device 16000 is a device for coupling to the management network
50000.
[0085] In the example shown in FIG. 2, various programs and tables
are stored in the memory 11000, but these programs and tables may
be stored in either the storage device 12000 or another storage
medium (not shown in the drawing). In accordance with this, at
program execution, the processor 15000 reads a target program on
the memory 11000, and executes the program that has been read.
Also, the above-mentioned programs and tables may be stored in
either the memory 21000 of the storage apparatus 20000 (refer to
FIG. 3) or the memory 31000 of the physical server 30000 (refer to
FIG. 4), and either the storage apparatus 20000 or the physical
server 30000 may execute a stored program. Another apparatus, such
as another physical server 30000 or a switch (not shown in the
drawing) may store the above-mentioned programs and tables, and may
execute a stored program.
[0086] FIG. 3 is a constitutional diagram of an example of the
storage apparatus related to Example 1.
[0087] The storage apparatus 20000 comprises a memory 21000, a
logical volume provision part 22000, a disk I/F controller 23000, a
management I/F 24000, a processor 25000, and a data I/F 26000, and
these elements are coupled via a communication channel 27000, such
as an internal bus.
[0088] The memory 21000 comprises a disk cache 21100. The memory
21000 stores a configuration/performance information collection
program 21200. The disk cache 21100 is a storage area for
temporarily storing information. The configuration/performance
information collection program 21200 is for sending/receiving the
management information and the performance information of the
storage apparatus 20000 and the like to/from the management server
10000.
[0089] The logical volume provision part 22000 comprises a disk
pool 22200 configured from a storage area of one or more physical
disks (not shown in the drawing), logically partitions the storage
area of the disk pool 22200, and provides the relevant logically
partitioned storage areas as logical volumes 22100. This makes it
possible to access a logical volume 22100 from an apparatus outside
the relevant storage apparatus 20000. A disk pool number is
appended to a disk pool 22200, and a logical volume number is
appended to a logical volume 22100. This makes it possible for the
storage apparatus 20000 to uniquely identify each disk pool 22200
and logical volume 22100.
[0090] In the example shown in FIG. 3, each of two disk pools 22200
(POOL1 and POOL2) is logically partitioned, and four logical
volumes 22100 (LV1, LV2, LV3 and LV4) are provided to an apparatus
(for example, the physical server 30000) outside the storage
apparatus 20000. The disk I/F controller 23000 is an interface
device for coupling to the logical volume provision part 22000. The
management I/F 24000 is an interface device for coupling to the
management network 50000. The processor 25000 executes a program
deployed on the memory 21000. The data I/F 26000 is an interface
device for coupling to the SAN 40000. Furthermore, there may be
multiple disk I/F control parts 23000, management I/Fs 24000, and
data I/Fs 26000.
[0091] In the example shown in FIG. 3, the storage apparatus 20000
comprises two data I/Fs 26000, data I/F (P1) and data I/F (P2). In
the example shown in FIG. 3, the configuration/performance
information collection program 21200 is stored in the memory 21000,
but this program 21200 may be stored in another storage device (not
shown in the drawing) or another storage medium (not shown in the
drawing). In accordance with this, at process execution time, the
processor 25000 reads the configuration/performance information
collection program 21200 on the memory 21000, and executes the read
configuration/performance information collection program 21200.
[0092] The configuration/performance information collection program
21200 maybe stored in the memory 11000 of the management server
10000, and the management server 10000 may execute the stored
program 21200. Also, another storage apparatus 20000 may store the
configuration/performance information collection program 21200 and
may execute the stored program 21200. The logical volume provision
part 22000 may make all the storage areas on a single physical disk
22200 into one logical volume 22100. The logical volume provision
part 22000 may also prepare a logical volume 22100 using the
storage area of a storage medium other than the physical disk
22200, for example, a flash memory.
[0093] FIG. 4 is a constitutional diagram of an example of the
physical server related to Example 1.
[0094] The physical server 30000 comprises a memory 31000, a data
I/F 32000, a processor 33000, and a management I/F 34000, and these
elements are coupled together via an internal bus or other such
communication channel 35000.
[0095] The memory 31000 stores a configuration performance
information collection program 31100, a business program 31200, and
a volume management program 31300.
[0096] The configuration performance information collection program
31100 is for sending/receiving physical server 30000 management
information, performance information, and the like to/from the
management server 10000. The business program 31200 is for
realizing a business operation executed by the physical server
30000, and, for example, is a DBMS (Data Base Management System) or
a file system. The volume management program 31300 is for
allocating a logical volume 22100 provided by the storage apparatus
20000 to the physical server 30000. The physical server 30000
executes a business operation using the logical volume 22100
allocated by the volume management program 31300.
[0097] The data I/F 32000 is an interface device for coupling to
the SAN 40000. The processor 33000 executes a program deployed on
the memory 31000. The management I/F 34000 is an interface device
for coupling to the management network 50000.
[0098] There may be multiple data I/Fs 32000 and management I/Fs
34000. In the example shown in FIG. 4, a variety of programs are
stored in the memory 31000, but these programs maybe stored in
another storage apparatus (not shown in the drawing). In accordance
with this, at process execution time, the processor 33000 reads a
target program on the memory 31000, and executes the read
program.
[0099] FIG. 5 is a constitutional diagram of an example of the
configuration information table related to Example 1.
[0100] The configuration information table 11810 stores information
related to an I/O route, which extends from the physical server
30000 to the physical disk configuring the logical volume 22100
provided to the relevant physical server 30000, and which the
physical server 30000 uses to access the logical volume 22100. An
entry is added to the configuration information table 11810 in
accordance with the configuration setting management program 11600
being executed.
[0101] The configuration information table 11810 comprises the
fields of a physical server 11811, a drive 11812, a server data I/F
11813, a storage 11814, a storage data I/F 11815, a logical volume
11816, and a disk pool 11817. An identifier for uniquely
identifying a physical server 30000 is stored in the physical
server 11811. An identifier for uniquely identifying a mount point
of a volume on the physical server 30000 is stored in the drive
11812. An identifier for uniquely identifying a physical server
30000 data I/F 32000 (hereinafter may be referred to as "server
data I/F") used when the physical server 30000 accesses the logical
volume 22100 shown by a logical volume 11816 identifier is stored
in the server data I/F 11813. An identifier for uniquely
identifying a storage apparatus 20000 constituting the access
destination of the physical server 30000 is stored in the storage
11814. An identifier for uniquely identifying a storage apparatus
20000 data I/F 26000 (hereinafter maybe referred to as "storage
data I/F") used when the physical server 30000 accesses the logical
volume 22100 shown by the logical volume 11816 identifier is stored
in the storage data I/F 11815. An identifier for uniquely
identifying a logical volume 22100 is stored in the logical volume
11816. An identifier for uniquely identifying a disk pool 22200 by
which the logical volume 22100 shown by the logical volume 11816
identifier is prepared is stored in the disk pool 11817.
[0102] For example, the first entry from the top of FIG. 5 shows
that the logical volume "LV1" created from the disk pool "POOL1" of
the storage A (storage apparatus A) is coupled to the physical
server A via the storage data I/F "P1" and the server data I/F
"S1", and is recognized as the logical volume "/opt" on the
physical server A.
[0103] Here, the configuration information table 11810 related to
this example comprises information on the physical server 30000,
the server data I/F, the storage apparatus 20000, the storage data
I/F, the logical volume 22100, and the disk pool 22200 as the
apparatuses and devices that exist on the access route, but the
constitution of this table is not limited thereto. For example, the
configuration information table 11810 may comprise information on a
switch and a switch data I/F, and may associatively store business
program (DBMS and so forth) information or VM information on the
business server 30000, a snapshot volume holding a VM snapshot, and
a clone volume holding a clone. The configuration information table
11810 may hold configuration management operation historical
information, and may be linked to a Syslog server and hold detailed
log information denoting system operations.
[0104] FIG. 6 is a constitutional diagram of an example of the
performance information table related to Example 1.
[0105] The performance information table 11820 stores performance
information related to either an apparatus or a device inside an
apparatus comprising the computer system, for example, performance
information related to the logical volume 22100, the disk pool
22200, and so forth in each storage apparatus 20000. An entry is
added to the performance information table 11820 in accordance with
the performance information collection program 11700 being
executed.
[0106] The performance information table 11820 comprises the fields
of an apparatus ID 11821, a device ID 11822, a metrics 11823, an
equipment OS 11824, a performance value 11825, an alert execution
threshold 11826, a threshold type 11827, and a Status 11828.
[0107] An identifier (apparatus ID) for uniquely identifying an
apparatus is stored in the apparatus ID 11821. An identifier
(device ID) for uniquely identifying a device, which is the
performance information acquisition target, is stored in the device
ID 11822. Information showing a type of performance information,
such as a CPU utilization rate, a number of IOs (IOPS) per unit of
time (for example, one second) for a storage apparatus, and a
response time with respect to a request, is stored in the metrics
11823. Data showing a type of OS (Operating System) running on the
apparatus corresponding to the apparatus ID in the apparatus ID
11821 is stored in the equipment OS 11824. A value of the
performance information for the type shown by the metrics 11823 for
the device shown by the device ID 11822 is acquired from the
apparatus comprising the device and stored in the performance value
11825. Either an upper limit or a lower limit threshold for a
normal range of management-target performance values (hereinafter
referred to as "alert execution threshold") is specified by the
user and stored in the alert execution threshold 11826. Data
showing whether the alert execution threshold is the upper limit or
the lower limit of the normal values is stored in the threshold
type 11827. Data showing whether the performance value 11825 is a
normal value or an abnormal value is stored in the Status
11828.
[0108] In the performance information table 1820 shown in FIG. 6
here, only one performance value is supported for one arbitrary
metrics for one arbitrary device on one arbitrary apparatus, but
the configuration may be such that each piece of information of the
performance information table 11820 is stored together with a value
showing the time at with the performance information collection
program 11700 acquired information held by each apparatus by
communicating with the configuration performance information
collection programs 21200 and 31100, and the performance values at
multiple time points corresponding to the acquired times may be
held as historical information.
[0109] The storage data I/F, the logical volume 22100, the disk
pool 22200, and the mount point which is recognized by the physical
server 30000, was cited as the performance information
acquisition-target device which is shown by the device ID in the
device ID 11822, but the constitution is not limited thereto, and
the performance information acquisition-target device may also be a
server data I/F, a physical disk, a switch, a switch port, and so
forth.
[0110] Furthermore, as examples of metrics, a CPU utilization rate,
an IOPS, a response time with respect to a request, and so forth
were given, but other performance indicators, such as an I/O busy
rate, a transfer rate, a throughput, a buffer hit rate and a number
of records inserted/updated/deleted in database management
software, a Web server response time, a free capacity or
utilization rate for a file system or disk, an amount of I/O data,
a utilization time, a number of network interface errors, a buffer
overflow, and frame errors may also be used.
[0111] As the alert execution threshold stored in the alert
execution threshold 11826, rather than a threshold specified by the
user, for example, a value capable of triggering an alert
notification, such as a value that differs from a baseline value
obtained using an average value of the performance information
historical information, may be employed.
[0112] FIG. 7 is a constitutional diagram of an example of the
event management table related to Example 1.
[0113] The event analysis processing program 11400 compares a
performance value registered in the performance information table
11820 to an alert execution threshold registered in the performance
information table 11820, and in a case where the performance value
exceeds the alert execution threshold, prepares an entry showing
the corresponding event, and registers the prepared entry in the
event management table 11910. The event analysis processing program
11400 receives an event message showing the occurrence of an event
from various apparatuses in the system, and may register an entry
showing the event corresponding to the received event message in
the event management table 11910. The event management table 11910
is referred to as needed in a rule expansion process (refer to FIG.
15).
[0114] The event management table 11910 comprises the fields of an
event ID 11911, an apparatus ID 11912, an apparatus component ID
11913, a metrics 11914, an equipment OS 11915, a status 11916, an
analyzed flag 11917, and an occurrence date/time 11918. An event
ID, which is the identifier of the event itself, is stored in the
event ID 11911. An apparatus ID, which is the identifier of the
apparatus in which the event occurred, is stored in the apparatus
ID 11912. An identifier of the device in which the event occurred
is stored in the apparatus component ID 11913. The name of a
metrics for which a threshold abnormality was detected is stored in
the metrics 11914. Data showing the type of OS in the apparatus in
which the threshold abnormality was detected is stored in the
equipment OS 11915. Data showing a state at event occurrence of a
device in which the event occurred is stored in the status 11916.
Data showing whether or not the event has been analyzed by the rule
expansion program 11500 is stored in the analyzed flag 11917. Data
showing the date and time at which the event occurred is stored in
the occurrence date/time 11918.
[0115] For example, the first entry from the top of FIG. 7 shows
that the management server 10000 detected a threshold abnormality
of the processor operating rate in the data I/F "P2" of the storage
apparatus A at 15:00:00 hours on 30 Jun. 2012, that the event ID
thereof is "EV1", and that this event has not been analyzed by the
rule expansion program 115000.
[0116] FIG. 8 is a constitutional diagram of an example of the
general rule table related to Example 1.
[0117] One or more general rule tables 11920 are stored in the
rule-plan information repository 11900. In this example, one
general rule is defined by one general rule table 11920. The
general rule (the same holds true for an expanded rule, which will
be explained further below) is data showing the relationship
between a combination of one or more condition events capable of
occurring in a node apparatus comprising the computer system, and a
conclusion event, which is regarded as the failure cause with
respect to the one or more condition events. Generally speaking, an
event propagation model for identifying the cause in a failure
analysis uses an "IF-THEN" format to describe a combination of
events, which are expected to occur as the result of a certain
failure, and the cause thereof. The general rule is not limited to
that given in FIG. 8, and a larger number of rules are allowed.
[0118] The general rule table 11920 comprises a condition part
11921, a conclusion part 11922, a general rule ID 11923, and an
applied topology 11924.
[0119] Data showing an observed event, that is, each of one or more
condition events, which is equivalent to the IF part of the general
rule described using the "IF-THEN" format, is stored in the
condition part 11921. The condition part 11921 comprises the fields
of an event ID 11925, an apparatus type 11926, an apparatus
component type 11927, a metrics 11928, and a status 11929. Data
showing a causal event, that is, a conclusion event, which is
equivalent to the THEN part of the general rule described using the
"IF-THEN" format, is stored in the conclusion part 11922. The
conclusion part 11922 comprises the fields of the apparatus type
11926, the apparatus component type 11927, the metrics 11928, and
the status 11929. A general rule ID, which is the identifier of the
general rule, is stored in the general rule ID 11923. Data showing
a topology which is referenced when the management server 10000
expands the general rule to an actual system and creates an
expanded rule, is stored in the applied topology 11924. An
identifier (event ID) for uniquely identifying an event (condition
event) included in an observed event equivalent to the IF part of
the general rule described using the "IF-THEN" format, is stored in
the event ID 11925. Data showing the type of apparatus in which
either the condition event or the conclusion event occurs is stored
in the apparatus type 11926. Data showing the type of device in
which either the condition event or the conclusion event occurs is
stored in the apparatus component type 11927. Information showing a
type of performance information, such as a CPU utilization rate, a
storage apparatus IOPS, and a response time with respect to a
request, is stored in the metrics 11928. Here, the same performance
information stored in the metrics 11823 of the performance
information table 11820, or other performance information may be
used as the performance information stored in the metrics 11928.
Data showing the state at event occurrence of the device in the
apparatus is stored in the status 11929.
[0120] In a case where one or more condition events described in
the condition part 11921 have been detected, it is determined that
the conclusion event described in the conclusion part 11922 is the
cause of the failure. In a case where the status of the conclusion
part 11922 transitions to normal, that is, a case in which the
performance value related to the conclusion event returns to
normal, the problem in the condition part 11921 is also resolved,
that is, the performance value related to each condition event can
also be expected to return to a normal value. In the example of
FIG. 8, three events are described in the condition part 11921, but
the number of events is not limited.
[0121] For example, the general rule given as an example in FIG. 8,
that is, the general rule for which the general rule ID is shown as
"RULE 1" shows that when a threshold abnormality in the response
time of a drive on a host computer, a threshold abnormality in the
I/Os per unit of time in the logical volume 22100 of the storage
apparatus 20000, and a threshold abnormality in the I/Os per unit
of time in the disk pool 22200 of the storage apparatus 20000 have
been detected as observed events, the threshold abnormality in the
I/Os per unit of time in the disk pool 22200 of the storage
apparatus 20000 is concluded to be the cause. The fact that a
certain condition is normal may also be defined as an event
included in the observed events. Topology information is acquired
from the configuration information table 11810 when an expanded
rule is created based on this general rule.
[0122] FIG. 9 is a constitutional diagram of an example of the
general plan table related to Example 1.
[0123] The general plan table 11930 shows a list of available plans
in the computer system. The general plan table 11930 comprises the
fields of a general plan ID 11931 and a plan 11932. A general plan
ID, which is the identifier of the general plan, is stored in the
general plan ID 11931. Information showing the plans, which are
available in the computer system, is stored in the plan 11932. As
plans, for example, there are reboot host, change switch setting,
storage apparatus 20000 volume migration, and VM migration, among
others. The plans are not limited to those given in FIG. 9.
[0124] FIG. 10 is a constitutional diagram of an example of the
rule-plan association table related to Example 1.
[0125] The rule-plan association table 11940 shows the
corresponding relationship between a general rule, a list of plans
available in a case where this general rule was applied and the
cause of a failure was identified, and an event (hereinafter
referred to as "unresolved event"), which remains in an unresolved
state in a case where the plans have been executed. The rule-plan
association table 11940 comprises a general rule ID 11941, a
general plan ID 11942, and an unresolved event ID 11943. A general
rule ID, which is the identifier of a general rule, is stored in
the general rule ID 11941. The general rule ID stored in the
general rule ID 11941 corresponds to the general rule ID stored in
the general rule ID 11923 of the general rule table 11920. A
general plan ID, which is the identifier of a general plan, is
stored in the general plan ID 11942. The general plan ID stored in
the general plan ID 11942 corresponds to the general plan ID stored
in the general plan ID 11931 of the general plan table 11930. An
unresolved event ID, which is the identifier of an event
(unresolved event), which remains in an unresolved state in a case
where the plans have been executed, is stored in the unresolved
event ID 11943. The unresolved event ID corresponds to the
identifier (event ID) of a condition event stored in the event ID
11925 of the general rule table 11920. In the unresolved event ID
11943, for example, in a case where an unresolved event does not
exist, "NONE" is stored, and in a case where all of the condition
events remain as unresolved events, "ALL" is stored.
[0126] FIG. 11 is a constitutional diagram of an example of the
expanded rule table related to Example 1.
[0127] One or more expanded rule tables 11950 are stored in the
rule-plan information repository 11900. In this example, one
expanded rule is defined by one expanded rule table 11950. The
expanded rule is data, which expands the general rule in a format
that depends on the actual configuration of the computer system.
The expanded rule shown in FIG. 11 is created in accordance with
replacing the values of the apparatus type 11926 and the apparatus
component type 11927 in the general rule shown in FIG. 8 with the
identifier (apparatus ID) of the identified apparatus and the
identifier (device ID) of the identified device defined in the
configuration information table 11810.
[0128] The expanded rule table 11950 comprises the fields of a
condition part 11951, a conclusion part 11952, an expanded rule ID
11953, and a pre-expansion general rule ID 11954.
[0129] Data showing an observed event, that is, each of one or more
condition events, which is equivalent to the IF part of the
expanded rule described using the "IF-THEN" format, is stored in a
condition part 11951. The condition part 11951 comprises the fields
of an event ID 11955, an apparatus ID 11956, an apparatus component
ID 11957, a metrics 11958, and a status 11959. Data showing a
causal event, that is, a conclusion event, which is equivalent to
the THEN part of the expanded rule described using the "IF-THEN"
format, is stored in a conclusion part 11952. The conclusion part
11952 comprises the fields of the apparatus ID 11956, the apparatus
component ID 11957, the metrics 11958, and the status 11959. An
expanded rule ID, which is the identifier of the expanded rule, is
stored in the expanded rule ID 11953. The general rule ID of the
general rule, which constitutes the basis of the expanded rule, is
stored in the pre-expansion general rule ID 11954. An identifier
for uniquely identifying a condition event included in an observed
event equivalent to the IF part of the expanded rule described
using the "IF-THEN" format, is stored in the event ID 11955. An ID
(apparatus ID) of an apparatus in which either the condition event
or the conclusion event occurs is stored in the apparatus ID 11956.
An ID (device ID) of the device in which either the condition event
or the conclusion event occurs is stored in the apparatus component
ID 11957. Information showing a type of performance information,
such as a CPU utilization rate, a storage apparatus IOPS, and a
response time with respect to a request, is stored in the metrics
11958. The same performance information stored in the metrics 11823
of the performance information table 11820, or other performance
information may be used as the performance information stored in
the metrics 11958. Data showing the state at event occurrence of
the device in the apparatus is stored in the status 11959.
[0130] The expanded rule is created in accordance with taking into
account the actual configuration of the computer system (for
example, the coupling relationship shown in the configuration
information table 11810) and embodying the apparatus type and the
device type related to the condition event and the conclusion event
in the identified apparatus and the identified device in the actual
configuration of the computer system.
[0131] For example, the expanded rule, for which the expanded rule
ID is shown as "ExRule 1-1", given as an example in FIG. 11 is
created in accordance with replacing the values of the apparatus
type 11926 and the apparatus component type 11927 in the general
rule "Rule 1" shown in FIG. 8 with the identifiers of the
identified apparatuses (physical server A and storage apparatus A)
and the identifiers of the identified devices (drive "/var",
logical volume "LV1", and disk pool "POOL1") defined in the
configuration information table 11810.
[0132] As is clear from the expanded rule "ExRule 1-1" shown in the
expanded rule table 11950 of FIG. 11, the expanded rule "ExRule
1-1" is expanded on the basis of the general rule "Rule 1", and
shows that when a threshold abnormality in the response time of the
logical volume "/var" of the physical server A, a threshold
abnormality in the response time of the logical volume "LV2" of the
storage apparatus A, and a threshold abnormality in the response
time of the disk pool "POOL1" of the storage apparatus A have been
detected as observed events, the threshold abnormality in the
response time of the disk pool "POOL1" of the storage apparatus A
is concluded to be the cause.
[0133] FIG. 12A is a first constitutional diagram of the expanded
plan table related to Example 1. FIG. 12B is a second
constitutional diagram of the expanded plan table related to
Example 1.
[0134] One or more expanded plan tables 11960 are stored in the
rule-plan information repository 11900. In this example, one or
more expanded plans are defined by one expanded plan table 11960.
The expanded plan is information for expanding a general plan in a
format, which relies on the actual configuration of the computer
system. The expanded plan table 11960 is created by the plan
creation program 11100 based on the general plan table 11930, the
expanded rule table 11950, the configuration information table
11810, and the performance information table 11820.
[0135] The expanded plan table 11960 comprises the fields of a plan
details 11961, a general plan ID 11962, and an expanded rule ID
1196A. The general plan ID of the general plan, which constitutes
the basis of the expanded plan, is stored in the general plan ID
11962. The expanded rule ID of the expanded rule corresponding to
the expanded plan is stored in the expanded rule ID 1196A as
information for identifying the failure cause for which the
expanded plan is implemented.
[0136] Specific processing contents for each of one or more
expanded plans, which have been expanded, and information on the
post-execution state of the expanded plan are stored in the plan
details 11961. The plan details 11961 comprises the fields of an
expanded plan ID 11963, a plan target 11964, and a risk site 11969.
An expanded plan ID, which is the identifier of the expanded plan,
is stored in the expanded plan ID 11963. Data showing a problem
site (hereinafter referred to as "risk site"), which latently
remains as-is even after the plan has been executed, is stored in
the risk site 11969.
[0137] In the plan target 11964, for example, there is stored
information showing the configuration element (device) related to
the plan, and post-plan execution information. The post-plan
execution information here comprises a predicted value for a
post-plan execution performance value related to the device, which
is related to the plan. The predicted value of the post-plan
execution performance value, for example, is calculated in
accordance with the plan creation program 11100 referencing the
performance information table 11820 to simulate the situation
subsequent to plan execution. The fields included in the plan
target 11964 will differ in accordance with the plan contents.
[0138] The expanded plan defined by the expanded plan table 11960
of FIG. 12A is an expanded plan based on the general plan "Plan 1",
that is, an expanded plan related to a volume migration. In the
case of an expanded plan related to a volume migration, the plan
target 11964, for example, comprises the fields of a
migration-target volume 11965, a migration-source pool 11966, and a
migration-destination pool 11967. The migration-target volume 11965
comprises a volume ID 11965A in which is stored an identifier of
the logical volume 22100 (hereinafter referred to as
"migration-target volume"), which constitutes the volume migration
target, and an I/O Response Time prediction 11965B in which is
stored a predicted value for the I/O response time with respect to
the migration-target volume subsequent to volume migration
execution. The migration-source pool 11966 comprises a pool ID
11966A in which is stored an identifier of the disk pool 22200 to
which the migration-target volume belongs (that is, the
migration-source disk pool 22200, hereinafter referred to as
"migration-source pool"), and an I/O Response Time prediction
11966B in which is stored a predicted value for the I/O response
time with respect to the migration-source pool subsequent to volume
migration execution. The migration-destination pool 11967 comprises
a pool ID 11967A in which is stored an identifier of the disk pool
22200, which constitutes the migration destination of the
migration-target volume (hereinafter referred to as
"migration-destination pool"), and an I/O Response Time prediction
11967B in which is stored a predicted value for the I/O response
time with respect to the migration-destination pool subsequent to
volume migration execution.
[0139] The plan creation program 11100 acquire information from the
configuration information table 11810, and stores the information
in the volume ID 11965A, the pool ID 11966A, and the pool ID
11967A. Further, an arbitrary method may be used as the method for
calculating the values stored in the respective I/O Response Time
predictions 11965B, 11966B, and 11967B. For example, the values of
the respective I/O Response Time predictions 11965B, 11966B, and
11967B may be values (I/O response times) obtained, for example, by
the plan creation program 11100 acquiring the I/Os per unit of time
of the migration-target volume, the migration-source pool, and the
migration-destination pool from the performance information table
11820, subtracting the value of the I/Os per unit of time of the
migration-target volume from the I/Os per unit of time of the
migration-source pool, adding the value of the I/Os per unit of
time of the migration-target volume to the I/Os per unit of time of
the migration-destination pool, predicting the I/Os of the
migration-source pool and the migration-destination pool subsequent
to volume migration execution, and adopting a value (I/O response
time) obtained by inversing thereof.
[0140] The expanded plan defined in accordance with the expanded
plan table 11960 of FIG. 12B is an expanded plan based on the
general plan "Plan 5", that is, the expanded plan related to the
addition of a disk to the pool. In the case of an expanded plan
related to the addition of a disk to the pool, the plan target
11964, for example, comprise the fields of a pool ID 11968A in
which is stored the identifier of the disk pool 22200, which
constitutes the destination of the disk (hereinafter referred to as
"add-target pool"), an additional disk 11968B in which is stored
the identifier of the disk to be added, and an I/O Response Time
prediction 11968C in which is stored a predicted value for the I/O
response time with respect to the add-target pool.
[0141] The plan creation program 11100 acquired information from
the configuration information table 11810 and stores the
information in the pool ID 11968A and the additional disk 11968B.
Further, an arbitrary method may be used as the method for
calculating the value (a predicted value of I/O response time with
respect to add-target pool) of the I/O Response Time prediction
11968C. For example, the plan creation program 11100 may calculate
the I/O response time of the add-target pool subsequent to disk
addition by acquiring the value of the I/Os per unit of time of the
add-target pool, the value of the capacity of the add-target pool,
and the value of the capacity of the disk to be added from the
performance information table 11820, and assuming that the I/Os per
unit of time are distributed in proportion to the capacity values,
multiplying a value obtained by dividing the capacity of the
add-target pool prior to disk addition by the capacity of the
add-target pool subsequent to disk addition by the I/O response
time of the add-target pool prior to disk addition. In the example
of FIG. 12B, there is described an example in which performance
information is stored as the contents of the plan details 11961,
but cost information related to the plan, or information on system
downtime resulting from a failure when the plan is executed may be
stored as the contents of the plan details 11961.
[0142] In FIG. 12, there is described an example of an expanded
plan related to a volume migration, and an expanded plan related to
the addition of a disk to a pool, but an expanded plan
corresponding to each general plan of the other general plans
included in the general plan table 11930 is created in the same
way. In a case where another general plans is to be expanded into
an expanded plan, the plan creation program 11100, for example,
computes a predicted value for a post-plan execution performance
value related a device, which is related to the plan, by
referencing the configuration information table 11810, listing
post-processing execution configuration information candidates,
referencing the performance information table 11820, and simulating
post-plan execution status information, such as performance
information, capacity information, cost information, and downtime
information.
[0143] FIG. 13 is a constitutional diagram of an example of the
analysis result management table related to Example 1.
[0144] The analysis result management table 11970 comprises the
fields of a cause apparatus ID 11971, a cause component ID 11972, a
metrics 11973, a certainty factor 11974, an expanded rule ID 11975,
and a received event ID 11976. The identifier (apparatus ID) of an
apparatus related to an event, which has been judged as the cause
of a failure in a failure cause analysis process, is stored in the
cause apparatus ID 11971. The identifier (device ID) of a device
related to the event judged to be the cause of the failure is
stored in the cause component ID 11972. Information showing a type
of performance information, such as a CPU utilization rate and a
response time with respect to a request, for which a threshold
abnormality has been detected, is stored in the metrics 11973. In
other words, in the metrics 11973 data showing the type of
performance information related to an event judged to be the cause
of a failure is stored. A value (certainty factor) showing the
probability of the event judged to be the cause of the failure
being the root cause is stored in the certainty factor 11974. In
this example, an incidence of the condition event, for example,
serves as the certainty factor. The expanded rule ID of the
expanded rule, which constitutes the basis upon which the event was
judged to be the cause of the failure, is stored in the expanded
rule ID 11975. The event ID of the event, which actually occurred
from among the condition events, is stored in the received event ID
11976.
[0145] For example, the first entry from the top of FIG. 13 shows
that, based on the expanded rule "ExRule 1-1", the management
server 10000 has judged that the response time threshold
abnormality in the disk pool "POOL1" of the storage apparatus A is
the failure cause, that the occurrence of events, which are shown
with the event IDs "EV 2" "EV 3" and "EV 5", is the basis of the
judgment, and that the certainty factor, that is, the incidence of
the condition event is 3/3. This entry, for example, shows a case
in which, in a case where it is assumed that the performance of the
disk comprising the disk pool "POOL1" has slowed, the performance
of the disk pool "POOL1" has deteriorated, and the performance of
the logical volume "LV2" has slowed, an event has not occurred
based on the logical volume "LV1" or the drive "/opt" since there
is no sending/receiving of I/Os to/from the drive "/opt" of the
physical server A or the logical volume "LV1" of the storage
apparatus A.
[0146] Next, the respective processes executed by the management
server 10000 will be explained. First, a configuration information
acquisition process executed by the management server 10000 will be
explained. The configuration information acquisition process is
executed in accordance with the management server 10000 processor
15000 executing the configuration setting management program 11600
deployed on the memory 11000. First, the configuration setting
management program. 11600 communicates with a program, which
collects information on the respective apparatuses coupled to the
SAN 40000 (in this example, the configuration/performance
information collection program 21200 of the storage apparatus 20000
and the configuration/performance information collection program
31100 of the physical server 30000), and acquires the configuration
information held by the apparatuses.
[0147] Here, the management server 10000, for example, may use an
SCSI (Small Computer System Interface) Inquiry command to acquire
the coupling relationship between the physical server 30000, the
drive on the physical server 30000, the server data I/F, the
storage apparatus 20000, the storage data I/F, the logical volume
22100 accessed by the physical server 30000, and the disk pool
22200 to which the logical volume 22100 belongs. Also, the physical
server 30000 may acquire configuration information from the storage
apparatus 20000 in accordance with the physical server 30000
issuing an SCSI Inquiry command to the storage apparatus 20000. The
configuration information collection program 31100 of the physical
server 30000 may acquire configuration information related to the
physical server 30000 using any method, such as, for example, a
request to the OS. The configuration information collection program
21200 on the storage apparatus 20000 may acquire configuration
information related to the storage apparatus 20000 using any
method, such as, for example, acquiring the information from the
memory 21000. Next, the configuration setting management program
11600 stores the acquired configuration information in the
configuration information table 11810, and ends the configuration
information acquisition process.
[0148] Next, a performance information collection process executed
by the management server 10000 will be explained. The performance
information collection process is executed in accordance with the
management server 10000 processor 15000 executing the performance
information collection program 11700 deployed on the memory 11000.
First, the performance information collection program 11700
communicates with a program, which collects information on the
respective apparatuses coupled to the SAN 40000 (in this example,
the configuration/performance information collection program 21200
of the storage apparatus 20000 and the configuration/performance
information collection program 31100 of the physical server 30000),
and acquires the performance information held by the
apparatuses.
[0149] Here, the management server 10000, for example, may use the
SCSI Inquiry command to acquire the respective data registered in
the performance information table 11820, that is, the data of the
apparatus ID 11821, the device ID 11822, the metrics 11823, the
equipment OS 11824, the performance value 11825, the alert
execution threshold 11826, the threshold type 11827, and the Status
11828. Also, the physical server 30000 may acquire performance
information from the storage apparatus 20000 in accordance with the
physical server 30000 issuing an SCSI Inquiry command to the
storage apparatus 20000. The performance information collection
program 11700 of the physical server 30000 may acquire performance
information related to the physical server 30000 using any method,
such as, for example, a request to the OS. The
configuration/performance information collection program. 21200 on
the storage apparatus 20000 may acquire performance information
related to the storage apparatus 20000 using any method, such as,
for example, acquiring the information from the memory 21000. Next,
the performance information collection program 11700 stores the
acquired performance information in the performance information
table 11820, and ends the performance information collection
process.
[0150] FIG. 14 is a flowchart of an event analysis process related
to Example 1.
[0151] The event analysis process is executed in accordance with
the management server 10000 processor 15000 executing the event
analysis processing program 11400 deployed on the memory 11000.
[0152] First, the event analysis processing program 11400
references the performance information table 11820, and acquires
the performance values and alert execution threshold for each
device (Step 1001). Next, the event analysis processing program
11400 carries out the processing of the following Steps 1002
through 1004 for the acquired performance values.
[0153] The event analysis processing program 11400 checks whether
or not the processing-target performance value exceeds the alert
execution threshold corresponding to the processing-target
performance value, and, based on the result, updates the value of
the Status 11828 of the performance information table 11820 (Step
1002). Specifically, in a case where the processing-target
performance value exceeds the alert execution threshold, the event
analysis processing program 11400 stores "threshold abnormality" in
the Status 11828, and in a case where the processing-target
performance value does not exceed the alert execution threshold,
stores "normal" in the Status 11828.
[0154] Next, the event analysis processing program 11400 determines
whether or not the value of the Status 11828 changed before or
after the update of Step 1002 (Step 1003). In a case where a change
occurred (Step 1003: Yes), the event analysis processing program
11400 registers an entry related to the event showing that the
processing-target performance value exceeds the alert execution
threshold in the event management table 11910 (Step 1004), and
advances to the next step. At this time, the date and time at which
the performance information collection program 11700 collected the
processing-target performance value is stored in the entry
occurrence date/time 11918. Alternatively, in a case where the
value of the Status 11828 has not changed (Step 1003: No), the
event analysis processing program 11400 advances to the next step
without executing Step 1004.
[0155] After the processing (Steps 1002 through 1004) has been
completed for all the performance values, the event analysis
processing program 11400 determines whether or not there exists an
entry related to an event, which has been newly registered in the
event management table 11910 (Step 1005), and in a case where an
entry related to a newly registered event exists (Step 1005: Yes),
instructs the rule expansion program 11500 to execute a rule
expansion process (refer to FIG. 15) (Step 1006), and ends the
event analysis process. Alternatively, in a case where an entry
related to a newly registered event does not exist (Step 1005: No),
the event analysis processing program 11400 ends the event analysis
process.
[0156] Here, in this example, the event analysis processing program
11400 registers an entry related to an event in the event
management table 11910 based on performance information acquired by
the performance collection program 11700, but may update the
information in the event management table 11910 using the notified
information, such as a SNMP (Simple Network Management Protocol)
trap or the like, which is notified from an apparatus.
[0157] FIG. 15 is a flowchart of the rule expansion process related
to Example 1.
[0158] The rule expansion process is executed in accordance with
the management server 10000 processor 15000 executing the rule
expansion program 11500 deployed on the memory 11000.
[0159] First, the rule expansion program 11500 acquires an entry
(event entry) related to a newly registered event from the event
management table 11910, and updates the analyzed flag 11917 of the
acquired event entry to "Yes" (Step 2001). Next, the rule expansion
program 11500 acquires one or more general rules shown by one or
more general rule tables 11920, which are being stored in the
rule-plan information repository 11900 (Step 2002). The rule
expansion program 11500 performs the processing of Steps 2004
through 2010 below for each acquired general rule (Step 2003). The
rule expansion program 11500 also performs the processing of Steps
2005 through 2010 below for each event entry acquired in Step
2001.
[0160] The rule expansion program 11500 determines whether or not
the event (processing-target event) shown in the processing-target
event entry matches any of the condition events in the
processing-target general rule (Step 2005). For example, the rule
expansion program 11500 determines that the processing-target event
and a condition event match in a case where the apparatus and
device related to the processing-target event correspond to the
types of the apparatus and device related to the condition event,
and, in addition, the type of the processing-target event matches
the type of the condition event.
[0161] In a case where it has been determined that there is no
match in Step 2005 (Step 2005: No), the rule expansion program
11500 executes the processing (Steps 2004 through 2010) by making
the next event entry the processing target. Alternatively, in a
case where it has been determined that there is a match in Step
2005 (Step 2005: Yes), the rule expansion program 11500 acquires
from the configuration information table 11810 the identifiers of
the apparatus and device having the association shown in the
processing-target general rule with the apparatus and device
related to the processing-target event (Step 2006). Next, the rule
expansion program 11500 prepares an expanded rule based on the
processing-target general rule for each combination having the
association (Step 2007).
[0162] For example, in a case where the EV5 in the event management
table 11910 is regarded as the processing-target event, and the
general rule "Rule 1" is regarded as the processing-target general
rule, the processing-target event matches the condition event 1
(the condition event for which the event ID 11925 is "1") of the
general rule "Rule 1". Therefore, the rule expansion program 11500
lists all of the logical volume 22100 and disk pool 22200 of the
storage apparatus 20000, which are the apparatus and device having
the association described in the general rule "Rule 1" (a coupling
relationship exists between the components) with the physical
server A drive "/var", which is the site where the
processing-target event occurred. Then, the rule expansion program
11500 prepares an expanded rule, which corresponds to the
combination of the physical server A drive "/var", the logical
volume 22100 of the storage apparatus 20000, and the disk pool
22200 of the storage apparatus 20000.
[0163] Next, the rule expansion program 11500 determines whether or
not an expanded rule table 11950, which shows the same expanded
rule as the prepared expanded rule, already exists in the rule-plan
information repository 11900 (Step 2008).
[0164] In a case where an expanded rule table 11950 showing the
same expanded rule as the prepared expanded rule does not exist
(Step 2008: No), the rule expansion program 11500 stores the
expanded rule table 11950 showing the prepared expanded rule in the
rule-plan information repository 11900, and registers a new entry
in the analysis result management table 11970 (Step 2009).
Alternatively, in a case where an expanded rule table 11950 showing
the same expanded rule as the prepared expanded rule exists (Step
2008: Yes), the rule expansion program 11500 adds the event ID of
the processing-target event in the received event ID 11976 and
changes the value of the certainty factor 11974 of the analysis
result management table 11970 (Step 2010).
[0165] After the processing (Steps 2005 through 2010) for all the
entries in the general rule has been completed, the rule expansion
program 11500 executes the processing (Steps 2004 through 2010) by
making the next general rule the processing target. Then, after the
processing (Steps 2004 through 2010) for all the general rules has
been completed, the rule expansion program 11500 determines whether
or not a newly registered entry exists in the analysis result
management table 11970 (Step 2011), and in a case where a newly
registered entry exists (Step 2011: Yes), instructs the plan
creation program 11100 to execute a plan creation process (refer to
FIG. 16) (Step S2012), and ends the rule expansion process.
Alternatively, in a case where a newly registered entry does not
exist (S2011: No), the rule expansion program 11500 ends the rule
expansion process.
[0166] FIG. 16 is a flowchart of the plan creation process related
to Example 1.
[0167] The plan creation process is executed in accordance with the
management server 10000 processor 15000 executing the plan creation
program 11100 deployed on the memory 11000.
[0168] First, the plan creation program 11100 acquires a newly
registered entry related to analysis result (analysis result entry)
from the analysis result management table 11970 (Step 3001). The
plan creation program 11100 implements the processing of Steps 3003
through 3008 below for each acquired analysis result entry (Step
3002).
[0169] The plan creation program 11100 acquires the expanded rule
ID stored in the expanded rule ID 11975 of the processing-target
analysis result entry in the analysis result management table
11970. Hereinafter, the expanded rule having the acquired expanded
rule ID will be called "processing-target expanded rule". Then, the
plan creation program 11100 acquires the general rule ID of the
general rule, which constitutes the basis of the processing-target
expanded rule, from the pre-expansion general rule ID 11954 of the
expanded rule table 11950 showing the processing-target expanded
rule (Step 3003).
[0170] Next, the plan creation program 11100 acquires from the
rule-plan association table 11940 one or more general plan IDs
corresponding to the general rule ID acquired in Step 3003. The
plan creation program 11100 also acquires from the rule-plan
association table 11940 an unresolved event ID corresponding to a
combination of the acquired general plan ID and the general rule ID
acquired in Step 3003 (Step 3004).
[0171] Next, the plan creation program 11100 references the
configuration information table 11810 and the general plan table
11930, and creates an expanded plan which is based on a general
plan having the general plan ID acquired in Step 3004 and which
corresponds to the processing-target expanded rule, and stores the
expanded plan table 11960 showing the created expanded plan in the
rule-plan information repository 11900 (Step 3005). For example, in
a case where a volume migration general plan is to be expanded, the
plan creation program 11100 references the configuration
information table 11810 and identifies all the disk pools 22200
that could be the migration-destination pool. For example, the plan
creation program 11100, based on the configuration information
table 11810, identifies a disk pool 22200, which differs from the
migration-source pool and can be accessed by the physical server
10000, which had been accessing the migration-target volume, and
regards the identified disk pool 22200 as the migration-destination
pool.
[0172] Next, the plan creation program 11100 repeatedly executes
the processing of Steps 3007 and 3008 with respect to each expanded
plan created in Step 3005 (Step 3006). The plan creation program
11100 references the performance information table 11820,
calculates a predicted value for the post-plan execution
performance value in accordance with simulating a post-plan
execution state, and, based on the simulated result information,
updates the value of the plan target 11964 in the expanded plan
table 11960 showing the processing-target expanded plan (Step
3007).
[0173] Next, the plan creation program 11100 instructs the
post-plan execution risk selection program 11200 to execute a
post-plan execution risk selection process (refer to FIG. 17) for
the processing-target expanded plan (Step 3008). At this time, the
plan creation program 11100 inputs to the post-plan execution risk
selection program 11200 an unresolved event ID of an unresolved
event corresponding to the processing-target expanded plan, that
is, an unresolved event, which corresponds to a combination of the
general rule constituting the basis of the processing-target
expanded rule and the general plan constituting the basis of the
processing-target expanded plan. A risk site with respect to the
processing-target expanded plan is identified in accordance with
the post-plan execution risk selection process.
[0174] The plan creation program 11100, after completing the
processing (Steps 3003 through 3008) for all the acquired analysis
result entries, instructs the plan presentation processing program
11300 to execute a plan presentation process (refer to FIG. 18)
(Step 3009). Thereafter, the plan creation program 11100 ends the
plan creation process.
[0175] This example takes up performance information, in
particular, a predicted value for an I/O response time, and gives
an example of a simulation method, but a value other than a
performance value may be used as the value to be stored in the
expanded plan table 11960 as long as the value is capable of
serving as an index, which demotes the characteristics of the plan.
The management server 10000, for example, may perform the same
simulation as that for a performance value by storing cost
information pertaining to plan execution and time information
pertaining to plan execution in the configuration information table
11810 or the performance information table 11820.
[0176] FIG. 17 is a flowchart of the post-plan execution risk
selection process related to Example 1.
[0177] The post-plan execution risk selection process is executed
in accordance with the management server 10000 processor 15000
executing the post-plan execution risk selection program 11200
deployed on the memory 11000.
[0178] First, the post-plan execution risk selection program 11200
uses an unresolved event ID received from the plan creation program
11100 to select an unresolvable event from among the actually
occurred condition events registered in the received event ID 11976
of the processing-target analysis result entry in the analysis
result management table 11970 (Step 4001). As used here,
unresolvable event refers to an event, which corresponds to the
condition event shown by the unresolved event ID, from among the
condition events that have actually occurred.
[0179] For example, in a case where the first entry from the top of
the analysis result management table 11970 of FIG. 13 (the entry in
which the disk pool "POOL1" of the storage apparatus A is the
failure cause) is selected as the processing-target analysis result
entry in Step 3002 of FIG. 16, and the expanded plan "ExPlan 1-1"
is selected as the processing-target expanded plan in Step 3006,
the processing-target expanded rule, that is, the expanded rule
shown by the expanded rule ID 11975 of the processing-target
analysis result entry is the expanded rule "ExRule 1-1", and the
general rule, which constitutes the basis of the processing-target
expanded rule, is the general rule "Rule 1". Therefore, the
unresolved event corresponding to the processing-target expanded
plan "ExPlan 1-1" is the unresolved event corresponding to the
combination of the general plan "Plan 1", which constitutes the
basis of the expanded plan "ExPlan 1-1", and the general rule "Rule
1", and the unresolved event "3" is acquired from the rule-plan
association table 11940 of FIG. 10. The unresolved event "3" shows
the condition event 3 of the general rule "Rule 1", that is, the
response time threshold abnormality event in the disk pool 22200 of
the storage apparatus 20000. Therefore, the event "EV3", which
corresponds to the response time threshold abnormality event in the
disk pool 22200 of the storage apparatus 20000, is selected as the
unresolvable event from among the events (event "EV2", event "EV3",
and event "EV5") registered in the received event ID 11976 of the
processing-target analysis result entry.
[0180] Next, the post-plan execution risk selection program 11200
references the event management table 11910 and the expanded rule
table 11950, and identifies the occurrence sites (the occurrence
source apparatus and device) of the unresolvable event selected in
Step 4001 (Step 4002). Next, the post-plan execution risk selection
program 11200 references the configuration information table 11810,
and selects anyone or more of the unresolvable event occurrence
sites and the sites (apparatus and device) associated on the I/O
path with the unresolvable event occurrence sites as risk sites
(Step 4003).
[0181] In a case where risk sites have been selected in Step 4003
(Step 4004: Yes), the post-plan execution risk selection program
11200 stores data showing the selected risk sites in the risk site
11969 of the processing-target expanded plan in the expanded plan
table 11969 (Step 4005), and ends the post-plan execution risk
selection process. Alternatively, in a case where a risk site was
not selected in Step 4003 (Step 4004: No), the post-plan execution
risk selection program 11200 ends the post-plan execution risk
selection process.
[0182] In this example, only business side information, which is
most important for the administrator, that is, only information of
the volume "/opt" of the physical server A is selected as the risk
site (for example, refer to FIG. 12), but other sites on the I/O
path shown by the entry of the configuration information table
11810, for example, the server data I/F and the logical volume of
the storage apparatus 20000 may also be selected as the risk
site.
[0183] FIG. 18 is a flowchart of the plan presentation process
related to Example 1.
[0184] The plan presentation process is executed in accordance with
the management server 10000 processor 15000 executing the plan
presentation program 11300 deployed on the memory 11000.
[0185] First, the plan presentation program 11300 acquires from the
analysis result management table 11970 information showing a
failure cause, that is, the values of the cause apparatus ID 11971,
the cause component ID 11972, the metrics 11973, and the certainty
factor 11974 (Step 5001).
[0186] Next, the plan presentation program 11300 implements the
processing of Step 5002 below with respect to each analysis result
entry of the analysis result management table 11970. Here, the
expanded rule having the expanded rule ID stored in the expanded
rule ID 11975 of the processing-target analysis result entry will
be referred to as "processing-target expanded rule".
[0187] The plan presentation program 11300 acquires from the
rule-plan information repository 11900 one or more expanded plan
tables 11960, which show one or more expanded plans corresponding
to the processing-target expanded rule (plans, which are candidates
for failure recovery) (Step 5002).
[0188] After completing the processing for all the analysis result
entries (Step 5002), the plan presentation program 11300 creates a
plan presentation screen (refer to FIG. 20) based on information
showing the failure cause and the certainty factor acquired in Step
5001 and the expanded plan table 11960 acquired in Step 5002, and
causes the output device 14000 to display the created plan
presentation screen (Step 5003). Thereafter, the plan presentation
program 11300 ends the plan presentation process.
[0189] FIG. 19 is a flowchart of an effect-risk presentation
process related to Example 1.
[0190] The plan presentation process is executed and the plan
presentation screen is displayed in accordance with the management
server 10000 processor 15000 executing the plan presentation
program 11300 deployed on the memory 11000.
[0191] The plan presentation program 11300 starts the effect-risk
presentation process when a desired expanded plan is selected and
an input requesting the display of the plan details screen with
respect to the selected expanded plan is received in the plan
presentation screen (Step 6001).
[0192] First, the plan presentation program 11300 references the
configuration information table 11810 and acquires risk site status
information, performance information, and setting information (Step
6002). Next, the plan presentation program 11300 references the
expanded plan table 11960 and the analysis result management table
11970, and acquires information showing which event of the
condition events included in the expanded rule corresponding to the
selected expanded plan occurred, and information showing which
event will be resolved in a case where the selected expanded plan
is implemented (Step 6003). Next, the plan presentation program
11300 selects I/O path information associated with the selected
expanded plan (Step 6004).
[0193] Thereafter, the plan presentation program 11300 creates a
plan details screen (refer to FIG. 21) based on the information
acquired in Steps 6002 through 6004, and causes the output device
14000 to display the created plan details screen (Step 6005).
Thereafter, the plan presentation program 11300 ends the
effect-risk presentation process.
[0194] FIG. 20 is a constitutional diagram of an example of the
plan presentation screen related to Example 1.
[0195] The plan presentation screen 9000 comprises a display area
9001 for displaying information, which is referenced when, in a
case where a failure has occurred in the computer system, the
administrator is investigating the cause of the failure and
implementing countermeasures, specifically, information showing the
corresponding relationship between a failure cause and a list of
plans available with respect to the failure, a plan details button
9002 for displaying the details of the plans, and an execute plan
button 9003 for executing a plan.
[0196] As information showing the failure cause, for example, the
ID of the apparatus related to the failure cause event, the ID of
the device related to the failure cause event, the type of the
failure cause event, and the certainty factor of the failure cause,
that is, the percentage of the number of condition events, which
have actually occurred relative to the total number of condition
events, are displayed in the display area 9001, which displays the
association between the failure cause and a plan for countering the
failure. The plan presentation program 11300 acquires these values
from the analysis result management table 11970 shown in FIG. 13 in
Step 5001 of FIG. 18.
[0197] As information about the plan with respect to the failure,
for example, the display area 9001 displays information showing the
contents of candidate plans, the costs pertaining to plan
execution, the time required for plan execution (that is, the time
during which the failure will remain, hereinafter referred to as
"downtime"), post-plan execution performance information, and
information showing a risk site. The information showing a risk
site, for example, comprises data (for example, the name of the
risk site) showing the risk site stored in the risk site 11969 of
the expanded plan table 11960, and data (for example, the name of
the apparatus) showing the apparatus, which comprises the device
regarded as the risk site.
[0198] Regarding the cost information pertaining to the plan
execution, for example, the management server 10000 may prestore
the approximate cost pertaining to each general plan in the general
plan table 11930 shown in FIG. 9, and may determine the cost based
on this information. For example, with respect to the plan for
replacing the storage port of general plan "Plan 8", the management
server 10000 may store a sum obtained by adding up the cost of
purchasing a storage port and the labor costs for the maintenance
personnel needed to replace the storage port, and may display this
sum as the cost. With respect to the volume migration plan of
general plan "Plan 1", the management server 10000 may store the
cost per bit in a case where data is migrated from a certain
storage device to a different storage device, and may calculate the
cost displayed in the "Cost ($)" field of FIG. 20 in accordance
with the capacity of the volume to be migrated.
[0199] Regarding downtime, for example, in the case of the volume
migration plan, the management server 10000 may prestore data
showing about how much data can be migrated per unit of time (for
example, one second) for each combination of the media type of both
the migration source and the migration destination storage devices
and a RAID level, and may calculate the downtime displayed in the
"Downtime" field of FIG. 20 in accordance with the capacity of the
volume to be migrated. At this time, the management server 10000
may take into account the fact that the migration time will
fluctuate in accordance with the utilization state in the actual
environment, and, using past migration history information, may
calculate the capacity capable of being migrated per unit of time,
and may seek to revise the prestored information by averaging the
prestored information and the historical information. One example
of a method for determining the cost information and downtime
pertaining to plan execution has been given here, but another
determination method may be employed.
[0200] As the post-plan execution performance information, for
example, a predicted value for the post-plan execution performance
value, which was simulated in Step 3007 of the plan creation
process shown in FIG. 16 and stored in the plan target 11964 of the
expanded plan table 11960 shown in FIG. 12, and, for example, the
values stored in the I/O Response Time predictions 11965B, 11966B,
11967B, and 11968C are displayed. In the example of FIG. 20, a
predicted value for the I/O response time with respect to the
migration-target volume is displayed for the first plan from the
top (the volume migration plan), and a predicted value for the I/O
response time with respect to the add-target pool is displayed for
the fifth plan from the top (the add disk to pool plan), but other
values, for example, a predicted value for the I/O response time
with respect to the migration-source pool and the
migration-destination pool may be displayed for the first plan from
the top (the volume migration plan), and other performance values
may also be displayed. The predicted value for multiple performance
values may also be displayed.
[0201] The order in which the candidate plans are displayed here
may be sorted on the basis of the characteristics of the plan, such
as being arranged in order from the plan with the lowest plan
execution cost, being arranged in order from the plan with the
shortest time required for plan execution, or being arranged in
order from the plan with no risk site.
[0202] Any method may be used as the arrangement method, such as,
for example, arranging the plans in order from the one having the
lowest cost by clicking on the "Cost ($)" in the display area
9001.
[0203] The plan details button 9002 is for instructing the display
of the plan details screen (FIG. 21). When the administrator
selects a desired plan from within the display area 9001 using the
input apparatus 15000 and presses the plan details button 9002, the
management server 10000 starts the execution of the effect-risk
presentation process of FIG. 19, and displays the plan details
screen (FIG. 21) for displaying the selected plan details
information on the output apparatus 14000.
[0204] The plan execution button 9003 is for instructing the
execution of the selected plan, and when this button is pressed,
the management server 10000 issues a plan execution indication to
the program, which provides the function corresponding to the
selected plan. The program, which receives the plan execution
indication, executes the selected plan. The program, which executes
the plan here, for example, is a program in the memory 11000 of the
management server 10000, and, for example, is a volume migration
program or a VM migration program.
[0205] Also, for example, predicted values for a pre-plan execution
performance value and a post-plan execution performance value may
be displayed in the display area 9001 in graph form as trend
information.
[0206] FIG. 20 is an example of the plan display screen 9000, and
plan characteristics-denoting information other than the plan
execution cost and time required for plan execution, such as, for
example, a list of business operations, which are using
plan-related resources and could have an affect at plan execution
time, may be displayed together in the display area 9001, and
another display mode may be used.
[0207] FIG. 21 is a constitutional diagram of an example of the
plan details screen related to Example 1.
[0208] The plan details screen 9010 comprises a display area 9011
for displaying information, which is referenced when, in a case
where a failure has occurred in the computer system, the
administrator is investigating the cause of the failure and
implementing countermeasures, specifically, information showing the
coupling relationship between the apparatus and the device related
to the failure, and a display area 9017 for displaying detailed
information regarding a risk site. The display area 9011 comprises
a server area 9012, which denotes the configuration of the physical
server 30000 in the computer system, a switch area 9013, which
denotes the configuration of the switch, a storage area 9014, which
denotes the configuration of the storage apparatus 20000, an area
9015 showing the coupling relationship and the setting relationship
between each apparatus and device prior to executing the plan
selected in the plan presentation screen 9000, and an area 9016
showing the coupling relationship and the setting relationship
between each apparatus and device after executing the selected
plan. The display area 9017 for displaying detailed information
regarding a risk site comprises a risk site 9018, which denotes a
site where a risk remains even after plan execution, and a risk
occurrence timing 9019, which denotes, when the risk site is left
unattended, the timing at which the relevant risk will occur.
[0209] In the example of FIG. 21, a diagram 9020 denoting the
physical server A, and a diagram 9021 denoting the storage
apparatus A are displayed in the server area 9012 of the display
area 9011. Also, diagrams denoting the drive "/opt" and the drive
"/var", which are mount points, are displayed inside the diagram
9020, and diagrams denoting the logical volume "LV1" and the disk
pool "POOL1" are displayed inside the diagram 9021. That is, the
server area 9012 is for expressing the coupling relationships and
the setting relationships of the computer system. In addition, a
mark 9022 showing a failure event, which has occurred in the system
and is being managed by the analysis result management table 11910,
is displayed in the server area 9012 on the site where the failure
event occurred, for example, on the data I/F "S2" of the physical
server A. Also, a mark 9023 showing a risk site, which is being
managed by the expanded plan table 11960, is displayed in the
server area 9012 on the risk occurrence site, for example, on the
drive "/opt" of the physical server A. Furthermore, the coupling
relationship of an apparatus and a device, such as the physical
server A and the storage apparatus A, in the area 9011 is expressed
using a solid line connecting the diagrams denoting the apparatus
and the device.
[0210] In the example shown in FIG. 21, the area 9015 showing the
state before plan execution shows the fact that the drive "/opt" of
the physical server A, the logical volume "LV1" and the disk pool
"POOL1" of the storage apparatus A are associated, the drive "/var"
of the physical server A, the logical volume "LV2" and the disk
pool "POOL1" of the storage apparatus A are associated, and the
logical volume "LV3" and the disk pool "POOL2" of the storage
apparatus A are associated. The area 9015 showing the state before
plan execution also shows the fact that failure events have
occurred in the drive "/var" on the physical server A, the data I/F
on the physical server A, the data I/F of the storage apparatus A,
the logical volume "LV2" of the storage apparatus A, and the disk
pool "POOL1" of the storage apparatus A, and that a risk exists in
the drive "/opt" on the physical server A.
[0211] The area 9016 showing the state after plan execution shows
that the logical volume "LV2", which existed in the disk pool
"POOL1", has come to exist in the POOL2, and that even after plan
execution, a failure event remains in the disk pool "POOL1" of the
storage apparatus A, and a risk remains in the drive "/opt" on the
physical server A.
[0212] Detailed information related to the device with the mark
9023 showing the risk site is displayed in the display area 9017
showing the risk details. In the example of FIG. 21, the risk
details information, which is on the physical server A drive "/opt"
in the display area 9011, is displayed in the area 9017, and in the
example shown in FIG. 21, the area 9017 denotes that the risk site
is the physical server A drive "/opt", and that the timing at which
the risk could occur is when an I/O is generated to the drive
"/opt". The risk occurrence timing, for example, is determined
based on the information acquired in Step 6002 of FIG. 19. For
example, the management server 10000 may acquire the information of
the performance value 11825 of the performance information table
11820, may detect the fact that the performance value related to
the physical server A drive "/opt" is 0 msec and that an I/O has
not occurred, may judge that the occurrence of an I/O could trigger
the occurrence of a risk, and may store the relevant information in
the risk occurrence timing 9019.
[0213] The plan details screen 9010 here comprises the display area
9011 and the display area 9017, but the display method is not
limited thereto, and, for example, when the mark 9023 showing the
risk site of the display area 9011 is clicked, the display area
9017 may be newly displayed as a separate screen. Or, when the risk
site information in the display area 9001 of the plan presentation
screen 9000 is clicked, the display area 9017 may be newly
displayed as a separate screen. Also, when a diagram showing either
an apparatus or a device, such as the physical server 30000 or the
storage apparatus 20000 of the display area 9011 is clicked, the
performance values of either the relevant apparatus or device
before and after plan execution may be displayed.
[0214] According to Example 1, it is possible to associatively
present a failure cause and a specific recovery plan for the
failure, and to check the number of events, of the failure events
related to the failure cause, that has been resolved in accordance
with executing the plans, and by displaying the result, to present
the administrator with a problem site, which latently remains as-is
even after plan execution, together with the reason therefor as
plan details information. This makes it possible for the
administrator to select an appropriate plan, and to readily
ascertain post-plan execution risk at plan selection time.
EXAMPLE 2
[0215] Next, Example 2 will be explained. In the following
explanation, the focus will be on explaining the differences with
Example 1, and descriptions of similar components, programs having
similar functions, and tables having similar items will be
omitted.
[0216] FIG. 22 is a constitutional diagram of an example of a
management server related to Example 2.
[0217] The management server 10000 comprises the same components as
in Example 1, and, in addition, a VM configuration management table
11830 is stored in the configuration/performance information
repository 11800 of the memory 11000. The VM configuration
management table 11830 manages the corresponding relationship
between a VM and a hypervisor (hereinafter also called "HV"), which
logically creates and runs the VM, and VM setting information, for
example, power status information and the like.
[0218] In the example shown in FIG. 22, the various programs and
tables are stored in the memory 11000, but may be stored in either
the storage device 12000 or another storage medium (not shown in
the drawing). In accordance with this, the processor 15000 reads a
target program on the memory 11000 at program execution time, and
executes the program that was read. The above-described programs
and the above-described tables may be stored in either the memory
21000 of the storage apparatus 20000 or the memory 31000 of the
physical server 30000, and either the storage apparatus 20000 or
the physical server 30000 may execute the stored programs. Another
apparatus, such as a switch, may store the above-described programs
and tables, and may execute the stored programs.
[0219] FIG. 23 is a constitutional diagram of an example of a
physical server related to Example 2.
[0220] The physical server 30000 comprises the same components as
in Example 1, and, in addition, a VM management program 31400 is
stored in the memory 31000. The VM management program 31400 manages
the configuration information and the performance information of
the VM. The VM management program 31400 also carries out VM-related
control, such as VM migration.
[0221] In the example shown in FIG. 23, various programs are stored
in the memory 31000, but maybe stored in another storage medium
(not shown in the drawing). In accordance with this, the processor
33000 reads a target program on the memory 31000 at process
execution time, and executes the program that was read.
[0222] FIG. 24 is a logical constitutional diagram of an example of
the physical server related to Example 2.
[0223] The physical server 30000 comprises an HV 80000, which
logically creates a VM 70000, and runs the created VM 70000. The HV
80000 can control multiple VMs 70000 at one time. Each of the
multiple VMs 70000 can execute an application just like a
standalone physical computer.
[0224] FIG. 25 is a constitutional diagram of an example of a
switch related to Example 2.
[0225] The switch 60000 comprises a memory 61000, a management I/F
62000, a processor 63000, and a switch data I/F 64000, and these
apparatuses are coupled via an internal bus 65000 such as the
internal bus 65000. The switch 60000 is an apparatus for selecting
a communication route from the data I/F 32000 of the physical
server 30000 to the data I/F 26000 of the storage 20000. A
configuration/performance information collection program 61100 is
stored in the memory 61000. The configuration/performance
information collection program 61100 is for sending/receiving the
management information and the performance information of the
switch 60000 and the like to/from the management server 10000.
[0226] FIG. 26 is a constitutional diagram of an example of a
configuration information table related to Example 2.
[0227] The configuration information table 11810 comprises the
field of switch information 11818 in addition to the fields of the
configuration information table 11810 related to Example 1. The
switch information 11818 comprises a switch 11818A in which is
stored the switch 60000 identifier, a switch data I/F IN 11818B in
which is stored data showing the input data I/F of the switch
60000, and a switch data I/F OUT 11818C in which is stored data
showing the output data I/F of the switch 60000. The switch
information 11818 is arranged between the fields of the physical
server 11811 and the storage 11814, but the relative position of
this field shows the relative positions of the apparatus and device
on the communication route. For example, the second entry from the
top of FIG. 26 shows that switch B and switch C exist between the
server data I/F "S2" of the physical server A and the storage data
I/F "P2" of the storage apparatus A. More specifically, this entry
shows that the physical server A server data I/F "S2" is coupled to
the switch B data I/F "R10", the switch B data I/F "R11" and the
switch C data I/F "R20" are coupled, and the switch C data I/F
"R21" and the storage apparatus A storage data I/F "P2" are
coupled. The VM 70000 configuration information is also stored in
the information configuration table 11810 the same as the physical
server 30000 configuration information. Therefore, an identifier
for uniquely identifying the VM 70000 is stored in the physical
server 11811. For example, the fifth entry from the top of FIG. 26
shows that the logical volume "LV10", which was created from the
disk pool "POOL3" of storage B, is coupled to the VM1 via the
storage B storage data I/F "P3", the switch A data I/F "R2", "R1"
and the server data I/F "S3", and is recognized as logical volume
"E:" on the VM1.
[0228] FIG. 27 is a constitutional diagram of an example of the VM
configuration management table related to Example 2.
[0229] An entry is added to the VM configuration management table
11830 in accordance with the configuration setting management
program 11600 being executed. The configuration setting management
program 11600 comprises the fields of a virtual server 11831, a
power status 11832, a physical server 11833, and a server data I/F
11834. An identifier for uniquely identifying a VM 70000 is stored
in the virtual server 11831. Data showing the power status of the
VM 70000, for example, "ON", "OFF", or "SUSPEND" is stored in the
power status 11832. An identifier for uniquely identifying the
physical server 30000 on which the VM 70000 is running is stored in
the physical server 11833. An identifier for uniquely identifying
the server data I/F of the physical server 30000 is stored in the
server data I/F 11834.
[0230] FIG. 28 is a constitutional diagram of an example of the
performance information table related to Example 2.
[0231] The configuration of the performance information table 11820
related to Example 2 is substantially the same as the configuration
of the performance information table 11820 related to Example 1. VM
70000 performance information, hypervisor 80000 performance
information, and switch 60000 performance information are also
stored in the performance information table 11820 related to
Example 2 as performance information related to either an apparatus
or a device comprising the computer system. Here, the VM 70000 and
the hypervisor 8000 are also treated as apparatuses. For example,
an identifier for uniquely identifying either the VM 70000, the
hypervisor 80000, or the switch 60000 is stored in the apparatus ID
11821. FIG. 29 is a constitutional diagram of an example of the
event management table related to Example 2.
[0232] The configuration of the event management table 11910
related to Example 2 is substantially the same as the configuration
of the event management table 11910 related to Example 1. In the
event management table 11910 related to Example 2, information
related to an event that has occurred in the VM 70000, information
related to an event that has occurred in the hypervisor 80000, and
information related to an event that has occurred in the switch
60000 are also stored as information related to an event, which
occurs in either an apparatus or a device comprising the computer
system.
[0233] FIG. 30A is a first constitutional diagram of a general rule
table related to Example 2. FIG. 30B is a second constitutional
diagram of a general rule table related to Example 2.
[0234] The configuration of the general rule table 11920 related to
Example 2 is substantially the same as the configuration of the
general rule table 11920 related to Example 1. In Example 2, a VM
70000-related event, a hypervisor 80000-related event, and a switch
60000-related event are also used as a condition event, which is
defined by the condition part 11921 and the conclusion part 11922
of the general rule.
[0235] FIG. 31 is a constitutional diagram of an example of a
rule-plan association table related to Example 2.
[0236] The configuration of the rule-plan association table 11940
related to Example 2 is substantially the same as the configuration
of the rule-plan association table 11940 related to Example 1. In
Example 2, general rules "Rule 3" and "Rule 4" are taken up and
explained as general rules, and general plans "Plan 1" and "Plan 6"
are taken up and explained as general plans.
[0237] FIG. 32A is a first constitutional diagram of the expanded
rule table related to Example 2. FIG. 32B is a second
constitutional diagram of the expanded rule table related to
Example 2.
[0238] The configuration of the expanded rule table 11950 related
to Example 2 is substantially the same as the configuration of the
expanded rule table 11950 related to Example 1. In Example 2, a VM
70000-related event, a hypervisor 80000-related event, and a switch
60000-related event are also used as a condition event, which is
defined by the condition part 11951 and the conclusion part 11952
of the expanded rule.
[0239] FIG. 33A is a first constitutional diagram of the expanded
plan table related to Example 2. FIG. 33B is a second
constitutional diagram of the expanded plan table related to
Example 2. FIG. 33C is a third constitutional diagram of the
expanded plan table related to Example 2.
[0240] The configuration of the expanded plan table 11960 related
to Example 2 is substantially the same as the configuration of the
expanded plan table 11960 related to Example 1. The same as in
Example 1, the fields included in the plan target 11964 will differ
in accordance with the contents of the plan.
[0241] The expanded plan shown in either FIG. 33B or FIG. 33C is an
expanded plan based on the general plan "Plan 6", that is, an
expanded plan related to a VM migration, and the plan target 11964,
for example, comprises the fields of a target VM 1196B, a migration
source 1196C, and a migration destination 1196D. The target VM
1196B comprises the fields of an ID 1196BA in which is stored the
identifier of the VM 70000 (hereinafter referred to as "target
VM"), which is the target of the VM migration, and a performance
1196BB in which is stored the target VM post-migration performance
value. The migration source 1196C comprises the fields of an ID
1196CA in which is stored the identifier of the hypervisor 80000,
which is the migration source of the target VM (hereinafter
referred to as "migration-source hypervisor"), and a performance
1196CB in which is stored the performance value of the
migration-source hypervisor after the target VM has been migrated.
The migration destination 1196D comprises the fields of an ID
1196DA in which is stored the identifier of the hypervisor 80000,
which is the migration destination of the target VM (hereinafter
referred to as "migration-destination hypervisor"), and a
performance 1196DB in which is stored the performance value of the
migration-destination hypervisor after the target VM has been
migrated. The plan creation program 11100 acquires the identifiers
stored in the ID 1196BA, the ID 1196CA, and the ID 1196DA from the
configuration information table 11810 and so forth, and stores
these identifiers in the ID 1196DA. Furthermore, any method may be
used to calculate the value of the performance information
predicted values stored in the performance 1196BB, the performance
1196CB, and the performance 1196DB, and, for example, the plan
creation program 11100 may determine the predicted values in
accordance with either adding or subtracting the IOPS as shown in
Example 1. An example of performance information has been described
here, but cost information, or time information, such as system
downtime resulting from a failure, may be stored. A single
hypervisor 80000 serves as the migration source and the migration
destination here, but a set of multiple hypervisors 80000, which
share resources, or a data store inside the hypervisor 80000 may
serve as the migration source and the migration destination.
[0242] FIG. 34 is a constitutional diagram of an example of the
analysis result management table related to Example 2.
[0243] The configuration of the analysis result management table
11970 related to Example 2 is substantially the same as the
configuration of the analysis result management table 11970 related
to Example 1. In Example 2, the VM 70000 identifier, the hypervisor
80000 identifier, the switch 60000 identifier, and the identifiers
of the devices thereof can be stored in a cause apparatus ID 11971
and a cause component ID 11972. Also, the identifier of an event
related to the VM 70000, the identifier of an event related to the
hypervisor 80000, and the identifier of an event related to the
switch 60000 can be stored in the received event ID 11976.
[0244] FIG. 35 is a flowchart of the plan presentation process
related to Example 2.
[0245] The plan presentation process is executed in accordance with
the management server 10000 processor 15000 executing the plan
presentation program 11300 deployed on the memory 11000.
[0246] First, the plan presentation program 11300 acquires
information showing a failure cause from the analysis result
management table 11970, that is, the values of the cause apparatus
ID 11971, the cause component ID 11972, the metrics 11973, and the
certainty factor 11974 (Step 7001).
[0247] Next, the plan presentation program 11300 executes the
processing of Steps 7002 through 7005 below with respect to each of
one or more expanded plans shown by one or more expanded plan
tables 11960 stored in the rule-plan information repository 11900.
The plan presentation program 11300 determines whether or not there
exists among the one or more expanded plans shown by the one or
more expanded plan tables 11960 stored in the rule-plan information
repository 11900 an expanded plan (hereinafter referred to as
"first consolidation-target plan"), which has an expanded rule ID
1196A value that differs from that of the processing-target
expanded plan, that is, a corresponding expanded rule that is
different, and, in addition, has the same processing content as
that of the processing-target expanded plan (Step 7002).
[0248] In a case where a first consolidation-target plan does not
exist (Step 7002: No), the plan presentation program 11300 advances
the processing to Step 7004. Alternatively, in a case where a first
consolidation-target plan exists (Step 7002: Yes), the plan
presentation program 11300 deletes the first consolidation-target
plan from the expanded plan table 11960, updates the value of the
expanded rule ID 1196A of the expanded plan table 11960 comprising
the processing-target expanded plan (Step 7003), and advances the
processing to Step 7004.
[0249] For example, in the examples of FIGS. 33B and 33C, the
expanded plans are such that the expanded plan "ExPlan 6-1" and the
expanded plan "ExPlan 6-3", and the expanded plan "ExPlan 6-2" and
the expanded plan "ExPlan 6-4" have corresponding expanded rules
that differ from one another, and, in addition, have the same
processing content. Therefore, in a case where the
processing-target expanded plan is the expanded plan "ExPlan 6-1",
the plan presentation program 11300 identifies the expanded plan
"ExPlan 6-3" as the first consolidation-target plan in Step 7002,
deletes the expanded plan "ExPlan 6-3" from the expanded plan table
11960 in Step 7003, and updates the value in the expanded rule ID
1196A of the expanded plan table 11960 comprising the expanded plan
"ExPlan 6-1" to data showing the expanded rule "ExRule 3-1" and the
expanded rule "ExRule 4-1", for example, to "ExRule 3-1, ExRule
4-1". Furthermore, in a case where the processing-target expanded
plan is the expanded plan "ExPlan 6-2", the plan presentation
program 11300 identifies the expanded plan "ExPlan 6-4" as the
first consolidation-target plan in Step 7002, deletes the expanded
plan "ExPlan 6-4" from the expanded plan table 11960 in Step 7003,
and updates the value in the expanded rule ID 1196A of the expanded
plan table 11960 comprising the expanded plan "ExPlan 6-2" to
"ExRule 3-1, ExRule 4-1". The value in the expanded rule ID 1196A
of an existing expanded plan table 11960 is updated here, but a new
expanded plan table 11960 in which "ExRule 3, ExRule 4" is stored
in the expanded rule ID 1196A may be prepared.
[0250] In Step 7004, the plan presentation program 11300 determines
whether or not there exists in the one or more expanded plans shown
by the one or more expanded plan tables 11960 stored in the
rule-plan information repository 11900 an expanded plan
(hereinafter referred to as "second consolidation-target plan"),
which has a general plan ID 11962 that is the same as the
processing-target expanded plan, that is, which has the same
general plan as a basis, and, in addition, has similar performance
information and has the same risk.
[0251] In a case where a second consolidation-target plan does not
exist (Step 7004: No), the plan presentation program 11300 advances
the processing to the next process without executing Step 7005.
Alternatively, in a case where a second consolidation-target plan
exists (Step 7004: Yes), the plan presentation program 11300
identifies from among the processing-target expanded plan and one
or more second consolidation-target plans an expanded plan
(hereinafter referred to as "best plan") having the best predicted
value for the post-plan execution performance value. Then, the plan
presentation program 11300 deletes from the expanded plan table
11960 the expanded plan, which is not the best plan of the
processing-target expanded plan and the one or more second
consolidation-target plans, and updates the value in the expanded
rule ID 1196A of the expanded plan table 11960 (Step 7005).
[0252] For example, in the examples of FIGS. 33A and 33B, of the
expanded plan "ExPlan 1-1", the expanded plan "ExPlan 1-2", and the
expanded plan "ExPlan 1-3", which are expanded plans having the
same general rule as a basis, and, in addition, similar performance
information and the same risk, the expanded plan "ExPlan 1-1" is
the best plan in that the migration-target volume performance is
the best. Therefore, only the best plan "ExPlan 1-1" is left as-is,
and the other expanded plan "ExPlan 1-2" and expanded plan "ExPlan
1-3" are deleted. Also, of the expanded plan "ExPlan 6-1" and the
expanded plan "ExPlan 6-2", which are expanded plans having the
same general rule as a basis, and, in addition, similar performance
information and the same risk, the expanded plan "ExPlan 6-1" is
the best plan in that the target VM performance is the best.
Therefore, only the best plan "ExPlan 6-1" is left as-is, and the
other expanded plan "ExPlan 6-2" is deleted.
[0253] The range for judging that the performance information is
similar, for example, may be fixedly configured beforehand as the
I/O response time being within the range of equal to or less than
.+-.1 msc, and may be configured by the administrator via the input
device 13000.
[0254] In Step 7005, the plan presentation program 11300 only
leaves the best plan, for example, the one expanded plan having the
fastest I/O response time, and deletes the other expanded plan(s),
but may leave as-is multiple expanded plans having good predicted
values for the post-plan execution performance values. The number
of expanded plans left after consolidation, for example, may be
fixedly decided beforehand, and may be configured by the
administrator via the input device 13000. Also, the number of
post-consolidation expanded plans may be decided so that all of the
expanded plans are capable of being displayed in the output screen.
The object of this processing is to prevent the plan selection
operation performed by the administrator from becoming troublesome
due to the display of a large number of similar expanded plans. For
example, rather than deleting an expanded plan, a method, which
displays only an expanded plan having a good post-plan execution
performance value and does not display another expanded plan, and
which switches between displaying and not displaying by clicking a
prescribed button, may be used.
[0255] After completing the processing (Steps 7002 through 7005)
for all the expanded plans, the plan presentation program 11300,
based on the information showing the failure cause and the
certainty factor acquired in Step 7001 and the expanded plan table
11960 stored in the rule-plan information repository 11900, creates
a plan presentation screen 9000 (refer to FIG. 36) and causes the
output device 14000 to display the created plan presentation screen
9000 (Step 7006). Thereafter, the plan presentation program 11300
ends the plan presentation process.
[0256] FIG. 36 is a constitutional diagram of an example of the
plan presentation screen related to Example 2.
[0257] The configuration of the plan presentation screen 9000
related to Example 2 is substantially the same as the configuration
of the plan presentation screen 9000 related to Example 1.
[0258] In Example 2, the expanded plans shown in FIGS. 33A, 33B,
and 33C are created in accordance with the plan creation process.
Specifically, the expanded plan "ExPlan 1-1", the expanded plan
"ExPlan 1-2", and the expanded plan "ExPlan 1-3" are created as
volume migration-related expanded plans, and the expanded plan
"ExPlan 6-1", the expanded plan "ExPlan 6-2", the expanded plan
"ExPlan 6-3", and the expanded plan "ExPlan 6-4" are created as VM
migration-related expanded plans. That is, a total of seven
expanded plans are created. According to the plan presentation
process related to Example 2 shown in FIG. 35, of the expanded plan
"ExPlan 1-1", the expanded plan "ExPlan 1-2", and the expanded plan
"ExPlan 1-3", only the expanded plan "ExPlan 1-1", which is the
plan having the best migration-target volume performance, is left
as-is, and the expanded plan "ExPlan 1-2" and the expanded plan
"ExPlan 1-3" are deleted. Also, according to the plan presentation
process, of the expanded plan "ExPlan 6-1", the expanded plan
"ExPlan 6-2", the expanded plan "ExPlan 6-3", and the expanded plan
"ExPlan 6-4", only the expanded plan "ExPlan 6-1", which is the one
plan via which the performance of the target VM will become the
best, is left as-is, and the expanded plan "ExPlan 6-2", the
expanded plan "ExPlan 6-3", and the expanded plan "ExPlan 6-4" are
deleted. This example shows that it is possible to consolidate the
expanded plan corresponding to the general rule in which the
failure cause is the storage apparatus 20000 and the expanded plan
corresponding to the general rule in which the failure cause is the
switch 60000. In this screen 9000, essentially, any display method
may be used as long as it makes it possible to illustrate the fact
that a plan with respect to failures having different root causes
is shared in common, and the display method is not limited to the
one shown in FIG. 36.
[0259] According to Example 2, it is possible to associatively
present a failure cause and a specific recovery plan for the
failure, and to check the number of events, of the failure events
related to the failure cause, that has been resolved in accordance
with executing the plans, and by displaying the result, to present
the administrator with a problem site, which latently remains as-is
even after plan execution, together with the reason therefor as
plan details information. This makes it possible for the
administrator to select an appropriate plan, and to readily
ascertain post-plan execution risk at plan selection time. Also, by
consolidating expanded plans via which identical or similar effects
are obtained into a single plan, it is possible to curb the
presentation of redundant plans, and, in a case where numerous
recovery plans exist with respect to a failure, it is possible to
reduce the number of plans presented to the administrator and to
lower the administrator's costs with respect to plan details
checking operations and plan selection operations.
EXAMPLE 3
[0260] Next, Example 3 will be explained. In the following
explanation, the focus of the explanation will be on the
differences with Example 1 and Example 2, and descriptions of
similar components, programs having similar functions, and tables
having similar items will be omitted.
[0261] FIG. 37 is a constitutional diagram of an example of a
management server related to Example 3.
[0262] The management server 10000 comprises the same components as
in Example 2, and, in addition, a maintenance information
management program 11110 is stored in the memory 11000. In
addition, a maintenance information management table 11980 is
stored in the rule-plan information repository 11900. The
maintenance information management table 11980 manages information
related to replacement and maintenance operations due to apparatus
renewal.
[0263] In the example shown in FIG. 37, the various programs and
tables are stored in the memory 11000, but these programs and
tables may be stored in either the storage device 12000 or another
storage medium (not shown in the drawing). In accordance with this,
the processor 15000 reads a target program on the memory 11000 at
program execution time, and executes the program that was read. The
above-described programs and the above-described tables may be
stored in either the storage apparatus 20000 memory or the physical
server 30000 memory, and either the storage apparatus 20000 or the
physical server 30000 may execute the stored programs. Another
apparatus, such as a switch 60000, may store the above-described
programs and tables, and may execute the stored programs.
[0264] FIG. 38 is a constitutional diagram of an example of a
general plan table related to Example 3.
[0265] The general plan table 11930 manages a list of plans
available in the computer system. The general plan table 11930
comprises the fields of a general plan ID 11931, a plan 11932, and
a maintenance action 11933. A general plan ID, which is the
identifier of the general plan, is stored in the general plan ID
11931. Information showing the plans, which are available in the
computer system, is stored in the plan 11932. Information showing
whether or not the relevant plan is a plan related to a maintenance
schedule is stored in the maintenance action 11933. For example, a
plan in which physical hardware is to be replaced, such as the
storage port replacement of the general plan "Plan 8" and the
switch replacement of the general plan "Plan 9", is regarded as a
plan related to a maintenance schedule.
[0266] FIG. 39 is a constitutional diagram of an example of a
rule-plan association table related to Example 3.
[0267] The configuration of the rule-plan association table 11940
related to Example 3 is substantially the same as the configuration
of the rule-plan association table 11940 related to Example 1. In
Example 3, general rule "Rule 4" is taken up and explained as the
general rule, and general plans "Plan 6" and "Plan 9" are taken up
and explained as the general plans.
[0268] FIG. 40A is a first constitutional diagram of the expanded
plan table related to Example 3. FIG. 40B is a second
constitutional diagram of the expanded plan table related to
Example 3.
[0269] The configuration of the expanded plan table 11960 related
to Example 3 is substantially the same as the configuration of the
expanded plan table 11960 related to Example 1. The same as in
Example 1, the fields included in the plan target 11964 will differ
in accordance with the contents of the plan.
[0270] Regarding the expanded plan based on the general plan "Plan
9" of FIG. 40B, that is, the expanded plan related to replacing a
switch, the plan target 11964, for example, comprises the fields of
a replacement-target switch 1196E in which is stored the identifier
of the switch 60000, which is targeted for replacement (hereinafter
referred to as "replacement-target switch"), and a Cost 1196F in
which is stored data denoting the replacement cost. The plan
creation program 11100 acquires the identifier of the
replacement-target switch from the configuration information table
11810, and stores this identifier in replacement-target switch
1196E. The plan creation program 11100 acquires the value stored in
the Cost 1196F from the maintenance information management table
11980, and stores this value in the Cost 1196F. An example in which
only the replacement target identification information and the cost
information are stored has been described here, but other
information, for example, information showing about how much time
it will take to replace the switch 60000 may be stored.
[0271] FIG. 41 is a constitutional diagram of an example of the
maintenance information management table related to Example 3.
[0272] The maintenance information management table 11980 manages
schedule information in accordance with which the administrator
performs a maintenance operation, such as hardware replacement.
This table 11980, for example, is created manually and inputted by
the administrator. The maintenance information management table
11980 comprises the fields of an apparatus 11981, an apparatus
component 11982, a replacement reason 11983, a replacement
date/time 11984, an affected service 11985, and a cost 11986. The
apparatus ID of an apparatus targeted for a maintenance operation
is stored in the apparatus 11981. The identifier of a device
targeted for a maintenance operation is stored in the apparatus
component ID 11982. Information showing the reason that a
replacement was scheduled is stored in the replacement reason
11983. Information showing the date and time of the replacement is
stored in the replacement date/time 11984. The identifier of a
service, which will be affected by replacing a device targeted for
the maintenance operation, is stored in the affected service 11985.
Information showing the cost of replacing the device targeted for
the maintenance operation is stored in the cost 11986.
[0273] FIG. 42 is a flowchart of the effect-risk presentation
process related to Example 3.
[0274] The processing of Steps 8001 through 8004 is the same
processing as the processing of Steps 6001 through 6004 in the
effect-risk presentation process related to Example 1, and as such,
explanations of these steps will be omitted.
[0275] In Step 8005, the plan presentation program 11300 references
the information in the maintenance action 11933 of the general plan
table 11930 and the maintenance information table 11980, and
acquires information related to a maintenance schedule.
[0276] Thereafter, the plan presentation program 11300 creates a
plan details screen 9010 (refer to FIG. 43) based on the
information acquired in Steps 8002 through 8005, and causes the
output device 14000 to display the created plan details screen 9010
(Step 8006). Thereafter, the plan presentation program 11300 ends
the effect-risk presentation process.
[0277] FIG. 43 is a constitutional diagram of an example of the
plan details screen related to Example 3.
[0278] The configuration of the plan details screen 9010 related to
Example 3 is substantially the same as the configuration of the
plan details screen 9010 related to Example 1.
[0279] In the example shown in FIG. 43, the area 9015 showing the
state before plan execution shows the fact that the VM1 drive "E:",
the switch A, the data I/F "P3" of the storage B, the logical
volume "LV10" of the storage B, and the disk pool "POOL3" are
associated, the VM2 drive "F:", the switch A, the logical volume
"LV11" of the storage B, and the disk pool "POOL3" are associated,
and the VM3 drive "D:", the switch B, the logical volume "LV12" of
the storage B, and the disk pool "POOL4" are associated. The area
9015 showing the state before plan execution also shows the fact
that failure events have occurred in the VM2 drive "F:", the switch
A data I/F, and the storage B data I/F, and that a risk exists in
the VM1 drive "E:".
[0280] The area 9016 showing the state after plan execution shows
that the VM2, which had been running on the hypervisor 80000 of the
physical server B, is running on the hypervisor 80000 of the
physical server C, and that after plan execution, the VM2 drive
"F:", the switch B, the storage B data I/F, the logical volume
"LV11" of the storage B, and the disk pool "POOL3" are associated,
a failure event remains in the switch A data I/F and the storage B
data I/F, and a risk remains in the VM1.
[0281] In Example 3, the area 9017, which shows the risk details,
denotes that the risk site is the VM1 drive "E:", and that the
timing at which the risk could occur is when the power to the drive
"E:" is turned ON. The risk occurrence timing, for example, is
determined based on the information acquired in Step 8002 of FIG.
42. For example, the management server 10000 may acquire the
information in the power status 11832 of the VM configuration
management table 11830, may detect the fact that the VM1 power
status is OFF, may judge that the resumption of business operations
accompanying the powering ON of the VM could trigger the occurrence
of a risk, and may store the relevant information in the risk
occurrence timing 9019.
[0282] A display area 9022 for showing the maintenance schedule,
for example, comprises the fields of an area 9023 in which is
displayed the identifier of either an apparatus or a device
targeted for replacement in a maintenance operation, an area 9024
is which is displayed a replacement reason, and an area 9025 in
which is displayed a replacement date/time. A replacement target,
which is displayed in the display area 9022 showing the maintenance
schedule, for example, is limited to either an apparatus or a
device in which a failure event has occurred, and this information
is acquired via the processing of Step 8005 in the effect-risk
presentation process of FIG. 42. The information, for example, the
cost information shown in the maintenance information management
table 11980 of FIG. 41 may be collectively displayed in the display
area 9022. The administrator, who references the information
related to the maintenance schedule, for example, is able to check
the replacement date/time in accordance with the maintenance of
either the apparatus or device where the failure occurred, and,
along with taking actions with respect to the failure, to either
replace the switch ahead of schedule, or, since switch replacement
is near at hand, to select a VM migration-based plan as a temporary
measure even though a certain amount of risk remains.
[0283] In Example 3, the display area 9011, the display area 9017,
and the display area 9022 are displayed in the same screen, but the
display method is not limited thereto, and, for example, when a
diagram showing either the apparatus or the device configured in
the maintenance schedule of the display area 9011, for example,
switch A, is clicked, the display area 9022 may be newly displayed
as a separate screen. Or, when a plan displayed in the display area
9001 of the plan presentation screen 9000 is clicked, the display
area 9022 may be newly displayed as a separate screen. When a
diagram showing either an apparatus or a device, such as the
physical server 30000 or the storage apparatus 20000 in the display
area 9011 is clicked, the performance values of the relevant
apparatus or device before and after plan execution may be
displayed.
[0284] According to Example 3, it is possible to associatively
present a failure cause and a specific recovery plan for the
failure, and to check the number events, of the failure events
related to the failure cause, that has been resolved in accordance
with executing the plans, and by displaying the result, to present
the administrator with a problem site, which latently remains as-is
even after plan execution, together with the reason therefor as
plan details information. This makes it possible for the
administrator to select an appropriate plan, and to readily
ascertain post-plan execution risk at plan selection time. Also, in
the case of a plan, which can be associated with a maintenance
schedule, a check can be performed in conjunction with a
maintenance schedule in the plan details screen 9010, making it
easier for the administrator to ascertain the importance of the
affects of the maintenance, thereby enabling the administrator to
select a plan that will reduce costs.
[0285] Needless to say, the present invention is not limited to the
examples explained hereinabove, and various changes are possible
without departing from the gist thereof.
REFERENCE SIGNS LIST
[0286] 10000 Management server [0287] 20000 Storage apparatus
[0288] 30000 Physical server
* * * * *