U.S. patent application number 11/070942 was filed with the patent office on 2006-09-07 for failure trend detection and correction in a data storage array.
This patent application is currently assigned to Seagate Technology LLC. Invention is credited to Robert Sherwood Gittins, Robert Michael Lester.
Application Number | 20060200726 11/070942 |
Document ID | / |
Family ID | 36945437 |
Filed Date | 2006-09-07 |
United States Patent
Application |
20060200726 |
Kind Code |
A1 |
Gittins; Robert Sherwood ;
et al. |
September 7, 2006 |
Failure trend detection and correction in a data storage array
Abstract
Method and apparatus for detecting and correcting parametric
failure trends in a data storage array. A plurality of data storage
devices, such as hard disc drives, are arranged to form a
multi-device addressable memory array space. A controller controls
access to the array space, and is configured to accumulate
operational performance data from each of the devices into a
history log. A statistical analysis engine of the controller
analyzes the data to detect anomalous operation of the devices,
including a horizontal analysis of data across multiple devices.
The controller initiates a data storage device specific corrective
action event in response to the analysis, as required. The analysis
by the engine can be in addition to, or in lieu of, analysis by the
individual devices. A data request block requests additional data
samples for a given parameter, or requests additional parametric
data to further the analysis.
Inventors: |
Gittins; Robert Sherwood;
(Woodland Park, CO) ; Lester; Robert Michael;
(Colorado Springs, CO) |
Correspondence
Address: |
Fellers, Snider, Blankenship,;Bailey & Tippens, P. C.
Suite 1700
100 North Broadway
Oklahoma City
OK
73102-8820
US
|
Assignee: |
Seagate Technology LLC
|
Family ID: |
36945437 |
Appl. No.: |
11/070942 |
Filed: |
March 3, 2005 |
Current U.S.
Class: |
714/763 ;
714/E11.024; G9B/27.052 |
Current CPC
Class: |
G11B 2220/41 20130101;
G06F 11/0751 20130101; G11B 27/36 20130101; G06F 11/0727 20130101;
G06F 11/0793 20130101; G11B 2220/2516 20130101 |
Class at
Publication: |
714/763 |
International
Class: |
G11C 29/00 20060101
G11C029/00 |
Claims
1. An apparatus comprising a plurality of data storage devices
arranged to form a multi-device array space and a controller which
controls access to the array space, the controller configured to
accumulate operational performance data from each of the plurality
of data storage devices into a history log, to analyze said data to
detect anomalous operation of said devices, and to initiate a data
storage device specific corrective action event in relation to said
analysis.
2. The apparatus of claim 1, wherein each of plurality of data
storage devices analyzes the operational performance data that is
accumulated into the history log associated with said device to
detect anomalous operation of said device.
3. The apparatus of claim 1, wherein the analysis of said data by
the controller comprises parametric data associated with multiple
ones of the plurality of the data storage devices.
4. The apparatus of claim 1, wherein the controller comprises a
statistical analysis engine which operates upon the data stored in
the data log to analyze said data.
5. The apparatus of claim 4, wherein the controller further
comprises a corrective action module which forwards an alarm
indication to a user of the system in response to detection of said
anomalous operation of said devices.
6. The apparatus of claim 4, further comprising a graphical user
interface in communication with the engine to facilitate
user-specified analysis by the engine upon the data accumulated in
the data log.
7. The apparatus of claim 4, further comprising a data request
block in communication with the engine which issues a request to at
least a selected one of the data storage devices to provide
additional data to the data log in response to the engine.
8. The apparatus of claim 1, wherein the data log is stored in the
array space established by the plurality of data storage
devices.
9. The apparatus of claim 1, wherein each of the plurality of data
storage devices is characterized as a hard disc drive comprising at
least one rotatable data storage medium accessed by a moveable
transducer.
10. An apparatus, comprising: a plurality of data storage devices
arranged to form a multi-device memory array space; and first means
for accumulating operational performance data from each of the
plurality of data storage devices, for performing an analysis of a
subset of said data associated with multiple ones of said devices,
and for providing an alarm indication to a user in response to
detection of an anomalous event as a result of said analysis.
11. The apparatus of claim 10, wherein at least one of the
plurality of data storage devices performs an analysis of the
accumulated operational performance data, and wherein the first
means operates in response to the analysis performed by the at
least one of the plurality of data storage devices.
12. The apparatus of claim 10, wherein the first means further
issues a data request command to at least one of the plurality of
data storage devices to supply additional data for accumulation and
analysis by the first means.
13. A method comprising: arranging a plurality of data storage
devices to form a multi-device memory array space; and providing a
controller which controls access to the array space, the controller
configured to accumulate operational performance data from each of
the plurality of data storage devices into a history log, to
analyze said data to detect anomalous operation of said devices,
and to initiate a data storage device specific corrective action
event in relation to said analysis.
14. The method of claim 13, further comprising a step of
configuring each of the plurality of data storage devices to
separately analyze the operational performance data that is
accumulated into the history log associated with said device to
detect anomalous operation of said device.
15. The method of claim 13, wherein the analysis of said data
during the providing step comprises an analysis of parametric data
associated with multiple ones of the plurality of the data storage
devices.
16. The method of claim 13, wherein the controller of the providing
step comprises a statistical analysis engine which operates upon
the data stored in the data log to analyze said data.
17. The method of claim 16, wherein the controller of the providing
step further comprises a corrective action module which forwards an
alarm indication to a user of the system in response to detection
of said anomalous operation of said devices.
18. The method of claim 16, wherein the controller of the providing
step further comprises a data request block in communication with
the engine which issues a request to at least a selected one of the
data storage devices to provide additional data to the data log in
response to the analysis performed by the engine.
19. The method of claim 13, wherein the data log of the providing
step is stored in the array space formed from the plurality of data
storage devices.
20. The method of claim 13, wherein each of the plurality of data
storage devices is characterized as a hard disc drive comprising at
least one rotatable data storage medium accessed by a moveable
transducer.
Description
FIELD OF THE INVENTION
[0001] The claimed invention relates generally to the field of data
storage systems and more particularly, but not by way of
limitation, to an apparatus and method for detecting and correcting
parametric failure trends in a data storage array.
BACKGROUND
[0002] Multi-device arrays (MDAs) are relatively large data space
storage systems comprising a number of data storage devices, such
as hard disc drives (HDDs), that are grouped together to provide an
inter-device addressable memory space. MDAs are increasingly used
in a wide variety of data intensive applications, web servers and
other network accessed systems.
[0003] Individual data storage devices can be equipped with
routines that monitor various operational parameters to provide
early failure trend detection capabilities. This allows a user to
take appropriate corrective action, such as reallocation or
replacement of the associated data storage device, prior to a
system failure event that adversely affects other portions of the
system.
[0004] While operable, due to the continued increase in the
reliance and use of MDAs, there remains a continual need in the
manner in which failure trends can be analyzed and system failure
events can be avoided.
SUMMARY OF THE INVENTION
[0005] Preferred embodiments of the present invention are generally
directed to an apparatus and method for detecting and correcting
parametric failure trends in a data storage array.
[0006] In accordance with preferred embodiments, a plurality of
data storage devices, such as hard disc drives, are arranged to
form a multi-device addressable memory array space. A controller is
provided to control access to the array space.
[0007] The controller is configured to accumulate operational
performance data from each of the devices into a history log. A
statistical analysis engine of the controller analyzes the data to
detect anomalous operation of the devices, including a horizontal
analysis of data across multiple devices. The controller utilizes a
corrective action module to initiate a data storage device specific
corrective action event in response to the analysis, as
required.
[0008] The analysis by the engine can be in addition to, or in lieu
of, analysis by the individual devices. A data request block
requests additional data samples for a given parameter, or requests
additional parametric data to further the analysis. A graphical
user interface (GUI) reports alarm indications to a system user, as
well as facilitates user-specified data collection and
analyses.
[0009] These and various other features and advantages which
characterize the claimed invention will become apparent upon
reading the following detailed description and upon reviewing the
associated drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] FIG. 1 is an exploded view of a data storage device
constructed and operated in accordance with preferred embodiments
of the present invention.
[0011] FIG. 2 is a generalized functional block diagram of the
device of FIG. 1.
[0012] FIG. 3 illustrates relevant portions of a multi-disc array
(MDA) formed from a plurality of data storage devices such as shown
in FIGS. 1 and 2.
[0013] FIG. 4 represents a network system utilizing a number of the
MDAs such as shown in FIG. 3.
[0014] FIG. 5 provides a generalized functional block diagram of
operation of a selected MDA/controller sub-system of FIG. 4 in
accordance with preferred embodiments of the present invention.
[0015] FIG. 6 shows a preferred format for the data log of FIG.
5.
[0016] FIG. 7 provides a flow of alternative statistical analysis
strategies carried out by the sub-system of FIG. 5.
[0017] FIG. 8 graphically illustrates a number of parametric data
sets to better set forth preferred operation of the sub-system of
FIG. 5.
DETAILED DESCRIPTION
[0018] FIG. 1 shows an exploded view of a data storage device 100.
The device 100 is preferably characterized as a 3.5 inch form
factor hard disc drive of the type used to store and retrieve
computerized data, but such is not limiting to the scope of the
claimed subject matter.
[0019] The device 100 includes a rigid, environmentally controlled
housing 102 formed from a base deck 104 and a top cover 106. A
spindle motor 108 is mounted within the housing 102 to rotate a
number of data storage media 110 at a relatively high speed.
[0020] Data are arranged on the media 110 in concentric tracks (not
shown) which are accessed by a corresponding array of data
transducing heads 112. The heads 112 (transducers) are supported by
an actuator 114 and moved across the media surfaces by application
of current to a voice coil motor, VCM 116. A flex circuit assembly
118 facilitates communication between the actuator 114 and control
circuitry on an externally mounted printed circuit board, PCB
120.
[0021] As shown in FIG. 2, the control circuitry preferably
includes an interface circuit 124 which communicates with a host
device using a suitable interface protocol. A top level processor
126 provides top level control for the device 100 and is preferably
characterized as a programmable, general purpose processor with
suitable programming to direct the operation of the device 100.
[0022] A read/write channel 128 operates in conjunction with a
preamplifier/driver circuit (preamp) 130 to write data to and to
recover data from the discs 108. A servo circuit 132 provides
closed loop positional control for the heads 112.
[0023] It is contemplated that the processor 126 can include
programming routines to carry out failure trend detection during
operation of the device 100. As those skilled in the art will
recognize, various parameters associated with the operation of the
device 100 can be monitored over time, and variation in the values
of these parameters can signal the onset of degraded performance or
imminent failure. Parameters that can be monitored in this way
include, but are not limited to read error rates, channel quality,
head bias current magnitudes, servo positioning times, spindle
motor speed, vibration levels, operational temperature levels, the
occurrence of thermal asperities or other grown defects on the
media, etc.
[0024] In one approach, preselected threshold levels for the
various parameters are established. When an associated threshold is
reached, the device 100 provides an alarm to the end user who can
then take appropriate corrective action to ensure system data
integrity, such as reallocation of the data stored by the device
and replacement of the failed device with a new unit.
[0025] It is becoming increasingly common to incorporate multiple
sets of the devices 100 into a multi-device array (MDA), such as
generally represented at 140 in FIG. 3. The MDA 140 pools the data
storage capacity of the devices 100 to provide a single, relatively
large addressable memory space. Well-known RAID techniques are
preferably employed to distribute the recording of data across the
various devices 100.
[0026] The N devices 100 are arranged to communicate with a common
input/output block 142. A power supply block 144 and a battery
back-up supply 146 are included to meet the normal and standby
requirements of the MDA 140.
[0027] Although not depicted in FIG. 4, it will be understood that
the components are preferably arranged into a common housing so as
to provide a single plug-and-play unit which can be incorporated
into a rack or other system. Additional elements such as cooling
fans and interconnection backplanes are omitted for clarity of
illustration, and redundant sets of the components shown in FIG. 4
(e.g., two power supplies, two battery back-ups, etc.) are
preferably incorporated into the MDA 140 to enhance system
reliability and availability.
[0028] FIG. 4 illustrates a network 150 in which a number of MDAs
such as 140 are incorporated. Each MDA 140 is shown to have an
associated controller 152 which controls access to each respective
MDA 140. Each controller 152 preferably includes a relatively
powerful general purpose processor and a relatively large cache
memory space to control large scale data transfers with the MDA
140.
[0029] Although not shown, preferably two controllers 152 and two
MDAs 140 are operated in tandem at each location for redundancy.
The controllers 152 communicate with a number of host computers 154
through a fabric 156, which can comprise the Internet, a wide area
network, or other network connection system.
[0030] FIG. 5 illustrates a preferred operational architecture of
each controller/MDA combination from FIG. 4. As explained in
greater detail below, operational parametric data from each of the
devices 100 in the MDA 140 are accumulated by the controller 152
into a data log 160.
[0031] A statistical analysis engine 162 analyzes the data and,
when appropriate, initiates a data storage specific corrective
action event using a corrective action module 164. The module 164
interfaces with a GUI 166 (graphical user interface) to provide
visual and/or audible alarm indicators and other outputs to a user.
The GUI 166 further allows access to the engine 162 to initiate
user-specific data requests and analyses. The engine 162 further
provides parametric monitoring data requests via command block 168
to adjust the types and/or sampling frequency of parametric data
supplied to the log 160, as required.
[0032] The log 160 is preferably stored in a designated portion of
the non-volatile memory space provided by the devices 100 in the
MDA 140. From here, the entire log or selected portions thereof are
uploaded into the cache memory space of the controller 152 to allow
access by the engine 162. Alternatively, separate provision of
memory space (including a dedicated array) is provided accessable
by the controller 152 to store the parametric data from the devices
100.
[0033] It is contemplated that the log 160 can take any number of
forms, depending on the requirements of a given application. A
particularly useful format is generally set forth by FIG. 6, which
provides individual parametric data from each device 100 in
separate "columns" using a common index (such as elapsed time).
[0034] Thus for example, the column for device 1 can comprise all
of the data for a single parameter (e.g., channel quality) in
historical sequence over time, with later obtained CQ measurements
appended at the end. Similar data are provided in adjacent columns
for each of the remaining devices 2-N. Separate "sheets" can be
formed to track each of the different operational parameters being
monitored.
[0035] Other constructs for the data log 160 are readily
envisioned, however, including formats that group all or related
subsets of correlated parameters into the same table, or that
provide a different sheet per device. Regardless, the log
represents historical parametric data across all of the relevant
devices 100 in the MDA 140.
[0036] This facilitates the execution of a vertical analysis by the
engine 162 upon data associated with a single one of the devices
100, as represented by vertical data block 170, as well as a
horizontal analysis by the engine 162 across multiple devices, as
represented by horizontal data block 172.
[0037] A hierarchy of potential analysis modes is thus envisioned,
as set forth by FIG. 7. In some preferred embodiments, the
individual devices 100 continue as originally configured to carry
out separate monitoring of selected parameters during operation.
This is signified by block 174. Such operation is separately
carried out by the local top level processor 130 (FIG. 2) in each
device.
[0038] In this example, when a particular parameter is found to be
out-of-bounds, an alarm indication can be transmitted via the local
I/F block 124 to the MDA I/O block 142, which notifies the
controller 152. The controller 152 takes the appropriate action,
such as logging the event or notifying the user via the corrective
action module 164 and GUI 166. Depending upon the severity of the
event, the appropriate corrective action may be taken at the device
level, by the device in response to a specific command control
input by the controller, or by user intervention.
[0039] In addition to the foregoing operation, all of the
parametric data collected and analyzed by the individual devices
100 are preferably forwarded to the data log 160 to accumulate the
historical data into the log.
[0040] Another level of analysis provided in FIG. 7 is the
aforementioned vertical analysis by the engine 162, depicted at
block 176. Using the above example where the individual devices 100
continue to perform in situ parametric analysis, this provides a
second level of verification capability. That is, the engine 162
can carry out the same analysis in tandem with the local processor
130, enhancing system reliability and reducing false positives.
[0041] The engine 162 can alternatively rely upon the local
processors 130 to serve as first pass filter screens, so that
alarms set by the individual devices 100 serve as inputs to the
engine 162 to commence investigation and analysis at the controller
level. In this case, the engine 162 applies advanced statistical
analyses to the existing data, and may use heuristic methods to
request additional data not previously supplied by the associated
device 100 (i.e., greater frequency of samples, reporting of other
available but not normally reported parameters, etc.) in order to
evaluate the situation and arrive at a decision with regard to
whether a failure trend has in fact been detected and what
corrective action, if any, should be taken.
[0042] In another alternative embodiment, the localized parametric
optimization at the individual device level is eliminated, such
being carried out instead by the more powerful engine 162. In this
case the devices 100 merely upload the associated run-time
parametric data to the log with no or minimal analysis thereof.
[0043] An advantage of this particular approach is the
simplification of the design and programming of the individual
devices, since the power and resources required for such analysis
can be eliminated from the design. It will be appreciated by those
skilled in the art that such simplifications can result in a not
insignificant cost savings per device, which when multiplied by the
sheer volume of devices incorporated into the MDAs can result in
significant cost savings and system availability advances.
[0044] Alternatively, the freeing of system resources at the
individual device level on the analysis end can be used to budget
greater amounts of data (more samples as well as greater numbers of
parameters) to the data log 160 by the individual devices.
[0045] Accordingly, in this alternative approach the vertical
analysis represented by block 176 is envisioned as replacing the
localized parametric analysis performed by the individual devices
100 (block 174). As before, because of the greater processing power
of the controller 152, more complex and computationally intensive
statistical processes can be applied to the data than are presently
available. Moreover, detection of an initial trend can result in
tuned data requests via block 168 to the associated device 100 for
more data to enhance the analysis.
[0046] Block 178 in FIG. 7 depicts the aforementioned horizontal
analysis across multiple devices 100 in the MDA 140. This level of
analysis is preferably performed in addition to the horizontal
analyses of blocks 174 and/or 176, such as on a time or parameter
basis. It will be noted that the horizontal analysis of block 178
involves performing an analysis on at least a subset of the data in
the history log 160, with the subset associated with at least
multiple ones of the devices 100 in the MDA 140 (i.e., spread
across multiple devices, or all of the devices in the array as
required).
[0047] User-specified queries and analyses initiated through the
GUI 166 are depicted at block 180. It will be noted that the
various blocks in FIG. 7 can be utilized singly or in combination,
and the output of one can automatically trigger the execution of
another.
[0048] FIG. 8 illustrates one manner in which the analysis blocks
can be advantageously utilized. FIG. 8 provides a generic series of
parametric history curves 182, 184, 186 and 188, graphically
plotted against an index x-axis 190 and a common amplitude y-axis
192. It will be recognized that graphical depiction of the
parameter sets is not necessarily required by the engine 162 in
order to carry out the associated processes, but such graphs
facilitate the present discussion and can readily be provided to
the user via the GUI 166, as desired.
[0049] In a first example, it will be contemplated that the curves
182, 184, 186 and 188 represent data for each of the devices 1, 2,
3 and N respectively associated with a particular parameter, in
this case, error rate. The data are represented such that lower
values are "better" and higher values are "worse," although such is
merely one available formulation. Associated baseline values are
denoted via broken lines.
[0050] It can be seen that a significant upward trend in error rate
for device N (denoted locally at 194) can be readily detected,
either by trend analysis (moving average, etc.) or via cross-over
of an associated threshold (not shown).
[0051] An increase in error rate in and of itself does not
necessarily suggest a particular cause, but does allow immediate
remedial corrective action to be taken, such as reallocation of the
affected data, etc. so as to minimize the effects of the trend upon
system performance. Further monitoring and diagnostics, however,
can take place to isolate one or more causes, leading to
elimination of the problem from the system. Exemplary corrective
actions include decommissioning of a particular head/media
combination, substitution of a particular device for a standby
"spare" within the MDA, application of a different RAID or ECC
level, performance of routine scheduled maintenance, etc.
[0052] Continuing with this example, it will be noted that
analyzing the data across multiple devices within the MDA 140
provides further important information with regard to this event,
namely, that only device N is presently experiencing the localized
increase in error rate and the other devices are apparently not
affected within the applicable time period. In other words, even at
this point it appears that the failure event is isolated to the
device N.
[0053] The reader may note that the same knowledge would appear to
be available simply relying upon the separate, individual device
level analysis of block 174, but this is not the case; the failure
of any of the other devices in the array to identify an
out-of-bounds condition trend is not the same as knowing globally
what the specific data are for each of the devices at the same
time. Accordingly, the unified data log approach provides superior
analysis and corrective action operations even when the data event
is isolated to a single device, and even when the same level of
analysis is performed as would be performed at the individual
device level.
[0054] Continuing with another example using FIG. 8, it will now be
contemplated that each of the curves 182, 184, 186, 188 represent
different parameters such as, for example, channel quality, servo
qualification time, rotational vibration and off-track errors,
respectively, for the same or different devices. In this case,
inter-parametric correlations such as at 196 and 198 can be
identified, allowing further insight into the inter-dependency of
respective parameters. Time lag relationships can also be
established such as, for ex-ample, the decrease at 198 inducing the
corresponding increase at 194. The identification of such
relationships can better isolate the true cause of a particular
event.
[0055] For example, it might be determined that the device
associated with curve 184 (device 2) is inducing the error in curve
188 (device N) by way of acting upon the device represented by
curve 186 (device 3). Thus, adjustment or replacement of device 2
would resolve the operational difficulties experienced by devices 3
and N, and so on.
[0056] It will now be appreciated that the preferred embodiments of
the present invention as set forth herein present advantages over
the prior art. Using the data log 160 to accumulate historical data
across a number of the devices 100 can provide cost savings and the
freeing of system resources, deeper and global analysis of the
parametric data on a per device basis, and analysis of the data
across multiple devices.
[0057] For purposes of the appended claims, the recited first means
will be understood to correspond to the controller structure set
forth in FIG. 5, with the engine configured to carry out horizontal
analyses as depicted in FIGS. 6 and 7.
[0058] It is to be understood that even though numerous
characteristics and advantages of various embodiments of the
present invention have been set forth in the foregoing description,
together with details of the structure and function of various
embodiments of the invention, this detailed description is
illustrative only, and changes may be made in detail, especially in
matters of structure and arrangements of parts within the
principles of the present invention to the full extent indicated by
the broad general meaning of the terms in which the appended claims
are expressed. For example, the particular elements may vary
depending on the particular control environment without departing
from the spirit and scope of the present invention.
[0059] In addition, although the embodiments described herein are
directed to a multiple disc array that employs a number of hard
disc drives to present a common addressable memory space, it will
be appreciated by those skilled in the art that the claimed subject
matter is not so limited and various other data storage systems,
including optical based and solid state data storage devices, can
readily be utilized without departing from the spirit and scope of
the claimed invention.
* * * * *