U.S. patent application number 13/650827 was filed with the patent office on 2016-04-14 for automated upgrading method for capacity of it system resources.
This patent application is currently assigned to CAPLAN SOFTWARE DEVELOPMENT S.R.L.. The applicant listed for this patent is CAPLAN SOFTWARE DEVELOPMENT S.R.L.. Invention is credited to Paolo Cremonesi, Kanika Dhyani, Stefano Visconti.
Application Number | 20160105327 13/650827 |
Document ID | / |
Family ID | 43414796 |
Filed Date | 2016-04-14 |
United States Patent
Application |
20160105327 |
Kind Code |
A9 |
Cremonesi; Paolo ; et
al. |
April 14, 2016 |
AUTOMATED UPGRADING METHOD FOR CAPACITY OF IT SYSTEM RESOURCES
Abstract
Embodiments provide a method for performing an automatic
execution of a Box and Jenkins method for forecasting the behavior
of said dataset. The method may include pre-processing the dataset
including providing one or more missing values to the dataset,
removing level discontinuities and outliers, and removing one or
more last samples from the dataset, obtaining a trend of the
pre-processed dataset including identifying and filtering the trend
out of the dataset based on a coefficient of determination
methodology, detecting seasonality to obtain a resulting stationary
series including computing an auto correlation function of the
dataset, repeating the detecting step on an aggregate series of a
previous dataset, and removing detected seasonality based on a
seasonal differencing process, and modeling the resulting
stationary series under an autoregressive-moving-average (ARMA)
model.
Inventors: |
Cremonesi; Paolo; (Milano,
IT) ; Dhyani; Kanika; (Milano, IT) ; Visconti;
Stefano; (Brenta, IT) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
CAPLAN SOFTWARE DEVELOPMENT S.R.L. |
Milano |
|
IT |
|
|
Assignee: |
CAPLAN SOFTWARE DEVELOPMENT
S.R.L.
Milano
IT
|
Prior
Publication: |
|
Document Identifier |
Publication Date |
|
US 20130041644 A1 |
February 14, 2013 |
|
|
Family ID: |
43414796 |
Appl. No.: |
13/650827 |
Filed: |
October 12, 2012 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
PCT/IB2011/051650 |
Apr 15, 2011 |
|
|
|
13650827 |
|
|
|
|
Current U.S.
Class: |
706/21 ;
703/13 |
Current CPC
Class: |
H04L 41/142 20130101;
G06F 11/3452 20130101; G06N 5/022 20130101; H04L 41/147 20130101;
G06Q 10/06 20130101; H04L 43/04 20130101; G06N 5/047 20130101 |
International
Class: |
H04L 12/24 20060101
H04L012/24; G06N 5/04 20060101 G06N005/04; G06N 5/02 20060101
G06N005/02 |
Foreign Application Data
Date |
Code |
Application Number |
Apr 15, 2010 |
IT |
PCT/IT2010/000165 |
Claims
1. A method for capacity of IT system resources including
monitoring over time a signal representing a capacity of an IT
system resource, collecting a dataset of said signal, analyzing
said dataset based on a prediction method to forecast a behavior of
said dataset, and upgrading said capacity at time t-n including
allocating additional resource to the IT system, when said behavior
shows that at time t said resource reaches a threshold, the
prediction method comprising: performing an automatic execution of
a Box and Jenkins method for forecasting the behavior of said
dataset including. (a) pre-processing the dataset, including
providing one or more missing values to the dataset, removing level
discontinuities and outliers, and removing one or more last samples
from the dataset, (b) obtaining a trend of the pre-processed
dataset including identifying and filtering the trend out of the
dataset based on a coefficient of determination methodology, (c)
detecting seasonality to obtain a resulting stationary series
including computing an auto correlation function of the dataset,
repeating the detecting step on an aggregate series of a previous
dataset, and removing detected seasonality based on a seasonal
differencing process, (d) modeling the resulting stationary series
under an autoregressive-moving-average (ARMA) model, wherein said
prediction method to forecast said behavior at time t is computed
based on execution of steps (b)-(d) in reverse order.
2. A computer readable medium including computer-readable
instructions loadable into an internal memory of a computer having
one or more processors, comprising instructions for performing the
steps of claim 1 when said instructions are executed by the one or
more processors.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to a prediction method for
capacity planning of IT system resources. In particular, it relates
to an automated prediction method.
BACKGROUND
[0002] As known, capacity planning is very crucial for the
efficient management of available resources. In the context of IT
infrastructure, it is the science of estimating the space, the
computer hardware, software and connection resources that will be
needed in some time in the future. The aim of a capacity planner is
therefore to find the most cost efficient solution by determining
appropriate tradeoffs so that the needed technical
capacity/resources can be added in time to meet the predicted
demand, however making sure that the resources do not go unused for
long periods of time. In other words, it is required to upgrade or
update some hardware at the right point in time, so as to cope with
the demand without too much anticipate the correct time so as to
budget correctly upgrading costs.
[0003] Time series analysis is a very vital process for predicting
how major aspects of various economic and social processes evolve
over time. For a long time now, it is extensively applied in
predicting the growth of key business activities, for instance the
rise and fall of stock prices, determining market trends amongst
others. Due to the rising need of optimizing IT infrastructure to
offer better services while minimizing the cost of maintaining and
buying the infrastructure, there is a growing necessity of
developing advanced methods that automatically trigger hardware
upgrading or add-on processes.
[0004] Time series analysis applied to an IT infrastructure is
based on collecting or sampling data related to signals issued by
monitored hardware, so as to build the historical behaviour and
hence estimate the future points of the model. This analysis
projected in time, is apt to supply specific information for
establishing when and how said hardware or software resource will
require upgrading or substitution. Upgrading of a certain resource
n the IT infrastructure for example intended for a specific task,
may occur also as an automatic re-allocation of resources (for
example memory banks, disk space, CPU, . . . ) from an other system
provisionally allocated to a different task: in such a case the
entire upgrading process can be carried out in a completely
automatic mode.
[0005] The same analysis supplies information about occurrence of
events, errors on prediction bands, point in time when given
hardware changes should be done or when the given infrastructure
will breakdown.
[0006] As an example, the following can be reported: based on past
behaviour of entities like the number of accesses to or
transactions in a web site, a time series analysis can help
minimizing user response time by predicting future hardware
requests. This constitutes a simple capacity planning situation in
a demand-supply scenario where a balance between how much hardware
infrastructures need to be installed on the basis of expected
number of users and minimizing the loss of profit situations due to
a slow web access needs to be determined by a capacity planner.
[0007] One of the algorithm mostly employed in the field of time
series prediction is the well known Box and Jenkins prediction
algorithm (see, for example, G. E. P. Box and G. M. Jenkins, Time
Series Analysis: Forecasting and Control. San Francisco, Calif.:
Holden-Day, 1976 and J. G. Caldwell, (2007, February). Mathematical
forecasting using the Box-Jenkins methodology, available online at
www.foundationwebsite.org.); this system is able to roughly match
well to operate to any condition, regardless of the specific domain
wherein it is used. Typically, to tune this algorithm to supply
good results for a specific application field, it is required a
certain amount of manual intervention to select a number of tuning
parameters based on visual observation of the historical behaviour
of the specific acquired time series. Of course, this way of
proceeding, as such, is not suitable to completely automate the
upgrading process.
[0008] An object of the present invention it is hence that of
supplying a method for hardware upgrading based on a robust time
series, prediction in the domain of capacity planning of business
and workload performance metrics in IT infrastructure, like
business drivers, technical proxy, CPU, memory utilization etc. To
achieve the goal, it is desired to develop a completely automated
time series prediction method. Having an automated method for
performance data, has two-fold advantages: (i) due to the large
volumes of data with constantly changing physical characteristics
which needs to be regularly analyzed, an automation of reading
data, updating of internal parameters and a through extensive
analysis is imperative; (ii) human intervention in time series
prediction process always has some draw backs as capacity planners
are engineers who generally lack a deep mathematical and
statistical knowledge that time forecasting experts have.
SUMMARY OF THE INVENTION
[0009] The above object is obtained through a method as defined in
its essential characteristics in the attached claims.
[0010] In particular, the method specified relies on a forecasting
algorithm based on the Box and Jenkins prediction algorithm with
added functionalities which, on the basis of proper identification
of characteristic properties of the data set, is able to boost the
accuracy of prediction and of the hardware upgrading process.
[0011] The algorithm is completely automated and is designed for an
unskilled capacity planner requiring no prior knowledge in this
area and no manual intervention. To achieve this end, apart from
all the other phases of the algorithm, the main core of the
algorithm comprising the Box and Jenkins prediction algorithm has
also been completely automated.
[0012] This algorithm is very suited and tailored for time series
coming from the workload and performance domains in IT systems,
since they have a lot of internal behaviour like long range trends,
long term and short term seasonalities and dynamics that evolve
independently of each other, representing the different physical
contribution to the final structure of the data.
[0013] For this specific domain of data, the method of the
invention has a clear edge over other popular forecasting methods
like Robust Linear regression (P. J. Rousseeuw and A. M. LeRoy.
Linear regression and outlier detection. Hoboken: Wiley, 2003),
which can only capture long term trends without giving any further
insight on smaller granularity data, Holt-Winters (P. S. Kalekar,
"Time series forecasting using Holt-Winters exponential smoothing",
December 2004), which provides a prediction based on the trend and
seasonality in the data but is not robust to anomalies, the Random
Walk algorithm (N. Guillotin-Plantard and R. Scott, Dyanim random
walks: theory and applications. Oxford, UK: Elsevier, 2006),
especially used for stock forecasting, which is suited only for
short-range perspective as it predicts on the basis on the last
observation and does not take the general trend into account, and
the Moving Average set of algorithms (P. J. Brockwell and R. A.
Davis, Time series: theory and methods. 2nd ed. New York: Springer,
1991), which assume a relation between the short and long term
perspective by defining a user threshold and generally work well
only if seasonalities in the data are regular and cyclic.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] Further features and advantages of the system according to
the invention will in any case be more evident from the following
detailed description of some preferred embodiments of the same,
given by way of example and illustrated in the accompanying
drawings and tables, wherein:
[0015] FIG. 1 is a block diagram showing the main steps of the
prediction method according to the invention;
[0016] FIG. 2 is an exemplary time series representing the active
memory of a IT machine;
[0017] FIG. 3 is a plot showing forecast and prediction bands of a
test series.
[0018] FIG. 4 is a cross validation on workload series showing
comparison amongst different algorithms.
[0019] FIG. 5 is a cross validation on performance series showing
comparison amongst different algorithms.
[0020] FIG. 6 (Table 1) is a table of results for workload data
showing comparison amongst different algorithms.
[0021] FIG. 7 (Table 2) is a table of results for performance data
showing comparison amongst different algorithms.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION
[0022] The Box and Jenkins approach is a complex forecasting method
which is known since 1976. This framework is based on the
assumption that each time series x(t) can be modelled as
follows:
x(t)=f(T(t);P(t);S(t)); (1)
where T(t) represents the trend, P(t) the periodic components and
S(t) a stationary process.
[0023] A major pitfall of this algorithm is that it requires--as
mentioned above--substantial manual work and a deep statistical
knowledge. For instance this algorithm is based on the determining
of the parameters of the ARMA model which cannot be trivially
inferred from the data. Up till now, this was done by iterating
over p and q, leading to high number of possible combinations and
hence it is high demanding for computing resources and doesn't
allow to get a solution in real time.
[0024] Due to the peculiarity of IT systems data, in addition to
completely automating the prediction process, several components
have been added to the Box and Jenkins algorithm to give an
informed intelligent prediction. These procedures identify some
characteristic behaviours of the series, which are known as
features and are selectively used according to the invention. A
list of 35 important features have already been suggested in the
art, which can be broadly classified as domain knowledge,
functional form, context knowledge, causal forces, trend,
seasonality, uncertainty and instability. While some of the
features listed cannot be automated, some others are not suited for
the performance and workload data. After an in-depth analysis,
according to the invention a subset of 20 features has been chosen
and then merged into 6 main characteristics to be detected. Those
characteristics have been splitted and allotted to two different
treatment stages of the process: some of them to a pre-processing
step, while some others in the other sections.
[0025] In particular, the method according to the invention, after
having collected (over time) a dataset of hardware performance
signals coming from a monitored IT system, is relying on a
treatment process of said dataset which is divided in 5 main phases
(see FIG. 1): (a) pre-processing of the data, (b) identification of
trend, (c) seasonal components analysis, (d) ARMA modelling and (e)
final prediction of the time series,
(a) Pre-Processing Stage
[0026] As the name suggests, in this stage the algorithm prepares
the dataset for further analysis. This preamble is very crucial to
the accuracy of the final prediction of the dataset, as solving
anomalies in the data leads to a cleaner exploration of the series
structure. In the following, some of the features that strongly
characterize each time series and that are used to compute an
informed pre-processing of the data is described.
(a.1) NA values
[0027] In real applications, i.e. in IT systems resources, many
series contain missing values (the abbreviation "NA" stands for
"Not Available"). These lacks of information are caused by the
missing data during collection (the machine in charge of the
acquisition is down for some reason) or by the in consistency of
the data with the domain of the metric. According to the invention,
each missing observation is replaced by the median of the k closest
samples (default is k=5): the median, in fact, maintains the
general behaviour of the series and in addition is not affected by
extreme values (unlike the mean), thereby not distorting the model
components (trend and seasonality).
(a.2) Level Discontinuities
[0028] A level discontinuity is a change in the level of the
series, which usually appears in the form of a step. These jumps
can be caused by occasional hardware upgrades or changes in the
physical structure of the monitored IT applications. To detect
level discontinuities, a set of candidate points is created using
Kullback-Leibler divergence (see W. N. Venables, D. M. Smith,
(December 2008). An introduction to R. Available at
http://www.r-project.org/): for each point of the time series, the
deviation between its backward and forward window is computed. The
resulting vector is filtered through a significance threshold and
the remaining points constitute the change points in the series,
which are the required level discontinuities. To filter further
level jumps, the algorithm provides another method, which works in
two steps: (i) an alternative list of candidate jump points is
created: considering the second difference of the series, the
procedure takes care of samples in which it is greater than half
(as default) of the standard deviation of the dataset; (ii) a
seasonal analysis is computed to this vector of candidates, to
evade peaks at known periodic lags to be considered as level
discontinuities. Thanks to this 2-step filter, only jumps of
considerable entity and concrete meaning are selected and
adjusted.
(a.3) Outliers
[0029] Outliers are values that deviate substantially from the
behaviour of the series. Their identification is crucial for the
goodness of the final forecast of the data, as the presence of
strange samples in the time series can lead to big errors in the
determination of the parameters for the model of the prediction. In
addition to these points, there are samples which appear as
anomalous points as their behaviour differs from that of the data.
However, in real world these points have a physical significance
and are known as events. Events normally represent some hardware
upgrades or changes in the system which occur either in isolation
or may occur at periodic intervals. Both outliers and events are
obtained in the vector of change points got using the
Kullback-Leibler function. Change points are calculated using a
boxplot analysis conducted at all intervals, the values in these
intervals over the whisker point are generally considered
anomalous. To distinguish events from outliers, this list of
unlabeled points is then refined by the event detection box, which
identifies seasonalities in this vector (if existing) by
automatically detecting the starting point of the seasonal sequence
and producing a list of events in the data. All the other points
are labelled as outliers. The event detection procedure ensures
that events are considered as seasonal elements and properly dealt
with the dedicated section of the algorithm.
(a.4) Last Period Analysis
[0030] It is known that statisticians think that predicting a time
series, with a highly corrupted last period, can bring to such huge
conceptual mistake that the whole forecasting process could be
totally compromised. Handling suitably anomalies in the last part
of the series is really important for the final prediction of the
data. Hence, according to the invention, the last P samples (P is
the granularity of the data) of the dataset is left out, while a
P-step ahead forecast (using the estimation method itself) is
computed. It produces the estimated value for the last period,
which is judged to be unusual or not. If it is, the real values are
substituted by the just computed prediction. Then, the last sample
of the obtained data series is ignored, and its unusuality is
judged through a procedure specular to the one above. This filter
of the earliest samples (i.e. the most significant samples of the
trailing edge) is very important, as dealing series with stable
last portion leads to a more accurate prediction of the data.
(a.5) Additional Features
[0031] Other than these widely known features, the prediction
method of the invention has been equipped with a set of alternative
identifiers, which are useful to adequately characterize aspects of
the specific performance and workload data of IT systems. In
particular, IT time series are not allowed to take negative values,
as they represent percentages or natural metrics, which by
definition cannot be negative. Because of that, the prediction
shall be limited to prevent the forecast to reach negative
values.
[0032] Further, it is detected if the series constitutes a
utilization dataset, so that a lower and an upper bound can be put.
In some cases, in real situations, the series shows a non trivial
lower (or upper) bound: this limit must be correctly detected and
considered, as letting forecasts go under that boundary causes
infeasible situations in terms of physical meaning of the data. To
identify the bound, the mode of the series is calculated: if it is
the lower (or upper) bound of the data and has a sufficiently high
frequency, then it is considered the base of the data, which is
labelled as a trampoline series.
(b) Identification of Trend
[0033] Once the time series has been cleaned from all anomalous
behaviours, and therefore can be considered a pure and meaningful
expression of the process underlying the data, it shall be treated
according to the three steps illustrated in FIG. 1.
[0034] The trend part of the data is, in most of the cases, the
most relevant one as it dominates the whole series. For this
reason, performing a good identification of the general direction
of the data usually leads a good final prediction of the series. To
accomplish this job, the coefficient of determination (R.sup.2)
technique is suggested to be used (see, for explanations, L. Huang
and J. Chen, "Analysis of variance, coefficient of determination
and f-test for local polynomial regression", The annals of
statistics, vol. 36, no. 5, pp. 2085-2109, October 2008; E. R.
Dougherty, Kim and Y Chen, "Coefficient of determination in non
linear signal processing". Signal processing, vol. 80, no. 10 pp.
2219-2235, October 2000.)
[0035] The fixed set of possible curves is composed by polynomial
functions (linear, quadratic and cubic) and a non linear one
(exponential). Initially a heuristic test to detect possible
exponential behaviour is computed on the series. Y(t) (which is the
output of the pre-processing procedure): the natural logarithm of Y
is taken to obtain the slope of the resulting fitting line, which
is useful for the further analysis. Supposing, in fact, that Y(t) y
T(t) (condition satisfied in almost all real situations),
[0036] if
T(t)=a*e.sup.b*t,
than
log Y(t)=log a*e.sup.b*t=log a+log e.sup.b*t=log a+b*t
[0037] The slope (b) of the fitting regression line can be used to
obtain the proper analytical expression for the exponential
arrangement to the series. Afterwards, it is possible to apply a
R.sup.2 test to the dataset, involving the modelling exponential
function just calculated (e.sup.b*t) and the polynomial ones (t, t2
and t3) cited above. The maximum value of the R.sup.2 is the one
corresponding to the correct regression.
[0038] Once the function is chosen, it is straightforward to find
the analytical expression of the best-fitting line (all computer
programs for numerical calculations provide a built-in function to
fit generic analytical models to a dataset). Sometimes,
unfortunately, real world data are very "dirty". The coefficient of
determination could be biased by some random circumstance in the
data, that could deviate the output of the R.sup.2 test (especially
if samples are not numerous). To prevent unexpected and undesirable
situations of bad adaptation to the data, a threshold is put on the
trend test. It, represents the value over which the maximum R.sup.2
rate must stay, to be significant. This filter is put to avoid the
overfitting of the model, that could rely too much on the original
data (which can be corrupted, instead, by some disturbing random
factor) with respect to future samples.
(c) Seasonality
[0039] In most of the real world situations in IT systems, every
aspect of the data is highly influenced by time. All time series
have a granularity, which is the interval between which data are
collected. Data, for example, can be gathered hourly, weekly or
yearly; they can even be picked at a certain granularity and then
be processed to obtain different time intervals. Often, datasets
show time-dependent correlation, like events or usual realizations
as well, that tend to be periodic in their appearances. IT data, in
this sense, are very significant, as particularly expressive of
this crucial aspect of datasets.
[0040] According to the invention there is provided an entire
process block that deals seasonal traits in the data. It is divided
in 3 parts: the first two handle respectively the original
detrended dataset (Z(t)) and its aggregation with respect to the
basic seasonality, while the third one uses the information
acquired from the previous two to prepare the series for the next
steps of the process,
c.1 Original Data Investigation
[0041] To detect if seasonality is a relevant component of the
series, an analysis on the Auto Correlation Function (ACF) over
dataset is computed. The procedure for the identification of
seasonal components detects sufficiently high peaks (local minima
or maxima greater than a threshold) in the ACF, which represents
the period of the relevant seasonal components in the data. This
test highlights regular behaviours of the series, that usually
denote specific qualities of the process underlying the data. The
process can handle the ACF-test output with the following 3
different approaches.
[0042] 1. Granularity-based. A seasonal component in the original
data is considered relevant only if its period is equal to the time
interval. This approach allows to discover regular dependencies
which relies to correlation that have a concrete physical meaning
connected to the time elapsing. This is the default option of the
algorithm;
[0043] 2. Greedy. This approach instead supposes that every period
labelled as significant by the Auto Correlation Function analysis
is acceptable and potentially relevant. The algorithm chooses the
highest peak returned by the seasonality test and assumes the
corresponding period as the one driving the series;
[0044] 3. Custom. It is possible finally to leave to the user the
choice to input a set of feasible periods to the automatic process.
The procedure chooses the lowest period which is both in this set
and in the relevant periods returned by the ACF-test. This option
has been added to let the user the possibility to customize (if
needed) his analysis and also to enable the algorithm to manage
peculiar time series that can rarely appear in real applications. A
data collection, for instance, can be conducted from Monday to
Friday and stopped in the week-end (due to the closing of the
offices, for example). In this case, a possibly relevant
seasonality would have period 5 (rather than 7, as in usual weekly
dependencies), and due to this selection the algorithm can properly
handle it. Obviously, this procedure could return no period, to
express that seasonality is not important in the evolution of the
data.
c.2 Aggregated Data Investigation
[0045] Once the original series has been dealt, a deeper analysis
is made on the dataset. The data vector is divided in K groups and
each of them is composed of T elements, T is the value returned by
the previous processing (if seasonality has not been considered
relevant in the data, then it is set to time interval P; if the
last group has {hacek over (T)} elements and {hacek over (T)}<T,
it is filled adding T-{hacek over (T)} samples equal to the mean of
the series). Hence, the so called aggregate series is obtained,
where each sample Z'(j) (j.times.{1, 2, . . . , K}) is defined
as
.SIGMA..sub.i=0.sup.T-1Z(j-1)
[0046] At this point, the ACF-test can be applied to this series,
to find possible significant seasonal components in the data (T).
This additional periodic examination allows to discover specific
patterns, which are connected to "double seasonalities" in the
structure of the metric that is modelling the data. The applied
method for this procedure is the greedy one, which does not forbid
a period to be considered significant by the algorithm. This choice
takes into account the complicated dynamics of performance dataset.
IT aggregated time series, in fact, do not have fixed periodic
patterns but can show regularities at any sample lag.
c.3 Seasonal Differencing
[0047] Once even the aggregated series has been fully analyzed,
Z(t) can be taken in consideration again and information obtained
in the two previous sections can be used. Seasonal differencing is
a form of adjustment, whose aim is to remove the seasonality
explicitly from the dataset. In general, given a time series X(t)
(of length n) and the season .DELTA., the difference series is
obtained as follows:
S(j)=X(j+.DELTA.)-X(j) j.times.{1,2, . . . ,n-.DELTA.} (2)
[0048] The season parameter can vary, according to the possible
results of the previous inspection of the data: [0049] if
aggregated series showed relevant seasonality (independently from
original series' result), then
[0049] .DELTA.=T*T'; [0050] if aggregated data did not reflect
periodic regularities, while original data did, then
[0050] .DELTA.=T; [0051] otherwise, if neither the original series,
nor the aggregated one showed seasonality, then seasonal
differencing can not be applied, as periodicity is not thought as
relevant in the structure of the data.
[0052] The obtained series S(t) is accordingly deprived of all its
seasonal components and the parameter .DELTA. is kept for the
further reapplication of the seasonality upon the final stage of
prediction.
(d) ARA Analysis
[0053] Dataset resulting from all previous procedures is a
stationary series S(t), which can be modelled as an ARMA(p,q)
process (see, for example, G. E. P. Box and G. M. Jenkins, Time
Series Analysis: Forecasting and Control. San Francisco, Calif.:
Holden-Day, 1976.). Therefore, investigations to be accomplished on
the dataset are the detection of the order of the model and its
identification.
[0054] There are some different approaches to handle this portion
of the Box and Jenkins analysis (see for example, Y. Lu and S. M.
AbouRizk, "Automated Box Jenkins forecasting modelling", Automation
in construction, vol. 18, pp. 547-558, November 2008.) and the
details of the processes to be used will not be described here,
since they are well known in the field and they do not form
specifically part of the present invention.
[0055] According to a preferred embodiment of the invention, with
the aim of (tightly) reducing the computational time of this
procedure, a process tool has been construed which is able to
accurately identify the most correct orders of AR an MA portions of
the ARMA process, starting from reasonable considerations on the
structure of the data. First, the procedure agrees with the one
from Lu and Abourizk, about the bounds on the orders, as p and q
components greater than 3 would not bring substantial improvements
to the modelling goodness and would only complicate the abstraction
on the data. This process tool, then, executes an accurate
inspection of the `acf` (a correlation function) and `pcf` (p
correlation function) function, which basically applies rules
described in the document S. Bittanti, "Identificazione dei modelli
e sisterni adattativi," Bologna, Italy: Pitagora editrice,
2005.
[0056] This procedure is concerned with the knowledge acquired
directly from the serial correlation among data samples (ad and
pcf), that is strongly meaningful for the behaviour of the data and
in addition can be computed in a very short time.
[0057] After p and q components have been identified correctly, the
only thing that is left is the computation of the values of model
parameters. This is done by the machine, which uses a traditional
MILE (Maximum Likelihood Estimate) method to estimate coefficients
that best model the given time series S(t). Once the ARMA model has
been built, the system is ready for the final prediction stage of
the procedure.
(e) Final Prediction
[0058] At this stage, an abstraction of every component of the time
series has been produced: hence every obtained model can be used
and extended to any future collected data, to produce a forecast of
the dataset. The prediction of time series is computed following
the inverse order, with respect to the identification, applying
results from the less relevant component to the dominant one.
e.1 ARMA
[0059] The modelling of S(t) allows to produce an analytical
expression for the serial correlation of the series with past
values and noise sequences. The forecast generated from this
information is particularly relevant for a short prediction horizon
(in the first future samples), while far unknown samples are less
affected by the ARMA contribution,
e.2 Seasonal Differencing
[0060] Expression (2) explained how to obtain the difference series
without the seasonal components. Now seasonality must be reapplied
to build the desired dataset {hacek over (Z)}(t). Known samples are
trivially reacquired with the following:
{hacek over (Z)}(j)=Z(j) j.times.1,2, . . . ,T (3)
{hacek over (Z)}(j+T)=Z(j)+S(j) j.times.1,2, . . . ,n-T (4)
Future samples, instead, are obtained, thanks to the ARMA forecast,
using:
{hacek over (Z)}(n+j)=Z(n-T+j(mod T))+{circumflex over (S)}(n-T+j)
j.times.1,2, . . . ,H (5)
where H is the desired forecasting horizon. This procedure puts
particular care on the last period of the data (which is, as
discussed previously, a very important portion of the dataset). In
equation (5), in effect, the attention of the forecast is focused
on Z(n-T+j(mod T)), that basically considers only the last period
of the series, joining it with the prediction of the stationary
component of the data, to replicate seasonality over time. Please
notice that parameter T includes all interesting seasonalities of
the data and allows the dealing of multiple periodic components,
without any other additional information.
e.3 Trend
[0061] Finally, the identified regression curve of the data is
reapplied. In most performance data series, trend is the most
relevant component of the data. Therefore, to perform effective
capacity planning, the general behaviour of a time series is
crucial and its proper detection and application becomes the most
critical section of any forecasting procedure. That is why so much
attention is paid in the trend identification section of this
procedure. Finally, after having considered detected features
(reapplication of level discontinuities and outliers, etc.), the
definitive prediction is computed.
[0062] Based on this computed prediction, the method according to
the invention further triggers a proper procedure, depending on the
specific hardware employed and monitored, to upgrade said hardware,
either allocating to the system some un-employed shared resources
of an other system or issuing an alarm for the IT manager to start
a manual upgrading procedure.
[0063] As an example, if the prediction method is based on a
dataset representing the storage space usage in a hard-disk, the
computed prediction based on the historical usage data vs time of
the hard-disk, gives and indication that at time t the hard-disk
capacity will be used up to 99%. The process is hence set so that,
at time t-n before reaching complete usage of said resource, more
disk space capacity is allocated to that IT system.
[0064] According to a preferred embodiment of the invention, to
provide a more robust estimation of unknown forecasted samples,
prediction bands are computed in addition to the estimated
predicted values of the dataset. These prediction bands are
calculated as a function of the forecasted value and the chosen
confidence on error of prediction: they represent the region (i.e.
upper and lower bounds) in which the prediction lays with a certain
probability.
Evaluation and Example
[0065] The accuracy of the prediction method is evaluated on data
coming from IT metrics, roughly categorized in two subsets:
workload and performance data. The first one includes datasets
representing raw business data, directly taken from the IT
activities: for example Business Driver, which monitor user-based
metrics (number of requests, logins, orders, etc.), Technical
Proxy, which indirectly measure business performance (rates, number
of hits, volumes of data involved in transactions, etc.) and Disk
Load, describing the load addressed to memory devices. The second
one is composed of time series which can be workload data after the
processing applied by queuing network models or performance series,
coming directly from IT architecture devices (CPU's, storage
systems, databases, etc.).
[0066] The accuracy of the method according to the invention has
been assessed through visual judgement of cross-validation,
comparing the results with some others which can be obtained from
other popular forecasting methods, like robust linear (RL) and
Holt-Winters (HW). In both cases, the method according to the
invention has been judged superior.
[0067] In FIG. 2 is shown a real dataset obtained monitoring the
number of bytes of active memory of a computing machine. The method
of the invention has been found able to automatically detect and
recognize missing values (dashed ovals in the figure), which shall
be filled as explained above, the level discontinuities
(rectangular shapes), which shall be adjusted for the correct
automatic analysis, and the outliers (dashed circles) which are
well considered in the pre-processing stage. In particular, some of
outliers were not detected as anomalous points, but correctly
recognized--through the event detection procedure--as usual
periodic peaks due to seasonality; trend analysis on the cleaned
series fits an upward linear trend to the series. The seasonal
detection, instead, discovers a double seasonality in the data. The
dataset, in fact, shows a seasonal component of period 24 (number
of hours in a day) while the aggregate series is dominated by a
seasonality of period 7 (number of days in the week). Hence, the
hourly time series tends to replicate its behaviour every week.
This aspect of the dataset is handled by the seasonal differencing,
which combines the ARMA prediction (p=q=3) and the identified trend
to make an appropriate forecast. FIG. 3 shows the initial series,
together with its final prediction (bold line) for 408 samples (17
days), along with the prediction bands (dashed line), computed for
a 75% confidence.
[0068] While there has been illustrated and described what are
presently considered to be example embodiments, it will be
understood by those skilled in the art that various other
modifications may be made, and equivalents may be substituted,
without departing from claimed subject matter. Additionally, many
modifications may be made to adapt a particular situation to the
teachings of claimed subject matter without departing from the
central concept described herein. Therefore, it is intended that
claimed subject matter not be limited to the particular embodiments
disclosed, but that such claimed subject matter may also include
all embodiments falling within the scope of the appended claims,
and equivalents thereof.
Evaluation and Example of Comparison with Other Algorithms
[0069] To demonstrate the efficiency of the automated prediction
algorithm, which intelligently uses the underlying behavior of the
time series, with other algorithms, we present extensive
cross-validation results. Our algorithm is compared with two
different forecasting methods, widely used in common data analysis
applications; Robust Linear (RL) and Holt-Winters (HW). The first
one is very popular in the time series forecasting, because of its
easy understanding, its robustness with respect to outliers and to
high variable behaviors. Robust linear, anyway, extracts a very
basic model from the data, and so does not take seasonal dynamics
in the series in consideration. Holt-Winters, instead, is an
exponential smoothing method, which handles both trend and
seasonality behaviors. The major drawback of this forecasting
algorithm is the lack of an informed periodic analysis: data is
just smoothed and seasonality of the last portion of the dataset is
replicated over time, without knowing if the periodic component is
relevant and ignoring multiple seasonalities. Finally, both these
methods lack the ARMA analysis, which is important in detecting the
serial correlation in the time series.
The performance of the algorithm is evaluated through the three
performance indices.
Performance Indices
[0070] Let y(j) represent the prediction of sample y(j) and N the
length of the portion of series used to test the prediction.
Firstly, to have a quantitative indicator of the accuracy of our
prediction algorithm, we use the most basic performance index: the
Root Mean Squared Error (RMSE), which computes the variation of the
forecast with respect to the read data, defined as
R M S E = j = 1 N ( y ( j ) - y ^ ( j ) ) 2 N ##EQU00001##
To obtain an absolute indicator of the goodness of the prediction,
a Mean Absolute Percentage Error (MAPE) can be calculated, it is
defined as follows:
M A P E = j = 1 N ( y ( j ) - y ^ ( j ) ) y ( j ) N
##EQU00002##
[0071] Finally, the third index we use to evaluate the performance
of our prediction algorithm computes a test on each predicted
sample separately, calculating the deviation of the forecasted
value from the real one. Error(j) is computed as follows:
Error ( j ) = { 1 if ( y ( j ) - y ^ ( j ) ) 2 .sigma. 2 > q 1 -
.alpha. 0 else , ##EQU00003##
where .sigma. is the standard deviation of dataset y and q1-.alpha.
is the quantile of a normal distribution. Error is therefore a
vector with as many zeroes as the number of samples correctly
predicted, with the chosen confidence of (1-.alpha.). To obtain an
absolute indicator of the accuracy of the forecast, we compute the
Error Ratio (ER):
E R = j = 1 N Error ( j ) N , ##EQU00004##
We set the confidence level for ER index as 95%. For each
considered time series, the last third of the data vector is left
out of the prediction and used to validate the computed forecast.
For each subset a hypothesis test for difference between means is
computed: we suppose that for every comparison, the two sampling
population (with mean .mu. and standard deviation .sigma.) are
normally distributed. Considering two distributions with sample
parameters .mu..sub.1, .mu..sub.2, .sigma..sub.1, .sigma..sub.2 and
lengths n.sub.1, n.sub.2, we formulate the hypotheses:
H.sub.0:|.mu..sub.1-.mu..sub.2|=0
H.sub.1:|.mu..sub.1-.mu..sub.2|>0
We consider .mu..sub.x=|.mu..sub.1-.mu..sub.2| and
.sigma. x = .sigma. 1 2 n 1 + .sigma. 2 2 n 2 . ##EQU00005##
Then we calculate the z-score
.mu. x .sigma. x ##EQU00006##
and the significance threshold t.sub.1-.alpha., which is the
quantile of the t-distribution with min {n.sub.1,n.sub.2}=1 degrees
of freedom. If the z-score is greater than the threshold, than the
two distributions are different and the null hypothesis is
rejected. In tables 1 and 2, results for different types of
workload and performance data are displayed. We do not report
computational time taken by the execution of the algorithm, as it
is almost always less than a few seconds, which is reasonable, for
the purposes of this study. First column indicates the type of
parameter shown (.mu. is the mean and .sigma. is the standard
deviation); columns from 2 to 4 shows MAPE values for Box and
Jenkins, Robust Linear and Holt-Winters, while columns from 5 to 7
and from 8 to 10 illustrate for the three algorithms respectively
the RMSE and the R. Every table shows the results grouped with
respect to the type of metric that the series is monitoring. For
each group, the mean and the standard deviation of the results
vector are displayed, together with the output of previously
discussed test. In the "test" row, symbol `+` indicates that the
null hypothesis is been rejected in favor of the automated Box and
Jenkins algorithm, while a stands for the acceptance of the null
hypothesis. Results of these tests stand the accuracy of our
algorithm, in predicting IT time series, with respect to the
considered performance indices. The .mu. value for MAPE is never
over 40%, which in literature is considered a reasonable threshold
for the goodness of the forecast.
[0072] Further, the .mu. value for Error Rate never surpasses 0.2,
which implies that 20% of the uncorrect predicted samples is a
reasonable percentage. Seeing the results obtained by the null
hypothesis test shows that our algorithm is never statistically
outperformed by its counter parts, in particular it is
significantly better in 43% of the considered subsets of data.
Moreover, amongst all the types of data metrics, the algorithm
performs best on Business Driver (Events) and Storage data We
present two visual examples of cross validation which back our
algorithm further.
[0073] FIG. 4 shows the cross-validation performed on a time series
monitoring the number of events occurring in a web server on a
daily basis. This example will clearly illustrate the suitability
of our method over the algorithms for this type of time series, in
general, datasets of this type show a clear lower bound, which is
properly detected by the trampoline identifier, there by evading
infeasible situations due to possible negative trend in the data.
The instance illustrated in FIG. 4 shows a clear trampoline base
represented by value 0, as datasets monitoring events, can not have
negative values (as Holt-Winters instead uncorrectly predicts).
Furthermore, our automated prediction method detects all basic
seasonal components in the data so that the forecast fits the
periodic behavior of the series appropriately (unlike Robust Linear
algorithm, which is unaware of seasonalities). Our second example
shown in FIG. 2 represents the cross-validation test on a hourly
sampled storage time series, representing the disk memory used by a
machine. Analyzing the data closely shows that there is a double
seasonality, arising from daily and weekly fluctuations of the
memory occupation over the trend of the series. Capturing this
characteristic behavior leads to a more informed prediction, which
follows correctly the recurrence of local and global peaks in the
data. For this specific time series, the improvement in terms of
prediction accuracy of this algorithm with respect to its
counterparts is considerable: it reduces the MAPE by 35% and the
RMSE by 26% against Robust linear, 80% and 75% respectively against
Holt-Winters. Further, none of the predicted values is considered
uncorrect (with confidence 95%) and accordingly the ER for
automated Box and Jenkins algorithm is 0.
* * * * *
References