U.S. patent number 11,216,743 [Application Number 16/103,220] was granted by the patent office on 2022-01-04 for learning sparsity-constrained gaussian graphical models in anomaly detection.
This patent grant is currently assigned to International Business Machines Corporation. The grantee listed for this patent is International Business Machines Corporation. Invention is credited to Tsuyoshi Ide, Jayant R. Kalagnanam, Matthew Menickelly, Dzung Phan.
United States Patent |
11,216,743 |
Phan , et al. |
January 4, 2022 |
Learning sparsity-constrained gaussian graphical models in anomaly
detection
Abstract
A first dependency graph is constructed based on a first data
set by solving an objective function constrained with a maximum
number of non-zeros and formulated with a regularization term
comprising a quadratic penalty to control sparsity. The quadratic
penalty in constructing the second dependency graph is determined
as a function of the first data set. A second dependency graph is
constructed based on a second data set by solving the objective
function constrained with the maximum number of non-zeros and
formulated with the regularization term comprising a quadratic
penalty. The quadratic penalty in constructing the second
dependency graph is determined as a function of the first data set
and the second data set. An anomaly score is determined for each of
a plurality of sensors based on comparing the first dependency
graph and the second dependency graph, nodes of which represent
sensors.
Inventors: |
Phan; Dzung (Ossining, NY),
Menickelly; Matthew (Downers Grove, IL), Kalagnanam; Jayant
R. (Briarcliff Manor, NY), Ide; Tsuyoshi (Harrison,
NY) |
Applicant: |
Name |
City |
State |
Country |
Type |
International Business Machines Corporation |
Armonk |
NY |
US |
|
|
Assignee: |
International Business Machines
Corporation (Armonk, NY)
|
Family
ID: |
1000006031526 |
Appl.
No.: |
16/103,220 |
Filed: |
August 14, 2018 |
Prior Publication Data
|
|
|
|
Document
Identifier |
Publication Date |
|
US 20200057956 A1 |
Feb 20, 2020 |
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F
16/9024 (20190101); G06N 20/00 (20190101); G06F
11/3089 (20130101); G06N 5/003 (20130101); G06N
7/02 (20130101); G06F 11/3006 (20130101) |
Current International
Class: |
G06N
7/02 (20060101); G06N 5/00 (20060101); G06F
11/30 (20060101); G06N 20/00 (20190101); G06F
16/901 (20190101) |
References Cited
[Referenced By]
U.S. Patent Documents
Other References
Alippi et al., Learning Causal Dependencies to Detect and Diagnose
Faults in Sensor Networks, IEEE, 2014, Total pp. 8 (Year: 2014).
cited by examiner .
He et al., A Dependency Graph Approach for Fault Detection and
Localization Towards Secure Smart Grid, IEEE Transactions on Smart
Grid, vol. 2, No. 2, Jun. 2011, pp. 342-351 (Year: 2011). cited by
examiner .
Liu, C., et al., "An unsupervised spatiotemporal graphical modeling
approach to anomaly detection in distributed CPS," ICCPS '16
Proceedings of the 7th International Conference on Cyber-Physical
Systems, Apr. 2016, 10 pages, Article No. 1. cited by applicant
.
Chakrabarti, A., et al., "Robust Anomaly Detection for Large-Scale
Sensor Data," BuildSys '16 Proceedings of the 3rd ACM International
Conference on Systems for Energy-Efficient Built Environments, Nov.
16-14, 2016, pp. 31-40. cited by applicant .
Phan, D., et al., "A Novel 10-constrained Gaussian Graphical Model
for Anomaly Localization", 2017 IEEE International Conference on
Data Mining Workshops (ICDMW), New Orleans, Louisiana, USA, Nov.
2017, pp. 830-833, (Grace Period Disclosure). cited by applicant
.
Friedman, J., et al., "Sparse inverse covariance estimation with
the graphical lasso", Biostatistics (2007), Biostatistics Advance
Access, Dec. 12, 2007, pp. 1-10. cited by applicant .
Scheinberg, K., et al., "SINO--a greedy coordinate ascent method
for sparse inverse covariance selection problem",
http://www.optimization-online.org/DB_FILE///2009/07/2359.pdf, Jul.
31, 2009, 21 pages. cited by applicant .
Hsieh, C.-J., et al., "QUIC: Quadratic Approximation for Sparse
Inverse Covariance Estimation", Journal of Machine Learning
Research (2014), Submitted Apr. 13, 2014, Revised Apr. 14, 2014,
Published Oct. 14, 2014, pp. 2911-2947, vol. 15. cited by applicant
.
Scheinberg, K., et al., "Practical Inexact Proximal Quasi-Newton
Method with Global Complexity Analysis",
https://arxiv.org/pdf/1311.6547v4.pdf, Jul. 15, 2015, pp. 1-29.
cited by applicant .
Olsen, P.A., et al., "Newton-Like Methods for Sparse Inverse
Covariance Estimation",
http://papers.nips.cc/paper/4601-newton-like-methods-for-sparse-inverse-c-
ovariance-estimation.pdf, Accessed on Aug. 14, 2018, pp. 1-9. cited
by applicant .
Lu, Z., et al., "Sparse Approximation via Penalty Decomposition
Methods", Siam J. Optim., Submitted Sep. 9, 2010, Accepted Sep. 3,
2013, Published online Dec. 5, 2013, pp. 2448-2478, vol. 23, No. 4.
cited by applicant .
Ide, T., et al., "Proximity-Based Anomaly Detection using Sparse
Structure Learning", Proceedings of the 2009 SIAM International
Conference on Data Mining, Apr. 30-May 2, 2009, pp. 97-108. cited
by applicant.
|
Primary Examiner: Chen; Alan
Attorney, Agent or Firm: Scully, Scott, Murphy &
Presser, P.C. Morris; Daniel P.
Claims
What is claimed is:
1. A method, comprising: receiving a first data set comprising
sensor data detected by a plurality of sensors coupled to
equipments in operation during a first period of time; constructing
a first dependency graph based on the first data set by solving a
first objective function constrained with a maximum number of
non-zeros and formulated with a regularization term comprising a
first quadratic penalty to control sparsity, wherein the first
quadratic penalty is determined as a function of the first data set
in constructing the first dependency graph; receiving a second data
set comprising sensor data detected by the plurality of sensors
coupled to the equipments in operation during a second period of
time; constructing a second dependency graph based on the second
data set by solving the first objective function constrained with
the maximum number of non-zeros and formulated with the
regularization term comprising a second quadratic penalty, wherein
the second quadratic penalty is determined as a function of the
first data set and the second data set in constructing the second
dependency graph; and determining an anomaly score for each of the
plurality of sensors based on comparing the first dependency graph
and the second dependency graph.
2. The method of claim 1, further comprising: responsive to
determining that an anomaly score associated with a sensor in the
plurality of sensors meets a threshold value indicative of abnormal
functioning, automatically performing a corrective action to
correct equipment coupled to the sensor.
3. The method of claim 1, wherein the first object function is
solved by a projected gradient algorithm.
4. The method of claim 3, wherein the projected gradient algorithm
embeds an approximate Newton method, wherein feasibility with
respect to sparsity constraint is handled via projection, and a
symmetric positive-definiteness of iterates is ensured through a
line-search procedure.
5. The method of claim 1, wherein the determining of the anomaly
score for each of the plurality of sensors comprises implementing a
conditional expected Kullback-Liebler divergence between the first
dependency graph and the second dependency graph.
6. The method of claim 1, wherein the determining of the anomaly
score for each of the plurality of sensors comprises implementing a
stochastic nearest neighbors algorithm measuring dissimilarity
between neighborhood of i-th variable representing a sensor between
the first dependency graph and the second dependency graph.
7. The method of claim 1, wherein the determining of the anomaly
score for each of the plurality of sensors comprises implementing a
sparsest subgraph approximation based on the first dependency graph
and the second dependency graph.
8. The method of claim 1, wherein the equipments comprise
equipments of an industrial process.
9. The method of claim 1, wherein the equipments comprise
equipments of a manufacturing process manufacturing a product.
10. A computer readable storage device storing a program of
instructions executable by a machine to perform a method
comprising: receiving a first data set comprising sensor data
detected by a plurality of sensors coupled to equipments in
operation during a first period of time; constructing a first
dependency graph based on the first data set by solving a first
objective function constrained with a maximum number of non-zeros
and formulated with a regularization term comprising a first
quadratic penalty to control sparsity, wherein the first quadratic
penalty is determined as a function of the first data set in
constructing the first dependency graph; receiving a second data
set comprising sensor data detected by the plurality of sensors
coupled to the equipments in operation during a second period of
time; constructing a second dependency graph based on the second
data set by solving the first objective function constrained with
the maximum number of non-zeros and formulated with the
regularization term comprising a second quadratic penalty, wherein
the second quadratic penalty is determined as a function of the
first data set and the second data set in constructing the second
dependency graph; and determining an anomaly score for each of the
plurality of sensors based on comparing the first dependency graph
and the second dependency graph.
11. The computer readable storage device of claim 10, further
comprising: responsive to determining that an anomaly score
associated with a sensor in the plurality of sensors meets a
threshold value indicative of abnormal functioning, automatically
performing a corrective action to correct equipment coupled to the
sensor.
12. The computer readable storage device of claim 10, wherein the
first object function is solved by a projected gradient
algorithm.
13. The computer readable storage device of claim 12, wherein the
projected gradient algorithm embeds an approximate Newton method,
wherein feasibility with respect to sparsity constraint is handled
via projection, and a symmetric positive-definiteness of iterates
is ensured through a line-search procedure.
14. The computer readable storage device of claim 10, wherein the
determining of the anomaly score for each of the plurality of
sensors comprises implementing a conditional expected
Kullback-Liebler divergence between the first dependency graph and
the second dependency graph.
15. The computer readable storage device of claim 10, wherein the
determining of the anomaly score for each of the plurality of
sensors comprises implementing a stochastic nearest neighbors
algorithm measuring dissimilarity between neighborhood of i-th
variable representing a sensor between the first dependency graph
and the second dependency graph.
16. The computer readable storage device of claim 10, wherein the
determining of the anomaly score for each of the plurality of
sensors comprises implementing a sparsest subgraph approximation
based on the first dependency graph and the second dependency
graph.
17. A system, comprising: a hardware processor coupled with a
communication interface; a memory device coupled to the hardware
processor; the hardware processor operable to receive via the
communication interface, a first data set comprising sensor data
detected by a plurality of sensors coupled to equipments in
operation during a first period of time; the hardware processor
operable to construct a first dependency graph based on the first
data set by solving a first objective function constrained with a
maximum number of non-zeros and formulated with a regularization
term comprising a first quadratic penalty to control sparsity,
wherein the first quadratic penalty is determined as a function of
the first data set in constructing the first dependency graph; the
hardware processor operable to receive a second data set comprising
sensor data detected by the plurality of sensors coupled to the
equipments in operation during a second period of time; the
hardware processor operable to construct a second dependency graph
based on the second data set by solving the first objective
function constrained with the maximum number of non-zeros and
formulated with the regularization term comprising a second
quadratic penalty, wherein the second quadratic penalty is
determined as a function of the first data set and the second data
set in constructing the second dependency graph; and the hardware
processor operable to determine an anomaly score for each of the
plurality of sensors based on comparing the first dependency graph
and the second dependency graph.
18. The system of claim 17, wherein the hardware processor is
further operable to, responsive to determining that an anomaly
score associated with a sensor in the plurality of sensors meets a
threshold value indicative of abnormal functioning, automatically
trigger a corrective action to correct an equipment coupled to the
sensor.
19. The system of claim 17, wherein the first object function is
solved by a projected gradient algorithm.
20. The system of claim 19, wherein the projected gradient
algorithm embeds an approximate Newton method, wherein feasibility
with respect to sparsity constraint is handled via projection, and
a symmetric positive-definiteness of iterates is ensured through a
line-search procedure.
Description
STATEMENT REGARDING PRIOR DISCLOSURES BY THE INVENTOR OR A JOINT
INVENTOR
The following disclosure is submitted under 35 U.S.C.
102(b)(1)(A):
D. T. Phan, T. Ide, J. Kalagnanam, M. Menickelly and K. Scheinberg,
"A Novel l0-Constrained Gaussian Graphical Model for Anomaly
Localization," 2017 IEEE International Conference on Data Mining
Workshops (ICDMW), New Orleans, La., USA, November 2017, pp.
830-833.
BACKGROUND
The present disclosure relates to computer-implemented anomaly
detection and learning of dependency graphs via which anomalies are
detected, for example, in industries such as manufacturing and
heavy industries.
In industrial or manufacturing processes, or other like processes,
a network of sensors is deployed to enable automatic monitoring of
various operating conditions. Data mining techniques, for example,
can convert low-level sensor data collected from Internet-of-Things
(IoT) devices into actionable insights, for example, for
condition-based maintenance. For instance, identifying an unusual
occurrence or change in data can signal anomaly.
In traditional problem settings, a goal of anomaly detection is to
compute or determine the degree of anomalousness for a multivariate
measurement, giving an overall anomaly score. In sensor networks,
for example, wireless sensor networks, it may be inadequate simply
to indicate whether or not a network is behaving anomalously, for
example, to perform anomaly detection. For example, if a sensor
network takes measurements from different car parts of a prototype
automobile on a test track, then it may be also valuable to the
field engineers to know which car parts are contributing to an
anomalous behavior, for example, in addition to knowing that the
car as a whole is behaving anomalously. Similar problem settings
can be found across different application domains, e.g., monitoring
mechanical stresses in buildings and bridges, identifying the
sources of flooding, and finding sources of threats in computer
networks.
BRIEF SUMMARY
A method and system, for example, which detects anomaly in sensor
networks, may be provided. The method, in one aspect, may include
receiving a first data set comprising sensor data detected by a
plurality of sensors coupled to equipments in operation during a
first period of time. The method may also include constructing a
first dependency graph based on the first data set by solving a
first objective function constrained with a maximum number of
non-zeros and formulated with a regularization term comprising a
quadratic penalty to control sparsity, wherein the quadratic
penalty is determined as a function of the first data set in
constructing the first dependency graph. The method may further
include receiving a second data set comprising sensor data detected
by the plurality of sensors coupled to the equipments in operation
during a second period of time. The method may also include
constructing a second dependency graph based on the second data set
by solving the first objective function constrained with the
maximum number of non-zeros and formulated with the regularization
term comprising a quadratic penalty, wherein the quadratic penalty
is determined as a function of the first data set and the second
data set in constructing the second dependency graph. The method
may also include determining an anomaly score for each of the
plurality of sensors based on comparing the first dependency graph
and the second dependency graph. In a further aspect, the method
may also include, responsive to determining that an anomaly score
associated with a sensor in the plurality of sensors meets a
threshold value indicative of abnormal functioning, automatically
performing a corrective action to correct equipment coupled to the
sensor.
A system, in one aspect, may include a hardware processor coupled
with a communication interface and a memory device coupled to the
hardware processor. The hardware processor may be operable to
receive via the communication interface, a first data set
comprising sensor data detected by a plurality of sensors coupled
to equipments in operation during a first period of time. The
hardware processor may be further operable to construct a first
dependency graph based on the first data set by solving a first
objective function constrained with a maximum number of non-zeros
and formulated with a regularization term comprising a quadratic
penalty to control sparsity, wherein the quadratic penalty is
determined as a function of the first data set in constructing the
first dependency graph. The hardware processor may be further
operable to receive a second data set comprising sensor data
detected by the plurality of sensors coupled to the equipments in
operation during a second period of time. The hardware processor
may be further operable to construct a second dependency graph
based on the second data set by solving the first objective
function constrained with the maximum number of non-zeros and
formulated with the regularization term comprising a quadratic
penalty, wherein the quadratic penalty is determined as a function
of the first data set and the second data set in constructing the
second dependency graph. The hardware processor may be further
operable to determine an anomaly score for each of the plurality of
sensors based on comparing the first dependency graph and the
second dependency graph. In a further aspect, the hardware
processor may be further operable to, responsive to determining
that an anomaly score associated with a sensor in the plurality of
sensors meets a threshold value indicative of abnormal functioning,
automatically perform a corrective action to correct equipment
coupled to the sensor.
A computer readable storage medium storing a program of
instructions executable by a machine to perform one or more methods
described herein also may be provided.
Further features as well as the structure and operation of various
embodiments are described in detail below with reference to the
accompanying drawings. In the drawings, like reference numbers
indicate identical or functionally similar elements.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram illustrating anomaly detection for multivariate
noisy sensor data in one embodiment.
FIG. 2 is a diagram illustrating computing of a change score of a
sensor in a dependency graph in one embodiment.
FIG. 3 is a diagram showing components of a system in one
embodiment, which performs localization for multivariate time
series datasets detects anomalies in sensor data.
FIG. 4 is a diagram illustrating a screen display of an analytics
tool in an embodiment, which integrate anomaly detection technique
in one embodiment.
FIG. 5 illustrates a schematic of an example computer or processing
system that may implement a system in one embodiment of the present
disclosure.
DETAILED DESCRIPTION
In embodiments, methods, systems and techniques are provided, which
detect anomalies in sensor networks. In embodiments, methods,
systems and techniques may anomaly localization, where a
variable-wise anomaly score is desired. In embodiments, methods,
systems and techniques include learning a dependency graph of
multivariate data, which may be noisy. The methods, system and
techniques automatically identify the number of dependencies and
perform transfer learning from a training phase to a testing phase.
Learned dependency graphs are used to detect anomalies in sensor
networks. In one aspect, a learning performed according to an
embodiment in the present disclosure may be useful in tracking
inter-variable dependency, reducing the number of runs to obtain a
graph that is sparse enough for interpretability (e.g., in
detecting anomalies), while capturing real dependencies in the
multivariate data.
In an embodiment, methods, systems and techniques may be
implemented as a computer-implemented solution library, for
instance, which can be integrated or plugged into a
computer-implemented platform or server which may function as
analytics tool, for example, in performing operations managements
in an industry.
In an embodiment, an optimization is generated, which allows to
automatically identify the number of conditional dependencies for a
pair of variables and transfer a learning sparsity pattern and
magnitude of correlations from a normal state to a testing phase.
In an embodiment, a two-step algorithm based on projections and
active set identification is presented. In an embodiment, a full
gradient may be used for solution quality and a reduced subspace
Newton direction to accelerate it; Feasibility with respect to
membership in sparsity constraint is handled via projection;
Symmetric positive-definiteness is ensured through a line-search; a
continuation step improves solution quality and running time.
Anomaly detection, for example, via learned sparse dependency
graph, in one embodiment identifies asset behavior that is outside
of a defined norm that constitutes normal behavior. In an
embodiment, anomaly detection algorithm is based on a probabilistic
graphical model, and may detect anomalies with higher accuracy and
fewer false alarms. In an embodiment, early detection of anomalies
or faults in a system can lead to considerable savings of time and
cost. A system and/or method of the present disclosure, for
example, reduces maintenance cost for example, incurred in vessel
monitoring, increases asset utilization, prevents unexpected
equipment breakdowns by identifying anomalous conditions that can
lead to accidents or equipment damage, and reduces false positive
and false negative alerts. In industrial processing, for example,
productivity of assets may be enhanced, and downtime and
maintenance costs may be reduced.
A system and/or method, in an embodiment, solved a
sparsity-constrained model, allows transfer learning via an
algorithm. The algorithm, in one aspect, learns and transfers
connectivity pattern and dependency magnitude from the normal
training phase to the testing phase for the dependency matrix. In
an embodiment, an additional regularization term may be included,
which keeps the magnitudes of precision matrix entries to
manageable amount. In an embodiment, a system and/or method selects
the sparsity level of the dependency graphs a priori.
FIG. 1 is a diagram illustrating anomaly detection for multivariate
noisy sensor data in one embodiment. A model building phase 102 in
an embodiment includes using past dataset associated with sensors,
for example, operating in a domain (e.g., industry domain), for
example, in normal condition (e.g., without failure). Such dataset
is considered to be normal data. For instance, dataset 104, for
example, of past period (e.g., selected historical time period,
e.g., a number of time series sample data points (e.g., 100 sample
data points) associated with each sensor), may be obtained and a
dependency graph model 106 is built based on the normal data. The
dependency graph model includes nodes which represent sensors and
links between the sensors during operating conditions.
During a runtime phase 108, a current dependency graph 112 from
real-time sensor data 110 (data points associated with current
period of time, or window of time) is built or computed. The
dependency graph 106 associated with the normal state is compared
with the current dependency graph 112. A change in the dependency
among or between the nodes indicates anomaly in the current
processing in the domain. In this way, anomalies among sensors are
detected in real-world situations to predict when and where
maintenance is required.
For example, at 114, a dependency graph 106 may be discovered or
built based on the past sensor data 104 taken during normal
operations of a process (e.g., manufacturing or industrial
process). At 116, a current dependency graph 112 is computed from
real-time sensor data 108. The dependency graphs 106 and 112 may be
compared to identify or determine any anomaly in the current
operations of the process.
As an example application, nodes in a dependency graph represent
sensors, for example, monitoring different components of a process
such as a manufacturing or industrial process. Dependency arcs or
edges between nodes represent that the data of the sensors move
together, or have a pattern of moving together. If, for example,
two nodes are conditionally independent, then the value of
dependency edge is zero. In one embodiment, detecting anomaly is
carried out by comparing the change of dependency graphs between
the training phase and testing phase.
FIG. 2 is a diagram illustrating computing of a change score of a
sensor in a dependency graph in one embodiment. A preparation phase
202, in an embodiment, includes obtaining or receiving data matrix
X, which includes data such as sensor data output during an
operation of a manufacturing or industrial process, for example, by
a system coupled with sensors for operating the manufacturing or
industrial process. For instance, for each sensor, a time series of
data points (e.g., of a selected time window) associated with a
system operating under normal operating conditions (e.g., no
errors, faults or failures of equipment or process), may be
obtained.
At 206, the data of data matrix X is input to an optimization
model, Equation (a), and Equation (a) is solved subject to a
specified constraint.
In an embodiment, the constraint specifies a maximum number of
non-zero entries. Non-zero entries indicate a relationship between
sensors. The partial correlation between two sensors, for example,
is non-zero when the corresponding element of dependency matrix is
non-zero. At 208, a solution to Equation (a) creates a dependency
graph G (e.g., a Gaussian graphical model).
An operation phase 210, for example, at time t, in an embodiment,
includes obtaining or receiving data matrix X.sup.(t) 212, for
example, sensor data output during a manufacturing or industrial
process at time window (t), which may be the current time or near
current time. At 214, the data of data matrix X.sup.(t) 212 is
input to the optimization model, Equation (a), and Equation (a) is
solved subject to a specified constraint. At 216, the solution of
the optimization model (Equation (a)) creates a dependency graph
G.sub.t, (e.g., a Gaussian graphical model) associated with the
current data (operations data of time t window).
In another aspect, at 214, Equation (c) may be solved using both
data matrix X 210 and data matrix X.sup.(t) 212 to create the
dependency graph G.sub.t, for example, in lieu of Equation (a).
The dependency graph G 208 and the dependency graph G.sub.t 216 are
input to another model, Equation (b), and Equation (b) is solved at
218. Solving Equation (b) produces a change score a.sub.i (anomaly
score for sensor i) at time t 220, for each sensor considered. For
example, an anomaly score for each sensor is determined, for
example, by solving Equation (b).
.times.
.times..times..times..function..rho..times..times..times..times..-
times..ltoreq..kappa..times..times..times..times..times..times..times..tim-
es..times..times..times..times..times..times..times..times..intg..times..f-
unction..times..intg..times..function..times..function..function..function-
.'.times..times. ##EQU00001## where D is the training data set and
D' is the testing data set. A detailed formula is given in Equation
(8) below.
.times.
.times..times..times..function..rho..times..times..times..times..-
times..ltoreq..kappa..times..times..times..times..times..times..times..tim-
es..times..times..times..times..times..times..times..times.
##EQU00002##
In the above formulations, S represents the sample covariance of
the user-inputted data, X represents the dependency graph to be
learned at time t (also referred to as X.sup.(t) above), and
X.sub.train represents the dependency graph from the normal
operation phase (i.e., the training phase), .rho. represents a
regularization parameter, .parallel. .parallel..sub.F represents a
norm term. Details of the formulations are described further
below.
In an embodiment, the L0 (l.sub.0)-constraint
(.parallel.X.parallel..sub.0.ltoreq.K) guarantees that the solution
admits a level of sparsity. In an embodiment, the quadratic
penalty
.rho..times. ##EQU00003## encourages the capacity of selecting
groups in the presence of highly correlated variables and transfer
learning. The models of Equations (a) and (b) are non-convex
models, which may be challenging to solve, and in embodiments,
optimization algorithms are provided to solve such non-convex
models.
In an embodiment, the problem of anomaly localization in a sensor
network for multivariate time-series data is considered by
computing anomaly scores for each variable separately. A system
and/or method in an embodiment evaluate a difference measure using
the conditional expected Kullback-Liebler divergence between two
sparse Gaussian graphical models (GGMs) learned from different
sliding windows of the dataset. To estimate the sparse GGMs, an
optimization model is provided and the system and/or method in an
embodiment constrain the number of nonzeros (l.sub.0 norm) in the
learned inverse covariance matrix and apply an additional
l.sub.2-regularization in the objective. In an embodiment, a
projected gradient algorithm efficiently solves this nonconvex
problem. Numerical evidence supports the benefits of using the
model and method over the usual convex relaxations (i.e.,
l.sub.1-regularized models) for learning sparse GGMs in this
anomaly localization setting.
Anomaly localization of the present disclosure in an embodiment
determines or computes a variable-wise anomaly score. In wireless
sensor networks, e.g., in addition to determining whether or not a
network is behaving anomalously, it is useful to determine or know
what aspect of the network is anomalous. For example, if a sensor
network takes measurements from different car parts of a prototype
automobile on a test track, then it is valuable to the field
engineers to know which car parts are contributing to an anomalous
behavior, e.g., more than simply knowing that the car as a whole is
behaving anomalously. Similar problem settings can be found across
different application domains: monitoring mechanical stresses in
buildings and bridges, identifying the sources of flooding or
forest fires, pinpointing the change points in stock price time
series, finding sources of serious threats in computer networks,
and identifying car loss.
In an embodiment, sparse dependency graphs (e.g., GGMs) are learned
in the context of anomaly localization. In an embodiment, a model
is presented to learn a sparse dependency graph where both sparsity
pattern and numeric values are taken into account. GGM-learning
model employing both an l.sub.0 constraint and an
l.sub.2-regularization, along with an optimization algorithm to
solve it, outperforms other models including typical
l.sub.1-regularized models. In one aspect, the system and/or method
leverage the conditional expected Kullback-Liebler (KL) divergence
method in computing anomaly scores for each variable. A comparative
study conducted on various GGM-learning models and scoring methods
for assigning anomaly scores demonstrates efficiency of a system
and method in an embodiment. Using l.sub.0 norm-based methods can
present computationally challenging issues. The system and/or
method in an embodiment provide l.sub.0-constrained GGM learning
algorithms in the anomaly detection setting. In an embodiment, an
l.sub.0-constrained l.sub.2-regularized model learns a sparse
dependency graph for anomaly localization, and a proximal gradient
algorithm handles both an l.sub.0 constraint and a
positive-definite constraint.
In an embodiment, the following notations are used. For any
X.di-elect cons.R.sup.n.times.nm .gradient.f(X) denotes the
gradient of f, which is a matrix in R.sup.n.times.n. Notation tr(X)
denotes the trace operator, i.e., .SIGMA..sub.i=1.sup.nX.sub.i,i.
The vectorization operator vec(X) transforms X into a vector in
R.sup.n.sup.2 by stacking the columns from left to right.
.sigma..sub.1(X) and .sigma..sub.n (X) denote the minimum and
maximum eigenvalues of X, respectively. Given an index set F.OR
right.{1, . . . , n}.times.{1, . . . , n}, [X].sub.F denotes the
matrix obtained from the identity matrix I.sub.n by replacing the
(i, j)-th entry by X.sub.i,j for all (i, j).di-elect cons.F. Let
denote the Kronecker product of matrices and let e.sub.i denote a
unit vector with a 1 in the ith coordinate. For any finite set S,
|S| denotes the number of elements of S.
Learning Sparse Graphical Models
The following describes optimization models that the system and/or
method use in an embodiment to learn sparse precision matrices.
Given m data samples {y.sub.1, . . . y.sub.m}R.sup.n, the system in
an embodiment considers fitting an n-variate Gaussian distribution
N(.mu., X.sup.-1) to the data, where .mu. denotes the mean and X is
the precision matrix. In an embodiment, it is assumed the data is
centered so that .mu.=0. Zeros in X indicate conditional
independence between two variables. To estimate the unknown
parameter X, a maximum likelihood problem as follows may be
solved,
.times..times..function..times..times..times..times..function.
##EQU00004## where X0 denotes a positive definite constraint. In an
embodiment, the empirical covariance matrix is defined as
.times..times..times..times. ##EQU00005## The solution X.di-elect
cons.R.sup.n.times.n is the estimated inverse covariance, which
describes a Gaussian graphical model (GGM). However, in general,
due to noise in the data, X obtained from Equation (1) is a dense
matrix and thus the conditional dependencies suggested by X may
reveal very little. For reasons of interpretability, sparsity in
the matrix X is desired.
The sparse graphs for anomaly localization have been computed by
solving the well-studied l.sub.1-regularized maximum likelihood
problem, i.e.,
.times..times..function..times..times..times..times..function..lamda..ti-
mes. ##EQU00006## where
.parallel.X.parallel..sub.1=.SIGMA..sub.i,j=1.sup.n|X| is the
element-wise l.sub.1 norm of the matrix X, and .lamda.>0 is a
regularization parameter to control sparsity. Equation (2) is a
convex problem, and generally admits a unique global minimizer.
Thus, given S, selecting a sparse graph is a matter of computing a
regularization path by varying .lamda. and cross-validating the
solutions along the path. Since sensor data of industrial
applications are very noisy in general, some useful interaction
information among the variables in the graph can be omitted.
Enforcing sparsity in the constraint set instead of in the
objective function may avoid bias. In an embodiment, the system
and/or method of the present disclose and solve a model that
estimates sparse graphs. In an embodiment, the system and/or method
directly constrains sparsity by specifying a maximally allowable
number of nonzeros .kappa. in the optimal solution and adds a
quadratic penalty
.times..times..function..times..times..times..times..function..lamda..ti-
mes..times..times..times..ltoreq..kappa. ##EQU00007## where
.lamda.>0 is a regularization parameter and the Frobenius norm
term
.parallel.X.parallel..sub.F.sup.2=.SIGMA..sub.i,j=1.sup.n|X.sub.ij|.sup.2
is also referred to as l.sub.2-regularization. The present
disclosure denotes by .parallel. .parallel..sub.0 the number of
nonzeros of the argument, the so-called l.sub.0-norm. The
l.sub.0-constraint guarantees that the solution admits a level of
sparsity, while the quadratic penalty encourages the capacity of
selecting groups in the presence of highly correlated variables.
Without the regularization, some entries of the purely
l.sub.0-constrained model can have relatively large magnitudes. The
associated anomaly scores significantly dominate the others, and
thus some faulty variables can be overlooked. Hence, in an
embodiment, the l.sub.2-regularization term keeps the magnitude of
all entries uniformly similar. From a computational point of view,
the l.sub.2-term makes the problem of Equation (3) better-behaved.
The objective function is strongly convex and the optimal solution
set is bounded. Thus, it is expected to be solved more efficiently.
In an embodiment, the l.sub.0-constraint and the
l.sub.2-regularization are combined. Because of the presence of the
l.sub.0 constraint, it may be challenging to solve problem of
Equation (3); in an embodiment, an efficient proximal gradient
algorithm is presented to handle the problem.
In an embodiment, the model of Equation (3) can be used to learn
dependency graphs in both the train phase and the testing phase,
for example, learning a sparse GGM for anomaly detection. In an
embodiment, the system and/or method may also use the following for
the testing phase
.times..times..function..times..times..times..times..function..lamda..ti-
mes..times..times..times..ltoreq..kappa. ##EQU00008## where
X.sub.normal is a precision matrix from a normal state. The system
and/or method in one embodiment perform transfer learning for
sparsity pattern and magnitude of correlations from the normal
state to the testing phase.
.times..times..times..function..times..times..ltoreq..kappa.>.times..-
times..function..times..times..function..lamda..times..lamda..times.
##EQU00009##
Equations (5) and (6) represent additional models, which may be
applied in learning a sparse GGM for anomaly detection algorithms,
for instance, in the context of anomaly localization. Equation (6)
is an l.sub.1-based "elastic net" version for GGM model.
Methods for Computing Anomaly Score of Each Variable
The following describes embodiments of techniques that perform a
change analysis, e.g., assigning anomaly scores to individual
variables to measure the magnitude of their contributions to the
observed differences between two GGMs from training and testing
datasets. The techniques, for example, in embodiments, implement
the processing shown at 120 in FIG. 1.
Conditional Expected KL-Divergence
The system and/or method in an embodiment may consider the
conditional expected Kullback-Liebler divergence between two
learned GGMs to compute anomaly scores. Given two probabilistic
models A and B with respective distributions p.sub.A and p.sub.B,
the system and/or method consider the expected KL divergence
d.sub.i.sup.AB between the marginal distributions
p.sub.A(x.sub.i|z.sub.i) and p.sub.B(x.sub.i|z.sub.i)
.intg..times..function..times..intg..times..function..times..times..funct-
ion..function. ##EQU00010## for each variable i. By swapping A and
B in Equation (7), the system and/or method obtain a symmetric
formula for d.sub.i.sup.BA. Because the system and/or method are
learning GGMs, the formula of Equation (7) takes a special form
that reduces to matrix-vector multiplications. For the i-th
variable, and for a learned GGM X, let
.alpha..beta..times. ##EQU00011## be permuted such that the last
row and column of X, X.sup.-1 correspond to the i-th variable.
Then, letting X.sup.A and X.sup.B denote the GGMs learned from
datasets A and B respectively, it can be shown
.function..function..times..times..alpha..times..times..alpha..function..-
times..alpha..alpha..beta..function..alpha..alpha. ##EQU00012##
The system and/or method then define the anomaly score of the ith
variable as d.sub.i=max{d.sub.i.sup.AB,d.sub.i.sup.BA}. (9) When a
threshold t>0 is selected, d.sub.i>t is an indication that
the i-th variable is anomalous, while d.sub.i.ltoreq.t is an
indication that the ith variable is healthy. Alpha and beta, L and
W notations are components of matrices X and X.sup.-1 when they are
broken up as in equation (7b).
Stochastic Nearest Neighbors
The system and/or method in an embodiment may consider using a
stochastic nearest neighbors approach to anomaly scoring. Let
S.sup.A and S.sup.B denote the sample covariance matrices of
datasets A and B respectively, and let S.sub.i.sup.A, S.sub.i.sup.B
denote the ith columns of S.sup.A and S.sup.B respectively. For a
sparse GGM X.sup.A learned for dataset A, define an indicator
vector associated with the ith variable l.sub.A,i coordinate-wise
by
.times..times..times..times..times..times..times..times..times..times..ti-
mes..times..times..times..times. ##EQU00013##
Then, a measure of the dissimilarity between the neighborhoods of
the ith variable between the GGMs for A and B, weighted by the
sample covariances of the two datasets, is given by
.function..times..times..times. ##EQU00014## Symmetrically, the
system and/or method define d.sub.i.sup.BA and then compute an
anomaly score d.sub.i as in Equation (9). "T" in the equations
stands for the transpose of a vector or matrix.
Sparsest Subgraph Approximation
The system and/or method in an embodiment may consider using
sparsest subgraph approximation technique to score anomaly. Given
two (sparse) dependency graphs, e.g. X.sup.A and X.sup.B consider a
graph given by an (also sparse) adjacency matrix A defined
entrywise by .LAMBDA..sub.ij=|X.sub.ij.sup.A-X.sub.ij.sup.B|. The
problem of finding an independent set (a set of mutually
disconnected nodes) of size k in .LAMBDA. is generalized to finding
a "sparsest k-subgraph" in .LAMBDA.:
.di-elect
cons..times..times..times..LAMBDA..times..times..times..times..-
times..times..times..times..times. ##EQU00015## where 1.sub.n
denotes a vector of ones. Intuitively, the solution to Equation
(12) identifies a set of k nodes that change the least between
X.sup.A and X.sup.B, suggesting a healthy (non-anomalous) set of
nodes. However, (12) is a mixed-integer quadratic programming
(MIQP) formulation, the solution of which is generally NP-hard, and
moreover requires foreknowledge of the magnitude of k, the size of
the supposedly healthy set of nodes. Therefore, the system and/or
method may use a convex QP relaxation (solvable in polynomial
time)
.di-elect
cons..times..times..times..LAMBDA..mu..times..times..times..tim-
es..times..times..gtoreq. ##EQU00016## where 0.sub.n is a vector of
zeros and .LAMBDA..sub..mu.=.LAMBDA.+.mu.I.sub.n is the original
matrix .LAMBDA. with an added scaled identity with .mu..gtoreq.0
sufficiently bounded away from zero to enforce the positive
definiteness of .LAMBDA..sub..mu.. With this relaxation, the
solution d* to Equation (13) can be interpreted as anomaly
scores.
In an embodiment, the system and/or method uses the l.sub.0-l.sub.2
based model of Equations (3) and (4) to learn a sparse GGM in the
training phase and the testing phase respectively, and computes
anomaly scores for each variable by applying the conditional
expected KL divergence of Equation (9).
Optimization Algorithm for l.sub.0 Sparse Models
In embodiments, the system and/or method implements a projected
gradient algorithm for solving the learning problems or models of
Equation (3) and Equation (4).
Projected Gradient Newton Algorithm
In an embodiment, a projected gradient Newton algorithm with warm
starts may solve the l.sub.0-constrained inverse covariance
learning problems of Equations (3) and (4). The following describes
a method for solving Equation (3), and the method also can be
applied to Equation (4). It is noted when A is small, Equation (3)
is difficult to be solved. In an embodiment, instead of directly
solving (3), the system and/or method solves a sequence of
subproblems parameterized by a bigger .lamda.>0,
.function..function..function..lamda..times..times..ltoreq..kappa..times-
..times..times..A-inverted..di-elect cons. ##EQU00017## where the
system and/or method denote the Frobenius norm by
.parallel.X.parallel..sub.F.sup.2=.SIGMA..sub.i,j=1.sup.n|X.sub.ij|.sup.2-
. The objective of Equation (14) is strongly convex and the optimal
solution set of Equation (14) is bounded. The l.sub.2 regularizer
makes the subproblems better-conditioned, possibly allowing for
finding higher quality local minima of Equation (14) faster. By
driving .lamda. to zero in Equation (14), an optimal solution for
Equation (3) can be obtained. Thus, to yield a practical algorithm,
the system and/or method in an embodiment consider a sequence of
strictly decreasing values {.lamda..sub.k}; in the k-th iteration
of an algorithm, the system and/or method seek an approximate local
minimizer to Equation (14) with .lamda.=.lamda..sub.k. On the
(k+1)-th iteration, the system and/or method warm start the
solution of the (k+1)-th subproblem with the solution to the k-th
subproblem as an initial point.
Before analyzing optimality conditions of (3) or (14), a hardness
result may be proved, for example, demonstrating that for any
.kappa.>0 in Equation (3), there exists no .lamda.>0 such
that an optimal solution to the l.sub.1 regularized model of
Equation (2) with regularization parameter .lamda. recovers the
global optimum of Equation (3). This suggests solving Equation (3)
without resorting to convex formulations.
Theorem 1. For any .kappa..gtoreq.n, there exists no value of
.lamda.>0 such that an optimal solution of (2) globally
minimizes (3).
Proof 1. Towards a contradiction, suppose there exists
.kappa..gtoreq.n and .lamda.>0 such that X is a global minimizer
of both (2) and (3). Define Y=X.sup.-1 and .PSI.(X)=-log
det(X)+tr(SX). Since S is positive semi-definite and nonzero, there
exists at least one diagonal entry S.sub.ii>0. Without loss of
generality, suppose S.sub.11>0. For any .alpha. such that
|.alpha.|<min(X.sub.11, .sigma..sub.1(X)), both
.parallel.X+.alpha.e.sub.1e.sub.1.sup.T.parallel..sub.0=.parallel.X.paral-
lel..sub.0.ltoreq..kappa. and X+.alpha.e.sub.1e.sub.1.sup.T are
positive definite. Additionally, since
(X+.alpha.e.sub.1e.sub.1.sup.T).sub.kl=0 for any (k,l).di-elect
cons.| it may be concluded that X+.alpha.e.sub.1e.sub.1.sup.T is
feasible in Equation (3) for every .alpha. such that
|.alpha.|<min(X.sub.11, .sigma..sub.1(X)).
Moreover, using a property of determinants,
.psi..function..alpha..times..times..times..psi..function..times..times..-
function..function..times..alpha..times..times..times..times..function..al-
pha..times..times..times..times..function..function..times..function..alph-
a..times..times..alpha..times..times..times..DELTA..times..function..alpha-
. ##EQU00018##
Note that if X is a minimizer of Equation (3), then
g(.alpha.).gtoreq.0 for every
|.alpha.|<min(X.sub.11,.sigma..sub.1(X)). Since g is a convex
one-dimensional function of .alpha. in the neighborhood of zero and
g(0)=0, it may be concluded that g'(0)=0. This implies that
S.sub.11=Y.sub.11. From a characterization of optimal solutions of
Equation (2), if X is optimal for Equation (2), then
S.sub.11=Y.sub.11-.lamda.. Since .lamda.>0, this is a
contradiction.
First-Order Stationarity Conditions
The following characterizes properties of Equation (14), for
example, concerning the boundedness of the set of optimal solutions
and appropriate optimality conditions for Equation (3). In an
embodiment, it may be assumed that the l.sub.2-regularization
parameter .lamda.>0. For ease of notation, the sparsity
constraint set is defined as .omega..DELTA.{X.di-elect
cons.R.sup.n.times.n:.parallel.X.parallel..sub.0.ltoreq..kappa.,X.sub.i,j-
=0.A-inverted.(i,j).di-elect cons.|}, and the projection
operator
.OMEGA..function..times..DELTA..times..times..di-elect
cons..OMEGA..times. ##EQU00019## It may be noted that the gradient
of the objective function in Equation (14) is
.gradient.f(X)=S-X.sup.-1+.lamda.X, and the Hessian is
.gradient..sup.2f(X)=X.sup.-1X.sup.-1+.lamda.I.sub.n.sup.2.
It is shown in the following theorem that Equation (14) can be
safely reformulated in such a way that the feasible set is
compact.
Theorem 2. Assume that problem of Equation (14) is feasible and
.lamda.>0. Consider the problem
.function..function..lamda..times..times..times..ltoreq..kappa..times..A-
-inverted..di-elect cons..times..sigma..times. .sigma..times.
##EQU00020## where there are fixed constants
.sigma..times..lamda..times..times..function..times..lamda..times..times.-
.lamda..times..sigma..function..times..times..lamda..times..times..functio-
n..times..sigma..times..times..sigma. ##EQU00021## and
.function..function..times..times..lamda..times..times..times..lamda.
##EQU00022## Then, the sets of global minima of Equation (14) and
Equation (15) are equivalent.
Proof 2. Suppose X*.di-elect cons.R.sup.n.times.n is an optimal
solution of Equation (14) and 0<.sigma..sub.1.ltoreq. . . .
.ltoreq..sigma..sub.n are the eigenvalues of X*. The system and/or
method in an embodiment seek lower and upper bounds for
.sigma..sub.1 and .sigma..sub.n, respectively.
Consider X=tI.sub.n for an arbitrary fixed scalar t>0. X is
feasible with respect to the constraints of Equation (14), and
.function..gtoreq..function..function..function..lamda..times..gtoreq..fu-
nction..lamda..times..times..times..gtoreq..times..times..function..sigma.-
.lamda..times..times..times..times..gtoreq..times..times..function..sigma.-
.lamda..times..times..sigma. ##EQU00023##
The second inequality is due to the fact that S.+-.0 and X*0, and
thus tr(SX*).gtoreq.0. The third inequality uses the Cauchy-Schwarz
inequality, and the last inequality holds because
.SIGMA..sub.i=1.sup.nX.sub.ii*=.SIGMA..sub.i=1.sup.n.sigma..sub.i
and .sigma..sub.i>0. Combining Equation (16) with the fact that
log(.sigma..sub.n).ltoreq..sigma..sub.n-1, the system and/or method
get
.function..gtoreq..function..sigma..lamda..times..times..sigma.
##EQU00024## The right-hand side of Equation (17) tends to infinity
as .sigma..sub.n.fwdarw..infin.; since f(X) is finite for the fixed
value of t, .sigma..sub.n is bounded from above. Specifically,
.sigma..sub.n cannot exceed the larger of the two solutions to the
quadratic equation
.lamda..times..times..sigma..function..sigma..function.
##EQU00025## That is,
.sigma..ltoreq..sigma..times..DELTA..times..times..lamda..times..times..f-
unction..times..lamda..times..times..lamda.<.infin.
##EQU00026##
Likewise, a lower bound for .sigma..sub.1 is estimated as
f(X).gtoreq.f(X*).gtoreq.-log det(X*)
.gtoreq.-log(.sigma..sub.1)-(n-1)log(.sigma.), from which it may be
concluded .sigma..sub.1.gtoreq.exp(-f(X)).sigma..sup.1-n. (19)
In an embodiment, this bound can be tightened further. Starting
from an intermediate step in the derivation of (16), there is
.function..gtoreq..function..lamda..times..times..times..times..sigma..gt-
oreq..function..sigma..times..function..sigma..times..times..lamda..times.-
.sigma..gtoreq..function..sigma..times..function..sigma..times..times..lam-
da..times..function..times..function..times..sigma..times.
##EQU00027## Thus,
.sigma..gtoreq..sigma..times..DELTA..times..function..times..times..lamda-
..times..times..function..times..sigma..times..times..sigma.>
##EQU00028##
From Equations (18) and (20), in order to tighten the bounds
.sigma. and .sigma., the system and/or method may choose t so that
f(X)=f(tI.sub.n) is minimized. Observe that
.function..function..times..times..times..function..times..times..lamda..-
times. ##EQU00029## is a strongly convex one-dimensional function
in t for t>0. Thus f (tI.sub.n) is minimized when the necessary
optimality condition
.times..function. ##EQU00030## is satisfied; thus the following is
chosen
.function..function..times..times..lamda..times..times..times..lamda.
##EQU00031## which completes the proof.
The following introduces in Theorem 3 optimality conditions for
problem (14). For any positive definite matrix H and a scalar
t>0, define
.function..function..gradient..function..times..times..times..function..t-
imes..times..di-elect
cons..OMEGA..times..times..function..times..times..times..di-elect
cons..OMEGA..times..times..times..times..gradient..function..times..times-
..times..function..times..function. ##EQU00032##
Theorem 3. Suppose that X* is an optimal solution of (14) and H is
positive definite. Then there exists T>0 such that
X*=S.sub.t,H(X*) for all t.di-elect cons.(0,T). (23)
Proof 3. From Theorem 2, there is, any optimal solution X* of (14)
satisfies .sigma.I.sub.nX*, where .sigma.>0 is independent of
X*. Since
.gradient..sup.2f(X*)=X*.sup.-1X*.sup.-1+.lamda.I.sub.n.sub.2, it
follows that
.gradient..sup.2f(X*)(.sigma..sup.-2+.lamda.)I.sub.n.sub.2. Hence,
there is, from strong convexity that
.function..ltoreq..function..function..gradient..function..times..times.
##EQU00033## for all Y.di-elect cons..OMEGA., where
L.sub.f=(.sigma..sup.-2+.lamda.).sup.-1. It is shown that (23)
holds for T=.sigma..sub.1(H)L.sub.f. First, it is demonstrated that
X*.di-elect cons.S.sub.t,H (X*) for any t.di-elect cons.[0, T).
This first claim is a statement concerning existence of X*. The
following subsequently proves uniqueness of X*, which proves the
theorem. From (24) and since
.times. .times. ##EQU00034## for all t.di-elect cons.[0, T), there
is, for all Y.di-elect cons..OMEGA.
.function..ltoreq..function..ltoreq..function..function..gradient..functi-
on..times..times..times. ##EQU00035## It follows that
.function..function..gradient..function..times..times..times..gtoreq.
##EQU00036## for all Y.di-elect cons..OMEGA.. Thus X*.di-elect
cons.S.sub.t, (X*) since h.sub.t,H,X*(X*)=0, proving the claim.
To prove uniqueness, suppose towards contradiction that there
exists {circumflex over (X)}.di-elect cons.S.sub.t,H(X*) satisfying
(23), but {circumflex over (X)}.noteq.X*. Noting that by definition
h.sub.t,H,X*(X*)=0, we must have h.sub.t,H,X*({circumflex over
(X)})=0, and so
.function..gradient..function..times..times..times. ##EQU00037##
Combining (24) and (25), there is,
.function..ltoreq..function..function..gradient..function..times..times..-
times..function..times..times..times.<.function. ##EQU00038##
which contradicts the global optimality of X* for (14).
For a general positive definite matrix H, solving the problem
defining S.sub.t,H (X) in (22) is generally intractable due to the
cardinality constraint. As a result, verifying the optimality
condition (23) is computationally expensive. The following lemma
offers a simpler condition, which is an immediate consequence of
Theorem 3 when H=I.sub.n. A similar result was established, but for
a slightly different constraint set .OMEGA..
Lemma 1. Suppose that X* is an optimal solution of (14). Then there
exists T>0 such that X*=P.sub..OMEGA.(X*-t.gradient.f(X*)) for
all t.di-elect cons.(0,T). (27)
In one aspect, a subsequent convergence analysis relies on Lemma 1.
The following states and proves a technical proposition regarding
the operator S.sub.t,I.sub.n, concerning how it acts on convergent
sequences.
Proposition 1. Let t>0. Suppose {X.sup.k} is a sequence in
R.sup.n.times.n converging to X*, and suppose there exists a
corresponding sequence {Y.sup.k} such that Y.sup.k.di-elect
cons.S.sub.t,I.sub.n (X.sup.k) for each k. If there exists Y* such
that Y.sup.k.fwdarw.Y*, then Y*.di-elect
cons.S.sub.t,I.sub.n(X*).
Proof 4. Since Y.sup.k.di-elect cons..OMEGA. for each k and .OMEGA.
is closed, it is concluded that Y*.di-elect cons..OMEGA.. Hence, it
is obtained from the definition of S.sub.t,I.sub.n(X*) that
.times..gradient..function..gtoreq..di-elect
cons..OMEGA..times..times..gradient..function. ##EQU00039## If
inequality (28) holds as an equality, then the proposition is
proven. Towards a contradiction, suppose (28) holds as a strict
inequality. Then, for all k,
.times..gradient..function..times.> .times..di-elect
cons..OMEGA..times..times..gradient..function..gtoreq.
.times..di-elect
cons..OMEGA..times..times..times..gradient..function..times..times..gradi-
ent..function..times..gradient..function..times.
.times..times..times..gradient..function..times..times..gradient..functio-
n..times..gradient..function..times. ##EQU00040## where the second
inequality follows from the triangle inequality, and the equality
follows from the definition of Y.sup.k. As k.fwdarw..infin., (29)
implies
.parallel.Y*-(X*-t.gradient.f(X*)).parallel..sub.F>.parallel.Y*-(X*-t.-
gradient.f(X*)).parallel..sub.F, a contradiction.
Algorithm Description
An algorithm is presented to solve a general problem of the form
min{f(X):X.di-elect
cons.R.sup.n.times.n,.parallel.X.parallel..sub.0.ltoreq..kappa.,X.sub.i,j-
=0,.A-inverted.(i,j).di-elect cons.|,X0,X=X.sup.T}, (30) where f is
twice differentiable on its domain. In an embodiment, a strategy
for solving the constrained optimization problem is to embed
approximate Newton method into a projected gradient method. In this
algorithm, feasibility with respect to sparsity constraint .OMEGA.
is handled via projection, while the symmetric
positive-definiteness of the iterates (i.e. feasibility with
respect to {X0}) in ensured through a line-search procedure.
Convergence analysis can be applied, for example, when the
objective function has the special structure as in (14):
.function..function..function..lamda..times. ##EQU00041##
A particular issue in the case of applying a projected gradient
method to (30) is that the constraint set of (30) is nonconvex and
not closed. This is a different situation from problems in the
conventional scope of projected gradient methods, which require the
feasible set to be convex and closed. In one aspect, due to the
openness of the positive-definite (PD) constraint, directly
applying a projected gradient method to Equation (30) is difficult.
In an embodiment, a methodology of the present disclosure avoids
dealing with the PD constraint in the projection step.
In an algorithm of the present disclosure in an embodiment, the
projected gradient direction and the approximate Newton step are
alternatively employed. On each gradient search step, the algorithm
begins from a feasible iterate X.sup.k and then backtrack along the
projection are defined by
P.sub..OMEGA.(X.sup.k-.alpha..sub.k.gradient.f(X.sup.k)),
initializing the stepsize .alpha..sub.k>0 with a
Barzilai-Borwein (BB) stepsize [1]:
.alpha..function..gradient..function..gradient..function..times.
##EQU00042## BB-stepsizes perform better when they are embedded in
a nonmonotone line search. In an embodiment, a scheme based on the
Grippo-Lampariello-Lucidi (GLL) stepsize rule is considered. Let
f.sub.max.sup.k denote the largest of the M most recent function
values:
f.sub.max.sup.k=max{f(X.sup.k-j):0.ltoreq.j<min(k,M)}.
This quantity f.sub.max.sup.k is used to define a sufficient
decrease condition. For instance, the algorithm terminates the
backtracking procedure for computing
X.sup.k+1=P.sub..OMEGA.(X.sup.k-.alpha..sub.k.gradient.f(X.sup.k))
by gradually reducing .alpha..sub.k once the following conditions
(C1) and (C2) are both satisfied:
(C1) A sufficient decrease condition for the objective function
value is attained, i.e.
.function..ltoreq..delta..times. ##EQU00043## for some algorithmic
parameter .delta.>0.
(C2) The next iterate X.sup.k+1 is feasible with respect to the
positive definite constraint, i.e. X.sup.k+10.
In an embodiment, to accelerate convergence and make use of the
fact that the optimal solution is often very sparse in many
applications, i.e., .kappa.=n.sup.2, the algorithm in the present
disclosure may additionally incorporate projected approximate
Newton steps, by using second-order information in a reduced
subspace. At each iterate X.sup.k, a local quadratic model of the
objective function can be minimized subject to the constraint that
all but a working set W.sub.k of the matrix entries will not be
fixed to zero in the minimization. In an embodiment of the
algorithm, W.sub.k may be chosen as
W.sub.k={(i,j):X.sub.i,j.sup.k.noteq.0}.orgate.O.sub.k (31) or
W.sub.k={(i,j):Y.sub.i,j.noteq.0 for some Y.di-elect
cons.P.sub..OMEGA.(X.sup.k-.alpha..sub.k.gradient.f(X.sup.k))}.orgate.O.s-
ub.k, (32) where .alpha..sub.k>0 is a step size and O.sub.k is a
set of indices corresponding to the .left
brkt-top..eta..kappa..right brkt-bot. largest absolute values of
entries in X.sup.k-X.sup.k-1 for some (small) .eta..di-elect
cons.(0,1). The inclusion of O.sub.k helps to close the optimality
gap faster when the condition (27) in Lemma 1 does not hold. The
determination for updating the working set W.sub.k from (31) or
(32) is based on how X.sup.k is computed, which is described in
detail. In one aspect, the particular selection of W.sub.k is not
critical to proving theoretical convergence results, and the
prescriptions in (31), (32) are intended as practical guidance.
For any working set W.sub.k{1, . . . , n}.times.{1, . . . , n} and
a matrix Y, define Y.sub.w.sub.k coordinate-wise via
.times..times..di-elect cons..times..times. ##EQU00044## Then,
given a current iterate X.sup.k, gradient .gradient.f(X.sup.k), and
Hessian .gradient..sup.2f(X.sup.k), solve the following reduced
quadratic problem to obtain a Newton direction D.sup.k:
.di-elect
cons..times..times..function..ident..function..gradient..functi-
on..times..times..function..times..gradient..times..function..times..funct-
ion. ##EQU00045##
To efficiently obtain an approximate solution of (33), an
embodiment of a methodology of the present disclosure makes use of
a conjugate gradient method. This conjugate gradient method takes
advantage of the particular Kronecker product structure of
.gradient..sup.2f(X.sup.k), as well as the observation that
|W.sub.k|<<n.sup.2, for instance, since the methodolgoy
searches for a very sparse solution in applications.
A detailed description of our algorithm for solving (30) is given
in Algorithm 1.
TABLE-US-00001 Algorithm 1: Reduced-space Projected Gradient
Algorithm - RPGA(X.sup.0, .OMEGA., .lamda.) 1 Choose parameters
.sigma. (0,1),.delta. > 0,.eta. (0,1),.alpha..sub.min.sup.Newton
> 0,[.alpha..sub.min,.alpha..sub.max] .OR right. (0,.infin.) ,
integers p,q such that 0 < p < q , integerM > 0, grad_step
= true . 2 Set k = 0. 3 while some stopping criteria not satisfied
do 4 Step 1: Active-set Newton step 5 if (mod(k,q) .noteq. p) then
6 if (grad_step = true) then 7 W.sub.k .rarw.Eq. (31) 8 else 9
W.sub.k .rarw.Eq. (32) /* end of if-then-else of line 6 */ 10
Approximately solve for D.sup.k from (33) 11 Choose .alpha..sub.k =
.alpha..sub.k,0 = 1 12 j .rarw. 0, ls .rarw. true 13 while (ls =
true) & (.alpha..sub.k .gtoreq. .alpha..sub.min.sup.Newton) do
14 .alpha..sub.k .rarw. .sigma..sup.j.alpha..sub.k,0 15
X.sub.trial.sup.k+1 .rarw. P.sub..OMEGA. (X.sup.k +
.alpha..sub.kD.sup.k) 16
.times..times..function..ltoreq..times..times..delta..times..times..-
times..times..times. ##EQU00046## then 17 X.sup.k+1 .rarw.
X.sub.trial.sup.k+1 18 ls .rarw. false 19 else 20 j .rarw. j + 1 /*
end of if-then-else of line 16 */ /* end of while loop of line 13
*/ /* end of if-then of line 5 */ 21 Step 2: Gradient projection
step 22 grad_step .rarw. false 23 if (mod (k,q) = p) or
(.alpha..sub.k < .alpha..sub.min.sup.Newton) then 24 Choose
.alpha..sub.k,0 =
min(.alpha..sub.max,max(.alpha..sub.min,.alpha..sub.k.sup.BB)) 25 j
.rarw. 0, ls .rarw. true 26 while (ls = true) do 27 .alpha..sub.k
.rarw. .sigma..sup.j.alpha..sub.k,0 28 X.sub.trial.sup.k+1 .rarw.
P.sub..OMEGA.(X.sup.k - .alpha..sub.k.gradient.f(X.sup.k)) 29
.times..times..function..ltoreq..times..times..delta..times..times.-
.times..times..times. ##EQU00047## then 30 X.sup.k+1 .rarw.
X.sub.trial.sup.k+1 31 ls .rarw. false 32 else 33 j .rarw. j + 1 /*
end of if-then-else of line 29 */ /* end of while loop of line 26
*/ 34 grad_step .rarw. true /* end of if statement of line 23 */ 35
Step 3: Iterate 36 k .rarw. k + 1 /* end of whiile loop of line 3
*/
Remark 1. As mentioned, .OMEGA. is a non-convex set, and as such,
the operator P.sub..OMEGA. is generally set-valued, i.e.
projections are non-unique. A point in P.sub..OMEGA.(X) can be
quickly obtained. If X.di-elect cons.R.sup.n.times.n, set
X.sub.i,j=0 for every (i, j).di-elect cons.. Then P.sub..delta.(X)
is computed by sorting the n.sup.2 many entries of |X|, where | |
denotes an entry-wise absolute value, and then replacing all but
the .kappa. largest values of |X| with 0, where ties could be
broken arbitrarily. In order to attain a unique projection, when we
compute P.sub..OMEGA., ties are not broken arbitrarily, but by a
dictionary order on the matrix entries. The dictionary order is
chosen in such a way to ensure that the projected matrix is
symmetric. As a consequence of sorting n.sup.2 matrix entries, the
projection operation can be performed in O(n.sup.2 log(n))
time.
On each iteration of the backtracking procedure, one would compute
an eigenvalue decomposition of the trial step X.sup.k+1 both in
order to compute the value of det(X.sup.k+1) in the objective
function and to determine the satisfaction of (C2). Computing a
determinant is generally an expensive operation. However, since a
method in the present disclosure is explicitly searching for a
positive definite matrix, it is enough to attempt a Cholesky
factorization of X.sup.k+1, which can be done more efficiently. If
the factorization fails (a complex number appears on the diagonal
of the lower triangular matrix), then the current X.sup.k+1 is not
positive definite and the method can immediately retract along the
projection arc. If the factorization is successful, then the
determinant of the matrix X.sup.k+1 can be computed from the
diagonal values of lower triangular matrix. This sequence of
attempted Cholesky factorizations is repeated until the stopping
criteria is satisfied. The finite termination in the backtracking
step is proved below.
In an embodiment, performing a line search in the projected
gradient direction in the full space may help a projected
approximate Newton algorithm identify a good working set subspace,
and enables proving of global convergence. Thus, in an embodiment,
the general logic of Algorithm 1 is that, for fixed q.di-elect
cons.N, after every q-1 iterations of Step 1 (the active-set Newton
steps), the algorithm performs a single iteration of Step 2, the
gradient projection step.
As in the case in nonconvex optimization, the iterates of Algorithm
1 need not converge to a global minimizer of (14), but only to a
point satisfying necessary conditions for optimality. The
difference between (14), which Algorithm 1 is to solve, and (3),
the real problem of interest, lies in the l.sub.2 regularization
term. In an embodiment, a method of the present disclosure uses an
outer homotopy method for solving (3). Algorithm 2 solves a
sequence of subproblems of the form (14) by calling Algorithm 1,
attempting to find increasingly better-quality minimizers. Two
strictly decreasing sets of integer values
.kappa..sub.0>.kappa..sub.1> . . . >.kappa..sub.K=.kappa.
and real values .lamda..sub.0>.lamda..sub.1> . . .
>.lamda..sub.K=A are selected before running the algorithm. A
method of the present disclosure may first (approximately) solve
(14) using a sparsity parameter .kappa..sub.0, regularization
parameter .lamda..sub.0, and some initial point X.sup.0. Then, the
method may sequentially solve (14) to some tolerance for each pair
(.kappa..sub.j,.lamda..sub.j): j=1, . . . , K, initializing each
subproblem solve with the previous solution X.sup.j-1. The final
regularization parameter .lamda..sub.K should be quite small in
practice.
TABLE-US-00002 Algorithm 2: Projected Gradient Newton Algorithm
with Warm Starts - PPNA 1 Choose homotopy sequence {(.kappa..sub.j,
.lamda..sub.j): j = 0,1,2,..., K}, initial point X.sup.0. 2 for j =
0,1,...,K do 3 .OMEGA..sub.j .rarw. {X:||X||.sub.0 .ltoreq.
.kappa..sub.j, X.sub.i,j = 0 .A-inverted.(i, j) .di-elect cons. I}
4 X.sup.j+1 .rarw. RPGA(X.sup.j ,.OMEGA..sub.j,.lamda..sub.j) /*
end of for loop of line 2 */
Convergence Analysis
This section provides a proof that the iterates of Algorithm 1 for
solving problem (14) accumulate at a point satisfying the necessary
optimality conditions (27). It may be assumed that the initial
point X.sup.0 is feasible with respect to the constraints of
(14).
The following states an assumption on .kappa.:
Assumption 1. Assume that .kappa. in (14) (and (15)) is an integer
satisfying .kappa..di-elect cons.[n, n.sup.2). Observe that this
assumption is reasonable, since as demonstrated earlier, n is a
sufficient condition for (15) to have a non-empty feasible set, and
if .kappa.=n.sup.2, then the problem reduces to a convex one. It is
demonstrated that the line search in either Step 1 (the active-set
Newton step) or Step 2 (the gradient projection step) terminates in
a finite number of iterations.
Lemma 2. Suppose that problem (14) satisfies Assumption 1. Let
{X.sup.k} denote the sequence of iterates generated by Algorithm 1
for solving problem (14). Suppose the line search starting in
either Line 13 or Line 26 of Algorithm 1 is reached in the k-th
iteration and suppose X.sup.k is feasible with respect to the
constraints of (14). Then, the line search in steps 1 and 2
terminates in a finite number of iterations.
Proof 5. The line search in Step 1 terminates in a finite number of
iterations because the termination condition
.alpha..sub.k<.alpha..sub.min.sup.Newton is enforced. The
following proves the finite termination when the line search occurs
in Step 2 (the gradient projection step). Assume
.gradient.f(X.sup.k).noteq.0, since otherwise the algorithm stops a
local solution X.sup.k.
There exists T.sub.p.sup.k>0 such that every matrix in
P.sub..OMEGA.(X.sup.k-t.gradient.f(X.sup.k)) is positive definite
for all t.di-elect cons.(0, T.sub.p.sup.k). For a nonzero symmetric
matrix U, denote the spectral norm of U by
.parallel.U.parallel..sub.2.DELTA. max(|.sigma..sub.min(U)|,
|.sigma..sub.max(U)|)>0, where
.sigma..sub.min(U),.sigma..sub.max(U) denote the smallest and
largest eigenvalues of U, respectively. Thus, for all t.di-elect
cons.(0,L.sub.1), there is X.sup.k+tU0, where there is defined
.sigma..function. ##EQU00048##
Given X.sup.k, let I and A denote respectively the inactive and
active matrix indices of X.sup.k with respect to the cardinality
constraint, i.e.: I.DELTA.{(i,j): X.sub.i,j.sup.k.noteq.0}
A.DELTA.{(i,j): X.sub.i,j.sup.k=0}.
Additionally define
.times..DELTA..times..times..times..di-elect
cons..times..DELTA..times..times..times..di-elect
cons..times..gradient..function..DELTA..times..times..di-elect
cons..times..gradient..function. ##EQU00049##
The following demonstrates the existence of the necessary
T.sub.p.sup.k>0 by analyzing two separate cases concerning the
cardinality of the set of inactive indices, |I|.
Case 1. (|I|=.kappa.): For every (i, j).di-elect cons.I there is
|X.sub.i,j.sup.k-t(.gradient.f(X.sup.k)).sub.i,j|.gtoreq.|X.sub.i,j.sup.k-
|-t|(.gradient.f(X.sup.k)).sub.i,j|.gtoreq.|X.sub.i.sub.x.sub.,j.sub.x.sup-
.k|-t|(.gradient.f(X.sup.k)).sub.i.sub.g,n.sub.,j.sub.g,n|.
Defining
.gradient..function..gradient..function. ##EQU00050## it is
concluded from (34) that for all (i, j).di-elect cons.I,
|X.sub.i,j.sup.k-t(.gradient.f(X.sup.k)).sub.i,j|>t|(.gradient.f(X.sup-
.k)).sub.i.sub.g,a.sub.,j.sub.g,a|.gtoreq.t|(.gradient.f(X.sup.k)).sub.h,l-
|, (34) for all (h,l).di-elect cons.A and t.di-elect cons.(0,
T.sub.1.sup.k). Note that T.sub.1.sup.k is well-defined since
|X.sub.i.sub.x.sub.,j.sub.x.sup.k| and
|(.gradient.f(X.sup.k)).sub.i.sub.g,n.sub.,j.sub.g,n|+|(.gradient.f(X.sup-
.k)).sub.i.sub.g,a.sub.,j.sub.g,a| are strictly positive. It is
possible to decompose
X.sup.k-t.gradient.f(X.sup.k)=[X.sup.k-t.gradient.f(X.sup.k)].sub.I+[X.su-
p.k-t.gradient.f(X.sup.k)].sub.A=[X.sup.k-t.gradient.f(X.sup.k)].sub.I-t[.-
gradient.f(X.sup.k)].sub.A. Combining this decomposition with
Remark 1 and (34), it follows that
[X.sup.k-t.gradient.f(X.sup.k)].sub.I=P.sub..OMEGA.(X.sup.kt.gradient.f(X-
.sup.k)) for t.di-elect cons.(0,T.sub.1.sup.k). Now consider
U=[.gradient.f(X.sup.k)].sub.I, and define
.times..DELTA..times..sigma..function..times..times..times..noteq..times.
##EQU00051## Then, letting
T.sub.p.sup.k=min(T.sub.1.sup.k,T.sub.2.sup.k), not only is
X.sup.k-t[.gradient.f(X.sup.k)].sub.I=P.sub..OMEGA.(X.sup.k-t.gradient.f(-
X.sup.k)), but also X.sup.k-t[.gradient.f(X.sup.k)].sub.I is
positive definite for any t.di-elect cons.(0,T.sub.p.sup.k).
Case 2. (|I|<.kappa.): Denote by AA the set of indices (i,
j).di-elect cons.A corresponding to the .kappa.-|I| largest
absolute magnitude entries of [.gradient.f(X.sup.k)].sub.A.
Denote
''.times..DELTA..times..times..times..di-elect
cons..times..times..times..gradient..function. ##EQU00052##
Define
.gradient..function..gradient..function.'' ##EQU00053## From (34),
it follows that
|X.sub.i,j.sup.k+t(.gradient.f(X.sup.k)).sub.i,j|>t|(.gradient.f(X.sup-
.k)).sub.i.sub.g,a.sub.,j.sub.g,a|.gtoreq.t|(.gradient.f(X.sup.k)).sub.k,h-
|, (35) for all (i, j).di-elect cons.I,(k,h).di-elect cons.A\A and
t.di-elect cons.(0, T.sub.3.sup.k). Similarly to Case 1, it is
possible to decompose
.times..gradient..function..times..times..times..gradient..function..time-
s..gradient..function..times..times..gradient..function..times..times..tim-
es..times..gradient..function..function..gradient..function..function..gra-
dient..function..times..times. ##EQU00054## From this
decomposition, Remark 1, and (35), there is
[X.sup.k-t.gradient.f(X.sup.k)].sub.I.orgate.A.di-elect
cons.P.sub..OMEGA.(X.sup.k-t.gradient.f(X.sup.k)) for t.di-elect
cons.(0, T.sub.3.sup.k). Define
U=[.gradient.f(X.sup.k)].sub.I.orgate.A, and analogous to the
definition of T.sub.2.sup.k in Case 1, define
.times..DELTA..times..sigma..function..times..times..times..noteq..times.
##EQU00055## Choose T.sub.p.sup.k=min(T.sub.3.sup.k,T.sub.4.sup.k).
It is concluded that for any t.di-elect cons.(0,T.sub.p.sup.k),
there is both
X.sup.k-t[.gradient.f(X.sup.k)].sub.I.orgate.A.di-elect
cons.P.sub..OMEGA.(X.sup.k-t.gradient.f(X.sup.k)) and
X.sup.k-t[.gradient.f(X.sup.k)].sub.I.orgate.A is positive
definite. Since every matrix in
P.sub..OMEGA.(X.sup.k-t.gradient.f(X.sup.k).sup.k) can be
represented as X.sup.k-t[.gradient.f(X.sup.k)].sub.I.orgate.A, the
existence of the needed T.sub.p.sup.k is proven.
The following proves that there exists a positive number
T.sub..delta.>0, independent of k, such that
.function..ltoreq..delta..times..times..times..times..times..di-elect
cons..delta. ##EQU00056## where
X.sup.k+1=P.sub..OMEGA.(X.sup.k-t.gradient.f(X.sup.k)) and suppose
that
.function..ltoreq..delta..times. ##EQU00057## Here, the notion
X.sup.-1=X.sup.0 and f.sub.max.sup.-1=f(X.sup.0) is adopted. Define
L={X.di-elect cons.R.sup.n.times.n:f(X).ltoreq.f(X.sup.0)}. Note
that f(X.sup.k).ltoreq.f.sub.max.sup.k-1.ltoreq.f(X.sup.0), thus
X.sup.k.di-elect cons.L. Replacing X* and X by X and X.sup.0
respectively in (18) and (20) yields L.orgate.{X.di-elect
cons.R.sup.n.times.n:.sigma.I.sub.nX.sigma.I.sub.n}, where .sigma.
and .sigma. are some constants satisfying
0<.sigma.<.sigma.<.infin.. Define K={X.di-elect
cons.R.sup.n.times.n:0.5.sigma.I.sub.nX(.sigma.+0.5.sigma.)I.sub.n}.
(37) There is L.OR right.K. Since X.sup.k+1 minimizes
h.sub.t,I.sub.n.sub.,X.sub.k(Z), it follows
.function..times..times..function..gradient..function..times..times..time-
s..ltoreq. .times..times..function. ##EQU00058## Obtain
.parallel.X.sup.k+1-X.sup.k.parallel..sub.F.sup.2.ltoreq.2ttr(.gradient.f-
(X.sup.k).sup.T(X.sup.k-X.sup.k+1)) Taking norms yields
.parallel.X.sup.k+1-X.sup.k.parallel..sub.F.ltoreq.2t.parallel..gradient.-
f(X.sup.k).parallel..sub.F=2t.parallel.S-(X.sup.k).sup.-1+.lamda.X.sup.k.p-
arallel..sub.F Since X.sup.k lies in the compact set K, there
exists a constant c, independent of X.sup.k.di-elect cons.K, such
that
.parallel..gradient.f(X.sup.k).parallel..sub.F=.parallel.S-(X.sup.k).sup.-
-1+.lamda.X.sup.k.parallel..sub.F.ltoreq.c. Choose .beta..di-elect
cons.(0,.infin.) so that 2.delta.c.ltoreq.0.5.sigma..
X.sup.k+1.di-elect cons.K when t.ltoreq..beta. and X.sup.k.di-elect
cons.L . Indeed, there is
.sigma..function..times..gtoreq.
.times..sigma..function..sigma..function.
.times..sigma..function..gtoreq.
.times..sigma..gtoreq..times..sigma..sigma..times.>.times..sigma..time-
s..times..times..ltoreq..times..sigma. ##EQU00059## Furthermore,
there is
.sigma..function..times..ltoreq..times..sigma..function..sigma..function.-
.ltoreq. .times..sigma..ltoreq. .times..sigma..times..sigma.
##EQU00060##
Hence it is concluded that X.sup.k+1.di-elect cons.K. Because the
eigenvalues of any matrix in K are bounded, f is Lipschitz
continuously differentiable on K. Denote L.sub.f by the Lipschitz
constant for f on K. In view of the convexity of K, the line
segment connecting X.sup.k and X.sup.k+1 is contained in K. There
is
.function..times..ltoreq..times..function..function..gradient..function..-
times..times..times..ltoreq.
.times..function..times..times..times..ltoreq.
.times..times..times..times..ltoreq.
.times..delta..times..times..times..times..times..delta..ltoreq.
##EQU00061## Define
.sigma..function..sigma..times..delta..times..times. ##EQU00062##
then there is
.function..ltoreq..delta..times..times..times..times..times..di-elect
cons..delta. ##EQU00063## i.e., the sufficient decrease condition
in Line 29 holds for all t.di-elect cons.[0, T.sub..delta..sup.k).
Thus, letting T.sub.p.sup.k be the appropriate T.sub.p.sup.k from
either Case 1 or Case 2, and letting .sigma. and .alpha..sub.k,0 be
taken from Algorithm 1, it is concluded that within
.sigma..function..times..delta..alpha. ##EQU00064## iterations,
X.sup.k+1 will satisfy both of the stopping criteria of the line
search.
Lemma 3. Suppose that problem (14) satisfies Assumption 1. Let
{X.sup.k} denote the sequence of iterates generated by Algorithm 1
for (14). Then, .parallel.X.sup.k+1-X.sup.k.parallel..fwdarw.0.
Moreover, the sequence {f(X.sup.k)} admits a limit point f*.
Proof 6. If .gradient.f(X.sup.k)=0 for any k, then the lemma is
immediate. So, assume .gradient.f(X.sup.k).noteq.0 for every k. Due
to the stopping criteria for the line-search in Lines 16 and 28 of
Algorithm 1, there is f(X.sup.k+1).ltoreq.f.sub.max.sup.k for all
k. Thus, f.sub.max.sup.k+1.ltoreq.f.sub.max.sup.k for all k. The
sequence {f.sub.max.sup.k} is monotone decreasing and bounded from
below, hence there exists a limit point
{f.sub.max.sup.k}.fwdarw.f*. Since f is uniformly continuous on the
level set L.OR right.{X:.sigma.I.sub.nX.sigma.I.sub.n}, an
induction argument can be used to conclude that
.parallel.X.sup.k+1-X.sup.k.parallel..fwdarw.0.
Theorem 4. Suppose that problem (14) satisfies Assumption 1. Let
{X.sup.k} denote the sequence of iterates generated by Algorithm 1
for (14). If X* is an accumulation point of {X.sup.k}, then X*
satisfies the necessary optimality conditions stated in Lemma 1,
i.e. there exists T>0 such that (27) holds.
Proof 7. Again, assume .gradient.f(X.sup.k).noteq.0 for every k,
since otherwise the theorem is immediate. Since X* is an
accumulation point of {X.sup.k}, there exists a subsequence of
indices {k.sub.i}.sub.i=0, 1, 2, . . . such that
.fwdarw..infin..times. ##EQU00065## By the definition or limit, for
a sufficiently large k, the positions of the nonzero entries of
X.sup.k.sup.i are the same for k.sub.i>k, and their magnitudes
are uniformly bounded away from zero. In Lemma 2, it was proven
that X.sup.k.di-elect cons.L.di-elect
cons.{.sigma.I.sub.nX.sigma.I.sub.n} for all k, the largest
eigenvalue of
.gradient.f(X.sup.k)=S-(X.sup.k).sup.-1+.lamda.X.sup.k is bounded
from above for all k. Thus, reusing the definition of T.sub.p.sup.k
from the proof of Lemma 2 for either Case 1 or Case 2, there exists
a constant T.sub.p>0 such that
T.sub.p.sup.k.sup.i.gtoreq.T.sub.p for all k.sub.i>k. So, the
step-size .alpha..sub.k.sub.i taken at the end of the k.sub.i-th
iteration is lower-bounded as
.alpha..gtoreq..alpha..times..sigma..sigma..function..times..delta..alpha-
. ##EQU00066## and so there exists .alpha.>0 such that
.fwdarw..infin..times..times..alpha..gtoreq..alpha. ##EQU00067##
since there are bounds
.alpha..sub.min.ltoreq..alpha..sub.k.sub.i.sub.,0.ltoreq..alpha..sub.max
as algorithmic parameters. Take any T such that
0<T<min(.alpha.,T.sub.p). Consider two possible cases for the
subsequence {X.sup.k.sup.i.sup.+1}:
Case 1: Suppose that infinitely many iterates in
{X.sup.k.sup.i.sup.+1}.sub.i=0, 1, . . . are generated in Step 2
(the gradient projection step). Then, without loss of generality,
it can be assumed that the entire subsequence
{X.sup.k.sup.i.sup.+1}.sub.i=0, 1, . . . is generated in Step 2. By
triangle inequality, there is
.fwdarw..infin..times..ltoreq..fwdarw..infin..times..fwdarw..infin..times-
. ##EQU00068## and so it is concluded from Lemma 3 and the choice
of subsequence {k.sub.i} that
.fwdarw..infin..times. ##EQU00069## For any t>0, let
Y.sup.k.sup.i.di-elect cons.S.sub.t,I.sub.n(X.sup.k.sup.i). It
follows from the definition of S.sub.t,I.sub.n that
.function..gradient..function..times..times..times..gtoreq..function..gra-
dient..function..times..times..times. ##EQU00070## Since
.di-elect cons..alpha..function. ##EQU00071## for all k.sub.i,
there also is from the definition of
.alpha. ##EQU00072## that
.function..gradient..function..times..times..alpha..times..gtoreq..functi-
on..gradient..function..times..times..alpha..times. ##EQU00073##
Combining (39) and (40), then, it is obtained for any t>0
.times..gradient..function..times..times..times..gtoreq..function..gradie-
nt..function..times..times..alpha..times..times..times..alpha..times.
##EQU00074## If t.di-elect cons.(0, T), then it may be concluded
from (42) that
.fwdarw..infin..times. ##EQU00075## since
.fwdarw..infin..times. ##EQU00076## by Lemma 3 and since
.times..times..alpha.> ##EQU00077## by the choice of t. By
triangle inequality,
.fwdarw..infin..times..ltoreq..fwdarw..infin..times..fwdarw..infin..times-
. ##EQU00078## and hence
.fwdarw..infin..times..fwdarw. ##EQU00079## So, using Proposition
1, it is concluded that X*.di-elect
cons.P.sub..OMEGA.(X*-t.gradient.f(X*)) for all t.di-elect cons.(0,
T).
Case 2: Suppose that only finitely many elements of the subsequence
{X.sup.k.sup.i.sup.+1} are generated in Step 2. Then, without loss
of generality, it can be assumed that every iterate of the
subsequence {X.sup.k.sup.i.sup.+1}.sub.i=0, 1 . . . is generated in
Step 1. By the pigeonhole principle, there exists an integer
m.di-elect cons.[0, p) such that mod(k.sub.i+1,q)=m for infinite
many k.sub.i. Hence one has mod(k.sub.i+1+p-m,q)=p for infinite
many k.sub.i; that is, every iterate of the subsequence
{X.sup.k.sup.i.sup.+1+p-m} is generated in Step 2. From the
triangle inequality,
.fwdarw..infin..times..ltoreq..fwdarw..infin..times..times..times..fwdarw-
..infin..times. ##EQU00080## and so by Lemma 3 and the choice of
the subsequence {X.sup.k.sup.i},
X.sup.k.sup.i.sup.+1+p-m.fwdarw.X*. From here, the same argument as
in Case 1 can be applied to the subsequence
{X.sup.k.sup.i.sup.+1+p-m} to get X*.di-elect
cons.P.sub..OMEGA.(X*-t.gradient.f(X*)) for all t.di-elect cons.(0,
T). It now only remains to prove the uniqueness of
P.sub..OMEGA.(X*-t.gradient.f(X*)), show that
X*=P.sub..OMEGA.(X*-t.gradient.f(X*)) as a set equality. Two
separate cases are analyzed: Case A. (|I(X*)|=.kappa.): Assume that
Z*.di-elect cons.P.sub..OMEGA.(X*-t.gradient.f(X)) for all
t.di-elect cons.(0,T). Towards a contradiction, suppose that
Z*.noteq.X* for some t'.di-elect cons.(0, T). By the definition of
P.sub..OMEGA., it is concluded that because Z*.noteq.X*,
|X.sub.i',j'*|.ltoreq.|Z.sub.l',m'*(t')|, where there is defined
|X.sup.i',j'*|.DELTA.argmin{|X.sub.u,v*|:(u,v).di-elect cons.I(X*)}
and
|Z.sub.l'*,m'*(t)|.DELTA.argmax{|(X*-t.gradient.f(X*)).sub.u,v|:(u,v).di--
elect cons.A(X*)}.
This implies that (.gradient.f(X*)).sub.l',m'=0, since otherwise
there exists {tilde over (t)}.di-elect cons.(0,T}) such that
|(X*-{tilde over (t)}.gradient.f(X*)).sub.l',m'|>|X.sub.i',j'*|,
which would contradict X*.di-elect
cons.P.sub..OMEGA.(X*-t.gradient.f(X*)), since (i', j').di-elect
cons.I(X*) and (l', m').di-elect cons.A(X*). But, if
(.gradient.f(X*)).sub.l',m'=0, then X*(X*-t.gradient.f(X*)) for any
t.di-elect cons.(0,T), since
(X*-t.gradient.f(X*)).sub.l',m'=(X*).sub.l',m', and
(l',m').di-elect cons.A(X*). This is a contradiction.
Case B. (|I(X*)|<.kappa.): If (.gradient.f(X*)).sub.i,j.noteq.0
for some (i, j).di-elect cons.A(X*), then there exists {tilde over
(t)}.di-elect cons.(0,T) such that |(X*-{tilde over
(t)}.gradient.f(X*)).sub.i,|.noteq.0; hence
X*P.sub..OMEGA.(X*-{tilde over (t)}.gradient.f(X*)), which would be
a contradiction. Thus, it is the case that
[.gradient.f(X*)].sub.A(X*)*=0. Consider the reduced projection
problem over the inactive indices in I(X*),
.times..di-elect cons..function..times..times..gradient..function.
##EQU00081## Since H.sub.i,j* is a solution to (42), it is derived
that [.gradient.f(X*)].sub.I(X*)=0; thus it is concluded that
.gradient.f(X*)=0, and so it follows that
X*=P.sub..OMEGA.(X*-t.gradient.f(X*)) for any t.di-elect
cons.(0,T). The following result further shows that any
accumulation point produced by Algorithm 1 is a strict local
minimizer of (14).
Theorem 5. Any accumulation point X* of {X.sup.k}.sub.k=0, 1, . . .
generated by Algorithm 1 is a strict local minimizer of (14).
Proof 8. It may be proven that there exists .epsilon.>0 such
that for all Y so that X*+Y is feasible with respect to the
constraints of (14) and .parallel.Y.parallel..sub.F.di-elect
cons.(0,.epsilon.), it holds that f(X*+Y)>f(X*). From Theorem 4,
there exists T>0 such that X*=P.sub..OMEGA.(X*-t.gradient.f(X*))
for all t.di-elect cons.(0,T). Equivalently, for all such t, the
matrix 0 is the unique global minimizer of
.times..times..times..di-elect
cons..OMEGA..times..function..times..DELTA..times..function..function..gr-
adient..function..times..times..times. ##EQU00082## Moreover, f is
strongly convex on K={x:0.5.sigma.I.sub.n X
(.sigma.+0.5.sigma.)I.sub.n} defined in (37). Denote .rho. by the
strong convexity constant. Note that X.sup.k.di-elect cons.L.OR
right.{.sigma.I.sub.n X .sigma.I.sub.n}, hence the limit
X*.di-elect cons.{.sigma.I.sub.nX.sigma.I.sub.n}. In Lemma 2, it
was proven that if .parallel.Y.parallel..sub.F.ltoreq.0.5.sigma.
then X*+Y.di-elect cons.K. For any
.parallel.Y.parallel..sub.F.ltoreq.0.5.sigma., there is
.function..times..gtoreq.
.times..function..function..gradient..function..times..rho..times..times.-
.function..times..rho..times..rho..times..gradient..function..rho..times..-
gradient..function. ##EQU00083## where the equality is by
completing the square. As in the proof of Theorem 4, the analysis
may be split into two cases.
Case A (|I(X*)|=.kappa.): Observe that there is Y.sub.i,j=0 for
every (i, j).di-elect cons.A(X*), in order for X*+Y.di-elect
cons..OMEGA. to hold in an arbitrarily small neighborhood of X*.
Also observe that (.gradient.f(X*)).sub.i,j=0 for every (i,
j).di-elect cons.I(X*); otherwise,
X*P.sub..OMEGA.(X*-t.gradient.f(X*)) for sufficiently small t,
which would yield a contradiction. Taking
.epsilon.<min{0.5.sigma.,min{|X.sub.i,j*|:(i,j).di-elect
cons.I(X*)}}, (44) there is X*+Y0 since
0.5.sigma.<.sigma..sub.1(X*). From (43), it is obtained
.function..times..gtoreq..times..function..times..rho..times..di-elect
cons..function..times..rho..function..gradient..function..times..times..r-
ho..times..di-elect
cons..function..times..rho..function..gradient..function..times..rho..tim-
es..di-elect
cons..function..times..gradient..function..times..rho..times..di-elect
cons..function..times..gradient..function..times..function..times..rho..t-
imes..di-elect
cons..function..times..times..times..rho..times..di-elect
cons..function..times..rho..function..gradient..function..times..rho..tim-
es..di-elect
cons..function..times..gradient..function..times..function..times..times.-
.rho..times..di-elect cons..function..times..times.>.function.
##EQU00084## where the last strict inequality comes from the fact
that not all the entries of Y are 0.
Case B. (|I(X*)|<.kappa.): As in the proof of Case B in Theorem
4, there is .gradient.f(X*)=0. Hence from (43), there is
.function..gtoreq..function..times..rho..times.>.function.
##EQU00085## for any .parallel.Y.parallel..sub.F.ltoreq..epsilon.
(.epsilon. defined in (44)), which completes the proof.
In one aspect, the number of nonzeros .kappa. and the coefficient
.lamda. that penalizes the Frobenius norm may be tuned. In one
aspect, it is noted that the performance of the model of Equation
(3), for example, evaluated based on real data, is stable for a
wide range of parameters .kappa. and .lamda..
A system and/or method in embodiments provide for anomaly
localization for multivariate time series datasets. An
l.sub.0-constrained model with an l.sub.2-regularization in an
objective function in one embodiment learns sparse dependency
graphs, e.g., creates a dependency graph structure. In an
embodiment, a projected gradient algorithm for the
l.sub.0-constrained problem is developed and proved that it
converges to a stationary point. In embodiments, metrics are
applied for scoring anomalies. An l.sub.2-regularized
l.sub.0-constrained models of Equation (3) and Equation (4), for
example, combined with a metric, e.g., the conditional expected
Kullback-Liebler divergence anomaly scoring of Equation (8),
improves methods for detecting anomalies.
FIG. 3 is a diagram showing components of a system in one
embodiment, which performs localization for multivariate time
series datasets detects anomalies in sensor data. One or more
hardware processors 302 such as a central processing unit (CPU), a
graphic process unit (GPU), and/or a Field Programmable Gate Array
(FPGA), an application specific integrated circuit (ASIC), and/or
another processor, may be coupled with a memory device 304, and
constructs a first dependency graph based on a first data set by
solving a first objective function constrained with a maximum
number of non-zeros and formulated with a regularization term,
which includes a quadratic penalty to control sparsity. In an
embodiment, the quadratic penalty is determined as a function of
the first data set in constructing the first dependency graph. One
or more hardware processors 302 also constructs a second dependency
graph based on the second data set by solving the first objective
function constrained with the maximum number of non-zeros and
formulated with the regularization term, which includes a quadratic
penalty. In constructing the second dependency graph, the quadratic
penalty is determined as a function of the first data set and the
second data set.
In an embodiment, one or more hardware processors 302 receive via a
communication or network interface 308, a first data set, which
includes sensor data detected by a plurality of sensors coupled to
equipments in operation during a first period of time, for example,
shown at 312. During this first period of time, the equipments are
considered to have functioned in normal manner (e.g., without
particular errors or malfunctioning), and hence the first data set
includes data associated with normal state of operations of the
equipments. One or more hardware processors 302 also receive a
second data set comprising sensor data detected by the plurality of
sensors coupled to the equipments 312 in operation during a second
period of time. The second period of time may be the current time,
for example, a window of current time. One or more hardware
processors 302 determines an anomaly score for each of the
plurality of sensors based on comparing the first dependency graph
and the second dependency graph.
One or more hardware processors 302, in an embodiment, responsive
to determining that an anomaly score associated with a sensor in
the plurality of sensors meets a threshold value indicative of
abnormal functioning, automatically triggers a corrective action to
correct an equipment coupled to the sensor. For instance, one or
more hardware processors 302 may send a signal to adjust one or
more parameters of one or more equipments 312, via a network
interface 308.
In an embodiment, one or more hardware processors 302 solve the
first object function implementing a projected gradient algorithm.
The projected gradient algorithm embeds an approximate Newton
method, wherein feasibility with respect to sparsity constraint is
handled via projection, and a symmetric positive-definiteness of
iterates is ensured through a line-search procedure.
The memory device 304 may include random access memory (RAM),
read-only memory (ROM) or another memory device, and may store data
and/or processor instructions for implementing various
functionalities associated with the methods and/or systems
described herein. A processor 302 may execute computer instructions
stored in the memory 304 or received from another computer device
or medium. The memory device 304 may, for example, store
instructions and/or data for functioning of the one or more
hardware processors 302, and may include an operating system and
other program of instructions and/or data. In one aspect, input
data such as the first data set and the second data set, and
program instructions which one or more of the hardware processors
302 may execute, can be stored on a storage device 306 or received
via a network interface 308 from a remote device, and may be
temporarily loaded into the memory device 304 to construct
dependency graphs and detect anomaly. One or more hardware
processors 302 may be coupled with interface devices such as a
network interface 308 for communicating with remote systems, for
example, via a network, and an input/output interface 310 for
communicating with input and/or output devices such as a keyboard,
mouse, display, and/or others.
FIG. 4 is a diagram illustrating a screen display of an analytics
tool in an embodiment, which integrate anomaly detection technique
in one embodiment. Learning dependency models and detecting
anomalies according to embodiments in the present disclosure can be
integrated as an analytics function, for example, shown at 402,
which can be selectable via a user interface 400 of an analytics
tool. Selecting the function 402, for example, navigates a user
through building of dependency graphs and detecting anomalies in
connected sensor network, for example, of equipments in operation,
for example, in a manufacturing or another industrial process.
FIG. 5 illustrates a schematic of an example computer or processing
system that may implement a system in one embodiment of the present
disclosure. The computer system is only one example of a suitable
processing system and is not intended to suggest any limitation as
to the scope of use or functionality of embodiments of the
methodology described herein. The processing system shown may be
operational with numerous other general purpose or special purpose
computing system environments or configurations. Examples of
well-known computing systems, environments, and/or configurations
that may be suitable for use with the processing system shown in
FIG. 5 may include, but are not limited to, personal computer
systems, server computer systems, thin clients, thick clients,
handheld or laptop devices, multiprocessor systems,
microprocessor-based systems, set top boxes, programmable consumer
electronics, network PCs, minicomputer systems, mainframe computer
systems, and distributed cloud computing environments that include
any of the above systems or devices, and the like.
The computer system may be described in the general context of
computer system executable instructions, such as program modules,
being executed by a computer system. Generally, program modules may
include routines, programs, objects, components, logic, data
structures, and so on that perform particular tasks or implement
particular abstract data types. The computer system may be
practiced in distributed cloud computing environments where tasks
are performed by remote processing devices that are linked through
a communications network. In a distributed cloud computing
environment, program modules may be located in both local and
remote computer system storage media including memory storage
devices.
The components of computer system may include, but are not limited
to, one or more processors or processing units 12, a system memory
16, and a bus 14 that couples various system components including
system memory 16 to processor 12. The processor 12 may include a
module 30 that performs the methods described herein. The module 30
may be programmed into the integrated circuits of the processor 12,
or loaded from memory 16, storage device 18, or network 24 or
combinations thereof.
Bus 14 may represent one or more of any of several types of bus
structures, including a memory bus or memory controller, a
peripheral bus, an accelerated graphics port, and a processor or
local bus using any of a variety of bus architectures. By way of
example, and not limitation, such architectures include Industry
Standard Architecture (ISA) bus, Micro Channel Architecture (MCA)
bus, Enhanced ISA (EISA) bus, Video Electronics Standards
Association (VESA) local bus, and Peripheral Component
Interconnects (PCI) bus.
Computer system may include a variety of computer system readable
media. Such media may be any available media that is accessible by
computer system, and it may include both volatile and non-volatile
media, removable and non-removable media.
System memory 16 can include computer system readable media in the
form of volatile memory, such as random access memory (RAM) and/or
cache memory or others. Computer system may further include other
removable/non-removable, volatile/non-volatile computer system
storage media. By way of example only, storage system 18 can be
provided for reading from and writing to a non-removable,
non-volatile magnetic media (e.g., a "hard drive"). Although not
shown, a magnetic disk drive for reading from and writing to a
removable, non-volatile magnetic disk (e.g., a "floppy disk"), and
an optical disk drive for reading from or writing to a removable,
non-volatile optical disk such as a CD-ROM, DVD-ROM or other
optical media can be provided. In such instances, each can be
connected to bus 14 by one or more data media interfaces.
Computer system may also communicate with one or more external
devices 26 such as a keyboard, a pointing device, a display 28,
etc.; one or more devices that enable a user to interact with
computer system; and/or any devices (e.g., network card, modem,
etc.) that enable computer system to communicate with one or more
other computing devices. Such communication can occur via
Input/Output (I/O) interfaces 20.
Still yet, computer system can communicate with one or more
networks 24 such as a local area network (LAN), a general wide area
network (WAN), and/or a public network (e.g., the Internet) via
network adapter 22. As depicted, network adapter 22 communicates
with the other components of computer system via bus 14. It should
be understood that although not shown, other hardware and/or
software components could be used in conjunction with computer
system. Examples include, but are not limited to: microcode, device
drivers, redundant processing units, external disk drive arrays,
RAID systems, tape drives, and data archival storage systems,
etc.
The present invention may be a system, a method, and/or a computer
program product. The computer program product may include a
computer readable storage medium (or media) having computer
readable program instructions thereon for causing a processor to
carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that
can retain and store instructions for use by an instruction
execution device. The computer readable storage medium may be, for
example, but is not limited to, an electronic storage device, a
magnetic storage device, an optical storage device, an
electromagnetic storage device, a semiconductor storage device, or
any suitable combination of the foregoing. A non-exhaustive list of
more specific examples of the computer readable storage medium
includes the following: a portable computer diskette, a hard disk,
a random access memory (RAM), a read-only memory (ROM), an erasable
programmable read-only memory (EPROM or Flash memory), a static
random access memory (SRAM), a portable compact disc read-only
memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a
floppy disk, a mechanically encoded device such as punch-cards or
raised structures in a groove having instructions recorded thereon,
and any suitable combination of the foregoing. A computer readable
storage medium, as used herein, is not to be construed as being
transitory signals per se, such as radio waves or other freely
propagating electromagnetic waves, electromagnetic waves
propagating through a waveguide or other transmission media (e.g.,
light pulses passing through a fiber-optic cable), or electrical
signals transmitted through a wire.
Computer readable program instructions described herein can be
downloaded to respective computing/processing devices from a
computer readable storage medium or to an external computer or
external storage device via a network, for example, the Internet, a
local area network, a wide area network and/or a wireless network.
The network may comprise copper transmission cables, optical
transmission fibers, wireless transmission, routers, firewalls,
switches, gateway computers and/or edge servers. A network adapter
card or network interface in each computing/processing device
receives computer readable program instructions from the network
and forwards the computer readable program instructions for storage
in a computer readable storage medium within the respective
computing/processing device.
Computer readable program instructions for carrying out operations
of the present invention may be assembler instructions,
instruction-set-architecture (ISA) instructions, machine
instructions, machine dependent instructions, microcode, firmware
instructions, state-setting data, or either source code or object
code written in any combination of one or more programming
languages, including an object oriented programming language such
as Smalltalk, C++ or the like, and conventional procedural
programming languages, such as the "C" programming language or
similar programming languages. The computer readable program
instructions may execute entirely on the user's computer, partly on
the user's computer, as a stand-alone software package, partly on
the user's computer and partly on a remote computer or entirely on
the remote computer or server. In the latter scenario, the remote
computer may be connected to the user's computer through any type
of network, including a local area network (LAN) or a wide area
network (WAN), or the connection may be made to an external
computer (for example, through the Internet using an Internet
Service Provider). In some embodiments, electronic circuitry
including, for example, programmable logic circuitry,
field-programmable gate arrays (FPGA), or programmable logic arrays
(PLA) may execute the computer readable program instructions by
utilizing state information of the computer readable program
instructions to personalize the electronic circuitry, in order to
perform aspects of the present invention.
Aspects of the present invention are described herein with
reference to flowchart illustrations and/or block diagrams of
methods, apparatus (systems), and computer program products
according to embodiments of the invention. It will be understood
that each block of the flowchart illustrations and/or block
diagrams, and combinations of blocks in the flowchart illustrations
and/or block diagrams, can be implemented by computer readable
program instructions.
These computer readable program instructions may be provided to a
processor of a general purpose computer, special purpose computer,
or other programmable data processing apparatus to produce a
machine, such that the instructions, which execute via the
processor of the computer or other programmable data processing
apparatus, create means for implementing the functions/acts
specified in the flowchart and/or block diagram block or blocks.
These computer readable program instructions may also be stored in
a computer readable storage medium that can direct a computer, a
programmable data processing apparatus, and/or other devices to
function in a particular manner, such that the computer readable
storage medium having instructions stored therein comprises an
article of manufacture including instructions which implement
aspects of the function/act specified in the flowchart and/or block
diagram block or blocks.
The computer readable program instructions may also be loaded onto
a computer, other programmable data processing apparatus, or other
device to cause a series of operational steps to be performed on
the computer, other programmable apparatus or other device to
produce a computer implemented process, such that the instructions
which execute on the computer, other programmable apparatus, or
other device implement the functions/acts specified in the
flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the
architecture, functionality, and operation of possible
implementations of systems, methods, and computer program products
according to various embodiments of the present invention. In this
regard, each block in the flowchart or block diagrams may represent
a module, segment, or portion of instructions, which comprises one
or more executable instructions for implementing the specified
logical function(s). In some alternative implementations, the
functions noted in the block may occur out of the order noted in
the figures. For example, two blocks shown in succession may, in
fact, be executed substantially concurrently, or the blocks may
sometimes be executed in the reverse order, depending upon the
functionality involved. It will also be noted that each block of
the block diagrams and/or flowchart illustration, and combinations
of blocks in the block diagrams and/or flowchart illustration, can
be implemented by special purpose hardware-based systems that
perform the specified functions or acts or carry out combinations
of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing
particular embodiments only and is not intended to be limiting of
the invention. As used herein, the singular forms "a", "an" and
"the" are intended to include the plural forms as well, unless the
context clearly indicates otherwise. It will be further understood
that the terms "comprises" and/or "comprising," when used in this
specification, specify the presence of stated features, integers,
steps, operations, elements, and/or components, but do not preclude
the presence or addition of one or more other features, integers,
steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of
all means or step plus function elements, if any, in the claims
below are intended to include any structure, material, or act for
performing the function in combination with other claimed elements
as specifically claimed. The description of the present invention
has been presented for purposes of illustration and description,
but is not intended to be exhaustive or limited to the invention in
the form disclosed. Many modifications and variations will be
apparent to those of ordinary skill in the art without departing
from the scope and spirit of the invention. The embodiment was
chosen and described in order to best explain the principles of the
invention and the practical application, and to enable others of
ordinary skill in the art to understand the invention for various
embodiments with various modifications as are suited to the
particular use contemplated.
* * * * *
References