U.S. patent application number 12/115824 was filed with the patent office on 2008-11-06 for systems and methods for structural clustering of time sequences.
This patent application is currently assigned to International Business Machines Corporation. Invention is credited to Vittorio Castelli, Michail Vlachos, Philip S. Yu.
Application Number | 20080275671 12/115824 |
Document ID | / |
Family ID | 37071647 |
Filed Date | 2008-11-06 |
United States Patent
Application |
20080275671 |
Kind Code |
A1 |
Castelli; Vittorio ; et
al. |
November 6, 2008 |
SYSTEMS AND METHODS FOR STRUCTURAL CLUSTERING OF TIME SEQUENCES
Abstract
Arrangements and methods for performing structural clustering
between different time series. Time series data relating to a
plurality of time series is accepted, structural features relating
to the time series data are ascertained, and at least one distance
between different time series via employing the structural features
is determined. The different time series may be partitioned into
clusters based on the at least one distance, and/or the k closest
matches to a given time series query based on the at least one
distance may be returned.
Inventors: |
Castelli; Vittorio;
(Croton-on-Hudson, NY) ; Vlachos; Michail;
(Elmsford, NY) ; Yu; Philip S.; (Chappaqua,
NY) |
Correspondence
Address: |
FERENCE & ASSOCIATES LLC
409 BROAD STREET
PITTSBURGH
PA
15143
US
|
Assignee: |
International Business Machines
Corporation
Armonk
NY
|
Family ID: |
37071647 |
Appl. No.: |
12/115824 |
Filed: |
May 6, 2008 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
11096485 |
Mar 31, 2005 |
7369961 |
|
|
12115824 |
|
|
|
|
Current U.S.
Class: |
702/176 |
Current CPC
Class: |
Y10S 707/99937 20130101;
Y10S 706/90 20130101; Y10S 707/99953 20130101; Y10S 707/99936
20130101; Y10S 707/99945 20130101; G06K 9/00523 20130101 |
Class at
Publication: |
702/176 |
International
Class: |
G06F 15/00 20060101
G06F015/00 |
Goverment Interests
[0002] This invention was made with Government support under
Contract No.: H98230-04-3-001 awarded by the U.S. Department of
Defense. The Government has certain rights in this invention.
Claims
1. A method of performing structural clustering between different
time series, said method comprising the steps of: accepting
distinct and diverse time series data relating to a plurality of
time series; ascertaining structural features relating to the time
series data; determining at least one distance between different
time series via employing the structural features; and partitioning
the different time series into time-invariant clusters containing
at least one of the time series based on the at least one distance;
wherein the clusters are stored in a computer memory.
2. The method according to claim 1, further comprising the step of
determining common periodicities corresponding to each of the
clusters.
3. The method according to claim 1, further comprising the step of
predetermining a number of structural features to compute.
4. The method according to claim 1, wherein: said ascertaining step
comprises ascertaining frequency content relating to the time
series data; and said ascertaining step further comprises
implementing a Discrete Fourier Transform.
5. The method according to claim 1, wherein: said ascertaining step
comprises determining an orthogonal transformation relating to the
time series data; the orthogonal transformation of the data
comprising a Discrete Wavelet Transform.
6. The method according to claim 1, further comprising the steps
of: selecting at least one of said structural features; and said
determining is performed via employing the at least one structural
feature selected.
7. The method according to claim 6, wherein said selecting step is
performed by a user.
8. The method according to claim 6, wherein: said selecting step is
performed automatically; and said selecting step comprises:
identifying candidate structural features; and verifying the
candidate structural features.
9. The method according to claim 8, wherein: said identifying step
comprises: computing a periodogram of the different time series;
and identifying peaks of the periodogram; and said verifying step
comprises: computing an autocorrelation; and selecting identified
peaks of the periodogram that lie on hills of the
autocorrelation.
10. The method according to claim 1 wherein said ascertaining step
comprises: computing all structural features; and automatically
selecting a number of most relevant features.
11. The method according to claim 10, wherein said step of
automatically selecting a number of most relevant features
comprises: selecting a parameter k corresponding to a number of
structural features to keep; and retaining the k features that
contain the highest amount of periodic content.
12. The method according to claim 10, wherein: said step of
automatically selecting a number of most relevant features
comprises: selecting a threshold; and retaining features having
value larger than the threshold; and said step of selecting a
threshold comprises selecting a threshold which serves to discard
features having values attributable to statistical variations via:
computing a resampling estimate of the distribution of feature
values attributable to statistical variations; selecting a value of
probability of type 1 error; and selecting as a threshold a value
that guarantees the selected value of probability of type 1 error
for a distribution equal to the resampling estimate of the
distribution.
13. An apparatus for performing structural clustering between
different time series, said apparatus comprising: an arrangement
for accepting distinct and diverse time series data relating to a
plurality of time series; an arrangement for ascertaining
structural features relating to the time series data; an
arrangement for determining at least one distance between different
time series via employing the structural features; and an
arrangement for partitioning the different time series into
time-invariant clusters containing at least one of the time series
based on the at least one distances; wherein the clusters are
stored in a computer memory.
14. The apparatus according to claim 13, further comprising an
arrangement for determining common periodicities corresponding to
each of the clusters.
15. The apparatus according to claim 13, further comprising an
arrangement for predetermining a number of structural features to
compute.
16. The apparatus according to claim 13, wherein: said ascertaining
arrangement is adapted to ascertain frequency content relating to
the time series data; and said ascertaining arrangement is further
adapted to implement a Discrete Fourier Transform.
17. The apparatus according to claim 13, wherein: said ascertaining
arrangement is adapted to determine an orthogonal transformation
relating to the time series data; the orthogonal transformation of
the data comprising a Discrete Wavelet Transform.
18. The apparatus according to claim 13, further comprising: an
arrangement for selecting at least one of said structural features;
and said determining arrangement is adapted to employ the at least
one structural feature selected.
19. The apparatus according to claim 18, wherein said selecting
arrangement is operable by a user.
20. The apparatus according to claim 18, wherein: said selecting
arrangement is operable automatically; and said selecting
arrangement is adapted to: identify candidate structural features;
and verify the candidate structural features.
21. The apparatus according to claim 20, wherein: said identifying
arrangement is adapted to: compute a periodogram of the different
time series; and identify peaks of the periodogram; and said
verifying arrangement is adapted to: compute an autocorrelation;
and select identified peaks of the periodogram that lie on hills of
the autocorrelation.
22. The apparatus according to claim 13 wherein said ascertaining
arrangement is adapted to: compute all structural features; and
automatically select a number of most relevant features.
23. The apparatus according to claim 22, wherein said arrangement
for automatically selecting a number of most relevant features is
adapted to: select a parameter k corresponding to a number of
structural features to keep; and retain the k features that contain
the highest amount of periodic content.
24. The apparatus according to claim 22, wherein: said arrangement
for automatically selecting a number of most relevant features is
adapted to: select a threshold; and retain features having value
larger than the threshold; and said arrangement for selecting a
threshold is adapted to select a threshold which serves to discard
features having values attributable to statistical variations via:
computing a resampling estimate of the distribution of feature
values attributable to statistical variations; selecting a value of
probability of type 1 error; and selecting as a threshold a value
that guarantees the selected value of probability of type 1 error
for a distribution equal to the resampling estimate of the
distribution.
25. A program storage device readable by machine, tangibly
embodying a program of instructions executed by the machine to
perform method steps for performing structural clustering between
different time series, said method comprising the steps of:
accepting distinct and diverse time series data relating to a
plurality of time series; ascertaining structural features relating
to the time series data; determining at least one distance between
different time series via employing the structural features; and
partitioning the different time series into time-invariant clusters
containing at least one of the time series based on the at least
one distance; wherein the clusters are stored in a computer
memory.
26. A method of quantifying the structural similarity between
different time series, said method comprising the steps of:
accepting distinct and diverse time series data relating to a
plurality of time series; ascertaining structural features relating
to the time series data; determining at least one distance between
different time series via employing the structural features; and
returning the k closest matches to a given time series query based
on the at least one distance; wherein the k closet matches are
stored in a computer memory.
27. The method according to claim 26, wherein the structural
features are based on at least one of: periodic features extracted
from the time-series; and burst features extracted from the
time-series.
28. An apparatus for quantifying the structural similarity between
different time series, said apparatus comprising: an arrangement
for accepting distinct and diverse time series data relating to a
plurality of time series; an arrangement for ascertaining
structural features relating to the time series data; an
arrangement for determining at least one distance between different
time series via employing the structural features; and an
arrangement for returning the k closest matches to a given time
series query based on the at least one distance; wherein the k
closest matches are stored in a computer memory.
29. The apparatus according to claim 28, wherein the structural
features are based on at least one of: periodic features extracted
from the time-series; and burst features extracted from the
time-series.
Description
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application is a continuation application of co-pending
U.S. patent application Ser. No. 11/096,485 filed on Mar. 31, 2005,
the contents of which are hereby incorporated by reference as if
set forth fully herein.
FIELD OF THE INVENTION
[0003] The present invention generally relates to the management of
data relating to time-series representations.
BACKGROUND OF THE INVENTION
[0004] Herebelow, numerals set forth in square brackets--[ ]--are
keyed to the list of references found towards the end of the
present disclosure.
[0005] In recent years, in a constant effort to effect ongoing
improvements in a crowded field of knowledge, there has been a
profusion of time-series distance measures and representations. The
majority of these attempts to characterize the similarity between
sequences is based solely on shape. However, it is becoming
increasingly apparent that structural similarities can provide more
intuitive sequence characterizations that adhere more tightly to
human perception of similarity.
[0006] While shape-based similarity methods seek to identify
homomorphic sequences using original raw data, structure-based
methodologies are designed to find latent similarities, possibly by
transforming the sequences into a new domain, where the resemblance
can be more apparent.
[0007] Generally, an evolving need has been recognized in
connection with providing an ever more effective and efficient
manner of managing time-series data.
SUMMARY OF THE INVENTION
[0008] Broadly contemplated herein, in accordance with at least one
presently preferred embodiment of the present invention, are
methods and arrangements considered for:
[0009] (i) efficiently capturing and characterizing (automatically)
the periodicity of time-series;
[0010] (ii) characterizing the periodic similarity of time series;
and
[0011] (iii) combining the above methods to perform periodic
clustering of time-series, where the periodicities of each cluster
are also provided.
[0012] Techniques such as those outlined above can be applicable in
a variety of disciplines, such as manufacturing, natural sciences
and medicine, which acquire and record large amounts of periodic
data. For the analysis of such data, first there is preferably
employed accurate periodicity estimation, which can be utilized
either for anomaly detection or for prediction purposes. Then, a
structural distance measure can preferably be deployed that can
effectively incorporate the periodicity for quantifying the degree
of similarity between sequences. It is recognized that a periodic
measure can allow for more meaningful and accurate clustering and
classification, and can also be used for interactive exploration
(and visualization) of massive periodic datasets.
[0013] In summary, one aspect of the invention provides a method of
performing structural clustering between different time series,
said method comprising the steps of: accepting time series data
relating to a plurality of time series; ascertaining structural
features relating to the time series data; determining at least one
distance between different time series via employing the structural
features; and partitioning the different time series into clusters
based on the at least one distance.
[0014] Another aspect of the invention provides an apparatus for
performing structural clustering between different time series,
said apparatus comprising: an arrangement for accepting time series
data relating to a plurality of time series; an arrangement for
ascertaining structural features relating to the time series data;
an arrangement for determining at least one distance between
different time series via employing the structural features; and an
arrangement for partitioning the different time series into
clusters based on the at least one distance.
[0015] A further aspect of the invention provides a program storage
device readable by machine, tangibly embodying a program of
instructions executed by the machine to perform method steps for
performing structural clustering between different time series,
said method comprising the steps of: accepting time series data
relating to a plurality of time series; ascertaining structural
features relating to the time series data; determining at least one
distance between different time series via employing the structural
features; and partitioning the different time series into clusters
based on the at least one distance.
[0016] Yet another aspect of the invention provides a method of
quantifying the structural similarity between different time
series, said method comprising the steps of: accepting time series
data relating to a plurality of time series; ascertaining
structural features relating to the time series data; determining
at least one distance between different time series via employing
the structural features; and returning the k closest matches to a
given time series query based on the at least one distance.
[0017] A yet further aspect of the invention provides an apparatus
for quantifying the structural similarity between different time
series, said apparatus comprising: an arrangement for accepting
time series data relating to a plurality of time series; an
arrangement for ascertaining structural features relating to the
time series data; an arrangement for determining at least one
distance between different time series via employing the structural
features; and an arrangement for returning the k closest matches to
a given time series query based on the at least one distance.
[0018] Furthermore, an additional aspect of the invention provides
a program storage device readable by machine, tangibly embodying a
program of instructions executed by the machine to perform method
steps for quantifying the structural similarity between different
time series, said method comprising the steps of: accepting time
series data relating to a plurality of time series; ascertaining
structural features relating to the time series data; determining
at least one distance between different time series via employing
the structural features; and returning the k closest matches to a
given time series query based on the at least one distance.
[0019] For a better understanding of the present invention,
together with other and further features and advantages thereof,
reference is made to the following description, taken in
conjunction with the accompanying drawings, and the scope of the
invention will be pointed out in the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] FIG. 1 depicts a reconstruction of a signal from Fourier
coefficients.
[0021] FIG. 2 depicts a sequence and a corresponding periodogram
and autocorrelation graph.
[0022] FIG. 3 schematically depicts an "autoperiod"
methodology.
[0023] FIG. 4 provides a graphical demonstration of the method of
FIG. 3.
[0024] FIG. 5 depicts an algorithm, "getPeriodHints".
[0025] FIGS. 6(a) through 6(b) depict queries and corresponding
periodograms.
[0026] FIG. 7 depicts a segmentation of autocorrelation
intervals.
[0027] FIGS. 8(a) through 8(d) depict periodicity detection results
of the "autoperiod" method.
[0028] FIG. 9 provides a comparison between two time-series.
[0029] FIG. 10 depicts a dendrogram based on historical
features.
[0030] FIG. 11 depicts a two-dimensional mapping of pairwise
distances between different sequences.
[0031] FIG. 12 depicts a dendrogram for a pDist measure, which
achieves a perfect clustering.
[0032] FIG. 13 depicts incorrect grouping in a 2 class ECG
problem.
[0033] FIG. 14 depicts correct grouping in a 3 class ECG
problem.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0034] By way of background, provided herebelow is a brief
introduction to harmonic analysis using the discrete Fourier
Transform, because these tools will be used as the building blocks
of algorithms in accordance with at least one embodiment of the
present invention.
[0035] The normalized Discrete Fourier Transform of a sequence
x(n), n=0,1 . . . N-1 is a sequence of complex numbers X(f):
X ( f k / N ) = 1 N n = 0 N - 1 x ( n ) - j2.pi. kn N , k = 0 , 1 N
- 1 ##EQU00001##
where the subscript k/N denotes the frequency that each coefficient
captures. Herethroughout there will also be utilized the notation
F(x) to describe the Fourier Transform. Since we are dealing with
real signals, the Fourier coefficients are symmetric around the
middle one (or to be more exact, they will be the complex conjugate
of their symmetric). The Fourier transform represents the original
signal as a linear combination of the complex sinusoids
s f ( n ) = j2.pi.fn / N N . ##EQU00002##
Therefore, the Fourier coefficients record the amplitude and phase
of these sinusoids, after signal x is projected on them.
[0036] One can return from the frequency domain back to the time
domain, using the inverse Fourier transform
F.sup.-1(x).ident.x(n):
x ( n ) = 1 N n = 0 N - 1 X ( f k / N ) j2.pi. kn N , k = 0 , 1 N -
1 ##EQU00003##
[0037] Note that if during this reverse transformation one discards
some of the coefficients (e.g., the last k), then the outcome will
be an approximation of the original sequence (see FIG. 1). By
carefully selecting which coefficients to record, one can perform a
variety of tasks such as compression, denoising, etc.
[0038] In order to discover potential periodicities of a
time-series, one needs to examine its power spectral density (PSD
or power spectrum). The PSD essentially tells us how much is the
expected signal power at each frequency of the signal. Since period
is the inverse of frequency, by identifying the frequencies that
carry most of the energy, we can also discover the most dominant
periods. There are two well known estimators of the PSD; the
periodogram and the circular autocorrelation. Both of these methods
can be computed using the DFT of a sequence (and can therefore
exploit the Fast Fourier Transform for execution in O(N log N)
time).
[0039] Suppose that X is the DFT of a sequence x. The periodogram P
is provided by the squared length of each Fourier coefficient:
P ( f k / N ) = X ( f k / N ) 2 k = 0 , 1 N 2 ##EQU00004##
[0040] Where .parallel...parallel. denotes the L.sub.2 norm of a
vector. Notice that one can only detect frequencies that are at
most half of the maximum signal frequency, due to Nyquist's
fundamental theorem. In order to find the k dominant periods, one
should preferably pick the k largest values of the periodogram.
[0041] Each element of the periodogram provides the power at
frequency k/N or, equivalently, at period N/k. Being more precise,
each DFT `bin` corresponds to a range of periods (or frequencies).
That is, coefficient X(f.sub.k/N) corresponds to periods
[ N k N k - 1 ) . ##EQU00005##
It is easy to see that the resolution of the periodogram becomes
very coarse for longer periods. For example, for a sequence of
length N=256, the DFT bin margins will be N/1,N/2,N/3, . . . =256,
128, 64 etc.
[0042] Essentially, the accuracy of the discovered periods
deteriorates for large periods, due to the increasing width of the
DFT bins (N/k). Another related issue is spectral leakage, which
causes frequencies that are not integer multiples of the DFT bin
width, to disperse over the entire spectrum. This can lead to
`false alarms` in the periodogram. However, the periodogram can
still provide an accurate indicator of important short (to medium)
length periods. Additionally, through the periodogram it is easy to
automate the extraction of important periods (peaks) by examining
the statistical properties of the Fourier coefficients.
[0043] The second way to estimate the dominant periods of a
time-series x, is to calculate the circular AutoCorrelation
Function (or ACF), which examines how similar a sequence is to its
previous values for different .tau. lags:
ACF ( .tau. ) = 1 N n = 0 N - 1 x ( .tau. ) x ( n + .tau. )
##EQU00006##
[0044] where the sum in n+.tau. is modulo N.
[0045] Therefore, the autocorrelation is formally a convolution,
and one can avoid the quadratic calculation in the time domain by
computing it efficiently as a dot product in the frequency domain
using the normalized Fourier transform:
ACF=F.sup.-1<X,X*>
[0046] The star (*) symbol denotes complex conjugation.
[0047] The ACF provides a more fine-grained periodicity detector
than the periodogram, hence it can pinpoint with greater accuracy
even larger periods. However, it is not sufficient by itself for
automatic periodicity discovery for the following reasons:
[0048] 1. Automated discovery of important peaks is more difficult
than in the periodogram, because the user must set a significance
threshold.
[0049] 2. Even if the user picks the level of significance,
multiples of the same basic period also appear as peaks. Therefore,
the method introduces many false alarms that need to be eliminated
in a post-processing phase.
[0050] 3. Low amplitude events of high frequency may appear less
important (i.e., have lower peaks) than high amplitude patterns,
which nonetheless appear more scarcely (see FIG. 2).
[0051] With relation to FIG. 2, the 7 day period is latent in the
autocorrelation graph, because it has lower amplitude (even though
it happens with higher frequency). However, the 7 day peak is very
clear in the Periodogram.
[0052] The advantages and shortcomings of the periodogram and the
ACF are summarized in Table 1.
[0053] From the above discussion one can realize that although the
periodogram and the autocorrelation cannot provide sufficient
spectral information separately, there is a lot of potential when
both methods are combined. An approach in accordance with at least
one preferred embodiment of the present invention is delineated in
the following section.
[0054] In accordance with at least one preferred embodiment of the
present invention, there is preferably utilized a two-tier
approach, by considering the information in both the
autocorrelation and the periodogram. One may call this method
AUTOPERIOD. Since the discovery of important periods is more
difficult on the autocorrelation, one can use the periodogram for
extracting period candidates. The period candidates may be termed
`hints`. These `hints` may be false (due to spectral leakage), or
provide a coarse estimate of the period (remember that DFT bins
increase gradually in size); therefore a verification phase using
the autocorrelation is required, since it provides a more
fine-grained estimation of potential periodicities. The intuition
is that if the candidate period from the periodogram lies on a hill
of the ACF then one can consider it as a valid period, otherwise
one may preferably discard it as false alarm. For the periods that
reside on a hill, further refinement may be required if the
periodicity hint refers to a large period.
[0055] FIG. 3 summarizes a methodology in accordance with at least
one embodiment of the present invention and FIG. 4 depicts the
visual intuition behind such an approach with a working example.
The sequence is obtained from the MSN query request logs and
represents the aggregate demand for the query `Easter` for 1000
days after the beginning of 2002. The demand for the specific query
peaks during Easter time and one can observe one yearly peak. The
intuition is that periodicity should be approximately 365 (although
not exactly, since Easter is not celebrated at the same date every
year). Indeed the most dominant periodogram estimate is 333.33=(
1000/3), which is located on a hill of the ACF, with a peak at 357
(the correct periodicity--at least for this 3 year span). The
remaining periodic hints can be discarded upon verification with
the autocorrelation.
[0056] FIG. 4 provides a visual demonstration a method carried out
in accordance with an embodiment of the present invention.
Candidate periods from the periodogram are verified against the
autocorrelation. Valid periods are further refined utilizing the
autocorrelation information.
[0057] Essentially, there has been leveraged the information of
both metrics for providing an accurate periodicity detector. In
addition, methods carried out in accordance with at least one
embodiment of the present invention are computationally efficient,
because both the periodogram and the ACF can be directly computed
through the Fast Fourier Transform of the examined sequence in O(N
log N) time.
[0058] For extracting a set of candidate periodicities from the
periodogram, one needs to determine an appropriate power threshold
that should distinguish only the dominant frequencies (or inversely
the dominant periods). If none of the sequence frequencies exceeds
the specific threshold (i.e., the set of periodicity `hints` is
empty), then one can regard the sequence as non-periodic.
[0059] In order to specify which periods are important, one should
first preferably identify how much of the signal energy is
attributed to random mechanisms, that is, everything that could not
have been attributed to a random process should be of interest.
[0060] Let us assume that one examines a sequence x. The outcome of
a permutation on the elements of x is a sequence {tilde over (x)}.
The new sequence will retain the first order statistics of the
original sequence, but will not exhibit any pattern or
periodicities, because of the `scrambling` process (even though
such characteristics may have existed in sequence x). Anything that
has the structure of {tilde over (x)} is not interesting and should
be discarded, therefore at this step one can record the maximum
power (p.sub.max) that {tilde over (x)} exhibits, at any frequency
f.
p max = arg max f x ~ ( f ) 2 ##EQU00007##
[0061] Only if a frequency of x has more power than p.sub.max can
be considered interesting. If one would like to provide a 99%
confidence interval on what frequencies are important, one should
repeat the above experiment 100 times and record for each one the
maximum power of the permuted sequence {tilde over (x)}. The
99.sup.th largest value of these 100 experiments, will provide a
sufficient estimator of the power threshold p.sub.T being sought.
Periods (in the original sequence periodogram) whose power is more
than the derived threshold will be considered:
p.sub.hint={N/k:P(f.sub.k/N)>p.sub.T}
[0062] Finally, an additional period `trimming` should be performed
for discarding periods that are either too large or too small and
therefore cannot be considered reliable. In this phase any periodic
hint greater than N/2 or smaller than 2 is removed.
[0063] FIG. 5 captures a pseudo-code of the algorithm for
identifying periodic hints.
[0064] In [2] another algorithm for detection of important periods
was proposed, which follows a different concept for estimating the
periodogram threshold. The assumption there was that the
periodogram of non-periodic time-series will follow an exponential
distribution, which returned very intuitive period estimates for
real world datasets. In experimentation, it has been found that the
two algorithms return very comparable threshold values. However,
because the new method does not make any assumptions about the
underlying distribution, it can be applicable for a wider variety
of time-series processes.
[0065] By way of concrete examples, there were employed sequences
from the MSN query logs (yearly span) to demonstrate the usefulness
of the discovered periodic hints. In FIG. 6(a) there is presented
the demand of the query `stock market`, where one can distinguish a
strong weekly component in the periodogram. FIG. 6(b) depicts the
query `weekend` which does not contain any obvious periodicities. A
method in accordance with at least one embodiment of the present
invention can set the threshold high enough, therefore avoiding
false alarms.
TABLE-US-00001 TABLE 1 Concise comparison of approaches for
periodicity detection. Accurate Accurate short Large Com- Method
Easy to threshold periods Periods plexity Periodogram Yes yes no
O(NlogN) Autocorrelation No yes yes O(NlogN) Combination Yes yes
yes O(NlogN)
[0066] After the periodogram peaks have been identified, there has
been obtained a candidate set of periodicities for the examined
sequence. The validity of these periods will be verified against
the autocorrelation. An indication that a period is important, can
be the fact that the corresponding period lies on a hill of the
autocorrelation. If the period resides on a valley then it can be
considered spurious and therefore safely discarded.
[0067] After discovering that a periodicity `hint` resides on a
hill of the autocorrelation, one can refine it even further by
identifying the closest peak (i.e., local maximum). This is a
necessary step, because the correct periodicity (i.e., peak of the
hill) might not have been discovered by the periodogram, if it was
derived from a `wide` DFT bin. This is generally true for larger
periods, where the resolution of the DFT bins drops significantly.
Below is a discussion of how to address such issues.
[0068] The significance of a candidate period ideally can be
determined by examining the curvature of the ACF around the
candidate period p. The autocorrelation is concave downward, if the
second derivative is negative in an open interval (a . . . b):
.differential. 2 ACF ( x ) .differential. x 2 < 0 , for all x
.di-elect cons. ( a b ) , a < p < b ##EQU00008##
[0069] Nevertheless, small perturbations of the ACF due to the
existence of noise, may invalidate the above requirement. There
will be sought a more robust estimator of the curvature by
approximating the ACF in the proximity of the candidate period with
two linear segments. Then it is sufficient to examine if the
approximating segments exhibit an upward-downward trend, for
identifying a concave downward pattern (i.e., a hill).
[0070] The segmentation of a sequence of length N into k linear
segments can be computed optimally using a dynamic programming
algorithm in O(N.sup.2k) time, while a greedy merge algorithm
achieves results very close to optimal in O(N log N) time. For this
problem instance, however, one can employ a simpler algorithm,
because only a two segment approximation for a specific portion of
the ACF is required.
[0071] Let S.sub.a.sup.b a be the linear regression of a sequence x
between the positions [a . . . b] and
( S ^ a b ) ##EQU00009##
be the error introduced by the approximating segment. The best
split position t.sub.split is derived from the configuration that
minimizes the total approximation error:
t split = arg min t ( s ^ 1 t ) + ( S ^ t + 1 n ) ##EQU00010##
[0072] After it has been ascertained that a candidate period
belongs on a hill and not on a valley of the ACF, there is a need
to discover the closest peak which will return a more accurate
estimate of the periodicity hint (particularly for larger periods).
One can proceed in two ways; the first one would be to perform any
hill-climbing technique, such as gradient ascent, for discovering
the local maximum. In this manner the local search will be directed
toward the positive direction of the first derivative.
Alternatively, one could derive the peak position directly from the
linear segmentation of the ACF, which is already computed in the
hill detection phase. The peak should be located either at the end
of the first segment or at the beginning of the second segment.
[0073] Both methods have been implemented for the purpose of
experimentation and it has been found that both report accurate
results.
[0074] Several sequences from the MSN query logs were employed to
perform convincing experiments regarding the accuracy of a 2-tier
methodology in accordance with at least one embodiment of the
present invention. The specific dataset is ideal for the present
purposes because one can detect a number of different periodicities
according to the demand pattern of each query.
[0075] The examples in FIG. 8 demonstrate a variety of situations
that might occur when using both the periodogram and
autocorrelation.
[0076] Query `Easter` (MSN): Examining the demand for a period of
1000 days, one can discover several periodic hints above the power
threshold in the periodogram. In this example, the autocorrelation
information refines the original periodogram hint (from
333.fwdarw.357 ). Additional hints are rejected because they reside
on ACF valleys (in the figure only the top 3 candidate periods are
displayed for reasons of clarity).
[0077] Query `Harry Potter` (MSN): For the specific query although
there are no observed periodicities (duration 365 days), the
periodogram returns 3 periodic hints, which are mostly attributed
to the burst pattern during November when the movie was released.
The hints are classified as spurious upon verification with
ACF.
[0078] Query `Fourier` (MSN): This is an example where the
periodogram threshold effectively does not return candidate
periods. Notice that if one had utilized only the autocorrelation
information, it would have been more troublesome to discover which
(if any) periods were important. This represents another validation
that the choice to perform the period thresholding in the frequency
space was correct.
[0079] Economic Index (Stock Market): Finally, this last sequence
from a stock market index illustrates a case where both the
periodogram and autocorrelation information concur on the single
(albeit weak) periodicity.
[0080] Through this experimental testbed it has been demonstrated
that AUTOPERIOD can provide very accurate periodicity estimates
without upsampling the original sequence. In the sections that
follow, it will be shown how it can be used in conjunction with
periodic similarity measures, for interactive exploration of
sequence databases.
[0081] Structural measures can preferably be introduced that are
based on periodic features extracted from sequences. Periodic
distance measures can be used for providing more meaningful
structural clustering and visualization of sequences (whether they
are periodic or not). After sequences are grouped in `periodic`
clusters, using a `drill-down` process the user can selectively
apply the AUTOPERIOD method for periodicity estimation on the
sequences or clusters of interest. In the discussion of
experimentation examples of this methodology using hierarchical
clustering trees are provided.
[0082] Let us consider first the utility of periodic distance
measures with an example. Suppose that one is examining the
similarity between the two time-series of FIG. 9. When sequence A
exhibits an upward trend, sequence B displays a downward drift.
Clearly, the Euclidean distance (or inner product) between
sequences A and B, will characterize them as very different.
However, if one exploits the frequency content of the sequences and
evaluates their periodogram, one will discover that it is almost
identical. In this new space, the Euclidean distance can easily
identify the sequence similarities. Even though this specific
example could have been addressed in the original space using the
Dynamic Time Warping (DTW) distance, it should be noted that the
methods broadly contemplated herein are significantly more
efficient (in terms of both time and space) than DTW. Additionally,
periodic measures can address more subtle similarities that DTW
cannot capture, such as different patterns/shapes occurring at
periodic (possibly non-aligned) intervals. Herebelow, there will be
examined cases where the DTW fails.
[0083] The new measure of structural similarity presented herein
exploits the power content of only the most dominant
periods/frequencies. By considering the most powerful frequencies,
the present method concentrates on the most important structural
characteristics, effectively filtering out the negative influence
of noise, and eventually allowing for expedited distance
computation. Additionally, the omission of the phase information
renders the new similarity measure shift invariant in the time
domain. One can therefore discover time-series with similar
patterns, which may occur at different chronological instants.
[0084] For comparing the periodic structure of two sequences, one
should preferably examine how different is their harmonic content.
One may achieve this by utilizing the periodogram and specifically
the frequencies with the highest energy.
[0085] Suppose that X is the Fourier transform of a sequence x with
length n. One can discover the k largest coefficients of X by
computing its periodogram P(X) and recording the position of the k
frequencies with the highest power content (parameter k depends on
the desired compression factor). Let us denote the vector holding
the positions of the coefficients with the largest power p.sup.+
(so p.sup.+.OR right.[1 . . . n]). To compare x with any other
sequence q, one needs to examine how similar energies they carry in
the dominant periods of x. Therefore, one preferably evaluates
P(Q(p.sup.+)), that describes a sequence holding the equivalent
coefficients as the vector P(X(p.sup.+)). The distance pDist
between these two vectors captures the periodic similarity between
sequences x and q:
pDist=.parallel.P(Q(p.sup.+))-P(X(p.sup.+)).parallel.
Example: Let x and q be two sequences and let their respective
Fourier Transforms be X={(1+2i),(2+2i),(1+i),(5+1)} and
Q={(2+2i),(1+i),(3+i),(1+2i)}. The periodogram vector of X is:
P(X)=.parallel.X.parallel..sup.2=(5,8,2,26). The vector holding the
positions of X with highest energy is p.sup.+=(2,4) and therefore
P(X(p.sup.+))=(0,8,0,26). Finally, since P(Q)=(8,2,10,5) it follows
that: P(Q(p.sup.+))=(0,2,0,5).sup.1.
[0086] Alternatively, if one doesn't want to provide a parameter k,
one could extract from a sequence those periodic features that
retain e% of the signal energy, or use the algorithm from the
periodicity detection to extract the most important periods of a
sequence.
[0087] In order to meaningfully compare the power content of two
sequences one should preferably normalize them, so that they
contain the same amount of total energy. One can assign to any
sequence x(n) unit power, by performing the following
normalization:
x ^ ( n ) = x ( n ) - 1 N i = 1 N x ( i ) i = 1 N ( x ( n ) - 1 N i
= 1 N x ( i ) ) 2 ##EQU00011##
[0088] The above transformation will lead to zero mean value and
sum of squared values equal to 1. Parseval's theorem dictates that
the energy in the time domain equals the energy in the frequency
domain, therefore the total energy in the frequency domain should
also be unit:
.parallel.{circumflex over
(x)}.parallel..sup.2=.parallel.F({circumflex over
(x)}).parallel..sup.1=1
[0089] After this normalization, one can more meaningfully compare
the periodogram energies.
[0090] Presented herein are the results of extensive
experimentation that show the usefulness of the new periodic
measures, and the measures are compared with widely used shape
based measures or newly introduced structural distance
measures.
[0091] Using 16 sequences which record the yearly demand of several
keywords at the MSN search engine, one may preferably perform the
hierarchical clustering which is shown in FIG. 10. In the
dendrogram derived using the pDist as the distance function, one
can notice a distinct separation of the sequences/keywords into 3
classes. The first class contains no clear periodicities (no
specific pattern in the demand of the query), while the second one
exhibits only bursty seasonal trends (e.g., during Christmas). The
final category of queries are requested with high frequency (weekly
period) and here one can find keywords such as `cinema`, `bank`,
`Bush` etc.
[0092] One can utilize an extended portion of the same dataset for
exploring the visualization power of periodic distance measures.
Using the pairwise distance matrix between a set of MSN keyword
demand sequences (365 values, year 2002), there is evaluated a 2D
mapping of the keywords using Multidimensional Scaling (FIG. 11).
The derived mapping shows the high discriminatory efficacy of the
pDist measure; seasonal trends (low frequencies) are disjoint from
periodic patterns (high frequencies), allowing for a more
structural sequence exploration. Keywords like `fall`, `Christmas`,
`lord of the rings`, `Elvis`, etc., manifest mainly seasonal
bursts, which need not be aligned in the time axis. On the
contrary, queries like `dry cleaners` or `Friday` indicate a
natural weekly repeated demand. Finally, some queries do not
exhibit any obvious periodicities within a year's time (e.g.,
`icdm`, `kdd`, etc).
[0093] For a second experiment there is employed a combination of
periodic time series that are collected from natural sciences,
medicine and manufacturing, augmented by pairs of random noise and
random walk data.
[0094] All datasets come in pairs, hence, when performing a
hierarchical clustering algorithm on this dataset, one expects to
find a direct linkage of each sequence pair at the lower level of
the dendrogram. If this happens one may consider the clustering to
be correct. The dataset is made up of 12 pairs, therefore a measure
of the clustering accuracy can be the number of correct pair
likings, over twelve, the number of total pairs.
[0095] FIG. 12 displays the resulting dendrogram for the pDist
measure, which achieves a perfect clustering. One can also observe
that pairs derived from the same source/process are clustered
together as well, in the higher dendrogram level (Power Demand,
ECG, MotorCurrent etc). After the clustering, one can execute the
AUTOPERIOD method and annotate the dendrogram with the important
periods of every sequence. Some sequences, like the random walk or
the random data, do not contain any periodicities, indicated with
an empty set { }. When both sequences at the lower level display
the same periodicity, a single set is displayed on the bifurcation
for clarity.
[0096] For many datasets that came into 2 pairs (power demand,
video surveillance, motor current), all 4 instances demonstrated
the same basic period (as suggested by the AUTOPERIOD). However,
the periodic measure can effectively separate them into two pairs,
because the power content of the respective frequencies was
different.
[0097] The last experiment is performed on the MIT-BIH Arrhythmia
dataset. There are employed two sets of sequences; one with 2
classes of heartbeats and another one with three (FIGS. 13, 14).
There is presented the dendrogram of the pDiSt measure and the DTW,
which represents possibly one of the best shape based distance
measures. To tune the single parameter of the DTW (corresponding to
the maximum warping length) there were probed several values and
here there is reported the one that returned the best
clustering.
[0098] For both dataset instances, pDist again returns an accurate
clustering, while DTW seems to perform badly on the high level
dendrogram aggregations, hence not leading to perfect class
separation. The Euclidean distance reported worse results. The CDM
measure is accurate on the 2 class separation test but does not
provide a perfect separation for the 3 class problem (see the
original paper [1] for respective results).
[0099] By way of recapitulation, there have been presented herein
various methods for accurate periodicity estimation and for the
characterization of structural periodic similarity between
sequences. It is believed that these methods will find many
applications for interactive exploration of time-series databases
and for classification or anomaly detection of periodic sequences
(e.g., in auto manufacturing, biometrics and medical
diagnosis).
[0100] It is to be understood that the present invention, in
accordance with at least one presently preferred embodiment,
includes an arrangement for accepting time series data, an
arrangement for ascertaining structural features relating to the
time series data, and an arrangement for determining at least one
distance between different time series and/or an arrangement for
returning the k closest matches to a given time series query.
Together, these elements may be implemented on at least one
general-purpose computer running suitable software programs. They
may also be implemented on at least one integrated Circuit or part
of at least one Integrated Circuit. Thus, it is to be understood
that the invention may be implemented in hardware, software, or a
combination of both.
[0101] If not otherwise stated herein, it is to be assumed that all
patents, patent applications, patent publications and other
publications (including web-based publications) mentioned and cited
herein are hereby fully incorporated by reference herein as if set
forth in their entirely herein.
[0102] Although illustrative embodiments of the present invention
have been described herein with reference to the accompanying
drawings, it is to be understood that the invention is not limited
to those precise embodiments, and that various other changes and
modifications may be affected therein by one skilled in the art
without departing from the scope or spirit of the invention.
REFERENCES
[0103] E. Keogh, S. Lonardi, and A. Ratanamahatana. Towards
parameter-free data mining. In Proc. of SIGKDD, 2004.
[0104] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos.
Identification of Similarities, Periodicities & Bursts for
Online Search Queries. In Proc. of SIGMOD, 2004.
* * * * *