U.S. patent application number 13/491425 was filed with the patent office on 2013-12-12 for methods and systems for statistical aberrant behavior detection of time-series data.
This patent application is currently assigned to Verisign, Inc.. The applicant listed for this patent is Sylvain Luiset, Matthew Thomas. Invention is credited to Sylvain Luiset, Matthew Thomas.
Application Number | 20130332109 13/491425 |
Document ID | / |
Family ID | 48703129 |
Filed Date | 2013-12-12 |
United States Patent
Application |
20130332109 |
Kind Code |
A1 |
Luiset; Sylvain ; et
al. |
December 12, 2013 |
METHODS AND SYSTEMS FOR STATISTICAL ABERRANT BEHAVIOR DETECTION OF
TIME-SERIES DATA
Abstract
Methods and systems for detecting aberrant behavior in
time-series observation data, such as non-existent domain data, are
disclosed. The methods and systems analyze the time-series
observation data to determine time-series prediction data. The
time-series observation data and time-series prediction data are
used to determine a threshold that is based on the standard
deviation of deviation values between the time-series observation
data and time-series prediction data. The threshold may be used to
detect aberrant behavior in subsequently obtained time-series
observation data.
Inventors: |
Luiset; Sylvain; (Fribourg,
CH) ; Thomas; Matthew; (Atlanta, GA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Luiset; Sylvain
Thomas; Matthew |
Fribourg
Atlanta |
GA |
CH
US |
|
|
Assignee: |
Verisign, Inc.
|
Family ID: |
48703129 |
Appl. No.: |
13/491425 |
Filed: |
June 7, 2012 |
Current U.S.
Class: |
702/179 |
Current CPC
Class: |
H04L 63/16 20130101;
H04L 63/14 20130101; H04L 63/1441 20130101; H04L 61/103 20130101;
H04L 2463/144 20130101; H04L 61/1511 20130101; H04L 63/1408
20130101; H04L 63/145 20130101 |
Class at
Publication: |
702/179 |
International
Class: |
G06F 17/18 20060101
G06F017/18 |
Claims
1. A computer-implemented method for detecting aberrant behavior in
time-series data, comprising: obtaining first time-series
observation data; determining time-series prediction data
representative of a predicted trend of the first observation data;
determining a standard deviation value representative of a
deviation between the first observation data and the prediction
data; determining a threshold based, at least in part, on the
standard deviation value; and detecting aberrant behavior in second
time-series observation data based, at least in part, on the
threshold.
2. The method of claim 1, wherein the first observation data is
representative of queries for non-existent domains.
3. The method of claim 2, wherein the aberrant behavior detection
provides an indication of botnet activity.
4. The method of claim 1, wherein the step of determining the
prediction data further comprises applying an exponential smoothing
technique to the first observation data.
5. The method of claim 1, wherein the step of determining the
standard deviation further comprises: determining time-series
deviation data between the first observation data and the
prediction data; determining the mean of the deviation data; and
determining the standard deviation by analyzing the deviation data
and the mean.
6. The method of claim 5, wherein the step of determining the
deviation data further comprises applying an exponential smoothing
technique to differences between the first observation data and the
prediction data.
7. The method of claim 1, wherein the step of detecting aberrant
behavior in the second observation data further comprises:
determining second time-series prediction data representative of a
predicted trend of the second observation data; determining second
time-series deviation data between the second observation data and
the second prediction data; and comparing one or more values of the
second deviation data with the threshold.
8. The method of claim 7, wherein the step of detecting aberrant
behavior in the second observation data further comprises
determining that a predetermined number of values of the second
deviation data exceed the threshold.
9. The method of claim 1, further comprising: replacing the
threshold with an updated threshold based, at least in part, on the
second observation data.
10. The method of claim 9, further comprising: detecting aberrant
behavior in third time-series observation data based, at least in
part, on the updated threshold.
11. A system for detecting aberrant behavior in time-series data,
comprising: a processor; a memory; program code stored on the
memory, which, when executed by the processor, causes the system to
perform the steps of: obtaining first time-series observation data;
determining time-series prediction data representative of a
predicted trend of the first observation data; determining a
standard deviation value representative of a deviation between the
first observation data and the prediction data; determining a
threshold based, at least in part, on the standard deviation value;
and detecting aberrant behavior in second time-series observation
data based, at least in part, on the threshold.
12. The system of claim 11, wherein the first observation data is
representative of queries for non-existent domains.
13. The system of claim 12, wherein the aberrant behavior detection
provides an indication of botnet activity.
14. The system of claim 11, wherein the step of determining the
prediction data further comprises applying an exponential smoothing
technique to the first observation data.
15. The system of claim 11, wherein the step of determining the
standard deviation further comprises: determining time-series
deviation data between the first observation data and the
prediction data; determining the mean of the deviation data; and
determining the standard deviation by analyzing the deviation data
and the mean.
16. The system of claim 15, wherein the step of determining the
deviation data further comprises applying an exponential smoothing
technique to differences between the first observation data and the
prediction data.
17. The system of claim 11, wherein the step of detecting aberrant
behavior in the second observation data further comprises:
determining second time-series prediction data representative of a
predicted trend of the second observation data; determining second
time-series deviation data between the second observation data and
the second prediction data; and comparing one or more values of the
second deviation data with the threshold.
18. The system of claim 17, wherein the step of detecting aberrant
behavior in the second observation data further comprises
determining that a predetermined number of values of the second
deviation data exceed the threshold.
19. The system of claim 11, further comprising program code stored
on the memory, which, when executed by the processor, causes the
system to perform the step of: replacing the threshold with an
updated threshold based, at least in part, on the second
observation data.
20. The system of claim 19, further comprising program code stored
on the memory, which, when executed by the processor, causes the
system to perform the step of: detecting aberrant behavior in third
time-series observation data based, at least in part, on the
updated threshold.
Description
TECHNICAL FIELD
[0001] The present disclosure relates to the field of data analysis
and, more particularly, methods and systems for aberrant behavior
detection for time-series data.
BACKGROUND
[0002] It is often desirable to analyze time-series data for
anomalies. For example, time-series data may be analyzed to monitor
stock exchange data or data recorded in logs reflecting traffic
through firewalls or telephone systems. Such analysis may also be
used in detection of "malware." Malware, short for "malicious
software," is software that is designed for hostile or intrusive
purposes. For example, malware may be designed with the intent of
gathering confidential information, denying or disrupting
operations, accessing resources without authorization, and other
abusive purposes. Types of malware include, for example, computer
viruses, worms, Trojan horses, spyware, adware, and botnets.
Malware developers typically distribute their software via the
Internet, often clandestinely. As Internet use continues to grow
around the world, malware developers have more incentives for
releasing such software.
[0003] Botnets are one example of malware that have become a major
security threat in recent years. A botnet is a network of
"innocent" host computers that have been infected with malicious
software in such a way that a remote attacker is able to control
the host computers. The malicious software used to infect the host
computers is referred to as a "bot," which is short for "robot."
Botnets operate under a command and control (C&C) architecture,
where a remote attacker is able to control the infected computers,
often referred to as "zombie" computers. An attacker may control
the infected computers to carry out online anti-social or criminal
activities, such as e-mail spam, click fraud, distributed
denial-of-service attacks (DDoS), or identity theft.
[0004] FIG. 1 illustrates an exemplary C&C architecture of a
botnet 100. The botnet master 101, often referred to as a
"botmaster" or "bot herder," distributes malicious bot software,
typically over the Internet 102. This bot software stores an
indication of a future time and of domain names to contact at the
indicated future time. The bot software infects a number of host
computers 103 causing them to become compromised. Users of host
computers 103 typically do not know that the bot software is
running on their computers. Botnet master 101 also registers
temporary domain names to be used as C&C servers 104. Then, at
the indicated future time, the bot software instruct host computers
103 to contact C&C servers 104 to get instructions. The
instructions are sent over a C&C channel via the Internet 102.
The ability to send instructions to host computers 103 provides
botnet master 101 with control over a large number of host
computers. This enables botnet master 101 to generate huge volumes
of network traffic, which can be used for e-mailing spam messages,
shutting down or slowing web sites through DDoS attacks, or other
purposes.
[0005] Botnets exploit the domain name system (DNS) to rally
infected host computers. The DNS allows people using the Internet
to refer to domain names, rather than Internet Protocol (IP)
addresses, when accessing websites and other online services.
Domain names, which employ text characters, such as letters,
numbers, and hyphens (e.g., "www.example.com"), will often be
easier to remember than IP addresses, which are numerical and do
not contain letters or hyphens (e.g., "128.1.0.0"). In addition, a
domain name may be registered before an IP address has been
acquired. The DNS is the Internet's hierarchical lookup service for
mapping character-based domain names meaningful to humans into
numerical IP addresses.
[0006] Botnets exploit the DNS by registering domain names to be
temporarily used as C&C servers 104. However, a botnet master
will often distribute bot software before registering the domains
indicated in the bot software. By the time bot software instructs
host computers 103 to contact C&C servers 104, the bot master
101 will often have only registered a subset of the domains
indicated in the bot software. Thus, when bot software instructs
host computers 103 to contact C&C servers 104, host computers
103 will often attempt to contact a number of unregistered
domains.
[0007] Legitimate internet user activity will include a mixture of
requests for existent domains (YXDs) and non-existent domains
(NXDs). In addition, legitimate internet user activity will have a
periodic nature such that activity is, on average, higher at some
predictable times and lower at other predictable times (e.g., an
internet user may be more active during the day than during the
night, and may be more active during weekdays than during
weekends). Because of the periodic nature of a typical internet
user's activity, an examination of NXD data will often reveal a
predictable pattern over one or more periods of time.
[0008] Illegitimate internet use, such as by host computers 103 in
botnet 100, will also include a mixture of requests for YXDs and
NXDs. However, because a botnet master 101 will typically only
register a small subset of the domain names that it provides in the
bot software, after host computers 103 attempt to access the
C&C servers 104 a spike in the overall quantity of NXDs will
arise that deviates from the predictable periodic nature of
legitimate internet user activity.
SUMMARY
[0009] In one disclosed embodiment, a computer-implemented method
for detecting aberrant behavior in time-series data is performed.
The method includes obtaining first time-series observation data.
The method further includes determining time-series prediction data
representative of a predicted trend of the first observation data.
The method further includes determining a standard deviation value
representative of a deviation between the first observation data
and the prediction data. The method further includes determining a
threshold based, at least in part, on the standard deviation value.
The method further includes detecting aberrant behavior in second
time-series observation data based, at least in part, on the
threshold.
[0010] In another disclosed embodiment, a system for detecting
aberrant behavior in time-series data is provided. The system
includes a processor, a memory, and program code stored on the
memory, which, when executed by the processor, causes the system to
obtain first time-series observation data, determine time-series
prediction data representative of a predicted trend of the first
observation data, determine a standard deviation value
representative of a deviation between the first observation data
and the prediction data, determine a threshold based, at least in
part, on the standard deviation value, and detect aberrant behavior
in second time-series observation data based, at least in part, on
the threshold.
[0011] Additional aspects related to the embodiments will be set
forth in part in the description which follows, and in part will be
obvious from the description, or may be learned by practice of the
invention.
[0012] It is to be understood that both the foregoing general
description and the following detailed description are exemplary
and explanatory only and are not restrictive of the invention, as
claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] FIG. 1 illustrates a command and control network
architecture of a botnet.
[0014] FIG. 2 illustrates an exemplary system that may be used for
implementing the disclosed embodiments.
[0015] FIG. 3 illustrates an exemplary system that may be used for
implementing the disclosed embodiments.
[0016] FIG. 4 illustrates an exemplary method for determining an
adaptive threshold.
[0017] FIG. 5 illustrates an exemplary method for determining that
time-series data exhibits aberrant behavior.
DETAILED DESCRIPTION
[0018] Reference will now be made in detail to the exemplary
embodiments, examples of which are illustrated in the accompanying
drawings. Wherever possible, the same reference numbers will be
used throughout the drawings to refer to the same or like
parts.
[0019] FIG. 2 is a diagram illustrating an exemplary computer
system 200 that may be used for implementing the disclosed
embodiments.
[0020] Computer system 200 may include one or more computers 210,
which may be servers, personal computers, and/or other types of
computing devices. Computer 210 may include, among other things,
one or more of the following components: a central processing unit
(CPU) 201 configured to execute computer program code to perform
various processes and methods, including the embodiments herein
described; tangible non-transitory computer-readable memory such as
random access memory (RAM) 202 and read only memory (ROM) 203
configured to access and store information and computer program
code; memory 204 to store data and information; database 205 to
store tables, lists, or other data structures; I/O devices 206;
interfaces 207; and antennas 208. Each of these components is
well-known in the art and will not be discussed further.
[0021] FIG. 3 illustrates an exemplary DNS traffic analyzing system
300.
[0022] System 300 may include a traffic processor 304, which may be
a CPU 201, a computer 210, or any other device capable of
processing data. Traffic processor 304 may obtain time-series data
such as DNS lookup data from a database 302. Database 302 may be
associated with DNS servers, and database 302 may contain DNS
lookup data concerning DNS queries. For example, the DNS lookup
data may include time-series data regarding queries for
non-existent domains (NXDs).
[0023] In some embodiments, database 302 may be a round-robin
database. In a round-robin database, several layers may exist, such
that data from one layer may be aggregated and archived in another
layer. A round-robin database allows for the total storage size to
be limited. In addition, a round-robin allows for analysis of the
DNS lookup data to be performed at different levels of abstraction
based on which archive layer is selected for analysis. In
alternative embodiments, a traditional database may also be used to
store the DNS lookup data.
[0024] Based on the DNS lookup data, traffic processor 304 may
determine an adaptive threshold 306. The adaptive threshold 306 may
be used by traffic processor 304 to generate aberrant data 308.
Aberrant data 308 may provide an indication that portions of the
DNS lookup data exhibit aberrant and/or non-aberrant behavior.
[0025] FIG. 4 illustrates an exemplary method 400 for determining
an adaptive threshold.
[0026] In method 400, observation values are obtained over a
period, such as a seasonal period (step 402). In some embodiments,
the seasonal period is representative of an amount of time over
which one approximately repeating cycle occurs. For example, in
embodiments where the observation values are obtained from DNS
lookup data that includes time-series data regarding queries for
NXDs, a seasonal period may be one day or one week in order to
account for the approximately repeating behavior of legitimate
internet user activity over the course of a day or week.
[0027] In some embodiments, the observation values are obtained
from DNS lookup data that includes time-series data regarding
queries for NXDs. In such embodiments, an observation value at time
t may represent the number of queries for NXDs that occurred
between time t-1 and time t. In embodiments where a round-robin
database is used to store DNS lookup data, the time between
observation values will depend upon the archive layer selected for
analysis. For example, if the selected seasonal period is one week,
an archive layer may be selected such that the time between
observation values is one hour; in contrast, if the selected
seasonal period if one day, an archive layer may be selected such
that the time between observation values is one minute.
[0028] The observation values may also be grouped based on other
factors, such as geographic location or time zone, that would
increase the predictability of legitimate internet activity. For
example, two internet users within the same time zone are more
likely to be using the internet at the same time than two internet
users that are not within the same time zone. By grouping
observation values in this way, deviations from legitimate internet
activity may be more apparent.
[0029] The observation values are used to determine prediction
values over the seasonal period (step 404). An initial prediction
value will be set for time 1. For example, the initial prediction
value may be set as being equal to the observation value at time 1.
Subsequent prediction values may be determined using an exponential
smoothing technique. For example, the prediction at time t may be
calculated by determining a weighted average of the observation at
time t-1 and the prediction at time t-1.
[0030] Once observation and prediction values are obtained over a
seasonal period (i.e., observation values are obtained for time 1
to t and prediction values are obtained for time 1 to t), the
method 400 may determine deviation values over the seasonal period
(step 406). A deviation value at a given time t may be calculated
by applying an exponential smoothing technique (e.g., a
Holt-Winters technique) to the difference between the observation
value at the given time t and the prediction value at the given
time t. For example, the deviation value at a given time t may be
calculated using the formula
g.sub.t=.epsilon.|y.sub.t-y.sub.t|+(1-.epsilon.)g.sub.t-1, where
g.sub.t is the deviation at time t, y.sub.t is the prediction at
time t, y.sub.t is the observation at time t, and .epsilon. is a
weighting parameter.
[0031] Once deviation values are obtained over a seasonal period
(i.e., deviation values are obtained for time 1 to t), the method
400 may determine the mean of the deviation values over the
seasonal period (step 408). Then, using the deviation values
determined in step 406 and the mean determined in step 408, the
method 400 may determine the standard deviation of the deviation
values over the seasonal period (step 410).
[0032] The standard deviation may be used to derive an adaptive
threshold value (step 412). In some embodiments, the adaptive
threshold will be determined by first determining the percentage of
deviation values that are to be considered non-aberrant. Using
Chebyshev's inequality, the number of standard deviations to
satisfy the desired percentage may be calculated. Chebyshev's
inequality provides that, for a random variable X with mean .mu.
and standard deviation .sigma., the probability of [X-.mu.] being
less than k times a will be greater than or equal to 1-1/k.sup.2,
for any value k>0. Thus, the adaptive threshold may be set to
equal k multiplied by the standard deviation, where k is determined
by finding a value for k that would cause 1-11(k.sup.2) to equal,
or approximately equal, the desired percentage of non-aberrant
deviation values. For example, if at least 93.75% of deviation
values should be non-aberrant, the adaptive threshold may be set to
equal 4 times the standard deviation, since 1-1/(4.sup.2) is equal
to 0.9375, or 93.75%.
[0033] FIG. 5 illustrates an exemplary method for determining that
time-series data exhibits aberrant behavior by utilizing an
adaptive threshold.
[0034] In method 500, observation values are obtained (step 502).
The observation values that are obtained are used to determine
prediction values (step 504). The obtained observation values and
determined prediction values of method 500 have the same
characteristics as described above in regards to the obtained
observation values and determined prediction values of method 400.
However, whereas method 400 waits to obtain observation values over
a seasonal period, some embodiments of method 500 may be performed
in real time as observation values are being obtained.
[0035] Once observation and prediction values are obtained (though
not necessarily for an entire seasonal period), the method 500 may
determine deviation values (step 506). As described above in
regards to step 406 of method 400, a deviation value at a given
time t may be calculated by applying an exponential smoothing
technique.
[0036] Deviation values are then compared to the adaptive threshold
value (step 508). Based on this comparison, a determination will be
made as to whether the obtained observation values exhibit aberrant
behavior (step 510). For example, when a deviation value is greater
than the adaptive threshold value, a determination may be made that
the time-series data is exhibiting aberrant behavior.
Alternatively, when a deviation value is not greater than the
adaptive threshold value, a determination may be made that the
time-series data is not exhibiting aberrant behavior.
[0037] When the time-series data is NXD data, an aberrant behavior
determination may provide an indication that botnet activity
exists. This is because, as discussed above, botnets cause a spike
in NXD data. Thus, a spike in NXD data that causes a deviation
value to exceed the threshold value provides an indication of
botnet activity.
[0038] In some embodiments, an additional requirement may be
imposed that a predetermined number of deviations values greater
than the adaptive threshold value be found before a determination
is made that the time-series data is exhibiting aberrant behavior.
By imposing such a requirement, fewer false-positive determinations
of aberrant behavior would be detected, but also fewer correct
determinations of aberrant behavior would be detected.
[0039] In some embodiments, a new threshold value, to be used with
a subsequent set of time-series data, may be calculated using the
new set of time-series data in the manner described above in
regards to method 400. In other words, in some embodiments, each
set of time series-data may be analyzed with the threshold
calculated from the previous set of time-series data and used to
calculate a new threshold. However, in some embodiments, the same
threshold value could be used for multiple sets of time-series data
before an updated threshold value is calculated.
[0040] Other embodiments will be apparent to those skilled in the
art from consideration of the specification and practice of the
invention disclosed herein. It is intended that the specification
and examples be considered as exemplary only, with a true scope and
spirit of the invention being indicated by the following claims.
Further, it should be understood that, as used herein, the
indefinite articles "a" and "an" mean "one or more" in open-ended
claims containing the transitional phrase "comprising,"
"including," and/or "having."
* * * * *