U.S. patent application number 15/256292 was filed with the patent office on 2016-12-22 for monitoring network entities via a central monitoring system.
The applicant listed for this patent is Go Daddy Operating Company, LLC. Invention is credited to Chris Boltz, Craig Condit, Jeff Mink, Daymion Reynolds.
Application Number | 20160373328 15/256292 |
Document ID | / |
Family ID | 51224269 |
Filed Date | 2016-12-22 |
United States Patent
Application |
20160373328 |
Kind Code |
A1 |
Reynolds; Daymion ; et
al. |
December 22, 2016 |
MONITORING NETWORK ENTITIES VIA A CENTRAL MONITORING SYSTEM
Abstract
Systems and method of the present invention provide for one or
more server computers configured to receive a plurality of data
published by a network entity and identify, within the data: the
network entity that published the data, a sample of one or more
metrics for the network entity and a sample type of each of the one
or more samples. The one or more server computers may further be
configured to calculate a network resource usage score, using the
one or more metrics according to one or more rules for each of the
sample types identified, for the sample.
Inventors: |
Reynolds; Daymion; (Phoenix,
AZ) ; Mink; Jeff; (Tempe, AZ) ; Condit;
Craig; (Scottsdale, AZ) ; Boltz; Chris;
(Phoenix, AZ) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Go Daddy Operating Company, LLC |
Scottsdale |
AZ |
US |
|
|
Family ID: |
51224269 |
Appl. No.: |
15/256292 |
Filed: |
September 2, 2016 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13756316 |
Jan 31, 2013 |
9438493 |
|
|
15256292 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
H04L 41/22 20130101;
H04L 43/14 20130101; H04L 43/08 20130101; H04L 43/022 20130101;
H04L 41/06 20130101 |
International
Class: |
H04L 12/26 20060101
H04L012/26; H04L 12/24 20060101 H04L012/24 |
Claims
1. A method, comprising: retrieving, by a server computer from a
data stream provided by a first server in a plurality of servers, a
first encoded file identifying the first server, a first raw
performance datum of the first server, and a first sample type of
the first raw performance datum; normalizing, by the server
computer, the first raw performance datum using the first sample
type of the first raw performance datum to generate a first
normalized performance datum; calculating, by the server computer,
according to at least one scoring rule for the first sample type, a
first score using the first normalized performance datum; and
responsive to a determination by the server computer that the first
score is outside a minimum or maximum boundary determined by the
first sample type, generating, by the server computer, a user
interface including a recommended action to resolve a cause of an
alert as defined for the first sample type.
2. The method of claim 1, further comprising: retrieving, by the
server computer from a second data stream provided by a second
server in the plurality of servers, a second encoded file
identifying the second server, a second raw performance datum of
the second server, and a second sample type of the second raw
performance datum; normalizing, by the server computer, the second
raw performance datum using the second sample type of the second
raw performance datum to generate a second normalized performance
datum; calculating, by the server computer, according to at least
one scoring rule for the second sample type, a second score using
the second normalized performance datum; and comparing, by the
server computer, the first score to the second score to determine
the recommended action.
3. The method of claim 1, wherein the first server has not run
calculations on the first raw performance datum.
4. The method of claim 1, wherein the plurality of servers include
a plurality of unrelated network nodes across one or more server
clusters.
5. The method of claim 1, wherein the first encoded file is
published via a message broker software utilizing a
publish/subscribe model, wherein the first server publishes the
first sample type of the first raw performance datum as a message
topic.
6. The method of claim 1, wherein the first encoded file at least
partially determines the alert.
7. The method of claim 1, wherein the first raw performance datum
includes a collection of data containing at least one measurement
of data for the first server over a specific period of time.
8. The method of claim 1, wherein the first sample type identifies
a purpose of the first server and the at least one scoring rule is
at least partially determined by the purpose of the first
server.
9. A method, comprising: retrieving, by a server computer from a
data stream provided by a first server in a plurality of servers, a
first encoded file identifying the first server and a first raw
performance datum of the first server; normalizing, by the server
computer, the first raw performance datum to generate a first
normalized performance datum; and generating, by the server
computer, a user interface including a recommended action to
resolve a cause of an alert using the first normalized performance
datum.
10. The method of claim 9, further comprising: retrieving, by the
server computer from a second data stream provided by a second
server in the plurality of servers, a second encoded file
identifying the second server and a second raw performance datum of
the second server; normalizing, by the server computer, the second
raw performance datum to generate a second normalized performance
datum; and wherein the recommended action is at least partially
determined by the second normalized performance datum.
11. The method of claim 9, wherein the first server has not run
calculations on the first raw performance datum.
12. The method of claim 9, wherein the plurality of servers include
a plurality of unrelated network nodes across one or more server
clusters.
13. The method of claim 9, wherein the first encoded file is
published via a message broker software utilizing a
publish/subscribe model, wherein the first server publishes a first
sample type of the first raw performance datum as a message
topic.
14. The method of claim 9, wherein the first raw performance datum
includes a collection of data containing at least one measurement
of data for the first server over a specific period of time.
15. A system, comprising: a server computer executing a single
instance of a network monitoring software communicatively coupled
to a network and configured to: retrieve, from a data stream
provided by a first server in a plurality of servers, a first
encoded file identifying the first server, a first raw performance
datum of the first server, and a first sample type of the first raw
performance datum; normalize the first raw performance datum using
the first sample type of the first raw performance datum to
generate a first normalized performance datum; calculate, according
to at least one scoring rule for the first sample type, a first
score using the first normalized performance datum; and responsive
to a determination by the server computer that the first score is
outside a minimum or maximum boundary determined by the first
sample type, generate a user interface including a recommended
action to resolve a cause of an alert as defined for the first
sample type.
16. The system of claim 15, wherein the server computer is further
configured to: Retrieve, from a second data stream provided by a
second server in the plurality of servers, a second encoded file
identifying the second server, a second raw performance datum of
the second server, and a second sample type of the second raw
performance datum; normalize the second raw performance datum using
the second sample type of the second raw performance datum to
generate a second normalized performance datum; calculate,
according to at least one scoring rule for the second sample type,
a second score using the second normalized performance datum; and
compare the first score to the second score to determine the
recommended action.
17. The system of claim 15, wherein the first server has not run
calculations on the first raw performance datum.
18. The system of claim 15, wherein the plurality of servers
include a plurality of unrelated network nodes across one or more
server clusters.
19. The system of claim 15, wherein the first encoded file is
published via a message broker software utilizing a
publish/subscribe model, wherein the first server publishes the
first sample type of the first raw performance datum as a message
topic.
20. The system of claim 15, wherein the first raw performance datum
includes a collection of data containing at least one measurement
of data for the first server over a specific period of time.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of and claims priority to
U.S. patent application Ser. No. 13/756,316 entitled "MONITORING
NETWORK ENTITIES VIA A CENTRAL MONITORING SYSTEM" and filed on Jan.
31, 2013.
FIELD OF THE INVENTION
[0002] The present inventions generally relate to shared server
hosting and, more particularly, systems and methods for a central
software running on a server computer to monitor data published by
one or more network entities reflecting their performance metrics,
and to calculate scores for the network entities based on these
metrics.
SUMMARY OF THE INVENTION
[0003] An example embodiment of a method of monitoring one or more
network entities using a central monitoring system may comprise the
steps of one or more server computers receiving a plurality of data
published by one or more network entities, and identifying, within
the plurality of data: the network entity that published the data,
one or more samples of one or more metrics for each of the one or
more network entities, and a sample type of each of the one or more
samples. Additional steps may include the one or more server
computers calculating one or more network resource usage scores,
using the one or more metrics and according to one or more rules
for the sample type identified, for each of the one or more
samples.
[0004] An example embodiment of a system for monitoring one or more
network entities using a central monitoring system may comprise one
or more server computers communicatively coupled to a network and
configured to: receive a plurality of data published by a network
entity and identify, within the data: the network entity that
published the data, a sample of one or more metrics for the network
entity and a sample type of each of the one or more samples. The
one or more server computers may further be configured to calculate
a network resource usage score, using the one or more metrics
according to one or more rules for each of the sample types
identified, for each of the one or more samples.
[0005] The above features and advantages of the present inventions
will be better understood from the following detailed description
taken in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] FIG. 1 is a flow diagram illustrating a possible embodiment
of a method for monitoring one or more network entities using a
central monitoring system.
[0007] FIG. 2 illustrates a possible embodiment of a system for
monitoring one or more network entities using a central monitoring
system.
[0008] FIG. 3 is a flow diagram illustrating a possible embodiment
of a method for monitoring one or more network entities using a
central monitoring system.
DETAILED DESCRIPTION
[0009] The present inventions will now be discussed in detail with
regard to the attached drawing figures, which were briefly
described above. In the following description, numerous specific
details are set forth illustrating the Applicant's best mode for
practicing the inventions and enabling one of ordinary skill in the
art to make and use the inventions. It will be obvious, however, to
one skilled in the art that the present inventions may be practiced
without many of these specific details. In other instances,
well-known machines, structures, and method steps have not been
described in particular detail in order to avoid unnecessarily
obscuring the present inventions. Unless otherwise indicated, like
parts and method steps are referred to with like reference
numerals.
[0010] A network is a collection of links and nodes (e.g., multiple
computers and/or other devices connected together) arranged so that
information may be passed from one part of the network to another
over multiple links and through various nodes. Examples of networks
include the Internet, the public switched telephone network, the
global Telex network, computer networks (e.g., an intranet, an
extranet, a local-area network, or a wide-area network), wired
networks, and wireless networks.
[0011] The Internet is a worldwide network of computers and
computer networks arranged to allow the easy and robust exchange of
information between people or organizations that make use of
network or computer resources (users). Hundreds of millions of
people around the world have access to computers connected to the
Internet via Internet Service Providers (ISPs). Content providers
(e.g., website owners or operators) place multimedia information
(e.g., text, graphics, audio, video, animation, and other forms of
data) at specific locations on the Internet referred to as
websites. Websites comprise a collection of connected or otherwise
related, web pages. The combination of all the websites and their
corresponding web pages on the Internet is generally known as the
World Wide Web (WWW) or simply the Web.
[0012] A network administrator may desire to monitor several
computers, servers and/or other network nodes (nodes) within a
network environment. Presently existing systems and methods may
rely on each of the nodes to monitor its own performance metrics
and perform all necessary algorithms to generate its own
performance scores. Each node may also be responsible for
monitoring these scores and generating and triggering alerts if and
when the node determines a score is out of bounds of acceptable
parameters.
[0013] Applicant has determined, however, that presently-existing
monitoring systems and methods do not provide optimal means for
monitoring performance information and to contribute to network
statistics generally for nodes, clusters, other network resources
and/or the users who use them (entities). In a network environment
with a large number of such entities, adding, removing, or
repurposing individual network resources requires changing the
system configuration for each of the nodes in the network
environment. Specifically, adding scores and/or changing how scores
are generated requires sending an update to each node that
monitors, calculates and/or uses that score. Likewise, alert
behavior for each entity must be individually configured in such a
monitoring system.
[0014] If an individual node is a part of a cluster of nodes that
act in conjunction with one another, but the node generates and
handles all of its own monitoring, scoring, alert and behavior
data, then generating scores that include data from all other
entities in the cluster or other node groupings is very
complicated, if even possible. Presently-existing monitoring
systems only generate and present scores and/or alerts on an
individual entity basis, then report the findings of each
individual entity. They do not compare data feeds provided by
various and diverse entities and/or nodes. In other words,
presently existing systems and methods have no ability to normalize
and/or compare the data feeds from other nodes in the network.
[0015] Applicant has therefore determined that optimal monitoring
systems and methods may improve on presently-existing monitoring
systems by moving the responsibility for generating scores and
alerts to a central monitoring system, possibly one or more
software modules running on one or more central nodes in the
network which are accessible by all nodes in the network. In such a
model, the central monitoring system may have access to raw data
containing network, cluster, user and/or individual entity metrics
published by all nodes in a grouping of nodes and accessible via
messages and/or data feeds. Using this raw data, scoring scripts
may be written to use the data from all entities to calculate
scores and alerts, where needed, for all network entities within
the network.
[0016] Changes to the network, therefore, such as adding, removing
or changing network nodes' configurations, may be implemented
without needing to reconfigure the one or more software modules
that calculate these scores and alerts. The software modules only
consume the raw data provided by each network entity, so scoring
rules need only be changed within the one or more software modules,
which then apply the changes to the raw data received from any of
the entities within the network.
[0017] In addition, information may be analyzed across diverse
nodes and normalized accordingly. As a non-limiting example,
presently existing systems and methods may calculate scores and
alerts where memory utilization of an entity is measured as bytes
if the node is running one operating system, but may generate
scores and alerts where memory utilization of a second entity is
measured in kilobytes if the node is running a second, different
operating system. In presently existing methods and systems, such
variations in the information makes it difficult, if not
impossible, to compare the memory utilization metrics in an "apples
to apples" comparison.
[0018] Applicant has therefore determined that moving
responsibility for generating scores to a centralized monitoring
system may allow the monitoring system to normalize the data
received from the various entities to compare the values and
determine how the values relate to each other. For example, the
central monitoring system described herein may be configured to
recognize the difference between the two operating systems, convert
bytes to kilobytes or vice-versa, compare the data from diverse
entities, and generate scores and/or alerts for these diverse
entities.
[0019] Likewise, a central monitoring system represents a greater
flexibility over presently existing systems and methods. As
described herein, the raw data received from the network entities
may comprise information about a type of entity or metric sample
and/or the purpose of a cluster of nodes. Because a score is
generated, or an alert takes action, based on the entity/sample
type and/or score category, one or more nodes do not need to be
configured for each entity.
[0020] Where presently existing systems and methods only comprise
information about single nodes within the cluster and must be
configured for each entity, a central monitoring system may use a
sample type within the raw data which contains information about
the purpose of the cluster to generate scores and alerts customized
for specific entities and/or the purpose of the cluster and/or
entities in different clusters. The score(s) and/or alert(s) may be
set up to provide special handling for only those entities that are
exceptional, so no flexibility is lost when monitoring and
generating scores and/or alerts for diverse entities or nodes.
[0021] Methods and Systems for Monitoring Network Resources
[0022] FIGS. 1 and 2 illustrate embodiments of a method and a
system, respectively, for monitoring one or more network entities
using a central monitoring system. The method embodiment may
comprise the steps of one or more server computers 205,
communicatively coupled to a network 200, receiving a plurality of
data 215 published by one or more network entities 210 (Step 100).
The server(s) 205 may then identify, within the data 215 from each
of the one or more network entities 210, the network entity 210
that published the data 215 (Step 110), one or more samples 220 of
one or more metrics 220 for each of the network entities 210 (Step
120) and a sample type 225 for each of the samples 220 (Step 130).
The server(s) 205 may then calculate one or more network resource
usage scores 235 for the sample 220 and/or entity 210 (Step 140).
The network resource usage scores 235 may be calculated using one
or more rules 230 applied to each of the one or more metrics 220
for each of the one or more network entities 210 and determined by
the identified sample type 225.
[0023] As seen in FIG. 2, a system for monitoring one or more
network entities 210 using a central monitoring system may comprise
one or more central monitoring software modules (monitor module(s))
245 running on the server(s) 205 communicatively coupled to the
network 200 and configured to analyze the plurality of data 215 and
create and/or apply score and alert rules 230.
[0024] The example embodiments illustrated herein place no
limitation on network 200 configuration or connectivity. Thus, as
non-limiting examples, the network 200 could comprise the Internet,
the public switched telephone network, the global Telex network,
computer networks (e.g., an intranet, an extranet, a local-area
network, or a wide-area network), wired networks, wireless
networks, or any combination thereof. System components may be
communicatively coupled to the network 200 via any method of
network connection known in the art or developed in the future
including, but not limited to wired, wireless, modem, dial-up,
satellite, cable modem, Digital Subscriber Line (DSL), Asymmetric
Digital Subscribers Line (ASDL), Virtual Private Network (VPN),
Integrated Services Digital Network (ISDN), X.25, Ethernet, token
ring, Fiber Distributed Data Interface (FDDI), IP over Asynchronous
Transfer Mode (ATM), Infrared Data Association (IrDA), wireless,
WAN technologies (T1, Frame Relay), Point-to-Point Protocol over
Ethernet (PPPoE), and/or any combination thereof.
[0025] Any of the servers 205, 210 described herein may comprise a
computer-readable storage media storing instructions that, when
executed by a microprocessor, cause the server(s) 205, 210 to
perform the steps for which they are configured. Such
computer-readable media may comprise any data storage medium
capable of storing instructions for execution by a computing
device. It may comprise, as non-limiting examples, magnetic,
optical, semiconductor, paper, or any other data storage media, a
database or other network storage device, hard disk drives,
portable disks, CD-ROM, DVD, RAM, ROM, flash memory, and/or
holographic data storage. The instructions may, as non-limiting
examples, comprise software and/or scripts stored in the
computer-readable media that may be stored locally in the server(s)
or, alternatively, in a highly-distributed format in a plurality of
computer-readable media accessible via the network 200, perhaps via
a grid or cloud-computing environment.
[0026] Such instructions may be implemented in the form of software
modules. Any software modules described herein may comprise a
self-contained software component that may interact with the larger
system and/or other modules. A module may comprise an individual
(or plurality of) file(s) and may execute a specific task within a
larger software and/or hardware system. As a non-limiting example,
a module may comprise any software and/or scripts running on one or
more server(s) containing instructions (perhaps stored in
computer-readable media accessible by the server computer's
computer processor) that, when executed by the computer processor,
cause the server computer 205, 210 to perform the steps for which
it is configured.
[0027] The software modules, as well as the data generated and
stored, are not required to be stored and/or executed on a single
computer or within a single software module. As a non-limiting
example, in some embodiments, data may be stored across various
applications on the central servers 205. This configuration may run
functions and/or software modules such as the scoring and/or alert
function(s) within the monitor module(s) 245 as service instances
on multiple servers 205, each server handling a fraction of the
load while ensuring that each instance of the service has access to
the same data.
[0028] For example, if the scoring and/or alert functions within
the monitor module(s) 245 are distributed among three servers 205,
each of the three servers 205 may process scores 235 and/or alerts
240 for a third of the clusters. A non-limiting example of such a
service may include the open-source project Memcache.
[0029] The plurality of data 215 received by the one or more server
computers may comprise a plurality of raw data 215 monitored,
generated and transmitted to and/or from one or more network
entities 210, possibly using software modules configured to do so.
The data 215 may be a data feed, possibly comprising network
messages, which include samples of network, cluster and/or
individual entity metrics 220. This data may have no intelligence
on its own, meaning it has not been analyzed or had scoring, alert
and/or any other algorithms performed on it. These samples 220 may
include one or more point in-time samples of performance data 220,
including the type of sample obtained 225, plus enough information
to determine (at minimum): the entity identification 250 and the
timestamp that the sample 220 was obtained.
[0030] The entity identification 250 for the entity 210 which
monitored, generated and transmitted the plurality of data 215 may
include, as non-limiting examples, an identification of a node if
the sample is associated with a specific node, an identification of
a plurality of related or unrelated nodes, and/or an identification
of a user and/or customer making use of these or other network 200
resources. Accordingly, the one or more network entities 210 may
comprise any combination of a computer, a server computer (possibly
a node acting as part of a cluster), a cluster of server computers
(possibly one or more of nodes grouped together for a common
purpose), unrelated individual nodes across one or more clusters
with a common purpose, unrelated individual nodes across one or
more clusters that generate monitored data which may be grouped
together, unrelated individual nodes across one or more clusters
with a common operating system, metrics reflecting resources within
the network 200 used by a user, etc.
[0031] In some embodiments, the entity data 215 may be transmitted
to the one or more server computers 205 running the monitor
module(s) 245 via software using a publish/subscribe model, where
one side publishes a message and the other, if interested,
subscribes to the message flow. This means that each of the network
entities 210 may publish the data 215 which contains their
individual network performance metric samples 220 and the monitor
module(s) 245 may subscribe to the data 215 published by these
network entities 210.
[0032] In some embodiments, the publish/subscribe model may be
accomplished via a message brokering software such as Apache Active
MQ, as a non-limiting example. The message brokering software may
route, transform, aggregate and/or store messages between
applications based on message brokering rules specified in the
software. In embodiments which use such software, the message
brokering software may be integrated into the monitor module(s)
245. In other embodiments, the message brokering software may run
independent of the monitor module(s) 245 but provide the monitor
module(s) 245 access to the messages published/subscribed to.
[0033] The monitor module(s) 245 may be developed, installed and/or
run on one or more servers 205 in the network 200. These modules
245 may then be configured to monitor one or more of the network
nodes/entities 210, "listen" for metrics sample 220 messages
published by any monitored network entities 210, receive the
metrics sample 220 messages when published and apply the
appropriate rules 230 to perform all necessary algorithms to
calculate scores 235 and/or alerts 240 behavior, described in
detail herein. The monitor module(s) 245 may receive periodic data
from any node, cluster or other network entity 210 in the network
200.
[0034] The monitor module(s) 245 may calculate scores 235 and any
associated alert behavior 240, as described below, based on the
raw, unintelligent network entity data 215 received from the
network entities 210. Thus, in addition to calculating scores 235
and/or alerts 240 for a single node, the monitoring module(s) 245
may also calculate scores 235 and/or alerts 240 for a plurality of
nodes/entities 210 acting in conjunction with each other.
[0035] Because the scores 235 are calculated and alerts 240
generated and triggered within the monitoring module(s) 245 rather
than on each individual node/entity 210, the monitor module(s) 245
may be configured to normalize and compare diverse information
within the network entity data 215 from each of these diverse
nodes/entities 210. In addition, changes or updates within
configurations to calculate scores 235 and/or alerts 240 behavior
from the network entity performance data 215, may be accomplished
for all nodes/network entities 210 without sending updates to each
individual node that uses that score 235 and/or alert 240.
Likewise, changes or updates within network 200 configuration
(e.g., adding, removing or repurposing a network node/cluster 210)
may be accomplished for all nodes/network entities 210 without
sending updates to each individual node that is affected by that
network 200 configuration change.
[0036] In some embodiments, the monitor module(s) 245 may comprise
an application programming interface (API) 255. The API 255 may
comprise a service made available to third parties, which may
further comprise any individual, entity, system, hardware, or
software wishing to access the disclosed information and
functionality. Such an API 255 may comprise a software-to-software
interface that specifies the protocol defining how independent
computer programs interact or communicate with each other. It also
may comprise a collection of pre-configured building blocks
allowing a third party to easily configure their software for
compatibility and/or extensibility.
[0037] The API 255 may comprise any API type known in the art or
developed in the future including, but not limited to,
request-style, Berkeley Sockets, Transport Layer Interface (TLI),
Representational State Transfer (REST), Simple Object Access
Protocol (SOAP), Remote Procedure Calls (RPC) 285, Standard Query
Language (SQL), file transfer, message delivery, and/or any
combination thereof. The API 255 may comprise computer-readable
code that, when executed, causes the API 255 to receive an RPC
(i.e., function call) 285 requesting information services.
Responsive to receipt of the RPC 285, the API 255 may perform the
above described processes, and transmit a request results to the
requesting third party.
[0038] To submit the request via an RPC 285 to the API 255, the
server(s) may require authentication with the API 255. Computers or
servers may locate the API 255 via an access protected URL mapped
to the API 255, and may then use an API key configured to
authenticate the one or more computers or servers prior to
accessing the API 255.
[0039] The disclosed system components may request and receive data
using requests and responses for all transfers of information
through the network 200 described herein (e.g. requests/responses
for published entity data 215, SOAP requests, etc.), using any data
transfer request including, as non-limiting examples, any
combination of web services data transfers, API function calls,
HTTP response/request, SQL queries etc.
[0040] In embodiments that utilize web services, these transfers of
information may be accomplished via web services data transfers.
Web services may utilize a software system designed to support
interoperable machine-to-machine interaction between two electronic
devices over a network such as the World Wide Web by using an
interface described in a machine-processable format, such as Web
Services Description Language (WSDL), as a non-limiting example.
These system components may interact with the web service in a
manner prescribed by its description using, as one non-limiting
example, SOAP messages conveyed using HTTP with an XML,
serialization in conjunction with other web-related standards.
[0041] In some embodiments, the monitor module(s) 245, the API 255,
the server computer 205 on which they run, any web services or
RPCs, and/or any other hardware or software may be configured via
one or more configuration files 260 which define the behavior of
the related software and/or hardware. In some embodiments, the
configuration file(s) 260 may comprise XML files and may comprise
any combination of configuration files and/or subdivisions of the
configuration file(s) 260, possibly distinguished by XML tags,
where appropriate.
[0042] As non-limiting examples, these configuration files 260
and/or subdivisions may include a configuration portion, a "message
handler" portion, a scoring/scoring rules portion and an
alert/alert rules portion of the configuration file(s) 260. The
configuration portion of the configuration file(s) 260 may
comprise: configurations to "listen" for, receive and handle the
network entity data 215, including any metrics samples 220 within
the data 215; a data retention policy; a system service model that
contains server and client settings; and, where separate
configuration files 260 are used, source paths to the message
handler portion, the scoring/scoring rules portion and/or the
alert/alert rules portion of the configuration file 260.
[0043] The monitoring module(s) 245 may handle the incoming network
entity data 215, score the included metrics sample data 220 and/or
generate and trigger alerts 240 from the scores 235 according to
one or more rule sets 230 accessible to the monitor module(s) 245.
In embodiments with one or more configuration files 260, these
rules sets may be contained within, accessible to, or referenced by
the configuration file(s) 260.
[0044] The rule sets 265 may contain various types of rules
including, but not limited to: rules for extracting identifying
information from the network entity data 215 to determine the
metrics sample(s) 220, a sample type 225 for each sample 220 and
the entities 210 the sample 220 applies to; rules for extracting
individual metrics sample(s) 220 from the network entity data 215;
optional rules for stripping extraneous data from the sample(s) 220
to reduce message storage requirements; rules to calculate scores
235; and rules to generate and trigger alerts 240 based on the
scores 235.
[0045] These rules/rule sets 230 may be written in a
general-purpose software language (e.g., C# software code, XML data
elements or any combination thereof) and may be applied to any
entity 210 which is represented in a sample 220. In other words,
the rule sets 230 may be configurable, meaning that these
configurable rules 230 may be used to create scores 235 and/or
alerts 240 from that data 215 received from any entity 210 on the
network, such as user, server or cluster-based rules 230.
[0046] The rules 230 for extracting and identifying information
from the network entity data 215 may include one or more "message
handler" functions within the monitor module(s) 245 configured to
send and receive communication between the publisher and subscriber
of the network entity data 215, possibly including "topics" to
subscribe to and/or listen for (e.g., "endpoint.metric.hosting").
In embodiments that include a configuration file 260, the
configuration file 260 may further comprise a cluster monitoring
section (possibly including software code to be executed) which
contains settings to subscribe to, listen for and/or identify
topics within messages (e.g., Active MQ communications settings,
URI, username, password, etc. as a non-limiting example)
[0047] The message received may comprise a message body containing
a point-in-time sample of performance data 220, plus enough
information to determine at least a cluster identification 250, a
node identification 250 if the sample 220 is associated with a
specific node, a timestamp that the sample 220 was obtained and the
type 225 of sample 220 obtained.
[0048] The message handler functions may be further configured to
receive, analyze and identify, within the received data 215: the
network entity 210 (identified by the cluster and/or node acting as
a "data sample generator" or the "external sample source" of the
data); the sample of performance data 220; and the sample type 225
of the sample. The monitor module(s) 245 may use the identified
sample type 225 to determine the rule sets 265 to be applied to the
sample 220. The message handler functions may also apply optional
rules for stripping extraneous data from the sample 220 to reduce
message storage requirements.
[0049] The one or more samples 220, which may be contained in the
performance data 215 received, may be extracted and/or examined.
Samples 220 may comprise a collection of performance data that
contains measurements of the data 215 for the network entity 210
(e.g. clusters, nodes and/or user data) over a specific period of
time. As non-limiting examples, samples 220 may include CPU usage,
CPU usage per customer, Node CPU non-idle time, CPU time required
for each user on a node, memory required for each user on the node
for a 1 minute time period, statistics from memory used/free,
website traffic, which users have used which CPU time and how much,
etc.
[0050] Various software properties and methods may be available to
the monitor module(s) 245 to receive, analyze and identify samples
220. These properties and functions may utilize information
identifying the associated cluster, the associated node (if
applicable), the date and time the sample 220 was generated, the
sample type 225, and data comprising a wrapper of the sample data
(possibly in XML, format), allowing access to the attributes and
elements of the data as properties. As a non-limiting example, if
the sample XML is <Sample
ClusterId=`5`><SomeValue>32.7</SomeValue></ClusterId>-
;, then Data.ClusterId will return 5, and Data.SomeValue will
return 32.7.
[0051] This wrapper of sample data, in combination with the central
location of the monitor module(s) 245 and analysis of message data
215 received from the entities and comprising the sample type 225,
may create flexibility in monitoring, scoring and
calculating/generating alerts 240 for the nodes and entities 210
related to network performance. The sample type 225 may comprise a
format for a sample that defines the data that the sample contains,
and, in some embodiments, may include the fully-qualified name of
the root element of the sample XML. The monitoring modules 245 may
examine the message to determine the sample type 225 contained in
the message. As non-limiting examples, a sample type 225 may
include computer resource usage data or external webpage response
times.
[0052] Any sample type 225 may be defined, since the sample type
225 contain no restriction to the type of data that can be included
with a sample 220 or sample type 225. In some embodiments, the
sample type 225 may contain information about the purpose of a
server cluster. Different scores 235 may therefore be generated
based on this purpose, and the use of different clusters, or
different purposes for the clusters, creates flexibility in
monitoring diverse clusters or nodes.
[0053] Additional software properties and methods (possibly
software objects within the monitor module(s) 245) may take
advantage of this flexibility to receive, analyze and score samples
220 across diverse entities 210 in the network. For example, the
monitor modules 245 may comprise properties and methods for
receiving samples 220 from individual nodes and groups of nodes on
the network 200. As non-limiting examples, these properties and
methods may analyze all samples 220 that have been received for and
associated with identifying information 250 for a particular node
or a collection of nodes and return, as non-limiting examples:
sample 220 collections that contain: samples 220 of a particular
type 225, samples 220 and identifying information from each of the
nodes in the collection of nodes, the most recent samples 220
received from all samples 220 in a collection of nodes and/or
clusters, etc.
[0054] As non-limiting examples, these samples 220 may be monitored
and analyzed at the level of any of the network entities 210 such
as nodes, clusters of nodes and/or users of these network
resources. As non-limiting examples, at the node level,
non-limiting example samples 220 may include one or more nodes':
operating system; number of central processing units (CPU),
percentage of CPU time used by user-mode and/or kernel mode
operations across all processes; free or used physical, swap or
non-paged pool memory; NIC bytes transmitted; NIC bytes
transmitted/sec; NIC bytes received; NIC bytes received/sec; disk
input/output (I/O) bytes written or read; disk I/O writes or reads;
number of context switches; or any of the site poller metrics
described below, but applied to a node level.
[0055] At the cluster level, non-limiting examples of samples 220
may include polls of sites including a site poller status (e.g.,
"success," "timeout," "internal_error," "dns_lookup_failure,"
"connection_refused"), site poller connect time, first byte time,
total time and/or total bytes for various programs and operating
systems with local or network disk storage.
[0056] At the user level, non-limiting examples of samples 220 may
include: CPU time used by user-mode or kernel-mode operations; used
memory in the non-paged pool; number of hypertext transfer protocol
(HTTP) or secure HTTP (HTTPS) requests; bytes received or sent via
web requests; disk I/O bytes written or read; or Disk I/O writes or
reads.
[0057] Once the samples 220 are received, and the related entity
210 and sample type 225 for each sample 220 is identified, the
monitor module(s) 245 may calculate and/or generate scores 235 for
the sample(s) 220 for any of the entities 210 in the network 200
accessible to the monitor module(s) 245. These scores 235 may give
an indication of performance or usage of the network entity 210,
and may comprise a numeric value, a category that indicates how the
numeric value should be interpreted, and the associated network
entity 210.
[0058] Scores 235 may be applied to entities 210 other than a
single node, may be generated for and applied to any entity 210
which is represented in a sample 220 and may have access to
previous samples 220 for the cluster and node. Because the monitor
module(s) 245 have access to samples 220 from all network entities
210, greater flexibility is available for the generation and
application of the scores 235 for each of these network entities
210 than would be available if each entity 210 determined its own
scores. As non-limiting examples, the score 235 may be applied to
the entire cluster or may be applied to a specific user across any
or all nodes in a cluster, as non-limiting examples.
[0059] The sample type 225 for each received sample 220 may
determine the scoring and/or alert 240 rule(s) 230 for the sample
220. These scoring rules 230 may be determined by one or more rule
sets for scoring 230, which may ultimately determine the scores 235
assigned to the metrics 220. Because of the flexibility available
to the scoring rules 230 as outlined above: a single rule 230 may
generate any number of scores 235, one or more rules may use
samples 220 of different sample types 225 in generating the scores
235 and/or one or more rules may generate different categories of
scores 235 based on any and all information included in the various
samples 220. As a non-limiting example, if one of the samples 220
for a cluster contains operating system information or cluster
purpose, a scoring rule within the rule sets 230 may use that
information or purpose to generate scores 235 that only apply to
that operation system or purpose. Similarly, if a single cluster or
node has very specific scoring needs, a rule 230 may use the
cluster or node identifiers 250 to determine scoring behavior.
[0060] The monitor module(s) 245 may calculate and generate the
scores 235 according to a scoring script, possibly written in a
general purpose language such as C# within XML. In embodiments that
use a configuration file 260, the configuration file 260 may
include and/or call rules 230 and/or functions to execute such
software code to determine scores 235 for network entities 210
identified in relation to the sample(s) 220. As a non-limiting
example, a "Score" software object within the monitor module(s) 245
may represent a score 235 generated by a scoring rule 230. An
instance of this Score object may identify the associated network
entity 210, the score category, the numeric value of the score 235
and/or a list of string name-value pairs that may be used to
associate additional information with the score 235. For example,
if a score 235 represented the sum of the CPU percentage of the top
3 users on a node, the following code may include the
identification of the top 3 users as supporting data: var
top3CpuPercentScore=new Score(ScoreOwnerType.Node, nodeId, "Top 3
Users CPU %", top3CpuPercent, "UserId", userId1, "UserId", userId2,
"UserId", userId3).
[0061] A scoring rule 230 may likewise comprise a "ScoringRules"
software object, possibly invoked by the "message handler"
functions disclosed above. In embodiments which include one or more
configuration files 260, the scoring script(s) may be included in a
scoring section of the message handler configurations and
functions. As a non-limiting example, a ScoringRules software
object may have a method with the following signature:
IEnumerable<Score> GetScores(SampleView sample,
ClusterDataView clusterData), where "sample" contains the data 215
of a just-received sample 220, and "clusterData" contains data 215
for a target cluster.
[0062] Like the samples 220 described above, scores 235 and/or
score rules 230 may be broken down by cluster, node or user. At the
cluster level, non-limiting example scores 235 and/or score rules
230 may include the difference (in percent) in connection count
between the top 2 nodes in the cluster; average CPU usage (in
percent) across all nodes for the past minute; and physical memory
(in bytes) used or free across all nodes in the cluster.
[0063] At the node level, non-limiting examples of scores 235
and/or score rules 230 may include time (in minutes) since the most
recent resource metrics sample 220; average CPU usage (in percent)
for the past minute, 10 minutes etc.; estimated average number of
simultaneous in-bound HTTP and HTTPS connections over a 1 minute
period; physical (in bytes) or virtual (physical + swap - in bytes)
memory used or free on the node; percent of total physical or
virtual memory that is in use on the node; The memory (in bytes)
used by the `iissvcs` svchost or `inetinfo.exe` processes; average
rate (in bytes/sec) of network usage (transmit + receive) on the
node in the past minute; lowest, average or highest filer ping time
(in milliseconds) in the past minute; and the number of attempted
and/or successful pings in the past minute.
[0064] At the user level, non-limiting example scores 235 and/or
score rules 230 may include the amount of CPU (in seconds and/or
percent) consumed by the user's processes for the past minute, 10
minutes, etc. across a single node and/or all nodes, possibly as a
percent of the total CPU time available; estimated average number
of simultaneous in-bound HTTP and HTTPS connections to the user's
sites on a single node; average amount of CPU time (in seconds)
consumed by user processes per HTTP or HTTPS connection; ratio of
data read from disk by user processes to data returned over HTTP or
HTTPS connection; number of HTTP and/or HTTPS connections received
by the user's processes across all nodes in the past minute; amount
of data (in bytes) sent and/or received by HTTP(S) connections to
the user's processes across all nodes in the past minute; amount of
data (I/O in bytes) read from, written to or transmitted via other
(i.e. not read or write) operations to or from the local disk or
filer by the user's processes on the node in the past minute;
number of read, write or other (i.e. not read or write) operations
(I/O in bytes) from the local disk or filer by the user's processes
on the node in the past minute; number of threads, handles and/or
physical and/or virtual memory used by the user's processes on the
node in the past minute.
[0065] The score category included as a part of each score 235 may
indicate how the numeric value in the score 235 should be
interpreted, and may be used for purposes of generating and
triggering one or more alerts 240.
[0066] FIG. 3 illustrates that the method for monitoring one or
more network entities using a central monitoring system disclosed
in FIG. 1 may further comprise the steps of the server(s) 205
monitoring the network resource usage score(s) 235 (Step 300),
determining whether the score(s) 235 go out of a minimum or maximum
boundary for a specified period of time (Step 310), and if so,
generating and triggering an alert 240 indicating a detected issue
with the network entity 210 (Step 320). The server(s) 210 may then
transmit the alert 240 to a client computer 275 communicatively
coupled to the network 200 (Step 330).
[0067] Alerts 240 may indicate a detected issue with a network
entity 210 and may be triggered within the monitor module(s) 245 by
a specific score 235 when the score goes out-of-range of a minimum
and/or maximum boundary for the score 235 for a specified period of
time. The period of time that the score 235 must be "out-of-bounds"
before the alert 240 is triggered may indicate to the monitor
module(s) 245 that action needs to be taken to resolve a detected
issue with the one or more specific network entities 210. Alerts
240 may identify, from the network entity data 215, a type of
entity 210 or a score category that the alert 240 applies to, so
that the monitor module(s) 245 may take action based on the type of
entity, sample type 225 and/or score category identified.
[0068] In addition, alerts 240 may also review metrics 220 and
previous alerts 240 for a period of time to determine if there is a
pattern to the behavior and relate the data set back to the
offending network entity 210. The specific period of time may
include metrics 220 checked, for example, over the last hour, or
over a set of minutes, etc. for alerts 240. This data may be
analyzed to determine if there is a pattern of behavior and the
relevant rules 230 may be applied accordingly. As non-limiting
examples, alerts 240 may identify a bad performing cluster
arrangement for a server cluster or an individual server.
[0069] Each of the alert rules 230 may comprise: a unique name for
the alert 240 that identifies the alert rule 230; an owner type of
the score 235 that triggers the alert 240 (e.g., "cluster," "node,"
"user," etc.) that owns the score 235 that the alert 240 is
triggered by; a score category that triggers the alert category of
the score 235 that the alert 240 is triggered by; an optional
minimum acceptable value of the score lower bound of the acceptable
range of the score 235 (If this is not specified, there is no lower
bound, and the alert will only be triggered when the score 235
exceeds the upper bound of the acceptable range); an optional
maximum acceptable value of the score upper bound of the acceptable
range of the score 235 (If this is not specified, there is no upper
bound, and the alert will only be triggered when the score 235 is
below the lower bound of the acceptable range); a total amount of
time the score 235 can be out of range before the alert 240 is
triggered; a period of time over which the out-of-range time is
calculated before the alert 240 is triggered; and the severity of
the alert (e.g., critical, major, minor, info)
[0070] As a non-limiting example, an alert 240 may be thrown for an
alert rule 230 named "Node Not Reporting." This alert rule 230 may
target individual nodes and may be triggered based on a score 235
for a most recent sample 220 age. The alert rules 230 may cause
this alert 240 to be triggered in this example if the resource
metrics 220 for the node have not been received in at least 5
minutes. If such an alert 240 is triggered, the alert rules 230 may
be configured to assign an alert severity of "Major" to this alert
240.
[0071] In embodiments which use one or more configuration files
260, the alert rules 230 may be defined through an alert rules 230
configuration file or an alert rules 230 section of a configuration
file 260. As a non-limiting example, The rules 230 for generating
alerts 240 based on scores 235 may comprise a collection of XML
alert rules objects, primarily used for SNMP messages associated
with the alert. For example, an XML alert rule object may appear as
follows: <AlertRules> <add Name=`High Cluster CPU`
ScoreOwnerType=`Cluster` ScoreCategory=`CPU %` MaxValue=`90`
AllowedOutOfRangeTime=`00:05` AllowedOutOfRangeWindow=`00:30`
Severity=`Major`/></AlertRules>
[0072] As with samples 220 and scores 235, the centralized nature
of the monitor module(s) 245 within the network 200 circumvents the
need to configure each individual entity 210 for alerts 240. This
creates flexibility in their application within the system because
alerts 240 are triggered and take action in a centralized software
based on entity type and score category. Thus, if specific entities
210 require different alert rules 230, the score rules 230 and/or
alert rules 230 may be set up to provide special handling for only
those entities 210 that are exceptional.
[0073] The monitor module(s) 245 and/or API 255 may include an
interface to send messages to external systems when an alert 240 is
triggered. This external system may comprise a "response engine"
280 that may examine alerts 240 transmitted from the monitor
module(s) 245 and make recommendations on "actions" to be taken to
correct an alert 240. Users may select either the recommended
action or another action to run through the interface. Actions may
include automated steps taken against a cluster or node, such as
"kill user processes on node 123." Actions performed repeatedly
which are easily automatable may be simplified to a simple click in
the interface.
[0074] The response engine 280 may process alerts 240 which
determine what actions can be run in response to an alert 240. The
response engine may also provide a list of all available actions
for a cluster, node or user that is currently affected by an alert
240 to take other or additional actions in response to the alert
240. The list of available actions may comprise the following
information: network entities 210 that are currently alerting;
cluster information for any cluster, node or user; recommended
actions based on alerts 240; and a list of available actions for
all clusters, nodes and users.
[0075] In some embodiments, the score(s) 235 may be transmitted to
a client computer 275 in response to a data request from the client
computer 275. The monitor module(s) 245 and/or API 255 may include
an interface to return sample 220, score 235 and/or alert 240 data
to external systems (e.g. a user interface monitored by system
administrators) in response to requests from those systems. The
modules 245 may include a data request handler to handle these
requests. The monitor module(s) 245 and/or API 255 may include a
web service that exposes a SOAP interface that allows access to the
generated scores 235 and alerts 240, as well as the samples 220.
The SOAP interface 285 may send requests that: receive the samples
220, scores 235 and/or alerts 240 associated with a cluster; gets
all clusters that have active alerts 240; gets the identifications
associated with all clusters that are currently monitored; gets the
clusters with the highest scores 235 of the specified type.
[0076] The SOAP interface 285 may utilize SOAP, a protocol
specification for exchanging structured information in the
implementation of web services in computer networks, which may
consist of three parts: an envelope, which defines what is in the
message and how to process it, a set of encoding rules for
expressing instances of application-defined data types, and a
convention for representing procedure calls and responses
[0077] The web service may be accessible via a WSDL interface. This
XML-based interface description language may be used for describing
the functionality offered by a web service. A WSDL description of a
web service (also referred to as a WSDL file) may provide a
machine-readable description of how the service can be called, what
parameters it expects, and what data structures it returns.
[0078] Other embodiments and uses of the above inventions will be
apparent to those having ordinary skill in the art upon
consideration of the specification and practice of the inventions
disclosed herein. The specification and examples given should be
considered exemplary only, and it is contemplated that the appended
claims will cover any other such embodiments or modifications as
fall within the true scope of the inventions.
[0079] The Abstract accompanying this specification is provided to
enable the United States Patent and Trademark Office and the public
generally to determine quickly from a cursory inspection the nature
and gist of the technical disclosure and in no way intended for
defining, determining, or limiting the present inventions or any of
its embodiments.
* * * * *