U.S. patent application number 12/880125 was filed with the patent office on 2013-09-19 for system and method for clustering host inventories.
The applicant listed for this patent is Rishi Bhargava, David P. Reese, JR.. Invention is credited to Rishi Bhargava, David P. Reese, JR..
Application Number | 20130246422 12/880125 |
Document ID | / |
Family ID | 49158653 |
Filed Date | 2013-09-19 |
United States Patent
Application |
20130246422 |
Kind Code |
A1 |
Bhargava; Rishi ; et
al. |
September 19, 2013 |
SYSTEM AND METHOD FOR CLUSTERING HOST INVENTORIES
Abstract
A method in one example implementation includes obtaining a
plurality of host file inventories corresponding respectively to a
plurality of hosts, calculating input data using the plurality of
host file inventories, and then providing the input data to a
clustering procedure to group the plurality of hosts into one or
more clusters of hosts. The method further includes each cluster of
hosts being grouped using predetermined similarity criteria. In
more specific embodiments, each of the host file inventories
includes a set of one or more file identifiers with each file
identifier representing a different executable software file on a
corresponding one of the plurality of hosts. In other more specific
embodiments, calculating the input data includes transforming the
host file inventories into a matrix of keyword vectors in Euclidean
space. In further embodiments, calculating the input data includes
transforming the host file inventories into a similarity
matrix.
Inventors: |
Bhargava; Rishi; (Cupertino,
CA) ; Reese, JR.; David P.; (Sunnyvale, CA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Bhargava; Rishi
Reese, JR.; David P. |
Cupertino
Sunnyvale |
CA
CA |
US
US |
|
|
Family ID: |
49158653 |
Appl. No.: |
12/880125 |
Filed: |
September 12, 2010 |
Current U.S.
Class: |
707/737 ;
707/E17.089 |
Current CPC
Class: |
G06F 17/30017 20130101;
H04L 41/0893 20130101; G06F 16/35 20190101 |
Class at
Publication: |
707/737 ;
707/E17.089 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A computer implemented method, comprising: obtaining a plurality
of host file inventories corresponding respectively to a plurality
of hosts in a network environment, wherein each of the plurality of
host file inventories includes one or more file identifiers, each
of the file identifiers of a particular host file inventory
representing a different executable file on one of the plurality of
hosts corresponding to the particular host file inventory;
calculating input data by transforming the plurality of host file
inventories into a matrix of keyword vectors in Euclidean space
based on a keyword sequence, wherein each keyword of the keyword
sequence is unique, wherein each one of the plurality of hosts
corresponds to one of the keyword vectors; and providing the input
data to a clustering procedure to group the plurality of hosts into
one or more clusters of hosts, wherein the one or more clusters of
hosts are grouped using a predetermined similarity criteria.
2. (canceled)
3. The method of claim 1, wherein each of the one or more file
identifiers in the plurality of host file inventories includes a
token sequence of one or more tokens.
4. The method of claim 3, further comprising: receiving input from
a user selecting a number of tokens for the token sequence and
selecting a configuration for each of the tokens.
5. The method of claim 3, wherein the token sequence includes a
first token and a configuration for the first token is a checksum
of an executable file.
6. The method of claim 3, wherein the token sequence includes a
first token and a second token, wherein a first configuration for
the first token is a checksum of an executable file and a second
configuration for the second token is a file path of the executable
file.
7. The method of claim 3, further comprising: determining the
keyword sequence, wherein each keyword of the keyword sequence is
unique with respect to other keywords in the keyword sequence.
8. The method of claim 7, wherein a first keyword of the keyword
sequence is equivalent to a first token of at least one of the one
or more file identifiers of the plurality of host file
inventories.
9. The method of claim 8, wherein a second keyword of the keyword
sequence is equivalent to another first token of at least one of
the one or more other file identifiers of the plurality of host
file inventories, wherein the first token and the other first token
are not equivalent.
10. The method of claim 8, wherein a third keyword of the keyword
sequence is equivalent to a second token of at least one of the one
or more file identifiers of the plurality of host file
inventories.
11. (canceled)
12. The method of claim 1, wherein a first one of the keyword
vectors includes a plurality of values, each value indicating
whether one of the keywords in the keyword sequence is included in
a first one of the plurality of host file inventories.
13. The method of claim 1, wherein a first one of the keyword
vectors includes a plurality of values, each value corresponding to
a frequency of occurrence of one of the keywords in a first one of
the plurality of host file inventories.
14. (canceled)
15. (canceled)
16. (canceled)
17. The method of claim 1, further comprising: generating
information indicating the one or more clusters of hosts, wherein
each of the one or more clusters includes at least one host.
18. The method of claim 17, wherein the information is a proximity
plot.
19. The method of claim 1, wherein the clustering procedure is an
agglomerative hierarchical clustering technique with the
predetermined similarity criteria including a cut point
determination to define a stopping point of the clustering
procedure.
20. The method of claim 1, wherein the clustering procedure is a
partitional clustering technique.
21. Logic encoded in one or more non-transitory media that includes
code for execution and when executed by a processor is operable to
perform operations comprising: obtaining a plurality of host file
inventories corresponding respectively to a plurality of hosts in a
network environment, wherein each of the plurality of host file
inventories includes one or more file identifiers, each of the file
identifiers of a particular host file inventory representing a
different executable file on one of the plurality of hosts
corresponding to the particular host file inventory; calculating
input data by transforming the plurality of host file inventories
into a matrix of keyword vectors in Euclidean space based on a
keyword sequence, wherein each keyword of the keyword sequence is
unique, wherein each one of the plurality of hosts corresponds to
one of the keyword vectors; and providing the input data to a
clustering procedure to group the plurality of hosts into one or
more clusters of hosts, wherein the one or more clusters of hosts
are grouped using a predetermined similarity criteria.
22. (canceled)
23. The logic of claim 21, wherein each of the one or more file
identifiers in the plurality of host file inventories includes a
token sequence of one or more tokens.
24. The logic of claim 21, the processor being operable to perform
further operations comprising: determining the keyword sequence,
wherein each keyword of the keyword sequence is unique with respect
to other keywords in the keyword sequence, wherein a first one of
the keyword vectors includes a plurality of values, each value
indicating whether one of the keywords in the keyword sequence is
included in a first one of the plurality of host file
inventories.
25. (canceled)
26. An apparatus, comprising: a host inventory preparation module;
a processor operable to execute instructions associated with the
host inventory preparation module, including: obtaining a plurality
of host file inventories corresponding respectively to a plurality
of hosts in a network environment, wherein each of the plurality of
host file inventories includes one or more file identifiers, each
of the file identifiers of a particular host file inventory
representing a different executable file on one of the plurality of
hosts corresponding to the particular host file inventory;
calculating input data by transforming the plurality of host file
inventories into a matrix of keyword vectors in Euclidean space
based on a keyword sequence, wherein each keyword of the keyword
sequence is unique, wherein each one of the plurality of hosts
corresponds to one of the keyword vectors; and providing the input
data to a clustering procedure to group the plurality of hosts into
one or more clusters of hosts, wherein the one or more clusters of
hosts are grouped using a predetermined similarity criteria.
27. (canceled)
28. The apparatus of claim 26, wherein each of the one or more file
identifiers in the plurality of host file inventories includes a
token sequence of one or more tokens.
29. The apparatus of claim 26, wherein the processor is operable to
perform further instructions, comprising: determining the keyword
sequence, wherein each keyword of the keyword sequence is unique
with respect to other keywords in the keyword sequence, wherein a
first one of the keyword vectors includes a plurality of values,
each value indicating whether one of the keywords in the keyword
sequence is included in a first one of the plurality of host file
inventories.
30. (canceled)
Description
TECHNICAL FIELD
[0001] This disclosure relates in general to the field of computer
network administration and support and, more particularly, to
identifying similar software inventories on selected hosts.
BACKGROUND
[0002] The field of computer network administration and support has
become increasingly important and complicated in today's society.
Computer network environments are configured for virtually every
organization and usually have multiple interconnected computers
(e.g., end user computers, laptops, servers, printing devices,
etc.). Typically, each computer has its own set of executable
software, each of which can be represented by an executable
software inventory. For Information Technology (IT) administrators,
congruency among executable software inventories of similar
computers (e.g., desktops and laptops) simplifies maintenance and
control of the network environment. Differences between executable
software inventories, however, can arise in even the most tightly
controlled network environments. In addition, each organization may
develop its own approach to computer network administration and,
consequently, some organizations may have very little congruency
and may experience undesirable diversity of executable software on
their computers. Particularly in very large organizations,
executable software inventories may vary greatly among computers
across departmental groups. Varied executable software inventories
on computers within organizations present numerous difficulties to
IT administrators to maintain, to troubleshoot, to service, and to
provide uninterrupted access for business or other necessary
activities. Innovative tools are needed to assist IT administrators
to successfully support computer network environments with
computers having incongruities between executable software
inventories.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] To provide a more complete understanding of the present
disclosure and features and advantages thereof, reference is made
to the following description, taken in conjunction with the
accompanying figures, wherein like reference numerals represent
like parts, in which:
[0004] FIG. 1 is a pictorial representation of an exemplary network
environment in which various embodiments of a system and method for
clustering host inventories may be implemented in accordance with
the present disclosure;
[0005] FIG. 2 is a simplified block diagram of a computer, which
may be utilized in embodiments in accordance with the present
disclosure;
[0006] FIG. 3 is a simplified flowchart illustrating a series of
example steps associated with the system in accordance with one
embodiment of the present disclosure;
[0007] FIG. 4 illustrates an n.times.m vector matrix format used in
accordance with an embodiment of the present disclosure;
[0008] FIG. 5 is a flowchart illustrating a series of example steps
for generating values for an n.times.m vector matrix as shown in
FIG. 4 in accordance with one embodiment of the present
disclosure;
[0009] FIG. 6 illustrates an example selected group of hosts in a
network environment to which embodiments of the present disclosure
may be applied;
[0010] FIG. 7 illustrates a vector matrix created by application of
the flow of FIG. 5 to the example selected group of hosts of FIG.
6;
[0011] FIG. 8 is an example cluster diagram of the hosts of FIG. 6
that could be created from the system in accordance with
embodiments of the present disclosure;
[0012] FIG. 9 is a simplified flowchart illustrating a series of
example steps associated with the system in accordance with another
embodiment of the present disclosure;
[0013] FIG. 10 illustrates an n.times.n similarity matrix format
used in accordance with one embodiment of the present
disclosure;
[0014] FIG. 11 is a flowchart illustrating a series of example
steps for generating values for an n.times.n similarity matrix as
shown in FIG. 10 in accordance with one embodiment of the present
disclosure; and
[0015] FIG. 12 illustrates a similarity matrix created by
application of the flow of FIG. 11 to the example selected group of
hosts of FIG. 6.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
Overview
[0016] A method in one example implementation includes obtaining a
plurality of host file inventories corresponding respectively to a
plurality of hosts, calculating input data using the plurality of
host file inventories, and providing the input data to a clustering
procedure to group the plurality of hosts into one or more clusters
of hosts. The method further includes each cluster of hosts being
grouped using a predetermined similarity criteria. More specific
embodiments include each of the plurality of host file inventories
including a set of one or more file identifiers with each file
identifier representing a different executable software file on a
corresponding one of the plurality of hosts. In another more
specific embodiment, the method includes each of the one or more
file identifiers including a token sequence of one or more tokens.
In other more specific embodiments, the calculating the input data
includes transforming the plurality of host file inventories into a
similarity matrix. In another more specific embodiment, the
calculating the input data includes transforming the plurality of
host file inventories into a matrix of keyword vectors in Euclidean
space, where each keyword vector corresponds to one of the
plurality of hosts.
Example EMBODIMENTS
[0017] FIG. 1 is a pictorial representation of a computer network
environment 100 in which embodiments of a system for clustering
host inventories may be implemented in accordance with the present
disclosure. Computer network environment 100 illustrates a network
of computers including a plurality of hosts 110a, 110b, and 110c
(referred to collectively herein as hosts 110), which may each
have, respectively, a set of executable files 112a, 112b, and 112c
(referred to collectively herein as sets of executable files 112)
and a host inventory feed 114a, 114b, and 114c (referred to
collectively herein as host inventory feeds 114). Hosts 110 may be
operably connected to a central server 130 through communication
link 120. Central server 130 may include an administrative module
140, a host inventory preparation module 150, and a clustering
module 160. A management console 170 can also be suitably connected
to central server 130 to provide an interface for users such as
Information Technology (IT) administrators, network operators, and
the like.
[0018] In example embodiments, the system for clustering host
inventories may be utilized to provide valuable information to
users (e.g., IT administrators, network operators, etc.)
identifying computers having similar operating systems and
installed executable software files. In one example, when the
system for clustering host inventories is applied to a computer
network environment such as network environment 100 of FIG. 1,
software inventories from hosts 110 may be transformed by host
inventory preparation module 150 into input data for a clustering
algorithm or procedure. Clustering module 160 may apply the
clustering algorithm to the prepared input data to create a
clustering diagram or other information identifying logical
groupings of hosts 110 having similar operating systems and
installed sets of executable files 112. The clustering diagram may
also identify any outlier hosts 110 having significant differences
in operating systems and/or executable files relative to the other
hosts 110 in network environment 100. Thus, the IT administrator or
other user is provided with valuable information that enables the
discovery of trends and exceptions of computers, such as hosts 110,
in the particular network environment. As a result, common policies
may be applied to computers within logical groupings and remedial
action may be taken on any identified outlier computers.
[0019] For purposes of illustrating the techniques of the system
for clustering host inventories, it is important to understand the
activities occurring within a given network. The following
foundational information may be viewed as a basis from which the
present disclosure may be properly explained. Such information is
offered earnestly for purposes of explanation only and,
accordingly, should not be construed in any way to limit the broad
scope of the present disclosure and its potential applications.
[0020] Typical network environments used in organizations and by
individuals often include a plurality of computers such as end user
desktops, laptops, servers, network appliances, and the like, and
each may have an installed set of executable software. In large
organizations, network environments may include hundreds or
thousands of computers, which may span different buildings, cities,
and/or geographical areas around the world. IT administrators may
be tasked with the extraordinary responsibility of maintaining
these computers in a way that minimizes or eliminates disruption to
business activities.
[0021] One difficulty IT administrators face includes maintaining
multiple computers in a chaotic or heterogeneous network
environment. In such an environment, congruency between executable
software of the computers may be minimal. For example, executable
files may be stored in different memory locations on different
computers, different versions of executable files may be installed
in different computers, executable files may be stored on some
computers but not on others, and the like. Such networks may
require additional time and resources to be adequately supported as
IT administrators may need to individualize policies, maintenance,
upgrades, repairs, and/or any other type of support to suit
particular computers having nonstandard executable software and/or
operating systems.
[0022] Homogenous network environments, in which executable
software of computers are congruent or at least similar, may also
benefit from a system and method for clustering host inventories.
In homogeneous environments or substantially homogeneous
environments, particular computers may occasionally deviate from
standard computers within the network environment. For example,
malicious software may break through the various firewalls and
other network barriers creating one or more deviant computers. In
addition, end users of computers may install various executable
software files from transportable disks or download such software
creating deviant computers. In accordance with the present
disclosure, a system for clustering host inventories could readily
identify any outliers having nonstandard and possibly malicious
executable software.
[0023] A system and method for clustering host inventories, as
outlined in FIG. 1, could greatly enhance abilities of IT
administrators or other users managing computer networks to
effectively support both heterogeneous and homogeneous network
environments. The system, which may be implemented in a computer
such as server 130, enables identification of logical groupings of
computers with similar executable file inventories and
identification of outliers (e.g., computers with drastically
different executable file inventories). In accordance with one
example implementation, host file inventories of executable files
from hosts 110 are provided for evaluation. The host file
inventories are transformed into input data for a clustering
algorithm. Once the input data is prepared, the clustering
algorithm is applied and one or more diagrams or charts may be
created to show logical clusters or groupings of hosts 110 having
the same or similar software inventories. In addition, the diagrams
or charts may also show any of the hosts 110 that drastically
deviate from other hosts 110. Thus, the system provides network or
IT administrators with valuable information that may be used to
more effectively manage hosts 110 within network environment
100.
[0024] Note that in this Specification, references to various
features (e.g., elements, structures, modules, components, steps,
etc.) included in "one embodiment", "example embodiment", "an
embodiment", "another embodiment", "some embodiments", "various
embodiments", "one example", "other embodiments", and the like are
intended to mean that any such features may be included in one or
more embodiments of the present disclosure, but may or may not
necessarily be included in the same embodiments.
[0025] Turning to the infrastructure of FIG. 1, the example network
environment 100 may be configured as one or more networks and may
be configured in any form including, but not limited to, local area
networks (LANs), wide area networks (WANs) such as the Internet, or
any combination thereof. In some embodiments, communication link
120 may represent any electronic link supporting a LAN environment
such as, for example, cable, Ethernet, wireless (e.g., WiFi), ATM,
fiber optics, etc. or any combination thereof. In other
embodiments, communication link 120 may represent a remote
connection to central server 130 through any appropriate medium
(e.g., digital subscriber lines (DSL), telephone lines, T1 lines,
T3 lines, wireless, satellite, fiber optics, cable, Ethernet, etc.
or any combination thereof) and/or through any additional networks
such as a wide area network (e.g., the Internet). In addition,
gateways, routers and the like may be used to facilitate electronic
communication between hosts 110 and central server 130.
[0026] In an example embodiment, hosts 110 may represent end user
computers that could be operated by end users. The end user
computers may include desktops, laptops, and mobile or handheld
computers (e.g., personal digital assistants (PDAs) or mobile
phones). Hosts 110 can also represent other computers (e.g.,
servers, appliances, etc.) having executable software, which could
be similarly evaluated and clustered by the system, using
executable file inventories derived from sets of executable files
112 on such hosts 110. It should be noted that the network
configurations and interconnections shown and described herein are
for illustrative purposes only. FIG. 1 is intended as an example
and should not be construed to imply architectural limitations in
the present disclosure.
[0027] Sets of executable files 112 on hosts 110 can include all
executable files on respective hosts 110. In this Specification,
references to "executable file", "program file", "executable
software file", and "executable software" are meant to encompass
any software file comprising instructions that can be understood
and processed by a computer such as executable files, library
modules, object files, other executable modules, script files,
interpreter files, and the like. In one embodiment, the system
could be configured to allow the IT administrator to select a
particular type of executable file to be clustered. For example, an
IT Administrator may choose only dynamic-link library (DLL) modules
for clustering. Thus, sets of executable files 112 would include
only DLL modules on the respective hosts 110. In addition, the IT
administrator may also be permitted to select particular hosts to
which clustering is applied. For example, all end user computers in
a network or within a particular part of the network may be
selected. In another example, all servers within a network or
within a particular part of the network may be selected.
[0028] Central server 130 of network environment 100 represents an
exemplary server or other computer linked to hosts 110, which may
provide services to hosts 110. The system of clustering host
inventories may be implemented in central server 130 using various
embodiments of host inventory preparation module 150 and clustering
module 160. For example, keyword techniques may be used with vector
based clustering in one example embodiment. In this example
embodiment, host inventory preparation module 150 creates an
(n.times.m) vector matrix where columns of the matrix may
correspond to a determined number (i.e., "m") of unique keywords,
each of which is associated with one or more executable files in a
selected number (i.e., "n") of hosts. The rows of the vector matrix
may correspond to the n selected hosts. Clustering module 160 can
then apply a clustering algorithm to the vector matrix to create
logical groupings of the n selected hosts. In another example
embodiment, compression techniques may be used with similarity
based clustering. In this example embodiment, host inventory
preparation module 150 may create an (n.times.n) similarity matrix
using compression techniques for a selected number (i.e., "n") of
hosts. Clustering module 160 may then apply a clustering algorithm
to the similarity matrix to create logical groupings of the n
selected hosts. In one embodiment, selected hosts may include all
of the hosts 110 in a particular network environment such as
network environment 100. In other embodiments, selected hosts may
include particular hosts selected by a user or predefined by
policy, with such hosts existing in one or more network
environments.
[0029] Management console 170 linked to central server 130 may
provide viewable cluster data for the IT administrators or other
authorized users. Administrative module 140 may also be
incorporated to allow IT administrators or other authorized users
to add the logical groupings from a cluster analysis to an
enterprise management system and to apply common policies to
selected groupings. In addition, deviant or exceptional groupings
or outliers can trigger various remedial actions (e.g., emails,
vulnerability scans, etc.). In addition, management console 170 may
also provide a user interface for the IT Administrator to select
particular hosts and/or particular types of executable files to be
included in the clustering procedure, in addition to other user
provided configuration data for the system. One exemplary
enterprise management system that could be used includes
McAfee.RTM. electronic Policy Orchestrator (ePO) software
manufactured by McAfee, Inc. of Santa Clara, Calif.
[0030] Turning to FIG. 2, FIG. 2 is a simplified block diagram of a
general or special purpose computer 200, such as hosts 110, central
server 130, or other computing devices connected to network
environment 100. Computer 200 may include various components such
as a processor 220, a main memory 230, a secondary storage 240, a
network interface 250, a user interface 260, and a removable memory
interface 270. A bus 210, such as a system bus, may provide
electronic communication between processor 210 and the other
components, memory, and interfaces of computer 200.
[0031] Processor 220, which may also be referred to as a central
processing unit (CPU), can include any general or special-purpose
processor capable of executing machine readable instructions and
performing operations on data as instructed by the machine readable
instructions. Main memory 230 may be directly accessible to
processor 220 for accessing machine instructions and can be in the
form of random access memory (RAM) or any type of dynamic storage
(e.g., dynamic random access memory (DRAM)). Secondary storage 240
can be any non-volatile memory such as a hard disk, which is
capable of storing electronic data including executable software
files. Externally stored electronic data may be provided to
computer 200 through removable memory interface 270. Removable
memory interface 270 may provide connection to any type of external
memory such as compact discs (CDs), digital video discs (DVDs),
flash drives, external hard drives, or any other external
media.
[0032] Network interface 250 can be any network interface
controller (NIC) that provides a suitable network connection
between computer 200 and any networks to which computer 200
connects for sending and receiving electronic data. For example,
network interface 250 could be an Ethernet adapter, a token ring
adapter, or a wireless adapter. A user interface 260 may be
provided to allow a user to interact with the computer 200 via any
suitable means, including a graphical user interface display. In
addition, any appropriate input mechanism may also be included such
as a keyboard, mouse, voice recognition, touch pad, input screen,
etc.
[0033] Not shown in FIG. 2 is additional hardware that may be
suitably coupled to processor 220 and bus 210 in the form of memory
management units (MMU), additional symmetric multiprocessing (SMP)
elements, read only memory (ROM), peripheral component interconnect
(PCI) bus and corresponding bridges, small computer system
interface (SCSI)/integrated drive electronics (IDE) elements, etc.
Any suitable operating systems will also be configured in computer
200 to appropriately manage the operation of hardware components
therein. These elements, shown and/or described with reference to
computer 200, are intended for illustrative purposes and are not
meant to imply architectural limitations of computers such as hosts
110 and central server 130, utilized in accordance with the present
disclosure. As used herein in this Specification, the term
`computer` is meant to encompass any personal computers, network
appliances, routers, switches, gateways, processors, servers, load
balancers, firewalls, or any other suitable device, component,
element, or object operable to affect or process electronic
information in a network environment.
[0034] Turning to FIG. 3, an example system flow 300 of a
keyword-based embodiment of the system and method for clustering
host inventories is illustrated. Flow may begin at step 310 where
host file inventories (I.sub.1 through I.sub.n, with n=number of
selected hosts) are generated for each of the selected hosts 110.
Each host file inventory can include a set of file identifiers,
with each file identifier representing a different executable file
in the set of executable files 112a, 112b, or 112c of the
corresponding selected host 110a, 110b, or 110c. Each file
identifier may include a sequence of one or more tokens associated
with the executable file represented by the file identifier. In one
embodiment, the user may be provided with the option of choosing
the number of tokens and the configuration of each token. For
example, a simple file identifier could include a single token
having a checksum configuration. A checksum can be a mathematical
value or hash sum (e.g., a fixed string of numerical digits)
derived by applying an algorithm to an executable file. If the
algorithm is applied to another executable file that is identical
to the first executable file, then the checksums should match.
However, if the other executable file is different in any way
(e.g., different type of software program, same software program
but different version, same software program that has been altered
in some way, etc.), then the checksums are very unlikely to match.
Thus, the same executable file stored in different hosts or stored
in different locations on disk of the same host should have
identical checksums.
[0035] In other examples, more complex file identifiers could be
selected to provide a higher level of distinctiveness of an
executable file. In one such example, a file identifier could
include a sequence of first and second tokens having a checksum
configuration and file path configuration, respectively, where the
file path indicates where the executable file is stored on disk in
the particular host in which it is installed. Thus, if identical
executable files X and Y are installed on host 110a and host 110b,
respectively, but are stored in different locations of memory, then
the first token of the file identifier generated for executable
file X on host 110a could be the same as the first token of the
file identifier generated for executable file Y on host 110b.
However, the second token of the file identifier generated for
executable file X on host 110a could be different than the second
token of the file identifier generated for executable file Y on
host 110b.
[0036] Numerous other file identifiers may be configured by using
any number of tokens and configuring the tokens to include any
combination of available program file attributes, checksums, and/or
file paths. Program file attributes may include, for example,
creation date, modification date, security settings, vendor name,
and the like. Although file identifiers may be configured with any
number of such tokens, an executable file without a particular
program file attribute, which is selected as one of the tokens, may
have a file identifier with only the tokens available to that
executable file. For example, if the file identifier is configured
to include a first token (e.g., a checksum) and a second token
(e.g., a vendor name), then an executable file without an embedded
vendor name would have a file identifier with only a first token
corresponding to the file checksum. In contrast, an executable file
having an embedded vendor name would have a file identifier with
both first and second tokens corresponding to the file checksum and
vendor name, respectively.
[0037] The file identifiers and resulting host file inventories
I.sub.1 through I.sub.n may be provided by various implementations.
In one embodiment, the file identifiers and resulting host file
inventories may be generated by host inventory feeds 114 for each
host 110 and pushed to central server 130. For embodiments in which
a user configures the file identifier by selecting a number of
tokens for the token sequence and by selecting individual token
configurations, central server 130 may provide the user selected
configuration criteria to each host 110. Host inventory feeds 114
may then generate file identifiers with token sequences having the
particular user-selected configuration. In another embodiment,
checksums for each executable file may be generated on hosts 110 by
host inventory feeds 114 and then pushed to central server 130
along with other file attributes and file paths such that host
inventory preparation module 150 of central server 130 can generate
the file identifiers and resulting host file inventories for each
of the selected hosts 110. In one embodiment, enumeration of
executable files from the sets of executable files 112 of selected
hosts 110 can be achieved by existing security technology such as,
for example, Policy Auditor software or Application Control
software, both manufactured by McAfee, Inc. of Santa Clara,
Calif.
[0038] Referring again to FIG. 3, after file identifiers and host
file inventories have been determined for all of the selected hosts
110 in step 310, flow then moves to step 320 where a keyword method
is used to transform host file inventories I.sub.1 through I.sub.n
into a vector matrix, which will be further described herein with
reference to FIGS. 4 and 5. Once a vector matrix is created, flow
moves to step 330 where a vector-based clustering analysis is
performed on the vector matrix. Exemplary types of clustering
analysis that may be performed on the vector matrix include
agglomerative hierarchical clustering and partitional clustering.
The results of such clustering techniques may be stored in a memory
element of central server (e.g., secondary storage 240 of computer
200), or may be stored in a database or other memory element
external to central server 130.
[0039] After vector-based clustering has been performed on the
vector matrix in step 330, flow moves to step 340 where one or more
reports can be generated indicating the clustered groupings
determined during the clustering analysis and can be provided to
authorized users by various methods (e.g., screen displays,
electronic files, hard copies, emails, etc.). Exemplary reports may
include a textual report and/or a visual representation (e.g., a
proximity plot, a dendrogram, heat maps of a permuted keyword
matrix, heat maps of a reduced keyword matrix where rows and
columns have been merged to illustrate clusters, other cluster
plots, etc.) enabling the user to view logical groupings of the
selected hosts. For example, after the clustering analysis has been
performed, a graphical user interface of management console 170 may
display a proximity plot having physical representations of each
host, with identifiable logical groupings (e.g., uniquely colored
groupings, circled or otherwise enclosed groupings, representations
of groupings with connected lines, etc.). Once the similar
groupings and outlier hosts have been identified, an IT
Administrator or other authorized user can apply common policies to
hosts within the logical groupings and remedial action may be taken
on any identified outlier hosts. For example, outlier hosts may be
remediated to a standard software configuration as defined by the
IT Administrators.
[0040] Turning to FIG. 4, FIG. 4 illustrates a matrix format 400
used when generating a vector matrix in one embodiment of the
system and method of clustering host inventories. In this
embodiment, host file inventories are transformed into a vector
matrix in Euclidean space using a keyword method. The following
variables may be identified when generating a vector matrix: [0041]
n=number of selected hosts [0042] m=number of keywords [0043]
H.sub.i=host in network, with i=1 to n (e.g., H.sub.1=host 110a,
H.sub.2=host 110b, H.sub.3=host 110c) [0044] K.sub.j=unique
keyword, with j=1 to m [0045] I.sub.i=host file inventory on
H.sub.i, with i=1 to n
[0046] The number of keywords associated with an executable file
equals the number of tokens in the token sequence of the file
identifier representing the executable file. Therefore, one or more
keywords can be associated with each executable file in sets of
executable files 112 of selected hosts 110. In addition, each
keyword could be associated with multiple executable files in the
same or different hosts. Thus, a keyword sequence km may be defined
as a sequence of unique keywords K.sub.1 through K.sub.m, where
each keyword is associated with one or more executable files in
sets of executable files 112 of all selected hosts 110.
[0047] Vector matrix format 400 includes n rows 460 and m columns
470, with n and m defining the dimensions of the resulting n-by-m
(i.e., n.times.m) vector matrix. Each row of vector matrix format
400 is denoted by a unique host H.sub.i (i=1 to n), and each column
is denoted by a unique keyword K.sub.j (j=1 to m) of keyword
sequence km. Each entry 480 is denoted by a variable with
subscripts i and j (i.e., a.sub.i,j) where i and j correspond to
the respective row and column where the entry is located. For
example, entry a.sub.2,1 is found in row 2, column 1 of vector
matrix format 400. Each row of entries represents a row vector 410,
420, and 430 for its corresponding host H.sub.1, H.sub.2 and
H.sub.n. For example, a.sub.1,1, a.sub.1,2, through a.sub.1,m
define row vector 410 for host H.sub.1. Once each of the entries
480 has been filled with a determined value, row vectors 410, 420,
through 430 can be provided as input data to a vector-based
clustering algorithm to create a cluster graph or plot showing
logical groupings of hosts H.sub.1 through H.sub.n, having similar
inventories of executable files and any host outliers having
dissimilar host inventories.
[0048] Turning to FIG. 5, FIG. 5 illustrates a flow 500 using a
keyword method to transform host file inventories into a list of
vectors in Euclidean space, represented by vector matrix format
400. Flow 500 corresponds to step 320 in flow 300 of FIG. 3 and may
be implemented, at least in part, by host inventory preparation
module 150 of central server 130 shown in FIG. 1. Flow may begin at
step 510 to determine keyword sequence km, which is a sequence of m
unique keywords (km=K.sub.1, K.sub.2, . . . K.sub.m) and is a basis
for m-dimensional keyword space. In one embodiment, to determine km
the file identifiers of all host file inventories of selected hosts
110 can be evaluated to find each unique keyword. In one example,
each of the file identifiers of the host file inventories (I.sub.1
through I.sub.n) includes a sequence of first and second tokens
with the first token having a file checksum configuration and the
second token having a file path configuration. If the same version
of Microsoft.RTM. Word software is installed on each of the
selected hosts 110a, 110b, and 110c in the same location on disk,
then each file identifier representing the Microsoft.RTM. Word
software in each of the host file inventories includes a first
token containing a checksum of the Microsoft.RTM. Word software and
a second token containing a file path of the software. In this
example, keyword sequence km could include a first keyword
(K.sub.1) containing the checksum, which is the same on each of the
selected hosts 110a, 110b, and 110c, and a second keyword (K.sub.2)
containing the file path, which is the same on each of the selected
hosts 110a, 110b, and 110c.
[0049] Once keyword sequence km has been determined, the algorithm
of flow 500 computes a list of position vectors for an n.times.m
vector matrix. Variables `i` and `j` are used to construct the
vector matrix having n.times.m vector matrix format 400, in
m-dimensional keyword space, for each host H.sub.i by iterating
over j through km and producing appropriate values for the position
vectors indicating whether each host file inventory I.sub.i
contains each keyword K.sub.j.
[0050] The iterative flow to find keywords of keyword sequence km
in file identifiers of host inventories is illustrated in steps 520
through 575 of FIG. 5. In step 520, variable i is set to 1 and
steps 530, 570, and 575 form an outer loop iterating through the
hosts. Variable j is set to 1 in step 530 and steps 540 through 565
form an inner loop iterating over j through km. After variables i
and j are set to 1, flow moves to step 540 where keyword K.sub.j is
retrieved. Flow moves to decision box 545 where a query is made as
to whether keyword K.sub.j is found in host file inventory I.sub.i
of host H.sub.i. Thus, host file inventory I.sub.i of host H.sub.i
is searched for a file identifier containing keyword K.sub.j. If
keyword K.sub.j is not found in host file inventory I.sub.i then
flow moves to step 550 where row i, column j (i.e., a.sub.i,j) in
the vector matrix may be updated with an appropriate value
indicating keyword K.sub.j was not found in host file inventory
I.sub.i. However, if in step 545, keyword K.sub.j is found in host
file inventory I.sub.i, then flow moves to step 555 where row i,
column j (i.e., a.sub.i,j) in the vector matrix may be updated with
an appropriate value indicating keyword K.sub.j was found in host
file inventory I.sub.i.
[0051] The values of entries a.sub.i,j in the vector matrix, which
indicate whether keyword K.sub.j is found in a host file inventory
I.sub.i, may vary depending upon the particular implementation of
the system. In one embodiment, an entry a.sub.i,j is assigned a `1`
value in step 555, indicating keyword K.sub.j was found in host
file inventory I.sub.i, or a `0` value in step 560, indicating
keyword K.sub.j was not found in host file inventory I.sub.i. Thus,
in this embodiment, vector matrix contains only `0` and/or `1`
values. In another embodiment, entry a.sub.i,j is assigned a value
in step 555 or 550 corresponding to a frequency of occurrence of
keyword K.sub.j in host file inventory I.sub.i. For example, assume
file identifiers in a host file inventory I.sub.1 include a first
token configured as a checksum and a second token configured as a
vendor name, with three executable files on host H.sub.1 having the
same embedded vendor name, XYZ, resulting in keyword K.sub.2 of
keyword sequence km being assigned the embedded vendor name XYZ. In
this embodiment, when host file inventory I.sub.1 of host H.sub.1
is searched for keyword K.sub.2, entry a.sub.1,2 could be updated
with a value of 3 because of the three occurrences of vendor name
XYZ in file identifiers of host file inventory I.sub.1. Thus, in
this embodiment, vector matrix may contain `0` values and/or
positive integer values.
[0052] After row i, column j is filled with an appropriate value in
step 555 or 550, flow moves to decision box 560 where a query is
made as to whether j<m. If j<m, then host file inventory
I.sub.i of host H.sub.i has not been checked for all of the
keywords in km. Therefore, flow moves to step 565 where j is set to
j+1, and flow loops back to step 540 to get the next keyword
K.sub.j (with j=j+1) in km and search for K.sub.j in host file
inventory I.sub.i. If, however, in decision box 560 it is
determined that j is not less than m (i.e., j.gtoreq.m), then host
file inventory I.sub.i has been searched for all of the keywords
K.sub.1 through K.sub.m in keyword sequence km, so flow moves to
decision box 570, which is part of the outer loop of flow 500. A
query is made in decision box 570 to determine whether i<n, and
if i<n, then not all of the hosts have been evaluated to
generate corresponding keyword vectors. Therefore, flow moves to
step 575 where i is set to i+1, and flow loops back to step 530. In
step 530, j is set to 1 again, so that a vector for the next host
H.sub.i (with i=i+1) can be generated by inner loop steps 540
through 565. With reference again to decision box 570, if i is not
less than n in), then all of the hosts H.sub.1 through H.sub.n have
been evaluated such that all of the vectors have been created in
n.times.m vector matrix and, therefore, the flow ends.
[0053] The embodiment of the flow 500 shown in FIG. 5 creates a
vector in keyword space successively for each host H.sub.1 through
H.sub.n. Other embodiments, however, could be configured to switch
the inner and outer loops in flow 500 such that for each keyword
K.sub.j, a column of vector matrix entries is produced by iterating
over i through hosts H.sub.i and producing an appropriate value
when keyword K.sub.j is found in host file inventory I.sub.i of
host H.sub.i and an appropriate value when K.sub.j is not found in
host file inventory I.sub.i of host H.sub.i. This processing could
be repeated until all columns are filled, thereby generating the
list of vectors in rows 1 through n.
[0054] The clustering analysis performed on the resulting vector
matrix may include commonly available clustering techniques such as
agglomerative hierarchical clustering or partitional clustering. In
agglomerative hierarchical clustering, each element begins as a
separate cluster and elements are merged into successively larger
clusters, which may be represented in a tree structure called a
dendrogram. A root of the tree represents a single cluster of all
of the elements and the leaves of the tree represent separated
clusters of the elements. Generally, merging schemes in
agglomerative hierarchical clustering used to achieve logical
groupings may include schemes well-known in the art such as
single-link (i.e., the distance between clusters is equal to the
shortest distance from any member of one cluster to any member of
another cluster), complete-link (i.e., the distance between
clusters is equal to the greatest distance from any member of one
cluster to any member of another cluster), group-average (i.e., the
distance between clusters is equal to the average distance from any
member of one cluster to any member of another cluster), and
centroid (i.e., the distance between clusters is equal to the
distance from the center of any one cluster to the center of
another cluster).
[0055] Known techniques may be implemented in which predetermined
similarity criteria sets the point at which clustering is halted
(e.g., cut point determination). Cut point determination may be
made, for example, at a specified level of similarity or when
consecutive similarities are the greatest, which is known in the
art. As an example, a tree structure representing clusters could be
cut at a predetermined height resulting in more or less clusters
depending on the selected height at which the cut is made. Cut
point determinations may be determined based on a particular
network environment or particular hosts being clustered. In one
example embodiment, an IT administrator or other authorized user
could define the cut point determination used by the clustering
procedure by determining a desired threshold for similarity based
on the particular network environment.
[0056] In other embodiments, partitional clustering may be used.
Partitional clustering typically involves an algorithm that
determines all clusters at one time. In partitional clustering,
predetermined similarity criteria may provide, for example, a
selected number of clusters to be generated or a maximum diameter
for the clusters. One exemplary software package that implements
these various clustering techniques is CLUTO Software for
Clustering High-Dimensional Datasets developed by George Karypis,
Professor at the Department of Computer Science & Engineering,
University of Minnesota, Minneapolis and Saint Paul, Minn., which
may be found on the World Wide Web at
http://glaros.dtc.umn.edu/gkhome/view/cluto.
[0057] Turning to FIGS. 6, 7, and 8, an example selected plurality
of hosts 600 with executable files, a vector matrix 700 generated
using the executable files of selected hosts 600, and an example
resulting cluster plot 800 are illustrated, respectively. In FIG.
6, host 1 (H.sub.1) is shown with a set of executable files 601
including executable files 610, 620, 630, and 640. Host 2 (H.sub.2)
is shown with a set of executable files 602 including executable
files 610, 650, 660, and 670. Host 3 (H.sub.3) is shown with a set
of executable files 603 including executable files 610, 620, 630,
and 680. Host 4 (H.sub.4) is shown with a set of executable files
604 including executable files 610, 650, and 660. Host 5 (H.sub.5)
is shown with a set of executable files 605 including executable
files 640, 670, and 680.
[0058] FIG. 7 illustrates the resulting vector matrix 700 after the
keyword method of flow 500 has been applied to the sets of
executable files 601 through 605 of the selected plurality of hosts
600 of FIG. 6. Vector matrix 700 shows hosts 1 through 5 (H.sub.1
through H.sub.5) corresponding to rows 760 containing keyword
vectors 710, 720, 730, 740, and 750, respectively. Columns 770 of
vector matrix 700 are designated by keywords K.sub.1 through
K.sub.8. In this example vector matrix 700, entries 780 include a
`1` indicating that keyword K.sub.j is contained in a file
identifier of host file inventory I.sub.i, or a `0` indicating that
keyword K.sub.j is not contained in a file identifier of host file
inventory I.sub.i.
[0059] In the example scenario of applying the keyword method of
flow 500 to the sets of executable files 601 through 605 of
selected plurality of hosts 600 in order to create vector matrix
700, the following variables may be identified: [0060] n=5 (host
computers) [0061] H.sub.1 through H.sub.5=hosts in network (e.g.,
H.sub.1=host 1, H.sub.2=host 2, etc.) [0062] I.sub.1 through
I.sub.5=host file inventories representing sets of executable files
601 through 605, respectively [0063] m=8 (keywords) [0064] K.sub.1
through K.sub.8=unique keywords Each of the host file inventories
I.sub.1 through I.sub.5 includes a set of file identifiers
representing one of the sets of executable files 601, 602, 603,
604, and 605, respectively. Each executable file in a set of
executable files is represented by a separate file identifier in
the particular host file inventory. In this exemplary scenario,
file identifiers each include a first token having a checksum
configuration. Unique keywords are determined among all sets of
executable files of selected hosts H.sub.1 through H.sub.5. Thus, 8
unique keywords may be determined for selected hosts 600: [0065]
K.sub.1=checksum for executable file 610 [0066] K.sub.2=checksum
for executable file 620 [0067] K.sub.3=checksum for executable file
630 [0068] K.sub.4=checksum for executable file 640 [0069]
K.sub.5=checksum for executable file 650 [0070] K.sub.6=checksum
for executable file 660 [0071] K.sub.7=checksum for executable file
670 [0072] K.sub.8=checksum for executable file 680 A keyword
sequence km can then be created in step 520 with the 8 unique
keywords: [0073]
km=K.sub.1K.sub.2K.sub.3K.sub.4K.sub.5K.sub.6K.sub.7K.sub.8 Thus,
in this example scenario, the following host file inventories
I.sub.1 through I.sub.5 could include file identifiers having first
tokens equivalent to the following keywords: [0074]
I.sub.1.fwdarw.K.sub.1, K.sub.2, K.sub.3, K.sub.4 [0075]
I.sub.2.fwdarw.K.sub.1, K.sub.5, K.sub.6, K.sub.7 [0076]
I.sub.3.fwdarw.K.sub.1, K.sub.2, K.sub.3, K.sub.8 [0077]
I.sub.4.fwdarw.K.sub.1, K.sub.5, K.sub.6 [0078]
I.sub.5.fwdarw.K.sub.4, K.sub.7, K.sub.8
[0079] Once keyword sequence km is determined, flow moves to step
520 where variable i is set to 1 and then the iterative flow begins
to create n.times.m (5.times.8) vector matrix 700 shown in FIG. 7.
In step 530, variable j is set to 1 and keyword K.sub.j (K.sub.1)
is retrieved from km in step 540. Flow moves to decision box 545
where host file inventory I.sub.i (I.sub.1) of host H.sub.i
(H.sub.1) is searched for keyword K.sub.j (K.sub.1). In this
example, keyword K.sub.1 is found in host file inventory I.sub.1 of
host H.sub.1, so flow moves to step 555 where a `1` entry is added
to row i, column j (row 1, column 1) of vector matrix 700. After
vector matrix 700 has been updated flow moves to decision box 560
where a query is made as to whether j<m. Since 1 is less than 8,
the flow moves to step 565 where j is set to 2 (i.e., j=j+1). Flow
then loops back to step 540 to search for the next keyword K.sub.j
(K.sub.2) in host file inventory I.sub.i (I.sub.1) of host H.sub.i
(H.sub.1). In this case, keyword K.sub.2 is found in host file
inventory I.sub.1 so a `1` entry is added to row i, column j (row
1, column 2) of vector matrix 700. The variable j is still less
than 8, (i.e., 2<8) as determined in decision box 560, so flow
moves to step 565 and j is set to 3 (i.e., j=j+1). This iterative
processing continues for each value of j until j=8, thereby filling
in each entry 780 of keyword vector 710 for host H.sub.i
(H.sub.1).
[0080] After the last entry 780 of keyword vector 710 has been
added to vector matrix 700, flow moves to decision box 560 where a
query is made as to whether j<m (i.e., Is 8<8?). Because j is
not less than 8, flow moves to decision box 570 where a query is
made as to whether i<n (i.e., Is 1<5?). Because 1 is less
than 5, flow moves to step 575 where i is set to 2 (i.e., i=i+1)
and flow loops back to step 530 where j is set to 1. The inner
iterative loop then begins in step 540 to search for all keywords
in host file inventory I.sub.i (I.sub.2) of host H.sub.i (H.sub.2)
beginning with keyword K.sub.j (K.sub.1). Thus, in the embodiment
used in this example scenario, rows 760 are successively filled
with a `1` or a `0` value for each entry a.sub.i,j until each
vector row 710 through 750 has been completed. As previously
discussed herein, however, another embodiment provides that each
entry a.sub.i,j in rows 760 could be filled with a value
corresponding to the frequency of occurrence of keyword K.sub.j
found in host file inventory I.sub.i.
[0081] Vector matrix 700 can be provided as input data to a vector
based clustering procedure, as previously described herein.
Information generated from the clustering procedure could be
provided in numerous ways such as, for example, reports, screen
displays, files, emails, etc. In one example, the information could
be provided in a proximity plot such as example proximity plot 800
illustrated in FIG. 8. Proximity plot 800 is an example graph that
could be created by a vector-based clustering procedure applied to
vector matrix 700 of FIG. 7. If agglomerative hierarchical
clustering is used, clusters 810, 820, and 830 may be determined
based on a cut point determination. If partitional clustering is
used, clusters 810, 820, and 830 may be generated based on a
predetermined number of clusters. Proximity plot 800 shows two
clusters and one outlier. Cluster 810 represents hosts H.sub.1 and
H.sub.3 and cluster 820 represents hosts H.sub.2 and H.sub.4. Hosts
H.sub.1 and H.sub.3 may be clustered together because they have
three common executable files 610, 620, and 630. Hosts H.sub.2 and
H.sub.4 may be clustered together because they also have three
common executable files 610, 650, and 660. Outlier 830 represents
host H.sub.5, which may be indicated as an outlier, because it has,
in this example, none or only one common executable file with each
of the other hosts. Although the clustering information is
displayed on proximity plot 800 shown in FIG. 8, other textual
reports and/or visual representations, as previously described
herein with reference to FIG. 3, may be used to show clusters 810
and 820 and outlier 830.
[0082] Turning to FIG. 9, FIG. 9 illustrates an example system flow
900 of a compression-based embodiment of a system and method for
clustering host inventories. Flow may begin at step 910 where file
identifiers and host file inventories (I.sub.1 through I.sub.n,
with n=number of selected hosts) may be generated for each of the
selected hosts 110, as previously described herein with reference
to FIG. 3.
[0083] After file identifiers and host file inventories have been
determined for each of the selected hosts 110, flow then moves to
step 920 where a compression technique may be used to transform
host file inventories into a similarity matrix, which will be
further described herein with reference to FIGS. 10 and 11. Once a
similarity matrix is created, flow moves to step 930 where a
similarity-based clustering analysis can be performed on the
similarity matrix. The similarity-based clustering analysis
performed on the similarity matrix may include, for example,
agglomerative hierarchical clustering or partitional clustering.
The results of such clustering techniques may be stored in a memory
element of central server (e.g., secondary storage 240 of computer
200), or may be stored in a database or other memory element
external to central server 130.
[0084] After similarity-based clustering has been performed on the
similarity matrix in step 930, flow moves to step 940 where one or
more reports can be generated indicating the clustered groupings
determined during the clustering analysis, as previously described
herein with reference to FIG. 3. Such reports for similarity-based
clustering may include a textual report and/or a visual
representation (e.g., a proximity plot, a dendrogram, heat maps of
a similarity matrix where rows and columns have been merged to
illustrate clusters, other cluster plots, etc.) enabling the user
to view logical groupings of the selected hosts. Once the similar
groupings and outlier hosts have been identified, an IT
Administrator or other authorized user can apply common policies to
computers within the logical groupings and remedial action may be
taken on any identified outlier computers. For example, outlier
computers may be remediated to a standard software configuration as
defined by the IT Administrators.
[0085] Turning to FIG. 10, FIG. 10 illustrates a matrix format 1000
used when generating a similarity matrix in one embodiment of the
system and method of clustering host inventories. The similarity
matrix is generated by applying a compression method to a plurality
of host file inventories, each of which includes a set of file
identifiers. As an example, each of the sets of file identifiers
may represent one of the sets of executable files 112a, 112b, or
112c on the corresponding selected host 110a, 110b, or 110c. In
addition, the following variables may be identified when generating
a similarity matrix: [0086] n=number of selected hosts [0087]
H.sub.i=host in network, with i=1 to n (e.g., H.sub.1=host 110a,
H.sub.2=host 110b, H.sub.3=host 110c) [0088] H.sub.j=host in
network, with j=1 to n (e.g., H.sub.1=host 110a, H.sub.2=host 110b,
H.sub.3=host 110c) [0089] I.sub.i=host file inventory of H.sub.i
[0090] I.sub.j=host file inventory of H.sub.j
[0091] Similarity matrix format 1000 includes n rows 1060 and n
columns 1070, with `n` defining the number of dimensions of the
resulting n-by-n (i.e., n.times.n) similarity matrix. Each row of
similarity matrix format 1000 is denoted by host H.sub.i (i=1 to
n), and each column is denoted by host H.sub.j (j=1 to n). Each
entry 1080 is denoted by a variable with subscripts i and j (i.e.,
a.sub.i,j) where i and j correspond to the respective row and
column where the entry is located. For example, entry a.sub.2,1 is
found in row 2, column 1 of similarity matrix format 1000.
[0092] When a similarity matrix is created in accordance with one
embodiment of this disclosure, each entry a.sub.i,j has a numerical
value representing the similarity distance between host H.sub.i and
host H.sub.j with 1 representing the highest degree of similarity.
In one embodiment, the similarity distances represented by entries
a.sub.1,1 through a.sub.n,n can include any numerical value from 0
to 1, inclusively (i.e., 0.ltoreq.a.sub.i,j.ltoreq.1). In this
embodiment, the closer a.sub.i,j is to 1, the greater the
similarity is between host file inventories I.sub.i and I.sub.j of
hosts H.sub.i and H.sub.j, and the closer a.sub.i,j is to zero, the
greater the difference is between host file inventories I.sub.i and
I.sub.j of hosts H.sub.i and H.sub.j. Thus, a value of 1 in
a.sub.i,j may indicate hosts H.sub.i and H.sub.j have identical
host file inventories and therefore, identical sets of executable
files, whereas a value of zero in a.sub.i,j may indicate hosts
H.sub.i and H.sub.j have no common file identifiers in their
respective host file inventories and therefore, no common
executable files in their respective sets of executable files. Once
each of the entries 1080 has been filled with a calculated value,
the resulting similarity matrix can be provided as input data into
a similarity-based clustering algorithm to create a cluster graph
or plot showing logical groupings of hosts H.sub.1 through H.sub.n
having similar sets of executable files and outlier hosts having
dissimilar sets of executable files. The clustering analysis
performed on the resulting similarity matrix may include commonly
available clustering techniques such as agglomerative hierarchical
clustering or partitional clustering, as previously described
herein with reference to clustering analysis of a vector
matrix.
[0093] Turning to FIG. 11, FIG. 11 illustrates a flow 1100 using a
compression method to transform host file inventories I.sub.1
through I.sub.n of hosts H.sub.1 through H.sub.n, respectively,
into a similarity matrix. Flow 1100 corresponds to step 920 of FIG.
9 and may be implemented, at least in part, by host inventory
preparation module 150 of central server 130, shown in FIG. 1. When
flow 1100 begins, i is set to 1 in step 1110 and j is set to 1 in
step 1115. Variables `i` and `j` are used to construct the
n.times.n similarity matrix for the selected plurality of hosts
being clustered. Steps 1115, 1175, and 1180 form an outer loop
iterating through the rows of hosts and steps 1120 through 1170
form an inner loop iterating through the columns of hosts.
[0094] In step 1120, a list of file identifiers (e.g., checksums,
checksums combined with a file path, checksums combined with one or
more file attributes, etc.) representing a set of executable files
on host H.sub.i are extracted from host file inventory I.sub.i and
put in a file F.sub.i. In step 1125, a list of file identifiers
representing a set of executable files on host H.sub.i are
extracted from host file inventory I.sub.j and put in a file
F.sub.j. In step 1130, files F.sub.i and F.sub.j are concatenated
and put in file F.sub.ij. It will be apparent that the use of files
F.sub.i, F.sub.j, and F.sub.ij to store file identifiers is an
example implementation of the system, and that memory buffers or
any other suitable representation allowing concatenation,
compression, and length determination of data may also be used.
[0095] After files F.sub.i, F.sub.j, and F.sub.ij are prepared,
compression is applied to each of the files. A compression utility
such as, for example, gzip, bzip, bzip2, zlib, or zip compression
utilities may be used to compress files F.sub.i, F.sub.j, and
F.sub.ij. Also, in some embodiments, the list of file identifiers
in files F.sub.i, F.sub.j, and F.sub.ij may be sorted to enable
more accurate compression by the compression utility. In step 1140,
file F.sub.i is compressed and the length of the result is
represented as C. In step 1145, file F.sub.j is compressed and the
length of the result is represented as C.sub.j. In step 1150, file
F.sub.ij is compressed and the length of the result is represented
as C.sub.ij. After compressing each of the files, normalized
compression distance (NCD.sub.i,j) between H.sub.i and H.sub.j is
computed in step 1155.
[0096] Normalized compression distance (NCD) is used for clustering
and is based on an algorithm developed by Kolmogorov called
normalized information distance (NID). NCD is discussed in detail
in Rudi Cilibrasi's 2007 thesis entitled "Statistical Interference
through Data Compression," which may be found at
http://www.illc.uva.nl/Publications/Dissertations/DS-2007-01.text.pdf
and can be used to compute the distance between similar data. NCD
may be computed using the following equation:
NCD.sub.i,j=[C.sub.ij-min{C.sub.i,C.sub.j}]/max{C.sub.i,C.sub.j}
[0097] Once NCD.sub.i,j has been computed, flow moves to step 1160
where a.sub.i,j is computed by the following equation:
a.sub.i,j=1-NCD.sub.i,j. The value a.sub.i,j is then used to
construct the similarity matrix by adding a.sub.i,j to row i,
column j. After the similarity matrix has been updated in step
1160, flow moves to decision box 1165 and a query is made as to
whether j<n. If j<n, then additional entries in row i of the
similarity matrix need to be computed (i.e., similarity distance
has not been computed between host H.sub.i and all of the hosts
H.sub.j (j=1 to n). In this case, flow moves to step 1170 where j
is set to j+1. Flow then loops back to step 1120 where the inner
loop of flow 1100 repeats and the similarity distance is computed
between host H.sub.i and the next host H.sub.j with j=j+1.
[0098] With reference again to decision box 1165, if j is not less
than n (i.e., j.gtoreq.n), then all of the entries in row i have
been computed and flow moves to decision box 1175 where a query is
made as to whether i<n. If i<n, then not all rows of
similarity matrix 1000 have been computed, and therefore, flow
moves to step 1180 where i is set to i+1. Flow then loops back to
step 1115 where j is set to 1 so that entries a.sub.i,j for the
next row i (H.sub.i, with i=i+1) can be generated by inner loop
steps 1120 through 1170. With reference again to decision box 1175,
if i is not less than n (i.e., i.gtoreq.n) then entries for all of
the rows i through n have been computed and, therefore, the
similarity matrix has been completed and flow ends.
[0099] It will be apparent that flow 1100 could be optimized in
numerous ways. One optimization technique includes caching the
lengths of compressed files C.sub.i and C.sub.j, which are used
multiple times during flow 1100 to calculate entries 1080 in the
similarity matrix. In addition, the extracted lists of file
identifiers F.sub.i and F.sub.j may also be cached for use during
flow 1100. It will also be noted that the matrix should be
symmetric along the diagonal a.sub.1,1 through a.sub.n,n. This
symmetry could be used in the implementation of the system to
compute only one-half of the matrix and then reflect the results
over the diagonal.
[0100] Turning to 12, FIG. 12 shows an example similarity matrix
1200 generated by applying the compression method of flow 1100 of
FIG. 11 to host file inventories I.sub.1 through I.sub.5 of the
example selected plurality of hosts 600 of FIG. 6. FIG. 12 shows
hosts 1 through 5 (H.sub.1 through H.sub.5) corresponding to rows
1260 and columns 1270, forming a 5.times.5 similarity matrix 1200.
Entries 1280 of similarity matrix 1200 include values from 0 to 1,
inclusively. The closer the value is to 1, the closer the distance
or greater the similarity of the corresponding hosts in row i,
column j. For example, each entry in matrix 1200 with the same host
in the corresponding row and column, (e.g, a.sub.1,1, a.sub.2,2,
a.sub.3,3, etc.) has a value of 1 because the hosts, and therefore
the host file inventories, are identical. In contrast, each entry
in similarity matrix 1200, in which the corresponding hosts H.sub.i
and H.sub.j have respective executable file inventories I.sub.i and
I.sub.j with no common executable files, has a value of zero (e.g.,
a.sub.5,4, a.sub.4,5).
[0101] Applying the compression method flow 1100 of FIG. 11 to the
example selected plurality of hosts 600 of FIG. 6, in order to
transform host file inventories into similarity matrix 1200, the
following variables can be identified: [0102] n=5 (hosts) [0103]
H.sub.1 through H.sub.5=hosts in network (e.g., H.sub.1=host 1,
H.sub.2=host 2, etc.) [0104] I.sub.1 through I.sub.5=host file
inventories representing sets of executable files 601 through 605,
respectively
[0105] Each of the host file inventories I.sub.1 through I.sub.5
includes a set of file identifiers representing one of the sets of
executable files 601, 602, 603, 604, and 605. Each executable file
in a set of executable files is represented by a separate file
identifier in the particular host file inventory. In this example
scenario in which each file identifier includes a single token
having a checksum configuration, the following host file
inventories of hosts H.sub.1 through H.sub.5 could include file
identifiers D.sub.1 through D.sub.8, which represent executable
files 610 through 680, respectively: [0106] I.sub.1.fwdarw.D.sub.1,
D.sub.2, D.sub.3, D.sub.4 [0107] I.sub.2.fwdarw.D.sub.1, D.sub.5,
D.sub.6, D.sub.2 [0108] I.sub.3.fwdarw.D.sub.1, D.sub.2, D.sub.3,
D.sub.8 [0109] I.sub.4.fwdarw.D.sub.1, D.sub.5, D.sub.6 [0110]
I.sub.5.fwdarw.D.sub.4, D.sub.7, D.sub.8
[0111] In step 1110, i is set to 1 and then the iterative looping
begins to create an n.times.n (5.times.5) similarity matrix 1200
shown in FIG. 12. In step 1115 j is set to 1 and flow passes to
steps 1120 through 1125 where the following variables can be
determined: [0112] F.sub.i (F.sub.1)=D.sub.1D.sub.2D.sub.3D.sub.4
(i.e., list of file identifiers for I.sub.i (I.sub.1)) [0113]
F.sub.j (F.sub.1)=D.sub.1D.sub.2D.sub.3D.sub.4 (i.e., list of file
identifiers for I.sub.j (I.sub.1)) [0114] F.sub.ij
(F.sub.1F.sub.1)=D.sub.1D.sub.2D.sub.3D.sub.4D.sub.1D.sub.2D.sub.3D.sub.4
(i.e., concatenated files F.sub.i (F.sub.1) and F.sub.j (F.sub.1))
Flow then moves to steps 1140 through 1150 where compression is
applied to these files and the length of the compressed files is
represented as follows: [0115] C.sub.i (C.sub.1)=length of
compressed file F.sub.i (F.sub.1) [0116] C.sub.j (C.sub.1)=length
of compressed file F.sub.j (F.sub.1) [0117] C.sub.ij
(C.sub.1C.sub.1)=length of compressed file F.sub.ij
(F.sub.1F.sub.1) For simplicity of explanation, example arbitrary
values are provided in which each file identifier has a defined
length of 1, such that C.sub.1=4 and C.sub.1C.sub.1=4. It will be
apparent, however, that these values are provided for example
purposes only and may not accurately reflect actual values produced
by a compression utility. After compression has been applied to the
files, NCD.sub.i,j is computed using the compressed values C.sub.i,
C.sub.j, and C.sub.ij. In this example,
[0117] NCD 1 , 1 = [ C 1 C 1 - min { C 1 , C 1 } ] max { C 1 , C 1
} = [ 4 - min { 4 , 4 } ] max { 4 , 4 } = 0 ##EQU00001##
[0118] After the NCD.sub.1,1 value is computed in step 1155, flow
moves to step 1160 and a.sub.i,j is computed:
a 1 , 1 = 1 - NCD 1 , 1 = 1 - 0 = 1 ##EQU00002##
The `1` value is added to row i, column j (row 1, column 1) of
similarity matrix 1200. After similarity matrix 1200 has been
updated, flow moves to decision box 1165 where a query is made as
to whether j<n. Since 1 is less than 5, flow moves to step 1170
where j is set to 2 (i.e., j=j+1). Flow then loops back to step
1120 to determine the similarity distance between H.sub.i (H.sub.1)
and the next host H.sub.j (H.sub.2). In this case, after extraction
and compression are performed, NCD.sub.i,j (NCD.sub.1,2) is
computed as 0.75, because H.sub.1 and H.sub.2 have only one common
file identifier D.sub.1 and, therefore, only one common executable
file 601. In step 1160, NCD.sub.1,2 is used to compute a.sub.1,2 as
0.25, which is added to row i, column j (row 1, column 2) of
similarity matrix 1200. The variable j is still less than 5, (i.e.,
2<5) as determined in decision box 1165, so flow moves to step
1170 and j is set to 3 (i.e., j=j+1). This iterative processing
continues for each value of j until j=5, thereby filling in each
entry for H.sub.1 in row i (row 1) of similarity matrix 1200.
[0119] After the last entry of row i (row 1) has been added to
similarity matrix 1200, flow moves to decision box 1165 where a
query is made as to whether j<n (i.e., Is 5<5?). Because j is
not less than 5, flow moves to decision box 1175 where a query is
made as to whether i<n (i.e., Is 1<5?). Because 1 is less
than 5, flow moves to step 1180 where i is set to 2 (i.e., i=i+1)
and flow loops back to step 1115 where j is set to 1. The inner
iterative loop then begins in step 1120 to determine the similarity
distance between host file inventory I.sub.i (I.sub.2) of host
H.sub.i (H.sub.2) and each host file inventory I.sub.j (I.sub.1
through I.sub.5). Thus, rows 1160 are successively filled with
similarity distance values a.sub.i,j until each row has been
completed.
[0120] After the compression method of flow 1100 has finished
processing, similarity matrix 1200 can be provided as input to a
similarity-based clustering procedure, as previously described
herein with reference to clustering techniques used with a vector
matrix. Information generated from the clustering procedure could
be provided in numerous ways, as previously described herein with
reference to FIG. 9. In one example, the information could be
provided in a proximity plot such as example proximity plot 800
illustrated in FIG. 8, which has been previously shown and
described herein.
[0121] Software for achieving the operations outlined herein can be
provided at various locations (e.g., the corporate IT headquarters,
end user computers, distributed servers in the cloud, etc.). In
other embodiments, this software could be received or downloaded
from a web server (e.g., in the context of purchasing individual
end-user licenses for separate networks, devices, servers, etc.) in
order to provide this system for clustering host inventories. In
one example implementation, this software is resident in one or
more computers sought to be protected from a security attack (or
protected from unwanted or unauthorized manipulations of data).
[0122] In other examples, the software of the system for clustering
host inventories in a computer network environment could involve a
proprietary element (e.g., as part of a network security solution
with McAfee.RTM. EPO software, McAfee.RTM. Application Control
software, etc.), which could be provided in (or be proximate to)
these identified elements, or be provided in any other device,
server, network appliance, console, firewall, switch, information
technology (IT) device, distributed server, etc., or be provided as
a complementary solution (e.g., in conjunction with a firewall), or
provisioned somewhere in the network.
[0123] In certain example implementations, the clustering
activities outlined herein may be implemented in software. This
could be inclusive of software provided in central server 130
(e.g., via administrative module 140, host inventory preparation
module 150 and clustering module 160) and hosts 110 (e.g., via host
inventory feed 114). These elements and/or modules can cooperate
with each other in order to perform clustering activities as
discussed herein. In other embodiments, these features may be
provided external to these elements, included in other devices to
achieve these intended functionalities, or consolidated in any
appropriate manner. For example, some of the processors associated
with the various elements may be removed, or otherwise consolidated
such that a single processor and a single memory location are
responsible for certain activities. In a general sense, the
arrangement depicted in FIG. 1 may be more logical in its
representation, whereas a physical architecture may include various
permutations/combinations/hybrids of these elements.
[0124] In various embodiments, all of these elements (e.g., hosts
110, central server 130) include software (or reciprocating
software) that can coordinate, manage, or otherwise cooperate in
order to achieve the clustering operations, as outlined herein. One
or all of these elements may include any suitable algorithms,
hardware, software, components, modules, interfaces, or objects
that facilitate the operations thereof. In the implementation
involving software, such a configuration may be inclusive of logic
encoded in one or more tangible media (e.g., embedded logic
provided in an application specific integrated circuit (ASIC),
digital signal processor (DSP) instructions, software (potentially
inclusive of object code and source code) to be executed by a
processor, or other similar machine, etc.), which may be inclusive
of non-transitory media. In some of these instances, one or more
memory elements (as shown in FIG. 2) can store data used for the
operations described herein. This includes the memory element being
able to store software, logic, code, or processor instructions that
are executed to carry out the activities described in this
Specification. A processor can execute any type of instructions
associated with the data to achieve the operations detailed herein
in this Specification. In one example, the processors (as shown in
FIG. 2) could transform an element or an article (e.g., data) from
one state or thing to another state or thing. In another example,
the activities outlined herein may be implemented with fixed logic
or programmable logic (e.g., software/computer instructions
executed by a processor) and the elements identified herein could
be some type of a programmable processor, programmable digital
logic (e.g., a field programmable gate array (FPGA), an erasable
programmable read only memory (EPROM), an electrically erasable
programmable read only memory (EEPROM)) or an ASIC that includes
digital logic, software, code, electronic instructions, or any
suitable combination thereof.
[0125] Any of the memory items discussed herein should be construed
as being encompassed within the broad term `memory element.`
Similarly, any of the potential processing elements, modules, and
machines described in this Specification should be construed as
being encompassed within the broad term `processor.` Each of the
computers, servers, and other devices may also include suitable
interfaces for receiving, transmitting, and/or otherwise
communicating data or information in a network environment.
[0126] Note that with the examples provided herein, interaction may
be described in terms of two, three, four, or more network
components. However, this has been done for purposes of clarity and
example only. It should be appreciated that the system can be
consolidated in any suitable manner. Along similar design
alternatives, any of the illustrated computers, modules,
components, and elements of FIG. 1 may be combined in various
possible configurations, all of which are clearly within the broad
scope of this Specification. In certain cases, it may be easier to
describe one or more of the functionalities of a given set of flows
by only referencing a limited number of components or network
elements. Therefore, it should also be appreciated that the system
of FIG. 1 (and its teachings) is readily scalable. The system can
accommodate a large number of components, as well as more
complicated or sophisticated arrangements and configurations.
Accordingly, the examples provided should not limit the scope or
inhibit the broad teachings of the system as potentially applied to
a myriad of other architectures.
[0127] It is also important to note that the operations described
with reference to the preceding FIGURES illustrate only some of the
possible scenarios that may be executed by, or within, the system.
Some of these operations may be deleted or removed where
appropriate, or these operations may be modified or changed
considerably without departing from the scope of the discussed
concepts. In addition, the timing of these operations may be
altered considerably and still achieve the results taught in this
disclosure. The preceding operational flows have been offered for
purposes of example and discussion. Substantial flexibility is
provided by the clustering system in that any suitable
arrangements, chronologies, configurations, and timing mechanisms
may be provided without departing from the teachings of the
discussed concepts.
* * * * *
References