U.S. patent application number 12/747868 was filed with the patent office on 2011-09-15 for method and system for abstracting information for use in link analysis.
This patent application is currently assigned to 21CT, INC.. Invention is credited to Joon Hao Chuah, Joe Dente, Travis Hartwell, Morgan Hollinger, Robby Morgan, Jim Snyder, John Thele, Jimmy Wan, Robert Williams.
Application Number | 20110225158 12/747868 |
Document ID | / |
Family ID | 41398380 |
Filed Date | 2011-09-15 |
United States Patent
Application |
20110225158 |
Kind Code |
A1 |
Snyder; Jim ; et
al. |
September 15, 2011 |
Method and System for Abstracting Information for Use in Link
Analysis
Abstract
Observable data points are collected and organized into a
link-oriented data set comprising nodes and links. Information is
abstracted for use in link analysis by generating links between the
collected data points, including deriving links and inducing links.
A link can be induced by linking together a pair of nodes that
satisfy a distance function. Exemplary distance functions that can
be used to induce links include geo spatial proximity, attribute
nearness, and name similarity. Paths can be identified between
selected nodes of interest through a dataset operation, and nodes
and/or links can be selectively included or excluded from the data
set operation. The dataset can be augmented with pedigree
information or one or more association nodes. Link information,
including a trajectory and a connected path that selectively
produces or excludes one or more intermediate nodes, can be
displayed and/or produced in a specified format.
Inventors: |
Snyder; Jim; (Cedar Park,
TX) ; Chuah; Joon Hao; (Austin, TX) ; Dente;
Joe; (Austin, TX) ; Hartwell; Travis; (Austin,
TX) ; Hollinger; Morgan; (Austin, TX) ; Thele;
John; (Austin, TX) ; Wan; Jimmy; (Austin,
TX) ; Williams; Robert; (Austin, TX) ; Morgan;
Robby; (Austin, TX) |
Assignee: |
21CT, INC.
Austin
TX
|
Family ID: |
41398380 |
Appl. No.: |
12/747868 |
Filed: |
December 12, 2008 |
PCT Filed: |
December 12, 2008 |
PCT NO: |
PCT/US08/86729 |
371 Date: |
May 16, 2011 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
61013273 |
Dec 12, 2007 |
|
|
|
Current U.S.
Class: |
707/737 ;
707/E17.089 |
Current CPC
Class: |
G06F 16/9024 20190101;
G06Q 30/02 20130101; G06F 16/287 20190101 |
Class at
Publication: |
707/737 ;
707/E17.089 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Claims
1. A method for abstracting information for use in link analysis
comprising: representing a plurality of collected observable data
points in a link-oriented dataset; and abstracting information for
use in link analysis from the plurality of collected observable
data points by generating links between two or more of the
plurality of observable data points.
2. The method of claim 1 wherein the step of generating links
between two or more of the plurality of observable data points
comprises deriving at least one link between two or more of the
plurality of observable data points.
3. The method of claim 2 wherein the step of representing the
plurality of collected observable data points in a link-oriented
dataset further comprises organizing the plurality of collected
observable data points into one or more typed nodes and one or more
named links between the nodes, the named links comprising typed end
points.
4. The method of claim 3 further comprising displaying one or more
typed nodes and one or more derived links.
5. The method of claim 4 further comprising selectively concealing
one or more links.
6. The method of claim 2 wherein the step of abstracting
information for use in link analysis by deriving links between one
or more of the plurality of observable data points further
comprises the step of performing one or more dataset operations on
the link-oriented dataset.
7. The method of claim 4 wherein the one or more dataset operations
comprise a first dataset operation comprising one or more of the
following: subgraph matching, path matching, or matrix algebra.
8. The method of claim 7 wherein the one or more dataset operations
further comprise a second dataset operation on a result of the
first dataset operation.
9. The method of claim 8 wherein the second dataset operation
comprises aggregating all pairs of typed nodes that result from the
first dataset operation.
10. The method of claim 1 wherein the step of generating links
between two or more of the plurality of observable. data points
comprises generating a link based solely on features inherent in
the link-oriented data set.
11. The method of claim 1 wherein the step of generating links
between two or more of the plurality of observable data points
comprises inducing at least one link between two or more of the
plurality of observable data points.
12. The method of claim 11 wherein the link-oriented dataset
comprises a plurality of nodes and wherein the step of inducing a
least one link between two or more of the plurality of observable
data points further comprises inducing a link between at least one
pair of nodes that satisfies a distance function.
13. The method of claim 12 wherein the distance function comprises
geospatial proximity.
14. The method of claim 12 wherein the distance function comprises
nearness of an attribute.
15. The method of claim 14 wherein the nearness of an attribute
comprises name similarity.
16. The method of claim 1 further comprising: selecting a first
node and a second node; and identifying at least one path between
the first node and the second node, wherein identifying at least
one path between the first node and second node comprises
performing a dataset operation.
17. The method of claim 16 wherein one or more links in the
link-oriented data set are selectively included or excluded from
the dataset operation.
18. The method of claim 17 wherein at least one of the one or more
links in the link-oriented data set that are selectively included
or excluded from the dataset operation is an induced link or a
derived link.
19. The method of claim 16 wherein one or more nodes in the
link-oriented data set are included or excluded from the dataset
operation.
20. The method of claim 19 wherein one or more nodes are
selectively included or excluded from the dataset operation based
on type.
21. The method of claim 16 wherein the dataset operation comprises
a shortest path algorithm.
22. The method of claim 16 wherein at least one path between the
first node and second node comprises an induced link or a derived
link.
23. The method of claim 1 in which the link-oriented data set
comprises data pedigree information.
24. The method of claim 1 in which the link-oriented data set
comprises at least one association node.
25. The method of claim 1 further comprising the step of producing
link information.
26. The method of claim 25 wherein the step of producing link
information comprises displaying the link information.
27. The method of claim 25 wherein the step of producing link
information comprises outputting the link information in a
specified format.
28. The method of claim 25 further comprising producing a
trajectory.
29. The method of claim 28 wherein the trajectory comprises two or
more data points having geospatial attributes.
30. The method of claim 28 wherein the trajectory comprises group
activity.
31. The method of claim 25 further comprising producing an
association node.
32. The method of claim 25 wherein the link-oriented dataset
includes a first node, a second node, and a connected path between
the first node and second node that includes one or more
intermediate nodes; and selectively producing or excluding from
production at least one of the one or more intermediate nodes.
33. A system for abstracting information for use in link analysis
comprising: a processor; a memory operatively connected to the
processor, the memory comprising a link-oriented dataset comprising
a plurality of collected observable data points; and graph
analytics platform software comprising instruction code which when
executed by the processor provides the function of: abstracting
information for use in link analysis from the plurality of
collected observable data points by generating links between two or
more of the plurality of observable data points.
34. The system of claim 33 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor derives at least one link between two or more of the
plurality of observable data points.
35. The system of claim 33 wherein the plurality of collected
observable data points are organized into one or more typed nodes
and one or more named links between the nodes, the named links
comprising typed end points.
36. The system of claim 35 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor displays one or more typed nodes and one or more
derived links.
37. The system of claim 36 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor selectively conceals one or more links.
38. The system of claim 34 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor performs one or more dataset operations on the
link-oriented dataset.
39. The system of claim 38 wherein the one or more dataset
operations comprise a first dataset operation comprising one or
more of the following: subgraph matching, path matching, or matrix
algebra.
40. The system of claim 39 wherein the one or more dataset
operations further comprise a second dataset operation on a result
of the first dataset operation.
41. The system of claim 40 wherein the second dataset operation
comprises aggregating all pairs of typed nodes that result from the
first dataset operation.
42. The system of claim 33 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor generates a link based solely on features inherent in
the link-oriented data set.
43. The system of claim 33 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor induces at least one link between two or more of the
plurality of observable data points.
44. The system of claim 43 wherein the link-oriented dataset
comprises a plurality of nodes and wherein the graph analytics
platform software further comprises instruction code which when
executed by the processor induces a link between at least one pair
of nodes that satisfies a distance function.
45. The system of claim 44 wherein the distance function comprises
geospatial proximity.
46. The system of claim 44 wherein the distance function comprises
nearness of an attribute.
47. The system of claim 46 wherein the nearness of an attribute
comprises name similarity.
48. The system of claim 33 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor provides the functions of: selecting a first node and
a second node; and identifying at least one path between the first
node and the second node, wherein identifying at least one path
between the first node and second node comprises performing a
dataset operation.
49. The system of claim 48 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor selectively includes or excludes one or more links in
the link-oriented data set from the dataset operation.
50. The system of claim 49 wherein at least one of the one or more
links in the link-oriented data set that are selectively included
or excluded from the dataset operation is an induced link or a
derived link.
51. The system of claim 48 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor selectively includes or excludes one or more nodes in
the link-oriented data set from the dataset operation.
52. The system of claim 51 wherein one or more nodes are
selectively included or excluded from the dataset operation based
on type.
53. The system of claim 48 wherein the dataset operation comprises
a shortest path algorithm.
54. The system of claim 48 wherein at least one path between the
first node and second node comprises an induced link or a derived
link.
55. The system of claim 33 in which the link-oriented data set
comprises data pedigree information.
56. The system of claim 33 in which the link-oriented data set
comprises at least one association node.
57. The system of claim 33 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor produces link information.
58. The system of claim 57 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor displays the link information on a display device
coupled to the processor.
59. The system of claim 57 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor outputs the link information in a specified
format.
60. The system of claim 57 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor produces a trajectory.
61. The system of claim 60 wherein the trajectory comprises two or
more data points having geospatial attributes.
62. The system of claim 60 wherein the trajectory comprises group
activity.
63. The system of claim 57 wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor produces an association node.
64. The system of claim 57 wherein the link-oriented dataset
includes a first node, a second node, and a connected path between
the first node and second node that includes one or more
intermediate nodes; and wherein the graph analytics platform
software further comprises instruction code which when executed by
the processor selectively produces or excludes from production at
least one of the one or more intermediate nodes.
65. A software program product comprising: a computer readable
medium; and program code on the computer readable medium that when
executed provides the functions of representing a plurality of
collected observable data points in a link-oriented dataset; and
abstracting information for use in link analysis from the plurality
of collected observable data points by generating links between two
or more of the plurality of observable data points.
66. A software program product of claim 65 further comprising
program code on the computer readable medium that when executed
provides the functions of deriving at least one link between two or
more of the plurality of observable data points.
67. A software program product of claim 65 further comprising
program code on the computer readable medium that when executed
provides the functions of inducing at least one link between two or
more of the plurality of observable data points.
68. A software program product of claim 65 further comprising
program code on the computer readable medium that when executed
provides the functions of: selecting a first node and a second
node; and identifying at least one path between the first node and
the second node, wherein identifying at least one path between the
first node and second node comprises performing a dataset
operation.
69. A software program product of claim 65 further comprising
program code on the computer readable medium that when executed
provides the function of producing link information.
70. The software program product of claim 69 wherein the function
of producing link information comprises displaying the link
information.
Description
RELATED PATENTS/PRIORITY CLAIM
[0001] This application claims the benefit of U.S. Provisional
Application No, 61/013,273, filed on 12 Dec. 2007, which is hereby
incorporated by reference.
TECHNICAL FIELD
[0002] The invention relates generally to a method and system for
abstracting or transforming data stored in a link-oriented data set
for use in link analysis.
BACKGROUND
[0003] Over the past several decades, criminal and terrorist
networks have evolved the sophistication by which they finance,
organize and communicate. These changes are reflected in events
from 9/11 to violent crimes associated with drug trafficking to
Improvised Explosive Devices used all over the globe. Both law
enforcement and military intelligence agencies are charged with
detecting, disrupting, and dismantling these human networks in an
attempt to combat what has been called asymmetric threats.
[0004] An effective technique used to understand and combat these
kinds of threats is Link Analysis-the modeling of high-level
networks and linkages between people, organizations, and events to
gain insight into the weak or unknown portions so that some action
can be taken against these networks. FIG. 1 illustrates generally
how information flows during an exemplary intelligence analysis
cycle 100, and includes some major transition points that have
direct relevance to tools and technologies supporting analysts.
Information is collected from and by various sources and methods,
and its collection is highly dependent on the disciplines involved;
this information is usually collected independently or indirectly
from the analytical cycles that exploit it. The analysis phase is
highly subjective with respect to the skills and training of
individual analysts and agencies, and it includes manual and
computational techniques that can establish relevant relationships
in support of a perspective or recommendation. The preparation
phase is a graphical information presentation (e.g. network chart)
workflow that attempts to visually communicate the "punch line" of
an analytical process, that is, a condensed information model is
drawn. Lastly, the presentation phase is typically the assembly and
production of some kind of media that can be presented or
distributed for human consumption.
[0005] What is important to note about this process is that it
involves lots of people, organizations, and systems whereby
information is searched, analyzed, and organized in highly manual
and individual ways. Two kinds of computational tool support that
are relevant for this problem area are link charting and
computational link analysis.
[0006] Link Charting. Computer support for the preparation phase of
the analysis process has been widely supported by a variety of
commercial tools such as Analyst Notebook, Centrifuge, and Visual
Analytics. These tools focus on aiding the preparation of visually
compelling or communicative information so that the analytical
results can be consumed by the intended audience. However, the
proportion of time spent on the preparation phase is relatively
small to because these tools amount to task-specialized drawing
tools much like tools such as Visio or PowerPoint are to
presentation. The difficult task of wielding large amounts of
information or applying computationally intensive methods to
information is ignored by this class of tool.
[0007] As a result, analysts are compelled to assemble a set of
disconnected tools that are difficult to use in a uniform or smooth
technique, and they must also develop ad hoc (and often manual)
processes to manage the information discovery and exploration tasks
through their analysis methodology. An additional consequence is
that standard intelligence practices are hard to reproduce as they
are intermingled with the tool integration approaches themselves;
as tools change, the process changes or they way in which an
analyst uses a tool can change thereby making processes brittle and
reproducibility difficult. There is an unmet need for improved
computer support of analysis and not just preparation.
[0008] Computational Link Analysis. Simply stated, Link Analysis
includes the methods and techniques used to produce the information
that feeds the presentation phase of the analysis process. The goal
is to incrementally capture, understand, refine, and analyze Human
Activity Networks such that latent or non-obvious relationships can
be discovered and definitive actions can take place such as
arrests, intelligence targeting, or threats being eliminated.
[0009] While many techniques for Link Analysis are known, they have
traditionally been manual and impractical to implement
computationally or at Technical Needs of Computational Link
Analysis Methods scale. For example, matrix operations can be used
to find associations between people but cannot be used frequently,
in large part, due to the difficulty of capturing relevant
information because many things need to come together to make
robust computational link analysis feasible. For example,
algorithms, software architectures, hardware performance,
global-scale networking, and commoditized large data stores need to
converge at once to provide computational support.
[0010] A goal of computational link analysis is to create a
synergistic behavior between human analysts and computers where
information intensive tools and techniques can aid analysts in
"zeroing in" on relevant understanding or abstracting the linkages
between information. Said differently, computers should do what
they do best (which is lots of dumb calculations quickly) and
analysts should do what they do best-reason and analyze.
[0011] Link Analysis can be used to identify and qualify high-level
links in Human Activity Networks (HANs). These links are
combinations of many kinds of simple and complex linkages and
pathways in link-oriented data and very often are manually
constructed or simply drawn in pictures. Presented here is a
technology solution that allows such high-level links to be
realized in computation link analysis technologies.
SUMMARY
[0012] Disclosed are a method, system and computer program product
for abstracting information for use in link analysis. Observable
data points are collected and organized into a link-oriented data
set comprising nodes and links. Information is abstracted for use
in link analysis by generated links between the collected data
points. One exemplary embodiment of a generated link is a derived
link which may be derived as a result of one or more dataset
operations on the link-oriented data set. Another embodiment is a
link that is generated based solely on features inherent in the
data set. Another example of a generated link is an induced link,
which may be induced in an embodiment by linking together a pair of
nodes That satisfy a distance function. Exemplary distance
functions that can be used to induce links include geospatial
proximity, attribute nearness, and name similarity. In another
embodiment, two nodes of interest are selected and paths are
identified between the nodes of interest through a dataset
operation. In embodiments the analyst can selectively include or
exclude from the data set operation typed nodes and/or links,
including induced links or derived links. In another embodiment the
dataset is augmented with pedigree information or one or more
association nodes.
[0013] Additional embodiments are disclosed for producing link
information, including displaying link information or outputting
link information in a specified format. Embodiments of produced
link information include a trajectory and a connected path that
selectively produces or excludes one or more intermediate nodes.
The methods and systems disclosed herein transform the link and
node organization of collected information into a form and/or
format that is more useful to the analyst.
[0014] The above as well as additional objectives, features and
advantages of the present invention will become apparent in the
following detailed written description.
DESCRIPTION OF THE DRAWINGS
[0015] FIG. 1 depicts a high level view of an intelligence analysis
cycle.
[0016] FIG. 2 depicts exemplary link variations within a sample
population.
[0017] FIG. 3 depicts exemplary information layers.
[0018] FIG. 4 depicts an exemplary Human Activity Network.
[0019] FIG. 5 depicts pathways used to define an exemplary Derived
Link within an exemplary Human Activity Network.
[0020] FIG. 6 depicts an XML specification for possible Derived
Links in an embodiment.
[0021] FIG. 7 depicts an example of a pattern query used to derive
a Derived Link in an to embodiment.
[0022] FIGS. 8a and 8b illustrate an exemplary application of type
exclusion to enhance Association Discovery.
[0023] FIGS. 9a and 9b illustrate an exemplary application of Link
Induction based on geospatial nearness.
[0024] FIGS. 10a and 10b illustrate an exemplary application of
Link Induction based on name similarity.
[0025] FIG. 11 illustrates an exemplary association node in an
embodiment.
[0026] FIG. 12 depicts an exemplary use of pedigree properties in
an embodiment.
[0027] FIGS. 13a and 13b illustrate an exemplary use of Path
Simplification in an embodiment.
[0028] FIG. 14 depicts an exemplary geospatial trajectory chart in
an embodiment.
[0029] FIGS. 15a, 15b and 15c illustrate stages of an exemplary
process for extracting a trajectory for an entity.
[0030] FIG. 16 depicts an exemplary trajectory path definition of
one mappable point for a Person node type in an embodiment.
[0031] FIG. 17 depicts the components of an embodiment of a system
for abstracting information for use in link analysis.
DETAILED DESCRIPTION
[0032] Technical Needs of Computational Link Analysis Methods. Very
often in link analysis, analysts need to show how people are
"connected" to each other or hypothesize that people are connected,
and then analysis results are combined to establish a non-obvious
linkage. FIG. 2 shows how two different linkages between the same
set of people can help illustrate what person is potentially
exploitable from a targeting perspective. The potentially
exploitable persons can be identified by constructing two networks,
one from the perspective of being part of the family by marriage
and one from the perspective of being in the blood or tribal
network.
[0033] What the example shows is the need to abstract or derive
many kinds of relationships such as spouse Of or parent Of into the
logical grouping of family-MemberOf. However, the vast number of
ways that these relationships exist and need to be stated presents
a serious technical challenge. Moreover, the definitions of these
relationships will change over time as more is learned about the
discipline, the network itself, or how data is collected.
Therefore, it may be neither possible nor feasible to enumerate all
the possible linkages of the data before the data is collected or
analyzed. Association Discovery and Cluster Analysis are two
computational analysis methods useful in deriving or identifying
data linkages of interest.
[0034] Association Discovery is the process of finding linkages
between two or more entities given a variety of complicated or
convoluted pathways between them. For example, it can be inferred
that two people know each other because they lived at the same
place or they committed crimes together or they belong to the same
gang, or even they give common information such as the same phone
number. Additionally, associations between people can be inferred
when the people show up at physical locations together or know many
of the same people in the same area or are in proximity to a large
number of similar crime locations.
[0035] As these associations are built up, linkages can be combined
together (possibly with other links) to form a network of activity.
Often these networks are focused around a particular theme such as
crimes. To support a more general approach of building networks
around multiple activities, a more general network structure, such
as a Human Activity Network, can be used.
[0036] When considering paths that may connect entities, preferably
the analyst has the ability to selectively include or exclude
entity types as intermediaries as path candidates. For example, to
discover methods used by individuals for communication, it might be
beneficial to exclude non-communication pathways for consideration.
This may be accomplished by excluding entity types. FIGS. 8a and 8b
illustrate an exemplary application of this technique. FIG. 8a
shows source data model 800 comprising exemplary entity types,
including person 810, PhoneNumber 815, PhoneCall 820, PhoneTower
825, Meeting 830, EmailAddress 840, Email 845, PersonName 860,
Organization 870, Flight 880, Airline 887, and Airport 885. Some of
these types (e.g., Organization 870, Person Name 860, and
Flight-related types (880)) do not directly relate to methods of
communication with other persons and can be excluded from display
so that the analyst can focus on communication-related data types
such as PhoneNumber 815 and EmailAddress 840, as shown in FIG. 8b.
The ability to selectively include or exclude type for path
consideration provides additional tools to the analyst to narrow
the results and pinpoint those relative links of interest. Using
type exclusion also reduces the search space for Association
Discovery which improves the efficiency of the search. An
embodiment allows the analyst several ways to selectively include
or exclude types for path consideration, including rules-based
techniques and manual selection.
[0037] Cluster Analysis is a complementary technique to Association
Discovery in that once a high-quality abstracted network has been
built, people (or other nodes) in the network can be clustered
based on their associations. (Other names for his technique include
Group Detection or Community Extraction). Referring back to the
example in FIG. 2, two kinds of clusters could be built to compare
and contrast the family versus blood networks.
[0038] More specifically, most of the algorithmic techniques for
clustering (i.e., the technique that produces the clusters) make
use of computational linear algebra and require highly abstracted
or pristine links. However, the data is not in such a state in its
intrinsic form. Typically pre-processing on small data sets is done
unless the data is intrinsically collected in pristine network
form. An example of this type of data is phone call logs.
[0039] Information Abstraction in Link Analysis. Typically there is
a gap between the information needed by analysts to make analysis
decisions and the data collected. This can be illustrated by
information abstraction layers. FIG. 3 shows three layers that
represent notional stable points in a data storage and
representation embodiment, namely the Observables Layer 310, the
Entities Layer 340 and the Aggregates Layer 370.
[0040] Observables Layer: The data collected throughout the course
of normal or daily activity is represented in Observables Layer
310, that is, things that can be observed and/or recorded (e.g.,
electronically). Examples include email, documents, phone call
logs, financial transaction, database records and any other type of
data used on, by or with a computer.
[0041] Entities Layer: Entities layer 340 reflects qualitative
evaluation, refinements and judgments about the data and what it
represents specifically with respect to the entities under scrutiny
(people, organizations, locations, etc.) For example, in an
exemplary embodiment of a Human Activity Network, the Entities
Layer would resolve two different names of a person to the same
person represented in the network.
[0042] Aggregates Layer: The output of Cluster Analysis is an
example of what would be represented in Aggregates Layer 370, that
is, groups of people can be aggregated into groups. Moreover,
people can be aggregated from countless perspectives, and
comparison of these variations can show how human networks
function. However, the technical requirements of the techniques
used to make aggregates cannot just use information represented at
the Observables Layer as is.
[0043] An important conclusion to draw from this observation about
how information needs to be tiered is that there is a gap in how
data is collected, processed and stored as compared to how the
behaviors of people and organizations via Human Activity Networks
can be analyzed and discovered.
[0044] In an embodiment, a graph analytics platform provides the
technology for generating linkages for use at a high-level of
analysis. Preferably the graph analytics platform is implemented in
software (which may include portions implemented in hardware). It
is assumed that observable (or raw) data has been collected, and
the graph analytics platform preferably stores or organizes the
collected observable data in a form that is link-oriented, that is,
data is organized as Nodes and Links (or edges) between nodes.
Exemplary link-oriented data sets include graphs and trees, and can
be implemented with relational database technology such as a
relational database management system and query language using
methods well-known to those of ordinary skill in the art. Nodes
have types associated with them (e.g. People) and one ore more
attributes and Links are named (e.g. parentOf) and their end points
are also typed (e.g. links of People). Attributes are named scalar
value properties that express owned aspects of a given Node type
(e.g., a person's name, a vehicle's model, or a phone call's
duration). The features of the graph analytics platform are not
dependent on the definition of any one data set, but can adapt to
function against any data set that is or will be defined.
[0045] The graph analytics platform includes search and segment
matching tools to search the data set efficiently and to match
segments or patterns or identify nodes or links that meet specified
criteria. Methods and techniques for searching and segment
matching, including without limitation graph tools including
sub-graph matching and relational database methods, are well-known
to those of ordinary skill in the art. In an embodiment the
link-oriented data set uses a strongly-typed node and link system,
where every node is of an identifiable type such as `Person` or
`Organization`. Links are typed and connected between identifying
node types, such as `Person memberOf Organization`. In an
embodiment, links are typed but do not have attributes, which
facilitates scalable, fast pattern matching. Preferably the graph
analytics platform uses a strongly-typed link-oriented data,
segment matching for data set searches, an efficient storage format
and language and use of query languages for building queries, all
as described in pending U.S. patent application Ser. No. 11/590,070
filed Oct. 30, 2006 entitled Segment Matching Search System and
Method, hereby incorporated by reference. A graph analytics
platform preferably also provides pattern search (including graph
pattern matching), and management and application development
(including client and server tools) functionality. An exemplary
embodiment of a graph analytics platform is the Lynxeon
Intelligence Analytics Enterprise product suite provided by 21st
Century Technologies.
[0046] FIG. 4 depicts an exemplary Human Activity Network 400. To
analyze the exemplary Human Activity Network 400, it would be
helpful to be able to identify links such as loyalTo, trained, etc.
The problem is that such data is not necessarily intrinsically
available in the captured data.
[0047] It is also assumed that the captured data will not be
directly organized like the idealized network 400 from FIG. 4, and
there will be more than one way to derive links such as, for
example, the link loyalTo. If there are multiple paths or
substructures in the data that justify the loyalTo association, it
will be valuable to derive all the links that are defined in terms
of all these paths, that is, to identify all pairs of nodes where
the desired pathways exists between the nodes. A derived link is
therefore a mechanism that allows dynamic synthesis of typed links
in link-oriented datasets. For a simple exemplary illustration, the
derived link loyalTo as someone who calls each other or has met as
a safe house as shown in FIG. 5 with highlighted elements. In FIG.
5, for example, there are loyalTo derived links between persons 520
and 530, and between persons 540 and 550, because the observable
data shows phone calls between them (510, 560).
[0048] In an embodiment, Derived Links can be generalized as the
set of typed edges (i.e. typed node pairs) that are the result of a
disjunctive set of subgraph matching operations. Those of ordinary
skill in the art will appreciate that other dataset operations such
as, by way of non-limiting example, pattern matching or matrix
operations, can be used in addition to, or instead of, subgraph
matching operations, and further, that the results may be
aggregated in whole or in part, excluded in whole or in part, or
otherwise operated upon in any number of ways to generate a set of
typed edges. While there may be special cases in some embodiments
(for example the nodes may have to be the same type or capable of
path discovery via linear algebra), the general idea allows a
solution to be cast in many technology implementations. A
characteristic of one embodiment is to defer the generation of such
linkages as late as possible while balancing the computational
efficiency to produce answers sufficiently quickly.
[0049] FIGS. 6 and 7 illustrate an embodiment implemented using
Lynxeon in which Derived Links can be used to condition
link-oriented data to perform cluster analysis. FIG. 6 shows an XML
specification 600 of the possible Derived Links for a particular
schema. Line 14 (610) shows that the Derived Link "friendOfAFriend"
is defined using a pattern query that is based on graph matching
approaches. FIG. 7 is an example of a pattern query that is
referenced in the Derived Link Specification of FIG. 6 (see 620).
The exported output of the pattern (710) in this embodiment
includes a source and destination node that ultimately is consumed
as a typed edge. In the example shown in FIGS. 6 and 7, friends is
a Derived Link that is derived from other data, e.g., observable
data. The Derived Link "friendOfAFriend" is derived by identifying
each subgraph that includes two persons, each of whom has a friends
Derived Link with a middle man.
[0050] Observable data often arrives in discrete units of tightly
interconnected data. This results in the creation of many
disconnected subgraphs where each subgraph of observable data is
isolated from other subgraphs within the total link oriented data
set. This often disconnected nature of data presents a challenge
for Link Analysis, as many analytic approaches will operate on each
of these data subgraphs in isolation, diminishing the ability to
discover new valuable links between entities within the data.
[0051] Link induction is an approach which creates new typed links
between nodes through comparative analysis of two subgraphs.
Specifically, link induction (like the creation of derived links)
is the result of an analytical or computational process that
preferably considers only features or attributes present (or
inherent) in the data set. This is distinct from general link
creation which encompasses link creation using knowledge not
necessarily present in the data set. The process of link induction
is comprised of two phases. The first phase is selection of two
subgraphs for comparison, and may be accomplished with many of the
well known or previously detailed analytic operations such as type
selection or subgraph matching. The second phase is comparison
between the selected subgraphs to select candidates to connect with
new links.
[0052] One exemplary subgraph comparison technique used for
inducing links is the process of computing `nearness` of selected
nodes in each subgraph. This `nearness` operation is used to
compare attribute proximity given a distance function, where the
distance function may be geospatial ranges, temporal values,
phonetic similarity, simple string comparison, or other distance
function between literal values assigned to node attributes.
Preferably the graph analytics platform software will provide a
variety of pre-configured distance functions for the analyst and
will also provide the analyst with the power to define custom
distance functions as needed during analysis.
[0053] FIGS. 9a and 9b illustrate how geospatial nearness can be
used to induce links between multiple people who were sighted near
each other about the same time. FIG. 9a depicts disconnected
observation data 900 containing two person-nodes representing Fred
910 and Joe 930. The person nodes 910 and 930 are
unconnected--there is no link or edge between them to identify any
possible connection between them. However, the analyst can specify
a distance function in that compares geospatial attribute values
(such as, for example, geospatial coordinates and time of day of a
confirmed sighting) and induces a link between the nodes whose
geospatial attribute values are within an analyst-specified
distance from each other. The data set illustrated in FIG. 9a
includes geospatial attribute values 920, 940 relating to Fred and
Joe, namely that Fred has been sighted at 202 Maple St. and Joe has
been sighted at 210 Maple Street at or about a specified time. In
the example, 202 Maple St. and 210 Maple Street are within a mile
of each other. If the analyst-specified distance function is
satisfied by a pair of persons with geospatial coordinates of
sightings that are within one mile of each other and at or about a
specified time, link 950 will be induced between Fred 910 and Joe
930, as illustrated in FIG. 9b.
[0054] FIGS. 10a and 10b illustrate an exemplary use of attribute
"nearness" to induce links between persons who share similar names
such as a familial name such as Al-Sabah, which in turn may
identify persons that share a family or tribal or loyalty
relationship. FIG. 10a shows disconnected observation data 1000
including persons with name attributes Ahmad I Al-Jaber Al-Saha
1010, Sabah III Al-Salim Al-Sabah 1020, and Mubarek Al-Sabah 1030.
A distance function can be specified which is satisfied by
similarity of names will induce links 1015, 1025 and 1035 between
the persons with the same familial name of Al-Sabah, as illustrated
in FIG. 10b. Geospatial proximity and name similarity are examples
of "nearness" based on comparison operations, and alternative
comparison operations within the scope of the methods described
herein will be familiar to those of ordinary skill in the art.
[0055] Although people, and relationships between people, are the
primary entities under scrutiny in the foregoing examples, those of
ordinary skill in the art will appreciate that the methods
described herein are equally applicable to analyzing other types of
entities, such as organizations or events, and relationships
between the entities.
[0056] The link abstraction approaches described herein provide an
additional layer of interpretation to the observational data or
even other abstraction views. It is important to understand that
these abstractions can build on the results of previous abstraction
processes. For example, Association Discovery could utilize the
results generated from Derived Link processing, which can be built
upon data sets containing Induced Links. Further, the abstractions
can be chained together in any order. The results of any
computation analysis process can become input for any other
process. This chaining approach to the abstraction layer provides
the analyst a method for leveraging multiple insights to create a
more complete understanding of the relevant date for further
reasoning. The results of these abstraction approaches can be
persisted as edges defined by the pair of endpoints for the link
abstraction, and type name for the edge. This persistence form
permits the storage of link abstraction results into long term
storage, such as a relational database in local or remote
persistent storage or a data store, and incorporation into
computation processes across analyst sessions.
[0057] Links between nodes may require additional qualifiers to be
effective for computational analysis. Social networking analysis
(SNA) requires representation and visualization of links between
people, which may represent the relationships between them, such as
friendships, marriages, coworkers, associates, etc. Such
associations can be modeled as nodes with attributes, rather than
as links between people. This modeling method allows robust
representation of qualities of the association which may be
important for analytical understanding. A Person's membership in an
organization frequently has discrete temporal aspects, such as the
begin/end dates of the association. Observable events may have
varying level of confidence dependant on the observation
methodology utilized. "association nodes" can be used to capture
these qualified links in an embodiment. These association nodes can
be stored in the link set as intermediary nodes in a link between
graph entities. As illustrated in FIG. 11, exemplary data set
excerpt 1100 includes nodes 1110 for a person and 1120 for an
organization. The fact that a person 1110 is a member of
organization 1120 can be represented by association node 1130, and
important data about the membership (such as the startDate and
endDate) can be stored as attributes of the association node.
Association links 1115 and 1125 link the entities 1110 and 1120 to
the association node 1130. However, such modeling comes at a cost
of higher model complexity and potentially confusing visualization,
since humans typically think of such associations as simple links
rather than nodes themselves. Thus the analyst in an embodiment can
elect to display the entities 1110 and 1120 with the association
node 1130 and association links 1115 and 1125, or the analyst can
conceal the association node 1130 and association links 1115 and
1125 and display the association between entities 1110 and 1120 as
a single link (not shown).
[0058] Data incorporated into a dataset is done selectively based
on the intended analysis goals of the system. Over time these
intended goals or needs for the system will evolve, and require
original principles used in data selection to be reassessed, and
new data brought into consideration from original sources. To
facilitate origin tracking of data incorporated into the data set,
a `Pedigree` layer is embodied in the system. This pedigree layer
tags any data incorporated into the data set with information about
the origin or source, and information about the agents or users
involved in the data creation, or its modification or deletion.
This provides the analyst the ability to return to original sources
for insight that may have been lost through the data preparation
process.
[0059] Pedigree records consist of a plurality of value properties
that describe the origin for a group of nodes and edges. This
plurality of properties is given a unique identifier, which is
subsequently recorded as a property on each affected node and edge
in the data set. The unbounded nature of the pedigree properties
allows adaptation of the pedigree mechanism to address data from
varied sources such as structured databases, records management
systems, and unstructured reports. FIG. 12 shows an example of how
origin information for data prepared from a traffic report is
recorded and associated with the data instances. Data from the
traffic report is organized as typed nodes and named edges 1200,
and pedigree record 1210 associated with the collection of nodes
and edges 1200 includes the source of the information (i.e.,
Traffic Report No. 211), the URI where the traffic report was
found, the date of creation, and the person who created it.
[0060] Charting tools facilitate human comprehension and reasoning
over results of the data abstraction layers and are valuable to the
analysis process. Charting tools provide methods to contextualize
the entities and understand the relationships between represented
entities. With visualization of large link-oriented datasets,
methods must be provided to abstract details, allowing the analyst
to focus on entities and relations that are important to provide
situational understanding and hiding those that are not relevant to
gain this understanding. Finally, charting tools must provide
access to the computational link analysis abilities detailed
previously to allow new insights to be added into the link
abstraction view. Through such charting capabilities, analysts are
provided a means to quickly comprehend, investigate, and evaluate
link-oriented data.
[0061] Link Charting is a method of presenting link-oriented data
visually to convey the relational nature of the underlying data.
Display of link-oriented data in a link charting visualization is
familiar to those skilled in the art, and supported in a range of
commercial tools. However, the visualization capabilities in an
embodiment provide enhancements through tight integration with
computational link analysis methods and visual data abstraction
abilities. Preferably, the graph analytics platform software
provides for the visual display of link-oriented data and also for
production of link-oriented data in other output formats, such as,
for example, ESRI shapes or the format used by Google Earth.
[0062] Path Simplification provides a method for abstracting
complex paths between nodes into single edge visual representation,
while preserving the most significant type representing this path.
FIGS. 13a and 13b illustrate an exemplary use of Path
Simplification. FIG. 13a illustrates a detailed view of source
graph 1300 showing the persons of interest 1310, 1340, and 1350 and
the pathways between them showing intermediate nodes 1315, 1320,
1325, 1335 representing the connecting path details associated with
phone calls between persons 1310, 1340 and 1350. FIG. 13b shows a
simplified view in which the intermediate nodes are concealed and
replaced with single links 1385, 1390 and 1395 preserving the
communication nature of the underlying details of the phone
communication. In this way the subgraph representing multiple
participants in a phone call is simplified to present direct
connectivity between the individuals. Thus details in the
link-oriented data set can be reduced to core concepts of interest
for analyst understanding and presentation, including the
relationship between the connectivity between persons 1310, 1340
and 1350 and person 1350's air travel 1360, 1370 to destination
1380.
[0063] The rules for path simplification are defined by indicating
the candidate set of node types that should be simplified. For each
of the candidate node types marked for simplification, a list of
the entire candidate's relation tuples to preserve must also be
recorded. Finally, the candidate simplification nodes are ranked
from most to least important concept. With these simplification
rules modeled, candidate nodes can then be replaced by edges
representing existence of instances of their relation tuples. Types
for the edges can then be assigned based on the type ranking and
comparison to connecting edge types.
[0064] Geospatial Charting is a method to present geospatial
information of given entities in a geospatial coordinate system. A
common embodiment of this method is to display entities at specific
locations within a map view of a region. Through locating entities
as geospatial coordinates, an understanding of proximity between
entities and geospatial features can be conveyed, providing
additional insight.
[0065] Trajectory tracking is the ability to display movement paths
of entities using multiple observation data relations over time.
This is done in an automated fashion by interpreting the semantics
of relations for a given entity to positional observations. In
tracking an individual it is common to have positional events
recorded from many methods such as cell phone tower positions,
personal observations, sales receipts, and other type events. It is
also possible to calculate and present group activity--a plurality
of points, representing members of a logical group, based on shared
behavior within a given timeframe--within a geospatial perspective.
For example: a group of phone devices found to be calling one
another and acting as a coordinated group may be plotted in a
geospatial context with display elements (e.g., visually bounded by
a geometric shape) to suggest the known range of movement and use
of the devices. Trajectory tracking combines the location data from
these multiple, different events and presents them in an ordered
display, preferably in temporal order, as illustrated in exemplary
geospatial trajectory chart 1400 in FIG. 14. This tracking provides
insight into understanding patterns of behavior and potentially
estimating future locations.
[0066] The graph data representing the knowledge of an entity must
be converted into a sequence of distinct points in time for
trajectory tracking. FIGS. 15a, 15b and 15c illustrate the process
by which the trajectory for an individual Person is extracted from
the graph model. FIG. 15a shows a trajectory source graph 1500.
Paths rooted at the tracked entity 1503 are defined within the
graph structure for identifying sequence information. These path
definitions define the type of entity providing the mappable
coordinate, such as a Sighting 1507, 1509, Residence, PhoneCall, or
Incident. From these mappable entities, additional paths are
defined to indicate the attributes providing coordinate values, and
temporal bounds, 1512; 1513, 1514, 1515, 1516, 1518. Additional
paths may be defined to incorporate other useful attribute
information into the mappable points to assist in comprehension of
the charted trajectory as desired. FIG. 15b illustrates trajectory
path selection by depicting trajectory paths 1533, 1536, 1539
defining mappable points which are extracted from the original
source graph data. Each of these path contains the coordinate and
time bound values for a mappable point for the subject 1503. FIG.
15c illustrates Trajectory Sequence Creation. Using the path
definitions, trajectory paths 1533, 1536 and 1539 are converted
into row-oriented records 1573, 1576 and 1579 in table 1570
containing coordinate and time information. Additional values may
also be included in the results records to provide contextual
information for each of these mappable points. The records can be
ordered by their start time values to create a time-based sequence
of locations for the subject. FIG. 16 shows an exemplary trajectory
path definition of one mappable point for a Person node type, which
can contain a plurality of these mappable point definitions.
[0067] FIG. 17 depicts the components of an embodiment of a
computer system 1700 that is programmed to perform the methods
described above. In this embodiment, the methods and functions
described above are performed by graph analytic platform software
implemented in software that is executed by the computer system
1700 although other implementations are within the scope of the
invention such as implementing the methods on a hardware device or
a combination of hardware and software. In the example shown in
FIG. 1700, the system is implemented on computer 1710. In an
exemplary embodiment, computer 1710 preferably includes one or more
processing units 1712 operatively connected via system bus 1711 to
persistent storage units, including hard drive 1716 and CD/DVD
drive 1714, network interface device 1718, and memory 1720 that
stores the software instructions executed by processing unit 1712
to implement the methods described herein. Computer 1710 preferably
includes data store 1762 (which may be local or remote),
conventional input/output devices including display 1764, keyboard
1766 and pointing/trackball device 1768. Preferably hard drive 1716
includes least 5 GB of free space. Display 1764 can be local or
remote, and includes conventional monitors, printers, any other
from of graphic display devices, and virtual display devices such
as a specified file.
[0068] The size requirements of memory 1720, persistent storage
units 1716 and 1714 and data store 1762 will increase depending on
size of observational data included into the system. Memory 1720
preferably includes operating system software 1730 which may be
conventional, such as Windows or Linux. Memory 1720 also preferably
includes link-oriented data set 1748 and the graph analytics
platform software 1740, comprising a plurality of lines of program
instructions or code which are executed by processor 1712 to
perform the methods described herein, including support and
operation of the link-oriented dataset 1748 and link analysis and
abstracting information for use in link analysis as is discussed
herein. In an embodiment, graph analytics platform software 1740
includes computational analysis module 1742 and a Link Charter
module 1744 for production of link-oriented data. In an embodiment,
Link Charter module 1744 is a client of computational analysis
module 1742. In an embodiment, link-oriented dataset 1748 can be
stored in or distributed among memory 1720, persistent storage
1714, 1716 and/or data store 1762.
[0069] The computer system 1700 may comprise one computer 1710 or
in an embodiment one or more computers 1710 interconnected via
network 1785 to increase performance of the system as needed. In a
preferred embodiment, computer 1710 is a workstation connected via
network 1785 to servers 1788 comprising persistent storage, data
stores, and other network resources. In another embodiment,
computer 1710 is a server in a client-server architecture, with one
or more exemplary client devices such as desktop computer 1793, PDA
or handheld device 1795, laptop computer 1791, or tablet computer
1798. The system also may be implemented using other computer
architectures, including but not limited to a mainframe system with
terminals, ASP (application service provider), peer-to-peer, and
similar architectures, and network 1785 can include a local area
network, a wide area network, the internet, the World Wide Web, a
wireless network, a mobile phone network and the like, all of which
are within the scope of the invention since the invention is not
limited to any particular computer architecture or network.
Generally, computer system 1700 can be any system that provides
sufficient computing power, memory, and persistent storage to
execute the methods described herein, including link analysis and
abstracting information for use in link analysis as is discussed
herein.
[0070] Those of skill will recognize that the techniques of the
embodiments described herein may be implemented to advantage in a
variety of sequential orders and that the present invention may be
generally implemented in magnetic media such as RAM drives, USB
drives, disks, tapes, DVDs and CD-ROMs or other storage media for
introduction into a system for s abstracting information for use in
link analysis. In such cases, program instructions for executing
the steps described herein in abstracting information for use in
link analysis will be embedded in the media.
[0071] The terms and descriptions used herein are set forth by way
of illustration only and are not meant as limitations. Those
skilled in the art will recognize that many variations are to
possible within the spirit and scope of the invention as defined in
the following claims, and their equivalents, in which all terms are
to be understood in their broadest possible sense unless otherwise
indicated. The described embodiments illustrate the scope of the
claims but do not restrict the scope of the claims.
* * * * *