U.S. patent application number 13/252630 was filed with the patent office on 2012-04-19 for component based approach to building data integration tools.
This patent application is currently assigned to TATA CONSULTANCY SERVICES LIMITED. Invention is credited to Prasenjit Das, Sreedhar Sannareddy Reddy, Raghavendra Reddy Yeddula.
Application Number | 20120095957 13/252630 |
Document ID | / |
Family ID | 45934979 |
Filed Date | 2012-04-19 |
United States Patent
Application |
20120095957 |
Kind Code |
A1 |
Reddy; Sreedhar Sannareddy ;
et al. |
April 19, 2012 |
Component Based Approach to Building Data Integration Tools
Abstract
A method facilitating interoperability across the integrating
platforms is provided by composing purpose specific integrating
tools using conceptual modeling framework and a set of primitive
foundational components. The model driven architecture provides
easy migration from one solution architecture to another. Also, the
present invention discloses a method to migrate from one
implementation to another using set of composed purpose specific
integration tools.
Inventors: |
Reddy; Sreedhar Sannareddy;
(Maharashtra, IN) ; Das; Prasenjit; (West Bangal,
IN) ; Yeddula; Raghavendra Reddy; (Maharashtra,
IN) |
Assignee: |
TATA CONSULTANCY SERVICES
LIMITED
Maharashtra
IN
|
Family ID: |
45934979 |
Appl. No.: |
13/252630 |
Filed: |
October 4, 2011 |
Current U.S.
Class: |
707/602 ;
707/763; 707/E17.005; 707/E17.07 |
Current CPC
Class: |
G06F 16/24535 20190101;
G06F 16/254 20190101; G06F 16/24542 20190101; G06F 16/256
20190101 |
Class at
Publication: |
707/602 ;
707/763; 707/E17.005; 707/E17.07 |
International
Class: |
G06F 17/30 20060101
G06F017/30 |
Foreign Application Data
Date |
Code |
Application Number |
Oct 18, 2010 |
IN |
2893/MUM/2010 |
Claims
1) A method for composing integrating tools for facilitating
interoperability across integrating platforms, the said method
comprising processor implemented steps of: modeling plurality of
data entities at multiple levels of abstraction in a source schema
and capturing relationship between the data entities using a query
language in conjugation with path expressions; mapping between the
modeled data entities of the source scheme to obtain a unified
model query using a mapping component ; translating the unified
model query of the source schema into an equivalent target queries
on a target schema and checking for containment relationship
between the queries thereof using a query translation component;
translating one or more target queries into equivalent target data
flow graphs using a query to DFG translation component and;
executing the data flow graphs directly or translating the target
data flow graphs into third party integrating platform and
executing it thereof using a DFG to query translation
component.
2) The method of claim 1, wherein the modeling of data entities at
multiple levels is achieved by conceptual modeling framework using
an object model or an entity relationship model as a modeling
language.
3) The method of claim 1, wherein the source schema is defined at
multiple levels using conceptual model for at least one level and
physical level or conceptual model for subsequent levels.
4) The method of claim 1, wherein the query language and the path
expressions captures association relationship between the data
entities and gets translated into corresponding structured query
language following a translation rule.
5) The method of claim 1, wherein the path expression are governed
by a set of rules to support association relationships; the said
rules including: selecting from a cartesian product of relations
listed in `from` clause of query, the relations that satisfy
`where` clause and the path expressions; and projecting out columns
specified during selection; wherein the path expressions have
implicit existential quantification and; allowing the path
expressions in `select` clause of the query provided *:1 (many to
one) associations are involved.
6) The method of composing platform specific integrating tools as
claimed in claim 5, wherein the governing rule allows each
association segment of the path expression to introduce two
relations in the `from` clause, corresponding to the association
and associated class, and corresponding join conditions in the
`where` clause of the query.
7) The method of claim 1, wherein the mapping is specified by
either of global-as-view (GAV) or local-as-view (LAV) or GLAV
mappings.
8) The method of claim 1, wherein the target query is translated
into an equivalent target data flow graph in steps of: segmenting a
given query containing subgoals from multiple sources such that
each maximally connected subgraph that contains at least one
relation node constitutes a query segment; checking for an already
existing subquery having an equivalent query segment for reuse;
factoring the query segments into subquery executable on a single
source and; constructing the target data flow graph from segmented
query representation expressing nodes and edges corresponding to
subgoals of query.
9) The method of composing platform specific integrating tools as
claimed in claim 1, wherein the data flow graphs are represented by
a meta model, the said meta model further defined by data flow
graph operators wherein the said operators includes a set of input
and output ports along with associated expressions.
10) The method of composing platform specific integrating tools as
claimed in claim 1, wherein the data flow graph (DFG) to query
translation component translates the data flow graph consisting of
one target relation from one target database and at least one
source relation from at least one source database, wherein the
translation is performed by: formulating an outer query
corresponding to target DFG operator in a suitable format and;
formulating sub queries corresponding to each inner operator of the
data flow graph in a nested manner and optimizing the formulated
query by removing duplicate sub queries.
11) The method of composing platform specific integrating tools as
claimed in claim 10, wherein the target DFG operator is formulated
into its corresponding query using a specified format, the said
format comprising select clause containing names of output ports;
from clause containing sub queries corresponding to the operators,
the said operators having their output port connected to an input
port of the operator; where clause derived from an associated
expression of the operator, wherein semantics of the said operator
deciding a form of said clause.
12) The method of claim 1, wherein the composed integrating tools
can be either of EII or ETL or data migration or data
synchronization tool or a combination thereof.
13) The method of claim 1, wherein the integrating tool,
particularly a data integration tool can be composed in the steps
of: building a source model and a unified global model using a
modeling framework; mapping between the source model and the global
model using a mapping component; specifying entity matching
specifications on the global model to generate object map using a
matching component; translating query updates on the source model
into equivalent query updates on matching sources and; executing
the said query updates on the sources using a query translator
component.
14) The method of claim 13, wherein the matching component takes
matching specifications expressed on the global model and matches
instances of a class across plurality of disparate databases by
presenting all matching object pairs from said instances as one or
more tuple called object map.
15) The method of claim 1, wherein the integration tool,
particularly a data migration tool is composed in the steps of:
modeling a source schema and a target schema using a modeling
component; mapping between the source schema and the target schema
using a mapping component; formulating queries corresponding to
each target table and translating the said queries into source
specific queries using a query translation component and
translating the said source specific queries into equivalent data
flow graphs using a query to DFG translation component, and
executing the said data flow graphs to migrate data from the source
schema to the target schema;
16) A computer program product comprising a computer usable medium
having a computer readable program code embodied therein, said
computer readable program code adapted to be executed to implement
a method for composing integrating tools for facilitating
interoperability across integrating platforms, said method
comprising: modeling plurality of data entities at multiple levels
of abstraction in a source schema and capturing relationship
between the data entities using a query language in conjugation
with path expressions; mapping between the modeled data entities of
the source scheme to obtain a unified model query using a mapping
component ; translating the unified model query of the source
schema into an equivalent target queries on a target schema and
checking for containment relationship between the queries thereof
using a query translation component; translating one or more target
queries into equivalent target data flow graphs using a query to
DFG translation component and; executing the data flow graphs
directly or translating the target data flow graphs into third
party integrating platform and executing it thereof using a DFG to
query translation component.
Description
FIELD OF THE INVENTION
[0001] The present invention relates to data integration and more
particularly, to a method of composing purpose specific integration
tools for facilitating migration from one integrating platform
solution to another.
BACKGROUND OF THE INVENTION
[0002] The term data integration refers to the problem of combining
data residing in heterogeneous sources and currently, it relates to
wide range of technologies, from extract, transform and load (ETL)
to enterprise application integration (EAI) to enterprise
information integration (EII) and various change propagation
technologies. There has been extensive theoretical research on data
integration systems, exploring various mapping systems and
languages, and their complex results and limitations. However no
single technology suffices for all needs of an enterprise and these
needs keep changing with growing data volumes and changing business
needs. Consequently, enterprises end up tweaking their integration
systems continually and sometimes summarily moving them from one
technology to another, to keep up with these demands. This consumes
a lot of effort--by some estimates as much as 40% of all IT efforts
in an enterprise.
[0003] One reason why this consumes so much effort is the rigidity
of the available integration technologies. Once a solution is
implemented in one of these technologies, moving to another is like
implementing the entire solution afresh which requires large amount
of time, efforts and computational resources. As a result, people
end up building ad-hoc, quick-fix solutions, which over time leads
to data fragmentation and inconsistencies. Keeping these fragments
synchronized to avoid inconsistencies puts a lot of strain on these
systems.
[0004] Lack of common reference architecture and lack of a common
set of foundational primitives from which purpose specific
solutions can be composed are the principal reasons for this state
of affairs.
[0005] Though there exist a large number of vendors with tool
offerings in ETL, EAI, data migration, EII and so on, each uses
one's own proprietary technology with no interoperability,
sometimes even among tools of the same category (E.g. Vendor X's
ETL tool to vendor Y's ETL tool). Some vendors offer tools in many
categories (for example both ETL and EII), but again with no
interoperability between a tool of one category and a tool of
another. The principal reason for lack of interoperability among
tools of the same category (say ETL) is the lack of common
reference architecture across tool implementations. The principal
reason for lack of interoperability across categories (say ETL and
EII) is that their specifications are too close to the
implementation platform i.e. they are not at a level of abstraction
that allows their semantics to be mapped easily.
[0006] Moreover, the data inter-operability problem arises from the
fact that data, even within a single domain of application, is
available at many different sites, in many different schemas, and
even in different data models. The integration and transformation
of such data has become increasingly important for many modern
applications that need to support their users with informed
decision making.
[0007] While number of useful approaches has been devised for
designing and deploying specific integration processes, there
remains a need for tools to enable easy migration of the
integration processes themselves, once designed, among different
technology platforms.
OBJECTS OF THE INVENTION
[0008] In accordance with the present invention, a method of
composing platform specific integration tools to facilitate easy
migration from source implementation solution to target
implementation solution is provided.
[0009] It is an object of the present invention to employ model
driven common reference data integration architecture and set of
primitive foundational components for composing platform specific
integrating tools.
[0010] Another object of the invention is to identify primitive
foundational components and combining them to build effective
purpose specific integrating tools.
[0011] It is an object of the invention to identify primitive
foundational components and combining them to build effective
purpose specific migration tools which can be directly employed to
migrate from one solution architecture to another.
[0012] It is yet another object of the present invention to
facilitate interoperability across widely available data
integration technologies.
[0013] Yet another object of the present invention is to combine
GAV and LAV mappings to model complex data entities existing in a
hierarchy.
[0014] It is another object of the present invention to support
complex relationships ranging from hierarchical relationships to
association relationships existing between data models via model
driven architecture.
[0015] It is yet another object of the present invention to use
platform independent representational formalism to represent
complex query language, mappings and data flow graphs employed to
build model driven data integration architecture.
[0016] It is further object of the present invention to present
data flow graphs using rich meta models capable of being mapped to
required platform specification for representing data dependencies
between numbers of operations.
[0017] One of the objects of the present invention is to provide a
query translation component for translation of query on source
schema into an equivalent query on target schema.
[0018] Another object of the present invention is to provide query
to data flow graph translation component to translate a given query
into an equivalent data flow graph that can process data from
multiple databases.
[0019] One of the objects of the present invention is to achieve
performance optimization by performing a set of semantically
invariant graph transformation operations on the generated data
flow graph to obtain an optimal execution order on data flow
operators.
SUMMARY OF THE INVENTION
[0020] It is the purpose of the present invention to provide a
system and method for composing platform specific integration tools
to facilitate easy migration across the integrating platforms by
composing purpose specific integrating tools using conceptual
modeling framework and a set of primitive foundational
components.
[0021] In one of the preferred embodiments of the present invention
a reference model driven architecture, a set of representational
formalisms and a set of primitive foundational components that
conform to this architecture is provided, and a method of composing
these components to derive purpose specific integration tools, the
said method comprising processor implemented steps of:
[0022] modeling the data at multiple levels of abstraction and
capturing relationships, constraints and rules governing the
behavior of the said models;
[0023] using the set of representational formalism to represent
queries, mappings, rules and data flow graphs;
[0024] specifying set of primitive foundational components using
which purpose specific data integrations tools are composed;
[0025] mapping between source schema and global schema using
mapping component selected from primitive foundational
components;
[0026] translating queries on a given model into equivalent target
queries on other mapped data source models using primitive
foundational components;
[0027] translating the target query into equivalent data flow
graphs using primitive foundational components and;
[0028] executing the data flow graphs directly or translating the
data flow graph into third party ETL and executing it thereof.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
[0029] The foregoing summary, as well as the following detailed
description of preferred embodiments, are better understood when
read in conjunction with the appended drawings. For the purpose of
illustrating the invention, there is shown in the drawings example
constructions of the invention; however, the invention is not
limited to the specific methods and system disclosed. In the
drawings:
[0030] FIG. 100 represents well delineated hierarchical view of an
enterprise in the present invention.
[0031] FIG. 200 highlights meta model to specify ETL data flow
graphs.
[0032] FIG. 300 is a schematic illustration of an EII (Enterprise
information integration) tool composed from primitive
components.
[0033] FIG. 400 depicts an integration tool that combines ETL and
EII wherein ETL is used to build a warehouse and EII is used to
access data uniformly from both warehouse and original sources.
[0034] FIG. 500 is a schematic representation of Data
synchronization tool composed from primitive components.
[0035] FIG. 600 is a representation of Data migration tool for
translating queries from source model to target model, and for
generating ETL to migrate data from source database to target
database.
DETAILED DESCRIPTION OF THE INVENTION
[0036] Some embodiments of this invention, illustrating all its
features, will now be discussed in detail.
[0037] The words "comprising," "having," "containing," and
"including," and other forms thereof, are intended to be equivalent
in meaning and be open ended in that an item or items following any
one of these words is not meant to be an exhaustive listing of such
item or items, or meant to be limited to only the listed item or
items.
[0038] It must also be noted that as used herein and in the
appended claims, the singular forms "a," "an," and "the" include
plural references unless the context clearly dictates otherwise.
Although any systems and methods similar or equivalent to those
described herein can be used in the practice or testing of
embodiments of the present invention, the preferred, systems and
methods are now described.
[0039] A variety of data integration tools exist, each with
particular strengths and weaknesses. As a given user's needs
evolve, the user may desire to move from using one tool to using
another. A problem for such a user is that the user may have
devoted significant time and resources to the development of data
integration jobs using one tool and would not like to repeat the
effort while moving to another tool. However, converting data
integration jobs has to date required very extensive coding
efforts. Thus, a need exists for improved methods and systems for
converting data integration jobs that use one integration tool into
data integration jobs that use a different integration tool.
[0040] As mentioned, there remains a need for the right reference
architecture and a set of foundational primitives using which data
integration problems can be addressed efficiently. The embodiments
of the present invention addresses precisely this need by providing
a method and a set of primitive components using which purpose
specific integration tools can be composed efficiently.
[0041] Referring now to the drawings, there are shown preferred
embodiments of the invention. FIG. 100 illustrates a hierarchical
structure of an enterprise where the data exists at enterprise
level, department level, function level and so on. The data
contained within an enterprise at hierarchical level needs to be
modeled using richer conceptual models that allow modeling of data
entities at multiple levels of abstraction and capture
relationships existing between them. The object model is chosen as
a conceptual modeling language to define unified global schema. The
source model can however be defined at multiple levels using
conceptual modeling framework for at least one level and conceptual
model or a physical model for subsequent levels.
[0042] The model consists of Class, Attribute and Association.
Classes and associations can have subtype relationships. A subtype
relationship can be further qualified as complete or incomplete and
as overlapping or disjoint. A class can specify a subset of its
attributes as a key. Association cardinalities specify constraints
on participating objects. Objects are identified by unique IDs. If
a class has a defined key then its object ID is modelled as a
function over the key attributes. Mappings between object models
are specified in one of the following ways:
[0043] GAV (Global as view) mappings, where a class or an
association at some level is specified as a view over the models at
the next lower level.
[0044] LAV (Local as view) mappings, where a class or an
association at some level is specified as a view over the model at
the next higher level.
[0045] Rules that relate classes and associations of adjacent
levels. Rules are essentially like GLAV mappings.
[0046] Subtype relationships, where a class or an association at a
lower level is specified as a subtype of a class or association at
the higher level.
[0047] Mapping between an object model and a relational model is
specified as follows:
[0048] GAV mappings, where a class is specified as a view over a
set of relations. For a class that has a primary key defined, the
view must specify all its primary key attributes.
[0049] LAV mappings, where a relation is specified as a view over a
set of classes.
[0050] The modelling between defined global schema and source
schema employs a set of representational formalisms to represent
queries, mappings and data flow graphs. The description of
representational formalisms used to define conceptual modelling is
illustrated below.
[0051] Query Language: In general, structured query language (SQL)
is used as a language for managing data in relational databases.
Since conceptual models are richer in structure, we need a more
specialized query language that exploits this structure, hiding
implementation details such as how the model is implemented in a
relational database.
[0052] At a minimum we need a language that supports association
navigation. SQL does not offer any special constructs for
association navigation as it is designed for the relational model.
We extend SQL with path expressions to support association
navigation. We call this extended language as PSQL. PSQL can be
pre-processed into equivalent SQL. Suitable examples specifying the
representational format and characteristic features of PSQL is
demonstrated below.
[0053] We use the following conceptual model in the examples:
[0054] Classes: [0055] Company, Department, Person
[0056] Associations: [0057] Company.(department/company;
1:*).Department [0058] Department.(employee/department;
*:1).Person
[0059] Next, we present a set of rules that define the semantics of
PSQL, and explain these rules with the help of examples.
EXAMPLE 1
List the Names of Persons and Their Companies
[0060] Select p.name, c.name [0061] from Person p, Company c [0062]
where [0063] p.department.company=c;
[0064] In this example p.department.company is a path expression.
It navigates associations from person to department and from
department to company.
[0065] Rule 1: Path expressions can occur in the `where` clause.
Semantics are the same as in SQL--from the cartesian product of the
relations listed in the `from` clause, select those that satisfy
the `where` clause, and project out columns specified in the
`select` clause.
EXAMPLE 2
List the Names of Companies that Have Foreign Nationals
[0066] Select c.name [0067] from Company c [0068] where [0069]
c.department.employee.nationality < > `Indian`;
[0070] Rule 2: Path expressions have implicit existential
quantification. In the above expression, at least one employee of
at least one department of a company must be a non Indian; and it
is not required to have all employees of all departments of a
company to be non Indians.
EXAMPLE 3
List the Names of All Persons Who Work for Company `X, and Their
Departments
[0071] Select p.name, d.name [0072] from Person p, Department d
[0073] where [0074] p.department=d and d.company.name=`X`;
EXAMPLE 4
List the Names of Persons and Their Companies
[0074] [0075] Select p.name, p.department.company.name [0076] from
Person p;
[0077] Rule 3: Path expressions are allowed in `select` as well,
provided they are *:1 paths. The above example is correct because
the cardinalities of both the associations involved are `*:1.
[0078] With regard to the above query language, it should be noted
that it can be translated into an equivalent SQL format. Assuming a
canonical representation where an association is stored as a
relation with two columns corresponding to the two roles, path
expressions can be translated to SQL format using the following
rule.
[0079] Translation rule: Each association segment of a path
expression, whether it occurs in where or select clause of the
query, introduces two relations in the `from` clause, corresponding
to the association and the associated class, and corresponding join
conditions in the where clause.
[0080] The above rule is elaborately discussed below using examples
as shown below.
EXAMPLE 1
[0081] PSQL:
[0082] Select p.name, c.name [0083] from Person p, Company c
[0084] where [0085] p.department.company=c;
[0086] Equivalent SQL:
[0087] Select p.name, c.name
[0088] from [0089] Person p, Company c,
Department_employee_department_Person dp, [0090] Department d,
Company_department_company_Department cd
[0091] where [0092] dp.employee=p.id and dp.department=d.id and
[0093] cd.department=d.id and cd.company=c.id;
EXAMPLE 2
[0094] PSQL:
[0095] Select c.name
[0096] from Company c
[0097] where [0098] c.department.employee.nationality < >
`Indian`;
[0099] Equivalent SQL:
[0100] Select c.name
[0101] from [0102] Company c, Company_department_company_Department
cd, Department d, Department_employee_department_Person dp, Person
p
[0103] where [0104] cd.company=c.id and cd.department=d.id and
[0105] dp.department=d.id and dp.employee=p.id and [0106]
p.nationality < > `Indian`;
EXAMPLE 3
[0107] PSQL:
[0108] Select p.name, p.department.company.name
[0109] from Person p;
[0110] Equivalent SQL:
[0111] Select p.name, c.name
[0112] from [0113] Person p, Department_employee_department_Person
dp, Department d, [0114] Company_department_company_Department cd,
Company c
[0115] where [0116] dp.employee=p.id and dp.department=d.id and
[0117] cd.department=d.id and cd.company=c.id;
[0118] Further, the representational formalisms also include view
definition language which is quite similar to the PSQL query
language as discussed above. The only exception is that a view must
specify attribute names of the target relation as aliases in the
select clause.
[0119] An example differentiating the view definition language from
PSQL query language is given below.
EXAMPLE
[0120] Suppose we have the following models: [0121] Target:
Employee(empName, address, salary, deptName) [0122] Source:
Emp(name, address, salary), Dept(name, manager) with an association
<emp:dept; *:1> from Emp to Dept. [0123] Target view
definition: [0124] select e.name empName, e.address address,
e.salary salary, d.name deptName [0125] from Emp e, Dept d [0126]
where [0127] e.dept=d;
[0128] The conceptual modelling framework may have a set of rules
that specify relationships among its elements. The rule language is
thus defined to support and capture relationships which may be in
the form of a constraint, a derivation or a mapping. Rules can
exist in a range of forms--from rules that can be efficiently
executed by a machine to rules that can only be interpreted by
humans. However, the scope of the present invention is confined to
machine executable rules.
[0129] These rules have the following general form:
.A-inverted.x(.phi.(x).fwdarw..E-backward.y.psi.(x,y))
[0130] Where x and y are tuples of variables, .phi.(x) is a
conjunction of atomic formulas over a model of interest with
variables in x and with each variable in x occurring in at least
one conjunct of .phi.(x), and .psi.(x, y) is a conjunction of
atomic formulas over another model of interest (could be the same
as the first) with variables among those in x and y. The atomic
formulas may comprise of object terms, property terms, association
terms, arithmetic expressions, relational expressions and
aggregation functions. One may devise any suitable surface syntax
to represent the rules (such as the popular IF-THEN notation) as
long as they are reducible to the above general form. One can even
allow disjunctions in the surface syntax as long as they can be
decomposed into rules of the above form. One can readily verify
that rules, as specified above, are similar in form to GLAV
mappings discussed in data integration literature [1], and hence
can be decomposed into suitable GAV and LAV rules that can be used
by data integration components.
[0131] The above discussed rule language also forms a part of
representational formalism. Next, representational formalism
includes data flow graphs using a meta model as shown in FIG. 200.
The model is abstract enough and rich enough to be mappable to ETL
specifications of most commercial tools.
[0132] DFG Operator is defined as the super class of all operators
while the sub classes include Source Qualifier, Target, Join,
Union, LookUp, Expression, etc. One can add a new operator by
adding a new subclass and implementing the operator interface. An
operator has a set of input ports, a set of output ports and a set
of associated expressions. An operator's input flows in through the
input ports and its output flows out through the output ports. The
semantics of the associated expressions depends on the operator
type (e.g. for join it is a relational expression). A port has an
associated variable that binds to the values that flow in or flow
out. The output port of one operator may be connected to the input
port of another operator.
[0133] The model modeling architecture and representational
formalisms discussed above enable enterprise data assets to modeled
and mapped at a conceptual level of abstraction, in an
implementation independent manner
[0134] Next discussed is the set of primitive components from which
purpose specific integration tools can be composed. The
foundational primitive components so identified include the set of
following stated components, which are discussed below along with
supporting embodiments of the invention.
[0135] Mapping component: Mapping component helps develop mappings
between models. Given two models it can discover correspondences
between the models and interactively prompt a user to refine them
into full-fledged mappings. It provides a user-friendly interface
that hides much of the complexity of the mapping language discussed
earlier. For instance, with respect to the example given earlier,
when a user identifies Employee as the target and Emp and Dept as
the sources, it can automatically discover the join between Emp and
Dept by traversing the association between them.
[0136] Query Translation Component: Query translation component
provides the following functionality:
[0137] Given a set of source models S and a set of target models T
and a mapping between them (possibly via a number of intermediate
model layers), translate a query Q.sub.s on S into an equivalent
query Q.sub.t on T. Query translation can be performed by using a
suitable query rewriting approach provided in related prior
art.
[0138] Query equivalence testing component: Query equivalence
testing component provides the following functionality:
[0139] Given a set of source models S and a set of target models T
and a mapping between them (possibly via a number of intermediate
model layers), and given two queries Q.sub.s and Q.sub.t on models
S and T respectively, test if there is any containment relationship
between the two (i.e. Q.sub.s.OR right.Q.sub.t or Q.sub.t.OR
right.Q.sub.s or Q.sub.s=Q.sub.t)
[0140] This involves first translating Q.sub.s into an equivalent
query Q.sub.st on target schema using the query translation
component and then testing for containment relationship between
Q.sub.st and Q.sub.t which is well discussed in literature.
[0141] Query to DFG translation component: This component provides
the following functionality:
[0142] Given a query consisting of relations that come from
multiple databases, translate it into an equivalent data-flow
graph. Given below is an approach that takes a query spanning
multiple databases and translates into an equivalent data-flow
graph. The algorithm has three sequential steps: query
segmentation, DFG generation and DFG optimization.
[0143] Query segmentation: Input query may contain subgoals from
multiple sources. These subgoals need to be segmented into source
specific groups. Each such segment should then be factored out as a
subquery that can be executed on a single source. Emphasis is made
to maximize the size of each segment so that as many subgoals as
possible are pushed into a single source query.
[0144] Query Segmentation:
[0145] A graph data structure to identify query segments is
constructed as described below. [0146] A node for each relation
subgoal and each expression subgoal is constructed. [0147] An edge
between two relation nodes is created if they belong to the same
source and share at least one variable. [0148] An expression node
is labeled as belonging to a source S if each of its variables is
either shared by a relation belonging to S or shared by another
expression node that is labeled as belonging to source S. [0149]
Edges are created from an expression node labeled S to each of the
relation nodes of S that share a variable with it. [0150] Edges are
created from an expression node labeled S to each expression node
labeled S that shares a variable with it.
[0151] Each maximally connected subgraph that contains at least one
relation node is a separate query segment. Such query segments can
be separated out as subqueries. The head variables of a subquery
are determined by selecting the variables of the query segment that
are required outside the segment (i.e. either in the query head or
in other query segments). Calls to these subqueries together with
the remaining conditions make up the original query.
[0152] Before forming a subquery for a query segment it is checked
if there already exists a subquery that has an equivalent query
segment. If so, the subquery is reused to assist in performance
optimization. This can be well illustrated using an example
described below:
[0153] E.g. [0154] Q(a1(v1), a2(v2), a3(v3)):--R1(a4(v1), a5(v4)),
R2(a6(v4), a7(v2)), R3(a8(v1), a9(v5)), v2>v5, R4(a10(v5),
a11(v6)), R5(a12(v1), a13(v7)), v6>v7, v3=v5+v6.
[0155] Let's say R1, R2 & R3 are from source S1 and R4, R5 are
from source S2.
[0156] Next, two subquery segments shall be identified: [0157]
Qs1(a4(v1), a7(v2), a9(v5)):--R1(a4(v1), a5(v4)), R2(a6(v4), a
7(v2), R3(a8(v1), a9(v5)), v2>v5.
[0158] It is to be noted that v4 is not there in the head since it
is not required outside the query segment. [0159] Qs2(a12(v1),
a10(v5), a11(v6)):--R4(a10(v5), a11(v6)), R5(a12(v1), a13(v7)),
v6>v7.
[0160] Similarly v7 is not there in the head since it is not
required outside the query segment.
[0161] Now the original query is rewritten as: [0162] Q(a1(v1),
a2(v2), a3(v3)):--Qs1(a4(v1), a7(v2), a9(v5)), Qs2(a12(v1),
a10(v5), a11(v6)), v3=v5+v6.
[0163] The next phase is to describe the generation of data flow
graph from segmented query representation which is an extension of
the above stated example:
[0164] Generating DFG from the segmented query representation:
[0165] The approach is however discussed only for certain key DFG
operators and the following naming convention is used in the
discussion below:
[0166] T--Target
[0167] S--Source
[0168] SQ--Source query segment
[0169] Q--Query in general
[0170]
`{`<in-flow>`}`<DFG-operator>[`(`<parameters>`)`]-
`{`<out-flow>`}`-DFG operator
[0171] DFG(<query>)-DFG transformation of the query fragment
<query>
[0172] Outer query corresponds to the target, and inner queries
correspond to the intermediate nodes and source-specific queries
correspond to the sources. We generate DFG job-flow backwards
starting from the target.
[0173] In the following it is discussed how different query forms
map to different DFG operators:
TABLE-US-00001 Query form DFG operator Q(<head-var-list>) :-
Q1(<Q1-var-list>), {DFG(Q1(<Q1-var-list>)),
Q2(<Q2-var-list>),.., Qm(..), DFG(Q2(<Q2-var-list>)),
<Q cond>. .., DFG(Qm(<Qm-var-list>))} JOIN(<Q
cond>) {<head-var-list>} Q(v1,..,vj, <aggr>(vk)) :-
<body> {DFG{`Q(v1,..,vj, vk)) :- <body>`)}
<aggr>(vk, vk-aggr-value) {v1,..,vj, vk-aggr-value}
Q(v1,..,vn) :- SQ1(vi, vj,..vk), SQ2(..),..,
{DFG(R1(<arg1>)), DFG(R2(<arg2>)),.., SQm(..), <Q
cond>. DFG(Rs(<argS>))} SQ1(v1,..,vl) :- R1(<arg1>),
R2(..),.., JOIN(<SQ1 cond>) Rs(..), <SQ1 cond>.
{v1,..,vl} Where if possible <cond> should be split and
pushed down to R* (using the usual optimization techniques),
resulting in: {R1} SELECT(<R1 cond>) {<R1 columns>}
PROJECT(<arg1>) {<arg1>} Q(v1,..,vn) :- <body1> {
Q(v1,..,vn) :- <body2> DFG(`Q(v1,..,vn) :- <body1>`),
.. DFG(`Q(v1,..,vn) :- <body2>`),.., Q(v1,..,vn) :-
<bodyk> DFG(`Q(v1,..,vn) :- <bodyk>`) } UNION
{v1,..,vn} Q1(..) :- <body1>, Qk(<SQk-var-list>).
{DFG(Qk(<Qk-var-list>))} Q2(..) :- <body2>,
Qk(<SQk-var-list>). SPLIT .. {<Qk-var-list-1>,
<Qk-var-list-2>,..,<Qk- Qn(..) :- <bodyn>,
Qk(<SQk-var-list>). var-list-n>} Where
<Qk-var-list-i> is one of the inputs -- corresponding to the
subgoal Qk(..) -- to the JOIN operator corresponding to `Qi(..) :-
<bodyi>, Qk(<Qk-var-list>)`. Q(v1,..,vn) :-
<body1>, not Q2(vi,..,vk). { Refactor this query as below: (
Q(v1,..,vn) :- {DFG(<query that selects (vi,..,vk)>),
<query that selects (vi,..,vk)>, DFG(Q2(vi,..,vk))} <query
that selects rest of (v1,..,vn)>, DIFF <cond>, {vi,..,vk}
not Q2(vi,..,vk). ), DFG(<query that selects rest of
(v1,..,vn)>) } JOIN(<cond>) {v1,..,vn} We view lookup as a
join operation, but on {DFG(<body>)} a table that is marked
as a lookup table. LOOKUP(LT, vk, vl) {v1,..,vl,..,vn}
Q(v1,..,vl,..,vn) :- <body>, LT(vk, vl) Where `vk` is a key
variable selected in <body>, and LT is a lookup table that
stores surrogate keys. Function application: DFG(<body that
selects the rest of Functions occur in expression subgoals:
(v1,..,vn)>)} Q(v1,..,vf,..,vn) :- FUNC(f, (vi,..,vk), vf)
<body that selects the rest of (v1,..,vn)>, {v1,..,vf,..,vn}
vf = f(vi,..,vk).
[0174] As described in the above table DFG can be generated by
mapping different query forms to their corresponding DFG operators.
The generated DFG can be optimized further by using techniques
similar to the ones used in query graph optimization which are well
documented in literature.
[0175] DFG to query translation component: This component provides
the following functionality:
[0176] Given a data flow graph (DFG) that computes a target data
view from a set of relations that come from one or more source
databases, translate the DFG into an equivalent query on the
relations. The translation consists of two procedures as described
below:
[0177] Operator-to-Query Procedure:
[0178] The query corresponding to each operator (except Source
Qualifier instances) has the following format: [0179] Select clause
contains the names of output ports. [0180] From clause contains
sub-queries corresponding to the operators whose output ports are
connected to the input ports of this operator, with sub-query
aliases set to the names of the corresponding input operators.
[0181] Where clause is derived from the expressions associated with
the operator. Its precise form depends on the semantics of the
operator type; for example in the case of join operator the
expressions are simply dumped into the where clause after replacing
all references to input ports by their corresponding subquery
qualified names, i.e. <operatorName.portName>.
[0182] Source Qualifier operator has an associated query which is
simply dumped as is.
[0183] DFG-to-Query Procedure: [0184] Formulate the query of the
target operator using the procedure described above. The sub-query
creation would iteratively traverse backwards towards the sources.
Effectively queries corresponding to all the operators are created.
We are however interested only in the query corresponding to the
Target Operator, as this corresponds to the Query of the DFG.
[0185] This query is then optimized to remove duplicate sub
queries. This is done after it is converted to an internal
representation. This is done to avoid reevaluation of the same sub
query thereby contributing to performance optimization.
[0186] Discussed next is a matching component which can be used to
figure out matching entity instances in two databases. Its
capability to provide such functionality is exemplified below:
[0187] Matching Component: Given two database schemas S.sub.1 and
S.sub.2 that are mapped to a common global schema G, and given a
specification to detect duplicate instances in G (i.e. a
deduplication specification over G), and given two database
instances I.sub.1, and I.sub.2 of schemas S.sub.1 and S.sub.2
respectively, output all matching object pairs from I.sub.1 and
I.sub.2 as tuples of the form <S1:<entity>,
S2:<entity>, <pk1-value tuple>, <pk2-value
tuple>>.
[0188] Where each tuple identifies an entity instance in and its
corresponding matching entity instance in I.sub.2. This list of
tuples is called as an object map.
[0189] The deduplication specification essentially identifies when
two instances of an entity in the global schema refer to the same
underlying entity.
[0190] E.g.
[0191] Following model exists in global schema:
[0192] Dept, Emp
[0193] Dep.(emp(0 . . . *)/dept(1)).Emp
[0194] There might exist following matching specification when two
employee instances can be considered to refer to the same
real-world employee.
TABLE-US-00002 Match e1,e2:Emp when { e1.dob = e2.dob and
approx_match(e1.name. e2.name, 80) and e1.dept.name = e2.dept.name
}
[0195] Platform specific component: A platform specific component
translates between internal representations of queries and DFGs and
their platform specific representations. One such component exists
for each supported platform. They provide the import/export bridge
between the tool and third-party platforms, and thus play a key
role in the overall architecture.
[0196] For example, from internal representation of DFG to
Informatica ETL representation, or from internal representation of
query to Oracle specific representation, and so on.
[0197] After identification of all essentially required set of
primitive components, next required is the composition of purpose
specific integration tools, the method for which is discussed here
forth.
[0198] Composing Purpose Specific Integration Tools
[0199] We discuss how the modeling framework and the primitive
components discussed above can be used to compose integration
tools. This is illustrated with a few known tool categories as
given below:
[0200] Enterprise Information Integration (EII) Tool
[0201] EII tools enable a unified view of data by mapping a set of
database schemas to a unified global schema. They provide a query
interface in terms of the global model. The tool translates these
queries into data source specific queries and assembles the results
returned by the data sources. Composition of an EII tool from the
primitive components is shown in FIG. 300. The method for
constructing EII tool using defined foundational components
includes the steps of: [0202] Using the conceptual modeling
framework to create the data models. [0203] Using the mapping
component to specify mappings between these models. [0204] Using
the query translation component to translate a unified model query
into an equivalent target query involving relations from across
source schemas. [0205] Using the `query-to-DFG` component to
translate the target query into an equivalent DFG and finally
[0206] Translating the DFG into third party ETL and executing it or
executing the DFG directly using the DFG execution component.
[0207] Extract, Transform and Load (ETL) Tool
[0208] An ETL tool from the primitive components can be composed as
shown in FIG. 400, the method describing which is as follows:
[0209] Use the modeling tool to build warehouse model, source
models and global model if any. [0210] Use the mapping tool to map
the warehouse schema either to the global schema or to the source
schemas directly. [0211] Treat these mapping views as source
queries and use the query translation component to translate these
queries into equivalent target queries on data source models.
[0212] Use the `query-to-DFG` component to translate the target
queries into equivalent DFGs and [0213] Translate the DFGs into
third party ETL or execute the DFGs directly using the DFG
execution component.
[0214] Data Synchronization Tool:
[0215] A tool to synchronize data sources can be composed as
follows: [0216] Use the modeling tool to build source models and a
unified global model. [0217] Use the mapping tool to map the source
models to the global model. [0218] Use the matching tool to specify
entity matching specifications on the global model and generate the
object map. [0219] For each update on a source entity, consult the
object map to get the matching entities in other sources and use
the query translator component to translate update statements into
equivalent update statements on matching sources, and execute these
update statements on the sources as shown in FIG. 500.
[0220] Data Migration Tool
[0221] Data migration involves not only database migration but
migration of various data access programs as well, such as queries,
stored procedures, ETL specifications, etc.
[0222] We can compose relevant tool as shown in FIG. 600, the
method for which is detailed as follows: [0223] Use the modeling
tool to model source schemas and target schemas [0224] Use the
mapping component to specify mappings between them [0225] Form
pseudo queries corresponding to each target table (select * from
<target_table>) and use the query translation component to
translate these queries into source specific queries. [0226] Either
execute these queries directly (if there is only one source
database) or use the `query-to-DFG` component to generate
equivalent DFGs (and there from tool specific ETLs if required).
[0227] Translate queries on the source schema into equivalent
queries on the target schema using the query translator
component.
[0228] Similarily, source specific ETL specifications are
translated into equivalent target specific ETL specifications using
the following procedure: [0229] Use the `DFG-to-query` component to
translate source ETL specification into equivalent composite source
queries. [0230] Use the query translator component to translate
these composite source queries into equivalent target queries.
[0231] Use the `query-to-DFG` component to translate these target
queries into equivalent target DFGs, and use the platform specific
ETL translator to translate these DFGs into target specific ETL
specifications.
[0232] The present invention besides providing a solution for
creating purpose specific integration tools also discloses a method
by which migration from one solution architecture to another can be
implemented. The interoperability among different integrating tools
available is discussed below.
[0233] Migrating from one solution architecture to another: The
migration process from one solution to another is illustrated by
the cases involving currently available integration tools.
[0234] Migrating from an EII Solution to an ETL Solution
[0235] In one of the embodiments of the present invention, a method
of migrating from an EII solution platform to an ETL platform
solution is provided using the primitive foundational components.
Suppose instead of executing the queries on the virtual global
model to extract required data views, there is a need to
materialize those data views into a warehouse using an ETL like
approach, say in order to improve the response time. Given below is
the method of migrating to the new solution: [0236] Translate the
requisite global queries into equivalent source queries using the
query translation component. [0237] Translate these queries into
equivalent DFGs using the `query-to-DFG` component. [0238]
Translate these DFGs into target platform specific ETLs.
[0239] Migrating from an ETL Solution to an EII Solution
[0240] In one other embodiments of the present invention, a method
of migrating from an ETL solution to an EII platform solution is
provided. The method uses purpose specific tools for such migration
composed from a set of primitive foundational components. Suppose
we have a warehouse and an ETL mapping from sources to populate the
warehouse, and we want to migrate to a new solution where we want
to make the warehouse views virtual and on-demand so as to improve
latency. Discussed below is a solution how to migrate to a new
solution: [0241] Turn the warehouse schema into the global schema.
[0242] Translate platform specific ETL specifications into
equivalent DFGs. [0243] Translate each DFG into an equivalent
composite source query using the `DFG-to-query` component. Turn
this query into a mapping view specification of the corresponding
global schema table and finally [0244] Turn all warehouse queries
into equivalent global schema queries.
[0245] Hence, the proposed conceptual modelling architecture allows
enterprise integration to be modelled at a semantic level,
capturing rules and constraints that govern the behaviour of models
and their mappings and the set of primitive components presented
are sufficiently general purpose in nature that can be used to
compose purpose specific integration tools.
[0246] The integration architecture here presented in the
disclosure thus enables easy migration from one integration
solution to another.
[0247] The foregoing description of the specific embodiments will
so fully reveal the general nature of the invention that others
can, by applying current knowledge, readily modify and/or adapt for
various applications such specific embodiments without departing
from the generic concept, and, therefore, such adaptations and
modifications should and are intended to be comprehended within the
meaning and range of equivalents of the disclosed embodiments. It
is to be understood that the phraseology or terminology employed
herein is for the purpose of description and not of limitation.
Therefore, while the invention has been described in terms of
preferred embodiments, those skilled in the art will recognize that
the embodiments of the invention can be practiced with modification
within the spirit and scope of the appended claims.
* * * * *