U.S. patent application number 11/130737 was filed with the patent office on 2006-11-16 for system method and program product to estimate cost of integrating and utilizing heterogeneous data sources.
This patent application is currently assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. Invention is credited to Mickey Iqbal.
Application Number | 20060259442 11/130737 |
Document ID | / |
Family ID | 37420359 |
Filed Date | 2006-11-16 |
United States Patent
Application |
20060259442 |
Kind Code |
A1 |
Iqbal; Mickey |
November 16, 2006 |
System method and program product to estimate cost of integrating
and utilizing heterogeneous data sources
Abstract
System, method and program product for estimating a cost of
reconciling heterogeneous data sources. A transition cost for
integrating together a first program to identify semantic
conflicts, a second program to classify semantic conflicts and a
third program to reconcile semantic conflicts is estimated. A
steady state cost for managing and maintaining the integrated
first, second and third programs is estimated. Another system,
method and program product for estimating a cost of integrating
heterogeneous data sources. A steady state cost of managing and
maintaining a first program which identifies semantic conflicts
between a cross data source query and schema elements in a data
source is estimated. A steady state cost of managing and
maintaining a second program which classifies semantic conflicts
between the cross data source query and schema elements in the data
source is estimated. A steady state cost of managing and
maintaining a third program which reconciles semantic conflicts
between the cross data source query and schema elements in the data
source is estimated.
Inventors: |
Iqbal; Mickey; (Suwanee,
GA) |
Correspondence
Address: |
IBM CORPORATION
IPLAW IQ0A/40-3
1701 NORTH STREET
ENDICOTT
NY
13760
US
|
Assignee: |
INTERNATIONAL BUSINESS MACHINES
CORPORATION
ARMONK
NY
|
Family ID: |
37420359 |
Appl. No.: |
11/130737 |
Filed: |
May 17, 2005 |
Current U.S.
Class: |
705/400 ;
707/E17.032 |
Current CPC
Class: |
G06Q 30/0283 20130101;
G06F 16/21 20190101 |
Class at
Publication: |
705/400 |
International
Class: |
G06F 17/00 20060101
G06F017/00 |
Claims
1. A method for estimating cost of reconciling heterogeneous data
sources, said method comprising the steps of: estimating a
transition cost for integrating together a first program to
identify semantic conflicts, a second program to classify semantic
conflicts and a third program to reconcile semantic conflicts; and
estimating a steady state cost for managing and maintaining the
integrated first, second and third programs.
2. A method as set forth in claim 1 further comprising the steps
of: estimating a transition cost for configuring and implementing
said third program.
3. A method as set forth in claim 1 further comprising the steps
of: estimating a transition cost for implementing a non-monotonic
semantic assertion maintenance system which supports context
queries generated by the integrated first, second and third
programs; and estimating a steady state cost for managing and
maintaining said non-monotonic semantic assertion maintenance
system.
4. A method for estimating a cost of integrating heterogeneous data
sources, said method comprising the steps of: estimating a steady
state cost of managing and maintaining a first program which
identifies semantic conflicts between a cross data source query and
schema elements in a data source; estimating a steady state cost of
managing and maintaining a second program which classifies semantic
conflicts between said cross data source query and schema elements
in said data source; and estimating a steady state cost of managing
and maintaining a third program which reconciles semantic conflicts
between said cross data source query and schema elements in said
data source.
5. A method as set forth in claim 4 further comprising the step of:
estmimating a steady state cost of transforming at least some of
said cross data source queries to a canonical data model form.
6. A method for estimating a cost of integrating heterogeneous data
sources, said method comprising the steps of: estimating a
transition cost of implementing a first program which identifies
semantic conflicts between a cross data source query and schema
elements in a data source; estimating a transition cost of
implementing a second program which classifies semantic conflicts
between said cross data source query and schema elements in said
data source; and estimating a transition cost of implementing a
third program which reconciles semantic conflicts between said
cross data source query and schema elements in said data
source.
7. A method as set forth in claim 6, further comprising the steps
of: estimating a transition cost for reconciling cross data source
queries which require manual intervention for validation of
reconciled semantic conflicts; and estimating a steady state cost
for reconciling cross data source queries which require manual
intervention for validation of reconciled semantic conflicts.
8. A method for estimating a cost of integrating heterogeneous data
sources, said method comprising the steps of: estimating a percent
of schema elements in the data sources where functional computation
based on semantic mapping between schema terms and ontology terms
is required; estimating a steady state labor cost for said
functional computation for said percent of schema elements where
functional computation is required; estimating a percent of schema
elements in the data sources where structural heterogeneity
semantic mapping between schema terms and ontology terms is
required; and estimating a steady state labor cost for said
semantic mapping for said percent of schema elements where semantic
mapping is required.
9. A method as set forth in claim 8 further comprising the step of
estimating a steady state cost of mapping local data source schema
elements from the data sources to a shared ontology.
10. A method for estimating a cost of integrating heterogeneous
data sources, said method comprising the steps of: estimating a
percent of schema elements in each data source where functional
computation based on semantic mapping between schema terms and
ontology terms is required; estimating a transitional labor cost
for implementing a first program to perform said functional
computation for said percent of schema elements where functional
computation is required; estimating a percent of schema elements in
each data source where structural heterogeneity semantic mapping
between schema terms and ontology terms is required; and estimating
a transitional labor cost for implementing a second program to
perform said semantic mapping for said percent of schema elements
where semantic mapping is required.
11. A method as set forth in claim 10 further comprising the step
of estimating a transitional labor cost of mapping local data
source schema elements from each data source to a shared
ontology.
12. A method as set forth in claim 10 further comprising the steps
of: estimating a transitional labor cost for automatically mapping
the first said percent of schema elements to correct ontology
terms; and estimating a transitional labor cost for manually
mapping the first said percent of schema elements to correct
ontology terms.
13. A method as set forth in claim 10 further comprising the step
of: estimating a transitional labor cost for validating said
semantic mapping of the second said percent of schema elements to
correct ontology terms.
14. A method as set forth in claim 13 further comprising the step
of: estimating a steady state labor cost for validating said
semantic mapping of the second said percent of schema elements to
correct ontology terms.
15. A method as set forth in claim 10 further comprising the steps
of: estimating a transitional labor cost for implementing a wrapper
to convert semi-structured data from a semi-structured data source
to a structured schema which can interface with a semantic
reconciliation program; and estimating a steady state labor cost
for managing and maintaining said wrapper.
Description
BACKGROUND OF THE INVENTION
[0001] The invention relates generally to computer systems and
services, and more particularly to a tool to estimate the cost of
integrating heterogeneous data sources together so the data in
these heterogeneous information sources can be exchanged between
each other and accessed by users.
[0002] Organizations today often need to access multiple,
heterogeneous data sources. Existing middleware technology and the
World Wide Web enable physical connectivity between dispersed and
heterogeneous data sources. "Heterogeneous" data sources are data
repositories and data management systems that are incompatible. The
incompatibility, also called "semantic conflict", includes
differences in structural representations of data, differences in
data models, mismatched domains, and different naming and
formatting schemes employed by each data source. Thus,
heterogeneous data sources store data in different forms, and
require different formats and/or protocols to access their data and
exchange data between themselves.
[0003] The following are known examples of heterogeneous data
sources and their incompatibility or semantic conflicts.
[0004] Example 1 is a partial schema of an Oracle.TM. database, and
example 2 is a partial data base schema of a Microsoft.TM.
application for a SQL Server based employee database. (The term
"data base schema" refers to fields, attributes, tables and other
categorizations of data in a database.)
EXAMPLE 1
Oracle database form used for employment data of a university.
Data Model: Non-Normalized Relational Schema (partial):
Faculty (SS#, Nane, Dept, Sal_Amt, Sal_Type, Affiliation, Sponsor,
University . . . )
Faculty: Any tuple of the relation Faculty, identified by the key
SS#
SS#: An identifier, the social security number of a faculty
member
Name: An identifier, Name of a faculty member
Dept: The academic or nonacademic department to which a faculty
member is affiliated
Sal_Amt: The amount of annual Salary paid to a Faculty member
Sal_Type: The type of salary such as Base Salary, Grant, and
Honorarium
Affiliation: The affiliation of a faculty member, such as teaching,
non-teaching, research
University: The University where a Faculty member is employed
EXAMPLE 2
Microsoft database form of a SQL Server database of employees of
engineering related firms.
Data Model: Non-Normalized Relational Schema (partial):
Employee (ID, Name, Type, Employer, Dept, CompType, Comp,
Affiliation . . . )
Employee: Any tuple of the relation Employee, identified by the key
ID
ID: An identifier, the social security number of an Employee
Name: An identifier, Name of an employee
Type: An attribute describing the job category of an Employee, such
as Executive, Middle Manager, Consultant from another firm,
etc.
Employer: Name of the employer firm such as AT&T, Motorola,
General Motors, etc.
Dept: Name of the department where an Employee works
CompType: The type of compensation given to an employee, such as
Base Salary, Contract Amount
Comp: The amount of annual compensation for an employee
Affiliation: Name of the Consultant firm, such as a University
Name, Andersen Consulting, . . .
[0005] There exist several semantic correspondences between the
Oracle database format and the Microsoft database format, as
follows. Class `Faculty` in Oracle database and class `Employee` in
Microsoft database intersect. Instances of attribute `SS#` in
Oracle database correspond to instances of attribute `ID` in
Microsoft database where the employees are consultants from the
universities. Attributes `Dept` in Oracle database and Microsoft
database share some common domain values, as do `Sal_Type` in
Oracle database and `Comp_Type` in Microsoft database, and
`Sal_Amt` in Oracle database and `Comp` in Microsoft database.
These three pairs may be considered either as synonyms or as
homonyms depending on the nature of the query posed against these
two databases. Attributes `Affiliation` in Oracle database and
Microsoft database are homonyms, as are attribute `University` in
Oracle database and attribute `Employer` in Microsoft database,
because their domains do not overlap. The fact that the domains do
not overlap is coincidental, and therefore cannot be assumed to be
true all the time. The two attributes could have overlapped in
other database instances. Attribute `University` in Oracle database
and `Affiliation` in Microsoft database may be considered as
synonyms for the subset of class `Employee` where
`Employee.Type=Consultant`, and where the values in the domain of
the attribute `Affiliation` in Microsoft database correspond to the
names of Chicago based Universities. To allow these different data
sources to exchange data, the corresponding attributes need to be
identified and reconciled, so the data for both corresponding
attributes can be treated as data for the same attribute. Likewise,
to permit a user to access data for corresponding attributes with a
single query, the attribute specified in the single query needs to
be reconciled to both of the corresponding attributes. Known
semantic reconciliation techniques include those designed to
identify and reconcile semantic incompatibilities and distinctions
such as those illustrated by the examples above. The number of
semantic conflicts increases as more data sources are included in a
data integration effort.
[0006] The following describes known techniques for resolving
semantic conflicts while integrating heterogeneous data sources
together so their data can be exchanged between each other and
accessed by users.
[0007] A research paper entitled "A Comparative Analysis of
Methodologies for Database Schema Integration", by Batini,
Lenzerini and Navath, published in ACM Computing Surveys 18(4),
1986 discloses standardization of data definitions and structures
through the use of a common conceptual (or a global) schema across
a collection of data sources. The global schema specifies field and
record definitions, structures and rules for updating data values.
Using various mappings and transformations, source data is
converted into a semantically equivalent, compatible form. Rules
for performing these mappings in heterogeneous data sources
typically exist as a separate layer above the component
databases.
[0008] A research paper entitled "Interoperability of Multiple
Autonomous Databases" by Litwin, Mark and Roussoupoulos, ACM
Computing Surveys 22(3), 1990 discloses a multi-database language
approach or federated database approach which is an alternative to
total integration. This approach provides relief from some of the
problems of creating a global schema by proposing a multi-database
language to facilitate semantic reconciliation. This language
shifts most of the burden of data integration to the users. For
example, the language provides users with easy access to schema
contents, such as attribute and entity names, domain values etc. of
all participating information sources in the network. It is the
responsibility of the users to determine the semantics of the data
items in each information source in the network.
[0009] An object of the present invention is to provide accurate
cost estimates based on a wide variety of parameters, for
reconciling heterogeneous data sources.
[0010] Another object of the present invention is to automate, in
whole or in part, cost estimates based on a wide variety of
parameters, for reconciling heterogeneous data sources.
SUMMARY OF THE INVENTION
[0011] The present invention resides in a system, method and
program product for estimating a cost of reconciling heterogeneous
data sources. A transition cost for integrating together a first
program to identify semantic conflicts, a second program to
classify semantic conflicts and a third program to reconcile
semantic conflicts is estimated. A steady state cost for managing
and maintaining the integrated first, second and third programs is
estimated.
[0012] The present invention resides in another system, method and
program product for estimating a cost of integrating heterogeneous
data sources. A steady state cost of managing and maintaining a
first program which identifies semantic conflicts between a cross
data source query and schema elements in a data source is
estimated. A steady state cost of managing and maintaining a second
program which classifies semantic conflicts between the cross data
source query and schema elements in the data source is estimated. A
steady state cost of managing and maintaining a third program which
reconciles semantic conflicts between the cross data source query
and schema elements in the data source is estimated.
[0013] The present invention resides in another system, method and
program product for estimating cost of integrating heterogeneous
data sources. A transition cost of implementing a first program
which identifies semantic conflicts between a cross data source
query and schema elements in a data source is estimated. A
transition cost of implementing a second program which classifies
semantic conflicts between the cross data source query and schema
elements in the data soource is estimated. A transition cost of
implementing a third program which reconciles semantic conflicts
between the cross data source query and schema elements in the data
source is estimated.
[0014] The present invention resides in another system, method and
program product for estimating a cost of integrating heterogeneous
data sources. An estimation is made of a percent of schema elements
in the data sources where functional computation based on semantic
mapping between schema terms and ontology terms is required. An
estimation is made of a steady state labor cost for the functional
computation for the percent of schema elements where functional
computation is required. An estimation is made of a percent of
schema elements in the data sources where structural heterogeneity
semantic mapping between schema terms and ontology terms is
required. An estimation is made of a steady state labor cost for
the semantic mapping for the percent of schema elements where
semantic mapping is required.
[0015] The present invention also resides in a system, method and
program for estimating a cost of integrating heterogeneous data
sources. An estimation is made of a percent of schema elements in
each data source where functional computation based on semantic
mapping between schema terms and ontology terms is required. An
estimation is made of a transitional labor cost for implementing a
first program to perform the functional computation for the percent
of schema elements where functional computation is required. An
estimation is made of a percent of schema elements in each data
source where structural heterogeneity semantic mapping between
schema terms and ontology terms is required. An estimation is made
of a transitional labor cost for implementing a second program to
perform semantic mapping for the percent of schema elements where
semantic mapping is required.
BRIEF DESCRIPTION OF THE FIGURES
[0016] FIG. 1 is a block diagram of a computer system which
includes a cost estimation tool according to the present
invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0017] The present invention will now be described in detail with
reference to the figures. FIG. 1 illustrates a computer system 20
which includes the present invention. System 20 includes known CPU
14, operating system 15, RAM 16, ROM 18, keyboard 17, monitor 19
and internal and/or external disk storage 21. System 20 also
includes a cost estimation program tool 25 in accordance with the
present invention. Program 25 comprises a user interface program
module 30, a cost modeling program module 32, a decision analysis
and cost sensitivity analysis program module 39, a knowledge
management database and program module 41, and a systems framework
program module 40. Together they permit estimation of the initial
cost of integrating heterogeneous data sources and subsequent use
of the heterogeneous data sources.
[0018] The user interface program module 30 displays requirements
data submitted by the end-user of the heterogeneous data systems.
For example, the requirements data comprises a number of databases
to be reconciled, schema information for each type of database (for
example, relational or object oriented), a number of schema
elements in each schema to be considered, business policies to be
considered, network environment information, a type of semantic
conflict reconciliation method (or methods) to be used, a number of
users that will need access to the heterogeneous data system, and a
performance requirement for estimating time spent by the system.
Program 25 may need input parameters for each semantic
reconciliation business variable and technical variable that will
have an impact on the cost of the semantic reconciliation of the
heterogeneous data sources. For example, location attributes such
as street addresses--related semantic conflicts are best resolved
by applying a specific semantic reconciliation technique `J67`
along with a more general semantic reconciliation technique `X45`.
Currency attributes such as monetary values--related semantic
conflicts are best resolved by applying a specific semantic
reconciliation technique `G21` along with a more general semantic
reconciliation technique `Y54`. Computed functional attributes such
as pension, compound interest, etc. are best resolved by applying a
specific semantic reconciliation technique `G12` along with a more
general semantic reconciliation technique `Y55`. Another example is
a cost of creating an information wrapper such as one for a
relational database with `N` tables with less than `n` attributes.
An "information wrapper" is a data container which translates
unstructured information into structured information. (For example,
a database schema is "structured information" and free text is
unstructured information.) The information wrapper recognizes
certain nouns of unstructured data as corresponding to certain
table or entity names of structured data. The user interface
program module 30 guides the tool-user in selecting these variables
from a predefined set of candidates. Typically, there is more than
one viable parameter from which the user can choose. The user
interface program module 30 also displays the cost estimates
subsequently generated by other program components within tool 25.
The user interface program module 30 also guides and allows the
user to make changes in requirements data and reconciliation
business and technical variables to obtain different cost
estimates.
[0019] The cost modeling program module 32 calculates labor,
hardware, software and infrastructure costs for a given set of
end-user provided requirements and end-user's selection of
reconciliation business and technical variables that influence the
cost of the semantic reconciliation of heterogeneous data sources.
The cost modeling program 32 uses facts, business rules and data
maintained by the knowledge management database and program 41 in
making its cost estimates.
[0020] The knowledge management database and program module 41
stores facts, business rules and data pertaining to the
following:
[0021] (a) Selection of tool-sets/methodologies for creating
network and data integration subsystems (such as wrappers, shared
vocabulary, etc.) and inter-data source linkages. These are a
prerequisite for connectivity between heterogeneous data sources
before any semantic reconciliation or meaningful exchange of
information between the heterogeneous data sources can occur.
[0022] (b) Implementation of tool-sets/methodologies for creating
network and data integration subsystems (such as wrappers, shared
vocabulary, etc.) and inter-data source linkages. These are a
prerequisite for connectivity between heterogeneous data sources
before any semantic reconciliation or meaningful exchange of
information between the heterogeneous data sources can occur.
(c) Identification and estimation of hardware, software, labor and
infrastructure costs required to create the data integration
environment for a given set of heterogeneous data sources.
[0023] (d) Identification and estimation of hardware, software,
labor and infrastructure costs required to extend a data
integration environment to a given set of heterogeneous data
sources over time as more data sources become part of the
integrated environment; these can be expressed as cost per
additional component or additional utility model costs.
(e) Life-cycle management support costs for hardware, software,
labor and infrastructure required to manage the integrated set of
heterogeneous data sources over time.
(f) Selecting specific semantic reconciliation
techniques/tool-sets/methodologies for the identification,
classification and adequate reconciliation for each set of semantic
conflicts.
(g) Implementing and applying specific semantic reconciliation
techniques/tool-sets/methodologies for the identification,
classification and adequate reconciliation for each set of semantic
conflicts.
(h) Implementing and applying selected combination of semantic
reconciliation techniques/tool-sets/methodologies for the
identification, classification and adequate reconciliation for each
set of semantic conflicts.
(i) Estimating costs for implementing and applying specific
semantic reconciliations techniques/tool-sets/methodologies for the
identification, classification and adequate reconciliation for each
set of semantic conflicts.
(j) Estimating costs for implementing and applying selected
combination of semantic reconciliation
techniques/tool-sets/methodologies for the identification,
classification and adequate reconciliation for each set of semantic
conflicts.
(k) Historical cost data for implementing semantic reconciliation
environments from previous similar operational efforts.
[0024] Knowledge engineers provide the foregoing data and rules for
the knowledge management data base and program. The knowledge
engineers are domain experts in facts, business rules and data
pertaining to knowledge domains. The knowledge engineers
participate in the initial creation of the knowledge management
database and program, and periodically update the knowledge base
database and program. The knowledge management database and program
module 41 also contains a data processing program engine which
performs data management operations on the data stored within the
repository.
[0025] The decision analysis and cost sensitivity analysis program
module 39 processes input variables (such as end-user requirements
data and reconciliation business and technical variables selections
by the tool-user) based on integration and usage facts and rules
maintained by the knowledge management database and program module
41. Then, the module 39 invokes the cost modeling program module 32
to calculate the actual cost for labor, hardware, software and
infrastructure. The decision analysis and cost sensitivity program
module 39 can also be used to subsequently alter input variables to
reduce cost.
[0026] The systems framework program module 40 integrates the
program modules 30, 32, 39 and 41 to work together. The systems
framework program module 40 also provides National Language Support
(NLS) for the user. The systems framework program module 40 can be
implemented in a variety of programming languages such as JAVA to
provide multi-operating system support, multi-hardware support,
distributed objects (including database support), and
multi-language support. In an alternate embodiment of the present
invention, program modules 30, 32, 39 and 41 can reside on
different computer systems, in which case the system framework
program module 40 can integrate these program modules using a
distributed network computing architecture. The system framework
program module also provides controlled and secure program
interfaces between the program modules 30, 32, 39 and 41 and the
external system environment (such as tool-users, end-users and
other outside applications) to manage updates, facilitate general
systems management tasks (such as program backup and recovery), and
support future interactions and integration with other useful
programs.
[0027] The following is a more detailed description of program
modules 30, 32, 39 and 41.
[0028] The user interface program module 30 allows a tool-user and
end-user to interact with the overall system to specify end-user
requirements and select technical and business variables regarding
the heterogeneous data systems and their proposed integration and
subsequent use. The following is a non-exhaustive list of the
end-user requirements and reconciliation business and technical
variables, for which the user's input is solicited:
(a) Number of data sources which will participate in the semantic
reconciliation/semantic data integration effort.
(b) Approximate number of additional data sources, which may be
added annually to the semantically integrated environment.
(c) Identification of applicable knowledge domain(s) (for example,
medical, mechanical engineering, finance, material resource
planning, enterprise resource planning, or human resources) for
each data source.
(d) Identification of attributes (such as total DASD, processor
speed, memory, etc.) of each data source's hardware platform.
(e) Identification of attributes (such as vendor, version, patch
level, etc.) of each data source's operating system platform.
(f) Identification of attributes (such as vendor, version, patch
level, etc.) of each data source's middleware software
component.
(g) Identification of attributes (such as vendor, version, patch
level, relational/object oriented, centralized/distributed, etc.)
of each data source's database management system software.
(h) Identification of attributes (such as vendor, version, patch
level, information wrapper enablement capability, etc.) of each
data source's application front end software component.
(i) Identification of each data source's supported network
protocols.
(j) Identification of each data source's supported network
infrastructure including attributes for network topology, network
interface cards, etc.
(k) Identification of each data source's location dependencies (and
attributes for any distributed data source).
(l) Identification of each data source's NLS requirements.
[0029] (m) Identification of each data source's physical
implementation (such as number of tables, and growth rate of number
of tables) with respect to database schema. (Information about
number of rows/table, data growth rate per table, primary indexes,
etc. is not required.)
[0030] (n) Identification of each data source's physical
implementation with respect to all data formats used by the data
source. (For example, if Oracle database software is used, tool 25
already knows all date formats, currency formats, number formats,
character formats, memo field usage, indexes, data dictionary
tables, referential integrity constraints, primary keys, secondary
keys, composite keys, Binary Large Object ("BLOB") fields,
etc.)
[0031] (o) Identification of each data source's physical
implementation with respect to capturing business rules governing
the data source's operations (for example whether business rules
are captured as stored procedures, database triggers or sequences,
or captured as part of business logic at the application layer or a
combination of all of the above).
[0032] (p) Identification of each data source's physical
implementation with respect to data dictionary/meta data
implementation used within the data source's meta data repository.
(For example, if Oracle data base software is used then tool 25
already knows all date formats, currency formats, number formats,
character formats, memo field usage, BLOB fields, etc. used by the
Oracle software.)
(q) Identification of each data source's physical implementation
with respect to all data formats used within the data source's data
dictionary/meta data repository.
(r) Identification of each data source's physical implementation
with respect to business rules governing the data source's data
dictionary/meta data operations.
(s) Identification of attributes of each data source's logical
implementation with respect to database schema (for example,
relational schema, object oriented schema, file based data
structures, etc.).
(t) Identification of attributes of each data source's logical
implementation with respect to all data formats used within the
data source (for example, relational schema, object oriented
schema, file based data structures, etc.).
[0033] (u) Identification of attributes of each data source's
logical implementation with respect to capturing business rules
governing the data source's operations (for example, relational
schema, object oriented schema, file based data structures,
application level business logic routines, etc.).
(v) Identification of attributes of each data source's logical
implementation with respect to data dictionary/meta data
implementation (for example, data dictionary/meta data schema).
[0034] (w) Identification of preferred tool-sets/methodologies for
creating network and data integration subsystems (such as WAN
links, LAN links, information wrappers, vocabulary shared by
heterogeneous databases, etc.) from amongst a system generated
choice list for creating inter-data source linkages. (The shared
vocabulary is also called "ontology".)
(x) Identification of preferred tool-sets/methodologies pertaining
to extension of data integration environment over time as more data
sources become part of the integrated environment.
(y) Identification of preferred and system recommended specific
semantic reconciliation techniques/tool-sets/methodologies for the
identification, classification and adequate reconciliation for each
set of semantic conflicts.
[0035] (z) An estimate of the number of queries posed against
heterogeneous data sources, which will require semantic
reconciliation, within the same knowledge domain.
(aa) An estimate of the number of queries posed against
heterogeneous data sources, which will require semantic
reconciliation, across different knowledge domains.
(bb) An estimate of the minimum tolerable time period as well as
the maximum allowable time period within which the desired semantic
reconciliation environment must be implemented.
(cc) An estimate of the minimum time period as well as the maximum
allowable time period over which the semantic reconciliation
environment will need to be managed.
(dd) Specification of fully burdened labor rates for each skill
level required for integration and use of the semantically
reconciled heterogeneous data sources.
(ee) Identification of geographic locations where data sources,
which will participate in the integrated semantic reconciliation
process, are located.
[0036] (ff) Identification and selection of preferred types of
servers (from a system provided list), with respect to hardware
vendor choice, server operating system choice, etc. required to
support the desired semantic reconciliation infrastructure
environment.
[0037] (gg) Identification and selection of hardware configuration,
operating system, software applications installed, and network
configurations for any server supporting the desired semantic
reconciliation infrastructure environment (from a system provided
list).
[0038] (hh) Identification and selection of network connections
details between each location where the heterogeneous data sources
are located. (An abstract level network diagram can be generated
graphically using the user interface program module 30 based on the
user's input.)
(ii) Identification of owner's technology refresh policy (for
replacing severs and other hardware) in terms of after how many
years a service a piece of hardware such as a server, router,
switch, etc. may be retired and replaced by newer equipment).
[0039] Once the user has input all end-user requirements and
business and technical variables, the user interface program module
30 can be used to request various types of cost estimation proposal
reports in various formats including screen based reports, HTML
reports, pdf files, text file reports, and CSV files based reports.
The reports include various cost accounts broken down by cost
categories (hardware, software, labor, etc. for semantic
reconciliation). The various types of reports are provided as a
result of the inputs fed into program module 30 and the back-end
processing performed by program modules 32, 39 and 41. The cost
data can also be reported in a granular format with reporting
categories.
[0040] The knowledge management database and program module 41
stores facts, business rules and data pertaining to semantic
reconciliation amongst heterogeneous data sources. The following
are key facts, data and business rules which are stored within the
knowledge management database and program module 41:
(a) Classification of each possible semantic conflict between data
pairs as a combination of the following relationships: synonym,
homonym, function, aggregation, generalization, specialization, or
unrelated.
(b) Classification of each possible semantic conflict between data
pairs from a query and a database, based on structural levels
including class, instance, and attribute levels.
(c) Semantic reconciliation technique `X`15 is suitable to use for
data sources belonging to knowledge domains `A` `B`, . . . `N`.
(d) Semantic reconciliation technique `X`17 is suitable to use only
for data sources belonging to knowledge domain `K`.
(e) Semantic reconciliation technique `Y` utilizes
lexicon/ontology/dictionary `p`, and does not work well with
lexicon/ontology/dictionary `f` or `g`.
(f) Location attributes `such as street addresses` related semantic
conflicts are best resolved by applying a specific semantic
reconciliation technique `J67` along with the more general semantic
reconciliation technique `X45`.
(g) Currency attributes `such as monetary values` related semantic
conflicts are best resolved by applying a specific semantic
reconciliation technique `G21` along with the more general semantic
reconciliation technique `Y54`.
(h) Computed functional attributes `such as pension, compound
interest, etc.` are best resolved by applying a specific semantic
reconciliation technique `G12` along with the more general semantic
reconciliation technique `Y55`.
(i) Cost of creating an information wrapper for a relational
database with `N` tables with less than `n` attributes is $YRT.
(j) Cost of creating an information wrapper for a semi-structured
data source with `N` information attributes is $JKL per
semi-structured page with less than `m` characters.
(k) Middleware software applications `TRX`, `LJK` or `YUT` are
preferred to create a data integration software layer between
relational databases, with license costs of $X, $Y, and $Z
respectively, per data source.
(l) Middleware `TRX`, `LJK` or `YUT` are preferred to create data
integration software layer between relational databases, with
implementation costs of $Xaj, $Yab, and $Zah respectively, per data
source.
(m) Middleware `XRT`, `JKL` or `TUY` are preferred to create data
integration software layer between object oriented databases, with
license costs of $A, $B, and $C respectively, per data source.
(n) Middleware `TRX`, `LJK` or `YUT` are preferred to create a data
integration software layer between object oriented and relational
databases, with implementation costs of $Aaj, $Bab, and $Cah,
respectively, per data source.
(o) Cost of semantically mapping `m` attributes from a database
schema to a common lexicon/ontology/dictionary is `$DFR` and
requires (m*r) hours of medium skilled labor (as identified by the
system in labor rates).
(p) Cost of installing and configuring semantic reconciliation
technique `Y575` to semantically integrate `W` pairs of
heterogeneous data sources created using the same DBMS is $ VCU
[0041] (q) Cost of installing and configuring semantic
reconciliation technique `X176` for `N` number of heterogeneous
data sources, which include `n` relational database, `m` object
oriented database, and `k` semi-structured data sources requiring
wrapper generation is `$CXD`.
(r) Cost for managing semantic reconciliation technique `X`15 for
`n` databases is $TYV/month and $VRE/month, respectively.
(s) Cost of semantically reconciling each local database query
composed of `n ` terms against a remote database containing `m`
schema elements and while using semantic reconciliation technique
`Y577` is `$PTX`.
(t) Semantic reconciliation technique `TUX456RS` can perform
semantic reconciliation for US English language databases only.
(u) Cost of manual intervention to resolve a semantic conflict
between each local database query term against a remote database
containing `m` schema elements is `$PTPPX`.
(v) Cost of providing a network communication link between `b` data
sources using `LKJ` network topology and `RHG` network protocols
and `YR` kb/sec bandwidth is $UQW.
(w) Rules pertaining to cost metrics (including software, hardware,
labor and infrastructure costs) for connecting `n` different and
previously unconnected geographic locations where data sources are
located is $QQQ.
[0042] (x) Rules pertaining to cost metrics (software, hardware,
labor and infrastructure costs) for procuring, configuring and
installing preferred types of servers (from a system provided
list), with respect to hardware vendor choice, server operating
system choice, etc. required to support the desired semantic
reconciliation infrastructure environment.
(y) Rules pertaining to cost metrics for data source integration
and usage based on hardware, software, labor and infrastructure
required to manage the integrated set of heterogeneous data sources
over time.
(z) Rules pertaining to cost metrics for technology refresh of each
type of hardware, which is to be used as part of the infrastructure
setup for the semantic reconciliation environment.
[0043] After the user enters the end-user requirements and business
and technical variables into cost estimation program tool 25 using
the user interface program module 30, the user can invoke the
decision analysis and cost sensitivity analysis program module 39
to obtain an estimate of the initial integration or "transitional"
cost and subsequent use or "steady state" cost. This can be the
first cost estimation for the heterogeneous data sources or a
subsequent estimation based on sensitivity/"what-if" analysis on a
previously generated cost estimate after changing one or more
parameters. The decision analysis and cost sensitivity analysis
program module 39 first establishes an overall goal and criteria
for decision making based on the user's or owner's requirements
data. It then creates a decision hierarchy based on all known
criteria and identifies the various alternatives based on its
interaction with program modules 32 and 41. The decision hierarchy
is determined based on parameters and requirements provided by the
end user. For example if the user enters a high end budget
constraint of $ 100K, then all hardware, software and semantic
integration tasks will be sorted in order to provide an estimate
which may not be the best possible solution but which meets the
budget goal. The decision analysis and cost sensitivity analysis
program module (by invoking the knowledge management database and
program module 41 and the cost modeling program module 32 with new
sets of parameters for each sub path in the decision hierarchy)
determines the costs of all decision sub-paths within each
alternative to compute the lowest cost alternative that meets
required goals and criteria. The cost modeling program module 32
calculates specific labor, hardware, software and infrastructure
costs based on user provided requirements data and facts, data and
business rules maintained by the knowledge management database and
program module 41. The cost modeling program module 32 then tallies
the costs and supplies the total to the decision analysis and cost
sensitivity analysis program module 39 which presents them to the
user via the user interface. If the user changes the requirements
criteria by changing any of the values of earlier provided
technical/business variables, the decision analysis and cost
sensitivity analysis program module re-computes the cost estimate
and provides a new solution to the user.
[0044] The cost modeling program module 32 includes functions and
formulas to estimate the semantic reconciliation operation's
life-cycle management costs, as follows:
Let f(x)=upfront setup cost of the semantic reconciliation
environment based on a given set of business/technical variables.
(f(x) will also be referenced as "transition" costs)
Let f(y)=cost of managing the semantic reconciliation operation
over time excluding f(x) based on a given set of business/technical
variables. (f(y) will also be referenced as "steady state"
costs.)
The estimated proposal costs to be incurred by the owner for
undertaking the overall semantic reconciliation life cycle
management operation are estimated by f(z), where
f(z)=f(x)+f(y).
[0045] The following is a generalized form of cost modeling
performed by the cost modeling program module 32 to calculate both
f(x) and f(y) and yield f(z) based on a given set of user input
regarding business/technical variables, and the facts/business
rules embedded in the knowledge management database. Note that the
generalized form of the cost model described below assumes that all
requirements data pertaining to required technical and business
variables have already been provided by the user or owner and have
been stored in the knowledge management database. Also, the cost
function calculations are based on historical data, expert
knowledge, parametric knowledge, facts and business rules
maintained in the knowledge database.
Let DSMgt=Traditional data source management costs during
transition.
Let DSMgs=Traditional data source management costs during steady
state.
Let Map t=Costs for mapping local data source schema elements from
each data source to a shared lexicon/ontology/dictionary during
transition.
Let Maps=Costs for mapping local data source schema elements from
each data source to a shared lexicon/ontology/dictionary during
steady state.
Let InstSemInt t=Cost of configuring and implementing the system
selected semantic reconciliation technique(s) across heterogeneous
data sources environment during transition.
Let MngSemInt s=Cost of managing and maintaining the system
selected semantic reconciliation technique(s) across heterogeneous
data sources environment during steady state.
Let QP t=Cost of estimated number of queries, which will require
semantic reconciliation and which will be posted against the
integrated data sources during transition.
Let QP s=Cost of estimated number of queries, which will require
semantic reconciliation and which will be posted against the
integrated data sources during steady state.
Let HW t=Hardware costs during transition.
Let HWs=Hardware costs during steady state.
Let Net t=Network services costs during transition.
Let Nets=Network services costs during steady state.
Let SWt=Software costs during transition.
Let SWs=Software costs during steady state
Let Tt=Transition duration.
Let St=Steady State duration.
Then f(x) is a function of the following costs:
f(x)=f(DSMgt, Map t,InstSemInt t, HW t, QP t, Net t, SWt, Tt).
and f(y) is a function of the following costs:
f(y)=f(DSMgS, Map s, MngSemInt s, HWS, QP s, Net s, SWs, St).
The costs, which are components of f(x) and f(y), can be further
broken down into the following computed functions:
Let DSn t=number of data sources which will participate in the
integrated semantic reconciliation environment during
transition.
Let DS(OS)n t=The number of data sources in DSn t categorized per
operating system (where the operating system may be Linux RedHat,
HP-UX, AIX, Win NT operating system, etc.) during transition.
Let DS(HW)n t=The number of data sources in DSn t categorized per
hardware (HW) model (where HW may be IBM x Series, IBM p series,
Sun Fire 3800, HP 9000 computer, etc.) during transition.
Let DS(LOGL)n t=The number of data sources in DSn t per logical
schema implementation (where LOGL may be relational, object
oriented, CODASYL, etc.) during transition.
Let DS(Ven)n t=The number of data sources in DSn t per vendor
(where Ven may be MS SQL Server, IBM DB2, Oracle, Informix
database, etc.) during transition.
Let DS(Size)n t(1 . . . DSn t)=The size of each data source(1 . . .
. DSn t) in Giga-bytes during transition.
Let DS(Up)n t(1 . . . DSn t)=The number of DBMS software updates
for each data source(1 . . . DSn t) during transition.
Let DS(Vers)n t(1 . . . DSn t)=The number of version upgrades for
each data source(1 . . . DSn t) during transition.
Let DS(Rec)n t(1 . . . DSn t)=The transition costs of disaster
recovery management system for each data source(1 . . . DSn t).
Let DSn s=The number of data sources including DSn t, which will
participate in the integrated semantic reconciliation environment
during steady state.
Let DS(OS)n s=The number of data sources in DSn s categorized per
operating system (where OS may be Linux RedHat, HP-UX, AIX, Win NT
operating system, etc.) during steady state.
Let DS(HW)n s=The number of data sources in DSn s categorized per
hardware (HW) model (where HW may be IBM x Series, IBM p series,
Sun Fire 3800, HP 9000 computer, etc.) during steady state.
Let DS(LOGL)n s=The number of data sources in DSn s categorized per
logical schema implementation (where LOGL may be relational, object
oriented, CODASYL, etc.) during steady state.
Let DS(Ven)n s=The number of data sources in DSn s categorized per
vendor (where Ven may be MS SQL Server, IBM DB2, Oracle, Informix,
etc.) during steady state.
Let DS(Size)n s(1 . . . DSn s)=The size of each data source(1 . . .
DSn s) in Giga Bytes during steady state.
Let DS(Up)n s(1 . . . DSn s)=The number of DBMS software updates
for each data source(1 . . . DSn s) during steady state.
Let DS(Vers)n s(1 . . . DSn s)=The number of version upgrades for
each data source(1 . . . DSn s) during steady state.
Let DS(Rec)n s(1 . . . DSn s)=The steady state costs of disaster
recovery management system for each data source(1 . . . DSn s).
Then DSMgt is a function of the following costs:
DSMgt=f (DSn t, DS(OS)n t, DS(HW)n t, DS(LOGL)n t, DS(Ven)n t,
DS(Size)n t(1 . . . DSn t), DS(Up)n t(1 . . . DSn t),DS(Vers)n t(1
. . . DSn t), DS(Rec)n t(1 . . . DSn t), Tt).
and DSMgs is a function of the following costs:
DSMgs=f (DSMgt, DSn s, DS(OS)n s, DS(HW)n s, DS(LOGL)n s, DS(Ven)n
s, DS(Size)n s(1 . . . DSn S), DS(Up)n s(1 . . . DSn s),DS(Vers)n
s(1 . . . DSn s), Ds(Rec)n s(1 . . . DSn s), St).
Let ElemSct(k)=The number of schema elements in each data source
`k` (where k ranges from 1 . . . DSn t) during transition.
Let Ont t (k)=The transition labor cost for implementing the
selected shared lexicon/ontology/dictionary for each data source
`k` (where k ranges from 1 . . . DSn t).
Let Autot(k)=The transition labor cost for the % of schema elements
in each data source `k` (where k ranges from 1 . . . DSn t) which
are automatically mapped to correct ontology terms.
Let Mant(k)=The transition labor cost for the % of schema elements
in each data source `k` (where k ranges from 1 . . . DSn t) which
are manually mapped to correct ontology terms.
Let Valdt(k)=The transition labor cost for the % of schema elements
in each data source `k` (where k ranges from 1 . . . DSn t) where
validation of accurate semantic mappings between schema terms and
ontology terms is required.
Let Compt(k)=The transition labor cost for the % of schema elements
in each data source `k` (where k ranges from 1 . . . DSn t) where
functional computation based semantic mapping between schema terms
and ontology terms is required.
Let Structt(k)=The transition labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
t) where structural heterogeneity semantic mapping between schema
terms and ontology terms is required.
Let ElemScs(k)=The number of schema elements in each data source
`k` (where k ranges from 1 . . . DSn s) during steady state.
Let Ont s(k)=The steady state labor cost for implementing the
selected shared lexicon/ontology/dictionary for each data source
`k` (where k ranges from 1 . . . DSn s).
Let Autos(k)=The steady state labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
s) which are automatically mapped to correct ontology terms.
Let Mans(k)=The steady state labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
s) which are manually mapped to correct ontology terms.
Let Valds(k)=The steady state labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
s) where validation of accurate semantic mappings between schema
terms and ontology terms is required.
Let Comps(k)=The steady state labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
s) where functional computation based semantic mapping between
schema terms and ontology terms is required.
Let Structs(k)=The steady state labor cost for the % of schema
elements in each data source `k` (where k ranges from 1 . . . DSn
s) where structural heterogeneity semantic mapping between schema
terms and ontology terms is required.
Then Map t is a function of the following costs:
Map t=f(DSn t,ElemSct(k),Ont t (k),Autot(k), Mant(k),Valdt(k),
Compt(k), Structt(k),Tt).
and Map s is a function of the following costs:
Map s=f (Map t, DSn s,ElemScs(k), Ont s (k), Autos(k), Mans(k),
Valds(k), Comps(k), Structs(k),St).
[0046] Let SemConf(x) t=Transition cost of implementing the
semantic conflict identification technique `x` for identifying
semantic conflicts between a cross data source query and the schema
elements in data source `k` (where k ranges from 1 . . . DSn t),
and where x has been selected for k by system 20 as one of the
applicable techniques.
[0047] Let SemClass(y) t=Transition cost of implementing the
semantic conflict classification technique `y` for classifying
semantic conflicts between a cross data source query and schema
elements in data source `k` (where k ranges from 1 . . . DSn t),
and where y has been selected for k by system 20 as one of the
applicable techniques.
[0048] Let SemRec(z) t=Transition cost of implementing the semantic
conflict reconciliation technique `z` for reconciling semantic
conflicts between a cross data source query and schema elements in
data source `k` (where k ranges from 1 . . . DSn t) and where z has
been selected for k by system 20 as one of the applicable
techniques.
[0049] Let CombSem(x,y,z) t=Transition costs for application
integration required for installation of combination of all system
selected semantic conflict identification, semantic conflict
classification, and semantic conflict reconciliation methods
selected for a data source `k` where x, y and z have been selected
for k by system 20 as the combined applicable techniques.
[0050] Let NonMont(x,y,z) t=Transition costs for implementing a
non-monotonic semantic assertion belief maintenance system for
supporting the query context generated by the combination of all
system selected semantic conflict identification, semantic conflict
classification, and semantic conflict reconciliation methods
selected for a data source `k`. A "non-monotonic semantic assertion
belief maintenance system" means a system which makes initial
assumptions as to correlations between table names in heterogenious
data bases, but can later recognize that those assumptions were
incorrect, and back track to make other assumptions. For example,
if one table uses a term "bridge" in one database and another table
uses the same term "bridge in another database, the system may
initially assume that the bridge data in both databases refers to a
similar attribute. However, if the system later determines, based
on vastly different data or knowledge, that the terms mean
something different, for example a highway bridge versus a dental
bridge, then the system can undo the correlations that were
previously made based on the incorrect assumption, and then proceed
with a different assumption.
Let Wrap(k)t=Transition costs for implementing a wrapper to convert
semi-structured data from a semi structured data source `k` (where
k ranges from 1 . . . DSn t) to a structured schema, which can
interface with semantic reconciliation techniques.
[0051] Let SemConf(x) s=Steady state cost of managing and
maintaining the semantic conflict identification technique `x` for
identifying semantic conflicts between a cross data source query
and the schema elements in data source `k` (where k ranges from 1 .
. . DSn s), and where x has been selected for k by program module
32 as one of the applicable techniques.
[0052] Let SemClass(y) s=Steady state cost of managing and
maintaining the semantic conflict classification technique `y` for
classifying semantic conflicts between a cross data source query
and schema elements in data source `k` (where k ranges from 1 . . .
DSn s), and where y has been selected for k by program module 32 as
one of the applicable techniques.
[0053] Let SemRec(z) s=Steady state cost of managing and
maintaining the semantic conflict reconciliation technique `z` for
reconciling semantic conflicts between a cross data source query
and schema elements in data source `k` (where k ranges from 1 . . .
DSn s) and where z has been selected for k by program module 32 as
one of the applicable techniques.
[0054] Let CombSem(x,y,z) s=Steady state costs for management and
maintenance of the combination of all system selected semantic
conflict identification, semantic conflict classification, and
semantic conflict reconciliation methods selected for a data source
`k` where x, y and z have been selected for k by program module 32
as the combined applicable techniques.
[0055] Let NonMont(x,y,z) s=Steady state costs for the management
and maintenance of the non-monotonic semantic assertion belief
maintenance system for supporting the query context generated by
the combination of all system selected semantic conflict
identification, semantic conflict classification, and semantic
conflict reconciliation methods selected for a data source `k`.
[0056] Let Wrap(k)s=Steady state costs for management and
maintenance of the wrapper to convert semi-structured data from a
semi structured data source `k` (where k ranges from 1 . . . DSn s)
to a structured schema, which can interface with semantic
reconciliation techniques.
Then InstSemInt t is a function of the following costs:
InstSemInt t=f(DSn t,SemConf(x) t,SemClass(y) t, SemRec(z)
t,CombSem(x,y,z) t,NonMont(x,y,z) t,Wrap(k)t,Tt)
and MngSemInt s is a function of the following costs:
MngSemInt s=f(InstSemInt,DSn s,SemConf(x) s,SemClass(y) s,SemRec(z)
s,CombSem(x,y,z) s,NonMont(x,y,z) s,Wrap(k)s,St).
Let Qn t=An estimated total number of cross data source queries to
be posed against the integrated system during transition.
Let QnOO t=An estimated number of object oriented cross data source
queries to be posed against the integrated system during
transition.
Let QnR t=An estimated number of relational cross data source
queries to be posed against the integrated system during
transition.
Let QTnt=An estimated number of cross data source queries, which
must be translated to a canonical data model form in order to be
processed against the integrated system during transition.
Let QTnCt=Transition costs of transforming cross data source
queries to a canonical data model form. A "canonical" data model
form means a common data model.
Let QMn t=An estimated total number of cross data source queries to
be posed against the integrated system which will be flagged by
program module 32 as requiring manual intervention for validation
of reconciled semantic conflicts during transition.
Let QMnC t=Transition costs for reconciling the cross data source
queries, which will be flagged by program module 32 as requiring
manual intervention for validation of reconciled semantic conflicts
during transition.
Let Qn s=An estimated total number of cross data source queries to
be posed against the integrated system during steady state.
Let QnOO s=An estimated number of Object Oriented cross data source
queries to be posed against the integrated system during steady
state.
Let QnR s=An estimated number of Relational cross data source
queries to be posed against the integrated system during steady
state.
Let QTns=An estimated number of cross data source queries, which
must be translated to a canonical data model form in order to be
processed against the integrated system during steady state.
Let QTnCs=Steady state costs of transforming cross data source
queries to a canonical data model form.
[0057] Let QMn s=An estimated total number of cross data source
queries to be posed against the integrated system which will be
flagged by program module 32 as requiring manual intervention for
validation of reconciled semantic conflicts during steady
state.
Let QMnC s=Steady state costs for reconciling the cross data source
queries, which will be flagged by program module 32 as requiring
manual intervention for validation of reconciled semantic conflicts
during steady state.
Then QP t is a function of the following costs:
QP t=f(DSn t,Qn t,QnOO t, QnR t, QTnt,QTnCt, QMn t, QMnC t,Tt)
and QP s is a function of the following costs:
QP s=f(QP t, DSn s,Qn s,QnOO S, QnR s, QTns,QTnCs, QMn s, QMnC
s,St).
Let NetHW t=number of network switches, hubs, routers, gateways
required during transition for production environment.
Let NetHW s=number of network switches, hubs, routers, gateways
required during steady state for production environment.
Let Serv t=number of servers required during transition for
semantic reconciliation infrastructure production environment.
Let Serv t-lab=number of servers required during transition for
semantic reconciliation lab infrastructure environment.
Let Serv s=number of servers required during steady state for
semantic reconciliation infrastructure production environment.
Let Serv s-lab=number of servers required during steady state for
semantic reconciliation lab infrastructure environment.
Let Wkst t=number of workstations required during transition for
production environment.
Let Wkst t-lab=number of workstations required during transition
for semantic reconciliation lab environment.
Let Wkst s=number of workstations required during steady state for
production environment.
Let Wkst s-lab=number of workstations required during steady state
for semantic reconciliation laboratory environment.
Let Med t=cost of media, such as CDs, DVDs, USB devices etc.
required during transition for semantic reconciliation
environment.
Let Med s=cost of media, such as CDs, DVDs, USB devices etc.
required during steady state for semantic reconciliation
environment.
Let Per t=cost of peripherals, such as printers, cables, toner
cartridges, etc. during transition for semantic reconciliation
environment.
Let Per s=cost of peripherals, such as printers, cables, toner
cartridges, etc. during transition for semantic reconciliation
environment.
Let TRef t=cost of refreshing the required number of servers during
the transition period.
Let TRef s=cost of refreshing the required number of servers during
the steady state period.
Then HW t is a function of the following costs:
HW t=f(NetHW t, Serv t, Let Serv t-lab, Wkst t, Wkst t-lab, Med t,
Per t, TRef t,Tt)
and HW s is a function of the following costs:
HW s=f(NetHW s, Serv s, Let Serv s-lab, Wkst s, Wkst s-lab, Med s,
Per s, TRef s,St).
Let NetHWServ t=services labor costs/month for supporting NetHWt
during transition.
Let NetHWServ s=services labor costs/month for supporting NetHWs
during steady state.
Then Net t is a function of the following costs:
Net t=f(NetHW t, NetHWServ t,DSn t, Tt)
and Net s is a function of the following costs:
Net s=f(Net t, NetHW s, NetHWServ s,DSn s, St)
Let LicsDBMS(k) t=cost of acquiring DBMS licenses for data source
`k` (where k ranges from 1 . . . DSn t) and for Serv t-lab during
transition.
Let LicsOnt(k) t=cost of acquiring shared
ontology/lexicon/dictionary licenses for data source `k` (where k
ranges from 1 . . . DSn t) and for Serv t-lab during
transition.
Let LicsMDW(k) t=cost of acquiring system selected middleware
product's licenses for data source `k` (where k ranges from 1 . . .
DSn t) and for Serv t-lab during transition.
Let LicsSemInt(k) t=cost of acquiring system selected semantic
reconciliation tools/techniques/methodologies licenses for data
source `k` (where k ranges from 1 . . . DSn t) and for Serv t-lab
during transition.
Let LicsSemQ(k) t=cost of acquiring system selected query
transformation and semantic query processing product licenses for
data source `k` (where k ranges from 1 . . . DSn t) and for Serv
t-lab during transition.
Let LicsNetSoft(k) t=cost of acquiring system selected network
management product's licenses for data source `k` (where k ranges
from 1 . . . DSn t) and for Serv t-labduring transition.
Let LicsWkst t=cost of acquiring system selected client side
licenses for all workstations (Wkst t and Wkst t-lab) during
transition.
Let LicsServ t-lab=cost of acquiring system selected operating
system licenses for Serv t-lab during transition.
Let LicsServ t=cost of acquiring system selected operating system
licenses for data source `k` (where k ranges from 1 . . . DSn t)
during transition.
Let LicsDBMS(k) s=cost of maintaining DBMS licenses for data source
`k` (where k ranges from 1 . . . DSn s) and for Serv s-lab during
steady state.
Let LicsOnt(k) s=cost of maintaining shared
ontology/lexicon/dictionary licenses for data source `k` (where k
ranges from 1 . . . DSn s) and for Serv s-lab during steady
state.
Let LicsMDW(k) s=cost of maintaining system selected middleware
product's licenses for data source `k` (where k ranges from 1 . . .
DSn s) and for Serv s-lab during steady state.
Let LicsSemInt(k)s=cost of maintaining system selected semantic
reconciliation tools/techniques/methodologies licenses for data
source `k` (where k ranges from 1 . . . DSn s) and for Serv s-lab
during steady state.
Let LicsSemQ(k) s=cost of maintaining system selected query
transformation and semantic query processing product licenses for
data source `k` (where k ranges from 1 . . . DSn s) and for Serv
s-labduring steady state.
Let LicsNetSoft(k) s=cost of maintaining system selected network
management product's licenses for data source `k` (where k ranges
from 1 . . . DSn s) and for Serv s-lab during steady state.
Let LicsWkst s=cost of maintaining system selected client side
licenses for all workstations (Wkst s and Wkst s-lab) during steady
state.
Let LicsServ s-lab=cost of maintaining system selected operating
system licenses for Serv s-lab during steady state.
Let LicsServ s=cost of maintaining system selected operating system
licenses for data source `k` (where k ranges from 1 . . . DSn s)
during steady state.
Then SWt is a function of the following costs:
SWt=f (DSn t, LicsDBMS(k) t, LicsOnt(k) t, LicsMDW(k)
t,LicsSemInt(k) t,LicsSemQ(k) t,LicsNetSoft(k) t,LicsWkst
t,LicsServ t-lab, LicsServ t Tt)
and SWS is a function of the following costs:
SWs=f(SWt, DSn s,LicsDBMS(k)s,LicsOnt(k)s, LicsMDW(k)
s,LicsSemInt(k) s,LicsSemQ(k) s,LicsNetSoft(k) s,LicsWkst
s,LicsServ s-lab, LicsServ s, St).
[0058] The program modules 30, 32, 34, 41 and 42 can be loaded into
a computer for execution from a computer storage media such as
magnetic tape or disk, optical disk, DVD, etc. or downloaded from
network media via the Internet via a TCP/IP adapter card. Both the
storage media and network media are collectively called computer
readable media.
[0059] Based on the foregoing, a system, method and program product
for estimating the cost of reconciling heterogeneous data sources
have been disclosed. However, numerous modifications and
substitutions can be made without deviating from the scope of the
present invention. Therefore, the present invention has been
disclosed by way of illustration and not limitation, and reference
should be made to the following claims to determine the scope of
the present invention.
* * * * *