U.S. patent application number 11/357134 was filed with the patent office on 2007-08-23 for data quality management using business process modeling.
Invention is credited to Sugato Bagchi, Xue Bai, Jayant Ramarao Kalagnanam.
Application Number | 20070198312 11/357134 |
Document ID | / |
Family ID | 38429449 |
Filed Date | 2007-08-23 |
United States Patent
Application |
20070198312 |
Kind Code |
A1 |
Bagchi; Sugato ; et
al. |
August 23, 2007 |
Data quality management using business process modeling
Abstract
A business process modeling framework is used for data quality
analysis. The modeling framework represents the sources of
transactions entering the information processing system, the
various tasks within the process that manipulate or transform these
transactions, and the data repositories in which the transactions
are stored or aggregated. A subset of these tasks is associated as
the potential error introduction sources, and the rate and
magnitude of various error classes at each such task are
probabilistically modeled. This model can be used to predict how
changes in transactions volumes and business processes impact data
quality at the aggregate level in the data repositories. The model
can also account for the presence of error correcting controls and
assess how the placement and effectiveness of these controls alter
the propagation and aggregation of errors. Optimization techniques
are used for the placement of error correcting controls that meet
target quality requirements while minimizing the cost of operating
these controls. This analysis also contributes to the development
of business "dashboards" that allow decision-makers to monitor and
react to key performance indicators (KPIs) based on aggregation of
the transactions being processed. Data quality estimation in real
time provides the accuracy of these KPIs (in terms of the
probability that a KPI is above or below a given value), which may
condition the action undertaken by the decision-maker.
Inventors: |
Bagchi; Sugato; (White
Plains, NY) ; Bai; Xue; (Pittsburgh, PA) ;
Kalagnanam; Jayant Ramarao; (Tarrytown, NY) |
Correspondence
Address: |
Whitham, Curtis, & Christofferson, P.C.;Suite 340
11491 Sunset Hills Road
Reston
VA
20190
US
|
Family ID: |
38429449 |
Appl. No.: |
11/357134 |
Filed: |
February 21, 2006 |
Current U.S.
Class: |
705/7.41 |
Current CPC
Class: |
G06F 16/215 20190101;
G06Q 10/06311 20130101; G06Q 10/06375 20130101; G06F 17/18
20130101; G06Q 10/06 20130101; G06Q 10/06395 20130101; G06Q 10/067
20130101 |
Class at
Publication: |
705/007 |
International
Class: |
G06F 17/50 20060101
G06F017/50 |
Claims
1. A data quality management method comprising the steps of:
creating a model of a new or existing business process; utilizing a
modeling framework, identifying transaction sources, error sources,
and audit targets; running error propagation analysis to estimate
error rates and cost of error at the audit targets; utilizing a
control systems model to associate error sources with a set of
controls; and analyzing an impact of selected controls using an
assessment technique.
2. The data quality management method recited in claim 1, further
comprising the steps for transaction sources of obtaining or
estimating a volume of transactions over a given time period and
estimating transaction book values.
3. The data quality management method recited in claim 2, wherein
estimating transaction book values is based on a simple average
book value or a probability distribution based on historical
transaction data.
4. The data quality management method recited in claim 1, further
comprising the steps for error sources of obtaining a probability
of errors prior to application of any controls and a taint of the
error sources.
5. The data quality management method recited in claim 4, wherein
the probability of errors and the taint of the error sources are
obtained from logs of controls that already exist.
6. The data quality management method recited in claim 4, wherein
for a new business process or for error sources that do not have
logs of past control activity, an estimation is done based on
comparable error sources with available data.
7. The data quality management method recited in claim 1, further
comprising the steps for audit targets of specifying types of
errors of interest and if any error level requirements exist for
them.
8. The data quality management method recited in claim 1, further
comprising the step for a model with probability distributions of
performing a Monte Carlo simulation to estimate error rates and
costs in terms of probability distributions.
9. The data quality management method recited in claim 1, further
comprising the step for each control of estimating its error
detection and correction effectiveness.
10. The data quality management method recited in claim 1, further
comprising the step of maximizing the reliability level at audit
targets subject to meeting a budget for a cost of controls.
11. The data quality management method recited in claim 1, further
comprising the step of minimizing a cost of controls subject to
meeting a minimum reliability level at audit targets.
Description
BACKGROUND OF THE INVENTION
[0001] 1. Field of the Invention
[0002] The present application generally relates to modeling and
quantitative analysis techniques for managing the quality of data
and, more particularly, to extending a business process model with
constructs to identify the sources data whose quality is of
interest, the data transformative tasks where error may be
introduced, the error detection and correction controls in the
process, and the data repositories whose quality is to be
assessed.
[0003] 2. Background Description
[0004] As companies increasingly adopt information systems that
cover a range of functional areas, they have electronic access to
vast amounts of transactional data. Increasingly companies are
looking develop dashboards where a variety of key performance
indicators that are composed from the transactional data are
displayed to assist to business decisions. The quality of data
contained in these enterprise information systems has important
consequences, both from the internal perspective of making business
decisions based on the data as well as the legal obligation to
provide accurate reporting to external agencies and stakeholders.
As a result, companies spend considerable time and money to assess
and improve the quality of data in the transactions that flow
through its information systems and are stored in its
repositories.
[0005] A considerable body of literature exists on the issue of
data quality assessment from the perspective of auditing a given
information processing system. The prior work on data quality
management comes from the fields of financial accounting and
auditing and information systems.
[0006] Data quality and control assessment has been studied in
accounting literature since the early 1970s. Most of the studies
have approached reliability assessment with the accounting system
viewed as a "black box" that transforms data into aggregations of
account balances contained in various ledgers (see, for example, W.
R. Knechel, "The use of Quantitative Models in the Review and
Evaluation of Internal Control: A Survey and Review", Journal of
Accounting Literature, (Vol. 2), Spring 1983:205-219). This
approach works well from the perspective of an auditor who is
interested in assessing the reliability with which the black box
performs the data transformations. We review this literature to
make note of the key concepts, definitions, and analyses that we
adopt and extend in order to develop data quality modeling and
analysis techniques at the detailed level of the transformational
tasks and processes that are contained within the accounting
system.
[0007] B. E. Cushing in "A Mathematic Approach of the Analysis and
Design of Internal Control Systems" in The Accounting Review 1974,
pp. 24-41, developed a mathematic formulation for measuring the
reliability for an accounting system. He used the probability that
the system makes no errors of any kind in its outputs as the system
reliability measure. He also derived a cost measurement by taking
into consideration of the cost of executing error correction
controls and the risk of undetected errors in the system. It is
useful in the sense of evaluating the reliability assessment of a
given system. However, Cushing's control model takes the system
structure as given; it does not address any problem from the system
design perspective. We apply the same basic concepts of reliability
and cost measurement to the problems of evaluating system
reliability for a detailed process model and to design the optimal
set of corrective controls with the objective of cost
minimization.
[0008] S. S. Hamlen in "A Chance-Constrained Mix Integer
Programming Model for Internal Control Systems", The Accounting
Review 1980, pp. 578-593, proposed a mixed integer programming
model for designing an internal control system. Her model minimizes
the cost of controls subject to a given percentage of quality
improvement desired in the output from the system. In order to
formulate a linear program, the model imposes instrumental
polynomial terms with their respective constraints which have the
drawback of growing exponentially with the number of terms. The
accounting system is modeled as a set of controls that can correct
a set of error types (which could be errors in various ledgers). We
extend Hamlen's approach to a more detailed model that identifies
error sources within the business process of the accounting system
and controls that may be selectively applied to these error
sources. Our model also allows us to assess the effect of applying
a control to an error source on the resulting probability of errors
at all the ledgers that are linked to that error source. This leads
to greater flexibility in selecting controls to apply with the
potential of better solutions. We also show how our optimization
problem formulation, though more detailed than Hamlen's, can be
reduced to a non-exponential series of knapsack problems without
having to convert a non-linear system into a linear one.
[0009] Other research in accounting literature focused on
probabilistic modeling and quantitative assessment of accounting
information system reliability. These studies have focused at the
accounting system level modeling of reliability assessment using
probabilistic or deterministic methods. They treat the transactions
streams and transformative processes within the accounting
information systems as a black box. Recent studies have begun to
develop more detailed models for the assessment of accounting
system reliability.
[0010] R. B. Lea, S. J. Adams, and R. F. Boykin in "Modeling of the
audit risk assessment process at the assertion level within an
account balance", Auditing: A Journal of Practice & Theory 1992
(Vol. 11, Supplement): 152-179, discussed the audit risk assessment
models at different levels of detail within accounting systems.
They model how risks of error at the level of the various
transaction streams are related to the risk of error at the account
balance level to which they contribute. They note that the level of
tolerable error at the transaction stream level cannot be assumed
to be the same as that for the account balance level. Their risk
model covers both inherent risk (in the absence of internal
controls) and control risk. We follow their motivation to decompose
an account balance to its constituent transaction streams but
extend their purely additive model to include (a) the volume of
transactions in the various streams and (b) the probabilistic
network structure of these transaction streams, identifying the
various sources of errors (as represented by a process model). This
allows us to overcome the assumption made by their model that the
errors in the various transaction streams are independent.
[0011] R. Nado, M. Chams, J. Delisio, and W. Hamscher in "Comet: An
Application of Model-Based Reasoning to Accounting Systems",
Proceedings of the Eighth Innovative Applications of Artificial
Intelligence Conference AAAI Press (1996) pp. 1482-1490, developed
a process model based reasoning system, which they called "Comet",
for analyzing the effectiveness of controls. This is one of the
earliest attempts to decompose the accounting system structure into
the level of tasks that process transactions and implement internal
controls. They modeled accounting systems as a hierarchically
structured graph with nodes representing the transaction processing
activities and collection points. The potential for failure in each
activity is propagated to the collection points that are the
accounts being audited. Controls are modeled in terms of the
probability that they will not cover the failures. This model can
be used to select the key set of controls that reduce the risk of
failure below a threshold. However, the paper does not clarify the
quantitative model (if any) that is used. It models only the
probability of failures but ignores the magnitude of error in these
failures. It also implicitly assumes identical and fixed costs for
all controls. Our model adopts the basic process modeling concepts
introduced in this paper and extends them to develop the
quantitative framework described hereinafter. This enables the
performance of rigorous quantitative analysis including Monte Carlo
simulation of inherent and control risk and optimization of control
usage based on risk and cost.
[0012] Research on data quality in the information systems
literature has focused on identifying the important characteristics
that define the quality of data (see, for example, Y. Wand and R.
Y. Wang, "Anchoring data quality dimensions in ontological
foundations", Communications of the ACM (39:11) (1996), pp. 86-95,
and R. Y. Wang, "A Product Prospective on Total Data Quality
Management", Communications of the ACM, (41:2) (1998), pp. 58-65).
Recently, the management of data quality and the quality of
associated data management processes has been identified as a
critical issue (see D. Ballou, R. Wang, H. Pazer, and G. Tayi,
"Modeling Information Manufacturing Systems to Determine
Information Product Quality", Management Science (44:4), April,
1998, pp. 462-484). However, most of the papers describe the
criteria for the information systems design to improve or achieve
good data quality (DQ) or information quality (IQ). To our
knowledge, none of the papers have tackled data quality management
from the point of view quantitative reliability assessment and
optimization, nor did they bring the costs of quality and quality
improvement into the DQ or IQ assessment consideration. We consider
these issues to be critical from the practical perspective of
design and management of enterprise information systems.
[0013] Wand and Wang, supra, are amongst the first who studied the
data quality in the context of information systems design. They
suggested rigorous definitions of data quality dimensions by
anchoring them in ontological foundations and showed that such
dimensions can provide guidance to systems designers on data
quality issues. They developed a set of Ontological Concepts, and
defined Design Deficiencies and Data Quality Dimensions. Then they
presented the analysis of Dimensions and the Implications to
Information Systems Design. Wang, supra, and Ballou et al., supra,
developed the Total Data Quality Management methodology (TDQM).
TDQM consists of the concepts and the principles of information
quality (IQ) and the information product (IP), and procedures of
information management system (IMS) for defining, measuring,
analyzing, and improving information products.
[0014] L. L. Pipino, Y. W. Lee, and R. Y. Wang, in "Data Quality
Assessment", Communications of the ACM, (45:4), (2002), pp.
211-218, introduced three functional forms of data quality: simple
ratio, min or max operators, and weighted average. Based on these
functional forms, they developed the illustrative metrics for
important data quality dimensions. Finally, they presented an
approach that combines the subjective and objective assessments of
data quality, and demonstrated how the approach can be used
effectively in practice.
[0015] H. Xu in "Managing accounting information quality: an
Australian study", Managing Accounting Information Quality, (2000),
pp. 628-634, developed and tested a model that identifies the
critical success factors (CSF) influencing data quality in
accounting information systems. He first proposed a list of factors
influencing the data quality of AIS from the literature, and then
conducted pilot case studies, using the findings from the pilot
study together with the literature to identify possible critical
success factors for data quality of accounting information systems.
He did case studies of accounting information quality in Australian
organizations in practice to test and customize the initial
research model and compared similarities and differences between
proposed critical success factors with real-world critical success
factors.
[0016] E. M. Pierce in "Assessing Data Quality with Control
Matrices", Communications of the ACM, (47:2), (2004), pp. 82-86,
developed a technique for information quality management based on
the practice from auditing field: an information product control
matrix, to evaluate the reliability of an information product.
Pierce defined the components of the matrix, and presented a way to
link the data problems to the quality controls that should detect
and correct these data problems during the information
manufacturing process.
[0017] D. Strong, Y. W. Lee, and R. Wang in "Data Quality in
Context", Communications of the ACM, (40:5), (1997), pp. 58-65,
propose a data-consumer perspective for data assessments as opposed
to the traditional intrinsic DQ assessment. They presented a set of
DQ dimensions that consists of not only the Intrinsic DQ, but
Accessibility DQ, Contextual DQ and Representational DQ. The latter
three concern about the user-task context. They argued that data
quality assessment should incorporate the task context of users and
the processes by which users' access and manipulate data to meet
their task requirements.
[0018] Adopted from Strong et al.'s idea, C. Cappiello, C.
Francalanci, and B. Pernici in "Data quality assessment from the
user's perspective", International Workshop on Information Quality
in Information Systems., 2004, proposed a data quality assessment
model that takes into consideration user requirements in the
assessment phase. In their mathematical formulation, parameters and
matrices to capture the user and user class's preference and
requirement are introduced. Their model showed how data quality
assessment should take into account how user requirements vary with
the accessed service.
SUMMARY OF THE INVENTION
[0019] Our invention addresses the issue of data quality management
from the perspectives of the owner or the consumer of the
information processing system and predicting and managing the
quality of its data when faced with anticipated changes in the
business environment in which the system operates. Such changes
could include: [0020] Changes in the relative volume of
transactions arriving from different input sources. For example, a
small but fast-growing business unit alters the mix of sales
transactions over time and therefore impacts the overall quality of
sales data. [0021] Changes in the business processes and policies
that transform the data in the transactions. For example, automated
systems replace manual tasks or sections of a process are
outsourced. [0022] Changes in the business controls that attempt to
detect and fix errors in the transaction. For example, the
thresholds that trigger a control are altered or controls are added
or removed as part of process re-engineering.
[0023] This invention provides the modeling and analysis for
predicting how these changes impact data quality. Then, on the
basis of this predictive ability, optimization techniques are used
for the placement of error correcting controls that meet target
quality requirements while minimizing the cost of operating these
controls. This analysis also contributes to the development of
business "dashboards" that allow decision-makers to monitor and
react to key performance indicators (KPIs) based on aggregation of
the transactions being processed. Data quality estimation in real
time provides the accuracy of these KPIs (in terms of the
probability that a KPI is above or below a given value), which may
condition the action undertaken by the decision-maker.
[0024] Our approach to modeling data quality takes advantage of the
increasing emphasis in many businesses on the formal modeling of
business processes and their underlying information processing
systems. Although the initial objective of process modeling is
usually for resource planning, and services and workflow design
purposes, data quality estimation can be an important secondary
outcome.
[0025] A business process model can be used to represent the
sources of transactions entering the information processing system
and the various tasks within the process that manipulate or
transform these transactions. We associate a subset of these tasks
as the potential error introduction sources and probabilistically
model the rate and magnitude of various error classes at each such
task. We also define the information repositories such as
accounting ledgers and other databases where the transactions are
eventually stored and whose quality needs to be assessed. A network
of links (often with probabilistic branches) connects the
transaction sources, error sources, and the information
repositories.
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] The foregoing and other objects, aspects and advantages will
be better understood from the following detailed description of a
preferred embodiment of the invention with reference to the
drawings, in which:
[0027] FIG. 1 is a block diagram of a process network consisting of
transaction sources, error sources and audit targets;
[0028] FIG. 2A is a block diagram illustrating preventive controls
on an error source, and FIG. 2B is a block diagram illustrating
feed-forward control on an error source;
[0029] FIG. 3 is a block diagram illustrating a sequence of
feed-forward controls at an error source; and
[0030] FIG. 4 is an influence diagram of a simple control
system.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION
Process Model
[0031] A business process model represents the flow of physical
items or informational artifacts through a sequence of tasks and
sub-processes that operate on them. The flow may be controlled by
different types of "gateways" that can diverge or converge flows
using constructs such as branches, forks, merges, and joins. These
elements form a directed graph with the tasks and gateways as
nodes. The graphs may be cyclic (with the probability of a cycle
being less than one) as well as hierarchical, where one of the
nodes could be a sub-process containing its own directed graph.
[0032] We extend the business process modeling framework by adding
the following attributes relevant to modeling data quality.
Consider a business process with T tasks, including all the tasks
in its sub-processes. We assign some of these tasks to be
transaction sources, error sources, and audit targets as defined
next.
[0033] A start event or initial task in a process may be assigned
to be a transaction source. This is the origination point of a
transaction in which an error is yet to be introduced. A
transaction source is characterized by a volume of transaction over
a predefined time period and a random variable signifying the
quantitative value of the transaction. For financial accounting
data, this is typically the book value of the transaction. [0034]
Let T.sub.S.OR right.T be the set of transaction sources in the
process model. [0035] Let x.sub.k be a random variable representing
the book value a transaction originating from the transaction
source t.sub.k.epsilon.T.sub.S. Errors could occur when data
originating from a transaction source passes through a subsequent
task that is assigned to be an error source. Error sources are
tasks that operate on the incoming transaction and could introduce
errors in them. [0036] Let T.sub.E.OR right.T be the set of error
sources in the process model. [0037] Let p.sub.i(.epsilon.) be the
error incidence probability for error class .epsilon. in the error
source t.sub.i.epsilon.T.sub.E. Borrowing from financial accounting
practice, we consider three classes of error: [0038] 1. Valuation
error, which is defined as an error in the magnitude or value of a
valid transaction. This can happen when a transaction's book value
contains the wrong number due to data entry or mathematical
calculation error. [0039] Substituting .epsilon.=.nu., let
p.sub.i(.nu.) be the probability that a valuation error is
introduced at the error source t.sub.i. [0040] Let z.sub.i be a
random variable representing the taint of the valuation error.
"Taint" is defined as the ratio of the error magnitude to the book
value. If a valuation error is introduced at the error source
t.sub.i, the magnitude of that error in the book value is defined
as: e.sub.i.sup.v=z.sub.ix.sub.i, (1) [0041] where x.sub.i is the
observable book value of a transaction at error source t.sub.i and
e.sub.i.sup.v is the random discrepancy between this book value and
the true value of the transaction, known as its audit value. [0042]
2. Existence error is defined as the introduction of spurious
transaction entries at the error source. This can happen if the
task at the error source erroneously introduces a new or duplicate
transaction into the business process or fails to follow a business
rule that calls for the cancellation or rejection of a real
transaction. [0043] Substituting .epsilon.=e, let p.sub.i(e) be the
probability that an existence error is introduced at the error
source t.sub.i. [0044] Let x.sub.i.sup.e be the random variable for
the book value of the spurious transaction. [0045] If an existence
error is introduced at the error source i, the magnitude of that
error in the book value is defined as: e.sub.i.sup.e=x.sub.i.sup.e.
(2) [0046] 3. Completeness error occurs when a valid transaction is
lost or goes missing at the error source. This can happen for
example when a valid transaction is erroneously deleted or canceled
or if there is a failure to create a new data record as required by
a business rule at the task. [0047] Substituting .epsilon.=c, let
p.sub.i(c) be the probability that a completeness error is
introduced at the error source t.sub.i. [0048] If a completeness
error is introduced at the error source i, the magnitude of that
error in the book value is defined as: e.sub.i.sup.c=x.sub.i. (3)
From the above definitions of the three error classes, note that an
error source can introduce only one class of error in any single
transaction.
[0049] Audit targets are repositories in the business process where
transactions can be stored and retrieved. These could be databases
containing business and financial data that is used by the company
in its decision-making and evaluation of its strategy, or used to
generate quarterly and annual financial reports to external parties
such as shareholders and regulatory agencies. [0050] Let T.sub.A.OR
right.T be the set of audit targets in the process model (where we
model repositories as tasks). [0051] Let X.sub.j be the set of
transactions in an audit target t.sub.j.epsilon.T.sub.A and
X.sub.j.sup..epsilon. be the subset of transactions containing
error of class .epsilon.. [0052] Let x.sub.j be the book value of a
transaction in X.sub.j and e.sub.j.sup..epsilon. be the magnitude
of an erroneous transaction in X.sub.j.sup..epsilon.. [0053] As
described in more detail below, we consider three mutually
exclusive classes of error: valuation, existence, and completeness,
denoted by the set [v,e,c]. Let E.sub.j.OR right.[v,e,c] be the
subset of error classes of interest in the audit target. Our
objective of data quality assessment is to quantify the error in
these repositories according to various error metrics. [0054] 1.
Rate One: Error Incidence, is the ratio of the number of erroneous
transactions of error class .epsilon. to the total number of
transactions: R .times. .times. 1 j = X j X j . ( 4 ) ##EQU1##
[0055] 2. Rate Two: Proportion of net monetary error, is the ratio
of total monetary error over all erroneous transactions of error
class E to the total book value over all transactions: R .times.
.times. 2 j = X j .times. e j X j .times. x j . ( 5 ) ##EQU2## The
proportion of net monetary error can be decomposed into the two
following rates: [0056] 3. Rate Three: Proportion of dollar unit in
error or tainting, is the ratio of the total monetary error over
all erroneous transactions of error class .epsilon. to the total
book value over the same set of erroneous transactions: R .times.
.times. 3 j = X j .times. e j X j .times. x j . ( 6 ) ##EQU3##
[0057] 4. Rate Four. Proportion of dollar units containing error,
is the ratio of the total book value of all erroneous transactions
of error class .epsilon. to the total book value over all the
transactions: R .times. .times. 4 j = X j .times. x j X j .times. x
j . ( 7 ) ##EQU4## The transactions sources, error sources and
audit targets are connected to each other by a network of links and
gateways. Gateways are defined as the means by which (a) the output
from a single task diverges into the inputs of multiple tasks or
(b) the outputs from multiple tasks converge into the input of a
single task. The following types of gateways are common in a
process network: 1. Branch: The branch gateway sends the output of
a single task to the input of one out of multiple alternative
tasks. The branching decision is probabilistic (either directly
specified or derived from other branching criteria). 2. Merge: The
merge gateway allows the output of multiple tasks to feed into the
input of a single task which is performed when it receives an input
from any one of the tasks being merged. 3. Fork: The fork gateway
sends the output of a single task to the inputs of multiple tasks
at the same time, resulting in the creation of parallel streams of
task activity. 4. Join: The join gateway allows the output of
multiple tasks to feed into the input of a single task which is
performed only when it receives input from all of the tasks being
joined. This is usually present to synchronize the parallel task
activities created as a result of a fork upstream in the process
network.
[0058] We can traverse a process network with the objective of
identifying the following parameters that link transaction sources
to error sources, and error sources to audit targets: [0059] Let
V.sub.ki be the volume of transactions that flow from a transaction
source t.sub.k to a task t.sub.i designated as an error source.
[0060] Let P.sub.ij be the probability that a transaction that
flows through an error source t.sub.i will subsequently be stored
in an audit target repository t.sub.j.
[0061] FIG. 1 shows a network diagram linking the transaction
sources, error sources, and audit targets. The dashed links between
any two nodes denote a (possibly null) set of tasks and gateways
(hidden in the figure) that intermediate the flow of transactions
between the two nodes in the direction shown. By definition, these
hidden tasks cannot be transaction sources, error sources, or audit
targets.
[0062] As shown by the figure, an error introduced at an error
source may be stored in several audit targets. Also, a single audit
target may contain errors introduced at multiple error sources.
With the data quality attributes defined above and the network
interconnections depicted in FIG. 1, the propagation of
transactions and their errors can now be calculated. The volume of
transactions, V.sub.i, reaching error source
t.sub.i.epsilon.T.sub.E from all transactions sources
t.sub.k.epsilon..T.sub.S: V i = t k .di-elect cons. T S .times. V
ki . ( 8 ) ##EQU5## The book value, x.sub.i, of a transaction
reaching error source, t.sub.i: x i = t k .di-elect cons. T S
.times. x k V ki V i . ( 9 ) ##EQU6## The magnitude of error,
e.sub.i.sup..epsilon., of error class introduced by error source
t.sub.i is given by Equations (1), (2) and (3) for valuation,
existence and completeness errors respectively.
[0063] The transactions passing through error sources propagate to
audit targets based on the probability P.sub.ij which is determined
from the process network. The aggregation of all transactions from
all error sources results in X.sub.j defined above, the set of
transactions in an audit target, t.sub.j.epsilon.T.sub.A. The
subset of these transactions containing errors depends on the error
incidence probability p.sub.i(.epsilon.) for each error source and
the volume of transactions flowing through it.
[0064] At each audit target, t.sub.j.epsilon.T.sub.A, we calculate
the set of error rates defined above, corresponding to each of the
error classes. Let .epsilon..epsilon.[v,e,c] denote the class of
error for which we calculate the error rates. R .times. .times. 1 j
= X j X j = t i .di-elect cons. T E .times. V i p i .function. ( )
P ij t i .di-elect cons. T E .times. V i P ij , ( 10 ) R .times.
.times. 2 j = X j .times. e j X j .times. x j = t i .di-elect cons.
T E .times. V i e i p i .function. ( ) P ij t i .di-elect cons. T E
.times. V i x i P ij , ( 11 ) R .times. .times. 3 j = X j .times. e
j X j .times. x j = t i .di-elect cons. T E .times. V i e i p i
.function. ( ) P ij t i .di-elect cons. T E .times. V i x i p i
.function. ( ) P ij , ( 12 ) R .times. .times. 4 j = X j .times. x
j X j .times. x j = t i .di-elect cons. T E .times. V i x i p i
.function. ( ) P ij t i .di-elect cons. T E .times. V i x i P ij .
( 13 ) ##EQU7## These equations calculate error rates for a single
error class. As described above, an error source may introduce up
to three classes of errors: valuation, existence, or completeness.
For a given transaction however, only a single class of error is
possible. Due to this mutual exclusion, the sets of erroneous
transactions, X.sub.j.sup.v, X.sub.j.sup.e, X.sub.j.sup.e, have no
transactions in common (i.e., their pair-wise intersections result
in null sets). As a result of this property, the combined error
rates for all error classes are: R .times. .times. 1 j = .di-elect
cons. E j .times. R .times. .times. 1 j = .di-elect cons. E j
.times. t i .di-elect cons. T E .times. V i p i .function. ( ) P ij
t i .di-elect cons. T E .times. V i P ij = t i .di-elect cons. T E
.times. ( V i P ij .times. .di-elect cons. E i .times. p i
.function. ( ) ) t i .di-elect cons. T E .times. V i P ij ( 14 ) R
.times. .times. 2 j = .di-elect cons. E j .times. R .times. .times.
2 j = .di-elect cons. E j .times. t i .di-elect cons. T E .times. V
i e i p i .function. ( ) P ij t i .di-elect cons. T E .times. V i x
i P ij = t i .di-elect cons. T E .times. ( V i P ij .times.
.di-elect cons. E i .times. e i p i .function. ( ) ) t i .di-elect
cons. T E .times. V i x i P ij ( 15 ) ##EQU8## where the set
E.sub.j.OR right.[v,e,c] consists of the error classes of interest
at the audit target.
[0065] These metrics can be directly calculated if point estimates
(or means only) are given for the input random variables (such as
the transaction book values x.sub.k and the taints z.sub.i). If
instead, probability distributions are specified for the random
variables, Monte Carlo simulation can be done to arrive at
probability distributions for the outputs.
[0066] Cost of error arises from the failure of reduce or correct
errors that accumulate at the audit targets of a transaction
process. The cost may arise due to the additional cost or losses
incurred because of operating the business with incorrect
information (for example poor targeting of potential customers due
to erroneous sales data). The cost could also be in the form of
penalties assessed by regulatory and legal agencies due to
misstatements made as a result of incorrect data in financial
ledgers. [0067] Let .omega..sub.1 be the unit cost per erroneous
transaction. [0068] Let .omega..sub.2 be the unit cost per unit of
monetary error. Then, the total cost due to the number of erroneous
transactions for audit target t.sub.j can be obtained as follows,
applying Equation (14): .OMEGA. 1 , j = R .times. .times. 1 j
.omega. 1 X j = .omega. 1 t i .di-elect cons. T E .times. ( V i P
ij .times. .di-elect cons. E j .times. p i .function. ( ) ) ( 16 )
##EQU9## The total cost due to the magnitude of monetary error for
audit target t.sub.j can be obtained as follows, applying Equation
(15): .OMEGA. 2 , j = R .times. .times. 2 j .omega. 2 X j .times. x
j = .omega. 2 t i .di-elect cons. T E .times. ( V i P ij .times.
.di-elect cons. E i .times. e i p i .function. ( ) ) ( 17 )
##EQU10## The total cost across all audit targets
t.sub.j.epsilon.T.sub.A is: .OMEGA. = t i .di-elect cons. T A
.times. ( .OMEGA. 1 , j + .OMEGA. 2 , j ) . ( 18 ) ##EQU11##
[0069] The set of equations introduced in this section enables the
assessment of data quality at an audit target both in terms of
error rates and cost. This assessment takes into account the
structure of the business process and the location of transaction
sources and error sources within it. Process owners can use this
assessment to quantify the impact of changes in process structure
or transaction volumes on the quality of data being stored. In
auditing terminology, this level of analysis estimates the inherent
risk of the accounting system. In the next section, we begin to
estimate the effect of applying error detection and correction
controls in order to reduce error rates and costs.
Control Model
[0070] Businesses implement internal control systems to reduce the
incidence of errors in its business processes. Controls may be
implemented either to prevent errors from being introduced or to
monitor for and detect errors after they have been generated at
error sources. In the latter case, the control could attempt to
correct the errors as they are detected (feed-forward control) or
to report them so that the error-producing action may be eventually
corrected (feedback control) (see B. E. Cushing, "A Further Note on
the Mathematic Approach to Internal Control", The Accounting
Review, Vol. 50, No. 1, 1975, pp. 141-154).
[0071] For our model, we consider the controls that have a direct
impact on reducing the number of erroneous transactions introduced
at an error source. This includes preventive and feed-forward
controls but excludes feedback controls because they lack the
direct corrective action on erroneous transactions. FIGS. 2A and 2B
show how these control types interact with an error source and
alter its probability of introducing an error from p(.epsilon.) to
p(.epsilon..sub.c). More particularly, FIG. 2A shows the impact of
preventive control on an error source, and FIG. 2B shows the impact
of feed-forward control on an error source. Note that the controls
may only impact the probability of an error, not its taint.
[0072] An error source may have a sequence of feed-forward controls
associated with it to monitor, detect, and fix errors that may be
introduced by the error source or by any of the intervening
controls. This is shown in FIG. 3. The figure also depicts the
possibility that not all the transactions that leave an error
source may be sent to a control. A random sampling or a business
rule may be used to select the subset of transactions that are sent
to each control.
[0073] To develop the mathematical formula for calculating,
p(.epsilon..sub.K), the probability of error after a sequence of
controls K is applied, let us consider the simplest case of one
error source and one feed-forward control. There are four variables
in the system to describe the state of this control system:
[0074] the error status E=(.epsilon., .epsilon.),
[0075] the control signaling status C.sub.s=(c.sub.s, c.sub.s),
[0076] the control fixing status C.sub.f=(c.sub.f, c.sub.f),
and
[0077] the error status after the application of the control,
E.sub.c=(.epsilon., .epsilon.).
[0078] There are eight possible states in the control system as
described in the table below, along with the resulting impact on
the error status E.sub.c after the application of the control:
TABLE-US-00001 TABLE 1 Control System States State E C.sub.s
C.sub.f Description E.sub.c 1 .epsilon. c.sub.s c.sub.f An error
exists, the control signals the error, .epsilon. and fixes it. 2
.epsilon. c.sub.s c.sub.f An error exists, the control signals the
error, .epsilon. and does not fix it. 3 .epsilon. c.sub.c.sub.s
c.sub.f An error exists, the control does not signal .epsilon. the
error, but somehow takes an action of fixing it. 4 .epsilon.
c.sub.s c.sub.f An error exists, the control does not signal
.epsilon. the error, nor fixes it. 5 .epsilon. c.sub.c.sub.s
c.sub.f An error does not exist, the control does not .epsilon.
signal the error, but somehow takes an action of error "fixing". 6
.epsilon. c.sub.s c.sub.f An error does not exist, the control does
not .epsilon. signal the error, nor fixes it. 7 .epsilon. c.sub.s
c.sub.f An error does not exist; the control signals .epsilon. an
error, and takes an action of error "fixing". 8 .epsilon. c.sub.s
c.sub.f An error does not exist; the control signals .epsilon. an
error, but no fixing action.
We define the following exogenous attributes of a feed-forward
control that represent the effectiveness of the control (we show
later that preventive controls can be formulated as a special
case): [0079] p(c.sub.s|.epsilon.): the probability that the
control signals an error .epsilon. in an error source, given that
the error .epsilon. exists. [0080] p(c.sub.s| .epsilon.): the
probability that the control signals an error .epsilon. in an error
source, given that the error .epsilon. does not exist (contra
factual). [0081] p(c.sub.f|c.sub.s): the probability that the
control takes an action of error fixing, given that it signals an
error .epsilon. in an error source. [0082] p(c.sub.f| c.sub.s): the
probability that the control takes an action of error fixing, given
that it does not signal an error .epsilon. in an error source
(contra factual). The influence diagram of this control system is
shown in FIG. 4. The diagram shows that if the status of C.sub.s is
known, then the status of C.sub.f is independent of the status of
E, i.e., C.sub.f and E are conditionally independent given:
p(E,C.sub.f|C.sub.s)=p(E|C.sub.s)p(C.sub.f|C.sub.s) (19) From this
conditional independence, we have: p .function. ( C f | C s , E ) =
p .function. ( E , C f | C s ) p .function. ( E | C s ) = p
.function. ( E | C s ) p .function. ( C f | C s ) p .function. ( E
| C s ) = p .function. ( C f | C s ) ( 20 ) ##EQU12## Using this,
we derive the probability of any state in the control system as
follows: p .function. ( E , C s , C f ) = p .function. ( C f , C s
| E ) p .function. ( E ) = p .function. ( C f | C s , E ) p
.function. ( C s | E ) p .function. ( E ) = p .function. ( C f | C
s ) p .function. ( C s | E ) p .function. ( E ) ( 21 ) ##EQU13## We
assume the following for feed-forward controls: [0083] If an
control does not signal an error, there will never be an action of
fixing an error, i.e., p(c.sub.f| c.sub.s)=0 and p( c.sub.f|
c.sub.s)=1. [0084] If an control does signal an error, there will
always be an action of fixing an error, i.e., p(c.sub.f|c.sub.s)=1
and p( c.sub.f|c.sub.s)=0. These assumptions are always true for
preventive controls along with p(c.sub.s| .epsilon.)=0. That is, we
formulate a preventive control as a special case of the
feed-forward control where p(c.sub.s|.epsilon.) is the only
parameter that can have a value between 0 and 1. This parameter
represents the effectiveness of the control in preventing an error
from being generated by the error source.
[0085] Under these assumptions, Equation (21) reduces to the
following for each of the eight states in the control system:
p(.epsilon.,c.sub.s,c.sub.f)=p(c.sub.s|.epsilon.)p(.epsilon.)
p(.epsilon.,c.sub.s, c.sub.f)=0 p(.epsilon., c.sub.s,c.sub.f)=0
p(.epsilon., c.sub.s, c.sub.f)=p( c.sub.s|.epsilon.)p(.epsilon.) p(
.epsilon., c.sub.s,c.sub.f)=0 p( .epsilon., c.sub.s, c.sub.f)=p(
c.sub.s| .epsilon.)p( .epsilon.) p(
.epsilon.,c.sub.s,c.sub.f)=p(c.sub.x| .epsilon.)p( .epsilon.) p(
.epsilon.,c.sub.s, c.sub.f)=0 Now we derive p(.epsilon..sub.c), the
probability of error .epsilon. in an error source after a single
control c has been applied: p .function. ( c ) = p .function. ( , c
_ s , c _ f ) + p .function. ( _ , c s , c f ) + p .function. ( , c
_ s , c f ) + p .function. ( , c s , c _ f ) = p .function. ( , c _
s , c _ f ) + p .function. ( _ , c s , c f ) = p .function. ( c _ s
| ) p .function. ( ) + p .function. ( c s | _ ) p .function. ( _ )
= p .times. ( ) ( 1 - p .function. ( c s | ) ) + ( 1 - p .function.
( ) ) p .function. ( c s | _ ) . ( 22 ) ##EQU14## If the control c
is applied only to a fraction y of all the transactions coming out
of the error source, Equation (22) is modified to: p .function. ( c
) = y [ p .times. ( ) ( 1 - p .function. ( c s | ) ) + ( 1 - p
.function. ( ) ) p .function. ( c s | _ ) ] + ( 1 - y ) p
.function. ( ) = p .function. ( ) ( 1 - yp .function. ( c s | ) ) +
( 1 - p .function. ( ) ) ( yp ( c s .times. _ ) ( 23 )
##EQU15##
[0086] Next, we consider p(.epsilon..sub.K), the probability of
error .epsilon. after the application of a sequence of controls K
to an error source, as shown on FIG. 3. Let K=[c.sub.j]|j=1 . . . J
,where c.sub.j is the j-th control in the sequence. Then, the
probability of error after the application of the j-th control is:
p(.epsilon..sub.c,j)=p(.epsilon..sub.c,j-1)(1-y.sub.jp(c.sub.s,j|.epsilon-
.))+(1-p(.epsilon..sub.c,j-1))(y.sub.jp(c.sub.s,j| .epsilon.)) (24)
where, the j subscript in the other variables denote the respective
variables for the j-th control.
[0087] Equation (24) can be iteratively calculated to compute
p(.epsilon..sub.c)=p(.epsilon..sub.c,J) starting at
p(.epsilon..sub.c,0)=p(.epsilon.). This quantifies the effect of
applying a regime of controls K to a single error source. Using the
error propagation formulation described above, we can now assess
the impact of the controls on the error rates and cost of error at
the audit targets in the business process. In auditing terminology,
this level of analysis estimates the control risk in the accounting
system.
[0088] The application of controls at an error source incurs a
cost. We consider this cost to be linearly proportional to the
number of transactions passing through the control. This cost
consists of the cost to detect if an error exists and the cost to
fix the error if found. Let .omega.(c.sub.s) be the cost to
monitor, detect and signal an error (incurred on all transactions
passing through the control) and .omega.(c.sub.f) be the cost of
fixing each error (incurred only on the transactions deemed
erroneous). Then, the cost per transaction passing through the
control is:
.omega.(c)=.omega.(c.sub.s)+.omega.(c.sub.f)(p(c.sub.f,c.sub.s,
.epsilon.)+p(c.sub.fc.sub.s,.epsilon.)+p(c.sub.f, c.sub.s,
.epsilon.)+p(c.sub.f, c.sub.s,.epsilon.)) Applying the assumptions
for feed-forward controls and Equation (21),
.omega.(c)=.omega.(c.sub.s)+.omega.(c.sub.f)(p(c.sub.s|
.epsilon.)(1-p(.epsilon.))+p(c.sub.s|.epsilon.)p(.epsilon.)). (25)
Considering T.sub.E error sources and a sequence of controls
K.sub.i at an error source t.sub.i.epsilon.T.sub.E, we have the
total cost of controls in the business process: .OMEGA. C = t i
.di-elect cons. T E .times. ( V i .times. c j .di-elect cons. K i
.times. y j .times. .omega. .function. ( c j ) ) ( 26 ) ##EQU16##
where V.sub.i, as defined in Equation (8), is the volume of
transactions reaching the error source t.sub.i.
[0089] Now we are in a position to formulate optimization problems
that trade off the cost of controls at the error sources with the
cost of error at the audit targets. This is done in the next
section.
Optimization
[0090] The business process and control models developed above
allow us to formulate the following series of optimization
problems. [0091] For these formulations, we use the following
variables: The overall system reliability (1-R) across all the
audit targets, where R is either t j .di-elect cons. T A .times. R
.times. .times. 1 j ##EQU17## as defined by Equation (14) or t j
.di-elect cons. T A .times. R .times. .times. 2 j ##EQU18## as
defined by Equation (15). [0092] The total cost of error across all
audit targets, .OMEGA. as given in Equation (18). [0093] The total
cost of controls in the business process, .OMEGA..sub.C as given in
Equation (26). [0094] The decision variables y.sub.j, which is the
fraction of transactions at error source t.sub.i.epsilon.T.sub.E
that will be sent to a control c.sub.j.epsilon.K.sub.i, where
K.sub.i is the sequence of controls available for the error source
t.sub.i.
[0095] Using the above notation, the optimization formulations are
as follows: [0096] 1. Maximize the system reliability (1-R),
subject to a budget {circumflex over (.OMEGA.)}.sub.C for the total
control cost in the business process, i.e.,
.OMEGA..sub.C.ltoreq.{circumflex over (.OMEGA.)}.sub.C. [0097] 2.
Minimize the control cost .OMEGA..sub.C, subject to a target system
reliability (1-{circumflex over (R)}), i.e., R.ltoreq.{circumflex
over (R)}. [0098] 3. Minimize the cost of error .OMEGA., subject to
a budget {circumflex over (.OMEGA.)}.sub.C for the total control
cost in the business process, i.e.,
.OMEGA..sub.C.ltoreq.{circumflex over (.OMEGA.)}.sub.C. [0099] 4.
Minimize the control cost .OMEGA..sub.C, subject to a budget
{circumflex over (.OMEGA.)} for the total cost of error in the
business process, i.e., .OMEGA..ltoreq.{circumflex over (.OMEGA.)}.
[0100] 5. Minimize the total cost in the process
(.OMEGA.+.OMEGA..sub.C).
[0101] As a special case with a tractable solution, consider the
optimization problem 4 above, where the cost of control must be
minimized to as to keep the cost of error in the system below a
threshold budget {circumflex over (.OMEGA.)}.
[0102] We solve this problem by dividing it into two sub-problems.
One sub-problem is at the audit targets stage, where we wish to
minimize the total cost of error, given sets of controlled error
levels and their corresponding control cost for each error source.
The second sub-problem is to come up with these sets at the error
sources stage, where we wish to minimize the control cost for a
given error level.
[0103] For the audit target stage sub-problem, Equation (18)
calculates the total cost of error across all audit targets
.OMEGA., which can be written as follows, if we consider only a
single class of error in our analysis: .OMEGA. = t j .di-elect
cons. T A .times. ( t i .di-elect cons. T E .times. V i P ij
.function. ( .omega. 1 + .omega. 2 .times. e i ) p i .function. ( )
) ( 27 ) ##EQU19##
[0104] To meet the .OMEGA..ltoreq.{circumflex over (.OMEGA.)}
requirement, we need to apply controls at one or more error sources
to reduce the "posterior" error rates p(.epsilon..sub.K) at some
cost. For each error source t.sub.i, we characterize a set of
pairs: {(.omega..sub.i,j.sub.i,
p.sub.i(.epsilon..sub.k.sub.i))|k.sub.i.epsilon.{1, 2, . . .
K.sub.i}}, where .omega..sub.i,k.sub.i is the cost of reducing the
error level at t.sub.i to p.sub.i(.epsilon..sub.k.sub.i). As
described below for the second sub-problem, where we optimize the
cost of controls at the error sources stage, k.sub.i, is a control
strategy that can be applied at the error source t.sub.i. Table 2
below shows the different levels controls and the associated cost
and reliability levels. TABLE-US-00002 TABLE 2 The different error
levels (by applying controls) and associated cost of control error
.times. .times. source .times. .times. 1 ( .omega. 1.1 , p 1
.function. ( 1 ) ) ( .omega. 1.2 , p 1 .function. ( 2 ) ) ( .omega.
1. .times. K .times. 1 , p 1 .function. ( K .times. 1 ) ) .times.
.times. error .times. .times. source .times. .times. 2 ( .omega.
2.1 , p 2 .function. ( 1 ) ) ( .omega. 2.2 , p 2 .function. ( 2 ) )
( .omega. 2. .times. K .times. 2 , p 2 .function. ( K .times. 2 ) )
.times. .times. error .times. .times. source .times. .times. 3 (
.omega. 3.1 , p 3 .function. ( 1 ) ) ( .omega. 3.2 , p 3 .function.
( 2 ) ) ( .omega. 3. .times. K .times. 3 , p 3 .function. ( K
.times. 3 ) ) .times. .times. , ##EQU20## error .times. .times.
source .times. .times. I ( .omega. I .times. .1 , p I .function. (
1 ) ) ( .omega. I .times. .2 , p I .function. ( 2 ) ) ( .omega. I .
K .times. 1 , p I .function. ( K .times. 1 ) ) ##EQU21##
The objective here is to pick an appropriate level of control at
each error source so as to keep the system level cost of error
below the threshold budget {circumflex over (.OMEGA.)}. This can be
written as follows: min .times. t i .di-elect cons. T E .times. k i
= 1 K i .times. ( .omega. i , k i z i , k i ) .times. .times. s . t
. .times. .OMEGA. = t j .di-elect cons. T A .times. ( t i .di-elect
cons. T E .times. V i P ij .times. ( .omega. 1 + .omega. 2 .times.
e i ) ( k i = 1 K i .times. z i , k i p i .function. ( k i ) ) )
.ltoreq. .OMEGA. ^ .times. .times. k i = 1 K i .times. z i , k i
.ltoreq. 1 , .times. z i , k i = { 1 , if .times. .times. control
.times. .times. level .times. .times. k i .times. .times. is
.times. .times. chosen .times. .times. for .times. .times. error
.times. .times. source .times. .times. i 0 , if .times. .times.
control .times. .times. level .times. .times. k i .times. .times.
is .times. .times. not .times. .times. chosen .times. .times. for
.times. .times. error .times. .times. source .times. .times. i ( 28
) ##EQU22## The decision variable is z.sub.i,k.sub.i,
k.sub.i.epsilon.{1, 2, . . . K.sub.i}. z.sub.i,k.sub.i, is a binary
variable, which takes the value of 1 if the pair
(.omega..sub.i,k.sub.i, p.sub.i(.epsilon..sub.k.sub.i)) is chosen
for the error source i. The constraint k i = 1 K i .times. z i , k
i .ltoreq. 1 , ##EQU23## implies that only one reliability level
for each error source i can be chosen. Recognizing this problem as
the multiple choice knapsack problem (see, for example, S. Martello
and P. Toth, Knapsack Problems, Algorithms and Computer
Implementations, John Wiley and Sons Ltd., England, 1990) which can
be solved by dynamic programming in O(K.times.W) where K is the
total number of levels across all error sources and W is related to
the accuracy with which {circumflex over (.OMEGA.)} needs to be
achieved.
[0105] Next, we develop a control model to compute the minimum cost
control strategy for each level at each error source. Although this
implies the need to solve an optimization model to compute each
(cost, error level) pair, we will show that this optimization model
is a knapsack problem which is relatively easy to solve.
[0106] For the sub-problem at the level of the error sources, our
objective is to come up with a set of (.omega..sub.i,k.sub.i,
p.sub.i(.epsilon..sub.k.sub.i)) pairs for each error source. In
doing so, we wish to minimize the cost .omega..sub.i,k.sub.i of
reducing the error level at error source t.sub.i to
p.sub.i(.epsilon..sub.k.sub.i).
[0107] Equation (24) provides the means for iteratively calculating
p.sub.i(.epsilon..sub.k.sub.i) for a given set of controls K.sub.i
at error source t.sub.i and a control strategy defined by the
fraction of transactions y.sub.j, j.epsilon.{1, 2, . . . ,
|K.sub.i|}, reaching each control c.sub.j.epsilon.K.sub.i. If we
can reasonably assume that a control attempting to fix a non-error
will not introduce an error, i.e., the states 5 and 7 in Table 1,
E.sub.c= .epsilon.. With this the error incidence rate
p(.epsilon..sub.K) simplifies to: min .times. .times. j = 1 K
.times. ( .omega. .function. ( c j ) y j ) .times. .times. s . t .
.times. ln .times. .times. p .function. ( ) + j = 1 K j .times. ln
.function. ( 1 - y j .times. p .function. ( c s , j .times. |
.times. ) ) .ltoreq. ln .times. .times. p .function. ( k i )
.times. .times. y j .di-elect cons. [ 0 , 1 ] ( 30 ) ##EQU24##
Observe that we have linearized the expression for
p(.epsilon..sub.k.sub.i) using logarithms. This suggests that the
sequence in which the controls c.sub.j.epsilon.K.sub.i are applied
is inconsequential. So a simple optimization formulation for a
single error source with multiple controls is as follows: Given
c.sub.j.epsilon.K.sub.i control units and a the target error level
p(.epsilon..sub.k.sub.i), find the optimal control strategy
k.sub.i, specified in terms of y.sub.j, j.epsilon.{1, 2, . . . ,
|K.sub.i|}, that minimizes the control cost: p .function. ( K i ) =
p .function. ( ) j = 1 K i .times. ( 1 - y j .times. p .function. (
c s , j .times. | .times. ) .times. .times. ln .times. .times. p
.function. ( K i ) = ln .times. .times. p .function. ( ) + j = 1 K
i .times. ln .function. ( 1 - y j .times. p .function. ( c s , j
.times. | .times. ) ) ( 29 ) ##EQU25## where .omega.(c.sub.j) is
the per-transaction cost of applying the jth control to the error
source as defined in Equation (25). Although we have assumed
(implicitly by making y.sub.j binary,) that the controls are
applied to all the transactions or none, this can be easily relaxed
to allow the control of a fraction of the transactions. Notice that
the above problem is a knapsack problem that can be solved by
dynamic programming (see, again, Martello and Toth 1990, supra) in
O(J.times.R) where J is the number of controls and R is a number
based on the accuracy desired of p(.epsilon..sub.k.sub.i).
[0108] Noting from Equation (29) that the sequence of applying
controls does not impact the probability of error after the
application of controls, we construct a simple algorithm that can
find the optimal control strategy k.sub.i for a given target error
level p(.epsilon..sub.k.sub.i). This is shown in Table 3. We select
the control with the highest cost-effectiveness ratio and apply it
to all the transactions in the error source. If the resulting error
level is still higher than the target, we apply the control with
the next highest cost-effectiveness ratio. When the error level
falls below the target, we adjust the sampling fraction y.sub.j of
the last selected to so as to achieve the target error level. Thus,
the sampling fractions of all controls will be 1 or 0 with the
exception of one control, whose sampling fraction will be in [0-1].
TABLE-US-00003 TABLE 3 Algorithm for Control Strategy Selection
Given target error level, p(.epsilon..sub.k.sub.i) Candidate
control set K = {C.sub.1, C.sub.2, . . . C.sub.J}, Solution set
[y.sub.j]|j.di-elect cons.{1, 2, . . . , |K|} = [0] Set P =
p(.epsilon.) 1. .times. .times. Calculate .times. .times. the
.times. .times. cost .times. - .times. effectiveness .times.
.times. ratio , p .function. ( c s , j | ) .omega. .function. ( c j
) , ##EQU26## for each candidate control unit in K 2. Choose the
control that has the highest value j * = max c j .times. .epsilon.K
.times. .times. ( p .function. ( c s , j | ) .omega. .function. ( c
j ) ) ##EQU27## 3. Update P = P (1 - p(c.sub.s,j.sub.*|.epsilon.))
4. if P > p(.epsilon..sub.k.sub.i), set y.sub.j.sub.* = 1; take
C.sub.j.sub.* off the candidate list K, if K .noteq. .phi. go to
step 2 else terminate the procedure with failure else set .times.
.times. y j * = P - p .function. ( k i ) .times. ( 1 - p .function.
( c s , j * | ) ) P p .function. ( c s , j * | ) ##EQU28##
terminate the procedure with success
[0109] We have described a framework for the quantitative modeling
of data quality in a business process. We have shown how the model
can be used to make assessments of data quality in a pre-defined
process as well as to develop optimal control system designs that
meet reliability or cost requirements.
[0110] These techniques will be of value to business process owners
as well as to evaluators of data quality (such as auditors in case
of business processes with financial transactions and accounts).
However, the users of these techniques must adopt a methodology by
which the data quality model must be developed and maintained. The
methodology comprises of the following steps:
1. Create a model of an existing business process. Various modeling
tools are commercially available for this purpose.
2. Utilizing the modeling framework developed in the Process Model
section, identify the transaction sources, error sources, and audit
targets.
[0111] 2.1. For transaction sources, obtain or estimate the volume
of transactions over a given time period (e.g., per day, month,
quarter, or etc.) and estimate the transaction book values. This
may be a simple average book value or a probability distribution
based on historical transaction data.
[0112] 2.2. For error sources, obtain the probability of errors
prior to the application of any controls. This may be obtained from
the logs of controls that already exist. For a new business process
or for error sources that do not have logs of past control
activity, an estimation must be done based on comparable error
sources with available data. The taint of the error sources must
also be obtained from historical logs or otherwise estimated. Note
that the taint may be a point estimate or a probability
distribution.
[0113] 2.3. For audit targets, specify the types of errors of
interest and if any error level requirements exist for them.
[0114] 3. Run the error propagation analysis described in the
Process Model section to estimate error rates and cost of error at
the audit targets. For a model with probability distributions, a
Monte Carlo simulation can be performed to estimate error rates and
costs in terms of probability distributions. The process analyst
may develop multiple scenarios to test different expectations of
future process changes, such as changes in transaction volumes and
business process topology and policies.
[0115] 4. Utilize the control systems model developed in the
Control Model section to associate error sources with a set of
controls. These may be existing or available controls. For each
control, estimate its error detection and correction effectiveness,
as defined by the probabilities p(c.sub.s|.epsilon.) and p(c.sub.s|
.epsilon.). This data is available if the controls are periodically
subject to internal or external auditing, where they are evaluated
with test data with known errors. The cost of controls can be
estimated from the time spent on each control to search for and
then fix errors.
[0116] 5. Analyze the impact of selected controls using the
assessment technique described in the Control Model section. The
process analyst may run multiple scenarios with different control
selections as well as the scenarios developed in step 3 above. The
cost of the selected controls can be compared with the reliability
level or cost of error at the audit targets.
[0117] 6. When manual search for the optimal control design is
intractable, the optimization techniques shown in step 5 are
applicable. Here, we assume that each error source has a set of
potential controls and the problem is to select the fraction of the
total transactions to send to each.
[0118] Although our model and analyses have been motivated by the
types of transaction errors and error correction controls in the
accounting and auditing domain, it can extend to other domains and
definitions of data quality. For example, we can consider that
"error" sources introduce uncertainty about the data in a
transaction rather than mistakes. Sources of uncertainty could be
prices of raw material, customer demand, product development times,
service delivery times, etc. We can adapt the error propagation
techniques of this invention to propagate these uncertainties to
the data repositories. We can also then consider the analogues of
"controls" that may reduce these uncertainties, but at a cost. For
example, uncertainties about raw material prices can be reduced by
establishing long-term contracts or hedging with options.
Variability in delivery times may be reduced by automating
processes. These uncertainty reduction actions come at a cost and
we can trade off these costs with the consequent level or cost of
the uncertainty in the data repositories.
[0119] In conclusion, our invention contributes to the analysis of
data quality by incorporating a business process framework for the
assessment and optimization of data quality. This invention applies
not only to the literature and practice of financial accounting and
auditing, but also to business decision-support systems.
[0120] While the invention has been described in terms of a single
preferred embodiment, those skilled in the art will recognize that
the invention can be practiced with modification within the spirit
and scope of the appended claims.
* * * * *