U.S. patent application number 14/792379 was filed with the patent office on 2015-10-29 for dynamic outlier bias reduction system and method.
The applicant listed for this patent is HARTFORD STEAM BOILER INSPECTION & INSURANCE COMPANY. Invention is credited to Richard B. Jones.
Application Number | 20150309964 14/792379 |
Document ID | / |
Family ID | 49043335 |
Filed Date | 2015-10-29 |
United States Patent
Application |
20150309964 |
Kind Code |
A1 |
Jones; Richard B. |
October 29, 2015 |
DYNAMIC OUTLIER BIAS REDUCTION SYSTEM AND METHOD
Abstract
A system and method is described herein for data filtering to
reduce functional, and trend line outlier bias. Outliers are
removed from the data set through an objective statistical method.
Bias is determined based on absolute, relative error, or both.
Error values are computed from the data, model coefficients, or
trend line calculations. Outlier data records are removed when the
error values are greater than or equal to the user-supplied
criteria. For optimization methods or other iterative calculations,
the removed data are re-applied each iteration to the model
computing new results. Using model values for the complete dataset,
new error values are computed and the outlier bias reduction
procedure is re-applied. Overall error is minimized for model
coefficients and outlier removed data in an iterative fashion until
user defined error improvement limits are reached. The filtered
data may be used for validation, outlier bias reduction and data
quality operations.
Inventors: |
Jones; Richard B.;
(Georgetown, TX) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
HARTFORD STEAM BOILER INSPECTION & INSURANCE COMPANY |
Hartford |
CT |
US |
|
|
Family ID: |
49043335 |
Appl. No.: |
14/792379 |
Filed: |
July 6, 2015 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13772212 |
Feb 20, 2013 |
9111212 |
|
|
14792379 |
|
|
|
|
13213780 |
Aug 19, 2011 |
9069725 |
|
|
13772212 |
|
|
|
|
Current U.S.
Class: |
703/2 |
Current CPC
Class: |
G06K 9/6284 20130101;
G06F 17/18 20130101; G06N 7/00 20130101; G06F 30/20 20200101 |
International
Class: |
G06F 17/18 20060101
G06F017/18; G06F 17/50 20060101 G06F017/50 |
Claims
1. A system specialized for assessing the viability of a data set
for developing a model for a facility, comprising: an input unit
for inputting one or more data sets to be processed, wherein the
input unit comprises a measuring device configured to: measure one
or more target variables for a facility; and provide a
corresponding data set for each of the target variables; a
computing unit coupled to the input unit and for processing the one
or more data sets, wherein the computing unit comprises a processor
and a non-transient storage subsystem; and an output unit coupled
to the computing unit and for outputting one or more of the
processed data sets received from the computing unit, a computer
program stored by the non-transient storage subsystem comprising
instructions, when executed by the processor, cause the system
specialized for assessing the viability of the corresponding data
set for developing a model to perform at least the following:
generate a random data set from the corresponding data set; obtain
a set of bias criteria values used to determine one or more
outliers; perform dynamic outlier bias reduction on the
corresponding data set for one or more bias criteria values of the
set of bias criteria values to generate one or more outlier bias
reduced target data sets; perform dynamic outlier bias reduction on
the random data set for the one or more bias criteria values of the
set of bias criteria values to generate one or more outlier bias
reduced random data sets; calculate a set of target error values
for the one or more outlier bias reduced target data sets and a set
of random error values for the one or more outlier bias reduced
random data sets; calculate a set of target correlation
coefficients for the one or more outlier bias reduced target data
sets and a set of random correlation coefficients for the outlier
bias reduced random data set; construct a first bias criteria curve
for the corresponding data set and a second bias criteria curve for
the random data set from the one or more bias criteria values, the
set of target error values, the set of random error values, the set
of target correlation coefficients, and the set of random
correlation coefficients; and compare the first bias criteria curve
and the second bias criteria curve for determining viability of the
corresponding data set used to develop the model.
2. The system of claim 1, wherein the output unit is configured
display a plot for the first bias criteria curve and the second
bias criteria curve.
3. The system of claim 1, wherein the measuring device comprises a
sensor configured to detect a compound corresponding to one of the
target variables and quantify the compound corresponding to the one
of the target variables.
4. The system of claim 1, wherein the compound is a greenhouse
chemical gas compound, and wherein the sensor is further configured
to detect and quantify the compound corresponding to the one of the
target variables continuously.
5. The system of claim 1, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the corresponding data set for developing the model to
translate the comparison of the first bias criteria curve and the
second bias criteria curve to an automated advice message that
indicates the viability of the corresponding data set used to
develop the model.
6. The system of claim 1, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the corresponding data set for developing the model to
perform dynamic outlier bias reduction on the corresponding data
set for the one or more bias criteria values of the set of bias
criteria values to generate the one or more outlier bias reduced
target data sets by performing at least the following: for each of
the one or more bias criteria values: generate a plurality of model
predicted values for the corresponding data set by applying the
model to the corresponding data set; compute a plurality of error
values determined from the corresponding data set and the model
predicted values; compare the error values with the corresponding
bias criteria value; remove outliers within the corresponding data
set to form the corresponding outlier bias reduced target data set
determined from the comparison of the error values with the
corresponding bias criteria value; and optimize the model to from
an updated model determined from the corresponding outlier bias
reduced target data set.
7. The system of claim 6, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the corresponding data set for developing the model to
perform dynamic outlier bias reduction on the corresponding data
set for the one or more bias criteria values of the set of bias
criteria values to generate the one or more outlier bias reduced
target data sets by performing at least the following: for each of
the one or more bias criteria values: compare the error values with
a predefined termination criteria to determine termination of
optimizing the model; and generate a plurality of second model
predicted values for the corresponding data set by applying the
updated model to the corresponding data set when the comparison of
the error values and the predefined termination criteria do not
represent termination of optimizing the model.
8. The system of claim 1, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the corresponding data set for developing the model to
compare the first bias criteria curve and the second bias criteria
curve for determining viability of the corresponding data set used
to develop the model by performing at least the following:
determine a first bias criteria value on the first bias criteria
curve that corresponds to a first target error value of the set of
target error values; determine a second bias criteria value on the
second bias criteria curve that corresponds to a first random error
value of the set of random error values; and compare the first bias
criteria value with the second bias criteria value, wherein the
first target error value and the first random error value are the
same.
9. The system of claim 1, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the corresponding data set for developing the model to
determine the influence of the dynamic outlier bias reduction for
each bias criteria value by performing at least the following:
comparing a number of iterations to optimize the model for each of
the bias criteria values and comparing the differences in the set
of target correlation coefficients.
10. The system of claim 1, wherein the random data set comprises
all random data values based on the corresponding data set, and
wherein the instructions, when executed by the processor, cause the
system specialized for assessing the viability of the corresponding
data set for developing the model to perform dynamic outlier bias
reduction on the random data set for the one or more bias criteria
values of the set of bias criteria values to generate the one or
more outlier bias reduced random data sets by performing at least
the following: for each of the bias criteria values: generate a
plurality of model predicted values for the random data set by
applying the model to the random data set; compute a plurality of
error values using the random data set and the model predicted
values; compare the error values with the corresponding bias
criteria value; remove outliers within the random data set to form
the corresponding outlier bias reduced random data set determined
from the comparison of the error values with the corresponding bias
criteria value; and optimize, by the specially programmed computing
system, the model for form an updated model based on the
corresponding outlier bias reduced random data set.
11. The system of claim 1, wherein at least one of the set of
target error value is a standard error, and wherein at least one of
the set of target correlation value is a coefficient of
determination value.
12. The system of claim 1, wherein the random data set comprises a
plurality of random data values generated within a range of a
plurality of predicted values of the model.
13. A system for specialized for assessing the viability of a
target data set for developing a mode for a financial instrument,
comprising: an input unit configured to receive a target data set
corresponding to a financial instrument, wherein the target data
set comprises a plurality of data values for at least one target
variable corresponding to the financial instrument; a computing
unit coupled to the input unit, wherein the computing unit
comprises a processor and a non-transient storage subsystem, a
computer program stored by the non-transient storage subsystem
comprising instructions, when executed by the processor, cause the
system specialized for assessing the viability of the target data
set for developing a model to perform at least the following:
generate a random data set based on the target data set; receive a
plurality of bias criteria values used to determine one or more
outliers; produce a plurality of outlier bias reduced target data
sets that are associated with the bias criteria values by applying
a mathematical model and a dynamic outlier bias reduction to the
target data set; produce a plurality of outlier bias reduced random
data sets that are associated with the bias criteria values by
applying the mathematical model and the dynamic outlier bias
reduction to the random data set; calculate at least one target
error value for each of the outlier bias reduced target data sets
and at least one random error value for each of the outlier bias
reduced random data sets; calculate at least one target correlation
value for each of the outlier bias reduced target data sets and at
least one random correlation value for each of the outlier bias
reduced random data sets; construct a first bias criteria curve for
the target data set on a graph based on the at least one target
error value and the at least one target correlation value for each
of the outlier bias reduced target data sets; construct a second
bias criteria curve for the random data set on the graph based on
the at least one random error value and the at least one random
correlation value for each of the outlier bias reduced random data
sets; and compare the first bias criteria curve and the second bias
criteria curve to determine viability of the target data set used
for the mathematical model.
14. The system of 13, wherein the financial instrument is a common
stock, and wherein the target variable is the price of the common
stock, and wherein the target variable for the financial instrument
represents at least one of: dividends, earnings, cash flow,
earnings per share, price-to-earnings ratio, and growth rate.
15. The system of claim 13, wherein the output unit is configured
display a plot for the first bias criteria curve and the second
bias criteria curve.
16. The system of claim 13, wherein the instructions, when executed
by the processor, cause the system specialized for assessing the
viability of the target data set for developing the model to
produce a plurality of outlier bias reduced target data sets that
are associated with the bias criteria values by applying a
mathematical model and a dynamic outlier bias reduction to the
target data set by performing at least the following: for each of
the one or more bias criteria values: generate a plurality of model
predicted values for the target data set by applying the
mathematical model to the target data set; compute a plurality of
error values determined from the target data set and the model
predicted values; compare the error values with the corresponding
bias criteria value; remove outliers within the target data set to
form the corresponding outlier bias reduced target data set
determined from the comparison of the error values with the
corresponding bias criteria value; and optimize the mathematical
model to from an updated mathematical model determined from the
corresponding outlier bias reduced target data set.
17. The system of claim 13, wherein the instructions, when executed
by the processor, cause specialized for assessing the viability of
the target data set for developing the model to compare the first
bias criteria curve and the second bias criteria curve for
determining viability of the target data set used to develop the
model by performing at least the following: determine a first bias
criteria value on the first bias criteria curve that corresponds to
the at least one target error value; determine a second bias
criteria value on the second bias criteria curve that corresponds
to the at least one random error values, and compare the first bias
criteria value with the second bias criteria value, wherein the at
least one target error value and the at least one random error
value are the same.
18. A system for reducing outlier bias in target variables measured
for a facility, comprising: an input unit for inputting one or more
data sets to be processed, wherein the input unit comprises a
measuring device configured to: measure one or more target
variables for the facility; and provide a corresponding data set
for each of the target variables; a computing unit coupled to the
input unit and for processing the one or more data sets, wherein
the computing unit comprises a processor and a non-transient
storage subsystem; an output unit coupled to the computing unit and
for outputting one or more of the processed data sets received from
the computing unit; and a computer program stored by the
non-transient storage subsystem comprising instructions, when
executed by the processor, cause the system specialized for
reducing outlier bias in target variables measured for the facility
to perform at least the following: receive at least one error
threshold criteria and the corresponding data set via the database;
perform a first iteration of outlier bias reduction for the
corresponding data set that comprises: determining a set of
predicted values by applying a model comprising at least one
coefficient to the data set; comparing the set of predicted values
to the data set to produce at least one set of error values;
removing a plurality of data outliers from the data set determined
from the at least one set of error values and the at least one
error threshold criteria to generate an outlier filtered data set;
and constructing an updated model comprising at least one updated
coefficient from the outlier filtered data set; and perform a
second iteration of outlier bias reduction for the data set based
upon a determination that at least one termination criteria is not
satisfied, wherein performing the second iteration of outlier bias
reduction comprises determining a set of second predicted values by
applying the updated model to the data set.
19. The system of claim 18, wherein the instructions, when executed
by the processor, cause the system specialized for reducing outlier
bias to perform the second iteration of outlier bias reduction for
the corresponding data set that further comprises recombining the
outlier filtered data set with the data outliers to produce the
data set.
20. The system of claim 18, wherein the instructions, when executed
by the processor, cause the system specialized for reducing outlier
bias to perform the second iteration of outlier bias reduction for
the corresponding data set that further comprises: comparing the
set of second predicted values to the corresponding data set to
produce at least one set of second error values; removing a
plurality of second data outliers from the corresponding data set
determined from the at least one set of second error values and the
at least one error threshold criteria to generate a second outlier
filtered data set; and constructing second iteration updated model
comprising at least one second updated coefficient from the second
outlier filtered data set.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application is a continuation of U.S. patent
application Ser. No. 13/772,212, filed Feb. 20, 2013 by Richard
Bradley Jones and entitled "Dynamic Outlier Bias Reduction System
and Method," which is a continuation-in-part patent application
that claims the benefit of and priority to U.S. Non-Provisional
patent application Ser. No. 13/213,780, filed Aug. 19, 2011 by
Richard Bradley Jones and entitled "Dynamic Outlier Bias Reduction
System and Method," all of which are incorporated herein by
reference in their entirety.
STATEMENTS REGARDING FEDERALLY SPONSORED RESEARCH OR
DEVELOPMENT
[0002] Not applicable.
REFERENCE TO A MICROFICHE APPENDIX
[0003] Not applicable.
FIELD OF THE INVENTION
[0004] The present invention relates to the analysis of data where
outlier elements are removed (or filtered) from the analysis
development. The analysis may be related to the computation of
simple statistics or more complex operations involving mathematical
models that use data in their development. The purpose of outlier
data filtering may be to perform data quality and data validation
operations, or to compute representative standards, statistics,
data groups that have applications in subsequent analyses,
regression analysis, time series analysis or qualified data for
mathematical models development.
BACKGROUND
[0005] Removing outlier data in standards or data-driven model
development is an important part of the pre-analysis work to ensure
a representative and fair analysis is developed from the underlying
data. For example, developing equitable benchmarking of greenhouse
gas standards for carbon dioxide (CO.sub.2), ozone (O.sub.3), water
vapor (H.sub.2O), hydrofluorocarbons (HFCs), perfluorocarbons
(PFCs), chlorofluorocarbons (CFCs), sulfur hexafluoride (SF.sub.6),
methane (CH.sub.4), nitrous oxide (N.sub.2O), carbon monoxide (CO),
nitrogen oxides (NOx), and non-methane volatile organic compounds
(NMVOCs) emissions requires that collected industrial data used in
the standards development exhibit certain properties. Extremely
good or bad performance by a few of the industrial sites should not
bias the standards computed for other sites. It may be judged
unfair or unrepresentative to include such performance results in
the standard calculations. In the past, the performance outliers
were removed via a semi-quantitative process requiring subjective
input. The present system and method is a data-driven approach that
performs this task as an integral part of the model development,
and not at the pre-analysis or pre-model development stage.
[0006] The removal of bias can be a subjective process wherein
justification is documented in some form to substantiate data
changes. However, any form of outlier removal is a form of data
censoring that carries the potential for changing calculation
results. Such data filtering may or may not reduce bias or error in
the calculation and in the spirit of full analysis disclosure,
strict data removal guidelines and documentation to remove outliers
needs to be included with the analysis results. Therefore, there is
a need in the art to provide a new system and method for
objectively removing outlier data bias using a dynamic statistical
process useful for the purposes of data quality operations, data
validation, statistic calculations or mathematical model
development, etc. The outlier bias removal system and method can
also be used to group data into representative categories where the
data is applied to the development of mathematical models
customized to each group. In a preferred embodiment, coefficients
are defined as multiplicative and additive factors in mathematical
models and also other numerical parameters that are nonlinear in
nature. For example, in the mathematical model,
f(x,y,z)=a*x+b*y.sup.c+d*sin(ez)+f, a, b, c, d, e, and f are all
defined as coefficients. The values of these terms may be fixed or
part of the development of the mathematical model.
BRIEF SUMMARY
[0007] A preferred embodiment includes a computer implemented
method for reducing outlier bias comprising the steps of: selecting
a bias criteria; providing a data set; providing a set of model
coefficients; selecting a set of target values; (1) generating a
set of predicted values for the complete data set; (2) generating
an error set for the dataset; (3) generating a set of error
threshold values based on the error set and the bias criteria; (4)
generating, by a processor, a censored data set based on the error
set and the set of error threshold values; (5) generating, by the
processor, a set of new model coefficients; and (6) using the set
of new model coefficients, repeating steps (1)-(5), unless a
censoring performance termination criteria is satisfied. In a
preferred embodiment, the set of predicted values may be generated
based on the data set and the set of model coefficients. In a
preferred embodiment, the error set may comprise a set of absolute
errors and a set of relative errors, generated based on the set of
predicted values and the set of target values. In another
embodiment, the error set may comprise values calculated as the
difference between the set of predicted values and the set of
target values. In another embodiment, the step of generating the
set of new coefficients may further comprise the step of minimizing
the set of errors between the set of predicted values and the set
of actual values, which can be accomplished using a linear, or a
non-linear optimization model. In a preferred embodiment, the
censoring performance termination criteria may be based on a
standard error and a coefficient of determination.
[0008] Another embodiment includes a computer implemented method
for reducing outlier bias comprising the steps of: selecting an
error criteria; selecting a data set; selecting a set of actual
values; selecting an initial set of model coefficients; generating
a set of model predicted values based on the complete data set and
the initial set of model coefficients; (1) generating a set of
errors based on the model predicted values and the set of actual
values for the complete dataset; (2) generating a set of error
threshold values based on the complete set of errors and the error
criteria for the complete data set; (3) generating an outlier
removed data set, wherein the filtering is based on the complete
data set and the set of error threshold values; (4) generating a
set of new coefficients based on the filtered data set and the set
of previous coefficients, wherein the generation of the set of new
coefficients is performed by the computer processor; (5) generating
a set of outlier bias reduced model predicted values based on the
filtered data set and the set of new model coefficients, wherein
the generation of the set of outlier bias reduced model predicted
values is performed by a computer processor; (6) generating a set
of model performance values based on the model predicted values and
the set of actual values; repeating steps (1)-(6), while
substituting the set of new coefficients for the set of
coefficients from the previous iteration, unless: a performance
termination criteria is satisfied; and storing the set of model
predicted values in a computer data medium.
[0009] Another embodiment includes a computer implemented method
for reducing outlier bias comprising the steps of: selecting a
target variable for a facility; selecting a set of actual values of
the target variable; identifying a plurality of variables for the
facility that are related to the target variable; obtaining a data
set for the facility, the data set comprising values for the
plurality of variables; selecting a bias criteria; selecting a set
of model coefficients; (1) generating a set of predicted values
based on the complete data set and the set of model coefficients;
(2) generating a set of censoring model performance values based on
the set of predicted values and the set of actual values; (3)
generating an error set based on the set of predicted values and
the set of actual values for the target variable; (4) generating a
set of error threshold values based on the error set and the bias
criteria; (5) generating, by a processor, a censored data set based
on the data set and the set of error thresholds; (6) generating, by
the processor, a set of new model coefficients based on the
censored data set and the set of model coefficients; (7)
generating, by the processor, a set of new predicted values based
on the data set and the set of new model coefficients; (8)
generating a set of new censoring model performance values based on
the set of new predicted values and the set of actual values; using
the set of new coefficients, repeating steps (1)-(8) unless a
censoring performance termination criteria is satisfied; and
storing the set of new model predicted values in a computer data
medium.
[0010] Another embodiment includes a computer implemented method
for reducing outlier bias comprising the steps of: determining a
target variable for a facility, wherein the target variable is a
metric for an industrial facility related to its production,
financial performance, or emissions; identifying a plurality of
variables for the facility, wherein the plurality of variables
comprises: a plurality of direct variables for the facility that
influence the target variable; and a set of transformed variables
for the facility, each transformed variable is a function of at
least one direct facility variable that influences the target
variable; selecting an error criteria comprising: an absolute
error, and a relative error; obtaining a data set for the facility,
wherein the data set comprises values for the plurality of
variables; selecting a set of actual values of the target variable;
selecting an initial set of model coefficients; generating a set of
model predicted values based on the complete data set and the
initial set of model coefficients; generating a complete set of
errors based on the set of model predicted values and the set of
actual values, wherein the relative error is calculated using the
formula: Relative Error.sub.m=((Predicted Value.sub.m-Actual
Value.sub.m)/Actual Value.sub.m).sup.2 wherein `m` is a reference
number, and wherein the absolute error is calculated using the
formula: Absolute Error.sub.m=(Predicted Value.sub.m-Actual
Value.sub.m).sup.2; generating a set of model performance values
based on the set of model predicted values and the set of actual
values, wherein the set of overall model performance values
comprises of: a first standard error, and a first coefficient of
determination; (1) generating a set of errors based on the model
predicted values and the set of actual values for the complete
dataset; (2) generating a set of error threshold values based on
the complete set of errors and the error criteria for the complete
data set; (3) generating an outlier removed data set by removing
data with error values greater than or equal to the error threshold
values, wherein the filtering is based on the complete data set and
the set of error threshold values; (4) generating a set of outlier
bias reduced model predicted values based on the outlier removed
data set and the set of model coefficients by minimizing the error
between the set of predicted values and the set of actual values
using at least one of: a linear optimization model, and a nonlinear
optimization model, wherein the generation of the new model
predicted values is performed by a computer processor; (5)
generating a set of new coefficients based on the outlier removed
data set and the previous set of coefficients, wherein the
generation of the set of new coefficients is performed by the
computer processor; (6) generating a set of overall model
performance values based on the set of new predicted model values
and the set of actual values, wherein the set of model performance
values comprise: a second standard error, and a second coefficient
of determination; repeating steps (1)-(6), while substituting the
set of new coefficients for the set of coefficients from the
previous iteration, unless: a performance termination criteria is
satisfied, wherein the performance termination criteria comprises:
a standard error termination value and a coefficient of
determination termination value, and wherein satisfying the
performance termination criteria comprises: the standard error
termination value is greater than the difference between the first
and second standard error, and the coefficient of determination
termination value is greater than the difference between the first
and second coefficient of determination; and storing the set of new
model predicted values in a computer data medium.
[0011] Another embodiment includes a computer implemented method
for reducing outlier bias comprising the steps of: selecting an
error criteria; selecting a data set; selecting a set of actual
values; selecting an initial set of model predicted values;
determining a set of errors based on the set of model predicted
values and the set of actual values; (1) determining a set of error
threshold values based on the complete set of errors and the error
criteria; (2) generating an outlier removed data set, wherein the
filtering is based on the data set and the set of error threshold
values; (3) generating a set of outlier bias reduced model
predicted values based on the outlier removed data set and the
previous model predicted values, wherein the generation of the set
of outlier bias reduced model predicted values is performed by a
computer processor; (4) determining a set of errors based on the
set of new model predicted values and the set of actual values;
repeating steps (1)-(4), while substituting the set of new model
predicted values for the set of model predicted values from the
previous iteration, unless: a performance termination criteria is
satisfied; and storing the set of outlier bias reduced model
predicted values in a computer data medium.
[0012] Another embodiment includes a computer implemented method
for reducing outlier bias comprising the steps of: determining a
target variable for a facility; identifying a plurality of
variables for the facility, wherein the plurality of variables
comprises: a plurality of direct variables for the facility that
influence the target variable; and a set of transformed variables
for the facility, each transformed variable being a function of at
least one direct facility variable that influences the target
variable; selecting an error criteria comprising: an absolute
error, and a relative error; obtaining a data set, wherein the data
set comprises values for the plurality of variables, and selecting
a set of actual values of the target variable; selecting an initial
set of model coefficients; generating a set of model predicted
values by applying a set of model coefficients to the data set;
determining a set of performance values based on the set of model
predicted values and the set of actual values, wherein the set of
performance values comprises: a first standard error, and a first
coefficient of determination; (1) generating a set of errors based
on the set of model predicted values and the set of actual values
for the complete dataset, wherein the relative error is calculated
using the formula: Relative Error.sub.m=((Predicted
Value.sub.m-Actual Value.sub.m)/Actual Value.sub.m).sup.2, wherein
`m` is a reference number, and wherein the absolute error is
calculated using the formula: Absolute Error.sub.m=(Predicted
Value.sub.m-Actual Value.sub.m).sup.2) (2) generating a set of
error threshold values based on the complete set of errors and the
error criteria for the complete data set; (3) generating an outlier
removed data set by removing data with error values greater than or
equal to the set of error threshold values, wherein the filtering
is based on the data set and the set of error threshold values; (4)
generating a set of new coefficients based on the outlier removed
data set and the set of previous coefficients (5) generating a set
of outlier bias reduced model predicted values based on the outlier
removed data set and the set of new model coefficient by minimizing
the error between the set of predicted values and the set of actual
values using at least one of: a linear optimization model, and a
nonlinear optimization model, wherein the generation of the model
predicted values is performed by a computer processor; (6)
generating a set of updated performance values based on the set of
outlier bias reduced model predicted values and the set of actual
values, wherein the set of updated performance values comprises: a
second standard error, and a second coefficient of determination;
repeating steps (1)-(6), while substituting the set of new
coefficients for the set of coefficients from the previous
iteration, unless: a performance termination criteria is satisfied,
wherein the performance termination criteria comprises: a standard
error termination value, and a coefficient of determination
termination value, and wherein satisfying the performance
termination criteria comprises the standard error termination value
is greater than the difference between the first and second
standard error, and the coefficient of determination termination
value is greater than the difference between the first and second
coefficient of determination; and storing the set of outlier bias
reduction factors in a computer data medium.
[0013] Another embodiment includes a computer implemented method
for assessing the viability of a data set as used in developing a
model comprising the steps of: providing a target data set
comprising a plurality of data values; generating a random target
data set based on the target dataset; selecting a set of bias
criteria values; generating, by a processor, an outlier bias
reduced target data set based on the data set and each of the
selected bias criteria values; generating, by the processor, an
outlier bias reduced random data set based on the random data set
and each of the selected bias criteria values; calculating a set of
error values for the outlier bias reduced data set and the outlier
bias reduced random data set; calculating a set of correlation
coefficients for the outlier bias reduced data set and the outlier
bias reduced random data set; generating bias criteria curves for
the data set and the random data set based on the selected bias
criteria values and the corresponding error value and correlation
coefficient; and comparing the bias criteria curve for the data set
to the bias criteria curve for the random data set. The outlier
bias reduced target data set and the outlier bias reduced random
target data set are generated using the Dynamic Outlier Bias
Removal methodology. The random target data set can comprise of
randomized data values developed from values within the range of
the plurality of data values. Also, the set of error values can
comprise a set of standard errors, and wherein the set of
correlation coefficients comprises a set of coefficient of
determination values. Another embodiment can further comprise the
step of generating automated advice regarding the viability of the
target data set to support the developed model, and vice versa,
based on comparing the bias criteria curve for the target data set
to the bias criteria curve for the random target data set. Advice
can be generated based on parameters selected by analysts, such as
a correlation coefficient threshold and/or an error threshold. Yet
another embodiment further comprises the steps of: providing an
actual data set comprising a plurality of actual data values
corresponding to the model predicted values; generating a random
actual data set based on the actual data set; generating, by a
processor, an outlier bias reduced actual data set based on the
actual data set and each of the selected bias criteria values;
generating, by the processor, an outlier bias reduced random actual
data set based on the random actual data set and each of the
selected bias criteria values; generating, for each selected bias
criteria, a random data plot based on the outlier bias reduced
random target data set and the outlier bias reduced random actual
data; generating, for each selected bias criteria, a realistic data
plot based on the outlier bias reduced target data set and the
outlier bias reduced actual target data set; and comparing the
random data plot with the realistic data plot corresponding to each
of the selected bias criteria.
[0014] A preferred embodiment includes a system comprising: a
server, comprising: a processor, and a storage subsystem; a
database stored by the storage subsystem comprising: a data set;
and a computer program stored by the storage subsystem comprising
instructions that, when executed, cause the processor to: select a
bias criteria; provide a set of model coefficients; select a set of
target values; (1) generate a set of predicted values for the data
set; (2) generate an error set for the dataset; (3) generate a set
of error threshold values based on the error set and the bias
criteria; (4) generate a censored data set based on the error set
and the set of error threshold values; (5) generate a set of new
model coefficients; and (6) using the set of new model
coefficients, repeat steps (1)-(5), unless a censoring performance
termination criteria is satisfied. In a preferred embodiment, the
set of predicted values may be generated based on the data set and
the set of model coefficients. In a preferred embodiment, the error
set may comprise a set of absolute errors and a set of relative
errors, generated based on the set of predicted values and the set
of target values. In another embodiment, the error set may comprise
values calculated as the difference between the set of predicted
values and the set of target values. In another embodiment, the
step of generating the set of new coefficients may further comprise
the step of minimizing the set of errors between the set of
predicted values and the set of actual values, which can be
accomplished using a linear, or a non-linear optimization model. In
a preferred embodiment, the censoring performance termination
criteria may be based on a standard error and a coefficient of
determination.
[0015] Another embodiment of the present invention includes a
system comprising: a server, comprising: a processor, and a storage
subsystem; a database stored by the storage subsystem comprising: a
data set; and a computer program stored by the storage subsystem
comprising instructions that, when executed, cause the processor
to: select an error criteria; select a set of actual values; select
an initial set of coefficients; generate a complete set of model
predicted values from the data set and the initial set of
coefficients; (1) generate a set of errors based on the model
predicted values and the set of actual values for the complete
dataset; (2) generate a set of error threshold values based on the
complete set of errors and the error criteria for the complete data
set; (3) generate an outlier removed data set, wherein the
filtering is based on the complete data set and the set of error
threshold values; (4) generate a set of outlier bias reduced model
predicted values based on the outlier removed data set and the set
of coefficients, wherein the generation of the set of outlier bias
reduced model predicted values is performed by a computer
processor; (5) generate a set of new coefficients based on the
outlier removed data set and the set of previous coefficients,
wherein the generation of the set of new coefficients is performed
by the computer processor; (6) generate a set of model performance
values based on the outlier bias reduced model predicted values and
the set of actual values; repeat steps (1)-(6), while substituting
the set of new coefficients for the set of coefficients from the
previous iteration, unless: a performance termination criteria is
satisfied; and store the set of overall outlier bias reduction
model predicted values in a computer data medium.
[0016] Yet another embodiment includes a system comprising: a
server, comprising: a processor, and a storage subsystem; a
database stored by the storage subsystem comprising: a target
variable for a facility; a set of actual values of the target
variable; a plurality of variables for the facility that are
related to the target variable; a data set for the facility, the
data set comprising values for the plurality of variables; and a
computer program stored by the storage subsystem comprising
instructions that, when executed, cause the processor to: select a
bias criteria; select a set of model coefficients; (1) generate a
set of predicted values based on the data set and the set of model
coefficients; (2) generate a set of censoring model performance
values based on the set of predicted values and the set of actual
values; (3) generate an error set based on the set of predicted
values and the set of actual values for the target variable; (4)
generate a set of error threshold values based on the error set and
the bias criteria; (5) generate a censored data set based on the
data set and the set of error thresholds; (6) generate a set of new
model coefficients based on the censored data set and the set of
model coefficients; (7) generate a set of new predicted values
based on the data set and the set of new model coefficients; (8)
generate a set of new censoring model performance values based on
the set of new predicted values and the set of actual values; using
the set of new coefficients, repeat steps (1)-(8) unless a
censoring performance termination criteria is satisfied; and
storing the set of new model predicted values in the storage
subsystem.
[0017] Another embodiment includes a system comprising: a server,
comprising: a processor, and a storage subsystem; a database stored
by the storage subsystem comprising: a data set for a facility; and
a computer program stored by the storage subsystem comprising
instructions that, when executed, cause the processor to: determine
a target variable; identify a plurality of variables, wherein the
plurality of variables comprises: a plurality of direct variables
for the facility that influence the target variable; and a set of
transformed variables for the facility, each transformed variables
being a function of at least one direct variable that influences
the target variable; select an error criteria comprising: an
absolute error, and a relative error; select a set of actual values
of the target variable; select an initial set of coefficients;
generate a set of model predicted values based on the data set and
the initial set of coefficients; determine a set of errors based on
the set of model predicted values and the set of actual values,
wherein the relative error is calculated using the formula:
Relative Error.sub.m=((Predicted Value.sub.m-Actual
Value.sub.m)/Actual Value.sub.m).sup.2, wherein `m` is a reference
number, and wherein the absolute error is calculated using the
formula: Absolute Error.sub.m=(Predicted Value.sub.m-Actual
Value.sub.m).sup.2; determine a set of performance values based on
the set of model predicted values and the set of actual values;
wherein the set of performance values comprises: a first standard
error, and a first coefficient of determination; (1) generate a set
of errors based on the model predicted values and the set of actual
values; (2) generating a set of error threshold values based on the
complete set of errors and the error criteria for the complete data
set; (3) generate an outlier removed data set by filtering data
with error values outside the set of error threshold values,
wherein the filtering is based on the data set and the set of error
threshold values; (4) generate a set of new model predicted values
based on the outlier removed data set and the set of coefficients
by minimizing an error between the set of model predicted values
and the set of actual values using at least one of: a linear
optimization model, and a nonlinear optimization model, wherein the
generation of the outlier bias reduced model predicted values is
performed by a computer processor; (5) generate a set of new
coefficients based on the outlier removed data set and the set of
previous coefficients, wherein the generation of the set of new
coefficients is performed by the computer processor; (6) generate a
set of performance values based on the set of new model predicted
values and the set of actual values; wherein the set of model
performance values comprises: a second standard error, and a second
coefficient of determination; repeat steps (1)-(6), while
substituting the set of new coefficients for the set of
coefficients from the previous iteration, unless: a performance
termination criteria is satisfied, wherein the performance
termination criteria comprises: a standard error, and a coefficient
of determination, and wherein satisfying the performance
termination criteria comprises: the standard error termination
value is greater than the difference between the first and second
standard error, and the coefficient of determination termination
value is greater than the difference between the first and second
coefficient of determination; and store the set of new model
predicted values in a computer data medium.
[0018] Another embodiment of the present invention includes a
system comprising: a server, comprising: a processor, and a storage
subsystem; a database stored by the storage subsystem comprising: a
data set, a computer program stored by the storage subsystem
comprising instructions that, when executed, cause the processor
to: select an error criteria; select a data set; select a set of
actual values; select an initial set of model predicted values;
determine a set of errors based on the set of model predicted
values and the set of actual values; (1) determine a set of error
threshold values based on the complete set of errors and the error
criteria; (2) generate an outlier removed data set, wherein the
filtering is based on the data set and the set of error threshold
values; (3) generate a set of outlier bias reduced model predicted
values based on the outlier removed data set and the complete set
of model predicted values, wherein the generation of the set of
outlier bias reduced model predicted values is performed by a
computer processor; (4) determine a set of errors based on the set
of outlier bias reduction model predicted values and the
corresponding set of actual values; repeat steps (1)-(4), while
substituting the set of outlier bias reduction model predicted
values for the set of model predicted values unless: a performance
termination criteria is satisfied; and store the set of outlier
bias reduction factors in a computer data medium.
[0019] Another embodiment of the present invention includes a
system comprising: a server, comprising: a processor, and a storage
subsystem; a database stored by the storage subsystem comprising: a
data set, a computer program stored by the storage subsystem
comprising instructions that, when executed, cause the processor
to: determine a target variable; identify a plurality of variables
for the facility, wherein the plurality of variables comprises: a
plurality of direct variables for the facility that influence the
target variable; and a set of transformed variables for the
facility, each transformed variable is a function of at least one
primary facility variable that influences the target variable;
select an error criteria comprising: an absolute error, and a
relative error; obtain a data set, wherein the data set comprises
values for the plurality of variables, and select a set of actual
values of the target variable; select an initial set of
coefficients; generate a set of model predicted values by applying
the set of model coefficients to the data set; determine a set of
performance values based on the set of model predicted values and
the set of actual values, wherein the set of performance values
comprises: a first standard error, and a first coefficient of
determination; (1) determine a set of errors based on the set of
model predicted values and the set of actual values, wherein the
relative error is calculated using the formula: Relative
Error.sub.k=((Predicted Value.sub.k-Actual Value.sub.k)/Actual
Value.sub.k).sup.2, wherein `k` is a reference number, and wherein
the absolute error is calculated using the formula: Absolute
Error.sub.k=(Predicted Value.sub.k-Actual Value.sub.k).sup.2; (2)
determine a set of error threshold values based on the set of
errors and the error criteria for the complete data set; (3)
generate an outlier removed data set by removing data with error
values greater than or equal to the error threshold values, wherein
the filtering is based on the data set and the set of error
threshold values; (4) generate a set of new coefficients based on
the outlier removed dataset and the set of previous coefficients;
(5) generate a set of outlier bias reduced model values based on
the outlier removed data set and the set of coefficients and
minimizing an error between the set of predicted values and the set
of actual values using at least one of: a linear optimization
model, and a nonlinear optimization model; (5) determine a set of
updated performance values based on the set of outlier bias reduced
model predicted values and the set of actual values, wherein the
set of updated performance values comprises: a second standard
error, and a second coefficient of determination; repeat steps
(1)-(5), while substituting the set of new coefficients for the set
of coefficients from the previous iteration, unless: a performance
termination criteria is satisfied, wherein the performance
termination criteria comprises: a standard error termination value,
and a coefficient of determination termination value, and wherein
satisfying the performance termination criteria comprises the
standard error termination value is greater than the difference
between the first and second standard error, and the coefficient of
determination termination value is greater than the difference
between the first and second coefficient of determination; and
storing the set of outlier bias reduction factors in a computer
data medium.
[0020] Yet another embodiment includes a system for assessing the
viability of a data set as used in developing a model comprising: a
server, comprising: a processor, and a storage subsystem; a
database stored by the storage subsystem comprising: a target data
set comprising a plurality of model predicted values; a computer
program stored by the storage subsystem comprising instructions
that, when executed, cause the processor to: generate a random
target data set; select a set of bias criteria values; generate
outlier bias reduced data sets based on the target data set and
each of the selected bias criteria values; generate an outlier bias
reduced random target data set based on the random target data set
and each of the selected bias criteria values; calculate a set of
error values for the outlier bias reduced target data set and the
outlier bias reduced random target data set; calculate a set of
correlation coefficients for the outlier bias reduced target data
set and the outlier bias reduced random target data set; generate
bias criteria curves for the target data set and the random target
data set based on the corresponding error value and correlation
coefficient for each selected bias criteria; and compare the bias
criteria curve for the target data set to the bias criteria curve
for the random target data set. The processor generates the outlier
bias reduced target data set and the outlier bias reduced random
target data set using the Dynamic Outlier Bias Removal methodology.
The random target data set can comprise of randomized data values
developed from values within the range of the plurality of data
values. Also, the set of error values can comprise a set of
standard errors, and the set of correlation coefficients comprises
a set of coefficient of determination values. In another
embodiment, the program further comprises instructions that, when
executed, cause the processor to generate automated advice based on
comparing the bias criteria curve for the target data set to the
bias criteria curve for the random target data set. Advice can be
generated based on parameters selected by analysts, such as a
correlation coefficient threshold and/or an error threshold. In yet
another embodiment, the system's database further comprises an
actual data set comprising a plurality of actual data values
corresponding to the model predicted values, and the program
further comprises instructions that, when executed, cause the
processor to: generate a random actual data set based on the actual
data set; generate an outlier bias reduced actual data set based on
the actual data set and each of the selected bias criteria values;
generate an outlier bias reduced random actual data set based on
the random actual data set and each of the selected bias criteria
values; generate, for each selected bias criteria, a random data
plot based on the outlier bias reduced random target data set and
the outlier bias reduced random actual data; generate, for each
selected bias criteria, a realistic data plot based on the outlier
bias reduced target data set and the outlier bias reduced actual
target data set; and compare the random data plot with the
realistic data plot corresponding to each of the selected bias
criteria.
[0021] Other embodiments include a system for reducing outlier bias
in target variables measured for a facility comprising a computing
unit for processing a data set, the computing unit comprising a
processor and a storage subsystem, an input unit for inputting the
data set to be processed, the input unit comprising a measuring
device for measuring a given target variable and for providing a
corresponding data set, an output unit for outputting a processed
data set, a computer program stored by the storage subsystem
comprising instructions that, when executed, cause the processor to
execute following steps: selecting the target variable for a
facility; identifying a plurality of variables for the facility
that are related to the target variable; obtaining a data set for
the facility, the data set comprising values for the plurality of
variables; selecting a bias criteria; selecting a set of model
coefficients; (1) generate a set of predicted values for the data
set; (2) generate an error set for the data set; (3) generate a set
of error threshold values based on the error set and the bias
criteria; (4) generate a censored data set based on the error set
and the set of error threshold values; (5) generate a set of new
model coefficients; and (6) using the set of new model
coefficients, repeat steps (1)-(5), unless a censoring performance
termination criteria is satisfied.
[0022] Still, other embodiment include a system for reducing
outlier bias in target variables measured for a financial
instrument, such as equity security (e.g., common stock) or
derivative contract (e.g., forwards, futures, options, and swaps,
etc.), comprising a computing unit for processing a data set, the
computing unit comprising a processor and a storage subsystem, an
input unit for receiving the data set to be processed, the input
unit comprising a storage device for storing data on a target
variable (e.g., stock price) and for providing a corresponding data
set, an output unit for outputting a processed data set, a computer
program stored by the storage subsystem comprising instructions
that, when executed, cause the processor to execute following
steps: selecting the target variable for the financial instrument;
identifying a plurality of variables for the instrument that are
related to the target variable (e.g., dividends, earnings, cash
flow, etc.); obtaining a data set for the financial instrument, the
data set comprising values for the plurality of variables;
selecting a bias criteria; selecting a set of model coefficients;
(1) generate a set of predicted values for the data set; (2)
generate an error set for the data set; (3) generate a set of error
threshold values based on the error set and the bias criteria; (4)
generate a censored data set based on the error set and the set of
error threshold values; (5) generate a set of new model
coefficients; and (6) using the set of new model coefficients,
repeat steps (1)-(5), unless a censoring performance termination
criteria is satisfied.
BRIEF DESCRIPTION OF THE DRAWINGS
[0023] FIG. 1 is a flowchart illustrating an embodiment of the data
outlier identification and removal method.
[0024] FIG. 2 is a flowchart illustrating an embodiment of the data
outlier identification and removal method for data quality
operations.
[0025] FIG. 3 is a flowchart illustrating an embodiment of the data
outlier identification and removal method for data validation.
[0026] FIG. 4 is an illustrative node for implementing a method of
the invention.
[0027] FIG. 5 is an illustrative graph for quantitative assessment
of a data set.
[0028] FIGS. 6A and 6B are illustrative graphs for qualitative
assessment of the data set of FIG. 5, illustrating the randomized
and realistic data set, respectively, for the entire data set.
[0029] FIGS. 7A and 7B are illustrative graphs for qualitative
assessment of the data set of FIG. 5, illustrating the randomized
and realistic data set, respectively, after removal of 30% of the
data as outliers.
[0030] FIGS. 8A and 8B are illustrative graphs for qualitative
assessment of the data set of FIG. 5, illustrating the randomized
and realistic data set, respectively, after removal of 50% of the
data as outliers.
[0031] FIG. 9 illustrates an exemplary system used to reduce
outlier bias in target variables measured for a facility.
DETAILED DESCRIPTION OF THE INVENTION
[0032] The following disclosure provides many different
embodiments, or examples, for implementing different features of a
system and method for accessing and managing structured content.
Specific examples of components, processes, and implementations are
described to help clarify the invention. These are merely examples
and are not intended to limit the invention from that described in
the claims. Well-known elements are presented without detailed
description so as not to obscure the preferred embodiments of the
present invention with unnecessary detail. For the most part,
details unnecessary to obtain a complete understanding of the
preferred embodiments of the present invention have been omitted
inasmuch as such details are within the skills of persons of
ordinary skill in the relevant art.
[0033] A mathematical description of one embodiment of Dynamic
Outlier Bias Reduction is shown as follows:
Nomenclature
[0034] {circumflex over (X)}--Set of all data records: {circumflex
over (X)}={circumflex over (X)}.sub.k+{circumflex over (X)}.sub.Ck,
where: [0035] {circumflex over (X)}.sub.k--Set of accepted data
records for the k.sup.th iteration [0036] {circumflex over
(X)}.sub.Ck--Set of outlier (removed) data records for the k.sup.th
iteration [0037] {circumflex over (Q)}.sub.k--Set of computed model
predicted values for {circumflex over (X)}.sub.k [0038] {circumflex
over (Q)}.sub.Ck--Set of outlier model predicted values for data
records, {circumflex over (X)}.sub.Ck [0039] A--Set of actual
values (target values) on which the model is based [0040]
{circumflex over (.beta.)}.sub.k.fwdarw.k+1--Set of model
coefficients at the k+1.sup.st iteration computed as a result of
the model computations using {circumflex over (X)}.sub.k [0041]
M({circumflex over (X)}.sub.k:{circumflex over
(.beta.)}.sub.k.fwdarw.k+1)--Model computation producing
{circumflex over (Q)}.sub.k+1 from {circumflex over (X)}.sub.k
storing model derived and user-supplied coefficients: {circumflex
over (.beta.)}.sub.k.fwdarw.k+1 [0042] C--User supplied error
criteria (%) [0043] .PSI.({circumflex over (Q)}.sub.k, --Error
threshold function [0044] F(.PSI., C)--Error threshold value (E)
{circumflex over (.OMEGA.)}.sub.k--Iteration termination criteria,
e.g., iteration count, r.sup.2, standard error, etc. Initial
Computation, k=0 Initial Step 1: Using initial model coefficient
estimates, {circumflex over (.beta.)}.sub.0.fwdarw.1, compute
initial model predicted values by applying the model to the
complete data set:
[0044] {circumflex over (Q)}.sub.1=M({circumflex over (X)}:
{circumflex over (.beta.)}.sub.0.fwdarw.1)
Initial Step 2: Compute initial model performance results:
{circumflex over (.OMEGA.)}.sub.1=f({circumflex over
(Q)}.sub.1,A,k=0,r.sup.2, standard error, etc.)
Initial Step 3: Compute model error threshold value(s):
E.sub.1=F(.PSI.({circumflex over (Q)}.sub.1,,C)
Initial Step 4: Filter the data records to remove outliers:
{circumflex over (X)}.sub.1={.A-inverted.x.di-elect
cons.{circumflex over (X)}|.PSI.({circumflex over
(Q)}.sub.1,<E.sub.1}
[0045] Iterative Computations, k>0
Iteration Step 1: Compute predicted values by applying the model to
the accepted data set:
{circumflex over (Q)}.sub.k+1=M({circumflex over (X)}.sub.k:
{circumflex over (.beta.)}.sub.k.fwdarw.k+1)
Iteration Step 2: Compute model performance results:
{circumflex over (.OMEGA.)}.sub.k+1=f({circumflex over
(Q)}.sub.k+1,A,k,r.sup.2, standard error, etc.)
If termination criteria are achieved, stop, otherwise proceed to
Step 3: Iteration Step 3: Compute results for removed data,
{circumflex over (X)}.sub.Ck={.A-inverted.x.di-elect
cons.{circumflex over (X)}|x{circumflex over (X)}.sub.k} using
current model:
{circumflex over (Q)}.sub.Ck+1=M({circumflex over
(X)}.sub.Ck:{circumflex over (.beta.)}.sub.k.fwdarw.k+1)
Iteration Step 4: Compute model error threshold values:
E.sub.k+1=F(.PSI.({circumflex over (Q)}.sub.k+1+{circumflex over
(Q)}.sub.Ck+1,,C)
Iteration Step 5: Filter the data records to remove outliers:
{circumflex over (X)}.sub.k+1={.A-inverted.x.di-elect
cons.{circumflex over (X)}|.PSI.{circumflex over
(Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1,<E.sub.k+1}
[0046] Another mathematical description of one embodiment of
Dynamic Outlier Bias Reduction is shown as follows:
Nomenclature
[0047] {circumflex over (X)}--Set of all data records: {circumflex
over (X)}={circumflex over (X)}.sub.k+{circumflex over (X)}.sub.Ck,
where: [0048] {circumflex over (X)}.sub.k--Set of accepted data
records for the k.sup.th iteration [0049] {circumflex over
(X)}.sub.Ck--Set of outlier (removed) data records for the k.sup.th
iteration [0050] {circumflex over (Q)}.sub.k--Set of computed model
predicted values for {circumflex over (X)}.sub.k [0051] {circumflex
over (Q)}.sub.Ck--Set of outlier model predicted values for
{circumflex over (X)}.sub.Ck [0052] A--Set of actual values (target
values) on which the model is based [0053] {circumflex over
(.beta.)}.sub.k.fwdarw.k+1--Set of model coefficients at the
k+1.sup.st iteration computed as a result of the model computations
using {circumflex over (X)}.sub.k [0054] M({circumflex over
(X)}.sub.k: {circumflex over (.beta.)}.sub.k.fwdarw.k+1)--Model
computation producing {circumflex over (Q)}.sub.k+1 from
{circumflex over (X)}.sub.k storing model derived and user-supplied
coefficients: {circumflex over (.beta.)}.sub.k.fwdarw.k+1 [0055]
C.sub.RE--User supplied relative error criterion(%) [0056]
C.sub.AE--User supplied absolute error criterion(%) [0057]
RE({circumflex over (Q)}.sub.k+{circumflex over (Q)}.sub.Ck,
A)--Relative error values for all data records [0058]
AE({circumflex over (Q)}.sub.k+{circumflex over (Q)}.sub.ck,
A)--Absolute error values for all data records [0059]
P.sub.RE.sub.k--Relative error threshold value for the k.sup.th
iteration where
[0059] P.sub.RE.sub.k=Percentile(RE({circumflex over
(Q)}.sub.k+{circumflex over (Q)}.sub.Ck,A),C.sub.RE) [0060]
P.sub.AE.sub.k--Absolute error threshold value for the k.sup.th
iteration where
[0060] P.sub.AE.sub.k=Percentile(AE({circumflex over
(Q)}.sub.k+{circumflex over (Q)}.sub.Ck,A),C.sub.AE) [0061]
{circumflex over (.OMEGA.)}.sub.k--Iteration termination criteria,
e.g., iteration count, r.sup.2, standard error, etc. Initial
Computation, k=0
[0062] Initial Step 1: Using initial model coefficient estimates,
{circumflex over (.beta.)}.sub.0.fwdarw.1, compute initial model
predicted value results by applying the model to the complete data
set:
{circumflex over (Q)}.sub.1=M({circumflex over (X)}:{circumflex
over (.beta.)}.sub.0.fwdarw.1)
[0063] Initial Step 2: Compute initial model performance
results:
{circumflex over (.OMEGA.)}.sub.1=f({circumflex over
(Q)}.sub.1,A,k=0,r.sup.2, standard error, etc.)
Initial Step 3: Compute model error threshold values:
P.sub.RE.sub.1=Percentile(RE({circumflex over
(Q)}.sub.1,A),C.sub.RE)
P.sub.AE.sub.1=Percentile(AE({circumflex over
(Q)}.sub.1,A),C.sub.AE)
[0064] Initial Step 4: Filter the data records to remove
outliers:
X ^ 1 = { .A-inverted. x .di-elect cons. X ^ { RE ( Q ^ 1 , A ^ )
AE ( Q ^ 1 , A ^ ) } < ( P RE P AE ) 1 } ##EQU00001##
[0065] Iterative Computations, k>0
[0066] Iteration Step 1: Compute model predicted values by applying
the model to the outlier removed data set:
{circumflex over (Q)}.sub.k+1=M({circumflex over
(X)}.sub.k:{circumflex over (.beta.)}.sub.k.fwdarw.k+1)
[0067] Iteration Step 2: Compute model performance results:
{circumflex over (.OMEGA.)}.sub.k+1=f({circumflex over
(Q)}.sub.k+1,A,k,r.sup.2, standard error, etc.)
[0068] If termination criteria are achieved, stop, otherwise
proceed to Step 3:
[0069] Iteration Step 3: Compute results for the removed data,
{circumflex over (X)}.sub.Ck={.A-inverted.x.di-elect
cons.{circumflex over (X)}|x{circumflex over (X)}.sub.k} using
current model:
{circumflex over (Q)}.sub.Ck+1=M({circumflex over
(X)}.sub.Ck:{circumflex over (.beta.)}.sub.k.fwdarw.k+1)
[0070] Iteration Step 4: Compute model error threshold values:
P.sub.RE.sub.k+1=Percentile(RE({circumflex over
(Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1,A),C.sub.RE)
P.sub.AE.sub.k+1=Percentile(AE({circumflex over
(Q)}.sub.k+1+{circumflex over (Q)}.sub.Ck+1,A),C.sub.AE)
[0071] Iteration Step 5: Filter the data records to remove
outliers:
X ^ k + 1 = { .A-inverted. x .di-elect cons. X ^ { RE ( Q ^ k + 1 +
Q ^ Ck + 1 , A ^ ) AE ( Q ^ k + 1 + Q ^ Ck + 1 , A ^ ) } < ( P
RE P AE ) k + 1 } ##EQU00002##
[0072] Increment k and proceed to Iteration Step 1.
[0073] After each iteration where new model coefficients are
computed from the current censored dataset, the removed data from
the previous iteration plus the current censored data are
recombined. This combination encompasses all data values in the
complete dataset. The current model coefficients are then applied
to the complete dataset to compute a complete set of predicted
values. The absolute and relative errors are computed for the
complete set of predicted values and new bias criteria percentile
threshold values are computed. A new censored dataset is created by
removing all data values where the absolute or relative errors are
greater than the threshold values and the nonlinear optimization
model is then applied to the newly censored dataset computing new
model coefficients. This process enables all data values to be
reviewed every iteration for their possible inclusion in the model
dataset. It is possible that some data values that were excluded in
previous iterations will be included in subsequent iterations as
the model coefficients converge on values that best fit the
data.
[0074] In one embodiment, variations in GHG emissions can result in
overestimation or underestimation of emission results leading to
bias in model predicted values. These non-industrial influences,
such as environmental conditions and errors in calculation
procedures, can cause the results for a particular facility to be
radically different from similar facilities, unless the bias in the
model predicted values is removed. The bias in the model predicted
values may also exist due to unique operating conditions.
[0075] The bias can be removed manually by simply removing a
facility's data from the calculation if analysts are confident that
a facility's calculations are in error or possess unique,
extenuating characteristics. Yet, when measuring a facility
performance from many different companies, regions, and countries,
precise a priori knowledge of the data details is not realistic.
Therefore any analyst-based data removal procedure has the
potential for adding undocumented, non-data supported biases to the
model results.
[0076] In one embodiment, Dynamic Outlier Bias Reduction is applied
to a procedure that uses the data and a prescribed overall error
criteria to determine statistical outliers that are removed from
the model coefficient calculations. This is a data-driven process
that identifies outliers using a data produced global error
criteria using for example, the percentile function. The use of
Dynamic Outlier Bias Reduction is not limited to the reduction of
bias in model predicted values, and its use in this embodiment is
illustrative and exemplary only. Dynamic Outlier Bias Reduction may
also be used, for example, to remove outliers from any statistical
data set, including use in calculation of, but not limited to,
arithmetic averages, linear regressions, and trend lines. The
outlier facilities are still ranked from the calculation results,
but the outliers are not used in the filtered data set applied to
compute model coefficients or statistical results.
[0077] A standard procedure, commonly used to remove outliers, is
to compute the standard deviation (.sigma.) of the data set and
simply define all data outside a 2.sigma. interval of the mean, for
example, as outliers. This procedure has statistical assumptions
that, in general, cannot be tested in practice. The Dynamic Outlier
Bias Reduction method description applied in an embodiment of this
invention, is outlined in FIG. 1, uses both a relative error and
absolute error. For example: for a facility, `m`:
Relative Error.sub.m=((Predicted Value.sub.m-Actual
Value.sub.m)/Actual Value.sub.m).sup.2 (1)
Absolute Error.sub.m=(Predicted Value.sub.m-Actual
Value.sub.m).sup.2 (2)
[0078] In Step 110, the analyst specifies the error threshold
criteria that will define outliers to be removed from the
calculations. For example using the percentile operation as the
error function, a percentile value of 80 percent for relative and
absolute errors could be set. This means that data values less than
the 80th percentile value for a relative error and less than the
80th percentile value for absolute error calculation will be
included and the remaining values are removed or considered as
outliers. In this example, for a data value to avoid being removed,
the data value must be less than both the relative and absolute
error 80th percentile values. However, the percentile thresholds
for relative and absolute error may be varied independently, and,
in another embodiment, only one of the percentile thresholds may be
used.
[0079] In Step 120, the model standard error and coefficient of
determination (r.sup.2) percent change criteria are specified.
While the values of these statistics will vary from model to model,
the percent change in the preceding iteration procedure can be
preset, for example, at 5 percent. These values can be used to
terminate the iteration procedure. Another termination criteria
could be the simple iteration count.
[0080] In Step 130, the optimization calculation is performed,
which produces the model coefficients and predicted values for each
facility.
[0081] In Step 140, the relative and absolute errors for all
facilities are computed using Eqns. (1) and (2).
[0082] In Step 150, the error function with the threshold criteria
specified in Step 110 is applied to the data computed in Step 140
to determine outlier threshold values.
[0083] In Step 160, the data is filtered to include only facilities
where the relative error, absolute error, or both errors, depending
on the chosen configuration, are less than the error threshold
values computed in Step 150.
[0084] In Step 170, the optimization calculation is performed using
only the outlier removed data set.
[0085] In Step 180, the percent change of the standard error and
r.sup.2 are compared with the criteria specified in Step 120. If
the percent change is greater than the criteria, the process is
repeated by returning to Step 140. Otherwise, the iteration
procedure is terminated in step 190 and the resultant model
computed from this Dynamic Outlier Bias Reduction criteria
procedure is completed. The model results are applied to all
facilities regardless of their current iterative past removed or
admitted data status.
[0086] In another embodiment, the process begins with the selection
of certain iterative parameters, specifically:
[0087] (1) an absolute error and relative error percentile value
wherein one, the other or both may be used in the iterative
process,
[0088] (2) a coefficient of determination (also known as r.sup.2)
improvement value, and
[0089] (3) a standard error improvement value.
[0090] The process begins with an original data set, a set of
actual data, and either at least one coefficient or a factor used
to calculate predicted values based on the original data set. A
coefficient or set of coefficients will be applied to the original
data set to create a set of predicted values. The set of
coefficients may include, but is not limited to, scalars,
exponents, parameters, and periodic functions. The set of predicted
data is then compared to the set of actual data. A standard error
and a coefficient of determination are calculated based on the
differences between the predicted and actual data. The absolute and
relative error associated with each one of the data points is used
to remove data outliers based on the user-selected absolute and
relative error percentile values. Ranking the data is not
necessary, as all data falling outside the range associated with
the percentile values for absolute and/or relative error are
removed from the original data set. The use of absolute and
relative errors to filter data is illustrative and for exemplary
purposes only, as the method may be performed with only absolute or
relative error or with another function.
[0091] The data associated with the absolute and relative error
within a user-selected percentile range is the outlier removed data
set, and each iteration of the process will have its own filtered
data set. This first outlier removed data set is used to determine
predicted values that will be compared with actual values. At least
one coefficient is determined by optimizing the errors, and then
the coefficient is used to generate predicted values based on the
first outlier removed data set. The outlier bias reduced
coefficients serve as the mechanism by which knowledge is passed
from one iteration to the next.
[0092] After the first outlier removed data set is created, the
standard error and coefficient of determination are calculated and
compared with the standard error and coefficient of determination
of the original data set. If the difference in standard error and
the difference in coefficient of determination are both below their
respective improvement values, then the process stops. However, if
at least one of the improvement criteria is not met, then the
process continues with another iteration. The use of standard error
and coefficient of determination as checks for the iterative
process is illustrative and exemplary only, as the check can be
performed using only the standard error or only the coefficient of
determination, a different statistical check, or some other
performance termination criteria (such as number of
iterations).
[0093] Assuming that the first iteration fails to meet the
improvement criteria, the second iteration begins by applying the
first outlier bias reduced data coefficients to the original data
to determine a new set of predicted values. The original data is
then processed again, establishing absolute and relative error for
the data points as well as the standard error and coefficient of
determination values for the original data set while using the
first outlier removed data set coefficients. The data is then
filtered to form a second outlier removed data set and to determine
coefficients based on the second outlier removed data set.
[0094] The second outlier removed data set, however, is not
necessarily a subset of the first outlier removed data set and it
is associated with second set of outlier bias reduced model
coefficients, a second standard error, and a second coefficient of
determination. Once those values are determined, the second
standard error will be compared with the first standard error and
the second coefficient of determination will be compared against
the first coefficient of determination.
[0095] If the improvement value (for standard error and coefficient
of determination) exceeds the difference in these parameters, then
the process will end. If not, then another iteration will begin by
processing the original data yet again; this time using the second
outlier bias reduced coefficients to process the original data set
and generate a new set of predicted values. Filtering based on the
user-selected percentile value for absolute and relative error will
create a third outlier removed data set that will be optimized to
determine a set of third outlier bias reduced coefficients. The
process will continue until the error improvement or other
termination criteria are met (such as a convergence criteria or a
specified number of iterations).
[0096] The output of this process will be a set of coefficients or
model parameters, wherein a coefficient or model parameter is a
mathematical value (or set of values), such as, but not limited to,
a model predicted value for comparing data, slope and intercept
values of a linear equation, exponents, or the coefficients of a
polynomial. The output of Dynamic Outlier Bias Reduction will not
be an output value of its own right, but rather the coefficients
that will modify data to determine an output value.
[0097] In another embodiment, illustrated in FIG. 2, Dynamic
Outlier Bias Reduction is applied as a data quality technique to
evaluate the consistency and accuracy of data to verify that the
data is appropriate for a specific use. For data quality
operations, the method may not involve an iterative procedure.
Other data quality techniques may be used alongside Dynamic Outlier
Bias Reduction during this process. The method is applied to the
arithmetic average calculation of a given data set. The data
quality criteria, for this example is that the successive data
values are contained within some range. Thus, any values that are
spaced too far apart in value would constitute poor quality data.
Error terms are then constructed of successive values of a function
and Dynamic Outlier Bias Reduction is applied to these error
values.
[0098] In Step 210 the initial data is listed in any order.
[0099] Step 220 constitutes the function or operation that is
performed on the dataset. In this embodiment example, the function
and operation is the ascending ranking of the data followed by
successive arithmetic average calculations where each line
corresponds to the average of all data at and above the line.
[0100] Step 230 computes the relative and absolute errors from the
data using successive values from the results of Step 220.
[0101] Step 240 allows the analyst to enter the desired outlier
removal error criteria (%). The Quality Criteria Value is the
resultant value from the error calculations in Step 230 based on
the data in Step 220.
[0102] Step 250 shows the data quality outlier filtered dataset.
Specific values are removed if the relative and absolute errors
exceed the specified error criteria given in Step 240.
[0103] Step 260 shows the arithmetic average calculation comparison
between the complete and outlier removed datasets. The analyst is
the final step as in all applied mathematical or statistical
calculations judging if the identified outlier removed data
elements are actually poor quality or not. The Dynamic Outlier Bias
Reduction system and method eliminates the analyst from directly
removing data but best practice guidelines suggest the analyst
review and check the results for practical relevance.
[0104] In another embodiment illustrated in FIG. 3, Dynamic Outlier
Bias Reduction is applied as a data validation technique that tests
the reasonable accuracy of a data set to determine if the data are
appropriate for a specific use. For data validation operations, the
method may not involve an iterative procedure. In this example,
Dynamic Outlier Bias Reduction is applied to the calculation of the
Pearson Correlation Coefficient between two data sets. The Pearson
Correlation Coefficient can be sensitive to values in the data set
that are relatively different than the other data points.
Validating the data set with respect to this statistic is important
to ensure that the result represents what the majority of data
suggests rather than influence of extreme values. The data
validation process for this example is that successive data values
are contained within a specified range. Thus, any values that are
spaced too far apart in value (e.g. outside the specified range)
would signify poor quality data. This is accomplished by
constructing the error terms of successive values of the function.
Dynamic Outlier Bias Reduction is applied to these error values,
and the outlier removed data set is validated data.
[0105] In Step 310, the paired data is listed in any order.
[0106] Step 320 computes the relative and absolute errors for each
ordered pair in the dataset.
[0107] Step 330 allows the analyst to enter the desired data
validation criteria. In the example, both 90% relative and absolute
error thresholds are selected. The Quality Criteria Value entries
in Step 330 are the resultant absolute and relative error
percentile values for the data shown in Step 320.
[0108] Step 340 shows the outlier removal process where data that
may be invalid is removed from the dataset using the criteria that
the relative and absolute error values both exceed the values
corresponding to the user selected percentile values entered in
Step 330. In practice other error criteria may be used and when
multiple criteria are applied as shown in this example, any
combination of error values may be applied to determine the outlier
removal rules.
[0109] Step 350 computes the data validated and original data
values statistical results. In this case, the Pearson Correlation
Coefficient. These results are then reviewed for practical
relevance by the analyst.
[0110] In another embodiment, Dynamic Outlier Bias Reduction is
used to perform a validation of an entire data set. Standard error
improvement value, coefficient of determination improvement value,
and absolute and relative error thresholds are selected, and then
the data set is filtered according to the error criteria. Even if
the original data set is of high quality, there will still be some
data that will have error values that fall outside the absolute and
relative error thresholds. Therefore, it is important to determine
if any removal of data is necessary. If the outlier removed data
set passes the standard error improvement and coefficient of
determination improvement criteria after the first iteration, then
the original data set has been validated, since the filtered data
set produced a standard error and coefficient of determination that
too small to be considered significant (e.g. below the selected
improvement values).
[0111] In another embodiment, Dynamic Outlier Bias Reduction is
used to provide insight into how the iterations of data outlier
removal are influencing the calculation. Graphs or data tables are
provided to allow the user to observe the progression in the data
outlier removal calculations as each iteration is performed. This
stepwise approach enables analysts to observe unique properties of
the calculation that can add value and knowledge to the result. For
example, the speed and nature of convergence can indicate the
influence of Dynamic Outlier Bias Reduction on computing
representative factors for a multi-dimensional data set.
[0112] As an illustration, consider a linear regression calculation
over a poor quality data set of 87 records. The form of the
equation being regressed is y=mx+b. Table 1 shows the results of
the iterative process for 5 iterations. Notice that using relative
and absolute error criteria of 95%, convergence is achieved in 3
iterations. Changes in the regression coefficients can be observed
and the Dynamic Outlier Bias Reduction method reduced the
calculation data set based on 79 records. The relatively low
coefficient of determination (r.sup.2=39%) suggests that a lower
(<95%) criteria should be tested to study the additional outlier
removal effects on the r.sup.2 statistic and on the computed
regression coefficients.
TABLE-US-00001 TABLE 1 Dynamic Outlier Bias Reduction Example:
Linear Regression at 95% Iteration N Error r.sup.2 m b 0 87 3.903
25% -0.428 41.743 1 78 3.048 38% -0.452 43.386 2 83 3.040 39%
-0.463 44.181 3 79 3.030 39% -0.455 43.630 4 83 3.040 39% -0.463
44.181 5 79 3.030 39% -0.455 43.630
[0113] In Table 2 the results of applying Dynamic Outlier Bias
Reduction are shown using the relative and absolute error criteria
of 80%. Notice that a 15 percentage point (95% to 80%) change in
outlier error criteria produced 35 percentage point (39% to 74%)
increase in r.sup.2 with a 35% additional decrease in admitted data
(79 to 51 records included). The analyst can use a graphical view
of the changes in the regression lines with the outlier removed
data and the numerical results of Tables 1 and 2 in the analysis
process to communicate the outlier removed results to a wider
audience and to provide more insights regarding the effects of data
variability on the analysis results.
TABLE-US-00002 TABLE 2 Dynamic Outlier Bias Reduction Example:
Linear Regression at 80% Iteration N Error r.sup.2 m b 0 87 3.903
25% -0.428 41.743 1 49 1.607 73% -0.540 51.081 2 64 1.776 68%
-0.561 52.361 3 51 1.588 74% -0.558 52.514 4 63 1.789 68% -0.559
52.208 5 51 1.588 74% -0.558 52.514
[0114] As illustrated in FIG. 4, one embodiment of system used to
perform the method includes a computing system. The hardware
consists of a processor 410 that contains adequate system memory
420 to perform the required numerical computations. The processor
410 executes a computer program residing in system memory 420 to
perform the method. Video and storage controllers 430 may be used
to enable the operation of display 440. The system includes various
data storage devices for data input such as floppy disk units 450,
internal/external disk drives 460, internal CD/DVDs 470, tape units
480, and other types of electronic storage media 490. The
aforementioned data storage devices are illustrative and exemplary
only. These storage media are used to enter data set and outlier
removal criteria into to the system, store the outlier removed data
set, store calculated factors, and store the system-produced trend
lines and trend line iteration graphs. The calculations can apply
statistical software packages or can be performed from the data
entered in spreadsheet formats using Microsoft Excel, for example.
The calculations are performed using either customized software
programs designed for company-specific system implementations or by
using commercially available software that is compatible with Excel
or other database and spreadsheet programs. The system can also
interface with proprietary or public external storage media 300 to
link with other databases to provide data to be used with the
Dynamic Outlier Bias Reduction system and method calculations. The
output devices can be a telecommunication device 510 to transmit
the calculation worksheets and other system produced graphs and
reports via an intranet or the Internet to management or other
personnel, printers 520, electronic storage media similar to those
mentioned as input devices 450, 460, 470, 480, 490 and proprietary
storage databases 530. These output devices used herein are
illustrative and exemplary only.
[0115] As illustrated in FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B, in
one embodiment, Dynamic Outlier Bias Reduction can be used to
quantitatively and qualitatively assess the quality of the data set
based on the error and correlation of the data set's data values,
as compared to the error and correlation of a benchmark dataset
comprised of random data values developed from within an
appropriate range. In one embodiment, the error can be designated
to be the data set's standard error, and the correlation can be
designated to be the data set's coefficient of determination
(r.sup.2). In another embodiment, correlation can be designated to
be the Kendall rank correlation coefficient, commonly referred to
as Kendall's tau (.tau.) coefficient. In yet another embodiment,
correlation can be designated to be the Spearman's rank correlation
coefficient, or Spearman's .rho. (rho) coefficient. As explained
above, Dynamic Outlier Bias Reduction is used to systematically
remove data values that are identified as outliers, not
representative of the underlying model or process being described.
Normally, outliers are associated with a relatively small number of
data values. In practice, however, a dataset could be unknowingly
contaminated with spurious values or random noise. The graphical
illustration of FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B illustrate how
the Dynamic Outlier Bias Reduction system and method can be applied
to identify situations where the underlying model is not supported
by the data. The outlier reduction is performed by removing data
values for which the relative and/or absolute errors, computed
between the model predicted and actual data values, are greater
than a percentile-based bias criteria, e.g. 80%. This means that
the data values are removed if either the relative or absolute
error percentile values are greater than the percentile threshold
values associated with the 80th percentile (80% of the data values
have an error less than this value.)
[0116] As illustrated in FIG. 5, both a realistic model development
dataset and a dataset of random values developed within the range
of the actual dataset are compared. Because in practice the
analysts typically do not have prior knowledge of any dataset
contamination, such realization must come from observing the
iterative results from several model calculations using the dynamic
outlier bias reduction system and method. FIG. 5 illustrates an
exemplary model development calculation results for both datasets.
The standard error, a measure of the amount of model unexplained
error, is plotted versus the coefficient of determination (%) or
r.sup.2, representing how much data variation is explained by the
model. The percentile values next to each point represent the bias
criteria. For example, 90% signifies that data values for relative
or absolute error values greater than the 90th percentile are
removed from the model as outliers. This corresponds to removing
10% of the data values with the highest errors each iteration.
[0117] As FIG. 5 illustrates, for both the random and realistic
dataset models, error is reduced by increasing the bias criteria,
i.e., the standard error and the coefficient of determination are
improved for both datasets. However, the standard error for the
random dataset is two to three times larger than the realistic
model dataset. The analyst may use a coefficient of determination
requirement of 80%, for example, as an acceptable level of
precision for determining model parameters. In FIG. 5, an r.sup.2
of 80% is achieved at a 70% bias criteria for the random dataset,
and at an approximately 85% bias criteria for the realistic data.
However, the corresponding standard error for the random dataset is
over twice as large as the realistic dataset. Thus, by
systematically running the model dataset analysis with different
bias criteria and repeating the calculations with a representative
spurious dataset and plotting the result as shown in FIG. 5,
analysts can assess acceptable bias criteria (i.e., the acceptable
percentage of data values removed) for a data set, and accordingly,
the overall dataset quality. Moreover, such systematic model
dataset analysis may be used to automatically render advice
regarding the viability of a data set as used in developing a model
based on a configurable set of parameters. For example, in one
embodiment wherein a model is developed using Dynamic Outlier Bias
Removal for a dataset, the error and correlation coefficient values
for the model dataset and for a representative spurious dataset,
calculated under different bias criteria, may be used to
automatically render advice regarding the viability of the data set
in supporting the developed model, and inherently, the viability of
the developed model in supporting the dataset.
[0118] As illustrated in FIG. 5, observing the behavior of these
model performance values for several cases provides a quantitative
foundation for determining whether the data values are
representative of the processes being modeled. For example,
referring to FIG. 5, the standard error for the realistic data set
at a 100% bias criteria (i.e., no bias reduction), corresponds to
the standard error for the random data set at approximately 65%
bias criteria (i.e., 35% of the data values with the highest errors
removed). Such a finding supports the conclusion that data is not
contaminated.
[0119] In addition to the above-described quantitative analysis
facilitated by the illustrative graph of FIG. 5, Dynamic Outlier
Bias Reduction can be utilized in an equally, if not more powerful,
subjective procedure to help assess a dataset's quality. This is
done by plotting the model predicted values against the data given
actual target values for both the outlier and included results.
[0120] FIGS. 6A and 6B illustrate these plots for the 100% points
of both the realistic and random curves in FIG. 5. The large
scatter in FIG. 6A is consistent with the arbitrary target values
and the resultant inability of the model to fit this intentional
randomness. FIG. 6B is consistent and common with the practical
data collection in that the model prediction and actual values are
more grouped around the line whereon model predicted values equal
actual target values (hereinafter Actual=Predicted line).
[0121] FIGS. 7A and 7B illustrate the results from the 70% points
in FIG. 5 (i.e., 30% of data removed as outliers). In FIGS. 7A and
7B the outlier bias reduction is shown to remove the points most
distant from the Actual=Predicted line, but the large variation in
model accuracy between FIGS. 7A and 7B suggests that this dataset
is representative of the processes being modeled.
[0122] FIGS. 8A and 8B show the results from the 50% points in FIG.
5 (i.e., 50% of data removed as outliers). In this case about half
of the data is identified as outliers and even with this much
variation removed from the dataset, the model, in FIG. 8A, still
does not closely describe the random dataset. The general variation
around the Actual=Predicted line is about the same as in the FIGS.
6A and 7A taking into account the removed data in each case. FIG.
8B shows that with 50% of the variability removed, the model was
able to produce predicted results that closely match the actual
data. Analyzing these types of visual plots in addition to the
analysis of performance criteria shown in FIG. 5 can be used by
analysts to assess the quality of actual datasets in practice for
model development. While FIGS. 5, 6A, 6B, 7A, 7B, 8A, and 8B
illustrate visual plots wherein the analysis is based on
performance criteria trends corresponding to various bias criteria
values, in other embodiments, the analysis can be based on other
variables that correspond to bias criteria values, such as model
coefficient trends corresponding to various bias criteria selected
by the analyst.
[0123] Various embodiments include a system for reducing outlier
bias in target variables measured for a facility. FIG. 9
illustrates an examples of such embodiments. The system illustrated
in FIG. 9 comprises a computing unit 1012 by which a data set, such
as a data set containing various performance measurements for an
industrial facility, can be processed. The computing unit 1012
comprises a processor 1014 and a storage subsystem 1016 on which a
computer program embodying the Dynamic Outlier Bias Removal
methodology disclosed herein. The system 1010 comprises an input
unit 1018 that further comprises a measuring device 1020 for
measuring a given target variable and for providing a corresponding
data set. The measuring device 1020 can be configured to measure
any target variable of interest, such as, for example, the number
of parts that leave an industrial plant facility per time unit, or
the volume of refined substances produced by a refining facility
per time unit. Beyond that, a plurality of target variables can be
measured simultaneously. In the embodiment shown the measuring
device 1020 comprises a sensor 1022. One of ordinary skill in the
art would appreciate the scope of the present invention includes
various sensors that may be used in measuring various physical
attributes of material and/or components used in or produced by
industrial facilities, such as, for example, sensors capable of
detecting and quantifying a chemical compound, e.g. greenhouse gas
emissions. In addition, one of ordinary skill in the art will
appreciate that measuring a target variable of interest includes
any means of collecting, receiving, measuring, accumulating, and
processing data. The target variables, data sets, and data can
comprise data of all kinds, including but not limited to industrial
process data, computer system data, financial data, economic data,
stock, bond and futures data, internet search data, security data,
voice and other human recognition data, cloud data, big data,
insurance data, and other data of interest, the scope and breath of
the disclosure and invention is not limited to the type of target
variables, data sets or data. One skilled in the art will also
appreciate that the sensor and the measuring device can also be or
include computers, computer systems, and processors. Moreover, the
system 1010 comprises an output unit 1024 by which the processed
data can be outputted. The output device may include a monitor, a
printer or a transmission device (not shown).
[0124] In one embodiment, the system 1010 initiates the sensor 1022
which in turn detect and quantifies a given compound, e.g. carbon
dioxide. The detection and quantification can be done continuously
or within discrete time steps. Each time a measurement is
completed, a data set is generated, is stored on the storage
subsystem 1016, and inputted into the computing unit 1012. The data
set is processed by the Dynamic Outlier Bias Removal computer
program stored by the storage subsystem 1016 whereby it is censored
according to the various embodiments of the methods disclosed
herein. Once the computer program has processed the data, the
processed data is outputted by the output unit 1024. In an
embodiment wherein the output unit 1024 is a monitor or a printer,
the results may be visualized in a diagram. In an embodiment
wherein the output unit 1024 comprises a transmission device, the
processed data is sent to a central database or a control center
where the data can be further processed (not shown). Accordingly,
the system according to the various disclosed embodiments provides
a powerful tool to compare different facilities within one company
or within one technical field with each other in an automated way
wherein outlier bias is reduced.
[0125] In a preferred embodiment the measuring device 1020
comprises one or more sensors for detecting and quantifying a
chemical compound. Due to the global warming, greenhouse gasses
emitted by a facility are becoming an increasingly important target
variable. Facilities that emit small amounts of greenhouse gasses
may be better ranked than those emitting higher amounts although
the overall productivity of the latter may be better. Examples of
greenhouse gases are carbon dioxide (CO2), ozone (03), water vapor
(H2O), hydrofluorocarbons (HFCs), perfluorocarbons (PFCs),
chlorofluorocarbons (CFCs), sulphur hexafluoride (SF6), methane
(CH4), nitrous oxide (N2O), carbon monoxide (CO), nitrogen oxides
(NOx), and non-methane volatile organic compounds (NMVOCs). The
automated detection and quantification of these compounds may be
used to develop industrial standards regarding certain allowable
emissions of the greenhouse gasses. However, applying the Dynamic
Outlier Bias Removal leads to removing outliers that may be caused
by extraordinary circumstances in the production such as operating
errors or even accidents. Thus, using various embodiments disclosed
herein results in developing more accurate and meaningful
standards. Once the industrial standards are developed, the system
can be used to compare the emissions with the standards.
[0126] One of ordinary skill in the art would further appreciate
that the scope of the present invention includes application of the
various disclosed embodiments for reducing outlier bias in target
variables relating to financial instruments, such as equity
securities (e.g., common stock) or derivative contracts (e.g.,
forwards, futures, options, and swaps, etc.). For example, in one
embodiment, the system 1010 comprises an input unit 1018 that
receives data relating to a financial instrument, such as a common
stock, and provides a corresponding data set. The target variable
can be the stock price. Further, variables that relate to the
target variable can be determined using various known methods of
evaluating financial instruments, such as, for example, discounted
cash flow analysis. Such related variables may include the relevant
dividends, earnings, or cash flows, earnings per share,
price-to-earnings ratio, or growth rate, etc. Once the database of
target values and related variable values is formed, various
embodiments of the Dynamic Outlier Bias Removal disclosed herein
can be applied to the database, resulting in a more accurate model
to evaluate the financial instrument.
[0127] The foregoing disclosure and description of the preferred
embodiments of the invention are illustrative and explanatory
thereof and it will be understood by those skilled in the art that
various changes in the details of the illustrated system and method
may be made without departing from the scope of the invention.
* * * * *