U.S. patent application number 16/665413 was filed with the patent office on 2020-09-24 for specifying and applying rules to data.
The applicant listed for this patent is Ab Initio Technology LLC. Invention is credited to Joel Gould, Roy Leonard Procops.
Application Number | 20200301897 16/665413 |
Document ID | / |
Family ID | 1000004882061 |
Filed Date | 2020-09-24 |
United States Patent
Application |
20200301897 |
Kind Code |
A1 |
Procops; Roy Leonard ; et
al. |
September 24, 2020 |
SPECIFYING AND APPLYING RULES TO DATA
Abstract
Validation rules are specified for validating data included in
fields of elements of a dataset. Cells are rendered in a
two-dimensional grid that includes: one or more subsets of the
cells extending in a direction along a first axis, each associated
with a respective field, and multiple subsets of the cells
extending in a direction along a second axis, one or more of the
subsets associated with a respective validation rule. Validation
rules are applied to at least one element based on user input
received from at least some of the cells. Some cells, associated
with a field and a validation rule, can each include: an input
element for receiving input determining whether or not the
associated validation rule is applied to the associated field,
and/or an indicator for indicating feedback associated with a
validation result based on applying the associated validation rule
to data included in the associated field.
Inventors: |
Procops; Roy Leonard;
(Winchester, MA) ; Gould; Joel; (Arlington,
MA) |
|
Applicant: |
Name |
City |
State |
Country |
Type |
Ab Initio Technology LLC |
Lexington |
MA |
US |
|
|
Family ID: |
1000004882061 |
Appl. No.: |
16/665413 |
Filed: |
October 28, 2019 |
Related U.S. Patent Documents
|
|
|
|
|
|
Application
Number |
Filing Date |
Patent Number |
|
|
13653995 |
Oct 17, 2012 |
10489360 |
|
|
16665413 |
|
|
|
|
Current U.S.
Class: |
1/1 |
Current CPC
Class: |
G06F 40/18 20200101;
G06F 16/215 20190101; G06F 16/248 20190101 |
International
Class: |
G06F 16/215 20060101
G06F016/215; G06F 16/248 20060101 G06F016/248; G06F 40/18 20060101
G06F040/18 |
Claims
1. A computing system for validating data included in one or more
fields of a plurality of elements of a dataset, the computing
system comprising: a user interface module configured to render a
configurable two-dimensional grid comprising a plurality of cells
arranged as a plurality of rows and columns, with a first one of
the rows and columns corresponding to one or more validation rules
and a second one of the rows and columns corresponding to fields of
the elements of the dataset; with at least one of the cells
comprising: an input element region of the at least one of the
cells, with the input element region configured receive an input
parameter that configures a first one of the validation rules; and
an indicator element region of the cell that renders feedback
associated with execution of the first one of the validation rules.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of and claims priority to
U.S. application Ser. No. 13/653,995, filed on Oct. 17, 2012,
incorporated herein by reference.
BACKGROUND
[0002] This description relates to specifying and applying rules to
data.
[0003] Many modern applications, including business applications,
process large sets of data (i.e., "datasets") which may be compiled
from various sources. The various sources that provide data to the
dataset may have different levels of data quality. To ensure that
the applications function properly, an adequate level of data
quality in the dataset should be maintained. To maintain an
adequate level of data quality, the dataset can be processed by a
data validation system. Such a system applies validation rules to
the dataset before it is provided to the application. In some
examples, the data validation system uses the results of validation
rules to calculate a measure of data quality and alert an
administrator of the application if the measure of data quality
falls below a predetermined threshold. In other examples, the data
validation system includes modules for handling data that fails one
or more of the validation rules. For example, the data validation
system may discard or repair data that fails one or more of the
validation rules.
[0004] In general, the validation rules applied by the data
validation system are defined by an administrator of the data
validation system.
SUMMARY
[0005] In one aspect, in general, a computing system specifies one
or more validation rules for validating data included in one or
more fields of each element of a plurality of elements of a
dataset. The computing system includes a user interface module
configured to render a plurality of cells arranged in a
two-dimensional grid having a first axis and a second axis. The
two-dimensional grid includes: one or more subsets of the cells
extending in a direction along the first axis of the
two-dimensional grid, each subset of the one or more subsets
associated with a respective field of an element of the plurality
of elements of the dataset, and multiple subsets of the cells
extending in a direction along the second axis of the
two-dimensional grid, one or more of the multiple subsets
associated with a respective validation rule. The computing system
also includes a processing module configured to apply validation
rules to at least one element of the dataset based on user input
received from at least some of the cells. In some implementations,
at least some cells, associated with a field and a validation rule,
each include an input element for receiving input determining
whether or not the associated validation rule is applied to the
associated field. In some implementations, at least some cells,
associated with a field and a validation rule, each include an
indicator for indicating feedback associated with a validation
result based on applying the associated validation rule to data
included in the associated field of the element.
[0006] Aspects can include one or more of the following
features.
[0007] Applying validation rules to data included in a first field
of a first element includes: determining any selected validation
rules associated with cells from a subset of cells extending in the
direction along the first axis associated with the first field of
the first element, based on any input received in the input
elements of the cells; and determining validation results for the
data included in the first field of the first element based on the
selected validation rules.
[0008] The one or more subsets of the cells extending in a
direction along the first axis are rows of cells.
[0009] The multiple subsets of the cells extending in a direction
along the second axis are columns of cells.
[0010] The input element is configured to receive input specifying
one or more validation rule parameters.
[0011] One or more of the validation rules when evaluated yield a
validation result of set of at least two validation results, the
validation results including a result of valid and a result of
invalid.
[0012] The indicator for indicating feedback included in at least
some of the cells is configured to apply shading to a cell if the
validation result is a result of invalid.
[0013] The input element is further configured to determine a
correctness of each of the validation rule parameters.
[0014] The at least some cells associated with a field and a
validation rule each include a second indicator for displaying a
result of determining a correctness of the validation rule
parameters associated with the cell.
[0015] The indicator for indicating feedback includes a numeric
indicator which is configured to display a number of invalid
results, the number of invalid results determined by applying the
associated validation rule to data included in the associated field
for all of the elements of the dataset.
[0016] The dataset includes one or more tables of a database and
the elements of the dataset include database records.
[0017] One or more of the validation rules are user defined.
[0018] One or more of the validation rules are predefined.
[0019] One or more of the multiple subsets of the cells extending
in the direction along the second axis of the two-dimensional grid
includes a first cell associated with a first validation rule and a
second cell associated with a second validation rule, the second
validation rule different from the first validation rule.
[0020] One or more of the multiple subsets of the cells extending
in the direction along the second axis of the two-dimensional grid
includes a subset of cells that include an input element for
receiving a value to replace an existing value in a corresponding
field in response to a result of invalid for one of the validation
rules applied to the existing value.
[0021] One or more of the multiple subsets of the cells extending
in the direction along the second axis of the two-dimensional grid
includes a subset of cells that include an input element for
receiving an excluded value, such that the excluded value appearing
in a corresponding field results in preventing validation rules
from being applied to the existing value.
[0022] In another aspect, in general, a computing system specifies
one or more validation rules for validating data included in one or
more fields of each element of a plurality of elements of a
dataset. The computing system includes means for rendering a
plurality of cells arranged in a two-dimensional grid having a
first axis and a second axis. The two-dimensional grid includes:
one or more subsets of the cells extending in a direction along the
first axis of the two-dimensional grid, each subset of the one or
more subsets associated with a respective field of an element of
the plurality of elements of the dataset, and multiple subsets of
the cells extending in a direction along the second axis of the
two-dimensional grid, one or more of the multiple subsets
associated with a respective validation rule. The computing system
also includes means for applying validation rules to at least one
element of the dataset based on user input received from at least
some of the cells. In some implementations, at least some cells,
associated with a field and a validation rule, each include an
input element for receiving input determining whether or not the
associated validation rule is applied to the associated field. In
some implementations, at least some cells, associated with a field
and a validation rule, each include an indicator for indicating
feedback associated with a validation result based on applying the
associated validation rule to data included in the associated field
of the element.
[0023] In another aspect, a method specifies one or more validation
rules for validating data included in one or more fields of each
element of a plurality of elements of a dataset. The method
includes: rendering, by a user interface module, a plurality of
cells arranged in a two-dimensional grid having a first axis and a
second axis. The two-dimensional grid includes: one or more subsets
of the cells extending in a direction along the first axis of the
two-dimensional grid, each subset of the one or more subsets
associated with a respective field of an element of the plurality
of elements of the dataset, and multiple subsets of the cells
extending in a direction along the second axis of the
two-dimensional grid, one or more of the multiple subsets
associated with a respective validation rule. The method also
includes applying, by at least one processor, validation rules to
at least one element of the dataset based on user input received
from at least some of the cells. In some implementations, at least
some cells, associated with a field and a validation rule, each
include an input element for receiving input determining whether or
not the associated validation rule is applied to the associated
field. In some implementations, at least some cells, associated
with a field and a validation rule, each include an indicator for
indicating feedback associated with a validation result based on
applying the associated validation rule to data included in the
associated field of the element.
[0024] In another aspect, in general, a computer program, stored on
a computer-readable storage medium, specifies one or more
validation rules for validating data included in one or more fields
of each element of a plurality of elements of a dataset. The
computer program includes instructions for causing a computer
system to render a plurality of cells arranged in a two-dimensional
grid having a first axis and a second axis. The two-dimensional
grid includes: one or more subsets of the cells extending in a
direction along the first axis of the two-dimensional grid, each
subset of the one or more subsets associated with a respective
field of an element of the plurality of elements of the dataset,
and multiple subsets of the cells extending in a direction along
the second axis of the two-dimensional grid, one or more of the
multiple subsets associated with a respective validation rule. The
computer program also includes instructions for causing the
computer system to apply validation rules to at least one element
of the dataset based on user input received from at least some of
the cells. In some implementations, at least some cells, associated
with a field and a validation rule, each include an input element
for receiving input determining whether or not the associated
validation rule is applied to the associated field. In some
implementations, at least some cells, associated with a field and a
validation rule, each include an indicator for indicating feedback
associated with a validation result based on applying the
associated validation rule to data included in the associated field
of the element.
[0025] Aspects can have one or more of the following
advantages.
[0026] Among other advantages, the user interface can provide live
feedback of the results of applying the rules to a single data
element of a dataset as the rules are entered. In this way, the
user can test the effectiveness of their rules without having to
apply the rules to the entire dataset (a potentially time consuming
process).
[0027] The user interface allows a user to run the specified rules
over a dataset and receive feedback regarding the performance of
each of the specified rules over the entire dataset. The user then
has an opportunity to modify any of the specified rules that do not
meet the expectations of the user.
[0028] The user interface allows a user to quickly and intuitively
specify and modify rules, saving time and resources.
[0029] Other features and advantages of the invention will become
apparent from the following description, and from the claims.
DESCRIPTION OF DRAWINGS
[0030] FIG. 1 is a block diagram of a system for specifying
validation rules for validating data.
[0031] FIG. 2 is a user interface for specifying validation rules
for validating data.
[0032] FIG. 3 is a screen capture of the user interface for
specifying validation rules.
DESCRIPTION
[0033] FIG. 1 shows an exemplary data processing system 100 in
which the validation techniques can be used. The system 100
includes a data source 102 that may include one or more sources of
data such as storage devices or connections to online data streams,
each of which may store data (sometimes referred to as a "dataset")
in any of a variety of storage formats (e.g., database tables,
spreadsheet files, flat text files, or a native format used by a
mainframe). An execution environment 104 includes a user interface
(UI) module 106 and a processing module 108. The UI module 106
manages input received from a user 110 over a user interface 112
(e.g., a graphical view on a display screen) for specifying
validation rules to be used by the processing module 108 for
processing data from the data source 102.
[0034] The execution environment 104 may be hosted on one or more
general-purpose computers under the control of a suitable operating
system, such as the UNIX operating system. For example, the
execution environment 104 can include a multiple-node parallel
computing environment including a configuration of computer systems
using multiple central processing units (CPUs), either local (e.g.,
multiprocessor systems such as SMP computers), or locally
distributed (e.g., multiple processors coupled as clusters or
MPPs), or remote, or remotely distributed (e.g., multiple
processors coupled via a local area network (LAN) and/or wide-area
network (WAN)), or any combination thereof.
[0035] The processing module 108 reads data from the data source
102 and performs validation procedures based on validation
information obtained by the UI module 106. Storage devices
providing the data source 102 may be local to the execution
environment 104, for example, being stored on a storage medium
connected to a computer running the execution environment 104
(e.g., hard drive 114), or may be remote to the execution
environment 104, for example, being hosted on a remote system
(e.g., mainframe 116) in communication with a computer running the
execution environment 104, over a remote connection.
[0036] In general, a dataset accessed from the data source 102
includes a number of data elements (e.g., records formatted
according to a predetermined record structure, or rows in a
database table). Each element of the number of data elements can
include values for a number of fields (e.g., attributes defined
within a record structure, or columns in a database table) (e.g.,
"first name," "last name," "email address," etc.), possibly
including null or empty values. Various characteristics of values
in the fields (e.g., related to content or data type), or the
presence or absence of values in certain fields, may be considered
valid or invalid. For example, a "last name" field including the
string "Smith" may be considered valid, while a "last name" field
that is blank may be considered invalid.
[0037] The performance of an application that utilizes the dataset
from the data source 102 may be adversely affected if the dataset
includes a significant number of data elements with one or more
invalid fields. The processing module 108 performs data validation
procedures, including applying data validation rules to the
dataset, to ensure that the dataset meets a quality constraint
defined by validation rules. The data processing system 100 alerts
a system administrator if the quality of the dataset fails to meet
the quality constraint. In some examples, the processing module 108
may be configured to repair invalid data, if possible, or perform
various data cleansing procedures to generate a dataset of cleansed
data elements. In yet other examples, the processing module 108 may
be configured to generate a list of fields that include invalid
data from which reports can be generated. In some examples, the
reports include a count of records that included invalid data for
one or more of the fields in the list of fields. In other examples,
aggregations of invalid fields are calculated from the list of
fields.
[0038] In general, different applications process different types
of data. Thus, depending on the application, the elements of the
dataset may include different fields. The UI module 106 provides
the user interface 112, which enables a set of validation rules to
be specified and used to validate the dataset. The user interface
112 is able to provide a single view including multiple fields of a
particular data element structure (in some implementations, all the
available fields). Thus, for a given application, the user 110
(e.g., a system administrator) is able to specify appropriate
validation rules for the data.
1 Validation User Interface
[0039] Referring to FIG. 2, one example of the user interface 112
is configured to facilitate the user 110 specifying and verifying
one or more validation rules for validating the dataset.
1.1 Validation Rule Specification
[0040] The UI module 106 renders the user interface 112 (e.g., on a
computer monitor) including a number of cells 224 arranged in a
two-dimensional grid 225 having a first axis 226 and a second axis,
228. One or more subsets 230 of the cells 224 (i.e., referred to as
rows 230 in the remainder of the detailed description) extends in a
direction along the first axis 226 of the two-dimensional grid 225.
Each of the rows 230 is associated with a field 218. In some
examples, the first (i.e., leftmost) cell of each of the rows 230
includes the name of the field 218 associated with the row 230 (in
this example, the field names are "Field 1," "Field 2," . . .
"Field M").
[0041] Multiple subsets 232 of the cells 224 (i.e., referred to as
columns 232 in the remainder of the detailed description) extend in
a direction along the second axis 228 of the two-dimensional grid
225. One or more of the columns 232 is associated with a respective
validation rule 234. In some examples, the first (i.e., the
topmost) cell of each of the columns 232 includes the name of the
validation rule 234 associated with the column 232 (in this
example, the validation rule names are "Validation Rule 1,"
"Validation Rule 2," . . . "Validation Rule N"). It is noted that
in some examples, the directions of the first axis 226 and the
second axis 228 can be swapped, causing the rows 230 associated
with the fields 218 to become columns and the columns 232
associated with the validation rules 234 to become rows.
[0042] In some examples, the user interface 112 includes a list
(not shown) of pre-defined validation rules. The validation rules
234 are added to the two-dimensional grid 225, for example, by the
user 110 dragging one or more of the pre-defined validation rules
into the two-dimensional grid 225, or double-clicking one of the
pre-defined validation rules, resulting in one or more new columns
232 being added to the grid 225. The pre-defined validation rules
have a built-in function, which may accept a pre-defined set of
parameters as input that can be provided within a corresponding
cell. For many situations, the pre-defined list of validation rules
is sufficient for the user's 110 needs. However, in some examples,
as is described below, the user 110 can define custom validation
rules which can also be added as columns 232 to the two-dimensional
grid 225.
[0043] After one or more validation rule columns 232 are added to
the two-dimensional grid 225, the user 110 can specify which
validation rules 234 should be applied to which fields 218. To
specify that a given validation rule 234 should be applied to a
given field 218, the user 110 first selects a cell 224 where the
row 230 associated with the given field 218 intersects with the
column 232 associated with the given validation rule 234. The user
110 then enters one or more validation rule parameters 236 in an
input element (e.g., a text field or check box) of the selected
cell 224. In general, the inclusion of a rule parameter 236 in a
cell potentially serves two purposes. The first purpose is to
provide "configuration input" to configure the validation rule 234,
and the second purpose is to indicate that the given validation
rule 234 should be applied to the given field 218. It follows that
if a cell 224 does not include validation rule parameters 236
(i.e., the cell is left blank), the processing module 108 does not
apply the validation rule 234 associated with the cell 224 to the
field 218 associated with the cell 224.
[0044] Many different types of rule parameters 236 can be entered
in to the cells 224. In some cases, no configuration input is
needed to configure a rule, so the rule parameter 236 may simply be
a "confirmation input" rule parameter that confirms that a
corresponding validation rule is to be applied. For example, one
example of an input element for receiving a confirmation input rule
parameter is a checkbox which, when checked, indicates that the
validation rule 234 associated with a cell 224 should be applied to
the field 218 associated with the cell 224. Examples of various
types of validation rules are presented in the following list,
which indicates whether or not the validation rule is configured by
configuration input: [0045] Integer--validates that the filed
contains only integer numbers (no configuration input needed).
[0046] Invalid Values--validates that the field does not contain
user specified invalid values (provided as configuration input).
[0047] Max Precision--Validates that the field has no more than a
user specified number of digits (provided as configuration input)
after the decimal point. [0048] Maximum--Invalid if the field value
is greater than a user specified value (provided as configuration
input). [0049] Maximum Length--Validates that the field has no more
than a user specified number of characters or bytes (provided as
configuration input). [0050] Minimum--Invalid if the field is less
than a user specified value (provided as configuration input).
[0051] Not Blank--Invalid if the field is empty or contains only
blanks (no configuration input needed). [0052] Not Null--Invalid if
the field is null (provided as configuration input needed). [0053]
Pattern--Validates that a string field as the specified pattern
(provided as configuration input). [0054] Valid Values--Validates
that the field contains only user specified valid values (provided
as configuration input). [0055] Valid for Type--Validates that the
field data is valid for its type (no configuration input
needed).
[0056] It is noted that the above list of validation rules is not
necessarily comprehensive.
1.2 Validation Rule Verification
[0057] In some examples, the UI module 106 provides feedback to the
user 110 through the user interface 112 by displaying results of
the processing module 108 applying the user-specified validation
rules 234 to at least some of the elements of the dataset.
[0058] The user interface 112 shown in FIG. 2 is configured to
display the values 242 of the fields 218 for a given element 244 of
the dataset. As the user specifies (and/or modifies) validation
rules 234 and their associated parameters 236, the processing
module 108 automatically applies the specified validation rules 234
to the values 242 of the fields 218 of the given data element 244
and provides the results of applying the validation rules 234 to
the UI module 106, which in turn presents the results in the user
interface 112 as feedback to the user 110. In general, the result
of applying a validation rule is a pass/fail result. Such a
pass/fail result can be indicated to the user 110 by, for example,
filling the appropriate cell with a certain color, pattern, or
shading. In FIG. 2, the cell associated with field 1 and validation
rule 1 includes gray shading 238, indicating that the value of
field 1 failed validation rule 1. In other examples, a pass/fail
result can be indicated to the user 100 by the inclusion/exclusion
of an indicator icon in the appropriate cell. For example, a
failing result can be indicated by including a red exclamation
point icon in the cell and a passing result can be indicated by the
absence of the red exclamation point icon. In some examples, an
icon such as a green circle can be included in the cell to indicate
a passing result.
[0059] When specifying validation rules 234, it can be useful for
the user 110 to navigate through the dataset to evaluate the effect
of the validation rules on different elements of the dataset. Thus,
the user interface 112 includes a control 246 which allows the user
to select different elements of the dataset (in this example, by
entering a sequence number). As the user navigates from one element
to the next, the processing module 108 automatically applies the
validation rules 234 to the currently selected element.
[0060] In some examples, the user interface 112 includes a run
control 248, which permits the processing module 108 to apply the
specified validation rules 234 to all of the elements of the
dataset. Upon completion of applying the validation rules 234 to
the dataset, the processing module 108 provides the results of
applying the validation rules 234 to the dataset to the UI module
106, which in turn displays the results in the user interface 112
to the user 110. In some examples, each cell 234 associated with a
validation rule 234 that was applied includes a failed result count
indicator 240. The failed result count indicator 240 displays the
number of data elements that failed the validation rule 234
specified by the cell 224.
1.3 Mixed Columns and Custom Validation Rules
[0061] As was mentioned above, the user 110 may desire a validation
rule with functionality that is not included in any of the
pre-defined validation rules. In some examples, the user interface
112 includes an option for inserting one or more mixed validation
rule columns into the two-dimensional grid 225. A mixed validation
rule column allows the user 110 to specify a different validation
rule for each cell (associated with a given field 218) included in
the column. For example, one cell of the mixed validation rule
column could include a `Valid Values` test while another cell of
the mixed validation rule column could include a `Maximum` test. In
general, the user 100 specifies a validation rule for a given cell
of the mixed validation rule column by entering the name of the
test followed by the rule parameters for the test (if the test
accepts rule parameters). In general, any validation rule which can
be added to the two-dimensional grid 225 as a column can be entered
into a single cell of a mixed validation rule column. Some examples
of the contents of cells of the mixed validation rule column are
"Not Null," "Maximum(99)," and "Valid Values(VM,F)."
[0062] One advantage provided by the mixed validation rule column
is that the usability of the user interface 112 is improved by more
efficiently representing rarely used tests on the screen. In
particular, the user 110 does not have to devote an entire column
232 of the two-dimensional grid 225 to a validation rule that only
applies to a single field 218. For example, the mixed validation
rule column can avoid a situation where a "Valid Email" test
applies only to a single field 218 (e.g., an `email_addr` field)
but occupies an entire column 232 of the two-dimensional grid 225,
thereby wasting valuable screen real estate.
[0063] In other examples, the user 110 can augment the list of
pre-defined validation rules with a new, reusable, custom
validation rule 234. The user interface 112 provides a template for
the user 110 to define the functionality of the new validation rule
234. The user 110 defines the desired custom functionality within
the bounds of the template using, for example, a programming
language or an expression language, for example DML code decorated
with structured comments. Upon saving the new validation rule 234,
the validation rule 234 is added to the list of pre-defined
validation rules. The user 110 can later use the new custom
validation rule 234, for example, by dragging the validation rule
from the list of validation rules into the two-dimensional grid 225
or by double-clicking the validation rule. As is the case with the
pre-defined validation rules, dragging the new validation rule into
the grid 225 or double-clicking the new validation rule causes a
new column 232 to be added to the grid 225, the new column 232
associated with the new validation rule.
[0064] Validation rules, whether pre-defined or custom validation
rules, may have an attribute indicating whether the rule should be
applied to null values or blank values. If the rules specifies it
should not be applied to null values, the value is first tested for
null, and then if null the rule is not applied, or if not null the
rule is applied. If the rule specifies it should not be applied to
blank values, the value is first tested to see if it is blank, and
the rule is only applied if the value was found to be not
blank.
[0065] Validation rules, whether pre-defined or custom, may have
attributes indicating logic that can be used to determine the
whether a set of rule parameters 236 entered in a cell 224 are
valid for the validation rule. For example, the user interface 112
uses this logic to determine the correctness of each set of rule
parameters 236 entered in a cell 224, and if the rule parameters
are determined to be incorrect (e.g., due to a syntax error), and
an indicator (for example a red stop sign) is displayed in the
cell, and an error message determined by the logic is displayed
(for example in a list of errors, or as a hover tooltip when
hovering over the cell). Another example of checking the
correctness of a rule parameter is checking semantics, such as
checking that a specified lookup file identifier has in fact been
made known to the processing module 108.
1.4 Pre-Processing or Post-Processing Columns
[0066] In some examples, the user interface 112 may include a
pre-processing column, which can be used to apply any initial
processing to values in a field, or to specify any particular
values to be handled differently by validation rules of other
columns. The user interface 112 may also include a post-processing
column, which can be used to apply any actions in response to
results of a test performed by a validation rule. A pre-processing
column can be used, for example, to allow the user 110 to specify
values to be excluded from validation, and validation data types
for one or more of the fields 218. A post-processing column can be
used, for example, to allow the user 110 to specify replacement
values to replace existing values in an element (e.g., to replace
different types of invalid values with appropriate replacement
values).
[0067] In general, a replacement value is entered into a single
cell of the post-processing column and is associated with a given
field 218. The replacement value replaces the value 242 of the
given field 218 when one or more validation rules 236 associated
with the given field 218 fails. For example, if a `start date`
field is associated with two validation rules, Minimum(1900-01-01)
and Maximum(2011-12-31), one example of a replacement value is
1970-01-01. Thus, if the value of the `start date` field for a
given record is below the minimum (i.e., before 1900-01-01) or
above the maximum (i.e., later than 2011-12-31), the value is
replaced with the replacement value, 1970-01-01. Other types of
replacement values such as strings, date/times, etc. can also be
specified in the post-processing column.
[0068] As is noted above, the user 110 can also specify one or more
values to be excluded from validation in an excluded value type
pre-processing column. For example, valid data for a field such as
`end_date` generally includes only date information (e.g.,
1900-01-01). However, in some applications it may be desirable to
also specify that another value such as "ACTIVE" is also valid data
for the `end date` field. This can be done by entering the string
"ACTIVE" into the excluded value type pre-processing column,
indicating that the value "ACTIVE" is always allowable for the
`start date` field and that the validation rules do not need to be
applied to the specified excluded value.
[0069] A pre-processing column can also include a validation type
column that specifies a validation data type for one or more of the
fields 218. In some examples, the user 110 can enter a DML type
declaration which is used to validate a field. For example, if a
field 218 includes a string value that represents a date, the user
110 can enter DATE(`YYYY-MM-DD`) so specify that the string value
actually represents a date data type and therefore should be
validated as such. Similarly, to validate a string as a decimal
number, the user 110 can enter decimal(``).
1.5 Example User Interface
[0070] Referring to FIG. 3, a screen capture illustrates one
implementation of the user interface 112 of FIG. 2. The user
interface 112 is configured to allow a user 110 to specify
validation rules 234 for a dataset while receiving validation rule
feedback.
[0071] As is described above, the user interface 112 includes a
two-dimensional grid 225 of cells 224. The grid 225 includes a
number of rows 230 associated with fields 218 of the data elements
of the dataset. The first cell of each of the rows 230 includes the
name of the field 218 associated with the row 230 and, in
parentheses, the value 242 of the field 218 for a currently
selected data element 244 of the dataset. Other information about
the field can also be displayed visually, to aid in a user
specifying validation rules. In this example, the first cell also
includes an icon 220 that visually indicates a data type of the
values of the field 218.
[0072] In FIG. 3, the user 110 has added a number of validation
rules 234 to the grid 225. The validation rules 234 appear in the
grid as a number of columns 232. The name of each validation rule
234 is included at the top of the column 232 associated with the
validation rule 234 (e.g., "Maximum Length," "Not Blank,"
"Pattern," etc.).
[0073] The user 110 has specified that selected validation rules
234 should be applied to one or more fields 218 of the elements of
the dataset. To do so, for each validation rule 234 to be applied,
the user 110 has entered a rule parameter 236 at the intersection
of the column 232 associated with the validation rule 234 and the
row(s) 230 associated with the field(s) 218 to which the validation
rule 234 should be applied. For example, the user 110 has entered
the rule parameter S"99999" at the intersection of the "Pattern"
validation rule and the `zipcode` field. The entered rule parameter
configures the "Pattern" validation rule to evaluate the `zipcode`
field of each element of the dataset to determine if the value of
the `zipcode` field of each of the elements is a string with a
pattern of five consecutive numeric characters. Similarly, the
"Pattern" validation rule is configured to evaluate the `phonenum`
field of each element of the dataset to determine if the value 242
of the `phonenum` field of each element is a string with a pattern
of S"999-999-9999" (i.e., three numeric characters, a dash, three
more numeric characters, a dash, and four more numeric
characters).
[0074] Other types of validation rules 234 and rule parameters are
also illustrated in FIG. 3. For example, a "Valid Values"
validation rule is applied to the `statename` field with a rule
parameter of M"StateNames" which identifies the valid values for
the `statename` field as the set of state names for the United
States of America. The `M` before "StateNames" in the rule
parameter above indicates that the set of state names is defined
(e.g., by the user 110 or a system administrator) as a separate
dataset (sometimes referred to as a codeset), which is stored in a
metadata reference system that is accessible in the execution
environment 104. In this example, the dataset including the state
names is referred to by the variable name "StateNames."
[0075] In some examples, a codeset is stored in a lookup table. To
access the codeset in the lookup table, the rule parameter is
entered as, for example, L"StateNames" indicating that a lookup
file identified to the system with the name "StateNames" is the
source of valid `statename` values. In yet other examples, the user
110 can directly enter the set of valid values. For example, the
valid set of gender codes can be entered as V"M,F,U".
[0076] Another, "Not Blank," validation rule is applied to a number
of the fields. For example, the "Not Blank" validation rule is
applied to the `street` field due to the presence of a check mark
rule parameter in the cell at the intersection of the "Not Blank"
rule parameter column and the `street` field row.
[0077] As is described above, the user interface 112 is able to
display all of the values 242 of the fields 218 for a given element
244 to the user 110. The UI module 106 also receives input from the
user interface 112 that causes the processing module 108 to execute
some or all of the validation rules 234 associated with the fields
218 of the element 244. The result(s) generated by the processing
module 108 are provided to the UI module 106, which in turn
displays feedback based on the result(s) to the user 110 in the
user interface 112. In FIG. 3, the "Valid Values" validation rule
is applied to the `statename` field to test whether the value of
the `statename` field is a member of the set of state names. From
inspection, one can see that the value of the `statename` field is
`Pennsylvannia` which is a misspelling of the state name
`Pennsylvania.` Thus, the "Valid Values" validation rule fails for
the `statename` field for the given element 244. To indicate the
failure of the validation rule to the user 110, the cell associated
with the "Valid Values" validation rule and the `statename` field
is shaded.
[0078] The user 110 can navigate through the elements of the
dataset using a navigation control 246. In some examples,
navigation control 246 includes arrows, which allow the user 110 to
step through the elements of the dataset one at a time, and a
numeric field, which allows the user 110 to enter a dataset element
number that they would like to view. Whenever the user 110
navigates to a different element using the navigation control 246,
the processing module 108 executes the specified validation rules
on the values of the new element, and the values 242 and other
visual feedback indicating results of the validation tests (for
example shading of cells) are refreshed/updated.
[0079] The user interface 112 also includes a `Test` button 248
which, when actuated, causes the processing module 108 to execute
the specified validation rules for all of the elements of the
dataset. As is described above, the results of executing the
specified validation rules for all of the elements of the dataset
are summarized in the user interface 112 by the inclusion of a
failed element count indicator 240 in each cell for which one or
more elements have failed the specified validation rule. In the
implementation of FIG. 3, the failed element count indicator 240 is
a number that represents the number of elements of the dataset that
failed the validation rule specified by the cell. For example, the
failed element count indicator for the cell associated with the
`statename` field and the "Valid Values" validation rule indicates
that 3886 of the elements of the dataset include a state name that
is not a member of the set of valid state names. A user can click
on that cell to retrieve information about elements that
failed.
[0080] For each element that failed one or more validation rule
test results, a collection of issue information can be aggregated
over the validation issues and stored for later retrieval. For
example, a list of fields for which one or more validation rules
were specified can be displayed in another view, with counts of
number of elements that had a validation issue for that field,
including a count of zero elements if there were no validation
issues for that field. This enables a user to unambiguously
determine that no elements failed that particular validation rule,
while also confirming that the validation rules for that field were
actually performed. Stored validation issue information can also be
used to compute various metrics (e.g., percentages of records that
have particular quality issues), or to augment a dataset of data
elements with validation issue information.
2 Alternatives
[0081] In some examples, the failed result count indicator 240 is a
hyperlink which, when clicked by the user 110, causes the UI module
106 to display a window that summarizes all of the failed elements
to the user 110.
[0082] In some examples, the result of applying data validation
rules can be used to determine metrics of the dataset. For example,
metrics can include the percentage of records of the dataset which
have data quality issues. Other user interfaces which are not
described herein can be used to specify and present these metrics
to the user 110.
[0083] While the above description describes providing feedback to
users by shading cells, other types of feedback mechanisms (e.g.,
sounds, pop-up windows, special symbols, etc.) can be utilized.
[0084] The above description describes specifying rules while
working on a full dataset. However, in some examples, a test
dataset that has a reduced and more manageable size and is
representative of a full dataset can be used.
[0085] The techniques described above can be implemented using
software for execution on a computer. For instance, the software
forms procedures in one or more computer programs that execute on
one or more programmed or programmable computer systems (which may
be of various architectures such as distributed, client/server, or
grid) each including at least one processor, at least one data
storage system (including volatile and non-volatile memory and/or
storage elements), at least one input device or port, and at least
one output device or port. The software may form one or more
modules of a larger program, for example, that provides other
services related to the design and configuration of dataflow
graphs. The nodes and elements of the graph can be implemented as
data structures stored in a computer readable medium or other
organized data conforming to a data model stored in a data
repository.
[0086] The software may be provided on a storage medium, such as a
CD-ROM, readable by a general or special purpose programmable
computer, or delivered (encoded in a propagated signal) over a
communication medium of a network to a storage medium of the
computer where it is executed. All of the functions may be
performed on a special purpose computer, or using special-purpose
hardware, such as coprocessors. The software may be implemented in
a distributed manner in which different parts of the computation
specified by the software are performed by different computers.
Each such computer program is preferably stored on or downloaded to
a storage media or device (e.g., solid state memory or media, or
magnetic or optical media) readable by a general or special purpose
programmable computer, for configuring and operating the computer
when the storage media or device is read by the computer system to
perform the procedures described herein. The inventive system may
also be considered to be implemented as a computer-readable storage
medium, configured with a computer program, where the storage
medium so configured causes a computer system to operate in a
specific and predefined manner to perform the functions described
herein.
[0087] A number of embodiments of the invention have been
described. Nevertheless, it will be understood that various
modifications may be made without departing from the spirit and
scope of the invention. For example, some of the steps described
above may be order independent, and thus can be performed in an
order different from that described.
[0088] It is to be understood that the foregoing description is
intended to illustrate and not to limit the scope of the invention,
which is defined by the scope of the appended claims. For example,
a number of the function steps described above may be performed in
a different order without substantially affecting overall
processing. Other embodiments are within the scope of the following
claims.
* * * * *